Applied Econometrics

Applied Econometrics
15. Juli 2010
Inhaltsverzeichnis
1 Introduction and Motivation 3

1.1 3 motivations for CLRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The Classical Linear Regression Model (CLRM): Parameter Estimation by OLS 7
3 Assumptions of the CLRM 10
4 Finite sample properties of the OLS estimator 13
5 Hypothesis Testing under Normality 16
6 Confidence intervals 22
6.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 Goodness-of-fit measures 24
8 Introduction to large sample theory 25

8.1 A. Modes of stochastic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 B. Law of Largen Numbers (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.3 C. Central Limit Theorems (CLTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.4 D. Useful lemmas of large sample theory . . . . . . . . . . . . . . . . . . . . . . . . 30
8.5 Large sample properties for the OLS estimator . . . . . . . . . . . . . . . . . . . . . 31
9 Time Series Basics (Stationarity and Ergodicity) 34
10 Generalized Least Squares 37
11 Multicollinearity 39
12 Endogeneity 40
Applied Econometrics Inhaltsverzeichnis
13 IV estimation 42
14 Questions for Review 45
2
Applied Econometrics 1 Introduction and Motivation
1 Introduction and Motivation
What is econometrics?
Econometrics=economic statistics/data analysis economic theory mathematics
Conceptional:
Data perceived as realizations of random variables

Parameters are real numbers, not random variables
Joint distributions of random variables depend on parameters
General regression equations
Generalization: yi = 1 xi1 + 2 xi2 + ... + K xiK + i (linear model) mit:
yi = dependent variable (observable)
xik =explanatory variable (observable)
i =unobservable component
Index of observations i=1,2,...,n (for individuals, time etc.) and regressors k=1,2...,K
yi 0 xi i
(1x1)=(1xK) (Kx1) + (1x1) (1)

1 xi1
= ... andxi = ... (2)
K xiK
Structural parameters=suggested by economic theory

The key problem of econometrics: We deal with non-experimental data
Unobservable variables, interdependence, endogeneity, (reversed) causality
1.1 3 motivations for CLRM
Theory: Glosten/Harris
Notation:
Transaction price: Pt
3
Indicator of transaction type:

(
1 buyer initiated trade
Qt = (3)
1 seller initiated trade
Trade volume: vt
Drift parameter:
Earnings/costs of the market maker: c (operational costs, asymmetric information costs)
Private information: Qt zt
Public information: t
Efficient/fair price:mt = + mt1 + t +Qt zt mit zt = z0 + z1 vt
| {z }
Random walk with drift
Market maker sets:
Sell price (ask): Pta = + mt1 + t + zt + c

Buy price (bid): Ptb = + mt1 + t zt c
| {z }
(for oneself)
Spread: market maker anticipates price impact

Transactions occur at ask or bid price:
Pt Pt1 = + mt1 + t + Qt zt + cQt [mt1 + cQt1 ] (4)

Pt = + z0 Qt + z1 vt Qt + cQt + t (5)
Observable: Pt , Qt , vt xi = [1, Qt , vt Qt , Qt ]0 and yi = Pt
Unobservable: t
Estimation of unknown structural parameters = (, z0 , z1 , c)0
Theory: Asset pricing
Finance theory: Investors compensated for holding risk; demand expected return beyond risk-free
rate
xjt+1 : payoff of risky asset j f1,t+1 ...fR,t+1 (f:risk factor)
Linear (risk) factor models:
ej
E(Rt+1 ) = j
|{z} mit j = (1j , ..., kj )0 und
| {z } |{z}
expected asset return exposure of payoff asset j to factor k risk price of factor k risk
4
= (1 , ..., k )0
ej j f
Rt+1 = Rt+1 Rt+1 (6)
| {z } | {z }
gross return risk-free rate
xjt+1 j
Pt+1 +djt+1
mit gross return= =
Ptj Ptj
Cov(Rej ,f )
Single risk factor: j = V ar(f )
Some AP models:
CAPM: f = Rem = (Rm Rf )

| {z }
marketrisk
Fama French (FF): f = (Rem , HM L, SM B)0

f1 , ..., fK excess returns themselves
Then = [E(f1 ), ..., E(fK )]0
ej em ) und FF: E(Rej ) = j E(Rem )+ j E(HM L j
CAPM: E(Rt+1 ) = j E(Rt+1 t+1 1 t+1 2 t+1 )+3 E(SM Bt+1 )
To estimate risk loadings , we formulate sampling (regression, compatible) model:

ej
Rt+1 em
= 1 Rt+1 + 2 HM Lt+1 + 3 SM Bt+1 + jt+1 (7)
Assume E(jt+1 |Rt+1

em , HM L j
t+1 , SM Bt+1 ) = 0 (implies E(t+1 ) = 0).
This model is compatible with the theoretical model which gets clear when you use expected values
on both sides.
This sampling model does not contain a constant: 0j = 0 (from theory) = testable restriction
Theory: Mincer equation
ln(W AGEi ) = 1 + 2 Si + 3 T EN U REi + 4 EXP Ri + i (8)

Notation:
Logarithm of the wage rate: ln(W AGEi )

Years of schooling: Si
Experience in the current job: T EN U REi
Experience in the labor market: EXP Ri
5
Estimation of the parameters k , where 2 : return to schooling

Statistical specification
E(yi |xi ) = x0i k (linear function of the xs)
|{z} |{z}
(xi1 ,...,xik ) (1 ,...,k )0
E(yi |xi )
Marginal effect: = K (9)
xK
Compatible regression model: yi = x0i + i with E(yi |xi ) = x0i LTE (law of total expectation)
implies:
(
E(i |xi ) = 0
(10)
Cov(i , xik ) = 0k
The specification E(yi |xi ) = x0i is ad hoc Alternative: non-parametric regression (leaves the
functional form open).
Justifyable by normality assumption:
2
Y y y xy
BV N , (11)
X x xy x2
Y

y

X1 2
... M V N x , y xy

(12)
Xk
|{z} xy 2x
|{z} (k+1x1)
(k+1x1)
mit x = E(x) = [E(x1 )...E(xk )] und x = V ar(x) = kxk

E(y|x) = y + 1
x yx (x x ) = + x
e
| {z }
linear conditional mean
e x und e = 1 yx
= y x
V ar(y|x) = y2 0yx x yx
| {z }
does not depend on x, homoscedasticity
Rubins causal model = Regression analysis of experiments

STAR experiment = small class experiment
Treatment: binary variable Di = 0, 1 (class size)
Outcome: yi (SAT scores)
Does Di yi ? (causal effects)
Potential outcome:
(
y1i if Di = 1
One of the outcomes is hypothetical = (13)
y0i if Di = 0
6
Applied Econometrics 2 The Classical Linear Regression Model (CLRM): Parameter
Estimation by OLS
Actual outcome: observed yi = y0i + (y1i y0i ) Di (the effect may be identical or different across
| {z }
causalef f ect
individuals i)
Uses causality explicitly in contrast to the other models.
Yi = |{z}
+ Di + + zi0 (14)
|{z} |{z} |{z}
E(yoi ) y1i yoi yoi E(yoi ) STAR(gender, race, free lunch)
is constant across i. Goal: estimate

STAR: experimentrandom assignment of i to treatment: E(yoi |Di = 1) = E(yoi |Di = 0)
In a non-experimentselection bias: E(yoi |Di = 1) E(yoi |Di = 0) 6= 0
E(yi |Di = 1) = + E(i |Di = 1) (15)

| {z }
E(yoi |Di =1)
sowie
E(yi |Di = 0) = + E(i |Di = 0) (16)

| {z }
E(yoi |Di =0)
E(yi |Di = 1) E(yi |Di = 0) = + selectionbias

| {z }
E(yio |Di =1)E(yoi |Di =0)
Another way, apart from experiments, to avoid selection bias are natural (quasi) experiments.
2 The Classical Linear Regression Model (CLRM): Parameter

Estimation by OLS
Classical linear regression model:
yi = 1 xi1 + 2 xi2 + ... + K xiK + i = x0i +i (17)

|{z} |{z}
(1xK) (Kx1)
yi =(y1 , ..., yn ): Dependent variable, observed

x0i =(xi1 , ..., xiK ): Explanatory variable, observed
0 =(1 , ..., K ): Unknown parameters
i =Disturbance component, unobserved
b=(b1 , ..., bK ): Estimate of 0
ei =yi x0i b: Estimated residual
7
Estimation by OLS
Based on i.i.d. variables, which refers to the independence of the is, not the xs.
Preferred technique: Least-squares (instead of maximum-likelihood and moments).
Two-sidedness (ambiquity): xs as random variables or realisations
For convenience we introduce matrix notation:
X + |{z}
y = |{z} (18)
|{z} |{z}
(nx1) (nxK) (Kx1) (nx1)
Constant: (x11 , ..., xn1 )=(1,...,1)

Writing extensively: A system of linear equations
y1 = 1 + 2 x12 + ... + K x1K + 1 (19)

y2 = 1 + 2 x22 + ... + K x2K + 2 (20)
... (21)
yn = 1 + 2 xn2 + ... + K xnK + n (22)
(23)
OLS estimation in detail:

Estimate by choosing b: P
e2i = (yi x0i b)2 = (yi xi1 ... xiK )2 = S(b)
P P
argmin(b)
| {z }
minimizesumof squaredresiduals
Differentiate with respect to b1 , ..., bK FOC
S(b) X 1X
= 2[ yi x0i b] = ei = 0 (24)
b1 n
S(b) X 1X
= 2[ yi x0i b]xi2 = ei xi2 = 0 (25)
b2 n
... (26)
S(b) X 1 X
= 2[ yi x0i b]xiK = ei xiK = 0 (27)
bK n
(28)
8
Estimation by OLS
System of K equations and K unknowns: solve for b (OLS estimator)

Characteristics:
e=0
Cov(ei , xi2 ) = 0
Cov(ei , xiK ) = 0
With 1 regressor:
yi = P1 + 2 xPi2 +Pi
yi xi2 Pyi xi2 sample cov
b2 = P x2 [ x ]2
= sample var
i2 i2
Solution more complicated for K2 The system of K equations is solved by matrix algebra:
e=y-Xb FOC rewritten:
X
ei = 0 (29)
X
xi2 ei = 0 (30)
... (31)
X
xiK ei = 0 (32)
0
Xe=0 (33)
Extensively:

1...1 e1

x12 ...xn2 e2 (34)

... ...
x1K ...xnK en
X 0 e = X 0 (y Xb) = X 0 y X 0
X b =0
|{z} | {z } |{z}
Kx1 KxK Kx1
XX has a rank of K (full rank, then [X 0 X]1 exists and premultiplying by [X 0 X]1 results in:
[X 0 X]1 Xy-[X 0 X]1 XXb=0
[X 0 X]1 Xy-Ib=0
b=[X 0 X]1 Xy
Alternative notation:
1 1 1X 1X
b = ( X 0 X)1 X 0 y = ( xi x0i )1 xi yi (35)
n n | n {z } |n {z }
matrixof samplemeans vectorof samplemeans
Questions:
Properties? Unbiased, efficient, consistent?
9
Applied Econometrics 3 Assumptions of the CLRM
3 Assumptions of the CLRM
The four core assumptions of CLRM

1.1 Linearity in parameters: yi = x0i + i
This is not too restrictive, because reformulation is possible, e.g. using ln, quadratics
1.2 Strict exogeneity: E(i |X) = E(i |x11 , ..., x1k , ..., xi1 , ..., xik , ..., xn1 , ..., xnk ) = 0
Implications:
a) E(i |x) = 0 E(i |xik ) = 0 E(i ) = 0
| {z } | {z }
by LTE by LTE
a)
b) E(i xjk ) = Cov(i , xjk ) = 0 i,j,k (use LTE, LIE)
z}|{
| {z }
by LTE
unconditional moment restrictions (compare to OLS FOC)
Example where this may be violated:
ln(wagesi ) = 1 + 2 Si +... + i Ability
|{z}
(+)
crimei = 1 + 2 P olicei +... + i Social factors
| {z } | {z }
in district i ()
unempli = 1 + 2 Libi +... + i Macro shock
| {z } |{z}
in country i ()
When E(i |x) 6= 0 Endogeneity (prevents us from estimating 0 s consistently).

Discussion: Endogeneity and sample selection bias
Rubins causal model: yi = +Di +i ; E(yi |Di = 1)E(yi |Di = 0) = +E(yoi |Di = 1) E(yoi |Di = 0)
| {z }
selection bias
Di and {yoi , y1i } assumed independent {yoi , y1i }Di

| {z }
partly unobservable
With independence : E(yoi |Di = 1) = E(yoi |Di = 0) = E(yoi )=E(
b i | Di ) = E(i ) = 0
|{z} |{z}
i xik
10
Independence is normally not the case because of endogeneity and sample selection bias.
Conditional independence assumption (CIA): {yoi , y1i }Di |xi selection bias vanishes conditioning
on xi
How do we instrumentalize CIA: by adding control variables to the right-hand side of the equation
Example: Mincer-type regression
ln(wagei ) = 1 + 2 Highschooli + 3 T eni + 4 Expi + 5 Abilityi + 6 F amilyi +i
| {z }
Control variables
I assume i Highschool|Ability, Family,... Justifies E(i |Highschool, ...) = 0
CIA justifies inclusion control variable and E(i |x) = 0
Matching= sorting individuals into groups and then comparing the outcomes
1.3 No exact multicollinearity, P (rank(X) = K) =1
| {z }
Bernoulli distributed (is a random variable)
No linear dependencies in the data matrix, otherwise (X 0 X)1 does not exist
Does not refer to a high correlation between the Xs
1.4 Spherical disturbances: V ar(i |x) = E(2i |x) = V ar(i ) = 2 i Homoscedasticity (relates to
the MVN)
Cov(i , j |x) = E(i j |x) = E(i j ) = 0 No serial correlation
E(1 , ..., n )
E(21 |x)
2
... ... ... 0
E[ 0 |x] = E(2 1 |x) ... ... = ... ... ... = Cov(|x) (36)
... ... E(2n |x) 0 ... 2
By LTE E( 0 ) = V ar() = 2 In
V ar(i ) = E(2i ) = 2
Cov(i , j ) = E(i , ..., j ) = 0i 6= j
Interpreting the parameters of different types of linear equations
Linear model: yi = 1 + 2 xi2 + ... + K xiK + i . A one unit increase in the independent variable
xiK increases the dependent variable by K units
Semi-log form: ln(yi ) = 1 + 2 xi2 + ... + K xiK + i . A one unit increase in the independent variable
increases the dependent variable approximately by 100*k percent
Log-linear model: ln(yi ) = 1 ln(xi1 ) + 2 ln(xi2 ) + ... + K ln(xiK ) + i . A one percent increase in
xiK increases the dependent variable yi approximately by k percent.
e.g. yi = A xi1 xi2 i (Cobb-Douglas) lnyi = lnA + lnxi1 + lnxi2 + lni
11
Before the OLS proofs a useful tool: Law of total expectations (LTE)
Z + Z
fxy (x, y)
E(y|X = x) = y fy|x (y, x)dy = y dy (37)
fx (x)
Using random variable x:

Z
fxy (x, y)
E(y|x) = y dy (38)
f (x)
| {zx }
g(x)
Z Z Z Z Z Z
fxy (x, y)
Ex (g(x)) = g(x)fx (x)dx = [ y dy]fx (x)dx = y fxy dxdy = yfy (y)dy = E(y)
fx (x)
(39)
Ex [ Ey|x [y|x] ] = Ey (y) E[E(y|x)] = E(y) (40)

| {z }
measurable function of x
Notes:
works when X a vector

forecasting interpretation
LTE extension: Law of iterated expectation (LIE)
Ez|x [[ Ey|x,z (y|x, z) ]|x] = E(y|x) (41)

| {z }
measurable function of x,z
Other important laws

Double Expectation Theorem (DET):
Ex [Ey|x (g(y)|x)] = Ey (g(y)) (42)
Generalized DET:
Ex [Ey|x (g(x, y))|x] = Ex,y (g(x, y)) (43)
Linearity of Conditional Expectations:
Ey|x [g(x)y|x] = g(x)Ey|x [y|x] (44)
12
Applied Econometrics 4 Finite sample properties of the OLS estimator
4 Finite sample properties of the OLS estimator
Finite Sample Properties of b = (X 0 X)1 X 0 y
1. With 1.1-1.3 and holding for any sample size: E[b|X] = and by LTE: E[b] =
unbiasedness
2. With 1.1-1.4: V ar[b|X] = 2 [X 0 X]1 (importance for testing, depends on the data)
conditional variance
3. With 1.1-1.4: OLS is efficient: V ar[b|X] V ar[|X]
b
Gauss-Markov theorem
4. OLS is BLUE
Starting point: sampling error

b1 1
b = ... = [X 0 X]1 X 0 y = [X 0 X]1 X 0 [X + ] = [X 0 X]1 X 0 (45)
bK K
Defining A = [X 0 X]1 X 0 , so that:

a11 ... a1n 1
A = ... ... ... ...
(46)
ak1 ... akn n
whereas A is treated as a constant
Derive unbiasedness
Step 1:
E(b |X) = E(A |X) = A E(|X) =0 (47)

| {z }
=0(assumption 1.2)
Step 2: b conditionally unbiased

Step1
E(b |X) = E(b|X) E(|X) = E(b|X) = 0 E(b|X) =
z}|{
(48)
Step 3: OLS unbiased

LT E Step2
Ex [E(b|X)] = E(b) =
z}|{ z}|{
(49)
13
Derive conditional variance
sampling error 1.4

V ar(b|X) = V ar(b|X) = V ar(A|X) = AV ar(|x)A0 = AE(0 |x)A0 = A 2 InA0 = 2 AA0
z}|{ z}|{
(50)
Using [BA]=AB and inserting for A:
2 [X 0 X]1 X 0 X[X 0 X]1 = 2 [X 0 X]1 = V ar(b|X) (51)
Und es gilt:

V ar(b1 |x) Cov(b2 , b1 |x) ...
V ar(b|X) = Cov(b1 , b2 |x) ... ... (52)
... ... V ar(bk |x)
Derive Gauss-Markov-Theorem
b V ar(b|x)
V ar(|x)
b V ar(b|x) is
This refers to the fact that we are claiming that the elementwise difference V ar(|x)
positive semi-definite.
Insertion:
A and B: separate matrizes, same size
A B if A-B is positive semi-definite
X 0 |{z}
C is psd if |{z}
|{z} X 0 for all x6= 0
C |{z}
kxk 1xk kxk kx1
a0 [V ar(|x)
b V ar(b|x)]a 0a 6= 0 (53)
If this is true for all a, it is true in particular for a=[1,0,...,0] which leads to: V ar(b1 |x) V ar(b1 |x)
This works also for b2 , ..., bk with the respective a. This means that any conditional variance of is
bigger than the repsective conditional variance.
The proof:
Note that b=A*y
b is linear in y and unbiased and =C*y
b with C being a function of X
D=C-A C=D+A
Step 1:
b = (D + A)y = Dy + Ay = D(X + ) + b = DX + D + b (54)
Step 2:
b = DX + E(D|x) + E(b|X) = DX + D E(|x) + = DX +

E(|x) (55)
| {z }
=0
14
As we are working with an unbiased estimator: DX=0

Step 3: Going back to Step 1
b = D + b (56)
b = D + b = [D + A] (57)
| {z } | {z }
s.e.of b s.e.of b=A
Step 4:
b = V ar(b |x) = V ar[(D + A)|x] = [D + A]V ar(|x)[D + A]0 = 2 [D + A][D + A]0 (58)
V ar(|x)
= 2 [DD0 + AD0 + DA0 + AA0 ](59)
Es gilt:
AA0 = [X 0 X]1 X 0 X[X 0 X]1 = [X 0 X]1
AD0 = [X 0 X]1 X 0 D0 = [X 0 X]1 [DX]0 = 0 (as DX=0)
DA0 = D[[X 0 X]1 X 0 ]0 = DX[X 0 X]1 = 0 (as DX=0)
b = 2 [DD0 + [X 0 X]1 ]
V ar(|x) (60)
Step 5: Es muss also gelten
a0 [ 2 [DD0 + [X 0 X]1 ] [X 0 X]1 2 ]a 0 (61)

0 2 0
a [ DD ]a 0 (62)
(63)
We havePto show that aDDa 0 a is sdp

As zz= zi2 0 a the above is true.
The OLS estimator is BLUE
OLS is linear: Holds under assumption 1.1

OLS is unbiased: Holds under assumption 1.1-1.3
OLS is the best estimator: Holds under the Gauss Markov theorem
15
Applied Econometrics 5 Hypothesis Testing under Normality
OLS anatomy
ybi x0i b; yb = Xb; e = y yb

p = X[X 0 X]1 X 0 projection matrix; use: yb = p y = x b (lies in space spanned by columnes
of X); P is symmetric and idempotent (P*P=P)
M=In-p residual maker; use: e=M*y; M is symmetric and idempotent
y=Py+My [Xe=0 from FOC] (e orthogonal to columns of X; space is spanned by columns of
X) yb0 e = 0
e = M e0 e = e2i = 0 M
P
with a constant
P
ei = 0 [FOC]
y = x0 b (for x and y vectors of sample means)
y = n1 yi = n1
P P
ybi = yb
5 Hypothesis Testing under Normality
Economic theory provides hypotheses about parameter.

Eg. asset pricing example: Rte = + Rtem + t Hypothesis implied by APT: =0
If theory os rightTestable implications.
In order to test we need the distribution of b, so that hypotheses cant be tested without distributional
assumptions about .
In addition to 1.1-1.4 we assume that 1 , ..., n |X are normally distributed:
Distributional assumption (Assumption 1.5): Normality assumption about the conditional distribution
of |X M V N (E(|x) = 0, V ar(|x) = 2 In ) (Can be dispersed of later).
i |x N (0, 2 )
Useful tools/results:
Fact 1: If xi N (0, 1) with i=1,...,m and xi independent y = m 2 2

P
i=1 xi (m)
2 x
If x N(0,1) and y (m) and x,y independent: t = t(m)
y/m
Fact 2 omitted
Fact 3: If x and y are independent, so are f(x) and g(y)
Fact 4: Let x M V N (, ) and nonsingular: (x )0 1 (x ) 2 (random variable)
(W/a)
Fact 5: W 2 (a), g 2 (b) and W,g independent: (g/b) F (a, b)
16
Fact 6: X M V N (, ) and y=c+Ax mit c,A non-random vector/matrix y M V N (c +

A, AA0 )
We use b = (X 0 X)1 X 0 :
| {z }
s.e.
Assuming |X M V N (0, 2 In ) from 1.5 and using Fact 6:
b |X M V N ((X 0 X)1 X 0 E(|X), (X 0 X)1 X 0 2 In X(X 0 X)1 ) (64)

2 0 1
b |X M V N (0, (X X) ) (65)
b |X M V N (E(b |X), V ar(b |X)) (66)
Note that V ar(bk |X) = 2 ((X 0 X)1 )

By the way: e|X M V N (0, 2 M ) show using e=M* and M is symmetric and idempotent
Testing hypothesis about individual parameters (t-Test)
Null hypothesis: H0 : k = k (a hypothesized value, a real number). k is often assumed to be 0.
It is suggested by theory.
Alternative hypothesis HA : k 6= k
We control the type 1 error (=rejection when right) by fixing a significance level. We then aim for a
high power (=rejection when false) (low type 2 error (=no rejection when false)).
Construction of the test statistic:
By Fact 6: bk k N (0, 2 ((X 0 X)1 )kk ) [((X 0 X)1 )kk is the k-th row k-th column element of
(X 0 X)1 ]
OLS-estimator conditionally normal distributed if |X is multivariate normal
bk |X N (k , 2 ((X 0 X)1 )kk )
Under H0 : k = k :
bk |X N ( k , 2 ((X 0 X)1 )kk )
If H0 is true E(bk ) = k :
bk k
zk = p N (0, 1) (67)
2 ((X 0 X)1 )kk
Standard normally distributed under the null hypothesis
We dont say anything about the distribution under the alternative hypothesis
Distribution under H0 does not depend on X
Value of the test statistic depends on X and on 2 whereas 2 is unknown
17
Call s2 an unbiased estimate of 2 :

a
bk k
tk = p t(n k)or N (0, 1)for(n k) > 30
z}|{
(68)
s2 ((X 0 X)1 )kk
Nuisance parameter 2 can be estimated
2 = E(2i |X) = V ar(i |X) = E(2i ) = V ar(i ) (69)
We dont know i , but we use the estimator ei = yi x0i b

1X 1X 2 1X 2 1
b2 = V ar(ei ) =
(ei ei ) = ei = e0 e (70)
n n n n
d 2 is a biased estimator:
sigma
nK 2
2 |x) =
E(b (71)
n
Proof: in Hayashi p.30-31. Use:
e2i = e0 e = 0 M
P

P
trace(A)= aii (sum of the diagonal of the matrix)
2 is asymptotically unbiased as
Note that for n b
nK
limn =1 (72)
n
Having an estimator where the bias and the variance vanishes when n , is what we call a
consistent estimator.
An unbiased estimator of 2 (see Hayashi p.30-31)
2 1 P 2
For s = nK ei = nK e0 e we get an unbiased estimator:
1
1
E(s2 |X) = E(e0 e|X) = 2 (73)
nK
E(E(s2 |X)) = E(s2 ) = 2 (74)
Using this provides an unbiased estimator of V ar(b|X) = 2 (X 0 X)1 :
V ar(b|X)
d = s2 (X 0 X)1 (75)
t-statistic under H0 :
bk k bk k bk k
tk = q = =q t(n K) (76)
SE(bk )
(V ar(b|X))
d kk ( V ar(b
d k |X))
18
Hayashi p.36-37 shows that with the unbiased estimator of 2 we have to replace the distribution of
the t-stat by the t-distribution. Sketch of the proof:
r
bk k 2 zk zk
tk = p 2
=p =p (77)
2 [(X 0 X)1 ]kk s e e0 (n k)/ 2 q/n k
We need to shoq that q/n k 2 (n k) and q,zk independent.

Decision rule for the t-test
1. State your null hypothesis: H0 : k = k is often k = 0; HA : k 6= k

bk k
2. Given k , OLS-estimate bk and s2 , we compute tk = SE(bk )
3. Fix significance level of two-sided test

4. Fix non-rejection and rejection regions
5. decision
Remark:
p
2 0 1
p [(X X) ]kk : standard deviation bk |X
2 0 1
s [(X X) ]kk : standard error bk |X
We want a test that keeps its size and at the same time has a high power.
Find critical value such that: P rob[ t/2 (n k) < t(n k) < t/2 (n k) ]=1
| {z } | {z } | {z }
real n ; lower crit value r.v.;t-stat real n ; upper crit value
Example: n-k=30, =0.05

t/2 (n k) = t.975 (30) = 2.042
The probability that the t-stat takes on a value of 2.042 or smaller is equal to 97.5%.
t/2 (n k) = t.025 (30) = 2.042
The probability that the t-stat takes on a value of -2.042 or smaller is equal to 2.5%
Conventional s: 0.01; 0.05; 0.1
Use stars
Use p-values: compute the quantile taking your t-stat as the x (Interpretation!)
Use t-stat
Report standard error
Report confidence intervals (Interpretation!)
19
F-test/Wald test
Model more complex hypotheses (linear hypotheses) = testing joint hypotheses.
Example 1: G/H-Modell
Pi = + cQi + z0 Qi + z1 vi Qi + i
Hypothesis: no infromational content of traders
H0 : z0 = 0 and z1 = 0; HA : z0 6= 0 or z1 6= 0
Example 2: Mincer eq.
ln(wagei ) = 1 + 2 Si + 3 tenurei + 4 experi + i
Hypothesis: 1y more schooling shows the same effect as 1y more tenure AND experience has no effect
H0 : 2 = 3 and 4 = 0; HA : 2 6= 3 or 4 6= 0
For the F-test write the null hypothesis as a system of linear equations:
R = |{z}
|{z} r
|{z}
#rxK Kx1 #rx1

R11 ... R1K 1 r1
... ... ... ... = ...
(78)
R#1 ... R#K k r#
mit:
#=number of restrictions
R=matrix of real N s
r=vector of real N s
Example 1:

y
0 0 1 0 c 0
=
(79)
0 0 0 1 z0 0
z1
Example 2:

1

0 1 1 0 2
0

3 = 0
(80)
0 0 0 1
4
Use Fact 6: Y=A*Z (Y M V N (A, AA0 ))

In our context:
Replacing the = (1 , ..., k ) by estimator b = (b1 , ..., bK )0 .
R b = re (b is conditionally normally distributed under 1.1-1.5)
For the Wald-test: you only need the unconstrained parameter estimates.
For the other two, you need both the constrained and the unconstrained parameter estimate.
20
The Wald-test keeps its size and gives the highest power.
The restrictions are fixed by the H0
Definition of the F-test statistic
E(e r|x) = R E(b|x) = R (under 1.1-1.3)
V ar(er|x) = R V ar(b|x)R0 = R 2 (X 0 X)1 R0 (under 1.1-1.4)
re|x = R b|x M V N (R, R 2 (X 0 X)1 R)
Using Fact 4 to construct the test for:
H0 : R = r
HA : R 6= r
Under the hypothesis:
E(Rb|x)=r (=hypothesized value)
Var(Rb|x)=R 2 (X 0 X)1 R (unaffected by hypothesis)
Rb|x=r M V N (r, R 2 (X 0 X)1 R0 ) (this is where the hypothesis goes in)
Problem: Rb is a random vector and the distribution depends on x.
Distribution of a random variabl under H0 whose distribution does not depend on X.
With Fact 4: (Rb E(Rb|x))0 (V ar(Rb|x))1 (Rb E(Rb|x))
This leads to the Wald-statistic:
(Rb r)0 [R 2 (X 0 X)1 R]1 (Rb r) 2 (#r) (81)

The smallest number for this is zero (which is the case when the restrictions are perfectly met and
the parenthesis are zero). Thus we always have a one-sided test here.
1 P 2
Replace 2 by its unbiased estimate s2 = nk 1
ei = nk e0 e and dividing by #r (to find the distri-
bution using Fact 5):
F = (Rb r)0 [Rs2 (X 0 X)1 R0 ](Rb r)/#r F (#r, n k) (82)
(Rb r)0 [R(X 0 X)1 R0 ]1 (Rb r)/#r

F = (83)
(e0 e)/(n K)
= (Rb r)0 [RV ar(b|X)R
d 0 1
] (Rb r)/#r F (#r, n K) (84)
F-Test is one-sided.
For applied work:
N : s2 (and c2 ) get close to 2 , so we can use the approximation.
Decision rule for the Wald-test
1. Specify H0 in the form R = r and HA : R 6= r

2. Calculate F-statistic
21
Applied Econometrics 6 Confidence intervals
3. Look up entry in the table of the F-distribution for #r and n-K at given significance level
4. Null is not rejected on the significance level for F less than F (#r, n K)
Alternative representation of the Wald/F-statistic
Minimization of the unrestricted sum of squared residuals min (yi x0i b)2 SSRu
P
Minimization of the restricted sum of squared residuals (constraints imposed) min (yi x0ieb)2
P
SSRr
SSRu SSRr (always!)
(SSRr SSRu )/#r

F ratio : F = (85)
SSRu /(n k)
This F-ratio is equivalent to the Wald-ratio. Problem: For the F-ratio we need two estimates b.
This is also known as the likelihood-ratio principle in contrast to the Wald-principles.
(Third principle: only using unrestricted estimates Langrange Multiplier principle)
6 Confidence intervals
Duality of t-test and confidence interval

Probability for non-rejection for t-test:
P (t/2 (n K) tk t/2 (n K)) = |1

{z } (86)
| {z }
do not reject if this even occurs if H0 is true
(tk is a random variable)

In 1-% of our samples tk lies inside the critical values.
Rewrite (86):
P (bk SE(bk )t/2 (n K) k bk + SE(bk )t/2 (n K)) = 1 (87)

(bk , SE(bk ) is a random variable)
The confidence interval
Confidence interval for k (=if H0 is true):
P (bk SE(bk )t/2 (n K) k bk + SE(bk )t/2 (n K)) = 1 (88)

The values inside the bounds of the confidence interval are the value for which you cannot reject the
null-hypothesis.
In 1-% of our samples k would lie inside our bounds (the bounds would enclose k ).
22
Applied Econometrics 6 Confidence intervals
The confidence bounds are random variables!

bk SE(bk )t/2 (n K): lower bound of the 1- confidence interval.
bk + SE(bk )t/2 (n K): upper bound of the 1- confidence interval.
Wrong interpretation: True parameter k lies with probability 1- within the bounds of the confidence
interval. Problem: Confidence bounds are not fixed; they are random!
H0 is rejected at significance level if the hypothesized value does not lie within the confidence
bounds of the 1- interval.
<Grafik>
Count the number of times that k is inside the confidence interval. After n sample, this gives us the
probability for the confidence interval 1-. (k not inside the confidence interval is equivalent to the
event that we reject the H0 using the t-statistic).
This works the same way if k 6= k .
On the correct interpretation of confidence intervals: given:
b
s2 and se(bk )
t/2 (n k)
(bk k )/se(bk ) (reject if outside the bounds of the critical values)
Rephrasing: for which values k do we (not) reject?
bk se(bk ) t/2 (n k) < k < bk + se(bk ) t/2 (n k) = not reject
bk se(bk ) t/2 (n k) > k > bk + se(bk ) t/2 (n k) = reject
We want small confidence intervals, because the allow us to reject hypothesis (narrow range=high
power). We achieve this by increasing n (decreasing the standard error) or by increasing .
6.1 Simulation
Frequentists work with the idea that we can conduct experiments over and over again.
To avoid type 2 error we need smaller standard errors (narrower distribution) and thus we have to
increase the sample size.
We are running a simulation with multiple draws. To proof unbiasedness, we compute the mean for
all the draws and compare this to the true parameters. To get closer, we have to increase the number
of draws.
Increasing n leads to a smaller standard error.
Increasing draws leads to more precise estimates, but the standard error doesnt decrease.
23
Applied Econometrics 7 Goodness-of-fit measures
You can use a confidence interval to determine the power of a test. Determine in how many % of
the cases a wrong parameter estimate lies inside the confidence interval (for the true parameter this
should be 1-). This is the power of the test. To increase this, increase the sample size.
The closer the hypothesized value and the tue value are, the less the power of the test.
7 Goodness-of-fit measures
Is needed with conflicting equations (theories)

Uncentered R2 :P 2
Ruc
Variability of yi : yi2 = y 0 y
Decomposition of yy:
=(Xb + e)0 (Xb + e) = (b y + e)0 (b
y + e) = yb0 yb + 2e0 yb + e0 e = yb0 yb + e0 e
|{z}
|{z}
explainedvariation unexplainedvariation(SSR)
(e0 yb = (b
y 0 e)0 = (e0 yb)0 = yb0 e = (Xb)0 e = b0 X 0 e = 0 (as Xe=0))
<Grafik>
Ruc 2 =
0
yb yb
y 0 y 100% (% of explained variation)
P 2
e0 e e
1 y0 y 100% = 1 P i2
yi
100% (% of unexplained variation)
A good model explains much and therefore the residual variation is very small compared to the
explained variation.
Centered R2 : Rc2
Use centered R2 if there is a constant in the model (xi1 = 1)
Variance of yi : n1 (yi y)2
P
X X
Decomposition of n1 (yi y)2 = (ybi y)2 e2i
P
+
| {z } | {z }
variance of predicted values SSR
Proof:
1 P
(yi y)2 = n1 (b
yi + ei y)2 = (b
y y)2 + e2i + 2(b
P P P P
n y y)ei =
X
2
P P 2 P P P 2
P 2
y y) + ei + 2b Xei 2yei = (b
(b y y) + ei 2y ei
|{z} | {z }
0
0
1 1
yi y)2 e2i
P P
(b
Rc2 = n
1 P
(yi y)2
=1 1
n
P
(yi y)2
n n
Both uncentered and centered R2 lie in the interval [0,1] but describe different models. They are not
comparable.
The centered R2 of a model with only a constant whereas the uncentered R2 is not 0 as long as the
24
Applied Econometrics 8 Introduction to large sample theory
constant is 6=0 (but we explain only the level

P of y and not the variation). Without a constant the
2
cnetered R canT be compared, because ei = 0 only holds with a constant.
R2 : coefficients of determination; =[corr(y, yb)]2

Explanatory power of regression beyond constant
Test that 2 = ... = k = 0 is the same as the test that R2 =0 (R2 is a random varia-
ble/estimate)
Overall F-test:
SSRR : yi = 0 + i = (yi y)2
P
P 2
SSRU R : yi = 0 + 1 x1 + ... + k xk = ei
(SSRR SSRU R )/k
SSRU R /(nk1)
Trick: increase R2 , add more regressors (k).

Modell comparison:
Conduct an F-test between the parsimonious and the extended model

Use model selection criteria, which give a penalty for parameterization
Selection criterion
2 SSR/(n k) n 1 SSR
Radj =1 =1 (89)
SST /(n 1) k} SST
|n {z
>1f ork>1
(can become negative)

Alternative model selection criteria
Implement Occams razor (the simpler explanation (parsimonious model) tends to be the right one)
Akaike criterion (from information theory): log[SSR/n]+2k/n (minimize, can be neg.)
Schwarz criterion (from Bayesian theory): log[SSR/n]+k*log(n)/n (minimize, can be neg.)
Schwarz tends to penalize more heavily for larger n (=favors parsimonious models).
not one is favored (all three are generally accepted), but one has to argue why one is used.
8 Introduction to large sample theory
Basic concepts of large sample theory

Using large sample theory we can dispense with basic assumptions from finite sample theory:
1.2 E(i |X) = 0: strict exogeneity
1.4 V ar(|X) = 2 In : homoscedasticity
1.5 |X N (0, 2 In ): normality of the error term
25
Approximate/assymptotic distribution of b, and t- and the F-stat can be obtained.

Contents:
A. Stochastic convergence
B. Law of large numbers
C. Central limit theorem
D. Useful lemmas
A-D: pillars and building blocks of modern applied econometrics
CAN(consistent asymptotiv normal) - property of OLS (and other estimates)
8.1 A. Modes of stochastic convergence
First: non-stochastic convergence

{cn }=(c1 , c2 , ...)=sequence of real numbers
Q: Can you find n() such that |cn -c|< for all nn() for all ?
A: Yes? Then {cn } converges to c.
Other notations: limn cn = c; cn c; c=lim cn
Examples:
cn = 1 n1 so that cn 1
cn = exp(n) so that cn 0
cn = (1 + na )n so that cn exp(a)
Second: Stochastic convergence
{zn }=(Z1 , Z2 , ...) with Z1 , Z2 ,... being random variables.
Illustration:
All Zn are iid e.g. N(0,1).
They can be described using their distribution function.

Example 1: a.s.:
{un } sequence of random variables (ui is iid with E(ui ) = < , V ar(ui ) < )
1 1
zn = n1
P
ui mit {zn }={u1 , (u1 + u2 ), ..., (u1 + ... + u100 )}
2
| {z } 100
| {z }
z2 z100
<Grafik>
For every possible realization (a sequence of real numbers) of {zn }, the limit is

Mathematical: limn zn = almost sure convergence; also written as: zn a.s.
a.s.: strongest mode of stochastic convergence
More formally:
Q: Can you find an n(, w) such that |zn | < for all n>n(,w) for all <0 and for all w.

A: Yes? Then zn a.s.
26

Note: sample mean sequence {zn } is a special case of a.s.. Generally: P (limn Zn = ) = 1;

zn a.s. ; zn converges almost surely to .

Example 2: p :
<Grafik> n=5: 1/3 realizations within limits
n=10: 2/3 realizations within limits
n=20: 3/3 realizations within limits
When # of realizations , then relative frequency of # of realizations in limits over # of
replications p.
Formalization:
Q: P [|zn | < ] = 1
A: Yes? Then zn ; plimzn = ; zn convergence in probability to
If you define P [|zn | < ] as a series of probabilities pn , you can express this as: pn 1
A different illustration (reaching towards consistency):
<Grafik>

Example 3: m.s. (is similar to p ):
limn E[(Zn )2 ] = 0

m.s. implies p (m.s. is the stronger concept)
Implies that limn E(Zn ) = 0 and limn V ar(Zn ) = 0
Add remarks on the modes of convergence

(limit) can also be a random variable (not so important for us), so that Zn p Z, Z being a
r.v. (also for a.s., m.s.)

Extends to vector/matrix sequence of r.v.: Zn1 p 1 , Zn2 p 2 , ...

Notation: Zn p (element-wise convergence). This is used later to write: bn p

Convergence in distribution d
Illustration:
<Grafik>
fz : limiting distribution of Zn
Definition: Denote Fn cdf: P (Zn a) = Fn (a), which means: {Zn } converges in distribution to a
r.v. Z, if cdf Fn of Zn converges to cdf of Z at each point (of continuity) of Fz <Grafik>

Zn d Z; Fz : limiting distribution if Zn

If distribution of Z known, we write e.g. Zn d N (0, 1)
27
Extension to a sequence of random vectors:

Z n1 Z1
Zn d Z ; Zn = ... andZ = ...
(90)
|{z} |{z} Znk Zk
(kx1) (kx1)
Fn of Zn converges at each point to cdf of Z.

cdf of Z = FZ (a) = P [Z1 a1 , ..., Zk ak ]
cdf of Zn = FZn (a) = P [Zn1 a1 , ..., Znk ak ]
8.2 B. Law of Largen Numbers (LLN)
Weak law of large numbers (Khinchin)

{Zi } sequence of iid random variables
E(Zi ) = < compute n1
P
Zi = Zn . Then:

limn P [|Zn | > ] = 0( > 0) or plimZn ; Zn p
Extensions: The WLLN holds for
1. Multivariate Extension (seq. of random vectors):
{Zi }

z11 z21
... , ... , ... (91)
z1k z2k
| {z } | {z }
i=1 i=2
E(Zi ) = < , being a vector of expectations of the rows [E(z1 ), ..., E(zk )]0
Compute sample means for each element of seq. over the rows:
n
1X 1X 1X
Zi = [ Z1i , ..., Zki ]0 (92)
n n n
i=1

1
Zi p
P
Element-wise convergence: n
2. Relax independence:
Relax iid, allow for dependence in {Zi } through cov(Zi , Zij ) 6= 0, j 6= 0 (especially important in
time series; draws still from the same distribution)
ergodic theorem.
3. Functions of Zi
h(zi ) e.g.{ ln(Zi ) } (93)

| {z } | {z }
measurable function new sequence
28

1
h(zi ) p E(h(zi )) (LLN also can
P
If {Zi } allows application of LLN and E(h(Zi )) < , we have: n
be used)
Application
{Zi }iid, E(zi ) <
h(Zi ) = (zi )2
E(h(zi )) = V ar(zi ) = 2

1 P
n (zi )2 p V ar(zi )
4. Vector-valued function f(zi )
Vector-valued functions: one or many arguments and one or many returns.
{Zi }{f(Zi )} (mit f (Zi ) = f (zi1 , ..., zik ))
If {Zi } allow application of LLN and E(f(Zi ))<:
1X
f (Zi ) p E(f (Zi )) (94)
n
Example:
{Zi }={Zi1 , Zi2 } vector sequence
f{Zi }=f(zi1 , zi2 )=(zi1 zi2 , zi1 , zi2 )
Apply the above result (WLLN)
8.3 C. Central Limit Theorems (CLTs)
Univariate CLT
If {Zi } iid, E(Zi ) = < , Var(Zi ) = 2 < and WLLN works:

Then: n(Zn ) d N (0, 2 ) (without n the variance and mean would go to zero as n goes into
infinity; also used with: square-root n consistent estimators)
CLT implies LLN.
Other kind of writing this:
n(Zn ) a N (0, 2 )
yn 2
If yn = n(Zn ) then zn =
n
+ and thus zn a N (, n )
Multivariate CLT
{Zi } is iid and can be drawn from ANY distribution, but the n for convergence may be different.

z11 z21 ... zn1
= ... ... ...
... = Z1 , Z2 , ..., Zn (95)
z1k z2k ... znk
29
E(Z1 ) = E(Z2 ) = ... = = [E(Z1 ), ..., E(Zk )]0 and

V ar(zi1 )... ...
V ar(Zi ) = ... .... ... = < (96)
Cov(Zi1 , zin ) ... V ar(zik )
and WLLN works.

Then: n(zn ) d M V N (0, )
Other writings:
n(zn ) a M V N (0, ) or:

Zn a M V N (, /n)
iid can be relaxed, but not as far with the WLLN
8.4 D. Useful lemmas of large sample theory

Lemmas 1&2 continuous mapping theorem (probabilitylimit p and a passes through continuous
function)
Lemma 1:
Zn can be a scalar, vector or matrix sequence.

Zn p with a as a continuous function which does not depend on n then:

a(Zn ) p a() or plimn a(zn ) = a(plimn (zn ))
Examples:

xn p ln(xn ) p ln()

xn p and yn p xn + yn p +
Lemma

2:
zn d z then:

a(zn ) d a(z) (for z: different degrees of difficulty to find it)
Examples:

zn d z N (0, 1) zn2 d 2 (1) (as [N (0, 1)]2 = 2 (1))
Lemma
3:
xn d x and yn p (yn treated as non-stochastic, a vector of real numbers), then:
|{z}
(mx1)

xn + yn d x +
Examples:

xn d N (0, 1), yn p xn + yn d N (, 1)
30
Lemmas 4& 5 (& 6) Slotzkys theorem

Lemma

4:
xn d x and yn p 0 (we say that yn is 0P), then:

x n yn p 0
Lemma
5:
xn d x and An p A, then:
|{z} |{z}
(mx1) (kxm)

An xn d A x (A treated as non-random, x as random vector)
Examples:

xn d M V N (0, ), An p A An xn d M V N (0, Aa0 )
Lemma

6:
xn d x and An p A then:

x0n A1 0 1
n xn d x A x (x is random and A is non-random)
Examples:

xn d x M V N (0, A) and An p A x0n A1 xn d x0 A1 x 2 (rows(x))
8.5 Large sample properties for the OLS estimator
(2.1) maintained: Linearity: yi = x0i + i , i = 1, 2, ..., n

(2.2) Assumptions regarding dependence of {yi , xi } (get rid of iid)
(2.3) Replacing strict exogeneity: Orthogonality/predetermined regressors: E(xik i ) = 0
If xik = 1 E(i ) = 0 Cov(xik , i ) = 0i, k. If violated: endogeneity
(2.4) Rank condition: E(xi x0i ) = xx is non-singular.
| {z }
KxK
Remarks on the properties

{yi , xi } allows application of WLLN (so DGP is not necessarily iid)
We assume {yi , xi } are (jointly) stationary & ergodic (identical but not independent)
{xi i }=

xi1 i
... = {gi } (97)
xik i
allows applicatio of central limit theorem

{gi } a martingal difference sequence (m.d.s.): E(gi |gi1 , ...) = 0 (gi is uncorrelated with its past =
zero autocorrelation)
Assuming iid {yi , xi , i } covers this (iid as a special case), but its not necessary (and not useful in
time series)
31
Overview
We get for b=(X 0 X)1 Xy:
1X 1X
bn = [ xi x0i ]1 xi yi0 (98)
n n
WLLN and continuous mapping & Slotzky theorem:

bn p (consistency = asymptotically unbiased)

n(bn ) d M V N (0, AV ar(b)) or b a M V N (, AV ar(b)
n ) (approximate distribution)
bn is consistent, asymptotically normal (CAN) (We loose unbiasedness & efficiency here)
We show that bn =P (X 0 X)1 X 0 y is consistent
bn = [ n1 xi x0i ]1 n1 P xi yi0
P
bn = [ n1 xi x0i ]1 n1
P
xi i
| {z }
sampling error

We show: bn p
When sequence {yi , xi } allows application of WLLN:

n1 xi x0i p E(xi x0i )
P
E(x2i1 )

... E(xi1 , xik )
= ... ... ... (99)
E(xi1 , xik ) ... 2
E(xik )
So by Lemma 1:

[ n1 xi x0i ]1 p [E(xi x0i )]1
P

n1 xi i p E(xi i ) = 0 (we assume predeterminant regressors)
P
Lemma 1 implies:

bn = [ n1 xi x0i ]1 n1 xi i p E(xi x0i )1 E(xi i ) p E(xi x0i )1 0 = 0
P P
bn = (X 0 X)1 X 0 y is consistent.
If the assumption E(xi i ) = 0 does not hold, we have inconsistent estimators (=death of estima-
tors)
We show that bn = (X 0 X)1 X 0 y is asymptotically P normal
Sequence {gi }={xi i } allows applying CLT for n1 xi i = g

0
n(g E(gi )) d M V N (0, E(gi gi ))
1X
n(bn ) = [ xi x0i ]1 ng (starting point: sampling error expression * n)
|n {z } |{z}
xn
An
Applying Lemma 5:

An = [ n1 xi x0i ]1 p A = 1
P
xx

xn = ng d x M V N (0, E(gi gi0 ))
32

n(bn ) d M V N (0, 1 E(g g 0 )1 )
| xx {zi i xx}
AV ar(b)
Practical use: bn a MV N (, AV ar(b)
n )
bn is CAN
In detail: CLT on {xi i }
1P
n n xi i = ng

CLT: n(g E(gi )) d M V N (0, E[(gi E(gi ))(gi E(gi ))0 ])
| {z }
V ar(gi )
With predetermined xi : E(xi i ) = 0 E(gi ) = 0 V ar(gi ) = E(gi gi0 ) = E(2i xi x0i )
How to estimate Avar(b)
Avar(b)=1 0 1
xx E(gi gi )xx

1. xx = E(xi x0i ) Sxx = n1 xi xi p E(xi xi ) = xx (also for inverse using Lemma 1)
P

2. E(gi gi0 ) Sb =? p E(gi gi ) = E(2i xi x0i )
Consistent estimate for Avar(b): Avar(b)d = S 1 Sb S 1 p 1 E(gi g 0 )1
xx xx xx i xx
So how to estimate Sb = E(2i xi x0i ):

1
f (Zi ) p E(f (Zi ))
P
Recall: Zi (vector) allows application of LLN n
Our Zi = [i , xi ]0
P 2 0
n1 i xi xi p E(2i xi x0i )
Problem: i is unknownwe use ei = yi x0i b
P 2 0
sb = n1 ei xi xi p E(2i xi x0i ) = E(gi gi0 )
Hayashi: p.123 shows that sb is a consistent estimator
Result:
d = [ 1 P xi x0 ]1 1 P e2 xi x0 [ 1 P xi x0 ]1
Avar(b) p E(xi x0i )1 E(gi gi )E(xi x0i )1 = Avar(b)
n i n i i n i
is estimated without the assumption 1.4 (homoscedasticity)
Developing a test statistic under the assumption of conditional homoscedasticity
Assumption: E(2i |xi ) = 2 (<0 and <) (We dont have to condition on all xs, but only xi )
E[E(2i |xi )] = 2 = E(2i ) = V ar(i )
d = [ 1 P xi x0 ]1
Avar(b) b2 n1
P
xi x0i [ n1
P
xi x0i ]1 = [ n1
P
xi x0i ]1
b2
n i
1 2 1 0 1
b = n ei ; Sb is still a consistent estimate if E(2i |xi ) = 2
2 2
P P
with Sb = n ei n xi xi and
ar(b) = Avar(b)
d
ThisPis the asymptotic variance covariance matrix Avar(b). d The variance of b V d n =
2 0 1
b [ xi xi ] = 2 0 1
b (X X) is the same as before.
When using the above assumption, we get to the same results as before asymptotic theory, but we
also have the general result for Avar(b)
d without the above assumption:
1. If you assume conditional homoscedasticity (cf. 1.4)

b2 [ n1 xi x0i ]1
P
Avar(b)
d =
33
Applied Econometrics 9 Time Series Basics (Stationarity and Ergodicity)
We estimate Var(b):
1
b2 [ n xi x0i ]1
P
Avar(b)
b2 (X 0 X)1
d
n = n=
b2 = n1 e2i
P
(consistent but biased)

1 P 2 p 2
b2 : s2 = nk
We can use instead of ei (consistent and unbiased)
2. If you dont assume homoscedasticity:
d = [ 1 P xi x0 ]1 [ 1 P e2 xi x0 ][ 1 P xi x0 ]1
Avar(b) n i n i i n i
In EViews: Heteroscedasticity-consistent variance-covariance matrix is used to compute t-stats, s.e.,
f-tests which are robust (=we dont need the homoscedasticity assumption)
3. Assuming wrongly conditional homoscedasticity: t-stats, s.e., f-tests are wrong (more precise: in-
consistent)
White standard errors
Adjusting the test statistics to make them robust against violations of conditional homoscedasticity.
t-ratio
bk k
tk = q 1 P 0 1 P 2 0 1 P 0 a N (0, 1) (100)
[n 1
xi xi ] n ei xi xi [ n xi xi ]1
[ n ]kk
Holds under H0 : k = k
F-ratio
W = (Rb r)0 [R Avar(b) R0 ]1 (Rb r)0 a 2 (#r)
d
n
Holds under H0 : R r = 0; allows for linear restrictions on
9 Time Series Basics (Stationarity and Ergodicity)
Dependence in the data:

Certain degree of dependence in the data in time series analyis; only one realization of the data
generating process is given.
CLT and WLLN rely on iid data, but dependence in real world data.
Examples:
Inflation rate, stock market returns
Stochastic process: sequence of random variables/vectors indexed by time {z1 , z2 , ...} or {zi } with
i=1,2,...
A realization/sample path: one possible outcome of the process (p.97)
Distinguish between:
Realization (sample path) = sequence of real numbers
34
Process {Zi } = sequence of random variables (not a real number)

Hayashi: p.99f
Annual inflation rate = sequence of real numbers
<Grafik>
vs. Three realizations (r) of stochastic process of US inflation rates = sequence of random variables
<Grafik>

Ensemble mean: R1 inf l1995,r p E[inf l1995 ] (by LLN, here: R=3)
P
What can we infer from one realiztation about the process?

We can learn something when the DGP has certain properties/restrictions. Condition: stationarity of
the process.
Stationarity
We draw from the same distribution over and over again, but maybe the draws are dependent.
Additionally, the dependence doesnt change over time.
Strict stationarity:
The joint distribution of zi , zi1 , ..., zir depends only on the relative position i1 i, ..., i1 i (distance)
but not on i itself.
In other words: the joint distribution of (zi , zir ) is the same as the joint distribution of (zj , zjr ) if
i ir = j jr .
Weak stationarity:
E(zi ) = (finite/exists, but does not depend on i)
Cov(zi , zij ) depends on j (distance), but not on i (absolute position)implies V ar(zi ) = 2 (does
not depend on i)
(We only need the first moments and not the whole distribution)
If all moments exist: strict stationarity implies weak stationarity
Illustration:
<Grafik>
It seems that E(zi ) does depend on i
<Grafik>
IT seems that E(zi ) = mu does not depend on i; the realization oszilliates around the expected value
(some process drags them back, although there are some dependencies over or below the mean).
V ar(zi ) = 2 does not depend on i
This second graph seems to fulfill the first two requirements.
We now turn to the assumption that the Cov(zi , zij ) only depends on the distance.
j=0 V ar(zi ) <Grafik>
The ensemble means are still constant, but V ar(zi ) increases with i.
Example: Random walk: Zi = zi1 + i [i N (0, 2 ) iid]
Stationarity is testable
35
Ergodicity
If I space two random variables apart in the process, he dependence has to decrease with the distance
(restricted memory of the process; in the limit the past is forgotten)
So a stationary process is also called ergodic if: independence
limn E[f (zi , zi+1 , ..., zi+k )g(zi+n , zi+n+1 , ..., zi+n+l )] = E[f (zi , zi+1 , ..., zi+k )]E[g(zi+n , zi+n+1 , ..., zi+n+l )]
(101)
Ergodic processes are stationary processes which allow the application of a LLN: Ergodic theorem:
Sequence {zi } is stationary and ergodic with E(zi ) = , then
1X
zn = zi p (102)
n
So for an ergodic process we are back to the fact that one realization is enough to infer from.
Martingale difference sequence
Stationarity and Ergodicity are not enough for applying the CLT. To derive the CAN property of the
OLS-estimator we assume:
{gi }={xi i }
{gi } is a stationary and ergodic martingale difference sequence (m.d.s.):
E(gi | gi1 , gi2 , ... ) = 0 E(gi ) = 0 by LTE
| {z }
history of the process
Derive asymptotic distribution of bn (X 0 X)1 X 0 y. We assume {xi , i } is a stationary & ergodic m.d.s.:
| {z }
gi

i 1
i xi2
gi =
...
(103)
i xik
Why? Because there exists a CLT for s&e m.d.s
E(gi |gi1 , ...) = 0 with xi1 = 1 E(i |gi1 , ...) = 0 by LTE: E(i ) = 0 and also by LIE:
E(i ij ) = 0i = 1, 2, ... so that Cov(i , ij ) = 0.
Summarize
We assume for large sample OLS:
{yi , xi } stationary& ergodic

LLN (ergodic theorem): n1 xi x0i p E(xi x0i ) and 1
xi yi p E(xi yi )
P P
n
Consistency of b=(XX)Xy
gi = i xi is a stationary&ergodic m.d.s
E(gi |gi1 , ...) = 0 E(gi ) = [E(i ),P
..., E(i xik )] = 0
CLT for s&e m.d.s applied to {gi }: n1 i xi = g n :
36
Applied Econometrics 10 Generalized Least Squares

n(g n E(gi )) d M V N (0, E(gi gi0 ))
| {z } | {z }
=0 V ar(gi )
Distribution of b
b = [ n1 xi x0i ]1 n1
P P
xi yi is CAN
Notes:
{i xi } is a s&e m.d.s rules out Cov(i , ij ) 6= 0.
This autocorrelation is ok for cross-sectional data, but not so ok for time eries. Can be relaxed (but
more complex): HAC properties (=heteroscedasticity autocorrelation consistent).
Robust inference
Robust s.e./t-stats etc. (HAC) and those using the restrictive assumptions (CAN) should not
differ too much (Often happens when one tries to account for autocorrelation).
10 Generalized Least Squares
Assumptions of GLS
Linearity yi = x0i + i
Full rank: rank(X)=K (with Prob=1)
Strict exogeneity E(i |X) = 0 E(i ) = 0 and Cov(i , xik ) = E(i xik ) = 0
Not assumed: V ar(|X) = 2 In .
Instead:

V ar(1 |X) Cov(1 , 2 |X) ... Cov(1 , n |X)
0
Cov(1 , 2 |X) V ar(2 |X) ... ...
V ar(|X) = E( |X) = (104)
... ... ... ...
Cov(1 , n |X) ... ... V ar(n |X)
V ar(|X) = E(0 |X) = 2 V (X) (a positive real number extracted from V ar(|X))
Consequences:
bn = (X 0 X)1 X 0 y no longer BLUE, but unbiased

dn |X) 6= 2 (X 0 X)1
V ar(b
Deriving the GLS estimator
Derived under the assumption that V(X) is knowm, symmetric and positive definite. Then:
V=P P (with P=characteristic vectors/eigenvectors of V(nxn), =diagonal matrix with eigenvalues
on the diagonal) = spectral decompositionC=P*1/2 (C is (nxn) and nonsingular)
V (X)1 =CC
Transformation: ye = Cy, X
e = CX
y = X +
37
Applied Econometrics 10 Generalized Least Squares
Cy = CX + C
ye = X
e +e

Least squares estimation of e using transformed data
e 0 X)
bGLS = (X e 1 X
e 0 ye = (X 0 C 0 CX)1 X 0 C 0 Cy = (X 0 12 V 1 X)1 X 0 1
V 1 y
2
= [X 0 [V ar(|X)]1 ]X 0 [V ar(|X)]1 y
GLS estimator is BLUE:
linearity
full rank X
strict exogeneity
|X) = 2 In
V ar(e
V ar(|X)
e = 2 [X 0 V 1 X]1
Assumptions 1.1-1.4 are fulfilled for this model
Problems:
Diffcult to work out the asymptotic properties of bGLS

In real world applications V ar(|X) not known (and this time we have a lot more unknowna)
If V ar( < X) is estimated the BLUE-property of bGLS is lost
Special case of GLS - weighted least squares

V1 (X) 0 ... 0
0 V2 (X) ... 0
E(0 ) = V ar(|X) = 2
...
= 2 V (xi ) (105)
... ... ...
0 0 ... VN (X)
As V (X)1 =CC C=
1
0 ... 0

1
0 ... 0

V1 (X)
s1
1
0 ... 0 1

0
s2 ... 0
V2 (X) = (106)
... ... ... ...
... ... ... ...

1 1
0 0 ... 0 0 ... sn
Vn (X)
argmin ( ys1i b1 s1 b xi2 b xiK 2

P
i 2 si ... k si )
Observations are weighted by standard deviation (higher penalty for wrong estimates)
For this we would need the conditional variances V ar(|xi ). We assume that we can use a variance
conditioned on xi .
38
Applied Econometrics 11 Multicollinearity
Notes on GLS
If regressors X are not strictly exogeneous, GLS may be inconsistent (OLS does not have this problem:
robustness property of OLS).
Cross-section data: E(i |X) = 0 better justified and E(i j ) = 0, but:
Conditional heteroscedasticity: V ar(i |) = 2 V (xi ) WLS using 2 V (xi )
Functional form of V (xi )? Typically not known, but has to be assumed. E.g.:
0
V (xi ) = Zi
|{z}
|{z}
observables parameters to be estimated
Feasible GLS (FGLS) (b , b estimated in one step)
Finite sample properties of GLS are lost
For large sample properties:
FGLS may be more efficient (asymptotic) than the OLS, but: the functional form of V (xi ) has to be
correctly specified. If this is not the case: FGLS may be even less efficient than OLS.
So we face a trade-off between consistency and efficiency.
One can work out the distribution of th GLS also asusming s&e m.d.s and you also get a normal
distribution.
11 Multicollinearity
Exact multicollinearity
Expressing a regressor as linear combination of (an)other regressor(s)
rank(X)6= K: No full rank Assumption 1.3 or 2.4 is violated and (X 0 X)1 does not exist.
Example: seasonal dummies/sex dummies and a constant
Often economic variables are correlated to some degree.
Example: tenure and work experience
BLUE result is not afffected
Large sample results are not affected
Var(b|X) is affected
Effects of Multicollinearity and solution to the problem
Effects:
Coefficients may have high standard errors and low significance levels
Estimates may have the wrong sign (compared to a theoretical sign)
Small changes in the data produces wide swings in the parameter
39
Applied Econometrics 12 Endogeneity
Recall: V ar(bk |x) = 2 [(X 0 X)1 ]kk , X contains a constant & 2 regressors x1 , x2
2
Then V ar(bk |X) = (1r2 )P(xik xk )2 (with r2 = empirical correlation of regressors and second
x1 ,x2
part as sample variance of the k-th regressor)
General: [(X 0 X)1 ]kk = (1R2 ) P1(x x )2 (with Rk.
2 = explained variance of the regression of x on
k
k. ik k
the rest (data matrix excluding xk ))
The stronger the correlation between the regressors, the smaller the denominator of the variance
and the higher the variance of bk : less precision and we can hardly reject the null hypothesis (small
t-stat)
The smaller 2 , the smaller V ar(bk ).
The higher n, the higher the variance in the regressor xk and thus a high n can compensate for
high correlation: the larger the sample variance, the smaller V ar(bk ).
Solutions:
Increasing precision by implementing more data (= higher nhigher variance of xk ) (costly!)

Building a better fitting model that leaves less unexplained (= smaller 2 )
Excluding some regressors (Omitted variables bias!) (= smaller correlation)
12 Endogeneity
Omitted variable bias

Correctly specified model: y = X1 1 + X2 2 +
Regression of y on X1 X2 gets into the error term: y = X1 1 + |{z}

b
X2 2 +
Omitted variable bias:
b1 = (X10 X1 )1 X10 y
= (X10 X1 )1 X10 (X1 1 + X2 2 + )
| {z }
y
= 1 + (X10 X1 )1 X10 X2 2 + (X10 X1 )1 X10
| {z } | {z }
Regressionof X2 onX1 =0ifE(|X)=0
OLS estimator is biased:

If 2 = 0 (which would mean that X2 is not part of the model in the first place) (X10 X1 )1 X10 X2 2 6=
0
If (X10 X1 )1 X10 X2 6= 0 (the regression of X2 on X1 gives you non-zero coefficients) (X10 X1 )1 X10 X2 2 6=
0
Recall:
1. E(i |x) = 0 = strict exogeneity E(i xi ) = 0
2. E(i xik ) = 0k = large sample OLS assumption; predetermined/orthogonal regressors
E(i xi ) 6= 0 or E(i xik ) 6= 0 for some k
40
Applied Econometrics 12 Endogeneity
Endogeneity bias: Working example

Simultaneous equations model of market equilibrium (structural form):
qid = 0 + 1 pi + ui (ui =demand shifter (e.g. preferences))
qis = 0 + 1 pi + vi (vi =supply shifter (e.g. temperature)
Assumptions: E(ui ) = 0, E(vi ) = 0, E(ui vi ) = 0(Cov)
Clear markets: qid = qis
Our data show clear point of the market. It is not possible to estimate 0 , 1 , 0 , 1 (levels and slopes)
as we do not know whether changes in the market equilibrium are due to supply or demand shocks.
We observe many possible equilibria, however we can not explain the slope of the demand and the
supply curve from the data.
Here: Simultaneous equation bias
From structural form to reduced form
Solving for qi and pi (qid = qis = q) yields reduced form (only exogeneous things on the right-hand
side):
pi = 01 0 vi ui
1 + 1 1
1 0 0 1 1 vi 1 ui
qi = 1 1 + 1 1
Both price and quantity are a function of the tow error terms (no regression can be used at this point
as we dont observe the error terms): vi =supply shifter and ui :demand shifter
Endogeneity: Correlation between errors and regressors, regressors are not predetermined as the price
is a function of these error terms
Calculating the covariance of pi and the demand shifter ui :
Cov(pi , ui ) = Var(ui)
1 1
(using reduced form)
If endogeneity is present the OLS-estimator is not consistent
FOCs inPsimple regression context yield:
1
(q q)(pi p) Cov(pi ,qi )
b1 = n 1 Pi
2
(pi p)
p V ar(p ) (formula for estimation)
i
n
But here: Cov(pi , qi ) = 1 V ar(pi ) + Cov(pi , ui ) (from structured form)
Cov(p i ,qi )
V ar(pi ) = 1 +
Cov(pi ,ui )
V ari 6= 1
OLS is not consistent
Same holds for 1
Instruments for the market model
Properties of the instruments: uncorrelated with the errors, instruments are predetermined and corre-
lated with the endogenous regressors.
New model:
qid = 0 + 1 pi + ui
qis = 0 + 1 pi + 2 xi + i
qid = qis
with:
E(i ) = 0
41
Applied Econometrics 13 IV estimation
Cov(xi , i ) = 0
Cov(xi , ui ) = 0
Cov(xi , pi ) = 1
2
1
V ar(xi ) 6= 0
xi an instrument for pi yields new reduced form:
pi = 01 0 2 i ui
1 + 1 1 xi + 1 1
1 0 0 1 1 1 1 i 1 ui
qi = 1 1 + 1 1 xi + 1 1
1 2
Cov(xi , qi ) = 1 1 V ar(xi ) = 1 Cov(xi , pi ) (from reduced form)
Cov(xi ,qi )
Using Method of Moments Theory (not regression): 1 = Cov(xi ,pi ) b1 p 1 (see
by WLLN:
1 P
xi yi
estimator: n
1 P
xi zi
)
n
A simple macroeconomic model: Haavelmo

Aggregated consumption function: Ci = 0 + 1 Yi + ui
GDP identity: Yi = Ci + Ii
Yi affects Ci but at the same time Ci influences Yi
0 1 ui
Reduced form:Yi = 1 1
+ 1 1
Ii + 1 1
Ci cannot be regressed on Yi as the regressor is correlated with the residual:
Cov(Yi , ui ) = V1
ar(ui )
1
> 0 (using reduced form)
OLS-estimator is inconsistent: upwards biased (as Cov(Ci , Yi ) = 1 V ar(yi ) + Cov(Yi , ui ) from
structured form):
Cov(Ci ,Yi ) Cov(Yi ,ui )
V ar(Yi ) = 1 + V ar(Yi ) 6= 1
Valid instrument for income Yi : investment Ii
Cov(Ii , ui ) = 0
Cov(Ii , Yi ) = V1
ar(Ii )
1
6= 0 (from reduced form)
Errors in variables
Explanatory variable is measured with error (e.g. reporting errors)
Classical example: Friedmans permanent income hypothesis
Permanent consumption is proportional to permanent income Ci = kYi
Observed variables:
Yi = Yi +yi
Ci = Ci +ci
ci = kyi + ui
Endogeneity due to measurement errors
Solution: IV-estimators; here: xi = 1
13 IV estimation
General solution to endogeneity problem: IV

Note: change in notation: zi =regressors, xi =instruments
42
Linear regressions: yi = zi0 + i

But the assumption of predetermined regressors does not hold: E(zi i ) 6= 0
To get consistent estimators instrumental variables are needed: xi = [xi1 , ..., xiK ]0
xi is correlated with the endogeneous regressor, but uncorrelated with the error term.
Assumptions for IV-estimators
3.1 Linearity yi = zi + i
3.2 Ergodic stationarity: K instruments and L regressors. Data sequence wi = {yi , zi , xi } is stationary
and ergodic [LLN can be applied as with OLS]
We focus on K=L IV (other possibility K>LGMM)
3.3 Orthogonality conditions: E(xi i ) = 0
E(xi1 (yi zi0 )) = 0
...
E(xiK (yi zi0 )) = 0
E(xi (yi zi0 )) = 0 E(xi i ) = E(gi ) = 0
Later we also assume that a CLT can be applied to gi as for OLS
3.4 Rank condition for identification: rank(xy )=L with

E(xi1 ) ... E(xi1 ziL )
E(xi zi0 ) = ... ... ... = xz (107)
|{z}
E(xiK zi1 ) ... E(xiK ziL ) (KxL)
(=full column rank)

1
xz exists of K=L
Core question: Do moment conditions provide enough information to determine uniquely?

Some notes on IV-estimation
Start from:
gi = xi (yi zi0 ) = gi (wi , ) (with wi =the data)
Assumption 3.3 E(gi ) = 0 or E(g(wi , )) = 0 = orthogonality condition: E(xi (yi zi0 )) = 0 =
System of K equations relating L unknown parameters to K unconditional moments
Core question: Are population moment conditions sufficient to determine uniquely?
as a solution to system of equation. But: only solution?
b hypothetical value of that solves E(g(wi , ))
b =0
only solution:
b
E(xi yi ) E(xi zi )b = 0
| {z } | {z }
xy (Kx1) xz (KxL)
xz e = xy (System of linear equations Ax=b)
Necessary& sufficient conditions that e only solution
xz full column rank(L)
Reason for assumption 3.4
43
Another necessary condition (but not sufficient): Order condition: K(#instruments) L(#regressors)
Deriving the IV-estimator (K=L)
E(xi i ) = E(xi (yi zi ) = 0
E(xi yi ) E(xi zi0 ) = 0
= [E(xi zi0 )]1 E(xi yi )

bIV = [ n1 xi zi0 ]1 [ n1 x i yi ] p
P P
To show this proceed as for OLS, start from sampling error: bIV
Alternative notation:
bIV = (X 0 Z)1 X 0 y
If K=L the rank condition implies that 1
xz exists and the system is exactly identified.
Applying WLLN, CLT and the lemmas it can be shown that IV-estimator bIV is CAN.

To show this, proceed as for OLS, start from n(bIV ). Assume that a CLT can be applied to

{gi }: n( n1 gi E(gi )) d M V N (0, E(gi gi0 ))
P

n(bIV ) d M V N (0, Avar(bIV )) (Find expression for the variance and the estimator of the
variance)
So:
Show consistency and asymptotic normality of bIV . bIV is CAN: b a M V N (, Avar(nIV ) )
b
Relate OLS- to IV-estimators:

If regressors
P predetermined P E( i zi ) = 0 xi = zi (use regressors as instruments)
bIV = [ n1 zi zi0 ]1 n1 zi yi0 = OLS estimate
OLS a special case of IV
44
Applied Econometrics 14 Questions for Review
14 Questions for Review
First set
1. Ragnar Frischs statement in the first Econometrica issue claims that Econometrics is not (only)
economic statistics and not (only) mathematics applied to economics. What is it. then?
Econometrics=
Economic statistics (Basis)
Economic theory (Basis and application)
Mathematics (Methodology)
2. How does the fundamental asset value evolve over time in the Glosten/Harris model presented in
the lecture? Which parts of the final equation are associated with public and which parts with private
information that influences the fundamental asset value?
mt = + mt1 + t + Qt z0 + Qt z1 vt (108)
|{z} | {z } | {z }
f undamentalassetvalue randomwalkw.drif t privateinf o
Pt = + z0 Qt + z1 vt Qt + cQt + t (109)
|{z} | {z } |{z}
publicinf o privateinf o publicinf o
3. Explain the components of the equations that give the market maker buy (bid) and the market
maker sell (ask) price in the Glosten/Harris model.
Pta = mu
|{z} + mt1 + t zt c
|{z} (110)
| {z } |{z} |{z}
drif tparameter lastf airprice publicinf o volume earnings/costs
4. Explain how in the Glosten/Harris model the market maker anticipates the impact that a trade
event exerts on the fundamental asset price when setting buy and sell prices.
The market maker sells with the ask price, thus zt and c enter positively in the price. This means
the market maker adds a volume parameters and his costs to the price to be rewarded for his work
appropriately.
5. Why should it be interesting for a) an investor and b) for a stock exchange to estimate the
parameters of the Glosten/Harris model (z0 , z1 , c, )?
a) To find out the fundamental value thus to prevent the investor from paying too much
b) To trade with efficient prices (avoid bubbles, avoid exploitation through market maker).
45
6. What does the (bid-ask) spread mean? What are explanations for the existence of a positive spread
and what could be the reason that the spread differs when the same asset is traded on different market
environments?
Pta Ptb = 2zt + 2c (111)
Positive spread: earnings of the market maker to earn a living and avoid costs (operational costs,
protection/storage costs)
Market environment: influences costs and trade volume eg. in a crisis
7. Which objects (variables and parameters) in the Glosten/Harris model are a) observable to the
market maker, but not to the econometrician and b) observable to the econometrician?
a) costs/earnings c
b) Pt , Qt , vt
8. The final equation of the Glosten/Harris model contains observable objects, unknown parameters
and an unobservable component. What are these? What is the meaning of the unobservable variable
in the final equation?
Observable: Pt , Qt , vt
Unobservable: t
Parameter: , z0 , z1 , c
9. Why did we transform the two equations for the market makers buy and sell price that the market
maker posts at point in time t into the equation Pt ?
So that we can estimate the structural parameters
10. In the "Mincer equation"derived from human capital theory what are the observable variables and
what would a sensible interpretation of of the unobservable component be?
Observable: Wage, S, Tenure, Exp
Unobservable: i (eg. ability, luck, motivation, gender)
11. Why should the government (ministry of labour or education) be interested in estimating the
parameters of the Mincer equation from human capital theory?
To determine how many years people should go to school (costs!)
12. Explain why we can conceive ln(WAGEi), Si, TENUREi, and EXPRi in the Mincer equation as
random variables.
The individuals are randomly drawn. The variables are the results of a random experiment and thus
random variables
13. An empirical researcher using a linear regression model is confronted with the critique this is
measurement without theory". What is meant by that and why is this a problem.
46
A correlation is found, but it may be too small or without meaning for theory. Always start with a theo-
retical assumption. Without a theoretical assumption behind the statistics, we cant test hypothesis.
Causality is a problem, because we cant refer to a model/reasons for the form of our regression
14. The researcher counters that argument by stating that he is only interested in prediction and
causality issues do not matter for him. Exploiting a correlation would be just fine for his purposes.
Provide a critical assessment of this statement.
To act upon such a test, causality is needed. Correlation could be because of self-selection, unobserved
variables, reversed causality, interdependence, endogeneity
15. A possibility to motivate the linear regression model is to assume that E(Y|X)=x. What does
this imply for the disturbance term ?
uncorrelated with the Xs and its expectation value is zero.
16. Some researchers argue that nonparametric analysis provides a better way to model the conditional
mean of the dependent variable. What can be said in favor of it and what is the drawback?
E(yi |xi ) = x0i = assuming linearity (=parametric) vs. E(yi |xi ) = f (xi , ) = no functional form
assumed (nonparametric)
Favor: no linearity assumed (robust)
Drawback: data intensive (with a lot of variety in X), curse of dimensionality
17. The parameters of asset pricing models relating expected excess returns to risk factors can be
estimated by regression analysis. Describe the idea. Which implications for the "compatible regres-
sion model"follow from theory with respect to the regression parameters and the distribution of the
disturbance term? How can we test the parameter restriction?
ej
E(Rt+1 ) = j E(Rt+1
em
) (112)
Compatible model assumes that (un)conditional mean of j is 0:

ej
E(Rt+1 ) = j E(Rt+1
em
) + jt+1 (113)
Parameter restriction???
18. Despite the fact that the Fama-French model is in line with the fundamental equation from
finance theory which states that E(Rej ) = 0 , it is nevertheless prone to the "measurement without
theory"critique. Why is that so? What do and stand for in the first place?
=exposure of payoff asset to factor k risk
=price of factor k risk
Causality problem: if risk increases, does that cause the expected return to increase or is that because
of other factors (unobserved variables, reversed causality,...)
47
Zusatz. If one has stable correlation, we can predict without causality, eg. animals leaving the shore
and tsunamis.
Second set
1. Explain the key difference of experimental data and the "process-producedconomic data that is
typically available.
In process-produced data: self selection (individuals not randomly assigned)
2. Why is the Tennessee STAR study a proper experiment while the SQ study of the sociology student
(probably) not?
STAR participants randomly assigned to the two groups (small or big classes). SQ study with self-
selection bias and without control group and pre-test
3. Explain the term election bias". Write down the expression and explain its terms to your fellow
student.
Selection-bias:
Self-selection: individuals decide themselves which group they want to be part of
Bias: The parameters are biased because of self-selection; errors in choosing the individuals/groups
that participate
4. Comparing the average observable outcome yi (e.g. entry income) of a random sample for individuals
who received a treatment di (e.g. attended an SQ course) with the average outcome of a random
sample who did not receive a treatment (did not attend an SQ course) yields a consistent estimate
of E(Yi |di = 1) E(Yi |di = 0) (by a law of large numbers). Under which conditions does this yield
a consistent estimate of a treatment effet assumed to be identical for all i. What causes a bias of
the treatment effect? Write down an explicit expression of this bias. How can this bias be avoided?
The individuals have to be randomly assigned to either treatment or non-treatment.
E(b ) 6= 0 bzw. E(yi |di = 1) E(yi |di = 0) = + selection
| {z bias}
E(yi0 |di =1)E(yi0 |di =0)
5. Consider the police-crime example. Put it in a Rubin causal framework. Why would be suspect a
selection bias, i.e. E(y0ijdi = 1) E(y0ijdi = 0) 6= 0?
Yi = y0i + (y1i y0i )Di
Yi : Crime
Di : 0=police doesnt change; 1=police increases
y1i =crime if Di = 1 und y0i =crime if Di = 0
Social factors, which means police and i are correlated. Police officers are not randomly assigned to
the districts.
6. What is a quasi-experiment?
48
=a natural experiment:
Without assigning individuals to groups, those groups form in a way they do in experiments (random
assignment)
7. In the lecture I promoted the xperimental ideal"for the measurement of causal effects. For instance,
the US senate passed (in 2004) a law that educational studies have to be based on experimental or
quasi-experimental designs. Discuss the pros and cons of setting up an experiment in a socio-economic
context. Give argument against conducting such experiments which do a good thing as they help to
estimate the true causal effect of treatments.
Pro: measures causal effects; self-selection bias avoided
Contra: unethical, costly
8. Take the Mincer equation as an example. Why are the years of schooling ndogenous", i.e. the
result of an economic decision process that is not made explicit in the equation. What is the role of
the disturbance term in this context? Place this endogeneity in the context of the US senates law
mentioned above.
Years of schooling depend on the ability which is part of the (Si could be written as a function of
IQ). No decision based on the Mincer equation.
9. Estimating the parameters of the Mincer equation yields an estimate of the coefficients that multiply
with the years of schooling of 0.12. Interpret that number economically. When would the US senate
rely on this result when it has to decide on budget measures that involve endowing the US education
system with tax money? What are the concerns? What should be done to convince the senate that
the result is trustworthy?
An additional year of schooling increases the wage by 12%. The US senate would require an expe-
riment which would ask some students to leave school earlier or stay longer than planned (random
assignment). As this is unethical the selection bias cant be avoided. One might want to collect as
many pieces of information as possible to avoid endogeneity.
10. Provide another economic example (like the crime example or the liberalization example, but one
that is not given in the lecture) where endogeneity is present. Cast your example in a CLRM. What
is the economic interpretation of in your model?
???
11. There is a special name for model parameters which have an economic meaning. What do we call
them?
Structural parameters
12. Based on a sample of firms, a researcher estimates by OLS the coefficients of a Cobb-Douglas
production function. The CD function is nonlinear, but the regression coefficients are computed
despite the first CLRM assumption of linearity. How is this possible? Estimates of the production
function parameters yield numbers equal to 0.62 and 0.41, respectively. How do you interpret these
estimates?
49
Log-linear model
A 1% increase in capital leads to a .62% increase in production
13. Prove your algebraic knowledge and write the CLRM in matrix and vector notation. Indicate the
dimensions of scalars, vectors and matrices explicitly!

y1 1 x12 ... x1k 1 1
... = ... ... + ... (114)
yn 1 xn2 ... xnk k n
| {z } | {z } | {z } | {z }
(nx1) (nxk) (kx1) (nx1)
14. Which objects in the CLRM are observable and which are not?
Observable: y,X
Unobservable: ,
15. In the lecture we had a possible choice of b = (X 0 X)1 X 0 y. Write out that expression in detail
using the definition of the matrix X and the vector y. Can you write the right hand side expression
using xi = xi1 , xi2 , ..., xiK and yi instead of the matrix X and the vector y?
1

1 ... 1 1 ... 1
x12 ... xn2 1 x12 ... x1k x12 ... xn2 y1
b= ... ... (115)
... ...
1 xn2 ... xnk yn
x1k ... xnk x1k ... xnk
1X 1X
b=( xi x0i )1 x i yi (116)
n n
16. Using observed data on explanatory variables and the dependent variable you can compute the
OLS estimator as b = (X 0 X)1 X 0 y. This yields a K dimensional column vector of real numbers.
Thats an easy task using a modern computer. Explain why we conceive the OLS estimator also as
a vector of random variables (a random vector), and that the concrete computation of b yields a
particular realization of this vector of random variables. Explain the source of randomness of b.
Random vector because b is computed as a function of X andy; and X and y are random variables
because we assume they are drawn randomly. When individuals are drawn, those are the realizations
and we can compute the realized b
17. Linearity (yi = x0i + i ) seems a restrictive assumption. Provide an example (not from lecture)
where an initially nonlinear relation of dependent variable and regressors can be reformulated such
that an equation linear in parameters result. Give another example (not in lecture) where this is not
feasible.
50
y = x z lny = x + z
y = x y oder y = x + z as impossible variations.
18. When are two random variables orthogonal?
For r.v. E(xy)=0
For vectors xy=0
19. Strict exogeneity is a restrictive assumption. In another assignment you will be asked to prove the
result (given in the lecture) that strict exogeneity implies that the regressors and the disturbance term
are uncorrelated (or, equivalently, have zero covariance or, also equivalently, are orthogonal). Explain,
using one of the economic examples given in the lecture (crime example or liberalization example), that
assuming that error term and regressor(s) are uncorrelated is doubtful from an economic perspective
(You have to give the an economic meaning like in the wage example where measured unobserved
ability of an individual.)
Social factors as part of i incluence crime and police.
20. Why is Z = rank(X) is a random variable? Which values can Z take (X is a nxK matrix where
nK)? What do we assume w.r.t. the probability function of Z? Why do we make this assumption in
the first place and why is it important for the computation of the OLS estimator?
rank(x) is a function of X
Z=rank(X) can take on 1...K values and is Bernoulli distributed. Probability distribution: 1 for full
rank, 0 for all other outcomes degenerates
P(Z=rank(X))=1 OLS is otherwise not computable
21. How many random variables are contained in the equations
a) Y = X+: Y(n rv), X(kxn rv)
b) Y = Xb + e: Y(n rv), X(kxn rv), b(k rv), e(k rv)
Third set
1. Show the following results explicitly by writing out in detail the expectations for the case of conti-
nuous and discrete random variables.
Useful:
Z Z
E(g(x, y)) = g(x, y)fxy dxdy (117)
Z
fxy
E(g(x, y)|x) = g(x, y) dy (118)
fx
a) EX [EY |X (Y |X)] = EY (Y ) (Law of Total Expectations LTE)

R f (x,y)
EY |X (Y |X) = y xy f (x) dy
R Rx f (x,y) R R R
EX [EY |X (Y |X)] = [ y xy fx (x) dy]fx (x)dx = y fxy dxdy = yfy (y)dy = Ey (y)
51
b) EX [EY |X (g(Y )|X)] = EY (g(Y )) (Double Expectation Theorem DET)

R f
EY |X (g(Y )|X) = g(y) fxxy dy
R R (x) fxy R R R
EX [EY |X (g(Y )|X)] = [ g(y) fx (x) dy]fx (x)dx = g(y) fxy dxdy = g(y)fy (y)dy = Ey (g(y))
c) EXZ [EY |X,Z (Y |X, Z)] = EY (Y ) (LTE)

R f (x,y,z)
EY |X,Z (Y |X, Z) = y xyz fxz (x,z) dy
R R R fxyz (x,y,z) R R R
EXZ [EY |X,Z (Y |X, Z)] = [ y fxy (x,z) dy]fxz (x, z)dxdz = g(y) fxy dxdy = g(y)fy (y)dy =
Ey (g(y))
d) EZ|X [EY |X,Z (Y |X, Z)|X] = EY |X (Y |X) (Law of Iterated Expectations LIE)
R f (x,y,z)
EY |X,Z (Y |X, Z)|X = y xyz fxz x,z dy
f (x,y,z)
] fxz (x,z)
y fx1(x) fxyz (x, y, z)dzdy =
R R R R
EZ|X [EY |X,Z (Y |X, Z)|X] = [ y fxyz xz (x,z)dy fx (x) dz =
R f (x,y)
y xy
fx (x) dy = EY |X (Y |X)
e) EX [EY |X (g(X, Y )|X)] = EXY (g(X, Y )) (Generalized DET)

R f (x,y)
EY |X (g(X, Y )|X) = g(X, Y ) xyfx (x) dy
R R f (x,y) RR
EX [EY |X (g(X, Y )|X)] = [ g(x, y) xy fx (x) dy]fx (x)dx = g(x, y)fxy (x, y)dxdy = Exy (g(x, y))
f) EY |X [g(X)Y |X] = g(X)EY |X (Y |X) (Linearity of Conditional Expectations) EY |X [g(X)Y |X] =

R f (x,y) R fxy (x,y)
g(x)y xyfx (x) dy = g(x) y fx (x) dy = g(x)Ey|x (y|x)
2. Show that for K = 2, E(i |xi1 , xi2 ) = 0 implies

a) E(i ) = 0
E(i |xi1 , xi2 ) = 0 E(i |xi1 ) = 0 by LIE
E(i |xi1 ) = 0 E(i ) = 0 by LTE
b) E(i xi1 ) = 0
Show: E(i xi1 ) = 0
Exi1 [Ei |xi1 (i xi1 |xi1 )] by DET
Exi1 [xi1 E(i |xi1 )] = E(0) = 0 by Linearity of Cond. Exp.
| {z }
0
Show: E(i xi1 ) = Cov(i , xi1 ) = 0

Cov(i , xi1 ) = E(xi1 i ) E(xi1 ) E(i ) = 0
| {z } | {z }
0 0
c) Show that E(i |X) = 0 implies that Cov(i , xik ) = 0 i,k.

Ex [E(i |x)] = E(i ) = 0 by LTE
E(i xik ) = Ex [E(i xik |x)] by DET
Ex [xik E(i |x)] = E(0) = 0 by Linearity of Cond. Exp.
Cov(xik , i ) = E(xik i ) E(xik )E(i ) = 0
52
3. Explain how the positive semi-definiteness of the difference of the two variance-covariance matrices
of two alternative estimators is related to the relative efficiency of one estimator vs. another.
Positive semi-definiteness: a0 [V ar(|x)
b V ar(b|x)]a 0
When positive semi-definiteness is true we can choose a=[1,0,...,0],=[0,1,0,...,0] usw., so that:
V ar(b1 |x) V ar(b1 |x), V ar(b2 |x) V ar(b2 |x) usw. fr alle bk .
Thus we can assume efficiency when a matrix is positive semi-definite.
4. Show by writing in detail for K=2, = (1 , 2 )0 that a0 [var(|X)var(b|X)]a
b 0a 6= 0 implies
that var(1 |X) var(b1 |X) for a=(1, 0).
b
a0 [var(|X)
b var(b|X)]a 0 (119)
" ! #
V ar(b1 |x) Cov(b2 , b1 |x) V ar(b1 |x) Cov(b2 , b1 |x) 1
(1, 0) (120)
Cov(b1 , b2 |x) V ar(b2 |x) Cov(b1 , b2 |x) V ar(b2 |x) 0
!
V ar(b1 |x) V ar(b1 |x) Cov(b2 , b1 |x) Cov(b2 , b1 |x) 1
(1, 0) (121)
Cov(b1 , b2 |x) Cov(b1 , b2 |x) V ar(b2 |x) V ar(b2 |x) 0

a b 1 1
(1, 0) = [a, b] = a = V ar(b1 |x) V ar(b1 |x) 0 (122)
c d 0 0
V ar(b1 |x) V ar(b1 |x)

5. We require a variance covariance matrix to be symmetric and positive definite. Why is the latter
a sensible restriction? Definition of positive (semi-)definite matrix: A is a symmetric matrix. x is a
nonzero vector.
1. if xAx>0 for all nonzero x, then A is positive definite
2. if xAx0 for all nonzero x, then A is positive semi definite
They are symmetric by construction
X (, )
a0 = [a1 , ..., an ] Z=aX with V ar(Z) = a0 V ar(X)a
| {z }
has to be >0
6. How does the conditional independence (CIA) assumption justify the inclusion of additional regres-
sors (control variables) in addition to the key regressor, i.e. the explanatory variable of most interest
(e.g. small class, schooling, foreign indebtedness). In which way can one, by invoking the CIA get
closer to the experimental ideal?
CIA in general: p(a|b,c)=p(a|c) or a b|c: a is conditionally independent of b given c
i x1 |x2 , x3 , ...
The more control variables, the closer we come to the ideal that i x1
53
7. The CIA motivates the idea of a matching estimator briefly discussed in the lecture. What is the
simple basic idea of it?
To measure treatment effects without random assignment:
Matching means sorting individuals in similar groups, where one received treatment and then compare
the out comes.
E(yi |xi , di = 1) E(yi |xi , di = 0) = local average treatment effect
Sort groups according to xs. Replace expectations by the sample means for the individuals and the
difference tells the treatment effect of a certain group.
Exi [E(yi |xi , di = 1) E(yi |xi , di = 0)] = average treatment effect
Averaging over all xs, so we get the treatment effect for all groups.
Fourth set
0. Where does the null hypothesis enter into the t-statistic?
In the numerator, as we replace k with k
1. Explain what unbiasedness, efficiency, and consistency mean. Give an explanation of how we can
use these concepts to assess the quality of the OLS estimator b = (X 0 X)1 X 0 y. (We will be more
specific about consistency in later lectures, this question is just to make you aware of the issue.)
Unbiasedness E(b)=: If an estimator is unbiased, its probability distribution has an expected value
equal to the parameter it is supposed to estimate.
Efficiency V ar(|x)
b V ar(b|X): We would like the distribution of the sample estimates to be as
tightly distributed around the true as possible.
Consistency P (b > ) 0 as n : b is asymtotically unbiased
b is BLUE = best
|{z} linear unbiased estimator
ef f iciency
2. Refresh your basic Mathematics and Statistics:

a) v=a*Z. Z is a (nx1) vector of random variables. a is (1xn) vector of real numbers.
b) V=A*Z. Z is a (nx1) vector of random variables. A is (mxn) matrix of real numbers.
Compute for v mean and variance as well as mean vector and variance covariance matrix of V .
a0 E(z)
a) E(v) = E(a0 z) = |{z}
| {z }
(1xn) (nx1)
0
V ar(v) = V ar(a0 z) = |{z}
a V ar(z) |{z}
a
| {z }
(1xn) (nxn) (nx1)
A E(Z)
b) E(V ) = E(AZ) = |{z}
| {z }
(mxn) (nx1)
V ar(V ) = V ar(A Z) = |{z} A0
A V ar(Z) |{z}
| {z }
(mxn) (nxn) (nxm)
54
3. Under which assumptions can the OLS estimator b = (X 0 X)1 X 0 y be called BLUE?
Best estimator: Gauss-Markov (1.1-1.5)
Linear: 1.1
Unbiased: 1.1-1.3
4. Which additional assumption is used to establish that the OLS estimator is conditionally normally
distributed? Which key results from mathematical statistics have we used to employed to show the
normality of the OLS estimator b = (X 0 X)1 X 0 y which results from this assumption?
After assumption 1.5: |X M V N (0, 2 In)
As we know b = (X 0 X)X 0 e and using Fact 6:
b |X M V N (E(b |X), V ar(b |X))
b |X M V N (0, 2 (X 0 X)1 )
b|X M V N (, 2 (X 0 X)1 )
5. Why are the elements of b random variables in the first place?
They are randomly drawn and are a result of a random experiment.
6. For which purpose is it important to know the distribution of the parameter estimate in the first
place?
Hypothesis testing (including confidence intervals, critical values, p-values)
7. What does the expression a random variable is distributed under the null hypothesis as .... mean?
Random variable has a distribtuiton, as the realizations are only one way that the estimate can look
like. Under the null hypothesis means that we suppose that b=0 meaning that the null hypothesis is
true.
8. Have we said something about the unconditional distribution of the OLS estimate b? Have we said
something about the unconditional means? The unconditional variances and covariances?
Unconditional distribution: nein
Unconditional means: ja (LTE)
Unconditional variances: nein
9. What can we say about the conditional (on X) and unconditional distribution of the t and z statistic
(under the null hypothesis)? What is their respective mean and variance?
The type of distribution stays the same when conditioning.
zk : conditional: N(0,1); mean: k ; variance: 2 [(X 0 X)1 ]kk
tk : unconditional: t(n-k); mean: k ; variance: s2 [(X 0 X)1 ]kk
10. Explain
what is meant by a type 1 error and a type 2 error?
what is meant by significance level = size.
what is meant by the power of a test?
55
Type 1 error : rejected although correct; Type 2 error : accepted although wrong
Probability of making Type 1 error
1-Type 2 error: Rejecting when false
11. What are the two key properties that a good statistical test has to provide?
Tests should have a high power (=reject a false null) and should keep its size
12. There are two main schools of statistics: Frequentist and Bayesian. Describe their main diffe-
rences!
Frequentists: fix the probability (significance level) H0 can be rejected rejection approach
(favored)
Bayesian: the probability changes with the evidence we get from the data H0 cannot be rejected,
but we get closer to it confirmation approach
13. What is the null and the alternative hypothesis of the t-test presented in the lecture? How is the
test statistic constructed?
H0 : k = k , HA : k 6= k
bk |X N (k , 2 [(X 0 X)1 ]kk )
H0 : bk |X N (k , 2 [(X 0 X)1 ]kk )
zk = 2 bk
0
k
1
N (0, 1)
[(X X) ]kk
14. On which grounds could you criticize a researcher for choosing a significance level of 0.00000001
(or 0.95, respectively)? Give examples for more conventional significance levels.
0.00000001: hypothesis almost never rejected (type 1 error too low)
0.95: hypothesis is too often rejected (type 1 error too high)
15. Assume the value of a t-test statistic equals 3. The number of observations n in the study is 105,
the number of explanatory variables is 5. Give a quick interpretation of this result.
t-stat=3; n=105 and k=5 leads to df=100
We can use standard normal distribution as df>30.
For 5% significance level the critical values are 2 (rule of thumb), so we would reject the hypothe-
sis.
16. Assume the value of a t-test statistic equals 0.4 and n-K=100. Give a quick interpretation of this
result.
We cant reject the H0 on a 5% significance level (t-stat approximately normal and thus we can use
the rule of thumb for the critical values)
17. In a regression analysis with K=4 you obtain parameter estimates b2 = 0.2 and b3 = 0.04. You
are interested in testing (separately) the hypotheses 2 = 0.1 against 6= 0.1 and 3 = 0 against
56
3 6= 0.
You obtain

0.00125 0.023 0.0017 0.0005
0.023 0.0016 0.015 0.0097
s2 X 0 X 1 =
0.0017 0.015 0.0064 0.0006
(123)
0.0005 0.0097 0.00066 0.001
where s2 is an unbiased estimated of 2 . Compute the standard errors of the two estimates, the two
t-statistics and the associated p-values. Note that the t-test (as we have defined it) is two-sided. I
assume that you are familiar with p-values, students of Quantitative Methods definitely should be.
Otherwise, refresh you knowledge!

b2 = 0.2, 2 = 0.1 standard error for b2 = 0.0016 = 0.04
b3 = 0.04, 3 = 0 standard error for b3 = 0.0064 = 0.08
t-stat 2: 0.20.1
0.04 =2.5
t-stat 3: 0.04
0.08 =0.5
p-value 2: 1-0.9938=0.0062
p-value 3: 1-0.6915=0.3085
Computing the p-value is computing the implied .
bk k
18. List the assumptions that are necessary for the result that the z-statistic zk = is
2 [(X 0 X)1 ]kk
standard normally distributed under the null hypothesis, i.e. zk N (0, 1).
Bentigt: 1.1-1.5
Zustzlich: n-k>30 bzw. Var(b|X) known.
19. Is the t-statistic nuisance parameter-free?
Yes, as the nuisance parameter 2 is replaced by s2 .
Definition: any parameter which is not of immediate interest but must be accounted for in the analysis
of those parameters which are of interest.
20. I argue that using the quantile table of the standard normal distribution instead of the quantile
table of the t-distribution doesnt make much of a difference in many applications: The respective
quantiles are very similar anyway. Do you agree? Discuss.
This is correct for n-k>30. For lower values of n-k: t-dist is more spread out in the tails.
21. I argue that with respect to the distribution of the t-statistic under the null substituting for 2
the unbiased estimator b2 = nk1
e0 e of 2 does not make much of a difference. When would you
subscribe to that argument? Discuss.
For n the bias nK n b2 vanishes. Thus the statement is ok for large n. For small n the
for
unbiased estimator should be used.
22.
a. Show that P = X(X 0 X)1 X 0 and M=In-P are symmetric (i.e. A=A) and idempotent, (i.e.
57
A=A*A). A useful result is: (BA)=AB.

b. Show that yb=Xb=Py P and e=y-Xb=My are orthogonal vectors, i.e. yb0 e = 0
c. Show that e=M and e2i = e0P
e = 0 M
d. Show that if xi1 = 1 we have ei = 0 (use OLS FOC) and that y = x0 b (i.e. the regression
hyperplane passes through the sample means).
a.
P symmetric:
X(X 0 X)1 X
=[X][(X 0 X)1 ][X]=X[(X 0 X)0 ]1 X=X[X 0 [X 0 ]0 ]1 X=X(X 0 X)1 X
M symmetric:
-X(X 0 X)1 X
=In-[X(X 0 X)1 X]=In-X(X 0 X)1 X
P idempotent:
X(X 0 X)1 X
[X(X 0 X)1 X]=X(X 0 X)1 (XX)(X 0 X)1 X=X(X 0 X)1 X
M idempotent:
-X(X 0 X)1 X
[In-X(X 0 X)1 X]= In[In-X(X 0 X)1 X]-X(X 0 X)1 X[In-X(X 0 X)1 X]= In-X(X 0 X)1 X-X(X 0 X)1 X+X(X 0 X)1 X
X(X 0 X)1 X
b.
Py*My=0
X(X 0 X)1 Xy
*[(In-X(X 0 X)1 X)y]= y[X(X 0 X)1 X]*[y-X(X 0 X)1 Xy]= yX(X 0 X)1 Xy-yX(X 0 X)1 XX(X 0 X)1 Xy=0
c.
e=M*
-X(X 0 X)1 X
=-X(X 0 X)1 X=(y-x)-X(X 0 X)1 X(y-X)= y-X-X(X 0 X)1 Xy+X(X 0 X)1 XX= y-X-X(X 0 X)1 Xy+X
y-X(X 0 X)1 Xy=y-Xb=e
ee=0 M
ee=(M )0 M = 0 M 0 M = 0 M
d.
Starting point: X(y-Xb)=Xy-XXb=0

1 ... 1 1 ... 1
x12 ... xn2 y 1 x12 ... xn2 1 x12 ... x1k b1
... ... ... ... ... ... = 0 (124)
... ... ... ... ... ...
yn 1 xn2 ... xnk bk
x1k ... xnk x1k ... xnk
58
Only first row:

X X X
yi [n xi2 , ..., xik ] b = 0 y = b1 + b2 x2 + .... + bk xk (125)
yi = x0i b + ei = ybi + ei
1 P 1 P 1X
n yi = n y
bi + ei
|n {z }
0, if x=1
y = yb
Fifth set
1. What is the difference between the standard deviation of a parameter estimate bk and the standard
error of the estimate?
Standard deviation is the square root of the true variance and the standard error is the square root
of the estimated variance (and thus is dependent on the sample size).
2. Discuss the pros and conts of alternative ways to present the results for a t-test:
a) parameter estimate and
*** for significant parameter estimate (at =1%)
** for significant parameter estimate (at =5%)
* for significant parameter estimate (at =10%)
b) parameter estimate and p-value
c) parameter estimate and t-statistic
d) parameter estimate and parameter standard error
e) your preferred choice
a. Pro: Choice for all conventional significance levels and easy to recognize; Con: not as much infor-
mation as with p-value (densed information)
b. Pro: p-value with information for any significance level; Con: not as fast/recognizable as stars; for
a specific null hypothesis
c. Pro: t-stat is basis for p-values; Con: no clear choice possible based on the information given (table);
for a specific null hypothesis
d. Pro: Is the basis for p-values and shows variance and thus if any significance level is useful, can
be used to compute different null hypothesis; Con: no clear choice possible based on the information
given (table, t-stat)
e. I would take case b. because it gives the best information very fast.
4. Consider Fama-Frenchs asset pricing model and its compatible regression representation (see
your lecture notes of the first week). Suppose you want to test the restriction that none of these three
risk factors play a role in explaining the expected excess return of the asset (that is the parameters
in the regression equation are all zero) State the null- and alternative hypothesis in proper statistical
terms and construct the Wald statistic for that hypothesis, i.e. define R and r.
59
ej em + HM L j
Rt+1 = 1 Rt+1 2 t+1 + 3 SM Bt+1 + t+1
Hypothesis:
H0 : 1 = 2 = 3 = 0; HA : at least one 6=0 bzw.
0 : R = r; HA : R 6= r

1 0 0 1 0 b1
R = 0 1 0 ; = 2 ; r = 0 ; b = b2 (126)
0 0 1 3 0 b3
5. How is the result from multivariate statistics that (z )0 1 (z ) 2 (rows(z))withz

M V N (, ) used to construct the Wald statistic to test linear hypotheses about the parameters?
We compute E(r|x) = R E(b|x) = R; V ar(r|x) = R V ar(b|X)R0 = R 2 (X 0 X)1 R0 ; r|x =
Rb|x M V N (R, R 2 (X 0 X)1 R)
Under the null hypothesis:
E(Rb|x)=r (=hypothesized value)
Var(Rb|x)=R 2 (X 0 X)1 R0 (=unaffected by hypothesis)
Rb|x M V N (r, 2 (X 0 X)1 R0 ) (=this is where the null hypothesis goes in)
Using Fact 4:
[Rb E(Rb|x)]0 [V ar(Rb|x)]1 [Rb E(Rb|x)]
[Rb r]0 [R 2 (X 0 X)1 R]1 [Rb r] 2 (#r)
Sixth set
Suppose you have estimated a parameter vector b=(0.55,0.37,1.46,0.01) with an estimated variance-
covariance matrix.

0.01 0.023 0.0017 0.0005
0.023 0.0025 0.015 0.0097
V ar(b|X)
d =
0.0017 0.015
(127)
0.64 0.0006
0.0005 0.0097 0.0006 0.001
a) Compute the 95% confidence interval each parameter bk .
b) What does the specific confidence interval computed in a) tell you?
c) Why are the bounds of a confidence interval for k random variables?
d) Another estimation yields an estimated bk with the corresponding standard error se(bk ). You
k k
conclude from computing the t-statistic tk = se(bk ) that you can reject the null hypothesis H0 : bk =
k on the % significance level. Now, you compute the (1-)% confidence interval. Will k lie inside
or outside the confidence interval?
a. Confidence intervals for:
b1 : 0.551.96*0.1=[0.354;0.746] (bounds as realizations of r.v.)
b2 : 0.371.96*0.05=[0.272;0.468]
b3 : 1.461.96*0.8=[-0.108;3.028]
b4 : 0.011.96*0.032=[-0.0527;0.0727]
60
b. The first two reject the H0 m the other two dont reject on a 5% significance level for a two-sided
test.
c. Bounds are a function of X (=r.v.): se(bk ) = s2 [X 0 X]1
d. Lies outside as we reject the t-stat.
2. Suppose, computing the lower bound of the 95% confidence interval yields bk t/2 (n K)
se(bk ) = 0.01. The upper bound is bk +t/2 (nK)se(bk )=0.01. Which of the following statements
are correct?
With probability of 5% the true parameter k lies in the interval -0.01 and 0.01.
The null hypothesis H0 : k = k cannot be rejected for values (0.01 betak 0.01) on
the 5% significance level.
The null hypothesis H0 : k = 1 can be rejected on the 5% significance level.
The true parameter k is with probability 1 = 0.95 greater than -0.01 and smaller than
0.01.
The stochastic bounds of the 1 confidence interval overlap the true parameter with proba-
bility 1 .
If the hypothesized parameter value k falls within the range of the 1 confidence interval
computed from the estimates bk and se(bk ) then we do not reject H0 : k = k at the
significance level of 5%.
false
correct
correct
false
correct (if we suppose that the null hypothesis is true)
correct
Seventh Set
1.
a) Show that if the regression includes a constant:
yi = 1 + 2 xi2 + ... + K xiK + i
then the variance of the dependent variable can be written as:
N N N
1 X 1 X 1 X 2
(yi y)2 = yi y)2 +
(b i (128)
N N N
i=1 i=1 i=1
61
Hint: y = yb
b) Take your result from a) and formulate an expression for the coefficient of determination R2 .
c) Suppose, you estimated a regression with an R2 = 0.63. Interpret this value.
d) Suppose, you estimate the same model as in c) without a constant. You know that you cannot
compute a meaningful centered R2 . Therefore, you compute the uncentered Ruc 2 : R2 = yb0 yb = 0.84
uc y0 y
Compare the two goodness of fit measures in c) and d). Would you conclude that the constant can
be excluded because Ruc2 > R2 ?
a)
1 X
(yi y)2 (129)
N
| {z }
SST
1 X 1 X 1 X 1 X
= yi + ei y)2 =
(b yi y + ei )2 =
(b yi y)2 +
(b (ei )2 (130)
N N N N
1 X 1 X 1 X 1 X 1 X
yi y)ei =
2(b yi y)2 +
(b (ei )2 2b
yi e i + 2y ei (131)
N N N N N
1 X 1 X 1 X 1 X
= yi y)2 +
(b (ei )2 2b0 Xei + 2y ei (132)
N N |N {z } N | {z }
0by FOC 0with constant
1 X 1 X
= (byi y)2 + (ei )2 (133)
N
| {z N
} | {z }
SSE SSR
(Is our desired result as y = yb if xi1 = 1)

b)
SSE SSR
R2 = =1 (134)
SST SST
c) 63% of the variance of the dependent variable can be explained with the estimated variance.
d) R2 and Ruc
2 can never be compared as they are based on different models.
2. In a hedonic price model the price of an asset is explained with its characteristics. In the following
we assume that housing pricing can be explained by its size sqrft (measured as square feet), the
number of bedrooms bdrms and the size of the lot lotsize (also measured as square feet. Therefore,
we estimate the following equation with OLS:
log(price) = 1 + 2 log(sqrf t) + 3 bdrms + 4 log(lotsize) +
Results of the estimation can be found in the following table: (a) Interpret the estimates b2 und b3 .
(b) Compute the missing values for Std. Error and t-Statistic in the table and comment on the
statistical significance of the estimated coefficients (H0 : j = 0 vs. H1 : j 6= 0, j=0,1,2,3).
(c) Test the null hypothesis H0 : 1 = 1vs.H1 : 1 6= 1.
(d) Compute the p-value for the estimate b2 and interpret the result.
62
(e) What is the null hypothesis of that specific F- statistic missing in the table? How does it relate
to the R2 ? Compute the missing value of the statistic and interpret the result.
(f) Interpret the value of R-squared.
(g) An alternative specification of the model that excludes the lot size as an explanatory variable
provides you with values for the Akaike information criterion of -0.313 and a Schwartz criterion of
-0.229. Which specification would you prefer?
a) Everything else constant, an increase of 1% in size leads to a .7% increase in price.
Everything else constant, a one unit increase in bedrooms leads to a 3.696% increase in price.
b) Starting point: Coefficient/Std.Err=T-stat
Se2 =0.0929
T stat3 =1.3425
The constant, sqrft and lotsize are significant on a 5% significant level; bdrms is not significant
1.297041
c) 0.65128 = 3.52696 As the critical value is 1.98 we reject on a 5% significance level.
d) bdrms=0.70023; t-stat: 7.54031 => almost 0 in table for N(0,1)
e) H0 : 2 = 3 = 4 = 0 bzw.

1
0 1 0 0 2 0
0 0 1 0 = 0 (135)
3
0 0 0 1 0
4
R2 /k 0.64297/4 0.1607
F = 2
= = = 38.25 (136)
(1 R )/(n K) (1 0.64297)/84 0.0042003
So we reject for any conventional significance level
Alternative: get SSRR = (yi y)2 =n*sample variance of yi
P
f) 64% of the variance of the dependent variable can be explained with the estimated variance.
g) -0.313 vs. -0.4968 for Akaike and -0.229 vs. -0.3842 for Schwartz. We prefer the first model (with
lot size) as both AIC and SIC are smaller.
63
3. What can you do to narrow the parameter confidence bounds? Discuss the possibilities: - Increase
- Increase n
Can you explain how the effect of an increase of the sample size on the confidence bounds works?
- Increase in : not good, as type 1 error increases
- Increases n: standard error descreasesbounds decrease
4. When would you use the uncentered Ruc2 and when would you use the centered R2 . Why is the
2 higher than a centered R2 ? What is the range of the R2 and R2 ?
uncentered Ruc uc
2 : without a constanthigher (as the level is explained too)
Ruc
Rc2 : with constantlower (explaines only the variance around the level)
Range for both is from 0 to 1.
5. How do you interpret a R2 of 0.38?
38% of the variance of the dependent variable can be explained with the estimated variance.
2 ? What is the idea behind the adjustment of the R2 ? Which
6. Why would you use an adjusted Radj adj
2
values can the Radj take?
2 : any value between and 1. Has a penalty term for the use of degrees of freedom (heavy
Radj
parameterization) (Occams Razor)
7. What is the intuition behind the computation of AIC and SBC?
Find the smallest SSR, but adds a penalty term for the use of degrees of freedom (=heavy parame-
terization)
Eighth set
1. A researcher conducts an OLS regression with K = 4 with a computer software that is unfortunately
not able to report p-values. Besides the four coefficients and their standard errors the program only
reports the t-statistics that test the null hypothesis H0 : k = 0 against HA : k 6= 0 for k=1,2,3,4.
Interpret the t-statistics below and compute the associated p-values. (Interpret the p-values for a
reader who works with a significance level =5%.)
a)t1=-1.99
b)t2=0.99
c)t3=-3.22
d)t4=2.3
T-stat shows where we are on the x-axis of the cdf.
Assumption: n-k>30. Then we reject for 1 , 3 , 4 using the critical value 1.96.
P-values:
a) T-stat in standard normal distribution is 0.9767. 2*(1-0.9767)=0.0466 reject
b) T-stat in standard normal distribution is 0.8398. 2*(1-0.8398)=0.3224 not reject
c) T-stat in standard normal distribution is close to one, so p-value close to zero reject
d) T-stat in standard normal distribution is 0.9893. 2*(1-0.9893)=0.0214 reject
64
2. Explain your own in words: What is convergence in probability? What is convergence almost surely?
Which concept is stronger?
Convergence in probability: The probability that the n-th element of a row of n numbers lies within
the bounds around a certain number (e.g. mean) is equal to 1 as n goes into infinity.
Almost sure convergence: The probability that the n-th element of a row of n numbers is equal to a
certain number (e.g. mean) is equal to 1 as n goes into infinity.

a.s is stronger as it requires Zn to be equal to whereas p only requires Zn to lie in bounds
around .
3. Illustrate graphically the concept of convergence in probability. Illustrate graphically a random
sequence that does not converge in probability.
See Vorlesungsunterlagen
4. Review the notion of non-stochastic convergence. In which way do we refer to the notion of non-
stochastic convergence when we consider almost sure convergence, convergence in probability and
convergence in mean square? What are the non-stochastic sequences in each of the three modes of
stochastic convergence?
In non-stochastic convergence we review what happens to a series when n goes into infinity (there is
no randomness then in the elements of the series).

In a.s p , und m.s we have n going into infinity as the non-stochastic component. As the stochastic
component we have to consider the different worlds of infinite draws.
Examples for non-stochastic series:

A series of probabilities p 1

Realizations of Xa.s.0
5. Explain in your own words: What does convergence in mean square mean? Does convergence in
mean square imply convergence in probability? Or does convergence in probability imply convergence
in mean square?
Convergence in mean square: The expectation value of the mean error is equal to 0 as n grows into
infinity (also: the variance). Convergence in mean square implies convergence in probability.
6. Illustrate graphically the concept of convergence in distribution. What does convergence in distribu-
tion mean? Think of an example and provide a graphical illustration where the c.d.f. of the sequence
of random variables does not converge in distribution.
Convergence in distribution: The series Zn converges to Z, if the distribution function Fn of Zn
converges to the cdf of Z at each point of FZ .
With this mode of convergence, we increasingly expect to see the next outcome in a sequence of
random experiments becoming better and better be modeled by a given probability distribution.
Example where convergence in distribution doesnt work:
1. We draw random variables with a probability of 0.5 from a normal distribution with either mean 0
65
or with mean 1.
2. The distribution of bn , as it doesnt have a distribution (only a point)
3. We draw from a distribution where the variance doesnt exist. Computing the means doesnt have
a limit distribution.
7. I argue that extending convergence almost surely/in probability/mean square to vector sequences (or
matrices) does not increase complexeity as the basic concept remains the same. However, extending
convergence in distribution of a random sequence to a random vector sequence entails an increase of
complexity. Why?

For a.s, p and m.s extending the concepts to vectors only require elementwise-convergence.

For d all the k elements of the vectors have to be lower or equal to a constant at the same time for
every of the n elements of the series Zn (joint distribution).
8. Convergence in distribution implies the existence of a limit distribution. What do we mean by
that?
{Zn } has a distribution Fn . The limit distribution F of Zn is the distribution for which we assume
that Fn converges to at every point (F and Fn belong to the same class of distributions and we just
adjust the parameters)
9. Suppose that the random sequence zn converges in distribution to z, where z is a 2 (2) random
variable. Write this formally using two alternative notations.

Zn d Z 2 (2)

Zn d 2 (2)
10. What assumptions have to be fullfilled that you can apply Khinchins WLLN?
{Zi } sequence of iid random variables
E(Zi ) = <
11. I argue that extending the WLLN to a vector random sequences does not add complexeity to the
case of a scalar sequence, it just saves space because of notational convencience. Why?
for a vector we have to use elementwise convergence (of the means towards the expectation values).
We use the same argument as for convergence in probability (we just have to define the means and
expectations for the rows for elementwise convergence).
Ninth set
1. What does the Lindeberg-Levy (LL) central limit theorem state? What assumptions have to be
fullfilled that you can apply the LL CLT?

Formula: n(z n ) d N (0, 2 ).
The difference between the mean of a series and its expected value follows a normal distribution and
2
thus the mean of the series follows also a normal distribution (z n a N (, n )).
Assumptions:
66
{Zi } iid
E(Zi ) = <
V ar(Zi ) = 2 <
WLLN can be applied
2. Name the concept that is associated to the following short hand notations and explain their meaning:

a. zn d N (0, 1): Convergence in distribution
b. plimn zn = : Convergence in probability

c. zn m.s. : Convergence in mean square

d. zn d z: Convergence in distribution

e. n(z n ) d M V N (0, ): Lindenberg-Levy CLT (Multivariate CLT)

f. yn p : Convergence in probability
2
g. z n a N (, n ): CLZ (follows from the univariate CLT)

h. zn a.s. : Convergence almost surely/almost sure convergence

i. zn2 d 2 (1): Convergence in distribution/Lemma 2
3. Apply the seful lemmasf the lecture. State the name of the respective lemma(s) that you use.
Whenever matrix multiplications or summations are involved, assume that the respective operations
are possible. The underscore means that we deal with vector or matrix sequences. and A indicate
vectors or matrices of real numbers. Im is the identity matrix of dimension m:

a. z n p , then: plimexp(z n ) =?

b. zn d z N (0, 1), then: zn2 d ?

c. z n d z M V N (, ), An p A, then: An (z n ) a ?

d. xn d M V N (, ), y n p , An p A, then: An xn + y n d ?

e. xn d x, y n is Op, then: xn + y n d ?

f. xn d x M V N (, ), y n p 0, then: plimxn y n =?

g. xn d M V N (0, Im ), An p A, then: z n = An xn d ?
and then: z 0n (An A0n )1 z n a ?
a. Using Lemma 1: plimexp(z n ) = exp() (if exp does not depend on n)

b. Using Lemma 2: z 2n d [N (0, 1)]2 = 2 (1)

c. Using Lemma 2: z n d M V N (0, ) and then using Lemma 5: An (z n ) a M V N (0, AA0 )

d. Using Lemma 5: An xn d A x = M V N (, AA0 ) and then using Lemma 3: An xn + y n d
A x + = M V N ( + , AA0 )

e. Using Lemma 3: xn + y n d x + 0

f. Using Lemma 4: plimxn y p 0 n
67

g. Using Lemma 5: z n d A x = M V N (0, AA0 ) and then using Lemma 6: z 0n (An A0n )1 z n a
Fact 4
(Ax)0 (AA0 )1 (Ax) 2 (rows(z))
| {z } | {z } | {z }
z0 V ar(z) z
4. When large sample theory is used to derive the properties of the OLS estimator the set of ass-
umptions for the finite sample properties of the OLS estimator are altered. Which assumptions are
retained? Which are replaced?
Two are retained:
(1.1)(2.1) Linearity
(1.3)(2.4) Rank condition
Four are replaced:
iid(2.2) Dependencies allowed
(1.2)(2.3) Replacing strict exogeneity by orthogonality: E(xik i ) = 0
(1.4)deleted
(1.5)deleted
5. What are the properties of bn = (X 0 X)1 X 0 y under the "newssumptions?

bn p (consistency)
b a M V N (, Avar(b)
n ) (approximate normal distribution)
Unbiasedness & efficiency lost
6. What does CAN mean?
CAN = Consistent Asymptotically Normal
7. Does consistency of bn need an iid sample of dependent variable and regressors?
iid is a special case of a martingale difference sequence. Necessary is only a stationary & ergodic
m.d.s., which means an identical but not an independent sample.
8. Where does a WLLN and where does a CLT come into play when deriving the properties of
b = (X 0 X)1 X 0 y?
Deriving consitency: We use Lemma 1 twice. But to apply Lemma 1 on [ n1 xi x0i ], we have to assure
P

that n1 xi x0i p E(xi x0i ) which we do by applying WLLN.
P

Deriving distribution: We define n1 xi i = g and apply CLT to get n(gE(gi )) d M V N (0, E(gi gi0 )).
P
This is then used together with Lemma 5.
Tenth Set
1. At which stage of the derivation of the consistency property of the OLS estimator do we have to
invoke a WLLN?
2. What does it mean when an estimator has the CAN property?
3. At which stage of the derivation of the asymptotic normality of the OLS estimator do we have to
invoke a WLLN and when a CLT?
1-3: see Ninth set
68
4. Which of the useful lemmas 1-6 is used at which stage of a) consistency proof and b) asymptotic
normality proof?

a) We use Lemma 1 to show that together with WLLN: [ n1 xi x0i ]1 p [E(xi x0i )]1
P
1X
b) We know that n(b ) = [ xi x0i ]1 ng . For:
|n {z } |{z}
xn
An

An p A = 1
xx and

xn d x M V N (0, E(gi gi0 ))

So we can apply Lemma 5: n(b ) M V N (0, 1 0 1
xx E(gi gi )xx )
5. Explain the difference of the assumptions regarding the variances of the disturbances in the finite
sample context and using asymptotic reasoning.
Finite sample variance: E(2i |X) = 2 (assumption 1.4)
Large sample variance: E(2i |xi ) = 2
We only condition on xi and not on all xs at the same time
6. There is a special case when the finite sample variances of the OLS estimator based on finite
sample assumptions and based on large sample theory assumptions (almost) coincide. When does this
happen?
When we have only one observation i.
7. What would you counter an argument of someone who says that working with the variance cova-
riance estimate s2 (X 0 X)1 is quite OK as it is mainly consistency of the parameter estimates that
counts?
Using s2 (X 0 X)1 we would have to assume conditional homoscedasticity and using (X 0 X)1 S(X
b 0 X)1
would allow us to get rid of that assumption.
8. Consider the following assumptions:
(a) linearity
(b) rank condition: KxK matrix E(xi x0i ) = xx is nonsingular
(c) predetermined regressors: E(gi ) = 0 where gi = xi i
(d) gi is a martingale difference sequence with finite second moments
i) Show that under those assumptions, the OLS estimator is approximately normally distributed:

n(b ) d N (0, 1 2 0 1
xx E(i xi xi )xx ) (137)
Starting point: sampling error

1X
We have n(b ) = [ xi x0i ]1 ng . For:
|n {z } |{z}
xn
An

An p A = 1
xx (only possible if xx is nonsingular)

xn d x M V N (0, E(gi gi0 )) (as n(g E(gi )) d M V N (0, E(gi gi0 )) by CLT)
69

So we can apply Lemma 5: n(b ) M V N (0, 1 E(g g 0 )1 )
| xx {zi i xx}
Avar(b)
Avar(b)
bn a M V N (, n )
ii) Further, show that assumption 4 implies that the i are serially uncorrelated or E(i ij ) = 0.
E(gi |gi1 , ..., g1 ). As xi1 = 1 we focus on the first row:
E(i |i1 , ..., 1 , i1 xi1,2 , ..., 1 x1K ) = 0
by LIE: Ez|x [E(y|x, z)|x] = E(y|x) with x = i1 , ..., 1 ; y = i and z = i1 xi1,2 , ..., 1 x1K :
E(i |i1 , ..., 1 ) = 0
by LTE: E(i ) = 0
Second step in exercise 19
9. Show that the test statistic
(bk k )
tk = q (138)
Avar(b
d k)
n
converges in distribution to a standard normal distribution. Note, that bk is the k-th element of b

and Avar(bk ) is the (k,k) element of the KxK matrix Avar(b). Use the facts, that n(bk k ) d
d
N (0, Avar(bk )) and Avar(b) p Avar(b). Why is the latter true? Hint: use continuous mapping and
the Slutzky theorem (the seful lemmas")!

bk k n(b k )
r = q k a N (0, 1)
Avar(b
d )
k Avar(b
d k)
n
We know:

1. n(bk k ) d N (0, Avar(bk )) by CLT and from joint distribution of the estimators.

1 p 1
2. q by Lemma 1 and from the derivation of V ar(b
d k)
Avar(b
d k) Avar(bk )
Lemma
5:
tk = qn(bk k ) d N (E(a x), V ar(a x)) = N (0, a2 V ar(x)) = N (0, Avar(bk ) 2 ) = N (0, 1)
Avar(b
d k) Avar(bk )
10. Show that the Wald statistic:
Avar(b)
d
W = (Rb r)0 [R R0 ]1 (Rb r) (139)
n
converges in distribution to a Chi-square with degrees of freedom equal to the number of restrictions.
As a hint, rewrite the equation above as W = c0n Q1
n cn . Use Hayashis lemma 2.4(d) and the footnote
on page 41.
W = [Rb r]0 [R Avar(b) R0 ]1 [Rb r]

d
n
Under the null: Rb-r=Rb-R=R(b-)

cn = R n(bn ) d c M V N (0, R Avar(b) R0 ) (Lemma 2)
70
d R0
Qn = R Avar(b) p Q = R Avar(b) R0 = V ar(c) (derivation of V d
ar(b) and Lemma 1)

1
Wn = c0n Q1 0
n cn d c Q c 2 ( #r ) (Lemma 6 + Fact 4)
|{z} |{z}
V ar(c) rows(c)
11. What is an ensemble mean?

An ensemble is consisting of a large number of mental copies.
The ensemble mean is obtained by averaging all ensemble forecasts. This has the effect of filtering
out features of the forecast that are less predictable.
12. Why do we need the stationarity and ergodicity in the first place?
Especially for time series we want to get rid of the iid assumption, as we observe dependencies. Using
s&e allows dependencies, but we still need to draw from the same distribution and the dependencies
should be less between draws that are far apart from each other.
But: Mean and variance do not change over timeWe need this for correct estimates, tests.
13. Explain the concepts of weak stationarity and strong stationarity.
Strong stationarity: the joint distribution of (zi , zir ) is the same as the joint distribution of (zj , zjr )
if i-ir=j-jr relative position
Weak stationarity: only the first two moments are constant (not the whole distribution) and the
cov(zi , zij ) only depends on j.
15. When is a stationary process ergodic?
If the dependence decreases with distance so that:
limn E(f (zi , zi+1 , ..., zi+k ) g(zi+n , zi+n+1 , ..., zi+n+l )) =
E(f (zi , zi+1 , ..., zi+k )) E(g(zi+n , zi+n+1 , ..., zi+n+l ))
16. Which assumptions have to be fulfilled to apply Kinchins WLLN? Which of these assumptions
are weakened by the ergodic theorem? Which assumption is used instead?
Assumptions:
{Zi } iidinstead: stationarity&ergodicity
E(Zi ) = <
17. Which assumptions have to be ful
lled to apply the Lindeberg-Levy CLT? Is stationarity and ergodicity sufficient to apply a CLT? What
property of the sequence {gi }0{i xi } do we assume to apply a CLT?
Assumptions:
{Zi } iidinstead: stationary&ergodic m.d.s.
E(Zi ) = <
V ar(Zi ) = 2 <
WLLN
18. How do you call a stochastic process for which E(gi |gi1 , gi2 , ...) = 0?
71
M.D.S
19. Show that if a constant is included in the model it follows from E(gi |gi1 , gi2 , ...) = 0, that
cov(i , ij ) = 0j 6= 0 Hint: Use the law of iterated expectations.
From exercise 8: E(i |i1 , ...) = E(i ) = 0
Cov(i , ij ) = E(i ij ) E(i ) E(ij )
| {z } | {z }
? =0
E(i ij ) = 0
E[E(i ij )|i1 , ..., ij , ..., 1 ] by LTE backwards
E[ij E(i |i1 , ..., ij , ..., 1 )] = E(0) = 0 by Linearity of Conditional Expectations
| {z }
0(s.o.)
Cov(i , ij ) = 0
20. When using the heteroskedasticity-consistent covariance matrix,
Avar(b) = 1 2 0 1
xx E(i xi xi )xx (140)
which assumption regarding the covariances of the disturbances i and ij do we make?
Cov(i , ij ) = 0
Eleventh Set
1. When and why would you use GLS? Describe the limiting nature of the GLS approach.
GLS is used to estimate unknown parameters in a linear regression model. It is used when heterosce-
dasticity is present and when observations are correlated in the error term (cf. WLS).
Assumptions:
Linearity yi = x0i + i
Full rank
Strict exogeneity
V ar(|X) = sigma2 V (X). V(X) has to be known, symmmetric and positive definite.
Limits:
V(X) normally not known, and thus we even have to estimate more. If V ar(|x) is estimated the
BLUE property is lost and bGLS might even be worse than OLS.
If X not strictly exogeneous, GLS might be inconsistent
Large sample properties are more difficult to obtain.
2. A special case of the GLS approach is weighted least square (WLS). What difficulties could arise
in a WLS estimation? How are the weights constructed?
Observations are wighted by standard deviations (higher penalty for wrong estimates). Used when all
the off-diagonal entries of (Var.-Cov.-Matrix) are zero.
Here: V ar(|X) = 2 V (xi ) V (xi ) = Zi0 (V (xi ) typically unknown)
72
So again, the variance has to be estimated and thus our estimates might be inconsistent and even
less efficient than OLS.
3. When does exact multicollinearity occur? What happens to the OLS estimator in this case?
Occurs when one regressor is a linear combination of other regressors: rank(X)6=K
Then: Assumption 1.3/2.4 violated and (X 0 X)1 and thus the OLS estimator cannot be computed.
We can still estimate the linear combination of the parameters, though.
4. How is the OLS estimator and its standard error affected by (not exact) multicollinearity?
BLUE result is not affected

Var(b|X) is affected: Coefficients have high standard errors (effect: wide confidence intervals
and low t-stats and thus difficulties with rejection)
Estimates may have the wrong sign
Small changes in the data produce wide swings in the parameters
5. Which steps can be taken to overcome the problem of multicollinearity?
Increase n (and thus the variance of the xs)

Get smaller 2 (=better fitting model)
Get smaller correlation (=exclude some regressors, but: omitted variable bias)
Other ways:
Dont do anything (OLS is still BLUE)

Joint hypothesis testing
Use empirical data
Combine the multicollinear variables
Twelth Set
1. When does an omitted regressor not cause biased estimates?
With an omitted regressor we get: b1 = 1 + (X10 X1 )1 X10 X2 2 + (X10 X1 )1 X10
| {z } | {z }
0? 0with strict exogeneity
2 = 0 if X2 is not part of the model in the first place
(X10 X1 )1 X10 X2 = 0 if regression of X2 on X1 gives you zero coefficients
2. Explain the term endogeneity bias. Give a practical economic example when an endogeniety bias
occurs.
73
Endogeneity = no strict exogeneity bzw. no predetermined regressots: Xs correlated with the error
term estimator is biased.
Example: Haavelmo: Ci = 0 + 1 Yi + ui (Ii part of ui and correlated with Yi )
3. What is solution to the endogeneity problem in a linear regression framework?
IV-variables: include variables from ui that are correlated with xi
4. Use the same tools used to derive the CAN property of the OLS property to derive the CAN
property of the IV-estimator. Start with the sampling error and make your assumptions on the way.
(applicability of WLLN, CLT...)
Compute the sampling error:
(X 0 Z)1 X 0 y
(X 0 Z)1 X 0 (Z + )
(X 0 Z)1 X 0 Z + (X 0 Z)1 X 0
+ (X 0 Z)1 X 0 = (X 0 Z)1 X 0 = [ n1 xi zi0 ]1 n1
P P
xi i

Consistency: b = [ n1 xi zi0 ]1 n1 x i yi p
P P

1. n1 xi zi0 p E(xi zi0 ) and by Lemma 1:
P

[ n1 xi zi0 ]1 p [E(xi zi0 )]1
P

2. n1 xi i p E(xi i ) = 0
P

b = [ n1 xi zi0 ]1 n1 xi i p E(xi zi )1 E(xi i ) = 0
P P
| {z }
0

Asymptotic normality: n(b ) d M V N (0, Avar())
b
b 1X 1 X
n*sampling error: n( ) = [ xi zi0 ]1 n xi i
|n {z } | n {z }
An xn

[ n1 xi zi0 ]1 p A = E(xi zi0 )1
P
1. An =

2. xn = ng d x M V N (0, E(gi gi0 )) (apply CLT as E(gi ) = 0)
| {z }
S.33

Lemma 5: n(b ) d M V N (0, E(xi zi0 )1 E(gi gi0 )E(xi zi0 )1 )
| {z }
Avar()
b
5. When would you use an IV estimator instead of an OLS estimator? (Hint: Which assumption of
the OLS estimation is violated and what is the consequence.)
When xi and i correlated (violated: Assumption 2.3)
6. Describe the basic idea of instrumental variables estimation. How are the unknown parameters
related to the data generating process?
IV can be used to obtain consistent estimators in presence of omitted variables.
xi and zi correlated; xi and ui not correlated xi = instrument
74
7. Which assumptions are necessary to derive the instrumental variables estimator and the CAN
property of the IV estimator?
Assumptions:
3.1 Linearity
3.2 Ergodic stationarity
3.3 Orthogonality conditions
3.4 Rank condition for identification (full rank)
8. Where do the assumptions enter the derivation of the instrumental variables estimator and the
CAN property of the IV estimator?
Derivation:
E(xi i ) = E(xi (yi Zi )) |{z}
= 0
| {z }
3.1 3.3
E(xi yi ) E(xi zi0 ) = 0
= [E(xi zi0 )]1 E(xi yi )
| {z }
3.4

p
b |{z}
3.2
9. Show that the OLS estimator can be conceived as a special case of the IV estimator.
Relation:
E(i zP i ) = 0 and P xi = zi . Then:
b = [ n1 zi zi0 ]1 n1 zi yi0 = OLS estimator
10. What are possible sources endogeneity?
see additional page
75

Applied Econometrics

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Applied Econometrics

Transféré par

Droits d'auteur :

Formats disponibles

Applied Econometrics

15. Juli 2010

1 Introduction and Motivation 3

2 The Classical Linear Regression Model (CLRM): Parameter Estimation by OLS 7

3 Assumptions of the CLRM 10

4 Finite sample properties of the OLS estimator 13

5 Hypothesis Testing under Normality 16

8 Introduction to large sample theory 25

9 Time Series Basics (Stationarity and Ergodicity) 34

10 Generalized Least Squares 37

14 Questions for Review 45

1 Introduction and Motivation

Data perceived as realizations of random variables

Structural parameters=suggested by economic theory

1.1 3 motivations for CLRM

Indicator of transaction type:

Market maker sets:

Sell price (ask): Pta = + mt1 + t + zt + c

Spread: market maker anticipates price impact

Pt Pt1 = + mt1 + t + Qt zt + cQt [mt1 + cQt1 ] (4)

CAPM: f = Rem = (Rm Rf )

Fama French (FF): f = (Rem , HM L, SM B)0

To estimate risk loadings , we formulate sampling (regression, compatible) model:

Assume E(jt+1 |Rt+1

ln(W AGEi ) = 1 + 2 Si + 3 T EN U REi + 4 EXP Ri + i (8)

Logarithm of the wage rate: ln(W AGEi )

Estimation of the parameters k , where 2 : return to schooling

mit x = E(x) = [E(x1 )...E(xk )] und x = V ar(x) = kxk

Rubins causal model = Regression analysis of experiments

is constant across i. Goal: estimate

E(yi |Di = 1) = + E(i |Di = 1) (15)

E(yi |Di = 0) = + E(i |Di = 0) (16)

E(yi |Di = 1) E(yi |Di = 0) = + selectionbias

2 The Classical Linear Regression Model (CLRM): Parameter

Classical linear regression model:

yi = 1 xi1 + 2 xi2 + ... + K xiK + i = x0i +i (17)

yi =(y1 , ..., yn ): Dependent variable, observed

Constant: (x11 , ..., xn1 )=(1,...,1)

y1 = 1 + 2 x12 + ... + K x1K + 1 (19)

OLS estimation in detail:

System of K equations and K unknowns: solve for b (OLS estimator)

3 Assumptions of the CLRM

The four core assumptions of CLRM

When E(i |x) 6= 0 Endogeneity (prevents us from estimating 0 s consistently).

Di and {yoi , y1i } assumed independent {yoi , y1i }Di

Using random variable x:

Ex [ Ey|x [y|x] ] = Ey (y) E[E(y|x)] = E(y) (40)

works when X a vector

Ez|x [[ Ey|x,z (y|x, z) ]|x] = E(y|x) (41)

Other important laws

Ex [Ey|x (g(y)|x)] = Ey (g(y)) (42)

Ex [Ey|x (g(x, y))|x] = Ex,y (g(x, y)) (43)

Linearity of Conditional Expectations:

Ey|x [g(x)y|x] = g(x)Ey|x [y|x] (44)

4 Finite sample properties of the OLS estimator

Finite Sample Properties of b = (X 0 X)1 X 0 y

E(b |X) = E(A |X) = A E(|X) =0 (47)

Step 2: b conditionally unbiased

Step 3: OLS unbiased

Derive conditional variance

sampling error 1.4

b = (D + A)y = Dy + Ay = D(X + ) + b = DX + D + b (54)

b = DX + E(D|x) + E(b|X) = DX + D E(|x) + = DX +

As we are working with an unbiased estimator: DX=0

Sell price (ask): Pta = + mt1 + t + zt + c

Pt Pt1 = + mt1 + t + Qt zt + cQt [mt1 + cQt1 ] (4)

Assume E(jt+1 |Rt+1

ln(W AGEi ) = 1 + 2 Si + 3 T EN U REi + 4 EXP Ri + i (8)

yi = 1 xi1 + 2 xi2 + ... + K xiK + i = x0i +i (17)

y1 = 1 + 2 x12 + ... + K x1K + 1 (19)

When E(i |x) 6= 0 Endogeneity (prevents us from estimating 0 s consistently).

E(b |X) = E(A |X) = A E(|X) =0 (47)

b = (D + A)y = Dy + Ay = D(X + ) + b = DX + D + b (54)

b = DX + E(D|x) + E(b|X) = DX + D E(|x) + = DX +

b |X M V N ((X 0 X)1 X 0 E(|X), (X 0 X)1 X 0 2 In X(X 0 X)1 ) (64)

2 = E(2i |X) = V ar(i |X) = E(2i ) = V ar(i ) (69)

We dont know i , but we use the estimator ei = yi x0i b

(2.1) maintained: Linearity: yi = x0i + i , i = 1, 2, ..., n

So how to estimate Sb = E(2i xi x0i ):