Povilas Lastauskas
23rd November 2013
1 Theory
Research agenda for applied economics requires to address these major issues:
1. Causal relationship;
2. Ideal experiment;
3. Identification strategy;
4. Statistical inference.
Reality manifests in many ways which complicate empirical enquiry, especially when
questions about causality are raised. Drawing from Angrist and Pischke (2008), such
questions hinge on the experimentalist paradigm and potential outcomes. This new
paradigm in econometrics requires some vocabulary...
1i
Y0i
Outcome if i is treated;
Outcome if i is not treated.
In effect, Y1i (potential outcome under treatment) and Y0i (potential outcome without
treatment) are outcomes in alternative states of the world. Inexistence of parallel worlds
leads to inability to measure treatment effects at the individual level. Notice that the
observed outcome Yi can be expressed in terms of potential ones:
Yi =
if Di = 1,
Y0i , if Di = 0,
1i,
or Yi = Y0i + (Y1i Y0i ) Di . One hope is, instead of individual ones, to analyse average
treatment effects. What about a comparison of the difference in means for treated and
untreated?
We obtain
E [Yi  Di = 1] E [Yi  Di = 0] = E [Y1i  Di = 1] E [Y0i  Di = 0] ,
and subtracting and adding E [Y0i  Di = 1] produces
E [Yi  Di = 1] E [Yi  Di = 0]
= E [Y1i  Di = 1] E [Y0i  Di = 1]
+E [Y0i  Di = 1] E [Y0i  Di = 0] .
(1)
The result is decomposed into two components: average treatment effect on treated
(ATT) and selection bias. The latter is simply the difference in average outcome under
nontreatment between those who were treated and those who were not.
Problem 1. Actual treatment status Di is not independent of potential outcomes.
One of the solutions is randomly assign treatment to individuals in the population since
Di becomes independent of potential outcomes, and so the selection bias disappears.
Under this,
E [Yi  Di = 1] E [Yi  Di = 0]
= E [Y1i  Di = 1] E [Y0i  Di = 1] ,
since E [Y0i  Di = 1] E [Y0i  Di = 0] = 0. This enables to infer ATT  which coincides
with the average treatment effect (ATE) in the entire population1  from the difference
in means.
Under random assignment, we can obtain ATT and ATE by running OLS regressions,
Yi = + Di + i . For a moment suppose that the treatment effect is the same for
everybody, such that Y1i Y0i = . Using Yi = Y0i + (Y1i Y0i ) Di , we can reexpress
it as Yi = EY0i + (Y1i Y0i ) Di + (Y0i EY0i ) = + Di + i , where EY0i and
i = Y0i EY0i . Obviously,
E [Yi  Di = 1] E [Yi  Di = 0] =
+E [i  Di = 1] E [i  Di = 0] .
Claim 2. The selection bias amounts to nonzero correlation between the regression error
term i and the regressor Di : what you previously called endogeneity bias is known as
selection bias here.
1
Ex [Ey [y  x]] =
Ey [y  x] f (x) dx =
y f (y  x) dy f (x) dx =
yf (y  x) f (x) dy dx
=
yf (y, x) dy dx =
yf (y, x) dx dy = y f (y, x) dx dy = yf (y) dy
= E[y] .
Best predictor property.
E [Yi  Xi ] is the Best (Minimum mean squared error MMSE) predictor of Yi in that it
minimises the function E (Yi h (Xi ))2 , or
h
In
where h (Xi ) is any function of Xi . The proof follows immediately by subtracting and
adding E [Yi  Xi ] inside the brackets, so that
E (Yi h (Xi ))2
= E [Yi E [Yi  Xi ]]2 2 [Yi E [Yi  Xi ]] [E [Yi  Xi ] h (Xi )] + [E [Yi  Xi ] h (Xi )]2 .
The first term does not involve E [Yi  Xi ] h (Xi ), the second one is zero from the
decomposition property, and the last one is minimised at zero, yielding a result that
h (Xi ) is CEF.3
We have not specified h (Xi ) so far and not really linked CEF to regression. Define
population regression parameter vector as the solution to the following minimisation
problem:4
0 2
.
= arg min E Yi X
i
i
E Xi Yi Xi0
= 0,
which can be used to produce = E [Xi Xi0 ]1 E [Xi Yi ] . An unbiased estimator results in
E = and its (asymptotic) variancecovariance matrix is E [Xi Xi0 ]1 E [Xi Xi0 2i ] E [Xi Xi0 ]1 .
A useful property is that a linear predictor takes the form
E [Yi  Xi ] = Xi0 ,
and the solution to the the Best Linear Predictor (BLP) problem, i.e., min E (Yi E [Yi  Xi ])2 ,
yields . If the CEF is linear, then the population regression function is exactly it. CEF
is linear when Y and X are jointly normal, also when the model is saturated, i.e., model
with a separate parameter for every possible combination of values that the regressors
can take.
2 Examples
2.1 Linear Regression
Consider the linear regression model
E [yi  xi ] = + xi ,
3
4
See Angrist and Pischke (2008) for the ANOVA theorem and yet another CEF property.
Wooldridge (2010) approaches this by using the linear projection of Yi on Xi . Let E [Yi  Xi ] denote
the true regression function. The linear projection of Yi on Xi , denoted
L (Yi  Xi ) = Xi0
2
0
is such that the parameters solve arg min E Yi Xi
. In other words, the parameters in the
linear projection L (Yi  Xi ) provide the best (population) mean square error approximation to the
true regression function.
Solution
The OLS estimator for in the model y = + x + is
xy n1 x y
b= P 2
.
P
x n1 ( x)2
P
(2)
xy n1 n1 y
n
=
n1 n1 (n1 )2
y and
Since
xy =
b =
x=1
x=1
y=
y n1
x=1 y
n1
x=0
xy n1
n1 n2
(3)
y+
y n1
x=0
x=1
x=1
n1 n2
x=0 y
= y1 y2 ,
n2
n2
x=1
y n1
n1 n2
x=0
(4)
y1 y2 )n1
(n1 + n2 )
y2
n1 y1 + n2 y2 (
=
= y2 .
n
n
n
(5)
A Method of Moments interpretation: Since x takes only two values we can derive the expectation of y conditional on these values (2 moments) to identify the two
parameters and . We have that
E[y  x = 1] = + E[x  x = 1] + E[  x = 1] = +
E[y  x = 0] = + E[x  x = 0] + E[  x = 0] = ,
which follows from the usual condition of no correlation between x and , E[  x] = 0
(for all values of x). Then, we can replace the population conditional moments by their
sample counterparts and the parameters by their estimators. Note that y1 is the sample
estimator of E[y  x = 1] and y2 is the sample moment corresponding to E[y  x = 0].
Thus, the moment conditions are y1 = a + b and y2 = a. Upon solving this system we
get the same answer as before, b = y1 y2 .
The interpretation of the OLS estimator as a MoM estimator can be generalised to any
linear model, not just to this particular problem.
2.2 Densities
The joint density function of two random variables, X and Y , is given by
y 2 xexy
fXY (x, y) =
0
f (x) =
and that Y is uniform distributed on [0, 1]. Also find the conditional density of X
given Y . Are X and Y are independent?
Solution
1
The marginal density of x is defined as f (x) = 0 f (x, y) dy. Using the functional
form provided and integrating by parts twice we get
"
2 xy
f (x) = x
y e
dy = x y
2e
x
1
"
= ex + 2 y
xy
#1
+
0
#
xy 1
+ 2x
y
0
ex 1 exy
dy = ex + 2
x
x
x x
xy
exy
dy
x
e
e 1
2e
2
=
2
x
x
"
#1
(x + 2x + 2)
,
x2
f (y) = y
xe
0
exy
x
y
"
xy
dx = y
+y
2
0
exy
exy
dx = y 2 0 y
y
y
f (x, y)
y 2 xexy
=
6= f (x) ,
f (y)
1
X2 = 2 + 2 Z1 + 1 2 Z2 .
"
= 1,
0
Hence, (X1 ; X2 ) are bivariate normal (such that Xi N (i , i2 )). The covariance
between (X1 ; X2 ) is 1 2 . Recalling the OLS expressions for estimators,
= 2 1
= Cov (X1 ; X2 ) /V ar (X1 ) = 2 /1 .
Let the linear predictor be denoted by E ? (X2  X1 ) = + X1 . Then, the best linear
predictor is CEF,
E ? (X2  X1 )
= 2 1 + X1
= 2 + (X1 1 ) .
Note that employing properties of the Normal distribution,
E (X2  X1 ) = 2 + (X1 1 ) .
So the CEF is linear in this case (this is not true with other distributions.).
Geometrically, two vectors in N dimensional space, the angle between them must always satisfy:
xy
(x, y). The cosine of two rays is one if they point in exactly the same
cos = xx
= Corr
yy
direction, zero if they are perpendicular, negative one if they point in exactly opposite directions exactly the same as with correlations
A1
1
= A, (AB)1 = B 1 A1 , (A0 )
0
= A1 ,
X=
1 x12 x1K
1 x22 x2K
..
..
..
..
.
.
.
.
1 xN 2 xN K
N K
and a vector = (1 , 2 , . . . , N )0N 1 containing all of the unobservable determinants of the outcome y. The system of equations can be represented as: y N 1 =
X N K K1 + N 1 . Its econometric model is N 1 = y N 1 X N K K1 , where residuals N 1 are obtained once K1 is estimated. We want residuals tobe such that
their size (or norm) of the vector is minimised, i.e., min k
k = min 0 or, more
2
0
familiarly, min k
k = min . In
other words,
we want to minimise
the following ex
0
0
0
0
0
pression min = y X y X = y X y X = y 0 y y 0 X
0
0
The optimal OLS solves: y 0 X (X 0 y)0 + 2OLS X 0 X = 0 or
X 0 y + X 0 X .
OLS
1
0
= (X 0 X) X 0 y, where (X 0 y) = y 0 X.
1
1
From this expression, notice that = yX = yX (X 0 X) X 0 y = I X (X 0 X) X 0 y.
We can decompose y into two components: the orthogonal projection onto the K dimensional space spanned by X, X and the component that is the orthogonal projection
onto the n K subspace that is orthogonal to the span of X, . Since is chosen to
make as short as possible, will be orthogonal to the space spanned by X as in this
space, X 0 = 0. The FOCs that define the least squares estimator imply that this is so.
For a simple illustration, suppose y = x0 Ax = ax21 + 2bx1 x2 + cx22 . Its partial derivatives wrt to x1
and x2 , respectively, are simply 2 (ax1 + bx2 ) and 2 (bx1 + cx2 ).
7
Under symmetry, otherwise y/x0 = x0 A + A0 .
0
0
and observe that PZ X = Z
(Z Z) Z X
is the fitted value from a regression of X
0
on Z (first stage) and E (PZ X) PZ = E (X 0 PZ ) = 0. This is the generalised
instrumental variables estimator, defined as
1
IV = (X 0 PZ X) X 0 PZ y
1
= (X 0 PZ X) X 0 PZ (X + )
1
= + (X 0 PZ X) X 0 PZ ,
1
4 Miscellaneous Examples
The estimators.8
In the simple linear regression model, X has two columns: a vector of ones and a
vector containing the explanatory variable x. By working on the general formula for the
, instead of
Pn
i=1 .
b = (X X)
"
Xy =
#
1
xn
1 1
x1 x2
P
xi
P 2
x
xi
1 x1
1 x2
.. ..
. .
1 xn
1 P
"
#
1
xn
1 1
x1 x2
y1
y2
..
.
yn
yi
xi yi
P
P 2
P P
P
xi yi xi xi yi
yi
xi x i
1
1
b= 0
.
= P 2
P
2
P
P
P
P
P
X
X

n
x
(
x
)
i
i
xi yi + n xi yi
xi yi
xi
n
Consider the second element of vector b,
P
P 1 P
P
P P
P
P
xi y i xi n y i
n xi yi xi yi
xi (yi y)
(xi x) (yi y)
P = P
= P
b2 =
=
.
P 2
P
P
P
2
xi (xi x)
n xi ( xi )
(xi x)2
x2i xi n1 xi
As for the other element, we have found that
b1 =
P 2P
P P
xi yi xi xi yi
P 2
P
2
xi (
xi )
xi
xi y i =
P
n x2
1
n
xi ) 2
xi )2 y (n
yi ),
xi yi
xi
yi ) x .
Therefore,
b1 =
P
n x2
i
xi )2 y (n
P
n x2
i
xi y i
xi
yi ) x
xi )
= y b2 x .
Standard errors.
1
To find the covariance matrix of b, note that b = + (X0 X) X0 , so
E (b ) (b )0
= E (X0 X) X0 0 X (X0 X)
1 0
1
1
0
0
= (X X) X E [ ] X (X0 X) = 2 (X0 X) ,
which follows from the facts that X is fixed in repeated samples and E [0 ] = 2 I.9
E (b ) (b )0 =
P 2
xi
P
P
n x2i ( xi )2 P x
xi
If X is stochastic then all the moments are defined conditional on X. For example, E [0  X] = 2 I.
10
The standard error of b1 is given by the square root of the element (1, 1) of the
covariance matrix, while the covariance between b1 and b2 is given by its offdiagonal
element,
se(b1 ) =
v
u
u
t
2 x2i
,
P
P
n x2i ( xi )2
2 xi
cov(b1 , b2 ) = P 2
.
P
n xi ( xi )2
P
and
regression model the estimated parameter vector is b = (X0 X)1 X0 y and the residuals
regress
y on
Z. The estimated
can be calculated as e = y y = y Xb. Now we
There is a linear relation between the coefficient estimated in the two regressions. Replace now the definition of Z and bA in the residuals for this new regression to see that
the original regression (y on X),10
they are the same as those in
eA = y ZbA = y (XA)(A1 b) = y Xb = e .
10
This is due to the fact that the projections matrices in both models are the same. We have
that PX = X(X0 X)1 X0 and PA = Z(Z0 Z)1 Z0 = Z(Z0 Z)1 Z0 = XA(A0 X0 XA)1 A0 X0 =
e = (I
P
)y =
(I P
)y =
e
. In words,
the space spanned by the columns of X is the same as the span ofZ, the only difference
is the basis
11
References
Abadir, K., and J. Magnus (2005): Matrix Algebra, Econometric Exercises. Cambridge University Press.
Angrist, J. D., and J.S. Pischke (2008): Mostly Harmless Econometrics: An
Empiricists Companion. Princeton University Press.
Wooldridge, J. M. (2010): Econometric Analysis of Cross Section and Panel Data.
The MIT Press, second edn.
12