Académique Documents
Professionnel Documents
Culture Documents
Contents
Part 1.
Linear Models
Chapter 1. Introduction
1.1. Introduction
10
13
2.1. From the linear population model to the linear regression model
13
14
2.3. Diculties
15
18
18
19
19
27
30
31
37
39
43
4.1. Introduction
43
4.2. Unbiasedness
43
3
CONTENTS
44
47
48
58
59
65
70
Chapter 5. The Oaxacas model: OLS, optimal weighted least squares and group-wise
heteroskedasticity
71
5.1. Introduction
71
71
75
77
81
81
85
87
90
7.1. Introduction
90
91
7.3. GLS
96
103
108
8.1. Introduction
108
8.2. The Fixed Eect Model (or Least Squares Dummy Variables Model)
108
CONTENTS
117
123
125
129
132
134
136
9.1. Introduction
136
137
142
144
10.1. Introduction
144
146
150
151
152
152
155
156
156
157
Non-linear models
169
170
Part 2.
11.1. Introduction
170
170
CONTENTS
171
174
176
12.1. Introduction
176
177
180
181
182
182
183
187
13.1. Introduction
187
187
190
192
14.1. Introduction
192
192
14.3. Estimation
193
195
Bibliography
197
Part 1
Linear Models
CHAPTER 1
Introduction
1.1. Introduction
Indeed, causation is not always exactly the same as correlation. Econometrics uses economic theory, mathematics and statistics to quantify economic structural relationship, often
in the search of causal links among the variables of interest.
Although rather schematic, the following discussion should convey the basic intuition of
how this process works.
Economic theory provides the econometrician with an economic structural model,
(1.1.1)
y = ! (x, ")
1.1. INTRODUCTION
from the economic factors of interest, x, to the economic response or dependent variable y.
Often in applications q = 1, which means that " is treated as a catch-all random scalar.
For example, ! (x, ") may be the expenditure function in a population of (possibly) heterogenous consumers, with preferences " and facing income and prices x; or it may be the
marshallian demand function for some good in the same population, with x denoting prices
and total consumption expenditure; also it may be the demand function for some input of a
population of (possibly) heterogenous firms facing input and output prices x, with " comprising
technological latent heterogeneity, and so on.1
The individual ! (x, "), with its gradient vector of marginal eects, @x ! (x, "), and hessian
matrix, Dxx ! (x, "), are typically the structural objects of interest, but sometimes attention is
centered upon aggregate structural objects, such as the population-averaged structural function,
1Wooldridge (2010) prefers to think of ! (x, ") as a structural conditional expectation: E (y|x, ") ! (x, ") .
10
The key question is under what conditions these estimable statistical objects are informative on ! (x, "). Evidently, to establish a mapping between the structural economic object
of interest and the foregoing statistical objects the econometrician needs to model the relationship between observables and unobservables in ! (x, ") and do so in a plausible way. The
restrictions that are used to this purpose are said identification restrictions. The next sections
describe the simplest probabilistic model for equation (1.1.1), the linear population model.
a k1
(1.2.1)
then, given P.3, E (y|x) = x0 .
0) = 1,
11
E (a0 xx0 a) > 0, which in turn implies a0 E (xx0 ) a > 0. Therefore, E (xx0 ) is positive definite
and so non-singular, that is P.2.
Exercise 1. prove that if x = (1 x1 ) then assumption P.2 is equivalent to V ar (x1 ) 6= 0.
Solution:
E xx0 = @
E (x1 )
E (x1 ) E x21
1
A
x0
x0 , then
E xy
xx0
=0
or E (xy) = E (xx0 ) . Assumption P.2, then, ensures that the foregoing system can be solved
for
to have
(1.2.3)
= E xx0
E (xy)
At this point the linear probabilistic model establishes a precise mapping between, on the
one hand, the structural objects of interest, ! (x, "), " and @x ! (x, ") and on the other the
observable or estimable objects y, x, E (xx0 ) and E (xy). Indeed, ! (x, "), " and @x ! (x, ")
are equal to unique known transformations of y, x, E (xx0 ) and E (xy). This means that
12
! (x, "), " and @x ! (x, ") can be estimated using estimators for E (xx0 ) and E (xy), whose
choice depends on the underlying sampling mechanism. The most basic strategy is to carry
out estimation within the linear regression model and its variants. In essence, the linear regression model is the linear probabilistic model supplemented by a random sampling assumption.
This ensures optimal properties of the ordinary least squares estimator (OLS) and its various
generalizations.
A more restrictive specification of the linear model maintains the assumptions of conditional homoskedasticity and normality
P.4: V ar ("|x) =
P.5: "|x N 0,
2.
2
A more general variant of the linear model, instead, replaces assumption P.3 with
P.3b: E (x") = 0.
Under P.3b it is still true that
= E (xx0 )
the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,
the model is more general.
The function x0 , with
= E (xx0 )
E (x) E (") ).
CHAPTER 2
=(
2 ...
k)
yi = x0i + "i
14
y =X +"
where
B
B
B
B
B
y =B
B
n1
B
B
B
@
y1
..
.
yi
..
.
yn
C
B
C
B
C
B
C
B
C
B
C, X = B
C nk B
C
B
C
B
C
B
A
@
x01
C
B
B
.. C
C
B
. C
B
C
B
C
0
xi C , " = B
B
n1
B
.. C
C
B
. C
B
A
@
0
xn
"1
..
.
"i
..
.
"n
C
C
C
C
C
C.
C
C
C
C
A
It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties
LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combinations of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for
any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0i a 6= 0. P.2 ensures that
this occurs with non-zero probability, which approaches unity as n ! 1. LRM.3, instead, is
a consequence of P.3 and RS. This is proved as follows. By P.3, E ("i |x0i ) = 0, i = 1, ..., n or
E (yi |x0i )
x0i
= 0, i = 1, ..., n. Since
E "i |x01 , ... x0i , ..., x0n = E yi |x01 , ... x0i , ..., x0n
x0i
2.3. DIFFICULTIES
15
and by RS, E (yi |x0i ) = E (yi |x01 , x02 , ..., x0n ), then
E "i |x01 , ... x0i , ..., x0n = E yi |x0i
x0i
= 0.
If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for
the population model, then one easily verifies that
LRM.4: V ar ("|X) =
2I
LRM.5: "|X N 0,
2I
n.
While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sucient for
accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is
replaced by P.3b, E (x") = 0, then LRM.3 gets replaced by
LRM.3b: E (
Pn
i=1 xi "i )
Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions
and/or the RS mechanism are not verified in reality. Here is a list of the most important
population issues.
Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads
LRM.1 to fail.
Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combinations of the others. LRM.2 fails, but in general this is not a serious problem, it
simply indicates that the model has not been parametrized correctly to begin with.
A dierent parametrization will restore identification in most cases.
Endogeneity (P.3 fails): some variables in x are related to ". LRM.3 fails.
Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.
LRM.4 fails.
2.3. DIFFICULTIES
16
2.3. DIFFICULTIES
17
As we will see in Chapter 4, although multicollinearity does not aect the statistical properties
of the estimators in finite samples, it can severely aect the precision of the coecient estimates
in terms of large standard errors.
CHAPTER 3
19
This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),
Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes
a conformable null matrix, vector or scalar, with the appropriate meaning being clear from
the context; y is a real n 1 vector containing the observations of the dependent variable; X
is a real (n k) regressor matrix of full column rank.
The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data
set US_gasoline.dta.
B
B
B
B
B
X=B
B
B
B
B
@
x01
20
C
.. C
. C
C
C
0
xi C
C,
.. C
C
. C
A
x0n
where
S (bo ) = (y
Xbo )0 (y
Xbo ) .
Xbo )0 (y
= 0.
bo =b
Xbo ):
S (bo ) = y0 y
= y0 y
b0o X 0 y
2X 0 y + 2X 0 Xbo
21
so that the first order conditions (OLS normal equations) of the minimization problem are
X 0 y + X 0 Xb = 0,
(3.3.1)
b = X 0X
(3.3.2)
X 0y
Notice that
The estimator exists since X 0 X is non-singular, X being of full-column rank.
The estimator is a true minimizer since the Hessian of S (bo ),
@ 2 S (bo )
= 2X 0 X,
@bo @b0o
is a positive definite matrix (i.e. S (bo ) is globally convex in bo ). The latter is easily
proved as follows. A matrix A is said to be positive definite if the quadratic form
c0 Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption
n
X
z = Xc 6= 0 for any c 6= 0 therefore c0 X 0 Xc = z0 z =
zi2 > 0 for any c =
6 0.
i=1
e=y
Xb
X 0 (y
Xb) = 0.
Therefore, if X contains a column of all unity elements, say 1, three important implications
follows.
(1) The sample mean of e is zero:
10 e
n
X
i=1
1X
ei = 0 and consequently, e =
ei = 0.
n
i=1
22
(2) The OLS regression line passes through the point sample means (y, x), that is y = x0 b,
P
where y = ( ni=1 yi ) /n and
0
x =
1
n
n
X
x1i . . .
1
n
i=1
n
X
i=1
xki
Xb) = 0).
(3) Let
(3.3.5)
y = Xb
denote their sample mean
denote the OLS predicted values of y, and y
n
y =
1X
yi
n
i=1
23
use c:\giovanni\mydata
If the path involves folders with names that include blanks, then include the whole path
into double quotes. For example:
use "/Users/giovanni/didattica/Greene/dati/ch.
1/mydata"
3.3.2. Stata implementation: the help command. To know syntax, options, usage
and examples for any Stata command, write help from the command line followed by the
name of the command for which you want help. For example,
help use
will make appear a new window describing use:
24
Title
[D] use
Syntax
Load Stata-format dataset
use filename [, clear nolabel]
Menu
File > Open...
Description
use loads a Stata-format dataset previously saved by save into memory.
filename is specified without an extension, .dta is assumed. If your
filename contains embedded spaces, remember to enclose it in double
quotes.
If
In the second syntax for use, a subset of the data may be read.
Options
clear specifies that it is okay to replace the data in memory, even though
the current data have not been saved to disk.
nolabel prevents value labels in the saved data from being loaded.
unlikely that you will ever want to specify this option.
It is
Examples
. use http://www.stata-press.com/data/r11/auto
. replace rep78 = 3 in 12
. use http://www.stata-press.com/data/r11/auto, clear
. keep make price mpg rep78 weight foreign
. save myauto
. use make rep78 foreign using myauto
. describe
.
.
.
.
use
tab
use
tab
Also see
Manual:
Help:
[D] use
[D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]
infile (free format), [D] infile (fixed format), [D] infix, [D]
insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse
25
3.3.3. Stata implementation: OLS estimates with regress. Now that you have
loaded your data into memory, Stata can work with them. Suppose your dependent variable
y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of
depvar on x1 and x2 with the constant term included, you write regress followed by depvar
and, then, the names of the regressors:
regress depvar x1 x2
The following example shows the regression in example 1.2 of Greene (2008) with annual
values of US aggregate consumption (c) used as the dependent variable and regressed on
annual values of US personal income (y) for the period 1970-1979.
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
. regress c y
Source
SS
df
MS
Model
Residual
64435.1192
537.004616
1
8
64435.1192
67.125577
Total
64972.1238
7219.12487
Coef.
y
_cons
.9792669
-67.58063
Std. Err.
.031607
27.91068
t
30.98
-2.42
Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.042
=
=
=
=
=
=
10
959.92
0.0000
0.9917
0.9907
8.193
1.052153
-3.218488
regress includes the constant term (the unity vector) by default and always with the
name _cons. If you dont want it, just add the regress option, noconstant:
regress depvar x1 x2, noconstant
Notice that always, according to a general rule of the Stata syntax, the options of any Stata
command follow the comma symbol. This means that if you wish to specify options you have
to write the comma symbol after the last argument of the command, so that everything to the
26
right of the comma symbol is held by Stata as an option. Options can be more than one. Of
course, if you do not wish to include options dont write the comma symbol.
After execution, regress leaves behind a number of objects in memory, mainly scalars
and matrices, that will stay there, available for use, until execution of the next estimation
command. To know what these objects are, consult the section Saved results in the help
of regress, where you will find the following description.
Saved results
number of observations
model sum of squares
model degrees of freedom
residual sum of squares
residual degrees of freedom
R-squared
adjusted R-squared
F statistic
root mean squared error
log likelihood under additional assumption of i.i.d.
normal errors
log likelihood, constant-only model
number of clusters
rank of e(V)
Macros
e(cmd)
e(cmdline)
e(depvar)
e(model)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(vce)
e(vcetype)
e(properties)
e(estat_cmd)
e(predict)
e(marginsok)
e(asbalanced)
e(asobserved)
regress
command as typed
name of dependent variable
ols or iv
weight type
weight expression
title in estimation output when vce() is not ols
name of cluster variable
vcetype specified in vce()
title used to label Std. Err.
b V
program used to implement estat
program used to implement predict
predictions allowed by margins
factor variables fvset as asbalanced
factor variables fvset as asobserved
Matrices
e(b)
e(V)
e(V_modelbased)
coefficient vector
variance-covariance matrix of the estimators
model-based variance
Functions
e(sample)
27
You should be already familiar with some of the e() objects in the Scalars and Matrices
parts. At the end of the course you will be able to understand most of them. Dont worry
about the Macros and Functions parts, they are rather technical and, however, not relevant
for our purposes.
To know the values taken on by the e() objects, execute the command ereturn list just
after the regress instruction. In our regression example we have:
. ereturn list
scalars:
e(N)
e(df_m)
e(df_r)
e(F)
e(r2)
e(rmse)
e(mss)
e(rss)
e(r2_a)
e(ll)
e(ll_0)
e(rank)
=
=
=
=
=
=
=
=
=
=
=
=
10
1
8
959.919036180133
.9917348458900325
8.193020017500434
64435.11918375102
537.0046160573024
.9907017016262866
-34.10649331948547
-58.08502782843004
2
e(cmdline)
e(title)
e(marginsok)
e(vce)
e(depvar)
e(cmd)
e(properties)
e(predict)
e(model)
e(estat_cmd)
:
:
:
:
:
:
:
:
:
:
"regress c y"
"Linear regression"
"XB default"
"ols"
"c"
"regress"
"b V"
"regres_p"
"ols"
"regress_estat"
macros:
matrices:
e(b) :
e(V) :
1 x 2
2 x 2
functions:
e(sample)
28
R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) Rn ; R (A)
is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2 2
R (A) and ca1 2 R (A) for any real scalar c). Since each element of R (A) is e vector of n
components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted
by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,
dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.
The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A? .
I now prove that A? is a subspace of Rn . A? Rn by definition. Given any two vectors
b1 and b2 belonging to A? and for any a 2 R (A), b01 a = 0 and b02 a = 0, but then also
(b1 + b2 )0 a = 0 and, for any scalar c, (cb01 ) a = 0, which completes the proof.
Importantly, it is possible to prove, but not pursued here, that
(3.4.1)
dim A? = n
rank (A) .
A? is commonly referred to as the space orthogonal to R (A), or also the orthogonal complement of R (A) .
Exercise 3. Prove that any subspace of Rn contains the null vector.
For simplicity, assume A of full column rank and define the operator P[A] as
P[A] = A A0 A
A0 .
0 = P ) and idempotent (P P
As an exercise you can verify that P[A] is a symmetric (P[A]
[A]
[A] [A] =
P[A] ) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical
terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal
to R (A), A? . Symmetrically,
M[A] = I
P[A]
is the orthogonal projector that projects vectors onto A? along a direction that is parallel to
the space orthogonal to A? , R (A).
29
Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is
symmetric and idempotent).
The properties of orthogonal projectors, established by the following exercises, are readily
understood, once one grasps the geometrical meaning of orthogonal projectors. They can be
also demonstrated algebraically, which is what the exercises require.
Exercise 5. Given two (n k) real matrices A and B, both of full column rank, prove
that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always
expressed as A = BK where Kis a non-singular (k k) matrix).
Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and
as such every column of A can be expressed as a linear combination of the columns of B,
A = BK, where K is (k k) . Therefore, P[A] = BK (K 0 B 0 BK)
K 0 B. An important result
of linear algebra states that given two conformable matrices C and D, then rank (CD)
min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank
equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that
rank (K)
k, and since rank (K) > k is not possible, then rank (K) = k and K is non-
singular. Finally, by the property of the inverse of the product of square matrices (see Greene
(2008), p. 963, (A-64))
P[A] = BK K 0 B 0 BK
K 0B0
B0B
K0
= BKK
K 0B0
= P[B] .
Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A] M[A] = 0.
30
P[A] A = A
and
(3.5.2)
M[A] A = 0.
e = M[X] y,
where
M[X] = I
X X 0X
X 0.
Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthogonal to that spanned by the regressors, X ? . For this reason M[X] is said the residual maker.
From (3.3.2) and (3.3.5) it follows that
= P[X] y
y
, is the orthogonal projection onto the
and so the vector of OLS predicted (fitted) values, y
= 0 (see exercise 6 or also equation
space spanned by the regressors, R (X). Clearly, e0 y
(3.3.4)), therefore the OLS method decomposes y into two orthogonal components
(3.5.4)
+ e.
y=y
31
Partition X =
X1
X2 and, accordingly,
0
b=@
b1
b2
1
A
Exercise 9. Prove that if X is of full column rank, then M[X1 ] X2 is of f.c.r. Solution:
I prove the result by contradiction and assume that P[X1 ] X2 b = X2 b for some vector b 6= 0.
Therefore, X1 a = X2 b, where a = (X10 X1 )
b0 ), which
32
or
X10
X20
Ay
0
@
10
X10 X1 X10 X2
X20 X1
A@
X20 X2
b1
b2
A = 0
(3.6.1)
X10 y
X10 X1 b1
X10 X2 b2 = 0
(3.6.2)
X20 y
X20 X1 b1
X20 X2 b2 = 0
(3.6.3)
X10 (y
X 2 b2 ) ,
which shows at once the important result that if X 1 and X 2 are orthogonal, then
b1 = X10 X1
X10 y
b2 = X20 X2
X20 y,
and
X20 X1 X10 X1
X10 (y
X 2 b2 )
X20 X2 b2 = 0
X20 P[X1 ] (y
X 2 b2 )
X20 X2 b2 = 0.
X10 ,
33
I Xb2 = 0
so that eventually
b2 = X20 M[X1 ] X2
X20 M[X1 ] y.
b1 = X10 M[X2 ] X1
X10 M[X2 ] y.
By symmetry,
1 (10 1)
X1 1
1
sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned
onX demeaned.
34
The following result on the decomposition of orthogonal projectors into orthogonal components will be useful in a number of occasions later on.
Lemma 12. Given the partitioning X = (X1 X2 ) , the following representation of P[X]
holds
(3.6.4)
1]
X2 ] ,
with both terms in the right hand side of (3.6.4) being orthogonal1.
3.6.1. Add one regressor. Given the initial regressor matrix, X, include an additional
regressor z (z is a non-zero (n 1) vector), so that there is a larger regressor matrix, W ,
partitioned as W = X z .
Now, consider the OLS estimator from the regression of y on W
0
with the resulting OLS residual
d
c
A = W 0W
u = y
(3.6.5)
= y
W@
Xd
d
c
W 0y
1
A
zc
1Equation (3.6.4) can be proved directly using the formula for the inverse of the 2 2 partitioned inverse, or
indirectly, but more easily, by using the FWL Theorem and noticing that, for any non-null y and X = (X1 X2 )
1
1
of f.c.r., P[X] y = X1 b1 + X2 b2 , where b1 = (X10 X1 ) X10 (y X2 b2 ) and b2 = X20 M[X1 ] X2
X20 M[X1 ] y.
You just have to plug the right hand sides of b1 and b2 into the right hand side of P[X] y = X1 b1 + X2 b2 , work
through
all the algebraic
simplifications and eventually notice that the equation you end up with, P[X] y =
P[X1 ] + P[M
[X1 ] X2
35
and the formula for c in closed form solution, obtained as a specific application of the general
analysis of the previous section,
c =
z0 M[X] z
z0 M[X] y
z0 M[X] z
(3.6.6)
where M[X] = I
X (X 0 X)
1 0
z M[X] y
X 0.
We wish to prove that the residual sum of squares always decreases when one regressor is
added to X, i.e. given e defined as in (3.3.3) u0 u < e0 e. To this purpose, it is convenient to
express d as in equation (3.6.3)
1
d = X 0X
X 0 (y
zc) .
X X 0X
X 0 (y
= y
X X 0X
X 0y + X X 0X
= M[X] y
zc)
zc
1
X 0 zc
zc
M[X] zc.
Since M[X] y = e (from (3.5.3), remember that M[X] is the residual maker)
u=e
M[X] zc,
= e0 e
e0 M[X] zc
36
Then, by replacing e with M[X] y into the second addendum of equation (3.6.8) and considering
that M[X] is idempotent gives
u 0 u = e0 e
uu = ee
= e0 e
(3.6.8)
and since
(z0 M[X] y)
z0 M[X] z
z0 M[X] y
2 0
z M[X] z
z0 M[X] y
z0 M[X] z
z0 M[X] y
+
z0 M[X] z
> 0,
u0 u < e0 e.
(3.6.9)
Exercise 14. How would formulas for c, d and u0 u change if the new regressor z is
orthogonal to X?
3.6.2. The squared coecient of partial correlation (not covered in class, but
good and easy exercise). The squared coecient of partial correlation between y and z,
2 ,
ryz
2
ryz
z0 M[X] y
= 0
,
z M[X] zy0 M[X] y
measures the extent to which y and z are related net of the variation of X. In this sense it is
closely related to c and indeed
2
c = ryz
y0 M[X] y
.
z0 M[X] y
2
ryz
.
37
n
X
(yi
y)2
i=1
1 and not n, since M[1] y are the residuals from the regression of y on 1 (see exercise
11) and so there could be no more than n 1 linearly independent vectors in the space to which
M[1] y belongs, 1? . In fact, since rank (1) = 1, then given equation (3.4.1), dim 1? = n
By the orthogonal decomposition for y in (3.5.4),
+ M[1] e.
M[1] y = M[1] y
But since e and X are orthogonal and X contains 1, it follows that 10 e = 0, thereby
(3.7.1)
M[1] e = e
and
+ e.
M[1] y = M[1] y
Then,
0 M[1] y
+ 2e0 M[1] y
+ e0 e.
SST = y
1.
38
k and not n since in the residual space, X ? , there could be no more than
k linearly independent vectors. This follows from the assumption that X is of full column
k.
(3.7.2)
0 M[1] y
= yM[1] y
and since y
0 M[1] y
y
SSR
= 0
SST
y M[1] y
e0 e,
R2 = 1
e0 e
.
y0 M[1] y
Therefore, if the constant term is included into the regression it has that 0 R2 1 and
R2 measures the portion of total variation in y explained by the OLS regression; in this sense
R2 is a measure of goodness of fit2. There are two interesting extreme cases. If all regressors,
lies onto the space spanned by 1 and M[1] y
= 0, so
apart from 1, are null vectors, then y
that eventually R2 = 0. Only the constant term has explanatory power in this case, and the
2If the constant term is not included into the regression than (3.7.1) does not hold and R2 may be negative.
39
regression is an horizontal line with intercept equal to the sample mean of y. If y lies already
(and also e0 e = 0) and R2 = 1, a perfect (but useless) fit3.
onto R (X), then y = y
A problem with the R2 measure is that it never decreases when a regressor is added to
X (this emerges straightforwardly from the R2 formula and the inequality in (3.6.9)) and in
principle one can obtain artificially high R2 by inflating the model with regressors (the extreme
case of R2 = 1 is attained if n = k, since in this case y ends up to lie onto R (X)). This
2
problem may be obviated by using the corrected R2 , R , defined by including into the formula
of R2 the appropriate degrees of freedom corrections:
2
R =1
SSE/ (n
SST / (n
k)
.
1)
In fact, R does not necessarily increases when one more regressor is added.
Exercise 15. Prove that, given W , u and e defined as in section 3.6.1, the coecient of
determination resulting from the regression of y on W is
2
RW
= R2 + 1
2
R2 ryz
R =1
n
n
1
1
k
R2 .
40
0 M[1] y
y
b0 X 0 M[1] Xb
y0 P[X] M[1] P[X] y
R 0
=
=
y M[1] y
y0 M[1] y
y0 M[1] y
2
and
(3.8.2)
Ru2
y0 P[X] y
0y
y
b0 X 0 Xb
=
=
,
y0 y
y0 y
y0 y
e0 e
y0 y
whether or not the unity vector 1 is included into X. In fact, since y = Xb + e and X 0 e = 0,
0y
+ e0 e. The same is not true for the centered measure. Indeed, 0 R2 1 and
y0 y = y
(3.8.3)
R2 = 1
e0 e
y0 M[1] y
if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample
mean, that is M[1] y = y and M[1] X = X. Clearly in the latter case, R2 = Ru2 .
A convenient property of the centered R-squared, when 1 is included into X, is that it
2 , that is
, ry,
coincides with the squared simple correlation between y and y
y
2
(3.8.4)
R2 =
0 M[1] y
y
,
0 M[1] y
y0 M[1] y
y
41
+ e, then
along the following lines. Since y = y
0 M[1] y = y
0 M[1] (
y
y + e)
0 M[1] y
+y
0 M[1] e
= y
0 M[1] y
+y
e
= y
0 M[1] y
.
= y
where the third equality follows from M[1] e = e, since the constant is included, and the last
and the OLS residuals.
from the orthogonality of y
This property is not shared by the uncentered R-squared, unless variables have zero sample
means.
3.8.1. A convenient formula for R2 when the constant is included (not covered
But
so that eventually
(3.8.5)
R =
y0 P[M[1] X ] y
y0 M[1] y
P[M[1] X ] ,
42
which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R namely the OLS regression of y in
squared from the OLS regression of M[1] y on M[1] X,
in mean-residuals.
mean-residuals and X
CHAPTER 4
4.2. Unbiasedness
Under LRM.1-LRM.3, OLS is unbiased, that is E (b) = .
This is proved as follows. From LRM.1, LRM.2 and OLS formula in (3.3.2)
(4.2.1)
b=
+ X 0X
X 0 ".
+ X 0X
.
43
X 0 E ("|X)
44
Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.
V ar (b|X) = E (b
) (b
V ar (b|X) = E
X 0X
X 0X
X 0X
)0 |X ,
X 0 ""0 X X 0 X
X 0 E ""0 |X X X 0 X
1
i
|X ,
1
45
Next I prove that OLS has the smallest, in a sense that will be clarified soon, covariance
matrix in the class of linear unbiased estimator.
I define the following partial order in the space of the l l symmetric matrices:
Definition 17. Given any two l l symmetric matrices A and B, A is said no smaller
than B if and only if A
Given the partial order of definition 17, OLS has the smallest conditional covariance
matrix in the class of linear unbiased estimators. This important result is universally known
as the Gauss-Marcov Theorem. To prove it, define the generic member of the foregoing class
as
bo = Cy
where C is a generic k n matrix that depends only on the sample information in X and
such that CX = Ik (how would you explain the last requirement?). OLS is of course a
member of the class, with its own C equal to COLS = (X 0 X)
that V ar (bo |X) =
2 CC 0 .
Define, now D = C
V ar (bo |X) =
2 DD 0
1
2
= V ar (b|X) +
Since
D + X 0X
X 0X
X0
ih
D0 + X X 0 X
DD0
DD0 .
B) r
Indeed, if A is
to the Gauss-Marcov Theorem, we can say that any linear combination of the components
46
of b, r0 b, has smaller conditional variance than r0 bo . Formally, the theorem implies that
r0 [V ar (bo |X)
V ar (b|X)] r
the fact that in empirical applications we are interested in the linear combinations of population coecients, as in the following example, where it is shown that the estimates of individual
coecients can always be expressed as specific linear combinations of the k components of the
estimators.
Example 18. On noticing that bi = r0i b and boi = r0i bo , i = 1, ..., k, where ri is the
k 1 vector with all zero elements except the i.th entry, which equals unity, and given the
Gauss-Marcov Theorem, we conclude that V ar (boi |X)
is given by r0 b
and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r0 b is BLUE (you can
easily verify that E (r0 b) = r0 ).
Now we prove that the Gauss-Marcov Theorem extends to the unconditional variances.
From
V ar (bo |X) = V ar (b|X) +
DD0 ,
it has
EX [V ar (bo |X)] = EX [V ar (b|X)] +
EX DD0 ,
or
V ar (bo ) = V ar (b) +
EX DD0 ,
and since EX (DD0 ) is n.n.d., we can also state that the unconditional variance of OLS is no
greater than that of any linear unbiased estimator.
47
2.
LRM.4, the residual sum of squares divided by the appropriate degrees of freedom correction,
s2 = e0 e/ (n
2.
E s2 |X
1
n
Since "0 M[X] " is a scalar, "0 M[X] " = tr "0 M[X] " and so, by the permutation rule of the trace
of a matrix product, "0 M[X] " = tr "0 M[X] " = tr M[X] ""0 . Replacing the right hand side of the
foregoing equation into equation (4.4.1) yields
E s2 |X
1
n
E tr M[X] ""0 |X .
Then exploiting the fact that both trace and expectation are linear operators
E s2 |X
=
=
(4.4.2)
1
n
tr E M[X] ""0 |X .
tr M[X] E ""0 |X
1
n
2
tr M[X]
where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X] :
h
tr M[X] = tr In
= tr In
= n
k,
X X 0X
tr X 0 X
X0
X 0X
2.
48
Finally, by
In fact
1
E V\
ar (b)|X = 2 X 0 X
= V ar (b|X)
h
i
and since E V\
ar (b) = EX E V\
ar (b)|X by the law of iterated expectations, then
E V\
ar (b) =
EX
X 0X
= V ar (b) .
2I
b|X N
Also, since e = M[X] ", e|X N 0,
2M
[X]
X 0X
to prove at once that b and e are also jointly normal with zero covariances, conditional on X.
Specifically, since
0
@
b
e
A=@
b
e
20
A |X N 4@
A,
A+ @
2@
(X 0 X) 1 X 0
M[X]
(X 0 X) 1 X 0
M[X]
1
A
A"
X (X 0 X)
M[X]
3
5
or
0
@
b
e
20
A |X N 4@
1 0
A,@
2 (X 0 X) 1
0
2M
[X]
49
13
A5
or
X 0X
X 0X
X 0X
) e0 |X
X 0 ""0 M[X] |X
X 0 E ""0 |X M[X]
1
XM[X]
= 0kn .
Exercise 21. Verify, by direct computation of V ar (e|X), that V ar (e|X) =
20
Exercise 22. Is V ar 4@
b
e
2M
[X] .
1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is
E [x
E (x|z)] [y
E (y|z)]0 |z .
50
4.5.1. Testing single linear restrictions. Let (X 0 X)ii 1 stand for the i.th main diagonal
element of (X 0 X)
, then
bi |X N
i,
X 0X
1
ii
and, given the properties of the normal distribution, bi can be standardized to have
(4.5.1)
i = 1, ..., k. Were
bi
2 (X 0 X) 1
ii
|X N (0, 1) ,
with
Ho :
i,
bi
2 (X 0 X) 1
ii
i,
where
i,
N (0, 1) .
be used as it is. With some adjustment we can make it operational, though. Just replace
bi
i
ti = q
s2 (X 0 X)ii 1
(4.5.2)
i.
The
denominator term in expression (4.5.2) is the standard error estimate for coecient bi .
First, notice that since s2 = e0 e/ (n
(4.5.3)
(n
k)
" 0
M[X]
"
k),
.
2 (p)
with
k) s2 /
51
is an idempotent quadratic
form in a standard normal vector and, in the light of the foregoing distributional result, has
a chi-squared distribution with degrees of freedom equal to rank M[X] = n
(n
k)
s2
2
(n
k:
k) .
2 (X 0 X) 1 and (n
on X, hence (bi
k) s2 / 2 are conditionally independent.
i )/
ii
Further,
ti = q
bi
2 (X 0 X) 1
ii
/ (n
k)
s2
2
/ (n
k),
2 (p)
p
and z and x are independent, then z/ x/p has a t
k), i = 1, ..., k.
Finally, since the t distribution does not depend on the sample information and, specifically,
on X, then ti and any component of X are statistically independent, so that the above holds
also unconditionally, that is ti t (n
k), i = 1, ..., k.
are the product elasticities of the two inputs, the hypothesis of constant returns to
= 1.
52
t (n
k) ,
q = 0.
4.5.1.1. Stata implementation. For the sake of exposition, I report here the regress output
already seen in Section 3.3.3.
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
. regress c y
Source
SS
df
MS
Model
Residual
64435.1192
537.004616
1
8
64435.1192
67.125577
Total
64972.1238
7219.12487
Coef.
y
_cons
.9792669
-67.58063
Std. Err.
.031607
27.91068
t
30.98
-2.42
Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.042
=
=
=
=
=
=
10
959.92
0.0000
0.9917
0.9907
8.193
1.052153
-3.218488
The OLS coecient estimates, b, are displayed in the first column (labeled Coef.). Then,
the second column reports the standard error estimates peculiar to each OLS coecients,
q
se
i = s2 (X 0 X)ii 1 , i = 1, ..., k. The third column reports the values of the t statistics for
the k null hypotheses
= 0, i = 1, ..., k :
bi
ti = q
.
s2 (X 0 X)ii 1
> 0 or
reports the so called p-value for the two-sided t-test. It is defined as the probability that
a t distributed random variable is more extreme than the outcome of ti in absolute value:
P r [(t <
|ti |) [ (t > |ti |)] or more compactly P r (|t| > |ti |) . Clearly, if the p-value is smaller
53
than the chosen size of the test (the Type I error) then ti falls for sure into the critical region
and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of
the critical region (the lowest Type I error) we could have fixed to reject the null, given the test
outcome. In this sense, the p-value is more informative than critical values. In the regress
example, if we choose a critical region of 5% size, given that P r (|t| > 2.42) = 0.042 < 0.05,
we can reject at 5% that the constant term is equal to zero, knowing that we could also have
rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is
the minimum size allowing rejection, and for this reason we cant reject at 1%. This is a clear
case of borderline significance, one which we could not have identified with such precision by
simply looking at the 5% critical values. On the other hand, the p-value for the coecient
on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much
conservative we are towards the null, we can reject it at any conventional level of significance
(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also
at a less conventional 0.1% (since 0.001 > 0.000).
4.5.2. From tests to confidence intervals. Let us fix the 100% critical region for our
two-sided t test for the null H0 :
bi
se
i
< t/2
= Pr
(1
i
se
i t/2 , bi + se
i t/2 is a (1
) . Formally,
t/2 <
bi
se
i
se
i t/2 <
< t/2
se
i t/2 < bi
= P r bi
But bi
6=
= Pr
= (1
< se
i t/2
< bi + se
i t/2
) .
i.
se
i t/2 , bi + se
i t/2 contains all of the null hypotheses
that we cannot reject at 100%. So while a given t test is informative only for
54
the specific null it is testing, the confidence interval conveys to the researcher much more
information. The last column of the regress output reports the 95% confidence intervals for
each OLS coecients.
Exercise 24. Your regression output for a given coecient
se
i = 1.760. 1) Compute the t-statistic for the null H0 :
yields bi =
9.320 and
= 0. 2) In your regression
k = 334, this implies that t0.025 = 1.967, where t0.025 : P r (t > t0.025 ) = 0.025. Will you
reject or not H0 :
= 0 against H1 :
your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for
i?
i.
6=
6 at 5%, why? 5)
Using only your answers to Question 4, can you assert that the p-value of that test is greater
than 0.05? Also, do you expect that the absolute value of the t statistic for H0 :
be greater or smaller than 1.967, why? Verify your answer by directly computing the value of
the t statistic for H0 :
0 against H1 :
>0
with a 5% significance level. Is the critical level for this test equal to, smaller or greater than
1.967?
Answer: 1) ti =
= 0
6 is not rejected at 5%, then ti falls within the acceptance region and so the test
p-value> 0.05. Since ti falls within the acceptance region, the value of |ti | must be smaller
than t0.025 = 1.967. Indeed, ti =
value is t0.05 .
Exercise 25. Your regression output for a given coecient
se
i = 3.577. The outcome of the t-test for H0 :
= 0 against H1 :
value= 0.07. Can you reject the null at 10%? Can you at 5%?
6= 0 shows p-
55
4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:
q = 0, where R and q are, respectively, a J k matrix and a J 1 vector of fixed
H0 : R
known constants and such that no rows in R can be obtained as a linear combination of the
others, that is R is of full row rank J.
Under the null,
Rb
and so given LRM.5,
q = R (b
q|X N 0,
Rb
R X 0X
R0 .
)0
1 (x
2 (p) ,
it has
(Rb
W =
Again,
h
q)0 R (X 0 X)
R0
(Rb
q)
|X
(J) .
statistic
(Rb
F =
q)
R (X 0 X) 1 R0
Js2
(Rb
q)
2 (p
1)
and x2
2 (p
2 ),
F (p1 , p2 ) .
It is not hard to see that the above result can be applied to F, since it can be reformulated
as the ratio of two conditionally independent chi-squared random variables corrected by their
own degrees of freedoms. In fact, at the numerator we have
(Rb
h
q)0 R (X 0 X)
J
1
2
R0
(Rb
q)
2.
56
latter is a function of e alone. Therefore, in the light of Theorem 19 the two are conditionally
independent and so we can invoke the foregoing distributional result, to establish F |X
F (J, n
k).
As with the t statistic, since the F distribution does not depend on the sample information,
we have that the above holds unconditionally: F F (J, n
k) .
When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero
elements except unity in the entry corresponding to the parameter that is set to zero. For
0
= (
3)
= 0 and
= 0, then J = 2, q0 = (0 0) and
0
R=@
1
1 0 0
0 0 1
B
@
AB
B
0 0 1 @
1 0 0
1
2
3
1
A
0 1
C
C @ 0 A
.
C=
A
0
The F -test can be always rewritten as a function of the residual sum of squares under the
unrestricted model, e0 e, and the residual sum of squares under the model with restrictions
imposed, say e0 e :
F =
(e0 e e0 e) /J
.
e0 e/ (n k)
This is proved for the case of exclusion restrictions by using Lemma 12.
57
Partition the sample regressor matrix as X = (X1 X2 ) and consider the F test for the set
of exclusion restrictions H0 :
= 0:
=
=
b02
X20 M[X1 ] X2
k2 s2
b02 X20 M[X1 ] X2 b2
.
k2 s2
b2
=
=
X20 M[X1 ] y
The numerator of the right hand side of the foregoing equation can be written more compactly
as y0 P[M[X ] X2 ] y. Hence, by Lemma 12,
1
F
y0 P[X] P[X1 ] y
k2 s2
=
=
y0 M[X1 ] M[X] y
k2 s2
(e0 e e0 e) /k2
.
s2
It is not hard to prove that if the constant term is kept in both models, then
F =
R2 R2 /J
,
(1 R2 ) / (n k)
where R2 is R-squared from the unrestricted model and R2 is the R-squared from the restricted
model.
58
A=@
1 0 0
0 1 0
1
A
1
0
E (y1 |x)
E (y1 |w)
B
C
B
..
..
B
C
B
E (y|x) = B
C and E (y|w) = B
.
.
@
A
@
E (yn |x)
E (yn |w)
C
C
C.
A
Remark 27. Notice that in the formulation of conditional expectations the way the conditioning set is represented is just a matter of notational convenience. What matters are the
random scalars that enter the conditioning set and not the way they are organized therein.
For example, E (y|w1 , w2 , w3 , w4 ) can just be equivalently expressed as E (y|w0 ) or E (y|w),
59
W =@
w1 w3
w2 w4
A,
Given Remark 27 the general LIE can be formulated with conditional expectations having
the conditioning set organized in the form of random matrices rather than random vectors, as
follows.
LIE(vector|matrix): Given the random vector y and the random matrices W and X,
where X = f (W ), then E (y|X) = E [E (y|W ) |X].
Paralleling the consideration made above, since f () is a generic function, from LIE(vector|matrix)
follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =
(W1 W2 ), by LIE(vector|matrix) we always have
E (y|W1 ) = E [E (y|W ) |W1 ]
and
E (y|W2 ) = E [E (y|W ) |W2 ] .
60
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
So far we are in the classical regression framework, but now let x0 = (x01 x02 ) with x1 being
a k1 1 vector and x2 a k2 1 vector and k = k1 + k2 and maintain that x2 is latent
or, however, not included into the statistical model and lets explore the implications on the
statistical model. P.1 implies that
(4.7.1)
where = X2
y = X1
2
X1 is of f.c.r. as well. So, in a sense, LRM.1 and LRM.2 continue to hold. But, as far as
LRM.3 and LRM.4 are concerned, nothing can be said. Specifically, we do not know whether
E (|X1 ) = 0 or V ar (|X1 ) = & 2 In . The first consequence is that the OLS estimator for
b1 = X10 X1
1,
X10 y
is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right
hand side of equation (4.7.1) into the OLS formula yields
b1 =
+ X10 X1
X10
+ X10 X1
X10 X2
+ X10 X1
X10 ".
61
Since RS holds, "i , i = 1, .., n, is conditional-mean independent of (x0i1 x0i2 ) and statistically
+ X10 X1
X10 X2
and hence, by the law of iterated expectations, we have the unconditional bias
(4.7.2)
E (b1 )
= E
X10 X1
X10 X2
There are two specific instances, however, in which the bias is zero.
The first instance is that analyzed in Greene (2008) when X10 X2 = 0k1 k2 . In this case
(X10 X1 )
X10 X2
2 |x1 )
the population E ("|x) = 0, then by the general law of iterated expectation also E ("|x1 ) = 0.
Hence, E (x02
in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is
that b1 is unbiased.
The two situations are not related. Clearly, E (X2
2 |X1 )
0k1 k2 . But also the converse is not true, and X10 X2 = 0k1 k2 may hold if E (X2
2 |X1 )
6= 0,
+ x2
x1 . While E (x2
2 |x1 )
= (1
x1 )
2,
this case, a random sampling of y, x1 and x2 from the population will yield X10 X2 = 0 and
E (X2
2 |X1 )
Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,
and in general omitted variables will bring about bias and inconsistency in the coecient
62
estimates. Solutions are typically given by proxy variables, panel data estimators and instrumental variables estimators. The first method is briefly described below, the classical panel
data estimators are pursued in Chapter 8, while IV methods are described in Chapter 10.
To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,
unless V ar (x02
2 |x1 )
Lemma 29. Given any two non-singular square matrices of the same dimension, A and
B, if A
B is n.n.d. then B
is n.n.d.
The foregoing lemma signifies that in the space of non-singular square matrices of a given
dimension if A is no smaller than B, then A
is no greater than B
1.
It is useful in
situations in which the dierence of inverse matrices is more easily worked out than that of
the original matrices.
The following exercise asks you to think through the consequences of overfitting, namely
applying OLS to a statistical model with variables that are redundant in the population model.
both k 1 vectors and P.1-P.4 satisfied. Assume also that the l 1 vector z of
observable variables is available, such that rank [E (ww0 )] = k + l where w0 = (x0 z0 ). Also,
assume E ("|x z) = 0 and V ar ("|x z) =
2,
Finally assume there is a sample of size n from the population, such that the elements of the
sequence {(yi x0i z0i ) , i = 1, ..., n} are i.i.d. 1 (1 + k + l) random vectors. Applying the usual
x01
C
B
B
.. C
C
B
. C
B
C
B
C
0
xi C , Z = B
B
nl
B
.. C
C
B
. C
B
A
@
x0n
z01
63
C
B
B
.. C
C
B
. C
B
C
B
C
0
zi C , " = B
B
n1
B
.. C
C
B
. C
B
A
@
z0n
"1
..
.
"i
..
.
"n
C
C
C
C
C
C,
C
C
C
C
A
y =X +"
satisfies LRM.1-LRM.4 (of course, this proves that
b = X 0X
(4.7.3)
X 0y
is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased
estimator for
and call it bof it . 3) Derive the covariance matrix of bof it . 4) Use lemma 29
and verify that, indeed, the conditional covariance matrix of bof it is no smaller than that of
b in (4.7.3). 5) A byproduct of the overfitting strategy is the l 1 vector of OLS coecients
for the variables in Z. Lets call it c. Express c using the intermediate result in the FWL
theorem as
c = Z 0Z
Z 0 (y
Xbof it )
Xbof it
Zc equals
based on eof it .
Answer: 1) Obvious, since in the population and the sampling mechanism we have all
we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by
64
2I
and
X 0 M[Z] X
i
X 0 M[Z] y|X Z =
X 0 M[Z] X
invoke the lemma. 5) Easy, its just algebra: replace bof it and c into eof it y
Xbof it
Zc
and rearrange. 6) First, use the formula of the overfitting residual vector derived in the
previous question, M[M[Z] X ] M[Z] y, to set up the estimator
2
s =
Then, notice that
2
s =
Finally, take the expectation of the trace s2 , conditional on X and Z, and follow the same
steps as when proving unbiasedness of the standard estimator s2 . In the derivation dont
forget that M[Z] and M[M[Z] X ] are orthogonal projections.
4.7.1. The proxy variables solution. Assume for simplicity that there is only one
omitted variable x2 from the population equation
y = x01
(4.7.4)
+ x2
+ ".
65
8 20
1
39
<
=
x1
A (x01 z0 )5 = k1 +l. This is analogous to property P.2 in Chapter
(3) rank E 4@
:
;
z
1 and permits identification of coecients in the proxy variable regression as we will
see below.
(4.7.5)
where = x2 E (x2 |x1 z), and hence E (|x1 z) = 0. Replacing the right hand side of equation
(4.7.5) into the population equation (4.7.4) yields
y = x01
(4.7.6)
where E (
2 |x1
+ z0 (
)+
+ ",
+ "|x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3
1 control variables
regressor matrix as X = (X
where b
is (k
66
b=@
A,
bi
V ar (bi |X) =
1
ii
where (X 0 X)ii 1 indicates the last entry onto the main diagonal of (X 0 X)
My purpose here is to derive the formula for (X 0 X)ii 1 . As usual when it comes to the
derivation of formulas regarding parts of the OLS vector, I invoke the Frisch-Waugh-Lovell
(FWL) Theorem. Hence,
bi = x0i M[X
i]
xi
x0i M[X
i]
y,
+ xi
so, given
y=X
+ xi
+ ",
it has
bi = x0i M[X
x0i M[X
i]
xi
+ x0i M[X
i]
+"
and consequently
bi =
i]
xi
x0i M[X
i]
".
Finally
V ar (bi |X) = E
(4.8.1)
x0i M[X
x0i M[X
i]
x
i] i
xi
x0i M[X
i]
x0i M[X
i]
xi
xi
x0i M[X
x0i M[X
1
i]
""0 M[X
i]
x x0i M[X
i] i
E ""0 |X M[X
i]
x
i] i
xi x0i M[X
i]
|X
xi
i
1
67
(4.8.2)
1
ii
1
x0i M[X
i]
xi
Equation (4.8.2) is a general algebraic result providing the formula for the generic i.th main
diagonal element of the inverse of any non-singular cross-product matrix X 0 X. I have proved it
in quite a peculiar way, using a well-known and easy-to-remember econometric result! Above
all, I could get away without referring to the hard-to-remember result on the inverse of the
(2 2) partitioned matrix, which is instead the route followed by Greene (Theorem 3.4 inGreene (2008), p. 30) .
Exercise 31. Prove (4.8.2) using formula (A-74) for the inverse of the (2 2) partitioned
matrix in Greene (2008), p. 966.
4.8.1. The three determinants of V ar (bi |X) when 1 is a regressor. Now I get back
to V ar (bi |X) in equation (4.8.1)
V ar (bi |X) =
and assume that X
M[X
i]
x0i M[X
i]
xi
= X
i]
xi is the
residual sum of squares for this regression. Since the unity vector is a column of X
coecient of determination for this regression is
Ri2 = 1
x0i M[X i ] xi
,
x0i M[1] xi
i]
xi = 1
, the
68
and eventually2
2
V ar (bi |X) =
Also, it has
x0i M[1] xi
n
X
xi ) 2 ,
(xij
j=1
that is x0i M[1] xi is the total variation in xi around its sample mean, xi . Therefore,
(4.8.3)
V ar (bi |X) =
Rk2
Pn
xi ) 2
j=1 (xji
other things constant, Ri2 increases, in words the correlation between xi and the
other regressors increases (this is the multicollinearity eect on the variance of the
OLS individual coecient);
other things constant, the total variation in xi ,
Pn
j=1 (xji
xi )2 , decreases;
or
V ar (bi |X) =
V ar (bi |X)
where
x0i M[1]
]=I
P[ M
[1] X
P[1]
i
xi
i]
x0i M[1] xi
x0i P[M
[1] X
x0i M[1] xi
x0i P[M
xi
[1] X i ]
x0i M[1] xi
"
x0i M[1] xi 1
Ri2
x0i P[M
i ] xi
x0i P[M
Ri2
i ] xi
[1] X
x0i M[1] xi
P[ M
and so
i ] xi
[1] X
i]
[1] X
x0i M[1] xi
1
!#
69
Remark 32. Multicollinearity, when it does not degenerate into perfect multicollinearity,
i.e. det (X 0 X) = 0, does not aect the finite sample properties of OLS. Nonetheless, it may
severely reduce the precision of our estimates, in terms of larger standard errors and confidence
intervals.
0
0 b0 , where b
is of dimension (k 1) 1 and b0 is the OLS estimator of the constant
b= b
term
0.
Prove that
0
b
x
b0 = y
is the (k
where y is the sample mean of y and x
regressors (hint: just use the intermediate result from the proof of the FWL Theorem that
b1 = (X10 X1 )
X10 (y
X2 b2 )).
E (
y |X)]2 |X
= E "2 |X
2
0
0
b|X
V ar b|X
2) V ar x
= x
x
70
and
b|X
3) cov y, x
0
b
E (
y |X)) x
= E (
y
X
0 M[1] X
= E ""0 M[1] X
|X
x
1
1 0 0
X
0 M[1] X
|X
1 "" M[1] X
x
n
1
2
X
0 M[1] X
= 0,
10 M[1] X
x
n
= E
=
0
0
b|X
E x
|X
to prove that
2
V ar (b0 |X) =
V ar b|X
.
+x
x
M[X2 ] X1 b1 .
I now prove that u is equal to the OLS residual vector e M[X] y. Since b1 = X10 M[X2 ] X1
X10 M[X2 ] y,
2]
X1 ] .
2]
2]
X10 M[X2 ] y
X1 ] y.
X1 ]
Then
u =
M[X] + P[M[X
= M[X] y
e.
2]
X1 ]
P[M[X
2]
X1 ] y
P[M[X ] X1 ] or M[X2 ] =
2
CHAPTER 5
= Xf
m
f
+ "m
+ "f
2
m
2!2
m
and
2
f
2!2 ,
f
not necessar-
ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two
71
72
h
i
1
1
2
0
N m , m (Xm Xm )
and bf |Xf N f , f2 Xf0 Xf
and are both BLUE.
The question I ask is whether it is possible to embed the two models into a single regression
model by pooling the two sub-samples into a larger one with size equal to n = nm + nf , and
continue to estimate
and
eciently.
y=@
ym
yf
A , Xw = @
Xm
Xf
A, " = @
"m
"f
A.
Let 1 denote the (n 1) vector of all unity elements and construct the (n 1) vector d, such
that its first nm entries have all unity elements and the last nf all zero elements.
Variables like d are usually referred to as dummy variables or indicator variables, since
they indicate whether any observation in the sample belongs or not to a given group. In this
particular case, d is the male dummy variable indicating whether any observation in the sample
is specific to the male group. Since the two groups are mutually exclusive, the female dummy
variable, indicating whether any observation in the sample belongs to the female group, can be
constructed as the complementary vector 1
that is d0 (1
d. By construction, d and 1
d are orthogonal,
d) = 0.
Let x0wi be a (1 k) row vector indicating the i.th row of Xw and yi , "i and di be scalars
indicating the i.th component of y, " and d, respectively.
With this in hand, the model for the generic worker i = 1, ..., n is
(5.2.1)
yi = di x0wi
di ) x0wi
+ (1
=@
as
m
f
1
A
+ "i
X=@
Xm
0(nm k)
0(nf k)
Xf
73
A,
where 0(st) indicate a (s t) matrix of all zero elements, model (5.2.1) can be reformulated
in matrix form as
(5.2.2)
y = X + ".
Exercise 35. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.
Summing up, we have two equivalent representations of the same model: 1) that in Greene
(2008), with the two separate regressions; 2) that presented here with a single regression model,
represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as ecient
estimation of the population coecients is concerned. Indeed, as I prove next, the OLS
estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two
0
0
0
separate regressions as presented in Greene (2008), i.e. b = bm bf . Let
1
b = X 0X
By construction,
X 0y = @
and
X 0X = @
X 0 y.
0 y
Xm
m
Xf0 yf
1
A
0 X
Xm
m
0(kk)
0(kk)
Xf0 Xf
A.
Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene
(2008))
X 0X
=@
0 X )
(Xm
m
0(kk)
0(kk)
Xf0 Xf
A.
74
Hence,
0
b = @
0
= @
0
= @
0 X )
(Xm
m
0(kk)
0(kk)
Xf0 Xf
1
0 X ) 1 X0 y
(Xm
m
m
m
A
1
0
Xf Xf
Xf0 yf
1
bm
A.
bf
0
b=@
bm
bf
10
A@
0 y
Xm
m
Xf0 yf
1
A
1
A
It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The disturbances ", although independently distributed, suer from what is usually referred to as
group-wise heteroskedasticity, as the model does not maintain
ance matrix for " is
=
2B
2 I
!m
nm
2
m
2
f.
0(nm nf ) C
A.
0(nf nm )
!f2 Inf
In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not
BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sucient for the
OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific
circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxacas
model is one such case. This is verified in the next section. A general necessary and sucient
condition for the OLS to be BLUE is postponed to the last section of this tutorial.
75
As stated by the exercise below, the matrix H when premultiplied to any conformable
vector transforms the vector so that its first nm elements get divided by !m and the remaining
by !f . This is what we refer to as weighting.
Exercise 37. Verify by direct inspection that, given any (nm 1) vector xm , any (nf 1)
vector xf and
then
x=@
0
Hx = @
xm
xf
A,
!m1 xm
1
!f x f
A.
+"
e=X
where the tilde indicates weighted variables. Two important facts are worth observing at this
point. First, the population parameters vector,
model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to
n,
76
(so, if LRM.5 holds they are independent standard normal variables) since
e
V ar e
"|X
= HH 0 = HH
0
10
1
1
2
0(nm nf ) C
m Inm
B !m Inm 0(nm nf ) C B
= @
A@
A
2
0(nf nm ) !f 1 Inf
0(nf nm )
f Inf
0
1
1
B !m Inm 0(nm nf ) C
@
A
0(nf nm ) !f 1 Inf
0
10
1
1
!m Inm
0(nm nf ) C B !m Inm 0(nm nf ) C
2B
=
@
A@
A
0(nf nm )
!f I n f
0(nf nm ) !f 1 Inf
=
In .
Therefore, the weighted model is a classical regression model that identifies the parameters of
interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted
model (5.3.1), referred to as the weighted least squares estimator (WLS), bw , is BLUE for
0 X
!m2 Xm
m
2 (X 0 X )
!m
m m
bw = @
= @
0(kk)
0(kk)
0(kk)
2
!f Xf0 Xf
0(kk)
!f2 Xf0 Xf
1
0 X ) 1 X0 y
(Xm
m
m m
A,
1
= @
0
Xf Xf
Xf0 yf
10
A@
10
A@
0 y
!m2 Xm
m
2
!f Xf0 yf
0 y
!m2 Xm
m
!f 2 Xf0 yf
1
A
1
A
which proves that b = bw , namely that in the Oaxacas models the OLS estimator coincides
with the optimal WLS estimator.
Does this imply that we can do inference in the Oaxacas model feeding the Stata regress
command with the variables of model (5.2.2) without further cautions? Not quite. Although
77
the single OLS regression provides the BLUE estimator for the population coecients
, the
V ar (b|X)
= s2 @
0 X )
(Xm
m
0(kk)
0(kk)
Xf0 Xf
A,
with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is
that V ar (b|X)
forces the regression variance estimate to be constant across the two samples.
Luckily, the same is not true for the separate regressions on the two subsamples, providing us
with the unbiased estimators of the model coecients, bm and bf and the unbiased estimator
of the covariance matrix
0
V ar (b|X)
=@
where s2m =
1
nm k
Pnm
2
i=1 ei
0 X )
s2m (Xm
m
and s2f =
0(kk)
1
nf k
Pn
0(kk)
s2f Xf0 Xf
2
i=nm +1 ei .
1
A
a feasible version of the weighted regression explained above, using sm and sf as weights.
But this is clearly more computationally cumbersome than carrying out the two separate
regressions.
2I
n,
P[X] = P[X] In
P[X] .
78
That Zyskinds condition is also verified in the Oaxacas model is straightforwardly proved
upon elaborating P[X] :
1
P[X] = X X 0 X
X0
0
10
10
1
0
0 X ) 1
Xm
0(nm k)
Xm
0(knf )
(Xm
0(kk)
m
A@
A
1 A@
= @
0
0(nf k)
Xf
0(kk)
Xf0 Xf
0(knm )
Xf
0
10
1
0 X ) 1
0
Xm (Xm
0(nm k)
Xm
0(knf )
m
@
A
@
A
1
=
0
0
0(nf k)
Xf Xf Xf
0(knm )
Xf
0
1
1
0
0
0 n n
B Xm (Xm Xm ) Xm
C
( m f ) 1
= @
A
0
0
0(nf nm )
Xf Xf Xf
Xf
0
1
0(nm nf ) C
B P[Xm ]
= @
A
0(nf nm )
P[Xf ]
Therefore,
B
P[X] = @
0
B
= @
0
B
= @
2
m I nm
10
0(nm nf ) C B
A@
2
0(nf nm )
f Inf
1
2 P
m [Xm ] 0(nm nf ) C
A
2
0(nf nm )
f P[ X f ]
10
P[Xm ]
0(nm nf ) C B
A@
0(nf nm )
P[Xf ]
= P[X] .
P[Xm ]
2
m I nm
0(nm nf ) C
A
0(nf nm )
P[Xf ]
0(nm nf ) C
A.
2
0(nf nm )
f Inf
As a final remark, remind that the Zyskind condition ensures only that OLS coecients are
BLUE, saying nothing about the properties of the OLS standard error estimates and indeed
we have seen in the previous sections that they may be biased even if b is BLUE. The following
exercise on partitioning provides another instance of such occurrence.
79
y = X1
+ X2
+ ",
maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation
by M[X2 ] boils down to the reduced regression model
1
=X
y
(5.4.2)
+"
where
1 = M[X ] X1 and "
= M[X2 ] y, X
= M[X2 ] ".
y
2
2) How can you interpret the variables in model (5.4.2)? 3) As far as
is concerned, does
OLS applied to model (5.4.2) yields the same estimator as OLS applied to model (5.4.1), why
or why not? 4) Does the reduced model (5.4.2) satisfy LRM.1-LRM.4? Which ones, if any, are
not satisfied? 5) The degrees of freedom of the reduced regression are n
that the resulting OLS estimate for
k1 . Do you think
1 =
|X
V ar "
M[X2 ] .
k degrees of freedom
to correct the OLS residual sum of squares (which is nonetheless the same for both models
(5.4.1) and (5.4.2), as we learn from Section 4.9). 6) You have just to verify that
M[X2 ] P[M[X
2]
X1 ]
= P[M[X
2]
X1 ] M[X2 ] ,
80
which is readily done by noting that M[X2 ] , symmetric and idempotent, is the first and the
last factor in P[M[X ] X1 ] .
2
The within regression examined in Chapter 8 (equation (8.2.7)) is a special case of model
(5.4.2) in the foregoing exercise.
CHAPTER 6
2.
Assume also that T1 > k. It is worth noting right now that we are not
assuming T2 > k, so that the test we introduce here, dierently from the Classical Chow test,
goes through also in the case in which the second subsample does not permit identification of
the (k 1) vector of population coecients
y=@
y1
y2
A, X = @
X1
X2
A and " = @
"1
"2
A,
so that the top blocks of the partitioned matrices contain the first T1 observations and those
in the bottom contain the last T2 observations. The model under the null hypothesis is just
the classical regression model
(6.1.1)
and the OLS estimator for
y = X + ".
is
b = X 0 X
X 0 y.
The structural break is thought of as time-specific additive shocks hitting the model over the
second subsample. Therefore, the model in the presence of the structural break is formulated
81
82
as
y 1 = X1 + " 1
(6.1.2)
y 2 = X2 + IT2 + " 2 ,
shocks. Written in a more compact form where 0(lm) indicates a (l m) null matrix, the
general model becomes
0
@
y1
y2
A=@
X1 0(T1 T2 )
X2
IT2
10
A@
A+@
"1
"2
A,
from which it is clear that model (6.1.2) uses a larger regressor matrix than model (6.1.1),
including also the time dummies specific to the observations over the second subsample
0
and
D=@
0(T1 T2 )
I T2
b
c
A = W 0W
W 0 y,
where W = (X D) is the extended regressor matrix. Therefore, the null hypothesis can be
formally expressed as H0 :
significance
(6.1.3)
F =
(e0 e e0 e) /T2
F (T2 , T1
e0 e/ (T1 k)
k) ,
where e = y
83
Xb indicate the OLS residual (T 1) vector from the model under the null
and
(6.1.4)
e=y
W@
b
c
1
A
F =
where e1 = y1
X1 and accordingly
b1 = X10 X1
(6.1.6)
X1 y 1 .
Dierently from Greene (2008) I prove this result by using the FWL Theorem. We just need
to work out e and to do this I obtain the separate expressions for b and c. Lets get started
with b. By the FWL Theorem
b = X 0 M[D] X
X 0 M[D] y.
M[D] = In
0
@
0(T1 T2 )
I T2
1 20
10 0
13
0(T1 T2 )
0
A6
A @ (T1 T2 ) A7
4@
5
IT2
I T2
10
0(T1 T2 )
I T2
10
A .
1The degrees-of-freedom correction in the denominator of the F ratio follows from the fact that the number of
84
Therefore,
M[D] = In
(6.1.7)
= In
0(T1 T2 )
0(T1 T1 ) 0(T1 T2 )
@
@
I T2
10
A@
0(T2 T1 )
0(T1 T2 )
IT2
I T2
1
10
A
A.
The second matrix in equation (6.1.7) transforms any conformable vector it premultiplies
so that its first T1 values are replaced by zeroes and its last T2 values are left unchanged.
Accordingly, M[D] carries out the complement operation, transforming any conformable vector
to which it is premultiplied in a way that its first T1 values remain unchanged and the last T2
values get replaced by zeroes. Therefore,
0
M[D] X = @
X1
0(T2 k)
A,
D0 (y
Xb)
c = @
(6.1.8)
0(T1 T2 )
I T2
0(T1 T2 )
= y2
X2 b 1 .
= @
I T2
10
A (y
10 0
A @
Xb1 )
y1
X1 b 1
y2
X2 b 1
1
A
85
Therefore, the OLS coecients c are the predicted residuals for the second subsample by
means of the estimates b1 , obtained from the first subsample. Finally, replacing the right
hand sides of equations (6.1.6) and (6.1.8) into equation (6.1.4) yields
e = y Xb1
0
1
y1
A
= @
y2
0
e1
= @
0(T2 1)
D (y2 X2 b1 )
0
1 0
1
X1 b 1
0(T1 T2 )
@
A @
A
X2 b 1
y 2 X2 b 1
1
A,
which proves that e0 e = e01 e1 and consequently the F test expression of equation (6.1.5).
Remark 40. Since nothing of the foregoing derivation hinges upon the fact that the T2
observations are contiguous in the sample (the OLS estimator and residuals are invariant to
permutations of the rows), there is a more general lesson that can be learnt here. Regardless of
the data-set format, be it a time series, a cross-section or a panel data, extending the matrix of
regressors to dummy variables that indicate each a single observation will actually exclude all
involved observations from the estimation sample. Therefore, the above procedure can be used
both to test that given observations in the sample are not outliers and, in case of rejection,
to neglect the outliers from the estimation sample, without materially removing records from
the data-set.
6.2. An equivalent reformulation of the Chow predictive test
Here we stress the interpretation of the Chow predictive test as a test of zero prediction
errors in the period with insucient observations. In doing this, we reformulate the test using
the formula based on the Wald statistic.
From equation (6.1.8) derived in the previous section we have
c = y2
X2 b 1 .
86
when we use the first sub-period estimates to predict y in the second sub period, given X2 .
If the elements in c are not significantly dierent from zero jointly, then we have evidence for
not rejecting the null hypothesis of parameter constancy (zero
Chow predictive test in (6.1.5)
F =
normal ".
As usual, F can be rewritten as the Wald statistic divided by the number of restrictions
(T2 in this case):
1
\
c0 V ar
(c|X)
F =
T2
(6.2.1)
\
(c is the discrepancy vector and V ar
(c|X) is the OLS estimator of V ar (c|X) ). The F
formula in (6.2.1) can be made operational by elaborating the conditional covariance matrix
of the prediction error, V ar (c|X) , under the null hypothesis of the test, H0 :
If
+ (X10 X1 )
X2
X2 X10 X1
+ X10 X1
1
X10 "1 .
= 0.
X10 "1
Therefore, under H0
V ar (c|X) = E
"2
X2 X10 X1
X10 "1
h
= E "2 "02 |X + E X2 X10 X1
h
i
1
2
=
IT2 + X2 X10 X1
X20 ,
"2
1
X2 X10 X1
X10 "1
1
X20 |X
|X
87
where the second equality follows from the assumption of spherical disturbances over the whole
sample,
20
V ar 4@
Hence,
F =
"1
"2
A |X 5 =
IT .
h
\
V ar
(c|X) = s21 IT2 + X2 X10 X1
i
X20 ,
h
X2 b1 )0 IT2 + X2 (X10 X1 )
X20
T2 s21
(y2
X2 b 1 )
and
2,
2.
2I
T.
partitioning holds:
0
y=@
y1
y2
A, X = @
X1
X2
A and " = @
"1
"2
A,
so that the top blocks of the partitioned matrices contain the first T1 observations and those in
the bottom contain the last T2 observations. Assume that both X1 and X2 have f.c.r. Notice
that this is not ensured by f.c.r. for X as the following example show.
88
1
1
2
B
C
B
C
B 1 2 C
B
C
B
C
C
X=B
B 1 2 C
B
C
B
C
B 1 0 C
@
A
1 0
1 2
B
C
B
C
X1 = B 1 2 C
@
A
1 2
has rank=1.
+ "1
y 2 = X2
+ "2
or more compactly
0
y=W@
(6.3.1)
where
and
Let
and
W =@
A+"
X1
0T1 k
0T2 k
X2
are (k 1) vectors.
D=@
0T1 T2
I T2
1
A
A.
89
premultiplied by P[D] , is transformed into the interaction of x and the time dummy for the
second sub period. Hence, the general model (6.3.1) can be equivalently written as
y = M[D] X
or still equivalently, given M[D] = I
(6.3.2)
where
1.
+ P[D] X
+ ",
P[D] ,
y=X
+ P[D] X + ",
2,
is
by constructing just one set of interacted variables P[D] X and 2) carrying out an F test of
joint significance for the coecients on the interacted variables after OLS estimation of model
(6.3.2).
As it turns out, the classical Chow test is a special case of the predictive Chow. Consider
model (6.3.2) and reformulate it by expanding P[D]
y=X
+ D D0 D
D0 X + "
or
y = X + D + ",
where
shows that
and
= (D0 D)
D0 X = X2 , then
= X2 , which
2 RT2 . This implies that, given rank (X2 ) = k, the two tests are identically
CHAPTER 7
91
then
By strict exogeneity
E
then
X 0"
|X
n
and so
V ar
X 0"
n
1 0
X 0X
X"
+ plim
n
n
0
X"
+ Q 1 plim
.
n
X 0X
n
plim (b) =
V ar
X 0"
n
X 0"
n
1
E
n
1 X 0 X
n n
=0
1
= E
n
X 0 ""0 X
|X
n
X 0 X
n
which goes to zero as n ! 1 by assumption OLS.1. Hence X 0 "/n converges in squared mean,
and consequently in probability, to zero.
Clearly, the above implies that OLS is consistent in the classical case of LRM.4.
92
6
6
6
6
=6
6
6
4
2
1
0
2
2
0
..
.
..
..
.
..
.
0
..
.
7
7
7
7
7
0 7
7
5
2
n
and
1X
V ar (xi "i ) =
n
xi x0 i
1X 2
0
i E xi xi
n
i=1
0
X X
.
= E
n
i=1
Therefore,
2
iE
1X
lim
V ar (xi "i ) = lim E
n!1 n
n!1
i=1
X 0 X
n
which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,
p X
n
n
X 0"
X X
xi "i p ! N 0, plim
.
n
n
n d
i=1
93
Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),
p
)!N
n (b
and so
p
n (b
X 0X
n
0, Q
X 0"
p !Q
n d
plim
0"
p ,
n
1X
X 0 X
n
plim
X 0 X
n
\
Avar
(b) = X 0 X
2
2
6 e1 0
6
6 0 e2
2
6
=6 . .
6 ..
..
6
4
0
X 0 X
X 0X
..
.
..
.
0
0
..
.
7
7
7
7
7.
0 7
7
5
e2n
where the symbol stands for the element-by-element matrix product (also known as Hadamard
product).
Econometric softwares routinely compute robust OLS standard errors: these are just the
\
square roots of the main diagonal elements of Avar
(b) in (7.2.1). In Stata this is done through
the regress option vce(robust) (or, equivalently, simply robust).
94
7.2.4. Whites heteroskedasticity test. The Whites estimator remains consistent under homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical
discrepancy between s2 (X 0 X)
and (X 0 X)
(X 0 X)
X 0 X
of homoskedasticity, the discrepancy will be small. This is the essence of the Whites heteroskedasticty test. The statistics measuring such discrepancy can be implemented through
the following auxiliary regression including the constant term.
(1) Generate the squared OLS residuals, e2 = e e
(2) Do the OLS auxiliary regression that uses e2 as the dependent variable and the
2 (p
2 (p
1).
We may implement the White test manually, saving the OLS residuals through predict and
then generating squares and interactions as appropriate, or more easily by giving the following
post-estimation command after regress: imtest, white.
95
Clustering cannot be neglected in empirical work. In the case of firm data, for example,
it is likely that there is correlation across the productivity shocks hitting firms in the same
sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.
The White estimator can be made robust to cluster correlation quite easily. I explain
this in terms of the firm data example. Assume that we have cross-sectional data of n firms,
indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector
g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n G matrix
D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm
i belongs to sector j and zero if not. The cluster-correlation and heteroskedasticity consistent
estimator for the asymptotic covariance matrix of b is then assembled by simply replacing
in Equation (7.2.1) with
c = ee0 DD0 .
Stata does this through the regress option vce(cluster clustervar ), where clustervar is
the name of the cluster identifier in the data set.
Chapter 9 will cover cases of multi-clustering, that is data that are grouped along more
than one dimension.
7.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of
the average variance
2
n
1X
=
n
2
i,
i=1
is given by
s2n =
1X 2
ei ,
n
i=1
s2n
2
n
is a sequence).
Since
s2n
"0 "
=
n
"0 X
n
X 0X
n
X 0"
n
2
n
as
2
n
7.3. GLS
plim
s2n
= plim
= plim
96
"0 "
n
"0 "
n
+ 00 Q
By the RS assumption the squared errors, "2i , are all independently distributed with means
E "2i
2
i,
"0 "
1X 2
=
"i ,
n
n
i=1
n
"0 "
1X 2
plim
i = 0.
n
n
i=1
7.3. GLS
The estimation strategy described in the previous sections is based on OLS estimates for the
regression coecients with standard errors estimates corrected for heteroskedasticity and/or
cluster correlation. The drawback of the approach is a loss in eciency, if the departures from
LRM.4 are of a known form. We will see that in this case the BLUE can always be found.
To formalize the new set-up, let V ar ("|X) =
positive definite (p.d.) (n n) matrix and
2 ,
= C
C0
7.3. GLS
97
and
(7.3.2)
where
1/2
is the inverse of ,
1/2
= C
1/2 1/2
C 0,
1,
1/2
is
a diagonal matrix with main-diagonal elements equal to the square root reciprocals of the
main-diagonal elements of .
Consider the GLS transformed model
y = X + " ,
(7.3.3)
such that y
1/2 C 0 y,
1/2 C 0 X
and "
1/2 C 0 ".
= and
1/2 1/2
1.
= C
C 0 CC 0 = C
IC 0 = C
= I. Then,
C 0 = CC 0 = I
and
= CC 0 C
C 0 = CI
C 0 = C
1/2
C 0 = CC 0 = I.
is diagonal and so
0
2I
2I
1 X;
1/2 1/2
0
2) X " = X 0
1.
1 ";
then use the general law of iterated expectation to prove that also
Given the results of the foregoing exercise, the OLS applied to the transformed model
(7.3.3) is the Gauss-Marcov estimator for
bGLS =
(7.3.4)
X X
X 0
X y
1
X 0
7.3. GLS
X 0
1/2
1X
(7.3.5)
1/2
1/2 C 0
= C
98
y=
1/2
X +
1/2
"
is also a GLS transformation, that is OLS applied to model (7.3.5) yields bGLS .
Solution: By exercise 42
X 0
1/2
1/2
X 0
1/2
1/2
y = X 0
X 0
y.
The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.
The following exercise asks you to verify by direct inspection that GLS is better than OLS
in terms of covariance.
Exercise 45. Prove that
2
X 0X
X 0 X X 0 X
X 0
X 0
is a n.n.d. matrix.
Solution: We define a k n matrix D as
D X 0X
X0
X 0
Therefore,
X 0X
X 0 = X 0
X 0
+ D.
X 0
X 0
i h
+D
X 0 X X 0 X
X X 0
X 0
X
1
+ D0
+ DD0 .
=
=
7.3. GLS
99
result.
X 0 1X
n
X 0 1"
,
n
X "
,
n
prove that
plim (bGLS ) =
under assumption GLS.1 and strict exogeneity (SE).
Solution. Easy: just write
0
bGLS =
then consider that V ar (" |X ) =
2I
X X
n
in Section 7.2.1.
7.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,
SE and RS (again, remember that V ar (" |X ) =
2I
n ).
0
E xi xi
7.3. GLS
100
and
n
1X
V ar (xi "i ) =
n
1
n
i=1
0
E xi xi
n
X
i=1
=
Therefore,
X 0 1X
n
1X
lim
V ar (xi "i ) =
n!1 n
lim E
n!1
i=1
X 0 1X
n
i=1
and since
p
n (bGLS
then
p
X 0 1X
n
n (bGLS
X 0 1"
p
!Q
d
n
) ! N 0,
1X
Avar (bGLS ) =
and is estimated by
\
Avar
(bGLS ) = s2GLS X 0
where
s2GLS =
=
(y
X bGLS )0 (y
n k
(y
XbGLS )0 1 (y
n k
X bGLS )
XbGLS )
0 1"
7.3. GLS
101
Exercise 47. (This may be skipped) Under GLS.1, SE and RS, prove that plim s2GLS =
2.
7.3.3. Feasible GLS. In general situations we may know the form of but not the
values taken on by its elements. Therefore to make GLS operational we need an estimate of
b Replacing by
b into (7.3.4) delivers the feasible GLS, henceforth FGLS:
, say .
b
bF GLS = X 0
b
X 0
y.
Since GLS is consistent, to know that bGLS and bF GLS are asymptotically equivalent, i.e.
plim (bF GLS
) ! N 0,
n (bF GLS
) and
n (bF GLS
totically equivalent, or
p
(7.3.6)
n (bF GLS
bGLS ) ! 0.
p
plim
plim
b 1"
X 0
p
n
b 1X
X 0
n
X 0 1"
p
n
X 0 1X
n
!
!
X 0 1X
n
plim
X 0 1"
p
n
= Q,
= q0 ,
= 0
= 0.
n (bF GLS
) be asymp-
7.3. GLS
102
where q0 is a finite vector, and the Slutsky Theorem (if g is a continuous function, then
plim [g (z)] = g [plim (z)], p. 1113) to verify that
b 1"
X 0
p
n
plim
b 1X
X 0
n
plim
X 0 1"
p
n
X 0 1X
n
= 0
= 0
Solution: Given
plim
and
plim
X 0 1X
n
b 1X
X 0
n
plim
then
X 0 1X
n
+ plim
=Q
X 0 1X
n
b 1X
X 0
n
= 0,
X 0 1X
n
= Q,
b 1X
X 0 1X
X 0
+
n
n
X 0 1X
n
b 1X
X 0
n
= plim
and
(7.3.9)
plim
b 1X
X 0
n
=Q
plim
b 1"
X 0
p
n
= q0 .
b 1" p
X 0
b
p
= n X 0
n
b
X 0
",
=Q
X 0 1X
n
1
b 1X
b
= X 0
X 0
bF GLS
1"
b 1X
X 0
n
X 0 1" p
p
= n X 0
n
and bGLS
X 0 1X
n
103
1
= X 0
1X
b 1" p
X 0
p
= n (bF GLS
n
X 0 1" p
p
= n (bGLS
n
X 0
1
",
X 0
1 ",
then
),
).
The last two equalities, along with the maintained conditions (7.3.7) and (7.3.8), the asympp
totic results (7.3.9) and (7.3.10) and the Slutsky Theorem, prove that both n (bGLS
)
p
and n (bF GLS
) converge in probability to the same limit, Q 1 q0 .
Conditions (7.3.7) and (7.3.8) must be verified on a case-by-case basis. Importantly, they
b is not consistent, as shown in the context of FGLS panel
may hold even in cases in which
data estimators by Prucha (1984).
n (b
) ! N 0,
0 d
(2) plim XnX = Q
(3) plim s2 =
2Q 1
2.
and consider the following lemma, referred to as the product rule. For more on this see
White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although
implicitly used for proving the asymptotic distributions of the tests).
104
Lemma 49. (The product rule) Let An be a sequence of random (l k) matrices and bn a
sequence of random (k 1) vectors such that plim (An ) = 0 and bn ! z. Then, plim (An bn ) =
d
0.
7.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distribution of the t-ratio test for the null hypothesis Ho :
o
k.
Ho
p
(7.4.1)
n (bk
q
o
k)
! N (0, 1)
2Q 1
kk
where (X 0 X)kk1
reformulated as
x0k M[X(k) ] xk
then
(7.4.2)
o
(bk
k)
t= q
,
s2 (X 0 X)kk1
p
n (bk
4
plim q
0
s2 XnX
80
<
1
plim @ q
0
:
s2 XnX
1
kk
o
k)
1
kk
2Q 1
kk
o
k)
,
1
kk
p
1
A
n (bk
q
n (bk
o
k)
2Q 1
kk
o
k) 5
9
=
;
where the second equality follows from the product rule, given that, by results 2-3 and the
Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim
converges in probability to zero and, by result 1., the second factor converges in distribution
to a normal random scalar. Hence, the two sequences in the plim of equation (7.4.2) are
105
asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given
(7.4.1), this proves that
o
(bk
k)
q
s2 (X 0 X)kk1
! N (0, 1) .
= q, where r is a non-
zero (k 1) vector of non-random constants and q is a non-random scalar. Using the same
approach as above it is possible to prove that
r0 (b
)
q
s2 r0 (X 0 X)
(7.4.3)
r0 Q
! N (0, 1) .
0
Exercise 50. (skip) Prove (7.4.3). Hint: by the Slutsky Theorem, plim r0 XnX
r =
1 r.
7.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis
Ho : R
(b
F =
R0
R (b
)
.
n (b
2 RQ 1 R0 .
) R
"
s R
X 0X
n
p
R n (b
).
(7.4.4)
= A1/2
A
1/2
1/2
p
R n (b
) ! N (0, IJ ) .
d
h
plim A
(7.4.5)
p
R n (b
h
plim A
1/2
1/2
106
p
R n (b
p
A 1/2 R n (b
1/2
where the second equality follows from the product rule given that
plim A
1/2
=A
1/2
) ! N (0, A) ,
d
by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two
sequences in the left-hand-side plim of equation (7.4.5) have the same limiting distribution
and given (7.4.4), this proves that
A
Let w A
1/2 R
1/2
p
R n (b
) ! N (0, IJ ) .
d
n (b
w0 w !
(7.4.6)
But since
A
1/2
1/2
= A
"
(J) .
= s R
X 0X
n
then
w0 w =
p
p
n (b
p
)0 R0 A 1/2 A 1/2 R n (b
0 1 # 1
p
XX
2
s R
R0
R n (b
n
n (b
"
)0 R 0
) =
)
JF,
and so by (7.4.6)
JF
(J) .
107
CHAPTER 8
8.2. The Fixed Eect Model (or Least Squares Dummy Variables Model)
Consider the following panel data regression model expressed at the observation level, that
is for individual i = 1, ...N and time t = 1, ..., T :
(8.2.1)
where x0it = (x1it , ..., xkit ),
B
C
B . C
= B .. C
@
A
k
and i is a scalar denoting the time-constant, individual specific eect for the individual i.
108
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
109
Define djit as the value taken on by the dummy variable indicating individual j = 1, ..., N
at the observation (i, t) , that is
djit
8
< 1 if i = j, any t = 1, ..., T
=
.
: 0 if i 6= j, any t = 1, ..., T
B
B
B
B
B
yi = B
B
(T 1)
B
B
B
@
and
yi1
..
.
x0i1
..
.
C
B
C
B
C
B
C
B
C
B
B 0
yit C
C , Xi = B xit
(T k)
C
B .
.. C
B .
. C
B .
A
@
yiT
x0iT
8
< 1T
d ji =
: 0
T
C
C
C
C
C
C,
C
C
C
C
A
if i = j
B
B
B
B
B
"i = B
B
(T 1)
B
B
B
@
"i1
..
.
C
C
C
C
C
"it C
C
.. C
C
. C
A
"iT
if i 6= j
1T indicates the (T 1) vector of all unity elements and 0T the (T 1) vector of all zero
elements.
Stacking data by individuals, an even more compact representation of the regression model
(8.2.2), at the level of the whole data-set, is
(8.2.3)
y = X + D + ",
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
110
where
0
B
B
B
B
B
y =B
B
(N T 1)
B
B
B
@
y1
..
.
yi
..
.
yN
C
C
C
C
C
C,
C
C
C
C
A
X1
..
.
B
C
B
C
B
C
B
C
B
C
X =B
Xi C
B
C,
(N T k)
B . C
B . C
B . C
@
A
XN
B
B
B
B
B
" =B
B
(N T 1)
B
B
B
@
"1
..
.
"i
..
.
"N
C
C
C
C
C
C,
C
C
C
C
A
1
..
.
B
C
B
C
B
C
B
C
B
C
=B
i C
B
C
(N 1)
B . C
B . C
B . C
@
A
N
0T C
C
0T C
C
.. C
. C
C.
C
C
0T C
A
1T
or equivalently D = (d1 d2 ... dN ) . Under the following assumptions model (8.2.3) is a classical
regression model that includes individual dummies:
FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,
but also none of its columns can be expressed as a linear combination of the dummy
variables, which boils down to saying that no column of X can be time-constant,
which in turn implies that X does not include the unity vector (indeed, there is a
constant term in model (8.2.3), but one that jumps across individuals).
FE.2: E ("|X) = 0. Hence, the variables in X are strictly exogenous with respect to ",
but the statistical relationship with is left completely unrestricted. Model (8.2.3),
therefore, automatically accommodates any form of omitted-variable bias due to the
omission of time constant regressors. Notice that D is taken as a non-random matrix,
therefore conditioning on (X D) or simply X is exactly the same.
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
FE.3: V ar ("|X) =
111
2
" IN T .
treatment of the topic, as in Arellano (2003) for example, but see also Section 8.7
(and Chapter 9, which can however be skipped for the exam).
Exercise 51. Prove that the following model with the constant term is an equivalent
reparametrization of Model (8.2.3):
(8.2.4)
where D
y = 1N T
0 + X + D
1
= (d2 ... dN ),
1N
1 1 ,
1
1
+ ",
and
1)
=@
y = d1 1 + X + D
1 ) 1N
1 1
+"
= 1N T or equivalently
d1 + D
1 1N 1
= 1N T .
1 1
= 1N T 1 + X + D
1 1
= 1N T
0 + X + D
1
1
+ ".
where
0 1 and
1N
1 1 .
(d1 + D
1 1N 1 1
+"
1 1N 1 ) 1
+"
1 1N 1 ) 1
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
112
Remark 52. Exercise 51 demonstrates that after the reparametrization the interpretation
of the
coecients is unchanged, the constant term is 1 and the coecients on the remaining
individual dummies are no longer the individual eects of the remaining individuals, i , i =
2, ..., N , but rather the contrasts of i with respect to 1 , i = 2, ..., N . Of course, the reference
individual must not necessarily be the first one in the data-set and can be freely chosen among
the N individuals by the researcher at her/his own convenience. In Stata this is implemented
by using regress followed by the dependent variable, the X regressors and N
1 dummy
X 0 M[D] y
and
aLSDV = D0 D
D0 (y
XbLSDV )
is the LSDV estimator for . As already mentioned, both are BLUEs, but while bLSDV
converges in probability to
to only when T ! 1. This discrepant large-sample behavior of bLSDV and aLSDV is due
to the fact that the dimension of increases as N increases, whereas that of
to k.
is kept fixed
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
113
D0 D
B 1/T
B
B 0
B
=B .
B ..
B
@
0
D0 =
1 0
TD
D0 D
B
B
B
B
B
B
B
B
1 0
Dz=B
B
B
B
B
B
B
B
@
1/T
..
.
..
0
..
.
0
1/T
C
C
C
C
C.
C
C
A
T
X
z1t C 0
C
t=1
C
C B
..
C B
.
C B
C B
T
X
C B
1
B
zit C
CB
T
C B
t=1
C B
..
C B
.
C @
C
T
C
X
A
1
z
Nt
T
z1
..
.
z1
..
.
zN
C
C
C
C
C
Cz
C .
C
C
C
A
t=1
D0 transforms it into an (N 1)
, where each mean is taken over the group of observations peculiar to the
vector of means, z
same individual and for this reason is said a group mean. Therefore,
i
aLSDVi = y
0i bLSDV ,
x
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
P[D] z = D D0 D
B
B
B
B
B
B
B
B
B
B
B
B
B
B
1 0
B
Dz=B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
z1
..
.
z1
..
.
zi
..
.
zi
..
.
zN
..
.
zN
x
1i
... x
ki
114
. It is
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C.
C
C
C
C
C
C
C
C
C
C
C
C
C
A
obtain bLSDV by applying OLS to the model transformed in group-means deviations, that is
regressing M[D] y on M[D] X.
Exercise 55. Verify (in a couple of seconds...) that P[D] =
1
0
T DD .
2
"
2
"
X 0 M[D] X
M[D] XbLSDV :
s2LSDV =
e0LSDV eLSDV
.
NT N k
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
2
".
115
tell yourself BRAVO! I just give you a few hints. First, on noting that y is determined by
the right hand side of (8.2.3) prove that e = M[M[D] X ] M[D] ", then elaborate the conditional
mean of "0 M[D] M[M[D] X ] M[D] " using the trace operator as we did for s2 , finally apply the law
of iterated expectations.
It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of
model (8.2.3) transformed in group-means deviations (this transformation is referred to in the
panel-data literature as the within transformation)
(8.2.7)
The intuition is simple: since the group mean of any time constant element, as i , coincides
with the element itself, all time-constant elements in model (8.2.3) are wiped out, this also
explains why X cannot contain time-constant variables. So, in a sense, the within transformation controls out the whole time-constant heterogeneity, latent or not, in model (8.2.3),
making it look like almost as a classical LRM. In particular, it can be proved easily that
LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D] ", have a
non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can
you derive it?). Specifically, the vector M[D] " presents within-group serial correlation, since
for each individual group there are only T
consequence, LRM.4 does not apply to model (8.2.7). All the same, bLSDV is BLUE. This
is true because the condition of Theorem 38 in Section 5.4 is met (if you have answered the
previous question on the covariance matrix of M[D] ", you should be able to verify also this
claim).
One should not conclude from the foregoing discussion that OLS on the within transformed
model (8.2.7) is a safe strategy. As in the Oaxacas pooled model of section 5.2, the fact that
the error covariance matrix is not spherical, presenting in this specific case within group serial
correlation, has bad consequences as far as standard error estimates are concerned. Indeed,
8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)
116
should we leave the econometric software free to treat model (8.2.7) as a classical LRM, and so
regress M[D] y on M[D] X, it would compute coecient estimates just fine, but would estimate
V ar (bLSDV |X) by s2 X 0 M[D] X
k) 6= s2LSDV , which is
biased since it uses a wrong degrees of freedom correction. The econometric software cannot
be aware that for each individual in the sample there are only T
1 linearly independent
demeaned errors and so, rather than dividing the residual sum of squares by N (T
1)
k, it
divides it by N T
k. The upshot is that standard errors estimated in this way needs rectifying
p
by multiplying each of them by the correction factor (N T k) / (N T N k).
An interesting assumption to test is that of the absence of individual heterogeneity, H0 :
1 = 2 = ... = N . Under the restriction implied by H0 , model (8.2.3) pools together all
data with no attention to the individual clustering and can be written as
y = X
(8.2.8)
+ ",
where
X = (1N T X) ,
=@
(8.2.9)
A.
X y
eP OLS = y
X bP OLS ,
F =
F (N
1, N T
k) .
117
If F does not reject H0 , POLS is a legitimate, more ecient than LSDV, estimation procedure.
If F rejects H0 , then POLS is biased and LSDV should be adopted.
Exercise 57. On reparametrizing the LSDV model as in Exercise 51, the hypothesis of no
B
C
B . C
= B .. C
@
A
k
and i is a scalar denoting the time-constant, individual specific eect for individual i. The
statistical properties of model (8.3.1) are dierent, though. Without loss of generality, write
i as i = 0 + ui and let
B
B
B
B
B
u =B
B
(N 1)
B
B
B
@
u1
..
.
ui
..
.
uN
C
C
C
C
C
C.
C
C
C
C
A
(8.3.2)
y = X
where
X = (1N T X) ,
=@
118
+ w,
(N T N )
2
" IN T ,
V ar (u|X ) =
2
u IN ,
2
" IN T
2
0
u DD.
This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS
estimator in (8.2.9) is unbiased (verify this) but not BLUE (unless
estimator for
RE
0
= X
y.
2
u
119
1.
2
u
1
DD0 ,
T
then
where
2
1
2
"
+T
2
u.
2
" IN T
2
" IN T
2
" M[D]
bGLS RE
2
" P[D]
2
" P[D]
2
u P[D]
+T
2
1 P[D] .
Therefore,
(8.3.3)
and
2
u P[D]
+T
= X
1
2
"
2
"
2
"
2 P[D]
1
2
1 P[D]
2
1
M[D] +
M[D] +
M[D] +
2
" M[D]
2 P[D]
1
1
2
"
M[D] +
2 P[D]
1
y.
1
2
"
M[D] +
2 P[D]
1
= IN T
bGLS RE
= X
RE
M[D] +
M[D] +
RE
"
1
2
"
2 P[D]
1
y.
"
M[D] +
P[D] = IN T
1 ) P[D] ,
"
"
1
120
P[D] .
pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes
a portion of the group-mean from the variable. For this reason, the coecients on timeconstant variables are identified in the RE model: time-constant variables when premultiplied
by M[D] + ( " /
1 ) P[D]
"/ 1.
M[D] + ( " /
1 ) P[D]
y = M[D] + ( " /
1 ) P[D]
V ar
M[D] + ( " /
1 ) P[D]
w|X =
+ M[D] + ( " /
1 ) P[D]
2
" IN T .
RE ,
say bF GLS
RE ,
the one
that is actually implemented in econometric softwares, can be obtained through the method
by Swamy and Arora (1972). The estimator for
2
"
2
1
is obtained as follows.
Define the Between residual vector eB as
(8.3.7)
where bB =
eB = P[D] y
X P[D] X
P[D] X bB
regression of the group means of y on the group means of X . The resulting estimator, bB ,
is referred to in the panel data literature as the Between estimator1. Then, based on eB ,
1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,
0
so that both bLSDV and bB are uniquely defined (in fact, with such an assumption X P[D] X and X 0 M[D] X
are both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2B and s2LSDV
121
as
2
1
s2B =
e0B eB
.
N k 1
2
"
2
u.
+T
Same hint as for exercise 56: First, on noting that y is determined by the right hand
side of (8.3.2) prove that eB = M[P[D] X ] P[D] w, then elaborate the conditional mean of
w0 P[D] M[P[D] X ] P[D] w using the trace operator as we did for s2 , finally apply the law of
iterated expectations.
Solution. Replacing the formula of bB into the right hand side of equation (8.3.7) gives
eB =
=
=
P[D] X
P[D] X X P[D] X
X P[D] X
P[D] X X P[D] X
X P[D] P[D] y
0
X P[D]
P[D] X
+ P[D] w
X P[D] P[D] w
= M[P[D] X ] P[D] w.
Therefore,
122
e0B eB |X
= E tr M[P[D] X ] P[D] ww |X
2
1 P[D] ,
2
" M[D]
2
1 P[D] ,
= tr M[P[D] X ] P[D]
2
1 tr
M[P[D] X ] P[D]
1. Since
P[D] X X P[D] X
1.
X P[D] ,
tr P[D] X X P[D] X
tr Ik+1
k
2
1.
X P[D]
= X
RE ,
123
is
s2LSDV
M[D] +
P[D] X
s2B
s2LSDV
M[D] +
P[D] y.
s2B
e0LSDV eLSDV
|X
NT N k
2
"
(hint: follow the same steps as above, noticing that M[D] w = M[D] ".)
Exercise 63. Derive the formula for the subvector of bGLS
the
RE ,
say bGLS
RE ,
estimating
vector.
Solution. Simply apply the FWL Theorem to the GLS-transformed RE model in (8.3.6),
noticing that by the well-known properties of orthogonal projectors P[D] P[1N T ] = P[1N T ] (remember 1N T = D1N and so 1N T 2 R (D)) so that
"
M[D] +
"
1
P[D]
= M[D] +
2
"
2
1
P[D]
P[1N T ]
and eventually
bGLS
RE
M[D] +
2
"
2
1
P[D]
P[1N T ]
M[D] +
2
"
2
1
P[D]
P[1N T ]
y.
124
is needed. Notice that if you include all 100 dummies, then the constant term should be
removed by the noconstant option. Alternatively, you can leave it there and include N
dummies. While the bLSDV estimates remain unchanged, the coecient estimates on the
included dummies do not. The latter must now be thought of as contrasts with respect to
the constant estimate, which turns out to equal the individual eect estimate peculiar to
the individual excluded from the regression, who is therefore treated as the base individual.
Nothing is lost by choosing either identification strategy.
125
When N is large the foregoing regress strategy is not practical. The bLSDV estimator
can, then, be manually implemented by applying the within transformation, carrying out OLS
on the transformed model and then correct standard errors appropriately. Implementation of
bF GLS
RE
by hands is more tricky and one goes along the following steps: 1) get the two
variance components estimates from within and between regressions; 2) transform variables
(including the constant) in partial deviations and 3) apply OLS to the transformed variables.
Details can be found in a Stata do.file available on the learning space.
I recommend to use always the ocial xtreg command to implement the standard panel
data estimators in empirical applications, unless strictly necessary to do otherwise (for example, if I explicitly ask you to!).
126
statistical dierence between the two estimators should be not significantly dierent from zero
in large samples.
Hausman proves that, under RE.1-RE.3, such dierence can be measured by the statistics
H = (bLSDV
bF GLS
RE )
Avar (bLSDV\
bF GLS
RE )
(bLSDV
bF GLS
RE )
H !
d
(k) .
Hausman also provides a useful computational result. He shows that since bF GLS
RE
is
bF GLS
RE ,
bF GLS
RE )
= 0,
so
Acov (bLSDV
bF GLS
RE ,
bF GLS
RE )
RE )
RE )
= 0
and
Avar (bLSDV
bF GLS
RE )
= Avar (bLSDV )
RE ) .
Hence,
H = (bLSDV
bF GLS
h
0
)
Avar\
(bLSDV )
RE
\
Avar (b
F GLS
RE )
(bLSDV
bF GLS
RE ) .
Wooldridge (2010) (pp. 328-334) evidences two diculties with the Hausmans test.
First, Avar (bLSDV )
RE )
such as time dummies. Therefore, along with the coecients on time-constant variables, also
those on aggregate variables must be excluded from the Hausman statistics.
127
Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic distribution of H is not standard even if RE.2 holds, so that H would be of little guidance in
detecting violations of RE.2, with an actual size that may be significantly dierent from the
nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.
In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H
converges in distribution rather then diverging, which means that the probability of rejecting
RE.3 when it is false does not tend to unity as N ! 1, making H inconsistent. The solution
is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.
The approach I describe next is well suited to solve both diculties at once.
8.5.2. The Mundlaks test. Mundlak (1978) asks the following question. Is it possible
to find an estimator that is more ecient than LSDV within a framework that allows correlation between individual eects, taken as random variables, and X? To provide an answer,
he takes the move from model (8.2.3) and supposes that the individual eects are linearly
correlated with regressors according to the following equation
= 1N 0 + D0 D
h
with E (|X) = E | (D0 D)
D0 X + u
i
D0 X , and so E (u|X) = 0. Pre-multiplying both sides of the
foregoing equation by D and then replacing the right-hand side of the resulting equation into
(8.2.3) yields
(8.5.1)
y = 1N T 0 + X + P[D] X + Du + ",
which is evidently a RE model extended to the inclusion of the P[D] X regressors. Model (8.5.1)
springs up from a restriction in (8.2.3) and hence seems promising for more ecient estimates.
But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (8.2.3)
returns the LSDV estimator, bLSDV for the
coecients, bB
128
and b0B for the constant term 0 , where b0B and bB are the components of the between
estimator, bB , presented in Section 8.3.1.
To summarize Mundlaks results
The standard LSDV estimator for
RE estimator for
The standard FGLS-RE estimator in the RE model (equation (8.3.2)) can be equivalently obtained as a constrained FGLS estimator applied to the general RE model
(8.5.1) with constraints = 0.
Therefore, the validity of the RE model can be tested by applying a standard Wald test of
joint significance for the null hypothesis that = 0 in the context of Mundlaks equation
(8.5.1):
M = (bLSDV
Under H0 : = 0, M !
d
bB )0 Avar (b\
LSDV
bB )
(bLSDV
bB ) .
2 (k).
Hausman and Taylor (1981) proves that the statistics H and M are numerically identical
(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, recommends using the regression-based version of the test because it can be made fully robust
to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial
correlation) using the standard robustness options available for regression commands in most
econometric packages. In addition, it is relatively easy to detect and solve singularity problems
in the context of regression-based tests.
8.5.3. Stata implementation. The Stata implementation of most results in this section
is demonstrated through a Stata do file available on the course learning space.
129
B X1
B ..
B .
B
B
X=B
B Xi
B .
B .
B .
@
XN
(8.6.1)
C
C
C
C
C
C,
C
C
C
C
A
with Xi indicating the (T k) block of observations peculiar to individual i = 1, ..., N. Similarly, observations in the (N T 1) vectors y and " are stacked by individuals.
The projection matrix M[D] projects onto the space orthogonal to that spanned by the
columns of the individual dummies matrix D and any conformable vector that is post-multiplied
to it gets transformed into group mean deviations. It is not hard to see that M[D] is a block
diagonal matrix, with blocks all equal to
M[1T ] = IT
1T 10T
.
T
So,
2
(8.6.2)
M[D]
6 M[1T ]
6
.
6 0
M[1T ] . .
6
=6
..
..
..
6
.
.
.
6
4
0
0
..
.
0
M[1T ]
7
7
7
7
7.
7
7
5
130
, is given by
X 0 M[D] y.
8.6.2. Large-sample properties of LSDV. Let V ar ("|X) = , where is an arbitrary and unknown p.d. matrix.
X 0 M[D] M[D] X
N
= lim E
N !1
X 0 M[D] M[D] X
N
Q , a positive def-
X M[D] X
LSDV.2: p limN !1
= Q a positive definite and finite matrix
N
Exercise 64. (This has been done in class) Prove that under LSDV.1 and LSDV.2
p limN !1 bLSDV = .
8.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS
and the following
0
..
.
2
..
.
..
.
..
.
0
0
..
.
131
7
7
7
7
7
0 7
7
5
N
is a block diagonal (N T N T ) positive definite matrix. Notice that the blocks of are
arbitrary and heterogenous, so that both arbitrary correlation across the time observations
of the same individual (referred to as within-group serial correlation) and heteroskedasticity
across individuals and over time are permitted. What is not permitted by the block-diagonal
structure is correlation of the " realizations across dierent individuals.
Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal
form of M[D] as in (6.1.7),
2
6 M[1T ]
6
.
6 0
M[1T ] . .
6
M[D] X = 6
..
..
..
6
.
.
.
6
4
0
0
..
.
0
M[1T ]
X1
..
.
M[1T ] X1
..
.
B
C B
C B
7B
C B
7B
B
C B
7B
C B
7B
B
7 B Xi C
C = B M[1T ] Xi
7B
C
7 B .. C B
..
5B . C B
.
B
@
A @
XN
M[1T ] XN
1
C
C
C
C
C
C
C
C
C
C
A
The proof of asymptotic normality for bLSDV parallels that in 7.2.2 with the only dierence
that now the random objects whence we move are not k 1 vectors at the observation level
but k 1 vectors at the individual level, Xi0 M[1T ] "i , i = 1, ..., N .
First, by strict exogeneity E Xi0 M[1T ] "i = 0 and hence
V ar Xi0 M[1T ] "i |X
132
so that
V ar Xi0 M[1T ] "i = E Xi0 M[1T ] i M[1T ] Xi .
Then, averaging across individuals
N
1 X
V ar Xi0 M[1T ] "i
N
i=1
N
1 X
E Xi0 M[1T ] i M[1T ] Xi
N
i=1
0
X M[D] M[D] X
= E
.
N
Therefore,
0
N
X M[D] M[D] X
1 X
V ar Xi0 M[1T ] "i = lim E
Q ,
N !1 N
N !1
N
lim
i=1
which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies
to yield
Finally, since
p
N
X 0 M[D] "
NX 0
Xi M[1T ] "i p
! N (0, Q ) .
d
N
N
i=1
N (bLSDV
p
X 0 M[D] X
N
N (bLSDV
X 0 M[D] "
p
!Q
d
N
) ! N 0, Q
d
Q Q
1X
0M
[D] "
Avar (bLSDV ) =
1
Q
N
Q Q
M[1T ] Xi bLSDV ,
133
i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation
(8.6.3) is given by the Whites estimator:
(8.7.1)
Avar\
(bLSDV ) = X 0 M[D] X
[D] X X 0 M[D] X
X 0 M[D] M
LSDV DD .
Remark 65. The estimator in (8.7.1) is robust to arbitrary heteroskedasticity and withingroup serial correlation. Stock and Watson (2008) prove that in the LSDV model the Whites
is a diagonal matrix with generic
estimator correcting for heteroskedasticity only, where
element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for
N ! 1. The crux of Stock and Watsons argument is essentially algebraic, in that demeaned
residuals are correlated over time by construction and this correlation does not vanish for
N ! 1. The recommendation for practitioners is then to correct for both heteroskedasticity
and within-group serial correlation using the estimator (8.7.1), which is not aected by the
Stock and Watsons critique.
Remark 66. In Stata the robust covariance matrix of LSDV is computed easily by using
the xtreg command with the options fe and vce(cluster id), where id is the name of the
individual categorical variable in your Stata data set.
A similar correction can be carried out for POLS and FGLS-RE. For POLS we have
where
0
Avar\
bP OLS = X X
0
X 0 X
X X
and the POLS residual vector defined as in equation (8.2.10), whereas for FGLS-RE we have
Avar \
bF GLS
where
RE
0
b
= X
= eF GLS
0
b
X
134
1/2 b 1/2
0
RE eF GLS RE
0
b
X X
DD0 ,
RE
=y
X bF GLS
RE .
Remark 67. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE
is estimated by using, respectively, the regress and the xtreg, re commands, both with the
option vce(cluster id), as in the LSDV case.
PN
i=1 Ti .
The LSDV estimator is implemented without any problem either creating individual dummies
or taking variables in group-mean deviations, where group means are at the individual level.
135
The random eect estimator requires only some algebraic modifications in the formulas allowing for unbalancedness. Arellano estimator also requires simple modifications in notations to
accommodate unbalancedness: there is now a (Ti 1) LSDV residual vector given by
eLSDV,i = M[1T ] yi
i
M[1T ] xi bLSDV ,
i
CHAPTER 9
137
can occur along more than one dimension. In a student survey, for example, there could be an
additional level of clustering given by teachers, or classes, within schools. Similarly, patients
can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.
In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of
countries and these are naturally clustered along two dimensions: the first and the second
country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the
worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).
Is it possible to do inference that is robust to multi-way clustering as we do inference that is
robust to one-way clustering? A recent paper by Cameron et al. (2011) oers a computationally
simple solution extending the White estimator to multi-way contexts. In essence, their method
boils down to computing a number of one-way robust covariance estimators, that are then
combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial
for the accuracy of the multi-way estimator that the one-way estimators be also accurate,
and so that the data-set have dimensions with a large number of clusters. Such asymptotic
requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both
individual- and time-clustering in the typical micro-econometric panel data set, where T is
fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have
good finite-sample properties in data-sets with dimensions of 100 clusters.
To illustrate the method I focus on two-way clustering, using a notation that is close to
that inCameron et al. (2011).
138
Survey of patients with, at least, moderately large numbers of doctors and hospitals
Bilateral trade-flows data with, at least, a moderately large number of countries.
Matched-employers-employees data with, at least, moderately large numbers of firms
and workers
For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This
information is contained in the mappings g : {1, ..., n} ! {1, ..., G}
g (i) = [g 2 {1, ..., G} : observation i belongs to cluster g] , i = 1, ..., n.
and h : {1, ..., n} ! {1, ..., H}
h (i) = [h 2 {1, ..., H} : observation i belongs to cluster h] , i = 1, ..., n.
From the mappings g and h we can also construct the n G dummy variables matrix DG and
the n H dummy variables matrix DH , as the following definitions indicates
Definition 68. Let
8
< 1 if
dig =
: 0
dih
8
< 1 if
=
: 0
g (i) = g
else
h (i) = h
else
i 2 {1, ..., n}, h 2 {1, ..., H}. Then, DG and DH are the n G and n H matrices with (i, g)
element dig and (i, h) element dih , respectively.
Given g and h, we can define an intersection dimension, say G\H, such that each cluster in
G\H contains only observations that belong to one unique cluster in {1, ..., G} and one unique
cluster in {1, ..., H} . This yields the matrix of dummy variables DG\H . By construction, the
139
DG\H
B
B
B
B
B
B
B
=B
B
B
B
B
B
@
1 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 0 1
C
C
C
C
C
C
C
C.
C
C
C
C
C
A
This framework allows that in a survey of patients, for example, there could be more than
one patients admitted to the same hospital and under the assistance of the same doctor. Or,
similarly, that in a panel data matching workers with firms the same worker can move across
firms over time or that, conversely, the same firm may employ dierent workers over time.
0
S G has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}
; zero otherwise.
S H has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};
zero otherwise.
140
S G\H has ijth entry equal to one if observations i and j share any cluster g in
{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.
Also, the iith entries in S G , S H and S G\H equal one for all i = 1, ..., n, so the three indicator
matrices have main diagonals with all unity elements.
Consider now a linear regression model allowing for two-way clustering
yi, = x0i, + "i
i = 1, ..., n and let
B
B
B
B
B
"=B
B
B
B
B
@
"1
..
.
"i,
..
.
"n
C
C
C
C
C
C.
C
C
C
C
A
Assumptions LRM.1-LRM.3 hold. Assumption LRM.4 is here replaced with a more general
one permitting arbitrary heteroskedasticity and maintaining zero correlation only between
errors peculiar to observations that share no cluster in common. For example, the latent error
of patient i is not correlated to the latent error of patient j only if the two subjects are under
the assistance of dierent doctors, say g (i) 6= g (j), and in dierent hospitals, h (i) 6= h (j).
Formally,
LRM.4b: V ar ("|X) = E (""0 |X) with E ("i "j |X) = 0 unless g (i) = g (j) or
h (i) = h (j), i, j = 1, ..., n.
Importantly, LRM.4b can equivalently be expressed as
(9.2.1)
= E ""0 S G |X + E ""0 S H |X
E ""0 S G\H |X ,
where the symbol stands for the element-by-element matrix product (also known as Hadamard
product) between matrices with equal dimension (verify equivalence of LRM.4b and (9.2.1)).
141
As we know, OLS, in this case, are consistent and unbiased but not ecient. More importantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator
for inference. The covariance estimator devised by Cameron et al. (2011) is the combination
of three one-way covariance estimators a la White. It is constructed along the following steps.
Carry out OLS, obtain the OLS residuals
ei,g(i),h(i) = yi,g(i),h(i)
x0i,g(i),h(i) b
\
Avar
(b) = X 0 X
GX X 0X
X 0
\
G = ee0 S G . Avar
where
(b) is a White estimator that is robust to clustering only along
the G dimension.
The second one-way covariance estimator is
\
Avar
(b)
\
H = ee0 S H . Avar
where
(b)
= X 0X
H X X 0X
X 0
the H dimension.
The third one-way covariance estimator is
\
Avar
(b)
G\H
= X 0X
G\H X X 0 X
X 0
\
G\H = ee0 S G\H . Avar
where
(b)
G\H
142
\
\
\
Avar
(b) = Avar
(b) + Avar
(b)
\
Avar
(b)
G\H
\
Avar
(b) is robust to clustering along both G and H dimensions and is the estimator that
is used to construct our robust tests.
\
Remark 69. Writing Avar
(b) as
\
Avar
(b) = X 0 X
G +
H
X0
G\H X X 0 X
and then considering equation (9.2.1) uncovers the analogy principle on which the two-way
robust covariance estimator rests.
\
Remark 70. Cameron et al. (2011) also present a general multi-way version of Avar
(b),
which is derived from a simple extension of the foregoing analysis. The additional cost is only
in terms of a more cumbersome notation. For the formulas I refer you to that paper.
9.3. Stata implementation
\
While there is no ocial command for the two-way Avar
(b) in Stata, it can be simply
implemented by means of three one-way OLS regressions. Suppose that in our data-set the
two categorical variables for dimensions G and H are called doctor and hospital. You can
\
assembleAvar
(b) along the following steps.
(1) Create the categorical variable for the intersection dimension, G \ H, through the
following instruction
egen doc hosp = group (doctor hospital)
where doc_hosp is a name of choice.
143
(2) Implement the first regress instruction with the option vce(cluster doctor) and
then save the covariance matrix estimate through the command: matrix V_d=e(V)
(V_d is a name of choice).
(3) Implement the second regress instruction with the option vce(cluster hospital)
and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a
name of choice).
(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and
then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name
of choice).1
(5) Finally, work out the two-way robust covariance estimator by executing: matrix
V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of
V_robust do: matrix list V_robust.
1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one
observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no
problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting
for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).
CHAPTER 10
10.1. Introduction
The conditional-mean-independence of " and x maintained by P.2 (Section 2.1) often fails
in economic structures, where some of the x variables are chosen by the economic subjects
and as such may depend on the latent factors at the equilibrium. These x variables are said
endogenous.
In economics, think of a production function, where (some of) the observable input quantities are under the firms control. The same consideration holds for the education variable
in a wage equation. These are all cases of omitted variable bias (Section 4.7), which makes
standard estimation techniques not usable.
As we have seen in Section 4.7.1, the proxy variables solution maintains that there is information external to the model that is able to fully explain the correlation between observed and
unobserved variables. For example observed IQ scores, clearly redundant in a wage equation
with latent ability, are an imperfect measure of latent ability, but the discrepancy between the
144
10.1. INTRODUCTION
145
two variables is likely to be unrelated with the individual education levels. Such information,
so close to the latent variable, is often unavailable, though.
If the latent variables are invariant across individual and/or over time and there is a paneldata set, the endogeneity problem is solved by applying the panel-data methods introduced
in Chapter 8. But not always panel data are available and even when they are, the disturbing
omitted factors may not meet the time-constancy requirement. For example, idiosyncratic
productivity shocks may well be related to input factors in the estimation of a production
function.
Neither proxy variables or panel data methods are generally usable when endogeneity
springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise
variable as a determinant of individual health, hinting for an endogeneity bias due to reverse
causality. If the exercise activity is indeed aected by the health status, exercise would depend
on the observable and unobservable determinants of health, and so cannot be exogenous.
Instrumental variables (IV) and Generalized method of moment (GMM) estimators oer
a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity
problem into two stages. The first stage attempts to identify the exogenous-variation components of the x, through a set of exogenous variables, some of which are external to the model,
said instrumental variables. The second stage applies regression analysis using only the firststage exogenous components as explanatory variables. IV and GMM methods are preferred
tools of econometric analysis, compared to alternative techniques, since often the first stage
can be justified on the ground of economic theory.
There are various IV GMM applications showing the methods of this chapter: iv.do using
mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do both
using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.
146
10.2.1. The linear regression model. Consider the linear model of Chapter 1 and the
system of moment conditions (1.2.3)
E (xy) = E xx0
So, the true coecient vector,
= E (xx0 )
0
xi b = 0.
x i yi = X 0 X
i=1
Hence,
b =
n
X
xi xi
i=1
1 n
X
, b , will satisfy
X 0 y,
i=1
10.2.2. The Instrumental Variable (IV) regression model in the just identified
case. Consider the linear model of Chapter 1 but without assumption P.3, E ("|x) = 0, or
even the weaker P.3b, E (x") = 0. This means that some of the variables in x are potentially
endogenous, that is related in some way to ". Assume, instead, conditional mean independence
for a L 1 vector of variablesz, that is E ("|z) = 0, with L = k. The vector z is generally
dierent from x, if it is not then we are back to the classical regression model and there is no
147
endogeneity problem. Then, as before using the law of iterated expectations we have
E (z") = Ez [E (z"|z)]
= Ez [zE ("|z)]
= 0.
So, there are k moment conditions in the population
E z y
or equivalently
x0
=0
E (zy) = E zx0
So, the true coecient vector,
= E (zx0 )
0
xi b = 0.
zi y i = Z 0 X
i=1
Hence,
b
n
X
i=1
zi x i
1 n
X
, b , will satisfy
Z 0 y,
i=1
The intuition is straightforward: since the true coecients solve the population moment
conditions, if the sample moments provide good estimates for the population moments, then
one might expect that the estimator solving the sample moment conditions will provide good
estimates of the true coecients.
What if there are more moment conditions than unknown parameters, that is if L > k?
Then we turn to GMM estimation.
148
b
m
f wi , b
n
i=1
m b =0
hence there are L equations and k unknowns so that no estimator b can solve the system of
sample moment conditions. Instead, there exists a b that can make m b as close to zero as
possible:
GM M
= arg min Q b
b
0
where Q b m b Am b is a quadratic criterion function of the sample moments and
A is a positive definite matrix weighting the squares and the cross-products of the sample
moments in Q b .
Note that Q b
0 and since A is positive definite, Q b = 0 only if m b = 0.
Thus, Q b can be made exactly zero in the just identified case and is strictly greater than
zero in the over-identified case.
149
10.2.4. The TSLS estimator. It is not hard to prove that the well-known Two Stages
Least Squares estimator (TSLS) in the overidentified linear model belongs to the class of GMM
estimators. Consider the linear regression model of Section 10.2.2 with L > k instruments.
Then, there are the following population moments
m( ) E z y
x0
E z y
x0
=0
The L sample moments are collected into the (L 1) vector m b
n
1X
m b
zi y i
n
1
0
xi b Z 0 y
n
i=1
Xb
Suppose we choose a quadratic criterion function with the following weighting matrix
n
A
Then
1X 0
zi zi
n
1
Q b
y
n
i=1
= n Z 0Z
0
X b Z Z 0Z
Z0 y
Xb
@Q b
2
1 0
X 0Z Z 0Z
Z y Xb = 0
n
@b
that solved for b yield the TSLS estimator
or more compactly
T SLS
X 0Z Z 0Z
b
T SLS
Z 0X
X 0 P[Z] X
X 0Z Z 0Z
X 0 P[Z] y.
Z 0 y.
150
The estimators name derives from the fact that it is computed into two stages:
(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:
Z (Z 0 Z)
mately exogenous component, whose covariance with " goes to zero as n ! 1, and
M[Z] X is a residual, potentially endogenous, component. Only P[Z] X is used in the
second stage.
(2) Regress y on the fitted values, P[Z] X, to obtain TSLS.
If the population moment conditions are true, then Q b T SLS should not be significantly
dierent from zero. This provides a test for the validity of the L
k over-identifying moment
S nQ b T SLS
(L
k) .
T SLS
=
=
X 0Z Z 0Z
Z 0X
Z 0X
Z 0Z X 0Z
Z 0X
Z 0y
X 0Z Z 0Z
X 0Z Z 0Z
1
1
Z 0y
Z 0y
151
10.4.1. Choosing the weighting matrix . The weighting matrix in the optimal twostep GMM estimator is
A = Z 0 Z/n
(10.4.1)
h
(see Hansen 1982). A is a consistent estimate of the inverse of V ar zi yi
:
Choice of
xi
= I (the resulting GMM estimator col If " is homoskedastic and independent then
lapses to TSLS). Its implemented through the ivregress gmm option: wmatrix(unadjusted).
is a diagonal matrix with generic diag If " is heteroskedastic and independent then
onal element equal to the squared residual from some one-step consistent estimator,
the TSLS for example:
0
B
B
B
B
=B
B
B
@
with ei = yi
e21
0
..
.
e22
..
0
..
.
C
C
C
C
C
0 C
C
A
e2n
152
0 C
B 1 0
B
.. C
B 0
2
. C
B
C
=B .
C
..
B ..
.
0 C
B
C
@
A
N
0 0
not a scalar: it is the vector of residual observations peculiar to cluster i = 1, ..., N. Its
implemented through the ivregress gmm option: wmatrix(cluster cluster_var ).
10.4.2. Iterative GMM. The GMM procedure can be iterated by adding the option
igmm. The resulting estimator is asymptotically equivalent to the two-step estimator. However,
Hall (2005) suggests that it may have a better finite-sample performance.
\
V ar b T SLS = X 0 P[Z] X
[Z] X X 0 P[Z] X
X 0 P[Z] P
153
heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this
exercise, which will prove useful in the derivations below.
X2
] Z1 ]
2
1
is
X20 P[M[X
1]
Z1 ] y
1]
Z1 ] ,
1]
Z1 ]
The DWH test provides a robust version of the H test. It maintains instruments validity, E ("|Z) = 0 and is based on the so called control-function approach, which recasts the
endogeneity problem as a misspecification problem aecting the structural equation
(10.6.1)
X = (X1 X2 ),
y = X + ",
=
0
1
0
2
E (u|X) = 0 and is the nk2 -matrix of the errors in the first-stage equations of the variables
X2 . As such, is responsible for endogeneity of X2 .
Replacing in (10.6.1) with the residuals from the first stage regressions, = M[Z] X2 ,
makes the DWH test operational as a simple test of joint significance for in the auxiliary
OLS regression
(10.6.2)
y = X + M[Z] X2 + u .
154
The test works well since under the alternative of 6= 0, OLS estimation of the auxiliary
regression yields the TSLS estimators. This is proved as follows.
= M[Z] X2
P[M[X ] Z1 ] . Therefore,
1
+ ) + u
and since P[Z] X and M[Z] X2 are orthogonal the FWL Theorem assures that the OLS estimator
for
is
bT SLS = X 0 P[Z] X
X 0 P[Z] y.
and also
\
2+ =
X20 M[Z] X2
X20 M[Z] X2
= Kb2,OLS + (I
with K X20 M[Z] X2
X20 M[Z] y
1
X20 M[X1 ] y
X20 P[M[X
y
] Z1 ]
K) b2,T SLS ,
X20 M[X1 ] X2 and where the last equation follows from Exercise 72.
Therefore
b = Kb2,OLS + (I
= K (b2,OLS
K) b2,T SLS
b2,T SLS
b2,T SLS ) ,
proving that the test indeed follow the Hausman test general principle of assessing the distance between an ecient estimator and a consistent but inecient estimator, under the null
hypothesis.
155
One great advantage of the DWH test over a conventional Hausman test is that it can
be easily robustified for heteroskedasticity and/or clustered errors by estimating (10.6.2) with
regress and a suitable robust option, vce(cluster clustervar ) for example.
DWH can be immediately implemented in Stata through the ivregress postestimation
command estat endogenous.
yi = x1i
+ x2i
+ i
x2i = x1i 1 + zi 2 + i
8
>
<1 if x2i > 0
x2i =
>
:0 otherwise
2 0
13
2
2
A5 .
(i , i ) N 40, @
2
156
10.9.1. Three stages Least Squares. Its a system estimator including structural equations for all endogenous variables. Identification is ensured by standard (sucient) rank and
(necessary) order conditions. It is seldomly used as it is inconsistent in the presence of heteroskedastic errors, which is the norm in most micro applications. The Stata command is
reg3.
157
yit = + yi,t
+ it
t = 1, ..., T , i = 1, ..., N .
Model (10.10.1) can be easily extended to allow for time invariant individual terms:
(10.10.2)
yit = yi,t
+ i + it
t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,
yi = y
1i
+ i 1T + i
i = 1, ..., N, where
0
B
B
B
B
B
yi = B
B
(T 1)
B
B
B
@
yi1
..
.
yi,0
..
.
C
B
C
B
C
B
C
B
C
B
C
B y
,
y
=
yit C
1i
B i,t
C
B
(T 1)
.. C
..
B
. C
.
B
A
@
yiT
yi,T
C
C
C
C
C
C,
C
C
C
C
A
B
B
B
B
B
"i = B
B
(T 1)
B
B
B
@
"i1
..
.
C
C
C
C
C
"it C
C
.. C
C
. C
A
"iT
Notice that for each individual there are T + 1 observations available in the data set,
from yi1 to yiT , but only T are usable since one is lost to taking lags. The problem here is
that y
1i
f (y0 , i1 , i2 , ..., it
1)
1i,
1i
is yi,t
yi,T
158
1 ),
(exercise: Can you work out the exact expression of the right-hand side of yi,t
1)
= 0 fail
= f (y0 , i1 , i2 , ..., it
Nonetheless, we may maintain conditional-mean independence between i,t and the t.th and
more remote values of y
1i,
say yit
= (yi,t
(2003). More formally, we maintain throughout (see Arellano (2003) for a discussion)
A.1: E it |yit
Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as
sequential exogeneity conditional on the unobserved eect. It may be convenient sometimes to
maintain also the following (sequential) conditional homoskedasticity assumption
A.2: E 2it |yit
, i =
It is not hard to prove that Equation (10.10.1) and Assumption A.1 implies the following
(prove it using the LIE and i,t
A.3: E it it
t 1
j |yi , i
= yi,t
yi,t
j 1
i )
1
1i
the LSDV estimator, LSDV , is inconsistent for N ! 1. Nickell (1981) was the first to derive
the inconsistency. Given,
1
NT
LSDV =
y i.
1
NT
he showed that
1 XX
plim
yi,t
NT
t
PP
(it
yi,t
PP
i
y i.
yi,t
i. ) = E
(it
y i.
i. )
,
2
1
T
1X
yi,t
T
y i.
(it
t=1
1 T
T2
1 T +
(1
)2
6= 0.
i. )
1 )?).
159
Hence, the bias vanishes for T ! 1, but it does not for N ! 1 and T fixed. For this reason,
the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to
be semi-inconsistent (see also Sevestre and Trognon, 1996).
Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed
in the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) suggest
two simple IV estimators that, upon transforming the model in first dierences to eliminate
the unobserved individual heterogeneity, use the second lags of the dependent variable, either
dierenced or in levels, as an instrument for the dierenced one-time lagged dependent variable.
Arellano and Bond (1991) propose a GMM estimator for the first dierenced model which,
relying on all available lags of y
1i
Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses only linear moment
restrictions, suggest a set of non linear restrictions that may be used in addition to the linear
one to obtain more ecient estimates. Blundell and Bond (1998) observe that with highly
persistent data first-dierenced IV or GMM estimators may suer of a severe small sample
bias due to weak instruments. As a solution, they suggest a system GMM estimator with
first-dierenced instruments for the equation in levels and instrument in levels for the firstdierenced equation. Some of the foregoing methods are nowadays very popular and are
surveyed below.
10.10.2. The Anderson and Hsiao IV Estimator. One typical solution is to take
model (10.10.2) in first dierences to eliminate the individual eects:
(10.10.3)
yit
yi,t
= (yi,t
yi,t
2)
+ it
i,t
1.
This makes the disturbances M A(1) with unit root, and so induces correlation between the
lagged endogenous variables and the disturbances. This problem can be solved by using
instruments for 4yi,t
either yi,t
2,
or 4yi,t
1.
2
it
i,t
1.
160
generally non optimal and with a high root mean squared error in applications.
10.10.3. The Arellano and Bond GMM estimator. Arellano and Bond (1991) propose a more ecient estimator, using a larger set of instruments. Upon noticing that, given
(10.10.2), yi1 = f (i1 ), yi2 = f (i1 , i2 ), ..., yit = f (i1 , i2 , ..., it ) and that after dierencing
the first useable period in the sample is t = 3:
yi2
yi1 = (yi1
yi0 ) + i2
i1 ,
one finds that the value yi1 is a valid instrument for (yi2
(yi2
i2 )] = 0.
yi3 = (yi3
yi2 ) + i4
and both yi1 and yi2 are valid instruments for (yi3
i3 ,
This approach adds an extra valid instruments with each forward period, so that in the
last period T we have (yi0 , yi1 , yi2 , ..., yi,T
is therefore
B
B
B
B
B
B
B
Zi = B
B
B
B
B
B
@
yi0 0
2) .
. .
. 0
yi0 yi1 0
. .
. 0
. .
. .
. .
. .
. .
0 0
Z = Z10 , Z20 , ..., ZN
,
C
C
C
C
C
C
C
C
C
C
C
C
C
A
161
where
y is a (N (T
is a (N (T
1) 1) vector;
is a (N (T
1) 1) vector;
1) 1) vector
1)/2. So, Z is a (N (T
1) L) matrix. The
instrumental variables satisfy for each individual i the (L 1) vector of population moment
h 0
i
conditions m( ) E Zi i = 0, where i stands for the i.th block of . We can define
the (L 1) vector of sample moment conditions m( ) =
1 0
NZ (
) =
1 0
NZ (
1 ).
where according to what seen in Subsection 10.4.1 A is a consistent estimator of the inverse of
(10.10.5)
0
V ar Zi
i =
E Zi GZi
1 0
B
B
B 1
B
B
B 0
B
G=B
B .
B
B
B
B 0
@
0
. . . 0
1 0
. . . 0
1 2
.
. . .
. . .
. . .
N
1 X 0
Zi GZi
N
i=1
C
C
C
C
C
.
. C
C
C.
.
. C
C
C
C
2
1 C
A
1 2
1 . . .
A=
162
in (10.10.4) and so
2
b1 = 4 y
2
N
X
Zi GZi
i=1
4 y
N
X
Zi GZi
i=1
Z0
!
Z0
15
y5
N
1 X 0
Zi
N
i=1
e1i
e01i Zi
and where
e1i =
yi
estimator:
b1
b2 = 4 y
2
),
N
X
Zi
e1i
e1 0i Zi
i=1
1i
4 y
163
N
X
Zi
e1i
e1 0i Zi
i=1
Z0
!
Z0
15
y5
To test instrument validity one can apply the Hansen-Sargan test of overidentifying restrictions:
S=
N
X
Zi
i=1
where
e2i
!0
N
X
Zi
i=1
e2i
e2 0i Zi
N
X
i=1
Zi
e2i
A second specification test suggested by Arellano and Bond (1991) is that testing lack of
AR(2) correlation in
e1 or
164
Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full
instrument set has zero power when T , and consequently the moment conditions, becomes too
large for given N .
10.10.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)
demonstrate that in the presence of
with
This is easily seen by considering the following example taken from Blundell and Bond. Let
T = 2, then after taking the model in first dierences there is only a cross-section available
for estimation:
4yi,2 = 4yi,1 + 4i,2 , i = 1, ..., N.
and only one moment condition
N
1 X
(4yi,2
N
i=1
4yi,1 ) yi,0 = 0.
To what extent is yi,0 related to 4yi,1 ? To answer this question it suces to work out the
reduced form for 4yi,1 :
4yi,1 = (
from which it is clear that the closer
1) yi,0 + i + i,1
4yi,1 .
To solve the problem they suggest exploiting the following additional moment restrictions
(10.10.6)
E [(yi,t
yi,t
1 ) 4yi,t 1 ]
= 0, t = 2, ..., T
which are valid if along to Assumption A.1, we maintain that the process for yi,t is meanstationary, that is
A.4:
E (yi,0 |i ) =
i
1
165
Assumption A.4 is justified if the process started in the distant past. Starting from the model
at observation t = 0 and going backward in time recursively
yi,0 =
yi,
+ i + i,0
yi,
+ i + i + i,
yi,
+ i,0
i + i + i +
i,
+ i,
+ i,0
.
.
=
i
1
1
X
i,
t=0
i
1
+ ui,0
yi,1 ) 4yi,1 ] =
E [(i + i,2 ) [(
1) yi,0 + i + i,1 ]] =
i
E (i + i,2 ) (
1)
+ ui,0 + i + i,1
=
1
E [(i + i,2 ) [(
1) ui,0 + i,1 ]]
Thus, Blundell and Bond (1998) suggest a system GMM estimator, which also uses instruments in first dierences for the equation in levels.
Hahn (1999) evaluates the eciency gains brought by exploiting the stationarity of the
initial condition as done by Blundell and Bond, finding that it is substantial also for large T .
Statas xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,
which implements the GMM system estimator. Third, xtdpd, is a more general command that
allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2
(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel
data models.
166
10.10.5. Application. Arellano and Bond (1991) show their methods estimating a dynamic employment equation on a sample of UK manufacturing companies. Their data set
in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler
versions of their model though dierenced and system GMM using xtabond and xtabond2.
The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bonds results
using xtabond2.
10.10.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models
are consistent for N large, so they can be severely biased and imprecise in panel data with a
small number of cross-sectional units. This certainly applies to most macro panels, but also
micro panels where heterogeneity concerns force the researcher to restrict estimation to small
subsamples of individuals.
Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)
demonstrate that LSDV although inconsistent has a relatively small variance compared to IV
and GMM estimators. So, an alternative approach based upon the correction of LSDV for the
finite sample bias has recently become popular in the econometric literature. Kiviet (1995)
uses higher order asymptotic expansion techniques to approximate the small sample bias of the
LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows
that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators
in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence
by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro
panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.
Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).
Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new
users written Stata command xtlsdvc to implement LSDVC.
167
Kiviet (1995) shows that the bias approximations are even more accurate when there is
a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped
standard errors computed by xtlsdvc.
10.10.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the
xtabond and xtabond2 applications of Subsection 10.10.5, here we do not use all information
available to estimate the parameters of the labour demand equation in abdata.dta. Instead,
we follow a strategy that, exploiting the industry partition of the cross-sectional dimension
as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily
accomplished by restricting the usable data to the panel of firms belonging to a given industry.
While such a strategy leads to a less restrictive specification for the firm labour demand, it
causes a reduced number of cross-sectional units for use in estimation, so that the researcher
must be prepared to deal with a potentially severe small sample bias in any of the industry
regressions. Clearly, xtlsdvc is the appropriate solution in this case.
The demonstration is kept as simple as possible considering regressions for only one industry panel, ind=4.
The following instructions are implemented in a Stata-do file
168
Part 2
Non-linear models
CHAPTER 11
n
X
[yi
i=1
170
(x, b)]2 .
171
(11.3.1)
Or, equivalently,
y = exp x0
where u = y
+u
E (y|x).
Equation (11.3.1) =) E y
exp x0
|x = 0
and by the Law of Iterated expectations there are zero covariances between u and x:
(11.3.2)
Ey,x x y
exp x0
= 0.
n
X
i=1
x i yi
exp x0i
= 0.
y!
Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = .
172
n
X
i=1
n
X
i=1
ln
exp x0i
+ yi x0i
ln (yi !)
n
X
i xi x0i
i=1
It is easily seen that the k first order conditions that maximize lnL coincide with the equations in (11.3.3), so that bM L = bGM M . This proves two things: 1) The GMM estimator is
asymptotically ecient if the conditional mean function is correctly specified and the density
function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the
correct density function, as long as the conditional mean is correctly specified. In such cases,
when the likelihood function is not correctly specified, we refer to the ML estimator as a
pseudo ML estimator and a robust covariance matrix estimator should be used for inference
rather than (11.3.4):
Vrob (bM L ) =
n
X
i=1
i xi x0i
1" n
X
(yi
i=1
i )
xi x0i
n
X
i xi x0i
i=1
i )2 close to
i =) V (bM L ) close to Vrob (bM L )
173
The consistency result for the (pseudo) ML estimator holds in general if two conditions are
verified:
(1) The conditional mean is correctly specified
(2) The density function belongs to an exponential family
Definition 73. An exponential family of distributions is one whose conditional loglikelihood function at a generic observation is of the form
lnL (y|x, ) = a (y) + b [ (x, )] + yc [ (x, )] .
A member of the family is identified by the numerical values of
ln (y!),
b [ (x, )] =
yc [ (x, )] = yx0 .
The Normal distribution with a known variance
1
(y|x, ) = p exp
2
[y
(x, )]2
2 2
ln
b [ (x, )] =
2,
y 2 /2
(x, )2 /2
yc [ (x, )] = y (x, ) /
and
2.
The Stata command that implements poisson regression is poisson, with a syntax close
to regress. It computes bM L with standard error estimates obtained by V (bM L ). If the
vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and
still provides the bM L coecient estimates, but with the robust covariance matrix Vrob (bM L ) .
174
2.
()y
y!
iterated expectations
E (y) = E [E (y|)] =
and
V ar (y) = E [V ar (y|)] + V ar [E (y|)] = E [] + V ar [] = + 2
= 1+
>
()y
.
y!
To find it in closed form we need to specify the marginal density function for . If
Gamma (1, ), then f (y) is a negative binomial density function, N B ,
and V ar (y) = 1 +
. Clearly if
, with E (y) =
Poisson.
Specifying = exp (x0 ) yields the NB regression model and and 2 are estimated via
ML based on N B exp (x0 ) , 2 . Testing for overdispersion within this framework boils down
to testing the null hypothesis
= 0.
The Stata command that implements the NB regression is nbreg, with a syntax close to
regress and poisson. The output also gives the overdispersion (LR) test of
= 0.
175
2
= 0, therefore under
, therefore NB regres-
sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented
h
i
after poisson estimation using an estimate of [V ar (y|x) /] 1, (yi
i )2 yi /
i , as the
CHAPTER 12
177
y = 1 (A) .
Then
P r (y = 1) = P r (A) and P r (y = 0) = 1
E (y) = and V ar (y) = (1
) .
(12.2.2)
(12.2.3)
u=y
E (y|x) .
+u
178
12.2.1. Latent regression. When F () is a distribution function the binary model can
be motivated as a latent regression model. In microeconomics this is a convenient way to
model individual choices.
Introduce the latent continuous random variable y with
y = x0 + ",
(12.2.4)
let " be a zero mean random variable that is independent from x and with " F , where
F is a distribution function that is symmetric around zero. Then, let y = 1 (y > 0) . In
our example of Immigrant identity we may think of y as the utility variation faced by a
subject with observable and latent characteristics x and ", respectively, when he/she decides
to conform to the host-country culture, so that event A occurs if and only if y > 0.
(12.2.5)
=) y = 1 " >
x0
x0 |x
Since " and x are independent P r (" x0 |x) = F (x0 ). Moreover, by symmetry of F ,
P r (" >
x0 ) = P r [("/ ) >
and
,
2
must be fixed to
= 2 /3.
179
n
X
i=1
yi ln F x0i
+ (1
yi ) ln 1
F x0i
We have
bM L !
p
(12.2.6)
V (bM L )
n
X
i=1
f (x0i bM L ) xi x0i
F (x0i bM L ) [1 F (x0i bM L )]
P r (yi = 1|x) =
2
i
= 1,
180
Statas hetprobit estimates this heteroskedastic probit model and, importantly, provides a
LR test for the null of homoskedasticity ( =0).
12.2.4. Clustering. Dierently from heteroskedasticity, it makes sense to adjust standard error estimates to within-cluster correlation. This is the case since within-cluster correlation leaves the conditional expectation of an individual observation unaected, so that the ML
estimator can be motivated as a partial ML estimator, which remains consistent even if observations are not independent (see Wooldridge 2010, p. 609). The Stata option vce(cluster
clustervar ), therefore, can be conveniently included in both probit and logit statements.
12.3. Coecient estimates and marginal eects
There is no exact relationship between the coecient estimates from the three foregoing
models. Amemiya (1981) works out the following rough conversion factors
blogit ' 4bols
bprobit ' 2.5bols
blogit ' 1.6bprobit .
The marginal eect of x at observation i are estimated by logit and probit as
(@x Fi )probit = fprobit,i bprobit =
x0i bprobit
blogit
181
The post-estimation command margins with the option dydx(varlist) estimates marginal
eects for each of the variables in varlist. Marginal eects can be estimated at a point x
(conventionally, the sample mean when variables are continuous and in this case the option
atmean must be supplied) or can be averaged over the sample (default).
8
< 1 if
: 0
F x0
else
9
0.5 =
;
The opcp is given by the number of times yi = yi over n. A problem with this measure
is that it can be high also in cases where the model poorly predicts one outcome. It
may be more informative in these cases to compute the percent correctly predicted
for each outcome separately: 1) the number of times yi = yi = 1 over the number of
times yi = 1 and 2) the number of times yi = yi = 0 over the number of times yi = 0.
This is done through the post-estimation command estat classification.
Test the discrepancy of the actual frequency of an outcome and the estimated average probability of the same outcome within a subsample S of interest (for example,
females in a sample of workers)
yS
1 X
1 X 0
yi vs. pS
F xi
nS
nS
i2S
i2S
Doing this on the whole sample makes little sense because the two measures are
always very close (equal in the logit model with the intercept).
2 = 1
Evaluate the Pseudo R-squared: R
182
L ( ) /L (
y ) , where L ( ) is the value of
+ ",
Then + "|x N 0, 1 +
and
p
y
1+
= x0 p
1+
p
is a legitimate probit model. In fact, y / 1 +
+"
+p
,
1+ 2
is latent,
+"
p
|x N (0, 1)
1+ 2
and so
xp
1+
= P r (y = 1|x) .
183
It follows that we can apply standard probit ML estimation and the resulting estimator
p
p
\
\
0
2
2
p
/ 1+
is consistent for / 1 +
and so is
x 1+ 2 for the response probabilities
P r (y = 1|x) .
p
/\
1+
estimates
with a downward
bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal eects
@x P r (y|, x) averaged over (AMEs), E [@x P r (y|, x)], this is no problem.
Indeed, given f (|x) the conditional density function of , it is generally true that
P r (y|x) =
P r (y|x, ) f (|x) d
|x
Hence, under mild regularity conditions that permit interchanging integrals and derivatives,
@x P r (y|x) = E [@x P r (y|, x)]
The above result is important, for it establishes that to estimate P r (y|x) and @x P r (y|x)
\
0
p
is to estimate E [P r (y|, x)] and E [@x P r (y|, x)], respectively. So,
x 1+ 2 is a
\ is a consistent estimator of
consistent estimator for E [P r (y|, x)], likewise @x
x0 p
1+ 2
E [@x P r (y|, x)] (see Wooldridge (2005) and Wooldridge (2010)).
If evaluated at a given point x0 , the AMEs are averages over alone. To estimate
\
\
0
0
Ex, [P r (y|, x)] and Ex, [@x P r (y|, x)] just average
xi p1+ 2 and @x
xi p1+ 2 over
the sample.
184
Multivariate probit models are constructed by supplementing the random vector y defined
in (12.2.1) with the latent regression model
yj = x0
(12.7.1)
j = 1, ..., m, and
j,
+ "j
variables and the error term. Stacking all "j s into the vector " ("1 , ... , "m )0 , we assume
"|x N (0, R). The covariance matrix R is subject to normalization restrictions that will be
made explicit below. Equation specific regressors are accommodated by allowing
to have
zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation
restrictions on the
s are also permitted. R is normalized for scale and so has unity diagonal
elements and arbitrary o-diagonal elements, ij , which allows for possible cross-equation
correlation of errors. It may or may not present constraints beyond normalization. If m = 2
we have the bivariate probit model, which is estimated by the Stata command biprobit, with
a syntax similar to probit.
(12.7.2)
y1 = x
+ y2 + "1
y2 = x
+ "2
It is then evident that estimating a bivariate recursive probit model is ancillary to estimation
of a univariate probit model with a binary endogenous regressor, the first equation of system
(12.7.2).
185
The feature that makes the recursive multivariate probit model appealing is that it accommodates endogenous, binary explanatory variables without special provisions for endogeneity,
simply maximizing the log-likelihood function as if the explanatory variables were all ordinary
exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,
Roodman 2011). This can be easily seen here in the case of the recursive bivariate model
P r (y1 = 1, y2 = 1|x) = P r (y1 = 1|y2 = 1, x) P (y2 = 1|x)
= P r "1 >
= P r "1 >
= P r "1 >
=
x0
x0
x0
x0
+ , x0
x0
2, x
, "2 >
x0
2 |x
P "2 >
x0
2 |x
x0
x0
,
x0
and so no endo-
geneity issue emerges when working out the joint probability as a joint normal distribution.
The other three joint probabilities are similarly derived, so that eventually the likelihood
function is assembled exactly as in a conventional multivariate probit model1.
Starting with the contributions of Evans and Schwab (1995) and Greene (1998), there
are by now many econometric applications of this model, including the recent articles by
Fichera and Sutton (2011) and Entorf (2012). The user-written command mvprobit deals
with m > 2, it evaluates multiple integrals by simulation (see Cappellari and Jenkins (2003)).
1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification
in recursive models may require exclusion restrictions in the equations of interest. For example, in system
(12.7.2) substantial identification requires some zeroes in 1 , where the corresponding variables may then be
thought of as instruments for y2 .
186
The recent user-written command cmp (see Roodman (2011)) is a more general sumulationbased procedure that can estimate many multiple-response and multivariate models.
CHAPTER 13
y=
8
>
<y = y
>
:y = L
if y > L
if y L
Think of a utility maximizer individual with latent and observable characteristics " and
x, respectively, choosing y subject to the inequality constraint y
unconstrained solution. For a part of individuals the constraint is binding (y = L ) and for
the others is not (y > L). Focussing on the former subpopulation, the regression model is
y = E (y|x, y > L) + u
(13.2.1)
x0
+u
x0
+u
187
where u = y
188
E (y|x, y > L) . The following result for the density and moments of the trun-
cated normal distribution are useful (see Greene 2012, pp. 874-876):
f (z|z > ) =
f (z|z < ) =
E (z|z > ) = +
[( ) / ]
[( ) / ]
[(
[(
E (z|z < ) =
/ {1
[(
/ [(
) / ]
) / ]}
) / ]
.
) / ]
The foregoing equalities are all based on the following representations of general cumulative
distribution function, F(,
F(,
2)
2)
() = P r [(z
(,
(,
2)
) / (
) / ] = F(0,1) [(
(z) :
1
p exp
2 ) (z) =
2
"
)2
(z
2
[(L x0 ) / ]
+ u.
[(L x0 ) / ]
) / ]
y = x0 +
[(x0
[(x0
L) / ]
+ u.
L) / ]
[(x0 ) / ]
+ u.
[(x0 ) / ]
189
13.2.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. Let
di = 1 (yi > L). Estimation can be via ML or two-step LS.
The log-likelihood function assembles the density functions peculiar to the subsample of
individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual
di = 1, yi = yi and we know that yi |xi N x0i ,
at the single point yi
x0i
f (yi |x) =
yi
x0i
x0i ) / ]. Therefore,
n
X
di ln
i=1
yi
x0i
+ (1
di ) ln
(L
x0i )
step ,
and L/
[(x0
L) / ] / [(x0
L) / ]
(x0 bprobit ) / (x0 bprobit ) (recall that bprobit is indeed a consistent estimate of
is subsumed in the constant term). In the second step apply OLS regression
of yi on xi and \
i / i restricting to the unconstrained subsample di = 1. using yi . b2
step
is consistent but standard errors needs to be adjusted since in the second step there is an
estimated regressor.
Upper limits can be dealt similarly:
y=
8
>
<y = y
>
:y = U
if y < U
if y
U.
190
if y L
if L < y < U
if y
The Stata command that compute bM L in the tobit model is tobit. The syntax is
similar to regress, requiring in addition options specifying lower limits, ll(#), and upper
limits, ul(#).
Marginal eects of interest are
@x E (y |x) =
@x E (y|x, y > L) = 1
[(x0
L) / ] / [(x0
@x E (y|x, y > L) =
w (w)
2 (w)
, where w = (x0
L) /
and
(w) =
L) / ] .
(w) .
s = z0 + ,
191
d = 1 (s > 0)
Process for y:
y = x0 + ",
y=
Interest is on
8
>
<y = y
if d = 1
>
:y = missing
"
20
A |z, x N 4@
0
0
if d = 0
n
X
i=1
1 0
A,@
"
"
2
"
13
A5
di ) ln [P r (di = 0)]} .
The Stata command that compute bM L in the selection model is heckman. The syntax
is similar to regress, requiring in addition an option specifying the list of variables in the
selection process, d and z: select(varlist_s ).
CHAPTER 14
Quantile regression
14.1. Introduction
Define the conditional c.d.f. of Y : F (y|x) = P r (Y y|x). Instead of E (y|x), as in the
CRM, we model quantiles of F (y|x) .
The conditional median function of y, Q0.5 (y|x) , is an example. Specifically, for given x
and F (y|x) , Q0.5 (y|x) is a function that assigns the median of F (y|x) to x, and is implicitly
defined as
F [Q0.5 (y|x) |x] = 0.5
or, explicitly,
(14.1.1)
Q0.5 (y|x) = F
(0.5|x) .
More generally, given, the quantile q 2 (0, 1), define the conditional quantile function, Qq (x),
as
(14.1.2)
Qq (y|x) = F
(q|x) .
y (x)
then
Lq = qE"(y,x)
14.3. ESTIMATION
193
Qq (y|x) is equivariant to monotone transformations: Let h () be a monotonic function, then Qq [h (y) |x] = h [Qq (y|x)]
In the case of the median:
L0.5 = Ey,x (|" (y, x)|) =) Q0.5 (y|x) minimizes Ey,x (|" (y, x)|)
In other words Q0.5 (y|x) is the minimum mean absolute error predictor.
14.3. Estimation
There is a sample {yi , xi } , i = 1, ..., n, for estimation.
14.3.1. Marginal eects. We can put the QR model in close relationship with the CRM.
Let
yi = E (yi |xi ) + ui
where ui = yi
q,
so that
q,
which implies that here the marginal eects computed by the regression model coincides with
those computed from the QR. Therefore
14.3. ESTIMATION
194
and
E (yi |xi ) = x0i
=) @x Qq (yi |xi ) = .
14.3.1.2. General case. If the i.i.d. assumptions does not hold (e.g. because of heteroskedasticity), then
@x Qq (yi |xi ) = @x E (yi |xi ) + @x Qq (ui |xi )
and
E (yi |xi ) = x0i
=) @x Qq (yi |xi ) =
+ @x Qq (ui |xi ) .
Also,
E (yi |xi ) = x0i
(14.3.1)
=) @x Qq (yi |xi ) =
q.
14.3.2. The linear QR. The linear quantile regression model specifies Qq (yi |xi ) as a
linear function
Qq (yi |xi ) = x0i
or equivalently
yi = x0i
where uq,i = yi
x0i
+ uq,i
q.
is
bq = argmin
b
8
<
:
yi
yi
yi <x0i b
bq N
i q (1
1)
x0i b
where A =
x0i b + (q
q) xi x0i , B =
i f uq
q, A
BA
yi
x0i b
9
=
;
1 BA 1
195
The main Stata command implementing QR is qreg. Its syntax is similar to regress.
The quantile(.##) option in qreg indicates the quantile of choice (e.g. to get the median,
which is also the default, set .##=.50) To produce QR estimates with bootstrap standard
errors apply bsqreg. The reps(#) in bsqreg indicates the number of bootstrap replications.
Implementing the same model at various quantiles through repeated qreg regressions can
shed light on the discrepancies in behavior across dierent regions of the variable of interest. To
evaluate the statistical significance of such discrepancies, though, it is necessary to estimate
a larger covariance matrix encompassing covariances between coecient estimators across
quantiles. This is done by sqreg, which provides simultaneous estimates from the quantile
regressions chosen by the researcher.
(1) ,
196
and we expect that an OLS regression will yield coecient estimates close to the foregoing
marginal eects. Also,
Qq (u|x) = Qq [(0.1 + 0.5x2 ) "|x] = (0.1 + 0.5x2 ) Qq ("|x) = 0.1
+ 0.5 q x2
where the first equality is obvious, the second follows from the equivariance property and the
last from independence of " and x2 yielding
have
@x2 Qq (y|xi ) = 1 + 0.5
@x3 Qq (y|xi ) = 1.
Therefore, we expect that quantile regressions will yield coecient estimates for x3 close to
the OLS estimate, regardless of the quantile considered, whilst for x2 we will observe various
discrepancies with the OLS estimate depending on the quantile regression. This is confirmed
by the outcome of the dofile qr.do.
Bibliography
Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.
Econometrica 67, 251333.
Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel
data. Journal of Econometrics 18, 570606.
Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests
in iv regression with weak instruments. Journal of Econometrics 139, 116132.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford
Bulletin of Economics and Statistics 49 (4), 43134.
Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence
and an application to employment equations. Review of Economic Studies 58, 277297.
Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.
Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data
models. Journal of Econometrics 87, 115143.
Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.
Economics Letters 77, 211220.
Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced
panel data models. Economics Letters 87, 361366.
Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals. The Stata Journal 5, 47300.
Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in
asymptotic expansions of bias. Economics Letters 79, 145152.
197
Bibliography
198
Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.
Journal of Business & Economic Statistics 29, 238249.
Cameron, A. C., Trivedi, P. K., 2009. Microeconometrics using Stata. Stata Press, College
Station, TX.
Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum
likelihood. The Stata Journal 3, 278294.
Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables
models. econometric theory, vol. 9, pp. Econometric Theory, 222240.
Entorf, H., 2012. Expected recidivism among young oenders: Comparing specific deterrence
under juvenile and adult criminal law. European Journal of Political Economy 28, 414429.
Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic
schools make a dierence? The Quarterly Journal of Economics 110, 941974.
Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics
30, 11641173.
Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal
of Economic Education 29, 291300.
Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice
Hall.
Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice
Hall.
Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.
Econometrica 50 (4), 10291054.
Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 12511271.
Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual eects.
Econometrica 49, 13771398.
Hayashi, F., 2000. Econometrics. Princeton University Press.
Bibliography
199
Judson, R. A., Owen, A. L., 1999. Estimating dynamic panel data models: a guide for macroeconomists. Economics Letters 65, 915.
Kiviet, J. F., 1995. On bias, inconsistency and eciency of various estimators in dynamic
panel data models. Journal of Econometrics 68, 5378.
Kiviet, J. F., 1999. Expectation of expansions for estimators in a dynamic panel data model;
some results for weakly exogenous regressors. In: C. Hsiao, K. Lahiri, L.-F. L., Pesaran,
M. H. (Eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge University Press, Cambridge, pp. 199225.
Maddala, G. S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge.
Mikusheva, A., Poi, B. P., 2006. Tests and confidence sets with correct size when instruments
are potentially weak. The Stata Journal 6, 335347.
Moulton, B. R., 1990. An illustration of a pitfall in estimating the eects of aggregate variables
on micro units. The Review of Economics and Statistics 72 (2), 33438.
Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46,
6985.
Nickell, S. J., 1981. Biases in dynamic models with fixed eects. Econometrica 49, 14171426.
Prucha, I. R., 1984. On the asymptotic eciency of feasible aitken estimators for seemingly
unrelated regression models with error components. Econometrica 52, 203207.
Rao, C. R., 1973. Linear Statistical Inference and Its Applications. New York: Wiley.
Roodman, D. M., 2009. How to do xtabond2: An introduction to dierence and system gmm
in stata. The Stata Journal 9 (1), 86136.
Roodman, D. M., 2011. Fitting fully observed recursive mixed-process models with cmp. The
Stata Journal 11, 159206.
Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.
Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed eects
panel data regression. Econometrica 76, 15574.
Bibliography
200
Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of
coecients in the error components regression models. Econometrica 40 (2), 261275.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.
Windmeijer, F., 2005. A finite sample correction for the variance of linear ecient two-step
gmm estimators. Journal of Econometrics 126, 2551.
Wooldridge, J. M., 2005. Unobserved heterogeneity and estimation of average partial eects.
In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric
Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.
Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.
The MIT Press, Cambridge, MA.
Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics
and Statistics 67, 134139.
Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092
09.