Vous êtes sur la page 1sur 200

skip

3.6.1 and cover 3.6.2


only for the definition of the
squared coefficient of partial
correlation between two variables;
going through the remaining few parts
not covered in class is useful exercise),
Ch. 4, Ch. 7 (do 7.3 limited to what
was done in class; skip all
proofs of asymptotic

normality and the derivations in


7.4; do pp. 297-299 in Greene
(2012)) , Ch. 8 (skip the reference to
Ch. 5 at p. 115, skip all derivations of the
large-sample results in 8.6), Ch. 10
(skip 10.4.2, 10.7, 10.9), Ch. 11, Ch
12 (skip 12.5).

Econometric Models: Lecture Notes for 2014/15


ESS/DES Econometrics (revised February 2015)
Giovanni Bruno
Department of Economics, Bocconi University, Milano
E-mail address: giovanni.bruno@unibocconi.it

Contents
Part 1.

Linear Models

Chapter 1. Introduction

1.1. Introduction

1.2. The linear population model

10

Chapter 2. The linear regression model

13

2.1. From the linear population model to the linear regression model

13

2.2. The properties of the LRM

14

2.3. Diculties

15

Chapter 3. The Algebraic properties of OLS

18

3.1. Motivation, notation, conventions and main assumptions

18

3.2. Linear combinations of vector

19

3.3. OLS: definition and properties

19

3.4. Spanning sets and orthogonal projections

27

3.5. OLS residuals and fitted values

30

3.6. Partitioned regression

31

3.7. Goodness of fit and the analysis of variance

37

3.8. Centered and uncentered goodness-of-fit measures

39

Chapter 4. The finite-sample statistical properties of OLS

43

4.1. Introduction

43

4.2. Unbiasedness

43
3

CONTENTS

4.3. The Gauss-Marcov Theorem

44

4.4. Estimating the covariance matrix of OLS

47

4.5. Exact tests of significance with normally distributed errors

48

4.6. The general law of iterated expectation

58

4.7. The omitted variable bias

59

4.8. The variance of an OLS individual coecient

65

4.9. Residuals from partitioned OLS regressions

70

Chapter 5. The Oaxacas model: OLS, optimal weighted least squares and group-wise
heteroskedasticity

71

5.1. Introduction

71

5.2. Embedding the Oaxacas model into a pooled regression framework

71

5.3. The OLS estimator in the Oaxacas model is BLUE

75

5.4. A general result

77

Chapter 6. Tests for structural change

81

6.1. The Chow predictive test

81

6.2. An equivalent reformulation of the Chow predictive test

85

6.3. The classical Chow test

87

Chapter 7. Large sample results for OLS and GLS estimators

90

7.1. Introduction

90

7.2. OLS with non-spherical error covariance matrix

91

7.3. GLS

96

7.4. Large sample tests


Chapter 8. Fixed and Random Eects Panel Data Models

103
108

8.1. Introduction

108

8.2. The Fixed Eect Model (or Least Squares Dummy Variables Model)

108

CONTENTS

8.3. The Random Eect Model

117

8.4. Stata implementation of standard panel data estimators

123

8.5. Testing fixed eects against random eects models

125

8.6. Large-sample results for the LSDV estimator

129

8.7. A Robust covariance estimator

132

8.8. Unbalanced panels

134

Chapter 9. Robust inference with cluster samplings

136

9.1. Introduction

136

9.2. Two-way clustering

137

9.3. Stata implementation

142

Chapter 10. Issues in linear IV and GMM estimation

144

10.1. Introduction

144

10.2. The method of moments

146

10.3. Stata implementation of the TSLS estimator

150

10.4. Stata implementation of the (linear) GMM estimator

151

10.5. Robust Variance Estimators

152

10.6. Durbin-Wu-Hausman Exogeneity test

152

10.7. Endogenous binary variables

155

10.8. Testing for weak instruments

156

10.9. Inference with weak instruments

156

10.10. Dynamic panel data

157

Non-linear models

169

Chapter 11. Non-linear regression models

170

Part 2.

11.1. Introduction

170

11.2. Non-linear least squares

170

CONTENTS

11.3. Poisson model for count data

171

11.4. Modelling and testing overdispersion

174

Chapter 12. Binary dependent variable models

176

12.1. Introduction

176

12.2. Binary models

177

12.3. Coecient estimates and marginal eects

180

12.4. Tests and Goodness of fit measures

181

12.5. Endogenous regressors

182

12.6. Independent latent heterogeneity

182

12.7. Multivariate probit models

183

Chapter 13. Censored and selection models

187

13.1. Introduction

187

13.2. Tobit models

187

13.3. A simple selection model

190

Chapter 14. Quantile regression

192

14.1. Introduction

192

14.2. Properties of conditional quantiles

192

14.3. Estimation

193

14.4. An heteroskedastic regression model with simulated data

195

Bibliography

197

Part 1

Linear Models

CHAPTER 1

Introduction

1.1. Introduction
Indeed, causation is not always exactly the same as correlation. Econometrics uses economic theory, mathematics and statistics to quantify economic structural relationship, often
in the search of causal links among the variables of interest.
Although rather schematic, the following discussion should convey the basic intuition of
how this process works.
Economic theory provides the econometrician with an economic structural model,
(1.1.1)

y = ! (x, ")

where ! : Rk+q ! R. Often, the structural relationship is formulated as a probabilistic model


for a given population of interest. So, y denotes a random scalar, x = (x1 ... xk )0 2 X Rk is

a 1 k random vector of explanatory variables of interest and " is a 1 q random vector of


latent variables. A structural model can be understood as one showing a causal relationship
8

1.1. INTRODUCTION

from the economic factors of interest, x, to the economic response or dependent variable y.
Often in applications q = 1, which means that " is treated as a catch-all random scalar.
For example, ! (x, ") may be the expenditure function in a population of (possibly) heterogenous consumers, with preferences " and facing income and prices x; or it may be the
marshallian demand function for some good in the same population, with x denoting prices
and total consumption expenditure; also it may be the demand function for some input of a
population of (possibly) heterogenous firms facing input and output prices x, with " comprising
technological latent heterogeneity, and so on.1
The individual ! (x, "), with its gradient vector of marginal eects, @x ! (x, "), and hessian
matrix, Dxx ! (x, "), are typically the structural objects of interest, but sometimes attention is
centered upon aggregate structural objects, such as the population-averaged structural function,

! (x, ") dF (") ,

the population-averaged marginal eects,

@x ! (x, ") dF (") ,

or the population averaged hessian matrix

Dxx ! (x, ") dF (") .

Statistics supplements the probabilistic model with a sampling mechanism in order to


estimate characteristics of the population from a sample of observables. The population objects
of interest may be, for example, the joint probability distribution of the observables, F (y, x) ,
and its moments, E (y), E (x), E (xx0 ), E (xy), the conditional distribution of y given x,
F (y|x) and its moments, E (y|x) and V ar (y|x) . Or functions of the above:

1Wooldridge (2010) prefers to think of ! (x, ") as a structural conditional expectation: E (y|x, ") ! (x, ") .

There is nothing in the present analysis that prevents such interpretation.

1.2. THE LINEAR POPULATION MODEL

10

The key question is under what conditions these estimable statistical objects are informative on ! (x, "). Evidently, to establish a mapping between the structural economic object
of interest and the foregoing statistical objects the econometrician needs to model the relationship between observables and unobservables in ! (x, ") and do so in a plausible way. The
restrictions that are used to this purpose are said identification restrictions. The next sections
describe the simplest probabilistic model for equation (1.1.1), the linear population model.

1.2. The linear population model


Equation (1.1.1) is a linear model of the population if the following assumptions hold.
P.1: Linearity: ! (x, ") = x0 + ", with " being a random scalar (q = 1) and

a k1

vector of fixed parameters


P.2: rank [E (xx0 )] = k or, equivalently, det [E (xx0 )] 6= 0
P.3: Conditional-mean-independence of " and x: E ("|x) = 0
Under linearity (P.1) equation (1.1.1) becomes
y = x0 + ",

(1.2.1)
then, given P.3, E (y|x) = x0 .

An equivalent, but easier to interpret, formulation of assumption P.2 states:


P.2b: No element of x in X can be obtained as linear combinations of the others with
probability equal to one: P r (a0 x = 0) = 1 only if a = 0.
The following proves equivalence of P.2 and P.2b (not crucial and rather technical, it can be
skipped towards the exam). I exploit the properties of the expectation and rank operators.
Assume P.2 and P r (a0 x = 0) = 1 for some conformable constant vector a. Then E (a0 xx0 a) =
0, and so a0 E (xx0 ) a = 0, which implies a = 0 by P.2, proving P.2b. Now, assume P.2b and pick
any a 6= 0. Then, P r (a0 x = 0) 6= 1 and so P r (a0 x 6= 0) > 0. But since a0 x 6= 0 is equivalent
to a0 xx0 a > 0, then P r (a0 xx0 a > 0) = P r (a0 x 6= 0) > 0. So, since P r (a0 xx0 a

0) = 1,

1.2. THE LINEAR POPULATION MODEL

11

E (a0 xx0 a) > 0, which in turn implies a0 E (xx0 ) a > 0. Therefore, E (xx0 ) is positive definite
and so non-singular, that is P.2.
Exercise 1. prove that if x = (1 x1 ) then assumption P.2 is equivalent to V ar (x1 ) 6= 0.
Solution:

and so det E (xx0 ) = E x21

E xx0 = @

E (x1 )

E (x1 ) E x21

1
A

E 2 (x1 ) = V ar (x1 ), and the claim is proved by noting that for

any k k matrix A, rank (A) = k if and only if det (A) 6= 0.


By equation (1.1.1) and assumption P.1, the latent part of ! (x, "), ", satisfies the following
equation
"=y

x0

and the marginal eects, @x ! (x, "), satisfy the following:


@x ! (x, ") = .
By assumption P.3 and the law of iterated expectations E (x") = 0. Since " = y

x0 , then

we have the system of k moment conditions


(1.2.2)

E xy

xx0

=0

or E (xy) = E (xx0 ) . Assumption P.2, then, ensures that the foregoing system can be solved
for

to have

(1.2.3)

= E xx0

E (xy)

At this point the linear probabilistic model establishes a precise mapping between, on the
one hand, the structural objects of interest, ! (x, "), " and @x ! (x, ") and on the other the
observable or estimable objects y, x, E (xx0 ) and E (xy). Indeed, ! (x, "), " and @x ! (x, ")
are equal to unique known transformations of y, x, E (xx0 ) and E (xy). This means that

1.2. THE LINEAR POPULATION MODEL

12

! (x, "), " and @x ! (x, ") can be estimated using estimators for E (xx0 ) and E (xy), whose
choice depends on the underlying sampling mechanism. The most basic strategy is to carry
out estimation within the linear regression model and its variants. In essence, the linear regression model is the linear probabilistic model supplemented by a random sampling assumption.
This ensures optimal properties of the ordinary least squares estimator (OLS) and its various
generalizations.
A more restrictive specification of the linear model maintains the assumptions of conditional homoskedasticity and normality
P.4: V ar ("|x) =
P.5: "|x N 0,

2.
2

A more general variant of the linear model, instead, replaces assumption P.3 with
P.3b: E (x") = 0.
Under P.3b it is still true that

= E (xx0 )

E (xy) and @x ! (x, ") = , with the virtue that

the conditional expectation E (y|x) is left unrestricted. Therefore, with P.3b replacing P.3,
the model is more general.
The function x0 , with

= E (xx0 )

E (xy), is relevant in either version of the linear

model and is said the linear projection of y onto x.


Exercise 2. Prove that if x contains 1, then E (x") = 0 is equivalent to E (") = 0 and
cov (", x) = 0 (hint: remember that cov (", x) = E (x")

E (x) E (") ).

Solution: Assume E (x") = 0. Since 1 is an element of x, the first component of E (x") is


E (") = 0, then, given cov (", x) = E (x")

E (x) E ("), cov (", x) = 0. Assume E (") = 0 and

cov (", x) = 0. Then, E (x") = E (x) E (") = 0.

CHAPTER 2

The linear regression model


The linear regression model is a statistical model, as such it incorporates a probabilistic
model of the population and a sampling mechanism that draws the data from the population.
2.1. From the linear population model to the linear regression model
Consider the linear model of the previous chapter: the population equation (1.1.1)
y = ! (x, ")
with the assumptions
P.1: Linearity: ! (x, ") = x0 + ", with x = (x1 x2 ... xk )0 being a k 1 random vector,
" a random scalar and

=(

2 ...

k)

a k 1 vector of fixed parameters

P.2: rank [E (xx0 )] = k for all x 2 X .


P.3: Conditional-mean-independence of " and x: E ("|x) = 0
Now, add the random sampling assumption
RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.
Given P.1-P.3 and RS, we have the linear regression model (LRM)
(2.1.1)

yi = x0i + "i

with x0 i = (xi1 xi2 ...xik ), i = 1, ..., n and {"i = yi


served i.i.d. errors terms.
13

x0i , i = 1, ..., n} is a sequence of unob-

2.2. THE PROPERTIES OF THE LRM

14

2.2. The properties of the LRM


It is convenient to express the LRM in compact matrix form as follows
(2.2.1)

y =X +"

where

B
B
B
B
B
y =B
B
n1
B
B
B
@

y1
..
.
yi
..
.
yn

C
B
C
B
C
B
C
B
C
B
C, X = B
C nk B
C
B
C
B
C
B
A
@

x01

C
B
B
.. C
C
B
. C
B
C
B
C
0
xi C , " = B
B
n1
B
.. C
C
B
. C
B
A
@
0
xn

"1
..
.
"i
..
.
"n

C
C
C
C
C
C.
C
C
C
C
A

It is not hard to see that model (2.2.1), given P.1-P.3 and RS, satisfies the following properties

LRM.1: Linearity in the parameters


LRM.2: X has full column rank, that is rank (X) = k.
LRM.3: The variables in X are strictly exogenous, that is
E "i |x01 , ... x0i , ..., x0n = 0,
i = 1, ..., n, or more compactly, E ("|X) = 0.

LRM.1 is obvious. LRM.2 requires that no columns of X can be obtained as linear combinations of other columns of X or, equivalently, that a = 0 if Xa = 0, or also equivalently that for
any a 6= 0, there exists at least one observation i = 1, ..., n, such that x0i a 6= 0. P.2 ensures that
this occurs with non-zero probability, which approaches unity as n ! 1. LRM.3, instead, is
a consequence of P.3 and RS. This is proved as follows. By P.3, E ("i |x0i ) = 0, i = 1, ..., n or
E (yi |x0i )

x0i

= 0, i = 1, ..., n. Since
E "i |x01 , ... x0i , ..., x0n = E yi |x01 , ... x0i , ..., x0n

x0i

2.3. DIFFICULTIES

15

and by RS, E (yi |x0i ) = E (yi |x01 , x02 , ..., x0n ), then
E "i |x01 , ... x0i , ..., x0n = E yi |x0i

x0i

= 0.

If, in addition, P.4 (conditional homoskedasticity) and P.5 (conditional normality) hold for
the population model, then one easily verifies that
LRM.4: V ar ("|X) =

2I

LRM.5: "|X N 0,

2I

n.

While LRM.1-LRM.5 are less restrictive than P.1-P.5 and RS and, in most cases, sucient for
accurate and precise inference, they are still strong assumptions to maintain. Finally, if P.3 is
replaced by P.3b, E (x") = 0, then LRM.3 gets replaced by
LRM.3b: E (

Pn

i=1 xi "i )

= 0, i = 1, ..., n, or more compactly,


E X 0 " = 0.
2.3. Diculties

Some or all of LRM.1-LRM.5 may not be verified if the population model assumptions
and/or the RS mechanism are not verified in reality. Here is a list of the most important
population issues.
Non-linearities (P.1 fails): the model is non-linear in the parameters. This leads
LRM.1 to fail.
Perfect multicollinearity (P.2 fails): some variables in x are indeed linear combinations of the others. LRM.2 fails, but in general this is not a serious problem, it
simply indicates that the model has not been parametrized correctly to begin with.
A dierent parametrization will restore identification in most cases.
Endogeneity (P.3 fails): some variables in x are related to ". LRM.3 fails.
Conditional heteroskedasticity (P.4 fails): the conditional variance depends on x.
LRM.4 fails.

2.3. DIFFICULTIES

16

Non-normality (P.5 fails): " is not conditionally normal. LRM.5 fails.


Other important problems are instead with the RS assumption.
Omitted variables: some of the variables in x are not sampled. This implies that
the missing variables cannot enter the conditioning set and have to be treated as
unobserved errors, along with ", which could make LRM.3-LRM.5 fail.
Measurement error: some of the variables in x are measured with error. We have
the wrong variables in the conditioning set. As in the case of omitted variables,
LRM.3-LRM.5 may fail.
Endogenous selection: some units in the sample are missing due to events related to
". Also in this case, LRM.3-LRM.5 are likely to fail.
Notice that often problems in the RS mechanism have their roots in the population model.
For example, the presence of non-random variables in x is not in general compatible with an
identically distributed sample and, in consequence with RS. It is easy to verify, though, that
non-random x along with a weaker sampling mechanism only requiring independent sampling
is compatible with LRM.1-LRM.5. Also, the presence of variables in x at dierent levels of
aggregation may not be compatible with an independent sampling, as observed by Moulton
(1990). In this case, the sampling mechanism can be relaxed by maintaining independence
only across groups of observations and not across observations themselves. See for example
the sampling mechanism described in Section 8.6 for panel data models, in which the sample
is neither identically distributed nor independent across observations.
Finally, it is important to emphasize that even if all the population assumptions and the
RS mechanism are valid, data problems may arise in the form of multicollinearity among
regressors.
Multicollinearity: some of the variables in X are almost collinear. In the population
this is reflected by det [E (xx0 )] ' 0; and in the sample by det (X 0 X) ' 0.

2.3. DIFFICULTIES

17

As we will see in Chapter 4, although multicollinearity does not aect the statistical properties
of the estimators in finite samples, it can severely aect the precision of the coecient estimates
in terms of large standard errors.

CHAPTER 3

The Algebraic properties of OLS

3.1. Motivation, notation, conventions and main assumptions


We do not agree with Larry (the adult croc), do we? Algebra may be boring, but only if
its purpose is left obscure. Algebra in econometrics provides the bricks to construct estimators
and tests. The fact that most estimators and tests are automatically implemented by statistical
packages is no excuse to neglect the underlying algebra. First, because most does not mean
all and there may be the case that for our research work we have to build the technique by
ourselves. This is especially true for the most recent techniques. A robust hausman test for
panel data models and multiway cluster robust standard errors are just a few examples of
techniques that are not yet coded by the popular statistical packages. Second, even if the
technique is available as a built-in procedure in our preferred statistical package, to use it
correctly we have to know how it is made, which boils down to understanding its algebra.
Finally, often interpretation of results requires that we are aware of the algebraic properties
of estimators and tests. So the material here may seem rather intricate at times, but it is
certainly of practical use.
18

3.3. OLS: DEFINITION AND PROPERTIES

19

This chapter is based on my lecture notes in matrix algebra as well as on Greene (2008),
Searle (1982) and Rao (1973). Throughout, I denotes a conformable identity matrix; 0 denotes
a conformable null matrix, vector or scalar, with the appropriate meaning being clear from
the context; y is a real n 1 vector containing the observations of the dependent variable; X
is a real (n k) regressor matrix of full column rank.
The do-file algebra_OLS.do demonstrates the results of this chapter using the Stata data
set US_gasoline.dta.

3.2. Linear combinations of vector


Given the real (n k) matrix A, the columns of A are said linearly dependent if there
exists some non-zero (k 1) vector b such thatAb = 0 .
Given the real (n k) matrix A, the columns of A are said linearly independent if Ab = 0
only if b = 0.
Two real non-zero (n 1) vectors a and b are said to be orthogonal if a0 b = 0. Given two
real non-zero (n k) matrices A and B, if each column of A is orthogonal to all columns of
B, so that A0 B = 0, then A and B are said to be orthogonal.

3.3. OLS: definition and properties


We do not have any model in mind here, just data for the response variable
0
1
y
B 1 C
B .. C
B . C
B
C
B
C
B
y = B yi C
C
B . C
B . C
B . C
@
A
yn

3.3. OLS: DEFINITION AND PROPERTIES

and the n k regressor matrix

B
B
B
B
B
X=B
B
B
B
B
@

x01

20

C
.. C
. C
C
C
0
xi C
C,
.. C
C
. C
A
x0n

where x0 i = (xi1 xi2 ...xik ) . I only maintain that rank (X) = k.


We aim at finding an optimal approximation of y using the information contained in y
and X. One such approximation can be obtained through the ordinary least squares estimator
(OLS), b, defined as the minimizer of the residual sum of squares S (bo ):
b = arg min S (bo ) ,
bo

where
S (bo ) = (y

Xbo )0 (y

Xbo ) .

Geometrically, Xb is an optimal approximation of y in that it minimizes the euclidean distance


from the vector y to the hyperplane Xbo . As such, b satisfies
@S (bo )
@bo
By expanding (y

Xbo )0 (y

= 0.
bo =b

Xbo ):

S (bo ) = y0 y
= y0 y

b0o X 0 y

y0 Xbo + b0o X 0 Xbo

2y0 Xbo + b0o X 0 Xbo .

Then, taking the partial derivatives


@S (bo )
=
@bo

2X 0 y + 2X 0 Xbo

3.3. OLS: DEFINITION AND PROPERTIES

21

so that the first order conditions (OLS normal equations) of the minimization problem are

X 0 y + X 0 Xb = 0,

(3.3.1)

with the resulting formula for the OLS estimator


1

b = X 0X

(3.3.2)

X 0y

Notice that
The estimator exists since X 0 X is non-singular, X being of full-column rank.
The estimator is a true minimizer since the Hessian of S (bo ),
@ 2 S (bo )
= 2X 0 X,
@bo @b0o
is a positive definite matrix (i.e. S (bo ) is globally convex in bo ). The latter is easily
proved as follows. A matrix A is said to be positive definite if the quadratic form
c0 Ac > 0 for any conformable vector c 6= 0. By the full column rank assumption
n
X
z = Xc 6= 0 for any c 6= 0 therefore c0 X 0 Xc = z0 z =
zi2 > 0 for any c =
6 0.
i=1

The OLS residuals are defined as


(3.3.3)

e=y

Xb

By (3.3.1) it follows that e and X are orthogonal:


(3.3.4)

X 0 (y

Xb) = 0.

Therefore, if X contains a column of all unity elements, say 1, three important implications
follows.
(1) The sample mean of e is zero:

10 e

n
X
i=1

1X
ei = 0 and consequently, e =
ei = 0.
n
i=1

3.3. OLS: DEFINITION AND PROPERTIES

22

(2) The OLS regression line passes through the point sample means (y, x), that is y = x0 b,
P
where y = ( ni=1 yi ) /n and
0

x =

1
n

n
X

x1i . . .

1
n

i=1

n
X
i=1

(it follows straightforwardly from 10 e = 10 (y

xki

Xb) = 0).

(3) Let
(3.3.5)

y = Xb
denote their sample mean
denote the OLS predicted values of y, and y
n

y =

1X
yi
n
i=1

then, since the sample mean of Xb equals x0 b,


y = y.
3.3.1. Stata implementation: get your Stata data file with use. All Stata data
files can be recognized by their filetype dta. Suppose you have y and X within a Stata data
file called, say, mydata.dta, stored in your Stata working directory and that you have just
launched Stata on your laptop. To get your data into memory, from the Stata command line
execute use followed by the name of the data file (specifying the filetype dta is not necessary
since use only supports dta files):
use mydata
If mydata.dta is not in your Stata working directory but somewhere else in your laptop,
then you must specify the path of the dta file. For example, if you have a mac and your data
file is in the folder /Users/giovanni you will write
use /Users/giovanni/mydata
If you have a pc and your file is in c:\giovanni

3.3. OLS: DEFINITION AND PROPERTIES

23

use c:\giovanni\mydata

If the path involves folders with names that include blanks, then include the whole path
into double quotes. For example:

use "/Users/giovanni/didattica/Greene/dati/ch.

1/mydata"

3.3.2. Stata implementation: the help command. To know syntax, options, usage
and examples for any Stata command, write help from the command line followed by the
name of the command for which you want help. For example,

help use
will make appear a new window describing use:

3.3. OLS: DEFINITION AND PROPERTIES

24

Title
[D] use

Use Stata dataset

Syntax
Load Stata-format dataset
use filename [, clear nolabel]

Load subset of Stata-format dataset


use [varlist] [if] [in] using filename [, clear nolabel]

Menu
File > Open...

Description
use loads a Stata-format dataset previously saved by save into memory.
filename is specified without an extension, .dta is assumed. If your
filename contains embedded spaces, remember to enclose it in double
quotes.

If

In the second syntax for use, a subset of the data may be read.

Options
clear specifies that it is okay to replace the data in memory, even though
the current data have not been saved to disk.
nolabel prevents value labels in the saved data from being loaded.
unlikely that you will ever want to specify this option.

It is

Examples
. use http://www.stata-press.com/data/r11/auto
. replace rep78 = 3 in 12
. use http://www.stata-press.com/data/r11/auto, clear
. keep make price mpg rep78 weight foreign
. save myauto
. use make rep78 foreign using myauto
. describe
.
.
.
.

use
tab
use
tab

if foreign == 0 using myauto


foreign, nolabel
using myauto if foreign==1
foreign, nolabel

Also see
Manual:
Help:

[D] use
[D] compress, [D] datasignature, [D] fdasave, [D] haver, [D]
infile (free format), [D] infile (fixed format), [D] infix, [D]
insheet, [D] odbc, [D] save, [D] sysuse, [D] webuse

3.3. OLS: DEFINITION AND PROPERTIES

25

3.3.3. Stata implementation: OLS estimates with regress. Now that you have
loaded your data into memory, Stata can work with them. Suppose your dependent variable
y is called depvar and that X contains two variables, x1 and x2. To run the OLS regression of
depvar on x1 and x2 with the constant term included, you write regress followed by depvar
and, then, the names of the regressors:
regress depvar x1 x2
The following example shows the regression in example 1.2 of Greene (2008) with annual
values of US aggregate consumption (c) used as the dependent variable and regressed on
annual values of US personal income (y) for the period 1970-1979.
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
. regress c y
Source

SS

df

MS

Model
Residual

64435.1192
537.004616

1
8

64435.1192
67.125577

Total

64972.1238

7219.12487

Coef.

y
_cons

.9792669
-67.58063

Std. Err.
.031607
27.91068

t
30.98
-2.42

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

P>|t|
0.000
0.042

=
=
=
=
=
=

10
959.92
0.0000
0.9917
0.9907
8.193

[95% Conf. Interval]


.9063809
-131.9428

1.052153
-3.218488

regress includes the constant term (the unity vector) by default and always with the
name _cons. If you dont want it, just add the regress option, noconstant:
regress depvar x1 x2, noconstant
Notice that always, according to a general rule of the Stata syntax, the options of any Stata
command follow the comma symbol. This means that if you wish to specify options you have
to write the comma symbol after the last argument of the command, so that everything to the

3.3. OLS: DEFINITION AND PROPERTIES

26

right of the comma symbol is held by Stata as an option. Options can be more than one. Of
course, if you do not wish to include options dont write the comma symbol.
After execution, regress leaves behind a number of objects in memory, mainly scalars
and matrices, that will stay there, available for use, until execution of the next estimation
command. To know what these objects are, consult the section Saved results in the help
of regress, where you will find the following description.
Saved results

regress saves the following in e():


Scalars
e(N)
e(mss)
e(df_m)
e(rss)
e(df_r)
e(r2)
e(r2_a)
e(F)
e(rmse)
e(ll)
e(ll_0)
e(N_clust)
e(rank)

number of observations
model sum of squares
model degrees of freedom
residual sum of squares
residual degrees of freedom
R-squared
adjusted R-squared
F statistic
root mean squared error
log likelihood under additional assumption of i.i.d.
normal errors
log likelihood, constant-only model
number of clusters
rank of e(V)

Macros
e(cmd)
e(cmdline)
e(depvar)
e(model)
e(wtype)
e(wexp)
e(title)
e(clustvar)
e(vce)
e(vcetype)
e(properties)
e(estat_cmd)
e(predict)
e(marginsok)
e(asbalanced)
e(asobserved)

regress
command as typed
name of dependent variable
ols or iv
weight type
weight expression
title in estimation output when vce() is not ols
name of cluster variable
vcetype specified in vce()
title used to label Std. Err.
b V
program used to implement estat
program used to implement predict
predictions allowed by margins
factor variables fvset as asbalanced
factor variables fvset as asobserved

Matrices
e(b)
e(V)
e(V_modelbased)

coefficient vector
variance-covariance matrix of the estimators
model-based variance

Functions
e(sample)

marks estimation sample

3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS

27

You should be already familiar with some of the e() objects in the Scalars and Matrices
parts. At the end of the course you will be able to understand most of them. Dont worry
about the Macros and Functions parts, they are rather technical and, however, not relevant
for our purposes.
To know the values taken on by the e() objects, execute the command ereturn list just
after the regress instruction. In our regression example we have:
. ereturn list
scalars:
e(N)
e(df_m)
e(df_r)
e(F)
e(r2)
e(rmse)
e(mss)
e(rss)
e(r2_a)
e(ll)
e(ll_0)
e(rank)

=
=
=
=
=
=
=
=
=
=
=
=

10
1
8
959.919036180133
.9917348458900325
8.193020017500434
64435.11918375102
537.0046160573024
.9907017016262866
-34.10649331948547
-58.08502782843004
2

e(cmdline)
e(title)
e(marginsok)
e(vce)
e(depvar)
e(cmd)
e(properties)
e(predict)
e(model)
e(estat_cmd)

:
:
:
:
:
:
:
:
:
:

"regress c y"
"Linear regression"
"XB default"
"ols"
"c"
"regress"
"b V"
"regres_p"
"ols"
"regress_estat"

macros:

matrices:
e(b) :
e(V) :

1 x 2
2 x 2

functions:
e(sample)

3.4. Spanning sets and orthogonal projections


Consider the n-dimensional Euclidean space Rn and the (n k) real matrix A. Then, each
column of A belongs to Rn and the set of all linear combinations of the columns of A is said
the space spanned by the columns of A (or also the range of A), denoted by R (A) .

3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS

28

R (A) can be easily proved to be a subspace of Rn (it is obvious that R (A) Rn ; R (A)
is a vector space since, given any two vectors a1 and a2 belonging to R (A), then a1 + a2 2
R (A) and ca1 2 R (A) for any real scalar c). Since each element of R (A) is e vector of n
components, R (A) is said to be a vector space of order n. The dimension of R (A), denoted
by dim [R (A)], is the maximum number of linearly independent vectors in R (A). Therefore,
dim [R (A)] = rank (A) and if A is of full column rank, then dim [R (A)] = k.
The set of all vectors in Rn that are orthogonal to the vectors of R (A) is denoted by A? .
I now prove that A? is a subspace of Rn . A? Rn by definition. Given any two vectors
b1 and b2 belonging to A? and for any a 2 R (A), b01 a = 0 and b02 a = 0, but then also
(b1 + b2 )0 a = 0 and, for any scalar c, (cb01 ) a = 0, which completes the proof.
Importantly, it is possible to prove, but not pursued here, that
(3.4.1)


dim A? = n

rank (A) .

A? is commonly referred to as the space orthogonal to R (A), or also the orthogonal complement of R (A) .
Exercise 3. Prove that any subspace of Rn contains the null vector.
For simplicity, assume A of full column rank and define the operator P[A] as
P[A] = A A0 A

A0 .

0 = P ) and idempotent (P P
As an exercise you can verify that P[A] is a symmetric (P[A]
[A]
[A] [A] =

P[A] ) matrix. With this two properties, P[A] is said an orthogonal projector. In geometrical
terms, P[A] projects vectors onto R (A) along a direction that is parallel to the space orthogonal
to R (A), A? . Symmetrically,
M[A] = I

P[A]

is the orthogonal projector that projects vectors onto A? along a direction that is parallel to
the space orthogonal to A? , R (A).

3.4. SPANNING SETS AND ORTHOGONAL PROJECTIONS

29

Exercise 4. Prove that M[A] is an orthogonal projector (hint: just verify that M[A] is
symmetric and idempotent).
The properties of orthogonal projectors, established by the following exercises, are readily
understood, once one grasps the geometrical meaning of orthogonal projectors. They can be
also demonstrated algebraically, which is what the exercises require.
Exercise 5. Given two (n k) real matrices A and B, both of full column rank, prove
that if A and B span the same space than P[A] = P[B] (hint: prove that A can be always
expressed as A = BK where Kis a non-singular (k k) matrix).
Solution: If R (A) coincides with R (B), then every column of A belongs to R (B), and
as such every column of A can be expressed as a linear combination of the columns of B,
A = BK, where K is (k k) . Therefore, P[A] = BK (K 0 B 0 BK)

K 0 B. An important result

of linear algebra states that given two conformable matrices C and D, then rank (CD)
min [rank (C) , rank (D)] (see Greene (2008), p. 957, (A-44)). Since both A and B have rank
equal to k, in the light of the foregoing inequality, k min [k, rank (K)], which implies that
rank (K)

k, and since rank (K) > k is not possible, then rank (K) = k and K is non-

singular. Finally, by the property of the inverse of the product of square matrices (see Greene
(2008), p. 963, (A-64))
P[A] = BK K 0 B 0 BK

K 0B0

B0B

K0

= BKK

K 0B0

= P[B] .

Exercise 6. Prove that P[A] and M[A] are orthogonal, that is P[A] M[A] = 0.

3.5. OLS RESIDUALS AND FITTED VALUES

30

3.5. OLS residuals and fitted values


The foregoing results are useful to properly understand the properties of OLS. But before
going on, do the the following exercise.
Exercise 7. Given any (n 1) real vector v lying onto R (A) prove that P[A] v = v and
M[A] v = 0 (hint: express v as v = Ac, where c is a real (k 1) vector).
From exercise 7 it clearly follows that
(3.5.1)

P[A] A = A

and
(3.5.2)

M[A] A = 0.

Using the OLS formula in (3.3.2)


(3.5.3)

e = M[X] y,

where
M[X] = I

X X 0X

X 0.

Therefore, the OLS residual vector, e, is the orthogonal projection of y onto the space orthogonal to that spanned by the regressors, X ? . For this reason M[X] is said the residual maker.
From (3.3.2) and (3.3.5) it follows that
= P[X] y
y
, is the orthogonal projection onto the
and so the vector of OLS predicted (fitted) values, y
= 0 (see exercise 6 or also equation
space spanned by the regressors, R (X). Clearly, e0 y
(3.3.4)), therefore the OLS method decomposes y into two orthogonal components
(3.5.4)

+ e.
y=y

3.6. PARTITIONED REGRESSION

31

3.5.1. Stata implementation: an important post-estimation command, predict.


and e are useful for a number of purposes. They can be obtained as new variIn applications y
ables in your Stata data set using the post-estimation command predict. The way predict
in
works is simple. Imagine you have just executed your regress instruction. To have now y
your data as a variable called, say, y_hat, just write from the command line:
predict y hat
values. Fitted
You have thereby created a new variable with name y_hat that contains the y
values are the default calculation of predict, if you want residuals just add the res option:
predict resid , res
and you have got a new variable in your data called resid that contains the e values.
It is important to stress that predict supports any estimation command, not only regress.
So, it can be implemented, for example, after xtreg in the context of panel data.

Partition X =

X1

3.6. Partitioned regression

X2 and, accordingly,
0

b=@

b1
b2

1
A

Exercise 8. Prove that if X is of full column rank, so are X1 and X2 .

Exercise 9. Prove that if X is of full column rank, then M[X1 ] X2 is of f.c.r. Solution:
I prove the result by contradiction and assume that P[X1 ] X2 b = X2 b for some vector b 6= 0.
Therefore, X1 a = X2 b, where a = (X10 X1 )

X10 X2 b, or Xc = 0, where c0 = (a0

leads to a contradiction since c 6= 0 and X is of f.c.r.

b0 ), which

3.6. PARTITIONED REGRESSION

32

Reformulate the normal equations accordingly


0
@

or

X10
X20

Ay

0
@

10

X10 X1 X10 X2
X20 X1

A@

X20 X2

b1
b2

A = 0

(3.6.1)

X10 y

X10 X1 b1

X10 X2 b2 = 0

(3.6.2)

X20 y

X20 X1 b1

X20 X2 b2 = 0

Expliciting the first system of equations for b1


b1 = X10 X1

(3.6.3)

X10 (y

X 2 b2 ) ,

which shows at once the important result that if X 1 and X 2 are orthogonal, then
b1 = X10 X1

X10 y

b2 = X20 X2

X20 y,

and

that is b1 (b2 ) can be obtained by the reduced OLS regression of y on X1 (X2 ).


In general situations things are not so simple, but it is still possible to work out both
components of b in closed form solution. Replace the right hand side of equation (3.6.3) into
the second system (3.6.2) to obtain
X20 y

X20 X1 X10 X1

X10 (y

X 2 b2 )

X20 X2 b2 = 0

or equivalently, using the orthogonal projector notation P[X1 ] for X1 (X10 X1 )


X20 y

X20 P[X1 ] (y

X 2 b2 )

X20 X2 b2 = 0.

X10 ,

3.6. PARTITIONED REGRESSION

33

Rearrange the foregoing equation


X20 I

P[X1 ] y + X20 P[X1 ]

I Xb2 = 0

so that eventually
b2 = X20 M[X1 ] X2

X20 M[X1 ] y.

b1 = X10 M[X2 ] X1

X10 M[X2 ] y.

By symmetry,

By inspecting either formula it emerges that b2 (b1 respectively) can be obtained by a


regression where the dependent variable is the residual vector obtained by regressing y on
X1 (X2 ) and the regressors are the residuals obtained from the regressions of each column
of X2 (X1 ) on X1 (X2 ). This important result is known in the econometric literature as
Frisch-Waugh-Lovell Theorem.
Since b exists, so do its components, which proves that X10 M[X2 ] X1 and X20 M[X1 ] X2 are
non-singular when X is of full column rank. This result can be also verified by direct inspection,
as suggested by the following exercise.
Exercise 10. Prove that X20 M[X1 ] X2 is positive definite if X is of f.c.r. (hint: use exercise
9 to prove that M[X1 ] X2 is of full column rank and then the fact that M[X1 ] is symmetric and
idempotent). Solution: By exercise 9, M[X1 ] X2 a 6= 0 if a 6= 0. Let z = M[X1 ] X2 a, hence
a0 X20 M[X1 ] X2 a = z0 z is a sum of squares with at least one positive element. Therefore,
a0 X20 M[X1 ] X2 a > 0.

Exercise 11. Partitioning X =


elements, prove that M[1] = I

1 (10 1)

X1 1
1

, where 1 is the (n 1) vector of all unity

10 transforms all variables in deviations from their

sample means, and so that the OLS estimator b1 can be obtained by regressing y demeaned
onX demeaned.

3.6. PARTITIONED REGRESSION

34

The following result on the decomposition of orthogonal projectors into orthogonal components will be useful in a number of occasions later on.

Lemma 12. Given the partitioning X = (X1 X2 ) , the following representation of P[X]
holds
(3.6.4)

P[X] = P[X1 ] + P[M[X

1]

X2 ] ,

with both terms in the right hand side of (3.6.4) being orthogonal1.

3.6.1. Add one regressor. Given the initial regressor matrix, X, include an additional
regressor z (z is a non-zero (n 1) vector), so that there is a larger regressor matrix, W ,

partitioned as W = X z .
Now, consider the OLS estimator from the regression of y on W
0
with the resulting OLS residual

d
c

A = W 0W

u = y
(3.6.5)

= y

W@
Xd

d
c

W 0y

1
A

zc

1Equation (3.6.4) can be proved directly using the formula for the inverse of the 2 2 partitioned inverse, or

indirectly, but more easily, by using the FWL Theorem and noticing that, for any non-null y and X = (X1 X2 )
1
1
of f.c.r., P[X] y = X1 b1 + X2 b2 , where b1 = (X10 X1 ) X10 (y X2 b2 ) and b2 = X20 M[X1 ] X2
X20 M[X1 ] y.
You just have to plug the right hand sides of b1 and b2 into the right hand side of P[X] y = X1 b1 + X2 b2 , work
through
all the algebraic
simplifications and eventually notice that the equation you end up with, P[X] y =

P[X1 ] + P[M

[X1 ] X2

] y, must hold for any non-null y, so that P[X] = P[X1 ] + P[M[X1 ] X2 ] .

3.6. PARTITIONED REGRESSION

35

and the formula for c in closed form solution, obtained as a specific application of the general
analysis of the previous section,
c =

z0 M[X] z

z0 M[X] y
z0 M[X] z

(3.6.6)
where M[X] = I

X (X 0 X)

1 0

z M[X] y

X 0.

Exercise 13. Derive the formula for d in closed form solution.

We wish to prove that the residual sum of squares always decreases when one regressor is
added to X, i.e. given e defined as in (3.3.3) u0 u < e0 e. To this purpose, it is convenient to
express d as in equation (3.6.3)
1

d = X 0X

X 0 (y

zc) .

Replacing the foregoing equation into (3.6.5) yields


u = y

X X 0X

X 0 (y

= y

X X 0X

X 0y + X X 0X

= M[X] y

zc)

zc
1

X 0 zc

zc

M[X] zc.

Since M[X] y = e (from (3.5.3), remember that M[X] is the residual maker)
u=e

M[X] zc,

so that the residual sum of squared for the enlarged regression is


u 0 u = e0 e
(3.6.7)

= e0 e

e0 M[X] zc

cz0 M[X] e + c2 z0 M[X] z

2cz0 M[X] e + c2 z0 M[X] z.

3.6. PARTITIONED REGRESSION

36

Then, by replacing e with M[X] y into the second addendum of equation (3.6.8) and considering
that M[X] is idempotent gives
u 0 u = e0 e

2cz0 M[X] y + c2 z0 M[X] z.

Finally, replace c from (3.6.6) into the foregoing equation to have


0

uu = ee
= e0 e

(3.6.8)
and since

(z0 M[X] y)
z0 M[X] z

z0 M[X] y
2 0
z M[X] z
z0 M[X] y
z0 M[X] z

z0 M[X] y
+
z0 M[X] z

> 0,
u0 u < e0 e.

(3.6.9)

Exercise 14. How would formulas for c, d and u0 u change if the new regressor z is
orthogonal to X?

3.6.2. The squared coecient of partial correlation (not covered in class, but
good and easy exercise). The squared coecient of partial correlation between y and z,
2 ,
ryz

2
ryz

z0 M[X] y
= 0
,
z M[X] zy0 M[X] y

measures the extent to which y and z are related net of the variation of X. In this sense it is
closely related to c and indeed
2
c = ryz

y0 M[X] y
.
z0 M[X] y

Moreover by (3.5.3), y0 M[X] y = e0 e, and hence given equation (3.6.8) it has


u 0 u = e0 e 1

2
ryz
.

3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE

37

3.7. Goodness of fit and the analysis of variance


Assume that the unity vector, 1, is part of the regressor matrix X. Total variation in y can
be expressed by the following sum of squares, referred to as Sum of Squared Total deviations
SST =

n
X

(yi

y)2

i=1

or, given what established in exercise 11,


SST = y0 M[1] y
Notice that SST is the sample variance of y, SST / (n

1), times the appropriate degrees of

freedom correction n 1. Incidentally, the degrees-of-freedom correction in the sample variance


is just n

1 and not n, since M[1] y are the residuals from the regression of y on 1 (see exercise

11) and so there could be no more than n 1 linearly independent vectors in the space to which
M[1] y belongs, 1? . In fact, since rank (1) = 1, then given equation (3.4.1), dim 1? = n
By the orthogonal decomposition for y in (3.5.4),
+ M[1] e.
M[1] y = M[1] y
But since e and X are orthogonal and X contains 1, it follows that 10 e = 0, thereby
(3.7.1)

M[1] e = e

and
+ e.
M[1] y = M[1] y
Then,
0 M[1] y
+ 2e0 M[1] y
+ e0 e.
SST = y

1.

3.7. GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE

38

= Xb and e and X are orthogonal, then e0 M[1] y


= e0 y
=
But since equation (3.7.1) holds, y
e0 Xb = 0 and so SST simplifies to
0 M[1] y
+ e0 e.
SST = y
0 M[1] y
as SSR (regression sums
Throughout, I stick to Greenes acronyms and refer to y
of squares) and e0 e as SSE (error sum of squares). Notice, however, that the name of error
sum of squares may be misleading in contexts where the distinction between random errors
and estimated residuals is crucial, given that e0 e actually represents the part of total variation
explained by OLS residuals and not errors. For this reason, I continue to refer to e0 e as the
residual sum of squares and sometimes abbreviate it with the Greenes acronym SSE.
As it happens for SST , SSE is the sample variance of residuals times the appropriate
degrees-of-freedom correction, n
variance is just n
n

k. Again, the degrees-of-freedom correction in the sample

k and not n since in the residual space, X ? , there could be no more than

k linearly independent vectors. This follows from the assumption that X is of full column

rank, thereby rank (X) = k, then given equation (3.4.1), dim X ? = n

k.

The coecient of determination, R2 , is defined as


R2 =

(3.7.2)
0 M[1] y
= yM[1] y
and since y

0 M[1] y

y
SSR
= 0
SST
y M[1] y

e0 e,
R2 = 1

e0 e
.
y0 M[1] y

Therefore, if the constant term is included into the regression it has that 0 R2 1 and
R2 measures the portion of total variation in y explained by the OLS regression; in this sense
R2 is a measure of goodness of fit2. There are two interesting extreme cases. If all regressors,
lies onto the space spanned by 1 and M[1] y
= 0, so
apart from 1, are null vectors, then y
that eventually R2 = 0. Only the constant term has explanatory power in this case, and the
2If the constant term is not included into the regression than (3.7.1) does not hold and R2 may be negative.

3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES

39

regression is an horizontal line with intercept equal to the sample mean of y. If y lies already
(and also e0 e = 0) and R2 = 1, a perfect (but useless) fit3.
onto R (X), then y = y
A problem with the R2 measure is that it never decreases when a regressor is added to
X (this emerges straightforwardly from the R2 formula and the inequality in (3.6.9)) and in
principle one can obtain artificially high R2 by inflating the model with regressors (the extreme
case of R2 = 1 is attained if n = k, since in this case y ends up to lie onto R (X)). This
2

problem may be obviated by using the corrected R2 , R , defined by including into the formula
of R2 the appropriate degrees of freedom corrections:
2

R =1

SSE/ (n
SST / (n

k)
.
1)

In fact, R does not necessarily increases when one more regressor is added.
Exercise 15. Prove that, given W , u and e defined as in section 3.6.1, the coecient of
determination resulting from the regression of y on W is
2
RW
= R2 + 1

2
R2 ryz

where R2 is the goodness of fit measure from the reduced regression.

Exercise 16. Prove that


2

R =1

n
n

1
1
k

R2 .

3.8. Centered and uncentered goodness-of-fit measures


Consider the OLS regression of y on the sample regressor matrix X and let b denote the
OLS vector. The centered and uncentered R-squared measures (see Hayashi (2000), p. 20) for
3Im maintaining throughout the obvious assumption that in any case y 2
/ R (1) . Why?

3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES

40

this regression are defined as


(3.8.1)

0 M[1] y

y
b0 X 0 M[1] Xb
y0 P[X] M[1] P[X] y
R 0
=
=
y M[1] y
y0 M[1] y
y0 M[1] y
2

and
(3.8.2)

Ru2

y0 P[X] y
0y

y
b0 X 0 Xb
=
=
,
y0 y
y0 y
y0 y

respectively. It is easy to prove that 0 Ru2 1 and


Ru2 = 1

e0 e
y0 y

whether or not the unity vector 1 is included into X. In fact, since y = Xb + e and X 0 e = 0,
0y
+ e0 e. The same is not true for the centered measure. Indeed, 0 R2 1 and
y0 y = y
(3.8.3)

R2 = 1

e0 e
y0 M[1] y

if and only if a) the constant is included, or b) all of the variables (y, X) have zero sample
mean, that is M[1] y = y and M[1] X = X. Clearly in the latter case, R2 = Ru2 .
A convenient property of the centered R-squared, when 1 is included into X, is that it
2 , that is
, ry,
coincides with the squared simple correlation between y and y
y
2

(3.8.4)

R2 =

0 M[1] y
y
,
0 M[1] y
y0 M[1] y
y

= Xb. Given the definition of R2 in (3.7.2), to prove


where R2 is defined in (3.8.1) and y
0 M[1] y = y0 M[1] y
, which can be accomplished easily
equation (3.8.4) boils down to proving y

3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES

41

+ e, then
along the following lines. Since y = y
0 M[1] y = y
0 M[1] (
y
y + e)
0 M[1] y
+y
0 M[1] e
= y
0 M[1] y
+y
e
= y
0 M[1] y
.
= y
where the third equality follows from M[1] e = e, since the constant is included, and the last
and the OLS residuals.
from the orthogonality of y
This property is not shared by the uncentered R-squared, unless variables have zero sample
means.

3.8.1. A convenient formula for R2 when the constant is included (not covered

1 and using Lemma


in class, but good and easy exercise). Partitioning X as X = X
12 gives P[X] = P[1] + P[M[1] X ] , which replaced into (3.8.1) gives in turn
R2 =

y0 P[1] + P[M[1] X ] M[1] P[1] + P[M[1] X ] y


y0 M[1] y

But

P[1] + P[M[1] X ] M[1] P[1] + P[M[1] X ]


=
P[1] M[1] P[1] + P[M[1] X ] M[1] P[1]

P[1] M[1] P[M[1] X ] + P[M[1] X ] M[1] P[M[1] X ]

so that eventually
(3.8.5)

R =

y0 P[M[1] X ] y
y0 M[1] y

P[M[1] X ] ,

3.8. CENTERED AND UNCENTERED GOODNESS-OF-FIT MEASURES

42

which proves at once that R2 defined in (3.8.1) can be also obtained as the uncentered R namely the OLS regression of y in
squared from the OLS regression of M[1] y on M[1] X,
in mean-residuals.
mean-residuals and X

CHAPTER 4

The finite-sample statistical properties of OLS


4.1. Introduction
This chapter is on the finite-sample statistical properties of OLS applied to the LRM.
Finite-sample means that we focus on a fixed sample size n as opposed to n ! 1, a case that
will be covered in Chapter 7. We will learn under what assumptions on the LRM and in which
sense the estimator is optimal. We will also learn how to test linear restrictions on the model
parameters. Finally, we will study an important case of inaccuracy for the OLS, which is the
omitted-variables problem.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using
the data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).

4.2. Unbiasedness
Under LRM.1-LRM.3, OLS is unbiased, that is E (b) = .
This is proved as follows. From LRM.1, LRM.2 and OLS formula in (3.3.2)
(4.2.1)

b=

+ X 0X

X 0 ".

From LRM.3, then,


E (b|X) =
=

+ X 0X
.
43

X 0 E ("|X)

4.3. THE GAUSS-MARCOV THEOREM

44

Finally, using the law of iterated expectations


E (b) = EX [E (b|X)]
= EX [ ]
=

Notice that unbiasedness does not follow if we replace LRM.3 with the weaker LRM.3b.

4.3. The Gauss-Marcov Theorem


Lets work out the conditional and unconditional covariance matrices for OLS under
LRM.1-LRM.4.
I get started with V ar (b|X). Since,

V ar (b|X) = E (b

) (b

then, given equation (4.2.1),

V ar (b|X) = E

X 0X

X 0X

X 0X

)0 |X ,

X 0 ""0 X X 0 X

X 0 E ""0 |X X X 0 X
1

i
|X ,
1

where the last equality follows from LRM.3 and LRM.4.


Now I turn to V ar (b). Using the law of decomposition of variance

V ar (b) = EX [V ar (b|X)] + V arx [E (b|X)]


h
i
1
2
=
EX X 0 X
,
where the last equality follows from E (b|X) = .

4.3. THE GAUSS-MARCOV THEOREM

45

Next I prove that OLS has the smallest, in a sense that will be clarified soon, covariance
matrix in the class of linear unbiased estimator.
I define the following partial order in the space of the l l symmetric matrices:
Definition 17. Given any two l l symmetric matrices A and B, A is said no smaller
than B if and only if A

B is non-negative definite (n.n.d.)

Given the partial order of definition 17, OLS has the smallest conditional covariance
matrix in the class of linear unbiased estimators. This important result is universally known
as the Gauss-Marcov Theorem. To prove it, define the generic member of the foregoing class
as
bo = Cy
where C is a generic k n matrix that depends only on the sample information in X and
such that CX = Ik (how would you explain the last requirement?). OLS is of course a
member of the class, with its own C equal to COLS = (X 0 X)
that V ar (bo |X) =

2 CC 0 .

Define, now D = C

V ar (bo |X) =

2 DD 0

1
2

= V ar (b|X) +
Since

X 0 . It is not hard to prove

COLS , then DX = 0 and so

D + X 0X

X 0X

X0

ih
D0 + X X 0 X

DD0

DD0 .

is n.n.d, according to Definition 17, V ar (b|X) is no greater than the variance

of any linear unbiased estimator.


The natural question arises of whether the partial order of Definition 17 is of any relevance in real-world applications. It is, since it readily translates into the total order of
real numbers, which is the domain of the variances of random scalars.
no smaller than B, then r0 (A

B) r

Indeed, if A is

0, for any conformable r. But then, according

to the Gauss-Marcov Theorem, we can say that any linear combination of the components

4.3. THE GAUSS-MARCOV THEOREM

46

of b, r0 b, has smaller conditional variance than r0 bo . Formally, the theorem implies that
r0 [V ar (bo |X)

V ar (b|X)] r

0. Then, V ar (r0 b|X) = r0 V ar (b|X) r and V ar (r0 bo |X) =

r0 V ar (bo |X) r and hence V ar (r0 bo |X)

V ar (r0 b|X) . The importance of this hinges upon

the fact that in empirical applications we are interested in the linear combinations of population coecients, as in the following example, where it is shown that the estimates of individual
coecients can always be expressed as specific linear combinations of the k components of the
estimators.

Example 18. On noticing that bi = r0i b and boi = r0i bo , i = 1, ..., k, where ri is the
k 1 vector with all zero elements except the i.th entry, which equals unity, and given the
Gauss-Marcov Theorem, we conclude that V ar (boi |X)

V ar (bi |X) , i = 1, ..., k.

In general, we have that the OLS estimator of any linear combination r0

is given by r0 b

and, as the foregoing discussion demonstrates, under LRM.1-LRM.4 r0 b is BLUE (you can
easily verify that E (r0 b) = r0 ).
Now we prove that the Gauss-Marcov Theorem extends to the unconditional variances.
From
V ar (bo |X) = V ar (b|X) +

DD0 ,

it has
EX [V ar (bo |X)] = EX [V ar (b|X)] +

EX DD0 ,

or
V ar (bo ) = V ar (b) +

EX DD0 ,

and since EX (DD0 ) is n.n.d., we can also state that the unconditional variance of OLS is no
greater than that of any linear unbiased estimator.

4.4. ESTIMATING THE COVARIANCE MATRIX OF OLS

47

4.4. Estimating the covariance matrix of OLS


Since

is unknown, so are V ar (b|X) and V ar (b). Unbiased estimators of V ar (b|X)

and V ar (b), therefore, require an unbiased estimator of

2.

We now prove that, under LRM.1-

LRM.4, the residual sum of squares divided by the appropriate degrees of freedom correction,
s2 = e0 e/ (n

k), is one such estimator, namely E s2 =

2.

From e = M[X] y and LRM.1 it follows that e = M[X] ". Hence,


(4.4.1)

E s2 |X

1
n

E "0 M[X] "|X .

Since "0 M[X] " is a scalar, "0 M[X] " = tr "0 M[X] " and so, by the permutation rule of the trace
of a matrix product, "0 M[X] " = tr "0 M[X] " = tr M[X] ""0 . Replacing the right hand side of the
foregoing equation into equation (4.4.1) yields
E s2 |X

1
n

E tr M[X] ""0 |X .

Then exploiting the fact that both trace and expectation are linear operators
E s2 |X

=
=

(4.4.2)

1
n

tr E M[X] ""0 |X .

tr M[X] E ""0 |X

1
n
2

tr M[X]

where the last equality follows from LRM.3 and LRM.4. Now, focus on tr M[X] :
h
tr M[X] = tr In
= tr In
= n

k,

X X 0X
tr X 0 X

X0

X 0X

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

and so the numerator and denominator in (4.4.2) simplify to have E s2 |X =

2.

48

Finally, by

the law of iterated expectations


E s2 =

With s2 at hand we can get an unbiased estimator for V ar (b). It is


V\
ar (b) = s2 X 0 X

In fact

1
E V\
ar (b)|X = 2 X 0 X
= V ar (b|X)

h
i
and since E V\
ar (b) = EX E V\
ar (b)|X by the law of iterated expectations, then

E V\
ar (b) =

EX

X 0X

= V ar (b) .

4.5. Exact tests of significance with normally distributed errors


Assume LRM. 5: "|X N 0,

2I

. Since equation (4.2.1) holds,

b|X N
Also, since e = M[X] ", e|X N 0,

2M

[X]

X 0X

. Using a result in Rao (1973), it is also possible

to prove at once that b and e are also jointly normal with zero covariances, conditional on X.
Specifically, since

0
@

b
e

A=@

then by (8a.2.9) in Rao (1973) it has


0
@

b
e

20

A |X N 4@

A,

A+ @

2@

(X 0 X) 1 X 0
M[X]

(X 0 X) 1 X 0
M[X]

1
A

A"

X (X 0 X)

M[X]

3
5

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

or

0
@

b
e

20

A |X N 4@

1 0
A,@

2 (X 0 X) 1

0
2M

[X]

49

13
A5

Therefore, being normally distributed with zero conditional covariances, conditional on X, b


and e are also independent, conditional on X. This general result is important and therefore
stated as a theorem for future reference.
Theorem 19. Assume LRM.5. Then b and e are independent, conditional on X.
Exercise 20. Verify, by direct computation of Cov (b, e|X), that Cov (b, e|X) = 0kn .
Solution: Since E (e|X) = 0 (verify), then1

Cov (b, e|X) = E (b

or

Cov (b, e|X) = E

X 0X

X 0X

X 0X

) e0 |X

X 0 ""0 M[X] |X

X 0 E ""0 |X M[X]
1

XM[X]

= 0kn .
Exercise 21. Verify, by direct computation of V ar (e|X), that V ar (e|X) =
20

Exercise 22. Is V ar 4@

b
e

2M

[X] .

A |X 5 non-singular? Why or why not?

1In general the matrix of conditional covariances between two random vectors x and y, conditional on z, is

E [x

E (x|z)] [y

E (y|z)]0 |z .

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

50

4.5.1. Testing single linear restrictions. Let (X 0 X)ii 1 stand for the i.th main diagonal
element of (X 0 X)

, then

bi |X N

i,

X 0X

1
ii

and, given the properties of the normal distribution, bi can be standardized to have
(4.5.1)
i = 1, ..., k. Were

bi

2 (X 0 X) 1
ii

|X N (0, 1) ,

known, then the above statistics could be used to test hypotheses on

by replacing the unknown

with

the researcher. For example, to test Ho :

= 0 one would use

Ho :

i,

The problem is, of course, that

bi
2 (X 0 X) 1
ii

i,

where

i,

is a value of interest fixed by

N (0, 1) .

is generally unknown and so the foregoing approach cannot

be used as it is. With some adjustment we can make it operational, though. Just replace

with s2 in the expression for the standardized bi to get

bi
i
ti = q
s2 (X 0 X)ii 1

(4.5.2)

and then prove that ti has a t distribution with n

k degrees of freedom when

i.

The

denominator term in expression (4.5.2) is the standard error estimate for coecient bi .
First, notice that since s2 = e0 e/ (n
(4.5.3)

(n

k)

k) = "0 M[X] "/ (n


s2
2

" 0

M[X]

"

k),
.

Now, consider the following distributional result

if z N (0, I), and A is a conformable idempotent matrix, then z0 Az


p = rank (A) .

2 (p)

with

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

Since "/ N (0, I) and M[X] is idempotent, then (n

k) s2 /

51

is an idempotent quadratic

form in a standard normal vector and, in the light of the foregoing distributional result, has
a chi-squared distribution with degrees of freedom equal to rank M[X] = n
(n

k)

s2
2

(n

k:

k) .

Since Theorem 19 holds, any function of b is independent of any function of e, conditionally


q

2 (X 0 X) 1 and (n
on X, hence (bi
k) s2 / 2 are conditionally independent.
i )/
ii
Further,

ti = q

bi

2 (X 0 X) 1
ii

/ (n

k)

s2
2

/ (n

k),

therefore, in the light of the following second distributional result


if z N (0, 1), x

2 (p)

p
and z and x are independent, then z/ x/p has a t

distribution with p degrees of freedom


then, ti |X t (n

k), i = 1, ..., k.

Finally, since the t distribution does not depend on the sample information and, specifically,
on X, then ti and any component of X are statistically independent, so that the above holds
also unconditionally, that is ti t (n

k), i = 1, ..., k.

Often we wish to test hypotheses involving linear combinations of , r0 , where r is a k 1


vector of known constants.

Example 23. If we are estimating a two-input Cobb-douglas production function and


and

are the product elasticities of the two inputs, the hypothesis of constant returns to

scale is clearly important, so that our null is

= 1.

In general we express the null involving a linear combination of population coecients as


H0 : r 0
q = 1.

q = 0, where q is a known constant. In the Cobb-Douglas example r = (1 1)0 and

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

52

The OLS estimator for r0 is r0 b, which is normally distributed conditional on X: r0 b|X


h
i
N r0 , 2 r0 (X 0 X) 1 r . Therefore we have the following t test
r0 b q
q
s2 r0 (X 0 X)

which can be used to test H0 : r0

t (n

k) ,

q = 0.

4.5.1.1. Stata implementation. For the sake of exposition, I report here the regress output
already seen in Section 3.3.3.
. use "/Users/giovanni/didattica/Greene/dati/ch. 1/1_1.dta"
. regress c y
Source

SS

df

MS

Model
Residual

64435.1192
537.004616

1
8

64435.1192
67.125577

Total

64972.1238

7219.12487

Coef.

y
_cons

.9792669
-67.58063

Std. Err.
.031607
27.91068

t
30.98
-2.42

Number of obs
F( 1,
8)
Prob > F
R-squared
Adj R-squared
Root MSE

P>|t|
0.000
0.042

=
=
=
=
=
=

10
959.92
0.0000
0.9917
0.9907
8.193

[95% Conf. Interval]


.9063809
-131.9428

1.052153
-3.218488

The OLS coecient estimates, b, are displayed in the first column (labeled Coef.). Then,
the second column reports the standard error estimates peculiar to each OLS coecients,
q
se
i = s2 (X 0 X)ii 1 , i = 1, ..., k. The third column reports the values of the t statistics for
the k null hypotheses

= 0, i = 1, ..., k :

bi
ti = q
.
s2 (X 0 X)ii 1

The test is two-sided in that the alternative is H1 :

> 0 or

< 0. The fourth column

reports the so called p-value for the two-sided t-test. It is defined as the probability that
a t distributed random variable is more extreme than the outcome of ti in absolute value:
P r [(t <

|ti |) [ (t > |ti |)] or more compactly P r (|t| > |ti |) . Clearly, if the p-value is smaller

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

53

than the chosen size of the test (the Type I error) then ti falls for sure into the critical region
and we reject the null at the chosen size. In other words, the p-value indicates the lowest size of
the critical region (the lowest Type I error) we could have fixed to reject the null, given the test
outcome. In this sense, the p-value is more informative than critical values. In the regress
example, if we choose a critical region of 5% size, given that P r (|t| > 2.42) = 0.042 < 0.05,
we can reject at 5% that the constant term is equal to zero, knowing that we could also have
rejected at, say, 4.5%, but not at 1%. A 1% size is smaller then the test p-value, which is
the minimum size allowing rejection, and for this reason we cant reject at 1%. This is a clear
case of borderline significance, one which we could not have identified with such precision by
simply looking at the 5% critical values. On the other hand, the p-value for the coecient
on y is virtually zero (as low as 0.000). This therefore indicates that no matter how much
conservative we are towards the null, we can reject it at any conventional level of significance
(conventional sizes, with an increasing degree of conservativeness are 10%, 5%, 1%) and also
at a less conventional 0.1% (since 0.001 > 0.000).
4.5.2. From tests to confidence intervals. Let us fix the 100% critical region for our
two-sided t test for the null H0 :

the corresponding critical values: P r

against the alternative H1 :


t<

of not rejecting the null when it is true is (1


Pr

bi
se
i

< t/2

= Pr

(1
i

se
i t/2 , bi + se
i t/2 is a (1

) 100% confidence interval bi


=

and let t/2 be

) . Formally,

t/2 <

bi
se
i

se
i t/2 <

< t/2

se
i t/2 < bi

= P r bi

But bi

6=

= . Then, the probability

t/2 [ t > t/2

= Pr

= (1

< se
i t/2

< bi + se
i t/2

) .

) 100% confidence interval for

i.

This proves that the

se
i t/2 , bi + se
i t/2 contains all of the null hypotheses

that we cannot reject at 100%. So while a given t test is informative only for

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

54

the specific null it is testing, the confidence interval conveys to the researcher much more
information. The last column of the regress output reports the 95% confidence intervals for
each OLS coecients.
Exercise 24. Your regression output for a given coecient
se
i = 1.760. 1) Compute the t-statistic for the null H0 :

yields bi =

9.320 and

= 0. 2) In your regression

k = 334, this implies that t0.025 = 1.967, where t0.025 : P r (t > t0.025 ) = 0.025. Will you

reject or not H0 :

= 0 against H1 :

6= 0 at a significance level of 5%? Why? 3) Given

your answer to Question 2, will you expect that 0 belongs to the 95% confidence interval for
i?

4) Compute the 95% confidence interval for

confidence interval alone, do you reject H0 :

i.

On the basis of the information from the


6 against H1 :

6=

6 at 5%, why? 5)

Using only your answers to Question 4, can you assert that the p-value of that test is greater
than 0.05? Also, do you expect that the absolute value of the t statistic for H0 :

be greater or smaller than 1.967, why? Verify your answer by directly computing the value of
the t statistic for H0 :

6. 6) Consider now the test of H0 :

0 against H1 :

>0

with a 5% significance level. Is the critical level for this test equal to, smaller or greater than
1.967?
Answer: 1) ti =

5.295. 2) Reject, because 5.295 > 1.967. 3) No, since H0 :

is rejected at 5%. 4) ( 12.782, 5.858) . No, since


H0 :

= 0

6 2 ( 12.782, 5.858) . 5) Yes: since

6 is not rejected at 5%, then ti falls within the acceptance region and so the test

p-value> 0.05. Since ti falls within the acceptance region, the value of |ti | must be smaller
than t0.025 = 1.967. Indeed, ti =

1.886. 6) Smaller: since the test is one-sided, the critical

value is t0.05 .
Exercise 25. Your regression output for a given coecient
se
i = 3.577. The outcome of the t-test for H0 :

yields bi = 6.668 with

= 0 against H1 :

value= 0.07. Can you reject the null at 10%? Can you at 5%?

6= 0 shows p-

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

55

4.5.3. Testing joint linear restrictions. We want to test jointly J linear restrictions:
q = 0, where R and q are, respectively, a J k matrix and a J 1 vector of fixed

H0 : R

known constants and such that no rows in R can be obtained as a linear combination of the
others, that is R is of full row rank J.
Under the null,
Rb
and so given LRM.5,

q = R (b

q|X N 0,

Rb

R X 0X

Then, using the distributional result that

R0 .
)0

given the p 1 random vector x N (, ), then (x

1 (x

2 (p) ,

it has
(Rb
W =
Again,

h
q)0 R (X 0 X)

R0

(Rb

q)
|X

(J) .

is not known and so W is unfeasible as a test for H0 . We can go about as in the

previous section and replace

with s2 . In addition, then, divide the result by J to get the

statistic
(Rb
F =

q)

R (X 0 X) 1 R0
Js2

(Rb

q)

Now consider another distributional results


Given two independent random scalar x1

2 (p

1)

and x2

2 (p

2 ),

then (x1 /p2 ) / (x1 /p2 )

F (p1 , p2 ) .
It is not hard to see that the above result can be applied to F, since it can be reformulated
as the ratio of two conditionally independent chi-squared random variables corrected by their
own degrees of freedoms. In fact, at the numerator we have
(Rb

h
q)0 R (X 0 X)
J

1
2

R0

(Rb

q)

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

and at the denominator s2 /

2.

56

Conditional on X, the former is a function of b alone, while the

latter is a function of e alone. Therefore, in the light of Theorem 19 the two are conditionally
independent and so we can invoke the foregoing distributional result, to establish F |X
F (J, n

k).

As with the t statistic, since the F distribution does not depend on the sample information,
we have that the above holds unconditionally: F F (J, n

k) .

When H0 is a set of J exclusion restrictions, then q = 0 and each row of R has all zero
elements except unity in the entry corresponding to the parameter that is set to zero. For
0

example, with three parameters


3

= (

3)

and two exclusion restrictions

= 0 and

= 0, then J = 2, q0 = (0 0) and
0

so that H0 can be formulated as


0

R=@
1

1 0 0
0 0 1

B
@
AB
B
0 0 1 @
1 0 0

1
2
3

1
A

0 1
C
C @ 0 A
.
C=
A
0

The F -test can be always rewritten as a function of the residual sum of squares under the
unrestricted model, e0 e, and the residual sum of squares under the model with restrictions
imposed, say e0 e :
F =

(e0 e e0 e) /J
.
e0 e/ (n k)

This is proved for the case of exclusion restrictions by using Lemma 12.

4.5. EXACT TESTS OF SIGNIFICANCE WITH NORMALLY DISTRIBUTED ERRORS

57

Partition the sample regressor matrix as X = (X1 X2 ) and consider the F test for the set
of exclusion restrictions H0 :

= 0:

=
=

b02

X20 M[X1 ] X2

k2 s2
b02 X20 M[X1 ] X2 b2
.
k2 s2

b2

Now apply the FWL Theorem to F to have


F

=
=

y0 M[X1 ] X2 X20 M[X1 ] X2

y0 M[X1 ] X2 X20 M[X1 ] X2


k2 s2

X20 M[X1 ] X2 X20 M[X1 ] X2


k2 s2
X20 M[X1 ] y

X20 M[X1 ] y

The numerator of the right hand side of the foregoing equation can be written more compactly
as y0 P[M[X ] X2 ] y. Hence, by Lemma 12,
1
F

y0 P[X] P[X1 ] y
k2 s2

or, adding and subtracting In within parentheses,


F

=
=

y0 M[X1 ] M[X] y
k2 s2
(e0 e e0 e) /k2
.
s2

It is not hard to prove that if the constant term is kept in both models, then
F =

R2 R2 /J
,
(1 R2 ) / (n k)

where R2 is R-squared from the unrestricted model and R2 is the R-squared from the restricted
model.

4.6. THE GENERAL LAW OF ITERATED EXPECTATION

58

4.6. The general law of iterated expectation


The general form of the law of iterated expectations (LIE) can be stated as in Wooldridge
(2010), pp. 19-20.
LIE(scalar|vector): Given the random variable y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x].
Since the above result holds for any function f (), x can just be any subvector of w, as the
following example shows.
Example 26. Consider w = (w1 w2 w3 )0 and x = Aw, where
0

A=@

1 0 0
0 1 0

1
A

then x = (w1 w2 )0 and so by the general law of iterated expectations


E (y|w1 w2 ) = E [E (y|w1 w2 w3 ) |w1 w2 ] .
The law can, of course, be formulated in terms of conditional expectations of random
vectors.
LIE(vector|vector): Given the random vector y and the random vectors w and x,
where x = f (w), then E (y|x) = E [E (y|w) |x] , where
0

1
0
E (y1 |x)
E (y1 |w)
B
C
B
..
..
B
C
B
E (y|x) = B
C and E (y|w) = B
.
.
@
A
@
E (yn |x)
E (yn |w)

C
C
C.
A

Remark 27. Notice that in the formulation of conditional expectations the way the conditioning set is represented is just a matter of notational convenience. What matters are the
random scalars that enter the conditioning set and not the way they are organized therein.
For example, E (y|w1 , w2 , w3 , w4 ) can just be equivalently expressed as E (y|w0 ) or E (y|w),

4.7. THE OMITTED VARIABLE BIAS

59

where w = (w1 w2 w3 w4 )0 , or E (y|W ) where


0

W =@

w1 w3
w2 w4

A,

or through any other organization of (w1 , w2 , w3 , w4 ).

Given Remark 27 the general LIE can be formulated with conditional expectations having
the conditioning set organized in the form of random matrices rather than random vectors, as
follows.
LIE(vector|matrix): Given the random vector y and the random matrices W and X,
where X = f (W ), then E (y|X) = E [E (y|W ) |X].
Paralleling the consideration made above, since f () is a generic function, from LIE(vector|matrix)
follows a special LIE for the case in which X is a submatrix of W . Therefore, given W =
(W1 W2 ), by LIE(vector|matrix) we always have
E (y|W1 ) = E [E (y|W ) |W1 ]
and
E (y|W2 ) = E [E (y|W ) |W2 ] .

4.7. The omitted variable bias


If explanatory variables that are relevant in the population model, for some reasons, are not
included into the statistical model - they may be intrinsically latent, such as individual skills,
or the specific data-set in use do not report them, or also, although observed and available,
the researcher failed to account for them in the model specification - then our OLS estimator
may undergone what is known in the econometric literature as an omitted variable bias. Lets
see when and why.

4.7. THE OMITTED VARIABLE BIAS

60

Assume that the population model is


y = x0 + ".
with x and

both k 1 vectors and P.1-P.4 satisfied and consider the RS mechanism

RS: There is a sample of size n from the population equation, such that the elements of
the sequence {(yi xi1 xi2 ...xik ) , i = 1, ..., n} are independently identically distributed
(i.i.d.) random vectors.

So far we are in the classical regression framework, but now let x0 = (x01 x02 ) with x1 being
a k1 1 vector and x2 a k2 1 vector and k = k1 + k2 and maintain that x2 is latent
or, however, not included into the statistical model and lets explore the implications on the
statistical model. P.1 implies that
(4.7.1)
where = X2

y = X1
2

+ " is now the new n 1 vector of latent realizations. If X is of f.c.r., then

X1 is of f.c.r. as well. So, in a sense, LRM.1 and LRM.2 continue to hold. But, as far as
LRM.3 and LRM.4 are concerned, nothing can be said. Specifically, we do not know whether
E (|X1 ) = 0 or V ar (|X1 ) = & 2 In . The first consequence is that the OLS estimator for
b1 = X10 X1

1,

X10 y

is likely to be biased. Indeed, the bias can be easily derived as follows. Replacing the right
hand side of equation (4.7.1) into the OLS formula yields
b1 =

+ X10 X1

X10

+ X10 X1

X10 X2

+ X10 X1

X10 ".

4.7. THE OMITTED VARIABLE BIAS

61

Since RS holds, "i , i = 1, .., n, is conditional-mean independent of (x0i1 x0i2 ) and statistically

independent of x0j1 x0j2 , j 6= i = 1, ..., n. Therefore, E ("|X) = 0, which implies that


E (b1 |X) =

+ X10 X1

X10 X2

and hence, by the law of iterated expectations, we have the unconditional bias
(4.7.2)

E (b1 )

= E

X10 X1

X10 X2

There are two specific instances, however, in which the bias is zero.
The first instance is that analyzed in Greene (2008) when X10 X2 = 0k1 k2 . In this case
(X10 X1 )

X10 X2

= 0 and so the bias in equation (4.7.2) becomes zero.

The second instance occurs if in the population E (x02

2 |x1 )

= 0, as I now show. Since in

the population E ("|x) = 0, then by the general law of iterated expectation also E ("|x1 ) = 0.
Hence, E (x02

+ "|x1 ) = 0, which along with RS yields E (|X1 ) = 0. Therefore, the vector

in model (4.7.1) behaves like a conventional error term that satisfies LRM.3. The upshot is
that b1 is unbiased.
The two situations are not related. Clearly, E (X2

2 |X1 )

= 0 does not imply X10 X2 =

0k1 k2 . But also the converse is not true, and X10 X2 = 0k1 k2 may hold if E (X2

2 |X1 )

6= 0,

as shown by the following example.


Example 28. Let y = x1

+ x2

+ " with x1 a Bernoulli random variable:

P r (x1 = 1) = and P r (x1 = 0) = 1


Let also x2 = 1

x1 . While E (x2

2 |x1 )

= (1

x1 )

2,

x1 x2 = 0 with probability one. In

this case, a random sampling of y, x1 and x2 from the population will yield X10 X2 = 0 and
E (X2

2 |X1 )

6= 0 with probability one.

Be that as it may, the foregoing two instances of unbiasedness constitute a narrow case,
and in general omitted variables will bring about bias and inconsistency in the coecient

4.7. THE OMITTED VARIABLE BIAS

62

estimates. Solutions are typically given by proxy variables, panel data estimators and instrumental variables estimators. The first method is briefly described below, the classical panel
data estimators are pursued in Chapter 8, while IV methods are described in Chapter 10.
To conclude, observe that if relevant variables are omitted LRM.4 does not generally hold,
unless V ar (x02

2 |x1 )

= & 2 < +1.

Lemma 29. Given any two non-singular square matrices of the same dimension, A and
B, if A

B is n.n.d. then B

is n.n.d.

The foregoing lemma signifies that in the space of non-singular square matrices of a given
dimension if A is no smaller than B, then A

is no greater than B

1.

It is useful in

situations in which the dierence of inverse matrices is more easily worked out than that of
the original matrices.
The following exercise asks you to think through the consequences of overfitting, namely
applying OLS to a statistical model with variables that are redundant in the population model.

Exercise 30. Assume the population model is


y = x0 + "
with x and

both k 1 vectors and P.1-P.4 satisfied. Assume also that the l 1 vector z of

observable variables is available, such that rank [E (ww0 )] = k + l where w0 = (x0 z0 ). Also,
assume E ("|x z) = 0 and V ar ("|x z) =

2,

i.e. z is redundant in the population equation.

Finally assume there is a sample of size n from the population, such that the elements of the
sequence {(yi x0i z0i ) , i = 1, ..., n} are i.i.d. 1 (1 + k + l) random vectors. Applying the usual

4.7. THE OMITTED VARIABLE BIAS

notation for the sample variables,


0
1
0
B y1 C
B
B .. C
B
B . C
B
B
C
B
B
C
B
B
C
y = B yi C , X = B
B
n1
B . C nk B
B . C
B
B . C
B
@
A
@
yn

x01

C
B
B
.. C
C
B
. C
B
C
B
C
0
xi C , Z = B
B
nl
B
.. C
C
B
. C
B
A
@
x0n

z01

63

C
B
B
.. C
C
B
. C
B
C
B
C
0
zi C , " = B
B
n1
B
.. C
C
B
. C
B
A
@
z0n

answer the following questions. 1) Prove that the statistical model

"1
..
.
"i
..
.
"n

C
C
C
C
C
C,
C
C
C
C
A

y =X +"
satisfies LRM.1-LRM.4 (of course, this proves that
b = X 0X

(4.7.3)

X 0y

is BLUE). 2) Prove that the overfitting strategy of regressing y on X and Z yields an unbiased
estimator for

and call it bof it . 3) Derive the covariance matrix of bof it . 4) Use lemma 29

and verify that, indeed, the conditional covariance matrix of bof it is no smaller than that of
b in (4.7.3). 5) A byproduct of the overfitting strategy is the l 1 vector of OLS coecients
for the variables in Z. Lets call it c. Express c using the intermediate result in the FWL
theorem as
c = Z 0Z

Z 0 (y

Xbof it )

and prove that the overfitting residual vector eof it y

Xbof it

Zc equals

eof it = M[M[Z] X ] M[Z] y.


6) Find an unbiased estimator for

based on eof it .

Answer: 1) Obvious, since in the population and the sampling mechanism we have all
we need for the statistical properties LRM.1-LRM.4 to be true. 2) This is proved at once by

4.7. THE OMITTED VARIABLE BIAS

64

noting that from RS and E ("|x z) = 0, E ("|X Z) = 0. 3) Prove that V ar ("|X Z) =

2I

and

then prove that


V ar

X 0 M[Z] X

i
X 0 M[Z] y|X Z =

4) Write X 0 M[Z] X as X 0 M[Z] X = X 0 X

X 0 M[Z] X

X 0 P[Z] X and then verify you have all is needed to

invoke the lemma. 5) Easy, its just algebra: replace bof it and c into eof it y

Xbof it

Zc

and rearrange. 6) First, use the formula of the overfitting residual vector derived in the
previous question, M[M[Z] X ] M[Z] y, to set up the estimator
2

s =
Then, notice that
2

s =

y0 M[Z] M[M[Z] X ] M[Z] y


n

"0 M[Z] M[M[Z] X ] M[Z] "


n

Finally, take the expectation of the trace s2 , conditional on X and Z, and follow the same
steps as when proving unbiasedness of the standard estimator s2 . In the derivation dont
forget that M[Z] and M[M[Z] X ] are orthogonal projections.
4.7.1. The proxy variables solution. Assume for simplicity that there is only one
omitted variable x2 from the population equation
y = x01

(4.7.4)

+ x2

+ ".

where x1 is a k1 1 vector of observed regressors. Assume also that there is a l 1 vector of


observed variables z such that the following assumptions hold.
(1) The z variables are redundant in the population equation, that is E (y|x z) = x0 .
(2) Once conditioning on z, the omitted variable x2 and the included explanatory variables, x1 , are independent in conditional-mean: E (x2 |x1 z) = E (x2 |z) . Also E (x2 |z) =
z0 .

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT

65

8 20
1
39
<
=
x1
A (x01 z0 )5 = k1 +l. This is analogous to property P.2 in Chapter
(3) rank E 4@
:
;
z
1 and permits identification of coecients in the proxy variable regression as we will
see below.

Assumption 2 implies that x2 can be written as


x2 = z0 + ,

(4.7.5)

where = x2 E (x2 |x1 z), and hence E (|x1 z) = 0. Replacing the right hand side of equation
(4.7.5) into the population equation (4.7.4) yields
y = x01

(4.7.6)
where E (

2 |x1

+ z0 (

)+

+ ",

z) = 0. Also, by the redundancy assumption, E (y|x z) = x0 , it follows that

E ("|x z) = 0 and so by the general LIE


E ("|x1 z) = E [E ("|x z) |x1 z] = 0.
It follows that E (

+ "|x1 z) = 0 and so, along P.1 and P.2 (given Assumption 3), also P.3

is satisfied for equation (4.7.6). With the following RS mechanism


RS(x1 z): There is a sample of size n from the population, such that the elements of
the sequence {(yi xi1 ...xik1 zi1 ...zil ) , i = 1, ..., n} are independently identically distributed (i.i.d.) random vectors,
the resulting statistical model will satisfy LRM.1-LRM.3 and so yield unbiased OLS estimates.

4.8. The variance of an OLS individual coecient


Suppose that attention is centered onto a given explanatory variable whose observations
are collected into the (n 1) column vector xi , and that there are k
collected into the n (k

1 control variables

1) matrix X i . Without loss of generality partition the (n k)

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT

regressor matrix as X = (X

xi ) and, correspondingly, the (k 1) OLS vector as


0

where b

is (k

66

b=@

A,

bi

1) 1 and bi is a scalar. Maintain LRM.1-LRM.4, so that


X 0X

V ar (bi |X) =

1
ii

where (X 0 X)ii 1 indicates the last entry onto the main diagonal of (X 0 X)

My purpose here is to derive the formula for (X 0 X)ii 1 . As usual when it comes to the
derivation of formulas regarding parts of the OLS vector, I invoke the Frisch-Waugh-Lovell
(FWL) Theorem. Hence,
bi = x0i M[X

i]

xi

x0i M[X

i]

y,

+ xi

so, given
y=X

+ xi

+ ",

it has
bi = x0i M[X

x0i M[X

i]

xi

+ x0i M[X

i]

+"

and consequently
bi =

i]

xi

x0i M[X

i]

".

Finally
V ar (bi |X) = E

(4.8.1)

x0i M[X

x0i M[X

i]

x
i] i

xi

x0i M[X

i]

x0i M[X

i]

xi

xi

x0i M[X

x0i M[X
1

i]

""0 M[X
i]

x x0i M[X
i] i

E ""0 |X M[X

i]

x
i] i

xi x0i M[X

i]

|X

xi

i
1

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT

67

which also proves that


X 0X

(4.8.2)

1
ii

1
x0i M[X

i]

xi

Equation (4.8.2) is a general algebraic result providing the formula for the generic i.th main
diagonal element of the inverse of any non-singular cross-product matrix X 0 X. I have proved it
in quite a peculiar way, using a well-known and easy-to-remember econometric result! Above
all, I could get away without referring to the hard-to-remember result on the inverse of the
(2 2) partitioned matrix, which is instead the route followed by Greene (Theorem 3.4 inGreene (2008), p. 30) .

Exercise 31. Prove (4.8.2) using formula (A-74) for the inverse of the (2 2) partitioned
matrix in Greene (2008), p. 966.

4.8.1. The three determinants of V ar (bi |X) when 1 is a regressor. Now I get back
to V ar (bi |X) in equation (4.8.1)
V ar (bi |X) =
and assume that X
M[X

i]

x0i M[X

i]

xi

contains the n 1 unity vector 1, or X

= X

xi is the residual vector from the OLS regression of xi on X

1 . Notice, now, that

and so x0i M[X

i]

xi is the

residual sum of squares for this regression. Since the unity vector is a column of X
coecient of determination for this regression is
Ri2 = 1

x0i M[X i ] xi
,
x0i M[1] xi

from which we have that


x0i M[X

i]

xi = 1

Ri2 x0i M[1] xi

, the

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT

68

and eventually2
2

V ar (bi |X) =

Ri2 x0i M[1] xi

Also, it has

x0i M[1] xi

n
X

xi ) 2 ,

(xij

j=1

that is x0i M[1] xi is the total variation in xi around its sample mean, xi . Therefore,
(4.8.3)

V ar (bi |X) =

Rk2

with V ar (bi |X) increasing when

Pn

xi ) 2

j=1 (xji

other things constant, Ri2 increases, in words the correlation between xi and the
other regressors increases (this is the multicollinearity eect on the variance of the
OLS individual coecient);
other things constant, the total variation in xi ,

Pn

j=1 (xji

xi )2 , decreases;

other things constant, the regression variance increases.

2An alternative proof is the following. Given Lemma 12, M


[X

or

V ar (bi |X) =
V ar (bi |X)

where

x0i M[1]

]=I

P[ M

[1] X

P[1]
i
xi
i]

x0i M[1] xi

x0i P[M

[1] X

x0i M[1] xi

x0i P[M

xi

[1] X i ]

x0i M[1] xi

"

x0i M[1] xi 1

Ri2

x0i P[M

i ] xi

x0i P[M
Ri2

i ] xi

[1] X

x0i M[1] xi

P[ M

and so

i ] xi

[1] X

i]

[1] X

x0i M[1] xi
1

!#

Given (3.8.5), Ri2 is the centered R-squared for the regression of xi , on X


i ).
R-squared from the regression of M[1] xi on M[1] X

(or equivalently the uncentered

4.8. THE VARIANCE OF AN OLS INDIVIDUAL COEFFICIENT

69

i . In this case R2 = 1 (see Section


Multicollinearity is perfect when xi belongs to R X
i
3.7) and the variance of bi diverges to infinity. Coecient

cannot be estimated given the

available data (X is not of f.c.r. in this case).

Remark 32. Multicollinearity, when it does not degenerate into perfect multicollinearity,
i.e. det (X 0 X) = 0, does not aect the finite sample properties of OLS. Nonetheless, it may
severely reduce the precision of our estimates, in terms of larger standard errors and confidence
intervals.

1 and accordingly, the OLS (k 1) vector as


Exercise 33. Partition X as X = X

0
0 b0 , where b
is of dimension (k 1) 1 and b0 is the OLS estimator of the constant
b= b
term

0.

Prove that

0
b
x

b0 = y
is the (k
where y is the sample mean of y and x

1) 1 vector of sample means for the X

regressors (hint: just use the intermediate result from the proof of the FWL Theorem that
b1 = (X10 X1 )

X10 (y

X2 b2 )).

Exercise 34. Use exercise 33 and the following three facts:


n
1) V ar (
y |X) = E [
y

E (
y |X)]2 |X

= E "2 |X
2

0
0

b|X
V ar b|X

2) V ar x
= x
x

4.9. RESIDUALS FROM PARTITIONED OLS REGRESSIONS

70

and

b|X
3) cov y, x

0
b
E (
y |X)) x

= E (
y

X
0 M[1] X

= E ""0 M[1] X

|X
x

1
1 0 0
X
0 M[1] X

|X
1 "" M[1] X
x
n

1
2
X
0 M[1] X

= 0,
10 M[1] X
x
n

= E
=

0
0
b|X
E x
|X

to prove that
2

V ar (b0 |X) =

V ar b|X
.
+x
x

4.9. Residuals from partitioned OLS regressions


Consider the partitioned OLS regression of M[X2 ] y on the columns in M[X2 ] X1 as regressors
and the corresponding OLS residual vector
u = M[X2 ] y

M[X2 ] X1 b1 .

I now prove that u is equal to the OLS residual vector e M[X] y. Since b1 = X10 M[X2 ] X1

X10 M[X2 ] y,

replacing it into the right hand side of the u equation yields


u = M[X2 ] y
= M[X2 ] y

M[X2 ] X1 X10 M[X2 ] X1


P[M[X

By Lemma 12 it has P[X] = P[X2 ] + P[M[X


M[X] + P[M[X

2]

X1 ] .

2]

2]

X10 M[X2 ] y

X1 ] y.

X1 ]

and so M[X] = M[X2 ]

Then
u =

M[X] + P[M[X

= M[X] y
e.

2]

X1 ]

P[M[X

2]

X1 ] y

P[M[X ] X1 ] or M[X2 ] =
2

CHAPTER 5

The Oaxacas model: OLS, optimal weighted least squares and


group-wise heteroskedasticity
5.1. Introduction
The Oaxacas model is a good way to check your comprehension of things so far. The
treatment is more complete than Greene (2008)s. Importantly, it serves as a motivation of the
Zyskinds condition, introduced in Section 5.4. It may also serve as an introduction to a number
of topics that will be covered later on: in particular, dummy variables; heteroskedasticity;
generalized least squares estimation; panel data models. Do Exercise 39 at the end.
5.2. Embedding the Oaxacas model into a pooled regression framework
We have 2 statistically independent samples, not necessarily of equal sizes. 1) A sample
from the population of male workers, with observations for the log(wage), collected into the
(nm 1) column vector ym , and socio-demographic explanatory variables collected into the
(nm k) sample regressor matrix Xm ; 2) a sample from the population of female workers with
the same variables collected into the (nf 1) column vector yf and the (nf k) matrix Xf ,
respectively.
Assume that both population models
y m = Xm
yf

= Xf

meet LRM.1-LRM.5 with regression variances,

m
f

+ "m

+ "f

2
m

2!2
m

and

2
f

2!2 ,
f

not necessar-

ily equal (group-wise heteroskedasticity). Hence, the resulting OLS estimators from the two
71

5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK

72

separate regressions, bm and bf , are independently normally distributed, with bm |Xm

h
i

1
1
2
0
N m , m (Xm Xm )
and bf |Xf N f , f2 Xf0 Xf
and are both BLUE.

The question I ask is whether it is possible to embed the two models into a single regression

model by pooling the two sub-samples into a larger one with size equal to n = nm + nf , and
continue to estimate

and

eciently.

Lets try and see. Here is the pooled data-set


0

y=@

ym
yf

A , Xw = @

Xm
Xf

A, " = @

"m
"f

A.

Let 1 denote the (n 1) vector of all unity elements and construct the (n 1) vector d, such
that its first nm entries have all unity elements and the last nf all zero elements.
Variables like d are usually referred to as dummy variables or indicator variables, since
they indicate whether any observation in the sample belongs or not to a given group. In this
particular case, d is the male dummy variable indicating whether any observation in the sample
is specific to the male group. Since the two groups are mutually exclusive, the female dummy
variable, indicating whether any observation in the sample belongs to the female group, can be
constructed as the complementary vector 1
that is d0 (1

d. By construction, d and 1

d are orthogonal,

d) = 0.

Let x0wi be a (1 k) row vector indicating the i.th row of Xw and yi , "i and di be scalars
indicating the i.th component of y, " and d, respectively.
With this in hand, the model for the generic worker i = 1, ..., n is
(5.2.1)

yi = di x0wi

di ) x0wi

+ (1

On setting up the (2k 1) parameter vector

=@

as
m
f

1
A

+ "i

5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK

and the (n 2k) regressor matrix X as


0

X=@

Xm

0(nm k)

0(nf k)

Xf

73

A,

where 0(st) indicate a (s t) matrix of all zero elements, model (5.2.1) can be reformulated
in matrix form as
(5.2.2)

y = X + ".

Exercise 35. Prove that X has f.c.r. if and only if both Xm and Xf have f.c.r.
Summing up, we have two equivalent representations of the same model: 1) that in Greene
(2008), with the two separate regressions; 2) that presented here with a single regression model,
represented by (5.2.2). It turns out that the two frameworks are equivalent, as far as ecient
estimation of the population coecients is concerned. Indeed, as I prove next, the OLS
estimator, b, from model (5.2.2) is numerically identical to the OLS estimators from the two

0
0
0
separate regressions as presented in Greene (2008), i.e. b = bm bf . Let
1

b = X 0X
By construction,

X 0y = @

and

X 0X = @

X 0 y.

0 y
Xm
m

Xf0 yf

1
A

0 X
Xm
m

0(kk)

0(kk)

Xf0 Xf

A.

Then, by a well know property of the inverse of a block diagonal matrix (see (A-73) in Greene
(2008))
X 0X

=@

0 X )
(Xm
m

0(kk)

0(kk)

Xf0 Xf

A.

5.2. EMBEDDING THE OAXACAS MODEL INTO A POOLED REGRESSION FRAMEWORK

74

Hence,
0

b = @
0

= @
0

= @

Exercise 36. Prove that

0 X )
(Xm
m

0(kk)

0(kk)
Xf0 Xf
1
0 X ) 1 X0 y
(Xm
m
m
m
A

1
0
Xf Xf
Xf0 yf
1
bm
A.
bf
0

b=@

using the FWL Theorem.

bm
bf

10
A@

0 y
Xm
m

Xf0 yf

1
A

1
A

It must be pointed out that model (5.2.2) does not satisfy assumption LRM.4. The disturbances ", although independently distributed, suer from what is usually referred to as
group-wise heteroskedasticity, as the model does not maintain
ance matrix for " is
=

2B

2 I
!m
nm

2
m

2
f.

Indeed, the covari-

0(nm nf ) C
A.
0(nf nm )
!f2 Inf

In this sense, model (5.2.2) is not a classical regression model. Does this mean that b is not
BLUE? No, and for an important reason. Assumptions LRM.1-LRM.4 are sucient for the
OLS estimator to be BLUE, as it has been proved in Section 4.3, but not necessary. In specific
circumstances, even if LRM.4 is not met, the OLS estimator is still BLUE, and the Oaxacas
model is one such case. This is verified in the next section. A general necessary and sucient
condition for the OLS to be BLUE is postponed to the last section of this tutorial.

5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE

75

5.3. The OLS estimator in the Oaxacas model is BLUE


Model (5.2.2) can be transformed into a classical regression model by using a standard
procedure in econometrics and statistics: weighting. Let
0
1
1
B !m Inm 0(nm nf ) C
H=@
A.
0(nf nm ) !f 1 Inf

As stated by the exercise below, the matrix H when premultiplied to any conformable

vector transforms the vector so that its first nm elements get divided by !m and the remaining
by !f . This is what we refer to as weighting.

Exercise 37. Verify by direct inspection that, given any (nm 1) vector xm , any (nf 1)
vector xf and

then

x=@
0

Hx = @

xm
xf

A,

!m1 xm
1

!f x f

A.

Premultiply both sides of model (5.2.2) by H:


Hy = HX + H",
or
(5.3.1)

+"
e=X

where the tilde indicates weighted variables. Two important facts are worth observing at this
point. First, the population parameters vector,

, in the weighted model is the same as in

model (5.2.2). Second, the weighted errors satisfy LRM.4 with covariance matrix equal to

5.3. THE OLS ESTIMATOR IN THE OAXACAS MODEL IS BLUE


2I

n,

76

(so, if LRM.5 holds they are independent standard normal variables) since

e
V ar e
"|X
= HH 0 = HH
0
10
1
1
2
0(nm nf ) C
m Inm
B !m Inm 0(nm nf ) C B
= @
A@
A
2
0(nf nm ) !f 1 Inf
0(nf nm )
f Inf
0
1
1
B !m Inm 0(nm nf ) C
@
A
0(nf nm ) !f 1 Inf
0
10
1
1
!m Inm
0(nm nf ) C B !m Inm 0(nm nf ) C
2B
=
@
A@
A
0(nf nm )
!f I n f
0(nf nm ) !f 1 Inf
=

In .

Therefore, the weighted model is a classical regression model that identifies the parameters of
interest, and hence, by the Gauss-Marcov Theorem, the OLS estimator applied to the weighted
model (5.3.1), referred to as the weighted least squares estimator (WLS), bw , is BLUE for

Let us work out its formula, using exercise 37:


0

0 X
!m2 Xm
m

2 (X 0 X )
!m
m m

bw = @
= @

0(kk)

0(kk)

0(kk)
2

!f Xf0 Xf

0(kk)

!f2 Xf0 Xf
1

0 X ) 1 X0 y
(Xm
m
m m
A,
1
= @
0
Xf Xf
Xf0 yf

10
A@

10
A@

0 y
!m2 Xm
m
2

!f Xf0 yf

0 y
!m2 Xm
m

!f 2 Xf0 yf

1
A

1
A

which proves that b = bw , namely that in the Oaxacas models the OLS estimator coincides
with the optimal WLS estimator.
Does this imply that we can do inference in the Oaxacas model feeding the Stata regress
command with the variables of model (5.2.2) without further cautions? Not quite. Although

5.4. A GENERAL RESULT

77

the single OLS regression provides the BLUE estimator for the population coecients

, the

OLS estimate of V ar (b|X) that would be computed by regress,


0

V ar (b|X)
= s2 @

0 X )
(Xm
m

0(kk)

0(kk)

Xf0 Xf

A,

with s2 obtained from the sum of squares of the pooled residuals, is biased. The reason is

that V ar (b|X)
forces the regression variance estimate to be constant across the two samples.
Luckily, the same is not true for the separate regressions on the two subsamples, providing us
with the unbiased estimators of the model coecients, bm and bf and the unbiased estimator
of the covariance matrix
0

V ar (b|X)
=@

where s2m =

1
nm k

Pnm

2
i=1 ei

0 X )
s2m (Xm
m

and s2f =

0(kk)
1
nf k

Pn

0(kk)

s2f Xf0 Xf

2
i=nm +1 ei .

1
A

Alternatively, one can implement

a feasible version of the weighted regression explained above, using sm and sf as weights.
But this is clearly more computationally cumbersome than carrying out the two separate
regressions.

5.4. A general result


Zyskind (1967) provides a general necessary and sucient condition for the OLS estimator
to be BLUE.
Theorem 38. Given the regressor matrix, X, and the conditional covariance matrix ,
V ar ("|X) = , the OLS estimator, b = (X 0 X)
If LRM.4 holds, that is =

2I

n,

X 0 y, is BLUE if and only if P[X] = P[X] .

Zyskinds condition holds for any X, since

P[X] = P[X] In

P[X] .

5.4. A GENERAL RESULT

78

That Zyskinds condition is also verified in the Oaxacas model is straightforwardly proved
upon elaborating P[X] :
1

P[X] = X X 0 X
X0
0
10
10
1
0
0 X ) 1
Xm
0(nm k)
Xm
0(knf )
(Xm
0(kk)
m
A@
A

1 A@
= @
0
0(nf k)
Xf
0(kk)
Xf0 Xf
0(knm )
Xf
0
10
1
0 X ) 1
0
Xm (Xm
0(nm k)
Xm
0(knf )
m
@
A
@
A

1
=
0
0
0(nf k)
Xf Xf Xf
0(knm )
Xf
0
1
1
0
0
0 n n
B Xm (Xm Xm ) Xm
C
( m f ) 1
= @
A
0
0
0(nf nm )
Xf Xf Xf
Xf
0
1
0(nm nf ) C
B P[Xm ]
= @
A
0(nf nm )
P[Xf ]

Therefore,

B
P[X] = @
0

B
= @
0

B
= @

2
m I nm

10

0(nm nf ) C B
A@
2
0(nf nm )
f Inf
1
2 P
m [Xm ] 0(nm nf ) C
A
2
0(nf nm )
f P[ X f ]
10
P[Xm ]
0(nm nf ) C B
A@
0(nf nm )
P[Xf ]

= P[X] .

P[Xm ]

2
m I nm

0(nm nf ) C
A
0(nf nm )
P[Xf ]

0(nm nf ) C
A.
2
0(nf nm )
f Inf

As a final remark, remind that the Zyskind condition ensures only that OLS coecients are
BLUE, saying nothing about the properties of the OLS standard error estimates and indeed
we have seen in the previous sections that they may be biased even if b is BLUE. The following
exercise on partitioning provides another instance of such occurrence.

5.4. A GENERAL RESULT

79

Exercise 39. Consider the partitioned regression


(5.4.1)

y = X1

+ X2

+ ",

maintaining LRM.1-LRM.4. 1) Verify that premultiplying both sides of the foregoing equation
by M[X2 ] boils down to the reduced regression model
1
=X
y

(5.4.2)

+"

where
1 = M[X ] X1 and "
= M[X2 ] y, X
= M[X2 ] ".
y
2
2) How can you interpret the variables in model (5.4.2)? 3) As far as

is concerned, does

OLS applied to model (5.4.2) yields the same estimator as OLS applied to model (5.4.1), why
or why not? 4) Does the reduced model (5.4.2) satisfy LRM.1-LRM.4? Which ones, if any, are
not satisfied? 5) The degrees of freedom of the reduced regression are n
that the resulting OLS estimate for

k1 . Do you think

would be unbiased? 6) Verify that the reduced model

(5.4.2) satisfies the Zyskind condition.

1 in model (5.4.2) are


and X
Solution. 1) It does since M[X2 ] X2 = 0. 2) The variables y
the residuals from k1 + 1 separate regressions using y and each column of X1 as dependent
variables and the columns of X2 as regressors. 3) Yes, by the FWL Theorem. 4) LRM.1-LRM.3
are met, but LRM.4 fails with

1 =
|X
V ar "

M[X2 ] .

5) It is biased since we know that the unbiased OLS estimator uses n

k degrees of freedom

to correct the OLS residual sum of squares (which is nonetheless the same for both models
(5.4.1) and (5.4.2), as we learn from Section 4.9). 6) You have just to verify that
M[X2 ] P[M[X

2]

X1 ]

= P[M[X

2]

X1 ] M[X2 ] ,

5.4. A GENERAL RESULT

80

which is readily done by noting that M[X2 ] , symmetric and idempotent, is the first and the
last factor in P[M[X ] X1 ] .
2
The within regression examined in Chapter 8 (equation (8.2.7)) is a special case of model
(5.4.2) in the foregoing exercise.

CHAPTER 6

Tests for structural change


6.1. The Chow predictive test
There is a time series data-set with T = T1 + T2 observations. X denotes the (T k)
regressor matrix of full column rank, y is the (T 1) vector of observations for the dependent
variable and " is the error vector, which is normally distributed, given X, with zero mean and
constant variance

2.

Assume also that T1 > k. It is worth noting right now that we are not

assuming T2 > k, so that the test we introduce here, dierently from the Classical Chow test,
goes through also in the case in which the second subsample does not permit identification of
the (k 1) vector of population coecients

Partition y, X and " row-wise


0

y=@

y1
y2

A, X = @

X1
X2

A and " = @

"1
"2

A,

so that the top blocks of the partitioned matrices contain the first T1 observations and those
in the bottom contain the last T2 observations. The model under the null hypothesis is just
the classical regression model
(6.1.1)
and the OLS estimator for

y = X + ".
is
b = X 0 X

X 0 y.

The structural break is thought of as time-specific additive shocks hitting the model over the
second subsample. Therefore, the model in the presence of the structural break is formulated
81

6.1. THE CHOW PREDICTIVE TEST

82

as
y 1 = X1 + " 1
(6.1.2)

y 2 = X2 + IT2 + " 2 ,

where IT2 is the identity matrix of order T2 and

is the (T2 1) vector of time-specific

shocks. Written in a more compact form where 0(lm) indicates a (l m) null matrix, the
general model becomes
0
@

y1
y2

A=@

X1 0(T1 T2 )
X2

IT2

10

A@

A+@

"1
"2

A,

from which it is clear that model (6.1.2) uses a larger regressor matrix than model (6.1.1),
including also the time dummies specific to the observations over the second subsample
0

and that the time specific shocks


estimator for

and

D=@

0(T1 T2 )

I T2

are the coecients on those time dummies. The OLS

can then be expressed as


0
@

b
c

A = W 0W

W 0 y,

where W = (X D) is the extended regressor matrix. Therefore, the null hypothesis can be
formally expressed as H0 :

= 0(T2 1) and can be tested by a standard F-test of joint

significance

(6.1.3)

F =

(e0 e e0 e) /T2
F (T2 , T1
e0 e/ (T1 k)

k) ,

6.1. THE CHOW PREDICTIVE TEST

where e = y

83

Xb indicate the OLS residual (T 1) vector from the model under the null

and
(6.1.4)

e=y

W@

b
c

1
A

is the OLS residual (T 1) vector from the unrestricted model1.


It is possible to prove that the F-ratio in (6.1.3) equals
(6.1.5)

F =

where e1 = y1

(e0 e e01 e1 ) /T2


,
e01 e1 / (T1 k)

X1 b1 denotes the OLS residual (T1 1) vector obtained by regressing y1 on

X1 and accordingly
b1 = X10 X1

(6.1.6)

X1 y 1 .

Dierently from Greene (2008) I prove this result by using the FWL Theorem. We just need
to work out e and to do this I obtain the separate expressions for b and c. Lets get started
with b. By the FWL Theorem
b = X 0 M[D] X

X 0 M[D] y.

Expand M[D] to obtain

M[D] = In

0
@

0(T1 T2 )
I T2

1 20
10 0
13
0(T1 T2 )
0
A6
A @ (T1 T2 ) A7
4@
5
IT2
I T2

10

0(T1 T2 )
I T2

10

A .

The matrix in brackets in the foregoing expression reduces to


0(T2 T2 ) + IT2 = IT2 .

1The degrees-of-freedom correction in the denominator of the F ratio follows from the fact that the number of

estimated parameters under the alternative is k + T2 .

6.1. THE CHOW PREDICTIVE TEST

84

Therefore,

M[D] = In

(6.1.7)

= In

0(T1 T2 )

0(T1 T1 ) 0(T1 T2 )

@
@

I T2

10
A@

0(T2 T1 )

0(T1 T2 )

IT2

I T2
1

10
A

A.

The second matrix in equation (6.1.7) transforms any conformable vector it premultiplies
so that its first T1 values are replaced by zeroes and its last T2 values are left unchanged.
Accordingly, M[D] carries out the complement operation, transforming any conformable vector
to which it is premultiplied in a way that its first T1 values remain unchanged and the last T2
values get replaced by zeroes. Therefore,
0

M[D] X = @

X1
0(T2 k)

implying that b = b1 , defined in equation (6.1.6).

A,

Turning to c, by the FWL Theorem (equation (3.6.3))


c = D0 D

D0 (y

Xb)

and since b = b1 and D0 D = IT2 ,


0

c = @

(6.1.8)

0(T1 T2 )
I T2

0(T1 T2 )

= y2

X2 b 1 .

= @

I T2

10

A (y
10 0
A @

Xb1 )

y1

X1 b 1

y2

X2 b 1

1
A

6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST

85

Therefore, the OLS coecients c are the predicted residuals for the second subsample by
means of the estimates b1 , obtained from the first subsample. Finally, replacing the right
hand sides of equations (6.1.6) and (6.1.8) into equation (6.1.4) yields
e = y Xb1
0
1
y1
A
= @
y2
0
e1
= @
0(T2 1)

D (y2 X2 b1 )
0
1 0
1
X1 b 1
0(T1 T2 )
@
A @
A
X2 b 1
y 2 X2 b 1
1
A,

which proves that e0 e = e01 e1 and consequently the F test expression of equation (6.1.5).
Remark 40. Since nothing of the foregoing derivation hinges upon the fact that the T2
observations are contiguous in the sample (the OLS estimator and residuals are invariant to
permutations of the rows), there is a more general lesson that can be learnt here. Regardless of
the data-set format, be it a time series, a cross-section or a panel data, extending the matrix of
regressors to dummy variables that indicate each a single observation will actually exclude all
involved observations from the estimation sample. Therefore, the above procedure can be used
both to test that given observations in the sample are not outliers and, in case of rejection,
to neglect the outliers from the estimation sample, without materially removing records from
the data-set.
6.2. An equivalent reformulation of the Chow predictive test
Here we stress the interpretation of the Chow predictive test as a test of zero prediction
errors in the period with insucient observations. In doing this, we reformulate the test using
the formula based on the Wald statistic.
From equation (6.1.8) derived in the previous section we have
c = y2

X2 b 1 .

6.2. AN EQUIVALENT REFORMULATION OF THE CHOW PREDICTIVE TEST

Therefore c, the OLS estimator of

86

, can be thought of as the prediction error undergone

when we use the first sub-period estimates to predict y in the second sub period, given X2 .
If the elements in c are not significantly dierent from zero jointly, then we have evidence for
not rejecting the null hypothesis of parameter constancy (zero
Chow predictive test in (6.1.5)
F =

idiosyncratic shocks). The

(e0 e e01 e1 ) /T2


e01 e1 / (T1 k)

assesses exactly this in a rigorous way, since F F (T2 , T1

k) under the null hypothesis and

normal ".
As usual, F can be rewritten as the Wald statistic divided by the number of restrictions
(T2 in this case):
1

\
c0 V ar
(c|X)
F =
T2

(6.2.1)

\
(c is the discrepancy vector and V ar
(c|X) is the OLS estimator of V ar (c|X) ). The F
formula in (6.2.1) can be made operational by elaborating the conditional covariance matrix
of the prediction error, V ar (c|X) , under the null hypothesis of the test, H0 :
If

= 0, then y2 = X2 + "2 and b1 =


c = X2 + " 2
= "2

+ (X10 X1 )

X2

X2 X10 X1

+ X10 X1
1

X10 "1 .

= 0.

X10 "1 , and hence


1

X10 "1

Therefore, under H0
V ar (c|X) = E

"2

X2 X10 X1

X10 "1

h
= E "2 "02 |X + E X2 X10 X1
h
i
1
2
=
IT2 + X2 X10 X1
X20 ,

"2
1

X2 X10 X1

X10 "1 "01 X1 X10 X1

X10 "1
1

X20 |X

|X

6.3. THE CLASSICAL CHOW TEST

87

where the second equality follows from the assumption of spherical disturbances over the whole
sample,

20

V ar 4@

Hence,

where s21 = e01 e1 / (T1

F =

"1
"2

A |X 5 =

IT .

h
\
V ar
(c|X) = s21 IT2 + X2 X10 X1

i
X20 ,

k) . This implies that the predictive F test can be also computed as


(y2

h
X2 b1 )0 IT2 + X2 (X10 X1 )

X20

T2 s21

(y2

X2 b 1 )

6.3. The classical Chow test


Now both T1 > k and T2 > k, so that there is enough information in both subsamples to
identify two subsample specific k 1 beta vectors,
test on H0 :

and

2,

and base a parameter constancy

2.

As before, X denotes the (T k) regressor matrix of full column rank, y is the (T 1)


vector of observations for the dependent variable and " is the error vector, which is normally distributed, given X, with zero mean and covariance matrix

2I

T.

The usual row-wise

partitioning holds:
0

y=@

y1
y2

A, X = @

X1
X2

A and " = @

"1
"2

A,

so that the top blocks of the partitioned matrices contain the first T1 observations and those in
the bottom contain the last T2 observations. Assume that both X1 and X2 have f.c.r. Notice
that this is not ensured by f.c.r. for X as the following example show.

6.3. THE CLASSICAL CHOW TEST

Example 41. The matrix

88

1
1
2
B
C
B
C
B 1 2 C
B
C
B
C
C
X=B
B 1 2 C
B
C
B
C
B 1 0 C
@
A
1 0

has rank=2, but

1 2

B
C
B
C
X1 = B 1 2 C
@
A
1 2

has rank=1.

The general model for the Chow test is


y 1 = X1

+ "1

y 2 = X2

+ "2

or more compactly
0

y=W@

(6.3.1)
where

and

Let

and

W =@

A+"

X1

0T1 k

0T2 k

X2

are (k 1) vectors.

D=@

0T1 T2
I T2

1
A

A.

As demonstrated in Section 6.1, a generic (T 1) vector x, when premultiplied by M[D] , is


transformed into the interaction of x and the time dummy for the first sub-period and when

6.3. THE CLASSICAL CHOW TEST

89

premultiplied by P[D] , is transformed into the interaction of x and the time dummy for the
second sub period. Hence, the general model (6.3.1) can be equivalently written as
y = M[D] X
or still equivalently, given M[D] = I
(6.3.2)
where

1.

+ P[D] X

+ ",

P[D] ,

y=X

+ P[D] X + ",

Therefore, the null hypothesis of the Chow test, H0 :

equivalent to the exclusion restrictions H0 :

2,

is

= 0, and so the test can be implemented 1)

by constructing just one set of interacted variables P[D] X and 2) carrying out an F test of
joint significance for the coecients on the interacted variables after OLS estimation of model
(6.3.2).
As it turns out, the classical Chow test is a special case of the predictive Chow. Consider
model (6.3.2) and reformulate it by expanding P[D]
y=X

+ D D0 D

D0 X + "

or
y = X + D + ",
where

shows that

and

= (D0 D)

D0 X . But since (D0 D)

2 R (X2 ) RT2 and so that the

Chow test, where

D0 X = X2 , then

= X2 , which

s here are not arbitrary as in the predictive

2 RT2 . This implies that, given rank (X2 ) = k, the two tests are identically

equal if and only if T2 = k, only in this case in fact R (X2 ) = RT2 .

CHAPTER 7

Large sample results for OLS and GLS estimators


7.1. Introduction
The linear regression model may present departures from LRM.4, such as heteroskedasticity and/or cluster correlation. In this chapter we study common econometric techniques that
accommodate these issues, for both estimation and inference: primarily, the Generalized LS
(GLS) estimator for the regression coecients and robust covariance estimators.
All known statistical properties are derived for n ! 1and so the techniques we consider
in this chapter work well in large samples.
I spell out the assumptions needed for consistency and asymptotic normality of OLS and
GLS estimators, providing the derivation of the large-sample properties.
Strict exogeneity is maintained throughout:
SE: E ("|X) = 0.
A weaker version of the random sampling assumption, one which does not maintain identical
distributions of records, is invoked when proving asymptotic normality and consistency for the
variance estimators:
RS: There is a sample of size n, such that the elements of the sequence {(yi x0i ) , i = 1, ..., n}
are independent (NB not necessarily identically distributed) random vectors.
Results in this chapter are demonstrated through the do-file statistics_OLS.do using the
data-sets US_gasoline.dta and mus06data.dta (from Cameron and Trivedi 2009).
90

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX

91

7.2. OLS with non-spherical error covariance matrix


I prove consistency in the general case of V ar ("|X) = , where is an arbitrary and
unknown symmetric, p.d. matrix.
7.2.1. Consistency. The following assumptions hold.
0
0
OLS.1: plim X nX = lim E X nX , a positive definite, finite matrix
n!1
0
OLS.2: plim XnX = Q, a positive definite, finite matrix

The proof of consistency goes as follows.


It has
b=

then

By strict exogeneity
E
then
X 0"
|X
n

and so
V ar

X 0"
n

1 0
X 0X
X"
+ plim
n
n
0
X"
+ Q 1 plim
.
n

X 0X
n

plim (b) =

V ar

X 0"
n

X 0"
n

1
E
n

1 X 0 X
n n

=0

1
= E
n

X 0 ""0 X
|X
n

X 0 X
n

which goes to zero as n ! 1 by assumption OLS.1. Hence X 0 "/n converges in squared mean,
and consequently in probability, to zero.
Clearly, the above implies that OLS is consistent in the classical case of LRM.4.

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX

92

7.2.2. Asymptotic normality with heteroskedasticity. Assumptions OLS.1 and OLS.2


hold along with RS and the following
OLS.3: V ar ("|X) = , where
2

6
6
6
6
=6
6
6
4

2
1

0
2
2

0
..
.

..

..
.
..
.

0
..
.

7
7
7
7
7
0 7
7
5
2
n

OLS.3 permits heteroskedasticity but not correlation. Partition X row-wise


0
1
0
B x1 C
B
C
B x0 2 C
B
C
X = B . C.
B .. C
B
C
@
A
x0 n
By strict exogeneity E (xi "i ) = 0 and hence

and

V ar (xi "i ) = E E "2i xi x0 i |xi =


n

1X
V ar (xi "i ) =
n

xi x0 i

1X 2
0
i E xi xi
n
i=1
0

X X
.
= E
n

i=1

Therefore,

2
iE

1X
lim
V ar (xi "i ) = lim E
n!1 n
n!1
i=1

X 0 X
n

which is a finite matrix by assumption and so, by the (multivariate) Lindeberg-Feller theorem,

p X
n
n
X 0"
X X
xi "i p ! N 0, plim
.
n
n
n d
i=1

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX

93

Eventually, given the rules for limiting distributions (Theorem D.16 in Greene (2008)),
p

)!N

n (b

and so
p

n (b

X 0X
n

0, Q

X 0"
p !Q
n d

plim

0"
p ,
n

1X

X 0 X
n

7.2.3. Whites heteroskedasticity consistent estimator for the OLS standard


errors. Under OLS.3 and given the OLS residuals ei = yi

x0i b, i = 1, ..., n, an heteroskedas-

ticity consistent estimator for the asymptotic covariance matrix of b,


1
Avar (b) = Q
n

plim

X 0 X
n

is given by the Whites estimator:


(7.2.1)
where

\
Avar
(b) = X 0 X
2

2
6 e1 0
6
6 0 e2
2
6

=6 . .
6 ..
..
6
4
0

X 0 X
X 0X

..
.
..
.
0

0
..
.

7
7
7
7
7.
0 7
7
5
e2n

one that will be used intensively in Chapters 8 and 9, is


An equivalent way to express ,
the following
= ee0 In ,

where the symbol stands for the element-by-element matrix product (also known as Hadamard
product).
Econometric softwares routinely compute robust OLS standard errors: these are just the
\
square roots of the main diagonal elements of Avar
(b) in (7.2.1). In Stata this is done through
the regress option vce(robust) (or, equivalently, simply robust).

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX

94

7.2.4. Whites heteroskedasticity test. The Whites estimator remains consistent under homoskedasticity, therefore one can test for heteroskedasticity by assessing the statistical
discrepancy between s2 (X 0 X)

and (X 0 X)

(X 0 X)
X 0 X

. Under the null hypothesis

of homoskedasticity, the discrepancy will be small. This is the essence of the Whites heteroskedasticty test. The statistics measuring such discrepancy can be implemented through
the following auxiliary regression including the constant term.
(1) Generate the squared OLS residuals, e2 = e e
(2) Do the OLS auxiliary regression that uses e2 as the dependent variable and the

following regressors: all k variables in the n k sample matrixX = X1 1 and


all interaction variables and squared variables in X1 . This implies that for any two

columns of X1 , say variables xi and xj , there are the additional regressors x1 x2 ,


x1 x1 and x2 x2 . The auxiliary regression, therefore, has p k (k + 1) /2 regressors.
(3) Save the R-squared of the auxiliary regression, say Ra2 , and multiply it by the sample
size n. The resulting statistics nRa2
A

2 (p

1) measures the statistical discrepancy

between the two covariance estimators and so provides a convenient heteroskedasticity


test: reject homoskedasticity when nRa2 is larger than a conventional percentile of
choice for

2 (p

1).

We may implement the White test manually, saving the OLS residuals through predict and
then generating squares and interactions as appropriate, or more easily by giving the following
post-estimation command after regress: imtest, white.

7.2.5. Clustering. Clustering of observations along a given dimension is the norm in


microeconometric applications. For example, firms cluster across dierent sectors, households
live in dierent provinces, immigrants in a given country belong to dierent ethnic groups and
so on.

7.2. OLS WITH NON-SPHERICAL ERROR COVARIANCE MATRIX

95

Clustering cannot be neglected in empirical work. In the case of firm data, for example,
it is likely that there is correlation across the productivity shocks hitting firms in the same
sectoral cluster, with a resulting bias in the standard error estimates, even if White robust.
The White estimator can be made robust to cluster correlation quite easily. I explain
this in terms of the firm data example. Assume that we have cross-sectional data of n firms,
indexed by i = 1, ..., n. There are G sectors, indexed by g = 1, ..., G and we know which sector
g = 1, ..., G firm i = 1, ..., n belongs to. This information is contained in the the n G matrix
D of sectoral indicators: the element of D in row i and column j, say d (i, j), is unity if firm
i belongs to sector j and zero if not. The cluster-correlation and heteroskedasticity consistent

estimator for the asymptotic covariance matrix of b is then assembled by simply replacing
in Equation (7.2.1) with
c = ee0 DD0 .

Stata does this through the regress option vce(cluster clustervar ), where clustervar is
the name of the cluster identifier in the data set.
Chapter 9 will cover cases of multi-clustering, that is data that are grouped along more
than one dimension.

7.2.6. Average variance estimate (skip it). I prove now that a consistent estimate of
the average variance

2
n

1X
=
n

2
i,

i=1

is given by

s2n =

1X 2
ei ,
n
i=1

in the sense that plim

s2n

2
n

= 0 (NB I use this formulation and not plim s2n =

is a sequence).
Since
s2n

"0 "
=
n

"0 X
n

X 0X
n

X 0"
n

2
n

as

2
n

7.3. GLS

plim

s2n

= plim
= plim

96

"0 "
n
"0 "
n

+ 00 Q

By the RS assumption the squared errors, "2i , are all independently distributed with means
E "2i

2
i,

and given that

"0 "
1X 2
=
"i ,
n
n
i=1

I can apply the Markovs strong law of large numbers to have


"
#

n
"0 "
1X 2
plim
i = 0.
n
n
i=1

7.3. GLS
The estimation strategy described in the previous sections is based on OLS estimates for the
regression coecients with standard errors estimates corrected for heteroskedasticity and/or
cluster correlation. The drawback of the approach is a loss in eciency, if the departures from
LRM.4 are of a known form. We will see that in this case the BLUE can always be found.
To formalize the new set-up, let V ar ("|X) =
positive definite (p.d.) (n n) matrix and

2 ,

where is a known symmetric,

is an unknown strictly positive scalar (that is,

is known up to a strictly positive multiplicative scalar.)


Since is symmetric and p.d., it can be always factorized as = CC 0 where is
an (n n) diagonal matrix with the main diagonal elements all strictly positive and C is a
(n n) matrix such that C 0 C = I, implying that C 0 is the inverse of C and consequently that
CC 0 = I ( and C are said, respectively, the eigenvalues (or also characteristic roots) and the
eigenvectors (or also characteristic vectors) matrices of ).
A great benefit of the foregoing factorization is that it permits to compute the inverse of
eortlessly. In fact, it is possible to verify that
(7.3.1)

= C

C0

7.3. GLS

97

and
(7.3.2)
where

1/2

is the inverse of ,

1/2

= C

1/2 1/2

C 0,

1,

is the inverse of and

1/2

is

a diagonal matrix with main-diagonal elements equal to the square root reciprocals of the
main-diagonal elements of .
Consider the GLS transformed model
y = X + " ,

(7.3.3)
such that y

1/2 C 0 y,

1/2 C 0 X

and "

1/2 C 0 ".

Exercise 42. Verify by direct inspection that indeed

= and

1/2 1/2

1.

Solution. Remember that is diagonal and so


1

= C

C 0 CC 0 = C

IC 0 = C

= I. Then,

C 0 = CC 0 = I

and

= CC 0 C

C 0 = CI

C 0 = C

The rest is proved similarly on considering that

1/2

C 0 = CC 0 = I.

is diagonal and so
0

Exercise 43. Use (7.3.1) and (7.3.2) to prove 1) X X = X 0


and 3) V ar (" |X) =
V ar (" |X ) =

2I

2I

1 X;

1/2 1/2
0

2) X " = X 0

1.
1 ";

then use the general law of iterated expectation to prove that also

Given the results of the foregoing exercise, the OLS applied to the transformed model
(7.3.3) is the Gauss-Marcov estimator for
bGLS =
(7.3.4)

and has the formula

X X

X 0

X y
1

X 0

7.3. GLS

with V ar (bGLS |X) =

X 0

Exercise 44. Let

1/2

1X

(7.3.5)

1/2

1/2 C 0

= C

98

and prove that

y=

1/2

X +

1/2

"

is also a GLS transformation, that is OLS applied to model (7.3.5) yields bGLS .
Solution: By exercise 42

X 0

1/2

1/2

X 0

1/2

1/2

y = X 0

X 0

y.

The estimator bGLS is OLS applied to a classical regression model and as such it is BLUE.
The following exercise asks you to verify by direct inspection that GLS is better than OLS
in terms of covariance.
Exercise 45. Prove that
2

X 0X

X 0 X X 0 X

X 0

X 0

is a n.n.d. matrix.
Solution: We define a k n matrix D as
D X 0X

X0

X 0

Therefore,
X 0X

X 0 = X 0

X 0

+ D.

On noticing that DX = 0kk ,


X 0X
h

X 0

X 0

i h
+D

X 0 X X 0 X

X X 0
X 0

X
1

+ D0

+ DD0 .

=
=

7.3. GLS

Since is p.d., then for any n 1 vector z, z0 z


But then, z0 z

99

0, being equal to zero if and only if z = 0.

0 when, in particular, z = D0 w for any n 1 vector w, which is equivalent

to saying that w0 DD0 w

0 for any n 1 vector w, or that DD0 is n.n.d, proving the

result.

7.3.1. Consistency of GLS. The following assumption hold.


0 1
0 1
GLS.1: plim X n X = lim E X n X = Q, a positive definite, finite matrix
n!1

Exercise 46. Given that


bGLS =

X 0 1X
n

X 0 1"
,
n

X "
,
n

prove that
plim (bGLS ) =
under assumption GLS.1 and strict exogeneity (SE).
Solution. Easy: just write
0

bGLS =
then consider that V ar (" |X ) =

2I

X X
n

(see Exercise 43) and, finally, follow the same steps as

in Section 7.2.1.

7.3.2. Asymptotic normality. I prove asymptotic normality for bGLS under GLS.1,
SE and RS (again, remember that V ar (" |X ) =

2I

n ).

By strict exogeneity E (xi "i ) = 0 and hence


V ar (xi "i ) =

0
E xi xi

7.3. GLS

100

and
n

1X
V ar (xi "i ) =
n

1
n

i=1

0
E xi xi

n
X
i=1

=
Therefore,

X 0 1X
n

1X
lim
V ar (xi "i ) =
n!1 n

lim E

n!1

i=1

X 0 1X
n

which is a finite matrix by assumption. By the Lindeberg-Feller central limit theorem,


p X
0
n
n
X "
X 0 1"

x i "i p
p
! N 0,
d
n
n
n

i=1

and since
p

n (bGLS

then
p

X 0 1X
n

n (bGLS

X 0 1"
p
!Q
d
n

) ! N 0,

1X

The asymptotic covariance matrix of bGLS is therefore


2

Avar (bGLS ) =

and is estimated by
\
Avar
(bGLS ) = s2GLS X 0

where
s2GLS =
=

(y

X bGLS )0 (y
n k

(y

XbGLS )0 1 (y
n k

X bGLS )

XbGLS )

0 1"

7.3. GLS

101

Exercise 47. (This may be skipped) Under GLS.1, SE and RS, prove that plim s2GLS =
2.

7.3.3. Feasible GLS. In general situations we may know the form of but not the
values taken on by its elements. Therefore to make GLS operational we need an estimate of
b Replacing by
b into (7.3.4) delivers the feasible GLS, henceforth FGLS:
, say .

b
bF GLS = X 0

b
X 0

y.

Since GLS is consistent, to know that bGLS and bF GLS are asymptotically equivalent, i.e.
plim (bF GLS

bGLS ) = 0, is enough to ensure that bF GLS is consistent but not that


p

) ! N 0,

n (bF GLS

For this we need the stronger condition that

) and

n (bF GLS

totically equivalent, or
p

(7.3.6)

n (bF GLS

bGLS ) ! 0.
p

Two sucient conditions for (7.3.6) are the following


(7.3.7)
(7.3.8)

plim

plim

b 1"
X 0
p
n

b 1X
X 0
n

X 0 1"
p
n
X 0 1X
n

!
!

Exercise 48. Use condition GLS.1 that


plim

X 0 1X
n

plim

X 0 1"
p
n

the similar condition that

= Q,

= q0 ,

= 0

= 0.

n (bF GLS

) be asymp-

7.3. GLS

102

where q0 is a finite vector, and the Slutsky Theorem (if g is a continuous function, then
plim [g (z)] = g [plim (z)], p. 1113) to verify that
b 1"
X 0
p
n

plim

b 1X
X 0
n

plim

X 0 1"
p
n
X 0 1X
n

= 0

= 0

are sucient for (7.3.6).

Solution: Given
plim
and

plim

X 0 1X
n

b 1X
X 0
n

plim
then

X 0 1X
n

+ plim

=Q

X 0 1X
n

b 1X
X 0
n

= 0,

X 0 1X
n

= Q,

and so applying Slutsky Theorem twice, we get


plim

b 1X
X 0 1X
X 0
+
n
n

X 0 1X
n

b 1X
X 0
n

= plim

and
(7.3.9)

plim

b 1X
X 0
n

=Q

By the same token,


(7.3.10)

plim

But now, since


b 1X
X 0
n

b 1"
X 0
p
n

= q0 .

b 1" p
X 0
b
p
= n X 0
n

b
X 0

",

=Q

7.4. LARGE SAMPLE TESTS

X 0 1X
n

1
b 1X
b
= X 0
X 0

bF GLS

1"

b 1X
X 0
n

X 0 1" p
p
= n X 0
n
and bGLS

X 0 1X
n

103
1

= X 0

1X

b 1" p
X 0
p
= n (bF GLS
n
X 0 1" p
p
= n (bGLS
n

X 0
1

",

X 0

1 ",

then

),

).

The last two equalities, along with the maintained conditions (7.3.7) and (7.3.8), the asympp
totic results (7.3.9) and (7.3.10) and the Slutsky Theorem, prove that both n (bGLS
)
p
and n (bF GLS
) converge in probability to the same limit, Q 1 q0 .
Conditions (7.3.7) and (7.3.8) must be verified on a case-by-case basis. Importantly, they
b is not consistent, as shown in the context of FGLS panel
may hold even in cases in which
data estimators by Prucha (1984).

7.4. Large sample tests


7.4.1. Introduction. This section covers large sample tests in more detail than Greene
(2008). For the exam you can skip the derivations of the asymptotic results.
Assume the following results hold
(1)

n (b
) ! N 0,
0 d
(2) plim XnX = Q
(3) plim s2 =

2Q 1

2.

and consider the following lemma, referred to as the product rule. For more on this see
White (2001) p. 67 (notice that the product rule is not mentioned in Greene (2008), although
implicitly used for proving the asymptotic distributions of the tests).

7.4. LARGE SAMPLE TESTS

104

Lemma 49. (The product rule) Let An be a sequence of random (l k) matrices and bn a
sequence of random (k 1) vectors such that plim (An ) = 0 and bn ! z. Then, plim (An bn ) =
d

0.
7.4.2. The t-ratio test (skip derivations). We wish to derive the asymptotic distribution of the t-ratio test for the null hypothesis Ho :

o
k.

We begin by noting that under

Ho
p

(7.4.1)

n (bk
q

o
k)

! N (0, 1)

2Q 1
kk

by result 1 and Theorem D.16(2) in Greene (2008) (p. 1050).


Then, the t-ratio test for Ho is

where (X 0 X)kk1
reformulated as

x0k M[X(k) ] xk

then
(7.4.2)

o
(bk
k)
t= q
,
s2 (X 0 X)kk1

and X = X(k) xk (see Section 4.8). Since t can be


p
n (bk
t= q
0
s2 XnX

p
n (bk
4
plim q
0
s2 XnX

80
<
1
plim @ q
0
:
s2 XnX

1
kk

o
k)
1
kk

2Q 1
kk

o
k)
,
1
kk

p
1
A

n (bk
q

n (bk

o
k)

2Q 1
kk

o
k) 5

9
=
;

where the second equality follows from the product rule, given that, by results 2-3 and the
Slutsky Theorem (Theorem D.12 in Greene (2008), p. 1045), the first factor in the second plim
converges in probability to zero and, by result 1., the second factor converges in distribution
to a normal random scalar. Hence, the two sequences in the plim of equation (7.4.2) are

7.4. LARGE SAMPLE TESTS

105

asymptotically equivalent and by Theorem D.16(3) have the same limiting distribution. Given
(7.4.1), this proves that

o
(bk
k)
q
s2 (X 0 X)kk1

! N (0, 1) .

Consider, now, the general case of a null hypothesis Ho : r0

= q, where r is a non-

zero (k 1) vector of non-random constants and q is a non-random scalar. Using the same
approach as above it is possible to prove that
r0 (b
)
q
s2 r0 (X 0 X)

(7.4.3)

r0 Q

! N (0, 1) .

0
Exercise 50. (skip) Prove (7.4.3). Hint: by the Slutsky Theorem, plim r0 XnX

r =

1 r.

7.4.3. The Chi-squared test (skip derivations). We wish to test the null hypothesis
Ho : R

q = 0, where R is a non-random (J k) matrix of full-row rank and q is a (J 1)

column vector. Under Ho , R = q and so the F test can be written as


h
)0 R0 s2 R (X 0 X)

(b
F =

R0

R (b

)
.

The foregoing equation can be rearranged as


JF =
Now let A

n (b

2 RQ 1 R0 .

that A1/2 A1/2 = A and A

) R

"

s R

X 0X
n

p
R n (b

).

Since A is p.d. (R is f.r.r.), there exists a p.d. matrix A1/2 such


1/2

(7.4.4)

= A1/2
A

Similarly, let A s2 R (X 0 X/n)


that A1/2 A1/2 = A and A

1/2

1/2

. Then, by result 1. and the Slutsky Theorem,

p
R n (b

) ! N (0, IJ ) .
d

R0 . Since A is p.d., there exists a p.d. matrix A1/2 such


1
= A1/2
. Then

7.4. LARGE SAMPLE TESTS

h
plim A

(7.4.5)

p
R n (b
h
plim A

1/2

1/2

106

p
R n (b

p
A 1/2 R n (b

1/2

where the second equality follows from the product rule given that

plim A

1/2

=A

1/2

by results 2. and 3. and the Slutsky Theorem, and


p
R n (b

) ! N (0, A) ,
d

by result 1. and Theorem D.16(2) in Greene (2008). Hence, by Theorem D.16(3) the two
sequences in the left-hand-side plim of equation (7.4.5) have the same limiting distribution
and given (7.4.4), this proves that

A
Let w A

1/2 R

1/2

p
R n (b

) ! N (0, IJ ) .
d

) , then by Theorem D.16(2),

n (b

w0 w !

(7.4.6)

But since
A

1/2

1/2

= A

"

(J) .

= s R

X 0X
n

then
w0 w =
p
p

n (b

p
)0 R0 A 1/2 A 1/2 R n (b
0 1 # 1
p
XX
2
s R
R0
R n (b
n

n (b
"

)0 R 0

) =
)

JF,

7.4. LARGE SAMPLE TESTS

and so by (7.4.6)
JF

(J) .

107

CHAPTER 8

Fixed and Random Eects Panel Data Models


8.1. Introduction
This chapter covers the two most important panel data models: the fixed eect and the
random eect models.
For simplicity we start directly from the statistical models. The sampling mechanism will
be introduced when proving asymptotic normality.
Results in this chapter are demonstrated through the do-file paneldata.do using the dataset airlines.dta, a panel data that I have extracted from costfn.dta (Baltagi et al. 1998).

8.2. The Fixed Eect Model (or Least Squares Dummy Variables Model)
Consider the following panel data regression model expressed at the observation level, that
is for individual i = 1, ...N and time t = 1, ..., T :
(8.2.1)
where x0it = (x1it , ..., xkit ),

yit = x0it + i + "it


0

B
C
B . C
= B .. C
@
A
k

and i is a scalar denoting the time-constant, individual specific eect for the individual i.
108

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

109

Define djit as the value taken on by the dummy variable indicating individual j = 1, ..., N
at the observation (i, t) , that is

djit

8
< 1 if i = j, any t = 1, ..., T
=
.
: 0 if i 6= j, any t = 1, ..., T

Then, model (8.2.1) can be equivalently written as


(8.2.2)

yit = x0it + d1it 1 + ... + diit i + ... + dNit N + "it ,

i = 1, ...N and t = 1, ..., T.


In a more compact notation, at the individual level i = 1, ..., N, model (8.2.2) is written
as
yi = Xi + d1i 1 + ... + dii i + ... + dNi N + "i ,
where

B
B
B
B
B
yi = B
B
(T 1)
B
B
B
@

and

yi1
..
.

x0i1
..
.

C
B
C
B
C
B
C
B
C
B
B 0
yit C
C , Xi = B xit
(T k)
C
B .
.. C
B .
. C
B .
A
@
yiT
x0iT
8
< 1T
d ji =
: 0
T

C
C
C
C
C
C,
C
C
C
C
A

if i = j

B
B
B
B
B
"i = B
B
(T 1)
B
B
B
@

"i1
..
.

C
C
C
C
C
"it C
C
.. C
C
. C
A
"iT

if i 6= j

1T indicates the (T 1) vector of all unity elements and 0T the (T 1) vector of all zero
elements.
Stacking data by individuals, an even more compact representation of the regression model
(8.2.2), at the level of the whole data-set, is
(8.2.3)

y = X + D + ",

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

110

where
0

B
B
B
B
B
y =B
B
(N T 1)
B
B
B
@

y1
..
.
yi
..
.
yN

C
C
C
C
C
C,
C
C
C
C
A

X1
..
.

B
C
B
C
B
C
B
C
B
C
X =B
Xi C
B
C,
(N T k)
B . C
B . C
B . C
@
A
XN

B
B
B
B
B
" =B
B
(N T 1)
B
B
B
@

and D is the (N T N ) matrix of dummy variables


0
B 1T 0T
B
B 0T 1T
B
B .
..
.
D=B
.
B .
B
B
B 0T 0T
@
0T 0T

"1
..
.
"i
..
.
"N

C
C
C
C
C
C,
C
C
C
C
A

1
..
.

B
C
B
C
B
C
B
C
B
C
=B
i C
B
C
(N 1)
B . C
B . C
B . C
@
A
N

0T C
C
0T C
C
.. C
. C
C.
C
C
0T C
A
1T

or equivalently D = (d1 d2 ... dN ) . Under the following assumptions model (8.2.3) is a classical
regression model that includes individual dummies:

FE.1: The extended regressor matrix (X D) has f.c.r. Therefore, not only is X of f.c.r.,
but also none of its columns can be expressed as a linear combination of the dummy
variables, which boils down to saying that no column of X can be time-constant,
which in turn implies that X does not include the unity vector (indeed, there is a
constant term in model (8.2.3), but one that jumps across individuals).
FE.2: E ("|X) = 0. Hence, the variables in X are strictly exogenous with respect to ",
but the statistical relationship with is left completely unrestricted. Model (8.2.3),
therefore, automatically accommodates any form of omitted-variable bias due to the
omission of time constant regressors. Notice that D is taken as a non-random matrix,
therefore conditioning on (X D) or simply X is exactly the same.

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

FE.3: V ar ("|X) =

111

This is standard. It can be relaxed in more advanced

2
" IN T .

treatment of the topic, as in Arellano (2003) for example, but see also Section 8.7
(and Chapter 9, which can however be skipped for the exam).
Exercise 51. Prove that the following model with the constant term is an equivalent
reparametrization of Model (8.2.3):
(8.2.4)
where D

y = 1N T
0 + X + D
1

= (d2 ... dN ),

1N

1 1 ,

1
1

+ ",

= (2 ...N )0 , 1s denotes the s 1 vector

of all unity elements and


0 1 .
Solution. Partition D = (d1 D

and

1)

=@

Then, rewrite model (8.2.3) equivalently as


(8.2.5)

y = d1 1 + X + D

Since D1N = 1N T , then (d1 D

1 ) 1N

1 1

+"

= 1N T or equivalently

d1 + D

1 1N 1

= 1N T .

Therefore, we can reparametrize model (8.2.5) adding 1N T 1 and subtracting (d1 + D


to the right hand side of (8.2.5) to have
y = 1N T 1 + d 1 1 + X + D

1 1

= 1N T 1 + X + D

1 1

= 1N T
0 + X + D

1
1

+ ".

where
0 1 and

1N

1 1 .

(d1 + D

1 1N 1 1

+"

1 1N 1 ) 1

+"

1 1N 1 ) 1

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

112

Remark 52. Exercise 51 demonstrates that after the reparametrization the interpretation
of the

coecients is unchanged, the constant term is 1 and the coecients on the remaining

individual dummies are no longer the individual eects of the remaining individuals, i , i =
2, ..., N , but rather the contrasts of i with respect to 1 , i = 2, ..., N . Of course, the reference
individual must not necessarily be the first one in the data-set and can be freely chosen among
the N individuals by the researcher at her/his own convenience. In Stata this is implemented
by using regress followed by the dependent variable, the X regressors and N

1 dummy

variables (see paneldata.do).


Remark 53. The interpretation of the constant in Exercise 51 is dierent from that in
the Stata transformation (see 10/04/12 Exercises) of Model (8.2.3). In the former case the
constant term is the eect of the individual whose dummy is removed from the regression, in
the latter it is the average of the N individual eects.
The LSDV estimator is just the OLS estimator applied to model (8.2.3) and, given FE.1-3,
it is the BLUE. The separate formulas of LSDV for

and are obtained by applying the

FWL Theorem to (8.2.3). So,


bLSDV = X 0 M[D] X
is the LSDV estimator for

X 0 M[D] y

and
aLSDV = D0 D

D0 (y

XbLSDV )

is the LSDV estimator for . As already mentioned, both are BLUEs, but while bLSDV
converges in probability to

when N ! 1 or T ! 1 or both, aLSDV converges in probability

to only when T ! 1. This discrepant large-sample behavior of bLSDV and aLSDV is due
to the fact that the dimension of increases as N increases, whereas that of
to k.

is kept fixed

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

113

Exercise 54. Verify that


0

D0 D

Given exercise 54, (D0 D)

B 1/T
B
B 0
B
=B .
B ..
B
@
0

D0 =

1 0
TD

D0 D

B
B
B
B
B
B
B
B
1 0
Dz=B
B
B
B
B
B
B
B
@

1/T
..
.

..

0
..
.

0
1/T

C
C
C
C
C.
C
C
A

and so for a generic (N T 1) vector z


1
T

T
X

z1t C 0
C
t=1
C
C B
..
C B
.
C B
C B
T
X
C B
1
B
zit C
CB
T
C B
t=1
C B
..
C B
.
C @
C
T
C
X
A
1
z
Nt
T

z1
..
.
z1
..
.
zN

C
C
C
C
C
Cz
C .
C
C
C
A

t=1

In words, premultiplying any (N T 1) vector z by (D0 D)

D0 transforms it into an (N 1)

, where each mean is taken over the group of observations peculiar to the
vector of means, z
same individual and for this reason is said a group mean. Therefore,
i
aLSDVi = y

0i bLSDV ,
x

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

0i is the (1 k) vector of group-means for individual i, x


0i =
where x
also clear that for any (N T 1) vector z

P[D] z = D D0 D

B
B
B
B
B
B
B
B
B
B
B
B
B
B
1 0
B
Dz=B
B
B
B
B
B
B
B
B
B
B
B
B
B
@

z1
..
.
z1
..
.
zi
..
.
zi
..
.
zN
..
.
zN

x
1i

... x
ki

114

. It is

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C.
C
C
C
C
C
C
C
C
C
C
C
C
C
A

In words, premultiplying any (N T 1) vector z by P[D] transforms it into a sample-conformable


(N T 1) vector of group-means: each group-mean is repeated T times. It follows that
M[D] z = z

P[D] z is the (N T 1) vector of group-mean deviations. Therefore, one can

obtain bLSDV by applying OLS to the model transformed in group-means deviations, that is
regressing M[D] y on M[D] X.
Exercise 55. Verify (in a couple of seconds...) that P[D] =

1
0
T DD .

The conditional variance-covariance matrix of bLSDV is V ar (bLSDV |X) =


It is estimated by replacing
eLSDV = M[D] y
(8.2.6)

2
"

2
"

X 0 M[D] X

with the Anova estimator s2LSDV , based on the LSDV residuals

M[D] XbLSDV :
s2LSDV =

e0LSDV eLSDV
.
NT N k

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

Exercise 56. Prove that E s2LSDV

2
".

115

This is a long one, but when done you can

tell yourself BRAVO! I just give you a few hints. First, on noting that y is determined by
the right hand side of (8.2.3) prove that e = M[M[D] X ] M[D] ", then elaborate the conditional
mean of "0 M[D] M[M[D] X ] M[D] " using the trace operator as we did for s2 , finally apply the law
of iterated expectations.
It is not hard to verify (do it) that bLSDV can be obtained from the OLS regression of
model (8.2.3) transformed in group-means deviations (this transformation is referred to in the
panel-data literature as the within transformation)
(8.2.7)

M[D] y = M[D] X + M[D] "

The intuition is simple: since the group mean of any time constant element, as i , coincides
with the element itself, all time-constant elements in model (8.2.3) are wiped out, this also
explains why X cannot contain time-constant variables. So, in a sense, the within transformation controls out the whole time-constant heterogeneity, latent or not, in model (8.2.3),
making it look like almost as a classical LRM. In particular, it can be proved easily that
LRM.1-LRM.3 hold. Notice, however, that errors in the transformed model, M[D] ", have a
non-diagonal conditional covariance matrix (it is, indeed, block-diagonal and singular, can
you derive it?). Specifically, the vector M[D] " presents within-group serial correlation, since
for each individual group there are only T

1 linearly independent demeaned errors. As a

consequence, LRM.4 does not apply to model (8.2.7). All the same, bLSDV is BLUE. This
is true because the condition of Theorem 38 in Section 5.4 is met (if you have answered the
previous question on the covariance matrix of M[D] ", you should be able to verify also this
claim).
One should not conclude from the foregoing discussion that OLS on the within transformed
model (8.2.7) is a safe strategy. As in the Oaxacas pooled model of section 5.2, the fact that
the error covariance matrix is not spherical, presenting in this specific case within group serial
correlation, has bad consequences as far as standard error estimates are concerned. Indeed,

8.2. THE FIXED EFFECT MODEL (OR LEAST SQUARES DUMMY VARIABLES MODEL)

116

should we leave the econometric software free to treat model (8.2.7) as a classical LRM, and so
regress M[D] y on M[D] X, it would compute coecient estimates just fine, but would estimate
V ar (bLSDV |X) by s2 X 0 M[D] X

, with s2 = e0LSDV eLSDV / (N T

k) 6= s2LSDV , which is

biased since it uses a wrong degrees of freedom correction. The econometric software cannot
be aware that for each individual in the sample there are only T

1 linearly independent

demeaned errors and so, rather than dividing the residual sum of squares by N (T

1)

k, it

divides it by N T

k. The upshot is that standard errors estimated in this way needs rectifying
p
by multiplying each of them by the correction factor (N T k) / (N T N k).
An interesting assumption to test is that of the absence of individual heterogeneity, H0 :

1 = 2 = ... = N . Under the restriction implied by H0 , model (8.2.3) pools together all
data with no attention to the individual clustering and can be written as
y = X

(8.2.8)

+ ",

where

X = (1N T X) ,

=@

Hence, under H0 , the pooled OLS (POLS) estimator


0
bP OLS = X X

(8.2.9)

A.

X y

is the BLUE. Let e indicate the restricted residual vector


(8.2.10)

eP OLS = y

X bP OLS ,

then under normality and H0


(8.2.11)

F =

(e0P OLS eP OLS e0LSDV eLSDV ) /N


e0LSDV eLSDV /N T N k

F (N

1, N T

k) .

8.3. THE RANDOM EFFECT MODEL

117

If F does not reject H0 , POLS is a legitimate, more ecient than LSDV, estimation procedure.
If F rejects H0 , then POLS is biased and LSDV should be adopted.
Exercise 57. On reparametrizing the LSDV model as in Exercise 51, the hypothesis of no

individual heterogeneity becomes H0 :

= 0. Prove that the resulting F-test is numerically

identical to F in Equation (8.2.11).


Solution. Easy. Since models (8.2.3) and (8.2.4) are indeed the same model, the resulting F-test is numerically identical to the F-test in Equation (8.2.11). This is demonstrated
empirically in the paneldata.do Stata dofile.
8.3. The Random Eect Model
The random eect model has the same algebraic structure of model (8.2.1). At the observation level, i = 1, ...N and t = 1, ..., T, we have
(8.3.1)
where x0it = (x1it , ..., xkit ),

yit = x0it + i + "it


0

B
C
B . C
= B .. C
@
A
k

and i is a scalar denoting the time-constant, individual specific eect for individual i. The
statistical properties of model (8.3.1) are dierent, though. Without loss of generality, write
i as i = 0 + ui and let

B
B
B
B
B
u =B
B
(N 1)
B
B
B
@

Model (8.3.1) can then be written compactly as

u1
..
.
ui
..
.
uN

C
C
C
C
C
C.
C
C
C
C
A

8.3. THE RANDOM EFFECT MODEL

(8.3.2)

y = X

where

X = (1N T X) ,

The following is maintained.

=@

118

+ w,

A and w = " + Du.

RE.1: X has f.c.r. This is equivalent to 1) X of f.c.r and 2) no linear combination of


the columns of X is equal to 1N T . Hence, as long as these two requirements are met,
X can contain time-constant variables.
RE.2: E ("|X ) = 0 and E (u|X ) = 0. This maintains strict exogeneity of X with
respect to both components of w, and so with respect to w itself. It is a stringent
assumption, implying that the time constant variables that are not included into
the regression are unrelated to the included regressors X. Notice that since 1N T is
non-random, conditioning onX is the same as conditioning on X.
RE.3: V ar ("|X ) =
0

(N T N )

2
" IN T ,

V ar (u|X ) =

2
u IN ,

Cov (", u|X ) = E ("u0 |X ) =

Let V ar (w|X ) . Then, given RE.3,


= V ar ("|X ) + V ar (Du|X )
=

2
" IN T

2
0
u DD.

This means that under RE.1-3 w, although homoskedastic, is non-diagonal and the POLS
estimator in (8.2.9) is unbiased (verify this) but not BLUE (unless
estimator for

is therefore the GLS Random eect estimator


bGLS

RE

0
= X

y.

2
u

= 0). The BLUE

8.3. THE RANDOM EFFECT MODEL

For implementation of bGLS , we need to work out

119

1.

Exercise 58. Verify that w is homoskedastic and in particular that V ar (wit |X ) =


2
"

2
u

for all i = 1, ..., N and t = 1, ..., T.

Since (see exercise 55)


P[D] =

1
DD0 ,
T

then

where

2
1

2
"

+T

2
u.

2
" IN T

2
" IN T

2
" M[D]

bGLS RE

2
" P[D]

2
" P[D]

2
u P[D]

+T

2
1 P[D] .

Therefore,

(8.3.3)
and

2
u P[D]

+T

= X

1
2
"

Exercise 59. Verify that

2
"

2
"

2 P[D]
1

2
1 P[D]

2
1

M[D] +

M[D] +

M[D] +

2
" M[D]

2 P[D]
1

1
2
"

M[D] +

2 P[D]
1

y.

P[D] is indeed the inverse of , that is

1
2
"

M[D] +

2 P[D]
1

= IN T

(easy, if you remember the properties of M[D] and P[D] .)


Exercise 60. 1) Verify that bGLS
(8.3.4)

bGLS RE

= X

RE

M[D] +

can be also written as


2
"
2 P[D]
1

M[D] +

2) Verify that premultiplying all variables of model (8.3.2) by M[D] +


a classical regression model, so that bGLS

RE

"
1

2
"
2 P[D]
1

y.

P[D] transforms it into

can be obtained at once by applying OLS to

8.3. THE RANDOM EFFECT MODEL

the transformed model. 3) Verify that the operator M[D] +


(8.3.5)

"

M[D] +

P[D] = IN T

The operator in (8.3.5), M[D] + ( " /

1 ) P[D] ,

"

"
1

120

P[D] can be also written as

P[D] .

transforms any conformable variable that

pre-multiplies in quasi-mean deviations, or partial deviations, in the sense that it only removes
a portion of the group-mean from the variable. For this reason, the coecients on timeconstant variables are identified in the RE model: time-constant variables when premultiplied
by M[D] + ( " /

1 ) P[D]

are not wiped out, but rescaled by a factor

"/ 1.

The RE model under

the GLS transformation is therefore


(8.3.6)

M[D] + ( " /

1 ) P[D]

y = M[D] + ( " /

1 ) P[D]

and you may wish to verify that indeed

V ar

M[D] + ( " /

1 ) P[D]

w|X =

8.3.1. The Feasible GLS. The feasible version of bGLS

+ M[D] + ( " /

1 ) P[D]

2
" IN T .

RE ,

say bF GLS

RE ,

the one

that is actually implemented in econometric softwares, can be obtained through the method
by Swamy and Arora (1972). The estimator for

2
"

is simply s2LSDV in (8.2.6) and that for

2
1

is obtained as follows.
Define the Between residual vector eB as
(8.3.7)
where bB =

eB = P[D] y

X P[D] X

P[D] X bB

X P[D] y. In words, eB is the residual vector from the OLS

regression of the group means of y on the group means of X . The resulting estimator, bB ,
is referred to in the panel data literature as the Between estimator1. Then, based on eB ,
1Technical note: I maintain that no column of X is either time-constant or already in group-mean deviations,
0

so that both bLSDV and bB are uniquely defined (in fact, with such an assumption X P[D] X and X 0 M[D] X
are both non-singular). Indeed, this is only made for simplicity, since it is possible to prove that s2B and s2LSDV

8.3. THE RANDOM EFFECT MODEL

construct the Anova estimator for

121

as

2
1

s2B =

e0B eB
.
N k 1

Exercise 61. Prove that


E s2B =

2
"

2
u.

+T

Same hint as for exercise 56: First, on noting that y is determined by the right hand
side of (8.3.2) prove that eB = M[P[D] X ] P[D] w, then elaborate the conditional mean of
w0 P[D] M[P[D] X ] P[D] w using the trace operator as we did for s2 , finally apply the law of
iterated expectations.
Solution. Replacing the formula of bB into the right hand side of equation (8.3.7) gives
eB =
=
=

P[D] X

P[D] X X P[D] X

X P[D] X

P[D] X X P[D] X

X P[D] P[D] y
0

X P[D]

P[D] X

+ P[D] w

X P[D] P[D] w

= M[P[D] X ] P[D] w.
Therefore,

e0B eB = w0 P[D] M[P[D] X ] P[D] w


= w0 M[P[D] X ] P[D] w
where the first equality follows from the idempotence of M[P[D] X ] and the second from
P[D] M[P[D] X ] = M[P[D] X ] P[D]
and the idempotence of P[D] .
are uniquely defined even if bLSDV and bB are not. The proof requires that all inverse matrices in the residuals
formulas are replaced with generalized inverse matrices. But dont worry, I wont pursue it further.

8.3. THE RANDOM EFFECT MODEL

122

Upon noticing that e0B eB is a scalar,


e0B eB = tr w0 M[P[D] X ] P[D] w
= tr M[P[D] X ] P[D] ww0 ,
and so
E

e0B eB |X

= E tr M[P[D] X ] P[D] ww |X

= tr E M[P[D] X ] P[D] ww0 |X


= tr M[P[D] X ] P[D] E ww0 |X
= tr M[P[D] X ] P[D] .

Expressing in spectral decomposition,


=
yields P[D] =

2
1 P[D] ,

2
" M[D]

2
1 P[D] ,

given that P[D] is idempotent and P[D] M[D] = 0(N T N T ) . Hence,


E e0B eB |X

= tr M[P[D] X ] P[D]
2
1 tr

M[P[D] X ] P[D]

Then, it remains to prove that tr M[P[D] X ] P[D] = N

tr M[P[D] X ] P[D] = tr P[D]


= tr IN
= N

1. Since

P[D] X X P[D] X

M[P[D] X ] P[D] = P[D]

1.

X P[D] ,

tr P[D] X X P[D] X

tr Ik+1
k

2
1.

X P[D]

8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS

In conclusion, the Feasible GLS, bF GLS


bF GLS RE

= X

RE ,

123

is

s2LSDV
M[D] +
P[D] X
s2B

s2LSDV
M[D] +
P[D] y.
s2B

Exercise 62. Prove that


E

e0LSDV eLSDV
|X
NT N k

2
"

(hint: follow the same steps as above, noticing that M[D] w = M[D] ".)
Exercise 63. Derive the formula for the subvector of bGLS
the

RE ,

say bGLS

RE ,

estimating

vector.
Solution. Simply apply the FWL Theorem to the GLS-transformed RE model in (8.3.6),

noticing that by the well-known properties of orthogonal projectors P[D] P[1N T ] = P[1N T ] (remember 1N T = D1N and so 1N T 2 R (D)) so that

"

M[D] +

P[D] M[1N T ] M[D] +

"
1

P[D]

= M[D] +

2
"
2
1

P[D]

P[1N T ]

and eventually
bGLS

RE

M[D] +

2
"
2
1

P[D]

P[1N T ]

M[D] +

2
"
2
1

P[D]

P[1N T ]

y.

8.4. Stata implementation of standard panel data estimators


Both fixed eects and random eects estimators are implemented through the Stata command xtreg, with the usual Stata syntax for regression commands: the command is followed
by the name of the dependent variable and then the list of regressors. The noconstant option
is not admitted in this case.
As a preliminary step, however, a panel data declaration is needed to make Stata aware
of which variables in our data identify time and individuals. Suppose that in our data the
individual variable is named id and the time variable time, then the panel data declaration is

8.4. STATA IMPLEMENTATION OF STANDARD PANEL DATA ESTIMATORS

124

carried out by the instruction


xtset id time
The random eect estimator is the default of xtreg, while the fixed eects (LSDV) estimator requires the option fe.
Sometimes, you may find it convenient to implement FE and RE estimators by hands,
using regress rather than xtreg. The greater computational eort may pay for the simple
reason that regress, being the most popular estimation command in Stata, is updated more
frequently to accommodate the most recent developments in statistics and econometrics, and
so has typically more options than any other estimation commands in Stata. To implement
bLSDV and aLSDV at once you may just apply regress to the LSDV model (8.2.3). This
requires to generate a full set of individual dummies from the individual-identifier id in your
panel. This is done through the tabulate command with an option, as follows

tabulate id, gen (id )


where id_ is a name of choice. If N equals, say, 100, tabulate will add the full set of 100
individual dummies to your data, with names id_1, id_2, ..., id_100 and you can just treat
them as regressors in a regress instruction to get bLSDV as the coecient estimates for the
X variables and aLSDV as the coecient estimates for the id_1-id_100 variable. Degrees
of freedom are correctly calculated as N T

k and so no correction of standard errors

is needed. Notice that if you include all 100 dummies, then the constant term should be
removed by the noconstant option. Alternatively, you can leave it there and include N

dummies. While the bLSDV estimates remain unchanged, the coecient estimates on the
included dummies do not. The latter must now be thought of as contrasts with respect to
the constant estimate, which turns out to equal the individual eect estimate peculiar to
the individual excluded from the regression, who is therefore treated as the base individual.
Nothing is lost by choosing either identification strategy.

8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS

125

When N is large the foregoing regress strategy is not practical. The bLSDV estimator
can, then, be manually implemented by applying the within transformation, carrying out OLS
on the transformed model and then correct standard errors appropriately. Implementation of
bF GLS

RE

by hands is more tricky and one goes along the following steps: 1) get the two

variance components estimates from within and between regressions; 2) transform variables
(including the constant) in partial deviations and 3) apply OLS to the transformed variables.
Details can be found in a Stata do.file available on the learning space.
I recommend to use always the ocial xtreg command to implement the standard panel
data estimators in empirical applications, unless strictly necessary to do otherwise (for example, if I explicitly ask you to!).

8.5. Testing fixed eects against random eects models


As Hausman (1978) and Mundlak (1978) independently found (in two papers appeared
in the same Econometrica issue!), the RE model is a special case of the FE model. In fact,
while in the former model assumption RE.2 models the relationship between the random
individual components, u, and X (E (u|X) = 0), the latter leaves it completely unrestricted.
In consequence, the RE model is nested into the FE model, so that a test discriminating
between them can be easily implemented with E (u|X) = 0 as the null hypothesis.
I present here two popular tests that, moving from the foregoing consideration, can provide
some guidance in the choice between RE and FE models.
8.5.1. The Hausmans test. Under Ho : E (u|X) = 0, both LSDV and FGLS-RE
estimators are consistent for N ! 1, but the LSDV estimator is inecient: redundant
individual eects are included in the regression when they could have rather been regarded
as random disturbances, saving on degrees of freedom. On the other hand, if Ho is not true
the LSDV estimator remains consistent, but FGLS does not, undergoing an omitted-variable
bias. The basic idea of the Hausmans test (Hausman, 1978), therefore, is that under Ho the

8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS

126

statistical dierence between the two estimators should be not significantly dierent from zero
in large samples.
Hausman proves that, under RE.1-RE.3, such dierence can be measured by the statistics

H = (bLSDV

bF GLS

RE )

Avar (bLSDV\
bF GLS

RE )

(bLSDV

bF GLS

RE )

and also that


2

H !
d

(k) .

Hausman also provides a useful computational result. He shows that since bF GLS

RE

is

asymptotically ecient and bLSDV is inecient under the null,


Acov (bLSDV

bF GLS

RE ,

bF GLS

RE )

= 0,

so
Acov (bLSDV

bF GLS

RE ,

bF GLS

RE )

= Acov (bLSDV , bF GLS

RE )

Avar (bF GLS

RE )

= 0
and
Avar (bLSDV

bF GLS

RE )

= Avar (bLSDV )

Avar (bF GLS

RE ) .

Hence,
H = (bLSDV

bF GLS

h
0
)
Avar\
(bLSDV )
RE

\
Avar (b
F GLS

RE )

(bLSDV

bF GLS

RE ) .

Wooldridge (2010) (pp. 328-334) evidences two diculties with the Hausmans test.
First, Avar (bLSDV )

Avar (bF GLS

RE )

is singular if X includes aggregate variables,

such as time dummies. Therefore, along with the coecients on time-constant variables, also
those on aggregate variables must be excluded from the Hausman statistics.

8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS

127

Second, and more importantly, if RE.3 fails, then, on the one hand, the asymptotic distribution of H is not standard even if RE.2 holds, so that H would be of little guidance in
detecting violations of RE.2, with an actual size that may be significantly dierent from the
nominal size. On the other hand, H is designed to detect violations of RE.2 and not RE.3.
In fact, if RE.2 holds both LSDV and FGLS-RE are consistent, regardless of RE.3, and H
converges in distribution rather then diverging, which means that the probability of rejecting
RE.3 when it is false does not tend to unity as N ! 1, making H inconsistent. The solution
is so to consider H as a test of RE.2 only, but in a version that is robust to violations of RE.3.
The approach I describe next is well suited to solve both diculties at once.

8.5.2. The Mundlaks test. Mundlak (1978) asks the following question. Is it possible
to find an estimator that is more ecient than LSDV within a framework that allows correlation between individual eects, taken as random variables, and X? To provide an answer,
he takes the move from model (8.2.3) and supposes that the individual eects are linearly
correlated with regressors according to the following equation
= 1N 0 + D0 D
h
with E (|X) = E | (D0 D)

D0 X + u

i
D0 X , and so E (u|X) = 0. Pre-multiplying both sides of the

foregoing equation by D and then replacing the right-hand side of the resulting equation into
(8.2.3) yields
(8.5.1)

y = 1N T 0 + X + P[D] X + Du + ",

which is evidently a RE model extended to the inclusion of the P[D] X regressors. Model (8.5.1)
springs up from a restriction in (8.2.3) and hence seems promising for more ecient estimates.
But this is not the case. Mundlak proves, in fact, that FGLS-RE applied to equation (8.2.3)
returns the LSDV estimator, bLSDV for the

coecients, bB

bLSDV for the coecients

8.5. TESTING FIXED EFFECTS AGAINST RANDOM EFFECTS MODELS

128

and b0B for the constant term 0 , where b0B and bB are the components of the between
estimator, bB , presented in Section 8.3.1.
To summarize Mundlaks results
The standard LSDV estimator for
RE estimator for

in the FE model (equation (8.2.3)) is the FGLS-

in the general RE model (8.5.1)

The standard FGLS-RE estimator in the RE model (equation (8.3.2)) can be equivalently obtained as a constrained FGLS estimator applied to the general RE model
(8.5.1) with constraints = 0.
Therefore, the validity of the RE model can be tested by applying a standard Wald test of
joint significance for the null hypothesis that = 0 in the context of Mundlaks equation
(8.5.1):
M = (bLSDV
Under H0 : = 0, M !
d

bB )0 Avar (b\
LSDV

bB )

(bLSDV

bB ) .

2 (k).

Hausman and Taylor (1981) proves that the statistics H and M are numerically identical
(for a simple proof see also Baltagi (2008)). Wooldridge (2010), p. 334, nonetheless, recommends using the regression-based version of the test because it can be made fully robust
to violations of RE.3 (for example, heteroskedasticity and/or arbitrary within-group serial
correlation) using the standard robustness options available for regression commands in most
econometric packages. In addition, it is relatively easy to detect and solve singularity problems
in the context of regression-based tests.

8.5.3. Stata implementation. The Stata implementation of most results in this section
is demonstrated through a Stata do file available on the course learning space.

8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR

129

8.6. Large-sample results for the LSDV estimator


8.6.1. Introduction. This section proves consistency and asymptotic normality of the
LSDV estimator, then describes the heteroskedasticity and within-group serial correlation
consistent covariance estimator and finally provides a remark for practitioners.
Notation is standard. X denotes the (N T k) regressors matrix (of all time-varying
regressors) and is partitioned by stacking individuals
0

B X1
B ..
B .
B
B
X=B
B Xi
B .
B .
B .
@
XN

(8.6.1)

C
C
C
C
C
C,
C
C
C
C
A

with Xi indicating the (T k) block of observations peculiar to individual i = 1, ..., N. Similarly, observations in the (N T 1) vectors y and " are stacked by individuals.
The projection matrix M[D] projects onto the space orthogonal to that spanned by the
columns of the individual dummies matrix D and any conformable vector that is post-multiplied
to it gets transformed into group mean deviations. It is not hard to see that M[D] is a block
diagonal matrix, with blocks all equal to
M[1T ] = IT

1T 10T
.
T

So,
2

(8.6.2)

M[D]

6 M[1T ]
6
.
6 0
M[1T ] . .
6
=6
..
..
..
6
.
.
.
6
4
0

0
..
.
0
M[1T ]

7
7
7
7
7.
7
7
5

8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR

The LSDV estimator for the coecients on the X regressors,


bLSDV = X 0 M[D] X

130

, is given by

X 0 M[D] y.

Strict exogeneity is maintained throughout:


SE: E ("|X) = 0.
The following random sampling assumption is invoked for the asymptotic normality of
bLSDV and the consistency of the bLSDV asymptotic covariance estimator:
RS: There is a sample of size n = N T , such that the elements of the sequence
{(yi Xi ) , i = 1, ..., N } are independent (NB not necessarily identically distributed)
random matrices.

8.6.2. Large-sample properties of LSDV. Let V ar ("|X) = , where is an arbitrary and unknown p.d. matrix.

8.6.3. Consistency. The following assumptions hold.


LSDV.1: p limN !1

X 0 M[D] M[D] X
N

= lim E
N !1

X 0 M[D] M[D] X
N

Q , a positive def-

inite and finite matrix


0

X M[D] X
LSDV.2: p limN !1
= Q a positive definite and finite matrix
N
Exercise 64. (This has been done in class) Prove that under LSDV.1 and LSDV.2
p limN !1 bLSDV = .
8.6.4. Asymptotic normality. Assumptions LSDV.1 and LSDV.2 hold along with RS
and the following

8.6. LARGE-SAMPLE RESULTS FOR THE LSDV ESTIMATOR

LSDV.3: V ar ("|X) = , where


2
6
6
6
6
=6
6
6
4

0
..
.

2
..
.

..
.
..
.
0

0
..
.

131

7
7
7
7
7
0 7
7
5
N

is a block diagonal (N T N T ) positive definite matrix. Notice that the blocks of are
arbitrary and heterogenous, so that both arbitrary correlation across the time observations
of the same individual (referred to as within-group serial correlation) and heteroskedasticity
across individuals and over time are permitted. What is not permitted by the block-diagonal
structure is correlation of the " realizations across dierent individuals.
Now focus on the generic individual i = 1, ..., N and notice that, given the block-diagonal
form of M[D] as in (6.1.7),
2

6 M[1T ]
6
.
6 0
M[1T ] . .
6
M[D] X = 6
..
..
..
6
.
.
.
6
4
0

0
..
.
0
M[1T ]

X1
..
.

M[1T ] X1
..
.

B
C B
C B
7B
C B
7B
B
C B
7B
C B
7B
B
7 B Xi C
C = B M[1T ] Xi
7B
C
7 B .. C B
..
5B . C B
.
B
@
A @
XN
M[1T ] XN

1
C
C
C
C
C
C
C
C
C
C
A

The proof of asymptotic normality for bLSDV parallels that in 7.2.2 with the only dierence
that now the random objects whence we move are not k 1 vectors at the observation level
but k 1 vectors at the individual level, Xi0 M[1T ] "i , i = 1, ..., N .
First, by strict exogeneity E Xi0 M[1T ] "i = 0 and hence
V ar Xi0 M[1T ] "i |X

= E Xi0 M[1T ] "i "0i M[1T ] Xi |X


= Xi0 M[1T ] i M[1T ] Xi ,

8.7. A ROBUST COVARIANCE ESTIMATOR

132

so that
V ar Xi0 M[1T ] "i = E Xi0 M[1T ] i M[1T ] Xi .
Then, averaging across individuals
N
1 X
V ar Xi0 M[1T ] "i
N
i=1

N
1 X
E Xi0 M[1T ] i M[1T ] Xi
N
i=1
0

X M[D] M[D] X
= E
.
N

Therefore,
0

N
X M[D] M[D] X
1 X
V ar Xi0 M[1T ] "i = lim E
Q ,
N !1 N
N !1
N
lim

i=1

which is a finite matrix by assumption LSDV.1, so that the Lindberg-Feller theorem applies
to yield

Finally, since
p

N
X 0 M[D] "
NX 0
Xi M[1T ] "i p
! N (0, Q ) .
d
N
N
i=1

N (bLSDV
p

X 0 M[D] X
N

N (bLSDV

X 0 M[D] "
p
!Q
d
N

) ! N 0, Q
d

Q Q

1X

0M

[D] "

and the asymptotic covariance matrix of bLSDV is given by


(8.6.3)

Avar (bLSDV ) =

1
Q
N

Q Q

8.7. A Robust covariance estimator


Arellano (1987) demonstrates that given the (T 1) LSDV residual vector
eLSDV,i = M[1T ] yi

M[1T ] Xi bLSDV ,

8.7. A ROBUST COVARIANCE ESTIMATOR

133

i = 1, ..., N, a consistent estimator for the asymptotic covariance matrix of bLSDV in equation
(8.6.3) is given by the Whites estimator:
(8.7.1)

Avar\
(bLSDV ) = X 0 M[D] X

[D] X X 0 M[D] X
X 0 M[D] M

is a block diagonal matrix with generic block given by eLSDV,i e0


where
LSDV,i . More formally,
0
= eLSDV e0

LSDV DD .

Remark 65. The estimator in (8.7.1) is robust to arbitrary heteroskedasticity and withingroup serial correlation. Stock and Watson (2008) prove that in the LSDV model the Whites
is a diagonal matrix with generic
estimator correcting for heteroskedasticity only, where
element e2LSDV,it (see the first formula of section 9.6.1 in Greene (2008)), is inconsistent for
N ! 1. The crux of Stock and Watsons argument is essentially algebraic, in that demeaned
residuals are correlated over time by construction and this correlation does not vanish for
N ! 1. The recommendation for practitioners is then to correct for both heteroskedasticity
and within-group serial correlation using the estimator (8.7.1), which is not aected by the
Stock and Watsons critique.
Remark 66. In Stata the robust covariance matrix of LSDV is computed easily by using
the xtreg command with the options fe and vce(cluster id), where id is the name of the
individual categorical variable in your Stata data set.
A similar correction can be carried out for POLS and FGLS-RE. For POLS we have

where

0
Avar\
bP OLS = X X

0
X 0 X
X X

= eP OLS e0P OLS DD0 ,

and the POLS residual vector defined as in equation (8.2.10), whereas for FGLS-RE we have

8.8. UNBALANCED PANELS

Avar \
bF GLS
where

RE

0
b
= X

= eF GLS

0
b
X

134

1/2 b 1/2

0
RE eF GLS RE

0
b
X X

DD0 ,

and the FGLS-RE residual vector defined as


eF GLS

RE

=y

X bF GLS

RE .

Remark 67. In Stata the robust asymptotic covariance matrices of POLS and FGLS-RE
is estimated by using, respectively, the regress and the xtreg, re commands, both with the
option vce(cluster id), as in the LSDV case.

8.8. Unbalanced panels


All of the methods so far have been described with a balanced panel data set in mind, but
nothing prevents applying the same methods to unbalanced panels (dierent numbers of time
observations across individuals).
Unbalanced panels only require a slight change in notation. As always we index individuals
by i = 1, ..., N , but now the size of each individual cluster, or group, of observations varies
across individuals and so the time index is t = 1, ..., Ti . This implies the following three facts.
(1) As in balanced panels, each observation in the data is uniquely identified by the two
indexes: the pair (i, t) identifies the t.th observation of the the i.th individual.
(2) Dierently from balanced panels, the group size, Ti , is no longer constant across
clusters.
(3) Dierently from balanced panels, the sample-size is n =

PN

i=1 Ti .

The LSDV estimator is implemented without any problem either creating individual dummies
or taking variables in group-mean deviations, where group means are at the individual level.

8.8. UNBALANCED PANELS

135

The random eect estimator requires only some algebraic modifications in the formulas allowing for unbalancedness. Arellano estimator also requires simple modifications in notations to
accommodate unbalancedness: there is now a (Ti 1) LSDV residual vector given by
eLSDV,i = M[1T ] yi
i

M[1T ] xi bLSDV ,
i

in (8.7.1) is a block diagonal matrix with blocks that are now of


i = 1, ..., N, and so matrix
dierent size. The notation using the Hadamard product does not instead require adjustments
to unbalancedness.

CHAPTER 9

Robust inference with cluster samplings


9.1. Introduction
The Panel-data sets considered in these notes, with a large individual dimension and a
small time dimension, are an example of one-way clustering. If the data-set is balanced, there
are n = N T observations clustered into N individual groups, each comprising T observations.
P
If the data-set is unbalanced, as often the case with real-world panels, there are n = N
i=1 Ti

observations clustered into N individual groups, each comprising Ti observations, i = 1, ..., N .


One-way clusterings can be observed also in cross-section data. Think for example of a
large sample of students clustered into many schools. The data structure parallels exactly
that of an unbalanced panel, just index schools by i = 1, ..., N and students within schools
by t = 1, ..., Ti . So, any observation in the data is uniquely identified by the values of i and
t. In other words, observation (i, t) refers to the t.th student in the i.th school. Therefore,
under random sampling of schools and arbitrary sampling of students within schools all of
the statistical methods described in chapter 8 can be conveniently used. This means that
pooled OLS, fixed and random eect estimators can be applied to clustered cross-sections.
The F-test on individual eects can be used to gauge the presence of latent heterogeneity.
The robust Hausman test can be used to discriminate between fixed and random eects and,
importantly, the White-Arellano estimator described in Section 8.7 can be used for computing
robust standard errors. For more on the parallel between unbalanced panel data and one-way
clustered cross sections see chapter 20 in Wooldridge (2010).
Dealing with one-way clustering is an important advance in econometrics. It is often the
case, however, that real-world economic data have a multi-dimensional structure, so clustering
136

9.2. TWO-WAY CLUSTERING

137

can occur along more than one dimension. In a student survey, for example, there could be an
additional level of clustering given by teachers, or classes, within schools. Similarly, patients
can be clustered along the two dimensions, not necessarily nested, of doctors and hospitals.
In a cross-sectional data-set of bilateral trade-flows, the cross-sectional units are the pairs of
countries and these are naturally clustered along two dimensions: the first and the second
country in the pair (Cameron et al., 2011). ln matched-employers-employees data there is the
worker dimension, the firm dimension and the time dimension (Abowd et al., 1999).
Is it possible to do inference that is robust to multi-way clustering as we do inference that is
robust to one-way clustering? A recent paper by Cameron et al. (2011) oers a computationally
simple solution extending the White estimator to multi-way contexts. In essence, their method
boils down to computing a number of one-way robust covariance estimators, that are then
combined linearly to yield the multi-way robust covariance estimator. It is, therefore, crucial
for the accuracy of the multi-way estimator that the one-way estimators be also accurate,
and so that the data-set have dimensions with a large number of clusters. Such asymptotic
requirement makes the analysis in Cameron et al. (2011) not well suited for dealing with both
individual- and time-clustering in the typical micro-econometric panel data set, where T is
fixed. Indeed, their Monte Carlo experiments show that the robust covariance estimator have
good finite-sample properties in data-sets with dimensions of 100 clusters.
To illustrate the method I focus on two-way clustering, using a notation that is close to
that inCameron et al. (2011).

9.2. Two-way clustering


Notation is general enough to embrace cases in which cluster aliations are not sucient
to uniquely identify an observation. There is a data-set with n observations indexed by i 2
{1, ..., n}. Observations are clustered into two dimensions, g 2 {1, ..., G} and h 2 {1, ..., H} .
Asymptotics is for both G and H ! 1. The data-sets that I have in mind are, for example,
Survey of students with, at least moderately large numbers of teachers and schools

9.2. TWO-WAY CLUSTERING

138

Survey of patients with, at least, moderately large numbers of doctors and hospitals
Bilateral trade-flows data with, at least, a moderately large number of countries.
Matched-employers-employees data with, at least, moderately large numbers of firms
and workers
For each dimension, it is known to which cluster a given observation i = 1, ..., n belongs. This
information is contained in the mappings g : {1, ..., n} ! {1, ..., G}
g (i) = [g 2 {1, ..., G} : observation i belongs to cluster g] , i = 1, ..., n.
and h : {1, ..., n} ! {1, ..., H}
h (i) = [h 2 {1, ..., H} : observation i belongs to cluster h] , i = 1, ..., n.
From the mappings g and h we can also construct the n G dummy variables matrix DG and
the n H dummy variables matrix DH , as the following definitions indicates
Definition 68. Let

i 2 {1, ..., n}, g 2 {1, ..., G}, and

8
< 1 if
dig =
: 0
dih

8
< 1 if
=
: 0

g (i) = g

else

h (i) = h

else

i 2 {1, ..., n}, h 2 {1, ..., H}. Then, DG and DH are the n G and n H matrices with (i, g)
element dig and (i, h) element dih , respectively.

Given g and h, we can define an intersection dimension, say G\H, such that each cluster in
G\H contains only observations that belong to one unique cluster in {1, ..., G} and one unique
cluster in {1, ..., H} . This yields the matrix of dummy variables DG\H . By construction, the

9.2. TWO-WAY CLUSTERING

139

number of clusters in the G \ H dimension is at most G H. For example if


0
1
0
1
1 0 0
1 0
B
C
B
C
B
C
B
C
B 1 0 0 C
B 1 0 C
B
C
B
C
B
C
B
C
B 0 1 0 C
B 1 0 C
B
C
B
C
G
H
D =B
C, D = B
C,
B 0 1 0 C
B 0 1 C
B
C
B
C
B
C
B
C
B
C
B
C
B 0 1 0 C
B 0 1 C
@
A
@
A
0 0 1
1 0
then

DG\H

B
B
B
B
B
B
B
=B
B
B
B
B
B
@

1 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 1 0
0 0 0 1

C
C
C
C
C
C
C
C.
C
C
C
C
C
A

This framework allows that in a survey of patients, for example, there could be more than
one patients admitted to the same hospital and under the assistance of the same doctor. Or,
similarly, that in a panel data matching workers with firms the same worker can move across
firms over time or that, conversely, the same firm may employ dierent workers over time.
0

Then, define three n n indicator matrices: S G = DG DG , S H = DH DH and S G\H =


DG\H DG\H

It is easy to verify that:

S G has ijth entry equal to one if observations i and j share any cluster g in {1, ..., G}
; zero otherwise.
S H has ijth entry equal to one if observations i and j share any cluster h in {1, ..., H};
zero otherwise.

9.2. TWO-WAY CLUSTERING

140

S G\H has ijth entry equal to one if observations i and j share any cluster g in
{1, ..., G} and any cluster h in {1, ..., H}; zero otherwise.
Also, the iith entries in S G , S H and S G\H equal one for all i = 1, ..., n, so the three indicator
matrices have main diagonals with all unity elements.
Consider now a linear regression model allowing for two-way clustering
yi, = x0i, + "i
i = 1, ..., n and let

B
B
B
B
B
"=B
B
B
B
B
@

"1
..
.
"i,
..
.
"n

C
C
C
C
C
C.
C
C
C
C
A

Assumptions LRM.1-LRM.3 hold. Assumption LRM.4 is here replaced with a more general
one permitting arbitrary heteroskedasticity and maintaining zero correlation only between
errors peculiar to observations that share no cluster in common. For example, the latent error
of patient i is not correlated to the latent error of patient j only if the two subjects are under
the assistance of dierent doctors, say g (i) 6= g (j), and in dierent hospitals, h (i) 6= h (j).
Formally,
LRM.4b: V ar ("|X) = E (""0 |X) with E ("i "j |X) = 0 unless g (i) = g (j) or
h (i) = h (j), i, j = 1, ..., n.
Importantly, LRM.4b can equivalently be expressed as
(9.2.1)

= E ""0 S G |X + E ""0 S H |X

E ""0 S G\H |X ,

where the symbol stands for the element-by-element matrix product (also known as Hadamard
product) between matrices with equal dimension (verify equivalence of LRM.4b and (9.2.1)).

9.2. TWO-WAY CLUSTERING

141

As we know, OLS, in this case, are consistent and unbiased but not ecient. More importantly, OLS standard errors are biased, and so we need a two-way robust covariance estimator
for inference. The covariance estimator devised by Cameron et al. (2011) is the combination
of three one-way covariance estimators a la White. It is constructed along the following steps.
Carry out OLS, obtain the OLS residuals
ei,g(i),h(i) = yi,g(i),h(i)

x0i,g(i),h(i) b

i = 1, ..., n and stack them into the n 1 column vector


0
1
e
B 1,g(1),h(1) C
B
C
..
B
C
.
B
C
B
C
C
e=B
B ei,g(i),h(i) C .
B
C
..
B
C
.
B
C
@
A
en,g(n),h(n)
The first one-way covariance estimator is
G

\
Avar
(b) = X 0 X

GX X 0X
X 0

\
G = ee0 S G . Avar
where
(b) is a White estimator that is robust to clustering only along
the G dimension.
The second one-way covariance estimator is
\
Avar
(b)
\
H = ee0 S H . Avar
where
(b)

= X 0X

H X X 0X
X 0

is a White estimator that is robust to clustering only along

the H dimension.
The third one-way covariance estimator is
\
Avar
(b)

G\H

= X 0X

G\H X X 0 X
X 0

9.3. STATA IMPLEMENTATION

\
G\H = ee0 S G\H . Avar
where
(b)

G\H

142

is a White estimator that is robust to clustering only

along the G \ H dimension.


Finally, the two-way robust covariance estimator is
(9.2.2)

\
\
\
Avar
(b) = Avar
(b) + Avar
(b)

\
Avar
(b)

G\H

\
Avar
(b) is robust to clustering along both G and H dimensions and is the estimator that
is used to construct our robust tests.
\
Remark 69. Writing Avar
(b) as
\
Avar
(b) = X 0 X

G +
H
X0

G\H X X 0 X

and then considering equation (9.2.1) uncovers the analogy principle on which the two-way
robust covariance estimator rests.
\
Remark 70. Cameron et al. (2011) also present a general multi-way version of Avar
(b),
which is derived from a simple extension of the foregoing analysis. The additional cost is only
in terms of a more cumbersome notation. For the formulas I refer you to that paper.
9.3. Stata implementation
\
While there is no ocial command for the two-way Avar
(b) in Stata, it can be simply
implemented by means of three one-way OLS regressions. Suppose that in our data-set the
two categorical variables for dimensions G and H are called doctor and hospital. You can
\
assembleAvar
(b) along the following steps.
(1) Create the categorical variable for the intersection dimension, G \ H, through the
following instruction
egen doc hosp = group (doctor hospital)
where doc_hosp is a name of choice.

9.3. STATA IMPLEMENTATION

143

(2) Implement the first regress instruction with the option vce(cluster doctor) and
then save the covariance matrix estimate through the command: matrix V_d=e(V)
(V_d is a name of choice).
(3) Implement the second regress instruction with the option vce(cluster hospital)
and then save the covariance matrix estimate with: matrix V_h=e(V) (V_h is a
name of choice).
(4) Implement the last regress instruction with the option vce(cluster doc_hosp) and
then save the covariance matrix estimate with: matrix V_dh=e(V) (V_dh is a name
of choice).1
(5) Finally, work out the two-way robust covariance estimator by executing: matrix
V_robust=V_d+V_h-V_dh (V_robust is a name of choice). To see the content of
V_robust do: matrix list V_robust.

The robust standard errors are just the

square roots of the main diagonal elements in V_robust.

1It may happen that clusters in the intersection dimension are all singletons (i.e. each cluster has only one

observation). In this case Stata will refuse to work with the option vce(cluster doc_hosp). This is no
problem, though, since correcting standard errors when clusters are singletons is clearly equivalent to correcting
for heteroskedasticity. Therefore, instead of vce(cluster doc_hosp), simply write vce(robust).

CHAPTER 10

Issues in linear IV and GMM estimation

10.1. Introduction
The conditional-mean-independence of " and x maintained by P.2 (Section 2.1) often fails
in economic structures, where some of the x variables are chosen by the economic subjects
and as such may depend on the latent factors at the equilibrium. These x variables are said
endogenous.
In economics, think of a production function, where (some of) the observable input quantities are under the firms control. The same consideration holds for the education variable
in a wage equation. These are all cases of omitted variable bias (Section 4.7), which makes
standard estimation techniques not usable.
As we have seen in Section 4.7.1, the proxy variables solution maintains that there is information external to the model that is able to fully explain the correlation between observed and
unobserved variables. For example observed IQ scores, clearly redundant in a wage equation
with latent ability, are an imperfect measure of latent ability, but the discrepancy between the
144

10.1. INTRODUCTION

145

two variables is likely to be unrelated with the individual education levels. Such information,
so close to the latent variable, is often unavailable, though.
If the latent variables are invariant across individual and/or over time and there is a paneldata set, the endogeneity problem is solved by applying the panel-data methods introduced
in Chapter 8. But not always panel data are available and even when they are, the disturbing
omitted factors may not meet the time-constancy requirement. For example, idiosyncratic
productivity shocks may well be related to input factors in the estimation of a production
function.
Neither proxy variables or panel data methods are generally usable when endogeneity
springs from reverse causality. In the strip, Wally questions the exogeneity of the exercise
variable as a determinant of individual health, hinting for an endogeneity bias due to reverse
causality. If the exercise activity is indeed aected by the health status, exercise would depend
on the observable and unobservable determinants of health, and so cannot be exogenous.
Instrumental variables (IV) and Generalized method of moment (GMM) estimators oer
a general solution to the endogeneity problem. Roughly speaking, they solve the endogeneity
problem into two stages. The first stage attempts to identify the exogenous-variation components of the x, through a set of exogenous variables, some of which are external to the model,
said instrumental variables. The second stage applies regression analysis using only the firststage exogenous components as explanatory variables. IV and GMM methods are preferred
tools of econometric analysis, compared to alternative techniques, since often the first stage
can be justified on the ground of economic theory.
There are various IV GMM applications showing the methods of this chapter: iv.do using
mus06data.dta, IV_GMM_panel.do using costfn.dta and IV_GMM_DPD.do and abest.do both
using abdata.dta. There is also a Monte Carlo application implemented by bias_in_AR1_LSDV.do.

10.2. THE METHOD OF MOMENTS

146

10.2. The method of moments


The method of moments estimates the parameters of interest by replacing population
moment conditions with their sample analogs. Almost all popular estimators can be thought
of as methods of moments estimators. Below there are two examples.

10.2.1. The linear regression model. Consider the linear model of Chapter 1 and the
system of moment conditions (1.2.3)
E (xy) = E xx0
So, the true coecient vector,
= E (xx0 )

, solve the population moment conditions and is equal to

E (xy). By the analogy principle a consistent estimator for

the system of k analog sample moment conditions:


1X
x i yi
n

0
xi b = 0.

x i yi = X 0 X

i=1

Hence,

b =

n
X

xi xi

i=1

which is exactly the OLS estimator.

1 n
X

, b , will satisfy

X 0 y,

i=1

10.2.2. The Instrumental Variable (IV) regression model in the just identified
case. Consider the linear model of Chapter 1 but without assumption P.3, E ("|x) = 0, or
even the weaker P.3b, E (x") = 0. This means that some of the variables in x are potentially
endogenous, that is related in some way to ". Assume, instead, conditional mean independence
for a L 1 vector of variablesz, that is E ("|z) = 0, with L = k. The vector z is generally
dierent from x, if it is not then we are back to the classical regression model and there is no

10.2. THE METHOD OF MOMENTS

147

endogeneity problem. Then, as before using the law of iterated expectations we have
E (z") = Ez [E (z"|z)]
= Ez [zE ("|z)]
= 0.
So, there are k moment conditions in the population

E z y

or equivalently

x0

=0

E (zy) = E zx0
So, the true coecient vector,
= E (zx0 )

, solve the population moment conditions and is equal to

E (zy). By the analogy principle a consistent estimator for

the system of k analog sample moment conditions:


1X
zi y i
n

0
xi b = 0.

zi y i = Z 0 X

i=1

Hence,
b

n
X
i=1

which is the classical IV estimator.

zi x i

1 n
X

, b , will satisfy

Z 0 y,

i=1

The intuition is straightforward: since the true coecients solve the population moment
conditions, if the sample moments provide good estimates for the population moments, then
one might expect that the estimator solving the sample moment conditions will provide good
estimates of the true coecients.
What if there are more moment conditions than unknown parameters, that is if L > k?
Then we turn to GMM estimation.

10.2. THE METHOD OF MOMENTS

148

10.2.3. The Generalized Method of Moments. GMM estimation is general: it can


be applied to both linear and non-linear models and in the over-identified case L > k. To see
this, define the column vector of observables in the population w (y x0 z0 )0 . There are L > k
population moments collected into the (L 1) column vector m ( ) :
m ( ) E [f (w, )]
and suppose that the following population moment conditions hold
m ( ) = 0.
Now consider the L sample moments
N
1X

b
m

f wi , b
n
i=1

and the L sample moment conditions


m b =0

hence there are L equations and k unknowns so that no estimator b can solve the system of

sample moment conditions. Instead, there exists a b that can make m b as close to zero as
possible:

GM M


= arg min Q b
b


0

where Q b m b Am b is a quadratic criterion function of the sample moments and

A is a positive definite matrix weighting the squares and the cross-products of the sample

moments in Q b .



Note that Q b
0 and since A is positive definite, Q b = 0 only if m b = 0.

Thus, Q b can be made exactly zero in the just identified case and is strictly greater than
zero in the over-identified case.

10.2. THE METHOD OF MOMENTS

149

10.2.4. The TSLS estimator. It is not hard to prove that the well-known Two Stages
Least Squares estimator (TSLS) in the overidentified linear model belongs to the class of GMM
estimators. Consider the linear regression model of Section 10.2.2 with L > k instruments.
Then, there are the following population moments

m( ) E z y

x0

and population moment conditions

E z y

x0

=0


The L sample moments are collected into the (L 1) vector m b
n
1X

m b
zi y i
n

1
0
xi b Z 0 y
n

i=1

Xb

Suppose we choose a quadratic criterion function with the following weighting matrix
n

A
Then

1X 0
zi zi
n

1
Q b
y
n

i=1

= n Z 0Z

0
X b Z Z 0Z

Z0 y

Xb

with the following normal equations for the minimization problem:


@Q b
2
1 0
X 0Z Z 0Z
Z y Xb = 0
n
@b
that solved for b yield the TSLS estimator
or more compactly

T SLS

X 0Z Z 0Z
b

T SLS

Z 0X

X 0 P[Z] X

X 0Z Z 0Z

X 0 P[Z] y.

Z 0 y.

10.3. STATA IMPLEMENTATION OF THE TSLS ESTIMATOR

150

The estimators name derives from the fact that it is computed into two stages:

(1) Regress each column of X on Z using OLS to obtain the OLS fitted values of X:
Z (Z 0 Z)

Z 0 X = P[Z] X. Thus, X = P[Z] X + M[Z] X, where P[Z] X is an approxi-

mately exogenous component, whose covariance with " goes to zero as n ! 1, and
M[Z] X is a residual, potentially endogenous, component. Only P[Z] X is used in the
second stage.
(2) Regress y on the fitted values, P[Z] X, to obtain TSLS.

If the population moment conditions are true, then Q b T SLS should not be significantly
dierent from zero. This provides a test for the validity of the L

k over-identifying moment

conditions based on the following statistic (Hansen-Sargan test)

S nQ b T SLS

(L

k) .

Exercise 71. Prove that if L = k, TSLS collapses to IV


Solution: Z 0 X is invertible, so
b

T SLS

=
=

X 0Z Z 0Z

Z 0X

Z 0X

Z 0Z X 0Z

Z 0X

Z 0y

X 0Z Z 0Z

X 0Z Z 0Z

1
1

Z 0y

Z 0y

10.3. Stata implementation of the TSLS estimator


It is implemeted by the command ivregress 2sls followed by the names of the dependent
variable, the included exogenous and, within parentheses, all the right-hand-side endogenous
and the excluded exogenous (the instruments) as follows
ivregress 2sls depvar indepvars (endog_vars = instruments), options

10.4. STATA IMPLEMENTATION OF THE (LINEAR) GMM ESTIMATOR

151

10.4. Stata implementation of the (linear) GMM estimator


It is implemeted by ivregress gmm followed by the names of the dependent variable,
the included exogenous and, within parentheses, all the right-hand-side endogenous and the
excluded exogenous (the instruments) as follows
ivregress gmm depvar indepvars (endog_vars = instruments), options

10.4.1. Choosing the weighting matrix . The weighting matrix in the optimal twostep GMM estimator is

A = Z 0 Z/n

(10.4.1)

h
(see Hansen 1982). A is a consistent estimate of the inverse of V ar zi yi
:
Choice of

xi

= I (the resulting GMM estimator col If " is homoskedastic and independent then
lapses to TSLS). Its implemented through the ivregress gmm option: wmatrix(unadjusted).
is a diagonal matrix with generic diag If " is heteroskedastic and independent then
onal element equal to the squared residual from some one-step consistent estimator,
the TSLS for example:
0

B
B
B
B

=B
B
B
@

with ei = yi

e21

0
..
.

e22

..

0
..
.

C
C
C
C
C
0 C
C
A
e2n

x0i b T SLS , i = 1, ..., n. Its implemented through the ivregress gmm

option: wmatrix(robust). Its the default.

is a block diagonal matrix with generic


If errors are clustered, with N clusters, then
block equal to the outer product of the residuals peculiar to the corresponding cluster.

10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST

152

Again residuals are taken from a one-step consistent regression (TSLS):


0
1

0 C
B 1 0
B
.. C
B 0
2
. C
B
C

=B .
C
..
B ..
.
0 C
B
C
@
A
N
0 0

i = ei e0 , i = 1, ..., N . Notice that in this case ei = yi x0 b T SLS is a vector and


with
i
i

not a scalar: it is the vector of residual observations peculiar to cluster i = 1, ..., N. Its
implemented through the ivregress gmm option: wmatrix(cluster cluster_var ).
10.4.2. Iterative GMM. The GMM procedure can be iterated by adding the option
igmm. The resulting estimator is asymptotically equivalent to the two-step estimator. However,
Hall (2005) suggests that it may have a better finite-sample performance.

10.5. Robust Variance Estimators


The less ecient, but computationally simpler and still consistent, TSLS estimator is

often used in estimation. Its robust variance-covariance matrix V ar b T SLS is consistently


estimated as

\
V ar b T SLS = X 0 P[Z] X

[Z] X X 0 P[Z] X
X 0 P[Z] P

is chosen according to the various departures from homoskedasticity and independence


where
spelled out above. The Stata implementation of the three variance-covariance estimators is
through the ivregress options: vce(unadjusted), vce(robust), vce(cluster cluster_var ).

10.6. Durbin-Wu-Hausman Exogeneity test


A conventional Hausman test can be always implemented, based on the Hausmans statistics measuring the statistical dierence between IV and OLS estimates. It is not robust to

10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST

153

heteroskedastic and clustered errors, though. Wu suggest an alternative. But before do this
exercise, which will prove useful in the derivations below.

Exercise 72. Prove that the TSLS estimator for

b2,T SLS = X20 P[M[X

X2
] Z1 ]

2
1

is

X20 P[M[X

1]

Z1 ] y

Solution. Applying the FWL Theorem to the second-stage regression

b2,T SLS = X20 P[Z] M [P[Z] X1 ] P[Z] X2


By Lemma 12 P[Z] = P[X1 ] + P[M[X

1]

Z1 ] ,

1]

Z1 ]

X20 P[Z] M [P[Z] X1 ] P[Z] y

so that P[Z] X1 = X1 and

b2,T SLS = X20 P[Z] M [X1 ] P[Z] X2


But then P[Z] = P[X1 ] + P[M[X

X20 P[Z] M [X1 ] P[Z] y.

also assures that P[Z] M [X1 ] = P[M[X ] Z1 ] , proving the result.


1

The DWH test provides a robust version of the H test. It maintains instruments validity, E ("|Z) = 0 and is based on the so called control-function approach, which recasts the
endogeneity problem as a misspecification problem aecting the structural equation
(10.6.1)
X = (X1 X2 ),

y = X + ",
=

0
1

0
2

, Z = (X1 Z1 ) and " = u + . The component u is such that

E (u|X) = 0 and is the nk2 -matrix of the errors in the first-stage equations of the variables
X2 . As such, is responsible for endogeneity of X2 .
Replacing in (10.6.1) with the residuals from the first stage regressions, = M[Z] X2 ,
makes the DWH test operational as a simple test of joint significance for in the auxiliary
OLS regression

(10.6.2)

y = X + M[Z] X2 + u .

10.6. DURBIN-WU-HAUSMAN EXOGENEITY TEST

154

The test works well since under the alternative of 6= 0, OLS estimation of the auxiliary
regression yields the TSLS estimators. This is proved as follows.

y = P[Z] + M[Z] X + M[Z] X2 + u


and so
y = P[Z] X + M[Z] X + M[Z] X2 + u
but M[Z] X

= M[Z] X2

since by Lemma 12 M[Z] = M[X1 ]


y = P[Z] X + M[Z] X2 (

P[M[X ] Z1 ] . Therefore,
1

+ ) + u

and since P[Z] X and M[Z] X2 are orthogonal the FWL Theorem assures that the OLS estimator
for

is
bT SLS = X 0 P[Z] X

X 0 P[Z] y.

and also
\
2+ =

X20 M[Z] X2

X20 M[Z] X2

= Kb2,OLS + (I
with K X20 M[Z] X2

X20 M[Z] y

1
X20 M[X1 ] y

X20 P[M[X

y
] Z1 ]

K) b2,T SLS ,

X20 M[X1 ] X2 and where the last equation follows from Exercise 72.

Therefore
b = Kb2,OLS + (I

= K (b2,OLS

K) b2,T SLS

b2,T SLS

b2,T SLS ) ,

proving that the test indeed follow the Hausman test general principle of assessing the distance between an ecient estimator and a consistent but inecient estimator, under the null
hypothesis.

10.7. ENDOGENOUS BINARY VARIABLES

155

One great advantage of the DWH test over a conventional Hausman test is that it can
be easily robustified for heteroskedasticity and/or clustered errors by estimating (10.6.2) with
regress and a suitable robust option, vce(cluster clustervar ) for example.
DWH can be immediately implemented in Stata through the ivregress postestimation
command estat endogenous.

10.7. Endogenous binary variables


The linear IV-GMM approach outlined so far fits the case of binary endogenous variables
producing consistent estimates. However, a first-stage regression fully accounting for the
binary structure of the endogenous variables may provide considerable eciency gains. The
implied model (non-linear) is as follows

yi = x1i

+ x2i

+ i

x2i = x1i 1 + zi 2 + i
8
>
<1 if x2i > 0
x2i =
>
:0 otherwise
2 0
13
2
2
A5 .
(i , i ) N 40, @
2

It is estimated by the Stata procedure

treatreg depvar indepvars, treat(endog_var = instruments) other_options


through either ML (default) or a consitent two-step procedure (twostep option).

10.9. INFERENCE WITH WEAK INSTRUMENTS

156

10.8. Testing for weak instruments


Staiger and Stocks rule of thumb: partial F tests in the first stage regression >
10. It is not rigorous, tends to reject too often weak intruments and has no obvious
implementation when there are more than one endogenous variables.
Stock and Yogos (2005) two tests overcome all of the above diculties. They are
both based on the on the minimum eigenvalue of the matrix analog of the partial
F test, a statistics introduced by Cragg and Donald (1993) to test nonidentification.
Importantly, the large-sample properties for both tests have been derived under the
assumption of homoskedastic and independent errors. Caution must be taken, then,
when drawing conclusions from the tests if the errors are thought to departure from
those hypotheses.
Both procedures are implemented by the ivregress postestimation command estat firststage.

10.9. Inference with weak instruments


Conditional inference on the endogenous variables coecients in the presence of weak instruments is implemented through command condivreg by Mikusheva and Poi (2006). Theory
reviewed and expanded in Andrews et al. (2007). The command produces three alternative
confidence sets for the coecient of the endogenous regressor obtained from the conditional
LR, Anderson-Rubin (option ar) and LM statistics (option lm). The syntax of condivreg is
similar to that of ivregress.

10.9.1. Three stages Least Squares. Its a system estimator including structural equations for all endogenous variables. Identification is ensured by standard (sucient) rank and
(necessary) order conditions. It is seldomly used as it is inconsistent in the presence of heteroskedastic errors, which is the norm in most micro applications. The Stata command is
reg3.

10.10. DYNAMIC PANEL DATA

157

10.10. Dynamic panel data


Situations in which past decisions have an impact on current behaviour are ubiquitous in
economics. For example in the presence of input adjustment costs, short-run input demands
depend also on past input levels. In such cases fitting a static model to data will lead to what
is referred to as dynamic underspecification. With a panel data set, however, it is possible to
implement a dynamic model from the outset to in order to describe the phenomena of interest.
To make things simple let us get started with the simple autoregressive process
(10.10.1)

yit = + yi,t

+ it

t = 1, ..., T , i = 1, ..., N .
Model (10.10.1) can be easily extended to allow for time invariant individual terms:
(10.10.2)

yit = yi,t

+ i + it

t = 1, ..., T , i = 1, ..., N . In vector notation, stacking time observations for each individual,
yi = y

1i

+ i 1T + i

i = 1, ..., N, where
0

B
B
B
B
B
yi = B
B
(T 1)
B
B
B
@

yi1
..
.

yi,0
..
.

C
B
C
B
C
B
C
B
C
B
C
B y
,
y
=
yit C
1i
B i,t
C
B
(T 1)
.. C
..
B
. C
.
B
A
@
yiT
yi,T

C
C
C
C
C
C,
C
C
C
C
A

B
B
B
B
B
"i = B
B
(T 1)
B
B
B
@

"i1
..
.

C
C
C
C
C
"it C
C
.. C
C
. C
A
"iT

Notice that for each individual there are T + 1 observations available in the data set,
from yi1 to yiT , but only T are usable since one is lost to taking lags. The problem here is
that y

1i

is not strictly exogenous. Given (10.10.2), the t.th realization of y

f (y0 , i1 , i2 , ..., it

1)

and so all future realizations of y

1i,

1i

is yi,t

from yi,t = f (y0 , i1 , ..., it ) to

10.10. DYNAMIC PANEL DATA

yi,T

= f (y0 , i1 , ...it , ...i,T

158

depend on i,t , which makes E (i,t |yi,0 , ...yi,T

1 ),

(exercise: Can you work out the exact expression of the right-hand side of yi,t

1)

= 0 fail

= f (y0 , i1 , i2 , ..., it

Nonetheless, we may maintain conditional-mean independence between i,t and the t.th and
more remote values of y

1i,

say yit

= (yi,t

1 , yi,t 2 , ..., yi,0 )

using the notation in Arellano

(2003). More formally, we maintain throughout (see Arellano (2003) for a discussion)
A.1: E it |yit

, i = 0 for all t = 1, ..., T

Assumption A.1 is also considered in Wooldridge (chapter 11, 2010), where it is referred to as
sequential exogeneity conditional on the unobserved eect. It may be convenient sometimes to
maintain also the following (sequential) conditional homoskedasticity assumption
A.2: E 2it |yit

, i =

for all t = 1, ..., T

It is not hard to prove that Equation (10.10.1) and Assumption A.1 implies the following
(prove it using the LIE and i,t
A.3: E it it

t 1
j |yi , i

= yi,t

yi,t

j 1

= 0, for all j = 1, ...t

i )
1

10.10.1. Inconsistency of the LSDV Estimator. Since y

1i

is not strictly exogenous

the LSDV estimator, LSDV , is inconsistent for N ! 1. Nickell (1981) was the first to derive
the inconsistency. Given,
1
NT

LSDV =

y i.

1
NT

he showed that
1 XX
plim
yi,t
NT
t

PP

(it

yi,t

PP
i

y i.

yi,t

i. ) = E

(it

y i.

i. )
,

2
1

T
1X
yi,t
T

y i.

(it

t=1

1 T
T2

1 T +
(1
)2

6= 0.

i. )

1 )?).

10.10. DYNAMIC PANEL DATA

159

Hence, the bias vanishes for T ! 1, but it does not for N ! 1 and T fixed. For this reason,
the LSDV estimator is inaccurate in panel data sets with large N and small T and is said to
be semi-inconsistent (see also Sevestre and Trognon, 1996).
Since Nickell (1981) a number of consistent IV and GMM estimators have been proposed
in the econometric literature as an alternative to LSDV. Anderson and Hsiao (1981) suggest
two simple IV estimators that, upon transforming the model in first dierences to eliminate
the unobserved individual heterogeneity, use the second lags of the dependent variable, either
dierenced or in levels, as an instrument for the dierenced one-time lagged dependent variable.
Arellano and Bond (1991) propose a GMM estimator for the first dierenced model which,
relying on all available lags of y

1i

as instruments, is more ecient than Anderson and Hsiaos.

Ahn and Smith (1995), upon noticing the Arellano and Bond estimator uses only linear moment
restrictions, suggest a set of non linear restrictions that may be used in addition to the linear
one to obtain more ecient estimates. Blundell and Bond (1998) observe that with highly
persistent data first-dierenced IV or GMM estimators may suer of a severe small sample
bias due to weak instruments. As a solution, they suggest a system GMM estimator with
first-dierenced instruments for the equation in levels and instrument in levels for the firstdierenced equation. Some of the foregoing methods are nowadays very popular and are
surveyed below.

10.10.2. The Anderson and Hsiao IV Estimator. One typical solution is to take
model (10.10.2) in first dierences to eliminate the individual eects:
(10.10.3)

yit

yi,t

= (yi,t

yi,t

2)

+ it

i,t

1.

This makes the disturbances M A(1) with unit root, and so induces correlation between the
lagged endogenous variables and the disturbances. This problem can be solved by using
instruments for 4yi,t
either yi,t

2,

or 4yi,t

1.
2

Anderson and Hsiao (1982)Anderson and Hsiao (1982) suggest using


since these are correlated with 4yi,t

but are uncorrelated with

10.10. DYNAMIC PANEL DATA

it

i,t

1.

160

It is an exactly identified IV estimator, consistent under Assumption A.1, but

generally non optimal and with a high root mean squared error in applications.

10.10.3. The Arellano and Bond GMM estimator. Arellano and Bond (1991) propose a more ecient estimator, using a larger set of instruments. Upon noticing that, given
(10.10.2), yi1 = f (i1 ), yi2 = f (i1 , i2 ), ..., yit = f (i1 , i2 , ..., it ) and that after dierencing
the first useable period in the sample is t = 3:
yi2

yi1 = (yi1

yi0 ) + i2

i1 ,

one finds that the value yi1 is a valid instrument for (yi2
(yi2

yi1 ) and given Assumption 1, E [yi1 (i3

yi1 ). In fact, yi1 is correlated with

i2 )] = 0.

In the next period, t = 4, we have


yi4

yi3 = (yi3

yi2 ) + i4

and both yi1 and yi2 are valid instruments for (yi3

i3 ,

yi2 ), and so on.

This approach adds an extra valid instruments with each forward period, so that in the
last period T we have (yi0 , yi1 , yi2 , ..., yi,T
is therefore

B
B
B
B
B
B
B
Zi = B
B
B
B
B
B
@

yi0 0

2) .

For each individual i, the matrix of instruments

. .

. 0

yi0 yi1 0

. .

. 0

yi0 yi1 yi2 . .

. .

. .

. .

. .

. .

. yi0 yi1 yi2 . yi,T

Stacking individuals the overall matrix of instrumental variables is

0 0
Z = Z10 , Z20 , ..., ZN
,

C
C
C
C
C
C
C
C
C
C
C
C
C
A

10.10. DYNAMIC PANEL DATA

161

and model (10.10.3) can be reformulated more compactly


y=

where

y is a (N (T

is a (N (T

1) 1) vector;

is a (N (T

1) 1) vector;
1) 1) vector

The number of instruments is L = T (T

1)/2. So, Z is a (N (T

1) L) matrix. The

instrumental variables satisfy for each individual i the (L 1) vector of population moment
h 0
i
conditions m( ) E Zi i = 0, where i stands for the i.th block of . We can define
the (L 1) vector of sample moment conditions m( ) =

1 0
NZ (

) =

1 0
NZ (

1 ).

Since L > 1, this is an overidentified case and GMM estimation is needed.


If we also consider Assumption A.2, that is , beyond being not serially correlated, is also
homoskedastic, the optimal GMM estimator can be obtained in one step. It minimizes the
following criterion function
(10.10.4)

Q (b) = m(b)0 Am(b)

where according to what seen in Subsection 10.4.1 A is a consistent estimator of the inverse of
(10.10.5)

0
V ar Zi

up to an irrelevant positive scalar and

i =

E Zi GZi

10.10. DYNAMIC PANEL DATA

1 0

B
B
B 1
B
B
B 0
B
G=B
B .
B
B
B
B 0
@
0

. . . 0

1 0

. . . 0

1 2
.

. . .

. . .

. . .

The Arellano-Bond one-step estimator b1 uses

N
1 X 0
Zi GZi
N
i=1

C
C
C
C
C
.
. C
C
C.
.
. C
C
C
C
2
1 C
A
1 2

1 . . .

A=

162

in (10.10.4) and so
2

b1 = 4 y
2

N
X

Zi GZi

i=1

4 y

N
X

Zi GZi

i=1

Z0
!

Z0

15

y5

Without homoskedasticity (that is without Assumption A.2), b1 is no longer optimal, but it


remains consistent and as such can be used to construct the optimal two-step estimator b2
along the lines described in Subsection 10.4.1. Therefore, b2 minimizes (10.10.4) with
A=

N
1 X 0
Zi
N
i=1

e1i

e01i Zi

10.10. DYNAMIC PANEL DATA

and where

e1i =

yi

estimator:

b1

b2 = 4 y

2
),

N
X

Zi

e1i

e1 0i Zi

i=1

If the it are iid(0,

is the individual-level residual vector from the one-step

1i

4 y

163

N
X

Zi

e1i

e1 0i Zi

i=1

Z0
!

Z0

b1 and b2 are asymptotically equivalent.

15

y5

To test instrument validity one can apply the Hansen-Sargan test of overidentifying restrictions:
S=

N
X

Zi

i=1

where

e2i

!0

N
X

Zi

i=1

e2i

e2 0i Zi

N
X
i=1

Zi

e2i

e2i = yi b2 y 1 i are the individual-level residuals from the two-step estimator.


h 0
i
Under H0 : E Zi i = 0 8i = 1, ..., N , S 2L 1 .
A

A second specification test suggested by Arellano and Bond (1991) is that testing lack of

AR(2) correlation in

e1 or

e2 , which must hold under Assumption A.1. The AR(2) test

under the null has a limiting standard normal distribution.


10.10.3.1. Inference issues. Monte Carlo studies tend to show that estimated standard
errors from 2-step GMM estimators are severely downward biased in finite samples (Arellano
and Bond 1991). This is not the case for 1-step GMM standard errors, which instead are
virtually unbiased. A possible explanation for this finding is that the weighting matrix in 2step GMM estimators depend on estimated parameters whereas that in 1-step GMM estimators
does not. Windmeijer (2005) proves that in fact a large portion of the finite sample bias of
2-step GMM standard errors is due to the variation of estimated parameters in the weighting
matrix. He derives both a general bias-correction and a specific one for panel data models
with predetermined regressors as in the Arellano and Bond model.

10.10. DYNAMIC PANEL DATA

164

Monte Carlo experiments in Bowsher (2002) show that the Sargan test based on the full
instrument set has zero power when T , and consequently the moment conditions, becomes too
large for given N .
10.10.4. Blundell and Bond (1998) System estimator. Blundell and Bond (1998)
demonstrate that in the presence of
with

close to unity instruments in levels are weakly correlated

leading to what is known in the econometric literature as a weak instrument bias.

This is easily seen by considering the following example taken from Blundell and Bond. Let
T = 2, then after taking the model in first dierences there is only a cross-section available
for estimation:
4yi,2 = 4yi,1 + 4i,2 , i = 1, ..., N.
and only one moment condition
N
1 X
(4yi,2
N
i=1

4yi,1 ) yi,0 = 0.

To what extent is yi,0 related to 4yi,1 ? To answer this question it suces to work out the
reduced form for 4yi,1 :
4yi,1 = (
from which it is clear that the closer

1) yi,0 + i + i,1

to unity the weaker the correlation between yi,0 and

4yi,1 .
To solve the problem they suggest exploiting the following additional moment restrictions
(10.10.6)

E [(yi,t

yi,t

1 ) 4yi,t 1 ]

= 0, t = 2, ..., T

which are valid if along to Assumption A.1, we maintain that the process for yi,t is meanstationary, that is
A.4:
E (yi,0 |i ) =

i
1

10.10. DYNAMIC PANEL DATA

165

Assumption A.4 is justified if the process started in the distant past. Starting from the model
at observation t = 0 and going backward in time recursively
yi,0 =

yi,

+ i + i,0

yi,

+ i + i + i,

yi,

+ i,0

i + i + i +

i,

+ i,

+ i,0

.
.
=

i
1

1
X

i,

t=0

i
1

+ ui,0

where E (ui,0 |1 ) = 0 by Assumption A.1.


That the moment restrictions hold under Assumptions 1 and 4 can be seen for t = 2
E [(yi,2

yi,1 ) 4yi,1 ] =

E [(i + i,2 ) [(
1) yi,0 + i + i,1 ]] =

i
E (i + i,2 ) (
1)
+ ui,0 + i + i,1
=
1

E [(i + i,2 ) [(

1) ui,0 + i,1 ]]

Thus, Blundell and Bond (1998) suggest a system GMM estimator, which also uses instruments in first dierences for the equation in levels.
Hahn (1999) evaluates the eciency gains brought by exploiting the stationarity of the
initial condition as done by Blundell and Bond, finding that it is substantial also for large T .
Statas xtabond performs the Arellano and Bond GMM estimator. Then, there is xtdpdsys,
which implements the GMM system estimator. Third, xtdpd, is a more general command that
allows more flexibility than both xtabond and xtdpdsys. Finally, the user-written xtabond2
(Roodman 2009) is certainly the most powerful code in Stata to implement dynamic panel
data models.

10.10. DYNAMIC PANEL DATA

166

10.10.5. Application. Arellano and Bond (1991) show their methods estimating a dynamic employment equation on a sample of UK manufacturing companies. Their data set
in Stata format is contained in abdata.dta. The dofile IV_GMM_DPD.do implements simpler
versions of their model though dierenced and system GMM using xtabond and xtabond2.
The dofile abbest.do by D. M. Roodman replicates exactly the Arellano and Bonds results
using xtabond2.

10.10.6. Bias corrected LSDV. IV and GMM estimators in dynamic panel data models
are consistent for N large, so they can be severely biased and imprecise in panel data with a
small number of cross-sectional units. This certainly applies to most macro panels, but also
micro panels where heterogeneity concerns force the researcher to restrict estimation to small
subsamples of individuals.
Monte Carlo studies (Arellano and Bond 1991, Kiviet 1995 and Judson and Owen 1999)
demonstrate that LSDV although inconsistent has a relatively small variance compared to IV
and GMM estimators. So, an alternative approach based upon the correction of LSDV for the
finite sample bias has recently become popular in the econometric literature. Kiviet (1995)
uses higher order asymptotic expansion techniques to approximate the small sample bias of the
LSDV estimator to include terms of at most order 1/(TN). Monte Carlo evidence therein shows
that the bias-corrected LSDV estimator (LSDVC) often outperforms the IV-GMM estimators
in terms of bias and root mean squared error (RMSE). Another piece of Monte Carlo evidence
by Judson and Owen (1999) strongly supports LSDVC when N is small as in most macro
panels. In Kiviet (1999) the bias expression is more accurate to include terms of higher order.
Bun and Kiviet (2003), simplify the approximations in Kiviet (1999).
Bruno (2005a) extends the bias approximations in Bun and Kiviet (2003) to accommodate
unbalanced panels with a strictly exogenous selection rule.Bruno (2005b) presents the new
users written Stata command xtlsdvc to implement LSDVC.

10.10. DYNAMIC PANEL DATA

167

Kiviet (1995) shows that the bias approximations are even more accurate when there is
a unit root in y. This makes for a simple panel unit-root test based on the bootstrapped
standard errors computed by xtlsdvc.
10.10.6.1. Estimating a dynamic labour demand equation for a given industry. Unlike the
xtabond and xtabond2 applications of Subsection 10.10.5, here we do not use all information
available to estimate the parameters of the labour demand equation in abdata.dta. Instead,
we follow a strategy that, exploiting the industry partition of the cross-sectional dimension
as defined by the categorical variable ind, lets the slopes be industry-specific. This is easily
accomplished by restricting the usable data to the panel of firms belonging to a given industry.
While such a strategy leads to a less restrictive specification for the firm labour demand, it
causes a reduced number of cross-sectional units for use in estimation, so that the researcher
must be prepared to deal with a potentially severe small sample bias in any of the industry
regressions. Clearly, xtlsdvc is the appropriate solution in this case.
The demonstration is kept as simple as possible considering regressions for only one industry panel, ind=4.
The following instructions are implemented in a Stata-do file

10.10. DYNAMIC PANEL DATA

168

Part 2

Non-linear models

CHAPTER 11

Non-linear regression models


11.1. Introduction
Non-linear models present three main diculties.
(1) Closed-form solutions for estimators are generally not available
(2) Marginal eects do not coincide with the model coecients and vary over the sample.
(3) Latent heterogeneity components in cross-sections or panel data require special attention.
The are two do-files demonstrating the methods of this chapter: nlmr.do using the data set
mus10data.dta and nlmr2.do using the data set mus17data.dta. Both data-sets are from
Cameron and Trivedi (2009).

11.2. Non-linear least squares


The regression model specifies the mean of y conditional on a vector of exogenous explanatory variables x by using some known, non-linear functional form
E (y|x) = (x, ) .
Or, equivalently,
y = (x, ) + u
where u = y

E (y|x). The non-linear least square estimator, bN LS , minimizes the non-linear

residual sum of squares


Q=

n
X

[yi

i=1

170

(x, b)]2 .

11.3. POISSON MODEL FOR COUNT DATA

171

11.3. Poisson model for count data


Let y 2 N be a count variable: doctor visits, car accidents, etc. The Poisson regression
model is a non-linear regression model with
E (y|x) = exp x0

(11.3.1)

Or, equivalently,
y = exp x0
where u = y

+u

E (y|x).

Equation (11.3.1) =) E y

exp x0

|x = 0

and by the Law of Iterated expectations there are zero covariances between u and x:
(11.3.2)

Ey,x x y

exp x0

= 0.

11.3.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation.


Given the population moment restrictions (11.3.2), estimation can be carried out with a limited set of assumptions within a GMM set-up: by the analogy principle the consistent GMM
estimator bGM M solve the system of k sample analog restrictions
(11.3.3)

n
X
i=1

x i yi

exp x0i

= 0.

Equations (11.3.3) are also the first-order-conditions in NLS =) bN LS = bGM M in this


case.
Alternatively, we can maintain a Poisson density function for y with mean :
f (y) =

y!

Importantly, the Poisson model has the equidispersion property: V ar (y) = E (y) = .

11.3. POISSON MODEL FOR COUNT DATA

172

Letting = exp (x0 ) we end up with the conditional log-likelihood function


lnL (y1 ...yn |x1 ...xn , ) =
=

n
X
i=1
n
X
i=1

ln

exp [ exp (x0i )] exp (x0i )yi


yi !

exp x0i

+ yi x0i

ln (yi !)

and the ML estimator bM L that maximizes it:


bML is consistent : bM L !
p

(11.3.4) The covariance matrix estimator of bM L : V (bM L )

n
X

i xi x0i

i=1

It is easily seen that the k first order conditions that maximize lnL coincide with the equations in (11.3.3), so that bM L = bGM M . This proves two things: 1) The GMM estimator is
asymptotically ecient if the conditional mean function is correctly specified and the density
function is Poisson; 2) the ML estimator is consistent even if the poisson density is not the
correct density function, as long as the conditional mean is correctly specified. In such cases,
when the likelihood function is not correctly specified, we refer to the ML estimator as a
pseudo ML estimator and a robust covariance matrix estimator should be used for inference
rather than (11.3.4):
Vrob (bM L ) =

n
X
i=1

i xi x0i

1" n
X

(yi

i=1

i )

xi x0i

n
X

i xi x0i

i=1

With equidispersion V ar (y|x) = E (y|x) = :


(yi

i )2 close to
i =) V (bM L ) close to Vrob (bM L )

With overdispersion V ar (y|x) > E (y|x) =


(yi

i )2 tends to be greater than


i =) V (bM L ) is inconsistent, with smaller

variance estimates than Vrob (bM L ), which remains consistent.

11.3. POISSON MODEL FOR COUNT DATA

173

The consistency result for the (pseudo) ML estimator holds in general if two conditions are
verified:
(1) The conditional mean is correctly specified
(2) The density function belongs to an exponential family
Definition 73. An exponential family of distributions is one whose conditional loglikelihood function at a generic observation is of the form
lnL (y|x, ) = a (y) + b [ (x, )] + yc [ (x, )] .
A member of the family is identified by the numerical values of

We verify that Poisson is an exponential family:


a (y) =

ln (y!),

b [ (x, )] =

exp (x0 ) and

yc [ (x, )] = yx0 .
The Normal distribution with a known variance
1
(y|x, ) = p exp
2

[y

(x, )]2
2 2

is an exponential family also:


a (y) =

ln

b [ (x, )] =

2,

y 2 /2

(x, )2 /2

yc [ (x, )] = y (x, ) /

and

2.

The Stata command that implements poisson regression is poisson, with a syntax close
to regress. It computes bM L with standard error estimates obtained by V (bM L ). If the
vce(robust) option is given, then Stata recognizes the more robust pseudo ML set-up and
still provides the bM L coecient estimates, but with the robust covariance matrix Vrob (bM L ) .

11.4. MODELLING AND TESTING OVERDISPERSION

174

11.4. Modelling and testing overdispersion


We start from a specific Poisson density function, conditional on a random scalar ,
f (y|) =
with E () = 1 and V ar () =

2.

()y
y!

We can find for the unconditional moments of y applying

iterated expectations
E (y) = E [E (y|)] =
and
V ar (y) = E [V ar (y|)] + V ar [E (y|)] = E [] + V ar [] = + 2

= 1+

>

so, overdispersion is allowed.


The marginal density function of y, f (y), is what is needed for ML estimation since is
not observable. Its generic expression is
f (y) = E

()y
.
y!

To find it in closed form we need to specify the marginal density function for . If
Gamma (1, ), then f (y) is a negative binomial density function, N B ,
and V ar (y) = 1 +

. Clearly if

, with E (y) =

= 0, then collapses to its unity mean and f (y) is

Poisson.
Specifying = exp (x0 ) yields the NB regression model and and 2 are estimated via

ML based on N B exp (x0 ) , 2 . Testing for overdispersion within this framework boils down
to testing the null hypothesis

= 0.

The Stata command that implements the NB regression is nbreg, with a syntax close to
regress and poisson. The output also gives the overdispersion (LR) test of

= 0.

11.4. MODELLING AND TESTING OVERDISPERSION

Overdispersion can be tested also under the null hypothesis of


Poisson regression, against the alternative of V ar (y|x) = 1 +

175
2

= 0, therefore under

, therefore NB regres-

sion, using a Lagrange Multiplier test. This is based on an auxiliary regression implemented
h
i
after poisson estimation using an estimate of [V ar (y|x) /] 1, (yi
i )2 yi /
i , as the

dependent variable and


i = exp (x0i bM L ), as the only regressor (no constant). The LM test
is the t-statistic computed for the OLS coecient estimate of
i .

CHAPTER 12

Binary dependent variable models


12.1. Introduction
Binary dependent variable models have a dependent variable that partitions the sample
into two categories of a given qualitative dimension of interest. For example
Labour supply. There are two categories: work/not work (univariate binary model).
Supplementary private health insurance. There are two categories: purchase/not
purchase (univariate binary model)
Binary models are said multivariate when there are multiple dimensions that are possibly
related
Two related dimensions: [Dimension 1: Being overweight (body mass index > 25)
=) Two categories: yes/not] and [Dimension 2: Job satisfaction =) Two categories:
satisfied/dissatisfied] (bivariate binary model).
Two related dimensions: [Dimension 1: Identity of immigrants with the host country
=) Two categories: yes/not] and [Dimension 2: Identity of immigrants with the
country of origin =) Two categories: yes/not] (bivariate binary model).
In these notes I focus almost exclusively with univariate binary models, except for a digression
on the bivariate probit model as estimated by Statas biprobit.
The do-file bdvm.do is a Stata application on binary models that uses the data set mus14data.dta
from Cameron and Trivedi (2009).
176

12.2. BINARY MODELS

177

12.2. Binary models


Let A the event of interest (e.g. an immigrant identifies with the host-country culture).
Let the indicator function 1 (A) be unity if event A occurs and zero if not. Define the discrete
random variable y such that
(12.2.1)

y = 1 (A) .

Then
P r (y = 1) = P r (A) and P r (y = 0) = 1
E (y) = and V ar (y) = (1

) .

We want to asses the impact of x on the probability of A and to do so we model P r (y = 1|x)


as a function of x.
Since 0 P r (y = 1|x) 1 a suitable functional form for P r (y = 1|x) is any cumulative
distribution function evaluated at a linear combination of x, F (x0 ). Accordingly, we specify
P r (y = 1|x) = F x0

(12.2.2)

Two popular choices for F () are


Probit Model: F ()

(), the Standard Normal distribution

Logit model: F () () exp (x0 ) / [1 + exp (x0 )] , the Logistic distribution


with zero mean and variance 2 /3.
Alternatively, we may model F () directly as a linear function of x:
Linear Probability Model (LPM): F (x0 ) x0 .
Since P r (y = 1|x) = E (y|x), Model (12.2.2) can always be expressed as the regression model
y = F x0

(12.2.3)
u=y

E (y|x) .

+u

12.2. BINARY MODELS

178

12.2.1. Latent regression. When F () is a distribution function the binary model can
be motivated as a latent regression model. In microeconomics this is a convenient way to
model individual choices.
Introduce the latent continuous random variable y with
y = x0 + ",

(12.2.4)

let " be a zero mean random variable that is independent from x and with " F , where
F is a distribution function that is symmetric around zero. Then, let y = 1 (y > 0) . In
our example of Immigrant identity we may think of y as the utility variation faced by a
subject with observable and latent characteristics x and ", respectively, when he/she decides
to conform to the host-country culture, so that event A occurs if and only if y > 0.
(12.2.5)

=) y = 1 " >

x0

=) P r (y = 1|x) = P r " >

x0 |x

Since " and x are independent P r (" x0 |x) = F (x0 ). Moreover, by symmetry of F ,
P r (" >

x0 |x) = F (x0 ) and so


P r (y = 1|x) = F x0

which is exactly Model (12.2.2).


Inspection of (12.2.5) clarifies that V ar (") =
since P r (" >

x0 ) = P r [("/ ) >

and

cannot be separately identified,

x0 ( / )]. Therefore, to identify

some known value. In the probit model

= 1 and in the logit model

,
2

must be fixed to

= 2 /3.

12.2. BINARY MODELS

179

12.2.2. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. In


the logit and probit models estimation is carried out via ML. The ML estimator, bM L maximizes the conditional log-likelihood function
lnL (y1 ...yn |x1 ...xn , ) =

n
X
i=1

yi ln F x0i

+ (1

yi ) ln 1

F x0i

We have
bM L !
p

(12.2.6)

V (bM L )

n
X
i=1

f (x0i bM L ) xi x0i
F (x0i bM L ) [1 F (x0i bM L )]

where f is the density function of F and remember that @x F (x) = f (x).


The Stata commands that compute bM L and V (bM L ) in the probit and logit models are,
respectively, probit and logit. The syntax is similar to regress.
The LPM assumes F = X . So, Equation (12.2.3) is a linear regression model that can be
estimated by regress. In this case the model coecients are identical to the marginal eects
of interest. But V ar (u|x) = V ar (y|x) = x0 (1

x0 ), so the model is heteroskedastic and

regress should be supplemented by the vce(robust) option.

12.2.3. Heteroskedasticity. Unlike the non-linear models examined in Chapter 11, in


probit and logit models heteroskedasticity brings about misspecification of the conditional
mean, so that ML estimators of both models become inconsistent. Hence, it makes little sense
to complement probit and logit coecient estimates with heteroskedasticity-robust standard
error estimates.
Heteroskedasticity can be modeled, though. In the probit model, instead of fixing
one can allow heteroskedasticity by setting
(12.2.7)

P r (yi = 1|x) =

2
i

= exp (z0i ) , so that


x0i /exp z0i

= 1,

12.3. COEFFICIENT ESTIMATES AND MARGINAL EFFECTS

180

Statas hetprobit estimates this heteroskedastic probit model and, importantly, provides a
LR test for the null of homoskedasticity ( =0).
12.2.4. Clustering. Dierently from heteroskedasticity, it makes sense to adjust standard error estimates to within-cluster correlation. This is the case since within-cluster correlation leaves the conditional expectation of an individual observation unaected, so that the ML
estimator can be motivated as a partial ML estimator, which remains consistent even if observations are not independent (see Wooldridge 2010, p. 609). The Stata option vce(cluster
clustervar ), therefore, can be conveniently included in both probit and logit statements.
12.3. Coecient estimates and marginal eects
There is no exact relationship between the coecient estimates from the three foregoing
models. Amemiya (1981) works out the following rough conversion factors
blogit ' 4bols
bprobit ' 2.5bols
blogit ' 1.6bprobit .
The marginal eect of x at observation i are estimated by logit and probit as
(@x Fi )probit = fprobit,i bprobit =

x0i bprobit bprobit

(@x Fi )logit = flogit,i blogit = x0i bprobit


and by LPM as

(@x Fi )ols = bols .


The following relationships hold
(@x F )logit 0.25bols
(@x F )probit 0.4bols

x0i bprobit

blogit

12.4. TESTS AND GOODNESS OF FIT MEASURES

181

The post-estimation command margins with the option dydx(varlist) estimates marginal
eects for each of the variables in varlist. Marginal eects can be estimated at a point x

(conventionally, the sample mean when variables are continuous and in this case the option
atmean must be supplied) or can be averaged over the sample (default).

12.4. Tests and Goodness of fit measures


Parameter restrictions can be tested by Wald tests (test) and LR tests (lrtest). As explained above, hetprobit, besides producing coecient estimates, provides an heteroskedasticity test.
The most common Goodness of fit measures reported in logit or probit outputs are:
The overall percent correctly predicted (opcp). Define the predictor yi of yi as
yi =

8
< 1 if
: 0

F x0

else

9
0.5 =
;

The opcp is given by the number of times yi = yi over n. A problem with this measure
is that it can be high also in cases where the model poorly predicts one outcome. It
may be more informative in these cases to compute the percent correctly predicted
for each outcome separately: 1) the number of times yi = yi = 1 over the number of
times yi = 1 and 2) the number of times yi = yi = 0 over the number of times yi = 0.
This is done through the post-estimation command estat classification.
Test the discrepancy of the actual frequency of an outcome and the estimated average probability of the same outcome within a subsample S of interest (for example,
females in a sample of workers)
yS

1 X
1 X 0
yi vs. pS
F xi
nS
nS
i2S

i2S

Doing this on the whole sample makes little sense because the two measures are
always very close (equal in the logit model with the intercept).

12.6. INDEPENDENT LATENT HETEROGENEITY

2 = 1
Evaluate the Pseudo R-squared: R

182

L ( ) /L (
y ) , where L ( ) is the value of

the maximized log-likelihood and L (


y ) is the log-likelihood evaluated for the model
2 < 1.
with only the intercept. Always 0 < R
12.5. Endogenous regressors
In the presence of a continuous endogenous regressors in the latent regression model one
can use an instrumental variable probit estimator. This is implemented by ivprobit, with a
syntax similar to ivregress. When the endogenous regressor is binary, then we can apply a
bivariate recursive probit model as explained in Subsection 12.7.1.
12.6. Independent latent heterogeneity
In the latent regression model (12.2.4) all explanatory variables are observed. But it
may be the case that relevant explanatory variables are latent, as allowed by the following
specification of the model
y = x0 + w 0

+ ",

where the ws are latent variables. There is so a latent heterogeneity component w0


to consider in the model along with ". We make the following assumptions
|x, " N 0,

Then + "|x N 0, 1 +

and
p

y
1+

= x0 p

1+

p
is a legitimate probit model. In fact, y / 1 +

+"
+p
,
1+ 2

is latent,

+"
p
|x N (0, 1)
1+ 2
and so

xp

1+

= P r (y = 1|x) .

12.7. MULTIVARIATE PROBIT MODELS

183

It follows that we can apply standard probit ML estimation and the resulting estimator

p
p
\
\
0
2
2
p
/ 1+
is consistent for / 1 +
and so is
x 1+ 2 for the response probabilities

P r (y = 1|x) .

From the above analysis it clearly emerges that

p
/\
1+

estimates

with a downward

bias (Yatchew and Griliches (1985)). Nonetheless, if our interest centers on marginal eects
@x P r (y|, x) averaged over (AMEs), E [@x P r (y|, x)], this is no problem.
Indeed, given f (|x) the conditional density function of , it is generally true that
P r (y|x) =

P r (y|x, ) f (|x) d

|x

But since and x are independent f (|x) = f () and so


P r (y|x) =

P r (y|x, ) f () d = E [P r (y|, x)]

Hence, under mild regularity conditions that permit interchanging integrals and derivatives,
@x P r (y|x) = E [@x P r (y|, x)]
The above result is important, for it establishes that to estimate P r (y|x) and @x P r (y|x)

\
0
p
is to estimate E [P r (y|, x)] and E [@x P r (y|, x)], respectively. So,
x 1+ 2 is a

\ is a consistent estimator of
consistent estimator for E [P r (y|, x)], likewise @x
x0 p
1+ 2
E [@x P r (y|, x)] (see Wooldridge (2005) and Wooldridge (2010)).

If evaluated at a given point x0 , the AMEs are averages over alone. To estimate

\
\
0
0
Ex, [P r (y|, x)] and Ex, [@x P r (y|, x)] just average
xi p1+ 2 and @x
xi p1+ 2 over
the sample.

12.7. Multivariate probit models


Multivariate probit models can be conveniently represented using the latent-regression
framework. There are m binary variables, y1 , , y2 ... ym that may be related.

12.7. MULTIVARIATE PROBIT MODELS

184

Multivariate probit models are constructed by supplementing the random vector y defined
in (12.2.1) with the latent regression model
yj = x0

(12.7.1)
j = 1, ..., m, and

j,

+ "j

x and "j are, respectively, the p1 vectors of parameters and explanatory

variables and the error term. Stacking all "j s into the vector " ("1 , ... , "m )0 , we assume
"|x N (0, R). The covariance matrix R is subject to normalization restrictions that will be
made explicit below. Equation specific regressors are accommodated by allowing

to have

zeroes in the positions of the variables in x that are excluded from equation j. Cross-equation
restrictions on the

s are also permitted. R is normalized for scale and so has unity diagonal

elements and arbitrary o-diagonal elements, ij , which allows for possible cross-equation
correlation of errors. It may or may not present constraints beyond normalization. If m = 2
we have the bivariate probit model, which is estimated by the Stata command biprobit, with
a syntax similar to probit.

12.7.1. Recursive models. An interesting class of multivariate probit models is that


of the recursive models. In recursive probit models the variables in y = (y1 , , y2 ... ym ) are
allowed as right-hand-side variables of the latent system provided that the m m matrix of
coecients on y is restricted to be triangular (Roodman 2011). This means that if the model
is bivariate, the latent system is

(12.7.2)

y1 = x

+ y2 + "1

y2 = x

+ "2

It is then evident that estimating a bivariate recursive probit model is ancillary to estimation
of a univariate probit model with a binary endogenous regressor, the first equation of system
(12.7.2).

12.7. MULTIVARIATE PROBIT MODELS

185

The feature that makes the recursive multivariate probit model appealing is that it accommodates endogenous, binary explanatory variables without special provisions for endogeneity,
simply maximizing the log-likelihood function as if the explanatory variables were all ordinary
exogenous variables (see Maddala 1983, Wooldridge 2010,Greene 2012 and, for a general proof,
Roodman 2011). This can be easily seen here in the case of the recursive bivariate model
P r (y1 = 1, y2 = 1|x) = P r (y1 = 1|y2 = 1, x) P (y2 = 1|x)

= P r "1 >

= P r "1 >

= P r "1 >
=

x0

x0

x0

x0

+ , x0

|y2 = 1, x P [y2 = 1|x]


|"2 >

x0

2, x

, "2 >

x0

2 |x

P "2 >

x0

2 |x

The crux of the above derivations is that, given


y1 = 1 "1 >

x0

y2 and y2 = 1 "2 >

x0

"1 is independent of the lower limit of integration conditional on "2 >

,
x0

and so no endo-

geneity issue emerges when working out the joint probability as a joint normal distribution.
The other three joint probabilities are similarly derived, so that eventually the likelihood
function is assembled exactly as in a conventional multivariate probit model1.
Starting with the contributions of Evans and Schwab (1995) and Greene (1998), there
are by now many econometric applications of this model, including the recent articles by
Fichera and Sutton (2011) and Entorf (2012). The user-written command mvprobit deals
with m > 2, it evaluates multiple integrals by simulation (see Cappellari and Jenkins (2003)).

1Wooldridge 2010 argues that, although not strictly necessary for formal identification, substantial identification

in recursive models may require exclusion restrictions in the equations of interest. For example, in system
(12.7.2) substantial identification requires some zeroes in 1 , where the corresponding variables may then be
thought of as instruments for y2 .

12.7. MULTIVARIATE PROBIT MODELS

186

The recent user-written command cmp (see Roodman (2011)) is a more general sumulationbased procedure that can estimate many multiple-response and multivariate models.

CHAPTER 13

Censored and selection models


13.1. Introduction
Censored models (Tobit models): y has lower and/or upper limits
Selection models: some values of y are missing not at random.
13.2. Tobit models
Consider the latent regression model
y = x0 + ",
with "|x N 0,

. y is an observed random variable such that

y=

where L is a known lower limit.

8
>
<y = y
>
:y = L

if y > L
if y L

Think of a utility maximizer individual with latent and observable characteristics " and
x, respectively, choosing y subject to the inequality constraint y

L, with y denoting the

unconstrained solution. For a part of individuals the constraint is binding (y = L ) and for
the others is not (y > L). Focussing on the former subpopulation, the regression model is
y = E (y|x, y > L) + u

(13.2.1)

= E x0 + "|x, " > L

x0

+u

x0 + E "|x, " > L

x0

+u

187

13.2. TOBIT MODELS

where u = y

188

E (y|x, y > L) . The following result for the density and moments of the trun-

cated normal distribution are useful (see Greene 2012, pp. 874-876):

f (z|z > ) =

f (z|z < ) =

E (z|z > ) = +

[( ) / ]
[( ) / ]

[(
[(

E (z|z < ) =

/ {1

[(

/ [(

) / ]

) / ]}

) / ]
.
) / ]

The foregoing equalities are all based on the following representations of general cumulative
distribution function, F(,

F(,

2)

2)

() = P r [(z

and general normal densities

(,

(,

2)

) / (

) / ] = F(0,1) [(

(z) :

1
p exp
2 ) (z) =
2

"

)2

(z
2

Then, Model (13.2.1) can be written in closed form as


y = x0 +

[(L x0 ) / ]
+ u.
[(L x0 ) / ]

By symmetry of the normal distribution,


(13.2.2)

) / ]

y = x0 +

[(x0
[(x0

L) / ]
+ u.
L) / ]

If L = 0, the foregoing reduces to


y = x0 +

[(x0 ) / ]
+ u.
[(x0 ) / ]

13.2. TOBIT MODELS

189

13.2.1. Estimation. There is a random sample {yi , xi } , i = 1, ..., n, for estimation. Let
di = 1 (yi > L). Estimation can be via ML or two-step LS.
The log-likelihood function assembles the density functions peculiar to the subsample of
individuals di = 1 and those peculiar to individuals di = 0 (left-censored). For an individual
di = 1, yi = yi and we know that yi |xi N x0i ,
at the single point yi

x0i
f (yi |x) =

yi

. Therefore, we can evaluate the density


x0i

For a left-censored individual (di = 0), all we know is that "i L


density over the interval "i L

to get P r (yi = L|xi ) =

x0i

x0i , so we integrate the


[(L

x0i ) / ]. Therefore,

the log-likelihood function is


lnL (y1 ...yn |x1 ...xn , ) =

n
X

di ln

i=1

yi

x0i

+ (1

di ) ln

(L

x0i )

The ML estimator bM L is consistent for , asymptotically normal and asymptotically ecient.


Two-step LS, b2

step ,

is based on Equation (13.2.2). In the first step we apply a probit

regression using di as the dependent variable to estimate


as \
i/ i
/

and L/

[(x0

L) / ] / [(x0

L) / ]

(x0 bprobit ) / (x0 bprobit ) (recall that bprobit is indeed a consistent estimate of
is subsumed in the constant term). In the second step apply OLS regression

of yi on xi and \
i / i restricting to the unconstrained subsample di = 1. using yi . b2

step

is consistent but standard errors needs to be adjusted since in the second step there is an
estimated regressor.
Upper limits can be dealt similarly:

y=

8
>
<y = y
>
:y = U

if y < U
if y

U.

13.3. A SIMPLE SELECTION MODEL

190

Also, lower and upper limits jointly:


8
>
>
>y = L
>
>
<
y = y = y
>
>
>
>
>
:y = U

if y L
if L < y < U
if y

The Stata command that compute bM L in the tobit model is tobit. The syntax is
similar to regress, requiring in addition options specifying lower limits, ll(#), and upper
limits, ul(#).
Marginal eects of interest are
@x E (y |x) =
@x E (y|x, y > L) = 1
[(x0

L) / ] / [(x0

@x E (y|x, y > L) =

w (w)

2 (w)

, where w = (x0

L) /

and

(w) =

L) / ] .

(w) .

These are computed by margins and the older mfx.


13.2.2. Heteroskedasticy and clustering. The same consideration made for binary
models in Sections 12.2.3 and 12.2.4 hold here. While heteroskedasticty breaks down the
specification of conditional expectations, clustering does not. Therefore, it makes sense to
apply the Stata option vce(cluster clustervar ).

13.3. A simple selection model


Two processes: the first select the units into the sample, the second generates y. The two
processes are related: selection is endogenous: it cannot be ignored!
Selection process:

s = z0 + ,

13.3. A SIMPLE SELECTION MODEL

191

d = 1 (s > 0)
Process for y:
y = x0 + ",

y=
Interest is on

8
>
<y = y

if d = 1

>
:y = missing

. The two processes are related for " and are:


0
@

"

20

A |z, x N 4@

0
0

Estimation is via ML. The log-likelihood is


lnL =

if d = 0

n
X
i=1

1 0
A,@

"

"

2
"

{di ln [f (yi |di = 1) P r (di = 1)] + (1

13
A5

di ) ln [P r (di = 0)]} .

The Stata command that compute bM L in the selection model is heckman. The syntax
is similar to regress, requiring in addition an option specifying the list of variables in the
selection process, d and z: select(varlist_s ).

CHAPTER 14

Quantile regression
14.1. Introduction
Define the conditional c.d.f. of Y : F (y|x) = P r (Y y|x). Instead of E (y|x), as in the
CRM, we model quantiles of F (y|x) .
The conditional median function of y, Q0.5 (y|x) , is an example. Specifically, for given x
and F (y|x) , Q0.5 (y|x) is a function that assigns the median of F (y|x) to x, and is implicitly
defined as
F [Q0.5 (y|x) |x] = 0.5
or, explicitly,
(14.1.1)

Q0.5 (y|x) = F

(0.5|x) .

More generally, given, the quantile q 2 (0, 1), define the conditional quantile function, Qq (x),
as
(14.1.2)

Qq (y|x) = F

(q|x) .

14.2. Properties of conditional quantiles


Define a predictor function y (x) and let " (y, x) be the corresponding prediction error
" (y, x) = y

y (x)

then
Lq = qE"(y,x)

0 [" (y, x)]+(q

1) E"(y,x)<0 [" (y, x)] is minimized when y (x) = Qq (y|x) .


192

14.3. ESTIMATION

193

Qq (y|x) is equivariant to monotone transformations: Let h () be a monotonic function, then Qq [h (y) |x] = h [Qq (y|x)]
In the case of the median:

L0.5 = Ey,x (|" (y, x)|) =) Q0.5 (y|x) minimizes Ey,x (|" (y, x)|)
In other words Q0.5 (y|x) is the minimum mean absolute error predictor.
14.3. Estimation
There is a sample {yi , xi } , i = 1, ..., n, for estimation.
14.3.1. Marginal eects. We can put the QR model in close relationship with the CRM.
Let
yi = E (yi |xi ) + ui
where ui = yi

E (yi |xi ), then


Qq (yi |xi ) = E (yi |xi ) + Qq (ui |xi ) .

This is proved by invoking the equivariant property of Qq (|x):


Qq (yi |xi ) = Qq [E (yi |xi ) + ui |xi ] = E (yi |xi ) + Qq (ui |xi ) .
14.3.1.1. i.i.d. case. If {ui } , i = 1, ..., n are i.i.d. and independent of xi then Qq (ui |xi ) is
constant over the sample and varies only with q, Qq (ui |xi ) =
Qq (yi |xi ) = E (yi |xi ) +

q,

so that

q,

which implies that here the marginal eects computed by the regression model coincides with
those computed from the QR. Therefore

@x Qq (yi |xi ) = @x E (yi |xi )

14.3. ESTIMATION

194

and
E (yi |xi ) = x0i

=) @x Qq (yi |xi ) = .

14.3.1.2. General case. If the i.i.d. assumptions does not hold (e.g. because of heteroskedasticity), then
@x Qq (yi |xi ) = @x E (yi |xi ) + @x Qq (ui |xi )
and
E (yi |xi ) = x0i

=) @x Qq (yi |xi ) =

+ @x Qq (ui |xi ) .

Also,
E (yi |xi ) = x0i

(14.3.1)

and Qq (ui |xi ) = x0i

=) @x Qq (yi |xi ) =

q.

14.3.2. The linear QR. The linear quantile regression model specifies Qq (yi |xi ) as a
linear function
Qq (yi |xi ) = x0i

or equivalently
yi = x0i
where uq,i = yi

x0i

and from (14.3.1),

+ uq,i

q.

A consistent estimator, bq , for

is

found by the analogy principle:

bq = argmin
b

8
<
:

yi

Under mild regularity conditions

yi

yi <x0i b

bq N
i q (1

1)

x0i b

where A =

x0i b + (q

q) xi x0i , B =

i f uq

q, A

BA

yi

x0i b

9
=
;

(0|xi ) xi x0i and fuq (0|xi ) is the conditional density

of uq,i evaluated at uq,i = 0. The latter makes A


conventional bootstrap procedures.

1 BA 1

dicult to estimate, better apply

14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA

195

The main Stata command implementing QR is qreg. Its syntax is similar to regress.
The quantile(.##) option in qreg indicates the quantile of choice (e.g. to get the median,
which is also the default, set .##=.50) To produce QR estimates with bootstrap standard
errors apply bsqreg. The reps(#) in bsqreg indicates the number of bootstrap replications.
Implementing the same model at various quantiles through repeated qreg regressions can
shed light on the discrepancies in behavior across dierent regions of the variable of interest. To
evaluate the statistical significance of such discrepancies, though, it is necessary to estimate
a larger covariance matrix encompassing covariances between coecient estimators across
quantiles. This is done by sqreg, which provides simultaneous estimates from the quantile
regressions chosen by the researcher.

14.4. An heteroskedastic regression model with simulated data


This is based on ch. 7.4 in Cameron and Trivedi (2009) and clarifies a couple of somewhat
intricate points therein.
Generate the data from an heteroskedastic linear regression model:
y = 1 + x2 + x3 + u,
u = (0.1 + 0.5x2 ) ",
x2

(1) ,

x3 |x2 N (0, 25) ,


"|x2 , x3 N (0, 25) .
It is not hard to verify that E (y|x) = 1 + x2 + x3 . Therefore,
@x2 E (y|xi ) = 1
@x3 E (y|xi ) = 1.

14.4. AN HETEROSKEDASTIC REGRESSION MODEL WITH SIMULATED DATA

196

and we expect that an OLS regression will yield coecient estimates close to the foregoing
marginal eects. Also,
Qq (u|x) = Qq [(0.1 + 0.5x2 ) "|x] = (0.1 + 0.5x2 ) Qq ("|x) = 0.1

+ 0.5 q x2

where the first equality is obvious, the second follows from the equivariance property and the
last from independence of " and x2 yielding

= Qq ("|x). Then, according to (14.3.1) we

have
@x2 Qq (y|xi ) = 1 + 0.5

@x3 Qq (y|xi ) = 1.
Therefore, we expect that quantile regressions will yield coecient estimates for x3 close to
the OLS estimate, regardless of the quantile considered, whilst for x2 we will observe various
discrepancies with the OLS estimate depending on the quantile regression. This is confirmed
by the outcome of the dofile qr.do.

Bibliography
Abowd, J. M., Kramarz, F., Margolis, D. N., 1999. High wage workers and high wage firms.
Econometrica 67, 251333.
Anderson, T. W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel
data. Journal of Econometrics 18, 570606.
Andrews, D. W. K., Moreira, M. J., Stock, J. H., 2007. Performance of conditional wald tests
in iv regression with weak instruments. Journal of Econometrics 139, 116132.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford
Bulletin of Economics and Statistics 49 (4), 43134.
Arellano, M., 2003. Panel Data Econometrics. Oxford University Press.
Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte carlo evidence
and an application to employment equations. Review of Economic Studies 58, 277297.
Baltagi, B. H., 2008. Econometric Analysis of Panel Data. New York: Wiley.
Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data
models. Journal of Econometrics 87, 115143.
Bowsher, C. G., 2002. On testing overidentifying restrictions in dynamic panel data models.
Economics Letters 77, 211220.
Bruno, G. S. F., 2005a. Approximating the bias of the lsdv estimator for dynamic unbalanced
panel data models. Economics Letters 87, 361366.
Bruno, G. S. F., 2005b. Estimation and inference in dynamic unbalanced panel data models
with a small number of individuals. The Stata Journal 5, 47300.
Bun, M. J. G., Kiviet, J. F., 2003. On the diminishing returns of higher order terms in
asymptotic expansions of bias. Economics Letters 79, 145152.
197

Bibliography

198

Cameron, A. C., Gelbach, J. B., Miller, D. L., 2011. Robust inference with multiway clustering.
Journal of Business & Economic Statistics 29, 238249.
Cameron, A. C., Trivedi, P. K., 2009. Microeconometrics using Stata. Stata Press, College
Station, TX.
Cappellari, L., Jenkins, S. P., 2003. Multivariate probit regression using simulated maximum
likelihood. The Stata Journal 3, 278294.
Cragg, J., Donald, S., 1993. Testing identfiability and specification in instrumental variables
models. econometric theory, vol. 9, pp. Econometric Theory, 222240.
Entorf, H., 2012. Expected recidivism among young oenders: Comparing specific deterrence
under juvenile and adult criminal law. European Journal of Political Economy 28, 414429.
Evans, W. N., Schwab, R. M., 1995. Finishing high school and starting college: Do catholic
schools make a dierence? The Quarterly Journal of Economics 110, 941974.
Fichera, E., Sutton, M., 2011. State and self investment in health. Journal of Health Economics
30, 11641173.
Greene, W. H., 1998. Gender economics courses in liberal arts colleges: Further results. Journal
of Economic Education 29, 291300.
Greene, W. H., 2008. Econometric Analysis, sixth Edition. Upper Saddle River, NJ: Prentice
Hall.
Greene, W. H., 2012. Econometric Analysis, seventh Edition. Upper Saddle River, NJ: Prentice
Hall.
Hansen, L. P., 1982. Large sample properties of generalized method of moments estimators.
Econometrica 50 (4), 10291054.
Hausman, J., 1978. Specification tests in econometrics. Econometrica 46, 12511271.
Hausman, J. A., Taylor, W., 1981. Panel data models and unobservable individual eects.
Econometrica 49, 13771398.
Hayashi, F., 2000. Econometrics. Princeton University Press.

Bibliography

199

Judson, R. A., Owen, A. L., 1999. Estimating dynamic panel data models: a guide for macroeconomists. Economics Letters 65, 915.
Kiviet, J. F., 1995. On bias, inconsistency and eciency of various estimators in dynamic
panel data models. Journal of Econometrics 68, 5378.
Kiviet, J. F., 1999. Expectation of expansions for estimators in a dynamic panel data model;
some results for weakly exogenous regressors. In: C. Hsiao, K. Lahiri, L.-F. L., Pesaran,
M. H. (Eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge University Press, Cambridge, pp. 199225.
Maddala, G. S., 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge.
Mikusheva, A., Poi, B. P., 2006. Tests and confidence sets with correct size when instruments
are potentially weak. The Stata Journal 6, 335347.
Moulton, B. R., 1990. An illustration of a pitfall in estimating the eects of aggregate variables
on micro units. The Review of Economics and Statistics 72 (2), 33438.
Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46,
6985.
Nickell, S. J., 1981. Biases in dynamic models with fixed eects. Econometrica 49, 14171426.
Prucha, I. R., 1984. On the asymptotic eciency of feasible aitken estimators for seemingly
unrelated regression models with error components. Econometrica 52, 203207.
Rao, C. R., 1973. Linear Statistical Inference and Its Applications. New York: Wiley.
Roodman, D. M., 2009. How to do xtabond2: An introduction to dierence and system gmm
in stata. The Stata Journal 9 (1), 86136.
Roodman, D. M., 2011. Fitting fully observed recursive mixed-process models with cmp. The
Stata Journal 11, 159206.
Searle, S. R., 1982. Matrix Algebra Useful for Statistics. New York: Wiley.
Stock, J. H., Watson, M. W., 2008. Heteroskedasticity-robust standard errors for fixed eects
panel data regression. Econometrica 76, 15574.

Bibliography

200

Swamy, P. A. B., Arora, S. S., 1972. The exact finite sample properties of the estimators of
coecients in the error components regression models. Econometrica 40 (2), 261275.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition Edition. Emerald.
Windmeijer, F., 2005. A finite sample correction for the variance of linear ecient two-step
gmm estimators. Journal of Econometrics 126, 2551.
Wooldridge, J. M., 2005. Unobserved heterogeneity and estimation of average partial eects.
In: Andrews, D. W. K., Stock, J. H. (Eds.), Identification And Inference For Econometric
Models: Essays In Honor Of Thomas Rothenberg. Cambridge University Press, New York.
Wooldridge, J. M., 2010. Econometric Analysis of Cross Section and Panel Data, 2nd Edition.
The MIT Press, Cambridge, MA.
Yatchew, A., Griliches, Z., 1985. Specification error in probit models. Review of Economics
and Statistics 67, 134139.
Zyskind, G., 1967. On canonical forms, non-negative covariance matrices and best and simple
least squares linear estimators in linear models. Annals of Mathematical Statistics 36, 1092
09.

Vous aimerez peut-être aussi