Académique Documents
Professionnel Documents
Culture Documents
c Michael Creel
List of Figures 10
List of Tables 12
Bibliography 491
Index 492
List of Figures
1.2.1 LYX 15
1.2.2 Octave 16
9.1.1
when there is no collinearity 164
9.1.2
when there is collinearity 165
12
CHAPTER 1
This document integrates lecture notes for a one year graduate level course
with computer programs that illustrate and apply the methods that are stud-
ied. The immediate availability of executable (and modifiable) example pro-
grams when using the PDF1 version of the document is one of the advantages
of the system that has been used. On the other hand, when viewed in printed
form, the document is a somewhat terse approximation to a textbook. These
notes are not intended to be a perfect substitute for a printed textbook. If you
are a student of mine, please note that last sentence carefully. There are many
good textbooks available. A few of my favorites are listed in the bibliography.
With respect to contents, the emphasis is on estimation and inference within
the world of stationary data, with a bias toward microeconometrics. The sec-
ond half is somewhat more polished than the first half, since I have taught that
course more often. If you take a moment to read the licensing information in
the next section, you’ll see that you are free to copy and modify the document.
If anyone would like to contribute material that expands the contents, it would
be very welcome. Error corrections and other additions are also welcome. As
an example of a project that has made use of these notes, see these very nice
lecture slides.
1
It is possible to have the program links open up in an editor, ready to run using keyboard
macros. To do this with the PDF version you need to do some setup work. See the bootable
CD described below.
13
1.2. OBTAINING THE MATERIALS 14
1.1. License
All materials are copyrighted by Michael Creel with the date that appears
above. They are provided under the terms of the GNU General Public License,
which forms Section 23 of the notes. The main thing you need to know is that
you are free to modify and distribute these materials in any way you like, as
long as you do so under the terms of the GPL. In particular, you must make
available the source files, in editable form, for your modified version of the
materials.
GNU Octave has been used for the example programs, which are scattered
though the document. This choice is motivated by two factors. The first is the
high quality of the Octave environment for doing applied econometrics. The
fundamental tools exist and are implemented in a way that make extending
2
”Free” is used in the sense of ”freedom”, but LYX is also free of charge.
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 15
them fairly easy. The example programs included here may convince you of
this point. Secondly, Octave’s licensing philosophy fits in with the goals of this
project. Thirdly, it runs on Linux, Windows and MacOS. Figure 1.2.2 shows an
Octave program being edited by NEdit, and the result of running the program
in a shell window.
The example programs are available as links to files on my web page in the
PDF version, and here. Support files needed to run these are available here.
The files won’t run properly from your browser, since there are dependencies
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY 16
between files - they are only illustrative when browsing. To see how to use
these files (edit and run them), you should go to the home page of this doc-
ument, since you will probably want to download the pdf version together
with all the support files and examples. Then set the base URL of the PDF file
to point to wherever the Octave files are installed. All of this may sound a bit
complicated, because it is. An easier solution is available:
The file pareto.uab.es/mcreel/Econometrics/econometrics.iso is an ISO im-
age file that may be burnt to CDROM. It contains a bootable-from-CD Gnu/Linux
1.4. KNOWN BUGS 17
system that has all of the tools needed to edit this document, run the Octave ex-
ample programs, etcetera. In particular, it will allow you to cut out small por-
tions of the notes and edit them, and send them to me as LYX (or TEX) files for
inclusion in future versions. Think error corrections, additions, etc.! The CD
automatically detects the hardware of your computer, and will not touch your
hard disk unless you explicitly tell it to do so. It is based upon the Knoppix
GNU/Linux distribution, with some material removed and other added. Ad-
ditionally, you can use it to install Debian GNU/Linux on your computer (run
knoppix-installer as the root user). The versions of programs on the CD
may be quite out of date, possibly with security problems that have not been
fixed. So if you do a hard disk installation you should do apt-get update,
apt-get upgrade toot sweet. See the Knoppix web page for more informa-
tion.
Economic theory tells us that the demand function for a good is something
like:
is the quantity demanded
is vector of prices of the good and its substitutes and comple-
ments
is income
is a vector of other variables such as individual characteristics that
affect preferences
The functions +* ?>@ which in principle may differ for all ! have been
restricted to all belong to the same parametric family.
Of all parametric families of functions, we have restricted the model
to the class of linear in the variables functions.
The parameters are constant across individuals.
There is a single unobservable component, and we assume it is addi-
tive.
If we assume nothing about the error term A , we can always write the last
equation. But in order for the coefficients to have an economic meaning,
and in order to be able to estimate them from sample data, we need to make
additional assumptions. These additional assumptions have no theoretical
basis, they are assumptions on top of those needed to prove the existence of
a demand function. The validity of any results we obtain using this model
will be contingent on these additional restrictions being at least approximately
correct. For this reason, specification testing will be needed, to check that the
model seems to be reasonable. Only when we are convinced that the model is
at least approximately correct should we use it for economic analysis.
2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS 20
We would like to ensure that the third reason is not contributing to rejections,
so that rejection will be due to either the first or second reasons. Hopefully the
above example makes it clear that there are many possible sources of misspec-
ification of econometric models. In the next few sections we will obtain results
supposing that the econometric model is entirely correctly specified. Later we
will examine the consequences of misspecification and see some methods for
determining if a model is correctly specified. Later on, econometric methods
that seek to minimize maintained assumptions are introduced.
CHAPTER 3
B
1F I1 28 F
2 &)&(&J2K D F ED 2KA
B]U L 4U ^2/JU
1
For example, cross-sectional data may be obtained by random sampling. Time series data
accumulate historically.
21
3.2. ESTIMATION BY LEAST SQUARES 22
(3.1.1) _ a` b28/T
aw y. xCz .y|M
xY{ }C~% /V
Figure 3.2.1, obtained by running TypicalData.m shows some data that fol-
lows the linear model BgU
01\2= U 2=A-U . The green line is the ”true” regression
line 01V2 U , and the red crosses are the data points U B]U C where A-U is a ran-
dom error that has mean zero and is independent of U . Exactly how the green
line is defined will become clear later. In practice, we only have the data, and
3.2. ESTIMATION BY LEAST SQUARES 23
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
X
we don’t know where the green line lies. We need to gain information about
the straight line that best fits the data points.
The ordinary least squares (OLS) estimator is defined as the value that mini-
mizes the sum of the squared errors:
V-
where
f
T
BgU L U4
U(1
_ ` G 4 _ `
_I4_iK#J_4 ` ^ 284 ` 4 `
_ `
3.2. ESTIMATION BY LEAST SQUARES 24
This last expression makes it clear how the OLS estimator is defined: it min-
imizes the Euclidean distance between B and & The fitted OLS coefficients
will define the best linear approximation to B using L as basis functions, where
”best” means minimum Euclidean distance. One could think of other esti-
mators based upon other metrics. For example, the minimum absolute distance
(MAD) minimizes fU(1G BgU L 4U . Later, we will see that which estimator is
best in terms of their statistical properties, rather than in terms of the metrics
that define them, depends upon the properties of A , about which we have as
yet made no assumptions.
T
# ` 4 `
x
Since `
this matrix is positive definite, since it’s a quadratic
form in a p.d. matrix (identity matrix of order , so is in fact a
minimizer.
The fitted values are in the vector _
¡` ¢ &
The residuals are in the vector /
_i `
3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 25
Note that
_
` ^2/
` ^ 2 /
` 4_£ ` 4 `
` 4 c _ ` h
` 4 /
3.3.1. In i¤ Space. Figure 3.3.1 shows a typical fit to data, along with the
true regression line. Note that the true line and the estimated line are different.
This figure was created by running the Octave program OlsFit.m . You can
experiment with changing the parameter values to see how this affects the fit,
and to see how the fitted line will sometimes be close to the true line, and
sometimes rather far away.
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
X
Observation 2
e = M_xY S(x)
x*beta=P_xY
Observation 1
3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION 27
3.3.3. Projection Matrices. is the projection of B onto the span of i or
¨i4©3 1 i4B
§
since
ª « B &
/ is the projection of B onto the ® dimensional space that is orthogonal
to the span of . We have that
/
B¯j
B¯jK4: 1 i4©B
°@± fp²K4: 1 i4³0B&
3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS 28
So the matrix that projects B onto the space orthogonal to the span of is
We have
/
´3« B&
Therefore
B
ªI« B 2 :
´ « B
^2 / &
These two projection matrices decompose the dimensional vector B into two
orthogonal components - the portion that lies in the dimensional space de-
fined by i and the portion that lies in the orthogonal K dimensional
space.
This is how we define a linear estimator - it’s a linear function of the de-
pendent variable. Since it’s a linear combination of the observations on the
dependent variable, where the weights are detemined by the observations on
the regressors, some observations may have more influence than others. De-
fine
¹U
ªI« U@U
º U4 ª« º U
ªI« º U
» ¼º U
¹U U¶
is the t element on the main diagonal of
ª
« ( ºU is a vector of zeros with
U¶
a in the t position). So ¾½
¹ U ½ V and
So, on average, the weight on the BVU ’s is  . If the weight is much higher, then
the observation has the potential to affect the fit importantly. The weight,
¹U
is referred to as the leverage of the observation. However, an observation may
also be influential due to the value of BVU , rather than the weight it is multiplied
by, which only depends on the U ’s.
To account for this, consider estimation of without using the
U¶ observa-
U(Å
tion (designate this estimator as ÄÃ Y& One can show (see Davidson and MacK-
innon, pp. 32-5 for proof) that
ÇÆ È ¹ UCÉ 4: 1 iU 4 /J U
à (U Å
£
3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS 30
10
-2
0 0.5 1 1.5 2 2.5 3
X
U(Å
¹U
=U £j=U Ã Æ È ¹ UCÉ /] U
While an observation may be influential if it doesn’t affect its own fitted value,
it certainly is influential if it does. A fast means of identifying influential ob-
servations is to plot c 1 Y¶ ÊY¶ Ê h /J U (which I will refer to as the own influence of the
observation) as a function of . Figure 3.4.1 gives an example plot of data, fit,
leverage and influence. The Octave program is InfluentialObservation.m . If
you re-run the program you will see that the leverage of the last observation
(an outlying value of x) is always high, and the influence is sometimes high.
After influential observations are detected, one needs to determine why
they are influential. Possible causes include:
3.5. GOODNESS OF FIT 31
data entry error, which can easily be corrected once detected. Data
entry errors are very common.
special economic factors that affect some observations. These would
need to be identified and incorporated in the model. This is the idea
behind structural change: the parameters may not be constant across all
observations.
pure randomness may have caused us to sample a low-probability ob-
servation.
BT4B
4i4 ^Ë2 # Ì4 i4 /¼ 2 /J 4 /
(3.5.1) BT4B
4i4 ^2 /J 4 /
/ 4 /
Í
È B B
4
4 4
B 4B
ª« B
B
ÎvÏ"Ð Wt0 Y
3.5. GOODNESS OF FIT 32
The uncentered changes if we add a constant to B since this changes
t (see Figure 3.5.1, the yellow vector is a constant, since it’s on the
ÑÒ degree line in observation space). Another, more common defini-
tion measures the contribution of the variables, other than the constant
term, to explaining the variation in B& Thus it measures the ability of
the model to explain the variation of B about its unconditional sample
mean.
3.5. GOODNESS OF FIT 33
´ÖÕ
± fsMÓ-PÓ¨4ÓW $ 1 ¨Ó 4
± fsMÓ×Ó¬4 Â
´ÖÕ B just returns the vector of deviations from the mean. In terms of deviations
from the mean, equation 3.5.1 becomes
BT4 ´ÖÕ B
I 4Ìi4 ´:Õ 2 /J 4 ´ÖÕ'/
/ 4 /
Ø y B ´ÖÕ B
ÈÚÙÜ
¿ ÛZÛ
4 ÛÛ
¿
Ö
´ Õ f
where
ÙÜÛÛ / 4 / and ÛÛ B 4 B = U(1 B]UuB%Ý .
Supposing that contains a column of ones (i.e., there is a constant term),
a
i4 /
ÞÁ /JU
U
so
´ÖÕ'/
/T & In this case
B4 ´ÖÕ B
4i4 ´ÖÕ ^2 /] 4 /
So
Ø
¿ Û Û
4 4:
´ Õ ÛÛ
where
ÛÛ
B
01 12K
2 &)&(&]2K DYED 2MA
(3.6.1) B
1F I1 28 F
2 &)&(&J2K D F ED 2KA
(3.6.2)
) ` 4 `å
æ «
where æ
« is a finite positive definite matrix. This is needed to be able to iden-
tify the individual effects of the explanatory variables.
Independently and identically distributed errors:
Nonautocorrelated errors:
Optionally, we will sometimes assume that the errors are normally dis-
tributed.
Normally distributed errors:
3.7.1. Unbiasedness. We have 4 : 1 4 B . By linearity,
¬ 4 3 1 4 ^28/V
b2¬i43 $ 1 i4/
1
1
Ù 4: i4/ Ù ¬ i4: i4/
i4: $ 1 i4 Ù /
so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.7.1 shows the results of a small Monte Carlo experiment where the
OLS estimator was calculated for 10000 samples from the classical model with
B
2# Ö2 / , where
# , èî
aï , and is fixed across samples. We can see
that the appears to be estimated without bias. The program that generates
the plot is Unbiased.m , if you would like to experiment with this.
With time series data, the OLS estimator will often be biased. Figure 3.7.2
shows the results of a small Monte Carlo experiment where the OLS estimator
was calculated for 1000 samples from the AR(1) model with B"U
a 2 & ï B]U 1v2¯/JU ,
where
# and è î
. In this case, assumption 3.6.2 does not hold: the
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 37
0.1
0.08
0.06
0.04
0.02
0
-3 -2 -1 0 1 2 3
regressors are stochastic. We can see that the bias in the estimation of is
about -0.2.
The program that generates the plot is Biased.m , if you would like to ex-
periment with this.
3.7.2. Normality. With the linearity assumption, we have
2= 4 : 1 4 /T&
This is a linear function of / . Adding the assumption of normality (3.6.6, which
implies strong exogeneity), then
içuñðòZNi4: $ 1 è \F ó
since a linear function of a normal random vector is also normally distributed.
In Figure 3.7.1 you can see that the estimator appears to be normally dis-
tributed. It in fact is normally distributed, since the DGP (see the Octave pro-
gram) has normal errors. Even when the data may be taken to be IID, the
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 38
0.12
0.1
0.08
0.06
0.04
0.02
0
-1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4
3.7.3. The variance of the OLS estimator and the Gauss-Markov theo-
rem. Now let’s make all the classical assumptions except the assumption of
2
Normality may be a good model nonetheless, as long as the probability of a negative value
occuring is negligable under the model. This depends upon the mean being large enough in
relation to the variance.
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 39
normality. We have
^ 2a 4 : 1 4 / and we know that Ù
. So
éÜô À
ÙRõ c¢÷ö h c¢ö h 4×ø
1 1'ú
Ùaù ¬ i43 $ i4/J/]4ÌK¬i4:
4 : 1 è
F
The OLS estimator is a linear estimator, which means that it is a linear func-
tion of the dependent variable, B&
° 4 : 1 4³ B
û B
where û is a function of the explanatory variables only, not the dependent vari-
able. It is also unbiased under the present assumptions, as we proved above.
One could consider other weights ü that are a function of that define some
other linear estimator. We’ll still insist upon unbiasedness. Consider ý
R
ü B
where ü
üþ3 is some Q:j matrix function of i & Note that since ü is
a function of it is nonstochastic, too. If the estimator is unbiased, then we
must have üR
a±'ÿ :
ë¼WüuB
ë7òü F Ë
2 ü/V
ü F
F
Á
üR
±'ÿ
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 40
The variance of ý is
éb
ý
üÚü4©è F &
Define
ü ¡ 4 : 1 4
so
ü
2¬ 4 : 1 4
Since ü
a±'ÿ
so
To illustrate the Gauss-Markov result, consider the estimator that results from
splitting the sample into equally-sized parts, estimating using each part of
the data separately by OLS, then averaging the resulting estimators. You
should be able to show that this estimator is unbiased, but inefficient with
respect to the OLS estimator. The program Efficiency.m illustrates this using
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 41
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 0.5 1 1.5 2 2.5 3 3.5 4
a small Monte Carlo experiment, which compares the OLS estimator and a 3-
way split sample estimator. The data generating process follows the classical
model, with
#% . The true parameter value is
#%& In Figures 3.7.3 and
3.7.4 we can see that the OLS estimator is more efficient, since the tails of its
histogram are more narrow.
À 1
We have that
and éÞô
ð iO è but we still need to
Ù ó F
estimate the variance of A , è , in order to have an idea of the precision of the
F
estimates of . A commonly used estimator of è is
F
è F
÷
/ 4 /
This estimator is unbiased:
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 42
0.1
0.08
0.06
0.04
0.02
0
0 0.5 1 1.5 2 2.5 3 3.5 4
è F
/ 4 /
J
/J4 ´ /
ë¼ è F
¿yÀ / 4 ´ /V
Ù
¿yÀ"´ /]/ 4
Ù
¿yÀ « « ´ /J/ 4
Ù Ù â
è «¿;ÀV´
F Ù
è KQ5
F
è F
3.8. EXAMPLE: THE NERLOVE MODEL 43
3.8.1. Theoretical background. For a firm that takes input prices . and
the output level as given, the cost minimization problem is to choose the
quantities of inputs to solve the problem
y. 4
ã
subject to the restriction
T&
The solution is the vector of factor demands .Ü " . The cost function is ob-
tained by substituting the factor demands into the criterion function:
û .¯
"
.;4 .Ü " Y&
This is one of the reasons the Cobb-Douglas form is popular - the coefficients
are easy to interpret, since they are the elasticities of the dependent variable
3.8. EXAMPLE: THE NERLOVE MODEL 45
Tß .¯
V . ß
û
ß .¯
V
where 8
) w
. So we see that the transformed model is linear in the logs of
the data.
One can verify that the property of HOD1 implies that
* 1
In other words, the cost shares add up to 1.
The hypothesis that the technology exhibits CRTS implies that
so
"&
Likewise, monotonicity implies that the coefficients * !
"'&(&(&) .
3.8.3. The Nerlove data and OLS. The file nerlove.data contains data on
145 electric utility companies’ cost of production, output and input prices. The
data are for the U.S., and were collected by M. Nerlove. The observations are
3.8. EXAMPLE: THE NERLOVE MODEL 46
that the data are sorted by output level (the third column).
We will estimate the Cobb-Douglas model
(3.8.1)
ûÚ
,12K æ 8
2 | ª K
2 "! #ª 8
2 $ ª ÿ 2KA
using OLS. To do this yourself, you need the data file mentioned above, as
well as Nerlove.m (the estimation program) , and the library of Octave func-
tions mentioned in the introduction to Octave that forms section 21 of this
document.3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
*********************************************************
While we will use Octave programs as examples in this document, since fol-
lowing the programming statements is a useful way of learning how theory
is put into practice, you may be interested in a more ”user-friendly” environ-
ment for doing econometrics. I heartily recommend Gretl, the Gnu Regression,
Econometrics, and Time-Series Library. This is an easy to use program, avail-
able in English, French, and Spanish, and it comes with a lot of data ready to
use. It even has an option to save output as LATEX fragments, so that I can just
include the results into this document, no muss, no fuss. Here the results of
the Nerlove model from GRETL:
const &% & Ò (# ' Ò V&*)+) Ñ %,) ¯"& ï+- ) Ò & Ñ -+-
l_output .& )"# % ï Ñ & ) Ñ '+' Ñ Ñ V&# Ñ"ÑÒ & """
l_labor & Ñ %+'(% Ñ &©# ï Ñ - V& Ñ ïVï # &(%+'
l_fuel & Ñ (# ' Ò /) &) V %+' ï Ñ &# Ñ ï Ò & """
l_capita &©#% +ï -+-+- .& %(% ï Ñ # ï .& ' Ñ ) - &Ò - #
3.8. EXAMPLE: THE NERLOVE MODEL 48
Exercises
(1) Prove that the split sample estimator used to generate figure 3.7.4 is unbi-
ased.
(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL,
and provide printouts of the results. Interpret the results.
(3) Do an analysis of whether or not there are influential observations for OLS
estimation of the Nerlove model. Discuss.
(4) Using GRETL, examine the residuals after OLS estimation and tell me whether
or not you believe that the assumption of independent identically dis-
tributed normal errors is warranted. No need to do formal tests, just look
at the plots. Print out any that you think are relevant, and interpret them.
(5) For a random vector ç ²43 ã 6 5¼ C what is the distribution of w o287 ,
where w and 7 are conformable matrices of contants?
(6) Using Octave, write a little program that verifies that
¿yÀ w
;¿ À ¯w
for w and
4x4 matrices of random numbers. Note: there is an Octave
function trace.
(7) For the model with a constant and a single regressor, B"U
01Ä2 U2A-U ,
which satisfies the classical assumptions, prove that the variance of the
OLS estimator declines to zero as the sample size increases.
CHAPTER 4
This is the joint density of the sample. This density can be factored as
/;?=
W¤69;
: F
;
=
¤ 9;
@ F <= A9; F
B
¤Z69y
:7
W¤69;
: C :DCFE
likelihood function.
50
4.1. THE LIKELIHOOD FUNCTION 51
Note that if @
F and F share no elements, then the maximizer of the condi-
tional likelihood function
;
=
W¤ 9;@V with respect to @ is the same as the max-
imizer of the overall likelihood function
<;?=
W¤69; :7
;
=
W¤ 9;
@" <= 9;T ,
for the elements of : that correspond to @ . In this case, the variables 9 are said
to be exogenous for estimation of @ , and we may more conveniently work with
the conditional likelihood function
;
=
W¤ 9;@" for the purposes of estimating
@
F.
D EFINITION 4.1.1. The maximum likelihood estimator of @
V[
~ /; =
¤ 9y@"
F
B
¤Z@V
BT1 g1Y@" B B1$$
@" B | B1YB $ | @V 0>N>N> Bgf B1IH B '&'&'&-BgU fT$\f%@"
U(1
4.1. THE LIKELIHOOD FUNCTION 52
)
f
\f+4@" B
¤Z
@" B]U U?@V
(U 1
The maximum likelihood estimator may thus be defined equivalently as
a
V[
~
@ \f5A@V C
4.1.1. Example: Bernoulli trial. Suppose that we are flipping a coin that
may be biased, so that the probability of a heads may not be 0.5. Maybe we’re
interested in estimating the probability of a heads. Let B
¹ º ôJTJ be a binary
variable that indicates whether or not a heads is observed. The outcome of a
toss is a Bernoulli random variable:
Bò F
1
"KF ?Èi F K BLC²S 'VX
<;
B Cj
 S 'VX
/;
Bò
K Ô ÈE 1 K
and
/;
Bò
B Ü
2aÔÈB5 P yiE
4.1. THE LIKELIHOOD FUNCTION 53
Ô ¼iE
BÜi
Ü?Èi
Averaging this over a sample of size gives
à
\à fE@
f B * i
* 1 ÜÔ ¼i
Setting to zero and solving gives
BÝ
à )
à B
<;
B¯i * * ÔÈi * +*
* ?Èi *
B * i +*
[ +*
4.2. CONSISTENCY OF MLE 54
Uniform convergence:
ÍR P QP ì )
\f5A@" f SUT "ë VW\f+4@"
T=A@%
@ F CÔê?@MC N &
We have suppressed ¤ here for simplicity. This requires that almost sure con-
vergence holds for all possible parameter values. For a given parameter value,
an ordinary Law of Large Numbers will usually imply almost sure conver-
gence to the limit of the expectation. Convergence for a single element of
the parameter space, combined with the assumption of a compact parameter
space, ensures uniform convergence.
4.2. CONSISTENCY OF MLE 55
Q6R P ì P
We will use these assumptions to show that @\f @ &
F
First, \f
@ certainly exists, since a continuous function has a maximum on a
compact set.
Second, for any @
í
F @
) B
4@"
ë Æ Æ B A @ ]ÉµÉ »
Æ ë Æ B A4@ @" ]ɵÉ
B
F F
by Jensen’s inequality ( Ô>© is a concave function).
Now, the expectation on the RHS is
F
B
4@ F F
since
B
A@ F is the density function of the observations, and since the integral of
any density is 1 & Therefore, since
ÔJ
)
ë Æ Æ B A 4@ @" ]ÉpÉ »
B
F
or
ëò\fÈ4@" [ IJëò\fy4@ F - » &
Taking limits, this is
/T A@%
@ F GK/T A @ F @ F »
@ [ @
F a.s.
Thus there is only one limit point, and it is equal to the true parameter value
with probability one. In other words,
@
fSUT @
F a& s &
This completes the proof of strong consistency of the MLE. One can use weaker
assumptions to prove weak consistency (convergence in probability to @
F ) of
the MLE. This is omitted here. Note that almost sure convergence implies
convergence in probability.
VY\f5A@V
gf5W¤@V
f
V BgU @V
U(1 ã
f ]UÔA@V C&
U(1
This is the score vector (with dim pN C& Note that the score function has ¤ as an
argument, which implies that it is a random function. ¤ (and any exogeneous
variables) will often be suppressed for clarity, but one should not forget that
they are still there.
The ML estimator @ sets the derivatives to zero:
f
]f5 @" ]UÔ @" &
(U 1
We will show that ëV\ gU?A @"I ]
þ VêY& This is the expectation taken with respect
to the density 4@" Y not necessarily A @ &
F
ë"V\^]UÔ4@" _]
V BgU U?@V _] B]U @V `JBgU
Y
\
V BgU U?@V _] BgU U?@V `JBgU
Y
BgU U? @V \
V BgU U? @V` JBgUP&
Y
4.4. ASYMPTOTIC NORMALITY OF MLE 58
ëaV#\^]U-A@V _]
V
Y
B]U U?
@" `JB]U
'V
4@ [ c @c@ F h
U0d A@ F C
where @ [
fe @y
2 ?; e
`@ F g½ e½ "& Assume A@ [ is invertible (we’ll justify
this in a minute). So h d h
c @pi@ F h
4@ [ $ 1 >04@ F
d
4.4. ASYMPTOTIC NORMALITY OF MLE 59
A@ [
V ,4@ [
O
V \f5A @ [
f
V U-A@ [
U(1
where the notation à
'f 4@" à \\f5à A@V &
@ 4
V
@
Given that this is an average of terms, it should usually be the case that this
satisfies a strong law of large numbers (SLLN). Regularity conditions are a set
of assumptions that guarantee that this will happen. There are different sets
of assumptions that can be used to justify appeal to different SLLN’s. For
example, the
) -U A@1[$ must not be too strongly dependent over time, and
V
their variances must not become infinite. We don’t assume any particular set
here, since the appropriate assumptions will depend upon the particularities
of a given model. However, we assume that a SLLN applies.
je @ þ d
Also, since we know that @ is consistent, and since @ [ 2 ? e
@F
`
we have that
R ì
QP P d
4@" is
@ [ @
F. Also, by the above differentiability assumtion,
continuous in @ . Given
d this, A@ [ converges to d the limit of it’s expectation:
T¾4@%@ F ½ /T=4@ F @ F
i.e., maximizes the limiting objective function. Since there is a unique max-
@
F d
imizer, and by the assumption that 'f+4@" is twice continuously differentiable
(which holds in the limit), then T=A@ must be negative definite, and there-
F
fore of full rank. Therefore the previous inversion is justified, asymptotically,
and we have h d h
c @c@ F h R ì =A@ F $ 1 >0A@ F C&
Q6P P
(4.4.1) h T
, F C& This is h
Now consider h > 4@
f
= f(U 1 =U
h
This is the case for >,4@ F for example. Then the properties of f depend on
the properties of the U?& For example, if the U have finite variances and are
h
not too strongly dependent, then a CLT for dependent processes will apply.
¡
Supposing that a CLT applies, andh noting that
Ù n gf5A@ F we get
o
=A@ F A1 p >gf A @ F cR m
T q\ ±'ÿ ]
where
o
=A@ F
"ë VAW ð Xh \^]f+4@ _],\^]f+4@ _] 4
T
fSUT F F ó
fSUT érVAW ð >]f A@ F ó
This can also be written as h
ç @° T¾4@ F 1 o T=4@ F T=A@ F 1 ³Ä&
c @ss@ F h u
Q
(4.4.3) c @c@ F h R
m þ $énT
h
There do exist, in special cases, estimators that are consistent such that
h 6
c @ss@ F h R & These are known as superconsistent estimators, since nor-
mally, is the highest factor that we can multiply by an still get convergence
to a stable limiting distribution.
D EFINITION 2 (Asymptotic unbiasedness). An estimator @ of a parameter
is asymptotically unbiased if
@
F
) ëaVJ @V
%@ &
(4.4.4) fStT
Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An
example is
E XERCISE 4.5. Consider an estimator @ with density
È f 1 @
@ F
@V
f1 H @
1
Show that this estimator is consistent but asymptotically biased. Also ask
yourself how you could define an estimator that would have this density.
4.6. THE INFORMATION MATRIX EQUALITY 63
d
4.6. The information matrix equality
Y
U-A@V `JB so
Y U-A@V `JB
V
Y
V U?A@" - -U 4@" `JB
since all cross products between different periods expect to zero. Finally take
limits, we get d
simplifies to h
c @pi@ F h R ì @° o T=4@ F $ 1 ³
QP P
(4.6.3) d
We can use x
f
o
d x T =A@ F
d ]UÔ @V _]UÔ @ P4
U(1
=A@ F
T "@ Y&
x
Note, one can’t use
± T=4@
n ] f @V r n g f5 @V r 4
F
to estimate the information matrix. Why not?
From this we see that there are alternative ways to estimate é?T=4@
F that are
all valid. These include
x d x
1
é x T=4@ F
n x T=A@ F
é x T=4@ F
od x
T=A @
1 x d x
n
F
énT=4@ F
T A@
1 o T=A@ T=4@ 1
F F F
4.7. THE CRAMÉR-RAO LOWER BOUND 65
These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since
it coincides with the covariance estimator of the quasi-ML estimator.
)
a
fSUT ë"VJ @pý i@"
Differentiate wrt @ 4H
) Y V n ¤Z
@" c @ý c@ JB
V
fO SUT ëaVJ @pý i@V fSUT O hr
this is a matrix of zeros C&
Noting that
V A@V V O ) A@V C we can write
O W¤@"
Y c @ý c@ 4 @" V 4@" ZJBp2 Y W¤@V V c @pý i@ JB
&
fStT h O fStT O h
) Y
c @ ý c@ h z \ V O {}| 4 @" _~ ] A @V `JB
±\ÿ
fSUT
4.7. THE CRAMÉR-RAO LOWER BOUND 66
Note that the bracketed part is just the transpose of the score vector, 0 V C
A@ so
h h
we can write
) n c ?0A@" P4 r
¡±'ÿ
fSUT ëaV @ ý c@ h
h
This means that the covariance of the score function with c @pý i
@h for @ ý
h
any CAN estimator, is an identity matrix. Using this, suppose the variance of
cs
@ ý s@ h tends to énT¾ h @ý C& Therefore,
énT='@"ý
(4.7.1) énT
±'ÿ &
0A @"
n
o
=A@V
T
This simplifies to
4 c érT¾'@Vý G o
T 1 A@" h &
Since
is arbitrary, énT='@Vý Z o
=A@V
T is positive semidefinite. This conludes the
proof.
This means that
o
T 1 A@V is a lower bound for the asymptotic variance of a
CAN estimator.
Exercises
(1) Consider coin tossing with a single possibly biased coin. The density func-
tion for the random variable B
¹ º ôJTJ is
<;
Bò F
"KF ?Èi F 1 K BLC²S 'VX
B Cj
 S 'VX
Suppose that we have a sample of size . We know from above that the ML
estimator is h
B Ý . We also knowd from the theoryd above that
F
YBÜÝ i F çR @° T=@ d F 1 o T= F T= F 1 ³
Q
The Cauchy density has a shape similar to a normal density, but with much
thicker tails. Thus, extremely small and large errors occur much more fre-
quently with this density than would happen if the errors were normally
distributed. Find the score function gf+4@" where @
lc 4 h 4 .
(3) Consider the model classical linear regression model B"U
Ç 4U i2¡A-U where
AÔUç ±± j è . Find the score function gf+4@" where @
c 4 è h 4 .
EXERCISES 69
(4) Compare the first order conditional that define the ML estimators of prob-
lems 2 and 3 and interpret the differences. Why are the first order condi-
tions that define an efficient estimator different in the two cases?
CHAPTER 5
The OLS estimator under the classical assumptions is unbiased and BLUE,
for all sample sizes. Now let’s see what happens when the sample size tends
to infinity.
5.1. Consistency
¬ 4 3 1 4 B
¬i43 $ 1 i4J^28/V
F a2 i4: $ 1 4/
1
2 Æ 4 4/
F É
« « « «
Consider the last two terms. By assumption ) fStT ð f O
æ « Á fSUT ð f O 1
ó ó
æ « 1 since the inverse of a nonsingular matrix is a continuous function of the
« î
elements of the matrix. Considering f O
4 /
f U¬/]U
U(1
Each U¬/JU has expectation zero, so
Æ 4 /
Ù É
e
70
5.2. ASYMPTOTIC NORMALITY 71
é U A-UW U 4U è &
As long as these are finite, and given a technical condition1, the Kolmogorov
SLLN applies, so
f ¬U /]U QR P ì P &
U(1
This implies that
QP P
Rì F&
This is the property of strong consistency: the estimator converges in almost
surely to the true value.
We’ve seen that the OLS estimator is normally distributed under the assump-
tion of normal errors. If the error distribution is unknown, we of course don’t
know the distribution of the estimator. However, we can get asymptotic re-
sults. Assuming the distribution of / is unknown, but the the other classical
assumptions hold:
1
For application of LLN’s and CLT’s, of which there are very many to choose from, I’m going
to avoid the technicalities. Basically, as long as terms of an average have finite variances and
are not too strongly dependent, one will be able to find a LLN or CLT to apply.
5.3. ASYMPTOTIC EFFICIENCY 72
1 i4/
F a2 i4: $
h
F
i4: $ 1 i4/ h
c £j F h
4/ Æ 4 1
É
« «
Now as before, ð f O 1 R æ « 1 &
« ó
Considering Of î the limit of the variance is
h
f
G
BgU 4U
U(1
Supposing that / is normally distributed, the model is
B
i
F 2/%
/ ç ² h è±
F "f C so
f }Y~% ,Æ / U
¬/"
G
U(1 # è g# è É
The joint density for B can be constructed using a change of variables. We have
/
BÜj so á á î O a
±f and
î
á á K O
" so
K h
f BgU U4
Ô &
B5
C
} %
~
G
I
Æ #gè É
U(1 # è
Taking logs, h
)
f BgU 4U
B è
y
# j è #gè &
U(1
It’s clear that the fonc for the MLE of are the same as the fonc for OLS (up
F
to multiplication by a constant), so the estimators are the same, under the present
assumptions. Therefore, their properties are the same. In particular, under the
classical assumptions with normality, the OLS estimator is asymptotically efficient.
As we’ll see later, it will be possible to use (iterated) linear estimation
methods and still achieve asymptotic efficiency even if the assumption that
éÜô À /V Ü
í è ± %f as long as / is still normally distributed. This is not the case if
5.3. ASYMPTOTIC EFFICIENCY 74
so
2 |
01,2K K
75
6.1. EXACT LINEAR RESTRICTIONS 76
B
2/
p
À
Let’s consider how to estimate subject to the restrictions p
À & The most
obvious approach is to set up the Lagrangean
BܲiG 4 B¯²
2Ë# e 4 p À C&
x
The Lagrange multipliers are scaled by 2, which makes things less messy. The
fonc are
¢ e
µ#] 4 B2M#] 4 r=2Ë#V 4 e
x
¢ e
M
r ^ À
We get
1
r
4 4 4B
À &
e
6.1. EXACT LINEAR RESTRICTIONS 77
4 : 1 4 4 w&
µM 4 : 1 ±
±'ÿ 4 : 1 4
µM 4 : 1 4
±'ÿ 4 : 1 4
ª
û
and
±'ÿ ¬ 4 :
1 4ª 1 ±'ÿ 1 4
¬ 4 : û
ª 1 ª
±\ÿ
so
w
±'ÿ
w
1
±\ÿ 4 : 1 4 ª 1 ¬ 4 : 1
1
ª 1 ;ˬ 4 : 1 ±}
£¡ 4 : 1 4 ª 1 c À h
ª 1 c À
h
1 4 ª 1 s
±'ÿ ¡ 4 : 4 :
1 4ª 1 À
ª 1 2
ª 1À
The fact that r and e
are linear functions of makes it easy to determine their
distributions, since the distribution of is already known. Recall that for a
random vector, and for w and 7 a matrix and vector of constants, respectively,
éÜô À w; 27C
¡w é¯ô À w 4 &
Though this is the obvious way to go about finding the restricted estima-
tor, an easier way, if the number of restrictions is small, is to impose them by
substitution. Write
B
1Ô01,28 28/
n Þ1 r 01
À
B
1-¯1 1 À ²1-¯1 1 8
2 28/
BÜj1Ô¯1 1 À
° j1Ô¯1 1 ³0
2/
or with the appropriate definitions,
B(
8
2 /T&
This model satisfies the classical assumptions, supposing the restriction is true.
One can estimate by OLS. The variance of is as before
é
¬i 4 , 1 è F
1 µ4 ª 1 c À h
r
÷¡i4: $
1 µ4 ª 1 À ¡¬i43 $ 1 µ 4 ª 1 =
b2¬i4: ¬i43 $ 1 i4B
b2¬i4: 1 i4©/72¬i43 $ 1 µ4 ª 1 \ À 8p ]+4: 1 µ4 ª 1 =
4©: $ 1 i4/
r^
¬i4: 1 i4/
2 ¬i4: 1 µ 4 ª 1 \À 8 p]
¬i4: 1 µ4 ª 1 =¬i4: 1 i4/
´
¢Û Ù r,
¼
ë rb
\ rb
?4
Noting that the crosses between the second term and the other terms expect to
zero, and that the cross of the first and third has a cancellation with the square
of the third, we obtain
´
¢Û Ù r
4 : 1 è
2 i4: $ 1 µ4 ª 1 \ À p]u\ À p] 4 ª 1 =
¬i43 $ 1
i4: $ 1 µ4 ª 1 =i4: $ 1 è
So, the first term is the OLS covariance. The second term is PSD, and the third
term is NSD.
If the restriction is true, the second term is 0, so we are better off. True
restrictions improve efficiency of estimation.
If the restriction is false, we may be better or worse off, in terms of
MSE, depending on the magnitudes of
À p and è &
6.2. TESTING 81
6.2. Testing
In many cases, one wishes to test economic theories. If theory suggests pa-
rameter restrictions, as in the above homogeneity example, one can test theory
by testing parameter restrictions. A number of tests are available.
B b28d / d
d
À Hp
í À
and one wishes to test the single restriction
F Hp vs. . Under
so À À
£
çRþ 'J &
1
1
=¬ 4 : 4 è F è F = 4 3 4
The problem is that è is unknown. One could use the consistent estimator è
F F
in place of è, but the test would only be valid asymptotically in this case.
F
P ROPOSITION 4.
j ' J
(6.2.1) çav4V
z à PÅ
(6.2.2) 4
ç
e
6.2. TESTING 82
é ªÞª 4
±f
ª 4 é ªÞª 4
ª 4
so
ª é ª 4
±f and thus B¾çuj ± "f . Thus B 4 B=ç , but
(6.2.3) 4 ¯
ç -
An immediate consequence is
(6.2.4) ç À Y &
4 ¯
/ 4 /
/ 4:
´ « /
è F è F
4
Æ è / É ´:« Æ è / É
F F
ç
Now consider (remember that we have only one restriction in this case)
h
« x « Å
À
W
à îO î O
è F = 4 : 1 4
fà ÿ O Å Wz
This will have the vK distribution if and / 4 / are independent. But
2¬ 4 3 1 4 / and
so À À
£
è çav
1
è F = 4 : 4 x
6.2. TESTING 84
d d
In particular, for the commonly encountered test of significance of an individual
*0
a *
coefficient, for which
F H vs.
F H í , the test statistic is
* çav÷
è *
x
Note: the Y test is strictly valid only if the errors are actually normally
distributed. If one has nonnormal errors, one could use the above as-
ymptotic result to justify taking critical values from the j 'J distri-
bution, since v R
m ² \J as
R k
& In practice, a conservative
d
procedure is to take critical values from the distribution if nonnor-
mality is suspected. This will reject less often since the distribu-
F
tion is fatter-tailed than is the normal.
2 2
6.2.2. test. The test allows testing multiple restrictions jointly.
c £ À h 4 ð M 4 :
1 4 1 c À
h
2
è ó ç 2
4T Y&
A numerically equivalent expression is
6.2. TESTING 85
ÙÜÛZÛ ^ Ü Â
Û Û ç
ÙÂ 2
4T Y&
2
Ù ÛÛ
Note: The test is strictly valid only if the errors are truly normally
distributed. The following tests will be appropriate when one cannot
assume normally distributed errors.
6.2.3. Wald-type tests. The Wald principle is based on the idea that if a
restriction is true, the unrestricted model should “approximately” satisfy the
restriction. Given that the least squares estimator is asymptotically normally
h
distributed:
d
c÷
F h R
m ð è F æ « 1 ó
then under HF p F
h À we have
c £ À h R
m ñð è F æ « 1 µ 4 ó
so by Proposition [6]
c À h 4 ð è F æ « 1 µ 4 ó 1 c £ À h R
m 4V
Note that æ
« 1 or è
F are not observable. The test statistic we use substitutes the
consistent estimators. Use ¬ 4 Â , 1 as the consistent estimator of æ « 1 & With
this, there is a cancellation of 4 " and the statistic to use is
6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,
an unrestricted model may be nonlinear in the parameters, but the model is
linear in the parameters under the null hypothesis. For example, the model
B
¬
I¡728/ d
"&
is nonlinear in and but is linear in under
F Ht Estimation of
nonlinear models is a bit more complicated, so one might prefer to have a
test based upon the restricted, linear model. The score test is useful in this
situation.
Score-type tests are based upon the general principle that the gradient
vector of the unrestricted model, evaluated at the restricted estimate,
should be asymptotically normally distributed with mean zero, if the
restrictions are true. The original development was for ML estimation,
but the principle is valid for a wide variety of estimation methods.
1 µ4 1 c À h
e
ð =i4: $ ó
ª 1 c £ À
h
h
Given that
c £ À h R
m ð è F æ « 1 4 ó
under the null hypothesis,
h
e R
m ð è F ª 1 æ « 1 4 ª 1 ó
h
or
e R
m ð è F ) ª 1 æ « 1 µ4 ª 1 ó
6.2. TESTING 87
since the ’s cancel and inserting the limit of a matrix of constants changes
nothing.
However,
e R
m ñð è F ª 1 ó
In this case,
e
1 4 e
Æ4 = 4 è: R
4V
F É m
since the powers of cancel. To get a usable test statistic substitute a consistent
estimator of è
F &
c 4 e h 4 4 : 1 4 e
è F
R
m 4V
Èi4@B2i4 r=2Mµ4 e
6.2. TESTING 88
to get that
µ4 e
i4B ² r0
i4 /1
/ 4 K¬ 4 : 1 4 /( R
4V
è F m
To see why the test is also known as a score test, note that the fonc for restricted
least squares
Èi4@B2i4 r=2Mµ4 e
give us
µ4 e^
i4BÜji4 r
and the rhs is simply the gradient (score) of the unrestricted model, evaluated
at the restricted estimator. The scores evaluated at the unrestricted estimate are
identically zero. The logic behind the score test is that the scores evaluated at
the restricted estimate should be approximately zero, if the restriction is true.
The test is also known as a Rao test, since P. Rao first proposed it in 1948.
6.2. TESTING 89
6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using
the unrestricted model. The score test can be calculated using only the re-
stricted model. The likelihood ratio test, on the other hand, uses both the re-
stricted and the unrestricted estimators. The test statistic is
B
# c B @" G B @"ý h
where @ is the unrestricted estimate and @ ý is the restricted estimate. To show
that it is asymptotically N
take a second order Taylor’s series expansion of
B '@"ý about @¾ H d
d d B
£¢ y c @pý @ h 4 @V c @sý @ h
o
As @g T¾4@
F 4@ F Y by the information matrix equality. So
R k R
c ý @ h 4 o T¾4 @ F c @ý @ h
Q
@p
B
c @pi@ F h
o T=A@ F 1 A1 p 0A@ F C&
Q
Combining
h the last two equations
Q
c p@ ý @ h
; A1 p o T=A@ F 1 µ4 ð o T=A@ F 1 µ 4 ó 1 o T=4@ F $ 1 0A@ F
But since
A1 p 0A@ F R
m þ o T¾A@ F -
the linear function
o T=A@ F 1 A1 p 0A@ F R
m j $ o T=4@ F 1 4 C &
We can see that LR is a quadratic form of this rv, with the inverse of its variance
in the middle, so
B
R
m b" Y&
6.3. The asymptotic equivalence of the LR, Wald and score tests
We have seen that the three tests all converge to random variables. In
fact, they all converge to the same rv, under the null hypothesis. We’ll show
that the Wald and LR tests are asymptotically equivalent. We have seen that
the Wald test is asymptotically equivalent to
ü
c À h 4 ð è F æ « 1 µ 4 ó 1 c £ À h
Q
R
m b"
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 91
Using
£ F
4 : 1 4 /
and
£ À
= £ F
we get h h
,= £ F
0=i4: $ 1 i4/
4 1 1Ap
Æ É 4/
Using this,
o
A@ F
T =4@ F
) x O 0 F
) x O 4 BÜjè i F
) 4
è
æ «
è
so
o
A@ F 1
è æ « 1
Substituting these last expressions into [??], we get
B
Q / 4 4 4 : 1 4 ðòè F = 4 : 1 4 ó 1 =
4 : 1 4 /
ª
Q / 4 E/
è F
ü
Q
This completes the proof that the Wald and LR tests are asymptotically equiv-
alent. Similarly, one can show that, under the null hypothesis,
2
Q ü
Q BZ´
Q B
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS 93
Though the four statistics are asymptotically equivalent, they are numerically
different in small samples. The numeric values of the tests also depend upon
d
how è is estimated, and we’ve already seen than there are several ways to do
this. For example all of the following are consistent for è under
F
îî
f O D
î O î
f
î î
f O¤ D ¤
î O¤ î ¤
f
and in general the denominator call be replaced with any quantity ô such that
) ô Â
"&
6.5. CONFIDENCE INTERVALS 94
Now that we have a menu of test statistics, we need to know how to use
them.
x
6.6. BOOTSTRAPPING 95
" ?È
d
a a¦ confidence interval for F is defined by the bounds of the set of
¥
© è ¸§ p
x
ª
A confidence ellipse for two coefficients jointly would be, analogously, the
set of {,1C[
X
2
such that the (or some other test statistic) doesn’t reject at the
specified critical value. This generates an ellipse, if the estimators are corre-
lated.
The region is an ellipse, since the CI for an individual coefficient de-
fines a (infinitely long) rectangle with total prob. mass
since the
other coefficient is marginalized (e.g., can take on any value). Since the
ellipse is bounded in both dimensions but also contains mass ;
it
must extend beyond the bounds of the individual CI.
From the pictue we can see that:
– Rejection of hypotheses individually does not imply that the joint
test will reject.
– Joint rejection does not imply individal tests will reject.
6.6. Bootstrapping
sample distribution. Also, the distributions of test statistics may not resemble
their limiting distributions at all. A means of trying to gain information on the
small sample distribution of test statistics and estimators is the bootstrap. We’ll
consider a simple example, just to get the main idea.
Suppose that
B
F 2 /
/ ç ±T± è
F
is nonstochastic
Given that the distribution of / is unknown, the distribution of will be un-
known in small samples. However, since we have random sampling, we could
generate artificial data. The steps are:
ß
(1) Draw observations from / with replacement. Call this vector / ý (it’s
a 3jJ C&
ß
ß
(2) Then generate the data by B ý b2 / ý
(3) Now take this and estimate
ß ß
ý
4: 1 i4 B ý &
ß
(4) Save ý
ß
(5) Repeat steps 1-4, until we have a large number, « of ý &
With this, we can use the replications to calculate the empirical distribution of ý ß &
One way to form a 100(1- ¦ confidence interval for
F would be to order the
ß GÂ #
ý from smallest to largest, and drop the first and last «
of the replications,
and use the remaining endpoints as the limits of the CI. Note that this will not
give the shortest CI if the empirical distribution is skewed.
6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD 98
Suppose one was interested in the distribution of some function of Z
for example a test statistic. Simple: just calculate the transformation
for each " and work with the empirical distribution of the transforma-
tion.
If the assumption of iid errors is too strong (for example if there is
heteroscedasticity or autocorrelation, see below) one can work with a
bootstrap defined by sampling from B with replacement.
How to choose « : « should be large enough that the results don’t
change with repetition of the entire bootstrap. This is easy to check.
If you find the results change a lot, increase « and try again.
The bootstrap is based fundamentally on the idea that the empiri-
cal distribution of the sample data converges to the actual sampling
distribution as becomes large, so statistics based on sampling from
the empirical distribution should converge in distribution to statistics
based on sampling from the actual sampling distribution.
In finite samples, this doesn’t hold. At a minimum, the bootstrap is a
good way to check if asymptotic theory results offer a decent approxi-
mation to the small sample distribution.
À
¡ &
F
where
À ?>@ is a -vector valued function. Write the derivative of the restriction
evaluated at as
À
=G
x O x
We suppose that the restrictions are not redundant in a neighborhood of , so
F
that
W=
-
À 4 c = 1
\¬ 4 3 1 = G 4 h À
R
m b"
è
under the null hypothesis.
Note that this also gives a convenient way to estimate nonlinear functions and
associated asymptotic confidence intervals. If the nonlinear function
À is
F
not hypothesized
h to be zero, we just have
c À G À F h R
m ð $= F æ « 1 = F P 4©è NF ó
à
à ®
4 ®
(note that this is the entire vector of elasticities). The estimated elasticities are
4 ®
Now demand must be positive, and we assume that expenditures sum to in-
come, so we have the restrictions
» * @ ÷ » "aê!
¯
* 1 * ÷
* * * *
* @÷
1 32 4© 6 K2 9 2 /
It is fairly easy to write restrictions such that the shares sum to one, but the
restriction that the shares lie in the \
\}] interval depends on both parameters
and the values of and & It is impossible to impose the restriction that »
* @÷ » for all possible and & In such cases, one might consider whether
or not a linear model is a reasonable specification.
Remember that we in a previous example (section 3.8.3) that the OLS re-
sults for the Nerlove model are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
*********************************************************
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686
*******************************************************
6.8. EXAMPLE: THE NERLOVE DATA 104
Value p-value
F 0.574 0.450
Wald 0.594 0.441
LR 0.593 0.441
Score 0.592 0.442
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861
*******************************************************
Value p-value
F 256.262 0.000
Wald 265.414 0.000
LR 150.863 0.000
6.8. EXAMPLE: THE NERLOVE DATA 105
Notice that the input price coefficients in fact sum to 1 when HOD1 is im-
posed. HOD1 is not rejected at usual significance levels (e.g., K
&(
). Also,
does not drop much when the restriction is imposed, compared to the un-
restricted results. For CRTS, you should note that j
, so the restriction is
satisfied. Also note that the hypothesis that
is rejected by the test sta-
tistics at all reasonable significance levels. Note that drops quite a bit when
imposing CRTS. If you look at the unrestricted estimation results, you can see
that a t-test for 8
also rejects, and that a confidence interval for
does
not overlap 1.
From the point of view of neoclassical economic theory, these results are
not anomalous: HOD1 is an implication of the theory, but CRTS is not.
The Chow test. Since CRTS is rejected, let’s examine the possibilities more
carefully. Recall that the data is sorted by output (the third column). Define
5 subsamples of firms, with the first group being the 29 firms with the lowest
output levels, then the next 29 firms, etc. The five subsamples can be indexed
by
"$#%'&(&)&( Ò where
for
"Y#%\&)&(&©# ï ,
# for
%
% "\&)&(& Ò -
, etc.
Define a piecewise linear model
where is a superscript (not a power) that inicates that the coefficients may be
different according to the subsample in which the observation falls. That is,
6.8. EXAMPLE: THE NERLOVE DATA 106
the coefficients depend upon which in turn depends upon Y& Note that the
first column of nerlove.data indicates this way of breaking up the sample. The
new model may be written as
Ò ß Ò
where B1 is 29 yVE1 is 29 E is the ¡ vector of coefficient for the
U¶
ß U¶
subsample, and A is the # ï j vector of errors for the subsample.
The Octave program Restrictions/ChowTest.m estimates the above model.
It also tests the hypothesis that the five subsamples share the same parameter
vector, or in other words, that there is coefficient stability across the five sub-
samples. The null to test is that the parameter vectors for the separate groups
are all the same, that is,
|
1
!
$
This type of test, that parameters are constant across different sets of data, is
sometimes referred to as a Chow test.
There are 20 restrictions. If that’s not clear to you, look at the Octave
program.
The restrictions are rejected at all conventional significance levels.
Since the restrictions are rejected, we should probably use the unrestricted
model for analysis. What is the pattern of RTS as a function of the output
6.8. EXAMPLE: THE NERLOVE DATA 107
2.4
2.2
1.8
1.6
1.4
1.2
0.8
1 1.5 2 2.5 3 3.5 4 4.5 5
Output group
group (small to large)? Figure 6.8.1 plots RTS. We can see that there is increas-
ing RTS for small firms, but that RTS is approximately constant for large firms.
6.8. EXAMPLE: THE NERLOVE DATA 108
(1) Using the Chow test on the Nerlove model, we reject that there is coef-
ficient stability across the 5 groups. But perhaps we could restrict the
input price coefficients to be the same but let the constant and output
coefficients vary by group size. This new model is
B µ#;2 ¡2 | 2MA
(b) Compare the means and standard errors of the estimated coeffi-
cients using OLS and restricted OLS, imposing the restriction that
2 |
V&
8
(c) Discuss the results.
(4) Get the Octave scripts bootstrap_example1.m , bootstrap.m , bootstrap_resample_iid.m
and myols.m figure out what they do, run them, and interpret the re-
sults.
CHAPTER 7
/JUç ±± è Y
or occasionally
/JU,ç T± ± ²
è C&
Now we’ll investigate the consequences of nonidentically and/or dependently
distributed errors. We’ll assume fixed regressors for now, relaxing this admit-
tedly unrealistic assumption later. The model is
B
i28/
ë7/V
é^/V
5
4 : 1 4 B
2ai4: $ 1 4/
We have unbiasedness, as before.
The variance of is
ë n £j
\ £
?4 r
ë ° ¬i4: 1 i4/J/J4ÌKi4: $ 1 ³
(7.1.1)
4: 1 i4±58i4: $ 1
Due to this, any test statistic that is based upon è or the probability
limit è of is invalid. In particular, the formulas for the Y 2
based
tests given above do not lead to statistics with these distributions.
is still consistent, following exactly the same argument given before.
If / is normally distributed, then
çuñðòJ4: 1 i4±58i4: $ 1 ó
Without normality,
h and unconditional
h on we still have
c
j h
Äi4©3 $ 1 i4/
1
Æ 4 1Ap 4 /
É
1Ap
Define the limiting variance of 4 / (supposing a CLT applies) as
Suppose 5 were known. Then one could form the Cholesky decomposition
ª 4 ª 5y 1
We have
ª 4ª 5
a± f
7.2. THE GLS ESTIMATOR 113
so
ª 4ª 5 ª 4
ª 4
B [
[b
28/ [ &
ë¼ ª /J/ 4 ª 4
ª 5 ª 4
±f
B [
[
28/ [
ë7/ [
é^/ [
±f
satisfies the classical assumptions. The GLS estimator is simply OLS applied
to the transformed model:
¯
+³
[ 4 [ $ 1 [ 4©B [
4 ª¯ª 4 3 1 4 ª¯ª 4 B
4 5 1 : 1 4 5 1 B
7.2. THE GLS ESTIMATOR 114
The GLS estimator is unbiased in the same circumstances under which the
OLS estimator is unbiased. For example, assuming is nonstochastic
¯ 1 : $ 1 1B ú
ë7 (³
ë ù i45y 4±5y
ë ù i4 5y 1 : $ 1 4±5y 1 i^2/ ú
&
¯
+³
[ ©4 [ $ 1 [ ©4 B [
[ 4© [ $ 1 [ g4 [ ^
28/ [
2¬ [ 4 [ $ 1 [ 4/ [
so
ë õ c¯ (³
j h c ¯ +³
h 4 ø
ë ù [ ©4 [ $ 1 [ ©4 / [ / [ 4 [ ¬ [ 4 [ $ 1 ú
[ 4 [ $ 1 [ 4 [ ¬ [ 4 [ $ 1
[ 4 [ $ 1
i45y 1 : $ 1
The GLS estimator is more efficient than the OLS estimator. This is a
consequence of the Gauss-Markov theorem, since the GLS estimator is
based on a model that satisfies the classical assumptions but the OLS
estimator is not. To see this directly, not that (the following needs to
be completed)
é¯ô À
G8éÜô À ¯ +³
¬i43 $ 1 i45K4©: $ 1 ¡¬i4±5y 1 3 $ 1
w 5 w O
¯
V-Z B¯j
?4*5y 1 BÜjiG
+³
The problem is that 5 isn’t known usually, so this estimator isn’t available.
Consider the dimension of 5 : it’s an ;¼ matrix with ¬ Â # 2Ü
2K, Â # unique elements.
7.3. FEASIBLE GLS 116
5
5 @"
5
5 i @" R 6 5¬i@V
If we replace 5 in the formulas for the GLS estimator with p
5 we obtain the
FGLS estimator. The FGLS estimator shares the same asymptotic properties
as GLS. These are
(1) Consistency
(2) Asymptotic normality
(3) Asymptotic efficiency if the errors are normally distributed. (Cramer-
Rao).
(4) Test procedures are asymptotically valid.
(2) Form 5
5 @"
(3) Calculate the Cholesky factorization
ª
û ¹ ´1µ 5 1 .
(4) Transform the model using
ª 4 B ª 4 i2 ª 4 /
7.4. Heteroscedasticity
ë7/J/J4© 5
is a diagonal matrix, so that the errors are uncorrelated, but have different
variances. Heteroscedasticity is usually thought of as associated with cross
sectional data, though there is absolutely no reason why time series data can-
not also be heteroscedastic. Actually, the popular ARCH (autoregressive con-
ditionally heteroscedastic) models explicitly assume that a time series is het-
eroscedastic.
Consider a supply function
*0
012K6 ª * K
2 Eì Û * 28/ *
where
ª * is price and * is some measure of size of the ! U¶ firm. One might
Û
suppose that unobservable factors (e.g., talent of managers, degree of coordi-
nation between production units, etc.) account for the error term / *& If there
is more variability in these factors for large firms than for small firms, then / *
may have a higher variance when * is high than when it is low.
Û
7.4. HETEROSCEDASTICITY 118
*,
01I286 ª * K
2 59 ´ *
2 /*
where
ª is price and
´ is income. In this case, /* can reflect variations in
preferences. There are more possibilities for expression of preferences when
one is rich, so it is possible that the variance of /* could be higher when
´ is
high.
Add example of group means.
ë Æ 4 /J/ 4
²
fStT K É
This matrix has dimension and can be consistently estimated, even if we
can’t estimate 5 consistently. The consistent estimator, under heteroscedastic-
ity but no autocorrelation is
f
Ë
²
U4 U / U
U(1
One can then modify the previous test statistics to obtain tests that are valid
when there is heteroscedasticity of unknown form. For example, the Wald test
d
7.4. HETEROSCEDASTICITY 119
7.4.2. Detection. There exist many tests for the presence of heteroscedas-
ticity. We’ll discuss three methods.
Goldfeld-Quandt. The sample is divided in to three parts, with G1C and
| observations, where 1 2 |
2i i . The model is estimated using the first
1 |
and third parts of the sample, separately, so that and will be independent.
Then we have
/ 1 4 / 1
/ 1 O ´ 1 / 1 R
1
è è m
and
| | | | |
/ 4 /
/ O ´ / R
|
è è m
so
/ |1 4 / |1 Â I1G 2
1
| Y&
/ 4 / Â |
R
m
The distributional result is exact if the errors are normally distributed. This test
is a two-tailed test. Alternatively, and probably more conventionally, if one has
prior ideas about the possible magnitudes of the variances of the observations,
one could order the observations accordingly, from largest to smallest. In this
case, one would use a conventional one-tailed F-test. Draw picture.
Ordering the observations is an important step if the test is to have
any power.
The motive for dropping the middle observations is to increase the
difference between the average variance in the subsamples, suppos-
ing that there exists heteroscedasticity. This can increase the power of
7.4. HETEROSCEDASTICITY 120
the test. On the other hand, dropping too many observations will sub-
stantially increase the variance of the statistics /
1 4 / 1 and /
| 4 / | & A rule of
thumb, based on Monte Carlo experiments is to drop around 25% of
the observations.
If one doesn’t have any ideas about the form of the het. the test will
probably have low power since a sensible data ordering isn’t available.
White’s test. When one has little idea if there exists heteroscedasticity, and
no idea of its potential form, the White test is a possibility. The idea is that if
there is homoscedasticity, then
ë7/ U WU è Ôê
so that U or functions of U shouldn’t help to explain 뼬/ U C& The test works as
follows:
(1) Since /]U isn’t available, use the consistent estimator /gU instead.
(2) Regress
/ U
è 2M]U4 =2¹¸JU
where \U is a
ª -vector. \U may include some or all of the variables in
U? as well as other variables. White’s original suggestion was to use
U , plus the set of all unique squares and cross products of variables in
U?&
(3) Test the hypothesis that
a & The statistic in this case is
2
ª ^ ª
2
Ù ÛÛ Â ÙÜÛª Û J
Ù ÛÛ
7.4. HETEROSCEDASTICITY 121
¿
Note that
Ù ÛÛ
ÛÛ so dividing both numerator and denomina-
tor by this we get
2
ª J
¼ 8
Note that this is the p or the artificial regression used to test for het-
eroscedasticity, not the of the original model.
This doesn’t require normality of the errors, though it does assume that the
fourth moment of /]U is constant, under the null. Question: why is this neces-
sary?
The White test has the disadvantage that it may not be very power-
ful unless the \U vector is chosen well, and this is hard to do without
knowledge of the form of heteroscedasticity.
It also has the problem that specification errors other than heteroscedas-
ticity may lead to rejection.
Note: the null hypothesis of this test may be interpreted as @
for
¹ ¹
the variance model é^/ U
2² U4 @V C where ?>@ is an arbitrary func-
tion of unknown form. The test is more general than is may appear
from the regression that is used.
Plotting the residuals. A very simple method is to simply plot the residuals
(or their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will
7.4. HETEROSCEDASTICITY 122
BgU
4U 2/]U
è U
뼬/ U
W U4 0 _º
/ U W gU4 0 º 2i¸]U
and ¸]U has mean zero. Nonlinear least squares could be used to estimate and
»
consistently, were /]U observable. The solution is to substitute the squared
OLS residuals / U in place of / U since it is consistent by the Slutsky theorem.
»
Once we have and we can estimate è U consistently using
è
g4 0 º R 6 è &
U U U
In the second step, we transform the model by dividing by the standard devi-
ation:
gB U
4U 2 /J U
è5U 5è U 5è U
7.4. HETEROSCEDASTICITY 123
or
B U [
U[ 4 2/ U[ &
Asymptotically, this model satisfies the classical assumptions.
B]U
4U ^2/]U
è U
뼬/ U
è Uº
where 'U is a single variable. There are still two parameters to be esti-
mated, and the model of the variance is still nonlinear in the parame-
ters. However, the search method can be used in this case to reduce the
estimation problem to repeated applications of OLS.
First, we define an interval of reasonable values for
%1]ò& »
e.g.,
»
Cc\
Partition this interval into ´ equally spaced values, e.g., S \&)"\&©#%\&)&(&)$#%& ï %%X&
For each of these values, calculate the variable U ºb¼ &
The regression
/ U
è U ºb¼ 2i¸]U
is linear in the parameters, conditional on
»
9µ so one can estimate è
by OLS.
Save the pairs (è9
» 97 C and the corresponding ÙÜÛZÛ 9µ& Choose the pair
with the minimum
Ù ÛÛ 9 as the estimate.
Next, divide the model by the estimated standard deviations.
Can refine. Draw picture.
Works well when the parameter to be searched over is low dimen-
sional, as in this case.
7.4. HETEROSCEDASTICITY 124
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a num-
ber of economic agents: e.g., 10 years of macroeconomic data on each of a set
of countries or regions, or daily observations of transactions of 200 banks. This
sort of data is a pooled cross-section time-series model. It may be reasonable to pre-
sume that the variance is constant over time within the cross-sectional units,
but that it differs across them (e.g., firms or countries of different sizes...). The
model is
B *U
*4 U ^28/ * U
뼬/ * U
è * ?ê
where !
"Y#%'&(&(&)$ are the agents, and
"$#%'&(&)&( are the observations on
each agent.
To correct for heteroscedasticity, just estimate each èI* using the natural estima-
tor:
è *
f / *
U(1 U
Note that we use  here since it’s possible that there are more than
regressors, so could be negative. Asymptotically the difference
is unimportant.
7.4. HETEROSCEDASTICITY 125
0.5
-0.5
-1
-1.5
0 20 40 60 80 100 120 140 160
B * U
*4 U 2 / * U
è* è* è*
Do this for each cross-sectional group. This transformed model satis-
fies the classical assumptions, asymptotically.
7.4.4. Example: the Nerlove model (again!) Let’s check the Nerlove data
for evidence of heteroscedasticity. In what follows, we’re going to use the
model with the constant and output coefficient varying across 5 groups, but
with the input price coefficients fixed (see Equation 6.8.3 for the rationale be-
hind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m
plots the residuals. We can see pretty clearly that the error variance is larger
for small firms than for larger firms.
7.4. HETEROSCEDASTICITY 126
Now let’s try out some tests to formally check for heteroscedasticity. The
Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt
tests, using the above model. The results are
Value p-value
White’s test 61.903 0.000
Value p-value
GQ test 10.886 0.000
All in all, it is very clear that the data are heteroscedastic. That means that OLS
estimation is not efficient, and tests of restrictions that ignore heteroscedastic-
ity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were cal-
culated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m
uses the Wald test to check for CRTS and HOD1, but using a heteroscedastic-
consistent covariance estimator.1 The results are
Testing HOD1
Value p-value
Wald test 6.161 0.013
Testing CRTS
Value p-value
Wald test 20.169 0.001
1
By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the re-
stricted LS estimator directly to restrict the fully general model with all coefficients
varying to the model with only the constant and the output coefficient varying. But
GLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into the
model. The methods are equivalent, but the second is more convenient and easier to under-
stand.
7.4. HETEROSCEDASTICITY 127
We see that the previous conclusions are altered - both CRTS is and HOD1 are
rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald test’s
tendency to over-reject?
From the previous plot, it seems that the variance of A is a decreasing func-
tion of output. Suppose that the 5 size groups have different error variances
(heteroscedasticity by groups):
é¯ô À A * è ß
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
*********************************************************
Testing HOD1
Value p-value
Wald test 9.312 0.002
The first panel of output are the OLS estimation results, which are used to
consistently estimate the è ß . The second panel of results are the GLS estimation
results. Some comments:
The measures are not comparable - the dependent variables are
not the same. The measure for the GLS results uses the transformed
dependent variable. One could calculate a comparable s measure,
but I have not done so.
The differences in estimated standard errors (smaller in general for
GLS) can be interpreted as evidence of improved efficiency of GLS,
since the OLS standard errors are calculated using the Huber-White
estimator. They would not be comparable if the ordinary (inconsis-
tent) estimator had been used.
7.5. AUTOCORRELATION 130
Note that the previously noted pattern in the output coefficients per-
sists. The nonconstant CRTS result is robust.
The coefficient on capital is now negative and significant at the 3%
level. That seems to indicate some kind of problem with the model or
the data, or economic theory.
Note that HOD1 is now rejected. Problem of Wald test over-rejecting?
Specification error in model?
7.5. Autocorrelation
BgU 4U 2/JUP
7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is
the same as in the case of heteroscedasticity - the standard formula does not
apply. The correct formula is given in equation 7.1.1. Next we discuss two
GLS corrections for OLS. These will potentially induce inconsistency when the
regressors are nonstochastic (see Chapter8) and should either not be used in
that case (which is usually the relevant case) or used with caution. The more
recommended procedure is discussed in section 7.5.5.
7.5.3. AR(1). There are many types of autocorrelation. We’ll consider two
examples. The first is the most commonly encountered case: autoregressive
7.5. AUTOCORRELATION 132
B]U
4U ^28/JU
/JU
"/JU 1 2¹½+U
+Ukç
½ !W!¾J [è Í
ë7/JU½ìÔ
- ½
T 9
ë7/ U è Í
9G F
è Í
Èj
If we had directly assumed that /gU were covariance stationary, we could
obtain this using
so
è Í
é^¬/JU
È
The variance is the
U¶ order autocovariance:
é^/JU
F
Note that the variance does not depend on
7.5. AUTOCORRELATION 134
but in this case, the two standard errors are the same, so the -order autocor-
relation Tì is
ì
ì
All this means that the overall matrix 5 has the form
è Í
.. ..
.
..
. .
z Èj{}| ~
5
..
this is the variance
.
f 1 >N>N>
z {}|
~
It turns out that it’s easy to estimate these consistently. The steps are
J/ U
/J U 12¹½ U[
6
Since ]/ U R
/JU? this regression is asymptotically equivalent to the re-
gression
/JU
" /JU 12¹½+U
which satisfies the classical assumptions. Therefore, obtained by ap-
6
plying OLS to ]/ U
/J U 12g½nU[ is consistent. Also, since ½>U[ R ½+U , the
estimator
è Í
f ½ [ R 6 è Í
U( U
(3) With the consistent estimators è Í and E form 5
5 è Í T using the
previous structure of 5µ and estimate by FGLS. Actually, one can omit
the factor è Í Â ?¼ C since it cancels out in the formula
1
¯ (³
c 4 5y 1 h 4 5y 1 B C&
One can iterate the process, by taking the first FGLS estimator of Z re-
estimating and è Í etc. If one iterates to convergences it’s equivalent
to MLE (supposing normal errors).
7.5. AUTOCORRELATION 136
B 1[
TB 1 È
[1
1 È
This somewhat odd-looking result is related to the Cholesky factor-
ization of 5 1& See Davidson and MacKinnon, pg. 348-49 for more
discussion. Note that the variance of B 1[ is è Í asymptotically, so we
see that the transformed model will be homoscedastic (and nonauto-
correlated, since the ½ 4 are uncorrelated with the B 4 " in different time
periods.
7.5.4. MA(1). The linear regression model with moving average order 1
errors is
B]U
4U 2/JU
/JU
+U 2Ëtn½+U 1
½
+Uñç
½ !ò!J è Í
뼬/]U¿½ì[
[ ½
7.5. AUTOCORRELATION 137
In this case,
é/JUW
F
ë ° 4½+U52Ëtr½EU 1Ô ³
è Í 2Mt è Í
è Í Ô2Ët
Similarly
E1
ëª\)b½EU 2Mtn½+U 1- 4½+U 1,2Ëtr½+U _ ]
tEè Í
and
2 tr½EU | _]
( +U 2Ëtr½+U 1- 4½+U Ë
\ 4½
so in this case
2Mt t >N>N>
t 2Ët t
è Í
5
t ..
.
..
.
.. ..
. . t
>N>N> t Z2Ët
+1
1
Á 1z à z À z Å
À Á
F
t
Ô2Mt
7.5. AUTOCORRELATION 138
Again the covariance matrix has a simple structure that depends on only two
parameters. The problem in this case is that one can’t estimate t using OLS on
J/ U +½ U 2Ëtr½EU 1
because the ½EU are unobservable and they can’t be estimated consistently. How-
ever, there is a simple way to estimate the parameters.
éb/JU è î è Í ÔZ2Mt
f
è î è Í ?2Ët / U
(U 1
By the Slutsky theorem, we can interpret this as defining an (uniden-
tified) estimator of both è,Í and t e.g., use this as
¥
f
è Í ?2 t / U
U(1
However, this isn’t sufficient to define consistent estimators of the pa-
rameters, since it’s unidentified.
To solve this problem, estimate the covariance of /gU and /]U
1 using
f
û  ´ ¸¬/]U×[/JU 1-
t è Í
/J U /J U 1
(U
7.5. AUTOCORRELATION 139
¥
f
t è Í /]U /]U 1
U(
Now solve these two equations to obtain identified (and therefore con-
sistent) estimators of both t and è Í & Define the consistent estimator
¥
5 5 t è Í
following the form we’ve seen above, and transform the model us-
ing the Cholesky decomposition. The transformed model satisfies the
classical assumptions asymptotically.
/1
/
i4/
n 1 >N>N> fir
...
/Jf
f
¬U /]U
U(1
f
^U
U(1
so that
f f
²
f SUT ë v (U 1 ^Ub·
¶ ¶
4U · w
U(1
We assume that U is covariance stationary (so that the covariance between U
and ^U
ì does not depend on - C&
¹
Define the ¸Üj autocovariance of U as
ë¼^U¬^4U Ä
¡
í
ÃÅÄ
²
While one could estimate parametrically, we in general have little informa-
tion upon which to base a parametric specification. Recent research has fo-
cused on consistent nonparametric estimators of ² &
Now define
f f
²
f
ë v
¶
^Ub· ¶
^4U · w
U(1 U(1
We have (show that the following is true, by expanding sum and shifting rows to left)
²
f
Ã
2F
à 1,2
Ã
41 2
K# Ã 2
Ã
4 0>N>N>N2 ð Ã f 1I2 Ã
4f 1 ó
ÃÆÄ
The natural, consistent estimator of is
f
¥
ÃÅÄ
^ U ^ 4U Ä &
U( Ä 1
where
U
¡ U /J U
^
(note: one could put  ^Ǹ% instead of  here). So, a natural, but inconsis-
tent, estimator of ² f would be
¥Ã
2F c à 1,2 à 41 h 2 8 # c ¥Ã 2 ¥Ã 4 2¡>N>N>'2 c à  f 12 à  4
¥ ¥
²
f h f 1h
f 1 ¸ ¥ÃÅÄ ¥Ã
¥Ã 2 s c 2 Ä4 h &
F 1
Ä
7.5. AUTOCORRELATION 142
ÃÄ
On the other hand, supposing that tends to zero sufficiently rapidly as ¸
tends to k
a modified estimator
¥ à f'Å c ¥ÃÅÄ ¥
²
f Ã
F 2 Ä 1 2 Ã Ä
4h
6
where R k
as
R k
will be consistent, provided 5 grows sufficiently
slowly.
¥ à f'Å c ¥ÃÅÄ 2 ¥
²
f Ã
2F Ä È È2¡ ¸
É
à Ä
4h &
M1 È
This estimator is p.d. by construction. The condition for consistency
is that A1 p! , R & Note that this is a very slow rate of growth
for T&
This estimator is nonparametric - we’ve placed no parametric
restrictions on the form of ² & It is an example of a kernel estimator.
7.5. AUTOCORRELATION 143
6 Â
æ «
Finally, since ²
f has ²
as its limit, ²
f R ²
& We can now use ²
f and
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedas-
ticity. The regression is
and the test statistic is the 0s statistic, just as in the White test. There are
ª
restrictions, so the test statistic is asymptotically distributed as a ª C &
The intuition is that the lagged errors shouldn’t contribute to explain-
ing the current error if there is no autocorrelation.
U is included as a regressor to account for the fact that the g/ U are not
independent even if the /]U are. This is a technicality that we won’t go
into here.
This test is valid even if the regressors are stochastic and contain lagged
dependent variables, so it is considerably more useful than the DW
test for typical time series data.
The alternative is not that the model is an AR(P), following the ar-
gument above. The alternative is simply that some or all of the first
ª autocorrelations are different from zero. This is compatible with
many specific forms of autocorrelation.
This will be the case when ë7 4 /"
Ú following a LLN. An important excep-
tion is the case where contains lagged B 4 and the errors are autocorrelated.
A simple example is the case of a single lag of the dependent variable with
7.5. AUTOCORRELATION 146
7.5.8. Examples.
Nerlove model, yet again. The Nerlove model uses cross-sectional data, so
one may not think of performing tests for autocorrelation. However, speci-
fication error can induce autocorrelated errors. Consider the simple Nerlove
model
) ûÚ
012K æ 8
2 | ª K
2 #ª 8
2 ª ÿ 2KA
"! $
) ûu
ß 28 ß æ K
2 | #ª K
2 "! #ª K
2 $ ª ÿ 2KA\&
1
7.5. AUTOCORRELATION 147
1.5
0.5
-0.5
-1
0 1 2 3 4 5 6 7 8 9 10
Value p-value
Breusch-Godfrey test 34.930 0.000
E XERCISE 7.6. Repeat the autocorrelation tests using the extended Nerlove
model (Equation ??) to see the problem is solved.
7.5. AUTOCORRELATION 148
û U
2
F
1 ª U52 ª U 12
| ò ü U 6 2Ëü U I 2MAY1U
The Octave program GLS/Klein.m estimates this model by OLS, plots the
residuals, and performs the Breusch-Godfrey test, using 1 lag of the residu-
als. The estimation and test results are:
*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
1.5
0.5
-0.5
-1
-1.5
-2
-2.5
0 5 10 15 20 25
*********************************************************
Value p-value
Breusch-Godfrey test 1.539 0.215
and the residual plot is in Figure 7.6.2. The test does not reject the null of
nonautocorrelatetd errors, but we should remember that we have only 21 ob-
servations, so power is likely to be fairly low. The residual plot leads me to
suspect that there may be autocorrelation - there are some significant runs be-
low and above the x-axis. Your opinion may differ.
Since it seems that there may be autocorrelation, lets’s try an AR(1) correc-
tion. The Octave program GLS/KleinAR1.m estimates the Klein consumption
equation assuming that the errors follow the AR(1) pattern. The results, with
the Breusch-Godfrey test for remaining autocorrelation are:
7.5. AUTOCORRELATION 150
*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171
*********************************************************
Value p-value
Breusch-Godfrey test 2.129 0.345
The test is farther away from the rejection region than before, and the
residual plot is a bit more favorable for the hypothesis of nonauto-
correlated residuals, IMHO. For this reason, it seems that the AR(1)
correction might have improved the estimation.
Nevertheless, there has not been much of an effect on the estimated
coefficients nor on their estimated standard errors. This is probably
because the estimated AR(1) coefficient is not very large (around 0.2)
EXERCISES 151
Exercises
EXERCISES 152
(1) Comparing the variances of the OLS and GLS estimators, I claimed that the
following holds:
(2)
é¯ô À Ä8éÜô À ¯ (³
w 5 w O
¯
V-Z B¯j
?4*5y 1 BÜjiG
+³
c
j h R
m ð æ « 1 ²¼æ « 1 ó
where
4 /J/ 4
²
fStT ëKÆ É
Explain why
f
Ë
²
4U U / U
U(1
is a consistent estimator of this matrix.
(5) Define the R ¹ autocovariance of a covariance stationary process ÷U ,
¸
where U
as
Ù ÃÅÄ
ë7^U¬^U4 Ä C&
Show that ë¼U¨ 4U
4&
Ä Ã Ä
ûÚ
ß 2K ß ) æ K
2 | #ª K
2 "! ) ª K
2 $ ª ÿ 2MA
1
EXERCISES 153
) æ
assume that é^WA-U UW
2
.
Exercises
(a) Calculate the FGLS estimator and interpret the estimation results.
(b) Test the transformed model to check whether it appears to satisfy ho-
moscedasticity.
CHAPTER 8
Stochastic regressors
The model we’ll deal will involve a combination of the following assumptions
Linearity: the model is a linear function of the parameter vector
F H
BgU
U4 F 28/JUP
or in matrix form,
B
i F
2 /%
where B is ÷i"'
c 1 4 is
>N>N> f h where U iV and F and / are
conformable.
154
8.1. CASE 1 155
L?Ì
Ôê
(8.0.2)
Ù /JU
In both cases, L 4U is the conditional mean of BVU given L U : L
L 4U
Ù B]U UW
8.1. Case 1
eçRñð×Ji4Ì: 1 è 'F ó
8.2. CASE 2 156
If the density of is J,3: Y the marginal density of is obtained by
multiplying the conditional density by J3¢: and integrating over i&
Doing this leads to a nonnormal density for ¢ in small samples.
However, conditional on i the usual test statistics have the Y 2
and
distributions. Importantly, these distributions don’t depend on i so
when marginalizing to obtain the unconditional distribution, nothing
changes. The tests are valid in small samples.
Summary: When is stochastic but strongly exogenous and / is nor-
mally distributed:
(1) is unbiased
(2) is nonnormally distributed
(3) The usual test statistics have the same distribution as with non-
stochastic i&
(4) The Gauss-Markov theorem still holds, since it holds condition-
ally on i and this is true for all &
(5) Asymptotic properties are treated in the next section.
8.2. Case 2
F a2 i4: $ 1 i4/
4 1 4/
F 2 Æ É
8.2. CASE 2 157
Now
Æ 4 1 R 6 æ « 1
É
h
by assumption, and
4 /
1Ap 4/ R6
since the numerator converges to a j æ
« è,C r.v. and the denominator still
goes to infinity. We have unbiasedness and the variance disappearing, so, the
estimator is consistent:
R6
F&
Considering the asymptotic
h distribution
h
1 4/
c ² Fh
²Æ 4 É
Æ 4 1 1Ap 4 /
É
h
so
c÷
F h R
m j æ « 1 è F
directly following the assumptions. Asymptotic normality of the estimator still
holds. Since the asymptotic results on all test statistics only require this, all the
previous asymptotic results on test statistics are also valid in this case.
(5) Tests are asymptotically valid, but are not valid in small samples.
8.3. Case 3
6
]B U
gU4 2
%ìPBgU ì2/]U
ìò1
4U ^2/]U
뼬/JU 1 WU
a
í
f
Sg'UòX S \UPX
(U 1
to converge in probability to a finite limit is that NU be stationary, in some sense.
Strong stationarity requires that the joint distribution of the set
U
U 1I2/JU
/]U§ç ±± ² [è
One can show that the variance of U depends upon in this case.
Stationarity prevents the process from trending off to plus or minus infinity,
and prevents cyclical behavior which would allow correlations between far
removed \U znd Nì to be high. Draw a picture here.
8.4. WHEN ARE THE ASSUMPTIONS REASONABLE? 160
Exercises
(1) Show that for two random variables w and if Ù w
a then Ù w [
. How is this used in the Gauss-Markov theorem?
(2) If it possible for an AR(1) model for time series data, e.g., BU
2 & ï BgU 1C2 /JU
satisfy weak exogeneity? Strong exogeneity? Discuss.
CHAPTER 9
Data problems
In this section well consider problems associated with the regressor matrix:
collinearity, missing observation and measurement error.
9.1. Collinearity
e
1 L 12 e L
2¡>N>N>N2
e ÿZLIÿ 2¹¸
¡
where L,* is the ! U¶ column of the regressor matrix and ¸ is an jM vector.
In the case that there exists collinearity, the variation in ¸ is relatively small, so
that there is an approximately exact linear relation between the regressors.
In the extreme, if there are exact linear relationships (every element of ¸ equal)
then : ½ so 4 3 ½Ë so 4 is not invertible and the OLS estimator
is not uniquely defined. For example, if the model is
¿
apartment is in Barcelona, µ*,
a otherwise. Similarly, define * * and * for
B
60
55
50
45
40
6 35
30
25
4 20
15
-2
-4
-6
-6 -4 -2 0 2 4 6
9.1.2. Back to collinearity. The more common case, if one doesn’t make
mistakes such as these, is the existence of inexact linear relationships, i.e., cor-
relations between the regressors that are less than one in absolute value, but
not zero. The basic problem is that when two (or more) variables move to-
gether, it is difficult to determine their separate influences. This is reflected
in imprecise estimates, i.e., estimates with high variances. With economic data,
collinearity is commonly encountered, and is often a severe problem.
When there is collinearity, the minimizing point of the objective function
that defines the OLS estimator (
, the sum of squared errors) is relatively
poorly defined. This is seen in Figures 9.1.1 and 9.1.2.
9.1. COLLINEARITY 165
100
90
80
70
60
6 50
40
30
4 20
-2
-4
-6
-6 -4 -2 0 2 4 6
é
4: 1 è
sion
L
ü e
2¹¸E&
Since
È Ù ÛÛ Â ¿ Û Û
we have
¿
ÙÜÛÛ ÛÛ Ô¼8
so the variance of the coefficient corresponding to L is
è
é^ Í
¿
ÛÛ Í ?yj Í
We see three factors influence the variance of this coefficient. It will be high if
(1) è is large
(2) There is little variation in L & Draw a picture here.
(3) There is a strong linear relationship between and the other regres-
sors, so that ü can explain the movement in L
well. In this case,
R k Í
will be close to 1. As "
Y
é &
R
Í Í
Intuitively, when there are strong linear relations between the regressors, it
is difficult to determine the separate influence of the regressors on the depen-
dent variable. This can be seen by comparing the OLS objective function in
the case of no correlation between regressors with the objective function with
correlation between the regressors. See the figures nocollin.ps (no correlation)
and collin.ps (correlation), available on the web site.
9.1.3. Detection of collinearity. The best way is simply to regress each ex-
planatory variable in turn on the remaining regressors. If any of these auxiliary
regressions has a high there is a problem of collinearity. Furthermore, this
procedure identifies which parameters are affected.
Sometimes, we’re only interested in certain parameters. Collinearity
isn’t a problem if it doesn’t affect what we’re interested in estimating.
p À 2i¸
where and
À are as in the case of exact linear restrictions, but ¸ is a random
vector. For example, the model could be
B
i28/
p
À 2¹¸
è î ± f ,f Ó(
ÎÏ ÎÏ ÎÏ
/
ç 6Ógf è Ä ± XÐÒ
¸ÑÐÒ ÐÒ
This sort of model isn’t in line with the classical interpretation of parameters
as constants: according to this interpretation the left hand side of p
À 2Ô¸ is
constant but the right is random. This model does fit the Bayesian perspective:
we combine information coming from the model and the data, summarized in
B
b28/
/ ç j è î ± f"
p3çRj À è Ä ±
Since the sample is random it is reasonable to suppose that ë7/<¸ 4
a which is
the last piece of information in the specification. How can you estimate using
9.1. COLLINEARITY 169
this model? The solution is to treat the restrictions as artificial data. Write
B
/
À 2
¸
B
/
^2
QÀ
Q Qu¸
is homoscedastic and can be estimated by OLS. Note that this estimator is bi-
ased. It is consistent, however, given that Q is a fixed constant, even if the
restriction is false (this is in contrast to the case of false exact restrictions). To
see this, note that there are æ restrictions, where æ is the number of rows of
& As R k
these æ artificial observations have no weight in the objective
function, so the estimator has the same limiting objective function as the OLS
estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of
the squared length of :
ë7 4
ë õ c 2a 4 : 1 4 / h 4 c 2a 4 : 1 4 / h ø
4 b28ë ð / 4 K 4 : 1 4 : 1 4 / ó
4 b2 ¿yÀ 4 : 1 è
ÿ
e5*
4©b2Kè * (the trace is the sum of eigenvalues)
1
¥ 4©b2 e"ÕÆÖ× Ã « O «
Å è (the eigenvalues are all positive, since34 is p.d.
9.1. COLLINEARITY 170
so
è,« «
ë7 4
¥ 4 2
Å e ÕØ^Ù
« « Ã O
where e ÕØ^Ù
à O Šis the minimum eigenvalue of 4 (which is the inverse of the
1
maximum eigenvalue of ¬ 4 : Y& As collinearity becomes worse and worse,
4 becomes more nearly singular, so e ÕØ^Ù Ã « O « Å tends to zero (recall that the
determinant is the product of the eigenvalues) and ë¼ 4
tends to infinite. On
the other hand, 4 is finite.
Now considering the restriction ±Nÿ
a X
2 ¸E& With this restriction the model
becomes
B
/
2
Q ±'ÿ
QÚ¸
ÎÏ
1 B
* ZÛ
n 4 Q ±\ÿ r n 4 ±\ÿ r
Q ±\ÿ
m
ÐÒ
This is the ordinary ridge regression estimator. The ridge regression estimator
can be seen to add Q \± ÿ which is nonsingular, to 4 which is more and
more nearly singular as collinearity becomes worse and worse. As Q the
R k
restrictions tend to
e that is, the coefficients are shrunken toward zero.
Also, the estimator tends to
* m ZÛ
ðòi4§2MQ \± ÿ ó 1 i4©B R ðPQ ±'ÿ ó 1 4©B
Q 4 B R
so 4 * ZÛ * ZÛ & This is clearly a false restriction in the limit, if our original
R
m m
model is at al sensible.
9.2. MEASUREMENT ERROR 171
First consider error in measurement of the dependent variable. The data gen-
erating process is presumed to be
B [
^28/
B
B[ i 2 ¸
JUñç
¸ !ò!J è Ä
where Ba[ is the unobservable true dependent variable, and B is what is ob-
served. We assume that / and ¸ are independent and that B [
2i/ satisfies
the classical assumptions. Given this, we have
B2i¸ ^28/
so
B
b28/pc¸
b2cÝ
U§ç
Ý !ò!J è î 28è Ä
As long as ¸ is uncorrelated with this model satisfies the classical
assumptions and can be estimated by OLS. This type of measurement
error isn’t a problem, then.
9.2. MEASUREMENT ERROR 173
BgU
U[ 4 2/JU
U
U[ 2i¸]U
JUñç
¸ !ò!J 65 Ä
ë¼ UßÝ
U
ë[ U[ i
2 ¸]UW 0Ôt¸ U4 ^28/JU [
5 Ä
where
5
Ä
ëb¸]U¸U4 0&
Because of this correlation, the OLS estimator is biased and inconsistent, just as
in the case of autocorrelated errors with lagged dependent variables. In matrix
notation, write the estimated model as
B
^2cÝ
9.2. MEASUREMENT ERROR 174
We have that
4 1 4B
Æ É Æ É
and
1
4
!W Æ É
µ
µ W! [ 4 2Mé 4 ,¬ [ 2ËéÜ
æ «à 25 Ä 1
W! é 4 é
ë f J¸ Ub¸4
(U 1 U
µ
5
Ä
Likewise,
so
µ ò!
æ « à D
2 5 Ä 1 æ « à
So we see that the least squares estimator is inconsistent when the regressors
are measured with error.
Missing observations occur quite frequently: time series data may not be
gathered in a certain year, or respondents to a survey may not answer all ques-
tions. We’ll consider two cases: missing observations on the dependent vari-
able and missing observations on the regressors.
1
\ i1 4 12i4
] \i 14 B1,28i 4 B ]
9.3. MISSING OBSERVATIONS 176
14 1 01
i14 BT1 P
Likewise, an OLS regression using only the second (filled in) observations
would give
4
4 B &
Substituting these into the equation for the overall combined estimator gives
\ 14 1,28i 4 ] 1
4
\ 14 128i 4 ] 1 \(14 128i 4 IJi14 1I]
±'ÿ è\ 14 1,2 4 ] 1 14 1
±'ÿ w &
Now,
ë¼
aw ^2 ±\ÿ w Pë c h
and this will be unbiased only if ë c h
Z&
9.3. MISSING OBSERVATIONS 177
B 2 /
where /
has mean zero. Clearly, it is difficult to satisfy this condition
without knowledge of &
Note that putting B
BÝ 1 does not satisfy the condition and therefore
leads to a biased estimator.
One possibility that has been suggested (see Greene, page 275) is to
estimate using a first round estimation using only the complete ob-
servations
01
¬ i14 1Ô $ 1 i14 BT1
then use this estimate, 01Y to predict B :
B
01
14 1- 1 14 BT1
Now, the overall estimate is a weighted average of 01 and just as
above, but we have
1 i4 B
4
4 1 i4 0 1
01
9.3. MISSING OBSERVATIONS 178
This shows that this suggestion is completely empty of content: the fi-
nal estimator is the same as the OLS estimator using only the complete
observations.
B U[
4U 2/JU
B]U B U [ if B U [
B [ 2/
with é^¬/"
# Ò , but using only the observations for which B"[ ¥u to estimate.
Figure 9.3.1 illustrates the bias. The Octave program is sampsel.m
15
10
-5
-10
0 2 4 6 8 10
In the case that there is only one regressor other than the constant,
subtitution of Ý for the missing U does not lead to bias. This is a special
case that doesn’t hold for ¦¥ #%&
E XERCISE 14. Prove this last statement.
Exercises
(1) Consider the Nerlove model
ûÚ
ß 2K ß ) æ K
2 | #ª K
2 "! ) ª K
2 $ ª ÿ 2MA
1
When this model is estimated by OLS, some coefficients are not significant.
This may be due to collinearity.
Exercises
(a) Calculate the correlation matrix of the regressors.
(b) Perform artificial regressions to see if collinearity is a problem.
(c) Apply the ridge regression estimator.
Exercises
(i) Plot the ridge trace diagram
(ii) Check what happens as Q goes to zero, and as Q becomes very
large.
CHAPTER 10
where
) w & Theory suggests that w ¥ ,1 ¥ ¥ | ¥ & This
F
model isn’t compatible with a fixed cost of production since ¸¼
a when
a &
Homogeneity of degree one in input prices suggests that 1¢2a
" while
constant returns to scale implies n
V&
While this modelh may be reasonable
h inh some cases,
h an alternative
182
10.1. FLEXIBLE FUNCTIONAL FORMS 183
The basic point is that many functional forms are compatible with the linear-
in-parameters model, since this model can incorporate a wide variety of non-
linear transformations of the dependent variable and the regressors. For ex-
ample, suppose that 0Ô>© is a real valued function and that Ô>© is a vector-
valued function. The following model is linear in the parameters but nonlinear
in the variables:
U
\UW
B]U
4U ^28/JU
There may be
ª fundamental conditioning variables NU , but there may be re-
gressors, where
ª & For example,
may be smaller than, equal to or larger than
U could include squares and cross products of the conditioning variables in JU?&
Given that the functional form of the relationship between the dependent
variable and the regressors is in general unknown, one might wonder if there
exist parametric models that can closely approximate a wide variety of func-
tional relationships. A “Diewert-Flexible” functional form is defined as one
such that the function, the vector of first derivatives and the matrix of second
derivatives can take on an arbitrary value at a single data point. Flexibility in
this sense clearly requires that there be at least
d
2 ª 2 ð ª ª Â È# 2 ª
ó
free parameters: one for each independent effect that we wish to model.
10.1. FLEXIBLE FUNCTIONAL FORMS 184
B , 28/
4 ,
, 0 02 4 ã 0 ,2 ã# 2
M
Use the approximation, which simply drops the remainder term, as an approx-
imation to 0 7H
ÿ
4 0
0 ¢ , ,2 4 ã 0 ,2 ã#
As the approximation becomes more and more exact, in the sense that
R
ÿ
g 2 4©^2¡ Â # 4 Ã
B
è 2 4@^2¡ Â # 4 Ã 28/
10.1. FLEXIBLE FUNCTIONAL FORMS 185
10.1.1. The translog form. In spite of the fact that FFF’s aren’t really as
flexible as they were originally claimed to be, they are useful, and they are
certainly subject to less bias due to misspecification of the functional form than
are many popular forms, such as the Cobb-Douglas of the simple linear in the
variables model. The translog model is probably the most widely used FFF.
This model is as above, except that the variables are subjected to a logarithmic
tranformation. Also, the expansion point is usually taken to be the sample
mean of the data, after the logarithmic transformation. The model is defined
10.1. FLEXIBLE FUNCTIONAL FORMS 186
by
B
) ¸
)
Ý
) W" ¢ ) CÝ
B
2 4©2 Â # 4 Ã 28/
B
¸ .Ü
V
where . is a vector of input prices and is output. We could add other vari-
ables by extending in the obvious manner, but this is supressed for simplicity.
10.1. FLEXIBLE FUNCTIONAL FORMS 187
¸
Ã
1P1 Ã
1
2 4 ^2M 4 » 2¡ Â # n 4 r
41 Ã P Ã
+1P1îE1
1P1
Ã
+1
P
+1 |
Ã
1
|
Ã
| P| &
P
Note that symmetry of the second derivatives has been imposed.
Then the share equations are just
b2 n à P1 1 Ã
1 r
10.1. FLEXIBLE FUNCTIONAL FORMS 188
Therefore, the share equations and the cost equation have parameters in com-
mon. By pooling the equations together and imposing the (true) restriction
that the parameters of the equations be the same, we can gain efficiency.
To illustrate in more detail, consider the case of two inputs, so
1
÷
&
Note that the share equations and the cost equation have parameters in
common. One can do a pooled estimation of the three equations at once, im-
posing that the parameters are the same. In this way we’re using more ob-
servations and therefore more information, which will lead to imporved effi-
ciency. Note that this does assume that the cost equation is correctly specified
(i.e., not an approximation), since otherwise the derivatives would not be the
true derivatives of the log cost function, and would then be misspecified for
the shares. To pool the equations, write the model in matrix form (adding in
10.1. FLEXIBLE FUNCTIONAL FORMS 189
error terms)
01
»
¸ 1 ã z
ã zz ì z 1 1-
/1
1
+1P1
g1
2
/
1 P
/|
P ||
+1
+1 |
|
This is one observation on the three equations. With the appropriate nota-
tion, a single observation can be written as
B]U U4@È2/JU
The overall model would stack observations on the three equations for a total
of %V observations:
B1
1
/1
B
/
.
.
@ È2
.
.. .. ..
B]f
=f
/Jf
10.1. FLEXIBLE FUNCTIONAL FORMS 190
Next we need to consider the errors. For observation the errors can be placed
in a vector
/1U
/JU
/ U
/ |U
First consider the covariance matrix of this vector: the shares are certainly
correlated since they must sum to one. (In fact, with 2 shares the variances are
equal and the covariance is -1 times the variance. General notation is used to
allow easy extension to the case of more than 2 inputs). Also, it’s likely that
the shares and the cost equation have different variances. Supposing that the
model is covariance stationary, the variance of /gU won4 t depend upon :
è01P1 0è 1 è01 |
é¯ô À /JU
> è P è |
5
F
> > è P| |
Note that this matrix is singular, since the shares sum to 1. Assuming that there
is no autocorrelation, the overall covariance matrix has the seemingly unrelated
10.1. FLEXIBLE FUNCTIONAL FORMS 191
/1
/
éÜô À
5
.
..
/]f
>N>N>
5
F
.. .
. ..
5
F
..
.
..
.
>N>N>
5
F
± fUï
5
F
where the symbol ï indicates the Kronecker product. The Kronecker product of
two matrices w and is
ô 1P1 ô 1 >N>N>Úô514
.. ..
w
ô 1 . .
ï
..
&
.
ôN6¨ >N>N> ôJ66
10.1.2. FGLS estimation of a translog model. So, this model has heteroscedas-
ticity and autocorrelation, so OLS won’t be efficient. The next question is: how
do we estimate efficiently using FGLS? FGLS is based upon inverting the esti-
mated error covariance µ& So we need to estimate 5µ&
5
01
»
¸ 1 ã z
ã zz ì z 1 1Ô
+1P1
/1
1
2
g1
P
/
P ||
+1
+1 |
|
B U [
U [ @y2/ U[
10.1. FLEXIBLE FUNCTIONAL FORMS 193
and in stacked notation for all observations we have the #g observa-
tions:
B 1[
1[
/ [1
B [
[
/ [
.
.
@ È2
.
.. .. ..
B f[
f[
/ f[
B [ [ @y2/ [
/1
5
F [
é¯ô À
/
± ftïD5
5 [
F[
Define 5 as the leading #¾3# block of , and form
F [ 5
F
± tf ï 5 [ &
5 [
F
This is a consistent estimator, following the consistency of OLS and
applying a LLN.
(4) Next compute the Cholesky factorization
ª ¡
û ¹´<µ c 5 [ h 1
F F
and the Cholesky factorization of the overall covariance matrix of the
2 equation model, which can be calculated as
ª
û ¹´<µ 5 [ a
± tf ï ª
F
10.1. FLEXIBLE FUNCTIONAL FORMS 194
(5) Finally the FGLS estimator can be calculated by applying OLS to the
transformed model
ª B [ ª [ @È2 ª / [
ª B [
ª [ @;2 ª /
F K F U [
0128
| * ßq
Å
"Y#%%%&
* 1
These are linear parameter restrictions, so they are easy to impose and
will improve efficiency if they are true.
10.2. TESTING NONNESTED HYPOTHESES 195
(3) The estimation procedure outlined above can be iterated. That is, esti-
+³
mate @ ¯ as above, then re-estimate 5ð[ using errors calculated as
F
/
BÜj @ ¯
(³
Given that the choice of functional form isn’t perfectly clear, in that many
possibilities exist, how can one choose between forms? When one form is a
parametric restriction of another, the previously studied tests such as Wald,
2
LR, score or are all possibilities. For example, the Cobb-Douglas model is a
parametric restriction of the translog: The translog is
B]U è 2 4U 2/
´ 17HB
i28/
/JUkç !W!¾J è î
´
HB 2
9U
ç !W!¾Jd è ñ
dM
´ *
We wish to test hypotheses of the form:
F H is correctly specified versus
H ´ * is misspecified, for !
"Y#%&
One could account for non-iid errors, but we’ll suppress this for sim-
plicity.
There are a number of ways to proceed. We’ll consider the « test, pro-
posed by Davidson and MacKinnon, Econometrica (1981). The idea is
to artificially nest the two models, e.g.,
B
ÔÈ
?b2
9t0 2cÝ
If the first model is correctly specified, then the true value of is zero.
On the other hand, if the second model is correctly specified then :
"&
– The problem is that this model is not identified in general. For
example, if the models share some regressors, as in
BgU
?È
Ô01,2aÔÈ
? +U 2aÔÈ
? | | 5U 2
+1,2
+U 2
| P! U+2cÝU
BgU
[?È
?012
+1- 2a[ÔÈ Ô 2
EU 2aÔÈ
? | | EU 2
| P! U+2cÝU
»
1 2 » UE2 » | | U+2 » ! !PUE2sÝ
U
4
»
The four are consistently estimable, but is not, since we have four equa-
tions in 7 unknowns, so one can’t test the hypothesis that 3
&
The idea of the « test is to substitute
in place of &
This is a consistent
estimator supposing that the second model is correctly specified. It will tend
to a finite probability limit even if the second model is misspecified. Then
estimate the model
B
Ô¼
Pb2
A9 0 2cÝ
F@È2
B 2sÝ
where B
9 A9 4 9p 1 9 4 B
ª = B & In this model, is consistently estimable, and
one can show that, under the hypothesis that the first model is correct, R 6
and that the ordinary -statistic for 3
a is asymptotically normal:
Q
§ R
è ç ² 'N
6
If the second model is correctly specified, then
R k
since tends in
probability to 1, while it’s estimated standard error tends to zero. Thus
the test will always reject the false null model, asymptotically, since the
statistic will eventually exceed any critical value with probability one.
10.2. TESTING NONNESTED HYPOTHESES 198
We can reverse the roles of the models, testing the second against the
first.
It may be the case that neither model is correctly specified. In this case,
the test will still reject the null hypothesis, asymptotically, if we use
critical values from the j 'J distribution, since as long as tends to
6
something different from zero, R k
& Of course, when we switch
the roles of the models the other will also be rejected asymptotically.
In summary, there are 4 possible outcomes when we test two models,
each against the other. Both may be rejected, neither may be rejected,
or one of the two may be rejected.
There are other tests available for non-nested models. The G
« test is
simple to apply when both models are linear in the parameters. The
ª -test is similar, but easier to apply when
´ 1 is nonlinear.
The above presentation assumes that the same transformation of the
dependent variable is used by both models. MacKinnon, White and
Davidson, Journal of Econometrics, (1983) shows how to deal with the
case of different transformations.
Monte-Carlo evidence shows that these tests often over-reject a cor-
rectly specified model. Can use bootstrap critical values to get better-
performing tests.
CHAPTER 11
B b28/
where, for purposes of estimation we can treat as fixed. This means that
when estimating we condition on i& When analyzing dynamic models, we’re
not interested in conditioning on as we saw in the section on stochastic
regressors. Nevertheless, the OLS estimator obtained by treating as fixed
continues to have desirable asymptotic properties even in that case.
199
11.1. SIMULTANEOUS EQUATIONS 200
Demand: vU
12 |
U52 ]B U 28/1U
Supply: vU
01,2K 5U 2/ U
ÎÏ
/1U n /1Uå/ ÈU r
0è 1P1 è01
ë
/ U ÐÒ > è P
5µÔê
Now consider whether 5U is uncorrelated with /1UGH
|
ë õ Æ 2 BgU 2 / 1Uj/ U É /1U ø
G
1 j
0
1
ë7 U¨/1U
0
è P
1 Ä
1
0
è 1
Because of this correlation, OLS estimation of the demand equation will be
biased and inconsistent. The same applies to the supply equation, for the same
reason.
11.1. SIMULTANEOUS EQUATIONS 201
In this model, vU and 5U are the endogenous varibles (endogs), that are deter-
mined within the system. BgU is an exogenous variable (exogs). These concepts
are a bit tricky, and we’ll return to it in a minute. First, some notation. Suppose
we group together current endogs in the vector ¤U?& If there are endogs, ¤U is
"& Group current and lagged exogs, as well as lagged endogs in the vector
Ú
=U , which is "& Stack the errors of the equations into the error vector Ù U?&
The model, with additional assumtions, can be written as
¤ U4Ã U 4 2 Ù U4
Ù ñ U ç j 6 5¼ YÔê
ë7 Ù U Ù ì 4
[µí
¤ Ã
2 Ù
뼬 4 Ù
ÿ Ó¯Å
Ã
ºN¸ ç j 6 Es
¸
Ù
where
¤ 41
14 Ù 14
¤ 4
4
Ù 4
¤
[
Ù
.
..
.
..
.
..
¤ 4f
f4
Ù f4
± fUï 5
11.2. Exogeneity
The model defines a data generating process. The model involves two sets of
variables, ¤EU and UP as well as a parameter vector
@
n ¸ Nº ¸ Ã 4 ¸
ºN¸ 4 ¸
ºN¸ [ 5¼ 4 r 4
In general, without additional restrictions, @ is a 2 2=W 8Ü Â #"2
dimensional vector. This is the parameter vector that were inter-
ested in estimating.
In principle, there exists a joint density function for ¤U and =U? which
depends on a parameter vector t& Write this density as
U-¤EUP- U t o U
11.2. EXOGENEITY 203
where
o
U is the information set in period ¤ 4U Y& This includes lagged
and lagged U ’s of course. This can be factored into the density of ¤U
conditional on U times the marginal density of U :
UÔW¤+U?[=U tI o U
-U W¤+U U×Yt o U U-=U tI o U
This is a general factorization, but is may very well be the case that not
all parameters in t affect both factors. So use tI1 to indicate elements
of t that enter into the conditional density and write t for parameters
that enter into the marginal. In general, tI1 and t may share elements,
of course. We have
U-W¤+UP[=U tI o U
ÔU W¤+U U?$t1Y o U U-=U t o UW
Recall that the model is
¤sU 4 Ã iU 4 2 Ù U 4
Ù ñ U ç j 6 5¼ YÔê
ë7 Ù U Ù ì 4
[µí
Normality and lack of correlation over time imply that the observations are
independent of one another, so we can write the log-likelihood function as the
11.2. EXOGENEITY 204
f
B W ¤ @% o U
-U W¤+UP[=U tI o U
U(1
f
U-¤EU =U?Yt1 o UW U-=U t o UW [
U(1
f f
U-W¤+U =U?Yt,1$ o UW 2 UÔ=U t o U
U(1 U(1
D EFINITION 15 (Weak Exogeneity). U is weakly exogeneous for @ (the
original parameter vector) if there is a mapping from t to @ that is invariant
to t & More formally, for an arbitrary òt1YYt C1 @5òt
@5Wt1- Y&
This implies that t1 and t cannot share elements if U is weakly exoge-
nous, since t1 would change as t changes, which prevents consideration of
arbitrary combinations of òt1CYt .
Supposing that =U is weakly exogenous, then the MLE of tI1 using the joint
density is the same as the MLE using only the conditional density
Of course, we’ll need to figure out just what this mapping is to recover
@ from t,1Y& This is the famous identification problem.
With lack of weak exogeneity, the joint and conditional likelihood func-
tions maximize in different places. For this reason, we can’t treat bU as
fixed in inference. The joint MLE is valid, but the conditional MLE is
not.
In resume, we require the variables in U to be weakly exogenous if
we are to be able to treat them as fixed in estimation. Lagged ¤U sat-
isfy the definition, since they are in the conditioning information set,
e.g., ¤EU 1òC o ?U & Lagged ¤EU aren’t exogenous in the normal usage of the
word, since their values are determined within the model, just earlier
on. Weakly exogenous variables include exogenous (in the normal sense)
variables as well as all predetermined variables.
¤sU 4 Ã
iU 4 2 Ù U 4
é^ Ù UW
5
¤¯U 4
iU 4 Ã 1 2 Ù U 4 Ã 1
iAU 4 ó 2ËéÞU 4
Now only one current period endog appears in each equation. This is the
reduced form.
CU
12
Æ vUj0 1G²/ U É 2 | BgU 28/1U
vU
CU
1G 01I2/ U I2K | gB U 2K /1U
CU
1G 01 2 | BgU 2 / 1U / U
1P12 1ÔB]U 2Ëé01U
2 / U
01,2K 5 U 8
12 |
5U52 B]U+2/1U
5U 5U
1Gj012 | BgU528/1Uj/ U
5U
1Gj01 2 | BgU 2 /1Uj/ U
1 2 P BgU52Ëé U
11.3. REDUCED FORM 207
The interesting thing about the rf is that the equations individually satisfy the
classical assumptions, since BVU is uncorrelated with /1U and /
U by assumption,
* UW
¡ i=1,2, êY& The errors of the rf are
and therefore ë¼BVU¬é
é01U
xCz î
Ê § z î z Ê
§
é U
î Ê xC z §î Ê z
xC
z z z
The variance of é,1U is
è01P1¢#g è01 2 è P
This is constant over time, so the first rf equation is homoscedastic.
Likewise, since the /]U are independent over time, so are the éEU?&
É É
È É
¤¯U 4
iU 4 Ã 1 2 Ù U 4 Ã 1
iAU 4 ó 2ËéÞU 4
so we have that
and that the é5U are timewise independent (note that this wouldn’t be the case
if the
Ù U were autocorrelated).
11.4. IV estimation
The IV estimator may appear a bit unusual at first, but it will grow on you
over time.
The simultaneous equations model is
¤ Ã
2 Ù
Considering the first equation (this is without loss of generality, since we can
always reorder the equations) we can partition the ¤ matrix as
¤
n B ¤,1 ¤ r
B is the first column
¤1 are the other endogenous variables that enter the first equation
¤ are endogs that are excluded from this equation
Similarly, partition as
n 1e r
11.4. IV ESTIMATION 209
1 are the included exogs, and are the excluded exogs.
n / 1 r
Ù Ù
Ã
Assume that has ones on the main diagonal. These are normalization
restrictions that simply scale the remaining coefficients on each equation, and
which scale the variances of the error terms.
Given this scaling and our partitioning, the coefficient matrices can be writ-
ten as
Ã
1
Ã
U+1 Ã
P
Ã
|
01
1
P
B
¤,1I+1,281Ô01,28/
9
»
28/
The problem, as we’ve seen is that 9 is correlated with /% since ¤1 is formed of
endogs.
11.4. IV ESTIMATION 210
Now, let’s consider the general problem of a linear regression model with
correlation between regressors and the error term:
B
2/
/ ç !ò!¾J ± fgè
ë7 4 /V
í &
The present case of a structural equation from a system of equations fits into
this notation, but so do other problems, such as measurement error or lagged
dependent variables with autocorrelated errors. Consider some matrix ü
which is formed of variables uncorrelated with / . This matrix defines a projec-
tion matrix
ª
üþWü 4 üÚ 1 ü 4
so that anything that is projected onto the space spanned by ü will be un-
correlated with /T by the definition of üÖ& Transforming the model with this
projection matrix we get
ª B
ª
^2 ª
/
or
B [
[
28/ [
ë7 4 ª 4 ª V/
ë7 [ 4 / [
ª
ë7 4 /V
11.4. IV ESTIMATION 211
and
ª
üþWü 4 üÚ 1 ü 4
will lead to a consistent estimator, given a few more assumptions. This is the
generalized instrumental variables estimator. ü is known as the matrix of instru-
ments. The estimator is
i4 ª : 1 i
ª
ôZõ 4 B
i4 ª : 1 i
4 ª ^28/V
ôZõ
ª
2a 4 : 4 / 1 ª
so
ª
1 ª
ô`õ÷
¬i4 : $ 4 /
ð i4@üþòü4@üÚ 1 ü4 ó 1 i
4üþWüR4©üÚ $ 1 ü4/
4 ü ü 4 ü 1 ü 4 1 4ü ü 4 ü 1 ü 4/
ôZõ÷j ;Æ Æ É Æ É Æ ÉpÉ Æ É Æ É Æ É
Assuming that each of the terms with a in the denominator satisfies a LLN,
so that
11.4. IV ESTIMATION 212
6 ö
f« O R æ , a finite pd matrix
fO R 6 æ «
a finite matrix with rank (= cols 3 )
f Oî R6
then the plim of the rhs is zero. This last term has plim 0 since we assume that
ü and / are uncorrelated, e.g.,
ë¼Wü U 4 /JU a
h ôZõ R 6 &
Furthermore,
h
scaling by G we have
1 h
c ô`õ÷ h
¶ Æ 4 ü Æ ü 4 ü 1 Æ ü 4 1
Æ 4 ü É Æ ü 4 ü É Æ ü 4 / É
É É É ·
Oî
ö
f R
m ² æ è
then we get h
« 1 « 4 1 è ó
c ô`õ÷ h æ æ æ
ö
R
m ð J
è ô` õ
c B Ö ôZõ h 4 c Ü
Â
B j ôZõ h &
This estimator is consistent following the proof of consistency of the OLS esti-
mator of è when the classical assumptions hold.
11.4. IV ESTIMATION 213
The formula used to estimate the variance of ôZõ is
1Â
é ôZõ
c 4üÚ ,òü4@ü® 1 òü43 h è ôZ õ
The IV estimator is
(1) Consistent
(2) Asymptotically normally distributed
(3) Biased in general, since even though 뼬 4
ª
ª
1 ª
V/ $ë7 4 : 4 /
An important point is that the asymptotic distribution of ô`õ depends upon
æ «
ö
and æ and these depend upon the choice of üÖ& The choice of instru-
ments influences the efficiency of the estimator.
When we have two sets of instruments, ü1 and ü such that üj1ø÷¡ü
then the IV estimator using ü is at least as efficiently asymptotically
as the estimator that used üj1C& More instruments leads to more asymp-
totically efficient estimation, in general.
There are special cases where there is no gain (simultaneous equations
is an example of this, as we’ll see).
The penalty for indiscriminant use of instruments is that the small
sample bias of the IV estimator rises as the number of instruments
increases. The reason for this is that
ª
becomes closer and closer
to itself as the number of instruments increases.
IV estimation can clearly be used in the case of simultaneous equa-
tions. The only issue is which instruments to use.
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 214
B
9
»
28/
where
9
n ¤,1d1 r
Notation:
Let be the total numer of weakly exogenous variables.
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 215
Let [
¸ ¬1[
´<µ
be the number of included exogs, and let [I[
[ be the number of excluded exogs (in this equation).
Let [
Ú¸ ´1µ W¤,1- 2 be the total number of included endogs, and let
ù[I[
8ö[ be the number of excluded endogs.
Now the 1 are weakly exogenous and can serve as their own instru-
ments.
It turns out that exhausts the set of possible instruments, in that
if the variables in don’t lead to an identified model then no other
instruments will identify the model either. Assuming this is true (we’ll
prove it in a moment), then a necessary condition for identification is
that ¸ ´<µ ¸ ´1µ W¤,1- since if not then at least one instrument must
be used twice, so ü will not have full column rank:
òüÚ ½ [ 2 [ Á æ
M =
½ [ 2M [
[I[
[
To show that this is in fact a necessary condition consider some arbi-
trary set of instruments üÖ& A necessary condition for identification is
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 216
that
Æ µ ò! ü 4 9 É
[ 2M [
where
9
n ¤,1d1 r
Recall that we’ve partitioned the model
¤ Ã
2 Ù
as
¤
n B ,¤ 1 ¤ r
n 1e r
Given the reduced form
¤
ó 2Ëé
we can write the reduced form using the same partition
n B ¤,1 ¤ r
n 1 r
1P1 1 1|
ó ó
2 n é,1 é r
| ¸
1
ó
P ó
so we have
¤,1
1 ó 1 28 ó P 2Ëé01
so
ü 49
ü 4 n
1 ó 1 28 ó P 2Mé,1e1¯r
Because the ü ’s are uncorrelated with the éI1 ’s, by assumption, the cross
between ü and é,1 converges in probability to zero, so
µ ò! ü 4 9
µ ò! ü 4 n
1 ó 1 28 ó P 1¯r
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 217
Since the far rhs term is formed only of linear combinations of columns of
the rank of this matrix can never be greater than regardless of the choice of
instruments. If 9 has more than columns, then it is not of full column rank.
When 9 has more than columns we have
[ 2 [
¥
[ ¥ [I[
In this case, the limiting matrix is not of full column rank, and the identification
condition fails.
¤sU 4 Ã
iU 4 2 Ù U
é^ Ù UW
5
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 218
¤ÞU 4
iU 4 Ã 1 2 Ù U Ã 1
iAU 4 ó 2Ëé5U
ébòé5UW
ðà 1ó45 à 1
²
The reduced form parameters are consistently estimable, but none of them are
known a priori, and there are no restrictions on their values. The problem is
that more than one structural form has the same reduced form, so knowledge
of the reduced form parameters alone isn’t enough to determine the structural
parameters. To see this, consider the model
¤ U 4 ÃÆ2
U4 2
2 Ù U2
é^ Ù U 2
2
45 2
¤ U4
U4 ÃÆ2 1
2
2 Ù U 2 ÃÆ2 1
iU 4 2ù2 1 Ã 1 2 Ù U 2ù2 1 Ã 1
iU 4 Ã 1 2 Ù U Ã 1
U 4 ó 2Mé+U
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 219
Since the two structural forms lead to the same rf, and the rf is all that is di-
rectly estimable, the models are said to be observationally equivalent. What we
Ã
need for identification are restrictions on and such that the only admissi-
2
ble is an identity matrix (if all of the equations are to be identified). Take the
coefficient matrices as partitioned before:
Ã
1
Ã
U+1 Ã
P
Ã
|
01
1
P
The coefficients of the first equation of the transformed model are simply these
2
coefficients multiplied by the first column of . This gives
Ã
1
U+1 Ã
P
Ã
1P1
Ã
|
1P1
2
2
01
1
P
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 220
For identification of the first equation we need that there be enough restrictions
so that the only admissible
1P1
2
Ã
1
UE1 Ã
P
t+1
Ã
|
1P1
2
01
1
01
P
then the only way this can hold, without additional restrictions on the model’s
2 2
parameters, is if is a vector of zeros. Given that is a vector of zeros, then
the first equation
n
1P1
Á
Ã
1 r
2
1P1
Therefore, as long as
ÎÏ
Ã
|
P ÐÒ
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 221
then
1P1
1
2
¯
The first equation is identified in this case, so the condition is sufficient for
identification. It is also necessary, since the condition implies that this subma-
trix must have at least rows. Since this matrix has
[I[ 2 [I[
8 [ 2 [I[
rows, we obtain
[ 2 [I[
a
or
[I[
[
which is the previously derived necessary condition.
The above result is fairly intuitive (draw picture here). The necessary con-
dition ensures that there are enough variables not in the equation of interest to
potentially move the other equations, so as to trace out the equation of inter-
est. The sufficient condition ensures that those other equations in fact do move
around as the variables change their values. Some points:
have more instruments than are strictly necessary for consistent esti-
mation. Since estimation by IV with more instruments is more efficient
asymptotically, one should employ overidentifying restrictions if one
is confident that they’re true.
We can repeat this partition for each equation in the system, to see
which equations are identified and which aren’t.
These results are valid assuming that the only identifying informa-
tion comes from knowing which variables appear in which equations,
e.g., by exclusion restrictions, and through the use of a normaliza-
tion. There are other sorts of identifying information that can be used.
These include
(1) Cross equation restrictions
(2) Additional restrictions on parameters within equations (as in the
Klein model discussed below)
(3) Restrictions on the covariance matrix of the errors
(4) Nonlinearities in variables
When these sorts of information are available, the above conditions
aren’t necessary for identification, though they are of course still suffi-
cient.
To give an example of how other information can be used, consider the model
¤ Ã
2 Ù
Ã
where is an upper triangular matrix with 1’s on the main diagonal. This is a
triangular system of equations. In this case, the first equation is
B1
Þ· 1 2 Ù · 1
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 223
B
t 1ÔBT1,2 Þ· 2 Ù ·
This equation has [I[
Ú excluded exogs, and [
# included endogs, so it
fails the order (necessary) condition for identification.
Klein’s Model 1)
U ª U UÈr
¤ÞU 4
on û U ± U§ü U 6 =
and the predetermined variables are all others:
U 4
n ü U sU ¿ U w U ª U 1 U 1 =
U 1Þr &
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 225
The model assumes that the errors of the equations are contemporaneously
correlated, by nonautocorrelated. The model written as ¤
Ã
2 Ù gives
¯
¯ ¯
Ã
|
t+1
¯
1 ;01
F F F
|
Þ
|
|
Þ Þ
U+1q Þ
Ã
|
Þ
P
|
|
w¡
Þ
|
|
This matrix is of full rank, so the sufficient condition for identification is met.
Counting included endogs, ë[
%% and counting excluded exogs, [I[
Ò so
[I[ B
[
Ò B
%
B
%
11.6. 2SLS 227
11.6. 2SLS
The 2SLS estimator is very simple: in the first stage, each column of ¤1 is re-
gressed on all the weakly exogenous variables in the system, e.g., the entire
matrix. The fitted values are
¤1
K¬i4: 1 i4©¤1
ª« ¤,1
ó 1
Since these fitted values are the projection of ¤I1 on the space spanned by i
and since any vector in this space is uncorrelated with / by assumption, ¤,1 is
11.6. 2SLS 228
uncorrelated with T / & Since ¤1 is simply the reduced-form prediction, it is cor-
related with ¤1Y The only other requirement is that the instruments be linearly
independent. This should be the case when the order condition is satisfied,
since there are more columns in than in ¤,1 in this case.
The second stage substitutes ¤1 in place of ¤1Y and estimates by OLS. This
original model is
B
¤,1I+1,281Ô01,28/
9
»
28/
B
,¤ 1I+1,281Ô01,28/T&
B
ªI« ,¤ 1I+1,2 ª «
1Ô01,28/
ªI« 9 » 28/
»
A9 4 Iª « 9p 1 9 4 ªI« B
which is exactly what we get if we estimate using IV, with the reduced form
predictions of the endogs used as instruments. Note that if we define
ª«
9 9
n ¤, d
1 1Þr
11.6. 2SLS 229
so that 9 are the instruments for 9; then we can write
»
9y4±9p 1 9y©4 B
Important note: OLS on the transformed model can be used to calcu-
late the 2SLS estimate of
»
since we see that it’s equivalent to IV using
a particular set of instruments. However the OLS covariance formula is
not valid. We need to apply the IV covariance formula already seen
above.
ªI«
9 9
n ¤ r
1 1
éb »
dc 9y4 9 h c 9y4 9 h c 9È4±9 h è ôZ õ
but since
ª« is idempotent and since
ª
«
i we can write
ªI«7ªI« ¤,1 ¤ 4 ª« 1
n ¤, 1 1Þr 4 n ¤,1e1Þr
¤ 1 4 1
14 ªI« ¤,1 14 1
Therefore, the second and last term in the variance formula cancel, so the 2SLS
varcov estimator simplifies to
1
é^ »
c 9y4 9 h è ôZ õ
which, following some algebra similar to the above, can also be written as
»
dc 1
é^ 9y4 9 h è ôZ õ
Finally, recall that though this is presented in terms of the first equation, it is
general since any equation can be placed first.
Properties of 2SLS:
(1) Consistent
(2) Asymptotically normal
(3) Biased when the mean esists (the existence of moments is a technical
issue we won’t go into here).
(4) Asymptotically inefficient, except in special circumstances (more on
this later).
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 231
The selection of which variables are endogs and which are exogs is part of
the specification of the model. As such, there is room for error here: one might
erroneously classify a variable as exog when it is in fact correlated with the
error term. A general test for the specification on the model can be formulated
as follows:
The IV estimator can be calculated by applying OLS to the transformed
model, so the IV objective function at the minimized value is
but
/ ôZõ
< B¯j ôZõ
B¯jKi4 : ª
1 ª
i4 B
ð ± jK 4 ª : 1 4 ª ó B
±ð jK 4 ª : 1 4 ª ¬ b28/V
ó
w ¨2/V
where
± jK¬i4 ª : $ 1
4ª
w
so
ôZõI
¬ /J4g2K4i4@ w 4 ª w ¬i28/V
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 232
w
Moreover, 4
ª
w is idempotent, as can be verified by multiplication:
w 4ª w ±ð ª K ª 1 ª ª 1 ª
¬i4 : $ i4 ó ±ð j K¬i4 : $ 4 ó
ªð ª K
4 : ª
1
i4 ª
ªð ª K i4 ª : 1 i4 ª
ó ó
±ð ª K¬i4 ª : $ 1 i ª
4ó &
Furthermore, w is orthogonal to
ð ± ²K 4 ª 3 1 4ª ó
w
l²
so
ª
ôZõI J/ 4 4 w /
w
Supposing the / are normally distributed, with variance èIN then the random
variable ª
ôZõ
/ 4 4 w / w
è è
is a quadratic form of a j 'J random variable with an idempotent matrix in
the middle, so
ôZõ ç
w 4 ª
w [
è
è
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 233
w 4ª aw
𪠪 ª 1 ª
i4 :
K i4 ó
so
¿yÀVª y¿ À i 4 ª
ª
ª
K¬i4 : $ 1
¿yÀ üþòü 4 üÚ 1 ü 4 «
¿yÀ ü 4 ü Wü 4 üÚ 1 «
«
«
where is the number of columns of ü and is the number of
columns of & The degrees of freedom of the test is simply the number
of overidentifying restrictions: the number of instruments we have
beyond the number that is strictly necessary for consistent estimation.
This test is an overall specification test: the joint null hypothesis is that
the model is correctly specified and that the ü form valid instruments
(e.g., that the variables classified as exogs really are uncorrelated with
/%& Rejection can mean that either the model B
9
»
2:/ is misspecified,
or that there is correlation between and /T&
This is a particular case of the GMM criterion test, which is covered in
the second half of the course. See Section 15.8.
Note that since
/ ôZõ
w /
<
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 234
and
ôZõI
/ 4 w 4 ª w /
we can write
ôZõ
/ 4 üþòü 4 üÚ 1 ü 4 Wüþòü 4 üÚ 1 ü 4 /V
è / 4/ Â
¢W ÛÛ î üAý  ¿ ÛÛ ¾î ü4ý
, Í
B
b28/
and ü is the matrix of instruments. If we have exact identification then ¸ Wü®
´<µ
ª B
ª
^2 ª
/
ª 1
ðWi4@üþòü4üÚ $ 1 üR4 ó 1
i4 :
WüR4Ì: 1 ð i4@üþòü4@üÚ 1 ó 1
WüR4Ì: 1 òü4üÚ ¨i4(üÚ 1
4ª
ôZõ
cB Ö ôZõ h c BÜj ôZõ h
ªB4 c B¯j ª
ôZõ h ôZ4 õ
i4 c Ü B j ôZõ h
B4 ª c B¯j ôZõ h ôZ4 õ
i4 ª B2 Iô`4 õ i4 ª ôZõ
B 4ª c B¯j ôZõ h ôZ4 õ c 4 ª B28 4 ª ôZõ h
B 4ª c B¯j ôZõ h
11.8. SYSTEM METHODS OF ESTIMATION 236
by the fonc for generalized IV. However, when we’re in the just indentified
case, this is
B4 ª 1 ü4©B
ôZõ
ð BÜj8òü43 $ ó
B4 ª 1 ü4 B
ð ± ÖKòü43 $ ó
B 4 ð üþòü 4 üÚ 1 ü 4 ü Wü 4 üÚ 1 ü 4 Kòü 4 : 1 ü 4 ó B
The value of the objective function of the IV estimator is zero in the just identified
case. This makes sense, since we’ve already shown that the objective function
after dividing by è is asymptotically with degrees of freedom equal to the
number of overidentifying restrictions. In the present case, there are no overi-
dentifying restrictions, so we have a rv, which has mean 0 and variance 0,
e.g., it’s simply 0. This means we’re not able to test the identifying restrictions
in the case of exact identification.
¤ Ã
2 Ù
뼬i4 Ù
ÿ Ó¯Å
Ã
ºN¸ ç j 6 Es
¸
Ù
Since there is no autocorrelation of the
Ù U ’s, and since the columns of
are individually homoscedastic, then
Ù
è01P1 ± f 0è 1 ± f >N>N>uè01 ¯ ± f
è P ± f ..
.
E .
..
. ..
> è ¯¯ ±f
5Ñï
±f
This means that the structural equations are heteroscedastic and cor-
related with one another
In general, ignoring this will lead to inefficient estimation, following
the section on GLS. When equations are correlated with one another
estimation should account for the correlation in order to obtain effi-
ciency.
Also, since the equations are correlated, information about one equa-
tion is implicitly information about all equations. Therefore, overiden-
tification restrictions in any equation improve efficiency for all equa-
tions, even the just identified equations.
Single equation methods can’t use these types of information, and are
therefore inefficient (in general).
11.8. SYSTEM METHODS OF ESTIMATION 238
11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS esti-
mator as a generalized method of moments estimator (see Chapter 15). I no
longer teach the following section, but it is retained for its possible historical
interest. Another alternative is to use FIML (Subsection 11.8.2), if you are will-
ing to make distributional assumptions on the errors. This is computationally
feasible with modern computers.
Following our above notation, each structural equation can be written as
B *å
¤ * E1,2 * 012/ *
9
* » * 28/ *
B1
71
9 >N>N>
»
1
/1
B .. »
/
9
.
2
.
..
..
.
..
.
.
..
.
..
B ¯
>N>N> 9 ¯
»
¯
/ ¯
or
B
9
»
28/
where we already have that
뼬/]/J4© E
5Çï
±f
11.8. SYSTEM METHODS OF ESTIMATION 239
The 3SLS estimator is just 2SLS combined with a GLS correction that takes
advantage of the structure of E¾& Define 9 as
¤1d1 >N>N>
..
¤ .
.. ..
. .
>N>N> ¤ ¯ ¯
ó
à 1
Ã
may be subject to some zero restrictions, depending on the restrictions on
and and ó does not impose these restrictions. Also, note that ó is calculated
using OLS equation by equation. More on this later.
The 2SLS estimator would be
»
9y4±9p 1 9y©4 B
error covariance into the formula, which gives the 3SLS estimator
/ *0
B * ¹
9 * » * H ³<(³
* *
(IMPORTANT NOTE: this is calculated using 9 not 9 C& Then the element
![¾ of 5 is estimated by
è * ßÈ
/ *4 / ß
Substitute 5 into the formula above to get the feasible 3SLS estimator.
Analogously to what we did in the case of 2SLS, the asymptotic distribution
of the 3SLS
h estimator can be shown to be
1
4 45Ñï ± f 1 9 ·
ÎÏ
ë
c | ³<(³ » h R
Q ã
»
ç f SUT
á ¶ 9
ç
æ
ä å ÐÒ
A formula for estimating the variance of the 3SLS estimator in finite samples
(cancelling out the powers of is
é c » | ³<(³ h
c 9È4 c 5y 1 ï
± f h 9 h 1
In the case that all equations are just identified, 3SLS is numerically
equivalent to 2SLS. Proving this is easiest if we use a GMM interpre-
tation of 2SLS and 3SLS. GMM is presented in the next econometrics
course. For now, take it on faith.
The 3SLS estimator is based upon the rf parameter estimator ó calculated
equation by equation using OLS:
ó i4: $ 1 4@¤
which is simply
ó 4 : 1 4 n B 1 B >N>N>RB ¯ r
that is, OLS equation by equation using all the exogs in the estimation of each
column of ó &
It may seem odd that we use OLS on the reduced form, since the rf equa-
tions are correlated:
¤¯U 4
iU 4 Ã 1 2 Ù U 4 Ã 1
U 4 ó 2Ëé U 4
and
é+U
ð Ã 1 ó 4 Ù Uçu c ð Ã 1 ó 4 5 Ã 1 h Ôê
Let this var-cov matrix be indicated by
þ
ðà 1 45 à 1
ó
11.8. SYSTEM METHODS OF ESTIMATION 242
B1 >N>N>
1 1
¸
B ..
.
2
¸
.
..
..
.
..
.
.
..
.
..
B ¯
>N>N>
¯
¸ ¯
where B* is the M vector of observations of the ! U¶ endog, is the entire
j matrix of exogs, * is the ! U¶ column of ó and ¸
* is the ! U ¶ column of é¢&
Use the notation
B
a`
2¹¸
to indicate the pooled model. Following this notation, the error covariance
matrix is
éb4¸%
þ
ï
±f
This is a special case of a type of model known as a set of seemingly
unrelated equations (SUR) since the parameter vector of each equation
is different. The equations are contemporanously correlated, however.
The general case would have a different * for each equation.
Note that each equation of the system individually satisfies the classi-
cal assumptions.
However, pooled estimation using the GLS correction is more efficient,
since equation-by-equation estimation is equivalent to pooled estima-
tion, since ` is block diagonal, but ignoring the covariance informa-
tion.
The model is estimated by GLS, where
þ
is estimated using the OLS
residuals from equation-by-equation estimation, which are consistent.
11.8. SYSTEM METHODS OF ESTIMATION 243
In the special case that all the * are the same, which is true in the
present case of estimation of the rf parameters, SUR OLS. To show
this note that in this case `§
¡± ftïM& Using the rules
(1) w ï
1
w 1 ï 1
(2) w ï
4
w 4 ï 4 and
(3) w ï
v û ï
wµû ï Y we get
1
³
ðY ± tf ïM: 4 þ ï ± f" 1 ± tf ïM: ó ± fUïM: 4 þ ï
± f 1 B
ðJð þ 1 ïMi4 ó ± tf ïM: ó 1 ð þ 1 ïMi4 ó B
ð þ ïui4: $ 1 ó ð þ 1 ïMi4 ó B
° ± ¯ ïu¬ 4 : 1 4 ³ B
1
.
..
¯
¤sU 4 Ã iU 4 2 Ù U 4
Ù ñ U ç j 6 5¼ YÔê
ë7 Ù U Ù ì 4
[µí
Given the assumption of independence over time, the joint log-likelihood func-
tion is
B Ã 65¼
, ) ò# [2p ÿ } Ã Y ) ÿ } 5 1 f ¤ 4 Ã j 4 "5 1 W¤ 4 Ã j 4 4
# # # U(1 U U U U
This is a nonlinear in the parameters objective function. Maximixation
of this can be done using iterative numeric methods. We’ll see how to
do this in the next section.
11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 245
It turns out that the asymptotic distribution of 3SLS and FIML are the
same, assuming normality of the errors.
One can calculate the FIML estimator by iterating the 3SLS estimator,
thus avoiding the use of a nonlinear optimizer. The steps are
| ³<(³ ³/+³
and |
Ã
(1) Calculate as normal.
| ³/+³ Ã ³/1 +³
(2) Calculate ó | & This is new, we didn’t estimate ó in
this way before. This estimator may have some zeros in it. When
Greene says iterated 3SLS doesn’t lead to FIML, he means this for
a procedure that doesn’t update ó but only updates 5 and and
Ã
& If you update ó you do converge to FIML.
(3) Calculate the instruments ¤
ó and calculate 5 using
Ã
and
to get the estimated errors, applying the usual estimator.
(4) Apply 3SLS using these new instruments and the estimate of µ&
5
CONSUMPTION EQUATION
11.9. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 246
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059
*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184
*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427
*******************************************************
The above results are not valid (specifically, they are inconsistent) if the er-
rors are autocorrelated, since lagged endogenous variables will not be valid
instruments in that case. You might consider eliminating the lagged endoge-
nous variables as instruments, and re-estimating by 2SLS, to obtain consistent
parameter estimates in this more complex case. Standard errors will still be
estimated inconsistently, unless use a Newey-West type covariance estimator.
Food for thought...
CHAPTER 12
We’ll begin with study of extremum estimators in general. Let f be the
available data, based on a sample of size .
D EFINITION 12.0.1. [Extremum estimator] An extremum estimator @ is the
optimizing element of an objective function 'f+Äf%@" over a set N .
V- ' f5A@V
Ô Â Å\ _,fs ` f(@<] 4 \ ,_ f ` f(@<]
@
1 _ &
We readily find that @
` 4 ` ` 4 Z
Example: Maximum likelihood
Suppose that the continuous random variable BVU;ç ± ± j
A@]F''J Y& The maxi-
mum likelihood estimator is defined as
f @"
V[
~ 5f A@"
ò# A1 p }C~% BgUi
G
¶
@
U(1 # ·
f B]Ui@V
g
[
b
~ \f5A@"
Â
 Â
Ô , f 4@" Þ # # Ô ,
@
U(1 #
Solution of the f.o.c. leads to the familiar result that @
_Z Ý&
MLE estimators are asymptotically efficient (Cramér-Rao lower bound,
Theorem3), supposing the strong distributional assumptions upon which
they are based are true.
One can investigate the properties of an “ML” estimator supposing
that the distributional assumptions are incorrect. This gives a quasi-
ML estimator, which we’ll study later.
The strong distributional assumptions of MLE may be questionable
in many cases. It is possible to estimate using weaker distributional
assumptions based only on some of the moments of a random vari-
able(s).
1v4@ F
3 .
3 1
3 1v4@gFC is a moment-parameter equation.
In this example, the relationship is the identity function Ä1\A@]FY
g@ FN
3
f Â
3
1 ]B U G&
U(1
12. INTRODUCTION TO THE SECOND HALF 250
Define
1\4@"
3 1CA@" G 3 1
The method of moments principle is to choose the estimator of the
parameter to set the estimate of the population moment equal to the
sample moment, i.e., i1\ V
@
. Then the moment-parameter equation
is inverted to solve for the parameter estimate.
In this case,
f Â
1\ @" @p gB U &
U(1
Since
f ]B U Â R 6 @gF
U(1 by the LLN, the estimator is consistent.
More on the method of moments
Continuing with the above example, the variance of a A@]FC r.v. is
Again, by the LLN, the sample variance is consistent for the true vari-
ance, that is,
fU(1 BgUuB%Ý R 6 (# @ F &
So,
fU(1 BgU®B%Ý
@
#]
12. INTRODUCTION TO THE SECOND HALF 251
-U A@V #(@pBgUÚB%Ý
and
f BgUuB%Ý
4 @" (# @p U(1
&
12. INTRODUCTION TO THE SECOND HALF 252
One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models aren’t useful. Linear models are more general than
12. INTRODUCTION TO THE SECOND HALF 253
they might first appear, since one can employ nonlinear transformations of the
variables:
In spite of this generality, situations often arise which simply can not be con-
vincingly represented by linear in the parameters models. Also, theory that
applies to nonlinear models also applies to linear models, so one may as well
start off with the general case.
Example: Expenditure shares
Roy’s Identity states that the quantity demanded of the !
U¶ of goods is
à à
+*,
à ¸@B Â Â à * &
¸@B B
An expenditure share is
*
*)+*Â B
* \}]ò * 1 *
. No linear in the parameters model
¯
so necessarily C\ and
for +* or * with a parameter space that is defined independent of the data can
guarantee that either of these conditions holds. These constraints will often be
violated by estimated linear models, which calls into question their appropri-
ateness in cases of this sort.
Example: Binary limited dependent variable
12. INTRODUCTION TO THE SECOND HALF 254
/ F j/ 1 ½ ¸
1 Ç w T
c¸ F T
(12.0.1) Ë
B
J
2
î \
M¸
w I]"&
¸
1 T
I
¸ F T
yI
and /]F and /1 are i.i.d. extreme value random variables. That is, utility de-
pends only on income, preferences in both states are homothetic, and a spe-
cific distributional assumption is made on the distribution of preferences in
the population. With these assumptions (the details are unimportant here, see
1
We assume here that responses are truthful, that is there is no strategic behavior and that
individuals are able to order their preferences in this hypothetical situation.
12. INTRODUCTION TO THE SECOND HALF 255
w @V
28 w
Ô2 }C~% Ô; [ 1 &
This is the simple logit model: the choice probability is the logit function of a
linear in parameters function.
Now, B is either or 1, and the expected value of B is 28 w . Thus, we
can write
B
28 w 2
ë7 &
c h
V- BÜ K
2 w -
U
The main point is that it is impossible that K 2 w can be written as a linear
in the parameters model, in the sense that, for arbitrary w , there are no @T m w
such that
28 w
Rm w ? 4*@%Ôê
where m w is a -vector valued function of and @ is a dimensional param-
eter. This is because for any @% we can always find a such that m 4@ will be
negative or greater than " which is illogical, since it is the expectation of a 0/1
binary random variable. Since this sort of problem occurs often in empirical
work, it is useful to study NLS and other nonlinear models.
12. INTRODUCTION TO THE SECOND HALF 256
After discussing these estimation methods for parametric models we’ll briefly
introduce nonparametric estimation methods. These methods allow one, for ex-
ample, to estimate
UW consistently when we are not willing to assume that a
model of the form
B]U
U 2/JU
can be restricted to a parametric form
BgU
U×@" 2/]U
Ë
/JU ½
2
î W t U
@ C N Yt C
where
?>@ and perhaps
2
î tI UW are of known functional form. This is im-
portant since economic theory gives us general information about functions
and the signs of their derivatives, but not about their specific form.
Then we’ll look at simulation-based methods in econometrics. These meth-
ods allow us to substitute computer power for mental power. Since computer
power is becoming relatively cheap compared to mental effort, any econome-
trician who lives by the principles of economic theory should be interested in
these techniques.
Finally, we’ll look at how econometric computations can be done in paral-
lel on a cluster of computers. This allows us to harness more computational
power to work with more complex models that can be dealt with using a desk-
top computer.
CHAPTER 13
Readings: Hamilton, ch. 5, section 7 (pp. 133-139) [ Gourieroux and Mon-
fort, Vol. 1, ch. 13, pp. 443-60 [ ; Goffe, et. al. (1994).
If we’re going to be applying extremum estimators, we’ll need to know
how to find an extremum. This section gives a very brief introduction to what
is a large literature on numeric optimization methods. We’ll consider a few
well-known techniques, and one fairly new technique that may allow one to
solve difficult problems. The main objective is to become familiar with the
issues, and to learn how to use the BFGS algorithm at the practical level.
The general problem we consider is how to find the maximizing element
@ (a -vector) of a function T4@" Y& This function may not be continuous, and
it may not be differentiable. Even if it is twice continuously differentiable, it
may not be globally concave, so local maxima, minima and saddlepoints may
all exist. Supposing 4@" were a quadratic function of @% e.g.,
YV A@" ¢7 2 û @
û 1
so the maximizing (minimizing) element would be @ N7 & This is the sort
of problem we have with linear models estimated by OLS. It’s also the case for
257
13.2. DERIVATIVE-BASED METHODS 258
feasible GLS, since conditional on the estimate of the varcov matrix, we have
a quadratic objective function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able
to solve for the maximizer analytically. This is when we need a numeric opti-
mization method.
13.1. Search
The idea is to create a grid over the parameter space and evaluate the func-
tion at each point on the grid. Select the best point. Then refine the grid in
the neighborhood of the best point, and continue until the accuracy is ”good
enough”. See Figure 13.1.1. One has to be careful that the grid is fine enough
in relationship to the irregularity of the function to ensure that sharp peaks are
not missed entirely.
To check values in each dimension of a dimensional parameter space,
ÿ
"
we need to check points. For example, if and o
there would
be " 1 F points to check. If 1000 points can be checked in a second, it would
take % &\/ )%pÖ years to perform the calculations, which is approximately the
age of the earth. The search method is a very reasonable choice if is small,
but it quickly becomes infeasible if is moderate or large.
The iteration method can be broken into two problems: choosing the stepsize
D D
ô (a scalar) and choosing the direction of movement, J which is of the same
dimension of @T so that
D D D D
@ Ã W1 Å
@ Ã ÅM
2 ô J &
A locally increasing direction of search J is a direction such that
à
5ôbH A@yà 2M
ô
ôJT ¥
for ô positive but small. That is, if we go in direction J , we will improve on the
objective function, at least if we don’t go too far in that direction.
13.2. DERIVATIVE-BASED METHODS 260
unless 0 V
&
A@ Every increasing direction can be represented in this
way (p.d. matrices are those such that the angle between and æ , "
4@
D
D D D D
@ Ã W1 Å
@ Ã ÅK
2 ô æ 0A@
and we keep going until the gradient becomes zero, so that there is no increas-
ing direction. The problem is how to choose ô and æ &
D D D D D D
\f+4@" u
¬ 2 Â # ð @pc@ ó 4 4@ ð @ss@ ó
\f+4@ 2c0A@ ?4 ð @c@ ó ¡
D D D D
ý A@"
, 4@ ?4*@È2¡ Â # ð @pi@ ó 4 4 @ ð @pi@ ó
with respect to @%& This is a much easier problem, since it is a quadratic function
in @% so it has linear first order conditions. These are
d
V A@V
0 A@ D 2 D D
A@ ð @pi@ ó
ý
So the solution for the next round estimate
d is
D 1
D D D
@ @ 4@ 1 0 A@
D 1
D D D 1 D
@ @ jô A @ 0A @
A potential problem is that the Hessian may not be negative definite
d
D
when we’re far from the maximizing point. So A@ 1 may not be
13.2. DERIVATIVE-BASED METHODS 263
D D
positive definite, and A@ 1 0 A@ may not define an increasing di-
rection of search. This can happen when the objective function has flat
regions, in which case the Hessian matrix is very ill-conditioned (e.g.,
d
is nearly singular), or when we’re in the vicinity of a local minimum,
D
A@ is positive definite, and our direction is a decreasing direction
of search. Matrix inverses by computers are subject to large errors
when the matrix is ill-conditioned. Also, we certainly don’t want to
go in the direction of a minimum when we’re maximizing. To solve
d
this problem, Quasi-Newton methods simply add a positive definite
d
component to A@" to ensure that the resulting matrix is positive def-
inite, e.g., æ
A@V 72 7] where 7 is chosen large enough so that
13.2. DERIVATIVE-BASED METHODS 264
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is
subject to limited machine precision and round-off errors. For these reasons,
it is unreasonable to hope that a program can exactly find the point that max-
imizes a function. We need to define acceptable tolerances. Some stopping
criteria are:
Negligable change in parameters:
D ßD 1 ½
@ ß c
@ / 1YÔêa
Negligable relative change:
ßD c D
D @ß 1 ½
@ ß 1 / Ôêa
@
D D
A@ G K4@ 1 ½ / |
Gradient negligibly different from zero:
D
ß 4@ ½ // !N?êa
13.2. DERIVATIVE-BASED METHODS 265
Starting values
The Newton-Raphson and related algorithms work well if the objective
function is concave (when maximizing), but not so well if there are convex
regions and local minima or multiple local maxima. The algorithm may con-
verge to a local minimum or to a local maximum that is not optimal. The
algorithm may also have difficulties converging at all.
The usual way to “ensure” that a global maximum has been found
is to use many different starting values, and choose the solution that
returns the highest objective function value. THIS IS IMPORTANT
in practice. More on this later.
Calculating derivatives
The Newton-Raphson algorithm requires first and second derivatives. It
is often difficult to calculate derivatives (especially the Hessian) analytically if
the function \f+?>@ is complicated. Possible solutions are to calculate derivatives
numerically, or to use programs such as MuPAD or Mathematica to calculate
analytic derivatives. For example, Figure 13.2.3 shows MuPAD1 calculating a
derivative that I didn’t know off the top of my head, and one that I did know.
Numeric derivatives are less accurate than analytic derivatives, and
are usually more costly to evaluate. Both factors usually cause opti-
mization programs to be less successful when numeric derivatives are
used.
1
MuPAD is not a freely distributable program, so it’s not on the CD. You can download it from
http://www.mupad.de/download.shtml
13.2. DERIVATIVE-BASED METHODS 266
One could define
GÂ " " U[
V"g U ; [
"" U [
\U Â "" &
[
à
In this case, the gradients § \ f5Ô>© and
\ f+?>@ will both be 1.
x
In general, estimation programs always work better if data is scaled
in this way, since roundoff errors are less likely to become important.
This is important in practice.
There are algorithms (such as BFGS and DFP) that use the sequen-
tial gradient evaluations to build up an approximation to the Hessian.
The iterations are faster for this reason since the actual Hessian isn’t
calculated, but more iterations usually are required for convergence.
Switching between algorithms during iterations is sometimes useful.
13.4. Examples
This section gives a few examples of how some nonlinear models may be
estimated using maximum likelihood.
13.4.1. Discrete Choice: The logit model. In this section we will consider
maximum likelihood estimation of the logit model for binary 0/1 dependent
variables. We will use the BFGS algotithm to find the MLE.
We saw an example of a binary choice model in equation 12.0.1. A more
general representation is
B [
0 Gj/
B
B [ ¥
ªÞÀ B
J
2
î \ 0 _]
@"
f * ) +*
\f5A@" * B @" 2aÔÈB * \@Èi +* @" I]¬
1
For the logit model (see the contingent valuation example above), the prob-
ability has the specific form
Here are some estimation results with " and the true @ 'N 4 &
***********************************************
Trial of MLE estimation of Logit model
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
13.4.2. Count Data: The Poisson model. Demand for health care is usu-
ally thought of a a derived demand: health care is an input to a home pro-
duction function that produces health, and health is an argument of the utility
function. Grossman (1972), for example, models health as a capital stock that
is subject to depreciation (e.g., the effects of ageing). Health care visits restore
the stock. Under the home production framework, individuals decide when to
make health care visits to maintain their health stock, or to deal with negative
shocks to the stock in the form of accidents or illnesses. As such, individual
13.4. EXAMPLES 270
}C~% Ô e e K
/;
B5
B &
f
2 B * e5* B *
\f+4@" * P e5* K
1
13.4. EXAMPLES 271
+*§
e }Y~% L *4
L,*§
\@
ª B ±Tû ª ± é
w û± û ](4¬&
Û¢Ù Ù Ù
This ensures that the mean is positive, as is required for the Poisson model.
Note that for this parameterization
à à
 ß
ß7
e
so
ß$TßÈ
è ãÔä
the elasticity of the conditional mean of B with respect to the
U¶ conditioning
variable.
The program EstimatePoisson.m estimates a Poisson model using the full
data set. The results of the estimation, using OBDV as the dependent variable
are here:
OBDV
******************************************************
Poisson model, MEPS 1996 full data set
Information Criteria
CAIC : 33575.6881 Avg. CAIC: 7.3566
BIC : 33568.6881 Avg. BIC: 7.3551
AIC : 33523.7064 Avg. AIC: 7.3452
******************************************************
In some cases the dependent variable may be the time that passes between
the occurence of two events. For example, it may be the duration of a strike,
or the time needed to find a job once one is unemployed. Such variables take
on values on the positive real line, and are referred to as duration data.
13.5. DURATION DATA AND THE WEIBULL MODEL 273
A spell is the period of time between the occurence of initial event and the
concluding event. For example, the initial event could be the loss of a job, and
the final event is the finding of a new job. The spell is the period of unemploy-
ment.
Let
F be the time the initial event occurs, and Y1 be the time the conclud-
ing event occurs. For simplicity, assume that time is measured in years. The
random variable
is the duration of the spell,
$1Äj . Define the density
F
function of
"!
¬ - C with distribution function ¬- Ë ½ - Y&
2 !
Several questions may be of interest. For example, one might wish to know
the expected time one has to wait to find a job given that one has already
waited years. The probability that a spell lasts years is
Ë
¥ J
ÈcË » J
È 2 !
WJ C&
The density of
conditional on the spell already having lasted years is
"!
"!
¥ J È 2 ¬ ! - ò J &
The expectanced additional time required for the spell to end given that is has
already lasted years is the expectation of
with respect to this density, minus
"& #!
ë7 ¥ J Ä8
Æ Y W" J% K
T
2 !
Ù U y òJ É
!
To estimate this function, one needs to specify the density - as a para-
metric density, then estimate by maximum likelihood. There are a number of
possibilities including the exponential density, the lognormal, etc. A reason-
ably flexible model that is a generalization of the exponential density is the
Weibull density
13.5. DURATION DATA AND THE WEIBULL MODEL 274
º Ã U(%Å $ e G e - ¡ 1 &
¬ @V "!
According to this model, ë¼
e ¡ & The log-likelihood is just the product of
the log densities.
To illustrate application of this model, 402 observations on the lifespan of
mongooses in Serengeti National Park (Tanzania) were used to fit a Weibull
model. The ”spell” in this case is the lifetime of an individual mongoose.
¦ "Ò Ò ï Ñ
R
The parameter estimates and standard errors are e
& & % and
& - ',)È & %(% and the log-likelihood value is -659.3. Figure 13.5.1 presents fitted
life expectancy (expected additional years of life) as a function of age, with 95%
confidence bands. The plot is accompanied by a nonparametric Kaplan-Meier
estimate of life-expectancy. This nonparametric estimator simply averages all
spell lengths greater than age, and then subtracts age. This is consistent by the
LLN.
In the figure one can see that the model doesn’t fit the data well, in that it
predicts life expectancy quite differently than does the nonparametric model.
For ages 4-6, the nonparametric estimate is outside the confidence interval that
results from the parametric model, which casts doubt upon the parametric
model. Mongooses that are between 2-6 years old seem to have a lower life
expectancy than is predicted by the Weibull model, whereas young mongooses
that survive beyond infancy have a higher life expectancy, up to a bit beyond
2 years. Due to the dramatic change in the death rate as a function of , one
"!
might specify - as a mixture of two Weibull densities,
The parameters * and e5* !
VY# are the parameters of the two Weibull densi-
»
ties, and is the parameter that mixes the two.
With the same data, @ can be estimated using the mixed model. The results
are a log-likelihood = -623.17. Note that a standard likelihood ratio test can-
not be used to chose between the two models, since under the null that
»
e
(single density), the two parameters and are not identified. It is possi-
ble to take this into account, but this topic is out of the scope of this course.
Nevertheless, the improvement in the likelihood function is considerable. The
parameter estimates are
13.6. NUMERIC OPTIMIZATION: PITFALLS 276
In this section we’ll examine two common problems that can be encoun-
tered when doing numeric optimization of nonlinear models, and some solu-
tions.
13.6.1. Poor scaling of the data. When the data is scaled so that the magni-
tudes of the first and second derivatives are of different orders, problems can
easily result. If we uncomment the appropriate line in EstimatePoisson.m, the
data will not be scaled, and the estimation program will have difficulty con-
verging (it seems to take an infinite amount of time). With unscaled data, the
elements of the score vector have very different magnitudes at the initial value
13.6. NUMERIC OPTIMIZATION: PITFALLS 277
of @ (all zeros). To see this run CheckScore.m. With unscaled data, one element
of the gradient is very large, and the maximum and minimum elements are 5
orders of magnitude apart. This causes convergence problems due to serious
numerical inaccuracy when doing inversions to calculate the BFGS direction
of search. With scaled data, none of the elements of the gradient are very
large, and the maximum difference in orders of magnitude is 3. Convergence
is quick.
13.6.2. Multiple optima. Multiple optima (one global, others local) can
complicate life, since we have limited means of determining if there is a higher
13.6. NUMERIC OPTIMIZATION: PITFALLS 278
maximum the the one we’re at. Think of climbing a mountain in an unknown
range, in a very foggy place (Figure 13.6.1). You can go up until there’s nowhere
else to go up, but since you’re in the fog you don’t know if the true summit
is across the gap that’s at your feet. Do you claim victory and go home, or do
you trudge down the gap and explore the other side?
The best way to avoid stopping at a local maximum is to use many starting
values, for example on a grid, or randomly generated. Or perhaps one might
have priors about possible values for the parameters (e.g., from previous stud-
ies of similar data).
13.6. NUMERIC OPTIMIZATION: PITFALLS 279
Let’s try to find the true minimizer of minus 1 times the foggy mountain
function (since the algoritms are set up to minimize). From the picture, you
can see it’s close to , but let’s pretend there is fog, and that we don’t know
that. The program FoggyMountain.m shows that poor start values can lead to
problems. It uses SA, which finds the true global minimum, and it shows that
BFGS using a battery of random start values can also find the global minimum
help. The output of one run is here:
======================================================
BFGSMIN final results
------------------------------------------------------
STRONG CONVERGENCE
Function conv 1 Param conv 1 Gradient conv 1
------------------------------------------------------
Objective function value -0.0130329
Stepsize 0.102833
43 iterations
------------------------------------------------------
16.000 -28.812
================================================
SAMIN final results
NORMAL CONVERGENCE
3.7417e-02 2.7628e-07
In that run, the single BFGS run with bad start values converged to a point far
from the true minimizer, which simulated annealing and BFGS using a battery
of random start values both found the true maximizaer. battery of random
start values managed to find the global max. The moral of the story is be
cautious and don’t publish your results too quickly.
EXERCISES 282
Exercises
(1) In octave, type ”help bfgsmin_example”, to find out the location of the
file. Edit the file to examine it and learn how to call bfgsmin. Run it, and
examine the output.
(2) In octave, type ”help samin_example”, to find out the location of the
file. Edit the file to examine it and learn how to call samin. Run it, and
examine the output.
(3) Using logit.m and EstimateLogit.m as templates, write a function to calcu-
late the probit loglikelihood, and a script to estimate a probit model. Run
it using data that actually follows a logit model (you can generate it in the
same way that is done in the logit example).
(4) Study mle_results.m to see what it does. Examine the functions that
mle_results.m calls, and in turn the functions that those functions call.
Write a complete description of how the whole chain works.
(5) Look at the Poisson estimation results for the OBDV measure of health care
use and give an economic interpretation. Estimate Poisson models for the
other 5 measures of health care usage.
CHAPTER 14
Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24 [& Amemiya, Ch.
4 section 4.1 [ ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey
and McFadden (1994), “Large Sample Estimation and Hypothesis Testing,” in
Handbook of Econometrics, Vol. 4, Ch. 36.
283
14.2. CONSISTENCY 284
14.2. Consistency
T HEOREM 19. [Consistency of e.e.] Suppose that \f
@ is obtained by maximiz-
ing 'f5A@V over N &
Assume
QR P ì P
Then @\f @gF'&
'f ¼ @\f ¼
\f ¼ A@ F
However,
\f @\ f
T= @V C
9 SUT ¼ ¼
as seen above, and
\f A@ F
T=A@ F
9 StT ¼
by uniform convergence, so
T¾ @V
T=4@ F Y&
14.2. CONSISTENCY 286
with
ª û
&
Discussion of the proof:
This proof relies on the identification assumption of a unique global
maximum at @gFN& An equivalent way to state this is
will not deal with in this course. Just note that conventional hypothe-
sis testing methods do not apply in this case.
Note that 'f¼A@V is not required to be continuous, though <T¾4@" is.
The following figures illustrate why uniform convergence is impor-
tant.
T HEOREM 20. [Uniform Strong LLN] Let S]Þf A@V YX be a sequence of stochastic
real-valued functions on a totally-bounded metric space N T Y& Then
Ð ' sf5A@V QR P ì P
V )
if and only if
Q6P ì P
(a) sf 4 @"
R for each @MCFN where N is a dense subset of N and
F F
(b) S]sf5A@V YX is strongly stochastically equicontinuous..
ÿ
The metric space we are interested in now is simply N ÷ O using
the Euclidean norm.
The pointwise almost sure convergence needed for assuption (a) comes
from one of the usual SLLN’s.
14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES 289
is
f f
'f5A@V
 BgU U4 @"
 U4 @
U(1 * 1 ð 4U @ F 8
2 /JU ó
f f f
 ð 4U ð¾@ F i @ Jó ó 2Ë#  4U ðI@ F c@  / U
U(1 U(1 ó J/ U ¡
2
U(1
Considering the last term, by the SLLN,
f
 / U QR P ì P Y/.DY10
/ J,3
.
J3
0
è î &
U(1
Considering the second term, since
Ù ¬/V and . and / are indepen-
dent, the SLLN implies that it converges to zero.
Finally, for the first term, for a given @ , we assume that a SLLN applies
so that
f
(14.3.1) Â ð U4 ð¾@ F i@ ì
Q6P P
R Y1.
ð 4 ð¾@ F i
@ Jó ó
.
. .
ð F ó M2 # ð F ó Y . .&J3 2 ð F ó Y . . J,3
ó ð F
ð F ó M 2 #Þð F ó Ù .p ,2 ð× F ó Ù ðò. ó
ó ðò F
Finally, the objective function is clearly continuous, and the parameter space
is assumed to be compact, so the convergence is also uniform. Thus,
/T=4@"
ð F 2Ë#sð F
ó ð× F ó Ù . 2 ðP F j ó Ù ×ð . ó 2Kè î
ó
E XERCISE 21. Show that in order for the above solution to be unique it is
necessary that
Ù . pí & Discuss the relationship between this condition and
the problem of colinearity of regressors.
14.4. ASYMPTOTIC NORMALITY 291
This example shows that Theorem 19 can be used to prove strong consis-
tency of the OLS estimator. There are easier ways to show this, of course - this
is only an example of application of the theorem.
hood of @ F&
Q6P ì P
(b) S"2¢f+4@\f $X
R
2T A@ F Y a finite negative definite matrix, for any sequence
h h
S<@\fTX that converges almost surely to @ F&
\ o T¾A@]FC _]" where o T=4@gFC
fStT3é¯ô À YV \f+4@gFY
VY\f+4@gFC cR m q
h
(c)
1o 1
Then c @c@ F h m \ 2 T A@ F T=A@ F (2T¾4@ F ]
R
where @
£e @È
[ 2 ?È e
Z@gFN » e » "&
Note that @ will be in the neighborhood where
\f+4@"
V exists with
probability one as becomes large, by consistency.
14.4. ASYMPTOTIC NORMALITY 292
So
¯
YV \f+4@ F 2 ° 2 T¾A@ F 2 ´ T6 ÔJ ³ c @p i
@Fh
h h
And
Þ
$V 'f5A@ F 2 ° 2T A@ F 2 ´ 6 ÔJ ³ c @ss@ F h
Now 2T A@gF$ is a finite negative definite matrix, so the
´
6TÔJ term is asymptoti-
cally irrelevant next to 2øh T=4@gFY , so we can writeh
h
Q $V 'f5A@ F 232 ¾
T A@ F h
c @p i@ F h
c @pi@ F h
4 2T=A@ F $ 1 YV \f5A@ F
Q
Because of assumption (c), and the formula for the variance of a linear combi-
nation of r.v.’s, h
c @pi@ F h R
m ° 2 T 4 @ F 1 o T=A@ F 2T A@ F 1 ³
Assumption (b) is not implied by the Slutsky theorem. The Slutsky
0 fV R ì 0 if f R and 0Ô>@ is continuous at &
QP P
theorem says that
the elements of which are not centered (they do not have zero expec-
tation). Supposing a SLLN applies, the almost sure limit of
'f 4@gFC C
V
$V 'f5A@ F 65 687J h
where we use the result of Example 49. If we were to omit the G
we’d have
6 5 6
h
where we use the fact that 5 5
6
C& The sequence
$V 'f5A@ F is centered, so we need to scale by to avoid convergence
to zero.
14.5. Examples
B [
4 £j/
B
B [ ¥
/ ç ² 'N
£Y ã T x W # A1 p }Y~% Ô / # `J"/
is the standard normal distribution function.
In general, a binary response model will require that the choice probability
be parameterized in some form. For a vector of explanatory variables , the
response probability will be parameterized in some manner
ªÞÀ B
@"
If I @V
9 4 @V C we have a logit model. If @V
4 @" Y where ?>@ is the
standard normal distribution function, then we have a probit model.
14.5. EXAMPLES 295
maximizes the uniform almost sure limit of Nf+4@" Y& Noting that
ë B *Ä
I +* @]FY C
and following a SLLN for i.i.d. processes, Nf5A@V converges almost surely to the
expectation of a representative term B @V C& First one can take the expectation
conditional on to get
ë K ã SJB @" I2?ÈjB5 ) \@y£ @V _]WX I @ F @" -2 ° Èi @ F ³ ) \Èi @" I]&
where 3 is the (joint - the integral is understood to be multiple, and < is the
support of ) density function of the explanatory variables . This is clearly
continuous in @T as long as @V is continuous, and if the parameter space is
compact we therefore have uniform almost sure convergence. Note that @"
is continous for the logit and probit models, for example. The maximizing
14.5. EXAMPLES 296
There’s no need to subtract the mean, since it’s zero, following the
f.o.c. in the consistency proof above and the fact that observations are
i.i.d.
The terms in h also drop out by the sameh argument:
) ÀV
f SUT éÜô VYA @ F
So we get à à
o
=A@ F
ë õ à @ B @ F à @ B @ F ø &
T
4
14.5. EXAMPLES 297
Likewise, à
¾A@ F
ë à @ à @ B @ F C&
2 T
4
Expectations are jointly over B and or equivalently, first over B conditional
on then over & From above, a typical element of the objective function is
Now suppose that we are dealing with a correctly specified logit model:
à
à
@"
Ô2 }C~% Ô L 4 @V [ Y} ~% Ô L 4±@" L
@
C
} %
~ L 1 Y} ~% Ô L 4 @" L
Ô2 Ô 4 @V [ 2 }C~% Ô L @V
4
@" PÈi @V [ L
ð @" Ä @" ó L &
So
à
(14.5.3)
à B @ F
° BÜ
@ F ?³ L
@ à
(14.5.4)
o
=4@ F
T
Y
Ù
; ° B K#Y @ F I @ F 23 @ F ³ ,L L *4 3 `J
(14.5.5)
Y °
@ F Gi @ F ³ 0L L 4 3 `J &
Ù Ù
(14.5.6) 2T¾4@ F
Y °
@ F Gi
@ F ³ L0L 4*3 `J &
Note that we arrive at the expected result: the information matrix equality
holds (that is, 2h T=4@gFY
o
T A@]FY [ . With this,
c @pi@ F h R
m ° 2 T 4 @ F $ 1 o T=A@ F 2T A@ F $ 1 ³
h
simplifies to
c @pi@ F h R
m @° ' 42T¾4@ F $ 1 ³
which can also be expressed
h as
c @pi@ F h R
m ° o T A @ F 1 ³ &
On a final note, the logit and standard normal CDF’s are very similar - the
logit distribution is a bit more fat-tailed. While coefficients will vary slightly
between the two models, functions of interest such as estimated probabilities
@" will be virtually identical for the two models.
Ref. Gourieroux and Monfort, section 8.3.4. White, Intn’l Econ. Rev. 1980 is
an earlier reference.
14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 299
B * ¹ +* @ F 28/ *
where
/ * ç!ò!J è
The nonlinear least squares estimator solves
g[ f * ¹ + *
\@ f
* 1 B @V [
We’ll study this more later, but for now it is clear that the foc for minimization
will require solving a set of nonlinear equations. A common approach to the
problem seeks to avoid this difficulty by linearizing the model. A first order
Taylor’s series expansion about the point
F with
à¹
remainder gives
B *
¹ F @ F 2 + * F 4 à F @ F 23= *
where = * encompasses both /* and the Taylor’s series remainder. Note that = *
is no longer a classical error - its mean is not zero. We should expect problems.
Define
à¹
¹ @ F G 4 à ]F
@gFY
[
๠F F
à F
@ F
[
Given this, one might try to estimate [ and [ by applying OLS to
B *
g 2K + * 3
2 =*
Question, will and be consistent for [ and [ ?
14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 300
The answer is no, as one can see by interpreting
and as extremum
estimators. Let
4 4 &
V[ \ f b0
f B *
+ *
* 1
The objective function converges to its expectation
ÍR P Q6P ì P
'f5b0 /T=0
ë « ë ; « BÜ
j
and converges ô+&©"& to the F that minimizes /T¾0 :
F
¡
V[Z ë « ë ; « BÜ
Noting that
ë « ë ; « B¯
4@
ë « ë ; « ð ¹ @ F 2/p
ó
è 28ë « ð ¹ @ F G ó
x
Tangent line x
β
α x x
x x
x Fitted line
x_0
It is clear that the tangent line does not minimize MSE, since, for ex-
ample, if
¹ @gFC is concave, all errors between the tangent line and
the true function are negative.
Note that the true underlying parameter VF
@ is not estimated consis-
tently, either (it may be of a different dimension than the dimension
of the parameter of the approximating model, which is 2 in this exam-
ple).
Second order and higher-order approximations suffer from exactly
the same problem, though to a less severe degree, of course. For
this reason, translog, Generalized Leontiev and other “flexible func-
tional forms” based upon second-order approximations in general suf-
fer from bias and inconsistency. The bias may not be too important for
analysis of conditional means, but it can be very important for analyz-
ing first and second derivatives. In production and consumer analysis,
first and second derivatives (e.g., elasticities of substitution) are often
14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 302
Exercises
(1) Suppose that +* ç *G
µ * 2Ë/ * where / * is iid(0,è,Y C&
uniform(0,1), and B
Suppose we estimate the misspecified model B *
2÷ +* 2 ]* by OLS. Find
the numeric values of F and F that are the probability limits of and
(2) Verify your results using Octave by generating data that follows the above
model, and calculating the OLS estimator. When the sample size is very
large the estimator should be very close to the analytical results you ob-
tained in question 1.
(3) Use the asymptotic normality theorem to find the asymptotic distribution
of the ML estimator of F B
l F72¡/T where /3ç ²
for the model \J
ì?> Å
and is independent of & This means finding á z \f5
, 23IFY C á Ã x and
áxáxO á x êê
o
F C& ê
CHAPTER 15
15.1. Definition
We’ve already seen one example of GMM in the introduction, based upon
the distribution. Consider the following example based upon the t-distribution.
The density function of a t-distributed r.v. ¤U is
U ó
\ 4@
f
@
g[Zb
~ 5f 4@"
) <;
Ê BgUP
@"
U(1
This approach is attractive since ML estimators are asymptotically ef-
ficient. This is because the ML estimator uses all of the available infor-
mation (e.g., the distribution is fully specified up to a parameter). Re-
calling that a distribution is completely characterized by its moments,
the ML estimator is interpretable as a GMM estimator that uses all of
304
15.1. DEFINITION 305
This estimator is based on only one moment of the distribution - it uses less
information than the ML estimator, so it is intuitively clear that the MM esti-
mator will be inefficient relative to the ML estimator.
%y4@]FC
3?!
Ù
B U !
A@ F K#", 4@ F Ñ
provided @gF ¥ Ñ & We can define a second moment condition
%;b@" f !
4 @" 4 @8#" b @s Ñ BU
U(1
15.1. DEFINITION 306
&
A second, different MM estimator chooses to set @"
If you
@
solve this you’ll see that the estimate is different from that in equation
15.1.1.
This estimator isn’t efficient either, since it uses only one moment. A GMM es-
timator would use the two moment conditions together to estimate the single
parameter. The GMM estimator is overidentified, which leads to an estima-
tor which is efficient relative to the just identified MM estimators (more on
efficiency later).
As before, setf+A@V
1\A@V C A @" - 4 & The subscript is used to in-
dicate the sample size. Note that :A@VFC
5 6 1Ap C since it is an
average of centered random variables, whereas :A @V @
5 6ÔJ Y, @3
í @gFN
where expectations are taken using the true distribution with param-
eter @]FN& This is the fundamental reason that GMM is consistent.
A GMM estimator requires defining a measure of distance, Jp:4@" - . A
popular choice (for reasons noted below) is to set p:A@" -
4 üifV
J
and we minimize ' f A@V
:A@" 4 üifV:4@" C& We assume üf converges to a
finite positive definite matrix.
In general, assume we have moment conditions, so :4@" is a -vector
and ü is a b matrix.
For the purposes of this course, the following definition of the GMM estimator
is sufficiently general:
-vector,
with ë"V:A@V
þ and üif converges almost surely to a finite
b
symmetric positive definite matrix üÞT .
15.2. CONSISTENCY 307
15.2. Consistency
where 2T¾A@]FC is the almost sure limit of V á z V \f5A@V and T=A@]FY
) f SUT:é¯ô
o À á \f5A@gFC Y&
á á O áV
We need to determine the form of these matrices given the objective function
\f+4@"
^f5A@V 4 üifg^f+4@" C&
Now using the product rule from the introduction,
à à
à \f+4@"
# à Of A@V üifg^fyA@V
@ @ É È
so:
à
(15.3.1)
à A@"
# A@V [üR 4@V 0&
@
(Note that 'f5A@V , f5A@" Y%üf and f5A@V all depend on the sample size G but it
is omitted to unclutter the notation).
15.3. ASYMPTOTIC NORMALITY 309
To take second derivatives, let *
be the [! th row of A @V C& Using the prod-
uct rule,
à à
à à T 4@"
à # * A@V [üifV A@V
@ 4 @
* @ 4 à
# *ü 4g2Ë#g^4@ü à *4
È @ 4 É
at gFN
@ assume that áá V O A@V *4 satisfies a LLN, so that it converges almost surely
to a finite limit. In this case, we have
à
#g:4@ F 4 ü à A @ F *4 QR P ì P
È @ 4 É
where
²
) ë ° 0:4@ F Ô:4@ F 4³ &
T
f SUT
o
=4@ F
Ñ TsüÞT ² TsüFT T4
T
Using
h these results, the asymptotic normality theorem gives us
c @pi@ F h R
m n N s
T üFT
4 1 TpüÞT ² TsüFT 4 TsüÞT 4 1
T T T r
the asymptotic distribution of the GMM estimator for arbitrary weighting ma-
trix ü f & Note that for «T
i to be positive definite,
T must have full row rank,
Ts
Q .
ô
ü
7
15.4. CHOOSING THE WEIGHTING MATRIX 311
with ô much larger than 7N& In this case, errors in the second moment condition
have less weight in the objective function.
T HEOREM 25. If @ is a GMM estimator that minimizes f5A@V
4 üifg^fEA@V C the
R ì
QP
asymptotic variance of @ will be minimized by choosing üf so that üf
üFT
² T 1 where ² T
) fSUTë \ 0:4@gFC Ô:4@gFY 4 ]"&
Proof: For üÞT
è² T 1 the asymptotic variance
simplifies to T
²
T 1 T4 1 &
üÞT
í ² T 1 con-
Now, for any choice such that
sider the difference of the inverses of the variances when ü
² 1 versus
when ü is some arbitrary positive definite matrix:
ð T
²
T 1 T4 ó ¡ TsüFT T4 \ TsüFT ² TÞüFT T4 ] 1 TsüFT 4
T
A1 p
n± ² TA1 p WüÞT T4 Å\ TsüÞT ² TsüFT T4 ] 1 TsüFT ² A1 p A1 p
T T r T T4
² ²
T
(15.4.1) c @pi@ F h R
m n ð T
²
T 1 T4 ó 1 r
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 313
allows us to treat
²
1 T4 1 ·
@ù¬ R ¶
@ F T T
where the ¬ means ”approximately distributed as.” To operationalize this we
need estimators of
T and ² ¾&
T
Â
The obvious estimator of 4f c @ h which is consistent
áá V T is simply
by the consistency of @% assuming that á V 4f is continuous in @%& Sto-
á
chastic equicontinuity results can give us this result even if á V 4f is
á
not continuous. We now turn to estimation of T & ²
^U and are functions of @% so for now assume that we have some consistent
estimator of @ F so that ^U
^U- @" Y& Now
°
 f  f
²
f ë :A @ F ?:A @ F 4 ³ ë v ¶
^U ·
¶
4U · w
U(1 U(1
f f
ë v  ¶ ^U · ¶ ^U4 · w
U(1 U(1
à 2 à 12 à 41 2 ÷# à 2 à 4 0>N>N>N2 ð à f 12 à 4f 1
F ó
ÃÆÄ
A natural, consistent estimator of is
f
¥
ÃÅÄ
Â Ä ^ U ^4U Ä &
U( 1
(you might use ÷i¸ in the denominator instead). So, a natural, but inconsis-
tent, estimator of ² T would be
2F c ¥Ã ,1 2 ¥Ã 4 2 ÷# c ¥Ã 2 4 h 2¡>N>N>'2 c à f 1,2
¥ ¥ Â Â
1h 4f 1 h
² Ã Ã Ã
f 1 ÷Ѹ ¥ÃÅÄ ¥Ã
¥
Ã
2F Ä c 2 Ä4 h &
1
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 315
ÃÄ
On the other hand, supposing that tends to zero sufficiently rapidly as ¸
tends to k
a modified estimator
¥ à fNÅ c ¥ÃÅÄ ¥
F 2 Ä 1 2 4h
² Ã Ã Ä
6
where R k
as
R k
will be consistent, provided 5 grows sufficiently
slowly. The term
f f Ä
can be dropped because must be
´
6T, C& This allows
information to accumulate at a rate that satisfies a LLN. A disadvantage of
this estimator is that it may not be positive definite. This could cause one to
calculate a negative statistic, for example!
Note: the formula for ²
requires an estimate of :A@VFC Y which in turn
requires an estimate of T
@ which is based upon an estimate of ²
The
¥ à fNÅ c ¥ÃÄ 2 ¥
² Ã
2F Ä y È2¡¸
ÚÉ
à Ä
4h &
1 È
15.6. ESTIMATION USING CONDITIONAL MOMENTS 316
U
¯
^
N 1 ^U 1I2>N>N>N2DN76 ^U 672¹½+U
This is estimated, giving the residuals ½EUP& Then the Newey-West covariance
²
estimator is applied to these pre-whitened residuals, and the covariance is
estimated combining the fitted VAR
ëG¤ö03
Y ;
Æ Y1A
¤ 0 : W¤[: ZJ¤ É "J i&
ëĤ , ¬:
Y ;
Æ Y1A
¤ö03 ¤ : `JE¤ É
: ZJ&
ëG¤ , ¬:
15.6. ESTIMATION USING CONDITIONAL MOMENTS 318
as claimed.
This is important econometrically, since models often imply restrictions on
conditional moments. Suppose a model tells us that the function B"U× UW has
expectation, conditional on the information set ± U? equal to ,
Q U?@V C
where 9 .7U is a j -vector valued function of .¼U and .7U is a set of variables
drawn from the information set ± U?& The 9 .7U are instrumental variables. We
now have moment conditions, so as long as
¥ the necessary condition
for identification holds.
15.6. ESTIMATION USING CONDITIONAL MOMENTS 319
9 1\.µ1Ô 9
.µ1- >N>N>89 .µ1-
1\. . 9 .
Ä9 f
9 9
.. . ..
.
9 1\.7fV 9
.7f" >N>N>89 .7f
9 14
4
9
9 f4
¹ 1\A@"
¹ A @"
^f+4@"
9È4
f
..
.
¹ fEA@V
9È4 ¹ 5f A@V
f
f 9ÄU ¹ -U 4@"
U(1
f ^U-A@V
U(1
· U¶
where 9
à UßH Å is the row of 9ÄfT& This fits the previous treatment. An interesting
question that arises is how one should choose the instrumental variables 9Ü.yU
to achieve maximum efficiency.
15.6. ESTIMATION USING CONDITIONAL MOMENTS 320
f
á á V 4 A@V
Note that with this choice of moment conditions, we have that
(a matrix) is
à
f5A@V
à 9yf 4 ¹ 5f A@V [ 4
@ à
Æ à ¹ 4 b@" 9Äf
@
f É
d
which we can define to be
d f+4@"
f,9Äf &
where f is a matrix that has the derivatives of the individual moment
conditions as its columns. Likewise, define the var-cov. of the moment condi-
tions
²
f
ë ° ^fE4@ F ?^fEA@ F ?4³
ë 9 f 4 ¹ f+4@ F ¹ f5A@ F 4 9Äf
È É
where we have defined f
é¯ô À ¹ f5A@]FC C& Note that matrix is growing with the
sample size and is not consistently estimable without additional assumptions.
The asymptotic normality theorem above says that the GMM estimator us-
ing the optimal weighting matrix
h is distributed as
c @c@ F h R
m j YérTs
15.6. ESTIMATION USING CONDITIONAL MOMENTS 321
d d
where
1
,9Äf
f 9 f 4 f+9¢f 1 9 f4 f4
(15.6.1) énT Æ É Æ É Æ É · &
fSUT
¶
1
Using an argument similar to that used to prove that ² T is the efficient weight-
ing matrix, we can show that putting d
¢f
9 f 1 f 4
d d
causes the above var-cov matrix to simplify to
1
énT
Æ Cf f 1 f 4 &
(15.6.2) f StT
É
and furthermore, this matrix is smaller that the limiting var-cov for any other
choice of instrumental variables. (To prove this, examine the difference of the
inverses of the var-cov matrices with the optimal intruments and with non-
optimal instruments. As above, you can show that the difference is positive
semi-definite). d d
Note that both fT which we should write more properly as f 4@ F C
since it depends on @ F and must be consistently estimated to apply
d
this.
Usually, estimation of d f is straightforward - one just uses
à
à ¹ f4 c @ ý h
@
Note that dynamic moment conditions simplify the var-cov matrix, but are
often harder to formulate. The will be added in future editions. For now, the
Hansen application below is enough.
The first order conditions for minimization, using the an estimate of the
optimal weighting matrix, are
à à
à T @"
# à Of c @ h ² 1 ^
f c @ h
@ È@ É
or
@" ² 1 ^
fE @V
15.8. A SPECIFICATION TEST 323
Consider a Taylor expansion of : @" :
(15.8.1) : @"
^f+4@ F 2 f4 A@ F c @c@ F h 2 ´ 6T?J C&
1
Multiplying by @V ² we obtain
@" ² 1 : @"
@V ² 1 ^f+4@ F 2 @V ² 1 4@ F P4 c @p i@ F 2 ´ 6TÔN
h
The lhs is zero, and since @ tends to @ F and ² tends to ² T , we can write
or
h h
c @pc@ F h
ð 1 T4 ó 1 1 ^f+4@ F
Q
² ²
T T T T
With this, and taking into account the original expansion (equation ??), we
h h h
get
Q
0: @
0^f5A@ F G T4 ð T
²
T 1 T4 ó 1 T
²
T 1 ^f5A@ F Y&
This last
h can be written
h as
Q
: @V
c ² TA1 p T4 ð T
²
T 1 T4 ó 1 T
²
T A1 p h ² T A1 p ^f+A@ F
Orh h
Q
² T A1 p : @V
cN± ²
T A1 p T4 ð T
²
T 1 T4 ó 1 T
²
T A1 p h ² T A1 p ^f+4@ F
h
Now
² T A1 p ^
f5A@ F iR m ² ±
15.8. A SPECIFICATION TEST 324
ª
lcN± ²
T A1 p T4 ð T
²
T 1 T4 ó 1 T
²
T A1 p h
is idempotent of rank
(recall that the rank of an idempotent matrix is
equal to h its trace) so h
c ² T A1 p :
@ h 4 c ² T A1 p :
@ h
0: @ ?4 ² T 1 :
@" iR m b¯
Since ² converges to ² T we also have
0: @" ?4 ² 1 : @ R
m bÜ
or
>]\f+ @" R
m b
supposing the model is correctly specified. This is a convenient test since we
just multiply the optimized value of the objective function by G and compare
with a bÜ critical value. The test is a general test of whether or not the
moments used to estimate are correctly specified.
This won’t work when the estimator is just identified. The f.o.c. are
YV \f5A@V
² 1 : &
@
²
But with exact identification, both and are square and invertible
(at least asymptotically, assuming that asymptotic normality hold), so
&
: @V
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 325
Suppose _
a` F
28/T where /sçRj 65¼ C(5 a diagonal matrix.
The typical approach is to parameterize
5èI C where è is a finite
5
^U-G
U5BgU L 4U
&
L U¬B]U0Ë Â
:
 ^U
 L U L 4U &
U U U
For any choice of üÖ\:G will be identically zero at the minimum, due to exact
identification. That is, since the number of moment conditions is identical to
the number of parameters, the foc imply that :
regardless of ü:& There
is no need to use the “optimal” weighting matrix in this case, an identity matrix
works just as well for the purpose of estimation. Therefore
¥ f 1 c ¥ÃÅÄ ¥
F 2 Ä 1 2 4h &
² Ã Ã Ä
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 327
¥
 f
² Ã
F
¶
^U U4 ·
U(1
 l f L U L 4U c B]U L 4U h w
v
U(1
f
 v L U L 4U / U w
U(1
` 4 ` G
G
where is an : diagonal matrix with / U in the position Y[ .
Therefore,
h the GMM varcov. estimator, which is consistent, is
` 4` ¶ ` 4G ` 1 ` 4` 1
é c c
j h,h
H
Æ É ·åÆ
JÉ I
` 4` 1 ¶ ` 4G ` ` 4` 1
Æ É ·åÆ
É
This is the varcov estimator that White (1980) arrived at in an influential article.
This estimator is consistent under heteroscedasticity of an unknown form. If
there is autocorrelation, the Newey-West estimator can be used to estimate ² -
the rest is the same.
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 328
_
` F 2/
/ ç j 65¼
metric specification (which may also depend upon ` C& In this case, the GLS
estimator is
ý
ð ` 4 5 1 ` ó 1 ` 4 5 1 G_
This estimator can be interpreted as the solution to the moment conditions
L UB]U Â L U L 4U ý
:ý
 &
U 5è U4@ F U è5U-A@ F
That is, the GLS estimator in this case has an obvious representation as a GMM
estimator. With autocorrelation, the representation exists but it is a little more
complicated. Nevertheless, the idea is the same. There are a few points:
or
_
Ä 28/
using the usual construction, where is and /gU is i.i.d. Suppose that
this equation is one of a system of simultaneous equations, so that NU contains
both endogenous and exogenous variables. Suppose that L U is the vector of all
exogenous and predetermined variables that are uncorrelated with /VU (suppose
that L U is
À jJ C&
Define as the vector of predictions of when regressed upon ` , e.g.,
¡` ` ` 1 `
4 4
` ` ` 1 `
4 K4
L " U
Since is a linear combination of the exogenous variables must
be uncorrelated with /T& This suggests the -dimensional moment con-
dition ^U-
" U BgUL 4U
and so
:
 " U BgUM"4U
&
U
Since we have parameters and moment conditions, the GMM
estimator will set identically equal to zero, regardless of üÖ so we
have 1
" UN"4 · 1
" U¬B]UW
dc ÄK4 h Ä©4 _
¶
U
U U
This is the standard formula for 2SLS. We use the exogenous variables and
the reduced form predictions of the endogenous variables as instruments, and
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 330
apply IV estimation. See Hamilton pp. 420-21 for the varcov formula (which
is the standard formula for 2SLS), and for how to deal with /VU heterogeneous
and dependent (basically, just use the Newey-West or some other consistent
estimator of ²
and apply the usual formula). Note that /gU dependent causes
lagged endogenous variables to loose their status as legitimate instruments.
B1U
1v?"U?
@ F1 2/1U
B U
?"U?
@ F 2/ U
..
.
B¯U
¯ ?"U×@ ¯F 28/ ¯ PU
or in compact notation
BgU
"UP@ F 2/JUP
where
Ô>@ F
A@ 1F 4 @ F 4 '>N>N>,@ ¯F 4 4 &
is a -vector valued function, and @
We need to find an ; w * j
vector of instruments L,* U? for each equation, that
are uncorrelated with / * UP& Typical instruments would be low order monomials
in the exogenous variables in U? with their lagged values. Then we can define
the c * 1 wy* h ² orthogonality conditions
¯
?"UP@ ¯ - L ¯ U
B ¯ U
¯
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 331
4@"
BT1$B \ &)&(&)[BgfT@"
A@V
Bgf ¤+f 1@" G> ¤Ef 1$@V
A@"
B]f ¤Ef 1$@V G> B]f 1 ¤+f @V G>"&(&)&> BT1? C&
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 332
érT
¥ ã ââ
ââ f n ââ
ââ
ââ
âä
fU(1 V ) B]U ¤+U 1Y @V + ç ââ
â
ââ
or the inverse of the negative of the Hessian (since the middle and last
term cancel, except for a minus sign):
f 1
¥
érT
vÞ
 V BgU ¤+U 1Y @V w
U(1
or the inverse of the outer product of the gradient (since the middle
and last cancel except for a minus sign, and the first term converges to
minus the inverse of the middle term, which is still inside the overall
inverse)
15.10. EXAMPLE: THE HAUSMAN TEST 334
f 1
¥
énT
OH Â n V BgU ¤EU 1$ @" r n V ) B]U ¤+U 1Y @V r 4 &
U(1 I
This simplification is a special result for the MLE estimator - it doesn’t apply
to GMM estimators in general.
Asymptotically, if the model is correctly specified, all of these forms con-
verge to the same limit. In small samples they will differ. In particular, there
is evidence that the outer product of the gradient formula does not perform
very well in small samples (see Davidson and MacKinnon, pg. 477). White’s
Information matrix test (Econometrica, 1982) is based upon comparing the two
ways to estimate the information matrix: outer product of gradient or negative
of the Hessian. If they differ by too much, this is evidence of misspecification
of the model.
This section discusses the Hausman test, which was originally presented
in Hausman, J.A. (1978), Specification tests in econometrics, Econometrica, 46,
1251-71.
Consider the simple linear regression model B"U
R 4U ^2ËA-U?& We assume that
the functional form and the choice of regressors is correct, but that the some of
the regressors may be correlated with the error term, which as you know will
produce inconsistency of ¢& For example, this will be a problem if
P#QYWR P#QYWR
PVQXW PVQXW
P P
RVQ RU RVQ Z RVQ Z&R RVQ ZS R[Q ZT RVQ ZU R[Q S WQ UU WQ y WQ y&R WQ yS WQ yT WQ yU R R[Q PR [Q PS R[Q PT R[Q PU
If we’re doubting about the consistency of OLS (or QML, etc.), why
should we be interested in testing - why not just use the IV estima-
tor? Because the OLS estimator is more efficient when the regressors
are exogenous and the other classical assumptions (including normal-
ity of the errors) hold. When we have a more efficient estimator that
relies on stronger assumptions (such as exogeneity) than the IV es-
timator, we might prefer to use it, unless we have evidence that the
assumptions are false.
So, let’s consider the covariance between the MLE estimator @ (or any other
fully efficient estimator) and some other CAN estimator, say @ ý . Now, let’s
recall some results from MLE. Equation 4.4.1 is:
h d h
c @c@ F h R ì =A@ F $ 1 >0A@ F C&
Q6P P
T
d
Equation 4.6.2 is
¾A@V
o T=4@" Y&
T
c @pi@ F h R ì ± T=A@ F 1 n0A@ F Y&
QP P
Also, equation 4.7.1 tells us that the asymptotic covariance between any
CAN estimator and the hMLE score vector is
énT='@"ý
érT
±'ÿ &
0A @"
n
o
=A@V
T
15.10. EXAMPLE: THE HAUSMAN TEST 337
Now, consider h h
The asymptotic
h covariance of this is
énT='@"ý
énT
± T=A@" 1 ± T=A@" 1
c @c@ h "ÿ ±\ÿ "ÿ
o
=4@"
T
énT
&
± T¾4@" 1 érT= @"
c @pi@ h
So, the asymptotic covariance between the MLE and any other CAN estima-
tor is equal to the MLE asymptotic variance (the inverse of the information
matrix).
Now, suppose we with to test whether the the two estimators are in fact
both converging to
F , versus the alternative hypothesis that the ”MLE” esti-
@
c @pý c@ F h
h
n ±'ÿ ±\ÿ r
c @ ý @ h
c @pc
@Fh
c @pý @ h R
m c Y énT @"ý GKérT¾ @" h &
So,
1
c @pý @ h 4 c né T= @"ý G8érT= @V h c @pý @ h R
m T C
where is the rank of the difference of the asymptotic variances. A statistic
that has the same asymptotic distribution is
c p@ ý @ h 4 c b 1
é @Vý G é^ @V h c @ý @ h R
m T C&
This is the Hausman test statistic, in its original form. The reason that this
test has power under the alternative hypothesis is that in that case the ”MLE”
estimator will not be consistent, and will converge to
h
, say, where
í
@
@ @
F.
c @ý @ h
Then the mean of the asymptotic distribution of vector will be @
F
@ , a non-zero vector, so the test statistic will eventually reject, regardless of
how small a significance level is used.
to be true, the test will be biased against rejection of the null hypothe-
sis. The contrary holds if we underestimate the rank.
A solution to this problem is to use a rank 1 test, by comparing only
a single coefficient. For example, if a variable is suspected of possibly
being endogenous, that variable’s coefficients may be compared.
This simple formula only holds when the estimator that is being tested
for consistency is fully efficient under the null hypothesis. This means
that it must be a ML estimator or a fully efficient estimator that has
the same asymptotic distribution as the ML estimator. This is quite
restrictive since modern estimators such as GMM and QML are not in
general fully efficient.
Following up on this last point, let’s think of two not necessarily efficient es-
timators, g1
@ and
, where one is assumed to be consistent,
@ but the other may
not be. We assume for expositional simplicity that both @g1 and @ belong to the
same parameter space, and that they can be expressed as generalized method
of moments (GMM) estimators. The estimators are defined (suppressing the
dependence upon data) by
*§
V[ : : A @ * 4 ü * * A@ *
@
V )
ä A @
ç
å
ÎÏ
y1 5y1
5
&
> 5 ÐÒ
and @ (or subvectors of the two) applied to the omnibus GMM estimator, but
with the covariance of the moment conditions estimated as
ÎÏ ¥
y1
5 |
à ¥
Ó z Å
5 &
|
à z
Å
Ó 5
ÐÒ
T ì
(15.11.1) ëb½
¸ U -ì ± UW &
ìò F
The parameter is between 0 and 1, and reflects discounting.
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 342
where 5U is the price and JU is the dividend in period Y& The price of ¸U
is normalized toV &
Current wealth .7U
Ôp2 À U ?!WU 1 , where !WU 1 is investment in period
; . So the problem is to allocate current wealth between current
consumption and investment to finance future consumption: .ÈU
¸ U$2
!WU .
Future net rates of return À U ìYY ¥ are not known in period : the asset
is risky.
A partial set of necessary conditions for utility maximization have the form:
To see that the condition is necessary, suppose that the lhs < rhs. Then by
reducing current consumption marginally would cause equation 15.11.1 to
drop by ½ 4 ¸ UW C since there is no discounting of the current period. At the
same time, the marginal reduction in consumption finances investment, which
has gross return Ô2 À U 1[ which could finance consumption in period 2V&
This increase in consumption would cause the objective function to increase by
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 343
where Ès is the coefficient of relative risk aversion ( ½ J . With this form,
¸ ¡U 1
+4¬ ¸ U a
½
ë ð ¸ ¡U 1 ² ù Ô 2 À U 1[ ¸ ¡U 1 1 ú ó ± U ¡
(note that ¸ U can be passed though the conditional expectation since ¸ U is chosen
based only upon information available in time - C&
Suppose that L U is a vector of variables drawn from the information set ± U?&
We can use the necessary conditions to form the expressions
Ø 1
È^?2 À U 1[ c ÊnØ Ê
h ¡ É L U
^U-A@V
È
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 344
Note that at time Y0U ì has been observed, and is therefore an element of
the information set. By rational expectations, the autocovariances of the mo-
Ã
ment conditions other than should be zero. The optimal weighting matrix
F
is therefore the inverse of the variance of the moment conditions:
M
) °
Ù :A@ F ?:A@ F ?4³
²F
 f
²
U- @" ?^U- @V ?4
^
U(1
As before, this estimate depends on an initial consistent estimate of T
@ which
can be obtained by setting the weighting matrix ü arbitrarily (to an identity
matrix, for example). After obtaining @% we then minimize
T4 @"
:4 @" ?4 ² 1 :A @V C&
This process can be iterated, e.g., use the new estimate to re-estimate ²
use
this to estimate @gF' and repeat until the estimates don’t change.
This whole approach relies on the very strong assumption that equa-
tion 15.11.2 holds without error. Supposing agents were heteroge-
neous, this wouldn’t be reasonable. If there were an error term here, it
could potentially be autocorrelated, which would no longer allow any
variable in the information set to be used as an instrument..
15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 345
Ù Û
and J in that order. There are 95 observations (source: Tauchen, « 1986).
As instruments we use 2 lags of ¸ and
À & The estimation results are
***********************************************
Example of GMM estimation of rational expectations model
15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 346
Value df p-value
X^2 test 6.6841 5.0000 0.2452
Exercises
(1) Show how to cast the generalized IV estimator presented in section 11.4 as
a GMM estimator. Identify what are the moment conditions, U-A@V , what
is the form of the the matrix
f% what is the efficient weight matrix, and
show that the covariance matrix formula given previously corresponds to
the GMM covariance matrix formula.
L
(2) Using Octave, generate data from the logit dgp . Recall that
Ù B"U U
L U?@"
@\ V2 }C~% ? L U @" I] 1 . Consider the moment condtions (exactly iden-
tified):
^U-4@"
\ gB Ui L U?@V _] L UP&
(a) Estimate by GMM, using these moments. Estimate by MLE.
(b) The two estimators should coincide. Prove analytically that the estima-
tors coicide.
(1) Verify the missing steps needed to show that >: @" 4 ² 1 :
@ has a
b¯ distribution. That is, show that the monster matrix is idem-
potent and has trace equal to Ü &
CHAPTER 16
Quasi-ML
348
16. QUASI-ML 349
`
f
\f+T B
T
(U 1 5 U-T
Suppose that we do not have knowledge of the family of densities
5U-T Y& Mistakenly, we may assume that the conditional density of _IU
is a member of the family
ÔU _,U U 1Y ` UP
@" Ca@ÞCèN where there is no
gF
@ such that
UÔ_,U U Y1 ` UP
@gFY
5U-_0U U 1Y ` U×FY C?ê (this is what we
mean by “misspecified”).
This setup allows for heterogeneous time series data, with dynamic
misspecification.
The QML estimator is the argument that maximizes the misspecified average
log likelihood, which we refer to as the quasi-log likelihood function. This
objective function is
ì
QP P
) f )
\f5A@" fSUT ë
R
-U 4@"
Ý A@"
(U 1
We assume that this can be strengthened to uniform convergence, a.s., follow-
ing the previous arguments. The “pseudo-true” value of @ is the value that
maximizes TÝ 4@" :
@ F
V-b
~ Ý A@V
Given assumptions so that theorem 19 is applicable, we obtain
N is compact
– 'f5A@V is continuous and converges pointwise almost surely to Ý A@"
(this means that Ý 4@" will be continuous, and this combined with
compactness of N means Ý A@V is uniformly continuous).
– gF
@ is a unique global maximizer. A stronger version of this as-
sumption that allows for asymptotic normality is that
Ý 4@" ex-
V
c @pi@ F h R
m ° 2 T 4 @ F $ 1 o T=A@ F 2T A@ F $ 1 ³
where
2T¾4@ F
f SU T ë V \f5A@ F
16. QUASI-ML 351
h
and
o
=A@ F
f SU) T é¯ô À YV \f5A@ F Y&
T
Note that asymptotic normality only requires that the additional as-
sumptions regarding 2 and
o
hold in a neighborhood of @ F for 2 and
at @gFN for ¾&
o
not throughout N In this sense, asymptotic normality is a
local property.
Notation: Let gU V
U-A@]FY
We need to estimate h
o
=4@ F
éÜô À h V$'f5A@ F
T
fSUT
éÜô À f V ) UÔ4@ F
fSUT U(1
éÜô À f gU
f SUT
U(1
ë H ¶ f b]Ujë]U _· ¶ f b ]Ujë#]U _· 4
fSUT U(1 U(1 I
16. QUASI-ML 352
[V ë « ë B @ F
ë « ë V ) B @ F
a
F F
16. QUASI-ML 353
f V ) B
@ F j o =
T 4@ F - C&
U(1 R
m
That is, it’s not necessary to subtract the individual means, since they
are zero. Given this, and due to independent observations, a consis-
tent estimator is
f
o
V ) -U @" V ?U @
U(1 O
This is an important case where consistent estimation of the covariance matrix
is possible. Other cases exist, even for dynamically misspecified time series
models.
CHAPTER 17
/JU,ç!ò!¾J è
_
4@" 2/
354
17.1. INTRODUCTION AND DEFINITION 355
È È
(17.1.1) s @ V O @ Y&
In shorthand, use in place of s @ C& Using this, the first order conditions can
be written as
4 _2 4 @
or
n &
(17.1.2) 4 i
_ @" r
This bears a good deal of similarity to the f.o.c. for the linear model - the
derivative of the prediction is orthogonal to the prediction error. If
A@V
` T@
then is simply ` so the f.o.c. (with spherical errors) simplify to
` 4_£ ` 4 `
a
17.2. IDENTIFICATION 356
17.2. Identification
f ° QP ì P
° W @ F G W%@" P³ J,3W" C
U - 4@ F G
-
U A@V ?³
R Y
(17.2.1)
U(1
where 3 is the distribution function of & In many cases, @" will
Given these results, it is clear that a minimizer is @"FN& When considering identi-
fication (asymptotic), the question is whether or not there may be some other
minimizer. A local condition for identification is that
à à
à à T=4@"
à à Y ° @ F G @" P³ J,3
@ @ 4 @ @ 4
à
à à Y ° @ F G
@V ³ J,3
# Y ° V
@ F ?4 ³ ° V O W%@ F ³ 4 J,3
@ @ 4 ê W
ê
êV
ê
the expectation of the outer product of the gradient of the regression function
evaluated at gF'&
@ (Note: the uniform boundedness we have already assumed
17.4. ASYMPTOTIC NORMALITY 358
allows passing the derivative through the integral, by the dominated conver-
gence theorem.) This matrix will be positive definite (wp1) as long as the gra-
dient vector is of full rank (wp1). The tangent space to the regression manifold
must span a -dimensional space if we are to consistently estimate a -
dimensional parameter vector. This is analogous to the requirement that there
be no perfect colinearity in a linear model. This is a necessary condition for
identification. Note that the LLN implies that the above expectation is equal
to
2T A@ F
# ë 4
17.3. Consistency
As in the case of GMM, we also simply assume that the conditions for as-
ymptotic normality as in Theorem 22 hold. The only remaining problem is to
determine the form of the asymptotic variance-covariance matrix. Recall that
the result of theh asymptotic normality theorem is
c @c@ F h R
m ° 2T¾4@ F 1 o T=4@ F 2 ¾ 1
T A@ F ³
¾A@]FC z
where 2 T is the almost sure limit of
á V á á V O \f+4@" evaluated at @gFN and
° V$'f5A@ F ? ³ ° VC\f A@ F ?³ 4 R ì o T=4@ F Y
Q6P P
17.4. ASYMPTOTIC NORMALITY 359
f
\f5A@" \g
B U
L UP@V _]
U(1
So
YV \f+4@"
# f \ ]B U
L UP
@" _]
L U×@" Y&
(U 1 V
Evaluating at @gF'
YV \f+4@ F
# f /]U
L U×@ F Y&
(U 1 V
Ñ f f 4
° ° 4
YV \f+4@ F P³ VY\f+4@ F P³ v /]U V
L U×@ F w v J/ U V
L U?@ F w
(U 1 U(1
Noting that
à
f à ° 4 @ F ?³ 4 /
/JU V
L UP@ F
U(1 @
4/
Ñ
° F ° F 4
YV \f+4@ P³ VC'f 4@ ? ³ ©4 /J/]4X
o
=A@ F
Ñ è ) ë 4
T
where the expectation is with respect to the joint density of and /T& Combin-
ing these expressions for 2øT=4@ F and
o
=A@ F C
T and the result of the asymptotic
normality theorem,
h we get
1
c @c@ F h R
¶ Æ ë 4 è · &
m
É
o
We can consistently estimate the variance covariance matrix using
1
(17.4.1) ¶ 4 · è
where is defined as in equation 17.1.1 and
n _ @V r 4 n
_
@V r
è
the obvious estimator. Note the close correspondence to the results for the
linear model.
}C~% Ô e UW e KU Ê
BgU gB U B]U#C²S '"$#%'&(&)&X&
The mean of BgU is e UP as is the variance. Note that e U must be positive. Suppose
that the true mean is
e
UF
}Y~% L U4 F Y
17.6. THE GAUSS-NEWTON ALGORITHM 361
\f+
¿ f ð }C~% L 4U F 2/JU }Y~% L U4
ó
U(1
¿ f ð }C~% f f
L 4U F }C~% L 4U
2 ¿ / U 2Ë# ¿ J/ U ð }Y~ L 4U F }C~% L 4U G
U(1 ó U(1 (U 1 ó
The last term has expectation zero since the assumption that ë7B"U L UW
}Y~ L 4U IFY
implies that ë/JU L UW
R which in turn implies that functions of L U are uncor-
related with /]UP& Applying a strong LLN, and noting that the objective function
is continuous on a compact parameter space, we get
T¾
ë Í ð }C~% L 4@ F }C~% L ©4 G ó 2ë Í }C~% L 4 F
where the last term comes from the fact that the conditional variance of / is the
same as the variance of B& This function is clearly minimized at
F so the
NLS estimator is consistent as long as identification holds. h
E XERCISE 27. Determine the limiting distribution of c jIF h & This
ì> Å
means finding the the specific forms of á z \f+G , 23IFY Y á Ã x and IFC C&
o
áxáxO á x êê
Again, use a CLT as needed, no need to verify that it can be applied.
ê
_
A@ F 28/T&
_
4@" I23=
where = is a combination of the fundamental error term / and the error due
to evaluating the regression function at @ rather than the true value @ F& Take a
first order Taylor’s series approximation around a point @
1 H
_
4@ 1 2 ° V O _ð @ 1 ó ³ ðI@pi@ 1 ó 23=Þ2 approximationerror.
1
where, as above, sA@
V 4@ 1 is the matrix of derivatives of the
O
1
regression function, evaluated at @ and Ý is = plus approximation error from
the truncated Taylor’s series.
The other new element here is 7 A @ÜD @ 1 C& Note that one could esti-
mate 7 simply by performing OLS on the above equation.
Given ]
7 we calculate a new round estimate of @ F as @
72l@ 1 & With
this, take a new Taylor’s series expansion around @ and repeat the
process. Stop when 7
a (to within a specified tolerance).
17.6. THE GAUSS-NEWTON ALGORITHM 363
To see why this might work, consider the above approximation, but evaluated
at the NLS estimator:
_
@ 2
s @ c @p @ h 2sÝ
The OLS estimate of 7
p @ is
@
dc 1 n
7 4 h 4 _i @" r &
by definition of the NLS estimator (these are the normal equations as in equa-
tion 17.1.2, Since 7
when we evaluate at @ updating would stop.
|
B
01I28 U 2/JU
Characteristics of individual: L
[
¡L 4 2cÝ
Latent labor supply:
/
. [
?"40=2
=5 Ä¡ 4 » 2
4@y28/
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 365
[
L 4@^2sÝ
. [
4*@;2/T&
Assume that
ÎÏ
Ý è è
çR
&
/ è ÐÒ
We assume that the offer wage and the reservation wage, as well as the latent
variable [ are unobservable. What is observed is
. \ . [ ¥ ]
.p [ &
[ ¡L 4 2 residual
using only observations for which ¥ & The problem is that these observa-
tions are those for which . [
¥ or equivalently, È/ ½ 4 @ and
ë \Ý ²/ ½ 4 @<]Z
í
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 366
[
¡L 42Kè/È2
&
¡L 42Kè뼬/ j/ ½ ±4 @V 2
&
A useful result is that for
ÜçR² \J
W
¥ [
tG+[Y
Ù Ô; [
where t¾P>@ and Ö?>@ are the standard normal density and distribution
function, respectively. The quantity on the RHS above is known as the
inverse Mill’s ratio:
± ´ =? [
tGW [
Ô; [
(17.7.1)
L 4©^28è ¾ t 4 @V 2
Ö 4 @"
n L 4 À XÃ O V?Å r 2 &
V?Å
(17.7.2)
sà O
D
where
è& The error term has conditional mean zero, and is uncorrelated
with the regressors L 4 D À sà WO V?V?Å Å & At this point, we can estimate the equation by
Ãs O
NLS.
Heckman showed how one can estimate this in a two step procedure
where first @ is estimated, then equation 17.7.2 is estimated by least
squares using the estimated value of @ to form the regressors. This
is inefficient and estimation of the covariance is a tricky issue. It is
probably easier (and more efficient) just to do MLE.
The model presented above depends strongly on joint normality. There
exist many alternative models which weaken the maintained assump-
tions. It is possible to estimate consistently without distributional as-
sumptions. See Ahn and Powell, Journal of Econometrics, 1994.
CHAPTER 18
Nonparametric inference
368
18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 369
The coefficient ô is the value of the function at
and the slope is the
value of the derivative at
a & These are of course not known. One might try
estimation by ordinary least squares. The objective function is
f
Â
Wô+7C BgU ¹ UW [ &
U(1
The limiting objective function, following the argument we used to get equa-
tions 14.3.1 and 17.2.1 is
¹ T=
)
 È' 2 Â
We may plot the true function and the limit of the approximation to see the
asymptotic bias as a function of :
(The approximating model is the straight line, the true model has curva-
ture.) Note that the approximating model is in general inconsistent, even at
the approximation point. This shows that “flexible functional forms” based
1
All calculations were done using Scientific Workplace.
18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 370
/
tE4 Â
t
Good approximation of the elasticity over the range of will require a good
approximation of both
and
4 over the range of & The approximating
elasticity is
¡ ¹ 4 Â ¹
Plotting the true elasticity and the elasticity obtained from the limiting approx-
imating model
The true elasticity is the line that has negative slope for large & Visually we
see that the elasticity is not approximated so well. Root mean squared error in
the approximation of the elasticity is
1Ap
Æ Y / G [ J
& %% ÒgÑ
F É '
an increasing function of the sample size. Here we hold the set of basis func-
tion fixed. We will consider the asymptotic behavior of a fixed model, which
we interpret as an approximation to the estimator’s behavior in finite samples.
Consider the set of basis functions:
ÿ
9 &
Maintaining these basis functions as the sample size increases, we find that the
limiting objective function is minimized at
Substituting these values into ÿ we obtain the almost sure limit of the ap-
proximation
(18.1.1)
=
2 Î\Ï"Ð5 yÆ, É 2a Ð- 2 Î\Ï"Ð # y Æ, Ñ É 2 Ð[ #
 y' 2 Â
(T )
B
L 2/%
18.3. THE FOURIER FUNCTIONAL FORM 374
ª
where is of unknown form and is a dimensional vector. For
The Fourier form, following Gallant (1982), but with a somewhat different pa-
rameterization, may be written as
(18.3.1)
ÿ L @ ÿ
2 L 4 2 Â # L 4% L 2
4½ ß`§¢ÎvÏ"Ð 4§ L G c¸ ßZ§¢Ð[ ~ 4§ L - &
19ß 1
§
(18.3.2) @
ÿK
S [ I4)
¸ Nº ¸ [ û ?4
½,1P1Y
¸1P1C'&'&'&C
½
¸ XN4¨&
We assume that the conditioning variables
L
have each been trans-
formed to lie in an interval that is shorter than # & This is required
to avoid periodic behavior of the approximation, which is desirable
since economic functions aren’t periodic. For example, subtract sam-
ple means, divide by the maxima of the conditioning variables, and
multiply by # º E where º is some positive number less than #
in value.
The Q § ª vectors
are ”elementary multi-indices” which are simply
formed of integers (negative, positive and zero). The Q § , 3
" Y#%\&)&(&) w
are required to be linearly independent, and we follow the convention
18.3. THE FOURIER FUNCTIONAL FORM 375
n Þ r 4
n # µ# #r 4
(18.3.3)
ÿ L ÿ
L 2
(Ô ß`§¢Ð- 4§ L Äs¸ ß`§Î\Ï"Ð ~04§ L - +~
ã @ ^2
§
19ß 1
\ t½ ]
§
of e L ¹ L
. If we have arguments of the (arbitrary) function , use
¹ L to
indicate a certain partial derivative:
à à
à ¹ L
à ¹ L à
1
z >N>N>
e
When is the zero vector,
¹ L
¹ L
. Taking this definition and the last
few equations into account, we see that it is possible to define ?¯ vector
L so that
9
(18.3.5)
ÿ L @ ÿ
L ?4*@ ÿ &
Both the approximating model and the derivatives of the approximat-
ing model are linear in the parameters.
For the approximating model to the function (not derivatives), write
ÿ L @ ÿ
4@ ÿ for simplicity.
The following theorem can be used to prove the consistency of the Fourier
form.
The modification of the original statement of the theorem that has been
made is to set the parameter space N in Gallant and Nychka’s (1987) Theorem
0 to a single point and to state the theorem in terms of maximization rather
than minimization.
This theorem is very similar in form to Theorem 19. The main differences
are:
(3) There is a denseness assumption that was not present in the other the-
orem.
We will not prove this theorem (the proof is quite similar to the proof of theo-
rem [19], see Gallant, 1987) but we will discuss its assumptions, in relation to
the Fourier form as the approximating model.
18.3.1. Sobolev norm. Since all of the assumptions involve the norm ¹
, we need to make explicit what norm we wish to use. We need a norm that
guarantees that the errors in approximation of the functions we are interested
in are accounted for. Since we are interested in first-order elasticities in the
present case, we need close approximation of both the function
and its
4 C throughout the range of & Let < be an open set that con-
first derivative
tains all values of that we’re interested in. The Sobolev norm is appropriate
in this case. It is defined, making use of our notation for partial derivatives, as:
¹ 9 H ;
à
~ Ð ' ; ¹
9 ê ê
ê ê
To see whether or not the function is well approximated by an approxi-
mating model ÿ @ ÿ , we would evaluate
;
L Gs ÿ L @ ÿ 9 H &
We see that this norm takes into account errors in approximating the function
and partial derivatives up to order & If we want to estimate first order elas-
ticities, as is the case in this example, the relevant would be
V& Further-
more, since we examine the Ð ' over <i convergence w.r.t. the Sobolev means
uniform convergence, so that we obtain consistent estimates for all values of &
18.3. THE FOURIER FUNCTIONAL FORM 379
where
is a finite constant. In plain words, the functions must have bounded
partial derivatives of one order higher than the derivatives we seek to estimate.
18.3.3. The estimation space and the estimation subspace. Since in our
case we’re interested in consistent estimation of first-order elasticities, we’ll
define the estimation space as follows:
¹
-
H
Q
T
1 Q
Use a picture here. The rest of the discussion of denseness is provided just for com-
pleteness: there’s no need to study it in detail. To show that ÿ is a dense subset
¹ I1 H ;
of with respect to it is useful to apply Theorem 1 of Gallant (1982),
who in turn cites Edmunds and Moscatelli (1977). We reproduce the theorem
as presented by Gallant, with minor notational changes, for convenience of
reference:
T HEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function
¹ [ L be continuously differentiable up to order on an open set containing
18.3. THE FOURIER FUNCTIONAL FORM 381
Therefore
T4
ÿ
so 4
T
ÿ is a dense subset of , with respect to the norm ¹ 1IH
;
.
With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limit-
ing objective function is
(18.3.6) /TKb
Y ;
L G s0 L - J,3
jè î &
ù T ð 1 ó 8 T ð F ó ú
W
¡ ¢
S
F
Y/; n ð 1 L G L ð F L G
L ó r &
W
¡ ¢
S
F ó J3
By the dominated convergence theorem (which applies since the finite bound
used to define -
H Z is dominated by an integrable function), the limit
and the integral can be interchanged, so by inspection, the limit is zero.
18.3.6. Identification. The identification condition requires that for any point
b
in å :VT¾
T¾ Á y
1IH ;
. This condition is clearly
satisfied given that and are once continuously differentiable (by the as-
sumption that defines the estimation space).
Estimation space
;
H
: the function space in the closure of
-
which the true function must lie.
Consistency norm ¹ ;
1IH & The closure of is compact with respect
to this norm.
Estimation subspace ÿ & The estimation subspace is the subset of
that is representable by a Fourier form with parameter @
ÿ & These are
dense subsets of 3&
Sample objective function 'f+4@ ÿ Y the negative of the sum of squares.
By standard arguments this converges uniformly to the
Limiting objective function /T=a
Y which is continuous in and has
a global maximum in its first argument, over the closure of the infinite
union of the estimation subpaces, at
&
As a result of this, first order elasticities
à
L,* à L
L
+ *
are consistently estimated for all L C£<i&
ÿ L @ ÿ
" 4*@ ÿ &
ÿË
ÿ
¤Ä4 G4ÿ B
ÿ
@
where Ô>© is the Moore-Penrose generalized inverse.
– This is used since 4ÿ ÿ may be singular, as would be the case for
large enough when some dummy variables are included.
4 @ ÿ of the unknown function L
. The prediction, is asymptotically
normally distributed:
h
c "4 @ ÿ
h R
m j w é Y
where
w é
) 4ÿ
ÿ
f SUT Ù
v 4,Æ É è w &
Formally, this is exactly the same as if we were dealing with a para-
metric linear model. I emphasize, though, that this is only valid if
grows very slowly as grows. If we can’t stick to acceptable rates, we
should probably use some other method of approximating the small
sample distribution. Bootstrapping is a possibility. We’ll discuss this
in the section on simulation.
18.4. KERNEL REGRESSION ESTIMATORS 385
B]U , UW 28/JUP
where
¬
J
/ U U
&
Ù
The conditional expectation of B given is 0 Y&
By definition of the
conditional expectation, we have
B ¹ B JB
0
Y
¹ Y B [B5 ZJB
where
¹ is the marginal density of H
¹
Y
[B5 ZJB&
This suggests that we could estimate 0 by estimating
¹ and y B B `JB&
18.4. KERNEL REGRESSION ESTIMATORS 386
Y
J ½
k
and ?>@ integrates to ÞH
k Z J
"&
Y
f
fSUT
) > D
fSUT f k
So, the window width must tend to zero, but not too quickly.
To show pointwise consistency of
¹ for
¹ C first consider the ex-
pectation of the estimator (since the estimator is an average of iid
terms we only need to consider the expectation of a representative
term):
n ¹ r
gY f D \( Â f/] ¹ W ZJT%&
Ù
18.4. KERNEL REGRESSION ESTIMATORS 387
D
Change variables as ,[
" Â f% so
¹f"+[ and m m ìà O
f
ì
we obtain
n ¹ r
D ¹ Ç D
Ù
Y
f [
TfV [ I f J [
Yk
W [ ¹ sfg [ ` J [ &
Now, asymptotically,
Yk
W [ ¹ `J [
¹ Y W [ " J [
¹ Y
since Tf
R and y
W [ " J [
by assumption. (Note: that we
can pass the limit through the integral is a result of the dominated
convergence theorem.. For this to hold we need that
¹ Ô>© be dominated
by an absolutely integrable function.
Next, considering the variance of
¹ C we have, due to the iid assump-
tion
D D f UW Â f<] ø
f é n¹ r
f é õ \( D
f
> >
U(1
D f
f M
é S \) WU Â T f/]X
U(1
By the representative term argument, this is
18.4. KERNEL REGRESSION ESTIMATORS 388
D D
> f é n ¹ r
f éËS \( j  f<]X
Also, since é
G we have
Ù Ù
D D D
> f é n ¹ r
f Ù ù \) 8 Â Tf/]¬ ú sf S Ù \( Â f/]¬ $X
Y If D \) Â f/] ¹ `Jss f D Y
f
D
\( 8 Â f<] ¹ W ZJ ø
õ
Y If D \) Â f/] ¹ `Jss f D n ¹ r
Ù
The second term converges to zero:
D ¹
f Ù n r R
by the previous result regarding the expectation and the fact that f R
& Therefore,
) > D é n ¹
) D
f \( 8 Â f/] ¹ W ZJ &
f SUT f r f SUT Y
Using exactly the same change of variables as before, this can be shown
to be
) > D é n ¹
¹ Y \ [ _] JT [ &
fSUT f r
¹
Since both y \ +[Y _] JT([ and are bounded, this is bounded, and
D
f R k by assumption, we have that
since >
é n¹ r R &
Since the bias and the variance both go to zero, we have pointwise
consistency (convergence in quadratic mean implies convergence in
probability).
18.4. KERNEL REGRESSION ESTIMATORS 389
estimator of
B C& The estimator has the same form as the estimator for ¹ C
only with one dimension more:
Y
B [ B aJB
a
f \( D UW Â f1]
Y
B B `JB
BgU f
(U 1
by marginalization of the kernel, so we obtain
0
¹ Y B B `JB
1f fU(1 B]U ÿ¦¥ à ã ã Ê Å p ¡ >§
1f fU(1 ÿ¦¥ à ã ã ¡Ê ¨ Å p ¡ >§
>
>
f
U(f 1 B]U \( UW Â f<] &
¡¨
U(1 \( UW Â f<]
This is the Nadaraya-Watson kernel regression estimator.
18.4.3. Discussion.
18.4. KERNEL REGRESSION ESTIMATORS 390
 G&
A large window width reduces the variance (strong imposition of flat-
ness), but increases the bias.
A small window width reduces the bias, but makes very little use of
information except points that are in a small neighborhood of U?& Since
relatively little information is used, the variance is large when the win-
dow width is small.
The standard normal density is a popular choice for Ô& and B C
[
This same principle can be used to choose w and « in a Fourier form model.
The previous discussion suggests that a kernel density estimator may easily
be constructed. We have already seen how joint densities may be estimated.
If were interested in a conditional density, for example of B conditional on ,
then the kernel estimate of the conditional density is simply
B
K ã ¹
f ÿ à ¥ ÊÅ p > H ÊÅ p > §
f 1 U(1 Ã K ÿ¦K ¥ ¡ ¡> ¨
à ã ã ¡
f Ê Å p >§
f 1 U(1 Ã ã ¡ ã > ¨ ¡
fU(1 [ f\¬B jB]UW Â fTJ WU Â f/]
Tf U(1 \) UW Â 1f ]
where we obtain the expressions for the joint and marginal densities from the
section on kernel regression.
this link . See also Cameron and Johansson, Journal of Applied Econometrics,
V. 12, 1997.
MLE is the estimation method of choice when we are confident about spec-
ifying the density. Is is possible to obtain the benefits of MLE when we’re not
so confident about the specification? In part, yes.
Suppose we’re interested in the density of B conditional on (both may be
vectors). Suppose that the density
B Y t0 is a reasonable starting approxi-
mation to the true density. This density can be reshaped by multiplying it by
a squared polynomial. The new density is
¹ B 0 B Yt0
t 0
\6TB Y
6
6T YtI 0
where
¹ 6TB 0
6 D B D
D
F
and
6T Yt 0 is a normalizing factor to make the density integrate (sum) to
one.
¹
Because 6 B 0 Â1 6 Yt`, is a homogenous function of @ it is necessary
to impose a normalization:
F is set to 1. The normalization factor
6TWt 0 is
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 393
T
Ù ¤ B B t`,
/;
K F
T ¹ 6B 0 I] <;
B 6Wt 0 B t0
\
F ©
K
T 6 © 6 /; ©
D 1Â
B
B 0
t D B B T6 Wt 0
F D I
6© F F © T 5D
<;
©
6
K
D H B B t I 1Â 6 òtI 0
D
F © K ©F
F
6 © 6 D D Â<
6Wt 0 C&
D F
F
À
By setting
a we get that the normalizing factor is
18.6.1
6 © 6 © ©
(18.6.1)
6TWt 0
D D D
F F
Recall that
F is set to 1 to achieve identification. The
in equation 18.6.1
are the raw moments of the baseline density. Gallant and Nychka (1987) give
conditions under which such a density may be treated as correctly specified,
asymptotically. Basically, the order of the polynomial must increase as the
sample size increases. However, there are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative bi-
nomial polynomial (NBP) density for count data. The negative binomial base-
line density may be written (see equation as
/;
B t0
Ã
B2¹Ã : Æ : Æ 3
e
K
Ã
B2J 4: :32 e
É,ª : 2
e
É
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 394
where t
S e
:yX ²
e ¥u
and : ¥R . The usual means of incorporating condi-
tioning variables L is the parameterization e²
º Í O x . When :
eÂ( we have
the negative binomial-I model (NB-I). When :
Â( we have the negative
binomial-II (NP-II) model. For the NB-I density, é^¤
e 2 e . In the case of
the NB-II model, we have é^¤
e 2 e . For both forms, W¤Ü g
e.
Ù
The reshaped density, with normalization to sum to one, is
¹ 6 B 0 I] Ã Bs2¹:
B tI 0
Ã Æ Æ 3
e
É &
<; \ : K
(18.6.2)
6TòtI 0 B2J b:7 :32
à e
É ª : 2
e
(18.6.3)
´ ;
¬-
: ðe º Ue ¹
2 : ó &
ª ª
To illustrate, here are the first through fourth raw moments of the NB density,
calculated using
MuPAD, which is a Computer Algebra System that is free for personal use,
and then programmed in Ox. These are the moments you would need to use a
second order polynomial
#V .
if(k_gam >= 1)
{
m[][0] = lambda;
m[][1] = (lambda .* (lambda + psi + lambda .* psi))
Econometrics/ psi;
}
if(k_gam >= 2)
{
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 395
For
%' the analogous formulae are impressively (i.e. several pages) long.
This is an example of a model that would be difficult ot formulate without the
help of a program like MuPAD.
It is possible that there is conditional heterogeneity such that the appropri-
ate reshaping should be more local. This can be accomodated by allowing the
D parameters to depend upon the conditioning variables, for example using
polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can
approximate a wide variety of densities arbitrarily well as the degree of the
polynomial increases with the sample size. This approach is not without its
drawbacks: the sample objective function can have an extremely large number
of local maxima that can lead to numeric difficulties. If someone could figure
out how to do in a way such that the sample objective function was nice and
smooth, they would probably get the paper published in a good journal. Any
ideas?
Here’s a plot of true and the limiting SNP approximations (with the order
of the polynomial fixed) to four different count data densities, which variously
exhibit over and underdispersion, as well as excess zeros. The baseline model
is a negative binomial density.
18.7. Examples
18.7.3. Kernel regression. We will use the same data generating process as
for the above examples of Fourier Form models. The program kernelreg1.ox
allows you to experiment with different sample sizes, window widths. For a
sample size of
Ò V , here are several plots with different window widths.
Note that too small a window-width (ww = 0.1) leads to a very irregular fit,
while setting the window width too high leads to too flat a fit.
Cross Validation
18.7.4. Kernel density estimation. The second DGP second DGP gener-
ates @ random variables, then estimates their density using kernel density
estimation. The program kerneldens.ox allows you to experiment using dif-
ferent sample sizes, kernels, and window widths. The following figure shows
18.7. EXAMPLES 400
¹ 6 B 0 I] Ã Bs2¹:
B tI 0
Ã Æ Æ 3
e
É
<; \ : K
(18.7.1)
6TòtI 0 B2J b:7 :32
à e
É ª : 2
e
¹ 67B 0
6 D B D &
(18.7.2) D
F
The normalization factor is
6 © 6 © ©
(18.7.3)
6TòtI 0
D D D &
F F
To implement this using a polynomial of order we need the raw moments
of the negative binomial density up to order #C . I couldn’t find the NB moment
generating function anywhere, so a solution is to calculate it using a Computer
Algebra System (CAS). Rather than using one of the expensive alternatives, we
can try out MuPAD, which can be downloaded and is free (in the sense of free
18.7. EXAMPLES 401
beer) for personal use. It is installed on the Linux machines in the computer
room, and if you like you can install the Windows version, too.
The file negbinSNP.mpd, if run using the the command mupad negbinSNP.mpd,
will give you the output that follows:
/ a \a / b \y
gamma(a + y) | ----- | | ----- |
\ a + b / \ a + b /
----------------------------------
gamma(a) gamma(y + 1)
/ a \a
| ----- |
\ a + b /
---------------------
/ a + b - b exp(t) \a
| ---------------- |
\ a + b /
b
18.7. EXAMPLES 403
5 4 4 5 2 3 3 2 2 4
(24 b + 60 a b + a b + 50 a b + 50 a b + 15 a b + 110 a b +
3 3 4 2 2 5 3 4 4 3 3 5
75 a b + 15 a b + 35 a b + 60 a b + 25 a b + 10 a b +
4 4 4 5 4
10 a b + a b ) / a
" t3 = a**-4*(b**5*24.0D0+60.0D0*a*b**4+a**4*b+50.0D0*a*b**5+50.0D
n ~(a*a)*b**3+15.0D0*a**3*(b*b)+110.0D0*(a*a)*b**4+75.0D0*a**3*b**3
n ~5.0D0*a**4*(b*b)+35.0D0*(a*a)*b**5+60.0D0*a**3*b**4+25.0D0*a**4*
n ~*3+10.0D0*a**3*b**5+10.0D0*a**4*b**4+a**4*b**5)"
"\\frac{24\\, b^5 + 60\\, a\\, b^4 + a^4\\, b + 50\\, a\\, b^5 + 50\\,
\\, b^3 + 15\\, a^3\\, b^2 + 110\\, a^2\\, b^4 + 75\\, a^3\\, b^3 + 15\
a^4\\, b^2 + 35\\, a^2\\, b^5 + 60\\, a^3\\, b^4 + 25\\, a^4\\, b^3 + 1
18.7. EXAMPLES 404
a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) +
b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) +
a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) +
b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6)
>> quit
Once you get expressions for the moments and the double sums, you can
use these to program a loglikelihood function in Ox, without too much trouble.
The file NegBinSNP.ox implements this. The file EstimateNBSNP.ox will let
you estimate NegBinSNP models for the MEPS data. The estimation results
for OBDV using
ª
# and a NB-I baseline model are
***********************************************************************
MEPS data, OBDV
18.7. EXAMPLES 405
negbin_snp_obj results
Strong convergence
Observations = 500
Standard Errors
t-Stats
Information Criteria
***********************************************************************
Note that the CAIC and BIC are lower for this model than for the ordinary
NB-I model. NOTE: density functions formed in this way may have MANY
local maxima, so you need to be careful before accepting the results of a casual
run. To guard against having converged to a local maximum, one can try using
multiple starting values, or one could try simulated annealing as an optimiza-
tion method. To do this, copy maxsa.ox and maxsa.h into your working direc-
tory, and then use the program EstimateNBSNP2.ox to see how to implement
SA estimation of the reshaped negative binomial model. For more details on
18.7. EXAMPLES 407
the Ox implementation of SA, see Charles Bos’ page. Note - in my own experi-
ence, using a gradient-based method such as BFGS with many starting values
is as successful as SA, and is usually faster. Perhaps I’m not using SA as well
as is possible... YMMV.
CHAPTER 19
Simulation-based estimation
19.1. Motivation
B * [ * ^28/ *
(19.1.1) / * çRj ²
B ¬« B [
This mapping is such that each element of B is either zero or one (in
some cases only one element will be one).
Define
wy*0
aw B *
SJB [ B *
« B [ $X
Suppose random sampling of B * [ * . In this case the elements of B *
may not be independent of one another (and clearly are not if ² is not
diagonal). However, B * is independent of B ß , !y
í "&
where
A1 p y / 4 ²
1/
¢¬/T ò# $ 1®
² p ² } Y %
~
È # É
is the multivariate normal density of an
´ -dimensional random vec-
tor. The log-likelihood function is
½
ß¼
ß ^28/ ß
B ß¼
\ ½ ß;¥ ½
D ?ê,Q i
C $Q:í +]
½
ßU
ü ß U¬j/ ß Uò
C S"Y#TX
C S"Y#%'&(&(&)iX
19.1. MOTIVATION 411
Then
B
[ ½
c½1
òü Küj1- ?2/ j
/1
^28/
B \ B [ ¥ ]V
The mean and variance of the Poisson distribution are both equal to e H
ë7B
^
é B5
ge &
This ensures that the mean is positive (as it must be). Estimation by ML is
straightforward.
Often, count data exhibits “overdispersion” which simply means that
If this is the case, a solution is to use the negative binomial distribution rather
than the Poisson. An alternative is to introduce a latent variable that reflects
heterogeneity into the specification:
where ]* has some specified density with support (this density may depend
Û
on additional parameters). Let J,3 V* be the density of g* &
In some cases, the
marginal density of B
:
B
B *
gY }Y~ @\ }C~% * 2 ]* I]u\ Y} ~% * ^2 g* _ ] K ]*
J,3
Ë ³
B *
will have a closed-form solution (one can derive the negative binomial distri-
bution in the way if has an exponential distribution), but often this will not
be possible. In this case, simulation is a means of calculating Ë
B
!P Y which
is then used to do ML estimation. This would be an example of the Simulated
Maximum Likelihood (SML) estimation.
In this case, since there is only one latent variable, quadrature is proba-
bly a better choice. However, a more flexible model with heterogeneity
19.1. MOTIVATION 413
would allow all parameters (not just the constant) to vary. For exam-
ple
:
B
B *
Y }C~% \@ }C~% * * _ ],\ }Y~ ¬ * * I] K *
J,3
Ë ³
B *
entails a e
) * -dimensional integral, which will not be evaluable
ÿ
To estimate a model of this sort, we typically have data that are assumed to be
observations of BgU in discrete points BT1Y]B
'&)&(&ÌB ¯ & That is, though BgU is a continu-
ous process it is observed in discrete time.
To perform inference on @% direct ML or GMM estimation is not usually fea-
sible, because one cannot, in general, deduce the transition density
BU BgU 1$@" Y&
This density is necessary to evaluate the likelihood function or to evaluate mo-
ment conditions (which are based upon expectations with respect to this den-
sity).
The discretization induces a new parameter, t (that is, the t0F which
defines the best approximation of the discretization to the actual (un-
known) discrete time version of the model is not equal to VF
@ which is
the true parameter value). This is an approximation, and as such “ML”
estimation of t (which is actually quasi-maximum likelihood, QML)
based upon this equation is in general biased and inconsistent for the
original parameter, @ . Nevertheless, the approximation shouldn’t be
too bad, which will be useful, as we will see.
The important point about these three examples is that computational
difficulties prevent direct application of ML, GMM, etc. Nevertheless
19.2. SIMULATED MAXIMUM LIKELIHOOD (SML) 415
f )
a
V
-
~ \f+4@"
(U 1 B]U =U?@V
@
®
U¶
where BgU ×U @" is the density function of the observation. When BVU =UP@V
does not have a known closed form, @ is an infeasible estimator. However,
®
it may be possible to define a random function such that
³ f
¡
V[Z
~ \f5A@V
Ü
* 1 ý B]UP[=UP@V
@
®
B ß¼
ø\^½ ß;¥ ½
D YQªCiYQ
í +]
} * ß7 ¶C1 B ý * ß ¶
The draws of / ý * are draw only once and are used repeatedly during
the iterations used to find and ² & The draws are different for each !-&
If the / ý * are re-drawn at every iteration the estimator will not converge.
The log-likelihood function with this simulator is a discontinuous func-
tion of and ² & This does not cause problems from a theoretical point
19.2. SIMULATED MAXIMUM LIKELIHOOD (SML) 417
B ý * ß7
D 9
~ ½ @* D r h 2¡& Ò j n ½ * ß¼
b
c w n ½ *ß b D 9
~ ½ @* D r
1 1
where w is a large positive number. This approximates a step
function such that Bý * ß is very close to zero if ½
*ß is not the max-
imum, and ½
* ߣ
if it is the maximum. This makes B ý * ß a con-
tinuous function of and ²
so that ý * ß and therefore ²
will be continuous and differentiable. Consistency requires that
w , R 6 k
so that the approximation to a step function becomes
arbitrarily close as the sample size increases. There are alternative
methods (e.g., Gibbs sampling) that may work better, but this is
too technical to discuss here.
d
To solve to log(0) problem, one possibility is to search the web for the
slog function. Also, increase if this is a serious problem.
d
19.3. METHOD OF SIMULATED MOMENTS (MSM) 418
2) if
fSUT A1 p  h
ge e
a finite constant, then
c@³ ®
i@ F h R
m j o 1 A @ F -
Suppose we have a DGP B @" which is simulable given @ , but is such that
the density of B is not calculable.
Once could, in principle, base a GMM estimator upon the moment condi-
tions
^U-A@"
\ BgUP UW GKQ0 U?@V _]N\U
where
Q, U?@V
Yk B]U? U IB PU @V `JB
19.3. METHOD OF SIMULATED MOMENTS (MSM) 419
\U is a vector of instruments in the information set and B UP@V is the density
of B conditional on U?& The problem is that this density is not available.
However Q, U?@V is readily simulated using
d ²
}
Qs U?
@"
}B U ¶ WU
¶C1 d
(19.3.1)
³ U-4@"
n B]UP UW G }s
Q U?
@" r \U
} A@V
: f ^
* 1 ³ U-A@" d ²
f
v B]UP UW G ,Q B} ¶ WU w \U
(19.3.2)
U
* 1 C¶ 1
with which we form the GMM criterion and estimate as usual. Note
that the unbiased simulator Q, B} U ¶ U appears linearly within the sums.
19.3.1. Properties. Suppose that the optimal weighting matrix is used. Mc-
Fadden (ref. above) and Pakes and Pollard (refs. above) show that the asymp-
totic distribution of the MSM estimator is very similar to that of the infeasible
GMM estimator. In particular, assuming that the optimal weighting matrix is
d
19.3. METHOD OF SIMULATED MOMENTS (MSM) 420
(19.3.3) c@® ³
®
i@ F h R
m Æ, 2 ð
É T
²
1 T4 ó 1 É
È
where T
²
1 T4 1 is the asymptotic variance of the infeasible GMM esti-
d
mator.
That is, the asymptotic variance is inflated by a factor ,2² Â & For this
d
reason the MSM estimator is not fully asymptotically efficient relative
d
to the infeasible GMM estimator, for finite, but the efficiency loss is
d
small and controllable, by setting reasonably large.
The estimator is asymptotically unbiased even for
V& This is an
advantage relative to SML.
If one doesn’t use the optimal weighting matrix, the asymptotic varcov
d
ý =
T A@"
Y ° Q, @ F GKQ, @" ³ + ` J,3 C&
If you look at equation 19.3.5 a bit, you will see why the variance in-
?Z2 1 .
²
flation factor is
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 422
The choice of which moments upon which to base a GMM estimator can
have very pronounced effects upon the efficiency of the estimator.
^U-4@" V U-A@ ± WU
The efficient method of moments (EMM) (see Gallant and Tauchen (1996),
“Which Moments to Match?”, ECONOMETRIC THEORY, Vol. 12, 1996, pages
657-681) seeks to provide moment conditions that closely mimic the score vec-
tor. If the approximation is very good, the resulting estimator will be very
nearly fully efficient.
The DGP is characterized by random sampling from the density
BgU UP@ F
U-A@ F
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 423
We can define an auxiliary model, called the “score generator”, which sim-
ply provides a (misspecified) parametric density
B ×U e
UÔ e
This density is known up to a parameter e
& We assume that this den-
sity function is calculable. Therefore quasi-ML estimation is possible.
Specifically,
f
a
V-´
~ \f+ e
) -U e Y&
e
(U 1
X ) B]U U× e .
e
After determining we can calculate the score functions
The important point is that even if the density is misspecified, there is
a pseudo-true e
F for which the true expectation, taken with respect to
the true but unknown density of B'B PU @ F C and then marginalized
over is zero:
e
F H"ë « ë ; « ° M B e F ³
)
« B e F ¬B
@ F ZJBÚJ,3
a
Y
« Y
;
U(1
These moment conditions are not calculable, since EU-A@V is not avail-
able, but they are simulable using
d ²
f B} ¶ UP e
^³ f5A@T
e
U
U(1 ¶C1
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 424
¶ ª A @V C
where B ý U is a draw from holding U fixed. By the LLN and
the fact that e converges to e F ,
T A@ F F
} = e
a &
This is not the case for other values of @ , assuming that e F is identified.
The advantage of this procedure is that if
BVU U? e closely approx-
imates B ×U @V C then
} f5A@T e
will closely approximate the optimal
moment conditions which characterize maximum likelihood estima-
tion, which is fully efficient.
If one has prior information that a certain density approximates the
data well, it would be a good choice for
Ô>@ Y&
If one has no density in mind, there exist good ways of approximating
unknown distributions parametrically: Philips’ ERA’s (Econometrica,
1983) and Gallant and Nychka’s (Econometrica, 1987) SNP density es-
timator which we saw before. Since the SNP density is consistent, the
efficiency of the indirect estimator is the same as the infeasible ML
estimator.
19.4.1. Optimal weighting matrix. I will present the theory for finite,
d d
and possibly small. This is done because it is sometimes impractical to esti-
mate with very large. Gallant and Tauchen give the theory for the case of
d
so large that it may be treated as infinite (the difference being irrelevant given
the numerical precision of a computer). The theory for the case of infinite
follows directly from the results presented here.
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 425
The moment condition :
} 4@% e
depends on the pseudo-ML estimate e
& We
can apply Theorem
h 22 to conclude that
(19.4.2) c e e F h R m ° 23 e F 1 o e F 23 e F 1 ³
If the density BgU U? e were in fact the true density B UP
@" C then e would
:
3 e F
O A@ F F C&
e
2
As in Theorem 22,
à à
F
f SU T ë \à fE 'à f5 e
e
o e
&
È
e ê W
ê
ê
e
4 ê W
ê
ê
É
ê ê
In this case, this is simply the asymptotic variance covariance matrix of the
h
moment conditions, ² & Now take a first order Taylor’s series approximation to
^fE4@gF'
e
about e F :
h h h
:
ý f5A@ F e h
^ ý f+4@ F e F 2
^ M O ý 4@ F e F c e e
F h 2 ´ T6 ?J
^
ý f+4@gFN e FY . It is straightforward but somewhat tedious to
First consider
²1
show that the asymptotic variance of this term is ± T= e FY .
h
19.4. EFFICIENT METHOD OF MOMENTS (EMM) 426
²
O ý 4@gF' e FY c e O ý fEA@]F' e FY R ì
^ Q6P P
Next consider the second term M e
Fh . Note that
3 e F C so we have
2 h h
:
M O ý 4@ F e F c e e
Fh
l23 e F c e e
F h $ô+&©"&
l23 e F c e F h çR ° o e F P³
Q
e
Now, combining theh results for the first and second terms,
d
Q Æ 2
x ý fEA@ F e u
^ ç É
o
e F É
È
Suppose that
o
e F is a consistent estimator of the asymptotic variance-covariance
matrix of the moment conditions. This may be complicated if the score gener-
ator is a poor approximator, since the individual score contributions may not
have mean zero in this case (see the section on QML) . Even if this is the case,
the individuals means can be calculated by simulation, so it is always possible
to consistently estimate
o
e FC when the model is simulable. On the other hand,
if the score generator is taken to be correctly specified, the ordinary estimator
of the information matrix is consistent. Combining this with the result on the
efficient GMM weighting matrix in Theorem 25,
x we see that defining @ as
d
¡
V[Z 1
@ f5A@T e P4 Æ 2 É
^ o
F É
e
^fE4@% e
È
1
d
1
c @c@ F h R
m ¶ T Æ 2 É o
F Ée
T4 ·
È
where
ë ° V^f4 A@ F e F ?³G&
T
f SUT
V ^4 @T e
f
19.4.3. Diagnotich testing. The fact that
d
Q Æ 2
^fEA@ F e u
ç É
o
e F É
È
implies that
d
e 1 Q
0^f+ @T P4 Æ,2 É É ^
o e
f+ @% e ç 4V
È
where is ÿ ) e T ÿ A@V C since without ÿ ) A@" moment conditions the model
is not identified, so testing is impossible. One test of the model is simply based
on this statistic: if it exceeds the 4V critical point, something may be wrong
(the small sample performance of this sort of test would be a topic worth in-
vestigating).
19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 428
A1 p 1
d
¶
diag Æ 2 É o
É ·
e
^fE @T e
È
can be used to test which moments are not well modeled. Since these
moments are related to parameters of the score generator, which are
usually related to certain features of the model, this information can be
² 'N C
h h
used to revise the model. These aren’t actually distributed as
h
since 0 ^f5A@ F e and ^f+ @% e have different distributions (that of
0^f5 @% e is somewhat more complicated). It can be shown that the
pseudo-t statistics are biased toward nonrejection. See Gourieroux et.
al. or Gallant and Long, 1995, for more details.
is simulated over @ , and the scores are calculated and averaged over the simu-
lations
*
^ý f+4@% t * Ef 4@% t
1
@ is chosen to set the simulated scores to zero
^ý fE @% t
This is only one method of using indirect inference for estimation of differ-
ential equations. There are others (see Gallant and Long, 1995 and Gourieroux
et. al.). Use of a series approximation to the transitional density as in Gal-
lant and Long is an interesting possibility since the score generator may have
a higher dimensional parameter than the model, which allows for diagnostic
testing. In the method described above the score generator’s parameter t is of
the same dimension as is @% so diagnostic testing is not possible.
CHAPTER 20
431
CHAPTER 21
Introduction to Octave
Why is Octave being used here, since it’s not that well-known by econome-
tricians? Well, because it is a high quality environment that is easily extensible,
uses well-tested and high performance numerical libraries, it is licensed under
the GNU GPL, so you can get it for free and modify it if you like, and it runs
on both GNU/Linux, Mac OSX and Windows systems. It’s also quite easy to
learn.
Get the bootable CD, as was described in Section 1.3. Then burn the image,
and boot your computer with it. This will give you this same PDF file, but with
all of the example programs ready to run. The editor is configure with a macro
to execute the programs using Octave, which is of course installed. From this
point, I assume you are running the CD (or sitting in the computer room across
the hall from my office), or that you have configured your computer to be able
to run the *.m files mentioned below.
The objective of this introduction is to learn just the basics of Octave. There
are other ways to use Octave, which I encourage you to explore. These are just
some rudiments. After this, you can look at the example programs scattered
throughout the document (and edit them, and run them) to learn more about
how Octave can be used to do econometrics. Students of mine: your problem
432
21.2. A SHORT INTRODUCTION 433
sets will include exercises that can be done by modifying the example pro-
grams in relatively minor ways. So study the examples!
Octave can be used interactively, or it can be used to run programs that are
written using a text editor. We’ll use this second method, preparing programs
with NEdit, and calling Octave from within the editor. The program first.m
gets us started. To run this, open it up with NEdit (by finding the correct
file inside the /home/knoppix/Desktop/Econometrics folder and click-
ing on the icon) and then type CTRL-ALT-o, or use the Octave item in the Shell
menu (see Figure 21.2.1).
21.2. A SHORT INTRODUCTION 434
Note that the output is not formatted in a pleasing way. That’s because
printf() doesn’t automatically start a new line. Edit first.m so that the
8th line reads ”printf(”hello world\n”);” and re-run the program.
We need to know how to load and save data. The program second.m
shows how. Once you have run this, you will find the file ”x” in the directory
Econometrics/Include/OctaveIntro/ You might have a look at it with
NEdit to see Octave’s default format for saving data. Basically, if you have
data in an ASCII text file, named for example ”myfile.data”, formed of
numbers separated by spaces, just use the command ”load myfile.data”.
After having done so, the matrix ”myfile” (without extension) will contain
the data.
Please have a look at CommonOperations.m for examples of how to do
some basic things in Octave. Now that we’re done with the basics, have a look
at the Octave programs that are included as examples. If you are looking at
the browsable PDF version of this document, then you should be able to click
on links to open them. If not, the example programs are available here and the
support files needed to run these are available here. Those pages will allow
you to examine individual files, out of context. To actually use these files (edit
and run them), you should go to the home page of this document, since you
will probably want to download the pdf version together with all the support
files and examples. Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You
might like to check the article Econometrics with Octave and the Econometrics Toolbox ,
which is for Matlab, but much of which could be easily used with Octave.
21.3. IF YOU’RE RUNNING A LINUX INSTALLATION... 435
Then to get the same behavior as found on the CD, you need to:
Get the collection of support programs and the examples, from the
document home page.
Put them somewhere, and tell Octave how to find them, e.g., by putting
a link to the MyOctaveFiles directory in /usr/local/share/octave/site-m
Make sure nedit is installed and configured to run Octave and use
syntax highlighting. Copy the file /home/econometrics/.nedit
from the CD to do this. Or, get the file NeditConfiguration and save
it in your $HOME directory with the name ”.nedit”. Not to put too
fine a point on it, please note that there is a period in that name.
Associate *.m files with NEdit so that they open up in the editor when
you click on them. That should do it.
CHAPTER 22
[3, Chapter 1]
LetÔ>© ÜHO 6R O be a real valued function of the -vector T& Then á ì Ã V V?Å is
á
@
organized as a -vector,
á ì Ã V?Å
à á ì V V?
Å
Tà 4 @"
á á V Ã z
@
.
..
á ì VÃ ¶V?Å
á
áYã
Let A @V : O
6 R O f be a -vector valued function of the -vector @ . Let 4@" 4
be the Þ valued transpose of . Then ð á V A@V 4 4
á V A@V C&
á ó á O
436
22.2. CONVERGENGE MODES 437
E XERCISE 34. For w a i matrix and a 8 vector, show that á$ã á$O ã ã
w 2 w 4.
à ã O x Å
}C~% 4
× ¹
E XERCISE 35. For and both ; á¸
vectors, show that .
áx
Real-valued sequences:
where
f=H ² R ¿»º s
O &
²
may be an arbitrary set.
f Ý G
Ý ½ /TÔê ¥ î¼ &
random variables SN f5Ý $X is a collection of such mappings, i.e., each f5 Ý is
ª
a random variable with respect to the probability space ² ¿ E& For example,
1
given the model ¤
F2¡/T the OLS estimator 5f
4 : 4 ¤ where
is the sample size, can be used to form a sequence of random vectors S 5fTX .
A number of modes of convergence are in use when dealing with sequences
of random variables. Several such modes of convergence should already be
familiar:
f ì i or =f
QP P
written as $ô+&©"&
R R
One can show that
=f R ì Á =f R 6
Q6P P
&
D EFINITION 42. [Convergence in distribution] Let the r.v. bf have distribu-
tion function
2
f and the r.v. f have distribution function
2
& If
2
f R 2
at
every continuity point of
2
then f converges in distribution to i&
4 1 4/
5 f F 2kÆ É Æ É
« î R Q6P ì P
and f O by a SLLN. Note that this term is not a function of the parameter
& This easy proof is a result of the linearity of the model, which allows us to
express the estimator in a way that separates parameters from random func-
tions. In general, this is not possible. We often deal with the more complicated
situation where the stochastic sequence depends on parameters in a manner
that is not reducible to a simple sequence of random variables. In this case,
we have a sequence of random functions that depend on @ : SNbf5Ýy@V YX where
each =f+ßÝy@" is a random variable with respect to a probability space ² ¿£
ª
and the parameter @ belongs to a parameter space @MCFN &
22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 441
It’s often useful to have notation for the relative magnitudes of quantities.
Quantities that are small relative to others can often be ignored, which simpli-
fies analysis.
D EFINITION 44. [Little-o] Let , and 0, be two real-valued functions.
The notation
´
, - means f SUTÀ ÃÃ f'fNÅÅ
¡ &
D EFINITION 45. [Big-O] Let
and 0
be two real-valued functions.
The notation
5
b0, [ means there exists some such that for ¥
£ À Ã fNf'ÅÅ ê ½ where is a finite constant.
ê
ê
à ê
à fNf'ÅÅ
ê ê
This definition doesn’t require that À have a limit (it may fluctuate bound-
Ã
edly).
22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 442
If S fTX
and SgfTX are sequences of random variables analogous definitions
are
1 1
E XAMPLE 47. The least squares estimator @
4 : 4 ¤
¬ 4 : 4 ¬F@gF
28/V
« « Å « î
1
@gF¢2®¬ 4 3 4 /T& Since plim à O 1
O
we can write ¬ 4 : 1 4 /
´ 6ÔN
and @
@gF2 6?J C& Asymptotically, the term 6T?J is negligible. This is just a
´ ´
where
ê ê
î is a finite constant.
f¡ç ²
E XAMPLE 49. If \ J then =f
5 6?J C since, given /T there is
ª
always some î such that f ½ î ¥ ¼j/T&
Useful rules:
Á5 6 6 5 6 6
5 6 6
´ 6 6 ´ 6T
´ T6 6
These two examples show that averages of centered (mean zero) quanti-
ties typically have plim 0, while averages of uncentered quantities have finite
nonzero plims. Note that the definition of 5 6 does not mean that
, and ,
are of the same order. Asymptotic equality ensures that this is the case.
!W Æ 0,, É
µ
Exercises
The GPL
This document and the associated examples and materials are copyright
Michael Creel, under the terms of the GNU General Public License. This li-
cense follows:
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place,
Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and
distribute verbatim copies of this license document, but changing it is not al-
lowed.
Preamble
The licenses for most software are designed to take away your freedom to
share and change it. By contrast, the GNU General Public License is intended
to guarantee your freedom to share and change free software–to make sure the
software is free for all its users. This General Public License applies to most
of the Free Software Foundation’s software and to any other program whose
authors commit to using it. (Some other Free Software Foundation software is
covered by the GNU Library General Public License instead.) You can apply it
to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our
General Public Licenses are designed to make sure that you have the freedom
to distribute copies of free software (and charge for this service if you wish),
that you receive source code or can get it if you want it, that you can change
445
23. THE GPL 446
the software or use pieces of it in new free programs; and that you know you
can do these things.
To protect your rights, we need to make restrictions that forbid anyone to
deny you these rights or to ask you to surrender the rights. These restrictions
translate to certain responsibilities for you if you distribute copies of the soft-
ware, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or
for a fee, you must give the recipients all the rights that you have. You must
make sure that they, too, receive or can get the source code. And you must
show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy, distribute
and/or modify the software.
Also, for each author’s protection and ours, we want to make certain that
everyone understands that there is no warranty for this free software. If the
software is modified by someone else and passed on, we want its recipients to
know that what they have is not the original, so that any problems introduced
by others will not reflect on the original authors’ reputations.
Finally, any free program is threatened constantly by software patents. We
wish to avoid the danger that redistributors of a free program will individually
obtain patent licenses, in effect making the program proprietary. To prevent
this, we have made it clear that any patent must be licensed for everyone’s
free use or not licensed at all.
The precise terms and conditions for copying, distribution and modifica-
tion follow.
23. THE GPL 447
2. You may modify your copy or copies of the Program or any portion of
it, thus forming a work based on the Program, and copy and distribute such
modifications or work under the terms of Section 1 above, provided that you
also meet all of these conditions:
a) You must cause the modified files to carry prominent notices stating that
you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in whole
or in part contains or is derived from the Program or any part thereof, to be
licensed as a whole at no charge to all third parties under the terms of this
License.
c) If the modified program normally reads commands interactively when
run, you must cause it, when started running for such interactive use in the
most ordinary way, to print or display an announcement including an appro-
priate copyright notice and a notice that there is no warranty (or else, saying
that you provide a warranty) and that users may redistribute the program un-
der these conditions, and telling the user how to view a copy of this License.
(Exception: if the Program itself is interactive but does not normally print such
an announcement, your work based on the Program is not required to print an
announcement.)
These requirements apply to the modified work as a whole. If identifiable
sections of that work are not derived from the Program, and can be reasonably
considered independent and separate works in themselves, then this License,
and its terms, do not apply to those sections when you distribute them as sep-
arate works. But when you distribute the same sections as part of a whole
which is a work based on the Program, the distribution of the whole must be
23. THE GPL 449
on the terms of this License, whose permissions for other licensees extend to
the entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights
to work written entirely by you; rather, the intent is to exercise the right to
control the distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of a
storage or distribution medium does not bring the other work under the scope
of this License.
3. You may copy and distribute the Program (or a work based on it, under
Section 2) in object code or executable form under the terms of Sections 1 and
2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable source
code, which must be distributed under the terms of Sections 1 and 2 above on
a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three years, to give
any third party, for a charge no more than your cost of physically performing
source distribution, a complete machine-readable copy of the corresponding
source code, to be distributed under the terms of Sections 1 and 2 above on a
medium customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer to dis-
tribute corresponding source code. (This alternative is allowed only for non-
commercial distribution and only if you received the program in object code
or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for mak-
ing modifications to it. For an executable work, complete source code means
23. THE GPL 450
all the source code for all modules it contains, plus any associated interface
definition files, plus the scripts used to control compilation and installation of
the executable. However, as a special exception, the source code distributed
need not include anything that is normally distributed (in either source or bi-
nary form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component itself
accompanies the executable.
If distribution of executable or object code is made by offering access to
copy from a designated place, then offering equivalent access to copy the
source code from the same place counts as distribution of the source code,
even though third parties are not compelled to copy the source along with the
object code.
4. You may not copy, modify, sublicense, or distribute the Program ex-
cept as expressly provided under this License. Any attempt otherwise to copy,
modify, sublicense or distribute the Program is void, and will automatically
terminate your rights under this License. However, parties who have received
copies, or rights, from you under this License will not have their licenses ter-
minated so long as such parties remain in full compliance.
5. You are not required to accept this License, since you have not signed
it. However, nothing else grants you permission to modify or distribute the
Program or its derivative works. These actions are prohibited by law if you do
not accept this License. Therefore, by modifying or distributing the Program
(or any work based on the Program), you indicate your acceptance of this Li-
cense to do so, and all its terms and conditions for copying, distributing or
modifying the Program or works based on it.
23. THE GPL 451
6. Each time you redistribute the Program (or any work based on the Pro-
gram), the recipient automatically receives a license from the original licensor
to copy, distribute or modify the Program subject to these terms and condi-
tions. You may not impose any further restrictions on the recipients’ exercise
of the rights granted herein. You are not responsible for enforcing compliance
by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringe-
ment or for any other reason (not limited to patent issues), conditions are im-
posed on you (whether by court order, agreement or otherwise) that contradict
the conditions of this License, they do not excuse you from the conditions of
this License. If you cannot distribute so as to satisfy simultaneously your obli-
gations under this License and any other pertinent obligations, then as a con-
sequence you may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by all those
who receive copies directly or indirectly through you, then the only way you
could satisfy both it and this License would be to refrain entirely from distri-
bution of the Program.
If any portion of this section is held invalid or unenforceable under any
particular circumstance, the balance of the section is intended to apply and the
section as a whole is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or
other property right claims or to contest validity of any such claims; this sec-
tion has the sole purpose of protecting the integrity of the free software distri-
bution system, which is implemented by public license practices. Many people
have made generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that system; it is
23. THE GPL 452
status of all derivatives of our free software and of promoting the sharing and
reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED
BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING
THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE
PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EX-
PRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICU-
LAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFOR-
MANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PER-
MITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARIS-
ING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUD-
ING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED
INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR
A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PRO-
GRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
23. THE GPL 454
free software, and you are welcome to redistribute it under certain conditions;
type ‘show c’ for details.
The hypothetical commands ‘show w’ and ‘show c’ should show the ap-
propriate parts of the General Public License. Of course, the commands you
use may be called something other than ‘show w’ and ‘show c’; they could
even be mouse-clicks or menu items–whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if necessary.
Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program ‘Gnomo-
vision’ (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the li-
brary. If this is what you want to do, use the GNU Library General Public
License instead of this License.
CHAPTER 24
The attic
456
24.1. MEPS DATA: MORE ON COUNT MODELS 457
Note to self: this chapter is yet to be converted to use Octave. To check the
plausibility of the Poisson model, we can compare the sample unconditional
variance with the estimated unconditional variance according to the Poisson
ÊnÅ f
Ê .
model: éb
° B5
P>
For OBDV and ERV, we get We see that even after
OBDV ERV
Sample 37.446 0.30614
Estimated 3.4540 0.19060
}C~% Ôò@" Z@ K
/;
B L [ /V
BÆ
@
}C~% ÇL 2/"
}C~% ÇL
}C~% /V
e
=
24.1. MEPS DATA: MORE ON COUNT MODELS 458
where K
}Y~% ÇL
e
) and =
}Y~% /V . Now = captures the randomness in the
constant. The problem is that we don’t observe = , so we will need to marginal-
ize it to get a usable density
T
}C~% \@&@/]ß@
<;
B L
Y
B
K Ä
`J
T
This density can be used directly, perhaps using numerical integration to eval-
uate the likelihood function. In some cases, though, the integral will have an
analytic solution. For example, if = follows a certain one parameter gamma
density, then
(24.1.1)
<;
B L Y t
Ã
B2ià :7 Æ : Æ :
e
K
Ã
B2¡N 4: :32 e
ÉȪ : 2
e
É
where t
e
:7 . : appears since it is the parameter of the gamma density.
So both forms of the NB model allow for overdispersion, with the NB-II model
allowing for a more radical form.
Schwartz
2315.3
Hannan-Quinn
2294.8
Akaike
2281.6
24.1. MEPS DATA: MORE ON COUNT MODELS 461
*********************************************************************
For the OBDV model, the NB-II model does a better job, in terms of
the average log-likelihood and the information criteria.
Note that both versions of the NB model fit much better than does the
Poisson model.
The t-statistics are now similar for all three ways of calculating them,
which might indicate that the serious specification problems of the
Poisson model for the OBDV data are partially solved by moving to
the NB model.
The estimated
is highly significant.
To check the plausibility of the NB-II model, we can compare the sample un-
conditional variance with the estimated unconditional variance according to
ÊnÅ
Ê § Ê z . For OBDV and ERV (estimation results
P>
the NB-II model: é
° B
f
not reported), we get The overdispersion problem is significantly better than
OBDV ERV
Sample 37.446 0.30614
Estimated 26.962 0.27620
in the Poisson case, but there is still some overdispersion that is not captured,
for both OBDV and ERV.
Returning to the Poisson model, lets look at actual and fitted count prob-
abilities. Actual relative frequencies are
B
T
* B *p
T Â and fit-
ted frequencies are
B
T
f* 1 <;
+* @" Â We see that for the OBDV
24.2. HURDLE MODELS 463
measure, there are many more actual zeros than predicted. For ERV, there are
somewhat more actual zeros than fitted, but the difference is not too important.
Why might OBDV not fit the zeros well? What if people made the deci-
sion to contact the doctor for a first visit, they are sick, then the doctor decides
on whether or not follow-up visits are needed. This is a principal/agent type
situation, where the total number of visits depends upon the decision of both
the patient and the doctor. Since different parameters may govern the two
decision-makers choices, we might expect that different parameters govern
the probability of zeros versus the other counts. Let e
6 be the parameters of
the patient’s demand for visits, and let e be the paramter of the doctor’s “de-
m
mand” for visits. The patient will initiate visits according to a discrete choice
model, for example, a logit model:
Ë
W¤
¡
<;
e 6J
È Â \@2 }C~% Ô e 6] I]
Ë
W¤ ¥Ë
 \© 2 }C%~ Ô e 6]_ ]
24.2. HURDLE MODELS 464
The above probabilities are used to estimate the binary 0/1 hurdle process.
Then, for the observations where visits are positive, a truncated Poisson den-
sity is estimated. This density is
<;
B m B ¥
<;
B e m
e
Ë
B ¥
<;
B e m
È }Y~ ? e m
since according to the Poisson model with the doctor’s paramaters,
Ë
B
¡
}Y~ ? e
m
e F &
m
Since the hurdle and truncated components of the overall density for ¤ share
no parameters, they may be estimated separately, which is computationally
more efficient than estimating the overall model. (Recall that the BFGS algo-
rithm, for example, will have to invert the approximated Hessian. The com-
putational overhead is of order where is the number of parameters to be
estimated) . The expectation of ¤ is
¤ ¥Ë Ù ¤ ¤ ¥
Ù W ¤ Ë
Æ 2 }C~%
e
Æ
Ô e 6] É È }C~% m ? e
É
m
24.2. HURDLE MODELS 465
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
*********************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value -0.58939
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -1.5502 -2.5709 -2.5269 -2.5560
pub_ins 1.0519 3.0520 3.0027 3.0384
priv_ins 0.45867 1.7289 1.6924 1.7166
sex 0.63570 3.0873 3.1677 3.1366
age 0.018614 2.1547 2.1969 2.1807
educ 0.039606 1.0467 0.98710 1.0222
inc 0.077446 1.7655 2.1672 1.9601
Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
*********************************************************************
24.2. HURDLE MODELS 466
Fitted and actual probabilites (NB-II fits are provided as well) are:
For the Hurdle Poisson models, the ERV fit is very accurate. The OBDV fit
is not so good. Zeros are exact, but 1’s and 2’s are underestimated, and higher
counts are overestimated. For the NB-II fits, performance is at least as good as
the hurdle Poisson model, and one should recall that many fewer parameters
are used. Hurdle version of the negative binomial model are also widely used.
24.2.1. Finite mixture models. The finite mixture approach to fitting health
care demand was introduced by Deb and Trivedi (1997). The mixture approach
has the intuitive appeal of allowing for subgroups of the population with dif-
ferent health status. If individuals are classified as healthy or unhealthy then
two subgroups are defined. A finer classification scheme would lead to more
subgroups. Many studies have incorporated objective and/or subjective indi-
cators of health status in an effort to capture this heterogeneity. The available
objective measures, such as limitations on activity, are not necessarily very
informative about a person’s overall health status. Subjective, self-reported
measures may suffer from the same problem, and may also not be exogenous
24.2. HURDLE MODELS 468
6 1 * ; * Å * ; 6
<;
BYt,1$'&(&)&(Yt6" 1Y'&(&(&) 6 -1 *
à B$t 2 6 BYt6N Y
1
6 1
where *
¥Ë !
VY#%'&(&)&(ò , 6
G * 1 * , and *
6 *,
. Identification re-
1
quires that the * are ordered in some way, for example, 1 N
> N
> > 6 and
t *
í t ß !p
í . This is simple to accomplish post-estimation by rearrangement
and possible elimination of redundant component densities.
The following are results for a mixture of 2 negative binomial (NB-I) models,
for the OBDV data, which you can replicate using this estimation program
24.2. HURDLE MODELS 470
*********************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value -2.2312
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant 0.64852 1.3851 1.3226 1.4358
pub_ins -0.062139 -0.23188 -0.13802 -0.18729
priv_ins 0.093396 0.46948 0.33046 0.40854
sex 0.39785 2.6121 2.2148 2.4882
age 0.015969 2.5173 2.5475 2.7151
educ -0.049175 -1.8013 -1.7061 -1.8036
inc 0.015880 0.58386 0.76782 0.73281
ln_alpha 0.69961 2.3456 2.0396 2.4029
constant -3.6130 -1.6126 -1.7365 -1.8411
pub_ins 2.3456 1.7527 3.7677 2.6519
priv_ins 0.77431 0.73854 1.1366 0.97338
sex 0.34886 0.80035 0.74016 0.81892
age 0.021425 1.1354 1.3032 1.3387
educ 0.22461 2.0922 1.7826 2.1470
inc 0.019227 0.20453 0.40854 0.36313
ln_alpha 2.8419 6.2497 6.8702 7.6182
logit_inv_mix 0.85186 1.7096 1.4827 1.7883
Information Criteria
24.2. HURDLE MODELS 471
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
*********************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.70096 0.12043
The 95% confidence interval for the mix parameter is perilously close
to 1, which suggests that there may really be only one component den-
sity, rather than a mixture. Again, this is not the way to test this - it is
merely suggestive.
Education is interesting. For the subpopulation that is “healthy”, i.e.,
that makes relatively few visits, education seems to have a positive
effect on visits. For the “unhealthy” group, education has a negative
effect on visits. The other results are more mixed. A larger sample
could help clarify things.
The following are results for a 2 component constrained mixture negative bi-
nomial model where all the slope parameters in %ß
Úº Í x ä
e
are the same across
the two components. The constants and the overdispersion parameters ,ß
are
allowed to differ for the two components.
24.2. HURDLE MODELS 472
*********************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value -2.2441
t-Stats
params t(OPG) t(Sand.) t(Hess)
constant -0.34153 -0.94203 -0.91456 -0.97943
pub_ins 0.45320 2.6206 2.5088 2.7067
priv_ins 0.20663 1.4258 1.3105 1.3895
sex 0.37714 3.1948 3.4929 3.5319
age 0.015822 3.1212 3.7806 3.7042
educ 0.011784 0.65887 0.50362 0.58331
inc 0.014088 0.69088 0.96831 0.83408
ln_alpha 1.1798 4.6140 7.2462 6.4293
const_2 1.2621 0.47525 2.5219 1.5060
lnalpha_2 2.7769 1.5539 6.4918 4.2243
logit_inv_mix 2.4888 0.60073 3.7224 1.9693
Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
24.2. HURDLE MODELS 473
2284.3
Akaike
2266.1
*********************************************************************
Delta method for mix parameter st. err.
mix se_mix
0.92335 0.047318
Now the mixture parameter is even closer to 1.
The slope parameter estimates are pretty close to what we got with the
NB-I model.
ûÞw;±%û
p# B @ 2ËQ0 2J
¯±%û
p# B @ 2ËQ )
w;±%û
p# B @ 2Ë#VQ
It can be shown that the CAIC and BIC will select the correctly specified model
from a group of models, asymptotically. This doesn’t mean, of course, that the
24.3. MODELS FOR TIME SERIES DATA 474
correct model is necesarily in the group. The AIC is not consistent, and will
asymptotically favor an over-parameterized model over the correctly specified
model. Here are information criteria values for the models we’ve seen, for
OBDV. According to the AIC, the best is the MNB-I, which has relatively many
parameters. The best according to the BIC is CMNB-I, and according to CAIC,
the best is NB-I. The Poisson-based models do not do well.
This section can be ignored in its present form. Just left in to form a basis
for completion (by someone else ?!) at some point.
Hamilton, Time Series Analysis is a good reference for this section. This is
very incomplete and contributions would be very welcome.
Up to now we’ve considered the behavior of the dependent variable B"U as a
function of other variables ?U & These variables can of course contain lagged
dependent variables, e.g., U
.7UPB]U 1Y'&(&)&(B]U ß Y& Pure time series methods
consider the behavior of BgU as a function only of its own lagged values, un-
conditional on other observable variables. One can think of this as modeling
the behavior of BgU after marginalizing out all other variables. While it’s not
immediately clear why a model that has other explanatory variables should
marginalize to a linear in the parameters time series model, most time series
24.3. MODELS FOR TIME SERIES DATA 475
work is done with linear models, though nonlinear time series is also a large
and growing field. We’ll stick with linear time series models.
orders:
0U
3ÄÔê
3
ß
ß Ô ê
U
ß¼
ß
As we’ve seen, this implies that
H the autocovariances depend only
one the interval between observations, but not the time of the observations.
(24.3.4)
f gB U R 6
U(1 3
24.3. MODELS FOR TIME SERIES DATA 477
F
D EFINITION 60 (White noise). White noise is just the time series literature
term for a classical error. A[U is white noise if i) ë7WA-UW
ÔêY ii) é^A-UW
è,N
êY and iii) A-U and Aì are independent, í
"& Gaussian white noise just adds a
normality assumption.
24.3.2. ARMA models. With these concepts, we can discuss ARMA mod-
els. These are closely related to the AR and MA error processes that we’ve
already discussed. The main difference is that the lhs variable is observed di-
rectly now.
24.3.2.1. MA(q) processes. A
U¶ order moving average (MA) process is
ëB]Uc3
F
ë/JU 2 >N>N>'2@[/JU $
2 @g1?/JU 12@ /JU ¡
è ð2@ 1 2¹@ 2¡N> >N>N2¹@
ó
Similarly, the autocovariances are
I ¥
>N>N>
or
¤+U
û 2 2
¤EU 1,2 Ù U
24.3. MODELS FOR TIME SERIES DATA 479
¤+U 1
û 2 2 ¤+U 2 U 1
Ù
û 2 û 2 E¤ U 1,2
Ù UW 2 Ù U 1
2 2
û 2 û 2 ¤+U 1 2
Ù U52 Ù U 1
2 2 2
and
ßû ß 1 ß ß 1
¤+U ß¼
û 2 2 û 2²>N>N>?2 2
2 2
E¤ U 12 2
Consider the impact of a shock in period on BVU ß & This is simply
à
à ¤+U ß
2
ß
UÙ 4 Ã 1IH 1WÅ Ã 1IH 1WÅ
If the system is to be stationary, then as we move forward in time this impact
must die off. Otherwise a shock causes a permanent change in the mean of BUP&
Therefore, stationarity requires that
ß
ß St T 2
à 1IH 1WÅ
¡
e+± Ê
¡
2
The determinant here can be expressed as a polynomial. for example, for
V
2
the matrix is simply
2
t1
so
t1
e
a
can be written as
t1
e
a
2
t1 t
so
t,1
e
t
2
+± Ê
e
e
and
e+± Ê
ge t1G8t
2 e
e
e
t1
8t
24.3. MODELS FOR TIME SERIES DATA 481
which can be found using the quadratic equation. This generalizes. For a
U¶
order AR process, the eigenvalues are the roots of
e 6 e 6 1 t,1Ä e 6 t >N>N>g e
t6
1 8t6
a
Supposing that all of the roots of this polynomial are distinct, then the matrix
2
can be factored as
2
¿ ¿ 1
where
¿ is the matrix which has as its columns the eigenvectors of
2
and
is a diagonal matrix with the eigenvalues on the main diagonal. Using this
decomposition, we can write
ß
¿ ¿ 1 ¿ ¿ 1
2
ð ó ð ó >N>N> ð ¿ ¿ 1 ó
¿
where
¿ 1 is repeated times. This gives
ß
¿ ß¿ 1
2
and ß
e
1
ß
ß
e
..
.
ß
e
6
Supposing that the e+* !
"Y#%\&)&(&)W are all real valued, it is clear that
ß
ß St T 2
à 1IH 1WÅ
¡
requires that
à 1IH 1WÅ
Dynamic multipliers: is a dynamic multiplier or an
impulse-response function. Real eigenvalues lead to steady movements,
whereas comlpex eigenvalue lead to ocillatory behavior. Of course,
when there are multiple eigenvalues the overall effect can be a mix-
ture. pictures
Invertibility of AR process
B
To begin with, define the lag operator
B
BgU
]B U 1
B
BgU
B
B BgUW
B
BgU 1
gB U
24.3. MODELS FOR TIME SERIES DATA 483
or
ÔÈ B
\Ô2 B
?BgU
È B
BgU 2 B
BgU B
BgU
ÈBgU
B]U08t,1ÔBgU 1
Kt gB U >N>N>VKt6'BgU 6
J/ U
or
B]UÔÔÈKt1 B 8t B >N>N>VKt6 B 6
J/ U
Factor this polynomial as
For the moment, just assume that the eE* are coefficients to be determined. Since
is defined to operate as an algebraic quantitiy, determination of the e* is the
B
same as determination of the eE* such that the following two expressions are
the same for all H
6 8
t1? 1 6 Kt 6 Ë
>N>N>$tT6 1Ô 1 8t6
W 1 e
1Ô \W 1 e 1 e
0>N>N>N ]6
and now define e
so we get
1
6 Kt1 e 6 1 Kt e 6 Ë>N>N>"8tT6 1 e KtT6
e 1- v e 0>N>N>\ ]6
e e e e e
24.3. MODELS FOR TIME SERIES DATA 484
The LHS is precisely the determinantal polynomial that gives the eigenvalues
of
2
& Therefore, the +*
e
that are the coefficients of the factorization are simply
the eigenvalues of the matrix
2
&
Now consider a different stationary process
ß ß
Multiply both sides by 2Mt B 2Mt B
2&)&(&]2Ët B to get
ß ß ß ß
ð 2Ët B 2Ët B 2&)&(&J2Ët B ó ?È8t B ÔB]U
ð 2Mt B 2Mt B ¡
2 &(&)&N2Ët B ó /JU
ß ß B ß ß ß `ß
Ô2Ët B Ë
2 t B 2&)&(&J2Mt B 8 t t B &)&(&"Kt B Kt 1 B 1 BgU
s
?2Ët B Ë ß ß
2 t B 2¡&(&(&J2Ët B %/JU
ß` ß ß ß
ð ÈKt 1 B 1 ó gB U
ð 2Ët B M
2 t B 2¡&(&)&]2Mt B ó /JU
so
ß ß` ß ß
B]U
t 1 B 1 gB U52 ð 2Mt B M 2 t B 2¡&(&)&J2Mt B ó /JU
ß ß
"t 1 B 1 BgU R since t ½ " so
`
R k
Now as
ß ß
BgU ç
ð 2Mt B 2Ët B 2¡&(&(&J2Ët B ó /JU
24.3. MODELS FOR TIME SERIES DATA 485
ß ß
BgU
ç ð[2Ët B Ë
2 t B 2&)&(&J2Ët B ó ?¼Kt B Ô B]U
so
ßB ß
2 t B 2¡&(&)&]2Mt
ð[2Mt B M ó ÔÈKt B
ç
and the approximation becomes arbitrarily good as increases arbitrarily. There-
fore, for t ½ " define
1
T ßB ß
?ÈKt $ B
ß t
F
Recall that our mean zero AR(p) process
BgU?ÔÈ e
1 B vÔÈ e
0
N
>
B
N
> N
> Ô
¼
6
/JU
e B
e* ½ "&
e 2
where the are the eigenvalues of and given stationarity, all the
Therefore, we can invert each first order polynomial on the LHS to get
T ßB ß ßB ß
T e ßB ß T
BgU
¶
ß · >N>N> ß 6 ·j/JU
e ¶
1 · ¶
ß
e
F F F
The RHS is a product of infinite-order polynomials in which can be repre-
B
sented as
BgU
? 2i:Z1 B 2¹: B 2¡>N>N> ?/JU
24.3. MODELS FOR TIME SERIES DATA 486
The :
* are formed of products of powers of the E* , which are in turn
e
is ôÞi7$![& In multiplication
which is real-valued.
This shows that an AR(p) process is representable as an infinite-order
MA(q) process.
Recall before that by recursive substitution, an AR(p) process can be
written as
ß 1 ßû ß 1 ß
¤+U ß¼
û 2 2 û 2²>N>N>?2 Ù U 1T2²>N>N>Ô2 2 Ù U ß 1"2 Ù U ß
2
2 2
E¤ U 12 2
Ù UJ2
2
If the process is mean zero, then everything with a û drops out. Take
this and lag it by periods to get
ß 1 ß ß ß 1 `ß
¤EU
2
E¤ U 12 2 ß
Ù U 2
2
Ù U 1 2>N>N>J2 2
Ù U 1I2 Ù U
As
R k
the lagged ¤ on the RHS drops out. The
Ù U ì are vectors
of zeros except for their first element, so we see that the first equation
here, in the limit, is just
T 2 ß
gB U ß ð ó 1IH 1 J/ U ß
F
24.3. MODELS FOR TIME SERIES DATA 487
a¸ 2Mt1_32Mt 32¡&(&(&]2ËtT6}3
3
so
¸
3
¼Kt1
Kt &(&)&VKtT6
and
¸¼
3 8t,1Z3 &)&(&"Kt63
so
B]U0s3
3 Kt,1_3 )& &(&"KtT6}3b2Mt1?B]U I1 2Mt B]U 2¡>N>N>J2MtT6'B]U 672/]Uc3
t1\BgU 1
c
3I ,2Mt BgU c3I ,2a&(&(&J2Ët6B]U 6;c3I 028/JU
With this, the second moments are easy to find: The variance is
Using the fact that ßy
ß one can take the 2a equations for
u 'V'&)&(&(ò ,
and each of the ?µ g* can be inverted as long as g* ½ "&
B
If this is the case,
then we can write
where
Ô2¹@g1 B 2¡&(&(&]2@ B 1
will be an infinite-order polynomial in
B
so we get
T »ßB ß ß
ß BgU c3I
/JU
F
with
»
¯V or
F
B]Uc3 G » \1 BgU 1Gs3I G » BgU s3I 2&)&(&
/JU
or
BgU
¸ 2 » ?1 B]U 1I2 » BgU 2&)&(&]28/JU
24.3. MODELS FOR TIME SERIES DATA 489
where
¸È
3 2
»
1 3 2 » 32&)&(&
Z
It turns out that one can always manipulate the parameters of an MA(q)
process to find an invertible representation. For example, the two
MA(1) processes
BgUs3
ÔÈi@ B P /]U
and
B U [ c3
ÔÈc@ 1 B P / U[
have exactly the same moments if
è î à è î @
è Ô2¹@ C&
F
Given the above relationships amongst the parameters,
è î @ Ô2@
è ?Z2D@
F
[
so the variances are the same. It turns out that all the autocovariances
will be the same, as is easily checked. This means that the two MA
processes are observationally equivalent. As before, it’s impossible to
distinguish between observationally equivalent processes on the basis
of data.
24.3. MODELS FOR TIME SERIES DATA 490
For a given MA(q) process, it’s always possible to manipulate the pa-
rameters to find an invertible representation (which is unique).
It’s important to find an invertible representation, since it’s the only
representation that allows one to represent /gU as a function of past B 4 &
The other representations express
Why is invertibility important? The most important reason is that it
provides a justification for the use of parsimonious models. Since an
AR(1) process has an MA( k representation, one can reverse the ar-
gument and note that at least some MA( k processes have an AR(1)
representation. At the time of estimation, it’s a lot easier to estimate
the single AR(1) coefficient rather than the infinite number of coeffi-
cients associated with the MA representation.
This is the reason that ARMA models are popular. Combining low-
order AR and MA models can usually offer a satisfactory representa-
tion of univariate time series data with a reasonable number of param-
eters.
Stationarity and invertibility of ARMA models is similar to what we’ve
seen - we won’t go into the details. Likewise, calculating moments is
similar.
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford
Univ. Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ.
Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supple-
mentary use only).
491
Index
Cobb-Douglas model, 21
parameter space, 49
convergence, almost sure, 438
Product rule, 436
convergence, in distribution, 439
convergence, in probability, 438
R- squared, uncentered, 31
Convergence, ordinary, 437
R-squared, centered, 32
convergence, pointwise, 437
convergence, uniform, 437
convergence, uniform almost sure, 440
cross section, 17
leverage, 28
likelihood function, 49
matrix, idempotent, 27
matrix, projection, 26
matrix, symmetric, 27
492