Vous êtes sur la page 1sur 205

1

Econometrics - Slides
2010/2011

João Nicolau
2

1 Introduction

1.1 What is Econometrics?

Econometrics is based upon the development of statistical methods for estimating economic
relationships, testing economic theories, and evaluating and implementing government and
business policy. Application of econometrics:

forecast (e.g. interest rates, in‡ation rates, and gross domestic product).

study economic relations;

testing economic theories;

evaluating and implementing government and business policy. For example, what are the
e¤ects of political campaign expenditures on voting outcomes? What is the e¤ect of school
spending on student performance in the …eld of education?
3

1.2 Steps in Empirical Economic Analysis


Formulate the question of interest. The question might deal with testing a certain
aspect of an economic theory, or it might pertain to testing the e¤ects of a government
policy.

Build the economic model. An economic model consists of mathematical equations that
describe various relationships. Formal economic modeling is sometimes the starting point
for empirical analysis, but it is more common to use economic theory less formally, or
even to rely entirely on intuition.

Specify the econometric model.

Collect the data.

Estimate and test the econometric model.

Answer the question in step 1.


4

1.3 The Structure of Economic Data

1.3.1 Cross-Sectional Data

A cross-sectional data: sample of individuals, households, …rms, cities, states, countries,


etc. taken at a given point in time. An important feature of cross-sectional data: they are
obtained by random sampling from the underlying population. For example, suppose that
yi is the i-th observation of the dependent variable and xi is the i-th observation of the
explanatory variable. Random sampling means that
f(yi; xi)g is an i.i.d. sequence.
This implies that for i 6= j
Cov yi; yj = 0; Cov xi; xj = 0; Cov yi; xj = 0:
Obviously, if xi “explains” yi we will have Cov (yi; xi) 6= 0:

Cross-sectional data is closely aligned with the applied microeconomics …elds, such as labor
economics, state and local public …nance, industrial organization, urban economics, demog-
raphy, and health economics.
5

An example of Cross-Sectional Data:


6

Scatterplots may be adequate for analyzing cross-section data:

Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter
“Finite-Sample Properties of OLS”.
7

1.3.2 Time-Series Data

A time series data set consists of observations on a variable or several variables over time.
E.g.: stock prices, money supply, consumer price index, gross domestic product, annual
homicide rates, and automobile sales …gures, etc.

Time series data cannot be assumed to be independent across time. For example, knowing
something about the gross domestic product from last quarter tells us quite a bit about the
likely range of the GDP during this quarter ...

The analysis of time series data is more di¢ cult than that of cross-sectional data. Reasons:

we need to account for the dependent nature of economic time series;

time-series data exhibits unique features such as trends over time and seasonality;

models based on time-series data rarely satisfy the assumptions cover be the chapter
“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter
“Large-Sample Theory”, which is theoretically more advanced.
8

An example of a time series (scatterplots cannot in general be used here, but there are
exceptions):
9

1.3.3 Pooled Cross Sections and Panel or Longitudinal Data

Data sets have both cross-sectional and time series features.

1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis

Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causal
analysis.
Example. Suppose that wages depend on education and labor force experience. Your goal
is to measure the “return to education”. If your analysis involves only wages and education
you may not uncover the ceteris paribus e¤ect of education on wages. Consider the following
data:

monthly wages (Euros) years of experience years of education


1500 6 9
1500 0 15
1600 1 15
2000 8 12
2500 10 12
10

Example. In a totalitarianism regime how can you measure the ceteris paribus e¤ect of
another year of education on wages? You may create 100 clones of a “normal” individual.
Give to each person an amount of education and then measure their wages.

Ceteris Paribus is relatively easy to analyze in Experimental Data.


Example (Experimental Data). Considered the e¤ects of new fertilizers on crop yields. Sup-
pose the crop under consideration is soybeans. Since fertilizer amount is only one factor
a¤ecting yields— some others include rainfall, quality of land, and presence of parasites—
this issue must be posed as a ceteris paribus question. One way to determine the causal e¤ect
of fertilizer amount on soybean yield is to conduct an experiment, which might include the
following steps. Choose several one-acre plots of land. Apply di¤erent amounts of fertilizer
to each plot and subsequently measure the yields.

In economics you have nonexperimental data, so in principle, it is di¢ cult to estimate the
ceteris paribus e¤ects. However, we will see that econometric methods can simulate a ceteris
paribus experiment. We will be able to do in nonexperimental environments what natural
scientists are able to do in a controlled laboratory setting: keep other factors …xed.
11

2 Finite-Sample Properties of OLS

This chapter covers the …nite- or small-sample properties of the OLS estimator, that is, the
statistical properties of the OLS estimator that are valid for any given sample size.

2.1 The Classical Linear Regression Model

The dependent variable is related to several other variables (called the regressors or the
explanatory variables).

Let yi be the i-th observation of the dependent variable.

Let (xi1; xi2; :::; xiK ) be the i-th observation of the K regressors. The sample or data is a
collection of those n observations.

The data in economics cannot be generated by experiments (except in experimental eco-


nomics), so both the dependent and independent variables have to be treated as random
variables, variables whose values are subject to chance.
12

2.1.1 The Linearity Assumption

Assumption (1.1 - Linearity). We have

yi = 1xi1 + 2xi2 + ::: + K xiK + "i; i = 1; 2; :::; n


where 0s are unknown parameters to be estimated, and "i is the unobserved error term.

0s : regression coe¢ cients. They represent the marginal and separate e¤ects of the regres-
sors.
Example (1.1). (Consumption function): Consider

coni = 1 + 2ydi + "i:


coni : consumption; ydi is disposable income. Note: xi1 = 1; xi2 = ydi: The error "i
represents other variables besides disposable income that in‡uence consumption. They in-
clude: those variables— such as …nancial assets— that might be observable but the researcher
decided not to include as regressors, as well as those variables— such as the “mood” of the
consumer— that are hard to measure. The equation is called the simple regression model.
13

The linearity assumption is not as restrictive as it might …rst seem.


Example (1.2). (Wage equation). Consider

wagei = e 1 e 2educi e 3tenurei e 4expri e"i


where WAGE = the wage rate for the individual, educ = education in years, tenure = years
on the current job, and expr = experience in the labor. This equation can be written as

log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i


The equation is said to be in the semi-log form (or log-level form).
Example. Does this model

yi = 1 + 2xi2 + 3 log xi2 + 4x2i3 + "i


violate Assumption 1.1?

There are, of course, cases of genuine nonlinearity. For example

yi = 1 + e 2xi2 + "i
14

Partial E¤ects

To simplify let’s consider, K = 2; and assume that E ( "ij xi1; xi2) = 0.

What is the impact on the conditional expected value y; E ( yij xi1; xi2) when xi2 is increased
by a small amount
x0i = (xi1; xi2) ! xi 0 = (xi1; xi2 + xi2) (holding the other variable …xed)?
Let
E ( yij xi) E ( yij xi1 = xi1; xi2 = xi2 + xi2) E ( yij xi1; xi2) :

Equation Interpretation of 2
(level-level) yi = 1 + 2xi2 + "i E ( yij xi) = 2 xi2
2 xi2
(level-log) yi = 1 + 2 log (xi2) + "i E ( yij xi) ' 100 x i2
100
E(yi jxi )
(log-level) log (yi) = 1 + 2xi2 + "i 100 ' (100 2) xi2
E(yi jxi )
(100 2: semi-elast.)
E(yi jxi ) xi2
(log-log) log (yi) = 1 + 2 log (xi2) + "i 100 ' 2 xi2 100
E(yi jxi )
( 2: elasticity)
15

Exercise 2.1. Suppose, for example, the marginal e¤ect of experience on wages declines with
the level of experience. How can this be captured?
Exercise 2.2. Provide an interpretation of 2 in the following equations:

(a) coni = 1 + 2inci + "i; where inc: income, con: consumption (both measured in
dollars). Assume that 2 = 0:8;

(b) log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i: Assume that 2 = 0:05:

(c) log (pricei) = 1 + 2 log (disti) + "i where prices = housing price and dist =
distance from a recently built garbage incinerator. Assume that 2 = 0:6:
16

2.1.2 Matrix Notation

We have
2 3
1
h i6 7
6 2 7
yi = 1 xi1 + 2xi2 + ::: + K xiK + "i = xi1 xi2 xiK 6 ... 7 + "i
4 5
K
= x0i + "i
where
2 3 2 3
xi1 1
6 xi2 7 6 7
xi = 6
6 ...
7
7;
6
=6 2 7
... 7
4 5 4 5
xiK K
yi = x0i + "i:
17

More compactly
2 3 2 3 2 3
y1 x11 x12 x1K "1
6 y2 7 6 x x22 x2K 7 6 "2 7
6 7 6 7
y =6 ... 7; X = 6 21 7; "i = 6
6
7
7
4 5 4 ... ... ... 5 4 ... 5
yn xn1 xn2 xnK "n

y = X + ":

Example. yi = 1 + 2educi + 3expi + "i (yi = wages in Euros). An example of


Cross-Sectional Data is
2 3 2 3
2000 1 12 5
6 7 6 7
6 2500 7 6 1 15 6 7
6 7 6 7
6 1500 7 6 1 12 3 7
y =6
6 ... 7;
7 X =6
6 ... ... ... 7:
7
6 7 6 7
6 7 6 7
4 5000 5 4 1 17 15 5
1000 1 12 1

Important: y and X (or yi and xik ) may be random variables or observed values. We use
the same notation for both cases.
18

2.1.3 The Strict Exogeneity Assumption

Assumption (1.2 - Strict exogeneity). E ( "ij X) = 0; 8i

This assumption can be written as

E ( "ij x1; :::; xn) = 0; 8i:


With random sampling "i is automatically independent of the explanatory variables for ob-
servations other than i. This implies that

E "ij xj = 0; 8i; j i 6= j
It remains to be analyzed whether or not
?
E ( "ij xi) = 0:
19

Strict Exogeneity assumption can fail in situations such as:

(Cross-Section or Time Series) Omitted variables;

(Cross-Section or Time Series) Measurement error in some of the regressors;

(Time Series, Static models) There is a feedback from yi on future values of xi.

(Time Series, Dynamic models) There is a lag dependent variable as a regressor.


Example (Omitted variables). Suppose that wage is determined by

wagei = 1 + 2xi2 + 3xi3 + vi;


where x2: years of education, x3: ability. Assume that E ( vij X) = 0: Since ability is not
observed, we instead estimate the model

wagei = 1 + 2xi2 + "i; "i = 3xi3 + vi:


Thus,
E ( "ij xi) = 3xi3 6= 0 ) E ( "ij X) 6= 0:
20

Example (Measurement error in some of the regressors). Consider y = household savings


and w = disposable income and

yi = 1 + 2wi + vi; E ( vij w) = 0:


Suppose that w cannot be measured absolutely accurately (for example, because of misre-
porting) and denote the measured value for wi by xi2: We have

xi2 = wi + ui:
Assume: E (ui) = 0; Cov (wi; ui) = 0; Cov (vi; ui) = 0. Now substituting xi2 = wi + ui
into yi = 1 + 2wi + vi we obtain

yi = 1 + 2xi2 + "i; "i = vi 2 ui :


Hence,

Cov ("i; xi2) = ::: = 2 Var (ui ) 6= 0:


Cov ("i; xi2) 6= 0 ) E ( "ij X) 6= 0:
21

Example (Feedback from y on future values of x). Consider a simple static time-series model
to explain a city’s murder rate (yt) in terms of police o¢ cers per capita (xt):

yt = 1 + 2xt + "t;
Suppose that the city adjusts the size of its police force based on past values of the murder
rate. This means that, say, xt+1 might be correlated with "t (since a higher "t leads to a
higher yt).
Example (There is a lag dependent variable as a regressor). See section 2.1.5.
Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educ
denote years of education for the woman. A simple model relating fertility to years of
education is
kidsi = 1 + 2educi + "i:
where "i is the unobserved error. (i) What kinds of factors are contained in "i? Are these
likely to be correlated with level of education? (ii) Will a simple regression analysis uncover
the ceteris paribus e¤ect of education on fertility? Explain.
22

2.1.4 Implications of Strict Exogeneity

The Assumption E ( "ij X) = 0; 8i implies:

E ("i) = 0; 8i:

E "ij xj = 0; 8i; j:

E xjk "i = 0; 8i; j; k or E xj "i = 0; 8i; j The regressors are orthogonal to the
error term for all observations

Cov xjk ; "i = 0:

Note: if E "ij xj 6= 0 or E xjk "i 6= 0 or Cov xjk ; "i 6= 0 ) E ( "ij X) 6= 0:


23

2.1.5 Strict Exogeneity in Time-Series Models

For time-series models where strict exogeneity can be rephrased as: the regressors are or-
thogonal to the past, current, and future error terms. However, for most time-series models,
strict exogeneity is not satis…ed.
Example. Consider
yi = yi 1 + "i; E ( "ij yi 1) = 0 (thus E (yi 1"i) = 0).
Let xi = yi 1: By construction we have
2
E (xi+1"i) = E (yi"i) = ::: = E "i 6= 0:
The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.
However, the estimator may possess good large-sample properties without strict exogeneity.

2.1.6 Other Assumptions of the Model

Assumption (1.3 - no multicollinearity). The rank of the n K data matrix X is K with


probability 1.
24

None of the K columns of the data matrix X can be expressed as a linear combination of
the other columns of X.
Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changed
jobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.
There no way to distinguish the tenure e¤ect on the wage rate from the experience e¤ect.
Remedy: drop tenurei or expri from the wage equation.
Example (Dummy Variable Trap). Consider
wagei = 1 + 2educi + 3f emalei + 4malei + "i
where
(
1 if i corresponds to a female
f emalei = ; malei = 1 f emalei:
0 if i corresponds to a male
In vectorial notation we have
wage = 11 + 2 educ + 3 female + 4 male + ":
It is obvious that 1 = female + male: Therefore the above model violates Assumption
1.3. One may also justify using scalar notation: xi1 = f emalei + malei because this
relationship implies 1 = female + male: Can you overcome the dummy variable trap by
removing xi1 1 from the equation?
25

Exercise 2.4. In a study relating college grade point average to time spent in various activ-
ities, you distribute a survey to several students. The students are asked how many hours
they spend each week in four activities: studying, sleeping, working, and leisure. Any activity
is put into one of the four categories, so that for each student the sum of hours in the four
activities must be 168. (i) In the model

GP Ai = 1 + 2studyi + 3sleepi + 4worki + 5leisurei + "i


does it make sense to hold sleep, work, and leisure …xed, while changing study? (ii) Explain
why this model violates Assumption 1.3; (iii) How could you reformulate the model so that
its parameters have a useful interpretation and it satis…es Assumption 1.3?
Assumption (1.4 - spherical error variance). The error term satis…es:
2 2 > 0;
E "i X = 8i; Homoskedasticity
E "i "j X = 0; 8i; j ; i 6= j: No correlation between observations.

Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov yi; yj X = 0:
26

Assumption 1.4 and strict exogeneity implies:

Var ( "ij X) = E "2i X = 2:

Cov "i; "j X = 0:

E ""0 X = 2I:

Var ( "j X) = 2I:

Note
2 3
E "21 X E ( " 1 "2 j X ) E ( "1 "n j X )
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X ) 7
E "" X =6
6 ... ... ... ... 7:
7
4 5
E ( " 1 "n j X ) E ( " 2 " n j X ) E "2n X
27

Exercise 2.6. Consider the savings function


q
savi = 1 + 2inci + "i; "i = incizi
where zi is a random variable with E (zi) = 0 and Var (zi) = 2z : Assume that zi is
independent of incj (for all i; j ). (i) Show that E ( "j inc) = 0; (ii) Show that Assumption
1.4 is violated.

2.1.7 The Classical Regression Model for Random Samples

The sample (y; X) is a random sample if f(yi; xi)g is i.i.d. (independently and identically
distributed) across observations. Random sample automatically implies:

E ( "ij X) = E ( "ij xi) ;


2 2
E "i X = E "i xi :
Therefore Assumptions 1.2 and 1.4 can be rephrasing as

Assumption 1.2 E ( "ij xi) = E ("i) = 0


Assumption 1.4 E "2i xi = E "2i = 2
28

2.1.8 “Fixed” Regressors

This is a simplifying (and generally an unrealistic) assumption to make the statistical analysis
tractable. It means that X is exactly the same in repeated samples. Sampling schemes that
support this assumption:

a) Experimental situations. For example, suppose that y represents the yields of a crop
grown on n experimental plots, and let the rows of X represent the seed varieties, irrigation
and fertilizer for each plot. The experiment can be repeated as often as desired, with the
same X. Only y varies across plots.

b) Strati…ed Sampling (for more details see Wooldridge, chap. 9).


29

2.2 The Algebra of Least Squares

2.2.1 OLS Minimizes the Sum of Squared Residuals

Residual for observation i (evaluated at ~ ):

yi x0i ~ :
Vector of residuals (evaluated at ~ ):

y X~:
Sum of squared residuals (SSR):
n
X 2 0
SSR ~ = yi x0i ~ = y X~ y X~ :
i=1
The OLS (Ordinary Least Squares):

b = arg min SSR ~


~

b is such that SSR (b) is minimum.


30

K = 1 ; y i = x i + "i
Example. Consider yi = 1 + 2xi2 + "i: The data:

y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8

!
0
Verify that SSR ~ = 42 when ~ = :
1
31

2.2.2 Normal Equations

To solve the optimization proble min ~ SSR ~ we use classical optimization:

First Order Condition (FOC):


@SSR ~
= 0.
@ ~
Solve the previous equation with respect to ~ : Let b such solution.

Second Order Condition (SOC):

@ 2SSR ~
0 is a Positive De…nite Matrix , b is global minimum point.
~
@ @ ~
32

To easily obtain the FOC we start writing SSR ~ as


0
SSR ~ = y X~ y X~
= :::
0 0 ~ ~ 0 0 ~
= y y 2y X + X X :
Recalling from matrix algebra that
0
@ a0 ~ @ ~ A~
= a; = 2A ~ (for A symmetric)
@ ~ @~
we have
@SSR ~ 0 0
= 2 yX + 2X0 X ~ = 0
@ ~
i.e. (replacing ~ by the solution b)

X0Xb = X0y or
X0 (y Xb) = 0:
33

This is a system with K equations and K unknowns. These equations are called the normal
equations. If
1
rank (X) = K ) X0 X is nonsingular ) there exists X0 X :
Therefore, if rank (X) = K we have a unique solution:
1
b = X0 X X0 y OLS estimator.
The SOC is
@ 2SSR ~
0 = 2 X0 X:
@ ~@ ~
If rank (X) = K then 2X0X is a positive de…nite matrix thus SSR ~ is strictly convex
in Rk . Hence b is a global minimum point.

The vector of residuals evaluated at ~ = b;

e=y Xb
is called the vector of OLS residuals (or simply residuals).
34

The normal equations can be written as


n
1X
X0 e =0, xiei = 0:
n i=1
This shows that the normal equations can be interpreted as the sample analogue of the
orthogonality conditions E (xi"i) = 0. Notice the reasoning: by assuming in the popula-
tion the orthogonality conditions E (xi"i) = 0 we deduce by the method of moments the
corresponding sample analogue
1X
xi yi x0i ~ = 0:
n i
We obtain the OLS estimator b by solving this equation with respect to ~ :
35

2.2.3 Two Expressions for the OLS Estimator

1
b = X0 X X0 y

X0 X 1 X0 y
b= n n = Sxx1Sxy ; where
n
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n i=1
n
X0 y 1X
Sxy = = xiy (sample average of xiyi).
n n i=1

Example (continuation of previous example). Consider the data.

y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8

Obtain b; e and SSR (b) :


36

2.2.4 More Concepts and Algebra

The …tted value for observation i: y^i = x0ib.

The vector of …tted value: y


^ = Xb:

The vector of OLS residuals: e = y Xb = y y


^:

The projection matrix P and the annihilator M are de…ned as


1
P=X X0 X X0 ; M=I P:

Properties:
Exercise 2.7. Show that P and M are symmetric and idempotent and
PX = X
MX = 0
y
^ = Py
e = My = M"
SSR = e0e = y0My = "0M":
37

The OLS estimate of 2 (the variance of the error term), denoted s2, is
SSR e0 e
s2 = =
n K n K
s2 is called the standard error of regression.

The sampling error


1
b = ::: = X0X X 0 ":

Coe¢ cient of Determination

A measure of goodness of …t is the coe¢ cient of determination


Pn Pn
(^
yi y )2 e2
R2 = Pi=1
n (y 2
=1 Pn
i=1 i
2
; 0 R2 1:
i=1 i y) i=1 (yi y)
It measures the proportion of the variation of y that is accounted for by variation in the
regressors, x0j s. Derivation of R2: [board]
38

y y
25 R^2 = 0.96 60 y
50 y^
20 40
30 R^2 = 0.19
15 20
y 10
10
y^ 0
x
-3 -2 -1 -10 0 1 2 3
5
-20
0 x -30
-3 -2 -1 0 1 2 3 -40
-5 -50

y
17
16
15
14
13 y
12 y^
11
10 R^2 = 0.00

9
8 x
-3 -2 -1 0 1 2 3
39

“The most important thing about R2 is that it is not important” (Goldberger). Why?

We are concerned with parameters in a population, not with goodness of …t in the sample;

We can always increase R2 by adding more explanatory variables. At the limit, if K =


n ) R 2 = 1:
Exercise 2.8. Prove that K = n ) R2 = 1 (assume that Assumption 1.3 holds).

It can be proved that


P
^i
i y y^ (yi y ) =n
R2 = ^2 ; ^= :
Sy^Sy

Adjusted coe¢ cient of determination


Pn 2 = (n
n 1 e k)
R2 = 1 1 R2 = 1 Pn
i=1 i
2
:
n k i=1 (yi y ) = (n 1)
Contrary to R2; R2 may decline when a variable is added to the set of independent variables.
40

2.3 Finite-Sample Properties of OLS

First of all we need to recognize that b and bj X are random!

Assumptions:
1.1 - Linearity: yi = 1xi1 + 2xi2 + ::: + K xiK + "i:
1.2 - Strict exogeneity: E ( "ij X) = 0:
1.3 - No multicollinearity.
1.4 - Spherical error variance: E "2i X = 2; E "i"j X = 0:

Proposition (1.1 - …nite-sample properties of b). We have:


(a) (unbiasedness) Under Assumptions 1.1-1.3, E ( bj X) = :
(b) (expression for the variance) Under Assumptions 1.1-1.4, Var ( bj X) = 2 X0X 1 :
(c) (Gauss-Markov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is e¢ cient in
the class of linear unbiased estimators (also called Best Linear Unbiased Estimator). That
is, for any unbiased estimator ^ that is linear in y, Var ( bj X) Var ^ X in the matrix
sense (i.e. Var ( b j X) Var ( bj X) is a positive semide…nite matrix).
(d) Under Assumptions 1.1-1.4, Cov ( b; ej X) = 0. Proof: [board]
41

Proposition (1.2 - Unbiasedness of s2). Let s2 = e0e= (n K ) : We have


2 2 2
E s X =E s = : Proof: [board]

An unbiased estimator of Var ( bj X) is


1
( bj X) = s2 X0X
Var\ :
Example. Consider

col GP Ai = 1 + 2HSGP Ai + 3ACTi + 4SKIP P EDi + 5P Ci + "i


where: col GP A : college grade point average (GPA); HSGP A : high school GPA; ACT :
achievement examination for college admission; SKIP P ED : average lectures missed per
week; P C is a binary variable (0/1) to identify who owns a personal computer. Using a
survey of 141 students (Michigan State University) in Fall 1994, we obtained the following
results:
42

These results tell us that n = 141, s = 0:325; R2 = 0:259; SSR = 14:37


2 3 2 3
1:356 0:32752 ? ? ? ?
6 7 6 7
6
6 0:4129 7
7
6
6
? 0:09242 ? ? ? 7
7
b =6 0:0133 7; \
Var ( bj X) = 6 ? ? 0:0102 ? ? 7
6 7 6 7
6 7 6 7
4 0:071 5 4 ? ? ? 0:0262 ? 5
0:1244 ? ? ? ? 0:05732
43

2.4 More on Regression Algebra

2.4.1 Regression Matrices

0
Matrix P = X X0X 1 X
Py ! Fitted values from the regression of y on X
Pz ! ?
1 0
Matrix M = I P = I X X0 X X
My ! Residuals from the regression of y on X
Mz ! ?
h i
Consider a partition of X as follows X = X1 X2

1 0
Matrix P1= X1 X01X1 X
1
P1y ! ?
1 0
Matrix M1= I P1 = I X1 X01X1 X
1
M1 y ! ?
44

2.4.2 Short and Long Regression Algebra

Partition X as
h i
X= X 1 X2 ; XK 1 n; XK2 n ; K1 + K2 = K

Long Regression
We have
" #
h i b1
y=y
^ + e = Xb + e = X1 X 2 + e = X1b1 + X2b2 + e:
b2

Short Regression
Suppose that we shorten the list of explanatory variables and regress y on X1: We have
y=y
^ + e = X1b1 + e
where
1
b1 = X01X1 X1 y
e = M1 y ; M1 = I X1 X01X1 X01
45

How are b1 and e related to b1 and e?

b1 vs. b1

We have,
1
b1 = X01X1 X1 y
1
= X01X1 X01 (X1b1 + X2b2 + e)
1 1
= b1 + X01X1 X01X2b2 + X01X1 X01e
| {z }
0
1
= b1 + X01X1 X01X2b2
1
= b1 + Fb2; F = X01X1 X01X2:
Thus, in general, b1 6= b1: Exceptional cases: b2 = 0 or X01X2 = O ) b1 = b1:
46

e vs. e

We have,

e = M1 y
= M1 (X1b1 + X2b2 + e)
= M1X1b1 + M1X2b2 + M1e
= M1X2b2 + e;
= v+e
Thus,
e 0e = e0e + v0v e0e
Thus the SSR of the short regression (e 0e ) exceeds the SSR of the long regression (e0e)
and e 0e = e0e i¤ v = 0; that is i¤ b2 = 0:
47

Example. Illustration of b1 6= b1 and e 0e e0e:

Find X; X1; X2; b; b1; b2; b1; e 0e ; e0e:


48

2.4.3 Residual Regression

Consider

y = X +"
= X 1 1 + X2 2 + ":
Premultiplying both sides by M1 and using M1X1 = 0; we obtain

M1 y = M1 X 1 1 + M 1 X 2 2 + M1 "
y ~ 2 2 + M1 "
~ = X
The OLS gives
1 1 1
~0 X
b2 = X ~ ~0 y
X ~ = ~0 X
X ~ ~ 0 M1 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 2 2 2

Thus
1
~0 X
b2 = X ~ ~0 y
X
2 2 2
49

1
~0 X
Another way to prove b2 = X ~ ~ 0 y (you may skip this proof). We have
X
2 2 2

1 1
~0 X
X ~ ~0 y =
X ~0 X
X ~ ~ 0 (X1b1 + X2b2 + e)
X
2 2 2 2 2 2
1 1 1
= ~0 X
X ~ ~ 0 X1b1 + X
X ~0 X
~ ~ 0 X2b2 + X
X ~0 X
~ ~0 e
X
2 2 2 2 2 2 2 2 2
| {z } | {z } | {z }
0 b2 0
= b2
since:
1 1
~0 X
X ~ ~ 0 X1b1 = X
X ~0 X
~ X02M1X1b1
2 2 2 2 2
= 0
1 0 1
~0 X
X ~ ~ X2b2 = X
X ~0 X
~ X02M1X2b2
2 2 2 2 2
0 0 1 0
= X 2 M 1 M1 X 2 X2M1X2b2
0 1 0
= X 2 M 1 X2 X2M1X2b2
= b2
~0 e =
X X02M1e
2
= X02e
= 0:
50

1 1
~0 X
The conclusion is that we can obtain b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~ as follows:

1) Regress X2 on X1 to get the residuals X ~ 2 = M1X2: Interp. of X ~ 2: X


~ 2 is X2 after the
e¤ects of X1 have been removed or, X ~ 2 is the part X2 that is uncorrelated with X1.
~ 2 to get the coe¢ cient b2 of the long regression.
2) Regress y on X

OR:
1’) Same as 1).
2’a) Regress y on X1 to get the residuals y
~ = M1 y :
2’b) Regress y ~ 2 to get the coe¢ cient b2 of the long regression.
~ on X

The conclusion of 1) and 2) is extremely important: b2 relates y to X2 after controlling for


the e¤ects of X1: This is why b2 can be obtained from the regression of y on X ~ 2 where
~ 2 is X2 after the e¤ects of X1 have been removed (…xed or controlled for). This means
X
that b2 has in fact a ceteris paribus interpretation.

To recover b1 we consider the equation b1 = b1 + Fb2: Regress y on X1; obtaining


0 1 0
b1 = X1X1 X1y and now
1
b1 = b1 X01X1 X01X2b2 = b1 Fb2:
51

Example. Consider the example on page 9.


52

h i
Example. Consider X = 1 exper tenure IQ educ and
h i
X1 = 1 exper tenure IQ ; X2 = educ
53
54

2.4.4 Application of Residual Regression

A) Trend Removal (time series)

Suppose that yt and xt have a linear trend. Should the trend term be included in the
regression as in the case
yt = 1 + 2xt2 + 3xt3 + "t; xt3 = t
or should the variables …rst be “detrended” and then used without the trend term included
as in
y~t = 2x
~t2 + ~
"t ?
According to the previous results, the OLS coe¢ cient b2 is the same in both regressions.
In the second regression b2 is obtained from the regression of y
~ = M1y on x
~ 2 = M1 x 2
where
2 3
1 1
h i 6 7
6 1 2 7
X1 = 1 x 3 = 6 .. .. 7 :
4 . . 5
1 n
55

Example. Consider (TXDES: unemployment rate, INF: in‡ation, t: time)

T XDESt = 1 + 2IN Ft + 3t + "t:


We will show two ways to obtain b2 (compare EQ01 to EQ04).

EQ01
Dependent Variable: TXDES EQ02
Method: Least Squares Dependent Variable: TXDES
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 4.463068 0.425856 10.48023 0.0000
INF 0.104712 0.063329 1.653473 0.1041 C 4.801316 0.379453 12.65325 0.0000
@TREND 0.027788 0.011806 2.353790 0.0223 @TREND 0.030277 0.011896 2.545185 0.0138

EQ03
Dependent Variable: INF EQ04
Method: Least Squares Dependent Variable: TXDES_
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C 3.230263 0.802598 4.024758 0.0002
@TREND 0.023770 0.025161 0.944696 0.3490 INF_ 0.104712 0.062167 1.684382 0.0978
56

B) Seasonal Adjustment and Linear Regression with Seasonal Data

Suppose that we have data on the variable y; quarter by quarter, for m years. A way to deal
with (deterministic) seasonality is the following
yt = 1Qt1 + 2Qt2 + 3Qt3 + 4Qt4 + 5xt5 + "i
where
(
1 in quarter i
Qti =
0 otherwise.
Let
h i h i
X= Q1 Q2 Q3 Q4 x 5 ; X1 = Q1 Q2 Q3 Q4 :
Previous results show that b5 can be obtained from the regression of y
~ = M1y on x
~ 5=
M1x 5: It can be proved
8
>
> yt yQ1 in quarter 1
>
>
< y yQ2 in quarter 2
t
y~t =
>
> yt yQ3 in quarter 3
>
>
: y yQ4 in quarter 4
t
where yQi is the seasonal mean of quarter i:
57

c) Deviations from Means


h i
Let x 1 be the summer vector. Instead of regressing y on x 1 x 2 x K to get
(b1; b2; :::; bK )0 ; we can regress y on
2 3
x12 x2 x1K xK
6 ... ... 7
4 5
xn2 x2 xnK xK
to get the same vector (b2; :::; bK )0 : We sketch the proof. Let
h i
X2 = x 2 x K

so that
y
^ = x 1b1 + X2b2:

~ 2 = M1X2 where
1) Regress X2 on x 1 to get the residuals X
1 0 x 1x0 1
M1 = I x 1 x0 1x 1 x 1 =I :
n
58

As we know
~ 2 = M1 X2
X
h i
= M1 x 2 x K
h i
= M1 x 2 M1 x K
2 3
x12 x2 x1K xK
6 ... ... 7
= 4 5:
xn2 x2 xnK xK

2) Regress y (or y ~ 2 to get the coe¢ cient b2 of the long regression:


~ = M1y) on X
1 1
~0 X
b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~:
The intercept can be recovered as
1 0
b1 = b1 x 1 x0 1x 1 x 1 X2 :
59

2.4.5 Short and Residual Regression in the Classical Regression Model

Consider:

y = X1b1 + X2b2 + e (long regression)


y = X1b1 + e (short regression).
The correct speci…cation corresponds to the long regression:

E ( y j X) = X1 1 + X 2 2
= X
Var ( yj X) = 2 I; etc.
60

A) Short-Regression Coe¢ cients

b1 is a biased estimator of 1

Given that
1 1
b1 = X01X1 X01y = b1 + Fb2; F= X01X1 X01X2:
we have

E ( b1j X) = E ( b1 + Fb2j X) = 1 + F 2;
1 1 1
Var ( b1j X) = Var X01X1 X01y X = X01X1 X01 Var ( yj X) X1 X01X1
2 1
= X01X1
thus, in general,
b1 is a biased estimator of 1 (“omitted-variable bias”)
unless:

2= 0: Corresponds to the case of “Irrelevant Omitted Variables”.


F = O: Corresponds to the case of “Orthogonal Explanatory Variables”(in sample space).
61

Var ( b1j X) Var b1 X (you may skip the proof)

Consider b1 = b1 Fb2

Var ( b1j X) = Var ( b1 Fb2j X)


= Var ( b1j X) + Var ( Fb2j X) since Cov ( b1; b2j X) = O [board]
= Var ( b1j X) + F Var ( b2j X) F0:
Because F Var ( b2j X) F0 is positive semide…nite (or nonnegative de…nite), Var ( b1j X)
Var b1 X .

This relation is still valid if 2 = 0: In this case 2 = 0; regressing y on X1 and on irrelevant


variables (X2) involves a cost: Var ( b1j X) Var b1 X ; although E ( b1j X) = 1:

In practise there may be a bias-variance trade-o¤ between short and long regression when
the target is 1:
62

Exercise 2.9. Consider the standard simple regression model yi = 1 + 2xi2 + "i under
Assumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased for
their respective population parameters. Let b2 be the estimator of 2 obtained by assuming
the intercept is zero i.e. 1 = 0 (i) Find E b2 X . Verify that b2 is unbiased for 2 when
the population intercept 1 is zero. Are there other cases where b2 is unbiased? (ii) Find the
variance of b2. (iii) Show that Var b2 X Var ( b2j X); (iv) Comment on the trade-o¤
between bias and variance when choosing between b2 and b2.
Exercise 2.10. Suppose that average worker productivity at manufacturing …rms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker ability
(avgabil):
avgprodi = 1 + 2avgtraini + 3avgabili + "i
Assume that this equation satis…es Assumptions 1.1 through 1.4. If grants have been given to
…rms whose workers have less than average ability, so that avgtrain and avgabil are negatively
correlated, what is the likely bias in b2 in obtained from the simple regression of avgprod on
avgtrain?
63

B) Short-Regression Residuals (skip this)

Given that e = M1y we have


~ 2 2;
E ( e j X ) = M1 E ( y j X ) = M1 E ( X 1 1 + X 2 2 j X ) = X
Var ( e j X) = Var ( M1yj X) = M1 Var ( yj X) M01 = 2M1:
Thus E ( e j X) 6= 0; unless 2 = 0:

Let’s see now that the omission of explanatory variables leads to an increase in the expected
SSR. We have, by R5,
0
E e e X = E y0M1y X = tr (M1 Var ( yj X)) + E ( yj X)0 M1 E ( yj X)
= 2 tr (M1) + 0 X ~ 2 = 2 (n K1) + 0 X
~0 X
2 2 2
~0 X
~2
2 2 2
and E e0e X = 2 (n K ) thus
0 0 2 0 ~0 X
~
E e e X E e e X = K2 + 2X 2 2 2 > 0:

Notice that: e 0e ~0 X
e0e = b02X ~ ~0 X
0: (check E b02X ~ 2K
2 2 b2 2 2 b2 X = 2 +
~0 X
0X ~ 2 ).
2
2 2
64

C) Residual Regression

The objective is to characterize


Var ( b2j X) :
1
We know that b2 = X ~
~0 X ~ 0 y: Thus
X
2 2 2
1
Var ( b2j X) = Var ~0 X
X ~ ~0 y X
X
2 2 2
1 1
= ~0 X
X ~ ~ 0 Var ( yj X) X
X ~0 X
~2 X ~
2 2 2 2 2
2 1
= ~0 X
X ~
2 2
2 1
= X02M1X2 :

Now suppose that


h i
X= X1 x K (i.e. x K = X2)
65

If follows that
2
Var ( bK j X) = 0
x K M1 x K
and x0 K M1x K is the sum of the squared residuals in the auxiliary regression

x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
One can conclude (assuming that x 1 is the summer vector):

2 x0 K M1x K
RK =1 P 2
:
(xiK xK )
Solving this equation for x0 K M1x K we have
X
x0 K M1 x K = 1
2
RK (xiK xK )2 :

We get
2 2
Var ( bK j X) = P 2
= :
1 2
RK (xiK xK ) 1 2 S2 n
RK xK
66

2 2
Var ( bK j X) = = :
1 2 P (x
RK xK ) 2
1 2 S2 n
RK
iK xK
We can conclude that the precision of bK is high (i.e. Var (bK ) is small) when:

2 is low;

Sx2K is high (imagine the regression


wage = 1 + 2educ + ":
If most people (in the sample) report the same education, Sx2K will be low and 2 will
be estimated very imprecisely).

n is high (large sample is preferable to small sample).

2 is low (multicollinearity increases R2 ).


RK K
67

Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours worked
per week; educ: years of schooling; female: binary variable equal to one if the individual
is female. Do women sleep more than men? Explain the di¤erences between the estimates
32.18 and -90.969.

Dependent Variable: SLEEP


Method: Least Squares
Dependent Variable: SLEEP Sample: 1 706
Method: Least Squares
Sample: 1 706 Variable Coefficient Std. Error t-Statistic Prob.

Variable Coefficient Std. Error t-Statistic Prob. C 3838.486 86.67226 44.28737 0.0000
TOTWRK -0.167339 0.017937 -9.329260 0.0000
C 3252.407 22.22211 146.3591 0.0000 EDUC -13.88479 5.657573 -2.454196 0.0144
FEMALE 32.18074 33.75413 0.953387 0.3407 FEMALE -90.96919 34.27441 -2.654143 0.0081

R-squared 0.001289 Mean dependent var 3266.356 R-squared 0.119277 Mean dependent var 3266.356
Adjusted R-squared -0.000129 S.D. dependent var 444.4134 Adjusted R-squared 0.115514 S.D. dependent var 444.4134
S.E. of regression 444.4422 Akaike info criterion 15.03435 S.E. of regression 417.9581 Akaike info criterion 14.91429
Sum squared resid 1.39E+08 Schwarz criterion 15.04726 Sum squared resid 1.23E+08 Schwarz criterion 14.94012
68

Example. The goal is to analyze the impact of another year of education on wages. Consider:
wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test of
work-related abilities); educ: years of education; exper: years of work experience; tenure:
years with current employer
Dependent Variable: LOG(WAGE)
Method: Least Squares
Dependent Variable: LOG(WAGE) Sample: 1 935
Method: Least Squares White Heteroskedasticity-Consistent Standard Errors & Covariance
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefficient Std. Error t-Statistic Prob.

Variable Coefficient Std. Error t-Statistic Prob. C 5.496696 0.112030 49.06458 0.0000
EDUC 0.074864 0.006654 11.25160 0.0000
C 5.973062 0.082272 72.60160 0.0000 EXPER 0.015328 0.003405 4.501375 0.0000
EDUC 0.059839 0.006079 9.843503 0.0000 TENURE 0.013375 0.002657 5.033021 0.0000

R-squared 0.097417 Mean dependent var 6.779004 R-squared 0.155112 Mean dependent var 6.779004
Adjusted R-squared 0.096449 S.D. dependent var 0.421144 Adjusted R-squared 0.152390 S.D. dependent var 0.421144
S.E. of regression 0.400320 Akaike info criterion 1.009029 S.E. of regression 0.387729 Akaike info criterion 0.947250
Sum squared resid 149.5186 Schwarz criterion 1.019383 Sum squared resid 139.9610 Schwarz criterion 0.967958

Dependent Variable: LOG(WAGE)


Method: Least Squares
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C 5.210967 0.113778 45.79932 0.0000


EDUC 0.047537 0.008275 5.744381 0.0000
EXPER 0.012897 0.003437 3.752376 0.0002
TENURE 0.011468 0.002686 4.270056 0.0000
IQ 0.004503 0.000989 4.553567 0.0000
KWW 0.006704 0.002070 3.238002 0.0012

R-squared 0.193739 Mean dependent var 6.779004


Adjusted R-squared 0.189400 S.D. dependent var 0.421144
S.E. of regression 0.379170 Akaike info criterion 0.904732
Sum squared resid 133.5622 Schwarz criterion 0.935794
69

Exercise 2.12. Consider

yi = 1 + 2xi2 + "i; i = 1; :::; n


where xi2 is an impulse dummy, i.e. x 2 is a column vector with n 1 zeros and only one
1. To simplify let us suppose that this 1 is the …rst element of x 2; i.e.
h i
x0 2 = 1 0 0 :
Find and interpret the coe¢ cient from the regression of y on x ~ 1 = M2x 1 and M2 =
0 1 0
I x 2 x 2x 2 x 2 (x
~ 1 is the residual vector from the regression x 1 on x 2):
Exercise 2.13. Consider the long regression model (under Assumptions 1.1 through 1.4):

y = X1b1 + X2b2 + e;
and the following coe¢ cients (obtained from the short regressions):
1 1
b1 = X01X1 X01y; b2 = X02X2 X02y:

Decide if you agree or disagree with the following statement: if Cov b1; b2 X1; X2 = O
(zero matrix) then b1 = b1 and b2 = b2:
70

2.5 Multicollinearity

If rank (X) < K then b is not de…ned. This is called strict multicollinearity. When this
happens, the statistical software will be unable to construct X0X 1 : Since the error is
discovered quickly, this is rarely a problem for applied econometric practice.

The more relevant situation is near multicollinearity, which is often called “multicollinearity”
for brevity. This is the situation when the X0X is near singular, when the columns of X are
close to linearly dependent.

Consequence: the individual coe¢ cient estimates will be imprecise. We have shown that
2
Var ( bK j X) = :
1 2 S2 n
RK xK
2 is the coe¢ cient of determination in the auxiliary regression
where RK

x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
71

Exercise 2.14. Do you agree with the following quotations: (a) “But more data is no remedy
for multicollinearity if the additional data are simply "more of the same." So obtaining lots
of small samples from the same population will not help” (Johnston, 1984); (b) “Another
important point is that a high degree of correlation between certain independent variables
can be irrelevant as to how well we can estimate other parameters in the model.”
Exercise 2.15. Suppose you postulate a model explaining …nal exam score in terms of class
attendance. Thus, the dependent variable is …nal exam score, and the key explanatory
variable is number of classes attended. To control for student abilities and e¤orts outside
the classroom, you include among the explanatory variables cumulative GPA, SAT score, and
measures of high school performance. Someone says, “You cannot hope to learn anything
from this exercise because cumulative GPA, SAT score, and high school performance are
likely to be highly collinear.” What should be your answer?
72

2.6 Statistical Inference under Normality

Assumption (1.5 - normality of the error term). "j X N ormal

Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that

"j X N 0; 2I and yj X N X ; 2I :

Suppose that we want to test H0 : 2 = 1. Although Proposition 1.1 guarantees that, on


average, b2 (the OLS estimate of 2) equals 1 if the hypothesis H0 : 2 = 1 is true, b2 may
not be exactly equal to 1 for a particular sample at hand. Obviously, we cannot conclude
that the restriction is false just because the estimate b2 di¤ers from 1. In order for us to
decide whether the sampling error b2 1 is “too large” for the restriction to be true, we
need to construct from the sampling distribution error some test statistic whose probability
distribution is known given the truth of the hypothesis.

The relevant theory is built from the following results:


73

1. z N (0; I) , z0z 2 :
(n)

2 ; w 2 ; w w =m
2. w1 (m) 2 (n) 1 and w2 are independent, w1 =n F (m; n) :
2

3. w 2 ; z
(n) N (0; 1) ; w and z are independent, p z t(n):
w=n

4. Asymptotic Results:
d
v F (m; n) ) mv ! 2(m) as n ! 1
d
u t(n) ) u ! N (0; 1) as n ! 1:

5. Consider the vector n 1 vector yj X N (X ; ) : Then,

w = (y X )0 1 (y X ) 2 :
(n)
74

6. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent


matrix with rank (M) = r n: Then,

"0M" X 2 :
(r)

7. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent


matrix with rank (M) = r n: Let L be a matrix such that LM = O: Let t1 = M"
and t2 = L": Then t1 and t2 are independent random vectors.

1
8. bj X N ; 2 X0 X :

9. Let r = R (Rp K ) with rank (R) = p (in Hayashi’s notation p is equal to #r):
Then,
1
Rbj X N r; 2R X0 X R0 :
75

1
10. Let bk be the kth element of b and q kk the (k; k) element of X0X : Then,
b
bk j X N k;
2 q kk or zk = kq k N (0; 1) :
q kk

0 1 1 2 2 :
11. w = (Rb r) R X0 X R0 (Rb r) = (p)

2
(bk k) 2 :
12. wk = 2 q kk (1)

13. w0 = e0e= 2 2
(n K) :

14. The random vectors b and e are independent.

d (b) ; is independent of each of the statistics


15. Each of the statistics e; e0e; w0; s2; Var
b, bk ; Rb; w; wk :
76

b 1
16. tk = k^ k t (n K ) ; where ^ 2b is the (k; k) element of s2 X0X :
bk k

17. q Rb R t (n K ) ; R is of type 1 K
s R(X0 X) 1 R0

0 1 1
18. F = (Rb r) R X0 X R0 (Rb r) = ps2 F (p; n K) :

Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).

The two most important results are:


bk k bk k
tk = = t (n K)
^ bk SE (bk )
1 1
F = (Rb r) 0
R X0 X R0 (Rb r) = ps2 F (p; n K) :
77

2.6.1 Con…dence Intervals and Regions

Let t =2 t =2 (n k) be such that

P jtj < t =2 = 1 :
78

Let F F (p; n K ) be such that

P (F > F ) = 1
79

(1 ) 100% CI for an individual slope coe¢ cient k :


8 9
< bj =
k
: t =2 , bk t =2 ^ bk :
: k ^ bk ;

(1 ) 100% CI for a single linear combination of the elements of (p = 1)


8 9
>
< >
= q
Rb R
R : q t =2 , Rb t =2s R (X0X) 1 R0:
>
: 1 >
;
s R (X0X) R0

In this case R is a vector 1 K:


(1 ) 100% Con…dence Region for the parameter vector =R :
( )
1 1
: (Rb )0 R X0X R0 (Rb ) =s2 pF :

(1 ) 100% Con…dence region for the parameter vector (consider R = I in the pre-
vious case)
n o
: (b ) 0
X0 X (b ) =s2 pF :
80

Exercise 2.17. Consider yi = 1xi1 + 2xi2 + "i where yi = wagesi wages; xi1 =
educi educ; xi2 = experi exper: The results are

Dependent Variable: Y
Method: Least Squares
Sample: 1 526

Variable Coefficient Std. Error t-Statistic Prob.

X
X1 0.644272 0.053755 11.98541 0.0000
X2 0.070095 0.010967 6.391393 0.0000

R-squared 0.225162 Mean dependent var 1.34E-15


Adjusted R-squared 0.223683 S.D. dependent var 3.693086
S.E. of regression 3.253935 Akaike info criterion 5.201402
Sum squared resid 5548.160 Schwarz criterion 5.217620
Log likelihood -1365.969 Hannan-Quinn criter. 5.207752
Durbin-Watson stat 1.820274

" # " #
4025:4297 5910:064 1 2:7291 10 4 1:6678 10 5
X0 X = ; X0 X =
5910:064 96706:846 1:6678 10 5 1:1360 10 5

(a) Build the 95% con…dence interval for 2.

(b) Build the 95% con…dence interval for 1 + 2:

(c) Build the 95% con…dence region for the parameter vector :
81

Con…dence regions in the EVIEWS

.10

.09

.08

beta2
.07

.06

.05

.04
.50 .55 .60 .65 .70 .75 .80

beta1

90% and 95% Con…dence region for the parameter vector


82

2.6.2 Testing on a Single Parameter

Suppose that we have a hypothesis about the kth regression coe¢ cient:
H0 : k = 0k
( 0k is a speci…c value, e.g. zero), and that this hypothesis is tested against the alternative
hypothesis
H1 : k 6= 0k :
We do not reject H0 at the 100% level if
0 lies within the (1 ) 100% CI for k ; i.e., bk t =2 ^ bk ;
k
reject H0 otherwise. Equivalently, calculate the test statistic
bk 0
tobs = k
^ bk
and,
if jtobsj > t =2 then reject H0;
if jtobsj t =2 then do not reject H0:
83

The reasoning is as follow. Under the null hypothesis we have


bk 0
t0k = k t(n K):
^ bk
If we observe jtobsj > t =2 and the H0 is true, then a low-probability event has occurred.
We take jtobsj > t =2 as an evidence against the null and the decision should be to reject
H0 :

Other cases:

H0 : k = 0k vs: H1 : k > 0k ;

if tobs > t then reject H0 at the 100% level; otherwise do not reject H0:

H0 : k = 0k vs: H1 : k < 0k ;

if tobs < t then reject H0 at the 100% level; otherwise do not reject H0:
84

2.6.3 Issues in Hypothesis Testing

p-value

p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true. p is an informal measure
of evidence of the null hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k 6= 0k

p-value = P t0k > tobs H0 is true :


A p-value = 0:02 shows little evidence supporting H0: At the 5% level you should reject the
H0 hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k > 0k

p-value = P t0k > tobs H0 is true :


EVIEWS: divide the reported p-value by two.
85

Reporting the outcome of a test

Correct wording in reporting the outcome of a test involving

H0 : k = 0k vs. H1 : k 6= 0k

When the null is rejected we say that bk (not k ) is signi…cantly di¤erent from 0k at
100% level. Some authors also say “the variable (associated with bk ) is statistically
signi…cant at 100% level”.

When the null isn’t rejected we say that bk (not k ) is not signi…cantly di¤erent from
0 at 100% level or that the variable is not statistically signi…cant at 100% level.
k
86

More Remarks:

Rejection of the null is not proof that the null is false. Why?

Acceptance of the null is not proof that the null is true. Why? We prefer to use the
language “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%
level.”

In a test of type H0 : k = 0k , if ^ bk is large (bk is an imprecise estimator) is more


di¢ cult to reject the null. The sample contains little information about the true value
of k parameter. Remember that ^ bk depends on
2; S 2 ; n and Rk2 .
xk
87

Statistical Versus Economic Signi…cance

The statistical signi…cance of a variable is determined by the size of tobs = bk =se (bk ) ;
whereas the economic signi…cance of a variable is related to the size and sign of bk :
Example. Suppose that in a business activity we have

log\
(wagei) = :1 + 0:01 f emale + ::: n = 600
(0:001)
H0 : 2 = 0 vs. H1 = 2 6= 0: We have:
b2
t0k= t(600 K) N (0; 1) (under the null)
^ b2
0:01
tobs = = 10;
0:001
p-value = P t0k > 10 H0 is true 0:
Discuss statistical versus economic signi…cance.
88

Exercise 2.18. Can we say that students at smaller schools perform better than those at
larger schools? To discuss this hypothesis we consider data on 408 high schools in Michigan
for the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentage
of students receiving a passing score on a tenth grade math test ( math10). School size
is measured by student enrollment ( enroll). We will control for two other factors, average
annual teacher compensation ( totcomp) and the number of sta¤ per one thousand students
( sta¤ ). Teacher compensation is a measure of teacher quality, and sta¤ size is a rough
measure of how much attention students receive. Figure below reports the results. Answer
to the initial question.

Dependent Variable: MATH10


Method: Least Squares
Sample: 1 408

Variable Coefficient Std. Error t-Statistic Prob.

C 2.274021 6.113794 0.371949 0.7101


TOTCOMP 0.000459 0.000100 4.570030 0.0000
STAFF 0.047920 0.039814 1.203593 0.2295
ENROLL -0.000198 0.000215 -0.917935 0.3592

R-squared 0.054063 Mean dependent var 24.10686


Adjusted R-squared 0.047038 S.D. dependent var 10.49361
S.E. of regression 10.24384 Akaike info criterion 7.500986
Sum squared resid 42394.25 Schwarz criterion 7.540312
Log likelihood -1526.201 Hannan-Quinn criter. 7.516547
F-statistic 7.696528 Durbin-Watson stat 1.668918
Prob(F-statistic) 0.000052
89

Exercise 2.19. We want to relate the median housing price ( price) in the community to
various community characteristics: nox is the amount of nitrous oxide in the air, in parts
per million; dist is a weighted distance of the community from …ve employment centers, in
miles; rooms is the average number of rooms in houses in the community; and stratio is
the average student-teacher ratio of schools in the community. Can we conclude that the
elasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -
see Wooldridge, chapter 4).

Dependent Variable: LOG(PRICE)


Method: Least Squares
Sample: 1 506

Variable Coefficient Std. Error t-Statistic Prob.

C 11.08386 0.318111 34.84271 0.0000


LOG(NOX) -0.953539 0.116742 -8.167932 0.0000
LOG(DIST) -0.134339 0.043103 -3.116693 0.0019
ROOMS 0.254527 0.018530 13.73570 0.0000
STRATIO -0.052451 0.005897 -8.894399 0.0000

R-squared 0.584032 Mean dependent var 9.941057


Adjusted R-squared 0.580711 S.D. dependent var 0.409255
S.E. of regression 0.265003 Akaike info criterion 0.191679
Sum squared resid 35.18346 Schwarz criterion 0.233444
Log likelihood -43.49487 Hannan-Quinn criter. 0.208059
F-statistic 175.8552 Durbin-Watson stat 0.681595
Prob(F-statistic) 0.000000
90

2.6.4 Test on a Set of Parameter I

Suppose that we have a joint null hypothesis about :

H0 : R = r vs. H1 : R 6= r:
where R p 1; Rp K ). The test statistics is

1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2 :

Let Fobs be the observed test statistics. We have

reject H0 if Fobs > F (or if p-value < )


do not reject H0 if Fobs F :

The reasoning is as follow. Under the null hypothesis we have

F0 F(p;n K):
If we observe F 0 > F and the H0 is true, then a low-probability event has occurred.
91

In the case p = 1 (single linear combination of the elements of ) one may use the test
statistics
0 Rb R
t = q t (n K ) :
1
s R (X0X) R0
Example. We consider a simple model to compare the returns to education at junior colleges
and four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,
chap. 4).The model is

log (wagesi) = 1 + 2jci + 3univi + 4experi + "i:


The population includes working people with a high school degree. jc is number of years
attending a two-year college and univ is number of years at a four-year college. Note that
any combination of junior college and college is allowed, including jc = 0 and univ = 0.
The hypothesis of interest is whether a year at a junior college is worth a year at a university:
this is stated as H0 : 2 = 3: Under H0, another year at a junior college and another year
at a university lead to the same ceteris paribus percentage increase in wage. The alternative
of interest is one-sided: a year at a junior college is worth less than a year at a university.
This is stated as H1 : 2 < 3:
92

Dependent Variable: LWAGE


Method: Least Squares
Sample: 1 6763

Variable Coefficient Std. Error t-Statistic Prob.

C 1.472326 0.021060 69.91020 0.0000


JC 0.066697 0.006829 9.766984 0.0000
UNIV 0.076876 0.002309 33.29808 0.0000
EXPER 0.004944 0.000157 31.39717 0.0000

R-squared 0.222442 Mean dependent var 2.248096


Adjusted R-squared 0.222097 S.D. dependent var 0.487692
S.E. of regression 0.430138 Akaike info criterion 1.151172
Sum squared resid 1250.544 Schwarz criterion 1.155205
Log likelihood -3888.687 Hannan-Quinn criter. 1.152564
F-statistic 644.5330 Durbin-Watson stat 1.968444
Prob(F-statistic) 0.000000

2 3
0:0023972 9:4121 10 5 8:50437 105 1:6780 10 5
6 7
1 6 9:41217 10 5 0:0002520 1:04201 10 5 9:2871 10 8 7
X0 X =6
6
7
7
4 8:50437 10 5 1:0420 10 5 2:88090 10 5 2:12598 10 7 5
1:67807 10 5 9:2871 10 8 2:1259 10 7 1:3402 10 7
Under the null, the test statistics is
Rb R
t0 = q t (n K) :
1
s R (X0X) R0
93

We have
h i
R = 0 1 1 0
q
1
R (X0X) R0 = 0:016124827
q
s R (X0X) 1 R0 = 0:430138 0:016124827 = 0:006936
2 3
1:472326
h i 6 0:066697 7
6 7
Rb = 0 1 1 0 6 7 = 0:01018
4 0:076876 5
0:004944
2 3
1
h i6 7
6 2 7
R = 0 1 1 0 6 7= 2 3 = 0 (under H0)
4 3 5
4
0:01018
tobs = = 1:467
0:006936
t0:05 = 1:645:
We do not reject H0 at the 5% level. There is no evidence against 2 = 3 at 5% level.
94

Remark: in this exercise t0 can be written as


Rb b2 b3
b2 b 3
t0 = q =q = :
1
s R (X0X) R0 Var \
(b2 b3) SE (b2 b3)

Exercise 2.20 (continuation). Propose another way to test H0 : 2 = 3 against H0 :


2 < 3 along the following lines: de…ne = 2 3 ; write 2 = + 3 ; plug this into
the equation log (wagesi) = 1 + 2jci + 3univi + 4experi + "i and test = 0: Use
the database available on the webpage of the course.
95

2.6.5 Test on a Set of Parameter II

We focus on another way to test

H0 : R = r vs. H1 : R 6= r:
(where R p 1; Rp K ). It can be proved that

1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2
e 0e e0e =p
=
e0e= (n K )
R2 R2 =p
= F (p; n K)
1 R2 = (n K)
where refers to the short regression or the regression subjected to the constraint R = r.
96

Example. Consider once again the equation log (wagesi) = 1 + 2jci + 3univi +
4 experi + "i and H0 : 2 = 3 against H0 : 2 6= 3 : The results of the regression
subjected to the constraint H0 : 2 = 3 are

Dependent Variable: LWAGE


Method: Least Squares
Sample: 1 6763

Variable Coefficient Std. Error t-Statistic Prob.

C 1.471970 0.021061 69.89198 0.0000


JC+UNIV 0.076156 0.002256 33.75412 0.0000
EXPER 0.004932 0.000157 31.36057 0.0000

R-squared 0.222194 Mean dependent var 2.248096


Adjusted R-squared 0.221964 S.D. dependent var 0.487692
S.E. of regression 0.430175 Akaike info criterion 1.151195
Sum squared resid 1250.942 Schwarz criterion 1.154220
Log likelihood -3889.764 Hannan-Quinn criter. 1.152239
F-statistic 965.5576 Durbin-Watson stat 1.968481
Prob(F-statistic) 0.000000

We have p = 1; e0e = 1250:544; e 0e = 1250:942 and


e 0e e0e =p (1250:942 1250:544) =1
Fobs = 0
= = 2:151;
e e= (n K ) 1250:544= (6763 4)
F0:05 = 3:84:
We do not reject the null at 5% level, since Fobs = 2:151 < F0:05 = 3:84:
97

In the case “all slopes zero” (test of signi…cance of the complete regression), it can be
proved that F o equals
R2= (K 1)
F0 = :
1 R2 = (n K)

Under the null H0 : k = 0; k = 2; 3; :::; K; we have F 0 F (K 1; n K) :


Exercise 2.21. Consider the results:
Dependent Variable: Y
Method: Least Squares
Sample: 1 500

Variable Coefficient Std. Error t-Statistic Prob.

C 0.952298 0.237528 4.009200 0.0001


X2 1.322678 1.686759 0.784154 0.4333
X3 2.026896 1.701543 1.191210 0.2341

R-squared 0.300503 Mean dependent var 0.975957


Adjusted R-squared 0.297688 S.D. dependent var 6.337496
S.E. of regression 5.311080 Akaike info criterion 6.183449
Sum squared resid 14019.16 Schwarz criterion 6.208737
Log likelihood -1542.862 Hannan-Quinn criter. 6.193372
F-statistic 106.7551 Durbin-Watson stat 2.052601
Prob(F-statistic) 0.000000

Test: (a) H0 : 2 = 0 vs. H1 : 2 6= 0; (b) H0 : 3 = 0 vs. H1 : 3 6= 0; (c)


H0 : 2 = 0; 3 = 0 vs. H1 : 9 i 6= 0 (i = 1; 2) (d) Are xi2 and xi3 truly relevant
variables? How would you explain the results you obtained in parts (a), (b) and (c)?
98

2.7 Relation to Maximum Likelihood

Having speci…ed the distribution of the error vector, we can use the maximum likelihood
(ML) principle to estimate the model parameters = 0; 2 0.

2.7.1 The Maximum Likelihood Principle

ML principle: choose the parameter estimates to maximize the probability of obtaining the
data. Maximizing the joint density associated with the data, f y; X; ~ ; leads to the same
solution. Therefore:

M L estimator of = arg max f y; X; ~ :


~
99

Example (Without X). We ‡ipped a coin 10 times. If heads then y = 1: Obviously y


Bernoulli( ) : We don’t know if the coin is fair, so we treated E (Y ) = as unknown
P10
parameter. Suppose that i=1 yi = 6: We have
n
Y
f y;~ = f y1; :::; yn; ~ = f (yi; ) = y1 (1 )1 y1 ::: yn (1 )1 yn
P Pi=1
= i yi (1 )10 i yi = 6 (1 )4 :

0.0012
joint density
0.0011

0.0010

0.0009

0.0008

0.0007

0.0006

0.0005

0.0004

0.0003

0.0002

0.0001

0.0000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
theta
100

To obtain the ML estimate of we proceed with:

d 6 (1 )4 6
=0,^=
d 10
and since
d2 6 (1 )4
<0
d 2
^ = 0:6 maximizes f y;~ : ^ is the “most likely” value ; that is the value that maximizes
the probability of observing (y1; :::; y10) : Notice that the ML estimator is y:

Since log x; x > 0 is a strictly increasing function we have: ^ maximizes f y;~ i¤ ^


maximizes log f y;~ ; that is

^ = arg max f y; X; ~ , ^ = arg max log f y; X; ~ :

In most cases we prefer to solve max log f y; X; ~ rather max f y; X; ~ ; since the
transformation log greatly simplify the likelihood (products become sums).
101

2.7.2 Conditional versus Unconditional Likelihood

The joint density f (y; X; ) is in general di¢ cult to handle. Consider:


f (y; X; ) = f ( yj X; ) f (X; ) ; = 0; 0 ;

log f (y; X; ) = log f ( yj X; ) + log f (X; )


In general we don’t know f (X; ) :
Example. Consider yi = 1xi1 + 2xi2 + "i where
"i j X N 0; 2 ) yij X N x0i ; 2
X N ; 2I :
x x
Thus,
" # " # " #
= ; = x ; = :
2 2
x

If there is no functional relationship between and (such as a subset of being a


function of ), then maximizing log f (y; X; ) with respect to is achieved by separately
maximizing f ( yj X; ) with respect to and maximizing f (X; ) with respect to . Thus
the ML estimate of also maximizes the conditional likelihood f ( yj X; ) :
102

2.7.3 The Log Likelihood for the Regression Model

Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 imply
that the distribution of " conditional on X is N 0; 2I . Thus,

"j X N 0; 2I ) yj X N X ; 2I )
2 n=2 1 0
f ( y j X; ) = 2 exp 2
(y X ) (y X ) )
2
n 2 1
log f ( yj X; ) = log 2 2
(y X )0 (y X ):
2 2
It can be proved
n
X n
n 2 1 X 2
log f ( yj X; ) = log f ( yij xi) = log 2 2
yi x0i :
i=1 2 2 i=1
Proposition (1.5 - ML Estimator of and 2). Suppose Assumptions 1.1-1.5 hold. Then,
1
M L estimator of = X0 X X0 y :
2 e0e 2 e0 e
M L estimator of = 6= s = :
n n K
103

We know that E s2 = 2: Therefore:

e0 e 6= 2:
E n

e 0e
limn!1 E n = 2:
Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,
the OLS estimator b of is BUE in that any other unbiased (but not necessarily linear)
estimator has larger conditional variance in the matrix sense.

This result should be distinguished from the Gauss-Markov Theorem that b is minimum
variance among those estimators that are unbiased and linear in y. Proposition 1.6 says
that b is minimum variance in a larger class of estimators that includes nonlinear unbiased
estimators. This stronger statement is obtained under the normality assumption (Assumption
1.5) which is not assumed in the Gauss-Markov Theorem. Put di¤erently, the Gauss-Markov
Theorem does not exclude the possibility of some nonlinear estimator beating OLS, but this
possibility is ruled out by the normality assumption.
104

Exercise 2.22. Suppose yi = x0i + "i where "ij X t(v): Assume that Assumptions
1.1-1.4 hold. Use your intuition to answer “true” or “false” to the following statements:

(a) b is the BLUE;

(b) b is the BUE;

(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formula
for the BUE estimator).

Just out of curiosity notice that the log-likelihood function is


n
X n n n
log f ( yij xi) = log 2 log log (v 2)
i=1 2 2 2
0 1
v+1 2
2
n
v+1 X B 1 yi x0i C
+n log v
log @1 + 2
A:
2 i=1 v 2
2
105

2.8 Generalized Least Squares (GLS)

We have assumed that


2 = Var ( "ij X) = 2 > 0;
E "i X 8i; Homoskedasticity
E " i "j X = 0; 8i; j ; i 6= j No correlation between observations.
Matrix notation:
2 3
E "21 X E ( "1 " 2 j X ) E ( "1 "n j X)
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X) 7
E "" X = 6
6 ... ... ... ... 7
7
4 5
E ( "1 "n j X ) E ( "2 "n j X ) E "2n X
2 3
2 0 0
6 2 7
6 0 0 7
= 6 7 = 2I:
.
6 .. . .
.. . . . .. 7
4 5
0 0 2
106

The Assumption E ""0 X = I is violated if either

E "2i X depends on X ! Heteroskedasticity, or

E "i"j X 6= 0 ! Serial Correlation (We will analyze this case later).

Let’s assume now that


0 2
E "" X = V (V depends on X):

The model y = X + " based on the assumptions Assumptions 1.1-1.3 and E ""0 X =
2 V is called generalized regression model.

Notice that by de…nition, we always have:


0
E "" X = Var ( "j X) = Var ( yj X) :
107

Example (case where E "2i X depends on X). Consider the following model

yi = 1 + 2xi2 + "i
to explain household expenditure on food (y ) as a function of household income. Typical
behavior: Low-income household do not have the option of extravagant food tastes: they
have few choices and are almost forced to spend a particular portion of their income on food;
High-income household could have simple food tastes or extravagant food tastes: income by
itself is likely to be relatively less important as an explanatory variable.

20
18
16
y : Expenditure

14
12
10
8
6
4
2
0
6 7 8 9 10 11 12 13
x : Income
108

If e accurately re‡ects the behavior of the "; the information in the previous …gure suggests
that the variability of yi increases as income increases, thus it is reasonable to suppose that

Var ( yij xi2) is a function of xi2:


This is the same as saying that
2 is a function of xi2:
E "i xi2

For example if E "2i xi2 = 2x2i2 then


2 3
x212 0 0
6 2 7
0 26
6 0 x22 0 7
7= 2 I:
E "" X = 6 ... ... ... ... 7 V 6=
4 5
0 0 x2n2
| {z }
V
109

2.8.1 Consequence of Relaxing Assumption 1.4

1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is some
other estimator.

2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. The
same comments apply to the F-test. Note that Var ( bj X) is no longer 2 X0X 1 : In
e¤ect,
1 1 1
Var ( bj X) = Var X0 X X0 y X = X0 X X0 Var ( yj X) X X0 X
2 1 1
= X0 X X0VX X0X :
On the other hand,
2 tr (MVM) 2 tr (MV)
E e0e X tr (Var ( ej X))
E s2 X = = = = :
n K n K n K n K
The conventional standard errors are incorrect when Var ( yj X) 6= 2I: Con…dence
region and hypothesis test procedures based on the classical regression model are not
valid.
110

3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-
sition 1.1 (a)) does not require Assumption 1.4. In e¤ect,
0 1 1
E ( bj X) = X X X0 E ( y j X) = X0 X X0 X = ; E (b) =

Options in the presence of E ""0 X 6= 2I:

1 1
Use b to estimate and Var ( bj X) = 2 X0X X0VX X0X for inference
purposes. Note that yj X N X ; 2V implies
1 1
bj X N ; 2 X0 X X0VX X0X :

This is not a good solution as if you know V you may use a more e¢ cient estimator, as
we will see below. Later on, in chapter “Large Sample Theory” we will …nd that 2V
may be replaced by a consistent estimator.

Search for a better estimator of :


111

2.8.2 E¢ cient Estimation with Known V

If the value of the matrix function V is known, a BLUE estimator for , called generalized
least squares (GLS), can be deduced. The basic idea of the derivation is to transform
the generalized regression model into a model that satis…es all the assumptions, including
Assumption 1.4, of the classical regression model. Consider

y = X + "; 0 2
E "" X = V:
We should multiply both sides of the equation by a nonsingular matrix C (depending on X)

Cy = CX + C"
y ~ +~
~ = X "
" verify E ~
such that the transformed error ~ "0 X = 2I; i.e.
"~

"0 X = E C""0C0 X = C E ""0 X C0 = 2CVC0 = 2I


"~
E ~
that is CVC0 = I:
112

Given CVC0 = I, how to …nd C? Since V is by construction symmetric and positive de…-
nite, there exists a nonsingular n n matrix C such
1 1
V=C C0 or V 1 = C0C
Note
1 0
CVC0 = CC 1 C0 C = I:

It easy to see that if y = X + " satis…es Assumptions 1.1-1.3 and Assumption 1.5 (but
not Assumption 1.4), then

y ~ +~
~=X "; where y ~ = CX
~ = Cy; X
satis…es Assumptions 1.1-1.5. Let
1 1X 1 1 y:
~ 0X
^ GLS = X ~ ~ 0y
X ~ = X0 V X0 V
113

Proposition (1.7 - …nite-sample properties of GLS). (a) (unbiasedness) Under Assumption


1.1-1.3,

E ^ GLS X = :
(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E ""0 X =
2 V that the conditional second moment is proportional to V,

2 1
Var ^ GLS X = X0 V 1 X :
(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLS
estimator is e¢ cient in that the conditional variance of any unbiased estimator that is linear
in y is greater than or equal to Var ^ GLS X in the matrix sense.

Remark: Var ( bj X) Var ^ GLS X is a positive semide…nite matrix. In particular,

Var bj X Var ^ j;GLS X :


114

2.8.3 A Special Case: Weighted Least Squares (WLS)

Let’s suppose that


2 2
E "i X = vi (vi is a function of X).

Recall: C is such that V 1 = C0C .

We have
2 3 2 3
v1 0 0 1=v1 0 0
6 0 v 0 7 6 0 1=v2 0 7
V = 6
6 .. 2
... . . . ...
7
7) V 1 6
= 6 ... ... ... ...
7
7)
4 . 5 4 5
0 0 vn 0 0 1=vn
2 p 3
1= v1 0 0
6 p 7
0 1= v2 0
C = 6
6 ... ... ... ...
7
7:
4 5
p
0 0 1= vn
115

Now
2 y 3
2
p 32 3 p1
1= v1 0 0 y1 6 yv1 7
6 p 76 y 7 6 p2 7
6 0 1= v2 0 76 2 7 6 7
y
~ = Cy = 6 .
.. .
.. ... .
.. 7 6 .. 7 = 6 .v2 7
4 54 . 5 6 .. 7
p 4 y 5
0 0 1= vn yn pn
vn
2 p 32 3
1= v1 0 0 1 x12 x1K
6 p 76 1 7
~ = CX = 6 6 0 1 = v2 0 76 x22 x2K 7
X ... ... ... ... 7 6 .. ... ... ... 7
4 54 . 5
p
0 0 1= vn 1 xn2 xnK
2 p p p 3
1= v1 x12= v1 x1K = v1
6 1=pv x =
p
v x =
p 7
v2 7
6 2 22 2 2K
= 6 ... ... ... ... 7:
4 5
p p p
1= vn xn2= vn xnK = vn

Another way to express these relations:


y x
y~i = p i ; ~ik = pik ;
x i = 1; 2; :::; n:
vi vi
116

Example. Suppose that yi = + xi2 + "i;


Var ( yij xi2) = Var ( "ij xi2) = 2exi2 ; Cov yi; yj xi2; xj2 = 0
2 3
ex12 0 0
6 ... ... 7
6 7
6 7
V =6
6 0 exi2 0 7:
7
6 ... ... 7
4 5
0 0 exn2
Transformed model (matrix notation):
Cy = 2
CX + C" 3
2 3 2 3
pyx1 p 1x x
p 12 " # p "1
6 e. 12 7 6 e 12
6 ex12 77 6 ex12 7
6 .. 7 = 6 ... ... 7 1 +6 ... 7
4 5 4 5 4 5
pyxn p 1x pxn2 2 p "xn
e n2 e n2 exn 2 e n2
or (scalar notation):
y~i = x
~i1 1 + x "i ,
~i2 2 + ~ i = 1; :::; n
y 1 xi2 "i
p i =p 1 + p 2 + p , i = 1; :::; n:
x
e i2 x
e i2 x
e i2 x
e i2
117

Notice:
!
" 1 1
"ij X) = Var p ix xi2
Var (~ = x Var ( "ij xi2) = x 2exi2 = 2:
e i2 e i2 e i2

E¢ cient estimation under a known form of heteroskedasticity is called the weighted regression
(or the weighted least squares (WLS)).
Example. Consider wagei = 1 + 2educi + 3experi + "i:

30 30

25 25

20 20
WAGE

WAGE

15 15

10 10

5 5

0 0
0 10 20 30 40 50 60 0 4 8 12 16 20

EXPER EDUC
118

300

250
Dependent Variable: WAGE
Method: Least Squares 200
Sample: 1 526

RES2
Variable Coefficient Std. Error t-Statistic Prob.
150
C -3.390540 0.766566 -4.423023 0.0000
EDUC 0.644272 0.053806 11.97397 0.0000
EXPER 0.070095 0.010978 6.385291 0.0000 100

R-squared 0.225162 Mean dependent var 5.896103


Adjusted R-squared 0.222199 S.D. dependent var 3.693086 50
S.E. of regression 3.257044 Akaike info criterion 5.205204
Sum squared resid 5548.160 Schwarz criterion 5.229531
Log likelihood -1365.969 Hannan-Quinn criter. 5.214729 0
F-statistic 75.98998 Durbin-Watson stat 1.820274 0 4 8 12 16 20
Prob(F-statistic) 0.000000
EDUC

Assume Var ( "ij educi; experi) = 2educ2i : Transformed model:


wagei 1 educi experi
= + 2 + 3 "i ,
+~ i = 1; :::; n
educi educi educi educi
119

Dependent Variable: WAGE/EDUC


Method: Least Squares
Sample: 1 526 IF EDUC>0

Variable Coefficient Std. Error t-Statistic Prob.

1/EDUC -0.709212 0.549861 -1.289800 0.1977


EDUC/EDUC 0.443472 0.038098 11.64033 0.0000
EXPER/EDUC 0.055355 0.009356 5.916236 0.0000

R-squared 0.105221 Mean dependent var 0.469856


Adjusted R-squared 0.101786 S.D. dependent var 0.265660
S.E. of regression 0.251777 Akaike info criterion 0.085167
Sum squared resid 33.02718 Schwarz criterion 0.109564
Log likelihood -19.31365 Hannan-Quinn criter. 0.094721
Durbin-Watson stat 1.777416

Exercise 2.23. Let fyi; i = 1; 2; :::g be a sequence of independent random variables with
distribution N ; 2i ; where 2i is known (note: we assume 21 6= 22 6= :::). When
the variances are unequal, the sample mean y is not the best linear unbiased estimator,
Pn
i.e. BLUE). The BLUE has the form y~ = i=1 wiyi where wi are nonrandom weights.
(a) Find a condition on wi such that E (~ y ) = ; (b) Find the optimal weights wi that
make y~ the BLUE. Hint: You may translate this problem into an econometric framework:
if fyig is a sequence of independent random variables with distribution N ; 2i then yi
can be represented by the equation yi = + "i; where "i N 0; 2i : Then …nd the GLS
estimator of :
120

Exercise 2.24. Consider


yi = xi1 + "i; >0
and assume E ( "ij X) = 0; Var ( "ij X) = 1 + jxi1j ; Cov "i; "j X = 0: (a) Suppose we
have a lot of observations and plot a graph of the observation of yi and xi2. How would the
scattered plot look like? (b) Propose an unbiased estimator with minimum variance; (c)
Suppose we have the 3 following observation of (xi2; yi): (0; 0); (3; 1) and (8; 5). Estimate
the value of from these 3 observations.
Exercise 2.25. Consider

yt = 1 + 2t + "t; Var ("i) = 2t2; i = 1; :::; 20


1
Find 2 X0X ; Var ( bj X) and Var ^ GLS X and comment on the results. Hint:
" # " #
2 1 ? 0:01578 ? 1:6326
X0 X = 2 ; Var ( bj X) = 2
0:01578 ? 1:6326 ?
" #
? 0:1895
Var ^ GLS X = 2 :
0:1895 ?
121

Exercise 2.26. A research …rst ran a OLS regression. Then she was given the true V matrix.
She transformed the data appropriately and obtained the GLS estimator. For several coe¢ -
cient, standard errors in the second regression were larger than those in the …rst regression.
Does this contradict 1.7 proposition? See the previous exercise.

2.8.4 Limiting Nature of GLS

Finite-sample properties of GLS rest on the assumption that the regressors are strictly
exogenous. In time-series models the regressors are not strictly exogenous and the error
is serially correlated.

In practice, the matrix function V is unknown.

V can be estimated from the sample. This approach is called the Feasible Generalized
Least Squares (FGLS). But if the function V is estimated from the sample, its value V
becomes a random variable, which a¤ects the distribution of the GLS estimator. Very
little is known about the …nite-sample properties of the FGLS estimator. We need to
use the large-sample properties ...
122

3 Large-Sample Theory

The …nite-sample theory breaks down if one of the following three assumptions is violated:

1. the exogeneity of regressors,

2. the normality of the error term, and

3. the linearity of the regression equation.

This chapter develops an alternative approach based on large-sample theory (n is “su¢ ciently
large”).
123

3.1 Review of Limit Theorems for Sequences of Random Variables

3.1.1 Convergence in Probability in Mean Square and in Distribution

Convergence in Probability

A sequence of random scalars fzng converges in probability to a constant (non-random)


if, for any " > 0,
lim P (jzn j > ") = 0:
n!1
We write
p
zn ! or plim zn = :

As we will see, zn is usually a sample mean


Pn Pn
i=1 yi i=1 zi
zn = or zn = :
n n
124

Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0
1 Pn p
otherwise. Let zn = n i=1 zi : The following graph suggests that zn ! 1=2:
125

A sequence of K dimensional vectors fzng converges in probability to a K -dimensional


vector of constants if, for any " > 0,

lim P (jznk k j > ") = 0; 8k


n!1
We write
p
zn ! :

Convergence in Mean Square

A sequence of random scalars fzig converges in mean square (or in quadratic mean) to a
if
h i
2
lim E (zn ) =0
n!1

The extension to random vectors is analogous to that for convergence in probability.


126

Convergence in Distribution

Let fzng be a sequence of random scalars and Fn be the cumulative distribution function
(c.d.f.) of zn, i.e. zn Fn. We say that fzng converges in distribution to a random scalar
z if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . We
write
d
zn ! z; where z F;
F is is the asymptotic (or limiting) distribution of z . If F is well-known, for example, if F
is the cumulative normal N (0; 1) distribution we prefer to write
d d
zn ! N (0; 1) (instead of zn ! z and z N (0; 1)):
d
Example. Consider zn t(n): We know that zn ! N (0; 1) :

In most applications zn is of type


p
zn = n (y E (yi)) :
p
Exercise 3.1. For zn = n (y E (yi)) calculate E (zn) and Var (zn) (assume E (yi) = ;
Var (yi) = 2 and fyig is an i.i.d. sequence).
127

3.1.2 Useful Results

Lemma (2.3 - preservation of convergence for continuous transformation). Suppose f is a


vector-valued continuous function that does not depend on n. Then:

p p
(a) if zn ! ) f (zn) ! f ( ) ;

d d
(b) if zn ! z ) f (zn) ! f (z) :

An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserve
convergence in probability:
p p p
xn ! ; yn ! ) xn + yn ! + :
p p p
xn ! ; yn ! ) xnyn ! :
p p p
xn ! ; yn ! ) xn=yn ! = ; 6= 0:
p p
Yn ! ) Yn 1 ! 1 ( is invertible).
128

Lemma (2.4). We have

d p d
(a) xn ! x; yn ! ) xn + yn ! x + :

d p p
(b) xn ! x; yn ! 0 ) y0nxn ! 0:

d p d
(c) xn ! x; An ! A ) Anxn ! Ax: In particular if x N (0; ) ; then
d
Anxn ! N 0; A A0 :

d p d
(d) xn ! x; An ! A ) x0nAn 1xn ! x0A 1x (A is nonsingular).

p
If xn ! 0 we write xn = op (1) :

p
If xn yn ! 0 we write xn = yn + op (1) :

d
In part (c) we may write Anxn = Axn (Anxn and Axn have the same asymptotic
distribution).
129

3.1.3 Viewing Estimators as Sequences of Random Variables

Let ^n be an estimator of a parameter vector based on a sample of size n. We say that


an estimator ^n is consistent for if
^n p! :

The asymptotic bias of ^n, is de…ned as plimn!1 ^n : So if the estimator is consistent,


its asymptotic bias is zero.

Wooldridge’s quotation:

While not all useful estimators are unbiased, virtually all economists agree that
consistency is a minimal requirement for an estimator. The famous econometrician
Clive W.J. Granger once remarked: “If you can’t get it right as n goes to in…nity,
you shouldn’t be in this business.” The implication is that, if your estimator of a
particular population parameter is not consistent, then you are wasting your time.
130

A consistent estimator ^n is asymptotically normal if


p d
n ^n ! N (0; ):
p
Such an estimator is called n-consistent.

The variance matrix is called the asymptotic variance and is denoted Avar ^n ; i.e.
p
lim Var n ^n = Avar ^n = :
n!1

Some authors use the notation Avar ^n to mean =n (which is zero in the limit).
131

3.1.4 Laws of Large Numbers and Central Limit Theorems

Consider
n
1X
zn = zi:
n i=1
p
We say that zn obeys to the LLN if zn ! where = E (zi) or limn E (zn) = :

(A Version of Chebychev’s Weak LLN) If


lim E (zn) = p
) zn ! .
lim Var (zn) = 0

p
(Kolmogorov’s Second Strong LLN) If fzig is i.i.d. with E (zn) = ) zn ! :

These LLNs extend readily to random vectors by requiring element-by-element convergence.


132

Theorem 1 (Lindeberg-Levy CLT). Let fzig be i.i.d. with E (zn) = and Var (zi) = :
Then
n
p 1 X d
n (zn )=p (zi ) ! N (0; ) :
n i=1

Notice that
p
E n (zn ) = 0 ) E (zn) =
p
Var n (zn ) = ) Var (zn) = =n
Given the previous equations, some authors write
!
a
zn N ; :
n
133

Example. Let fzig be i.i.d. with distribution 2(1): By the Lindeberg-Levy CLT (scalar case)
we have
n !
1X 2
a
zn = zi N ;
n i=1 n
where
n
1X
E (zn) = E (zi) = E (zi) = = 1;
n i=1
0 1
Xn 2
1 1 2
Var (zn) = Var @ ziA = Var (zi) = = :
n i=1 n n n
134

-
3210.1
0.4
0.3
0.2

Probability Density Function of


p
Probability Density Function of zn (obtained by n (zn ) (exact expressions for
Monte-Carlo Simulation) n = 5; 10 and 50)
135

Example. In a random sampling, sample size = 30; on the variable z with E (z ) = 10;
Var (z ) = 9 but unknown distribution, obtain an approximation to P (zn < 9:5) : We do
not know the exact distribution of zn: However, from Lindeberg-Levy CLT we have
!
p (zn ) d 2
a
n ! N (0; 1) or zn N ; :
n
Hence,
!
p (zn ) p (9:5 10)
P (zn < 9:5) = P n < 30
3
' ( 0:9128) , [ is the cdf of N (0; 1) ]
= 0:1807:
136

3.2 Fundamental Concepts in Time-Series Analysis

Stochastic process (SP): is a sequence of random variables. For this reason, it is more
adequate to write a SP as fzig (means a sequence of random variables) rather than zi
(means the random variable at time i).
137

3.2.1 Various Classes of Stochastic processes

De…nition (Stationary Processes). A SP fzig is (strictly) stationary if the joint distribution


of (z1; z2; :::; zs) equals to that of zk+1; zk+2; :::; zk+s for any s 2 N and k 2 Z:
Exercise 3.2. Consider a SP fzig where E (jg (zi)j) < 1: Show that if fzig is a strictly
stationary process then E (g (zi)) is constant and do not depend on t:

The de…nition implies that any transformation (function) of a stationary process is itself
stationary,
n othat is, if fzig is stationary, then fg (zi)g is. For example, if fzig is stationary
then ziz0i is also a SP.
De…nition (Covariance Stationary Processes). A stochastic process fzig is weakly (or co-
variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov zi; zi j exists, is
…nite, and depends only on j but not on i:

If fzig is a covariance SP then Cov (z1; z5) = Cov (z1001; z1005).

A transformation (function) of a covariance stationary process may or may not be a covari-


ance stationary process.
138

q
Example. It can be proved that fzig ; zi = 0 + 1zi2 1"i; where f"ig is i.i.d. with mean
q
zero and unit variance and 0> 0 and 1=3 1< 1 is a covariance stationary process.
However, wi = zi2 is not a covariance stationary process as E wi2 does not exist.
Exercise 3.3. Consider the SP futg where
8
>
> t if t 2000
<
ut = q
>
> k 2 if t > 2000
:
k t
iid iid
where t and s are independent for all t and s and t N (0; 1) and s t(k). Explain
why futg is weakly (or covariance) stationary but not strictly stationary.
De…nition (White Noise Processes). A white noise process fzig is a covariance stationary
process with zero mean and no serial correlation:

E (zi) = 0; Cov zi; zj = 0, i 6= j:


139

Y Y
8 25

4
20

0
15

-4
10
-8

5
-12

0
-16

-20 -5
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200

Y Y5
10 4

3
0
2

-10 1

0
-20
-1

-30 -2

-3
-40
-4

-50 -5
25 50 75 100 125 150 175 200 10 20 30 40 50 60 70 80 90
140

In the literature there is not a unique de…nition of ergodicity. We prefer to call “weakly
dependent process” to what Hayashi calls “ergodic process”.
De…nition. A stationary process fzig is said to be a weakly dependent process (= ergodic in
Hayashi’s de…nition) if, for any two bounded functions f : Rk+1 ! R and g : Rs+1 ! R;

lim E f zi; ::; zi+k g (zi+n; ::; zi+n+s)


n!1
= lim E f zi; ::; zi+k jE (g (zi+n; ::; zi+n+s))j :
n!1
Theorem 2 (S&WD). Let fzig be a stationary weakly dependent (S&WD) process with
p
E (zi) = : Then zn ! :

Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, is
allowed in this Theorem, provided that it disappears in the long run. Since, for any function
f , ff (zi)g is a S&WD stationary whenever fzig is, this theorem implies that any moment
of a S&WD process (if it exists and is …nite) is consistently estimated by the sample moment.
For example, suppose fzig is a S&WD process and E ziz0i exists and is …nite. Then
n
1X p
zn = zizi ! E ziz0i :
0
n i=1
141

De…nition (Martingale). A vector process fzig is called a martingale with respect to fzig if

E ( zij zi 1; :::; z1) = zi 1 for i 2:


The process
z i = z i 1 + "i
where f"ig is a white noise process with E ( "ij zi 1) = 0, is a martingale since

E ( zij zi 1; :::; z1) = E ( zij zi 1) = zi 1 + E ( "ij zi 1) = zi 1:


De…nition (Martingal Di¤erence Sequence). A vector process fgig with E (gi) = 0 is called
a martingale di¤erence sequence (MDS) or martingale di¤erences if

E ( gij gi 1; :::; g1) = 0:

If fzig is a martingale, the process de…ned as zi = zi zi 1 is a MDS.


Proposition. If fgig is a MDS then Cov gi; gi j = 0, j 6= 0:
142

By de…nition
0 1 0 1
n
X n
X nX1 X
n
1 @ A
1 @
Var (gn) = 2 Var gt = 2 Var (gt) + 2 Cov gi; gi j A :
n t=1 n t=1 j=1 i=j+1
However, if fgig is a stationary MDS with …nite second moment then
n
X
Var (gt) = n Var (gt) ; Cov gi; gi j = 0;
t=1
so
1
Var (gn) = Var (gt) :
n
De…nition (Random Walk). Let fgig be a vector independent white noise process. A random
walk, fzig, is a sequence of cumulative sums:

zi = gi + gi 1 + ::: + g1:
Exercise 3.4. Show that the random walk can be written as

zi = zi 1 + gi ; z1 = g1:
143

3.2.2 Di¤erent Formulation of Lack of Serial Dependence

We have three formulations of a lack of serial dependence for zero-mean covariance stationary
processes:

(1) fgig is independent white noise.

(2) fgig is stationary MDS with …nite variance.

(3) fgig is white noise.

(1) ) (2) ) (3):


Exercise
q 3.5 (Process that satis…es (2) but not (1) - the ARCH process). Consider gi =
2
0 + 1 gi 1 "i; where f"i g is i.i.d. with mean zero and unit variance and 0 > 0 and
j 1j < 1. Show that fgig is a MDS but not a independent white noise.
144

3.2.3 The CLT for S&WD Martingale Di¤erences Sequences

Theorem 3 (Stationary Martingale Di¤erences CLT (Billingsley, 1961) ). Let fgig be a


vector martingale di¤erence sequence that is S&WD process with E gigi0 = and let
P
gi = n1 gi. Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
Theorem 4 (Martingale Di¤erences CLT (White, 1984)). Let fgig be a vector martingale
di¤erence sequence. Suppose that (a) E gigi0 = t is a positive de…nite matrix with
1 Pn (positive de…nite matrix), (b) g has …nite 4th moment, (c) n1
P p
gigi0 !
n i=1 t!
: Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
145

3.3 Large-Sample Distribution of the OLS Estimator

The model presented in this section has probably the widest range of economic applications:

No speci…c distributional assumption (such as the normality of the error term) is required;

The requirement in …nite-sample theory that the regressors be strictly exogenous or …xed
is replaced by a much weaker requirement that they be "predetermined."

Assumption (2.1 - linearity). yi = x0i + "i:


Assumption (2.2 - S&WD). f(yi; xi)g is jointly S&WD.
Assumption (2.3 - predetermined regressors). All the regressors are predetermined in the
sense that they are orthogonal to the contemporaneous error term: E (xik "i) = 0; 8i; k.
This can be written as

E (xi"i) = 0 or E (gi) = 0 where gi = xi"i:


Assumption (2.4 - rank condition). E xix0i = xx is nonsingular.
146

Assumption (2.5 - fgig is a martingale di¤erence sequence with …nite second moments).
fgig ; where gi = xi"i; is a martingale di¤erence sequence (so a fortiori E (gi) = 0.
The K K matrix of cross moments, E gigi0 , is nonsingular. We use S for Avar (g ) (the
p P
variance of ng; where g = n1 gi). By Assumption 2.2 and S&WD martingale Di¤erences
CLT, S = E gigi0 :

Remarks:

1. (S&WD) A special case of S&WD is that f(yi; xi)g is i.i.d. (random sample in cross-
sectional data).

2. (The model accommodates conditional heteroskedasticity) If f(yi; xi)g is stationary,


then the error term "i = yi x0i is also stationary. The conditional moment
2
E "i xi can depend on xi

without violating any previous assumption, as long as E "2i is constant.


147

3. (E (xi"i) = 0 vs. E ( "ij xi) = 0) The condition E ( "ij xi) = 0 is stronger than
E (xi"i) = 0. In e¤ect,

E (xi"i) = E (E ( xi"ij xi))


= E (xi E ( "ij xi))
= E (xi0) = 0:

4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only the
contemporaneous relationship between the error term and the regressors. The exogeneity
assumption (Assumption 1.2) implies that, for any regressor k, E xjk "i = 0 for all i
and j; not just for i = j: Strict exogeneity is a strong assumption that does not hold in
general for time series models.
148

5. (Rank condition as no multicollinearity in the limit) Since


! 1
X0 X X0 y 1X 1 1X
b= = xix0i xiy = Sxx1Sxy
n n n n
where
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n
0
Xy 1X
Sxy = = xiyi (sample average of xiyi).
n n
By Assumptions 2.2, 2.4 and theorem S&WD we have
n
X0 X 1X p
= xix0i ! E xix0i :
n n i=1
0
Assumption 2.4 guarantees that the limit in probability of XnX has rank K:
149

6. (A su¢ cient condition for fgig to be a MDS) Since a MDS is zero-mean by de…nition,
Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face of
Assumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality of
the OLS estimator. A su¢ cient condition for fgig to be an MDS is

E ( "ij Fi) = 0 where


Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
(This condition implies that the error term is serially uncorrelated and also is uncorrelated
with the current and past regressors). Proof. Notice: fgig is a MDS if

E ( gij gi 1; :::; g1) = 0; gi = xi"i:


Now, using the condition E ( "ij Fi) = 0;

E ( xi"ij gi 1; :::; g1) = E [ E ( xi"ij Fi)j gi 1; :::; g1] = E [0j gi 1; :::; g1] = 0
thus E ( "ij Fi) = 0 ) fgig is a MDS.
150

7. (When the regressors include a constant) Assumption 2.5 is


02 3 1
1
B6 7 C
E ( xi"ij gi 1; :::; g1) = E @ 4 ::: 5 "i gi 1; :::; g1A = 0 ) E ( "ij gi 1; :::; g1) = 0:
xiK

E ( "ij "i 1; :::; "1) = E ( E ( "ij gi 1; :::; g1)j "i 1; :::; "1) = 0:
Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.

8. (S is a matrix of fourth moments)

S = E gigi0 = E xi"ix0i"i = E "2i xix0i :


Consistent estimation of S will require an additional assumption.
151

9. (S will take a di¤erent expression without Assumption 2.5) In general


0 1 0 1
p p 1X n Xn
1
Avar (g) = Var ng = Var @ n giA = Var @ p giA
n i=1 n i=1
0 1
n
X
1
= Var @ gi A
n i=1
0 1
n
X nX1 X
n
1@ A
= Var (gi) + Cov gi; gi j + Cov gi j ; gi
n i=1 j=1 i=j+1
1X n
1 nX1 X n
0 0
= Var (gi) + E gigi j + E gi j gi :
n i=1 n j=1 i=j+1
Given stationarity, we have
n
1X
Var (gi) = Var (gi) :
n i=1

Thanks to the assumption 2.5 we have E gigi0 j = E gi j gi0 = 0 so

S = Avar (g) = Var (gi) = E gigi0 :


152

Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b for
) Under Assumptions 2.1-2.4,
p
b ! :
(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then
p d
n (b ) ! N (0; Avar (b))
where
Avar (b) = 1 1
xx S xx :
^
(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S
of S: Then under Assumption 2.2, Avar (b) is consistently estimated by

^ xx1
[ (b) = Sxx1SS
Avar
where
n
X0 X 1X
Sxx = = xix0i:
n n i=1
153

Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,
n
X
1 p
s2 = e2i ! E "2i
n K i=1

provide E "2i exists and is …nite.

Under conditional homocedasticity E "2i xi = 2 (we will see this in detail later) we
have,
S = E gigi0 = E "2i xix0i = ::: = 2 0
E xixi =
2
xx
and

Avar (b) = 1S 1= 1 2 1 = 2 xx1;


xx xx xx xx xx
! 1
0
XX 1
[ (b) = s2
Avar = s2 n X0 X :
n
Thus
0 1
a [ (b)
Avar 1
b N@ ; A=N ; s 2 X0 X
n
154

3.4 Statistical Inference

Derivation of the distribution of test statistics is easier than in …nite-sample theory because
we are only concerned about the large-sample approximation to the exact distribution.
Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,
^ of S. As before, let Avar
and suppose there is available a consistent estimate of S [ (b) =
^ 1: Then
Sxx1SS xx

(a) Under the null hypothesis H0 : k = 0k


1
bk 0 [ ( bk )
Avar
Sxx1SS
^ xx
d
t0k = k ! N (0; 1) ; where ^ 2bk = = kk :
^ bk n n
(b) Under the null hypothesis H0 : R = r; with rank (R) = p
1 d 2 :
W = n (Rb r)0 RAvar
[ (b) R0 (Rb r) ! (p)
155

Remarks

^ bk is called is called the heteroskedasticity-consistent standard error, (heteroskedastic-


ity) robust standard error, or White’s standard error. The reason for this terminology is
that the error term can be conditionally heteroskedastic. The t-ratio is called the robust
t-ratio.

The di¤erences from the …nite-sample t-test are: (1) the way the standard error is
calculated is di¤erent, (2) we use the table of N (0; 1) rather than that of t(n K),
and (3) the actual size or exact size of the test (the probability of Type I error given
the sample size) equals the nominal size (i.e., the desired signi…cance level ) only
approximately, although the approximation becomes arbitrarily good as the sample size
increases. The di¤erence between the exact size and the nominal size of a test is called
the size distortion.

Both tests are consistent in the sense that

power = P (rejecting the null H0j H1 is true) ! 1 as n ! 1:


156

3.5 Estimating S = E "2i xix0i Consistently

How to select an estimator for a population parameter? One of the most important method
is the analog estimation method or the method of moments. The method of moment
principle: To estimate a feature of the population, use the corresponding feature of the
sample.

Examples of analog estimators:

Parameter of the population Estimator

E (yi) Y
Var (yi) Sy2
xy Sxy
2 2
x Pn Sx
i=1 Ifyi cg
P (yi c) n
median (yi) sample median
max(yi) maxi=1;:::;n (yi)
157

The analogy principle suggests that E "2i xix0i can be estimated using the estimator
n
1X
"2i xix0i:
n i=1
Since "i is not observable we need another one:
Xn
1
^=
S e2i xix0i:
n i=1
2
Assumption (2.6 - …nite fourth moments for regressors). E xik xij exists and is …nite
for all k and j (k; j = 1; :::; K ) :
Proposition (2.4 - consistent estimation of S). Suppose S = E "2i xix0i exists and is …nite.
Then, under Assumptions 2.1-2.4 and 2.6, S ^ is consistent for S:
158

The estimator S can be represented as


2 3
e21 0 0
n
X 0 BX 6 7
1 2 0 X 6 0 e2 0 7
^=
S ei xixi = where B =6 ...2 7:
6 ... ... 7
n i=1 n 4 5
0 0 e2n
1^ 1 1 1
[
Thus, Avar (b) = Sxx SSxx = n X0X X0BX X0X . We have

!
a [ b)
Avar( ^ xx1
Sxx1SS 1 1
b N ; n =N ; n =N ; X0 X X0BX X0X

0 1
W = n (Rb [ (b) R0
r) RAvar (Rb r)
1
= n (Rb r) 0 ^ xx1R0
RSxx1SS (Rb r)

0 1 d
0 1 1 2
= (Rb r) R X0 X X0BX X0 X R (Rb r) ! (p)
159

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.724551 -2.164014 0.0309


FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.825934 -1.898382 0.0582


FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000
160

3.6 Implications of Conditional Homoskedasticity

Assumption (2.7 - conditional homoskedasticity). E "2i xi = 2 > 0:

Under Assumption 2.7 we have


S = E "2i xix0i = ::: = 2 0
E xixi =
2
xx and
Avar (b) = 1 1 2 1 1 2 1
xx S xx = xx xx xx = xx :
Proposition (2.5 - large-sample properties of b, t , and F under conditional homoskedas-
ticity). Suppose Assumptions 2.1-2.5 and 2.7 are satis…ed. Then

(a) (Asymptotic distribution of b) The OLS estimator b is consistent and asymptotically


normal with
Avar (b) = 2 xx1:

(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,
Avar (b) is consistently estimated by
1
[ (b) = s2Sxx1 = ns2 X0X
Avar :
161

(c) (Asymptotic distribution of the t and F statistics of the …nite-sample theory)

Under H0 : k = 0k we have

bk 0 [ ( bk )
Avar 1
d
t0k = k ! N (0; 1) ; where ^ 2bk = 2 0
=s XX :
^ bk n kk

Under H0 : R = r with rank (R) = p, we have


d
pF 0 ! 2(p)
1 1
where F 0 = (Rb r)0 R X0X R0 (Rb r) = ps2 :

Notice
e 0e e0e d
pF 0 = 0 ! 2
(p)
e e= (n K )
where refers to the short regression or the regression subjected to the constraint R =r

Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,
p
s2Sxx ! 2 xx = S: We do not need the fourth-moment assumption (Assumption 2.6)
for consistency.
162

3.7 Testing Conditional Homoskedasticity

With the advent of robust standard errors allowing us to do inference without specifying the
conditional second moment testing conditional homoskedasticity is not as important as it
used to be. This section presents only the most popular test due to White (1980) for the
case of random samples.

Let i be a vector collecting unique and nonconstant elements of the K K symmetric


matrix xix0i.
Proposition (2.6 - White’s Test for Conditional Heteroskedasticity). In addition to Assump-
tions 2.1 and 2.4, suppose that (a) f(yi; xi)g is i.i.d. with …nite E "2i xix0i (thus strength-
ening Assumptions 2.2 and 2.5), (b) "i is independent of xi (thus strengthening Assumption
2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments of
"i and xi. Then
d
nR2 ! 2(m)

where R2 is the R2 from the auxiliary regression of e2i on a constant and i and m is the
dimension of i:
163

Dependent Variable: WAGE


Method: Least Squares
Sample: 1 526
Included observations: 526

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.724551 -2.164014 0.0309


FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000
164

Heteroskedasticity Test: White

F-statistic 5.911627 Prob. F(13,512) 0.0000


Obs*R-squared 68.64843 Prob. Chi-Square(13) 0.0000
Scaled explained SS 227.2648 Prob. Chi-Square(13) 0.0000

Test Equation:
Dependent Variable: RESID^2

Variable Coefficient Std. Error t-Statistic Prob.

C 47.03183 20.19579 2.328794 0.0203


FEMALE -7.205436 10.92406 -0.659593 0.5098
FEMALE*EDUC 0.491073 0.778127 0.631097 0.5283
FEMALE*EXPER -0.154634 0.168490 -0.917768 0.3592
FEMALE*TENURE 0.066832 0.351582 0.190089 0.8493
EDUC -7.693423 2.596664 -2.962811 0.0032
EDUC^2 0.315191 0.086457 3.645652 0.0003
EDUC*EXPER 0.045665 0.036134 1.263789 0.2069
EDUC*TENURE 0.083929 0.054140 1.550226 0.1217
EXPER 0.000257 0.610348 0.000421 0.9997
EXPER^2 -0.009134 0.007010 -1.303002 0.1932
EXPER*TENURE -0.004066 0.017603 -0.230969 0.8174
TENURE -0.298093 0.934417 -0.319015 0.7498
TENURE^2 -0.004633 0.016358 -0.283255 0.7771

R-squared 0.130510 Mean dependent var 8.664083


Adjusted R-squared 0.108433 S.D. dependent var 22.52940
S.E. of regression 21.27289 Akaike info criterion 8.978999
Sum squared resid 231698.4 Schwarz criterion 9.092525
Log likelihood -2347.477 Hannan-Quinn criter. 9.023450
F-statistic 5.911627 Durbin-Watson stat 1.905515
Prob(F-statistic) 0.000000
165

Dependent Variable: WAGE


Method: Least Squares
Included observations: 526
White Heteroskedasticity-Consistent Standard Errors & Covariance

Variable Coefficient Std. Error t-Statistic Prob.

C -1.567939 0.825934 -1.898382 0.0582


FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

R-squared 0.363541 Mean dependent var 5.896103


Adjusted R-squared 0.358655 S.D. dependent var 3.693086
S.E. of regression 2.957572 Akaike info criterion 5.016075
Sum squared resid 4557.308 Schwarz criterion 5.056619
Log likelihood -1314.228 Hannan-Quinn criter. 5.031950
F-statistic 74.39801 Durbin-Watson stat 1.794400
Prob(F-statistic) 0.000000

3.8 Estimation with Parameterized Conditional Heteroskedasticity

Even when the error is found to be conditionally heteroskedastic, the OLS estimator is still
consistent and asymptotically normal, and valid statistical inference can be conducted with
robust standard errors and robust Wald statistics. However, in the (somewhat unlikely) case
of a priori knowledge of the functional form of the conditional second moment, it should be
possible to obtain sharper estimates with smaller asymptotic variance.
166

To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5
by assuming that f(yi; xi)g is i.i.d.

3.8.1 The Functional Form

The parametric functional form for the conditional second moment we consider is
2 0
E "i xi = zi
where zi is a function of xi:

Por example, E ( "ij xi) = 1 + 2x2i2;

z0i = 1 x2i2 :
167

3.8.2 WLS with Known

The WLS (also GLS) estimator can be obtained by applying the OLS to the regression

~0i + ~"i
y~i = x
where

y x "
y~i = q i ; ~ik = q ik ;
x "i = q i ;
~ i = 1; 2; :::; n
z0i z0i z0i
We have
1 1X 1 1 y:
~ 0X
^ GLS = ^ (V) = X ~ ~ 0y
X ~ = X0 V X0 V
168

Note that
"i j x
E (~ ~ i ) = 0:
Therefore, provided that E x ~0i is nonsingular, Assumptions 2.1-2.5 are satis…ed for equa-
~i x
tion y~i = x ~0i +~"i. Furthermore, by construction, the error ~"i is conditionally homoskedastic:
E (~"i j x
~i) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-
cally normal, and the asymptotic variance is
1
Avar ^ (V) = E x ~0i
~i x
0 1 1
n
X
1
= plim @ x ~0iA
~i x (by S&WD theorem)
n i=1
1 0 1
= plim X V 1X :
n
Thus n1 X0V 1X is a consistent estimator of Avar ^ (V) :
169

3.8.3 Regression of e2i on zi Provides a Consistent Estimate of

If is unknown we need to obtain ^ : Assuming E "2i xi = z0i we have

"2i = E "2i xi + i
where by construction E ( ij xi) = 0: This suggest that the following regression can be
considered
"2i = z0i + i
Provided that E ziz0i is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-
sion: the OLS estimator of is consistent and asymptotically normal. However we cannot
run this regression as "i is not observable. In the previous regression we should replace "i
by the consistent estimate ei (despite the presence of conditional heteroskedasticity). In
conclusion, we may obtain a consistent estimate of by considering the regression of e2i on
zi to get
0 1 1
n
X Xn
^ =@ ziz0iA zie2i :
i=1 i=1
170

3.8.4 WLS with Estimated

Step 1: Estimate the equation yi = x0i + "i by OLS and compute the OLS residuals ei:

Step 2: Regress e2i on zi to obtain the OLS coe¢ cient estimate ^ .

Step 3: Transform the original variables according to the rules


y x
y~i = q i ; ~ik = q ik ;
x i = 1; 2; :::; n
0
zi ^ 0
zi ^
~0i
and run the OLS estimator with respect to the model y~i = x "i to obtain the
+~
Feasible GLS (FGLS):
1X 1 1y
^ = X0 V
^ V ^ X0 V
^
171

It can be proved that:

^ V
^ p
!

p d
n ^ V
^ ! N 0; Avar ^ (V)

1 X0 V
^ 1X is a consistent estimator of Avar ^ (V) :
n

No …nite properties are known concerning the estimator ^ V


^ :
172

3.8.5 A popular speci…cation for E "2i xi

The especi…cation "2i = z0i + i may lead to z0i ^ < 0: To overcome this problem a popular
speci…cation for E "2i xi is
n o
E "2i xi = exp 0x
i

(it guarantees that Var ( yij xi) > 0 for all 2 Rr ): It implies log E "2i xi = 0x :
i This
suggests the following procedure:

a) Regress y on X to get the residual vector e:


b) Run the LS regression log e2i on xi to estimate and calculate
n o
^ 2i = exp 0
^ xi :

x
c) Transform the data y~i = ^yi ; ~ij = ^ij .
x
i i
d) Regress y ~ and obtain ^ V
~ on X ^
173

Notice also that:


n o
E "2i xi = exp 0 xi
n o
"2i = exp 0 xi + vi; vi = "2i 2
E "i xi
log "2i 0x + v
i i
log e2i 0x + v :
i i
Example (Part 1). We want to estimate a demand function for daily cigarette consumption
(cigs). The explanatory variables are: log(income) - log of annual income, log(cigprice) -
log of per pack price of cigarettes in cents, educ - years of education, age and restaurn
- binary indicator equal to unity if the person resides in a state with restaurant smoking
restrictions (source: J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and
Statistics 79, 596-593).

Based on information below, are the standard errors reported in the …rst table reliable?
174

Heteroskedasticity Test: White

F-statistic 2.159258 Prob. F(25,781) 0.0009


Obs*R-squared 52.17245 Prob. Chi-Square(25) 0.0011
Scaled explained SS 110.0813 Prob. Chi-Square(25) 0.0000
Dependent Variable: CIGS
Method: Least Squares
Sample: 1 807 Test Equation:
Dependent Variable: RESID^2
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C -3.639823 24.07866 -0.151164 0.8799 C 29374.77 20559.14 1.428794 0.1535
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 LOG(INCOME) -1049.630 963.4359 -1.089466 0.2763
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 (LOG(INCOME))^2 -3.941183 17.07122 -0.230867 0.8175
EDUC -0.501498 0.167077 -3.001596 0.0028 (LOG(INCOME))*(LOG(CIGPRIC)) 329.8896 239.2417 1.378897 0.1683
AGE 0.770694 0.160122 4.813155 0.0000 (LOG(INCOME))*EDUC -9.591849 8.047066 -1.191969 0.2336
(LOG(INCOME))*AGE -3.354565 6.682194 -0.502015 0.6158
AGE^2 -0.009023 0.001743 -5.176494 0.0000 (LOG(INCOME))*(AGE^2) 0.026704 0.073025 0.365689 0.7147
RESTAURN -2.825085 1.111794 -2.541016 0.0112 (LOG(INCOME))*RESTAURN -59.88700 49.69039 -1.205203 0.2285
LOG(CIGPRIC) -10340.68 9754.559 -1.060087 0.2894
R-squared 0.052737 Mean dependent var 8.686493 (LOG(CIGPRIC))^2 668.5294 1204.316 0.555111 0.5790
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 (LOG(CIGPRIC))*EDUC 32.91371 59.06252 0.557269 0.5775
S.E. of regression 13.40479 Akaike info criterion 8.037737 (LOG(CIGPRIC))*AGE 62.88164 55.29011 1.137304 0.2558
(LOG(CIGPRIC))*(AGE^2) -0.622371 0.594730 -1.046477 0.2957
Sum squared resid 143750.7 Schwarz criterion 8.078448 (LOG(CIGPRIC))*RESTAURN 862.1577 720.6219 1.196408 0.2319
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 EDUC -117.4705 251.2852 -0.467479 0.6403
F-statistic 7.423062 Durbin-Watson stat 2.012825 EDUC^2 -0.290343 1.287605 -0.225491 0.8217
Prob(F-statistic) 0.000000 EDUC*AGE 3.617048 1.724659 2.097254 0.0363
EDUC*(AGE^2) -0.035558 0.017664 -2.012988 0.0445
EDUC*RESTAURN -2.896490 10.65709 -0.271790 0.7859
AGE -264.1461 235.7624 -1.120391 0.2629
AGE^2 3.468601 3.194651 1.085753 0.2779
AGE*(AGE^2) -0.019111 0.028655 -0.666935 0.5050
AGE*RESTAURN -4.933199 10.84029 -0.455080 0.6492
(AGE^2)^2 0.000118 0.000146 0.807552 0.4196
(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497
RESTAURN -2868.196 2986.776 -0.960299 0.3372

cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):
log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:
binary indicator equal to unity if the person resides in a state with restaurant smoking re-
strictions.
175

Example (Part 2). Discuss the results of the following …gures.

Dependent Variable: CIGS Dependent Variable: CIGS


Method: Least Squares Method: Least Squares
Sample: 1 807 Sample: 1 807
White Heteroskedasticity-Consistent Standard Errors & Covariance
Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob.
C -3.639823 24.07866 -0.151164 0.8799
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 C -3.639823 25.61646 -0.142089 0.8870
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 LOG(INCOME) 0.880268 0.596011 1.476931 0.1401
EDUC -0.501498 0.167077 -3.001596 0.0028 LOG(CIGPRIC) -0.750862 6.035401 -0.124410 0.9010
AGE 0.770694 0.160122 4.813155 0.0000 EDUC -0.501498 0.162394 -3.088167 0.0021
AGE^2 -0.009023 0.001743 -5.176494 0.0000 AGE 0.770694 0.138284 5.573262 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112 AGE^2 -0.009023 0.001462 -6.170768 0.0000
RESTAURN -2.825085 1.008033 -2.802573 0.0052
R-squared 0.052737 Mean dependent var 8.686493
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 R-squared 0.052737 Mean dependent var 8.686493
S.E. of regression 13.40479 Akaike info criterion 8.037737 Adjusted R-squared 0.045632 S.D. dependent var 13.72152
Sum squared resid 143750.7 Schwarz criterion 8.078448 S.E. of regression 13.40479 Akaike info criterion 8.037737
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 Sum squared resid 143750.7 Schwarz criterion 8.078448
F-statistic 7.423062 Durbin-Watson stat 2.012825 Log likelihood -3236.227 Hannan-Quinn criter. 8.053370
Prob(F-statistic) 0.000000 F-statistic 7.423062 Durbin-Watson stat 2.012825
Prob(F-statistic) 0.000000
176

Example (Part 3). a) Regress y on X to get the residual vector e:

Dependent Variable: CIGS


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

C -3.639823 24.07866 -0.151164 0.8799


LOG(INCOME) 0.880268 0.727783 1.209519 0.2268
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966
EDUC -0.501498 0.167077 -3.001596 0.0028
AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 -0.009023 0.001743 -5.176494 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112

R-squared 0.052737 Mean dependent var 8.686493


Adjusted R-squared 0.045632 S.D. dependent var 13.72152
S.E. of regression 13.40479 Akaike info criterion 8.037737
Sum squared resid 143750.7 Schwarz criterion 8.078448
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370
F-statistic 7.423062 Durbin-Watson stat 2.012825
Prob(F-statistic) 0.000000
177

b) Run the LS regression log e2i on xi

Dependent Variable: LOG(RES^2)


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

C -1.920691 2.563033 -0.749382 0.4538


LOG(INCOME) 0.291540 0.077468 3.763351 0.0002
LOG(CIGPRIC) 0.195418 0.614539 0.317992 0.7506
EDUC -0.079704 0.017784 -4.481657 0.0000
AGE 0.204005 0.017044 11.96928 0.0000
AGE^2 -0.002392 0.000186 -12.89313 0.0000
RESTAURN -0.627011 0.118344 -5.298213 0.0000

R-squared 0.247362 Mean dependent var 4.207486


Adjusted R-squared 0.241717 S.D. dependent var 1.638575
S.E. of regression 1.426862 Akaike info criterion 3.557468
Sum squared resid 1628.747 Schwarz criterion 3.598178
Log likelihood -1428.438 Hannan-Quinn criter. 3.573101
F-statistic 43.82129 Durbin-Watson stat 2.024587
Prob(F-statistic) 0.000000

n o
Calculate ^ 2i = exp ^ 0xi \
= exp log e2i :

\
Notice: log \
e21; :::; log e2n are the …tted values of the above regression.
178

c) Transform the data


yi xij
y~i = ; ~ij =
x
^i ^i
and d) Regress y ~ and obtain
~ on X ^ :
V

Dependent Variable: CIGS/SIGMA


Method: Least Squares
Sample: 1 807

Variable Coefficient Std. Error t-Statistic Prob.

1/SIGMA 5.635471 17.80314 0.316544 0.7517


LOG(INCOME)/SIGMA 1.295239 0.437012 2.963855 0.0031
LOG(CIGPRIC)/SIGMA -2.940314 4.460145 -0.659242 0.5099
EDUC/SIGMA -0.463446 0.120159 -3.856953 0.0001
AGE/SIGMA 0.481948 0.096808 4.978378 0.0000
AGE^2/SIGMA -0.005627 0.000939 -5.989706 0.0000
RESTAURN/SIGMA -3.461064 0.795505 -4.350776 0.0000

R-squared 0.002751 Mean dependent var 0.966192


Adjusted R-squared -0.004728 S.D. dependent var 1.574979
S.E. of regression 1.578698 Akaike info criterion 3.759715
Sum squared resid 1993.831 Schwarz criterion 3.800425
Log likelihood -1510.045 Hannan-Quinn criter. 3.775347
Durbin-Watson stat 2.049719
179

3.8.6 OLS versus WLS

Under certain conditions we have:

b and ^ V
^ are consistent.

Assuming that the functional form of the conditional second moment is correctly spec-
i…ed, ^ V
^ is asymptotically more e¢ cient than b.

It is not clear which estimator is better (in terms of e¢ ciency) in the following situations:

– the functional form of the conditional second moment is misspeci…ed;

– in …nite samples, even if the functional form is correctly speci…ed, the large-sample
approximation will probably work less well for the WLS estimator than for OLS
because of the estimation of extra parameters (a) involved in the WLS procedure.
180

3.9 Serial Correlation

Because the issue of serial correlation arises almost always in time-series models, we use the
subscript "t" instead of "i" in this section. Throughout this section we assume that the
regressors include a constant. The issue is how to deal with

E "t"t j xt j ; xt 6= 0:
181

3.9.1 Usual Inference is not Valid

When the regressors include a constant (true in virtually all known applications), Assumption
2.5 implies that the error term is a scalar martingale di¤erence sequence, so if the error
is found to be serially correlated (or autocorrelated), that is an indication of a failure of
Assumption 2.5.

We have Cov gt; gt j 6= 0: In fact,

Cov gt; gt j = E xt"tx0t j "t j


= E E xt"tx0t j "t j xt j ; xt
= E xtx0t j E "t"t j xt j ; xt 6= 0:

Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-
tent even if the error is autocorrelated. However, the large-sample properties of b, t , and
F of proposition 2.5 are not valid. To see why, consider
p p
n (b ) = Sxx1 ng :
182

We have
Avar (b) = 1
xx S
1
xx ;
\
Avar ^ xx1:
(b) = Sxx1SS

If errors are not autocorrelated:


p
S = Var ng = Var (gt) .

If the errors are autocorrelated:


p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt :
n j=1 t=j+1

Since Cov gt; gt j 6= 0 and E gt j gt0 6= 0 we have

S 6= Var (gt) i.e. S 6= E gtgt0 :


P 0 or 1 Pn e2 x x0 (robust to
If the errors are serial correlated we cannot use n1 n x
t=1 t tx n t=1 t t t
conditional heteroskedasticity) as a consistent estimators of S.
183

3.9.2 Testing Serial Correlation

Consider the regression yt = x0t + "t: We want to test whether or not "t is serial correlated.

Consider
Cov "t; "t j Cov "t; "t j j E " t "t j
j = r = = =
2
:
Var ("t) Var "t j Var (" t ) 0 E "t

Since j is not observable, we need to consider


~j
~j =
~0
n n
1 X 1X
~j = "t "t j ; ~0 = "2t :
n t=j+1 n t=1
184

Proposition. If f"tg is a stationary MDS with E "2t "t 1; "t 2; ::: = 2; then
p d p d
n~j ! N 0; 4 and n~j ! N (0; 1) :
Proposition. Under the assumptions of the previous proposition
p
X p
X
p 2 d
Box-Pierce Q statistics = QBP = n~j =n ~2j ! 2(p):
j=1 j=1

However, ~j is still unfeasible as we do not observe the errors. Thus,


^j
^j =
^0
n n
1 X 1X
^j = etet j ; ^0 = e2t (=SSR).
n t=j+1 n t=1
Exercise 3.6. Prove that ^j can be obtained from the regression et on et j (without inter-
cept).
185

Testing with Strictly Exogenous Regressors

To test H0 : j = 0 we consider the following proposition:


Proposition (testing for serial correlation with strictly exogeneous regressors). Suppose that
Assumptions 1.2, 2.1, 2.2, 2.4 are satis…ed. Then
p
^j ! 0;
p d
n^j ! N (0; 1) :
186

To test H0 : 1 = 2 = ::: = p = 0 we consider the following proposition:


Proposition (Box-Pierce Q & Ljung-Box Q). Suppose that Assumptions 1.2, 2.1, 2.2, 2.4
are satis…ed. Then
p
X d
QBP = n ^2j ! 2(p);
j=1
p
X ^2j d
QLB = n (n + 2) ! 2(p):
j=1 n j

It can be shown that the hypothesis H0 : 1 = 2 = ::: = p = 0 can also be tested


through the following auxiliary regression:

regression et on et 1; :::; et p:

We calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p are
all zero.
187

Testing with Predetermined, but Not Strictly Exogenous, Regressors


p
If the regressors are not strictly exogenous, the n^j has no longer N (0; 1) distribution and
the residual-based Q statistic may not be asymptotically chi-squared.

The trick consist in removing the e¤ect of xi in the regression of et on et 1; :::; et p by


considering now the
regression et on xt,et 1; :::; et p
and then calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p
are all zero. This regression is still valid when the regressors are strictly exogenous (so you
may always use that regression).

Given
et = 1 + 2xt2 + ::: + K xtK + 1et 1 + ::: + pet p + errort
the null hypothesis can be formulated as

H0 : 1 = ::: = p = 0
Use the F test:
188

EVIEWS
189

Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:
index of chemical production (to control for overall demand for barium chloride), gas: the
volume of gasoline production (another demand variable), rtwex: an exchange rate index
(measures the strength of the dollar against several other currencies).

Equation 1
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131

Variable Coefficient Std. Error t-Statistic Prob.

C -19.75991 21.08580 -0.937119 0.3505


LOG(CHEMPI) 3.044302 0.478954 6.356142 0.0000
LOG(GAS) 0.349769 0.906247 0.385953 0.7002
LOG(RTWEX) 0.717552 0.349450 2.053378 0.0421

R-squared 0.280905 Mean dependent var 6.174599


Adjusted R-squared 0.263919 S.D. dependent var 0.699738
S.E. of regression 0.600341 Akaike info criterion 1.847421
Sum squared resid 45.77200 Schwarz criterion 1.935213
Log likelihood -117.0061 Hannan-Quinn criter. 1.883095
F-statistic 16.53698 Durbin-Watson stat 1.421242
Prob(F-statistic) 0.000000
190

Equation 2
Breusch-Godfrey Serial Correlation LM Test:

F-statistic 2.337861 Prob. F(12,115) 0.0102


Obs*R-squared 25.69036 Prob. Chi-Square(12) 0.0119

Test Equation:
Dependent Variable: RESID
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Presample missing value lagged residuals set to zero.

Variable Coefficient Std. Error t-Statistic Prob.

C -3.074901 20.73522 -0.148294 0.8824


LOG(CHEMPI) 0.084948 0.457958 0.185493 0.8532
LOG(GAS) 0.110527 0.892301 0.123867 0.9016
LOG(RTWEX) 0.030365 0.333890 0.090942 0.9277
RESID(-1) 0.234579 0.093215 2.516546 0.0132
RESID(-2) 0.182743 0.095624 1.911051 0.0585
RESID(-3) 0.164748 0.097176 1.695366 0.0927
RESID(-4) -0.180123 0.098565 -1.827464 0.0702
RESID(-5) -0.041327 0.099482 -0.415425 0.6786
RESID(-6) 0.038597 0.098345 0.392468 0.6954
RESID(-7) 0.139782 0.098420 1.420268 0.1582
RESID(-8) 0.063771 0.099213 0.642771 0.5217
RESID(-9) -0.154525 0.098209 -1.573441 0.1184
RESID(-10) 0.027184 0.098283 0.276585 0.7826
RESID(-11) -0.049692 0.097140 -0.511550 0.6099
RESID(-12) -0.058076 0.095469 -0.608329 0.5442

R-squared 0.196110 Mean dependent var -3.97E-15


Adjusted R-squared 0.091254 S.D. dependent var 0.593374
S.E. of regression 0.565652 Akaike info criterion 1.812335
Sum squared resid 36.79567 Schwarz criterion 2.163504
Log likelihood -102.7079 Hannan-Quinn criter. 1.955030
F-statistic 1.870289 Durbin-Watson stat 2.015299
Prob(F-statistic) 0.033268
191

If you conclude that the errors are serial correlated you have a few options:

(a) You know (at least approximately) the form of autocorrelation and so you use a feasible
GLS estimator.

(b) The second approach, parallels the use of the White estimator for heteroskedasticity:
you don’t know the form of autocorrelation so you rely on the OLS, but you use a
consistent estimator for Avar (b) :

(c) You are concerned only with the dynamic speci…cation of the model and with forecast.
You may try to convert your model into a dynamically complete model.

(d) You model may be misspeci…ed: you respeci…ed the model and the autocorrelation
disappear.
192

3.9.3 Question (a): feasible GLS estimator

There are many forms of autocorrelation and each one leads to a di¤erent structure for the
error covariance matrix V. The most popular form is known as the …rst-order autoregressive
process. In this case the error term in

yt = x0t + "t
is assumed to follow the AR(1) model

"t = "t 1 + vt; j j < 1;


where vt is an error term with mean zero and constant conditional variance that exhibits no
serial correlation. We assume all assumptions 2.1-2.5 was = 0:
193

Initial Model:
yt = x0t + "t; "t = "t 1 + vt; j j<1

The GLS estimator is the OLS estimator applied to the transformed model

~0t + vt
y~t = x
where
( q ( q
1 2y t=1 ; 1 2 x0 t= 1 ;
y~t = 1 ~0t =
x 1
yt yt 1 t > 1 (xt xt 1)0 t > 1
Without the …rst observation, the transformed model is
0
yt yt 1 = (xt xt 1) + vt:

If is unknown we may replace it by a consistent estimator or we may use the nonlinear


least squares estimator (EVIEW).
194

Example (continuation of the previous example). Let’s consider the residuals of Equation 1:

Equation 3
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample (adjusted): 1978M03 1988M12
Included observations: 130 after adjustments
Convergence achieved after 8 iterations

Variable Coefficient Std. Error t-Statistic Prob.

C -39.30703 23.61105 -1.664772 0.0985


LOG(CHEMPI) 2.875036 0.658664 4.364949 0.0000
LOG(GAS) 1.213475 1.005164 1.207241 0.2296
LOG(RTWEX) 0.850385 0.468696 1.814362 0.0720
AR(1) 0.309190 0.086011 3.594777 0.0005

R-squared 0.338533 Mean dependent var 6.180590


Adjusted R-squared 0.317366 S.D. dependent var 0.699063
S.E. of regression 0.577578 Akaike info criterion 1.777754
Sum squared resid 41.69947 Schwarz criterion 1.888044
Log likelihood -110.5540 Hannan-Quinn criter. 1.822569
F-statistic 15.99350 Durbin-Watson stat 2.079096
Prob(F-statistic) 0.000000

Inverted AR Roots .31


195

3.9.4 Question (b): Heteroskedasticity and autocorrelation-consistent (HAC) Co-


variance Matrix Estimator

For sake of generality, assume that you have also a problem of heteroskedasticity.

Given
p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt
n j=1 t=j+1
nX1 X n
1
= E "2t xtx0t + 0 0
E "t"t j xtxt j + E "t j "txt j xt ;
n j=1 t=j+1
a possible estimator of S based on the analogy principle would be
n 0
nX1 X n
1X 1
e2t xtx0t + etet j xtx0t j + et j etxt j x0t ; n0 < n:
n t=1 n j=1 t=j+1
A major problem with this estimator is that it is not positive semi-de…nite and hence cannot
be a well-de…ned variance-covariance matrix.
196

Newey and West show that with a suitable weighting function ! (j ), the estimator below is
consistent and positive semi-de…nite:
Xn XL Xn
1 1
^ HAC =
S e2t xtx0t + ! (j ) etet j xtx0t j + et j etxt j x0t
n t=1 n j=1 t=j+1
where the weighting function ! (j ) is
j
! (j ) = 1 :
L+1
The maximum lag L must be determined in advance. Autocorrelations at lags longer than
L are ignored. For a moving-average process, this value is in general a small number.

This estimator is known as (HAC) covariance matrix estimator and is valid when both
conditional heteroskedasticity and serial correlations are present but of an unknown form.
197

Example. For xt = 1; n = 9; L = 3 we have


L
X n
X
! (j ) etet j xtx0t j + et j etxt j x0t
j=1 t=j+1
XL Xn
= ! (j ) 2etet j
j=1 t=j+1
= ! (1) (2e1e2 + 2e2e3 + 2e3e4 + 2e4e5 + 2e5e6 + 2e6e7 + 2e7e8 + 2e8e9) +
! (2) (2e1e3 + 2e2e4 + 2e3e5 + 2e4e6 + 2e5e7 + 2e6e8 + 2e7e9) +
! (3) (2e1e4 + 2e2e5 + 2e3e6 + 2e4e7 + 2e5e8 + 2e6e9) :

1
! (1) = 1 = 0:75
4
2
! (2) = 1 = 0:50
4
3
! (3) = 1 = 0:25
4
198

Newey-West covariance matrix estimator


[ (b) = Sxx1S
Avar ^ HAC Sxx1:

EVIEWS:

10
L
9

0
0 1000 2000 3000 4000 5000
n

n 2=9
Eviews selects L = f loor(4 100 )
199

Example (continuation ...). Newey-West covariance matrix estimator


[ (b) = Sxx1S
Avar ^ HAC Sxx1

Equation 4
Dependent Variable: LOG(CHNIMP)
Method: Least Squares
Sample: 1978M02 1988M12
Included observations: 131
Newey-West HAC Standard Errors & Covariance (lag truncation=4)

Variable Coefficient Std. Error t-Statistic Prob.

C -19.75991 26.25891 -0.752503 0.4531


LOG(CHEMPI) 3.044302 0.667155 4.563111 0.0000
LOG(GAS) 0.349769 1.189866 0.293956 0.7693
LOG(RTWEX) 0.717552 0.361957 1.982426 0.0496

R-squared 0.280905 Mean dependent var 6.174599


Adjusted R-squared 0.263919 S.D. dependent var 0.699738
S.E. of regression 0.600341 Akaike info criterion 1.847421
Sum squared resid 45.77200 Schwarz criterion 1.935213
Log likelihood -117.0061 Hannan-Quinn criter. 1.883095
F-statistic 16.53698 Durbin-Watson stat 1.421242
Prob(F-statistic) 0.000000
200

3.9.5 Question (c): Dynamically Complete Models

Consider
~0t + ut
yt = x
such that E ( utj x
~t) = 0: This condition although necessary for consistency, does not pre-
clude autocorrelation. You may try to increase the number of regressors to xt and get a new
regression model
yt = x0t + "t such that

E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0:


Written in terms of yt

E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt) :


De…nition. The model yt = x0t + "t is dynamically complete (DC) if

E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 or


E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt)
holds (see Wooldridge).
201

Proposition. If a model is DC then the errors are not correlated. Moreover fgig is a MDS.

Notice that E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 can be rewritten as

E ( "ij Fi) = 0 where


Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
Example. Consider

yt = 1 + 2xt2 + ut; ut = ut 1 + "t


~0t =
where f"tg is a white noise process and E "tj xt2; yt 1; xt 1;2; yt 2; ::: = 0. Set x
1 xt2 : The above model is not DC since the errors are autocorrelated. Notice that

E ytj xt2; yt 1; xt 1;2; yt 2; ::: = 1 + 2xt2 + ut 1


does not coincide with

E ( ytj x
~t) = E ( ytj xt2) = 1 + 2xt2:
202

However, it is easy to obtain a DC model. Since

ut = yt ( 1 + 2xt2) )
ut 1 = yt 1 ( 1 + 2xt 1;2)
we have

yt = 1 + 2xt2 + ut
= 1 + 2 xt2 + ut 1 + "t
= 1 + 2 xt2 + yt 1 1 + 2 xt 1;2 + "t :
This equation can be written in the form

yt = 1 + 2xt2 + 3yt 1 + 4xt 1;2 + "t:


Let xt = xt2; yt 1; xt 1;2 : The previous models is DC as

E ( ytj xt; yt 1; xt 1; :::) = E ( ytj xt) = 1 + 2xt2 + 3yt 1 + 4xt 1;2:


203

Example (continuation ...). Dynamically Complete Model

Equation 6
Breusch-Godfrey Serial Correlation LM Test:

F-statistic 0.810670 Prob. F(12,110) 0.6389


Obs*R-squared 10.56265 Prob. Chi-Square(12) 0.5667

Test Equation:
Dependent Variable: RESID
Method: Least Squares
Equation 5 Date: 05/12/10 Time: 19:13
Dependent Variable: LOG(CHNIMP) Sample: 1978M03 1988M12
Included observations: 130
Method: Least Squares Presample missing value lagged residuals set to zero.
Sample (adjusted): 1978M03 1988M12
Variable Coefficient Std. Error t-Statistic Prob.
Included observations: 130 after adjustments
C 1.025127 26.26657 0.039028 0.9689
LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299
Variable Coefficient Std. Error t-Statistic Prob. LOG(GAS) -0.279136 1.055889 -0.264361 0.7920
LOG(RTWEX) -0.074592 2.234853 -0.033377 0.9734
C -11.30596 23.24886 -0.486302 0.6276 LOG(CHEMPI(-1)) -1.878917 4.322963 -0.434636 0.6647
LOG(GAS(-1)) 0.315918 1.076831 0.293378 0.7698
LOG(CHEMPI) -7.193799 3.539951 -2.032175 0.0443 LOG(RTWEX(-1)) -0.007029 2.224878 -0.003159 0.9975
LOG(GAS) 1.319540 1.003825 1.314513 0.1911 LOG(CHNIMP(-1)) 0.151065 0.293284 0.515082 0.6075
RESID(-1) -0.189924 0.307062 -0.618520 0.5375
LOG(RTWEX) -0.501520 2.108623 -0.237842 0.8124 RESID(-2) 0.088557 0.124602 0.710715 0.4788
LOG(CHEMPI(-1)) 9.618587 3.602977 2.669622 0.0086 RESID(-3) 0.154141 0.098337 1.567475 0.1199
RESID(-4) -0.125009 0.098681 -1.266795 0.2079
LOG(GAS(-1)) -1.223681 1.002237 -1.220950 0.2245 RESID(-5) -0.035680 0.099831 -0.357407 0.7215
LOG(RTWEX(-1)) 0.935678 2.088961 0.447915 0.6550 RESID(-6) 0.048053 0.098008 0.490291 0.6249
LOG(CHNIMP(-1)) 0.270704 0.084103 3.218710 0.0016 RESID(-7) 0.129226 0.097417 1.326523 0.1874
RESID(-8) 0.052884 0.099891 0.529420 0.5976
RESID(-9) -0.122323 0.102670 -1.191423 0.2361
R-squared 0.394405 Mean dependent var 6.180590 RESID(-10) 0.022149 0.099419 0.222788 0.8241
RESID(-11) 0.034364 0.099973 0.343738 0.7317
Adjusted R-squared 0.359658 S.D. dependent var 0.699063 RESID(-12) -0.038034 0.102071 -0.372628 0.7101
S.E. of regression 0.559400 Akaike info criterion 1.735660
R-squared 0.081251 Mean dependent var -9.76E-15
Sum squared resid 38.17726 Schwarz criterion 1.912123 Adjusted R-squared -0.077442 S.D. dependent var 0.544011
Log likelihood -104.8179 Hannan-Quinn criter. 1.807363 S.E. of regression 0.564683 Akaike info criterion 1.835533
F-statistic 11.35069 Durbin-Watson stat 2.059684 Sum squared resid 35.07532 Schwarz criterion 2.276692
Log likelihood -99.30962 Hannan-Quinn criter. 2.014790
Prob(F-statistic) 0.000000 F-statistic 0.512002 Durbin-Watson stat 2.011429
Prob(F-statistic) 0.952295
204

3.9.6 Question (d): Misspeci…cation

In many cases the …nding of autocorrelation is an indication that the model is misspeci…ed.
If this is the case, the most natural route is not to change your estimator (from OLS to GLS)
but to change your model. Types of misspeci…cation may lead to a …nding of autocorrelation
in your OLS residuals:

dynamic misspeci…cation (related to question (c));

omitted variables (that are autocorrelated);

yt and/or xtk are integrated processes, e.g. yt I (1) :

functional form misspeci…cation.


205

Functional form misspeci…cation. Suppose that the true linear relationship is

yt = 1 + 2 log t + "t:
In the following …gure we estimate a misspeci…ed functional form: yt = 1 + 2t + "t : The
residuals are clearly autocorrelated

Vous aimerez peut-être aussi