1 205

1
Econometrics - Slides
2010/2011
João Nicolau
2
1 Introduction
1.1 What is Econometrics?
Econometrics is based upon the development of statistical methods for estimating economic
relationships, testing economic theories, and evaluating and implementing government and
business policy. Application of econometrics:
forecast (e.g. interest rates, in‡ation rates, and gross domestic product).
study economic relations;
testing economic theories;
evaluating and implementing government and business policy. For example, what are the
e¤ects of political campaign expenditures on voting outcomes? What is the e¤ect of school
spending on student performance in the …eld of education?
3
1.2 Steps in Empirical Economic Analysis

Formulate the question of interest. The question might deal with testing a certain
aspect of an economic theory, or it might pertain to testing the e¤ects of a government
policy.
Build the economic model. An economic model consists of mathematical equations that
describe various relationships. Formal economic modeling is sometimes the starting point
for empirical analysis, but it is more common to use economic theory less formally, or
even to rely entirely on intuition.
Specify the econometric model.
Collect the data.
Estimate and test the econometric model.
Answer the question in step 1.

4
1.3 The Structure of Economic Data
1.3.1 Cross-Sectional Data
A cross-sectional data: sample of individuals, households, …rms, cities, states, countries,

etc. taken at a given point in time. An important feature of cross-sectional data: they are
obtained by random sampling from the underlying population. For example, suppose that
yi is the i-th observation of the dependent variable and xi is the i-th observation of the
explanatory variable. Random sampling means that
f(yi; xi)g is an i.i.d. sequence.
This implies that for i 6= j
Cov yi; yj = 0; Cov xi; xj = 0; Cov yi; xj = 0:
Obviously, if xi “explains” yi we will have Cov (yi; xi) 6= 0:
Cross-sectional data is closely aligned with the applied microeconomics …elds, such as labor
economics, state and local public …nance, industrial organization, urban economics, demog-
raphy, and health economics.
5
An example of Cross-Sectional Data:

6
Scatterplots may be adequate for analyzing cross-section data:
Models based on Cross-Sectional Data usually satisfy the assumptions cover by the chapter
“Finite-Sample Properties of OLS”.
7
1.3.2 Time-Series Data
A time series data set consists of observations on a variable or several variables over time.
E.g.: stock prices, money supply, consumer price index, gross domestic product, annual
homicide rates, and automobile sales …gures, etc.
Time series data cannot be assumed to be independent across time. For example, knowing
something about the gross domestic product from last quarter tells us quite a bit about the
likely range of the GDP during this quarter ...
The analysis of time series data is more di¢ cult than that of cross-sectional data. Reasons:
we need to account for the dependent nature of economic time series;
time-series data exhibits unique features such as trends over time and seasonality;
models based on time-series data rarely satisfy the assumptions cover be the chapter
“Finite-Sample Properties of OLS”. The most adequate assumptions are cover by chapter
“Large-Sample Theory”, which is theoretically more advanced.
8
An example of a time series (scatterplots cannot in general be used here, but there are
exceptions):
9
1.3.3 Pooled Cross Sections and Panel or Longitudinal Data
Data sets have both cross-sectional and time series features.
1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis
Ceteris Paribus: “other (relevant) factors being equal”. Plays an important role in causal
analysis.
Example. Suppose that wages depend on education and labor force experience. Your goal
is to measure the “return to education”. If your analysis involves only wages and education
you may not uncover the ceteris paribus e¤ect of education on wages. Consider the following
data:
monthly wages (Euros) years of experience years of education

1500 6 9
1500 0 15
1600 1 15
2000 8 12
2500 10 12
10
Example. In a totalitarianism regime how can you measure the ceteris paribus e¤ect of
another year of education on wages? You may create 100 clones of a “normal” individual.
Give to each person an amount of education and then measure their wages.
Ceteris Paribus is relatively easy to analyze in Experimental Data.

Example (Experimental Data). Considered the e¤ects of new fertilizers on crop yields. Sup-
pose the crop under consideration is soybeans. Since fertilizer amount is only one factor
a¤ecting yields— some others include rainfall, quality of land, and presence of parasites—
this issue must be posed as a ceteris paribus question. One way to determine the causal e¤ect
of fertilizer amount on soybean yield is to conduct an experiment, which might include the
following steps. Choose several one-acre plots of land. Apply di¤erent amounts of fertilizer
to each plot and subsequently measure the yields.
In economics you have nonexperimental data, so in principle, it is di¢ cult to estimate the
ceteris paribus e¤ects. However, we will see that econometric methods can simulate a ceteris
paribus experiment. We will be able to do in nonexperimental environments what natural
scientists are able to do in a controlled laboratory setting: keep other factors …xed.
11
2 Finite-Sample Properties of OLS
This chapter covers the …nite- or small-sample properties of the OLS estimator, that is, the
statistical properties of the OLS estimator that are valid for any given sample size.
2.1 The Classical Linear Regression Model
The dependent variable is related to several other variables (called the regressors or the
explanatory variables).
Let yi be the i-th observation of the dependent variable.
Let (xi1; xi2; :::; xiK ) be the i-th observation of the K regressors. The sample or data is a
collection of those n observations.
The data in economics cannot be generated by experiments (except in experimental eco-

nomics), so both the dependent and independent variables have to be treated as random
variables, variables whose values are subject to chance.
12
2.1.1 The Linearity Assumption
Assumption (1.1 - Linearity). We have
yi = 1xi1 + 2xi2 + ::: + K xiK + "i; i = 1; 2; :::; n

where 0s are unknown parameters to be estimated, and "i is the unobserved error term.
0s : regression coe¢ cients. They represent the marginal and separate e¤ects of the regres-
sors.
Example (1.1). (Consumption function): Consider
coni = 1 + 2ydi + "i:

coni : consumption; ydi is disposable income. Note: xi1 = 1; xi2 = ydi: The error "i
represents other variables besides disposable income that in‡uence consumption. They in-
clude: those variables— such as …nancial assets— that might be observable but the researcher
decided not to include as regressors, as well as those variables— such as the “mood” of the
consumer— that are hard to measure. The equation is called the simple regression model.
13
The linearity assumption is not as restrictive as it might …rst seem.

Example (1.2). (Wage equation). Consider
wagei = e 1 e 2educi e 3tenurei e 4expri e"i

where WAGE = the wage rate for the individual, educ = education in years, tenure = years
on the current job, and expr = experience in the labor. This equation can be written as
log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i

The equation is said to be in the semi-log form (or log-level form).
Example. Does this model
yi = 1 + 2xi2 + 3 log xi2 + 4x2i3 + "i

violate Assumption 1.1?
There are, of course, cases of genuine nonlinearity. For example
yi = 1 + e 2xi2 + "i
14
Partial E¤ects
To simplify let’s consider, K = 2; and assume that E ( "ij xi1; xi2) = 0.
What is the impact on the conditional expected value y; E ( yij xi1; xi2) when xi2 is increased
by a small amount
x0i = (xi1; xi2) ! xi 0 = (xi1; xi2 + xi2) (holding the other variable …xed)?
Let
E ( yij xi) E ( yij xi1 = xi1; xi2 = xi2 + xi2) E ( yij xi1; xi2) :
Equation Interpretation of 2
(level-level) yi = 1 + 2xi2 + "i E ( yij xi) = 2 xi2
2 xi2
(level-log) yi = 1 + 2 log (xi2) + "i E ( yij xi) ' 100 x i2
100
E(yi jxi )
(log-level) log (yi) = 1 + 2xi2 + "i 100 ' (100 2) xi2
E(yi jxi )
(100 2: semi-elast.)
E(yi jxi ) xi2
(log-log) log (yi) = 1 + 2 log (xi2) + "i 100 ' 2 xi2 100
E(yi jxi )
( 2: elasticity)
15
Exercise 2.1. Suppose, for example, the marginal e¤ect of experience on wages declines with
the level of experience. How can this be captured?
Exercise 2.2. Provide an interpretation of 2 in the following equations:
(a) coni = 1 + 2inci + "i; where inc: income, con: consumption (both measured in
dollars). Assume that 2 = 0:8;
(b) log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i: Assume that 2 = 0:05:
(c) log (pricei) = 1 + 2 log (disti) + "i where prices = housing price and dist =
distance from a recently built garbage incinerator. Assume that 2 = 0:6:
16
2.1.2 Matrix Notation
We have
2 3
1
h i6 7
6 2 7
yi = 1 xi1 + 2xi2 + ::: + K xiK + "i = xi1 xi2 xiK 6 ... 7 + "i
4 5
K
= x0i + "i
where
2 3 2 3
xi1 1
6 xi2 7 6 7
xi = 6
6 ...
7
7;
6
=6 2 7
... 7
4 5 4 5
xiK K
yi = x0i + "i:
17
More compactly
2 3 2 3 2 3
y1 x11 x12 x1K "1
6 y2 7 6 x x22 x2K 7 6 "2 7
6 7 6 7
y =6 ... 7; X = 6 21 7; "i = 6
6
7
7
4 5 4 ... ... ... 5 4 ... 5
yn xn1 xn2 xnK "n
y = X + ":
Example. yi = 1 + 2educi + 3expi + "i (yi = wages in Euros). An example of

Cross-Sectional Data is
2 3 2 3
2000 1 12 5
6 7 6 7
6 2500 7 6 1 15 6 7
6 7 6 7
6 1500 7 6 1 12 3 7
y =6
6 ... 7;
7 X =6
6 ... ... ... 7:
7
6 7 6 7
6 7 6 7
4 5000 5 4 1 17 15 5
1000 1 12 1
Important: y and X (or yi and xik ) may be random variables or observed values. We use
the same notation for both cases.
18
2.1.3 The Strict Exogeneity Assumption
Assumption (1.2 - Strict exogeneity). E ( "ij X) = 0; 8i
This assumption can be written as
E ( "ij x1; :::; xn) = 0; 8i:

With random sampling "i is automatically independent of the explanatory variables for ob-
servations other than i. This implies that
E "ij xj = 0; 8i; j i 6= j
It remains to be analyzed whether or not
?
E ( "ij xi) = 0:
19
Strict Exogeneity assumption can fail in situations such as:
(Cross-Section or Time Series) Omitted variables;
(Cross-Section or Time Series) Measurement error in some of the regressors;
(Time Series, Static models) There is a feedback from yi on future values of xi.
(Time Series, Dynamic models) There is a lag dependent variable as a regressor.

Example (Omitted variables). Suppose that wage is determined by
wagei = 1 + 2xi2 + 3xi3 + vi;

where x2: years of education, x3: ability. Assume that E ( vij X) = 0: Since ability is not
observed, we instead estimate the model
wagei = 1 + 2xi2 + "i; "i = 3xi3 + vi:

Thus,
E ( "ij xi) = 3xi3 6= 0 ) E ( "ij X) 6= 0:
20
Example (Measurement error in some of the regressors). Consider y = household savings

and w = disposable income and
yi = 1 + 2wi + vi; E ( vij w) = 0:

Suppose that w cannot be measured absolutely accurately (for example, because of misre-
porting) and denote the measured value for wi by xi2: We have
xi2 = wi + ui:
Assume: E (ui) = 0; Cov (wi; ui) = 0; Cov (vi; ui) = 0. Now substituting xi2 = wi + ui
into yi = 1 + 2wi + vi we obtain
yi = 1 + 2xi2 + "i; "i = vi 2 ui :

Hence,
Cov ("i; xi2) = ::: = 2 Var (ui ) 6= 0:

Cov ("i; xi2) 6= 0 ) E ( "ij X) 6= 0:
21
Example (Feedback from y on future values of x). Consider a simple static time-series model
to explain a city’s murder rate (yt) in terms of police o¢ cers per capita (xt):
yt = 1 + 2xt + "t;
Suppose that the city adjusts the size of its police force based on past values of the murder
rate. This means that, say, xt+1 might be correlated with "t (since a higher "t leads to a
higher yt).
Example (There is a lag dependent variable as a regressor). See section 2.1.5.
Exercise 2.3. Let kids denote the number of children ever born to a woman, and let educ
denote years of education for the woman. A simple model relating fertility to years of
education is
kidsi = 1 + 2educi + "i:
where "i is the unobserved error. (i) What kinds of factors are contained in "i? Are these
likely to be correlated with level of education? (ii) Will a simple regression analysis uncover
the ceteris paribus e¤ect of education on fertility? Explain.
22
2.1.4 Implications of Strict Exogeneity
The Assumption E ( "ij X) = 0; 8i implies:
E ("i) = 0; 8i:
E "ij xj = 0; 8i; j:
E xjk "i = 0; 8i; j; k or E xj "i = 0; 8i; j The regressors are orthogonal to the
error term for all observations
Cov xjk ; "i = 0:
Note: if E "ij xj 6= 0 or E xjk "i 6= 0 or Cov xjk ; "i 6= 0 ) E ( "ij X) 6= 0:

23
2.1.5 Strict Exogeneity in Time-Series Models
For time-series models where strict exogeneity can be rephrased as: the regressors are or-
thogonal to the past, current, and future error terms. However, for most time-series models,
strict exogeneity is not satis…ed.
Example. Consider
yi = yi 1 + "i; E ( "ij yi 1) = 0 (thus E (yi 1"i) = 0).
Let xi = yi 1: By construction we have
2
E (xi+1"i) = E (yi"i) = ::: = E "i 6= 0:
The regressor is not orthogonal to the past error term, which is a violation of strict exogeneity.
However, the estimator may possess good large-sample properties without strict exogeneity.
2.1.6 Other Assumptions of the Model
Assumption (1.3 - no multicollinearity). The rank of the n K data matrix X is K with

probability 1.
24
None of the K columns of the data matrix X can be expressed as a linear combination of
the other columns of X.
Example (1.4 - continuation of Example 1.2). If no individuals in the sample ever changed
jobs, then tenurei = expri for all i, in violation of the no multicollinearity assumption.
There no way to distinguish the tenure e¤ect on the wage rate from the experience e¤ect.
Remedy: drop tenurei or expri from the wage equation.
Example (Dummy Variable Trap). Consider
wagei = 1 + 2educi + 3f emalei + 4malei + "i
where
(
1 if i corresponds to a female
f emalei = ; malei = 1 f emalei:
0 if i corresponds to a male
In vectorial notation we have
wage = 11 + 2 educ + 3 female + 4 male + ":
It is obvious that 1 = female + male: Therefore the above model violates Assumption
1.3. One may also justify using scalar notation: xi1 = f emalei + malei because this
relationship implies 1 = female + male: Can you overcome the dummy variable trap by
removing xi1 1 from the equation?
25
Exercise 2.4. In a study relating college grade point average to time spent in various activ-
ities, you distribute a survey to several students. The students are asked how many hours
they spend each week in four activities: studying, sleeping, working, and leisure. Any activity
is put into one of the four categories, so that for each student the sum of hours in the four
activities must be 168. (i) In the model
GP Ai = 1 + 2studyi + 3sleepi + 4worki + 5leisurei + "i

does it make sense to hold sleep, work, and leisure …xed, while changing study? (ii) Explain
why this model violates Assumption 1.3; (iii) How could you reformulate the model so that
its parameters have a useful interpretation and it satis…es Assumption 1.3?
Assumption (1.4 - spherical error variance). The error term satis…es:
2 2 > 0;
E "i X = 8i; Homoskedasticity
E "i "j X = 0; 8i; j ; i 6= j: No correlation between observations.
Exercise 2.5. Under the Assumptions 1.2 and 1.4, show that Cov yi; yj X = 0:
26
Assumption 1.4 and strict exogeneity implies:
Var ( "ij X) = E "2i X = 2:
Cov "i; "j X = 0:
E ""0 X = 2I:
Var ( "j X) = 2I:
Note
2 3
E "21 X E ( " 1 "2 j X ) E ( "1 "n j X )
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X ) 7
E "" X =6
6 ... ... ... ... 7:
7
4 5
E ( " 1 "n j X ) E ( " 2 " n j X ) E "2n X
27
Exercise 2.6. Consider the savings function

q
savi = 1 + 2inci + "i; "i = incizi
where zi is a random variable with E (zi) = 0 and Var (zi) = 2z : Assume that zi is
independent of incj (for all i; j ). (i) Show that E ( "j inc) = 0; (ii) Show that Assumption
1.4 is violated.
2.1.7 The Classical Regression Model for Random Samples
The sample (y; X) is a random sample if f(yi; xi)g is i.i.d. (independently and identically
distributed) across observations. Random sample automatically implies:
E ( "ij X) = E ( "ij xi) ;

2 2
E "i X = E "i xi :
Therefore Assumptions 1.2 and 1.4 can be rephrasing as
Assumption 1.2 E ( "ij xi) = E ("i) = 0

Assumption 1.4 E "2i xi = E "2i = 2
28
2.1.8 “Fixed” Regressors
This is a simplifying (and generally an unrealistic) assumption to make the statistical analysis
tractable. It means that X is exactly the same in repeated samples. Sampling schemes that
support this assumption:
a) Experimental situations. For example, suppose that y represents the yields of a crop
grown on n experimental plots, and let the rows of X represent the seed varieties, irrigation
and fertilizer for each plot. The experiment can be repeated as often as desired, with the
same X. Only y varies across plots.
b) Strati…ed Sampling (for more details see Wooldridge, chap. 9).

29
2.2 The Algebra of Least Squares
2.2.1 OLS Minimizes the Sum of Squared Residuals
Residual for observation i (evaluated at ~ ):
yi x0i ~ :
Vector of residuals (evaluated at ~ ):
y X~:
Sum of squared residuals (SSR):
n
X 2 0
SSR ~ = yi x0i ~ = y X~ y X~ :
i=1
The OLS (Ordinary Least Squares):
b = arg min SSR ~

~
b is such that SSR (b) is minimum.

30
K = 1 ; y i = x i + "i
Example. Consider yi = 1 + 2xi2 + "i: The data:
y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8
!
0
Verify that SSR ~ = 42 when ~ = :
1
31
2.2.2 Normal Equations
To solve the optimization proble min ~ SSR ~ we use classical optimization:
First Order Condition (FOC):

@SSR ~
= 0.
@ ~
Solve the previous equation with respect to ~ : Let b such solution.
Second Order Condition (SOC):
@ 2SSR ~
0 is a Positive De…nite Matrix , b is global minimum point.
~
@ @ ~
32
To easily obtain the FOC we start writing SSR ~ as

0
SSR ~ = y X~ y X~
= :::
0 0 ~ ~ 0 0 ~
= y y 2y X + X X :
Recalling from matrix algebra that
0
@ a0 ~ @ ~ A~
= a; = 2A ~ (for A symmetric)
@ ~ @~
we have
@SSR ~ 0 0
= 2 yX + 2X0 X ~ = 0
@ ~
i.e. (replacing ~ by the solution b)
X0Xb = X0y or
X0 (y Xb) = 0:
33
This is a system with K equations and K unknowns. These equations are called the normal
equations. If
1
rank (X) = K ) X0 X is nonsingular ) there exists X0 X :
Therefore, if rank (X) = K we have a unique solution:
1
b = X0 X X0 y OLS estimator.
The SOC is
@ 2SSR ~
0 = 2 X0 X:
@ ~@ ~
If rank (X) = K then 2X0X is a positive de…nite matrix thus SSR ~ is strictly convex
in Rk . Hence b is a global minimum point.
The vector of residuals evaluated at ~ = b;
e=y Xb
is called the vector of OLS residuals (or simply residuals).
34
The normal equations can be written as

n
1X
X0 e =0, xiei = 0:
n i=1
This shows that the normal equations can be interpreted as the sample analogue of the
orthogonality conditions E (xi"i) = 0. Notice the reasoning: by assuming in the popula-
tion the orthogonality conditions E (xi"i) = 0 we deduce by the method of moments the
corresponding sample analogue
1X
xi yi x0i ~ = 0:
n i
We obtain the OLS estimator b by solving this equation with respect to ~ :
35
2.2.3 Two Expressions for the OLS Estimator
1
b = X0 X X0 y
X0 X 1 X0 y
b= n n = Sxx1Sxy ; where
n
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n i=1
n
X0 y 1X
Sxy = = xiy (sample average of xiyi).
n n i=1
Example (continuation of previous example). Consider the data.
y X
1 1 1
3 1 3
2 1 1
8 1 3
12 1 8
Obtain b; e and SSR (b) :

36
2.2.4 More Concepts and Algebra
The …tted value for observation i: yî = x0ib.
The vector of …tted value: y

^ = Xb:
The vector of OLS residuals: e = y Xb = y y

^:
The projection matrix P and the annihilator M are de…ned as

1
P=X X0 X X0 ; M=I P:
Properties:
Exercise 2.7. Show that P and M are symmetric and idempotent and
PX = X
MX = 0
y
^ = Py
e = My = M"
SSR = e0e = y0My = "0M":
37
The OLS estimate of 2 (the variance of the error term), denoted s2, is
SSR e0 e
s2 = =
n K n K
s2 is called the standard error of regression.
The sampling error

1
b = ::: = X0X X 0 ":
Coe¢ cient of Determination
A measure of goodness of …t is the coe¢ cient of determination

Pn Pn
(^
yi y )2 e2
R2 = Pi=1
n (y 2
=1 Pn
i=1 i
2
; 0 R2 1:
i=1 i y) i=1 (yi y)
It measures the proportion of the variation of y that is accounted for by variation in the
regressors, x0j s. Derivation of R2: [board]
38
y y
25 R^2 = 0.96 60 y
50 y^
20 40
30 R^2 = 0.19
15 20
y 10
10
y^ 0
x
-3 -2 -1 -10 0 1 2 3
5
-20
0 x -30
-3 -2 -1 0 1 2 3 -40
-5 -50
y
17
16
15
14
13 y
12 y^
11
10 R^2 = 0.00
9
8 x
-3 -2 -1 0 1 2 3
39
“The most important thing about R2 is that it is not important” (Goldberger). Why?
We are concerned with parameters in a population, not with goodness of …t in the sample;
We can always increase R2 by adding more explanatory variables. At the limit, if K =

n ) R 2 = 1:
Exercise 2.8. Prove that K = n ) R2 = 1 (assume that Assumption 1.3 holds).
It can be proved that

P
î
i y y^ (yi y ) =n
R2 = ^2 ; ^= :
Sy^Sy
Adjusted coe¢ cient of determination

Pn 2 = (n
n 1 e k)
R2 = 1 1 R2 = 1 Pn
i=1 i
2
:
n k i=1 (yi y ) = (n 1)
Contrary to R2; R2 may decline when a variable is added to the set of independent variables.
40
2.3 Finite-Sample Properties of OLS
First of all we need to recognize that b and bj X are random!
Assumptions:
1.1 - Linearity: yi = 1xi1 + 2xi2 + ::: + K xiK + "i:
1.2 - Strict exogeneity: E ( "ij X) = 0:
1.3 - No multicollinearity.
1.4 - Spherical error variance: E "2i X = 2; E "i"j X = 0:
Proposition (1.1 - …nite-sample properties of b). We have:

(a) (unbiasedness) Under Assumptions 1.1-1.3, E ( bj X) = :
(b) (expression for the variance) Under Assumptions 1.1-1.4, Var ( bj X) = 2 X0X 1 :
(c) (Gauss-Markov Theorem) Under Assumptions 1.1-1.4, the OLS estimator is e¢ cient in
the class of linear unbiased estimators (also called Best Linear Unbiased Estimator). That
is, for any unbiased estimator ^ that is linear in y, Var ( bj X) Var ^ X in the matrix
sense (i.e. Var ( b j X) Var ( bj X) is a positive semide…nite matrix).
(d) Under Assumptions 1.1-1.4, Cov ( b; ej X) = 0. Proof: [board]
41
Proposition (1.2 - Unbiasedness of s2). Let s2 = e0e= (n K ) : We have

2 2 2
E s X =E s = : Proof: [board]
An unbiased estimator of Var ( bj X) is

1
( bj X) = s2 X0X
Var\ :
Example. Consider
col GP Ai = 1 + 2HSGP Ai + 3ACTi + 4SKIP P EDi + 5P Ci + "i

where: col GP A : college grade point average (GPA); HSGP A : high school GPA; ACT :
achievement examination for college admission; SKIP P ED : average lectures missed per
week; P C is a binary variable (0/1) to identify who owns a personal computer. Using a
survey of 141 students (Michigan State University) in Fall 1994, we obtained the following
results:
42
These results tell us that n = 141, s = 0:325; R2 = 0:259; SSR = 14:37

2 3 2 3
1:356 0:32752 ? ? ? ?
6 7 6 7
6
6 0:4129 7
7
6
6
? 0:09242 ? ? ? 7
7
b =6 0:0133 7; \
Var ( bj X) = 6 ? ? 0:0102 ? ? 7
6 7 6 7
6 7 6 7
4 0:071 5 4 ? ? ? 0:0262 ? 5
0:1244 ? ? ? ? 0:05732
43
2.4 More on Regression Algebra
2.4.1 Regression Matrices
0
Matrix P = X X0X 1 X
Py ! Fitted values from the regression of y on X
Pz ! ?
1 0
Matrix M = I P = I X X0 X X
My ! Residuals from the regression of y on X
Mz ! ?
h i
Consider a partition of X as follows X = X1 X2
1 0
Matrix P1= X1 X01X1 X
1
P1y ! ?
1 0
Matrix M1= I P1 = I X1 X01X1 X
1
M1 y ! ?
44
2.4.2 Short and Long Regression Algebra
Partition X as
h i
X= X 1 X2 ; XK 1 n; XK2 n ; K1 + K2 = K
Long Regression
We have
" #
h i b1
y=y
^ + e = Xb + e = X1 X 2 + e = X1b1 + X2b2 + e:
b2
Short Regression
Suppose that we shorten the list of explanatory variables and regress y on X1: We have
y=y
^ + e = X1b1 + e
where
1
b1 = X01X1 X1 y
e = M1 y ; M1 = I X1 X01X1 X01
45
How are b1 and e related to b1 and e?
b1 vs. b1
We have,
1
b1 = X01X1 X1 y
1
= X01X1 X01 (X1b1 + X2b2 + e)
1 1
= b1 + X01X1 X01X2b2 + X01X1 X01e
| {z }
0
1
= b1 + X01X1 X01X2b2
1
= b1 + Fb2; F = X01X1 X01X2:
Thus, in general, b1 6= b1: Exceptional cases: b2 = 0 or X01X2 = O ) b1 = b1:
46
e vs. e
We have,
e = M1 y
= M1 (X1b1 + X2b2 + e)
= M1X1b1 + M1X2b2 + M1e
= M1X2b2 + e;
= v+e
Thus,
e 0e = e0e + v0v e0e
Thus the SSR of the short regression (e 0e ) exceeds the SSR of the long regression (e0e)
and e 0e = e0e i¤ v = 0; that is i¤ b2 = 0:
47
Example. Illustration of b1 6= b1 and e 0e e0e:
Find X; X1; X2; b; b1; b2; b1; e 0e ; e0e:

48
2.4.3 Residual Regression
Consider
y = X +"
= X 1 1 + X2 2 + ":
Premultiplying both sides by M1 and using M1X1 = 0; we obtain
M1 y = M1 X 1 1 + M 1 X 2 2 + M1 "
y ~ 2 2 + M1 "
~ = X
The OLS gives
1 1 1
~0 X
b2 = X ~ ~0 y
X ~ = ~0 X
X ~ ~ 0 M1 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 2 2 2
Thus
1
~0 X
b2 = X ~ ~0 y
X
2 2 2
49
1
~0 X
Another way to prove b2 = X ~ ~ 0 y (you may skip this proof). We have
X
2 2 2
1 1
~0 X
X ~ ~0 y =
X ~0 X
X ~ ~ 0 (X1b1 + X2b2 + e)
X
2 2 2 2 2 2
1 1 1
= ~0 X
X ~ ~ 0 X1b1 + X
X ~0 X
~ ~ 0 X2b2 + X
X ~0 X
~ ~0 e
X
2 2 2 2 2 2 2 2 2
| {z } | {z } | {z }
0 b2 0
= b2
since:
1 1
~0 X
X ~ ~ 0 X1b1 = X
X ~0 X
~ X02M1X1b1
2 2 2 2 2
= 0
1 0 1
~0 X
X ~ ~ X2b2 = X
X ~0 X
~ X02M1X2b2
2 2 2 2 2
0 0 1 0
= X 2 M 1 M1 X 2 X2M1X2b2
0 1 0
= X 2 M 1 X2 X2M1X2b2
= b2
~0 e =
X X02M1e
2
= X02e
= 0:
50
1 1
~0 X
The conclusion is that we can obtain b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~ as follows:
1) Regress X2 on X1 to get the residuals X ~ 2 = M1X2: Interp. of X ~ 2: X

~ 2 is X2 after the
e¤ects of X1 have been removed or, X ~ 2 is the part X2 that is uncorrelated with X1.
~ 2 to get the coe¢ cient b2 of the long regression.
2) Regress y on X
OR:
1’) Same as 1).
2’a) Regress y on X1 to get the residuals y
~ = M1 y :
2’b) Regress y ~ 2 to get the coe¢ cient b2 of the long regression.
~ on X
The conclusion of 1) and 2) is extremely important: b2 relates y to X2 after controlling for

the e¤ects of X1: This is why b2 can be obtained from the regression of y on X ~ 2 where
~ 2 is X2 after the e¤ects of X1 have been removed (…xed or controlled for). This means
X
that b2 has in fact a ceteris paribus interpretation.
To recover b1 we consider the equation b1 = b1 + Fb2: Regress y on X1; obtaining

0 1 0
b1 = X1X1 X1y and now
1
b1 = b1 X01X1 X01X2b2 = b1 Fb2:
51
Example. Consider the example on page 9.

52
h i
Example. Consider X = 1 exper tenure IQ educ and
h i
X1 = 1 exper tenure IQ ; X2 = educ
53
54
2.4.4 Application of Residual Regression
A) Trend Removal (time series)
Suppose that yt and xt have a linear trend. Should the trend term be included in the
regression as in the case
yt = 1 + 2xt2 + 3xt3 + "t; xt3 = t
or should the variables …rst be “detrended” and then used without the trend term included
as in
y~t = 2x
~t2 + ~
"t ?
According to the previous results, the OLS coe¢ cient b2 is the same in both regressions.
In the second regression b2 is obtained from the regression of y
~ = M1y on x
~ 2 = M1 x 2
where
2 3
1 1
h i 6 7
6 1 2 7
X1 = 1 x 3 = 6 .. .. 7 :
4 . . 5
1 n
55
Example. Consider (TXDES: unemployment rate, INF: in‡ation, t: time)
T XDESt = 1 + 2IN Ft + 3t + "t:

We will show two ways to obtain b2 (compare EQ01 to EQ04).
EQ01
Dependent Variable: TXDES EQ02
Method: Least Squares Dependent Variable: TXDES
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
Variable Coefficient Std. Error t-Statistic Prob.
C 4.463068 0.425856 10.48023 0.0000
INF 0.104712 0.063329 1.653473 0.1041 C 4.801316 0.379453 12.65325 0.0000
@TREND 0.027788 0.011806 2.353790 0.0223 @TREND 0.030277 0.011896 2.545185 0.0138
EQ03
Dependent Variable: INF EQ04
Method: Least Squares Dependent Variable: TXDES_
Sample: 1948 2003 Method: Least Squares
Sample: 1948 2003
C 3.230263 0.802598 4.024758 0.0002
@TREND 0.023770 0.025161 0.944696 0.3490 INF_ 0.104712 0.062167 1.684382 0.0978
56
B) Seasonal Adjustment and Linear Regression with Seasonal Data
Suppose that we have data on the variable y; quarter by quarter, for m years. A way to deal
with (deterministic) seasonality is the following
yt = 1Qt1 + 2Qt2 + 3Qt3 + 4Qt4 + 5xt5 + "i
where
(
1 in quarter i
Qti =
0 otherwise.
Let
h i h i
X= Q1 Q2 Q3 Q4 x 5 ; X1 = Q1 Q2 Q3 Q4 :
Previous results show that b5 can be obtained from the regression of y
~ = M1y on x
~ 5=
M1x 5: It can be proved
8
>
> yt yQ1 in quarter 1
>
>
< y yQ2 in quarter 2
t
y~t =
>
> yt yQ3 in quarter 3
>
>
: y yQ4 in quarter 4
t
where yQi is the seasonal mean of quarter i:
57
c) Deviations from Means

h i
Let x 1 be the summer vector. Instead of regressing y on x 1 x 2 x K to get
(b1; b2; :::; bK )0 ; we can regress y on
2 3
x12 x2 x1K xK
6 ... ... 7
4 5
xn2 x2 xnK xK
to get the same vector (b2; :::; bK )0 : We sketch the proof. Let
h i
X2 = x 2 x K
so that
y
^ = x 1b1 + X2b2:
~ 2 = M1X2 where
1) Regress X2 on x 1 to get the residuals X
1 0 x 1x0 1
M1 = I x 1 x0 1x 1 x 1 =I :
n
58
As we know
~ 2 = M1 X2
X
h i
= M1 x 2 x K
h i
= M1 x 2 M1 x K
2 3
x12 x2 x1K xK
6 ... ... 7
= 4 5:
xn2 x2 xnK xK
2) Regress y (or y ~ 2 to get the coe¢ cient b2 of the long regression:

~ = M1y) on X
1 1
~0 X
b2 = X ~ ~0 y = X
X ~0 X
~ ~0 y
X
2 2 2 2 2 2 ~:
The intercept can be recovered as
1 0
b1 = b1 x 1 x0 1x 1 x 1 X2 :
59
2.4.5 Short and Residual Regression in the Classical Regression Model
Consider:
y = X1b1 + X2b2 + e (long regression)

y = X1b1 + e (short regression).
The correct speci…cation corresponds to the long regression:
E ( y j X) = X1 1 + X 2 2
= X
Var ( yj X) = 2 I; etc.
60
A) Short-Regression Coe¢ cients
b1 is a biased estimator of 1
Given that
1 1
b1 = X01X1 X01y = b1 + Fb2; F= X01X1 X01X2:
we have
E ( b1j X) = E ( b1 + Fb2j X) = 1 + F 2;
1 1 1
Var ( b1j X) = Var X01X1 X01y X = X01X1 X01 Var ( yj X) X1 X01X1
2 1
= X01X1
thus, in general,
b1 is a biased estimator of 1 (“omitted-variable bias”)
unless:
2= 0: Corresponds to the case of “Irrelevant Omitted Variables”.

F = O: Corresponds to the case of “Orthogonal Explanatory Variables”(in sample space).
61
Var ( b1j X) Var b1 X (you may skip the proof)
Consider b1 = b1 Fb2
Var ( b1j X) = Var ( b1 Fb2j X)

= Var ( b1j X) + Var ( Fb2j X) since Cov ( b1; b2j X) = O [board]
= Var ( b1j X) + F Var ( b2j X) F0:
Because F Var ( b2j X) F0 is positive semide…nite (or nonnegative de…nite), Var ( b1j X)
Var b1 X .
This relation is still valid if 2 = 0: In this case 2 = 0; regressing y on X1 and on irrelevant

variables (X2) involves a cost: Var ( b1j X) Var b1 X ; although E ( b1j X) = 1:
In practise there may be a bias-variance trade-o¤ between short and long regression when
the target is 1:
62
Exercise 2.9. Consider the standard simple regression model yi = 1 + 2xi2 + "i under
Assumptions 1.1 through 1.4. Thus, the usual OLS estimators b1 and b2 are unbiased for
their respective population parameters. Let b2 be the estimator of 2 obtained by assuming
the intercept is zero i.e. 1 = 0 (i) Find E b2 X . Verify that b2 is unbiased for 2 when
the population intercept 1 is zero. Are there other cases where b2 is unbiased? (ii) Find the
variance of b2. (iii) Show that Var b2 X Var ( b2j X); (iv) Comment on the trade-o¤
between bias and variance when choosing between b2 and b2.
Exercise 2.10. Suppose that average worker productivity at manufacturing …rms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker ability
(avgabil):
avgprodi = 1 + 2avgtraini + 3avgabili + "i
Assume that this equation satis…es Assumptions 1.1 through 1.4. If grants have been given to
…rms whose workers have less than average ability, so that avgtrain and avgabil are negatively
correlated, what is the likely bias in b2 in obtained from the simple regression of avgprod on
avgtrain?
63
B) Short-Regression Residuals (skip this)
Given that e = M1y we have

~ 2 2;
E ( e j X ) = M1 E ( y j X ) = M1 E ( X 1 1 + X 2 2 j X ) = X
Var ( e j X) = Var ( M1yj X) = M1 Var ( yj X) M01 = 2M1:
Thus E ( e j X) 6= 0; unless 2 = 0:
Let’s see now that the omission of explanatory variables leads to an increase in the expected
SSR. We have, by R5,
0
E e e X = E y0M1y X = tr (M1 Var ( yj X)) + E ( yj X)0 M1 E ( yj X)
= 2 tr (M1) + 0 X ~ 2 = 2 (n K1) + 0 X
~0 X
2 2 2
~0 X
~2
2 2 2
and E e0e X = 2 (n K ) thus
0 0 2 0 ~0 X
~
E e e X E e e X = K2 + 2X 2 2 2 > 0:
Notice that: e 0e ~0 X
e0e = b02X ~ ~0 X
0: (check E b02X ~ 2K
2 2 b2 2 2 b2 X = 2 +
~0 X
0X ~ 2 ).
2
2 2
64
C) Residual Regression
The objective is to characterize

Var ( b2j X) :
1
We know that b2 = X ~
~0 X ~ 0 y: Thus
X
2 2 2
1
Var ( b2j X) = Var ~0 X
X ~ ~0 y X
X
2 2 2
1 1
= ~0 X
X ~ ~ 0 Var ( yj X) X
X ~0 X
~2 X ~
2 2 2 2 2
2 1
= ~0 X
X ~
2 2
2 1
= X02M1X2 :
Now suppose that

h i
X= X1 x K (i.e. x K = X2)
65
If follows that
2
Var ( bK j X) = 0
x K M1 x K
and x0 K M1x K is the sum of the squared residuals in the auxiliary regression
x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
One can conclude (assuming that x 1 is the summer vector):
2 x0 K M1x K
RK =1 P 2
:
(xiK xK )
Solving this equation for x0 K M1x K we have
X
x0 K M1 x K = 1
2
RK (xiK xK )2 :
We get
2 2
Var ( bK j X) = P 2
= :
1 2
RK (xiK xK ) 1 2 S2 n
RK xK
66
2 2
Var ( bK j X) = = :
1 2 P (x
RK xK ) 2
1 2 S2 n
RK
iK xK
We can conclude that the precision of bK is high (i.e. Var (bK ) is small) when:
2 is low;
Sx2K is high (imagine the regression

wage = 1 + 2educ + ":
If most people (in the sample) report the same education, Sx2K will be low and 2 will
be estimated very imprecisely).
n is high (large sample is preferable to small sample).
2 is low (multicollinearity increases R2 ).

RK K
67
Exercise 2.11. Consider: sleep: minutes sleep at night per week; totwrk: hours worked
per week; educ: years of schooling; female: binary variable equal to one if the individual
is female. Do women sleep more than men? Explain the di¤erences between the estimates
32.18 and -90.969.
Dependent Variable: SLEEP

Method: Least Squares
Dependent Variable: SLEEP Sample: 1 706
Sample: 1 706 Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob. C 3838.486 86.67226 44.28737 0.0000
TOTWRK -0.167339 0.017937 -9.329260 0.0000
C 3252.407 22.22211 146.3591 0.0000 EDUC -13.88479 5.657573 -2.454196 0.0144
FEMALE 32.18074 33.75413 0.953387 0.3407 FEMALE -90.96919 34.27441 -2.654143 0.0081
R-squared 0.001289 Mean dependent var 3266.356 R-squared 0.119277 Mean dependent var 3266.356
Adjusted R-squared -0.000129 S.D. dependent var 444.4134 Adjusted R-squared 0.115514 S.D. dependent var 444.4134
S.E. of regression 444.4422 Akaike info criterion 15.03435 S.E. of regression 417.9581 Akaike info criterion 14.91429
Sum squared resid 1.39E+08 Schwarz criterion 15.04726 Sum squared resid 1.23E+08 Schwarz criterion 14.94012
68
Example. The goal is to analyze the impact of another year of education on wages. Consider:
wage: monthly earnings; KWW: knowledge of world work score (KWW is a general test of
work-related abilities); educ: years of education; exper: years of work experience; tenure:
years with current employer
Dependent Variable: LOG(WAGE)
Dependent Variable: LOG(WAGE) Sample: 1 935
Method: Least Squares White Heteroskedasticity-Consistent Standard Errors & Covariance
Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefficient Std. Error t-Statistic Prob.
Variable Coefficient Std. Error t-Statistic Prob. C 5.496696 0.112030 49.06458 0.0000
EDUC 0.074864 0.006654 11.25160 0.0000
C 5.973062 0.082272 72.60160 0.0000 EXPER 0.015328 0.003405 4.501375 0.0000
EDUC 0.059839 0.006079 9.843503 0.0000 TENURE 0.013375 0.002657 5.033021 0.0000
R-squared 0.097417 Mean dependent var 6.779004 R-squared 0.155112 Mean dependent var 6.779004
Adjusted R-squared 0.096449 S.D. dependent var 0.421144 Adjusted R-squared 0.152390 S.D. dependent var 0.421144
S.E. of regression 0.400320 Akaike info criterion 1.009029 S.E. of regression 0.387729 Akaike info criterion 0.947250
Sum squared resid 149.5186 Schwarz criterion 1.019383 Sum squared resid 139.9610 Schwarz criterion 0.967958
Dependent Variable: LOG(WAGE)

Sample: 1 935
White Heteroskedasticity-Consistent Standard Errors & Covariance
C 5.210967 0.113778 45.79932 0.0000

EDUC 0.047537 0.008275 5.744381 0.0000
EXPER 0.012897 0.003437 3.752376 0.0002
TENURE 0.011468 0.002686 4.270056 0.0000
IQ 0.004503 0.000989 4.553567 0.0000
KWW 0.006704 0.002070 3.238002 0.0012
R-squared 0.193739 Mean dependent var 6.779004

Adjusted R-squared 0.189400 S.D. dependent var 0.421144
S.E. of regression 0.379170 Akaike info criterion 0.904732
Sum squared resid 133.5622 Schwarz criterion 0.935794
69
Exercise 2.12. Consider
yi = 1 + 2xi2 + "i; i = 1; :::; n

where xi2 is an impulse dummy, i.e. x 2 is a column vector with n 1 zeros and only one
1. To simplify let us suppose that this 1 is the …rst element of x 2; i.e.
h i
x0 2 = 1 0 0 :
Find and interpret the coe¢ cient from the regression of y on x ~ 1 = M2x 1 and M2 =
0 1 0
I x 2 x 2x 2 x 2 (x
~ 1 is the residual vector from the regression x 1 on x 2):
Exercise 2.13. Consider the long regression model (under Assumptions 1.1 through 1.4):
y = X1b1 + X2b2 + e;
and the following coe¢ cients (obtained from the short regressions):
1 1
b1 = X01X1 X01y; b2 = X02X2 X02y:
Decide if you agree or disagree with the following statement: if Cov b1; b2 X1; X2 = O
(zero matrix) then b1 = b1 and b2 = b2:
70
2.5 Multicollinearity
If rank (X) < K then b is not de…ned. This is called strict multicollinearity. When this
happens, the statistical software will be unable to construct X0X 1 : Since the error is
discovered quickly, this is rarely a problem for applied econometric practice.
The more relevant situation is near multicollinearity, which is often called “multicollinearity”
for brevity. This is the situation when the X0X is near singular, when the columns of X are
close to linearly dependent.
Consequence: the individual coe¢ cient estimates will be imprecise. We have shown that
2
Var ( bK j X) = :
1 2 S2 n
RK xK
2 is the coe¢ cient of determination in the auxiliary regression
where RK
x K = 1x 1 + 2x 2 + ::: + K 1x K 1 + error:
71
Exercise 2.14. Do you agree with the following quotations: (a) “But more data is no remedy
for multicollinearity if the additional data are simply "more of the same." So obtaining lots
of small samples from the same population will not help” (Johnston, 1984); (b) “Another
important point is that a high degree of correlation between certain independent variables
can be irrelevant as to how well we can estimate other parameters in the model.”
Exercise 2.15. Suppose you postulate a model explaining …nal exam score in terms of class
attendance. Thus, the dependent variable is …nal exam score, and the key explanatory
variable is number of classes attended. To control for student abilities and e¤orts outside
the classroom, you include among the explanatory variables cumulative GPA, SAT score, and
measures of high school performance. Someone says, “You cannot hope to learn anything
from this exercise because cumulative GPA, SAT score, and high school performance are
likely to be highly collinear.” What should be your answer?
72
2.6 Statistical Inference under Normality
Assumption (1.5 - normality of the error term). "j X N ormal
Assumption 1.5 together with Assumptions 1.2 and 1.4 implies that
"j X N 0; 2I and yj X N X ; 2I :
Suppose that we want to test H0 : 2 = 1. Although Proposition 1.1 guarantees that, on

average, b2 (the OLS estimate of 2) equals 1 if the hypothesis H0 : 2 = 1 is true, b2 may
not be exactly equal to 1 for a particular sample at hand. Obviously, we cannot conclude
that the restriction is false just because the estimate b2 di¤ers from 1. In order for us to
decide whether the sampling error b2 1 is “too large” for the restriction to be true, we
need to construct from the sampling distribution error some test statistic whose probability
distribution is known given the truth of the hypothesis.
The relevant theory is built from the following results:

73
1. z N (0; I) , z0z 2 :
(n)
2 ; w 2 ; w w =m
2. w1 (m) 2 (n) 1 and w2 are independent, w1 =n F (m; n) :
2
3. w 2 ; z
(n) N (0; 1) ; w and z are independent, p z t(n):
w=n
4. Asymptotic Results:
d
v F (m; n) ) mv ! 2(m) as n ! 1
d
u t(n) ) u ! N (0; 1) as n ! 1:
5. Consider the vector n 1 vector yj X N (X ; ) : Then,
w = (y X )0 1 (y X ) 2 :
(n)
74
6. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent

matrix with rank (M) = r n: Then,
"0M" X 2 :
(r)
7. Consider the vector n 1 vector "j X N (0; I) : Let M be a n n idempotent

matrix with rank (M) = r n: Let L be a matrix such that LM = O: Let t1 = M"
and t2 = L": Then t1 and t2 are independent random vectors.
1
8. bj X N ; 2 X0 X :
9. Let r = R (Rp K ) with rank (R) = p (in Hayashi’s notation p is equal to #r):
Then,
1
Rbj X N r; 2R X0 X R0 :
75
1
10. Let bk be the kth element of b and q kk the (k; k) element of X0X : Then,
b
bk j X N k;
2 q kk or zk = kq k N (0; 1) :
q kk
0 1 1 2 2 :
11. w = (Rb r) R X0 X R0 (Rb r) = (p)
2
(bk k) 2 :
12. wk = 2 q kk (1)
13. w0 = e0e= 2 2
(n K) :
14. The random vectors b and e are independent.
d (b) ; is independent of each of the statistics

15. Each of the statistics e; e0e; w0; s2; Var
b, bk ; Rb; w; wk :
76
b 1
16. tk = k^ k t (n K ) ; where ^ 2b is the (k; k) element of s2 X0X :
bk k
17. q Rb R t (n K ) ; R is of type 1 K
s R(X0 X) 1 R0
0 1 1
18. F = (Rb r) R X0 X R0 (Rb r) = ps2 F (p; n K) :
Exercise 2.16. Prove the results #8, #9, #16 and #18 (take the other results as given).
The two most important results are:

bk k bk k
tk = = t (n K)
^ bk SE (bk )
1 1
F = (Rb r) 0
R X0 X R0 (Rb r) = ps2 F (p; n K) :
77
2.6.1 Con…dence Intervals and Regions
Let t =2 t =2 (n k) be such that
P jtj < t =2 = 1 :
78
Let F F (p; n K ) be such that
P (F > F ) = 1
79
(1 ) 100% CI for an individual slope coe¢ cient k :

8 9
< bj =
k
: t =2 , bk t =2 ^ bk :
: k ^ bk ;
(1 ) 100% CI for a single linear combination of the elements of (p = 1)

8 9
>
< >
= q
Rb R
R : q t =2 , Rb t =2s R (X0X) 1 R0:
>
: 1 >
;
s R (X0X) R0
In this case R is a vector 1 K:

(1 ) 100% Con…dence Region for the parameter vector =R :
( )
1 1
: (Rb )0 R X0X R0 (Rb ) =s2 pF :
(1 ) 100% Con…dence region for the parameter vector (consider R = I in the pre-
vious case)
n o
: (b ) 0
X0 X (b ) =s2 pF :
80
Exercise 2.17. Consider yi = 1xi1 + 2xi2 + "i where yi = wagesi wages; xi1 =
educi educ; xi2 = experi exper: The results are
Dependent Variable: Y
Sample: 1 526
X
X1 0.644272 0.053755 11.98541 0.0000
X2 0.070095 0.010967 6.391393 0.0000
R-squared 0.225162 Mean dependent var 1.34E-15

Log likelihood -1365.969 Hannan-Quinn criter. 5.207752
Durbin-Watson stat 1.820274
" # " #
4025:4297 5910:064 1 2:7291 10 4 1:6678 10 5
X0 X = ; X0 X =
5910:064 96706:846 1:6678 10 5 1:1360 10 5
(a) Build the 95% con…dence interval for 2.
(b) Build the 95% con…dence interval for 1 + 2:
(c) Build the 95% con…dence region for the parameter vector :
81
Con…dence regions in the EVIEWS
.10
.09
.08
beta2
.07
.06
.05
.04
.50 .55 .60 .65 .70 .75 .80
beta1
90% and 95% Con…dence region for the parameter vector

82
2.6.2 Testing on a Single Parameter
Suppose that we have a hypothesis about the kth regression coe¢ cient:
H0 : k = 0k
( 0k is a speci…c value, e.g. zero), and that this hypothesis is tested against the alternative
hypothesis
H1 : k 6= 0k :
We do not reject H0 at the 100% level if
0 lies within the (1 ) 100% CI for k ; i.e., bk t =2 ^ bk ;
k
reject H0 otherwise. Equivalently, calculate the test statistic
bk 0
tobs = k
^ bk
and,
if jtobsj > t =2 then reject H0;
if jtobsj t =2 then do not reject H0:
83
The reasoning is as follow. Under the null hypothesis we have

bk 0
t0k = k t(n K):
^ bk
If we observe jtobsj > t =2 and the H0 is true, then a low-probability event has occurred.
We take jtobsj > t =2 as an evidence against the null and the decision should be to reject
H0 :
Other cases:
H0 : k = 0k vs: H1 : k > 0k ;
if tobs > t then reject H0 at the 100% level; otherwise do not reject H0:
H0 : k = 0k vs: H1 : k < 0k ;
if tobs < t then reject H0 at the 100% level; otherwise do not reject H0:
84
2.6.3 Issues in Hypothesis Testing
p-value
p-value (or p) is the probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the null hypothesis is true. p is an informal measure
of evidence of the null hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k 6= 0k
p-value = P t0k > tobs H0 is true :

A p-value = 0:02 shows little evidence supporting H0: At the 5% level you should reject the
H0 hypothesis.
Example. Consider H0 : k = 0k vs: H1 : k > 0k
p-value = P t0k > tobs H0 is true :

EVIEWS: divide the reported p-value by two.
85
Reporting the outcome of a test
Correct wording in reporting the outcome of a test involving
H0 : k = 0k vs. H1 : k 6= 0k
When the null is rejected we say that bk (not k ) is signi…cantly di¤erent from 0k at
100% level. Some authors also say “the variable (associated with bk ) is statistically
signi…cant at 100% level”.
When the null isn’t rejected we say that bk (not k ) is not signi…cantly di¤erent from
0 at 100% level or that the variable is not statistically signi…cant at 100% level.
k
86
More Remarks:
Rejection of the null is not proof that the null is false. Why?
Acceptance of the null is not proof that the null is true. Why? We prefer to use the
language “we fail to reject H0 at the x% level” rather than “H0 is accepted at the x%
level.”
In a test of type H0 : k = 0k , if ^ bk is large (bk is an imprecise estimator) is more

di¢ cult to reject the null. The sample contains little information about the true value
of k parameter. Remember that ^ bk depends on
2; S 2 ; n and Rk2 .
xk
87
Statistical Versus Economic Signi…cance
The statistical signi…cance of a variable is determined by the size of tobs = bk =se (bk ) ;
whereas the economic signi…cance of a variable is related to the size and sign of bk :
Example. Suppose that in a business activity we have
log\
(wagei) = :1 + 0:01 f emale + ::: n = 600
(0:001)
H0 : 2 = 0 vs. H1 = 2 6= 0: We have:
b2
t0k= t(600 K) N (0; 1) (under the null)
^ b2
0:01
tobs = = 10;
0:001
p-value = P t0k > 10 H0 is true 0:
Discuss statistical versus economic signi…cance.
88
Exercise 2.18. Can we say that students at smaller schools perform better than those at
larger schools? To discuss this hypothesis we consider data on 408 high schools in Michigan
for the year 1993 (see Wooldridge, chapter 4). Performance is measured by the percentage
of students receiving a passing score on a tenth grade math test ( math10). School size
is measured by student enrollment ( enroll). We will control for two other factors, average
annual teacher compensation ( totcomp) and the number of sta¤ per one thousand students
( sta¤ ). Teacher compensation is a measure of teacher quality, and sta¤ size is a rough
measure of how much attention students receive. Figure below reports the results. Answer
to the initial question.
Dependent Variable: MATH10

Sample: 1 408
C 2.274021 6.113794 0.371949 0.7101

TOTCOMP 0.000459 0.000100 4.570030 0.0000
STAFF 0.047920 0.039814 1.203593 0.2295
ENROLL -0.000198 0.000215 -0.917935 0.3592

F-statistic 7.696528 Durbin-Watson stat 1.668918
Prob(F-statistic) 0.000052
89
Exercise 2.19. We want to relate the median housing price ( price) in the community to
various community characteristics: nox is the amount of nitrous oxide in the air, in parts
per million; dist is a weighted distance of the community from …ve employment centers, in
miles; rooms is the average number of rooms in houses in the community; and stratio is
the average student-teacher ratio of schools in the community. Can we conclude that the
elasticity of price with respect to nox is -1? (Sample: 506 communities in the Boston area -
see Wooldridge, chapter 4).
Dependent Variable: LOG(PRICE)

Sample: 1 506
C 11.08386 0.318111 34.84271 0.0000

LOG(NOX) -0.953539 0.116742 -8.167932 0.0000
LOG(DIST) -0.134339 0.043103 -3.116693 0.0019
ROOMS 0.254527 0.018530 13.73570 0.0000
STRATIO -0.052451 0.005897 -8.894399 0.0000

90
2.6.4 Test on a Set of Parameter I
Suppose that we have a joint null hypothesis about :
H0 : R = r vs. H1 : R 6= r:
where R p 1; Rp K ). The test statistics is
1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2 :
Let Fobs be the observed test statistics. We have
reject H0 if Fobs > F (or if p-value < )

do not reject H0 if Fobs F :
The reasoning is as follow. Under the null hypothesis we have
F0 F(p;n K):
If we observe F 0 > F and the H0 is true, then a low-probability event has occurred.
91
In the case p = 1 (single linear combination of the elements of ) one may use the test
statistics
0 Rb R
t = q t (n K ) :
1
s R (X0X) R0
Example. We consider a simple model to compare the returns to education at junior colleges
and four-year colleges; for simplicity, we refer to the latter as “universities”(See Wooldridge,
chap. 4).The model is
log (wagesi) = 1 + 2jci + 3univi + 4experi + "i:

The population includes working people with a high school degree. jc is number of years
attending a two-year college and univ is number of years at a four-year college. Note that
any combination of junior college and college is allowed, including jc = 0 and univ = 0.
The hypothesis of interest is whether a year at a junior college is worth a year at a university:
this is stated as H0 : 2 = 3: Under H0, another year at a junior college and another year
at a university lead to the same ceteris paribus percentage increase in wage. The alternative
of interest is one-sided: a year at a junior college is worth less than a year at a university.
This is stated as H1 : 2 < 3:
92
Dependent Variable: LWAGE

Sample: 1 6763
C 1.472326 0.021060 69.91020 0.0000

JC 0.066697 0.006829 9.766984 0.0000
UNIV 0.076876 0.002309 33.29808 0.0000
EXPER 0.004944 0.000157 31.39717 0.0000

2 3
0:0023972 9:4121 10 5 8:50437 105 1:6780 10 5
6 7
1 6 9:41217 10 5 0:0002520 1:04201 10 5 9:2871 10 8 7
X0 X =6
6
7
7
4 8:50437 10 5 1:0420 10 5 2:88090 10 5 2:12598 10 7 5
1:67807 10 5 9:2871 10 8 2:1259 10 7 1:3402 10 7
Under the null, the test statistics is
Rb R
t0 = q t (n K) :
1
s R (X0X) R0
93
We have
h i
R = 0 1 1 0
q
1
R (X0X) R0 = 0:016124827
q
s R (X0X) 1 R0 = 0:430138 0:016124827 = 0:006936
2 3
1:472326
h i 6 0:066697 7
6 7
Rb = 0 1 1 0 6 7 = 0:01018
4 0:076876 5
0:004944
2 3
1
h i6 7
6 2 7
R = 0 1 1 0 6 7= 2 3 = 0 (under H0)
4 3 5
4
0:01018
tobs = = 1:467
0:006936
t0:05 = 1:645:
We do not reject H0 at the 5% level. There is no evidence against 2 = 3 at 5% level.
94
Remark: in this exercise t0 can be written as

Rb b2 b3
b2 b 3
t0 = q =q = :
1
s R (X0X) R0 Var \
(b2 b3) SE (b2 b3)
Exercise 2.20 (continuation). Propose another way to test H0 : 2 = 3 against H0 :

2 < 3 along the following lines: de…ne = 2 3 ; write 2 = + 3 ; plug this into
the equation log (wagesi) = 1 + 2jci + 3univi + 4experi + "i and test = 0: Use
the database available on the webpage of the course.
95
2.6.5 Test on a Set of Parameter II
We focus on another way to test
H0 : R = r vs. H1 : R 6= r:
(where R p 1; Rp K ). It can be proved that
1 1
F0 = (Rb r)0
R X0 X R0 (Rb r) = ps2
e 0e e0e =p
=
e0e= (n K )
R2 R2 =p
= F (p; n K)
1 R2 = (n K)
where refers to the short regression or the regression subjected to the constraint R = r.
96
Example. Consider once again the equation log (wagesi) = 1 + 2jci + 3univi +
4 experi + "i and H0 : 2 = 3 against H0 : 2 6= 3 : The results of the regression
subjected to the constraint H0 : 2 = 3 are
Dependent Variable: LWAGE

Sample: 1 6763
C 1.471970 0.021061 69.89198 0.0000

JC+UNIV 0.076156 0.002256 33.75412 0.0000
EXPER 0.004932 0.000157 31.36057 0.0000

We have p = 1; e0e = 1250:544; e 0e = 1250:942 and

e 0e e0e =p (1250:942 1250:544) =1
Fobs = 0
= = 2:151;
e e= (n K ) 1250:544= (6763 4)
F0:05 = 3:84:
We do not reject the null at 5% level, since Fobs = 2:151 < F0:05 = 3:84:
97
In the case “all slopes zero” (test of signi…cance of the complete regression), it can be
proved that F o equals
R2= (K 1)
F0 = :
1 R2 = (n K)
Under the null H0 : k = 0; k = 2; 3; :::; K; we have F 0 F (K 1; n K) :

Exercise 2.21. Consider the results:
Dependent Variable: Y
Sample: 1 500
C 0.952298 0.237528 4.009200 0.0001

X2 1.322678 1.686759 0.784154 0.4333
X3 2.026896 1.701543 1.191210 0.2341

Test: (a) H0 : 2 = 0 vs. H1 : 2 6= 0; (b) H0 : 3 = 0 vs. H1 : 3 6= 0; (c)

H0 : 2 = 0; 3 = 0 vs. H1 : 9 i 6= 0 (i = 1; 2) (d) Are xi2 and xi3 truly relevant
variables? How would you explain the results you obtained in parts (a), (b) and (c)?
98
2.7 Relation to Maximum Likelihood
Having speci…ed the distribution of the error vector, we can use the maximum likelihood
(ML) principle to estimate the model parameters = 0; 2 0.
2.7.1 The Maximum Likelihood Principle
ML principle: choose the parameter estimates to maximize the probability of obtaining the
data. Maximizing the joint density associated with the data, f y; X; ~ ; leads to the same
solution. Therefore:
M L estimator of = arg max f y; X; ~ :

~
99
Example (Without X). We ‡ipped a coin 10 times. If heads then y = 1: Obviously y

Bernoulli( ) : We don’t know if the coin is fair, so we treated E (Y ) = as unknown
P10
parameter. Suppose that i=1 yi = 6: We have
n
Y
f y;~ = f y1; :::; yn; ~ = f (yi; ) = y1 (1 )1 y1 ::: yn (1 )1 yn
P Pi=1
= i yi (1 )10 i yi = 6 (1 )4 :
0.0012
joint density
0.0011
0.0010
0.0009
0.0008
0.0007
0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
0.0000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
theta
100
To obtain the ML estimate of we proceed with:
d 6 (1 )4 6
=0,^=
d 10
and since
d2 6 (1 )4
<0
d 2
^ = 0:6 maximizes f y;~ : ^ is the “most likely” value ; that is the value that maximizes
the probability of observing (y1; :::; y10) : Notice that the ML estimator is y:
Since log x; x > 0 is a strictly increasing function we have: ^ maximizes f y;~ i¤ ^

maximizes log f y;~ ; that is
^ = arg max f y; X; ~ , ^ = arg max log f y; X; ~ :
In most cases we prefer to solve max log f y; X; ~ rather max f y; X; ~ ; since the
transformation log greatly simplify the likelihood (products become sums).
101
2.7.2 Conditional versus Unconditional Likelihood
The joint density f (y; X; ) is in general di¢ cult to handle. Consider:

f (y; X; ) = f ( yj X; ) f (X; ) ; = 0; 0 ;
log f (y; X; ) = log f ( yj X; ) + log f (X; )

In general we don’t know f (X; ) :
Example. Consider yi = 1xi1 + 2xi2 + "i where
"i j X N 0; 2 ) yij X N x0i ; 2
X N ; 2I :
x x
Thus,
" # " # " #
= ; = x ; = :
2 2
x
If there is no functional relationship between and (such as a subset of being a

function of ), then maximizing log f (y; X; ) with respect to is achieved by separately
maximizing f ( yj X; ) with respect to and maximizing f (X; ) with respect to . Thus
the ML estimate of also maximizes the conditional likelihood f ( yj X; ) :
102
2.7.3 The Log Likelihood for the Regression Model
Assumption 1.5 (the normality assumption) together with Assumptions 1.2 and 1.4 imply
that the distribution of " conditional on X is N 0; 2I . Thus,
"j X N 0; 2I ) yj X N X ; 2I )
2 n=2 1 0
f ( y j X; ) = 2 exp 2
(y X ) (y X ) )
2
n 2 1
log f ( yj X; ) = log 2 2
(y X )0 (y X ):
2 2
It can be proved
n
X n
n 2 1 X 2
log f ( yj X; ) = log f ( yij xi) = log 2 2
yi x0i :
i=1 2 2 i=1
Proposition (1.5 - ML Estimator of and 2). Suppose Assumptions 1.1-1.5 hold. Then,
1
M L estimator of = X0 X X0 y :
2 e0e 2 e0 e
M L estimator of = 6= s = :
n n K
103
We know that E s2 = 2: Therefore:
e0 e 6= 2:
E n
e 0e
limn!1 E n = 2:
Proposition (1.6 - b is the Best Unbiased Estimator BUE). Under Assumptions 1.1-1.5,
the OLS estimator b of is BUE in that any other unbiased (but not necessarily linear)
estimator has larger conditional variance in the matrix sense.
This result should be distinguished from the Gauss-Markov Theorem that b is minimum
variance among those estimators that are unbiased and linear in y. Proposition 1.6 says
that b is minimum variance in a larger class of estimators that includes nonlinear unbiased
estimators. This stronger statement is obtained under the normality assumption (Assumption
1.5) which is not assumed in the Gauss-Markov Theorem. Put di¤erently, the Gauss-Markov
Theorem does not exclude the possibility of some nonlinear estimator beating OLS, but this
possibility is ruled out by the normality assumption.
104
Exercise 2.22. Suppose yi = x0i + "i where "ij X t(v): Assume that Assumptions
1.1-1.4 hold. Use your intuition to answer “true” or “false” to the following statements:
(a) b is the BLUE;
(b) b is the BUE;
(c) the BUE estimator can only be obtained numerically (i.e. there is not a closed formula
for the BUE estimator).
Just out of curiosity notice that the log-likelihood function is

n
X n n n
log f ( yij xi) = log 2 log log (v 2)
i=1 2 2 2
0 1
v+1 2
2
n
v+1 X B 1 yi x0i C
+n log v
log @1 + 2
A:
2 i=1 v 2
2
105
2.8 Generalized Least Squares (GLS)
We have assumed that

2 = Var ( "ij X) = 2 > 0;
E "i X 8i; Homoskedasticity
E " i "j X = 0; 8i; j ; i 6= j No correlation between observations.
Matrix notation:
2 3
E "21 X E ( "1 " 2 j X ) E ( "1 "n j X)
6 7
6 ( " " j X) 7
0 6 E 1 2 E "22 X E ( "2 "n j X) 7
E "" X = 6
6 ... ... ... ... 7
7
4 5
E ( "1 "n j X ) E ( "2 "n j X ) E "2n X
2 3
2 0 0
6 2 7
6 0 0 7
= 6 7 = 2I:
.
6 .. . .
.. . . . .. 7
4 5
0 0 2
106
The Assumption E ""0 X = I is violated if either
E "2i X depends on X ! Heteroskedasticity, or
E "i"j X 6= 0 ! Serial Correlation (We will analyze this case later).
Let’s assume now that

0 2
E "" X = V (V depends on X):
The model y = X + " based on the assumptions Assumptions 1.1-1.3 and E ""0 X =
2 V is called generalized regression model.
Notice that by de…nition, we always have:

0
E "" X = Var ( "j X) = Var ( yj X) :
107
Example (case where E "2i X depends on X). Consider the following model
yi = 1 + 2xi2 + "i
to explain household expenditure on food (y ) as a function of household income. Typical
behavior: Low-income household do not have the option of extravagant food tastes: they
have few choices and are almost forced to spend a particular portion of their income on food;
High-income household could have simple food tastes or extravagant food tastes: income by
itself is likely to be relatively less important as an explanatory variable.
20
18
16
y : Expenditure
14
12
10
8
6
4
2
0
6 7 8 9 10 11 12 13
x : Income
108
If e accurately re‡ects the behavior of the "; the information in the previous …gure suggests
that the variability of yi increases as income increases, thus it is reasonable to suppose that
Var ( yij xi2) is a function of xi2:

This is the same as saying that
2 is a function of xi2:
E "i xi2
For example if E "2i xi2 = 2x2i2 then

2 3
x212 0 0
6 2 7
0 26
6 0 x22 0 7
7= 2 I:
E "" X = 6 ... ... ... ... 7 V 6=
4 5
0 0 x2n2
| {z }
V
109
2.8.1 Consequence of Relaxing Assumption 1.4
1. The Gauss-Markov Theorem no longer holds for the OLS estimator. The BLUE is some
other estimator.
2. The t-ratio is not distributed as the t distribution. Thus, the t-test is no longer valid. The
same comments apply to the F-test. Note that Var ( bj X) is no longer 2 X0X 1 : In
e¤ect,
1 1 1
Var ( bj X) = Var X0 X X0 y X = X0 X X0 Var ( yj X) X X0 X
2 1 1
= X0 X X0VX X0X :
On the other hand,
2 tr (MVM) 2 tr (MV)
E e0e X tr (Var ( ej X))
E s2 X = = = = :
n K n K n K n K
The conventional standard errors are incorrect when Var ( yj X) 6= 2I: Con…dence
region and hypothesis test procedures based on the classical regression model are not
valid.
110
3. However, the OLS estimator is still unbiased, because the unbiasedness result (Propo-
sition 1.1 (a)) does not require Assumption 1.4. In e¤ect,
0 1 1
E ( bj X) = X X X0 E ( y j X) = X0 X X0 X = ; E (b) =
Options in the presence of E ""0 X 6= 2I:
1 1
Use b to estimate and Var ( bj X) = 2 X0X X0VX X0X for inference
purposes. Note that yj X N X ; 2V implies
1 1
bj X N ; 2 X0 X X0VX X0X :
This is not a good solution as if you know V you may use a more e¢ cient estimator, as
we will see below. Later on, in chapter “Large Sample Theory” we will …nd that 2V
may be replaced by a consistent estimator.
Search for a better estimator of :

111
2.8.2 E¢ cient Estimation with Known V
If the value of the matrix function V is known, a BLUE estimator for , called generalized
least squares (GLS), can be deduced. The basic idea of the derivation is to transform
the generalized regression model into a model that satis…es all the assumptions, including
Assumption 1.4, of the classical regression model. Consider
y = X + "; 0 2
E "" X = V:
We should multiply both sides of the equation by a nonsingular matrix C (depending on X)
Cy = CX + C"
y ~ +~
~ = X "
" verify E ~
such that the transformed error ~ "0 X = 2I; i.e.
"~
"0 X = E C""0C0 X = C E ""0 X C0 = 2CVC0 = 2I

"~
E ~
that is CVC0 = I:
112
Given CVC0 = I, how to …nd C? Since V is by construction symmetric and positive de…-
nite, there exists a nonsingular n n matrix C such
1 1
V=C C0 or V 1 = C0C
Note
1 0
CVC0 = CC 1 C0 C = I:
It easy to see that if y = X + " satis…es Assumptions 1.1-1.3 and Assumption 1.5 (but
not Assumption 1.4), then
y ~ +~
~=X "; where y ~ = CX
~ = Cy; X
satis…es Assumptions 1.1-1.5. Let
1 1X 1 1 y:
~ 0X
^ GLS = X ~ ~ 0y
X ~ = X0 V X0 V
113
Proposition (1.7 - …nite-sample properties of GLS). (a) (unbiasedness) Under Assumption

1.1-1.3,
E ^ GLS X = :
(b) (expression for the variance) Under Assumptions 1.1-1.3 and the assumption E ""0 X =
2 V that the conditional second moment is proportional to V,
2 1
Var ^ GLS X = X0 V 1 X :
(c) (the GLS estimator is BLUE) Under the same set of assumptions as in (b), the GLS
estimator is e¢ cient in that the conditional variance of any unbiased estimator that is linear
in y is greater than or equal to Var ^ GLS X in the matrix sense.
Remark: Var ( bj X) Var ^ GLS X is a positive semide…nite matrix. In particular,
Var bj X Var ^ j;GLS X :

114
2.8.3 A Special Case: Weighted Least Squares (WLS)
Let’s suppose that

2 2
E "i X = vi (vi is a function of X).
Recall: C is such that V 1 = C0C .
We have
2 3 2 3
v1 0 0 1=v1 0 0
6 0 v 0 7 6 0 1=v2 0 7
V = 6
6 .. 2
... . . . ...
7
7) V 1 6
= 6 ... ... ... ...
7
7)
4 . 5 4 5
0 0 vn 0 0 1=vn
2 p 3
1= v1 0 0
6 p 7
0 1= v2 0
C = 6
6 ... ... ... ...
7
7:
4 5
p
0 0 1= vn
115
Now
2 y 3
2
p 32 3 p1
1= v1 0 0 y1 6 yv1 7
6 p 76 y 7 6 p2 7
6 0 1= v2 0 76 2 7 6 7
y
~ = Cy = 6 .
.. .
.. ... .
.. 7 6 .. 7 = 6 .v2 7
4 54 . 5 6 .. 7
p 4 y 5
0 0 1= vn yn pn
vn
2 p 32 3
1= v1 0 0 1 x12 x1K
6 p 76 1 7
~ = CX = 6 6 0 1 = v2 0 76 x22 x2K 7
X ... ... ... ... 7 6 .. ... ... ... 7
4 54 . 5
p
0 0 1= vn 1 xn2 xnK
2 p p p 3
1= v1 x12= v1 x1K = v1
6 1=pv x =
p
v x =
p 7
v2 7
6 2 22 2 2K
= 6 ... ... ... ... 7:
4 5
p p p
1= vn xn2= vn xnK = vn
Another way to express these relations:

y x
y~i = p i ; ~ik = pik ;
x i = 1; 2; :::; n:
vi vi
116
Example. Suppose that yi = + xi2 + "i;

Var ( yij xi2) = Var ( "ij xi2) = 2exi2 ; Cov yi; yj xi2; xj2 = 0
2 3
ex12 0 0
6 ... ... 7
6 7
6 7
V =6
6 0 exi2 0 7:
7
6 ... ... 7
4 5
0 0 exn2
Transformed model (matrix notation):
Cy = 2
CX + C" 3
2 3 2 3
pyx1 p 1x x
p 12 " # p "1
6 e. 12 7 6 e 12
6 ex12 77 6 ex12 7
6 .. 7 = 6 ... ... 7 1 +6 ... 7
4 5 4 5 4 5
pyxn p 1x pxn2 2 p "xn
e n2 e n2 exn 2 e n2
or (scalar notation):
y~i = x
~i1 1 + x "i ,
~i2 2 + ~ i = 1; :::; n
y 1 xi2 "i
p i =p 1 + p 2 + p , i = 1; :::; n:
x
e i2 x
e i2 x
e i2 x
e i2
117
Notice:
!
" 1 1
"ij X) = Var p ix xi2
Var (~ = x Var ( "ij xi2) = x 2exi2 = 2:
e i2 e i2 e i2
E¢ cient estimation under a known form of heteroskedasticity is called the weighted regression
(or the weighted least squares (WLS)).
Example. Consider wagei = 1 + 2educi + 3experi + "i:
30 30
25 25
20 20
WAGE
WAGE
15 15
10 10
5 5
0 0
0 10 20 30 40 50 60 0 4 8 12 16 20
EXPER EDUC
118
300
250
Dependent Variable: WAGE
Method: Least Squares 200
Sample: 1 526
RES2
150
C -3.390540 0.766566 -4.423023 0.0000
EDUC 0.644272 0.053806 11.97397 0.0000
EXPER 0.070095 0.010978 6.385291 0.0000 100

Adjusted R-squared 0.222199 S.D. dependent var 3.693086 50
Log likelihood -1365.969 Hannan-Quinn criter. 5.214729 0
F-statistic 75.98998 Durbin-Watson stat 1.820274 0 4 8 12 16 20
EDUC
Assume Var ( "ij educi; experi) = 2educ2i : Transformed model:

wagei 1 educi experi
= + 2 + 3 "i ,
+~ i = 1; :::; n
educi educi educi educi
119
Dependent Variable: WAGE/EDUC

Sample: 1 526 IF EDUC>0
1/EDUC -0.709212 0.549861 -1.289800 0.1977

EDUC/EDUC 0.443472 0.038098 11.64033 0.0000
EXPER/EDUC 0.055355 0.009356 5.916236 0.0000

Exercise 2.23. Let fyi; i = 1; 2; :::g be a sequence of independent random variables with
distribution N ; 2i ; where 2i is known (note: we assume 21 6= 22 6= :::). When
the variances are unequal, the sample mean y is not the best linear unbiased estimator,
Pn
i.e. BLUE). The BLUE has the form y~ = i=1 wiyi where wi are nonrandom weights.
(a) Find a condition on wi such that E (~ y ) = ; (b) Find the optimal weights wi that
make y~ the BLUE. Hint: You may translate this problem into an econometric framework:
if fyig is a sequence of independent random variables with distribution N ; 2i then yi
can be represented by the equation yi = + "i; where "i N 0; 2i : Then …nd the GLS
estimator of :
120

yi = xi1 + "i; >0
and assume E ( "ij X) = 0; Var ( "ij X) = 1 + jxi1j ; Cov "i; "j X = 0: (a) Suppose we
have a lot of observations and plot a graph of the observation of yi and xi2. How would the
scattered plot look like? (b) Propose an unbiased estimator with minimum variance; (c)
Suppose we have the 3 following observation of (xi2; yi): (0; 0); (3; 1) and (8; 5). Estimate
the value of from these 3 observations.
yt = 1 + 2t + "t; Var ("i) = 2t2; i = 1; :::; 20

1
Find 2 X0X ; Var ( bj X) and Var ^ GLS X and comment on the results. Hint:
" # " #
2 1 ? 0:01578 ? 1:6326
X0 X = 2 ; Var ( bj X) = 2
0:01578 ? 1:6326 ?
" #
? 0:1895
Var ^ GLS X = 2 :
0:1895 ?
121
Exercise 2.26. A research …rst ran a OLS regression. Then she was given the true V matrix.
She transformed the data appropriately and obtained the GLS estimator. For several coe¢ -
cient, standard errors in the second regression were larger than those in the …rst regression.
Does this contradict 1.7 proposition? See the previous exercise.
2.8.4 Limiting Nature of GLS
Finite-sample properties of GLS rest on the assumption that the regressors are strictly
exogenous. In time-series models the regressors are not strictly exogenous and the error
is serially correlated.
In practice, the matrix function V is unknown.
V can be estimated from the sample. This approach is called the Feasible Generalized
Least Squares (FGLS). But if the function V is estimated from the sample, its value V
becomes a random variable, which a¤ects the distribution of the GLS estimator. Very
little is known about the …nite-sample properties of the FGLS estimator. We need to
use the large-sample properties ...
122
3 Large-Sample Theory
The …nite-sample theory breaks down if one of the following three assumptions is violated:
1. the exogeneity of regressors,
2. the normality of the error term, and
3. the linearity of the regression equation.
This chapter develops an alternative approach based on large-sample theory (n is “su¢ ciently
large”).
123
3.1 Review of Limit Theorems for Sequences of Random Variables
3.1.1 Convergence in Probability in Mean Square and in Distribution
Convergence in Probability
A sequence of random scalars fzng converges in probability to a constant (non-random)

if, for any " > 0,
lim P (jzn j > ") = 0:
n!1
We write
p
zn ! or plim zn = :
As we will see, zn is usually a sample mean

Pn Pn
i=1 yi i=1 zi
zn = or zn = :
n n
124
Example. Consider a fair coin. Let zi = 1 if the ith toss results in heads and zi = 0
1 Pn p
otherwise. Let zn = n i=1 zi : The following graph suggests that zn ! 1=2:
125
A sequence of K dimensional vectors fzng converges in probability to a K -dimensional

vector of constants if, for any " > 0,
lim P (jznk k j > ") = 0; 8k

n!1
We write
p
zn ! :
Convergence in Mean Square
A sequence of random scalars fzig converges in mean square (or in quadratic mean) to a
if
h i
2
lim E (zn ) =0
n!1
The extension to random vectors is analogous to that for convergence in probability.

126
Convergence in Distribution
Let fzng be a sequence of random scalars and Fn be the cumulative distribution function
(c.d.f.) of zn, i.e. zn Fn. We say that fzng converges in distribution to a random scalar
z if the c.d.f. Fn, of zn , converges to the c.d.f. F of z at every continuity point of F . We
write
d
zn ! z; where z F;
F is is the asymptotic (or limiting) distribution of z . If F is well-known, for example, if F
is the cumulative normal N (0; 1) distribution we prefer to write
d d
zn ! N (0; 1) (instead of zn ! z and z N (0; 1)):
d
Example. Consider zn t(n): We know that zn ! N (0; 1) :
In most applications zn is of type

p
zn = n (y E (yi)) :
p
Exercise 3.1. For zn = n (y E (yi)) calculate E (zn) and Var (zn) (assume E (yi) = ;
Var (yi) = 2 and fyig is an i.i.d. sequence).
127
3.1.2 Useful Results
Lemma (2.3 - preservation of convergence for continuous transformation). Suppose f is a

vector-valued continuous function that does not depend on n. Then:
p p
(a) if zn ! ) f (zn) ! f ( ) ;
d d
(b) if zn ! z ) f (zn) ! f (z) :
An immediate implication of Lemma 2.3 (a) is that the usual arithmetic operations preserve
convergence in probability:
p p p
xn ! ; yn ! ) xn + yn ! + :
p p p
xn ! ; yn ! ) xnyn ! :
p p p
xn ! ; yn ! ) xn=yn ! = ; 6= 0:
p p
Yn ! ) Yn 1 ! 1 ( is invertible).
128
Lemma (2.4). We have
d p d
(a) xn ! x; yn ! ) xn + yn ! x + :
d p p
(b) xn ! x; yn ! 0 ) y0nxn ! 0:
d p d
(c) xn ! x; An ! A ) Anxn ! Ax: In particular if x N (0; ) ; then
d
Anxn ! N 0; A A0 :
d p d
(d) xn ! x; An ! A ) x0nAn 1xn ! x0A 1x (A is nonsingular).
p
If xn ! 0 we write xn = op (1) :
p
If xn yn ! 0 we write xn = yn + op (1) :
d
In part (c) we may write Anxn = Axn (Anxn and Axn have the same asymptotic
distribution).
129
3.1.3 Viewing Estimators as Sequences of Random Variables
Let ^n be an estimator of a parameter vector based on a sample of size n. We say that

an estimator ^n is consistent for if
^n p! :
The asymptotic bias of ^n, is de…ned as plimn!1 ^n : So if the estimator is consistent,

its asymptotic bias is zero.
Wooldridge’s quotation:
While not all useful estimators are unbiased, virtually all economists agree that
consistency is a minimal requirement for an estimator. The famous econometrician
Clive W.J. Granger once remarked: “If you can’t get it right as n goes to in…nity,
you shouldn’t be in this business.” The implication is that, if your estimator of a
particular population parameter is not consistent, then you are wasting your time.
130
A consistent estimator ^n is asymptotically normal if

p d
n ^n ! N (0; ):
p
Such an estimator is called n-consistent.
The variance matrix is called the asymptotic variance and is denoted Avar ^n ; i.e.
p
lim Var n ^n = Avar ^n = :
n!1
Some authors use the notation Avar ^n to mean =n (which is zero in the limit).
131
3.1.4 Laws of Large Numbers and Central Limit Theorems
Consider
n
1X
zn = zi:
n i=1
p
We say that zn obeys to the LLN if zn ! where = E (zi) or limn E (zn) = :
(A Version of Chebychev’s Weak LLN) If

lim E (zn) = p
) zn ! .
lim Var (zn) = 0
p
(Kolmogorov’s Second Strong LLN) If fzig is i.i.d. with E (zn) = ) zn ! :
These LLNs extend readily to random vectors by requiring element-by-element convergence.

132
Theorem 1 (Lindeberg-Levy CLT). Let fzig be i.i.d. with E (zn) = and Var (zi) = :
Then
n
p 1 X d
n (zn )=p (zi ) ! N (0; ) :
n i=1
Notice that
p
E n (zn ) = 0 ) E (zn) =
p
Var n (zn ) = ) Var (zn) = =n
Given the previous equations, some authors write
!
a
zn N ; :
n
133
Example. Let fzig be i.i.d. with distribution 2(1): By the Lindeberg-Levy CLT (scalar case)
we have
n !
1X 2
a
zn = zi N ;
n i=1 n
where
n
1X
E (zn) = E (zi) = E (zi) = = 1;
n i=1
0 1
Xn 2
1 1 2
Var (zn) = Var @ ziA = Var (zi) = = :
n i=1 n n n
134
-
3210.1
0.4
0.3
0.2
Probability Density Function of

p
Probability Density Function of zn (obtained by n (zn ) (exact expressions for
Monte-Carlo Simulation) n = 5; 10 and 50)
135
Example. In a random sampling, sample size = 30; on the variable z with E (z ) = 10;
Var (z ) = 9 but unknown distribution, obtain an approximation to P (zn < 9:5) : We do
not know the exact distribution of zn: However, from Lindeberg-Levy CLT we have
!
p (zn ) d 2
a
n ! N (0; 1) or zn N ; :
n
Hence,
!
p (zn ) p (9:5 10)
P (zn < 9:5) = P n < 30
3
' ( 0:9128) , [ is the cdf of N (0; 1) ]
= 0:1807:
136
3.2 Fundamental Concepts in Time-Series Analysis
Stochastic process (SP): is a sequence of random variables. For this reason, it is more
adequate to write a SP as fzig (means a sequence of random variables) rather than zi
(means the random variable at time i).
137
3.2.1 Various Classes of Stochastic processes
De…nition (Stationary Processes). A SP fzig is (strictly) stationary if the joint distribution

of (z1; z2; :::; zs) equals to that of zk+1; zk+2; :::; zk+s for any s 2 N and k 2 Z:
Exercise 3.2. Consider a SP fzig where E (jg (zi)j) < 1: Show that if fzig is a strictly
stationary process then E (g (zi)) is constant and do not depend on t:
The de…nition implies that any transformation (function) of a stationary process is itself
stationary,
n othat is, if fzig is stationary, then fg (zi)g is. For example, if fzig is stationary
then ziz0i is also a SP.
De…nition (Covariance Stationary Processes). A stochastic process fzig is weakly (or co-
variance) stationary if: (i) E (zi) does not depend on i , and (ii) Cov zi; zi j exists, is
…nite, and depends only on j but not on i:
If fzig is a covariance SP then Cov (z1; z5) = Cov (z1001; z1005).
A transformation (function) of a covariance stationary process may or may not be a covari-

ance stationary process.
138
q
Example. It can be proved that fzig ; zi = 0 + 1zi2 1"i; where f"ig is i.i.d. with mean
q
zero and unit variance and 0> 0 and 1=3 1< 1 is a covariance stationary process.
However, wi = zi2 is not a covariance stationary process as E wi2 does not exist.
Exercise 3.3. Consider the SP futg where
8
>
> t if t 2000
<
ut = q
>
> k 2 if t > 2000
:
k t
iid iid
where t and s are independent for all t and s and t N (0; 1) and s t(k). Explain
why futg is weakly (or covariance) stationary but not strictly stationary.
De…nition (White Noise Processes). A white noise process fzig is a covariance stationary
process with zero mean and no serial correlation:
E (zi) = 0; Cov zi; zj = 0, i 6= j:

139
Y Y
8 25
4
20
0
15
-4
10
-8
5
-12
0
-16
-20 -5
25 50 75 100 125 150 175 200 25 50 75 100 125 150 175 200
Y Y5
10 4
3
0
2
-10 1
0
-20
-1
-30 -2
-3
-40
-4
-50 -5
25 50 75 100 125 150 175 200 10 20 30 40 50 60 70 80 90
140
In the literature there is not a unique de…nition of ergodicity. We prefer to call “weakly
dependent process” to what Hayashi calls “ergodic process”.
De…nition. A stationary process fzig is said to be a weakly dependent process (= ergodic in
Hayashi’s de…nition) if, for any two bounded functions f : Rk+1 ! R and g : Rs+1 ! R;
lim E f zi; ::; zi+k g (zi+n; ::; zi+n+s)

n!1
= lim E f zi; ::; zi+k jE (g (zi+n; ::; zi+n+s))j :
n!1
Theorem 2 (S&WD). Let fzig be a stationary weakly dependent (S&WD) process with
p
E (zi) = : Then zn ! :
Serial dependence, which is ruled out by the i.i.d. assumption in Kolmogorov’s LLN, is
allowed in this Theorem, provided that it disappears in the long run. Since, for any function
f , ff (zi)g is a S&WD stationary whenever fzig is, this theorem implies that any moment
of a S&WD process (if it exists and is …nite) is consistently estimated by the sample moment.
For example, suppose fzig is a S&WD process and E ziz0i exists and is …nite. Then
n
1X p
zn = zizi ! E ziz0i :
0
n i=1
141
De…nition (Martingale). A vector process fzig is called a martingale with respect to fzig if
E ( zij zi 1; :::; z1) = zi 1 for i 2:

The process
z i = z i 1 + "i
where f"ig is a white noise process with E ( "ij zi 1) = 0, is a martingale since
E ( zij zi 1; :::; z1) = E ( zij zi 1) = zi 1 + E ( "ij zi 1) = zi 1:

De…nition (Martingal Di¤erence Sequence). A vector process fgig with E (gi) = 0 is called
a martingale di¤erence sequence (MDS) or martingale di¤erences if
E ( gij gi 1; :::; g1) = 0:
If fzig is a martingale, the process de…ned as zi = zi zi 1 is a MDS.

Proposition. If fgig is a MDS then Cov gi; gi j = 0, j 6= 0:
142
By de…nition
0 1 0 1
n
X n
X nX1 X
n
1 @ A
1 @
Var (gn) = 2 Var gt = 2 Var (gt) + 2 Cov gi; gi j A :
n t=1 n t=1 j=1 i=j+1
However, if fgig is a stationary MDS with …nite second moment then
n
X
Var (gt) = n Var (gt) ; Cov gi; gi j = 0;
t=1
so
1
Var (gn) = Var (gt) :
n
De…nition (Random Walk). Let fgig be a vector independent white noise process. A random
walk, fzig, is a sequence of cumulative sums:
zi = gi + gi 1 + ::: + g1:
Exercise 3.4. Show that the random walk can be written as
zi = zi 1 + gi ; z1 = g1:
143
3.2.2 Di¤erent Formulation of Lack of Serial Dependence
We have three formulations of a lack of serial dependence for zero-mean covariance stationary
processes:
(1) fgig is independent white noise.
(2) fgig is stationary MDS with …nite variance.
(3) fgig is white noise.
(1) ) (2) ) (3):

Exercise
q 3.5 (Process that satis…es (2) but not (1) - the ARCH process). Consider gi =
2
0 + 1 gi 1 "i; where f"i g is i.i.d. with mean zero and unit variance and 0 > 0 and
j 1j < 1. Show that fgig is a MDS but not a independent white noise.
144
3.2.3 The CLT for S&WD Martingale Di¤erences Sequences
Theorem 3 (Stationary Martingale Di¤erences CLT (Billingsley, 1961) ). Let fgig be a

vector martingale di¤erence sequence that is S&WD process with E gigi0 = and let
P
gi = n1 gi. Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
Theorem 4 (Martingale Di¤erences CLT (White, 1984)). Let fgig be a vector martingale
di¤erence sequence. Suppose that (a) E gigi0 = t is a positive de…nite matrix with
1 Pn (positive de…nite matrix), (b) g has …nite 4th moment, (c) n1
P p
gigi0 !
n i=1 t!
: Then
n
p 1 X d
ngn = p gi ! N (0; ):
n i=1
145
3.3 Large-Sample Distribution of the OLS Estimator
The model presented in this section has probably the widest range of economic applications:
No speci…c distributional assumption (such as the normality of the error term) is required;
The requirement in …nite-sample theory that the regressors be strictly exogenous or …xed
is replaced by a much weaker requirement that they be "predetermined."
Assumption (2.1 - linearity). yi = x0i + "i:

Assumption (2.2 - S&WD). f(yi; xi)g is jointly S&WD.
Assumption (2.3 - predetermined regressors). All the regressors are predetermined in the
sense that they are orthogonal to the contemporaneous error term: E (xik "i) = 0; 8i; k.
This can be written as
E (xi"i) = 0 or E (gi) = 0 where gi = xi"i:

Assumption (2.4 - rank condition). E xix0i = xx is nonsingular.
146
Assumption (2.5 - fgig is a martingale di¤erence sequence with …nite second moments).
fgig ; where gi = xi"i; is a martingale di¤erence sequence (so a fortiori E (gi) = 0.
The K K matrix of cross moments, E gigi0 , is nonsingular. We use S for Avar (g ) (the
p P
variance of ng; where g = n1 gi). By Assumption 2.2 and S&WD martingale Di¤erences
CLT, S = E gigi0 :
Remarks:
1. (S&WD) A special case of S&WD is that f(yi; xi)g is i.i.d. (random sample in cross-
sectional data).
2. (The model accommodates conditional heteroskedasticity) If f(yi; xi)g is stationary,

then the error term "i = yi x0i is also stationary. The conditional moment
2
E "i xi can depend on xi
without violating any previous assumption, as long as E "2i is constant.

147
3. (E (xi"i) = 0 vs. E ( "ij xi) = 0) The condition E ( "ij xi) = 0 is stronger than
E (xi"i) = 0. In e¤ect,
E (xi"i) = E (E ( xi"ij xi))

= E (xi E ( "ij xi))
= E (xi0) = 0:
4. (Predetermined vs. strictly exogenous regressors) Assumption 2.3, restricts only the
contemporaneous relationship between the error term and the regressors. The exogeneity
assumption (Assumption 1.2) implies that, for any regressor k, E xjk "i = 0 for all i
and j; not just for i = j: Strict exogeneity is a strong assumption that does not hold in
general for time series models.
148
5. (Rank condition as no multicollinearity in the limit) Since

! 1
X0 X X0 y 1X 1 1X
b= = xix0i xiy = Sxx1Sxy
n n n n
where
X0 X 1X
Sxx = = xix0i (sample average of xix0i)
n n
0
Xy 1X
Sxy = = xiyi (sample average of xiyi).
n n
By Assumptions 2.2, 2.4 and theorem S&WD we have
n
X0 X 1X p
= xix0i ! E xix0i :
n n i=1
0
Assumption 2.4 guarantees that the limit in probability of XnX has rank K:
149
6. (A su¢ cient condition for fgig to be a MDS) Since a MDS is zero-mean by de…nition,
Assumption 2.5 is stronger than Assumption 2.3 (this latter is redundant in face of
Assumption 2.5). We will need Assumption 2.5 to prove the asymptotic normality of
the OLS estimator. A su¢ cient condition for fgig to be an MDS is
E ( "ij Fi) = 0 where

Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
(This condition implies that the error term is serially uncorrelated and also is uncorrelated
with the current and past regressors). Proof. Notice: fgig is a MDS if
E ( gij gi 1; :::; g1) = 0; gi = xi"i:

Now, using the condition E ( "ij Fi) = 0;
E ( xi"ij gi 1; :::; g1) = E [ E ( xi"ij Fi)j gi 1; :::; g1] = E [0j gi 1; :::; g1] = 0
thus E ( "ij Fi) = 0 ) fgig is a MDS.
150
7. (When the regressors include a constant) Assumption 2.5 is

02 3 1
1
B6 7 C
E ( xi"ij gi 1; :::; g1) = E @ 4 ::: 5 "i gi 1; :::; g1A = 0 ) E ( "ij gi 1; :::; g1) = 0:
xiK
E ( "ij "i 1; :::; "1) = E ( E ( "ij gi 1; :::; g1)j "i 1; :::; "1) = 0:
Assumption 2.5 implies that the error term itself is a MDS and hence is serially uncorrelated.
8. (S is a matrix of fourth moments)
S = E gigi0 = E xi"ix0i"i = E "2i xix0i :

Consistent estimation of S will require an additional assumption.
151
9. (S will take a di¤erent expression without Assumption 2.5) In general

0 1 0 1
p p 1X n Xn
1
Avar (g) = Var ng = Var @ n giA = Var @ p giA
n i=1 n i=1
0 1
n
X
1
= Var @ gi A
n i=1
0 1
n
X nX1 X
n
1@ A
= Var (gi) + Cov gi; gi j + Cov gi j ; gi
n i=1 j=1 i=j+1
1X n
1 nX1 X n
0 0
= Var (gi) + E gigi j + E gi j gi :
n i=1 n j=1 i=j+1
Given stationarity, we have
n
1X
Var (gi) = Var (gi) :
n i=1
Thanks to the assumption 2.5 we have E gigi0 j = E gi j gi0 = 0 so
S = Avar (g) = Var (gi) = E gigi0 :

152
Proposition (2.1- asymptotic distribution of the OLS Estimator). (a) (Consistency of b for
) Under Assumptions 2.1-2.4,
p
b ! :
(b) (Asymptotic Normality of b) If Assumption 2.3 is strengthened as Assumption 2.5, then
p d
n (b ) ! N (0; Avar (b))
where
Avar (b) = 1 1
xx S xx :
^
(c) (Consistent Estimate of Avar (b)) Suppose there is available a consistent estimator S
of S: Then under Assumption 2.2, Avar (b) is consistently estimated by
^ xx1
[ (b) = Sxx1SS
Avar
where
n
X0 X 1X
Sxx = = xix0i:
n n i=1
153
Proposition (2.2 - consistent estimation of error variance). Under the Assumptions 2.1- 2.4,
n
X
1 p
s2 = e2i ! E "2i
n K i=1
provide E "2i exists and is …nite.
Under conditional homocedasticity E "2i xi = 2 (we will see this in detail later) we
have,
S = E gigi0 = E "2i xix0i = ::: = 2 0
E xixi =
2
xx
and
Avar (b) = 1S 1= 1 2 1 = 2 xx1;

xx xx xx xx xx
! 1
0
XX 1
[ (b) = s2
Avar = s2 n X0 X :
n
Thus
0 1
a [ (b)
Avar 1
b N@ ; A=N ; s 2 X0 X
n
154
3.4 Statistical Inference
Derivation of the distribution of test statistics is easier than in …nite-sample theory because
we are only concerned about the large-sample approximation to the exact distribution.
Proposition (2.3 - robust t-ratio and Wald statistic). Suppose Assumptions 2.1-2.5 hold,
^ of S. As before, let Avar
and suppose there is available a consistent estimate of S [ (b) =
^ 1: Then
Sxx1SS xx
(a) Under the null hypothesis H0 : k = 0k

1
bk 0 [ ( bk )
Avar
Sxx1SS
^ xx
d
t0k = k ! N (0; 1) ; where ^ 2bk = = kk :
^ bk n n
(b) Under the null hypothesis H0 : R = r; with rank (R) = p
1 d 2 :
W = n (Rb r)0 RAvar
[ (b) R0 (Rb r) ! (p)
155
Remarks
^ bk is called is called the heteroskedasticity-consistent standard error, (heteroskedastic-

ity) robust standard error, or White’s standard error. The reason for this terminology is
that the error term can be conditionally heteroskedastic. The t-ratio is called the robust
t-ratio.
The di¤erences from the …nite-sample t-test are: (1) the way the standard error is
calculated is di¤erent, (2) we use the table of N (0; 1) rather than that of t(n K),
and (3) the actual size or exact size of the test (the probability of Type I error given
the sample size) equals the nominal size (i.e., the desired signi…cance level ) only
approximately, although the approximation becomes arbitrarily good as the sample size
increases. The di¤erence between the exact size and the nominal size of a test is called
the size distortion.
Both tests are consistent in the sense that
power = P (rejecting the null H0j H1 is true) ! 1 as n ! 1:

156
3.5 Estimating S = E "2i xix0i Consistently
How to select an estimator for a population parameter? One of the most important method
is the analog estimation method or the method of moments. The method of moment
principle: To estimate a feature of the population, use the corresponding feature of the
sample.
Examples of analog estimators:
Parameter of the population Estimator
E (yi) Y
Var (yi) Sy2
xy Sxy
2 2
x Pn Sx
i=1 Ifyi cg
P (yi c) n
median (yi) sample median
max(yi) maxi=1;:::;n (yi)
157
The analogy principle suggests that E "2i xix0i can be estimated using the estimator
n
1X
"2i xix0i:
n i=1
Since "i is not observable we need another one:
Xn
1
^=
S e2i xix0i:
n i=1
2
Assumption (2.6 - …nite fourth moments for regressors). E xik xij exists and is …nite
for all k and j (k; j = 1; :::; K ) :
Proposition (2.4 - consistent estimation of S). Suppose S = E "2i xix0i exists and is …nite.
Then, under Assumptions 2.1-2.4 and 2.6, S ^ is consistent for S:
158
The estimator S can be represented as

2 3
e21 0 0
n
X 0 BX 6 7
1 2 0 X 6 0 e2 0 7
^=
S ei xixi = where B =6 ...2 7:
6 ... ... 7
n i=1 n 4 5
0 0 e2n
1^ 1 1 1
[
Thus, Avar (b) = Sxx SSxx = n X0X X0BX X0X . We have
!
a [ b)
Avar( ^ xx1
Sxx1SS 1 1
b N ; n =N ; n =N ; X0 X X0BX X0X
0 1
W = n (Rb [ (b) R0
r) RAvar (Rb r)
1
= n (Rb r) 0 ^ xx1R0
RSxx1SS (Rb r)
0 1 d
0 1 1 2
= (Rb r) R X0 X X0BX X0 X R (Rb r) ! (p)
159

Sample: 1 526
C -1.567939 0.724551 -2.164014 0.0309

FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000


Sample: 1 526
C -1.567939 0.825934 -1.898382 0.0582

FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

160
3.6 Implications of Conditional Homoskedasticity
Assumption (2.7 - conditional homoskedasticity). E "2i xi = 2 > 0:
Under Assumption 2.7 we have

S = E "2i xix0i = ::: = 2 0
E xixi =
2
xx and
Avar (b) = 1 1 2 1 1 2 1
xx S xx = xx xx xx = xx :
Proposition (2.5 - large-sample properties of b, t , and F under conditional homoskedas-
ticity). Suppose Assumptions 2.1-2.5 and 2.7 are satis…ed. Then
(a) (Asymptotic distribution of b) The OLS estimator b is consistent and asymptotically

normal with
Avar (b) = 2 xx1:
(b) (Consistent estimation of asymptotic variance) Under the same set of assumptions,
Avar (b) is consistently estimated by
1
[ (b) = s2Sxx1 = ns2 X0X
Avar :
161
(c) (Asymptotic distribution of the t and F statistics of the …nite-sample theory)
Under H0 : k = 0k we have
bk 0 [ ( bk )
Avar 1
d
t0k = k ! N (0; 1) ; where ^ 2bk = 2 0
=s XX :
^ bk n kk
Under H0 : R = r with rank (R) = p, we have

d
pF 0 ! 2(p)
1 1
where F 0 = (Rb r)0 R X0X R0 (Rb r) = ps2 :
Notice
e 0e e0e d
pF 0 = 0 ! 2
(p)
e e= (n K )
where refers to the short regression or the regression subjected to the constraint R =r
Remark (No need for fourth-moment assumption) By S&WD and Assumptions 2.1-2.4,
p
s2Sxx ! 2 xx = S: We do not need the fourth-moment assumption (Assumption 2.6)
for consistency.
162
3.7 Testing Conditional Homoskedasticity
With the advent of robust standard errors allowing us to do inference without specifying the
conditional second moment testing conditional homoskedasticity is not as important as it
used to be. This section presents only the most popular test due to White (1980) for the
case of random samples.
Let i be a vector collecting unique and nonconstant elements of the K K symmetric

matrix xix0i.
Proposition (2.6 - White’s Test for Conditional Heteroskedasticity). In addition to Assump-
tions 2.1 and 2.4, suppose that (a) f(yi; xi)g is i.i.d. with …nite E "2i xix0i (thus strength-
ening Assumptions 2.2 and 2.5), (b) "i is independent of xi (thus strengthening Assumption
2.3 and conditional homoskedasticity), and (c) a certain condition holds on the moments of
"i and xi. Then
d
nR2 ! 2(m)
where R2 is the R2 from the auxiliary regression of e2i on a constant and i and m is the
dimension of i:
163

Sample: 1 526
Included observations: 526
C -1.567939 0.724551 -2.164014 0.0309

FEMALE -1.810852 0.264825 -6.837915 0.0000
EDUC 0.571505 0.049337 11.58362 0.0000
EXPER 0.025396 0.011569 2.195083 0.0286
TENURE 0.141005 0.021162 6.663225 0.0000

164
Heteroskedasticity Test: White
F-statistic 5.911627 Prob. F(13,512) 0.0000

Obs*R-squared 68.64843 Prob. Chi-Square(13) 0.0000
Scaled explained SS 227.2648 Prob. Chi-Square(13) 0.0000
Test Equation:
Dependent Variable: RESID^2
C 47.03183 20.19579 2.328794 0.0203

FEMALE -7.205436 10.92406 -0.659593 0.5098
FEMALE*EDUC 0.491073 0.778127 0.631097 0.5283
FEMALE*EXPER -0.154634 0.168490 -0.917768 0.3592
FEMALE*TENURE 0.066832 0.351582 0.190089 0.8493
EDUC -7.693423 2.596664 -2.962811 0.0032
EDUC^2 0.315191 0.086457 3.645652 0.0003
EDUC*EXPER 0.045665 0.036134 1.263789 0.2069
EDUC*TENURE 0.083929 0.054140 1.550226 0.1217
EXPER 0.000257 0.610348 0.000421 0.9997
EXPER^2 -0.009134 0.007010 -1.303002 0.1932
EXPER*TENURE -0.004066 0.017603 -0.230969 0.8174
TENURE -0.298093 0.934417 -0.319015 0.7498
TENURE^2 -0.004633 0.016358 -0.283255 0.7771

165

C -1.567939 0.825934 -1.898382 0.0582

FEMALE -1.810852 0.254156 -7.124963 0.0000
EDUC 0.571505 0.061217 9.335686 0.0000
EXPER 0.025396 0.009806 2.589912 0.0099
TENURE 0.141005 0.027955 5.044007 0.0000

3.8 Estimation with Parameterized Conditional Heteroskedasticity
Even when the error is found to be conditionally heteroskedastic, the OLS estimator is still
consistent and asymptotically normal, and valid statistical inference can be conducted with
robust standard errors and robust Wald statistics. However, in the (somewhat unlikely) case
of a priori knowledge of the functional form of the conditional second moment, it should be
possible to obtain sharper estimates with smaller asymptotic variance.
166
To simplify the discussion, throughout this section we strengthen Assumptions 2.2 and 2.5
by assuming that f(yi; xi)g is i.i.d.
3.8.1 The Functional Form
The parametric functional form for the conditional second moment we consider is
2 0
E "i xi = zi
where zi is a function of xi:
Por example, E ( "ij xi) = 1 + 2x2i2;
z0i = 1 x2i2 :
167
3.8.2 WLS with Known
The WLS (also GLS) estimator can be obtained by applying the OLS to the regression
~0i + ~"i
y~i = x
where
y x "
y~i = q i ; ~ik = q ik ;
x "i = q i ;
~ i = 1; 2; :::; n
z0i z0i z0i
We have
1 1X 1 1 y:
~ 0X
^ GLS = ^ (V) = X ~ ~ 0y
X ~ = X0 V X0 V
168
Note that
"i j x
E (~ ~ i ) = 0:
Therefore, provided that E x ~0i is nonsingular, Assumptions 2.1-2.5 are satis…ed for equa-
~i x
tion y~i = x ~0i +~"i. Furthermore, by construction, the error ~"i is conditionally homoskedastic:
E (~"i j x
~i) = 1. So Proposition 2.5 applies: the WLS estimator is consistent and asymptoti-
cally normal, and the asymptotic variance is
1
Avar ^ (V) = E x ~0i
~i x
0 1 1
n
X
1
= plim @ x ~0iA
~i x (by S&WD theorem)
n i=1
1 0 1
= plim X V 1X :
n
Thus n1 X0V 1X is a consistent estimator of Avar ^ (V) :
169
3.8.3 Regression of e2i on zi Provides a Consistent Estimate of
If is unknown we need to obtain ^ : Assuming E "2i xi = z0i we have
"2i = E "2i xi + i
where by construction E ( ij xi) = 0: This suggest that the following regression can be
considered
"2i = z0i + i
Provided that E ziz0i is nonsingular, Proposition 2.1 is applicable to this auxiliary regres-
sion: the OLS estimator of is consistent and asymptotically normal. However we cannot
run this regression as "i is not observable. In the previous regression we should replace "i
by the consistent estimate ei (despite the presence of conditional heteroskedasticity). In
conclusion, we may obtain a consistent estimate of by considering the regression of e2i on
zi to get
0 1 1
n
X Xn
^ =@ ziz0iA zie2i :
i=1 i=1
170
3.8.4 WLS with Estimated
Step 1: Estimate the equation yi = x0i + "i by OLS and compute the OLS residuals ei:
Step 2: Regress e2i on zi to obtain the OLS coe¢ cient estimate ^ .
Step 3: Transform the original variables according to the rules

y x
y~i = q i ; ~ik = q ik ;
x i = 1; 2; :::; n
0
zi ^ 0
zi ^
~0i
and run the OLS estimator with respect to the model y~i = x "i to obtain the
+~
Feasible GLS (FGLS):
1X 1 1y
^ = X0 V
^ V ^ X0 V
^
171
It can be proved that:
^ V
^ p
!
p d
n ^ V
^ ! N 0; Avar ^ (V)
1 X0 V
^ 1X is a consistent estimator of Avar ^ (V) :
n
No …nite properties are known concerning the estimator ^ V

^ :
172
3.8.5 A popular speci…cation for E "2i xi
The especi…cation "2i = z0i + i may lead to z0i ^ < 0: To overcome this problem a popular
speci…cation for E "2i xi is
n o
E "2i xi = exp 0x
i
(it guarantees that Var ( yij xi) > 0 for all 2 Rr ): It implies log E "2i xi = 0x :
i This
suggests the following procedure:
a) Regress y on X to get the residual vector e:

b) Run the LS regression log e2i on xi to estimate and calculate
n o
^ 2i = exp 0
^ xi :
x
c) Transform the data y~i = ^yi ; ~ij = îj .
x
i i
d) Regress y ~ and obtain ^ V
~ on X ^
173
Notice also that:

n o
E "2i xi = exp 0 xi
n o
"2i = exp 0 xi + vi; vi = "2i 2
E "i xi
log "2i 0x + v
i i
log e2i 0x + v :
i i
Example (Part 1). We want to estimate a demand function for daily cigarette consumption
(cigs). The explanatory variables are: log(income) - log of annual income, log(cigprice) -
log of per pack price of cigarettes in cents, educ - years of education, age and restaurn
- binary indicator equal to unity if the person resides in a state with restaurant smoking
restrictions (source: J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and
Statistics 79, 596-593).
Based on information below, are the standard errors reported in the …rst table reliable?
174
Heteroskedasticity Test: White

Scaled explained SS 110.0813 Prob. Chi-Square(25) 0.0000
Dependent Variable: CIGS
Sample: 1 807 Test Equation:
Dependent Variable: RESID^2
C -3.639823 24.07866 -0.151164 0.8799 C 29374.77 20559.14 1.428794 0.1535
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 LOG(INCOME) -1049.630 963.4359 -1.089466 0.2763
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 (LOG(INCOME))^2 -3.941183 17.07122 -0.230867 0.8175
EDUC -0.501498 0.167077 -3.001596 0.0028 (LOG(INCOME))*(LOG(CIGPRIC)) 329.8896 239.2417 1.378897 0.1683
AGE 0.770694 0.160122 4.813155 0.0000 (LOG(INCOME))*EDUC -9.591849 8.047066 -1.191969 0.2336
(LOG(INCOME))*AGE -3.354565 6.682194 -0.502015 0.6158
AGE^2 -0.009023 0.001743 -5.176494 0.0000 (LOG(INCOME))*(AGE^2) 0.026704 0.073025 0.365689 0.7147
RESTAURN -2.825085 1.111794 -2.541016 0.0112 (LOG(INCOME))*RESTAURN -59.88700 49.69039 -1.205203 0.2285
LOG(CIGPRIC) -10340.68 9754.559 -1.060087 0.2894
R-squared 0.052737 Mean dependent var 8.686493 (LOG(CIGPRIC))^2 668.5294 1204.316 0.555111 0.5790
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 (LOG(CIGPRIC))*EDUC 32.91371 59.06252 0.557269 0.5775
S.E. of regression 13.40479 Akaike info criterion 8.037737 (LOG(CIGPRIC))*AGE 62.88164 55.29011 1.137304 0.2558
(LOG(CIGPRIC))*(AGE^2) -0.622371 0.594730 -1.046477 0.2957
Sum squared resid 143750.7 Schwarz criterion 8.078448 (LOG(CIGPRIC))*RESTAURN 862.1577 720.6219 1.196408 0.2319
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 EDUC -117.4705 251.2852 -0.467479 0.6403
F-statistic 7.423062 Durbin-Watson stat 2.012825 EDUC^2 -0.290343 1.287605 -0.225491 0.8217
Prob(F-statistic) 0.000000 EDUC*AGE 3.617048 1.724659 2.097254 0.0363
EDUC*(AGE^2) -0.035558 0.017664 -2.012988 0.0445
EDUC*RESTAURN -2.896490 10.65709 -0.271790 0.7859
AGE -264.1461 235.7624 -1.120391 0.2629
AGE^2 3.468601 3.194651 1.085753 0.2779
AGE*(AGE^2) -0.019111 0.028655 -0.666935 0.5050
AGE*RESTAURN -4.933199 10.84029 -0.455080 0.6492
(AGE^2)^2 0.000118 0.000146 0.807552 0.4196
(AGE^2)*RESTAURN 0.038446 0.120459 0.319160 0.7497
RESTAURN -2868.196 2986.776 -0.960299 0.3372
cigs: number of cigarettes smoked per day, log(income): log of annual income, log(cigprice):
log of per pack price of cigarettes in cents, educ: years of education, age and restaurn:
binary indicator equal to unity if the person resides in a state with restaurant smoking re-
strictions.
175
Example (Part 2). Discuss the results of the following …gures.
Dependent Variable: CIGS Dependent Variable: CIGS

Method: Least Squares Method: Least Squares
Sample: 1 807 Sample: 1 807
C -3.639823 24.07866 -0.151164 0.8799
LOG(INCOME) 0.880268 0.727783 1.209519 0.2268 C -3.639823 25.61646 -0.142089 0.8870
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966 LOG(INCOME) 0.880268 0.596011 1.476931 0.1401
EDUC -0.501498 0.167077 -3.001596 0.0028 LOG(CIGPRIC) -0.750862 6.035401 -0.124410 0.9010
AGE 0.770694 0.160122 4.813155 0.0000 EDUC -0.501498 0.162394 -3.088167 0.0021
AGE^2 -0.009023 0.001743 -5.176494 0.0000 AGE 0.770694 0.138284 5.573262 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112 AGE^2 -0.009023 0.001462 -6.170768 0.0000
RESTAURN -2.825085 1.008033 -2.802573 0.0052
Adjusted R-squared 0.045632 S.D. dependent var 13.72152 R-squared 0.052737 Mean dependent var 8.686493
S.E. of regression 13.40479 Akaike info criterion 8.037737 Adjusted R-squared 0.045632 S.D. dependent var 13.72152
Sum squared resid 143750.7 Schwarz criterion 8.078448 S.E. of regression 13.40479 Akaike info criterion 8.037737
Log likelihood -3236.227 Hannan-Quinn criter. 8.053370 Sum squared resid 143750.7 Schwarz criterion 8.078448
F-statistic 7.423062 Durbin-Watson stat 2.012825 Log likelihood -3236.227 Hannan-Quinn criter. 8.053370
Prob(F-statistic) 0.000000 F-statistic 7.423062 Durbin-Watson stat 2.012825
176
Example (Part 3). a) Regress y on X to get the residual vector e:
Dependent Variable: CIGS

Sample: 1 807
C -3.639823 24.07866 -0.151164 0.8799

LOG(INCOME) 0.880268 0.727783 1.209519 0.2268
LOG(CIGPRIC) -0.750862 5.773342 -0.130057 0.8966
EDUC -0.501498 0.167077 -3.001596 0.0028
AGE 0.770694 0.160122 4.813155 0.0000
AGE^2 -0.009023 0.001743 -5.176494 0.0000
RESTAURN -2.825085 1.111794 -2.541016 0.0112

177
b) Run the LS regression log e2i on xi
Dependent Variable: LOG(RES^2)

Sample: 1 807
C -1.920691 2.563033 -0.749382 0.4538

LOG(INCOME) 0.291540 0.077468 3.763351 0.0002
LOG(CIGPRIC) 0.195418 0.614539 0.317992 0.7506
EDUC -0.079704 0.017784 -4.481657 0.0000
AGE 0.204005 0.017044 11.96928 0.0000
AGE^2 -0.002392 0.000186 -12.89313 0.0000
RESTAURN -0.627011 0.118344 -5.298213 0.0000

n o
Calculate ^ 2i = exp ^ 0xi \
= exp log e2i :
\
Notice: log \
e21; :::; log e2n are the …tted values of the above regression.
178
c) Transform the data

yi xij
y~i = ; ~ij =
x
î î
and d) Regress y ~ and obtain
~ on X ^ :
V
Dependent Variable: CIGS/SIGMA

Sample: 1 807
1/SIGMA 5.635471 17.80314 0.316544 0.7517

LOG(INCOME)/SIGMA 1.295239 0.437012 2.963855 0.0031
LOG(CIGPRIC)/SIGMA -2.940314 4.460145 -0.659242 0.5099
EDUC/SIGMA -0.463446 0.120159 -3.856953 0.0001
AGE/SIGMA 0.481948 0.096808 4.978378 0.0000
AGE^2/SIGMA -0.005627 0.000939 -5.989706 0.0000
RESTAURN/SIGMA -3.461064 0.795505 -4.350776 0.0000

Adjusted R-squared -0.004728 S.D. dependent var 1.574979
179
3.8.6 OLS versus WLS
Under certain conditions we have:
b and ^ V
^ are consistent.
Assuming that the functional form of the conditional second moment is correctly spec-
i…ed, ^ V
^ is asymptotically more e¢ cient than b.
It is not clear which estimator is better (in terms of e¢ ciency) in the following situations:
– the functional form of the conditional second moment is misspeci…ed;
– in …nite samples, even if the functional form is correctly speci…ed, the large-sample
approximation will probably work less well for the WLS estimator than for OLS
because of the estimation of extra parameters (a) involved in the WLS procedure.
180
3.9 Serial Correlation
Because the issue of serial correlation arises almost always in time-series models, we use the
subscript "t" instead of "i" in this section. Throughout this section we assume that the
regressors include a constant. The issue is how to deal with
E "t"t j xt j ; xt 6= 0:
181
3.9.1 Usual Inference is not Valid
When the regressors include a constant (true in virtually all known applications), Assumption
2.5 implies that the error term is a scalar martingale di¤erence sequence, so if the error
is found to be serially correlated (or autocorrelated), that is an indication of a failure of
Assumption 2.5.
We have Cov gt; gt j 6= 0: In fact,
Cov gt; gt j = E xt"tx0t j "t j

= E E xt"tx0t j "t j xt j ; xt
= E xtx0t j E "t"t j xt j ; xt 6= 0:
Assumptions 2.1-2.4 may hold under serial correlation, so the OLS estimator may be consis-
tent even if the error is autocorrelated. However, the large-sample properties of b, t , and
F of proposition 2.5 are not valid. To see why, consider
p p
n (b ) = Sxx1 ng :
182
We have
Avar (b) = 1
xx S
1
xx ;
\
Avar ^ xx1:
(b) = Sxx1SS
If errors are not autocorrelated:

p
S = Var ng = Var (gt) .
If the errors are autocorrelated:

p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt :
n j=1 t=j+1
Since Cov gt; gt j 6= 0 and E gt j gt0 6= 0 we have
S 6= Var (gt) i.e. S 6= E gtgt0 :

P 0 or 1 Pn e2 x x0 (robust to
If the errors are serial correlated we cannot use n1 n x
t=1 t tx n t=1 t t t
conditional heteroskedasticity) as a consistent estimators of S.
183
3.9.2 Testing Serial Correlation
Consider the regression yt = x0t + "t: We want to test whether or not "t is serial correlated.
Consider
Cov "t; "t j Cov "t; "t j j E " t "t j
j = r = = =
2
:
Var ("t) Var "t j Var (" t ) 0 E "t
Since j is not observable, we need to consider

~j
~j =
~0
n n
1 X 1X
~j = "t "t j ; ~0 = "2t :
n t=j+1 n t=1
184
Proposition. If f"tg is a stationary MDS with E "2t "t 1; "t 2; ::: = 2; then
p d p d
n~j ! N 0; 4 and n~j ! N (0; 1) :
Proposition. Under the assumptions of the previous proposition
p
X p
X
p 2 d
Box-Pierce Q statistics = QBP = n~j =n ~2j ! 2(p):
j=1 j=1
However, ~j is still unfeasible as we do not observe the errors. Thus,

^j
^j =
^0
n n
1 X 1X
^j = etet j ; ^0 = e2t (=SSR).
n t=j+1 n t=1
Exercise 3.6. Prove that ^j can be obtained from the regression et on et j (without inter-
cept).
185
Testing with Strictly Exogenous Regressors
To test H0 : j = 0 we consider the following proposition:

Proposition (testing for serial correlation with strictly exogeneous regressors). Suppose that
Assumptions 1.2, 2.1, 2.2, 2.4 are satis…ed. Then
p
^j ! 0;
p d
n^j ! N (0; 1) :
186
To test H0 : 1 = 2 = ::: = p = 0 we consider the following proposition:

Proposition (Box-Pierce Q & Ljung-Box Q). Suppose that Assumptions 1.2, 2.1, 2.2, 2.4
are satis…ed. Then
p
X d
QBP = n ^2j ! 2(p);
j=1
p
X ^2j d
QLB = n (n + 2) ! 2(p):
j=1 n j
It can be shown that the hypothesis H0 : 1 = 2 = ::: = p = 0 can also be tested

through the following auxiliary regression:
regression et on et 1; :::; et p:
We calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p are
all zero.
187
Testing with Predetermined, but Not Strictly Exogenous, Regressors

p
If the regressors are not strictly exogenous, the n^j has no longer N (0; 1) distribution and
the residual-based Q statistic may not be asymptotically chi-squared.
The trick consist in removing the e¤ect of xi in the regression of et on et 1; :::; et p by

considering now the
regression et on xt,et 1; :::; et p
and then calculate the F statistic for the hypothesis that the p coe¢ cients of et 1; :::; et p
are all zero. This regression is still valid when the regressors are strictly exogenous (so you
may always use that regression).
Given
et = 1 + 2xt2 + ::: + K xtK + 1et 1 + ::: + pet p + errort
the null hypothesis can be formulated as
H0 : 1 = ::: = p = 0
Use the F test:
188
EVIEWS
189
Example. Consider, chnimp: the volume of imports of barium chloride from China, chempi:
index of chemical production (to control for overall demand for barium chloride), gas: the
volume of gasoline production (another demand variable), rtwex: an exchange rate index
(measures the strength of the dollar against several other currencies).
Equation 1
Dependent Variable: LOG(CHNIMP)
Sample: 1978M02 1988M12
C -19.75991 21.08580 -0.937119 0.3505

LOG(CHEMPI) 3.044302 0.478954 6.356142 0.0000
LOG(GAS) 0.349769 0.906247 0.385953 0.7002
LOG(RTWEX) 0.717552 0.349450 2.053378 0.0421

190
Equation 2
Breusch-Godfrey Serial Correlation LM Test:

Test Equation:
Dependent Variable: RESID
Sample: 1978M02 1988M12
Presample missing value lagged residuals set to zero.
C -3.074901 20.73522 -0.148294 0.8824

LOG(CHEMPI) 0.084948 0.457958 0.185493 0.8532
LOG(GAS) 0.110527 0.892301 0.123867 0.9016
LOG(RTWEX) 0.030365 0.333890 0.090942 0.9277
RESID(-1) 0.234579 0.093215 2.516546 0.0132
RESID(-2) 0.182743 0.095624 1.911051 0.0585
RESID(-3) 0.164748 0.097176 1.695366 0.0927
RESID(-4) -0.180123 0.098565 -1.827464 0.0702
RESID(-5) -0.041327 0.099482 -0.415425 0.6786
RESID(-6) 0.038597 0.098345 0.392468 0.6954
RESID(-7) 0.139782 0.098420 1.420268 0.1582
RESID(-8) 0.063771 0.099213 0.642771 0.5217
RESID(-9) -0.154525 0.098209 -1.573441 0.1184
RESID(-10) 0.027184 0.098283 0.276585 0.7826
RESID(-11) -0.049692 0.097140 -0.511550 0.6099
RESID(-12) -0.058076 0.095469 -0.608329 0.5442
R-squared 0.196110 Mean dependent var -3.97E-15

191
If you conclude that the errors are serial correlated you have a few options:
(a) You know (at least approximately) the form of autocorrelation and so you use a feasible
GLS estimator.
(b) The second approach, parallels the use of the White estimator for heteroskedasticity:
you don’t know the form of autocorrelation so you rely on the OLS, but you use a
consistent estimator for Avar (b) :
(c) You are concerned only with the dynamic speci…cation of the model and with forecast.
You may try to convert your model into a dynamically complete model.
(d) You model may be misspeci…ed: you respeci…ed the model and the autocorrelation
disappear.
192
3.9.3 Question (a): feasible GLS estimator
There are many forms of autocorrelation and each one leads to a di¤erent structure for the
error covariance matrix V. The most popular form is known as the …rst-order autoregressive
process. In this case the error term in
yt = x0t + "t
is assumed to follow the AR(1) model
"t = "t 1 + vt; j j < 1;

where vt is an error term with mean zero and constant conditional variance that exhibits no
serial correlation. We assume all assumptions 2.1-2.5 was = 0:
193
Initial Model:
yt = x0t + "t; "t = "t 1 + vt; j j<1
The GLS estimator is the OLS estimator applied to the transformed model
~0t + vt
y~t = x
where
( q ( q
1 2y t=1 ; 1 2 x0 t= 1 ;
y~t = 1 ~0t =
x 1
yt yt 1 t > 1 (xt xt 1)0 t > 1
Without the …rst observation, the transformed model is
0
yt yt 1 = (xt xt 1) + vt:
If is unknown we may replace it by a consistent estimator or we may use the nonlinear

least squares estimator (EVIEW).
194
Example (continuation of the previous example). Let’s consider the residuals of Equation 1:
Equation 3
Sample (adjusted): 1978M03 1988M12
Included observations: 130 after adjustments
Convergence achieved after 8 iterations
C -39.30703 23.61105 -1.664772 0.0985

LOG(CHEMPI) 2.875036 0.658664 4.364949 0.0000
LOG(GAS) 1.213475 1.005164 1.207241 0.2296
LOG(RTWEX) 0.850385 0.468696 1.814362 0.0720
AR(1) 0.309190 0.086011 3.594777 0.0005

Inverted AR Roots .31

195
3.9.4 Question (b): Heteroskedasticity and autocorrelation-consistent (HAC) Co-

variance Matrix Estimator
For sake of generality, assume that you have also a problem of heteroskedasticity.
Given
p 1 nX1 X n
0 0
S = Var ng = Var (gt) + E gtgt j + E gt j gt
n j=1 t=j+1
nX1 X n
1
= E "2t xtx0t + 0 0
E "t"t j xtxt j + E "t j "txt j xt ;
n j=1 t=j+1
a possible estimator of S based on the analogy principle would be
n 0
nX1 X n
1X 1
e2t xtx0t + etet j xtx0t j + et j etxt j x0t ; n0 < n:
n t=1 n j=1 t=j+1
A major problem with this estimator is that it is not positive semi-de…nite and hence cannot
be a well-de…ned variance-covariance matrix.
196
Newey and West show that with a suitable weighting function ! (j ), the estimator below is
consistent and positive semi-de…nite:
Xn XL Xn
1 1
^ HAC =
S e2t xtx0t + ! (j ) etet j xtx0t j + et j etxt j x0t
n t=1 n j=1 t=j+1
where the weighting function ! (j ) is
j
! (j ) = 1 :
L+1
The maximum lag L must be determined in advance. Autocorrelations at lags longer than
L are ignored. For a moving-average process, this value is in general a small number.
This estimator is known as (HAC) covariance matrix estimator and is valid when both
conditional heteroskedasticity and serial correlations are present but of an unknown form.
197
Example. For xt = 1; n = 9; L = 3 we have

L
X n
X
! (j ) etet j xtx0t j + et j etxt j x0t
j=1 t=j+1
XL Xn
= ! (j ) 2etet j
j=1 t=j+1
= ! (1) (2e1e2 + 2e2e3 + 2e3e4 + 2e4e5 + 2e5e6 + 2e6e7 + 2e7e8 + 2e8e9) +
! (2) (2e1e3 + 2e2e4 + 2e3e5 + 2e4e6 + 2e5e7 + 2e6e8 + 2e7e9) +
! (3) (2e1e4 + 2e2e5 + 2e3e6 + 2e4e7 + 2e5e8 + 2e6e9) :
1
! (1) = 1 = 0:75
4
2
! (2) = 1 = 0:50
4
3
! (3) = 1 = 0:25
4
198
Newey-West covariance matrix estimator

[ (b) = Sxx1S
Avar ^ HAC Sxx1:
EVIEWS:
10
L
9
0
0 1000 2000 3000 4000 5000
n
n 2=9
Eviews selects L = f loor(4 100 )
199
Example (continuation ...). Newey-West covariance matrix estimator

[ (b) = Sxx1S
Avar ^ HAC Sxx1
Equation 4
Sample: 1978M02 1988M12
Newey-West HAC Standard Errors & Covariance (lag truncation=4)
C -19.75991 26.25891 -0.752503 0.4531

LOG(CHEMPI) 3.044302 0.667155 4.563111 0.0000
LOG(GAS) 0.349769 1.189866 0.293956 0.7693
LOG(RTWEX) 0.717552 0.361957 1.982426 0.0496

200
3.9.5 Question (c): Dynamically Complete Models
Consider
~0t + ut
yt = x
such that E ( utj x
~t) = 0: This condition although necessary for consistency, does not pre-
clude autocorrelation. You may try to increase the number of regressors to xt and get a new
regression model
yt = x0t + "t such that
E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0:

Written in terms of yt
E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt) :

De…nition. The model yt = x0t + "t is dynamically complete (DC) if
E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 or

E ( ytj xt; yt 1; xt 1; yt 2; :::) = E ( ytj xt)
holds (see Wooldridge).
201
Proposition. If a model is DC then the errors are not correlated. Moreover fgig is a MDS.
Notice that E ( "tj xt; yt 1; xt 1; yt 2; :::) = 0 can be rewritten as
E ( "ij Fi) = 0 where

Fi = Ii 1 [ xi = f"i 1; "i 2; :::; "1; xi; xi 1; :::; x1g ;
Ii 1 = f"i 1; "i 2; :::; "1; xi 1; :::; x1g :
Example. Consider
yt = 1 + 2xt2 + ut; ut = ut 1 + "t

~0t =
where f"tg is a white noise process and E "tj xt2; yt 1; xt 1;2; yt 2; ::: = 0. Set x
1 xt2 : The above model is not DC since the errors are autocorrelated. Notice that
E ytj xt2; yt 1; xt 1;2; yt 2; ::: = 1 + 2xt2 + ut 1

does not coincide with
E ( ytj x
~t) = E ( ytj xt2) = 1 + 2xt2:
202
However, it is easy to obtain a DC model. Since
ut = yt ( 1 + 2xt2) )
ut 1 = yt 1 ( 1 + 2xt 1;2)
we have
yt = 1 + 2xt2 + ut
= 1 + 2 xt2 + ut 1 + "t
= 1 + 2 xt2 + yt 1 1 + 2 xt 1;2 + "t :
This equation can be written in the form
yt = 1 + 2xt2 + 3yt 1 + 4xt 1;2 + "t:

Let xt = xt2; yt 1; xt 1;2 : The previous models is DC as
E ( ytj xt; yt 1; xt 1; :::) = E ( ytj xt) = 1 + 2xt2 + 3yt 1 + 4xt 1;2:

203
Example (continuation ...). Dynamically Complete Model
Equation 6
Breusch-Godfrey Serial Correlation LM Test:

Test Equation:
Dependent Variable: RESID
Equation 5 Date: 05/12/10 Time: 19:13
Dependent Variable: LOG(CHNIMP) Sample: 1978M03 1988M12
Method: Least Squares Presample missing value lagged residuals set to zero.
Sample (adjusted): 1978M03 1988M12
Included observations: 130 after adjustments
C 1.025127 26.26657 0.039028 0.9689
LOG(CHEMPI) 1.373671 3.968650 0.346130 0.7299
Variable Coefficient Std. Error t-Statistic Prob. LOG(GAS) -0.279136 1.055889 -0.264361 0.7920
LOG(RTWEX) -0.074592 2.234853 -0.033377 0.9734
C -11.30596 23.24886 -0.486302 0.6276 LOG(CHEMPI(-1)) -1.878917 4.322963 -0.434636 0.6647
LOG(GAS(-1)) 0.315918 1.076831 0.293378 0.7698
LOG(CHEMPI) -7.193799 3.539951 -2.032175 0.0443 LOG(RTWEX(-1)) -0.007029 2.224878 -0.003159 0.9975
LOG(GAS) 1.319540 1.003825 1.314513 0.1911 LOG(CHNIMP(-1)) 0.151065 0.293284 0.515082 0.6075
RESID(-1) -0.189924 0.307062 -0.618520 0.5375
LOG(RTWEX) -0.501520 2.108623 -0.237842 0.8124 RESID(-2) 0.088557 0.124602 0.710715 0.4788
LOG(CHEMPI(-1)) 9.618587 3.602977 2.669622 0.0086 RESID(-3) 0.154141 0.098337 1.567475 0.1199
RESID(-4) -0.125009 0.098681 -1.266795 0.2079
LOG(GAS(-1)) -1.223681 1.002237 -1.220950 0.2245 RESID(-5) -0.035680 0.099831 -0.357407 0.7215
LOG(RTWEX(-1)) 0.935678 2.088961 0.447915 0.6550 RESID(-6) 0.048053 0.098008 0.490291 0.6249
LOG(CHNIMP(-1)) 0.270704 0.084103 3.218710 0.0016 RESID(-7) 0.129226 0.097417 1.326523 0.1874
RESID(-8) 0.052884 0.099891 0.529420 0.5976
RESID(-9) -0.122323 0.102670 -1.191423 0.2361
R-squared 0.394405 Mean dependent var 6.180590 RESID(-10) 0.022149 0.099419 0.222788 0.8241
RESID(-11) 0.034364 0.099973 0.343738 0.7317
Adjusted R-squared 0.359658 S.D. dependent var 0.699063 RESID(-12) -0.038034 0.102071 -0.372628 0.7101
R-squared 0.081251 Mean dependent var -9.76E-15
Sum squared resid 38.17726 Schwarz criterion 1.912123 Adjusted R-squared -0.077442 S.D. dependent var 0.544011
Log likelihood -104.8179 Hannan-Quinn criter. 1.807363 S.E. of regression 0.564683 Akaike info criterion 1.835533
F-statistic 11.35069 Durbin-Watson stat 2.059684 Sum squared resid 35.07532 Schwarz criterion 2.276692
Prob(F-statistic) 0.000000 F-statistic 0.512002 Durbin-Watson stat 2.011429
204
3.9.6 Question (d): Misspeci…cation
In many cases the …nding of autocorrelation is an indication that the model is misspeci…ed.
If this is the case, the most natural route is not to change your estimator (from OLS to GLS)
but to change your model. Types of misspeci…cation may lead to a …nding of autocorrelation
in your OLS residuals:
dynamic misspeci…cation (related to question (c));
omitted variables (that are autocorrelated);
yt and/or xtk are integrated processes, e.g. yt I (1) :
functional form misspeci…cation.

205
Functional form misspeci…cation. Suppose that the true linear relationship is
yt = 1 + 2 log t + "t:
In the following …gure we estimate a misspeci…ed functional form: yt = 1 + 2t + "t : The
residuals are clearly autocorrelated

1 205

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1 205

Transféré par

Droits d'auteur :

Formats disponibles

1

1.1 What is Econometrics?

study economic relations;

testing economic theories;

1.2 Steps in Empirical Economic Analysis

Specify the econometric model.

Collect the data.

Estimate and test the econometric model.

Answer the question in step 1.

1.3 The Structure of Economic Data

1.3.1 Cross-Sectional Data

A cross-sectional data: sample of individuals, households, …rms, cities, states, countries,

An example of Cross-Sectional Data:

Scatterplots may be adequate for analyzing cross-section data:

1.3.2 Time-Series Data

we need to account for the dependent nature of economic time series;

1.3.3 Pooled Cross Sections and Panel or Longitudinal Data

Data sets have both cross-sectional and time series features.

1.3.4 Causality And The Notion Of Ceteris Paribus In Econometric Analysis

monthly wages (Euros) years of experience years of education

Ceteris Paribus is relatively easy to analyze in Experimental Data.

2 Finite-Sample Properties of OLS

2.1 The Classical Linear Regression Model

Let yi be the i-th observation of the dependent variable.

The data in economics cannot be generated by experiments (except in experimental eco-

2.1.1 The Linearity Assumption

Assumption (1.1 - Linearity). We have

yi = 1xi1 + 2xi2 + ::: + K xiK + "i; i = 1; 2; :::; n

coni = 1 + 2ydi + "i:

The linearity assumption is not as restrictive as it might …rst seem.

wagei = e 1 e 2educi e 3tenurei e 4expri e"i

log (wagei) = 1 + 2educi + 3tenurei + 4expri + "i

yi = 1 + 2xi2 + 3 log xi2 + 4x2i3 + "i

There are, of course, cases of genuine nonlinearity. For example

To simplify let’s consider, K = 2; and assume that E ( "ij xi1; xi2) = 0.

2.1.2 Matrix Notation

Example. yi = 1 + 2educi + 3expi + "i (yi = wages in Euros). An example of

2.1.3 The Strict Exogeneity Assumption

Assumption (1.2 - Strict exogeneity). E ( "ij X) = 0; 8i

This assumption can be written as

E ( "ij x1; :::; xn) = 0; 8i:

Strict Exogeneity assumption can fail in situations such as:

(Cross-Section or Time Series) Omitted variables;

(Cross-Section or Time Series) Measurement error in some of the regressors;

(Time Series, Dynamic models) There is a lag dependent variable as a regressor.

wagei = 1 + 2xi2 + 3xi3 + vi;

wagei = 1 + 2xi2 + "i; "i = 3xi3 + vi:

Example (Measurement error in some of the regressors). Consider y = household savings

yi = 1 + 2wi + vi; E ( vij w) = 0:

yi = 1 + 2xi2 + "i; "i = vi 2 ui :

Cov ("i; xi2) = ::: = 2 Var (ui ) 6= 0:

2.1.4 Implications of Strict Exogeneity

The Assumption E ( "ij X) = 0; 8i implies:

Cov xjk ; "i = 0:

Note: if E "ij xj 6= 0 or E xjk "i 6= 0 or Cov xjk ; "i 6= 0 ) E ( "ij X) 6= 0:

2.1.5 Strict Exogeneity in Time-Series Models

2.1.6 Other Assumptions of the Model

Assumption (1.3 - no multicollinearity). The rank of the n K data matrix X is K with

GP Ai = 1 + 2studyi + 3sleepi + 4worki + 5leisurei + "i

Assumption 1.4 and strict exogeneity implies:

Var ( "ij X) = E "2i X = 2:

Cov "i; "j X = 0: