Econometrics: A Predictive Modeling Approach: Francis X. Diebold University of Pennsylvania

Econometrics:
A Predictive Modeling Approach
Francis X. Diebold
University of Pennsylvania
July 5, 2017 1 / 247

Copyright
c 2013-2017, by Francis X. Diebold.
All rights reserved.
All materials are freely available for your use, but be warned: they are highly
preliminary, significantly incomplete, and rapidly evolving. All are licensed
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License. (Briefly: I retain copyright, but you can use, copy and
distribute non-commercially, so long as you give me attribution and do not
modify. To view a copy of the license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/.) In return I ask that you
please cite the books whenever appropriate, as: Diebold, F.X. (year here),
Book Title Here, Department of Economics, University of Pennsylvania,
http://www.ssc.upenn.edu/ fdiebold/Textbooks.html.
The painting is Enigma, by Glen Josselsohn, from Wikimedia Commons.
2 / 247
Introduction to Econometrics
3 / 247
Numerous Communities Use Econometrics
Economists, statisticians, analysts, data scientists in:

I Finance (Commercial banking, retail banking, investment
banking, insurance, asset management, real estate, ...)
I Traditional Industry (manufacturing, services, advertising,
brick-and-mortar retailing, ...)
I e-Industry (Google, Amazon, eBay, Uber, Microsoft, ...)
I Consulting (financial services, litigation support, ...)
I Government (treasury, agriculture, environment, commerce,
...)
I Central Banks and International Organizations (FED, IMF,
World Bank, OECD, BIS, ECB, ...)
4 / 247
Econometrics is Special
Econometrics is not just statistics using economic data. Many

properties and nuances of economic data require knowledge of
economics for sucessful analysis.
I Trend and seasonality

I Cycles (serial correlation)
I Volatility fluctuations (heteroskedasticity)
I Structural change
I Multivariate interactions
I Emphasis on prediction
5 / 247
Lets Elaborate on the Emphasis on Prediction...
Q: What is econometrics is about, broadly?
A: Helping people to make better decisions

I Consumers
I Firms
I Investors
I Policy makers
I Courts
Forecasts guide decisions.
Good forecasts promote good decisions.
Hence prediction holds a distinguished place in econometrics,

and it will hold a distinguished place in this course.
6 / 247
There are Many Issues Regarding Types of Recorded
Economic Data
I Time series
I Continuous recording
I Discrete recording
I Equally-spaced
I Unequally-spaced
I Common-frequency
I Mixed-frequency
I Cross section
I Time series of cross sections
I Balanced panel
I Unbalanced panel
7 / 247
Notational Aside
Standard cross-section notation: i = 1, ..., N
Standard time-series notation: t = 1, ..., T
Much of our discussion will be valid in both cross-section and

time-series environments, but still we have to pick a notation.
Without loss of generality, we will use t = 1, ..., T .
8 / 247
A Few Leading Econometrics Web Data Resources
(Clickable)
Indispensible:
I Resources for Economists (AEA)
I FRED (Federal Reserve Economic Data)
More specialized:
I National Bureau of Economic Research
I FRB Phila Real-Time Data Research Center
I Many more
9 / 247
A Few Leading Econometrics Software Environments
(Clickable)
I High-Level: EViews, Stata

I Mid-Level: R (CRAN; RStudio; R-bloggers), Python, Julia
I Low-Level: C, C++, Fortran
High-level does not mean best, and low-level does not

mean worst. There are many issues.
I More than you ever wanted to know about econometric

software, broadly defined: The Econometrics Journal software
links
10 / 247
Graphics Review
11 / 247
Graphics Help us to:
I Summarize and reveal patterns in univariate cross-section

data. Histograms and density estimates are helpful for learning
about distributional shape. Symmetric, skewed, fat-tailed, ...
I Summarize and reveal patterns in univariate time-series data.

Time Series plots are useful for learning about dynamics.
Trend, seasonal, cycle, outliers, ...
I Summarize and reveal patterns in multivariate data

(cross-section or time-series). Scatterplots are useful for
learning about relationships. Does a relationship exist? Is it
linear or nonlinear? Are there outliers?
12 / 247
Histogram Revealing Distributional Shape:
1-Year Government Bond Yield
13 / 247
Time Series Plot Revealing Dynamics:
1-Year Goverment Bond Yield, Levels
14 / 247
Scatterplot Revealing Relationship:
1-Year and 10-Year Government Bond Yields
15 / 247
Some Principles of Graphical Style
I Know your audience, and know your goals.

I Appeal to the viewer.
I Show the data, and only the data, withing the bounds of
reason.
I Avoid distortion. The sizes of effects in graphics should match
their size in the data. Use common scales in multiple
comparisons.
I Minimize, within reason, non-data ink. Avoid chartjunk.
I Third, choose aspect ratios to maximize pattern revelation.
Bank to 45 degrees.
I Maximize graphical data density.
I Revise and edit, again and again (and again). Graphics
produced using software defaults are almost never satisfactory.
16 / 247
Probability and Statistics Review
17 / 247
Moments, Sample Moments and Their Sampling
Distributions
I Discrete random variable, y
I Discrete probability distribution p(y )
I Continuous random variable y
I Probability density function f (y )
18 / 247
Population Moments: Expectations of Powers of R.V.s
Mean measures location:

X
= E (y ) = pi yi (discrete case)
i
Z
= E (y ) = y f (y ) dy (continuous case)
Variance, or standard deviation, measures dispersion, or scale:
2 = var (y ) = E (y )2 .
easier to interpret than 2 . Why?
19 / 247
More Population Moments
Skewness measures skewness (!)
E (y )3
S= .
3
Kurtosis measures tail fatness relative to a Gaussian distribution.
E (y )4
K= .
4
20 / 247
Covariance and Correlation
Multivariate case: Joint, marginal and conditional distributions

f (x, y ), f (x), f (y ), f (x|y ), f (y |x)
Covariance measures linear dependence:
cov (y , x) = E [(yt y )(xt x )].
So does correlation:
cov (y , x)
corr (y , x) = .
y x
Correlation is often more convenient. Why?
21 / 247
Sampling and Estimation
Sample : {yt }T
t=1 iid f (y )
Sample mean:
T
1 X
y = yt
T
t=1
Sample variance:
PT
2 (yt y )2
= t=1
T
Unbiased sample variance:
PT
2 (yt y )2
s = t=1
T 1
22 / 247
More Sample Moments
Sample skewness:
1 PT
T t=1 (yt y )3
S =
3
Sample kurtosis:
1 PT
T t=1 (yt y )4
K =
4
23 / 247
Still More Sample Moments
Sample covariance:
T
1 X
cd
ov (y , x) = [(yt y )(xt x)]
T
t=1
Sample correlation:
cd
ov (y , x)
corr
d (y , x) =
y x
24 / 247
Exact Finite-Sample Distribution of the Sample Mean
(Requires iid Normality)
Simple random sampling : yt iid N(, 2 ), t = 1, ..., T

y is unbiased, consistent, normally distributed with variance 2 /T ,
and minimum variance unbiased (MVUE).
2

y N ,
T
(and we estimate 2 consistently using s 2 )

s
y t1 2 (T 1) w .p.
T
y 0
= 0 = s t(T 1)

T
25 / 247
Large-Sample Distribution of the Sample Mean
(Requires iid, but not Normality)
Simple random sampling : yt iid (, 2 ), t = 1, ..., T

y is asymptotically unbiased, consistent, asymptotically normally
distributed with variance 2 /T , and best linear unbiased (BLUE).
a
2

y N , ,
T
and we estimate 2 consistently using s 2 . This is an approximate

(large-sample) result, due to the central limit theorem. The a
means asymptotically as T

s
As T , y z1 2 ) w .p.
T
y 0
= 0 = s N(0, 1)

T 26 / 247
Wages: Distributions
27 / 247
Wages: Sample Statistics
WAGE log WAGE
Sample Mean 12.19 2.34

Sample Median 10.00 2.30
Sample Maximum 65.00 4.17
Sample Minimum 1.43 0.36
Sample Std. Dev. 7.38 0.56
Sample Skewness 1.76 0.06
Sample Kurtosis 7.93 2.90
Jarque-Bera 2027.86 1.26
(p = 0.00) (p=0.53)
t(H0 : = 12) 0.93 -625.70
(p = 0.36) (p=0.00)
Correlation 0.40
28 / 247
Regression
29 / 247
Regression
A. As curve fitting. Tell a computer how to draw a line through a

scatterplot. (Well, sure, but there must be more...)
B. As a probabilistic framework for optimal prediction.
30 / 247
Regression as Curve Fitting
31 / 247
Distributions of Log Wage, Education and Experience
32 / 247
Scatterplot: Log Wage vs. Education
33 / 247
Curve Fitting
Fit a line:
yt = 1 + 2 xt
Solve:
T
X
min (yt 1 2 xt )2
t=1
is the set of two parameters 1 and 2
is the set of fitted parameters 1 and 2
34 / 247
Log Wage vs. Education with Superimposed Regression
Line
\ = 1.273 + .081EDUC
LWAGE
35 / 247
Actual Values, Fitted Values and Residuals
The fitted values are
yt = 1 + 2 xt ,
t = 1, ..., T .
The residuals are the difference between actual and fitted values,
et = yt yt ,
t = 1, ..., T .
36 / 247
Multiple Linear Regression (K RHS Variables)
Solve:
T
X
min (yt 1 2 x2t ... K xKt )2
t=1
Fitted hyperplane:
yt = 1 + 2 x2t + 3 x3t + ... + K xKt
More compactly:
K
X
yt = i xit ,
i=1
where x1t = 1 for all t.
Wage dataset:
\ = .867 + .093EDUC + .013EXPER
LWAGE
37 / 247
Regression as a Probability Model
38 / 247
An Ideal Situation (The Ideal Conditions)
I The data-generating process (DGP) is:
yt = 1 + 2 x2t + ... + K xKt + t

t iidN(0, 2 ),
and the fitted model matches it exactly.
I t and xit are independent, for all i, t
1. The fitted model is correctly specified

2. The disturbances are Gaussian
3. The coefficients (s) are fixed (whether over space or time,
depending on whether were working in a time-series or cross-section
environment)
4. The relationship is linear
5. The t s have constant variance 2
6. The t s are uncorrelated
39 / 247
Some Crucial Matrix Notation
You already understand matrix (spreadsheet) notation,

although you may not know it.

y1 1 x21 x31 ... xK 1 1 1
y2 1 x22 x32 ... xK 2 2 2
y = . X = . = .. = ..

.. .. . .
yT 1 x2T x3T ... xKT K T
40 / 247
Elementary Matrices and Matrix Operations

0 0 ... 0 1 0 ... 0
0 0 . . . 0 0 1 . . . 0
0 = . . . I = . . .

. . . .. . . . ..
. . . . . . . .
0 0 ... 0 0 0 ... 1
Transposition: A0ij = Aji

Addition: For A and B n m, (A + B)ij = Aij + Bij
Pm
Multiplication: For A n m and B m p, (AB)ij = k=1 Aik Bkj .
Inversion: For non-singular A n n,A1

satisfies
1 1
A A = AA = I . Many algorithms exist for calculation.
41 / 247
We Used to Write This:
yt = 1 + 2 x2t + ... + K xKt + t

t iidN(0, 2 )
t independent of xit , i, t
t = 1, 2, ..., T
42 / 247
Now, Equivalently, We Write This:
y = X + (1)
N(0, 2 I ) (2)
independent of X

y1 1 x21 x31 . . . xK 1 1 1
y2 1 x22 x32 . . . xK 2 2 2
.. = .. .. + .. (1)

. . . .
yT 1 x2T x3T . . . xKT K T
2
1 01 0 ... 0
2 02 0 2 . . . 0
N . , . .. (2)

.. . . .. . .
. . . . . .
T 0T 0 0 . . . 2
43 / 247
The OLS Estimator in Matrix Notation:
As always, the LS estimator solves:

T
X T
X
min 2t = min (yt 1 2 x2t ... K xKt )2

t=1 t=1
It can be shown that the solution is:
LS = (X 0 X )1 X 0 y
44 / 247
The Ideal Conditions, Redux
1. The DGP is:

y = X +
N(0, 2 I ).
and the fitted model matches it exactly
2. X and are independent
45 / 247
Sampling Distribution of LS Under the Ideal Conditions
LS N (, V ) .
where V is consistently estimated by s 2 (X 0 X )1 .

PT !
2
2 t=1 et
s =
T K
LS is normally distributed and MVUE
Note the precise parallel with the distribution of the sample mean
in Gaussian iid environments.
46 / 247
Sampling Distribution of LS Under the Ideal Conditions
Less Normality
As T ,
LS N (, V ) .
LS is asymptotically normally distributed and BLUE
Note the precise parallel with the distribution of the sample mean
in non-Gaussian iid environments.
47 / 247
Sample Mean (You Earlier Learned All About it)
OLS Regression (Youre Now Learning All About it)
What is the Relationship?
Sample mean is just LS regression on nothing but a constant.
(Prove it.)
Moreover, the distributional results are in precise parallel.
Distribution of the sample mean from an iid Gausssian sample:

y N (, v )
s2
where v is consistently estimated by T.
Distribution of the LS regression estimator under ideal conditions:

LS N (, V ) .
48 / 247
Conditional Implications of the DGP
Conditional mean:

E (yt | x1t = 1, x2t = x2t , ..., xKt = xKt ) = 1 + 2 x2t + ... + K xKt
or E (yt | xt = xt ) = xt 0

Conditional variance:
var (yt | xt = xt ) = 2
Full conditional density:
yt | xt = xt N(xt 0 , 2 )
49 / 247
Why All the Talk About Conditional Implications?:
The Predictive Modeling Problem
A major goal in econometrics is predicting y . The question is If a
new person arrives with characteristics x , what is my
minimum-MSE prediction of her y ? The answer under quadratic
loss is E (y |x = x ) = xt 0 .
The conditional mean is the minimum MSE (point) predictor
Non-operational version (we dont know ):

E (yt |xt = xt ) = xt 0
Operational version (use LS ):

|xt = xt ) = xt 0 LS (regression fitted value at xt = xt )
E (yt\
LS delivers operational optimal predictor with great generality
Follows immediately from the LS optimization problem

50 / 247
Interval Prediction
Non-operational:
yt [xt 0 1.96 ] w .p. 0.95
Operational:
yt [xt 0 LS 1.96 s] w .p. 0.95
51 / 247
Density Prediction
Non-operational version:
yt | xt = xt N(xt 0 , 2 )
Operational version:
yt | xt = xt N(xt 0 LS , s 2 )
52 / 247
Digging More Deeply into Prediction
The environment is:
yt = xt0 + t , t = 1, ..., T
t iid D(0, 2 )
53 / 247
Point Prediction
Assume for the moment that we know the model parameters. That
is, assume that we know and all parameters governing D. Note
that the mean and variance are in general insufficient to
characterize a non-Gaussian D.
We immediately obtain point forecasts as:
E (yi |xi = x ) = x 0 .
54 / 247
Analytic Density Prediction (And Hence Also Interval
Prediction) for D Gaussian
If D is Gaussian, then the density prediction is immediately
yi |xi = x N(x 0 , 2 ). (1)
We can calculate any desired interval forecast from the density

forecast. For example, a 95% interval would be x 0 1.96.
55 / 247
Simulation Algorithm for Density Prediction for D Gaussian
1. Take R draws from the disturbance density N(0, 2 ).

2. Add x 0 to each disturbance draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form an interval forecast (95%, say) by sorting the output
from step 2 to get the empirical cdf, and taking the left and
right interval endpoints as the the .025% and .975% values,
respectively.
As R , the algorithmic and analytic results coincide.
56 / 247
Making the Forecasts Feasible
The approaches above are infeasible in that they assume known

parameters. They can be made feasible by replacing unknown
parameters with estimates. For example, the feasible version of the
point prediction is x 0 . Similarly, to construct a feasible 95%
interval forecast in the Gaussian case we can take x 0 1.96,
where is the standard error of the regression
(also earlier denoted s).
57 / 247
Typical Regression Analysis of Wages, Education and
Experience
58 / 247
Top Matter: Background Information
I Dependent variable
I Method
I Date
I Sample
I Included observations
59 / 247
Middle Matter: Estimated Regression Function
I Variable
I Coefficient
I Standard error
I t-statistic
I p-value
60 / 247
Predictive Perspectives
OLS coefficient signs and sizes give the weights put on the
various x variables in forming the best in-sample prediction of y .
The standard errors, t statistics, and p-values let us do statistical

inference as to which regressors are most relevant for predicting y .
61 / 247
Bottom Matter: Statistics
There are many...
62 / 247
Regression Statistics: Mean dependent var 2.342
T
1 X
y = yt
T
t=1
63 / 247
The sample, or historical, mean of the dependent variable, y , an

estimate of the unconditional mean of y , is a benchmark forecast.
It is obtained by regressing y on an intercept alone no
conditioning on other regressors.
64 / 247
Regression Statistics: S.D. dependent var .561
s
PT
t=1 (yt y )2
SD =
T 1
65 / 247
The sample standard deviation of y is a measure of the in-sample

accuracy of the unconditional mean forecast y .
66 / 247
Regression Statistics: Sum squared resid 319.938
T
X
SSR = et2
t=1
Optimized value of the LS objective; will appear in many places.
67 / 247
The OLS fitted values, yt = xt0 , are effectively in-sample

regression predictions.
The OLS residuals, et = yt yt , are effectively in-sample

prediction errors corresponding to use of the regression predictions.
SSR measures total in-sample accuracy of the regression

predictions
SSR is closely related to in-sample MSE :

T
1 1 X 2
MSE = SSR = et
T T
t=1
(average in-sample accuracy of the regression predictions)
68 / 247
Regression Statistics: F -statistic 199.626
(SSRres SSR)/(K 1)
F =
SSR/(T K )
69 / 247
The F statistic effectively compares the accuracy of the

regression-based forecast to that of the unconditional-mean
forecast.
Helps us assess whether the x variables, taken as a set, have

predictive value for y .
Contrasts with the t statistics, which assess predictive value of

the x variables one at a time.
70 / 247
Regression Statistics: S.E. of regression .492
PT 2
2 t=1 et
s =
T K
s
PT 2
t=1 et
SER = s2 =
T K
71 / 247
s 2 is just SSR scaled by T K , so again, its a measure of the

in-sample accuracy of the regression-based forecast.
Like MSE, but corrected for degrees of freedom.
72 / 247
Regression Statistics: R-squared .232
PT 2
2 t=1 et
R = 1 PT
t=1 (yt y )2
73 / 247
Regression Statistics: Adjusted R-squared .231
1 PT 2
2 T K t=1 et
R = 1 1 T 2
P
T 1 t=1 (yt y )
74 / 247
R 2 and R 2 effectively compare the in-sample accuracy of

conditional-mean and unconditional-mean forecasts.
R 2 is not corrected for d.f. and has MSE on top:

1 PT 2
2 T t=1 et
R =1 1 PT .
T t=1 (yt y )2
R 2 is corrected for d.f. and has s 2 on top:

1 PT 2
2 T K t=1 et
R = 1 1 T
,
2
P
T 1 t=1 (yt y )
75 / 247
Regression Statistics: Log likelihood -938.236
I Intimately related to SSR under normality
I Therefore closely related to prediction as well
76 / 247
Background/Detail: Regression Statistics: Log likelihood
-938.236
I Likelihood joint density of the data (the yt s)
I Maximum-likelihood estimation natural estimation strategy:

find the parameter configuration that maximizes the likelihood
of getting the yt s that you actually did get.
I Log likelihood will have same max as the likelihood (why?)

but its more important statistically
I Hypothesis tests and model selection based on log likelihood
77 / 247
Background/Detail: Maximum-Likelihood Estimation
Linear regression model (under conditions) implies that:
yt iidN(xt0 , 2 ),
so that 1 0
1 2
f (yt ) = (2 2 ) 2 e 22 (yt xt ) .
Now by independence of the t s and hence yt s,
T
1 1 0 2
e 22 (yt xt )
Y
L = f (y1 , ..., yT ) = f (y1 ) f (yT ) = (2 2 ) 2
t=1
Note in particular that the vector that maximizes the likelihood

is the vector that minimizes the sum of squared residuals.
78 / 247
Background/Detail: Log Likelihood
T
T 1 X
ln L = (2 2 ) 2 (yt xt0 )2
2 2
t=1
- Log turns the product into a sum and eliminates the exponential
- Additive constant can be dropped
79 / 247
Background/Detail: Likelihood-Ratio Tests
Under conditions, asymptotically as T :
2(ln L0 ln L1 ) 2d ,
where ln L0 is the maximized log likelihood under the restrictions

implied by the null hypothesis, ln L1 is the unrestricted log
likelihood, and d is the number of restrictions imposed under the
null hypothesis.
t and F tests are likelihood ratio tests under a normality

assumption. Thats why they can be written in terms of minimized
SSRs rather than maximized ln Ls.
80 / 247
Regression Statistics: Schwarz criterion 1.435
Well get there shortly...
81 / 247
Regression Statistics: Durbin-Watson stat. 1.926
Well get there in 6-8 weeks
82 / 247
Residual Scatter
83 / 247
Residual Plot
Figure: Wage Regression Residual Plot
84 / 247
The OLS fitted values, yt = xt0 , are effectively best in-sample

predictions.
The OLS residuals, et = yt yt , are effectively in-sample

prediction errors corresponding to use of the best predictor.
Residual plots are useful for visually flagging neglected things

that impact forecasting. Residual serial correlation indicates that
point forecasts could be improved. Residual volatility clustering
indicates that interval and density forecasts could be improved.
85 / 247
Non-Quadratic Loss
86 / 247
We Will Generally Use Quadratic Loss...
Recall that the OLS estimator, OLS , solves:

T
X T
X
2
min (yt 1 2 x2t ... K xKt ) = min 2t
t=1 t=1
Simple
(analytic closed-form expression, (X 0 X )1 X 0 y )
But predictive loss simply may not be quadratic
Other approaches are possible.
87 / 247
...But We Can Also Consider Non-Quadratic Loss
88 / 247
LAD Regression (Absolute-Error Loss)
Loss is linear on each side of 0 with slope 1 on each side.
LAD estimator LAD minimizes absolute-error loss:

T
X
min |t |
t=1
Not as simple as OLS, but still simple
89 / 247
Quantile Regression (LinLin Loss)
Loss is linear with potentially different slopes on each side of 0.
QR minimizes LinLin loss, or check function loss:

T
X
min check(t ),
t=1
where:

a|e|, if e 0
check(e) =
b|e|, if e > 0

= a|e| I (e 0) + b|e| I (e > 0).

I (x) = 1 if x is true, and I (x) = 0 otherwise.
I () stands for indicator variable.
90 / 247
Additional Interpretation
What does regression tell us about?
LS: Conditional mean

How does mean(y |X ) vary with X ?
mean(y |X ) = X
LAD: Conditional median
How does median(y |X ) vary with X ?
median(y |X ) = X
Perhaps not much different from LS?
Quantile regression: Conditional quantile
(LAD is a special case. Why?)
How does 100d%(y |X ) vary with X ?
100d%(y |X ) = X
e.g., How does the fifth percentile of the distribution of log wage
given education vary with education?
91 / 247
The Link Between d, a, and b
100d%(y |X ) = X
where
b 1
d= =
a+b 1 + a/b
Note that a and b matter only through their ratio
92 / 247
Optimal Forecasts Can Be Biased
Symmetric (quadratic) loss: Optimal forecast is conditional

mean; corresponding error has zero mean
Symmetric (absolute) loss: Optimal forecast is conditional

median; corresponding error has zero median
Asymmetric (check) loss: Optimal forecast is conditional

quantile; corresponding error has non-zero mean and median
93 / 247
Quantile Regression (10th Percentile): LWAGE c,
EDUC
5
3
LW AGE
Fitted 10th Percentile
2
0
0 4 8 12 16 20 24
EDUC
\ = 0.799 + 0.068EDUC
LWAGE
94 / 247
Quantile Regression (90th Percentile): LWAGE c,
EDUC
5
3
LW AGE
Fitted 90th Percentile
2
0
0 4 8 12 16 20 24
EDUC
\ = 1.894 + 0.083EDUC
LWAGE
95 / 247
Comparison: LWAGE c, EDUC
5
3
LWAGE
0
0 4 8 12 16 20 24
EDUC
Fitted LAD Fitted OLS

Fitted 10th Percentile Fitted 90th Percentile
96 / 247
Misspecification
Do we really believe that the fitted model matches the DGP?
97 / 247
Regression Statistics: Schwarz criterion 1.435
PT 2
K
t=1 et
SIC = T ( T )
T
More general lnL version:
SIC = 2lnL + KlnT
98 / 247
Regression Statistics: Akaike info criterion 1.423
PT 2
2K
t=1 et
AIC = e ( T )
T
More general lnL version:
AIC = 2lnL + 2K
99 / 247
100 / 247
Estimate out-of-sample forecast accuracy (which is what we
really care about) on the basis of in-sample forecast accuracy. (We
want to select a forecasting model that will perform well for
out-of-sample forecasting, quite apart from its in-sample fit.)
We proceed by inflating the in-sample mean-squared error
(MSE ), in various attempts to offset the deflation from regression
fitting, to obtain a good estimate of out-of-sample MSE .
PT
e2
MSE = t=1 t
T

2 T
s = MSE
T K
2K
AIC = e ( T ) MSE
K
SIC = T ( T ) MSE
The AIC and SIC penalties have certain optimality properties.
101 / 247
Non-Normality and Outliers
Do we really believe that the disturbances are Gaussian?
102 / 247
What Well Do
Distributional results under non-normality
Detecting non-normality and outliers
Dealing with non-normality and outliers (robust estimation)
103 / 247
Recall Sample Mean Under iid With Normality
y is MVUE, and
2

y N , ,
T
and we estimate 2 consistently using s 2
Exact (finite-sample) result
104 / 247
Recall Sample Mean Under iid Without Normality
y is BLUE, and
a
2

y N , ,
T
Approximate (large-sample) result,

due to the central limit theorem
The a means asymptotically as T
105 / 247
OLS Under Ideal Conditions With Normality
LS is MVUE, and
LS N , 2 (X 0 X )1 ,

Exact (finite-sample) result
106 / 247
OLS Under Ideal Conditions Without Normality
LS is BLUE, and
a
LS N , 2 (X 0 X )1 ,

Approximate (large-sample) result,

due to the central-limit theorem
The a means asymptotically as T
107 / 247
Detecting Non-Normality
(In Data or in Residuals)
Sample skewness and kurtosis, S and K
Jarque-Bera test. Under normality we have:

T 1
JB = S + (K 3) 22
2 2
6 4
Many more
108 / 247
Recall Our OLS Wage Regression
109 / 247
OLS Residual Histogram and Statistics
110 / 247
QQ Plots
I We introduced histograms earlier...
I ...but if interest centers on the tails of distributions, QQ plots

often provide sharper insight as to the agreement or
divergence between the actual and reference distributions
I QQ plot is quantiles of the standardized data against quantiles

of a standardized reference distribution (e.g., normal)
I If the distributions match, the QQ plot is the 45 degree line
I To the extent that the QQ plot does not match the 45 degree
line, the nature of the divergence can be very informative, as
for example in indicating fat tails
111 / 247
OLS Wage Regression Residual QQ Plot
112 / 247
Detecting Outliers and Influential Observations:
OLS Residual Plot
113 / 247
Leave-One-Out Plot
Consider:

(t) , t = 1, ...T
Leave-one-out plot
114 / 247
Wage Regression
115 / 247
Leverage Plot
leverage ht is the t-th diagonal element of X (X 0 X )1 X 0 .
Leverage plot
Whats ht all about?
116 / 247

(t)
et and ht are two key Pieces of
It can be shown that

1
(t) = (X 0 X )1 xt0 et
1 ht

Other things equal, the larger is et , the larger is (t)

Other things equal, the larger is ht , the larger is (t)
The third key piece is xt
117 / 247
Dealing with Outliers:
Least Absolute Deviations (LAD), Again!
The LAD estimator, LAD , solves:
T
X
min |t |
t=1
LAD regression is quantile regression with d = .5
Recall that OLS fits the conditional mean function:
mean(y |X ) = x
LAD fits the conditional median function (50% quantile):

median(y |X ) = x
The two are equal under symmetry as with FIC, but not under
asymmetry, in which case the median is a better measure of central
tendency 118 / 247
LAD Wage Regression Estimation
119 / 247
Digging into Prediction (Much) More Deeply (Again)
The environment is:
yt = xt0 + t , t = 1, ..., T
t iid D(0, 2 )
120 / 247
Recall Point Prediction
Assume for the moment that we know the model parameters. That
is, assume that we know and all parameters governing D. Note
that the mean and variance are in general insufficient to
characterize a non-Gaussian D.
We immediately obtain point forecasts as:
E (yi |xi = x ) = x 0 .
121 / 247
Recall Analytic Density Prediction (And Hence Also
Interval Prediction) for D Gaussian
If D is Gaussian, then the density prediction is immediately
yi |xi = x N(x 0 , 2 ). (2)
We can calculate any desired interval forecast from the density

forecast. For example, a 95% interval would be x 0 1.96.
122 / 247
Recall Simulation Algorithm for Density Prediction for D
Gaussian
1. Take R draws from the disturbance density N(0, 2 ).

step 2.
respectively.
As R , the algorithmic and analytic results coincide.
123 / 247
Recall Making the Forecasts Feasible
The approaches above are infeasible in that they assume known

parameters. They can be made feasible by replacing unknown
parameters with estimates. For example, the feasible version of the
point prediction is x 0 . Similarly, to construct a feasible 95%
interval forecast in the Gaussian case we can take x 0 1.96,
where is the standard error of the regression
(also earlier denoted s).
124 / 247
Density Prediction for D Parametric Non-Gaussian
Our simulation algorithm still works for non-Gaussian D, so long as

we can simulate from D.
1. Take R draws from the disturbance density D.
step 2.
4. Form a 95% interval forecast by sorting the output from step
2, and taking the left and right interval endpoints as the the
.025% and .975% values, respectively.
Again as R , the algorithmic results become arbitrarily
accurate.
125 / 247
Density Prediction for D Non-Parametric
Now assume that we know nothing about distribution D, except
that it has mean 0. In addition, now that we have introduced
feasible forecasts, we will stay in that world.
1. Take R draws from the regression residual density (which is an
approximation to the disturbance density) by assigning
probability 1/N to each regression residual and sampling with
replacement.
2. Add x 0 to each draw.
step 2.
As R and N , the algorithmic results become arbitrarily
accurate.
126 / 247
Density Forecasts for D Nonparametric and Acknowledging
Parameter Estimation Uncertainty
So far: Disturbance uncertainty

Now: Disturbance uncertainty and parameter estimation
uncertainty
The feasible approach to density forecasting sketched above still
fails to acknowledge parameter estimation uncertainty, because it
treats plugged-in parameter estimates as true values, ignoring
the fact that they are only estimates and hence subject to
sampling variability. Parameter estimation uncertainty is often
ignored, as its contribution to overall forecast MSE can be shown
to vanish unusually quickly as sample size grows. But it impacts
forecast uncertainty in small samples and hence should not be
ignored in general.
127 / 247
Algorithm for Density Forecasts for D Nonparametric and
Acknowledging Parameter Estimation Uncertainty
1. Take R approximate disturbance draws by assigning
probability 1/N to each regression residual and sampling with
replacement.
2. Take R draws from the large-N sampling density of , namely
OLS N(, 2 (X 0 X )1 ),
as approximated by N(, 2 (X 0 X )1 ).
3. To each disturbance draw from 1 add the corresponding x 0
draw from 2.
step 3.
As R and N , we get precisely correct results.
128 / 247
Indicator Variables in Cross Sections:
Group Effects
Effectively a type of Structural change:

Do we really believe that coefficients are fixed across people?
129 / 247
Dummy Variables for Group Effects
A dummy variable, or indicator variable, is just a 0-1 variable that

indicates something, such as whether a person is female:

1 if person t is female
FEMALEt =
0 otherwise
(It really is that simple.)
Intercept dummies
Note that the sample mean of a dummy variable is the fraction of

the sample with the indicated attribute.
130 / 247
Histograms for Wage Covariates
131 / 247
Recall Basic Wage Regression on Education and
Experience
LWAGE C , EDUC , EXPER
132 / 247
Basic Wage Regression Results
133 / 247
Basic Wage Regression Residual Scatter
134 / 247
Controlling for Sex, Race, and Union Status
in the Wage Regression
Now:
LWAGE C , EDUC , EXPER, FEMALE , NONWHITE , UNION
135 / 247
Wage Regression on Education, Experience, and Group
Dummies
136 / 247
Residual Scatter from Wage Regression on
Education, Experience, and Group Dummies
137 / 247
Important Issues
I The intercept corresponds to the base case across all

dummies (i.e., when all dummies are simultaneously 0), and
the dummy coefficients give the extra effects (i.e., when the
respective dummies are 1).
I Alternatively, use a full set of dummies for each category (e.g.,

both a union dummy and a non-union dummy) and drop the
intercept. (More useful/common for in time-series situations)
I Never include a full set of dummies and an intercept.

Would be totally redundant: Perfect Multicollinearity
138 / 247
Nonlinearity
Do we really believe that the relationship is linear?
139 / 247
Anscombes Quartet
140 / 247
Anscombes Quartet: Regressions
141 / 247
Anscombes Quartet: Graphics
142 / 247
Parametric and Nonparametric Nonlinearity...
...and the gray area in between.
143 / 247
Log-Log Regression
lnyt = 1 + 2 lnxt + t
For close yt and xt , (ln yt ln xt ) 100 is approximately the percent

difference between yt and xt . Hence the coefficients in log-log
regressions give the expected percent change in y for a one-percent
change in x. That is, they give the elasticity of y with respect to x.
Example: Cobb-Douglas production function
yt = ALt Kt exp(t )
lnyt = lnA + lnLt + lnKt + t

We expect an % increase in output
in response to a 1% increase in labor input
144 / 247
Log-Lin Regression
lnyt = xt +
The coefficients in log-lin regressions give the expected percent

change in y for a one-unit (not 1%!) change in x.
Example: Exponential growth

yt = Ae rt
lnyt = lnA + rt
Coefficient r gives the expected percent change in y for a one-unit
change in time
Another example: LWAGE regression!

Coefficient on education gives the expected percent change in
WAGE arising from one more year of education.
145 / 247
Intrinsically Non-Linear Models
One example is the S-curve model,

1
y=
a + br x
(0 < r < 1)
No way to transform to linearity
Use non-linear least squares (NLS)
Under the remaining FIC (that is, dropping only linearity), NLS
has a sampling distribution similar to that of LS under the FIC
146 / 247
Taylor Series Expansions
Really no such thing as an intrinsically non-linear model...
In the bivariate case we can think of the relationship as
yt = g (xt , t )
or slightly less generally as
yt = f (xt ) + t
147 / 247
Taylor Series, Continued
Consider Taylor series expansions of f (xt ).

The linear (first-order) approximation is
f (xt ) 1 + 2 x,
and the quadratic (second-order) approximation is
f (xt ) 1 + 2 xt + 3 xt2 .
In the multiple regression case, Taylor approximations also involve

interaction terms. Consider, for example, f (xt , zt ):
f (xt , zt ) 1 + 2 xt + 3 zt + 4 xt2 + 5 zt2 + 6 xt zt + ....
Equally relevant for dummy variables
148 / 247
A Key Insight
The ultimate point is that so-called intrinsically non-linear

models are themselves linear when viewed from the series-expansion
perspective. In principle, of course, an infinite number of series
terms are required, but in practice nonlinearity is often quite gentle
(e.g., quadratic) so that only a few series terms are required.
So non-linearity is in some sense

really an omitted-variables problem
149 / 247
Assessing Non-Linearity
Use AIC and SIC as always.
Use ts and F as always.
150 / 247
Basic Wage Regression
151 / 247
Quadratic Wage Regression
152 / 247
Dummy Interactions?
153 / 247
Everything
154 / 247
So Drop Dummy Interactions and Tighten the Rest
155 / 247
Heteroskedasticity in Cross-Section Regression
Do we really believe that disturbance variances

are constant over space?
156 / 247
Heteroskedasticity is Another Type of Violation of the IC
(This time its non-constant disturbance variances.)
Consider: N(0, )
Heteroskedasticity corresponds to diagonal but 6= 2 I

Simpler but more important that spatial correlation
12 0 . . . 0

0 2 . . . 0
2
= .

. .. . . .
. . . ..
0 0 . . . N 2
157 / 247
Causes and Consequences of Heteroskedasticity
Causes:
Can arise for many reasons
Engel curve (e.g., food expenditure vs. income) is classic example
Consequences:
OLS estimation remains largely OK.
Parameter estimates consistent but inefficient.
OLS inference destroyed. Standard errors biased and inconsistent.
Hence t statistics do not have the t distribution in finite samples
and do not have the N(0, 1) distribution asymptotically.
Corresponding predictive consequences:

Point prediction remains largely OK.
|xt = xt ) E (yt |xt = xt ).
We still have E (yt\
Interval and density forecasts destroyed. So we need to detect
and deal with the heteroskedasticity.
158 / 247
What if You Dont Care About Detecting and Dealing
With Heteroskedasticity...
e.g., perhaps youre only interested in point prediction but still

want to do credible inference regarding the contributions of the
various x variables to the point prediction.
Then use heteroskedasticity-robust standard errors
White standard errors
Just a simple regression option
e.g., in EViews,
instead of ls y,c,x, use ls(cov=white) y,c,x
159 / 247
Wage regression with White Standard Errors
160 / 247
Detecting Heteroskedasticity
I Graphical heteroskedasticity diagnostics
I Formal heteroskedasticity tests
161 / 247
Graphical Diagnostics
Graph ei2 against xi , for various regressors
162 / 247
Recall Our Final Wage Regression
163 / 247
Squared Residual vs. EDUC
164 / 247
The Breusch-Godfrey-Pagan Test (BGP)
Limitation of graphing ei2 against xi : Purely pairwise
So move to a formal testing framework that blends all information
BGP test:
I Estimate the OLS regression, and obtain the squared residuals
I Regress the squared residuals on all regressors
I To test the null hypothesis of no relationship, examine NR 2

from this regression. In large samples NR 2 2K under the
null, where K is the number of regressors in the test
regression.
165 / 247
BPG Test
166 / 247
Whites Test
Like BGP, but replace BGPs linear regression

with a more flexible (quadratic) regression
I Estimate the OLS regression, and obtain the squared residuals
I Regress the squared residuals on all regressors, squared

regressors, and pairwise regressor cross products
I To test the null hypothesis of no relationship, examine NR 2

from this regression. In large samples NR 2 2K under the
null.
167 / 247
Whites Test
168 / 247
Simulation Algorithm for Density Prediction
D Gaussian, Heteroskedastic Disturbances
1. Take R draws from the disturbance density N(0, 2 ), where

2 is the fitted value from the White regression evaluated at
x = x
step 2.
respectively.
169 / 247
Spatial Correlation in Cross-Section Regression
Do we really believe that the disturbances are uncorrelated over

space?
170 / 247
Spatial Correlation is Another Type of Violation of the IC
(This time its non-zero disturbance correlations.)
Consider: N(0, )
Spatial correlation corresponds to non-diagonal .
12

12 . . . 1T
21 22 . . . 2T
= .

.. .. ..
.. . . .
T 1 T 2 . . . T2
Advanced topic and we will not pursue it further here.
Could be block-diagonal (clustering)
171 / 247
Time Series
172 / 247
Misspecification
Do we really believe that the fitted model matches the DGP?

No major changes in time series...
173 / 247
Non-Normality and Outliers
Do we really believe that the disturbances are Gaussian?

No major changes in time series...
174 / 247
Indicator Variables in Time Series:
Trend and Seasonality
Trend and seasonality are effectively types of structural change
Now: Do we really believe that means are fixed over time?
Later: Do we really believe that regression coefficients are

fixed over time?
175 / 247
Liquor Sales
176 / 247
Log Liquor Sales
177 / 247
Linear Deterministic Trend
Trendt = 1 + 2 TIMEt
where TIMEt = t
Simply run the least squares regression y c, TIME , where

1
2

3
TIME = .

..

T 1
T
178 / 247
Various Linear Trends
179 / 247
Linear Trend Estimation
Dependent Variable: LSALES

Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336
Variable Coefficient Std. Error t-Statistic Prob.
C 6.454290 0.017468 369.4834 0.0000

TIME 0.003809 8.98E-05 42.39935 0.0000
R-squared 0.843318 Mean dependent var 7.096188

Adjusted R-squared 0.842849 S.D. dependent var 0.402962
S.E. of regression 0.159743 Akaike info criterion -0.824561
Sum squared resid 8.523001 Schwarz criterion -0.801840
Log likelihood 140.5262 Hannan-Quinn criter. -0.815504
F-statistic 1797.705 Durbin-Watson stat 1.078573
Prob(F-statistic) 0.000000
180 / 247
Residual Plot
181 / 247
Deterministic Seasonality
s
X
Seasonal t = i SEASit (s seasons per year)
i=1

1 if observation t falls in season i
where SEASit =
0 otherwise
Simply run the least squares regression y SEAS1 , ..., SEASs

(or blend: y TIME , SEAS1 , ..., SEASs )
where (e.g., in quarterly data case, assuming Q1 start and Q4 end):

SEAS1 = (1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ..., 0)0
SEAS2 = (0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..., 0)0
SEAS3 = (0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, ..., 0)0
SEAS4 = (0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, ..., 1)0 .
182 / 247
Linear Trend with Seasonal Dummies
183 / 247
Residual Plot
184 / 247
Seasonal Pattern
185 / 247
Nonlinearity in Time Series
Do we really believe that trends are linear?
186 / 247
Non-Linear Trend: Exponential (Log-Linear)
Trendt = 1 e 2 TIMEt
ln(Trendt ) = ln(1 ) + 2 TIMEt
187 / 247
Figure: Various Exponential Trends
188 / 247
Non-Linear Trend: Quadratic
Allow for gentle curvature by including TIME and TIME 2 :
Trendt = 1 + 2 TIMEt + 3 TIMEt2
189 / 247
Figure: Various Quadratic Trends
190 / 247
Recall Log-Linear Liquor Sales Trend Estimation

Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
C 6.454290 0.017468 369.4834 0.0000

TIME 0.003809 8.98E-05 42.39935 0.0000

191 / 247
Residual Plot
192 / 247
Log-Quadratic Liquor Sales Trend Estimation

Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
C 6.231269 0.020653 301.7187 0.0000

TIME 0.007768 0.000283 27.44987 0.0000
TIME2 -1.17E-05 8.13E-07 -14.44511 0.0000

193 / 247
Residual Plot
194 / 247
Log-Quadratic Liquor Sales Trend Estimation
with Seasonal Dummies

Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
TIME 0.007739 0.000104 74.49828 0.0000

TIME2 -1.18E-05 2.98E-07 -39.36756 0.0000
D1 6.138362 0.011207 547.7315 0.0000
D2 6.081424 0.011218 542.1044 0.0000
D3 6.168571 0.011229 549.3318 0.0000
D4 6.169584 0.011240 548.8944 0.0000
D5 6.238568 0.011251 554.5117 0.0000
D6 6.243596 0.011261 554.4513 0.0000
D7 6.287566 0.011271 557.8584 0.0000
D8 6.259257 0.011281 554.8647 0.0000
D9 6.199399 0.011290 549.0938 0.0000
D10 6.221507 0.011300 550.5987 0.0000
D11 6.253515 0.011309 552.9885 0.0000
D12 6.575648 0.011317 581.0220 0.0000

Durbin-Watson stat 0.581383
195 / 247
Residual Plot
196 / 247
Serial Correlation in Time-Series Regression
Do we really believe that disturbances are uncorrelated over

time?
(Not possible in cross sections, so we didnt study it before...)
197 / 247
Serially Correlated Regression Disturbances
Disturbance serial correlation, or autocorrelation,

means correlation over time
Current disturbance correlated with past disturbance(s)
Leading example
(AR(1) disturbance serial correlation):
yt = xt0 + t
t = t1 + vt , || < 1
vt iid N(0, 2 )
(Extension to AR(p) disturbance serial correlation is immediate)
198 / 247
Serial Correlation Implies 6= 2 I
Recall heteroskedasticity:
diagonal but with different diagonal elements
Now serial correlation:

not even diagonal

(0) . . . (T 1)
(1)
(1) . . . (T 2)
(0)
=

.. . ..
. ..
. .. .
(T 1) (T 2) . . . (0)
where:
( ) = cov (t , t ), = 0, 1, 2, ...
Autocovariances: ( ), = 1, 2, ...
Autocorrelations: ( ) = ( )/ (0), = 1, 2, ...
199 / 247
Why is Negected Serial Correlation a Problem for
Prediction?
The IC involve = 2 I , and serial correlation implies 6= 2 I , so
we get inconsistent s.e.s, just as with heteroskedasticity. But that
was basically inconsequential for point forecasts.
But serial correlation is a bigger problem for prediction.

Heres the intuition:
Serial correlation in disturbances/residuals implies that the

included X variables have missed something that could be
exploited for improved point forecasting (and hence also improved
interval and density forecasting). That is, all types of forecasts are
sub-optimal when serial correlation in neglected.
Put differently:
Serial correlation in forecast errors means that you can forecast
your forecast errors! So something is wrong and can be improved...
200 / 247
What if You Dont Care About Neglected Serial
Correlation?
Hard to imagine
But perhaps you want to do credible inference regarding the

contributions of the various x variables to a point prediction based
only on the xs.
Then use heteroskedasticity and autocorrelation robust standard

errors
HAC standard errors, Newey-West standard errors
Just a simple regression option
e.g., in EViews,
instead of ls y,c,x, use ls(cov=hac) y,c,x
201 / 247
Trend + Seasonal Liquor Sales Regression with HAC
Standard Errors
202 / 247
Detecting Serial Correlation
I Formal tests
I Durbin-Watson
I Breusch-Godfrey
I Graphical diagnostics (actually more sophisticated and useful)

I Residual plot
I Residual scatterplot of (et vs. et )
I Residual autocorrelations
203 / 247
Recall Our Log-Quadratic Liquor Sales Model

Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
TIME 0.007739 0.000104 74.49828 0.0000

TIME2 -1.18E-05 2.98E-07 -39.36756 0.0000
D1 6.138362 0.011207 547.7315 0.0000
D2 6.081424 0.011218 542.1044 0.0000
D3 6.168571 0.011229 549.3318 0.0000
D4 6.169584 0.011240 548.8944 0.0000
D5 6.238568 0.011251 554.5117 0.0000
D6 6.243596 0.011261 554.4513 0.0000
D7 6.287566 0.011271 557.8584 0.0000
D8 6.259257 0.011281 554.8647 0.0000
D9 6.199399 0.011290 549.0938 0.0000
D10 6.221507 0.011300 550.5987 0.0000
D11 6.253515 0.011309 552.9885 0.0000
D12 6.575648 0.011317 581.0220 0.0000

Durbin-Watson stat 0.581383
204 / 247
Formal Tests: Durbin-Watson (0.59!)
Simple AR(1) environment:
yt = xt0 + t
t = t1 + vt
vt iid N(0, 2 )
We want to test H0 : = 0 against H1 : 6= 0
Regress yt xt and obtain the residuals et
Then form:
PT
(et et1 )2
DW = t=2PT
2
t=1 et
205 / 247
Understanding the Durbin-Watson Statistic
PT 2 1 PT 2
t=2 (et et1 ) T t=2 (et et1 )
DW = PT = 1 PT
2 2
t=1 et T t=1 et
1 PT 2 1 PT 2 1 PT
T t=2 et + T t=2 et1 2 T t=2 et et1
= 1 PT 2
T t=1 et
Hence as T :
2 + 2 2cov (t , t1 )
DW = 1+12corr (t , t1 ) = 2(1corr (t , t1 ))
2
= DW [0, 4], DW 2 as 0, and DW 0 as 1
206 / 247
Formal Tests: Breusch-Godfrey
General AR(p) environment:
yt = xt0 + t
t = 1 t1 + ... + p tp + vt
We want to test H0 : (1 , ..., p ) = 0 against H1 : (1 , ..., p ) 6= 0
I Regress yt xt and obtain the residuals et
I Regress et xt , et1 , ..., etp
I Examine TR 2 . In large samples TR 2 2p under the null.
Does this sound familiar?
207 / 247
BG for AR(1) Disturbances
(TR 2 = 168.5, p = 0.0000)
208 / 247
(TR 2 = 216.7, p = 0.0000)
209 / 247
(TR 2 = 219.0, p = 0.0000)
210 / 247
Residual Plot
211 / 247
Residual Scatterplot (et vs. et1 )
212 / 247
Residual Autocorrelations
213 / 247
Fixing the Serially Correlation Problem:
Including Lags of y as Regressors
Serial correlation in disturbances means that the included xs

(in our case, trends and seasonals)
dont fully account for the dynamics in y .
But the problem is simple to fix:

Just include lags of y as additional regressors.
AR(p) disturbances fixed by including p lags of y .
(Select p using the usual AIC , SIC , etc.)
AIC selects p = 4, and SIC selects p = 3.
214 / 247
Trend + Seasonal Model
with Four Lags of y
215 / 247
Trend + Seasonal Model
with Four Lags of y
Residual Plot
216 / 247
Residual Scatterplot
217 / 247
Residual Autocorrelations
218 / 247
Forecasting and the Forecasting the Right-Hand-Side
Variables Problem
yt = xt0 + t = yt+h = xt+h

0
+ t+h
Projecting on current information,

0
yt+h,t = xt+h,t
Forecasting the right-hand-side variables problem:

We dont have xt+h,t !
But no problem for trends or seasonals
219 / 247
What About Autoregressions?
e.g., AR(1)
yt = yt1 + t
Hence:
yt+h = yt+h1 + t+h
Projecting on current information,
yt+h,t = yt+h1,t
There seems to be a FRHS variables problem for h > 1.

But theres not!
We can build the multi-step forecast recursively.
Wolds chain rule of forecasting
220 / 247
(More General) Structural Change in Time Series:
Drifts and Breaks
Again, do we really believe that coefficients are fixed over time?
221 / 247
Structural Change
Sharp Breakpoint Exogenously Known
For simplicity of exposition, consider a bivariate regression:
11 + 21 xt + t , t = 1, ..., T

yt =
12 + 22 xt + t , t = T + 1, ..., T
Let
0, t = 1, ..., T

Dt =
Dt = 1, t = T + 1, ...T
Then we can write the model as:
yt = (11 + (12 11 )Dt ) + (21 + (22 21 )Dt )xt + t
We run:
yt c, Dt , xt , Dt xt
Use regression to test for structural change (F test)
Use regression to accommodate structural change if present.
222 / 247
Structural Change
Sharp Breakpoint, Exogenously Known, Continued
The Chow test is what were really calculating:
(e 0 e (e10 e1 + e20 e2 ))/K

Chow =
(e10 e1 + e20 e2 )/(T 2K )
Distributed F under the no-break null (and the rest of the IC)
223 / 247
Structural Change
Sharp Breakpoint, Endogenously Identified
MaxChow = max Chow ( ),

min max
where denotes potential break location as fraction of sample
(Typically we take min = .15 and max = .85)
The null distribution of MaxChow has been tabulated.
224 / 247
Rolling-Window Regression
for Generic Structural Change Assessment
Calculate and examine
tw :t
for t = w + 1, ..., T
w is window width
What does window width govern?
225 / 247
Expanding-Window (Recursive) Regression
for Generic Structural Change Assessment
Model:
K
X
yt = k xkt + t
k=1
t iidN(0, 2 ),
t = 1, ..., T .
OLS estimation uses the full sample, t = 1, ..., T .
Recursive least squares uses an expanding sample.

Begin with the first K observations and estimate the model.
Then estimate using the first K + 1 observations, and so on.
At the end we have a set of recursive parameter estimates:
k,t , for k = 1, ..., K and t = K , ..., T .
226 / 247
Recursive Residuals
At each t, t = K , ..., T 1, compute a 1-step forecast,

K
X
yt+1,t = kt xk,t+1 .
k=1
The corresponding forecast errors, or recursive residuals, are
et+1,t = yt+1 yt+1,t .
et+1,t N(0, 2 rt )
where rt > 1 for all t
227 / 247
Standardized Recursive Residuals and CUSUM
et+1,t
wt+1,t ,
rt
t = K , ..., T 1.
Under the maintained assumptions,
wt+1,t iidN(0, 1).
Then
t
X
CUSUMt wt+1,t , t = K , ..., T 1
t=K
is just a sum of iid N(0, 1)s.
228 / 247
Recursive Analysis Constant Parameter Model
229 / 247
Recursive Analysis Breaking Parameter Model
230 / 247
Heteroskedasticity in Time Series
Do we really believe that

disturbance variances are constant over time?
231 / 247
Varieties of Random (White) Noise
White noise: t WN(, 2 ) (serially uncorrelated)
Zero-mean white noise: t WN(0, 2 ) (serially uncorrelated)
iid
Independent (strong) white noise: t (0, 2 )
iid
Gaussian white noise: t N(0, 2 )
232 / 247
Linear Models (e.g., AR(1))
rt = rt1 + t
t iid(0, 2 ), || < 1
Uncond. mean: E (rt ) = 0 (constant)

Uncond. variance: E (rt2 ) = 2 /(1 2 ) (constant)
Cond. mean: E (rt | t1 ) = rt1 (varies)
Cond. variance: E ([rt E (rt | t1 )]2 | t1 ) = 2 (constant)
Conditional mean adapts, but conditional variance does not
233 / 247
ARCH(1) Process
rt |t1 N(0, ht )
2
ht = + rt1
E (rt ) = 0

E (rt 2 ) =
(1 )
E (rt |t1 ) = 0
E ([rt E (rt |t1 )]2 |t1 ) = + rt1
2
234 / 247
GARCH(1,1) Process (Generalized ARCH)
rt | t1 N(0, ht )
2
ht = + rt1 + ht1
E (rt ) = 0

E (rt 2 ) =
(1 )
E (rt |t1 ) = 0
E ([rt E (rt | t1 )]2 | t1 ) = + rt1
2
+ ht1
Well-defined and covariance stationary if

0 < < 1, 0 < < 1, + < 1
235 / 247
GARCH(1,1) and Exponential Smoothing
Exponential smoothing recursion:
t2 = t1
2
+ (1 )rt2
X
= t2 = (1 ) j rtj
2
But in GARCH(1,1) we have:

2
ht = + rt1 + ht1
X
ht = + j1 rtj
2
1
236 / 247
Tractable Maximum-Likelihood Estimation
L(; r1 , . . . , rT ) = f (rT |T 1 ; )f ((rT 1 |T 2 ; ) . . . ,
where = (, , )0
If the conditional densities are Gaussian,
1 rt2

1 1/2
f (rt |t1 ; ) = ht () exp ,
2 2 ht ()
so
1X 1 X rt2
ln L = const ln ht ()
2 t 2 t ht ()
237 / 247
Variations on the GARCH Theme
I Regression with GARCH disturbances
I Fat-tailed conditional densities: t-GARCH
I Asymmetric response and the leverage effect: T-GARCH
238 / 247
Regression with GARCH Disturbances
yt = xt0 + t
t |t1 N(0, ht )
239 / 247
Fat-Tailed Conditional Densities: t-GARCH
If r is conditionally Gaussian, then

p
rt = ht N(0, 1)
But often with high-frequency data,

r
t leptokurtic
ht
So take:
p td
r t = ht
std(td )
and treat d as another parameter to be estimated
240 / 247
Asymmetric Response and the Leverage Effect: T-GARCH
2 + h
Standard GARCH: ht = + rt1 t1
2 + r 2 D
T-GARCH: ht = + rt1 t1 t1 + ht1

1 if rt < 0
Dt =
0 otherwise
positive return (good news): effect on volatility
negative return (bad news): + effect on volatility
6= 0: Asymetric news response

> 0: Leverage effect
241 / 247
A Useful Specification Diagnostic
t |t1 N(0, ht )
p
t = ht vt , vt iidN(0, 1)

t = vt , vt iidN(0, 1)
ht

Infeasible: examine vt = t / ht . iid? Gaussian?
p
Feasible: examine vt = t / ht . iid? Gaussian?
Key potential deviation from iid is volatility dynamics:

Examine correlogram of squared standardized returns, vt2
Examine normality of standardized returns, vt
242 / 247
Conditional Mean Estimation
243 / 247
Conditional Variance Estimation
244 / 247
Autocorrelations of Squared Standardized Residuals
245 / 247
Distribution of Standardized Residuals
246 / 247
Time Series of Estimated Conditional Standard Deviations
247 / 247

Econometrics: A Predictive Modeling Approach: Francis X. Diebold University of Pennsylvania

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Econometrics: A Predictive Modeling Approach: Francis X. Diebold University of Pennsylvania

Transféré par

Droits d'auteur :

Formats disponibles

Econometrics:

A Predictive Modeling Approach

July 5, 2017 1 / 247

All rights reserved.

The painting is Enigma, by Glen Josselsohn, from Wikimedia Commons.

Economists, statisticians, analysts, data scientists in:

Econometrics is not just statistics using economic data. Many

I Trend and seasonality

A: Helping people to make better decisions

Good forecasts promote good decisions.

Hence prediction holds a distinguished place in econometrics,

Standard cross-section notation: i = 1, ..., N

Standard time-series notation: t = 1, ..., T

Much of our discussion will be valid in both cross-section and

Without loss of generality, we will use t = 1, ..., T .

I FRED (Federal Reserve Economic Data)

I FRB Phila Real-Time Data Research Center

I High-Level: EViews, Stata

High-level does not mean best, and low-level does not

I More than you ever wanted to know about econometric

I Summarize and reveal patterns in univariate cross-section

I Summarize and reveal patterns in univariate time-series data.

I Summarize and reveal patterns in multivariate data

I Know your audience, and know your goals.

I Discrete random variable, y

I Discrete probability distribution p(y )

I Continuous random variable y

I Probability density function f (y )

Mean measures location:

Variance, or standard deviation, measures dispersion, or scale:

easier to interpret than 2 . Why?

Skewness measures skewness (!)

Kurtosis measures tail fatness relative to a Gaussian distribution.

Multivariate case: Joint, marginal and conditional distributions

Covariance measures linear dependence:

cov (y , x) = E [(yt y )(xt x )].

Correlation is often more convenient. Why?

Simple random sampling : yt iid N(, 2 ), t = 1, ..., T

Simple random sampling : yt iid (, 2 ), t = 1, ..., T

and we estimate 2 consistently using s 2 . This is an approximate

WAGE log WAGE

Sample Mean 12.19 2.34

A. As curve fitting. Tell a computer how to draw a line through a

B. As a probabilistic framework for optimal prediction.

is the set of two parameters 1 and 2

is the set of fitted parameters 1 and 2

The fitted values are

where x1t = 1 for all t.

yt = 1 + 2 x2t + ... + K xKt + t

1. The fitted model is correctly specified

You already understand matrix (spreadsheet) notation,

Transposition: A0ij = Aji

Inversion: For non-singular A n n,A1

yt = 1 + 2 x2t + ... + K xKt + t

As always, the LS estimator solves:

It can be shown that the solution is:

1. The DGP is:

where V is consistently estimated by s 2 (X 0 X )1 .

LS is normally distributed and MVUE

where V is consistently estimated by s 2 (X 0 X )1 .

LS is asymptotically normally distributed and BLUE

Moreover, the distributional results are in precise parallel.

Distribution of the sample mean from an iid Gausssian sample: