Vous êtes sur la page 1sur 247

Econometrics:

A Predictive Modeling Approach

Francis X. Diebold
University of Pennsylvania

July 5, 2017 1 / 247


Copyright
c 2013-2017, by Francis X. Diebold.

All rights reserved.

All materials are freely available for your use, but be warned: they are highly
preliminary, significantly incomplete, and rapidly evolving. All are licensed
under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License. (Briefly: I retain copyright, but you can use, copy and
distribute non-commercially, so long as you give me attribution and do not
modify. To view a copy of the license, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/.) In return I ask that you
please cite the books whenever appropriate, as: Diebold, F.X. (year here),
Book Title Here, Department of Economics, University of Pennsylvania,
http://www.ssc.upenn.edu/ fdiebold/Textbooks.html.

The painting is Enigma, by Glen Josselsohn, from Wikimedia Commons.

2 / 247
Introduction to Econometrics

3 / 247
Numerous Communities Use Econometrics

Economists, statisticians, analysts, data scientists in:


I Finance (Commercial banking, retail banking, investment
banking, insurance, asset management, real estate, ...)
I Traditional Industry (manufacturing, services, advertising,
brick-and-mortar retailing, ...)
I e-Industry (Google, Amazon, eBay, Uber, Microsoft, ...)
I Consulting (financial services, litigation support, ...)
I Government (treasury, agriculture, environment, commerce,
...)
I Central Banks and International Organizations (FED, IMF,
World Bank, OECD, BIS, ECB, ...)

4 / 247
Econometrics is Special

Econometrics is not just statistics using economic data. Many


properties and nuances of economic data require knowledge of
economics for sucessful analysis.

I Trend and seasonality


I Cycles (serial correlation)
I Volatility fluctuations (heteroskedasticity)
I Structural change
I Multivariate interactions
I Emphasis on prediction

5 / 247
Lets Elaborate on the Emphasis on Prediction...
Q: What is econometrics is about, broadly?

A: Helping people to make better decisions


I Consumers
I Firms
I Investors
I Policy makers
I Courts
Forecasts guide decisions.

Good forecasts promote good decisions.

Hence prediction holds a distinguished place in econometrics,


and it will hold a distinguished place in this course.

6 / 247
There are Many Issues Regarding Types of Recorded
Economic Data

I Time series
I Continuous recording
I Discrete recording
I Equally-spaced
I Unequally-spaced
I Common-frequency
I Mixed-frequency
I Cross section
I Time series of cross sections
I Balanced panel
I Unbalanced panel

7 / 247
Notational Aside

Standard cross-section notation: i = 1, ..., N

Standard time-series notation: t = 1, ..., T

Much of our discussion will be valid in both cross-section and


time-series environments, but still we have to pick a notation.

Without loss of generality, we will use t = 1, ..., T .

8 / 247
A Few Leading Econometrics Web Data Resources
(Clickable)

Indispensible:
I Resources for Economists (AEA)

I FRED (Federal Reserve Economic Data)

More specialized:
I National Bureau of Economic Research

I FRB Phila Real-Time Data Research Center

I Many more

9 / 247
A Few Leading Econometrics Software Environments
(Clickable)

I High-Level: EViews, Stata


I Mid-Level: R (CRAN; RStudio; R-bloggers), Python, Julia
I Low-Level: C, C++, Fortran

High-level does not mean best, and low-level does not


mean worst. There are many issues.

I More than you ever wanted to know about econometric


software, broadly defined: The Econometrics Journal software
links

10 / 247
Graphics Review

11 / 247
Graphics Help us to:

I Summarize and reveal patterns in univariate cross-section


data. Histograms and density estimates are helpful for learning
about distributional shape. Symmetric, skewed, fat-tailed, ...

I Summarize and reveal patterns in univariate time-series data.


Time Series plots are useful for learning about dynamics.
Trend, seasonal, cycle, outliers, ...

I Summarize and reveal patterns in multivariate data


(cross-section or time-series). Scatterplots are useful for
learning about relationships. Does a relationship exist? Is it
linear or nonlinear? Are there outliers?

12 / 247
Histogram Revealing Distributional Shape:
1-Year Government Bond Yield

13 / 247
Time Series Plot Revealing Dynamics:
1-Year Goverment Bond Yield, Levels

14 / 247
Scatterplot Revealing Relationship:
1-Year and 10-Year Government Bond Yields

15 / 247
Some Principles of Graphical Style

I Know your audience, and know your goals.


I Appeal to the viewer.
I Show the data, and only the data, withing the bounds of
reason.
I Avoid distortion. The sizes of effects in graphics should match
their size in the data. Use common scales in multiple
comparisons.
I Minimize, within reason, non-data ink. Avoid chartjunk.
I Third, choose aspect ratios to maximize pattern revelation.
Bank to 45 degrees.
I Maximize graphical data density.
I Revise and edit, again and again (and again). Graphics
produced using software defaults are almost never satisfactory.

16 / 247
Probability and Statistics Review

17 / 247
Moments, Sample Moments and Their Sampling
Distributions

I Discrete random variable, y

I Discrete probability distribution p(y )

I Continuous random variable y

I Probability density function f (y )

18 / 247
Population Moments: Expectations of Powers of R.V.s

Mean measures location:


X
= E (y ) = pi yi (discrete case)
i
Z
= E (y ) = y f (y ) dy (continuous case)

Variance, or standard deviation, measures dispersion, or scale:

2 = var (y ) = E (y )2 .

easier to interpret than 2 . Why?

19 / 247
More Population Moments

Skewness measures skewness (!)

E (y )3
S= .
3

Kurtosis measures tail fatness relative to a Gaussian distribution.

E (y )4
K= .
4

20 / 247
Covariance and Correlation

Multivariate case: Joint, marginal and conditional distributions


f (x, y ), f (x), f (y ), f (x|y ), f (y |x)

Covariance measures linear dependence:

cov (y , x) = E [(yt y )(xt x )].

So does correlation:
cov (y , x)
corr (y , x) = .
y x

Correlation is often more convenient. Why?

21 / 247
Sampling and Estimation

Sample : {yt }T
t=1 iid f (y )

Sample mean:
T
1 X
y = yt
T
t=1

Sample variance:
PT
2 (yt y )2
= t=1
T
Unbiased sample variance:
PT
2 (yt y )2
s = t=1
T 1
22 / 247
More Sample Moments

Sample skewness:
1 PT
T t=1 (yt y )3
S =
3

Sample kurtosis:
1 PT
T t=1 (yt y )4
K =
4

23 / 247
Still More Sample Moments

Sample covariance:
T
1 X
cd
ov (y , x) = [(yt y )(xt x)]
T
t=1

Sample correlation:

cd
ov (y , x)
corr
d (y , x) =
y x

24 / 247
Exact Finite-Sample Distribution of the Sample Mean
(Requires iid Normality)

Simple random sampling : yt iid N(, 2 ), t = 1, ..., T


y is unbiased, consistent, normally distributed with variance 2 /T ,
and minimum variance unbiased (MVUE).

2
 
y N ,
T
(and we estimate 2 consistently using s 2 )
 
s
y t1 2 (T 1) w .p.
T
y 0
= 0 = s t(T 1)

T
25 / 247
Large-Sample Distribution of the Sample Mean
(Requires iid, but not Normality)

Simple random sampling : yt iid (, 2 ), t = 1, ..., T


y is asymptotically unbiased, consistent, asymptotically normally
distributed with variance 2 /T , and best linear unbiased (BLUE).
a
2
 
y N , ,
T

and we estimate 2 consistently using s 2 . This is an approximate


(large-sample) result, due to the central limit theorem. The a
means asymptotically as T
 
s
As T , y z1 2 ) w .p.
T
y 0
= 0 = s N(0, 1)

T 26 / 247
Wages: Distributions

27 / 247
Wages: Sample Statistics

WAGE log WAGE

Sample Mean 12.19 2.34


Sample Median 10.00 2.30
Sample Maximum 65.00 4.17
Sample Minimum 1.43 0.36
Sample Std. Dev. 7.38 0.56
Sample Skewness 1.76 0.06
Sample Kurtosis 7.93 2.90
Jarque-Bera 2027.86 1.26
(p = 0.00) (p=0.53)
t(H0 : = 12) 0.93 -625.70
(p = 0.36) (p=0.00)
Correlation 0.40

28 / 247
Regression

29 / 247
Regression

A. As curve fitting. Tell a computer how to draw a line through a


scatterplot. (Well, sure, but there must be more...)

B. As a probabilistic framework for optimal prediction.

30 / 247
Regression as Curve Fitting

31 / 247
Distributions of Log Wage, Education and Experience

32 / 247
Scatterplot: Log Wage vs. Education

33 / 247
Curve Fitting

Fit a line:

yt = 1 + 2 xt

Solve:
T
X
min (yt 1 2 xt )2
t=1

is the set of two parameters 1 and 2

is the set of fitted parameters 1 and 2

34 / 247
Log Wage vs. Education with Superimposed Regression
Line

\ = 1.273 + .081EDUC
LWAGE
35 / 247
Actual Values, Fitted Values and Residuals

The fitted values are

yt = 1 + 2 xt ,

t = 1, ..., T .

The residuals are the difference between actual and fitted values,

et = yt yt ,

t = 1, ..., T .

36 / 247
Multiple Linear Regression (K RHS Variables)
Solve:
T
X
min (yt 1 2 x2t ... K xKt )2
t=1

Fitted hyperplane:
yt = 1 + 2 x2t + 3 x3t + ... + K xKt
More compactly:
K
X
yt = i xit ,
i=1

where x1t = 1 for all t.

Wage dataset:
\ = .867 + .093EDUC + .013EXPER
LWAGE
37 / 247
Regression as a Probability Model

38 / 247
An Ideal Situation (The Ideal Conditions)
I The data-generating process (DGP) is:

yt = 1 + 2 x2t + ... + K xKt + t


t iidN(0, 2 ),
and the fitted model matches it exactly.
I t and xit are independent, for all i, t

1. The fitted model is correctly specified


2. The disturbances are Gaussian
3. The coefficients (s) are fixed (whether over space or time,
depending on whether were working in a time-series or cross-section
environment)
4. The relationship is linear
5. The t s have constant variance 2
6. The t s are uncorrelated
39 / 247
Some Crucial Matrix Notation

You already understand matrix (spreadsheet) notation,


although you may not know it.


y1 1 x21 x31 ... xK 1 1 1
y2 1 x22 x32 ... xK 2 2 2
y = . X = . = .. = ..

.. .. . .
yT 1 x2T x3T ... xKT K T

40 / 247
Elementary Matrices and Matrix Operations


0 0 ... 0 1 0 ... 0
0 0 . . . 0 0 1 . . . 0
0 = . . . I = . . .

. . . .. . . . ..
. . . . . . . .
0 0 ... 0 0 0 ... 1

Transposition: A0ij = Aji


Addition: For A and B n m, (A + B)ij = Aij + Bij
Pm
Multiplication: For A n m and B m p, (AB)ij = k=1 Aik Bkj .

Inversion: For non-singular A n n,A1


satisfies
1 1
A A = AA = I . Many algorithms exist for calculation.

41 / 247
We Used to Write This:

yt = 1 + 2 x2t + ... + K xKt + t


t iidN(0, 2 )
t independent of xit , i, t
t = 1, 2, ..., T

42 / 247
Now, Equivalently, We Write This:

y = X + (1)
N(0, 2 I ) (2)
independent of X


y1 1 x21 x31 . . . xK 1 1 1
y2 1 x22 x32 . . . xK 2 2 2
.. = .. .. + .. (1)

. . . .
yT 1 x2T x3T . . . xKT K T
2
1 01 0 ... 0
2 02 0 2 . . . 0
N . , . .. (2)

.. . . .. . .
. . . . . .
T 0T 0 0 . . . 2
43 / 247
The OLS Estimator in Matrix Notation:

As always, the LS estimator solves:


T
X T
X
min 2t = min (yt 1 2 x2t ... K xKt )2

t=1 t=1

It can be shown that the solution is:

LS = (X 0 X )1 X 0 y

44 / 247
The Ideal Conditions, Redux

1. The DGP is:


y = X +
N(0, 2 I ).
and the fitted model matches it exactly
2. X and are independent

45 / 247
Sampling Distribution of LS Under the Ideal Conditions

LS N (, V ) .

where V is consistently estimated by s 2 (X 0 X )1 .


PT !
2
2 t=1 et
s =
T K

LS is normally distributed and MVUE

Note the precise parallel with the distribution of the sample mean
in Gaussian iid environments.

46 / 247
Sampling Distribution of LS Under the Ideal Conditions
Less Normality

As T ,

LS N (, V ) .

where V is consistently estimated by s 2 (X 0 X )1 .

LS is asymptotically normally distributed and BLUE

Note the precise parallel with the distribution of the sample mean
in non-Gaussian iid environments.

47 / 247
Sample Mean (You Earlier Learned All About it)
OLS Regression (Youre Now Learning All About it)
What is the Relationship?
Sample mean is just LS regression on nothing but a constant.
(Prove it.)

Moreover, the distributional results are in precise parallel.

Distribution of the sample mean from an iid Gausssian sample:


y N (, v )
s2
where v is consistently estimated by T.

Distribution of the LS regression estimator under ideal conditions:


LS N (, V ) .
where V is consistently estimated by s 2 (X 0 X )1 .
48 / 247
Conditional Implications of the DGP

Conditional mean:

E (yt | x1t = 1, x2t = x2t , ..., xKt = xKt ) = 1 + 2 x2t + ... + K xKt

or E (yt | xt = xt ) = xt 0


Conditional variance:

var (yt | xt = xt ) = 2

Full conditional density:

yt | xt = xt N(xt 0 , 2 )

49 / 247
Why All the Talk About Conditional Implications?:
The Predictive Modeling Problem
A major goal in econometrics is predicting y . The question is If a
new person arrives with characteristics x , what is my
minimum-MSE prediction of her y ? The answer under quadratic
loss is E (y |x = x ) = xt 0 .

The conditional mean is the minimum MSE (point) predictor

Non-operational version (we dont know ):


E (yt |xt = xt ) = xt 0

Operational version (use LS ):


|xt = xt ) = xt 0 LS (regression fitted value at xt = xt )
E (yt\

LS delivers operational optimal predictor with great generality

Follows immediately from the LS optimization problem


50 / 247
Interval Prediction

Non-operational:

yt [xt 0 1.96 ] w .p. 0.95

Operational:

yt [xt 0 LS 1.96 s] w .p. 0.95

51 / 247
Density Prediction

Non-operational version:

yt | xt = xt N(xt 0 , 2 )

Operational version:

yt | xt = xt N(xt 0 LS , s 2 )

52 / 247
Digging More Deeply into Prediction

The environment is:

yt = xt0 + t , t = 1, ..., T

t iid D(0, 2 )

53 / 247
Point Prediction

Assume for the moment that we know the model parameters. That
is, assume that we know and all parameters governing D. Note
that the mean and variance are in general insufficient to
characterize a non-Gaussian D.

We immediately obtain point forecasts as:

E (yi |xi = x ) = x 0 .

54 / 247
Analytic Density Prediction (And Hence Also Interval
Prediction) for D Gaussian

If D is Gaussian, then the density prediction is immediately

yi |xi = x N(x 0 , 2 ). (1)

We can calculate any desired interval forecast from the density


forecast. For example, a 95% interval would be x 0 1.96.

55 / 247
Simulation Algorithm for Density Prediction for D Gaussian

1. Take R draws from the disturbance density N(0, 2 ).


2. Add x 0 to each disturbance draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form an interval forecast (95%, say) by sorting the output
from step 2 to get the empirical cdf, and taking the left and
right interval endpoints as the the .025% and .975% values,
respectively.

As R , the algorithmic and analytic results coincide.

56 / 247
Making the Forecasts Feasible

The approaches above are infeasible in that they assume known


parameters. They can be made feasible by replacing unknown
parameters with estimates. For example, the feasible version of the
point prediction is x 0 . Similarly, to construct a feasible 95%
interval forecast in the Gaussian case we can take x 0 1.96,
where is the standard error of the regression
(also earlier denoted s).

57 / 247
Typical Regression Analysis of Wages, Education and
Experience

58 / 247
Top Matter: Background Information

I Dependent variable

I Method

I Date

I Sample

I Included observations

59 / 247
Middle Matter: Estimated Regression Function

I Variable

I Coefficient

I Standard error

I t-statistic

I p-value

60 / 247
Predictive Perspectives

OLS coefficient signs and sizes give the weights put on the
various x variables in forming the best in-sample prediction of y .

The standard errors, t statistics, and p-values let us do statistical


inference as to which regressors are most relevant for predicting y .

61 / 247
Bottom Matter: Statistics

There are many...

62 / 247
Regression Statistics: Mean dependent var 2.342

T
1 X
y = yt
T
t=1

63 / 247
Predictive Perspectives

The sample, or historical, mean of the dependent variable, y , an


estimate of the unconditional mean of y , is a benchmark forecast.
It is obtained by regressing y on an intercept alone no
conditioning on other regressors.

64 / 247
Regression Statistics: S.D. dependent var .561

s
PT
t=1 (yt y )2
SD =
T 1

65 / 247
Predictive Perspectives

The sample standard deviation of y is a measure of the in-sample


accuracy of the unconditional mean forecast y .

66 / 247
Regression Statistics: Sum squared resid 319.938

T
X
SSR = et2
t=1

Optimized value of the LS objective; will appear in many places.

67 / 247
Predictive Perspectives

The OLS fitted values, yt = xt0 , are effectively in-sample


regression predictions.

The OLS residuals, et = yt yt , are effectively in-sample


prediction errors corresponding to use of the regression predictions.

SSR measures total in-sample accuracy of the regression


predictions

SSR is closely related to in-sample MSE :


T
1 1 X 2
MSE = SSR = et
T T
t=1

(average in-sample accuracy of the regression predictions)

68 / 247
Regression Statistics: F -statistic 199.626

(SSRres SSR)/(K 1)
F =
SSR/(T K )

69 / 247
Predictive Perspectives

The F statistic effectively compares the accuracy of the


regression-based forecast to that of the unconditional-mean
forecast.

Helps us assess whether the x variables, taken as a set, have


predictive value for y .

Contrasts with the t statistics, which assess predictive value of


the x variables one at a time.

70 / 247
Regression Statistics: S.E. of regression .492

PT 2
2 t=1 et
s =
T K

s
PT 2
t=1 et
SER = s2 =
T K

71 / 247
Predictive Perspectives

s 2 is just SSR scaled by T K , so again, its a measure of the


in-sample accuracy of the regression-based forecast.

Like MSE, but corrected for degrees of freedom.

72 / 247
Regression Statistics: R-squared .232

PT 2
2 t=1 et
R = 1 PT
t=1 (yt y )2

73 / 247
Regression Statistics: Adjusted R-squared .231

1 PT 2
2 T K t=1 et
R = 1 1 T 2
P
T 1 t=1 (yt y )

74 / 247
Predictive Perspectives

R 2 and R 2 effectively compare the in-sample accuracy of


conditional-mean and unconditional-mean forecasts.

R 2 is not corrected for d.f. and has MSE on top:


1 PT 2
2 T t=1 et
R =1 1 PT .
T t=1 (yt y )2

R 2 is corrected for d.f. and has s 2 on top:


1 PT 2
2 T K t=1 et
R = 1 1 T
,
2
P
T 1 t=1 (yt y )

75 / 247
Regression Statistics: Log likelihood -938.236

I Intimately related to SSR under normality

I Therefore closely related to prediction as well

76 / 247
Background/Detail: Regression Statistics: Log likelihood
-938.236

I Likelihood joint density of the data (the yt s)

I Maximum-likelihood estimation natural estimation strategy:


find the parameter configuration that maximizes the likelihood
of getting the yt s that you actually did get.

I Log likelihood will have same max as the likelihood (why?)


but its more important statistically

I Hypothesis tests and model selection based on log likelihood

77 / 247
Background/Detail: Maximum-Likelihood Estimation

Linear regression model (under conditions) implies that:

yt iidN(xt0 , 2 ),

so that 1 0
1 2
f (yt ) = (2 2 ) 2 e 22 (yt xt ) .
Now by independence of the t s and hence yt s,
T
1 1 0 2
e 22 (yt xt )
Y
L = f (y1 , ..., yT ) = f (y1 ) f (yT ) = (2 2 ) 2

t=1

Note in particular that the vector that maximizes the likelihood


is the vector that minimizes the sum of squared residuals.

78 / 247
Background/Detail: Log Likelihood

T
T 1 X
ln L = (2 2 ) 2 (yt xt0 )2
2 2
t=1

- Log turns the product into a sum and eliminates the exponential

- Additive constant can be dropped

79 / 247
Background/Detail: Likelihood-Ratio Tests

Under conditions, asymptotically as T :

2(ln L0 ln L1 ) 2d ,

where ln L0 is the maximized log likelihood under the restrictions


implied by the null hypothesis, ln L1 is the unrestricted log
likelihood, and d is the number of restrictions imposed under the
null hypothesis.

t and F tests are likelihood ratio tests under a normality


assumption. Thats why they can be written in terms of minimized
SSRs rather than maximized ln Ls.

80 / 247
Regression Statistics: Schwarz criterion 1.435

Well get there shortly...

81 / 247
Regression Statistics: Durbin-Watson stat. 1.926

Well get there in 6-8 weeks

82 / 247
Residual Scatter

83 / 247
Residual Plot

Figure: Wage Regression Residual Plot

84 / 247
Predictive Perspectives

The OLS fitted values, yt = xt0 , are effectively best in-sample


predictions.

The OLS residuals, et = yt yt , are effectively in-sample


prediction errors corresponding to use of the best predictor.

Residual plots are useful for visually flagging neglected things


that impact forecasting. Residual serial correlation indicates that
point forecasts could be improved. Residual volatility clustering
indicates that interval and density forecasts could be improved.

85 / 247
Non-Quadratic Loss

86 / 247
We Will Generally Use Quadratic Loss...

Recall that the OLS estimator, OLS , solves:


T
X T
X
2
min (yt 1 2 x2t ... K xKt ) = min 2t
t=1 t=1

Simple
(analytic closed-form expression, (X 0 X )1 X 0 y )

But predictive loss simply may not be quadratic

Other approaches are possible.

87 / 247
...But We Can Also Consider Non-Quadratic Loss

88 / 247
LAD Regression (Absolute-Error Loss)

Loss is linear on each side of 0 with slope 1 on each side.

LAD estimator LAD minimizes absolute-error loss:


T
X
min |t |
t=1

Not as simple as OLS, but still simple

89 / 247
Quantile Regression (LinLin Loss)
Loss is linear with potentially different slopes on each side of 0.

QR minimizes LinLin loss, or check function loss:


T
X
min check(t ),
t=1

where:

a|e|, if e 0
check(e) =
b|e|, if e > 0

= a|e| I (e 0) + b|e| I (e > 0).


I (x) = 1 if x is true, and I (x) = 0 otherwise.
I () stands for indicator variable.
Not as simple as OLS, but still simple
90 / 247
Additional Interpretation
What does regression tell us about?

LS: Conditional mean


How does mean(y |X ) vary with X ?
mean(y |X ) = X
LAD: Conditional median
How does median(y |X ) vary with X ?
median(y |X ) = X
Perhaps not much different from LS?
Quantile regression: Conditional quantile
(LAD is a special case. Why?)
How does 100d%(y |X ) vary with X ?
100d%(y |X ) = X
e.g., How does the fifth percentile of the distribution of log wage
given education vary with education?
91 / 247
The Link Between d, a, and b

100d%(y |X ) = X

where
b 1
d= =
a+b 1 + a/b

Note that a and b matter only through their ratio

92 / 247
Optimal Forecasts Can Be Biased

Symmetric (quadratic) loss: Optimal forecast is conditional


mean; corresponding error has zero mean

Symmetric (absolute) loss: Optimal forecast is conditional


median; corresponding error has zero median

Asymmetric (check) loss: Optimal forecast is conditional


quantile; corresponding error has non-zero mean and median

93 / 247
Quantile Regression (10th Percentile): LWAGE c,
EDUC
5

3
LW AGE
Fitted 10th Percentile
2

0
0 4 8 12 16 20 24

EDUC

\ = 0.799 + 0.068EDUC
LWAGE
94 / 247
Quantile Regression (90th Percentile): LWAGE c,
EDUC
5

3
LW AGE
Fitted 90th Percentile
2

0
0 4 8 12 16 20 24

EDUC

\ = 1.894 + 0.083EDUC
LWAGE
95 / 247
Comparison: LWAGE c, EDUC
5

3
LWAGE

0
0 4 8 12 16 20 24

EDUC

Fitted LAD Fitted OLS


Fitted 10th Percentile Fitted 90th Percentile
96 / 247
Misspecification

Do we really believe that the fitted model matches the DGP?

97 / 247
Regression Statistics: Schwarz criterion 1.435

PT 2
K
t=1 et
SIC = T ( T )
T

More general lnL version:

SIC = 2lnL + KlnT

98 / 247
Regression Statistics: Akaike info criterion 1.423

PT 2
2K
t=1 et
AIC = e ( T )
T

More general lnL version:

AIC = 2lnL + 2K

99 / 247
Predictive Perspectives

100 / 247
Predictive Perspectives
Estimate out-of-sample forecast accuracy (which is what we
really care about) on the basis of in-sample forecast accuracy. (We
want to select a forecasting model that will perform well for
out-of-sample forecasting, quite apart from its in-sample fit.)
We proceed by inflating the in-sample mean-squared error
(MSE ), in various attempts to offset the deflation from regression
fitting, to obtain a good estimate of out-of-sample MSE .
PT
e2
MSE = t=1 t
T
 
2 T
s = MSE
T K
 2K 
AIC = e ( T ) MSE
 K 
SIC = T ( T ) MSE
The AIC and SIC penalties have certain optimality properties.
101 / 247
Non-Normality and Outliers

Do we really believe that the disturbances are Gaussian?

102 / 247
What Well Do

Distributional results under non-normality

Detecting non-normality and outliers

Dealing with non-normality and outliers (robust estimation)

103 / 247
Recall Sample Mean Under iid With Normality

y is MVUE, and

2
 
y N , ,
T

and we estimate 2 consistently using s 2

Exact (finite-sample) result

104 / 247
Recall Sample Mean Under iid Without Normality

y is BLUE, and

a
2
 
y N , ,
T

and we estimate 2 consistently using s 2

Approximate (large-sample) result,


due to the central limit theorem

The a means asymptotically as T

105 / 247
OLS Under Ideal Conditions With Normality

LS is MVUE, and

LS N , 2 (X 0 X )1 ,


and we estimate 2 consistently using s 2

Exact (finite-sample) result

106 / 247
OLS Under Ideal Conditions Without Normality

LS is BLUE, and

a
LS N , 2 (X 0 X )1 ,


and we estimate 2 consistently using s 2

Approximate (large-sample) result,


due to the central-limit theorem

The a means asymptotically as T

107 / 247
Detecting Non-Normality
(In Data or in Residuals)

Sample skewness and kurtosis, S and K

Jarque-Bera test. Under normality we have:


 
T 1
JB = S + (K 3) 22
2 2
6 4

Many more

108 / 247
Recall Our OLS Wage Regression

109 / 247
OLS Residual Histogram and Statistics

110 / 247
QQ Plots

I We introduced histograms earlier...

I ...but if interest centers on the tails of distributions, QQ plots


often provide sharper insight as to the agreement or
divergence between the actual and reference distributions

I QQ plot is quantiles of the standardized data against quantiles


of a standardized reference distribution (e.g., normal)

I If the distributions match, the QQ plot is the 45 degree line

I To the extent that the QQ plot does not match the 45 degree
line, the nature of the divergence can be very informative, as
for example in indicating fat tails

111 / 247
OLS Wage Regression Residual QQ Plot

112 / 247
Detecting Outliers and Influential Observations:
OLS Residual Plot

113 / 247
Detecting Outliers and Influential Observations:
Leave-One-Out Plot

Consider:
 
(t) , t = 1, ...T

Leave-one-out plot

114 / 247
Wage Regression

115 / 247
Detecting Outliers and Influential Observations:
Leverage Plot

leverage ht is the t-th diagonal element of X (X 0 X )1 X 0 .

Leverage plot

Whats ht all about?

116 / 247
 
(t)
et and ht are two key Pieces of

It can be shown that


  1
(t) = (X 0 X )1 xt0 et
1 ht

 
Other things equal, the larger is et , the larger is (t)
 
Other things equal, the larger is ht , the larger is (t)

The third key piece is xt

117 / 247
Dealing with Outliers:
Least Absolute Deviations (LAD), Again!
The LAD estimator, LAD , solves:
T
X
min |t |
t=1
Not as simple as OLS, but still simple
LAD regression is quantile regression with d = .5
Recall that OLS fits the conditional mean function:
mean(y |X ) = x

LAD fits the conditional median function (50% quantile):


median(y |X ) = x

The two are equal under symmetry as with FIC, but not under
asymmetry, in which case the median is a better measure of central
tendency 118 / 247
LAD Wage Regression Estimation

119 / 247
Digging into Prediction (Much) More Deeply (Again)

The environment is:

yt = xt0 + t , t = 1, ..., T

t iid D(0, 2 )

120 / 247
Recall Point Prediction

Assume for the moment that we know the model parameters. That
is, assume that we know and all parameters governing D. Note
that the mean and variance are in general insufficient to
characterize a non-Gaussian D.

We immediately obtain point forecasts as:

E (yi |xi = x ) = x 0 .

121 / 247
Recall Analytic Density Prediction (And Hence Also
Interval Prediction) for D Gaussian

If D is Gaussian, then the density prediction is immediately

yi |xi = x N(x 0 , 2 ). (2)

We can calculate any desired interval forecast from the density


forecast. For example, a 95% interval would be x 0 1.96.

122 / 247
Recall Simulation Algorithm for Density Prediction for D
Gaussian

1. Take R draws from the disturbance density N(0, 2 ).


2. Add x 0 to each disturbance draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form an interval forecast (95%, say) by sorting the output
from step 2 to get the empirical cdf, and taking the left and
right interval endpoints as the the .025% and .975% values,
respectively.

As R , the algorithmic and analytic results coincide.

123 / 247
Recall Making the Forecasts Feasible

The approaches above are infeasible in that they assume known


parameters. They can be made feasible by replacing unknown
parameters with estimates. For example, the feasible version of the
point prediction is x 0 . Similarly, to construct a feasible 95%
interval forecast in the Gaussian case we can take x 0 1.96,
where is the standard error of the regression
(also earlier denoted s).

124 / 247
Density Prediction for D Parametric Non-Gaussian

Our simulation algorithm still works for non-Gaussian D, so long as


we can simulate from D.
1. Take R draws from the disturbance density D.
2. Add x 0 to each disturbance draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form a 95% interval forecast by sorting the output from step
2, and taking the left and right interval endpoints as the the
.025% and .975% values, respectively.
Again as R , the algorithmic results become arbitrarily
accurate.

125 / 247
Density Prediction for D Non-Parametric
Now assume that we know nothing about distribution D, except
that it has mean 0. In addition, now that we have introduced
feasible forecasts, we will stay in that world.
1. Take R draws from the regression residual density (which is an
approximation to the disturbance density) by assigning
probability 1/N to each regression residual and sampling with
replacement.
2. Add x 0 to each draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form a 95% interval forecast by sorting the output from step
2, and taking the left and right interval endpoints as the the
.025% and .975% values, respectively.
As R and N , the algorithmic results become arbitrarily
accurate.
126 / 247
Density Forecasts for D Nonparametric and Acknowledging
Parameter Estimation Uncertainty

So far: Disturbance uncertainty


Now: Disturbance uncertainty and parameter estimation
uncertainty
The feasible approach to density forecasting sketched above still
fails to acknowledge parameter estimation uncertainty, because it
treats plugged-in parameter estimates as true values, ignoring
the fact that they are only estimates and hence subject to
sampling variability. Parameter estimation uncertainty is often
ignored, as its contribution to overall forecast MSE can be shown
to vanish unusually quickly as sample size grows. But it impacts
forecast uncertainty in small samples and hence should not be
ignored in general.

127 / 247
Algorithm for Density Forecasts for D Nonparametric and
Acknowledging Parameter Estimation Uncertainty
1. Take R approximate disturbance draws by assigning
probability 1/N to each regression residual and sampling with
replacement.
2. Take R draws from the large-N sampling density of , namely
OLS N(, 2 (X 0 X )1 ),
as approximated by N(, 2 (X 0 X )1 ).
3. To each disturbance draw from 1 add the corresponding x 0
draw from 2.
4. Form a density forecast by fitting a density to the output from
step 3.
5. Form a 95% interval forecast by sorting the output from step
3, and taking the left and right interval endpoints as the the
.025% and .975% values, respectively.
As R and N , we get precisely correct results.
128 / 247
Indicator Variables in Cross Sections:
Group Effects

Effectively a type of Structural change:


Do we really believe that coefficients are fixed across people?

129 / 247
Dummy Variables for Group Effects

A dummy variable, or indicator variable, is just a 0-1 variable that


indicates something, such as whether a person is female:

1 if person t is female
FEMALEt =
0 otherwise

(It really is that simple.)

Intercept dummies

Note that the sample mean of a dummy variable is the fraction of


the sample with the indicated attribute.

130 / 247
Histograms for Wage Covariates

131 / 247
Recall Basic Wage Regression on Education and
Experience

LWAGE C , EDUC , EXPER

132 / 247
Basic Wage Regression Results

133 / 247
Basic Wage Regression Residual Scatter

134 / 247
Controlling for Sex, Race, and Union Status
in the Wage Regression

Now:

LWAGE C , EDUC , EXPER, FEMALE , NONWHITE , UNION

135 / 247
Wage Regression on Education, Experience, and Group
Dummies

136 / 247
Residual Scatter from Wage Regression on
Education, Experience, and Group Dummies

137 / 247
Important Issues

I The intercept corresponds to the base case across all


dummies (i.e., when all dummies are simultaneously 0), and
the dummy coefficients give the extra effects (i.e., when the
respective dummies are 1).

I Alternatively, use a full set of dummies for each category (e.g.,


both a union dummy and a non-union dummy) and drop the
intercept. (More useful/common for in time-series situations)

I Never include a full set of dummies and an intercept.


Would be totally redundant: Perfect Multicollinearity

138 / 247
Nonlinearity

Do we really believe that the relationship is linear?

139 / 247
Anscombes Quartet

140 / 247
Anscombes Quartet: Regressions

141 / 247
Anscombes Quartet: Graphics

142 / 247
Parametric and Nonparametric Nonlinearity...

...and the gray area in between.

143 / 247
Log-Log Regression

lnyt = 1 + 2 lnxt + t

For close yt and xt , (ln yt ln xt ) 100 is approximately the percent


difference between yt and xt . Hence the coefficients in log-log
regressions give the expected percent change in y for a one-percent
change in x. That is, they give the elasticity of y with respect to x.

Example: Cobb-Douglas production function

yt = ALt Kt exp(t )

lnyt = lnA + lnLt + lnKt + t


We expect an % increase in output
in response to a 1% increase in labor input
144 / 247
Log-Lin Regression

lnyt = xt +

The coefficients in log-lin regressions give the expected percent


change in y for a one-unit (not 1%!) change in x.

Example: Exponential growth


yt = Ae rt
lnyt = lnA + rt
Coefficient r gives the expected percent change in y for a one-unit
change in time

Another example: LWAGE regression!


Coefficient on education gives the expected percent change in
WAGE arising from one more year of education.
145 / 247
Intrinsically Non-Linear Models

One example is the S-curve model,


1
y=
a + br x

(0 < r < 1)

No way to transform to linearity

Use non-linear least squares (NLS)

Under the remaining FIC (that is, dropping only linearity), NLS
has a sampling distribution similar to that of LS under the FIC

146 / 247
Taylor Series Expansions

Really no such thing as an intrinsically non-linear model...

In the bivariate case we can think of the relationship as

yt = g (xt , t )

or slightly less generally as

yt = f (xt ) + t

147 / 247
Taylor Series, Continued

Consider Taylor series expansions of f (xt ).


The linear (first-order) approximation is

f (xt ) 1 + 2 x,

and the quadratic (second-order) approximation is

f (xt ) 1 + 2 xt + 3 xt2 .

In the multiple regression case, Taylor approximations also involve


interaction terms. Consider, for example, f (xt , zt ):

f (xt , zt ) 1 + 2 xt + 3 zt + 4 xt2 + 5 zt2 + 6 xt zt + ....

Equally relevant for dummy variables

148 / 247
A Key Insight

The ultimate point is that so-called intrinsically non-linear


models are themselves linear when viewed from the series-expansion
perspective. In principle, of course, an infinite number of series
terms are required, but in practice nonlinearity is often quite gentle
(e.g., quadratic) so that only a few series terms are required.

So non-linearity is in some sense


really an omitted-variables problem

149 / 247
Assessing Non-Linearity

Use AIC and SIC as always.

Use ts and F as always.

150 / 247
Basic Wage Regression

151 / 247
Quadratic Wage Regression

152 / 247
Dummy Interactions?

153 / 247
Everything

154 / 247
So Drop Dummy Interactions and Tighten the Rest

155 / 247
Heteroskedasticity in Cross-Section Regression

Do we really believe that disturbance variances


are constant over space?

156 / 247
Heteroskedasticity is Another Type of Violation of the IC
(This time its non-constant disturbance variances.)

Consider: N(0, )

Heteroskedasticity corresponds to diagonal but 6= 2 I


Simpler but more important that spatial correlation

12 0 . . . 0

0 2 . . . 0
2
= .

. .. . . .
. . . ..
0 0 . . . N 2

157 / 247
Causes and Consequences of Heteroskedasticity
Causes:
Can arise for many reasons
Engel curve (e.g., food expenditure vs. income) is classic example

Consequences:
OLS estimation remains largely OK.
Parameter estimates consistent but inefficient.
OLS inference destroyed. Standard errors biased and inconsistent.
Hence t statistics do not have the t distribution in finite samples
and do not have the N(0, 1) distribution asymptotically.

Corresponding predictive consequences:


Point prediction remains largely OK.
|xt = xt ) E (yt |xt = xt ).
We still have E (yt\
Interval and density forecasts destroyed. So we need to detect
and deal with the heteroskedasticity.
158 / 247
What if You Dont Care About Detecting and Dealing
With Heteroskedasticity...

e.g., perhaps youre only interested in point prediction but still


want to do credible inference regarding the contributions of the
various x variables to the point prediction.

Then use heteroskedasticity-robust standard errors

White standard errors

Just a simple regression option

e.g., in EViews,
instead of ls y,c,x, use ls(cov=white) y,c,x

159 / 247
Wage regression with White Standard Errors

160 / 247
Detecting Heteroskedasticity

I Graphical heteroskedasticity diagnostics

I Formal heteroskedasticity tests

161 / 247
Graphical Diagnostics

Graph ei2 against xi , for various regressors

162 / 247
Recall Our Final Wage Regression

163 / 247
Squared Residual vs. EDUC

164 / 247
The Breusch-Godfrey-Pagan Test (BGP)

Limitation of graphing ei2 against xi : Purely pairwise

So move to a formal testing framework that blends all information

BGP test:

I Estimate the OLS regression, and obtain the squared residuals

I Regress the squared residuals on all regressors

I To test the null hypothesis of no relationship, examine NR 2


from this regression. In large samples NR 2 2K under the
null, where K is the number of regressors in the test
regression.

165 / 247
BPG Test

166 / 247
Whites Test

Like BGP, but replace BGPs linear regression


with a more flexible (quadratic) regression

I Estimate the OLS regression, and obtain the squared residuals

I Regress the squared residuals on all regressors, squared


regressors, and pairwise regressor cross products

I To test the null hypothesis of no relationship, examine NR 2


from this regression. In large samples NR 2 2K under the
null.

167 / 247
Whites Test

168 / 247
Simulation Algorithm for Density Prediction
D Gaussian, Heteroskedastic Disturbances

1. Take R draws from the disturbance density N(0, 2 ), where


2 is the fitted value from the White regression evaluated at
x = x
2. Add x 0 to each disturbance draw.
3. Form a density forecast by fitting a density to the output from
step 2.
4. Form an interval forecast (95%, say) by sorting the output
from step 2 to get the empirical cdf, and taking the left and
right interval endpoints as the the .025% and .975% values,
respectively.

169 / 247
Spatial Correlation in Cross-Section Regression

Do we really believe that the disturbances are uncorrelated over


space?

170 / 247
Spatial Correlation is Another Type of Violation of the IC
(This time its non-zero disturbance correlations.)

Consider: N(0, )

Spatial correlation corresponds to non-diagonal .

12

12 . . . 1T
21 22 . . . 2T
= .

.. .. ..
.. . . .
T 1 T 2 . . . T2

Advanced topic and we will not pursue it further here.

Could be block-diagonal (clustering)

171 / 247
Time Series

172 / 247
Misspecification

Do we really believe that the fitted model matches the DGP?


No major changes in time series...

173 / 247
Non-Normality and Outliers

Do we really believe that the disturbances are Gaussian?


No major changes in time series...

174 / 247
Indicator Variables in Time Series:
Trend and Seasonality

Trend and seasonality are effectively types of structural change

Now: Do we really believe that means are fixed over time?

Later: Do we really believe that regression coefficients are


fixed over time?

175 / 247
Liquor Sales

176 / 247
Log Liquor Sales

177 / 247
Linear Deterministic Trend

Trendt = 1 + 2 TIMEt

where TIMEt = t

Simply run the least squares regression y c, TIME , where



1
2

3
TIME = .

..

T 1
T

178 / 247
Various Linear Trends

179 / 247
Linear Trend Estimation

Dependent Variable: LSALES


Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336

Variable Coefficient Std. Error t-Statistic Prob.

C 6.454290 0.017468 369.4834 0.0000


TIME 0.003809 8.98E-05 42.39935 0.0000

R-squared 0.843318 Mean dependent var 7.096188


Adjusted R-squared 0.842849 S.D. dependent var 0.402962
S.E. of regression 0.159743 Akaike info criterion -0.824561
Sum squared resid 8.523001 Schwarz criterion -0.801840
Log likelihood 140.5262 Hannan-Quinn criter. -0.815504
F-statistic 1797.705 Durbin-Watson stat 1.078573
Prob(F-statistic) 0.000000

180 / 247
Residual Plot

181 / 247
Deterministic Seasonality

s
X
Seasonal t = i SEASit (s seasons per year)
i=1

1 if observation t falls in season i
where SEASit =
0 otherwise

Simply run the least squares regression y SEAS1 , ..., SEASs


(or blend: y TIME , SEAS1 , ..., SEASs )

where (e.g., in quarterly data case, assuming Q1 start and Q4 end):


SEAS1 = (1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, ..., 0)0
SEAS2 = (0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ..., 0)0
SEAS3 = (0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, ..., 0)0
SEAS4 = (0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, ..., 1)0 .

182 / 247
Linear Trend with Seasonal Dummies

183 / 247
Residual Plot

184 / 247
Seasonal Pattern

185 / 247
Nonlinearity in Time Series

Do we really believe that trends are linear?

186 / 247
Non-Linear Trend: Exponential (Log-Linear)

Trendt = 1 e 2 TIMEt

ln(Trendt ) = ln(1 ) + 2 TIMEt

187 / 247
Figure: Various Exponential Trends

188 / 247
Non-Linear Trend: Quadratic

Allow for gentle curvature by including TIME and TIME 2 :

Trendt = 1 + 2 TIMEt + 3 TIMEt2

189 / 247
Figure: Various Quadratic Trends

190 / 247
Recall Log-Linear Liquor Sales Trend Estimation

Dependent Variable: LSALES


Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336

Variable Coefficient Std. Error t-Statistic Prob.

C 6.454290 0.017468 369.4834 0.0000


TIME 0.003809 8.98E-05 42.39935 0.0000

R-squared 0.843318 Mean dependent var 7.096188


Adjusted R-squared 0.842849 S.D. dependent var 0.402962
S.E. of regression 0.159743 Akaike info criterion -0.824561
Sum squared resid 8.523001 Schwarz criterion -0.801840
Log likelihood 140.5262 Hannan-Quinn criter. -0.815504
F-statistic 1797.705 Durbin-Watson stat 1.078573
Prob(F-statistic) 0.000000

191 / 247
Residual Plot

192 / 247
Log-Quadratic Liquor Sales Trend Estimation

Dependent Variable: LSALES


Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336

Variable Coefficient Std. Error t-Statistic Prob.

C 6.231269 0.020653 301.7187 0.0000


TIME 0.007768 0.000283 27.44987 0.0000
TIME2 -1.17E-05 8.13E-07 -14.44511 0.0000

R-squared 0.903676 Mean dependent var 7.096188


Adjusted R-squared 0.903097 S.D. dependent var 0.402962
S.E. of regression 0.125439 Akaike info criterion -1.305106
Sum squared resid 5.239733 Schwarz criterion -1.271025
Log likelihood 222.2579 Hannan-Quinn criter. -1.291521
F-statistic 1562.036 Durbin-Watson stat 1.754412
Prob(F-statistic) 0.000000

193 / 247
Residual Plot

194 / 247
Log-Quadratic Liquor Sales Trend Estimation
with Seasonal Dummies

Dependent Variable: LSALES


Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336

Variable Coefficient Std. Error t-Statistic Prob.

TIME 0.007739 0.000104 74.49828 0.0000


TIME2 -1.18E-05 2.98E-07 -39.36756 0.0000
D1 6.138362 0.011207 547.7315 0.0000
D2 6.081424 0.011218 542.1044 0.0000
D3 6.168571 0.011229 549.3318 0.0000
D4 6.169584 0.011240 548.8944 0.0000
D5 6.238568 0.011251 554.5117 0.0000
D6 6.243596 0.011261 554.4513 0.0000
D7 6.287566 0.011271 557.8584 0.0000
D8 6.259257 0.011281 554.8647 0.0000
D9 6.199399 0.011290 549.0938 0.0000
D10 6.221507 0.011300 550.5987 0.0000
D11 6.253515 0.011309 552.9885 0.0000
D12 6.575648 0.011317 581.0220 0.0000

R-squared 0.987452 Mean dependent var 7.096188


Adjusted R-squared 0.986946 S.D. dependent var 0.402962
S.E. of regression 0.046041 Akaike info criterion -3.277812
Sum squared resid 0.682555 Schwarz criterion -3.118766
Log likelihood 564.6725 Hannan-Quinn criter. -3.214412
Durbin-Watson stat 0.581383
195 / 247
Residual Plot

196 / 247
Serial Correlation in Time-Series Regression

Do we really believe that disturbances are uncorrelated over


time?
(Not possible in cross sections, so we didnt study it before...)

197 / 247
Serially Correlated Regression Disturbances

Disturbance serial correlation, or autocorrelation,


means correlation over time
Current disturbance correlated with past disturbance(s)

Leading example
(AR(1) disturbance serial correlation):

yt = xt0 + t

t = t1 + vt , || < 1
vt iid N(0, 2 )

(Extension to AR(p) disturbance serial correlation is immediate)

198 / 247
Serial Correlation Implies 6= 2 I
Recall heteroskedasticity:
diagonal but with different diagonal elements

Now serial correlation:


not even diagonal


(0) . . . (T 1)
(1)
(1) . . . (T 2)
(0)
=

.. . ..
. ..
. .. .
(T 1) (T 2) . . . (0)
where:
( ) = cov (t , t ), = 0, 1, 2, ...
Autocovariances: ( ), = 1, 2, ...
Autocorrelations: ( ) = ( )/ (0), = 1, 2, ...
199 / 247
Why is Negected Serial Correlation a Problem for
Prediction?
The IC involve = 2 I , and serial correlation implies 6= 2 I , so
we get inconsistent s.e.s, just as with heteroskedasticity. But that
was basically inconsequential for point forecasts.

But serial correlation is a bigger problem for prediction.


Heres the intuition:

Serial correlation in disturbances/residuals implies that the


included X variables have missed something that could be
exploited for improved point forecasting (and hence also improved
interval and density forecasting). That is, all types of forecasts are
sub-optimal when serial correlation in neglected.

Put differently:
Serial correlation in forecast errors means that you can forecast
your forecast errors! So something is wrong and can be improved...
200 / 247
What if You Dont Care About Neglected Serial
Correlation?
Hard to imagine

But perhaps you want to do credible inference regarding the


contributions of the various x variables to a point prediction based
only on the xs.

Then use heteroskedasticity and autocorrelation robust standard


errors

HAC standard errors, Newey-West standard errors

Just a simple regression option

e.g., in EViews,
instead of ls y,c,x, use ls(cov=hac) y,c,x

201 / 247
Trend + Seasonal Liquor Sales Regression with HAC
Standard Errors

202 / 247
Detecting Serial Correlation

I Formal tests
I Durbin-Watson
I Breusch-Godfrey

I Graphical diagnostics (actually more sophisticated and useful)


I Residual plot
I Residual scatterplot of (et vs. et )
I Residual autocorrelations

203 / 247
Recall Our Log-Quadratic Liquor Sales Model

Dependent Variable: LSALES


Method: Least Squares
Date: 08/08/13 Time: 08:53
Sample: 1987M01 2014M12
Included observations: 336

Variable Coefficient Std. Error t-Statistic Prob.

TIME 0.007739 0.000104 74.49828 0.0000


TIME2 -1.18E-05 2.98E-07 -39.36756 0.0000
D1 6.138362 0.011207 547.7315 0.0000
D2 6.081424 0.011218 542.1044 0.0000
D3 6.168571 0.011229 549.3318 0.0000
D4 6.169584 0.011240 548.8944 0.0000
D5 6.238568 0.011251 554.5117 0.0000
D6 6.243596 0.011261 554.4513 0.0000
D7 6.287566 0.011271 557.8584 0.0000
D8 6.259257 0.011281 554.8647 0.0000
D9 6.199399 0.011290 549.0938 0.0000
D10 6.221507 0.011300 550.5987 0.0000
D11 6.253515 0.011309 552.9885 0.0000
D12 6.575648 0.011317 581.0220 0.0000

R-squared 0.987452 Mean dependent var 7.096188


Adjusted R-squared 0.986946 S.D. dependent var 0.402962
S.E. of regression 0.046041 Akaike info criterion -3.277812
Sum squared resid 0.682555 Schwarz criterion -3.118766
Log likelihood 564.6725 Hannan-Quinn criter. -3.214412
Durbin-Watson stat 0.581383

204 / 247
Formal Tests: Durbin-Watson (0.59!)

Simple AR(1) environment:

yt = xt0 + t

t = t1 + vt
vt iid N(0, 2 )
We want to test H0 : = 0 against H1 : 6= 0

Regress yt xt and obtain the residuals et

Then form:
PT
(et et1 )2
DW = t=2PT
2
t=1 et

205 / 247
Understanding the Durbin-Watson Statistic

PT 2 1 PT 2
t=2 (et et1 ) T t=2 (et et1 )
DW = PT = 1 PT
2 2
t=1 et T t=1 et

1 PT 2 1 PT 2 1 PT
T t=2 et + T t=2 et1 2 T t=2 et et1
= 1 PT 2
T t=1 et

Hence as T :
2 + 2 2cov (t , t1 )
DW = 1+12corr (t , t1 ) = 2(1corr (t , t1 ))
2
= DW [0, 4], DW 2 as 0, and DW 0 as 1

206 / 247
Formal Tests: Breusch-Godfrey

General AR(p) environment:

yt = xt0 + t

t = 1 t1 + ... + p tp + vt
We want to test H0 : (1 , ..., p ) = 0 against H1 : (1 , ..., p ) 6= 0

I Regress yt xt and obtain the residuals et

I Regress et xt , et1 , ..., etp

I Examine TR 2 . In large samples TR 2 2p under the null.

Does this sound familiar?

207 / 247
BG for AR(1) Disturbances
(TR 2 = 168.5, p = 0.0000)

208 / 247
BG for AR(4) Disturbances
(TR 2 = 216.7, p = 0.0000)

209 / 247
BG for AR(8) Disturbances
(TR 2 = 219.0, p = 0.0000)

210 / 247
Residual Plot

211 / 247
Residual Scatterplot (et vs. et1 )

212 / 247
Residual Autocorrelations

213 / 247
Fixing the Serially Correlation Problem:
Including Lags of y as Regressors

Serial correlation in disturbances means that the included xs


(in our case, trends and seasonals)
dont fully account for the dynamics in y .

But the problem is simple to fix:


Just include lags of y as additional regressors.

AR(p) disturbances fixed by including p lags of y .

(Select p using the usual AIC , SIC , etc.)

AIC selects p = 4, and SIC selects p = 3.

214 / 247
Trend + Seasonal Model
with Four Lags of y

215 / 247
Trend + Seasonal Model
with Four Lags of y
Residual Plot

216 / 247
Residual Scatterplot

217 / 247
Residual Autocorrelations

218 / 247
Forecasting and the Forecasting the Right-Hand-Side
Variables Problem

yt = xt0 + t = yt+h = xt+h


0
+ t+h

Projecting on current information,


0
yt+h,t = xt+h,t

Forecasting the right-hand-side variables problem:


We dont have xt+h,t !

But no problem for trends or seasonals

219 / 247
What About Autoregressions?
e.g., AR(1)

yt = yt1 + t

Hence:

yt+h = yt+h1 + t+h

Projecting on current information,

yt+h,t = yt+h1,t

There seems to be a FRHS variables problem for h > 1.


But theres not!
We can build the multi-step forecast recursively.
Wolds chain rule of forecasting
220 / 247
(More General) Structural Change in Time Series:
Drifts and Breaks

Again, do we really believe that coefficients are fixed over time?

221 / 247
Structural Change
Sharp Breakpoint Exogenously Known
For simplicity of exposition, consider a bivariate regression:

11 + 21 xt + t , t = 1, ..., T

yt =
12 + 22 xt + t , t = T + 1, ..., T
Let
0, t = 1, ..., T

Dt =
Dt = 1, t = T + 1, ...T
Then we can write the model as:
yt = (11 + (12 11 )Dt ) + (21 + (22 21 )Dt )xt + t
We run:
yt c, Dt , xt , Dt xt
Use regression to test for structural change (F test)
Use regression to accommodate structural change if present.
222 / 247
Structural Change
Sharp Breakpoint, Exogenously Known, Continued

The Chow test is what were really calculating:

(e 0 e (e10 e1 + e20 e2 ))/K


Chow =
(e10 e1 + e20 e2 )/(T 2K )

Distributed F under the no-break null (and the rest of the IC)

223 / 247
Structural Change
Sharp Breakpoint, Endogenously Identified

MaxChow = max Chow ( ),


min max

where denotes potential break location as fraction of sample

(Typically we take min = .15 and max = .85)

The null distribution of MaxChow has been tabulated.

224 / 247
Rolling-Window Regression
for Generic Structural Change Assessment

Calculate and examine

tw :t

for t = w + 1, ..., T

w is window width

What does window width govern?

225 / 247
Expanding-Window (Recursive) Regression
for Generic Structural Change Assessment
Model:
K
X
yt = k xkt + t
k=1

t iidN(0, 2 ),
t = 1, ..., T .

OLS estimation uses the full sample, t = 1, ..., T .

Recursive least squares uses an expanding sample.


Begin with the first K observations and estimate the model.
Then estimate using the first K + 1 observations, and so on.
At the end we have a set of recursive parameter estimates:
k,t , for k = 1, ..., K and t = K , ..., T .
226 / 247
Recursive Residuals

At each t, t = K , ..., T 1, compute a 1-step forecast,


K
X
yt+1,t = kt xk,t+1 .
k=1

The corresponding forecast errors, or recursive residuals, are

et+1,t = yt+1 yt+1,t .

et+1,t N(0, 2 rt )

where rt > 1 for all t

227 / 247
Standardized Recursive Residuals and CUSUM

et+1,t
wt+1,t ,
rt

t = K , ..., T 1.

Under the maintained assumptions,

wt+1,t iidN(0, 1).

Then
t
X
CUSUMt wt+1,t , t = K , ..., T 1
t=K

is just a sum of iid N(0, 1)s.

228 / 247
Recursive Analysis Constant Parameter Model

229 / 247
Recursive Analysis Breaking Parameter Model

230 / 247
Heteroskedasticity in Time Series

Do we really believe that


disturbance variances are constant over time?

231 / 247
Varieties of Random (White) Noise

White noise: t WN(, 2 ) (serially uncorrelated)

Zero-mean white noise: t WN(0, 2 ) (serially uncorrelated)

iid
Independent (strong) white noise: t (0, 2 )

iid
Gaussian white noise: t N(0, 2 )

232 / 247
Linear Models (e.g., AR(1))

rt = rt1 + t

t iid(0, 2 ), || < 1

Uncond. mean: E (rt ) = 0 (constant)


Uncond. variance: E (rt2 ) = 2 /(1 2 ) (constant)
Cond. mean: E (rt | t1 ) = rt1 (varies)
Cond. variance: E ([rt E (rt | t1 )]2 | t1 ) = 2 (constant)

Conditional mean adapts, but conditional variance does not

233 / 247
ARCH(1) Process

rt |t1 N(0, ht )
2
ht = + rt1

E (rt ) = 0

E (rt 2 ) =
(1 )
E (rt |t1 ) = 0
E ([rt E (rt |t1 )]2 |t1 ) = + rt1
2

234 / 247
GARCH(1,1) Process (Generalized ARCH)

rt | t1 N(0, ht )
2
ht = + rt1 + ht1

E (rt ) = 0

E (rt 2 ) =
(1 )
E (rt |t1 ) = 0
E ([rt E (rt | t1 )]2 | t1 ) = + rt1
2
+ ht1

Well-defined and covariance stationary if


0 < < 1, 0 < < 1, + < 1
235 / 247
GARCH(1,1) and Exponential Smoothing

Exponential smoothing recursion:

t2 = t1
2
+ (1 )rt2
X
= t2 = (1 ) j rtj
2

But in GARCH(1,1) we have:


2
ht = + rt1 + ht1
X
ht = + j1 rtj
2
1

236 / 247
Tractable Maximum-Likelihood Estimation

L(; r1 , . . . , rT ) = f (rT |T 1 ; )f ((rT 1 |T 2 ; ) . . . ,

where = (, , )0

If the conditional densities are Gaussian,

1 rt2
 
1 1/2
f (rt |t1 ; ) = ht () exp ,
2 2 ht ()
so
1X 1 X rt2
ln L = const ln ht ()
2 t 2 t ht ()

237 / 247
Variations on the GARCH Theme

I Regression with GARCH disturbances

I Fat-tailed conditional densities: t-GARCH

I Asymmetric response and the leverage effect: T-GARCH

238 / 247
Regression with GARCH Disturbances

yt = xt0 + t

t |t1 N(0, ht )

239 / 247
Fat-Tailed Conditional Densities: t-GARCH

If r is conditionally Gaussian, then


p
rt = ht N(0, 1)

But often with high-frequency data,


r
t leptokurtic
ht

So take:
p td
r t = ht
std(td )

and treat d as another parameter to be estimated

240 / 247
Asymmetric Response and the Leverage Effect: T-GARCH

2 + h
Standard GARCH: ht = + rt1 t1

2 + r 2 D
T-GARCH: ht = + rt1 t1 t1 + ht1

1 if rt < 0
Dt =
0 otherwise

positive return (good news): effect on volatility

negative return (bad news): + effect on volatility

6= 0: Asymetric news response


> 0: Leverage effect

241 / 247
A Useful Specification Diagnostic

t |t1 N(0, ht )
p
t = ht vt , vt iidN(0, 1)

t = vt , vt iidN(0, 1)
ht


Infeasible: examine vt = t / ht . iid? Gaussian?
p
Feasible: examine vt = t / ht . iid? Gaussian?

Key potential deviation from iid is volatility dynamics:


Examine correlogram of squared standardized returns, vt2
Examine normality of standardized returns, vt

242 / 247
Conditional Mean Estimation

243 / 247
Conditional Variance Estimation

244 / 247
Autocorrelations of Squared Standardized Residuals

245 / 247
Distribution of Standardized Residuals

246 / 247
Time Series of Estimated Conditional Standard Deviations

247 / 247

Vous aimerez peut-être aussi