© All Rights Reserved

12 vues

© All Rights Reserved

- tmp287F.tmp
- jurnal honey comb.pdf
- Stata Tutorial
- Eco No Metrics
- Evermann Slides
- Credit Risk
- Heteroscedasticity
- IEEE - A tutorial introduction to estimation and filtering.pdf
- Chiou and Rothenberg - Comparing Legislators and Legislatures
- ECON-5027FG-Chu.pdf
- IASSC Reference Document 1218 - English
- Results GDP annual data 2000-2014.docx
- Ijrcm 4 IJRCM 4 Vol 3 2013 Issue 7 Art 10
- Diferencias en Diferenecias
- Handout
- duts
- cheatsheet on metrics and statistics
- Intro - Mathematical_preliminaries
- Chapter 3
- linear regression - macro.pdf

Vous êtes sur la page 1sur 360

Paul Sderlind1

16 January 2015

1 Universityof St. Gallen. Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St. Gallen,

Switzerland. E-mail: Paul.Soderlind@unisg.ch. Document name: EmpFinPhDAll.TeX.

Contents

1.1 The Variance of a Sample Mean . . . . . . . . . . . . . . . . . . . . 5

1.2 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4 Testing (Linear) Joint Hypotheses . . . . . . . . . . . . . . . . . . . 19

1.5 Testing (Nonlinear) Joint Hypotheses: The Delta Method . . . . . . . 20

A Statistical Tables 24

B Data Sources 24

2.1 Three Pricing Principles . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Stochastic Discount Factors . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Beta Pricing Models . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4 Risk Neutral Distributions . . . . . . . . . . . . . . . . . . . . . . . 42

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Return Distributions 61

4.1 Estimating and Testing Distributions . . . . . . . . . . . . . . . . . . 61

4.2 Estimating Risk-neutral Distributions from Options . . . . . . . . . . 78

4.3 Threshold Exceedance and Tail Distribution . . . . . . . . . . . . . 84

4.4 Exceedance Correlations . . . . . . . . . . . . . . . . . . . . . . . . 92

4.5 Beyond (Linear) Correlations . . . . . . . . . . . . . . . . . . . . . 92

1

4.6 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7 Joint Tail Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 A Little Financial Theory and Predictability . . . . . . . . . . . . . . 112

5.2 Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3 Multivariate (Auto-)correlations . . . . . . . . . . . . . . . . . . . . 124

5.4 Other Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.5 Maximally Predictable Portfolio . . . . . . . . . . . . . . . . . . . . 133

5.6 Spurious Regressions and In-Sample Overfitting . . . . . . . . . . . . 134

5.7 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.8 Out-of-Sample Forecasting Performance . . . . . . . . . . . . . . . . 142

5.9 Evaluating Forecasting Performance . . . . . . . . . . . . . . . . . . 143

6.1 Basics of Kernel Regressions . . . . . . . . . . . . . . . . . . . . . . 150

6.2 Distribution of the Kernel Regression and Choice of Bandwidth . . . 155

6.3 Applications of Kernel Regressions . . . . . . . . . . . . . . . . . . 160

7.1 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.2 ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.3 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.4 Value at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.5 Non-Linear Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.6 GARCH Models with Exogenous Variables . . . . . . . . . . . . . . 189

7.7 Stochastic Volatility Models . . . . . . . . . . . . . . . . . . . . . . 189

7.8 (G)ARCH-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.9 Multivariate (G)ARCH . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.10 LAD and Quantile Regressions . . . . . . . . . . . . . . . . . . . . . 198

7.11 A Closed-Form GARCH Option Valuation Model by Heston and

Nandi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.12 Fundamental Values and Asset Returns in Global Equity Markets,

by Bansal and Lundblad . . . . . . . . . . . . . . . . . . . . . . . . 212

2

A Using an FFT to Calculate the PDF from the Characteristic Function 216

A.1 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . 216

A.2 Invert the Characteristic Function . . . . . . . . . . . . . . . . . . . . 216

8.1 CAPM Tests: Overview . . . . . . . . . . . . . . . . . . . . . . . . . 220

8.2 Testing CAPM: Traditional LS Approach . . . . . . . . . . . . . . . 220

8.3 Testing CAPM: GMM . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.4 Testing Multi-Factor Models (Factors are Excess Returns) . . . . . . 235

8.5 Testing Multi-Factor Models (General Factors) . . . . . . . . . . . . 241

8.6 Linear SDF Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

8.7 Conditional Factor Models . . . . . . . . . . . . . . . . . . . . . . . 256

8.8 Conditional Models with Regimes . . . . . . . . . . . . . . . . . . 257

8.9 Fama-MacBeth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

B.1 Coding of the GMM Estimation of a Linear Factor Model . . . . . . . 265

B.2 Coding of the GMM Estimation of a Linear SDF Model . . . . . . . . 267

9.1 Consumption-Based Asset Pricing . . . . . . . . . . . . . . . . . . . 272

9.2 Asset Pricing Puzzles . . . . . . . . . . . . . . . . . . . . . . . . . . 275

9.3 The Cross-Section of Returns: Unconditional Models . . . . . . . . . 280

9.4 The Cross-Section of Returns: Conditional Models . . . . . . . . . . 283

9.5 Ultimate Consumption . . . . . . . . . . . . . . . . . . . . . . . . . 286

10.1 Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

10.2 Calendar Time and Cross Sectional Regression . . . . . . . . . . . . 290

10.3 Panel Regressions, Driscoll-Kraay and Cluster Methods . . . . . . . . 291

10.4 From CalTime To a Panel Regression . . . . . . . . . . . . . . . . . 299

10.5 The Results in Hoechle, Schmid and Zimmermann . . . . . . . . . . 299

10.6 Monte Carlo Experiment . . . . . . . . . . . . . . . . . . . . . . . . 301

10.7 An Empirical Illustration . . . . . . . . . . . . . . . . . . . . . . . . 305

3

11 Expectations Hypothesis of Interest Rates 308

11.1 Term (Risk) Premia . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

11.2 Testing the Expectations Hypothesis of Interest Rates . . . . . . . . . 310

11.3 The Properties of Spread-Based EH Tests . . . . . . . . . . . . . . . 315

11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

11.2 Risk Premia on Fixed Income Markets . . . . . . . . . . . . . . . . . 323

11.3 Summary of the Solutions of Some Affine Yield Curve Models . . . . 325

11.4 MLE of Affine Yield Curve Models . . . . . . . . . . . . . . . . . . 334

11.5 Summary of Some Empirical Findings . . . . . . . . . . . . . . . . . 347

12.1 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . 353

12.2 Approximating Non-Linear Regression Functions . . . . . . . . . . . 355

4

1 Econometrics Cheat Sheet

Sections denoted by a star ( ) is not required reading.

Reference: Cochrane (2005) 11 and 14; Singleton (2006) 24; DeMiguel, Garlappi,

and Uppal (2009)

Many estimators (including GMM) are based on some sort of sample average. Unless we

are sure that the series in the average is iid, we need an estimator of the variance (of the

sample average) that takes serial correlation into account. The Newey-West estimator is

probably the most popular.

Consider a sample mean, x, N of a K 1 vector x t

T

1X

xN D xt : (1.1)

T t D1

If x t is iid, then

p

Cov. T x/

N D Cov.x t /: (1.2)

T

X1

p jsj

Cov. T x/

N D 1 R.s/, where R.s/ D Cov.x t ; x t s /: (1.3)

T

sD .T 1/

Cov.x t ; x t C2 / C Cov.x t ; x t 2 /:

T T

X1

!

X

Var xt D .T jsj/ R.s/:

tD1 sD .T 1/

5

Var( T x)

Var(xt )

20

15

0

0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8

p

Figure 1.1: Variance of T sample average of an AR(1) series

2 . Let R.s/ denote the sth autocovariance and notice that R .s/ D jsj 2 = 1 2 . The

p 1 1

X 2 X jsj 2 1 C

Var T xN D R.s/ D D ;

sD 1

1 2 sD 1 1 2 1

which is increasing in (provided jj < 1, as required for stationarity). The variance

p

of T xN is much larger for close to one than for close to zero: the high autocorrela-

tion create long swings, so the mean cannot be estimated with good precision in a small

sample. If we disregard all autocovariances, then we would conclude that the variance of

p

T xN is 2 = 1 2 , that is, the variance of x t . This is much smaller (larger) than the

true value when > 0 ( < 0). For instance, with D 0:9, it is 19 times too small. See

p

Figure 1.1 for an illustration. Notice that Var. T x/= N Var.x t / D Var.x/=Var.x

N t /=T , so

the ratio also shows the relation between the true variance of xN and the classical estimator

of it (based of the iid assumption).

p

The Newey-West estimator of the variance-covariance matrix of T xN is

bov pT xN D X b

n

jsj

C 1 Cov .x t ; x t s / ; (1.4)

sD n

nC1

6

where n is a finite bandwidth parameter. The weights, 1 jsj =.nC1/, are clearly tent-

shaped: 1 at the zero lagand lower as the lags become longer. This is similar to (1.3),

but the weights decrease quicker (assuming n < T 1). This suggests that n should be

somewhat larger than last lag with significant autocorrelation. However, a common rule

of thumb is to use round n D floor.0:75T 1=3 /, where floor./ means rounding down to

nearest integer (an alternative rule is n D floor.4.T =100/2=9 /).

Example 1.2 (Newey-West estimator) With n D 1 in (1.4) the Newey-West estimator be-

comes

b

Cov

p 1

b b 1 b

T xN D Cov .x t ; x t C1 / C Cov .x t ; x t / C Cov .x t ; x t 1 / :

2 2

Remark 1.3 (VARHAC) The VARHAC estimator of the covariance matrix (see Andrews

and Monahan (1992)) is to first fit a VAR(p) to x t

Xp

x t D A0 C Ai x t i C "t

i D1

p

N D D 1S D 1,

Pp

and then calculate D D I i D1 Ai . Finally, the estimated Cov. T x/

p

where S is Newey-West estimate of Cov. T "N/. As an example, let x t be a scalar that

p

follows an AR(1) process, x t D x t 1 C " t . If " t is iid, then Cov. T "N/ D 2 where 2

p

is the variance of " t . D D 1 , so Cov. T x/ N D 2 =.1 /2 which is the same as the

variance in Example 1.1 (since .1 2 /=.1 C / D 1 ). See Tables 1.11.2.

: 0 0.375 0.75

Simulated 5:8 9:3 23:1

OLS formula 5:8 6:2 8:6

Newey-West 5:7 8:4 16:3

VARHAC 5:7 9:1 22:4

Bootstrapped 5:5 8:6 19:7

Table 1.1: Standard error of OLS intercept (%) under autocorrelation (simulation evi-

dence). Model: y t D 1 C 0:9x t C t , where t D t 1 C t ; t is iid N(). NW uses 5

lags. VARHAC uses 5 lags and a VAR(1). The bootstrap uses blocks of size 20. Sample

length: 300. Number of simulations: 10000.

7

D0 D 0:75

: 0 0.375 0.75 0 0.375 0.75

Simulated 5:8 6:2 8:7 3:9 5:5 10:9

OLS formula 5:8 6:2 8:6 3:9 4:2 5:8

Newey-West 5:7 6:1 8:4 3:8 5:1 8:9

VARHAC 5:7 6:1 8:5 3:8 5:4 10:5

Bootstrapped 5:8 6:2 8:6 3:8 5:4 10:1

Table 1.2: Standard error of OLS slope (%) under autocorrelation (simulation evidence).

Model: y t D 1 C 0:9x t C t , where t D t 1 C t ; t is iid N(). x t D x t 1 C t ; t is

iid N(). NW uses 5 lags. VARHAC uses 5 lags and a VAR(1). The bootstrap uses blocks

of size 20. Sample length: 300. Number of simulations: 10000.

1.2 GMM

1 XT

N

g./ D g t ./ D 0q1 ; (1.5)

T t D1

where g./

N is short hand notation for the sample average. The notation g t .) is meant to

show that moments conditions depend on the parameter vector () and on data for period

t . We let 0 denote the true value of the k 1 parameter vector.

The GMM estimator is

N 0 W g./;

N (1.6)

where W is some symmetric positive definite q q weighting matrix. When the model is

exactly identified (q D k), then we do not have to perform an explicit minimization, since

all sample moment conditions can be set equal to zero (as many parameters as moment

conditions).

Example 1.4 (Moment condition for a mean) To estimate the mean of x t , use

gt D xt :

8

Example 1.5 (Moments conditions for OLS) Consider the linear model y t D x t0 0 C u t ,

where x t and are k 1 vectors. The k moments are

g t D x t .y t x t0 /:

Example 1.6 (Moment conditions for estimating a normal distribution) Suppose you spec-

ify four moments for estimating the mean and variance of a normal distribution

2 3

xt

6 .x t /2 2 7

6 7

gt D 66 .x 3

7

4 t /

7

5

4 4

.x t / 3

matrix that depends on the covariance matrix of the moment conditions and the mapping

from the parameters to the moment conditions.

p

Let S0 be the (q q) covariance matrix of T g.

N 0 /, evaluated at the true parameter

values, hp i

S0 D Cov N 0/ ;

T g. (1.7)

then (1.7) becomes

When there is autocorrelation, then we may use the Newey-West approach to estimate S0 .

In practice, S0 is estimated by using the estimated coefficients in the moments to get

O a T q matrix, from which we estimate the covariances needed for

the data series g t ./,

(1.7) or (1.8).

Example 1.7 (Estimating a mean, variance) The moment in Example 1.4 (assuming iid

data, so we can use (1.8)) gives

S0 D Var.x t / D 2 :

9

In practice, we replace the variance by a sample estimate. If we suspect that x t is auto-

p

correlated, then we may use the NW estimator of Var. T g/.

N

Example 1.8 (OLS, covariance) For the moments in Example 1.5, using u t D y t x t0 ,

we have "p T #

T X

S0 D Cov xt ut

T t D1

replace u t by the fitted residuals and calculate a sample covariance. If we suspect that g t

p

is autocorrelated, then we may use the NW estimator of Var. T g/. N

Example 1.6 (assuming iid normally distributed data, so we can use (1.8)), it can be

shown that we have 2 3

2 0 3 4 0

2 4 12 6 7

6 7

6 0 0

S0 D 6 6

4 6

7:

4 3 0 15 0 7

5

6 8

0 12 0 96

In practice, we would use the point estimates in the moments and calculate the sample

covariance matrix. If we suspect that g t is autocorrelated, then we may use the NW

p

estimator of Var. T g/.

N

Let D0 be the (qk) probability limit of the gradient (Jacobian) of the sample moment

conditions with respect to the parameters (also evaluated at the true parameters)

N 0/

@g.

D0 D plim : (1.9)

@ 0

Remark 1.10 (Jacobian) The Jacobian is of the following format

2 3

@gN 1 . / @gN 1 . /

6 : @1 :

@k

7

N 0/ 6 :

@g. 6 : :

: 7

D6 : 7 (evaluated at 0 ).

7

@ 0 6 :: :

:: 7

4 5

@gN q . / @gN q . /

@1

@k

Example 1.11 (Estimating a mean, Jacobian) For the moment in Example 1.4

T

@ 1X

D0 D .x t / D 1;

@ T t D1

10

which is just a constantand does not involve any parameter values.

T

!

1X

D0 D plim x t x t0 D xx :

T tD1

This does not contain any parameters either, but includes data. In practice, we replace

xx by a sample estimate.

Example 1.6 (assuming iid normally distributed data) we have (the rows are for the four

different moment conditions, the columns for the parameters: and 2 )

2 3

1 0

T 6 7

1 X6 2.x t / 1 7

D0 D plim 6

2

7

T tD1 64 3.x t / 0 7

5

3 2

4.x t / 6

2 3

1 0

6 7

6 0 1 7

D6 6

2

7:

4 3 0 7

5

2

0 6

The second equality holds only if the data is indeed normally distributed. In practice,

we would use the point estimates in the matrix on the first line and calculate the sample

average.

p d

T .O 0 / ! N.0; V / if W D S0 1 , where

1

V D D00 S0 1 D0 (1.10)

;

which assumes that we have used S0 1 as the weighting matrix. This gives the most effi-

cient GMM estimatorfor a given set of moment conditions. The choice of the weighting

matrix is irrelevant if the model is exactly identified, so (1.10) can be applied to this case

(even if we did not specify any weighting matrix at all).

In practice, the gradient D0 is approximated by using the point estimates and the

available sample of data. The Newey-West estimator is commonly used to estimate the

11

covariance matrix S0 . When the model is exactly identified, then we can typically rewrite

the covariance matrix as V D D0 1 S0 .D0 1 /0 , which might be easier to calculate.

To implement W D S0 1 , an iterative procedure is often used: start with W D Iq ,

estimate the parameters, estimate S0 , then (in a second step) use W D SO0 1 and reesti-

mate. In most cases this iteration is stopped at this stage, but other researchers choose to

continue iterating until the point estimates converge.

Example 1.14 (Estimating a mean, distribution) For the moment condition in Example

1.4 we have (assuming iid data)

p d

T .O 0 / ! N.0; 2 /, so O N.0 ; 2 =T /:

Example 1.15 (OLS, distribution) For the moment conditions in Example 1.5

1

V D xx S0 1 xx

:

h

1

i 1

V D xx 2 xx D 2 xx1 :

xx

conditions in Example 1.6 (assuming iid normally distributed data) we have that the

asymptotic covariance matrix of the estimated mean and variance is then (.D00 S0 1 D0 / 1 )

02 30 2 3 1 2 31 1

1 0 2 0 3 4 0 1 0 " # 1

1

2 4 12 6 7

B6 7 6 7 6 7C

B6 0 1 7 6 0 0 6 0 1 7C

2

0

B6 7 6 7 6 7C D 1

B6

@4 3 2 0 7

5

6

4 3 4 0 15 6 0 7 5

6

4 3 2 0 7C

5A 0 2 4

0 6 2 0 12 6

0 96 8

0 6 2

" #

2 0

D :

0 2 4

In an overidentified model (k < q), we can test if the k parameters make all q mo-

ment conditions hold. Notice that under the null hypothesis (that the model is correctly

specified)

p d

T gN .0 / ! N 0q1 ; S0 ; (1.11)

where q is the number of moment conditions. Since O chosen is such a way that k linear

combinations of the moment conditions always (in every sample) are zero, we get that

12

there are effectively only q k non-degenerate random variables. We can therefore test

the hypothesis that gN .0 / D 0 by the J test

O 0 S 1 g.

N / O d 2 if W D S0 1 : (1.12)

T g. 0 N / ! q k;

The left hand side equals T times of value of the loss function in (1.6) evaluated at the

point estimates. With no overidentifying restrictions (q D k) there are no restrictions to

test. Indeed, the loss function value is then always zero at the point estimates.

the mean and the variance, we can test if all four moment conditions in Example 1.6 hold.

If data is drawn from a normal distribution, they should (give and take some randomness).

The distribution of the GMM estimates when we use a sub-optimal weighting matrix is

similar to (1.10), but the variance-covariance matrix is different (basically, reflecting the

fact that the approach does not produce the lowest possible variances anymore).

since there are four moment conditions but only two parameters. Instead of using the

optimal weighting matrix (the inverse of S0 from Example 1.9, assuming the data is iid

normally distributed), we could use any other (positive definite) 44 matrix. For instance,

W D I4 or a matrix that puts almost all weight on the first two moment conditions.

It can be shown that if we use another weighting matrix than W D S0 1 , then the

variance-covariance matrix in (1.10) should be changed to

0

, where (1.13)

1

VA2 D D00 WD0

:

O 0 C g.

N / O d 2 (1.14)

T g. 2 N / ! q k;

0

2 D A2 S0 A2 , where (1.15)

1

A2 D Iq D0 D00 WD0 D00 W:

13

Remark 1.19 (Quadratic form with degenerate covariance matrix) If the n 1 vector

X N.0; /, where has rank r n then Y D X 0 C X 2r where C is the

pseudo inverse of .

" # " #

1 2 0:02 0:06

AD , we have AC D :

3 6 0:04 0:12

Suppose we sidestep the whole optimization issue and instead specify k linear combina-

tions of the q moment conditions directly

0k1 D

A g. O ;

N / (1.16)

kq q1

ple 1.6 is overidentified since there are four moment conditions but only two parameters.

One possible A matrix would put all weight on the first two moment conditions

" #

1 0 0 0

AD :

0 1 0 0

changed to

0

, where (1.17)

1

VA3 D .A0 D0 / ;

Similarly, in the test of overidentifying restrictions (1.14), we should replace 2 by

0

3 D A3 S0 A3 , where (1.18)

1

A3 D Iq D0 .A0 D0 / A0 :

14

Example 1.22 (Estimating/testing a normal distribution) Continuing Example 1.21, we

have that A0 D0 in (1.17) is

0 1 1

B 2 3C

B

B" #6 1 0 C

C " #

B 1 0 0 0 6 0 7

1 7 C 1 0

VA3 D B D :

B 6 7C

B 0 1 0 0 6 2 7C

B 4 3 0 5 C

C 0 1

2

B

@ A0 0 6 C

A

D0

ple, A3 in (1.18) is

2 3 2 30 1 1

1 0 0 0 1 0

6 7 6 7 B" #C " #

6 0 1 0 0 7 6 0 1 7B 1 0 C 1 0 0 0

A3 D6 7 6 7B C

6

4 0 0 1 0 7

5

6

4 3 2 0 7B 0

5@ 1 C

A 0 1 0 0

6 2

0 0 0 1 0 A0 D0 A0

I4 D0

2 3

0 0 0 0

6 7

6 0 0 0 0 7

D6 7:

6

4 3 2 0 1 0 7

5

0 6 2 0 1

3 in (1.18) is therefore

2 32 32 30

0 0 0 0 2 0 3 4 0 0 0 0 0

2 4 12 6 7

6 76 76 7

6 0 0 0 0 7 6 0 0 6 0 0 0 0 7

3 D 66 3 2

76 76 7

4 0 1 0 7

54

6 3 4 0 15 6 0 7 54

6 3 2 0 1 0 7

5

0 6 2 0 1 0 12 6 0 96 8 0 6 2 0 1

2 3

0 0 0 0

6 7

6 0 0 0 0 7

D6 6 0 0 6 6

7

4 0 7

5

0 0 0 24 8

ple, we have that the test of the overidentifying restrictions (1.14) (assuming iid normally

15

distributed data to calculate S0 ) is (notice the generalized inverse of 3 )

2 30 2 32 3

0 0 0 0 0 0

6 7 6 76 7

6 0 7 6 0 0 0 0 76 0 7

DT6 T6

3

7 6

6

76

T

7

4 tD1 .x t / =T 5 4 0 0 1=.6 /

7 6 0 76

54 tD1 .x t /3 =T 7

5

T 4 4 8 T

tD1 .x t / 3 =T 0 0 0 1=.24 / tD1 .x t /4 3 4 =T

T 2 2

T t D1 .x t /3 =T T tTD1 .x t /4 3 4 =T

D C :

6 6 24 8

When we replace and by their estimates, then this is the same as the Jarque-Bera test

of normality.

Let R t be a vector of net returns of N assets. We want to estimate the mean vector and

the covariance matrix. The moment conditions for the mean vector are

E Rt D 0N 1 ; (1.19)

and the moment conditions for the unique elements of the second moment matrix are

Remark 1.25 (The vech operator) vech(A) where A is m m gives an m.m C 1/=2 1

vector with the elements on and below the principal diagonal

2 A3stacked on top of each

" # a11

a11 a12

other (column wise). For instance, vech D 4 a21 5.

6 7

a21 a22

a22

Stack (1.19) and (1.20) and substitute the sample mean for the population expectation

to get the GMM estimator

T

" # " # " #

1X Rt O 0N 1

D : (1.21)

T t D1 vech.R t R0t / vech. O / 0N.N C1/=21

O vech. O /) is

In this case, D0 D I , so the covariance matrix of the parameter vector (;

just S0 (defined in (1.7)), which is straightforward to estimate.

16

1.3 MLE

Let L be the likelihood function of a sample, defined as the joint density of the sample

L D pdf.x1 ; x2 ; : : : xT I / (1.22)

D L1 L2 : : : LT ; (1.23)

where are the parameters of the density function. In the second line, we define the

likelihood function as the product of the likelihood contributions of the different obser-

vations. For notational convenience, their dependence on the data and the parameters are

suppressed.

The idea of MLE is to pick parameters to make the likelihood (or its log) value as

large as possible

O D arg max ln L: (1.24)

p

N .O / !d N.0; V /, where V D I./ 1

with (1.25)

@2 ln L

I./ D E =T or

@@ 0

@2 ln L t

D E ;

@@ 0

where I./ is the information matrix. In the second line, the derivative is of the whole

log likelihood function (1.22), while in the third line the derivative is of the likelihood

contribution of observation t.

Alternatively, we can use the outer product of the gradients to calculate the informa-

tion matrix as

@ ln L t @ ln L t

J./ D E : (1.26)

@ @ 0

A key strength of MLE is that it is asymptotically efficient, that is, any linear combi-

nation of the parameters will have a smaller asymptotic variance than if we had used any

other estimation method.

17

1.3.2 QMLE

A MLE based on the wrong likelihood function (distribution) may still be useful. Suppose

we use the likelihood function L, so the estimator is defined by

@ ln L

D 0: (1.27)

@

If this is the wrong likelihood function, but the expected value (under the true distribution)

of @ ln L=@ is indeed zero (at the true parameter values), then we can think of (1.27) as

a set of GMM moment conditionsand the usual GMM results apply. The result is that

this quasi-MLE (or pseudo-MLE) has the same sort of distribution as in (1.25), but with

the variance-covariance matrix

V D I./ 1 J./I./ 1

(1.28)

Example 1.26 (LS and QMLE) In a linear regression, y t D x t0 C " t , the first order

T

condition for MLE based on the assumption that " t N.0; 2 / is tD1 O t D 0.

.y t x t0 /x

This has an expected value of zero (at the true parameters), even if the shocks have a, say,

t22 distribution.

1 x t2

1

pdf .x t / D p exp : (1.29)

2 2 2 2

1 PT x t2

2 T =2

D .2 / exp , so (1.30)

2 t D1 2

T 1 PT

ln L D ln.2 2 / x2: (1.31)

2 2 2 t D1 t

The first order condition for optimum is

@ ln L T 1 1 PT

D 2 C x 2 D 0 so

@ 2 2 2 2 2. 2 /2 t D1 t

O 2 D TtD1 x t2 =T: (1.32)

P

18

Differentiate the log likelihood once again to get

@2 ln L T 1 1 PT

t D1 x t , so

2

2 2

D (1.33)

@ @ 2 4 2

. /3

@2 ln L T 1 T T

E 2 2 D 2 D (1.34)

@ @ 2 4 2

. /3 2 4

The information matrix is therefore

@2 ln L 1

I./ D E 2 2

=T D ; (1.35)

@ @ 2 4

so we have

p

T .O 2 2 / !d N.0; 2 4 /: (1.36)

p d

T .O 0 / ! N .0; Vkk / ; (1.37)

q1 D R: (1.38)

p d

0 / ! N 0; qq ; where

T .R

D RVR0 : (1.39)

Example 1.27 (Testing 2 slope coefficients) Suppose we have estimated a model with

three coefficients and the null hypothesis is

H0 W 1 D 1 and 3 D 0:

" # 1 " # " #

1 0 0 6 7 1 0

4 2 5 D :

0 0 1 0 0

3

19

The test of the joint hypothesis is based on

d

T .R
0 /0 1 .R
0 / ! 2q : (1.40)

p d

T .O 0 / ! N .0; Vkk / ; (1.41)

where f .:/ has continuous first derivatives. Under that null hypothesis (that
D
0 )

p d

O
0 / ! N 0; qq ; where

T .f ./

@f .0 / @f .0 /0

D V , where (1.43)

@ 0 @

2 3

@f1 . / @f1 . /

@1

@k

@f ./ 6 :: : :: 7

D 6 : : : : 7

@ 0 4

@fq . / @fq . /

5

@1

@k

qk

can be used. Now, a test can be done as in the same way as in (1.40).

Example 1.28 (Testing a Sharpe ratio) Stack the mean ( D E x t ) and second moment

(2 D E x t2 ) as D ; 2 0 . The Sharpe ratio is calculated as a function of

E.x/ @f ./ h 2

i

D f ./ D , so D :

.x/ .2 2 /1=2 @ 0 .2 2 /3=2 2.2 2 /3=2

@f . /

Example 1.29 (Linear function) When f ./ D R, then the Jacobian is @ 0

D R, so

D RVR0 , just like in (1.39).

assume that both variables have zero means. The variances and the covariance can then

20

be estimated by the moment conditions

2 3 2 3

x t2 xx xx

PT

t D1 g t ./=T D 031 where g t D 4 y t yy 5 and D 4 yy 5 :

6 2 7 6 7

x t y t xy xy

The covariance matrix of these estimators is estimated as usual in GMM, making sure

to account for autocorrelation of the data. The correlation is a simple function of these

parameters

i

.x; y/ D f ./ D , so D 3=2 1=2

2 xx 1=2 3=2

2 xx 1=2 1=2 :

1=2

xx 1=2

yy @ 0 yy yy xx yy

is a forward difference

2 3

@f1 . /

@j

:: Q

7 f ./ f ./

, where Q D except that Qj D j C :

6

6 : 7D

4

@fq . /

5

@j

2 3

@f1 . /

Q Qj D j C =2

@j

" #

:: 7 f ./ f . /

, where Q D D except that

6

6 : 7D

4

@fq . /

5 j D j =2:

@j

1.5.1 GMM & Delta Method Example 1: Confidence Bands around a Mean-Variance

Frontier

of the means and the second moment matrix estimated by (1.21). It is therefore straight-

forward to apply the delta method to calculate a confidence band around the estimate.

Figure 1.2 shows some empirical results. The uncertainty is lowest for the minimum

variance portfolio (in a normal distribution, the uncertainty about an estimated variance is

p

increasing in the true variance, Var. T O 2 / D 2 4 ).

21

Mean-Std frontier US industry portfolios, 1947:12013:12

20

Mean Std

A 12.59 14.06

15

HD B 12.53 20.87

Mean, %

A CGJ BE

I F C 12.38 16.64

10 D 13.96 18.06

E 13.27 21.46

5 F 10.68 14.81

G 12.51 16.78

0 H 13.50 16.90

0 5 10 15 20 25 I 10.84 13.20

J 11.62 17.45

Std, %

20 1/N?

15 SR(tangency) 0.75

Mean, %

SR(EW) 0.59

10 t-stat of difference 1.87

0

0 5 10 15 20 25

Std, %

in the calculations, but 100 12Variance is plotted against 100 12mean.

1.5.2 GMM & Delta Method Example 2: Testing the 1=N vs the Tangency Portfolio

It has been argued that the (naive) 1=N diversification gives a portfolio performance

which is not worse than an optimal portfolio. One way of testing this is to compare the

the Sharpe ratios of the tangency and equally weighted portfolios. Both are functions of

the first and second moments of the basic assets, so a delta method approach similar to

the one for the MV frontier (see above) can be applied. Notice that this approach should

incorporate the way (and hence the associated uncertainty) the first and second moments

affect the portfolio weights of the tangency portfolio.

Figure 1.2 shows some empirical results.

22

Bibliography

Andrews, D. W. K., and J. C. Monahan, 1992, An Improved Heteroskedasticity and

Autocorrelation Consistent Covariance Matrix Estimator, Econometrica, 60, 953966.

Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,

revised edn.

DeMiguel, V., L. Garlappi, and R. Uppal, 2009, Optimal Versus Naive Diversification:

How Inefficient is the 1/N Portfolio Strategy?, Review of Financial Studies, 22, 1915

1953.

Singleton, K. J., 2006, Empirical dynamic asset pricing, Princeton University Press.

23

A Statistical Tables

n Critical values

10% 5% 1%

10 1:81 2:23 3:17

20 1:72 2:09 2:85

30 1:70 2:04 2:75

40 1:68 2:02 2:70

50 1:68 2:01 2:68

60 1:67 2:00 2:66

70 1:67 1:99 2:65

80 1:66 1:99 2:64

90 1:66 1:99 2:63

100 1:66 1:98 2:63

Normal 1:64 1:96 2:58

Table A.1: Critical values (two-sided test) of t distribution (different degrees of freedom)

and normal distribution.

n Critical values

10% 5% 1%

1 2:71 3:84 6:63

2 4:61 5:99 9:21

3 6:25 7:81 11:34

4 7:78 9:49 13:28

5 9:24 11:07 15:09

6 10:64 12:59 16:81

7 12:02 14:07 18:48

8 13:36 15:51 20:09

9 14:68 16:92 21:67

10 15:99 18:31 23:21

Table A.2: Critical values of chisquare distribution (different degrees of freedom, n).

B Data Sources

The data used in these lecture notes are from the following sources:

1. website of Kenneth French,

http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

24

2. Datastream

6. OlsenData, http://www.olsendata.com

25

2 Basic Asset Pricing Theory

References: Back (2010), Cochrane (2005) and Pennacchi (2008)

Remark 2.1 (Complete markets) Markets are complete if there are sufficiently many as-

sets so you could hedge against any possible outcome. For instance, in a binomial model

(where the stock price can take a jump up or down), then two assets are required: a stock

and a bond (or a stock and an option, or...)

Remark 2.2 (Law of one price) The price of portfolio is the portfolio of the prices. This

rules out trivial arbitrage.

Remark 2.3 (No arbitrage) Every asset whose payoff is always nonnegative and some-

times positive, has a positive price. This rules out a free lunch: you cannot get something

for nothing.

Remark 2.4 (Law of iterated expectations) The law of iterated expectations implies that

(using subscripts to denote time)

E t .x tC1 E t C1 y t C2 / D E t x t C1 y t C2

To see why, let y t C2 D E t C1 y t C2 C " t C2 , so " t C2 is the surprise. The left hand side can

then be written

E t x t C1 .y t C2 " t C2 /:

Rationality of the expectation requires that E t .x t C1 " t C2 / D 0, that is, " t C2 is not corre-

lated with x tC1 : information in t C 1 cannot predict a surprise in t C 2. As a special case,

x t C1 could be a (known) constant.

26

2.1.1 Three Asset Pricing Principles

There are three main ways of pricing financial assets. First, by replication. If the payoff

of asset i (xi ) is a portfolio (linear combination) of the payoffs of assets a and b

xi D xa C xb ; (2.1)

Pi D Pa C Pb : (2.2)

Example 2.5 (Binomial model) Assume a stock will be worth either 9:5 (low state) or 11

(high state), and that there is a bond that always pays off 1. A call option (with strike

price 10) on the stock will be worth either 0 or 1. You can replicate this option by holding

2=3 of the stock and 19=3 of the bonds, since it gives 0 in the low state and 1 in the high

state.

The second way is by a stochastic discount factor (SDF, also called pricing kernel),

which is a variable (m) such that

where xi is the (future) payoff and Pi is todays price. All time subscripts are dropped

in order to save ink. The expectation should be interpreted as being conditional on the

information available today. Notice that it is the same m that prices all assets.

The third way is by using the discounted riskneutral expected payoff

1

Pi D E xi ; (2.4)

Rf

where Rf is the riskfree gross rate and E xi is the expected value of xi according to the

riskneutral distribution. This distribution is best thought of as a theoretical construction,

which cannot be directly observed from data (more details later).

Remark 2.6 (Black-Scholes) Consider a European call option with a strike price K. The

payoff at expiration is max.0; Pi;1 K/, where Pi;1 is the price of the underlying asset

at expiration. If the return of the underlying asset follows a continuous time random

walk with normally distributed shocks, then a dynamically rebalanced portfolio of the

underlying and a safe asset replicates the call optionso the call option price must equal

27

the price of the portfolio. Alternatively, the call option price equals Em max.0; Pi;1 K/

and E max.0; Pi;1 K/=Rf . In each case, the explicit solution of the call option price

is the Black-Scholes formula.

Remark 2.7 (Bond pricing) The price of an n-period zero-coupon bond equals the cross-

moment between the pricing kernel (m) and the value of the same bond next period (then

an n 1-period bond)

1

Pn;0 D E mPn 1;1 D E Pn 1;1 :

Rf

Here Ri is the gross return. (Another way to think about this: Ri is the payoff of an asset

whose price today is 1.)

For a riskfree asset (which is not correlated with the SDF) we get

E m E Rf D 1, so Rf D 1= E m: (2.6)

Remark 2.8 (Risk and asset prices) Using (2.6) in (2.3) gives

E xi

Pi D C Cov.m; xi /:

Rf

This says that the price equals the expected present value of payoff + risk adjustment.

Idiosyncratic risk is not compensated (priced). For instance, E xi D 25; Rf D 1:1 and

Cov.m; xi / D 2

25

Pi D 2 22:7 2 D 20:7:

1:1

The investor is only willing to pay 20.7 for an asset with expected present value of 22.7:

this is considered to be a risky asset.

Combining (2.5) and (2.6) gives that an excess return should satisfy

E mRie D 0: (2.7)

28

Rewrite (2.7) get the risk premium

Risk premium is driven by the systematic risk (covariance with SDF). Idiosyncratic volatil-

ity does not matter for pricing (it can be diversified away).

Example 2.9 Using Rf D 1:1 and Cov.m; Rie / D 0:1 in (2.8) gives the risk premium

E Rie

Corr.m; Rie /.m/ D Em : (2.9)

.Rie /

Since 1 Corr.m; Rie / 1, this means (take the absolute value of both sides, notice

that Corr.m; Rie / 1, and rearrange)

.m/ j E Rie j

: (2.10)

E.m/ .Rie /

Remark 2.10 (A simple test of an SDF model) Find the highest E Rie = Rie from a

set of assets, and check if .m/ = E m (from your model) is higher. See Figure 2.1.

(

Rf C E.m/

.m/

.Ri / if E Ri Rf

E Ri is .m/

Rf E.m/

.Ri / if E Ri Rf

which gives the two lines in Figure 2.2 (draw E Ri as a function of .Ri /). Use 1= E.m/ D

Rf .

C0 C niD1 i Pi D W0 and (2.12)

P

C1 D y1 C niD1 i xi : (2.13)

P

29

Comparison of (m)/Em and max SR

1

Model of m : 0.99(Ct /Ct1 )

(m)/Em

US data 1970:12013:12,

max SR

0.8 monthly returns

0.6

0.4

0.2

0

0 50 100 150 200

Figure 2.1: Comparison of SDF model with max SR from data on returns

ER

slope: .m/=Em

Rf assets

.R/

The first line defines expected utility from consuming now (period 0) and later (period

1). Notice: the subscripts on (C0 ; C1 ; y1 ; W0 ) are here used to indicate time. However,

following the previous analysis we take it for granted that the price of asset i (Pi ) refer to

period 0 and the payoff (xi ) to period 1. The second line is the budget restriction today:

consumption plus asset purchases (i assets of type i, each at the price Pi ) equals todays

wealth. The third line defines what consumption in period 1 will be: any income y1 plus

the payoffs from the assets (xi from asset i , of which we own i ).

30

To solve the optimization problem, substitute for C1 and define the Lagrangian

P P

W0 :

C1

wrt i : E u0 .C1 / xi D Pi for i D 1 : : : n: (2.16)

Combine to get

0

u .C1 /

E m1 xi D Pi , where m1 D : (2.18)

u0 .C0 /

Apply to a riskfree asset, xi;1 D 1 and Pi D 1=Rf , where Rf is the gross interest rate

u0 .C1 / 1 1 u0 .C0 /

E 1 D so Rf D : (2.19)

u0 .C0 / Rf E u0 .C1 /

C1

m1 D : (2.20)

C0

Example 2.12 (Riskfree rate) With CRRA as in (2.20) with
D 3 and D 0:95, and

.C0 ; C1 / D .1:95; 2/ we get (assuming no uncertainty)

3

1 2

Rf D 1:14;

0:95 1:95

so the net interest rate is approximately 14%. Instead, with C1 D 2:1 we get

3

1 2:1

Rf D 1:31;

0:95 1:95

so the net rate is approximately 31%. The reason is that if the future is bright, then

investors want a large compensation for saving (rather, they would like to borrow).

Example 2.13 (The SDF as a function of consumption) With CRRA as in (2.20) with

31

Utility function with tangents Marginal utility

Consumption Consumption

3

2

m1 D 0:95 0:88

1:95

Instead, with C1 D 2:1 we get

3

2:1

m1 D 0:95 0:76;

1:95

so the ratio is random (since C1 is)and moves inversely with C1 .

Notice that u0 .C1 /, and therefore m in (2.18), moves inversely with C1 , see Figure

2.3. Approximate m1 a bC1 , where b > 0. Use in (2.8)

Procyclical assets have high expected returns (as they pay off when marginal utility is

low). The reason why risky assets have high risk premia is, of course, that otherwise no

one would like to buy those assets.

an affine function of the market excess return

e

mDa bRm , with b > 0:

This would, for instance, be the case in a Lucas model where consumption equals the

market return and the utility function is quadratic.

32

Remark 2.15 (Deriving CAPM from marginal utility) If utility depends on the portfolio

return Rp and is of CARA form, u.Rp / D exp. Rp k/ where k is a measure of risk

aversion, then the optimal portfolio will be on the mean-variance frontier. In equilibrium,

CAPM holds.

Remark 2.16 (The equity premium puzzle, log-normal version) With CRRA (2.20), equa-

tion (2.5) says that the average gross return of an asset should be determined by

C1

E Ri D 1:

C0

E exp. c C ri / D 1:

indeed normally distributed, then the previous equation can be written

E ri D C E c . 2 c2 =2 C r2 =2 cr /:

rf D C E c 2 c2 =2:

The equity premium puzzle (Mehra and Prescott (1985)) is that the covariance is too small

to explain the historical average risk premium on US equity.

2.2.2 Where Do SDFs Come from? Version 2: Law of One Price or No Arbitrage

We can (although it is a bit involved) prove the following result. In comparison with the

optimization approach, they are very general: they rely on weak assumptions and tells us

33

about whether there is an SDF. On the other hand, the results say very little about what

the SDF is.

If markets are complete, then there is only one SDF. You may derive it from MV

assets or macro series or whatever, but you get the same SDF.

Instead, if markets are incomplete (the more realistic case), then we can still prove

something about the existence of an SDF:

The law of one price and incomplete markets imply that there exists an SDFand it

can be written as linear function of available assets. However, there may be several

alternative SDFs, for instance, written in terms of macro variables. They need not

be the same, but they should deliver the same asset prices.

No arbitrage and incomplete markets imply that there exists at least one positive

SDF. There may be other SDFs, and some of them can take on negative values (in

some states of the world).

(SDFs that can have negative realizations are problematic, since they can assign neg-

ative prices to some derivativeswhich we want to avoid. Notice that the optimization

based approach creates SDFs that are always positiveprovided marginal utilities are.)

A beta pricing model says that the average returns for any asset are

E Ri D Rf C i0 ; (2.22)

where are factor risk premia and i are the regression coefficients from

When there is a single factor which is an excess return, then we can apply (2.22) to

f t and notice that i D 1 (regressing f t on itself gives a slope coefficient equal to one).

34

Fit of CAPM Fit of 2-factor model

18 18

16 16

Mean excess return, %

14 14

12 12

= 0.67 = 0.63 -11.27

10 b = -13.1 10 b = -299 15520

8 pval: 0.00 8 pval: 0.00

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

25 FF portfolios (B/M and size) Ri = i + i Rm and ERi = i

m = 1 + b (f Ef) Ri = i + i1 Rm + i2 R2m and

ERi = i1 1 + i2 2

1.02

The market return is marked by *

A D

1.01 HC G

*F

Mean

I J BE

1.005

Std

35

Therefore,

E f D Rf C ; so

D Ef Rf : (2.24)

This is a way to identify the factor risk premia and it also holds when there are several

factors. CAPM is a one-factor model, while the Fama-French model is a three-factor

model. In CAPM the factor is an excess return since it is the return of a long-short

portfolio, that is, the difference between the returns of the market portfolio and another

portfolio (a risk free asset). In the Fama and French (1993) model, the factors are also

excess returns since they are also based on long-short portfolios (market minus risk free,

value stock minus growth stocks, large stocks minus small stocks). Carhart (1997) adds a

momentum factor and Pastor and Stambaugh (2003) a liquidity factor.

However, the approach in (2.24) only works when the factors are returns. If not, the

factor risk premia must be estimated.

If the factor is indeed a (single) return, then

the beta pricing model (2.22) hold if and only if the factor is on the MV frontier

(and is not equal to the minimum variance return or the riskfree return).

The empirical implication is that

testing a beta pricing model is the same as testing if the return factor (e.g. the

market return) is MV efficient.

This can be done in several ways, but the easiest approach is run the regression

e

Ri;t D i C i0 f t C ui;t ; (2.25)

and to test if D 0. Notice that this works also with several factors, provided all of them

are excess returns.

Example 2.17 (CAPM) when f D Rm (the market return), then we have CAPM. Testing

CAPM is the same as testing if the market portfolio is MV efficient. See Figure 2.5 for an

empirical example.

Remark 2.18 (Typical results from testing CAPM) It is often found that > 0 for small

stocks and value stocks (and the opposite for growth stocks). Dynamic portfolios that

bet on short-run reversal, medium-run momentum and long-run reversion also tend to

36

have > 0. Finally, firms with unexpectedly high earnings growth tend to have returns

also after the surprise, but that firms with high uncertainty or high asset growth tend to

underperform. The average mutual fund has a negative alphaand it seems as if alphas

are not particularly correlated over time.

Pricing factors need not be returns (e.g. inflation or market volatility). In principle, that

does not change much: the beta pricing model (2.22)(2.23) would still be true, although

we need another method to identify the factor risk premia than (2.24).

In practice, it is often convenient to work with factor mimicking portfolios instead.

To construct such a portfolio, regress the factor on a constant and a vector of asset ex-

cess returns. Then use, the fitted values (minus the intercept) as the return of a factor

mimicking portfolio.

Instead of using a regression, factor mimicking portfolios are often approximated by

long-short based on asset sorts (for instance, small minus big firms).

Example 2.19 (Fama-French) The Fama and French (1993) (see also Fama and French

(1996)) SMB and HML portfolios can be thought of as factor mimicking portfolios

perhaps mimicking the credit cycle and the degree of optimism/pessimism on the market.

See Figure 2.6.

Example 2.20 (Carry trade factor) Lustig, Roussanov, and Verdelhan (2011) use a carry

trade risk factor mimicking portfolio (labelled HMLF X ) which is (almost) a portfolio of

exchange rates where high-minus-low refers to the interest rate level. This turns out to be

an important factor for explaining the cross-section of exchange rate returns. It is argued

that HMLF X is strongly related to macro economic risk.

We can always think of the SDF as the factor in a beta pricing model. To see this,

rewrite (2.8) as

Cov m; Rie

E Ri D

e

. Rf / Var.m/: (2.26)

Var.m/

i

However, we often want to go beyond this by providing more information about what is

behind the SDF.

37

Cumulated excess returns on FF factors

market

15 SMB

HML

index level

10

0

1960 1970 1980 1990 2000 2010

Example 2.21 ((2.26) with a utility based SDF) If the SDF is the ratio of marginal CRRA

utilities as in (2.20), then (2.26) explains the expected average return as a function of the

returns regression coefficient from Rie D
C i .C1 =C0 /
C u.

tion based asset pricing models (like the CRRA discussed above) have a problem with

explaining the historical average returns on equity. It is perhaps somewhat better at ex-

plaining the cross-section of different equity returns. More sophisticated consumption

based models that incorporate habits (Campbell and Cochrane (1999)) or that focus on

longer horizons (Parker and Julliard (2005)) or allow for (a small) long-run movements

in growth (Bansal and Yaron (2004)) do a bit better.

Given a beta pricing model against an SDF like (2.26), it is straightforward to show

that

there is always (another) beta representationagainst a return on the MV frontier:

where i;mv is the regression coefficient of asset i against that return on the MV frontier.

This result says that MV frontiers are crucial for all asset pricing. However, it remains

to be tested whether your favourite factor is really on the frontier. For instance, if the

market return happens to be on the MV frontier, then CAPM holdsotherwise not.

38

Proof. (Proof of (2.26))(2.27) ) Recall from (2.9)(2.10) that an asset that is per-

fectly correlated (positively or negatively) with the SDF has the highest possible jSharpe

ratioj, which means that it is on the MV frontier. The perfect correlation also means that

the return of this asset (Rmv ) must be such that m D
C Rmv , where
and are two

constants. Using this in (2.26) gives the result.

Cov C e

R mv ; Ri

E Rie D . Rf / Var.
C Rmv /

Var.
C Rmv /

Cov Rmv ; Rie

D . Rf / Var.Rmv /:

Var.Rmv /

mv

i;mv

It can also be show that an SDF that is a linear function of same factors (f ) is the

same thing as having linear factor model. To be precise,

there is a beta pricing model with the factors f if and only if there is an SDF that

is linear (affine) in those factors, m D a C b 0 f .

m D a C b 0 .f E f /; 0 D E mRie ; (2.28)

then we can find an such that the beta pricing model (2.22) holds. Conversely, given

in (2.22) we can find b such that (2.28) holds. (Notice: i can be estimated and a is only

important if we use some returns, not just excess returns. In that case, a D 1=Rf .) See

Figure 2.7 for a numerical example and Figure 2.4 for an empirical illustration.

Example 2.23 (From consumption to other factors) If the ratio of marginal utility (con-

sumption) in Example 2.21 depends on a vector of factors in a linear way, then we have

an SDF like in (2.28). For instance, the macro-economic equilibrium might imply that

marginal utility is a linear function of some key macroeconomic variables like output,

interest rates and inflation

where (
; ; ) are constants. The factor model for the asset prices then include the same

macro factors.

39

MV frontiers Means 0.090 0.060

0.1 matrix 0.005 0.014

Riskfree 0.010

Mean

0.05 Rm : w1 0.506

w2 0.494

mean 0.075

std 0.112

0

0 0.05 0.1 0.15

Std

m = 0.9901 5.134 (Rm ERm )

Pricing with CAPM (Rm ): Em(R Rf ) = 0.000 0.000

= ERm Rf = 0.065

= 1.227 0.767 Beta representation against m:

= 0.080 0.050 m = Var(m)/Em = -0.335

m = -0.239 -0.149

m m = 0.080 0.050

Figure 2.7: Calculations of SDF, beta representation against both SDF and Rm

fashioned linear factor model.

The choice between a linear SDF model or a beta pricing model is therefore based on

what is more convenient (or already established in the literature). For most asset classes

this means a beta pricing model, although studies of bonds and derivatives often work

with SDFs. What is more important is the assumption of linearityand the choice of the

factors.

Proof. (Proof of (2.28) ) Combine (2.28) and (2.22) to get

1

D Var.f /b or b D a Var.f / 1 ;

a

where Var.f / is the variance-covariance matrix of f . By definition, the betas are multiple

regression coefficients, so they are

i D Var.f / 1

Cov.f; Rie /:

0:08 and where i is the regression coefficient in Rie D i C i Rm

e e

C ui , where Rm is

40

e

the market excess return (f D Rm ). Suppose Var.Rme

/ D 0:162 . From the proof of (2.28)

and using E m D a D 1 (which is not important since we are dealing with excess returns),

we have

D b Var.Rm

e

/, or

0:08 D b0:162 ) b D 3:125:

Use in (2.28)

e

mD1 3:125 .Rm E Rm

e

/

Notice the sign: m proxies marginal utility. In terms of risk premia, recall (2.8) and

combine with the equation for m

e

; Rie /;

e

mD1 3:125 .Rm E Rm

e

/

e

/ D 0:162 . The proof of 2.28 gives

(In this section I use time subscripts, since it is needed to clarify the concepts)

Most asset pricing theories are conditional

1 D E0 m1 Ri;1 ; (2.29)

where E0 are the expectations at time of the portfolio formation and Ri;1 is the gross

return of asset i in period 1. However, we typically want an expression in terms of an un-

conditional expectationsince that can be approximated by a sample average of available

data.

41

Use iterated expectations to get

1 D E m1 Ri;1 : (2.30)

This is correct, but explores only a very limited set of the model properties. Instead, notice

that for any z0 known in 0

z0 D z0 E0 m1 Ri;1 )

E z0 D E m1 Ri;1 z0 ; (2.31)

where the second line follows from the law of iterated expectations. Notice that z0 must

be correlated with m1 Ri;1 for this expression to differ from (2.30). Otherwise, the expec-

tation can be factored as E z0 D E z0 E m1 Ri;1 . This suggests that z0 must be able to

predict m1 Ri;1 for it to matter. For this reason empirical conditional asset pricing often

checks if the proposed state variable z0 has any predictive power.

Interpretation 1: (2.31) captures more of the conditional model than just conditioning

down (z0 D 1 is a special case).

Interpretation 2: Ri;1 z0 is payoff of a managed portfolio; z0 its price. See Figure 2.8

for an example where the return distribution is different in different time periods (driven

by a state variable) and Figure 2.9 for an empirical illustration.

8 state -1 8 of basic assets

state 1 of managed portfolios

7 7

Mean, %

Mean, %

optimal

constant

6 6

5 5

0 5 10 15 20 0 5 10 15 20

Std, % Std, %

To simplify the analysis, assume that there are k different states of the world and that

state j has the probability j . The asset pricing model (2.3) can then be written (for asset

42

Fit of CAPM Fit of CAPM-LSTAR

Mean excess return, % 14 14

12 12

10 10

8 8

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

Rie = + [1 G(z)]1 Rm

e

+

Rie = + Rm

e

+ G(z)2 Rme

+

Monthly US data 1957:12013:12

G(z) is a logistic function

25 FF portfolios (B/M, size)

z: lagged momentum return

i)

Xk

Pi D E mxi D j mj xij ; (2.32)

j D1

Define the risk neutral probabilities as

j D Rf mj j ; (2.33)

Xk

Pi D j mj xij

j D1

.j /=Rf

Xk

D j xij =Rf

j D1

D E xi =Rf ; (2.34)

which says that the asset price equals the discounted risk neutral expected value. Once

we have the risk neutral distribution of asset i for a given investment horizon, we can price

any European derivative (whose expiration is at the investment horizon) on that asset. For

instance, if we have the risk neutral distribution for the bond price 3 months from now,

43

we can price also a European put option on that bond.

Clearly, this expression can be rearranged as

E xi

D Rf ; (2.35)

Pi

which says that the risk neutral expected (gross) return on any asset equals the (gross)

riskfree rate.

Recall that a forward contract has a zero price and a payoff xi;1 D Pi;1 Fi , where

Pi;1 is the price of the underlying asset i next period and Fi the contracted forward price.

Apply (2.34) to get

This shows that the mean of the risk-neutral distribution (of asset i) equals the forward

price.

Remark 2.26 (Binomial model) Consider the process for the underlying asset i

(

Pi;0 u with probability

Pi;1 D

Pi;0 d with probability 1 ;

where Pi;0 is todays price and Pi;1 is the price next period. For simplicity, we assume no

dividends. We know from basic option pricing that a derivative that is worth Cu in the up

state and Cd in the down state must have the current price C0

1 Rf d

C0 D Cu C .1 / Cd with D ;

Rf u d

where is the risk neutral probability. From (2.33) we know that j =Rf D mj j , so

=Rf D mu

.1 /=Rf D md .1 /:

C0 D mu Cu C md .1 /Cd ;

44

Cd D 0, then we get

1 0:95 1

C0 D 1 with D D :

1:1 0:95 3

It follows that mu D 1=2 and md D 2. Notice that E m D 2=3 1=2 C 1=3 2 D 1 so

the gross riskfree rate is indeed 1. Also, notice that using m to price the derivative gives

C0 D 2=3 1=2 1 D 1=3.

Suppose log asset price is a univariate normal (we call this is physical distribution)

and also that the distribution of the log SDF is also normal

Direct calculations then give that todays gross interest rate is (recall if x N.; 2 /,

then E exp.x/ D exp. C 2 =2/)

1

D E m1 D exp.m C mm =2/: (2.39)

Rf

Similarly, todays price of the underlying asset is (assuming no dividends, so the future

price is the payoff)

Pi;0 D E m1 Pi;1

D E exp.ln m1 C pi;1 /

D exp m C p C mm =2 C pp =2 C mp

1=Rf

1

D exp p C pp =2 C mp : (2.40)

Rf

Suppose the risk neutral distribution of the future log asset price is also normal

This distribution has the same variance as the physical distribution of pi;1 in (2.37), but

45

a different mean. This simple result is due to the assumption of lognormally distributed

variables. See Figure 2.10 for an example.

To illustrate that this works, notice that

1

Pi;0 D E Pi;1 (2.42)

Rf

1

D E exp.pi;1 /

Rf

1

D exp.p C mp C pp =2/;

Rf

To apply the risk neutral distribution, we could, for instance, price a European put

option with strike price K as

1

Put0 D E max.0; K Pi;1 /

Rf

1 RK

D 1 K exp.p/ .pI p ; pp /dp; (2.43)

Rf

where .pI p ; pp / is notation for the pdf for a N.p ; pp / distribution evaluated at p.

0:03; pp D 0:01 and mp D 0:015, then (2.39)(2.40) give

1

D exp. 0:04 C 0:04=2/ 0:98 so ln Rf D 0:02

Rf

Pi;0 D exp. 0:04 C 0:04=2/ exp.0:03 0:015 C 0:01=2/ D 1:

Example 2.29 (Risk neutral lognormal distribution) In Example 2.28, the physical dis-

tribution of the log payoff (pi;1 ) is N.0:03; 0:01/, while the risk neutral distribution (2.41)

is N.0:03 0:015; 0:01/, that is, N.0:015; 0:01/. See Figure 2.10.

To interpret the risk neutral pdf in (2.41), notice that an asset with a negative covari-

ance with the pricing kernel tends to pay off in the wrong states (for instance, in booms),

so it is considered a risky asset and will have a low price. For a risk-neutral investor to

make an equally low valuation, he must be more pessimistic about the future payoff: the

distribution is shifted down (to the left).

46

Physical and risk-neutral pdfs (lognormal)

4

physical

risk-neutral

1 mean std

physical 0.030 0.100

risk-neutral 0.015 0.100

0

0.25 0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2

log payoff (return)

0.1

0.05 4

2

0

0

3

2

1 2

0

1

2 4 Re

ln M 3

the transformation from the physical to the risk neutral distribution involves much more

than just a horizontal shift.

47

Physical and riskneutral pdfs

physical ln m N

rn Re mix N

0.25

0.2

0.15

0.1

mean std

0.05 physical 0.340 1.800

risk-neutral 0.000 2.110

0

4 3 2 1 0 1 2 3 4

Re

Bibliography

Back, K. E., 2010, Asset Pricing and Portfolio Choice Theory, Oxford University Press,

Oxford.

Bansal, R., and A. Yaron, 2004, Risks for the Long Run: A Potential Resolution of Asset

Pricing Puzzles, The Journal of Finance, 59, 14811509.

explanation of aggregate stock market behavior, Journal of Political Economy, 107,

205251.

Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,

revised edn.

Fama, E. F., and K. R. French, 1993, Common risk factors in the returns on stocks and

bonds, Journal of Financial Economics, 33, 356.

Fama, E. F., and K. R. French, 1996, Multifactor explanations of asset pricing anoma-

lies, Journal of Finance, 51, 5584.

48

Lustig, H. N., N. L. Roussanov, and A. Verdelhan, 2011, Common risk factors in cur-

rency markets, Review of Financial Studies, 24, 37313777.

Mehra, R., and E. Prescott, 1985, The equity premium: a puzzle, Journal of Monetary

Economics, 15, 145161.

Parker, J., and C. Julliard, 2005, Consumption risk and the cross section of expected

returns, Journal of Political Economy, 113, 185222.

Pastor, L., and R. F. Stambaugh, 2003, Liquidity risk and expected stock returns, Jour-

nal of Political Economy, 111, 642685.

49

3 Simulating the Finite Sample Properties

Reference: Greene (2000) 5.3 and Horowitz (2001)

3.1 Introduction

Additional references: Cochrane (2001) 15.2; Davidson and MacKinnon (1993) 21; Davi-

son and Hinkley (1997); Efron and Tibshirani (1993) (bootstrapping, chap 9 in particular);

and Berkowitz and Kilian (2000) (bootstrapping in time series models)

We know the small sample properties of regression coefficients in linear models with

fixed regressors and iid normal error terms. When these conditions are not satisfied, then

we may use Monte Carlo simulations and bootstrapping to understand the small sample

properties.

How they should be implemented depends crucially on the properties of the model

and data: if the residuals are autocorrelated, heteroskedastic, or perhaps correlated across

regressions equations. These notes summarize a few typical cases.

The need for using Monte Carlos or bootstraps varies across applications and data

sets. For a case where it is not needed, see Figure 3.1, and for a case where it matters,

compare the traditional and bootstrapped t-stats in Tables 3.13.2.

2y 3y 4y 5y

factor 1:00 1:87 2:69 3:47

.6:59/ .6:77/ .6:97/ .7:17/

constant 0:00 0:00 0:00 0:00

.0:00/ . 0:44/ . 0:83/ . 1:21/

R2 0:14 0:15 0:16 0:17

obs 591:00 591:00 591:00 591:00

Table 3.1: Regression of different excess (1-year) holding period returns (in columns, in-

dicating the maturity of the respective bond) on a single forecasting factor and a constant.

Numbers in parentheses are t-stats. U.S. data for 1964:12014:3.

50

alpha t LS t NW t boot

US industry portfolios, 1970:12013:12 all NaN NaN NaN NaN

15 A (NoDur) 3.56 2.71 2.52 2.24

Mean excess return

C (Manuf ) 0.71 0.75 0.71 0.64

10 D (Enrgy) 3.92 1.77 1.76 1.85

D

A

GC E (HiTec) -2.00 -1.11 -1.12 -1.00

I FH JB E 1.93 1.15 1.12 0.97

F (Telcm)

5 G (Shops) 1.35 0.94 0.90 0.89

H (Hlth ) 2.31 1.35 1.38 1.37

0 I (Utils) 2.80 1.58 1.53 1.58

0 0.5 1 1.5 J (Other) -0.61 -0.58 -0.56 -0.47

(against the market)

NW uses 1 lag

The bootstrap samples (yt , xt ) in blocks of 10

3000 simulations

2y 3y 4y 5y

factor 1:00 1:87 2:69 3:47

.3:83/ .4:00/ .4:16/ .4:32/

constant 0:00 0:00 0:00 0:00

.0:00/ . 0:21/ . 0:39/ . 0:56/

R2 0:14 0:15 0:16 0:17

obs 591:00 591:00 591:00 591:00

Table 3.2: Regression of different excess (1-year) holding period returns (in columns, in-

dicating the maturity of the respective bond) on a single forecasting factor and a constant.

U.S. data for 1964:12014:3. Numbers in parentheses are t-stats. Bootstrapped standard

errors, with blocks of 10 observations.

Monte Carlo simulations is essentially a way to generate many artificial (small) samples

from a parameterised model and then estimate the statistic (for instance, a slope coeffi-

cient) on each of those samples. The distribution of the statistic is then used as the small

sample distribution of the estimator.

The following is an example of how Monte Carlo simulations could be done in the

51

special case of a linear model with a scalar dependent variable

y t D x t0 C u t ; (3.1)

where u t is iid N.0; 2 / and x t is stochastic but independent of u t s for all s. This means

that x t cannot include lags of y t .

Suppose we want to find the small sample distribution of a function of the estimate,

O To do a Monte Carlo experiment, we need information on (i) the coefficients ; (ii)

g./.

the variance of u t ; 2 ; (iii) and a process for x t .

The process for x t is typically estimated from the data on x t (for instance, a VAR

system x t D A1 x t 1 C A2 x t 2 C e t ). Alternatively, we could simply use the actual

sample of x t and repeat it.

The values of and 2 are often a mix of estimation results and theory. In some case,

we simply take the point estimates. In other cases, we adjust the point estimates so that

g./ D 0 holds, that is, so you simulate the model under the null hypothesis in order to

study the size of tests and to find valid critical values for small samples. Alternatively,

you may simulate the model under an alternative hypothesis in order to study the power

of the test using either critical values from either the asymptotic distribution or from a

(perhaps simulated) small sample distribution.

To make it a bit concrete, suppose you want to use these simulations to get a 5%

critical value for testing the null hypothesis g./ D 0. The Monte Carlo experiment

follows these steps.

Draw random numbers uQ t for t D 1; : : : ; T from a prespecified distribution (eg.

multivariate normal) and use those together with the artificial sample of xQ t to cal-

culate an artificial sample yQ t for t D 1; : : : ; T from

yQ t D xQ t0 C uQ t ; (3.2)

by using the prespecified values of the coefficients (perhaps your point estimates).

O and perhaps also

the test statistic of the hypothesis that g./ D 0.

3. Repeat the previous steps N (3000, say) times. The more times you repeat, the

better is the approximation of the small sample distribution.

52

4. Sort your simulated , O g./,

O and the test statistic in ascending order. For a one-

sided test (for instance, a chi-square test), take the (0:95N )th observations in these

sorted vector as your 5% critical values. For a two-sided test (for instance, a t-

test), take the (0:025N )th and (0:975N )th observations as the 5% critical values.

You may also record how many times the 5% critical values from the asymptotic

distribution would reject a true null hypothesis.

O g./,

5. You may also want to plot a histogram of , O and the test statistic to see if there

is a small sample bias, and how the distribution looks like. Is it close to normal?

How wide is it?

We have the same basic procedure when y t is a vector, except that we might have

to consider correlations across the elements of the vector of residuals u t . For instance,

we might want to generate the vector uQ t from a N.0; / distributionwhere is the

variance-covariance matrix of u t .

Remark 3.1 (Generating N.; / random numbers) Suppose you want to draw an n 1

vector " t of N.; / variables. Use the Cholesky decomposition of to calculate the

lower triangular P such that D PP 0 . Draw u t from an N.0; In / distribution, and

define " t D C P u t . Note that Cov." t / D E P u t u0t P 0 D PIP 0 D .

It is straightforward to sample the errors from other distributions than the normal,

for instance, a student-t distribution. Equipped with uniformly distributed random num-

bers, you can always (numerically) invert the cumulative distribution function (cdf) of any

distribution to generate random variables from any distribution by using the probability

transformation method. See Figure 3.2 for an example.

Remark 3.2 Let X U.0; 1/ and consider the transformation Y D F 1 .X/, where

F 1 ./ is the inverse of a strictly increasing cumulative distribution function F , then Y

has the cdf F .

Example 3.3 The exponential cdf is x D 1 exp. y/ with inverse y D ln .1 x/ =.

Draw x from U.0; 1/ and transform to y to get an exponentially distributed variable.

If x t contains lags of y t , then we must set up the simulations so that feature is preserved in

every artificial sample that we create. For instance, suppose x t includes y t 1 and another

53

Distribution of LS t-stat, T = 5 Distribution of LS t-stat, T = 100

t = (b 0.9)/Std(b)

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

4 2 0 2 4 4 2 0 2 4

Model: Rt = 0.9ft + t , t = vt 2,

where vt has a 22 distribution Probability density functions

0.4

Number of simulations: 25000 22 2

0.3

T = 5 T = 100 0.2

Kurtosis of t-stat: 46.75 3.05

Frequency of |t-stat| > 1.64 0.29 0.11 0.1

Frequency of |t-stat| > 1.96 0.23 0.05 0

4 2 0 2 4

Figure 3.2: Results from a Monte Carlo experiment with thick-tailed errors.

y t D x t0 C u t (3.3)

D
y t 1 C 0zt C ut

We can then generate an artificial sample as follows. First, create a sample zQ t for t D

1; : : : ; T by some time series model (for instance, a VAR) or by taking the observed

sample itself. Second, observation t of .xQ t ; yQ t / is generated recursively as

" #

yQ t 1

xQ t D and yQ t D xQ t0 C uQ t for t D 1; : : : ; T (3.4)

zQ t

We clearly need the initial value yQ0 to start up the artificial sampleand then the rest of

the sample (t D 1; 2; :::) is calculated recursively. See Figures 3.43.5 for an example.

For instance, for a VAR(2) model (where there is no z t )

y t D A1 y t 1 C A2 y t 2 C ut ; (3.5)

54

Freq of rejecting randomness, = 0.75 Freq of rejecting randomness, = 0.5

1 1

0.5 0.5

Autocorr

runs test

0 0

0 50 100 150 200 250 0 50 100 150 200 250

Sample size, T Sample size, T

Freq of rejecting randomness, = 0.25 where t is iid N(0,2)

1

Rejecting randomness if |t-stat| > 1.96

0

0 50 100 150 200 250

Sample size, T

Figure 3.3: Results from a Monte Carlo experiment on two methods of testing for ran-

domness.

the procedure is straightforward. First, estimate the model on data and record the esti-

mates (A1 ; A2 ; Var.u t /). Second, draw a new time series of residuals, uQ t for t D 1; : : : ; T

and construct an artificial sample recursively (first t D 1, then t D 2 and so forth) as

yQ t D A1 yQ t 1 C A2 yQ t 2 C uQ t : (3.6)

(This requires some starting values for y 1 and y0 .) Third, re-estimate the model on the

artificial sample, yQ t for t D 1; : : : ; T .

It is more difficult to handle non-iid errors, like those with autocorrelation and het-

eroskedasticity. We then need to model the error process and generate the errors from

that model.

If the errors are autocorrelated, then we could estimate that process from the fitted

55

Average LS estimate of Std of LS estimate of

simulation

0.9 asymptotic

0.1

0.85

0.8

0.05

0.75

0.7 0

0 100 200 300 400 500 0 100 200 300 400 500

Sample size, T Sample size, T

T Std of LS estimate of True model: yt = 0.9yt1 + t ,

0.8 where t is iid N(0,2)

0.5

0.4

0 100 200 300 400 500

Sample size, T

Figure 3.4: Results from a Monte Carlo experiment of LS estimation of the AR coeffi-

cient.

uQ t D a1 uQ t 1 C a2 uQ t 2 C "Q t : (3.7)

model

u t N.0; t2 /, where t2 D ! C u2t 1 C t2 1 : (3.8)

However, this specification does not account for any link between the volatility and the

regressors (squared)as tested for by Whites test. This would invalidate the usual OLS

standard errors and therefore deserves to be taken seriously. A simple, but crude, approach

is to generate residuals from a N.0; t2 ) process, but where t2 is approximated by the

fitted values from

"2t D c 0 w t C t ; (3.9)

where w t include the squares and cross product of all the regressors.

56

Distribution of LS estimator, T = 25 Distribution of LS estimator, T = 100

Mean and std: Mean and std:

0.1 0.74 0.16 0.1 0.86 0.06

0.05 0.05

0 0

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Estimated model: yt = a + yt1 + ut

Number of simulations: 25000

Figure 3.5: Results from a Monte Carlo experiment of LS estimation of the AR coeffi-

cient.

3.3 Bootstrapping

sampling from the actual data. The advantage of the bootstrap is then that we do not

have to try to estimate the process of the errors and regressors (as we do in a Monte Carlo

experiment). The real benefit of this is that we do not have to make any strong assumption

about the distribution of the errors.

The bootstrap approach works particularly well when the errors are iid and indepen-

dent of x t s for all s. This means that x t cannot include lags of y t . We here consider

bootstrapping the linear model (3.1), for which we have point estimates (perhaps from

LS) and fitted residuals. The procedure is similar to the Monte Carlo approach, except

that the artificial sample is generated differently. In particular, Step 1 in the Monte Carlo

simulation is replaced by the following:

yQ t D x t0 C uQ t ; (3.10)

where uQ t is drawn (with replacement) from the fitted residual and where is the

point estimate.

57

Example 3.4 With T D 3, the artificial sample could be

2 3 2 0 3

.yQ1 ; xQ 1 / .x1 0 C u2 ; x1 /

4 .yQ2 ; xQ 2 / 5 D 4 .x20 0 C u1 ; x2 / 5 :

6 7 6 7

.yQ3 ; xQ 3 / .x30 0 C u2 ; x3 /

will then help retain the cross-sectional correlation of the residuals.

When x t contains lagged values of y t , then we have to modify the approach in (3.10)

since uQ t can become correlated with x t . For instance, if x t includes y t 1 and we happen

to sample uQ t D u t 1 , then we get a non-zero correlation. The easiest way to handle this

is as in the Monte Carlo simulations in (3.4), but where uQ t are drawn (with replacement)

from the sample of fitted residuals. The same carries over to the VAR model in (3.5)(3.6).

Suppose now that the errors are heteroskedastic, but serially uncorrelated. If the het-

eroskedasticity is unrelated to the regressors, then we can still use (3.10).

On contrast, if the heteroskedasticity is related to the regressors, then the traditional LS

covariance matrix is not correct (this is the case that Whites test for heteroskedasticity

tries to identify). It would then be wrong to pair x t with just any uQ t D us since that

destroys the relation between x t and the variance of the residual.

An alternative way of bootstrapping can then be used: generate the artificial sample

by drawing (with replacement) pairs .ys ; xs /, that is, we let the artificial pair in t be

.yQ t ; xQ t / D .ys ; xs / D .xs0 0 C us ; xs / for some random draw of s so we are always

pairing the residual, us , with the contemporaneous regressors, xs . Note that we are always

sampling with replacementotherwise the approach of drawing pairs would be to just re-

create the original data set.

This approach works also when y t is a vector of dependent variables.

2 3 2 3 2 0 3

.yQ1 ; xQ 1 / .y2 ; x2 / .x2 0 C u2 ; x2 /

4 .yQ2 ; xQ 2 / 5 D 4 .y3 ; x3 / 5 D 4 .x30 0 C u3 ; x3 / 5

6 7 6 7 6 7

58

It could be argued (see, for instance, Davidson and MacKinnon (1993)) that bootstrap-

ping the pairs .ys ; xs / makes little sense when xs contains lags of ys , since the random

sampling of the pair .ys ; xs / destroys the autocorrelation pattern on the regressors.

It is quite hard to handle the case when the errors are serially dependent, since we must

sample in such a way that we do not destroy the autocorrelation structure of the data. A

common approach is to fit a model for the residuals, for instance, an AR(1), and then

bootstrap the (hopefully iid) innovations to that process.

Another approach amounts to resampling blocks of data. For instance, suppose the

sample has 10 observations, and we decide to create blocks of 3 observations. The first

block is .uO 1 ; uO 2 ; uO 3 /, the second block is .uO 2 ; uO 3 ; uO 4 /, and so forth until the last block,

.uO 8 ; uO 9 ; uO 10 /. If we need a sample of length 3 , say, then we simply draw of those

block randomly (with replacement) and stack them to form a longer series. To handle end

point effects (so that all data points have the same probability to be drawn), we also create

blocks by wrapping the data around a circle. In practice, this means that we add the

following blocks: .uO 10 ; uO 1 ; uO 2 / and .uO 9 ; uO 10 ; uO 1 /. The length of the blocks should clearly

depend on the degree of autocorrelation, but T 1=3 is sometimes recommended as a rough

guide. An alternative approach is to have non-overlapping blocks. See Berkowitz and

Kilian (2000) for some other approaches.

Example 3.6 With T D 9 and a block size of 3, the artificial sample could be

u2 ; u3 ; u4 ; u7 ; u8 ; u9 ; u4 ; u5 ; u6 :

block 2 block 7 block 4

There are many other ways to do bootstrapping. For instance, we could sample the re-

gressors and residuals independently of each other and construct an artificial sample of

the dependent variable yQ t D xQ t0 O C uQ t . This clearly makes sense if the residuals and

regressors are independent of each other and errors are iid. In that case, the advantage of

this approach is that we do not keep the regressors fixed.

59

: 0 0.375 0.75

Simulated 5:8 8:0 10:2

OLS formula 5:8 6:2 7:2

Newey-West 5:7 7:5 9:5

VARHAC 5:7 8:1 11:0

Bootstrapped 5:5 7:5 9:5

Table 3.3: Standard error of OLS intercept (%) under autocorrelation (simulation evi-

dence). Model: y t D 1 C 0:9x t C t , where t D t C t 1 ; t is iid N(). NW uses 5

lags. VARHAC uses 5 lags and a VAR(1). The bootstrap uses blocks of size 20. Sample

length: 300. Number of simulations: 10000.

Bibliography

Berkowitz, J., and L. Kilian, 2000, Recent developments in bootstrapping time series,

Econometric-Reviews, 19, 148.

Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.

Oxford University Press, Oxford.

Davison, A. C., and D. V. Hinkley, 1997, Bootstrap methods and their applications, Cam-

bridge University Press.

Efron, B., and R. J. Tibshirani, 1993, An introduction to the bootstrap, Chapman and Hall,

New York.

Greene, W. H., 2000, Econometric analysis, Prentice-Hall, Upper Saddle River, New

Jersey, 4th edn.

Horowitz, J. L., 2001, The Bootstrap, in J.J. Heckman, and E. Leamer (ed.), Handbook

of Econometrics . , vol. 5, Elsevier.

60

4 Return Distributions

Sections denoted by a star ( ) is not required reading.

Reference: Harvey (1989) 260, Davidson and MacKinnon (1993) 267, Silverman (1986);

Mittelhammer (1996), DeGroot (1986)

The cdf (cumulative distribution function) measures the probability that the random vari-

able Xi is below or at some numerical value xi ,

For instance, with an N.0; 1/ distribution, F . 1:64/ D 0:05. Clearly, the cdf values

are between (and including) 0 and 1. The distribution of Xi is often called the marginal

distribution of Xi to distinguish it from the joint distribution of Xi and Xj . (See below

for more information on joint distributions.)

The pdf (probability density function) fi .xi / is the height of the distribution in the

sense that the cdf F .xi / is the integral of the pdf from minus infinity to xi

Z xi

Fi .xi / D fi .s/ds: (4.2)

sD 1

(Conversely, the pdf is the derivative of the cdf, fi .xi / D @Fi .xi /=@xi .) The Gaussian

pdf (the normal distribution) is bell shaped.

of x such that there is a probability of of a lower value. We can solve for the quantile by

inverting the cdf, D F . / as D F 1 ./. For instance, the 5% quantile of a N.0; 1/

distribution is 1:64 D 1 .0:05/, where 1 ./ denotes the inverse of an N.0; 1/ cdf,

also called the quantile function. See Figure 4.1 for an illustration.

61

Pdf of N(0,1) Pdf of N(0,1)

0.4 Pr(x 1) = 0.4 Pr(x 0.5) =

0.16 0.69

0.3 0.3

0.2 0.2

0.1 0.1

0 0

2 0 2 2 0 2

x x

1

2

cdf value

0.5 0

x

2

0

2 0 2 0 0.5 1

x cdf value

4.1.2 QQ Plots

Are returns normally distributed? Mostly not, but it depends on the asset type and on the

data frequency. Options returns typically have very non-normal distributions (in partic-

ular, since the return is 100% on many expiration days). Stock returns are typically

distinctly non-linear at short horizons, but can look somewhat normal at longer horizons.

To assess the normality of returns, the usual econometric techniques (BeraJarque

and Kolmogorov-Smirnov tests) are useful, but a visual inspection of the histogram and a

QQ-plot also give useful clues. See Figures 4.24.5 for illustrations.

Remark 4.2 (Reading a QQ plot) A QQ plot is a way to assess if the empirical distri-

bution conforms reasonably well to a prespecified theoretical distribution, for instance,

a normal distribution where the mean and variance have been estimated from the data.

Each point in the QQ plot shows a specific percentile (quantile) according to the empiri-

cal as well as according to the theoretical distribution. For instance, if the 2th percentile

(0.02 percentile) is at -10 in the empirical distribution, but at only -3 in the theoretical

62

distribution, then this indicates that the two distributions have very different left tails.

There is one caveat to this way of studying data: it only provides evidence on the

unconditional distribution. Suppose instead that we have estimated a model for time-

variation in the mean and variance (denoted t and t2 , respectively), then it makes more

sense to study the distribution (QQ plot) of the standardised return

Rt t

RQ t D : (4.3)

t

As a simple example, the mean could be estimated by an AR(1) model (so we would

have t D a C R t 1 ) and the variance by a GARCH model (so we would have t2 D

! C u2t 1 C t2 1 where u t 1 is the surprise to the return in t 1). See Figure 4.6 for

an illustration.

25

8000

Number of days

Number of days

20

6000

15

4000 10

2000 5

0 0

20 10 0 10 20 10 0 10

Daily excess return, % Daily excess return, %

8000 Daily S&P 500 returns, 1957:12014:3

Number of days

6000

4000

2000

0

2 0 2

Daily excess return, %

63

QQ plot of daily S&P 500 returns

6

0.1st to 99.9th percentiles

2

Empirical quantiles

6 1957:12014:3

6 4 2 0 2 4 6

Quantiles from estimated N(, 2 ), %

The skewness, kurtosis and Bera-Jarque test for normality are useful diagnostic tools.

They are

3

skewness D T1 TtD1 xt

P

N .0; 6=T /

4 (4.4)

kurtosis D T1 TtD1 xt

P

N .3; 24=T /

Bera-Jarque D T6 skewness2 C 24

T

.kurtosis 3/2 22 :

This is implemented by using the estimated mean and standard deviation. The distribu-

tions stated on the right hand side of (4.4) are under the null hypothesis that x t is iid

N.; 2 /. The excess kurtosis is defined as the kurtosis minus 3.

The intuition for the 22 distribution of the Bera-Jarque test is that both the skewness

and kurtosis are, if properly scaled, N.0; 1/ variables. It can also be shown that they,

under the null hypothesis, are uncorrelated. The Bera-Jarque test statistic is therefore a

sum of the square of two uncorrelated N.0; 1/ variables, which has a 22 distribution.

64

QQ plot of daily returns QQ plot of weekly returns

10

5

Empirical quantiles

Empirical quantiles

5

0 0

5

5

10

5 0 5 10 5 0 5 10

2 2

Quantiles from N(, ), % Quantiles from N(, ), %

Empirical quantiles

10

20

20 10 0 10

Quantiles from N(, 2 ), %

in GMM. The moment conditions

2 3

xt

T 6 2 2 7

7

1 X .x t /

g.; 2 / D (4.5)

6

6 7;

T t D1 6 3

.x

4 t / 7

5

.x t /4 3 4

should all be zero if x t is N.; 2 /. We can estimate the two parameters, and 2 , by

using the first two moment conditions only, and then test if all four moment conditions

are satisfied. It can be shown that this is the same as the Bera-Jarque test if x t is indeed

iid N.; 2 /.

65

QQ plot of 5-min returns QQ plot of hourly returns

0.2 1

Empirical quantiles

Empirical quantiles

0.1 0.5

0 0

0.1 0.5

0.2 1

0.2 0.1 0 0.1 0.2 1 0.5 0 0.5 1

2

Quantiles from N(, ), % Quantiles from N(, 2 ), %

Empirical quantiles

2

1998:12013:11

2 0 2

Quantiles from N(, 2 ), %

QQ plot of AR residuals QQ plot of AR&GARCH residuals

6 6

S&P 500 returns (daily)

4 1954:12014:3 4

Empirical quantiles

Empirical quantiles

2 0.1th to 99.9th 2

percentile

0 0

-2 AR(1) -2 AR(1)&GARCH(1,1)

-4 Stand. residuals: -4 Stand. residuals:

ut / ut /t

-6 -6

-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

Quantiles from N(0,1), % Quantiles from N(0,1), %

EDF.x/, conforms with a theoretical cdf, F .x/. The empirical distribution function is

66

Empirical distribution function and theoretical cdf

the correct EDF value is the higher one

0.8

0.6

0.4

data

0.2 cdf of N(4,1)

EDF

0

x

defined as the fraction of observations which are less or equal to x, that is,

T

1X

EDF .x/ D .x t x/; where (4.6)

T t D1

(

1 if q is true

.q/ D

0 else.

The EDF.x t / and F .x t / are often plotted against the sorted (in ascending order) sample

fx t gTtD1 .

See Figure 4.7 for an illustration.

Example 4.3 (EDF) Suppose we have a sample with three data points: x1 ; x2 ; x3 D

5; 3:5; 4. The empirical distribution function is then as in Figure 4.7.

xt

Example 4.4 (Kolmogorov-Smirnov test statistic) Figure 4.7 also shows the cumulative

distribution function (cdf) of a normally distributed variable. The test statistic (4.7) is then

67

Kolmogorov-Smirnov test

1 The

K-S test statistic is

T the length of the longest arrow

0.8

0.6

0.4

data

0.2 cdf of N(4,1)

EDF

0

x

the largest difference (in absolute terms) of the EDF and the cdfamong the observed

values of x t . See Figure 4.8.

p

We reject the null hypothesis that EDF.x/ D F .x/ if T D t > c, where c is a critical

value which can be calculated from

p 1

2i 2 c 2

X

lim Pr T DT c D 1 2 . 1/i 1

e : (4.8)

T !1

i D1

It can be approximated by replacing 1 with a large number (for instance, 100). For

instance, c D 1:35 provides a 5% critical value. See Figure 4.9. There is a corresponding

test for comparing two empirical cdfs.

Pearsons 2 test does the same thing as the K-S test but for a discrete distribution.

Suppose you have K categories with Ni values in category i. The theoretical distribution

predicts that the fraction pi should be in category i , with Ki D1 pi D 1. Then

P

K

X .Ni Tpi /2 2

K 1: (4.9)

i D1

Tpi

68

Cdf of K-S test stat, T DT

1

0.95

cdf

0.150 1.138

0.9 0.100 1.224

0.050 1.358

0.025 1.480

0.010 1.628

0.85

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

c

p

Figure 4.9: Distribution of the Kolmogorov-Smirnov test statistics, T DT

A normal distribution often fits return data poorly. If we need a distribution, then a

mixture of two normals is typically much better, and still fairly simple.

The pdf of this distribution is just a weighted average of two different (bell shaped)

pdfs of normal distributions (also called mixture components)

where .xI i ; i2 / is the pdf of a normal distribution with mean i and variance i2 . It

thus contains five parameters: the means and the variances of the two components and

their relative weight ().

See Figures 4.104.12 for an illustration.

Remark 4.5 (Estimation of the mixture normal pdf) With 2 mixture components, the log

likelihood is just

XT

LL D ln f .x t I 1 ; 2 ; 12 ; 22 ; /;

t D1

where f ./ is the pdf in (4.10) A numerical optimization method could be used to maximize

69

Distribution of daily S&P500 returns

normal pdf

0.6

0.5

0.4

0.3

1957:12014:3

0.1

0

5 4 3 2 1 0 1 2 3 4 5

Daily excess return, %

this likelihood function. However, this is tricky so an alternative approach is often used.

This is an iterative approach in three steps:

(1) Guess values of 1 ; 2 ; 12 ; 22 and . For instance, pick 1 D x1 , 2 D x2 , 12 D

22 D Var.x t / and D 0:5.

(2) Calculate

.x t I 2 ; 22 /

t D for t D 1; : : : ; T:

.1 /.x t I 1 ; 12 / C .x t I 2 ; 22 /

PT PT

.1 /x tD1 .1 t /.x t 1 /2

1 D Pt D1

t t 2

T

, 1 D P T

;

t D1 .1 t / t D1 .1 t /

PT PT

t D1 t x t t .x t 2 /2

2 D PT , 2 D t D1PT

2

, and

t D1 t tD1 t

XT

D t =T .

t D1

Iterate over (2) and (3) until the parameter values converge. (This is an example of the

EM algorithm.) Notice that the calculation of i2 uses i from the same (not the previous)

70

Distribution of daily S&P500 returns

Mixture pdf 1

0.6 Mixture pdf 2

pdf 1 pdf 2

mean 0.03 -0.04 Total pdf

0.5 std 0.66 2.01

weight 0.84 0.16

0.4

0.3

1957:12014:3

0.1

0

5 4 3 2 1 0 1 2 3 4 5

Daily excess return, %

iteration.

A histogram is just a count of the relative number of observations that fall in (pre-

specified) non-overlapping intervals. If we also divide by the width of the interval, then

the area under the histogram is unity, so the scaled histogram can be interpreted as a

density function. If the intervals (bins) are h wide around a point x: x h=2 to x Ch=2.

The scaled histogram at the point x (say, x D 2:3) can be defined as

T

O 1 X 1 x t x

f .x/ D 1=2 ; where (4.11)

T t D1 h h

(

1 if q is true

.q/ D

0 else.

In fact, that h1 .jx t xj h=2/ is the pdf value of a uniformly distributed variable

71

QQ plot of daily S&P 500 returns

6

0.1th to 99.9th percentiles

2

Empirical quantiles

1957:12014:3

6 4 2 0 2 4 6

Quantiles from estimated mixture normal, %

(over the interval x h=2 to x C h=2). This shows that our estimate of the pdf (here:

the histogram) can be thought of as a average of hypothetical pdf values of the data in

the neighbourhood of x. However, we can gain efficiency and get a smoother (across x

values) estimate by using another density function that the uniform. In particular, using a

density function that tapers off continuously instead of suddenly dropping to zero (as the

uniform density does) improves the properties. In fact, the N.0; h2 / is often used. The

kernel density estimator of the pdf at some point x is then

2

1 XT 1 1 x x

fO .x/ D

t

p exp : (4.12)

T t D1 h 2 2 h

Notice that the function in the summation is the density function of a N.x; h2 / distribu-

tion.

See Figure 4.13 for an example of the weights.

The value h D 1:06 Std.x t /T 1=5 is sometimes recommended, since it can be shown

to be the optimal choice (in MSE sense) if data is normally distributed and the Gaussian

kernel is used.(See below for a proof.) The bandwidth h could also be chosen by a leave-

72

Weights for histogram (at x = 4) Weights for kernel density (at x = 4)

2 2

1.5 1.5

1 data 1

weights

0.5 Bin: 4 0.25 0.5

0 0

3 3.5 4 4.5 5 3 3.5 4 4.5 5

x x

2

1.5

1

0.5

0

3 3.5 4 4.5 5

x

See Figure 4.14 for an example and Figure 4.15 for a cross-validation approach to

determine the bandwidth.

It can be shown (see Silverman (1986) 3.3) that under the assumption that x t is iid, the

mean squared error, variance and bias of the estimator at the value x are approximately

(for general kernel functions)

1 R1

VarfO.x/ D f .x/ 1 K.u/2 du

Th

1 d 2 f .x/ R 1

BiasfO.x/ D h2 1 K.u/u2 du: (4.13)

2 dx 2

In these expressions, f .x/ is the true density of x and K.u/ the kernel (pdf) used as a

weighting function for u D .x t x/= h. With an N.0; 1/ kernel these expressions can be

73

simplified to

1 1

VarfO.x/ D f .x/ p

Th 2

2

1 d f .x/

BiasfO.x/ D h2 : (4.14)

2 dx 2

Proof. (of (4.14)) We know that

R1 2 1 R1

1 K.u/ du D p and 1 K.u/u2 du D 1;

2

if K.u/ is the density function of a standard normal distribution. (We are effectively using

the N.0; 1/ pdf for the variable .x t x/= h.) Use in (4.13).

Proof. (Best bandwidth if f .x/ N ) Use (4.14) to write the mean integrated squared

error (MISE) as

2

d 2 f .x/

Z

1 1 1

Z Z

MISE.x/ D MSE.x/dx D f .x/dx p C h4 dx:

Th 2 4 dx 2

Notice that f .x/dx D 1 and if f .x/ is the pdf of N.0; 2 /, then the last integral is

R

p

3=.8 5 /. Combining gives

1 1 1 3

MISE.x/ D p C h4 5

p :

Th2 4 8

1 3

0D 2

C h3 5 , or

Th 4

4 5

D h5 , so

3T

1=5

4 1=5

hD T :

3

It can then be shown that (with iid data and a Gaussian kernel) the asymptotic distri-

bution is

p

1

O d

T hf .x/ f .x/ ! N 0; p f .x/ ; (4.15)

2

provided h is decreased (as T increases) slightly faster than T 1=5 (for instance, suppose

p

h D T 1:1=5 h0 , where h0 is a constant). Notice the T h term on left hand side (the usual

74

p

expression in parametric models include only T ).

Remark 4.6 (Asymptotic bias) The condition that h decreases faster than T 1=5 ensures

p

O

that the bias of T hb.x/ vanishes as T ! 1. This is seen by noticing that the bias in

p

O

(4.14) is proportional to h2 . Combining gives the bias of T hb.x/ as being proportional

1=2 5=2 1:1=5

to T h . If indeed h D T h0 , then we have

0:05 5=2

T 1=2 h5=2 D T h0

approach is perhaps the easiest way to find the best bandwidth. Define the integrated

squared error (ISE) as f .x/ fO.x/2 dx. (ISE is specific for a sample, while the MISE

R

discussed above is the expected value of ISE across all samples. The difference is small

in large samples.) Expand as ISE D f .x/2 C fO.x/2 2f .x/fO.x/dx. For the first

R

term, notice that f .x/2 dx does not depend on the bandwidth, so we can treat is as a

R

constant, . If we use an N.0; 1/ kernel, then it can be shown that the middle term is

O 2 1 XT 1 XT 1 x t xs

f .x/ dx D O t /; where g.x

O t/ D

R

g.x :

T t D1 T sD1 h h

where .x/ is the pdf of a N.0; 2/. The last term is the same as 2 Ex fO.x/, where Ex

denotes that the expectation is over which x value that is realized. If we had access to

another sample (xQ i ), then we could plug in those values in the (already estimated) fO./

function to estimate Ex fO.x/ as the average fO.xQ i / value. Instead, we use a leave-one-out

approach which preserves the property that fO./ does not depend on the xi value: estimate

fO t .x/ by excluding data point t from the sample, and then evaluate it at x D x t , fO t .x t /.

(The alternative of using the the same sample for both estimation and evaluation would

lead to a classical overfitting, driving the h value towards zero.) Then, estimate Ex fO.x/

by the average fO t .x t / value. To sum up, pick h to minimize ISE.h/, where

1 XT h O

i

ISE.h/ D C O t / 2f t .x t / :

g.x

T tD1

The easiest way to handle a bounded support of x is to transform the variable into one

with an unbounded support, estimate the pdf for this variable, and then use the change

of variable technique to transform to the pdf of the original variable.

We can also estimate multivariate pdfs. Let x t be a d 1 matrix and O be the estimated

covariance matrix of x t . We can then estimate the pdf at a point x by using a multivariate

75

Kernel density estimate of daily S&P500 returns

optimal bandwith

0.6 higher bandwidth

0.5

0.4

0.3

1957:12014:3

0.1

0

5 4 3 2 1 0 1 2 3 4 5

Daily excess return, %

2.5

integrated squared error (ISE)

1957:12014:3

1.5

rule-of-thumb

1

0.5

0.5

1

0.05 0.1 0.15 0.2 0.25

bandwidth (h)

Gaussian kernel as

1 XT 1 1

fO .x/ D exp .x t 0 O

x/ .H / 2

.x t 1

x/ : (4.16)

T O 1=2

t D1 .2/d=2 jH 2 j 2

76

Notice that the function in the summation is the (multivariate) density function of a

O distribution. The value H D 1:06T 1=.d C4/ is sometimes recommended.

N.x; H 2 /

Remark 4.8 ((4.16) with d D 1) With just one variable, (4.16) becomes

" 2 #

1 XT 1 1 x x

fO .x/ D

t

p exp ;

T t D1 H Std.x / 2

t 2 H Std.x t /

Topic: is the distribution of the return different after a signal?. This paper uses

kernel regressions to identify and implement some technical trading rules, and then tests

if the distribution (of the return) after a signal is the same as the unconditional distribution

(using Pearsons 2 test and the Kolmogorov-Smirnov test). They reject that hypothesis

in many cases, using daily data (1962-1996) for around 50 (randomly selected) stocks.

1350

MA(3) and MA(25), bandwidth 0.01

Circles at the bottom (top) margin indicates buys (sells)

1300

1250

1200

Long MA (-)

Long MA (+)

Short MA

1150

Jan Feb Mar Apr

1999

77

Distribution of returns for all days Inverted MA rule: after buy signal

Mean Std Mean Std dashed: 95%

0.6 0.03 1.16 0.6 0.06 1.71 confidence band

0.4 0.4

0.2 0.2

0 0

2 0 2 2 0 2

Return Return

Daily S&P 500 data 1990:12014:3

Inverted MA rule: after neutral signal Inverted MA rule: after sell signal

Mean Std Mean Std

0.6 0.04 0.96 0.6 0.02 0.88

0.4 0.4

0.2 0.2

0 0

2 0 2 2 0 2

Return Return

Reference: Breeden and Litzenberger (1978); Cox and Ross (1976), Taylor (2005) 16,

Jackwerth (2000), Sderlind and Svensson (1997a) and Sderlind (2000)

A European call option price with strike price X has the price

where M is the nominal discount factor and S is the price of the underlying asset at the

expiration date of the option k periods from now.

We have seen that the price of a derivative is a discounted risk-neutral expectation of

78

the derivative payoff. For the option it is

Example 4.9 (Call prices, three states) Suppose that S only can take three values: 90,

100, and 110; and that the risk-neutral probabilities for these events are: 0.5, 0.4, and

0.1, respectively. We consider three European call option contracts with the strike prices

89, 99, and 109. From (4.18) their prices are (if B D 1)

C .X D 99/ D 0:5 0 C 0:4.100 99/ C 0:1.110 99/ D 1: 5

C .X D 109/ D 0:5 0 C 0:4 0 C 0:1.110 109/ D 0:1:

Clearly, with information on the option prices, we could in this case back out what the

probabilities are.

Z 1

C D exp. ik/ .S X/ h .S / dS; (4.19)

X

where i is the per period (annualized) interest rate so exp. ik/ D Bk and h .S / is the

(univariate) risk-neutral probability density function of the underlying price (not its log).

Differentiating (4.19) with respect to the strike price and rearranging gives the risk-neutral

distribution function

@C .X /

Pr .S X/ D 1 C exp.ik/ : (4.20)

@X

Proof. Differentiating the call price with respect to the strike price gives

Z 1

@C

D exp . ik/ h .S / dS D exp . ik/ Pr .S > X/ :

@X X

Differentiating once more gives the risk-neutral probability density function of S at

S DX

@2 C.X/

pdf .X/ D exp.ik/ : (4.21)

@X 2

Figure 4.18 shows some data and results for German bond options on one trading date.

(A change of variable approach is used to show the distribution of the log asset price.)

79

Bond options, approximate cdf Bond options, approximate pdf

1

20

0.5 10

Options on German gov

bonds, traded on LIFFE

25-Mar-1994 0

0

95 100 105 95 100 105

Strike price Strike price

Figure 4.18: Bund options 6 April 1994. Options expiring in June 1994.

@C 1 C .Xi C1 / C .Xi / C .Xi / C .Xi 1/

C (4.22)

@X 2 Xi C1 Xi Xi Xi 1

gives the approximate distribution function. The approximate probability density func-

tion, obtained by a second-order difference quotient

@2 C

C .Xi C1 / C .Xi / C .Xi / C .Xi 1 / 1

= .Xi C1 Xi 1 / (4.23)

@X 2 Xi C1 Xi Xi Xi 1 2

is also shown. The approximate distribution function is decreasing in some intervals,

and the approximate density function has some negative values and is very jagged. This

could possibly be explained by some aberrations of the option prices, but more likely

by the approximation of the derivatives: changing approximation method (for instance,

from centred to forward difference quotient) can have a strong effect on the results, but

all methods seem to generate strange results in some interval. This suggests that it might

be important to estimate an explicit distribution. That is, to impose enough restrictions on

the results to guarantee that they are well behaved.

the logs of M and S, conditional on the information today, is a mixture of n bivariate

normal distributions (see Sderlind and Svensson (1997b)). Let .xI ; / denote a

normal multivariate density function over x with mean vector and covariance matrix

. The weight of the j t h normal distribution is .j / , so the probability density function,

80

Bond options, implied volatility Bond options, pdf

0.1

Options on German gov 25-Mar-1994 N

bonds, traded on LIFFE 1 mix N

25-Mar-1994

0.09

0.5

0.08

0

5.5 6 6.5 7 7.5 5.5 6 6.5 7 7.5

Strike price (yield to maturity, %) Yield to maturity, %

0.1

25-Mar-1994 23-Feb-1994

1 03-Mar-1994

0.09

0.5

0.08

0

5.5 6 6.5 7 7.5 5.5 6 6.5 7 7.5

Strike price (yield to maturity, %) Yield to maturity, %

Figure 4.19: Bund options 23 February and 3 March 1994. Options expiring in June 1994.

n

" #! " # " # " #!

.j / .j / .j /

ln M X ln M m mm ms

pdf D .j / I .j / ; .j / .j / ; (4.24)

ln S j D1

ln S s ms ss

P

that they represent different macro economic states, where the weight is interpreted as

the probability of state j .

Let .:/ be the standardized (univariate) normal distribution function. If .j /

m D m

.j /

and mm D mm in (4.24), then the marginal distribution of the log SDF is Gaussian

(while that of the underlying asset price is not). In this case the European call option price

(4.17) has a closed form solution in terms of the spot interest rate, strike price, and the

81

parameters of the bivariate distribution1

2 0 1

n .j / .j / .j /

B s C ms C ss ln X C

X 1 .j /

C D exp. ik/ .j / 4exp .j /

C .j /

C

6

s ms

2 ss

@ q A

.j /

j D1 ss

0 13

.j / .j /

B s C ms ln X C7

X @ q A5 : (4.25)

.j /

ss

(For a proof, see Sderlind and Svensson (1997b).) Notice that this is like using the

physical distribution, but with .j / .j /

s C ms instead of s .

.j /

Notice also that this is a weighted average of the option price that would hold in each

state

X n

C D .j / C .j / : (4.26)

j D1

A forward contract written in t stipulates that, in period , the holder of the contract

gets one asset and pays F . This can be thought of as an option with a zero strike price

and no discountingand it is also the mean of the riskneutral distribution. The forward

price then follows directly from (4.25) as

n

!

.j /

X ss

F D exp s C ms C

.j / .j / .j /

: (4.27)

j D1

2

There are several reasons for assuming a mixture of normal distributions. First, non-

parametric methods often generate strange results, so we need to assume some parametric

distribution. Second, it gives closed form solutions for the option and forward prices,

which is very useful in the estimation of the parameters. Third, it gives the Black-Scholes

model as a special case when n D 1.

To see the latter, let n D 1 and use the forward price from (4.27), F D exp .s C ms C ss =2/,

in the option price (4.25) to get

C D exp. ik/F p exp. ik/X p ; (4.28)

ss ss

1

Without these restrictions, .j / in (4.25) is replaced by Q .j / D .j / exp.m N .j / C

.j / .j / .j /

=2/= jnD1 .j / exp.m C mm =2/. In this case, Q .j / , not .j / , will be estimated from option

P

mm

data.

82

02-Mar-2009 16-Mar-2009

From CHF/EUR options 6

6

4 4

pdf

pdf

2 2

0 0

1.3 1.4 1.5 1.6 1.3 1.4 1.5 1.6

CHF/EUR CHF/EUR

16-Nov-2009 17-May-2010

8

15

6

pdf

10 pdf

4

5 2

0 0

1.3 1.4 1.5 1.6 1.3 1.4 1.5 1.6

CHF/EUR CHF/EUR

We want to estimate the marginal distribution of the future asset price, S. From (4.24),

it is a mixture of univariate normal distributions with weights .j / , means .j /

s , and vari-

.j /

ances ss . The basic approach is to back out these parameters from data on option and

forward prices by exploiting the pricing relations (4.25)(4.27). For that we need data on

at least at many different strike prices as there are parameters to estimate.

Remark 4.10 Figures 4.184.19 show some data and results (assuming a mixture of two

normal distributions) for German bond options around the announcement of the very high

money growth rate on 2 March 1994..

Remark 4.11 Figures 4.204.22 show results for the CHF/EUR exchange rate around

the period of active (Swiss) central bank interventions on the currency market.

Remark 4.12 (Robust measures of the standard deviation and skewness) Let P be the

th quantile (for instance, quantile 0.1) of a distribution. A simple robust measure of the

83

CHF/EUR 3m, 80% conf band and forward

1.6

1.55

1.5

1.45

1.4

1.35

1.3

1.25

1.2

200901 200904 200907 200910 201001 201004

Std D P1 P ;

where it is assumed that < 0:5. Sometimes this measure is scaled so it would give the

right answer for a normal distribution. For instance, with D 0:1, the measure would be

divided by 2.56 and for D 0:25 by 1.35.

One of the classical robust skewness measures was suggested by Hinkley

Skew D :

P1 P

This skewness measure can only take on values between 1 (when P1 D P0:5 ) and

1 (when P D P0:5 ). When the median is just between the two percentiles (P0:5 D

.P1 C P /=2), then it is zero.

84

Robust variance (10/90 perc) Robust skewness (10/90 perc)

From CHF/EUR options

0.1 0

0.2

0.05

0.4

0

200901 201001 2009 2010

0.1 0

0.2

0.05

0.4

0

2009 2010 2009 2010

In risk control, the focus is on the distribution of losses beyond some threshold level.

This has three direct implications. First, the object under study is the loss

XD R; (4.29)

that is, the negative of the return. Second, the attention is on how the distribution looks

like beyond a threshold and also on the probability of exceeding this threshold. In con-

trast, the exact shape of the distribution below that point is typically disregarded. Third,

modelling the tail of the distribution is best done by using a distribution that allows for a

much heavier tail that suggested by a normal distribution. The generalized Pareto (GP)

distribution is often used. See Figure 4.23 for an illustration.

Remark 4.13 (Cdf and pdf of the generalized Pareto distribution) The generalized Pareto

distribution is described by a scale parameter ( > 0) and a shape parameter (). The

85

unknown shape

u Loss

(

1 .1 C z=/ 1= if 0

G.z/ D

1 exp. z=/ D 0;

(

1

.1 C z=/ 1= 1 if 0

g.z/ D

1

exp. z=/ D 0:

The mean is defined (finite) if < 1 and is then E.Z/ D =.1 /. Similarly, the variance

is finite if < 1=2 and is then Var.Z/ D 2 =.1 /2 .1 2/. See Figure 4.24 for an

illustration.

the Cdf, we can notice that if u is uniformly distributed on .0; 1, then we can construct

random variables with a GPD by

z D .1 u/

1 if 0

zD ln.1 u/ D 0:

Consider the loss X (the negative of the return) and let u be a threshold. Assume

that the threshold exceedance (X u) has a generalized Pareto distribution. Let Pu be

probability of X u, that is, Pu D Pr.X u/. Then, the cdf of the loss for values

greater than the threshold (Pr.X x/ for x > u) can be written

86

Pdf of generalized Pareto distribution ( = 0.15)

7

=0

6 = 0.25

= 0.45

5

0

0 0.1 0.2 0.3 0.4 0.5

Outcome of random variable

where G.z/ is the cdf of the generalized Pareto distribution. Noticed that, the cdf value is

Pu at at x D u (or just slightly above u), and that it becomes one as x goes to infinity.

Clearly, the pdf is

where g.z/ is the pdf of the generalized Pareto distribution. Notice that integrating the

pdf from x D u to infinity shows that the probability mass of X above u is 1 Pu . Since

the probability mass below u is Pu , it adds up to unity (as it should). See Figure 4.26 for

an illustration.

It is often to calculate the tail probability Pr.X > x/, which in the case of the cdf in

(4.30) is

1 F .x/ D .1 Pu /1 G.x u/; (4.32)

The VaR (say, D 95%) is the -th quantile of the loss distribution

where cdfX 1 ./ is the inverse cumulative distribution function of the losses. That is, VaR

is the quantile of the loss distribution. For instance, VaR95% is the 0:95 quantile of the

87

Loss distributions for loss > 12, Pr(loss > 12) = 10%

1 N(0.08, 0.162 )

generalized Pareto ( = 0.22, = 0.16)

0.8

VaR(95%) ES(95%)

0.6

Normal dist 18.2 25.3

GP dist 24.5 48.4

0.4

0.2

0

15 20 25 30 35 40 45 50 55 60

Loss, %

Figure 4.25: Comparison of a normal and a generalized Pareto distribution for the tail of

losses

loss distribution. This clearly means that the probability of the loss to be less than VaR

equals

Pr.X VaR / D : (4.34)

Assuming VaR u (that is, Pu ), the cdf (4.30) together with the form of the

generalized Pareto distribution give

8

< uC 1

1 if 0

1 Pu

VaR D , for Pu : (4.35)

u ln 1 Pu1

D0

:

Proof. (of (4.35)) Set F .x/ D in (4.30) and use z D x u in the cdf from Remark

4.13 and solve for x.

If we assume < 1 (to make sure that the mean is finite), then straightforward inte-

gration using (4.31) shows that the expected shortfall is

ES D E.XjX VaR /

VaR u

D C , for > Pu and < 1: (4.36)

1 1

88

The expected exceedance of a GP distribution (with < 1) for any threshold > u is

u

D C , for > u and < 1. (4.37)

1 1

Proof. (of (4.37)) Substitute for VaR in the expected shortfall (4.36)

u

E.XjX / D C

1 1

and subtract from both sides to get (4.37).

The expected exceedance of a generalized Pareto distribution (with 0 < < 1) is

increasing with the threshold level . This indicates that the tail of the distribution is

very long. In contrast, a normal distribution would typically show a negative relation (see

Figure 4.26 for an illustration). This provides a way of assessing which distribution that

best fits the tail of the historical histogram.

.0 /

E.X jX > / D C ;

1 .0 /

with 0 D . /=

where ./ and are the pdf and cdf of a N.0; 1/ variable respectively.

The expected exceedance over is often compared with an empirical estimate of the

same thing: the mean of X t for those observations where X t >

PT

t D1 .X t /.X t > /

O

e./ D PT ; where (4.38)

t D1 .X t > /

(

1 if q is true

.q/ D

0 else.

O is increasing (more or less) linearly with the threshold level ()

as in (4.37), then it is reasonable to model the tail of the distribution from that point as a

generalized Pareto distribution.

The estimation of the parameters of the distribution ( and ) is typically done by

maximum likelihood. Alternatively, A comparison of the empirical exceedance (4.38)

with the theoretical (4.37) can help. Suppose we calculate the empirical exceedance for

89

different values of the threshold level (denoted i all large enough so the relation looks

linear), then we can estimate (by LS)

O i / D a C bi C "i :

e. (4.39)

Then, the theoretical exceedance (4.37) for a given starting point of the GPD u is related

to this regression according to

u

aD and b D , or

1 1

b

D and D a.1 / C u: (4.40)

1Cb

See Figure 4.27 for an illustration.

30

25

20

N(0.08, 0.162 )

15

generalized Pareto ( = 0.22, = 0.16, u = 12)

10

0

15 20 25 30 35 40

Threshold v, %

Remark 4.16 (Log likelihood function of the loss distribution) Since we have assumed

that the threshold exceedance (X u) has a generalized Pareto distribution, Remark 4.13

shows that the log likelihood for the observation of the loss above the threshold (X t > u)

90

is

X

LD Lt

t st. X t >u

(

ln .1= C 1/ ln 1 C .X t u/ = if 0

ln L t D

ln .X t u/ = D 0:

but imposed a priori (based on the expected exceedance).

Loss minus threshold, v

1.2 u = 1.3, = 0.29, = 0.52 0.1

= 0.24, = 0.57

1

0.05

0.8

0.6

0

0 0.5 1 1.5 2 2.5 1.5 2 2.5 3 3.5 4

Threshold v, % Loss, %

QQ plot

Daily S&P 500 returns, 1957:12013:12

Empirical quantiles

2.5

1.5

1.5 2 2.5

Quantiles from estimated GPD, %

Example 4.17 (Estimation of the generalized Pareto distribution on S&P daily returns).

Figure 4.27 (upper left panel) shows that it may be reasonable to fit a GP distribution

with a threshold u D 1:3. The upper right panel illustrates the estimated distribution,

while the lower left panel shows that the highest quantiles are well captured by estimated

distribution.

91

4.4 Exceedance Correlations

It is often argued that most assets are more strongly correlated in down markets than

in up markets. If so, diversification may not be such a powerful tool as what we would

otherwise believe.

A straightforward way of examining this is to calculate the correlation of two returns

(x and y, say) for specific intervals. For instance, we could specify that x t should be

between h1 and h2 and y t between k1 and k2

For instance, by setting the lower boundaries (h1 and k1 ) to 1 and the upper boundaries

(h2 and k2 ) to 0, we get the correlation in down markets.

A (bivariate) normal distribution has little probability mass at very low returns, which

leads to the correlation being squeezed towards zero as we only consider data far out in

the tail. In short, the tail correlation of a normal distribution is always closer to zero than

the correlation for all data points. This is illustrated in Figure 4.28.

In contrast, Figures 4.294.30 suggest (for two US portfolios) that the correlation in

the lower tail is almost as high as for all the data and considerably higher than for the

upper tail. This suggests that the relation between the two returns in the tails is not well

described by a normal distribution. In particular, we need to use a distribution that allows

for much stronger dependence in the lower tail. Otherwise, the diversification benefits (in

down markets) are likely to be exaggerated.

The standard correlation (also called Pearsons correlation) measures the linear rela-

tion between two variables, that is, to what extent one variable can be explained by a

linear function of the other variable (and a constant). That is adequate for most issues

in finance, but we sometimes need to capture non-linear relations. It also turns out to be

easier to calibrate/estimate copulas (see below) by using other measures of dependency.

Spearmans rank correlation (called Spearmans rho and often denoted S ) measures

to what degree two variables have a monotonic relation: it is the correlation of their

92

Correlation in lower tail, bivariate N(0,1) distribution

corr = 0.5 corr is the correlation of the

corr = 0.25

0.6 full bivariate distribution

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5

Upper boundary (prob of lower value)

Figure 4.28: Correlation in lower tail when data is drawn from a normal distribution with

correlation

respective ranks. It measures if one variable tends to be high when the other also is

without imposing the restriction that this relation must be linear.

It is computed in two steps. First, the data is ranked from the smallest (rank 1) to

the largest (ranked T , where T is the sample size). Ties (when two or more observations

have the same values) are handled by averaging the ranks. The following illustrates this

for two variables

x t rank.x t / y t rank.y t /

2 2:5 7 2

10 4 8 3 (4.42)

3 1 2 1

2 2:5 10 4

In the second step, simply estimate the correlation of the ranks of two variables

the rank correlation based on the difference of the ranks, d t Drank.x t / rank.y t /, D

1 6 tTD1 d t2 =.T 3 T /. It gives the same result if there are no tied ranks.) See Figure

4.31 for an illustration.

93

Extreme returns of two portfolios

10

Lines mark 5th and 95th percentiles

5

Returns of large stocks, %

10

Corr freq

All 0.69 1.00

Mid 0.52 0.84

High 0.50 0.02

20

20 15 10 5 0 5 10

Returns of small stocks, %

The rank correlation can be tested by using the fact that under the null hypothesis the

rank correlation is zero. We then get

p

T 1 Spearmans !d N.0; 1/: (4.44)

q

(For samples of 20 to 40 observations, it is often recommended to use .T 2/=.1 OS2 /OS

where OS denotes Spearmans . This has a tT 2 distribution.)

Remark 4.18 (Spearmans for a distribution ) If we have specified the joint distribu-

tion of the random variables X and Y , then we can also calculate the implied Spearmans

(sometimes only numerically) as CorrFX .X/; FY .Y / where FX .X/ is the cdf of X and

FY .Y / of Y .

ing changes of x t (compared to x1 ; : : : x t 1 ) with the corresponding changes of y t . For

94

Lower tail correlation Upper tail correlation

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Upper boundary (quantile level) Lower boundary (quantile level)

from 10 size deciles

Daily US data 1979:12013:12

Corr = 0.90 Corr = 0.03

2 2

1 1

0 0

y

1 1

Spearmans = 0.88 Spearmans = 0.03

2 Kendalls = 0.69 2 Kendalls = 0.01

5 0 5 5 0 5

x x

2 2

1 1

0 0

y

1 1

Spearmans = 0.84 Spearmans = 1.00

2 Kendalls = 0.65 2 Kendalls = 1.00

5 0 5 5 0 5

x x

instance, with three data points (.x1 ; y1 /; .x2 ; y2 /; .x3 ; y3 /) we first calculate

Changes of x Changes of y

x2 x1 y2 y1

(4.45)

95

x3 x1 y3 y1

x3 x2 y3 y2 ;

which gives T .T 1/=2 (here 3) pairs. Then, we investigate if the pairs are concordant

(same sign of the change of x and y) or discordant (different signs) pairs

ij is discordant if .xj xi /.yj yi / < 0:

Finally, we count the number of concordant (Tc ) and discordant (Td ) pairs and calculate

Kendalls tau as

Tc Td

Kendalls D : (4.47)

T .T 1/=2

It can be shown that

4T C 10

Kendalls ! N 0;

d

; (4.48)

9T .T 1/

so it is straightforward to test by a t-test.

x y

2 7

10 9

3 10:

Changes of x Changes of y

x2 x1 D 10 2 D 8 y2 y1 D 9 7 D 2 concordant

x3 x1 D 3 2 D 5 y3 y1 D 10 7 D 3 discordant

x3 x2 D 3 10 D 13 y3 y2 D 10 9 D 1; discordant.

1 2 1

D D :

3.3 1/=2 3

If x and y actually has bivariate normal distribution with correlation , then it can be

shown that on average we have

6

Spearmans rho = arcsin.=2/ (4.49)

2

Kendalls tau D arcsin./: (4.50)

In this case, all three measures give similar messages (although the Kendalls tau tends to

96

be lower than the linear correlation and Spearmans rho). This is illustrated in Figure 4.32.

Clearly, when data is not normally distributed, then these measures can give distinctly

different answers.

1

Spearmans

Kendalls

0.8 The variables have joint normal

distributions with correlations

according to the horizontal axis

0.6

0.4

0.2

0

0 0.2 0.4 0.6 0.8 1

Correlation

Figure 4.32: Spearmans rho and Kendalls tau if data has a bivariate normal distribution

A joint -quantile exceedance probability measures how often two random variables

(x and y, say) are both above their quantile. Similarly, we can also define the probability

that they are both below their quantile

In practice, this can be estimated from data by first finding the empirical -quantiles

(x; and Oy; ) by simply sorting the data and then picking out the value of observation

O

T of this sorted list (do this individually for x and y). Then, calculate the estimate

1 XT

GO D t ; where (4.52)

T t D1

(

t D

0 otherwise.

97

Pr(x < Q(x, q), y < Q(y, q)), Gauss Pr(x < Q(x, q), y < Q(y, q)), Gauss

100 5

corr = 0.25

corr = 0.5 4

3

%

%

50

2

1

0 0

0 50 100 0 5 10

Quantile level (q), % Quantile level (q), %

4.6 Copulas

Reference: McNeil, Frey, and Embrechts (2005), Alexander (2008) 6, Jondeau, Poon, and

Rockinger (2007) 6

Portfolio choice and risk analysis depend crucially on the joint distribution of asset

returns. Empirical evidence suggest that many returns have non-normal distribution, es-

pecially when we focus on the tails. There are several ways of estimating complicated

joint (non-normal) distributions: using copulas is one. This approach has the advantage

that it proceeds in two steps: first we estimate the marginal distribution of each return

separately, then we model the comovements by a copula.

ui D Fi .xi /;

where c./ is a copula density function and ui D Fi .xi / is short-hand notation for the cdf

value as in (4.1). The extension to three or more random variables is straightforward.

Equation (4.53) means that if we know the joint pdf f1;2 .x1 ; x2 /and thus also the

cdfs F1 .x1 / and F2 .x2 /then we can figure out what the copula density function must

be. Alternatively, if we know the marginal (univariate) pdfs f1 .x1 / and f2 .x2 /and thus

also the cdfs F1 .x1 / and F2 .x2 /and the copula function, then we can construct the joint

distribution. (This is called Sklars theorem.) This latter approach will turn out to be

98

useful.

The correlation of x1 and x2 depends on both the copula and the marginal distribu-

tions. In contrast, both Spearmans rho and Kendalls tau are determined by the copula

only. They therefore provide a way of calibrating/estimating the copula without having to

involve the marginal distributions directly.

Example 4.20 (Independent X and Y ) If X and Y are independent, then we know that

f1;2 .x1 ; x2 / D f1 .x1 /f2 .x2 /, so the copula density function is just a constant equal to

one.

Remark 4.21 (Joint cdf) A joint cdf of two random variables (X1 and X2 ) is defined as

This cdf is obtained by integrating the joint pdf f1;2 .x1 ; x2 / over both variables

Z x1 Z x2

F1;2 .x1 ; x2 / D f1;2 .s; t/dsdt:

sD 1 tD 1

(Conversely, the pdf is the mixed derivative of the cdf, f1;2 .x1 ; x2 / D @2 F1;2 .x1 ; x2 /=@x1 @x2 .)

See Figure 4.34 for an illustration.

Remark 4.22 (From joint to univariate pdf) The pdf of x1 (also called the marginal pdf

R1

of x1 ) can be calculate from the joint pdf as f1 .x1 / D x2 D 1 f1;2 .x1 ; x2 /dx2 .

pdf of bivariate N() distribution, corr = 0.8 cdf of bivariate N() distribution, corr = 0.8

0.2 1

0.1 0.5

2 2

0 0

2 0 2 0

0 x 0 x

y 2 2 y 2 2

99

Remark 4.23 (Joint pdf and copula density, n variables) For n variables (4.53) general-

izes to

ui D Fi .xi /;

Remark 4.24 (Cdfs and copulas ) The joint cdf can be written as

where C./ is the unique copula function. Taking derivatives gives (4.53) where

@2 C.u1 ; u2 /

c.u1 ; u2 / D :

@u1 @u2

Notice the derivatives are with respect to ui D Fi .xi /, not xi . Conversely, integrating the

density over both u1 and u2 gives the copula function C./.

2 2

1 21 2 C 2 22

1

c.u1 ; u2 / D p exp , with (4.54)

1 2 2.1 2 /

1

i D .ui /;

Notice that when using this function in (4.53) to construct the joint pdf, we have to

first calculate the cdf values ui D Fi .xi / from the univariate distribution of xi (which

may be non-normal) and then calculate the quantiles of those according to a standard

normal distribution i D 1 .ui / D 1 Fi .xi /. This is used in (4.54) (and finally in

(4.53)). See Figure 4.35 for an illustration.

It can be shown that assuming that the marginal pdfs (f1 .x1 / and f2 .x2 /) are normal

and then combining with the Gaussian copula density recovers a bivariate normal dis-

tribution. However, the way we typically use copulas is to assume (and estimate) some

other type of univariate distribution, for instance, with fat tailsand then combine with a

(Gaussian) copula density to create the joint distribution.

A zero correlation ( D 0) makes the copula density (4.54) equal to unityso the

joint density is just the product of the marginal densities. A positive correlation makes the

100

copula density high when both x1 and x2 deviate from their means in the same direction.

The easiest way to calibrate a Gaussian copula is therefore to set

as suggested by (4.49).

Alternatively, the parameter can be calibrated to give a joint probability of both x1

and x2 being lower than some quantileto match the properties of data: see (4.52). The

value of this probability (according to a copula) is easily calculated by finding the copula

function (essentially the cdf) corresponding to a copula density. Some results are given in

remarks below. See Figure 4.33 for results from a Gaussian copula. This figure shows that

a higher correlation implies a larger probability that both variables are very lowbut that

the probabilities quickly become very small as we move towards lower quantiles (lower

returns).

Remark 4.25 (The Gaussian copula function ) The distribution function corresponding

to the Gaussian copula density (4.54) is obtained by integrating over both u1 and u2 and

the value is C.u1 ; u"2 I #

/ D" .1 ; 2 / where i is defined in (4.54) and is the bivariate

#!

0 1

normal cdf for N ; . Most statistical software contains numerical routines

0 1

for calculating this cdf.

Remark 4.26 (Multivariate Gaussian copula density ) The Gaussian copula density for

n variables is

1 1 0

c.u/ D p exp .R 1

In / ;

jRj 2

where R is the correlation matrix with determinant jRj and is a column vector with

i D 1 .ui / as the i th element.

The Gaussian copula is useful, but it has the drawback that it is symmetricso the

downside and the upside look the same. This is at odds with evidence from many financial

markets that show higher correlations across assets in down markets. The Clayton copula

density is therefore an interesting alternative

c.u1 ; u2 / D . 1 C u1 C u2 / 2 1=

.u1 u2 / 1

.1 C /; (4.56)

where 0. When > 0, then correlation on the downside is much higher than on the

upside (where it goes to zero as we move further out the tail).

101

See Figure 4.35 for an illustration.

For the Clayton copula we have

Kendalls D , so (4.57)

C2

2K

D ; (4.58)

1 K

where K denotes Kendalls . The easiest way to calibrate a Clayton copula is therefore

to set the parameter according to (4.58).

Figure 4.36 illustrates how the probability of both variables to be below their respec-

tive quantiles depend on the parameter. These parameters are comparable to the those

for the correlations in Figure 4.33 for the Gaussian copula, see (4.49)(4.50). The figure

are therefore comparableand the main point is that Claytons copula gives probabilities

of joint low values (both variables being low) that do not decay as quickly as according to

the Gaussian copulas. Intuitively, this means that the Clayton copula exhibits much higher

correlations in the lower tail than the Gaussian copula doesalthough they imply the

same overall correlation. That is, according to the Clayton copula more of the overall

correlation of data is driven by synchronized movements in the left tail. This could be

interpreted as if the correlation is higher in market crashes than during normal times.

Remark 4.27 (Multivariate Clayton copula density ) The Clayton copula density for n

variables is

Pn n 1= Qn 1 Qn

c.u/ D 1 nC C .i

i D1 ui i D1 ui i D1 1 1/ :

Remark 4.28 (Clayton copula function ) The copula function (the cdf) corresponding to

(4.56) is

C.u1 ; u2 / D . 1 C u1 C u2 / 1= :

The following steps summarize how the copula is used to construct the multivariate

distribution.

1. Construct the marginal pdfs fi .xi / and thus also the marginal cdfs Fi .xi /. For in-

stance, this could be done by fitting a distribution with a fat tail. With this, calculate

the cdf values for the data ui D Fi .xi / as in (4.1).

2. Calculate the copula density as follows (for the Gaussian or Clayton copulas, re-

spectively):

102

Gaussian copula density, corr = -0.5 Gaussian copula density, corr = 0

5 5

0 0

2 2

0 0

x2 2 0 2 x2 2 2

2 x1 2 x1 0

Gaussian copula density, corr = 0.5 Clayton copula density, = 1.5 ( = 0.43)

5 5

0 0

2 2

0 0

x2 2 0 2 x2 2 2

2 x1 2 x1 0

Pr(x < Q(x, q), y < Q(y, q)), Gauss Pr(x < Q(x, q), y < Q(y, q)), Clayton

5 5

corr = 0.25 = 0.38

4 corr = 0.5 4 = 1.00

3 3

%

2 2

1 1

0 0

0 5 10 0 5 10

Quantile level (q), % Quantile level (q), %

The values are calibrated

to give correlations of

0.25 and 0.5

i. assume (or estimate/calibrate) a correlation to use in the Gaussian cop-

ula

103

ii. calculate i D 1

.ui /, where 1

./ is the inverse of a N.0; 1/ distribu-

tion

iii. combine to get the copula density value c.u1 ; u2 /

(b) for the Clayton copula (4.56)

i. assume (or estimate/calibrate) an to use in the Clayton copula (typically

based on Kendalls as in (4.58))

ii. calculate the copula density value c.u1 ; u2 /

3. Combine the marginal pdfs and the copula density as in (4.53), f1;2 .x1 ; x2 / D

c.u1 ; u2 /f1 .x1 /f2 .x2 /, where ui D Fi .xi / is the cdf value according to the marginal

distribution of variable i .

Joint pdf, Gaussian copula, corr = -0.5 Joint pdf, Gaussian copula, corr = 0

2 2

1 1

x2

x2

0 0

1 1

2 2

2 1 0 1 2 2 1 0 1 2

x1 x1

Notice: marginal distributions are N(0,1)

Joint pdf, Gaussian copula, corr = 0.5 Joint pdf, Clayton copula, = 1.5

2 2

1 1

x2

x2

0 0

1 1

2 2

2 1 0 1 2 2 1 0 1 2

x1 x1

104

Joint pdf, Gaussian copula, corr = -0.5 Joint pdf, Gaussian copula, corr = 0

2 2

1 1

x2

x2

0 0

1 1

2 2

2 1 0 1 2 2 1 0 1 2

x1 x1

Notice: marginal distributions are t5

Joint pdf, Gaussian copula, corr = 0.5 Joint pdf, Clayton copula, = 1.5

2 2

1 1

x2

x2

0 0

1 1

2 2

2 1 0 1 2 2 1 0 1 2

x1 x1

Remark 4.29 (Tail Dependence ) The measure of lower tail dependence starts by finding

the probability that X1 is lower than its qth quantile (X1 F1 1 .q/) given that X2 is

lower than its qth quantile (X2 F2 1 .q/)

It can be shown that a Gaussian copula gives zero or very weak tail dependence,

unless the correlation is 1. It can also be shown that the lower tail dependence of the

Clayton copula is

l D 2 1= if > 0

105

4.7 Joint Tail Distribution

The methods for estimating the (marginal) distribution of the lower tail for each return can

be combined with a copula to model the joint tail distribution. In particular, combining the

generalized Pareto distribution (GPD) with the Clayton copula provides a flexible way.

This can be done by first modelling the loss (X t D R t ) beyond some threshold (u),

that is, the variable X t u with the GDP. To get a distribution of the return, we simply use

the fact that pdfR . z/ D pdfX .z/ for any value z. Then, in a second step we calibrate the

copula by using Kendalls for the subsample when both returns are less than u. Figures

4.394.41 provide an illustration.

Remark 4.30 Figure 4.39 suggests that the joint occurrence (of these two assets) of re-

ally negative returns happens more often than the estimated normal distribution would

suggest. For that reason, the joint distribution is estimated by first fitting generalized

Pareto distributions to each of the series and then these are combined with a copula as in

(4.41) to generate the joint distribution. In particular, the Clayton copula seems to give a

long joint negative tail.

100 5

Data

estimated N() 4

3

%

50

2

1

0 0

0 50 100 0 5 10

Quantile level, % Quantile level, %

small stocks and large stocks

To find the implication for a portfolio of several assets with a given joint tail distribu-

tion, we often resort to simulations. That is, we draw random numbers (returns for each

of the assets) from the joint tail distribution and then study the properties of the portfolio

(with say, equal weights or whatever). The reason we simulate is that it is very hard to

106

Loss distribution, small stocks Loss distribution, large stocks

u = 0.5, Pr(loss > u) = 17.8% u = 0.5, Pr(loss > u) = 24.6%

0.2 = 0.20, = 0.61 0.2 = 0.12, = 0.67

0.15 Daily US data 1979:12013:12 0.15

0.1 0.1

0.05 0.05

0 0

1 2 3 4 5 1 2 3 4 5

Loss, % Loss, %

(Only lower tail is shown) (Only lower tail is shown)

0.2 0.2

0.15 GP 0.15

0.1 Normal 0.1

0.05 0.05

0 0

5 4 3 2 1 5 4 3 2 1

Return, % Return, %

rely on raw number crunching.

The approach proceeds in two steps. First, draw n values for the copula (ui ; i D

1; : : : ; n). Second, calculate the random number (return) by inverting the cdf ui D

Fi .xi / in (4.53) as

xi D Fi 1 .ui /; (4.59)

Remark 4.31 (To draw n random numbers from a Gaussian copula) First, draw n num-

bers from an N.0; R/ distribution, where R is the correlations matrix. Second, calculate

ui D .xi /, where is the cdf of a standard normal distribution.

Remark 4.32 (To draw n random numbers from a Clayton copula) First, draw xi for

i D 1; : : : ; n from a uniform distribution (between 0 and 1). Second, draw v from a

gamma(1=; 1) distribution. Third, calculate ui D 1 ln.xi /=v 1= for i D 1; : : : ; n.

107

Joint pdf, independent copula Joint pdf, Gaussian copula

1 1

large stocks

large stocks

2 2

3 3

Spearmans = 0.44

4 4

4 3 2 1 4 3 2 1

small stocks small stocks

Daily US data 1979:12013:12

1 The marginal distributions of the losses

large stocks

2

3

Kendalls = 0.30, = 0.87

4

4 3 2 1

small stocks

Remark 4.33 (Inverting a normal and a generalised Pareto cdf) Must numerical soft-

ware packages contain a routine for inverting a normal cdf. Remark 4.13 shows how to

generate random numbers for a generalised Pareto distribution.

Such simulations can be used to quickly calculate the VaR and other risk measures

for different portfolios. A Clayton copula with a high parameter (and hence a high

Kendalls ) has long lower tail with highly correlated returns: when asset takes a dive,

other assets are also likely to decrease. That is, the correlation in the lower tail of the

return distribution is high, which will make the VaR high.

Figures 4.424.43 give an illustration of how the movements in the lower get more

synchronised as the parameter in the Clayton copula increases.

108

Gaussian copula, = 0.49 Clayton copula, = 0.56

4 4

= 0.22

2 2

Asset 2

Asset 2

0 0

2 2

marginal pdf:

normal

4 4

4 2 0 2 4 4 2 0 2 4

Asset 1 Asset 1

Clayton copula, = 1.06 Clayton copula, = 2.06

4 4

= 0.35 = 0.51

2 2

Asset 2

0 Asset 2 0

2 2

4 4

4 2 0 2 4 4 2 0 2 4

Asset 1 Asset 1

Figure 4.42: Example of scatter plots of two asset returns drawn from different copulas

Bibliography

Alexander, C., 2008, Market Risk Analysis: Practical Financial Econometrics, Wiley.

Ang, A., and J. Chen, 2002, Asymmetric correlations of equity portfolios, Journal of

Financial Economics, 63, 443494.

Option Prices, Journal of Business, 51, 621651.

Cox, J. C., and S. A. Ross, 1976, The Valuation of Options for Alternative Stochastic

Processes, Journal of Financial Economics, 3, 145166.

Oxford University Press, Oxford.

109

Quantile of equally weighted portfolio, different copulas

0.8

Notice: VaR95% = (the 5% quantile)

1

1.2

1.4

1.6

1.8

N

2 Clayton, = 0.56

Clayton, = 1.06

Clayton, = 2.06

2.2

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Prob of lower outcome

Figure 4.43: Quantiles of an equally weighted portfolio of two asset returns drawn from

different copulas

sachusetts.

Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,

Cambridge University Press.

Hastie, T., R. Tibshirani, and J. Friedman, 2001, The elements of statistical learning: data

mining, inference and prediction, Springer Verlag.

Jackwerth, J. C., 2000, Recovering risk aversion from option prices and realized returns,

Review of Financial Studies, 13, 433451.

Jondeau, E., S.-H. Poon, and M. Rockinger, 2007, Financial Modeling under Non-

Gaussian Distributions, Springer.

Lo, A. W., H. Mamaysky, and J. Wang, 2000, Foundations of technical analysis: com-

putational algorithms, statistical inference, and empirical implementation, Journal of

Finance, 55, 17051765.

110

McNeil, A. J., R. Frey, and P. Embrechts, 2005, Quantitative risk management, Princeton

University Press.

Melick, W. R., and C. P. Thomas, 1997, Recovering an Assets Implied PDF from Op-

tions Prices: An Application to Crude Oil During the Gulf Crisis, Journal of Financial

and Quantitative Analysis, 32, 91115.

Mittelhammer, R. C., 1996, Mathematical statistics for economics and business, Springer-

Verlag, New York.

Ritchey, R. J., 1990, Call option valuation for discrete normal mixtures, Journal of

Financial Research, 13, 285296.

Silverman, B. W., 1986, Density estimation for statistics and data analysis, Chapman and

Hall, London.

Sderlind, P., 2000, Market expectations in the UK before and after the ERM crisis,

Economica, 67, 118.

Sderlind, P., and L. E. O. Svensson, 1997a, New techniques to extract market expecta-

tions from financial instruments, Journal of Monetary Economics, 40, 383420.

Sderlind, P., and L. E. O. Svensson, 1997b, New techniques to extract market expecta-

tions from financial instruments, Journal of Monetary Economics, 40, 383429.

Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University

Press.

111

5 Predicting Asset Returns

Sections denoted by a star ( ) is not required reading.

Reference: Cochrane (2005) 20.1; Campbell, Lo, and MacKinlay (1997) 2 and 7;

Taylor (2005) 57

The traditional interpretation of autocorrelation in asset returns is that there are some

irrational traders. For instance, feedback trading would create positive short term au-

tocorrelation in returns. If there are non-trivial market imperfections, then predictability

can be used to generate economic profits. If there are no important market imperfections,

then predictability of excess returns should be thought of as predictable movements in

risk premia.

To see the latter, let RetC1 be the excess return on an asset. The canonical asset pricing

equation then says

E t m t C1 RetC1 D 0; (5.1)

1

discounted sum of utility E t sD0 s u.c t Cs /. Let Q t be the consumer price index in t.

Then, we have 8

< u0 .ct C1 / Qt if returns are nominal

u0 .c t / Q t C1

m t C1 D

: . t C1 / if returns are real.

u 0 c

u0 .c t /

This says that the expected excess return will vary if risk (the covariance) does. If there is

some sort of reasonable relation between beliefs and the properties of actual returns (not

necessarily full rationality), then we should not be too surprised to find predictability.

112

Example 5.2 (Epstein-Zin utility function) Epstein and Zin (1991) define a certainty

equivalent of future utility as Z t D E t .U t1C1
/1=.1
/ where
is the risk aversionand

then use a CES aggregator function to govern the intertemporal trade-off between current

consumption and the certainty equivalent: U t D .1 /C t1 1= C Z t1 1= 1=.1 1= /

where is the elasticity of intertemporal substitution. If returns are iid (so the consumption-

wealth ratio is constant), then it can be shown that this utility function has the same

pricing implications as the CRRA utility, that is,

E.C t =C t 1/

R t D constant.

Example 5.3 (Portfolio choice with predictable returns) Campbell and Viceira (1999)

specify a model where the log return of the only risky asset follows the time series process

r t C1 D rf C x t C u t C1 ;

where rf is a constant riskfree rate, u t C1 is unpredictable, and the state variable follows

(constant suppressed)

x tC1 D x t C t C1 ;

be non-zero. For instance, with Cov t .u tC1 ; t C1 / < 0, a high return (u t C1 > 0) is

typically associated with an expected low future return (x tC1 is low since t C1 < 0) With

Epstein-Zin preferences, the portfolio weight on the risky asset is (approximately) of the

form

v t D a0 C a1 x t ;

where a0 and a1 are complicated expression (in terms of the model parameterscan be

calculated numerically). There are several interesting results. First, if returns are not

predictable (x t is constant since t C1 is), then the portfolio choice is constant. Second,

when returns are predictable, but the relative risk aversion is unity (no intertemporal

hedging), then v t D 1=.2
/ C x t =
Var t .u t C1 /. Third, with a higher risk aversion and

Cov t .u t C1 ; tC1 / < 0, there is a positive hedging demand for the risky asset: it pays off

(today) when the future investment opportunities are poor.

Example 5.4 (Habit persistence) The habit persistence model of Campbell and Cochrane

(1999) has a CRRA utility function, but the argument is the difference between consump-

tion and a habit level, C t X t , instead of just consumption. The habit is parametrised in

113

terms of the surplus ratio S t D .C t X t /=C t . The log surplus ratio.(s t )is assumed to

be a non-linear AR(1)

s t D s t 1 C .s t 1 /c t :

It can be shown (see Sderlind (2006)) that if .s t 1 / is a constant and the excess return

is unpredictable (by s t ) then the habit persistence model is virtually the same as the CRRA

model, but with
.1 C / as the effective risk aversion.

Example 5.5 (Reaction to news and the autocorrelation of returns) Let the log asset

price, p t , be the sum of a random walk and a temporary component (with perfectly cor-

related innovations, to make things simple)

D ut 1 C .1 C /" t :

so 0 < < 1 (initial overreaction of the price) gives a negative autocorrelation. In short,

mean reversion in the price level means negative autocorrelation of the returnsand vice

versa. See Figure 5.1 for the impulse responses with respect to a piece of news, " t .

5.2 Autocorrelations

The sampling properties of autocorrelations (Os ) are complicated, but there are several

useful large sample results for Gaussian processes (these results typically carry over to

processes which are similar to the Gaussiana homoskedastic process with finite 6th

moment is typically enough, see Priestley (1981) 5.3 or Brockwell and Davis (1991) 7.2

7.3). When the true autocorrelations are all zero (not 0 , of course), then for any i and j

different from zero

p

" # " # " #!

Oi 0 1 0

T !d N ; : (5.3)

Oj 0 0 1

114

Impulse response, = 0.4 Impulse response, = 0.4

1 1

0 0

1 Price 1

Return

0 1 2 3 4 5 0 1 2 3 4 5

period period

pt = ut + t , where ut = ut1 + t mean reversion (momentum) in the price

implies negative (positive) autocorrelation

The figure traces out the response to of the returns

1 = 1, starting from u0 = 0

Figure 5.1: Impulse responses when price is random walk plus temporary component

This result can be used to construct tests for both single autocorrelations (t-test or 2 test)

and several autocorrelations at once (2 test). To apply this on returns, the return horizon

can be whatever (seconds, years,...), but it is important that the returns are non-overlapping

(time aggregation can easily introduce spurious serial correlation).

Example 5.6 (t-test) We want to test the hypothesis that 1 D 0. Since the N.0; 1/ dis-

tribution has 5% of the probability mass below -1.64 and another 5% above 1.64, we

p

can reject the null hypothesis at the 10% level if T jO1 j > 1:64. With T D 100, we

p

therefore need jO1 j > 1:64= 100 D 0:165 for rejection, and with T D 1000 we need

p

jO1 j > 1:64= 1000 0:053.

p

The Box-Pierce test follows directly from the result in (5.3), since it shows that T Oi

p

and T Oj are iid N(0,1) variables. Therefore, the sum of the square of them is distributed

as an 2 variable. The test statistic typically used is

L

X

QL D T Os2 !d L

2

: (5.4)

sD1

Example 5.7 (Box-Pierce) Let O1 D 0:165, and T D 100, so Q1 D 100 0:1652 D

2:72. The 10% critical value of the 21 distribution is 2.71, so the null hypothesis of no

autocorrelation is rejected.

115

SMI SMI daily excess returns, %

6 10

SMI

bill portfolio 5

4

0

2

-5

0 -10

1995 2000 2005 2010 1995 2000 2005 2010

daily weekly monthly

returns 0.03 -0.10 0.02

|returns| 0.29 0.31 0.19

The choice of lag order in (5.4), L, should be guided by theoretical considerations, but

it may also be wise to try different values. There is clearly a trade off: too few lags may

miss a significant high-order autocorrelation, but too many lags can destroy the power of

the test (as the test statistic is not affected much by increasing L, but the critical values

increase).

See Figures 5.25.3.

The main problem with these tests is that the assumptions behind the results in (5.3)

may not be reasonable. For instance, data may be heteroskedastic. One way of handling

these issues is to make use of the GMM framework. (Alternatively, the results in Taylor

(2005) are useful.)

Remark 5.8 (Runs test ) A runs test is a non-parametric test of randomness. Let d t

be an indicator variable (

0 if y t q

dt D

1 if y t > q

where q typically (but not necessarily) is the mean of y t . Let T1 D TtD1 d t , that is the

P

y t q). Also define the numbers of runs (r), that is, the number of changes in the d t

116

Autocorr, daily excess returns Autocorr, weekly excess returns

Autocorr with 90% conf band around 0

0.2 S&P 500, 1979:12014:2 0.2

0.1 0.1

0 0

0.1 0.1

1 2 3 4 5 1 2 3 4 5

lag (days) lag (weeks)

0.2 0.2

0.1 0.1

0 0

0.1 0.1

1 2 3 4 5 1 2 3 4 5

lag (days) lag (weeks)

PT

r D1C t D2 jd t dt 1 j:

that, under the null hypothesis of randomness,

T1 T2

Er D 2 C 1 and

T

.E r 1/.E r 2/

Var.r/ D :

T 1

We can therefore test the null hypothesis of randomness by a t-stat

r Er

!d N.0; 1/:

Var.r/

p

The basic intuition of the test is that a positive autocorrelation would lead to too few runs

117

1st autocorrelation 2nd autocorrelation

0.1 0.1

0.05 0.05

0 0

0.05 0.05

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

size decile size decile

3rd autocorrelation

Autocorrelations of excess returns

on size sorted equity portfolios

0.1

Decile 1 contains the 10% smallest firms,

0.05

while decile 10 the 10% largest firms

0

US daily data 1979:12014:2

0.05

1 2 3 4 5 6 7 8 9 10

size decile

(r < E r): the y t variable would stay on one side of the threshold q for long spells of

timeand hence there would be few changes in x t . Negative autocorrelation is just the

opposite, since it tends to give a zigzag pattern around the mean. See Figure 5.5 for an

example.

This section discusses how GMM can be used to test if a series is autocorrelated. The

analysis focuses on first-order autocorrelation, but it is straightforward to extend it to

higher-order autocorrelation.

Consider a scalar random variable x t with a zero mean (it is easy to extend the analysis

to allow for a non-zero mean). Consider the moment conditions

T

" # " # " #

x t2 2 1 X x t2 2 2

g t ./ D ; so g./

N D , with D :

x t x t 1 2 T tD1 x t x t 1 2

(5.5)

118

t-stat for runs test of returns

10

few runs (positive autocorrelation)

15 US daily excess returns 1979:12014:2

1 2 3 4 5 6 7 8 9 10

size decile

ance. We want to test if D 0. We could proceed along two different routes: estimate

and test if it is different from zero or set to zero and then test overidentifying restrictions.

We are able to arrive at simple expressions for these testsprovided we are willing

to make strong assumptions about the data generating process. (These tests then typically

coincide with classical tests like the Box-Pierce test.) One of the strong points of GMM

is that we could perform similar tests without making strong assumptionsprovided we

use a correct estimator of the asymptotic covariance matrix of the moment conditions.

so the weight matrix does not matter, so the asymptotic distribution is

p d

T .O

1

0 / ! N.0; V /, where V D D00 S0 1 D0

;

where D0 is the Jacobian of the moment conditions and S0 the covariance matrix of the

moment conditions (at the true parameter values). We have

" # " # " #

@gN 1 .0 /=@ 2 @gN 1 .0 /=@ 1 0 1 0

D0 D plim D D ;

@gN 2 .0 /=@ 2 @gN 2 .0 /=@ 2 0 2

119

since D 0 (the true value). The definition of the covariance matrix is

"p T

# "p T #0

T X T X

S0 D E g t .0 / g t .0 / :

T t D1 T t D1

Assume that there is no autocorrelation in g t .0 / (which means, among other things, that

volatility, x t2 ; is not autocorrelated). We can then simplify as

S0 D E g t .0 /g t .0 /0 :

This assumption is stronger than assuming that D 0, but we make it here in order to

illustrate the asymptotic distribution. Moreover, assume that x t is iid N.0; 2 /. In this

case (and with D 0 imposed) we get

" #" #0 " #

x t2 2 x t2 2 .x t2 2 /2 .x t2 2 /x t x t 1

S0 D E DE

xt xt 1 xt xt 1 .x t2 2 /x t x t 1 .x t x t 1 /2

" # " #

E x t4 2 2 E x t2 C 4 0 2 4 0

D D :

0 E x t2 x t2 1 0 4

To make the simplification in the second line we use the facts that E x t4 D 3 4 if x t

N.0; 2 /, and that the normality and the iid properties of x t together imply E x t2 x t2 1 D

E x t2 E x t2 1 and E x t3 x t 1 D E 2 x t x t 1 D 0. Combining gives

p

" #!

O 2 0 1

Cov T D D0 S0 D0 1

O

0" #0 " # 1" #1 1

4

1 0 2 0 1 0

D@ 2 4

A

0 0 0 2

" #

2 4 0

D :

0 1

p

This shows that T O !d N.0; 1/.

5.2.3 Autoregressions

120

and then test if all the slope coefficients are zero with a 2 test. The return horizon can

be whatever (seconds, years,...), but it is important that the returns are non-overlapping.

See Figures 5.65.7 for illustrations. Notice that the results can be sensitive to the sample

period.

This approach is somewhat less general than the Box-Pierce test, but most stationary

time series processes can be well approximated by an AR of relatively low order. To

account for heteroskedasticity and other problems, it can make sense to estimate the co-

variance matrix of the parameters by an estimator like Newey-West. It can be noticed that

when r t D c C a1 r t 1 C " t , then a1 D 1 .

0.1

0

0.05

0.5

0

0 20 40 60 0 20 40 60

Return horizon (months) Return horizon (months)

2 1926:12013:12

e

+ t

Return

2

2 1 0 1 2

lagged return

The autoregression can easily allow for the coefficients to depend on the market situ-

ation. For instance, consider an AR(1), but where the autoregression coefficient may be

121

Slope coefficient (b) R2

Slope with 90% conf band

0.5

0.1

0

0.05

0.5

0

0 20 40 60 0 20 40 60

Return horizon (months) Return horizon (months)

1957:12013:12

e

+ t

(

1 if q is true

.q/ D

0 else.

Remark 5.10 (Pitfall I in testing long-run returns) Let r t in (5.6) be a two period return,

r t D rQt C rQt 1 , where rQt is a one-period (log) return. An AR(1) on overlapping data

would then be

rQt C rQt 1 D c C a.rQt 1 C rQt 2 / C " t :

Even if the one-period returns are uncorrelated, a would tend to be positive and significant

since rQt 1 shows up on both the left and right hand sides: the returns are overlapping.

Instead, the correct specification is

Remark 5.11 (Pitfall 2 in testing long-run returns) A less serious pitfall is to use all

available returns on the left hand side, for instance, all daily two-day returns. Two suc-

122

Autoregression coeff, after negative returns Autoregression coeff, after positive returns

0.1 0.1

with 90% conf band around 0

0.05 S&P 500 (daily), 1979:12014:2 0.05

0 0

0.05 0.05

0.1 0.1

1 2 3 4 5 1 2 3 4 5

lag (days) lag (days)

e e

+ Qt1 Rt1 + t

Qt1 = 1 if rt1 > 0, and zero otherwise

Figure 5.8: Predictability of US stock returns, results from a regression with interactive

dummies

rQt C1 C rQt D c C a.rQt 1 C rQt 2/ C " t C1

There is no problem with the point estimate of a, since the left and right hand side returns

do not overlap. However, the residuals (" t and " t C1 ) are likely to be correlated which has

to be handled in order to make correct inference. To see this, suppose rQt D c=2 C u t

where u t is iid. Clearly, the left and right hand sides are uncorrelated, so D 0. With

this we have

Since u t shows up in both " t and " t C1 , the latter are correlated. See Figure 5.9. This can

be solved by using a Newey-West approach (or something similar), or by skipping every

second observation (there is then no overlap of the residuals).

123

e

Slope (b) in Rte = a + bRt 1 + t Autocorrelation of residual

0.5 1

0 0.5

0.5 0

0 5 10 15 20 0 5 10 15 20

Return horizon (months) Return horizon (months)

Slope with two different 90% conf band, OLS and NW std

There is no reason to restrict the prediction model to only use the lagged returns of the

same asset. See Figure 5.10 for an illustration.

0.1 0.1

0.05 0.05

0 0

0.05 0.05

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

size decile size decile

Multiple regression with lagged excess Decile 1 contains the 10% smallest firms,

return on self and largest decile as while decile 10 the 10% largest firms

regressors:

e e e

Ri,t = + Ri,t1 + R10,t1 + t US daily excess returns 1979:12014:2

124

5.3.1 Momentum or Contrarian Strategy?

A momentum strategy invests in assets that have performed well recentlyand often goes

short in those that have underperformed. The performance is driven by both autocorrela-

tion and spill-over effects from other assets.

See 5.11 for an empirical illustration.

excess return

alpha

8

6

Buy (sell) the 5 assets with highest

(lowest) return over the last month

4

2

Monthly US data 1957:12013:12

25 FF portfolios

0

0 2 4 6 8 10 12

Evaluation horizon, days

To formalize this, let there be N assets with returns R, with means and autocovariance

matrix

E R D and (5.8)

.k/ D E.R t /.R t k /0 :

" #

Cov.R1;t ; R1;t k / Cov.R1;t ; R2;t k/

.k/ D :

Cov.R2;t ; R1;t k / Cov.R2;t ; R2;t k/

N

1 X

Rmt D Ri t D 10 R t =N (5.9)

N i D1

125

with the corresponding mean return

N

1 X

m D i D 10 =N: (5.10)

N i D1

Rt k Rmt k

w t .k/ D ; (5.11)

N

which basically says that wi t .k/ is positive for assets with an above average return k pe-

riods back. Notice that the weights sum to zero, so this is a zero cost portfolio. However,

the weights differ from fixed weights (for instance, put 1=5 into the best 5 assets, and

1=5 into the 5 worst assets) since the overall size of the exposure (10 jw t j) changes over

time. A large dispersion of the past returns means large positions and vice versa. To

analyse a contrarian strategy, reverse the sign of (5.11).

The profit from this strategy is

N N

X Ri t k Rmt k

X Ri t k Ri t

t .k/ D Ri t D Rmt k Rmt ; (5.12)

i D1 N

i D1

N

wit

where the last term uses the fact that iND1 Rmt k Ri t =N D Rmt k Rmt .

N

1 0 N 1 1 X

E t .k/ D 1 .k/1 tr .k/ C tr .k/ C .i m /2 ; (5.13)

N2 N2 N i D1

where the 10 .k/1 sums all the elements of .k/ and tr .k/ sums the elements along

the main diagonal. (See below for a proof.) To analyse a contrarian strategy, reverse the

sign of (5.13).

With a random walk, .k/ D 0, then (5.13) shows that the momentum strategy wins

money: the first two terms are zero, while the third term contributes to a positive perfor-

mance. The reason is that the momentum strategy (on average) invests in assets with high

average returns (i > m ).

The first term of (5.13) sums all elements in the autocovariance matrix and then sub-

tracts the sum of the diagonal elementsso it only depends on the sum of the cross-

covariances, that is, how a return is correlated with the lagged return of other assets. In

general, negative cross-covariances benefit a momentum strategy. To see why, suppose a

126

high lagged return on asset 1 predicts a low return on asset 2, but asset 2 cannot predict

asset 1 (Cov.R2;t ; R1;t k / < 0 and Cov.R1;t ; R2;t k / D 0). This helps the momentum

strategy since we have a negative portfolio weight of asset 2 (since it performed relatively

poorly in the previous period).

" # " #

Cov.R1;t ; R1;t k / Cov.R1;t ; R2;t k / 0 0

.k/ D D :

Cov.R2;t ; R1;t k / Cov.R2;t ; R2;t k/ 0:1 0

Then

1 0 1

1 .k/1 tr .k/ D 0:1 0 D 0:025, and

N 2 22

N 1 2 1

tr .k/ D 0 D 0;

N2 2

so the sum of the first two terms of (5.13) is positive (good for a momentum strategy).

For instance, suppose R1;t k > 0, then R2;t tends to be low which is good (we have a

negative portfolio weight on asset 2).

The second term of (5.13) depends only on own autocovariances, that is, how a return

is correlated with the lagged return of the same asset. If these own autocovariances are

(on average) positive, then a strongly performing asset in t k tends to perform well in

t , which helps a momentum strategy (as the strongly performing asset is overweighted).

See Figure 5.13 for an illustration based on Figure 5.12.

Example 5.14 Figure 5.13 shows that a momentum strategy works reasonably well on

daily data on the 25 FF portfolios. While the cross-covariances have a negative influence

(because they are mostly positive), they are dominated by the (on average) positive auto-

correlations. The correlation matrix is illustrated in Figure 5.12. In short, the small firms

(asset 1-5) are correlated with the lagged returns of most assets, while large firms are not.

" #

0:1 0

.k/ D ;

0 0:1

127

(Auto-)correlation matrix, daily FF returns 1979:12014:2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

1 0.12 0.10 0.08 0.07 0.06 0.15 0.11 0.08 0.07 0.07 0.15 0.11 0.07 0.07 0.07 0.16 0.10 0.08 0.05 0.05 0.14 0.10 0.07 0.06 0.05

2 0.09 0.08 0.06 0.05 0.04 0.11 0.08 0.06 0.05 0.05 0.12 0.09 0.06 0.05 0.05 0.13 0.08 0.07 0.04 0.04 0.11 0.09 0.06 0.05 0.04

3 0.05 0.05 0.03 0.02 0.02 0.08 0.05 0.03 0.03 0.03 0.09 0.06 0.03 0.03 0.03 0.09 0.06 0.04 0.02 0.02 0.09 0.07 0.04 0.03 0.03

4 0.04 0.04 0.02 0.01 0.01 0.07 0.04 0.02 0.02 0.02 0.08 0.06 0.03 0.03 0.03 0.08 0.05 0.04 0.02 0.02 0.08 0.06 0.03 0.03 0.02

5 0.08 0.08 0.07 0.06 0.06 0.10 0.08 0.07 0.07 0.07 0.11 0.09 0.07 0.07 0.07 0.11 0.09 0.08 0.06 0.06 0.10 0.08 0.06 0.06 0.06

6 0.07 0.07 0.05 0.04 0.04 0.11 0.08 0.06 0.06 0.05 0.13 0.10 0.07 0.06 0.06 0.14 0.10 0.08 0.06 0.05 0.14 0.11 0.08 0.07 0.06

7 0.03 0.03 0.02 0.01 0.01 0.06 0.04 0.02 0.02 0.02 0.07 0.06 0.03 0.03 0.03 0.08 0.06 0.05 0.02 0.02 0.08 0.07 0.04 0.03 0.03

8 0.01 0.01 0.00 -0.00 -0.00 0.04 0.02 0.01 0.01 0.01 0.05 0.04 0.02 0.02 0.02 0.06 0.04 0.03 0.01 0.01 0.06 0.05 0.03 0.02 0.02

9 0.01 0.01 -0.00 -0.01 -0.01 0.04 0.02 0.00 0.00 0.01 0.05 0.04 0.01 0.02 0.01 0.06 0.04 0.03 0.01 0.01 0.06 0.05 0.02 0.02 0.02

10 0.03 0.02 0.02 0.01 0.02 0.05 0.03 0.02 0.02 0.03 0.05 0.05 0.03 0.03 0.03 0.06 0.05 0.04 0.03 0.03 0.06 0.05 0.03 0.03 0.03

11 0.05 0.05 0.04 0.04 0.03 0.10 0.08 0.06 0.06 0.05 0.11 0.09 0.07 0.06 0.06 0.12 0.09 0.08 0.06 0.05 0.13 0.11 0.08 0.07 0.07

12 0.04 0.04 0.04 0.03 0.03 0.07 0.06 0.05 0.05 0.05 0.09 0.08 0.06 0.06 0.05 0.10 0.08 0.07 0.05 0.05 0.11 0.10 0.07 0.07 0.06

13 0.03 0.03 0.03 0.02 0.02 0.06 0.05 0.04 0.04 0.04 0.07 0.07 0.05 0.05 0.05 0.08 0.07 0.07 0.05 0.05 0.09 0.08 0.06 0.06 0.06

14 0.02 0.03 0.02 0.02 0.02 0.05 0.04 0.03 0.03 0.04 0.05 0.05 0.04 0.04 0.04 0.06 0.06 0.05 0.04 0.04 0.07 0.06 0.05 0.05 0.04

15 0.02 0.03 0.02 0.02 0.02 0.04 0.04 0.03 0.04 0.04 0.05 0.06 0.04 0.05 0.05 0.06 0.06 0.06 0.04 0.04 0.07 0.07 0.05 0.05 0.06

16 0.02 0.03 0.02 0.02 0.01 0.06 0.05 0.04 0.04 0.04 0.08 0.07 0.04 0.05 0.04 0.09 0.06 0.05 0.03 0.03 0.10 0.09 0.06 0.06 0.05

17 0.04 0.04 0.04 0.03 0.03 0.07 0.06 0.06 0.05 0.06 0.08 0.08 0.06 0.06 0.06 0.09 0.08 0.07 0.05 0.05 0.11 0.10 0.08 0.07 0.06

18 0.03 0.04 0.03 0.03 0.03 0.06 0.06 0.05 0.05 0.05 0.07 0.07 0.05 0.06 0.05 0.08 0.07 0.07 0.05 0.05 0.09 0.09 0.07 0.07 0.06

19 0.03 0.03 0.03 0.02 0.02 0.05 0.05 0.04 0.04 0.05 0.06 0.06 0.05 0.06 0.05 0.07 0.06 0.06 0.04 0.04 0.08 0.07 0.06 0.06 0.06

20 0.02 0.03 0.02 0.02 0.02 0.04 0.04 0.03 0.04 0.05 0.05 0.05 0.04 0.05 0.04 0.06 0.06 0.05 0.04 0.04 0.06 0.06 0.05 0.05 0.05

21 -0.05 -0.05 -0.05 -0.05 -0.05 -0.03 -0.03 -0.04 -0.04 -0.03 -0.02 -0.03 -0.04 -0.04 -0.04 -0.02 -0.03 -0.04 -0.05 -0.05 -0.00 -0.02 -0.04 -0.03 -0.04

22 -0.04 -0.04 -0.04 -0.04 -0.04 -0.02 -0.02 -0.03 -0.03 -0.03 -0.01 -0.01 -0.03 -0.02 -0.03 -0.01 -0.02 -0.03 -0.04 -0.04 -0.00 -0.01 -0.03 -0.03 -0.03

23 -0.02 -0.02 -0.02 -0.02 -0.03 -0.01 -0.01 -0.01 -0.01 -0.01 -0.00 0.00 -0.01 -0.01 -0.01 0.00 0.00 -0.01 -0.02 -0.02 0.01 0.01 -0.00 -0.00 -0.01

24 -0.05 -0.04 -0.05 -0.05 -0.05 -0.03 -0.03 -0.04 -0.03 -0.02 -0.02 -0.02 -0.03 -0.02 -0.03 -0.02 -0.02 -0.03 -0.04 -0.04 -0.01 -0.02 -0.03 -0.02 -0.02

25 -0.03 -0.03 -0.03 -0.03 -0.03 -0.02 -0.02 -0.02 -0.02 -0.01 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.03 -0.03 -0.02 -0.02 -0.03 -0.02 -0.02

Dark colours indicate high correlations, light colours indicate low correlations.

Decomposition of momentum return (1-day horizon)

20 25 FF portfolios (B/M and size)

10

0.02

0

10

20 -11.56

Figure 5.13: Decomposition of return from momentum strategy based on daily FF data

then

1 0 1

D .0:2 0:2/ D 0, and

2

1 .k/1 tr .k/

N 22

N 1 2 1

2

tr .k/ D .0:1 C 0:1/ D 0:05;

N 2

128

so the sum of the first two terms of (5.13) is positive (good for a momentum strategy).

Proof. (of (5.13)) Take expectations of (5.12) and use the fact that E xy D Cov.x; y/C

E x E y to get

N

1 X

E t .k/ D Cov.Ri t C 2i Cov.Rmt C 2m :

k ; Ri t / k ; Rmt /

N i D1

Notice that N1 N

i D1 Cov.Ri t D tr .k/=N , where tr denotes the trace. Also, let

P

k ; Ri t /

1 h 0Q 0Q 0 i 1 0 Q Q0 10 .k/1

Cov.Rmt k ; Rmt / D E 2 1 R t 1 Ri t k D E 2 1 R t Ri t k 1 D :

N N N2

2m D N1 N m /2 . Together, these results

P P

i D1 i i D1 .i

give

N

10 .k/1 1 1 X

E t .k/ D C tr .k/ C .i m /2 ;

N2 N N i D1

which can be rearranged as (5.13).

There are many other, perhaps more economically plausible, possible predictors of future

stock returns. For instance, both the dividend-price ratio and nominal interest rates have

been used to predict long-run returns, and lagged short-run returns on other assets have

been used to predict short-run returns.

See Figure 5.14 for an illustration.

Reference: Campbell, Lo, and MacKinlay (1997) 7 and Cochrane (2005) 20.1.

The gross return, R t C1 , is defined as

D t C1 C P t C1 D t C1 C P tC1

R tC1 D , so P t D : (5.14)

Pt R tC1

129

Substituting for P t C1 (and then P t C2 ; :::) gives

D t C1 D t C2 D t C3

Pt D C C C ::: (5.15)

R t C1 R t C1 R t C2 R t C1 R t C2 R tC3

1

X D t Cj

D Qj ; (5.16)

j D1 kD1 R t Ck

counting identity. It is clear that a high price in t must lead to low future returns and/or

high future dividendswhich (by rational expectations) also carry over to expectations

of future returns and dividends.

It is sometimes more convenient to analyze the price-dividend ratio. Dividing (5.15)

and (5.16) by D t gives

Pt 1 D t C1 1 D t C2 D t C1 1 D t C3 D t C2 D t C1

D C C C :::

Dt R t C1 D t R t C1 R tC2 D t C1 D t R t C1 R t C2 R t C3 D t C2 D t C1 D t

(5.17)

1 Y j

X D t Ck =D tCk 1

D : (5.18)

j D1

R t Ck

kD1

As with (5.16) it is just an accounting identity. It must therefore also hold in expectations.

Since expectations are good (the best?) predictors of future values, we have the impli-

cation that the asset price should predict a discounted sum of future dividends, (5.16),

and that the price-dividend ratio should predict a discounted sum of future changes in

dividends.

We now log-linearize the accounting identity (5.18) in order to tie it more closely to the

(typically linear) econometrics methods for detecting predictability The result is

1

X

pt dt s .d t C1Cs d t Cs / rQt C1Cs ; (5.19)

sD0

0:96 if D=P is 4%) and where rQtC1Cs is a one-period log return.

As before, a high price-dividend ratio must imply future dividend growth and/or low

future returns. In the exact solution (5.17), dividends and returns which are closer to the

130

present show up more times than dividends and returns far in the future. In the approxi-

mation (5.19), this is captured by giving a higher weight (higher s ).

Proof. (of (5.19)slow version) Rewrite (5.14) as

D t C1 C P t C1

P tC1 D t C1

R t C1 D D 1C or in logs

Pt Pt P t C1

rQt C1 D p tC1 p t C ln 1 C exp.d t C1 p t C1 / :

Make a first order Taylor approximation of the last term around a steady state value of

d t C1 p t C1 , denoted d p,

h i exp.d p/ h i

ln 1 C exp.d t C1 p t C1 / ln 1 C exp.d p/ C d tC1 p tC1 d p

1 C exp.d p/

constant C .1 / .d t C1 p t C1 / ;

where D 1=1 C exp.d p/ D 1=.1 C D=P /. Combine and forget about the constant.

The result is

rQt C1 p t C1 p t C .1 / .d t C1 p t C1 /

D p t C1 p t C .1 / d t C1 ;

where 0 < < 1. Add and subtract d t from the right hand side and rearrange

rQt C1 .p t C1 d t C1 / .p t d t / C .d t C1 d t / , or

pt d t .p t C1 d t C1 / C .d t C1 dt / rQt C1

This is a (forward looking, unstable) difference equation, which we can solve recursively

forward. Provided lims!1 s .p t Cs d t Cs / D 0, the solution is (5.19). (Trying to solve

for the log price level instead of the log price-dividend ratio is problematic since the

condition lims!1 s p t Cs D 0 may not be satisfied.)

One of the most successful attempts to forecast long-run return is by using the dividend-

price ratio

r t Cq D C q .d t p t / C " t Cq ; (5.20)

where r t Cq is the log return between t and t C q. For instance, CLM Table 7.1, report R2

values from this regression which are close to zero for monthly returns, but they increase

131

to 0.4 for 4-year returns (US, value weighted index, mid 1920s to mid 1990s). See also

Figure 5.14 for an illustration.

By comparing with (5.19), we see that the dividend-ratio in (5.20) is only asked to

predict a finite (unweighted) sum of future returnsdividend growth is disregarded. We

should therefore expect (5.20) to work particularly well if the horizon is long (high q) and

if dividends are stable over time.

From (5.19) we get (from using Cov.x; y z/ D Cov.x; y/ Cov.x; z/) that

1 1

! !

X X

Var.p t d t / Cov p t d t ; s .d tC1Cs d tCs / Cov p t d t ; s r t C1Cs ;

sD0 sD0

(5.21)

which shows that the variance of the price-dividend ratio can be decomposed into the

covariance of price-dividend ratio with future dividend change minus the covariance of

price-dividend ratio with future returns. This expression highlights that if p t d t is not

constant, then it must forecast dividend growth and/or returns.

The evidence in Cochrane suggests that p t d t does not forecast future dividend

growth, so that predictability of future returns explains the variability in the dividend-

price ratio. This fits very well into the findings of the R2 of (5.20). To see that, recall the

following fact.

Remark 5.16 (R2 from a least squares regression) Let the least squares estimate of in

O The fitted values yO t D x 0 .

y t D x t0 0 C u t be . O If the regression equation includes a

b

t

constant, then R2 D Corr .y t ; yO t /2 . In a simple regression where y t D a C bx t C u t ,

b

where x t is a scalar, R2 D Corr .y t ; x t /2 .

The evidence for US stock returns is that long-run returns may perhaps be predicted

by using dividend-price ratio or interest rates, but that the long-run autocorrelations are

weak (long run US stock returns appear to be weak-form efficient but not semi-strong

efficient). Both CLM 7.1.4 and Cochrane 20.1 use small models for discussing this

case. The key in these discussions is to make changes in dividends unforecastable, but

let the return be forecastable by some state variable (E t d t C1Cs E t d t Cs D 0 and

E t r t C1 D r C x t ), but in such a way that there is little autocorrelation in returns. By

taking expectations of (5.19) we see that price-dividend will then reflect expected future

returns and therefore be useful for forecasting.

132

Slope coefficient (b) R2

0.4 0.1

0.2 0.05

0 0

0 20 40 60 0 20 40 60

Return horizon (months) Return horizon (months)

Scatter plot, 36 month returns 1926:12013:12

2

Regression:

1 Rte = a + b log(E/P)t1 + t

Return

2

4 3 2 1

lagged log(E/P)

struct maximally predictable portfolios. The weights on the different assets in this portfo-

lio can also help us to understand more about how the predictability works.

Let Z t be an n 1 vector of demeaned returns

Zt D Rt E Rt ; (5.22)

and suppose that we (somehow) have constructed rational forecasts E t 1 Z t such that

0

Zt D Et 1 Z t C " t , where E t 1 " t D 0, Var t 1 ." t " t / D : (5.23)

Consider a portfolio 0 Z t . The R2 from predicting the return on this portfolio is (as

133

usual) the fraction of the variability of
0 Z t that is explained by
0 E t 1 Zt

D Var. 0 Z t / Var. 0 " t /= Var. 0 Z t /

D Var. 0 E t 1 Z t /= Var. 0 Z t /

D 0 Cov.E t 1 Z t / = 0 Cov.Z t / : (5.24)

The covariance in the denominator can be calculated directly from data, but the covari-

ance matrix in the numerator clearly depends on the forecasting model we use (to create

E t 1 Z t ).

The portfolio (
vector) that gives the highest R2 is the eigenvector (normalized to

sum to unity) associated with the largest eigenvalue (also the value of R2 ) of Cov.Z t / 1 Cov.E t 1 Z t /.

Example 5.17 (One forecasting variable) Suppose there is only one predictor, x t 1,

Z t D x t 1 C "t ;

0

where is n 1. This means that E t 1 Z t D x t 1 , so Cov.E t 1 Z t / D Var.x t 1 /

0 Var.x t 1 / 0

R2 .
/ D :

0 Var.x t 1 / 0
C
0

The first order conditions for maximum then gives (this is very similar to the calculations

of the minimum variance portfolio in mean-variance analysis)

D 1

=10 1

;

then the portfolio weight of asset i is i divided by the variance of the forecast error of

asset i: assets which are hard to predict get smaller weights. We also see that if the sign

of i is different from the sign of 10 1 , then it gets a negative weight. For instance, if

10 1 > 0, so that most assets move in the same direction as x t 1 , then asset i gets a

negative weight if it moves in the opposite direction (i < 0).

134

5.6.1 Spurious Regressions

Ferson, Sarkissian, and Simin (2003) argue that many prediction equations suffer from

spurious regression featuresand that data mining tends to make things even worse.

Their simulation experiment is based on a simple model where the return predictions

are

r t C1 D C Z t C v t C1 ; (5.25)

where Z t is a regressor (predictor). The true model is that returns follow the process

r t C1 D C Z t C u t C1 ; (5.26)

where the residual is white noise. In this equation, Z t represents movements in expected

returns. The predictors follow a diagonal VAR(1)

" # " #" # " # " #!

Zt 0 Zt 1 "t "t

D C , with Cov D : (5.27)

Z t 0 Z t 1 "t "t

In the case of a pure spurious regression, the innovations to the predictors are uncor-

related ( is diagonal). In this case, ought to be zeroand their simulations show that

the estimates are almost unbiased. Instead, there is a problem with the standard deviation

O If is high, then the returns will be autocorrelated.

of .

See Table 5.1 for an illustration.

D0 D 0:75

: 0 0.375 0.75 0 0.375 0.75

Simulated 5:8 6:2 8:7 3:9 5:5 10:9

OLS formula 5:8 6:2 8:6 3:9 4:2 5:8

Newey-West 5:7 6:1 8:4 3:8 5:1 8:9

VARHAC 5:7 6:1 8:5 3:8 5:4 10:5

Bootstrapped 5:8 6:2 8:6 3:8 5:4 10:1

Table 5.1: Standard error of OLS slope (%) under autocorrelation (simulation evidence).

Model: y t D 1 C 0:9x t C t , where t D t 1 C t ; t is iid N(). x t D x t 1 C t ; t is

iid N(). NW uses 5 lags. VARHAC uses 5 lags and a VAR(1). The bootstrap uses blocks

of size 20. Sample length: 300. Number of simulations: 10000.

Under the null hypothesis of D 0, this autocorrelation is loaded onto the residuals.

For that reason, the simulations use a Newey-West estimator of the covariance matrix

(with an automatic choice of lag order). This should, ideally, solve the problem with the

135

inferencebut the simulations show that it doesnt: when Z t is very autocorrelated (0.95

or higher) and reasonably important (so an R2 from running (5.26), if we could, would be

0.05 or higher), then the 5% critical value (for a t-test of the hypothesis D 0) would be

2.7 (to be compared with the nominal value of 1.96). Since the point estimates are almost

unbiased, the interpretation is that the standard deviations are underestimated. In contrast,

with low autocorrelation and/or low importance of Z t , the standard deviations are much

more in line with nominal values.

See Table 5.1 for an illustration. They show that we need a combination of an auto-

correlated residuals and an autocorrelated regressor to create a problem for the usual LS

formula for the standard deviation of a slope coefficient. When the autocorrelation is very

high, even the Newey-West estimator is likely to underestimate the true uncertainty.

To study the interaction between spurious regressions and data mining, Ferson, Sarkissian,

and Simin (2003) let Z t be chosen from a vector of L possible predictorswhich all are

generated by a diagonal VAR(1) system as in (5.27) with uncorrelated errors. It is as-

sumed that the researchers choose Z t by running L regressions, and then picks the one

with the highest R2 . When D 0:15 and the researcher chooses between L D 10

predictors, the simulated 5% critical value is 3.5. Since this does not depend on the im-

portance of Z t , it is interpreted as a typical feature of data mining, which is bad enough.

When the autocorrelation is 0.95, then the importance of Z t start to become important

spurious regressions interact with the data mining to create extremely high simulated

critical values. A possible explanation is that the data mining exercise is likely to pick out

the most autocorrelated predictor, and that a highly autocorrelated predictor exacerbates

the spurious regression problem.

Excluding a relevant regressor will cause a bias of all coefficients (unless those regres-

sors are uncorrelated with the excluded regressor). In contrast, including an irrelevant

regressor is not really dangerous, but is likely to decrease the precision. In particular,

forecasting models with many regressors typically perform very poorly out of sample.

To select the regressors, apply the following rules: rule 1: use economic theory; rule 2:

avoid data mining and mechanical searches for the right regressors; rule 3: maybe use a

general-to-specific approachstart with a general regression and test restrictions,..., keep

136

making it simpler until restrictions are rejected; rule 4: always include a constant.

Remember that R2 can never decrease by adding more regressorsnot really a good

guide. To avoid overfitting, punish models with too many parameters. Perhaps consider

RN 2 instead

T 1

RN 2 D 1 .1 R2 / ; (5.28)

T k

where T is the sample size and k is the number of regressors (including the constant).

This measure includes trade-off between fit and the number of regressors (per data point).

Notice that RN 2 can be negative (while 0 R2 1). Clearly, the model must include a

constant for R2 (and therefore RN 2 ) to make sense. Alternatively, apply Akaikes Informa-

tion Criterion (AIC) and the Bayesian information criterion (BIC). They are

k

AIC D ln 2 C 2 (5.29)

T

k

BIC D ln 2 C ln T; (5.30)

T

where 2 is the variance of the fitted residuals.

These measures also involve trade-offs between fit (low 2 ) and number of parameters

(k, including the intercept). Choose the model with the highest RN 2 or lowest AIC or BIC.

It can be shown (by using R2 D 1 2 = Var.y t / so 2 D Var.y t /.1 R2 /) that AIC and

BIC can be rewritten as

k

AIC D ln Var.y t / C ln.1 R2 / C 2 (5.31)

T

k

BIC D ln Var.y t / C ln.1 R2 / C ln T: (5.32)

T

This shows that both are decreasing in R2 (which is good), but increasing in the number

of regressors per data point (k=T ). It therefore leads to a similar trade-off as in RN 2 . Recall

that the model should always include a constant.

Example 5.18 (Empirical application of model selection) See Table 5.2 for an empirical

example showing a number of possible model specifications. The dependent variable is

the monthly realized variance of S&P 500 returns (calculated from daily returns). The

possible regressors are lags of the dependent variable, the VIX index and the S&P 500

returns. Similarly, Table 5.3 for the the best specification according to AIC. Notice that

AIC tend to favour fairly large models with many regressors.

137

1 2 3 4 5 6 7

RV t 1 0:73 0:33

.10:86/ .1:90/

RV t 2 0:57 0:04

.10:49/ .0:34/

VIX t 1 0:91 0:71

.11:69/ .4:49/

VIX t 2 0:68 0:25

.12:14/ . 1:30/

Rt 1 0:86 0:17

. 3:26/ . 1:37/

Rt 2 0:48 0:06

. 2:77/ .0:67/

constant 4:24 6:86 2:44 2:16 16:45 16:20 0:94

.4:58/ .7:97/ . 1:96/ .2:27/ .16:02/ .15:42/ .1:01/

R2 0:54 0:32 0:56 0:31 0:16 0:05 0:61

obs 277:00 277:00 277:00 277:00 277:00 277:00 277:00

Table 5.2: Regression of monthly realized S&P 500 return volatility 1990:12013:5.

Numbers in parentheses are t-stats, based on Newey-West with 4 lags.

In some cases, even good economic theory leaves us with too many potential regres-

sors. This is often the case when developing forecasting modelswhere it is often noticed

models with many predictors tend to fail out of sample. It then becomes crucial to apply

some model selection technique, that is, a method that sets some regression coefficients

to zero.

If there are K potential regressors, then there are 2K different models. If the list of

models is not too long, then the AIC and BIC in (5.29)(5.30) can be used, see Table 5.3.

Otherwise, we need some sort of sequential approach.

Example 5.19 (3 potential regressors) If the three potential regressors are 1; x1 and x2 ,

then the list of models has 23 1 D 7 possibilities: .1/I .x1 /I .x2 /I .1; x1 /I .1; x2 /I .x1 ; x2 /I .1; x1 ; x2 /.

138

1 2 3 4

RV t 1 0:31 0:31 0:33 0:30

.2:03/ .1:96/ .1:94/ .1:93/

RV t 2 0:05

.0:45/

VIX t 1 0:74 0:88 0:71 0:87

.4:76/ .6:04/ .4:46/ .6:18/

VIX t 2 0:24 0:35 0:22 0:39

. 1:17/ . 2:42/ . 1:15/ . 2:41/

Rt 1 0:16 0:17

. 1:24/ . 1:37/

Rt 2 0:06

.0:71/

constant 0:90 0:39 0:73 0:67

.0:94/ .0:35/ .0:87/ .0:63/

R2 0:61 0:61 0:61 0:61

obs 277:00 277:00 277:00 277:00

Table 5.3: Best four regressions of monthly realized S&P 500 return volatility according

to AIC, 1990:12013:5. Ordered from best (1) to fourth best (4). Numbers in parentheses

are t-stats, based on Newey-West with 4 lags.

.2/ add the variable that improves the fit the most

.3/ repeat .2/ until the fit does not improve much

To specify a stopping rule, first define the residual sum of squares (for a given vector of

coefficients, ) as

RSS./ D TtD1 .y t x t0 /2 : (5.34)

P

In step (2) we would then add the variable that gives the lowest RSS (when added to the

previous selection). In step (3), it is often recommended that we stop adding regressors

when

RSS.Oold / RSS.Onew /

< c1;T k 1 ; (5.35)

RSS.Onew /=.T k 1/

where k is the number of coefficients in Oold (including the intercept) so there are k C 1

coefficients in Onew and c1;T k 1 is the 90% or 95% critical value of an F1;T k 1 distri-

139

1 2 3 4

RV t 1 0:31 0:31

.1:96/ .2:03/

RV t 2

.11:69/ .6:15/ .6:04/ .4:76/

VIX t 2 0:36 0:35 0:24

. 2:11/ . 2:42/ . 1:17/

Rt 1 0:16

. 1:24/

Rt 2

. 1:96/ . 1:64/ .0:35/ .0:94/

R2 0:56 0:59 0:61 0:61

obs 277:00 277:00 277:00 277:00

Table 5.4: Best four regression of monthly realized S&P 500 return volatility according

to a forward step selection (based on t-stats), 1990:12013:5. Ordered from smallest (1)

to fourth smallest (4). Numbers in parentheses are t-stats, based on Newey-West with 4

lags.

bution. For instance, the 90% critical value of F1;100 equals 2:76.

As an alternative to the RSS based rule in (5.34)(5.35), we could instead use t-stats:

in step (2) add the variable with the highest jt-statj and in step (3) stop adding variables

when that jt-statj is lower than 1.64 (or 1.96).

Example 5.20 (Forward stepwise selection) Applying the forward step selection approach

(based on t-stats) to the regression discussed in Example 5.18 gives a sequence of larger

and larger models shown in Table 5.4.

An alternative approach to model selection is the lasso method, which minimizes the

sum of squared residuals (just like OLS), but under a restriction that the sum of the ab-

solute value of the coefficients should not exceed a threshold t . In short, it solves the

following constrained optimization problem

x t0 b/2 subject to

PT PK

minb t D1 .y t i D1 jbi j t; (5.36)

where the value of t is chosen a priori. (This problem can be solved by brute force

140

1 2 3 4

RV t 1 0:32 0:32 0:35

.1:69/ .1:88/ .1:80/

RV t 2

.11:69/ .3:25/ .2:98/ .2:76/

VIX t 2

Rt 1 0:32 0:31

. 3:65/ . 3:61/

Rt 2 0:11

.0:88/

constant 2:44 0:58 1:00 0:71

. 1:96/ . 0:58/ .0:93/ .0:82/

R2 0:56 0:58 0:60 0:61

obs 277:00 277:00 277:00 277:00

Table 5.5: Regression of monthly realized S&P 500 return volatility with model selection

done by lasso, but then estimated with OLS, 1990:12013:5. Ordered from smallest (1)

to fourth smallest (4). Numbers in parentheses are t-stats, based on Newey-West with 4

lags.

minimization if there are few regressors. Otherwise, the lars algorithm by Efron, Hasti,

Johnstone, and Tibshirani (2004) is very efficient and can handle large problems.)

Clearly, when t K O O

i D1 jbi j where bi are the OLS estimates, then the lasso approach

P

reproduces the OLS estimates. For smaller values of t , the lasso will give smaller coeffi-

cients: some bi will be zero and others tend to be closer to zero than OLS would suggest

(like other shrinkage methods like a ridge estimation).

The lasso method can be used as a model selection technique by estimating a sequence

of models with increasingly higher t values. With a sufficiently low t , only one coefficient

is non-zerofor a somewhat higher t value, two coefficients are non-zero and so on. (If

we solve (5.36) with brute force, then this might involve some tweaking of the sequence

of the t values. However, the lars algorithm does this automatically.) Once the L (five,

say) smallest specifications are found, we could re-estimate each of them with OLS. (This

is the lars-OLS hybrid discussed in Efron, Hasti, Johnstone, and Tibshirani (2004).)

Example 5.21 (lasso regression) Applying the lasso approach to the regression discussed

in Example 5.18 gives a sequence of larger and larger models. Re-estimating the four

141

smallest of those models with OLS gives the results in Table 5.5.

Remark 5.22 (Ridge regression ) The ridge regression solves minb TtD1 .y t x t0 b/2 C

P

K 2

P

i D1 bi ;where > 0, so it forms a compromise between OLS and zero coefficients. This

is easiest to see if y t and x t are demeaned so D 0. Then, the first order conditions for

minimization are TtD1 x t .y t x t0 b/

P Q bQ D 0, so bQ D . 1 PT x t x 0 C / 1 1 PT x t y t :

T t D1 t T t D1

Notice that D 0 gives OLS, while D 1 gives b D 0.Q

Remark 5.23 (Application of the lasso/lars algorithms) These algorithms often stan-

dardize x t to have zero means and unit standard deviations, and y t to have zero means.

p

In some cases, they also calculate bi T instead of bi .

References: Goyal and Welch (2008), and Campbell and Thompson (2008)

create et D

rOt r t rOt

1 t 1 t t C1

sample

create e t C1 D

rOt C1 r t C1 rOt C1

1 t t C1

longer sample

The idea of out-of-sample forecasting is to replicate real life forecasting. The predic-

tion equation is estimated on data up to and including t 1, and then a forecast is made

for period t. The forecasting performance of the equation is then compared with using

some benchmark prediction model (also estimated on data up to and including t 1). See

Figure 5.15 for an illustration. Then, the sample is extended with one period (t) and a

forecast is made for t C 1. This continuous until the sample is exhausted. The Mariano-

Diebold tests of the forecasting performance based on (5.45) can clearly be used in this

setting.

142

Goyal and Welch (2008) find that the evidence of predictability of equity returns dis-

appears when out-of-sample forecasts are considered. Campbell and Thompson (2008)

claim that there is still some out-of-sample predictability, provided we put restrictions on

the estimated models.

Campbell and Thompson (2008) first report that only few variables (earnings price

ratio, T-bill rate and the inflation rate) have significant predictive power for one-month

stock returns in the full sample (18712003 or early 1920s2003, depending on predictor).

The comparison is done in terms of the MSE and an out-of-sample R2

1 XT 1 XT

2

ROS D1 .r t rOt /2 = .r t rQt /2 ; (5.37)

T t Ds T t Ds

where s is the first period with an out-of-sample forecast, rOt is the forecast based on the

prediction model (estimated on data up to and including t 1) and rQt is the prediction

from some benchmark model (also estimated on data up to and including t 1). In

practice, the paper uses the historical average (also estimated on data up to and including

t 1) as the benchmark prediction. Comparing the MSE is an application of the Mariano-

Diebold approach, while out-of-sample R2 is non-linear transformation of linear moment

conditions (5.41) that define the MSE. We can therefore apply the delta method to test it.

The evidence shows that the out-of-sample forecasting performance is very weakas

claimed by Goyal and Welch (2008).

It is argued that forecasting equations can easily give strange results when they are

estimated on a small data set (as they are early in the sample). They therefore try different

restrictions: setting the slope coefficient to zero whenever the sign is wrong, setting

the prediction (or the historical average) to zero whenever the value is negative. This

improves the results a bitalthough the predictive performance is still weak.

See Figures 5.165.17 for an illustrations. The evidence suggests that the in-sample

long-run predictability vanishes out-of-sample. It also suggests that there is still some

short-run predictability for small firm stocks.

Further reading: Diebold (2001) 11; Stekler (1991); Diebold and Mariano (1995)

To do a solid evaluation of the forecast performance (of some forecaster/forecast

method/forecast institute), we need a sample (history) of the forecasts and the resulting

forecast errors. The reason is that the forecasting performance for a single period is likely

143

Out-of-sample R2 , excess returns predicted by E/P

0.05

Estimation on data window of 240 months

0 Forecasts for 1957:12013:12

of E/P regression relative to the

0.1 historical average

0.15

0.2

0.25

10 20 30 40 50 60

Return horizon (months)

Out-of-sample R2 , AR model

0.05

Out-of-sample R2 from AR(1) of excess

0.04 returns on size sorted equity portfolios

0.03 while decile 10 the 10% largest firms

0.01

0.01

1 2 3 4 5 6 7 8 9 10

size decile

results for several periods.

Let e t be the forecast error in period t

et D rt rOt ; (5.38)

144

where rOt is the forecast (made in t h) and r t the actual outcome. (Warning: some authors

prefer to work with rOt r t as the forecast error instead.)

Quite often, we compare a forecast method (or forecasting institute) with some kind

of naive forecast like a no change or a random walk. The idea of such a comparison

is to study if the resources employed in creating the forecast really bring value added

compared to a very simple (and inexpensive) forecast.

Ultimately, the ranking of forecasting methods should ideally be done based on the

true benefits/costs of forecast errorswhich may differ between organizations. For in-

stance, a forecasting agency has a reputation (and eventually customers) to lose, while

an investor has more immediate pecuniary losses. Unless the relation between the fore-

cast error and the losses are immediately understood, the ranking of two forecast methods

is typically done based on a number of different criteria. Several of those criteria are

inspired by basic statistics.

Most statistical forecasting methods are based on the idea of minimizing the sum of

squared forecast errors, tTD1 e t2 . For instance, the least squares (LS) method picks the

regression coefficient in

r t D 0 C 1 x t h C t (5.39)

to minimize the sum of squared residuals. This will, among other things, give a zero

mean of the fitted residuals and also a zero correlation between the fitted residual and the

regressor. As usual, rational forecasts should have forecast errors that cannot be predicted

(by past regressors or forecast errors).

Evaluation of a forecast often involve extending these ideas to the forecast method,

irrespective of whether a LS regression has been used or not. In practice, this means

studying (i) whether the forecast error, e t , has a zero mean; (ii) the mean squared (or

absolute value) of the forecast error ; (iii) the fraction of times the squared (or absolute

value) of the forecast error is lower than some threshold; (iv) the profit from investing by

following a forecasting model; (v) if the forecast errors are autocorrelated or correlated

with past information.

ror has a zero correlation with the forecast error h (and more) periods earlier. For in-

stance, with h D 2, let e t C2;t D y t C2 E t y t C2 be the error of forecasting y t C2 using the

information in period t. It should be uncorrelated with e t;t 2 D y t E t 2 y t , since the

latter is known when the forecast E t y t C2 is formed.

145

test is typically performedand it is an application of GMM. To implement it, consider

two different forecasts. For instance, the first forecast could be your estimated model and

the other one might be a naive forecasting model (for instance, no change) that you hope

to beat. To test the different aspects discussed before, let e t and " t denote the forecast

errors, .x/ be an indicator function that is one if x is true and zero otherwise, and let

r te and r t" denote the returns from following the different forecasts. Then, define moment

conditions

gt D et " t , or (5.40)

g t D e t2 "2t or g t D je t j j" t j, or (5.41)

g t D .e t2 < "2t /, or (5.42)

g t D r te r t" , or (5.43)

gt D et et 1 "t "t 1 or g t D e t x t h "t xt h: (5.44)

The different moment conditions correspond to the different aspects of the forecasts dis-

cussed above. For instance, (5.40) is for testing if the two methods have the same average

forecast error. From the usual properties of GMM, we then have

p

T gN !d N.0; S0 /; (5.45)

PT p

where gN D t D1 g t =T is the average moment condition and S 0 D Var. N is the

T g/

variance. The latter can be estimated by, for instance, a Newey-West approach. This can

be used to construct a t-test. However, when models a and b are nested (say, a is a special

case of b), then the asymptotic distribution is non-normal so other critical values must be

applied (see Clark and McCracken (2001)).

Remark 5.25 (Empirical results on predicting annual S&P 500 returns) Table 5.6 com-

pares the out-of-sample predictive performance of an E/P regression estimated recursively

(e) and the historical average return also estimated recursively ("). The historical aver-

age seems to overestimate the returns, but the difference to the E/P regression (which is

almost unbiased) is not significant. In contrast, the regression creates more volatile errors

(although again not significant), as measured by both the squared and the absolute value

of the residuals. Finally, the last line shows that the larger volatility of the regression

errors is partly driven by a higher frequency of large (in absolute terms) residuals (and

this result is significant).

Example 5.26 We want to compare the performance of the two forecast methods a and

146

mean std t-stat

2

Roos 0:11

e 0:23 2:08 0:11

" 1:65 1:91 0:86

e " 1:88 0:86 2:18

e2 334:11 59:66 5:60

2

" 301:91 51:52 5:86

e 2 "2 32:20 27:61 1:17

jej j"j 0:85 0:62 1:36

jej < j"j 0:46 0:05 10:21

The e forecasts are out-of-sample and based on E/P regressions, while the " forecasts

are the historical average returns. Estimation is done on a moving data window of 240

observations.

b. We have the following forecast errors .e1 ; e2 ; e3 / D . 1; 1; 2/ and ."1 ; "2 ; "3 / D

. 1:9; 0; 1:9/. Both have zero means, so there is (in this very short sample) no constant

bias. The mean squared errors are

MSE" D . 1:9/2 C 02 C 1:92 =3 2:41;

so the first forecast a is better according to the mean squared errors criterion. The mean

absolute errors are

MAE" D j 1:9j C j0j C j1:9j=3 1:27;

so the second forecast is better according to the mean absolute errors criterion. The

reason for the difference between these criteria is that the second forecast has fewer but

larger errorsand the quadratic loss function punishes large errors very heavily. Count-

ing the number of times the absolute error (or the squared error) is smaller, we see that

the first forecast is better one time (first period), and the second forecast is better two

times.

As an example, Leitch and Tanner (1991) analyse the profits from selling 3-month

T-bill futures when the forecasted interest rate is above futures rate (forecasted bill price

is below futures price). The profit from this strategy is (not surprisingly) strongly related

147

to measures of correct direction of change (see above), but (perhaps more surprisingly)

not very strongly related to mean squared error, or absolute errors.

Bibliography

Brockwell, P. J., and R. A. Davis, 1991, Time series: theory and methods, Springer Verlag,

New York, second edn.

explanation of aggregate stock market behavior, Journal of Political Economy, 107,

205251.

markets, Princeton University Press, Princeton, New Jersey.

Campbell, J. Y., and S. B. Thompson, 2008, Predicting the equity premium out of sam-

ple: can anything beat the historical average, Review of Financial Studies, 21, 1509

1531.

Campbell, J. Y., and L. M. Viceira, 1999, Consumption and portfolio decisions when

expected returns are time varying, Quarterly Journal of Economics, 114, 433495.

Clark, T. E., and M. W. McCracken, 2001, Tests of equal forecast accuracy and encom-

passing for nested models, Journal of Econometrics, 105, 85110.

Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,

revised edn.

Business and Economic Statistics, 13, 253265.

Efron, B., T. Hasti, I. Johnstone, and R. Tibshirani, 2004, Least angle regression, The

Annals of Statistics, 32, 407499.

Epstein, L. G., and S. E. Zin, 1991, Substitution, risk aversion, and the temporal behavior

of asset returns: an empirical analysis, Journal of Political Economy, 99, 263286.

148

Ferson, W. E., S. Sarkissian, and T. T. Simin, 2003, Spurious regressions in financial

economics, Journal of Finance, 57, 13931413.

Goyal, A., and I. Welch, 2008, A comprehensive look at the empirical performance of

equity premium prediction, Review of Financial Studies 2008, 21, 14551508.

Hastie, T., R. Tibshirani, and J. Friedman, 2001, The elements of statistical learning: data

mining, inference and prediction, Springer Verlag.

Leitch, G., and J. E. Tanner, 1991, Economic forecast evaluation: profit versus the con-

ventional error measures, American Economic Review, 81, 580590.

Lo, A. W., and A. C. MacKinlay, 1997, Maximizing predictability in the stock and bond

markets, Macroeconomic Dynamics, 1, 102134.

Priestley, M. B., 1981, Spectral analysis and time series, Academic Press.

Sderlind, P., 2006, C-CAPM Refinements and the cross-section of returns, Financial

Markets and Portfolio Management, 20, 4973.

Journal of Forecasting, 7, 375384.

Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University

Press.

149

6 Predicting Asset Returns: Nonparametric Estimation

Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Hrdle (1990); Pagan and Ullah

(1999); Mittelhammer, Judge, and Miller (2000) 21

6.1.1 Introduction

Nonparametric regressions are used when we are unwilling to impose a parametric form

on the regression equationand we have a lot of data.

Let the scalars y t and x t be related as

where " t is uncorrelated over time and where E " t D 0 and E." t jx t / D 0. The function

b./ is unknown and possibly non-linear.

One possibility of estimating such a function is to approximate b.x t / by a polynomial

(or some other basis). This will give quick estimates, but the results are global in the

sense that the value of b.x t / at a particular x t value (x t D 1:9, say) will depend on all

the data pointsand potentially very strongly so. The approach in this section is more

local by down weighting information from data points where xs is far from x t .

Suppose the sample had 3 observations (say, t D 3, 27, and 99) with exactly the same

value of x t , say 1:9. A natural way of estimating b.x/ at x D 1:9 would then be to

average over these 3 observations as we can expect average of the error terms to be close

to zero (iid and zero mean).

Unfortunately, we seldom have repeated observations of this type. Instead, we may

try to approximate the value of b.x/ (x is a single value, 1.9, say) by averaging over (y)

observations where x t is close to x. The general form of this type of estimator is

PT

Ob.x/ D PtD1 w t .x/y t ; (6.2)

T

t D1 w t .x/

where w t .x/= tTD1 w t .x/ is the weight on observation t, which his non-negative and

150

(weakly) decreasing in the the distance of x t from x. Note that the denominator makes

the weights sum to unity. The basic assumption behind (6.2) is that the b.x/ function is

smooth so local averaging (around x) makes sense.

Remark 6.1 (Local constant estimator ) Notice that (6.2) solves the problem min TtD1 w t .x/.y t

P

O

x /2 for each value of x. (The result is b.x/ D x .) This is (for each value of x) like a

weighted regression of x t on a constant. This immediately suggests that the method could

be extended to solving a problem like min TtD1 w t .x/y t x bx .x t x/2 , which

P

which are closest to x and zero weight to all other observations (this is the k-nearest

neighbor estimator, see Hrdle (1990) 3.2). As another example, the weight function

O

could be defined so that it trades off the expected squared errors, Ey t b.x/ 2

, and the

2O

expected squared acceleration, Ed b.x/=dx . This defines a cubic spline (often used

2 2

Remark 6.2 (Easy way to calculate the nearest neighbor estimator, univariate case)

Create a matrix Z where row t is .y t ; x t /. Sort the rows of Z according to the second

column (x). Calculate an equally weighted centred moving average of the first column

(y).

A Kernel regression uses a pdf as the weight function, w t .x/ D K .x t x/= h, where

the choice of h (also called bandwidth) allows us to easily vary the relative weights of

different observations. The perhaps simplest choice is a uniform density function for x t

over x h=2 to x C h=2 (and zero outside this interval). In this case, the weighting

function is

1 x t x

w t .x/ D 1=2 ; where (6.3)

h

( h

1 if q is true

.q/ D

0 else.

This weighting function puts the weight 1= h on all data point in the interval x h=2 and

zero on all other data points.

151

Kernel regression of AR(1) for returns

6

Uniform kernel, interval width: 1

4 Daily S&P 500 returns 1979:12014:3

2

Return

2

Evaluated at -9.5, -8.5,...

Evaluated at more points

4

10 5 0 5 10

Lagged return

However, we can gain efficiency and get a smoother (across x values) estimate by

using a density function that puts more weight to very local information, but also tapers

off more smoothly. The pdf of N.x; h2 / is often used. This weighting function is positive,

so all observations get a positive weight, but the weights are highest for observations close

to x and then taper off in a bell-shaped way. A low value of h means that the weights taper

off fast. See Figure 6.2 for an example.

With the N.x; h2 / kernel, we get the following weights at a point x

h i

xt x 2

exp

h

=2

w t .x/ D p : (6.4)

h 2

Remark 6.3 (Kernel as a pdf of N.x; h2 /) If K.z/ is the pdf of an N.0; 1/ variable, then

K .x t x/= h = h is the same as using an N.x; h2 / pdf of x t . Clearly, the 1= h term

would cancel in (6.2).

Effectively, we can think of these weights as being calculated from an N .0; 1/ density

function, but where we use .x t x/= h as the argument.

When h ! 0, then b.x/ O evaluated at x D x t becomes just y t , so no averaging is

O

done. In contrast, as h ! 1, b.x/ becomes the sample average of y t , so we have global

averaging. Clearly, some value of h in between is needed.

152

Data and weights for b(1.7) Data and weights for b(1.9)

5 5

wt (1.7) wt (1.9)

weights

weights

1 1

yt

yt

4 4

0 0

1.5 2 2.5 1.5 2 2.5

xt xt

Data and weights for b(2.1) Data on yt : 5.0 4.0 3.5

wt (2.1) . denotes the data

denotes the fitted b(x)

weights

1

Left y-axis: data; right y-axis: weights

yt

0

1.5 2 2.5

xt

O

In practice we have to estimate b.x/ at a finite number of points x. This could, for

instance, be 100 evenly spread points in the interval between the minimum and the max-

imum values observed in the sample. Special corrections might be needed if there are a

lot of observations stacked close to the boundary of the support of x (see Hrdle (1990)

4.4). See Figure 6.3 for an illustration.

Example 6.4 (Kernel regression) Suppose the sample has three data points x1 ; x2 ; x3 D

1:5; 2; 2:5 and y1 ; y2 ; y3 D 5; 4; 3:5. Consider the estimation of b.x/ at x D 1:9.

With h D 1, the numerator in (6.4) is

XT 2 2 2

p

w t .x/y t D e .1:5 1:9/ =2 5 C e .2 1:9/ =2 4 C e .2:5 1:9/ =2 3:5 = 2

t D1

p

.0:92 5 C 1:0 4 C 0:84 3:5/ = 2

p

D 11:52= 2:

153

Kernel regression, effect of bandwidth (h)

Data

5 h = 0.25

h = 0.2

4.5

y

3.5

1.4 1.6 1.8 2 2.2 2.4

x

The denominator is

XT p

.1:5 1:9/2 =2 .2 1:9/2 =2 .2:5 1:9/2 =2

w t .x/ D e Ce Ce = 2

t D1

p

2:75= 2:

O

b.1:9/ 11:52=2:75 4:19:

where " t is uncorrelated over time and where E " t D 0 and E." t jx t ; z t / D 0.

This makes the estimation problem more data demanding. To see why, suppose we

use a uniform density function as weighting function (see in (6.3)). However, with two

regressors, the interval becomes a rectangle. With as little as a 20 intervals of each of

x and z, we get 400 bins, so we need a large sample to have a reasonable number of

observations in every bin.

154

Kernel regression of AR(1) for returns

5

Gaussian kernel

Daily S&P 500 returns 1979:12014:3

Return

Small bandwidth

Optimal bandwitdh

Large bandwidth

5

10 5 0 5 10

Lagged return

In any case, the most common way to implement the kernel regressor is to let

PT

O z/ D Pt D1 w t .x/w t .z/y t ;

b.x; (6.6)

T

tD1 w t .x/w t .z/

where w t .x/ and w t .z/ are two kernels like in (6.4) and where we may allow the band-

width (h) to be different for x t and z t (and depend on the variance of x t and y t ). In this

case. the weight of the observation (x t ; z t ) is proportional to w t .x/w t .z/, which is high

if both x t and z t are close to x and z respectively.

Kernel regressions are typically consistent, provided longer samples are accompanied by

smaller values of h, so the weighting function becomes more and more local as the sample

size increases. It can be shown (see Hrdle (1990) 3.1 and Pagan and Ullah (1999) 3.34)

that under the assumption that x t is iid, the mean squared error, variance and bias of the

155

Kernel regression of AR(2) for returns

4

Gaussian kernel

6 Daily S&P 500 returns 1979:12014:3 10

5

10 0

5

0 5

5 10

10

Return lagged twice

Return lagged once

O

MSE.x/ D Varb.x/ O

C Biasb.x/ 2

, with

2

O 1 .x/ R 1

Varb.x/ D 1 K.u/2 du

T h f .x/

2

O 1 d b.x/ df .x/ 1 db.x/ R1

Biasb.x/ 2

Dh 2

C 2

1 K.u/u du: (6.7)

2 dx dx f .x/ dx

In these expressions, 2 .x/ is the variance of the residuals in (6.1), f .x/ the marginal

density of x and K.u/ the kernel (pdf) used as a weighting function for u D .x t x/= h.

The remaining terms are functions of the true regression function.

With a Gaussian kernel these expressions can be simplified to

O 1 2 .x/ 1

Varb.x/ D p

T h f .x/ 2

2

O 1 d b.x/ df .x/ 1 db.x/

Biasb.x/ 2

Dh C : (6.8)

2 dx 2 dx f .x/ dx

156

Proof. (of (6.8)) We know that

R1 2 1 R1

1 K.u/ du D p and 1 K.u/u2 du D 1;

2

if K.u/ is the density function of a standard normal distribution. (We are effectively using

the N.0; 1/ pdf for the variable .x t x/= h.) Use in (6.7).

A smaller h increases the variance (we effectively use fewer data points to estimate

b.x/) but decreases the bias of the estimator (it becomes more local to x). If h decreases

less than proportionally with the sample size (so hT in the denominator of the first term

increases with T ), then the variance goes to zero and the estimator is consistent (since the

bias in the second term decreases as h does).

The variance is a function of the variance of the residuals and the peakedness of the

kernel, but not of the b.x/ function. The more concentrated the kernel is (s K.u/2 du

large) around x (for a given h), the less information is used in forming the average around

x, and the uncertainty is therefore largerwhich is similar to using a small h. A low

density of the regressors (f .x/ low) means that we have little data at x which drives up

the uncertainty of the estimator.

The bias increases (in magnitude) with the curvature of the b.x/ function (that is,

.d b.x/=dx 2 /2 ). This makes sense, since rapid changes of the slope of b.x/ make it hard

2

to get b.x/ right by averaging at nearby x values. It also increases with the variance of

the kernel since a large kernel variance is similar to a large h.

It is clear that the choice of h has a major importance on the estimation results. A

lower value of h means a more local averaging, which has the potential of picking up

sharp changes in the regression functionat the cost of being more affected by random-

ness.

See Figures 6.16.4 for an example.

Remark 6.5 (Rule of thumb value of h) In a simplified case, we case find the h value

that minimizes the MSE. Use (6.8) to construct the MSE D Var.b/Cbias.b/2 . To simplify,

assume the distribution of x is uniform, so f .x/ D 1=.xmax xmin / and df .x/=dx D 0.

In addition, run the regression y D C x C
x 2 C " as an approximation of b.x/.

With this we have d 2 b.x/=dx 2 2
and we approximate 2 by the variance of the fitted

residuals, "2 . Combining, we have

1 2 1

MSE D .xmax xmin / p C h4
2 :

Th " 2

157

Minimizing with respect to h gives the first order condition

1 2 1

" .xmax x min / p C 4h3
2 D 0, so

T h2 2

1 2 1

2

" .xmax xmin / p D h5 , or

T
8

1=5 2=5

T j
j "2=5 .xmax xmin /1=5 0:6 D h:

In practice, replace xmax xmin by the difference between the 90th and 10th percentiles of

x.

cross-validation technique. This approach would, for instance, choose h to minimize the

expected (or average) prediction error

XT h i2

EPE.h/ D yt bO t .x t ; h/ =T; (6.9)

t D1

a sample that excludes observation t , and a bandwidth h. This means that each prediction

is out-of-sample. To calculate (6.9) we clearly need to make T estimations (for each

x t )and then repeat this for different values of h to find the minimum.

See Figure 6.6 for an example.

Step 2: estimate the b.x/ function on all data, but exclude t D 1, then calculate bO 1 .x1 /

and the error y1 bO 1 .x1 /

Step 3: redo Step 2, but now exclude t D 2 and. calculate the error y2 bO 2 .x2 /. Repeat

this for t D 3; 4; :::; T . Calculate the EPE as in (6.9).

Step 4: redo Steps 23, but for another value of h. Keep doing this until you find the best

h (the one that gives the lowest EPE)

Remark 6.7 (Speed and fast Fourier transforms) The calculation of the kernel estimator

can often be speeded up by the use of a fast Fourier transform.

If the observations are independent, then it can be shown (see Hrdle (1990) 4.2,

Pagan and Ullah (1999) 3.36, and also (6.8)) that, with a Gaussian kernel, the estimator

at point x is asymptotically normally distributed

p h i

1 2 .x/

O

T h b.x/ b.x/ ! N 0; pd

; (6.10)

2 f .x/

158

Cross validation, kernel regression

1.015

Gaussian kernel

Daily S&P 500 returns 1979:12014:3

1.01 rule-of-thumb

1.005

1

0.5 1 1.5 2 2.5

Bandwidth

where 2 .x/ is the variance of the residuals in (6.1) and f .x/ the marginal density of x.

(A similar expression holds for other choices of the kernel.) This expression assumes that

the asymptotic bias is zero, which is guaranteed if h is decreased (as T increases) slightly

faster than T 1=5 (for instance, suppose h D T 1:1=5 h0 , where h0 is a constant). To

estimate the density of x, we can apply a standard method, for instance using a Gaussian

kernel and the bandwidth (for the density estimate only) of 1:06 Std.x t /T 1=5 .

Remark 6.8 (Asymptotic bias) The condition that h decreases faster than T 1=5 ensures

p

O

that the bias of T hb.x/ vanishes as T ! 1. This is seen by noticing that the bias in

p

O

(6.8) is proportional to h2 . Combining gives the bias of T hb.x/ as being proportional

1=2 5=2 1:1=5

to T h . If indeed h D T h0 , then we have

0:05 5=2

T 1=2 h5=2 D T h0

residuals on x t

O t /;

"O2t D 2 .x t /, where "O t D y t b.x (6.11)

O t / are the fitted values from the non-parametric regression (6.1). Notice that the

where b.x

O

estimation of 2 .x/ is quite computationally intensive since it requires estimating b.x/ at

159

Kernel regression of AR(1) for returns, with 90% conf band

5

Gaussian kernel

Daily S&P 500 returns 1979:12014:3

Return

5

10 5 0 5 10

Lagged return

every point x D x t in the sample, not just a small grid of x values. To draw confidence

O

bands, it is typically assumed that the asymptotic bias is zero (E b.x/ D b.x/).

See Figure 6.7 for an example where the width of the confidence band varies across

x valuesmostly because the sample contains few observations close to some x values.

(However, the assumption of independent observations can be questioned in this case.)

set Prices, by Ait-Sahalia and Lo (1998)

There seem to be systematic deviations from the Black-Scholes model. For instance,

implied volatilities are often higher for options far from the current spot (or forward)

pricethe volatility smile. This is sometimes interpreted as if the beliefs about the future

log asset price put larger probabilities on very large movements than what is compatible

with the normal distribution (fat tails).

This has spurred many efforts to both describe the distribution of the underlying asset

price and to amend the Black-Scholes formula by adding various adjustment terms. One

strand of this literature uses nonparametric regressions to fit observed option prices to the

160

variables that also show up in the Black-Scholes formula (spot price of underlying asset,

strike price, time to expiry, interest rate, and dividends). For instance, Ait-Sahalia and

Lo (1998) applies this to daily data for Jan 1993 to Dec 1993 on S&P 500 index options

(14,000 observations).

This paper estimates nonparametric option price functions and calculates the implicit

risk-neutral distribution as the second partial derivative of this function with respect to the

strike price.

where S t is the price of the underlying asset, X is the strike price, is time to

expiry, r t is the interest rate between t and t C , and t is the dividend yield

(if any) between t and t C . It is very hard to estimate a five-dimensional kernel

regression, so various ways of reducing the dimensionality are tried. For instance,

by making b./ a function of the forward price, S t exp.r t t /, instead of S t ,

r t , and t separably.

2. Second, the implicit risk-neutral pdf of the future asset price is calculated as

@2 b.S t ; X; ; r t ; t /=@X 2 , properly scaled so it integrates to unity.

3. This approach is used on daily data for Jan 1993 to Dec 1993 on S&P 500 index op-

tions (14,000 observations). They find interesting patterns of the implied moments

(mean, volatility, skewness, and kurtosis) as the time to expiry changes. In par-

ticular, the nonparametric estimates suggest that distributions for longer horizons

have increasingly larger skewness and kurtosis: whereas the distributions for short

horizons are not too different from normal distributions, this is not true for longer

horizons. (See their Fig 7.)

4. They also argue that there is little evidence of instability in the implicit pdf over

their sample.

Bibliography

Ait-Sahalia, Y., and A. W. Lo, 1998, Nonparametric estimation of state-price densities

implicit in financial asset prices, Journal of Finance, 53, 499547.

161

Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial

markets, Princeton University Press, Princeton, New Jersey.

Hrdle, W., 1990, Applied nonparametric regression, Cambridge University Press, Cam-

bridge.

bridge University Press, Cambridge.

Press.

162

7 Predicting and Modelling Volatility

Sections denoted by a star ( ) is not required reading.

Reference: Campbell, Lo, and MacKinlay (1997) 12.2; Taylor (2005) 811; Hamil-

ton (1994) 21; Hentschel (1995); Franses and van Dijk (2000); Andersen, Bollerslev,

Christoffersen, and Diebold (2005)

7.1 Heteroskedasticity

and financial data.

The perhaps most straightforward way to gauge heteroskedasticity is to estimate a

time-series of realized variances from rolling samples. For a zero-mean variable, u t ,

this could be

q

2 1X 2

t D u D .u2t 1 C u2t 2 C : : : C u2t q /=q: (7.1)

q sD1 t s

Notice that t2 depends on lagged information, and could therefore be thought of as the

prediction (made in t 1) of the volatility in t . Unfortunately, this method can produce

quite abrupt changes in the estimate.

See Figures 7.17.4 for illustrations.

An alternative is to apply an exponentially weighted moving average (EWMA) es-

timator of volatility, which uses all data points since the beginning of the samplebut

where recent observations carry larger weights. Let the weight for lag s be .1 /s

where 0 < < 1, so

1

X

t2 D .1 / s 1 u2t s D .1 /.u2t 1 C u2t 2 C 2 u2t 3 C : : :/; (7.2)

sD1

t2 D .1 /u2t 1 C t2 1 : (7.3)

163

Realized std (44 days), annualized EWMA std, annualized, = 0.99

50 50

40 40

30 30

20 20

10 10

0 0

1980 1990 2000 2010 1980 1990 2000 2010

AR(1) of excess returns, the figures

show std of residual

EWMA std, annualized, = 0.9

50

40

30

20

10

0

1980 1990 2000 2010

The initial value (before the sample) could be assumed to be zero or (better) the uncondi-

tional variance in a historical sample. The EWMA is commonly used by practitioners. For

instance, the RISK Metrics (formerly part of JP Morgan) uses this method with D 0:94

for use on daily data. Alternatively, can be chosen to minimize some criterion function

like tTD1 .u2t t2 /2 .

See Figure 7.2 for an illustration of the weights. (They clearly sum to one.)

Remark 7.1 (VIX) Although VIX is based on option prices, it is calculated in a way

that makes it (an estimate of) the risk-neutral expected variance until expiration, not the

implied volatility, see Britten-Jones and Neuberger (2000) and Jiang and Tian (2005).

We can also estimate the realized covariance of two series (ui t and ujt ) by

q

1X

ij;t D ui;t s uj;t s ; (7.4)

q sD1

164

Weight on lagged data (u2ts ) in EWMA estimate of volatility

= 0.99

0.05 = 0.94

0.04

t2 = (1 )(u2t1 + u2t2 + 2 u2t3 + ...)

0.03

0.02

0.01

0

0 20 40 60 80 100

lag, s

Figure 7.2: Weights on old data in the EWMA approach to estimate volatility

0.04 0.05

0.04

0.035 0.03

0.02

0.03

Mon Tue Wed Thu Fri 0 6 12 18

Hour

5-minute data on EUR/USD changes, 1998:12014:10

Sample size: 1264870

lations.

See Figures 7.67.7 for illustrations.

165

Monthly std, EUR/USD Monthly std, GBP/USD

based on 5-minute changes,

0.1 1998:12014:10 0.1

0.05 0.05

0 0

2000 2005 2010 2000 2005 2010

0.1 0.1

0.05 0.05

0 0

2000 2005 2010 2000 2005 2010

Std, EWMA estimate, = 0.9 CBOE volatility index (VIX)

50 50

40 40

30 30

20 20

10 10

0 0

1990 1995 2000 2005 2010 2015 1990 1995 2000 2005 2010 2015

S&P 500, daily data 1954:12014:3

contract has a zero price in inception (in t ) and the payoff at expiration (in t C m) is

166

Monthly corr, EUR/USD and GBP/USD Monthly corr, EUR/USD and CHF/USD

1 1

0.5 0.5

0 0

-0.5 -0.5

2000 2005 2010 2000 2005 2010

1 based on 5-minute changes,

1998:12014:10

0.5

-0.5

2000 2005 2010

Correlation of FTSE 100 and DAX 30 Correlation of FTSE 100 and DAX 30

1 1

Corr

Corr

0.5 0.5

0 0

1995 2000 2005 2010 2015 1995 2000 2005 2010 2015

where the variance swap rate (also called the strike or forward price for ) is agreed on at

inception (t) and the realized volatility is just the sample variance for the swap period.

Both rates are typically annualized, for instance, if data is daily and includes only trading

167

days, then the variance is multiplied by 252 or so (as a proxy for the number of trading

days per year).

A volatility swap is similar, except that the payoff it is expressed as the difference

between the standard deviations instead of the variances

p

Volatility swap payoff t Cm = realized variance t Cm volatility swap rate t , (7.7)

If we use daily data to calculate the realized variance from t until the expiration(RV tCm ),

then

252 Pm 2

RV t Cm D sD1 R tCs ; (7.8)

m

where R t Cs is the net return on day t C s. (This formula assumes that the mean return is

zerowhich is typically a good approximation for high frequency data. In some cases,

the average is taken only over m 1 days.)

Notice that both variance and volatility swaps pays off if actual (realized) volatility

between t and t C m is higher than expected in t . In contrast, the futures on the VIX pays

off when the expected volatility (in t C m) is higher than what was thought in t. In a way,

we can think of the VIX futures as a futures on a volatility swap (between t C m and a

month later).

Since VIX2 is a good approximation of variance swap rate for a 30-day contract, the

return can be approximated as

Figures 7.8 and 7.9 illustrate the properties for the VIX and realized volatility of the

S&P 500. It is clear that the mean return of a variance swap (with expiration of 30

days) would have been negative on average. (Notice: variance swaps were not traded

for the early part of the sample in the figure.) The excess return (over a riskfree rate)

would, of course, have been even more negative. This suggests that selling variance

swaps (which has been the speciality of some hedge funds) might be a good dealexcept

that it will incur some occasional really large losses (the return distribution has positive

skewness). Presumably, buyers of the variance swaps think that this negative average

return is a reasonable price to pay for the hedging properties of the contractsalthough

the data does not suggest a very strong negative correlation with S&P 500 returns.

168

VIX (solid) and realized volatility (dashed)

measured over the last 30 days

70

60

50

40

30

20

10

1.5 Daily data on VIX and S&P 500 1990:22014:3

0.5

0

1 0.5 0 0.5 1 1.5 2 2.5

169

7.1.3 Forecasting Realized Volatility

Implied volatility from options (iv) should contain information about future volatilityas

is therefore often used as a predictor. It is unclear, however, if the iv is more informative

than recent (actual) volatility, especially since they are so similarsee Figure 7.8.

Table 7.1 shows that the iv (here represented by VIX) is close to be an unbiased

predictor of future realized volatility since the slope coefficient is close to one. However,

the intercept is negative, which suggests that the iv overestimate future realized volatility.

This is consistent with the presence of risk premia in the iv, but also with subjective beliefs

(pdfs) that are far from looking like normal distributions. By using both iv and the recent

realized volatility, the forecast powers seems to improve.

Remark 7.2 (Restricting the predicted volatility to be positive) A linear regression (like

those in Table 7.1) can produce negative volatility forecasts. An easy way to get around

that is to specify the regression in terms on the log volatility.

Remark 7.3 (Restricting the predicted correlation to be between 1 and 1) The per-

haps easiest way to do that is to specify the regression equation in terms of the Fisher

transformation, z D 1=2 ln.1 C /=.1 /, where is the correlation coefficient.

The correlation coefficient can then be calculated by the inverse transformation D

exp.2z/ 1=exp.2z/ C 1.

lagged RV 0:75 0:24

.10:23/ .2:01/

lagged VIX 0:91 0:66

.12:09/ .8:05/

constant 3:99 2:63 1:34

.4:01/ . 1:98/ . 1:65/

R2 0:56 0:61 0:62

obs 5986:00 6006:00 5986:00

Table 7.1: Regression of 22-day realized S&P return volatility 1990:22014:3. All daily

observations are used, so the residuals are likely to be autocorrelated. Numbers in paren-

theses are t-stats, based on Newey-West with 30 lags.

170

Corr(EUR,GBP) Corr(EUR,CHF) Corr(EUR,JPY)

lagged Corr(EUR,GBP) 0:88

.27:50/

lagged Corr(EUR,CHF) 0:87

.13:38/

lagged Corr(EUR,JPY) 0:83

.19:31/

constant 0:06 0:10 0:05

.3:56/ .1:93/ .2:77/

R2 0:81 0:76 0:68

obs 190:00 190:00 190:00

rates are against the USD. The monthly correlations are calculated from 5 minute data.

Numbers in parentheses are t-stats, based on Newey-West with 1 lag.

y t D x t0 b C u t ; where (7.10)

E u t D 0 and Cov.xi t ; u t / D 0:

In the standard case we assume that u t is iid (independently and identically distributed),

which rules out heteroskedasticity.

In case the residuals actually are heteroskedasticity, least squares (LS) is nevertheless

a useful estimator: it is still consistent (we get the correct values as the sample becomes

really large)and it is reasonably efficient (in terms of the variance of the estimates).

However, the standard expression for the standard errors (of the coefficients) is (except in

a special case, see below) not correct. This is illustrated in Table 7.4.

There are two ways to handle this problem. First, we could use some other estimation

method than LS that incorporates the structure of the heteroskedasticity. For instance,

combining the regression model (7.10) with an ARCH structure of the residualsand

estimate the whole thing with maximum likelihood (MLE) is one way. As a by-product

we get the correct standard errors provided, of course, the assumed distribution is cor-

rect. Second, we could stick to OLS, but use another expression for the variance of the

coefficients: a heteroskedasticity consistent covariance matrix, among which Whites

covariance matrix is the most common.

171

RV(EUR) RV(GBP) RV(CHF) RV(JPY)

lagged RV(EUR) 0:64

.7:32/

lagged RV(GBP) 0:72

.10:40/

lagged RV(CHF) 0:34

.2:60/

lagged RV(JPY) 0:56

.5:07/

constant 0:06 0:04 0:22 0:12

.1:62/ .1:33/ .3:16/ .1:85/

D(Tue) 0:12 0:08 0:13 0:11

.11:47/ .6:82/ .4:41/ .3:77/

D(Wed) 0:11 0:09 0:09 0:13

.9:48/ .7:23/ .4:11/ .4:42/

D(Thu) 0:12 0:09 0:13 0:15

.9:98/ .5:80/ .7:01/ .3:81/

D(Fri) 0:12 0:07 0:13 0:13

.6:42/ .4:33/ .9:05/ .4:32/

R2 0:41 0:52 0:12 0:32

obs 4151:00 4151:00 4151:00 4151:00

Table 7.3: Regression of daily realized variance 1998:12013:11. All exchange rates are

against the USD. The daily variances are calculated from 5 minute data. Numbers in

parentheses are t-stats, based on Newey-West with 1 lag.

To test for heteroskedasticity, we can use Whites test of heteroskedasticity. The null

hypothesis is homoskedasticity, and the alternative hypothesis is the kind of heteroskedas-

ticity which can be explained by the levels, squares, and cross products of the regressors

(denoted w t )clearly a special form of heteroskedasticity. The reason for this specifica-

tion is that if the squared residual is uncorrelated with w t , then the usual LS covariance

matrix applieseven if the residuals have some other sort of heteroskedasticity.

To implement Whites test, let wi be the squares and cross products of the regressors.

The test is then to run a regression of squared fitted residuals on w t

uO 2t D w t0 C v t ; (7.11)

and to test if all the slope coefficients (not the intercept) in
are zero. (This can be done

be using the fact that TR2 =.1 R2 / p2 , p D dim.wi / 1:)

172

Scatter plot, iid residuals Scatter plot, Var(residual) depends on x2

20 20

10 10

0 0

y

y

10 10

20 20

10 5 0 5 10 10 5 0 5 10

x x

y = 0.03 + 1.3x + u

Solid regression lines are based on all data,

dashed lines exclude the crossed out data point

Example 7.4 (Whites test) If the regressors include .1; x1t ; x2t / then w t in (7.11) is the

2 2

vector (1; x1t ; x2t ; x1t ; x1t x2t ; x2t ).

D0
D1

: 0 0.5 1 0 0.5 1

Simulated 7:1 13:0 19:1 13:6 19:1 24:9

OLS formula 7:1 10:1 13:3 13:4 16:2 19:3

Whites 7:0 12:7 18:5 13:3 18:7 24:3

Bootstrapped 7:1 12:7 18:5 13:4 18:8 24:4

Table 7.4: Standard error of OLS slope (%) under heteroskedasticity (simulation evi-

dence). Model: y t D 1C0:9x t C t , where t N.0; t2 /, with t2 D .1C
jz t jCjx t j/2 ,

where z t is iid N(0,1) and independent of x t . Sample length: 200. Number of simulations:

25000. The bootstrap draws pairs .ys ; xs / with replacement.

If we reject the null hypothesis in Whites test, then we either have to model both the

regression equation and the volatility process simultaneously (for instance, using MLE)

or adjust the OLS standard errors.

Remark 7.5 (Whites covariance matrix) Recall that the sample moment conditions for

OLS are

T

1X

gN ./ D x t .y t x t0 b/ D 0k1 ;

T t D1

173

where we have k regressors in x t . For the asymptotic distribution we need covariance

p

matrix of T gN ./ and the Jacobian of the moment conditions with respect to the param-

eters (b). Let u t D y t x t0 b and notice that

"p T

#

T X

S0 D Cov xt ut D Cov.x t u t /;

T t D1

where we have assumed that there is no autocorrelation, but we allow for heteroskedas-

ticity since we are using the covariance matrix of x t u t (not imposing that x t and u t are

unrelated). We can estimate this as

T

1

SO D x t x t0 uO 2t :

X

T tD1

T

!

1X

D0 D plim x t x t0 D xx :

T tD1

Combining gives

p d

T .bO b0 / ! N.0; xx1 S0 xx1 /:

found in financial data which shows volatility clustering.

To test for ARCH features, Engles test of ARCH is perhaps the most straightforward.

It amounts to running an AR(q) regression of the squared zero-mean variable (here de-

noted u t )

u2t D ! C a1 u2t 1 C : : : C aq u2t q C v t ; (7.12)

Under the null hypothesis of no ARCH effects, all slope coefficients are zero and the

R2 of the regression is zero. (This can be tested by noting that, under the null hypothesis,

TR2 =.1 R2 / 2q .) This test can also be applied to the fitted residuals from a regression

like (7.10). However, in this case, it is not obvious that ARCH effects makes the standard

expression for the LS covariance matrix invalid (use Whites test (7.11) for this).

It is straightforward to phrase Engles test in terms of GMM moment conditions. We

simply use a first set of moment conditions to estimate the parameters of the regression

174

model, and then test if the following additional (ARCH related) moment conditions are

satisfied at those parameters

2 3

u2t 1

6 :

E6

7 2

4

::

5 t a0 / D 0q1 :

7 .u (7.13)

2

ut q

see if the squared fitted residuals are autocorrelated. We just have to adjust the degrees of

freedom in the asymptotic chi-square distribution by subtracting the number of parameters

estimated in the regression equation. These tests for ARCH effects will typically capture

GARCH (see below) effects as well.

y t D x t0 b C u t ; where (7.14)

E u t D 0 and Cov.xi t ; u t / D 0:

We will study different ways of modelling how the volatility of the residual is autocorre-

lated.

In the ARCH(1) model the residual in the regression equation (7.14) can be written

u t D v t t ; with (7.15)

v t iid with E v t D 0 and Var.v t / D 1;

! > 0 and 0 < 1:

some authors use a different convention for the time subscripts.) We also assume that v t

175

ARCH std, annualized GARCH std, annualized

50 50

40 40

30 30

20 20

10 10

0 0

1980 1990 2000 2010 1980 1990 2000 2010

with ARCH(1) errors with GARCH(1,1) errors

t2 = + u2t1 t2 = + u2t1 + t1

2

0.31 0.08 0.91

See Figure 7.11 for an illustration.

The non-negativity restrictions on ! and are needed in order to guarantee t2 > 0.

The upper bound < 1 is needed in order to make the conditional variance stationary. To

see the latter, notice that the forecast (made in t) of volatility in t C s is

!

E t t2Cs D N 2 C s 1

t2C1 N 2 , with N 2 D (7.17)

;

1

where N 2 is the unconditional variance and we recall that t2C1 is known in t . The forecast

of the variance is just like in an AR(1) process. A value of < 1 is needed to make the

difference equation stable.

The conditional variance of u t Cs is clearly equal to the expected value of t2Cs

since v t is

independent of t . Morover, E t v t C1 D 1 and E t t C1 D t C1 (known in t). Combine to

2 2 2

get E t tC2

2

D ! C t2C1 . Similarly, E t t2C3 D ! C E t t2C2 . Substitute for E t t2C2 to

get E t t2C3 D ! C .! C t2C1 /, which can be written as (7.17). Further periods follow

176

the same pattern.

To prove (7.18), notice that Var t .u t Cs / D E t v t2Cs t2Cs D E t v t2Cs E t t2Cs since v tCs

and t Cs are independent. In addition, E t v t2Cs D 1, which proves (7.18).

If we assume that v t is iid N.0; 1/, then the distribution of u tC1 , conditional on the

information in t , is N.0; t2C1 /, where tC1

2

is known already in t . Therefore, the one-step

ahead distribution is normalwhich can be used for estimating the model with MLE.

However, the distribution of u t C2 (still conditional on the information in t ) is more com-

plicated. Notice that

q

u t C2 D v tC2 t C2 D v t C2 ! C v t2C1 t2C1 ; (7.19)

which is a nonlinear function of v t C2 and v t C1 (which are standard normal) and it depends

on t C2 which is not known in t. This makes u tC2 have a non-normal distribution. In

fact, it will have fatter tails than a normal distribution with the same variance (excess kur-

tosis). This spills over to the unconditional distribution which has the following kurtosis

(assuming > 0)

(

2

E u4t 3 11 3 2 > 3 if denominator is positive

2 2

D (7.20)

.E u t / 1 otherwise.

As a comparison, the kurtosis of a normal distribution is 3 (you also get this by setting

D 0). This means that we can expect u t to have fat tails, but that the standardized resid-

uals u t = t perhaps look more normally distributed. See Figure 7.13 for an illustration

(although based on a GARCH model).

Example 7.6 (Kurtosis) With D 1=3, the kurtosis is 4, at D 0:5 it is 9 and at D 0:6

it is infinite. With D 0, it is 3.

Proof. (of (7.20)) Since v t and t are independent, we have E.u2t / D E.v t2 t2 / D

E t2 and E.u4t / D E.v t4 t4 / D E. t4 / E.v t4 / D E. t4 /3, where the last equality follows

from E.v t4 / D 3 for a standard normal variable. To find E. t4 /, square (7.16) and take

expectations (and use E t2 D !=.1 /)

E t4 D ! 2 C 2 E u4t 1 C 2! E u2t 1

2 2

D! C E. t4 /3 2

C 2! =.1 /, so

2

1C !

E t4 D :

1 3 2 .1 /

177

Multiplying by 3 and dividing by .E u2t /2 D ! 2 =.1 /2 gives (7.20).

the heteroskedasticity or because we want a more efficient estimator of the regression

equation than LS. We therefore want to estimate the full model (7.14)(7.16) by ML or

GMM.

1 ut 2

1

pdf .x t / D p exp ;

2 2 2 2

so the log-likelihood is

1 1 1 u2t

ln L t D ln .2/ ln 2 :

2 2 2 2

If u t and us are independent (uncorrelated if normally distributed), then the joint pdf

is the product of the marginal pdfsand the joint log-likelihood is the sum of the two

likelihoods.

The most common way to estimate the model is to assume that v t is iid N.0; 1/ and

to set up the likelihood function. The log likelihood is easily found, since the model is

conditionally Gaussian. It is

T T

T 1X 1 X u2t

ln L D ln .2/ ln t2 , if (7.21)

2 2 tD1 2 t D1 t2

v t is iid N.0; 1/:

By plugging in (7.14) for u t and (7.16) for t2 , the likelihood function is written in terms

of the data and model parameters. The likelihood function is then maximized with respect

to the parameters. Note that we need a starting value of 12 D ! C u20 . The most

convenient (and common) way is to maximize the likelihood function conditional on a y0

and x0 . That is, we actually have a sample from (t D) 0 to T , but observation 0 is only

used to construct a starting value of 12 . The optimization should preferably impose the

constraints in (7.16). The MLE is consistent.

178

Remark 7.8 (Coding the ARCH(1) ML estimation) A straightforward way of coding the

estimation problem (7.14)(7.16) and (7.21) is as follows.

First, guess values of the parameters b (a vector), and !, and . The guess of b can be

taken from an LS estimation of (7.14), and the guess of ! and from an LS estimation of

uO 2t D ! C uO 2t 1 C " t where uO t are the fitted residuals from the LS estimation of (7.14).

Second, loop over the sample (first t D 1, then t D 2, etc.) and calculate uO t from (7.14)

and t2 from (7.16). Plug in these numbers in (7.21) to find the likelihood value.

Third, make better guesses of the parameters and do the second step again. Repeat until

the likelihood value converges (at a maximum).

(7.16), iterate over values of .b; !; Q and let ! D !Q 2 and D exp.a/=1

Q / Q C exp.a/.

Q

It is often found that the fitted normalized residuals, uO t = t , still have too fat tails

compared with N.0; 1/. Estimation using other likelihood functions, for instance, for

a t-distribution can then be used. Or the estimation can be interpreted as a quasi-ML

(is typically consistent, but requires different calculation of the covariance matrix of the

parameters).

Another possibility is to estimate the model by GMM using, for instance, the follow-

ing moment conditions

2 3

xt ut

E 4 u2t t2 5 D 0.kC2/1 ; (7.22)

6 7

u2t 1 .u2t t2 /

It is straightforward to add more lags to (7.16). For instance, an ARCH.p/ would be

We then have to add more moment conditions to (7.22), but the form of the likelihood

function is the same except that we now need p starting values and that the upper bound-

ary constraint should now be jpD1 j 1.

Instead of specifying an ARCH model with many lags, it is typically more convenient to

specify a low-order GARCH (Generalized ARCH) model. The GARCH(1,1) is a simple

179

GARCH std, annualized EWMA std, annualized, = 0.99

50 50

40 40

30 30

20 20

10 10

0 0

1980 1990 2000 2010 1980 1990 2000 2010

with GARCH(1,1) errors

t2 = + u2t1 + t1

2 EWMA std, annualized, = 0.9

50

GARCH coefficients:

40

0.08 0.91 30

20

10

0

1980 1990 2000 2010

! > 0; ; 0; and C < 1;

See Figure 7.11 for an illustration.

The non-negativity restrictions are needed in order to guarantee that t2 > 0 in all

periods. The upper bound C < 1 is needed in order to make the t2 stationary and

therefore the unconditional variance finite. To see the latter, notice that we in period t can

forecast the future conditional variance ( t2Cs ) as (since t2C1 is known in t )

!

E t t2Cs D N 2 C . C /s 1

t2C1 N 2 , with N 2 D (7.25)

;

1

where N 2 is the unconditional variance. This has the same form as in the ARCH(1) model

180

QQ plot of AR residuals QQ plot of AR&GARCH residuals

6 6

S&P 500 returns (daily)

4 1954:12014:3 4

Empirical quantiles

Empirical quantiles

2 0.1th to 99.9th 2

percentile

0 0

-2 AR(1) -2 AR(1)&GARCH(1,1)

-4 Stand. residuals: -4 Stand. residuals:

ut / ut /t

-6 -6

-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

Quantiles from N(0,1), % Quantiles from N(0,1), %

(7.17), but where the sum of and is like an AR(1) parameter. The restriction C < 1

must hold for this difference equation to be stable.

As for the ARCH model, the conditional variance of u t Cs is clearly equal to the ex-

pected value of tCs

2

Assuming that u t has no autocorrelation, it follows directly from (7.25) that the ex-

pected variance of a longer time period (u t C1 C u t C2 C : : : C u t CK ) is

Var t . K

PK 2

sD1 u t Cs / D E t N2 C K

sD1 tCs D K sD1 . C /

s 1

t2C1 N 2

P P

1 . C /K 2

D K N 2 C N 2 : (7.27)

tC1

1 . C /

This is useful for portfolio choice and asset pricing when the horizon is longer than one

period (day, perhaps).

See Figures 7.127.13 for illustrations.

Proof. (of (7.25)(7.27)) Notice that E t tC2

2

D ! C E t v t2C1 E t t2C1 C t2C1 since

v t is independent of t . Moreover, E t v tC12

D 1 and E t t2C1 D tC1 2

(known in t).

Combine to get E t tC2 D ! C . C / t C1 . Similarly, E t t C3 D ! C . C / E t t2C2 .

2 2 2

, which can be

written as (7.25). Further periods follow the same pattern.

To prove (7.27), use (7.25) and notice that K C s 1

D C K

= 1 . C /.

P

sD1 . / 1 . /

181

Remark 7.10 (EWMA) The GARCH(1,1) has many similarities with the exponential mov-

ing average estimator of volatility

t2 D .1 /u2t 1 C t2 1 :

This methods is commonly used by practitioners. For instance, the RISK Metrics uses

this method with D 0:94. Clearly, plays the same type of role as in (7.24) and

1 as . The main differences are that the exponential moving average does not have

a constant and volatility is non-stationary (the coefficients sum to unity). See Figure 7.12

for a comparison.

/2

(

E u4t 3 1 122.C > 3 if denominator is positive

D .C /2

(7.28)

.E u2t /2 1 otherwise.

If D 0, then the variance becomes deterministic and the distribution normal (with a

kurtosis of 3).

Proof. (of (7.28)) Since v t and t are independent, we have E.u2t / D E.v t2 t2 / D E t2

and E.u4t / D E.v t4 t4 / D E. t4 / E.v t4 / D E. t4 /3, where the last equality follows from

E.v t4 / D 3 for a standard normal variable. We also have E.u2t t2 / D E t4

E t4 D E.! C u2t 1 C t2 1 /2

D ! 2 C 2 E u4t 1 C 2 E t4 1 C 2! E u2t 1 C 2! E t2 1 C 2 E.u2t 1 t2 1 /

D ! 2 C 2 E. t4 /3 C 2 E t4 C 2! E t2 C 2! E t2 C 2 E t4

! 2 C 2!. C / E t2

D :

1 2 2 . C 2 /2

(7.28).

The GARCH(1,1) corresponds to an ARCH.1/ with geometrically declining weights,

which is seen by solving (7.24) recursively by substituting for t2 1 (and then t2 2 , t2 3 ,

...)

1

2 ! X

t D C j u2t 1 j : (7.29)

1 j D0

ARCH.

182

Proof. (of (7.29)) Substitute for t2 1 in (7.24), and then for t2 2 , etc

t 1 2

t2 D ! C u2t 1 C ! C u2t 2 C t2 2

D ! .1 C / C u2t 1 C u2t 2 C 2 t2 2

:

D ::

To estimate the model consisting of (7.14), (7.15) and (7.24) we can still use the

likelihood function (7.21) and do a MLE. We typically create the starting value of u20 as

in the ARCH model (use y0 and x0 to create u0 ), but this time we also need a starting value

of 02 . It is often recommended that we use 02 D Var.uO t /, where uO t are the residuals from

a LS estimation of (7.14). It is also possible to assume another distribution than N.0; 1/.

Q ;

tions in (7.24), iterate over values of .b; !; Q and let ! D !Q 2 , D exp./=1

Q / Q C

exp./ Q and D exp./=1

Q C exp./; Q C exp./ Q

Q C exp./.

To estimate the GARCH(1,1) with GMM, we can, for instance, use the following

moment conditions (where t2 is given by (7.24))

2 3

xt ut

6 2

6 u t t2

7

0

E66 u2 .u2 2 / 7 D 0.kC3/1 ; where u t D y t x t b: (7.30)

7

7

4 t 1 t t 5

u2t 2 .u2t t2 /

The value at risk (as fraction of the investment) at the level (say, D 0:95) is VaR D

cdf 1 .1 /, where cdf 1 ./ is the inverse of the cdfso cdf 1 .1 / is the 1

quantile of the return distribution. See Figure 7.14 for an illustration. When the return

has an N.; 2 / distribution, then VaR95% D . 1:64/. See Figures 7.157.17 for

an example of time-varying VaR, based on a GARCH model.

Backtesting a VaR model amounts to checking if (historical) data fits with the VaR num-

bers. For instance, we first find the VaR95% and then calculate what fraction of returns

183

Value at risk and distribution of returns

3

1.5

0.5

0

40 -VaR 95% 0 40

Return, %

5 10

S&P 500, daily data 1954:12014:3

4

The horizontal lines are from the

3 unconditional distribution 5

2

1

0 0

1980 1990 2000 2010 1980 1990 2000 2010

The VaR is based on N()

AR(1) of excess returns

e

Rte = a + bRt1 + ut

with GARCH(1,1) errors VaRt = (t 1.64t )

t2 = + u2t1 + t1

2 e

where t = a + bRt1

Coefficient estimates:

b

0.09 0.08 0.91

that is actually below (the negative of ) this number. If the model is correct it should be

5%. We then repeat this for VaR96% : only 4% of the returns should be below (the negative

of ) this number. Figures 7.167.17 show results from backtesting a VaR model where

the volatility follows a GARCH process (to capture the time varying volatility). The evi-

184

Value at Risk95% (one day) and loss, %

10

VaR S&P 500, daily data 1954:12014:3

max(loss,0)

The VaR is based on GARCH(1,1) & N()

8

Loss > VaR95% in 0.051 of the cases

Negative losses are shown as zero

0

1980 1985 1990 1995 2000 2005 2010 2015

Figure 7.16: Backtesting VaR from a GARCH model, assuming normally distributed

shocks

dence suggests that this model, combined with the assumption that the return is normally

distributed (but with time-varying parameters), works relatively well.

Remark 7.12 (Bernoulli and binomial distributions) In a Bernoulli distribution, the ran-

dom variable X can only take two values: 1 or 0, with probability p and 1 p respectively.

This gives E.X/ D p and Var.X/ D p.1 p/. After n independent trials, the number of

successes (y) has a binomial distribution with E.y/ D np and Var.y/ D np.1 p/.

To perform a statistical back test of a VaR model, define a variable that is one if the

loss is greater than the VaR

(

1 if R t < VaR

dt D (7.31)

0 otherwise

PT

t D1 d t =T !d N 1 ; .1 /=T : (7.32)

185

Backtesting VaR from GARCH(1,1) + N(), daily S&P 500 returns

0.1

0.09

lines: null hypothesis

and 90% confidence band

0.08

Empirical Prob(loss > VaR)

0.07

0.06

0.05

0.04

0.03

0.02

0

0.9 0.92 0.94 0.96 0.98 1

VaR confidence level ( in VaR )

Figure 7.17: Backtesting VaR from a GARCH model, assuming normally distributed

shocks

P

expression can be used for testing the null hypothesis that R t < VaR for the fraction

1 of the observations. Alternatively, we could use GMM with the moment condition

gt D dt .1 / D 0: (7.33)

p

T gN !d N.0; S0 /, so (7.34)

PT

t D1 d t =T !d N.1 ; S0 =T /; (7.35)

PT p

where gN D t D1 g t =T is the average moment condition and S0 D Var. T g/ N is the

variance. The latter can be estimated by, for instance, a Newey-West approach. Clearly,

the only difference between (7.32) and (7.35) is that the former specifies the variance as

.1 /=T , while the latter open up the possibility to use another way of estimating the

variance. Compare Figures 7.17 and 7.18.

186

Backtesting VaR from GARCH(1,1) + N(), daily S&P 500 returns

0.1

0.09

lines: null hypothesis

and 90% confidence band (GMM)

0.08

Empirical Prob(loss > VaR)

0.07

0.06

0.05

0.04

0.03

0.02

0

0.9 0.92 0.94 0.96 0.98 1

VaR confidence level ( in VaR )

A very large number of extensions of the basic GARCH model have been suggested.

Estimation is straightforward since MLE is done as for any other GARCH modeljust

the specification of the variance equation differs.

An asymmetric GARCH (Glosten, Jagannathan, and Runkle (1993)) can be con-

structed as

(

1 if q is true

.q/ D

0 else.

This means that the effect of the shock u2t 1 is if the shock was negative and C
if

the shock was positive. With
< 0, volatility increases more in response to a negative

u t 1 (bad news) than to a positive u t 1 .

187

GARCH std, annualized eGARCH std, annualized

50 50

40 40

30 30

20 20

10 10

0 0

1980 1990 2000 2010 1980 1990 2000 2010

AR(1) of excess returns

AR(1) of excess returns

with eGARCH(1,1) errors

with GARCH(1,1) errors |ut1 | ut1

t2 = + u2t1 + t1

2 ln t2 = + 2

+ ln t1 +

t1 t1

Coefficients: Coefficients:

0.08 0.91 0.10 0.99 -0.06

ju t 1j ut 1

ln t2 D ! C C ln t2 1 C
: (7.37)

t 1 t 1

Apart from being written in terms of the log (which is a smart trick to make t2 > 0

hold without any restrictions on the parameters), this is an asymmetric model. The ju t 1 j

term is symmetric: both negative and positive values of u t 1 affect the log volatility in

the same way. The linear term in u t 1 modifies this to make the effect asymmetric. In

particular, if
< 0, then the log volatility increases more in response to a negative u t 1

(bad news) than to a positive u t 1 . This model is stationary if jj < 1.

To estimate the model, we can still use the likelihood function (7.21) and do a MLE.

We typically create the starting value of u0 D y0 x00 bO and 02 D Var.uO t /, where uO t D

O See Figure 7.19 for an illustration.

y t x t0 b.

Hentschel (1995) estimates several models of this type, as well as a very general

formulation on daily stock index data for 1926 to 1990 (some 17,000 observations). Most

standard models are rejected in favour of a model where t depends on t 1 and ju t 1

bj3=2 .

188

7.6 GARCH Models with Exogenous Variables

We could easily extend the GARCH(1,1) model by adding exogenous variables x t 1 , for

instance, VIX

t2 D ! C u2t 1 C t2 1 C
x t 1 ; (7.38)

where care must be taken to guarantee that t2 > 0. One possibility is to make sure that

x t > 0 and then restrict
to be non-negative. Alternatively, we could use an EGARCH

formulation like

ju t 1 j

ln t2 D ! C C ln t2 1 C
x t 1 : (7.39)

t 1

These models can be estimated with maximum likelihood.

A stochastic volatility model differs from GARCH models by making the volatility truly

stochastic. Recall that in a GARCH model, the volatility in period t ( t ) is known already

in t 1. This is not the case in a stochastic volatility model where the log volatility follows

an ARMA process. The simplest case is the AR(1) formulation

ln t2 D ! C ln t2 1 C t , (7.40)

with t iid N.0; 1/;

The estimation of a stochastic volatility model is complicatedand the basic reason

is that it is very difficult to construct the likelihood function. So far, the most practical

way to do MLE is by simulations.

Instead, stochastic volatility models are often estimated by quasi-MLE. For the model

(7.15) and (7.40), this could be done as follows: square (7.15) and take logs to get

We could use this as the measurement equation in a Kalman filter (pretending that ln v t2

E ln v t2 is normally distributed), and (7.40) as the state equation. (The Kalman filter is a

convenient way to calculate the likelihood function.) In essence, this is an AR(1) model

with noisy observations.

If ln v t2 is normally distributed , then this will give MLE, otherwise just a quasi-MLE.

189

S&P 500, stochastic volatility

50 S&P 500 (daily) 1954:12014:3

40

0.99 0.11

std, annualized

30

20

10

0

1980 1985 1990 1995 2000 2005 2010 2015

1:27 and Var.ln v t2 / D 2 =2 (with D 3:14:::) so we could write the measurement

equation as

w t N.0; 2 =2/:

In this case, only the state equation contains parameters that we need to estimate: !; ; .

See Figure 7.20 for an example.

7.8 (G)ARCH-M

It can make sense to let the conditional volatility enter the regression (mean) equation

for instance, as a proxy for risk which may influence the expected return.

subject to Rp D Rm C .1 /Rf ;

where Rm is the return on the risky asset (the market index) and Rf is the riskfree return.

190

GARCH-M std, annualized

50 S&P 500 (daily) returns 1954:12014:3

40 AR(1) + GARCH-M:

30

e

Rte = a + bRt1 + t + ut

t = + u2t1 + t1

2 2

20

Coefficients:

10

b

0 0.09 0.08 0.08 0.91

1980 1990 2000 2010

The solution is

1 E.Rm Rf /

D :

k m2

In equilibrium, this weight is one (since the net supply of bonds is zero), so we get

E.Rm Rf / D km2 ;

which says that the expected excess return is increasing in both the market volatility and

risk aversion (k).

We modify the mean equation (7.14) to include the conditional variance t2 or the

standard deviation t (taken from any of the models for heteroskedasticity) as a regressor

can be estimated by using the likelihood function (7.21) to do MLE.

It can also be noted (see Gourieroux and Jasiak (2001) 11.3) that a slightly modified

GARCH-M model is the discrete time sampling version of a continuous time stochastic

volatility model (where the mean is affected by one Wiener process and the variance by

another).

See Figure 7.21 for an example.

Remark 7.14 (Coding of (G)ARCH-M) We can use the same approach as in Remark

7.8, except that we use (7.43) instead of (7.14) to calculate the residuals (and that we

obviously also need a guess of ').

191

7.9 Multivariate (G)ARCH

This section gives a brief summary of some multivariate models of heteroskedasticity. Let

the model (7.14) be a multivariate model where y t and u t are n 1 vectors. We define

the conditional (on the information set in t 1) covariance matrix of u t as

t D Et 1 u t u0t : (7.44)

be simple, but it is not. The reason is that it would contain far too many parameters.

Although we only need to care about the unique elements of t , that is, vech. t /, this

still gives very many parameters

0

vech. t / D C C Avech.u t 1ut 1/ C Bvech. t 1 /: (7.45)

This typically gives too many parameters to handleand makes it difficult to impose

sufficient restrictions to make t positive definite (compare the restrictions of positive

coefficients in (7.24)).

2 3 2 2 3 2 3

11;t u1;t 1 11;t 1

4 21;t 5 D C C A 4 u1;t 1 u2;t 1 5 C B 4 21;t 1 5 ;

6 7 6 7 6 7

hard to manage. We have to limit the number of parameters.

The diagonal model assumes that A and B are diagonal. This means that every element

of t follows a univariate process. To make sure that t is positive definite we have

to impose further restrictions. The obvious drawback of this model is that there is no

spillover of volatility from one variable to another.

192

Example 7.16 (Diagonal model, n D 2) With n D 2 we have

2 3 2 3 2 32 2 3 2 32 3

11;t c1 a1 0 0 u1;t 1 b1 0 0 11;t 1

6 7 6 7 6 76 7 6 76 7

1

t D C C A0 u t 0

1ut 1A C B 0t 1 B; (7.46)

where C is symmetric and A and B are n n matrices. Notice that this equation is

specified in terms of t , not vech. t /. (Recall that a quadratic form is positive definite,

provided the matrices are of full rank.)

" # " # " #0 " #" #

11;t 12;t c11 c12 a11 a12 u21;t 1 u1;t 1 u2;t 1 a 11 a 12

D C C

12;t 22;t c12 c22 a21 a22 u1;t 1 u2;t 1 u22;t 1 a21 a22

" #0 " #" #

b11 b12 11;t 1 12;t 1 b11 b12

;

b21 b22 12;t 1 22;t 1 b21 b22

The constant correlation model assumes that every variance follows a univariate GARCH

process and that the conditional correlations are constant. To get a positive definite t ,

each individual GARCH model must generate a positive variance (same restrictions as

before), and that all the estimated (constant) correlations are between 1 and 1. The price

is, of course, the assumption of no movements in the correlations.

" # "p #" # "p #

11;t 12;t 11;t 0 1 12 11;t 0

D p p

12;t 22;t 0 22;t 12 1 0 22;t

193

and each of 11t and 22t follows a GARCH process. Assuming a GARCH(1,1) as in

(7.24) gives 7 parameters (2 3 GARCH parameters and one correlation), which is con-

venient.

tion that 1 < < 1, iterate over Q and let D 1 2=1 C exp./.

Q

Remark 7.20 (Estimating the constant correlation model) A quick (and dirty) method

for estimating is to first estimate the individual GARCH processes and then estimate the

p p

correlation of the standardized residuals u1t = 11;t and u2t = 22;t .

The dynamic correlation model (see Engle (2002) and Engle and Sheppard (2001)) allows

the correlation to change over time. In short, the model assumes that each conditional

variance follows a univariate GARCH process and the conditional correlation matrix is

(essentially) allowed to follow a univariate GARCH equation.

The conditional covariance matrix is (by definition)

p

t D D t R t D t , with D t D diag. i i;t /; (7.47)

Remark 7.21 (diag(ai ) notation) diag.ai / denotes the nn matrix with elements a1 ; a2 ; : : : an

along the main diagonal and zeros elsewhere. For instance, if n D 2, then

" #

a1 0

diag.ai / D :

0 a2

model, but with a transformation that guarantees that it is actually a valid correlation ma-

trix. First, let v t be the vector of standardized residuals and let QN be the unconditional

correlation matrix of v t . For instance, if we assume a GARCH(1,1) structure for the

correlation matrix, then we have

Q t D .1 /QN C v t 0

1vt 1 C Q t 1, with (7.48)

p

vi;t D ui;t = i i;t ;

194

where and are two scalars and QN is the unconditional covariance matrix of the nor-

malized residuals (v t ). To guarantee that the conditional correlation matrix is indeed a

correlation matrix, Q t is treated as if it where a covariance matrix and R t is simply the

implied correlation matrix. That is,

p 1 p 1

R t D diag Q t diag (7.49)

qi i;t qi i;t :

The basic idea of this model is to estimate a conditional correlation matrix as in (7.49)

and then scale up with conditional variances (from univariate GARCH models) to get a

conditional covariance matrix as in (7.47).

See Figures 7.227.23 for illustrationswhich also suggest that the correlation is

close to what an EWMA method delivers. The DCC model is used in a study of asset

pricing in, for instance, Duffee (2005).

t is

" # "p #" # "p #

11;t 12;t 11;t 0 1 12;t 11;t 0

D p p ;

12;t 22;t 0 22;t 12;t 1 0 22;t

and each of 11t and 22t follows a GARCH process. To estimate the dynamic correla-

tions, we first calculate (where and are two scalars)

" # " # " #" #0 " #

q11;t q12;t 1 qN 12 v1;t 1 v1;t 1 q11;t 1 q12;t 1

D .1 / C C ;

q12;t q22;t qN 12 1 v2;t 1 v2;t 1 q12;t 1 q22;t 1

p

where vi;t 1 D ui;t 1 = i i;t 1 and qN ij is the unconditional correlation of vi;t and vj;t

and we get the conditional correlations by

" # " p #

1 12;t 1 q12;t = q11;t q22;t

D p :

12;t 1 q12;t = q11;t q22;t 1

.qN 12 ; ; /).

To see what DCC generates, consider the correlation coefficient from a bivariate

195

Std of DAX, GARCH(1,1) Std of FTSE, GARCH(1,1)

60 60

40 40

Std

Std

20 20

0 0

1995 2000 2005 2010 2015 1995 2000 2005 2010 2015

The standard deviations are annualized

1 (a) GARCH(1,1) for each variance

(b) Qt = (1 )Q + vt1 vt1 + Qt1 ,

where Qt is the covariance matrix and

vt are standardized residuals

Corr

DCC

CC 0.04 0.96

0

1995 2000 2005 2010 2015

model

q12;t

12;t D p p , where (7.50)

q11;t q22;t

q12;t D .1 /qN 12 C v1;t 1 v2;t 1 C q12;t 1

This is a complicated expression, but the numerator is the main driver: q11;t and q22;t are

variances of normalized variablesso they should not be too far from unity. Therefore,

q12;t is close to being the correlation itself. The equation for q12;t shows that it has a

GARCH structure: it depends on v1;t 1 v2;t 1 and q12;t 1 . Provided and are large

numbers, we can expect the correlation to be strongly autocorrelated.

196

Correlation of FTSE 100 and DAX 30 Correlation of FTSE 100 and DAX 30

Corr 1 1

Corr

0.5 0.5

0 0

1995 2000 2005 2010 2015 1995 2000 2005 2010 2015

1

Corr

0.5

EWMA ( = 0.99)

0

1995 2000 2005 2010 2015

In principle, it is straightforward to specify the likelihood function of the model and then

maximize it with respect to the model parameters. For instance, if u t is iid N.0; t /, then

the log likelihood function is

T T

Tn 1X 1X 0

ln L D ln.2/ ln j t j u 1ut : (7.51)

2 2 t D1 2 t D1 t t

In practice, the optimization problem can be difficult since there are typically many pa-

rameters. At least, good starting values are required.

GARCH(1,1) models for each variable separately, then estimate the correlation matrix

on the standardized residuals.

Remark 7.24 (Estimation of the dynamic correlation model) Engle and Sheppard (2001)

suggest estimating the dynamic correlation matrix by two-step procedure. First, estimate

197

the univariate GARCH processes. Second, use the standardized residuals to estimate the

dynamic correlations by maximizing the likelihood function (7.51 if we assume normally

distributed errors) with respect to the parameters and . In this second stage, both the

parameters for the univariate GARCH process and the unconditional covariance matrix

QN are kept constant.

Quantile regressions are useful for estimating models where the heteroskedasticity is re-

lated to the regressors.

7.10.1 LAD

The least absolute deviations (LAD) estimator is a special case of quantile regressors

and a good way to introduce the general concept. LAD minimizes the sum of absolute

residuals (rather than the squared residuals)

T

bOLAD D arg min x t0 b

X

(7.52)

y t

b

t D1

The optimization is a non-linear problem, but a simple iteration works nicely (see below).

The estimator is typically less sensitive to outliers than OLS. (There are also other ways

to estimate robust regression coefficients.) This is illustrated in Figure 7.24.

See Figure 7.25 for an empirical example.

If we assume that the median of the true residual, u t , is zero, then (under strict as-

sumptions, discussed below) we have

p

T .bOLAD b0 / !d N 0; f .0/ 2 xx1 =4 , where (7.53)

XT

xx D plim x t x t0 =T;

t D1

where f .0/ is the value of the pdf of the residual at zero. Unless we know this density

function (or else we would probably have used MLE instead of LAD), we need to estimate

itfor instance with a kernel density method. However, to arrive at the result in (7.53)

we must assume that the residual is independent of the regressors. (This is discussed in

some detail below, see quantile regressions).

198

OLS vs LAD of y = 0.75x + u

2

y: -1.125 -0.750 1.750 1.125

x: -1.500 -1.000 1.000 1.500

0

y

1

Data

OLS (0.25 0.90)

LAD (0.00 0.75)

2

3 2 1 0 1 2 3

x

Figure 7.24: Data and regression line from OLS and LAD

p

Example 7.25 (N.0; 2 /) When u t N.0; 2 ), then f .0/ D 1= 2 2 , so the covari-

ance matrix in (7.53) becomes 2 xx1 =2. This is =2 times larger than when using

LS.

Remark 7.26 (Algorithm for LAD) The LAD estimator can be written

T

bOLAD D arg min

X

w t uO t .b/2 , w t D 1= juO t .b/j ; with

b

t D1

O D yt

uO t .b/ x t0 bO

O

so it is a weighted least squares where both y t and x t are multiplied by 1= uO t .b/ . It can

O

be shown that iterating on LS with the weights given by 1= uO t .b/ , where the residuals

are from the previous iteration, converges very quickly to the LAD estimator.

y t D x t0 b C u t : (7.54)

199

US industry portfolios,

1.5

Monthly data 1947:12013:12 OLS

LAD

0.5

A B C D E F G H I J

In the OLS context we typically assume E u t D 0 and Cov.x t ; u t / D 0. The latter is the

same as E.u t jx t / D 0 which means that

E.y t jx t / D x t0 b: (7.55)

We can interpret the LAD estimator as an alternative way of getting good estimates of b,

especially when the error distribution has fat tails. In fact, when the errors have a Laplace

distribution, f .u/ D exp. juj =/=2), then LAD is the MLE.

u and x,

Cov .x; u/ D Cov x; E .ujx/ :

since E u D Ex E .ujx/ D Ex 0 D 0.

variable, then the mean, , is the solution to min E.u /2 , while the median, m, is the

solution to minm E ju mj. (There are some restrictions on u for this to be true, but we

disregard that here.)

200

The previous remark shows that the LAD estimator (7.52) amounts to finding the b

coefficients (in a linear model) so that

Median.y t jx t / D x t0 b: (7.57)

This is the alternative interpretation of the LAD: it tries to set the median of the residuals,

at a given x t vector, equal to zero. In contrast, OLS tries to set the mean of the residuals,

at a given x t vector, to zero.

quantile (the median), as is done in (7.56), it rather states that the qth quantile (conditional

on x t ) of the residual in zero

Q.y t jx t I q/ D x t0 b .q/ : (7.59)

Here Q.u t jx t I q/ denotes the qth quantile of u t at a particular value of x t and we also

index the coefficients b .q/ to remember that this refers to the qth quantile. Clearly, the

LAD is the special case when q D 0:5.

We could estimate (see below for how) such coefficients for various quantile. When

x t just contains a constant and one more regressor, then it is easy to illustrate. See Figure

7.26 for an example where the slopes differ across the quantiles and Figure 7.27 where

they do not. In particular, in Figure 7.26 the data follows a location and scale model

This is basically a linear model (y t D x t0 ), but where the residuals (u t ) are heteroskedas-

tic. In particular, the volatility of u t is increasing in jx t0
j. This highlights the key feature

of quantile regressions: they are well suited for showing how both the typical (median)

and tails (for instance, the 0:1th and 0:9th quantiles) are related to the regressors. Notice,

however, that we are always referring to conditional quantiles, that is, to quantiles of y t

at a particular value of x t . We are not referring to unconditional quantiles of y t . This

means that the slopes for a high quantile (0.9, say) do not necessarily describe the rela-

tion between y t and x t at generally (unconditionally) high y t (or x t ) valuessee Figure

201

Scatter and fitted values for various quantiles (0.1,0.3,0.5,0.7,0.9)

8

y = 0.2 + 0.9x + 0.5(x + 3)

where is iid N(0,1)

6

Estimated: y = a + bx + u (for each quantile):

quantile a b

4 0.90 2.15 1.62

0.70 0.98 1.18

0.50 0.18 0.92

2

y

0.10 -1.75 0.30

4

3 2 1 0 1 2 3

x

7.26. Rather, the slopes describes the relation between y t and x t at high " t values. This

is perhaps best seen by using the location and scale model in (7.60) which implies

D x t0 C x t0 Q." t I q/ (7.62)

D x t0 C Q." t I q/: (7.63)

(In the second line, Q." t I q/ need not be conditioned on x t since " t is independent of x t .)

Comparing with (7.59) shows that

For instance, if
> 0, then b .q/ is increasing with q since Q." t I q/ is. For instance,

if " t N.0; 1/, then Q." t I 0:05/ D 1:64 and Q." t I 0:95/ D 1:64. This is the case

illustrated in Figure 7.26where the higher slopes at high quantiles basically capture

heteroskedasticity. In contrast, Figure 7.27 shows the case where the
coefficient on the

non-constant regressors are all zero: the b .q/ coefficients (except for the constants) are the

same across quantiles.

202

Scatter and fitted values for various quantiles (0.1,0.3,0.5,0.7,0.9)

8

y = 0.2 + 0.9x + 0.5(0 + 3)

where is iid N(0,1)

6

Estimated: y = a + bx + u (for each quantile):

quantile a b

4 0.90 2.08 1.01

0.70 0.96 0.98

0.50 0.28 0.91

2

y

0.10 -1.72 1.01

4

3 2 1 0 1 2 3

x

Figure 7.28 illustrates these points by showing the predicted quantiles of a return as a

function of the lagged return. The empirical evidence suggests that the typical (median)

effect of a lagged return on todays return is almost zero (there is a weak pattern of neg-

ative return to be followed by positive returns and vice versa). More pronounced is the

smaller dispersion of returns after positive returnsand this is where the real payoff of

the quantile regressions is.

The estimated coefficients for the qth quantile, b .q/ , solves the following problem

P P

t Wu t <0 .1

ut D yt x t0 b .q/ :

This is a highly non-linear problem (and the objective function does not have continuous

derivatives), which can be solved by either a linear programming method or a derivative-

free minimization algorithm. As a special case, q D 0:5 gives the LAD where (7.65)

becomes

minb .0:5/ 0:5 TtD1 ju t j, where u t D y t x t0 b .q/ ; (7.66)

P

203

Fitted values for various quantiles

5

4

Estimated: Rt = a + bRt1 + ut (for each quantile):

3

2

1

Rt , %

0

1 quantile a b

0.99 4.50 -0.34

2 0.95 1.66 -0.13

0.75 0.56 -0.05

3 0.50 0.05 -0.03

0.25 -0.49 0.02 S&P 500 (daily), 1979:12014:4

4 0.05 -1.65 0.05

0.01 -4.86 0.05

3 2 1 0 1 2 3

Rt 1, %

Remark 7.29 (Alternative way of writing (7.65)) Suppose u1 0 and u2 < 0, then the

sum in (7.65) can be written qu1 C .q 1/u2 . This suggests that if we define a dummy d t

to be 1 if u t < 0 and zero otherwise, then we can write the sum as .q d1 /u1 C.q d2 /u2 .

In general, the minimisation problem can then be written

PT

minb .q/ t D1 .q d t /u t :

PT

t D1 q .u t /. As a function of u t , the q .u t / function has a (skewed) v-shape around

u t D 0. For u t < 0 the function is .q 1/u t so it is linear with a (negative) slope of

q 1, and for u t it is also linear but with a slope of q.

p

T .bO .q/ b0.q/ / !d N 0; q.1 q/C 1 xx C 1

, where (7.67)

XT

xx D plim x t x t0 =T and

t D1

P

204

where f .0jx t / is the value of the pdf of the residual, conditional on the regressor value,

at a zero residual. If x t is independent of the regressor, then f .0jx t / D f .0/ where the

latter is the unconditional density of the residual. In this case, the covariance matrix can

be written ; q.1 q/f .0/ 2 xx1 , which gives the result in (7.53) once we set q D 0:5.

One way of obtaining a consistent estimate of C is via a kernel density estimate

0

PT

C D t D1 w t x t x t =T , with (7.68)

1 h i

w t D p exp .uO t = h/2 =2 ,

h 2

and where uO t is the fitted residual.

and Nandi

This paper derives an option price formula for an asset that follows a GARCH process.

This is applied to S&P 500 index options, and it is found that the model works well

compared to a Black-Scholes formula.

The ARCH and GARCH models imply that volatility is random, so they are (strictly

speaking) not consistent with the B-S model. However, they are often combined with the

B-S model to provide an approximate option price. See Figure 7.29 for a comparison

of the actual distribution of the log asset price at different horizons when the returns

are generated by a GARCH modeland a normal distribution with the same mean and

variance. It is clear that the normal distribution is a good approximation unless the horizon

is short and the ARCH component (1 u2t 1 ) dominates the GARCH component (1 t2 1 ).

Over the period from t to t C the change of log asset price minus a riskfree rate (includ-

ing dividends/accumulated interest), that is, the continuously compounded excess return,

205

Pdf from GARCH, T = 1

0.4

Normal (Distribution of cumulated

0.3 Simulated returns over T periods)

0.2

0.1

0

5 0 5

Return over 1 day

0.2 0.2

GARCH parameters: GARCH parameters:

(, ) = (0.8, 0.09) (, ) = (0.09, 0.8)

0.15 0.15

0.1 0.1

0.05 0.05

0 0

10 5 0 5 10 10 5 0 5 10

Cumulated return over 10 days Cumulated return over 10 days

p

ln S t ln S t r D h t C h t z t ; where z t is iid N.0; 1/ (7.69)

p

h t D ! C 1 .z t 1 h t /2 C 1 h t : (7.70)

p

additional term makes the response of h t to an innovation symmetric around i h t i

instead of around zero. (HN also treat the case when the process is of higher order.)

If 1 > 0 then the return, ln S t ln S t , is negatively correlated with subsequent

volatility h t C as often observed in data. To see this, note that the effect on the return of

z t is linear, but that a negative z t drives up the conditional variance h t C D ! C 1 .z t

p

1 h t /2 C 1 h t more than a positive z t (if 1 > 0). The effect on the correlations is

illustrated in Figure 7.30.

The process (7.69)(7.70) does of course mean that the conditional (as of t )

distribution of the log asset price ln S t is normally distributed. This is not enough to price

206

Correlation of ln St and ht+s in Heston and Nandi (2000, RFS)

Parameter value

0.1 0.205

105 0.502

0 0.132

105

0.589

0.1 421.390

0.2

0.3

0.4

0.5

10 5 0 5 10

s (lead of h)

options on this asset, since we cannot use a dynamic hedging approach to establish a no-

arbitrage price since there are (by the very nature of the discrete model) jumps in the price

of the underlying asset. Recall that the price on a call option with strike price K is

Ct D Et fM t max S t K; 0g : (7.71)

Ct De r

Et fmax S t K; 0g ; (7.72)

where Et is the expectations operator for the risk neutral distribution. See, for instance,

Huang and Litzenberger (1988).

For parameter estimates on a more recent sample, see Table 7.5. These estimates

suggests that has the wrong sign (high volatility predicts low future returns) and the

persistence of volatility is much higher than in HN ( is much higher).

t ) is normal, that is

207

T =5 T = 50

20 8

N N

Sim Sim

15 6

10 4

5 2

0 0

4.5 4.55 4.6 4.65 4.7 4.5 4.55 4.6 4.65 4.7

ln ST ln ST

Heston-Nandi model,

ln(S0 ) = ln(100) 4.605

T = 50 Parameter value

0.205

8 char fun 105 0.502

Sim 105 0.132

6 0.589

421.390

4

2

0

4.5 4.55 4.6 4.65 4.7

ln ST

-2.5

! 1.22e-006

0.00259

0.903

6.06

Table 7.5: Estimate of the Heston-Nandi model on daily S&P500 excess returns, in %.

Sample: 1990:1-2011:5

This is the same as assuming that ln S t and ln M t have a bivariate normal distribution

(conditional on the information in t )since this is what it takes to motivates the BS

model. This type of assumption was first used in a GARCH model by Duan (1995), who

effectively assumed that ln M t was iid normally distributed (this assumption is probably

implicit in HN).

HN show that the risk neutral process must then be as in (7.69)(7.70), but with
1

replaced by
1 D
1 C C 1=2 and replaced by 1=2 (not in
1 , of course). This

208

Heston-Nandi model, T = 50

physical

8 risk-neutral

0

4.5 4.55 4.6 4.65 4.7 4.75

ln ST

Figure 7.32: Physical and riskneutral distribution of lnST in the Heston-Nandi model

means that they use the assumption about the conditional (as of t ) distribution of S t

to build up a conditional (as of t ) risk neutral distribution of ST for any T > t. This

risk neutral distribution can be calculated by clever tricks (as in HN) or by Monte Carlo

simulations.

Once we have a risk neutral process it is (in principle, at least) straightforward to

derive any option price (for any time to expiry). For a European call option with strike

price K and expiry at date T , the result is

C t .S t ; r; K; T / D e r

Et max ST K; 0 (7.73)

r

D S t P1 e KP2 ; (7.74)

where P1 and P2 are two risk neutral probabilities (implied by the risk neutral version of

(7.69)(7.70), see above). It can be shown that P2 is the risk neutral probability that ST >

K, and that P1 is the delta, @C t .S t ; r; K; T /=@S t (just like in the Black-Scholes model).

In practice, HN calculate these probabilities by first finding the risk neutral characteristic

function of ST , f ./ D Et exp.i ln ST /, where i 2 D 1, and then inverting to get the

probabilities.

Remark 7.30 (Characteristic function and the pdf) The characteristic function of a ran-

209

dom variable x is

f ./ D E exp.ix/

D x exp.ix/pdf.x/dx;

R

where pdf.x/ is the pdf. This is a Fourier transform of the pdf (if x is a continuous random

variable). For instance, the cf of a N.; 2 / distribution is exp.i 2 2 =2/. The pdf

can therefore be recovered by the inverse Fourier transform as

1 R1

pdf.x/ D exp. ix/f ./d:

2 1

In practice, we typically use a fast (discrete) Fourier transform to perform this calcula-

tion, since there are very quick computer algorithms for doing that (see the appendix).

1

A t D A tC1 C ir C B t C1 ! ln.1 21 B t C1 /

2

1 2 1 .i 1 /2

B t D i . C 1 / 1 C 1 B t C1 C ;

2 2 1 1 B t C1

which can be calculated recursively backwards ((AT ; BT ), then (AT 1 ; BT 1 ), and so

forth until (A0 ; B0 )) starting from AT D 0 and BT D 0, where T is the investment

horizon (time to expiration of the option contract). Notice that i is the imaginary number

such that i 2 D 1. Second, the characteristics function for the horizon T is

Remark 7.32 (Characteristic function in the iid case) In the special case when 1 ,
1 and

1 are all zero, then process (7.69)(7.70) has constant variance. Then, the recursions

give

1 2

A0 D T ir C .T 1/ ! i

2

1 2

B0 D i :

2

210

We can then write the characteristic function as

D exp i ln S0 C T .r C !/ 2 T !=2 ;

T .r C !/ and variance T !.

Returns on the index are calculated by using official index plus dividends. The riskfree

rate is taken to be a synthetic T-bill rate created by interpolating different bills to match

the maturity of the option. Weekly data for 19921994 are used (created by using lots of

intraday quotes for all Wednesdays).

HN estimate the GARCH(1,1)-M process (7.69)(7.70) with ML on daily data on

the S&P500 index returns. It is found that the i parameter is large, i is small, and that

1 > 0 (as expected). The latter seems to be important for the estimated h t series (see

Figures 1 and 2).

Instead of using the GARCH(1,1)-M process estimated from the S&P500 index

returns, all the model parameters are subsequently estimated from option prices. Recall

that the probabilities P1 and P2 in (7.74) depend (nonlinearly) on the parameters of the

risk neutral version of (7.69)(7.70). The model parameters can therefore be estimated

by minimizing the sum (across option price observation) squared pricing errors.

In one of several different estimations, HN estimate the model on option data for

the first half 1992 and then evaluate the model by comparing implied and actual option

prices for the second half of 1992. These implied option prices use the model parameters

estimated on data for the first half of the year and an estimate of h t calculated using

these parameters and the latest S&P 500 index returns. The performance of this model is

compared with a Black-Scholes model (among other models), where the implied volatility

in week t 1 is used to price options in period t. This exercise is repeated for 1993 and

1994.

It is found that the GARCH model outperforms (in terms of MSE) the B-S model. In

particular, it seems as if the GARCH model gives much smaller errors for deep out-of-

the-money options (see Figures 2 and 3). HN argue that this is due to two aspects of the

model: the time-profile of volatility (somewhat persistent, but mean-reverting) and the

negative correlation of returns and volatility.

211

7.12 Fundamental Values and Asset Returns in Global Equity Mar-

kets, by Bansal and Lundblad

This paper studies how stock indices for five major markets are related to news about

future cash flows (dividends and/or earnings). It uses monthly data on France, Germany,

Japan, UK, US, and a world market index for the period 19731998.

BL argue that their present value model (stock price equals the present value of future

cash flows) can account for observed volatility of equity returns and the cross-correlation

across markets. This is an interesting result since most earlier present value models have

generated too small movements in returnsand also too small correlations across mar-

kets. The crucial features of the model are a predictable long-run component in cash flows

and time-varying systematic risk.

Riet D i Rmt

e

C "i t ; (7.75)

where Rmt e

is the world market index. As in CAPM, the market return is proportional to

its volatilityhere modelled as a GARCH(1,1) process. We therefore have a GARCH-M

(-in-Mean) process

e 2

Rmt D mt C "mt , E t 1 "mt D 0 and Var t 1 ."mt /

2

D mt ; (7.76)

2

mt D C
"2m;t 1

2

C m;t 1: (7.77)

A gross return

Di;t C1 C Pi;t C1

Ri;t C1 D ; (7.78)

Pi t

can be approximated in terms of logs (lower case letters)

zi;tC1 zit gi;tC1

212

where i is the average dividend-price ratio for asset i.

Take expectations as of t and solve recursively forward to get the log price/dividend

ratio as a function of expected future dividend growth rates (gi ) and returns (ri )

1

X

pi t di t D zi t is E t .gi;t CsC1 ri;t CsC1 / : (7.80)

sD0

To calculate the right hand side of (7.80), notice the following things. First, the div-

idend growth (cash flow dynamics) is modelled as an ARMA(1,1)see below for de-

tails. Second, the riskfree rate (rf t ) is assumed to follow an AR(1). Third, the expected

return equals the riskfree rate plus the expected excess returnwhich follows (7.75)

(7.77).

Since all these three processes are modelled as univariate first-order time-series pro-

cesses, the solution is

2

pi t di t D zi t D Ai;0 C Ai;1 gi t C Ai;2 m;t C1 C Ai;3 rf t : (7.81)

(BL use an expected dividend growth instead of the actual but that is just a matter of

convenience, and has another timing convention for the volatility.) This solution can be

thought of as the fundamental (log) price-dividend ratio. The main theme of the paper is

to study how well this fundamental log price-dividend ratio can explain the actual values.

The model is estimated by GMM (as a system), but most of the moment conditions

are conventional. In practice, this means that (i) the betas and the AR(1) for the riskfree

rate are estimated by OLS; (ii) the GARCH-M by MLE; (iii) the ARMA(1,1) process

by moment conditions that require the innovations to be orthogonal to the current levels;

and (iv) moment conditions for changes in pi t di t D zi t defined in (7.81). This is the

overidentified part of the model.

As a benchmark for comparison, consider the case when the right hand side in (7.80)

equals a constant. This would happen when the growth rate of cash flows is unpredictable,

the riskfree rate is constant, and the market risk premium is too (which here requires that

the conditional variance of the market return is constant). In this case, the price-dividend

ratio is constant, so the log return equals the cash flow growth plus a constant.

This benchmark case would not be very successful in matching the observed volatility

and correlation (across markets) of returns: cash flow growth seems to be a lot less volatile

213

than returns and also a lot less correlated across markets.

What if we allowed for predictability of cash flow growth, but still kept the assump-

tions of constant real interest rate and market risk premium? Large movements in pre-

dictable cash flow growth could then generate large movements in returns, but hardly the

correlation across markets.

However, large movements in the market risk premium would contribute to both. It is

clear that both mechanisms are needed to get a correlation between zero and one. It can

also be noted that the returns will be more correlated during volatile periodssince this

drives up the market risk premium which is a common component in all returns.

The growth rate of cash flow, gi t , is modelled as an ARMA(1,1). The estimation results

show that the AR parameter is around 0:95 and that the MA parameter is around 0:85.

This means that the growth rate is almost an iid process with very low autocorrelation

but only almost. Since the MA parameter is not negative enough to make the sum of the

AR and MA parameters zero, a positive shock to the growth rate will have a long-lived

effect (even if small). See Figure 7.33.

1

X

yt D "t C as 1 .a C /" t s :

sD1

.1 C a/.a C /

1 D , and s D as 1 for s D 2; 3; : : :

1 C 2 C 2a

and the conditional expectations are

E t y t Cs D as 1 .ay t C " t /; s D 1; 2; : : :

214

Impulse response, a = 0.9 Autocorrelation function, a = 0.9

2 1

0 0.5

= 0.8

1 =0

= 0.8

2 0

0 5 10 0 5 10

period period

ARMA(1,1): yt = ayt1 + t + t1

7.12.5 Results

1. The hypothesis that the CAPM regressions have zero intercepts (for all five country

indices) cannot be rejected.

2. Most of the parameters are precisely estimated, except (the risk aversion).

5. The overidentifying restrictions are rejected , but the model still seems able to ac-

count for quite a bit of the data: the volatility and correlation (across countries) of

the fundamental price-dividend ratios are quite similar to those in the data. Note

that the cross correlations are driven by the common movements in the riskfree rate

and the world market risk premia (driven by mt 2

).

215

A Using an FFT to Calculate the PDF from the Charac-

teristic Function

h./ D E exp.ix/

R1

D 1 exp.ix/f .x/dx; (A.1)

where f .x/ is the pdf. This is a Fourier transform of the pdf (if x is a continuous random

variable). For instance, the cf of a N.; 2 / distribution is exp.i 2 2 =2/. The pdf

can therefore be recovered by the inverse Fourier transform as

1 R1

f .x/ D exp. ix/h./d: (A.2)

2 1

In practice, we typically use a fast (discrete) Fourier transform to perform this calculation,

since there are very quick computer algorithms for doing that.

Approximate the characteristic function (A.1) as the integral over xmin ; xmax (assuming

the pdf is zero outside)

R xmax i x

h./ D xmin e f .x/dx: (A.3)

PN i xk

h./ kD1 e f .xk /x: (A.4)

Split up xmin ; xmax into N intervals of equal size, so the step (and interval width) is

xmax xmin

x D : (A.5)

N

The mid point of the kth interval is

which means that x1 D xmin C x=2, x2 D xmin C 1:5x and that xN D xmax x=2.

216

Example A.1 With .xmin ; xmax / D .1; 7/ and N D 3, then x D .7 1/=3 D 2. The xj

values are 2 3

k xk D xmin C .k 1=2/x

C D

6 7

61 1 1=2 2 2 7

6 7

62

4 1 C 3=2 2 D 4 7

5

3 1 C 5=2 2 D 6:

PN

hj kD1 e fk x; (A.7)

We want

2 j 1

j D b C ; (A.8)

N x

so we can control the central location of . Use that in the Riemann sum

j 1

i xmin C.k 1=2/x 2

e i xmin C.k

PN 1=2/xb

hj kD1 e

N x fk x; (A.9)

j 1

and multiply both sides by exp i.xmin C 1=2x/ 2 =N to get

N x

e N x hj eN e fk x ; (A.10)

N N kD1

Qk

qj

which has the same form as the ifft. We should therefore be able to calculate Qk by

applying the fft on qj . We can then recover the density function as

fk D e Qk =x: (A.11)

Bibliography

Amemiya, T., 1985, Advanced econometrics, Harvard University Press, Cambridge, Mas-

sachusetts.

forecasting, Working Paper 11188, NBER.

Bansal, R., and C. Lundblad, 2002, Market efficiency, fundamental values, and the size

of the risk premium in global equity markets, Journal of Econometrics, 109, 195237.

217

Britten-Jones, M., and A. Neuberger, 2000, Option prices, implied price processes, and

stochastic volatility, Journal of Finance, 55, 839866.

markets, Princeton University Press, Princeton, New Jersey.

Duan, J., 1995, The GARCH option pricing model, Mathematical Finance, 5, 1332.

Duffee, G. R., 2005, Time variation in the covariance between stock returns and con-

sumption growth, Journal of Finance, 60, 16731712.

Engle, R. F., 2002, Dynamic conditional correlation: a simple class of multivariate gen-

eralized autoregressive conditional heteroskedasticity models, Journal of Business and

Economic Statistics, 20, 339351.

Engle, R. F., and K. Sheppard, 2001, Theoretical and empirical properties of dynamic

conditional correlation multivariate GARCH, Discussion Paper 2001-15, University

of California, San Diego.

Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,

Cambridge University Press.

Glosten, L. R., R. Jagannathan, and D. Runkle, 1993, On the relation between the ex-

pected value and the volatility of the nominal excess return on stocks, Journal of Fi-

nance, 48, 17791801.

Gourieroux, C., and J. Jasiak, 2001, Financial econometrics: problems, models, and

methods, Princeton University Press.

Greene, W. H., 2012, Econometric analysis, Pearson Education Ltd, Harlow, Essex, 7th

edn.

Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.

Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,

Cambridge University Press.

Hentschel, L., 1995, All in the family: nesting symmetric and asymmetric GARCH

models, Journal of Financial Economics, 39, 71104.

218

Heston, S. L., and S. Nandi, 2000, A closed-form GARCH option valuation model,

Review of Financial Studies, 13, 585625.

Huang, C.-F., and R. H. Litzenberger, 1988, Foundations for financial economics, Elsevier

Science Publishing, New York.

Jiang, G. J., and Y. S. Tian, 2005, The model-free implied volatility and its information

content, Review of Financial Studies, 18, 13051342.

347370.

Journal of Econometrics, 63, 289306.

Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University

Press.

219

8 Factor Models

Sections denoted by a star ( ) is not required reading.

Let Riet D Ri t Rf t be the excess return on asset i in excess over the riskfree asset,

and let f t D Rmt Rf t be the excess return on the market portfolio. CAPM with a

riskfree return says that i D 0 in

E "i t D 0 and Cov.f t ; "i t / D 0:

changes if the test asset is added to the investment opportunity set. See Figure 8.1 for an

illustration.

The basic test of CAPM is to estimate (8.1) on a single asset and then test if the

intercept is zero. This can easily be extended to several assets, where we test if all the

intercepts are zero.

Notice that the test of CAPM can be given two interpretations. If we assume that Rmt

is the correct benchmark, then it is a test of whether asset Ri t is correctly priced (this is

the approach in mutual fund evaluations). Alternatively, if we assume that Ri t is correctly

priced, then it is a test of the mean-variance efficiency of Rmt (compare the Roll critique).

If the residuals in the CAPM regression are iid, then the traditional LS approach is just

fine: estimate (8.1) and form a t-test of the null hypothesis that the intercept is zero. If the

disturbance is iid normally distributed, then this approach is the ML approach.

220

MV frontiers before and after ( = 0) MV frontiers before and after ( = 0.05)

Solid curves: 2 assets,

0.1 0.1

Dashed curves: 3 assets

Mean

Mean

0.05 0.05

0 0

0 0.05 0.1 0.15 0 0.05 0.1 0.15

Std Std

The new asset has the abnormal return

compared to the market (of 2 assets)

beta 1.2459 0.7787 0.5970

0.1 Cov 0.0256 0.0000 0.0016

matrix 0.0000 0.0144 0.0096

0.0016 0.0096 0.0100

Mean

portf 0.47 0.47 0.21 1.10

0.53 0.53 -0.58 3.17

0 NaN 0.00 1.36 -3.26

0 0.05 0.1 0.15

Std

.E f t /2

Var.O 0 / D 1 C 2 =T (8.2)

Var .f t /

D .1 C SRf2 / 2 =T; (8.3)

where 2 is the variance of the residual in (8.1) and SRf2 is the squared Sharpe ratio of

the market portfolio (recall: f t is the excess return on market portfolio). We see that

the uncertainty about the intercept is high when the disturbance is volatile and when the

sample is short, but also when the Sharpe ratio of the market is high. Note that a large

market Sharpe ratio means that the market asks for a high compensation for taking on

risk. A bit uncertainty about how risky asset i is then gives a large uncertainty about what

the risk-adjusted return should be.

Proof. (of (8.2)) Consider the regression equation y t D x t0 b0 C u t . With iid errors

that are independent of all regressors (also across observations), the LS estimator, bOLs , is

221

asymptotically distributed as

p d

T .bOLs b0 / ! N.0; 2 xx1 /, where 2 D E u2t and xx D E tTD1 x t x t0 =T:

When the regressors are just a constant (equal to one) and one variable regressor, f t , so

x t D 1; f t 0 , then we have

" # " #

1 1 f t 1 E f t

xx D E t D1 x t x t0 =T D E

PT T

D , so

P

T t D1 f t f t2 E f t E f t2

" # " #

2

E f t

2

E f t 2

Var.f t / C .E f t /2

E f t

2 xx1 D D :

E f t2 .E f t /2 E ft 1 Var.f t / E ft 1

(In the last line we use Var.f t / D E f t2 .E f t /2 :) The upper left cell is (8.2).

The t-test of the hypothesis that 0 D 0 is then

O O d

Dq ! N.0; 1/ under H0 : 0 D 0: (8.4)

Std./

O .1 C SRf2 / Var."i t /=T

Note that this is the distribution under the null hypothesis that the true value of the inter-

cept is zero, that is, that CAPM is correct (in this respect, at least).

Remark 8.1 (Quadratic forms of normally distributed random variables) If the n 1

vector X N.0; /, then Y D X 0 1 X 2n . Therefore, if the n scalar random

variables Xi , i D 1; :::; n, are uncorrelated and have the distributions N.0; i2 /, i D

1; :::; n, then Y D inD1 Xi2 =i2 2n .

Instead of a t-test, we can use the equivalent chi-square test

O 2 O 2 d

D ! 21 under H0 : 0 D 0: (8.5)

Var./ O .1 C SRf / Var."i t /=T

2

The chi-square test is equivalent to the t-test when we are testing only one restriction, but

it has the advantage that it also allows us to test several restrictions at the same time. Both

the t-test and the chisquare tests are Wald tests (estimate unrestricted model and then test

the restrictions).

It is quite straightforward to use the properties of minimum-variance frontiers (see

Gibbons, Ross, and Shanken (1989), and MacKinlay (1995)) to show that the test statistic

in (8.5) can be written

O i2 c c /2 .SR

.SR c f /2

D ; (8.6)

Var.O i / 1 C .SR

cf /2 =T

222

where SRf is the Sharpe ratio of the market portfolio and SRc is the Sharpe ratio of

the tangency portfolio when investment in both the market return and asset i is possible.

(Recall that the tangency portfolio is the portfolio with the highest possible Sharpe ratio.)

If the market portfolio has the same (squared) Sharpe ratio as the tangency portfolio of the

mean-variance frontier of Ri t and Rmt (so the market portfolio is mean-variance efficient

also when we take Ri t into account) then the test statistic, O i2 = Var.O i /, is zeroand

CAPM is not rejected.

Proof. (of (8.6)) From the CAPM regression (8.1) we have

" # " # " # " #

Riet i2 m2 C Var."i t / i m2 ei i C i em

Cov e

D , and D :

Rmt i m2 m2 em em

Suppose we use this information to construct a mean-variance frontier for both Ri t and

Rmt , and we find the tangency portfolio, with excess return Rct

e

. It is straightforward to

show that the square of the Sharpe ratio of the tangency portfolio is e0 1 e , where

e is the vector of expected excess returns and is the covariance matrix. By using the

covariance matrix and mean vector above, we get that the squared Sharpe ratio for the

tangency portfolio, e0 1 e , (using both Ri t and Rmt ) is

2 e 2

ec i2

m

D C ;

c Var."i t / m

which we can write as

i2

.SRc /2 D C .SRm /2 :

Var."i t /

Use the notation f t D Rmt Rf t and combine this with (8.3) and to get (8.6).

It is also possible to construct small sample test (that do not rely on any asymp-

totic results), which may be a better approximation of the correct distribution in real-life

samplesprovided the strong assumptions are (almost) satisfied. The most straightfor-

ward modification is to transform (8.5) into an F1;T 1 -test. This is the same as using a

t -test in (8.4) since it is only one restriction that is tested (recall that if Z tn , then

Z 2 F .1; n/).

An alternative testing approach is to use an LR or LM approach: restrict the intercept

in the CAPM regression to be zero and estimate the model with ML (assuming that the

errors are normally distributed). For instance, for an LR test, the likelihood value (when

D 0) is then compared to the likelihood value without restrictions.

A common finding is that these tests tend to reject a true null hypothesis too often

223

when the critical values from the asymptotic distribution are used: the actual small sam-

ple size of the test is thus larger than the asymptotic (or nominal) size (see Campbell,

Lo, and MacKinlay (1997) Table 5.1). To study the power of the test (the frequency of

rejections of a false null hypothesis) we have to specify an alternative data generating

process (for instance, how much extra return in excess of that motivated by CAPM) and

the size of the test (the critical value to use). Once that is done, it is typically found that

these tests require a substantial deviation from CAPM and/or a long sample to get good

power.

2 3 2 3 2 3 2 3

e

R1t 1 1 "1t

6 : 7 6 : 7 6 : 7

6 :: 7 D 6 :: 7 C 6 :: 7 f t C 6 ::: 7 , where (8.7)

6 7

4 5 4 5 4 5 4 5

e

Rnt n n "nt

E "i t D 0 and Cov.f t ; "i t / D 0:

This is a system of seemingly unrelated regressions (SUR)with the same regressor (see,

for instance, Greene (2003) 14). In this case, the efficient estimator (GLS) is LS on each

equation separately. Moreover, the covariance matrix of the coefficients is particularly

simple.

Under the null hypothesis of zero intercepts and iid residuals (although possibly cor-

related across regressions), the LS estimate of the intercept has the following asymptotic

distribution

p

T O !d N 0n1 ; .1 C SR2 / , where (8.8)

2 3

11 : : : 1n

6 : :: 7

5 with ij D Cov."i t ; "jt /:

D6 :: : 7

4

n1 : : : O nn

PT

In practice, we use the sample moments for the covariance matrix, ij D Oi t "Ojt =T .

t D1 "

This result is well known, but a simple proof is found in Appendix A.

224

To test the null hypothesis that all intercepts are zero, we then use the test statistic

T O 0 .1 C SR2 / 1 1

O 2n , where (8.9)

SR2 D .E f /2 = Var.f /:

To test n assets at the same time when the errors are non-iid we make use of the GMM

framework. A special case is when the residuals are iid. The results in this section will

then coincide with those in Section 8.2.

Write the n regressions in (8.7) on vector form as

E " t D 0n1 and Cov.f t ; "0t / D 01n ;

where and are n 1 vectors. Clearly, setting n D 1 gives the case of a single test

asset.

The 2n GMM moment conditions are that, at the true values of and ,

" # " #

"t Ret f t

g t .; / D D : (8.12)

ft "t f t Ret f t

There are as many parameters as moment conditions, so the GMM estimator picks values

of and such that the sample analogues of (8.11) are satisfied exactly

T T O t

" #

e

O D 1 X

O D 1 X R t O f

N ;

g. O / O /

g t .; D 02n1 ; (8.13)

T t D1 T tD1 f t .Ret O f O t/

which gives the LS estimator. For the inference, we allow for the possibility of non-iid

errors, but if the errors are actually iid, then we (asymptotically) get the same results as in

Section 8.2.

With point estimates and their sampling distribution it is straightforward to set up a

Wald test for the hypothesis that all elements in are zero

d

O 0 Var./

O 1 O ! 2n : (8.14)

225

Remark 8.2 (Easy coding of the GMM Problem (8.13)) Estimate by LS, equation by

equation. Then, plug in the fitted residuals in (8.12) to generate time series of the moments

(will be important for the tests).

Remark 8.3 (Distribution of GMM) Let the parameter vector in the moment condition

have the true value b0 . Define

hp i N 0/

@g.b

S0 D Cov T gN .b0 / and D0 D plim :

@b 0

When the estimator solves min gN .b/0 S0 1 gN .b/ or when the model is exactly identified, the

distribution of the GMM estimator is

p d

T .bO b0 / ! N .0k1 ; V / , where V D .D0 S0 1 .D00 / 1 :

When D0 is invertible (as it would be in an exactly identified model), then we can also

write V D D0 1 S0 .D0 1 /0 .

Note that, with a linear model, the Jacobian of the moment conditions does not involve

the parameters that we want to estimate. This means that we do not have to worry about

evaluating the Jacobian at the true parameter values. The probability limit of the Jacobian

is simply the expected value, which can be written as

" #

@gN t .; / 1 ft

plim D D0 D E In

@; f t f t2

" #" #0 !

1 1

D E In ; (8.15)

ft ft

where is the Kronecker product. (The last expression applies also to the case of several

factors.) Notice that we order the parameters as a column vector with the alphas first and

the betas second. It might be useful to notice that in this case

" #" #0 ! 1

1 1

D0 1 D E In ; (8.16)

ft ft

since .A B/ 1

DA 1

B 1

(if conformable).

226

Remark 8.4 (Kronecker product) If A and B are matrices, then

2 3

a11 B a1n B

6 : :: 7

AB D6 :

4 : : 75:

am1 B amn B

Example 8.5 (Two test assets) With assets 1 and 2, the parameter vector is b D 1 ; 2 ; 1 ; 2 0 .

Write out (8.11) as

2 3 2 3

e

gN 1 .; / R1t 1 1 f t " # " #

6 gN 2 .; / 7 XT 6 Re e

6 7 6 7

1 2t 2 2 f t 7 1 XT 1 R 1t 1 1 f t

6 gN .; / 7 D T

6 7 6 7D ;

t D1 6 f .R e T tD1 e

4 t 1t 1 1 f t / 5 ft R2t 2 2 f t

7

4 3 5

e

gN 4 .; / f t .R2t 2 2 f t /

where gN 1 .; / denotes the sample average of the first moment condition. The Jacobian

is

2 3

@gN 1 =@1 @gN 1 =@2 @gN 1 =@1 @gN 1 =@2

N 6 @gN 2 =@1 @gN 2 =@2 @gN 2 =@1 @gN 2 =@2 7

6 7

@g.; /

D66 7

@1 ; 2 ; 1 ; 2 0 4 @gN 3 =@1 @gN 3 =@2 @gN 3 =@1 @gN 3 =@2 5

7

@gN 4 =@1 @gN 4 =@2 @gN 4 =@1 @gN 4 =@2

2 3

1 0 ft 0 " #" #0 !

6 7

1 XT 6 0 1 0 f t 7 1 XT 1 1

D 6

2

7D I2 :

T t D1 6 f T t D1

4 t 0 ft 0 7 5 ft ft

0 f t 0 f t2

p

The asymptotic covariance matrix of T times the sample moment conditions, eval-

uated at the true parameter values, that is at the true disturbances, is defined as

p T ! 1

T X X

S0 D Cov g t .; / D R.s/, where (8.17)

T tD1 sD 1

R.s/ D E g t .; /g t s .; /0 : (8.18)

227

With n assets, we can write (8.18) in terms of the n 1 vector " t as

R.s/ D E g t .; /g t s .; /0

" #" #0

"t "t s

DE

ft "t ft s "t s

" " # ! " # !0 #

1 1

DE "t "t s : (8.19)

ft ft s

The Newey-West estimator is often a good estimator of S0 , but the performance of the

test improved, by imposing (correct, of course) restrictions on the R.s/ matrices.

From Remark 8.3, we can write the covariance matrix of the 2n 1 vector of param-

eters (n parameters in and another n in ) as

p

" #!

O

Cov T D D0 1 S0 .D0 1 /0 : (8.20)

O

Example 8.6 (Special case 1: f t is independent of"" t s , errors are# iid, and n D 1) With

1 E ft

these assumptions R.s/ D 022 if s 0, and S0 D Var."i t /. Combining

E f t E f t2

with (8.15) gives

1

p

" #! " #

O 1 E ft

Cov T D Var."i t /;

O E f t E f t2

which is the same expression as 2 xx1 in (8.2), which assumed iid errors.

" case 1, #but n 1) With these assumptions

1 E ft

R.s/ D 02n2n if s 0, and S0 D E " t "0t . Combining with (8.15)

E ft E ft 2

gives

# 1

p

" #! "

O 1 E ft

Cov D E " t "0t :

T

O E ft E f 2

t

AC BD (if conformable). This is the same as in the SURE case.

228

8.3.2 CAPM and Several Assets: GMM and an LM Test

tions (8.11) and (8.13). The moment conditions are then

" #

Ret f t

E g./ D E D 02n1 : (8.21)

f t .Ret f t /

Since there are q D 2n moment conditions, but only n parameters (the vector), this

model is overidentified.

We could either use a weighting matrix in the GMM loss function or combine the

moment conditions so the model becomes exactly identified.

With a weighting matrix, the estimator solves

N 0 W g.b/;

minb g.b/ N (8.22)

where g.b/

N is the sample average of the moments (evaluated at some parameter vector b),

and W is a positive definite (and symmetric) weighting matrix. Once we have estimated

the model, we can test the n overidentifying restrictions that all q D 2n moment condi-

O If not, the restriction (null hypothesis)

tions are satisfied at the estimated n parameters .

that D 0n1 is rejected. The test is based on a quadratic form of the moment conditions,

N 0 1 g.b/

T g.b/ N which has a chi-square distribution if the correct matrix is used.

Alternatively, to combine the moment conditions so the model becomes exactly iden-

tified, premultiply by a matrix A to get

The model is then tested by testing if all 2n moment conditions in (8.21) are satis-

fied at this vector of estimates of the betas. This is the GMM analogue to a classical

LM test. Once again, the test is based on a quadratic form of the moment conditions,

N 0 1 g.b/

T g.b/ N which has a chi-square distribution if the correct matrix is used.

Details on how to compute the estimates effectively are given in Appendix B.1.

For instance, to effectively use only the last n moment conditions in the estimation,

we specify " #

h i Ret f t

A E g./ D 0nn In E D 0n1 : (8.24)

f t .Ret f t /

229

This clearly gives the classical LS estimator without an intercept

PT

O f t Ret =T

D Pt D1

T

: (8.25)

2

f

t D1 t =T

Example 8.8 (Combining moment conditions, CAPM on two assets) With two assets we

can combine the four moment conditions into only two by

2 3

e

" # 6 R 1t 1 f t

e

7

0 0 1 0 R 2t 2 f t

A E g t .1 ; 2 / D E6

6 7

7 D 021 :

0 0 0 1 6 f .Re f / 7

4 t 1t 1 t 5

e

f t .R2t 2 f t /

Remark 8.9 (Test of overidentifying assumption in GMM) When the GMM estimator

N 0 S0 1 g./

solves the quadratic loss function g./ N (or is exactly identified), then the J test

statistic is

O 0 S 1 g.

N / O d 2

T g. 0 N / ! q k ;

Remark 8.10 (Distribution of GMM, more general results) When GMM solves minb g.b/ N 0 W g.b/

N

O D 0k1 , the distribution of the GMM estimator and the test of overidentifying

N /

or Ag.

assumptions are different from those in Remarks 8.3 and 8.9.

The size (using asymptotic critical values) and power in small samples is often found

to be disappointing. Typically, these tests tend to reject a true null hypothesis too often

(see Campbell, Lo, and MacKinlay (1997) Table 5.1) and the power to reject a false null

hypothesis is often fairly low. These features are especially pronounced when the sample

is small and the number of assets, n, is high. One useful rule of thumb is that a saturation

ratio (the number of observations per parameter) below 10 (or so) is likely to give poor

performance of the test. In the test here we have nT observations, 2n parameters in and

, and n.n C 1/=2 unique parameters in S0 , so the saturation ratio is T =.2 C .n C 1/=2/.

For instance, with T D 60 and n D 10 or at T D 100 and n D 20, we have a saturation

ratio of 8, which is very low (compare Table 5.1 in CLM).

One possible way of dealing with the wrong size of the test is to use critical values

from simulations of the small sample distributions (Monte Carlo simulations or bootstrap

simulations).

230

US industry portfolios, 1970:12013:12 US industry portfolios, 1970:12013:12

Mean excess return 10 10

D

A D

A

H GC H GC

F F

I JB E I JB E

5 5

0 0

0 0.5 1 1.5 0 5 10

e

(against the market) Predicted excess return (i ERm )

e

+ ei

all NaN 0.04 NaN

A (NoDur) 3.56 0.01 8.66 Factor: US market

B (Durbl) -1.09 0.59 13.58 and (Std(ei ))

C (Manuf ) 0.71 0.47 6.30 are in annualized %

D (Enrgy) 3.92 0.08 14.62

E (HiTec) -2.00 0.26 11.84

F (Telcm) 1.93 0.26 11.03

G (Shops) 1.35 0.37 9.44

H (Hlth ) 2.31 0.17 11.30

I (Utils) 2.80 0.12 11.65

J (Other) -0.61 0.57 6.93

This type of test is typically done on portfolios of assets, rather than on the individual

assets themselves. There are several econometric and economic reasons for this. The

econometric techniques we apply need the returns to be (reasonably) stationary in the

sense that they have approximately the same means and covariance (with other returns)

throughout the sample (individual assets, especially stocks, can change character as the

company moves into another business). It might be more plausible that size or industry

portfolios are stationary in this sense. Individual portfolios are typically very volatile,

which makes it hard to obtain precise estimate and to be able to reject anything.

It sometimes makes economic sense to sort the assets according to a characteristic

(size or perhaps book/market)and then test if the model is true for these portfolios.

Rejection of the CAPM for such portfolios may have an interest in itself.

231

alpha t LS t NW t boot

US industry portfolios, 1970:12013:12 all NaN NaN NaN NaN

15 A (NoDur) 3.56 2.71 2.52 2.24

Mean excess return

C (Manuf ) 0.71 0.75 0.71 0.64

10 D (Enrgy) 3.92 1.77 1.76 1.85

D

A

GC E (HiTec) -2.00 -1.11 -1.12 -1.00

I FH JB E 1.93 1.15 1.12 0.97

F (Telcm)

5 G (Shops) 1.35 0.94 0.90 0.89

H (Hlth ) 2.31 1.35 1.38 1.37

0 I (Utils) 2.80 1.58 1.53 1.58

0 0.5 1 1.5 J (Other) -0.61 -0.58 -0.56 -0.47

(against the market)

NW uses 1 lag

The bootstrap samples (yt , xt ) in blocks of 10

3000 simulations

Fit of CAPM

18

16

14

Mean excess return, %

12

10

6 US data 1957:12013:12

25 FF portfolios (B/M and size)

4 p-value for test of model: 0.00

4 6 8 10 12 14 16 18

Predicted mean excess return (CAPM), %

See Campbell, Lo, and MacKinlay (1997) 6.5 (Table 6.1 in particular) and Cochrane

(2005) 20.2.

232

Fit of CAPM

18

16

14

Mean excess return, %

12

10

6 1 (small)

2

3

4

4 5 (large)

4 6 8 10 12 14 16 18

Predicted mean excess return (CAPM), %

One of the more interesting studies is Fama and French (1993) (see also Fama and

French (1996)). They construct 25 stock portfolios according to two characteristics of the

firm: the size and the book value to market value ratio (BE/ME). In June each year, they

sort the stocks according to size and BE/ME. They then form a 5 5 matrix of portfolios,

where portfolio ij belongs to the ith size quantile and the j th BE/ME quantile (so this is

a double-sort). This is illustrated in Table 8.1.

1 2 3 4 5

Size 1 1 2 3 4 5

2 6 7 8 9 10

3 11 12 13 14 15

4 16 17 18 19 20

5 21 22 23 24 25

Fama and French run a traditional CAPM regression on each of the 25 portfolios

233

Fit of CAPM

18

16

14

Mean excess return, %

12

10

6 1 (low)

2

3

4

4 5 (high)

4 6 8 10 12 14 16 18

Predicted mean excess return (CAPM), %

(monthly data 19631991)and then study if the expected excess returns are related

to the betas as they should according to CAPM (recall that CAPM implies E Riet D

i E Rmte

). However, there is little relation between E Riet and i (see Figure 8.4). This

lack of relation (a cloud in the i E Riet space) is due to the combination of two features

of the data. First, within a size quantile there is a negative relation (across BE/ME quan-

tiles) between E Riet and i in stark contrast to CAPM (see Figure 8.5). Second, within

a BE/ME quantile, there is a positive relation (across size quantiles) between E Riet and

i as predicted by CAPM (see Figure 8.6).

In Figure 8.2, the results are presented in two different ways:

PT

1 W i e

t D1 Ri =T (8.26)

PT e T e

2 W i t D1 Rm

P

=T t D1 Ri =T

In the first approach, CAPM says that all data points (different assets, i ) should cluster

around a straight line with a slope equal to the average market excess return, TtD1 Rm

e

=T .

P

In the second approach, CAPM says that all data points should cluster around a 45-degree

234

line. In either case, the vertical distance to the line is i (which should be zero according

to CAPM).

Reference: Cochrane (2005) 12.1; Campbell, Lo, and MacKinlay (1997) 6.2.1

When the K factors, f t , are excess returns, the null hypothesis typically says that i D 0

in

E "i t D 0 and Cov.f t ; "i t / D 0K1 ;

and i is now an K 1 vector. The CAPM regression is a special case when the market

excess return is the only factor. In other models like ICAPM (see Cochrane (2005) 9.2),

we typically have several factors. We stack the returns for n assets to get

2 3 2 3 2 32 3 2 3

e

R1t 1 11 : : : 1K f1t "1t

6 : 7 6 : 7 6 : : : :: 76 : 7 6 : 7

6 :: 7 D 6 :: 7 C 6 :: : : 7 6 :: 7 C 6 :: 7

4 5 4 5 4 54 5 4 5

e

Rnt n n1 : : : nK fKt "nt

or in vector form

E " t D 0n1 and Cov.f t ; "0t / D 0Kn ;

where is n 1 and is n K. Notice that ij shows how the i th asset depends on the

j th factor.

This is, of course, very similar to the CAPM (one-factor) modeland both the LS and

GMM approaches are straightforward to extend.

The results from the LS approach of testing CAPM generalizes directly. In particular,

(8.9) still holdsbut where the residuals are from the multi-factor regressions (8.27) and

235

where the Sharpe ratio of the tangency portfolio (based on the factors) depends on the

means and covariance matrix of all factors

T O 0 .1 C SR2 / 1 1

O 2n , where (8.29)

SR2 D E f 0 Cov.f / 1

E f:

This result is well known, but some properties of SURE models are found in Appendix

A.

" # ! " # !

1 1

E g t .; / D E "t D E .Ret f t / D 0n.1CK/1 :

ft ft

(8.30)

Note that this expression looks similar to (8.11)the only difference is that f t may now

be a vector (and we therefore need to use the Kronecker product). It is then intuitively

clear that the expressions for the asymptotic covariance matrix of O and O will look very

similar too.

When the system is exactly identified, the GMM estimator solves

N

g.; / D 0n.1CK/1 ; (8.31)

which is the same as LS equation by equation. The model can be tested by testing if all

alphas are zeroas in (8.14).

Instead, when we restrict D 0n1 (overidentified system), then we either specify a

weighting matrix W and solve

N 0 W g./;

min g./ N (8.32)

N

AnKn.1CK/ g./ D 0nK1 : (8.33)

" # !

h i 1

A D 0nKn InK E .Ret f t / : (8.34)

ft

236

More generally, details on how to compute the estimates effectively are given in Appendix

B.1.

Example 8.11 (Moment condition with two assets and two factors) The moment condi-

tions for n D 2 and K D 2 are

2 3

e

R1t 1 11 f1t 12 f2t

e

6 7

6 R2t 2 21 f1t 22 f2t 7

6 7

6 f .Re

6 1t 1t 1 11 f1t 12 f2t / 7

7

E g t .; / D E 6 e

7 D 061 :

6 f1t .R2t 2 21 f1t 22 f2t / 7

6 7

6 f .Re f f / 7

4 2t 1t 1 11 1t 12 2t 5

e

f2t .R2t 2 21 f1t 22 f2t /

For the exactly identified case, we have the following results. The expressions for the

Jacobian D0 and its inverse are the same as in (8.15)(8.16). Notice that in this Jacobian

we differentiate the moment conditions (8.30) with respect to vec.; /, that is, where the

parameters are stacked in a column vector with the alphas first, then the betas for the first

factor, followed by the betas for the second factor etc. The test is based on a quadratic

N 0 1 g.b/

form of the moment conditions, T g.b/ N which has a chi-square distribution if

the correct matrix is used. The covariance matrix of the average moment conditions

are as in (8.17)(8.19).

Fama and French (1993) also try a multi-factor model. They find that a three-factor model

fits the 25 stock portfolios fairly well (two more factors are needed to also fit the seven

bond portfolios that they use). The three factors are: the market return, the return on a

portfolio of small stocks minus the return on a portfolio of big stocks (SMB), and the

return on a portfolio with high BE/ME minus the return on portfolio with low BE/ME

(HML).

Remark 8.12 (The Fama-French factors) The SMB and HML are created by a double

sort. First classify firms according to size: small or big, using the median as a cutoff.

Second, classify firms according the book/market value: low (growth stocks, using 30th

237

pval

US industry portfolios, 1970:12013:12 all NaN 0.00 NaN

A (NoDur) 2.66 0.04 8.44

Mean excess return

10 -0.43 0.64 6.02

C (Manuf )

AD 2.73 0.21 14.16

H GC D (Enrgy)

F E (HiTec) 1.27 0.40 9.88

E I J B

5 F (Telcm) 1.56 0.36 10.79

G (Shops) 0.76 0.60 9.33

H (Hlth ) 4.48 0.01 10.59

0 I (Utils) 0.29 0.86 10.47

0 5 10 J (Other) -2.95 0.00 5.82

Predicted excess return (with = 0)

Fama-French model

Factors: US market, SMB (size), and HML (book-to-market)

and (StdErr of residual) are in annualized %

percentile as cutoff), neutral or high (value stocks, using 70th percentile as cutoff). Create

six value weighted portfolios from the intersection of those groups

Small: Small Growth (SG) Small Neutral (SN) Small Value (SV)

Big: Big Growth (BG) Big Neutral (BN) Big Value (BV)

The SMB is the average of the small portfolios minus the average of the big portfolios:

SMB D 1=3.SG C SN C SV / 1=3.BG C BN C BV /. Rearranging gives SMB D

1=3.S G BG/C1=3.SN BN /CSV /C1=3.SV BV /, which shows that it represents

the return on small stocks (for a given book/market) minus the return on big stocks (for

same book/market). The HML is the average of the value stocks minus the growth stocks,

HML D 1=2.SV C BV / 1=2.SG C BG/, which can be rearranged as HML D

1=2.S V SG/ C 1=2.BV BG/, which shows that it represents the return on value

stocks (for a given size) minus the return on growth stocks (for the same size).

Campbell, Lo, and MacKinlay (1997) Table 6.1 or Fama and French (1993) Table 9c),

but it can still capture a fair amount of the variation of expected returnssee Figures

8.78.10.

238

Fit of FF model

18

16

14

Mean excess return, %

12

10

6 US data 1957:12013:12

25 FF portfolios (B/M and size)

4 p-value for test of model: 0.00

4 6 8 10 12 14 16 18

Predicted mean excess return (FF), %

are related to investor/fund characteristics, we often use the calendar time (CalTime) ap-

proach. First define M discrete investor groups (for instance, age 1830, 3140, etc) and

calculate their respective average excess returns (RNjt

e

for group j )

1 P

RNjt

e

D Re ; (8.35)

Nj i 2Groupj i t

Then, we run a factor model

RNjt

e

D j C j0 f t C vjt ; for j D 1; 2; : : : ; M (8.36)

where f t typically includes various return factors (for instance, excess returns on equity

and bonds). By estimating these M equations as a SURE system with Whites (or Newey-

Wests) covariance estimator, it is straightforward to test various hypotheses, for instance,

that the intercept (the alpha) is higher for the M th group than for the for first group.

239

Fit of FF model

18

16

14

Mean excess return, %

12

10

6 1 (small)

2

3

4

4 5 (large)

4 6 8 10 12 14 16 18

Predicted mean excess return (FF), %

Example 8.13 (CalTime with two investor groups) With two investor groups, estimate the

following SURE system

RN 1t

e

D 1 C 10 f t C v1t ;

RN 2t

e

D 2 C 20 f t C v2t :

The CalTime approach is straightforward and the cross-sectional correlations are fairly

easy to handle (in the SURE approach). However, it forces us to define discrete investor

groupswhich makes it hard to handle several different types of investor characteristics

(for instance, age, trading activity and income) at the same time.

The cross sectional regression (CrossReg) approach is to first estimate the factor

model for each investor

and to then regress the (estimated) betas for the pth factor (for instance, the intercept) on

the investor characteristics

Opi D zi0 cp C wpi : (8.38)

240

Fit of FF model

18

16

14

Mean excess return, %

12

10

6 1 (low)

2

3

4

4 5 (high)

4 6 8 10 12 14 16 18

Predicted mean excess return (FF), %

(for an age group, say) or a continuous variable (age, say). Notice that using a continuous

investor characteristics assumes that the relation between the characteristics and the beta

is linearsomething that is not assumed in the CalTime approach. (This saves degrees of

freedom, but may sometimes be a very strong assumption.) However, a potential problem

with the CrossReg approach is that it is often important to account for the cross-sectional

correlation of the residuals.

Reference: Cochrane (2005) 12.2; Campbell, Lo, and MacKinlay (1997) 6.2.3 and 6.3

241

8.5.1 GMM Estimation with General Factors

Linear factor models imply that all expected excess returns are linear functions of the

same vector of factor risk premia ()

2 3 2 32 3

e

R1t 11 : : : 1K 1

6 : 7 6 : ::

E6 :: 7 6 ::: 7 , or

7 6 7

: 7 D 6 :: :

4 : 5 4 : 54 5

e

Rnt n1 : : : nK K

E Ret D ; (8.40)

where is n K.

When the factors are excess returns, then the factor risk premia must equal the ex-

pected excess returns of those factors. (To see this, let the factor also be one of the test

assets. It will then get a beta equal to unity on itself (for instance, regressing Rmt e

on

itself must give a coefficient equal to unity). This shows that for factor k, k D E Rk t .e

More generally, the factor risk premia can be interpreted as follows. Consider an asset

that has a beta of unity against factor k and zero betas against all other factors. This asset

will have an expected excess return equal to k . For instance, if a factor risk premium is

negative, then assets that are positively exposed to it (positive betas) will have a negative

risk premiumand vice versa.

Remark 8.14 (Factor mimicking portfolios) It is more difficult to estimate and test a

model with general factors than a model with excess return factors. A common approach

to get around the difficulties is to replace any general factor with the linear combination

of excess returns that best mimics the general factor. This linear combination can be

constructed be either forming a regression of the general factor on a vector of excess

returns, or by creating an arbitrage portfolio that is long assets that are highly correlated

with the general factor and short assets that are less or even negatively correlated with

the factor.

The old way of testing this is to do a two-step estimation: first, estimate the i vectors

in a time series model like (8.27) (equation by equation); second, use Oi as regressors in

242

Fit of CAPM Fit of 2-factor model

18 18

16 16

Mean excess return, %

14 14

12 12

= 0.67 = 0.63 -11.27

10 b = -13.1 10 b = -299 15520

8 pval: 0.00 8 pval: 0.00

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

25 FF portfolios (B/M and size) Ri = i + i Rm and ERi = i

m = 1 + b (f Ef) Ri = i + i1 Rm + i2 R2m and

ERi = i1 1 + i2 2

T

tD1 Riet =T D Oi0 C ui : (8.41)

cross-sectional regression while the previous tests are time series regressions. The main

problem of the cross-sectional approach is that we have to account for the fact that the

regressors in the second step, Oi , are just estimates and therefore contain estimation errors.

This errors-in-variables problem is likely to have two effects (i) it gives a downwards bias

of the estimates of and an upward bias of the mean of the fitted residuals; and (ii)

invalidates the standard expression of the test of .

A way to handle these problems is to combine the moment conditions for the regres-

sion function (8.30) (to estimate ) with (8.40) (to estimate ) to get a joint system

2 " # 3

1

.Ret f t / 7

E g t .; ; / D E 4 f t 5 D 0n.1CKC1/1 : (8.42)

6

Ret

243

Fit of CAPM Fit of 2-factor model

18 18

16 16

Mean excess return, %

14 14

12 12

= 0.52 = 0.52 -25.26

10 b = -10.1 10 b = -654 34774

8 pval: 0.00 8 pval: 0.00

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

25 FF portfolios (B/M and size) Ri = i + i Rm and ERi = i , = ERm

m = 1 + b (f Ef) Ri = i + i1 Rm + i2 R2m and

ERi = i1 1 + i2 2 , 1 = ERm

Figure 8.12: CAPM and quadratic model, market excess is exactly priced

against market x 10

3 against market 2

1.4 5

0

1.2

5

1

10

0 10 20 0 10 20

Portfolio Portfolio

US data 1957:12013:12

25 FF portfolios (B/M and size)

We can then test the overidentifying restrictions of the model. There are n.1 C K C

1/ moment condition (for each asset we have one moment condition for the constant,

K moment conditions for the K factors, and one moment condition corresponding to

244

the restriction on the linear factor model). There are only n.1 C K/ C K parameters

(n in , nK in and K in ). We therefore have n K overidentifying restrictions

which can be tested with a chi-square test. Notice that this is, in general, a non-linear

estimation problem, since the parameters in multiply the parameters in . From the

GMM estimation using (8.42) we get estimates of the factor risk premia and also the

variance-covariance of them. This allows us to not only test the moment conditions, but

also to characterize the risk factors and to test if they are priced (each of them, or perhaps

all jointly) by using a Wald test.

One approach to estimate the model is to specify a weighting matrix W and then solve

a minimization problem like (8.32). The test is based on a quadratic form of the moment

N 0 1 g.b/

conditions, T g.b/ N which has a chi-square distribution if the correct matrix

is used. In the special case of W D S0 1 , the distribution is given by Remark 8.3. For

other choices of the weighting matrix, the expression for the covariance matrix is more

complicated.

It is straightforward to show that the Jacobian of these moment conditions (with re-

spect to vec.; ; /) is

2 " #" #0 ! 3

1 1

6 T1 TtD1 In 0n.1CK/K 7

P

D0 D 6 4 h f t

i f t

7

5 (8.43)

0 I nK

0 n

where the upper left block is similar to the expression for the case with excess return

factors (8.15), while the other blocks are new.

Example 8.15 (Two assets and one factor) we have the moment conditions

2 3

e

R1t 1 1 f t

e

6 7

6 R2t 2 2 f t 7

6 7

6 f .Re

6 t 1t 1 1 f t / 7

7

E g t .1 ; 2 ; 1 ; 2 ; / D E 6 e

7 D 061 :

6 f t .R2t 2 2 f t / 7

6 7

e

6

4 R 1t 1 7

5

e

R2t 2

There are then 6 moment conditions and 5 parameters, so there is one overidentifying

restriction to test. Note that with one factor, then we need at least two assets for this

testing approach to work (n K D 2 1). In general, we need at least one more asset

245

than factors. In this case, the Jacobian is

2 3

1 0 ft 0 0

6 7

6

6 0 1 0 ft 0 7 7

@gN 1 XT 6 f t 0 f t2 0 0 7

D

6 7

@1 ; 2 ; 1 ; 2 ; 0 T 2

6 7

tD1 6 0 ft 0 ft 0 7

6 7

6

4 0 0 0 1 7 5

0 0 0 2

2 " #" #0 ! 3

1

PT 1 1

t D1 I2 041 7

D T

ft ft 5:

6

4

0; I2

Instead of estimating the overidentified model (8.42) (by specifying a weighting matrix),

we could combine the moment equations so they become equal to the number of param-

eters. This can be done, by specifying a matrix A and combine as A E g t D 0. This does

not generate any overidentifying restrictions, but it still allows us to test hypotheses about

some moment conditions and about . One possibility is to let the upper left block of A

be an identity matrix and just combine the last n moment conditions, Ret , to just K

moment conditions

A E g t D 0n.1CK/CK1 (8.44)

2 " # 3

" # 1

In.1CK/ 0n.1CK/n .Ret f t / 7

E4 ft 5D0 (8.45)

6

0Kn.1CK/ Kn

Ret

2 " # 3

1

.Ret f t / 7

E4 ft 5D0 (8.46)

6

.Ret /

Here A has n.1 C K/ C K rows (which equals the number of parameters (; ; /) and

n.1 C K C 1/ columns (which equals the number of moment conditions). (Notice also

that is K n, is n K and is K 1.)

Remark 8.16 (Calculation of the estimates based on (8.45)) In this case, we can estimate

and with LS equation by equationas a standard time-series regression of a factor

246

model. To estimate the K 1 vector , notice that we can solve the second set of K

moment conditions as

1

E.Ret / D 0K1 or D ./ E Ret ;

being the regressors, the instruments, and E Ret the dependent variable).

With D 0 , we get the traditional cross-sectional approach (8.39). The only differ-

ence is we here take the uncertainty about the generated betas into account (in the testing).

Alternatively, let be the covariance matrix of the residuals from the time-series estima-

tion of the factor model. Then, using D 0 gives a traditional GLS cross-sectional

approach.

To test the asset pricing implications, we test if the moment conditions E g t D 0 in

(8.44) are satisfied at the estimated parameters. The test is based on a quadratic form of

the moment conditions, T g.b/N 0 1 g.b/

N which has a chi-square distribution if the correct

matrix is used (typically more complicated than in Remark 8.3).

Example 8.17 (LS cross-sectional regression, two assets and one factor) With the mo-

ment conditions in Example (8.15) and the weighting vector D 1 ; 2 (8.46) is

2 e

3

R1t 1 1 f t

e

R2t 2 2 f t

6 7

6 7

A E g t .1 ; 2 ; 1 ; 2 ; / D E 6

6 7

f t .R e

1 1 f t / 7 D 051 ;

6 1t 7

e

f .R f /

6 7

4 t 2t 2 2 t 5

e e

1 .R1t 1 / C 2 .R2t 2 /

which has as many parameters as moment conditions. The test of the asset pricing model

is then to test if

2 3

e

R1t 1 1 f t

e

6 7

6 R2t 2 2 f t 7

6 7

6 f .Re

6 t 1t 1 1 f t / 7

7

E g t .1 ; 2 ; 1 ; 2 ; / D E 6 e

7 D 061 ;

6 f t .R2t 2 2 f t / 7

6 7

e

6

4 R 1t 1 7

5

e

R2t 2

247

Example 8.18 (Structure of E.Ret /) If there are 2 factors and three test assets,

then 021 D E.Ret / is

E R1t

02 e

3 2 3 1

" # " # 11 12 " #

0 11 12 13 B6 7 1 C

D @4E R2t

e 7

421 22 5 A:

6

5

0 21 22 23 2

E R3t

e

31 32

The test of the general multi-factor models is sometimes written on a slightly different

form (see, for instance, Campbell, Lo, and MacKinlay (1997) 6.2.3, but adjust for the

fact that they look at returns rather than excess returns). To illustrate this, note that the

regression equations (8.27) imply that

E Ret D C E f t : (8.47)

D . E f t /; (8.48)

which is another way of summarizing the restrictions that the linear factor model gives.

We can then rewrite the moment conditions (8.42) as (substitute for and skip the last set

of moments)

"" # #

1

E g t .; / D E .Ret . E f t / f t / D 0n.1CK/1 : (8.49)

ft

Note that there are n.1 C K/ moment conditions and nK C K parameters (nK in and

K in ), so there are n K overidentifying restrictions (as before).

Example 8.19 (Two assets and one factor) The moment conditions (8.49) are

2 3

e

R1t 1 . E f t / 1 f t

e

E

6 7

6 R2t 2 . f t / 2 f t

E g t .1 ; 2 ; / D E 6

7

7 D 041 :

6 f Re

4 t 1t 1 . E f t / f

1 t 5

7

e

f t R2t 2 . E f t / 2 f t

This gives 4 moment conditions, but only three parameters, so there is one overidentifying

restriction to testjust as with (8.45).

248

8.5.4 What If the Factors Are Excess Returns?

It would (perhaps) be natural if the tests discussed in this section coincided with those in

Section 8.4 when the factors are in fact excess returns. That is almost so. The difference is

that we here estimate the K 1 vector (factor risk premia) as a vector of free parameters,

while the tests in Section 8.4 impose D E f t . This can be done in (8.45)(8.46) by doing

two things. First, define a new set of test assets by stacking the original test assets and the

excess return factors " #

Ret

RQ et D ; (8.50)

ft

which is an .n C K/ 1 vector. Second, define the K .n C K/ matrix as

h i

Q D 0Kn IK : (8.51)

D E ft : (8.52)

It is also straightforward to show that this gives precisely the same test statistics as the

Wald test on the multifactor model (8.27).

Proof. (of (8.52)) The betas of the RQ et vector are

" #

Q D

nK

:

IK

" # " #

h i Ret h i

nK

0Kn IK E D 0Kn IK , or

ft IK

E f t D :

Remark 8.20 (Two assets, one excess return factor) By including the factors among the

249

test assets and using the weighting vector D 0; 0; 1 gives

2 e

3

R1t 1 1 f t

e

R2t 2 2 f t

6 7

6 7

6 7

6

6 f t 3 3 f t 7

7

A E g t .1 ; 2 ; 3 ; 1 ; 2 ; 3 ; / D E 6 f t .R1te

1 1 f t / 7 D 071 :

6 7

6 7

e

6

6 f t .R2t 2 2 f t / 7

7

f t .f t 3 3 f t /

6 7

4 5

e e

0.R1t 1 / C 0.R2t 2 / C 1.f t 3 /

conditions and as many parameters. To test the asset pricing model, test if the following

moment conditions are satisfied at the estimated parameters

2 e

3

R1t 1 1 f t

e

6 R2t 2 2 f t 7

6 7

6 7

6

6 f t 3 3 f t

7

7

e

6 f t .R1t 1 1 f t / 7

6 7

E g t .1 ; 2 ; 3 ; 1 ; 2 ; 3 ; / D E 6

6 7

f t .R e

2 2 f t / 7 D 091 :

6 2t 7

6 f t .f t 3 3 f t / 7

6 7

6 7

e

6

6 R 1t 1 7

7

e

R2t 2

6 7

4 5

f t 3

In fact, this gives the same test statistic as when testing if 1 and 2 are zero in (8.14).

Remark 8.21 (What is an excess return? ) Short answer: the return of a zero cost port-

folio. More detailed answer: consider a portfolio with the (net) return

Rp D v1 R1 C v2 R2 C v3 R3 C .1 v1 v2 v3 /R4 ;

where vi is the portfolio weight on asset i which has the net return Ri . The balance

(1 v1 v2 v3 ) is made up of asset 4 with the net return R4 (which may be a riskfree

asset). Rearrange as

returns (even if v1 , v2 and/or v3 happen to be negative and they do not sum to unity). If

250

v3 D v2 , then we can rearrange further to get

Rp R4 D v1 .R1 R4 / C v2 .R2 R3 / :

This is still an excess return, although the excess on the right hand side is over different

returns. When we use excess returns as factors, then we typically require the portfolio

weights (see above) to be constant over time.

8.5.5 When Some (but Not All) of the Factors Are Excess Returns

Zt

ft D ; (8.53)

Ft

where Z t is an v 1 vector of excess return factors and F t is a w 1 vector of general

factors (K D v C w).

It makes sense (and is econometrically efficient) to use the fact that the factor risk

premia of the excess return factors are just their average excess returns (as in CAPM).

This can be done in (8.45)(8.46) by doing two things. First, define a new set of test

assets by stacking the original test assets and the excess return factors

" #

Ret

RQ et D ; (8.54)

Zt

" #

0 I

Q D

vn v

; (8.55)

#wn 0wv

" # " #

E Z

Q D

Z t

D ; (8.56)

F .# F / 1 #.E Ret Z Z /

where the Z and F are just betas of the original test assets on Z t and F t respectively

according to the partitioning

h i

nK D nv Z F

nw : (8.57)

One possible choice of # is # D F 0 , since then F are the same as when running a

251

cross-sectional regression of the expected abnormal return (E Ret Z Z ) on the betas

( F ).

Proof. (of (8.56)) The betas of the RQ et vector are

" #

Z F

Q D nv nw

:

Iv 0vw

Q E RQ et D Q Q Q

" #" # " #" #" #

0vn Iv E Ret 0vn Iv Z

nv F

nw Z

D

#wn 0wv E Zt #wn 0wv Iv 0vw F

" # " #" #

E Zt Iv 0vw Z

D :

#wn E R t

e Z F

#wn nv #wn nw F

Z D E Z t :

# E Ret D # Z Z C # F F ; so

F D .# F / 1 #.E Ret Z Z /:

Example 8.22 (Structure of to identify for excess return factors) Continue Example

e

8.18 (where there are 2 factors and three test assets) and assume that Z t D R3t so the

first factor is really an excess returnwhich we have appended last to set of test assets.

Then 31 D 1 and 32 D 0 (regressing Z t on Z t and F t gives the slope coefficients 1

P If we set .11 ; 12 ; 13 / D .0; 0; 1/, then the moment conditions in Example 8.18

and 0.)

can be written

E R1t

02 e

3 2 3 1

" # " # 11 12 " #

0 0 0 1 B6 7 Z C

D @4E R2t e 7

5 421 22 5 A:

6

0 21 22 23 F

E Zt 1 0

252

The first line reads

" #

h i

Z

0 D E Zt 1 0 , so Z D E Z t :

F

Chen, Roll, and Ross (1986) use a number of macro variables as factorsalong with

traditional market indices. They find that industrial production and inflation surprises are

priced factors, while the market index might not be. Breeden, Gibbons, and Litzenberger

(1989) and Lettau and Ludvigson (2001) estimate models where consumption growth is

the factorwith mixed results.

This section discusses how we can estimate and test the asset pricing equation

E m t Ret D 0: (8.58)

N C b 0 .f t

mt D m E f t /; (8.59)

where the K 1 vector f t contains the factors. Combining with (8.58) gives the sample

moment conditions

T

X

N

g.b/ D g t .b/=T D 0n1 , where (8.60)

t D1

N et C b 0 .f t

g t .b/ D m t Ret D mR fNt /Ret ; (8.61)

since m t is a scalar. There are K parameters (in b) and n moment conditions (the number

of assets). The mean of the SDF cannot be estimated from excess returns (it could if we

used returns), but it is straightforward to show that the choice of m

N does not matter for the

test based on excess returns.

Remark 8.23 (The SDF model and the mean SDF) Take expectations of the moment con-

ditions (8.61) and set equal to zero to get

m

253

N b/ D .0; 0/, which makes no sense. Instead, for any m

This would be satisfied by .m; N 0,

we could have

1 0

E Ret D b Cov.f t ; Ret /;

mN

which allows us to test if there is a K 1 vector b that prices all n assets, given how the

covariance matrix of the returns and factors looks like.

Remark 8.24 (Theoretical relation between a linear factor model and an SDF) Rewrite

the last equation in the previous remark as

1 0

E Ret D b Var.f /Var.f / 1 Cov.f t ; Ret /

mN

0

D 0

This shows the relation between a linear SDF model and the beta pricing relation from a

linear factor model.

To estimate this model with a weighting matrix W , we minimize the loss function

N 0 W g.b/:

J D g.b/ N (8.62)

N

AKn g.b/ D 0K1 : (8.63)

To test the asset pricing implications, we test if the moment conditions E g t D 0 are

satisfied at the estimated parameters. The test is based on a quadratic form of the moment

N 0 1 g.b/

conditions, T g.b/ N which has a chi-square distribution if the correct matrix is

used.

Reference: Ferson (1995); Jagannathan and Wang (2002) (theoretical results); Cochrane

(2005) 15 (empirical comparison); Bekaert and Urias (1996); and Sderlind (1999)

The test of the linear factor model and the test of the linear SDF model are (generally)

not the same: they test the same implications of the models, but in slightly different ways.

The moment conditions look a bit differentand combined with non-parametric methods

254

for estimating the covariance matrix of the sample moment conditions, the two methods

can give different results (in small samples, at least). Asymptotically, they are always the

same, as showed by Jagannathan and Wang (2002).

There is one case where we know that the tests of the linear factor model and the

SDF model are identical: when the factors are excess returns and the SDF is constructed

to price these factors as well. To demonstrate this, let R1t e

be a vector of excess returns

on some benchmarks assets. Construct a stochastic discount factor as in Hansen and

Jagannathan (1991):

mt D m N C .R1te

RN 1t

e 0

/ b; (8.64)

where m

N is a constant and b is chosen to make m t price R1t

e

in the sample, that is, so

tTD1 E R1t

e

m t =T D 0: (8.65)

e

, and SDF performance

1 PT

gN 2t D Re m t : (8.66)

T t D1 2t

Let the factor portfolio model be the linear regression

e e

R2t D C R1t C "t ; (8.67)

; " t / D 0. Then, the SDF-performance (pricing error) is

proportional to a traditional alpha

gN 2t =m

N D :

O (8.68)

Notice that (8.68) allows for the possibility that R1t

e

is the excess return on dynamic

portfolios, R1t D s t 1 R0t , where s t 1 are some information variables (not payoffs as

e e

are some basic bench-

marks (S&P500 and bond, perhaps). The reason is that if R0t are excess returns, so are

e

e

R1t D s t 1 R0te

. Therefore, the typical cross-sectional test (of E Re D 0 ) coincides

with the test of the alphaand also of zero SDF pricing errors.

Notice also that R2te

could be the excess return on dynamic strategies in terms of the

test assets, R2t D z t 1 Rpt

e e

, where z t 1 are information variables and Rpt

e

are basic test

assets (mutual funds say). In this case, we are testing the performance of these dynamic

strategies (in terms of mutual funds, say). For instance, suppose R1t is a scalar and the

255

for z t 1 R1t is positive. This would mean that a strategy that goes long in R1t when z t 1

is high (and vice versa) has a positive performance.

Proof. (of (8.68)) (Here written in terms of population moments, to simplify the nota-

tion.) It follows directly that b D Var.R1t e

/ 1 E R1te

N . Using this and the expression

m

for m t in (8.66) gives

E g2t D E R2t

e

N Cov R2t

e e

Var.R1t

e 1

E R1t

e

N

m ; R1t / m:

We now rewrite this equation in terms of the parameters in the factor portfolio model

(8.67). The latter implies E R2t

e

D C E R1te

, and the least squares estimator of the slope

1

coefficients is D Cov R2t ; R1t Var R1t . Using these two facts in the equation

e e e

The simplest way of introducing conditional information is to simply state that the

factors are not just the usual market indices or macro economic series: the factors are

non-linear functions of them (this is sometimes called scaled factors to indicate that

we scale the original factors with instruments). For instance, if Rmt

e

is the return on the

market portfolio and z t 1 is something else which is thought to be important for asset

pricing (use theory), then the factors could be

e

f1t D Rmt and f2t D z t e

1 Rmt : (8.69)

Since the second factor is not an excess return, the test is done as in (8.42).

An alternative interpretation of this is that we have only one factor, but that the coef-

ficient of the factor is time varying. This is easiest seen by plugging in the factors in the

time-series regression part of the moment conditions (8.42), Riet D C f t C "i t ,

e

Riet D C 1 Rmt C 2 z t e

1 Rmt C "i t

e

D C .1 C 2 z t 1 /Rmt C "i t : (8.70)

The first line looks like a two factor model with constant coefficients, while the second

line looks like a one-factor model with a time-varying coefficient (1 C 2 z t 1 ). This

is clearly just a matter of interpretation, since it is the same model (and is tested in the

same way). This model can be estimated and tested as in the case of general factorsas

256

against Rm against zRm

0.05

1.4

1.2 0

1

0.05

0.8

0 10 20 0 10 20

FF portfolio no. FF portfolio no.

e

+

25 FF portfolios (B/M and size) z: lagged momentum return

zt e

1 Rmtis not a traditional excess return.

See Figure 8.148.15 for an empirical illustration.

Remark 8.25 (Figures 8.148.15, equally weighted 25 FF portfolios) Figure 8.14 shows

the betas of the conditional model. It seems as if the small firms (portfolios with low num-

bers) have a somewhat higher exposure to the market in bull markets and vice versa,

while large firms have pretty constant exposures. However, the time-variation is not

marked. Therefore, the conditional (two-factor model) fits the cross-section of average

returns only slightly better than CAPMsee Figure 8.15.

Conditional models typically have more parameters than unconditional models, which

is likely to give small samples issues (in particular with respect to the inference). It is

important to remember some of the new factors (original factors times instruments) are

probably not an excess returns, so the test is done with an LM test as in (8.42).

It is also possible to estimate non-linear factor models. The model could be piecewise

linear or include higher order times. For instance, Treynor and Mazuy (1966) extend the

CAPM regression by including a squared term (of the market excess return) to capture

market timing.

257

Fit of CAPM Fit of 2-factor model

Mean excess return, % 14 14

12 12

10 10

8 8

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

e

+

Rie = + Rm

e

+

z: lagged momentum return

25 FF portfolios (B/M, size)

1

0.4 2 = 0.5

0.3 2 = 0.25

0.5

0.2 2 = 0

=1 0.1

=5

0 0

2 0 2 2 0 2

z z

y = [1 G(z)]1 x + G(z)2 x +

G(z) = 1/[1 + exp ( (z c))], c = 0

1 = 0.25

G(z) is a logistic function with

= 2 and c = 0

Figure 8.16: Logistic function and the effective slope coefficient in a Logistic smooth

transition regression

Alternatively, the conditional model (8.70) could be changed so that the time-varying

coefficients are non-linear in the information variable. In the simplest case, this could be

dummy variable regression where the definition of the regimes is exogenous.

More ambitiously, we could use a smooth transition regression, which estimates both

the abruptness of the transition between regimes as well as the cutoff point. Let G.z/

258

be a logistic (increasing but S -shaped) function

1

G.z/ D ; (8.71)

1 C exp
.z c/

where the parameter c is the central location (where G.z/ D 1=2) and
> 0 determines

the steepness of the function (a high
implies that the function goes quickly from 0 to 1

around z D c.) See Figure 8.16 for an illustration. A logistic smooth transition regression

is

y t D .z t 1 /0 x t C " t

D 1 G.z t 1 / 10 C G.z t 0

xt C "t

1 /2

0 0

D 1 G.z t 1 / 1 x t C G.z t 1 /2 x t C "t : (8.72)

At low z t values, the regression coefficients are (almost) 1 and at high z t values they are

(almost) 2 . See Figure 8.16 for an illustration.

Non-Linear least squares (NLS) by concentrating the loss function: optimize (numeri-

cally) over . ; c/ and let (for each value of . ; c/) the parameters (1 ; 2 ) be the OLS

coefficients on the vector of regressors .1 G.z t 1 / x t ; G.z t 1 /x t /.

LSTAR modellogistic smooth transition auto regression model, see Franses and van

Dijk (2000).

For an empirical application to a factor model, see Figures 8.178.18.

8.9 Fama-MacBeth

Reference: Cochrane (2005) 12.3; Campbell, Lo, and MacKinlay (1997) 5.8; Fama and

MacBeth (1973)

The Fama and MacBeth (1973) approach is a bit different from the regression ap-

proaches discussed so faralthough it seems most related to what we discussed in Section

8.5. The method has three steps, described below.

sion). This is often done on the whole sampleassuming the betas are constant.

259

Slope on factor

2

factor: Rm, state: RMom low state

high state

1.5

0.5

0 5 10 15 20 25

FF portfolio no.

Figure 8.17: Betas on the market in the low and high regimes, 25 FF portfolios

14 14

Mean excess return, %

12 12

10 10

8 8

6 6

4 4

5 10 15 5 10 15

Predicted mean excess return, % Predicted mean excess return, %

Rie = + [1 G(z)]1 Rm

e

+

Rie = + Rm

e

+ G(z)2 Rme

+

Monthly US data 1957:12013:12

G(z) is a logistic function

25 FF portfolios (B/M, size)

z: lagged momentum return

Sometimes, the betas are estimated separately for different sub samples (so we

could let Oi carry a time subscript in the equations below).

Second, run a cross sectional regression for every t. That is, for period t , estimate

260

t from the cross section (across the assets i D 1; : : : ; n) regression

where Oi are the regressors. (Note the difference to the traditional cross-sectional

approach discussed in (8.10), where the second stage regression regressed E Riet on

Oi , while the Fama-French approach runs one regression for every time period.)

T

1X

"Oi D "Oi t for i D 1; : : : ; n, (for every asset) (8.74)

T t D1

T

O D 1 O t :

X

(8.75)

T t D1

Oi are estimated, that is, measured with an error. The effect of this is typically to bias

the estimator of t towards zero (and any intercept, or mean of the residual, is biased

upward). One way to minimize this problem, used by Fama and MacBeth (1973), is to

let the assets be portfolios of assets, for which we can expect that some of the individual

noise in the first-step regressions to average outand thereby make the measurement

error in O smaller. If CAPM is true, then the return of an asset is a linear function of the

market return and an error which should be uncorrelated with the errors of other assets

otherwise some factor is missing. If the portfolio consists of 20 assets with equal error

variance in a CAPM regression, then we should expect the portfolio to have an error

variance which is 1/20th as large.

We clearly want portfolios which have different betas, or else the second step regres-

sion (8.73) does not work. Fama and MacBeth (1973) choose to construct portfolios

according to some initial estimate of asset specific betas. Another way to deal with the

errors-in-variables problem is adjust the tests. Jagannathan and Wang (1996) and Jagan-

nathan and Wang (1998) discuss the asymptotic distribution of this estimator.

We can test the model by studying if "i D 0 (recall from (8.74) that "i is the time

average of the residual for asset i , "it ), by forming a t-test "Oi = Std.O"i /. Fama and MacBeth

(1973) suggest that the standard deviation should be found by studying the time-variation

in "Oi t . In particular, they suggest that the variance of "Oi t (not "Oi ) can be estimated by the

261

(average) squared variation around its mean

T

1X

Var.O"i t / D .O"i t "Oi /2 : (8.76)

T t D1

Since "Oi is the sample average of "Oi t , the variance of the former is the variance of the latter

divided by T (the sample size)provided "Oi t is iid. That is,

T

1 1 X

Var.O"i / D Var.O"i t / D 2 .O"i t "Oi /2 : (8.77)

T T t D1

T

O D 1 X O O 2:

Var./ . t / (8.78)

T 2 t D1

Fama and MacBeth (1973) found, among other things, that the squared beta is not

significant in the second step regression, nor is a measure of non-systematic risk.

Proof. (of (8.8)) Write each of the regression equations in (8.7) on a traditional form

" #

1

Riet D x t0 i C "i t , where x t D :

ft

Define

XT XT

xx D plim x t x t0 =T , and ij D plim "i t "jt =T;

t D1 t D1

then the asymptotic covariance matrix of the vectors Oi and Oj (assets i and j ) is ij xx1 =T

(see below for a separate proof). In matrix form,

2 3

11 : : : 1n

p

Cov. T / O D 6 ::: :: 7 1

6

4 5 xx ;

: 7

n1 : : : O nn

where O stacks O1 ; : : : ; On . As in (8.3), the upper left element of xx1 equals 1 C SR2 ,

where SR is the Sharpe ratio of the market.

Proof. (of distribution of SURE coefficients, used in proof of (8.8) ) To simplify,

262

consider the SUR system

y t D x t C u t

z t D
x t C v t ;

where y t ; z t and x t are zero mean variables. We then know (from basic properties of LS)

that

1

O D C PT .x1 u1 C x2 u2 C : : : xT uT /

t D1 x t x t

1

O D
C PT .x1 v1 C x2 v2 C : : : xT vT / :

t D1 x t x t

In the traditional LS approach, we treat x t as fixed numbers (constants) and also assume

that the residuals are uncorrelated across and have the same variances and covariances

across time. The covariance of O and
O is therefore

!2

O
/ 1

Cov.;

2

O D x1 Cov .u1 ; v1 / C x22 Cov .u2 ; v2 / C : : : xT2 Cov .uT ; vT /

PT

t D1 x t x t

!2

1 P

T

D PT t D1 x t x t uv , where uv D Cov .u t ; v t / ;

t D1 x t x t

1

D PT uv :

t D1 x t x t

Divide and multiply by T to get the result in the proof of (8.8). (We get the same results

if we relax the assumption that x t are fixed numbers, and instead derive the asymptotic

distribution.)

Remark A.1 (General results on SURE distribution, same regressors) Let the regression

equations be

yi t D x t0 i C "i t , i D 1; : : : ; n;

where x t is a K 1 vector (the same in all n regressions). When the moment conditions

are arranged so that the first n are x1t " t , then next are x2t " t

E g t D E.x t " t /;

then Jacobian (with respect to the coefs of x1t , then the coefs of x2t , etc) and its inverse

263

are

D0 D xx In and D0 1 D xx1 In :

sD 1 E g t g t s . As

P

2 3 2 3

gN 1 y1t 1 1 f t

6 gN 2 7

6 7 XT 6 7

7D 1

6 y2t 2 2 f t 7

6 6 7;

6 gN 7 T t D1 6 f .y

4 t 1t 1 1 f t / 5

7

4 3 5

gN 4 f t .y2t 2 2 f t /

and

2 3

@gN 1 =@1@gN 1 =@2 @gN 1 =@1 @gN 1 =@2

@gN @gN 2 =@1@gN 2 =@2 @gN 2 =@1 @gN 2 =@2

6 7

6 7

0

D 6 7

@1 ; 2 ; 1 ; 2 6

4@gN 3 =@1@gN 3 =@2 @gN 3 =@1 @gN 3 =@2 7

5

@gN 4 =@1@gN 4 =@2 @gN 4 =@1 @gN 4 =@2

2 3

1 0 ft 0

6 7

1 XT 6 6 0 1 0 ft 7 1 XT

D 7D x t x t0 I2 :

T t D1 6 f

4 t 0 f t2 0 7 5 T t D1

2

0 ft 0 ft

Remark A.2 (General results on SURE distribution, same regressors, alternative order-

ing of moment conditions and parameters ) If instead, the moment conditions are ar-

ranged so that the first K are x t "1t , the next are x t "2t as in

E g t D E." t x t /;

then the Jacobian (wrt the coefficients in regression 1, then the coefficients in regression

2 etc.) and its inverse are

D0 D In . xx / and D0 1 D In . xx1 /:

2 3 2 3

gN 1 y1t 1 1 f t

6 gN 2 7

6 7

XT 6 f t .y1t 1 1 f t / 7

6

7D 1

7

6 6 7;

6 gN 7 T t D1

4 y2t 2 2 f t 5

6 7

4 3 5

gN 4 f t .y2t 2 2 f t /

264

and

2 3

@gN 1 =@1 @gN 1 =@1 @gN 1 =@2 @gN 1 =@2

@gN @gN 2 =@1 @gN 2 =@1 @gN 2 =@2 @gN 2 =@2 7

6 7

6

D 6 7

@1 ; 1 ; 2 ; 2 0 6

4@gN 3 =@1 @gN 3 =@1 @gN 3 =@2 @gN 3 =@2 7

5

@gN 4 =@1 @gN 4 =@1 @gN 4 =@2 @gN 4 =@2

2 3

1 ft 0 0

6 f t f t2

6 7

1 X T 0 0 7 1 XT 0

D 6 7 D I2 xt xt :

T t D1 6 0 0 1 ft 7 T t D1

4 5

0 0 f t f t2

This section describes how the GMM problem can be programmed. We treat the case

with n assets and K Factors (which are all excess returns). The moments are of the form

" # !

1

gt D .Ret f t /

ft

" # !

1

gt D .Ret f t /

ft

Suppose we could write the moments on the form

gt D zt yt x t0 b ;

to make it easy to use matrix algebra in the calculation of the estimate (see below for how

to do that). These moment conditions are similar to those for the instrumental variable

method. In that case we could let

T T T

1X 1X 0 1X

zy D z t y t and zx D z t x t , so g t D zy zx b:

T t D1 T t D1 T t D1

gN t D zy zx b D 0, so bO D zx1 zy :

265

(It is straightforward to show that this can also be calculated equation by equation.) In the

overidentified case with a weighting matrix, the loss function can be written

0

zx W zy 0

zx W zx bO D 0 and bO D .zx

0 0

W zx / 1 zx W zy :

In practice, we never perform an explicit inversionit is typically much better (in terms of

both speed and precision) to let the software solve the system of linear equations instead.

To rewrite the moment conditions as g t D z t y t x t0 b , notice that

0 1

" # !B " #0 ! C

1 B e 1 C , with b D vec.; /

C

gt D In B R

B t I n b

ft @ f t

C

A

zt x t0

0 1

" # !

1

gt D In @Ret f t0 In b A , with b D vec./

B C

ft

x t0

zt

for the exactly identified and overidentified case respectively. Clearly, z t and x t are ma-

trices, not vectors. (z t is n.1 C K/ n and x t0 is either of the same dimension or has n

rows less, corresponding to the intercept.)

Example B.1 (Rewriting the moment conditions) For the moment conditions in Example

8.11 we have

0 1

2 3B 2 30 2 3C

1 0 B 1 0 1 C

6 7B 6 7 6 7C

6 0 1 7B 0 1 7C

7 6 2 7C

6 7 6

6 7B " # 6

e

6f B 7 6 7C

6 1t 0 7 B R

7 6f 0

1t 1t 7 6 11 7C

g t .; / D 6 :

6

7B e

6 7 6 7C

6 0 f1t 7 B R 0 f 1t 7 621 7C

7 B 2t

B 6

6 6 7 6 7C

6f 7 6 7C

4 2t 0 5 B

7 6f 0

B 4 2t 5 4 12 5C C

0 f2t B @ 0 f 2t 22

C

A

zt x t0

266

Proof. (of rewriting the moment conditions) From the properties of Kronecker prod-

ucts, we know that (i) vec.ABC / D .C 0 A/vec.B/; and (ii) if a is m 1 and c is n 1,

then a c D .a In /c. The first rule allows to write

" # " #0 !

h i 1 1 h i

C f t D In as In vec. /:

ft ft

b

x t0

" # " # !

1 1

.Ret f t / as In .Ret f t /:

ft ft

zt

(For the exactly identified case, we could also use the fact .A B/0 D A0 B 0 to notice

that z t D x t .)

Remark B.2 (Quick matrix calculations of zx and zy ) Although a loop wouldnt take

too long time to calculate zx and zy , there is a quicker way. Put 1 f t0 in row t

of the matrix ZT .1CK/ and Re0t in row t of the matrix RT n . For the exactly identified

case, let X D Z. For the overidentified case, put f t0 in row t of the matrix XT K . Then,

calculate

zx D .Z 0 X=T / In and vec.R0 Z=T / D zy :

E pt 1 D E xt mt ;

where x t are the payoffs and p t 1 the prices of the assets. We can either interpret

p t 1 as actual asset prices and x t as the payoffs, or we can set p t 1 D 1 and let x t be

gross returns, or set p t 1 D 0 and x t be excess returns. Assume that the SDF is linear in

the factors

mt D
0ft ;

267

where the .1 C K/ 1 vector f t contains a constant and the other factors. Combining

gives the sample moment conditions

T

X

N

g.
/ D g t .
/=T D 0n1 , where

t D1

gt D xt mt pt 1 D x t f t0 pt 1:

There are 1 C K parameters and n moment conditions (the number of assets). To estimate

this model with a weighting matrix W , we minimize the loss function

N 0 W g.
/:

J D g.
/ N

N

A.1CK/n g.
/ D 0.1CK/1 :

T T

x t f t0 =T and p D

X X

xf D pt 1 =T:

t D1 t D1

N

g.
/ D xf
p ;

0

J D xf

p W xf p :

0

N /

O

@J @g.

0.1CK/1 D D N /

W g. O

@ @ 0

0

D xf W xf O p , so

0

1 0

O D xf W xf xf W p :

268

In can also be noticed that the Jacobian is

N

@g.
/

D xf :

@
0

N D 0, we have

Axf
Ap D 0, so

D .Axf / 1 Ap :

T T T

E f t /0 =T and p D

X X X

x D x t =T; xf D x t .f t pt 1 =T:

t D1 tD1 t D1

N

g.b/ D x m

N C xf b p

0

J D x m

N C xf b N C xf b

p W x m p :

N is given) are

0K1 D xf 0

W x m N C xf bO p , so

bO D xf 0

1 0

N :

W xf xf W p x m

N D 0, we have

N C Axf b

Ax m Ap D 0, so

b D .Axf / 1 A p N :

x m

Bibliography

Bekaert, G., and M. S. Urias, 1996, Diversification, integration and emerging market

closed-end funds, Journal of Finance, 51, 835869.

269

Breeden, D. T., M. R. Gibbons, and R. H. Litzenberger, 1989, Empirical tests of the

consumption-oriented CAPM, Journal of Finance, 44, 231262.

markets, Princeton University Press, Princeton, New Jersey.

Chen, N.-F., R. Roll, and S. A. Ross, 1986, Economic forces and the stock market,

Journal of Business, 59, 383403.

Christiansen, C., A. Ranaldo, and P. Sderlind, 2010, The time-varying systematic risk

of carry trade strategies, Journal of Financial and Quantitative Analysis, forthcoming.

revised edn.

Fama, E., and J. MacBeth, 1973, Risk, return, and equilibrium: empirical tests, Journal

of Political Economy, 71, 607636.

Fama, E. F., and K. R. French, 1993, Common risk factors in the returns on stocks and

bonds, Journal of Financial Economics, 33, 356.

Fama, E. F., and K. R. French, 1996, Multifactor explanations of asset pricing anoma-

lies, Journal of Finance, 51, 5584.

Ferson, W. E., 1995, Theory and empirical testing of asset pricing models, in Robert A.

Jarrow, Vojislav Maksimovic, and William T. Ziemba (ed.), Handbooks in Operations

Research and Management Science . pp. 145200, North-Holland, Amsterdam.

Ferson, W. E., and R. Schadt, 1996, Measuring fund strategy and performance in chang-

ing economic conditions, Journal of Finance, 51, 425461.

Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,

Cambridge University Press.

Gibbons, M., S. Ross, and J. Shanken, 1989, A test of the efficiency of a given portfolio,

Econometrica, 57, 11211152.

Greene, W. H., 2003, Econometric analysis, Prentice-Hall, Upper Saddle River, New

Jersey, 5th edn.

270

Hansen, L. P., and R. Jagannathan, 1991, Implications of security market data for models

of dynamic economies, Journal of Political Economy, 99, 225262.

Jagannathan, R., and Z. Wang, 1996, The conditional CAPM and the cross-section of

expectd returns, Journal of Finance, 51, 353.

Jagannathan, R., and Z. Wang, 1998, A note on the asymptotic covariance in Fama-

MacBeth regression, Journal of Finance, 53, 799801.

Jagannathan, R., and Z. Wang, 2002, Empirical evaluation of asset pricing models: a

comparison of the SDF and beta methods, Journal of Finance, 57, 23372367.

Lettau, M., and S. Ludvigson, 2001, Resurrecting the (C)CAPM: a cross-sectional test

when risk premia are time-varying, Journal of Political Economy, 109, 12381287.

MacKinlay, C., 1995, Multifactor models do not explain deviations from the CAPM,

Journal of Financial Economics, 38, 328.

Finance Review, 3, 233237.

Treynor, J. L., and K. Mazuy, 1966, Can Mutual Funds Outguess the Market?, Harvard

Business Review, 44, 131136.

271

9 Consumption-Based Asset Pricing

Reference: Bossaert (2002); Campbell (2003); Cochrane (2005); Smith and Wickens

(2002)

Et 1 R t M t D 1: (9.1)

discount factor (SDF). E t 1 denotes the expectations conditional on the information in

period t 1, that is, when the investment decision is made. This equation holds for

any assets that are freely traded without transaction costs (or taxes), even if markets are

incomplete.

In a consumption-based model, (9.1) is the Euler equation for optimal saving in t 1

where M t is the ratio of marginal utilities in t and t 1, M t D u0 .C t /=u0 .C t 1 /. I will

focus on the case where the marginal utility of consumption is a function of consumption

only, which is by far the most common formulation. This allows for other terms in the

utility function, for instance, leisure and real money balances, but they have to be addi-

tively separable from the consumption term. With constant relative risk aversion (CRRA)

, the stochastic discount factor is

M t D .C t =C t 1/ , so (9.2)

ln M t D ln
c t ; where c t D ln C t =C t 1: (9.3)

The second line is only there to introduce the convenient notation c t for the consumption

growth rate.

The next few sections study if the pricing model consisting of (9.1) and (9.2) can fit

historical data. To be clear about what this entails, note the following. First, general

equilibrium considerations will not play any role in the analysis: the production side will

272

not be even mentioned. Instead, the focus is on one of the building blocks of an otherwise

unspecified model. Second, complete markets are not assumed. The key assumption is

rather that the basic asset pricing equation (9.1) holds for the assets I analyse. This means

that the representative investor can trade in these assets without transaction costs and taxes

(clearly an approximation). Third, the properties of historical (ex post) data are assumed

to be good approximations of what investors expected. In practice, this assumes both

rational expectations and that the sample is large enough for the estimators (of various

moments) to be precise.

To highlight the basic problem with the consumption-based model and to simplify the

exposition, I assume that the excess return, Ret , and consumption growth, c t , have a

bivariate normal distribution. By using Steins lemma, we can write the the risk premium

as

E t 1 Ret D Cov t 1 .Ret ; c t /
: (9.4)

The intuition for this expressions is that an asset that has a high payoff when consumption

is high, that is, when marginal utility is low, is considered risky and will require a risk

premium. This expression also holds in terms of unconditional moments. (To derive that,

start by taking unconditional expectations of (9.1).)

We can relax the assumption that the excess return is normally distributed: (9.4) holds

also if Ret and c t have a bivariate mixture normal distributionprovided c t has the

same mean and variance in all the mixture components (see Section 9.1.1 below). This

restricts consumption growth to have a normal distribution, but allows the excess return

to have a distribution with fat tails and skewness.

Remark 9.1 (Steins lemma) If x and y have a bivariate normal distribution and h.y/ is

a differentiable function such that Ejh0 .y/j < 1, then Covx; h.y/ D Cov.x; y/ Eh0 .y/.

E Re D Cov.Re ; M /= E M:

lemma, x D Re , y D ln M and h./ D exp./.) Finally, notice that Cov.Re ; ln M / D

Cov.Re ; c/.

273

The Gains and Losses from Using Steins Lemma

The gain from using (the extended) Steins lemma is that the unknown relative risk aver-

sion,
, does not enter the covariances. This facilitates the empirical analysis consider-

ably. Otherwise, the relevant covariance would be between Ret and .C t =C t 1 /
.

The price of using (the extended) Steins lemma is that we have to assume that con-

sumption growth is normally distributed and that the excess return have a mixture normal

distribution. The latter is not much of a price, since a mixture normal can take many

shapes and have both skewness and excess kurtosis.

In any case, Figure 9.1 suggests that these assumptions might be reasonable. The

upper panel shows unconditional distributions of the growth of US real consumption per

capita of nondurable goods and services and of the real excess return on a broad US equity

index. The non-parametric kernel density estimate of consumption growth is quite similar

to a normal distribution, but this is not the case for the US market excess return which has

a lot more skewness.

e

Pdf of c Pdf of Rm

1 0.06

Kernel

Normal

0.04

0.5

0.02

0 0

1 0 1 2 20 10 0 10 20

Consumption growth, % Market excess return, %

Figure 9.1: Density functions of consumption growth and equity market excess returns.

The kernel density function of a variable x is estimated by using a N.0; / kernel with

D 1:06 Std.x/T 1=5 . The normal distribution is calculated from the estimated mean

and variance of the same variable.

To allow for a non-normal distribution of the asset return, an extension of Steins lemma

is necessary. The following proposition shows that this is possibleif we restrict the

distribution of the log SDF to be gaussian.

274

Figure 9.2 gives an illustration.

0.2

0.1

2

0

3 0

2

1

0

1 2

2 x

y 3

are drawn at the back.

Proposition 9.2 Assume (a) the joint distribution of x and y is a mixture of n bivari-

ate normal distributions; (b) the mean and variance of y is the same in each of the

n components; (c) h.y/ is a differentiable function such that E jh0 .y/j < 1. Then

Covx; h.y/ D E h0 .y/ Cov.x; y/. (See Sderlind (2009) for a proof.)

This section studies if the consumption-based asset pricing model can explain the histor-

ical risk premium on the US stock market.

To discuss the historical average excess returns, it is convenient to work with the

unconditional version of the pricing expression (9.4)

Table 9.1 shows the key statistics for quarterly US real returns and consumption growth.

275

Mean Std Autocorr Corr with c

c 1:88 0:95 0:44 1:00

e

Rm 6:50 17:09 0:07 0:17

Riskfree 0:96 2:52 0:68 0:25

We see, among other things, that consumption has a standard deviation of only 1%

(annualized), the stock market has had an average excess return (over a T-bill) of 68%

(annualized), and that returns are only weakly correlated with consumption growth. These

figures will be important in the following sections. Two correlations with consumption

growth are shown, since it is unclear if returns should be related to what is recorded as

consumption this quarter or the next. The reason is that consumption is measured as a

flow during the quarter, while returns are measured at the end of the quarter.

Table 9.1 shows that we can write (9.5) as

0:065 0:17 0:17 0:01 : (9.7)

The basic problem with the consumption-based asset pricing model is that investors

enjoy a fairly stable consumption series (either because income is smooth or because it is

easy/inexpensive to smooth consumption by changing savings), so only an extreme risk

aversion can motivate why investors require such a high equity premium. This is the

equity premium puzzle stressed by Mehra and Prescott (1985) (although they approach

the issue from another angle). Indeed, even if the correlation was one, (9.7) would require

38.

In contrast to the traditional interpretation of efficient markets, it has been found that

excess returns might be somewhat predictableat least in the long run (a couple of years).

In particular, Fama and French (1988a) and Fama and French (1988b) have argued that

future long-run returns can be predicted by the current dividend-price ratio and/or current

returns.

Some evidence suggests that excess returns may perhaps have a predictable compo-

276

nent, that is, that (ex ante) risk premia are changing over time. To see how that fits with

the consumption-based model, notice that (9.4) says that the conditional expected excess

return should equal the conditional covariance times the risk aversion.

Figure 9.3.a shows recursive estimates of the mean return of the aggregate US stock

market and the covariance with consumption growth (dated t C 1). The recursive esti-

mation means that the results for (say) 1965Q2 use data for 1955Q21965Q2, the results

for 1965Q3 add one data point, etc. The second subfigure shows the same statistics, but

estimated on a moving data window of 10 years. For instance, the results for 1980Q2 are

for the sample 1971Q31980Q2. Finally, the third subfigure uses a moving data window

of 5 years.

Together these figures give the impression that there are fairly long swings in the

data. This fundamental uncertainty should serve as a warning against focusing on the fine

details of the data. It could also be used as an argument for using longer data series

provided we are willing to assume that the economy has not undergone important regime

changes.

It is clear from the earlier Figure 9.3 that the consumption-based model probably can-

not generate plausible movements in risk premia. In that figure, the conditional moments

are approximated by estimates on different data windows (that is, different subsamples).

Although this is a crude approximation, the results are revealing: the actual average excess

return and the covariance move in different directions on all frequencies.

The CRRA utility function has the special feature that the intertemporal elasticity of sub-

stitution is the inverse of the risk aversion, that is, 1=
. Choosing the risk aversion pa-

rameter, for instance, to fit the equity premium, will therefore have direct effects on the

riskfree rate.

A key feature of any consumption-based asset pricing model, or any consumption/saving

model for that matter, is that the riskfree rate governs the time slope of the consumption

profile. From the asset pricing equation for a riskfree asset (9.1) we have E t 1 .Rf t /

E t 1 .M t / D 1. Note that we must use the conditional asset pricing equationat least as

long as we believe that the riskfree asset is a random variable. A riskfree asset is defined

by having a zero conditional covariance with the SDF, which means that it is regarded

as riskfree at the time of investment (t 1). In practice, this means a real interest rate

(perhaps approximated by the real return on a T-bill since the innovations in inflation are

277

Recursive estimation 10-year data window

e

4 ERm 4

e

Cov(Rm , c)

2 2

0 0

2 2

1960 1980 2000 1960 1980 2000

mean excess return on equity, in percent

4 Covariance with cons growth, in basis points

Initialization: data for first 10 years

2

e e

To test: Et1 Rm = Covt1 (Rm , c)

0

2

1960 1980 2000

small), which may well have a nonzero unconditional covariance with the SDF.1 Indeed,

in Table 9.1 the real return on a T-bill is as correlated with consumption growth as the

aggregate US stockmarket.

When the log SDF is normally distributed (the same assumption as before), then the

log expected riskfree rate is

Before we try to compare (9.9) with data, several things should be noted. First, the log

gross rate is very close to a traditional net rate (ln.1 C z/ z for small z), so it makes

sense to compare with the data in Table 9.1. Second, we can safely disregard the variance

1

As a very simple example, let x t D z t 1 C" t and y t D z t 1 Cu t where " t are u t uncorrelated with each

other and with z t 1 . If z t 1 is observable in t 1, then Cov t D 0, but Cov.x t ; y t / D 2 .z t 1 /.

1 .x t ; y t /

278

term since it is very small, at least as long as we are considering reasonable values of
.

Although the average conditional variance is not directly observable, we know that it must

be smaller than the unconditional variance2 , which is very small in Table 9.1. In fact, the

variance is around 0.0001 whereas the mean is around 0.02.

Proof. (of (9.8)) For a riskfree gross return Rf , (9.1) with the SDF (9.2) says

E t 1 .Rf t / E t 1 .C t =C t 1 /
D 1. Recall that if x N.; 2 / and y D exp.x/

then E y D exp. C 2 =2/. When c t is conditionally normally distributed, the log of

E t 1 .C t =C t 1 /
equals ln
E t 1 c t C
2 Var t 1 .c t /=2/.

According to (9.9) there are two ways to reconcile a positive consumption growth rate

with a low real interest rate (around 1% in Table 9.1): investors may prefer to consume

later rather than sooner ( > 1) or they are willing to substitute intertemporally without

too much compensation (1=
is high, that is,
is low). However, fitting the equity pre-

mium requires a high value of
, so investors must be implausibly patient if (9.9) is to

hold. For instance, with
D 25 (which is a very conservative guess of what we need to

fit the equity premium) equation (9.9) says

(ignoring the variance terms), which requires 1:6. This is the riskfree rate puzzle

stressed by Weil (1989). The basic intuition for this result is that it is hard to reconcile a

steep slope of the consumption profile and a low compensation for postponing consump-

tion if people are insensitive to intertemporal pricesunless they are extremely patient

(actually, unless they prefer to consume later rather than sooner).

Another implication of a high risk aversion is that the real interest rate should be

very volatile, which it is not. According to Table 9.1 the standard deviation of the real

interest rate is perhaps twice the standard deviation of consumption growth. From (9.8)

the volatility of the (expected) riskfree rate should be

if the conditional variance of consumption growth is constant. This expression says that

the standard deviation of expected real interest rate is
times the standard deviation of ex-

pected consumption growth. We cannot observe the conditional expectations directly, and

therefore not estimate their volatility. However, a simple example is enough to demon-

2

Let E.yjx/ and Var.yjx/ be the expectation and variance of y conditional on x. The unconditional

variance is then Var.y/ D VarE.yjx/ C EVar.yjx/.

279

strate that high values of
are likely to imply counterfactually high volatility of the real

interest rate.

As an approximation, suppose both the riskfree rate and consumption growth are

AR(1) processes. Then (9.11) can be written

(9.12)

0:75 0:02 0:3 0:01 (9.13)

where the second line uses the results in Table 9.1. With
D 25, (9.13) implies that the

RHS is much too volatile This shows that an intertemporal elasticity of substitution of

1/25 is not compatible with the relatively stable real return on T-bills.

Proof. (of (9.12)) If x t D x t 1 C " t , where " t is iid, then E t 1 .x t / D x t 1 , so

.E t 1 x t / D .x t 1 /.

The previous section demonstrated that the consumption-based model has a hard time ex-

plaining the risk premium on a broad equity portfolioessentially because consumption

growth is too smooth to make stocks look particularly risky. However, the model does

predict a positive equity premium, even if it is not large enough. This suggests that the

model may be able to explain the relative risk premia across assets, even if the scale is

wrong. In that case, the model would still be useful for some issues. This section takes a

closer look at that possibility by focusing on the relation between the average return and

the covariance with consumption growth in a cross-section of asset returns.

The key equation is (9.5), which I repeat here for ease of reading

E Ret D Cov.Ret ; c t / :

returns on factors with unknown factor risk premia (see, for instance, Cochrane (2005)

chap 12 or Campbell, Lo, and MacKinlay (1997) chap 6).

Remark 9.3 (GMM estimation of (9.5)) Let there be N assets. The original moment

280

conditions are

2 3

.c t c / D 0

T 6

.Riet i / D 0 for i D 1; 2; :::; N

7

1 X6 7

gT ./ D 6 7

T t D1 6

4 .c t c /.Riet i / ci D 0 for i D 1; 2; :::; N 7

5

.Riet ci / D 0 for i D 1; 2; :::; N;

where c is the mean of c t , i the mean of Riet , ci the covariance of c t and Riet .

This gives 1 C 3N moment conditions and 2N C 3 parameters, so there are N 2

overidentifying restrictions.

To estimate, we define the combined moment conditions as

2 3

1 01N 01N 01N

60N 1 IN 0N N 0N N 7

6 7

6 7

A.2N C3/.1C3N / D 60 0

6 N 1 N N I N 0N N 7 ;

7

4 0 01N 01N i0c 5

6 7

These moment conditions mean that means and covariances are estimated in the tradi-

tional way, and that is estimated by a LS regression of E Riet on a constant and ci . The

test that the pricing errors are all zero is a Wald test that gT ./ are all zero, where the

covariance matrix of the moments are estimated by a Newey-West method (using one lag).

This covariance matrix is singular, but that does not matter (as we never have to invert

it).

It can be shown (see Sderlind (2006)) that (i) the recursive utility function in Epstein

and Zin (1991); (ii) the habit persistence model of Campbell and Cochrane (1999) in the

case of no return predictability, as well as the (iii) models of idiosyncratic risk by Mankiw

(1986) and Constantinides and Duffie (1996) also in the case of no return predictability,

all imply that (9.5) hold. The only difference is that the effective risk aversion (
) differs.

Still, the basic asset pricing implication is the same: expected returns are linearly related

to the covariance.

Figure 9.4 shows the results of both C-CAPM and the standard CAPMfor the 25

Fama and French (1993) portfolios. It is clear that both models work badly, but CAPM

actually worse.

281

C-CAPM CAPM

20 20

: 216 : -0.3

15 t-stat of : 1.7 15 t-stat of : -0.2

ERe , %

ERe , %

R2: 0.40 R2: 0.00

10 10

5 5

0 0

0 2 4 6 0 1 2 3 4 5

Cov(c, Re ), bps e

Cov(Rm , Re ), %

C-CAPM C-CAPM

20 20

1 (small) 1 (low)

15 2 15 2

3 3

ERe , %

ERe , %

4 4

10 5 (large) 10 5 (high)

5 5

lines connect same size lines connect same B/M

0 0

0 2 4 6 0 2 4 6

Cov(c, Re ), bps Cov(c, Re ), bps

CAPM CAPM

20 20

15 15

ERe , %

ERe , %

10 10

5 5

lines connect same size lines connect same B/M

0 0

0 1 2 3 4 5 0 1 2 3 4 5

e

Cov(Rm , Re ), % e

Cov(Rm , Re ), %

Figure 9.5 takes a careful look at how the C-CAPM and CAPM work in different

smaller cross-sections. A common feature of both models is that growth firms (low book-

282

to-market ratios) have large pricing errors (in the figures with lines connecting the same

B/M categories, they are the lowest lines for both models).

In contrast, a major difference between the models is that CAPM shows a very strange

pattern when we compare across B/M categories (lines connecting the same size cate-

gory): mean excess returns are decreasing in the covariance with the marketthe wrong

sign compared to the CAPM prediction. This is not the case for C-CAPM.

The conclusion is that the consumption-based model is not good at explaining the

cross-section of returns, but it is no worse than CAPMif it is any comfort.

The basic asset pricing model is about conditional moment and it can be summarizes as

in (9.4) which is given here again

Et 1 Ret D Cov t e

1 .R t ; c t /
: (EPP3c again)

Expression this in terms of unconditional moments as in (9.5) shows only part of the

story. It is, however, fair to say that if the model does not hold unconditionally, then that

is enough to reject the model.

However, it can be shown (see Sderlind (2006)) that several refinements of the con-

sumption based model (the habit persistence model of Campbell and Cochrane (1999)

and also the model with idiosyncratic risk by Mankiw (1986) and Constantinides and

Duffie (1996)) also imply that (9.4) holds, but with a time varying effective risk aversion

coefficient (so
should carry a time subscript).

Lettau and Ludvigson (2001b) use a scaled factor model, where they impose the re-

striction that the time variation (using a beta representation) is a linear function of some

conditioning variables (specifically, the cay variable) only.

The cay variable is defined as the log consumption/wealth ratio. Wealth consists of

both financial assets and human wealth. The latter is not observable, but is assumed to

be proportional to current income (this would, for instance, be true if income follows and

283

AR(1) process). Therefore, cay is modelled as

where c t is log consumption, a t log financial wealth and y t is log income. The coeffi-

cient ! is estimated with LS to be around 0.3. Although (9.14) contains non-stationary

variables, it is interpreted as a cointegrating relation so LS is an appropriate estimation

method. Lettau and Ludvigson (2001a) shows that cay is able to forecast stock returns (at

least, in-sample). Intuitively, cay should be a signal of investor expectations about future

returns (or wage earnings...): a high value is probably driven by high expectations.

The SDF is modelled as time-varying function of consumption growth

M t D a t C b t c t , where (9.15)

a t D
0 C
1 cay t 1 and b t D 0 C 1 cay t 1: (9.16)

This is a conditional C-CAPM. It is clearly the same as specifying a linear factor model

where the coefficients are estimated in time series regression (this is also called a scaled

factor model since the true factor, c, is scaled by the instrument, cay). Then, the

cross-sectional pricing implications are tested by

E Ret D ; (9.18)

Lettau and Ludvigson (2001b) use the 25 Fama-French portfolios as test assets and

compare the results from (9.17)(9.18) with several other models, for instance, a tradi-

tional CAPM (the SDF is linear in the market return), a conditional CAPM (the SDF is

linear in the market return, cay and their product), a traditional C-CAPM (the SDF is

linear in consumption growth) and a Fama-French model (the SDF is linear in the market

return, SMB and HML). It is found that the conditional CAPM and C-CAPM provides a

much better fit of the cross-sectional returns that the unconditional models (including the

Fama-French model)and that the C-CAPM is actually a pretty good model.

284

Duffee (2005) estimates the conditional model (9.4) by projecting both ex post returns

and covariances on a set of instrumentsand then studies if there is a relation between

these projections.

A conditional covariance (here of the asset return and consumption growth) is the

covariance of the innovations. To create innovations (denoted eR;t and ec;t below), the

paper uses the following prediction equations

0

Ret D R YR;t 1 C eR;t (9.19)

c t D c0 Yc;t 1 C ec;t : (9.20)

In practice, only three lags of lagged consumption growth is used to predict consumption

growth and only the cay variable is used to predict the asset return.

Then, the return is related to the covariance as

where .b1 C b2 p t 1 / is a model of the effective risk aversion. In the CRRA model, b2 D

0, so b1 measures the relative risk aversion as in (9.4). In contrast, in Campbell and

Cochrane (1999) p t 1 is an observable proxy of the surplus ratio which measure how

close consumption is to the habit level.

The model (9.19)(9.21) is estimated with GMM, using a number of instruments

(Z t 1 ): lagged values of stock market value/consumption, stock market returns, cay and

the product of demeaned consumption and returns. This can be thought of as first finding

proxies for

bR

Et e 0

D R YR;t and (9.22)

2 .e

1 t 1

bR

Et 1

e

t D b0 C .b1 C b2 p t 2

1 / Cov t 1 .eR;t ; ec;t / C ut : (9.23)

The point of using a (GMM) system is that this allows handling the estimation uncer-

tainty of the prediction equations in the testing of the relation between the predictions.

The empirical results (using monthly returns on the broad U.S. stock market and per

capita expenditures in nondurables and services, 19592001) suggest that there is a strong

negative relation between the conditional covariance and the conditional expected market

285

returnwhich is clearly at odds with a CRRA utility function (compare (9.4)). In addi-

tion, typical proxies of the p t 1 variable do not seem to any important (economic) effects.

In an extension, the paper also studies other return horizons and tries other ways to

model volatility (including a DCC model).

(See also Sderlind (2006) for a related approach applied to a cross-section of returns.)

Parker and Julliard (2005) suggest using a measure of long-run changes in consump-

tion instead of just a one-period change. This turns out to give a much better empirical fit

of the cross-section of risk premia.

To see the motivation for this approach, consider the asset pricing equation based on

a CRRA utility function. It says that an excess return satisfies

Et 1 Ret .C t =C t 1/

D0 (9.24)

E t n .C t Cn =C t /

D Pnt , so (9.25)

Ct D E t n C t Cn =Pn;t : (9.26)

Mn;t D .1=Pn;t /.C t Cn =C t 1/ :

This expression relates the one-period excess return to an n-period SDFwhich involves

the interest rate (1=Pn;t ) and ratio of marginal utilities n periods apart.

If we can apply Steins lemma (possibly extended) and use yn;t D ln 1=Pnt to denote

the n-period log riskfree rate, then we get

e

D Cov t e

1 R t ;
ln.C t Cn =C t 1 / Cov t e

1 R t ; yn;t : (9.28)

This first term is very similar to the traditional expression (9.2), except that we here have

the (nC1)-period (instead of the 1-period) consumption growth. The second term captures

286

the covariance between the excess return and the n-period interest rate in period t (both

are random as seen from t 1). If we set n D 0, then this equation simplifies to the

traditional expression (9.2). Clearly, the moments in (9.28) could be unconditional instead

of conditional.

The empirical approach in Parker and Julliard (2005) is to estimate (using GMM) and

test the cross-sectional implications of this model. (They do not use Steins lemma.) They

find that the model fits data much better with a high value of n (ultimate consumption)

than with n D 0 (the traditional model). Possible reasons could be: (i) long-run changes

in consumption are better measured in national accounts data; (ii) the CRRA model is a

better approximation for long-run movements.

20 20

: 216 : 236

15 t-stat of : 1.7 15 t-stat of : 1.9

ERe , %

ERe , %

R2: 0.40 R2: 0.32

10 10

5 5

0 0

0 2 4 6 0 2 4 6

Cov(c, Re ), bps Cov(c, Re ), bps

Unconditional C-CAPM

Ultimate C-CAPM, 8 quarters US quarterly data 1957Q12013Q3

20

15

ERe , %

10

: 402

5 t-stat of : 2.1

R2: 0.55

0

0 2 4 6

Cov(c, Re ), bps

SDF and Pnt the price of an n-period bond. Clearly, P2t D E t M tC1 P1;t C1 , so P2t D

E t M tC1 E t C1 .M t C2 P0;tC2 /. Use the law of iterated expectations (LIE) and P0;t C2 D 1

to get P2t D E t M t C2 M t C1 . The extension from 2 to n is straightforward, which gives

287

(9.25). To prove (9.27), use (9.26) in (9.24), apply LIE and simplify.

Bibliography

Bossaert, P., 2002, The paradox of asset pricing, Princeton University Press.

Milton Harris, and Rene Stultz (ed.), Handbook of the Economics of Finance . chap. 13,

pp. 803887, North-Holland, Amsterdam.

explanation of aggregate stock market behavior, Journal of Political Economy, 107,

205251.

markets, Princeton University Press, Princeton, New Jersey.

revised edn.

Constantinides, G. M., and D. Duffie, 1996, Asset pricing with heterogeneous con-

sumers, The Journal of Political Economy, 104, 219240.

Duffee, G. R., 2005, Time variation in the covariance between stock returns and con-

sumption growth, Journal of Finance, 60, 16731712.

Epstein, L. G., and S. E. Zin, 1991, Substitution, risk aversion, and the temporal behavior

of asset returns: an empirical analysis, Journal of Political Economy, 99, 263286.

Fama, E. F., and K. R. French, 1988a, Dividend yields and expected stock returns,

Journal of Financial Economics, 22, 325.

Fama, E. F., and K. R. French, 1988b, Permanent and temporary components of stock

prices, Journal of Political Economy, 96, 246273.

Fama, E. F., and K. R. French, 1993, Common risk factors in the returns on stocks and

bonds, Journal of Financial Economics, 33, 356.

Lettau, M., and S. Ludvigson, 2001a, Consumption, wealth, and expected stock returns,

Journal of Finance, 56, 815849.

288

Lettau, M., and S. Ludvigson, 2001b, Resurrecting the (C)CAPM: a cross-sectional test

when risk premia are time-varying, Journal of Political Economy, 109, 12381287.

Mankiw, G. N., 1986, The equity premium and the concentration of aggregate shocks,

Journal of Financial Economics, 17, 211219.

Mehra, R., and E. Prescott, 1985, The equity premium: a puzzle, Journal of Monetary

Economics, 15, 145161.

Parker, J., and C. Julliard, 2005, Consumption risk and the cross section of expected

returns, Journal of Political Economy, 113, 185222.

Smith, P. N., and M. R. Wickens, 2002, Asset pricing with observable stochastic discount

factors, Discussion Paper No. 2002/03, University of York.

Sderlind, P., 2006, C-CAPM Refinements and the cross-section of returns, Financial

Markets and Portfolio Management, 20, 4973.

Sderlind, P., 2009, An extended Steins lemma for asset pricing, Applied Economics

Letters, 16, 10051008.

Weil, P., 1989, The equity premium puzzle and the risk-free rate puzzle, Journal of

Monetary Economics, 24, 401421.

289

10 Alphas /Betas and Investor Characteristics

The task is to evaluate if alphas or betas of individual investors (or funds) are related

to investor (fund) characteristics, for instance, age or trading activity. The data set is

panel with observations for T periods and N investors. (In many settings, the panel is

unbalanced, but, to keep things reasonably simple, that is disregarded in the discussion

below.)

The calendar time (CalTime) approach is to first define M discrete investor groups (for

instance, age 1830, 3140, etc) and calculate their respective average excess returns (yNjt

for group j )

1 P

yNjt D yi t ; (10.1)

Nj i 2Groupj

where Nj is the number of individuals in group j .

Then, we run a factor model

where x t typically includes a constant and various return factors (for instance, excess re-

turns on equity and bonds). By estimating these M equations as a SURE system with

Whites (or Newey-Wests) covariance estimator, it is straightforward to test various hy-

potheses, for instance, that the intercept (the alpha) is higher for the M th group than

for the for first group.

Example 10.1 (CalTime with two investor groups) With two investor groups, estimate the

following SURE system

yN1t D x t0 1 C v1t ;

yN2t D x t0 2 C v2t :

290

The CalTime approach is straightforward and the cross-sectional correlations are fairly

easy to handle (in the SURE approach). However, it forces us to define discrete investor

groupswhich makes it hard to handle several different types of investor characteristics

(for instance, age, trading activity and income) at the same time.

The cross sectional regression (CrossReg) approach is to first estimate the factor

model for each investor

and to then regress the (estimated) betas for the pth factor (for instance, the intercept) on

the investor characteristics

Opi D zi0 cp C wpi : (10.4)

(for age group, say) or a continuous variable (age, say). Notice that using a continuous

investor characteristics assumes that the relation between the characteristics and the beta

is linearsomething that is not assumed in the CalTime approach. (This saves degrees of

freedom, but may sometimes be a very strong assumption.) However, a potential problem

with the CrossReg approach is that it is often important to account for the cross-sectional

correlation of the residuals.

10.3.1 OLS

yi t D xi0 t C "i t ; (10.5)

where xi t is an K 1 vector. Notice that the coefficients are the same across individuals

(and time). Define the matrices

T N

1 XX

xx D xi t xi0 t (an K K matrix) (10.6)

T N t D1 i D1

T N

1 XX

xy D xi t yi t (a K 1 vector). (10.7)

T N t D1 i D1

291

The LS estimator (stacking all T N observations) is then

O D xx1 xy : (10.8)

Remark 10.2 (Panel regression vs average coefficient ) Consider the regression for in-

vestor i

yi t D x t0 i C "i t , i D 1:::N;

where the regressors are the same in all regressionsbut where the coefficients might

differences across investors. Clearly, we have for each i

Oi D Sxx1 Sxyi ;

T T

x t x t0 =T and Sxyi D

X X

where Sxx D x t yi t =T:

t D1 t D1

N N

1 X O 1 X

i D Sxx Sxyi :

N i D1 N i D1

Compare that to (10.6) and notice that since x t is repeated N times, we have xx D Sxx .

Similarly, comparing with (10.7) gives

N

1 X

Sxyi D xy :

N i D1

1

PN O D ,

O where the latter is from the panel regression (10.8).

This shows that N i D1 i

10.3.2 GMM

T N

1X 1 X

hi t D 0K1 , where hi t D xi t "i t D xi t .yi t xi0 t /: (10.9)

T t D1 N i D1

Remark 10.3 (Distribution of GMM estimates) Under fairly weak assumption, the ex-

p d

actly identified GMM estimator T N .O 0 / ! N.0; D0 1 S0 D0 1 /, where D0 is the

p

Jacobian of the average moment conditions and S0 is the covariance matrix of T N

times the average moment conditions.

292

Remark 10.4 (Distribution of O 0 ) As long as T N is finite, we can (with some abuse

p

of notation) consider the distribution of O instead of T N .O 0 / to write

O 0 N.0; D0 1 SD0 1 /;

where S D S0 =.T N / which is the same as the covariance matrix of the average moment

conditions (10.9).

To apply these remarks, first notice that the Jacobian D0 corresponds to (the probabil-

ity limit of) the xx matrix in (10.6)

D0 D xx : (10.10)

Second, notice that

T N

!

1X 1 X

Cov.average moment conditions/ D Cov hi t (10.11)

T t D1 N i D1

In particular, if hi t has no correlation across time (effectively, N1 N i D1 hi t is not auto-

P

T N

!

1 X 1 X

Cov.average moment conditions/ D 2 Cov hi t : (10.12)

T t D1 N i D1

We would then design an estimator that would consistently estimate this covariance matrix

by using the time dimension.

N D 4. Then, (10.11) can be written

1 1

Cov .h1t C h2t C h3t C h4t / C .h1;tC1 C h2;tC1 C h3;tC1 C h4;tC1 / :

24 24

If there is no correlation across time periods, then this becomes

1 1 1 1

Cov .h1t C h2t C h3t C h4t / C 2 Cov .h1;t C1 C h2;t C1 C h3;t C1 C h4;t C1 / ;

22 4 2 4

which has the same form as (10.12).

293

10.3.3 Driscoll-Kraay

O D 1 S 1 ;

Cov./ (10.13)

xx xx

where

T N

1 X 0 1 X

SD 2 h t h t ; with h t D hi t , hi t D xi t "i t ; (10.14)

T t D1 N i D1

where hi t is the LS moment condition for individual i . Clearly, hi t and h t are K 1, so S

is KK. Since we use the covariance matrix of the moment conditions, heteroskedasticity

is accounted for.

Notice that h t is the cross-sectional average moment condition (in t ) and that S is an

estimator of the covariance matrix of those average moment conditions

b

1 PT PN

S D Cov hi t :

T N tD1 i D1

To calculate this estimator, (10.14) uses the time dimension (and hence requires a reason-

ably long time series).

Remark 10.6 (Relation to the notation in Hoechle (2007)) Hoechle writes Cov./ O D

.X 0 X/ 1 SOT .X 0 X/ 1 , where SOT D TtD1 hO t hO 0t ; with hO t D N

i D1 hi t . Clearly, my xx D

P P

(10.14) gives the cross-sectional average in period t

1

ht D .h1t C h2t C h3t C h4t / ;

4

294

and the covariance matrix

T

1 X

SD 2 h t h0t

T t D1

T 2

1 X 1

D 2 .h1t C h2t C h3t C h4t /

T t D1 4

T

1 X 1 2

D 2 .h C h22t C h23t C h24t ;

T t D1 16 1t

C 2h1t h2t C 2h1t h3t C 2h1t h4t C 2h2t h3t C 2h2t h4t C 2h3t h4t /

so we can write

" 4

1 X

SD Var.h

c it /

T 16 i D1

b b b

C 2Cov.h1t ; h2t / C 2Cov.h1t ; h3t / C 2Cov.h1t ; h4t /

bov.h ; h / C 2Cbov.h

C 2C 2t 3t 2t ; h4t /

C2Cbov.h ; h /i :

3t 4t

Notice that S is the (estimate of) the variance of the cross-sectional average, Var.h t / D

Var.h1t C h2t C h3t C h4t /=4.

A cluster method puts restrictions on the covariance terms (of hi t ) that are allowed

to enter the estimate S. In practice, all terms across clusters are left out. This can be

implemented by changing the S matrix. In particular, instead of interacting all i with

each other, we only allow for interaction within each of the G clusters (g D 1; :::; G/

G T

X 1 X g g 0 1

h t h t , where hgt D

X

SD 2

hi t : (10.15)

gD1

T t D1 N i 2 cluster g

(Remark: the cluster sums should be divided by N , not the number of individuals in the

cluster.)

Example 10.7, but assume that individuals 1 and 2 form cluster 1 and that individuals 3

and 4 form cluster 2and disregard correlations across clusters. This means setting the

295

covariances across clusters to zero,

T

1 X 1 2

SD 2

.h1t C h22t C h23t C h24t ;

T t D1 16

2h1t h2t C 2h1t h3t C 2h1t h4t C 2h2t h3t C 2h2t h4t C 2h3t h4t /

0 0 0 0

so we can write

b b

" 4 #

1 X

SD Var.h

c i t / C 2Cov.h1t ; h2t / C 2Cov.h3t ; h4t / :

T 16 i D1

Example 10.9 (Cluster method on N D 4) From (10.15) we have the cluster (group)

averages

1 1

h1t D .h1t C h2t / and h2t D .h3t C h4t / :

4 4

T 0

Assuming only one regressor (to keep it simple), the time averages, T1 hgt hgt , are

P

t D1

then (for cluster 1 and then 2)

T T 2 T

1 X 1 1 0 1X 1 1X 1 2

ht ht D .h1t C h2t / D h1t C h22t C 2h1t h2t , and

T t D1 T t D1 4 T t D1 16

T T

1 X 2 2 0 1X 1 2

ht ht D h3t C h24t C 2h3t h4t :

T t D1 T t D1 16

Finally, summing across these time averages gives the same expression as in Example

10.8. The following 4 4 matrix illustrates which cells that are included (assumption: no

dependence across time)

i 1 2 3 4

1 h21t h1t h2t 0 0

2

2 h1t h2t h2t 0 0

3 0 0 h23t h3t h4t

4 0 0 h3t h4t h24t

In comparison, the iid and Whites cases only sum up the principal diagonal, while the

DK method fills the entire matrix.

Instead, we get Whites covariance matrix by excluding all cross terms. This can be

296

accomplished by defining

T N

1 X 1 X

SD 2 hi t h0i t : (10.16)

T t D1 N 2 i D1

Example 10.10 (Whites method on N D 4) With only one regressor (10.16) gives

T

1 X 1 2

SD 2 h1t C h22t C h23t C h24t

T tD1 16

4

1 X

D Var.h

c it /

T 16 i D1

we get

T N

O 1 XX 2

CovLS ./ D xx s =T N , where s D

1 2 2

" : (10.17)

T N t D1 i D1 i t

It is a special case of Whites approach, but does not allow for x t2 and "2it to interact.

Remark 10.11 (Why the cluster method fails when there is a missing time fixed effect

and one of the regressors indicates the cluster membership) To keep this remark short,

assume yi t D 0qi t C "i t , where qi t indicates the cluster membership of individual i (con-

stant over time). In addition, assume that all individual residuals are entirely due to an

(excluded) time fixed effect, "i t D w t . Let N D 4 where i D .1; 2/ belong to the first

cluster (qi D 1) and i D .3; 4/ belong to the second cluster (qi D 1). (Using the values

qi D 1 gives qi a zero mean, which is convenient.) It is straightforward to demon-

strate that the estimated (OLS) coefficient in any sample must be zero: there is in fact no

uncertainty about it. The individual moments in period t are then hi t D qi t w t

2 3 2 3

h1t wt

6 7 6 7

6 h2t 7 6 w t 7

7D6

6 h 7 6 w 7:

6 7

4 3t 5 4 t 5

h4t wt

297

The matrix in Example 10.9 is then

i 1 2 3 4

1 w t2 w t2 0 0

2 2

2 wt wt 0 0

3 0 0 w t2 w t2

4 0 0 w t2 w t2

P

definition, so its variance should also be zero. In contrast, the DK method adds the off-

diagonal elements which are all equal to w t2 , so summing the whole matrix indeed gives

zero. If we replace the qi t regressor with something else (eg a constant), then we do not

get this result.

To see what happens if the qi variable does not coincide with the definitions of the clus-

ters change the regressor to qi D . 1; 1; 1; 1/ for the four individuals. We then get

.h1t ; h2t ; h3t ; h4t / D . w t ; w t ; w t ; w t /. If the definition of the clusters (for the covari-

ance matrix) are unchanged, then the matrix in Example 10.9 becomes

i 1 2 3 4

1 w t2 w t2 0 0

2 w t2 w t2 0 0

3 0 0 w t2 w t2

4 0 0 w t2 w t2

which sum to zero: the cluster covariance estimator works fine. The DK method also

works since it adds the off-diagonal elements which are

i 1 2 3 4

1 w t2 w t2

2 w t2 w t2

3 w t2 w t2

4 w t2 w t2

which also sum to zero. This suggests that the cluster covariance matrix goes wrong

only when the cluster definition (for the covariance matrix) is strongly related to the qi

regressor.

298

10.4 From CalTime To a Panel Regression

The CalTime estimates can be replicated by using the individual data in the panel. For

instance, with two investor groups we could estimate the following two regressions

yi t D x t0 1 C u.1/

i t for i 2 group 1 (10.18)

yi t D x t0 2 C u.2/

i t for i 2 group 2. (10.19)

More interestingly, these regression equations can be combined into one panel regres-

sion (and still give the same estimates) by the help of dummy variables. Let zj i D 1 if

individual i is a member of group j and zero otherwise. Stacking all the data, we have

(still with two investor groups)

yi t D .z1i x t /0 1 C .z2i x t /0 2 C ui t

" #!0 " #

z1i x t 1

D C ui t

z2i x t 2

# "

z 1i

D .zi x t /0 C ui t , where zi D : (10.20)

z2i

Since the CalTime approach (10.2) and the panel approach (10.20) give the same

coefficients, it is clear that the errors in the former are just group averages of the errors in

the latter

1 P

vjt D u.j / : (10.21)

Nj i 2Group j i t

We know that

1

Var.vjt / D . i i ih / C ih ; (10.22)

Nj

where i i is the average Var.u.j / .j / .j /

i t / and ih is the average Cov.ui t ; uht /. With a large

cross-section, only the covariance matters. A good covariance estimator for the panel

approach will therefore have to handle the covariance with a groupand perhaps also

the covariance across groups. This suggests that the panel regression needs to handle the

cross-correlations (for instance, by using the cluster or DK covariance estimators).

Hoechle, Schmid, and Zimmermann (2009) (HSZ) suggest the following regression on all

299

data (t D 1; : : : ; T and also i D 1; : : : ; N )

yi t D .zi t x t /0 d C vi t (10.23)

D .1; z1i t ; : : : ; zmi t 1; x1t ; : : : ; xk t /0 d C vi t ; (10.24)

i in period t and where xpt is the pth pricing factor. In many cases zj i t is time-invariant

and could even be just a dummy: zj i t D 1 if investor i belongs to investor group j

(for instance being 1830 years old). In other cases, zj i t is still time invariant and con-

tains information about the number of fund switches as well as other possible drivers of

performance like gender. The x t vector contains the pricing factors. In case the charac-

teristics z1i t ; : : : ; zmi t sum to unity (for a given individual i and time t ), the constant in

1; z1i t ; : : : ; zmi t is dropped.

This model is estimated with LS (stacking all N T observations), but the standard

errors are calculated according to Driscoll and Kraay (1998) (DK)which accounts for

cross-sectional correlations, for instance, correlations between the residuals of different

investors (say, v1t and v7t ).

HSZ prove the following two propositions.

exclusive and constant group membership (z1i t D 1 means that investor i belongs to

group 1, so zj i t D 0 for j D 2; :::; m), then the LS estimates and DK standard errors

of (10.23) are the same as LS estimates and Newey-West standard errors of the CalTime

approach (10.2). (See HSZ for a proof.)

This proposition basically says that panel regression is as good as the CT approach.

So why use a panel regression, then? A. Because if allows for (a) many characteristics

(poor, old, men) without having to define a very large set of dummies (poor&old&men,

poor&old&female, poor&young&men,...); (b) a finer (continuous) characteristics grid

(age in years, months, days and...).

switches) The LS estimates and DK standard errors of (10.23) are the same as the LS

estimates of CrossReg approach (10.4), but where the standard errors account for the

cross-sectional correlations, while those in the CrossReg approach do not. (See HSZ for

a proof.)

300

Example 10.14 (One investor characteristic and one pricing factor). In this case (10.23)

is

2 30

1

6 7

6 x1t 7

yi t D 6

6 7 d C vi t ;

4 zi t 5

7

zi t x1t

D d0 C d1 x1t C d2 zi t C d3 zi t x1t C vi t :

In case we are interested in how the investor characteristics (zi t ) affect the alpha (inter-

cept), then d2 is the key coefficient.

This section reports results from a simple Monte Carlo experiment. We use the model

yi t D C f t C gi C "i t ; (10.25)

(demeaned) number of the cluster ( 2; 1; 0; 1; 2) that the individual belongs to. This is

a simplified version of the regressions we run in the paper. In particular, measures how

the performance depends on the number of fund switches.

The experiment uses 3000 artificial samples with t D 1; : : : ; 2000 and i D 1; : : : ; 1665.

Each individual is a member of one of five equally sized groups (333 individuals in each

group). The benchmark return f t is iid normally distributed with a zero mean and a stan-

p

dard deviation equal to 15= 250, while "i t is also normally distributed with a zero mean

and a standard deviation of one (different cross-sectional correlations are shown in the

table). In generating the data, the true values of and are zero, while is oneand

these are also the hypotheses tested below. To keep the simulations easy to interpret, there

is no autocorrelation or heteroskedasticity.

Results for three different GMM-based methods are reported: Driscoll and Kraay

(1998), a cluster method and Whites method. To keep the notation short, let the re-

gression model be yi t D xi0 t b C "i t , where xi t is a K 1 vector of regressors. The (least

301

squares) moment conditions are

1 PT PN

hi t D 0K1 , where hi t D xi t "i t : (10.26)

T N t D1 i D1

Standard GMM results show that the variance-covariance matrix of the coefficients is

O D 1 S 1 , where xx D 1 PT PN

Cov.b/ xi t xi0 t ; (10.27)

xx xx

T N t D1 i D1

and S is covariance matrix of the moment conditions.

The three methods differ with respect to how the S matrix is estimated

1 PT 0

PN

SDK D t D1 h t h t , where h t D i D1 hi t ;

T 2N 2

1 PT PM j j 0 j

X

SC l D t D1 j D1 h t .h t / , where h t D hi t ;

T 2N 2 i 2 cluster j

1 PT PN 0

SW h D t D1 i D1 hi t hi t : (10.28)

T 2N 2

To see the difference, consider a simple example with N D 4 and where i D .1; 2/ belong

to the first cluster and i D .3; 4/ belong to the second cluster. The following matrix shows

the outer product of the moment conditions of all individuals. Whites estimator sums up

the cells on the principal diagonal, the cluster method adds the underlined cells, and the

DK method adds also the remaining cells

2 3

i 1 2 3 4

6 1 h1t h01t h1t h02t h1t h03t h1t h04t 7

6 7

6 7

0 0 0 0 7

6 2 h2t h1t h2t h2t h2t h3t h2t h4t 7 (10.29)

6

6 3 h3t h0 h3t h0 h3t h0 h3t h0 7

6 7

4 1t 2t 3t 4t 5

4 h4t h1t h4t h2t h4t h3t h4t h04t

0 0 0

To generate data with correlated (in the cross-section) residuals, let the residual of indi-

vidual i (belonging to group j ) in period t be

302

where uit N.0; u2 ), vjt N.0; v2 ) and w t N.0; w2 )and the three components

are uncorrelated. This implies that

" #

v2 C w2 if individuals i and k belong to the same group

Cov."i t ; "k t / D

w2 otherwise.

(10.31)

Clearly, when w2 D 0 then the correlation across groups is zero, but there may be corre-

lation within a group. If both v2 D 0 and w2 D 0, then there is no correlation at all across

individuals. For CalTime portfolios (one per activity group), we expect the ui t to average

out, so a group portfolio has the variance v2 C w2 and the covariance of two different

group portfolios is w2 .

The Monte Carlo simulations consider different values of the variancesto illustrate

the effect of the correlation structure.

Table 10.1 reports the fraction of times the absolute value of a t-statistics for a true null

hypothesis is higher than 1.96. The table has three panels for different correlation patterns

the residuals ("i t ): no correlation between individuals, correlations only within the pre-

specified clusters and correlation across all individuals.

In the upper panel, where the residuals are iid, all three methods have rejection rates

around 5% (the nominal size).

In the middle panel, the residuals are correlated within each of the five clusters, but

there is no correlation between individuals that belong to the different clusters. In this

case, but the DK and the cluster method have the right rejection rates, while Whites

method gives much too high rejection rates (around 85%). The reason is that Whites

method disregards correlation between individualsand in this way underestimates the

uncertainty about the point estimates. It is also worth noticing that the good performance

of the cluster method depends on pre-specifying the correct clustering. Further simula-

tions (not tabulated) show that with a completely random cluster specification (unknown

to the econometrician), gives almost the same results as Whites method.

The lower panel has no cluster correlations, but all individuals are now equally cor-

related (similar to a fixed time effect). For the intercept () and the slope coefficient on

the common factor (), the DK method still performs well, while the cluster and Whites

303

Driscoll-

White Cluster Kraay

A. No cross-sectional correlation

0:044 0:045 0:045

0:050 0:051 0:050

B. Within-cluster correlations

0:850 0:047 0:048

0:859 0:049 0:050

0:934 0:364 0:046

0:015 0:000 0:050

Table 10.1: Simulated size of different covariance estimators This table presents the

fraction of rejections of true null hypotheses for three different estimators of the co-

variance matrix: Whites (1980) method, a cluster method, and Driscoll and Kraays

(1998) method. The model of individual i in period t and who belongs to cluster j is

ri t D C f t C
gi C "i t , where f t is a common regressor (iid normally distributed)

and gi is the demeaned number of the cluster that the individual belongs to. The sim-

ulations use 3000 repetitions of samples with t D 1; : : : ; 2000 and i D 1; : : : ; 1665.

Each individual belongs to one of five different clusters. The error term is constructed as

"i t D ui t C vjt C w t , where ui t is an individual (iid) shock, vjt is a shock common to

all individuals who belong to cluster j , and w t is a shock common to all individuals. All

shocks are normally distributed. In Panel A the variances of .ui t ; vjt ; w t / are (1,0,0), so

the shocks are iid; in Panel B the variances are (0.67,0.33,0), so there is a 33% correlation

within a cluster but no correlation between different clusters; in Panel C the variances are

(0.67,0,0.33), so there is no cluster-specific shock and all shocks are equally correlated,

effectively having a 33% correlation within a cluster and between clusters.

methods give too many rejects: the latter two methods underestimate the uncertainty since

some correlations across individuals are disregarded. Things are more complicated for the

slope coefficient of the cluster number (). Once again, DK performs well, but both the

304

cluster and Whites methods lead to too few rejections. The reason is the interaction of

the common component in the residual with the cross-sectional dispersion of the group

number (gi ).

To understand this last result, consider a stylised case where yi t D gi C "i t where

D 0 and "i t D w t so all residuals are due to an (excluded) time fixed effect. In this

case, the matrix above becomes

2 3

i 1 2 3 4

6 1 w t2 w t2 w t2 w t2 7

6 7

6 7

6 2 wt 2

wt 2

wt 2 2 7

wt 7 (10.32)

6

6 7

6 3 2 2 2 2

4 wt wt wt wt 7 5

2 2 2 2

4 wt wt wt wt

(This follows from gi D . 1; 1; 1; 1/ and since hi t D gi w t we get .h1t ; h2t ; h3t ; h4t / D

. w t ; w t ; w t ; w t /.) Both Whites and the cluster method sum up only positive cells, so

S is a strictly positive number. (For this the cluster method, this result relies on the as-

sumption that the clusters used in estimating S correspond to the values of the regressor,

gi .) However, that is wrong since it is straightforward to demonstrate that the estimated

coefficient in any sample must be zero. This is seen by noticing that N i D1 hi t D 0 at

P

a zero slope coefficient holds for all t, so there is in fact no uncertainty about the slope

coefficient. In contrast, the DK method adds the off-diagonal elements which are all equal

to w t2 , giving the correct result S D 0.

See 10.2 for results on a ten-year panel of some 60,000 Swedish pension savers (Dahlquist,

Martinez and Sderlind, 2011).

Bibliography

Driscoll, J., and A. Kraay, 1998, Consistent Covariance Matrix Estimation with Spatially

Dependent Panel Data, Review of Economics and Statistics, 80, 549560.

Hoechle, D., 2007, Robust Standard Errors for Panel Regressions with Cross-Sectional

Dependence, The Stata Journal, 7, 281312.

305

Alpha (% per year) AlphaIndividual 201206 All0 Model5

10

2.2

1

25

2

620

2150

1.8 51

1.6

1.4

1.2

0.8

2002 2004 2006 2008 2010

endar Time Portfolio Approach and the Performance of Private Investors, Working

paper, University of Basel.

306

Table 10.2: Investor activity, performance, and characteristics

I II III IV

(2.841) (3.284) (2.819) (3.253)

Default fund 0.406 0.387 0.230 0.217

(1.347) (1.348) (1.316) (1.320)

1 change 0.117 0.125

(0.463) (0.468)

2 5 changes 0.962 0.965

(0.934) (0.934)

620 changes 2.678 2.665

(1.621) (1.623)

2150 changes 4.265 4.215

(2.074) (2.078)

51 changes 7.114 7.124

(2.529) (2.535)

Number of changes 0.113 0.112

(0.048) (0.048)

Age 0.008 0.008

(0.011) (0.011)

Gender 0.306 0.308

(0.101) (0.101)

Income 0.007 0.009

(0.033) (0.036)

R-squared (in %) 55.0 55.1 55.0 55.1

The table presents the results of pooled regressions of an individuals daily excess return on

return factors, and measures of individuals fund changes and other characteristics. The return

factors are the excess returns of the Swedish stock market, the Swedish bond market, and

the world stock market, and they are allowed to across the individuals characteristics. For

brevity, the coefficients on these return factors are not presented in the table. The measure of

fund changes is either a dummy variable for an activity category or a variable counting the

number of fund changes. Other characteristics are the individuals age in 2000, gender, or

pension rights in 2000, which is a proxy for income. The constant term and coefficients on the

dummy variables are expressed in % per year. The income variable is scaled down by 1,000.

Standard errors, robust to conditional heteroscedasticity and spatial cross-sectional correlations

as in Driscoll and Kraay (1998), are reported in parentheses. The sample consists of 62,640

individuals followed daily over the 2000 to 2010 period.

307

11 Expectations Hypothesis of Interest Rates

Remark 11.1 (Prices and yields on zero-coupon bonds ) Consider an m-period zero

coupon bond. Its price B.m/ and continuously compounded interest rate y.m/ are re-

lated according to

B .m/ D exp my .m/ :

Similarly, the continuously compounded forward rate for an m-period investment that

starts in k periods ahead is

f .k; k C m/ D ln D :

m B.k C m/ m

Term risk premia can be defined in several ways. All these premia are zero (or at least

constant) under the expectations hypothesis (EH).

The (realized) yield term premium is defined as the difference between a long (n-

period) interest rate and the average future short (m-period) rates over the same period

and EH says that is should be constant

1 Xk 1

' tyCn D ynt ym;tCsm , with k D n=m, and (11.1)

k sD0

Example 11.2 (Yield term premium, rolling over 3-month rates for a year). Let y1y;t be

the current 1-year rate and y3M;t CsM the 3-month rate s months ahead

1

' tyC1y D y1y;t E t .y3M;t C y3M;t C3M C y3M;t C6M C y3M;t C9M / :

4

Let f t .k; kCm/ be the forward rate that applies for the future period t Ck to t CkCm.

The (realized) forward term premium is the difference between a forward rate for an m-

period investment (starting in k periods ahead) and the short interest rate for the same

308

hold m bond new m bond new m bond new m bond

now m 2m 3m 4m

hold n D 4m bond

period

E t ' tfCkCm D under EH. (11.4)

Example 11.3 (Forward term premium, 1-month investment starting 2 months from now)

Let f t .2M; 3M / be the 1-month forward rate starting 2 months ahead and let y1M;t C2M

be the one-month interest rate over the same period. Then,

hold m bond

now m 2m 3m 4m

period bond between t and t Cm (buy it in t for Pnt and sell it in t Cm for Pn m;t Cm )in

excess of holding an m-period bond over the same period

1

' thCm D ln.Pn m;t Cm =Pnt / ymt , and (11.5)

m

E t ' thCm D under EH. (11.6)

309

his version is perhaps the most similar to the definition of risk premia of other assets (for

instance, equity). Figure 11.3 illustrates the timing.

Example 11.4 (Holding-period premium, holding a 10-year bond for one year).

D 10y10y;t 9 E t y9y;tC1

y1y;t :

The second line just rewrites the bond prices in terms of the interest rates.

hold m bond

now m 2m 3m

Notice that these risk premia are all expressed relative to a short(er) ratethey are

term premia. Nothing rules out the possibility that the short rate(-er) also includes risk

premia. For instance, a short nominal interest rate is likely to include an inflation risk

premium since inflation over the next period is risky. However, this is not the focus here.

The expectations hypothesis (see (11.2), (11.4) and (11.6)) says that the exp