Vous êtes sur la page 1sur 20

U NIVERSITY OF G RONINGEN

Summary SC Microeconometrics

Tomas Geurts
April 8, 2015

1
1.1

W EEK 6

M AXIMUM LIKELIHOOD AND GMM 03/02/2015


M AXIMUM LIKELIHOOD

The likelihood principle is to choose as estimator of the parameter vector that value of that maximizes the
likelihood of observing the actual sample. The likelihood function, L N (|y, X ), is given by the conditional density,
f (y|X , ), viewed as a function of given the data (x, Y ). For cross-section data, the observations (y i , x i ) are
independent over i and then,
N
Y
L N (|y, X ) =
f (y i |x i , ).
(1.1)
i =1

by maximizing the likelihood function, or more frequently, the log-likelihood function,


We find the MLE ()
L N () =

N
X

ln f (y i |x i , ).

(1.2)

i =1

We equate the derivative of the log-likelihood function over the parameter vector
the score vector s(), and find the MLE, ML .

L N ()

to zero, which is called

Under some general conditions, the MLE is consistent for 0 and,


p

N (M L 0 ) N 0, A 01

whereA 0 = E

2 L N ()
0

(1.3)

The resulting asymptotic distribution of the MLE is often expressed as,


"
2
1 #
L N ()

ML E , E
.
0

(1.4)

Under regularity, the maximum likelihood estimator has the following asymptotic properties,
Consistency, see equation (1.3)
Asymptotic normality, see equation (1.4)
Asymptotic efficiency (MLE achieves the Cramer-Rao lower bound for consistent estimators), see equation (1.3)

GMM
The idea of Method of Moments is to replace expectations by their sample counterparts and to define estimators
by solving the resulting system in terms of parameters. For example, if we want to estimate the parameters of a
lognormal distribution by means of MM, we have given,

(lnx)2
Probability density function of X f (x) = p1 exp 22
x 2
Raw moment equation of X E(X r ) = r + 12 r 2 2
The idea of MM is to replace expectations by their sample counterparts. Hence the MM estimator solves (and
consequently we present the solution),

N
1 X
1 2


m1 =
x i = exp +

N i 1
2
m2 =

1 X
+ 2

2
x i2 = exp 2

N i 1

1
= 2lnm 1 lnm 2
2
2 = 2lnm 1 + lnm 2

(1.5)

There are a few notes concerning the MM estimator,


The moments are consistent by the LLN
The estimated parameters are consistent by Slutskys Theorem
The estimated parameters are not unique
To overcome the last problem, we introduce GMM. First we compare ML estimation to GMM,
In order to perform ML estimation, a full parametric specification of the model, including distributional
assumptions on all random elements, is required
GMM estimation can be applied as soon as a sufficient number of estimating equations is available
GMM less efficient but more robust than ML
For GMM we assume that there are r moment conditions for q parameters and the following should hold,
E [h(w i , 0 )] = 0.

(1.6)

If r = q the model is said to be just identified and the MM estimator M M is the solution to,
N
1 X
= 0.
h(w i , )
N i =1

(1.7)

If r > q the model is said to be overidentified and there is no exact solution. The generalized method of moments
estimator G M M minimizes,
"
#0
"
#
N
N
1 X
1 X
Q N () =
h(w i , ) WN
h(w i , )
(1.8)
N i =1
N i =1
where the r r weighting matrix WN is symmetric positive definite and with finite probability limit W0 . Different
choices of the weighting matrix WN lead to different estimators that, although consistent, have different variances.
Application of GMM requires specification of moment function h() and weighting matrix WN . The asymptotic
distribution is,
p
N (G M M 0 ) N (0,V )
(1.9)
where, if observations are i.i.d.,
V = (G 0h0 W0G 0i)1 (G 00 W0 S 0 W0G 0 )(G 00 W0G 0 )1
h
G 0 = E
0 |
0 0
S 0 = E hh |0

Optimal GMM estimator usually has the smallest asymptotic variance given a specified function h(). For overidentified models and W0 known, the most efficient GMM estimator is obtained by choosing WN = S 01 . Intuitively, the
best moment conditions have large G 0 (the moment condition is much violated if 6= 0 and the moment is very
informative on the true values of 0 ) and small S 0 (sample variation of the moment (noise) is small).
2

1.2

N UMERICAL OPTIMIZATION 04/02/2015


G RID SEARCH METHODS

In grid search method, the procedure is as follows,


(1) Select many values of along a grid
(2) Compute Q N () for each of these values
(3) Choose as estimator the value that provides the largest value of Q N ()
An advantage of grid search methods is that if a fine enough grid is chosen, the method will always work. However a
disadvantage is that it is generally impractical to choose a fine enough grid without further restrictions. If I need to
estimate 10 parameters and the grid evaluates each parameter at just 10 points (a very sparse grid), there are 101 0
evaluations.
I TERATIVE METHODS
Iterative methods are the most used algorithms for nonlinear optimization. These algorithms update the current
estimate of using a particular (updating) rule. Ideally, the new estimate is a move toward the maximum, so that,
Q N (s+1 ) > Q N (s )

(1.10)

but in general this cannot be guaranteed, since this method might also end up at a local maximum instead of the
global maximum.
Most iterative methods are gradient methods that change s in a direction determined by the gradient. The
update formula is given by,
s+1 = s + A s g s , s = 1, ..., S
(1.11)

Q N ()
where A s is a q q matrix that depends on s and g s =
, which is the q 1 gradient vector evaluated at s .
s

The Newton-Raphson (NR) method is a popular gradient method, especially effective if the objective function
is globally concave in . The updating rule is given by,
s+1 = s H s1 g s ,

s = 1, ..., S

(1.12)

where,
gs =
Hs =

Q N ()

2 Q N ()
0

The basis for the NR method is a linear Taylor series approximation. We utilize a second-order Taylor series expansion
around s , take its first derivative and equate this to zero. Solving for gives us the updating rule of the NR method.
We find that Q N () always increases if H s is negative definite. The NR method works best for maximization problems
where the objective function is globally concave (if the objective function is not globally concave, always try a variety
of starting values), so that H s is always negative definite.
Ideally, the iterative procedure should terminate when the gradient is zero. In practice, this will not be possible, primarily because of accumulated rounding error in the computation of the function and its derivatives. Hence
there are a few alternative criteria when to stop,
(1) Small relative change in the objective function Q N (s )
(2) Small change in the gradient vector g s relative to the Hessian
(3) Small relative change in the parameter estimates s
We can compute use either analytical (pro: more precise and quicker to compute, con: more coding time) or
numerical (pro: no coding beyond providing the objective function, con: the derivatives have to be computed many
times) derivatives. The latter are computed using,

Q N (s ) Q N (s + he j ) Q N (s he j )
=
j
2h

, j = 1, ..., q

(1.13)

where h is small and E j = (0, ..., 0, 1, 0..., 0)0 is a vector with unity in the j-th row and zeros elsewhere.
A common modification of the NR method is the method of scoring (MS), which substitutes H s with E [H s ] = I ()
(information matrix).
There might be situations where it is not possible to obtain an estimate of the parameters.
Always check descriptive statistics and look for anomalies
Rescale the data so that the regressors have similar means and variances (e.g. use income in thousands of
euros rather than in euros)
Check for multicollinearity
Try different starting values
For dummy variables, check whether there is enough variability in the data

2
2.1

W EEK 7

B INARY CHOICE MODELS 10/02/2015

Binary choice models exist to explain a choice by a certain binary outcome (y = 1 or y = 0). Models boil down to
specifying the probability that y = 1 as a function of x.
We can use a linear probability model (y i = x i0 i + i ) to model choices (E(y i |x i ) = P (y i = 1|x i ) = x i0 i ), but the
probabilities could fall outside [0, 1] and the error term suffers from heteroskedasticity.
A binary choice model describes the probability of a certain choice as follows,
P (y i = 1|x i ) = F (x i0 i )

(2.1)

where F () : 0 F () 1. Hence, we choose a c.d.f. for F . Different choices for F are,


Probit model
F (x i0 i ) = (x i0 i ) =

x i0 i

1
1
p exp t 2 d t
2
2

(2.2)

Logit model
0

F (x i0 i ) = (x i0 i ) =

e x i i
1 + e x i i
0

(2.3)

Complementary log-log model


F (x i0 i ) = C (x i0 i ) = 1 exp ( exp (x i0 i ))

(2.4)

The standard normal and standard logistic distribution do not differ much, they have the same mean (E(X ) = 0) but
2
their variances differ (2normal = 1 vs. 2logistic = 3 ). Hence the estimates from a logit model are roughly a factor 1.8
larger.
Frequently the marginal effects are utilized to interpret the coefficients. For each model we find the following
marginal effect for x i k ,

Linear probability model: k


Probit model: (x i0 i )k
Logit model: (x i0 i )[1 (x i0 i )]k
Complementary log-log model: exp ( exp (x i0 i )) exp (x i0 i )k .

There are three possibilities to compute the marginal effect, the marginal effect at a representative value (MER), the
marginal effect at the mean (MEM) or average marginal effect (AME).

The odds ratio or relative risk can be computed by the ratio


distribution, the odds ratio is given by

p
1p

p
1p .

If the probabilities come from a standard logistic

= exp (x i0 i ).

These models can also be derived from the underlying latent model, where the underlying latent variable equation
is given by,
y i = x i0 i + i
(2.5)
where y i is unobserved and,

(
yi =

if

y i > 0

if

y i < 0.

(2.6)

This leads us to,


P [y i = 1|x i ] = P [y i > 0|x i ] = P [i < x i0 i |x i ] = F (x i0 i )

(2.7)

where F is the distribution function of i or, in the common case of a symmetric distribution, the distribution
function of i .
We estimate binary choice models by maximum likelihood. The following identities facilitate this estimation,
Likelihood function
L N () =

N
Y

P [y i = 1|x i , i ] y i P [y i = 0|x i , i ]1y i

(2.8)

i =1

Log-likelihood function
L () =

N
X
i =1

y i lnF (x i0 i ) +

First order conditions MLE M L

N
X
i =1

(1 y i )ln(1 F (x i0 i ))

N
X

y i F (x i0 i )

i =1

F (x i0 i )(1 F (x i0 i ))

F (x i0 i )x i .

(2.9)

(2.10)

There is no analytical solution but maxima are easy to find using NR as in the case of logit and probit the loglikelihood
is globally concave. Furthermore the MLE is consistent if the probability of success is correctly specified (p i
F (x i0 i )). If the the conditional density of y i given x i is correctly specified, then the MLE is asymptotically efficient,
"
2
1 #
LN
M L N , E
.
(2.11)
0
The goodness of fit of binary choice models can be measured by the pseudo-R 2 (which measures the relative gain),
2
R RG
= 1

L (M L )
.
L0

(2.12)

An alternative way to measure the goodness o fit is to find the proportion of correctly specified predictions. In case
of a logit or probit model, this would be,
n 00 + n 11
wr 1 =
.
(2.13)
N

2.2

O RDERED AND MULTINOMIAL RESPONSE MODELS 11/02/2015

For ordered and multinomial response models it holds that here are m alternatives and the dependent variable y
is defined to take value j if the j -th alternative is taken, j = 1, ..., m. Let us introduce m binary variables for each
observation y,
(
1
if y = j
yj =
(2.14)
0
if y 6= j.
The multinomial density for one observation can conveniently be written as,
y

f (y) = p 1 1 p 2 2 ... p mm =

m
Y
j =1

yj

pj .

(2.15)

Next we need to specify a model for the probability that individual i chooses the j -th alternative,
p i j = P (y i = j |x i ) = F j (x i , i ),

i = 1, ..., N and j = 1, ..., m.

(2.16)

We estimate multinomial models by maximum likelihood. The following identities facilitate this estimation,
5

Log-likelihood function
lnL N =

N X
m
X

y i j lnp i j

(2.17)

i =1 j =1

First order conditions MLE M L

N X
m y
lnL N X
i j p i j
=
= 0.

i =1 j =1 p i j

(2.18)

O RDERED RESPONSE MODELS


One has the choice between m alternatives numbered from 1 to m. Ordered models are also based upon the
underlying latent variable model but with a different match from the latent variable y i to the observed one y i =
1, 2, ..., m. Hence,
y i = 0 x i0 i + i .
(2.19)
The observation rule is,
if j 1 < y i j , j = 1, ..., m.

yi = j

(2.20)

The corresponding probability is given by,


P [y i = j |x i , , ] = P [ j 1 < y i j |x i , , ],

j = 1, ..., m.

(2.21)

The type of model (ordered logit/ordered probit) depends on the distribution of the error term. In the ordered probit
model, setting 1 = 0 and the error variance to 1, are two normalization constraints. These are required to identify
the remaining parameters.
We consider the probabilities and marginal effects of the ordered probit model for j = 1, ..., m.
j =1
P [y i = 1|x i ] = (0 x i0 )
p i 1
= ( j 0 x i0 )k
x i k

(2.22)
(2.23)

j = 2, ..., m 1
P [y i = j |x i ] = ( j 0 x i0 ) ( j 1 0 x i0 )
p i j
x i k

= [( j 0 x i0 ) ( j 1 0 x i0 )]k

(2.24)
(2.25)

j =m
P [y i = m|x i ] = 1 (m1 0 x i0 )
p i m
= ( j 0 x i0 )k .
x i k

(2.26)
(2.27)
(2.28)

Interval regression is an application of ordered response models. The truncation points () are in this case known.
This does not require the normalization of .
M ULTINOMIAL MODELS
When we consider the case where there is no natural ordering we call this multinomial models. There are three
types of multinomial models, however this depends on the type of regressors we have,
Alternative-varying regressors take different values for different alternatives. E.g. travelling time and costs in
a model of transportation choice.
Alternative-invariant regressors do not vary across alternatives. E.g. socioeconomic status in a model of
transportation choice.
The three multinomial methods (that are treated in this course) are given by,
6

(1) Multinomial logit: all regressors not alternative specific (e.g. income)
(2) Conditional logit: all regressors (e.g. price) alternative specific
(3) Mixed logit: both types of variables included
When regressors do not vary over alternatives, the multinomial logit model (MNL) is used. We need to restrict
1 = 0 to ensure model identification and accordingly find,
pi 1 =
pi j =

1+

1
,
exp
(x i0 l )
l =2

Pm

exp (x i0 j )
Pm
1 + l =2 exp (x i0 l )

j = 2, ..., m.

(2.29)
(2.30)

The marginal effect of a change in x i on p i j is given by,


p i j
x i

= p i j ( j i )

where i =

m
X

p i l l .

(2.31)

l =1

When regressors do vary over alternatives, the conditional logit model (CL) is used. The model is specified by,
exp (x i0 j )
p i j = Pm
,
exp (x i0 l )
l =2

j , l = 1, ..., m and i = 1, ..., N .

(2.32)

The marginal effect of a change in x i k on p i j is given by,


p i j
x i k

= p i j (i j k p i k ),

(2.33)

where i j k is an indicator variable equal to 1 if j = k and equal to 0 if j 6= k. This effect is positive when we measure
the marginal effect within one group and > 0. Since we find p i j (1 p i j ) > 0. This is called the own effect.
We can generalize both models to the mixed logit model which is specified by,
exp (x i0 j + w i0 j )
p i j = Pm
,
exp (x i0 l + w i0 l )
l =2

j , l = 1, ..., m and i = 1, ..., N .

(2.34)

where the x i j variables vary over alternatives and the w i variables do not vary over alternatives.
The coefficients in the CL and MNL models can also be given a more direct logit-like interpretation in terms
of relative risk. In the MNL model, the conditional probability of observing alternative j given that either alternative
j or alternative k is observed is,
P [y i = j |y = j ork] =
=
=

pi j
pi j + pi k
exp (x i0 j )
exp (x i0 j ) + exp (x i0 k )
exp (x i0 ( j k ))
exp (x i0 ( j k )) + 1

(2.35)
(2.36)
(2.37)

which is a logit model with coefficient ( j k ). A similar approach can also be applied to the CL model.
A drawback of MNL and CL models is that the probability ratio between two alternatives j and k does not depend on the other alternatives, which is an undesirable property of the multinomial (conditional) logit model. This
is assumption is named Independence of Irrelevant Alternatives (IIA). The IIA property can be checked by means
of a Hausman test, which compares estimates of the red bus option in a three choice model (car, red bus and blue
bus) with estimates of the red bus option in a two choice model (car, red bus). If IIA holds then,
Both models (three choice and two choice model) will yield consistent estimates
The estimates from the three choice model are more efficient than those of the two choice model

3
3.1

W EEK 8

M ULTINOMIAL MODELS 17/02/2015


A DDITIVE RANDOM UTILITY MODELS

Unordered multinomial models more general than multinomial and conditional logit can be obtained using the
general framework of additive random utility models (ARUM). In a general m-choice multinomial model, the utility
of choice j is,
Uj = Vj + j ,
j = 1, ..., m
(3.1)
where V j is the deterministic component of utility and j is the random component of utility. Individual i will choose
the alternative that provides the highest utility, so that,
p i j = P [y i = j ] = P [Ui j Ui k ,

j 6= k]

= P [i j k Vi j Vi k ,

(3.2)

j 6= k]

(3.3)

where i j k = i k i j is the difference with respect to the reference alternative k. The log-likelihood is then,
lnL N =

N X
m
X

y i j lnp i j .

(3.4)

i =1 j =1

Different multinomial models arise from different assumptions on the joint distribution of 1 , ..., m . When we
assume that the errors j are i.i.d. type I extreme value,
f ( j ) = e j e exp ( j ) ,

j = 1, ..., m.

(3.5)

Consequently for the i -th individual in multinomial outcomes modelled using ARUM, it can be shown that,
P [y i = j ] =

e Vi j
e Vi 1

+ e Vi 2

(3.6)

+ ... + e Vi m

We obtain the CL model when Vi j = x i0 j and the MNL model when Vi j = x i0 j . The assumption that the errors j are
independent across alternatives j is likely to be violated if two alternatives are similar (failure of the IIA assumption).
When we assume that the errors j are jointly distributed and a GEV (generalized extreme value) type, where
the c.d.f. is given by,

F (1 , .., m ) = exp G e 1 , e 2 , ..., e m .


(3.7)
If the errors are GEV distributed then,
P [y = j ] = e V j

G j (e 1 , e 2 , ..., e m )

where G j (Y1 , ..., Ym ) =

G (e 1 , e 2 , ..., e m )

The multinomial (conditional, mixed) logit model is obtained if G j (Y1 , ..., Ym ) =


model is the nested logit model.

G(Y1 , ..., Ym )
.
Y j

Pm

k=1

(3.8)

Yk . The other widely used GEV

N ESTED LOGIT MODEL


The nested logit model breaks decision making into groups. We assume that at the top level there are J limbs to
choose from and the j -th limb has K j branches. The utility of being on limb j and branch k is presented in the same
fashion as equation (3.1). The probability of being on limb j and branch k, p j k can be factored as,
p j k = p j p k| j .

(3.9)

The nested logit model arises when the error terms j k from the ARUM model have a GEV distribution function, see
equation (3.7). Furtheremore we have thath G() has the following form,
G(Y ) = G(Y11 , ..., Y1K 1 , ..., Y J 1 , ..., Y J K J ) =

K
J
X
Xj
j =1 k=1

1
j

Yjk

! j
,

(3.10)

p
where the dissimilarity parameter (or scale parameter) j = 1 cor ( j k , j l ), with 0 1. The error terms
appearing in the nested logit ARUM model are independent for two alternatives which are in different limbs and
correlated for alternatives within a limb. If j = 1, j the nested logit model is equivalent to the multinomial logit
model.
A typical specification for V j k is,
V j k = z 0j + x 0j k

j = 1, ..., J and k = 1, ..., K j ,

(3.11)

where z j varies over limbs only and x j k varies over both limbs and branches. The GEV model yields the nested logit
model,

p j k = p j p k| j = P J

exp

exp (z 0j + j I j )

0
m=1 exp (z m + m I m )

where,
l j = ln

"K
Xj

exp

x 0j l j

x 0j k j
j

PK j

exp
l =1

x 0j l j

(3.12)

!#
(3.13)

l =1

is called the inclusive-sum or log-sum. We have that the IIA property holds for alternatives within a limb but not for
alternatives in different limbs. At last, some drawbacks of the nested logit model are that the tree structure has to be
imposed a priori and that 0 1 (which is not always the case).
C OUNT MODELS
In certain applications we would like to explain the number of times a given event occurs. While the outcomes are
discrete and ordered, there are two important differences with ordered response outcomes,
(1) They are of cardinal rather than ordinal nature
(2) There is often no natural upper bound to the outcomes
The Poisson regression model specifies that each y i is drawn from a Poisson population with parameter i , which
is related to the regressors x i . The primary equation of the model is,
y

P [Y = y i |x i ] =

exp (i )i i
yi !

y i = 0, 1, 2, ...

(3.14)

where i = exp (x i0 ). We can easily show that,


E[y i |x i ] = exp (x i0 )
Var[y i |x i ] = exp (x i0 ), which makes the model intrinsically heteroskedastic.
Assuming that the observations on different individuals are independent, we can estimate by maximum likelihood,
Log-likelihood function
lnL() =

N
X
i =1

y i x i0 exp (x i0 ) ln(y i !)

(3.15)

First order conditions MLE M L


N

lnL X
=
y i exp (x i0 ) x i = 0

i =1

(3.16)

The Poisson MLE is consistent under the assumption that the conditional mean is correctly specified. The loglikelihood function is globally concave (because the Hessian is negative) and the NR algorithm yields unique
parameter estimates. Note that the impact of a marginal change in x i k on the expected value of y i is given by,
E[y i |x i ]
= exp (x i0 )k
x i k

(3.17)

The Poisson model is criticized frequently because of a phenomenon named overdispersion, since for count data
the variance usually exceeds the mean (which qualitatively has the same consequences as homoskedasticity in
9

the linear regression model) and the Poisson model does not incorporate this. Overdispersion leads to grossly
deflated standard errors and grossly inflated t-statistics. A simple overdispersion test statistic for H0 : = 0 (where
Var[y i |x i ] = i + g [i ]) can be computed by,
(1) Estimating the Poisson model

(2) Constructing fitted values i = exp (x i0 )


(3) Running the auxiliary OLS regression
g ( i )
(y i i )2 y i
=
+ ui .
i
i

(3.18)

We can generalize the Poisson model by introducing an individual unobserved effect into the conditional mean,
i = i u i

(3.19)

Then the distribution of y i conditioned on x i and u i remains Poisson with conditional mean and variance i . The
unconditional distribution f (y i |i ) is the expected value (over u i ) of f (y i |x i , u i ),
Z
exp (i u i )(i u i ) y i
f (y i |x i ) =
g (u i )d u i
(3.20)
yi !
0

3.2

T OBIT AND SELECTION MODELS 18/02/2015


C ENSORED AND TRUNCATED MODELS

Censoring arises when information on the dependent variable is lost, but not on the regressors.
Example: studies of income based on incomes above some poverty line may be of limited usefulness for
inference about the whole population.
Censoring from below is given by,
y

if

y > L

if

y L.

if

y U

if

y < U.

(
y=

(3.21)

Censoring from above is given by,


(
y=

(3.22)

Truncation arises when one attempts to make inferences about a larger population from a sample that is drawn
from a distinct subpopulation. Some observations on both the dependent variable and the regressors are lost.
Example: individuals of all income levels are included in the sample but incomes below the poverty line are
reported at the poverty line.
Truncation from below is given by,
y = y

if

y > L

(3.23)

y = y

if

y < U

(3.24)

Truncation from above is given by,


OLS estimation using truncated or censored data will lead to inconsistent estimation of the slope parameter.
However it can be easily dealt with if the researcher applies a fully parametric approach (this approach relies on
strong distributional assumptions). If the conditional density function of y given regressors x, f (y |x), is specified,
then the parameters can be consistently and efficiently estimated. Consider ML estimation given censoring from
below, then the conditional density of y is given by,
(
f (y |x)
if y > L
f (y|x) =
(3.25)
F (L|x)
if y L.
Accordingly, let d be equal to 1 if y > L and 0 if y = L. Then,
f (y|x) = f (y |x)d F (L|x)1d .

(3.26)

And for a sample of N independent observations, the censored MLE maximizes,


lnL N () =

N
X

d i ln f (y |x i , ) + (1 d i )lnF (L|x i , ) .
i =1

10

(3.27)

For truncation from below at L, and suppressing dependence on x, the conditional density of the observed y is,
f (y) = f (y|y > L) =

f (y)
f (y)
=
;
P (y|y > L) 1 F (L)

(3.28)

The truncated MLE therefore maximizes,


lnL N () =

N
X

ln f (y i |x i , ) ln 1 F (L i |x i , )

(3.29)

i =1

T OBIT MODEL
The standard tobit model is also called a censored normal regression model and has censoring from below at zero.
The latent model is given by y = x 0 + (where N(0, 2 )) and hence, y N(x 0 , 2 ). The observation rule is
identical to censoring from below, given by equation (3.21) where L = 0. The censored density is in the same fashion
as equation (3.26), where F (0) = P [y 0] = P [x 0 + 0] = 1 (
given by,

x0
).

The density and log-likehood function are

Censored density

d
(1d )
1
1
x 0
f (y) = p
exp 2 (y x 0 )2
)
1 (
2

22

(3.30)

Censored MLE
2

lnL N (, ) =

N
X
i =1

0 !!)

x
1
1
1
2
0
2
d i ln2 ln 2 (y x i ) + (1 d i )ln 1 i
2
2
2

(3.31)

!)
x i0
1
1
1
2
0
2
ln2 ln 2 (y x i ) ln
.
2
2
2

(3.32)

Truncated MLE
2

lnL N (, ) =

N
X

i =1

The latent-variable-, left-truncated- and left-censored- means are given by,


Latent variable mean
E[y |x] = x 0

(3.33)

E[y|x] = P [y = 0|x] E[y|x, y = 0] + P [y > 0|x] E[y|x, y > 0]


0
0
x
x 0
x +
= 0+

(3.34)

Left censored mean (censoring at zero)

(3.35)

Left truncated mean (truncating at zero)


E[y|x, y > 0] = E[y |x, y > 0]
0
x
0
= x +
.

(3.36)
(3.37)
(3.38)

The standard tobit model imposes a restrictive structure. A variable that increases the probability of having a positive
outcome also increases the expected value of that outcome, given that it is positive (for example, age on wage). There
are two solutions to this problem,
Two-part model: a model for the censoring mechanism and a model for the outcome, conditional on the
outcome being observed
Sample selection model: joint distribution for the censoring mechanism and outcome
First we investigate the two-part model. Define a binary indicator variable d = 1 for participants (y > 0) and d = 0 for
Sparticipants (y = 0). Then the two part-model is,
nonA
(
P [d = 0|x]
if y = 0
f (y|x) =
(3.39)
P [d = 1|x] f (y|d = 1, x)
if y > 0.
Estimation separates into estimation of,
11

(1) Discrete choice model (e.g. probit or logit) for the participation decision using all observations
(2) Parameters of the density f (y|d = 1, x) using only observations with y > 0.
The sample we observe is not a random sample from a larger population. People with very low potential wages may
decide to not work. Thus, the probability of observing a wage depends upon the potential wage (endogenous sample
selection). The tobit II model is often used for this problem. It is also known as the sample selection model. The
model consists of two equations: a participation equation and an outcome equation. We have,
Participation equation
(
y 1i =

if

y 1i
>0

if

y 1i
0.

(3.40)

Outcome equation
(
y 2i =

y 2i

if

y 1i
>0

if

y 1i
0.

(3.41)

The standard model is specified by,

where,
"

1i
2i

0
y 1i
= x 1i
1 + 1i

(3.42)

y 2i

(3.43)

0
= x 2i
2 + 2i

#!
" # "
1 12
0
,
.
N
0
12 22

(3.44)

It can be shown that the conditional expectation of y 2i given that y 1i = 1 is,


0
0
E[y 2i |x, y 1i = 1] = x 2i
2 + 12 (x 1i
1 ).

(3.45)

The model can be estimated by maximum likelihood or through the two-step Heckman procedure. The two step
procedure is based on the following regression,
0
0
y 2i = x 2i
2 + 12 (x 1i
1 ) + i

0
where 12 (x 1i
1 ) =

0
(x 1i
1 )
0
(x 1i
1 )

(3.46)

The two steps are,


0
(1) Estimate the selection equation by standard probit and compute (x 1i
1 )
0
(2) Regress y 2i on x 2i and (x 1 )
1i

This is consistent but not efficient. Standard errors should be adjusted. We prefer variables in x 1i that are not in
x 2i (exclusion restrictions). Thus, variables affecting selection but not the outcomes y 2 i directly. Such variables are
hard to find. A test for sample selection bias is a test for 12 = 0. This can be performed by a standard t -test on the
coefficient for Heckmans lambda.

4
4.1

W EEK 9

S URIVAL ANALYSIS 24/02/2015 & 25/02/2015

We want to model the length of time spent in a given state (classification of individual at given point in time) before
transitioning (movement from one state to another) to another state. Time spent state is named the duration or spell
length. Many possible sampling schemes; inference depends both on duration model and on sampling scheme,
Flow sampling: sample those entering state at given point in time
Stock sampling: sample those in state at given point in time
Population sampling: sample population at given point in time regardless of state.
Two estimation methods that do not work (and the accompanying reasons),
OLS equation on duration t i (log(t i ) = x i0 + i )
(1) Censoring and truncation (OLS gives inconsistent estimates for these problems)

12

(2) Time-varying covariates: multiple x i t for each duration y i


Binary choice model for indicator equal to 1 if transition occurred (y i = x i0 + i where N(0, 1))
(1) Doesnt take into account time at risk
(2) One throws away a lot of information (point in time that an individual left the state)
Duration in a state is a nonnegative, continuous random variable T . Probability that duration is less than t ,
t

Z
F (t ) = P [T t ] =

f (s)d s.

(4.1)

The opposite probability, also named the survivor function (S(t )) is given by,
S(t ) = P [T > t ] = 1 F (t ).
The mean completed spell length (duration) is given by,
Z
E[T ] =

(4.2)

S(u)d u.

(4.3)

The hazard function gives the instantaneous probability of leaving a state conditional on survival to time t ,
(t ) = lim

t 0

P [t T < t + t |T t ]
t

f (t )
S(t )
lnS(t )
.
=
t

(4.4)
(4.5)

(4.6)

The hazard function specifies the distribution of T (like the c.d.f.),


Z t

S(t ) =
(u)d u .

(4.7)

The conditional hazard rate (t |x) is our object of interest. We can also compute the cumulative hazard function,
(t ) =

Z
0

(s)d s

(4.8)

= lnS(t )

(4.9)

Duration often measured as interval. The transition process may be discrete or continuous but the data are always
observed discretely (grouped data). If the hazard within the interval is assumed constant we have a discrete time
hazard model,
Discrete-time hazard function: probability of transition at discrete time t j , j = 1, 2, ..., given survival to time
tj
f d (t j )
(4.10)
j = P [T = t j |T t j ] = d
S (t j 1 )
Discrete-time survivor function: is a decreasing step function with steps at t j , it can be obtained recursively
from the hazard function
Y
S d (t ) = P [T > t ] =
(1 j )
(4.11)
j |t j t

Discrete-time cumulative hazard function


d (t ) =

(4.12)

j |t j t

Choice depends on length of interval of data grouping relative to typical spell-length,


If we measure survival time in days and typical spell takes months: use continuous time
If we measure survival time in days and typical spell takes less than a week: discrete time
Random censoring: completed duration Ti and censoring ijtime C i are independent for all individuals in the
sample,
We observe Ti if spell ends before censoring time or C i if it does not
13

We observe whether or not spell was censored


Possible reason: termination of study
Data given by:
Ti = min[Ti ,C i ]

(4.13)

i = 1[Ti < C i ]

(4.14)

Type I censoring: durations censored above fixed, known censoring time t ci ,


Special case of random censoring with C i = t ci
Induces right-censoring
Type II censoring: observation of N subjects stops after p-th failure,
Only the durations of p shortest spells are completely observed
Remaining N p spells are censored at C i = t (p)
All standard duration models assume independent (non-informative) censoring,
Distribution of C not informative about the distribution of T
Censoring indicator is exogenous
Uncensored observations are observed with probability
P [T = t , = 1] = P [T = t ] P [ = 1]

(4.15)

Censored observations are observed with probability


P [T = t , = 0] = P [T = t ] P [ = 0]

(4.16)

Types of censoring (spell length not known exactly),


Right-censoring: observe spells from time 0 up to censoring time c
Left-censoring: start date of spell not observed
Interval-censoring: completed spell length observed, but only up to interval
Types of truncation (systematic exclusion of survival times from sample)
Left truncation: only cases that survive beyond minimum duration are included in sample
Right truncation: only cases that have left state by particular date included in sample
Empirical survival functions are useful for descriptive purposes, especially if interest is in the hazard at a few key
values of regressors. First we analyze the empirical survival function in the case of no censoring,
)=
S(t

rt
N

(4.17)

where r t : number of spells of duration greater than t and N : sample size. We present the following terminology,

d j = number of spells ending at time t j


m j = number of spells censored in [t j , t j +1 ]
r j = number of spells at risk at time t j
P
r j = l |l j (d l + m l )
(sum of all spell with duration greater than t )

We find the following formulas for the empirical survival function in the case of independent censoring,
Sample counterpart of hazard function,
j =

dj

(4.18)

rj

Kaplan-Meier estimator of the survival function,


)=
S(t

(1 j ) =

j |t j t

Y rj dj
rj

j |t j t

(4.19)

Nelson-Aalen estimator of the cumulative hazard,


)=
(t

X
j |t j t

14

j =

X dj
j |t j t

rj

(4.20)

We now present a number of continuous time hazard models, see the table below.

Interpretation
Proportional hazard

Accelerated failure time

Specification of the hazard


Parametric
Semi-parametric
Exponential
Cox model
Weibull
Piece-wise const. exponential
Generalized Weibull
Gompertz
Exponential
Weibull
Log-logistic

We specify the hazard (the distribution of duration T ) so that it consists of two factors,
(t |x) =

(t , )
| 0 {z }

(x, )
| {z }

function of t alone

(4.21)

function of x alone

where 0 (t ) is the baseline hazard. Mostly characterized as,


Polynomial baseline hazard
Piecewise constant hazard: step function with k segments.
And usually (x), = exp (x 0 ), to ensure that (t |x) > 0.
When we specify the parametric distributions (for (t |x) we can estimate the parameters by ML,
Weibull distribution

1
t
| {z }

(t |x) =

baseline hazard

,
|{z}

> 0.

(4.22)

exp (x 0 )

Monotonically increasing if > 1 and decreasing if < 1.


Gompertz distribution
(t |x) = exp (t )
| {z }

baseline hazard

.
|{z}

(4.23)

exp (x 0 )

Monotonically increasing if > 0, decreasing if < 0 and exponential if = 0.


Generalized Weibull distribution (allows for non-monotonic hazard)
(t |x) =

1
|t{z }

baseline hazard

S(t )
| {z }

survivor function

|{z}

(4.24)

exp (x 0 )

where S(t ) = (1 t ) . Increasing in t if > 0 & > 1, decreasing in t if < 0 & 1 and u-shaped if < 0
& > 1.
Censoring yields likelihood that is similar to tobit,
Uncensored observations
contribution f (t |x, )
Right censored observations R

contribution P r [T > t ] = t f (u|x, )d u = 1 F (t |x, ) = S(t |x, )


Then we find,
Density for observation i
L() = f (t |x, )i S(t |x, )(1i ) ,

where i = a binary variable concerning censoring

(4.25)

Log-likelihood

N
X

lnL() =
i ln f (t |x, ) + (1 i )lnS(t |x, )
{z
}
{z
} |
i =1 |
completed spells

15

right-censored spells

(4.26)

Asymptotic distribution MLE M L

2 1 !
L
N , E
0

(4.27)

When interpreting the results we ask ourselves two questions. Shape of baseline hazard: how does hazard change
with time?
Positive vs. negative duration dependence
For Weibull model: < 1 or > 1
Estimated coefficients on regressors: how does hazard vary with regressors?

Absolute differences in regressors imply proportionate differences in hazard


Regressors have multiplicative effect on the hazard
Hazard is scaled by same amount for every t
j > 0 means that an increase in x j leads to an increase in the hazard of failure (and decrease in expected
duration)

When modelling accelerated failure time (AFT) models we model ln(t ) instead of t ,
lnt = x 0 + u

(4.28)

Different distributions for u lead to different AFT models, for example, lognormal, log-logistic or generalized Gamma
distributions. Why accelerated failure time?
t = exp (x 0 ) where = exp (u)
Hazard rate: (t |x) = 0 () exp (x 0 )
Substitute = t exp (x 0 ):

(t |x) = 0 (t exp (x 0 )) exp (x 0 )

(4.29)

This is acceleration of baseline hazard 0 (t ) if exp (x 0 ) > 1 and deceleration of the baseline hazard 0 (t ) if
exp (x 0 ) < 1. We can interpret the parameter estimates as follows, a unit change in regressor associated with
proportionate change in survival time (ceteris paribus).
Estimates of the proportional hazard model (Weibull) can be used to compute other quantities, for example,
Baseline hazard
1
(t |x)
Ratio of hazard rate at survival time t to the hazard at time u given same covariates: (u|x)
= ut
Covariates
Elasticity of hazard w.r.t. one unit increase in kth explanatory variable: i k = k x i k
If explanatory variable is log: k is elasticity of hazard
,x 1 )
Hazard ratios at given survival time are related to absolute differences in characteristics: (t
(t ,x 1 ) =

exp (x 1 x 2 )0 .
Fully parametric models are simple to estimate (even if there is censoring). However, estimates are inconsistent if
either part is misspecified (baseline hazard or part that depends on covariates). A solution to this problem is the Cox
PH: semiparametric model. It has the following features,

Hazard: (t |x, ) = 0 (t ) (x, )


0 is not specified
is fully specified
Often (x, ) = exp (x 0 )

We categorize observations as those who die or are at risk at each failure time, this gives us the following categories,

R(t j ) = l : t l t j = set of spells at risk at t j (risk set)


D(t j ) = l: t l = t j = set of spells completed at t j
P
d j = l 1 t l = t j = number of spells completed at t j
The risk set at t j includes all spells that are not yet completed or censored. Probability of a particular at-risk spell
ending at time t j ,
(x j , )
P [T j = t j | j R(t j )] = P
(4.30)
l R(t j ) (x l , )
The baseline hazard 0 (t ) drops out because of the PH assumption. The likelihood for the Cox PH model is given by,

16

Partial likelihood function (product of P [T j = t j | j R(t j )] over the k ordered failure times)
L p () =

mD(t j ) (x m , )

k
Y

j =1

hP

l R(t j ) (x l , )

(4.31)

id j

Partial log-likelihood function


lnL p =

k
X

"
ln(x m , ) d j ln

j =1 mD(t j )

Asymptotic distribution MLE M L

!#

(x l , )

(4.32)

l R(t j )

"
#!1 !
2 L p ()

N , E
0

(4.33)

Given we can retrieve a nonparametric estimate of the baseline hazard or surival function,
Write the survivor function as

S(t |x, ) = S 0 (t )(x,)

(4.34)

The estimated baseline survival function is


S0 (t ) =

(4.35)

j |t j t

Furthermore, changes in regressors have a multiplicative effect on the hazard,


(t |x,)

(t |x, )
x
= (t |x, )
x
(t |x, )

(4.36)

We can interpret the estimates without knowing the baseline hazard 0 (t ).


Up to now we assumed cross-section data that was time-invariant. However, covariates may take different values
over the spell. Two problems with time-varying variables,
(1) Treating time-varying covariate as fixed is misspecification
(2) Time-varying covariate may exhibit feedback: not strictly exogenous
Maintained assumption: weak exogeneity
Process that leads to time variation in covariates not taken into account in estimating hazard function
External time variation
Fully parametric models make strong assumptions about shape of baseline hazard and the Cox model makes no
assumptions about shape of baseline hazard. The in-between solution is given by the exponential model with
piecewise constant hazard.

5
5.1

W EEK 10

L INEAR PANEL DATA AND D IFF - IN - DIFF 03/03/2015


L INEAR PANEL DATA

Panel data: observe the same units repeatedly over time,


y i t = x i0 t + i t

(5.1)

where i = 1, ..., n indexes units and t = 1, ..., T indexes time periods (T is balanced if i T is the same, unbalanced
otherwise). Units are i.i.d., but observations for given unit are not independent: E (i t i s ) 6= 0 (even if t 6= s). Different
assumptions on the relationship between x i t and i t lead to different estimators. Note that in panel data, y i is a
T 1 vector and X i is a T K matrix where T stands for the time periods of the panel data and K for the number of
regressors per time period. There are a number of advantages to using panel data,
Efficiency gains (smaller standard errors and lower variance because of dependence)
17

Excluding the endogenous regressors (transformations of original regressors are often exogenous)
Robustness to omitted variables (only variation within units over time: removes all unobserved time-constant
variables)
Identification of individual dynamics (individuals who experienced event in the past are more likely to
experience that event in the future)
Spurious state dependence: individuals differ in unobservable ways that influence likelihood of events
(but are not influenced by event)
True state dependence: event changes preferences/constraints etc. such that person becomes more
likely to experience event again
We can decompose the error term of equation (5.1) into two parts,
i t = c i + u i t

(5.2)

where,
c i is an individual effect: fixed across repeated observations for individual i (captures unobserved heterogeneity)
u i t is an idiosyncratic shock (u i t i.i.d.(0, 2u ) across observations)
Different estimators result from different assumptions on c i and u i t ,
Pooled OLS
Fixed effects
First differences
Random effects

Assumptions on c i
E(x i t c i ) = 0
E(x i t c i ) 6= 0 allowed
E(x i t c i ) 6= 0 allowed
E(x i t c i ) = 0

Assumptions on u i t
E(x i t u i t ) = 01
E(x i t u i s ) = 0 for all s, t 2
E((x i t x i ,t 1 )(u i t u i ,t 1 )) = 0
E(x i t u i s ) = 0 for all s, t

Pooled OLS
OLS is consistent (for n or T ) if E(x i t (c i + u i t )) = 0. This condition is satisfied if E(x i t c i ) = 0 (lagged
dependent variables violate this assumption) and E(x i t u i t ) = 0 (weak exogeneity). OLS is not efficient if 2c 6= 0
(auto-correlation present in composite error term). Usual standard errors are incorrect, hence one should use
clustered standard errors.
The Least Squares Dummy Variable (LSDV) estimator
Linear regression model with individual-specific intercepts,
y i t = c i + x i0 t + u i t

(5.3)

where x i t does not contain an intercept; u i t i.i.d.(0, 2u ). All x i t are usually assumed to be independent of all u i t .
c = (c 1 , ..., c n )0 and can be estimated by OLS (though, numerically difficult).
The Fixed Effects (FE) estimator
We estimate an equation similar to (5.3). However, there are two assumptions,
Allows for E(x i t c i ) 6= 0: correlation between regressors and individual effects
Assumes strict exogeneity: E(x i t u i s ) = 0 for all s, t (rules out lagged dependent variables and feedback loops)
The model is given by equation (5.3), we substract the individual means and accordingly find,
y i t yi = (x i t xi ) + (u i t u i )

(5.4)

The within transformation eliminates the individual effects (c i ) and accordingly we can use the OLS estimator
on the transformed model which yields the within estimator or fixed effects estimator. Consistency requires no
perfect multicollinearity, strict exogeneity and (X i , y i ) are i.i.d. random variables. The asymptotic distribution of the
estimator is given by,

p
N (F E ) N 0, Av ar F E
(5.5)
Furthermore the parameter 2u can be estimated by,
2u = yi t xi0 t F E
If u i t is not white noise, standard errors will not be correct, consequently use panel-robust standard errors.
We have seen 2 ways of getting FE estimator,
18

(5.6)

(1) Least Squares Dummy Variables - include dummy for each individual
(2) Estimation using transformed data - variables in deviations from individual means
There is a third way, namely the Mundlak procedure. It implies that individual means are included as additional
covariates, accordingly the coefficients on the means are not interpreted and coefficients on variables themselves
are FE estimates. A few final formulas concerning the FE estimator:
Estimator for c i
Within R

ci = yi xi0 F E

(5.7)

F E

2
2

R wi
yi t yiF E , (y i t yi )
t hi n F E = Corr

(5.8)

OLS of yi = xi0 + i

(5.9)


B
2
2

R bet
ween B = Corr yi , yi

(5.10)

2
2

R over
al l OLS = Corr yi t , y i t

(5.11)

Between estimator
Between R

Overall R 2
First-Difference (FD) estimator
We start from equation (5.3) and accordingly take first differences, which gives us,
y i t y i ,t 1 = (x i t x i ,t 1 )0 + (u i t u i ,t 1 )

(5.12)

Applying OLS to the differenced equation yields the first-difference estimator F D . Consistency requires that the first
differenced regressors and error terms are uncorrelated (weaker than strict exogeneity). The FD and FE estimators are
similar if strict exogeneity holds (this is the case when T = 2). If FD and FE are dramatically different, 2 explanations
are possible:
(1) Some RHS variables violate strict exogeneity
(2) The model is incorrectly specified (important time-varying variables missing)
Random effects (RE) estimator
Consider, one last time, the linear error-components model given in equation (5.1) where the error term is specified
as in (5.2). Now assume that both c i and u i t are white noise and independent. Assume strict exogeneity and the fact
that the regressors are uncorrelated with the individual effects. Under these assumptions pooled OLS, FE and FD are
all consistent. However, OLS (disregards correlation between error terms of given unit), FE and FD (both disregard all
variation between units) of them are efficient. The correlation between two error terms at time t and s is given by,
Corr [i t , i s ] =

2c
2
c + 2u

(5.13)

If 2c and 2u are known: use GLS to derive BLUE estimator. We can write the GLS estimator for as,

GLS =

N X
T
X

(x i t xi )(x i t xi )0 + T

i =1 t =1

where =

2u
T 2c +2u

N
X

!1
xi x)
0
(xi x)(

N X
T
X

(x i t xi )(y i t yi ) + T

i =1 t =1

i =1

N
X
i =1

!
yi y)

(xi x)(
(5.14)

and x is the overall average of x i t .

For = 0: GLS = F E
For = 1 (2c = 0): GLS = OLS .
GLS estimator is the optimal matrix-weighted average of the between and within estimators. If 2c and 2u are known:
we perform OLS on transformed model,
(y i t yi ) = (x i t xi )0 + i t

(5.15)

p
where = 1 . Error term i t = i t i = (1 )c i + u i t u i is i.i.d. over individuals and time. If 2c and 2u are
not known, one can follow these steps,
2u from within regression, see equation (5.6)
(1) Obtain

19

2u , obtain
2c from the between regression,
(2) Given
yi = xi0 + i

(5.16)

2 .
where i = c i + u i : between regression yields

(3) 2 = 2c + Tu , so we estimate 2c as follows,

2c = 2

u
T

(5.17)

2u and
2c to compute
(4) Use
(5) Perform FGLS using to obtain an estimate for
(y i t yi ) = (x i t xi )0 + i t
where = 1

(5.18)

2u

2
u +T
2c

Contrary to FE, time-constant variables can be included in the model and RE is consistent for either T or N going
to infinity.

ALL MATERIAL TREATED UNTILL DIFF-IN-DIFF

20

Vous aimerez peut-être aussi