Académique Documents
Professionnel Documents
Culture Documents
Summary SC Microeconometrics
Tomas Geurts
April 8, 2015
1
1.1
W EEK 6
The likelihood principle is to choose as estimator of the parameter vector that value of that maximizes the
likelihood of observing the actual sample. The likelihood function, L N (|y, X ), is given by the conditional density,
f (y|X , ), viewed as a function of given the data (x, Y ). For cross-section data, the observations (y i , x i ) are
independent over i and then,
N
Y
L N (|y, X ) =
f (y i |x i , ).
(1.1)
i =1
N
X
ln f (y i |x i , ).
(1.2)
i =1
We equate the derivative of the log-likelihood function over the parameter vector
the score vector s(), and find the MLE, ML .
L N ()
N (M L 0 ) N 0, A 01
whereA 0 = E
2 L N ()
0
(1.3)
ML E , E
.
0
(1.4)
Under regularity, the maximum likelihood estimator has the following asymptotic properties,
Consistency, see equation (1.3)
Asymptotic normality, see equation (1.4)
Asymptotic efficiency (MLE achieves the Cramer-Rao lower bound for consistent estimators), see equation (1.3)
GMM
The idea of Method of Moments is to replace expectations by their sample counterparts and to define estimators
by solving the resulting system in terms of parameters. For example, if we want to estimate the parameters of a
lognormal distribution by means of MM, we have given,
(lnx)2
Probability density function of X f (x) = p1 exp 22
x 2
Raw moment equation of X E(X r ) = r + 12 r 2 2
The idea of MM is to replace expectations by their sample counterparts. Hence the MM estimator solves (and
consequently we present the solution),
N
1 X
1 2
m1 =
x i = exp +
N i 1
2
m2 =
1 X
+ 2
2
x i2 = exp 2
N i 1
1
= 2lnm 1 lnm 2
2
2 = 2lnm 1 + lnm 2
(1.5)
(1.6)
If r = q the model is said to be just identified and the MM estimator M M is the solution to,
N
1 X
= 0.
h(w i , )
N i =1
(1.7)
If r > q the model is said to be overidentified and there is no exact solution. The generalized method of moments
estimator G M M minimizes,
"
#0
"
#
N
N
1 X
1 X
Q N () =
h(w i , ) WN
h(w i , )
(1.8)
N i =1
N i =1
where the r r weighting matrix WN is symmetric positive definite and with finite probability limit W0 . Different
choices of the weighting matrix WN lead to different estimators that, although consistent, have different variances.
Application of GMM requires specification of moment function h() and weighting matrix WN . The asymptotic
distribution is,
p
N (G M M 0 ) N (0,V )
(1.9)
where, if observations are i.i.d.,
V = (G 0h0 W0G 0i)1 (G 00 W0 S 0 W0G 0 )(G 00 W0G 0 )1
h
G 0 = E
0 |
0 0
S 0 = E hh |0
Optimal GMM estimator usually has the smallest asymptotic variance given a specified function h(). For overidentified models and W0 known, the most efficient GMM estimator is obtained by choosing WN = S 01 . Intuitively, the
best moment conditions have large G 0 (the moment condition is much violated if 6= 0 and the moment is very
informative on the true values of 0 ) and small S 0 (sample variation of the moment (noise) is small).
2
1.2
(1.10)
but in general this cannot be guaranteed, since this method might also end up at a local maximum instead of the
global maximum.
Most iterative methods are gradient methods that change s in a direction determined by the gradient. The
update formula is given by,
s+1 = s + A s g s , s = 1, ..., S
(1.11)
Q N ()
where A s is a q q matrix that depends on s and g s =
, which is the q 1 gradient vector evaluated at s .
s
The Newton-Raphson (NR) method is a popular gradient method, especially effective if the objective function
is globally concave in . The updating rule is given by,
s+1 = s H s1 g s ,
s = 1, ..., S
(1.12)
where,
gs =
Hs =
Q N ()
2 Q N ()
0
The basis for the NR method is a linear Taylor series approximation. We utilize a second-order Taylor series expansion
around s , take its first derivative and equate this to zero. Solving for gives us the updating rule of the NR method.
We find that Q N () always increases if H s is negative definite. The NR method works best for maximization problems
where the objective function is globally concave (if the objective function is not globally concave, always try a variety
of starting values), so that H s is always negative definite.
Ideally, the iterative procedure should terminate when the gradient is zero. In practice, this will not be possible, primarily because of accumulated rounding error in the computation of the function and its derivatives. Hence
there are a few alternative criteria when to stop,
(1) Small relative change in the objective function Q N (s )
(2) Small change in the gradient vector g s relative to the Hessian
(3) Small relative change in the parameter estimates s
We can compute use either analytical (pro: more precise and quicker to compute, con: more coding time) or
numerical (pro: no coding beyond providing the objective function, con: the derivatives have to be computed many
times) derivatives. The latter are computed using,
Q N (s ) Q N (s + he j ) Q N (s he j )
=
j
2h
, j = 1, ..., q
(1.13)
where h is small and E j = (0, ..., 0, 1, 0..., 0)0 is a vector with unity in the j-th row and zeros elsewhere.
A common modification of the NR method is the method of scoring (MS), which substitutes H s with E [H s ] = I ()
(information matrix).
There might be situations where it is not possible to obtain an estimate of the parameters.
Always check descriptive statistics and look for anomalies
Rescale the data so that the regressors have similar means and variances (e.g. use income in thousands of
euros rather than in euros)
Check for multicollinearity
Try different starting values
For dummy variables, check whether there is enough variability in the data
2
2.1
W EEK 7
Binary choice models exist to explain a choice by a certain binary outcome (y = 1 or y = 0). Models boil down to
specifying the probability that y = 1 as a function of x.
We can use a linear probability model (y i = x i0 i + i ) to model choices (E(y i |x i ) = P (y i = 1|x i ) = x i0 i ), but the
probabilities could fall outside [0, 1] and the error term suffers from heteroskedasticity.
A binary choice model describes the probability of a certain choice as follows,
P (y i = 1|x i ) = F (x i0 i )
(2.1)
x i0 i
1
1
p exp t 2 d t
2
2
(2.2)
Logit model
0
F (x i0 i ) = (x i0 i ) =
e x i i
1 + e x i i
0
(2.3)
(2.4)
The standard normal and standard logistic distribution do not differ much, they have the same mean (E(X ) = 0) but
2
their variances differ (2normal = 1 vs. 2logistic = 3 ). Hence the estimates from a logit model are roughly a factor 1.8
larger.
Frequently the marginal effects are utilized to interpret the coefficients. For each model we find the following
marginal effect for x i k ,
There are three possibilities to compute the marginal effect, the marginal effect at a representative value (MER), the
marginal effect at the mean (MEM) or average marginal effect (AME).
p
1p
p
1p .
= exp (x i0 i ).
These models can also be derived from the underlying latent model, where the underlying latent variable equation
is given by,
y i = x i0 i + i
(2.5)
where y i is unobserved and,
(
yi =
if
y i > 0
if
y i < 0.
(2.6)
(2.7)
where F is the distribution function of i or, in the common case of a symmetric distribution, the distribution
function of i .
We estimate binary choice models by maximum likelihood. The following identities facilitate this estimation,
Likelihood function
L N () =
N
Y
(2.8)
i =1
Log-likelihood function
L () =
N
X
i =1
y i lnF (x i0 i ) +
N
X
i =1
(1 y i )ln(1 F (x i0 i ))
N
X
y i F (x i0 i )
i =1
F (x i0 i )(1 F (x i0 i ))
F (x i0 i )x i .
(2.9)
(2.10)
There is no analytical solution but maxima are easy to find using NR as in the case of logit and probit the loglikelihood
is globally concave. Furthermore the MLE is consistent if the probability of success is correctly specified (p i
F (x i0 i )). If the the conditional density of y i given x i is correctly specified, then the MLE is asymptotically efficient,
"
2
1 #
LN
M L N , E
.
(2.11)
0
The goodness of fit of binary choice models can be measured by the pseudo-R 2 (which measures the relative gain),
2
R RG
= 1
L (M L )
.
L0
(2.12)
An alternative way to measure the goodness o fit is to find the proportion of correctly specified predictions. In case
of a logit or probit model, this would be,
n 00 + n 11
wr 1 =
.
(2.13)
N
2.2
For ordered and multinomial response models it holds that here are m alternatives and the dependent variable y
is defined to take value j if the j -th alternative is taken, j = 1, ..., m. Let us introduce m binary variables for each
observation y,
(
1
if y = j
yj =
(2.14)
0
if y 6= j.
The multinomial density for one observation can conveniently be written as,
y
f (y) = p 1 1 p 2 2 ... p mm =
m
Y
j =1
yj
pj .
(2.15)
Next we need to specify a model for the probability that individual i chooses the j -th alternative,
p i j = P (y i = j |x i ) = F j (x i , i ),
(2.16)
We estimate multinomial models by maximum likelihood. The following identities facilitate this estimation,
5
Log-likelihood function
lnL N =
N X
m
X
y i j lnp i j
(2.17)
i =1 j =1
N X
m y
lnL N X
i j p i j
=
= 0.
i =1 j =1 p i j
(2.18)
yi = j
(2.20)
j = 1, ..., m.
(2.21)
The type of model (ordered logit/ordered probit) depends on the distribution of the error term. In the ordered probit
model, setting 1 = 0 and the error variance to 1, are two normalization constraints. These are required to identify
the remaining parameters.
We consider the probabilities and marginal effects of the ordered probit model for j = 1, ..., m.
j =1
P [y i = 1|x i ] = (0 x i0 )
p i 1
= ( j 0 x i0 )k
x i k
(2.22)
(2.23)
j = 2, ..., m 1
P [y i = j |x i ] = ( j 0 x i0 ) ( j 1 0 x i0 )
p i j
x i k
= [( j 0 x i0 ) ( j 1 0 x i0 )]k
(2.24)
(2.25)
j =m
P [y i = m|x i ] = 1 (m1 0 x i0 )
p i m
= ( j 0 x i0 )k .
x i k
(2.26)
(2.27)
(2.28)
Interval regression is an application of ordered response models. The truncation points () are in this case known.
This does not require the normalization of .
M ULTINOMIAL MODELS
When we consider the case where there is no natural ordering we call this multinomial models. There are three
types of multinomial models, however this depends on the type of regressors we have,
Alternative-varying regressors take different values for different alternatives. E.g. travelling time and costs in
a model of transportation choice.
Alternative-invariant regressors do not vary across alternatives. E.g. socioeconomic status in a model of
transportation choice.
The three multinomial methods (that are treated in this course) are given by,
6
(1) Multinomial logit: all regressors not alternative specific (e.g. income)
(2) Conditional logit: all regressors (e.g. price) alternative specific
(3) Mixed logit: both types of variables included
When regressors do not vary over alternatives, the multinomial logit model (MNL) is used. We need to restrict
1 = 0 to ensure model identification and accordingly find,
pi 1 =
pi j =
1+
1
,
exp
(x i0 l )
l =2
Pm
exp (x i0 j )
Pm
1 + l =2 exp (x i0 l )
j = 2, ..., m.
(2.29)
(2.30)
= p i j ( j i )
where i =
m
X
p i l l .
(2.31)
l =1
When regressors do vary over alternatives, the conditional logit model (CL) is used. The model is specified by,
exp (x i0 j )
p i j = Pm
,
exp (x i0 l )
l =2
(2.32)
= p i j (i j k p i k ),
(2.33)
where i j k is an indicator variable equal to 1 if j = k and equal to 0 if j 6= k. This effect is positive when we measure
the marginal effect within one group and > 0. Since we find p i j (1 p i j ) > 0. This is called the own effect.
We can generalize both models to the mixed logit model which is specified by,
exp (x i0 j + w i0 j )
p i j = Pm
,
exp (x i0 l + w i0 l )
l =2
(2.34)
where the x i j variables vary over alternatives and the w i variables do not vary over alternatives.
The coefficients in the CL and MNL models can also be given a more direct logit-like interpretation in terms
of relative risk. In the MNL model, the conditional probability of observing alternative j given that either alternative
j or alternative k is observed is,
P [y i = j |y = j ork] =
=
=
pi j
pi j + pi k
exp (x i0 j )
exp (x i0 j ) + exp (x i0 k )
exp (x i0 ( j k ))
exp (x i0 ( j k )) + 1
(2.35)
(2.36)
(2.37)
which is a logit model with coefficient ( j k ). A similar approach can also be applied to the CL model.
A drawback of MNL and CL models is that the probability ratio between two alternatives j and k does not depend on the other alternatives, which is an undesirable property of the multinomial (conditional) logit model. This
is assumption is named Independence of Irrelevant Alternatives (IIA). The IIA property can be checked by means
of a Hausman test, which compares estimates of the red bus option in a three choice model (car, red bus and blue
bus) with estimates of the red bus option in a two choice model (car, red bus). If IIA holds then,
Both models (three choice and two choice model) will yield consistent estimates
The estimates from the three choice model are more efficient than those of the two choice model
3
3.1
W EEK 8
Unordered multinomial models more general than multinomial and conditional logit can be obtained using the
general framework of additive random utility models (ARUM). In a general m-choice multinomial model, the utility
of choice j is,
Uj = Vj + j ,
j = 1, ..., m
(3.1)
where V j is the deterministic component of utility and j is the random component of utility. Individual i will choose
the alternative that provides the highest utility, so that,
p i j = P [y i = j ] = P [Ui j Ui k ,
j 6= k]
= P [i j k Vi j Vi k ,
(3.2)
j 6= k]
(3.3)
where i j k = i k i j is the difference with respect to the reference alternative k. The log-likelihood is then,
lnL N =
N X
m
X
y i j lnp i j .
(3.4)
i =1 j =1
Different multinomial models arise from different assumptions on the joint distribution of 1 , ..., m . When we
assume that the errors j are i.i.d. type I extreme value,
f ( j ) = e j e exp ( j ) ,
j = 1, ..., m.
(3.5)
Consequently for the i -th individual in multinomial outcomes modelled using ARUM, it can be shown that,
P [y i = j ] =
e Vi j
e Vi 1
+ e Vi 2
(3.6)
+ ... + e Vi m
We obtain the CL model when Vi j = x i0 j and the MNL model when Vi j = x i0 j . The assumption that the errors j are
independent across alternatives j is likely to be violated if two alternatives are similar (failure of the IIA assumption).
When we assume that the errors j are jointly distributed and a GEV (generalized extreme value) type, where
the c.d.f. is given by,
G j (e 1 , e 2 , ..., e m )
G (e 1 , e 2 , ..., e m )
G(Y1 , ..., Ym )
.
Y j
Pm
k=1
(3.8)
(3.9)
The nested logit model arises when the error terms j k from the ARUM model have a GEV distribution function, see
equation (3.7). Furtheremore we have thath G() has the following form,
G(Y ) = G(Y11 , ..., Y1K 1 , ..., Y J 1 , ..., Y J K J ) =
K
J
X
Xj
j =1 k=1
1
j
Yjk
! j
,
(3.10)
p
where the dissimilarity parameter (or scale parameter) j = 1 cor ( j k , j l ), with 0 1. The error terms
appearing in the nested logit ARUM model are independent for two alternatives which are in different limbs and
correlated for alternatives within a limb. If j = 1, j the nested logit model is equivalent to the multinomial logit
model.
A typical specification for V j k is,
V j k = z 0j + x 0j k
(3.11)
where z j varies over limbs only and x j k varies over both limbs and branches. The GEV model yields the nested logit
model,
p j k = p j p k| j = P J
exp
exp (z 0j + j I j )
0
m=1 exp (z m + m I m )
where,
l j = ln
"K
Xj
exp
x 0j l j
x 0j k j
j
PK j
exp
l =1
x 0j l j
(3.12)
!#
(3.13)
l =1
is called the inclusive-sum or log-sum. We have that the IIA property holds for alternatives within a limb but not for
alternatives in different limbs. At last, some drawbacks of the nested logit model are that the tree structure has to be
imposed a priori and that 0 1 (which is not always the case).
C OUNT MODELS
In certain applications we would like to explain the number of times a given event occurs. While the outcomes are
discrete and ordered, there are two important differences with ordered response outcomes,
(1) They are of cardinal rather than ordinal nature
(2) There is often no natural upper bound to the outcomes
The Poisson regression model specifies that each y i is drawn from a Poisson population with parameter i , which
is related to the regressors x i . The primary equation of the model is,
y
P [Y = y i |x i ] =
exp (i )i i
yi !
y i = 0, 1, 2, ...
(3.14)
N
X
i =1
y i x i0 exp (x i0 ) ln(y i !)
(3.15)
lnL X
=
y i exp (x i0 ) x i = 0
i =1
(3.16)
The Poisson MLE is consistent under the assumption that the conditional mean is correctly specified. The loglikelihood function is globally concave (because the Hessian is negative) and the NR algorithm yields unique
parameter estimates. Note that the impact of a marginal change in x i k on the expected value of y i is given by,
E[y i |x i ]
= exp (x i0 )k
x i k
(3.17)
The Poisson model is criticized frequently because of a phenomenon named overdispersion, since for count data
the variance usually exceeds the mean (which qualitatively has the same consequences as homoskedasticity in
9
the linear regression model) and the Poisson model does not incorporate this. Overdispersion leads to grossly
deflated standard errors and grossly inflated t-statistics. A simple overdispersion test statistic for H0 : = 0 (where
Var[y i |x i ] = i + g [i ]) can be computed by,
(1) Estimating the Poisson model
(3.18)
We can generalize the Poisson model by introducing an individual unobserved effect into the conditional mean,
i = i u i
(3.19)
Then the distribution of y i conditioned on x i and u i remains Poisson with conditional mean and variance i . The
unconditional distribution f (y i |i ) is the expected value (over u i ) of f (y i |x i , u i ),
Z
exp (i u i )(i u i ) y i
f (y i |x i ) =
g (u i )d u i
(3.20)
yi !
0
3.2
Censoring arises when information on the dependent variable is lost, but not on the regressors.
Example: studies of income based on incomes above some poverty line may be of limited usefulness for
inference about the whole population.
Censoring from below is given by,
y
if
y > L
if
y L.
if
y U
if
y < U.
(
y=
(3.21)
(3.22)
Truncation arises when one attempts to make inferences about a larger population from a sample that is drawn
from a distinct subpopulation. Some observations on both the dependent variable and the regressors are lost.
Example: individuals of all income levels are included in the sample but incomes below the poverty line are
reported at the poverty line.
Truncation from below is given by,
y = y
if
y > L
(3.23)
y = y
if
y < U
(3.24)
(3.26)
N
X
d i ln f (y |x i , ) + (1 d i )lnF (L|x i , ) .
i =1
10
(3.27)
For truncation from below at L, and suppressing dependence on x, the conditional density of the observed y is,
f (y) = f (y|y > L) =
f (y)
f (y)
=
;
P (y|y > L) 1 F (L)
(3.28)
N
X
ln f (y i |x i , ) ln 1 F (L i |x i , )
(3.29)
i =1
T OBIT MODEL
The standard tobit model is also called a censored normal regression model and has censoring from below at zero.
The latent model is given by y = x 0 + (where N(0, 2 )) and hence, y N(x 0 , 2 ). The observation rule is
identical to censoring from below, given by equation (3.21) where L = 0. The censored density is in the same fashion
as equation (3.26), where F (0) = P [y 0] = P [x 0 + 0] = 1 (
given by,
x0
).
Censored density
d
(1d )
1
1
x 0
f (y) = p
exp 2 (y x 0 )2
)
1 (
2
22
(3.30)
Censored MLE
2
lnL N (, ) =
N
X
i =1
0 !!)
x
1
1
1
2
0
2
d i ln2 ln 2 (y x i ) + (1 d i )ln 1 i
2
2
2
(3.31)
!)
x i0
1
1
1
2
0
2
ln2 ln 2 (y x i ) ln
.
2
2
2
(3.32)
Truncated MLE
2
lnL N (, ) =
N
X
i =1
(3.33)
(3.34)
(3.35)
(3.36)
(3.37)
(3.38)
The standard tobit model imposes a restrictive structure. A variable that increases the probability of having a positive
outcome also increases the expected value of that outcome, given that it is positive (for example, age on wage). There
are two solutions to this problem,
Two-part model: a model for the censoring mechanism and a model for the outcome, conditional on the
outcome being observed
Sample selection model: joint distribution for the censoring mechanism and outcome
First we investigate the two-part model. Define a binary indicator variable d = 1 for participants (y > 0) and d = 0 for
Sparticipants (y = 0). Then the two part-model is,
nonA
(
P [d = 0|x]
if y = 0
f (y|x) =
(3.39)
P [d = 1|x] f (y|d = 1, x)
if y > 0.
Estimation separates into estimation of,
11
(1) Discrete choice model (e.g. probit or logit) for the participation decision using all observations
(2) Parameters of the density f (y|d = 1, x) using only observations with y > 0.
The sample we observe is not a random sample from a larger population. People with very low potential wages may
decide to not work. Thus, the probability of observing a wage depends upon the potential wage (endogenous sample
selection). The tobit II model is often used for this problem. It is also known as the sample selection model. The
model consists of two equations: a participation equation and an outcome equation. We have,
Participation equation
(
y 1i =
if
y 1i
>0
if
y 1i
0.
(3.40)
Outcome equation
(
y 2i =
y 2i
if
y 1i
>0
if
y 1i
0.
(3.41)
where,
"
1i
2i
0
y 1i
= x 1i
1 + 1i
(3.42)
y 2i
(3.43)
0
= x 2i
2 + 2i
#!
" # "
1 12
0
,
.
N
0
12 22
(3.44)
(3.45)
The model can be estimated by maximum likelihood or through the two-step Heckman procedure. The two step
procedure is based on the following regression,
0
0
y 2i = x 2i
2 + 12 (x 1i
1 ) + i
0
where 12 (x 1i
1 ) =
0
(x 1i
1 )
0
(x 1i
1 )
(3.46)
This is consistent but not efficient. Standard errors should be adjusted. We prefer variables in x 1i that are not in
x 2i (exclusion restrictions). Thus, variables affecting selection but not the outcomes y 2 i directly. Such variables are
hard to find. A test for sample selection bias is a test for 12 = 0. This can be performed by a standard t -test on the
coefficient for Heckmans lambda.
4
4.1
W EEK 9
We want to model the length of time spent in a given state (classification of individual at given point in time) before
transitioning (movement from one state to another) to another state. Time spent state is named the duration or spell
length. Many possible sampling schemes; inference depends both on duration model and on sampling scheme,
Flow sampling: sample those entering state at given point in time
Stock sampling: sample those in state at given point in time
Population sampling: sample population at given point in time regardless of state.
Two estimation methods that do not work (and the accompanying reasons),
OLS equation on duration t i (log(t i ) = x i0 + i )
(1) Censoring and truncation (OLS gives inconsistent estimates for these problems)
12
Z
F (t ) = P [T t ] =
f (s)d s.
(4.1)
The opposite probability, also named the survivor function (S(t )) is given by,
S(t ) = P [T > t ] = 1 F (t ).
The mean completed spell length (duration) is given by,
Z
E[T ] =
(4.2)
S(u)d u.
(4.3)
The hazard function gives the instantaneous probability of leaving a state conditional on survival to time t ,
(t ) = lim
t 0
P [t T < t + t |T t ]
t
f (t )
S(t )
lnS(t )
.
=
t
(4.4)
(4.5)
(4.6)
S(t ) =
(u)d u .
(4.7)
The conditional hazard rate (t |x) is our object of interest. We can also compute the cumulative hazard function,
(t ) =
Z
0
(s)d s
(4.8)
= lnS(t )
(4.9)
Duration often measured as interval. The transition process may be discrete or continuous but the data are always
observed discretely (grouped data). If the hazard within the interval is assumed constant we have a discrete time
hazard model,
Discrete-time hazard function: probability of transition at discrete time t j , j = 1, 2, ..., given survival to time
tj
f d (t j )
(4.10)
j = P [T = t j |T t j ] = d
S (t j 1 )
Discrete-time survivor function: is a decreasing step function with steps at t j , it can be obtained recursively
from the hazard function
Y
S d (t ) = P [T > t ] =
(1 j )
(4.11)
j |t j t
(4.12)
j |t j t
(4.13)
i = 1[Ti < C i ]
(4.14)
(4.15)
(4.16)
rt
N
(4.17)
where r t : number of spells of duration greater than t and N : sample size. We present the following terminology,
We find the following formulas for the empirical survival function in the case of independent censoring,
Sample counterpart of hazard function,
j =
dj
(4.18)
rj
(1 j ) =
j |t j t
Y rj dj
rj
j |t j t
(4.19)
X
j |t j t
14
j =
X dj
j |t j t
rj
(4.20)
We now present a number of continuous time hazard models, see the table below.
Interpretation
Proportional hazard
We specify the hazard (the distribution of duration T ) so that it consists of two factors,
(t |x) =
(t , )
| 0 {z }
(x, )
| {z }
function of t alone
(4.21)
function of x alone
1
t
| {z }
(t |x) =
baseline hazard
,
|{z}
> 0.
(4.22)
exp (x 0 )
baseline hazard
.
|{z}
(4.23)
exp (x 0 )
1
|t{z }
baseline hazard
S(t )
| {z }
survivor function
|{z}
(4.24)
exp (x 0 )
where S(t ) = (1 t ) . Increasing in t if > 0 & > 1, decreasing in t if < 0 & 1 and u-shaped if < 0
& > 1.
Censoring yields likelihood that is similar to tobit,
Uncensored observations
contribution f (t |x, )
Right censored observations R
(4.25)
Log-likelihood
N
X
lnL() =
i ln f (t |x, ) + (1 i )lnS(t |x, )
{z
}
{z
} |
i =1 |
completed spells
15
right-censored spells
(4.26)
2 1 !
L
N , E
0
(4.27)
When interpreting the results we ask ourselves two questions. Shape of baseline hazard: how does hazard change
with time?
Positive vs. negative duration dependence
For Weibull model: < 1 or > 1
Estimated coefficients on regressors: how does hazard vary with regressors?
When modelling accelerated failure time (AFT) models we model ln(t ) instead of t ,
lnt = x 0 + u
(4.28)
Different distributions for u lead to different AFT models, for example, lognormal, log-logistic or generalized Gamma
distributions. Why accelerated failure time?
t = exp (x 0 ) where = exp (u)
Hazard rate: (t |x) = 0 () exp (x 0 )
Substitute = t exp (x 0 ):
(4.29)
This is acceleration of baseline hazard 0 (t ) if exp (x 0 ) > 1 and deceleration of the baseline hazard 0 (t ) if
exp (x 0 ) < 1. We can interpret the parameter estimates as follows, a unit change in regressor associated with
proportionate change in survival time (ceteris paribus).
Estimates of the proportional hazard model (Weibull) can be used to compute other quantities, for example,
Baseline hazard
1
(t |x)
Ratio of hazard rate at survival time t to the hazard at time u given same covariates: (u|x)
= ut
Covariates
Elasticity of hazard w.r.t. one unit increase in kth explanatory variable: i k = k x i k
If explanatory variable is log: k is elasticity of hazard
,x 1 )
Hazard ratios at given survival time are related to absolute differences in characteristics: (t
(t ,x 1 ) =
exp (x 1 x 2 )0 .
Fully parametric models are simple to estimate (even if there is censoring). However, estimates are inconsistent if
either part is misspecified (baseline hazard or part that depends on covariates). A solution to this problem is the Cox
PH: semiparametric model. It has the following features,
We categorize observations as those who die or are at risk at each failure time, this gives us the following categories,
16
Partial likelihood function (product of P [T j = t j | j R(t j )] over the k ordered failure times)
L p () =
mD(t j ) (x m , )
k
Y
j =1
hP
l R(t j ) (x l , )
(4.31)
id j
k
X
"
ln(x m , ) d j ln
j =1 mD(t j )
!#
(x l , )
(4.32)
l R(t j )
"
#!1 !
2 L p ()
N , E
0
(4.33)
Given we can retrieve a nonparametric estimate of the baseline hazard or surival function,
Write the survivor function as
(4.34)
(4.35)
j |t j t
(t |x, )
x
= (t |x, )
x
(t |x, )
(4.36)
5
5.1
W EEK 10
(5.1)
where i = 1, ..., n indexes units and t = 1, ..., T indexes time periods (T is balanced if i T is the same, unbalanced
otherwise). Units are i.i.d., but observations for given unit are not independent: E (i t i s ) 6= 0 (even if t 6= s). Different
assumptions on the relationship between x i t and i t lead to different estimators. Note that in panel data, y i is a
T 1 vector and X i is a T K matrix where T stands for the time periods of the panel data and K for the number of
regressors per time period. There are a number of advantages to using panel data,
Efficiency gains (smaller standard errors and lower variance because of dependence)
17
Excluding the endogenous regressors (transformations of original regressors are often exogenous)
Robustness to omitted variables (only variation within units over time: removes all unobserved time-constant
variables)
Identification of individual dynamics (individuals who experienced event in the past are more likely to
experience that event in the future)
Spurious state dependence: individuals differ in unobservable ways that influence likelihood of events
(but are not influenced by event)
True state dependence: event changes preferences/constraints etc. such that person becomes more
likely to experience event again
We can decompose the error term of equation (5.1) into two parts,
i t = c i + u i t
(5.2)
where,
c i is an individual effect: fixed across repeated observations for individual i (captures unobserved heterogeneity)
u i t is an idiosyncratic shock (u i t i.i.d.(0, 2u ) across observations)
Different estimators result from different assumptions on c i and u i t ,
Pooled OLS
Fixed effects
First differences
Random effects
Assumptions on c i
E(x i t c i ) = 0
E(x i t c i ) 6= 0 allowed
E(x i t c i ) 6= 0 allowed
E(x i t c i ) = 0
Assumptions on u i t
E(x i t u i t ) = 01
E(x i t u i s ) = 0 for all s, t 2
E((x i t x i ,t 1 )(u i t u i ,t 1 )) = 0
E(x i t u i s ) = 0 for all s, t
Pooled OLS
OLS is consistent (for n or T ) if E(x i t (c i + u i t )) = 0. This condition is satisfied if E(x i t c i ) = 0 (lagged
dependent variables violate this assumption) and E(x i t u i t ) = 0 (weak exogeneity). OLS is not efficient if 2c 6= 0
(auto-correlation present in composite error term). Usual standard errors are incorrect, hence one should use
clustered standard errors.
The Least Squares Dummy Variable (LSDV) estimator
Linear regression model with individual-specific intercepts,
y i t = c i + x i0 t + u i t
(5.3)
where x i t does not contain an intercept; u i t i.i.d.(0, 2u ). All x i t are usually assumed to be independent of all u i t .
c = (c 1 , ..., c n )0 and can be estimated by OLS (though, numerically difficult).
The Fixed Effects (FE) estimator
We estimate an equation similar to (5.3). However, there are two assumptions,
Allows for E(x i t c i ) 6= 0: correlation between regressors and individual effects
Assumes strict exogeneity: E(x i t u i s ) = 0 for all s, t (rules out lagged dependent variables and feedback loops)
The model is given by equation (5.3), we substract the individual means and accordingly find,
y i t yi = (x i t xi ) + (u i t u i )
(5.4)
The within transformation eliminates the individual effects (c i ) and accordingly we can use the OLS estimator
on the transformed model which yields the within estimator or fixed effects estimator. Consistency requires no
perfect multicollinearity, strict exogeneity and (X i , y i ) are i.i.d. random variables. The asymptotic distribution of the
estimator is given by,
p
N (F E ) N 0, Av ar F E
(5.5)
Furthermore the parameter 2u can be estimated by,
2u = yi t xi0 t F E
If u i t is not white noise, standard errors will not be correct, consequently use panel-robust standard errors.
We have seen 2 ways of getting FE estimator,
18
(5.6)
(1) Least Squares Dummy Variables - include dummy for each individual
(2) Estimation using transformed data - variables in deviations from individual means
There is a third way, namely the Mundlak procedure. It implies that individual means are included as additional
covariates, accordingly the coefficients on the means are not interpreted and coefficients on variables themselves
are FE estimates. A few final formulas concerning the FE estimator:
Estimator for c i
Within R
ci = yi xi0 F E
(5.7)
F E
2
2
R wi
yi t yiF E , (y i t yi )
t hi n F E = Corr
(5.8)
OLS of yi = xi0 + i
(5.9)
B
2
2
R bet
ween B = Corr yi , yi
(5.10)
2
2
R over
al l OLS = Corr yi t , y i t
(5.11)
Between estimator
Between R
Overall R 2
First-Difference (FD) estimator
We start from equation (5.3) and accordingly take first differences, which gives us,
y i t y i ,t 1 = (x i t x i ,t 1 )0 + (u i t u i ,t 1 )
(5.12)
Applying OLS to the differenced equation yields the first-difference estimator F D . Consistency requires that the first
differenced regressors and error terms are uncorrelated (weaker than strict exogeneity). The FD and FE estimators are
similar if strict exogeneity holds (this is the case when T = 2). If FD and FE are dramatically different, 2 explanations
are possible:
(1) Some RHS variables violate strict exogeneity
(2) The model is incorrectly specified (important time-varying variables missing)
Random effects (RE) estimator
Consider, one last time, the linear error-components model given in equation (5.1) where the error term is specified
as in (5.2). Now assume that both c i and u i t are white noise and independent. Assume strict exogeneity and the fact
that the regressors are uncorrelated with the individual effects. Under these assumptions pooled OLS, FE and FD are
all consistent. However, OLS (disregards correlation between error terms of given unit), FE and FD (both disregard all
variation between units) of them are efficient. The correlation between two error terms at time t and s is given by,
Corr [i t , i s ] =
2c
2
c + 2u
(5.13)
If 2c and 2u are known: use GLS to derive BLUE estimator. We can write the GLS estimator for as,
GLS =
N X
T
X
(x i t xi )(x i t xi )0 + T
i =1 t =1
where =
2u
T 2c +2u
N
X
!1
xi x)
0
(xi x)(
N X
T
X
(x i t xi )(y i t yi ) + T
i =1 t =1
i =1
N
X
i =1
!
yi y)
(xi x)(
(5.14)
For = 0: GLS = F E
For = 1 (2c = 0): GLS = OLS .
GLS estimator is the optimal matrix-weighted average of the between and within estimators. If 2c and 2u are known:
we perform OLS on transformed model,
(y i t yi ) = (x i t xi )0 + i t
(5.15)
p
where = 1 . Error term i t = i t i = (1 )c i + u i t u i is i.i.d. over individuals and time. If 2c and 2u are
not known, one can follow these steps,
2u from within regression, see equation (5.6)
(1) Obtain
19
2u , obtain
2c from the between regression,
(2) Given
yi = xi0 + i
(5.16)
2 .
where i = c i + u i : between regression yields
2c = 2
u
T
(5.17)
2u and
2c to compute
(4) Use
(5) Perform FGLS using to obtain an estimate for
(y i t yi ) = (x i t xi )0 + i t
where = 1
(5.18)
2u
2
u +T
2c
Contrary to FE, time-constant variables can be included in the model and RE is consistent for either T or N going
to infinity.
20