Académique Documents
Professionnel Documents
Culture Documents
Erdinç
ECO 311
Econometrics II
1. INTRODUCTION
Panel data-sets follow a random sample of individuals (or rms, households, etc.) over time. The big
advantage of working with panel data is that we will be able to control for individual- specic, time-
invariant, unobserved heterogeneity, the presence of which could lead to bias in standard estimators like
OLS. We can also estimate dynamic equations.
Panel data combine a time series dimension with a cross section dimension, in such a way that there
are data on N individuals (or rms, countries...), followed over T time periods. Not all data-sets that
combine a time series dimension with a cross section dimension are panel data-sets, however. It is
1 1992 1 0 0 8 1
1 1993 0 1 0 12 1
1 1994 0 0 1 10 1
2.1. A Panel Data-Set.
2 1992 1 0 0 7 0
2 1993 0 1 0 5 0
2 1994 0 0 1 3 0
time dummies, constructed from the year variable; x1 is an example of a time varying variable and x2
is an example of a time invariant variable.
small.
series.
Throughout this lecture, we will focus mostly on the large N, small T case.
• If the time periods for which we have data are the same for all N individuals, e.g. t = 1, 2, ...T ,
then we have a balanced panel. In practice, it is common that the length of the time series and/or
the time periods diers across individuals. In such a case the panel is unbalanced.1
• Repeated cross sections are not the same as panel data. Repeated cross sections are obtained
by sampling from the same population at dierent points in time. The identity of the individuals
1Analyzing unbalanced panel data typically raises few additional issues compared with analysis of balanced data. However
if the panel is unbalanced for reasons that are not entirely random (e.g. because rms with relatively low levels of
productivity have relatively high exit rates), then we may need to take this into account when estimating the model. This
can be done by means of a sample selection model (more on this later). We abstract from this particular problem here.
1
Lecture Notes ECO311, Econometrics II Prof. Erdinç
(or rms, households etc.) is not recorded, and there is no attempt to follow individuals over
time. This is the key reason why pooled cross sections are dierent from panel data. Had the
id variable in the example above not been available, we would have referred to this as a pooled
2.2. Advantages of Panel Approach. When we have a dataset with both a time series and a
cross-section dimension, this opens up new opportunities in our research. For example:
• With larger sample size than single cross-section, we should be able to obtain more precise
• We can now ask how certain eects evolve over time (e.g. time trend in dependent variable; or
• Panel data enable us to solve an omitted variables problem. Panel data also enable us to estimate
dynamic equations (e.g. specications with lagged dependent variables on the right-hand side).
2.3. Using Panel Data To Address an Endogeneity Problem. One of the main advantages of
panel data is that such data can be used to solve an omitted variables problem. Suppose our model is
parameters β1 and β2 .
• Throughout this lecture we will assume that the residual uit ; which varies both over time and
• Our problem is that we do not observe αi , which is constant over time for each individual (hence
no t subscript) but varies across individuals. Hence if we estimate the model in levels using OLS
OLS
then αi will go into the error term: vit = αi + uit .
What would be the consequence of αi going into the error term?
• If αi is uncorrelated with xit , then αi is just another unobserved factor making up the resid-
ual. However, the OLS will not be BLUE, because the error term vitOLS is serially correlated:
2
OLS OLS
cov(vit , vit−s ) = σ2σ+σ
α
2 for s = 1, 2, ...
α u
This suggests some feasible generalized least squares estimator could be preferable (this is indeed the
2
case; see below). Notice that OLS would be consistent , however, and the only substantive problem with
relying on OLS for this model is that the standard formula for calculating the standard errors are wrong.
This problem is straightforward to solve, e.g. by clustering the standard errors on the individuals, using
• But if αi is correlated with xit , then putting αi in the error term can cause serious problems. This,
of course, is an omitted variables problem, so we can use some of familiar results to understand
the nature of the problem. For the single-regressor model, hence
2Consistency is a minimal requirement for an estimator. Simplistically, it means that the distribution of the estimator β̂
collapses to the single point of true β as n goes to innity. It is an asymptotic property. If obtaining more data does not
get us closer to the parameter value β , then we are using a poor estimation procedure.
2
Lecture Notes ECO311, Econometrics II Prof. Erdinç
cov(xit ,αi )
plim(β̂ OLS ) = β + σx2
which shows that the OLS estimator is inconsistent unless cov(xit , αi ) = 0. If xit is positively correlated
with the unobserved eect, then there is an upward bias. If the correlation is negative, we get a negative
bias.
• Can you think about applications for which a specication like the following
• Individual earnings
• Household expenditures
• Firm investment
What factors can reasonably be represented by αi ? Can these be assumed uncorrelated with xit ?
We have seen that if αi is correlated with the variables in the xit vector, there will be an endogeneity
problem which would bias the OLS estimates. Under assumptions 1.1 and 1.2, we can use the Fixed
Eects (FE) or the First Dierenced (FD) estimator to obtain consistent estimates of allowing αi to be
of this is that FE and FD will not yield consistent estimates if xit contains lagged dependent variables
(yi;t−1 ; y i;t−2 ;.....). We will discuss such cases under GMM and instrumental variable estimation.
To see how the FE estimator solves the endogeneity problem that would contaminate the OLS esti-
mates, begin by taking the average of the equation above for each individual - this gives:
This transformation of the original equation, known as the within transformation, has eliminated αi
from the equation. Hence, we can estimate coecients consistently by using OLS on (2.3). This is called
is required for FE in the sense that E(xit , uit−2 ) = 0 is not required. Thus if there is feedback
from uit to xit that takes more than two periods, FD will be still consistent whereas FE will not.
To see how the FD estimator removes the individual eects to solve the endogeneity problem, we
Cleary, by removing the xed eects, the FD generates consistent estimates based on OLS.
Which one to choose: FE or FD?. Since FE and FD are two alternative ways of removing the xed
eects, which one should we choose?
• Conversely, if under classical assumptions that uit ∼ i.i.d(0, σu2 ), the FE will be more ecient
than FD estimator (because in this case, ∆uit will exhibit negative serial correlation.)
• Hence, it may be useful to test for the presence of a unit root in the residuals to see if FE or FD
should be preferred.
4
Lecture Notes ECO311, Econometrics II Prof. Erdinç
consistently estimate βs using OLS. In this context, we refer to this as the Pooled OLS estimator or
POLS. To do inference based on OLS, we need to assume homoscedasticity and no serial correlation but
the latter can be restrictive so it is better to obtain the estimates and standard errors of these estimates
in a manner robust to both heteroscedasticity and serial correlation using the cluster option in Stata,
with the strongest assumptions underlying Pooled OLS (no correlation between time invariant part of
residual and the explanatory variables). Why we need these assumptions will be clear below.
Why is Pooled OLS inecient in this case? This is because the residual ϑit = αi + uit is serially
E(ϑit , ϑit−s ) = E [(αi + uit )(αi + uit−s )] = E [(αi2 + uit αi + αi uit−s + uit uit−s )] = E(αi2 ) = σα2
E(ϑit ,ϑit−s ) σ2
So, corr(ϑit , ϑit−s ) = √ 2 q 2 = 2 α 2
σϑσ +σ σϑ α u
t t−s
2
Since σϑ
t
= σϑ2 t−s = σϑ2 = σα2 + σu2 . I assume that uit is serially uncorrelated in the above calculation.
Clearly, the presence of this particular form of serial correlation via indiviual eects still renders OLS
estimators as unbiased and consistent. But the composite error structure follows a specic type of serial
correlation and the computed standard errors of the OLS estimators are incorrect. This requires that
RE estimation should rely on a more ecient GLS (Generalized Least Squares Estimation) in order to
the estimates can be substantial. But when individuals, rms or countries (in short, the cross sections)
can be treated as one of a kind rather than random draws from a population, then we should use the
FE estimation. However, even when we think RE is appropriate, the FE may be preferred because of
3
the possible correlation between αi and xit (Assumption 3.1) . Still we can rely on the Hausman test to
make a choice. This test is based on a test of the null which states that the αi and xit are uncorrelated
(such that RE is appropriate) against the alternative that they are correlated (favoring FE).
References
[1] Hsiao, Cheng C., Hashem M. Pesaran, and Kamil A. Tahmiscioglu. 1999. Bayes Estimation of Short-run Coecients
in Dynamic Panel Data Models , in Analysis of Panels and Limited Dependent Variable Models, eds: Hsiao C., K.
Lahiri, L.-F. Lee and M.H. Pesaran, pp. 268-296. Cambridge: Cambridge University Press.