PanelData ClassNotesI

Prof.
Erdinç
ECO 311
PANEL DATA CLASS NOTES
Econometrics II
1. INTRODUCTION
Panel data-sets follow a random sample of individuals (or rms, households, etc.) over time. The big
advantage of working with panel data is that we will be able to control for individual- specic, time-
invariant, unobserved heterogeneity, the presence of which could lead to bias in standard estimators like
OLS. We can also estimate dynamic equations.
Panel data combine a time series dimension with a cross section dimension, in such a way that there
are data on N individuals (or rms, countries...), followed over T time periods. Not all data-sets that
combine a time series dimension with a cross section dimension are panel data-sets, however. It is
important to distinguish panel data from repeated cross-sections.

id year yr92 yr93 yr94 x1 x2
1 1992 1 0 0 8 1
1 1993 0 1 0 12 1
1 1994 0 0 1 10 1
2.1. A Panel Data-Set.
2 1992 1 0 0 7 0
2 1993 0 1 0 5 0
2 1994 0 0 1 3 0
(...) (...) (...) (...) 0 (...) (...)

where id is the variable identifying the individual that we follow over time; yr92, yr93 and yr94 are
time dummies, constructed from the year variable; x1 is an example of a time varying variable and x2
is an example of a time invariant variable.
• In microeconomic data, N (the number of individuals, rms...) is typically large, while T is
small.
• In aggregate data, longer T is more common.

• In the opposite case, say N = 5countries and T = 40 years, the topic becomes multiple time
series.
Throughout this lecture, we will focus mostly on the large N, small T case.
• If the time periods for which we have data are the same for all N individuals, e.g. t = 1, 2, ...T ,
then we have a balanced panel. In practice, it is common that the length of the time series and/or
the time periods diers across individuals. In such a case the panel is unbalanced.1
• Repeated cross sections are not the same as panel data. Repeated cross sections are obtained
by sampling from the same population at dierent points in time. The identity of the individuals
1Analyzing unbalanced panel data typically raises few additional issues compared with analysis of balanced data. However
if the panel is unbalanced for reasons that are not entirely random (e.g. because rms with relatively low levels of
productivity have relatively high exit rates), then we may need to take this into account when estimating the model. This
can be done by means of a sample selection model (more on this later). We abstract from this particular problem here.
1
Lecture Notes ECO311, Econometrics II Prof. Erdinç
(or rms, households etc.) is not recorded, and there is no attempt to follow individuals over
time. This is the key reason why pooled cross sections are dierent from panel data. Had the
id variable in the example above not been available, we would have referred to this as a pooled
repeated cross-section data set.
2.2. Advantages of Panel Approach. When we have a dataset with both a time series and a
cross-section dimension, this opens up new opportunities in our research. For example:
• With larger sample size than single cross-section, we should be able to obtain more precise
estimates (i.e. lower standard errors).
• We can now ask how certain eects evolve over time (e.g. time trend in dependent variable; or
changes in the coecients of the model).
• Panel data enable us to solve an omitted variables problem. Panel data also enable us to estimate
dynamic equations (e.g. specications with lagged dependent variables on the right-hand side).
2.3. Using Panel Data To Address an Endogeneity Problem. One of the main advantages of
panel data is that such data can be used to solve an omitted variables problem. Suppose our model is
yit = β1 x1it + β2 x2it + (αi + uit )

where we observe yit , x1it ; and x2it but αi and uit are not observed. Our goal is to estimate the
parameters β1 and β2 .
• Throughout this lecture we will assume that the residual uit ; which varies both over time and
across individuals, is serially uncorrelated.
• Our problem is that we do not observe αi , which is constant over time for each individual (hence
no t subscript) but varies across individuals. Hence if we estimate the model in levels using OLS
OLS
then αi will go into the error term: vit = αi + uit .
What would be the consequence of αi going into the error term?
• If αi is uncorrelated with xit , then αi is just another unobserved factor making up the resid-
ual. However, the OLS will not be BLUE, because the error term vitOLS is serially correlated:
2
OLS OLS
cov(vit , vit−s ) = σ2σ+σ
α
2 for s = 1, 2, ...
α u
This suggests some feasible generalized least squares estimator could be preferable (this is indeed the
2
case; see below). Notice that OLS would be consistent , however, and the only substantive problem with
relying on OLS for this model is that the standard formula for calculating the standard errors are wrong.
This problem is straightforward to solve, e.g. by clustering the standard errors on the individuals, using
the cluster option in Stata (cluster(id) option after reg).
• But if αi is correlated with xit , then putting αi in the error term can cause serious problems. This,
of course, is an omitted variables problem, so we can use some of familiar results to understand
the nature of the problem. For the single-regressor model, hence
2Consistency is a minimal requirement for an estimator. Simplistically, it means that the distribution of the estimator β̂
collapses to the single point of true β as n goes to innity. It is an asymptotic property. If obtaining more data does not
get us closer to the parameter value β , then we are using a poor estimation procedure.
2
cov(xit ,αi )
plim(β̂ OLS ) = β + σx2
which shows that the OLS estimator is inconsistent unless cov(xit , αi ) = 0. If xit is positively correlated
with the unobserved eect, then there is an upward bias. If the correlation is negative, we get a negative
bias.
• Can you think about applications for which a specication like the following
yit = β1 x1it + β2 x2it + (αi + uit )

would be appropriate? How about:
• Individual earnings
• Household expenditures
• Firm investment
• Country income per capita.
What factors can reasonably be represented by αi ? Can these be assumed uncorrelated with xit ?
2. Model 1: The Fixed Effects ("Within") Estimator

Model:
(2.1) yit = β1 x1it + β2 x2it + (αi + uit )
wheret = 1, 2, ....T and i = 1, 2, ....., N

Assumptions about unobserved terms:
• Assumption 2.1: αi is freely correlated with xit .

• Assumption 2.2: E(xit , uis ) = 0 for s = 1, 2, ....T (strict exogeneity)
We have seen that if αi is correlated with the variables in the xit vector, there will be an endogeneity
problem which would bias the OLS estimates. Under assumptions 1.1 and 1.2, we can use the Fixed
Eects (FE) or the First Dierenced (FD) estimator to obtain consistent estimates of allowing αi to be
freely correlated with xit .

Note that strict exogeneity rules out feedback from past uis shocks to current xit . One implication
of this is that FE and FD will not yield consistent estimates if xit contains lagged dependent variables
(yi;t−1 ; y i;t−2 ;.....). We will discuss such cases under GMM and instrumental variable estimation.
To see how the FE estimator solves the endogeneity problem that would contaminate the OLS esti-
mates, begin by taking the average of the equation above for each individual - this gives:
(2.2) ȳi = β1 x̄1i + β2 x̄2i + (αi + ūi )

PT
where i = 1, 2, ....., N and ȳi = t=1 yit
and so on. Now subtracting (2.2) from (2.1) equation,
T
yit − ȳi = (β1 x1it − β1 x̄1i ) + (β2 x2it − β2 x̄2i ) + (αi − αi ) + (uit − ūi )
yit − ȳi = β1 (x1it − x1i ) + β2 (x2it − x2i ) + (uit − ūi )
which we can write as:
3
(2.3) ÿit = β1 ẍ1it + β2 ẍ2it + üit )
where ÿit is time-demeaned data.
This transformation of the original equation, known as the within transformation, has eliminated αi
from the equation. Hence, we can estimate coecients consistently by using OLS on (2.3). This is called
the within estimator or the Fixed Eects estimator.
In STATA, we obtain FE estimates from the xtreg command: xtreg Y X1 X2, fe
3. Model 2: The First Differencing Estimator

Model:
wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 3.1: αi is freely correlated with xit .

• Assumption 3.2: E(xit , uis ) = 0 for s = t, t − 1. This is a weaker form of strict exogeneity what
is required for FE in the sense that E(xit , uit−2 ) = 0 is not required. Thus if there is feedback
from uit to xit that takes more than two periods, FD will be still consistent whereas FE will not.
Why? (Hint: see the FE transformation)
To see how the FD estimator removes the individual eects to solve the endogeneity problem, we
dierence the above equation:
(3.2) yit − yit−1 = β1 (x1it − x1it−1 ) + β2 (x2it − x2it−1 ) + (uit − uit−1 )
(3.3) ∆yit = β1 ∆x1it + β2 ∆x2it + ∆uit
Cleary, by removing the xed eects, the FD generates consistent estimates based on OLS.
Which one to choose: FE or FD?. Since FE and FD are two alternative ways of removing the xed
eects, which one should we choose?
• First, these two methods are equivalent when T = 2. Prove this.

• If uit is a random walk (i.e. uit = uit−1 + it ), then ∆uit is serially uncorrelated and FD will be
more ecient than FE. Prove this. (Easy)
• Conversely, if under classical assumptions that uit ∼ i.i.d(0, σu2 ), the FE will be more ecient
than FD estimator (because in this case, ∆uit will exhibit negative serial correlation.)
• Hence, it may be useful to test for the presence of a unit root in the residuals to see if FE or FD
should be preferred.
4
4. Model 3: The Pooled OLS Estimator

Model:
wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 4.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0

• Assumption 4.2: E(xit , uit ) = 0 (xit is predetermined)
Clearly, under these assumptions, ϑit = αi + uit will be uncorrelated with xit implying that we can
consistently estimate βs using OLS. In this context, we refer to this as the Pooled OLS estimator or
POLS. To do inference based on OLS, we need to assume homoscedasticity and no serial correlation but
the latter can be restrictive so it is better to obtain the estimates and standard errors of these estimates
in a manner robust to both heteroscedasticity and serial correlation using the cluster option in Stata,
e.g. reg Y X1 X2, cluster (id).
5. Model 4: The Random Effects Estimator

Model:
wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 5.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0

• Assumption 5.2: E(xit , uis ) = 0 for s = 1, 2, ....T (strict exogeneity)
Note that this combines the strongest assumptions underlying FE/FD estimation (strict exogeneity)
with the strongest assumptions underlying Pooled OLS (no correlation between time invariant part of
residual and the explanatory variables). Why we need these assumptions will be clear below.
Why is Pooled OLS inecient in this case? This is because the residual ϑit = αi + uit is serially
correlated. To see this, note that
E(ϑit , ϑit−s ) = E [(αi + uit )(αi + uit−s )] = E [(αi2 + uit αi + αi uit−s + uit uit−s )] = E(αi2 ) = σα2
E(ϑit ,ϑit−s ) σ2
So, corr(ϑit , ϑit−s ) = √ 2 q 2 = 2 α 2
σϑσ +σ σϑ α u
t t−s
2
Since σϑ
t
= σϑ2 t−s = σϑ2 = σα2 + σu2 . I assume that uit is serially uncorrelated in the above calculation.
Clearly, the presence of this particular form of serial correlation via indiviual eects still renders OLS
estimators as unbiased and consistent. But the composite error structure follows a specic type of serial
correlation and the computed standard errors of the OLS estimators are incorrect. This requires that
RE estimation should rely on a more ecient GLS (Generalized Least Squares Estimation) in order to
estimate the coecients in the model.

5
6. Choice between The Fixed Effects or The Random Effects?

The choice between FE and RE models is not easy and particularly when T is small, the dierences in
the estimates can be substantial. But when individuals, rms or countries (in short, the cross sections)
can be treated as one of a kind rather than random draws from a population, then we should use the
FE estimation. However, even when we think RE is appropriate, the FE may be preferred because of
3
the possible correlation between αi and xit (Assumption 3.1) . Still we can rely on the Hausman test to
make a choice. This test is based on a test of the null which states that the αi and xit are uncorrelated
(such that RE is appropriate) against the alternative that they are correlated (favoring FE).
References
[1] Hsiao, Cheng C., Hashem M. Pesaran, and Kamil A. Tahmiscioglu. 1999. Bayes Estimation of Short-run Coecients
in Dynamic Panel Data Models , in Analysis of Panels and Limited Dependent Variable Models, eds: Hsiao C., K.
Lahiri, L.-F. Lee and M.H. Pesaran, pp. 268-296. Cambridge: Cambridge University Press.
3Ignoring this correlation may render RE estimation inconsistent

6

PanelData ClassNotesI

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

PanelData ClassNotesI

Transféré par

Droits d'auteur :

Formats disponibles

Prof.

PANEL DATA CLASS NOTES

important to distinguish panel data from repeated cross-sections.

(...) (...) (...) (...) 0 (...) (...)

• In microeconomic data, N (the number of individuals, rms...) is typically large, while T is

• In aggregate data, longer T is more common.

repeated cross-section data set.

estimates (i.e. lower standard errors).

changes in the coecients of the model).

yit = β1 x1it + β2 x2it + (αi + uit )

across individuals, is serially uncorrelated.

the cluster option in Stata (cluster(id) option after reg).

yit = β1 x1it + β2 x2it + (αi + uit )

• Country income per capita.

2. Model 1: The Fixed Effects ("Within") Estimator

(2.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 2.1: αi is freely correlated with xit .

freely correlated with xit .

(2.2) ȳi = β1 x̄1i + β2 x̄2i + (αi + ūi )

(2.3) ÿit = β1 ẍ1it + β2 ẍ2it + üit )

where ÿit is time-demeaned data.

the within estimator or the Fixed Eects estimator.

In STATA, we obtain FE estimates from the xtreg command: xtreg Y X1 X2, fe

3. Model 2: The First Differencing Estimator

(3.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 3.1: αi is freely correlated with xit .

Why? (Hint: see the FE transformation)

dierence the above equation:

(3.2) yit − yit−1 = β1 (x1it − x1it−1 ) + β2 (x2it − x2it−1 ) + (uit − uit−1 )

(3.3) ∆yit = β1 ∆x1it + β2 ∆x2it + ∆uit

• First, these two methods are equivalent when T = 2. Prove this.

more ecient than FE. Prove this. (Easy)

4. Model 3: The Pooled OLS Estimator

(4.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 4.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0

e.g. reg Y X1 X2, cluster (id).

5. Model 4: The Random Effects Estimator

(5.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N

• Assumption 5.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0

correlated. To see this, note that

estimate the coecients in the model.

6. Choice between The Fixed Effects or The Random Effects?

3Ignoring this correlation may render RE estimation inconsistent

Vous aimerez peut-être aussi

• In microeconomic data, N (the number of individuals, rms...) is typically large, while T is

changes in the coecients of the model).

the within estimator or the Fixed Eects estimator.

In STATA, we obtain FE estimates from the xtreg command: xtreg Y X1 X2, fe

dierence the above equation:

more ecient than FE. Prove this. (Easy)

estimate the coecients in the model.