Vous êtes sur la page 1sur 6

Prof.

Erdinç

ECO 311

PANEL DATA CLASS NOTES

Econometrics II

1. INTRODUCTION
Panel data-sets follow a random sample of individuals (or rms, households, etc.) over time. The big

advantage of working with panel data is that we will be able to control for individual- specic, time-
invariant, unobserved heterogeneity, the presence of which could lead to bias in standard estimators like
OLS. We can also estimate dynamic equations.

Panel data combine a time series dimension with a cross section dimension, in such a way that there

are data on N individuals (or rms, countries...), followed over T time periods. Not all data-sets that

combine a time series dimension with a cross section dimension are panel data-sets, however. It is

important to distinguish panel data from repeated cross-sections.


id year yr92 yr93 yr94 x1 x2

1 1992 1 0 0 8 1

1 1993 0 1 0 12 1

1 1994 0 0 1 10 1
2.1. A Panel Data-Set.
2 1992 1 0 0 7 0

2 1993 0 1 0 5 0

2 1994 0 0 1 3 0

(...) (...) (...) (...) 0 (...) (...)


where id is the variable identifying the individual that we follow over time; yr92, yr93 and yr94 are

time dummies, constructed from the year variable; x1 is an example of a time varying variable and x2
is an example of a time invariant variable.

• In microeconomic data, N (the number of individuals, rms...) is typically large, while T is

small.

• In aggregate data, longer T is more common.


• In the opposite case, say N = 5countries and T = 40 years, the topic becomes multiple time

series.

Throughout this lecture, we will focus mostly on the large N, small T case.

• If the time periods for which we have data are the same for all N individuals, e.g. t = 1, 2, ...T ,
then we have a balanced panel. In practice, it is common that the length of the time series and/or

the time periods diers across individuals. In such a case the panel is unbalanced.1
• Repeated cross sections are not the same as panel data. Repeated cross sections are obtained

by sampling from the same population at dierent points in time. The identity of the individuals

1Analyzing unbalanced panel data typically raises few additional issues compared with analysis of balanced data. However
if the panel is unbalanced for reasons that are not entirely random (e.g. because rms with relatively low levels of
productivity have relatively high exit rates), then we may need to take this into account when estimating the model. This
can be done by means of a sample selection model (more on this later). We abstract from this particular problem here.
1
Lecture Notes ECO311, Econometrics II Prof. Erdinç

(or rms, households etc.) is not recorded, and there is no attempt to follow individuals over

time. This is the key reason why pooled cross sections are dierent from panel data. Had the

id variable in the example above not been available, we would have referred to this as a pooled

repeated cross-section data set.

2.2. Advantages of Panel Approach. When we have a dataset with both a time series and a
cross-section dimension, this opens up new opportunities in our research. For example:

• With larger sample size than single cross-section, we should be able to obtain more precise

estimates (i.e. lower standard errors).

• We can now ask how certain eects evolve over time (e.g. time trend in dependent variable; or

changes in the coecients of the model).

• Panel data enable us to solve an omitted variables problem. Panel data also enable us to estimate

dynamic equations (e.g. specications with lagged dependent variables on the right-hand side).

2.3. Using Panel Data To Address an Endogeneity Problem. One of the main advantages of
panel data is that such data can be used to solve an omitted variables problem. Suppose our model is

yit = β1 x1it + β2 x2it + (αi + uit )


where we observe yit , x1it ; and x2it but αi and uit are not observed. Our goal is to estimate the

parameters β1 and β2 .
• Throughout this lecture we will assume that the residual uit ; which varies both over time and

across individuals, is serially uncorrelated.

• Our problem is that we do not observe αi , which is constant over time for each individual (hence
no t subscript) but varies across individuals. Hence if we estimate the model in levels using OLS
OLS
then αi will go into the error term: vit = αi + uit .
What would be the consequence of αi going into the error term?

• If αi is uncorrelated with xit , then αi is just another unobserved factor making up the resid-

ual. However, the OLS will not be BLUE, because the error term vitOLS is serially correlated:
2
OLS OLS
cov(vit , vit−s ) = σ2σ+σ
α
2 for s = 1, 2, ...
α u

This suggests some feasible generalized least squares estimator could be preferable (this is indeed the
2
case; see below). Notice that OLS would be consistent , however, and the only substantive problem with

relying on OLS for this model is that the standard formula for calculating the standard errors are wrong.

This problem is straightforward to solve, e.g. by clustering the standard errors on the individuals, using

the cluster option in Stata (cluster(id) option after reg).

• But if αi is correlated with xit , then putting αi in the error term can cause serious problems. This,

of course, is an omitted variables problem, so we can use some of familiar results to understand
the nature of the problem. For the single-regressor model, hence

2Consistency is a minimal requirement for an estimator. Simplistically, it means that the distribution of the estimator β̂
collapses to the single point of true β as n goes to innity. It is an asymptotic property. If obtaining more data does not
get us closer to the parameter value β , then we are using a poor estimation procedure.
2
Lecture Notes ECO311, Econometrics II Prof. Erdinç

cov(xit ,αi )
plim(β̂ OLS ) = β + σx2
which shows that the OLS estimator is inconsistent unless cov(xit , αi ) = 0. If xit is positively correlated
with the unobserved eect, then there is an upward bias. If the correlation is negative, we get a negative

bias.

• Can you think about applications for which a specication like the following

yit = β1 x1it + β2 x2it + (αi + uit )


would be appropriate? How about:

• Individual earnings

• Household expenditures

• Firm investment

• Country income per capita.

What factors can reasonably be represented by αi ? Can these be assumed uncorrelated with xit ?

2. Model 1: The Fixed Effects ("Within") Estimator


Model:

(2.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N


Assumptions about unobserved terms:

• Assumption 2.1: αi is freely correlated with xit .


• Assumption 2.2: E(xit , uis ) = 0 for s = 1, 2, ....T (strict exogeneity)

We have seen that if αi is correlated with the variables in the xit vector, there will be an endogeneity

problem which would bias the OLS estimates. Under assumptions 1.1 and 1.2, we can use the Fixed

Eects (FE) or the First Dierenced (FD) estimator to obtain consistent estimates of allowing αi to be

freely correlated with xit .


Note that strict exogeneity rules out feedback from past uis shocks to current xit . One implication

of this is that FE and FD will not yield consistent estimates if xit contains lagged dependent variables

(yi;t−1 ; y i;t−2 ;.....). We will discuss such cases under GMM and instrumental variable estimation.

To see how the FE estimator solves the endogeneity problem that would contaminate the OLS esti-

mates, begin by taking the average of the equation above for each individual - this gives:

(2.2) ȳi = β1 x̄1i + β2 x̄2i + (αi + ūi )


PT
where i = 1, 2, ....., N and ȳi = t=1 yit
and so on. Now subtracting (2.2) from (2.1) equation,
T
yit − ȳi = (β1 x1it − β1 x̄1i ) + (β2 x2it − β2 x̄2i ) + (αi − αi ) + (uit − ūi )
yit − ȳi = β1 (x1it − x1i ) + β2 (x2it − x2i ) + (uit − ūi )
which we can write as:
3
Lecture Notes ECO311, Econometrics II Prof. Erdinç

(2.3) ÿit = β1 ẍ1it + β2 ẍ2it + üit )

where ÿit is time-demeaned data.

This transformation of the original equation, known as the within transformation, has eliminated αi
from the equation. Hence, we can estimate coecients consistently by using OLS on (2.3). This is called

the within estimator or the Fixed Eects estimator.

In STATA, we obtain FE estimates from the xtreg command: xtreg Y X1 X2, fe

3. Model 2: The First Differencing Estimator


Model:

(3.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N


Assumptions about unobserved terms:

• Assumption 3.1: αi is freely correlated with xit .


• Assumption 3.2: E(xit , uis ) = 0 for s = t, t − 1. This is a weaker form of strict exogeneity what

is required for FE in the sense that E(xit , uit−2 ) = 0 is not required. Thus if there is feedback

from uit to xit that takes more than two periods, FD will be still consistent whereas FE will not.

Why? (Hint: see the FE transformation)

To see how the FD estimator removes the individual eects to solve the endogeneity problem, we

dierence the above equation:

(3.2) yit − yit−1 = β1 (x1it − x1it−1 ) + β2 (x2it − x2it−1 ) + (uit − uit−1 )

(3.3) ∆yit = β1 ∆x1it + β2 ∆x2it + ∆uit

Cleary, by removing the xed eects, the FD generates consistent estimates based on OLS.

Which one to choose: FE or FD?. Since FE and FD are two alternative ways of removing the xed
eects, which one should we choose?

• First, these two methods are equivalent when T = 2. Prove this.


• If uit is a random walk (i.e. uit = uit−1 + it ), then ∆uit is serially uncorrelated and FD will be

more ecient than FE. Prove this. (Easy)

• Conversely, if under classical assumptions that uit ∼ i.i.d(0, σu2 ), the FE will be more ecient

than FD estimator (because in this case, ∆uit will exhibit negative serial correlation.)

• Hence, it may be useful to test for the presence of a unit root in the residuals to see if FE or FD

should be preferred.
4
Lecture Notes ECO311, Econometrics II Prof. Erdinç

4. Model 3: The Pooled OLS Estimator


Model:

(4.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N


Assumptions about unobserved terms:

• Assumption 4.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0


• Assumption 4.2: E(xit , uit ) = 0 (xit is predetermined)
Clearly, under these assumptions, ϑit = αi + uit will be uncorrelated with xit implying that we can

consistently estimate βs using OLS. In this context, we refer to this as the Pooled OLS estimator or

POLS. To do inference based on OLS, we need to assume homoscedasticity and no serial correlation but

the latter can be restrictive so it is better to obtain the estimates and standard errors of these estimates

in a manner robust to both heteroscedasticity and serial correlation using the cluster option in Stata,

e.g. reg Y X1 X2, cluster (id).

5. Model 4: The Random Effects Estimator


Model:

(5.1) yit = β1 x1it + β2 x2it + (αi + uit )

wheret = 1, 2, ....T and i = 1, 2, ....., N


Assumptions about unobserved terms:

• Assumption 5.1: αi is uncorrelated with xit i.e. E(xit αi ) = 0


• Assumption 5.2: E(xit , uis ) = 0 for s = 1, 2, ....T (strict exogeneity)
Note that this combines the strongest assumptions underlying FE/FD estimation (strict exogeneity)

with the strongest assumptions underlying Pooled OLS (no correlation between time invariant part of

residual and the explanatory variables). Why we need these assumptions will be clear below.

Why is Pooled OLS inecient in this case? This is because the residual ϑit = αi + uit is serially

correlated. To see this, note that

E(ϑit , ϑit−s ) = E [(αi + uit )(αi + uit−s )] = E [(αi2 + uit αi + αi uit−s + uit uit−s )] = E(αi2 ) = σα2
E(ϑit ,ϑit−s ) σ2
So, corr(ϑit , ϑit−s ) = √ 2 q 2 = 2 α 2
σϑσ +σ σϑ α u
t t−s
2
Since σϑ
t
= σϑ2 t−s = σϑ2 = σα2 + σu2 . I assume that uit is serially uncorrelated in the above calculation.

Clearly, the presence of this particular form of serial correlation via indiviual eects still renders OLS

estimators as unbiased and consistent. But the composite error structure follows a specic type of serial

correlation and the computed standard errors of the OLS estimators are incorrect. This requires that

RE estimation should rely on a more ecient GLS (Generalized Least Squares Estimation) in order to

estimate the coecients in the model.


5
Lecture Notes ECO311, Econometrics II Prof. Erdinç

6. Choice between The Fixed Effects or The Random Effects?


The choice between FE and RE models is not easy and particularly when T is small, the dierences in

the estimates can be substantial. But when individuals, rms or countries (in short, the cross sections)

can be treated as one of a kind rather than random draws from a population, then we should use the

FE estimation. However, even when we think RE is appropriate, the FE may be preferred because of
3
the possible correlation between αi and xit (Assumption 3.1) . Still we can rely on the Hausman test to

make a choice. This test is based on a test of the null which states that the αi and xit are uncorrelated

(such that RE is appropriate) against the alternative that they are correlated (favoring FE).

References
[1] Hsiao, Cheng C., Hashem M. Pesaran, and Kamil A. Tahmiscioglu. 1999.  Bayes Estimation of Short-run Coecients

in Dynamic Panel Data Models , in Analysis of Panels and Limited Dependent Variable Models, eds: Hsiao C., K.

Lahiri, L.-F. Lee and M.H. Pesaran, pp. 268-296. Cambridge: Cambridge University Press.

3Ignoring this correlation may render RE estimation inconsistent


6

Vous aimerez peut-être aussi