Académique Documents
Professionnel Documents
Culture Documents
Panel data (also known as longitudinal or cross sectional time-series data) is a dataset in which
the behavior of entities are observed across time. These entities could be states, companies,
individuals, countries, etc. There are other names for panel data, such as pooled data,
micropanel data.
Panel data allows you to control for variables you cannot observe or measure like cultural
factors or difference in business practices across companies; or variables that change over time
but not across entities (i.e. national policies, federal regulations, international agreements,
etc.). This is, it accounts for individual heterogeneity.
BALANCED PANEL:
A panel is said to be balanced if each subject (firm,individuals,etc)has the same number of
observations.
UNBALANCED PANEL:
A panel is said to be unbalanced if each entity (firm,individuals,etc) has the different number
of observations.
In the panel data literature, there is two terms:
SHORT PANEL: In a short panel the number of cross sectional subjects,N, is greater than
the number of time periods,T.
LONG PANEL: In a long panel the number of cross sectional subjects,N, is less than the
number of time periods,T.
Labour economics, welfare economics and several other fields rely heavily on
household panel studies.
Panels are more informative than simple time series of aggregates, as they allow
tracking individual histories. A 10% unemployment rate is less informative than a
panel of individuals with all of them unemployed 10% of the time or one with 10%
always unemployed.Panels are more informative than cross-sections, as they reflect
dynamics and Granger causality across variables.
Panel data provides a means of resolving the magnitude of econometric problems that
often arises in empirical studies, namely the often heard assertion that the real reason
one finds (or does not find) certain effects is the presence of omitted (mismeasured or
unobserved) variables that are correlated with explanatory variables.
They allow to study individual dynamics (e.g. separating age and cohort effects).
They give information on the time-ordering of events.
They allow to control for individual unobserved heterogeneity.
it
where i = 1; : : : ; N , t = 1; : : : ; T
Note that the double-subscripted notation represents that we are dealing with a panel data set.
This is a pooled regression model since we pool all the observations in OLS
regression.The model is implicitly assuming that the coefficients (including the
intercepts) are the same for all the individuals.
In order for the OLS estimates to be unbiased and consistent, the regressor should
satisfy exogeneity assumption.
Suppose each individual i has time-invariant but unique effects on the dependent
variable. Since the pooled regression model neglects the heterogeneity across
individuals and assumes the same coefficients for all individuals, those effects unique to
each individual are all subsumed in the error term
it.
If this is the case, the explanatory variables will no longer be uncorrelated with the
error terms. Then, the estimates from pooled OLS regression will be biased and
inconsistent.
it
Note that we need to subtract a dummy variable for one individual to avoid perfect
multicollinearity.
This dummy technique is called the least-squres dummy variable (LSDV) because it is
simply the OLS estimator with plenty of dummy variables.
Note that consistent estimates with the LSDV model is only obtained when the error
terms are independent across both dimensions (across time and individual) of the panel
data.
In many cases the prime interest of researchers is not in obtaining the impact of the
unobserved variables (or heterogeneity). For this reason, the parameters of dummy
variables for fixed effects are called nuisance parameters.
We can provide a test to check if the fixed effects model gives different estimates than
the pooled OLS regression using F-test.
The null hypothesis test associated with this F-test is
H0 : 1 = 2 = . = N = 0
The caveat of using the fixed effects is that if you introduce too many dummy variables
(that is, if N is too large), you will lack enough observations to do a meaningful
statistical analysis. For example, suppose we have N = 2; 000 and T = 3, then we have to
draw upon the variation shown only from 3 observations for the fixed effects of each
individual.
effect on trade or GDP or the business practices of a company may influence its stock
price).
When using FE we assume that something within the individual may impact or bias the
predictor or outcome variables and we need to control for this. This is the rationale
behind the assumption of the correlation between entitys error term and predictor
variables. FE remove the effect of those time-invariant characteristics from the
predictor variables so we can assess the predictors net effect.
Another important assumption of the FE model is that those time-invariant
characteristics are unique to the individual and should not be correlated with other
individual characteristics.
Each entity is different therefore the entitys error term and the constant (which
captures individual characteristics) should not be correlated with the others. If the
error terms are correlated then FE is no suitable since inferences may not be correct
and you need to model that relationship (probably using random-effects), this is the
main rationale.
The equation for the fixed effects model becomes:
Yit= 1Xit+ i+ uit
Where,
WORKING EXAMPLE 1:
Greene (1997) provides a small panel data set with information on costs and output of 6
different firms, in 4 different periods of time (1955, 1960, 1965, and 1970). Your job is to
estimate a cost function using basic panel data techniques.
PANEL DATA:
Year
Firm
Cost
Output D1
1955
1
3.154
214
1960
1
4.271
419
1965
1
4.584
588
D2
1
1
1
D3
0
0
0
D4
0
0
0
D5
0
0
0
D6
0
0
0
0
0
0
1970
1955
1960
1965
1970
1955
1960
1965
1970
1955
1960
1965
1970
1955
1960
1965
1970
1955
1960
1965
1970
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
5.849
3.859
5.535
8.127
10.966
19.035
26.041
32.444
41.18
35.229
51.111
61.045
77.885
33.154
40.044
43.125
57.727
73.05
98.846
138.88
191.56
1025
696
811
1640
2506
3202
4802
5821
9275
5668
7612
10206
13702
6000
8222
8484
10004
11796
15551
27218
30958
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
POOLED OLS:
The most basic estimator of panel data sets are the Pooled OLS (POLS). Johnston & DiNardo (1997)
recall that the POLS estimators ignore the panel structure of the data, treat observations as being
serially uncorrelated for a given individual, with homoscedastic errors across individuals and time
periods:
(2)
bPOLS = (X'X)-1X'y
gen lnc=log(cost)
gen lny=log(output)
regress lnc lny
Source |
SS
df
MS
-------------+-----------------------------Model | 33.617333
Number of obs =
24
F( 1, 22) = 728.51
1 33.617333
Prob > F
R-squared
-------------+------------------------------
= 0.0000
= 0.9707
Root MSE
= .21482
-----------------------------------------------------------------------------lnc |
t P>|t|
.8197573 .9562164
scalar R2OLS=_result(7)
Number of obs =
Number of groups =
avg =
overall = 0.9707
=
between = 0.9833
F(1,17)
24
4.0
max =
121.66
Prob > F
0.0000
-----------------------------------------------------------------------------lnc |
t P>|t|
.5453044 .8032534
F(5, 17) =
9.67
matrix bW=get(_b)
matrix VW=get(VCE)
BETWEEN-GROUPS ESTIMATORS:
(4) bB = [X'PX]-1X'Py
xtreg lnc lny, be
Between regression (regression on group means) Number of obs
Number of groups =
between = 0.9833
overall = 0.9707
avg =
max =
4.0
4
24
F(1,4)
sd(u_i + avg(e_i.))= .1838474
Prob > F
236.23
0.0001
-----------------------------------------------------------------------------lnc |
t P>|t|
.7464935 1.075653
. matrix bB=get(_b)
. matrix VB=get(VCE)
RANDOM EFFECTS:
(5) bGLS = [X'Omega-1X]-1X'Omega-1y
where , Omega = (sigmau2*InT + T*sigmaa2*P)
GLS
xtreg lnc lny, re
Random-effects GLS regression
Number of obs
Number of groups =
between = 0.9833
overall = 0.9707
Random effects u_i ~ Gaussian
corr(u_i, X)
= 0 (assumed)
avg =
max =
24
4.0
4
Wald chi2(1)
Prob > chi2
= 268.10
= 0.0000
------------------------------------------------------------------------------
lnc |
z P>|z|
16.37 0.000
-8.26 0.000
.7010002
-4.222788
.8916404
-2.6034
-------------+---------------------------------------------------------------sigma_u | .17296414
sigma_e | .12463167
rho | .65823599 (fraction of variance due to u_i)
------------------------------------------------------------------------------
- Under the Null Hypothesis: Orthogonality, i.e., no correlation between individual effects and
explanatory variables. Both random effects and fixed effects estimators are consistent, but the
random effects estimator is efficient, while fixed effects is not.
- Under the Alternative Hypothesis: Individual effects are correlated with the X's. In this case,
random effects estimator is inconsistent, while fixed effects estimator is consistent and efficient.
Greene (1997) recalls that, under the null, the estimates should not differ systematically. Thus, the
test will be based on a contrast vecor H:
(6)
Number of obs
24
Number of groups =
between = 0.9833
avg =
overall = 0.9707
max =
F(1,17)
Prob > F
4.0
4
= 121.66
=
0.0000
-----------------------------------------------------------------------------lnc |
P>|t|
.508593
11.03 0.000
-4.72 0.000
.5453044
.8032534
-3.472046 -1.325972
-------------+---------------------------------------------------------------sigma_u | .36730483
sigma_e | .12463167
rho | .89675322 (fraction of variance due to u_i)
-----------------------------------------------------------------------------F test that all u_i=0:
F(5, 17) =
9.67
Number of obs
Number of groups =
between = 0.9833
avg =
overall = 0.9707
corr(u_i, X)
Wald chi2(1)
= 0 (assumed)
24
4.0
max =
= 268.10
= 0.0000
-----------------------------------------------------------------------------lnc |
z P>|z|
16.37 0.000
-8.26 0.000
.7010002
.8916404
-4.222788
-2.6034
-------------+---------------------------------------------------------------sigma_u | .17296414
sigma_e | .12463167
rho | .65823599 (fraction of variance due to u_i)
So, based on the test above, we can see that the tests statistic (10.86) is greater than the critical value
of a Chi-squared (1df, 5%) = 3.84. Therefore, we reject the null hypothesis. So, the preferred model
is the fixed effects.
SS
df
MS
-------------+-----------------------------Model | 280.714267
7 40.1020382
Number of obs =
F( 7,
24
17) = 2581.72
Prob > F
= 0.0000
Residual | .264061918
17 .015533054
R-squared
-------------+-----------------------------Total | 280.978329
= 0.9991
24 11.7074304
Root MSE
= .12463
-----------------------------------------------------------------------------lnc |
P>|t|
11.03 0.000
d1 | -2.693527 .3827874
-7.04 0.000
-3.501137 -1.885916
d2 | -2.911731 .4395755
-6.62 0.000
-3.839154 -1.984308
d3 | -2.439957 .5286852
-4.62 0.000
-3.555386 -1.324529
d4 | -2.134488 .5587981
-3.82 0.001
-3.313449
d5 | -2.310839
.55325
d6 | -1.903512 .6080806
-4.18 0.001
-3.13 0.006
.5453044
.8032534
-.955527
-3.478094 -1.143583
-3.18645 -.6205737
------------------------------------------------------------------------------
The slope is obviously the same. The only change is the substitution of a common intercept for 6
dummies, each of them representing a cross-sectional unit. Now suppose we would like to know if
the difference in the firms effects is statistically significant.
- Regress the fixed affects estimators above, including the intercept and the dummies:
regress lnc lny d1 d2 d3 d4 d5 d6
note: d1 omitted because of collinearity
Source |
SS
df
MS
-------------+-----------------------------Model | 34.368475
Residual | .264061918
6 5.72807917
17 .015533054
-------------+-----------------------------Total | 34.6325369
23 1.50576248
Number of obs =
F( 6,
24
17) = 368.77
Prob > F
= 0.0000
R-squared
= 0.9924
------------------------------------------------------------------------------
= .12463
lnc |
P>|t|
11.03 0.000
.5453044
.8032534
d2 | -.2182041 .1052027
-2.07 0.054
-.4401624
.0037542
d3 | .2535693 .1716665
1.48 0.158
-.1086153
.6157539
d4 | .5590387 .1982915
2.82 0.012
.1406801
.9773973
d5 | .3826881 .1933058
1.98 0.064
-.0251516
.7905277
d6 | .7900151 .2436915
3.24 0.005
.275871
d1 | (omitted)
-7.04 0.000
1.304159
-3.501137 -1.885916
------------------------------------------------------------------------------
Note that one of the dummies is dropped (due to perfect collinearity of the constant), and all other
dummies are represented as the difference between their original value and the constant. (The value
of the constant in this second regression equals the value of the dropped dummy in the previous
regression. The dropped dummy is seen as the benchmark.)
Obtain the R-squared from restricted (POLS) and unrestricted (fixed effects with dummies) models
. scalar R2LSDV=_result(7)
. scalar list
R2LSDV = .99237532
R2OLS = .97068641
Perform the traditional F-test, comparing the unrestricted regression with the restricted regression:
(7)
where the subscript "u" refers to the unrestricted regression (fixed effects with dummies), and the
subscript "p" to the restricted regression (POLS). Under the null hypothesis, POLS are more
efficient.
. scalar
F=((R2LSDV-R2OLS)/(6-1))/((1-R2LSDV)/(24-6-1))
. scalar list F
F = 9.6715307
The result above can be compared with the critical value of F(5,17), which equals 4.34 at 1% level.
Therefore, we reject the null hypothesis of common intercept for all firms.
WORKING EXAMPLE 2:
The Following Panel Data we have selected is Investment data for four companies for 20
years,1935-1954
Where,
Grossinv = gross investment=y
Valuefirm = value of firm =x2
capstock = capital (stock of plant and equipment)=x3
First we regress Gross Investment on Value of the firm and Capital Stock.
y
x2
x3
33.1
45
77.2
.
.
71.78
90.08
68.6
1170.6 97.8
2015.8 104.4
2803.3 118
.
.
.
.
864.1 145.5
1193.5 174.8
1188.9 213.5
RESULTS:
Dependent Variable: Y
Method: Least Squares
Sample: 1 80
Included observations: 80
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
X2
X3
-63.30414
0.110096
0.303393
29.61420
0.013730
0.049296
-2.137628
8.018809
6.154553
0.0357
0.0000
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.756528
0.750204
142.3682
1560690.
-508.6596
119.6292
0.000000
290.9154
284.8528
12.79149
12.88081
12.82730
0.309795
4.146525
7.778406
6.247011
Prob. F(2,77)
Prob. Chi-Square(2)
Prob. Chi-Square(2)
0.0195
0.0205
0.0440
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
X2
X3
15418.30
-1.935848
23.44746
5174.959
2.399202
8.614225
2.979406
-0.806872
2.721947
0.0039
0.4222
0.0080
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.097230
0.073782
24878.25
4.77E+10
-921.7262
4.146525
0.019486
19508.62
25850.15
23.11815
23.20748
23.15397
0.729271
F-statistic
2.719894
Prob. F(2,77)
0.0722
Obs*R-squared
5.278799
Prob. Chi-Square(2)
0.0714
Scaled explained SS
4.239521
Prob. Chi-Square(2)
0.1201
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
X2^2
X3^2
20262.24
-0.000584
0.011659
3695.723
0.000423
0.005000
5.482620
-1.379930
2.331917
0.0000
0.1716
0.0223
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
0.065985
0.041725
25305.11
4.93E+10
-923.0872
19508.62
25850.15
23.15218
23.24151
23.18799
F-statistic
Prob(F-statistic)
2.719894
0.072214
Durbin-Watson stat
0.745009
CONCLUSION:
Hence we can conclude that the above results showing the large impact of stock of plant and
equipment and firm values on the gross investment..
the above regression equation has a positive impact on the data and can be written as:
REQUIRED GLS:
GROSSINV=0.110096(VALUE OF FIRMS) +0.303393(STOCK OF PLANT & EUIPMENT)