Académique Documents
Professionnel Documents
Culture Documents
AN EXTENDED ANALYSIS
Jeffrey M. Wooldridge
Department of Economics
Michigan State University
East Lansing, MI 48824-1038
(517) 353-5972
wooldri1@msu.edu
This version: June 2006
1
ABSTRACT
This is an expanded version of Wooldridge (2003), which provided an overview of
cluster-sample methods in linear models. Here I include additional details for linear models
and provide, as much as possible, a parallel treatment for nonlinear models, including a
summary of strategies for dealing with data sets having a small number of clusters.
Keywords: Cluster Correlation; Generalized Estimating Equations; Minimum Distance
Estimation; Panel Data; Robust Variance Matrix; Unobserved Effect
JEL Classification Codes: C13, C21, C23
2
1. INTRODUCTION
In Wooldridge (2003), I provided a brief overview of econometric approaches to analyzing
cluster samples in the context of a linear regression model. I considered cases with both large
and small cluster sizes (relative to the number of clusters). That treatment was necessarily
terse, and some subtle issues were only briefly mentioned or neglected entirely.
The asymptotic theory for the case with a large number of clusters (and relatively small
cluster sizes), either in linear or nonlinear models, has been pretty well worked out; for a
summary, see, for example, Wooldridge (2002). Just as importantly, popular statistical
packages, such as Stata
, ,
(as G with M
g
fixed) and G -asymptotically normal. See, for example, Wooldridge (2002, Sections 7.8).
Without more assumptions, a robust variance matrix is needed to account for correlation
within clusters or heteroskedasticity in Varv
gm
|x
g
, Z
g
, or both. When v
gm
has the form in (2),
the amount of within-cluster correlation can be substantial, which means the usual OLS
7
standard errors can be very misleading (and, in most cases, systematically too small). In the
case of common cluster sizes, Wooldridge (2002, Section 7.8) gives the formula for a
variance-matrix estimator that assumes no particular kind of within-cluster correlation nor a
particular form of heteroskedasticity. Allowing for different group sizes is straightforward.
Write the equation by cluster as
y
g
W
g
z v
g
, g 1, . . . , G, (4)
where W
g
is the M
g
1 K L matrix of all regressors for group g. Then the
1 K L 1 K L variance matrix estimator is
Avarz
g1
G
W
g
W
g
1
g1
G
W
g
v
g
v
g
W
g
g1
G
W
g
W
g
1
(5)
where v
g
is the M
g
1 vector of pooled OLS residuals for group g. This asymptotic variance
is now computed routinely by packages such as Stata
j
Mg
o
u
2
I
Mg
, (7)
where j
Mg
is the M
g
1 vector of ones and I
Mg
is the M
g
M
g
identity matrix. In the standard
setup, we also make the system homoskedasticity assumption,
Varv
g
|x
g
, Z
g
Varv
g
. (8)
It is important to understand the role of assumption (8): it implies that the conditional
variance-covariance matrix is the same as the unconditional variance-covariance matrix, but it
does not restrict Varv
g
. The particular random effects structure on Varv
g
is given by (7).
Under (7) and (8), the resulting GLS estimator is the well-known random effects (RE)
estimator; see, for example, Wooldridge (2002, Section 10.3).
The random effects estimator is asymptotically more efficient than pooled OLS under (6),
(7), and (8) (and a standard rank condition). The RE estimates and test statistics are computed
routinely by popular software packages. Nevertheless, an important point is often overlooked
in applications of RE: one can, and in many cases should, make inference completely robust to
an unknown form of Varv
g
|x
g
, Z
g
. Wooldridge [2002, equation (7.49)] gives the robust
formula in the balanced case (same group sizes), and the formula extends easily to the
unbalanced case. One way to think about the robust formula is to characterize RE as a pooled
OLS estimator on quasi-time demeaned data. For each g, define
0
g
1 1/1 M
g
o
c
2
/o
u
2
1/2
, where o
c
2
and o
u
2
are estimators of the two variance
parameters. Then the RE estimator is identical to the pooled OLS estimator of
y
gm
0
g
y g
on 1 0
g
, 1 0
g
x
g
, z
gm
0
g
zg
, m 1, . . . , M
g
; g 1, . . . , G; (9)
9
see, for example, Hsiao (1986). For fully robust inference, we can just apply the fully robust
variance matrix estimator in (5) but on the transformed data.
The idea in obtaining a fully robust variance matrix of RE is straightforward, and dates
back at least to the generalized estimating equation (GEE) literature [Liang and Zeger
(1986)]. Even if Varv
g
|x
g
, Z
g
does not have the RE form, the RE estimator is still consistent
and G -asymptotically normal under (6), and it is likely to be more efficient than pooled OLS.
[Except in unusual cases, incorrectly accounting for within-group correlation by assuming
constant correlation within-group correlation is likely to be more efficient than entirely
ignoring within-group correlation.] Yet we should recognize that the RE second moment
assumptions can be violated without causing inconsistency in the RE estimator. Making
inference robust to serial correlation in the idiosyncratic errors for panel data applications,
particularly with more than a few time periods, can be very important. But within-group
correlation in the idiosyncratic errors can arise for cluster samples, too, especially if underlying
(1) is a random coefficient model, where z
gm
,
g
replaces z
gm
,. By estimating a standard
random effects model that assumes common slopes ,, we effectively include z
gm
,
g
, in the
idiosyncratic error; this generally creates within-group correlation because z
gm
,
g
, and
z
gp
,
g
, will be correlated for m p, conditional on Z
g
. Also, the idiosyncratic error will
have heteroskedasticity that is a function of z
gm
. Nevertheless, if we assume
E,
g
|X
g
, Z
g
E,
g
, along with (6), the random effects estimator still consistently
estimates the average slopes, ,. Therefore, in applying random effects to panel data or cluster
samples, it is sensible (with large G) to make the variance estimator of random effects robust to
arbitrary heteroskedasticity and within-group correlation.
With panel data so that there is a natural ordering of the observations within group it
10
may make sense to estimate an unrestricted version of Varv
g
, especially if G is large. Even
in that case, we can use Wooldridge [2002, equation (7.49)] to obtain a variance matrix robust
to Varv
gm
|x
g
, Z
g
Varv
g
.
In economics, the prevailing view is still that robust inference is not necessary when using
GLS. Adopting the perspective of the GEE literature that one often chooses a working
variance-covariance matrix for convenience, but should recognize it may not represent the
conditional variance-covariance matrix is relatively straightforward these days. The
appendix shows how to use GEE software to obtain fully robust inference after random effects
estimation.
If we are only interested in estimating ,, the fixed effects (FE) or within estimator is
attractive. The within transformation subtracts off group averages from the dependent variable
and explanatory variables:
y
gm
y g
z
gm
zg
, u
gm
g
, m 1, . . . , M
g
; g 1, . . . , G, (10)
and this equation is estimated by pooled OLS. (The x
g
get swept away by the within-group
demeaning.) Under a full set of fixed effects assumptions which, unlike random effects,
allows arbitrary correlation between c
g
and the z
gm
inference is straightforward using
standard software. Nevertheless, analogous to the random effects case, it is often important to
allow Varu
g
|Z
g
to have an arbitrary form, including within-group correlation and
heteroskedasticity. Arellano (1987) proposed a fully robust variance matrix estimator for the
fixed effects estimator, and it is consistent (as G increases with the M
g
fixed) with cluster
samples or panel data; see also Wooldridge [2002, equation (10.59)]. For panel data, the
idiosyncratic errors can always have serial correlation or heteroskedasticity, and it is easy to
11
guard against these problems in inference. Reasons for wanting a fully robust variance matrix
estimator for FE applied to cluster samples are similar to the RE case. For example, if ,
g
replaces , in (1), then z
gm
zg
,
g
, appears in u
gm
in (4). The FE estimator is still
consistent if E,
g
|z
g1
zg
, . . . , z
g,Mg
zg
E,
g
, a reasonable assumption that allows
,
g
to be correlated with zg
provide the group slopes are mean-independent of the within-group
deviations from the mean. Nevertheless, u
gm
, u
gp
will be correlated for m p. A fully robust
variance matrix estimator is
Avar,
FE
g1
G
Z
g
1
g1
G
Z
g
g1
G
Z
g
1
, (11)
where Z
g
is the matrix of within-group deviations from means and
g
is the M
g
1 vector of
fixed effects residuals. This estimator is justified with large-G asymptotics.
One benefit of a fixed effects approach, especially in the standard model with constant
slopes but c
g
in the composite error term, is that no adjustments are necessary if the c
g
are
correlated across groups. When the groups represent different geographical units, we might
expect correlation across groups close to each other. If we think such correlation is largely
captured through the unobserved effect c
g
, then its elimination via the within transformation
effectively solves the problem. If we use pooled OLS or a random effects approach, we would
have to deal with spatial correlation across g, in addition to within-group correlation, and this
is a difficult problem.
Recently, in the context of fixed effects estimation and panel data, Kzde (2001) and
Bertrand, Duflo, and Mullainathan (2002) study the finite-sample properties of robust variance
matrix estimators that are theoretically justified only as G . One common finding is that
the fully robust estimator works reasonably well even when the cross-sectional sample size (G)
12
is not especially large relative to the time-series dimension. When Varu
g
|Z
g
does not depend
on Z
g
, a variance matrix that exploits system homoskedasticity can perform better than the
fully robust variance matrix estimator. Kzde (2001) also finds that, without serial correlation,
an estimator that adjusts the variance matrix only for heteroskedasticity works well. This
finding is probably more relevant for cluster sample applications, where independence of the
idiosyncratic errors is often reasonable unless slopes vary by group.
Importantly, the encouraging simulaton findings for fixed effects with panel data are not in
conflict with findings that the robust variance matrix for the pooled OLS estimator with a
small number of clusters can behave poorly. An important difference in the two problems is
that for FE estimation using panel data, the issue is serial correlation in the idiosyncratic errors
u
gm
: m 1, . . . , M
g
, and in simulation studies to date this correlation has been assumed to
die out as the time periods get far apart. [Typically, a stable AR(1) model is assumed.] The
pooled OLS estimator that keeps c
g
in the error term suffers because of the constant correlation
across all observations within cluster. Plus, fixed effects can only estimate ,, while for pooled
OLS on cluster samples the focus is on typically on [. If we are interested in [, an interesting
open question for a small cluster sample size is whether inference with the random effects
estimator, whether carried out under the usual RE assumptions or made robust to arbitrary
within-group correlation and heteroskedasticity, can improve on robust inference for pooled
OLS.
The previous discussion extends immediately to instrumental variables versions of all
estimators. With large G, one can typically afford to make pooled two stage least squares
(2SLS), random effects 2SLS, and fixed effects 2SLS robust to arbitrary within-cluster
correlation and heteroskedasticity. Also, more efficient estimation is possible by applying
13
generalized method of moments (GMM); again, GMM is justified with large G.
Unfortunately, for any of the IV methods there appears to be little simulation evidence for how
large G standard errors work for IV methods when G is not so large.
2.3. Large Group Size Asymptotics
The previous subsection discussed the mixed performance of standard large G variance
matrix estimators when G is not so large. I now discuss estimation of the parameters in (1)
under a standard stratified sampling scheme. In particular, suppose we have defined G strata in
the population, and we obtain our data by randomly sampling from each stratum. As before,
M
g
is the sample size for stratum (group) g. Without specifying how the sample was obtained,
the resulting data set is essentially indistinguishable from that described in Section 2.2, where
we randomly draw clusters from a population of many clusters. But the difference turns out to
be important, and in this subsection I act as if the usual large G asymptotics do not apply.
We are mostly interested in estimating [ in (1), but that generally requires estimating ,, too.
Donald and Lang (2001) (hereafter, DL) study this problem, and I draw on their work while
offering a complementary approach.
DLs examples are for very small G, and one of their examples is Card and Krueger
(1994), who had G 2 states, New Jersey and Pennsylvania. DLs approach can be, and has
been, applied with larger numbers of groups. Moulton (1990), the clusters are states in the
U.S. (G 50) and, in Loeb and Bound (1996), G 36 cohort-division groups. I explicitly
consider the case where G and the M
g
are both (moderately) large in Section 2.4.
14
To illustrate the pitfalls in applying standard methods such as pooled OLS when G is
small, consider a special case of (1), where x
g
is a scalar (as in DL) and z
gm
is not in the
equation. The equation is
y
gm
o [x
g
c
g
u
gm
, m 1, . . . , M
g
; g 1, . . . , G, (12)
where c
g
and u
gm
: m 1, . . . , M
g
are independent of x
g
and u
gm
: m 1, . . . , M
g
is a
mean-zero, independent, identically distributed sequence for each g. Now, if the cluster effect
c
g
is absent from the model, then we can estimate (12) using pooled ordinary least squares, and
inference is straightforward. If we assume Varu
gm
is constant across g then, provided
N M
1
. . . M
G
is large enough (often N as small as 30 suffices) even if G is not large we
can use the usual t statistics from the pooled OLS regression as having an approximate
standard normal distribution. Making inference robust to heteroskedasticity is straightforward,
and tends to work well even for moderate N.
Donald and Lang (2001) allow for the presence of c
g
Normal0, o
c
2
, which they assume
to be independent of u
gm
: m 1, . . . , M
g
. With a common cluster effect, there is no
averaging out within cluster that allows application of the central limit theorem. One way to
see the problem is to note that the pooled OLS estimator, [
g
, along with ,
g
, from an
OLS regression within each group. If G K 1 then, typically, we can solve for 0
o , [
19
uniquely in terms of the G 1 vector o
:. 0
X
1
o
g
on 1, x
g
, g 1, . . . , G. (17)
Under asymptotics such that M
g
g
M where 0
g
1 and M , the minimum distance
estimator 0
g
is just y g
, the sample mean for group g. Let o
g
2
denote the usual sample variance for group g.
Because the y g
are independent across g, the efficient MD estimator uses a diagonal weighting
matrix. As a computational device, the minimum chi-square estimator can be computed by
using the weighted least squares (WLS) version of (17), where group g is weighted by M
g
/o
g
2
(groups that have more data and smaller variance receive greater weight). Conveniently, the
reported t statistics from the WLS regression are asymptotically standard normal as the group
sizes M
g
get large. (With fixed G, the WLS nature of the estimation is just a computational
device; the standard asymptotic analysis of the WLS estimator has G .). The minimum
distance approach works with small G provided G K 1 and each M
g
is large enough so that
normality is a good approximation to the distribution of the (properly scaled) sample average
20
within each group.
If z
gm
is present in the first-stage estimation, we use as the minimum chi-square weights the
inverses of the asymptotic variances for the g intercepts in the separate G regressions. With
large M
g
, we might make these fully robust to heteroskedasticity in Eu
gm
2
|z
gm
using the White
(1980) sandwich variance estimator, or at least allow for different o
g
2
. In any case, once we
have the Avaro
g
which are just the squared reported standard errors for the o
g
we use as
weights 1/Avaro
g
in the computationally simple WLS procedure. We are still using
independence across g in obtaining a diagonal weighting matrix in the MD estimation.
An important by-product of the WLS regression is a minimum chi-square statistic that can
be used to test the G K 1 overidentifying restrictions. The statistic is easily obtained as the
weighted sum of squared residuals, say SSR
w
. Under the null hypothesis in (16),
SSR
w
a
_
GK1
2
as the group sizes, M
g
, get large. If we reject H
0
at a reasonably small
significance level, the x
g
are not sufficient for characterizing the changing intercepts across
groups. If we fail to reject H
0
, we can have some confidence in our specification, and perform
inference using the standard normal distribution for t statistics for testing linear combinations
of the population averages.
We might also be interested in how one of the slopes in ,
g
depends on the group features,
x
g
. Then, we simple replace o
g
with, say ,
g1
, the slope on the first element of z
gm
. Naturally,
we would use 1/Avar,
g1
as the weights in the MD estimation.
The minimum distance approach can also be applied if we impose ,
g
, for all g, as in
the original model (1). Obtaining the o
g
themselves is easy: run the pooled regression
y
gm
on d1
g
, d2
g
, . . . , dG
g
, z
gm
, m 1, . . . , M
g
; g 1, . . . , G (18)
21
where d1
g
, d2
g
, . . . , dG
g
are group dummy variables. Using the o
g
from the pooled regression
(18) in MD estimation is complicated by the fact that the o
g
are no longer asymptotically
independent; in fact, o
g
y g
zg
,, where , is the vector of common slopes, and the presence
of , induces correlation among the intercept estimators. Let V
be the G G estimated
(asymptotic) variance matrix of the G 1 vector o
V
1
X
1
X
V
1
o
V
1
X
1
. If the OLS
regression (17) is used, or the WLS version, the resulting standard errors will be incorrect
because they ignore the across group correlation in the estimators. (With large group sizes the
errors might be small; see the next subsection.)
Intermediate approaches are available, too. Loeb and Bound (1996) (LB for short) allow
different group intercepts and group-specific slopes on education, but impose common slopes
on demographic and family background variable. The main group-level covariate is the
student-teacher ratio. Thus, LB are interested in seeing how the student-teach ratio affects the
relationship between test scores and education levels. LB use both the unweighted estimator
and the weighted estimator and find that the results differ in unimportant ways. Because they
impose common slopes on a set of regressors, the estimated slopes on education (say ,
g1
) are
not asymptotically independent, and perhaps using a nondiagonal estimated variance matrix V
(which would be 36 36 in this case) is more appropriate; but see Section 2.4.
As another example, suppose x
g
is a binary treatment indicator, equal to unity if group g is
in the treatment group and zero otherwise. Then [
g
o
g
O
p
Mg
1/2
, we can
ingore the estimation error in o
g
for large M
g
(Recall that the same large M
g
assumption
underlies the minimum distance approach.) Then, it is as if we are estimating the equation
o
g
o x
g
[ c
g
, g 1, . . . , G by OLS. If the c
g
are drawn from a normal distribution,
classical analysis is applicable because c
g
is assumed to be independent of x
g
. This approach
is desirable when one cannot, or does not want to, find group-level observables that completely
determine the o
g
. It is predicated on the assumption that the other factors in c
g
are not
systematically related to x
g
, a reasonable assumption if, say, x
g
is a randomly assigned
treatment at the group level, a case considered by Angrist and Lavy (2002).
Before settling for conservative inference, which is justified only under normality of c
g
when we can ignore the estimation error in o
g
, it makes sense to choose x
g
carefully and test
Varc
g
0. This is precisely the purpose of the minimum chi-square overidentification
statistic. The minimum chi-square approach is justified very generally when the group sizes,
M
g
, are moderately large; normality is not needed anywhere, nor is homoskedasticity within or
across groups. If the M
g
are not very large but G K 1 is large, the _
GK1
2
distribution may
not be a good approximation to the distribution of the overidentification statistic.
23
Even if the overidentification statistic rejects (16), it may be too rash to adopt the DL
approach with small degrees of freedom. To see why, consider a treatment effect setup.
Suppose there are three large groups (G 3), and groups one and two are the control groups
while group three is the treatment group. So x
g
is a binary variable with x
1
x
2
0 and
x
3
1. For simplicity, there are no control variables. Then the estimated average treatment
effect using the DL regression (13) or the (unweighted) minimum distance regression (18) are
identical, and simply equal to the difference in means for the treatment and control group. It is
conveniently written as
[
y 3
p
1
y 1
p
2
y 2
, (19)
where p
1
M
1
/M
1
M
2
and p
2
M
2
/M
1
M
2
are the proportions of observations for
groups one and two, respectively, out of the total number of control units. In other words, the
mean outcome for the control group is just a weighted average. The efficient minimum
distance estimator would use a different weighting scheme, but that is not important for the
main point here.
Because G 3, we can test the single overidentifying restriction. This is tantamount to
testing whether j
1
j
2
, that is, the population means of the two control groups are the same.
Here is the important point: should we really reject the standard analysis just because there is a
heterogeneous response in the two control groups? I think not. If we do and apply the DL
approach, we wind up using a t distribution with a single degree of freedom to conduct
inference, even though we may have hundreds, or even thousands, of observations from each
group or stratum.
Instead, we can be satisfied with (19) as the estimate of the treatment effect without
worrying whether j
1
j
2
. In fact, if the sample proportions p
1
and p
2
converge to the
24
relative population frequencies, say
1
and
2
, then the probability limit of (19) is simply
j
3
1
j
1
2
j
2
, where the term in parentheses is the average across the two control group
populations. If p
1
and p
2
are not consistent estimates of the population frequencies, then the
control group mean estimates a weighted average of j
1
and j
2
. We may or may not be
interested in the particular weighted average implied by (19). Nevertheless, the DL method
produces exactly the same estimate of the ATE, but DL would insist that the estimated effect
be deemed statistically significant only if the t statistic, from the regression of y g
on 1, x
g
,
g 1, 2, 3, is on the order of 12.8 (for a two-sided, 5% level test).
Rather than either rejecting the minimum distance approach if j
1
j
2
, or simply
accepting the estimated treatment effect as possibly estimating an unteresting weighted
combination of the mean effects for different groups, we can specify ahead of time a particular
linear combination of the means that is of interest. So, for example, suppose that the first G
1
groups are the controls and the remaining G
2
groups are the treated groups. Then, we might
define the treatment effect in terms of two simple averages, say 0
C
j
1
. . . j
G
1
/G
1
and
0
T
j
G
1
1
. . . j
G
/G
2
; the treatment effect would be defined as [ 0
T
0
C
. Even with
only a moderate sample size for each g, we can typically get fairly precise estimates of the j
g
using the sample average. Estimation of [ is then straightforward, and inference is standard
because asymptotic standard errors for 0
C
and 0
T
are easily obtained.
Of course, one might use weighted averages in each case, particularly if one has knowledge
of population frequencies. The general point is that once we have specified an interesting
population parameter, we can obtain a consistent, approximately normal estimator, and
inference can be based on the standard normal distribution when the group sizes are
moderately large. There is no reason to rely on the small degrees of freedom approach of DL.
25
Beyond the treatment effect case, the issue of how to define parameters of interest appears
complicated, and deserves further research.
2.4. What if G and the M
g
are Both Large?
In Section 2.2, I reviewed methods appropriate for a large number of groups and relatively
small group sizes. In Section 2.3, I considered two approaches appropriate for large group
sizes and a small number of groups. The DL and minimum distance approaches use the large
group sizes assumption differently: in its most applicable setting, DL use the large M
g
assumption to ignore the first-stage estimation error entirely, while the asymptotics underlying
the MD approach is based on applying the central limit theorem within each group. Not
surprisingly, more flexibility is afforded if G and M
g
are both large.
For example, suppose we adope the DL specification (with an unobserved cluster effect)
and G 50 (say, states in the U.S.). Further, assume first that the group sizes are large enough
(or the cluster effects are so strong) that the first-stage estimation error can be ignored. Then,
it matters not whether we impose some common slopes or run separate regressions for each
group (state) in the first stage: we can treat the o
g
, g 1, . . . , G, as independent observations
to be used in the second stage. This means we apply regression (17) and apply the usual
inference procedures. The difference now is that with G 50, the usual t statistics have some
robustness to nonnormality of the c
g
, assuming the CLT approximation works well With small
G, the exact inference was based on normality of the c
g
.
Loeb and Bound (1996) essentially use regression (17), but with estimated slopes as the
26
dependent variable in place of estimated intercepts. As mentioned in Section 2.3, LB impose
some common slopes across groups, which means all estimated parameters are dependent
across group. The minimum distance approach without cluster effects is one way to account
for the dependence. Alternatively, one can simply adopt the DL perspective and just assume
the estimation error is swamped by c
g
; then standard OLS analysis is approximately justfied.
LB also apply a weighted least squares procedure that is intended to account for the different
sampling variations in their slope estimates (largely due to different group sample sizes). With
a cluster effect, the LB weights are not the correct ones for eliminating heteroskedasticity.
[Plus, we do not have the actual variances of the o
g
of ,
g1
; we have only estimates of their
asymptotic variances. This is fine from the MD perspective, but not really correct from a WLS
perspective.] There is nothing really wrong with using the incorrect weights provided the
standard errors are appropriately adjusted, but with G 36 one might be concerned about the
finite-sample properties of robust standard errors. With a cluster effect, the same problem
arises with OLS, as it ignores the heteroskedasticity. From the MD perspective, the LB WLS
approach would be correct if each first-stage regression were estimated separately; but they
were not.
In summary, with G somewhat large, the DL approach gains robustness to nonnormality of
the c
g
if the first-stage estimation error can be safely ignored. A WLS approach is justified in
the MD approach when separate regressions can be run for each group. But that weighting is
not the correct one when either a cluster effect is present or one must account for dependence
in the first-stage estimators across groups.
27
3. NONLINEAR MODELS
Many of the issues for nonlinear models are the same as for linear models. The biggest
difference is that, in many cases, standard approaches require distributional assumptions about
the unobserved group effects. In addition, it is more difficult in nonlinear models to allow for
group effects correlated with covariates, especially when group sizes differ. For the small G
case, we offer extensions of the Donald and Lang (2001) approach (with large group sizes) and
the minimum distance approach.
Rather than using a general, abstract setting, the issues for nonlinear models are easily
illustrated with a few widely-used examples. Here I discuss binary response, count data, tobit
models, and fractional response models.
3.1. Large Groups Asymptotics
As in Section 2, we begin with the case where the asymptotic analysis is with fixed group
sizes M
g
with the number of groups G getting large.
3.1.1. Binary Response Models
We can illustrate many issues using an unobserved effects probit model. Let y
gm
be a
binary response, with x
g
and z
gm
, m 1, . . . , M
g
, g 1, . . . , G defined as in Section 2. Assume
28
that
y
gm
1o x
g
[ z
gm
, c
g
u
gm
0
u
gm
|x
g
, Z
g
, c
g
~Normal0, 1
(20)
(21)
(where 1 is the indicator function). Equations (20) and (21) imply
Py
gm
1|x
g
, z
gm
, c
g
Py
gm
1|x
g
, Z
g
, c
g
o x
g
[ z
gm
, c
g
, (22)
where is the standard normal cumulative distribution function (cdf). We assume
throughout that only z
gm
affects the response probability of y
gm
conditional on x
g
and c
g
; the
outcomes of z
gp
for p m are assumed not to matter. This is captured in (22). For pooled
methods we could relax this restriction (as in the linear case), but, with the presence of c
g
, this
affords little generality in practice.
The presence of c
g
in (22) raises several important issues. Before even considering
estimation, we should know how to define the partial effects of the explanatory variables,
x
g
, z
gm
. The elements of [ and , provide the directions of effects and, for continuous
elements, the relative sizes. For example, if the first element of x
g
is continuous,
Py
gm
1|x
g
, z
gm
, c
g
x
g1
[
1
o x
g
[ z
gm
, c
g
, (23)
where is the standard normal density function. From (23), the sign of [
1
tells us whether
x
g1
has a positive or negative partial effect on the response probability. Further, if x
g2
is
another continuous variable, the ratio of partial effects is simply [
1
/[
2
. For discrete
covariates, the signs of the parameters give the direction of effects, although the relative
magnitudes are not free of the unobservable, c
g
. In any case, the magnitude of the partial
effects depends directly on c
g
, as (23) shows in the case of a continuous covariate. Often, we
need to a measure of the change in the response probability given a change in a regressor. To
29
address the dependence of (23) [and its discrete analogs] on c
g
, we can average (23) across the
population distribution of c
g
, for fixed values of the covariates. This results in what I have
called the average partial effect (APE); see, for example, Wooldridge (2002, 2005). The
simplest way to obtain APEs requires is to specify a distribution (or a conditional distribution)
for c
g
.
We can easily derive the APEs, and then consistently estimate them, if we assume c
g
is
independent of x
g
, Z
g
and is normally distributed:
c
g
|x
g
, Z
g
~Normal0, o
c
2
, (24)
where the zero mean is without loss of generality because (20) contains an intercept, o. The
unconditional normality assumption for c
g
implies that the APEs are obtained from the
function
Py
gm
1|x
g
, Z
g
o x
g
[ z
gm
,/1 o
c
2
1/2
o
c
x
g
[
c
z
gm
,
c
, (25)
where o
c
o/1 o
c
2
1/2
, and so on; see, for example, Wooldridge (2002, Chapter 15).
Conveniently, the scaled coefficients are exactly the coefficients estimated by using a simple
pooled probit procedure. So, for estimating the average partial effects, pooled probit is
perfectly acceptable. With large G and small group sizes, we can easily make the standard
errors and test statistics robust to arbitarary within group correlation. This is especially
convenient for panel data applications, where serial correlation in the idiosyncratic errors, u
gm
(where m indexes time) causes extra dependence in the responses y
gm
that is not accounted for
by c
g
. In other words, the cluster-robust variance matrix estimator allows for any kind of
within-group correlation, just as in the linear case.
If we supplement (20),.(21), and (24) with
30
u
g1
, . . . , u
g,Mg
are independent conditional on x
g
, Z
g
, c
g
(26)
then we have the so-called random effects probit model. Under the RE probit assumptions,
o, [, , and o
c
2
are all identified, which means we can estimate the APEs as well as the partial
effects evaluated at the everage value of c
g
, which is zero. We can also compute partial effects
at other values of c
g
that we might select from the normal distribution with estimated standard
deviation o
c
. The details for random effects probit in the balanced panel data case are given in
Wooldridge (2002, Chapter 15). The unbalanced case is similar.
The random effects probit estimator has no known robustness properties that is, if any of
the assumptions (20),.(21), (24), and (26) fail, the parameter estimators are believed to be
inconsistent. Plus, the APEs would not be consistently estimated in general. Therefore, if we
insist that the parameter estimators and APEs are consistently estimated, there is no logical
reason to use robust inference; the usual MLE inference is appropriate. Still, one might view
the RE probit model as an approximation to the true model, and use a sandwich estimator for
robust inference; see White (1982) and Wooldridge (2002, Chapter 12).
Random effects probit relies on a full set of distributional assumptions, while pooled probit
relaxes assumptions on the joint distribution at the cost of inefficient estimators (of the scaled
parameters indexing the APEs). If we maintain the strict exogeneity assumption of the
explanatory variables, there are estimators that are generally more efficient than pooled probit
while remaining consistent under (20), (21), and (24). The generalized estimating equation
(GEE) literature applies to grouped data (panel data as a special case); see Liang and Zeger
(1986). Perhaps the best way to think about GEE is as a multivariate weighted nonlinear least
squares (WNLS) estimator, with explicit recognition that the variance-covariance matrix
underlying the WNLS estimation is likely misspecified. The GEE approach begins by
31
specifying a conditional mean for the M
g
1 vector y
g
. In the binary response case,
Ey
g
|x
g
, Z
g
is simply the vector of response probabilities in (25). [In the context of GEE, the
expression in (25) would be called the population averaged model because the unobserved
effect has been averaged out.] Because y
gm
is binary,
Vary
gm
|x
g
, Z
g
o
c
x
g
[
c
z
gm
,
c
1 o
c
x
g
[
c
z
gm
,
c
. (27)
The pooled weighted nonlinear least squares estimator, using as weights the inverse of the
estimated variance function, produces an an estimator asymptotically equivalent to the pooled
probit estimator. (Remember, this is all as G with the M
g
fixed.) What GEE does is
exploit the within-group correlation to obtain a more efficient estimator. It does this by
specifying a working correlation matrix (WCM) which is, for each g, an M
g
M
g
matrix of
correlations. This matrix is supposed to approximate Corry
g
|x
g
, Z
g
. In most
implementations of GEE, the WCM does not depend on the condititioning variables. For
cluster samples and panel data, a useful WCM is the exchangeable matrix that assumes a
constant correlation, , across any two pairs m and p. Combined with (27), the WCM gives an
approximation to Vary
g
|x
g
, Z
g
. The resulting (simple) structure is known to be wrong if the
random effects probit model, or any extensions (such as modeling the within-group correlation
in u
g1
, . . . , u
g,Mg
). The idea is that accounting for the within-cluster correlation in even a
crude way generally leads to more efficient estimation than ignoring the correlation in
estimation, as with pooled probit. Because the M
g
M
g
variance matrices are assumed to be
misspecified, a sandwich estimator should be used for inference, and this option is standard
(and should always be used) for GEE applications. The GEE estimates are directly comparable
to the pooled probit estimates the parameters index the APEs in both cases and can be
32
compared to random effects estimates after scaling the latter by 1/1 o
c
2
1/2
.
In many large G applications, interest centers on the causal effect of the unit specific
covariates, z
gm
, controlling for differences in unobserved heterogeneity, c
g
. Then, x
g
(and an
intercept) is absorbed into c
g
and correlation between c
g
and z
g1
, z
g2
, . . . , z
g,Mg
is allowed.
For linear models, we saw that the within or fixed effects estimator allows arbitrary
correlation, and does not restrict the within-cluster dependence of u
g1
, . . . , u
g,Mg
.
Unfortunately, allowing general correlation is much more difficult in nonlinear models. In the
balanced case, where the M
g
are the same, a device due to Chamberlain (1980) can be used.
Actually, in practice a more restrictive version is often used: the distribution of c
g
given
z
g1
, z
g2
, . . . , z
gM
is assumed to be homoskedasticity normal with a mean linear in the group
average:
c
g
|Z
g
~Normalp zg
, o
a
2
, (28)
where o
a
2
is the conditional variance Varc
g
|Z
g
. If we use all random effects probit
assumptions but with (28) in place of (24), then we obtain a simple extension of the RE probit
model: simply add the group averages, zg
, as a set of additional explanatory variables. The
marginal distributions are
Py
gm
1|Z
g
p z
gm
, zg
/1 o
a
2
1/2
p
a
z
gm
,
a
zg
a
(29)
where now the coefficients are scaled by a function of the conditional variance. As shown in
Wooldridge (2002, Chapter 15), the average partial effects are consistently estimated by taking
derivatives and changes of
G
1
g1
G
p
a
z,
a
zg
a
(30)
33
with respect to elements of z. In other words, we average out the effects of the group averages,
zg
. The RE probit assumptions identify the original coefficients and o
a
2
. We can also apply
pooled probit or GEE without (26) and directly estimate the scaled coefficients (and only the
scaled coefficients are identified). Also, one is free to add x
g
to the set of covariates, but,
typically, one would not interpret the partial effects in a causal way (because c
g
and x
g
might
be (partially) correlated.)
Unfortunately, while the Chamberlain-Mundlak device is sensible for balanced cluster
samples (including balanced panels), it needs to be modified for the unbalanced case. [Of
course, one can always balance a cluster sample under the assumption that the cluster sizes are
exogenous, and that might be desirable if there is not much variation in the cluster sizes. But
sometimes it entails throwing away a lot of data.] Here I will simply mention what might be
done. At a minimum, (28) should be modified to allow the variances to depend on the cluster
size, M
g
. Under restrictive assumptions, such as joint normality of c
g
, z
g1
, . . . , z
g,Mg
, with the
z
gm
independent and identically distributed within a cluster, one can derive Varc
g
|Z
g
. But
these are strong assumptions. With G large and little variation in the group sizes, M
g
, one
might just allow a different variance for each M
g
. Then the marginal distributions are
Py
gm
1|Z
g
p z
gm
, zg
/1 o
a,Mg
2
1/2
. (31)
A simple approach, and one that allows p, to be p
Mg
,
Mg
, is to estimate a different
equation by pooled probit for each group size [with covariates 1, z
gm
, zg
]. Then, we can
estimate APEs for each M
g
, and then average those. Obtaining standard errors via the delta
method would be tricky, and this approach ignores the constancy of , across groups. [In
effect, we would just be obtaining p
Mg
, ,
Mg
,
Mg
for each group size.] To impose a common
34
,, a minimum distance approach can be used.
Alternatively, for each different value of M
g
we can estimate a random effects probit model
with the zg
included as covariates. In fact, we could be more general and allow p, to be
p
Mg
,
Mg
. Then, after estimating the slope and variance parameters for a set of extended RE
probits, we can then use a minimum distance approach to estimate the common ,. This might
be a useful topic for future research.
In estimating APEs, one can dispense with parametric assumptions entirely, provided
nonparametric estimation is feasible. In particular, consider the two assumptions
Py
gm
1|Z
g
, c
g
Py
gm
1|z
gm
, c
g
Fz
gm
, c
g
(32)
and
Dc
g
|z
g1
, z
g2
, . . . , z
g,Mg
Dc
g
|zg
. (33)
Assumption (32) implies strict exogeneity conditional on c
g
while (33) means that the
distribution of the group effect, c
g
, given all within-group covariates, depends only on the
group average; a specific distributional assumption, such as homoskedastic normality with a
mean linear in zg
, is clearly a special case.
Define H
g
z
gm
, zg
Py
gm
1|z
gm
, zg
. Under (32) and (33), it can be show that the
APEs are obtained from
E
zg
H
g
z, zg
; (34)
see, for example, Wooldridge (2005). If the group sizes differ, H
g
, generally does depend
on g. If there are relatively few group sizes, it makes sense to estimate the H
g
, separately
for each group size M
g
. Then, the APEs can be estimated from
35
G
1
g1
G
g
z, zg
. (35)
As a practical matter, we might just use flexible parametric models, such as pooled probit or
random effects probit. This approach does not impose the constant function F, as specified
in (32), and so more efficient methods should be available, but I will not explore that here.
Other strategies are available for estimating APEs. A procedure commonly known as
fixed effects probit treats the c
g
as parameters to estimate in
Py
gm
1|Z
g
, c
g
Py
gm
1|z
gm
, c
g
z
gm
, c
g
. (36)
With small group sizes M
g
(say, siblings or short panel data sets), treating the c
g
as parameters
to estimate creates an incidental parameters problem. In particular, the estimator of , can be
badly biased. Interestingly, as shown in recent work by Fernndez-Val (2005) for the balanced
panel case and independence across m conditional on Z
g
, c
g
, the average partial effects,
obtained from
G
1
g1
G
z,
g
, (37)
are unbiased for the population APEs up to order M
2
when the group effects are not present.
When group effects are present, Fernndez-Val (2005) proposes a bias correction for the APEs.
See Fernndez-Val (2005) for details.
Some prefer to use a logit response function rather than probit. The use of logit for any of
the methods discussed previously makes estimation of parameters and average partial effects
more difficult, so logit has little to offer as an alternative. Nevertheless, for allowing
correlation between c
g
and z
gm
, logit has an advantage over probit: under the conditional
36
independence assumption
y
g1
, . . . , y
g,Mg
are independent conditional on Z
g
, c
g
, (38)
the conditional maximum likelihood estimator eliminates c
g
and leads to consistent estimation
of ,, without having to specify Dc
g
|Z
g
. While this is useful for obtaining directions and
relative magnitudes, it does not allow us to estimate the APEs. Plus, for panel data
applications it is limited by the conditional independence assumption.
3.1.2. Count Data
Almost all of the discussion for binary response models extends to other nonlinear models
for cluster samples. An important application is to count data (or, in fact, any nonnegative
response variable). A popular model for the conditional mean is
Ey
gm
|x
g
, Z
g
, c
g
Ey
gm
|x
g
, z
gm
, c
g
expo x
g
[ z
gm
, c
g
. (39)
As with the probit model we assume strict exogeneity of the unit-specific covariates, z
gm
,
conditional on x
g
, c
g
. As usual, strict exogeneity can be relaxed for pooled methods but not
for random effects, fixed effects, and GLS-type methods. A convenient normalization in (39)
is Eexpc
g
1.
Under (39), a simple way to estimate o, [, and , is the pooled Poisson quasi-maximum
likelihood estimator (QMLE). In other words, act as if y
gm
has a Poisson distribution given
x
g
, z
gm
and ignore the within-group correlation in estimation. In many statistical packages, it
is simple to make inference robust to arbitrary within-group correlation (including serial
correlation in a panel data setting). The Poisson QMLE is known to be fully robust to
37
distributional misspecification, as well as within-group correlation. All that is required is
Ey
gm
|x
g
, Z
m
expo x
g
[ z
gm
, (40)
(and we can replace Z
m
with z
gm
) along with regularity conditions. See Wooldridge (2002,
Chapter 19) for references and a summary.
As with the linear case, any pooled method is likely to be inefficient. A typical random
effects Poisson approach adds the assumptions
y
gm
|x
g
, Z
g
, c
g
~Poissonexpo x
g
[ z
gm
, c
g
, (41)
y
g1
, . . . , y
g,Mg
are independent conditional on x
g
, Z
g
, c
g
, (42)
and
expc
g
|x
g
, Z
g
~Gammat, t, (43)
where the gamma distribution is parameterized to have mean equal to one. [Of course, other
distributions are possible, but the gamma is very common.] This estimator is programmed in
many packages, and is, of course, efficient under the maintained assumptions. Unfortunately,
the estimator has no known robustness properties to violations of (41), (42), or (43).
As in the binary response case, a middle ground that maintains robustness while likely
being more efficient than the pooled estimator is available. A GEE approach assumes (40),
uses the nominal Poisson variance assumption
Vary
gm
|x
g
, Z
g
Ey
gm
|x
g
, Z
g
expo x
g
[ z
gm
, (44)
and specifies a working correlation matrix. Typically, with a cluster sample the
exchangeable option would be used. As described in Section 3.1.1, the GEE estimator is a
multivariate weighted nonlinear least squares estimator with a (probably) misspecified
38
variance matrix. Thus, inference should be made fully robust using a sandwich estimator, and
this option is available (see appendix).
Unlike in the probit case, for count data we can derive the correct conditional variance
matrix under the nominal random effects Poisson assumption, and the resulting correlation
matrix does not have the exchangeable form. As shown in Wooldridge (2002, Chapter 19),
Vary
gm
|x
g
, Z
g
expo x
g
[ z
gm
, p
2
expo x
g
[ z
gm
,
2
Covy
gm
, y
mp
|x
g
, Z
g
p
2
expo x
g
[ z
gm
, expo x
g
[ z
gp
,
(45)
(46)
where p
2
1/t Varexpc
g
. The conditional correlation clearly depends on the
covariates. Therefore, rather than using a constant working correlation, one uses the variance
matrix implied by (45) and (46). Such an estimator is still fully robust but would be more
efficient than the exchangeable GEE estimator under the RE Poisson assumptions.
Wooldridge (2002, Chapter 19) provides details on estimating p
2
as well as fully robust
inference [to guard against (45) or (46) being incorrect]. As far as I know, this estimator is not
programmed in standard econometrics software, although it would not be difficult to do so.
It is straightforward to extend the previous models to allow correlation between c
g
and z
gm
.
Is in the binary response case, we initially drop x
g
. Then, we might be willing to assume
Eexpc
g
|Z
g
Eexpc
g
|zg
exp0
g
zg
g
(47)
where 0
g
,
g
depends on g through the group size, M
g
. Then
Ey
gm
|Z
m
exp0
g
z
gm
, zg
g
(48)
and now we can apply the pooled Poisson estimator, the GEE estimator, or the WNLS
estimator implied by (45) and (46), but but where we allow the slope and intercepts on zg
to
depend on the group sizes. One might want to use a different estimate of p
2
for each group
39
size, but that is not necessary for consistency. One could also extend the RE Poisson estimator
to allow expc
g
|Z
g
to have a gamma distribution with mean in (47), although estimation
would be much more complicated than extending the other methods. As in the binary response
case, one can add x
g
to the model, subject to the issue of how to interpret the parameters.
Because the elements of , are semi-elasticities (or elasticities), often estimation of , is
sufficient. Still, estimation of APEs can be based on
G
1
g1
G
exp0
g
z, zg
g
. (49)
Rather than restricting the nature of the relationship between c
g
and Z
g
, we can use the
fixed effects Poisson estimator to consistently estimate . Originally, this estimator was
derived as a conditional MLE by Hausman, Hall, and Griliches (1984). Blundell, Griffith, and
Windmeijer (2002) showed that treating the c
g
as parameters to estimate is the same as the
conditional MLE, so the name fixed effects Poisson cannot cause confusion.
As shown by Wooldridge (1999), the FE Poisson estimator has very desirable robustness
properties: it is consistent and G -asymptotically normal under the assumption
Ey
gm
|Z
g
, c
g
expz
gm
, c
g
. (50)
The actual distribution of y
gm
is practically unrestricted (except for regularity conditions), as is
the conditition dependence withing group. If interest lies in ,, the FE Poisson estimator is very
attractive. Of course, one should generally use a sandwich variance matrix estimator [see
Wooldridge (1999) or Wooldridge (2002, Section 19.6.4)]; unfortunately, this is not currently a
standard option in statistical packages.
40
3.1.2. Tobit Models for Corner Solutions
The discussion for Tobit models is very similar for probit models. Here, I assume that
Tobit is applied to a corner solution response, so that y
gm
, the nonnegative variable with
probability mass at zero, is the variable we wish to explain. In particular, I will focus on
estimating APEs on the mean response.
The standard Tobit model with a group effect is
y
gm
max0, o x
g
[ z
gm
, c
g
u
gm
(51)
u
gm
|x
g
, Z
g
, c
g
~Normal0, o
u
2
(52)
Among other things, (51) and (52) imply
Ey
gm
|x
g
, Z
g
, c
g
o x
g
[ z
gm
, c
g
/o
u
o x
g
[ z
gm
, c
g
o
u
o x
g
[ z
gm
, c
g
/o
u
mo x
g
[ z
gm
, c
g
, o
u
2
. (53)
The APEs are obtained by integrating this expression with respect to the distribution of c
g
. A
typical andom effects approach assumes
c
g
|x
g
, Z
g
~Normal0, o
c
2
, (54)
in which case the APEs are remarkably simple to obtain. They are estimated from
mo x[
z,, o
v
2
(55)
where o
v
2
o
c
2
o
u
2
. Conveniently, the pooled Tobit directly provides consistent and
G -asymptotically normal estimators of o, [, , and o
v
2
. A sandwich estimator is easily
obtained to allow arbitrary within-group correlation.
The full random effects Tobit specification adds the same conditional independence
41
assumption as random effects probit; see (26). Then o
c
2
and o
u
2
are separately identified, but
consistency requires all RE Tobit assumptions. In principle, a GEE-type approach can be used
in the Tobit case as well. This could be based on the scores from the Tobit log-likelihood,
although, as far as I know, details remain to be worked out.
As with probit, we can modify pooled Tobit and RE Tobit to allow for correlation between
c
g
and Z
g
, as in (28). But, with different group sizes, we should replace o
a
2
Varc
g
|zg
with
o
a,Mg
2
. As in the probit case, we can estimate a different variance for each group size M
g
if the
group sizes are relatively few. An even easier approach is to allow all parameters to differ by
M
g
, estimate the APEs in each case, and average them together. Either pooled Tobit or RE
Tobit can be used. The APEs would be obtained from
G
1
g1
G
mp
Mg
z,
Mg
zg
Mg
, o
v,Mg
2
(56)
where the M
g
just indicates a different set of estimates for each different group size.
With larger M
g
, one might use fixed effects Tobit. While it is known that the estimator
of , suffers from an incidental parameters problem, one can use
G
1
g1
G
mz,
g
, o
u
2
(57)
to estimate APEs. I know of no simulation evidence that studies APEs based on (57); they
might be reasonable estimates for moderate M
g
. Fernndez-Val (2005) contains some relevant
theory.
3.1.4. Fractional Responses
42
Recently, more applied econometric work is recognizing that fractional responses can
require special treatment. Naturally, one can always opt for a linear model (as in the case of
binary responses, count, and corner solutions responses). But there is more interest in using
functional forms that ensure expected values are in the unit interval. One possibility for
responses that have probability mass at zero and one is to adopt a two-limit Tobit model. The
analysis would then be very similar to Section 3.1.3. In particular, the formulas for the
expected value in the two-limit case can be easily adapted to allow the presence of unobserved
heterogeneity, either assumed independent of the covariates or with Dc
g
|zg
assumed normal.
The drawback of the two-limit Tobit model is that it makes sense only when there is pile up
a zero and one. Even in such cases, if fully specifies the distribution Dy
gm
|x
g
, Z
g
. If we are
primarily interested in effects on the mean response, more robust methods are available. These
methods produce consistent estimators with pile up at neither, both, or only one endpoint.
Papke and Wooldridge (2005) recently proposed methods for fractional responses for
balanced panel data sets. Here I draw on that work and make suggestions for the unbalanced
case. In fact, the methods are virtually identical to the results for the probit case, but with a
change in how we interpret the probit response function. With y
gm
a fractional response, we
begin with
Ey
gm
|x
g
, Z
g
, c
g
o x
g
[ z
gm
,, m 1, . . . , M
g
. (58)
Equation (58) is attractive as a mean response In other words, we simply assume the mean
response has the probit form. [The logit form is also possible but does not lead to simple
estimating equations.] If we add the independence and normality assumption
43
c
g
|x
g
, Z
g
~Normal0, o
c
2
, (59)
just as in (24), then we have an extension of (25):
Ey
gm
|x
g
, Z
g
o
c
x
g
[
c
z
gm
,
c
, (60)
where the c index denotes scaling by 1/1 o
c
2
1/2
. Because the Bernoulli quasi-likelihood
function identifies the parameters of a correctly specified conditional mean see Gourieroux,
Monfort, and Trognon (1984) or Papke and Wooldridge (1996) we can use pooled probit
to consistently estimate the scaled parameters and APEs. Naturally, we should use a fully
robust sandwich variance matrix estimator to account for within-group correlation but also for
the fact that the Bernoulli variance assumption can no longer be expected to hold. See Papke
and Wooldridge (2005) for more discussion.
A random effects probit analysis does not make sense here because we do not have a fully
specified distribution. But the GEE methods described for probit apply immediately. In fact,
using standard GEE commands, you would not even know y
gm
is fractional or binary (except
possibly for a warning that you applying probit to a nonbinary response).
The discussion about allowing correlation between c
g
and Z
g
is identical to the binary
response case. Again, it is important to allow at least Varc
g
|zg
depend on the group size, and
possibly the slopes on zg
as well. A simple, flexible, but inefficient approach is to estimate a
different model for each M
g
, provided there is not much variation in M
g
. Better would be to
apply a minimum distance approach that recognizes a constant , vector on z
gm
in (58).
3.2 Large Group Size Asymptotics
44
Unlike in the linear case covered by Donald and Lang (2001), for nonlinear models no
exact inference is available. Nevertheless, if the group sizes M
g
are reasonably large, we can
extend the approximate inference using the DL approach to nonlinear models. Plus, the
minimum distance approach carries over essentially without change.
We can apply the methods to any of the nonlinear models in Section 3.1 (as well as others).
Here, I illustrate the probit response function (which means it can apply to binary or fractional
responses).
With small G but random sampling of y
gm
, z
gm
: m 1, . . . , M
g
within each g, write
Py
gm
1|z
gm
o
g
z
gm
,
m
, m 1, . . . , M
g
(61)
o
g
o x
g
[, g 1, . . . , G. (62)
As with the linear model, we assume the intercept, o
g
in (61), is a function of the group
features x
g
. With the M
g
moderately large, we can get good estimates of the o
g
. The
o
g
, g 1, . . . , G, are easily obtained by estimating a separate probit for each group.
Under (62), we can apply the minimum distance approach just as before. Let Avaro
g
(so these shrink to zero at the rate 1/M
g
).
For binary response, these are just the usual MLE estimated variances. For fractional response,
these would be from a sandwich estimate of the asymptotic variance. Then, we obtain the
minimum distance estimates as the WLS estimates from
o
g
on 1, x
g
, g 1, . . . , G (63)
using weights 1/Avaro
g
are used as the weights. This is the efficient minimum distance
45
estimator and, conveniently, the proper asymptotic standard errors are reported from the WLS
estimation (even though we are doing large M
g
, not large G, asymptotics.) The
overidentification test is obtained exactly as in the linear case: there are G K 1
degrees-of-freedom in the test.
The same cautions about using the overidentification test to reject the minimum distance
approach apply here as well. In particular, in the treatment effect setup, where x
g
is zero or
one, one might reject a weighted comparision of means simply because the means within the
control or within the treatment group differ, or both. It might make more sense to define, using
other considerations, weighted averages of the population means, and then to estimate those
means using within-group sample averages. Similar considerations may be relevant when x
g
is
a vector with a more complicated structure.
If we impose common slopes ,
g
,, g 1, . . . , G, then in applying MD estimation we
should account for the across group correlation in the o
g
, as described in Section 2.3. This is
also true if we impose that a subset of the slopes are common across g.
If we reject the overidentification restrictions and wish to have conservative inference, then
we can adapt Donald and Lang and treat
o
g
o x
g
[ error
g
, g 1, . . . , G (64)
as approximately satisfying the classical linear model assumptions, provided G K 1, just as
before. As in the linear case, this approach is justified if o
g
o x
g
[ c
g
with c
g
independent of x
g
and c
g
drawn from a homoskedastic normal distribution. It assumes that we
can ignore the estimation error in o
g
, based on o
g
o
g
O1/ M
g
.
Because the DL approach ignores the estimation error in o
g
, it is unchanged if one imposes
46
some constant slopes across the groups, as with the linear model.
Once one estimates o and [, the estimated effect on the response probability can be
obtained by averaging the response probability for a given x:
G
1
g1
G
M
g
1
m1
Mg
o x[
z
gm
,
g
, (65)
where derivatives or differences with respect to the elements of x can be computed. Here, the
minimum distance approach has an important advantage of the DL approach: the finite sample
properties of (65) are viritually impossible to obtain, whereas the large-M
g
asymptotics
underlying minimum distance would be straightforward using the delta method. Whether
bootstrapping can be applied is an interesting question.
Particularly with binary response problems, the two-step methods described here are
problematical when the response does not vary within group. For example, suppose that x
g
is a
binary treatment equal to one for receiving a voucher to attend college and y
gm
is an
indicator of attending college. Each group is a high school class, say. If some high schools
have all students attend college, one cannot use probit (or logit) of y
gm
on z
gm
, m 1, . . . , M
g
.
A linear regression returns zero slope coefficients and intercept equal to unity. Of course, if
randomization occurs at the group level that is, x
g
is independent of group attributes then it
is not necessary to control for the z
gm
. Instead, the within-group averages can be used in a
simple minimum distance approach. In this case, as y
gm
is binary, the DL approximation will
not be valid, as the CLM assumptions will not even approximately hold in the model
y g
o x
g
[ e
g
(because y g
is always a fraction regardless of the size of M
g
).
As in the linear case, more flexibility is afforded if G is somewhat large along with large
M
g
. Then, in implementing the DL approach with or without common slopes imposed in the
47
first stage one gains robustness to nonnormality of c
g
if G is large enough so that
G
1
g1
G
c
g
and G
1
g1
G
x
g
c
g
are approximately normally distributed. If we estimate
separate models in the first stage and do not wish to ignore the estimation error in o
g
, then
error
g
in (64) is heteroskedastic, and we can, perhaps, use heteroskedasticity-robust standard
errors in (64).
4. CONCLUDING REMARKS
My purpose is writing this expanded version of Wooldridge (2003) is to flesh out some
issues only touched on in the shorter paper, while also considering more general frameworks.
This version of the paper contains what are best described as conjectures, or at least
suggestions that may not turn out to work very well. Future research could resolve the issue of
how to allow group heterogeneity to be correlated with the unit-specific covariates when the
group sizes differ. Minimum distance estimation strikes me as promising, but that remains to
be seen. Less parametric approaches should also be considered.
Much more remains to be learned about the small G case, too, particularly for nonlinear
models. A similuation study comparing Donald and Langs (2001) conservative approach to
minimum distance estimation, when neither method is entirely appropriate, would be
worthwhile. For example, in a treatment effect case with G 4, two treatment groups and two
control groups, with different means within each group, one might select different, modest
sizes of M
g
(ranging from maybe 20 to 40). The underlying population distributions can be
normal. Which method, DL or minimum distance, produces the best confidence intervals for
48
the ATE (defined appropriately for a fictitious population)?
49
REFERENCES
Angrist, J.D. and V. Lavy (2002), The Effect of High School Matriculation Awards:
Evidence from Randomized Trials, NBER Working Paper 9389.
Arellano, M. (1987), Computing Robust Standard Errors for Within-Groups Estimators,
Oxford Bulletin of Economics and Statistics 49, 431-434.
Baker, M. and N.M. Fortin (2001), Occupational Gender Composition and Wages in
Canada, 1987-1988, Canadian Journal of Economics 34, 345-376.
Bertrand, M., E. Duflo, and S. Mullainathan (2002), How Much Should We Trust
Differences-in-Differences Estimates? mimeo, MIT Department of Economics.
Blundell, R., R. Griffith, and F. Windmeijer (2002), Individual Effects and Dynamics in
Count Data Models, Journal of Econometrics 108, 113-131.
Campolieti, M. (2004), Disability Insurance Benefits and Labor Supply: Some Additional
Evidence, Journal of Labor Economics 22, 863-889.
Card, D., and A.B. Krueger (1994), Minimum Wages and Employment: A Case Study of
the Fast-Food Industry in New Jersey and Pennsylvania, American Economic Review 84,
772-793.
Donald, S.G. and K. Lang (2001), Inference with Difference in Differences and Other
Panel Data, mimeo, University of Texas Department of Economics.
Fernndez-Val, I. (2005), Estimation of Structural Parameters and Marginal Effects in
Binary Choice Panel Data Models with Fixed Effects, mimeo, Boston University Department
of Economics.
50
Hausman, J.A., B.H. Hall, and Z. Griliches (1984), Econometric Models for Count Data
with an Application to the Patents-R&D Relationship, Econometrica 52, 909-938.
Hsiao, C. (1986), Analysis of Panel Data. Cambridge: Cambridge University Press.
Kzdi, G. (2001), Robust Standard Error Estimation in Fixed-Effects Panel Models,
mimeo, University of Michigan Department of Economics.
Liang, K.-Y., and S.L. Zeger (1986), Longitudinal Data Analysis Using Generalized
Linear Models, Biometrika 73, 13-22.
Loeb, S. and J. Bound (1996), The Effect of Measured School Inputs on Academic
Achievement: Evidence form the 1920s, 1930s and 1940s Birth Cohorts, Review of
Economics and Statistics 78, 653-664
Moulton, B.R. (1990), An Illustration of a Pitfall in Estimating the Effects of Aggregate
Variables on Micro Units, Review of Economics and Statistics 72, 334-338.
Papke, L.E. and J.M. Wooldridge (1996), Econometric Methods for Fractional Response
Variables with an Application to 401(k) Plan Participation Rates, Journal of Applied
Econometrics 11, 619-632.
Papke, L.E. and J.M. Wooldridge (2005), Panel Data Methods for Fractional Response
Variables with an Application to Test Pass Rates, mimeo, Michigan State University
Department of Economics.
Pepper, J.V. (2002), Robust Inferences from Random Clustered Samples: An Application
Using Data from the Panel Study of Income Dynamics, Economics Letters 75, 341-345.
White, H. (1980), A Heterosekdasticity-Consistent Covariance Matrix Estimator and A
Direct Test for Heteroskeasticity, Econometrica 48, 817-838.
White, H. (1982), Maximum Likelihood Estimation with Misspecified Models,
51
Econometrica 50, 1-26.
White, H. (1984), Asymptotic Theory for Econometricians. Academic Press: Orlando, FL.
Wooldridge, J.M. (1999) Distribution-Free Estimation of Some Nonlinear Panel Data
Models, Journal of Econometrics 90, 77-97.
Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data. MIT
Press: Cambridge, MA.
Wooldridge, J.M. (2003), Cluster-Sample Methods in Applied Econometrics, American
Economic Review 93, 133-138.
Wooldridge, J.M. (2005), Unobserved Heterogeneity and Estimation of Average Partial
Effects, in Identification and Inference for Econometric Models: Essays in Honor of Thomas
Rothenberg. D.W.K. Andrews and J.H. Stock (eds.). Cambridge: Cambridge University Press,
27-55.
52
APPENDIX
Below are commands in version 8.0 of Stata that can be used for robust inference with
cluster or panel data sets. All commands assume that there is an identifier, id, that identifies
the cluster or group for each observation.
Section 2.2
Pooled OLS
reg y x1 x2 ... xK z1 z2 ... zL, cluster(id)
Random Effects: Usual Inference
iis id
xtreg y x1 ... xK z1 ... zL, re
Random Effects: Fully Robust Inference
iis id
xtgee y x1 x2 ... xK z1 ... zL, corr(exch) robust
Note: The xtgee and xtreg estimates will generally differ for two reasons. First, xtgee uses
estimates of o
c
2
based on an initial pooled OLS estimation, while xtreg uses a fixed-effects
based estimate. Second, xtgee iterates, while xtreg does not.
Section 2.3
A separate linear model is estimated for each group identifier to obtain the asymptotic
53
variances of the intercepts. In the second step, OLS can be used (as in DL), relying on the
reported t statistics and critical values, or minimum distance (implemented as WLS) using the
standard normal approximation.
reg y z1 ... zL if id g
Section 3.1.1
Pooled Probit
probit y x1 x2 ... xK z1 z2 ... zL, robust cluster(id)
Random Effects Probit
iis id
xtprobit y x1 x2 ... xK z1 z2 ... zK, re
GEE with an Exchangeable Working Correlation Matrix
iis id
xtgee y x1 x2 ... xK z1 z2 ... zK, family(bin) link(probit) corr(exch) robust
or
xtprobit y x1 x2 ... xK z1 z2 ... zK, pa
(The qualifier pa stands for population averaged. Stata simply uses the first command
when pa is used with xtprobit.)
Using the Chamberlain-Mundlak Device
All of the previous commands can include the within-group averages of the unit-specific
covariates. For example, Chamberlains random effects probit model can be estimated by
iis id
54
xtprobit y x1 x2 ... xK z1 z2 ... zL z1bar z2bar ... zLbar, re
But, as discussed in Section 3.1.1, at a minimum these coefficients are all scaled by a
variance that depends on the group size, M
g
. One could do this estimation for each group size
and average the APEs. Even better would be to use minimum distance to impose the
restriction of a common , in the structural model.
Section 3.1.2
Pooled Poisson
poisson y x1 ... xK z1 ... zL, robust cluster(id)
Random Effects Poisson
iis id
xtpois y x1 ... xK z1 ... zL, re
GEE with an Exchangeable WCM
xtgee y x1 ... xK z1 ... zL, family(pois) link(log) corr(exc) robust
The Fixed Effects Poisson Estimator
iis id
xtpois y x1 ... xK z1 ... zL, fe
Note: Currently, Stata does not report robust standard errors for FE Poisson
Section 3.1.3
Pooled Tobit
tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) cluster(id)
55
Random Effects Tobit
xttobit y x1... xK z1 ... zL z1bar ... zLbar, re
Section 3.1.4
Pooled Two-Limit Tobit
tobit y x1... xK z1 ... zL z1bar ... zLbar, ll(0) ul(1) cluster(id)
Note: Unfortunately, xttobit does not currently allow specifying an upper bound, so the RE
version of the two-limit Tobit is not immediately available.
Pooled Quasi-MLE
glm y x1 ... xK z1 ... zL, family(bin) link(probit) robust cluster(id)
GEE with Exchangeable WCM
xtgee y x1 ... xK z1 ... zL, family(bin) link(logit) corr(exch) robust
Using the Chamberlain-Mundlak Device
For GEE, this would be
xtgee y x1 ... xK z1 ... zL z1bar ... zLbar, family(bin) link(probit) corr(exch) robust
Section 3.2
A separate model is estimated for each group identifier to obtain the asymptotic variances
of the intercepts. So, the commands are carried out for each group size, g. In the second step,
OLS can be used (as in DL), relying on the reported t statistics and critical values, or minimum
distance (implemented as WLS) using the standard normal approximation.
Binary Response
56
probit y z1 ... zL if id g
Count Response
glm y z1 ... zL if id g, family(poisson) link(log) robust
Fractional Response
glm y z1 ... zL if id g, family(bin) link(probit) robust
or
glm y z1 ... zL if id g, family(bin) link(logit) robust
57