Vous êtes sur la page 1sur 6

LOGISTIC REGRESSION ANALYSIS WITH AGGREGATE

DATA: TACKLING THE ECOLOGICAL FALLACY

David G. Steel, University of Wollongong, Australia


Mark Tranmer, University of Southampton, UK
D. Holt, Office for National Statistics, London, UK
David G. Steel, School of Mathematics and Applied Statistics,
University of Wollongong, NSW 2522, Australia (david_steel~uow.edu.au)

K e y W o r d s : Ecological analysis, ecological fal- a combination of basic demographic and housing


lacy, multilevel models, random effects, grouping variables as auxiliary variables they were able to
variables. reduce the aggregation effects in a set of ecological
correlations by up to 70 per cent.
1. Introduction In social research the variables are usually cat-
egorical at the individual level and the area level
An ecological analysis uses aggregate group level means are the corresponding proportions. Vari-
data to estimate individual level relationships. ous methods of ecological inference in this situa-
The ecological fallacy arises when ecological anal- tion were evaluated by Cleave, Brown and Payne
ysis provides biased estimates of individual level (1995). These included linear ecological regres-
relationships. The groups involved are often small sion as originally suggested by Goodman (1959)
geographic areas such as census Enumeration Dis- and a method based on an Aggregated Compound
tricts (EDs). Suppose that we are interested in in- Multinomial (ACM) model. The ACM model as-
vestigating the relationship between two variables sumed that the frequencies in each group have in-
Y and X. The aggregate data available for area g dependent multinomial distributions and used a
_

usually consists of the group level means Yg and Dirichlet compounding distribution. Their evalu-
ation favoured the ACM based method but they
The ecological fallacy arises because the indi- recognized that it is not easy to implement. Re-
viduals within the groups are not equivalent to cently King (1997) proposed a new method of eco-
randomly formed groups. Progress in understand- logical analysis for categorical data. This method
ing aggregation effects requires allowance for the models the conditional proportions of Y given X
population structure in the model underpinning as random effects with a joint truncated Normal
the analysis. Steel and Holt (1996) proposed a distribution and exploits the constraints implied
model for cases where the relationship between the by the group means.
two variables of interest, Y and X, is linear. This In this paper we develop and evaluate some sim-
model incorporated auxiliary variables, Z, which ple adjusted ecological analysis procedures based
explain much of the within area homogeniety of on the idea of incorporating auxiliary variables
the variables of interest. Random group level coef- and using individual level data available for these
ficients were also included to reflect group level ef- variables.
fects additional to those due to the auxiliary vari-
ables. Based on this model Steel and Holt (1996)
developed a method to adjust group level covari-
ance matrices using limited individual level data 0 Adjusted Ecological Analysis for
available for the auxiliary variables. Steel, Holt Dichotomous Variables
and Tranmer (1996) evaluated this approach for
estimating correlation coefficients using ED level Let Yi and Xi be the values of the variables of
data from the 1991 UK population census. Using interest for the i th individual in the population.
This research was supported by the UK Economic So- Suppose there is a sample, s, of groups and within
cial Research Council, grant Ft000 23 6135 the sampled group, g, a sample of ng individuals

324
is used to calculate sample group means: and King (1997). In both cases the covariates are
used at the area level to explain some of the vari-
1 ZYi and )(g= 1 ~}-~X, ation in the random coefficients characterizing the
relationship between the variables across the ar-
eas. Our approach is motivated by the idea that
The variables are dichotomous and so these means a large part of the variation in the relationships
are proportions. An important special case occurs across areas is due to compositional effects and
when the means are available for all areas and each can be removed by inclusion of the auxiliary vari-
mean is based on all individuals within the rele- ables. If this is the case then handling the remain-
vant area. ing variation between areas should be easier. We
For the sample in the group g let nabg be the propose attempting to average over the auxiliary
number of individuals for which Y~ = a and variables to estimate the marginal relationship be-
Xi = b. The corresponding population counts tween Y and X. This requires information about
are Nabg. We use " + " to indicate summa- the relevant parameters of the individual level dis-
tion over a subscript. Define Pablg = nabg//n++g tribution of the covariates.
and Pablg = Nabg/N++g as the sample and fi- For a single categorical auxiliary variable the
nite population proportions for group g. The con- information needed to calculate the adjusted
ditional proportions are Palbg = nabg//n+bg and marginal probabilities consists only of the propor-
Palbg = Nabg/N+bg. Define Pab+ -- nab+l/n+++ tions in each category. These can be calculated
and Pab+ = Nab+/N+++ as the overall sample and from the weighted group level data or could come
finite population proportions. The corresponding from some other source, such as a survey. If two or
conditional proportions are Palb+ = nab+/n+b+ more categorical auxiliary variables are used then
and Palb+ = Nab+ /N+b+. the marginal cross tabulations of these variables
The basis of the approach proposed by Steel and are required, unless further assumptions can be
Holt (1996) is that, for continuous variables, the made concerning the relationship between the aux-
conditional probability density function of Y given iliary variables. No individual level data about the
X can be expressed as variables of direct interest are used.

f (ylx) = f f (ylx, z) f (xlz) f (z)dz


f f(xlz)f(z)dz (1) 3. Empirical Evaluation
When the parameters of f(ylx, z), f(zlz), and f(z)
Individual and group level data from the 1991 UK
are distinct, analysis can proceed by using individ-
population census were used to evaluate several
ual level data to estimate the parameters of f(z)
methods. The Small Area Statistics (SAS) data-
and aggregate data are used to estimate the pa-
base provided data in the form of totals for a range
rameters of f (ylx, z)and f (xlz).
of categorical variables for EDs. This was the
Assume t h a t there is a single auxiliary variable
source of the group means Yg, )fg and Zg. For the
Z and t h a t Y, X and Z are all categorical. The
variables analysed in this paper these means are
target of inference is the conditional probability
all based on 100 per cent of the census records for
distribution of Y given X, P(YIX). The approach
the relevant ED. Individual level data are available
we develop here is based on
from a 2 percent Sample of Anonymized Records
p ( y i x ) = F_,z P(YI X, Z ) P ( X I Z ) P ( Z ) (SAR) for Local Authority Districts (LADs). The
~-,z P ( X I Z ) P ( Z ) (2) evaluation used data for the LAD of Manchester,
which contained 897 EDs and 7613 individuals in
Estimation of P(YIX) can be a t t e m p t e d by us- the SAR, of which 5802 were aged 16 or more.
ing aggregate data to estimate P(YIX, Z) and Using these data it is possible to calculate ad-
P(XIZ ) and then using the individual level data justed ecological estimates of the marginal proba-
to estimate P(Z). The analysis using aggregate bility distribution P(YIX) based on equation (2)
data will be based on linear or logistic regression. using Yg, Xg and Zg obtained from the SAS data-
The potential benefit of using group level in- base and using individual level data obtained from
formation about covariates in ecological analysis the SAR concerning variables chosen as auxiliary
is discussed by Cleave, Brown and Payne (1995) variables. In this evaluation we considered the

325
following estimators of the conditional probability data to estimate P(YIX, Z) and P(XIZ). Es-
P(Y = I IX = b) = 7rllb, for b = 0, 1. timation of P(Y[X) is then based on equation
(2) with the individual level data being used
(a) SAR relative frequencies. The relative fre- to estimate P ( Z ) .
quency obtained from the SAR, nlb+/n+b+.
(g) Adjusted correlation approach. For two di-
(b) SAS relative frequencies. For pairs of vari- chotomous variables the correlation combined
ables for which the SAS contains the relevant with the marginal totals determines the pro-
cross tabulation we can calculate Nlb+ IN+b+. portions in the cross classification i.e.
Ply+ = PI++P+~+ +
(c) Ecological linear regression. This is built
around the relationship Ryxv/PI++(1 - P l + + ) P + l + ( 1 - P+l+) (4)
where R y x is the correlation coefficient based
~'g = P11og(1 - X . ) -4- P111g)(. (3) on the table of proportions P~b+. The method
proposed by Steel and Holt (1996) is used
If Pll0g and Pxllg are random variables with to produce adjusted estimates of the correla-
E[P~Io, IXg] - 7r~10 and E [ ~ I ~ , I X , ] _ - 7r11~ tion R y x ( Z ) which can then be substituted
then a linear regression of Yg on Xg gives into equation (4). This method enables use of
unbiased estimates of 7r110 and 7r111. This is information about several auxiliary variables
the classical Goodman regression approach even if only the two way cross tabulations are
(Goodman, 1959). This approach can pro- available.
duce estimates outside [0,1] and simple direct
use of this model is not usually recommended. (h) King's ecological inference method. This
method is also built around equation (3). The
(d) Ecological logistic regression. A common group level conditional proportions Pll0g and
model used in analysing a dichotomous re- Pl[lg are assumed to have a joint Normal dis-
sponse variable is logistic regression which as- tribution which is truncated so that they each
sumes Y~ is a Binomial variable based on one lie in the [0,1] interval. The method incorpo-
trial, B(1, E[Y~IXi]), where rates the fact that given ]?g and if g, equation
(3) implies bounds and constraints for Plllg
E[r, lx,] ) and Pll0g. This approach produces estimates
log 1 - E[Y~IX,] - ~ + DX~
of Plllg and Pll0g which can then be com-
bined to produce estimates of Plfl and/9110.
If groups were completely homogeneous with
This method can incorporate group level co-
respect to X then each group total Yg would
variates to help model the variation between
be a Binomial variable based on Ng trials with
groups.
probability such that
The variables used in the analysis were as fol-
E[~alXg] ) lows:
log 1 - -~[Ygg~ffg] - oL+ Z f( g
Y: employed, unemployed,
X" marital status,
Groups are not homogeneous with respect to
Z: age 45-59, age 60+, living in owner occupied
X but this model has the advantage of not
dwelling, renting from local authority.
giving predicted probabilities outside [0,1].
These auxiliary variables were chosen because of
(e) Adjusted ecological linear regression. Here we their success in removing aggregation effects in
use linear regression based on aggregate data correlations in the evaluation by Steel, Holt and
to estimate P(YIX, Z) and P(XIZ). Estima- Tranmer (1996). The analysis was confined to
tion of P ( Y I X ) i s then based on equation (2) those aged 16 or more.
with the individual level data being used to Table 1 gives the estimated probabilities of be-
estimate P(Z). ing employed (Y = 1) given marital status (X).
The estimates are obtained using the methods
(f) Adjusted ecological logistic regression. Here listed above. The ecological methods and adjusted
we use logistic regression based on aggregate correlation methods were implemented using the

326
SAS package and King's method was implemented mative bounds. However, the estimates are still
using EzI software, developed by King and col- worse than those obtained from the adusted logis-
leagues (King, 1997). The first row of the table tic regression method. When the variable "owner
give the "true" values of the probabilities as es- occupied" is used as a covariate, the estimates ob-
timated from census cross tabulations, available tained from King's method are effectively equal to
for these particular variables from the SAS data the true values, while those based on the adjusted
base, which were the same as the corresponding linear and logistic regression methods are not.
estimates obtained from the SAR. While these results are limited to two relation-
ships, they highlight the importance of using aux-
When adjustment variables are not included,
iliary variables in ecological analysis to obtain rea-
the ecological linear and ecological logistic esti-
sonable estimates. The choice of auxiliary vari-
mates are considerably different from the true val-
ables is important and methods of identifying ef-
ues. The inclusion of "owner occupied housing"
fective adjustment variables need to be used, as
as an adjustment variable leads to ecological lin-
suggested by Steel and Holt (1996). If appropriate
ear and logistic estimates that are much closer to
auxiliary variables are used, quite simple methods
the true values. Used as a single adjustment vari-
can perform almost as well as more sophisticated,
able "aged 60 and over" does not improve the es-
computer intensive ones. We expect that the sim-
timates. The estimates obtained by the adjusted
ple adjusted ecological methods can be extended
ecological linear method are fairly close to the true
to also include random effects within the sort of
values when several adjustment variables are used.
multilevel framework developed, for example, by
In particular, those combinations of adjustment
Goldstein (1995). Methods that solely use ran-
variables that include housing tenure. The ad-
dom effects to account for the variation in the re-
justed linear regression method generally works
lationship between groups will, in general, not be
better than the adjusted correlation method sug-
very successful. More information needs to be in-
gested by Steel and Holt (1996). In this exam-
corporated in order to get useful estimates. This
ple, King's method without covariates gives results
information may be in the form of individual level
similar to those for the ecological linear and lo-
data on auxiliary variables, group level covariates
gistic methods without covariates. When "owner
or the constraints exploited in King's method.
occupied" is used as a covariates, the estimated
probabilities are further from the true values than
those obtained using the adjusted linear regres- 4. References
sion method. However, when the covariate "rent-
ing from a local authority" is added the estimates Cleave, N., Brown, P.J. and C. D. Payne (1995)
from King's method are slightly better than those Evaluation of Methods for Ecological Inference.
based on the adjusted linear regression method. Journal of the Royal Statistical Society, A, 158,
Use of King's method with more than two covari- pp 55- 72
ates proved difficult in practice and no results for Goldstein, H. (1995). Multilevel Statistical Mod-
such cases are included. els, 2nd Edition, Edward Arnold, London.
Goodman, L.A. (1959). Some alternatives to Eco-
For the estimates of the proportion unemployed logical regression. American Journal of Sociologi-
by marital status given in Table 2, the unad- cal Review, 18, 663-664.
justed ecological linear and logistic methods pro- King, G. (1997). A Solution to the Ecological In-
vide poor estimates. The linear method leads to ference Problem :Reconstructing Individual Behav-
an out of range estimate. Including "owner occu- ior from Aggregate Data. Princeton Univ Press
pied" as an adjustment variable in these meth- Steel, D. and Holt, D. (1996). Analysing and Ad-
ods leads to some improvement. The adjusted justing Aggregation Effects: The Ecological Fal-
ecological linear regression method works reason- lacy Revisted. International Statistical Review,
ably well for combinations of adjustment vari- 64, pp 39-6O
ables that include housing tenure and age. King's Steel D., Holt D. and Tranmer M. (1996). Making
method without covariates works somewhat bet- unit level inferences from aggregated data, Survey
ter in this case than in the previous example, be- Methodology 22, 3-15
cause differences in the proportions unemployed
and married across the EDs lead to more infor-

327
T a b l e 1: E s t i m a t e d p r o b a b i l i t i e s : Y - e m p l o y e d ; X - m a r r i e d

l(a): Ecological linear l(b): Ecological logistic

Pllo Plll Pllo Plll


SAS 'truth' .41 .50 SAS 'truth' .41 .50
covariate( s): covariate(s):
none .26 .67 none .26 .67
age2 .23 .71 age60+ .29 .65
age1, age2 .28 .65 age1, age2 .33 .59
oo .48 .42 oo .49 .41
oo, rla .47 .43 oo, rla .46 .44
oo, rla, age1,2 .45 .45 oo, rla, age1,2 .45 .46

l(c): Correlation method l(d): King's method

Pllo Plll Pll0 Pll!


SAS 'truth' .41 .50 SAS 'truth' .41 .50
covariate(s)" covariate(s):
none .27 .66 none .25 .69
age2 .22 .70 age60+
age1, age2 + .20 .73 age4559, age60+
oo .55 .35 oo .50 .39
oo, rla .51 .39 oo, rla .44 .46
oo, rla, agel,2 .47 .43 oo, rla, agel,2

Source: 1991 UK census data.


Population: Residents aged 16 or more in households, Manchester LAD.
Y takes the value 1 for 'employed' 0 for 'not employed'.
X takes the value 1 for 'married' and 0 for 'not married'.
Pllo means P ( Y = 1 I X = 0); Plll means P ( Y = 1 I X = 1)
Adjustment variables: oo - owner occupied; rla - rented from local authority;
age1 = persons aged 45 - 59; age2 = persons aged 60 and over.

328
T a b l e 2: E s t i m a t e d p r o b a b i l i t i e s : Y - u n e m p l o y e d ; X - m a r r i e d

l(a): Ecological linear l(b): Ecological logistic

Pilo Pill P~10 P~li


SAS 'truth' .14 .07 SAS 'truth' .14 .07
covariate(s)" covariate(s)"
none .24 -.06 none .17 .02
age2 .24 -.05 age60+ .17 .02
agel, age2 .22 -.03 agel, age2 .11 .09
oo .18 .01 oo .16 .04
oo, rla .19 .01 oo, rla .15 .04
oo, rla, agel,2 .16 .04 oo, rla, agel,2 .11 .09

2(c): Correlation method 2(d): King's method

Pil0 Pill Pllo Pill


SAS 'truth' .14 .07 SAS 'truth' .14 .07
covariate(s)- covariate(s)"
none .27 -.09 none .19 .00
age2 .27 -.09 age2
agel, age2 .28 -.09 agel, age2
oo .19 -.00 oo .14 .07
oo, rla .18 .01 oo, rla
oo, rla, agel,2 .16 .03 oo, rla, agel,2

Source: 1991 UK census data.


Population: Residents aged 16 or more in households, Manchester LAD.
Y takes the value 1 for 'unemployed' 0 for 'not unemployed'.
X takes the value 1 for 'married' and 0 for 'not married'.
Pil0 means P(Y = l I X = 0); Pili means P(Y = l I X = 1)
Adjustment variables: oo = owner occupied; rla = rented from local authority;
agel = persons aged 45 - 59; age2 = persons aged 60 and over.

329