SAE

© All Rights Reserved

1 vues

SAE

© All Rights Reserved

- Cook Weisberg Residuals and Influence
- Linear Regression Analysis
- syb551.pdf
- 2007-Hawkins-Ecology-richness-metabolic
- x-ray fluorescence technique
- Output
- Manual.for.Building.tree.Volume.and.Biomass.allometric.equation
- 21-zn-am-38-110-zelazny-1-do-druku.pdf
- 142175417-consumer-s-awareness-towards-their-ethical-rights.pdf
- 263 Homework
- A Study on the Impact of Advertisements, Reference Group and Brand Perception in the Purchase Involvement of Customers in Chennai With Regard to Tvs Tyres
- Testing of hypotgesis about linear regression Atif Muhammad
- Project_F17.pdf
- 2 Hypothesis Testing
- 01.estimation.pdf
- Paper 21 RRJS 2013 Shu Raj Tha
- Efficient Use
- Lecture 20
- A Statistical Evaluation of Methods of Determining BOD Rate.pdf
- 10.1.1.30

Vous êtes sur la page 1sur 12

Michael Hidiroglou

Statistical Innovation and Research Division, Statistics Canada, 16 th Floor Section D, R.H. Coats Building, Tunney's

Pasture, Ottawa, Ontario, K1A 0T6, Canada

discussed in Rao (2003) and illustrates how some of

Small area estimation (SAE) was first studied at the estimators have been used in practice on a number

Statistics Canada in the seventies. Small area estimates of surveys at Statistics Canada. It is structured as

have been produced using administrative files or follows. Section 2 provides a summary of the primary

surveys enhanced with administrative auxiliary data uses of small area estimates as criteria for computing

since the early eighties. In this paper we provide a them. Section 3 defines the notation, provides a

summary of existing procedures for producing official number of typical direct estimators, and indirect

small-area estimates at Statistics Canada, as well as a estimators used in small area estimation. Section 4

summary of the ongoing research. The use of these provides four examples that reflect the diverse uses of

techniques is provided for a number of applications at small area estimations at Statistics Canada

Statistics Canada that include: the estimation of health

statistics; the estimation of average weekly earnings; 2. Primary uses and Criteria for SAE Production

the estimation of under-coverage in the census; and the

estimation of unemployment rates. We also highlight One of the primary objectives for producing small area

problems for producing small-area estimates for estimates is provide summary statistics to central or

business surveys. local governments so that they can plan for immediate

or future resource allocation. Typical small area

estimates include Employment indicators (employed

KEY WORDS: Small Area, Official Statistics, Fay- and unemployed), Health indicators (drug use, alcohol

Herriot use) and Business indicators such as average salary.

number of factors. What the demand for such

Small domain or area refers to a population for which statistics? What is the commitment and will of the

reliable statistics of interest cannot be produced due to agency to support methodological, systems, and

certain limitations of the available data. Examples of subject matter staff. How much methodology and

domains include a geographical region (e.g. a subject matter expertise exist within the agency. How

province, county, municipality, etc.), a demographic well correlated are existing auxiliary data with the

group (e.g. age x sex), a demographic group within a variables of interest? Is the survey sample size large

geographic region. The demand for such data small enough to allow reliable estimates by using both the

areas has greatly increased during the past few years survey data and the existing auxiliary data? How much

(Brackstone, 1987). This increase is due to the bias are the agency and clients willing to tolerate with

usefulness of these data in government policy and the estimates,: what are the consequences for making

program development, allocation of various funds and incorrect decisions? The size of the small areas in

regional planning. terms of the number of the units that belong to them is

also an important consideration. Small areas that are

A number of national and regional statistical agencies, too small may results in confidentiality breeches.

including Statistics Canada, have introduced programs Furthermore, small area estimates may be quite

aimed at producing estimates for small areas to meet different from statistics based on local knowledge.

the new demand. Available data to produce such

estimates are based on surveys that are not designed

for these levels. However, if administrative sources 3. SAE Estimators

have data at the small area level, and that they are

well correlated with variables of interest at the 3.1 Introduction

corresponding level, several procedures are available

to estimates various parameters of interest for these A survey population U consists of N distinct elements

lower levels. (or ultimate units) identified through the labels j = 1, . .

. , N . A sample s is selected from U with probability

3445

Section on Survey Research Methods

p(s), and the probability of including the j-th element in Rao (2003). We will confine ourselves to just a few

in the sample is π j . The design weight for each of them that include the synthetic estimator, and the

more well-known composite estimators.

selected unit j ∈ s is defined as w j = 1/ π j . Suppose

U i denotes a domain (or subpopulation) of interest. 3. 2 Direct Estimation

Denote as si = s ∩ U i the part of the sample s that

Let w j be the design weight associated with j ∈ s .

falls in domain U i .The realized sample size of si is a

The Horvitz-Thompson is the simplest direct estimator.

random variable ni , where 0 ≤ ni ≤ N i . Auxiliary data x

If the small area total Yi is to be estimated for small

will either be known at the element level x j for j ∈ s

area U i , then the corresponding Horvitz-Thompson

or for each small area i as totals X i = ∑

j∈U i

x j or

estimator is given by Yˆ = w y provided i , HT ∑ j∈si j j

The problem is to estimate the domain total Auxiliary information can be available either at the

Yi = ∑j∈U

y j or the domain mean Yi = Yi / N i ,

i

population level or at the domain level. If it available

at the population level, then we used the Generalized

where N i , the number of elements in U i may or may Regression Estimator (GREG) given by

not be known. We define yij to be y j if j ∈ U i , and 0 Yˆ

i ,GR= X ′ β% + Yˆ

i ,GREG − Xˆ ′ β% ( where

i , HT HT i ,GREG )

otherwise. An indicator variable aij is similarly

∑∑

m

defined: it is equal to one if j ∈ U i and 0 otherwise. X′ =

j∈Ui

xj , Xˆ HT

′ = ∑ xk′ / π k , and

∑y =∑y a

s

i =1

Note that Yi can be written as Yi = ij j ij .

j∈U j∈U β%i ,GREG is the set of regression coefficient obtained by

Small area estimation is categorized into two types of

regressing yij on x j . That is

estimators: direct and indirect estimators. A direct

estimator is one that uses values of the variable of −1

⎛ w j x j x ′j ⎞ w j x j yij

⎜∑ ∑

interest, y, only from the sample units in the domain of β%i ,GREG = ⎜ ⎟ ,

interest. However, a major disadvantage of such ⎝

s cj ⎟

⎠

s cj

estimators is that unacceptably large standard errors may where c j is a specified constant ( c j >0 ).

result: this is especially true if the sample size within

the domain is small or nil. An indirect estimator uses

values of the variable of interest from a domain and/or The straight GREG is estimator is not efficient, and it

time period other than the domain and time period of is better to use regression estimators that use auxiliary

interest. Three types of indirect estimators can be data available as close possible to the small areas of

identified. A domain indirect estimator uses values of interest. One such estimator is the domain–specific

the variable of interest from another domain but not GREG that uses auxiliary data at the domain level. It is

from another time period. A time indirect estimator

uses values of the variable of interest from another

given by Y * = X ′βˆ

i , GR + Yˆ − Xˆ ′ βˆ

i i ,GREG ( i , HT i , HT i , GREG )

−1

time period but not from another domain. An estimator

that is both domain and time indirect uses values of the

where βˆi ,GREG = ⎛⎜

⎝

∑ si

w j x j x ′j / c j ⎞⎟

⎠

∑ si

wj x j y j / c j .

variable of interest from another domain and another

time period. An estimator that is approximately p-unbiased as the

overall sample size increases but uses y-values outside

An alternative is to use estimators that borrow strength the domain is the modified direct estimator given by

across small areas, by modeling dependent on

independent variables across a number of small areas: Yˆ = X ′βˆ

i , SR + Yˆ

i GREG − Xˆ ′ βˆ ( where

i , HT i , HT GREG )

( ∑ w x x′ / c ) ∑

they are called indirect estimators. Indirect estimators −1

will be quite good (i.e.: indirectly increase the effective βˆGREG = j j j j wj x j y j / c j .This

s s

sample size and thus decrease the standard error) if the

estimator is also referred to in Woodruff (1966), and

models obtained across small areas still hold at the

Battese, Harter, and Fuller (1988) as the “survey

small area level. Departures from the model will result

regression estimator”.

in unknown biases. There is a wide variety of indirect

estimators available, and a good summary is provided

3446

Section on Survey Research Methods

Hidiroglou and Patak (2004) compared a number of composite estimator is most insensitive when the mean

the direct estimators. One of their conclusions was that square errors of the two component estimators do not

the direct estimators would be best if the domains of differ greatly. Simple weighting factors for the

interest coincided as closely as possible with the composite estimators that depend on the realized

design strata. domain size were given by Drew, Singh and Choudhry

(1982), and by Hidiroglou and Särndal (1985)

3.2 Indirect Estimation

Small area estimators are split into two main types,

Some of the most widely used indirect estimators have depending on how models are applied to the data

been the synthetic estimator, the regression-adjusted within the small areas: these two types are known as

synthetic, the composite estimator, and the sample- area level and unit level. Small area estimators are

dependent estimator. based on area level computations if models link small

area means of interest (y) to area-specific auxiliary

The synthetic estimator uses reliable information of a variables (such as x sample means). They are based on

direct estimator for a large area that spans several unit level computations if the models link unit values

small areas, and this information is used to obtain an of interest to unit-specific auxiliary variables. Area

indirect estimator for a small area. It is assumed that based small area estimators are computed if the unit

the small areas have the same characteristics as the level area data are not available. They can also be

large area: Gonzalez (1978) provides a good account computed if the unit level data are available by

how these estimators were obtained, and used to obtain summarizing them at the appropriate area level.

unemployment statistics at levels lower than those

planned in the survey design. The National Center for 3.2.1 Area Model

Health Statistics (1968) in the United States pioneered

the use of synthetic estimation for developing state One of the most widely used area based level small

estimates of disability and other health characteristics area estimator was given by Fay and Herriot (1979)

from the National Health Interview Survey (NHIS).

Sample sizes in most states were too small to provide

small. Population totals ( Yi = ∑

j∈U

y j ) or means

i

small area U i , can be estimated. The Fay-Herriot

Levy (1971) used mortality data to compute average methodology is usually presented as an estimator of

relative errors of synthetic estimates for States. He

the small area population mean Yi for a given small

used the regression-adjusted synthetic estimator to ac-

count for local variation by combining area-specific area U i where i = 1, …, m . The Fay-Herriot estimator

covariates with the synthetic estimator. These for small area U i is a linear combination of a direct

covariates attempted to attenuate the magnitude of ˆ

potential relative bias associated with the synthetic estimator (say Yi , DIR ) and a synthetic estimator

estimator. ˆ

(say Yi , SYN ). The direct estimator of the population

The potential bias associated with indirect estimators ˆ

mean Yi is given by Yi , DIR = Yˆi , DIR / Nˆ i , DIR where

Yˆi , INDIR can be attenuated by combining them with the

direct estimators Yˆ via a weighted average. The

i , DIR

∑

Yˆi , DIR = w% j y j and Nˆ i, DIR =

j∈si ∑ w% j . The

j∈si

resulting combined estimator is given by weight w% j associated with the j-th unit can be the

design weight w j (i.e. w% j = w j ) or a final weight that

Yˆi ,COMB = φiYˆi , DIR + (1 − φi )Yˆi , INDIR reflects any adjustment (i.e.: non-response, calibration,

or a product thereof) made to the design weight.

where φi ( 0 ≤ φi ≤ 1) . The optimal φi* is determined by

The synthetic portion is estimated as the product of a

minimizing the MSE of Yˆi ,COMB . The resulting given auxiliary population mean row-vector

composite estimator has a mean square error which is

smaller than that of either component estimator.

(say Zi′ = ∑ z ′j / N i ) for the i-th small area of

j∈U i

interest times an estimated regression vector (say βFH ,

insensitive to poor estimates of the optimum weight.

This insensitivity depends on the relative sizes of the where FH stands for Fay-Herriot). The auxiliary data

mean square errors of the component estimators. The row-vector z ′j is known for all units in the population

3447

Section on Survey Research Methods

ˆ

small area U i . The regression vector βFH is computed and ψˆ iEXP = ∑w y ∑w

j∈U i

j j

j∈U i

j is the simple estimator

across a number of small areas in such a way that the

ˆ of the mean involving the design weights w j . The

model linking the variable of interest (the mean Yi , DIR )

computations required to obtain the normal regression

auxiliary data also holds at the small area level. The estimator do not involve estimating any variance

Fay-Herriot estimator of a given population mean Yi is components.

estimated as:

ˆ

Yi , FH = γ i Yi , DIR + (1 − γ i ) Zi′βFH

% % 3.2.2 Unit Model

(3.1)

The unit model originates with Battese, Harter and

The two components (direct estimator and synthetic Fuller (1988). They used the nested error regression

estimator) of (3.1) are weighted γ i and (1 − γ i ) where model to estimate county crop areas using sample

%

survey data in conjunction with satellite information.

γ i = σ v2 /(σ v2 +ψ i ) . The regression vector βFH and γ i Their model is given by

depend on the population variance ψ i , DIR of the direct

yij = xij′ β + vi + eij (3.5)

ˆ

estimator Yi , DIR and the model variance σ v2 . Although iid iid

the sampling variance of Yi , DIR is easy to compute, it

may be unstable if the domain sizes are small. This is

K

i=1,…,m and j = 1, , ni . The small areas of interest in

Battese, Harter and Fuller (1988) were 12 counties

repaired with a smoothing of the estimated variances.

(m=12) in North-Central Iowa. Each county was divided

We denote the smoothed variances as ψˆ i , DIR . The into area segments and the areas under corn and

soybeans were ascertained for a sample of segments by

estimated model variance σˆ 2 and βˆ are computed v FH interviewing farm operators. The number of sampled

recursively. Details of the required computations for segments in a county ni ranged from 1 to 6. Auxiliary

obtaining σˆ v2 can be found in of Rao (2003, pp. 118- data were in the form of numbers of pixels (a term used

119). The estimated regression vector βˆ and the FH

for "picture elements" of about 0.45 hectares) classified

as corn and soybeans were also obtained for all the area

factor γˆi are given by: segments, including the sampled segments, in each

−1

⎡ D Z ′ Yˆ ⎤ county using LANDSAT satellite readings.

∑ Z i′

∑

⎡ D Zi ⎤

βˆ = ⎢ ⎥ ⎢ i i , DIR

⎥ (3.2)

FH

ˆ ⎢ i =1 ψˆ i , DIR + σˆ v

⎣ i =1 ψ i , DIR + σˆ 2 2

⎢ v ⎥⎦ ⎥ The resulting sample mean using (3.5) is given by

⎣ i ⎦

yi. = xi′. β + vi + ei . (3.6)

and

where yi . , xi′. and ei. are the means of the associated ni

(

γˆi = σˆ v2 / ψˆ i , DIR + σˆ v2 ) (3.3)

(y, x) observations and e-residuals. Battese et al.

(1988)’s objective was to estimate the conditional

respectively. population mean given the realized cluster (county)

The Fay-Herriot estimator Yi , FH can also be expressed

ˆ effect. Under the assumption of model (3.5), the

conditional population mean is given by

as: Yi. = X i′. β + vi (3.7)

ˆ ˆ ˆ

( ˆ

Yi , FH = Z i′ βFH + γˆi Yi , DIR − Zi′βFH FH . ) (3.4) where Yi. , X i′. are the population means of the

associated Ni observations ( yij , xij )in the i-th

This form of the Fay-Herriot estimator is very similar sampled cluster U i . The corresponding predictor y%i.

to the “normal” regression estimator

for the county mean crop area per segment is X i′. β% + v%i.

( ) ∑

ni

ˆ ˆ ˆ ˆ

Yi , REG = Z i′ βREG + Yi , EXP − Z i′ βREG (3.5) where v%i. = ni−1

j =1

(y ij )

− xij′ β% γ i with

given in Cochran (1977), where the estimated

regression vector is given by −1

∑∑ ∑∑ ( x y − γ x y ) (3.8)

m ni m ni

⎛ ⎞

−1 β% BHF = ⎜ ( xij xij′ − γ i xi. xi′. ) ⎟

∑ Z ′Z /ψˆ ∑ Z ′ ψˆ

D D

βˆ REG

⎛ ⎞ ⎛ ⎞ ij ij i i. i.

=⎜ i i i , EXP ⎟ ⎜ i i

EXP

/ψˆ i , EXP ⎟ ⎝ i =1 j =1 ⎠ i =1 j =1

⎝ i =1 ⎠ ⎝ i =1 ⎠

3448

Section on Survey Research Methods

and γ i = σ v2 (σ v2 + ni−1σ e2 ) .

−1

The resulting best linear unbiased prediction (BLUP) survey-weighted estimator of β (say β%YR ). The

estimator is y% i , BHF = γ i yi. + ( X i′. − γ i xi′. ) β% BHF for the i- resulting “pseudo-EBLUP” estimator Yˆi , PR is given by

th small area. However, the variance components

σ v2 and σ e2 are not known. Battese et al (1988) use the

Y

ˆ

= X ′ βˆ + γˆ y − x ′ βˆ

i , PR i PR iw ( iw iw PR )

. Note that the self-

well-known-method of fitting-of-constants to estimate benchmarking property means that the sum of the

them. The resulting estimator of the i-th area sample estimated small area totals is equal to the direct

mean is known as the EBLUP estimator, because the estimator of the overall total Y. That is,

variance components were estimated.

ˆ ′

∑

m

Prasad and Rao (1990) derived an approximation to N i Yi , PR = Yˆw + ( X − X w ) βˆ w

i =1

o(m −1 ) for the model based mean squared error of the

ˆ

∑ ∑ ∑

m m ni

Battese-Harter-Fuller estimator, and also obtained its where Yˆw = Ni Yi , PRYˆw = w% ij yij and

i =1 i =1 j =1

estimator to o(m −1 ) as well. Prasad-Rao (1999) were

the first to include the survey weights in the unit level Xˆ w is similarly defined.

model: they labelled their estimator as a pseudo-

EBLUP estimator of the small area mean Yi . The 4. Applications

Prasad-Rao estimator of Yi is given by 4.1 Canadian Community Health Survey: Area

model

Yi , PR = X i′ βˆ PR + γ iw yiw − xiw

%

(

′ β% PR ) (3.9) The Canadian Community Health Survey CCHS is a

cross-sectional health survey carried out by Statistics

Canada since 2001.The survey operates on a two-year

where (

γ iw = σ v2 / σ v2 + σ e2 ∑ j∈si

w% 2j ) with collection cycle. The first year of the survey cycle

"x.1" is a large sample (130,000 persons), general

yiw = ∑ j∈si

w% j y j ; w% ij = wij* / ∑ j∈si

wij* and wij* are population health survey, designed to provide reliable

estimates at the health region (sub-provincial areas

calibrated weights, and β% PR is given by defined in terms of Census results), provincial and

national levels. This portion of the survey collects

−1 m information related to health status, health care

∑γ ∑γ

m

⎛ ⎞

β%PR = ⎜ iw xiw xiw′ ⎟ iw xiw yiw (3.10) utilization and health determinants for the Canadian

⎝ i =1 ⎠ i =1 population. The second year of the survey cycle "x.2"

has a smaller sample (30,000 persons) and is designed

Prasad and Rao (1999) also provided model based to provide provincial and national level results on

expressions for the MSE of their estimator when it specific focused health topics.

included the estimated variance components

σ v2 and σ e2 . The CCHS is based on a multiple frame (two frames)

sampling design of that uses. The first one, used as the

primary frame, is the area frame designed for the

The sum of small area estimates do not necessarily add

Canadian Labour Force Survey. This survey is

up to the corresponding direct estimator. You-and Rao

basically a two-stage stratified design that uses

(2002) proposed an estimator of β that ensures self-

probability proportional to size without replacement at

benchmarking of the small area estimates to the each stage. Face to face interviews take place with

corresponding direct estimator. Their estimator is individuals selected from that frame. The second frame

given by uses a list frame of telephone numbers in some of the

Health Regions for cost reasons. Individuals selected

Yi ,YR = X i′ βˆYR + γ iw yiw − xiw (

′ βˆYR ) (3.10) in that frame are interviewed by telephone.

where The area frame uses the Labour Force Frame. This

resulting sample is a two-stage stratified cluster.

−1

ni ni

⎛ m ′⎞

m

β%YR =⎜ w% ij xij (x

ij − γ iw xiiw. ) ⎟ % ij ij iw xiiw. ) yij . Firstly, a list of the dwellings that were or had been in

⎝ i =1 j =1 ⎠ i =1 j =1

3449

Section on Survey Research Methods

Secondly, a sample of dwellings was selected from this

list. The households in the selected dwellings then

formed the sample of households. The majority (88%)

of the targeted sample was selected from the area

frame. Lastly, respondents are randomly selected from

households in this frame. Although a single individual

is normally randomly selected from each household,

the requirement to over sample youths results in a

second member of a number of households to be

selected as well. Face-to-face interviews are carried

out with the selected respondents.

version (Health Regions) of the Canada Phone

directory. Simple random sampling takes place within Figure 4.1: Health areas in British Columbia

each of the resulting strata. Random digit dialling is

carried out in five HRs and the three Territories. ψ̂ rDIR DIR

,a denotes the estimated variance for p̂r ,a under the

The direct estimator of a population total Yi for a given

domain i is given by Ŷi DIR = ∑ w% *j y j where

is given by deff rDIR ˆ DIR ˆ DIR ˆ DIR (

,a = ψ r ,a / pr ,a (1 − pr ,a ) / nr ,a . The )

j∈si

smoothed design effect over all I=200 domains is

w% *j represents the overall weight that incorporates the

multiple frame nature of the sampling design, non-

given by def

DIR

= ∑ deff

i i

DIR

/ I . The estimated

response adjustments at each stage, where appropriate, coefficient of variation, cv ( ˆprDIR

,a ) , for p̂i

DIR

for a given

and the calibration (age groups 12 to 19, 20 to 29, 30

( ˆp (1 − ˆp ) / n ) / ˆp

DIR DIR DIR DIR

to 44, 45 to 64 and 65 or older for each sex within each domain i is def r ,a r ,a r ,a r ,a .

health region and province). More details of this

The common mean model is the simplest one that can

sampling design are available in Béland (2002).

be implemented using the Fay-Herriot (1979)

methodology. This model assumes that the proportion

Estimates of various population parameters can be

of alcohol abuse is the same within each of the twenty

produced for different domains. In the present

Health Regions for a given age-sex group: that is, the

example, taken from Hidiroglou, Singh and Hamel

(2007), our parameter of interest is the proportion of linking model is given by Pr ,a = β a +ν r ,a where Pr ,a is

alcohol abuser within the previously stated domains the unknown population proportion of interest,

belonging to the province of British Columbia using and β a is the common mean across the health regions

the two year (2000-2001) CCHS sample. The for the a-th age-sex group. The corresponding

associated sample had 18,302 observations with

sampling model is given by p̂rDIR ,a = Pr ,a + er ,a . The

domain sample size ranging from 20 to 238 for the 200

domains. Figure 4.1 provides an idea of how the resulting small area estimate for the ra-th domain is

,a + (1 − γ r ,a ) β r ,a ,

Health Regions are delineated in British Columbia. given by pˆ rEBLUP = γˆ r ,a ˆprDIR ˆ ˆ where

,a

γˆ r ,a = (see Rao 2003, p. 116). The

regions r ( r = 1,… , 20 ) and age-sex groups a ψ DIR

% r ,a + σˆ v2

,a (1 − pr ,a )

( a = 1,… , 10 ). The direct estimator of proportion of DIR ˆ rDIR

p ˆ DIR

alcohol abuse is given by p̂ DIR = Yˆ DIR / Nˆ DIR ψ% rDIR

,a term given by ψ r ,a = def

%

DIR

r ,a r ,a r ,a nr ,a

where N̂ DIR

r ,a = ∑ j∈sr ,a

*

w . Given that, for domain ra,

%j is obtained using the smoothed design effect

def

DIR

= ∑ deffi i

DIR

/ I over the I=200 domains. The

σ̂ v2 term is obtained from the Fay-Herriot

methodology: computational details for

estimating σ̂ v can be found in Rao (2003, p. 118). The

2

,a ) for

3450

Section on Survey Research Methods

mse ( pˆ rEBLUP )

p̂rEBLUP is given by

,a

, where error.

,a

p̂rEBLUP

,a

Rubin et al. (2007) investigated whether Small Area

σˆ 2ψ% DIR

mse ( ˆp EBLUP

r ,a ) = ψ% DIRv +r ,aσˆ 2 represents the estimated Estimation (SAE) procedures could be used to produce

r ,a v estimates for AWE with reasonably good estimated

leading term of MSE ( ˆprEBLUP ) . Figure 4.2 is a graph mean squared errors for lower levels, namely industry

,a

groups at the North American Industry Classification

between the estimated coefficients of variation System (NAICS4) level 4 and geography at the level of

resulting for the direct and indirect estimation province, that is, the "NAICS 4 x province" domains.

The Average Weekly Earnings for a population

domain i ( U i ) is given by

Yi = ∑ j∈U i

Eij yij / ∑ j∈U i

Eij

Est where yij is the average weekly earnings and Eij is

cv%

the average number of employees within the j-th

establishment within that domain.

properties of the GREG estimator and a number of

Observed Proportion SAE estimators. The y-values for the population used

for the study were created for twelve months for

Figure 4.2: Estimated coefficients of variation for the

twelve months representing the January to December

direct (blue) and EBLUP (red) estimators of proportion

2005 calendar year. In sample y-values were kept as is,

and the kept as is and the y-values for the out-of-

4.2 Canadian Survey of Employment Payroll and sample units were synthesized using the nearest

Hours: Unit model neighbour using the average number of employment

and average monthly earnings (available for the whole

The Canadian Survey of Employment, Payrolls and

population). Some 100o samples were then

Hours (SEPH) collects and publishes on a monthly

independently sampled from each of the twelve

basis, estimates of payrolls, employment, paid hours

generated populations, preserving the longitudinal

and earnings at detailed industrial and geography

aspect of SEPH (i.e.: sample rotation of one-twelfth of

levels. Estimators for average weekly earnings (AWE)

the sample on a monthly basis). Summary statistics

have been produced since the early nineties by SEPH. (r)

These estimates have been produced via the based on the specific estimators, yi,EST , used of the i-

generalized regression (GREG) estimator using a th small area (i=1,…,I ) computed from the Monte

combination of survey and payroll deduction Carlo, included the average relative bias (ARB),

(administrative) data provided to Statistics Canada by

∑ ∑( y

1 I 1 R

(r)

i ,EST − Yi ) , and the average root

I i =1 RYi r =1

approximately design unbiased (ADU).

relative mean square error (ARMSE) ,

SEPH is currently being redesigned to redefine 0.5

∑ ∑( y

2

1 I ⎛ 1 R ⎞

primary domains of interest, as well as incorporate ⎜

(r )

i ,EST − Yi ) ⎟ .

I ⎜ RYi ⎟

improvements on the use of the administrative data. i =1 ⎝ r =1 ⎠

The resulting sample, estimated to be between 11,00 to

20,00 establishments (depending on budget Estimators considered in the Rubin et al. (2007)

constraints) will be allocated to the newly defined simulation included the GREG, the Prasad-Rao (1999)

strata, defined as cross-classifications of geography pseudo-EBLUP unit level, and the You-Rao (2002)

(provinces) and industry (NAICS3), so that the pseudo-EBLUP area level SAE estimators given in

resulting GREG estimates for AWE satisfy coefficients Section 3.0. The GREG estimator is given by

of variation. The design strata are also referred to

model groups since the GREG estimators are

y =

i ,GREG ∑

E% x ′ βˆ + w E% y − x ′ βˆ

Ui ij ij ∑

(4.1) si ij ij ( ij ij )

computed at these levels as well. Estimates below this with xij′ = (1, xij ) . Here xij is the average monthly

level can be obtained using domain estimation. As the

sample associated will be relatively small (or non- earnings associated with the j-th sampled

existent), the reliability associated with the GREG establishment within domain U i , and β̂ is the

3451

Section on Survey Research Methods

regression estimator resulting from the (Undercount) and the gross number of persons

iid erroneously included in the final Census count

model yij = xij′ βˆ + eij , with eij ~( 0,σ / Eij ) .

2

e (Overcount). The sample size of the RRC is designed

to produce reliable direct estimates for the provinces

Figure 3 and 4 provide the ARB and ARMSE (including the two Territories),and eight age - sex

respectively for construction domains in Canada for groups, with age categories are less than 19, 20 to

2005. The GREG estimator has the smallest ARB 29, 30 to 44, and 45 and over at the national level.

amongst the three estimators. The Prasad-Rao (1999) The cross tabulation of these two marginal tabulations

is the best estimator in terms of ARMSE. This is results in m= 96 (12*8) cells. These cells are

reasonable on account that the You-Rao (2002) considered as small areas because they have too few

estimator loses efficiency on account of its observations to sustain reliable direct estimates. The

benchmarking property. objective is to use small area techniques to improve the

reliability of the cell estimates. Dick (1995) applied the

Fay-Herriot methodology for this purpose.

quantities. The true (but unknown) Census count is

denoted as Ti , and the corresponding observed Census

count as Ci . This means that the difference ( Ti − Ci )

is the missed unknown net undercoverage count

( M i ) . This net undercoverage count is estimated by

the RRC for the i-th small area is Mˆ . The true count

i

Figure 4.3: Average absolute relative bias for Ti can be expressed as the product of the observed

construction domains in Canada for 2005

count Ci and the true adjustment factor

θ i = ( M i + Ci ) / Ci = Ti / Ci . The true adjustment

factor can be estimated directly as

(ˆ )

yi = M i + Ci / Ci . However, the direct estimator

Mˆ i may not be reliable. The problem is cast into a

Fay-Herriot context as follows.

wh ere we assu me that E p ( ei ) = 0 a nd

Figure 4.4: Average relative root mean square error

V p ( ei ) = ψ i , where ψ i is ass umed to be

for construction domains in Canada for 2005

kn o wn. T he lin kin g mo de l is give n b y

4.3 Canadian Census of Population under coverage θ i = zi′ β +ν i , where zi is a set of a ux iliar y

iid

The Census of Canada is conducted every five years. va r iab les , and vi ~(0, σ v2 ) .

One objective is to provide the Population Estimates

Program with accurate baseline counts of the number T he r esu lt in g Fa y-Herr iot est imator is give n

of persons by age and sex for specified geographic

areas. However, not all persons are correctly

as (

θˆi , FH = zi′ βˆ FH + γˆi yi − zi′ βˆ FH)where

enumerated. Two errors that occur are undercoverage - γˆi = σˆ /(σˆ +ψ i ) .

2

v

2

v

exclusion of eligible persons - and over coverage -

erroneous inclusion of persons. This undercoverage The sampling variances are not known, but can be

varies between 2 and 3 %.

estimated as from ψˆ i = v ( yi ) given the sampling plan

A special survey, known as the Reverse Record Check for the RRC. As these variances are for domains, they

(RRC), with a sample size of 60,000 persons, estimates will be tend to be variable. Dick (1995) smoothed them

the net number of persons missed by the Census. This

net number combines two types of coverage errors: the

gross number of persons missed by the Census

3452

Section on Survey Research Methods

( )

by using log v( Mˆ i ) = α + β log ( Ci ) + ηi where it is

assumed that ηi N ( 0, ζ 2 ) .

iid

area is v(% M i i (

ˆ ) = expˆ α + βˆ log ( C ) . Hence, the )

smoothed variance of yi = 1 + Mˆ i / Ci is

ψ% i = v( ˆ 2

% M i ) / Ci . Figure 4.6: Comparison of Direct and FH CVs

(Source: You and Dick 2004)

Replacing the unknown ψ i by ψ% i leads to

In terms of the CV comparison given in figure 4.6, the

θ%i , FH = zi′ β% FH + γ%i ( yi − zi′ β% FH ) where σ2

%v and β% FH HB approach achieves a large CV reduction when the

sample sizes are small. As sample size increases, the

are solved iteratively using the algorithm given in the CV reduction decreases. As the sample size increases,

appendix. the CVs of the direct and HB estimates quite similar.

State which variables used and for Census (2011). For

4.4 Labour Force Survey

further details see You, Rao and Dick (2002)

Unemployment rates are produced on a monthly

The above methodology was used to estimate the 2001

Canadian Census undercoverage. The final z-variables

basis in Canada by the Labour Force Survey (LFS).

used in the linking model (4.3) were Yukon, Nunavet, The LFS samples some 53,000 households based on a

Male 20 to 29, Male 30 to 44, Female 20 to 29, British stratified multi-stage design. The survey reduces

Colombia renters, Ontario renters and North West response burden by having one-sixth of its sample

Territories renters. replaced each month. For a detailed description of

the LFS design, see Gambino, Singh, Dufour,

Figure 1 displays the direct and FH estimates of Kennedy and Lindeyer (1998). The published

undercoverage ratios by the domain sample sizes. provincial and national estimates unemployment

Figure 2 displays the corresponding coefficients of rates are a key indicator of economic performance in

variation (CV) of the direct and FH HB estimates. Canada.

provincial level are also of great interest. For

instance, the unemployment rates for Census

Metropolitan Areas (CMAs, i.e., cities with

Population more than 100,000) and Census

Agglomerations (CAs, i.e., other urban centers)

receive scrutiny at local governments. However,

many of the CAs do not have a large enough sample

to produce adequate direct estimates. Their

estimates need to be produced using SAE

Figure 4.5: Comparison of Direct and HB Estimates

techniques. You, Rao and Gambino (2003) used a

(Source: You and Dick 2004)

cross-sectional and time series model to estimate

unemployment for such small areas: their methodology

Figure 4.5 supports the conclusion that the FH

borrowed strength both across time and small areas.

approach leads to smoothed estimates, particularly for

the domains with relatively small sample sizes. When

sample size is small, some direct net undercoverage Let yit denote the direct LFS estimate of θ it the true

estimates are negative due to the fact that the unemployment rate of the ith CA (small area) at

overcoverage estimates are larger then the time t, for i =1, ..., m, t =1, ..., T, where m is the total

undercoverage estimates. The FH method “corrected” number of CAs and T is the (current) time of

the negative values. All the FH net undercoverage interest. Assume that the sampling model is

estimates are positive.

3453

Section on Survey Research Methods

0,9

where eit ’s are sampling errors. Since the CAs can 0,8

Direct Est Fay-Herriot Space-Time

be treated as strata, the eit ’s are uncorrelated between 0,7

0,6

themselves for a given time period t. However, the

Est. cv

0,5

rotation results in a significant level of overlap for the 0,4

sampled households. This is reflected in the linking 0,3

model given by θ it = xit′ β + vi + uit where the error 0,2

0,1

structure of the uit ’s is assumed to follow an AR(1)

0

( 0,σ )

iid

process, represented as uit = ui , t −1 + ε it ; ε it 2 Figure 4.8: Comparison of coefficients of variation of

unemployment rates using Direct, Fay-Herriot, and

space-time estimates for June 1999

The error structure of the eit ’s is assumed known, and

as this is not the case, the sample based estimates need Acknowledgements: The author would like to

to be smoothed. You, Rao and Gambino (2003) used acknowledge Jon Rao, Peter Dick and Susana

the Hierarchical Bayes (HB) procedure to estimate the Rubin-Bleuer.

required parameters in the error and linking equation.

They compared numerically three estimators of the References

unemployment rates in June 1999. These estimators

were the direct estimator (Direct Est), a small area Australian Bureau of Statistics (2006). A Guide to

estimator based only on the current cross-sectional Small Area Estimation - Version 1.1. Internal

data (the Fay-Herriot), and one using both the cross- ABS document.

sectional and longitudinal data (Space-time). Battese, G.E., Harter, R.M., Fuller, W.A. (1988). An

Error-Components Model for Prediction of Crop

Figure 4.7 displays these LFS estimates for the June Areas Using Survey and Satellite Data, Journal of

1999 unemployment rates for the 62 CAs across the American Statistical Association, 83, 28-36.

Canada. The 62 CAs appear in the order of Brackstone, G. J. (1987). Small area data: policy issues

population size with the smallest CA (Dawson and technical challenges. In R. Platek, J. N. K. Rao,

Creek, BC, population is 10,107) on the left and the C. E. Sarndal, and M. P. Singh, eds., Small Area

largest CA (Toronto, Ont., population is 3,746,123) Statistics, pp. 3-20. John Wiley & Sons, New York.

on the right. The Fay-Herriot model tends to shrink Béland, Yves Canadian Community Health Survey

(2002). Methodological overview. Health report,

the estimates towards the average of the

Statistics Canada, Catalogue no. 82-003-XPE

unemployment rates. The space-time model leads to

(0030182-003-XIE.pdf), Vol. 13, No. 3, ISSN

moderate smoothing of the direct LFS estimates. For 0840-6529.

the CAs with large population sizes and therefore Dick, P. (1995). Modelling Net Undercoverage in the

large sample sizes, the direct estimates and the HB 1991 Canadian Census, Survey Methodology ,

estimates are very close to each other; for smaller 21, 45-54.

CAs, the direct and HB estimates differ substantially Drew, D., Singh, M.P., and Choudhry, G.H. (1982).

for some regions. Evaluation of Small Area Estimation Techniques

for the Canadian Labour Force Survey, Survey

Methodology , 8, 17-47.

16

Direct Est Fay-Herriot Space-Time

Fay, R.E. and Herriot, R.A. (1979). Estimation of

14

Income for Small Places: An Application of

Unemployment rate(%)

12

James-Stein Procedures to Census Data. Journal

10

8

of the American Statistical Association, 74, 269-

6

277.

4 Fuller, W.A. (1999). Environmental Surveys Over

2 Time, Journal of Agricultural, Biological and

0 Environmental Statistics, 4, 331-345.

Gambino, J.G., Singh, M.P., Dufour, J., Kennedy, B.

Figure 4.7: Comparison of unemployment rates using and Lindeyer, J. (1998). Methodology of the

Direct, Fay-Herriot, and space-time for June 1999 Canadian Labour Force Survey, Statistics Canada,

Catalogue No. 71-526.

3454

Section on Survey Research Methods

Gonzalez, M.E., and Hoza, C. (1978), Small-Area Singh, M.P., Gambino, J., Mantel, H.J. (1994). Issues

Estimation with Application to Unemployment and Strategies for Small Area Data, Survey

and Housing Estimates, Journal of the American Methodology, 20, 3-22.

Statistical Association, 73, 7-15. Woodruff, R.S. (1966), Use of a Regression Technique

Hidiroglou M.A. and Singh A., and Hamel M. (2007). to Produce Area Breakdowns of the Monthly

some thoughts on small area estimation for the National Estimates of Retail Trade, Journal of the

Canadian community health survey (CCHS). American Statistical Association, 61, 496-504.

Internal Statistics Canada document. You, Y., and Rao, J.N.K. (2002). A Pseudo-Empirical

Hidiroglou, M.A. and Särndal, C.E., (1985). Small Best Linear Unbiased Prediction Approach to

Domain Estimation: A Conditional Analysis, Small Area Estimation Using Survey Weights,

Proceedings of the Social Statistics Section, Canadian Journal of Statistics, 30, 431-439.

American Statistical Association, 147-158. You, Y., Rao, J.N.K. and Dick, J.P. (2002)

Hidiroglou, M.A. and Patak, Z. (2004). Domain Benchmarking hierarchical Bayes small area

estimation using linear regression. Survey estimators with application in census

Methodology, 30, 67-78. undercoverage estimation. Proceedings of the

Levy, P.S. (1971). The Use of Mortality Data in Survey Methods Section 2002, Statistical Society

Evaluating Synthetic Estimates, Proceedings of of Canada, 81 - 86.

the Social Statistics Section, American Statistical You, Y, Rao, J.N.K., and Gambino, J.G. (2003).

Association, pp. 328-331. Model-based unemployment rate estimation for

Prasad, N.G.N., and Rao, J.N.K. (1990), The the Canadian Labour Force Survey: A hierarchical

Estimation of the Mean Squared Error of Small- Bayes approach, Survey Methodology, 29, 25-32.

Area Estimators,. Journal of the American You, Y and Dick, P. (2004). Hierarchical Bayes

Statistical Association, 85, 163-171. Small Area Inference to the 2001 Census

Undercoverage Estimation. Proceedings of the

Prasad, N.G.N. and Rao, J.N.K. (1999). On robust ASA Section on Government Statistics, 1836-

small area estimation using a simple random 1840.

effects model. Survey Methodology, 25, 67-72 .

Rao, J.N.K. (2003). Small Area Estimation. New York: .

Wiley.

Rao, J.N.K. and Choudhry, H. (1995). Small Area

Estimation: Overview and Empirical study.

Business Survey Methods, Edited by Cox, Binder,

Chinnappa, Christianson, Colledge, Kott, Chapter

27.

Rubin-Bleuer, S., Godbout S and Morin Y (2007).

Evaluation of small domain estimators for the

Canadian Survey of Employment, Payrolls and

Hours. Paper presented at the third International

Conference of Establishment Surveys July 2007

Statistical of Society Meetings.

Schaible, W.A. (1978). Choosing Weights for

Composite Estimators for Small Area Statistics,

Proceedings of the Section on Survey Research

Methods, American Statistical Association, pp.

741-746.

Singh A.C. and Verret F. (2006). Mixed Linear

Nonlinear Aggregate level and Matt Type for

formulas? Models for Small Area Estimation for

Binary count data from Surveys. Proceedings of

the Statistics Canada Symposium.

Singh, A.C. (2006). Some problems and proposed

solutions in developing a small area estimation

product for clients. ASA Proc. Surv. Res. Meth.

Sec.

3455

Random

effects model

Section on Survey Research Methods

Description Computation

1. Model a smooth function of Yi θ i = g ( Yi ) where Yi is the small area population mean for i-th small area;

i=1,…,m

2. Direct estimate of θi ()

θˆi = g Yˆi where Yˆi is the observed direct estimate

3. Auxiliary data

zi = ( z1i , z2i , K, z pi )′

4. Linking model: Connect the θ i θ i = zi′ β + vi ; vi i.i.d under model ( 0, σ v2 ) ; σ v2 =model variance

5. Sampling model θˆi = θ i + ei ;sampling errors ei independent E p ( ei θi ) = 0 and sampling

variance V p ( ei θ i ) = ψ i (assumed known)

6. Combine 6 and 7 θˆi = zi′ β + vi + ei : Fay-Herriot model

7. Estimation of σ v2 Method of moments:

∑ (θˆ − z ′ β (σ )) / (ψ + σ

m 2

Solve h (σ v2 ) = i i

% 2

v i

2

v ) = m − p for σ 2

v via iteration

i =1

(

σ v2( r +1) = σ v2 ( r ) + ⎡⎣ m − p − h (σ v2 ) ⎤⎦ h*′ σ v2 ( r ) constraining to σ v2( r +1) ≥ 0 , )

∑ (θˆ − z ′ β ) / (ψ + σ

m 2

where h*′ (σ v2 ) = − )

2

2

i i

%

i v is an approximation to the

i =1

8. Optimal model-based Fay-Herriot

estimator (

θˆiFH = γˆiθˆi + (1 − γˆi ) zi′ βˆ = zi′ βˆ + γˆi θˆi − zi′ βˆ = zi′ βˆ + vˆi where β̂ is the )

weighted least squares estimator of β . Now

−1

∑ ∑ z θˆ / (ψ + σˆ )

m m

βˆ = β% (σˆ v2 ) = ⎢ zi zi′ / (ψ i + σˆ v2 ) ⎥

⎡ ⎤ ⎡ 2 ⎤

⎢ i i i v ⎥ where

⎣ i =1 ⎦ ⎣ i =1 ⎦

MSE of θˆi, FH ( ) ( )

2

9. Leading term of MSE θˆi , FH = E θˆi , FH − θ i where the expectation is with

respect to the Fay-Herriot model; see step 8; g1i (σ v2 ) = γ iψ i shows the

efficiency of θˆi, FH over direct estimator θˆi is γ i−1 for large number of areas

m. If γ i = σ v2 / (ψ i + σ v2 ) = 1/ 2 , then efficiency is 200% or gain in

efficiency is 100%.

10. Scenarios for large efficiency Sampling variance ψ i large or model variance σ v2 small relative to ψ i

gains

11. Nearly unbiased estimator of

( )

mse θˆi , FH : See equation (7.1.26), p. 129, Rao (2003); easily

(

MSE θˆi , FH) programmable

12. Estimation of small area mean Yi Yˆ = g −1 θˆ

i , FH ( ) = K (θˆ )

i , FH i , FH

13. MSE estimator of Yˆi , FH ( ) = ⎡⎣ K ′ (θˆ )⎤⎦ mse (θˆ ) ; may not be nearly unbiased.

2

mse Yˆi , FH i , FH i , FH

Empirical Bayes (EB) and hierarchical Bayes( HB) methods are better

suited for handling non-linear cases , K θˆ , see p. 133 , Rao (2003) ( i , FH )

3456

- Cook Weisberg Residuals and InfluenceTransféré parEnrique Slim
- Linear Regression AnalysisTransféré parmak111111
- syb551.pdfTransféré parInvest
- 2007-Hawkins-Ecology-richness-metabolicTransféré parjuli
- x-ray fluorescence techniqueTransféré paraladinsane
- OutputTransféré parDewi Shintia HuLumudi
- Manual.for.Building.tree.Volume.and.Biomass.allometric.equationTransféré parCarine Klauberg
- 21-zn-am-38-110-zelazny-1-do-druku.pdfTransféré parTalish Mahmood Talish
- 142175417-consumer-s-awareness-towards-their-ethical-rights.pdfTransféré parsinghsanya
- 263 HomeworkTransféré parHimanshu Saikia J
- A Study on the Impact of Advertisements, Reference Group and Brand Perception in the Purchase Involvement of Customers in Chennai With Regard to Tvs TyresTransféré parSwaroop C Mathew
- Testing of hypotgesis about linear regression Atif MuhammadTransféré parAtifMuhammad
- Project_F17.pdfTransféré parOmar Oubaha
- 2 Hypothesis TestingTransféré parKaran Singh Kathiar
- 01.estimation.pdfTransféré parShafiullaShaik
- Paper 21 RRJS 2013 Shu Raj ThaTransféré parNarendraSinghThakur
- Efficient UseTransféré parajitkarthik
- Lecture 20Transféré parAnonymous 0qR2NHz9D
- A Statistical Evaluation of Methods of Determining BOD Rate.pdfTransféré parClaudinei José Rodrigues
- 10.1.1.30Transféré parkady123
- hasil 4Transféré parGaluh Fahmi
- OutputTransféré parherr
- tarpeyTransféré parghoshtapan4321
- Statistics8803715499Transféré parGuru Guroor
- final ron report (2)Transféré parAmit Batham
- Predicting Construction Project Duration With SupportTransféré parInternational Journal of Research in Engineering and Technology
- 17.Format.app-An Analysis of Growth Rate (Reviewed) (2)Transféré parImpact Journals
- RegressionTransféré parmechanical z
- plots deskriptif uji korelasi.docTransféré parRahmatAbdullah
- SPSS OutputTransféré parAhmad Nasa'ie Ismail

- organizationalshifts-danobrien-r2Transféré parapi-324079029
- Sub Vendor Audit Checklist-PersonnelTransféré paraszliza
- AOT L C-Arm IntroductionTransféré parMargara Perez
- Sheet 7 Solution 2Transféré parAlex Lahuliet
- matthew hurwitz resume 2014 v 3Transféré parapi-242004341
- Probability Distribution- PptTransféré parAwasthi Shivani
- Transforming Difficult Children Keith Miller Article 2Transféré parblacksuitboy
- Extraction of Emoticons with Sentimental BarTransféré parVIVA-TECH IJRI
- Lucerne Por George FlemwellTransféré parcacharreta
- Panduranga Vittala - eight short stories - by Bharat BhushanTransféré parBharat Bhushan
- Analysis of Body Size Measurements for U.S. Navy Womens Clothing and Pattern Design 1993Transféré parDan Andrei Stefan
- Pharmacy-Leaflet-18_Final.pdfTransféré paraditi joshi
- Money MarketTransféré parkitten301
- Ekt 241-5 -Transmission LineTransféré parراجية رضى الله
- English 2Transféré parBagoes Goenawan
- Chromatography PaperTransféré parFrengky Afrimirza
- BushTransféré parJason Howard
- Carcinoma Unknown Evaluation PPTTransféré parShri Rushi
- Comparison of Hardness TestTransféré parBalaji Raja
- Aircraft Radio Ignition NoiseTransféré parjwzumwalt
- RA1003A45Transféré parAnjani Kunwar
- 1810254494848===Traditional Zulu MedicinalTransféré parsellaginella
- Ahp Template ScbukTransféré parDieniMutiarawati
- siddhartha guatamaTransféré parapi-271908876
- ResumeTransféré parpdhar7475
- DataExtractionandRetractioninBPC BITransféré parKrishna Priya
- Philippine Cacao Summit - Program as of May 18 2016Transféré parLorna Dietz
- PROFILE-STUDY-BEHAVIORS-AND-ACADEMIC-PERFORMANCE-OF-NURSING-STUDENTS-AT-CSU.docxTransféré parFerreze Ann
- TDA7267ATransféré parshareator
- XAct ParisTransféré parmadhavanrajagopal