Vous êtes sur la page 1sur 188

Sampling, Regression, Experimental Design and

Analysis for Environmental Scientists,


Biologists, and Resource Managers
2006

C. J. Schwarz
Department of Statistics and Actuarial Science, Simon Fraser University
cschwarz@stat.sfu.ca

August 7, 2006
Contents

4 Sampling 2
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1.1 Difference between sampling and experimental design . . 3
4.1.2 Why sample rather than census? . . . . . . . . . . . . . . 3
4.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . 3
4.1.4 Probability sampling vs. non-probability sampling . . . . 4
4.1.5 The importance of randomization in survey design . . . . 6
4.1.6 Model vs Design based sampling . . . . . . . . . . . . . . 10
4.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . 11
4.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . 11
4.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . 19
4.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . 21
4.2.6 Repeated Sampling . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Simple Random Sampling Without Replacement (SRSWOR) . . 25
4.4.1 Summary of main results . . . . . . . . . . . . . . . . . . 25
4.4.2 Estimating the Population Mean . . . . . . . . . . . . . . 26
4.4.3 Estimating the Population Total . . . . . . . . . . . . . . 27
4.4.4 Estimating Population Proportions . . . . . . . . . . . . . 28
4.4.5 Example - estimating total catch of fish in a recreational
fishery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
What is the population of interest? . . . . . . . . . . . . . 30
What is the frame? . . . . . . . . . . . . . . . . . . . . . . 31
What is the sampling design and sampling unit? . . . . . 31
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 32
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35
JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Sample size determination for a simple random sample . . . . . . 47
4.5.1 Example - How many anglers to survey . . . . . . . . . . 49
4.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.1 Advantages of systematic sampling . . . . . . . . . . . . . 52

1
CONTENTS

4.6.2Disadvantages of systematic sampling . . . . . . . . . . . 52


4.6.3How to select a systematic sample . . . . . . . . . . . . . 53
4.6.4Analyzing a systematic sample . . . . . . . . . . . . . . . 53
4.6.5Technical notes - Repeated systematic sampling . . . . . . 54
Example of replicated subsampling within a systematic
sample . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Stratified simple random sampling . . . . . . . . . . . . . . . . . 57
4.7.1 A visual comparison of a simple random sample vs a strat-
ified simple random sample . . . . . . . . . . . . . . . . . 59
4.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7.3 Summary of main results . . . . . . . . . . . . . . . . . . 63
4.7.4 Example - sampling organic matter from a lake . . . . . . 65
4.7.5 Example - estimating the total catch of salmon . . . . . . 70
What is the population of interest? . . . . . . . . . . . . . 71
What is the sampling frame? . . . . . . . . . . . . . . . . 71
What is the sampling design? . . . . . . . . . . . . . . . . 72
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 75
JMP analysis . . . . . . . . . . . . . . . . . . . . . . . . . 79
When should the various estimates be used? . . . . . . . . 85
4.8 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . 88
4.8.1 Total sample size . . . . . . . . . . . . . . . . . . . . . . . 88
4.8.2 Allocating samples among strata . . . . . . . . . . . . . . 91
4.8.3 Example: Estimating the number of tundra swans. . . . . 94
4.8.4 Multiple stratification . . . . . . . . . . . . . . . . . . . . 101
4.8.5 Post-stratification . . . . . . . . . . . . . . . . . . . . . . 101
4.8.6 Allocation and precision - revisited . . . . . . . . . . . . . 101
4.9 Ratio estimation in SRS - improving precision with auxiliary in-
formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.9.1 Summary of Main results . . . . . . . . . . . . . . . . . . 104
4.9.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . 105
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 107
SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 109
JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 113
Post mortem . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9.3 Example - Grouse numbers - using a ratio estimator to
estimate a population total . . . . . . . . . . . . . . . . . 121
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 123
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 125
JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 130
Post mortem - a question to ponder . . . . . . . . . . . . 135
4.10 Additional ways to improve precision . . . . . . . . . . . . . . . . 135
4.10.1 Using both stratification and auxiliary variables . . . . . . 135
4.10.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . 136
4.10.3 Sampling with unequal probability - pps sampling . . . . 137
4.11 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

2006
c Carl James Schwarz 2
CONTENTS

4.11.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . 138


4.11.2 Advantages and disadvantages of cluster sampling com-
pared to SRS . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.11.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.11.4 Summary of main results . . . . . . . . . . . . . . . . . . 142
4.11.5 Example - estimating the density of urchins . . . . . . . . 144
Excel Analysis . . . . . . . . . . . . . . . . . . . . . . . . 145
SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 148
JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 152
Planning for future experiments . . . . . . . . . . . . . . . 160
4.11.6 Example - estimating the total number of sea cucumbers 161
4.12 Multi-stage sampling - a generalization of cluster sampling . . . . 165
4.12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.12.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.12.3 Summary of main results . . . . . . . . . . . . . . . . . . 167
4.12.4 Example - estimating number of clams . . . . . . . . . . . 169
Excel Spreadsheet . . . . . . . . . . . . . . . . . . . . . . 174
SAS Program . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.12.5 Some closing comments on multi-stage designs . . . . . . 175
4.13 Some final comments on descriptive surveys . . . . . . . . . . . . 176
4.13.1 Unit size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.13.2 Key considerations when designing a survey . . . . . . . . 177
4.14 Analytical surveys - almost experimental design . . . . . . . . . . 177
4.15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.16 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . 183
4.16.1 Confusion about the definition of a population . . . . . . 183
4.16.2 How is N defined . . . . . . . . . . . . . . . . . . . . . . . 184
4.16.3 Multi-stage vs Multi-phase sampling . . . . . . . . . . . . 184
4.16.4 What is the difference between a Population and a frame? 185
4.16.5 How to account for missing transects. . . . . . . . . . . . 185

2006
c Carl James Schwarz 3
Chapter 4

Sampling

4.1 Introduction

Today the word "survey" is used most often to describe a method of gathering
information from a sample of individuals or animals or areas. This "sample" is
usually just a fraction of the population being studied.

You are exposed to survey results almost every day. For example, election
polls, the unemployment rate, or the consumer price index are all examples of
the results of surveys. On the other hand, some common headlines are NOT
the results of surveys, but rather the results of experiments. For example, is a
new drug just as effective as an old drug.

Not only do surveys have a wide variety of purposes, they also can be con-
ducted in many ways – including over the telephone, by mail, or in person.
Nonetheless, all surveys do have certain characteristics in common. All surveys
require a great deal of planning in order that the results are informative.

Unlike a census, where all members of the population are studied, surveys
gather information from only a portion of a population of interest – the size of
the sample depending on the purpose of the study. Surprisingly to many people,
a survey can give better quality results than an census.

In a bona fide survey, the sample is not selected haphazardly. It is scientifi-


cally chosen so that each object in the population will have a measurable chance
of selection. This way, the results can be reliably projected from the sample to
the larger population.

4
CHAPTER 4. SAMPLING

Information is collected by means of standardized procedures The survey’s


intent is not to describe the particular object which, by chance, are part of the
sample but to obtain a composite profile of the population.

4.1.1 Difference between sampling and experimental de-


sign

There are two key differences between survey sampling and experimental design.

• In experiments, one deliberately perturbs some part of population to see


the effect of the action. In sampling, one wishes to see what the population
is like without disturbing it.
• In experiments, the objective is to compare the mean response to changes
in levels of the factors. In sampling the objective is to describe the char-
acteristics of the population. However, refer to the section on analytical
sampling later in this chapter for when sampling looks very similar to
experimental design.

4.1.2 Why sample rather than census?

There are a number of advantages of sampling over a complete census:

• reduced cost
• greater speed - a much smaller scale of operations is performed
• greater scope - if highly trained personnel or equipment is needed
• greater accuracy - easier to train small crew, supervise them, and reduce
data entry errors
• reduced respondent burden
• in destructive sampling you can’t measure the entire population - e.g.
crash tests of cars

4.1.3 Principle steps in a survey

The principle steps in a survey are:

2006
c Carl James Schwarz 5
CHAPTER 4. SAMPLING

• formulate the objectives of the survey - need concise statement


• define the population to be sampled - e.g. what is the range of animals
or locations to be measured? Note that the population is the set of final
sampling units that will be measured - refer to the FAQ at the end of the
chapter for more information.
• establish what data is to be collected - collect a few items well rather than
many poorly
• what degree of precision is required - examine power needed
• establish the frame - this is a list of sampling units that is exhaustive and
exclusive
– in many cases the frame is obvious, but in others it is not
– it is often very difficult to establish a frame - e.g. a list of all streams
in the lower mainland.

• choose among the various designs; will you stratify? There are a variety
of sampling plans some of which will be discussed in detail later in this
chapter. Some common designs in ecological studies are:
– simple random sampling
– systematic sample
– cluster sampling
– multi-stage design
All designs can be improved by stratification, so this should always be
considered during the design phase.
• pre-test - very important to try out field methods and questionnaires
• organization of field work - training, pre-test, etc
• summary and data analysis - easiest part if earlier parts done well
• post-mortem - what went well, poorly, etc.

4.1.4 Probability sampling vs. non-probability sampling

There are two types of sampling plans - probability sampling where units are
chosen in a ‘random fashion’ and non-probability sampling where units are cho-
sen in some deliberate fashion.

In probability sampling

2006
c Carl James Schwarz 6
CHAPTER 4. SAMPLING

• every unit has a known probability of being in the sample


• the sample is drawn with some method consistent with these probabilities
• these selection probabilities are used when making estimates from the
sample

The advantages of probability sampling

• we can study biases of the sampling plans

• standard errors and measures of precision (confidence limits) can be ob-


tained

Some types of non-probability sampling plan include:

• quota sampling - select 50 M and 50 F from the population


– less expensive than a probability sample
– may be only option if no frame exists
• judgmental sampling - select ‘average’ or ‘typical’ value. This is a quick
and dirty sampling method and can perform well if there are a few extreme
points which should not be included.
• convenience sampling - select those readily available. This is useful if is
dangerous or unpleasant to sample directly. For example, selecting blood
samples from grizzly bears.

• haphazard sampling (not the same as random sampling). This is often


useful if the sampling material is homogeneous and spread throughout the
population, e.g. chemicals in drinking water.

The disadvantages of non-probability sampling include

• unable to assess biases in any rational way.


• no estimates of precision can be obtained. In particular the simple use of
formulae from probability sampling is WRONG!.
• experts may disagree on what is the “best” sample.

2006
c Carl James Schwarz 7
CHAPTER 4. SAMPLING

4.1.5 The importance of randomization in survey design

[With thanks to Dr. Rick Routledge for this part of the notes.]

. . . I had to make a ‘cover degree’ study... This involved the use


of a Raunkiaer’s Circle, a device designed in hell. In appearance
it was all simple innocence, being no more than a big metal hoop;
but in use it was a devil’s mechanism for driving sane men mad.
To use it, one stood on a stretch of muskeg, shut one’s eyes, spun
around several times like a top, and then flung the circle as far away
as possible. This complicated procedure was designed to ensure
that the throw was truly ‘random’; but, in the event, it inevitably
resulted in my losing sight of the hoop entirely, and having to spend
an unconscionable time searching for the thing.
Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963.

Why would a field biologist in the early post-war period be instructed to


follow such a bizarre-looking scheme for collecting a representative sample of
tundra vegetation? Could she not have obtained a typical cross-section of the
vegetation by using her own judgment? Undoubtedly, she could have convinced
herself that by replacing an awkward, haphazard sampling scheme with one de-
pendent solely on her own judgment and common sense, she could have been
guaranteed a more representative sample. But would others be convinced? A
careful, objective scientist is trained to be skeptical. She would be reluctant
to accept any evidence whose validity depended critically on the judgment and
skills of a stranger. The burden of proof would then rest squarely with Farley
Mowat to prove his ability to take representative, judgmental samples. It is typ-
ically far easier for a scientist to use randomization in her sampling procedures
than it is to prove her judgmental skills.

Hovering and Patrolling Bees

It is often difficult, if not impossible, to take a properly randomized sam-


ple. Consider, e.g., the problem faced by Alcock et al. (1977) in studying the
behavior of male bees of the species, Centris pallida, in the deserts of south-
western United States. Females pupate in underground burrows. To maximize
the presence of his genes in the next generation, a male of the species needs
to mate with as many virgin females as possible. One strategy is to patrol the
burrowing area at a low altitude, and nab an emerging female as soon as her
presence is detected. This patrolling strategy seems to involve a relatively high
risk of confrontation with other patrolling males. The other strategy reported
by the authors is to hover farther above the burrowing area, and mate with
those females who escape detection by the hoverers. These hoverers appear to
be involved in fewer conflicts.

2006
c Carl James Schwarz 8
CHAPTER 4. SAMPLING

Because the hoverers tend to be less involved in aggressive confrontations,


one might guess that they would tend to be somewhat smaller than the more
aggressive patrollers. To assess this hypothesis, the authors took measurements
of head widths for each of the two subpopulations. Of course, they could not
capture every single male bee in the population. They had to be content with
a sample.

Sample sizes and results are reported in the Table below. How are we to
interpret these results? The sampled hoverers obviously tended to be somewhat
smaller than the sampled patrollers, although it appears from the standard
deviations that some hoverers were larger than the average-sized patroller and
vice-versa. Hence, the difference is not overwhelming, and may be attributable
to sampling errors.

Table Summary of head width measurements on two samples of bees.


Sample n y SD
Hoverers 50 4.92 mm 0.15 mm
Patrollers 100 5.14 mm 0.29 mm

If the sampling were truly randomized, then the only sampling errors would
be chance errors, whose probable size can be assessed by a standard t-test.
Exactly how were the samples taken? Is it possible that the sampling procedure
used to select patrolling bees might favor the capture of larger bees, for example?
This issue is indeed addressed by the authors. They carefully explain how they
attempted to obtain unbiased samples. For example, to sample the patrolling
bees, they made a sweep across the sampling area, attempting to catch all the
patrolling bees that they observed. To assess the potential for bias, one must
in the end make a subjective judgment.

Why make all this fuss over a technical possibility? It is important to do so


because lack of attention to such possibilities has led to some colossal errors in
the past. Nowhere are they more obvious than in the field of election prediction.
Most of us never find out the real nature of the population that we are sam-
pling. Hence, we never know the true size of our errors. By contrast, pollsters’
errors are often painfully obvious. After the election, the actual percentages are
available for everyone to see.

Lessons from Opinion Polling

In the 1930’s, political opinion was in its formative years. The pioneers in
this endeavor were training themselves on the job. Of the inevitable errors, two
were so spectacular as to make international headlines.

In 1935, an American magazine with a large circulation, The Literary Digest,


attempted to poll an enormous segment of the American voting public in order
to predict the outcome of the presidential election that autumn. Roosevelt,

2006
c Carl James Schwarz 9
CHAPTER 4. SAMPLING

the Democratic candidate, promised to develop programs designed to increase


opportunities for the disadvantaged; Landon, the candidate for the Republican
Party, appealed more to the wealthier segments of American society. The Liter-
ary Digest mailed out questionnaires to about ten million people whose names
appeared in such places as subscription lists, club directories, etc. They received
over 2.5 million responses, on the basis of which they predicted a comfortable
victory for Landon. The election returns soon showed the massive size of their
prediction error.

The cumbersome design of this highly publicized survey provided a young,


wily pollster with the chance of a lifetime. Between the time that the Digest
announced its plans and released its predictions, George Gallup planned and
executed a remarkable coup. By polling only a small fraction of these individ-
uals, and a relatively small number of other voters, he correctly predicted not
only the outcome of the election, but also the enormous size of the error about
to be committed by The Literary Digest.

Obviously, the enormous sample obtained by the Digest was not very rep-
resentative of the population. The selection procedure was heavily biased in
favor of Republican voters. The most obvious source of bias is the method
used to generate the list of names and addresses of the people that they con-
tacted. In 1935, only the relatively affluent could afford magazines, telephones,
etc., and the more conservative policies of the Republican Party appealed to a
greater proportion of this segment of the American public. The Digest’s sample
selection procedure was therefore biased in favor of the Republican candidate.

The Literary Digest was guilty of taking a sample of convenience. Samples


of convenience are typically prone to bias. Any researcher who, either by choice
or necessity, uses such a sample, has to be prepared to defend his findings
against possible charges of bias. As this example shows, it can have catastrophic
consequences.

How did Gallup obtain his more representative sample? He did not use
randomization. Randomization is often criticized on the grounds that once in
a while, it can produce absurdly unrepresentative samples. When faced with a
sample that obviously contains far too few economically disadvantaged voters,
it is small consolation to know that next time around, the error will likely not
be repeated. Gallup used a procedure that virtually guaranteed that his sample
would be representative with respect to such obvious features as age, race, etc.
He did so by assigning quotas which his interviewers were to fill. One interviewer
might, e.g. be assigned to interview 5 adult males with specified characteristics
in a tough, inner-city neighborhood. The quotas were devised so as to make the
sample mimic known features of the population.

This quota sampling technique suited Gallup’s needs spectacularly well in


1935 even though he underestimated the support for the Democratic candidate

2006
c Carl James Schwarz 10
CHAPTER 4. SAMPLING

by about 6%. His subsequent polls contained the same systematic error. In
1947, the error finally caught up with him. He predicted a narrow victory for
the Republican candidate, Dewey. A Newspaper editor was so confident of the
prediction that he authorized the printing of a headline proclaiming the victory
before the official results were available. It turned out that the Democrat,
Truman, won by a narrow margin.

What was wrong with Gallup’s sampling technique? He gave his interviewers
the final decision as to whom would be interviewed. In a tough inner-city
neighborhood, an interviewer had the option of passing by a house with several
motorcycles parked out in front and sounds of a raucous party coming from
within. In the resulting sample, the more conservative (Republican) voters
were systematically over-represented.

Gallup learned from his mistakes. His subsequent surveys replaced inter-
viewer discretion with an objective, randomized scheme at the final stage of
sample selection. With the dominant source of systematic error removed, his
election predictions became even more reliable.

Implications for Biological Surveys

The bias in samples of convenience can be enormous. It can be surprisingly


large even in what appear to be carefully designed surveys. It can easily exceed
the typical size of the chance error terms. To completely remove the possibility
of bias in the selection of a sample, randomization must be employed. Sometimes
this is simply not possible, as for example, appears to be the case in the study
on bees. When this happens and the investigators wish to use the results of a
nonrandomized sample, then the final report should discuss the possibility of
selection bias and its potential impact on the conclusions.

Furthermore, when reading a report containing the results of a survey, it is


important to carefully evaluate the survey design, and to consider the potential
impact of sample selection bias on the conclusions.

Should Farley Mowat really have been content to take his samples by tossing
Raunkier’s Circle to the winds? Definitely not, for at least two reasons. First,
he had to trust that by tossing the circle, he was generating an unbiased sample.
It is not at all certain that certain types of vegetation would not be selected with
a higher probability than others. For example, the higher shrubs would tend to
intercept the hoop earlier in its descent than would the smaller herbs. Second,
he has no guarantee that his sample will be representative with respect to the
major habitat types. Leaving aside potential bias, it is possible that the circle
could, by chance, land repeatedly in a snowbed community. It seems indeed
foolish to use a sampling scheme which admits the possibility of including only
snowbed communities when tundra bogs and fellfields may be equally abundant
in the population. In subsequent chapters, we shall look into ways of taking more

2006
c Carl James Schwarz 11
CHAPTER 4. SAMPLING

thoroughly randomized surveys, and into schemes for combining judgment with
randomization for eliminating both selection bias and the potential for grossly
unrepresentative samples. There are also circumstances in which a systematic
sample (e.g. taking transects every 200 meters along a rocky shore line) may
be justifiable, but this subject is not discussed in these notes.

4.1.6 Model vs Design based sampling

Model-based sampling starts by assuming some sort of statistical model for


the data in the population and the goal is to select data to estimate the pa-
rameters of this distribution. For example, you may be willing to assume that
the distribution of values in the population is log-normally distributed. The
data collected from the survey are then used along with a likelihood function to
estimate the parameters of the distribution.

Model-based sampling is very powerful because you are willing to make a lot
of assumptions about the data process. However, if your model is wrong, there
are big problems. For example, what if you assume log-normality but data is
not log-normally distributed? In these cases, the estimates of the parameters
can be extremely biased and inefficient.

Design-based sampling makes no assumptions about the distribution of


data values in the population. Rather it relies upon the randomization procedure
to select representative elements of the population. Estimates from design-based
methods are unbiased regardless of the distribution of values in the population,
but in “strange” populations can also be inefficient. For example, if a population
is highly clustered, a random sample of quadrats will end up with mostly zero
observations and a few large values and the resulting estimates will have a large
standard error.

Most of the results in this chapter on survey sampling are design-based, i.e.
we don’t need to make any assumptions about normality in the population for
the results to valid.

4.1.7 Software

Unfortunately, there is no common, easy to use statistical package for the anal-
ysis of survey data. Fortunately, most of the computations are fairly straight-
forward so that many of the common packages, such as JMP, or Excel can be
used to analyze survey data.

SAS includes survey design procedures, but these are not covered in this

2006
c Carl James Schwarz 12
CHAPTER 4. SAMPLING

course.

For a review of packages that can be used to analyze survey data please
refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/
survey-soft.html.

CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE


PACKAGES Standard statistical software packages generally do not take into
account four common characteristics of sample survey data: (1) unequal proba-
bility selection of observations, (2) clustering of observations, (3) stratification
and (4) nonresponse and other adjustments. Point estimates of population pa-
rameters are impacted by the value of the analysis weight for each observation.
These weights depend upon the selection probabilities and other survey de-
sign features such as stratification and clustering. Hence, standard packages
will yield biased point estimates if the weights are ignored. Estimated vari-
ance formulas for point estimates based on sample survey data are impacted by
clustering, stratification and the weights. By ignoring these aspects, standard
packages generally underestimate the estimated variance of a point estimate,
sometimes substantially so.

Most standard statistical packages can perform weighted analyses, usually


via a WEIGHT statement added to the program code. Use of standard statis-
tical packages with a weighting variable may yield the same point estimates for
population parameters as sample survey software packages. However, the esti-
mated variance often is not correct and can be substantially wrong, depending
upon the particular program within the standard software package.

For further information about the problems of using standard statistical


software packages in survey sampling please refer to the article at http://www.
fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

NOTE that SAS has specialized routines for the analysis of survey
data that avoid these problems.

4.2 Overview of Sampling Methods

4.2.1 Simple Random Sampling

This is the basic method of selecting survey units. Each unit in the population
is selected with equal probability and all possible samples are equally likely to
be chosen. This is commonly done by listing all the members in the population
(the set of sampling units) and then choosing units using a random number

2006
c Carl James Schwarz 13
CHAPTER 4. SAMPLING

table.

An example of a simple random sample would be a vegetation survey in


a large forest stand. The stand is divided into 480 one-hectare plots, and a
random sample of 24 plots was selected and analyzed using aerial photos. The
map of the units selected might look like:

2006
c Carl James Schwarz 14
CHAPTER 4. SAMPLING

Units are usually chosen without replacement, i.e. each unit in the pop-
ulation can only be chosen once. In some cases (particularly for multi-stage
designs), there are advantages to selecting units with replacement, i.e. a unit in
the population may potentially be selected more than once. The analysis of a
simple random sample is straightforward. The mean of the sample is an esti-
mate of the population mean. An estimate of the population total is obtained
by multiplying the sample mean by the number of units in the population. The
sampling fraction, the proportion of units chosen from the entire population,
is typically small. If it exceeds 5%, an adjustment (the finite population cor-
rection) will result in better estimates of precision (a reduction in the standard
error) to account for the fact that a substantial fraction of the population was
surveyed.

A simple random sample design is often ‘hidden’ in the details of many


other survey designs. For example, many surveys of vegetation are conducted
using strip transects where the initial starting point of the transect is randomly
chosen, and then every plot along the transect is measured. Here the strips
are the sampling unit, and are a simple random sample from all possible strips.
The individual plots are subsamples from each strip and cannot be regarded
as independent samples. For example, suppose a rectangular stand is surveyed
using aerial overflights. In many cases, random starting points along one edge
are selected, and the aircraft then surveys the entire length of the stand starting
at the chosen point. The strips are typically analyzed section- by-section, but
it would be incorrect to treat the smaller parts as a simple random sample from
the entire stand.

Note that a crucial element of simple random samples is that every sampling
unit is chosen independently of every other sampling unit. For example, in strip
transects plots along the same transect are not chosen independently - when a
particular transect is chosen, all plots along the transect are sampled and so
the selected plots are not a simple random sample of all possible plots. Strip-
transects are actually examples of cluster-samples. Cluster samples are discuses
in greater detail later in this chapter.

4.2.2 Systematic Surveys

In some cases, it is logistically inconvenient to randomly select sample units from


the population. An alternative is to take a systematic sample where every k th
unit is selected (after a random starting point); k is chosen to give the required
sample size. For example, if a stream is 2 km long, and 20 samples are required,
then k = 100 and samples are chosen every 100 m along the stream after a
random starting point. A common alternative when the population does not
naturally divide into discrete units is grid-sampling. Here sampling points are
located using a grid that is randomly located in the area. All sampling points

2006
c Carl James Schwarz 15
CHAPTER 4. SAMPLING

are a fixed distance apart.

An example of a systematice sample would be a vegetation survey in a large


forest stand. The stand is divided into 480 one-hectare plots. As a total sample
size of 24 is required, this implies that we need to sample every 480/24 = 20th
plot. We pick a random starting point (the 9th ) plot in the first row, and then
every 20 plots reading across rows. The final plan could look like:

2006
c Carl James Schwarz 16
CHAPTER 4. SAMPLING

If a known trend is present in the sample, this can be incorporated into the
analysis (Cochran, 1977, Chapter 8). For example, suppose that the systematic
sample follows an elevation gradient that is known to directly influence the
response variable. A regression-type correction can be incorporated into the
analysis. However, note that this trend must be known from external sources -
it cannot be deduced from the survey.

Pitfall: A systematic sample is typically analyzed in the same fashion as


a simple random sample. However, the true precision of an estimator from a
systematic sample can be either worse or better than a simple random sample
of the same size, depending if units within the systematic sample are positively
or negatively correlated among themselves. For example, if a systematic sam-
ple’s sampling interval happens to match a cyclic pattern in the population,
values within the systematic sample are highly positively correlated (the sam-
pled units may all hit the ‘peaks’ of the cyclic trend), and the true sampling
precision is worse than a SRS of the same size. What is even more unfortunate
is that because the units are positively correlated within the sample, the sam-
ple variance will underestimate the true variation in the population, and if the
estimated precision is computed using the formula for a SRS, a double dose of
bias in the estimated precision occurs (Krebs, 1989, p.227). On the other hand,
if the systematic sample is arranged ‘perpendicular’ to a known trend to try
and incorporate additional variability in the sample, the units within a sample
are now negatively correlated, the true precision is now better than a SRS sam-
ple of the same size, but the sample variance now overestimates the population
variance, and the formula for precision from a SRS will overstate the sampling
error. While logistically simpler, a systematic sample is only ‘equivalent’ to a
simple random sample of the same size if the population units are ‘in random
order’ to begin with. (Krebs, 1989, p. 227). Even worse, there is no information
in the systematic sample that allows the manager to check for hidden trends
and cycles.

Nevertheless, systematic samples do offer some practical advantages over


SRS if some correction can be made to the bias in the estimated precision:

• it is easier to relocate plots for long term monitoring


• mapping can be carried out concurrently with the sampling effort because
the ground is systematically traversed. This is less of an issue now with
GPS as the exact position can easily be recorded and the plots revisited
alter.
• it avoids the problem of poorly distributed sampling units which can occur
with a SRS [but this can also be avoided by judicious stratification.]

Solution: Because of the necessity for a strong assumption of ‘randomness’


in the original population, systematic samples are discouraged and statistical

2006
c Carl James Schwarz 17
CHAPTER 4. SAMPLING

advice should be sought before starting such a scheme. If there are no other
feasible designs, a slight variation in the systematic sample provides some pro-
tection from the above problems. Instead of taking a single systematic sample
every kth unit, take 2 or 3 independent systematic samples of every 2k th or 3k th
unit, each with a different starting point. For example, rather than taking a
single systematic sample every 100 m along the stream, two independent sys-
tematic samples can be taken, each selecting units every 200 m along the stream
starting at two random starting points. The total sample effort is still the same,
but now some measure of the large scale spatial structure can be estimated.
This technique is known as replicated sub-sampling (Kish, 1965, p. 127).

4.2.3 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For


example, some animals congregate in herds or family units. It is often convenient
to select a random sample of herds and then measure every animal in the herd.
This is not the same as a simple random sample of animals because individual
animals are not randomly selected; the herds are the sampling unit. The strip-
transect example in the section on simple random sampling is also a cluster
sample; all plots along a randomly selected transect are measured. The strips
are the sampling units, while plots within each strip are sub-sampling units.
Another example is circular plot sampling; all trees within a specified radius of
a randomly selected point are measured. The sampling unit is the circular plot
while trees within the plot are sub-samples.

The reason cluster samples are used is that costs can be reduced compared
to a simple random sample giving the same precision. Because units within a
cluster are close together, travel costs among units are reduced. Consequently,
more clusters (and more total units) can be surveyed for the same cost as a
comparable simple random sample.

For example, consider the vegation survey of previous sections. The 480
plots can be divided into 60 clusters of size 8. A total sample size of 24 is
obtained by randomly selecting three clusters from the 60 clusters present in
the map, and then surveying ALL eight members of the seleced clusters. A map
of the design might look like:

2006
c Carl James Schwarz 18
CHAPTER 4. SAMPLING

Alternatively, cluster are often formed when a transect sample is taken. For
example, suppose that the vegetation survey picked an initial starting point on
the left margin, and then flew completely across the landscape in a a straight
line measuring all plots along the route. A map of the design migh look like:

2006
c Carl James Schwarz 19
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 20
CHAPTER 4. SAMPLING

In this case, there are three clusters chosen from a possible 30 clusters and
the clusters are of unequal size (the middle cluster only has 12 plots measured
compared to the 18 plots measured on the other two transects.

Pitfall A cluster sample is often mistakenly analyzed using methods for sim-
ple random surveys. This is not valid because units within a cluster are typically
positively correlated. The effect of this erroneous analysis is to come up with
an estimate that appears to be more precise than it really is, i.e. the estimated
standard error is too small and does not fully reflect the actual imprecision in
the estimate.

Solution: In order to be confident that the reported standard error really


reflects the uncertainty of the estimate, it is important that the analytical meth-
ods are appropriate for the survey design. The proper analysis treats the clusters
as a random sample from the population of clusters. The methods of simple
random samples are applied to the cluster summary statistics (Thompson, 1992,
Chapter 12).

4.2.4 Multi-stage sampling

In many situations, there are natural divisions of the population into several
different sizes of units. For example, a forest management unit consists of several
stands, each stand has several cutblocks, and each cutblock can be divided into
plots. These divisions can be easily accommodated in a survey through the
use of multi-stage methods. Selection of units is done in stages. For example,
several stands could be selected from a management area; then several cutblocks
are selected in each of the chosen stands; then several plots are selected in each
of the chosen cutblocks. Note that in a multi-stage design, units at any stage
are selected at random only from those larger units selected in previous stages.

Again consider the vegetation survey of previous sections. The population is


again divided into 60 clusers of size 8. However, rather than surveying all units
within a cluster, we decide to survey only two units within each cluster. Hence,
we now sample at the first stage, a total of 12 clusters out of the 60. In each
cluster, we randomly sample 2 of the 8 units. A sample plan might look like
the following where the rectangles indicate the clusters selected, and the checks
indicate the sub-sample taken from each cluster:

2006
c Carl James Schwarz 21
CHAPTER 4. SAMPLING

The advantage of multi-stage designs are that costs can be reduced com-
pared to a simple random sample of the same size, primarily through improved
logistics. The precision of the results is worse than an equivalent simple ran-
dom sample, but because costs are less, a larger multi-stage survey can often be
done for the same costs as a smaller simple random sample. This often results
in a more precise estimate for the same cost. However, due to the misuse of
data from complex designs, simple designs are often highly preferred and end

2006
c Carl James Schwarz 22
CHAPTER 4. SAMPLING

up being more cost efficient when costs associated with incorrect decisions are
incorporated.

Pitfall: Although random selections are made at each stage, a common error
is to analyze these types of surveys as if they arose from a simple random sample.
The plots were not independently selected; if a particular cut- block was not
chosen, then none of the plots within that cutblock can be chosen. As in cluster
samples, the consequences of this erroneous analysis are that the estimated
standard errors are too small and do not fully reflect the actual imprecision in
the estimates. A manager will be more confident in the estimate than is justified
by the survey.

Solution: Again, it is important that the analytical methods are suitable for
the sampling design. The proper analysis of multi-stage designs takes into ac-
count that random samples takes place at each stage (Thompson, 1992, Chapter
13). In many cases, the precision of the estimates is determined essentially by
the number of first stage units selected. Little is gained by extensive sampling
at lower stages.

4.2.5 Multi-phase designs

In some surveys, multiple surveys of the same survey units are performed. In the
first phase, a sample of units is selected (usually by a simple random sample).
Every unit is measured on some variable. Then in subsequent phases, samples
are selected ONLY from those units selected in the first phase, not from the
entire population.

For example, refer back to the vegetation survey. An initial sample of 24


plots is closen in a simple random survey. Aerial flights are used to quickly
measure some characteristic of the plots. A second phase sample of 6 units
(circled below) is then measured using ground based methods.

2006
c Carl James Schwarz 23
CHAPTER 4. SAMPLING

Multiphase designs are commonly used in two situations. First, it is some-


times difficult to stratify a population in advance because the values of the
stratification variables are not known. The first phase is used to measure the
stratification variable on a random sample of units. The selected units are then
stratified, and further samples are taken from each stratum as needed to mea-
sure a second variable. This avoids having to measure the second variable on
every unit when the strata differ in importance. For example, in the first phase,

2006
c Carl James Schwarz 24
CHAPTER 4. SAMPLING

plots are selected and measured for the amount of insect damage. The plots are
then stratified by the amount of damage, and second phase allocation of units
concentrates on plots with low insect damage to measure total usable volume of
wood. It would be wasteful to measure the volume of wood on plot with much
insect damage.

The second common occurrence is when it is relatively easy to measure a


surrogate variable (related to the real variable of interest) on selected units, and
then in the second phase, the real variable of interest is measured on a subset
of the units. The relationship between the surrogate and desired variable in the
smaller sample is used to adjust the estimate based on the surrogate variable in
the larger sample. For example, managers need to estimate the volume of wood
removed from a harvesting area. A large sample of logging trucks is weighed
(which is easy to do), and weight will serve as a surrogate variable for volume. A
smaller sample of trucks (selected from those weighed) is scaled for volume and
the relationship between volume and weight from the second phase sample is
used to predict volume based on weight only for the first phase sample. Another
example is the count plot method of estimating volume of timber in a stand. A
selection of plots is chosen and the basal area determined. Then a sub-selection
of plots is rechosen in the second phase, and volume measurements are made
on the second phase plots. The relationship between volume and area in the
second phase is used to predict volume from area measurements seen the first
phase.

4.2.6 Repeated Sampling

One common objective of long-term studies is to investigate changes over time


of a particular population. This will involve repeated sampling from the popu-
lation. There are three common designs.

First, separate independent surveys can be conducted at each time point.


This is the simplest design to analyze because all observations are independent
over time. For example, independent surveys can be conducted at five year
intervals to assess regeneration of cutblocks. However, precision of the estimated
change may be poor because of the additional variability introduced by having
new units sampled at each time point.

At the other extreme, units are selected in the first survey and the same
units are remeasured over time. For example, permanent study plots can be
established that are remeasured for regeneration over time. The advantage
of permanent study plots is that comparisons over time are free of additional
variability introduced by new units being measured at every time point. One
possible problem is that survey units have become ‘damaged’ over time, and the
sample size will tend to decline over time. An analysis of these types of designs

2006
c Carl James Schwarz 25
CHAPTER 4. SAMPLING

is more complex because of the need to account for the correlation over time
of measurements on the same sample plot and the need to account for possible
missing values when units become ‘damaged’ and are dropped from the study.

Intermediate to the above two designs are partial replacement designs where
a portion of the survey units are replaced with new units at each time point.
For example, 1/5 of the units could be replaced by new units at each time point
- units would normally stay in the study for a maximum of 5 time periods. The
analysis of these types of designs is very complex.

4.3 Notation

Unfortunately, sampling theory has developed its own notation that is differ-
ent than that used for design of experiments or other areas of statistics even
though the same concepts are used in both. It would be nice to adopt a general
convention for all of statistics - maybe in 100 years this will happen.

Even among sampling textbooks, there is no agreement on notation! (sigh).

In the table below, I’ve summarized the “usual” notation used in sampling
theory. In general, large letters refer to population values, while small letters
refer to sample values.

Characteristic Population value Sample value


number of elements N n
units Yi yi
N
P Pn
total τ= Yi y= yi
i=1 i=1
N n
1 1
P P
mean µ= N Yi y= n yi
i=1 i=1
proportion P = τ
N p = ny
N n
2 (Yi −µ)2 (yi −y)2
s2 =
P P
variance S = N −1 n−1
i=1 i=1
N np(1−p)
variance of a prop S 2 = N −1 P (1 − P) s2 = n−1

Note:

• The population mean is sometimes denoted as Y in many books.


• The population total is sometimes denoted as Y in many books.
• Again note the distinction between the population quantity (e.g. the pop-
ulation mean µ) and the corresponding sample quantity (e.g. the sample

2006
c Carl James Schwarz 26
CHAPTER 4. SAMPLING

mean y

4.4 Simple Random Sampling Without Replace-


ment (SRSWOR)

This forms the basis of many other more complex sampling plans and is the
‘gold standard’ against which all other sampling plans are compared. It often
happens that more complex sampling plans consist of a series of simple random
samples that are combined in a complex fashion.

In this design, once the frame of units has been enumerated, a sample of size
n is selected without replacement from the N population units.

Refer to the previous sections for an illustration of how the units will be
selected.

4.4.1 Summary of main results

It turns out that for a simple random sample, the sample mean (y) is the best
estimator for the population mean (µ). The population total is estimated by
multiplying the sample mean by the POPULATION size. And, a proportion
is estimated by simply coding results as 0 or 1 depending if the sampled unit
belongs to the class of interest, and taking the mean of these 0,1 values. (Yes,
this really does work - refer to a later section for more details).

As with every estimate, a measure of precision is required. We say in an


earlier chapter that the standard error (se) is such a measure. Recall that the
standard error measures how variable the results of our survey would be if the
survey were to be repeated. The standard error for the sample mean looks very
similar to that for a sample mean from a completely randomized design (refer
to later chapters) with a common correction of a finite population factor (the
(1 − f ) term).

The standard error for the population total estimate is found by multiplying
the standard error for the mean by the POPULATION SIZE.

The standard error for a proportion is found again, by treating each data
value as 0 or 1 and applying the same formula as the standard error for a mean.

The following table summarizes the main results:

2006
c Carl James Schwarz 27
CHAPTER 4. SAMPLING

Parameter Population value Estimator Estimated se


q
s2
Mean µ µ
b=y n (1 − f)
q
s2
Total τ τb = N × µ
b = N yy N × se(b
µ) = N n (1 − f)
q
y p(1−p)
Proportion P Pb = p = y 0/1 = n n−1 (1 − f )

Notes:

• Inflation factor The term N/n is called the inflation factor and the
estimator for the total is sometimes called the expansion estimator or the
simple inflation estimator.

• Sampling weight Many statistical packages that analyze survey data


will require the specification of a sampling weight. A sampling weight
represent how many units in the population are represented by this unit
in the sample. In the case of a simple random sample, the sampling weight
is also equal to N/n. For example, if you select 10 units at random from
150 units in the population, the sampling weight for each observation is
15, i.e. each unit in the sample represents 15 units in the population. The
sampling weights are computed differently for various designs so won’t
always be equal to N/n.
• sampling fraction the term n/N is called the sampling fraction is de-
noted as f

• finite population correction (fpc) the term (1 − f ) is called the finite


population correction factor and reflects that if you sample a substantial
part of the population, the variance of the estimator is smaller than what
would be expected from experimental design results. If f is less than 5%,
this is often ignored.

4.4.2 Estimating the Population Mean

The first line of the above table shows the “basic” results and all the remaining
lines in the table can be derived from this line as will be shown later.

The population mean (µ) is estimated by the sample mean (y). The esti-
mated se of the sample mean is
r
s2 s p
se(y) = (1 − f ) = √ (1 − f )
n n

2006
c Carl James Schwarz 28
CHAPTER 4. SAMPLING

Note that if the sampling fraction (f) is small, then the standard error of the
sample mean can be approximated by:
r
s2 s
se(y) ≈ =√
n n

which is the familiar form seen previously. In general, the standard error
formula changes depending upon the sampling method used to collect
the data and the estimator used on the data. Every different sampling
design has its own way of computing the estimator and se.

Confidence intervals for parameters are computed in the usual fashion, i.e.
an approximate 95% confidence interval would be found as: estimator ± 2se.
Some textbooks use a t-distribution for smaller sample sizes, but most surveys
are sufficiently large that this makes little difference.

4.4.3 Estimating the Population Total

Many students find this part confusing, because of the term population total.
This does NOT refer to the total number of units in the population, but rather
the sum of the individual values over the units. For example, if you are interested
in estimating total timber volume in an inventory unit, the trees are the sampling
units. A sample of trees is selected to estimate the mean volume per tree. The
total timber volume over all trees in the inventory unit is of interest, not the
total number of trees in the inventory unit.

As the population total is found by N µ (total population size times the pop-
ulation mean), a natural estimator is formed by the product of the population
size and the sample mean, i.e. T OT
dAL = τb = N y. Note that you must multiply
by the population size not the sample size.

Its estimated se is found by multiplying the estimated se for the sample


mean by the population size as well, i.e.,
r
s2
τ) = N
se(b (1 − f )
n

In general, estimates for population totals in most sampling designs are found
by multiplying estimates of population means by the population size.

Confidence intervals are found in the usual fashion.

2006
c Carl James Schwarz 29
CHAPTER 4. SAMPLING

4.4.4 Estimating Population Proportions

A “standard trick” used in survey sampling when estimating a population pro-


portion is to replace the response variable by a 0/1 code and then treat this
coded data in the same way as ordinary data.

For example, suppose you were interested the proportion of fish in a catch
that was of a particular species. A sample of 10 fish were selected (of course
in the real world, a larger sample would be taken), and the following data were
observed (S=sockeye, C=chum):

S C C S S S S C S S

Of the 10 fish sampled, 3 were chum so that the sample proportion of fish that
were chum is 3/10 = 0.30.

If the data are recoded using 1=Chum, 0=Sockeye, the sample values would
be:

0 1 1 0 0 0 0 1 0 0

The sample average of these numbers gives y = 3/10 = 0.30 which is exactly
the proportion seen.

It is not surprising then that by recoding the sample using 0/1 variables, the
first line in the summary table reduces to the last line in the summary table. In
particular, s2 reduces to np(1 − p)/(n − 1) resulting in the se seen above.

Confidence intervals are computed in the usual fashion.

4.4.5 Example - estimating total catch of fish in a recre-


ational fishery

This will illustrate the concepts in the previous sections using a very small
illustrative example.

For management purposes, it is important to estimate the total catch by


recreational fishers. Unfortunately, there is no central registry of fishers, nor
is there a central reporting station. Consequently, surveys are often used to
estimate the total catch.

2006
c Carl James Schwarz 30
CHAPTER 4. SAMPLING

There are two common survey designs used in these types of surveys (generi-
cally called creel surveys). In access surveys, observers are stationed at access
points to the fishery. For example, if fishers go out in boats to catch the fish, the
access points are the marinas where the boats are launched and are returned.
From these access points, a sample of fishers is selected and interviews con-
ducted to measure the number of fish captured and other attributes. Roving
surveys are commonly used when there is no common access point and you
can move among the fishers. In this case, the observer moves about the fishery
and questions anglers as they are encountered. Note that in this last design,
the chances of encountering an angler are no longer equal - there is a greater
chance of encountering an angler who has a longer fishing episode. And, you
typically don’t encounter the angler at the end of the episode but somewhere
in the middle of the episode. The analysis of roving surveys is more complex -
seek help. The following example is based on a real life example from British
Columbia. The actual survey is much larger involving several thousand anglers
and sample sizes in the low hundreds, but the basic idea is the same.

An access survey was conducted to estimate the total catch at a lake in


British Columbia. Fortunately, access to the lake takes place at a single landing
site and most anglers use boats in the fishery. An observer was stationed at the
landing site, but because of time constraints, could only interview a portion of
the anglers returning, but was able to get a total count of the number of fishing
parties on that day. A total of 168 fishing parties arrived at the landing during
the day, of which 30 were sampled. The decision to sample an fishing party was
made using a random number table as the boat returned to the dock.

The objectives are to estimate the total number of anglers and their catch
and to estimate the proportion of boat trips (fishing parties) that had sufficient
life-jackets for the members on the trip. Here is the raw data - each line is the

2006
c Carl James Schwarz 31
CHAPTER 4. SAMPLING

results for a fishing party..

Number Party Sufficient


Anglers Catch Life Jackets?
1 1 yes
3 1 yes
1 2 yes
1 2 no
3 2 no
3 1 yes
1 0 no
1 0 no
1 1 yes
1 0 yes
2 0 yes
1 1 yes
2 0 yes
1 2 yes
3 3 yes
1 0 no
1 0 yes
2 0 yes
3 1 yes
1 0 yes
2 0 yes
1 1 yes
1 0 yes
1 0 yes
1 0 no
2 0 yes
2 1 no
1 1 no
1 0 yes
1 0 yes

What is the population of interest?

The population of interest is NOT the fish in the lake. The Fisheries Department
is not interested in estimating the characteristics of the fish, such as mean fish
weight or the number of fish in the lake. Rather, the focus is on the anglers and
fishing parties. Refer to the FAQ at the end of the chapter for more details.

It would be tempting to conclude that the anglers on the lake are the popula-

2006
c Carl James Schwarz 32
CHAPTER 4. SAMPLING

tion of interest. However, note that information is NOT gathered on individual


anglers. For example, the number of fish captured by each angler in the party is
not recorded - only the total fish caught by the party. Similarly, it is impossible
to say if each angler had an individual life jacket - if there were 3 anglers in the
boat and only two life jackets, which angler was without? 1

For this reason, the the population of interest is taken to be the set of boats
fishing at this lake. The fisheries agency doesn’t really care about the individual
anglers because if a boat with 3 anglers catches one fish, the actual person who
caught the fish is not recorded. Similarly, if there are only two life jackets, does
it matter which angler didn’t have the jacket?

Under this interpretation, the design is a simple random sample of boats


returning to the landing.

What is the frame?

The frame for a simple random sample is a listing of ALL the units in the pop-
ulation. This list is then used to randomly select which units will be measured.
In this case, there is no physical list and the frame is conceptual. A random
number table was used to decide which fishing parties to interview.

What is the sampling design and sampling unit?

The sampling design will be treated as if it were a simple random sample from
all boats (fishing parties) returning, but in actual fact was likely a systematic
sample or variant. As you will see later, this may or may not be a problem.

In many cases, special attention should be paid to identify the correct sam-
pling unit. Here the sampling unit is a fishing party or boat, i.e. the boats were
selected, not individual anglers. This mistake is often made when the data are
presented on an individual basis rather than on a sampling unit basis. As you
will see in later chapters, this is an example of pseudo-replication.
1 Ifdata were collected on individual anglers, then the anglers could be taken as the popu-
lation of interest. However, in this case, the design is NOT a simple random sample of anglers.
Rather, as you will see later in the course, the design is a cluster sample where a simple ran-
dom sample of clusters (boats) was taken and all members of the cluster (the anglers) were
interviewed. As you will see later in the course, a cluster sample can be viewed as a simple
random sample if you define the population in terms of clusters.

2006
c Carl James Schwarz 33
CHAPTER 4. SAMPLING

Excel analysis

As mentioned earlier, Excel should be used with caution in statistical analysis.


However, for very simple surveys, it is an adequate tool.

A copy of a sample Excel workbook called creel.xls is available from the


Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/
Notes/MyPrograms.

Here is a condensed view of the spreadsheet within the workbook:

2006
c Carl James Schwarz 34
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 35
CHAPTER 4. SAMPLING

The analysis proceeds in a series of logical steps as illustrated for the number
of anglers in each party variable.

Enter the data on the spreadsheet

The metadata (information about the survey) is entered at the top of the spread-
sheet.

The actual data is entered in the middle of the sheet. One row is used to
list the variables recorded for each angling party.

Obtain the required summary statistics.

At the bottom of the data, the summary statistics needed are computed using
the Excel built-in functions. This includes the sample size, the sample mean,
and the sample standard deviation.

Obtain estimates of the population quantity

Because the sample mean is the estimator for the population mean in if the
design is a simple random sample, no further computations are needed.

In order to estimate the total number of angler, we multiply the average


number of anglers in each fishing party (1.533 angler/party) by the POPULA-
TION SIZE (the number of fishing parties for the entire day = 168) to get the
estimated total number of anglers (257.6).

Obtain estimates of precision - standard errors

The se for the sample mean is computed using the formula presented earlier.
The estimated standard error OF THE MEAN is 0.128 anglers/party.

Because we found the estimated total by multiplying the estimates of the


mean number of anglers/boat trip times the number of boat trips (168), the
estimated standard error of the POPULATION TOTAL is found by multiplying
the standard error of the sample mean by the same value, 0.128x168 = 21.5
anglers.

Hence, a 95% confidence interval for the total number of anglers fishing this
day is found as 257.6 ± 2(21.5).

Estimating total catch

The next column uses a similar procedure is followed to estimate the total catch.

Estimating proportion of parties with sufficient life-jackets

2006
c Carl James Schwarz 36
CHAPTER 4. SAMPLING

First, the character values yes/no are translated into 0,1 variables using the IF
statement of Excel.

Then the EXACT same formula as used for estimating the total number of
anglers or the total catch is applied to the 0,1 data!

We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4


percentage points.

SAS analysis

SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of
the sample SAS program called creel.sas and the output called creel.lst are avail-
able from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms.

Here is the SAS program:

/* Simple Random Sample - Creel survey */

title ’Creel Survey - Simple Random Sample’;


options nodate nonumber nocenter noovp linesize=75;

/*
For management purposes, it is important to estimate the
total catch by recreational fishers.
Unfortunately, there is no central registry of fishers, nor
is there a central reporting station.
Consequently, surveys are often used to estimate the total
catch.

An access survey was conducted to estimate the total catch at


a lake in British Columbia. Fortunately, access to the lake
takes place at a single landing site and most anglers use
boats in the fishery. An observer was stationed at the
landing site, but because of time constraints, could only
interview a portion of the parties returning, but was able to
get a total count of the number of parties fishing on that
day. A total of 168 boats (fishing parties)
arrived at the landing during the
day, of which 30 were sampled. The decision to sample an
party was made using a random number table as the boats
returned.

2006
c Carl James Schwarz 37
CHAPTER 4. SAMPLING

The objectives are to estimate the total number of anglers


and their catch and to estimate the proportion of boat trips
that had sufficient life-jackets for the members on the trip.
Here is the raw data. */

data creel; /* read in the survey data */


input angler catch lifej $;
enough = 0;
if lifej = ’yes’ then enough = 1;
datalines;
1 1 yes
3 1 yes
1 2 yes
1 2 no
3 2 no
3 1 yes
1 0 no
1 0 no
1 1 yes
1 0 yes
2 0 yes
1 1 yes
2 0 yes
1 2 yes
3 3 yes
1 0 no
1 0 yes
2 0 yes
3 1 yes
1 0 yes
2 0 yes
1 1 yes
1 0 yes
1 0 yes
1 0 no
2 0 yes
2 1 no
1 1 no
1 0 yes
1 0 yes
;;;;

proc print data=creel;


title2 ’raw data’;

2006
c Carl James Schwarz 38
CHAPTER 4. SAMPLING

/* add the sampling weights to the data set. The sampling


weights are defined as N/n for an SRSWOR */

data creel;
set creel;
sampweight = 168/30;

proc surveymeans data=creel


total=168 /* total population size */
mean clm /* find estimates of mean, its se, and a 95% confidence interval */
sum clsum /* find estimates of total,its se, and a 95% confidence interval */
;
var angler catch lifej ; /* estimate mean and total for numeric variables, proportions fo
weight sampweight;

/* Note that it is not necessary to use the coded 0/1 variables in this procedure */
run;

The program starts with the metadata so that the purpose of the program
and how the data were collected etc are not lost.

The first section of code reads the data and computes the 0,1 variable from
the life-jacket information. The data is listed so that it can be verified that it
was read correctly.

Most program for dealing with survey data require that sampling weights
be available for each observation. A sampling weight is the weighting factor
representing how many people in the population this observation represents. In
this case, each of the 30 parties represents 168/30=5.6 parties in the population.

Finally, the SURVEYMEANS procedure is used to estimate the quantities


of interest. It is not necessary to code any formula as these are builtin into the
SAS program. So how does the SAS program know this is a simple random
sample? This is the default analysis - more complex designs require additional
statements (e.g. a CLUSTER statement) to indicate a more complex design.
As well, equal sampling weights indicate that all items were selected with equal
probability.

Here is the SAS output

2006
c Carl James Schwarz 39
CHAPTER 4. SAMPLING

Creel Survey - Simple Random Sample


raw data

Obs angler catch lifej enough

1 1 1 yes 1
2 3 1 yes 1
3 1 2 yes 1
4 1 2 no 0
5 3 2 no 0
6 3 1 yes 1
7 1 0 no 0
8 1 0 no 0
9 1 1 yes 1
10 1 0 yes 1
11 2 0 yes 1
12 1 1 yes 1
13 2 0 yes 1
14 1 2 yes 1
15 3 3 yes 1
16 1 0 no 0
17 1 0 yes 1
18 2 0 yes 1
19 3 1 yes 1
20 1 0 yes 1
21 2 0 yes 1
22 1 1 yes 1
23 1 0 yes 1
24 1 0 yes 1
25 1 0 no 0
26 2 0 yes 1
27 2 1 no 0
28 1 1 no 0
29 1 0 yes 1
30 1 0 yes 1
Creel Survey - Simple Random Sample
raw data

The SURVEYMEANS Procedure

Data Summary

Number of Observations 30
Sum of Weights 168

2006
c Carl James Schwarz 40
CHAPTER 4. SAMPLING

Class Level Information

Class
Variable Levels Values

lifej 2 no yes

Statistics

Std Error Lower 95% Upper 95%


Variable Mean of Mean CL for Mean CL for Mean
-------------------------------------------------------------------------
angler 1.533333 0.128419 1.270686 1.795980
catch 0.666667 0.139688 0.380972 0.952362
lifej=no 0.266667 0.074425 0.114450 0.418884
lifej=yes 0.733333 0.074425 0.581116 0.885550
-------------------------------------------------------------------------

Statistics

Lower 95% Upper 95%


Variable Sum Std Dev CL for Sum CL for Sum
-------------------------------------------------------------------------
angler 257.600000 21.574442 213.475312 301.724688
catch 112.000000 23.467659 64.003248 159.996752
lifej=no 44.800000 12.503462 19.227550 70.372450
lifej=yes 123.200000 12.503462 97.627550 148.772450
-------------------------------------------------------------------------

All of the results match that from the Excel spreadsheet.

JMP Analysis

Unfortunately, while JMP excels (excuse the pun!) in the analysis of experi-
mental data, it is bit clumsy to analyze survey data using JMP. 2 . There are
two deficiencies:


• There is no way to specify the finite population correction (the 1 − f )
that is applied to standard errors. Fortunately, in many ecological exper-
iments, the sampling fraction f is very close to 0, the finite population
2 Future versions of JMP will include survey sampling modules

2006
c Carl James Schwarz 41
CHAPTER 4. SAMPLING

correction is very close to 1, and so there is little effect. In any case,


the reported standard error by JMP will be slightly too large which is
conservative.
• It is not easy to take the results from the analysis and use them in future
computations. For example, the estimated total is found by multiplying
the estimated mean by the population size - this is usually done by hand
outside of JMP.

Obtain the required summary statistics.

JMP assumes, unless you specify otherwise, that the data are collected from
a simple random sample. This matches the design of the angler survey so JMP
can be used directly.

The data are entered into a JMP spreadsheet directly. A copy of the
JMP data file is called creel.jmp and is available from the Sample Program Li-
brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
There is no need to code the categorical variable corresponding to sufficient life-
jackets. Be sure that the angler and catch variables are continuously scaled and
that EnoughLifeJackets is nominally scaled:

The basic summary statistics are found using the Analyze->Distribution


platform:

2006
c Carl James Schwarz 42
CHAPTER 4. SAMPLING

All three variables can be specified simultaneously and JMP will use the scale
of the variables to decide which statistics to compute.

This gives summary output as shown below:

2006
c Carl James Schwarz 43
CHAPTER 4. SAMPLING

The display can be improved by converting to a stacked setting (use the Stacked
option in the red-triangle near the Distribution header):

2006
c Carl James Schwarz 44
CHAPTER 4. SAMPLING

removing the quantile information and the histograms (use the red triangles for
each data variable to remove the display):

and asking for standard errors and confidence intervals for the proportion (right
click in the table of proportions and ask for the appropriate columns):

The final output is:

2006
c Carl James Schwarz 45
CHAPTER 4. SAMPLING

Estimating the average number of anglers/boat

The estimates are read directly from the output. The estimated average number
of angers per boat is 1.53 with an estimated se of .14. Notice that the se is slight
larger than the se reported by Excel - this is a result of not applying a finite
population correction.

Estimating the total catch of fish over all boats

The estimated average catch per boat is read directly above and is 0.667 fish/boat
with a se of .15 fish/boat. To estimate the total catch over all 168 boats, we
multiply both the mean catch/boat and the se of the catch/boat by 168. This
gives an estimated total catch of 112 fish (se 26) fish. [Again, the standard

2006
c Carl James Schwarz 46
CHAPTER 4. SAMPLING

error is slightly larger than that reported by Excel because the finite population
correction factor was not applied.]

Estimating the proportion of parties with sufficient life jackets

There is no need to code the categorical variable as was done in Excel. Reading
directly from the above output, we estimate that 73% of parties had sufficient
life jackets with a se of .08 (or a se of 8 percentage points). [Again the se
is slightly larger than that reported by Excel because of the lack of a finite
population correction factor.]

What to do if you want to do everything in JMP

It is possible to do the entire analysis in JMP including multiplying the the


population size and applying the finite population factor. This will be illustrated
by estimating the total number of angers and their catch over all 168 boats.

For estimating the total number of anglers and their catch, we need the sam-
ple size, the average over the sample and the standard deviation over the sample.
This can be done using the Tables->Summary pop-down menu. Complete the
dialogue box as shown:

This will create a new summary table.

2006
c Carl James Schwarz 47
CHAPTER 4. SAMPLING

At this point, it may be easier to simply use the summary statistics to


compute the relevant quantities by hand rather than actually programming the
equations in JMP.

We continue, by creating new columns to estimate the se for the mean. In


this column we create a formula to estimate the standard error using JMP’s
formula editor. The final results is:

To estimate the se for the total number of anglers, multiply the estimates of
the mean number of anglers/boat trip times the number of boat trips (168) to
get the final estimate. The estimated total number of anglers is 1.53333x168 =
257.6 with an estimated standard error of 0.128x168 = 21.5.

A similar procedure is followed to estimate the total catch.

We are also interested in estimating the proportion of BOATS that have


sufficient number of life-jackets for passengers.

We first transform the yes/no responses in 1/0 using a formula box, and
then repeat the same summary steps as for the mean number of anglers giving:

We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4


percentage points.

2006
c Carl James Schwarz 48
CHAPTER 4. SAMPLING

4.5 Sample size determination for a simple ran-


dom sample

I cannot emphasize too strongly, the importance of planning in advance of the


survey.

There are many surveys where the results are disappointing. For example,
a survey of anglers may show that the mean catch per angler is 1.3 fish but
that the standard error is .9 fish. In other words, a 95% confidence interval
stretches from 0 to well over 4 fish per angler, something that is known with
near certainty even before the survey was conducted. In many cases, a back of
the envelope calculation has showed that the precision obtained from a survey
would be inadequate at the proposed sample size even before the survey was
started.

In order to determine the appropriate sample size, you will need to first
specify some measure of precision that is required to be obtained. For example,
a policy decision may require that the results be accurate to within 5% of the
true value.

This precision requirement usually occurs in one of two formats:

• an absolute precision, i.e. you wish to be 95% confident that the sample
mean will not vary from the population mean by a pre-specified amount.
For example, a 95% confidence interval for the total number of fish cap-
tured should be ± 1,000 fish.
• a relative precision, i.e. you wish to be 95% confident that the sample
mean will be within 10% of the true mean.

The latter is more common than the former, but both are equivalent and
interchangeable. For example, if the actual estimate is around 200, with a se of
about 50, then the 95% confidence interval is ± 100 and the relative precision
is within 50% of the true answer (± 100 / 200). Conversely, a 95% confidence
interval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40%
of 200), and consequently, the se is around 40 (=80/2).

A common question is:

What is the difference between se/est and 2se/est? When is the


relative standard error divided by 2? Does se/est have anything to
do with a 95 % ci?

2006
c Carl James Schwarz 49
CHAPTER 4. SAMPLING

Precision requirements are stated in different ways (replace blah below by


mean/total/proportion etc).

Expression Mathematics
- within xxx of the blah se = xxx

- margin of error of xxx 2se = xxx


- within xxx of the true value 19 times out of 20 2se = xxx
- within xxx of the true value 95% of the time 2se = xxx

- the width of the 95% confidence interval is xxx 4se = xxx

- within 10% of the blah se/est = .10


- a rse of 10% se/est = .10
- a relative error of 10% se/est = .10

- within 10% of the blah 95% of the time 2se/est = .10


- within 10% of the blah 19 times out of 20 2se/est = .10
- margin of error of 10% 2se/est = .10

- width of 95% confidence interval = 10% of the blah 4se/est = .10

As a rough rule of thumb, the following are often used as survey precision
guidelines:

• For preliminary surveys, the 95% confidence interval should be ± 50% of


the estimate.
• For management surveys, the 95% confidence interval should be ± 25% of
the estimate.

• For scientific work, the 95% confidence interval should be ± 10% of the
estimate.

Next, some preliminary guess for the standard deviation of individual items
in the population (S) needs to be taken along with an estimate of the population
size (N ) and possibly the population mean (µ) or population total (τ ). These
are not too crucial and can be obtained by:

• taking a pilot study.

• previous sampling of similar populations


• expert opinion

2006
c Carl James Schwarz 50
CHAPTER 4. SAMPLING

A very rough estimate of the standard deviation can be found by taking the
usual range of the data/4. If the population proportion is unknown, the value
of 0.5 is often used as this leads to the largest sample size requirement as a
conservative guess.

These are then used with the formulae for the confidence interval to deter-
mine the relevant sample size. Many text books have complicated formulae to
do this - it is much easier these days to simply code the formulae in a spreadsheet
(see examples) and use either trial and error to find a appropriate sample size,
or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriate
sample size. This will be illustrated in the example.

The final numbers are not to be treated as the exact sample size but more
as a guide to the amount of effort that needs to be expended.

If more than one item is being surveyed, these calculations must be done
for each item. The largest sample size needed is then chosen. This may lead
to conflict in which case some response items must be dropped or a different
sampling method must be used for this other response variable.

Precision essentially depends only the ab-


solute sample size, not the relative fraction
of the population sampled. For example, a sample of
1000 people taken from Canada (population of 33,000,000) is just as precise as
a sample of 1000 people taken from the US (population of 333,000,000)! This is
highly counter-intuitive and will be explored more in class.

4.5.1 Example - How many anglers to survey

We wish to repeat the angler creel survey next year.

• How many angling-parties should be interviewed to be 95% confident of


being with 10% of the true mean catch?
• What sample size would be needed to estimate the proportion of boats
within 3 percentage points 19 times out of 20? In this case we are asking
that the 95% confidence interval be ±0.03 or that the se = 0.015.

The sample size spreadsheet is available in an Excel workbook called Sur-


veySampleSize.xls which can be downloaded from the Sample Program Library
at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a condensed view of the spreadsheet:

2006
c Carl James Schwarz 51
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 52
CHAPTER 4. SAMPLING

First note that the computations for sample size require some PRIOR in-
formation about population size, the population mean, or the population pro-
portion. We will use information from the previous survey to help plan future
studies.

For example, about 168 boats were interviewed last year. The mean catch
per angling party was about .667 fish/boat. The standard deviation of the catch
per party was .844. These values are entered in the spreadsheet in column C.

A preliminary sample size of 40 (in green in Column C) was tried. This


lead to a 95% confidence interval of ± 35% which did not meet the precision
requirements.

Now vary the sample size (in green) in column C until the 95% confidence
interval (in yellow) is below ± 10%. You will find that you will need to interview
almost 135 parties - a very high sampling fraction indeed. The problem for this
variable is the very high variation of individual data points.

If you are familiar with Excel, you can use the Goal Seeker function to speed
the search.

Similarly, the proportion of people wearing lifejackets last year was around
73%. Enter this in the blue areas of Column E. The initial sample size of 20
is too small as the 95% confidence interval is pm .186 (18 percentage points).
Now vary the sample size (in green) until the 95% confidence interval is ± .03.
Note that you need to be careful in dealing with percentages - confidence limits
are often specified in terms of percentage points rather than percents to avoid
problems where percents are taken of percents. This will be explained further
in class.

Try using the spreadsheet to compare the precision of a poll of 1000 people
taken from Canada (population 33,000,000) and 1000 people taken from the US
(population 330,000,000) if both polls have about 40% in favor of some issue.

Technical notes

If you really want to know how the sample size numbers are determined,
here is the lowdown.

Suppose that you wish to be 95% sure that the sample mean is within 10%
of the true mean.
q
We must solve z √Sn NN−n ≤ εµ for n where z is the term representing the
multiplier for a particular confidence level (for a 95% c.i. use z = 2) and ε is the
‘closeness’ factor (in this case ε = 0.10.

2006
c Carl James Schwarz 53
CHAPTER 4. SAMPLING

N
Rearranging this equation gives n = εµ 2
1+N ( zS )

4.6 Systematic sampling

Sometimes, logistical considerations make a true simple random sample not


very convenient to administer. For example, in the previous creel survey, a true
random sample would require that a random number be generated for each boat
returning to the marina. In such cases, a systematic sample could be used to
select elements. For example, every 5th angler could be selected after a random
starting point.

4.6.1 Advantages of systematic sampling

The main advantages of systematic sampling are:

• it is easier to draw units because only one random number is chosen


• if a sampling frame is not available but there is a convenient method of
selecting items, e.g. the creel survey where every 5th angler is chosen.
• easier instructions for untrained staff
• if the population is in random order relative to the variable being mea-
sured, the method is equivalent to a SRS. For example, it is unlikely
that the number of anglers in each boat changes dramatically over the
period of the day. This is an important assumption that should be
investigated carefully in any real life situation!
• it distributes the sample more evenly over the population. Consequently
if there is a trend, you will get items selected from all parts of the trend.

4.6.2 Disadvantages of systematic sampling

The primary disadvantages of systematic sampling are:

• Hidden periodicities or trends may cause biased results. In such cases,


estimates of mean and variances may be severely biased! See Section 4.2.2
for a detailed discussion.

2006
c Carl James Schwarz 54
CHAPTER 4. SAMPLING

• Without making an assumption about the distribution of population units,


there is no estimate of the standard error. This is an important disad-
vantage of a systematic sample! Many studies very casually make the
assumption that the systematic sample is equivalent to a simple random
sample without much justification for this.

4.6.3 How to select a systematic sample

There are several methods, depending if you know the population size, etc.
Suppose we need to choose every k th record, where k is chosen to meet sample
size requirements. - an example of choosing k will be given in class. All of the
following methods are equivalent if k divides N exactly. These are the two most
common methods.

• Method 1 Choose a random number j from 1 · · · k.. Then choose the j,


j + k, j + 2k, · · · records. One problem is that different samples may be
of different size - an example will be given in class where n doesn’t divide
N exactly. This causes problems in sampling theory, but not too much of
a problem if n is large.

• Method 2 Choose a random number from 1 · · · N . Choose very k th item


and continue in a circle when you reach the end until you have selected
n items. This will always give you the same sized sample, however, it
requires knowledge of N

4.6.4 Analyzing a systematic sample

Most surveys casually assume that the population has been sorted in random
order when the systematic sample was selected and so treat the results as if
they had come from a SRSWOR. This is theoretically not correct and if your
assumption is false, the results may be biased, and there is no way of examining
the biases from the data at hand.

Before implementing a systematic survey or analyzing a systematic survey,


please consult with an expert in sampling theory to avoid problems. This is a
case where an hour or two of consultation before spending lots of money could
potentially turn a survey where nothing can be estimated, into a survey that
has justifiable results.

2006
c Carl James Schwarz 55
CHAPTER 4. SAMPLING

4.6.5 Technical notes - Repeated systematic sampling

To avoid many of the potential problems with systematic sampling, a common


device is to use repeated systematic samples on the same population.

For example, rather than taking a single systematic sample of size 100 from
a population, you can take 4 systematic samples (with different starting points)
of size 25.

An empirical method of obtaining a variance estimator from a systematic


sample is to use repeated systematic sampling. Rather than choosing one
systematic subsample of every k th unit, choose, m independent systematic sub-
sample of size n/m. Then estimate the mean of each sub-systematic sample.
Treat these means as a simple random sample from the population of possible
systematic samples and use the usual sampling theory. The variation among
the sub-systematic samples provides an estimate of the sample variance. This
will be illustrated in an example.

Example of replicated subsampling within a systematic sample

A yearly survey has been conducted in the Prairie Provinces to estimate the
number of breeding pairs of ducks. One breeding area has been divided into
approximately 1000 transects of a certain width, i.e. the breeding area was
divided into 1000 strips.

What is the population of interest? As noted in class, the definition of a


population depends, in part, upon the interest of the researcher. Two possible
definitions are:

• The population is the set of individual ducks on the study area. However,
no frame exists for the individual birds. But a frame can be constructed
based on the 1000 strips that cover the study area. In this case, the design
is a cluster sample, with the clusters being strips.
• The population consists of the 1000 strips that cover the study area and
the number of ducks in each strip is the response variable. The design is
then a simple random sample of the strips.

In either case, the analysis is exactly the same and the final estimates are exactly
the same.

Approximately 100 of the transects are flown by an aircraft and spotters on


the aircraft count the number of breeding pairs visible from the aircraft.

2006
c Carl James Schwarz 56
CHAPTER 4. SAMPLING

For administrative convenience, it is easier to conduct systematic sampling.


However, there is structure to the data; it is well known that ducks do not
spread themselves randomly through out the breeding area. After discussions
with our Statistical Consulting Service, the researchers flew 10 sets of replicated
systematic samples; each set consisted of 10 transects. As each transect is
flown, the scientists also classify each transect as ‘prime’ or ‘non-prime’ breeding
habitat.

Here is the raw data reporting the number of nests in each set of 10 transects:

Prime Non-Prime Non-


Set Habitat Habitat ALL Prime prime
Total n Total n Total mean mean Diff
(b) (a) (c) (d) (e)
1 123 3 345 7 468 41.0 49.3 -8.3
2 57 2 36 8 93 28.5 4.5 24.0
3 85 5 46 5 131 17.0 9.2 7.8
4 97 2 131 8 228 48.5 16.4 32.1
5 34 5 43 5 77 6.8 8.6 -1.8
6 85 3 67 7 152 28.3 9.6 18.8
7 56 7 64 3 120 8.0 21.3 -13.3
8 46 2 65 8 111 23.0 8.1 14.9
9 37 4 43 6 80 9.3 7.2 2.1
10 93 2 104 8 197 46.5 13.0 33.5

Avg 71.3 165.7 10.97


s 29.5 117.0 16.38
n 10 10 10

Est
Est total 7130 16570 mean 10.97
Est se 885 3510 se 4.91

Several different estimates can be formed.

1. Total number of nests in the breeding area (refer to column (a)


above). The total number of nests in the breeding area for all types of
habitat is of interest. Column (a) in the above table is the data that will
be used. It represents the total number of nests in the 10 transects of each
set.
The principle behind the estimator is that the 1000 total transects can be
divided into 100 sets of 10 transects, of which a random sample of size
10 was chosen. The sampling unit is the set of transects – the individual
transects are essentially ignored.
Note that this method assumes that the systematic samples are all of the

2006
c Carl James Schwarz 57
CHAPTER 4. SAMPLING

same size. If the systematic samples had been of different sizes (e.g. some
sets had 15 transects, other sets had 5 transects), then a ratio-estimator
(see later sections) would have been a better estimator.

• compute the total number of nests for each set. This is found in
column (a).
• Then the sets selected are treated as a SRSWOR sample of size 10
from the 100 possible sets. An estimate of the mean number of
nests per set of 10 transects is found as: µ q + 93 + · ·· +
b = (468
s2 n
197)/10 = 165.7 with an estimated se of se(bµ) = n 1 − 100 =
q
117.02 10

10 1 − 100 = 35.1
• The average number of nests per set is expanded to cover all 100 sets
τb = 100b
µ = 16570 and se(b
τ ) = 100se(b
µ) = 3510

2. Total number of nests in the prime habitat only (refer to column (b)
above). This is formed in exactly the same way as the previous estimate.
This is technically known as estimation in a domain. The number of
elements in the domain in the whole population (i.e. how many of the
1000 transects are in prime-habitat) is unknown but is not needed. All
that you need is the total number of nests in prime habitat in each set –
you essentially ignore the non-prime habitat transects within each set.
The average number of nests per set in prime habitats is found as
before: µb = 123+···+93 = 71.3 with an estimated se of se(b µ) =
q q10
s 2 n 29.5 2 10
n (1 − 100 ) = 10 (1 − 100 ) = 8.85.

• because there are 100 sets of transects in total, the estimate of the
population total number of nests in prime habitat and its estimated
se is τb = 100b
µ = 7130 with a se(b
τ ) = 100se(b
µ) = 885
• Note that the total number of transects of prime habitat is not known
for the population and so an estimate of the density of nests in prime
habitat cannot be computed from this estimated total. However, a
ratio-estimator (see later in the notes) could be used to estimate the
density.

3. Difference in mean density between prime and non-prime habi-


tats The scientists suspect that the density of nests is higher in prime
habitat than in non-prime habitat. Is there evidence of this in the data?
(refer to columns (c)-(e) above). Here everything must be transformed to
the density of nest per transect (assuming that the transects were all the
same size). Also, pairing (refer to the section on experimental design) is
taking place so a difference must be computed for each set and the dif-
ferences analyzed, rather than trying to treat the prime and non-prime
habitats as independent samples.
Again, this is an example of what is known as domain-estimation.

2006
c Carl James Schwarz 58
CHAPTER 4. SAMPLING

• Compute the domain means for type of habitat for each set (columns
(c) and (d)). Note that the totals are divided by the number of
transects of each type in each set.
• Compute the difference in the means for each set (column (e))
• Treat this difference as a simple random sample of size 10 taken from
the 100 possible sets of transects. What does the final estimated
mean difference and se imply?

4.7 Stratified simple random sampling

A simple modification to a simple random sample can often lead to dramatic


improvements in precision. This is known as stratification. All survey meth-
ods can potentially benefit from stratification (also known as blocking in the
experimental design literature).

Stratification will be beneficial whenever variability in the response variable


among the survey units can be anticipated and strata can be formed that are
more homogeneous than the original set of survey units.

All stratified designs will have the same basic steps as listed below regardless
of the underlying design.

• Creation of strata. Stratification begins by grouping the survey units


into homogeneous groups (strata) where survey units within strata should
be similar and strata should be different. For example, suppose you wished
to estimate the density of animals. The survey region is divided into a
large number of quadrats based on aerial photographs. The quadrats can
be stratified into high and low quality habitat because it is thought that
the density within the high quality quadrats may be similar but different
from the density in the low quality habitats. The strata do not have to
be physically contiguous – for example, the high quality habitats could
be scattered through out the survey region and can be grouped into one
single stratum.
• Determine total sample size. Use the methods in previous sections to
determine the total sample size (number of survey units) to select. At this
stage, some sort of “average” standard deviation will be used to determine
the sample size.
• Allocate effort among the strata. there are several ways to allocate
the total effort among the strata.

2006
c Carl James Schwarz 59
CHAPTER 4. SAMPLING

– In equal allocation, the total effort is split equally among all strata.
Equal allocation is preferred when equally precise estimates are re-
quired for each stratum. 3
– In proportional allocation, the total effort is allocated to the strata
in proportion to stratum importance. Stratum importance could be
related to stratum size (e.g. when allocating effort among the U.S.
and Canada, then because the U.S. is 10 times larger in Canada,
more effort should be allocated to surveying the U.S.). But if density
is your measure of importance, allocate more effort to higher den-
sity strata. Proportional allocation is preferred when more precise
estimates are required in more important strata.
– Neyman allocation. Neyman determined that if you also have in-
formation on the variability within each stratum, then more effort
should be allocated to strata that are more important and more vari-
able to give you the most precise overall estimate for a given sample
size. This rarely is performed in ecology because often information
on intra-stratum variability is unknown. 4
– Cost allocation. In general, effort should be allocated to more
important strata, more variable strata, or strata where sampling is
cheaper to give the best overall precision for the entire survey. As
in the previous allocation method, ecologists rarely have sufficiently
detailed cost information to do this allocation method.
• Conduct separate surveys in each stratum Separate independent
surveys are conducted in each stratum. It is not necessary to use the
same survey method in all strata. For example, low density quadrats
could be surveyed using aerial methods, while high density strata may
require ground based methods. Some strata may use simple random sam-
ples, while other strata may use cluster samples. Many textbooks show
examples were the same survey method is used in all strata, but this is
NOT required.
• Obtain stratum specific estimates. Use the appropriate estimators to
estimate stratum means, proportions or totals (along with their se ) for
each stratum.
• Rollup The separate stratum estimates are then combined to give an
overall value for the entire survey region.

Stratification can be carried out prior to the survey (pre- stratification) or


after the survey (post-stratification). Pre-stratification is used if the stratum
3 Recall from previous sections that the absolute sample size is one of the drivers for pre-

cision.
4 However, in many cases, higher means per survey unit are accompanied by greater vari-

ances among survey units so allocations based on stratum means often capture this variation
as well.

2006
c Carl James Schwarz 60
CHAPTER 4. SAMPLING

variable is known in advance for every plot (e.g. elevation of a plot). Post-
stratification is used if the stratum variable can only be ascertained after mea-
suring the plot, e.g. soil texture or soil pH. The advantages of pre-stratification
are that samples can be allocated to the various strata in advance to optimize the
survey and the analysis is relatively straightforward. With post-stratification,
there is no control over sample size in each of the strata, and the analysis is
more complicated (the problem is that the samples sizes in each stratum are
now random). A post-stratification can result in significant improvements in
precision but does not allow for finer control of the sample sizes as found in
pre-stratification.

Stratification can be used with any type of sampling design – the concepts
introduced here deal with stratification applied to simple random samples but
are easily extended to more complex designs.

The advantages of stratification are:

• variance estimates of the mean or of the total will be more precise when
compared to variances from an unstratified design if the units can be
divided into groups that are more homogeneous within groups than the
whole population.
• the cost of conducting a survey under stratification may be less as units
selected within a stratum are in closer proximity.
• different sampling methods may be used in each stratum for cost or con-
venience reasons. [In the detail below we assume that each stratum has
the same sampling method used, but this is only for simplification.]
• because randomization occurs independently in each stratum, corruption
of the survey design due to problems experienced in the field may be
confined.
• separate estimates for each stratum with a given precision can be obtained
• it may be more convenient to take a stratified random sample for admin-
istrative reasons. For example, the strata may refer to different district
offices.

4.7.1 A visual comparison of a simple random sample vs


a stratified simple random sample

You may find it useful to compare a simple random sample of 24 vs a stratified


random sample of 24 using the following visual plans:

Select a sample of 24 in each case.

2006
c Carl James Schwarz 61
CHAPTER 4. SAMPLING

Simple Random Sampling

Describe how the sample was taken.

2006
c Carl James Schwarz 62
CHAPTER 4. SAMPLING

Stratified Simple Random Sampling

First you will have to define the strata. Suppose that there is a gradient
in response from the top to the bottom of the map. Three strata are defined,
consisting of the first 3 rows, the next 5 rows, and finally, the last two rows.
It was decided to conduct a simple random sample within each stratum, with
sample sizes of 8, 10, and 6 in the three strata respectively. [The decision process
on allocating samples to strata will be covered later.]

2006
c Carl James Schwarz 63
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 64
CHAPTER 4. SAMPLING

In this design, the same design was used in ALL strata, but this is NOT a
requirement for stratification. It is quite possible, and often desirable, to use
different methods in the different strata. For example, it may be more efficient
to survey desert areas using a fixed-wing aircraft, while ground surveys need to
be used in heavily forested areas.

4.7.2 Notation

Common notation is to use h as a stratum index and i or j as unit indices within


each stratum.

Characteristic Population quantities sample quantities


number of strata H H
stratum sizes N1 , N2 , · · · , NH n1 , n2 , · · · , nH
population units Yhj h=1,· · · ,H, j=1,· · · ,NH yhj h=1,· · · ,H, j=1,· · · ,nH
stratum totals τh yh
stratum means µh yh
H
P Nh
Population total τ =N Wh µh where Wh = N
h=1
H
P
Population mean µ= Wh µh
h=1

Standard deviation Sh2 s2h

4.7.3 Summary of main results

It is assumed that from each stratum, a SRSWOR of size nh is selected inde-


pendently of ALL OTHER STRATA!

The results below summarize the computations that can be more easily
thought as occurring in four steps:

1. Compute the estimated mean and its se for each stratum. In this chapter,
we use a SRS design in each stratum, but it not necessary to use this
design in a stratum and each stratum could have a different design. In the
case of an SRS, the estimate of the mean for each stratum is found as:
bh = y h
µ
with associated standard error:
s
s2h
se(b
µh ) = (1 − fh )
nh

2006
c Carl James Schwarz 65
CHAPTER 4. SAMPLING

where the subscript h refers to each stratum.


2. Compute the estimated total and its se for each stratum. In many cases
this is simply the estimated mean for the stratum multiplied by the STRA-
TUM POPULATION size. In the case of an SRS in each stratum this
gives::
τbh = Nh × µ
bh = Nh × y h
. s
s2h
τh ) = Nh × se(b
se(b µh ) = Nh × (1 − fh )
nh

3. Compute the grand total and its se over all strata. This is the sum of the
individual totals. The se is computed in a special way.
τb = τb1 + τb2 + . . .
q
τ ) = se(b
se(b τ )21 + se(b
τ )22 + . . .

4. Occasionally, the grand mean over all strata is needed. This is found by
dividing the estimated grand total by the total POPULATION sizes:
τb
µ
b=
N1 + N2 + . . .
se(b
τ)
se(b
µ) =
N1 + N2 + . . .

This can be summarized in a succinct form as follows. Note that the stratum
weights Wh are formed as Nh /N and are often used to derive weighted means
etc:

Quantity Pop value Estimator se


s
H H H
Wh2 se2 (y h ) =
P P P
Mean µ= Wh µh µ
bstr = Wh y h
h=1 h=1 h=1
s
H
s2
Wh2 nhh (1 − fh )
P
h=1

s
H H H
Nh2 se2 (y h ) or
P P P
Total τ =N Wh µh or τbstr = N Wh y h or
h=1 h=1 h=1
s
H H H
s2
Nh2 nhh (1 − fh )
P P P
τ= τh or τbstr = Nh y h
h=1 h=1 h=1
PH
τ= Nh µh
h=1

Notes

2006
c Carl James Schwarz 66
CHAPTER 4. SAMPLING

• The estimator for the grand population mean is a weighted average of


the individual stratum means using the POPULATION weights rather
than the sample weights. This is NOT the same as the simple unweighted
average of the estimated stratum means unless the nh /n equal the Nh /N
- such a design is known as proportional allocation in stratified sampling.
p
• The estimated standard error for the grand total is found as se21 + se22 + · · · + se2h ,
i.e. the square root of the sum of the individual se2 of the strata TO-
TALS.
• The estimators for a proportion are IDENTICAL to that of the mean
except replace the variable of interest by 0/1 where 1=character of interest
and 0=character not of interest.
• Confidence intervals Once the se has been determined, the usual ±2se
will give approximate 95% confidence intervals if the sample sizes are rela-
tively large in each stratum. If the sample sizes are small in each stratum
some authors suggest using a t-distribution with degrees of freedom de-
termined using a Satterthwaite approximation - this will not be covered
in this course.

4.7.4 Example - sampling organic matter from a lake

[With thanks to Dr. Rick Routledge for this example].

Suppose that you were asked to estimate the total amount of organic matter
suspended in a lake just after a storm. The first scheme that might occur to
you could be to cruise around the lake in a haphazard fashion and collect a few
sample vials of water which you could then take back to the lab. If you knew the
total volume of water in the lake, then you could obtain an estimate of the total
amount of organic matter by taking the product of the average concentration
in your sample and the total volume of the lake.

The accuracy of your estimate of course depends critically on the extent to


which your sample is representative of the entire lake. If you used the haphazard
scheme outlined above, you have no way of objectively evaluating the accuracy
of the sample. It would be more sensible to take a properly randomized sample.
(How might you go about doing this?)

Nonetheless, taking a randomized sample from the entire lake would still not
be a totally sensible approach to the problem. Suppose that the lake were to be
fed by a single stream, and that most of the organic matter were concentrated
close to the mouth of the stream. If the sample were indeed representative, then
most of the vials would contain relatively low concentrations of organic matter,
whereas the few taken from around the mouth of the stream would contain much

2006
c Carl James Schwarz 67
CHAPTER 4. SAMPLING

higher concentration levels. That is, there is a real potential for outliers in the
sample. Hence, confidence limits based on the normal distribution would not
be trustworthy.

Furthermore, the sample mean is not as reliable as it might be. Its value
will depend critically on the number of vials sampled from the region close to
the stream mouth. This source of variation ought to be controlled.

Finally, it might be useful to estimate not just the total amount of organic
matter in the entire lake, but the extent to which this total is concentrated near
the mouth of the stream.

You can simultaneously overcome all three deficiencies by taking what is


called a stratified random sample. This involves dividing the lake into two or
more parts called strata. (These are not the horizontal strata that naturally
form in most lakes, although these natural strata might be used in a more
complex sampling scheme than the one considered here.) In this instance, the
lake could be divided into two parts, one consisting roughly of the area of high
concentration close to the stream outlet, the other comprising the remainder of
the lake.

Then if a simple random sample of fixed size were to be taken from within
each of these “strata”, the results could be used to estimate the total amount of
organic matter within each stratum. These subtotals could then be added to
produce an estimate of the overall total for the lake.

This procedure, because it involves constructing separate estimates for each


stratum, permits us to assess the extent to which the organic matter is concen-
trated near the stream mouth. It also permits the investigator to control the
number of vials sampled from each of the two parts of the lake. Hence, the
chance variation in the estimated total ought to be sharply reduced. Finally, we
shall soon see that the confidence limits that one can construct are free of the
outlier problem that invalidated the confidence limits based on a simple random
sampling scheme.

A randomized sample is to be drawn independently from within each stra-


tum.

How can we use the results of a stratified random sample to estimate the
overall total? The simplest way is to construct an estimate of the totals within
each of the strata, and then to sum these estimates. A sensible estimate of the
average within the h’th stratum is y h . Hence, a sensible estimate of the total
within the h’th stratum is τbh = Nh y h , and the overall total can be estimated
PH PH
by τb = h=1 τbh = h=1 Nh y h .

2006
c Carl James Schwarz 68
CHAPTER 4. SAMPLING

If we prefer to estimate the overall average, we can merely divide the estimate
of the overall total by the size of the population, N . The resulting estimator is
called the stratified
PH random sampling estimator of the population average, and
is given by µb = h=1 Nh y h /N .

This can be expressed as a fancy average if we adjust the order of operations


in the above expression. If, instead of dividing the sum by N , we divide each
term by N and then sum the results, we shall obtain the same result. Hence,
H
X
µ
bstratif ied = (Nh /N )y h
h=1
H
X
= Wh y h ,
h=1

where Wh = Nh /N . These Wh -values can be thought of as weighting factors,


and µ
bstratif ied can then be viewed as a weighted average of the within-stratum
sample averages.

The estimated standard error is found as:


(H )
X
se(b
µstratif ied ) = se Wh y h
h=1
v
uH
uX
= t Wh2 [se(y h )]2 ,
h=1

estimated se(y h ) is given by the formulas for simple random sampling:


where theq
s2h
se(y h ) = nh (1 − fh ).

A Numerical Example

Suppose that for the lake sampling example discussed earlier the lake were
subdivided into two strata, and that the following results were obtained. (All
readings are in mg per litre.)

Stratum Nh nh Sample Observations yh sh


1 7.5 × 108 5 37.2 46.6 45.3 38.1 40.4 41.52 4.23
2 2.5 × 107 5 365 344 388 347 403 369.4 25.7

We begin by computing the estimated mean for each stratum and its asso-
nh
ciated standard error. The sampling fraction Nh
is so close to 0 it can be safely

2006
c Carl James Schwarz 69
CHAPTER 4. SAMPLING

ignored. For example, the standard error of the mean for stratum 1 is found as:
s r
s21 4.232
se(b
µ1 ) = (1 − f1 ) = = 1.89
n1 5

. This gives the summary table:

Stratum nh µ
bh µh )
se(b
1 5 41.52 1.8935
2 5 369.4 11.492

Next, we estimate the total organic matter in each stratum. This is found by
multiplying the mean concentration and se of each stratum by the total volume:

τbh = Nh × µ
bh

se(b
τh ) = Nh se(b
µh )
For example, the estimated total organic matter in stratum 1 is found as:

b1 = 7.5 × 108 × 41.52 = 311.4 × 108


τb1 = N1 × µ

se(b µ1 ) = 7.5 × 108 × 1.89 = 14.175 × 108


τ1 ) = N1 se(b
This gives the summary table:

Stratum nh µ
bh µh )
se(b τbh se(b
τh )
1 5 41.52 1.8935 311.4 ×108 14.175 ×108
2 5 369.4 11.492 92.3 ×108 2.873 ×108

Next, we total
√ the organic content of the two strata and find the se of the
grand total as 14.1752 + 2.8732 × 108 to give the summary table:

Stratum nh µ
bh µh )
se(b τbh se(b
τh )
1 5 41.52 1.8935 311.4 ×108 14.175 ×108
2 5 369.4 11.492 92.3 ×108 2.873 ×108
Total 403.7 ×108 14.46 ×108

Finally, the overall grand mean is found by dividing by the total volume of
the lake 7.75 × 108 to give:
403.7 × 108
µ
b= = 52.09mg/L
7.75 × 108
14.46 × 108
se(b
µ) = = 1.87mg/L
7.75 × 108

The calculations required to compute the stratified estimate can also be done
using the method of weighted averages as shown in the following table:

2006
c Carl James Schwarz 70
CHAPTER 4. SAMPLING

Stratum Nh Wh yh Wh y h se(y h ) Wh2 [se(y h )]2


(= Nh /N )
1 7.5 × 108 0.9677 41.52 40.180 1.8935 3.3578
2 2.5 × 107 0.0323 369.4 11.916 11.492 0.1374
Totals 7.75 × 108 1.0000 52.097 3.4952

se = 3.4952

Hence the estimate of the√ overall average is 52.097 mg/L, and the associated
estimated standard error is 3.4963 = 1.870 mg/L and an approximate 95%
confidence interval is then found in the usual fashion. As expected these match
the previous results.

This discussion swept a number of practical difficulties under the carpet.


These include (a) estimating the volume of each of the two portions of the lake,
(b) taking properly randomized samples from within each stratum, (c) selecting
the appropriate size of each water sample, (d) measuring the concentration for
each water sample, and (e) choosing the appropriate number of water samples
from each stratum. None of these difficulties is simple to do. Estimating the
volume of a portion of a lake, for example, typically involves taking numerous
depth readings and then applying a formula for approximating integrals. This
problem is beyond the scope of these notes.

The standard error in the estimator of the overall average is markedly re-
duced in this example by the stratification. The standard error was just esti-
mated for the stratified estimator to be around 2. This result was for a sample
of total size 10. By contrast, for an estimator based on a simple random sample
of the same size, the standard error can be found to be about 20. [This involves
methods not covered in this class.] Stratification has reduced the standard error
by an order of magnitude.

It is also possible that we could reduce the standard error even further with-
out increasing our sampling effort by somehow allocating this effort more effi-
ciently. Perhaps we should take fewer water samples from the region far from
the outlet, and take more from the other stratum. This will be covered later in
this course.

One can also read in more comprehensive accounts how to construct esti-
mates from samples that are stratified after the sample is selected. This is
known as post-stratification. These methods are useful if, e.g. you are sam-
pling a population with a known sex ratio. If you observe that your sample is
biased in favor of one sex, you can use this information to build an improved
estimate of the quantity of interest through stratifying the sample by sex after
it is collected. It is not necessary that you start out with a plan for sampling
some specified number of individuals from each sex (stratum).

2006
c Carl James Schwarz 71
CHAPTER 4. SAMPLING

Nonetheless, in any survey work, it is crucial that you begin with a plan.
There are many examples of surveys that produced virtually useless results
because the researchers failed to develop an appropriate plan. This should
include a statement of your main objective, and detailed descriptions of how
you plan to generate the sample, collect the data, enter them into a computer
file, and analyze the results. The plan should contain discussion of how you
propose to check for and correct errors at each stage. It should be tested with
a pilot survey, and modified accordingly. Major, ongoing surveys should be
reassessed continually for possible improvements. There is no reason to expect
that the survey design will be perfect the first time that it is tried, nor that
flaws will all be discovered in the first round. On the other hand, one should
expect that after many years experience, the researchers will have honed the
survey into a solid instrument. George Gallup’s early surveys were seriously
biased. Although it took over a decade for the flaws to come to light, once they
did, he corrected his survey design promptly, and continued to build a strong
reputation.

One should also be cautious in implementing stratified survey designs for


long-term studies. An efficient stratification of the Fraser Delta in 1994, e.g.
might be hopelessly out of date 50 years from now, with a substantially altered
configuration of channels and islands. You should anticipate the need to revise
your stratification periodically.

4.7.5 Example - estimating the total catch of salmon

DFO needs to monitor the catch of sockeye salmon as the season progresses so
that stocks are not overfished.

The season in one statistical sub-area in a year was a total of 2 days (!) and
250 vessels participated in the fishery in these 2 days. A census of the catch of
each vessel at the end of each day is logistically difficult.

In this particular year, observers were randomly placed on selected vessels


and at the end of each day the observers contacted DFO managers with a count
of the number of sockeye caught on that day.

Here is the raw data - each line corresponds to the observers’ count for that
vessel for that day. On the second day, a new random sample of vessels was
selected. On both days, 250 vessels participated in the fishery.

Date Sockeye
29-Jul-98 337
29-Jul-98 730

2006
c Carl James Schwarz 72
CHAPTER 4. SAMPLING

29-Jul-98 458
29-Jul-98 98
29-Jul-98 82
29-Jul-98 28
29-Jul-98 544
29-Jul-98 415
29-Jul-98 285
29-Jul-98 235
29-Jul-98 571
29-Jul-98 225
29-Jul-98 19
29-Jul-98 623
29-Jul-98 180

30-Jul-98 97
30-Jul-98 311
30-Jul-98 45
30-Jul-98 58
30-Jul-98 33
30-Jul-98 200
30-Jul-98 389
30-Jul-98 330
30-Jul-98 225
30-Jul-98 182
30-Jul-98 270
30-Jul-98 138
30-Jul-98 86
30-Jul-98 496
30-Jul-98 215

What is the population of interest?

The population of interest is the set of vessels participating in the fishery on


the two days. [The fact that each vessel likely participated in both days is not
really relevant.] The population of interest is NOT the salmon captured - this
is the response variable for each boat whose total is of interest.

What is the sampling frame?

It is not clear how the list of fishing boats was generated. It seems unlikely that
the aerial survey actually had a picture of the boats on the water from which
DFO selected some boats. More likely, the observers were taken onto the water

2006
c Carl James Schwarz 73
CHAPTER 4. SAMPLING

in some systematic fashion, and then the observer selected a boat at random
from those seen at this point. Hence the sampling frame is the set of locations
chosen to drop off the observers and the set of boats visible from these points.

What is the sampling design?

The sampling unit is a boat on a day. The strata are the two days. On each
day, a random sample was selected from the boats participating in the fishery.

This is a stratified design with a simple random sample selected each day.

Note in this survey, it is logistically impossible to do a simple random sample


over both the days as the number of vessels participating really isn’t known for
any day until the fishery starts. Here, stratification takes the form of adminis-
trative convenience.

Excel analysis

A copy of an Excel workbook called sockeye.xls is available from the Sample


Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.

A summary of the page appears below:

2006
c Carl James Schwarz 74
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 75
CHAPTER 4. SAMPLING

The data are listed on the spreadsheet on the left.

Summary statistics

The Excel builtin functions are used to compute the summary statistics (sample
size, sample mean, and sample standard deviation) for each stratum. Some
caution needs to be exercised that the range of each function covers only the
data for that stratum. 5

You will also need to specify the stratum size (the total number of sampling
units in each stratum), i.e. 250 vessels on each day.

Find estimates of the mean catch for each stratum

Because the sampling design in each stratum is a simple random sample, the
same formulae as in the previous section can be used.

The mean and its estimated se for each day of the opening is reported in the
spreadsheet.

Find the estimates of the total catch for each stratum

The estimated total catch is found by multiplying the average catch per boat by
the total number of boats participating in the fishery. The estimated standard
error for the total for that day is found by multiplying the standard error for
the mean by the stratum size as in the previous section.

For example, in the first stratum (29 July), the estimated total catch is
found by multiplying the estimated mean catch per boat (322) by the number
of boats participating (250) to give an estimated total catch of 80,500 salmon
for the day. The se for the total catch is found by multiplying the se of the
mean (57) by the number of boats participating (250) to give the se of the total
catch for the day of 14,200 salmon.

Find estimate of grand total

Once an estimated total is found for each stratum, the estimated grand total
is found by summing the individual stratum estimated totals. The estimated
standard error of the grand total is found by the square root of the sum of the
squares of the standard errors in each stratum - the Excel function sumsq is
useful for this computation.

Estimates of the overall grand mean


5 If you are proficient with Excel, Pivot-Tables are an ideal way to compute the summary

statistics for each stratum. An application of Pivot-Tables is demonstrated in the analysis of


a cluster sample where the cluster totals are needed for the summary statistics.

2006
c Carl James Schwarz 76
CHAPTER 4. SAMPLING

This was not done in the spreadsheet, but is easily computed by dividing
the total catch by the total number of boat days in the fishery (250+250=500).
The se is found by dividing the se of the total catch also by 500.

Note this is interpreted as the mean number of fish captured per day per
boat.

SAS analysis

As noted earlier, some care must be used when standard statistical packages are
used to analyze survey data as many packages ignore the design used to select
the data.

A sample SAS program for the analysis of the sockeye example called sock-
eye.sas and its output called sockeye.lst is available from the Sample Program Li-
brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

A copy of the program is as follows:

/* Example of a stratified sample analysis using SAS */

/* On each of two days, a sample of vessels were assigned


observers who counted the number of sockeye salmon caught
in that day. On the second day, a new set of vessels was observed. */

title ’Number of sockeye caught - example of stratified simple random sampling’;

options nodate nonumber noovp nocenter linesize=75;

data sockeye; /* read in the data */


length date $8.;
input date $ sockeye;
/* compute the sampling weight. In general,
these will be different for each stratum */
if date = ’29-Jul’ then sampweight = 250/15;
if date = ’30-Jul’ then sampweight = 250/15;
datalines;
29-Jul 337
29-Jul 730
29-Jul 458
29-Jul 98
29-Jul 82
29-Jul 28

2006
c Carl James Schwarz 77
CHAPTER 4. SAMPLING

29-Jul 544
29-Jul 415
29-Jul 285
29-Jul 235
29-Jul 571
29-Jul 225
29-Jul 19
29-Jul 623
29-Jul 180
30-Jul 97
30-Jul 311
30-Jul 45
30-Jul 58
30-Jul 33
30-Jul 200
30-Jul 389
30-Jul 330
30-Jul 225
30-Jul 182
30-Jul 270
30-Jul 138
30-Jul 86
30-Jul 496
30-Jul 215
;;;;

data n_boats; /* you need to specify the stratum sizes if you want stratum totals */
length date $8.;
date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total_ */
date = ’30-Jul’; _total_=250; output;

proc print data=sockeye;


title2 ’raw data from the survey’;

proc print data=n_boats;


title2 ’number of boats in each stratum’;

proc surveymeans data=sockeye


N = n_boats /* dataset with the stratum population sizes present */
mean /* average catch/boat along with standard error */
sum ; /* request estimates of total */ ;

strata date / list; /* identify the stratum variable */


var sockeye; /* which variable to get estimates for */
weight sampweight;

2006
c Carl James Schwarz 78
CHAPTER 4. SAMPLING

The program starts with reading in the raw data and the computation of the
sampling weights. Because the population size and sample size are the same for
each stratum, the sampling weights are common to all boats. In general, this
is not true, and a separate sampling weight computation is required for each
stratum.

A separate file is also constructed with the population sizes for each stratum
so that estimates of the population total can be constructed.

The SURVEYMEANS procedure then uses the STRATUM statement to


identify that this is a stratified design. The default analysis in each stratum is
again a simple random sample.

The SAS output is:

Number of sockeye caught - example of stratified simple random sampling


raw data from the survey

Obs date sockeye sampweight

1 29-Jul 337 16.6667


2 29-Jul 730 16.6667
3 29-Jul 458 16.6667
4 29-Jul 98 16.6667
5 29-Jul 82 16.6667
6 29-Jul 28 16.6667
7 29-Jul 544 16.6667
8 29-Jul 415 16.6667
9 29-Jul 285 16.6667
10 29-Jul 235 16.6667
11 29-Jul 571 16.6667
12 29-Jul 225 16.6667
13 29-Jul 19 16.6667
14 29-Jul 623 16.6667
15 29-Jul 180 16.6667
16 30-Jul 97 16.6667
17 30-Jul 311 16.6667
18 30-Jul 45 16.6667
19 30-Jul 58 16.6667
20 30-Jul 33 16.6667
21 30-Jul 200 16.6667
22 30-Jul 389 16.6667
23 30-Jul 330 16.6667
24 30-Jul 225 16.6667
25 30-Jul 182 16.6667

2006
c Carl James Schwarz 79
CHAPTER 4. SAMPLING

26 30-Jul 270 16.6667


27 30-Jul 138 16.6667
28 30-Jul 86 16.6667
29 30-Jul 496 16.6667
30 30-Jul 215 16.6667
Number of sockeye caught - example of stratified simple random sampling
number of boats in each stratum

Obs date _total_

1 29-Jul 250
2 30-Jul 250
Number of sockeye caught - example of stratified simple random sampling
number of boats in each stratum

The SURVEYMEANS Procedure

Data Summary

Number of Strata 2
Number of Observations 30
Sum of Weights 500

Stratum Information

Stratum Population Sampling


Index date Total Rate N Obs Variable N
---------------------------------------------------------------------------
1 29-Jul 250 6.00% 15 sockeye 15
2 30-Jul 250 6.00% 15 sockeye 15
---------------------------------------------------------------------------

Statistics

Std Error
Variable Mean of Mean Sum Std Dev
------------------------------------------------------------------------
sockeye 263.500000 33.082758 131750 16541
------------------------------------------------------------------------

The results are the same as before.

The only thing of “interest” is to note that SAS labels the precision of the

2006
c Carl James Schwarz 80
CHAPTER 4. SAMPLING

estimated grand means as a Standard error while it labels the precision of the
estimated total as a standard deviation! Both are correct - a standard error is
a standard deviation - not of individual units in the population - but of the
estimates over repeated sampling from the same population. I think it is clearer
to label both as standard errors to avoid any confusion.

If separate analyses are wanted for each stratum, the SURVEYMEANS pro-
cedure has to be run twice, one time with a BY statement to estimate the means
and totals in each stratum.

Again, it is likely easiest to do planning for future experiments in an Excel


spreadsheet rather than using SAS.

JMP analysis

The data is available in a JMP file called sockeye.jmp available at the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.

The data are entered as usual into JMP. Ensure that the variable that identi-
fies the strata is nominally scaled and that the response variable is continuously
scaled:

2006
c Carl James Schwarz 81
CHAPTER 4. SAMPLING

We start by finding the summary statistics for EACH stratum. This is done
using the Analyze->Distribution platform as in the analysis of the creel data,
but we want a separate analysis for each stratum. This is obtained by using the
By box in the dialogue box:

2006
c Carl James Schwarz 82
CHAPTER 4. SAMPLING

The output from the Analyze->Distribution platform is stacked and unneces-


sary information removed as in the creel example. The final summary statistics
for each stratum are:

2006
c Carl James Schwarz 83
CHAPTER 4. SAMPLING

The estimated catch per boat in the first stratum is 322 (se 59) and in the
second stratum is 205 (se 35). Note that the difference in se between the results
and JMP and Excel are minimal because the sampling fraction (15/250=6%) is
very small.

Unfortunately, we now need to complete the rest of the computations by


hand. The estimated total catch in each stratum is found by multiplying the
estimated catch per boat by the number of boats giving an estimated total catch

2006
c Carl James Schwarz 84
CHAPTER 4. SAMPLING

of 80,500 (se 14,640) in stratum 1 and 51,250 (se 8,757) in stratum 2.

The estimated grand total catch is found by adding thepstratum total,


131, 750 = 80, 500+51, 250. The se of the grand total is found as 14, 6402 + 8, 7572 =
17, 060. [Again, all of the se are slightly larger because the finite population cor-
rection factor has not been applied, but the differences in the se are only about
3%.

Doing everything in JMP

It is possible to do everything in JMP including the finite population factor and


totaling over strata. We follow the same steps as in the previous analyses.

We need to find the sample size, the mean, and the standard deviation for
each stratum. We can use the Tables->Summary pop-down menu as shown
below:

Note that we must specify the grouping variable in this case to identify the
strata.

This generates the summary table with statistics for each stratum:

date N rows N(sockeye) Mean Std dev


29-Jul-98 15 15 322 226.818
30-Jul-98 15 15 205 135.695

2006
c Carl James Schwarz 85
CHAPTER 4. SAMPLING

At this point, further computations using JMP are clumsy. It may be easier
to use the summary statistics and transfer them to a spreadsheet to continue
the computations.

Add a new column for the number of vessels in each stratum, and two more
columns where you estimate the total catch for the day (mean catch x number
of vessels) and the standard error**2 for the total in each stratum using the
formula:

This generates the revised summary table with statistics for each stratum:

Finally, add the totals and square of the se from each stratum to get the overall

2006
c Carl James Schwarz 86
CHAPTER 4. SAMPLING

total and overall square of the se,


Take the of the square of the se to get the overall se and, voila,

Hence our final estimate is that a total of 131,750 sockeye were caught with
a se of 16541 fish.

When should the various estimates be used?

In a stratified sample, there are many estimates that are obtained with different
standard errors. It can sometimes be confusion as to which estimate is used for
which purpose.

2006
c Carl James Schwarz 87
CHAPTER 4. SAMPLING

Here is a brief review of the four possible estimates and the level of interest
in each estimate.

2006
c Carl James Schwarz 88
2006

CHAPTER 4. SAMPLING
Parameter Estimator se Example and Interpretation Who would be interested in this
c

quantity?
Carl James Schwarz

q
s2h
Stratum µ
bh = Y h nh (1 − fh ) Stratum 1. Estimate is 322; se A fisher who wishes to fish ONLY
mean of 56.8 (not shown). the first day of the season and
The estimated average catch wants to know if it will meet ex-
per boat was 322 fish (se 56.8 penses.
fish) on 29 July
Stratum τbh = Nh µ
bh = Nh se(b
q µh ) = Stratum 1. Estimate is The estimated total catch overall
total Nh Y h s2j 80,500=250x322; se of boats on 29 July was 80,500 (se
Nh nh (1 − fh )
14195=250x56.8. 14,195) DFO who wishes to esti-
mate TOTAL catch overall ALL
boats on this single day so that
quota for next day can be set.
Grand Total
89

p
Grand τb = τb1 + τb2 τ1 )2 + se(b
se(b τ1 )2 Estimate DFO who wishes to know total
total. 131,750=80,500+51,250;
√ se is catch over entire fishing season so
1419522 + 849222 = 16541. that impacts on stock can be exam-
The estimated total catch ined.
overall all boats over all days
is 132,000 fish (se 17,000 fish).
τb se(b
τ)
Grand µ
b= N N Grand mean (not shown). A fisher who want to know aver-
average N=500 vessel-days. age catch per boat per day for the
Estimate is entire season to see if it will meet
131,750/500=263.5; se is expenses.
16541/500=33.0.
The estimated catch per boat
per day over the entire season
was 263 fish (se 33 fish).
CHAPTER 4. SAMPLING

4.8 Sample Size for Stratified Designs

4.8.1 Total sample size

As before, the question arises as how many units should be selected in stratified
designs.

This has two questions that need to be answered. First, what is the total
sample size required? Second how should these be allocated among the strata.

The total sample size can be determined using the same methods as for a
simple random sample. I would suggest that you initially ignore the fact that
the design will be stratified when finding the initial required total sample size.
If stratification proves to be useful, then your final estimate will be more precise
than you anticipated (always a nice thing to happen!) but seeing as you are
making guesses as to the standard deviations and necessary precision required,
I wouldn’t worry about the extra cost in sampling too much.

If you must, it is possible to derive formulae for the overall sample sizes when
accounting for stratification, but these are relatively complex. It is likely easier
to build a general spreadsheet where the single cell is the total sample size and
all other cells in the formula depend upon this quantity depending upon the
allocation used. Then the total sample size can be manipulated to obtain the
desired precision. The following information will be required:

• The sizes (or relative sizes) of each stratum (i.e. the Nh or Wh ).


• The standard deviation of measurements in each stratum. This can be
obtained from past surveys, a literature search, or expert opinion.

• The desired precision – overall – and if needed, for each stratum.

Again refer to the sockeye.xls spreadsheet:

2006
c Carl James Schwarz 90
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 91
CHAPTER 4. SAMPLING

The standard deviations from this survey will be used as ‘guesses’ for what
might happen next year. As in this year’s survey, the total sample size will be
allocated evenly between the two days.

In this case, the total sample size must be allocated to the two strata. You
will see several methods in a later section to do this, but for now, assume
that the total sample will be allocated equally among both strata. Hence the
proposed sample size of 75 is split in half to give a proposed sample size of 37.5
in each stratum. Don’t worry about the fractional sample size - this is only a
planning exercise. We create one cell that has the total sample size, and then
use the formulae to allocate the total sample size equally to the two strata.
The total and the se of the overall total are found as before, and the relative
precision (denoted as the relative standard error (rse), and, unfortunately, in
some books at the coefficient of variation cv ) is found as the estimated standard
error/estimated total.

Again, this portion of the spreadsheet is setup so that changes in the total
sample size are propagated throughout the sheet. If you change the total sample
size from 75 to some other number, this is automatically split among the two
strata, which then affects the estimated standard error for each stratum, which
then affects the estimated standard error for the total, which then affects the
relative standard error. Again, the proposed total sample size can be varied
using trial and error, or the Excel Goal-Seek option can be used.

Here is what happens when a sample size of 75 is used. Don’t be alarmed


by the fractional sample sizes in each stratum – the goal is again to get a rough
feel for the required effort for a certain precision.

Total n=75
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 37.5 322 226.8 250 80500 8537
30-Jul 37.5 205 135.7 250 51250 5107
Total 131750 9948
rse 7.6%

A sample size of 75 is too small. Try increasing the sample size until the rse
is 5% or less. Alternatively, once could use the GOAL SEEK feature of Excel to
find the sample size that gives a relative standard error of 5% or less as shown
below:

2006
c Carl James Schwarz 92
CHAPTER 4. SAMPLING

Total n=145
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 72.5 322 226.8 250 80500 5611
30-Jul 72.5 205 135.7 250 51250 3357
Total 131750 6539
rse 5.0%

4.8.2 Allocating samples among strata

There are number of ways of allocating a sample of size n among the various
strata. For example,

1. Equal allocation. Under an equal allocation scheme, all strata get the
same sample size, i.e. nh = n/H This allocation is best if variances of
strata are roughly equal, equally precise estimates are required for each
stratum, and you wish to test for differences in means among strata (i.e.
an analytical survey discussed in previous sections).

2. Proportional allocation. Under proportional allocation, sample sizes


are allocated to be proportional to the number of sampling units in the
strata, i.e ni = n × NN = n×
i PNi Ni
Nh = n × N1 +N2 +···+NH = n × Wi This
allocation is simple to plan and intuitively appealing. However, it is not
the best design. This design may waste effort because large strata get
large sample sizes but precision is determined by sample size not the ratio
of sample size to population size. For example, if one stratum is 10 times
larger than any other stratum, it is not necessary to allocate 10 times the
sampling effort to get the same precision in that stratum.
3. Optimal allocation - a.k.a Neyman allocation In optimal allocation,
the sample is allocated to minimize the overall variance, keeping the sam-
ple size fixed. Tedious algebra gives that the sample should be allocated
proportional to the product of the stratum size and the stratum standard
deviation, i.e. ni = n × PWWi Sh Si h = n × PNNi Sh Si h = n × N1 S1 +N2Ni Si
S2 +···+NH SH .
This allocation will be appropriate if the costs of measuring units are the
same in all strata. Intuitively, the strata that have the most of sampling
units should be weighted larger; strata with larger standard deviations
must have more samples allocated to them to get the se of the sample
mean within the stratum down to a reasonable level. A key assumption of
this allocation is that the cost to sample a unit is the same in all strata.
4. Optimal Allocation when costs are involved In some cases, the costs
of sampling differ among the strata. Suppose that it costs Ci to sample

2006
c Carl James Schwarz 93
CHAPTER 4. SAMPLING

P
each unit in a stratum i. Then the total cost of the survey is C = nh Ch .
The allocation rule is that sample sizes should be proportional to the
product to stratum sizes, stratum standard deviations, and the inverse√
of
Wi S i / Ci
the square root of the cost of sampling, i.e. ni = n × P (W √
S / C )
=
h h h
Ni Si Ni Si
√ √
Ci Ci
n× Nh Sh = n× N1 S1
√ N2 S2 N S This implies that large samples
(√ +√ +···+ √H H
P
)
Ch C1 C2 CH

are found in strata that are larger, more variable, or cheaper to sample.

In practice, most of the gain in precision occurs from moving from equal to
proportional allocation, while often only small improvements in precision are
gained from moving from proportional allocation to Neyman allocation. Simi-
larly, unless cost differences are enormous, there isn’t much of an improvement
in precision to moving to an optimal allocation based on costs.

Example - estimating the size of a caribou herd We wish to estimate


the size of a caribou herd. The density of caribou differs dramatically based
on the habitat type. The survey area was was divided into six strata based on
habitat type. The survey design is to divide each stratum in 4 km2 quadrats
that will be randomly selected. The number of caribou in the quadrats will be
counted from an aerial photograph.

An Excel workbook called caribou.xls is available from the Sample Program


Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The key point to examining different allocations is to make a single cell represent
the total sample size and then make a formula in each of the stratum sample
sizes a function of the total.

The total sample size can be found by varying the sample total until the
desired precision is found.

Results from previous year’s survey: Here are the summary statistics from
the survey in a previous year:

Map-squares sampled
Stratum Nh nh y s Est total se (total)
1 400 98 24.1 74.7 9640 2621
2 40 10 25.6 63.7 1024 698
3 100 37 267.6 589.5 26760 7693
4 40 6 179 151.0 7160 2273
5 70 39 293.7 351.5 20559 2622
6 120 21 33.2 99.0 3984 2354
Total 770 211 69127 9172

2006
c Carl James Schwarz 94
CHAPTER 4. SAMPLING

The estimated size of the herd is 69,127 animals with an estimated se of


9,172 animals.

Equal allocation

What would happen if an equal allocation were used? We now split the 211
total sample size equally among the 6 strata. In this case, the sample sizes are
‘fractional’, but this is OK as we are interested only in planning to see what
would have happened. Notice that the estimate of the overall population would
NOT change, but the se changes.

Stratum Nh nh y s Est total se (total)


1 400 35.2 24.1 74.7 9640 4810
2 40 35.2 25.6 63.7 1024 149
3 100 35.2 267.6 589.5 26760 8005
4 40 35.2 179 151.0 7160 354
5 70 35.2 293.7 351.5 20559 2927
6 120 35.2 33.2 99.0 3984 1684
Total 770 211 69127 9938

An equal allocation gives rise to worse precision than the original survey. Exam-
ining the table in more detail, you see that far too many samples are allocated
in an equal allocation to strata 2 and 4 and not enough to strata 1 and 3.

Proportional allocation

What about proportional allocation? Now the sample size is proportional to the
stratum population sizes. For example, the sample size for stratum 1 is found
as 211 × 400/770. The following results are obtained:

Stratum Nh nh y s Est total se (total)


1 400 109.6 24.1 74.7 9640 2431
2 40 11.0 25.6 63.7 1024 656
3 100 27.4 267.6 589.5 26760 9596
4 40 11.0 179 151.0 7160 1554
5 70 19.2 293.7 351.5 20559 4787
6 120 32.9 33.2 99.0 3984 1765
Total 770 211 69127 11263

This has an even worse standard error! It looks like not enough samples are
placed in stratum 3 or 5.

Optimal allocation

What if both the stratum sizes and the stratum variances are to be used in
allocating the sample? We create a new column (at the extreme right) which is

2006
c Carl James Schwarz 95
CHAPTER 4. SAMPLING

equal to Nh Sh . Now the sample sizes are proportional to these values, i.e. the
sample size for the first stratum is now found as 211 × 29866.4/133893.8. Again
the estimate of the total doesn’t change but the se is reduced.

Stratum Nh nh y s Est total se (total) N h Sh


1 400 47.1 24.1 74.7 9640 4089 29866.4
2 40 4.0 25.6 63.7 1024 1206 2550.0
3 100 92.9 267.6 589.56 26760 1629 58953.9
4 40 9.5 179 151.0 7160 1709 6039.6
5 70 38.8 293.7 351.5 20559 2639 24607.6
6 120 18.7 33.2 99.0 3984 2522 11876.4
Total 770 211 69127 6089 133893.8

4.8.3 Example: Estimating the number of tundra swans.

The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan,
is a large bird with white plumage and black legs, feet, and beak. 6 The USFWS
is responsible for conserving and protecting tundra swans as a migratory bird
under the Migratory Bird Treaty Act and the Fish and Wildlife Conservation
Act of 1980. As part of these responsibilities, it conducts regular aerial surveys
at one of their prime breeding areas in Bristol Bay, Alaska. And, the Bristol
Bay population of tundra swans is of particular interest because suitable habitat
for nesting is available earlier than most other nesting areas. This example is
based on one such survey. 7

Tundra swans are highly visible on their nesting grounds making them easy
to monitor during aerial surveys.

The Bristol Bay refuge has been divided into 186 survey units, each being a
q quarter section. These survey units have been divided into three strata based
on density, and previous years’ data provide the following information about
the strata:

Density Total Past Past


Stratum Survey Units Density Std Dev
High 60 20 10
Medium 68 10 6
Low 58 2 3
Total 186
6 Additional information about the tundra swan is available at http://www.hww.ca/hww2.

asp?id=78&cid=7
7 Doster, J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska

Peninsula, June 2002.

2006
c Carl James Schwarz 96
CHAPTER 4. SAMPLING

Based on past years’ results and budget considerations, approximately 30


survey units can be sampled.

The three strata are all approximately the same total area (number of sur-
vey units) so allocations based on stratum area will be approximately equal
across strata. However, that would place about 1/3 of the effort into the low
density strata which typically have fewer birds. It is felt that stratum density
is a suitable measure of stratum importance (notice that close relationship be-
tween stratum density and stratum standard deviations which is often found in
biological surveys.) Consequently, an allocation based on stratum density was
used. This allocation would place about 30 × 2032 = 18 units in the high density
stratum; about 30times 1032 = 9 units in the medium density stratum; and the
remainder (3 units) in the low density stratum.

The survey was conducted with the following results:

Survey Area Swans in Single Total


Unit Stratum (km2 ) flocks Birds Pairs birds
dilai2 h 148 12 6 24
naka41 h 137 13 15 43
naka43 h 137 6 16 38
naka51 h 16 10 3 2 17
nakb32 h 137 10 10 30
nakb44 h 135 6 18 12 48
nakc42 h 83 4 5 6 21
nakc44 h 109 17 15 47
nakd33 h 134 11 11 33
ugac34 h 65 2 10 22
ugac44 h 138 28 15 58
ugad5/63 h 159 9 20 49
dugad56/4 m 102 7 4 15
guad43 m 137 6 4 14
ugad42 m 137 5 11 15 46
low1 l 143 2 2
low3 l 138 1 1

The first thing to notice from the table above is that not all survey units
could be surveyed because of poor weather. As always with missing data, it is
important to determine if the data are Missing Completely at Random (MCAR).
In this case, it seems reasonable that swans did not adjust their behavior know-
ing that certain survey units would be sampled on the poor weather days and so
there is no impact of the missing data other than a loss of precision compared
to a survey with a full 30 survey units chosen.

Also notice that “blanks” in the table (missing values) represent zeros and

2006
c Carl James Schwarz 97
CHAPTER 4. SAMPLING

not really missing data.

Finally, not all of the survey units are the same area. This could introduce
additional variation into our data which may affect our final standard errors.
Even though the survey units are of different areas, the survey units were chosen
as a simple random sample so ignoring the area will NOT introduce bias into the
estimates (why). You will see in later sections how to compute a ratio estimator
which could take the area of each survey units into account and potentially lead
to more precise estimates.

JMP analysis The data are imported into JMP and are available in a JMP
file tundra.jmp available in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The data sheet appears below. Notice that zeros have been inserted in the
appropriate locations.

Because units in each stratum was selected using a simple random sample,
the appropriate summary measures are means and standard errors for each mean
for each stratum. Use the Tables->Summary

2006
c Carl James Schwarz 98
CHAPTER 4. SAMPLING

to select appropriate statistics for each stratum:

2006
c Carl James Schwarz 99
CHAPTER 4. SAMPLING

Because the sample fraction is smallish, the finite population correction factor
is ignored in each stratum. This gives the summary table:

In order to estimate the total number of swans in each stratum, augment the
table with the number of survey units in each stratum, 8 and then multiply the
mean and standard error by the number of survey units to estimate the total
swans and se(total) in each stratum using the formula editor: 9

Hence we estimate that about 2000 swans are present in the H and M strata,
but just over 100 in the L stratum. The grand total is found by adding the
estimated totals from the strata 2150+87+1700=3937,
√ and the standard error
of the grand total is found in the usual way 230.122 + 29.002 + 714.272 = 751
either using a calculator or a spreadsheet.

The standard error is larger than desired, mostly because of the very small
sample size in the M stratum where only 3 of the 9 proposed survey units could
be surveyed.

SAS analysis A copy of the SAS program (tundra.sas) is available in avail-


able in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms.

The data are read into SAS in the usual fashion with the code fragment:
8 Notice that JMP sorts the strata alphabetically
9 It may be easiest at this point to import back to Excel

2006
c Carl James Schwarz 100
CHAPTER 4. SAMPLING

data swans;
length survey_unit $10 stratum $1;;
input survey_unit $ stratum $ area num_flocks num_single num_pairs;
num_swans = num_flocks + num_single + 2*num_pairs;
datalines;
... data inserted here ...

The total number of survey units in each stratum is also read into SAS using
the code fragment. Notice that the variable that has the number of stratum
units must be called _total_ as required by the SurveyMeans procedure.

data total_survey_units;
length stratum $1.;
input stratum $ _total_; /* must use _total_ as variable name */
datalines;
h 60
m 68
l 58
;;;;

Next the data are sorted by stratum (not shown), the number of actual
survey units surveyed in each stratum is found using Proc Means:

proc means data=swans noprint;


by stratum;
var num_swans;
output out=n_units n=n;

Most survey procedures in SAS require the use sampling weights. These
are the reciprocal of the probability of selection. In this case, this is simply the
number of units in the stratum divided by the number sampled in each stratum:

data swans;
merge swans total_survey_units n_units;
by stratum;
sampling_weight = _total_ / n;

Now the individual stratum estimates are obtained using the code fragment:

/* first estimate the numbers in each stratum */

2006
c Carl James Schwarz 101
CHAPTER 4. SAMPLING

proc surveymeans data=swans


total=total_survey_units /* inflation factors */
sum clsum mean clm;
by stratum; /* separate estimates by stratum */
var num_swans;
weight sampling_weight;
ods output statistics=IndivEst;

This gives the output:

Lower Upper Lower Upper


Obs stratum Mean StdErr CLMean CLMean Sum StdDev CLSum CLSum
1 h 35.83 3.43 28.28 43.38 2150 206 1697 2603
2 l 1.50 0.49 -4.74 7.74 87 28 -275 449
3 m 25.00 10.27 -19.19 69.19 1700 698 -1305 4705

The estimates in the L and M strata are not very precise because of the small
number of survey units selected. SAS has incorporated the finite population
correction factor when estimating the se for the individual stratum estimates.

We estimate that about 2000 swans are present in the H and M strata, but
just over 100 in the L stratum. The grand total is found by adding the estimated
totals from the strata 2150+87+1700=3937,
√ and the standard error of the grand
total is found in the usual way 2062 + 282 + 6982 = 729.

Proc SurveyMeans can be used to estimate the grand total number of units
overall strata using the code fragment::

proc surveymeans data=swans


total=total_survey_units /* inflation factors for each stratum */
sum clsum mean clm; /* want to estimate grand totals */
title2 ’Estimate total number of swans’;
strata stratum /list; /* which variable define the strata */
var num_swans; /* which variable to analyze */
weight sampling_weight; /* sampling weight for each obs */

This gives the output:

Lower Upper Lower Upper


Obs Mean StdErr CLMean CLMean Sum StdDev CLSum CLSum
1 21.17 3.92 12.77 29.57 3937 729 2374 5500

2006
c Carl James Schwarz 102
CHAPTER 4. SAMPLING

The standard error is larger than desired, mostly because of the very small
sample size in the M stratum where only 3 of the 9 proposed survey units could
be surveyed.

4.8.4 Multiple stratification

In some cases, we need to allocate units based upon two or more variables We
first find the optimal allocation and see how they differ from one another - in
many cases, there may not be a big difference.

There are quite complicated schemes available to optimize allocation for two
or more variables. These will not be covered in this course.

4.8.5 Post-stratification

In some cases, it is inconvenient or impossible to stratify the population ele-


ments into strata before sampling. For example, if an attribute measured for
stratification is only available when the unit is sampled. For example, we wish to
stratify baby births by birth weight to estimate the proportion of birth defects;
or we wish to stratify by family size when looking at day care costs.

In these cases, we do simple random sampling, and post- stratify after the
sample is taken. We assume that the stratum sizes Nh are still known, say from
other sources.

The estimates of the population mean, total, etc., don’t change. However,
the variances must be increased to account for the fact that the sample size
in each stratum is no longer fixed. This introduces an additional source of
variation for the estimate, i.e. estimates will vary from sample to sample not
only because a new sample is drawn each time, but also because the sample size
within a stratum will change, leading to different precisions in each new sample.

This is covered in many standard books on sampling theory and is not cov-
ered more in this course.

4.8.6 Allocation and precision - revisited

A student wrote:

I’m a little confused about sample allocation in stratified sampling.

2006
c Carl James Schwarz 103
CHAPTER 4. SAMPLING

Earlier in the course, you stated that precision is independent of


sample size, i.e. a sample of 1000 gave estimates that were equally
precise for Canada and the US (assuming a simple random sample).
Yet in stratified sampling, you also said that precision is improved by
proportional allocation where larger strata get larger sample sizes.

Both statements are correct. If you are interested in estimates for individual
populations, then absolute sample size is important.

If you wanted equally precise estimates for BOTH Canada and the US then
you would have equal sample sizes from both populations, say 1000 from both
population even though their overall population size differs by a factor of 10:1.

However, in stratified sampling designs, you may also be interested in the


OVERALL estimate, over both populations. In this case, a proportional alloca-
tion where sample size is allocated proportion to population size often performs
better. In this, the overall sample of 2000 people would be allocated propor-
tional to the population sizes as follows:

Stratum Population Fraction of total population Sample size


US 300,000,000 91% 91% x2000=1818
Canada 30,000,000 9% 9% x2000=181
Total 330,000,000 100% 2000

Why does this happen? Well if you are interested in the overall population,
then the US results essentially drives everything and Canada has little effect
on the overall estimate. Consequently, it doesn’t matter that the Canadian
estimate is not as precise as the US estimate.

4.9 Ratio estimation in SRS - improving preci-


sion with auxiliary information

An association between the measured variable of interest and a second variable


of interest can be exploited to obtain more precise estimates. For example,
suppose that growth in a sample plot is related to soil nitrogen content. A
simple random sample of plots is selected and the height of trees in the sample
plot is measured along with the soil nitrogen content in the plot. A regression
model is fit (Thompson, 1992, Chapters 7 and 8) between the two variables to
account for some of the variation in tree height as a function of soil nitrogen
content. This can be used to make precise predictions of the mean height in
stands if the soil nitrogen content can be easily measured. This method will
be successful if there is a direct relationship between the two variables, and,

2006
c Carl James Schwarz 104
CHAPTER 4. SAMPLING

the stronger the relationship, the better it will perform. This technique is often
called ratio-estimation or regression-estimation.

Notice that multi-phase designs often use an auxiliary variable but this sec-
ond variable is only measured on a subset of the sample units and should not
be confused with ratio estimators in this section.

Ratio estimation has two purposes. First, in some cases, you are interested
in the ratio of two variables, e.g. what is the ratio of wolves to moose in a region
of the province.

Second, a strong relationship between two variables can be used to improve


precision without increasing sampling effort. This is an alternative to stratifi-
cation when you can measure two variables on each sampling unit.

We define the population ratio as R = ττX Y


= µµXY
. Here Y is the variable
of interest; X is a secondary variable not really of interest. Note that notation
differs among books - some books reverse the role of X and Y.

Why is the ratio defined in this way? There are two common ratio estimators,
traditionally called the mean-of-ratio and the ratio-of-mean estimators. Suppose
you had the following data for Y and X which represent the counts of animals
of species 1 and 2 taken on 3 different days:

Sample
1 2 3
Y 10 100 20
X 3 20 1

The mean-of-ratios estimator would compute the estimated ratio between Y


and X as:
10
+ 100 20
20 + 1
Rmean−of −ratio = 3 = 9.44
3
while the ratio-of-means would be computed as:
(10 + 100 + 20)/3 10 + 100 + 20
Rratio−of −means = = = 5.41
(3 + 20 + 1)/3 3 + 20 + 1
Which is ”better”?

The mean-of-ratio estimator should be used when you wish to give equal
weight to each pair of numbers regardless of the magnitude of the numbers.
For example, you may have three plots of land, and you measure Y and X on
each plot, but because of observer efficiencies that differ among plots, the raw
numbers cannot be compared. For example, in a cloudy, rainy day it is hard
to see animals (first case), but in a clear, sunny day, it is easy to see animals
(second case). The actual numbers themselves cannot be combined directly.

2006
c Carl James Schwarz 105
CHAPTER 4. SAMPLING

The ratio-of-means estimator (considered in this chapter) gives every value


of Y and X equal weight. Here the fact that unit 2 has 10 times the number of
animals as unit 1 is important as we are interested in the ratio over the entire
population of animals. Hence, by adding the values of Y and X first, each
animals is given equal weight.

When is a ratio estimator better - what other information is needed?


The higher the correlation between Xi and Yi , the better the ratio estimator is
compared to a simple expansion estimator. It turns out that the ratio estimator
is the ‘best’ linear estimator if

• the relation between Yi and Xi is linear through the origin


• the variation around the regression line is proportional to the X value, i.e.
the spread around the regression line increases as X increases unlike an
ordinary regression line where the spread is assumed to be constant in all
parts of the line.

In practice, plot yi vs. xi from the sample and see what type of relation exists.

When can a ratio estimator be used? A ratio estimator will require that
another variable (the X variable) be measured on the selected sampling units.
Furthermore, if you are estimating the overall mean or total, the total value of
the X-variable over the entire population must also be known. For example, as
see in the examples to come, the total area must be known to estimate the total
animals once the density (animals/ha) is known.

4.9.1 Summary of Main results

Quantity Population value Sample estimate se


r
2
τY µY y y 1 sdiff
Ratio R= τX = µX r= x = x µ2X n
(1 − f )
r
2
1 sdiff
Total τY = RτX τbratio = rτX τX × µ2X n
(1 − f )
r
2
1 sdiff
Mean µY = RµX µc
Y ratio = rµX µX × µ2X n
(1 − f )

Notes

Don’t be alarmed by the apparent complexity of the formulae above. They


are relatively simple to implement in spreadsheets.

2006
c Carl James Schwarz 106
CHAPTER 4. SAMPLING

n
(yi −rxi )2
• The term s2diff =
P
n−1 is computed by creating a new column
i=1
yi − rxi and finding the (sample standard deviation)2 of this new derived
variable. This will be illustrated in the examples.
• In some cases the µ2X in the denominator may or may not be known and
it or its estimate x2 can be used in place of it. There doesn’t seem to be
any empirical evidence that either is better.
2
• The term τX /µ2X reduces to N 2 .

• Confidence intervals Confidence limits are found in the usual fashion.


In general, the distribution of R is positively skewed and so the upper
bound is usually too small. This skewness is caused by the variation in
the denominator of the the ratio. For example, suppose that a random
variable (Z) has a uniform distribution between 0.5 and 1.5 centered on 1.
The inverse of the random variable (i.e. 1/Z) now ranges between 0.666
and 2 - no longer symmetrical around 1. So if a symmetric confidence
interval is created, the width will tend not to match the true distribution.
This skewness is not generally a problem if the sample size is at least 30
and the relative standard error of y and x are both less than 10%.
• Sample size determination: We once again invert the formulae for the
se for the ratio but again is likely easiest done on a spread sheet using
trial and error or the solver feature.

4.9.2 Example - wolf/moose ratio

[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs inter-
changes the use of x and y in the ratio.]

Wildlife ecologists interested in measuring the impact of wolf predation on


moose populations in BC obtained estimates by aerial counting of the population
size of wolves and moose on 11 sub-areas (all roughly equal size) selected as
SRSWOR from a total of 200 sub-areas in the game management zone.

In this example, the actual ratio of wolves to moose is of interest.

Here are the raw data:

Sub-areas Wolves Moose


1 8 190
2 15 370
3 9 460

2006
c Carl James Schwarz 107
CHAPTER 4. SAMPLING

4 27 725
5 14 265
6 3 87
7 12 410
8 19 675
9 7 290
10 10 370
11 16 510

What is the population and parameter of interest?

As in previous situations, there is some ambiguity:

• The population of interest is the 200 sub-areas in the game-management


zone. The sampling units are the 11 sub-areas. The response variables are
the wolf and moose populations in the game management sub-area. We
are interested in the wolf/moose ratio.
• The populations of interest are the moose and wolves. If individual mea-
surements were taken of each animal, then this definition would be fine.
However, only the total number of wolves and moose within each sub-area
are counted - hence a more proper description of this design would be a
cluster design. As you will see in a later section, the analysis of a cluster
design starts by summing to the cluster level and then treating the clusters
as the population and sampling unit as is done in this case.

Having said this, do the number of moose and wolves measured on each
sub-area include young moose and young wolves or just adults? How will im-
migration and emigration be taken care of?

What was the frame? Was it complete?

The frame consists of the 200 sub-areas of the game management zone. Pre-
sumably these 200 sub-areas cover the entire zone, but what about emigration
and immigration? Moose and wolves may move into and out of the zone.

What was the sampling design?

It appears to be an SRSWOR design - the sampling units are the sub-areas


of the zone.

How did they determine the counts in the sub-areas? Perhaps they simply
looked for tracks in the snow in winter - it seems difficult to get estimates from
the air in summer when there is lots of vegetation blocking the view.

2006
c Carl James Schwarz 108
CHAPTER 4. SAMPLING

Excel analysis

A copy of the workbook to perform the analysis of this data is called wolf.xls
and is available from the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a summary shot of the spreadsheet:

Assessing conditions for a ratio estimator

2006
c Carl James Schwarz 109
CHAPTER 4. SAMPLING

The ratio estimator works well if the relationship between Y and X is linear,
through the origin, with increasing variance with X. Begin by plotting Y (wolves)
vs X (moose).

The data appears to satisfy the conditions for a ratio estimator.

Compute summary statistics for both Y and X

Refer to the screen shot of the spreadsheet. The Excel builtin functions are
used to compute the sample size, sample mean, and sample standard deviation
for each variable.

Compute the ratio

The ratio is computed using the formula for a ratio estimator in a simple
random sample, i.e.
y
r=
x

Compute the difference variable

Then for each observation, the difference between the observed Y (the actual

2006
c Carl James Schwarz 110
CHAPTER 4. SAMPLING

number of wolves) and the predicted Y based on the number of moose (Ybi = rXi )
is found. Notice that the sum of the differences must equal zero.

The standard deviation of the differences will be needed to compute the


standard error for the estimated ratio.

Estimate the standard error of the estimated ratio

Use the formula given at the start of the section.

Final estimate

Our final result is that the estimated ratio is 0.03217 wolf/moose with an
estimated se of 0.00244 wolf/moose. An approximate 95% confidence interval
would be computed in the usual fashion.

Planning for future surveys

Our final estimate has an approximate rse of 0.00244/.03217 = 7.5% which


is pretty good. You could try different n values to see what sample size would
be needed to get a rse of better than 5% or perhaps this is too precise and you
only want a rse of about 10%.

The key variable for the standard error is the total sample size (which you
can modify) and the standard deviation of the differences - which is estimated
from the previous survey.

As before, create a new spreadsheet where you can modify the total sample
size and see the effect upon precision. This will be left as an exercise for the
reader.

SAS Analysis

The above computations can also be done in SAS with the program wolf.sas
available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms. It uses Proc SurveyMeans which gives the out-
put contained in wolf.lst.

Here is the SAS program:

/* Example of a ratio estimator in simple random sampling */

/* Wildlife ecologists interested in measuring the impact of wolf

2006
c Carl James Schwarz 111
CHAPTER 4. SAMPLING

predation on moose populations in BC obtained estimates by aerial


counting of the population size of wolves and moose on 11
subareas (all roughly equal size) selected as SRSWOR from a total of
200 subarea in the game management zone.

In this example, the actual ratio of wolves to moose is of interest. */

title ’Wolf-moose ratio - ratio estimator in SRS design’;


options nodate nonumber noovp nocenter linesize=75;

data wolf;
input subregion wolf moose;
datalines;
1 8 190
2 15 370
3 9 460
4 27 725
5 14 265
6 3 87
7 12 410
8 19 675
9 7 290
10 10 370
11 16 510
;;;

proc print data=wolf;


title2 ’raw data’;
sum wolf moose;

proc plot data=wolf;


title2 ’plot to assess assumptions’;
plot wolf*moose;

proc surveymeans data=wolf ratio clm N=200;


title2 ’Estimate of wolf to moose ratio’;
/* ratio clm - request a ratio estimator with confidence intervals */
/* N=200 specifies total number of units in the population */
var moose wolf;
ratio wolf/moose; /* this statement ask for ratio estimator */

The SAS program again starts with the DATA step to read in the data.
Because the sampling weights are equal for all observation, it is not necessary
to include them when estimating a ratio (the weights cancel out in the formula

2006
c Carl James Schwarz 112
CHAPTER 4. SAMPLING

used by SAS).

The PLOT procedure creates the plot similar to that in the Excel spread-
sheet.

The RATIO statement in the SURVEYMEANS procedure request the com-


putation of the ratio estimator.

Here is the output:

Wolf-moose ratio - ratio estimator in SRS design


raw data

Obs subregion wolf moose

1 1 8 190
2 2 15 370
3 3 9 460
4 4 27 725
5 5 14 265
6 6 3 87
7 7 12 410
8 8 19 675
9 9 7 290
10 10 10 370
11 11 16 510
==== =====
140 4352

Plot of wolf*moose. Legend: A = 1 obs, B = 2 obs, etc.


wolf
27 + A
|
26 +
|
25 +
|
24 +
|
23 +
|
22 +
|
21 +
|

2006
c Carl James Schwarz 113
CHAPTER 4. SAMPLING

20 +
|
19 + A
|
18 +
|
17 +
|
16 + A
|
15 + A
|
14 + A
|
13 +
|
12 + A
|
11 +
|
10 + A
|
9 + A
|
8 + A
|
7 + A
|
6 +
|
5 +
|
4 +
|
3 + A
---+-------------+-------------+-------------+-------------+--
0 200 400 600 800

moose

The SURVEYMEANS Procedure

Data Summary

Number of Observations 11

2006
c Carl James Schwarz 114
CHAPTER 4. SAMPLING

Statistics

Std Error Lower 95% Upper 95%


Variable Mean of Mean CL for Mean CL for Mean
------------------------------------------------------------------------
moose 395.636364 56.458162 269.839740 521.432988
wolf 12.727273 1.926872 8.433935 17.020611
------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator Ratio Std Err 95% Confidence Interval


---------------------------------------------------------------------------
wolf moose 0.032169 0.002438 0.026737 0.037601
---------------------------------------------------------------------------

The results are identical to that from the spreadsheet.

Again, it is easier to do planning in the Excel spreadsheet rather than in the


SAS program.

JMP Analysis

The JMP data table is available here in the file wolf.jmp from the Sample Pro-
gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

CAUTION. Ordinary regression estimation from standard statistical pack-


ages provide only an APPROXIMATION to the correct analysis of survey data.
There are two problems in using standard statistical packages for regression and
ratio estimation of survey data:

• Unable to use a finite population correction factor. This is usually not a


problem unless the sample size is large relative to the population size.
• Wrong error structure. Standard regression analyses assume that the vari-
ance around the regression or ratio line is constant. In many survey prob-
lems this is not true. This can be partially alleviated through the use
of weighted regression, but this still does not completely fix the problem.
For further information about the problems of using standard statistical
software packages in survey sampling please refer to the article at http:
//www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

2006
c Carl James Schwarz 115
CHAPTER 4. SAMPLING

Because the ratio estimator assumes that the variance of the response in-
creases with the value of X, a new column representing the inverse of the X
variable (i.e. 1/the number of moose) has been created. Be sure all variables
are continuously scaled:

The Analyze->Fit Y-by-X platform will be used. The Y variable is the


number of wolves; the X variable is the number of moose. Be sure to specify
that the inverse of the X variable (1/X) is the weighting variable:

2006
c Carl James Schwarz 116
CHAPTER 4. SAMPLING

The graph looks like it is linear through the origin which is one of the assump-
tions of the ratio estimator. Then use Fit Special:

2006
c Carl James Schwarz 117
CHAPTER 4. SAMPLING

and select the no intercept option to force the regression line through the origin:

2006
c Carl James Schwarz 118
CHAPTER 4. SAMPLING

This gives the following output:

2006
c Carl James Schwarz 119
CHAPTER 4. SAMPLING

We see that the estimated ratio (.032 wolves/moose) matches the Excel output,
the estimated standard error (.0026) does not quite match Excel. The difference
is a bit larger than can be accounted for not using the finite population correction
factor.

As a matter of interest, if you use the Analyze->Fit Y-by-X platform WITH-


OUT using the square of the X variable as the weighting variable, you obtain
an estimated ratio of .0317 (se .0022). All of these estimates are similar and it
likely makes very little difference which is used.

Using JMP the hard way

2006
c Carl James Schwarz 120
CHAPTER 4. SAMPLING

It is possible to reproduce the ratio estimator and its standard error exactly.
As in previous sections, the use of JMP is a bit clumsy.

Use the Tables->Summary to get the total wolves and total moose as in
previous examples. The estimate is easily obtained as r = y/x = y/x where y
and x refer to the total wolves and moose respectively. Create a new column to
compute the ratio giving:

Hence our estimate of the wolf/moose ratio is 0.03217 wolves/moose. In


order to compute a se we create a new column in the ORIGINAL data where
we compute yi − rxi = wolvesi − r × moosei

2006
c Carl James Schwarz 121
CHAPTER 4. SAMPLING

Find the standard deviation of this new column again using the Tables->Summary
command. Compute the se as:

This gives the final result: Which match the previous results.

2006
c Carl James Schwarz 122
CHAPTER 4. SAMPLING

Post mortem

No population numbers can be estimated using the ratio estimator in this case
because of a lack of suitable data.

In particular, if you had wanted to estimate the total wolf population, you
would have to use the simple inflation estimator that we discussed earlier unless
you had some way of obtaining the total number of moose that are present in
the ENTIRE management zone. This seems unlikely.

However, refer to the next example, where the appropriate information is


available.

4.9.3 Example - Grouse numbers - using a ratio estimator


to estimate a population total

In some cases, a ratio estimator is used to estimate a population total. In these


cases, the improvement in precision is caused by the close relationship between
two variables.

Note that the population total of the auxiliary variable will have to be known
in order to use this method.

Grouse Numbers

A wildlife biologist has estimated the grouse population in a region contain-


ing isolated areas (called pockets) of bush as follows: She selected 12 pockets
of bush at random, and attempted to count the numbers of grouse in each of
these. (One can assume that the grouse are almost all found in the bush, and
for the purpose of this question, that the counts were perfectly accurate.) The
total number of pockets of bush in the region is 248, comprising a total area of
3015 hectares. Results are as follows:

2006
c Carl James Schwarz 123
CHAPTER 4. SAMPLING

Area Number
(ha) Grouse
8.9 24
2.7 3
6.6 10
20.6 36
3.7 8
4.1 8
25.8 60
1.8 5
20.1 35
14.0 34
10.1 18
8.0 22

What is the population of interest and parameter to be estimated?

As before, the is some ambiguity:

• The population of interest are the pockets of brush in the region. The
sampling unit is the pocket of brush. The number of grouse in each pocket
is the response variable.
• The population of interest is the grouse. These happen to be clustered
into pockets of brush. This leads back to the previous case.

What is the frame

Here the frame is explicit - the set of all pockets of bush. It isn’t clear if all
grouse will be found in these pockets - will some be itinerant and hence missed?
What about movement between looking at the pockets of bush?

Summary statistics

Variable n mean std deva


area 12 10.53 7.91
grouse 12 21.92 16.95

Simple inflation estimator ignoring the pocket areas

Using our earlier results for the simple inflation estimator, our estimate of the
total number ofqgrouse is τb = N y =q 248 × 21.92 = 5435.33 with an estimated
s2 16.952 12
se of se = N × n (1 − f ) = 248 × 12 (1 − 248 ) = 1183.4.

The estimate isn’t very precise with a rse of 1183.4/5435.3 = 22%.

2006
c Carl James Schwarz 124
CHAPTER 4. SAMPLING

Ratio estimator - why?

Why did the inflation estimator do so poorly? Part of the reason is the relatively
large standard deviation in the number of grouse in the pockets. Why does this
number vary so much?

It seems reasonable that larger pockets of brush will tend to have more
grouse. Perhaps we can do better by using the relationship between the area of
the bush and the number of grouse through a ratio estimator.

Excel analysis

An Excel worksheet is available in grouse.xls from the Sample Program Library


at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Preliminary plot to assess if ratio estimator will work

First plot numbers of grouse vs. area and see if this has a chance of succeeding.

The graph shows a linear relationship, through the origin. There is some

2006
c Carl James Schwarz 125
CHAPTER 4. SAMPLING

evidence that the variance is increasing with X (area of the plot).

Find the ratio between grouse numbers and area

The spreadsheet is set up similarly to the previous example:

The total of the X variable (area) will need to be known.

2006
c Carl James Schwarz 126
CHAPTER 4. SAMPLING

As before, you find summary statistics for X and Y, compute the ratio esti-
mate, find the difference variables, find the standard deviation of the difference
variable, and find the se of the estimated ratio.

The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha.

The se of r is found as
s r
1 s2diff 1 4.74642 12
se(r) = 2 × × (1 − f ) = 2
× × (1 − ) = 0.1269
x n 10.533 12 248

grouse/ha.

Expand ratio by total of X

In order to estimate the population total of Y, you now multiply the es-
timated ratio by the population total of X. We know the pockets cover 3015
Y = τX × r =
ha, and so the estimated total number of grouse is found by τc
3015 × 2.081 = 6273.3 grouse.

To estimate the se of the total, multiply the se of r by 3015 as well: se(c


τY ) =
τX × se(r) = 3015 × 0.1269 = 382.6 grouse.

The precision is much improved compared to the simple inflation estimator.


This improvement is due to the very strong relationship between the number of
grouse and the area of the pockets.

Sample size for future surveys

If you wish to investigate different sample sizes, the simplest way would be
to modify the cell corresponding to the count of the differences. This will be
left as an exercise for the reader.

The final ratio estimate has a rse of about 6% - quite good. It is relatively
straight forward to investigate the sample size needed for a 5% rse. We find
this to be about 17 pockets.

SAS analysis

The analysis is done in SAS using the program grouse.sas from the Sample Pro-
gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

Here is a program listing:

2006
c Carl James Schwarz 127
CHAPTER 4. SAMPLING

/* Example of using SAS to compute a ratio estimator in survey sampling */

/* A wildlife biologist has estimated the grouse population


in a region containing isolated areas (called pockets) of
bush as follows: She selected 12 pockets of bush at random, and
attempted to count the numbers of grouse in each of of these.
(One can assume that the grouse are almost all found in the bush, and for the
purpose of this question, that the counts were perfectly accurate.)
The total number of pockets of bush in the region is 248,
comprising a total area of 3015 hectares. */

title ’Number of grouse - Ratio estimator’;


options nodate nonumber noovp nocenter linesize=75;

data grouse;
input area grouse; /* sampling weights not needed */
datalines;
8.9 24
2.7 3
6.6 10
20.6 36
3.7 8
4.1 8
25.8 60
1.8 5
20.1 35
14.0 34
10.1 18
8.0 22
;;;;

proc print data=grouse;


title2 ’raw data’;

proc plot data=grouse;


plot grouse * area;

proc surveymeans data=grouse ratio clm N=248;


/* the ratio clm keywords request a ratio estimator and a confidence interval. */
title2 ’Estimation using a ratio estimator’;
var grouse area;
ratio grouse / area;
ods output ratio=outratio; /* extract information so that total can be estimated */

data outratio;
/* compute estimates of the total */

2006
c Carl James Schwarz 128
CHAPTER 4. SAMPLING

set outratio;
Est_total = ratio * 3015;
Se_total = stderr* 3015;
UCL_total = uppercl*3015;
LCL_total = lowercl*3015;
format est_total se_total ucl_total lcl_total 7.1;
format ratio stderr lowercl uppercl 7.3;

proc print data=outratio split=’_’;


title2 ’the computed estimates’;
var ratio stderr lowercl uppercl Est_total Se_total LCL_total UCL_total;

The DATA step reads in the data. It is not necessary to include a computa-
tion of the sampling weight if the data are collected in a simple random sample
for a ratio estimator – the weights will cancel out in the formulae used by SAS.

The SURVEYMEANS procedure can estimate the ratio of grouse/ha but


cannot directly estimate the population total. The ODS statement redirects the
results from the RATIO statement to a new dataset that is processed further
to multiply by the total area of the pockets.

The output is as follows:

Number of grouse - Ratio estimator


raw data

Obs area grouse

1 8.9 24
2 2.7 3
3 6.6 10
4 20.6 36
5 3.7 8
6 4.1 8
7 25.8 60
8 1.8 5
9 20.1 35
10 14.0 34
11 10.1 18
12 8.0 22
Number of grouse - Ratio estimator
raw data

Plot of grouse*area. Legend: A = 1 obs, B = 2 obs, etc.

2006
c Carl James Schwarz 129
CHAPTER 4. SAMPLING

grouse |
|
60 + A
|
|
|
|
|
|
50 +
|
|
|
|
|
|
40 +
|
|
| AA
| A
|
|
30 +
|
|
|
| A
|
| A
20 +
| A
|
|
|
|
|
10 + A
| AA
|
| A
|
| A
|
0 +

2006
c Carl James Schwarz 130
CHAPTER 4. SAMPLING

|
-+----------+----------+----------+----------+----------+----------+
0 5 10 15 20 25 30

area
Number of grouse - Ratio estimator
Estimation using a ratio estimator

The SURVEYMEANS Procedure

Data Summary

Number of Observations 12

Statistics

Std Error Lower 95% Upper 95%


Variable Mean of Mean CL for Mean CL for Mean
------------------------------------------------------------------------
grouse 21.916667 4.772130 11.413279 32.420054
area 10.533333 2.227746 5.630097 15.436570
------------------------------------------------------------------------

Ratio Analysis

Numerator Denominator Ratio Std Err 95% Confidence Interval


---------------------------------------------------------------------------
grouse area 2.080696 0.126893 1.801406 2.359986
---------------------------------------------------------------------------
Number of grouse - Ratio estimator
the computed estimates

Est Se LCL UCL


Obs Ratio StdErr LowerCL UpperCL total total total total

1 2.081 0.127 1.801 2.360 6273.3 382.6 5431.2 7115.4

The results are exactly the same as before.

Again, it is easiest to do the sample size computations in Excel.

2006
c Carl James Schwarz 131
CHAPTER 4. SAMPLING

JMP Analysis

The JMP data table is available here in the file grouse.jmp from the Sam-
ple Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms. The data file contains both variables and the derived variable
1/area:

Estimating total grouse using inflation estimator

As in Excel, let us first use the simple inflation estimate by finding the
average number of grouse/pocket and then expanding by the number of pockets.
The Analyze->Distribution platform

2006
c Carl James Schwarz 132
CHAPTER 4. SAMPLING

is used to estimate the mean grouse/pocket:

The estimated mean number of grouse per pocket is 21.9 (se 4.9). The estimated
total number is found by multiplying the mean number per pocket by the total
number of pockets (N = 248) to give an estimated total of 5435 (se 1213) grouse.
The standard error is larger than that computed by Excel because of the lack
of the finite population correction.

Estimating total grouse using ratio estimator

We must first estimate the ratio (grouse/hectare), and then expand this to

2006
c Carl James Schwarz 133
CHAPTER 4. SAMPLING

estimate the overall number of grouse.

CAUTION. Ordinary regression estimation from standard statistical pack-


ages provide only an APPROXIMATION to the correct analysis of survey data.
There are two problems in using standard statistical packages for regression and
ratio estimation of survey data:

• Unable to use a finite population correction factor. This is usually not a


problem unless the sample size is large relative to the population size.
• Wrong error structure. Standard regression analyses assume that the vari-
ance around the regression or ratio line is constant. In many survey prob-
lems this is not true. This can be partially alleviated through the use
of weighted regression, but this still does not completely fix the problem.
For further information about the problems of using standard statistical
software packages in survey sampling please refer to the article at http:
//www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

Because the ratio estimator assumes that the variance of the response in-
creases with the value of X, a new column representing the inverse of the X
variable (i.e. 1/area of pocket) has been created. Be sure all variables are
continuously scaled:

The Analyze->Fit Y-by-X platform will be used. The Y variable is the


number of grouse; the X variable is the area of the pocket. Be sure to specify
that the inverse of the X variable (1/X) is the weighting variable:

2006
c Carl James Schwarz 134
CHAPTER 4. SAMPLING

The graph looks like it is linear through the origin which is one of the assump-
tions of the ratio estimator. Then use Fit Special: and select the no intercept
option to force the regression line through the origin:

The final estimates are found:

2006
c Carl James Schwarz 135
CHAPTER 4. SAMPLING

The estimated density is 2.081 (se .123) grouse/hectare. The point estimate is
bang on, and the estimated se is within 1% of the correct se.

This now need to be multiplied by the total area of the pockets (3015 ha)
which gives an estimated total number of grouse of 6274 (se 371) grouse. [Again
the estimated se is slightly smaller because of the lack of a finite population
correction.]

The ratio estimator is much more precise than the inflation estimator because
of the strong relationship between the number of grouse and the area of the
pocket.

2006
c Carl James Schwarz 136
CHAPTER 4. SAMPLING

Post mortem - a question to ponder

What if it were to turn out that grouse population size tended to be proportional
to the perimeter of a pocket of bush rather than its area? Would using the above
ratio estimator based on a relationship with area introduce serious bias into the
ratio estimate, increase the standard error of the ratio estimate, or do both?

4.10 Additional ways to improve precision

This section will not be examined on the exams or term tests

4.10.1 Using both stratification and auxiliary variables

It is possible to use both methods to improve precision. However, this comes at


a cost of increased computational complexity.

There are two ways of combining ratio estimators in stratified simple random
sampling.

1. combined ratio estimate: Estimate the numerator and denominator


using stratified random sampling and then form the ratio of these two
estimates:

µc
Y stratif ied
rstratif ied,combined =
µc
X stratif ied

and
µc
Y stratif ied
τc
Y stratif ied,combined = τX
µc
X stratif ied

We won’t consider the estimates of the se in this course, but it can be


found in any textbook on sampling.
2. separate ratio estimator- make a ratio total for each stratum, and form
a grand ratio by taking a weighted average of these estimates. Note that
we weight by the covariate total rather than the stratum sizes. We get
the following estimators for the grand ratio and grand total:
H
1 X
rstratif ied,separate = τXh rh
τX
h=1

2006
c Carl James Schwarz 137
CHAPTER 4. SAMPLING

and
H
X
τc
Y stratif ied,separate = τXh rh
h=1
Again, we won’t worry about the estimates of the se.

Why use one over the other?

• You need stratum total for separate estimate, but only population total
for combined estimate
• combined ratio is less subject to risk of bias. (see Cochran, p. 165 and
following). In general, the biases in separate estimator are added together
and if they fall in the same direction, then trouble. In the combined
estimator these biases are reduced through stratification for numerator
and denominator
• When the ratio estimate is appropriate (regression through the origin and
variance proportional to covariate), the last term vanishes. Consequently,
the combined ratio estimator will have greater variance than the separate
ratio estimator unless R is relatively constant from stratum to stratum.
However, see above, the bias may be more severe for the separate ratio
estimator. You must consider the combined effects of bias and precision,
i.e. MSE.

4.10.2 Regression Estimators

A ratio estimator works well when the relationship between Yi and Xi is linear,
through the origin, with the variance of observations about the ratio line in-
creasing with X. In some cases, the relationship may be linear, but not through
the origin.

In these cases, the ratio estimator is generalized to a regression estimator


where the linear relationship is no longer constrained to go through the origin.

We won’t be covering this in this course.

Regression estimators are also useful if there is more than one X variable.

Whenever you use a regression estimator, be sure to plot y vs x to assess if


the assumptions for a ratio estimator are reasonable.

CAUTION: If ordinary statistical packages are used to do regression anal-


ysis on survey data, you could obtain misleading results because the usual pack-

2006
c Carl James Schwarz 138
CHAPTER 4. SAMPLING

ages ignore the way in which the data were collected. Virtually all standard re-
gression packages assume you’ve collected data under a simple random sample.
If your sampling design is more complex, e.g. stratified design, cluster design,
multi-state design, etc then you should use a package specifically designed for
the analysis of survey data, e.g. SAS and the Proc SurveyReg procedure.

4.10.3 Sampling with unequal probability - pps sampling

All of the designs discussed in previous sections have assumed that each sample
unit was selected with equal probability. In some cases, it is advantageous
to select units with unequal probabilities, particularly if they differ in their
contribution to the overall total. This technique can be used with any of the
sampling designs discussed earlier. An unequal probability sampling design can
lead to smaller standard errors (i.e. better precision) for the same total effort
compared to an equal probability design. For example, forest stands may be
selected with probability proportional to the area of the stand (i.e. a stand of
200 ha will be selected with twice the probability that a stand of 100 ha in size)
because large stands contribute more to the overall population and it would be
wasteful of sampling effort to spend much effort on smaller stands.

The variable used to assign the probabilities of selection to individual study


units does not need to have an exact relationship with an individual contribu-
tions to the total. For example, in probability proportional to prediction (3P
sampling), all trees in a small area are visited. A simple, cheap characteristic
is measured which is used to predict the value of the tree. A sub-sample of
the trees is then selected with probability proportional to the predicted value,
remeasured using a more expensive measuring device, and the relationship be-
tween the cheap and expensive measurement in the second phase is used with
the simple measurement from the first phase to obtain a more precise estimate
for the entire area. This is an example of two-phase sampling with unequal
probability of selection.

Please consult with a sampling expert before implementing or analyzing an


unequal probability sampling design.

4.11 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For


example, some animals congregate in herds or family units. It is often convenient
to select a random sample of herds and then measure every animal in the herd.
This is not the same as a simple random sample of animals because individual

2006
c Carl James Schwarz 139
CHAPTER 4. SAMPLING

animals are not randomly selected; the herds are the sampling unit. The strip-
transect example in the section on simple random sampling is also a cluster
sample; all plots along a randomly selected transect are measured. The strips
are the sampling units, while plots within each strip are sub-sampling units.
Another example is circular plot sampling; all trees within a specified radius of
a randomly selected point are measured. The sampling unit is the circular plot
while trees within the plot are sub-samples.

Some examples of cluster samples are:

• urchin estimation - transects are taken perpendicular to the shore and a


diver swims along the transect and counts the number of urchins in each
m2 along the line.
• aerial surveys - a plane flies along a line and observers count the number
of animals they see in a strip on both sides of the aircraft.
• forestry surveys - often circular plots are located on the ground and ALL
tree within that plot are measured.

Pitfall A cluster sample is often mistakenly analyzed using methods for sim-
ple random surveys. This is not valid because units within a cluster are typically
positively correlated. The effect of this erroneous analysis is to come up with
an estimate that appears to be more precise than it really is, i.e. the estimated
standard error is too small and does not fully reflect the actual imprecision in
the estimate.

Solution: You will pleased to know that, in fact, you already know how to
design and analyze cluster samples! The proper analysis treats the clusters as a
random sample from the population of clusters, i.e. treat the cluster as a whole
as the sampling unit, and deal only with cluster total as the response measure.

4.11.1 Sampling plan

In simple random sampling, a frame of all elements was required in order to draw
a random sample. Individual units are selected one at a time. In many cases,
this is impractical because it may not be possible to list all of the individual
units or may be logistically impossible to do this. In many cases, the individual
units appear together in clusters. This is particularly true if the sampling unit
is a transect - almost always you measure things on a individual quadrat level,
but the actual sampling unit is the cluster.

This problem is analogous to pseudo-replication in experimental design - the

2006
c Carl James Schwarz 140
CHAPTER 4. SAMPLING

breaking of the transect into individual quadrats is like having multiple fish
within the tank.

A visual comparison of a simple random sample vs a cluster sample

You may find it useful to compare a simple random sample of 24 vs a cluster


sample of 24 using the following visual plans:

Select a sample of 24 in each case.

2006
c Carl James Schwarz 141
CHAPTER 4. SAMPLING

Simple Random Sampling

Describe how the sample was taken.

2006
c Carl James Schwarz 142
CHAPTER 4. SAMPLING

Cluster Sampling

First, the clusters must be defined. In this case, the units are naturally
clustered in blocks of size 8. The following units were selected.

Describe how the sample was taken. Note the differences between stratified
simple random sampling and cluster sampling!

2006
c Carl James Schwarz 143
CHAPTER 4. SAMPLING

4.11.2 Advantages and disadvantages of cluster sampling


compared to SRS

• Advantage It may not be feasible to construct a frame for every elemental


unit, but possible to construct frame for larger units, e.g. it is difficult to
locate individual quadrats upon the sea floor, but easy to lay out transects
from the shore.
• Advantage Cluster sampling is often more economical. Because all units
within a cluster are close together, travel costs are much reduced.
• Disadvantage Cluster sampling has a higher standard error than an SR-
SWOR of the same total size because units are typically homogeneous
within clusters. The cluster itself serves as the sampling unit. For the
same number of units, cluster sampling almost always gives worse preci-
sion. This is the problem that we have seen earlier of pseudo-replication.
• Disadvantage A cluster sample is more difficult to analyze, but with
modern computing equipment, this is less of a concern. The difficulties
are not arithmetic but rather being forced to treat the clusters as the
survey unit - there is a natural tendency to think that data are being
thrown away.

4.11.3 Notation

The key thing to remember is to work with the cluster TOTALS.

Traditionally, the cluster size is denoted by M rather than by X, but as you


will see in a few moment, estimation in cluster sampling is nothing more than
ratio estimation performed on the cluster totals.

Population Sample
Attribute value value
Number of clusters N n
Cluster totals τi yi NOTE τi and yi are the cluster i TOTALS
Cluster sizes Mi mi
Total area M

4.11.4 Summary of main results

The key concept in cluster sampling is to treat the cluster TOTAL as the re-
sponse variable and ignore all the individual values within the cluster. Because

2006
c Carl James Schwarz 144
CHAPTER 4. SAMPLING

the clusters are a simple random sample from the population of clusters, simply
apply all the results you had before for a SRS to the CLUSTER TOTALS.

If the clusters are roughly equal in size, a simple inflation estimator can be
used; In many cases, there is strong relationship between the size of the cluster
and cluster total – in these cases a ratio estimator would likely be suitable where
the X variable is the cluster size. If there is no relationship between cluster size
and the cluster total, a simple inflation estimator can be used as well even in the
case of unequal cluster sizes.. You should do a preliminary plot of the cluster
totals against the cluster sizes to see if this relationship holds.

You will also have to know the size of each cluster - this is simply the number
of sub-units within each cluster.

The perils of ignoring a cluster design This design is used frequently,


but often analyzed incorrectly. The key thing to note is that the sampling unit
is a cluster, not the individual quadrats. In general, when ever the quadrats
have been gathered using a transect of some sort, you have a cluster sampling
problem.

The biggest danger of ignoring the clustering aspects and treating the indi-
vidual quadrats as if they came from an SRS is that, typically, your estimated se
will be too small. That is, the true standard error from your design may be sub-
stantially larger than your estimated standard error. The precision is thought
to be far better than is justified based on the survey results. This has been seen
before - refer to the paper by Underwood where the dangers of estimation with
positively correlated data were discussed.

Extensions of cluster analysis - unequal size sampling In some cases,


the clusters are of quite unequal sizes. A better design choice may to be select
clusters with an unequal probability design rather than using a simple random
sample. In this case, clusters that are larger, typically contribute more to the
population total, and would be selected with a higher

Computational formulae

Parameter Population value Estimator estimated se


N
P n
P
τi yi q 2
i=1 i=1 1 sdiff
Overall mean µ= N
µ
b= n
P m2 n
(1 − f )
mi
P
Mi
i=1 i=1

q 2
1 sdiff
Overall total τ =M ×µ τb = M × µ
b M2 × m2 n
(1 − f )

• You never use the mean per unit within a cluster.

2006
c Carl James Schwarz 145
CHAPTER 4. SAMPLING

n
µmi )2
(yi −b
• The term s2diff =
P
n−1 is again found in the same fashion as in
i=1
ratio estimation - create a new variable which is the difference between
bmi , find the sample standard deviation2 of it, and then square the
yi − µ
standard deviation.
• Sometimes the ratio of two variables measured within each cluster is re-
quired, e.g. you conduct aerial surveys to estimate the ratio of wolves to
moose - this has already been done in an earlier example! In these cases,
the actual cluster length is not used.

Confidence intervals

As before, once you have an estimator for the mean and for the se, use the
usual ±2se rule. If the number of clusters is small, then some text books advise
using a t-distribution for the multiplier – this is not covered in this course.

Sample size determination

Again, this is no real problem - except that you will get a value for the
number of CLUSTERS, not the individual quadrats within the clusters.

4.11.5 Example - estimating the density of urchins

Red sea urchins are considered a delicacy and the fishery is worth several millions
of dollars to British Columbia.

In order to set harvest quotas and in order to monitor the stock, it is im-
portant that the density of sea urchins be determined each year.

To do this, the managers lay out a number of transects perpendicular to


the shore in the urchin beds. Divers then swim along the transect, and roll a
1 m2 quadrat along the transect line and count the number of legal sized and
sub-legal sized urchins in the quadrat.

The number of possible transects is so large that the correction for finite
population sampling can be ignored.

The raw data is available in an ascii file at urchin.dat from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms. The data isn’t all listed here as it has over 1000 lines!

What is the population of interest and the parameter?

2006
c Carl James Schwarz 146
CHAPTER 4. SAMPLING

The population of interest is the sea urchins in the harvest area. These
happened to be (artificially) “clustered” into transects which are sampled. All
sea urchins within the cluster are measured.

The parameter of interest is the density of legal sized urchins.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible tran-
sects. Rather they pick random points along the shore and then lay the transects
out from that point.

What is the sampling design?

The sampling design is a cluster sample - the clusters are the transect lines
while the quadrats measured within each cluster are similar to pseudo-replicates.
The measurements within a transect are not independent of each other and are
likely positively correlated (why?).

As the points along the shore were chosen using a simple random sample the
analysis proceeds as a SRS design on the cluster totals.

Excel Analysis

An Excel worksheet with the data and analysis is called urchin.xls and is avail-
able in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms. A reduced view appears below:

2006
c Carl James Schwarz 147
CHAPTER 4. SAMPLING

Summarize to cluster level

The key, first step in any analysis of a cluster survey is to first summarize the
data to the cluster level. You will need the cluster total and the cluster size (in
this case the length of the transect). The Pivot Table feature of Excel is quite
useful for doing this automatically. Unfortunately, you still have to play around
with the final table in order to get the data displayed in a nice format.

2006
c Carl James Schwarz 148
CHAPTER 4. SAMPLING

In many transect studies, there is a tendency to NOT record quadrats with


0 counts as they don’t affect the cluster sum. However, you still have to know
the correct size of the cluster (i.e. how many quadrats), so you can’t simply
ignore these ‘missing’ values. In this case, you could examine the maximum
of the quadrat number and the number of listed quadrats to see if these agree
(why?).

Preliminary plot

Plot the cluster totals vs. the cluster size to see if a ratio estimator is appro-
priate, i.e. linear relationship through the origin with variance increasing with
cluster size.

The plot (not shown) shows a weak relationship between the two variables.

Summary Statistics

Compute the summary statistics on the cluster TOTALS. You will need the
totals over all sampled clusters of both variables.

sum(legal) sum(quad) n(transect)


1507 1120 28

Compute the ratio

sum(legal)
The estimated density is then density
d =
sum(quad) = 1507/1120 = 1.345536
urchins/m2 .

Compute the difference column

To compute the se, create the diff column as in the ratio estimation section and
find its standard deviation.

Compute the se of the ratio estimate


r
s2diff
q
1 48.099332 1
The estimated se is then found as: se(density)
d = ntransects × 2 = 28 × 402
quad
= 0.2272 urchins/m2 .

(Optional) Expand final answer to a population total

In order to estimate the total number of urchins in the harvesting area, you
simply multiply the estimated ratio and its standard error by the area to be
harvested.

2006
c Carl James Schwarz 149
CHAPTER 4. SAMPLING

SAS Analysis

SAS v.8 has procedures for the analysis of survey data taken in a cluster design.
A program to analyze the data is urchin.sas and is available from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.

Here is a program listing:

/* Example of a cluster sample */

/* This dataset consists of the results from a series


of transects conducted perpindicular to the shore.
As divers swam along the transect, they counted
the number of red sea urchins in 1 m**2 quadrats.

The variables are (from left to right):


transect, quadrat, legal size, sublegal sized.

If a quadrat is not present in this listing, then the count was 0


for both variables. It does NOT indicate that the quadrat
was not measured - rather that no urchins were found.

There was no transect numbered 5, 12, 17, or 19. */

filename urchin ’urchin.dat’; /* name of datafile containing urchin data */

title ’Estimating urchin density - example of cluster analysis’;

options nodate nonumber noovp nocenter linesize=75;

data urchin;
infile urchin firstobs=2 missover; /* the first record has the variable names */
input transect quadrat legal sublegal;
/* no need to specify sampling weights because transects are an SRS */

/***** First check to see if any transects are missing quadrats *************/

proc sort data=urchin; by transect;

proc means data=urchin noprint;


by transect;
var quadrat legal;
output out=check min=min max=max n=n sum(legal)=tlegal;

2006
c Carl James Schwarz 150
CHAPTER 4. SAMPLING

data check;
set check;
length problem $15.;
problem = ’ ’;
if max ^= n then problem = ’missing quadrat?’;
drop _type_ _freq_;

proc print data=check;


title2 ’check to see if any transect is missing quadrats’;

proc plot data=check;


title2 ’plot the relationship between the cluster total and cluster size’;
plot tlegal *n$ transect; /* use the transect number as the plotting character */

/****************************************************************************/

/* Now for the cluster analysis */

proc surveymeans data=urchin; /* do not specify a pop size as fpc is negligble */


cluster transect;
var legal;

Because we are computing a ratio estimator from a simple random sample


of transects, it is not necessary to specify the sampling weights.

The key feature of the SAS program is the use of the CLUSTER statement
to identify the clusters in the data.

The population number of transects was not specified as the finite population
correction is negligible.

Here are the results:

Estimating urchin density - example of cluster analysis


check to see if any transect is missing quadrats

Obs transect min max n tlegal problem

1 1 1 100 100 151


2 2 1 30 30 150
3 3 1 100 100 0
4 4 1 73 73 158
5 6 1 39 39 22

2006
c Carl James Schwarz 151
CHAPTER 4. SAMPLING

6 7 1 21 21 5
7 8 1 46 46 85
8 9 1 24 24 46
9 10 1 37 37 27
10 11 1 24 24 9
11 13 1 40 40 50
12 14 1 37 37 50
13 15 1 32 32 15
14 16 1 21 21 58
15 18 1 32 32 42
16 20 1 42 42 12
17 21 1 21 21 52
18 22 1 13 13 0
19 23 1 88 88 100
20 24 1 15 15 1
21 25 1 23 23 16
22 26 1 17 17 49
23 27 1 16 16 46
24 28 1 18 18 40
25 29 1 30 30 37
26 30 1 39 39 40
27 31 1 42 42 175
28 33 1 100 100 71
Estimating urchin density - example of cluster analysis
plot the relationship between the cluster total and cluster size

Plot of tlegal*n$transect. Symbol points to label.

tlegal |
|
180 +
| > 31
|
|
|
160 + > 4
|
| > 2 1 <
|
|
140 +
|
|
|
|
120 +

2006
c Carl James Schwarz 152
CHAPTER 4. SAMPLING

|
|
|
|
100 + > 23
|
|
|
| > 8
80 +
|
| 33 <
|
|
60 + > 16
|
| > 21 14 < > 13
| 27 <> 26 > 9
| > 18
40 + > 28 > 30
| > 29
|
| > 10
| > 6
20 +
| > 25 > 15
| > 20
| > 11
| > 7
0 + 22 < > 24 3 <
|
-+------------+------------+------------+------------+------------+-
0 20 40 60 80 100

n
Estimating urchin density - example of cluster analysis
plot the relationship between the cluster total and cluster size

The SURVEYMEANS Procedure

Data Summary

Number of Clusters 28
Number of Observations 1120

2006
c Carl James Schwarz 153
CHAPTER 4. SAMPLING

Statistics

Std Error Lower 95%


Variable N Mean of Mean CL for Mean
------------------------------------------------------------------------
legal 1120 1.345536 0.227248 0.879261
------------------------------------------------------------------------

Statistics

Upper 95%
Variable CL for Mean
------------------------
legal 1.811810
------------------------

The results are identical to above.

JMP Analysis

The urchin data is available in a JMP file urchin.jmp from the Sample Program
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.

The raw data:

2006
c Carl James Schwarz 154
CHAPTER 4. SAMPLING

contains variables for the transect, the quadrat within each transect, and the
number of legal and sub-legal sized urchins.

Use the Tables->Summary

2006
c Carl James Schwarz 155
CHAPTER 4. SAMPLING

to get the cluster totals by summing the number of legal sized urchins and
counting the number of quadrats present:

2006
c Carl James Schwarz 156
CHAPTER 4. SAMPLING

2006
c Carl James Schwarz 157
CHAPTER 4. SAMPLING

This gives a table with one row for each transect:

(note that there was no transect 5, 12, 17, 19, or 32).

We compare the maximum(quadrat) number to the number of quadrat values


actually recorded and see that they all match indicating that it appears no empty
quadrats were not measured.

Now we are back to the case of a ratio estimator with the Y variable being
the number of legal sized urchins measured on the transect, and the X variable
being the size of the transect. As in the previous examples of a ratio estimator,
we create a weighting variable equal to 1/X = 1/size of transect:

We use the Analyze->Fit Y-by-X platform:

2006
c Carl James Schwarz 158
CHAPTER 4. SAMPLING

and don’t forget to specify 1/X as the weighting variable.

After the plot is created, we use the Fit Special from the red triangle near
the plot:

2006
c Carl James Schwarz 159
CHAPTER 4. SAMPLING

and request a line with intercept of 0:

2006
c Carl James Schwarz 160
CHAPTER 4. SAMPLING

The final estimates are presented:

2006
c Carl James Schwarz 161
CHAPTER 4. SAMPLING

The estimated density is 1.346 (se .216) uchins/m2 . The se is a bit smaller
because of the lack of a finite population correction factor but is within 1% of
the correct se.

Planning for future experiments

The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determination
of sample size is done in the same manner as in the ratio estimator case dealt
with in earlier sections except that the number of CLUSTERS is found. If we
wanted to get a rse near to 5%, we would need almost 320 transects - this is
likely too costly.

2006
c Carl James Schwarz 162
CHAPTER 4. SAMPLING

4.11.6 Example - estimating the total number of sea cu-


cumbers

Sea cucumbers are considered a delicacy among some, and the fishery is of
growing importance.

In order to set harvest quotas and in order to monitor the stock, it is impor-
tant that the number of sea cucumbers in a certain harvest area be estimated
each year.

The following is an example taken from Griffith Passage in BC 1994.

To do this, the managers lay out a number of transects across the cucumber
harvest area. Divers then swim along the transect, and while carrying a 4 m
wide pole, count the number of cucumbers within the width of the pole during
the swim.

The number of possible transects is so large that the correction for finite
population sampling can be ignored.

Here is the summary information up the transect area (the preliminary raw
data is unavailable):

Transect Sea
Area Cucumbers
260 124
220 67
200 6
180 62
120 35
200 3
200 1
120 49
140 28
400 1
120 89
120 116
140 76
800 10
1460 50
1000 122
140 34
180 109
80 48

2006
c Carl James Schwarz 163
CHAPTER 4. SAMPLING

The total harvest area is 3,769,280 m2 as estimated by a GIS system.

The transects were laid out from one edge of the bed and the length of the
edge is 51,436 m. Note that because each transect was 4 m wide, the number
of transects is 1/4 of this value.

What is the population of interest and the parameter?

The population of interest is the sea cucumbers in the harvest area. These
happen to be (artificially) “clustered” into transects which are the sampling
unit. All sea cucumbers within the transect (cluster) are measured.

The parameter of interest is the total number of cucumbers in the harvest


area.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible transects.
Rather they pick random points along the edge of the harvest area, and then
lay out the transect from there.

What is the sampling design?

The sampling design is a cluster sample - the clusters are the transect lines while
the quadrats measured within each cluster are similar to pseudo-replicates. The
measurements within a transect are not independent of each other and are likely
positively correlated (why?).

Analysis - abbreviated

As the analysis is similar to the previous example, a detailed description of the


Excel, SAS, or JMP versions will not be done.

The workbook cucumber.xls from the Sample Program Library http://www.


stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms illustrates the compu-
tations in Excel. There are three sheets It also computes the two estimators
when two potential outliers are deleted and for a second harvest area.

The SAS program is available in cucumber.sas and the relevant output in


cucumber.lst. Because only the summary data is available, you cannot use the
CLUSTER statement of SurveyMeans. Rather, as noted earlier in the notes,
you form a ratio estimator based on the cluster totals.

A JMP file called cucumber.jmp is also available.

Summarize to cluster level

2006
c Carl James Schwarz 164
CHAPTER 4. SAMPLING

The key, first step in any analysis of a cluster survey is to first summarize the
data to the cluster level. You will need the cluster total and the cluster size (in
this case the area of the transect). This has already been done in the above
data.

Now this summary table is simply an SRSWOR from the set of all transects.
We first estimate the density, and then multiply by the area to estimate the total.

Note that after summarizing up to the transect level, this example proceeds
in an analogous fashion as the grouse in pockets of brush example that we looked
at earlier.

Preliminary Plot

A plot of the cucumber total vs the transect size shows a very poor rela-
tionship between the two variables. It will be interesting to compare the results
from the simple inflation estimator and the ratio estimator.

Simple Inflation Estimator

First, estimate the number ignoring the area of the transects by using a simple
inflation estimator.

The summary statistics that we need are:

n 19 transects
Mean 54.21 cucumbers/transect
std Dev 42.37 cucumbers/transect

We compute an estimate of the total as τb = N y = (51, 436/4) × 54.21 =


697, 093 sea cucumbers. [Why did we use 51,436/4 rather than 51,436?]
p
p We compute an estimate of the se of the total as: se(b
τ) = N 2 s2 /n × (1 − f ) =
(51, 436/4)2 × 42.372 /19 = 124, 981 sea cucumbers.

The finite population correction factor is so small we simply ignore it.

This gives a relative standard error (se/est) of 18%.

Ratio Estimator

We use the methods outlined earlier for ratio estimators from SRSWOR to get
the following summary table:

2006
c Carl James Schwarz 165
CHAPTER 4. SAMPLING

area cucumbers
Mean 320.00 54.21 per transect

mean(cucumbers)
The estimated density of sea cucumbers is then density
d =
mean(area)
2
= 54.21/320.00 = 0.169 cucumber/m .

To compute the se, create the diff column as in the ratio estimation section
qsdiff 2 = 73.63. The estimated
and find its standard deviation as
q
se of the ratio
sdiff 2
is then found as: se(density)
d = × 1 2 = 73.63 × 1 2 = 0.053
ntransects area 19 320
cucumbers/m2 .

We once again ignore the finite population correction factor.

In order to estimate the total number of cucumbers in the harvesting area,


you simply multiply the above by the area to be harvested:

τbratio = area × density


d = 3, 769, 280 × 0.169= 638,546 sea cucumbers.

τratio ) = area × se(density)


The se is found as: se(b d = 3, 769, 280 × 0.053 =
198,983 sea cucumbers for an overall rse of 31%.

Comparing the two approaches

Why did the ratio estimator do worse in this case than the simple inflation
estimator in Griffiths Passage? The plot the number of sea cucumbers vs the
area of the transect shows virtually no relationship between the two - hence
there is no advantage to using a ratio estimator.

In more advanced courses, it can be shown that the ratio estimator will do
better than the inflation estimator if the correlation between the two variables is
greater than 1/2 of the ratio of their respective relative variation (std dev/mean).
Advanced computations shows that half of the ratio of their relative variations
is 0.732, while the correlation between the two variables is 0.041. Hence the
ratio estimator will not do well.

The Excel workbook also repeats the analysis for Griffith Passage after drop-
ping some obvious outliers. This only makes things worse! As well, at the bot-
tom of the worksheet, a sample size computation shows that substantially more
transects are needed using a ratio estimator than for a inflation estimator. It
appears that in Griffith Passage, that there is a negative correlation between
the length of the transect and the number of cucumbers found! No biological
reason for this has been found. This is a cautionary example to illustrate the
even the best laid plans can go astray - always plot the data.

2006
c Carl James Schwarz 166
CHAPTER 4. SAMPLING

A third worksheet in the workbook analyses the data for Sheep Passage.
Here the ratio estimator outperforms the inflation estimator, but not by a wide
factor.

4.12 Multi-stage sampling - a generalization of


cluster sampling

Not part of Stat403/650 Please consult with a sampling expert before im-
plementing or analyzing a multi-stage design.

4.12.1 Introduction

All of the designs considered above select a sampling unit from the population
and then do a complete measurement upon that item. In the case of cluster
sampling, this is facilitated by dividing the sampling unit into small observa-
tional units, but all of the observational units within the sampled cluster are
measured.

If the units within a cluster are fairly homogeneous, then it seems wasteful
to measure every unit. In the extreme case, if every observational unit within a
cluster was identical, only a single observational unit from the cluster needs to be
selected in order to estimate (without any error) the cluster total. Suppose then
that the observational units within a cluster were not identical, but had some
variation? Why not take a sub-sample from each cluster, e.g. in the urchin
survey, count the urchins in every second or third quadrat rather than every
quadrat on the transect.

This method is called two-stage sampling. In the first stage, larger sampling
units are selected using some probability design. In the second stage, smaller
units within the selected first-stage units are selected according to a probability
design. The design used at each stage can be different, e.g. first stage units
selected using a simple random sample, but second stage units selected using a
systematic design as proposed for the urchin survey above.

This sampling design can be generalized to multi-stage sampling.

Some example of multi-stage designs are:

• Vegetation Resource Inventory. The forest land mass of BC has been


mapped using aerial methods and divided into a series of polygons repre-

2006
c Carl James Schwarz 167
CHAPTER 4. SAMPLING

senting homogeneous stands of trees (e.g. a stand dominated by Douglas-


fir). In order to estimate timber volumes in an inventory unit, a sample of
polygons is selected using a probability-proportional-to-size design. In the
selected polygons, ground measurement stations are selected on a 100 m
grid and crews measure standing timber at these selected ground stations.
• Urchin survey Transects are selected using a simple random sample
design. Every second or third quadrat is measured after a random starting
point.
• Clam surveys Beaches are divided into 1 ha sections. A random sample
of sections is selected and a series of 1 m2 quadrats are measured within
each section.
• Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-
1814) used a two-stage design to estimate herring spawn in the Strait of
Georgia.
• Georgia Strait Creel Survey The Georgia Strait Creel Survey uses a
multi-stage design to select landing sites within strata, times of days to
interview at these selected sites, and which boats to interview in a survey
of angling effort on the Georgia Strait.

Some consequences of simple two-stage designs are:

• If the selected first-stage units are completely enumerated then complete


cluster sampling results.
• If every first-stage unit in the population is selected, then a stratified
design results.
• A complete frame is required for all first-stage units. However, a frame
of second-stage and lower-stage units need only be constructed for the
selected upper-stage units.
• The design is very flexible allowing (in theory) different selection methods
to be used at each stage, and even different selection methods within each
first stage unit.
• A separate randomization is done within each first-stage unit when select-
ing the second-stage units.
• Multi-stage designs are less precise than a simple random sample of the
same number of final sampling units, but more precise than a cluster
sample of the same number of final sampling units. [Hint: think of what
happens if the second-stage units are very similar.]

2006
c Carl James Schwarz 168
CHAPTER 4. SAMPLING

• Multi-stage designs are cheaper than a simple random sample of the same
number of final sampling units, but more expensive than a cluster sample
of the same number of final sampling units. [Hint: think of the travel costs
in selecting more transects or measuring quadrats within a transect.]
• As in all sampling designs, stratification can be employed at any level
and ratio and regression estimators are available. As expected, the theory
becomes more and more complex, the more "variations" are added to the
design.

The primary incentives for multi-stage designs are that

1. frames of the final sampling units are typically not available


2. it often turns out that most of the variability in the population occurs
among first-stage units. Why spend time and effort in measuring lower
stage units that are relatively homogeneous within the first-stage unit

4.12.2 Notation

A sample of n first-stage units (FSU) is selected from a total of N first-stage


units. Within the ith first-stage unit, mi second-stage units (SSU) are selected
from the Mi units available.

Item Population Sample


Value Value
First stage units N n
Second stage units P iM mi
SSUs in population M = Mi

Value of SSU Yij yij


mi
P
Total of FSU τi τbi = Mi /mi yij
P j=1
Total in pop τ= τi

Mean in pop µ = τ /M

4.12.3 Summary of main results

We will only consider the case when simple random sampling occurs at both
stages of the design.

2006
c Carl James Schwarz 169
CHAPTER 4. SAMPLING

The intuitive explanation for the results is that a total is estimated for each
FSU selected (based on the SSU selected). These estimated totals are then used
in a similar fashion to a cluster sample to estimate the grand total.

Parameter Population Estimated


value Estimate se s
n n
N s21 N 2 f1 s2
Mi2 (1 − f2 ) m2ii
P P P
Total τ = τi n τbi se (τ̂ ) = N 2 (1 − f1 ) n + n2
i=1 i=1
q
τ τb se2 (τ̂ )
Mean µ= M µ
b= M se (µ̂) = M2

where

n 
P 2
τbi − τb
i=1
s21 =
n−1

mi
P 2
(yij − yi )
j=1
s22i =
mi − 1

n
1X
τb = τbi
n i=1

f1 = n/N and f2i = mi /Mi

Notes:

• There are two contributions to the estimated se - variation among first


stage totals (s21 ) and variation among second stage units (S2i
2
).
• If the FSU vary considerably in size, a ratio estimator (not discussed in
these notes) may be more appropriate.

Confidence Intervals The usual large sample confidence intervals can be


used.

2006
c Carl James Schwarz 170
CHAPTER 4. SAMPLING

4.12.4 Example - estimating number of clams

The Klahoose First Nations (situated near Desolation Sound in the Strait of
Georgia) wished to develop a wild oyster fishery. As first stage in the devel-
opment of the fishery, a survey was needed to establish the current stock in a
number of oyster beds.

This example looks at the estimate of oyster numbers at Lloyd Point from
a survey conducted in 1994.

The survey was conducted by running a line through the oyster bed – the
total length was 105 m. Several random location were located along the line
in increments of 1 m. At each randomly chosen location, the width of the bed
was measured and about 3 random location along the perpendicular transect at
that point were taken. A 1 m2 quadrat was applied, and the number of oysters
of various sizes was counted in the quadrat.

The raw data: is available as a data file called wildoyster.dat from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.

2006
c Carl James Schwarz 171
CHAPTER 4. SAMPLING

tran- width quad- total Net


Location sect width rat seed xsmall small med large count weight
(m) (m) (kg)
Lloyd 5 17 3 18 18 41 48 14 139 14.6
Lloyd 5 17 5 6 4 30 9 4 53 5.2
Lloyd 5 17 10 15 21 44 13 11 104 8.2
Lloyd 7 18 5 8 10 14 5 3 40 6.0
Lloyd 7 18 12 10 38 36 16 4 104 10.2
Lloyd 7 18 13 0 15 12 3 3 33 4.6
Lloyd 18 14 1 11 8 5 9 19 52 7.8
Lloyd 18 14 5 13 23 68 18 11 133 12.6
Lloyd 18 14 8 1 29 60 2 1 93 10.2
Lloyd 30 11 3 17 1 13 13 2 46 5.4
Lloyd 30 11 8 12 16 23 22 14 87 6.6
Lloyd 30 11 10 23 15 19 17 1 75 7.0
Lloyd 49 9 3 10 27 15 1 0 53 2.0
Lloyd 49 9 5 13 7 14 11 4 49 6.8
Lloyd 49 9 8 10 25 17 16 11 79 6.0
Lloyd 76 21 4 3 3 11 7 0 24 4.0
Lloyd 76 21 7 15 4 32 26 24 101 12.4
Lloyd 76 21 11 2 19 14 19 0 54 5.8
Lloyd 79 18 1 14 13 7 9 0 43 3.6
Lloyd 79 18 4 0 32 32 27 16 107 12.8
Lloyd 79 18 11 16 22 43 18 8 107 10.6
Lloyd 84 19 1 14 32 25 39 7 117 10.2
Lloyd 84 19 8 25 43 42 17 3 130 7.2
Lloyd 84 19 15 5 22 61 30 13 131 14.2
Lloyd 86 17 8 1 19 32 10 8 70 8.6
Lloyd 86 17 11 8 17 13 10 3 51 4.8
Lloyd 86 17 12 7 22 55 11 4 99 9.8
Lloyd 95 20 1 17 12 20 18 4 71 5.0
Lloyd 95 20 8 32 4 26 29 12 103 11.6
Lloyd 95 20 15 3 34 17 11 1 66 6.0

These multi-stage designs are complex to analyze. Rather than trying to


implement the various formulae, I would suggest that a proper sampling package
be used (such as SAS) rather than trying to do these by hand.

The first step after importing the data in JMP, is to collect information to
estimate the FSU (the transect) total and to compute some components of the
variance from the second stage of sampling.

As in cluster sampling, use the Tables->Summary to create summary statis-


tics. The Grouping variable is the transect. You will need to compute the
average of the weights, the standard deviation of the weights, the number of
quadrats measured, and the width of each transect. [The latter is found as the

2006
c Carl James Schwarz 172
CHAPTER 4. SAMPLING

average of the width variable which was replicated for each individual quadrat.]

This will create a summary table as shown below:

Now you will need to add some columns to estimate the total for each FSU
and the contribution of the second stage sampling to the overall variance. These
columns will be created using the formula boxes as shown below.

2006
c Carl James Schwarz 173
CHAPTER 4. SAMPLING

First the formula for the FSU total, i.e. the estimated total weight for the
entire transect. This is the simply the average weight per quadrat times the
width of the strip.

Second, the component of variance for the second stage. [Typically, if the
first stage sampling fraction is small, this can be ignored.]

This gives the summary table shown below:

2006
c Carl James Schwarz 174
CHAPTER 4. SAMPLING

Now to summarize up to the population level. We again must summarize


this table using the Tables->Summary menu item.

This gives us the final summary table

The variance component from the first stage is found as:

2006
c Carl James Schwarz 175
CHAPTER 4. SAMPLING

And the final overall se is found as:

This gives us the final solution:

Our final estimate is a total biomass of 14,070 kg with an estimated se of


1484 kg.

A similar procedure can be used for the other variables. The nice feature
about JMP is that once a series of operations has been done, you can save the
script and apply it easily to other variables – refer to the JMP manual for more
details.

Excel Spreadsheet

The above computations can also be done in Excel as shown in the attached
workbook klahoose.xls from the Sample Program Library.

As in the case of a pure cluster sample, the PivotTable feature can be used
to compute summary statistics needed to estimate the various components.

2006
c Carl James Schwarz 176
CHAPTER 4. SAMPLING

SAS Program

SAS can also be used to analyze the data as shown in the program klahoose.sas
and output Klahoose.lst.

Note that the Proc SurveyMeans computes the se using only the first stage
variance. As the first stage sampling fraction is usually quite small, this will tend
to give only slight underestimates of the true standard error of the estimate.

4.12.5 Some closing comments on multi-stage designs

The above example barely scratches the surface of multi-stage designs. Multi-
stage designs can be quite complex and the formulae for the estimates and
estimated standard errors fearsome. If you have to analyze such a design, it
is likely better to invest some time in learning one of the statistical packages
designed for surveys (e.g. SAS v.8) rather than trying to program the tedious
formulae by hand.

There are also several important design decisions for multi-stage designs.

• Two-stage designs have reduced costs of data collection because units


within the FSU are easier to collect but also have a poorer precision com-
pared to a simple-random sample with the same number final sampling
units. However, because of the reduced cost, it often turns out the more
units can be sampled under a multi-stage design leading to an improved
precision for the same cost as a simple-random sample design. There is
a tradeoff between sampling more first stage units and taking a small
sub-sample in the secondary stage. An optimal allocation strategy can
be constructed to decide upon the best strategy – consult some of the
reference books on sampling for details.
• As with ALL sampling designs, stratification can be used to improve
precision. The stratification usually takes place at the first sampling unit
stage, but can take place at all stages. The details of estimation under
stratification can be found in many sampling texts.
• Similarly, ratio or regression estimators can also be used if auxiliary in-
formation is available that is correlated with the response variable. This
leads to very complex formulae!

One very nice feature of multi-stage designs is that if the first stage is sam-
pled with replacement, then the formulae for the estimated standard errors
simplify considerably to a single term regardless of the design used in the

2006
c Carl James Schwarz 177
CHAPTER 4. SAMPLING

lower stages! If there are many first stage units in the population and if the
sampling fraction is small, the chances of selecting the same first stage unit
twice are very small. Even if this occurs, a different set of second stage units
will likely be selected so there is little danger of having to measure the same
final sampling unit more than once. In such situations, the design at second and
lower stages is very flexible as all that you need to ensure is that an unbiased
estimate of the first-stage unit total is available.

4.13 Some final comments on descriptive surveys

4.13.1 Unit size

A typical concern with any of the survey methods occurs when the population
does not have natural discrete sampling units. For example, a large section of
land may be arbitrarily divided into 1 m2 plots, or 10 m2 plots. A natural
question to ask is what is the ‘best size’ of unit. This has no simple answer and
depends upon several factors which must be addressed for each survey:

• Cost. All else being equal, sampling many small plots may be more
expensive than sampling fewer larger plots. The primary difference in
cost is the overhead in traveling and setup to measure the unit.
• Size of unit. An intuitive feeling is that more smaller plots are better
than few large plots because the sample size is larger. This will be true
if the characteristic of interest is ‘patchy’ , but surprisingly, makes no
difference if the characteristic is randomly scattered through out the area
(Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, then
larger plots are better. For example, competition among trees implies they
are spread out more than expected if they were randomly located. Logistic
considerations often influence the plot size. For example, if trampling
the soil affects the response, then sample plots must be small enough to
measure without trampling the soil.
• Edge effects. Because the population does not have natural boundaries,
decisions often have to be made about objects that lie on the edge of the
sample plot. In general larger square or circular plots are better because
of smaller edge-to-area ratio. [A large narrow rectangular plot can have
more edge than a similar area square plot.]

• Size of object being measured. Clearly a 1 m2 plot is not appropriate


when counting mature Douglas-fir, but may be appropriate for a lichen
survey.

2006
c Carl James Schwarz 178
CHAPTER 4. SAMPLING

A pilot study should be carried out prior to a large scale survey to investigate
factors that influence the choice of sampling unit size.

4.13.2 Key considerations when designing a survey

Key considerations when designing a survey are

• what is the sampling unit? This should be carefully distinguished from


the observational unit.
• is a frame available for all of the sampling units? If so, then direct sampling
can be used. If not, is there a frame of groups of units suitable for a two-
stage sample?
• Are all the units are the same size? If so, then a simple random sample (or
variant thereof) is likely a suitable design. If the units vary considerably
in size, then an unequal probability design may be more suitable.
• Are there precision requirements that will be needed? How will you know
the proper sample size that is needed?
• Can stratification be employed either for administrative convenience or to
improve precision?

When analyzing a survey a key step is to recognize the design that was used
to collect the data. Key pointers to help recognize various designs are:

• How were the units selected? A true simple random sample makes a list
of all possible items and then chooses from that list.
• Is there more than one size of sampling unit? For example, were transects
selected at random, and then quadrats within samples selected at random?
This is usually a multi-stage design.
• Is there a cluster? For example, transects are selected, and these are
divided into a series of quadrats - all of which are measured.

4.14 Analytical surveys - almost experimental de-


sign

In descriptive surveys, the objective was to simply obtain information about one
large group. In observational studies, two deliberately selected sub-populations

2006
c Carl James Schwarz 179
CHAPTER 4. SAMPLING

are selected and surveyed, but no attempt is made to generalize the results
to the whole population. In analytical studies, sub-populations are selected
and sampled in order to generalize the observed differences among the sub-
population to this and other similar populations.

As such, there are similarities between analytical and observational surveys


and experimental design. The primary difference is that in experimental studies,
the manager controls the assignment of the explanatory variables while measur-
ing the response variables, while in analytical and observational surveys, neither
set of variables is under the control of the manager. [Refer back to Examples B,
C, and D in the earlier chapters] The analysis of complex surveys for analytical
purposes can be very difficult (Kish 1987; Kish, 1984; Rao, 1973; Sedransk,
1965a, 1965b, 1966).

As in experimental studies, the first step in analytical surveys is to iden-


tify potential explanatory variables (similar to factors in experimental studies).
At this point, analytical surveys can be usually further subdivided into three
categories depending on the type of stratification:

• the population is pre-stratified by the explanatory variables and surveys


are conducted in each stratum to measure the outcome variables;
• the population is surveyed in its entirety, and post-stratified by the ex-
planatory variables.
• the explanatory variables can be used as auxiliary variables in ratio or
regression methods.

[It is possible that all three types of stratification take place - these are very
complex surveys.]

The choice between the categories is usually made by the ease with which
the population can be pre-stratified and the strength of the relationship between
the response and explanatory variables. For example, sample plots can be easily
pre-stratified by elevation or by exposure to the sun, but it would be difficult
to pre-stratify by soil pH.

Pre-stratification has the advantage that the manager has control over the
number of sample points collected in each stratum, whereas in post- stratifica-
tion, the numbers are not controllable, and may lead to very small sample sizes
in certain strata just because they form only a small fraction of the population.

For example, a manager may wish to investigate the difference in regener-


ation (as measured by the density of new growth) as a function of elevation.
Several cut blocks will be surveyed. In each cut block, the sample plots will

2006
c Carl James Schwarz 180
CHAPTER 4. SAMPLING

be pre-stratified into three elevation classes, and a simple random sample will
be taken in each elevation class. The allocation of effort in each stratum (i.e.
the number of sample plots) will be equal. The density of new growth will be
measured on each selected sample plot. On the other hand, suppose that the
regeneration is a function of soil pH. This cannot be determined in advance,
and so the manager must take a simple random sample over the entire stand,
measure the density of new growth and the soil pH at each sampling unit, and
then post-stratify the data based on measured pH. The number of sampling
units in each pH class is not controllable; indeed it may turn out that certain
pH classes have no observations.

If explanatory variables are treated as a auxiliary variables, then there must


be a strong relationship between the response and explanatory variables. Ad-
ditionally, we must be able to measure the auxiliary variable precisely for each
unit. Then, methods like multiple regression can also be used to investigate the
relationship between the response and the explanatory variable. For example,
rather than classifying elevation into three broad elevation classes or soil pH into
broad pH classes, the actual elevation or soil pH must be measured precisely to
serve as an auxiliary variable in a regression of regeneration density vs. elevation
or soil pH.

If the units have been selected using a simple random sample, then the
analysis of the analytical surveys proceeds along similar lines as the analysis of
designed experiments (Kish, 1987; also refer to Chapter 2). In most analyses
of analytical surveys, the observed results are postulated to have been taken
from a hypothetical super-population of which the current conditions are just
one realization. In the above example, cut blocks would be treated as a random
blocking factor; elevation class as an explanatory factor; and sample plots as
samples within each block and elevation class. Hypothesis testing about the
effect of elevation on mean density of regeneration occurs as if this were a
planned experiment.

Pitfall: Any one of the sampling methods described in Section 2 for descrip-
tive surveys can be used for analytical surveys. Many managers incorrectly use
the results from a complex survey as if the data were collected using a simple
random sample. As Kish (1987) and others have shown, this can lead to sub-
stantial underestimates of the true standard error, i.e. the precision is thought
to be far better than is justified based on the survey results. Consequently
the manager may erroneously detect differences more often than expected (i.e.
make a Type I error) and make decisions based on erroneous conclusions.

Solution: As in experimental design, it is important to match the analysis


of the data with the survey design used to collect it. The major difficulty in the
analysis of analytical surveys are:

2006
c Carl James Schwarz 181
CHAPTER 4. SAMPLING

1. Recognizing and incorporating the sampling method used to collect the


data in the analysis. The survey design used to obtain the sampling units
must be taken into account in much the same way as the analysis of
the collected data is influenced by actual experimental design. A table of
‘equivalences’ between terms in a sample survey and terms in experimental
design is provided in Table 1.
Table 1
Equivalences between terms used in surveys and in experimental design.
Survey Term Experimental Design Term
Simple Random Completely randomized design
Sample
Cluster Sam- (a) Clusters are random effects; units within a
pling cluster treated as sub-samples; or
(b) Clusters are treated as main plots; units within
a cluster treated as sub-plots in a split-plot anal-
ysis.
Multi-stage (a) Nested designs with units at each stage nested
sampling in units in higher stages. Effects of units at each
stage are treated as random effects, or
(b) Split-plot designs with factors operating at
higher stages treated as main plot factors and fac-
tors operating at lower stages treated as sub-plot
factors.
Stratification Fixed factor or random block depending on the
reasons for stratification.
Sampling Unit Experimental unit or treatment unit
Sub-sample Sub-sample
There is no quick easy method for the analysis of complex surveys (Kish,
1987). The super-population approach seems to work well if the selection
probabilities of each unit are known (these are used to weight each obser-
vation appropriately) and if random effects corresponding to the various
strata or stages are employed. The major difficulty caused by complex
survey designs is that the observations are not independent of each other.

2. Unbalanced designs (e.g. unequal numbers of sample points in each combi-


nation of explanatory factors). This typically occurs if post- stratification
is used to classify units by the explanatory variables but can also occur
in pre-stratification if the manager decides not to allocate equal effort in
each stratum. The analysis of unbalanced data is described by Milliken
and Johnson (1984).
3. Missing cells, i.e. certain combinations of explanatory variables may not
occur in the survey. The analysis of such surveys is complex, but refer to
Milliken and Johnson (1984).
4. If the range of the explanatory variable is naturally limited in the popula-

2006
c Carl James Schwarz 182
CHAPTER 4. SAMPLING

tion, then extrapolation outside of the observed range is not recommended.

More sophisticated techniques can also be used in analytical surveys. For


example, correspondence analysis, ordination methods, factor analysis, multidi-
mensional scaling, and cluster analysis all search for post-hoc associations among
measured variables that may give rise to hypotheses for further investigation.
Unfortunately, most of these methods assume that units have been selected
independently of each other using a simple random sample; extensions where
units have been selected via a complex sampling design have not yet developed.
Simpler designs are often highly preferred to avoid erroneous conclusions based
on inappropriate analysis of data from complex designs.

Pitfall: While the analysis of analytical surveys and designed experiments


are similar, the strength of the conclusions is not. In general, causation cannot
be inferred without manipulation. An observed relationship in an analytical
survey may be the result of a common response to a third, unobserved variable.
For example, consider the two following experiments. In the first experiment,
the explanatory variable is elevation (high or low). Ten stands are randomly
selected at each elevation. The amount of growth is measured and it appears
that stands at higher elevations have less growth. In the second experiment,
the explanatory variables is the amount of fertilizer applied. Ten stands are
randomly assigned to each of two doses of fertilizer. The amount of growth is
measured and it appears that stands that receive a higher dose of fertilizer have
greater growth. In the first experiment, the manager is unable to say whether
the differences in growth are a result of differences in elevation or amount of
sun exposure or soil quality as all three may be highly related. In the second
experiment, all uncontrolled factors are present in both groups and their effects
will, on average, be equal. Consequently, the assignment of cause to the fertilizer
dose is justified because it is the only factor that differs (on average) among the
groups.

As noted by Eberhardt and Thomas (1991), there is a need for a rigorous


application of the techniques for survey sampling when conducting analytical
surveys. Otherwise they are likely to be subject to biases of one sort or another.
Experience and judgment are very important in evaluating the prospects for
bias, and attempting to find ways to control and account for these biases. The
most common source of bias is the selection of survey units and the most com-
mon pitfall is to select units based on convenience rather than on a probabilistic
sampling design. The potential problems that this can lead to are analogous to
those that occur when it is assumed that callers to a radio-phone- in show are
representative of the entire population.

2006
c Carl James Schwarz 183
CHAPTER 4. SAMPLING

4.15 References
• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.
One of the standard references for survey sampling. Very technical
• Gillespie, G.E. and Kronlund, A.R. (1999).
A manual for intertidal clam surveys, Canadian Technical Report of Fish-
eries and Aquatic Sciences 2270. A very nice summary of using sampling
methods to estimate clam numbers.
• Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New
York: American Chemical Society.
A series of papers on sampling mainly for environmental contaminants in
ground and surface water, soils, and air. A detailed discussion on sampling
for pattern.
• Kish, L. (1965). Survey Sampling. New York: Wiley.
An extensive discussion of descriptive surveys mostly from a social science
perspective.
• Kish, L. (1984). On Analytical Statistics from complex samples. Survey
Methodology, 10, 1-7.
An overview of the problems in using complex surveys in analytical sur-
veys.
• Kish, L. (1987). Statistical designs for research. New York: Wiley.
One of the more extensive discussions of the use of complex surveys in
analytical surveys. Very technical.
• Krebs, C. (1989). Ecological Methodology.
A collection of methods commonly used in ecology including a section on
sampling
• Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999).
Survey methodology for intertidal bivalves. Canadian Technical Report of
Fisheries and Aquatic Sciences 2214. An overview of how to use surveys for
assessing intertidal bivalves - more technical than Gillespie and Kronlund
(1999).
• Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem
management. New York: Wiley.
Good primer on how to measure common ecological data using direct sur-
vey methods, aerial photography, etc. Includes a discussion of common
survey designs for vegetation, hydrology, soils, geology, and human influ-
ences.
• Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal
of the Royal Statistical Society, Series B, 27, 264-278.

2006
c Carl James Schwarz 184
CHAPTER 4. SAMPLING

• Thompson, S.K. (1992). Sampling. New York:Wiley.


A good companion to Cochran (1977). Has many examples of using sam-
pling for biological populations. Also has chapters on mark-recapture,
line-transect methods, spatial methods, and adaptive sampling.

4.16 Frequently Asked Questions (FAQ)

4.16.1 Confusion about the definition of a population


What is the difference between the "population total" and the "pop-
ulation size"?

Population size normally refers to the number of “final sampling” units in


the population. Population total refers to the total of some variable over these
units.

For example, if you wish to estimate the total family income of families in
Vancouver, the “final” sampling units are families, the population size is the
number of families in Vancouver, and the response variable is the income for
this household, and the population total will be the total family income over all
families in Vancouver.

Things become a bit confusing when sampling units differ from “final” units
that are clustered and you are interested in estimates of the number of “final”
units. For example in the grouse/pocket bush example, the population consists
of the grouse which are clustered into 248 pockets of brush. The grouse is
the final sampling unit, but the sampling unit is a pocket of bush. In cluster
sampling, you must expand the estimator by the number of CLUSTERS, not by
the number of final units. Hence the expansion factor is the number of pockets
(248), the variable of interest for a cluster is the number of grouse in each pocket,
and the population total is the number of grouse over all pockets.

Similarly, for the oysters on the lease. The population is the oysters on the
lease. But you don’t randomly sample individual oysters – you randomly sample
quadrats which are clusters of oysters. The expansion factor is now the number
of quadrats.

In the salmon example, the boats are surveyed. The fact that the number
of salmon was measured is incidental - you could have measured the amount of
food consumed, etc.

In the angling survey problem, the boats are the sampling units. The fact

2006
c Carl James Schwarz 185
CHAPTER 4. SAMPLING

that they contain anglers or that they caught fish is what is being measured,
but the set of boats that were at the lake that day is of interest.

4.16.2 How is N defined


How is N (the expansion factor defined). What is the best way to
find this value?

This can get confusing in the case of cluster or multi-phase designs as there
are different N ’s at each stage of the design. It might be easier to think of N
as an expansion factor.

The expansion factor will be known once the frame is constructed. In some
cases, this can only be done after the fact - for example, when surveying angling
parties, the total number of parties returning in a day is unknown until the
end of the day. For planning purposes, some reasonable guess may have to
done in order to estimate the sample size. If this is impossible, just choose some
arbitrary large number - the estimated future sample size will be an overestimate
(by a small amount) but close enough. Of course, once the survey is finished,
you would then use the actual value of N in all computations.

4.16.3 Multi-stage vs Multi-phase sampling

What is the difference between Multi-stage sampling and multi-


phase sampling?

In multi-stage sampling, the selection of the final sampling units takes place
in stages. For example, suppose you are interested in sampling angling parties
as they return from fishing. The region is first divided into different landing
sites. A random selection of landing sites is selected. At each landing site, a
random selection of angling parties is selected.

In multi-phase sampling, the units are NOT divided into larger groups.
Rather a first phase selects some units and they are measured quickly. A sec-
ond phase takes a sub-sample of the first phase and measures more intently.
Returning back to the angling survey. A multi-phase design would select an-
gling parties. All of the selected parties could fill out a brief questionnaire. A
week later, a sample of the questionnaires is selected, and the angling parties
RECONTACTED for more details.

The key difference is that in multi-phase sampling, some units are measured
TWICE; in multi-phase sampling, there are different sizes of sampling units

2006
c Carl James Schwarz 186
CHAPTER 4. SAMPLING

(landing sites vs angling parties), but each sampling unit is only selected once.

4.16.4 What is the difference between a Population and a


frame?

Frame = list of sampling units from which a sample will be taken. The sampling
units may not be the same as the “final” units that are measured. For example,
in cluster sampling, the frame is the list of clusters, but the final units are the
objects within the cluster.

Population = list of all “final” units of interest. Usually the “final units” are
the actual things measured in the field, i.e. what is the final object upon which
a measurement is taken.

In some cases, the frame doesn’t match the population which may cause
biases, but in ideal cases, the frame covers the population.

4.16.5 How to account for missing transects.


What do you do if an entire cluster is “missing”?

Missing data can occur at various parts in a survey and for various reasons.
The easiest data to handle is data ‘missing completely at random’ (MCAR).
In this situation, the missing data provides no information about the problem
that is not already captured by other data point and the ‘missingness’ is also
non-informative. In this case, and if the design was a simple random sample,
the data point is just ignored. So if you wanted to sample 80 transects, but
were only able to get 75, only the 75 transects are used. If some of the data
are missing within a transect - the problem changes from a cluster sample to a
two-stage sample so the estimation formulae change slightly.

If data is not MCAR, this is a real problem - welcome to a Ph.D. in statistics


in how to deal with it!

2006
c Carl James Schwarz 187

Vous aimerez peut-être aussi