Académique Documents
Professionnel Documents
Culture Documents
C. J. Schwarz
Department of Statistics and Actuarial Science, Simon Fraser University
cschwarz@stat.sfu.ca
August 7, 2006
Contents
4 Sampling 2
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1.1 Difference between sampling and experimental design . . 3
4.1.2 Why sample rather than census? . . . . . . . . . . . . . . 3
4.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . 3
4.1.4 Probability sampling vs. non-probability sampling . . . . 4
4.1.5 The importance of randomization in survey design . . . . 6
4.1.6 Model vs Design based sampling . . . . . . . . . . . . . . 10
4.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . 11
4.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . 11
4.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . 13
4.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . 19
4.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . 21
4.2.6 Repeated Sampling . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Simple Random Sampling Without Replacement (SRSWOR) . . 25
4.4.1 Summary of main results . . . . . . . . . . . . . . . . . . 25
4.4.2 Estimating the Population Mean . . . . . . . . . . . . . . 26
4.4.3 Estimating the Population Total . . . . . . . . . . . . . . 27
4.4.4 Estimating Population Proportions . . . . . . . . . . . . . 28
4.4.5 Example - estimating total catch of fish in a recreational
fishery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
What is the population of interest? . . . . . . . . . . . . . 30
What is the frame? . . . . . . . . . . . . . . . . . . . . . . 31
What is the sampling design and sampling unit? . . . . . 31
Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . 32
SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35
JMP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Sample size determination for a simple random sample . . . . . . 47
4.5.1 Example - How many anglers to survey . . . . . . . . . . 49
4.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.1 Advantages of systematic sampling . . . . . . . . . . . . . 52
1
CONTENTS
2006
c Carl James Schwarz 2
CONTENTS
2006
c Carl James Schwarz 3
Chapter 4
Sampling
4.1 Introduction
Today the word "survey" is used most often to describe a method of gathering
information from a sample of individuals or animals or areas. This "sample" is
usually just a fraction of the population being studied.
You are exposed to survey results almost every day. For example, election
polls, the unemployment rate, or the consumer price index are all examples of
the results of surveys. On the other hand, some common headlines are NOT
the results of surveys, but rather the results of experiments. For example, is a
new drug just as effective as an old drug.
Not only do surveys have a wide variety of purposes, they also can be con-
ducted in many ways – including over the telephone, by mail, or in person.
Nonetheless, all surveys do have certain characteristics in common. All surveys
require a great deal of planning in order that the results are informative.
Unlike a census, where all members of the population are studied, surveys
gather information from only a portion of a population of interest – the size of
the sample depending on the purpose of the study. Surprisingly to many people,
a survey can give better quality results than an census.
4
CHAPTER 4. SAMPLING
There are two key differences between survey sampling and experimental design.
• reduced cost
• greater speed - a much smaller scale of operations is performed
• greater scope - if highly trained personnel or equipment is needed
• greater accuracy - easier to train small crew, supervise them, and reduce
data entry errors
• reduced respondent burden
• in destructive sampling you can’t measure the entire population - e.g.
crash tests of cars
2006
c Carl James Schwarz 5
CHAPTER 4. SAMPLING
• choose among the various designs; will you stratify? There are a variety
of sampling plans some of which will be discussed in detail later in this
chapter. Some common designs in ecological studies are:
– simple random sampling
– systematic sample
– cluster sampling
– multi-stage design
All designs can be improved by stratification, so this should always be
considered during the design phase.
• pre-test - very important to try out field methods and questionnaires
• organization of field work - training, pre-test, etc
• summary and data analysis - easiest part if earlier parts done well
• post-mortem - what went well, poorly, etc.
There are two types of sampling plans - probability sampling where units are
chosen in a ‘random fashion’ and non-probability sampling where units are cho-
sen in some deliberate fashion.
In probability sampling
2006
c Carl James Schwarz 6
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 7
CHAPTER 4. SAMPLING
[With thanks to Dr. Rick Routledge for this part of the notes.]
2006
c Carl James Schwarz 8
CHAPTER 4. SAMPLING
Sample sizes and results are reported in the Table below. How are we to
interpret these results? The sampled hoverers obviously tended to be somewhat
smaller than the sampled patrollers, although it appears from the standard
deviations that some hoverers were larger than the average-sized patroller and
vice-versa. Hence, the difference is not overwhelming, and may be attributable
to sampling errors.
If the sampling were truly randomized, then the only sampling errors would
be chance errors, whose probable size can be assessed by a standard t-test.
Exactly how were the samples taken? Is it possible that the sampling procedure
used to select patrolling bees might favor the capture of larger bees, for example?
This issue is indeed addressed by the authors. They carefully explain how they
attempted to obtain unbiased samples. For example, to sample the patrolling
bees, they made a sweep across the sampling area, attempting to catch all the
patrolling bees that they observed. To assess the potential for bias, one must
in the end make a subjective judgment.
In the 1930’s, political opinion was in its formative years. The pioneers in
this endeavor were training themselves on the job. Of the inevitable errors, two
were so spectacular as to make international headlines.
2006
c Carl James Schwarz 9
CHAPTER 4. SAMPLING
Obviously, the enormous sample obtained by the Digest was not very rep-
resentative of the population. The selection procedure was heavily biased in
favor of Republican voters. The most obvious source of bias is the method
used to generate the list of names and addresses of the people that they con-
tacted. In 1935, only the relatively affluent could afford magazines, telephones,
etc., and the more conservative policies of the Republican Party appealed to a
greater proportion of this segment of the American public. The Digest’s sample
selection procedure was therefore biased in favor of the Republican candidate.
How did Gallup obtain his more representative sample? He did not use
randomization. Randomization is often criticized on the grounds that once in
a while, it can produce absurdly unrepresentative samples. When faced with a
sample that obviously contains far too few economically disadvantaged voters,
it is small consolation to know that next time around, the error will likely not
be repeated. Gallup used a procedure that virtually guaranteed that his sample
would be representative with respect to such obvious features as age, race, etc.
He did so by assigning quotas which his interviewers were to fill. One interviewer
might, e.g. be assigned to interview 5 adult males with specified characteristics
in a tough, inner-city neighborhood. The quotas were devised so as to make the
sample mimic known features of the population.
2006
c Carl James Schwarz 10
CHAPTER 4. SAMPLING
by about 6%. His subsequent polls contained the same systematic error. In
1947, the error finally caught up with him. He predicted a narrow victory for
the Republican candidate, Dewey. A Newspaper editor was so confident of the
prediction that he authorized the printing of a headline proclaiming the victory
before the official results were available. It turned out that the Democrat,
Truman, won by a narrow margin.
What was wrong with Gallup’s sampling technique? He gave his interviewers
the final decision as to whom would be interviewed. In a tough inner-city
neighborhood, an interviewer had the option of passing by a house with several
motorcycles parked out in front and sounds of a raucous party coming from
within. In the resulting sample, the more conservative (Republican) voters
were systematically over-represented.
Gallup learned from his mistakes. His subsequent surveys replaced inter-
viewer discretion with an objective, randomized scheme at the final stage of
sample selection. With the dominant source of systematic error removed, his
election predictions became even more reliable.
Should Farley Mowat really have been content to take his samples by tossing
Raunkier’s Circle to the winds? Definitely not, for at least two reasons. First,
he had to trust that by tossing the circle, he was generating an unbiased sample.
It is not at all certain that certain types of vegetation would not be selected with
a higher probability than others. For example, the higher shrubs would tend to
intercept the hoop earlier in its descent than would the smaller herbs. Second,
he has no guarantee that his sample will be representative with respect to the
major habitat types. Leaving aside potential bias, it is possible that the circle
could, by chance, land repeatedly in a snowbed community. It seems indeed
foolish to use a sampling scheme which admits the possibility of including only
snowbed communities when tundra bogs and fellfields may be equally abundant
in the population. In subsequent chapters, we shall look into ways of taking more
2006
c Carl James Schwarz 11
CHAPTER 4. SAMPLING
thoroughly randomized surveys, and into schemes for combining judgment with
randomization for eliminating both selection bias and the potential for grossly
unrepresentative samples. There are also circumstances in which a systematic
sample (e.g. taking transects every 200 meters along a rocky shore line) may
be justifiable, but this subject is not discussed in these notes.
Model-based sampling is very powerful because you are willing to make a lot
of assumptions about the data process. However, if your model is wrong, there
are big problems. For example, what if you assume log-normality but data is
not log-normally distributed? In these cases, the estimates of the parameters
can be extremely biased and inefficient.
Most of the results in this chapter on survey sampling are design-based, i.e.
we don’t need to make any assumptions about normality in the population for
the results to valid.
4.1.7 Software
Unfortunately, there is no common, easy to use statistical package for the anal-
ysis of survey data. Fortunately, most of the computations are fairly straight-
forward so that many of the common packages, such as JMP, or Excel can be
used to analyze survey data.
SAS includes survey design procedures, but these are not covered in this
2006
c Carl James Schwarz 12
CHAPTER 4. SAMPLING
course.
For a review of packages that can be used to analyze survey data please
refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/
survey-soft.html.
NOTE that SAS has specialized routines for the analysis of survey
data that avoid these problems.
This is the basic method of selecting survey units. Each unit in the population
is selected with equal probability and all possible samples are equally likely to
be chosen. This is commonly done by listing all the members in the population
(the set of sampling units) and then choosing units using a random number
2006
c Carl James Schwarz 13
CHAPTER 4. SAMPLING
table.
2006
c Carl James Schwarz 14
CHAPTER 4. SAMPLING
Units are usually chosen without replacement, i.e. each unit in the pop-
ulation can only be chosen once. In some cases (particularly for multi-stage
designs), there are advantages to selecting units with replacement, i.e. a unit in
the population may potentially be selected more than once. The analysis of a
simple random sample is straightforward. The mean of the sample is an esti-
mate of the population mean. An estimate of the population total is obtained
by multiplying the sample mean by the number of units in the population. The
sampling fraction, the proportion of units chosen from the entire population,
is typically small. If it exceeds 5%, an adjustment (the finite population cor-
rection) will result in better estimates of precision (a reduction in the standard
error) to account for the fact that a substantial fraction of the population was
surveyed.
Note that a crucial element of simple random samples is that every sampling
unit is chosen independently of every other sampling unit. For example, in strip
transects plots along the same transect are not chosen independently - when a
particular transect is chosen, all plots along the transect are sampled and so
the selected plots are not a simple random sample of all possible plots. Strip-
transects are actually examples of cluster-samples. Cluster samples are discuses
in greater detail later in this chapter.
2006
c Carl James Schwarz 15
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 16
CHAPTER 4. SAMPLING
If a known trend is present in the sample, this can be incorporated into the
analysis (Cochran, 1977, Chapter 8). For example, suppose that the systematic
sample follows an elevation gradient that is known to directly influence the
response variable. A regression-type correction can be incorporated into the
analysis. However, note that this trend must be known from external sources -
it cannot be deduced from the survey.
2006
c Carl James Schwarz 17
CHAPTER 4. SAMPLING
advice should be sought before starting such a scheme. If there are no other
feasible designs, a slight variation in the systematic sample provides some pro-
tection from the above problems. Instead of taking a single systematic sample
every kth unit, take 2 or 3 independent systematic samples of every 2k th or 3k th
unit, each with a different starting point. For example, rather than taking a
single systematic sample every 100 m along the stream, two independent sys-
tematic samples can be taken, each selecting units every 200 m along the stream
starting at two random starting points. The total sample effort is still the same,
but now some measure of the large scale spatial structure can be estimated.
This technique is known as replicated sub-sampling (Kish, 1965, p. 127).
The reason cluster samples are used is that costs can be reduced compared
to a simple random sample giving the same precision. Because units within a
cluster are close together, travel costs among units are reduced. Consequently,
more clusters (and more total units) can be surveyed for the same cost as a
comparable simple random sample.
For example, consider the vegation survey of previous sections. The 480
plots can be divided into 60 clusters of size 8. A total sample size of 24 is
obtained by randomly selecting three clusters from the 60 clusters present in
the map, and then surveying ALL eight members of the seleced clusters. A map
of the design might look like:
2006
c Carl James Schwarz 18
CHAPTER 4. SAMPLING
Alternatively, cluster are often formed when a transect sample is taken. For
example, suppose that the vegetation survey picked an initial starting point on
the left margin, and then flew completely across the landscape in a a straight
line measuring all plots along the route. A map of the design migh look like:
2006
c Carl James Schwarz 19
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 20
CHAPTER 4. SAMPLING
In this case, there are three clusters chosen from a possible 30 clusters and
the clusters are of unequal size (the middle cluster only has 12 plots measured
compared to the 18 plots measured on the other two transects.
Pitfall A cluster sample is often mistakenly analyzed using methods for sim-
ple random surveys. This is not valid because units within a cluster are typically
positively correlated. The effect of this erroneous analysis is to come up with
an estimate that appears to be more precise than it really is, i.e. the estimated
standard error is too small and does not fully reflect the actual imprecision in
the estimate.
In many situations, there are natural divisions of the population into several
different sizes of units. For example, a forest management unit consists of several
stands, each stand has several cutblocks, and each cutblock can be divided into
plots. These divisions can be easily accommodated in a survey through the
use of multi-stage methods. Selection of units is done in stages. For example,
several stands could be selected from a management area; then several cutblocks
are selected in each of the chosen stands; then several plots are selected in each
of the chosen cutblocks. Note that in a multi-stage design, units at any stage
are selected at random only from those larger units selected in previous stages.
2006
c Carl James Schwarz 21
CHAPTER 4. SAMPLING
The advantage of multi-stage designs are that costs can be reduced com-
pared to a simple random sample of the same size, primarily through improved
logistics. The precision of the results is worse than an equivalent simple ran-
dom sample, but because costs are less, a larger multi-stage survey can often be
done for the same costs as a smaller simple random sample. This often results
in a more precise estimate for the same cost. However, due to the misuse of
data from complex designs, simple designs are often highly preferred and end
2006
c Carl James Schwarz 22
CHAPTER 4. SAMPLING
up being more cost efficient when costs associated with incorrect decisions are
incorporated.
Pitfall: Although random selections are made at each stage, a common error
is to analyze these types of surveys as if they arose from a simple random sample.
The plots were not independently selected; if a particular cut- block was not
chosen, then none of the plots within that cutblock can be chosen. As in cluster
samples, the consequences of this erroneous analysis are that the estimated
standard errors are too small and do not fully reflect the actual imprecision in
the estimates. A manager will be more confident in the estimate than is justified
by the survey.
Solution: Again, it is important that the analytical methods are suitable for
the sampling design. The proper analysis of multi-stage designs takes into ac-
count that random samples takes place at each stage (Thompson, 1992, Chapter
13). In many cases, the precision of the estimates is determined essentially by
the number of first stage units selected. Little is gained by extensive sampling
at lower stages.
In some surveys, multiple surveys of the same survey units are performed. In the
first phase, a sample of units is selected (usually by a simple random sample).
Every unit is measured on some variable. Then in subsequent phases, samples
are selected ONLY from those units selected in the first phase, not from the
entire population.
2006
c Carl James Schwarz 23
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 24
CHAPTER 4. SAMPLING
plots are selected and measured for the amount of insect damage. The plots are
then stratified by the amount of damage, and second phase allocation of units
concentrates on plots with low insect damage to measure total usable volume of
wood. It would be wasteful to measure the volume of wood on plot with much
insect damage.
At the other extreme, units are selected in the first survey and the same
units are remeasured over time. For example, permanent study plots can be
established that are remeasured for regeneration over time. The advantage
of permanent study plots is that comparisons over time are free of additional
variability introduced by new units being measured at every time point. One
possible problem is that survey units have become ‘damaged’ over time, and the
sample size will tend to decline over time. An analysis of these types of designs
2006
c Carl James Schwarz 25
CHAPTER 4. SAMPLING
is more complex because of the need to account for the correlation over time
of measurements on the same sample plot and the need to account for possible
missing values when units become ‘damaged’ and are dropped from the study.
Intermediate to the above two designs are partial replacement designs where
a portion of the survey units are replaced with new units at each time point.
For example, 1/5 of the units could be replaced by new units at each time point
- units would normally stay in the study for a maximum of 5 time periods. The
analysis of these types of designs is very complex.
4.3 Notation
Unfortunately, sampling theory has developed its own notation that is differ-
ent than that used for design of experiments or other areas of statistics even
though the same concepts are used in both. It would be nice to adopt a general
convention for all of statistics - maybe in 100 years this will happen.
In the table below, I’ve summarized the “usual” notation used in sampling
theory. In general, large letters refer to population values, while small letters
refer to sample values.
Note:
2006
c Carl James Schwarz 26
CHAPTER 4. SAMPLING
mean y
This forms the basis of many other more complex sampling plans and is the
‘gold standard’ against which all other sampling plans are compared. It often
happens that more complex sampling plans consist of a series of simple random
samples that are combined in a complex fashion.
In this design, once the frame of units has been enumerated, a sample of size
n is selected without replacement from the N population units.
Refer to the previous sections for an illustration of how the units will be
selected.
It turns out that for a simple random sample, the sample mean (y) is the best
estimator for the population mean (µ). The population total is estimated by
multiplying the sample mean by the POPULATION size. And, a proportion
is estimated by simply coding results as 0 or 1 depending if the sampled unit
belongs to the class of interest, and taking the mean of these 0,1 values. (Yes,
this really does work - refer to a later section for more details).
The standard error for the population total estimate is found by multiplying
the standard error for the mean by the POPULATION SIZE.
The standard error for a proportion is found again, by treating each data
value as 0 or 1 and applying the same formula as the standard error for a mean.
2006
c Carl James Schwarz 27
CHAPTER 4. SAMPLING
Notes:
• Inflation factor The term N/n is called the inflation factor and the
estimator for the total is sometimes called the expansion estimator or the
simple inflation estimator.
The first line of the above table shows the “basic” results and all the remaining
lines in the table can be derived from this line as will be shown later.
The population mean (µ) is estimated by the sample mean (y). The esti-
mated se of the sample mean is
r
s2 s p
se(y) = (1 − f ) = √ (1 − f )
n n
2006
c Carl James Schwarz 28
CHAPTER 4. SAMPLING
Note that if the sampling fraction (f) is small, then the standard error of the
sample mean can be approximated by:
r
s2 s
se(y) ≈ =√
n n
which is the familiar form seen previously. In general, the standard error
formula changes depending upon the sampling method used to collect
the data and the estimator used on the data. Every different sampling
design has its own way of computing the estimator and se.
Confidence intervals for parameters are computed in the usual fashion, i.e.
an approximate 95% confidence interval would be found as: estimator ± 2se.
Some textbooks use a t-distribution for smaller sample sizes, but most surveys
are sufficiently large that this makes little difference.
Many students find this part confusing, because of the term population total.
This does NOT refer to the total number of units in the population, but rather
the sum of the individual values over the units. For example, if you are interested
in estimating total timber volume in an inventory unit, the trees are the sampling
units. A sample of trees is selected to estimate the mean volume per tree. The
total timber volume over all trees in the inventory unit is of interest, not the
total number of trees in the inventory unit.
As the population total is found by N µ (total population size times the pop-
ulation mean), a natural estimator is formed by the product of the population
size and the sample mean, i.e. T OT
dAL = τb = N y. Note that you must multiply
by the population size not the sample size.
In general, estimates for population totals in most sampling designs are found
by multiplying estimates of population means by the population size.
2006
c Carl James Schwarz 29
CHAPTER 4. SAMPLING
For example, suppose you were interested the proportion of fish in a catch
that was of a particular species. A sample of 10 fish were selected (of course
in the real world, a larger sample would be taken), and the following data were
observed (S=sockeye, C=chum):
S C C S S S S C S S
Of the 10 fish sampled, 3 were chum so that the sample proportion of fish that
were chum is 3/10 = 0.30.
If the data are recoded using 1=Chum, 0=Sockeye, the sample values would
be:
0 1 1 0 0 0 0 1 0 0
The sample average of these numbers gives y = 3/10 = 0.30 which is exactly
the proportion seen.
It is not surprising then that by recoding the sample using 0/1 variables, the
first line in the summary table reduces to the last line in the summary table. In
particular, s2 reduces to np(1 − p)/(n − 1) resulting in the se seen above.
This will illustrate the concepts in the previous sections using a very small
illustrative example.
2006
c Carl James Schwarz 30
CHAPTER 4. SAMPLING
There are two common survey designs used in these types of surveys (generi-
cally called creel surveys). In access surveys, observers are stationed at access
points to the fishery. For example, if fishers go out in boats to catch the fish, the
access points are the marinas where the boats are launched and are returned.
From these access points, a sample of fishers is selected and interviews con-
ducted to measure the number of fish captured and other attributes. Roving
surveys are commonly used when there is no common access point and you
can move among the fishers. In this case, the observer moves about the fishery
and questions anglers as they are encountered. Note that in this last design,
the chances of encountering an angler are no longer equal - there is a greater
chance of encountering an angler who has a longer fishing episode. And, you
typically don’t encounter the angler at the end of the episode but somewhere
in the middle of the episode. The analysis of roving surveys is more complex -
seek help. The following example is based on a real life example from British
Columbia. The actual survey is much larger involving several thousand anglers
and sample sizes in the low hundreds, but the basic idea is the same.
The objectives are to estimate the total number of anglers and their catch
and to estimate the proportion of boat trips (fishing parties) that had sufficient
life-jackets for the members on the trip. Here is the raw data - each line is the
2006
c Carl James Schwarz 31
CHAPTER 4. SAMPLING
The population of interest is NOT the fish in the lake. The Fisheries Department
is not interested in estimating the characteristics of the fish, such as mean fish
weight or the number of fish in the lake. Rather, the focus is on the anglers and
fishing parties. Refer to the FAQ at the end of the chapter for more details.
It would be tempting to conclude that the anglers on the lake are the popula-
2006
c Carl James Schwarz 32
CHAPTER 4. SAMPLING
For this reason, the the population of interest is taken to be the set of boats
fishing at this lake. The fisheries agency doesn’t really care about the individual
anglers because if a boat with 3 anglers catches one fish, the actual person who
caught the fish is not recorded. Similarly, if there are only two life jackets, does
it matter which angler didn’t have the jacket?
The frame for a simple random sample is a listing of ALL the units in the pop-
ulation. This list is then used to randomly select which units will be measured.
In this case, there is no physical list and the frame is conceptual. A random
number table was used to decide which fishing parties to interview.
The sampling design will be treated as if it were a simple random sample from
all boats (fishing parties) returning, but in actual fact was likely a systematic
sample or variant. As you will see later, this may or may not be a problem.
In many cases, special attention should be paid to identify the correct sam-
pling unit. Here the sampling unit is a fishing party or boat, i.e. the boats were
selected, not individual anglers. This mistake is often made when the data are
presented on an individual basis rather than on a sampling unit basis. As you
will see in later chapters, this is an example of pseudo-replication.
1 Ifdata were collected on individual anglers, then the anglers could be taken as the popu-
lation of interest. However, in this case, the design is NOT a simple random sample of anglers.
Rather, as you will see later in the course, the design is a cluster sample where a simple ran-
dom sample of clusters (boats) was taken and all members of the cluster (the anglers) were
interviewed. As you will see later in the course, a cluster sample can be viewed as a simple
random sample if you define the population in terms of clusters.
2006
c Carl James Schwarz 33
CHAPTER 4. SAMPLING
Excel analysis
2006
c Carl James Schwarz 34
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 35
CHAPTER 4. SAMPLING
The analysis proceeds in a series of logical steps as illustrated for the number
of anglers in each party variable.
The metadata (information about the survey) is entered at the top of the spread-
sheet.
The actual data is entered in the middle of the sheet. One row is used to
list the variables recorded for each angling party.
At the bottom of the data, the summary statistics needed are computed using
the Excel built-in functions. This includes the sample size, the sample mean,
and the sample standard deviation.
Because the sample mean is the estimator for the population mean in if the
design is a simple random sample, no further computations are needed.
The se for the sample mean is computed using the formula presented earlier.
The estimated standard error OF THE MEAN is 0.128 anglers/party.
Hence, a 95% confidence interval for the total number of anglers fishing this
day is found as 257.6 ± 2(21.5).
The next column uses a similar procedure is followed to estimate the total catch.
2006
c Carl James Schwarz 36
CHAPTER 4. SAMPLING
First, the character values yes/no are translated into 0,1 variables using the IF
statement of Excel.
Then the EXACT same formula as used for estimating the total number of
anglers or the total catch is applied to the 0,1 data!
SAS analysis
SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of
the sample SAS program called creel.sas and the output called creel.lst are avail-
able from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms.
/*
For management purposes, it is important to estimate the
total catch by recreational fishers.
Unfortunately, there is no central registry of fishers, nor
is there a central reporting station.
Consequently, surveys are often used to estimate the total
catch.
2006
c Carl James Schwarz 37
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 38
CHAPTER 4. SAMPLING
data creel;
set creel;
sampweight = 168/30;
/* Note that it is not necessary to use the coded 0/1 variables in this procedure */
run;
The program starts with the metadata so that the purpose of the program
and how the data were collected etc are not lost.
The first section of code reads the data and computes the 0,1 variable from
the life-jacket information. The data is listed so that it can be verified that it
was read correctly.
Most program for dealing with survey data require that sampling weights
be available for each observation. A sampling weight is the weighting factor
representing how many people in the population this observation represents. In
this case, each of the 30 parties represents 168/30=5.6 parties in the population.
2006
c Carl James Schwarz 39
CHAPTER 4. SAMPLING
1 1 1 yes 1
2 3 1 yes 1
3 1 2 yes 1
4 1 2 no 0
5 3 2 no 0
6 3 1 yes 1
7 1 0 no 0
8 1 0 no 0
9 1 1 yes 1
10 1 0 yes 1
11 2 0 yes 1
12 1 1 yes 1
13 2 0 yes 1
14 1 2 yes 1
15 3 3 yes 1
16 1 0 no 0
17 1 0 yes 1
18 2 0 yes 1
19 3 1 yes 1
20 1 0 yes 1
21 2 0 yes 1
22 1 1 yes 1
23 1 0 yes 1
24 1 0 yes 1
25 1 0 no 0
26 2 0 yes 1
27 2 1 no 0
28 1 1 no 0
29 1 0 yes 1
30 1 0 yes 1
Creel Survey - Simple Random Sample
raw data
Data Summary
Number of Observations 30
Sum of Weights 168
2006
c Carl James Schwarz 40
CHAPTER 4. SAMPLING
Class
Variable Levels Values
lifej 2 no yes
Statistics
Statistics
JMP Analysis
Unfortunately, while JMP excels (excuse the pun!) in the analysis of experi-
mental data, it is bit clumsy to analyze survey data using JMP. 2 . There are
two deficiencies:
√
• There is no way to specify the finite population correction (the 1 − f )
that is applied to standard errors. Fortunately, in many ecological exper-
iments, the sampling fraction f is very close to 0, the finite population
2 Future versions of JMP will include survey sampling modules
2006
c Carl James Schwarz 41
CHAPTER 4. SAMPLING
JMP assumes, unless you specify otherwise, that the data are collected from
a simple random sample. This matches the design of the angler survey so JMP
can be used directly.
The data are entered into a JMP spreadsheet directly. A copy of the
JMP data file is called creel.jmp and is available from the Sample Program Li-
brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
There is no need to code the categorical variable corresponding to sufficient life-
jackets. Be sure that the angler and catch variables are continuously scaled and
that EnoughLifeJackets is nominally scaled:
2006
c Carl James Schwarz 42
CHAPTER 4. SAMPLING
All three variables can be specified simultaneously and JMP will use the scale
of the variables to decide which statistics to compute.
2006
c Carl James Schwarz 43
CHAPTER 4. SAMPLING
The display can be improved by converting to a stacked setting (use the Stacked
option in the red-triangle near the Distribution header):
2006
c Carl James Schwarz 44
CHAPTER 4. SAMPLING
removing the quantile information and the histograms (use the red triangles for
each data variable to remove the display):
and asking for standard errors and confidence intervals for the proportion (right
click in the table of proportions and ask for the appropriate columns):
2006
c Carl James Schwarz 45
CHAPTER 4. SAMPLING
The estimates are read directly from the output. The estimated average number
of angers per boat is 1.53 with an estimated se of .14. Notice that the se is slight
larger than the se reported by Excel - this is a result of not applying a finite
population correction.
The estimated average catch per boat is read directly above and is 0.667 fish/boat
with a se of .15 fish/boat. To estimate the total catch over all 168 boats, we
multiply both the mean catch/boat and the se of the catch/boat by 168. This
gives an estimated total catch of 112 fish (se 26) fish. [Again, the standard
2006
c Carl James Schwarz 46
CHAPTER 4. SAMPLING
error is slightly larger than that reported by Excel because the finite population
correction factor was not applied.]
There is no need to code the categorical variable as was done in Excel. Reading
directly from the above output, we estimate that 73% of parties had sufficient
life jackets with a se of .08 (or a se of 8 percentage points). [Again the se
is slightly larger than that reported by Excel because of the lack of a finite
population correction factor.]
For estimating the total number of anglers and their catch, we need the sam-
ple size, the average over the sample and the standard deviation over the sample.
This can be done using the Tables->Summary pop-down menu. Complete the
dialogue box as shown:
2006
c Carl James Schwarz 47
CHAPTER 4. SAMPLING
To estimate the se for the total number of anglers, multiply the estimates of
the mean number of anglers/boat trip times the number of boat trips (168) to
get the final estimate. The estimated total number of anglers is 1.53333x168 =
257.6 with an estimated standard error of 0.128x168 = 21.5.
We first transform the yes/no responses in 1/0 using a formula box, and
then repeat the same summary steps as for the mean number of anglers giving:
2006
c Carl James Schwarz 48
CHAPTER 4. SAMPLING
There are many surveys where the results are disappointing. For example,
a survey of anglers may show that the mean catch per angler is 1.3 fish but
that the standard error is .9 fish. In other words, a 95% confidence interval
stretches from 0 to well over 4 fish per angler, something that is known with
near certainty even before the survey was conducted. In many cases, a back of
the envelope calculation has showed that the precision obtained from a survey
would be inadequate at the proposed sample size even before the survey was
started.
In order to determine the appropriate sample size, you will need to first
specify some measure of precision that is required to be obtained. For example,
a policy decision may require that the results be accurate to within 5% of the
true value.
• an absolute precision, i.e. you wish to be 95% confident that the sample
mean will not vary from the population mean by a pre-specified amount.
For example, a 95% confidence interval for the total number of fish cap-
tured should be ± 1,000 fish.
• a relative precision, i.e. you wish to be 95% confident that the sample
mean will be within 10% of the true mean.
The latter is more common than the former, but both are equivalent and
interchangeable. For example, if the actual estimate is around 200, with a se of
about 50, then the 95% confidence interval is ± 100 and the relative precision
is within 50% of the true answer (± 100 / 200). Conversely, a 95% confidence
interval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40%
of 200), and consequently, the se is around 40 (=80/2).
2006
c Carl James Schwarz 49
CHAPTER 4. SAMPLING
Expression Mathematics
- within xxx of the blah se = xxx
As a rough rule of thumb, the following are often used as survey precision
guidelines:
• For scientific work, the 95% confidence interval should be ± 10% of the
estimate.
Next, some preliminary guess for the standard deviation of individual items
in the population (S) needs to be taken along with an estimate of the population
size (N ) and possibly the population mean (µ) or population total (τ ). These
are not too crucial and can be obtained by:
2006
c Carl James Schwarz 50
CHAPTER 4. SAMPLING
A very rough estimate of the standard deviation can be found by taking the
usual range of the data/4. If the population proportion is unknown, the value
of 0.5 is often used as this leads to the largest sample size requirement as a
conservative guess.
These are then used with the formulae for the confidence interval to deter-
mine the relevant sample size. Many text books have complicated formulae to
do this - it is much easier these days to simply code the formulae in a spreadsheet
(see examples) and use either trial and error to find a appropriate sample size,
or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriate
sample size. This will be illustrated in the example.
The final numbers are not to be treated as the exact sample size but more
as a guide to the amount of effort that needs to be expended.
If more than one item is being surveyed, these calculations must be done
for each item. The largest sample size needed is then chosen. This may lead
to conflict in which case some response items must be dropped or a different
sampling method must be used for this other response variable.
2006
c Carl James Schwarz 51
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 52
CHAPTER 4. SAMPLING
First note that the computations for sample size require some PRIOR in-
formation about population size, the population mean, or the population pro-
portion. We will use information from the previous survey to help plan future
studies.
For example, about 168 boats were interviewed last year. The mean catch
per angling party was about .667 fish/boat. The standard deviation of the catch
per party was .844. These values are entered in the spreadsheet in column C.
Now vary the sample size (in green) in column C until the 95% confidence
interval (in yellow) is below ± 10%. You will find that you will need to interview
almost 135 parties - a very high sampling fraction indeed. The problem for this
variable is the very high variation of individual data points.
If you are familiar with Excel, you can use the Goal Seeker function to speed
the search.
Similarly, the proportion of people wearing lifejackets last year was around
73%. Enter this in the blue areas of Column E. The initial sample size of 20
is too small as the 95% confidence interval is pm .186 (18 percentage points).
Now vary the sample size (in green) until the 95% confidence interval is ± .03.
Note that you need to be careful in dealing with percentages - confidence limits
are often specified in terms of percentage points rather than percents to avoid
problems where percents are taken of percents. This will be explained further
in class.
Try using the spreadsheet to compare the precision of a poll of 1000 people
taken from Canada (population 33,000,000) and 1000 people taken from the US
(population 330,000,000) if both polls have about 40% in favor of some issue.
Technical notes
If you really want to know how the sample size numbers are determined,
here is the lowdown.
Suppose that you wish to be 95% sure that the sample mean is within 10%
of the true mean.
q
We must solve z √Sn NN−n ≤ εµ for n where z is the term representing the
multiplier for a particular confidence level (for a 95% c.i. use z = 2) and ε is the
‘closeness’ factor (in this case ε = 0.10.
2006
c Carl James Schwarz 53
CHAPTER 4. SAMPLING
N
Rearranging this equation gives n = εµ 2
1+N ( zS )
2006
c Carl James Schwarz 54
CHAPTER 4. SAMPLING
There are several methods, depending if you know the population size, etc.
Suppose we need to choose every k th record, where k is chosen to meet sample
size requirements. - an example of choosing k will be given in class. All of the
following methods are equivalent if k divides N exactly. These are the two most
common methods.
Most surveys casually assume that the population has been sorted in random
order when the systematic sample was selected and so treat the results as if
they had come from a SRSWOR. This is theoretically not correct and if your
assumption is false, the results may be biased, and there is no way of examining
the biases from the data at hand.
2006
c Carl James Schwarz 55
CHAPTER 4. SAMPLING
For example, rather than taking a single systematic sample of size 100 from
a population, you can take 4 systematic samples (with different starting points)
of size 25.
A yearly survey has been conducted in the Prairie Provinces to estimate the
number of breeding pairs of ducks. One breeding area has been divided into
approximately 1000 transects of a certain width, i.e. the breeding area was
divided into 1000 strips.
• The population is the set of individual ducks on the study area. However,
no frame exists for the individual birds. But a frame can be constructed
based on the 1000 strips that cover the study area. In this case, the design
is a cluster sample, with the clusters being strips.
• The population consists of the 1000 strips that cover the study area and
the number of ducks in each strip is the response variable. The design is
then a simple random sample of the strips.
In either case, the analysis is exactly the same and the final estimates are exactly
the same.
2006
c Carl James Schwarz 56
CHAPTER 4. SAMPLING
Here is the raw data reporting the number of nests in each set of 10 transects:
Est
Est total 7130 16570 mean 10.97
Est se 885 3510 se 4.91
2006
c Carl James Schwarz 57
CHAPTER 4. SAMPLING
same size. If the systematic samples had been of different sizes (e.g. some
sets had 15 transects, other sets had 5 transects), then a ratio-estimator
(see later sections) would have been a better estimator.
• compute the total number of nests for each set. This is found in
column (a).
• Then the sets selected are treated as a SRSWOR sample of size 10
from the 100 possible sets. An estimate of the mean number of
nests per set of 10 transects is found as: µ q + 93 + · ·· +
b = (468
s2 n
197)/10 = 165.7 with an estimated se of se(bµ) = n 1 − 100 =
q
117.02 10
10 1 − 100 = 35.1
• The average number of nests per set is expanded to cover all 100 sets
τb = 100b
µ = 16570 and se(b
τ ) = 100se(b
µ) = 3510
2. Total number of nests in the prime habitat only (refer to column (b)
above). This is formed in exactly the same way as the previous estimate.
This is technically known as estimation in a domain. The number of
elements in the domain in the whole population (i.e. how many of the
1000 transects are in prime-habitat) is unknown but is not needed. All
that you need is the total number of nests in prime habitat in each set –
you essentially ignore the non-prime habitat transects within each set.
The average number of nests per set in prime habitats is found as
before: µb = 123+···+93 = 71.3 with an estimated se of se(b µ) =
q q10
s 2 n 29.5 2 10
n (1 − 100 ) = 10 (1 − 100 ) = 8.85.
• because there are 100 sets of transects in total, the estimate of the
population total number of nests in prime habitat and its estimated
se is τb = 100b
µ = 7130 with a se(b
τ ) = 100se(b
µ) = 885
• Note that the total number of transects of prime habitat is not known
for the population and so an estimate of the density of nests in prime
habitat cannot be computed from this estimated total. However, a
ratio-estimator (see later in the notes) could be used to estimate the
density.
2006
c Carl James Schwarz 58
CHAPTER 4. SAMPLING
• Compute the domain means for type of habitat for each set (columns
(c) and (d)). Note that the totals are divided by the number of
transects of each type in each set.
• Compute the difference in the means for each set (column (e))
• Treat this difference as a simple random sample of size 10 taken from
the 100 possible sets of transects. What does the final estimated
mean difference and se imply?
All stratified designs will have the same basic steps as listed below regardless
of the underlying design.
2006
c Carl James Schwarz 59
CHAPTER 4. SAMPLING
– In equal allocation, the total effort is split equally among all strata.
Equal allocation is preferred when equally precise estimates are re-
quired for each stratum. 3
– In proportional allocation, the total effort is allocated to the strata
in proportion to stratum importance. Stratum importance could be
related to stratum size (e.g. when allocating effort among the U.S.
and Canada, then because the U.S. is 10 times larger in Canada,
more effort should be allocated to surveying the U.S.). But if density
is your measure of importance, allocate more effort to higher den-
sity strata. Proportional allocation is preferred when more precise
estimates are required in more important strata.
– Neyman allocation. Neyman determined that if you also have in-
formation on the variability within each stratum, then more effort
should be allocated to strata that are more important and more vari-
able to give you the most precise overall estimate for a given sample
size. This rarely is performed in ecology because often information
on intra-stratum variability is unknown. 4
– Cost allocation. In general, effort should be allocated to more
important strata, more variable strata, or strata where sampling is
cheaper to give the best overall precision for the entire survey. As
in the previous allocation method, ecologists rarely have sufficiently
detailed cost information to do this allocation method.
• Conduct separate surveys in each stratum Separate independent
surveys are conducted in each stratum. It is not necessary to use the
same survey method in all strata. For example, low density quadrats
could be surveyed using aerial methods, while high density strata may
require ground based methods. Some strata may use simple random sam-
ples, while other strata may use cluster samples. Many textbooks show
examples were the same survey method is used in all strata, but this is
NOT required.
• Obtain stratum specific estimates. Use the appropriate estimators to
estimate stratum means, proportions or totals (along with their se ) for
each stratum.
• Rollup The separate stratum estimates are then combined to give an
overall value for the entire survey region.
cision.
4 However, in many cases, higher means per survey unit are accompanied by greater vari-
ances among survey units so allocations based on stratum means often capture this variation
as well.
2006
c Carl James Schwarz 60
CHAPTER 4. SAMPLING
variable is known in advance for every plot (e.g. elevation of a plot). Post-
stratification is used if the stratum variable can only be ascertained after mea-
suring the plot, e.g. soil texture or soil pH. The advantages of pre-stratification
are that samples can be allocated to the various strata in advance to optimize the
survey and the analysis is relatively straightforward. With post-stratification,
there is no control over sample size in each of the strata, and the analysis is
more complicated (the problem is that the samples sizes in each stratum are
now random). A post-stratification can result in significant improvements in
precision but does not allow for finer control of the sample sizes as found in
pre-stratification.
Stratification can be used with any type of sampling design – the concepts
introduced here deal with stratification applied to simple random samples but
are easily extended to more complex designs.
• variance estimates of the mean or of the total will be more precise when
compared to variances from an unstratified design if the units can be
divided into groups that are more homogeneous within groups than the
whole population.
• the cost of conducting a survey under stratification may be less as units
selected within a stratum are in closer proximity.
• different sampling methods may be used in each stratum for cost or con-
venience reasons. [In the detail below we assume that each stratum has
the same sampling method used, but this is only for simplification.]
• because randomization occurs independently in each stratum, corruption
of the survey design due to problems experienced in the field may be
confined.
• separate estimates for each stratum with a given precision can be obtained
• it may be more convenient to take a stratified random sample for admin-
istrative reasons. For example, the strata may refer to different district
offices.
2006
c Carl James Schwarz 61
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 62
CHAPTER 4. SAMPLING
First you will have to define the strata. Suppose that there is a gradient
in response from the top to the bottom of the map. Three strata are defined,
consisting of the first 3 rows, the next 5 rows, and finally, the last two rows.
It was decided to conduct a simple random sample within each stratum, with
sample sizes of 8, 10, and 6 in the three strata respectively. [The decision process
on allocating samples to strata will be covered later.]
2006
c Carl James Schwarz 63
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 64
CHAPTER 4. SAMPLING
In this design, the same design was used in ALL strata, but this is NOT a
requirement for stratification. It is quite possible, and often desirable, to use
different methods in the different strata. For example, it may be more efficient
to survey desert areas using a fixed-wing aircraft, while ground surveys need to
be used in heavily forested areas.
4.7.2 Notation
The results below summarize the computations that can be more easily
thought as occurring in four steps:
1. Compute the estimated mean and its se for each stratum. In this chapter,
we use a SRS design in each stratum, but it not necessary to use this
design in a stratum and each stratum could have a different design. In the
case of an SRS, the estimate of the mean for each stratum is found as:
bh = y h
µ
with associated standard error:
s
s2h
se(b
µh ) = (1 − fh )
nh
2006
c Carl James Schwarz 65
CHAPTER 4. SAMPLING
3. Compute the grand total and its se over all strata. This is the sum of the
individual totals. The se is computed in a special way.
τb = τb1 + τb2 + . . .
q
τ ) = se(b
se(b τ )21 + se(b
τ )22 + . . .
4. Occasionally, the grand mean over all strata is needed. This is found by
dividing the estimated grand total by the total POPULATION sizes:
τb
µ
b=
N1 + N2 + . . .
se(b
τ)
se(b
µ) =
N1 + N2 + . . .
This can be summarized in a succinct form as follows. Note that the stratum
weights Wh are formed as Nh /N and are often used to derive weighted means
etc:
s
H H H
Nh2 se2 (y h ) or
P P P
Total τ =N Wh µh or τbstr = N Wh y h or
h=1 h=1 h=1
s
H H H
s2
Nh2 nhh (1 − fh )
P P P
τ= τh or τbstr = Nh y h
h=1 h=1 h=1
PH
τ= Nh µh
h=1
Notes
2006
c Carl James Schwarz 66
CHAPTER 4. SAMPLING
Suppose that you were asked to estimate the total amount of organic matter
suspended in a lake just after a storm. The first scheme that might occur to
you could be to cruise around the lake in a haphazard fashion and collect a few
sample vials of water which you could then take back to the lab. If you knew the
total volume of water in the lake, then you could obtain an estimate of the total
amount of organic matter by taking the product of the average concentration
in your sample and the total volume of the lake.
Nonetheless, taking a randomized sample from the entire lake would still not
be a totally sensible approach to the problem. Suppose that the lake were to be
fed by a single stream, and that most of the organic matter were concentrated
close to the mouth of the stream. If the sample were indeed representative, then
most of the vials would contain relatively low concentrations of organic matter,
whereas the few taken from around the mouth of the stream would contain much
2006
c Carl James Schwarz 67
CHAPTER 4. SAMPLING
higher concentration levels. That is, there is a real potential for outliers in the
sample. Hence, confidence limits based on the normal distribution would not
be trustworthy.
Furthermore, the sample mean is not as reliable as it might be. Its value
will depend critically on the number of vials sampled from the region close to
the stream mouth. This source of variation ought to be controlled.
Finally, it might be useful to estimate not just the total amount of organic
matter in the entire lake, but the extent to which this total is concentrated near
the mouth of the stream.
Then if a simple random sample of fixed size were to be taken from within
each of these “strata”, the results could be used to estimate the total amount of
organic matter within each stratum. These subtotals could then be added to
produce an estimate of the overall total for the lake.
How can we use the results of a stratified random sample to estimate the
overall total? The simplest way is to construct an estimate of the totals within
each of the strata, and then to sum these estimates. A sensible estimate of the
average within the h’th stratum is y h . Hence, a sensible estimate of the total
within the h’th stratum is τbh = Nh y h , and the overall total can be estimated
PH PH
by τb = h=1 τbh = h=1 Nh y h .
2006
c Carl James Schwarz 68
CHAPTER 4. SAMPLING
If we prefer to estimate the overall average, we can merely divide the estimate
of the overall total by the size of the population, N . The resulting estimator is
called the stratified
PH random sampling estimator of the population average, and
is given by µb = h=1 Nh y h /N .
A Numerical Example
Suppose that for the lake sampling example discussed earlier the lake were
subdivided into two strata, and that the following results were obtained. (All
readings are in mg per litre.)
We begin by computing the estimated mean for each stratum and its asso-
nh
ciated standard error. The sampling fraction Nh
is so close to 0 it can be safely
2006
c Carl James Schwarz 69
CHAPTER 4. SAMPLING
ignored. For example, the standard error of the mean for stratum 1 is found as:
s r
s21 4.232
se(b
µ1 ) = (1 − f1 ) = = 1.89
n1 5
Stratum nh µ
bh µh )
se(b
1 5 41.52 1.8935
2 5 369.4 11.492
Next, we estimate the total organic matter in each stratum. This is found by
multiplying the mean concentration and se of each stratum by the total volume:
τbh = Nh × µ
bh
se(b
τh ) = Nh se(b
µh )
For example, the estimated total organic matter in stratum 1 is found as:
Stratum nh µ
bh µh )
se(b τbh se(b
τh )
1 5 41.52 1.8935 311.4 ×108 14.175 ×108
2 5 369.4 11.492 92.3 ×108 2.873 ×108
Next, we total
√ the organic content of the two strata and find the se of the
grand total as 14.1752 + 2.8732 × 108 to give the summary table:
Stratum nh µ
bh µh )
se(b τbh se(b
τh )
1 5 41.52 1.8935 311.4 ×108 14.175 ×108
2 5 369.4 11.492 92.3 ×108 2.873 ×108
Total 403.7 ×108 14.46 ×108
Finally, the overall grand mean is found by dividing by the total volume of
the lake 7.75 × 108 to give:
403.7 × 108
µ
b= = 52.09mg/L
7.75 × 108
14.46 × 108
se(b
µ) = = 1.87mg/L
7.75 × 108
The calculations required to compute the stratified estimate can also be done
using the method of weighted averages as shown in the following table:
2006
c Carl James Schwarz 70
CHAPTER 4. SAMPLING
Hence the estimate of the√ overall average is 52.097 mg/L, and the associated
estimated standard error is 3.4963 = 1.870 mg/L and an approximate 95%
confidence interval is then found in the usual fashion. As expected these match
the previous results.
The standard error in the estimator of the overall average is markedly re-
duced in this example by the stratification. The standard error was just esti-
mated for the stratified estimator to be around 2. This result was for a sample
of total size 10. By contrast, for an estimator based on a simple random sample
of the same size, the standard error can be found to be about 20. [This involves
methods not covered in this class.] Stratification has reduced the standard error
by an order of magnitude.
It is also possible that we could reduce the standard error even further with-
out increasing our sampling effort by somehow allocating this effort more effi-
ciently. Perhaps we should take fewer water samples from the region far from
the outlet, and take more from the other stratum. This will be covered later in
this course.
One can also read in more comprehensive accounts how to construct esti-
mates from samples that are stratified after the sample is selected. This is
known as post-stratification. These methods are useful if, e.g. you are sam-
pling a population with a known sex ratio. If you observe that your sample is
biased in favor of one sex, you can use this information to build an improved
estimate of the quantity of interest through stratifying the sample by sex after
it is collected. It is not necessary that you start out with a plan for sampling
some specified number of individuals from each sex (stratum).
2006
c Carl James Schwarz 71
CHAPTER 4. SAMPLING
Nonetheless, in any survey work, it is crucial that you begin with a plan.
There are many examples of surveys that produced virtually useless results
because the researchers failed to develop an appropriate plan. This should
include a statement of your main objective, and detailed descriptions of how
you plan to generate the sample, collect the data, enter them into a computer
file, and analyze the results. The plan should contain discussion of how you
propose to check for and correct errors at each stage. It should be tested with
a pilot survey, and modified accordingly. Major, ongoing surveys should be
reassessed continually for possible improvements. There is no reason to expect
that the survey design will be perfect the first time that it is tried, nor that
flaws will all be discovered in the first round. On the other hand, one should
expect that after many years experience, the researchers will have honed the
survey into a solid instrument. George Gallup’s early surveys were seriously
biased. Although it took over a decade for the flaws to come to light, once they
did, he corrected his survey design promptly, and continued to build a strong
reputation.
DFO needs to monitor the catch of sockeye salmon as the season progresses so
that stocks are not overfished.
The season in one statistical sub-area in a year was a total of 2 days (!) and
250 vessels participated in the fishery in these 2 days. A census of the catch of
each vessel at the end of each day is logistically difficult.
Here is the raw data - each line corresponds to the observers’ count for that
vessel for that day. On the second day, a new random sample of vessels was
selected. On both days, 250 vessels participated in the fishery.
Date Sockeye
29-Jul-98 337
29-Jul-98 730
2006
c Carl James Schwarz 72
CHAPTER 4. SAMPLING
29-Jul-98 458
29-Jul-98 98
29-Jul-98 82
29-Jul-98 28
29-Jul-98 544
29-Jul-98 415
29-Jul-98 285
29-Jul-98 235
29-Jul-98 571
29-Jul-98 225
29-Jul-98 19
29-Jul-98 623
29-Jul-98 180
30-Jul-98 97
30-Jul-98 311
30-Jul-98 45
30-Jul-98 58
30-Jul-98 33
30-Jul-98 200
30-Jul-98 389
30-Jul-98 330
30-Jul-98 225
30-Jul-98 182
30-Jul-98 270
30-Jul-98 138
30-Jul-98 86
30-Jul-98 496
30-Jul-98 215
It is not clear how the list of fishing boats was generated. It seems unlikely that
the aerial survey actually had a picture of the boats on the water from which
DFO selected some boats. More likely, the observers were taken onto the water
2006
c Carl James Schwarz 73
CHAPTER 4. SAMPLING
in some systematic fashion, and then the observer selected a boat at random
from those seen at this point. Hence the sampling frame is the set of locations
chosen to drop off the observers and the set of boats visible from these points.
The sampling unit is a boat on a day. The strata are the two days. On each
day, a random sample was selected from the boats participating in the fishery.
This is a stratified design with a simple random sample selected each day.
Excel analysis
2006
c Carl James Schwarz 74
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 75
CHAPTER 4. SAMPLING
Summary statistics
The Excel builtin functions are used to compute the summary statistics (sample
size, sample mean, and sample standard deviation) for each stratum. Some
caution needs to be exercised that the range of each function covers only the
data for that stratum. 5
You will also need to specify the stratum size (the total number of sampling
units in each stratum), i.e. 250 vessels on each day.
Because the sampling design in each stratum is a simple random sample, the
same formulae as in the previous section can be used.
The mean and its estimated se for each day of the opening is reported in the
spreadsheet.
The estimated total catch is found by multiplying the average catch per boat by
the total number of boats participating in the fishery. The estimated standard
error for the total for that day is found by multiplying the standard error for
the mean by the stratum size as in the previous section.
For example, in the first stratum (29 July), the estimated total catch is
found by multiplying the estimated mean catch per boat (322) by the number
of boats participating (250) to give an estimated total catch of 80,500 salmon
for the day. The se for the total catch is found by multiplying the se of the
mean (57) by the number of boats participating (250) to give the se of the total
catch for the day of 14,200 salmon.
Once an estimated total is found for each stratum, the estimated grand total
is found by summing the individual stratum estimated totals. The estimated
standard error of the grand total is found by the square root of the sum of the
squares of the standard errors in each stratum - the Excel function sumsq is
useful for this computation.
2006
c Carl James Schwarz 76
CHAPTER 4. SAMPLING
This was not done in the spreadsheet, but is easily computed by dividing
the total catch by the total number of boat days in the fishery (250+250=500).
The se is found by dividing the se of the total catch also by 500.
Note this is interpreted as the mean number of fish captured per day per
boat.
SAS analysis
As noted earlier, some care must be used when standard statistical packages are
used to analyze survey data as many packages ignore the design used to select
the data.
A sample SAS program for the analysis of the sockeye example called sock-
eye.sas and its output called sockeye.lst is available from the Sample Program Li-
brary at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
2006
c Carl James Schwarz 77
CHAPTER 4. SAMPLING
29-Jul 544
29-Jul 415
29-Jul 285
29-Jul 235
29-Jul 571
29-Jul 225
29-Jul 19
29-Jul 623
29-Jul 180
30-Jul 97
30-Jul 311
30-Jul 45
30-Jul 58
30-Jul 33
30-Jul 200
30-Jul 389
30-Jul 330
30-Jul 225
30-Jul 182
30-Jul 270
30-Jul 138
30-Jul 86
30-Jul 496
30-Jul 215
;;;;
data n_boats; /* you need to specify the stratum sizes if you want stratum totals */
length date $8.;
date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total_ */
date = ’30-Jul’; _total_=250; output;
2006
c Carl James Schwarz 78
CHAPTER 4. SAMPLING
The program starts with reading in the raw data and the computation of the
sampling weights. Because the population size and sample size are the same for
each stratum, the sampling weights are common to all boats. In general, this
is not true, and a separate sampling weight computation is required for each
stratum.
A separate file is also constructed with the population sizes for each stratum
so that estimates of the population total can be constructed.
2006
c Carl James Schwarz 79
CHAPTER 4. SAMPLING
1 29-Jul 250
2 30-Jul 250
Number of sockeye caught - example of stratified simple random sampling
number of boats in each stratum
Data Summary
Number of Strata 2
Number of Observations 30
Sum of Weights 500
Stratum Information
Statistics
Std Error
Variable Mean of Mean Sum Std Dev
------------------------------------------------------------------------
sockeye 263.500000 33.082758 131750 16541
------------------------------------------------------------------------
The only thing of “interest” is to note that SAS labels the precision of the
2006
c Carl James Schwarz 80
CHAPTER 4. SAMPLING
estimated grand means as a Standard error while it labels the precision of the
estimated total as a standard deviation! Both are correct - a standard error is
a standard deviation - not of individual units in the population - but of the
estimates over repeated sampling from the same population. I think it is clearer
to label both as standard errors to avoid any confusion.
If separate analyses are wanted for each stratum, the SURVEYMEANS pro-
cedure has to be run twice, one time with a BY statement to estimate the means
and totals in each stratum.
JMP analysis
The data is available in a JMP file called sockeye.jmp available at the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.
The data are entered as usual into JMP. Ensure that the variable that identi-
fies the strata is nominally scaled and that the response variable is continuously
scaled:
2006
c Carl James Schwarz 81
CHAPTER 4. SAMPLING
We start by finding the summary statistics for EACH stratum. This is done
using the Analyze->Distribution platform as in the analysis of the creel data,
but we want a separate analysis for each stratum. This is obtained by using the
By box in the dialogue box:
2006
c Carl James Schwarz 82
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 83
CHAPTER 4. SAMPLING
The estimated catch per boat in the first stratum is 322 (se 59) and in the
second stratum is 205 (se 35). Note that the difference in se between the results
and JMP and Excel are minimal because the sampling fraction (15/250=6%) is
very small.
2006
c Carl James Schwarz 84
CHAPTER 4. SAMPLING
We need to find the sample size, the mean, and the standard deviation for
each stratum. We can use the Tables->Summary pop-down menu as shown
below:
Note that we must specify the grouping variable in this case to identify the
strata.
This generates the summary table with statistics for each stratum:
2006
c Carl James Schwarz 85
CHAPTER 4. SAMPLING
At this point, further computations using JMP are clumsy. It may be easier
to use the summary statistics and transfer them to a spreadsheet to continue
the computations.
Add a new column for the number of vessels in each stratum, and two more
columns where you estimate the total catch for the day (mean catch x number
of vessels) and the standard error**2 for the total in each stratum using the
formula:
This generates the revised summary table with statistics for each stratum:
Finally, add the totals and square of the se from each stratum to get the overall
2006
c Carl James Schwarz 86
CHAPTER 4. SAMPLING
√
Take the of the square of the se to get the overall se and, voila,
Hence our final estimate is that a total of 131,750 sockeye were caught with
a se of 16541 fish.
In a stratified sample, there are many estimates that are obtained with different
standard errors. It can sometimes be confusion as to which estimate is used for
which purpose.
2006
c Carl James Schwarz 87
CHAPTER 4. SAMPLING
Here is a brief review of the four possible estimates and the level of interest
in each estimate.
2006
c Carl James Schwarz 88
2006
CHAPTER 4. SAMPLING
Parameter Estimator se Example and Interpretation Who would be interested in this
c
quantity?
Carl James Schwarz
q
s2h
Stratum µ
bh = Y h nh (1 − fh ) Stratum 1. Estimate is 322; se A fisher who wishes to fish ONLY
mean of 56.8 (not shown). the first day of the season and
The estimated average catch wants to know if it will meet ex-
per boat was 322 fish (se 56.8 penses.
fish) on 29 July
Stratum τbh = Nh µ
bh = Nh se(b
q µh ) = Stratum 1. Estimate is The estimated total catch overall
total Nh Y h s2j 80,500=250x322; se of boats on 29 July was 80,500 (se
Nh nh (1 − fh )
14195=250x56.8. 14,195) DFO who wishes to esti-
mate TOTAL catch overall ALL
boats on this single day so that
quota for next day can be set.
Grand Total
89
p
Grand τb = τb1 + τb2 τ1 )2 + se(b
se(b τ1 )2 Estimate DFO who wishes to know total
total. 131,750=80,500+51,250;
√ se is catch over entire fishing season so
1419522 + 849222 = 16541. that impacts on stock can be exam-
The estimated total catch ined.
overall all boats over all days
is 132,000 fish (se 17,000 fish).
τb se(b
τ)
Grand µ
b= N N Grand mean (not shown). A fisher who want to know aver-
average N=500 vessel-days. age catch per boat per day for the
Estimate is entire season to see if it will meet
131,750/500=263.5; se is expenses.
16541/500=33.0.
The estimated catch per boat
per day over the entire season
was 263 fish (se 33 fish).
CHAPTER 4. SAMPLING
As before, the question arises as how many units should be selected in stratified
designs.
This has two questions that need to be answered. First, what is the total
sample size required? Second how should these be allocated among the strata.
The total sample size can be determined using the same methods as for a
simple random sample. I would suggest that you initially ignore the fact that
the design will be stratified when finding the initial required total sample size.
If stratification proves to be useful, then your final estimate will be more precise
than you anticipated (always a nice thing to happen!) but seeing as you are
making guesses as to the standard deviations and necessary precision required,
I wouldn’t worry about the extra cost in sampling too much.
If you must, it is possible to derive formulae for the overall sample sizes when
accounting for stratification, but these are relatively complex. It is likely easier
to build a general spreadsheet where the single cell is the total sample size and
all other cells in the formula depend upon this quantity depending upon the
allocation used. Then the total sample size can be manipulated to obtain the
desired precision. The following information will be required:
2006
c Carl James Schwarz 90
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 91
CHAPTER 4. SAMPLING
The standard deviations from this survey will be used as ‘guesses’ for what
might happen next year. As in this year’s survey, the total sample size will be
allocated evenly between the two days.
In this case, the total sample size must be allocated to the two strata. You
will see several methods in a later section to do this, but for now, assume
that the total sample will be allocated equally among both strata. Hence the
proposed sample size of 75 is split in half to give a proposed sample size of 37.5
in each stratum. Don’t worry about the fractional sample size - this is only a
planning exercise. We create one cell that has the total sample size, and then
use the formulae to allocate the total sample size equally to the two strata.
The total and the se of the overall total are found as before, and the relative
precision (denoted as the relative standard error (rse), and, unfortunately, in
some books at the coefficient of variation cv ) is found as the estimated standard
error/estimated total.
Again, this portion of the spreadsheet is setup so that changes in the total
sample size are propagated throughout the sheet. If you change the total sample
size from 75 to some other number, this is automatically split among the two
strata, which then affects the estimated standard error for each stratum, which
then affects the estimated standard error for the total, which then affects the
relative standard error. Again, the proposed total sample size can be varied
using trial and error, or the Excel Goal-Seek option can be used.
Total n=75
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 37.5 322 226.8 250 80500 8537
30-Jul 37.5 205 135.7 250 51250 5107
Total 131750 9948
rse 7.6%
A sample size of 75 is too small. Try increasing the sample size until the rse
is 5% or less. Alternatively, once could use the GOAL SEEK feature of Excel to
find the sample size that gives a relative standard error of 5% or less as shown
below:
2006
c Carl James Schwarz 92
CHAPTER 4. SAMPLING
Total n=145
se
Est Est
Stratum n Mean std dev vessels total total
29-Jul 72.5 322 226.8 250 80500 5611
30-Jul 72.5 205 135.7 250 51250 3357
Total 131750 6539
rse 5.0%
There are number of ways of allocating a sample of size n among the various
strata. For example,
1. Equal allocation. Under an equal allocation scheme, all strata get the
same sample size, i.e. nh = n/H This allocation is best if variances of
strata are roughly equal, equally precise estimates are required for each
stratum, and you wish to test for differences in means among strata (i.e.
an analytical survey discussed in previous sections).
2006
c Carl James Schwarz 93
CHAPTER 4. SAMPLING
P
each unit in a stratum i. Then the total cost of the survey is C = nh Ch .
The allocation rule is that sample sizes should be proportional to the
product to stratum sizes, stratum standard deviations, and the inverse√
of
Wi S i / Ci
the square root of the cost of sampling, i.e. ni = n × P (W √
S / C )
=
h h h
Ni Si Ni Si
√ √
Ci Ci
n× Nh Sh = n× N1 S1
√ N2 S2 N S This implies that large samples
(√ +√ +···+ √H H
P
)
Ch C1 C2 CH
are found in strata that are larger, more variable, or cheaper to sample.
In practice, most of the gain in precision occurs from moving from equal to
proportional allocation, while often only small improvements in precision are
gained from moving from proportional allocation to Neyman allocation. Simi-
larly, unless cost differences are enormous, there isn’t much of an improvement
in precision to moving to an optimal allocation based on costs.
The total sample size can be found by varying the sample total until the
desired precision is found.
Results from previous year’s survey: Here are the summary statistics from
the survey in a previous year:
Map-squares sampled
Stratum Nh nh y s Est total se (total)
1 400 98 24.1 74.7 9640 2621
2 40 10 25.6 63.7 1024 698
3 100 37 267.6 589.5 26760 7693
4 40 6 179 151.0 7160 2273
5 70 39 293.7 351.5 20559 2622
6 120 21 33.2 99.0 3984 2354
Total 770 211 69127 9172
2006
c Carl James Schwarz 94
CHAPTER 4. SAMPLING
Equal allocation
What would happen if an equal allocation were used? We now split the 211
total sample size equally among the 6 strata. In this case, the sample sizes are
‘fractional’, but this is OK as we are interested only in planning to see what
would have happened. Notice that the estimate of the overall population would
NOT change, but the se changes.
An equal allocation gives rise to worse precision than the original survey. Exam-
ining the table in more detail, you see that far too many samples are allocated
in an equal allocation to strata 2 and 4 and not enough to strata 1 and 3.
Proportional allocation
What about proportional allocation? Now the sample size is proportional to the
stratum population sizes. For example, the sample size for stratum 1 is found
as 211 × 400/770. The following results are obtained:
This has an even worse standard error! It looks like not enough samples are
placed in stratum 3 or 5.
Optimal allocation
What if both the stratum sizes and the stratum variances are to be used in
allocating the sample? We create a new column (at the extreme right) which is
2006
c Carl James Schwarz 95
CHAPTER 4. SAMPLING
equal to Nh Sh . Now the sample sizes are proportional to these values, i.e. the
sample size for the first stratum is now found as 211 × 29866.4/133893.8. Again
the estimate of the total doesn’t change but the se is reduced.
The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan,
is a large bird with white plumage and black legs, feet, and beak. 6 The USFWS
is responsible for conserving and protecting tundra swans as a migratory bird
under the Migratory Bird Treaty Act and the Fish and Wildlife Conservation
Act of 1980. As part of these responsibilities, it conducts regular aerial surveys
at one of their prime breeding areas in Bristol Bay, Alaska. And, the Bristol
Bay population of tundra swans is of particular interest because suitable habitat
for nesting is available earlier than most other nesting areas. This example is
based on one such survey. 7
Tundra swans are highly visible on their nesting grounds making them easy
to monitor during aerial surveys.
The Bristol Bay refuge has been divided into 186 survey units, each being a
q quarter section. These survey units have been divided into three strata based
on density, and previous years’ data provide the following information about
the strata:
asp?id=78&cid=7
7 Doster, J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska
2006
c Carl James Schwarz 96
CHAPTER 4. SAMPLING
The three strata are all approximately the same total area (number of sur-
vey units) so allocations based on stratum area will be approximately equal
across strata. However, that would place about 1/3 of the effort into the low
density strata which typically have fewer birds. It is felt that stratum density
is a suitable measure of stratum importance (notice that close relationship be-
tween stratum density and stratum standard deviations which is often found in
biological surveys.) Consequently, an allocation based on stratum density was
used. This allocation would place about 30 × 2032 = 18 units in the high density
stratum; about 30times 1032 = 9 units in the medium density stratum; and the
remainder (3 units) in the low density stratum.
The first thing to notice from the table above is that not all survey units
could be surveyed because of poor weather. As always with missing data, it is
important to determine if the data are Missing Completely at Random (MCAR).
In this case, it seems reasonable that swans did not adjust their behavior know-
ing that certain survey units would be sampled on the poor weather days and so
there is no impact of the missing data other than a loss of precision compared
to a survey with a full 30 survey units chosen.
Also notice that “blanks” in the table (missing values) represent zeros and
2006
c Carl James Schwarz 97
CHAPTER 4. SAMPLING
Finally, not all of the survey units are the same area. This could introduce
additional variation into our data which may affect our final standard errors.
Even though the survey units are of different areas, the survey units were chosen
as a simple random sample so ignoring the area will NOT introduce bias into the
estimates (why). You will see in later sections how to compute a ratio estimator
which could take the area of each survey units into account and potentially lead
to more precise estimates.
JMP analysis The data are imported into JMP and are available in a JMP
file tundra.jmp available in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The data sheet appears below. Notice that zeros have been inserted in the
appropriate locations.
Because units in each stratum was selected using a simple random sample,
the appropriate summary measures are means and standard errors for each mean
for each stratum. Use the Tables->Summary
2006
c Carl James Schwarz 98
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 99
CHAPTER 4. SAMPLING
Because the sample fraction is smallish, the finite population correction factor
is ignored in each stratum. This gives the summary table:
In order to estimate the total number of swans in each stratum, augment the
table with the number of survey units in each stratum, 8 and then multiply the
mean and standard error by the number of survey units to estimate the total
swans and se(total) in each stratum using the formula editor: 9
Hence we estimate that about 2000 swans are present in the H and M strata,
but just over 100 in the L stratum. The grand total is found by adding the
estimated totals from the strata 2150+87+1700=3937,
√ and the standard error
of the grand total is found in the usual way 230.122 + 29.002 + 714.272 = 751
either using a calculator or a spreadsheet.
The standard error is larger than desired, mostly because of the very small
sample size in the M stratum where only 3 of the 9 proposed survey units could
be surveyed.
The data are read into SAS in the usual fashion with the code fragment:
8 Notice that JMP sorts the strata alphabetically
9 It may be easiest at this point to import back to Excel
2006
c Carl James Schwarz 100
CHAPTER 4. SAMPLING
data swans;
length survey_unit $10 stratum $1;;
input survey_unit $ stratum $ area num_flocks num_single num_pairs;
num_swans = num_flocks + num_single + 2*num_pairs;
datalines;
... data inserted here ...
The total number of survey units in each stratum is also read into SAS using
the code fragment. Notice that the variable that has the number of stratum
units must be called _total_ as required by the SurveyMeans procedure.
data total_survey_units;
length stratum $1.;
input stratum $ _total_; /* must use _total_ as variable name */
datalines;
h 60
m 68
l 58
;;;;
Next the data are sorted by stratum (not shown), the number of actual
survey units surveyed in each stratum is found using Proc Means:
Most survey procedures in SAS require the use sampling weights. These
are the reciprocal of the probability of selection. In this case, this is simply the
number of units in the stratum divided by the number sampled in each stratum:
data swans;
merge swans total_survey_units n_units;
by stratum;
sampling_weight = _total_ / n;
Now the individual stratum estimates are obtained using the code fragment:
2006
c Carl James Schwarz 101
CHAPTER 4. SAMPLING
The estimates in the L and M strata are not very precise because of the small
number of survey units selected. SAS has incorporated the finite population
correction factor when estimating the se for the individual stratum estimates.
We estimate that about 2000 swans are present in the H and M strata, but
just over 100 in the L stratum. The grand total is found by adding the estimated
totals from the strata 2150+87+1700=3937,
√ and the standard error of the grand
total is found in the usual way 2062 + 282 + 6982 = 729.
Proc SurveyMeans can be used to estimate the grand total number of units
overall strata using the code fragment::
2006
c Carl James Schwarz 102
CHAPTER 4. SAMPLING
The standard error is larger than desired, mostly because of the very small
sample size in the M stratum where only 3 of the 9 proposed survey units could
be surveyed.
In some cases, we need to allocate units based upon two or more variables We
first find the optimal allocation and see how they differ from one another - in
many cases, there may not be a big difference.
There are quite complicated schemes available to optimize allocation for two
or more variables. These will not be covered in this course.
4.8.5 Post-stratification
In these cases, we do simple random sampling, and post- stratify after the
sample is taken. We assume that the stratum sizes Nh are still known, say from
other sources.
The estimates of the population mean, total, etc., don’t change. However,
the variances must be increased to account for the fact that the sample size
in each stratum is no longer fixed. This introduces an additional source of
variation for the estimate, i.e. estimates will vary from sample to sample not
only because a new sample is drawn each time, but also because the sample size
within a stratum will change, leading to different precisions in each new sample.
This is covered in many standard books on sampling theory and is not cov-
ered more in this course.
A student wrote:
2006
c Carl James Schwarz 103
CHAPTER 4. SAMPLING
Both statements are correct. If you are interested in estimates for individual
populations, then absolute sample size is important.
If you wanted equally precise estimates for BOTH Canada and the US then
you would have equal sample sizes from both populations, say 1000 from both
population even though their overall population size differs by a factor of 10:1.
Why does this happen? Well if you are interested in the overall population,
then the US results essentially drives everything and Canada has little effect
on the overall estimate. Consequently, it doesn’t matter that the Canadian
estimate is not as precise as the US estimate.
2006
c Carl James Schwarz 104
CHAPTER 4. SAMPLING
the stronger the relationship, the better it will perform. This technique is often
called ratio-estimation or regression-estimation.
Notice that multi-phase designs often use an auxiliary variable but this sec-
ond variable is only measured on a subset of the sample units and should not
be confused with ratio estimators in this section.
Ratio estimation has two purposes. First, in some cases, you are interested
in the ratio of two variables, e.g. what is the ratio of wolves to moose in a region
of the province.
Why is the ratio defined in this way? There are two common ratio estimators,
traditionally called the mean-of-ratio and the ratio-of-mean estimators. Suppose
you had the following data for Y and X which represent the counts of animals
of species 1 and 2 taken on 3 different days:
Sample
1 2 3
Y 10 100 20
X 3 20 1
The mean-of-ratio estimator should be used when you wish to give equal
weight to each pair of numbers regardless of the magnitude of the numbers.
For example, you may have three plots of land, and you measure Y and X on
each plot, but because of observer efficiencies that differ among plots, the raw
numbers cannot be compared. For example, in a cloudy, rainy day it is hard
to see animals (first case), but in a clear, sunny day, it is easy to see animals
(second case). The actual numbers themselves cannot be combined directly.
2006
c Carl James Schwarz 105
CHAPTER 4. SAMPLING
In practice, plot yi vs. xi from the sample and see what type of relation exists.
When can a ratio estimator be used? A ratio estimator will require that
another variable (the X variable) be measured on the selected sampling units.
Furthermore, if you are estimating the overall mean or total, the total value of
the X-variable over the entire population must also be known. For example, as
see in the examples to come, the total area must be known to estimate the total
animals once the density (animals/ha) is known.
Notes
2006
c Carl James Schwarz 106
CHAPTER 4. SAMPLING
n
(yi −rxi )2
• The term s2diff =
P
n−1 is computed by creating a new column
i=1
yi − rxi and finding the (sample standard deviation)2 of this new derived
variable. This will be illustrated in the examples.
• In some cases the µ2X in the denominator may or may not be known and
it or its estimate x2 can be used in place of it. There doesn’t seem to be
any empirical evidence that either is better.
2
• The term τX /µ2X reduces to N 2 .
[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs inter-
changes the use of x and y in the ratio.]
2006
c Carl James Schwarz 107
CHAPTER 4. SAMPLING
4 27 725
5 14 265
6 3 87
7 12 410
8 19 675
9 7 290
10 10 370
11 16 510
Having said this, do the number of moose and wolves measured on each
sub-area include young moose and young wolves or just adults? How will im-
migration and emigration be taken care of?
The frame consists of the 200 sub-areas of the game management zone. Pre-
sumably these 200 sub-areas cover the entire zone, but what about emigration
and immigration? Moose and wolves may move into and out of the zone.
How did they determine the counts in the sub-areas? Perhaps they simply
looked for tracks in the snow in winter - it seems difficult to get estimates from
the air in summer when there is lots of vegetation blocking the view.
2006
c Carl James Schwarz 108
CHAPTER 4. SAMPLING
Excel analysis
A copy of the workbook to perform the analysis of this data is called wolf.xls
and is available from the Sample Program Library at http://www.stat.sfu.
ca/~cschwarz/Stat-650/Notes/MyPrograms.
2006
c Carl James Schwarz 109
CHAPTER 4. SAMPLING
The ratio estimator works well if the relationship between Y and X is linear,
through the origin, with increasing variance with X. Begin by plotting Y (wolves)
vs X (moose).
Refer to the screen shot of the spreadsheet. The Excel builtin functions are
used to compute the sample size, sample mean, and sample standard deviation
for each variable.
The ratio is computed using the formula for a ratio estimator in a simple
random sample, i.e.
y
r=
x
Then for each observation, the difference between the observed Y (the actual
2006
c Carl James Schwarz 110
CHAPTER 4. SAMPLING
number of wolves) and the predicted Y based on the number of moose (Ybi = rXi )
is found. Notice that the sum of the differences must equal zero.
Final estimate
Our final result is that the estimated ratio is 0.03217 wolf/moose with an
estimated se of 0.00244 wolf/moose. An approximate 95% confidence interval
would be computed in the usual fashion.
The key variable for the standard error is the total sample size (which you
can modify) and the standard deviation of the differences - which is estimated
from the previous survey.
As before, create a new spreadsheet where you can modify the total sample
size and see the effect upon precision. This will be left as an exercise for the
reader.
SAS Analysis
The above computations can also be done in SAS with the program wolf.sas
available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms. It uses Proc SurveyMeans which gives the out-
put contained in wolf.lst.
2006
c Carl James Schwarz 111
CHAPTER 4. SAMPLING
data wolf;
input subregion wolf moose;
datalines;
1 8 190
2 15 370
3 9 460
4 27 725
5 14 265
6 3 87
7 12 410
8 19 675
9 7 290
10 10 370
11 16 510
;;;
The SAS program again starts with the DATA step to read in the data.
Because the sampling weights are equal for all observation, it is not necessary
to include them when estimating a ratio (the weights cancel out in the formula
2006
c Carl James Schwarz 112
CHAPTER 4. SAMPLING
used by SAS).
The PLOT procedure creates the plot similar to that in the Excel spread-
sheet.
1 1 8 190
2 2 15 370
3 3 9 460
4 4 27 725
5 5 14 265
6 6 3 87
7 7 12 410
8 8 19 675
9 9 7 290
10 10 10 370
11 11 16 510
==== =====
140 4352
2006
c Carl James Schwarz 113
CHAPTER 4. SAMPLING
20 +
|
19 + A
|
18 +
|
17 +
|
16 + A
|
15 + A
|
14 + A
|
13 +
|
12 + A
|
11 +
|
10 + A
|
9 + A
|
8 + A
|
7 + A
|
6 +
|
5 +
|
4 +
|
3 + A
---+-------------+-------------+-------------+-------------+--
0 200 400 600 800
moose
Data Summary
Number of Observations 11
2006
c Carl James Schwarz 114
CHAPTER 4. SAMPLING
Statistics
Ratio Analysis
JMP Analysis
The JMP data table is available here in the file wolf.jmp from the Sample Pro-
gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
2006
c Carl James Schwarz 115
CHAPTER 4. SAMPLING
Because the ratio estimator assumes that the variance of the response in-
creases with the value of X, a new column representing the inverse of the X
variable (i.e. 1/the number of moose) has been created. Be sure all variables
are continuously scaled:
2006
c Carl James Schwarz 116
CHAPTER 4. SAMPLING
The graph looks like it is linear through the origin which is one of the assump-
tions of the ratio estimator. Then use Fit Special:
2006
c Carl James Schwarz 117
CHAPTER 4. SAMPLING
and select the no intercept option to force the regression line through the origin:
2006
c Carl James Schwarz 118
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 119
CHAPTER 4. SAMPLING
We see that the estimated ratio (.032 wolves/moose) matches the Excel output,
the estimated standard error (.0026) does not quite match Excel. The difference
is a bit larger than can be accounted for not using the finite population correction
factor.
2006
c Carl James Schwarz 120
CHAPTER 4. SAMPLING
It is possible to reproduce the ratio estimator and its standard error exactly.
As in previous sections, the use of JMP is a bit clumsy.
Use the Tables->Summary to get the total wolves and total moose as in
previous examples. The estimate is easily obtained as r = y/x = y/x where y
and x refer to the total wolves and moose respectively. Create a new column to
compute the ratio giving:
2006
c Carl James Schwarz 121
CHAPTER 4. SAMPLING
Find the standard deviation of this new column again using the Tables->Summary
command. Compute the se as:
This gives the final result: Which match the previous results.
2006
c Carl James Schwarz 122
CHAPTER 4. SAMPLING
Post mortem
No population numbers can be estimated using the ratio estimator in this case
because of a lack of suitable data.
In particular, if you had wanted to estimate the total wolf population, you
would have to use the simple inflation estimator that we discussed earlier unless
you had some way of obtaining the total number of moose that are present in
the ENTIRE management zone. This seems unlikely.
Note that the population total of the auxiliary variable will have to be known
in order to use this method.
Grouse Numbers
2006
c Carl James Schwarz 123
CHAPTER 4. SAMPLING
Area Number
(ha) Grouse
8.9 24
2.7 3
6.6 10
20.6 36
3.7 8
4.1 8
25.8 60
1.8 5
20.1 35
14.0 34
10.1 18
8.0 22
• The population of interest are the pockets of brush in the region. The
sampling unit is the pocket of brush. The number of grouse in each pocket
is the response variable.
• The population of interest is the grouse. These happen to be clustered
into pockets of brush. This leads back to the previous case.
Here the frame is explicit - the set of all pockets of bush. It isn’t clear if all
grouse will be found in these pockets - will some be itinerant and hence missed?
What about movement between looking at the pockets of bush?
Summary statistics
Using our earlier results for the simple inflation estimator, our estimate of the
total number ofqgrouse is τb = N y =q 248 × 21.92 = 5435.33 with an estimated
s2 16.952 12
se of se = N × n (1 − f ) = 248 × 12 (1 − 248 ) = 1183.4.
2006
c Carl James Schwarz 124
CHAPTER 4. SAMPLING
Why did the inflation estimator do so poorly? Part of the reason is the relatively
large standard deviation in the number of grouse in the pockets. Why does this
number vary so much?
It seems reasonable that larger pockets of brush will tend to have more
grouse. Perhaps we can do better by using the relationship between the area of
the bush and the number of grouse through a ratio estimator.
Excel analysis
First plot numbers of grouse vs. area and see if this has a chance of succeeding.
The graph shows a linear relationship, through the origin. There is some
2006
c Carl James Schwarz 125
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 126
CHAPTER 4. SAMPLING
As before, you find summary statistics for X and Y, compute the ratio esti-
mate, find the difference variables, find the standard deviation of the difference
variable, and find the se of the estimated ratio.
The se of r is found as
s r
1 s2diff 1 4.74642 12
se(r) = 2 × × (1 − f ) = 2
× × (1 − ) = 0.1269
x n 10.533 12 248
grouse/ha.
In order to estimate the population total of Y, you now multiply the es-
timated ratio by the population total of X. We know the pockets cover 3015
Y = τX × r =
ha, and so the estimated total number of grouse is found by τc
3015 × 2.081 = 6273.3 grouse.
If you wish to investigate different sample sizes, the simplest way would be
to modify the cell corresponding to the count of the differences. This will be
left as an exercise for the reader.
The final ratio estimate has a rse of about 6% - quite good. It is relatively
straight forward to investigate the sample size needed for a 5% rse. We find
this to be about 17 pockets.
SAS analysis
The analysis is done in SAS using the program grouse.sas from the Sample Pro-
gram Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
2006
c Carl James Schwarz 127
CHAPTER 4. SAMPLING
data grouse;
input area grouse; /* sampling weights not needed */
datalines;
8.9 24
2.7 3
6.6 10
20.6 36
3.7 8
4.1 8
25.8 60
1.8 5
20.1 35
14.0 34
10.1 18
8.0 22
;;;;
data outratio;
/* compute estimates of the total */
2006
c Carl James Schwarz 128
CHAPTER 4. SAMPLING
set outratio;
Est_total = ratio * 3015;
Se_total = stderr* 3015;
UCL_total = uppercl*3015;
LCL_total = lowercl*3015;
format est_total se_total ucl_total lcl_total 7.1;
format ratio stderr lowercl uppercl 7.3;
The DATA step reads in the data. It is not necessary to include a computa-
tion of the sampling weight if the data are collected in a simple random sample
for a ratio estimator – the weights will cancel out in the formulae used by SAS.
1 8.9 24
2 2.7 3
3 6.6 10
4 20.6 36
5 3.7 8
6 4.1 8
7 25.8 60
8 1.8 5
9 20.1 35
10 14.0 34
11 10.1 18
12 8.0 22
Number of grouse - Ratio estimator
raw data
2006
c Carl James Schwarz 129
CHAPTER 4. SAMPLING
grouse |
|
60 + A
|
|
|
|
|
|
50 +
|
|
|
|
|
|
40 +
|
|
| AA
| A
|
|
30 +
|
|
|
| A
|
| A
20 +
| A
|
|
|
|
|
10 + A
| AA
|
| A
|
| A
|
0 +
2006
c Carl James Schwarz 130
CHAPTER 4. SAMPLING
|
-+----------+----------+----------+----------+----------+----------+
0 5 10 15 20 25 30
area
Number of grouse - Ratio estimator
Estimation using a ratio estimator
Data Summary
Number of Observations 12
Statistics
Ratio Analysis
2006
c Carl James Schwarz 131
CHAPTER 4. SAMPLING
JMP Analysis
The JMP data table is available here in the file grouse.jmp from the Sam-
ple Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms. The data file contains both variables and the derived variable
1/area:
As in Excel, let us first use the simple inflation estimate by finding the
average number of grouse/pocket and then expanding by the number of pockets.
The Analyze->Distribution platform
2006
c Carl James Schwarz 132
CHAPTER 4. SAMPLING
The estimated mean number of grouse per pocket is 21.9 (se 4.9). The estimated
total number is found by multiplying the mean number per pocket by the total
number of pockets (N = 248) to give an estimated total of 5435 (se 1213) grouse.
The standard error is larger than that computed by Excel because of the lack
of the finite population correction.
We must first estimate the ratio (grouse/hectare), and then expand this to
2006
c Carl James Schwarz 133
CHAPTER 4. SAMPLING
Because the ratio estimator assumes that the variance of the response in-
creases with the value of X, a new column representing the inverse of the X
variable (i.e. 1/area of pocket) has been created. Be sure all variables are
continuously scaled:
2006
c Carl James Schwarz 134
CHAPTER 4. SAMPLING
The graph looks like it is linear through the origin which is one of the assump-
tions of the ratio estimator. Then use Fit Special: and select the no intercept
option to force the regression line through the origin:
2006
c Carl James Schwarz 135
CHAPTER 4. SAMPLING
The estimated density is 2.081 (se .123) grouse/hectare. The point estimate is
bang on, and the estimated se is within 1% of the correct se.
This now need to be multiplied by the total area of the pockets (3015 ha)
which gives an estimated total number of grouse of 6274 (se 371) grouse. [Again
the estimated se is slightly smaller because of the lack of a finite population
correction.]
The ratio estimator is much more precise than the inflation estimator because
of the strong relationship between the number of grouse and the area of the
pocket.
2006
c Carl James Schwarz 136
CHAPTER 4. SAMPLING
What if it were to turn out that grouse population size tended to be proportional
to the perimeter of a pocket of bush rather than its area? Would using the above
ratio estimator based on a relationship with area introduce serious bias into the
ratio estimate, increase the standard error of the ratio estimate, or do both?
There are two ways of combining ratio estimators in stratified simple random
sampling.
µc
Y stratif ied
rstratif ied,combined =
µc
X stratif ied
and
µc
Y stratif ied
τc
Y stratif ied,combined = τX
µc
X stratif ied
2006
c Carl James Schwarz 137
CHAPTER 4. SAMPLING
and
H
X
τc
Y stratif ied,separate = τXh rh
h=1
Again, we won’t worry about the estimates of the se.
• You need stratum total for separate estimate, but only population total
for combined estimate
• combined ratio is less subject to risk of bias. (see Cochran, p. 165 and
following). In general, the biases in separate estimator are added together
and if they fall in the same direction, then trouble. In the combined
estimator these biases are reduced through stratification for numerator
and denominator
• When the ratio estimate is appropriate (regression through the origin and
variance proportional to covariate), the last term vanishes. Consequently,
the combined ratio estimator will have greater variance than the separate
ratio estimator unless R is relatively constant from stratum to stratum.
However, see above, the bias may be more severe for the separate ratio
estimator. You must consider the combined effects of bias and precision,
i.e. MSE.
A ratio estimator works well when the relationship between Yi and Xi is linear,
through the origin, with the variance of observations about the ratio line in-
creasing with X. In some cases, the relationship may be linear, but not through
the origin.
Regression estimators are also useful if there is more than one X variable.
2006
c Carl James Schwarz 138
CHAPTER 4. SAMPLING
ages ignore the way in which the data were collected. Virtually all standard re-
gression packages assume you’ve collected data under a simple random sample.
If your sampling design is more complex, e.g. stratified design, cluster design,
multi-state design, etc then you should use a package specifically designed for
the analysis of survey data, e.g. SAS and the Proc SurveyReg procedure.
All of the designs discussed in previous sections have assumed that each sample
unit was selected with equal probability. In some cases, it is advantageous
to select units with unequal probabilities, particularly if they differ in their
contribution to the overall total. This technique can be used with any of the
sampling designs discussed earlier. An unequal probability sampling design can
lead to smaller standard errors (i.e. better precision) for the same total effort
compared to an equal probability design. For example, forest stands may be
selected with probability proportional to the area of the stand (i.e. a stand of
200 ha will be selected with twice the probability that a stand of 100 ha in size)
because large stands contribute more to the overall population and it would be
wasteful of sampling effort to spend much effort on smaller stands.
2006
c Carl James Schwarz 139
CHAPTER 4. SAMPLING
animals are not randomly selected; the herds are the sampling unit. The strip-
transect example in the section on simple random sampling is also a cluster
sample; all plots along a randomly selected transect are measured. The strips
are the sampling units, while plots within each strip are sub-sampling units.
Another example is circular plot sampling; all trees within a specified radius of
a randomly selected point are measured. The sampling unit is the circular plot
while trees within the plot are sub-samples.
Pitfall A cluster sample is often mistakenly analyzed using methods for sim-
ple random surveys. This is not valid because units within a cluster are typically
positively correlated. The effect of this erroneous analysis is to come up with
an estimate that appears to be more precise than it really is, i.e. the estimated
standard error is too small and does not fully reflect the actual imprecision in
the estimate.
Solution: You will pleased to know that, in fact, you already know how to
design and analyze cluster samples! The proper analysis treats the clusters as a
random sample from the population of clusters, i.e. treat the cluster as a whole
as the sampling unit, and deal only with cluster total as the response measure.
In simple random sampling, a frame of all elements was required in order to draw
a random sample. Individual units are selected one at a time. In many cases,
this is impractical because it may not be possible to list all of the individual
units or may be logistically impossible to do this. In many cases, the individual
units appear together in clusters. This is particularly true if the sampling unit
is a transect - almost always you measure things on a individual quadrat level,
but the actual sampling unit is the cluster.
2006
c Carl James Schwarz 140
CHAPTER 4. SAMPLING
breaking of the transect into individual quadrats is like having multiple fish
within the tank.
2006
c Carl James Schwarz 141
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 142
CHAPTER 4. SAMPLING
Cluster Sampling
First, the clusters must be defined. In this case, the units are naturally
clustered in blocks of size 8. The following units were selected.
Describe how the sample was taken. Note the differences between stratified
simple random sampling and cluster sampling!
2006
c Carl James Schwarz 143
CHAPTER 4. SAMPLING
4.11.3 Notation
Population Sample
Attribute value value
Number of clusters N n
Cluster totals τi yi NOTE τi and yi are the cluster i TOTALS
Cluster sizes Mi mi
Total area M
The key concept in cluster sampling is to treat the cluster TOTAL as the re-
sponse variable and ignore all the individual values within the cluster. Because
2006
c Carl James Schwarz 144
CHAPTER 4. SAMPLING
the clusters are a simple random sample from the population of clusters, simply
apply all the results you had before for a SRS to the CLUSTER TOTALS.
If the clusters are roughly equal in size, a simple inflation estimator can be
used; In many cases, there is strong relationship between the size of the cluster
and cluster total – in these cases a ratio estimator would likely be suitable where
the X variable is the cluster size. If there is no relationship between cluster size
and the cluster total, a simple inflation estimator can be used as well even in the
case of unequal cluster sizes.. You should do a preliminary plot of the cluster
totals against the cluster sizes to see if this relationship holds.
You will also have to know the size of each cluster - this is simply the number
of sub-units within each cluster.
The biggest danger of ignoring the clustering aspects and treating the indi-
vidual quadrats as if they came from an SRS is that, typically, your estimated se
will be too small. That is, the true standard error from your design may be sub-
stantially larger than your estimated standard error. The precision is thought
to be far better than is justified based on the survey results. This has been seen
before - refer to the paper by Underwood where the dangers of estimation with
positively correlated data were discussed.
Computational formulae
q 2
1 sdiff
Overall total τ =M ×µ τb = M × µ
b M2 × m2 n
(1 − f )
2006
c Carl James Schwarz 145
CHAPTER 4. SAMPLING
n
µmi )2
(yi −b
• The term s2diff =
P
n−1 is again found in the same fashion as in
i=1
ratio estimation - create a new variable which is the difference between
bmi , find the sample standard deviation2 of it, and then square the
yi − µ
standard deviation.
• Sometimes the ratio of two variables measured within each cluster is re-
quired, e.g. you conduct aerial surveys to estimate the ratio of wolves to
moose - this has already been done in an earlier example! In these cases,
the actual cluster length is not used.
Confidence intervals
As before, once you have an estimator for the mean and for the se, use the
usual ±2se rule. If the number of clusters is small, then some text books advise
using a t-distribution for the multiplier – this is not covered in this course.
Again, this is no real problem - except that you will get a value for the
number of CLUSTERS, not the individual quadrats within the clusters.
Red sea urchins are considered a delicacy and the fishery is worth several millions
of dollars to British Columbia.
In order to set harvest quotas and in order to monitor the stock, it is im-
portant that the density of sea urchins be determined each year.
The number of possible transects is so large that the correction for finite
population sampling can be ignored.
The raw data is available in an ascii file at urchin.dat from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms. The data isn’t all listed here as it has over 1000 lines!
2006
c Carl James Schwarz 146
CHAPTER 4. SAMPLING
The population of interest is the sea urchins in the harvest area. These
happened to be (artificially) “clustered” into transects which are sampled. All
sea urchins within the cluster are measured.
The frame is conceptual - there is no predefined list of all the possible tran-
sects. Rather they pick random points along the shore and then lay the transects
out from that point.
The sampling design is a cluster sample - the clusters are the transect lines
while the quadrats measured within each cluster are similar to pseudo-replicates.
The measurements within a transect are not independent of each other and are
likely positively correlated (why?).
As the points along the shore were chosen using a simple random sample the
analysis proceeds as a SRS design on the cluster totals.
Excel Analysis
An Excel worksheet with the data and analysis is called urchin.xls and is avail-
able in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/
Stat-650/Notes/MyPrograms. A reduced view appears below:
2006
c Carl James Schwarz 147
CHAPTER 4. SAMPLING
The key, first step in any analysis of a cluster survey is to first summarize the
data to the cluster level. You will need the cluster total and the cluster size (in
this case the length of the transect). The Pivot Table feature of Excel is quite
useful for doing this automatically. Unfortunately, you still have to play around
with the final table in order to get the data displayed in a nice format.
2006
c Carl James Schwarz 148
CHAPTER 4. SAMPLING
Preliminary plot
Plot the cluster totals vs. the cluster size to see if a ratio estimator is appro-
priate, i.e. linear relationship through the origin with variance increasing with
cluster size.
The plot (not shown) shows a weak relationship between the two variables.
Summary Statistics
Compute the summary statistics on the cluster TOTALS. You will need the
totals over all sampled clusters of both variables.
sum(legal)
The estimated density is then density
d =
sum(quad) = 1507/1120 = 1.345536
urchins/m2 .
To compute the se, create the diff column as in the ratio estimation section and
find its standard deviation.
In order to estimate the total number of urchins in the harvesting area, you
simply multiply the estimated ratio and its standard error by the area to be
harvested.
2006
c Carl James Schwarz 149
CHAPTER 4. SAMPLING
SAS Analysis
SAS v.8 has procedures for the analysis of survey data taken in a cluster design.
A program to analyze the data is urchin.sas and is available from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.
data urchin;
infile urchin firstobs=2 missover; /* the first record has the variable names */
input transect quadrat legal sublegal;
/* no need to specify sampling weights because transects are an SRS */
/***** First check to see if any transects are missing quadrats *************/
2006
c Carl James Schwarz 150
CHAPTER 4. SAMPLING
data check;
set check;
length problem $15.;
problem = ’ ’;
if max ^= n then problem = ’missing quadrat?’;
drop _type_ _freq_;
/****************************************************************************/
The key feature of the SAS program is the use of the CLUSTER statement
to identify the clusters in the data.
The population number of transects was not specified as the finite population
correction is negligible.
2006
c Carl James Schwarz 151
CHAPTER 4. SAMPLING
6 7 1 21 21 5
7 8 1 46 46 85
8 9 1 24 24 46
9 10 1 37 37 27
10 11 1 24 24 9
11 13 1 40 40 50
12 14 1 37 37 50
13 15 1 32 32 15
14 16 1 21 21 58
15 18 1 32 32 42
16 20 1 42 42 12
17 21 1 21 21 52
18 22 1 13 13 0
19 23 1 88 88 100
20 24 1 15 15 1
21 25 1 23 23 16
22 26 1 17 17 49
23 27 1 16 16 46
24 28 1 18 18 40
25 29 1 30 30 37
26 30 1 39 39 40
27 31 1 42 42 175
28 33 1 100 100 71
Estimating urchin density - example of cluster analysis
plot the relationship between the cluster total and cluster size
tlegal |
|
180 +
| > 31
|
|
|
160 + > 4
|
| > 2 1 <
|
|
140 +
|
|
|
|
120 +
2006
c Carl James Schwarz 152
CHAPTER 4. SAMPLING
|
|
|
|
100 + > 23
|
|
|
| > 8
80 +
|
| 33 <
|
|
60 + > 16
|
| > 21 14 < > 13
| 27 <> 26 > 9
| > 18
40 + > 28 > 30
| > 29
|
| > 10
| > 6
20 +
| > 25 > 15
| > 20
| > 11
| > 7
0 + 22 < > 24 3 <
|
-+------------+------------+------------+------------+------------+-
0 20 40 60 80 100
n
Estimating urchin density - example of cluster analysis
plot the relationship between the cluster total and cluster size
Data Summary
Number of Clusters 28
Number of Observations 1120
2006
c Carl James Schwarz 153
CHAPTER 4. SAMPLING
Statistics
Statistics
Upper 95%
Variable CL for Mean
------------------------
legal 1.811810
------------------------
JMP Analysis
The urchin data is available in a JMP file urchin.jmp from the Sample Program
Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
2006
c Carl James Schwarz 154
CHAPTER 4. SAMPLING
contains variables for the transect, the quadrat within each transect, and the
number of legal and sub-legal sized urchins.
2006
c Carl James Schwarz 155
CHAPTER 4. SAMPLING
to get the cluster totals by summing the number of legal sized urchins and
counting the number of quadrats present:
2006
c Carl James Schwarz 156
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 157
CHAPTER 4. SAMPLING
Now we are back to the case of a ratio estimator with the Y variable being
the number of legal sized urchins measured on the transect, and the X variable
being the size of the transect. As in the previous examples of a ratio estimator,
we create a weighting variable equal to 1/X = 1/size of transect:
2006
c Carl James Schwarz 158
CHAPTER 4. SAMPLING
After the plot is created, we use the Fit Special from the red triangle near
the plot:
2006
c Carl James Schwarz 159
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 160
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 161
CHAPTER 4. SAMPLING
The estimated density is 1.346 (se .216) uchins/m2 . The se is a bit smaller
because of the lack of a finite population correction factor but is within 1% of
the correct se.
The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determination
of sample size is done in the same manner as in the ratio estimator case dealt
with in earlier sections except that the number of CLUSTERS is found. If we
wanted to get a rse near to 5%, we would need almost 320 transects - this is
likely too costly.
2006
c Carl James Schwarz 162
CHAPTER 4. SAMPLING
Sea cucumbers are considered a delicacy among some, and the fishery is of
growing importance.
In order to set harvest quotas and in order to monitor the stock, it is impor-
tant that the number of sea cucumbers in a certain harvest area be estimated
each year.
To do this, the managers lay out a number of transects across the cucumber
harvest area. Divers then swim along the transect, and while carrying a 4 m
wide pole, count the number of cucumbers within the width of the pole during
the swim.
The number of possible transects is so large that the correction for finite
population sampling can be ignored.
Here is the summary information up the transect area (the preliminary raw
data is unavailable):
Transect Sea
Area Cucumbers
260 124
220 67
200 6
180 62
120 35
200 3
200 1
120 49
140 28
400 1
120 89
120 116
140 76
800 10
1460 50
1000 122
140 34
180 109
80 48
2006
c Carl James Schwarz 163
CHAPTER 4. SAMPLING
The transects were laid out from one edge of the bed and the length of the
edge is 51,436 m. Note that because each transect was 4 m wide, the number
of transects is 1/4 of this value.
The population of interest is the sea cucumbers in the harvest area. These
happen to be (artificially) “clustered” into transects which are the sampling
unit. All sea cucumbers within the transect (cluster) are measured.
The frame is conceptual - there is no predefined list of all the possible transects.
Rather they pick random points along the edge of the harvest area, and then
lay out the transect from there.
The sampling design is a cluster sample - the clusters are the transect lines while
the quadrats measured within each cluster are similar to pseudo-replicates. The
measurements within a transect are not independent of each other and are likely
positively correlated (why?).
Analysis - abbreviated
2006
c Carl James Schwarz 164
CHAPTER 4. SAMPLING
The key, first step in any analysis of a cluster survey is to first summarize the
data to the cluster level. You will need the cluster total and the cluster size (in
this case the area of the transect). This has already been done in the above
data.
Now this summary table is simply an SRSWOR from the set of all transects.
We first estimate the density, and then multiply by the area to estimate the total.
Note that after summarizing up to the transect level, this example proceeds
in an analogous fashion as the grouse in pockets of brush example that we looked
at earlier.
Preliminary Plot
A plot of the cucumber total vs the transect size shows a very poor rela-
tionship between the two variables. It will be interesting to compare the results
from the simple inflation estimator and the ratio estimator.
First, estimate the number ignoring the area of the transects by using a simple
inflation estimator.
n 19 transects
Mean 54.21 cucumbers/transect
std Dev 42.37 cucumbers/transect
Ratio Estimator
We use the methods outlined earlier for ratio estimators from SRSWOR to get
the following summary table:
2006
c Carl James Schwarz 165
CHAPTER 4. SAMPLING
area cucumbers
Mean 320.00 54.21 per transect
mean(cucumbers)
The estimated density of sea cucumbers is then density
d =
mean(area)
2
= 54.21/320.00 = 0.169 cucumber/m .
To compute the se, create the diff column as in the ratio estimation section
qsdiff 2 = 73.63. The estimated
and find its standard deviation as
q
se of the ratio
sdiff 2
is then found as: se(density)
d = × 1 2 = 73.63 × 1 2 = 0.053
ntransects area 19 320
cucumbers/m2 .
Why did the ratio estimator do worse in this case than the simple inflation
estimator in Griffiths Passage? The plot the number of sea cucumbers vs the
area of the transect shows virtually no relationship between the two - hence
there is no advantage to using a ratio estimator.
In more advanced courses, it can be shown that the ratio estimator will do
better than the inflation estimator if the correlation between the two variables is
greater than 1/2 of the ratio of their respective relative variation (std dev/mean).
Advanced computations shows that half of the ratio of their relative variations
is 0.732, while the correlation between the two variables is 0.041. Hence the
ratio estimator will not do well.
The Excel workbook also repeats the analysis for Griffith Passage after drop-
ping some obvious outliers. This only makes things worse! As well, at the bot-
tom of the worksheet, a sample size computation shows that substantially more
transects are needed using a ratio estimator than for a inflation estimator. It
appears that in Griffith Passage, that there is a negative correlation between
the length of the transect and the number of cucumbers found! No biological
reason for this has been found. This is a cautionary example to illustrate the
even the best laid plans can go astray - always plot the data.
2006
c Carl James Schwarz 166
CHAPTER 4. SAMPLING
A third worksheet in the workbook analyses the data for Sheep Passage.
Here the ratio estimator outperforms the inflation estimator, but not by a wide
factor.
Not part of Stat403/650 Please consult with a sampling expert before im-
plementing or analyzing a multi-stage design.
4.12.1 Introduction
All of the designs considered above select a sampling unit from the population
and then do a complete measurement upon that item. In the case of cluster
sampling, this is facilitated by dividing the sampling unit into small observa-
tional units, but all of the observational units within the sampled cluster are
measured.
If the units within a cluster are fairly homogeneous, then it seems wasteful
to measure every unit. In the extreme case, if every observational unit within a
cluster was identical, only a single observational unit from the cluster needs to be
selected in order to estimate (without any error) the cluster total. Suppose then
that the observational units within a cluster were not identical, but had some
variation? Why not take a sub-sample from each cluster, e.g. in the urchin
survey, count the urchins in every second or third quadrat rather than every
quadrat on the transect.
This method is called two-stage sampling. In the first stage, larger sampling
units are selected using some probability design. In the second stage, smaller
units within the selected first-stage units are selected according to a probability
design. The design used at each stage can be different, e.g. first stage units
selected using a simple random sample, but second stage units selected using a
systematic design as proposed for the urchin survey above.
2006
c Carl James Schwarz 167
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 168
CHAPTER 4. SAMPLING
• Multi-stage designs are cheaper than a simple random sample of the same
number of final sampling units, but more expensive than a cluster sample
of the same number of final sampling units. [Hint: think of the travel costs
in selecting more transects or measuring quadrats within a transect.]
• As in all sampling designs, stratification can be employed at any level
and ratio and regression estimators are available. As expected, the theory
becomes more and more complex, the more "variations" are added to the
design.
4.12.2 Notation
Mean in pop µ = τ /M
We will only consider the case when simple random sampling occurs at both
stages of the design.
2006
c Carl James Schwarz 169
CHAPTER 4. SAMPLING
The intuitive explanation for the results is that a total is estimated for each
FSU selected (based on the SSU selected). These estimated totals are then used
in a similar fashion to a cluster sample to estimate the grand total.
where
n
P 2
τbi − τb
i=1
s21 =
n−1
mi
P 2
(yij − yi )
j=1
s22i =
mi − 1
n
1X
τb = τbi
n i=1
Notes:
2006
c Carl James Schwarz 170
CHAPTER 4. SAMPLING
The Klahoose First Nations (situated near Desolation Sound in the Strait of
Georgia) wished to develop a wild oyster fishery. As first stage in the devel-
opment of the fishery, a survey was needed to establish the current stock in a
number of oyster beds.
This example looks at the estimate of oyster numbers at Lloyd Point from
a survey conducted in 1994.
The survey was conducted by running a line through the oyster bed – the
total length was 105 m. Several random location were located along the line
in increments of 1 m. At each randomly chosen location, the width of the bed
was measured and about 3 random location along the perpendicular transect at
that point were taken. A 1 m2 quadrat was applied, and the number of oysters
of various sizes was counted in the quadrat.
The raw data: is available as a data file called wildoyster.dat from the Sample
Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/
MyPrograms.
2006
c Carl James Schwarz 171
CHAPTER 4. SAMPLING
The first step after importing the data in JMP, is to collect information to
estimate the FSU (the transect) total and to compute some components of the
variance from the second stage of sampling.
2006
c Carl James Schwarz 172
CHAPTER 4. SAMPLING
average of the width variable which was replicated for each individual quadrat.]
Now you will need to add some columns to estimate the total for each FSU
and the contribution of the second stage sampling to the overall variance. These
columns will be created using the formula boxes as shown below.
2006
c Carl James Schwarz 173
CHAPTER 4. SAMPLING
First the formula for the FSU total, i.e. the estimated total weight for the
entire transect. This is the simply the average weight per quadrat times the
width of the strip.
Second, the component of variance for the second stage. [Typically, if the
first stage sampling fraction is small, this can be ignored.]
2006
c Carl James Schwarz 174
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 175
CHAPTER 4. SAMPLING
A similar procedure can be used for the other variables. The nice feature
about JMP is that once a series of operations has been done, you can save the
script and apply it easily to other variables – refer to the JMP manual for more
details.
Excel Spreadsheet
The above computations can also be done in Excel as shown in the attached
workbook klahoose.xls from the Sample Program Library.
As in the case of a pure cluster sample, the PivotTable feature can be used
to compute summary statistics needed to estimate the various components.
2006
c Carl James Schwarz 176
CHAPTER 4. SAMPLING
SAS Program
SAS can also be used to analyze the data as shown in the program klahoose.sas
and output Klahoose.lst.
Note that the Proc SurveyMeans computes the se using only the first stage
variance. As the first stage sampling fraction is usually quite small, this will tend
to give only slight underestimates of the true standard error of the estimate.
The above example barely scratches the surface of multi-stage designs. Multi-
stage designs can be quite complex and the formulae for the estimates and
estimated standard errors fearsome. If you have to analyze such a design, it
is likely better to invest some time in learning one of the statistical packages
designed for surveys (e.g. SAS v.8) rather than trying to program the tedious
formulae by hand.
There are also several important design decisions for multi-stage designs.
One very nice feature of multi-stage designs is that if the first stage is sam-
pled with replacement, then the formulae for the estimated standard errors
simplify considerably to a single term regardless of the design used in the
2006
c Carl James Schwarz 177
CHAPTER 4. SAMPLING
lower stages! If there are many first stage units in the population and if the
sampling fraction is small, the chances of selecting the same first stage unit
twice are very small. Even if this occurs, a different set of second stage units
will likely be selected so there is little danger of having to measure the same
final sampling unit more than once. In such situations, the design at second and
lower stages is very flexible as all that you need to ensure is that an unbiased
estimate of the first-stage unit total is available.
A typical concern with any of the survey methods occurs when the population
does not have natural discrete sampling units. For example, a large section of
land may be arbitrarily divided into 1 m2 plots, or 10 m2 plots. A natural
question to ask is what is the ‘best size’ of unit. This has no simple answer and
depends upon several factors which must be addressed for each survey:
• Cost. All else being equal, sampling many small plots may be more
expensive than sampling fewer larger plots. The primary difference in
cost is the overhead in traveling and setup to measure the unit.
• Size of unit. An intuitive feeling is that more smaller plots are better
than few large plots because the sample size is larger. This will be true
if the characteristic of interest is ‘patchy’ , but surprisingly, makes no
difference if the characteristic is randomly scattered through out the area
(Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, then
larger plots are better. For example, competition among trees implies they
are spread out more than expected if they were randomly located. Logistic
considerations often influence the plot size. For example, if trampling
the soil affects the response, then sample plots must be small enough to
measure without trampling the soil.
• Edge effects. Because the population does not have natural boundaries,
decisions often have to be made about objects that lie on the edge of the
sample plot. In general larger square or circular plots are better because
of smaller edge-to-area ratio. [A large narrow rectangular plot can have
more edge than a similar area square plot.]
2006
c Carl James Schwarz 178
CHAPTER 4. SAMPLING
A pilot study should be carried out prior to a large scale survey to investigate
factors that influence the choice of sampling unit size.
When analyzing a survey a key step is to recognize the design that was used
to collect the data. Key pointers to help recognize various designs are:
• How were the units selected? A true simple random sample makes a list
of all possible items and then chooses from that list.
• Is there more than one size of sampling unit? For example, were transects
selected at random, and then quadrats within samples selected at random?
This is usually a multi-stage design.
• Is there a cluster? For example, transects are selected, and these are
divided into a series of quadrats - all of which are measured.
In descriptive surveys, the objective was to simply obtain information about one
large group. In observational studies, two deliberately selected sub-populations
2006
c Carl James Schwarz 179
CHAPTER 4. SAMPLING
are selected and surveyed, but no attempt is made to generalize the results
to the whole population. In analytical studies, sub-populations are selected
and sampled in order to generalize the observed differences among the sub-
population to this and other similar populations.
[It is possible that all three types of stratification take place - these are very
complex surveys.]
The choice between the categories is usually made by the ease with which
the population can be pre-stratified and the strength of the relationship between
the response and explanatory variables. For example, sample plots can be easily
pre-stratified by elevation or by exposure to the sun, but it would be difficult
to pre-stratify by soil pH.
Pre-stratification has the advantage that the manager has control over the
number of sample points collected in each stratum, whereas in post- stratifica-
tion, the numbers are not controllable, and may lead to very small sample sizes
in certain strata just because they form only a small fraction of the population.
2006
c Carl James Schwarz 180
CHAPTER 4. SAMPLING
be pre-stratified into three elevation classes, and a simple random sample will
be taken in each elevation class. The allocation of effort in each stratum (i.e.
the number of sample plots) will be equal. The density of new growth will be
measured on each selected sample plot. On the other hand, suppose that the
regeneration is a function of soil pH. This cannot be determined in advance,
and so the manager must take a simple random sample over the entire stand,
measure the density of new growth and the soil pH at each sampling unit, and
then post-stratify the data based on measured pH. The number of sampling
units in each pH class is not controllable; indeed it may turn out that certain
pH classes have no observations.
If the units have been selected using a simple random sample, then the
analysis of the analytical surveys proceeds along similar lines as the analysis of
designed experiments (Kish, 1987; also refer to Chapter 2). In most analyses
of analytical surveys, the observed results are postulated to have been taken
from a hypothetical super-population of which the current conditions are just
one realization. In the above example, cut blocks would be treated as a random
blocking factor; elevation class as an explanatory factor; and sample plots as
samples within each block and elevation class. Hypothesis testing about the
effect of elevation on mean density of regeneration occurs as if this were a
planned experiment.
Pitfall: Any one of the sampling methods described in Section 2 for descrip-
tive surveys can be used for analytical surveys. Many managers incorrectly use
the results from a complex survey as if the data were collected using a simple
random sample. As Kish (1987) and others have shown, this can lead to sub-
stantial underestimates of the true standard error, i.e. the precision is thought
to be far better than is justified based on the survey results. Consequently
the manager may erroneously detect differences more often than expected (i.e.
make a Type I error) and make decisions based on erroneous conclusions.
2006
c Carl James Schwarz 181
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 182
CHAPTER 4. SAMPLING
2006
c Carl James Schwarz 183
CHAPTER 4. SAMPLING
4.15 References
• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.
One of the standard references for survey sampling. Very technical
• Gillespie, G.E. and Kronlund, A.R. (1999).
A manual for intertidal clam surveys, Canadian Technical Report of Fish-
eries and Aquatic Sciences 2270. A very nice summary of using sampling
methods to estimate clam numbers.
• Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New
York: American Chemical Society.
A series of papers on sampling mainly for environmental contaminants in
ground and surface water, soils, and air. A detailed discussion on sampling
for pattern.
• Kish, L. (1965). Survey Sampling. New York: Wiley.
An extensive discussion of descriptive surveys mostly from a social science
perspective.
• Kish, L. (1984). On Analytical Statistics from complex samples. Survey
Methodology, 10, 1-7.
An overview of the problems in using complex surveys in analytical sur-
veys.
• Kish, L. (1987). Statistical designs for research. New York: Wiley.
One of the more extensive discussions of the use of complex surveys in
analytical surveys. Very technical.
• Krebs, C. (1989). Ecological Methodology.
A collection of methods commonly used in ecology including a section on
sampling
• Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999).
Survey methodology for intertidal bivalves. Canadian Technical Report of
Fisheries and Aquatic Sciences 2214. An overview of how to use surveys for
assessing intertidal bivalves - more technical than Gillespie and Kronlund
(1999).
• Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem
management. New York: Wiley.
Good primer on how to measure common ecological data using direct sur-
vey methods, aerial photography, etc. Includes a discussion of common
survey designs for vegetation, hydrology, soils, geology, and human influ-
ences.
• Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal
of the Royal Statistical Society, Series B, 27, 264-278.
2006
c Carl James Schwarz 184
CHAPTER 4. SAMPLING
For example, if you wish to estimate the total family income of families in
Vancouver, the “final” sampling units are families, the population size is the
number of families in Vancouver, and the response variable is the income for
this household, and the population total will be the total family income over all
families in Vancouver.
Things become a bit confusing when sampling units differ from “final” units
that are clustered and you are interested in estimates of the number of “final”
units. For example in the grouse/pocket bush example, the population consists
of the grouse which are clustered into 248 pockets of brush. The grouse is
the final sampling unit, but the sampling unit is a pocket of bush. In cluster
sampling, you must expand the estimator by the number of CLUSTERS, not by
the number of final units. Hence the expansion factor is the number of pockets
(248), the variable of interest for a cluster is the number of grouse in each pocket,
and the population total is the number of grouse over all pockets.
Similarly, for the oysters on the lease. The population is the oysters on the
lease. But you don’t randomly sample individual oysters – you randomly sample
quadrats which are clusters of oysters. The expansion factor is now the number
of quadrats.
In the salmon example, the boats are surveyed. The fact that the number
of salmon was measured is incidental - you could have measured the amount of
food consumed, etc.
In the angling survey problem, the boats are the sampling units. The fact
2006
c Carl James Schwarz 185
CHAPTER 4. SAMPLING
that they contain anglers or that they caught fish is what is being measured,
but the set of boats that were at the lake that day is of interest.
This can get confusing in the case of cluster or multi-phase designs as there
are different N ’s at each stage of the design. It might be easier to think of N
as an expansion factor.
The expansion factor will be known once the frame is constructed. In some
cases, this can only be done after the fact - for example, when surveying angling
parties, the total number of parties returning in a day is unknown until the
end of the day. For planning purposes, some reasonable guess may have to
done in order to estimate the sample size. If this is impossible, just choose some
arbitrary large number - the estimated future sample size will be an overestimate
(by a small amount) but close enough. Of course, once the survey is finished,
you would then use the actual value of N in all computations.
In multi-stage sampling, the selection of the final sampling units takes place
in stages. For example, suppose you are interested in sampling angling parties
as they return from fishing. The region is first divided into different landing
sites. A random selection of landing sites is selected. At each landing site, a
random selection of angling parties is selected.
In multi-phase sampling, the units are NOT divided into larger groups.
Rather a first phase selects some units and they are measured quickly. A sec-
ond phase takes a sub-sample of the first phase and measures more intently.
Returning back to the angling survey. A multi-phase design would select an-
gling parties. All of the selected parties could fill out a brief questionnaire. A
week later, a sample of the questionnaires is selected, and the angling parties
RECONTACTED for more details.
The key difference is that in multi-phase sampling, some units are measured
TWICE; in multi-phase sampling, there are different sizes of sampling units
2006
c Carl James Schwarz 186
CHAPTER 4. SAMPLING
(landing sites vs angling parties), but each sampling unit is only selected once.
Frame = list of sampling units from which a sample will be taken. The sampling
units may not be the same as the “final” units that are measured. For example,
in cluster sampling, the frame is the list of clusters, but the final units are the
objects within the cluster.
Population = list of all “final” units of interest. Usually the “final units” are
the actual things measured in the field, i.e. what is the final object upon which
a measurement is taken.
In some cases, the frame doesn’t match the population which may cause
biases, but in ideal cases, the frame covers the population.
Missing data can occur at various parts in a survey and for various reasons.
The easiest data to handle is data ‘missing completely at random’ (MCAR).
In this situation, the missing data provides no information about the problem
that is not already captured by other data point and the ‘missingness’ is also
non-informative. In this case, and if the design was a simple random sample,
the data point is just ignored. So if you wanted to sample 80 transects, but
were only able to get 75, only the 75 transects are used. If some of the data
are missing within a transect - the problem changes from a cluster sample to a
two-stage sample so the estimation formulae change slightly.
2006
c Carl James Schwarz 187