Académique Documents
Professionnel Documents
Culture Documents
net/publication/224914790
CITATIONS READS
56 24,446
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sushil Kumar Garg on 20 May 2014.
SPECIAL ARTICLE
Received: 8 August 2011 / Accepted: 5 April 2012 / Published online: 6 May 2012
# Dr. K C Chaudhuri Foundation 2012
Abstract Estimation of appropriate sample size for preva- Keywords Sample size estimation . Prevalence . Precision
lence surveys presents many challenges, particularly when
the condition is very rare or has a tendency for geographical
clustering. Sample size estimate for prevalence studies is a Introduction
function of expected prevalence and precision for a given
level of confidence expressed by the z statistic. Choice of Sample size estimation is a key issue in design of most
the appropriate values for these variables is sometimes not studies. In a study conducted to estimate prevalence of a
straight-forward. Certain other situations do not fulfil the given condition in a geographic area, the objective is to
assumptions made in the conventional equation and present sample sufficient population to get adequate number of
a special challenge. These situations include, but are not subjects correctly classified as having the condition of in-
limited to, smaller population size in relation to sample size, terest or not, with a given confidence about the amount to
sampling technique or missing data. This paper discusses which this estimate might be affected by sampling error.
practical issues in sample size estimation for prevalence Several factors may potentially contribute to difficulty in
studies with an objective to help clinicians and healthcare this regard, including misclassification of diseased and
researchers make more informed decisions whether review- healthy subjects due to limitations of diagnostic test, the
ing or conducting such a study. given condition being very common or uncommon, or,
limitations of sampling technique. This paper discusses the
issues involved in sample size estimation for prevalence
studies with an objective to provide practically useful infor-
R. Arya
mation to clinicians and healthcare researchers.
Comprehensive Epilepsy Centre, Division of Neurology,
Cincinnati Children’s Hospital Medical Centre, Let us say our objective is to determine the prevalence of
Cincinnati, OH, USA a disease condition in the population living in an isolated
geographic area. Here, one simply cannot perform the diag-
B. Antonisamy
nostic test on the whole population. Such an approach
Department of Biostatistics, Christian Medical College,
Vellore, Tamil Nadu, India would be impractical due to logistic or financial reasons,
and, ethically unacceptable. Moreover, there would still be
S. Kumar misclassification because the predictive values of the test
Division of Basic and Translational Research,
would not be 100 %. So our target is to find out sufficient
Department of Surgery, University of Minnesota,
Minneapolis, MN, USA number of people sampled at random from this population,
which if subjected to diagnostic testing, would yield an
R. Arya (*) estimate of the prevalence of the condition in the given
MLC 2015, Cincinnati Children’s Hospital Medical Centre,
population with some confidence. It can be argued that if
3333 Burnet avenue,
Cincinnati, OH 45229, USA the condition is very rare in the given population, a larger
e-mail: Ravindra.Arya@cchmc.org sample would be required to yield sufficient number of
Indian J Pediatr (November 2012) 79(11):1482–1488 1483
‘cases’ and ‘non-cases’; if it is relatively common we might ‘true’ prevalence lies between 20 to 40 % (Bayesian way).
need a rather small number.1 Hence, sample size n is directly Sample size estimation for prevalence surveys has 2 inter-
proportional to the prevalence of the disease P in the popu- related measures of precision, which are discussed subse-
lation. Secondly, we have to decide how precisely we want quently: CI and the allowable margin of error d.
to estimate this prevalence. This is usually measured as the
amount of acceptable (or allowable) margin of error in the
prevalence estimate. This is a random sampling error which Z Statistic
decreases on repeated trials. Hence, n is inversely propor-
tional to allowable error d, which is a surrogate measure of Whenever we make inferences about population from a
precision. These relationships can be expressed by a simple sample, an element of uncertainty is introduced. One way
equation. to quantify this uncertainty is the use of CI. Z statistic
indicates the number of standard deviations an observation
ðz2 ÞPð1 PÞ is above or below the ‘population’ mean. It captures the
n¼
d2 level of confidence, assuming a normal distribution. For
Where n0sample size, z0z statistic for the level of conventional 95 % confidence level, the z value is 1.96,
confidence, P 0expected prevalence and d 0allowable since 95 % of a normal distribution would lie within ±1.96
error. This formula assumes that P and d are decimal standard deviations on either side of the mean. The value of
values, but would hold correct also if they are percen- z is chosen by the investigator according to the desired level
tages, except that the term (1-P) in numerator would of confidence.
become (100-P). In this straight forward equation
(Box 1), several practical issues arise in the choice of
values for z, P and d. Allowable Margin of Error (d)
the element of chance is captured by the chosen CI or Z Suppose one is doing the first ever prevalence study for a
statistic. particular condition in a given population, then there is no
previous study to help estimate P. In such a situation, some
authors [3] recommend that n may be calculated using
Expected Prevalence (P) P00.5. As ascertained from Fig. 1, this contention is valid
if P lies between 10 to 90 %, as it will give the largest
This might seem like a paradox of sorts. Our objective in estimate for n. However, for rare (P<0.1) and very common
conducting the study is to determine the prevalence of a (P>0.9) conditions, the sample size estimated with an as-
particular condition in a particular population i.e., it is this P sumption of P00.5 is likely to be unsuitable. Suppose we
which we are out to find. But to do so, we need to have a plug values of P00.5, d00.05 in our formula and obtain n0
prior idea of it! Mostly, this idea can be had from prior 385 (Fig. 1). This will be the empirical sample size estimate
studies. Usually, the previous studies will give a range of for all purposes, when we make a blind assumption of P0
P and not a single number. It has been suggested that one 0.5. Assume that we are dealing with a rare condition having
should err towards P00.5 in these situations [2]. That is, if a true prevalence of only 1 %, that is to say P00.01. In such
the range provided by previous studies is 30–40 %, then P a case, our sample is likely to capture only 3 or 4 cases.
should be set at 0.4, whereas if the range is 70–80 %, it is There is a finite probability that it may not capture even a
better to take P00.7, i.e., the value nearer to 0.5 or 50 %. single case! On the contrary if we are dealing with a very
This recommendation is based on the fact that choosing a common condition, say whose prevalence is 99 % (P00.99),
value nearer to 0.5 leads to the largest n, within certain then we will not be able to capture sufficient non-cases with
limitations (Fig. 1). our sample size and as above, we may indeed capture none.
Indian J Pediatr (November 2012) 79(11):1482–1488 1485
In either case, the assumption of normal approximation is the best strategy is to conduct a pilot study, estimate P,
invalidated. and use it for calculating a sample size.
At a minimum this requires an estimate as to whether
P is <0.1, falls between 0.1 and 0.9, or is >0.9. If we
can estimate this, our value for allowable error (d) will Assumption of Normal Approximation
change, as discussed above, and we will get a more
meaningful calculation for sample size (n). Table 1 The formula for sample size estimation is based on the
illustrates this calculation for few representative values assumption that the sample will capture and correctly clas-
of P<0.05 and P>0.95. Notice how n is different for sify certain minimum number (05) of cases and non-cases.
these extremes of P from the empirical figure of 385, This means that nP and n(1-P) must be Q5. We know that
increasing symmetrically as we approach P00 or P01. allowable error is directly related to standard error of pro-
It is obvious intuitively that at P 00, n will become portion (SEp) by the equation:
infinite. If there are no cases of the given condition in
a given population, no matter how large a sample we d ¼ z SEp
take, we will never find one! Similar argument holds
true for P01. So, our problem boils down to finding Also:
whether P is <0.1, between 0.1 and 0.9 or >0.9. This rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
can be easily estimated by even a crude pilot study. To P ð1 P Þ
SEp ¼
summarize, if we have an estimate for P from prior n
studies, we can use it erring towards P00.5; otherwise,
Using above 2 equations and re-arranging, we can get the
formula for n, discussed above. For z01.96 and d00.05, the
equation would be:
Table 1 Calculating n 0:05 ¼ 1:96 SEp
for very rare (P<0.05) z d P n
and very common
(P>0.95) conditions 1.96 0.005 0.01 1521.27 This means that 1.96 standard errors of our estimate
1.96 0.01 0.02 752.95 would be equal to 0.05. Stated otherwise, if 1.96*SEp is
1.96 0.015 0.03 496.85 equal to 0.05, then our sample size estimate has a 95 %
1.96 0.02 0.04 368.79 chance of being within 5 percentage points of the true
1.96 0.025 0.05 291.96 prevalence. This inherently assumes that n is sufficiently
1.96 0.025 0.95 291.96 large, so that all possible values of P will have a normal
1.96 0.02 0.96 368.79 distribution. It also means that if we were to repeat our
1.96 0.015 0.97 496.84
study, about 95 % of times our prevalence estimate will fall
1.96 0.01 0.98 752.95
within the value specified by P±1.96*SEp. This indeed,
Precision is calculated 1.96 0.005 0.99 1521.27
shows that these values have a normal distribution with a
as discussed in text mean of P and standard deviation of d/Z, approximately
1486 Indian J Pediatr (November 2012) 79(11):1482–1488
(Box 2). This is nothing but approximating binomial distri- must be corrected by multiplying with a finite population
bution with a normal distribution, justified by the central correction (FPC), to account for the added precision gained
limit theorem. In this context, P is the proportion of suc- by sampling close to a larger percentage of population.
cesses in a Bernoulli trial process estimated from a sample pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
of size n with z(1-α/2) as the (1-α/2) percentile of standard FPC ¼ ðN nÞ=ðN 1Þ
normal distribution and α as the error percentile. The central
limit theorem is applicable to a binomial distribution when The effect of FPC is annulled when n 0 N. The formula
the proportion is not very close to 0 or 1 i.e., the normal for sample size calculation including the effect of FPC is:
approximation fails when sample proportion ‘approaches’ 0 N ðz2 ÞPð1 PÞ
these values. n ¼
ðd 2 ÞðN 1Þ þ ðz2 ÞPð1 PÞ
,
Indian J Pediatr (November 2012) 79(11):1482–1488 1487
get population is very large (theoretically infinite), Continuity correction factor is employed to compare two
hypergeometric distribution can be well approximated population proportions, usually in a diagnostic evaluation,
by a binomial one, and thus, formula based on latter and is not discussed further here [4].
is acceptable. Any investigator using FPC in sample
size estimate should also incorporate adjustment for
hypergeometric sampling in analysis. Non-Response Correction
Where n is the sample size estimated assuming a simple This corresponds to rejection of the null hypothesis that
random sampling. DE is actually a ratio measure that the true prevalence is P at a 1-sided significance level of α.
describes how much precision is gained or lost if a more Taking the natural logarithm, setting α 00.05 and re-
complex sampling strategy is used instead of simple random arranging, we get a simple relationship:
sampling. Usually, complex sampling techniques lead to a
lnð0:05Þ 3
decrease of precision, resulting in DE >1. This also implies n¼ ¼
that for the same precision, cluster or multistage sampling P P
technique would require a larger sample size than simple or This simplification makes an assumption that ln
systematic random sampling. A classic example is cluster ð1 PÞ ¼ P , which holds true only for P < 0.02.
surveys for immunization coverage, where DE has shown to Hence, there are two important things in using this
be approximately 1.9 [2]. formula for sample size estimation. First, a previous
Sample size calculations are based on large sample ap- estimate of the prevalence in a region with relatively
proximation methods. Together FPC, DE and continuity high disease frequency, determined by conventional
correction are ‘adjustment factors’ which help improve spe- methods, will be helpful in determining the size of
cific sample size approximation to exact distributions. population to be screened in a region thought to be
1488 Indian J Pediatr (November 2012) 79(11):1482–1488
having lesser prevalence. Secondly, this formula will be available in some other population having relatively higher
suitable only for conditions with prevalence less than prevalence of it and the disease has an estimated prevalence
2 % in the target population undergoing screening. of less than 2 % in the population in which we want to have
For example, Gaucher disease type 3 is relatively com- the estimate. This method cannot provide an estimate of true
mon in the population of Northern Swedish region of Nor- prevalence but can generate a hypothesis for the same.
botten having a prevalence of approximately 0.0006. So, if
one screens about 3/0.000605000 people from any popula-
tion and does not find a case of this disease in the sample, Conclusions
then one could say with 95 % confidence that the prevalence
of Gaucher disease type 3 is less than 0.0006 in that popu- Sample size estimation for prevalence studies is straight-
lation. Extending this argument, if 10000 people from the forward for the most part. Certain challenging situations like
same population are screened for Gaucher disease type 3 extremes of prevalence, small target population, sampling
and not a single case is found, then a conclusion can be technique, expected misclassification or missing data re-
made with 95 % confidence that the prevalence of this quire diligent choice of calculating formula and plug in
disease is less than 0.0003 in the given population. values.
If we introduce the concept of statistical power in this
formula, it will represent the probability of all persons
within the sample of size n being disease free, given the
Conflict of Interest None.
true prevalence of disease in the population is p′ (alternative
hypothesis) instead of P (null hypothesis), such that p′<<P.
In such a case, power (Θ) will be: Role of Funding Source None.
0 n
Θ¼ 1P
References
Taking natural logarithms and substituting n03/P and
Θ00.8 yields P′00.074*P. In the example of Gaucher dis- 1. Antonisamy B, Christopher S, Samuel PP. Biostatistics: principles
ease type 3, this means that assuming the null prevalence in and practice. New Delhi: Tata McGraw-Hill; 2010.
Northern Swedish region of 0.0006, the alternative preva- 2. Macfarlane SB. Conducting a descriptive survey: 2. Choosing a
sampling strategy. Trop Doct. 1997;27:14–21.
lence in our hypothetical population should be less than
3. Lwanga SK, Lemeshow S. Sample size determination in health
(0.074*0.00060) 4.4*10−5 for this method to have at least studies: a practical manual. Geneva: World Health Organization;
80 % power. This power will increase for yet smaller 1991.
prevalence. 4. Fosgate GT. Practical sample size calculations for surveillance and
diagnostic investigations. J Vet Diagn Invest. 2009;21:3–14.
This method is called “zero patient” design and can only
5. Yazici H, Biyikli M, van der Linden S, Schouten HJ. The ‘zero
be used for estimating relative prevalence of disease in a patient’ design to compare the prevalences of rare diseases. Rheu-
population, provided an estimate of its prevalence is matology (Oxford). 2001;40:121–2.