Vous êtes sur la page 1sur 5

Evaluating New Screening

Methodologies

Scientific Impact Paper No. 20


April 2010
Evaluating New Screening Methodologies
1. Introduction
Screening is defined as the examination of asymptomatic individuals with a view to identifying those
who have occult disease or who are likely subsequently to develop disease. The primary aim of screening
is to allow intervention to improve outcome through treatment of disease at an earlier, more tractable
state or to prevent its development. Screening tests can take many forms, including history, physical
examination, biochemical measurements, imaging and cytology. Obstetrics and gynaecology includes
examples of highly successful and high-profile screening, such as Down syndrome and cervical cytology.
Many more problems within the specialty could potentially be improved by effective screening and
intervention, such as stillbirth, premature ovarian failure and ovarian cancer.

2. Assessing a screening test


A series of arithmetic measures exist which quantify different aspects of the ability of a screening test to
identify high-risk individuals correctly. Understanding the meaning of these measures is essential to
understanding the ability of a given test to influence the disease in populations or individuals.
Sensitivity: The proportion of all people who have the disease (in a
screened population) who screen positive

Specificity: The proportion of all people who do not have the disease (in a
screened population) who screen negative

Positive predictive value (PPV): The proportion of people who screen positive and who have the
disease

Negative predictive value (NPV): The proportion of people who screen negative and do not have
the disease

Positive likelihood ratio: The ratio of the odds of disease in someone who screens
positive to the odds of disease in the general population

Negative likelihood ratio: The ratio of the odds of disease in someone who screens
negative to the odds of disease in the general population
Perhaps the key measures are the sensitivity, as it indicates the proportion of all disease in the population
which is potentially preventable by the screening programme (assuming an effective intervention), and
the PPV, as it indicates how likely are people who screen positive to be affected by the disease. The screen
positive rate is also of practical importance, as it determines the proportion of the screened population
that will require further investigation.

A common method for characterising screening tests is the receiver operating characteristic (ROC) curve.
This is an X/Y plot with 1-specifity (that is, the false positive rate) on the X axis and sensitivity on the
Y axis. Each point on the plot is the value for each of these parameters, having chosen a given threshold
to indicate someone who has screened positive. Analysis of the area under the ROC curve is a common
summary statistic of screening tests. Calculation of the area under the curve is performed by comparing
all possible pairwise combinations of those who have the disease and those who do not. The numerical
value of the area under the curve is the proportion of all these comparisons where the person with the
disease was calculated to be at higher risk than the person without the disease. Therefore, a perfectly
informative test has the value 1 and a completely non-informative test has the value 0.5 (that is, the test
was equally as often higher in healthy people as it was in those with the disease). This measure is not

Scientific Impact Paper No. 20 2 of 5 Royal College of Obstetricians and Gynaecologists


conceptually easy (although it is useful in research studies) and has less direct meaning than the other
measures above. However, many screening tests involve a continuous measurement and applying such a
test involves setting an arbitrary threshold, beyond which an individual is deemed to have screened
positive. A given threshold of risk will then also determine the sensitivity and the screen-positive rate.
Analysis of the ROC curve can be helpful in setting a threshold. Setting the threshold also depends on
the consequences of failure to detect disease when it is present, balanced against the adverse effects (such
as costs, anxiety, risk) of further tests in unaffected individuals. Depending on the circumstances, the
threshold may be set on the basis of what is regarded as the minimum acceptable level of sensitivity or
a maximum acceptable false-positive rate.

3. Prevalence, incidence and screening


Interpreting all of the above also depends on how commonly the condition occurs within the population.
If a condition is very rare, even a very discriminatory test may not be clinically useful. For example, a
study carried out in 2006 showed that spontaneous preterm birth before 29 weeks of gestation affected
0.24% of a population of nulliparous women who had serum screening performed.1 A model generated
a test with a positive likelihood ratio of 5.5. However, given the rarity of the outcome, women who
screened positive still only had a 1.3% chance of the event. In contrast, if a condition had a 20%
incidence, a positive likelihood ratio of 5.5 would identify people with an approximately 60% risk of
the outcome (see Box 1).

4. Evaluating clinical application of a screening programme


One of the key characteristics of a successful screening programme is that it has two distinct components.
The first is the ability to identify people at high risk. The second is the ability to intervene within this
group in a way that improves outcome. Ideally, these two components should be evaluated separately.
To design an adequate interventional trial, the screening properties of the test have to be known. This
requires appropriately designed and analysed studies of new diagnostic methodologies. A guideline exists
(STAndards for the Reporting of Diagnostic accuracy studies or STARD) which aims to standardise
reporting of such tests.2

In reality, there are many situations in obstetrics and gynaecology where research studies have failed to
appreciate the two separate components of screening. An important example is the use of routine
ultrasonography in obstetrics. A series of trials were conducted. Meta-analysis has demonstrated no
effect of routine scanning, including Doppler flow velocimetry of the uteroplacental circulation, on
perinatal mortality.3,4 However, many of the studies lacked a standardised intervention in women who
screened positive. Moreover, the trials were designed in the absence of clear information on the screening

Box 1. Reminder about odds and likelihood ratios


Likelihood ratios are a means of modifying the odds of a disease on the basis of a test or personal characteristics. Remember: the probability is
different from the odds.

Odds = probability/(1-probability). Conversely, probability = odds/1+odds.

When the condition is rare, the odds and probability are similar. However, when the condition is common, odds and probability become quite
different.

In the calculation, prior risk of 20%. i.e. probability = 0.2.


Therefore, the prior odds = 0.2/0.8 =0.25.

The posterior odds = the prior odds x the LR = 0.25*5.5 = 1.375.

To convert the posterior odds back into the probability, divide the odds by (odds + 1):

Posterior probability = 1.375/2.375 = 0.58.


Therefore the risk is approximately 60%, given a prior risk of 20% and a positive likelihood ratio of 5.5

Scientific Impact Paper No. 20 3 of 5 Royal College of Obstetricians and Gynaecologists


properties of the test as a means of identifying women at increased risk of perinatal death from an
unselected population. It is unclear, therefore, whether the trials failed to show benefit because the
screening test was ineffective at identifying women at high risk, whether the trials were underpowered
or whether there was no available intervention which reduced the risk of the outcome in women who
screened positive. This is discussed in detail elsewhere.5

In general, randomised controlled trials should be conducted within the group who screen positive rather
than analysed at the level of the entire population screened. Failure to do so could result in effective
screening and intervention being dismissed as ineffective through reduced statistical power. This is best
illustrated by a hypothetical but plausible example. Consider a population with a 5% incidence of
gestational diabetes mellitus (GDM); a treatment for GDM which reduced perinatal morbidity from 4%
to 1%, as has been described;6 a background rate of perinatal morbidity (that is, among women without
GDM) of 0.5%; and, for simplicity, a screening test for GDM which is 100% specific and 100%
sensitive. If a randomised controlled trial was conducted by allocating women to being screened or not
screened, the incidence of perinatal morbidity would be 0.525% in the intervention group and 0.675%
in the control group. An adequately powered study would require the inclusion of over 100 000 women
to detect this effect. In contrast, if a randomised controlled trial was conducted comparing the
intervention with routine care among women who screened positive, around 1250 women who screened
positive would be required to detect the reduction in risk from 4% to 1%, which would require screening
of around 25 000 women. The explanation is that power was increased by excluding a greater
proportion of the outcomes which were unrelated to GDM and so could not be influenced by the
screening programme.

Assessing the acceptability of screening tests is also an important component of their evaluation, as even
perfectly predictive tests will have little effect on outcomes if most women find them unacceptable.
Hence, the attitudes of consumers to new tests is an important aspect of clinical trials of screening and
intervention. Finally, as introduction of new treatments into care in the NHS and elsewhere is often
based on the cost per quality-adjusted-life-year gained, economic analysis of all new methods of
screening is essential.

5. Common pitfalls
There are a number of common errors in screening studies. First, many studies initially evaluate
screening tests using patients who are inherently high risk (such as those from a tertiary referral centre).
As discussed above, the PPV of a test is one of its key features and this is strongly related to the
prevalence of the condition. Hence, this is often overestimated in initial descriptions of a screening test.
Second, many studies use complex multivariate models to create a method for predicting risk. However,
use of models with large numbers of predictors, especially when applied to small numbers of cases, will
tend to yield estimates of prediction which are over-optimistic. Models should generally be validated in
a different sample of participants from the sample used to derive the model, particularly in studies
containing small numbers of cases and large number of predictors. Third, early detection of a disease, in
particular cancer, will lead to prolongation of the interval between diagnosis and death, even in the
absence of an effective intervention. Moreover, screening may detect disease that is less likely to result in
death or symptoms prior to death from other causes. Hence, screening may appear to improve outcome
in the absence of a true effect. Fourth, there is a tendency to regard screening solely on the basis of a
discriminatory test. Screening should only be considered where the benefit of early detection and
intervention clearly exceeds the potential drawbacks of the programme (costs, anxiety, morbidity from
diagnostic procedures in false positives and so on). Finally, incomplete ascertainment of cases can lead
to misleading conclusions. In particular, the failure to identify participants who were false negatives (that
is, people affected by the disease who screened negative) will yield over-optimistic estimates of screening
efficiency.7

Scientific Impact Paper No. 20 4 of 5 Royal College of Obstetricians and Gynaecologists


6. Opinion
Population screening is widespread in clinical practice and research in obstetrics and gynaecology.
Researching screening is methodologically quite complex and is often performed suboptimally in the
specialty.

References
1. Smith GCS, Shah I, White IR, Pell JP, Crossley JA, Dobbie R. Maternal and biochemical
predictors of spontaneous preterm birth among nulliparous women: a systematic analysis in
relation to the degree of prematurity. Int J Epidemiol 2006;35:116977.
2. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards
complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ
2003; 326:414.
3. Bricker L, Neilson JP. Routine Doppler ultrasound in pregnancy. Cochrane Database Syst Rev
2000;(2):CD001450.
4. Bricker L, Neilson JP. Routine ultrasound in late pregnancy (after 24 weeks gestation).
Cochrane Database Syst Rev 2000;(2):CD001451.
5. Smith GCS, Fretts RC. Stillbirth. Lancet 2007;370:171525.
6. Crowther CA, Hiller JE, Moss JR, McPhee AJ, Jeffries WS, Robinson JS. Effect of treatment
of gestational diabetes mellitus on pregnancy outcomes. N Engl J Med 2005;352:247786.
7. Cuckle H, Aitken D, Goodburn S, Senior B, Spencer K, Standing S. Age-standardisation when
target setting and auditing performance of Down syndrome screening programmes. Prenat
Diagn 2004; 24:8516.

This Scientific Impact Paper was produced on behalf of the Royal College of Obstetricians and Gynaecologists by:
Professor GCS Smith FRCOG, Cambridge.

It was peer reviewed by: Dr D Aitken, Glasgow; Professor DJ Murphy MRCOG; Dr RW Old, Warwick; Professor PW Soothill FRCOG.

The Scientific Advisory Committee lead reviewer was Ms SM Quenby FRCOG.

At the time of publication, the Chair of the Scientific Advisory Committee was Professor S Thornton FRCOG and the Vice Chair was
Professor R Anderson FRCOG.

The final version is the responsibility of the Scientific Advisory Committee of the RCOG.

The review process will commence in 2013 unless otherwise indicated.

Scientific Impact Paper No. 20 5 of 5 Royal College of Obstetricians and Gynaecologists

Vous aimerez peut-être aussi