Académique Documents
Professionnel Documents
Culture Documents
31 (2002) 567581
Evidence-based diagnosis
in endocrinology
Roman Jaeschke, MD, MSca,
Gordon H. Guyatt, MD, MSca,b,*,
Victor M. Montori, MD, MScc
a
Any interaction with patients involves one or more tasks that may include
establishing and communicating the prognosis of a particular condition,
choosing, if available, an appropriate therapy or preventive strategy, and
supporting them during therapy. This chain of events clearly begins with a
determination of what is happening, that is, establishing the diagnosis.
A rapidly growing literature that describes the characteristics and usefulness of diagnostic tests and procedures provides information that can help
clinicians make an accurate diagnosis. The concepts used in this literature
(and the vocabulary used to describe them) are reviewed in this article. They
are introduced in the context of a clinical scenario that describes the testing
of a urine sample for the presence of microalbuminuria.
Recent guidelines for the management of diabetes recommend that
patients have annual screening for proteinuria. The presence of even a small
amount of protein in urine (microalbuminuria) has important prognostic
and therapeutic implications. Twenty-fourhour urine collections for albumin remain the gold standard for detecting microalbuminuria, but they are
quite cumbersome, and the need for a simpler, more convenient test is
obvious. A determination of the urine albumin concentration (UAC) or the
urine albumin:creatinine ratio (UACR) in a random urine sample is an
appropriate option. This was assessed in a 1997 article [1] that described the
* Corresponding author.
E-mail address: guyatt@mcmaster.ca (G.H. Guyatt).
0889-8529/02/$ - see front matter 2002, Elsevier Science (USA). All rights reserved.
PII: S 0 8 8 9 - 8 5 2 9 ( 0 2 ) 0 0 0 1 8 - X
568
569
To preclude the possibility that the results of the new diagnostic test are
inuenced by the results of the reference standard, it is important that the
test results and the reference standard be assessed independently of each
other (ie, by interpreters who were unaware of the results of the other investigation). This independence of comparisons is not crucial when considering
objective, biochemical tests. Its importance arises, however, when the interpretation of one tests results may be inuenced by the knowledge of the
other tests results. Examples include assessing fundoscopy when one knows
angiography results, assessing results of clinical examination for neuropathy
when one knows the electromyogram (EMG) or nerve-conduction-study
results, interpreting bone radiographs when one knows bone scan results,
looking at chest radiographs when one knows CT scan results, and conducting heart auscultation when one knows echocardiogram results. The more
likely that the interpretation of a new test could be inuenced by knowledge
of the reference standard result (or vice versa), the greater the importance of
the independent interpretation of both tests.
All patients should receive the test under evaluation and the reference
standard. This point may be illustrated by the situation in which all patients
with suspected peripheral neuropathy have nerve conduction studies but
only patients with abnormal velocities have a nerve biopsy. This situation,
sometimes called verication bias or work-up bias, was not a problem
in the study under consideration, in which patients had both tests.
A diagnostic test is useful only to the extent that it distinguishes between
target states or disorders that might otherwise be confused. Almost any test
can distinguish the healthy from the severely aected. The pragmatic value
of a test is established only in a study that closely resembles clinical practice.
570
>26.8 mg/g
1526.8 mg/g
<15 mg/g
>28.8 mg/d
<28.8 mg/d
61
8
0
6
8
40
69
54
571
Table 2
Relationship between UAER (gold standard) and UACR (test) and two levels of tests results
UAER
UACR
>26.8 mg/g
<26.8 mg/g
>28.8 mg/d
<28.8 mg/d
61
8
6
48
69
54
>15 mg/g
<15 mg/g
>28.8 mg/d
<28.8 mg/d
69
0
14
40
69
54
572
Table 4
Relationship between tests results and truth
Disease
Test result
Positive
Negative
Present
Absent
a (TP)
c (FN)
b (FP)
d (TN)
ac
bd
Abbreviations: FN, false negative; FP, false positive; TN, true negative; TP, true positive.
horizontal axis displays [1-specicity] (or the false-positive rate) for the same
cut-os. The curve established by connecting the points generated by using
dierent diagnostic cutos is called a receiver operating characteristic (ROC
curve). For the data set under consideration, two points of this curve, on the
basis of known sensitivities and specicities, are known. A third point could
be read from the data provided in the article and represents 100% specicity
at the expense of only 42% sensitivity. The resulting ROC curve, which represents modied ROC curve from the article, is presented in Fig. 1.
Such ROC curves can be used to compare formally the value of dierent
tests by examining the area under each curve; the better the test, the larger
the area under the curve. Zelmanovitz et al found that for the detection of
Fig. 1. Receiver operating characteristics curve for UACR. (Adapted from Gerstein H, Haynes
B. Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc; 2001; with permission.)
573
abnormal protein secretion by UACR the area under ROC curve was 0.9689
and that the area for another test described in the same study (UAC) was
0.976 [1]. To put the discriminating abilities of those tests into perspective,
one may consider that the area under ROC curve for ferritin in diagnosing
iron deciency anemia is approximately 0.95 [3].
Predictive value
Predictive values represent another way of expressing the properties of a
diagnostic test. In applying a given test to a patient, the vertical columns
of Table 4 are of limited interest, because if one really knew what column
the patient was in, the diagnostic test would not be required. The clinically
relevant questions are embedded in the rows. For example: what proportion
of patients with UACR ratio more than 15 mg/g has abnormal 24-hour urinary albumin secretion? In this study the answer is 69 of 83 patients, or 83%
(a proportion called the positive predictive value, or PPV). For the same
threshold the probability that a patient with negative test results has no disease is 100% (40/40), a proportion called negative predictive value or NPV
(Table 3). For the dierent threshold (26.8 mg/g) the respective values are
91% for PPV and 86% for NPV (Table 2). Using the symbols in Table 4,
PPV [a/(a b)], and NPV [d/(c d)].
The relationship between sensitivity and specicity on one hand and predictive values on the other can be illustrated using a hypothetical example
(Table 5), in which one assumes that a population has a smaller proportion of people with the disease of interest. In this example, the sensitivity
[a/(a c)] and specicity [2d/2(b d) d/(b d)] are unchanged, but the
PPV is reduced from [a/(a b)] to [a/(a 2b)] and the NPV has increased
from [d(c d)] to [2d(c 2d)].
As a general rule, although the sensitivity and specicity do not change,
decreasing the disease prevalence decreases the PPV and increases the NPV.
Similarly, it can be easily shown that maintaining test sensitivity and specificity but increasing the disease prevalence (2a and 2c) increases the PPV and
decreases the NPV. Predictive values reect the test characteristics and the
disease prevalence in the population and are of limited value in populations
dierent from the studied one.
Sometimes sensitivity or specicity is so high that it can be used to rule in
or rule out a target disorder. When a test has a high sensitivity, a negative
Table 5
Relationship between prevalence of disease, sensitivitiy, specicity, and predictive values
Disease
Test result
Positive
Negative
Present
Absent
a
c
2b
2d
ac
2(b d)
574
result rules out the diagnosis (a convenient mnemonic is sensitive-negativeout or SnNout); this result corresponds to a high negative predictive value.
When a test has a high specicity, a positive test result rules in the diagnosis
(specicity-positive-in or SpPin); this result corresponds to a high positive
predictive value. Calling a test result positive or negative may be useful when
the test has a good SpPin or SnNout, but for most tests, creating this dichotomy can lose much information.
There are clearly situations in which it is important to maximize sensitivity
or specicity. The requirement for high sensitivity is obvious when a test is
used as a screening tool. In that situation it is important to identify all patients
with a given condition in a population, not just part of them. The high sensitivity and associated high NPV, however, come at a price of lower specicity,
or an increased number of false-positive results and an increased need for conrmatory (sometimes invasive) tests. Examples include mammography for
breast cancer screening or prostate specic antigen (PSA) for prostate cancer
screening. The opposite occurs when high specicityand corresponding
high positive predictive valueis required, such as when establishing the
diagnosis denitively has important therapeutic or prognostic implications.
Likelihood ratios: pretest and posttest probabilities
Despite the relative simplicity of the concepts described previously, they
are limited by the need to choose dierent thresholds for a test result, which
can vary depending on the purpose of the test (ie, screening for or conrming
disease). They also lump dierent degrees of abnormality into a single category: either diabetic neuropathy is present or not; either ketoacidosis is
present or not. Unfortunately, this distinction does not correspond to the
clinical reality in which, for example, a daily albumin secretion of 400 mg and
4 g, a serum creatinine of 200 or 600 lmol/L, and a serum glucose of 20 and
60 mmol/L are all abnormal but have clearly dierent clinical implications.
These distinctions are best captured by the concept of likelihood ratios.
This concept recognizes the fact that dierent patients have dierent probabilities of having the disease of interest because of dierent risk factors,
such as age and comorbidities. The application of any test can be viewed
as a way of either increasing or decreasing the probability that the patient
has the disease of interest. That is, a test serves to modify the pretest probability of the disease and yields a new posttest probability. The direction and
magnitude of this change from pretest to posttest probability is determined
by the tests properties, which are called the likelihood ratios.
In the diagnostic process one frequently proceeds through a series of different diagnostic tests (information from history taking, physical examination, laboratory or radiologic tests). If the properties of each of these
pieces of information are known, one can move sequentially through them,
incorporating each piece of information and continuously recalculating the
probability of the target disorder. Clinicians implicitly do proceed in this
575
576
Having determined the magnitude and signicance of the LRs, how does
one use them to go from pretest to posttest probability? LRs cannot be combined directly; their formal use requires converting pretest probability to
odds, multiplying the result by the LR, and converting the consequent posttest odds into a posttest probability. Although not a dicult process (see
appendix), this calculation can be tedious and o-putting. Fortunately, there
is an easier way. A nomogram proposed by Fagan [4] (Fig. 2) does all the
conversions and facilitates the conversion from pretest to posttest probabilities. The rst column of this nomogram represents the pretest probability,
the second column represents the LR, and the third shows the posttest probability. One may obtain the posttest probability by anchoring a ruler at the
pretest probability and rotating it until it lines up with the LR for the
observed test result.
Thus, the LR incorporates the information that is generally used when
arriving at a diagnosis: the specics of a given clinical encounter (ie, the individual characteristics of a patient and ones clinical experience) and the
external evidence that comes from performing tests in populations of
patients. The former determines the assessment of pretest probabilities, and
the latter concerns the ability of the tests result to distinguish patients with
and without the condition of interest. These two elements are combined to
establish estimates of whether the patient has the target disorder (posttest
probabilities).
Table 6 provides likelihood ratios that apply to the evaluation of thyroid
nodules. Tables such as this can be useful in designing evidence-based diagnostic strategies and interpreting test results at the clinic using the framework presented previously.
How applicable are the studys result and the diagnostic test to dierent
clinical settings?
The value of any test depends on its ability to yield the same result when
reapplied to stable patients in ones own clinical setting. Poor reproducibility can result from problems with the test itself (eg, variations in reagents in
radioimmunoassay kits for determining hormone levels). A second cause for
dierent test results in stable patients arises whenever a test requires interpretation (eg, the extent of ST-segment elevation on an electrocardiogram).
Ideally, an article about a diagnostic test informs readers about how reproducible the test results can be expected to be. This is especially important
when expertise is required in performing or interpreting the test.
If the reproducibility of a test in the study setting is mediocre, disagreement between observers is common, and the test still discriminates well
between patients with and without the target condition, it is useful. Under
these circumstances, it is likely that the test can be applied readily in any
clinical setting. If reproducibility of a diagnostic test is high and observer
577
Fig. 2. A likelihood ratio nomogram. (Adapted from Fagan T. Nomogram for Bayess theorem.
N Engl J Med 1975;293:257; with permission. 1975 Massachusetts Medical Society. All rights
reserved.)
variation is low, either the test is simple and unambiguous or the clinicians
who are interpreting it are highly skilled. If the latter applies, less skilled
interpreters may not fare as well.
Test properties may change with a dierent mix of disease severity or
a dierent distribution of competing conditions. If the population with the
target condition is severely aected, likelihood ratios move away from
578
Table 6
Likelihood ratios for the diagnosis of malignancy in euthyroid patients with a single or
dominant thyroid nodule
Prevalence
(pretest
probability) (%)
No. of
patients
included
20
722
Test
Result
LR (95% CI)
132
Fine-needle
aspiration cytology
guided with ultrasound
Malignant
Suspicious
Insucient
Benign
226 (4.411.739)
1.3 (0.523.2)
2.7 (0.5215)
0.24 (0.110.52)
868
Fine-needle
aspiration cytology
not guided
Malignant
Suspicious
Insucient
Benign
34 (1574)
1.7 (0.943)
0.5 (0.270.76)
0.23 (0.130.42)
579
Fig. 3. Diagnostic process: test and treatment thresholds. (Adapted from Gerstein H, Haynes B.
Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc; 2001; with permission.)
580
patient. The value of an accurate test is undisputed when the target disorder,
if left undiagnosed, is dangerous, the test has acceptable risks, and eective
treatment exists.
In other clinical situations, tests may be accurate and management even
may change as a result of their application, but their impact on patient outcome may be far less certain. Examples include right heart catheterization
for many critically ill patients and the incremental value of MRI over CT
scanning for various problems.
Acknowledgment
This article is largely based on our previous publication on the subject
entitled How should diagnostic tests be chosen and used? [11].
References
[1] Zelmanovitz T, Gross JL, Oliveira JR, et al. The receiver operating characteristics curve in
the evaluation of a random urine specimen as a screening test for diabetic nephropathy.
Diabetes Care 1997;20:5169.
[2] Evidence-based Medicine Working Group. Users guides to the medical literature:
a manual for evidence-based clinical practice. Chicago: AMA Press; 2001.
[3] Guyatt G, Oxman A, Ali M. Diagnosis of iron deciency. J Gen Intern Med 1992;7:14553.
[4] Fagan T. Nomogram for Bayess theorem. N Engl J Med 1975;293:257.
[5] Hlatky M, Pryor D, Harrell F. Factors aecting sensitivity and specicity of exercise
electrocardiography. Am J Med 1984;77:6471.
[6] Ginsberg J, Caco C, Brill-Edwards P, et al. Venous thrombosis in patients who have
undergone major hip or new surgery: detection with compress US and impedance
plethysmography. Radiology 1991;181:6514.
[7] Lachs MS, Nachamkin I, Edelstein PH, et al. Spectrum bias in the evaluation of diagnostic
tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med
1992;117:13540.
[8] Irwig L, Tosteson AN, Gatsonis C, et al. Guidelines for meta-analyses evaluating
diagnostic tests. Ann Intern Med 1994;120:66776.
[9] Walter SD, Irwig L, Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference
standards. J Clin Epidemiol 1999;52:94351.
[10] Sackett D, Haynes R, Guyatt G, et al. Clinical epidemiology: a basic science for clinical
medicine. 2nd edition. Boston: Little, Brown and Co; 1991.
[11] Gerstein H, Haynes B. Evidence-based diabetes care. Hamilton, Ontario: BC Decker Inc;
2001.
581