Académique Documents
Professionnel Documents
Culture Documents
The clinical performance of a laboratory test can be tions about clinical performance. Some of these studies
described in terms of diagnostic accuracy, or the ability to assessed test performance merely by calculating the mean
correctly classify subjects into clinically relevant sub- test results for the various sample groups they studied.
Others calculated sensitivity, specificity, efficiency, and (or)
groups. Diagnostic accuracy refers to the quality of the
predictive value. Five of the 18 studies included receiver-
information provided by the classification device and (or relative-) operating characteristic (ROC) plots to repre-
should be distinguished from the usefulness, or actual sent test performance.3
practical value, of the information. Receiver-operating It is apparent that both the concepts and the measures of
characteristic (ROC) plots provide a pure index of accu- performance varied from study to study. The lack of a
racy by demonstrating the limits of a tests ability to standardized approach to performance makes for a confus-
discriminate between alternative states of health over the ing situation in which the investigators and readers fail to
communicate and understand clearly the information of
complete spectrum of operating conditions. Furthermore, interest to both groups. Without agreement about the
ROC plots occupy a central or unifying position in the concept of performance and how it should be measured and
process of assessing and using diagnostic tools. Once the represented, the struggle to understand the meaning of the
plot is generated, a user can readily go on to many other data gathered is likely to continue.
activities such as performing quantitative ROC analysis Here we offer a definition of accuracy to be used as a
and comparisons of tests, using likelihood ratio to revise measure of performance, and review the principles and
application of the ROC plot as an index of diagnostic
the probability of disease in individual subjects, selecting
accuracy (1). ROC plots are fundamental; their pivotal
decision thresholds, using logistic-regression analysis, position provides a unifying concept in the process of test
using discriminant-function analysis, or incorporating the evaluation (Figure 1). Once the data are collected and the
tool into a clinical strategy by using decision analysis. plots generated, numerous other assessments, compari-
sons, indices, and analyses can follow. Even clinical deci-
Indexing Terms: receiver-operating characteristic cuives data sion analysis (not shown), a complex but important tool for
analysis . diagnostic accuracy .likelihood ratio diagnostic medical decision making, involves the use of data gener-
threshold test efficiency . predictive value ated for ROC plotting. It is this central position of ROC
plots that we will describe in this review (Figure 1).
ROC methodology is based on statistical decision theory
Reports evaluating some clinical aspect of laboratory test
performance frequently appear in this and many other and was developed in the context of electronic signal
journals. However, the elements of performance that are detection and problems with radar in the early 1950s (2).
An ROC-type plot was used in the late 1950s to describe the
addressed vary. What is clinical performance? Terms com-
abffity of an automatedPap smear analyzer to discriminate
monly used include sensitivity and specificity, efficiency,
between smears with and without malignant cells. Curves
accuracy, utility, value, worth, effectiveness, usefulness,
and efficacy. Often the word diagnostic precedes the term, of true-negative vs false-negative results were used to
i.e., diagnostic value or diagnostic efficiency. Other terms select an operating point for the instrument that would
such as predictive value (positive and negative), likelihood
provide an optimum trade-off between false-positive and
ratio, odds ratio, and likelihood quotient have been used. false-negative results (3).
The meaning of these terms is often vague and variable, By the mid-1960s, ROC plots had been used in experi.
particularly utility, worth, value, usefulness, and effective- mental psychology and psychophysics (2). Following work
in psychophysics by Green and Swets (4), Leo Lusted, a
ness, but even diagnostic accuracy seems to mean different
things to different people. radiologist, suggested using ROC analysis in medical deci-
As laboratorians, we are often interested in how well a sion making in 1967 and began using it in studies of
test performs clinically, because we are considering replac- medical imaging devices in 1969 (5, 6). Others wrote about
ing an existing test with a newer one, adding a new test to it (7), and eventually ROC analysis made its way into other
our laboratorys menu, eliminating tests where possible, or areas of medicine.
Laboratory tests are ordered to help answer questions
just because we want to know something about the value of
what we are doing. In the first six issues of Clinical about patient management. How much help an individual
Chemistiy during 1991, at least 18 studies addressed ques- test result provides varies and, in any case, is a highly
Clinical Pathology Department, Warren G. Magnuson Clinical 3Nonstandard abbreviations: ROC, receiver-operating charac-
Center, and2 Biometry and Field Studies Branch, National Insti- teristic; CK, creatine kinase; AMI, acute myocardial infarction;
tutes of Neurological Diseases and Stroke, National Institutes of CAD, coronary artery disease; HDL, high-density lipoprotein; LR,
Health, Bethesda, MD 20892. likelihood ratio; PV(+), predictive value of a positive test result;
Received August 20, 1991; accepted November 2, 1992. and PV(-), predictive value of a negative test result.
1
I-
>
I-
Cl, 0.8
z
LU
Cl,
z
0 0.6
I-
C-)
IL
LU 0.4
>
I-
Cl)
0
a. 0.2
LU
I-
under the trapezoids that comprise the entire area or use w 0.4
>
the Mann-Whitney version of the Wilcoxon statistic with I-
average ranks. However, if the number of ties is consider- Cl,
0
able, this trapezoidal area tends to be a biased underesti- a. 0.2
mate of the true area. This can be illustrated for continuous uJ
data that have been discretized. For example, Figure 8
presents the ROC plot based on the discretization of the I-
data of Figure 7 into eight bins. Note that the eight-bin
ROC plot generally lies below the unbinned ROC plot.
Whereas the unbiased area estimate is 0.747 for the un- 0 0.2 0.4 0.6 0.8 1
binned plot, the area under the eight-bin discretization is
TRUE NEGATIVE FRACTION (1-SPECIFICITY)
smaller (0.718). This is because the actual ROC plot tends
to be convex and therefore to lie above, rather than below, Fig. 9. Nonparametrlc ROC plots of serum apollpoproteln A-I con-
centrations and the ratio of HDL to total cholesterol
the diagonals (30,32,35-37). lithe number of ties is large,
Same subjects as in Fig. 5 (data from Kottke et al, 18). Areas under the ROC
it is advisable to report not only the trapezoidal area but
plots are 0.753 and 0.743, respectIvely
the maximal area where all diagonals are replaced by
mnximal vertical, then horizontal possible paths. For ex-
ample, in Figure 8, whereas the unbinned mimsd area = sensitivities below -0.65-0.70, the HDL/total cholesterol
0.747 (only one small diagonal), the maximal area for the ratio has better specificity (lower false-positive fraction),
eight-bin plot is much larger (0.854). Introduction of ties by whereas at higher sensitivities apolipoprotein A-I has bet-
binning the data biases the trapezoidal area estimate and ter specificity.
increases the difference between that estimate and the
maximal-area estimate. The standard deviation of the Statistical Comparison of Multiple Tests by Use
trapezoidal area estimate is always increased with discret- of ROC Plots
ization; in this case, it is 0.036 for the unbinned area and Direct statistical comparison of multiple diagnostic
0.041 for the eight-bin area. tests is frequent in clinical laboratories. Two (or more)
For the parametric approach, there is a graphical method tests are usually porformed on the same subjects, as in a
that estimates the parameters of the binormal model to split-sample comparison. In such cases, the results from
obtain an area estimate (2, 4). A more exact approach two tests are usually correlated or associated. It is also
involving the above maximum likelihood estimation pro- possible, but less common, to have different individuals for
vides not only an area estimate but also its standard error the two tests, in which case the test results are indepen-
(8, 28, 29). The latter permits hypothesis testing and dent (and hence uncorrelated). Of these two designs, the
confidence interval estimation of the area. However, if a paired (split) design using the same individuals for the
parametric model such as the binormal has been applied, two tests is more efficient and also controls the patient-to-
the area obtained by estimation of the parameters can also patient variation. For example, it may be possible to
be biased, unless the parametric assumptions on the counts detect a real difference by exsrnining 50 individuals with
are well satisfied. A comparison of nonparainetric and test A and another 50 with teat B, whereas only 40
binormal parametric areas has been done by Center and patients given both tests might have been sufficient. When
Schwartz (38). comparison of test performance is accomplished statisti-
Area is in some sense an imperfect measure of the cally by using ROC plots or curves, it is referred to as ROC
performance of the diagnostic test. For one thing, it is a analysis. (See below for other ways to compare diagnostic
single global measure-there is necessarily a loss of infor- tests.) ROC graphs for the two diagnostic tests, either
mation in reducing the overall performance of the diagnos- nonparametric or parametric, can differ in shape but still
tic test to a single number. A less global alternative is to may agree at a single point or have the same areas.
restrict the area to a relevant portion, e.g., the area under Therefore, it is always advisable to visualize the entire
the curve for observed specificity >0.6 or for sensitivity performances through the ROC graphs.
0.7. This restriction analysis has been accomplished non- Note that with only a single (specificity, sensitivity) pair
parametrically (39) as well as under the bin#{243}rmal model for each test (i.e., only one point on the ROC plot for each
(40). Because the area under the ROC plot condenses the test), a comparison of the performance of the two tests is
information of the graph to only a single number, it is usually impossible. Only if the two points (one for each test)
usually undesirable to consider area without evRmining are on the same vertical or same horizontal line on the
the plot itself As Figure 9 illustrates, two ROC plots can be nonparametric ROC plot can they be compared (i.e., at a
quite different in shape and yet have similar areas. At common sensitivity or specificity). In such cases, McNe-
Choice of decision
threshold P NP - - NP#{176}
Likelihood ratio (LR) P NP NP P NP
CI for LR NP - NP
Compare two ROC
areas P NP,P NP
Output file for ROC
graph Y V V V Y V
Commercial, public
domain, or
sharewareb PD PD S PD C C S
Macintosh, PC, or
mainframe (MF) PC, MF M, PC, MF PC PC M PC PC,MF
Cl, confidenceinterval; NP, nonparametric; P. parametric; 0, gaussIan (normal) distrIbution; 0, other; V. yea.
C(n) denotes continuous input, binned into a maximum of n categoriesfor the analysis; B(n) denotes binned(ordered categorical) Input, wfth up to n bins for
parametric analysts.
b Shareware Implies a small fee but the enterprise is not commercial.
Cm = 1.
metric analysis of likelthood ratios through maximum TEP-UH (Test Evaluation Program-University Hospi-
likelihood methods but rather based on the assumption of tal). Thomas G. Pellar, Department of Clinical Biochemis-
normality in the original or log-transformed scale. try, University Hospital, P.O. Box 5339, 339 Windemere
Metz programs: LABROC1, CLINROC, ROCFIT, COR- Road, London, Ontario, Canada N6A 5A5. Running
ROC. Charles E. Mets, Department of Radiology, MC2026, TEP-UH requires the parent program MUMPS (Micronet-
The University of Chicago Medical Center, 5841 S. Mary- ics Design Corp., Rockville, MD).
land Ave., Chicago, IL 60637-1470 [FAX (312)702-6779, Note that only three of the programs are designed to
Internet address: c-metz[@]uchicago.edu]. The Mets pro- treat the continuous data directly, without binning (forcing
grams are, for a single diagnostic test, LABROC1 for into discrete intervals) the data.
continuous data and ROCF1T for discrete data, and, for two
correlated tests, CLABROC and CORROC, respectively.
There are slight differences in Table 2 entries for versions Examples from the Literature
on the different computer platforms. Program requesters Van Steirteghem et al. (16) compared the accuracies of
are asked to specify platform and to include for microcom- myoglobin, CK-BB, CK-MB, and total CK in discriminat-
puter requests two appropriate floppy disks. ing among persons presenting to an emergency room with
ROC ANALYZER. Robert M. Centor, 10806 Stoney- typical chest pain, with and without AM!. ROC plots could
creek Drive, Richmond, VA 23233 [Bitnet address: be constructed for any sampling time by using measure-
Centor[@]VCUVAX on BITNETI. This program is de- ments on multiple closely sequential serum samples timed
scribed by Centor and Keightley (70). from the onset of pain. The plots showed clearly the
ROCLAB. James M. DeLeo, Bldg. 12A, Room 2013, superior accuracy of myoglobin at early times such as 5 or
Division of Computer Research and Technology, National 8 h, as well as the impressive accuracy of CK and its
Institutes of Health, Bethesda, MD 20892 [Bitnet address: isoenzymes at 18 h after the onset of pain. More recently,
deleo.nihdcrt[@]cu.nih.gov]. ROCLAB provides maximal Leung et al. (71) performed a similarly detailed evaluation
as well as trapezoidal areas for ties. It has the ability to do of total CK and CK-2 in 310 patients admitted to a cardiac
ROC plots for fuzzy data as well. care unt with chest pain. These authors also used ROC
RULEMAKER. Digital Medicine, Inc., Hanover, NH plots to describe the changing accuracy at various time
03755 [FAX (603) 643-3686]. RULEMAKER has an inter- intervals after the onset of pain.
active capability that enables one to point to a location on Carson et al. (72) investigated the abilities of four assays
the ROC plot and obtain the exact sensitivity and specific- of prostatic acid phosphatase to discriminate between those
ity and the decision level to which it corresponds. New subjects with prostatic cancer and those subjects with
decision rules can be formed by combining results from two either some other urologic abnormality or no known uro-
or more diagnostic tests and the corresponding ROC plot logic abnormality. They concluded from comparisons of
can be displayed. A release version is anticipated in the ROC plots and areas under the plots that there is little
second half of 1993. difference in diagnostic accuracy among the four assays.
SIGNAL. SYSTAT, Inc., 1800 Sherman Ave., Evanston, Because ROC was used, the conclusions were not influ-
IL 60201. SIGNAL is a module of a much larger commer- enced by choice of upper limit of normal. They cited
cial package SYSTAT. previous studies that had claimed superior performance of