Meaningful Variation in Performance: A Systematic Literature Review

ORIGINAL ARTICLE
Meaningful Variation in Performance

A Systematic Literature Review
Vicki Fung, PhD,* Julie A. Schmittdiel, PhD,* Bruce Fireman, MA,* Aabed Meer, BA,*
Sean Thomas, MD,† Nancy Smider, PhD,† John Hsu, MD, MBA, MSCE,*
and Joseph V. Selby, MD, MPH*
Key Words: quality improvement, quality measurement,

Background: Recommendations for directing quality improvement
performance measurement, physician profiling, systematic reviews
initiatives at particular levels (eg, patients, physicians, provider
groups) have been made on the basis of empirical components of (Med Care 2010;48: 140 –148)
variance analyses of performance.
Objective: To review the literature on use of multilevel analyses of
variability in quality.
Research Design: Systematic literature review of English-lan-
guage articles (n ⫽ 39) examining variability and reliability of
performance measures in Medline using PubMed (1949 –Novem-
ber 2008). P erformance measurement is an important component of
national and local efforts to improve quality of care,
reduce unwanted practice variation, and increase accountabil-
Results: Variation was most commonly assessed at facility (eg,
hospital, medical center) (n ⫽ 19) and physician (n ⫽ 18) levels; ity at levels ranging from the individual provider to the
most articles reported variability as the proportion of total vari- geographic delivery area. Profiles generated by these mea-
ation attributable to given levels (n ⫽ 22). Proportions of vari- surement efforts are used in a number of ways, including
ability explained by aggregated levels were generally low (eg, public report cards that rank entities by level of performance
⬍19% for physicians), and numerous authors concluded that the or quality, as well as pay-for-performance initiatives that link
proportion of variability at a specific level did not justify target- quality ratings with financial incentives.
ing quality interventions to that level. Few articles based their Since the work of Wennberg et al examining small area
recommendations on absolute differences among physicians, hos- variation, there has been a longstanding assumption that a
pitals, or other levels. Seven of 12 articles that assessed reliability high degree of variability in performance suggests a potential
found that reliability was poor at the physician or hospital level due to for quality improvement.1,2 This principle has led numerous
low proportional variability and small sample sizes per unit, and authors to examine variability at different “levels” of health
cautioned that public reporting or incentives based on these measures care delivery, such as physicians, provider groups, hospitals,
may be inappropriate. and health plans. Components of variance are analyzed, often
Conclusions: The proportion of variability at levels higher than using hierarchical models, to apportion the total observed
patients is often found to be “low.” Although low proportional variation in performance measures in a patient population to
variability may lead to poor measurement reliability, a number of 1 or more aggregated levels. The findings of these analyses
authors further suggested that it also indicates a lack of potential for have then been used to make inferences about the appropriate
quality improvement. Few studies provided additional information level for profiling and intervention efforts.3–12
to help determine whether variation was, nevertheless, clinically A closely related issue in performance measurement is
meaningful. reliability, which declines as the proportion of total variabil-
ity (intraclass correlation coefficient 关ICC兴) at a given level
decreases. Reliability is also a function of the number of
From the *Division of Research, Kaiser Permanente Medical Care Program, available patients per unit at the level. Whether units have
Oakland, CA; and †Epic Systems Corporation, Verona, WI. sufficient sample size to produce reliable measures de-
Supported by the Council of Accountable Physician Practices (CAPP) and pends upon the level of analysis (eg, physician, provider
Office of Research in Women’s Health Building Interdisciplinary Careers group, health plan); the prevalence of the condition; and
in Women’s Health K 12 Career Development Award (K12HD052163)
(to J.A.S.). the type of quality measure selected. Reliability reflects the
Reprints: Vicki Fung, PhD, 2000 Broadway, 3rd Floor, Oakland, CA. consistency of the measure, the extent to which it is
E-mail: vicki.fung@kp.org. reproducible rather than random, and is particularly im-
Supplemental digital content is available for this article. Direct URL citations portant when performance rankings are being used for
appear in the printed text and are provided in the HTML and PDF versions
of this article on the journal’s Web site (www.lww-medicalcare.com). public reporting or for determining incentive payments.3,13
Copyright © 2010 by Lippincott Williams & Wilkins If units cannot be reliably ranked, quality improvement
ISSN: 0025-7079/10/4802-0140 efforts based on such measures may unfairly punish or
140 | www.lww-medicalcare.com Medical Care • Volume 48, Number 2, February 2010

Medical Care • Volume 48, Number 2, February 2010 Meaningful Variation in Performance
FIGURE 1. Article Selection. This

figure presents the article selection
process. The total citations re-
trieved includes duplicate titles that
were identified both through initial
identification and the systematic
Medline search using PubMed. The
number of articles included in ab-
stract review and article abstraction
represent unique titles. Initially
identified articles were those identi-
fied nonsystematically before the
initiation of the systematic review.
reward individuals or groups; in addition, unreliable mea- terms available in the Appendix, Supplemental Digital
sures may mislead patients who use this information in a Content 1, available online at: http://links.lww.com/MLR/A58).
predictive manner to make health care decisions.
We examined studies that explicitly assessed vari- Selection Process
ability and/or reliability of performance measures across 2 The selection process involved 3 review stages: title,
or more nested levels (eg, the individual patient and the abstract, and article (Fig. 1). Two researchers indepen-
provider). We specifically assessed whether authors linked dently conducted title and abstract reviews. Articles were
recommendations for reporting or providing incentives for selected if they appeared to address issues relating to the
quality at specific levels to the proportion of variability statistical modeling of quality measurement or measuring
observed at that level and, if so, what kind of criteria were variation or reliability of quality measures at 1 or more
given for an acceptable amount of variability and/or reli- aggregated levels (eg, physician, hospital, geographic).
ability to justify performance reporting or incentives. Because the related issue of case-mix adjustment has
received thorough treatment in the literature,15–17 it was
METHODS not a primary focus of this review.
Data Sources and Search Strategy Content Abstraction and Synthesis
Our systematic literature search employed a 2-step After excluding articles that did not assess either com-
strategy: a database search for relevant peer-reviewed articles ponents of variation or the precision/reliability of health care
in Medline and a subsequent hand-search of reference lists quality measures when grouped at 1 or more levels above the
from articles identified in the first step. We searched individual person, we abstracted the following information
English-language articles in Medline using PubMed from from the remaining articles selected for inclusion: study
inception (1949) through November 25, 2008 using title design, study time period, setting/population, data source(s),
and abstract keyword searches and Medical Subject Head- levels of analysis, sample size by level, outcome variables,
ing Terms. The PubMed keyword searches relied on pri- modeling approach, case-mix adjustors, methods for assess-
mary terms (eg, quality measure*, perform* measure*, and ing components of variance, methods for assessing reliability,
profile or profiling) alone and in combination with addi- the proportion or amount of variance found at each level, the
tional parameters (eg, varia* 关with “*” indicating a wild degree of reliability reported, and the authors’ interpretations
card兴, multilevel or multilevel,). We also searched on of their findings (eg, whether they considered the amount of
Medical Subject Heading Terms that were commonly used variability to be “high” or “low”).
to index articles we initially identified as representing the
core theme of our literature review3,14: Quality Assurance, Reliability Scenarios
Quality Indicators, Physician’s Practice Patterns, and Out- To further examine important issues related to reliabil-
come and Process Assessment (complete list of search ity we conducted sample size calculations and simulations to
© 2010 Lippincott Williams & Wilkins www.lww-medicalcare.com | 141

Fung et al Medical Care • Volume 48, Number 2, February 2010
examine the amount of instability or random fluctuation in level variance across these studies was 0% to 19% of total
physician rankings at different reliability levels. To assess the variance, for provider groups 0% to 10%, for facilities 0% to
number of patients per physician needed to achieve reliability 51%, and for health plans 0% to 3%. Three studies, con-
levels of 0.65, 0.80, and 0.90, given selected ICC levels ducted by overlapping research groups, report the percentage
(between 0.01 and 0.20), we used the Spearman-Brown of “explained” variability, rather than of total variability, at 1
prophecy. The ICC measures the proportion of total variabil- or more aggregated levels of care.35,37,45 In these 3 studies,
ity accounted for by an aggregate group level. For example, the authors reported proportional variance at multiple levels
at the physician level, ICC ⫽ variance between physicians/ (eg, physician, site, plan), but calculated these proportions
total variance, where total variance is the sum of between after removing the large residual variability attributable to
physician and within physician variance. Using the Spear- individual patients from the denominator, yielding much
man-Brown prophecy, reliability was calculated as (n ⫻ larger reported proportions. For example, Safran et al re-
ICC)/(1 ⫹ 关(n ⫺ 1) ⫻ ICC兴), where n represents the number ported the proportion of variance explained by “delivery
of patients per unit.18 To determine the 90% confidence system effects” at the physician, practice, health plan, and
intervals around a physician’s percentile ranking at different network levels, with individuals’ physicians accounting for
reliability levels, we conducted simulations (each with 20 61.7% to 83.9% of the explained variance in physician-
million iterations) for each reliability level (0.65, 0.80, and patient interaction scores based on the Consumer Assessment
0.90) assuming that physicians’ true scores are normally of Health Plans Study survey.35 However, only 5.0% to
distributed. 22.1% of the total variance in the scores examined was
explained by physicians, practices, health plans and net-
RESULTS works, and case-mix adjustment.
We identified a total of 9652 titles in the initial stage A small number of studies presented information on
of the search process; 360 titles were selected for abstract performance variation at given levels in absolute terms or
review. After abstract review, 61 citations were selected directly interpretable units, such as actual observed ranges
for article review. We excluded 22 articles that did not in performance, interquartile ranges, or standard devia-
assess components of variation or the precision/reliability tions.8,24,30,35,36,40 Other studies went a step further and
of health care quality measures when grouped at 1 or more presented estimated ranges in performance after case-mix
levels above the individual person; a total of 39 articles, adjustment and/or shrinkage (Bayesian methods are often
published between 1994 and 2008, were included in the used to shrink estimates, particularly those based on small
review (Fig. 1). sample sizes, toward an expected value, such as the overall
mean).9,10,21,28,29,30,36,40,43 For example, Cowen and Straw-
Attributable Variation derman found that physicians explained only 0.9% to 4.2% of
Table 1 presents a summary of findings on the compo- the total variance in pharmacy expenditures; in absolute
nents of variability analyses across the studies included in the terms, the physicians in 1 sample ranged from $52,066 above
review; 37 of the 39 studies presented the amount of propor- to $64,102 below their expected expenditures.10 In an article
tional or absolute variability at 1 or more aggregated levels, examining resource utilization for Medicare hemodialysis
and 30 described their approach as using hierarchical models patients, Turenne et al reported standard deviations in Medi-
to analyze components of variance. The majority of articles (32) care allowable charges per patient per session at the provider
adjusted for basic demographic characteristics (age and sex), ($8.23) and dialysis center ($18.67) levels.43 Tan et al re-
28 adjusted for clinical characteristics (eg, comorbidity lev- ported the range (2.1%–22.2%) and interquartile range
els, measures of disease severity, self-reported health status), (4.2%–7.9%) across physicians for false-positive rates of
and 18 adjusted for other types of sociodemographic infor- screening mammography.40
mation (eg, race/ethnicity, socioeconomic status) that could Sixteen studies made qualitative comments on the de-
create case-mix differences among units within a level (eg, gree of variability found, characterizing the proportional
physicians); 15 articles adjusted for all of these 3 types of variability attributable to 1 or more levels as “low” (Table 1),
patient characteristics (See Appendix, Supplemental Digital and noted that it may be less effective or valuable to focus
Content 1, available online at: http://links.lww.com/MLR/A58, quality improvement efforts on levels with low proportional
for summary of analytic approaches). variability.3–10,19,21,22,24,31,33,36 These recommendations were
Eighteen studies assessed variability at the individual based largely on findings of proportions of variation attrib-
physician level; 8 at the physician group level; 19 at the utable to given levels, yet no article specified thresholds for
facility level (which includes medical centers, hospitals, defining low variability.
and nursing homes); 5 at the health plan level; and 6 at
another level such as the hospital ward, clinical team, Reliability
provider network, or geographic area. Twenty-five articles Twelve articles addressed the issue of the reliability or
assessed variability at a single grouped level, 7 articles precision of performance measure ranking (Table 2). Seven
examined 2 levels, 3 articles examined 3 levels, and 2 studies used the Spearman-Brown prophecy to examine reli-
examined 4 levels of care. ability,3,5,7,14,26,35,37 and a number suggested a level of 0.80
Twenty-two studies presented variability estimates in as a threshold for acceptable reliability.3,5,7,14 Reliability
terms of the proportion of total variability explained by a varied substantially across these studies. For example, Hofer
given level, or the ICC (Table 1). The range of physician- et al found that reliability for physician profiles of diabetes
142 | www.lww-medicalcare.com © 2010 Lippincott Williams & Wilkins

TABLE 1. Summary of Components of Variability Analyses

Proportional or Absolute Variability by Level
Considered Variation
Description of Facility (eg Medical Low at >1
Article Performance Measure Physician Provider Group Center, Hospital) Health Plan Other Aggregated Levels
Aiello 200319 Patient satisfaction with 1% (hospital care unit) 43% (episode of care) X
nursing care
Baker 200411 Receipt of preventive care Significant variation* Significant
screenings variation*
Bjertnaes 200820 Physician satisfaction with 10%–23% (mental health
community mental center)
health centers
Bjorngaard 200721 Patient experience and 0% (clinic) 0% (health trust) 2% (clinical team) X
satisfaction ratings
Cowen 200210 Pharmacy costs 0.9%–5.9% X
Davis 20029 Clinical activity (eg issuing 4.1%–11.0% X
prescriptions, ordering
lab tests)
Degenholtz 200622 Nursing facility residents’ 9% (nursing home) X
self-reported quality of
life
D’Errigo 200723 30-d mortality after CABG 10.10% (cardiac surgery
center)
Djkstra 200424 Adherence to diabetes care 5.5%–18.8% 0.6%–7.9% process; X
guidelines, physiological (process); 1.4%–4.8%
outcomes 0.4%–19.3% outcomes (hospital)
outcomes
Gifford 200825 Psychiatric inpatient length 51% (hospital)
of stay
Greenfield 200214 Diabetes care process 12%–18%
measures, physiological
outcomes (eg HbA1c),
and patient satisfaction
Harman 200426 Psychiatric inpatient length 5%–7% 11%–36% (hospital)
of stay
27
Hawley 2006 Surgical treatment for 5.3%–11.1%
breast cancer
Hayward 19948 Hospital length of stay and 1%–3% X
total ancillary resource
use
Hofer 19993 Diabetes care performance 4%–13% X
measures (hospitalization
and clinic visit rates,
total laboratory resource
utilization rate, glycemic
control)
Huang 20057 Asthma care process 1%–10% X
measures, and patient
outcomes (eg emergency
visit), patient
satisfaction
Kanton 20006 Quality of care and clinical No significant X
outcome measures for differences†
depressed patients
Krein 20025 Diabetes care process 0%–9% 0%–3% 1%–18% (medical center) X
measures, intermediate
outcomes, linked
process-outcome
measures, and resource
use
Normand 200728 Receipt of appropriate Estimated between-hosptial
medication therapy variance: 0.03–0.59
following AMI
(Continued)

TABLE 1. (Continued)
Proportional or Absolute Variability by Level
Considered Variation
Description of Facility (eg Medical Low at >1
Article Performance Measure Physician Provider Group Center, Hospital) Health Plan Other Aggregated Levels
Normand 199729 30-d mortality for elderly (⫺2.53, ⫺0.92) 2.5 and 97.5
AMI patients percentiles of
log-odds of
mortality (hospital)
O’Brien 200730 Multiple CABG IQR (e.g., mortality):
performance measures 1.9%–2.9%
O’Connor 200831 HbA1c levels for diabetes 0.8%–1.4% ⬍0.1%–2.7% (clinic) X
patients
Philips 200732 Activities of daily living for 8%–14% (nursing home) X
nursing home residents
33
Philips 2008 Activities of daily living for 9%–20% (nursing home)
nursing home residents
Sa Carvalho 200334 Survival of dialysis patients var (0.43) (dialysis
centers)
Safran 200635 Patient satisfaction with 17.8%–83.9%‡ 15.8%–81.1%‡ 0%–2.5%‡ Networks: 0%–15.8%‡
primary care physicians
and practices
Sixma 19984 Patient satisfaction with 5%–10% X
general practitioners
Sjetne 200736 Patient satisfaction with 0.23%–6.5% (hospital) X
hospital care
Solomon 200237 Patient satisfaction from 0%–34%‡ 20%–92%‡ 0%–17%‡ RSO: 0%–45%‡
Consumer Assessment of
Health Plans
Study (CAHPS)
Sullivan 200538 Specialist referrals 3.60%
Swinkels 200539 No. physical therapy 4.40% 7.20%
treatment sessions for
low back pain
Tan 200740 False-positive rates of Range
screening mammography (2.1%–22.2%),
IQR (4.2–7.9)
Thomas 199441 Hospital mortality for 4 Between-hospital
disease conditions variation only significant
for CHF
Tuerk 200842 HbA1c levels for diabetes 2% X
patients
Turenne 200843 Resource utilization for SD: $8.23 SD: $18.67 (dialysis
dialysis patients facility)
(Medicare allowable
charges per patient
per session)
Veenstra 200344 Patient satisfaction with Var (2.72), X
provision of information SE (2.05) (hospital
from hospital staff ward level)
Zaslavsky 200445 Patient satisfaction from 44%‡ MSA: 18%; State: 28%‡
Consumer Assessment of
Health Plans
Study (CAHPS)
Total no. articles reporting 18 8 19 5 6
variation at each level
Results from conditional models reported when both unconditional and conditional model results presented.
*Did not use hierarchical models so could not quantify components of variance. Assessed variation by including health plan and physician dummy variables in the model.
†
Used hierarchical logistic models, only report that they did not detect significant differences in physician practice and total variability explained by the model (not by physicians).
‡
Authors presented the percentage of explained variation attributable to given level, not percentage of total variation.
Abbreviations: coronary artery bypass graft surgery (CABG); congestive heart failure (CHF); interquartile range (IQR); metropolitan statistical area (MSA); regional service
organizations (RSO); standard deviation (SD); standard error (SE); variance (var).

TABLE 2. Summary of Reliability Assessments

Reliability Assessment Considered Reliability
Article Method Study Population Sample Size N (Level) Reliability Findings to be Poor
Fuhlbrigge 200813 Simulation Children with persistent asthma 2,761 (patients); 39 (practice At a practice size of ⱕ50 patients, Yes
enrolled in managed care groups); 3 (managed care no measures achieved a
organizations organizations) reproducibility ⬎85%. Only at
practice size ⫽ 100 did all
measures achieve ranking
reproducibility ⬎85%.
Greenfield 200217 Spearman-brown Diabetes patients 1750 (patients); With 1750 patients, the study had
prophecy 29 (physicians) 80% power for detecting a 15%
difference between specialties
given inflation factors due to
physician clustering.
Harman 200426 Spearman-Brown Hospital discharges for 62,090–190,014 (discharges); Reliability ⫽ 0.90 with 11–28
prophecy schizophrenia, depression and 241 (hospitals) patients with each of the
bipolar disorder respective conditions; 50%–60%
of hospitals met these criteria
Hayward 199410 Weighted kappas across Four general medicine services 13,301 (discharges); Kappas ranged from ⫺0.05 to 0.18 Yes
1-month summaries ⫹ and 4 subspecialty ward 195 (residents ⫹
simulation service at a large hospital attendings)
Hofer 19996 Spearman-brown Diabetes patients 3642 (patients); ⬍40% for all measures Yes
prophecy 232 (physicians)
Hofer 199546 Monte Carlo simulations, Heart, GI, neurologic and 603,959 (discharges); Sensitivity: 9%–14% PPV: Yes
sensitivity and positive pulmonary DRGs 190 (hospitals) 22%–36% Area under ROC:
predictive value 0.59–0.66
Huang 20059 Spearman-Brown Asthma patients 2515 (patients); 20 (physician Outcome measures: 0.60–0.87
prophecy groups) Process measures: 0.77–0.89
Satisfaction: 0.91
Krien 20025 Spearman-brown Diabetes patients 12,110 (patients); 258 (PCPs); To achieve 80% reliability: 200 Yes
prophecy 42 (provider groups); patients/provider needed w/PCP
13 (facilities) effect ⫽ 2%; 50
patients/provider needed w/PCP
effect ⫽ 8%; Median panel
size ⫽ 24
Normand 200728 Simulation to determine AMI discharges 10,385 (patients); Less than half of CA hospitals in Yes
minimum number of 449 (hospitals) study had minimum sample size
patients needed for to accurately assess hospital
each measure to rankings regardless of assumed
demonstrate that a mean performance
hospital passed a
“threshold”
Safran 200635 Spearman-brown Massachusetts ambulatory care 9625 (patients); All measures except 2 had
prophecy experiences survey respondents 215 (physicians); 67 (sites); reliability ⱖ70% for provider
6 (networks); 6 (plans) estimates with samples of 45
patients/provider
Solomon 200231 Reliability index ⫽ ratio CAHPS respondents 5584 (patients); 49 (practice Out of 16 composite scores,
of the variance: sum sites); 30 (medical groups); reliability ⬎70% for 6 scores at
of variance ⫹ 13 (RSOs); 3 (health plans) group level and 7 scores at site
measurement error level
Thomas 199441 Simulation Hospital discharges stroke, 5888 (discharges); Sample sizes too small for reliable Yes
pneumonia, myocardial 297 (hospitals) estimates of between hospital
infarction and congestive heart variation in adjusted mortality
failure rates
quality of care, including hospitalization and office visit rates, 0.05, 0.10, and 0.20). We chose physician-level ICCs typical
glycemic control and laboratory resource use, were less than of those reported in the studies included in this review (ie,
0.40, and estimated that a panel size of 100 diabetes patients ⬍0.20). As the ICC increases, the number of patients needed
per physician was needed to achieve a reliability of 0.80.3 per physician to achieve a given reliability levels decreases
Conversely, Safran et al examined physician-level measures substantially. For example, to achieve 0.80 reliability at an
of patient-centered care, which have higher ICCs at the ICC ⫽ 0.01, 396 patients are needed per physician, as
physician level, and concluded that reliability would be 0.70 compared with only 76 patients per physician when the
or higher for reporting most measures with just 45 patients ICC ⫽ 0.05.
sampled per physician.35 Two articles provided additional information on the
Table 3 illustrates the sample sizes needed to achieve amount of certainty available for ranking or classifying per-
reliabilities of 0.65, 0.80, or 0.90 at varying ICCs (0.01, 0.02, formance associated with a given reliability level.35,45 In 1

percentile for a physician whose true performance is at the

TABLE 3. Number of Patients per Physician Needed for
Three Reliability Levels by Intraclass Correlation Coefficient 50th percentile).
ICC Reliability No. Patients per Physician Needed
Relative Value of Targeting Specific Levels
0.01 0.65 184 of Care
0.01 0.80 396 Twelve articles5,11,19,20,24,26,35,37,39,43,45 assessed vari-
0.01 0.90 891 ability at more than 1 aggregate level of care delivery and 5
0.02 0.65 91 made recommendations on the value of targeting one level of
0.02 0.80 196 care versus another for quality improvement based on these
0.02 0.90 441 analyses.5,7,35,37,43 Huang et al, Krein et al, and Turenne et al
0.05 0.65 35 recommended focusing on levels higher than the physician
0.05 0.80 76 group. Huang et al based their recommendation primarily on
0.05 0.90 171 concerns about reliability at lower levels, due to insufficient
0.10 0.65 17 sample size.7 Krein et al also cited concerns about poor
0.10 0.80 36 reliability at the provider level, and additionally noted that a
0.10 0.90 81 greater proportion of total variation was found at the facility
0.20 0.65 8 level.5 Turenne et al recommended placing financial incen-
0.20 0.80 16 tives at the level of facilities versus physicians for dialysis
0.20 0.90 36 prospective payments, based on the larger absolute variances
Calculated using the Spearman-Brown prophecy formula. at the facility level. In addition, they considered the mecha-
nisms available to facilities to affect physician behavior in
their recommendation.43 Safran et al and Solomon et al, who
TABLE 4. Ninety Percent Confidence Intervals for Selected examined satisfaction measures, generally recommended in-
Ranks of Physicians Scores for Three Reliability Levels tervening at lower levels of care (eg, physician or physician
Physician’s Rank 90% Confidence Interval
site) because these levels generally accounted for a greater
on Quality Reliability of on Physician’s Percentile share of variability in satisfaction scores.35,37
Measure (Percentile) the Measure Rank
50 0.65 (17–83) DISCUSSION
50 0.80 (23–77) There is considerable debate about the use of perfor-
50 0.90 (30–70) mance profiles, such as report cards, particularly at the
80 0.65 (39–95) physician level. Proponents of performance reporting assert
80 0.80 (51–93) that this kind of information, though not perfect, is a key tool
80 0.90 (61–91) to improve quality.47 Others raise concerns about unfairly
90 0.65 (53–98) penalizing physicians based on unreliable profiles that do not
90 0.80 (66–97) correctly distinguish true differences in performance.48 –50
90 0.90 (76–96) The majority of studies in this review raised questions
95 0.65 (64–99) as to whether a sufficient degree of variation exists, particu-
95 0.80 (77–99) larly at the individual provider level, to justify measuring
95 0.90 (85–98) quality at that level. These conclusions were made almost
For a normally distributed measure on a continuous scale. exclusively on the basis of the proportion of variability
explained at a given level. However, some authors also
discussed situations in which efforts directed at levels that
example, Safran et al provided the probability of misclassi- explain only a small proportion of total variability may still
fying physicians into performance tiers at 3 levels of reliabil- yield value. Hofer et al,3 Hayward et al,8 and Krein et al5 each
ity (0.70, 0.80, 0.90) and varying proximity to performance pointed out that interventions may be called for despite low
cutpoints.35 Misclassification risk decreased with higher re- proportional variability if total variability for a measure is
liability, greater distance of the physicians’ score from the large. In these cases, even a small proportion of variation
closest cutpoint, and fewer cutpoints (eg, 38.0% probability at a given level may translate into large absolute differ-
of misclassification at 0.70 reliability and 1 point difference ences across units. Conversely, Huang et al suggest that if
between physician score and cutpoint versus ⬍.001% prob- performance across units within a level is uniformly poor,
ability of misclassification at 0.90 reliability and 6 point it may be useful to intervene at that level despite very
difference between physician score and cutpoint). Table 4 small variation across the units.7 These situations highlight
illustrates the 90% confidence intervals for a physician’s the importance of expressing variation in directly inter-
percentile ranking at 3 different levels of reliability (0.65, pretable units (eg, dollars, percentage of patients, or phys-
0.80, and 0.90), based on our own simulations. These tables iological units such as mm Hg), using statistics such as the
indicate that at that even at a reliability of 0.90, the 90% interquartile range or standard deviations, to help deter-
confidence intervals on physicians’ percentile ranks are mine if the spread and range in performance is sufficiently
wide (eg, a 90% confidence interval of the 30th to 70th large to be clinically meaningful.

The studies reviewed here suggest that the proportion ful variation across physicians or facilities may still have
of variability in most performance measures that is explained value, even in the face of low measurement reliability. Con-
by physicians, provider groups, or health plans, will almost cerns about reliability may be lessened if rankings are simply
always be “low” (eg, usually less than 20% of total variation), used within organizations for quality improvement purposes.
and that the majority of variability will usually be found at the Quality initiatives could adopt a less threatening approach to
patient level. Nevertheless, it may sometimes be more effi- using this information, by providing information and feed-
cient and effective to intervene through physicians, particu- back to individual units, but foregoing public reporting or
larly if they are more directly accessible than patients. Un- incentives based on these measures.
derstanding the underlying mechanisms that influence
variability (eg, guideline adherence, incentives/reimburse- Limitations
ment, organizational culture, health information technology) This literature review has limitations to note. We did
at each level is critical for determining optimal approaches not examine non-English language articles on variability and
for improving quality. For example, O’Connor et al found reliability. We also did not examine the literature on variabil-
that over 95% of the variance in HbA1c levels among ity and performance in nonhealth related fields. Disciplines
diabetes patients was attributable to patients; but suggested such as manufacturing have their own literature on methods
that physicians may still play a major role in influencing some such as total quality management that address similar issues
important patient factors, such as whether patients receive to those discussed here. Finally, a more detailed understand-
drug treatment intensification.31 Intervening at levels higher ing of variability and reliability will require more empirical
studies, particularly of longitudinal patterns in variance over
than the physician, such as the facility or health plan level,
time; future data-driven research should continue to examine
may also allow for a marshalling of resources to support
these issues.
changes in provider behavior.43 Given that medical therapy
and clinical quality have improved in concert with system-
level efforts and new therapies that would be initiated by CONCLUSION
physicians, it is difficult to accept that low proportional This review identified a tendency in the literature to
variability translates to either a lack of provider or system conclude that quality improvement efforts should not focus
influence or to the ineffective nature of interventions directed on organizational levels with low proportional variability in
at these levels. quality measures. Few studies focused on absolute variation
Although low proportional variability does not neces- or assessed it in clinically meaningful terms. Although low
sarily reflect a low potential for quality improvement, or even proportional variability may threaten the reliability of perfor-
low absolute variability, it does mean that reliability of mance measures, particularly at the physician level, it does
performance measurement is likely to be poor, especially not indicate the absence of meaningful variation in perfor-
when the sample size per unit is also small, as it often is for mance or of potential for quality improvement at that level.
individual physicians. Inadequate sample size may make it Considerations of clinically meaningful variation among phy-
impossible to reliably assess performance at the individual sicians or hospitals and of the logistics of effecting change at
physician level, especially when performance is assessed in each level, rather than consideration of proportional variabil-
ity alone, should guide discussions of when and how to
specific patient subpopulations (eg, those with diabetes or
intervene to improve quality.
congestive heart failure). Focusing performance assessment
at higher levels of care (eg, provider group level), which by
REFERENCES
definition involve larger sample sizes, ensures greater reli-
1. Wennberg JE, Freeman JL, Shelton RM, et al. Hospital use and mortality
ability, all else equal.7 In addition, physician-level ICCs are among Medicare beneficiaries in Boston and New-Haven. N Engl J Med.
likely to be greater for measures more directly under their 1989;321:1168 –1173.
control, such as patient satisfaction with their care or com- 2. Fisher ES, Wennberg DE, Stukel TA, et al. The implications of regional
munication style, than with endpoints further downstream variations in Medicare spending. Part 2: health outcomes and satisfaction
with care. Ann Intern Med. 2003;138:288 –298.
from the point of care, such as health outcomes, which may 3. Hofer TP, Hayward RA, Greenfield S, et al. The unreliability of
improve the reliability of these measures.35 individual physician “report cards” for assessing the costs and quality of
Reliability is a critical consideration when designing care of a chronic disease. JAMA. 1999;281:2098 –2105.
intervention efforts. If reliability is poor, use of these mea- 4. Sixma HJ, Spreeuwenberg PM, van der Pasch MA. Patient satisfaction
with the general practitioner: a two-level analysis. Med Care. 1998;36:
sures for ranking, reporting and paying for performance may 212–229.
unfairly reward or penalize individual units and misinform 5. Krein SL, Hofer TP, Kerr EA, et al. Whom should we profile? Exam-
patients. While a general consensus has emerged that a ining diabetes care practice variation among primary care providers,
minimal reliability of 0.80 should be sought to ensure fairness provider groups, and health care facilities. Health Serv Res. 2002;37:
1159 –1180.
when public reporting or payment decisions are involved, 6. Katon W, Rutter CM, Lin E, et al. Are there detectable differences in
acceptability likely varies depending on the uses and conse- quality of care or outcome of depression across primary care providers?
quences of the ranking.13 As demonstrated in the reliability Med Care. 2000;38:552–561.
scenarios, even when reliability levels are above 0.80, the 7. Huang IC, Diette GB, Dominici F, et al. Variations of physician group
profiling indicators for asthma care. Am J Manag Care. 2005;11:38 – 44.
confidence intervals around individual units’ rankings remain 8. Hayward RA, Manning WG Jr, McMahon LF Jr, et al. Do attending or
substantial and raise concerns about the precision of these resident physician practice styles account for variations in hospital
rankings. Nevertheless, demonstration of clinically meaning- resource use? Med Care. 1994;32:788 –794.

9. Davis P, Gribben B, Lay-Yee R, et al. How much variation in clinical adult cardiac surgery: part 2–statistical considerations in composite
activity is there between general practitioners? A multi-level analysis of measure scoring and provider rating. Ann Thorac Surg. 2007;83(suppl
decision-making in primary care. J Health Serv Res Policy. 2002;7:202– 4):S13–S26.
208. 31. O’Connor PJ, Rush WA, Davidson G, et al. Variation in quality of
10. Cowen ME, Strawderman RL. Quantifying the physician contribution to diabetes care at the levels of patient, physician, and clinic. Prev Chronic
managed care pharmacy expenses: a random effects approach. Med Dis. 2008;5:A15.
Care. 2002;40:650 – 661. 32. Phillips CD, Shen R, Chen M, et al. Evaluating nursing home perfor-
11. Baker LC, Hopkins D, Dixon R, et al. Do health plans influence quality mance indicators: an illustration exploring the impact of facilities on
of care? Int J Qual Health Care. 2004;16:19 –30. ADL change. Gerontologist. 2007;47:683– 689.
12. Young GJ. Can multi-level research help us design pay-for-performance 33. Phillips CD, Chen M, Sherman M. To what degree does provider
programs? Med Care. 2008;46:109 –111. performance affect a quality indicator? The case of nursing homes and
13. Fuhlbrigge A, Carey VJ, Finkelstein JA, et al. Are performance mea- ADL change. Gerontologist. 2008;48:330 –337.
sures based on automated medical records valid for physician/practice 34. Sa Carvalho M, Henderson R, Shimakura S, et al. Survival of hemodi-
profiling of asthma care? Med Care. 2008;46:620 – 626. alysis patients: modeling differences in risk of dialysis centers. Int J
14. Greenfield S, Kaplan SH, Kahn R, et al. Profiling care provided by Qual Health Care. 2003;15:189 –196.
different groups of physicians: effects of patient case-mix (bias) and 35. Safran DG, Karp M, Coltin K, et al. Measuring patients’ experiences
physician-level clustering on quality assessment results. Ann Intern Med. with individual primary care physicians. Results of a statewide demon-
2002;136:111–121. stration project. J Gen Intern Med. 2006;21:13–21.
15. DeLong ER, Peterson ED, DeLong DM, et al. Comparing risk-adjust- 36. Sjetne IS, Veenstra M, Stavem K. The effect of hospital size and
ment methods for provider profiling. Stat Med. 1997;16:2645–2664. teaching status on patient experiences with hospital care: a multilevel
16. Salem-Schatz S, Moore G, Rucker M, et al. The case for case-mix analysis. Med Care. 2007;45:252–258.
adjustment in practice profiling. When good apples look bad. JAMA. 37. Solomon LS, Zaslavsky AM, Landon BE, et al. Variation in patient-
1994;272:871– 874. reported quality among health care organizations. Health Care Financ
17. Landon B, Iezzoni LI, Ash AS, et al. Judging hospitals by severity- Rev. 2002;23:85–100.
adjusted mortality rates: the case of CABG surgery. Inquiry. 1996;33: 38. Sullivan CO, Omar RZ, Ambler G, et al. Case-mix and variation in
155–166. specialist referrals in general practice. Br J Gen Pract. 2005;55:529 –
18. Snijders TAB, Bosker RJ. Multilevel Analysis: An Introduction to Basic
533.
and Advanced Multilevel Modeling. Sage Publications Inc; 1999.
39. Swinkels IC, Wimmers RH, Groenewegen PP, et al. What factors
19. Aiello A, Garman A, Morris SB. Patient satisfaction with nursing care:
explain the number of physical therapy treatment sessions in patients
a multilevel analysis. Qual Manag Health Care. 2003;12:187–190.
referred with low back pain; a multilevel analysis. BMC Health Serv
20. Bjertnaes OA, Garratt A, Ruud T. Family physicians’ experiences with
Res. 2005;5:74.
community mental health centers: a multilevel analysis. Psychiatr Serv.
40. Tan A, Freeman JL, Freeman DH Jr. Evaluating health care perfor-
2008;59:864 – 870.
21. Bjorngaard JH, Ruud T, Garratt A, et al. Patients’ experiences and mance: strengths and limitations of multilevel analysis. Biom J. 2007;
clinicians’ ratings of the quality of outpatient teams in psychiatric care 49:707–718.
units in Norway. Psychiatr Serv. 2007;58:1102–1107. 41. Thomas N, Longford NT, Rolph JE. Empirical Bayes methods for
22. Degenholtz HB, Kane RA, Kane RL, et al. Predicting nursing facility estimating hospital-specific mortality rates. Stat Med. 15 1994;13:889 –
residents’ quality of life using external indicators. Health Serv Res. 903.
2006;41:335–356. 42. Tuerk PW, Mueller M, Egede LE. Estimating physician effects on
23. D’Errigo P, Tosti ME, Fusco D, et al. Use of hierarchical models to glycemic control in the treatment of diabetes: methods, effects sizes, and
evaluate performance of cardiac surgery centers in the Italian CABG implications for treatment policy. Diabetes Care. 2008;31:869 – 873.
outcome study. BMC Med Res Methodol. 2007;7:29. 43. Turenne MN, Hirth RA, Pan Q, et al. Using knowledge of multiple
24. Dijkstra RF, Braspenning JC, Huijsmans Z, et al. Patients and nurses levels of variation in care to target performance incentives to providers.
determine variation in adherence to guidelines at Dutch hospitals more Med Care. 2008;46:120 –126.
than internists or settings. Diabet Med. 2004;21:586 –591. 44. Veenstra M, Hofoss D. Patient experiences with information in a
25. Gifford E, Foster EM. Provider-level effects on psychiatric inpatient hospital setting: a multilevel approach. Med Care. 2003;41:490 – 499.
length of stay for youth with mental health and substance abuse disor- 45. Zaslavsky AM, Zaborski LB, Cleary PD. Plan, geographical, and tem-
ders. Med Care. 2008;46:240 –246. poral variation of consumer assessments of ambulatory health care.
26. Harman JS, Cuffel BJ, Kelleher KJ. Profiling hospitals for length of stay Health Serv Res. 2004;39:1467–1485.
for treatment of psychiatric disorders. J Behav Health Serv Res. 2004; 46. Hofer TP, Hayward RA. Can early re-admission rates accurately detect
31:66 –74. poor-quality hospitals? Med Care. 1995;33:234 –245.
27. Hawley ST, Hofer TP, Janz NK, et al. Correlates of between-surgeon 47. Epstein A. Performance reports on quality–prototypes, problems, and
variation in breast cancer treatments. Med Care. 2006;44:609 – 616. prospects. N Engl J Med. 1995;333:57– 61.
28. Normand SL, Wolf RE, Ayanian JZ, et al. Assessing the accuracy of 48. Landon BE, Normand SL, Blumenthal D, et al. Physician clinical
hospital clinical performance measures. Med Decis Making. 2007;27: performance assessment: prospects and barriers. JAMA. 2003;290:1183–
9 –20. 1189.
29. Normand SL, Glickman ME, Gatsonis CA. Statistical methods for 49. Kassirer JP. The use and abuse of practice profiles. N Engl J Med.
profiling providers of medical care: issues and applications. J Am Stat 1994;330:634 – 636.
Assoc. 1997;92:803– 814. 50. Bindman AB. Can physician profiles be trusted? JAMA. 1999;281:2142–
30. O’Brien SM, Shahian DM, DeLong ER, et al. Quality measurement in 2143.

Meaningful Variation in Performance: A Systematic Literature Review

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Meaningful Variation in Performance: A Systematic Literature Review

Transféré par

Droits d'auteur :

Formats disponibles

ORIGINAL ARTICLE

Meaningful Variation in Performance

Key Words: quality improvement, quality measurement,

140 | www.lww-medicalcare.com Medical Care • Volume 48, Number 2, February 2010

FIGURE 1. Article Selection. This

© 2010 Lippincott Williams & Wilkins www.lww-medicalcare.com | 141

142 | www.lww-medicalcare.com © 2010 Lippincott Williams & Wilkins

TABLE 1. Summary of Components of Variability Analyses

© 2010 Lippincott Williams & Wilkins www.lww-medicalcare.com | 143

144 | www.lww-medicalcare.com © 2010 Lippincott Williams & Wilkins

TABLE 2. Summary of Reliability Assessments

© 2010 Lippincott Williams & Wilkins www.lww-medicalcare.com | 145

percentile for a physician whose true performance is at the

146 | www.lww-medicalcare.com © 2010 Lippincott Williams & Wilkins

© 2010 Lippincott Williams & Wilkins www.lww-medicalcare.com | 147

148 | www.lww-medicalcare.com © 2010 Lippincott Williams & Wilkins

Vous aimerez peut-être aussi