Vous êtes sur la page 1sur 10

Module 3 – Measurement

Learning Objectives:
- Understand/identify different types of outcome measures and understand their limits
- Define different measurement properties
- Describe methods to evaluate different types of validity, reliability, sensitivity to change
and responsiveness
- Design a study to evaluate the measurement properties of an outcome measure

Measuring Health

What is health?
- According to WHO, it is a state of complete physical, mental, and social well-being
- It is a multi-faceted concept influenced by a person’s experiences, beliefs, expectations,
and perceptions
- It means different things to different people

ICF Model of Health


- ICF = International Classification of Functioning, Disability, and Health
- Model meant to standardize communication about health
- Health outcomes are classified according to the effect on body function and structure
(impairment; includes mental health items), limitations in activities (disability), and
participation (handicap)
- Modifiers of these outcomes are: age, coping strategies, social attitudes, education,
experience

Measuring Health in Research


- By using the ICF as a guide, use several outcome measures that can speak to specific
aspect(s) of health being affected by the intervention
- Measure QoL questionnaire, whose items include specific aspects of health that patients
have deemed important and relevant to their disease
o Not always the case that all questions are equally valued; some are more important
than others
- Good outcome measures for a study have good measurement properties AND well-
known/commonly used (ease of interpretation by others)
- Issue: too many independent outcomes = potential for multiple comparisons error

Types of Outcome Measures

Predictive Outcome Measure


- An instrument/device/method that predicts future
o E.g. MCAT predicts who is likely to perform well on licensing exam
o E.g. following an acute injury, predict who is likely to become chronic
- Design a predictive instrument using prognosis design
- Evaluate predictive validity using diagnosis design

Discriminative Outcome Measure


- An instrument/device/method that sorts individuals into groups
o E.g. x-ray (fracture present or absent)
- Evaluate validity using diagnosis design
Evaluative Outcome Measure
- An instrument/device/method that provides data on the quantity/quality of the result of
the experiment
- It is a basis for measuring the effects of the independent variable or change in dependent
variable
o E.g. pain in pre- and post-intervention
- Evaluate using longitudinal construct validity and sensitivity to change

Types of Evaluative Measures


- Surrogate Outcomes
- Patient Important Outcomes

Surrogate Outcomes
- Outcome measures that are not of direct practical importance to patients but are believed
to reflect outcomes that are important.
- Validity depends on magnitude of the association b/w surrogate and the patient important
outcome (i.e. its predictive validity)
o E.g. reduction in cholesterol as surrogate for reduction in mortality
o E.g. increased bone density as a surrogate for reduction in fracture incidence
- We use these outcomes b/c of its efficiency; changes can be measured on all patients over
a shorter time interval

Patient Important Outcomes


- Outcome measures that are of direct importance to patients
o E.g. death/survival, success/failure, patient-reported QoL
- Advantage: validity
- Disadvantage: long time interval needed to measure

Types of QoL

Measurement Properties

Validity and Reliability


- Validity (accuracy) is a measure of how close a measurement comes to the true score for
a variable
- Reliability (precision) is a measure of the extent to which repeated measurements come
up with the same value
- All outcome measures need to demonstrate validity and reliability
- Exception: evaluative measures only need to show responsiveness

Validity vs. Reliability


- Improve precision of an estimate by increasing the number of measurements taken (i.e.
regression to the mean)
o Reduces level of random error and narrows CI about the value being estimated
- Increasing precision when experiment contains systematic errors is not the solution
o Solution: calibration of instrument

Validity: the extent to which an instrument measures what it is intended to measure

Types:
- Face: the extent to which a measurement instrument appears to measure what it is
intended to measure.
- Content: the extent to which a measurement instrument represents all facets of a given
social construct.
- Criterion: examines the extent to which a measure provides results that are consistent
with a gold standard.
o Predictive: compares the measure in question with an outcome assessed at a later
time.
o Concurrent: comparison between the measure in question and an outcome assessed at
the same time.
- Construct: forming theories about the attribute of interest and then assessing the extent to
which the measure under investigation provides results that are consistent with the
theories.
o Convergent: tests the degree to which two measures of constructs that theoretically
should be related, are in fact related
o Divergent: tests whether concepts or measurements that are supposed to be unrelated
are, in fact, unrelated

Study Designs: Validity


- For known groups, one group has disease and the other doesn’t

Reliability: the extent to which an instrument yields the same results in repeated
administrations in a stable population

Study Designs: Reliability


- All require the disease to be in a stable state; measurements are repeated at least twice
- Test re-test: assumes the rater and disease are consistent and evaluates the
reproducibility of the test (patient has to perform the test)
- Inter-rater: The extent to which 2 or more raters are able to consistently differentiate
subjects with higher and lower values on an underlying trait
o assumes the test and disease are consistent and evaluated the reproducibility b/w
different raters (observations of a client or rater has to perform this test)
- Intra-rater: The extent to which a rater is able to consistently differentiate participants
with higher and lower values of an underlying trait on repeated ratings over time
o Assumes the test and disease are consistent and evaluates the reproducibility of one
rater over time (observations of a client or rater has to perform the test)

Statistics to Communicate Reliability

Relative Reliability:
- Reliability = measuring agreement, NOT association
- Cannot use Pearson/Spearman to demonstrate reliability b/c they are measures of
association and do not consider systematic differences b/w measures
- However, both Intra-class Correlation Coefficient (ICC) and Kappa do consider this
(measures of agreement)
- Ideal value = 1
- Measures that are highly associated but are systematically different will have a
correlation coefficient that is larger than the agreement statistics (A)
- Measures that are highly associated without a systematic difference will have similar
values for the correlation coefficient and agreement statistic (B)

ICC:
- Is a measure of reproducibility that compares variance b/w patients to the total variance
(b/w patient and within-patient variance)

Kappa
- Is a measure of the extent to which observers achieve agreement beyond the level
expected to occur by chance alone.
- For a binary outcome variable (0 – no agreement or 1 – perfect agreement)
- The more discordant the raters are, the lower the value of Kappa
- The weighted Kappa is for ordered categories
o Any discordant ratings will largely affect value of Kappa
Absolute Reliability: Precision – Individual Score
- Standard Error of Measurement (SEM) is a statistic for absolute reliability and is
calculated from test-re-test reliability study design
- SEM allows us to determine how certain we can be about a particular individual’s score
at a particular point in time.
- SEM = √within-client variance
- Ideally 0
- Clinician can be x % confident (x defined by confidence level chosen) that the true score
lies within the reported interval

Absolute Reliability: Real Change or Error?


- We can use SEM to determine if there has been a real change in score over time.
- We can be x % confident that a true change has occurred if the change exceeds the
reported interval, known as the Minimal Detectable Change/Difference (MDC/D), as
opposed to possibly due to random error within the measurement.
- MDC(X) = SEM * Z score for X * sqrt(2)

Sensitivity to Change
- Is the ability to detect change that isn’t necessarily meaningful change
- Many stats. for expressing this
- Standardized Response Mean (SRM) is most common
- Study Design: in a population expected to change, administer the new test pre- and post-
change
- SRM = (mean change) / (SD change)
- If SRM > 1, ‘signal/change’ could be detected over the ‘noise/variability’
- Signal = change that occurred from pre- to post-treatment
- Noise = all systematic and random errors

Responsiveness

Responsiveness: is the instruments ability to detect a clinically meaningful change


- Statistic: Minimally Clinically Important Difference (MCID)
- Sensitivity to change is necessary but insufficient condition for responsiveness
- NOTE: using wrong MCID has important implications for sample size
o Within-group: within a treatment group, every patient changes from pre- to post-
treatment
o B/w-group: the difference we want to detect in a study evaluating two different
treatments; the more similar the treatments, the smaller the expected difference b/w the
groups
o B/w-group MCID is approx. 20% of a within-group MCID

Anchor-Based Approach
- way to establish the interpretability of measures of patient-reported outcomes
- All patients are measured at Time 1 and Time 2
- B/w these times, provide an intervention that usually provides some improvement
- At Time 2, the Anchor is included = Global Rating of Change questionnaire
o Patient indicates how much better/worse they feel compared to Time 1
o Calculate average score of all patients who indicated a small but important change on
the GRC (score of 2 or 3); represents the within-group MCID for that instrument.
- If the magnitude of change in ‘better’ and ‘worse’ group are different, then averaging
score is not valid.

Distribution-Based Approach

- Approach 1
o Measure outcome at two time points in individuals not expected to change
o Calculate change scores for every participant and plot them in distribution
o Choose threshold (MCID) for classifying an individual as not having changed by an
important amount

- Approach 2
o Measure outcome at two time points in individuals expected to change by an important
amount
o Calculate change scores for every participant and plot them in distribution
o Choose threshold (MCID) for classifying an individual as having changed by an
important amount

- The score at the cut-off is the within-group MCID for that instrument

Self-Assessment

Vous aimerez peut-être aussi