Vous êtes sur la page 1sur 12

Introduction to measurement and scale development Part 5: Validity

Daniel Stahl

Department of Biostatistics & Computing

Last week: Reliability II

• Generalizability theory

– explicitly recognizes that there are multiple sources of error and that measures may have different reliabilities in different situations

(e.g. different hospitals, gender, different countries)

– allows us to determine the sources of error due using variance components and mulitlevel modelling techniques

– is an extension of the reliability model from two to more random and fixed factors

• Reliability and standard error/Confidence intervals

• Calculating the necessary number of items for a good reliability

• Sample sizes for test-retest reliability studies • Reliability for categorical variables: Kappa

• From item scores to scale scores (total score, subscores)

Last week: From items to total score

• If our final item pool passed – requirements for internal consistency, – factor analysis for checking dimensionality – and other reliability checks,

we need to combine the items to a total score (or several subscores).

• We need a rule to combine the information of all items into a total score (or a total score for each domain)

• Usually you simply add up the scores of each domain to obtain a subtotal score for domain

• Adding up usually/often works well

Question from last week:

Categorizing a continuous scale

• If we

accept the fact that doctors like to ruin a

nice continuous outcome measure by turning it into a dichotomy (depressed –not depressed) we need to find a way to find the best cut-off score

tradeoffs between sensitivity and specificity • (or between true positives and false positives)

• http://www.childrensmercy.org/stats/ask/roc.asp

Example

• A new simple depression test was developed for use in psychiatric clinics to screen patients for depression.

• How can we determine a cut-off score to “correctly” classify a patient as depressive?

• We could apply the new test to patients and ask psychiatrists to assess the same patients.

• If our test should classify similar as psychiatrists would do we could use e.g. ROC curves to determine the optimal cut-off score.

• The ROC Curve allows to evaluate the performance of tests that categorize cases into one of two groups.

Receiver-operating characteristic curve

• An ROC curve is a graphical representation of the trade off between the false negative (sensitivity) and false positive rates (1-specificity) for every possible cut off.

• The plot shows the false positive rate on the X axis and 1 - the false negative rate on the Y axis.

• A good diagnostic test is one that has small false positive and false negative rates.

• SPSS Analyze ROC curve

Depression score versus doctor’s classification

6 20 19 18 17 16 15 14 13 12 11 10 9 8 7 Actual
6
20
19
18
17
16
15
14
13
12
11
10
9
8
7
Actual state according
5
4
3
1
2
0
50
100
Count
150
depressive
not depressive
to doctor
200

Depression score

ROC curve ROC Curve 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0
ROC curve
ROC Curve
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
Area Under the Curve
Diagonal segments are produced by ties.
Test Result Variable(s): Depression score
False positives
Asymptotic 95% Confidence
Interval
Asymptotic
b
Area
Std. Error a
Sig.
Lower Bound
Upper Bound
Area under the curve:0.97
.970
.003
.000
.964
.977
The test result variable(s): Depression score has at least one tie
between the positive actual state group and the negative actual state
True positives

Area under the ROC curve

A good test will result in a ROC curve that rises to the upper left hand corner very quickly. Therefore, the area under the curve is a measure of the quality of the test diagnostic.

• The larger the area, the better the diagnostic test:

– Area is 1.0 = 100% sensitivity and 100% specificity.

– Area is 0.5 = 50% sensitivity and 50% specificity = not better than flipping a coin.

– In practice, a diagnostic test is somewhere between these two extremes.

What's a good value for the area under the curve? • 0.50 to 0.75 = fair 0.75 to 0.92 = good 0.92 to 0.97 = very good 0.97 to 1.00 = excellent.

Cut-off point

Choice of Cut-off point depends on • Importance of correct classification, • Cost of misclassification

• Prevalence (the lower the prevalence, the higher the proportion of false positives among the positive results)

Cut-off point

ROC Curve

0.7 0.8 0.9 1.0 Sensitivity 0.20 0.10 0.00 0.30 0.25 0.15 0.05 True positives
0.7
0.8
0.9
1.0
Sensitivity
0.20
0.10
0.00
0.30
0.25
0.15
0.05
True positives

1 - Specificity

Diagonal segments are produced by ties. False positives
Diagonal segments are produced by ties.
False positives
 

Cut-off point = 10.5

 
 

Coordinates of the Curve

 

10.5 seems to be a good

Test Result Variable(s): Depression score

classification cut-off point

 

Positive if

   

Greater Than

>10 = depressive

or Equal To a

Sensitivity

1 - Specificity

 

.00

 

1.000

1.000

10 not depressive

 

1.50

2.50

 

1.000

.999

.898

.813

• 91.8% of depressive are classified

3.50

.997

.683

 

as depressive, 9.2% of not

depressed are “misclassified” as

 

4.50

5.50

.994

.993

.596

.529

6.50

.985

.372

depressive

7.50

.979

.306

But original score still has got

more information (most of the

wrongly classified patients score

 

8.50

9.50

.969

.934

.240

.134

 

10.50

 

.918

.092

 

11.50

 

.865

.034

12.50

.758

.015

around 9 and 10).

13.50

.700

.013

 

14.50

.573

.005

• Dichotomizing data causes up to

 

15.50

.513

.002

 

66% power loss in statistical

analysis!

16.50

17.50

.379

.321

.001

.001

18.50

.188

.000

 

19.50

.086

.000

21.00

.000

.000

The test result variable(s): Depression score has at least

one tie between the positive actual state group and the

ne ative actual state

rou

Check if classification is the same between males and females 120 Actual state according to doctor
Check if classification is the same between
males and females
120
Actual state according
to doctor
100
not depressive
depressive
80
60
40
20
0
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Depression score
gender
male
female
Count

ROC Curve

1.0 0.8 Sensitivity 0.6 0.4 0.2 0.0 0.4 0.0 0.8 0.6 0.2 1.0 gender: male
1.0
0.8
Sensitivity
0.6
0.4
0.2
0.0
0.4
0.0
0.8
0.6
0.2
1.0
gender: male

1 - Specificity

Diagonal segments are produced by ties.

ROC Curve

gender: female 1.0 0.8 Sensitivity 0.6 0.4 0.2 0.0 0.4 0.0 0.8 0.6 0.2 1.0
gender: female
1.0
0.8
Sensitivity
0.6
0.4
0.2
0.0
0.4
0.0
0.8
0.6
0.2
1.0

1 - Specificity

Diagonal segments are produced by ties.

• There seem to be no gender differences in cut-off point • Use logistic regression to see influence of age, gender, ethnicity… on classification • Depressive yes/no test score + sex + age + ethnicity

Literature: ROC curves

• Quantifying the information value of clinical assessments with signal detection theory. Richard M. McFall, Teresa A. Treat. Annu Rev Psychol 1999: 50, 215-41

• W.C. Lee. (1999) Selecting diagnostic tests by ruling out or ruling in disease: the use of the Kullback-Leibler distance. I.J. of Epidemiology, 28:521–525, 1999.

(more than 2 classifications)

• J. J. Strik, A. Honig, R. Lousberg, J. Denollet.(2001) Sensitivity and specificity of observer and self-report questionnaires in major and minor depression following myocardial infarction. 42(5), 423-8.

Validity

Now we have a scale with very good reliability.

But what about validity? Is our scale measuring what it is supposed to measure? Next step: Validity Is our scale not only reliable but also valid?

Validity

• A valid measure is one which is measuring what it is supposed to measure.

• Validity refers to getting results that accurately reflect the concept being measured:

• Are we drawing valid conclusions from our measures: does as is a high score on our IQ scale really means that the person is intelligent?

Validity is the degree of confidence we can place on inferences about people based on their test scores.

• Validity implies reliability: Reliability places the upper limit of the validity of a scale!

Measurement Validity Types

3 general types of validity with several subcategories:

• Expert validity • Criterion validity • Construct validity

Validity Types

3 general types of validity with several subcategories:

• Expert validity assessment that the items of a test are drawn from the domains being measured (takes place after initial form of the scale has been developed) • Criterion validity correlate measures with a criterion measure known to be valid, e.g. established test (= other scales of the same or similar measure are available) • Construct validity examines whether a measure is related to other variables as required by theory, e.g. depression score should change in response of a stressful life event (= other scales of the same or similar measure are not available)

Expert validity

Expert validity: do the experts agree? – Face validity:

• subjective assessment that the instrument/items appear to asses the desired qualities

• Does the operationalization look like a good translation of the construct?

• Assessment by colleagues, friends, target subjects, clinician… • Weakest way to demonstrate construct validity – Content validity:

• closely related to face validity, but it is a more rigorous assessment and done by an expert panel.

• It is concerned with sample-population representativeness:

• subjective assessment that the instrument samples all the important contents/domains of the attribute

Criterion and Construct validity

• Criterion validity: Is the measure consistent with what we already know and what we expect?

– Concurrent validity: correlate measurements of a new scale with “gold standard” criterion, both which are given at the same time

– Predictive validity: correlate with criterion, which is not yet available

• Construct validity:

– Sensitivity to change, responsiveness

Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event.

– Discriminate Validity: Doesn’t associate with constructs that shouldn’t be related.

Criterion and construct validity – another view

• Validity is mainly about generating hypothesis and designing experiments to test them.

• It is a process of hypothesis testing and the important question is not, which kind of validity we apply but: “Does our hypothesis makes sense in light of what the scale is designed to measure?”.

• However, the different categories of validity may help to generate and to design hypothesis.

Criterion validity:

• Criterion validity: Is the measure consistent with what we already know and what we expect?

– Concurrent validity: correlate measurements of a new scale with “gold standard” criterion, both which are given at the same time

e.g. new depression test and Beck Depression inventory

= allows for immediate results

– Predictive validity: correlate with criterion, which is not yet available,

e.g. measuring aggressiveness and comparing it with reported aggressive acts in the following year;

intelligence and final exam results

Criterion Validity

Steps for conducting a criterion-related validation study:

• Identify a suitable criterion and method for measuring it.

• (New test for IQ with the aim to predict school performance: test scores should be correlated exam results.

• Identify an appropriate sample (students)

• Correlate test scores and criterion measure (exam results).

• The degree of relationship is the validity coefficient.

• If we estimate a Pearson correlation coefficient r (or ICC), r is our validation coefficient.

• The interpretation is similar to reliability: the closer r to 1, the better the validity.

Concurrent validity:

Number of storks as a measure of fertility?

Millions of newborn babies in Germany The number of breeding storks correlates almost perfectly with the
Millions of newborn babies in Germany
The number of
breeding storks
correlates almost
perfectly with
the number of
newborn babies!
Pairs of breeding storks

Association between categorical scores

• Often we classify according to our score into

two or more categories, e.g.

– neurotic or not neurotic

– Extroverted –introverted

• In this case we want to see if there is an association between the categories, e.g. our neuroticism test result with the psychiatrists/doctors diagnosis :

   

Doctors

   

Neurotic

Not

     

neurotic

New test

Neurotic

20

3

 

Not

2

10

neurotic

   

• (SPSS Analyze Descriptive statistics crosstabs statistics)

Symmetric Measures

 

Value

Approx. Sig.

Nominal by

Phi

.691

.000

Nominal

Cramer's V

.691

.000

Contingency Coefficient

.568

.000

N of Valid Cases

35

  • a. Not assuming the null hypothesis.

  • b. Using the asymptotic standard error assuming the null hypothesis.

]

• Phi is 0.691: there is a significant association between test classification and doctor’s opinion.

Problem’s with categorical association coefficients:

• Different measures of association result in different association measures (see phi = 0.69 and contingency coefficient =0.57).

• Therefore, comparison of different association measures is not possible.

• It is difficult to evaluate a coefficient with larger

contingency tables.

Effect sizes

• 2x2 table (phi)

– small effect (small or weak association): 0.1 – medium effect: 0.3 – large, strong association: 0.5 – perfect association: 1

• Pearson’s correlation

– small effect (small or weak association): 0.3 – medium effect: 0.5 – large, strong association: 0.8 – perfect association: 1

 

In general we want to see if there is an association between our test score and an appropriate measures such as:

 

(SPSS: Analyze Descriptives Crosstabs Statistics

Type of Correlation Coefficient

Types of Scales

Pearson product-moment

Both scales interval/ratio

Spearman rank-order

Both scales ordinal

Phi

Both scales are naturally dichotomous (nominal)

Contingency coefficient

Both scales are nominal (more than 2 categories)

Linear-by-linear association

Ordered categorical variables

Tetrachoric

Both scales are artificially dichotomous (nominal)

Point-biserial

One scale naturally dichotomous (nominal), one scale interval/ratio

Biserial

One scale artificially dichotomous (nominal), one scale interval/ratio

Gamma

One scale nominal, one scale ordinal

Measurement Validity Types

3 general types of validity with several subcategories:

• Expert validity

assessment that the items of a test are drawn from the domains being measured

• Criterion validity

correlate measures with a criterion measure known to be valid, e.g. established test

• Construct validity

– sensitivity to change, responsiveness

Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event.

– Discriminate Validity: Doesn’t associate with constructs that shouldn’t be related.

Construct validation

Perhaps the most difficult type of validation (if done correctly)

Construct validity refers to the degree to which a test or other measure assesses the underlying theoretical construct it is supposed to measure.

It is traditionally defined as the experimental demonstration that a test is measuring the construct it claims to be measuring.

Idea behind estimating construct validity

• A construct is an unobservable trait, such as anxiety, self-esteem , xenophobe or IQ, that can not be measured.

• The basic idea behind construct validity is that we can observe and measure behavioral (or other traits, e.g. blood pressure), which according to our hypothesized construct, are influenced by or result of the construct, which we intend to measure,

e.g. sweaty palms are influenced by anxiety.

• We can now compare measurements of observable traits with the measurements of our construct.

• If the observed behavior correlates well with the scores of the test, we assume to have a good “construct validity”.

Construct validity: examples

• A simple experiment could take the form of a differential-groups study:

you compare the performances on the test between two groups: one that has the construct and one that does not have the construct:

The serotonin levels of people who score high in depression test will be compared with people who score low.

– If the group with the construct performs different than the group without the construct, that result is said to provide evidence of the construct validity of the test (extreme group validity).

Construct validity: examples

You could do an intervention study,

• Theories tell us how constructs respond to interventions

– e.g

.. to the feared stimulus

phobic symptoms subside after repeated exposures

• Our scale should measure the differences after intervention as predicted

– after repeated exposures, our test should show lower scores

– If the test score of the subjects change in the predicted direction, we have evidence of the construct validity.

Construct validity: more examples Developmental changes

• Should the construct change with age?

– E.g

..

attention increases with age

– e.g

..

memory retrieval decreases with age

• This should be seen in your scale.

• Give your scale to a group of various ages – check for the pattern predicted by your theory:

Is there a predicted correlation between age and score?

Construct validity: more examples

• We developed a test for “perceived general health”.

• For theoretical reasons we assume that perceived general health should be associated with the number of visits to a GP or levels of self-medication.

• These two hypothesis can be formally tested by correlating perceived general health score and number of visits to the GP and levels of self-medications.

• If the correlational analysis confirm your assumption, this will add to the evidence of construct validity.

• Quasi-experimental study:

• We developed a test for “Depression”.

• We will assume that people who recently had a stressful live event score higher on our test than people who did not had a stressful life event (t-test).

• Furthermore, the depression score of a person should increase after a stressful life event (“Sensitivity for change”).

Main steps in construct validity

Formulate several hypotheses

Try to formulate hypotheses for each domain of your construct.

Define (Operationalise) how you want to test your hypotheses: experimental, quasi-experimental or correlational approach?

Develop the scale now!

Gather data that will allow you to test the hypotheses

Determine if your data support the hypotheses, e.g. ANOVA, Correlation, Regression.

Construct validity

• So far mainly:

• sensitivity to change, responsiveness

• Convergent and discriminate validity • Multitrait Multimethod matrix

Convergent validity

• Convergent validity is the degree to which our scale that should be related theoretically with other attributes or variables is interrelated with them in reality.

• Example: A new depression scale should correlate with a measure of “serotonin and norepinephrine imbalance” or with “social inactivity”.

• We assume that social activity is not only influenced by depression and therefore we do not expect a perfect correlation.

• Therefore, we expect a correlation between the measures, but not a very strong one. If the correlation would be very strong, our scale would measure the same thing. In this case, social activity would be just a proxy for depression and we could use social activity as a measure.

Difference to Concurrent validity: We do not compare our measure with similar measures of the same trait but with related variables.

Discriminant validity

• But by looking only at correlations with things that are only similar, we will never discover it’s flaw (measures too much).

We need a method to tell that it it measures only enough (not too much, not too little). Discriminant validity

• Discriminant validity is the degree to which our scale that should not be related theoretically with other attributes or variables are, in fact, not interrelated in reality, such as e.g. depression and body size or neuroticism.

• We need to find the traits, which are similar according to our theory (high correlations) = convergent validity

• But we need also need to test traits, which are different! (low or no correlations) = discriminant validity

• This way we can “narrow down” and check that we measure

just right

Multitrait Multimethod matrix

• If we find a correlation between two variables but we used similar methods of administration (both are psychometric tests), the correlation may be due to the same administration method

• (e.g. wording of items is difficult or subjects try to be social desirable in their answers).

• In this case we may find a correlation between two dissimilar measures of traits, where we do not assume one.

• C&D methods can be used to check the influence of the measurement method on our scores.

• Multitrait Multimethod matrix analysis allows us to detangle correlations between instruments due to similarity of test methods form and similarities due to tapping the same attribute.

Multitrait Multimethod matrix

• Two or more different traits (both similar and dissimilar traits) are measured by two or more methods (e.g., psychometric test, a direct observation, a performance measure) at the same time. • You then correlate the scores with each other.

• To construct an MTMM, you need to arrange the correlation matrix by concepts within methods.

• Essentially, the MTMM is just a correlation matrix between your measures, with one exception:

instead of 1's along the diagonal (as in the typical correlation matrix) we substitute an estimate of the reliability of each measure as the diagonal.

Multitrait Multimethod matrix

• We developed a scale for self developed learning and want to validate the measure:

• We conduct a study of students and measure two similar traits : Self directed learning (SDL) and Knowledge (Know).

• Furthermore, we measure each of the two traits in two different ways: a test measure and a exam rating.

• We assume that the two measures should not correlate. • We measured reliability for each measure • The results are arrayed in the MTMM.

•Example from: Streiner and Norman 2003, page 184

Homotrait-Homomethod trait: MTMM-Matrix The Reliability Diagonal • Main diagonal are reliabilities (= “correlations with itself”) of
Homotrait-Homomethod trait:
MTMM-Matrix
The Reliability Diagonal
• Main diagonal are reliabilities (= “correlations
with itself”) of our instruments, they should be
the highest.
SDL
Know
Rater
Test
Rater
Test
SDL
Know
SDL
Rater
0.53
Rater
Test
Rater
Test
Test
0.42
0.79
SDL
Rater
0.53
Know
Rater
0.18
0.17
0.58
Test
0.42
0.79
Test
0.15
0.23
0.49
0.88
Know
Rater
0.18
0.17
0.58
Test
0.15
0.23
0.49
0.88
Heterotrait-Homomethod: Discriminant validity
Homotrait-Heteromethod: Concurrent validity
• Correlation between similar traits but with
different methods, should show high correlation
(but less than reliability coefficient)
• Correlation between the different traits but with
same methods, should low correlation. If
correlations are high, method has got an effect
on scores.
SDL
Know
Rater
Test
Rater
Test
SDL
Know
SDL
Rater
0.53
Rater
Test
Rater
Test
Test
0.42
0.79
SDL
Rater
0.53
Know
Rater
0.18
0.17
0.58
Test
0.42
0.79
Test
0.15
0.23
0.49
0.88
Know
Rater
0.18
0.17
0.58
Test
0.15
0.23
0.49
0.88

Heterotrait-Heteromethod

• Correlation between the different traits but with different methods, should show similar low correlations as heterotrait-homomethod if method has got no effect on scores.

   

SDL

Know

   

Rater

Test

Rater

Test

SDL

Rater

0.53

     
 

Test

0.42

0.79

   

Know

Rater

0.18

0.17

0.58

 
 

Test

0.15

0.23

0.49

0.88

Multitrait Multimethod matrix

• can be easily extended to include more similar and dissimilar traits and more methods: see e.g.

http://www.socialresearchmethods.net/kb/mtmmmat.htm

Multitrait Multimethod matrix

• We developed a scale for self esteem and want to validate the measure:

• We conduct a study of students and measure three traits or concepts: Self Esteem (A), Self Disclosure (B) and Locus of Control (C).

• Furthermore, we measure each of these traits in three different ways: a Paper-and-Pencil (1) measure, a Teacher rating (2) and parent assessment (3).

• We assume that self Esteem should correlate with self Disclosure but not with Locus of Control (LC).

• We measured reliability for each measure • The results are arrayed in the MTMM.

•Example form: http://www.socialresearchmethods.net/kb/mtmmmat.

Have a go for yourself….

Have a go for yourself…. see: http://www.socialresearchmethods.net/kb/mtmmmat.htm

see: http://www.socialresearchmethods.net/kb/mtmmmat.htm

Construct validity: measure and theory

• In construct validity we are measuring two types of validity at the same time:

• is our measure valid and is the theory regarding our construct valid (are the observed traits derived from our hypotheses really measures of the construct, is e.g. college grade a measure of IQ).

• If we have high validity, then we have more confidence that both our measure and our theory are correct.

• However, if we only obtain low validity then the problems could be:

– Our measure is good, but the theory is wrong – The theory is good but the measure is wrong – Both theory and measure are wrong

– If we did an experiment (e.g. inducing anxiety) it also could be the experiment which did not work but theory and measure are correct

many carefully planned validation tests needs to be done

Summary: Evaluating construct validity

• Evaluating construct validity will involve a large number of tests and an assessment of disparate information.

• Trying to confirm the plausible associations and disconfirm the implausible ones is often a long and incremental process.

• There will be no definitive answers and there is no formal way to weigh the overall evidence.

• Reaching conclusions on construct validity is further complicated by the problem that evidence tends to be interpreted in two different ways:

– to use these associations to test whether the instrument is a good measure of the intended constructs (validation of the instrument) ;

– to use these associations to confirm and clarify the constructs (validation of the underlying construct).

Validity

• Example: A test is developed to measure xenophobe.

• The test was administered to 100 people working in the car industry.

• In addition to the test, the people were asked to answer the following questions: age, type of work (blue collar-white collar worker), political attitude (left to right), handedness, number of foreigners as friends and education.

• Can you think about some validity test?

Possible validity tests

• Older people show on average more xenophobe than younger ones.

• People with rightwing political attitude show on average more xenophobe.

• People with unsafe job situation show more xenophobe than people with safe jobs blue collar worker should show on average more xenophobe than white collar worker.

• People with higher education should show on average less xenophobe.

• Left-handed people should show similar degree of xenophobe than right-handed people.

Validity and reliability

• Validity implies reliability. A valid measure must be reliable, but a reliable measure may not be valid.

• Reliability places the upper limit of the validity of a scale!

Upper limit of validity

• Validity is to some degree dependent on reliability:

reliability places the upper limit of the validity of a scale.

reliability of "gold standard", e.g. test - retest = reliability of new test , e.g. test
reliability of "gold standard", e.g. test - retest
= reliability of new test , e.g. test - retest
new test and e.g. gold standard)
= validity(correlation between
xx
xy
yy
r
r
r
yy
xx
xy
=
r
r
r

Example: max validity

If the reliability of our test is 0.8, and the reliability of a gold standard test is 0.7, the maximum correlation (validity coefficient) between the variables is:

0.7 = 0.75 0.8 xy r
0.7 = 0.75
0.8
xy
r

Upper limit of validity: Relationship between reliability of criterion and validity of new scale:

maximal validity

.90 .80 .70 reliability of new scale 1.00 0.80 0.60 0.40 0.20 0.00 0.20 0.40 0.60
.90
.80
.70
reliability of new scale
1.00
0.80
0.60
0.40
0.20
0.00
0.20
0.40
0.60
0.80
1.00

reliability of criterion ("gold standard")

The problem: unreliable criterion

• This relationship between reliability and validity can cause a problem in our validity analysis:

• Our measure may be valid but if we compare it with a not very reliable criterion, we will get a low validity for =our measure.

Correcting for low reliability:

• However, we can estimate what the validity coefficient of our new measure would be if both - our new scale and the criterion - would be perfectly reliable:

= validity (correlation between new test and gold standard ) estimated validityif if both scales were
= validity (correlation between new test and gold standard )
estimated validityif if both scales were 100% relaible (r
reliabilit y of "gold standard", e.g. test - retest
= reliabilit y of new test , e.g. test - retest
yy
xx
r
r
xy
r
r *
xy
=
1)
yy
xx
xy
xy
=
=
=
r
r
r
r
*
Example: Estimated validity A correlation between our new measure and a gold standard test results in
Example: Estimated validity
A correlation between our new measure and a gold standard
test results in r =0.6.
What would be the validity if both would be perfect reliable?
(Reliability of our test: 0.8, reliability of gold standard test: 0.7)
0.6
0.6
r *
=
=
= 0.8
xy
0.8
0.7
0.75
Divide observed reliability by maximum reliability
Validity would be 0.8 if both measures would be perfectly
reliable
Example: Estimated validity
A correlation between our new measure and a gold
standard test results in r =0.6.
What would be the validity if the gold standard
would be perfect reliable? (Reliability of our test is
0.8, reliability of gold standard test is 0.7)
0.6
0.6
r *
=
=
= 0.72
xy
0.7
0.83
Validity would be 0.72, if the criterion measure would
be perfectly reliable.

Estimating validity if gold standard is perfect

• However, our new measure is not perfect reliable and won’t be. A more realistic approach would be to estimate the validity of our test if only the gold standard would be perfectly reliable:

= validity (correlation between new test and gold standard ) estimated validityif if "gold standard" was
= validity (correlation between new test and gold standard )
estimated validityif if "gold standard" was 100% relaible (r
= reliability of new test , e.g. test - retest
yy '
r
xy
r
r *
xy
=
yy '
1)
xy
xy
=
=
r
r
r
*
 

Effect of increasing reliability on validity

 

If the validity of our scale is low due to low reliability, we could improve reliability to increase validity.

• How much do we have to improve the reliability of the scale to obtain a acceptable validity?

r *

 

=

changed changed r * r r xy xx ' yy ' r r xx yy
changed
changed
r
*
r
r
xy
xx '
yy
'
r
r
xx
yy
 

xy

where

changed

r

xx'

and

changed

r

yy '

are the changed reliabiites for the two variables

r *

xy

: estimated validity

 

r

xy

: observed correlation

r

xx

,

r

yy '

: observed reliabilities

Example: Usefulness of validity corrections • We observed a correlation (validity) of 0.6, the observed reliabilities
Example:
Usefulness of validity corrections
• We observed a correlation (validity) of 0.6, the observed
reliabilities are 0.8 for our scale and 0.7 for the criteria
scales. How much would the validity increase if we could
increase reliability of our scale by .1?
changed
changed
r
=
0.9
r
=
0.7
xx'
yy '
• If our new scale should correlate strongly with
another measure, corrections will give us an
indications of the true validity of the instrument
and hence a indications that we are measuring
the construct:
observed correlation
r
: 0.6
xy
• A low uncorrected validity score may lead us
observed
validity :
r
=
0.8 and
r
: 0.7
xx
yy '
otherwise to discard our scale, while it is in
reality a problem of reliability.
changed
changed
r
*
r
r
xy
xx'
yy'
r *
=
xy
r
r
• Is it worth increasing the reliability of our new
scale?
x x
y y
0 .6
* 0 .95
* 0 .84
=
• How reliable needs to be our scale to get a valid
scale?
0 .89
* 0 .84
=
0 .64
Example: Usefulness of validity corrections • We observed a correlation (validity) of 0.6, the observed reliabilities

Which reliability measure to use?

• “Use the type of reliable estimate that treats as error those factors that one decides should be treated as error” (Muchinsky 1996)

E.g. If you think that the test may suffer of low item coverage of all your domains, then you should use Cronbach’s alpha as a reliability measure.

• If you think that subjects may have problems with the test, you should use test-retest reliability estimates (ICC)

• Unfortunately, there is no acceptable psychometric basis for creating validity coefficients as a product of correcting multiple types of unreliability (Muchinsky 96).

• Perhaps use the lowest one.

Summary of the course

• The aim of psychometric scale development is to develop a tool to measure unobservable traits, latent constructs.

• These unobservable traits are measured by a scale which consistes of many items (e.g. questions), which all should tap into the construct or in domains of the construct.

• The key concepts of classical test development theory are reliability and validity.

 

Item pool

Item pool Face and content Internal consistency

Face and content

 
  • Internal
    consistency

 

generation

validity

   

•Inter-item

 
correlation
 

correlation

•Item-total
 

•Item-total

correlation

Remove/ Revise items
Remove/ Revise items
Remove/
Revise items
  • •Cronbach’s Alpha
    •Factor analysis

•Cronbach’s Alpha •Factor analysis Dimensionality: More Reliability

Dimensionality:

  • More Reliability

 
From items to scale: (Stability):
From items to scale:
From items to scale:

From items to scale:

 

(Stability):

Validity

   

•Interobserver

•Intra observer

 

•Construct

•Construct

•Concurrent

Total score

•Test-retest

 

Total subscores