Introduction to measurement and scale development Part 5: Validity
Daniel Stahl
Department of Biostatistics & Computing
Last week: Reliability II
• Generalizability theory
– explicitly recognizes that there are multiple sources of error and that measures may have different reliabilities in different situations
(e.g. different hospitals, gender, different countries)
– allows us to determine the sources of error due using variance components and mulitlevel modelling techniques
– is an extension of the reliability model from two to more random and fixed factors
• Reliability and standard error/Confidence intervals
• Calculating the necessary number of items for a good reliability
• Sample sizes for testretest reliability studies • Reliability for categorical variables: Kappa
• From item scores to scale scores (total score, subscores)
Last week: From items to total score
• If our final item pool passed – requirements for internal consistency, – factor analysis for checking dimensionality – and other reliability checks,
•
we need to combine the items to a total score (or several subscores).
• We need a rule to combine the information of all items into a total score (or a total score for each domain)
• Usually you simply add up the scores of each domain to obtain a subtotal score for domain
• Adding up usually/often works well
Question from last week:
Categorizing a continuous scale
• If we
accept the fact that doctors like to ruin a
nice continuous outcome measure by turning it into a dichotomy (depressed –not depressed) we need to find a way to find the best cutoff score
tradeoffs between sensitivity and specificity • (or between true positives and false positives)
• http://www.childrensmercy.org/stats/ask/roc.asp
Example
• A new simple depression test was developed for use in psychiatric clinics to screen patients for depression.
• How can we determine a cutoff score to “correctly” classify a patient as depressive?
• We could apply the new test to patients and ask psychiatrists to assess the same patients.
• If our test should classify similar as psychiatrists would do we could use e.g. ROC curves to determine the optimal cutoff score.
• The ROC Curve allows to evaluate the performance of tests that categorize cases into one of two groups.
Receiveroperating characteristic curve
• An ROC curve is a graphical representation of the trade off between the false negative (sensitivity) and false positive rates (1specificity) for every possible cut off.
• The plot shows the false positive rate on the X axis and 1  the false negative rate on the Y axis.
• A good diagnostic test is one that has small false positive and false negative rates.
• SPSS Analyze ROC curve
1
Depression score versus doctor’s classification
Depression score
Area under the ROC curve
•
A good test will result in a ROC curve that rises to the upper left hand corner very quickly. Therefore, the area under the curve is a measure of the quality of the test diagnostic.
• The larger the area, the better the diagnostic test:
– Area is 1.0 = 100% sensitivity and 100% specificity.
– Area is 0.5 = 50% sensitivity and 50% specificity = not better than flipping a coin.
– In practice, a diagnostic test is somewhere between these two extremes.
What's a good value for the area under the curve? • 0.50 to 0.75 = fair 0.75 to 0.92 = good 0.92 to 0.97 = very good 0.97 to 1.00 = excellent.
•
•
•
Cutoff point
Choice of Cutoff point depends on • Importance of correct classification, • Cost of misclassification
• Prevalence (the lower the prevalence, the higher the proportion of false positives among the positive results)
Cutoff point
ROC Curve
1  Specificity
Cutoff point = 10.5 

Coordinates of the Curve 
• 
10.5 seems to be a good 

Test Result Variable(s): Depression score 
classification cutoff point 

Positive if 

Greater Than 
• 
>10 = depressive 

or Equal To ^{a} 
Sensitivity 
1  Specificity 

.00 
1.000 
1.000 
• 
≤ 10 not depressive 

1.50 2.50 
1.000 .999 
.898 .813 
• 91.8% of depressive are classified 

3.50 
.997 
.683 
as depressive, 9.2% of not depressed are “misclassified” as 

4.50 5.50 
.994 .993 
.596 .529 

6.50 
.985 
.372 
depressive 

7.50 
.979 
.306 
But original score still has got more information (most of the wrongly classified patients score 

8.50 9.50 
.969 .934 
.240 .134 
• 

10.50 
.918 
.092 

11.50 
.865 
.034 

12.50 
.758 
.015 
around 9 and 10). 

13.50 
.700 
.013 

14.50 
.573 
.005 
• Dichotomizing data causes up to 

15.50 
.513 
.002 
66% power loss in statistical analysis! 

16.50 17.50 
.379 .321 
.001 .001 

18.50 
.188 
.000 

19.50 
.086 
.000 

21.00 
.000 
.000 

The test result variable(s): Depression score has at least 

one tie between the positive actual state group and the ne ative actual state rou 
2
ROC Curve
1  Specificity
Diagonal segments are produced by ties.
ROC Curve
1  Specificity
Diagonal segments are produced by ties.
• There seem to be no gender differences in cutoff point • Use logistic regression to see influence of age, gender, ethnicity… on classification • Depressive yes/no test score + sex + age + ethnicity
Literature: ROC curves
• Quantifying the information value of clinical assessments with signal detection theory. Richard M. McFall, Teresa A. Treat. Annu Rev Psychol 1999: 50, 21541
• W.C. Lee. (1999) Selecting diagnostic tests by ruling out or ruling in disease: the use of the KullbackLeibler distance. I.J. of Epidemiology, 28:521–525, 1999.
(more than 2 classifications)
• J. J. Strik, A. Honig, R. Lousberg, J. Denollet.(2001) Sensitivity and specificity of observer and selfreport questionnaires in major and minor depression following myocardial infarction. 42(5), 4238.
Validity
Now we have a scale with very good reliability.
But what about validity? Is our scale measuring what it is supposed to measure? Next step: Validity Is our scale not only reliable but also valid?
Validity
• A valid measure is one which is measuring what it is supposed to measure.
• Validity refers to getting results that accurately reflect the concept being measured:
• Are we drawing valid conclusions from our measures: does as is a high score on our IQ scale really means that the person is intelligent?
• Validity is the degree of confidence we can place on inferences about people based on their test scores.
• Validity implies reliability: Reliability places the upper limit of the validity of a scale!
Measurement Validity Types
3 general types of validity with several subcategories:
• Expert validity • Criterion validity • Construct validity
3
Validity Types
3 general types of validity with several subcategories:
• Expert validity assessment that the items of a test are drawn from the domains being measured (takes place after initial form of the scale has been developed) • Criterion validity correlate measures with a criterion measure known to be valid, e.g. established test (= other scales of the same or similar measure are available) • Construct validity examines whether a measure is related to other variables as required by theory, e.g. depression score should change in response of a stressful life event (= other scales of the same or similar measure are not available)
Expert validity
Expert validity: do the experts agree? – Face validity:
• subjective assessment that the instrument/items appear to asses the desired qualities
• Does the operationalization look like a good translation of the construct?
• Assessment by colleagues, friends, target subjects, clinician… • Weakest way to demonstrate construct validity – Content validity:
• closely related to face validity, but it is a more rigorous assessment and done by an expert panel.
• It is concerned with samplepopulation representativeness:
• subjective assessment that the instrument samples all the important contents/domains of the attribute
Criterion and Construct validity
• Criterion validity: Is the measure consistent with what we already know and what we expect?
– Concurrent validity: correlate measurements of a new scale with “gold standard” criterion, both which are given at the same time
– Predictive validity: correlate with criterion, which is not yet available
• Construct validity:
– Sensitivity to change, responsiveness
Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event.
– Discriminate Validity: Doesn’t associate with constructs that shouldn’t be related.
Criterion and construct validity – another view
• Validity is mainly about generating hypothesis and designing experiments to test them.
• It is a process of hypothesis testing and the important question is not, which kind of validity we apply but: “Does our hypothesis makes sense in light of what the scale is designed to measure?”.
• However, the different categories of validity may help to generate and to design hypothesis.
Criterion validity:
• Criterion validity: Is the measure consistent with what we already know and what we expect?
– Concurrent validity: correlate measurements of a new scale with “gold standard” criterion, both which are given at the same time
e.g. new depression test and Beck Depression inventory
= allows for immediate results
– Predictive validity: correlate with criterion, which is not yet available,
e.g. measuring aggressiveness and comparing it with reported aggressive acts in the following year;
intelligence and final exam results
Criterion Validity
Steps for conducting a criterionrelated validation study:
• Identify a suitable criterion and method for measuring it.
• (New test for IQ with the aim to predict school performance: test scores should be correlated exam results.
• Identify an appropriate sample (students)
• Correlate test scores and criterion measure (exam results).
• The degree of relationship is the validity coefficient.
• If we estimate a Pearson correlation coefficient r (or ICC), r is our validation coefficient.
• The interpretation is similar to reliability: the closer r to 1, the better the validity.
4
Concurrent validity:
Number of storks as a measure of fertility?
Association between categorical scores
• Often we classify according to our score into
two or more categories, e.g.
– neurotic or not neurotic
– Extroverted –introverted
• In this case we want to see if there is an association between the categories, e.g. our neuroticism test result with the psychiatrists/doctors diagnosis :
Doctors 

Neurotic 
Not 

neurotic 

New test 
Neurotic 
20 
3 
Not 
2 
10 

neurotic 
• (SPSS Analyze Descriptive statistics crosstabs statistics)
Symmetric Measures
Value 
Approx. Sig. 

Nominal by 
Phi 
.691 
.000 
Nominal 
Cramer's V 
.691 
.000 
Contingency Coefficient 
.568 
.000 

N of Valid Cases 
35 
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
•
]
• Phi is 0.691: there is a significant association between test classification and doctor’s opinion.
Problem’s with categorical association coefficients:
• Different measures of association result in different association measures (see phi = 0.69 and contingency coefficient =0.57).
• Therefore, comparison of different association measures is not possible.
• It is difficult to evaluate a coefficient with larger
contingency tables.
Effect sizes
• 2x2 table (phi)
– small effect (small or weak association): 0.1 – medium effect: 0.3 – large, strong association: 0.5 – perfect association: 1
• Pearson’s correlation
– small effect (small or weak association): 0.3 – medium effect: 0.5 – large, strong association: 0.8 – perfect association: 1
In general we want to see if there is an association between our test score and an appropriate measures such as: 

(SPSS: Analyze Descriptives Crosstabs Statistics 

Type of Correlation Coefficient 
Types of Scales 

Pearson productmoment 
Both scales interval/ratio 

Spearman rankorder 
Both scales ordinal 

Phi 
Both scales are naturally dichotomous (nominal) 

Contingency coefficient 
Both scales are nominal (more than 2 categories) 

Linearbylinear association 
Ordered categorical variables 

Tetrachoric 
Both scales are artificially dichotomous (nominal) 

Pointbiserial 
One scale naturally dichotomous (nominal), one scale interval/ratio 

Biserial 
One scale artificially dichotomous (nominal), one scale interval/ratio 

Gamma 
One scale nominal, one scale ordinal 
5
Measurement Validity Types
3 general types of validity with several subcategories:
• Expert validity
assessment that the items of a test are drawn from the domains being measured
• Criterion validity
correlate measures with a criterion measure known to be valid, e.g. established test
• Construct validity
– sensitivity to change, responsiveness
Is the measure related to other variables as required by theory, e.g. depression score should change in response of a stressful life event.
– Discriminate Validity: Doesn’t associate with constructs that shouldn’t be related.
Construct validation
• 
Perhaps the most difficult type of validation (if done correctly) 
• 
Construct validity refers to the degree to which a test or other measure assesses the underlying theoretical construct it is supposed to measure. 
• 
It is traditionally defined as the experimental demonstration that a test is measuring the construct it claims to be measuring. 
Idea behind estimating construct validity
• A construct is an unobservable trait, such as anxiety, selfesteem , xenophobe or IQ, that can not be measured.
• The basic idea behind construct validity is that we can observe and measure behavioral (or other traits, e.g. blood pressure), which according to our hypothesized construct, are influenced by or result of the construct, which we intend to measure,
e.g. sweaty palms are influenced by anxiety.
• We can now compare measurements of observable traits with the measurements of our construct.
• If the observed behavior correlates well with the scores of the test, we assume to have a good “construct validity”.
Construct validity: examples
• A simple experiment could take the form of a differentialgroups study:
–
you compare the performances on the test between two groups: one that has the construct and one that does not have the construct:
The serotonin levels of people who score high in depression test will be compared with people who score low.
– If the group with the construct performs different than the group without the construct, that result is said to provide evidence of the construct validity of the test (extreme group validity).
Construct validity: examples
You could do an intervention study,
• Theories tell us how constructs respond to interventions
– e.g
.. to the feared stimulus
phobic symptoms subside after repeated exposures
• Our scale should measure the differences after intervention as predicted
– after repeated exposures, our test should show lower scores
– If the test score of the subjects change in the predicted direction, we have evidence of the construct validity.
Construct validity: more examples Developmental changes
• Should the construct change with age?
– E.g 
.. 
attention increases with age 
– e.g 
.. 
memory retrieval decreases with age 
• This should be seen in your scale.
• Give your scale to a group of various ages – check for the pattern predicted by your theory:
Is there a predicted correlation between age and score?
6
Construct validity: more examples
• We developed a test for “perceived general health”.
• For theoretical reasons we assume that perceived general health should be associated with the number of visits to a GP or levels of selfmedication.
• These two hypothesis can be formally tested by correlating perceived general health score and number of visits to the GP and levels of selfmedications.
• If the correlational analysis confirm your assumption, this will add to the evidence of construct validity.
• Quasiexperimental study:
• We developed a test for “Depression”.
• We will assume that people who recently had a stressful live event score higher on our test than people who did not had a stressful life event (ttest).
• Furthermore, the depression score of a person should increase after a stressful life event (“Sensitivity for change”).
Main steps in construct validity
Formulate several hypotheses
Try to formulate hypotheses for each domain of your construct.
Define (Operationalise) how you want to test your hypotheses: experimental, quasiexperimental or correlational approach?
Develop the scale now!
Gather data that will allow you to test the hypotheses
Determine if your data support the hypotheses, e.g. ANOVA, Correlation, Regression.
Construct validity
• So far mainly:
• sensitivity to change, responsiveness
• Convergent and discriminate validity • Multitrait Multimethod matrix
Convergent validity
• Convergent validity is the degree to which our scale that should be related theoretically with other attributes or variables is interrelated with them in reality.
• Example: A new depression scale should correlate with a measure of “serotonin and norepinephrine imbalance” or with “social inactivity”.
• We assume that social activity is not only influenced by depression and therefore we do not expect a perfect correlation.
• Therefore, we expect a correlation between the measures, but not a very strong one. If the correlation would be very strong, our scale would measure the same thing. In this case, social activity would be just a proxy for depression and we could use social activity as a measure.
• Difference to Concurrent validity: We do not compare our measure with similar measures of the same trait but with related variables.
Discriminant validity
• But by looking only at correlations with things that are only similar, we will never discover it’s flaw (measures too much).
•
We need a method to tell that it it measures only enough (not too much, not too little). Discriminant validity
• Discriminant validity is the degree to which our scale that should not be related theoretically with other attributes or variables are, in fact, not interrelated in reality, such as e.g. depression and body size or neuroticism.
• We need to find the traits, which are similar according to our theory (high correlations) = convergent validity
• But we need also need to test traits, which are different! (low or no correlations) = discriminant validity
• This way we can “narrow down” and check that we measure
just right
Multitrait Multimethod matrix
• If we find a correlation between two variables but we used similar methods of administration (both are psychometric tests), the correlation may be due to the same administration method
• (e.g. wording of items is difficult or subjects try to be social desirable in their answers).
• In this case we may find a correlation between two dissimilar measures of traits, where we do not assume one.
• C&D methods can be used to check the influence of the measurement method on our scores.
• Multitrait Multimethod matrix analysis allows us to detangle correlations between instruments due to similarity of test methods form and similarities due to tapping the same attribute.
7
Multitrait Multimethod matrix
• Two or more different traits (both similar and dissimilar traits) are measured by two or more methods (e.g., psychometric test, a direct observation, a performance measure) at the same time. • You then correlate the scores with each other.
• To construct an MTMM, you need to arrange the correlation matrix by concepts within methods.
• Essentially, the MTMM is just a correlation matrix between your measures, with one exception:
instead of 1's along the diagonal (as in the typical correlation matrix) we substitute an estimate of the reliability of each measure as the diagonal.
Multitrait Multimethod matrix
• We developed a scale for self developed learning and want to validate the measure:
• We conduct a study of students and measure two similar traits : Self directed learning (SDL) and Knowledge (Know).
• Furthermore, we measure each of the two traits in two different ways: a test measure and a exam rating.
• We assume that the two measures should not correlate. • We measured reliability for each measure • The results are arrayed in the MTMM.
•Example from: Streiner and Norman 2003, page 184
8
HeterotraitHeteromethod
• Correlation between the different traits but with different methods, should show similar low correlations as heterotraithomomethod if method has got no effect on scores.
SDL 
Know 

Rater 
Test 
Rater 
Test 

SDL 
Rater 
0.53 

Test 
0.42 
0.79 

Know 
Rater 
0.18 
0.17 
0.58 

Test 
0.15 
0.23 
0.49 
0.88 
Multitrait Multimethod matrix
• can be easily extended to include more similar and dissimilar traits and more methods: see e.g.
http://www.socialresearchmethods.net/kb/mtmmmat.htm
Multitrait Multimethod matrix
• We developed a scale for self esteem and want to validate the measure:
• We conduct a study of students and measure three traits or concepts: Self Esteem (A), Self Disclosure (B) and Locus of Control (C).
• Furthermore, we measure each of these traits in three different ways: a PaperandPencil (1) measure, a Teacher rating (2) and parent assessment (3).
• We assume that self Esteem should correlate with self Disclosure but not with Locus of Control (LC).
• We measured reliability for each measure • The results are arrayed in the MTMM.
•Example form: http://www.socialresearchmethods.net/kb/mtmmmat.
Have a go for yourself….
see: http://www.socialresearchmethods.net/kb/mtmmmat.htm
Construct validity: measure and theory
• In construct validity we are measuring two types of validity at the same time:
• is our measure valid and is the theory regarding our construct valid (are the observed traits derived from our hypotheses really measures of the construct, is e.g. college grade a measure of IQ).
• If we have high validity, then we have more confidence that both our measure and our theory are correct.
• However, if we only obtain low validity then the problems could be:
– Our measure is good, but the theory is wrong – The theory is good but the measure is wrong – Both theory and measure are wrong
– If we did an experiment (e.g. inducing anxiety) it also could be the experiment which did not work but theory and measure are correct
• many carefully planned validation tests needs to be done
Summary: Evaluating construct validity
• Evaluating construct validity will involve a large number of tests and an assessment of disparate information.
• Trying to confirm the plausible associations and disconfirm the implausible ones is often a long and incremental process.
• There will be no definitive answers and there is no formal way to weigh the overall evidence.
• Reaching conclusions on construct validity is further complicated by the problem that evidence tends to be interpreted in two different ways:
– to use these associations to test whether the instrument is a good measure of the intended constructs (validation of the instrument) ;
– to use these associations to confirm and clarify the constructs (validation of the underlying construct).
9
Validity
• Example: A test is developed to measure xenophobe.
• The test was administered to 100 people working in the car industry.
• In addition to the test, the people were asked to answer the following questions: age, type of work (blue collarwhite collar worker), political attitude (left to right), handedness, number of foreigners as friends and education.
• Can you think about some validity test?
Possible validity tests
• Older people show on average more xenophobe than younger ones.
• People with rightwing political attitude show on average more xenophobe.
• People with unsafe job situation show more xenophobe than people with safe jobs blue collar worker should show on average more xenophobe than white collar worker.
• People with higher education should show on average less xenophobe.
• Lefthanded people should show similar degree of xenophobe than righthanded people.
Validity and reliability
• Validity implies reliability. A valid measure must be reliable, but a reliable measure may not be valid.
• Reliability places the upper limit of the validity of a scale!
Upper limit of validity
• Validity is to some degree dependent on reliability:
reliability places the upper limit of the validity of a scale.
Example: max validity
If the reliability of our test is 0.8, and the reliability of a gold standard test is 0.7, the maximum correlation (validity coefficient) between the variables is:
Upper limit of validity: Relationship between reliability of criterion and validity of new scale:
maximal validity
reliability of criterion ("gold standard")
10
The problem: unreliable criterion
• This relationship between reliability and validity can cause a problem in our validity analysis:
• Our measure may be valid but if we compare it with a not very reliable criterion, we will get a low validity for =our measure.
Correcting for low reliability:
• However, we can estimate what the validity coefficient of our new measure would be if both  our new scale and the criterion  would be perfectly reliable:
Estimating validity if gold standard is perfect
• However, our new measure is not perfect reliable and won’t be. A more realistic approach would be to estimate the validity of our test if only the gold standard would be perfectly reliable:
Effect of increasing reliability on validity 

• If the validity of our scale is low due to low reliability, we could improve reliability to increase validity. 

• How much do we have to improve the reliability of the scale to obtain a acceptable validity? 

r * 
= 
changed
changed
r
*
r
r
xy
xx '
yy
'
r
r
xx
yy


xy where 

changed r xx' 
and 
changed r yy ' 
are the changed reliabiites for the two variables 

r * xy 
: estimated validity 

r xy 
: observed correlation 

r xx 
, 
r yy ' 
: observed reliabilities 
11
Which reliability measure to use?
• “Use the type of reliable estimate that treats as error those factors that one decides should be treated as error” (Muchinsky 1996)
•
E.g. If you think that the test may suffer of low item coverage of all your domains, then you should use Cronbach’s alpha as a reliability measure.
• If you think that subjects may have problems with the test, you should use testretest reliability estimates (ICC)
• Unfortunately, there is no acceptable psychometric basis for creating validity coefficients as a product of correcting multiple types of unreliability (Muchinsky 96).
• Perhaps use the lowest one.
Summary of the course
• The aim of psychometric scale development is to develop a tool to measure unobservable traits, latent constructs.
• These unobservable traits are measured by a scale which consistes of many items (e.g. questions), which all should tap into the construct or in domains of the construct.
• The key concepts of classical test development theory are reliability and validity.
Item pool 

Face and content 


generation 
validity 

•Interitem 


correlation 


•Itemtotal 

correlation 

Remove/
Revise items

Dimensionality:



From items to scale: 
(Stability): 

Validity 
•Interobserver •Intra observer 

•Construct 


•Concurrent 
Total score 
•Testretest 

Total subscores 

12