Académique Documents
Professionnel Documents
Culture Documents
8 Validity of
Assessment
Techniques
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the concept of a true score;
2. Compare the different techniques of estimating the reliability of a
test;
3. Compare the different techniques of establishing the validity of a
test; and
4. Discuss the relationship between reliability and validity.
INTRODUCTION
We have discussed the various methods of assessing student performance using
objective tests, essay tests, projects, observation checklists, oral tests and portfolio
assessment. In this topic, we will address two important issues, namely, the
reliability and validity of these assessment methods. How do we ensure that the
techniques we use for assessing the knowledge, skills and values of students are
reliable and valid? We are making important decisions about the abilities and
capabilities of the future generation and obviously we want to ensure that we are
making the right decisions.
Formally, an observed test score, X, is conceived as the sum of a true score, T, and
an error term, E. The true score is defined as the average of test scores if a test is
repeatedly administered to a student (and the student can be made to forget the
content of the test in-between repeated administrations). Given that the true
score is defined as the average of the observed scores, so in each administration
of a test, the observed score departs from the true score and the difference is
called measurement error. This departure is not caused by blatant mistakes made
by test writers, but it is caused by some chance elements in students
performance on a test.
Measurement error mostly comes from the fact that we have only sampled a
small portion of a students capabilities. Ambiguous questions, incorrect
markings, etc. can contribute to measurement error but it is only a small part of
it. Imagine if there are 10,000 items and a student can obtain 60 percent if all
10,000 items are administered (which is not practically feasible). Then 60 percent
is the true score. Now you sample only 40 items to put in a test. The expected
score for the student is 24 items. But the student may get 20, 26, 30, etc.
depending on which items are in the test. This is the main source of measurement
error. That is, measurement error is due to the sampling of items, rather than
poorly written items.
Generally, the smaller the error, the greater the likelihood that you are closer to
measuring the true score of a student. If you are confident that your geometry
test (observed score) has a small error, then you can confidently infer that Swee
Leongs score of 66 percent is close to his true score or his actual ability in solving
geometry problems; i.e. what he actually knows. To reduce the error in a test, you
must ensure that your test is reliable and valid. The higher the reliability and
validity of your test, the greater the likelihood that you will be measuring the
true score of your students. We will first examine the reliability of a test.
If there is relatively little error, the ratio of the true score variance to the observed
score variance approaches a reliability coefficient of 1.00 which is perfect
reliability. If there is a relatively large amount of error, the ratio of the true score
variance to the observed score variance approaches 0.00, which is total
unreliability.
High reliability means that the questions of a test tended to pull together.
Students who answered a given question correctly were more likely to answer
other questions correctly. If an equivalent or parallel test were developed by
using similar items, the relative scores of students would show little change. Low
reliability means that the questions tended to be unrelated to each other in terms
of who answered them correctly. The resulting test scores reflect that something
is wrong with either the items or the testing situation rather than students
knowledge of the subject matter. The following guidelines may be used to
interpret reliability coefficients for classroom tests as shown in Table 8.1.
Reliability Interpretation
0.90 and above Excellent reliability (comparable to the best standardised tests)
0.80 0.90 Very good for a classroom test
0.70 0.80 Good for a classroom test but there are probably a few items which
could be improved
0.60 0.70 Somewhat low. There are probably some items which could be
removed or improved
0.50 0.60 The test needs to be revised.
0.50 and below Questionable reliability and the test should be replaced or needs
major revision
If you know the reliability coefficient of a test, can you estimate the true score of
a student on a test? In testing, we use the Standard Error of Measurement to
estimate the true score.
Using the normal curve, you can estimate a students true score with some
degree of certainty based on the observed score and Standard Error of
Measurement.
Example:
You gave a history test to group of 40 students. Khairul obtained a score of 75 in
the test, which is his observed score. The standard deviation of your test is 2.0.
Earlier, you had established that your history test had a reliability coefficient of
0.7. You are interested in finding out Khairuls true score.
Therefore, based on the normal distribution curve (see Figure 8.1), Khairuls true
score should be:
(a) Between 75 1.1 and 75 + 1.1 or between 73.9 and 76.1 for 68 percent of the
time.
(b) Between 75 2.2 and 75 + 2.2 or between 72.8 and 77.1 for 95 percent of the
time.
(c) Between 75 3.3 and 75 + 3.3 or between 71.7 and 78.3 for 99 percent of the
time.
ACTIVITY 8.1
SELF-CHECK 8.1
1. Define the reliability of a test.
2. What does the reliability coefficient indicate?
3. Explain the concept of a true score.
(a) Test-Retest
Using the Test-Retest technique, the same test is administered again to the
same group of students. The scores obtained in the first administration of
the test are correlated to the scores obtained on the second administration
of the test. If the correlation between the two scores is high, then the test
can be considered to have high reliability. However, a test-retest situation is
somewhat difficult to conduct as it is unlikely that students will be
prepared to take the same test twice.
There is also the effect of practice and memory that may influence the
correlation. The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two observations
are related over time. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on
the interval.
(i) Split-Half
To solve the problem of having to administer the same test twice,
the split-half technique is used. In the split-half technique, a test is
administered once to a group of students. The test is divided into
two equal halves after the students have completed the test. This
technique is most appropriate for tests which include multiple-choice
items, true-false items and perhaps short-answer essays. The items are
selected based on odd-even method whereby one half of the test
consists of odd numbered items while the other half consists of even
numbered items. Then, the scores obtained for the two halves are
correlated to determine the reliability of the whole test using the
Spearman-Brown correlation coefficient.
2 r xy
rsb
1 rxy
k
k i 1
pi 1 pi
Cronbachs alpha () 1
K 1 2x
Example:
Suppose that in a multiple-choice test consisting of five items or
questions, the following difficulty index for each item was observed:
p1 = 0.4, p2 0.5, p3 = 0.6, p4 = 0.75 and p5 = 0.85. Sample variance
(2x) = 1.84. Cronbachs alpha would be calculated as follows:
5 1.045
1 0.54
5 1 1.840
A Word of Caution!
When you get a low alpha, you should be careful not to immediately conclude
that the test is a bad test. You should check to determine if the test measures
several attributes or dimensions rather than one attribute or dimension. If it does,
there is the likelihood for the Cronbach Alpha to be deflated. For example, an
Aptitude Test may measure three attributes or dimensions such as quantitative
ability, language ability and analytical ability. Hence, it is not surprising that the
Cronbach Alpha for the whole test may be low as the questions may not correlate
with each other. Why? This is because the items are measuring three different
types of human abilities. The solution is to compute three different Cronbach
Alphas one for quantitative ability, one for language ability and one for
analytical ability.
SELF-CHECK 8.2
1. What is the main advantage of the split-half technique over the
test-retest technique in determining the reliability of a test?
SELF-CHECK 8.3
Messick (1989) was most concerned about the inferences a teacher draws from
the test score, the interpretation the teacher makes about his or her students and
the consequences from such inferences and interpretation. You can imagine the
power an educator holds in his or her hand when designing a test. Your test
could determine the future of many thousands of students. Inferences based on a
test of low validity could give a completely different picture of the actual abilities
and competencies of students.
Three types of validity have been identified: construct validity, content validity
and criterion-related validity which is made up of predictive and concurrent
validity (see Figure 8.4).
ability and so forth. You might think of construct validity as the correct
labelling of something. For example, when you measure what you term
as critical thinking, is that what you are really measuring?
Thus, to ensure high construct validity, you must be clear about the
definition of the construct you intend to measure. For example, a construct
such as reading comprehension would include vocabulary development,
reading for literal meaning and reading for inferential meaning. Some
experts in educational measurement have argued that construct validity is
the most critical type of validity. You could establish the construct validity
of an instrument by correlating it with another test that measures the same
construct. For example, you could compare the scores obtained on your
reading comprehension test with the scores obtained on another well-
known reading comprehension test administered to the same sample of
students. If the scores for the two tests are highly correlated, then you may
conclude that your reading comprehension test has high construct validity.
For example, the science unit on Energy and Forces may include facts,
concepts, principles and skills on light, sound, heat, magnetism and
electricity. However, it is difficult, if not impossible, to administer a 23
hour paper to test all aspects of the syllabus on Energy and Forces (see
Figure 8.5). Therefore, only selected facts, concepts, principles and skills
from the syllabus (or domain) are sampled. The content selected will be
determined by content experts who will judge the relevance of the content
in the test to the content in the syllabus or particular domain.
Copyright Open University Malaysia (OUM)
168 TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES
Figure 8.5: Sample of content tested for the unit on Energy and Forces
Content validity will be low if the questions in the test include questions
testing content not included in the domain or syllabus. To ensure content
validity and coverage, most teachers use the Table of Specifications (as
discussed in Topic 3). Table 8.2 is an example of a table of specifications
which specifies the knowledge and skills to be measured and the topics
covered for the unit on Energy and Forces. You cannot measure all the
content of a topic and so you will have to focus on the key areas and give
due weighting to those areas that are important. For example, the teacher
has decided that 64 percent of questions will emphasise the understanding
of concepts while 36 percent will focus on the application of concepts for
the five topics. A table of specifications provides the teachers with evidence
that a test has high content validity, that it covers what should be covered.
Table 8.2: Table of Specifications for the Unit on Energy and Forces
Understanding of Application of
Topics Total
Concept Concepts
Light 7 4 11 (22%)
Sound 7 4 11 (22%)
Heat 7 4 11 (22%)
Magnetism 3 3 6 (11%)
Electricity 8 3 11 (22%)
TOTAL 32 (64%) 18 (36%) 50 (100%)
Content validity is different from face validity, which refers not to what the
test actually measures, but to what it superficially appears to measure. Face
validity assesses whether the test looks valid to the examinees who take
it, the administrative personnel who decide on its use, and other technically
untrained observers. The face validity is a weak measure of validity but
that does not mean that it is incorrect, only that caution is necessary. Its
importance cannot be underestimated.
Copyright Open University Malaysia (OUM)
TOPIC 8 RELIABILITY AND VALIDITY OF ASSESSMENT TECHNIQUES 169
Figure 8.6: Graphical representations of the relationship between reliability and validity
On the other hand, your Inductive Reasoning Test can be Reliable but Not
Valid. How is that possible? Your test may not measure inductive reasoning but
the scores you obtain each time you administer the test is approximately the
same (see second picture). In other words, the test is consistently and
systematically measuring the wrong construct (i.e. inductive reasoning). Imagine
the consequences of making judgement about the inductive reasoning of students
using such a test!
The worst-case scenario is when the test is neither reliable nor valid (See first
picture). In this scenario, the scores obtained by students tend to concentrate at
the top half of the target and they are consistently missing the centre. Your
measure in this case is neither reliable nor valid and the test should be rejected or
improved.
The true score is a hypothetical concept with regard to the actual ability,
competency and capacity of an individual.
The higher the reliability and validity of your test, the greater the likelihood
that you will be measuring the true scores of your students.
Using the Test-Retest technique, the same test is administered again to the
same group of students.
For the parallel or equivalent forms technique, two equivalent tests (or forms)
are administered to the same group of students.
When two or more people mark essay questions, the extent to which there is
agreement in the marks allotted is called inter-rater reliability.
Some people may think of reliability and validity as two separate concepts. In
reality, reliability and validity are related.