Onlund

12 I
Validity and Reliability
Studying this chapter should enable you to

1. Distinguish between validity and reliability.
2. Describe the essential features of the concept of validity.
3. Describe how content-related evidence of validity is obtained.
4. List factors that can lower the validity of achievement assessments.
5. Describe procedures for obtaining criterion-related evidence of validity.
6. Describe procedures for obtaining construct-related evidence of validity.
7. Describe the role of consequences of using an assessment procedure on its validity.
8. Describe the methods for estimating test reliability and the type of information
provided by each.
9. Describe how the standard error of measurement is computed and interpreted.
10. Explain how to determine the reliability of a performance-based assessment.
The two most important 9!!t;stiQY"~ tn ::,c;kabout a test or other assessment

procedure are: (t)..,io'what extent will the interpretation. of the results 6e .I,
appropriate, meaningful, and useful?, and (2) to what extent will the results
be free from errors? The first question is concerned with validity, the second :,~;
with r£li~bnity.An understanding of both concepts is essentiatto the effec-
tive construction, selection, interpretation, and use of tests and other assess-
ment instruments. Validity is the most important quality to consider in the
preparation and use of assessment procedures. First and foremost, we want
the results to provide a representative and relevant measure of the achieve-
ment domain under consideration. Our second consideration is reliability,
which refers to the consistency of our assessment results. For example, if we
tested individuals at a dif!erent time, or with a different sample of equiva-
201
1-
202 Chapter 12 Vlllillily 111I11R,'lill/Ji/ily
lent items, we would like to obtain approximately the same results. This con-
sistency of results is important for two reasons. (1) Unless the results are Content Representativeness
fairly stable we cannot expect them to be valid. For example, if an individual /
scored high on a test one time and low another time, it would be impossible
to validly describe the achievement. (2) Consistency of results indicates Criterion Relationships
smaller errors of measurement and, thereby, more dependable results. Thus, VALIDITY
reliability provides the consistency needed to obtain validity and enables us
to interpret assessment results with greater confidence. Construct Evidence
Although it is frequently unnecessary to make elaborate validation and
reliability studies of informal assessment procedures, an understanding o(
these concepts provides a conceptual framework that can serve as a guide Consequences of Using
for more effective construction of assessment instruments, more effective
selection of standardized tests, and more appropriate interpretation and use
of assessment results. FIGURE12.1 Types of considerations in determining the validity of assessment
results.
Validity validity" has been replaced by the view that validity is a single, unitary con-
cept that is based on various forms of evidence. The former "types of valid-
Validity is concerned with the interpretation and use of assessment results. For ity" (content, criterion-related, and construct) are now simply considered to
exampJe, it wemferuom ar(assessmenf iliaTstudentslUiveiarieved the be convenient categories for accumulating evidence to support the validity
intended learning outcomes, we would like some assurance that our tasks pro- of an interpretation. Thus, we no longer speak of "content validity," but of
vided a relevant and representative measure of the outcomes. If we infer that "conterit-related evidence" of validity. Similarly, we speak of "criterion-
the assessment is useful for predicting or estimating some other performance, related evidence" and "construct-related evidence."
we would like some credible evidence to support that interpretation. If we infer For some interpretations of assessment results only one or two types of
that our assessment indicates that students have good "reasoning ability," we evidence' may be critical, but an ideal validation would include evi~nce
would like some evidence to support the fact that the results actually reflect from all four categories. We are most ill<eJyto draw valid inferences from
that construct. If we infer that our use of an assessment had positive effects assessment results when we have a full understanding of: (1) the nature of
(e.g., increased motivation) and no adverse effects (e.g.,poor study habits) on the assessment procedure and the specifications that were used in develop-
students, we would like some evidence concerning the consequences of its use. ing it, (2) the relation of the assessment results to significant criterion mea-
These are the kinds of considerations we are concerned with when considering sures, (3) the nature of the psychological characteristic(s) or constxuct(s)
the validity of assessment results (see the summary in Figure 12.1). being assessed, and (4) the consequences of using the assessment. Although
The meaning of validity in interpreting and using assessment results in many practical situations the evidence falls short of this ideal, we sfioUld
can be grasped most easily by reviewing the following characteristics. gather as much relevant evidence as is feasible within the cOnStrain"t;;'onfie
sitUation. We should also look for the vanous types of evidence when evalu-
1. Validity isjp1err~d.fromavailable evidence (n~t ~easured). atmg standardized tests (see Table 12=1).-
2. Validity depends on milnydifferenttypes of evidence. , '---"
3. Validity is expressed Dy"degfee(high, moderate, low):'
Content-Related Evidence
4. Validity is specific to a particular use.
5. Validity refers to the inferencesdrawn,not the instrument. Content-related evidence of validity is critilal when we want to use perfor-
6. Validity is a unitary concept. mance on a set of tasks as evidence of performance on a larger domain of
7. Validity is concerned with the consequencesof using the assessments. tasks. Let's assume, for example, that we have a list of 500 words that we
expect our students to be able to spell correctly at the end of the school year.
Describing validity as a unitary concept is a basic change in how valid- To test their spelling ability, we might give them a 50-word spelling test.
ity is viewed. The traditional view that there were several different "types of Their performance on these words is important only insofar as it provides
.',.
Validityand Reliability 205
204 Chapter 12
scored. Validity can be lowered by inadequate procedures in any of these
TABLE12.1 Basic Approaches to Validation areas (see Box 12.1). Thus, v'aJ.1dltyIS "built in" during the planning and-
preparation stages and maintained by proper administration and scoring.
Type of Evidence Question to Be Answered Throughout this book we have described how to prepare assessments that
provide valid results, even though we have not used the word validity as
Content-Related How adequately does the sample of assessment tasks
represent the domain of tasks to be measured? each procedure was discussed.
How accurately does performance on the assessment The makers of standardized test follow these same systematic proce-
Criterion-Related
(e.g., test) predict future performance (predictive study)
dures in building achievement tests, but the content and learning outcomes
or estimate present performance (concurrent study) on included in the test specifications are more broadly based than those used in
some other valued measure called a criterion? classroom assessment. Typically, they are based on the leading textbooks
How well can performance on the assessment be and the recommendations of various experts in the area being covered by
Construct- Rela ted
explained in terms of psychological characteristics? the test. Therefore, a standardized achievement test may be representative of
How well did use of the assessment serve the intended a broad range of content but be unrepresentative of the domain of content
Consequences taught in a particular school situation. To determine relevance to the local
purpose (e.g., improve performance) and avoid adverse
effects (e.g., poor study habits)? situation, it is necessary to evaluate the sample of test items in light of the
content and skills emphasized in the instruction.
In summary, content-related evidence of validity is of major concern in
achievement assessment, whether you are developing or selecting the
assessment procedure. When constructing a test, for example, content rele-
evidence of their ability to spf~l the 500words. Thus, our spelling test would vance and representativeness are built in by following a systematic proce-
provide a valid measure to the degree to which it provided an, adequate dure for specifying and selecting the sample of test items, constructing high
sample of the 500words it represented. If we selected only easy words, only quality items, and arranging the test for efficient administration and scoring.
difficult words, or only words that represented certain types of common In test selection, it is a matter of comparing the test sample to the domain
spelling errors, our test would tend to be unrepresentative and thus the of tasks to be measured and determining the degree of correspondence
scores would have low validity. If we selected a balanced sample of words between them. Similar care is needed when preparing and using perfor-
that took these and similar factors into account, our test scores would pro- mance assessments. Thus, content-related evidence of validity is obtained
vide a representative measure of the 500spelling words and thereby provide primarily by careful, logical analysis.
for high validity.
It should be clear from this discussion that the key element in content-
related evidence of validity is the adequacy of the sampling.An assessment is
always a sample of the many tasks that could be included. Content valida-
tion is a matter of determining whether the sample of tasks is representative
~2.1 · Factors That Lower the Validity of Assessment Results
,.
of the larger domain of tasks it is supposed to represent.
Content-related evidence of validity is especially important in achieve- 1. Tasksthat Erovide'an,inadequatesampleoftheachi~eD1ent\tobeasse!;$
ment assessment. Here we are interested in how well the asse'ssment mea- 2.,TaSksthat do i\?tIunCHol!.as
i11tend,~/d\]etouse',9f.~P:~~~ff':"
lackof relevance,ambiguity,clues;bi~linapptop~~Oiffifuf.%%!
sures the intended learning outcomes of the instruction. We can provide factors. \'\.~';:
'. ' , ' ',' , " . . ", "I !~,C,/:!,"',
greater assurance that an assessment provides valid results by (1) identify- 3. Improper cu:.rangement Of,tasks and'll:l\cle~.dJre:c!i2~~\", ";.~'.,,,;\,
ing the learning outcomes ~obe assessed, (2) preparing a plan that specifies 4. Too few tasks for the types ofinterpretati6J\.tObeDutd~(e;~tmtere~ti
the sample of tasks to be used, and (3) preparing an assessment procedure by objective based on afew testi!eJns). ~~!f.1ii\;1r ~'-" ,...
that closely fits the set of specifications. These are the best p;,ocedures we 5. Improper adpUnistrlltion...,-such as ina~Eigua!§Jin\~ Iijt!
have for ensuring the assessment of a representative sample of the domain trolledcOl1ditions. " " " ',,' 'I' J'T,::
of tasks encompassed by the intended learning outcomes. 6. Judgmental sc()~g't11at~ ina~~tiaJe.sCorlli~gty~;-
Although the focus of content':related evidence of validity is on the
adequacy of the sampling, a valid interpretation of the assessment results
that contains
.,
computational errors...
L . '-' -
-- -
,~'~.".' .:;'::'\~~:~',t4t,.;
,;', ~
assumes that the assessment was properly prepared, administered, and

206 Chapter 12 Validity and Reliability 207
Criterion-Related Evidence 10Wscores on the other. The extreme degrees of relationship it is possible to
-~
There are two types of studies used in obtaining criterion-related evidence of obtain between two sets of scores are indicated by the following values:
validity. These can be explained most clearly using test scores, although they
1.00 = perfect positive relationship
could be used with any type of assessment result. The first type of study is
.00 = no relationship
concerned with the use of test performance to predict future performance on
some other valued measure called a criterion.For example, we might use -1.00 = perfect negative relationship
scholastic aptitude test scores to predict course grades (the criterion). For
obvious reasons, this is called alpredi~~i~_s.tu4Y, The second type of study is When a correlation coefficient is used to express the degree of relation-
ship between a set of test scores and some criterion measure, it is called a
concerned with the use of test performance to estimate current performance
on some criterion. For instance, we might want to use a.test of study skills to validity coefficiellt.For example, a validity coefficient of 1.00 applied to the
estimate what the outcome would be of a careful observation of students in relationship between a set of aptitude test scores (the predictor) and a set of
an actual study situation (the criterion). Since with this procedure both mea- achievement test scores (the criterion) would indicate that each i~dividual in
sures (test and criterion are obtained at approximately the same time, this the group had exactly the same relative standing on both measures, and
type of study is called a oncurrentstu~. would thereby provide a perfect prediction from the aptitude scores to the
Although the value 0 using a predictive study is rather obvious, a achievement scores. Most validity coefficients are smaller than this, but the
question might be raised concerning the purpose of a concurrent study. Why extreme positive relationship provides a useful bench mark for evaluating
would anyone want to use test scores to estimate performance on some other validity cocfficicnts. The closer the validity coefficient approaches 1.00, the
measure that is to be obtained at the same time? There are at least three good higher the degree of relationship and, thus, the more accurate our predic-
tions of each individual's success on the criterion will be.
reasons for doing this. First, we may want to check the results of a newly
constructed test againSt some existing test that has a considerable amount of A more realistic procedure for evaluating a validity coefficient is to
validity evidence supporting it. Second, we may want to substitute a brief, compare it to the validity coefficients that are typically obtained when the
simple testing procedure for a more complex and time-consuming measure. two measures are correlated. For example, a validity coefficientof .40between
For example, our test of study skills might be substituted for an elaborate a set of aptitude test scores and achievement test scores would be considered
rating system if it provided a satisfactory estimate of study performance. small because we typically obtain coefficients in the .50to .70range for these
Third, we may want to determine whether a testing procedure has potential two measures. Therefore, validity coefficients must be judged an a relatiJe
as a predictive instrument. If a test provides an unsatisfactory estimate of basis, the larger coefficients being favored. To use validity coefficients effec-
current performance, it certainly cannot,be expected to predict future perfor- tively, one must become familiar with the size of the validity coefficients that
mance on the same measure. On the other hand, a satisfactory estimate of are typically obtained between various pairs of measures under different
present performance would indicate that the test may be useful in predicting conditions (e.g., the longer the time span between measures, the smaller the
future performance as well. This would inform us that a predictive study validity coefficient).
would be worth doing.
The key elementin both types of criterion-relatedstudy is the degreeof Expectancy Table. The expectancy table is a simple and practical means of
relationshipbetween the two sets of measures: (1) the test scores and (2) the expressing criterion-related evidence of validity and is especially useful for
criterion to be predicted or estimated. This.relationship is typically expressed making predictions from test scores. The expectancy table is simply a twofold
by means of a correlation coefficient or an expectancy table. chart with the test scores (the predictor) arranged in categories down the left
side of the table and the measure to be predicted (the criterion) arranged in cat-
egories across the top of the table. For each category of scores on the predictor,
Correlation Coefficients. Although the computation of correlation coeffi- the table indicates the percentage of individuals who fall Within each category
cients is beyond the scope of this book, the concept of correlation can easily of the criterion.An example of an expectancy table is presented in Table 12.2.
be grasped. A correlation coefficient (r) simply indicates the degree of rela- Note in Table 12.2that of those students wh,owere in the above-average
tionship between two sets of measures. A positive relationship is indic~ted group (stanines 7, 8, and 9) on the test scores, 43 percent received a grade of
when high scores on one measure are accompanied by high scores on the A, 43 percent a B,and 14 percent a C. Although these per<:entagesare based
other; low scores on the two measures are similarly 'associated. A negativere- on this particular group, it is possible to use them to predict the future per-
lationship is indicated when high scores on one measure are accompanied by formance of other students in this science course. Hence, i':a student falls in
ValidityandReliability 209
208 Chapter 12
highly intelligent, for example, is useful because that term carries with it a
series of associated meanings that indicate what the individual's behavior is
TABLE12.2 Expectancy Table Showing the Relation between Scholastic Aptitude
Scores and Course Gradeslor 30 Students in a Science Course
likely to be under various conditions. Before we can interpret assessment
results in terms of these broad behavior descriptions, however, we must first
Grouped Scholastic petfentage in Each Score Categ,?TY ReceivingEach Grade establish that the constructs that are presumed to be reflected in the scores
C B A actually do account for differences in performance.
Aptitude Scores E D
(Stanines) Construct-relatedevidenceof validityfor a test includes (1) a descrip-
43 . 43
14 tion of the theoretical framework that specifies the nature of the construct to
Above Average be measured, (2) a description of the development of the test and any aspects
(7,8,9) 25 19 of measurement that may affect the meaning of the test scores (e.g., test for-
19 37
Average mat), (3) the pattern of relationship between the test scores and other signifi-
(4,5,6) 29 14 cant variables (e.g., high correlations with similar tests and low correlations
57
Below Average with tests measuring different constructs), and (4)any other type of evidence
(1, 2, 3) that contributes to the meaning of the test scores (e.g., analyzing the mental
process used in responding, determining the predictive effectiveness of the
test). The specific types of evidence that are most critical for a particular test
depend on the nature of the construct, the clarity of the theoretical frame-
the above-average group on this scholastic aptitude test, we might predict work, and the uses to be made of the test scores. Although the gathering of
thht he or she has 43 chances out of 100of earning an A, 43 chances out of 100 construct-related evidence of validity can be endless, in practical situations it
of earning a B,and 14 chances out of a 100of earning a C in this particular sci- is typically necessary to limit the evidence to that which is most relevant to
ence course. Such predictions are highly tentative, of course, due to the small the interpretations to be made.
number of students on which this expectancy table . was built. Teachers
I
can The construct-related category of evidence is the broadest of the three
construct more dependable tables by accumulal1ng data from several classes categories. Evidence obtained in both the content-related category (e.g., rep-
over a period of time. resentativeness of the sample of tasks) and the criterion-related category
Expectancy tables can be used to show the relationship between any (e.g., how well the scores predict performance on specific criteria) are also
two measures.Constructingthe table is simp~ya matter of (1) groupingthe relevant to the construct-related category because they help to clarify the
scoreson eachmeasureinto a series of categories(anynumber of them),(2) meaning of the assessment results. Thus, the construct-related category
placingthe two sets of categorieson a twofoldchart, (3)tabulatingthe num- encompasses a variety of types of evidence, including that from content-
ber of students who fall into each position in the table (based on the stu- related and criterion-related validation studies (see Figure 12.2).
dent's standing on both measures), and (4) converting these numbers to
percentages (of the total number in that row). Thus, the expectancy table is a
clear way of showing the relationshipbetween sets of scores.Although the CONSTRUCT-
expectancytable is more cumbersometo deal with than a correlationcoeffi- RELATED
cient, it has the special advantage of being easily understood by persons EVIDENCE
without knowledge of statistics. Thus, it can be used in practical situations to
Content-Related VALIDITY
clarify the predi<:tiveefficiency of a test. Studies OF
INFERENCES
Criterion-Related
construct-Rellted Evidence .
Studies
The construct-rE¥atedcategory of evidence focuses on assessment results as a
Other Relevant
basis for inferrlrlg the possession of certain psychological characteristics. For Evidence
example, we mi~ht want to describe a person's reading comprehension, rea-
soning ability, ot mechanical aptitude~ These are all hypothetical qualities, or
con:;tructs, that e assume exist in order to explain behavior. Such theoretical FIGURE 12.2 Construct validation includes !IIIcategories of evidence.
constructs are seful in descrihing individuals and in predicting how they
will,d in m,n diffe",nt'p,dfic ,itu,'inM-Tnd,"'""c , pc",," "' b,ing
-j
!
- - - -.--
210 Chapter 12
The broad array of evidence that might be considered can be illustrated Construct validation, then, is an attempt to clarify and verify the infer-
by a test designed to measure mathematical reasoning ability. Some of the
ences to be made from assessment results. This involves a wide variety of
evidence we might consider is: procedures and many different types of evidence (including both content-
related and criterion-related). As evidence accumulates from many different
1. Compare the sample of test tasks to the domain of tasks specified by the sources, our interpretations of the results are'enriched and we are able to
conceptual framework of the construct. Is the sample relevant and rep- make them with great confidence.
resentative (content-related evidence)?
2. Examine the test features and their possible influence on the meaning of
the scores (e.g., test format, directions, scoring, reading level of items). Consequence of Using Assessment Results
Is it possible that some features might distort the scores?
3. Analyze the mental process used in answering the questions by having Validity focuses on the inferences drawn from assessment results with regard
students "think aloud" as they respond to each item. Do the items re- to specific uses. Therefore, it is legitimate to ask, What are the consequences
quire the intended reasoning process? of using the assessment? Did the assessment improve learning, as intended,
4. Determine the internal consistency of the test by intercorrelating the or did it contribute to adverse effects (e.g., lacP.of motivation, memorization,
test items. Do the items seem to be measuring a single characteristic (in poor study habits)? For example, assessment procedures that focus on simple
this case mathematical reasoning)? learning outcomes only (e.g., knowledge of facts) cannot provide valid evi-
5. Correlate the test scores with the scores of other mathematical reason- dence of reasoning and application skills, are likely to narrow the focus of
ing tests. Do they show a high degree of relationship? student learning, and tend to reinforce poor learning strategies (e.g., rote
6. Compare the scores of known groups (e.g., mathematical majors and learning). Thus, in evaluating the validity of the assessment used, one needs
nonmajors). Do the scores differentiate between the groups as predicted? to look at what types of influence the assessments have on students. The fol-
7. Compare the scores of students before and after specific training in lowing questions provide a general framework for considering some of the
mathematical reasoning. Do the scores change as predicted from the possible consequences of assessments on students.
theory underlying the construct?
8. Correlate the scores with grades in mathematics. Do they correlate to a 1. Did use of the assessment improve motivation?
satisfactory degree (criterion-related evidence)? 2. Did use of the assessment improve performance?
3. Did the use of the assessment improve self-assessment skills? ,
Other types of evidence could be added to this list, but it is sufficiently 4. Did the use of the assessment contribute to transfer of learning to re-
comprehensive to make clear that no single type of evidence is adequate. lated areas?
Interpreting test scores as a measure of'a particular construct involves a 5. Did the use of the assessment encourage independent learning?
comprehensive study of the development of the test, how it functions in a 6. Did the use of the assessment encourage good study habits?
_ variety of situations, and how the scores relate to other significant measures. 7. Did the use of the assessment contribute to a positive attitude toward
Assessment results are, of course, llUuienced by many factors other schoolwork?
than the construct they are designed to measure. Thus, construct validation 8. Did use of the assessment have an adverse effect in any of the above
is an attempt to account for all possible influences on the scores. We might, areas?
for example, ask to what extent the scores on our mathematical reasoning
test are influenced by reading comprehension, computation skill, and speed.
Judging the consequences of using the various assessment procedures
Each of these factors would require further study. Were attempts made to is an important role of the teacher, if the results are to serve their intended
eliminate such factors during test development by using simple vocabulary, purpose of improving learning. Both testing and perfonnance assessments
simple computations, and liberal time limits? To what extent do the test
are most likely to have 'positive consequences when tlley are designed to
scores correlate with measures of reading comprehension and computa-
tional skill? How do students' scores differ under different time limits? assess a broad range of learning outcomes, they give special emphasis to
complex learning outcomes, they are administered and scored (or judged)
Answers to these and similar questions will help us to determine how well properly, they are used to identify students' strength$ and weaknesses in
the test scores refle~t the construct we are attempting to measure and the learning, and the students view the assessments as fair) relevant, and useful
extent to which other factors might be influencing the scores. for improving learning.
Validity and Reliability 213
212 Chapter12
from one sample of items to another, and from one part of the test to another.
Reliability Reliability measures provide an estimate of how much variation we might
expect under different conditions. The reliability of test scores is typically
Reliability refers to the consistency of assessment results. Would we obtain reported by means of a reliabilitycoefficientor the standarderrorof measure-
about the same results if we used a different sample of the same type task? ment that is derived from it. Since both methods of estimating reliability
Would we obtain about the same results if we used the assessment at a dif- require score variability, the procedures to be discussed are useful primarily
ferent time? If a performance assessment is being rated, would different with tests designed for norm-referenced interpretation.
raters rate the performance the same way? These are the kinds of questions As we noted earlier, a correlation coefficient expressing the relation-
we are concerned about when we are considering the reliability of assess- ship between a set of test scores and a criterion measure is called a validity
ment results. Unless the results are generalizable over similar samples of coefficient.A reliabilitycoefficientis also a correlation coefficient, but it indi-
cates the correlation between two sets of measurements taken from the same
tasks, time periods, and raters, we are not likely to have great confidence in
them. procedure. We may, for example, administer the same test twice to a group,
Because the methods for estirnating reliability differ for tests and per- with a time interval in between (test-retestmethod); administer two equiva-
formance assessments these will be treated separately. lent forms of the test in close succession (equivalent-formsmethod); admini-
ster two equivalent forms of the test with a time interval in between
(test-retestwith equivalentforms method); or administer the test once and com-
pute the consistency of the response within the test (intemal-consistency
Estimating the Reliability of Test Scores method). Each of these methods of obtaining reliability provides a different
The\core an individualreceiveson a testis calledthe obtainedscore,rawscore, type of information. Thus, reliability coefficients obtained with the different
or observed
score.Thisscoretypicallycontainsa certainamountoferror.Some procedures are not interchangeable. Before deciding on the procedure to be
of this error may be systematicerror,in that it consistently inflates or lowers used, we must determine what type of reliability evidence we are seeking.
the obtained score. For example, readily apparent clues in several test items The four basic methods of estimating reliability and the type of information
might cause all students' scores to be higher than their achievement would each provides are shown in Table 12.3.
warrant, or short time limits during testing might cause all students' scores to
be lower than their "real achievement." The factors causing systematic errors ,r
are mainly due to inadequate testing practices. Thus, most of these errors can TABLE12.3 Methods of Estimating Reliability of Test Scores
be eliminated by using care in constructing and administering tests. Remov-
ing systematic errors from test scores is especially important because they Method'
TypeoflnformatiOn'PTovided:\~~~~
~.:...:~.:&..:f't...'-'...;.;.;.i.~'*Sv ." £.
have a direct effect on the validity of the inferences made from the scores.
Someof the error in obtainedscoresis randomerror,in that it raisesand Test-retest method The stability of test scores over a given
lowers scores in an unpredictable manner. Random errors are caused by period of time.
such things as temporary fluctuations in memory, variations in motivation Equivalent-forms method The consistency of the test scores over
and concentration from time to time, carelessness in marking answers, and different forms of the test (that is, different
luck in guessing. Such factors cause test scores to be inconsistent from one samples of items).
measurement to another. Sometimes an individual's obtained score will be Test-retest with equivalent forms The consistency of test scores over botha
higher than it should be and sometimes it will be lower. Although these time interval and different forms of the test.
errors are difficult to control and cannot be predicted with accuracy, an esti- Intemal-consistency methods The consistency of test scores over different
mate of their influence can be obtained by various statistical procedures. parts of the test.
Thus, when we talk about estimating the reliability of test scores or the
amount of measurement error in test scores,we are referringto the influence Note: Scorer reliability should also be considered when evaluating the responses to supply-type
of random errors. items (for example, essay tests). This is typically done by having the test papers scored
Reliabilityrefersto
I the consistency oftest scoresfromone measurement independently by two scorers and then correlating the two sets of scores. Agreement among
to another. Because of the ever present measurement error, we can expect a scorers, however, is not a substitute for the methods of estimating reliability shown in the table.
certain amount of variation in test performance from one time to another,
I
\
\
~14 Chapter 12
Test-Retest Method The test-retest method requires administering the same

that a test score represents not only present test performance but also what
215 ,
form of the test to the same group after some time interval. The time between test performance is likely to be at another time or on a different sample of
, 'the two administrations may be just a few days or several years. The length equivalent items.
of the time interval should fit the type of interpretation to be made from the
results. Thus, if we are interested in using test scores only to group students Internal-Consistency Methods These methods require only a single ad-
for more effective learning, short-term stability may be sufficient. On the ministration of a test. One procedure, the spl~t-halfmethod, involves scoring
other hand, if we are attempting to predict vocational success or make some the odd items and the even items separately and correlating the two sets of
other long-range predictions, we would desire evidence of stability over a pe- scores. This correlation coefficient indicates the degree to which the two ar-
riod of years. bitrarily selected halves of the test provide the same results. Thus, it reports
Test-retest reliability coefficients are influenced both by errors within on the internal consistency of the test. Like the equivalent-forms method, this
the measurement procedure and by the day-to-day stability of the students' procedure takes into account errors within the testing procedure and consis-
responses. Thus, longer time periods between testing will result in lower tency over different samples of items, but it omits the day-to-day stability of
reliability coefficients, due to the greater changes in the students. In report- the students' responses.
ing test-retest reliability coefficients, then, it is important to include the time Since the correlation coefficient based on the odd and even items inrii-
interval. For example, a report might state: liThe stability of test scores cates the relationship between two halves of the test, the reliability coeffi-
obtained on the same form over a three-month period was .90." This makes cient for the total test is determined by applying the Spearman-Brown
it possible to determine the extent to which the reliability data are si~ificant prophl'cy formula, A simplified version of this formula is as follows:
for a particular interpretation.
Equivalent-Fonns Method With this method, two equivalent forms of a test Reliability of total test = 2 x reliability for Y:ztest
1 + reliability for Y:ztest
(also called alternate forms or parallel forms) are administered to the same
group during the same testing session. The test forms are equivalent in the Thus, if we obtained a correlation coefficient of .60 for two halves of a test,
sense that they are built to measure the same abilities (that is, they are built
the reliability for the total test would be computed as follows:
to the same set of specifications), but for determining reliability it is also im-
portant that they be constructed independently. When this is the case, the re- . ' ' 2 x .60 1.20
liability coefficient indicates the adequacy of the test sample. That is, a high Rela
1 b IltyO
l f tota 1 test = -1 + .60 = -1.60 =. 75
reliability coefficient would indicate that the two independent samples are '
apparently measuring the same thing. A low reliability coefficient, of course,

would indicate that the two forms are measuring different behavior and that This application of the Spearman-Brown formula makes clear a useful
therefore both samples of items are questionable. principle of test reliability; the reliability of a t,estcan be increased by length-
Reliability coefficients determined by this method take into account ening it. This formula shows how much reliability will increase when the
errors within the measurement procedures and consistency over different length of the test is doubled. Application of the formula, however, assumes
samples of items, but they do not include the day-ta-day stability of the stu- that the test is lengthened by adding items like those already in the test.
dents' responses. Another internal-consistency method of estimating reliability is by use
of the Kuder-Richardson Formula 20 (KR-20).Kuder and Richardson devel-
Test-Retest Method with Equivalent Forms This is a combination of both oped other formulas but this one is probably the most widely used with
methods. Here, two different forms of the same test are administered with standardized tests. It requires a single test administration, a determination
time intervening. This is the most demanding estimate of reliability, since it of the proportion of individuals passing each item, and the standard devia-
takes into account all possible sources of variation. The reliability coefficient tion of the total set of scores. The formula is not especi~lly helpful in under-
reflects errors within the testing procedure, consistency over different sam- standing how to interpret the scores, but knowing what the coefficient
ples of items, and the day-ta-day stability of the students' responses. For most means is important. Basically, the KR-20 is equivalent to an average of all
purposes, this is probably the most useful type of reliability, since it enables split-half coefficients when the test is split in all possible ways. Where all
us to estimate how generalizable the test results are over the various condi- items in a test are measuring the same thing (e.g., math reasoning), the result
tions. A high reliability coefficient obtained by this method would indicate should approximate the split-half reliability estimate. Where the test items
216 Chapter12
Reliability of Criterion-Referenced Mastery Tests As noted earlier, the tra-
are measuring a variety of skills or content areas (i.e.,less homogeneous), the ditional methods for computing reliability require score variability (that is, a
KR-20 estimate will be lower than the split-half reliability estimate. Thus, spread of scores) and are therefore useful mainly with norm-referenced tests.
the KR-20method is useful with homogeneous tests but can be misleading if When used with criterion-referenced tests, they are likely to provide mis-
used with a test designed to measure heterogeneous content. leading results. Since criterion-referenced tests are not designed to emphasize
Internal-consistency methods are used because they require that the test differences among individuals, they typically have limited score variability.
be administered only once. They should not be used with speeded tests, how- This restricted spread of scores will result in low correlation estimates of re-
ever, because a spuriously high reliability estimate will result. If speed is an liability, even if the consistency of our test results is adequate for the use to
be made of them.
important factor in the testing (that is, if the students do not have time to
attempt all the items), other methods should be used to estimate reliability. When a criterion-referenced test is used to determine mastery, our pri-
mary concern is with how consistently our test classifies masters and non-
Standard Error of Measurement The standard error of measurement is an masters. If we administered two equivalent forms of a test to the same group
especially useful way of expressing test reliability because it indicates the of students, for example, we would like the results of both forms to identify
amount of error to allow for when interpreting individual test scores. The the same students as having mastered the material. Such perfect agreement
standard error is derived from a reliability coefficient by means of the fol- is unrealistic, of course, since some !itudents near the cutoff score are likely
lowing formula: to shift from one category to the other on the basis of errors of measurement
(due to such factors as lucky guesses or lapses of memory). However, if too
Standard error of measurement = s{i - rn many students demonstrated mastery on one form but nonmastery on the
\
other, our decisions concerning who mastered the material would be hope-
where s =the standard deviation and rn= the reliability coefficient. In apply- lessly confused. Thus, the reliability of mastery tests can be determined by
ing this formula to a reliability estimate of .60 obtained for a test where 5 = computing the percentage of consistent mastery-nonmastery decisions over
4.5, the following results would be obtained. the two forms of the test.
The procedure for comparing test performance on two equivalent
Standard error of measurement = 4.5-{1- .60
= 4.5-{AD forms of a test is relatively simple. After both forms have been administered
= 4.5 x .63 to a group of students, the resulting data can be placed in a two-by-two table
like that shown in Figure 12.3. These data are based on two forms of a 25-
= 2.8 item test administered to 40 students. Mastery was set at 80 percent correct
(20 items), so all students who scored 20 or higher on both forms of the test
The standard-error of measurement shows how many points we must were placed in the upper right-hand cell (30 students), and all those who
add to, and subtract from, an individual's test score in order to obtain "rea- scored below 20 on both forms were placed in the lower left-hand cell (6 stu-
sonable limits" for estimating that individual's true score (that is, a score free dents). The remaining students demonstrated mastery on one form and
of error). In our example, the standard error would be rounded to 3 score nonmastery on the other (4 students). Since 36 of the 40 students were con-
points. Thus, if a given student scored 35 on this test, that student's score
band, for establishing reasonable limits, would range from 32 (35 - 3) to 38 FORMB
(35 + 3). In other words, we could be reasonably sure that the score band of NONMASTERS MASTERS
32 to 38 included the student's true score (statistically, there are two chances
out of three that it does). The standard errors of test scores provide a means
of allowing for error during test interpretation. If we view test performance
MASTERS I 2 I 30
FORMA
in terms of score bands (also called confidence bands),we are not likely to
overinterpret small differences between test scores. NONMASTERS I 6 I 2
For the test user, the standard error of measurement is probably more
useful than the reUability coefficient. Although reliability coefficients can be
used in ev~luating the quality of a test and in comparing the relative meritc; FIGURE12.3 Classification of 40 students as masters or nonmasters on two forms
of a criterion-referenced test.
of different tests, the standard error of measurement is directly applicable to
the interpretation of individual test scores.
--.-
218 Chapter 12 Validity at!d Reliability 2~

sistently classified by the two forms of the test, we apparently have reason-
ably good consistency.
Estimating the Reliability .of Performance Assessments 1
. We can compute the percentage of consistency for this procedure with Perfonnance assessments are commonly evaluated by using scoring rubrics
that describe a number of levels of perfonnance, ranging from high to low
the following fonnula:
, (e.g., outstanding to inadequate). TIie perfonnance for e<!chstudent is then
.t Masters (both forms) + Nonmasters (both forms) judged and placed in the category that best fits the quality of the perfor-
10 C OnsIS ency = x 100
01
Total number in group mance. The reliability of these perfonnance judgments can be determined by
obtaining and comparing the scores of two judges who scored the perfor-
. 30+6 mances independently. The scores of the two judges can be correlated to
% ConsIStency = - 40 x 100 = 90%
determine the consistency of the scoring, or the proportion of agreement in
scoring can be computed.
This procedure is simple to use but it has a few limitations. First, two Let's assume that a perfonnance task, such as writing sample, was
forms of the test are required. This may not be as serious as it seems, how- obtained from 32 students and two teachers independently rated the students'
ever, since in most mas!ery programs more than one fonn of the test is perfonnance on a four-point scale where 4 is high and 1 is low. The results of
needed for retesting those students who fail to demonstrate mastery on the the ratings by the two judges are shown in Table 12.4.The ratings for Judge 1
first try. Second, it is difficult to determine what percentage of decision con- are presented in the columns and those for Judge 2 are presented in the rows.
sistency is necessary for a given situation. As with other measures of relia- Thus, Judge 1 assigned a score of 4 to seven students and Judge 2 assigned a
bility, the greater the consistency, the more satisfied we will be, but what score of 4 t<;leight students. Their ratings agreed on six of the students and dis-
constitutes a minimum acceptable level? There is no simple answer to such a agreed by 0ne score on three of the students. The number of rating agreements
question because it depends on the number of items in the test and the con- can be seen in the boxes on the diagonal from the upper righthand corner to
sequences of the decision. If a nonmastery decision for a student simply the lower lefthand corner. lhe percentage of agreement can be computed by
means further study and later retesting, low consistency might be accept- adding the numbers in these diagonal boxes (6 + 7 + 6 + 5= 24),dividing by
able. However, if the mastery-nonmastery decision concerns whether to give the total number of students in the group (32),and multiplying by 100.
a student a high schoolcertificate,as in somecompetencytestingprograms, 24
then a high level of consistencywill be demanded. Sincethere are no clear Rater agreement = - x 100= 75%
guidelines for setting minimum levels, we will need to depend on experi- 32
ence in various situations to determine what are reasonable expectations.
More sophisticated techniques have been developed for estimating the
reliability of criterion-referenced tests, but the numerous issues and prob- TABLE12.4 Classification of Students Based on Perfonnant;e Ratings by Two
Independent Judges
lems involved in their use go beyond the scope of this book. See Box 12.2for
factors that lower reliability of test scores. Ratings by Judge 1
BOX 12.2
I
. Factors That Lower the Reliability of Test Scores Scores 1 2
!1
3 4
Row
Totals
4 2 6 8
! ,- '1. Test'scoreS'arepased on too few items. (Remedy:Use longer tests or accu-
J;l)ulatescores from several short tests.)
.2.lW:tge of scores is too limited. (Remedy:Adjust item difficulty to obtain larger 3 3 7 1 11
Ratings by
spr~d of scores.)
Judge 2 2 2
3.T~~g conditions are inadequate. (Remedy:Arrange opportune time for 6 8
aOIriiriistrationand eliminate it1teni1ptions;J).oise, and other disrupting
factors.) 1 5 5
4. Scorit1gis subjective. (Remedy:Prepare scorit1g keys and follow carefully
when scorit1&essay ansWers.) Column Totals 7 9 9 7 32
zo Chapter12
~ By inspection, we can see that all ratings were within one score of each
othe<. The results also indicate that Judge 2 wa' a more lenient ,ate, than
Judge 1 (Le.,gave more high ratings and fewer low ratings). Thus, a table of
Summary of Points
this nature can be used to determine the consistency of ratings and the extent 1. Validity is the most important quality to consider in assessment and is
to which leniency can account for the disagreements.
concerned with the appropriateness, meaningfulness, and usefulness of
Although the need for two raters will limit the 'use of this method, it the specific inferences made from assessment results.
seems reasonable to expect two teachers in the same area to make periqdic
2. Validityis a utlitary COrl;cept
basedon variousformsofevidence(content-
checks on the scoring of performance assessments. This will not only pro-
related, criterion-relat~d, construct-related, and consequences).
vide information on the consistency of the scoring, but will provide the
teachers with insight into soine of their rating idiosyncrasies. See Box 12.3 3. Content-related evide~,=,:,':'~'.'~Iidity refers to how well the sample or
tasks represents the domain of tasks t9 be assessed.
for factors that.lower the reliabili~ of performance assessment. 4. Content-related evidence of validity is of major concern in achievement
The percentage of agreement between the scores assigned by indepen-
dent judges is a common method of estimating the reliability of performance assessment and is built in by following systematic procedures. Validity
is lowered by inadequate assessment practices.
assessments. It should be noted, however, that this reports on only one type
5. Criterion-related evidence of validity refers to the degree to which
of consistency-the consistency of thescoring.It does not indicatethe consis- assessment results are related to some other valued measure called a
tency of performance over similar tasks or over different time periods. We criterion.
can obtain a crude measure of this by examining the performance of stu-
6. Criterion-related evidence may be based on a predictive study or a con-
denQiover tasks and time, but a more adequate analysis requires an under- current study and is typically expressed by a correlation coefficient or
standing of generalizability theory, which is too technical for treatment expectancy table.
here. .
7. Construct-related evidence of validity refers to how well performance
on assessment tasks can be explained in terms of psychological charac-
teristics, or constructs (e.g., mathematical reasoning).
8. The construct-related category of evidence is the most comprehensive.
It includes evidence from both content-related and criterion-related
studies plusresults.
other types of evidence that help clarify the meaning of the
BOX 12.3
Assessments
· Factors That Lower the Reliability
~
of Perfonnance
assessment
9. Consequences of using the assessment is also an important considera-
tion in validity-both positive and negative consequences.
1. Insufficient umber of tasks. (Remedy:Accumulate results from several as- 10. Reliability refers to the consistency of scores (i.e.,to the degree to which
the scores are free from measurement error).
~sments. or example, several writing samples.)
2. Poorly s assess.ment procedures. (Remedy:Define carefully the na- 11. Reliability of test scores is typically reported by means of a reliability co-
efficient or a standard error of measurement.
ture of the ks, the conditions for obtaining the assessment, and the crite-
12. Reliability coefficients can be obtained by a number of different meth-
ria for scoIf.'g or judging the results.)
3. Dimensionsof perform~ce are specificto the tasks. (Remedy:Increase gen- ods (e.g., test-retest, equivalent-forms, internal-consistency) and each
eralizab ili Of per. ~orma nceb y Selecting tasks that have dimensions like
. one measures a different type of consistency (e.g., over time, over dif-
those in s' . ar ta$ks.) ferent samples of items, over different parts of the test).
4. Inadequae scoringguidesforjudgmentalscoring.(Remedy:Use scoring
..
13. Reliability of test scores tends to be lower when the test is short, range
/
.
quality.). ~
rubrics 0 rating scales. that specifically describe the criteria and levels of
.
.
... .. ... '. .. .
5. Scoringjudgments.thittare influencedby personal bias. (Remedy:Check
of Scores is limited, testing conditions are inadequate, and scoring is
subjective.
14. The standard error of measurement indicates the amount of error to
scores or1~tings wJ~'fuose, ofanlndependentjudge. Receive tra;ning in allow for when interpreting individual test scores.
judging iiQdratiQg"if ~2ssible.) , 15. Scorebands (or confidence bands)take into account the error of mea-
.380...___.._
surement and help
between test scores. prevent the over interpretation of small differences
-
-cI
---
I
222 Chapter 12
16. The reliability of criterion-referenced mastery tests can be obtained by

computing the percentage of agreement between two forms of the test
in classifying individuals as masters and nonmasters. Glossary
17. The reliability of performance-based assessments are commonly deter-
mined by the degree of agreement between two or more judges who
rate the performance independently.
This glossary of assessment terms fOCLIsesprimarily on the terms lIsed in this

References and Additional Reading book.
Oosterhoff, A. c., ClassroomApplicationsof
American Educational Research Association, Educational Measurement, 3rd ed.
Standards for EducationalandPsycholog- Achievement Assessment /\ prucedurl' that that performance on the different tl'sts
(Upper Saddk' River, NJ: Ml'rrill/ can be compared using a common norm
ical Tcstilrg (Washington, DC: AERA, Prentice-Hall,2001). is used to determine the degree to which
1999). individuals have achieved the intended group.
Thorndike, R.,Measurement andEvaluationin
Linn, R. L., and Gronlund, N. E.,Measurement learning outcomes of instruction. It Checklist A list of dimensions of a perfor-
Psychology and Education, 6th ed.
and Assessmentin Teaching,8th ed. includes both paper-and-penciltests and mance or product that is simply checked
(Upper Saddle River, NJ: Prentice-
(Upper Saddle River, NJ: Merrill/Pren- Hall,1997). performance aSSl.'Ssments, plus judg- present or absent.
tice-Hall, 2000). ments concerning learning progress. Content Standard A broad educational
Achievement Test An instrument that goal that indicatEs what a student
typically uses sets of items designed to should know and b? able to do in a sub-
measure a domain of learning tasks and ject area.
is administered 'under specified condi- Correlation Coefficient A statistic indicat-
tions (e.g., time limits, open or closed ing the degree of r.!lationship between
book). two sets of test scorrs or other measures.
Alternate Forms Two or more forms of a Criteria A set of qualities used in judging a
test or assessment that are designed to
performance, a product, or an asse'ss-
measure the same abilities (also called ment instrument.
equivalentor parallelforms). Criterion-Referenced Interpretation A de-
Alternative Assessment An assessment
scription of an individual's perfor-
procedure that provides an alternative mance in terms of the tasks he or she
to paper-and-pencil testing. can and cannot perform.
Analytic Scoring The assignment of scores Derived Score A score that results from .
to individual components of a perfor- converting a raw score to a different
mance or product (e.g., Evaluate a writ- score scale (e.g., percentile rank, stan-
ing sample by using separate scores for dard score).
organization, style, mechanics. etc.).
Difficulty Index Percentage of individuals
Anecdotal Record A brief description of who obtain the correct answer on a test
some significant student behavior, the item or task.
setting in which it occurred, and an
Discrimination Index The degree to which
interpretation of its meaning
a test item or task discriminates between
Authentic Assessment An assessment
high and low scor on the total test.
"
procedure that emphasizes the use of

tasks and contextual settings like those Expectancy Table A ,wofold chart that
in the real world.
Battery of Tests Two or more tests standar-
dized on the same sample of students, so
i
shows the relatio hip between two
sets of scores. It c 1be used to predict
the chances of su <;!ison one measure
223

Onlund

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Onlund

Transféré par

Droits d'auteur :

Formats disponibles

12 I

Validity and Reliability

Studying this chapter should enable you to

The two most important 9!!t;stiQY"~ tn ::,c;kabout a test or other assessment

202 Chapter 12 Vlllillily 111I11R,'lill/Ji/ily

assumes that the assessment was properly prepared, administered, and

Test-Retest Method The test-retest method requires administering the same

apparently measuring the same thing. A low reliability coefficient, of course,

218 Chapter 12 Validity at!d Reliability 2~

16. The reliability of criterion-referenced mastery tests can be obtained by

This glossary of assessment terms fOCLIsesprimarily on the terms lIsed in this

procedure that emphasizes the use of

Vous aimerez peut-être aussi