Académique Documents
Professionnel Documents
Culture Documents
50
Printed in Great Britain. All rights reserved. Copyright © 1986 Pergamon Journals Ltd.
Menucha Birenbaum
Tel A viv University, Israel
The issue of cognitive diagnosis, which has been ignored by the main stream
of research psychology for a long time, is beginning to receive much attention
in recent years. Cognitive psychologists, Artificial Intelligence (AI) expert~
and psychometricians are trying to model the thought process of individuals with
respect to how they solve a particular task. Needless to say that this shift in
focus from product oriented assessment toward a process oriented one, meant to
infer concerning the mental process underlying the subject's performance on the
task, has important implications for one of the most crucial issues in measure-
ment, namely, test validity. As pointed out by Embretson (1985), "The tradi-
tional approach to psychological measurement has had neither the theoretical
foundation nor the psychometric models that could specify explicitly those sub-
stantive qualities of test stimuli or of persons that underlie responses" (p.xi).
It seems that the process oriented approach has a great potential in the
area of educational assessment. It is true that the term "diagnosis" is not
novel in educational circles. Educators have been engaged in diagnosis for many
years. Textbooks for training student teachers stress the importance of diag-
nosis for prescribing remedial instruction. A variety of commercial diagnostic
tests are available on the market. However, most of the references made to
diagnosis in education are to "deficit analysis", i.e. identifying areas of
weakness on the part of the student with respect to various topics of the sub-
ject matter (Bejar, 1984).
As pointed out by Burton (1981) one can specify four levels of diagnosis.
The simplest one is concerned with determining whether or not a student has
mastered a skill. The degree of mastery is represented by a numeric value. A
159
160 M. Birenbaum
According to Glaser (1981), the future of testing lies in the higher levels
of diagnosis. As he states: "An important skill of teaching is the ability to
synthesize from a student's performance an accurate picture of the misconcep-
tions that lead to error. This task goes deeper than identifying incorrect ans-
wers and pointing these out to the student: it should identify the nature of
the concept or rule that the student is employing that governs her or his per-
formance in some systematic way" (Glaser, 1981, p. 926).
Surveys on teachers' attitudes toward tests and test usage suggest that
more knowledge is associated with more positive attitudes toward tests and,
given the resources, increased test use. Moreover, the quality of instruction
improves when teachers use direct and frequent measurement strategies (Tollefson
et al., 1985). It is therefore important to introduce teachers to the advan-
tages of diagnostic assessment and thereby affect the quality of instruction.
As stated by Glaser (1981), "Rule assessment approaches assume that conceptual
development can be thought of as an ordered sequence of learned, partial under-
standings. Because individuals acquire concepts in subject-matter domains to
various levels of understanding in a reasonably predictable fashion, the assess-
ment of levels of knowledge can be linked to appropriate instructional
Rule Assessment Approach 161
METHOD
The Data-Set
For the purpose of simplicity and clarity simulated data were used in the
present study. Twenty-five response vectors on a i0 item fraction addition test
were generated on the basis of "bugs" identified in previous empirical studies
which consisted of large data sets collected in the United States and in Israel
(Birenbaum & Shaw, 1985; Tatsuoka, 1984).
In order to eliminate noise, careless errors were not included in the simu-
lation. Moreover, inconsistencies due to strategy changes during the test,
which are quite common in real life testing, were not considered, neither were
reducing errors at the final stage. This was done in order to enable a coherent
presentation of the method of analysis and of the interpretation of the test
results.
The Test
The i0 items fraction addition test used in this study consisted of pairs
of parallel items: two pairs of mixed fraction items, one with like and the
other with unlike denominators, and 3 pairs of simple fractions, one with like
and two with unlike denominators, of which one had like numerators.
The items were adopted from a larger test designed for diagnostic purposes,
which was used for the empirical data collection upon which the present simula-
tion is based. The original test consisted of 48 open-ended items (Klein et al,
1981). A Task Specification Chart (TSC) which included content and procedural
facets was used for the test design (Birenbaum & Shaw, 1985).
Analysis
A "bug library" which includes "bugs" that were hypothesized on the basis
of the TSC and confirmed in previous empirical studies was used f~or coding
response-patterns. (The list of error codes used in the present study appears
in Exhibit i). The S-P chart technique (Sato, 1975) was used for presenting
test results, with error codes substituting for incorrect responses.
An S-P chart is a binary student by item (row x column) matrix in which the
students have been arranged in descending order of their total scores and the
items in ascending order of difficulty. The S curve is the step-function ogive
of the cumulative distribution of test scores for the group of students. The P
curve is the counterpart for the group of items.
Sato has suggested two indices for maximizing the information from test
results. One is the caution index (CI) and the other - the disparity
162 M. B~enbaum
EXHIBIT i: List of the Error Codes in the S-P Chart
Code Algorithm
A ab + e (b+e)
c d~ = (a+d) + (c+f)
B b+e_= be
c f cf
C -b + -e = -b +
-e
c f cf
D ~ + ~ = b+_~e
c f LCD
b e b
E c + f = iff b=e c+--"f-
ab a c
~=~;~
G b
-- + -e = -b-f
c f ce
h e bf+ef
}t (iff c=f) ~ + ~ =
f
I a~ " ab+c
c
J I + A = J
b e - ab+..__.~c + d e + f (ab+c)+(de+f)
a + d~ - c f = c+f
Z b + e = b+f
c f c+e
H ab + e = a+b+c
c d~ d+e+f
N ab + e a b d e = a+b+d+e
c d~ = ~- + c + 1 + f a+c+l+f
R b + e = bf+ec = bf+ec
c f cf 2cf
b +e b e
W T + A =W
c+f
U --
LCD
V a~ = ac+b + a; e df÷e
c ~ d ~ = ~ -'-+ d =
a~ + e ((ac+b)+(df+e))
c d~ = (a+d) c+f
y ab e a+b+c d+e+f
~+df = c + f
Z Y + A = Z
Rule Assessment Approach 163
7X
22 lO
IC 1CO
100
0 A
0 A • ~. ~. I. • i ~,- ~ ~.
t
6 7 70 • 13 r)
17 6 60 ,31 E
19 6 60 1,00 F
," ," .~'::IF"
1 q 40 0 C,
2 ~ qO 0 • ~ ~ • O D O O 0
~
3 ~ qO 0 G + + ~ A A ~, A A A
~ /~0 0 G + + + ~*C C C C
~ ~0 0 G + + ".' ~E E C C F I:
18 ~ ~0 0 G
7 3 30 0 ¢, ! A ~ A,1J
~3 1 lo n
2~ I IO 0 n
Z5 1 10 ,~0 It
B 0 0 0 r; C C "- ,~ C C C r.
q 0 0 0 G B A r~ B ~ BB n
tl 0 0 o G ~; ~ ~ I. L L t I( K
1~ 0 0 0 C
13 0 0 0 n A A ~t ~t A A A A N H
I~ o o O G A A M H A A .A A fl M
15 o 0 n G At., Z Z A A A ~" Z Z
!.~ 0 0 0 G A A A A R t~ R R ~ R
................................. °. . . . . . . . . . . . . .
PROnLEH t~t~, 3 ~ • B 1 ~ ~ 6 9 I0
...................................................
CORRECT . ~ t 4 S ~ P ~ ( K E Y | I L [ 1 L I L I L I
NCell~ ST U(~[H tS 14 I2 6 b tt
GET I' [ r i g CORI~| CT 1-," ? ~ 6 4
CAUTXDht SIGIIAL tl g g g Z 2 [ Z Z Z
IIEM-TO[ CCRR ( R P f f I S ) ,7 °6 ,~ ,3 ,S
StJPHAP.Y
AVERAGE RA~ SCORE 3.16 (31.60~:)
$,D. OF RAW SCORE
eI~eARZTY COEFFZ tE,r 3.21! z4)
~ELIABZLITY CngF~ICIgtlr~;
ALPIIA CRQ/Ifi ACII ,91
SPL[ T - H A L r S P[ A~ HAFI-IX~OItI! ,09
KLP'tA .90
STUoEtIT PRUQLEH
s'"':" ill
g~r.o. ~,.I t8II 1~1~[I :~ :~
FIGURE I: S-P Chart
Rule Assessment Approach 165
As to student #6 who got a total score of 70% on the test, his/her troubles
are with converting mixed numbers to improper fractions. (S)he multiplies the
whole number by the numerator and adds the denominator. Note that this student
gains an extra 10% on the test score because item #7 (whose whole number is i)
is not sensitive to this "bug".
Note that the overall picture of the test from a conventional psychometric
perspective is one of a difficult but highly reliable test. All its items fall
into the low CI category: six are classified as having a high difficulty level
and four, a medium difficulty level. The disparity coefficient is relatively
low, indicating a hierarchical structure of the task.
Although the overall quality of the test used here as measured by conven-
tional psychometric indices proved satisfactory, it was shown that the tradi-
tional interpretation, which refers to total test scores, can be misleading,
especially when adaptive remediation is sought. It is well known in medical
sciences that a disease has several symptoms yet several diseases can share the
same symptoms (i.e. high fever). Consequently, no responsible physician would
prescribe the same medicine for two patients suffering from different diseases
just because they both share high fever as one of their symptoms. Similarly,
when two students with different misapprehensions get the same total test score,
should the teacher prescribe the same remediation for correcting their misappre-
hension?
Although the method for diagnostic test construction was out of the scope
of this paper, it should be noted that test design is a crucial matter which
eventually determines the quality of the diagnosis. One has to, therefore,
carefully choose the items for the diagnosis in order to maximize the informa-
tion about the rules of operation underlying the students' responses. A task
specification chart (Birenbaum & Shaw, 1985) may serve as a useful tool in the
process of test construction. As was illustrated in the chart, when an item
yields the same results as a result of various "bugs", its contribution to rule
assessment is in question.
REFERENCES
BROWN, J.S. and BURTON, R.B. (1978). Diagnostic models for procedural bugs in
basic mathematical skills. Cognitive Science, 2, 155-192.
BURTON, R.R. (1981). Diagnosing bugs in a simple procedural skill. Palo Alto,
CA. XEROX Palo-Alto Research Center.
BUTTERFIELD, E.C., NIELSEN, D. TANGERN, K.L. and RICHARDSON, M.B. (1985).
Theoretically based psychometric measures of inductive reasoning. In: S.E.
Embretson (Ed.) Test design, developments in psychology and psychometrics.
New York, Academic Press.
EMBRETSON, S.E. (Ed.) (1985). Test desi~n~ Developments in psychology and
psychometrica. New York, Academic Press.
GLASER, R. (1981). The future of testing: A research agenda for cognitive
psychology and psychometrics. American Psychologist, 36, 923-936.
HARNISCH, D.L. (1983). Item response patterns: Application for educational
practice. Journal of Educational Measurement, 20, 191-206.
HAYS, J.R. and FLOWER. L. (1980). Identifying the organization of writing
processes. In L.W. Gregg and E.R. Steinberg (Eds.) Cognitive process in
writing. Hillsdale, N.J.: Erlbaum.
KLEIN, M., BIRENBAUM, M., STANDIFORD, S.M. and TATSUOKA, K.K. (1981). On the
construction of an error-diagnostic test in fraction arithmetic. Technical
Report 81-6 NIE Urbana: University of Illinois, Computer-based Education
Research Laboratory.
OHLESSON, S. and LANGLEY, P. (1985). Identifying solution paths in cognitive
diagnosis. Technical Report CUM-RI-TR-85-2. The Robotic Institute. Carnegie-
Mellon University.
PELLEGRINO, J.W., and GLASER, R. (1980). Components of inductive reasoning. In
R.E. Snow, P.a. Federico, and W. Montague (eds.) Aptitude~ learning and
instruction: Vol. i. Cognitive process analyses of aptitude. Hillsdale,
N.J. Erlbaum.
PELLEGRINO, J.W., RANDAL, J.M. and SHUTE, V.J. (1985). Analyses of special
aptitute and expertise. In: S.e. Embretson (Ed.) Test design, Developments
in psychology and psychometrics. N.Y. Academic Press.
SATO, T. (1975). The construction and interpretation of S-P tables. Tokyo:
Meiji Tosho (in Japanese).
SATO, T. (1980). The S-P chart and the caution index. NEC educational informa-
tion bulletin Computer and Communication Systems Research Laboratories, Nippon
Electrical Co. Ltd.
SHAUGHNESSY, M. (1977a). Errors and expectation. New York: Oxford University
Press.
SHAUGHNESSY, M. (1977b). Some needed research on writing. College composition
and communication, 28, 317-321.
SIEGLER, R.S. (1976). Three aspects of cognitive development. Cognitive
Psychology, 8, 481-520.
SIEGLER, R.S. (1978). The origns of scientific reasoning. In R.S. Siegler
(Ed.), Children's thinking: What develops? Hillsdale, N.J.: Erlbaum.
SLEEMAN, D.H. and SMITH, M.J. (1981). Modeling students' problem solving.
Artificial Intelli~ence, 16, 171-187.
STERNBERG, R. (1977a). Component process in analogical reasoning. Psychological
Review, 89, 353-378.
STERNBERG, R.J. (1977b). Intelligence~ information processin~ and analogical
reasoning: The componential analysis of human abilities. Hillsdale, N.J.:
Erlbaum.
STERNBERG, R.J. and McNAMARA, T.P. (1985). The representation and processing of
information in real-time verbal comprehension. In: S.E. Embretson (Ed.) Test
design: Developments in psychology and psychometrics. N.Y.: Academic Press.
TATSUOKA, K.K. (1982). Rule space, the product space of two score components in
signed-number subtraction. An approach to dealing with inconsistent use of
erroneous rules. Urbana, Ii. Computer-based Educational Research Laboratory.
University of Illinois at Urbana-Champagin.
168 MoBirenbaum
TATSUOKA, K.K. and TATSUOKA, M.M. (1983). Spotting erroneous rules of operation
by the individual consistency index. Journal of Educational Measurement, 20,
221-230.
TATSUOKA, K.K. (1983). A propabalistic model for diagnosing misconceptions by a
pattern classification approach. Research Report 83-4-ONR. Computer-based
Educational Research Laboratory. University of Illinois at Urbana-Champaign.
TATSUOKA, K.K. (1984). Analysis of errors in fraction addition and subtraction
problems (Final Report NIE-G-81-0002). Urbana: University of Illinois,
Computer-based Education Research Laboratory.
TOLLEFSON, N., TRACY, D.B., KAISER, J., CHEN, J.S., and KLEINSASSER, A. (1985).
Teachers attitudes towards tests. Paper presented at the annual meeting of the
American Educational Research Association in Chicago, Illinois.
VANLEHN, K. (1981). Bugs are not enough: Empirical studies of bugs, impasses
and repairs in procedural skills. Technical Report, CIS-II. Palo-Alto, CA.
XEROX, Palo Alto Research Center.
VANLEHN, K. (1983). Felicity conditions for human skill acquisition: Validating
an Al-based theory. Research Report CIS-21. Palo-Alto, CA. XEROX, Palo-Alto
Research Center.
WHITELY, S.E. (1980). Multicomponent latent trait models for ability tests.
Psychometrika, 45, 479-494.
WHITELY, S.E. (1981). Measuring aptitude processes with mult~component latent
trait models. Journal of Educational Measurement, 18, 67-84.
THE AUTHOR