Académique Documents
Professionnel Documents
Culture Documents
error. Based on this assumption, classical test theory gives in performance between the top quartile and bottom quartile
rise to a number of statistical analyses for test evaluation, students is small; hence the item discrimination index is low.
including item analysis and test analysis.5 The purpose of The point biserial coefficient is a measure of individual
these analyses is to examine if a test is reliable and discrimi- item reliability and is defined as the correlation between the
nating. For a reliable test, similar outcomes are expected if item scores and total scores,9
the test is administered twice 共at different times兲, assuming
the examinees’ performance is stable and the testing condi- rpbi =
X1 − X0
冑P共1 − P兲.
tions are the same. For a discriminating test, results can be x
used to clearly distinguish those who have a robust knowl- Here, X1 is the average total score for those who correctly
edge of tested materials from those who do not. In the fol- answer an item, X0 is the average total score for those who
lowing, we provide brief introductions to both item analysis incorrectly answer the item, x is the standard deviation of
and test analysis. One can follow these analyses to perform total scores, and P is the difficulty index for this item. A
test evaluations. reliable item should be consistent with the rest of the test, so
Item analysis encompasses three measures: item difficulty a fairly high correlation between the item score and the total
level 共P兲, discrimination index 共D兲, and point biserial coef- score is expected. A satisfactory point biserial coefficient10 is
ficient 共rpbi兲. Item difficulty is a measure of the easiness of an rpbi ⱖ 0.2. Once again, higher values are better. If an item
item 共although it is ironically called “item difficulty” level兲 shows a low biserial coefficient, it indicates this item may
and is defined as the proportion of correct responses, not test the same material 共or may not test the material at the
P = N1/N. same level兲 as other items. Revisions may be considered to
make this item more comparable to the rest of the test.
Here N1 is the number of correct responses and N is the total Test analysis has two measures: Kuder-Richardson reli-
number of students taking the test. Ideally, items with diffi- ability index 共rtest兲 and Ferguson’s delta 共␦兲. These two mea-
culty level around 0.5 have the highest reliability because sures are used to evaluate an entire test 共rather than evaluate
questions that are either extremely difficult or extremely easy individual items兲. Kuder-Richardson reliability measures the
do not discriminate between students 共since nearly everyone internal consistency of a test. In other words, it examines
gets them wrong or right兲. Practically, item difficulty values whether or not a test is constructed of parallel items that
ranging from 0.3 to 0.9 are acceptable.6 If items with ex- address the same materials. Higher correlations among indi-
tremely low or extremely high difficulty values are detected, vidual items result in a greater Kuder-Richardson index, in-
one may consider revising these items to make them easier dicating higher reliability of the entire test. For a multiple-
共or more difficult兲. choice test where each item is scored as “correct” or
The item discrimination index measures how powerful an “wrong,” the reliability index is calculated as follows:11,12
冉 冊
item is in distinguishing high-achieving students from low-
achieving students. It is defined as the difference in percent- K 兺 Pi共1 − Pi兲
rtest = 1− .
ages of correct response to an item between the top quartile K−1 2x
and the bottom quartile students,7
This formula is known as KR-20 after the equation number
D = 共NH − NL兲/共N/4兲. in the famous Kuder and Richardson paper.11 Here, K is the
Here NH and NL are the numbers of correct responses in the number of the test items, Pi is the difficulty level for the ith
top quartile and bottom quartile, respectively, and N is the item, and x is the standard deviation of total score. A widely
total number of students. “Quartile” can be determined by accepted criterion is that a test of reliability higher than 0.7 is
using either an “internal” criterion 共students’ scores on the considered reliable for group measurement. An rtest value
test being considered兲 or an “external” criterion 共e.g., stu- greater than 0.8 is the rule of thumb indicating a test is suit-
dents’ grade point averages兲.6 The criterion used in most able for use in assessing individuals.13 If a low reliability
PER discussions is internal. Occasionally researchers use top index is detected, one may first consider examining items
and bottom thirds or some other division of scores, depend- with low discrimination index and low point biserial coeffi-
ing on what best suits their needs. If an item is discrimina- cient. Because these items often are not consistent with other
tive, one can expect the number of correct responses in the items, they can negatively impact the reliability of the entire
top quartile 共NH兲 to be much greater than that 共NL兲 in the test.
bottom quartile, thus a high discrimination index. A com- Ferguson’s delta measures the discriminatory power of an
monly adopted standard8 for a satisfactory item discrimina- entire test. Specifically, it investigates how broadly students’
tory index is D ⱖ 0.3. Higher values are better. In case of a total scores are distributed over the possible range. Gener-
low item discriminatory index, one may need to scrutinize ally, the broader the score distribution is, the better the test is
the item to see if the statement of the question is clear. Some- in discriminating among students at different levels. The cal-
times, a poorly worded item can cause strong students to culation of Ferguson’s delta is given as follows:14,15
overthink the question, posing a negative effect on their per- N2 − 兺 f 2i
formance and thus lowering the item discrimination index. ␦= .
N − N2/共K + 1兲
2
Another possible situation for low item discrimination is
when an item is either too difficult 共low difficulty index兲 or Here, N is the total number of students taking the test, K is
too easy 共high difficulty index兲. In these cases, the difference the number of test items, and f i is the number of students
020103-2
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
r r
TABLE II. Evaluations of the MIET. B B
020103-3
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
FIG. 3. 共Color兲 Sum of the squared projections of the blue arrow FIG. 5. 共Color兲 The sum of the squared projections of the green
onto the red arrow 共through the origin兲 is maximized. The red arrow shadows onto the green vector is maximized. The green vector pre-
represents factor 1. sents factor 2. The orange vector is perpendicular to both the red
and green vectors, representing factor 3. In a general case of mul-
To find the first factor, we add a bidirectional vector tiple dimensions, a rotation of all factors probably is desirable and if
through the origin and spin it around until we have maxi- needed can still maintain orthogonality among them.
mized the sum of the squared projections of the blue vectors
onto it. The red arrow in Fig. 3 represents the first factor. The sible to visualize with more than three factors, the math-
projections of the variables 共blue vectors兲 onto the factor are ematical approach and meaning are the same regardless of
referred to as the factor loadings of factor 1. how many dimensions are involved.
Now place a flat screen at the origin, perpendicular to the
first factor 共red arrow兲. Shine a light toward the origin from
beyond each end of the red vector and cast shadows of the B. Principal component analysis
blue vectors onto the screen 共see Fig. 4兲. These shadows are PCA is a variable reduction procedure. Suppose one ob-
the residual correlations with the first factor partialled out. tains a data set of 30 variables from administering a 30-item
We now place a new bidirectional green vector on the multiple-choice test. Usually several items test the same
screen through the origin and perpendicular to the red vector. topic, and thus students’ scores on these items should ideally
We then rotate the green vector until the sum of the squared display high correlations. PCA groups these highly corre-
projections of the shadows onto it 共not the blue vectors, but lated items together and collapses them into a small number
their shadows on the screen兲 is maximal. The green bidirec- of components through weighted linear combinations as
tional vector in Fig. 5 represents the second factor. Since the shown below:
green vector is perpendicular to the first factor, factor 1 and
factor 2 are orthogonal. C1 = b11Q1 + b12Q2 + b13Q3 + ¯ + b1,30Q30 ,
We now place a third bidirectional orange vector on the
screen through the origin and perpendicular to the red and C2 = b21Q1 + b22Q2 + b23Q3 + ¯ + b2,30Q30 ,
green vectors 共see Fig. 5兲. The projections of the shadows on
the orange vector are called the loadings on the third factor. ¯.
By finally clearing away all the extraneous representa-
tions, we see how the three factor vectors form a coordinate Here C’s are components, Qi represents the ith item in the
system that is more closely aligned with the original vectors test, and b’s are weights of individual items. The main task
than the x, y, and z axes. Figure 6 shows the view looking of PCA is to optimize the weights 共b values兲 so that these
down the factor 1 axis toward the origin. Although impos- components constructed as such account for variance in the
FIG. 6. 共Color兲 Three factors are more aligned with the blue
FIG. 4. 共Color兲 Cast projections of the blue arrows onto the vectors than the x, y, and z axes. Here, the x axis is too close to
green screen perpendicular to factor 1. factor 1 and therefore is not labeled in the picture.
020103-4
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
It is clear from Table III that the first four questions have
large loadings on component 1 and small loadings on com-
ponent 2. Conversely, the last four questions show a reverse U1 U2 U2 U4
pattern. Thus it is reasonable to conclude that the first four
questions can be summarized by component 1 and the last FIG. 8. Common factors and unique factors in common factor
four questions by component 2. It is now the researcher’s job analysis.
020103-5
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
lar boxes represent observed variables 共questions 1–4兲 and TABLE IV. Factor loadings for common factor analysis of eight
elliptical circles indicate latent 共unmeasured兲 common fac- MIET items.
tors. Single-headed arrows signify causal influences on ob-
served variables. As shown, each observed variable receives Factor 1 Factor 2 Content
influences from not only common factors 共factors 1 and 2兲
Q12 0.54* 0.13 Energy graph, bound or unbound system
but also its own underlying factor 共U兲. As opposed to the
common factors, these U factors are called unique factors. Q13 0.66* 0.08 Energy graph, bound or unbound system
The goal of common factor analysis is to identify the com- Q14 0.35* 0.12 Energy graph, bound or unbound system
mon factors to account for the covariance among observed Q16 0.31 0.02 Energy of a bound system
variables. Q27 0.14 0.33* Work-energy theorem
As seen in Fig. 8, each observed variable can be ex- Q31 0.14 0.35* Work-energy theorem
pressed as follows: Q32 0.15 0.44* Work-energy theorem
Q1 = b11F1 + b12F2 + U1 , Q33 0.04 0.53* Work-energy theorem
Q2 = b21F1 + b22F2 + U2 .
cant loadings. Loadings with an asterisk in Table IV are con-
Here Q’s are observed variables, F’s are common factors, sidered significant by SAS.
U’s are unique factors, and b’s are regression coefficients As seen, the results are very similar to those obtained
共weights兲. Similarly to PCA, common factor analysis calcu- from PCA. However, factor loadings in Table IV are gener-
lates optimal weights 共b values兲 by solving eigenvalue equa- ally lower than component loadings in PCA 共Table III兲. This
tions of the observed correlation matrix. However, the diag- is understandable because common factor analysis considers
onal elements of the correlation matrix are no longer 1’s. only the portion of variances explained by common factors,
Rather, they are replaced with variances that are accounted whereas PCA analyzes the total variances in observed data
for by common factors. In other words, the portion of vari- without assuming underlying effects. This is why the purpose
ances explained by unique factors is discarded. This is a of PCA is purely data reduction, while the purpose of com-
major difference between PCA and common factor analysis. mon factor analysis is to understand causes.33,34
Here, this new correlation matrix is called adjusted correla- Another interesting difference between the above results
tion matrix.32 Since the portion of variances due to common and those in Table III is manifested in Q16. Previously, PCA
factors cannot be known a priori, it is practical to use esti- generates a high loading on component 1 “energy in bound
mates for an initial trial. Often commercial software has or unbound system.” Here the loading on factor 1 is barely
functions to generate estimates, followed by iterations to above the minimum suggested by Hair et al.28 In fact, SAS
solve the equations. does not even flag this loading as significant. A closer inspec-
In the following we provide an example of common fac- tion reveals that Q16 differs from the first three questions in
tor analysis using the same data as for PCA. The purpose of that it is not formatted in graphical representations. This dif-
this example is to uncover the underlying structures of mea- ference may have lowered its loading value on factor 1. Put
sured variables. 共Here we do not predetermine possible fac- differently, there may exist factors other than the above two
tors. Such approach is known as exploratory factor analysis. that can better explain the observed data for question 16.
In cases where common factor analysis is used to confirm
predetermined factors, it is called confirmatory factor analy-
D. Practical issues
sis. Readers can refer to Ref. 22 for confirmatory factor
analysis.兲 The initial adjusted correlation matrix is included Both of the above two examples yield a simple pattern of
in Appendix A for interested readers to replicate the results. loadings for observed variables; that is, nearly all the ques-
We choose orthogonal rotation and the maximum likelihood tions have loadings high on one component 共or factor兲 and
method29 共an iteration method in which parameter estimates low on others. This simple pattern makes it easy to interpret
are those most likely to have yielded the observed correlation the meaning of components or factors. In reality, one rarely
matrix兲 in SAS for common factor analysis. Other iteration obtains such a simple pattern, especially when dealing with
methods include unweighted least squares 共minimizing the binary data 共1 or 0兲 collected from dichotomously scored
sum of squared differences between estimated and observed multiple-choice questions. In that case, one can start with
correlation matrices兲 and generalized least squares 共adjusting correlations for dichotomous35 variables, as suggested by
the unweighted least squares by weighing the correlations Heller and Huffman,36 instead of Pearson correlations. An-
inversely to their unique variables, U’s兲.29 Although maxi- other approach is “reduced basis factor analysis”37 proposed
mum likelihood may be the most frequently used method, by Adams et al. In this approach, factor analysis is repeat-
there are no dogmatic criteria as for which one is more pref- edly performed by adjusting the initial factors so as to maxi-
erable than others. A rule of thumb is to choose a method that mally comprise between predetermined results and the re-
best facilitates data interpretation. The reason we choose sults acquired from real data.
maximum likelihood is that for our data this method pro- Results obtained from either PCA or common factor
duces more sensible and easy-to-interpret results. In Table analysis should be used heuristically and not in an absolute
IV, the final factor loadings for each question are shown. As manner. Oftentimes one may find PCA generates results
previously stated, SAS has built-in functions to flag signifi- more interpretable than common factor analysis and vice
020103-6
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
B
A
Measurement 2
E
020103-7
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
4
“atomic spectra” are the highest. Cluster D has the next least
number of students. Students in this cluster demonstrated
3 highest scores on “energy definition,” “system specification,”
and “determination of work and heat.” Cluster C is similar to
A B cluster D but with relatively lower scores. Thus, clusters C
2 E
and D merge into one cluster that further joins with E. In
C D
other words, students in clusters C and D are comparable in
1
the sense that they have fairly high scores on basic energy
concepts and definitions, but their scores on “work-energy
theorem” are significantly lower than those of group E. Stu-
0
S 50 S 100 S 150 S 200 S 250 S 300
dents in cluster E seem to be at a higher level because they
are not only able to grasp basic concepts but also able to
FIG. 11. A dendrogram for cluster analysis of 308 students. The successfully apply the work-energy theorem. As opposed to
horizontal axis represents 308 students 共S1–S308兲; the vertical axis the above three clusters, clusters A and B both have the low-
represents the Euclidian distance among clusters. For instance, the est scores on all topics. Nonetheless, cluster A differs from
distance between clusters A and B is the vertical height of the hori- cluster B on the work-energy theorem and “work and heat”
zontal line connecting A and B, which reads approximately 2.4. topics. This difference is easily noticed from the great verti-
cal height of the horizontal line connecting A and B 共see Fig.
dendrogram46兲 is depicted in Fig. 11. Here, the horizontal 11兲.
axis represents the arbitrarily assigned student identity and In the above example we used the agglomerative method
the vertical axis shows the Euclidian distance between clus- to form clusters. We chose this method mainly because it
ters. As seen, each cluster extends a vertical line upward to generates more interpretable results than others. Another rea-
meet and join with others. When two closest clusters merge, son is that the agglomerative method is more efficient than
a horizontal line is placed to connect them. The height of the the divisive and K-means methods. For the divisive method
connecting horizontal line indicates the distance between the it is not feasible to optimize the initial partitioning when a
two clusters 共see Fig. 11兲. data set is large since the number of ways to divide N sub-
Several major clusters emerged in this example, as can be jects into two groups is 2N−1 − 1. As for the K-means proce-
seen from the top down in Fig. 11. For a quick grasp of the dure, the difficulty lies in the predetermination of the number
results, we examine five clusters that are marked A – E in Fig. of clusters. 关Take the simplest situation for example. If we
11. Here, we focus on five clusters mainly for illustration predetermine to divide students into two clusters on each of
purpose. One certainly can choose a different number of these five concepts 共good or poor兲, then we will have 25
clusters for analysis. For example, one can choose to study = 32 clusters, more than what is manageable.兴 Nonetheless,
only the two biggest clusters 共indicated by the top two ver- both agglomerative and K-means have been adopted in re-
tical lines in Fig. 11兲 or as many as 308 clusters 共correspond- cent physics educational studies. For example, a work by
ing to 308 students兲. Conceivably, results from too few or too Springuel et al.47 used agglomerative method for cluster
many clusters are not as informative. Moreover, it is impor- analysis of student reasoning in kinematics; an unpublished
tant to note that cluster analysis 共like any other statistical work by Montenegro et al.48 employed K-means to investi-
technique兲 only reports how clusters are formed but not why gate student misconceptions in mechanics.
they are formed as such. It is the researcher’s job to find out A final note on differences between cluster analysis and
the unique characteristics of the individuals in each cluster the aforementioned factor analysis is worth mentioning since
that have caused them to be grouped together. In this ex- both seem to perform data classifications. First, cluster
ample, we seek to better understand what has made these five analysis is often used to classify subjects, whereas factor
clusters different. Hence, we further construct Table V to analysis is employed to group variables. Consider the same
examine their scores on each topic. example as before, where you administer a 30-item multiple-
As seen from Table V, cluster E has the least number of choice test among 300 students. If you are interested in ex-
students, but their scores on the “work-energy theorem” and amining how the 30 items interrelate with one another and
TABLE V. Five clusters and their scores 共percentages兲 on different topics covered in the MIET. 共Bold
prints indicate the highest scores among the five clusters.兲
Cluster
020103-8
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
020103-9
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
020103-10
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
冢冣 冢冣 冢冣 冢 冣
1 0 0 0.50 excluded from the analysis.64兲 We consider three models for
Q = 0.50 0 + 0.25 1 + 0.25 0 = 0.25 . analysis: 共1兲 work-energy model, 共2兲 momentum or force-
motion model, and 共3兲 others. In Fig. 16 we display one item
0 0 1 0.25 共Q33兲 to illustrate what each model may look like. In this
Here, the three elements “0.50,” “0.25,” and “0.25” indicate item, choice 共b兲 is the correct answer, thus representing the
the frequencies of applying three different models, respec- “work-energy” model. Many students answered 共c兲 and ar-
tively. Use the square root of each element in Q to form a gued during interviews that the work done is the same in
new vector V,
冢 冣冢 冣
冑0.50 0.71
V= 冑0.25 = 0.50 .
冑0.25 0.50
Now take an outer product of V with a transpose of itself
V 丢 VT to get a matrix, namely, the “density matrix” for each
individual student. Next, take an average over all the stu-
dents to obtain a class density matrix. Depending on how
students use different models, the class density matrix may
display different patterns. Figure 15 shows three examples as
discussed by Bao and Redish.63 The first matrix has only one
⎛1 0 0⎞ ⎛ 0 .5 0 0 ⎞ ⎛ 0 .5 0 .2 0 .1 ⎞
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜0 0 0⎟ ⎜ 0 0 .3 0 ⎟ ⎜ 0 .2 0 .3 0 .1 ⎟
⎜0 0 ⎟⎠ ⎜ 0 0 .2 ⎟⎠ ⎜ 0 .1 0 .2 ⎟⎠
⎝ 0 ⎝ 0 ⎝ 0 .1
020103-11
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
both time intervals because the change in speed is the same. distracters that represent similar models. These models must
Apparently, these students focused on motion change instead be predetermined 共both identified and categorized兲 before
of energy change; therefore, their model can be described as performing model analysis. In other words, qualitative stud-
the “momentum or force-motion” model. As for choices 共a兲 ies of students’ naive conceptions and incorporation of re-
and 共d兲, we label them as others. sults into item distracters are a necessary precursor to model
We obtain a class density matrix, as shown below, by analysis.
averaging individual density matrices,
冢 冣
VII. SUMMARY AND DISCUSSION
0.275 0.304 0.06
0.304 0.638 0.121 .
In this paper, we discuss the goals, basic algorithms, and
0.06 0.121 0.085 applications of five approaches to analyzing multiple-choice
test data; they are classical test theory, factor analysis, cluster
analysis, item response theory, and model analysis. These
We then solve its eigenequation and find that the largest ei- approaches are valuable tools for PER studies on large-scale
genvalue is max = 0.835 and its corresponding eigenvector is measurement using multiple-choice assessments.
U = 共−0.484, −0.857, −0.177兲T. According to Bao and Among them, classical test theory and item response
Redish,63 the class probability of using the first model 共work- theory are particularly useful for test evaluations during the
energy model兲 can thus be calculated as max ⫻ U21 = 0.835 development stage of an assessment. Both approaches can
⫻ 共−0.484兲2 = 0.20. Similarly, the class probability of using yield detailed information on difficulty and discrimination of
the momentum or force-motion model is 0.61. As for individual items and thus are regularly employed to identify
the third model others, the probability is max ⫻ U21 = 0.835 problematic items. Classical test theory also gives rise to the
⫻ 共−0.177兲2 = 0.03, which is negligible. Note that the sum of measures of an entire test, revealing the overall reliability
the above three probabilities is less than 1, implying that and discrimination of the test. 共This is why it is called “test
these probabilities do not form a complete set. In fact, the theory.”兲 Owing to its simple theoretical framework, classi-
probabilities originating from the second and the third eigen- cal test theory has rather straightforward formulations. Thus,
vectors have not been taken into account. Bao and Redish it is easy to carry out. Nevertheless, classical test theory has
considered these additional eigenvectors as “corrections of some major weaknesses. A prominent one is that results on
less popular features that are not presented by the primary item measures are examinee dependent, and examinees’ total
state.”63 Because of these “corrections,” probabilities from scores are item dependent. Conversely, item response theory
model analysis are lower than what is calculated from a di- overcomes such weaknesses by assuming a single underlying
rect division of the number of students choosing a particular “ability” 共or “skill”兲 and using logistic regression to describe
model by the total number of instances. This is generally true the propensity of correct responses to individual items. As a
except for some extreme cases in which, for example, all result, its estimated item measures and examinee abilities are
students consistently use the same model. Because model mutually independent. Moreover, item response theory can
analysis takes into account the kind of inconsistencies, a also be used to examine how alternative choices function.
class density matrix almost always displays nonzero off- 共Because of its emphasis on individual items, this theory is
diagonal elements. named “item response theory.”兲
Here, we represent the first two probabilities in a “model Factor analysis and cluster analysis are good candidates
plot” 共Fig. 17兲 proposed by Bao and Redish, in which three for grouping data into different categories. However, the
regions are separated by two straight lines passing through
former intends to group variables 共test items兲 into a small
the origin with slopes of 3 and 1/3, respectively.63 The data
number of components 共principle component analysis兲 or
point falls into the region of the “momentum or motion-
factors 共common factor analysis兲, whereas the latter intends
force” model but is close to the borderline. We thus conclude
that students are likely to use both the work-energy and mo- to group subjects 共test examinees兲 into different clusters.
mentum or force-motion models although the latter is more Hence, factor analysis reveals information on how test items
dominant. are interrelated, while cluster analysis illuminates how stu-
For researchers who consider using model analysis, it is dents’ responses differ. A noteworthy aspect about factor
worth noting some requirements this analysis places on us- analysis is the subtle difference between principle compo-
ers. First, one needs to have in a test “a sequence of ques- nent analysis and common factor analysis. Principle compo-
tions or situations in which an expert would use a single nent analysis is a pure data reduction technique and does not
coherent mental model.”63 Simply put, a sequence of ques- presume underlying “causal structures.” But common factor
tions that target the same concept or principle is necessary. In analysis assumes some common factors to be the cause of
fact, a large number of such questions is strongly preferred observed data.
because “the probabilistic character of the student model Model analysis is a useful tool for studies on how
state arises from the presentation of a large number of ques- consistently students answer isomorphic questions. It utilizes
tions or scenarios.”63 Second, these questions ought to have quantum notations and eigenvalue analysis techniques to
020103-12
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
examine the probabilities of students’ using different mental mental models, qualitative studies must be performed first to
models in answering isomorphic questions 共given such ques- ensure these questions meet the needs.
tions are readily available兲. Because model analysis requires All the above approaches are powerful tools for data
that all questions have alternative choices covering the same analysis of multiple-choice tests. However, none is perfect or
can be exclusively used to answer all research questions. As
discussed before, each has its specific purposes and applica-
TABLE VI. This table shows detailed item analysis results for
tions. Hence, caution must be practiced when selecting
MIET based on 389 data collected from two institutions: North
among these approaches. Finally, it is important to note that
Carolina State University and Purdue University. The average total
score is 17.4 out of 33 and the standard deviation is 4.9.
our ultimate goal is to make sense of raw data. Therefore,
when encountering two 共or more兲 equally sound choices, one
Discrimination Point biserial
should always prefer the one that better 共or best兲 facilitates
Item Difficulty index index coefficient data interpretation.
020103-13
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
冢 冣
1 . . . . . . .
0.38537 1 . . . . . .
0.20388 0.20578 1 . . . . .
0.11881 0.21611 0.20957 1 . . . .
R= .
0.09215 0.13006 0.03928 0.13247 1 . . .
0.09739 0.03541 − 0.00308 0.03355 0.16152 1 . .
0.15976 0.12583 0.06139 0.11005 0.13566 0.14278 1 .
0.06961 0.07763 0.06142 0.03742 0.17751 0.17413 0.25602 1
The following is an adjusted correlation matrix Radjust for the aforementioned eight items. Of note is that the diagonal
elements are no longer 1’s. Instead, they are replaced with SAS estimated variances that are accounted for by the common
factors. We use the adjusted correlation matrix Radjust for common factor analysis of eight MIET items,
冢 冣
0.182 . . . . . . .
0.385 0.196 . . . . . .
0.204 0.206 0.090 . . . . .
0.119 0.216 0.210 0.091 . . . .
Radjust = .
0.092 0.130 0.039 0.132 0.076 . . .
0.097 0.035 − 0.003 0.034 0.162 0.062 . .
0.160 0.126 0.061 0.110 0.136 0.143 0.105 .
0.070 0.078 0.061 0.037 0.178 0.174 0.256 0.103
The following is part of the 308⫻ 308 similarity matrix D for 308 students who took the MIET. Each student forms a
response vector 共T1 , T2 , T3 , T4 , T5兲, where Ti represents his or her score on the ith topic. The scores on each topic display both
order and magnitude and thus can be used as interval data.65 The similarities among students are simply calculated as the
Euclidian distances between their response vectors. Therefore, element Dij in the following matrix represents the Euclidian
distance between the response vector of student i and that of student j. Due to limited space, we only show the Euclidian
distances among ten students. Because this matrix is symmetric with respect to its diagonal, we omit its upper half. This matrix
D is used to perform agglomerative cluster analysis,
冢 冣
0 . . . . . . . . .
5.57 0 . . . . . . . .
5.00 3.16 0 . . . . . . .
4.90 2.65 3.61 0 . . . . . .
6.93 1.73 4.58 3.46 0 . . . . .
D= .
6.71 2.00 4.69 3.87 1.00 0 . . . .
5.00 5.29 4.24 6.25 6.40 5.83 0 . . .
9.33 5.83 6.32 7.81 5.92 5.48 5.29 0 . .
3.74 4.36 2.24 3.46 5.83 5.92 4.58 7.81 0 .
9.75 4.47 7.07 6.56 3.32 3.46 8.25 5.48 8.66 0
APPENDIX B: SELECTED QUESTIONS FROM MIET 共c兲 Sum of kinetic energy and potential energy
共d兲 Rest energy
Consider a system that consists of two asteroids in deep 共e兲 None of the above
space. The following figure plots energy of several different
Q13. What do the horizontal lines 共II, III, and IV兲 repre-
states of the two-asteroid system versus distance r between
sent?
them. Questions 12–14 refer to this figure 共see Fig. 18兲.
共a兲 Kinetic energy
Q12. What does the curved line I represent? 共b兲 Potential energy
共a兲 Kinetic energy 共c兲 Sum of kinetic energy and potential energy
共b兲 Potential energy 共d兲 Rest energy
020103-14
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
Energy
Enter Exit
IV
III
r
0
FIG. 21. Diagram of an electron entering and exiting a box.
II I
共e兲 None of the above
Q14. Which of the horizontal lines represents a bound
state?
共a兲 Line II
共b兲 Line III
共c兲 Line IV
共d兲 All of the above
共e兲 None of the above
Q16. In deep space two stars are orbiting around each
other. Consider the sum of kinetic energy and gravitational
potential energy K + U of the two-star system. Which of the
FIG. 18. Energy plot of several different states of the two-
following statements is true? 共See Fig. 19.兲
asteroid system vs the distance r between them.
共a兲 K + U is positive
共b兲 K + U is zero
共c兲 K + U is negative
共d兲 K + U is either positive or zero
共e兲 K + U is either negative or zero
Q27. A box of mass M is moving at speed v toward a
person along a surface of negligible friction. A person leans
against a wall with both arms stretched and applies a pushing
force to stop the box. Finally the person brings the box to a
full stop. There is no temperature change in the box at any
time 共see Fig. 20兲. We can conclude that during the process:
共a兲 The amount of work done on the box by the person
FIG. 19. Diagram of two stars in deep space orbiting around was + 21 M v2
each other. 共b兲 The amount of work done on the box by the person
was positive, but there is not enough information to deter-
mine the value
共c兲 The amount of work done on the box by the person
was 0
v
共d兲 The amount of work done on the box by the person
was − 21 M v2
Top view
v
Puck 1
v=0 Puck 2
FIG. 20. Diagram of a box of mass M moving at speed v toward FIG. 22. Diagram of a low-friction table with two pucks
a person along a surface of negligible friction. launched by pushing them against two identical springs.
020103-15
LIN DING AND ROBERT BEICHNER PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
共e兲 The amount of work done on the box by the person 共e兲 Not enough information to determine
was negative, but there is not enough information to deter-
Q32. On a low-friction table you launch two pucks by
mine the value
pushing them against two identical springs through the same
Q30. A clay ball moving to the right at a certain speed has amount. The two pucks have the same shape and size, but the
kinetic energy of 10 J. An identical clay ball is moving to the mass of puck 2 is twice the mass of puck 1. Then you release
left at the same speed. The two balls smash into each other the pucks and the springs propel them toward the finish line.
and both come to a stop. What happened to the energy of the
At the finish line, how does the kinetic energy of puck 1
two clay-ball system?
共a兲 The kinetic energy of the system did not change compare to the kinetic energy of puck 2? 共See Fig. 22.兲
共b兲 The kinetic energy changed into thermal energy 共a兲 It is the same as the kinetic energy of puck 2
共c兲 The total energy of the system decreased by an amount 共b兲 It is twice the kinetic energy of puck 2
20 J 共c兲 It is half the kinetic energy of puck 2
共d兲 The initial kinetic energy of the system was zero, so 共d兲 It is four times the kinetic energy of puck 2
there was no change in energy 共e兲 It is one-forth the kinetic energy of puck 2
Q31. An electron with speed 1 ⫻ 108 m / s enters a box Q33. A point particle of mass m is initially at rest. A
along a direction depicted in the diagram. Some time later constant force is applied to the point particle during a first
the electron is observed leaving the box with the same speed time interval, while the particle’s speed increases from 0 to
of 1 ⫻ 108 m / s but different direction as shown 共see Fig. v. The constant force continues to act on the particle for a
21兲. From this, we can conclude that in the box: second time interval, while the particle’s speed increases fur-
共a兲 The net force on the electron was nonzero, and the net ther from v to 2v. Consider the work done on the particle.
work done on the electron was nonzero During which time interval is more work done on the point
共b兲 The net force on the electron was nonzero, but the net particle?
work done on the electron was zero 共a兲 The first time interval
共c兲 The net force on the electron was zero, but the net 共b兲 The second time interval
work done on the electron was nonzero 共c兲 The work done during the two time intervals is the
共d兲 The net force on the electron was zero, and the net same
work done on the electron was zero 共d兲 Not enough information to determine
1 See, for example, A. Bork, Letters to the editor, Am. J. Phys. 52, 8 R. Doran, Basic Measurement and Evaluation of Science Instruc-
873 共1984兲; R. Varney, More remarks on multiple choice ques- tion 共NSTA, Washington, DC, 1980兲, p. 99.
tions, ibid. 52, 1069 共1984兲; T. Sandin, On not choosing mul- 9 E. Ghiselli, J. Campbell, and S. Zedeck, Measurement Theory for
tiple choice, ibid. 53, 299 共1985兲; M. Wilson and M. Bertenthal, the Behavioral Sciences 共Freeman, San Francisco, 1981兲.
Systems for State Science Assessment 共National Academies 10 P. Kline, A Handbook of Test Construction: Introduction to Psy-
Press, Washington, DC, 2006兲. chometric Design 共Methuen, London, 1986兲, p. 143.
2
See, for example, R. Lissitz and K. Samuelsen, A suggested 11 G. Kuder and M. Richardson, The theory of the estimation of test
change in terminology and emphasis regarding validity and edu- reliability, Psychometrika 2, 151 共1937兲.
cation, Educ. Res. 36, 437 共2007兲; S. Ramlo, Validity and reli- 12 J. Bruning and B. Kintz, Computational Handbook of Statistics,
ability of the force and motion conceptual evaluation, Am. J. 4th ed. 共Longman, New York, 1997兲.
Phys. 76, 882 共2008兲. 13
3 L. Cronbach and L. Furby, How we should measure “change”: Or R. Doran, Basic Measurement and Evaluation of Science Instruc-
tion 共NSTA, Washington, DC, 1980兲, p. 104.
should we?, Psychol. Bull. 74, 68 共1970兲; L. Suskie, Assessing 14 P. Kline, A Handbook of Test Construction: Introduction to Psy-
Student Learning: A Common Sense Guide, 2nd ed. 共Jossey-
chometric Design 共Methuen, London, 1986兲, p. 150.
Bass, San Francisco, 2009兲; R. R. Hake, Can distance and class- 15 L. Ding, R. Chabay, B. Sherwood, and R. Beichner, Evaluating
room learning be increased?, J. Scholarship Teach. Learn. 2, 1
共2008兲; http://www.physics.indiana.edu/~hake/ an electricity and magnetism assessment tool: Brief electricity
MeasChangeS.pdf and magnetism assessment, Phys. Rev. ST Phys. Educ. Res. 2,
4 T. Kline, Psychological Testing: A Practical Approach to Design 010105 共2006兲.
16 P. Kline, A Handbook of Test Construction: Introduction to Psy-
and Evaluation 共SAGE, Thousand Oaks, CA, 2005兲, pp. 91–
105. chometric Design 共Methuen, London, 1986兲, p. 144.
5 17 L. Ding, Ph.D. thesis, North Carolina State University, 2007
R. Doran, Basic Measurement and Evaluation of Science Instruc-
tion 共NSTA, Washington, DC, 1980兲 共http://eric.ed.gov/ 共http://www.lib.ncsu.edu/theses/available/etd-06032007-
ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/ 181559/兲.
38/90/26.pdf兲. 18
R. Beichner, Testing student interpretation of kinematics graphs,
6 R. Doran, Basic Measurement and Evaluation of Science Instruc-
Am. J. Phys. 62, 750 共1994兲.
tion 共NSTA, Washington, DC, 1980兲, p. 97. 19
D. Maloney, T. O’Kuma, C. Hieggelke, and A. Van Heuvelen,
7 A. Oosterhof, Classroom Applications of Educational Measure-
Surveying students’ conceptual knowledge of electricity and
ment, 3rd ed. 共Merrill, Upper Saddle River, NJ, 2001兲 pp. 176– magnetism, Am. J. Phys. 69, S12 共2001兲.
20
178. C. Singh and D. Rosengrant, Multiple-choice test of energy and
020103-16
APPROACHES TO DATA ANALYSIS OF MULTIPLE-CHOICE… PHYS. REV. ST PHYS. EDUC. RES. 5, 020103 共2009兲
Factor Analysis and Structural Equation Modeling 共SAS Insti- distance兲, one can choose to use the tie-breaking techniques in-
tute, Cary, NC, 1994兲. troduced in SAS/STAT User’s Guide, Version 8 共SAS Institute,
23
R. Gorsuch, Common factor analysis versus component analysis: Cary, NC, 1999兲.
Some well and little known facts, Multivar. Behav. Res. 25, 33 46 M. Aldenderfer and R. Blashfield, Cluster Analysis 共SAGE, Bev-
NY, 1985兲, pp. 13–15. ing to statistical analysis of student reasoning about two-
25 R. Gorsuch, Factor Analysis 共Lawrence Erlbaum, Hillsdale, NJ,
dimensional kinematics, Phys. Rev. ST Phys. Educ. Res. 3,
1983兲, pp. 165–169. 020107 共2007兲.
26
I. Jolliffe, Principal Component Analysis 共Springer-Verlag, New 48
M. Montenegro, G. Aubrecht, and L. Bao, a paper presented at
York, 2002兲, pp. 269–298; G. Dunteman, Principal Component the 2005 Physics Education Research Conference 共2005兲.
Analysis 共SAGE, Newbury Park, CA, 1989兲 pp. 48–49. 49
F. Baker, The Basics of Item Response Theory, 2nd ed. 共ERIC,
27 R. Gorsuch, Factor Analysis 共Lawrence Erlbaum, Hillsdale, NJ,
College Park, MD, 2001兲 共http://eric.ed.gov/ERICDocs/data/
1983兲, pp. 67–71 and 175–238. ericdocs2sql/content_storage_01/0000019b/80/19/5b/97.pdf兲.
28 J. Hair, R. Anderson, R. Tatham, and W. Black, Multivariate Data 50 F. Baker and S. Kim, Item Response Theory: Parameter Estima-
Analysis with Readings, 5th ed. 共Prentice-Hall, Englewood tion Techniques, 2nd ed. 共Dekker, New York, 2004兲.
51 F. Baker, The Basics of Item Response Theory, 2nd ed. 共ERIC,
Cliffs, NJ, 1998兲 p. 111.
29 SAS/STAT, Version 8, Cary, NC, 2000.
College Park, MD, 2001兲, pp. 21–45.
30 52
J. Kim and C. Mueller, Introduction to Factor Analysis: What It D. Thissen, W. Chen, and D. Bock, MULTILOG-7, Scientific Soft-
Is and How to Do It 共SAGE, Beverly Hills, CA, 1978兲. ware International.
31 53
J. Kim and C. Mueller, Factor Analysis: Statistical Methods and R. Bock, Estimating item parameters and latent ability when re-
Practical Issues 共SAGE, Beverly Hills, CA, 1978兲. sponses are scored in two or more nominal categories, Psy-
32
L. Hatcher, A Step-By-Step Approach to Using the SAS System for chometrika 37, 29 共1972兲.
Factor Analysis and Structural Equation Modeling 共SAS Insti- 54 F. Samejima, University of Tennessee Research Report No. 79-4,
case of disputed authorship, Multivar. Behav. Res. 25, 29 S. Dutta, T. Mzoughi, and V. McCauley, Testing the test: Item
共1990兲. response curves and test quality, Am. J. Phys. 74, 449 共2006兲.
34 C. Spearman, General intelligence, objectively determined and 56 Y. Lee, D. Palazzo, R. Warnakulasooriya, and D. Pritchard, Mea-
measured, Am. J. Psychol. 15, 201 共1904兲. suring student learning with item response theory, Phys. Rev. ST
35 For example, consider two items that both are scored either “cor-
Phys. Educ. Res. 4, 010102 共2008兲.
rect” 共1兲 or “incorrect” 共0兲. Thus, possible outcomes are 共1, 1兲, 57
B. Reeve and P. Fayers, in Assessing Quality of Life in Clinical
共1, 0兲, 共0, 1兲, and 共0, 0兲. If a, b, c, and d are used to indicate the Trials: Methods and Practice, 2nd ed., edited by P. Fayers and
frequencies of the four outcomes, respectively, the correlation R. Hays 共Oxford University Press, Oxford, 2005兲, pp. 55–73.
58 M. Linacre, Sample size and item calibration stability, Rasch
between the two dichotomous variables then can be calculated
as ad− bc/ 冑共a + b兲共c + d兲共a + c兲共b + d兲. Measurement Transactions 7, 328 共1994兲.
36 P. Heller and D. Huffman, Interpreting the force concept inven- 59 L. McLeod, K. Swygert, and D. Thissen, in Test Scoring, edited
tory: A reply to Hestenes and Halloun, Phys. Teach. 33, 503 by D. Thissen and H. Wainer 共Erlbaum, Mahwah, NJ, 2001兲, p.
共1995兲. 189.
37 60
W. Adams, K. Perkins, N. Podolesfsky, M. Dubson, N. Finkel- X. Fan, Item response theory and classical test theory: An empiri-
stein, and C. Wieman, New instrument for measuring student cal comparison of their item/person statistics, Educ. Psychol.
beliefs about physics and learning physics: The Colorado Learn- Meas. 58, 357 共1998兲.
61 D. Harris, Comparison of 1-, 2-, and 3-parameter IRT modes, J.
ing Attitudes about Science Survey, Phys. Rev. ST Phys. Educ.
Res. 2, 010101 共2006兲. Educ. Meas. 8, 35 共1989兲.
38 B. Everitt, Cluster Analysis 共Heinemann, London, 1974兲. 62 L. Bao, Ph.D. thesis, University of Maryland, 1999 共http://
39
B. Everitt, Cluster Analysis 共Heinemann, London, 1974兲, p. 40. www.compadre.org/PER/items/detail.cfm?ID⫽4760兲.
40 M. Aldenderfer and R. Blashfield, Cluster Analysis 共SAGE, Bev- 63 L. Bao and E. Redish, Model analysis: Representing and assess-
erly Hills, CA, 1984兲, p. 25. ing the dynamics of student learning, Phys. Rev. ST Phys. Educ.
41
A. Kuo, The Distance Macro: Preliminary Documentation, 2nd Res. 2, 010103 共2006兲.
ed. 共SAS Institute, Cary, NC, 1997兲 共http://citeseerx.ist.psu.edu/ 64
This is not the only way to deal with missing data. Since there are
viewdoc/summary?doi⫽10.1.1.31.208兲. only two possible scores 共1 or 0兲 a student can get for a particu-
42
B. Everitt, Cluster Analysis 共Heinemann, London, 1974兲, pp. lar item, it is reasonable to assign “0” to students for items they
9-15. do not answer.
43 M. Aldenderfer and R. Blashfield, Cluster Analysis 共SAGE, Bev- 65 A. Agresti and B. Finlay, Statistical Methods for the Social Sci-
erly Hills, CA, 1984兲, p. 38–43. ences, 4th ed. 共Prentice-Hall, Upper Saddle River, NJ, 2009兲.
020103-17