Vous êtes sur la page 1sur 9

The Normalized Programming State Model:

Predicting Student Performance in Computing


Courses Based on Programming Behavior
Adam S. Carter*, Christopher D. Hundhausen*, and Olusola Adesope†
Human-centered Environments for Learning and Programming (HELP) Lab
*School of Electrical Engineering and Computer Science

College of Education
Washington State University
Pullman, WA 99164
+1 509-335-6602
cartera@wsu.edu, hundhaus@wsu.edu, olusola.adesope@wsu.edu
ABSTRACT the fields of educational data mining and learning analytics [3,
Educators stand to benefit from advance predictions of their 24], educators can analyze these data in order to identify ways in
students' course performance based on learning process data which learning patterns and attitudes relate to learning outcomes.
collected in their courses. Indeed, such predictions can help Such analyses open up new opportunities to better tailor
educators not only to identify at-risk students, but also to better instruction to individual learners, and ultimately to improve
tailor their instructional methods. In computing education, at least student learning outcomes, especially among at-risk learners.
two different measures, the Error Quotient [14, 23] and Watwin In computing education, employing educational data mining and
Score [26, 27], have achieved modest success at predicting learning analytics techniques would appear to be particularly
student course performance based solely on students' compilation appropriate, given computing education's "grand challenge"
attempts. We hypothesize that one can achieve even greater problem of improving student retention, especially in early
predictive power by considering students' programming activities computing courses (see, e.g., [6, 9, 28]. Indeed, if computing
more holistically. To that end, we derive the Normalized educators are able to identify, at an early stage in a computing
Programming State Model (NPSM), which characterizes students' course, students who are at risk of dropping out or failing the
programming activity in terms of the dynamically-changing course, then they are in a better position to improve retention by
syntactic and semantic correctness of their programs. In an tailoring or adapting their instructional approaches.
empirical study, the NPSM accounted for 41% of the variance in
Recognizing this potential, computing education researchers have
students' programming assignment grades, and 36% of the
become increasingly interested in collecting log data on students'
variance in students' final course grades. We identify the
programming processes as they work on course assignments [2].
components of the NPSM that contribute to its explanatory power, Two predictive measures in this line of research—the Error
and derive a formula capable of predicting students' course
Quotient [14] and the Watwin Score [27]—focus exclusively on
programming performance with between 36 and 67 percent
differences between successive compilation attempts. These
accuracy, depending on the quantity of programming process data.
metrics associate improved learning outcomes with an ability to
Categories and Subject Descriptors quickly remove compilation errors from a program.
K.3.2 [Computer and Information Science Education]: While these measures have achieved modest success at predicting
Computer science education. D.2.8 [Metrics]: Performance student performance in early computing courses, they have at least
measures, Process metrics, Product metrics. two shortcomings. First, their narrow focus on compilation
behavior ignores other programming behaviors—most notably,
Keywords debugging [1]—that might be associated with learning success.
Error Quotient, Watwin Score, Normalized Programming State Second, both measures have been derived within one particular
Model, Predictive measures of student performance and programming course (CS1), language (Java) and novice
achievement, Learning analytics, Educational data mining programming environment (BlueJ); their predictive power has not
been tested in other courses, programming languages and
1. INTRODUCTION environments. These limitations raise two key research questions:
By collecting a stream of learning process data in their courses,
educators create opportunities to continuously assess their
students learning processes and progress. Using techniques from How is the predictive power of the Error Quotient
RQ1: and Watwin Score affected by different courses,
Permission to make digital or hard copies of all or part of this work for
languages, and programming environments?
personal or classroom use is granted without fee provided that copies are not RQ2: How well can a more holistic model of students'
made or distributed for profit or commercial advantage and that copies bear programming processes predict performance?
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to By presenting preliminary research that addresses these questions,
redistribute to lists, requires prior specific permission and/or a fee. Request this paper makes two key contributions. First, we perform a
permissions from Permissions@acm.org. replication study of the Error Quotient and Watwin Score studies,
ICER '15, August 9--13, 2015, Omaha, Nebraska, USA. using a different student population (CS2 instead of CS1),
© 2015 ACM. ISBN 978-1-4503-3630-7/15/08…$15.00
DOI: http://dx.doi.org/10.1145/2787622.2787710
different programming language (C/C++ vs. Java), and different
programming environment (Visual Studio® vs. BlueJ). Second,
we propose the Normalized Programming State Model (NPSM) as aims to harness greater predictive power by considering the
a more holistic characterization of students' programming programming process more holistically.
processes. In an empirical study, the NPSM achieved over eight
times the explanatory power of the Error Quotient, and over three 3. A PROGRAMMING STATE MODEL
times the explanatory power of the Watwin Score. In a follow-up Both the Error Quotient and Watwin Score focus exclusively on
study, we use the NPSM to derive a formula that is capable of students' compilation activities: students who quickly and
predicting students' course performance based on the accurately fix syntax errors in their programs are predicted to
programming process data available at any point in a course. perform better than those who do not. While the ability to
eliminate syntax errors from a program is an important
2. BACKGROUND AND RELATED WORK programming skill, it is widely acknowledged that programming
A large body of educational research has explored the extent to success also hinges on one's ability to identify, diagnose, and
which various learner variables are able to predict learning repair runtime (semantic) errors (see, e.g., [1]). Thus, one would
outcomes or future learning behaviors. These variables include the expect that an ability to eliminate semantic errors would also
learner's background (e.g., [5, 15]), prior knowledge (e.g., [5, correlate positively with performance in a computing course.
18]), cognitive abilities (e.g., [20]), time-on-task (e.g., [21]) and This observation motivates a more holistic predictive model of
learning attitudes (e.g., [4]). In computing education, for example, student performance rooted in a students' ability to develop both
Rosson et al. [19] found strong positive correlations between a syntactically and semantically-correct programs. Our proposed
number of attitudinal variables, including self-efficacy, and a model aims to approximate the syntactic and semantic correctness
learner's orientation towards the computing discipline. While this of a programming solution at any given point in time (see Table
line of research shares our interest in predicting student learning 1). Given a stream of programming data, we map a student's
outcomes, it differs in that it relies on only a limited number of current programming solution to one of the four states in this 2 ×
data snapshots to make its predictions. Thus, it lacks the ability to 2 space. We can determine the syntactic correctness of a program
furnish predictions of student performance that are dynamic, based on whether the last compilation attempt yielded an error.
robust and continuously updated throughout a course.
In contrast, semantic correctness is impossible to determine
The research presented here analyzes a continuous stream of data unequivocally. All we have is a rough proxy: the presence or
in order to identify patterns of learning associated with positive absence of runtime exceptions in the last execution attempt. If the
learning outcomes. As such, it falls within the emerging areas of last execution attempt yielded a runtime exception, we classify the
educational data mining and learning analytics [3, 24], which, in program as semantically incorrect. If the last execution attempt
many STEM fields, have been used to gain insights into the did not yield a runtime exception, we classify the program as
processes that underlie student learning, and ultimately to better semantically unknown. Clearly, our proxy for semantic
tailor instruction. A foundational idea is to build learner models correctness has significant limitations. For instance, a student’s
that infer learners' background knowledge, learning strategies, and program could meet the assignment specification (and hence be
motivations from learning process data [18]. In turn, such models “semantically correct” for the purpose of the assignment), but still
are used to adapt instruction to learner needs. raise a runtime exception if it encounters input data that it is not
Within computing education, a legacy of research has studied required to process. Conversely, a student’s program could run
students' programming processes using think-aloud protocols without raising a runtime exception, but its output could be
(e.g., [16, 22]), video analysis (e.g., [12]), and software logs (e.g., incorrect. Likewise, the student could have failed to test key
[8, 10]). These studies have had a variety of goals, ranging from boundary cases that would have raised a runtime exception.
better understanding how novices approach programming and Figure 1 presents a state-transition diagram that maps our model
debugging (e.g., [1]), to developing cognitive models of student to the stream of programming log data made available to us by
programming knowledge (e.g., [16, 22] ), to evaluating novice Microsoft® Visual Studio® [25], the IDE used in our study. Note
programming environments [8, 10, 12] While carrying forward its that Visual Studio does not report runtime exceptions that occur
interest in studying students' programming processes in detail, our outside of debug mode. Thus, our log data do not contain two
work differs from this line of work in that it aims to make potentially important transitions: those between RN/RU and
accurate advance predictions of course performance. YN/YU. For this reason, we are forced to determine the current
Most closely related to the work presented here are the Error state based on the results of the student's last compilation and
Quotient [14, 23] and Watwin Score [26, 27], which have been debug attempt. In order to switch from a syntactically incorrect
proposed as predictive measures of student performance based on state (NU and NN) to a syntactically correct state (YN and YU),
programming behavior. Both measures focus on quantifying a the student's last compilation attempt must have been free of build
student's ability to recover from compilation errors. This is errors. Likewise, a student switches from a semantically unknown
accomplished by examining successive pairs of compilation
attempts and awarding points based on whether or not later
Table 1. Dimensions of Program Correctness That Can Be
compilation attempts remove errors identified in earlier
Approximated from Programming Log Data
compilation attempts. In past studies, the Watwin Score has
generally outperformed the Error Quotient as a predictive Syntax
measure. While the Error Quotient was able to account for Correct Incorrect
between 19% [26] and 25% [14] of the variance of final course
Incorrect Syntactically correct/ Syntactically incorrect/
Semantics

grades, the Watwin score was able to account for between 36%
Semantically incorrect Semantically incorrect
[26] and 42% [27] of the variance in final course grades.
However, a refinement of the Error Quotient published more
recently appears to raise the Error Quotient's predictive power to Unknown Syntactically correct/ Syntactically incorrect/
nearly 30% [23]. The NPSM presented here can be seen as an Semantically unknown Semantically unknown
expansion of the Error Quotient and Watwin Score—one that
Editor activity
or timeout

Execute Execute Execute Yes No


w/o Debug w/o Debug w/o Debug
Previous State NU?
Code: RN Code: RU Code: R/

Editor activity Editor activity


Start without debug
or timeout or timeout

Compile w/o error

Start without debug


Start without debug Start without debug Compile w/o error

Editing Syntactically Correct, Editing Syntactically Correct, Editing Syntactically Incorrect, Editing Syntactically Incorrect,
Last Debug Unsuccessful Last Debug Successful Last Debug Successful Last Debug Unsuccessful
Code: YN Code: YU Code: NU Code: NN

Compile with error


Compile with error

Start with debug Start with debug

Compile with error

Unknown (Start) State


Debugging Debugging Compile w/o error
Code: UU
Code: DN Stop debugging
Code: DU

Idle
 All states timeout after 3 minutes of
Yes No inactivity (e.g. no editor activity,
compile, etc.)
Has Runtime Exception?  Upon new activity, transition is made
to last known state.
Figure 1: Programming State Transition Diagram
state (NU and NN) to a semantically incorrect state (YN and YU) 11 data points per student. These 11 data points form the
only if the last debug attempt yielded a runtime exception. Normalized Programming State Model (NPSM).
Observe that intermediate execution states are also captured in 4. STUDY I: EVALUATING
this state transition diagram. If the student's program is
syntactically correct, it can be executed either with or without the EXPLANATORY POWER
debugger in Visual Studio. This leads to the four left-most The goal of our first empirical study was two-fold: (1) to address
"Execute" states in the diagram (RN, RU, DN, DU). In contrast, if RQ1 by exploring whether the results of previous studies of the
the student's program is syntactically incorrect, it is not possible to Error Quotient and Watwin Score could be replicated using a
execute it in debug mode. However, in Visual Studio, it is different student population, programming language, and
possible to execute the last successful build of a program. This programming environment, and (2) to address RQ2 by exploring
leads to the right-most "Execute" state (R/). the explanatory power of the NPSM.
Lastly, two additional states are necessary in this model. First, it is We conducted Study I in a 15-week CS 2 course at Washington
impossible to determine the state if no compilation or execution State University. Taught by the first author, the course used C++
attempts have been made. To account for this situation, which as its instructional language, and required students to use the
commonly occurs at the beginning of a programming session, we Microsoft® Visual Studio® programming environment [25] for
define an additional state called "Unknown (Start) State" (UU). course assignments. The course revolved around three weekly 50-
Second, a prolonged period of inactivity (three minutes) in any minute lectures and one weekly 170-minute lab. We collected
state leads to a transition to the Idle state, in which the next programming process and grade data, and used those data as input
editing activity causes a transition back to the previous state. to the Error Quotient, Watwin Score, and NPSM.

3.1 Relating States to Student Performance 4.1 Participants


The four editing states in the model just presented (YN, YU, NU, The spring 2014 offering of CptS 122, the CS 2 course at
NN) serve as a rough proxy for the syntactic and semantic status Washington State University, enrolled 140 students, of whom 129
of the program being edited. We speculate that students who completed the course. Of these, 95 students (87 male, 8 female)
spend more of their time in syntactically correct and semantically consented to release their programming log data and course
unknown states will tend to outperform students who spend more grades for analysis in this study.
of their time in syntactically and semantically incorrect states. 4.2 Data Collection Materials and Procedure
The relationship between the five execution states in the model Students' programming activities were collected using OSBIDE
(RN, RU, DN, DU, R/) and student performance is murkier. [7], a plugin for Microsoft® Visual Studio®. We hired a
Students in the RN and DN states appear to be asking the professional programmer to write the software for computing the
question, "Why doesn't my program work?" However, students Error Quotient, Watwin Score, and state transition network on
who are in state DN may be approaching that question from a which the NPSM is based. The first author aggregated the state
more powerful position, since they are using the debugger. In a transition data into the eleven variables that make up the NPSM.
similar vein, students in the RU and DU states seem to be asking Because of a key difference between BlueJ and Visual Studio®
the question, "Does my program work?" Once again, those who (BlueJ only displays one build error at a time, whereas Visual
choose to use the debugger (in the DU state) appear to be asking Studio® displays all build errors), we had the programmer create
that question from a more powerful position. Finally, it is difficult two versions of the Watwin Score: One that mimics the originally-
to say just what students in the R/ state are up to. We suspect published algorithm by considering only one error at time, and
many of them may not realize that they are actually executing a another that considers all build errors. Both versions of the
previous build of the program. As such, we speculate that time Watwin Score are considered in the analysis that follows.
spent in this state may indicate that a student is struggling; we
would not expect time spent in this state to be positively Three performance indicators were used for the analysis: (1)
associated with course performance. students' grades on individual assignments; (2) students' overall
assignment average, and (3) students' final grades, which were
3.2 Deriving an Explanatory Model: NPSM based on the grades received on programming assignments (35%),
The programming model just presented maps, on a moment-to- labs (10%), participation (5%), in-class quizzes (10%), midterm
moment basis, a student's programming activity to one of the 11 exams (20%), and a final exam (20%). Predictions of individual
different states shown in Figure 1. Given our intuitions about how assignment grades were based exclusively on the programming
states might correlate with student performance, how can the log data generated while the corresponding assignment was open.
model be used as a basis for explaining the variance in student (The length of each assignment varied between ten and twenty
achievement? We adopt the most straightforward approach: For three days). Predictions of students' overall assignment averages
each student, we record the amount of time spent in each state, and final grades, in contrast, were based on programming log data
and then normalize the times relative to the total time the student generated throughout the entire semester.
spent programming.
4.3 Results
In a preliminary analysis of data generated from this model, we
observed that the Idle state tended to dominate student activity; 4.3.1 Predictions of Individual Assignment Grades
students tended to spend most of their day not programming. As To evaluate the ability of the three measures to explain the
we wanted to focus exclusively on the time students spent variance in individual assignment grades, we performed a linear
programming, we decided to eliminate the Idle state from the regression, with measure (Error Quotient, Watwin Score, NPSM)
normalization process. In addition, since our normalization is as the predictor variable and individual assignment grades as the
based on the total amount of time students spent programming, we outcome variable (see Table 2). Because the NPSM relies on
decided to include time-on-task as a variable in the model. We normalized time values, students who spend little time on an
were left with ten state variables plus time-on-task, for a total of assignment are likely to skew the explanatory model. With that in
mind, we ran a secondary analysis that considered only students Table 2. Explanation of Variance in Individual Assignment
who spent at least one hour on each programming assignment. Grades
The results of this analysis, in which 63 out of 665 data points
Measure df F p Adj. R2
were thrown out, are also presented in Table 2. Significant factors
in the multivariate regression run on all data points are presented Error Quotient 1, 679 53.65 < 0.01 0.07
in Table 3, while Table 4 shows the significant factors in the Watwin Score (one) 1, 662 13.61 < 0.01 0.02
multivariate regression run on only those data points
corresponding to students who were active for at least an hour. Watwin Score (all) 1,662 14.63 < 0.01 0.02

As indicated in Table 2, all measures were significant but weak NPSM (no min. 11, 653 8.92 < 0.01 0.08
predictors of individual assignment scores. If we filter out data time)
corresponding to students who spent less than an hour of NPSM (>1 hr) 10, 591 8.92 < 0.01 0.11
programming time on an assignment, the NPSM model accounted
for the most variance ( = 0.11) in assignment grades.
Table 3. Significant Predictors in NPSM for Individual
Interestingly, setting a minimum time limit of one hour altered
Assignment Grades (no minimum time)
three of the five significant contributing factors in the NPSM.
Variable  t p
4.3.2 Predictions of Overall Assignment Averages
We next aggregated an entire semester's worth of IDE data and YN 0.32 2.03 0.04
correlated these data with students' overall assignment averages. RU 0.31 3.77 <0.01
Results for each measure are presented in Table 5. Significant
RN 0.18 2.98 <0.01
contributors in the NPSM model are shown in Table 6.
R/ 0.12 2.92 <0.01
By considering an entire semester's worth of data, two of the three
predictive measures improved. The Error Quotient's explanatory Time on task 0.09 2.45 0.02
power decreased, whereas the Watwin Score increased its
explanatory power by a factor of five, and the NPSM increased its Table 4. Significant Predictors in NPSM for Individual
explanatory power by more than a factor of three. In absolute Assignment Grades (at least 1 hr of programming time)
terms, the NPSM was a substantially better predictor than the
other two measures, nearly quadrupling the explanatory power of Variable  t p
its closest rival (the Watwin Score). NU -0.17 -3.80 < 0.01
Interestingly, when the input dataset was expanded to include all UU -0.15 -3.45 < 0.01
data collected throughout the semester, the number of NPSM
variables that made significant contributions shrank from three RU 0.14 2.95 < 0.01
(NU, UU, RU) to two (UU, NU) (see Table 6). Moreover, both of Time on task 0.11 2.57 0.01
these variables (UU and NU) were negatively correlated with
performance. Recall that the UU state is used when students first
begin programming. It makes sense that the longer students go Table 5. Explanation of Variance in Average Assignment
without compiling or running their programs, the more likely it is Grades
that they will do poorly on the assignment. Likewise, it makes Measure df F p Adj. R2
sense that students who spend large proportions of time in the NU
state would tend to do worse on assignments, since students in Error Quotient 1, 94 7.16 <0.01 0.06
that state are grappling with syntax errors, and may not ever be Watwin Score (one) 1, 94 11.50 < 0.01 0.10
able to execute their programs. Indeed, the significance of NU as
Watwin Score (all) 1, 94 10.93 < 0.01 0.10
an explanatory factor aligns well with the Error Quotient and
Watwin Score, both of which can be seen as quantifying the rate NPSM 10, 84 7.00 < 0.01 0.39
at which students leave the NU state.
4.3.3 Predictions of Final Grades Table 6. Significant Predictors in NPSM for Assignment
Lastly, we consider each measure's ability to explain the variance Average Grades
in students' final course grades. As was the case with assignment
averages, we used students' programming behavior over the entire Variable  t p
semester as input to each measure. The results of this analysis are UU -0.44 -3.88 <0.01
presented in Table 7. Significant factors in the NPSM are shown
NU -0.22 -2.28 0.03
in Table 8. As can be seen in Table 7, the Error Quotient was the
only measure that did not significantly correlate with students'
final grades. The Watwin Score appears to be slightly better at two significant factors in the NPSM (UU and NU) remained the
explaining the variance in final grades than it was at explaining same in its correlations with assignment average and final grade.
the variance in assignment scores. In contrast, the NPSM appears 5. STUDY II: DERIVING A PREDICTVE
slightly worse at accounting for the variance in final grades than
at accounting for the variance in average assignment scores. As MEASURE
before, however, the NPSM did a substantially better job in The previous study showed that the NPSM was able to account
absolute terms, furnishing three times the explanatory power of for substantially more variation in student performance than both
the Watwin Score, its closest competitor. Finally, we note that the the Error Quotient and Watwin Scores. Given this potential, it
makes sense to derive a predictive measure that can be used in situ
to predict performance, rather than post hoc to explain variance.
Table 7. Explanation of Variance in Final Grades
Measure df F p Adj. R2 A1
Error Quotient 1, 94 3.68 0.06 0.03 A1-A2
Watwin Score 1, 94 14.26 < 0.01 0.12 A1-A3
(single error)
A1-A4
Watwin Score (all 1, 94 14.40 < 0.01 0.12
errors) A1-A5
NPSM 10, 84 6.63 < 0.01 0.36 A1-A6
A1-A7
Table 8. Significant Predictors in NPSM for Final Grades
0% 20% 40% 60% 80% 100%
Variable  t p
UU -0.30 -2.55 0.01 Figure 2. Seven Programming Data sets of Increasing Size
NU -0.27 -2.74 <0.01 as a Percentage of All Programming Data
coefficients, we see that the relative contributions of RU and RN
decrease as the size of the data increases, whereas the relative
We now present a follow-up study that uses the results of the
contributions of UU and NU increase as the size of the data
previous study to derive a predictive formula rooted in the NPSM.
increases. Finally, we see a drop in the amount of variance
Given that the NPSM includes eleven predictors, the ideal sample explained when adding data associated with the last assignment.
size for achieving full statistical power when deriving a predictive Whether this represents a true ceiling in the NPSM's predictive
measure would be approximately 220 students (see, e.g. [11, 29]). power remains an interesting question for future research.
While it is still possible to detect strong effects on a smaller
sample size, running eleven predictors against our sample size of 5.3 A Predictive Formula
95 students increases the probability of producing a significant Running the NPSM model with the variables UU, NU, RU, and
model without any significant predictors. For this reason, we RN across the seven overlapping data sets reveals a general trend
restricted ourselves to the development of a four-variable model— in which the amount of variance increases with the size of the data
the most appropriate size, given the size of our dataset [11, 29]. set. We now use the coefficients from these results in order to
formulate two predictive measures. The first is obtained by
We began by examining the seven significant variables identified
averaging the unstandardized beta coefficients of each predictor
in Study I: YN, RU, RN, R/, NU, UU, and time on task. A
variable across the data sets considered. The second model is
preliminary data analysis using datasets of varying sizes (see
obtained by using a weighted averaged of each predictor variable's
Figure 2) revealed a sporadic level of significance for YN, R/, and
unstandardized beta coefficients. Recall that the weighted average
time on task. Therefore, we decided to drop these variables from
is formulated based on the overall model's variance numbers.
further consideration, and settled on the variables UU, NU, RU,
Using the averaged coefficient values reported in Table 9, we
and RN for our predictive model.
arrive at the following formula:
5.1 Method
For Study II, we used the same programming log data and grade
data as were used in Study I. However, for this study, we
Using the weighted coefficient values yields a slightly different
evaluated the NPSM using seven input data sets whose sizes were
formula:
systematically varied, as illustrated in Figure 2. The first data set
consisted solely of the data collected during the first programming
assignment. The final six data sets each added an additional
assignment's grades and programming data. Therefore, starting To verify the accuracy of each formula, we calculated the
with the second, data set, the outcome variable was the average of predicted score for each dataset using both formulas. Next, we
all the programming assignment scores received up to that point in performed a linear regression using this predicted score as the
time. It follows that the final data set included all programming predictor variable and the student's actual assignment score as the
data from the semester, matching the dataset reported in Table 5. outcome variable. We deemed a formula to be successful if it
closely mirrored the total amount of variance explained by the
5.2 Results overall NPSM model. The amount of variance accounted for by
For each of the seven data sets, a multivariate regression was
each formula, as well as the overall NPSM model, is listed in
performed using UU, NU, RU, and RN as predictor variables and
Table 10. Inspection of Table 10 reveals that both formulas
assignment averages as outcome variables. Table 9 provides the
closely mirror each other (within +/- 2%), and that both are close
individual contribution of each predictive variable. The bottom
to the overall NPSM model in terms of explanatory power. As
two rows of the table compute the average value and weighted
such, it would appear that either formula does a good job of
average value of each coefficient.
transforming the NPSM data into a usable predictive measure.
In examining the coefficients listed in Table 9, we see that the RU
and RN variables are consistently significant regardless of the 6. DISCUSSION
amount of data considered. In contrast, the UU and NU variables We now turn to a detailed discussion of our results, organized
only became significant as the amount of data considered around the two research questions we posed for this research.
increases. However, in examining the standardized beta
Table 9: NSPM Predictive Power and Coefficients for Seven Datasets of Increasing Size (* = sig. at p < 0.05)

Variance Unstandardized  Coefficients Standardized  Coefficients


Dataset Explained Constant UU NU RU RN UU NU RU RN
A1 13% 50.01 0.09 -0.25 0.87* 2.33* 0.02 -0.08 0.25* 0.34*
A1-A2 15% 58.43 -0.05 -0.16 0.81* 2.54* -0.02 -0.06 0.28* 0.36*
A1-A3 20% 65.57 -0.07 -0.50* 0.79* 2.17* -0.03 -0.20* 0.29* 0.33*
A1-A4 26% 75.32 -0.45 -0.62* 0.63* 1.56* -0.15 -0.26* 0.27* 0.27*
A1-A5 37% 77.54 -1.12* -0.51* 0.58* 1.48* -0.34* -0.22* 0.24* 0.25*
A1-A6 45% 80.80 -1.45* -0.44* 0.53* 1.16* -0.45* -0.19* 0.22* 0.19*
A1-A7 41% 80.79 -1.34* -0.52* 0.41* 1.02* -0.43* -0.23* 0.19* 0.18*
Average N/A 69.78 -0.63 -0.43 0.66 1.75 -0.20 -0.18 0.25 0.27
Weighted Avg. N/A 73.42 -0.84 -0.45 0.61 1.58 -0.27 -0.19 0.24 0.25

6.1 RQ1: Do Prior Results Generalize to Table 10: Variance Explained by Overall NPSM Model,
Averaged NPSM Formula, and Weighted NPSM Formula
Different Populations and Programming
Averaged Weighted
Languages/Environments? NPSM Coefficient Coefficient
The results for the Error Quotient and Watwin measures differ Dataset Model Formula Formula
drastically from the results presented in prior research. For
example, a recent study of both the Error Quotient and Watwin A1 13% 13% 13%
measures accounted for 18% and 36% of the variance in students' A1-A2 15% 15% 13%
final grades [26] as compared to merely 3% and 12% in our study.
A1-A3 20% 20% 18%
How can we account for this large discrepancy? We offer two
possible explanations. A1-A4 26% 28% 27%
First, differences in the instructional emphasis of the courses A1-A5 37% 38% 39%
studied might have contributed to the differences in the Error A1-A6 45% 43% 45%
Quotient and Watwin Score observed across the studies. In
A1-A7 41% 39% 41%
previous studies in which the Error Quotient Watwin Score were
calculated, student homework was worth just 25% of the overall
reason that the Error Quotient and Watwin would artificially
grade. In contrast, in our study, student homework accounted for
inflate the base penalty assigned to students for each failed
35% of the overall grade.
compilation. Furthermore, in Visual Studio®/C++, the possibility
Second, the discrepancies in Error Quotient and Watwin Score that both the Error Quotient and Watwin Score will generate false
measures might be related to key differences in the programming positives (matched compilations that have the same error message
environments and languages used in the studies. Previous studies but for different reasons) increases. In contrast, the coarser
focused on the BlueJ [17] and the Java programming language. approach taken by the NPSM is not affected by these differences:
This study collected data on Microsoft® Visual Studio® and the an error state is an error state, regardless of whether a student
C++ programming language. Both the Error Quotient and Watwin generated one or one hundred errors in a given compilation.
Score rely on the processing of compilation error messages. Given
Even though the amount of variance accounted for by the Error
that C++ compilers tend to produce terser and/or more obtuse
Quotient and Watwin Score in this analysis is much lower than
compilation error messages, it seems plausible that differences
what has been previously reported, it is still possible to make
could have occurred with respect to students' compilation
comparisons with prior studies. For example, in their first study,
behaviors in the two environments. For example, forgetting a
Watson et al. [27] found that the predictive power of both the
semi-colon in BlueJ and Java results in the error message, "error:
Error Quotient and Watwin Score increased as a function of the
';' expected," followed by the exact line on which a semi-colon is
size of the input data. When they considered only a single
missing. In contrast, forgetting a semi-colon in Visual Studio®
assignment's worth of data (roughly 2-3 weeks), the variance
and C++ results in nine error messages. The first message is a red-
explained by both Error Quotient and Watwin was fairly low:
herring referencing an illegal usage of a type as an expression. For
10% for Error Quotient and 6% for Watwin. However, by the end
the actual cause, the user must look to the second error message,
of the term, the variance explained by Error Quotient and Watwin
which states, "syntax error: missing ';' before identifier <x>,"
had increased to 19% and 42% respectively. Using relative
with <x> being the line below the statement on which a semi-
magnitudes, we see that, in their study, Error Quotient increased
colon is missing.
by a factor of two and Watwin by a factor of seven.
Of these two explanations, we find the second one to be the most
These results are somewhat consistent with the results of this
compelling. Recall that both the Error Quotient and Watwin Score
study, which found that the Error Quotient performed best with
assign penalty points when subsequent compilation attempts
smaller data sets, and that the Watwin Score performed best with
either result in more errors, or contain the same error messages as
larger data sets. However, unlike in previously published studies,
previous compilation attempts. Given that Visual Studio® and
the variance explained by the Error Quotient in our study actually
C++ generate more error messages per compilation, it stands to
decreased as the size of the input data set increased. That the
relative trend in the amount of variance explained by the Watwin 7. CONCLUSION AND FUTURE WORK
Score is similar across studies, whereas the relative trend in the This paper introduced the NPSM, a holistic model of student
amount of variance explained by the Error Quotient is not, lends programming behavior; compared its explanatory power against
further credence to the idea that these predictive measures do not two previously established measures; and derived a formula for
perform consistently when applied to different programming predicting student performance given a set of programming
environments and languages. However, in order to increase our process data. Our results indicate that, at least in the population
confidence in this claim, we would need to conduct additional considered in this paper, the NPSM is much better at predicting
studies of the Error Quotient, Watwin Score, and NPSM using a student performance than the Error Quotient and Watwin Score.
variety of programming environments and languages.
Our preliminary research into the NPSM suggests several
6.2 RQ2: How Well Can a More Holistic directions for future work. First, future studies should examine the
Programming Model Predict robustness of the NPSM by performing a replication study with a
Performance? larger student population under similar classroom conditions. This
would allow researchers to test the predictive NPSM formula
In this paper, we developed the NPSM, a predictive model based
against a population different from the population used to derive
on the time spent in a set of programming states derived from a
it. Furthermore, the increased power that accompanies an increase
program’s syntactic and semantic correctness. Given the
in sample size might allow for the discovery of additional
configuration of instructor, assignments, exams and IDE we
significant factors within the NPSM.
studied, the NPSM outperformed models that only consider
compilation behavior. At the level of individual programming Second, future research should examine the suitability of the
assignments, the NPSM accounted for four times as much NPSM as a predictive measure under conditions not considered in
variance as the Watwin Score, but only slightly more variance our study. This includes applying the NPSM to different
(1%) than the Error Quotient. With respect to students' overall programming languages, environments, and computing courses. It
assignment averages, the NPSM accounted for nearly four times will be especially important to explore the predictive power of the
as much variance as the Watwin Score, and over six times the NPSM as applied to different programming environments, given
variance accounted for by the Error Quotient. With respect to final that the NPSM includes states that are unreachable in novice
course grades, the NPSM accounted for three times as much programming environments. For example, novice programming
variance as the Watwin Score, and 12 times as much variance as environments often do not allow a program to be executed unless
the Error Quotient. it is syntactically correct (i.e., there is no R\ state), and have only
In Study II, we developed a predictive formula based on four one mode of execution (i.e., there is no distinction between RN
NPSM states. RN (execute a semantically incorrect program) and and DN, and between RU and DU). Might a modified NPSM,
RU (execute a semantically unknown program) were found to be with some states eliminated and other states combined, yield the
positive contributors to student success. Conversely, UU (default same predictive power as was observed in this study?
state before first compilation/execution action is taken in a Third, one should consider expanding the scope of the NPSM by
programming session) and NU (syntactically incorrect program) incorporating predictors that are not based on programming
were found to be negative contributors to a student's success. This behavior. For example, in ongoing work, we are exploring how a
seems to indicate that toying with a program's runtime behavior, student's online social behavior (see, e.g. [13]) might impact the
regardless of semantic correctness, is a successful programming predictive capabilities of the NPSM.
approach. In contrast, writing large portions of code without
attempting to compile (UU) is not conducive to success. Indeed, it Lastly, we plan to explore how the NPSM might serve as a
is easy to imagine that when these students finally do compile, foundation for pedagogical interventions derived from a student's
they quickly find themselves in NU, the other state negatively NPSM state. For example, a student who appears to be stuck in an
correlated with performance. It is also worth noting that editing unhelpful state (e.g. NU) might be prompted to ask for help.
states that precede runtime exceptions (YN, NN) were not Alternatively, we might be able to use programming behavior to
significant predictors. Therefore, it might be worthwhile to drop encourage students to improve their programming techniques. For
this distinction in a future version of our model. example, for students who spend a lot of time in the RN (execute
without debug) state, an intervention could suggest using the
As revealed by our study, the calculations performed by both the
debugger (DN) to troubleshoot semantic issues.
Error Quotient and Watwin measures are based on the least
weighted significant contributor in the NPSM model: NU. For instructors, we envision an online dashboard that could
Interestingly, performing a linear regression with NU as the sole present continuously-updated information on students' NPSM
predictor variable explains more variance than either the Error states and programming progress. Using this information,
Quotient or Watwin Score for both assignment average, F(1,93 = instructors could check in on struggling students, or perhaps
15.06), p < 0.01, = 0.13, and final grade, F(1,93 = devote additional lecture time to topics or strategies that the
23.676), p < 0.01, = 0.19. This strongly suggests that any dashboard indicates may be problematic for many students.
measurement based on programming behaviors would do well to
look beyond compilation behavior. 8. ACKNOWLEDGMENTS
This project is funded by the National Science Foundation under
Finally, we note that the aggregation method used in the NPSM is
grant no IIS-1321045.
only one possible approach to quantifying Error! Reference
source not found.'s state diagram. It is possible that other 9. REFERENCES
approaches, such as one that quantifying the number and types of
[1] Ahmadzadeh, M., Elliman, D. and Higgins, C. 2005. An
transitions, would yield better results. Exploring this possibility
analysis of patterns of debugging among novice computer
would be an interesting direction for future research.
science students. ITiCSE ’05: Proceedings of the 10th
annual SIGCSE conference on Innovation and technology [15] Jeske, D., Stamov-Rossnagel, C. and Backhaus, J. 2014.
in computer science education. ACM Press. 84–88. Learner characteristics predict performance and confidence
[2] Altadmri, A. and Brown, N.C.C. 2015. 37 Million in e-Learning: An analysis of user behavior and self-
Compilations: Investigating Novice Programming evaluation. Journal of Interactive Learning Research. 25, 4
Mistakes in Large-Scale Student Data. Proceedings of the (2014), 509–529.
46th ACM Technical Symposium on Computer Science [16] Kessler, C.M. and Anderson, J.R. 1986. A Model of
Education (Kansas City, MO, USA, 2015), 522–527. Novice Debugging in LISP. Empirical Studies of
[3] Baker, R.S.J. and Siemens, G. 2014. Educational data Programmers. 198–212.
mining and learning analytics. The Cambridge Handbook [17] Kölling, M., Quig, B., Patterson, A. and Rosenberg, J.
of the Learning Sciences. Cambridge University Press. 2003. The BlueJ system and its pedagogy. Journal of
253–274. Compuer Science Education. 13, 4 (2003), 249–268.
[4] Bergin, S., Reilly, R. and Traynor, D. 2005. Examining the [18] Ma, W., Adesope, O.O., Nesbit, J.C. and Liu, Q. 2014.
role of self-regulated learning on introductory Intelligent tutoring systems and learning outcomes: A
programming performance. Proc. 2005 ACM International meta-analytic survey. Journal of Educational Psychology.
Computing Education Research Workshop. ACM Press. 106, 2007 (2014), 901–918.
81–86. [19] Rosson, M.B., Carroll, J.M. and Sinha, H. 2011.
[5] Bransford, J., Brown, A.L. and Cocking, R.R. eds. 1999. Orientation of Undergraduates Toward Careers in the
How people learn: Brain, mind, experience, and school. Computer and Information Sciences: Gender, Self-Efficacy
National Academy Press. and Social Support. ACM Transactions on Computing
[6] Campbell, P.F. and McCabe, G.P. 1984. Predicting the Education. 11, 3 (Oct. 2011), 1–23.
Success of Freshmen in a Computer Science Major. [20] Schunk, D.H. 2012. Learning theories: An educational
Commun. ACM. 27, 11 (1984), 1108–1113. perspective. Merrill Prentice Hall.
[7] Carter, A.S. 2012. Supporting the virtual design studio [21] Slavin, R.E. 2011. Educational psychology: Theory and
through social programming environments. Proceedings of practice. Pearson Education.
the ninth annual international conference on International [22] Spohrer, J.C. 1992. MARCEL: Simulating the novice
computing education researc (Auckland, New Zealand, programmer. Ablex.
2012), 157–158.
[23] Tabano, E.S., Rodrigo, M.M.T. and Jadud, M.C. 2011.
[8] Goldenson, D.R. and Wang, B.J. 1991. Use of structure Predicting at-risk novice Java programmers through the
editing tools by novice programmers. Empirical Studies of analysis of online protocols. Proceedings of the seventh
Programmers: Fourth Workshop. Ablex. 99–120. international workshop on Computing education research
[9] Graham, M.J., Federick, J., Byers-Winston, A., Hunber, (Providence, Rhode Island, USA, 2011), 85–92.
A.B. and Handelsman, J. 2013. Increasing persistence of [24] U.S. Department of Education, Office of Educational
college students in STEM. Science. 341, 27 Sept. (2013), Technology 2012. Enhancing Teaching and Learning
1455–56. through Educational Data Mining and Learning Analytics:
[10] Guzdial, M. 1994. Software-realized scaffolding to An Issue Brief.
facilitate programming for science learning. Interactive [25] Visual Studio® - Microsoft® Developer Tools: 2015.
learning Environments. 4, 1 (1994), 1–44. http://www.visualstudio.com. Accessed: 2015-04-20.
[11] Harrell, F.E. 2001. Regression Modeling Strategies: With [26] Watson, C., Li, F.W.B. and Godwin, J.L. 2014. No tests
Applications to Linear Models, Logistic Regression, and required: comparing traditional and dynmaic predictors of
Survival Analysis. Springer. programming success. Proceedings of the 45th ACM
[12] Hundhausen, C.D., Brown, J.L., Farley, S. and Skarpas, D. Technical Symposium on Computer Science Education
2006. A methodology for analyzing the temporal evolution (2014), 469–474.
of novice programs based on semantic components. [27] Watson, C., Li, F.W.B. and Godwin, J.L. 2013. Predicting
Proceedings of the 2006 ACM International Computing Performance in an Introductory Programming Course by
Education Research Workshop. ACM Press. 45–56. Logging and Analyzing Student Programming Behavior.
[13] Hundhausen, C.D., Carter, A.S. and Adesope, O. 2015. Proceedings of the 2013 IEEE 13th International
Supporting Programming Assignments with Activity Conference on Advanced Learning Technologies (2013),
Streams: An Empirical Study. Proc. 2015 SIGCSE 319–323.
Symposium on Computer Science Education (New York, [28] Wilson, B.C. and Shrock, S. 2001. Contributing to Success
2015). in an Introductory Computer Science Course: A Study of
[14] Jadud, M.C. 2006. Methods and tools for exploring novice Twelve Factors. SIGCSE Bull. 33, 1 (2001), 184–188.
compilation behaviour. Proce. Second International [29] Wilson, C.R., Voorhis, V. and Morgan, B.L. 2007.
Workshop on Computing Education Research. ACM. 73– Understanding power and rules of thumb for determining
84. sample sizes. Tutorials in Quantitative Methods for
Psychology. 3, 2 (2007), 43–50.

Vous aimerez peut-être aussi