Académique Documents
Professionnel Documents
Culture Documents
International Conference on
Educational Data Mining
(EDM 2013)
July 6-9, Memphis, TN, USA
S. K. DMello, R. A. Calvo, & A. Olney (Eds.)
Foreword
Welcome to the sixth installment of the International Conference on Educational Data Mining
(EDM 2013), which will be held in sunny Memphis, Tennessee from the 6th to 9th of July 2013.
Since its inception in 2008, the EDM conference series has featured some of the most innovative
and fascinating basic and applied research centered on data mining, education, and learning
technologies. This tradition of exemplary interdisciplinary research has been kept alive in 2013
as evident through the imaginative, exciting, and diverse set of papers spanning the fields of
Machine Learning, Artificial Intelligence, Learning Technologies, Education, Linguistics, and
Psychology. The EDM 2013 conference program features a rich collection of original research
embodied through oral presentations, posters, invited talks, a young researchers track, tutorials,
interactive demos, and a panel session.
We received 109 submissions for the main track. Each submission was assigned to three
members of the Program Committee based on their areas of expertise. Their reviews were then
examined by the Program Chairs who coordinated discussions among the reviewers in order to
arrive at a decision. Twenty-seven out of the 109 submissions were accepted as full papers (a
25% acceptance rate) and 22 as short papers (a 45% acceptance rate for full and short papers).
An additional 27 were accepted as poster presentations.
In addition to the main track, the conference received 15 submissions to the young researchers track (YRT), 7 to the late-breaking results track, and 9 to the interactive events
track. Six of the YRT submissions were accepted into the YRT with an additional two being
accepted as posters. Five late-breaking results papers were accepted, as were the nine demo
papers.
Each day of the conference will be kick-started by invited talks by three outstanding researchers: Valerie Shute (Florida State University), John Anderson (Carnegie Mellon University), and Ryan Shaun Joazeiro de Baker (Teachers College Columbia University). The main
conference will end with a panel session on the future of EDM with panelists including Tiffany
Barnes, Ed Dieterle, Neil Heffernan, Taylor Martin, and Sebastian Ventura, and moderated by
Sidney DMello and Ryan Baker. The conference will be followed by a series of mini-tutorials
led by Agathe Merceron, David Cooper, and Tristan Nixon.
EDM 2013 has broken a number of
records. As noted in the figure on the left,
we received a record number of 109 submissions this year, a 36% increase from the last
two years and a 140% increase since the first
EDM conference. This allowed us to accept
a larger number of papers for oral presentation (53% increase from EDM 2012) while still
maintaining the historic acceptance rate (45%
in 2013 compared to average 41% rate from
2008 to 2012). Although EDM has historically been a single-track conference, the increase in the number of submissions and accepted papers led us to a blended approach of both single and dual tracks. Another novelty to
EDM is the introduction of short mini-tutorials that will be held on July 9th.
The EDM 2013 conference would not have been possible without the vision and dedicated
effort of a number of people. We are indebted to the Program Committee and the additional
reviewers for their exceptional work in reviewing the submissions and helping us select the best
papers for the conference. We would like to acknowledge Tiffany Barnes and Davide Fossati for
organizing the YRT. Kristy Boyer and Usef Faghihi conceived the brilliant idea of having EDM
mini-tutorials and we would like to thank them for putting those together. We would also like
to thank Fazel Keshtkar and Sebastian Ventura for organizing the interactive events. A special
thanks to Phil Pavlik for managing the website and for joining us on the awards committee.
Finally, thanks to the authors for sending us their best work and to all the attendees who bring
EDM to life.
Andrew Olney, Phil Pavlik, and Art Graesser would like to thank Ryan Baker for his encouragement to host the 2013 conference, John Stamper for his efforts in securing sponsorships, and
Natalie Person for masterminding our incredible banquet. We are indebted to a number of students who worked tirelessly in the months leading up to the conference, including Jackie Maas,
Breya Walker, Haiying Li, Nia Dowell, Blair Lehman, Brent Morgan, Carol Forsyth, Patrick
Hays, and Whitney Cade. Additional thanks go to Conference Planning and Operations at the
University of Memphis, especially Courtney Shelton and Holly Stanford. We would also like
to thank our sponsors, The University of Memphis (Office of the Provost), Carney Labs, the
Institute for Intelligent Systems, and Pearson, who generously provided funds to help offset
registration costs for students. Finally, we would like to gratefully acknowledge the National
Science Foundation who provided funds to offset costs for students to attend the YRT and the
conference under grant IIS 1340163.
In summary, 2013 appears to be an excellent year for Educational Data Mining. The
keynotes, oral and poster presentations, live demos, young researchers track, panel session,
mini tutorials, and attendees from all over the world will undoubtedly make the EDM 2013
conference an intellectually stimulating, enjoyable, and memorable event.
Sidney DMello, University of Notre Dame, USA
Rafael Calvo, University of Sydney, Australia
Andrew Olney, University of Memphis, USA
EDM 2013
Program Committee
Program Committee
Omar Alzoubi
Mirjam Augstein K
ock
Tiffany Barnes
Kristy Elizabeth Boyer
Rafael A. Calvo
Min Chi
Christophe Choquet
Cristina Conati
David G. Cooper
Richard Cox
Sidney DMello
Michel Desmarais
Hendrik Drachsler
Toby Dragon
Mingyu Feng
Katherine Forbes-Riley
Davide Fossati
Dragan Gasevic
Eva Gibaja
Janice Gobert
Daniela Godoy
Ilya Goldin
Joseph Grafsgaard
Neil Heffernan
Arnon Hershkovitz
Roland Hubscher
Sebastien Iksal
Fazel Keshtkar
Jihie Kim
Evgeny Knutov
Kenneth Koedinger
Irena Koprinska
Diane Litman
Ming Liu
Vanda Luengo
Lina Markauskaite
Noboru Matsuda
Manolis Mavrikis
Riccardo Mazza
Gordon McCalla
Agathe Merceron
Juli`a Minguill
on
Tanja Mitrovic
Jack Mostow
Kasia Muldner
Roger Nkambou
EDM 2013
Andrew Olney
Alexandros Paramythis
Abelardo Pardo
Zachary Pardos
Philip I. Pavlik Jr.
Mykola Pechenizkiy
Peter Reimann
Cristobal Romero
Carolyn Rose
Ryan S.J.D. Baker
John Stamper
Jun-Ming Su
Steven Tanimoto
Sebasti
an Ventura
Katrien Verbert
Stephan Weibelzahl
Fridolin Wild
Martin Wolpers
Kalina Yacef
Michael Yudelson
Amelia Zafra G
omez
Program Committee
University of Memphis
Johannes Kepler University
University of Sydney
Massachusetts Institute of Technology
University of Memphis
Eindhoven University of Technology
University of Sydney
University of Cordoba
Carnegie Mellon University
Columbia University Teachers College
Carnegie Mellon University
National Chiao Tung University
University of Washington
University of Cordoba
K.U. Leuven
National College of Ireland
The Open University
Fraunhofer Institute of Applied Information Technology
University of Sydney
Carnegie Mellon University
University of Cordoba
EDM 2013
Additional Reviewers
Additional Reviewers
Baker, Ryan
Fournier-Viger, Philippe
Heffernan, Neil
Hussain, Md. Sazzad
Hussain, Sazzad
Joksimovic, Srecko
Kiseleva, Julia
Kovanovic, Vitomir
Latyshev, Alexey
Li, Nan
Liu, Li
Long, Yanjin
Maclellan, Christopher
Martinez Maldonado, Roberto
Mostow, Jack
Olney, Andrew
Rau, Martina
Sethi, Ricky
Shareghi Najar, Amir
Sodoke, Komi
Stampfer, Eliane
Van Velsen, Martin
Vanhoudnos, Nathan
Wang, Yutao
Zhao, Yu
EDM 2013
Table of Contents
Table of Contents
Keynotes
Oral Presentations
(Full Papers)
Student Profiling from Tutoring System Log Data: When do Multiple Graphical
Representations Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Ryan Carlson, Konstantin Genin, Martina Rau and Richard Scheines
Unsupervised Classification of Student Dialogue Acts with Query-Likelihood Clustering . . 20
Aysu Ezen-Can and Kristy Elizabeth Boyer
A Spectral Learning Approach to Knowledge Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Mohammad H. Falakmasir, Zachary A. Pardos, Geoffrey J. Gordon and Peter
Brusilovsky
Optimal and Worst-Case Performance of Mastery Learning Assessment with Bayesian
Knowledge Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Stephen Fancsali, Tristan Nixon and Steven Ritter
Automatically Recognizing Facial Expression: Predicting Engagement and Frustration . . . . 43
Joseph Grafsgaard, Joseph B. Wiggins, Kristy Elizabeth Boyer, Eric N. Wiebe and
James Lester
Investigating the Solution Space of an Open-Ended Educational Game Using
Conceptual Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Erik Harpstead, Christopher J. MacLellan, Kenneth R. Koedinger, Vincent Aleven,
Steven P. Dow and Brad A. Myers
Extending the Assistance Model: Analyzing the Use of Assistance over Time . . . . . . . . . . . . . 59
William Hawkins, Neil Heffernan, Yutao Wang and Ryan S.J.D. Baker
Differential Pattern Mining of Students Handwritten Coursework . . . . . . . . . . . . . . . . . . . . . . . . 67
James Herold, Alex Zundel and Thomas Stahovich
Predicting Future Learning Better Using Quantitative Analysis of Moment-by-Moment
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Arnon Hershkovitz, Ryan S.J.D. Baker, Sujith M Gowda and Albert T. Corbett
InVis: An Interactive Visualization Tool for Exploring Interaction Networks . . . . . . . . . . . . . . 82
Matthew Johnson, Michael Eagle and Tiffany Barnes
Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics . . . . . . . . . . . 90
Andrew Lan, Christoph Studer, Andrew Waters and Richard Baraniuk
EDM 2013
Table of Contents
EDM 2013
Oral Presentations
(Short Papers)
Table of Contents
EDM 2013
Oral Presentations
(Short Papers)
Table of Contents
Poster Presentations
(Regular Papers)
Do students really learn an equal amount independent of whether they get an item
correct or wrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Seth Adjei, Seye Salehizadeh, Yutao Wang and Neil Heffernan
Analysis of students clustering results based on Moodle log data . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Angela Bovo, Stephane Sanchez, Olivier Heguy and Yves Duthen
Mining the Impact of Course Assignments on Student Performance . . . . . . . . . . . . . . . . . . . . . . . 308
Ritu Chaturvedi and Christie Ezeife
Mining Users Behaviors in Intelligent Educational Games: Prime Climb a Case Study . . . . 310
Alireza Davoodi, Samad Kardan and Cristina Conati
Bringing student backgrounds online: MOOC user demographics, site usage, and online
learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Jennifer Deboer, Glenda S. Stump, Daniel Seaton, Andrew Ho, David E. Pritchard
and Lori Breslow
Detecting Player Goals from Game Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Kristen Dicerbo and Khusro Kidwai
A prediction model that uses the sequence of attempts and hints to better predict
knowledge: Better to attempt the problem first, rather than ask for a hint . . . . . . . . . . . . . . . . 316
Hien Duong, Linglong Zhu, Yutao Wang and Neil Heffernan
Towards the development of a classification service for predicting students performance . . 318
Diego Garca-Saiz and Marta Zorrilla
Identifying and Visualizing the Similarities Between Course Content at a Learning
Object, Module and Program Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Kyle Goslin and Markus Hofmann
Using ITS Generated Data to Predict Standardized Test Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Kim Kelly, Ivon Arroyo and Neil Heffernan
Joint Topic Modeling and Factor Analysis of Textual Information and Graded Response
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Andrew Lan, Christoph Studer, Andrew Waters and Richard Baraniuk
Component Model in Discourse Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Haiying Li, Art Graesser and Zhiqiang Cai
EDM 2013
Poster Presentations
(Regular Papers)
Table of Contents
EDM 2013
Table of Contents
Poster Presentations
(Regular Papers)
Poster Presentations
(Late Breaking Results)
Oral Presentations
(Young Researchers
Track)
Interactive Events/
Demos
EDM 2013
Interactive Events/
Demos
Table of Contents
Project CASSI: A Social-Graph Based Tool for Classroom Behavior Analysis and
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Robert Olson, Zachary Daily, John Malayny and Robert Szkutak
A Moodle Block for Selecting, Visualizing and Mining Students Usage Data . . . . . . . . . . . . . . 400
Cristobal Romero, Cristobal Castro and Sebasti
an Ventura
SEMILAR: A Semantic Similarity Toolkit for Assessing Students Natural Language
Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Vasile Rus, Rajendra Banjade, Mihai Lintean, Nobal Niraula and Dan Stefanescu
Gathering Emotional Data from Multiple Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Sergio Salmeron-Majadas, Olga C. Santos, Jesus G. Boticario, Ra
ul Cabestrero, Pilar
Quir
os and Mar Saneiro
A Tool for Speech Act Classification Using Interactive Machine Learning . . . . . . . . . . . . . . . . 406
Borhan Samei, Fazel Keshtkar and Arthur C. Graesser
Keynotes
vshute@fsu.edu
ABSTRACT
You can discover more about a person in an hour of play than in a year of conversation (Plato). For the past 6-7 years, I have been
examining ways to leverage good video games to assess and support important student competencies, especially those that are not
optimally measured by traditional assessment formats. The term "stealth assessment" refers to the process of embedding assessments deeply
and invisibly into the gaming environment. Though this approach produces ample real-time data on a player's interactions within the game
environment and preserves player engagement, a primary challenge for using stealth assessment in games is taking this stream of data and
making valid inferences about players' competencies that can be examined at various points in time (to see growth), and also at various
grain sizes (for diagnostic purposes). In this talk, I will present recent work related to creating and embedding three stealth assessments--for
creativity, conscientiousness, and qualitative physics understanding--into Newton's Playground, a game we developed that emphasizes nonlinear gameplay and puzzle-solving in a 2D physics simulation environment. I will begin by framing the topic in terms of why this type of
research is sorely needed in education, then generally describe the stealth assessment approach, and finally provide some concrete
examples of how to do it and how well it works regarding validity issues, learning, and enjoyment from a recent research study.
SHORT BIO
Valerie Shute is the Mack & Effie Campbell Tyner Endowed Professor in Education in the Department of Educational Psychology and
Learning Systems at Florida State University. Before coming to FSU in 2007, she was a principal research scientist at Educational Testing
Service where she was involved with basic and applied research projects related to assessment, cognitive diagnosis, and learning from
advanced instructional systems. Her general research interests hover around the design, development, and evaluation of advanced systems
to support learning--particularly related to 21st century competencies. An example of current research involves using immersive games
with stealth assessment to support learningof cognitive and non-cognitive knowledge, skills, and dispositions. Her research has resulted
in numerous grants, journal articles, chapters in edited books, a patent, and several recent books such as Innovative assessment for the 21st
century: Supporting educational needs (Shute & Becker, 2010) and Measuring and supporting learning in games: Stealth assessment
(Shute & Ventura, 2013).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
SHORT BIO
John Anderson received his B.A. from the University of British Columbia in 1968 and his Ph.D. from Stanford University 1972. He has
been at Carnegie Mellon University since 1978 where he is the Richard King Mellon Professor of Psychology and Computer Science. He
has been served as president of the Cognitive Science Society, and has been elected to the American Academy of Arts and Sciences, the
National Academy of Sciences, and the American Philosophical Society. He is the current editor of Psychological Review. He has
received numerous scientific awards including the American Psychological Associations Distinguished Scientific Career Award, the David
E. Rumelhart Prize for Contributions to the Formal Analysis of Human Cognition, the inaugural Dr A.H. Heineken Prize for Cognitive
Science, and the Benjamin Franklin Medal in Computer and Cognitive Science.
He is known for developing ACT-R, which is the most widely used cognitive architecture in cognitive science. Anderson was also an early
leader in research on intelligent tutoring systems. Computer systems based on his cognitive tutors teach currently mathematics to over
500,000 children in American schools. He has published a number of books including Human Associative Memory (1973 with Gordon
Bower), Language, Memory, and Thought (1976), The Architecture of Cognition (1983), The Adaptive Character of Thought (1990),
Rules of the Mind (1993), and The Atomic Components of Thought (1998 with Christian Lebiere), and How Can the Human Mind Occur
in the Physical Universe? (2007). His current research interest is focused on combining cognitive modeling and brain imaging to
understand the processes of mathematical learning.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
Baker2@exchange.tc.columbia.edu
ABSTRACT
We've started to answer the questions of what we can model through EDM, and we're getting better and better at modeling each year. We
publish papers that present solid numbers under reasonably stringent cross-validation, and we find that our models don't just agree with
training labels, but can predict future performance and engagement as well. We're making progress as a field in figuring out how to use
these models to drive and support intervention, although there's a whole lot more to learn.
But when and where can we trust our models? One of the greatest powers of EDM models is that we can use them outside the contexts
in which they were originally developed, but how can we trust that we're doing so wisely and safely? Theory from machine learning
and statistics can be used to study generalizability, and we know empirically that models developed with explicit attention to
generalizability and construct validity are more likely to generalize and to be valid. But our conceptions and characterizations of population
and context remain insufficient to fully answer the question of whether a model will be valid where will apply it. What's worse, the world is
constantly changing; the model that works today may not work tomorrow, if the context changes in important ways, and we don't know yet
which changes matter.
In this talk, I will illustrate these issues by discussing our work to develop models that generalize across urban, rural, and suburban
settings in the United States, and to study model generalizability internationally. I will discuss work from other groups that starts to think
more carefully about characterizing context and population in a concrete and precise fashion; where this work is successful, and where it
remains incomplete. By considering these issues more thoroughly, we can become increasingly confident in the applicability, validity, and
usefulness of our models for broad and general use, a necessity for using EDM in a complex and changing world.
SHORT BIO
Ryan Shaun Joazeiro de Baker is the Julius and Rosa Sachs Distinguished Lecturer at Teachers College, Columbia University. He earned
his Ph.D. in Human-Computer Interaction from Carnegie Mellon University. Baker was previously Assistant Professor of Psychology and
the Learning Sciences at Worcester Polytechnic Institute, and he served as the first Technical Director of the Pittsburgh Science of
Learning Center DataShop, the largest public repository for data on the interaction between learners and educational software.
He is currently serving as the founding President of the International Educational Data Mining Society, and as Associate Editor of the
Journal of Educational Data Mining. His research combines educational data mining and quantitative field observation methods in order to
better understand how students respond to educational software, and how these responses impact their learning. He studies these issues
within intelligent tutors, simulations, multi-user virtual environments, and educational games.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
Oral Presentations
(Full Papers)
Xiaolu Xiong
josephbeck@wpi.edu
xxiong@wpi.edu
ABSTRACT
There has been a large body of work in the field of EDM
involving predicting whether the students next attempt will be
correct. Many promising ideas have resulted in negligible gains
in accuracy, with differences in the thousandths place on RMSE
or R2. This paper explores how well we can expect student
modeling approaches to perform at this task. We attempt to place
an upper limit on model accuracy by performing a series of
cheating experiments. We investigate how well a student model
can perform that has: perfect information about a students
incoming knowledge, the ability to detect the exact moment when
a student learns a skill (binary knowledge), and the ability to
precisely estimate a students level of knowledge (continuous
knowledge). We find that binary knowledge model has an AUC
of 0.804 on our sample data, relative to a baseline PFA model
with a 0.745. If we weaken our cheating model slightly, such that
it no longer knows student incoming knowledge but simply
assumes students are incorrect on their first attempt, AUC drops
to 0.747. Consequently, we argue that many student modeling
techniques are relatively close to ceiling performance, and there
are probably not large gains in accuracy to be had. In addition,
knowledge tracing and performance factors analysis, two popular
techniques, correlate with each other at 0.96 indicating few
differences between them. We conclude by arguing that there are
more useful student modeling tasks such as detecting robust
learning or wheel-spinning, and estimating parameters such as
optimal spacing that are deserving of attention.
Keywords
Cheating experiments, student modeling, limits to accuracy,
knowledge tracing, performance factors analysis
1. INTRODUCTION
The field of educational data mining has seen many papers
published on the topic of student modeling, frequently predicting
next item correctness (e.g. [1-6]). Next item correctness refers to
the student modeling task where the students past performance
on this skill is known, and the goal is to predict whether the
student will respond correctly or incorrectly to the current item.
This task was the topic of the KDD Cup in 2010. It is typically
assumed that data from other students are also available to aid in
fitting modeling parameters.
This research area certainly
appeared to be ripe grounds for rapid improvement, with reported
R2 values for Performance Factors Analysis (PFA; [7]) and
Bayesian knowledge tracing [8] of 0.07 and 0.17, respectively [9].
PFA and Bayesian knowledge tracing were two better known,
baseline techniques, and their apparent poor performance left
tremendous room for improvement by developing more refined
modeling techniques.
Note that RMSE, R2 and AUC values are not comparable across
studies due to differing datasets.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
2.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
upper limit on AUC is 1.0, and the practical lower limit is 0.53.
AUC evaluates techniques based on how well they order their
predictions. For four problems, if a model predicts that a student
has a 95%, 90%, 87%, and 86% chance of responding correctly
and the student gets the first two items correct and the next two
items incorrect, the AUC will be a perfect 1.0 (assuming a
threshold of 50% is used, which would be a poor choice in this
scenario). Even though the model predicts the student is likely to
get the last two items correct, since those items are relatively less
likely to be correct than the first two, AUC gives a perfect score.
R2 is based on the squared error between the predicted and actual
value, but is normalized relative to the variance in the dataset. A
perfect R2 value is 1.0, while 0 is a lower bound for (non-pseudo)
R2. R2 is similar to Root Mean Squared Error (RMSE), but is
more interpretable due to the normalization step. For example, it
is unclear whether an RMSE of 0.3 is good or bad, perhaps a
better error could be obtained simply by predicting the mean
value? However, an R2 of 0.8 indicates the model is account for
most of the variability in the data. For computational simplicity,
we do not use the pseudo- R2 method such as Nagelkerke in this
paper. Neither AUC nor R2 is a perfect evaluation metric, but,
combined; they account for different aspects of model
performance (relative ordering, and absolute accuracy,
respectively) and provide us a basis for evaluating our models.
Table 1 shows performance of the baseline PFA model on both
the ASSISTments and KDD Cup data. We can see that the model
does not fit the KDD Cup data set as well as the ASSISTments
data. Also, the AUC scores are reasonably high, indicating PFA is
able to order its predictions relatively well. However, the lower
R2 values indicate the magnitude of the errors is still substantial.
Table 1. Performance of baseline PFA model
Data source
AUC
R2
ASSISTments
0.745
0.170
KDD Cup
0.713
0.100
CM1
CM2i
CM2c
CM2m
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
2.
3.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
and difficulty are both in the range [0,1]. Larger values represent
higher degrees of learner knowledge and more difficult items. If
item difficulty is less than or equal to knowledge, this model
maintains the learner will respond correctly. Otherwise the
learner will respond incorrectly to the item. In this manner, it can
represent a student who can respond correctly to some items
within a skill, but get other items wrong. The intuition is that the
model raises knowledge just high enough to account for student
correct responses. On observing an incorrect response, it has the
option to decrease student knowledge. The reasoning is similar to
that for CM1: a model that could increase and decrease
knowledge estimates at will (before seeing the students response)
would achieve perfect accuracy. When a student answers an
incorrect response, it can decrease its knowledge estimate
arbitrarily low, and will lower it enough to account for later
incorrect responses.
The performance of CM3 on the KDD Cup data is seen in the first
row of Table 5. Again, continuous knowledge resulted in strong
performance. For the KDD Cup data, we were a bit stymied as to
the meaning of item difficulty. For these results, we used a
concatenation of problem name and step name. However, many
such pairs were only attempted by 1 student, leading to
considerable over-fitting. Using just the problem name suffers
from the problem of underspecificity, and gives an AUC and R2 of
0.798 and 0.442, respectively.
Table 4. Full performance results on ASSISTments data
Initial
knowledge
Continuous
knowledge
AUC
R2
CM3
Known
Yes
0.884
0.634
CM1
Known
No
0.804
0.5
CM2i
Assume
incorrect
No
0.747
0.239
0.745
0.17
PFA
CM2m
Based on
difficulty
No (except
first item)
0.724
0.273
CM2c
Assumed
correct
No
0.678
0.266
Item difficulty
Prediction
Knowledge
estimate
0.4
0.399
0.7
0.7
0.8
0.7
0.6
0.7
0.65
0.599
0.6
0.599
0.3
0.599
3. EMPIRICAL CHEATING
EXPERIMENTS
In addition to the theoretic cheating experiments, we also examine
data from recent work [12] on ensembling multiple techniques
together. This dataset4 is of interest as it provides the predictions
of multiple student modeling techniques as a means of estimating
4
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
BF
EM
LD
PPS
PFA
prediction
0.31
0.31
0.31
0.25
0.34
0.25
0.60
0.60
0.59
0.60
0.60
0.60
3.
0.35
0.37
0.37
0.29
0.38
0.29
0.46
0.47
0.47
0.42
0.47
0.47
4.
0.37
0.37
0.37
0.36
0.39
0.39
2.
Continuous
knowledge
AUC
R2
CM3
Known
Yes
0.887
0.673
CM1
Known
No
0.762
0.453
CM2i
Assume
incorrect
No
0.754
0.353
CM2m
Based on
difficulty
No (except
first item)
0.713
0.357
0.713
0.1
0.711
0.356
PFA
CM2c
Assumed
correct
No
R2
Empirical cheating
0.831
0.324
PFA
0.706
0.130
KT-LD
0.701
0.126
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
predicting next item correctness has been done, and there are not
large gains in performance remaining to be found.
The cheating models described in this paper are extremely
powerful, and examine the basic cognitive inputs to student
performance. We are unlikely to get perfect detectors of learning
any time in the near future. For student modeling approaches that
rely on determining when a student has learned, trying to infer
incoming knowledge, and account for item difficulty, these
cheating models provide a reasonable upper limit on accuracy.
However, what of approaches those are not based on cognitive
principles? For example, student mistakes could be due to lack of
knowledge, or could be due to a careless error. Such careless
errors appear to be non-random, as such mistakes have been found
to be associated [18] with gaming the system [19], and there is
work on contextual detectors of slip and guess [13]. The potential
improvement from such work is not accounted for by the analyses
presented in this paper.
In addition, approaches such as collaborative filtering [20]
provide an avenue for non-cognitive approaches to improving
student modeling. With collaborative filtering approaches, rather
than modeling student knowledge explicitly, instead the goal is to
find similar past students and use their performance to make a
prediction for the current student (e.g., [21]).
5. FUTURE DIRECTIONS
It is unclear how much additional gain there is from refining
student models to achieve ever higher predictive accuracy. Many
promising approaches have resulted in little real-world
improvement in accuracy. One drawback is the seductive
combination of statistical hypothesis testing with increasinglylarge datasets. It is possible to find statistically reliable results
corresponding to very small effects. Even with a relatively small
dataset of 48,000 item responses, a result with a p-value of 0.002
resulted in an improvement of less than 0.001 in R 2 [22]. While
larger datasets enable us to estimate such miniscule quantities
quite precisely (thus, the low p-value), it raises the question of
whether this result useful in any way?
We should reflect on why so much effort is being devoted to the
problem of predicting student next response. Two candidate
answers are thats where the data are, and this task was the goal of
the 2010 KDD Cup. Certainly, correctness performance on each
item for each student is a very vast source of data. Ten years ago
that argument would have been a strong rationale, but now there
are large quantities of educational data of all sorts. As a thought
experiment, imagine a research result were published in EDM
2014 with a new student modeling approach that achieved an A
of 0.9 (comparable to an AUC of 0.9, but A has simpler
semantics). Effectively that would mean that given a correct and
an incorrect student response, this student model could determine
which was which 90% of the time. Such an accomplishment
would be a major step forward in our capabilities. But, what
would we actually do with the model? This question is nonrhetorical, as the authors do not have a good answer. To be clear,
there are plenty of useful problems our student models could
address, such as the probability of a student receiving an A in
the course, or whether he is ready to move and learn subsequent
material.
Ironically, as a field we have settled on a common test problem
that has little impact on tutorial decision making or on informing
the science of learning. We got to this point for good reasons.
Student modeling in ITS is primarily about the estimation of
ACKNOWLEDGMENTS
We want to acknowledge the funding on NSF grant DRL1109483 as well as funding of ASSISTments. See here
(http://www.webcitation.org/67MTL3EIs) for the funding sources
for ASSISTments
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
10
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
11
Konstantin Genin
Language Technologies
Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Department of Philosophy
Carnegie Mellon University
Pittsburgh, PA, USA
ryancarlson@cmu.edu
Martina Rau
Richard Scheines
Human-Computer Interaction
Institute
Carnegie Mellon University
Pittsburgh, PA, USA
Department of Philosophy
Carnegie Mellon University
Pittsburgh, PA, USA
marau@cs.cmu.edu
ABSTRACT
We analyze log-data generated by an experiment with Fractions Tutor, an intelligent tutoring system. The experiment compares the educational effectiveness of instruction
with single and multiple graphical representations. We extract the error-making and hint-seeking behaviors of each
student to characterize their learning strategy. Using an
expectation-maximization approach, we cluster the students
by learning strategy. We find that a) experimental condition
and learning outcome are clearly associated b) experimental
condition and learning strategy are not, and c) almost all of
the association between experimental condition and learning outcome is found among students implementing just one
of the learning strategies we identify. This class of students
is characterized by relatively high rates of error as well as a
marked reluctance to seek help. They also show the greatest
educational gains from instruction with multiple rather than
single representations. The behaviors that characterize this
group illuminate the mechanism underlying the effectiveness
of multiple representations and suggest strategies for tailoring instruction to individual students. Our methodology can
be implemented in an on-line tutoring system to dynamically
tailor individualized instruction.
1.
kgenin@andrew.cmu.edu
INTRODUCTION
scheines@cmu.edu
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
12
(see Figure 1). Students in the Single representation condition worked exclusively with either a number line, a circle
or a rectangle. Students in the Fully Interleaved condition saw a different representation than was used in the
preceding problem. Students in the intermediate conditions
went longer before seeing a different representation.
Figure 1: A partial ordering of experimental conditions by the frequency with which a new representation is presented.
3.
METHOD
We proceed in three stages: (1) we extract features characterizing error and hint-seeking behavior from the data
logs, (2) we transform the longitudinal log data into a crosssectional form, with one observation per student, and (3)
we estimate a mixture model to identify sub-populations of
students, using AIC and BIC to select the number of classes.
Once we have clustered our students by their learning strategy, we investigate the interaction between the strategies
and the experimental conditions. We construct a contingency table binning the experimental conditions into the
clusters estimated by the mixture model. We then run a Chisquared test for independence between experimental condition and learning strategy. Chi-squared tests are also run to
investigate dependence between pre-test outcome and strategy, strategy and post-test outcome and the conditional dependence of outcome and experimental condition, given a
strategic profile.
2.
3.1
EXPERIMENT
Extracting Features
The Fractions Tutor captures a detailed log of each students interactions with the tutor. It stores a time series of
correct and incorrect answers, hint requests, interface selections and durations between interactions. Previous analysis
[16] extracted the average number of errors made per step,
the average number of hints requested per step, and the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
13
Figure 2: The x-axis represents the nth interaction with the tutor across all problems. The y-axis is the total
number of hints requested at the nth step.
average time spent per step from the log data. These variables were used to characterize gross behavioral strategies
and dispositions. Similarly, we include the average number of hints requested (HintsRequested) and number of
errors (NumErrors) made per problem by each student.
We also extract the average number of bottom-out hints
(NumBOH) per student per problem this is the average
number of times a student exhausts the available hints in a
given problem. We also note that it is not always the average of these features that best characterizes a student. For
example, examination of the distribution of hints requested
per step across experimental condition, shows a telling picture.
Note that students who received only one representation
start out requesting the fewest hints, but students in the
moderate condition eventually need fewer (see Figure 2).
Also, students in the interleaved condition tend to request
many hints in the early steps of a problem, potentially reflecting the cognitive load associated with translating between representations [1]. Such considerations suggested
that exploiting the timing of student interactions within a
problem might expose structural features obscured by stepwise averages (as used in [16]). We fit geometric distributions to the number of steps taken before the first hint
request (FirstHintGeometric) and to the number of errors before the first hint (StubbornGeometric). The estimated parameter is used to characterize the students hintseeking propensity in general and hint-seeking propensity
when faced with adversity. For example, students in the
first quintile of StubbornGeometric seek help soon after making a mistake, whereas students in the fifth quintile
dont change their hint-seeking behavior even after making
a large number of errors. Students in the first quintile of
FirstHintGeometric are likely to request hints early in a
problem, whereas students in the fifth quintile are unlikely
3.2
Expectation-Maximization Clustering
(1)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
14
sd
median
min
max
20%
40%
60%
80%
100%
HintsRequested
0.78
1.27
0.34
11.22
0.06
0.19
0.5
1.31
11.22
NumErrors
2.21
1.27
1.92
0.34
8.39
1.15
1.7
2.18
3.19
8.39
FirstHintGeometric
0.35
0.27
0.27
0.04
0.13
0.2
0.33
0.57
Stubborn Geometric
0.36
0.21
0.31
0.07
0.19
0.27
0.38
0.47
NumBOH
0.04
0.08
0.62
0.01
0.05
0.63
4.
RESULTS
In the sections that follow we analyze the results of our clustering algorithm. We describe the strategic profiles that
were generated and characterize the students fitting each
profile. We then consider the relationships between our variables of interest: (a) adjusted delayed post-test score, (b) experimental condition, and (c) learning strategy. Specifically,
we run a series of Chi-squared tests for independence to determine how each variable relates with the others, commenting on the importance of each comparison. Finally, we explore the stability of these classes, which bears on whether
future systems could detect students strategic profiles in
real time.
4.1
http://userwww.service.emory.edu/~dlinzer/poLCA/
4.2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
15
Figure 4: Visualization of feature distributions for each learning profile. The left-to-right x-axis identifies
each feature, the front-to-back y-axis identifies which value that feature takes, and the top-to-bottom z-axis
describes the probability that the feature takes the value. Thus, given a feature and a class, the z-axis also
describes the probability distribution over that feature in that class.
We then construct terciles of the Adjusted Delayed PostTest Score and run a Chi-squared test for independence of
outcome from experimental condition. Confirming previous
results, we reject independence at a p-value of .024 (see Table
2). As expected, students in the multiple representation
conditions were more likely to be in the second or third
tercile of adjusted delayed post-test score, whereas students
in the single representation condition were more likely to be
in the first.
4.3
33%
66%
99%
blocked
14
29
20
increased
22
20
20
interleaved
13
21
18
moderate
18
13
22
single
30
13
17
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
16
Pre-Test
33%
66%
99%
33%
66%
99%
moderate
20
35
29
30
32
22
interactive
33
26
14
37
27
confident
13
15
22
14
31
stubborn
31
20
32
26
22
35
4.4
mod.
inter.
conf.
stub.
blocked
13
15
10
25
increased
21
16
10
15
interleaved
17
18
10
moderate
18
10
12
13
single
15
14
11
20
4.5
Finally, we explore the relationship between learning outcome and experimental condition for each of the strategic
profiles we have identified. Interestingly, we find that experimental condition has a substantial effect on learning outcome among the stubborn students, but virtually no effect
on learning among the moderate, interactive, and confident (see Table 5). Most students perform in the second
and third tercile when given multiple graphical representations, but are overwhelmingly in the first tercile when given
a single representation.
Students in the other three classes are not significantly affected by their representation condition. The learning strategies that these students implement seem to make them resilient to representational choice, at least in this experimental regime. Recall that students exhibiting the stubborn
profile rarely requested hints, even when they encountered
difficulty. We speculate that they lack the metacognitive
skills to judge when their learning strategies are failing, and
thus are not seeking help at appropriate times [4]. They
are the most sensitive to pedagogical decisions because they
are the least equipped to structure and manage their own
learning.
An ITS ought to ensure that these students are targeted
with multiple representations, and perhaps other forms of
metacognitive support. While not all stubborn students
improve when given MGRs, the vast majority of them do.
An ITS might help scaffold effective learning behaviors by
spontaneously offering hints to these students when they
appear to need them the most. A teacher informed that
a student exhibits this learning profile may try to encourage the student to ask for help and target their metacognitive skills more generally. Moreover, studying this subpopulation seems to be a promising avenue for illuminating
the mechanism by which MGRs improve learning outcomes.
Future experiments could test the effect of offering spontaneous hint-support to students that fit the stubborn profile.
4
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
17
moderate
33%
66%
99%
blocked
increased
interleaved
moderate
single
33%
66%
99%
blocked
increased
interleaved
moderate
single
interactive
confident
33%
66%
99%
stubborn
33%
66%
99%
blocked
blocked
10
10
increased
increased
interleaved
interleaved
moderate
moderate
single
single
14
Table 5: Condition and Tercile of Adjusted Delayed Post-Test Score, by Learning Strategy
We note that there are competing interpretations of our results that also suggest interesting future experiments. Studies have found that well-designed feedback from errors may
be very effective for improving learning outcomes [12]. It
may be that stubborn students, by not shying away from
mistakes, are taking advantage of a more effective support
system than students who avoid mistakes by soliciting hints.
Since instruction with multiple representations is generally
more difficult, stubborn students in a multiple representation condition would get more of this kind of feedback on
average. This interpretation would predict that students
in the interactive profile would benefit if some hints were
withheld [11]. However, this hypothesis could only be tested
by subsequent experiments.
4.6
Profile Stability
If an intelligent tutoring system could implement our classification methodology on-the-fly, it could tailor its pedagogical interventions to the needs of the individual student. To
substantiate the promise of the methodology we investigate
how efficiently the algorithm stabilizes to the final classification. To measure this, we first cluster on the entire corpus
and assign each student to their most likely profile. We
then artificially subset the data by restricting the number
of problems seen by the clustering algorithm, compute the
proportion of students who are in their final profile, and
then iteratively increase the size of the subset. This simulates how well our algorithm identifies student profiles as
they make their way through the material.
Figure 5 shows the percentage of total data used to estimate the model plotted against the proportion of students
assigned to their final strategic profile. At each iteration, we
look at an additional 10 problems from each student and reestimate the cluster assignments. The regression estimates
that 63% of the data is sufficient to classify three quarters of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
18
5.
6.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
REFERENCES
[17]
[18]
[19]
[20]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
19
Aysu Ezen-Can
Department of Computer Science
North Carolina State University
Raleigh, North Carolina 27695
aezen@ncsu.edu
keboyer@ncsu.edu
ABSTRACT
Dialogue acts model the intent underlying dialogue moves. In
natural language tutorial dialogue, student dialogue moves hold
important information about knowledge and goals, and are
therefore an integral part of providing adaptive tutoring.
Automatically classifying these dialogue acts is a challenging
task, traditionally addressed with supervised classification
techniques requiring substantial manual time and effort. There is
growing interest in unsupervised dialogue act classification to
address this limitation. This paper presents a novel unsupervised
framework, query-likelihood clustering, for classifying student
dialogue acts. This framework combines automated natural
language processing with clustering and a novel adaptation of an
information retrieval technique. Evaluation against manually
labeled dialogue acts on a tutorial dialogue corpus in the domain
of introductory computer science demonstrates that the proposed
technique outperforms existing approaches. The results indicate
that this technique holds promise for automatically understanding
corpora of tutorial dialogue and for building adaptive dialogue
systems.
Keywords
Tutorial dialogue, dialogue act modeling, unsupervised machine
learning
1. INTRODUCTION
Tutorial dialogue systems are highly effective at supporting
student learning [1, 8, 9, 11, 13, 14, 20]. However, these systems
are time-consuming to build because of the substantial
engineering effort required within their various components. For
example, understanding and responding to the rich variety of
student natural language input has been the focus of great
attention, addressed by a variety of techniques including latent
semantic analysis [15], enriching natural language input with
spoken language capabilities [21], linear regression for assessing
correlation of dialogue acts with learning [8] and integration of
multiple dialogue policies [12]. However, a highly promising
approach is to automatically mine models of user utterances from
corpora of dialogue using machine learning techniques [16, 24].
A task of particular importance in modeling student utterances is
determining the dialogue act of each utterance [25, 28]. The
premise of dialogue act modeling is that it captures the
communicative goal or action underlying each utterance, an idea
that emerged within linguistic theory and has been leveraged with
great success by dialogue systems researchers [2][27]. Dialogue
act modeling, in practice, is based on creating taxonomies to use
in dialogue act classification. Within tutorial dialogue systems,
first the dialogue act for a student utterance is inferred, and this
label serves as the basis for selecting the next tutorial strategy.
There are two approaches for learning dialogue act models from a
corpus: supervised and unsupervised. Supervised models require a
manually labeled corpus on which to train, while unsupervised
models employ machine learning techniques that rely solely on
the structure of the data and not on manual labels. A rich literature
on supervised modeling of dialogue acts has shown success in this
task by leveraging a variety of lexical, prosodic, and structural
features [29, 30]. However, supervised models face two
significant limitations. First, manual annotation is a timeconsuming and expensive process, a problem that is compounded
by the fact that many annotation schemes are domain-specific and
must be re-engineered for new corpora. Second, although there
are standard methods to assess agreement of different human
annotators when applying a tagging scheme, developing the
tagging scheme in the first place is often an ill-defined process. In
contrast, unsupervised approaches do not rely on manual tags, and
construct a partitioning of the corpus that suggests a fully datadriven taxonomy. Unsupervised approaches have only just begun
to be explored for dialogue act classification, but early results
from the computational linguistics literature suggest that they hold
promise [10, 24], and a very recent finding in the educational data
mining literature has begun to explore these techniques for
learning-centered speech [25].
This paper presents a novel approach toward unsupervised
dialogue act classification: query-likelihood clustering. This
approach adapts an information retrieval (IR) technique based on
query likelihood to first identify utterances that are similar to a
target utterance. These results are then clustered to identify
dialogue acts within a corpus in a fully unsupervised fashion. We
evaluate the proposed technique on a corpus of task-oriented
tutorial dialogue collected through a textual, computer-mediated
dialogue study. How best to evaluate unsupervised techniques is
an open research question since there is no perfect model that
the results can be compared to. We therefore examine two
complementary evaluation criteria that have been used in prior
work: quantitative evaluation with respect to manual labels [10,
25], and detailed qualitative inspection of the clustering to
determine whether it learned natural groupings of utterances
[24]. The results demonstrate that query-likelihood clustering
performs significantly better than majority baseline chance
compared to manual labels. In addition, the proposed algorithm
outperforms a recently reported unsupervised approach for speech
act classification within a learning-centered corpus. Finally,
qualitative analysis suggests that the clustering does group
together many categories of utterances in an intuitive way, even
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
20
2. RELATED WORK
Dialogue act classification aims to model the intent underlying
each utterance. Supervised dialogue act modeling has been well
studied in the computational linguistics literature, applying
techniques such as Hidden Markov Models [30] and Maximum
Entropy classifiers [4][29]. For tutorial dialogue, promising
approaches have included an extension of latent semantic analysis
[28], a syntactic parser model [22], and vector-based classifiers
[6].
Compared to the rich body of work on supervised dialogue act
modeling, a much smaller body of work has focused on
unsupervised approaches. A recent non-parametric Bayesian
approach used Dirichlet Process Mixture Models [10], which
attempt to identify the number of clusters non-parametrically.
Another recent work on unsupervised classification of dialogue
acts modeled a corpus of Twitter conversations using Hidden
Markov Models combined with a topic model built using Latent
Dirichlet Allocation [24]. This corpus was composed of small
dialogues about many general subjects discussed on Twitter. In
order for the dialogue act model not to be distracted by different
topics, they separated content words from dialogue act cues with
the help of the topic model. In our tutoring corpus, however, the
content words reveal important information about dialogue acts.
For example, the word help is generally found in utterances that
are requesting a hint. Therefore, our model retains content words.
Rus et al. utilize clustering to classify dialogue acts within an
educational corpus [25], forming vectors of utterances using the
leading tokens (words and punctuation marks), and using string
comparison as the similarity metric. As they mention, this string
comparison may not be sufficient to generalize word types used
within the same context. For example, hello and hi are
different according to string comparison; however, they are part of
the same dialogue act, in that they both serve as a greeting. Our
clustering approach uses query likelihood to group similar words
that can be used for the same intention, and we use a blended partof-speech tag and word feature set which overcomes the challenge
introduced by string comparisons. The results suggest that these
extensions improve upon existing clustering techniques.
3. TUTORING CORPUS
The corpus consists of dialogues collected between pairs of tutors
and students collaborating on the task of solving a programming
problem as part of the JavaTutor project during spring 2007. The
tutor and student interacted remotely with textual dialogue
through computer interfaces. There were forty-three dialogues in
total, with 1,525 student utterances (averaging 7.54 words per
utterance) and 3,332 tutor utterances (averaging 9.04 words per
utterance). This paper focuses on classifying the dialogue acts of
student utterances only. Within an automated tutoring system,
tutor utterances are system-generated and their dialogue acts are
therefore known. The corpus was manually segmented and
annotated with dialogue acts, one dialogue act per utterance,
during prior research that focused on supervised dialogue act
annotation and dialogue structure modeling [7]. While the manual
dialogue act labels are not used in model training, they are used to
evaluate the unsupervised clustering. Table 1 shows manually
labeled tags and their frequencies. The Kappa for agreement on
these manual tags was 0.76. An excerpt from the corpus is
presented in Table 2.
Act
Description
Freq
Question
276
EQ
Evaluation
Question
416
Statement
A statement of fact
211
Grounding
Acknowledgement of
previous utterance
192
EX
ExtraDomain
133
PF
Positive
Feedback
NF
Negative
Feedback
LF
Lukewarm
Feedback
32
GRE
Greeting
Greeting words
57
Positive assessment of
knowledge or task
Negative assessment of
knowledge or task
116
92
Tutor:
Tutor:
Student:
Student:
Student:
Utterance
so obviously here im going to read
into the array list and pull what we
have in the list so i can do my
calculations
something like that, yes
by the way, an array list (or
ArrayList) is something different in
Java. this is just an array.
ok
im sorry i just refer to it as a list
because thats what it reminds me it
does
stores values inside a
listbox(invisible)
Tag
S
LF
S
G
S
S
Tutor:
that's fine
EX
Tutor:
EQ
Student:
NF
4. QUERY-LIKELIHOOD CLUSTERING
This section describes our novel approach of adapting information
retrieval (IR) techniques combined with clustering to the task of
unsupervised dialogue act classification. IR is the process of
searching available resources to retrieve results that are similar to
the query [3]. IR techniques are mostly used in search engines to
retrieve results that are similar to given queries. In the proposed
approach, the target utterance that is to be classified is used as a
query and its similar utterances are gathered using query
likelihood. Then, the query likelihood results are provided to the
clustering technique.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
21
I'm reading it
right now
VB read PRP
right now
WDT VBZ DT
basic structur TO
begin DT array?
(WDT VBZ)
(VBZ DT) (DT
basic) (basic
structur) (structur
TO) (TO begin)
(begin DT) (DT
array) (array ?)
WDT VBD
correct
(WDT VBD)
(VBD correct)
WRB do PRP
think PRP MD
start PRP?
Query:
How can I solve this problem?
Query Likelihood results:
Query
combination
Utterance
4.3 Clustering
The similarity results from querying were used as the distance
metric in a k-means clustering algorithm. The implementation of
this idea relies on creating binary vectors for similar utterances
and then grouping those vectors. Each utterance that is present in
the similarity list is represented as a 1, while the others are
represented with a value of 0. In this way, each target utterance in
the corpus is represented by a vector indicating the utterances that
are similar to it. The entire unsupervised dialogue act
classification algorithm is summarized in Figure 2.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
22
tag for what and which) were utilized within the experiments.
Having determined the proper weight for WDT, different weights
for WRB (POS tag for why, where, how, when) were tried.
Finally, the weight for question marks was set.
Table 4: Mean average precision (MAP) results for
weighting interrogative parts of speech and punctuation
Weights
MAP
no weight
0.1239
WDT = 10
0.2359
WDT = 100
0.2326
WRB = 10
0.2358
WRB = 100
0.2339
? = 10
0.2457
? = 100
0.2567
else ! = 0
Let the total vector be ! = (!, , ! , , ! )
Return clusters = {! , ! ! }
such that C is the result of
k-means(! )
Figure 2: Query-likelihood clustering algorithm.
5. EXPERIMENTS
The goal of the experiments is to apply the novel unsupervised
technique of query-likelihood clustering to discover student
dialogue act clusters within the corpus of tutorial dialogue. We
utilize a two-pronged evaluation consisting of quantitative
comparison in terms of accuracy on manual labels, as well as
qualitative examination of the resulting clusters. This section first
presents the model-learning process, including parameter tuning
on a development set, and then presents quantitative and
qualitative evaluations on the remainder of the corpus. We also
compare performance of the proposed approach to a state-of-theart unsupervised technique for speech act labeling in a learningcentered corpus.
5
0.343
10
0.34
15
0.342
20
0.346
25
0.351
30
0.352
35
0.351
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
5
0.517
10
0.432
15
0.409
20
0.398
30
0.395
100
0.307
23
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
24
which are structurally very similar in that they both are questions,
were grouped into one cluster (Cluster 5), which constitutes
58.54% of all Q and EQ tagged utterances. In another cluster,
14.26% of all Q and EQ utterances were grouped together
(Cluster 6).
Cluster 1
- right (G)
- ahh (G)
- ok (G)
- yeah (G)
- yes (PF)
- heheh yeah that would work (PF)
- I see that (PF)
- gotcha (EX)
- Yes Giving me definitions to various commands and such
(EX)
- Ohh yes substantially (EX)
Clusters/
Tags
C1
GRE
EX
C2
C3
C4
C5
C6
C7
31
22
21
13
PF
57
16
11
32
144
EQ
NF
LF
C8
16
21
11
20
121
33
36
21
29
51
18
17
14
191
43
14
41
15
Cluster 2
- not really yet (NF)
- im not completely sure about how to do this (NF)
- the parsing im not sure about (NF)
- to be honest im not even sure what an array is (NF)
- im not sure how to read into the array (NF)
- I don't know how to do this (NF)
- and I know there is more to this line but I cannot remember
the command (NF)
- Not yet (NF)
- im not so good at ARRAY just yet (NF)
- but I'm not exactly sure how to do that (NF)
- Im not sure if it is asking if the PARAMETER is how ever far
you are away from NUMBER or the actual number you are
away from NUMBER (S)
- I am asking how to do whatever drawing I need to do in the
METHOD_CALL method (S)
Cluster 5
- So what's wrong with this? (Q)
- Can't manually turn an integer into a string? (Q)
- Then how would I incorporate that with the
METHOD_CALL? I think it's asking me to use that in some
why but it's not supplying arguments to do so (Q)
- are we done extracting digits? (Q)
- how do i sum the digits? (Q)
- do i have to set it to a PARAMETER? (EQ)
- thats another thing I was going to ask am I just storing the
values in METHOD_NAME and sending them to
PARAMETER? (EQ)
- why is what I just highlighted underlined in red doesn't that
mean its wrong? (EQ)
- does extracting have to do with METHOD_NAME? or
anything (EQ)
Cluster 6
- What is the next step? (Q)
- What do I write in it? (Q)
- what do i do first? (Q)
- so what can i do to fix what i was doing? (Q)
- does that look ok? (EQ)
- is this correct? (EQ)
- is this what i need to do? (EQ)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
25
6. DISCUSSION
A strength of unsupervised approaches is that because they do not
rely on any manually engineered tagging schemes, they reflect the
structure of the corpus in a fully data-driven way. In our case, the
results highlight challenges of utilizing pedagogically driven
manual dialogue act classification taxonomies within automated
approaches. For example, a cross-cutting issue with the clustering
presented here is that EX dialogue acts are distributed almost
equally across several clusters. In the manual tagging, EX is a
catch-all tag for conversation that was not directly related to
tutoring. This tag was applied at the structural level, so if a
question such as Should I close the door? was not task-related it
would have been tagged EX, as would its answer, Yes. This
distinction was desirable from a pedagogical perspective, but from
a linguistic perspective it conflates dialogue act with topic. Future
work will explore combining unsupervised dialogue act modeling
with unsupervised topic modeling in order to address this type of
modeling challenge. From a dialogue act research perspective, it
is important to consider the issue of conflating act with topic
when devising manual tagging schemes that may become the
target of automated approaches in later work.
While the proposed algorithm is promising in that it outperforms
current unsupervised approaches for dialogue act modeling, it has
several notable limitations. One limitation is algorithmic
complexity, which is quadratic over the size of the corpus. This
complexity is inherent in the binary representation of each
utterance as a vector with similarity to other utterances. Another
limitation of the proposed approach arises with clustering
algorithms in general, which is that a significant amount of human
intelligence is often required to decide on the number of suitable
clusters for the corpus. Nonparametric approaches to
automatically identifying the number of clusters performed worse
than parametric approaches in the current analyses; however,
nonparametric approaches in general are an important area for
future study. Finally, the query-likelihood clustering approach
does not consider higher-level dialogue structure; it clusters one
utterance at a time. This limitation leads to trouble disambiguating
utterances with similar surface features. A highly promising
direction to address this limitation is to enhance the algorithm
with structural features such as dialogue history.
8. ACKNOWLEDGMENTS
This work is supported in part by the North Carolina State
University Department of Computer Science and the National
Science Foundation through Grant DRL-1007962 and the STARS
Alliance, CNS-1042468. Any opinions, findings, conclusions, or
recommendations expressed in this report are those of the
participants, and do not necessarily represent the official views,
opinions, or policy of the National Science Foundation.
9. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
26
[10]
[11]
[12]
[13]
[14]
Chi, M., VanLehn, K. and Litman, D. (2010). Do microlevel tutorial decisions matter: applying reinforcement
learning to induce pedagogical tutorial tactics.
International Conference on Intelligent Tutoring
Systems, 224234.
Crook, N., Granell, R. and Pulman, S. (2009).
Unsupervised classification of dialogue acts using a
Dirichlet process mixture model. Proceedings of the 10th
Annual Meeting of the Special Interest Group on
Discourse and Dialogue - SIGDIAL09, 341348.
Dzikovska, M., Steinhauser, N., Moore, J.D., Campbell,
G.E., Harrison, K.M. and Taylor, L.S. (2010). Content,
Social, and Metacognitive Statements: An Empirical
Study Comparing Human-human and Human-computer
Tutorial Dialogue. Proceedings of the European
Conference on Technology Enhanced Learning, 93108.
Dzikovska, M., Moore, J.D., Steinhauser, N., Campbell,
G., Farrow, E. and Callaway, C.B. (2010). BEETLE II: a
system for tutoring and computational linguistics
experimentation. Proceedings of the ACL 2010 System
Demonstrations, 1318.
DMello, S.K., Lehman, B. and Graesser, A.C. (2011). A
Motivationally Supportive Affect-Sensitive AutoTutor.
R. A. Calvo and S. K. DMello (Eds.), New Perspectives
on Affect and Learning Technologies, pp. 113126.
Forbes-Riley, K. and Litman, D. (2009). Adapting to
Student Uncertainty Improves Tutoring Dialogues.
Proceedings of the International Conference on Artificial
Intelligence in Education, 3340.
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Pelleg, D. and Moore, A. (2000). X-means: Extending Kmeans with Efficient Estimation of the Number of
Clusters. Proceedings of the Seventeenth International
Conference on Machine Learning, 727-734.
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
27
Zachary A. Pardos
falakmasir@cs.pitt.edu
zp@csail.mit.edu
ABSTRACT
Bayesian Knowledge Tracing (BKT) is a common way of
determining student knowledge of skills in adaptive educational
systems and cognitive tutors. The basic BKT is a Hidden Markov
Model (HMM) that models student knowledge based on five
parameters: prior, learn rate, forget, guess, and slip. Expectation
Maximization (EM) is often used to learn these parameters from
training data. However, EM is a time-consuming process, and is
prone to converging to erroneous, implausible local optima
depending on the initial values of the BKT parameters. In this
paper we address these two problems by using spectral learning to
learn a Predictive State Representation (PSR) that represents the
BKT HMM. We then use a heuristic to extract the BKT
parameters from the learned PSR using basic matrix operations.
The spectral learning method is based on an approximate
factorization of the estimated covariance of windows from
students sequences of correct and incorrect responses; it is fast,
local-optimum-free, and statistically consistent. In the past few
years, spectral techniques have been used on real-world problems
involving latent variables in dynamical systems, computer vision,
and natural language processing. Our results suggest that the
parameters learned by the spectral algorithm can replace the
parameters learned by EM; the results of our study show that the
spectral algorithm can improve knowledge tracing parameterfitting time significantly while maintaining the same prediction
accuracy, or help to improve accuracy while still keeping
parameter-fitting time equivalent to EM.
Keywords
Bayesian Knowledge Tracing, Spectral Learning.
1. INTRODUCTION
Hidden Markov Models and extensions have been one of the most
popular techniques for modeling complex patterns of behavior,
especially patterns that extend over time. In the case of BKT, the
model estimates the probability of a student knowing a particular
skill (latent variable) based on the students past history of incorrect
and correct attempts at that skill. This probability is the key value
used by many cognitive tutors to determine when the student has
reached mastery in a skill (also called a Knowledge Component, or
KC) [17]. In an adaptive educational system, this probability can be
used to recommend personalized learning activities based on the
detailed representation of student knowledge in different topics.
Geoffrey J. Gordon
Peter Brusilovsky
ggordon@cs.cmu.edu
peterb@pitt.edu
2. BACKGROUND
In BKT we are interested in a sequence of student answers to a
series of exercises on different skills (KCs) in a tutoring system
[6]. BKT treats each skill separately, and attempts to model each
skill-specific sequence using a binary model of the students latent
cognitive state (the skill is learned or unlearned). Treating state as
Markovian, we therefore have five parameters to explain student
mastery in each skill: probabilities for initial knowledge,
knowledge acquisition, forget, guess, and slip. However, in
standard BKT [6], it is typical to neglect the possibility of
forgetting, leaving four free parameters.
The main benefit of the BKT model is that it monitors changes in
student knowledge state during practice. Each time a student
answers a question, the model updates its estimate of whether the
student knows the skill based on the students answer (the HMM
observation). However, the typical parameter estimation algorithm
for BKT, EM, is prone to converging to erroneous local optima
depending on initialization. On the other hand, in the past few
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
28
3. METHODOLOGY
We propose replacing the parameter-learning step of BKT with a
spectral method. In particular, we use spectral learning to discover a
PSR from a small number of sufficient statistics of the observed
sequences of student interactions. We then use a heuristic to extract
an HMM that approximates the learned PSR and read the BKT
parameters off of this extracted HMM. We can finally use these
parameters directly to estimate student mastery levels, and evaluate
prediction accuracy with our method compared to the standard
EM/MLE method of BKT parameter fitting. We call the above
method spectral knowledge tracing or SKT. We also evaluated
using the learned parameters as initial values for EM in order to get
closer to the global optimum. Due to the fact that spectral method
does not attempt to maximize likelihood, and also some noise in the
translation of the PSR to BKT parameters, the returned BKT
parameters are close to the global maximum, but further
improvement is available with a few EM iterations. The rest of the
section presents a short description of the data along with a brief
summary of our student model and analysis procedure.
http://www.cs.cmu.edu/~ggordon/spectral-learning/
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
29
#Students
#Topics (Templates)
tried
#Records
Spring 2008
15
18 (75)
427
Fall 2008
21
21 (96)
1003
Spring 2009
20
21 (99)
1138
Spring 2010
21
21 (99)
750
Fall 2010
18
19 (91)
657
Spring 2011
31
20 (95)
1585
Fall 2011
14
17 (81)
456
Spring 2012
41
19 (95)
2486
Fall 2012
41
21 (99)
2017
Total
222
21 (99)
10519
The system had no major structural changes since 2008, but the
enclosing adaptive system used some engagement techniques in
order to motivate more students to use the system. This is the
main reason the number of records is higher in the Spring and Fall
semesters of 2012.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
30
1
.8
Probability
.4
.6
.2
EM P(L0)
EM P(Forget)
EM P(Slip)
End
4. RESULTS
For the purpose of mimicking how the model may be trained and
deployed in a real world scenario, we learn the model from the
first semester data and test it on the second semester, learn the
model from the first and second semester data and test it on the
third semester, and so on. In total, we calculated results for 155
topic-semester pairs. All analysis was conducted in Matlab on a
laptop with a 2.4 GHz Intel Core i5 CPU and 4 GB of RAM.
4.1 EM Results
In our experiments it took around 36 minutes for EM to fit the
parameters, which is on average 15 seconds for each topicsemester pair. In 2 out of 155 cases, EM failed to converge within
the 200-iteration limit. The average accuracy of predicting a
students answer to the next question using the parameters learned
by EM is 0.650 with RMSE of 0.464. Figure 4 shows the boxplot
of the parameters learned by EM. The average values for prior,
learn, forget, guess and slip are: 0.413, 0.162, 0.019, 0.431, 0.295.
1
.8
Probability
.4
.6
.2
EM P(Learn)
EM P(Guess)
SKT P(L0)
SKT P(Forget)
SKT P(Slip)
SKT P(Learn)
SKT P(Guess)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
31
SEM P(L0)
SEM P(Forget)
SEM P(Slip)
SEM P(Learn)
SEM P(Guess)
-2
.2
EM Log(Time)
2
Probability
.4
.6
.8
Lowess smoother
-4
-3
-2
-1
SKT Log(Time)
bandwidth = .8
4.4 Comparison
4.4.1 Time
50
100
150
Semester x Topic Observations (increase in the size of training data)
SKTLogTime
Fitted values
-4
15
-2
Frequency
30
Log(Time)
0
2
45
60
The LOWESS plot confirms our intuition that the EM time grows
at least linearly compared to the SKT time. To test that hypothesis
we tried linear regression on the log-log plot. A 95% confidence
interval for the intercept is [2.82, 3.18], which excludes an
intercept of 0; a 95% interval for the slope is [.51, .70], which
excludes a slope of 1. This can be interpreted as: the time spent
learning parameters using EM is on average at least !.!! 16.77
times greater than the time spent learning the parameters using
SKT, and the scaling behavior of EM is likely to be worse (the
ratio gets higher as the data gets larger).
.2
.4
EMLogTime
Fitted values
.6
Accuracy
SKT
EM
.8
SEM
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
32
.8
120
90
Frequency
60
30
.2
.4
RMSE
SKT
EM
.6
.8
0
0
EM RMSE
SEM RMSE
SEM
SKT RMSE
5. DISCUSSION
Regarding prediction accuracy, both of our methods significantly
improved the prediction results (p=0.017 SKT vs. EM, p<<0.001
SEM vs. EM, paired t-test, 153 degrees of freedom). Regarding
RMSE, the spectrally learned parameters do not result in a
significant improvement compared to BKT, but the combination
of SKT with EM leads to a significantly better (lower) RMSE
compared to BKT (p<<0.001, paired t-test, 153 dof). Table 2
shows the summary of the results. Figure 11 and Figure 12 show
the boxplot of the prediction accuracy and RMSE respectively.
Table 2: Summary of the results
Method
Accuracy
RMSE
BKT
0.649 (baseline)
0.465 (baseline)
SKT
0.664 (p=0.017)
0.464 (p=0.348)
SEM
0.706 (p<<0.01)
0.422(p<<0.01)
Accuracy of Prediction
.4
.6
.8
.2
EM Accuracy
SEM Accuracy
SKT Accuracy
From a practical point of view, the results of our study will help
us improve our adaptive educational system. Currently, JavaGuide
uses a knowledge accumulation approach, based on the total
number of correct answers, to estimate students mastery within
each topic for adaptation purposes. The SEM model can be used
to improve the system by providing a more accurate (in regard to
predicting the student answer to the next question) estimate of
student knowledge.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
33
7. ACKNOWLEDGMENTS
This work is partially supported by the Directors Interdisciplinary
Graduate Fellowship, and was an extension of a project initiated
at the 8th Annual 2012 LearnLab Summer School at CMU.
8. REFERENCES
9.
Ishteva, M., Song, L., Park, H., Parikh, A., and Xing, E.
Hierarchical Tensor Decomposition of Latent Tree Graphical
Models. The 30th International Conference on Machine
Learning (ICML 2013), (2013).
1.
2.
3.
4.
13. Pardos, Z.A., Trivedi, S., Heffernan, N., Srkzy, G.N., and
Zachary A. Pardos, S.T. Clustered Knowledge Tracing.
Intelligent Tutoring Systems ITS 12, Springer Berlin
Heidelberg (2012), 405410.
5.
6.
7.
8.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
34
ABSTRACT
By implementing mastery learning, intelligent tutoring systems
aim to present students with exactly the amount of instruction they
need to master a concept. In practice, determination of mastery is
imperfect. Student knowledge must be inferred from performance,
and performance does not always follow knowledge. A standard
method is to set a threshold for mastery, representing a level of
certainty that the student has attained mastery. Tutors can make
two types of errors when assessing student knowledge: (1) false
positives, in which a student without knowledge is judged to have
mastered a skill, and (2) false negatives, in which a student
is presented with additional practice opportunities after acquiring
knowledge. Viewed from this perspective, the mastery threshold
can be viewed as a parameter that controls the relative frequency
of false negatives and false positives. In this paper, we provide a
framework for understanding the role of the mastery threshold in
Bayesian Knowledge Tracing and use simulations to model the
effects of setting different thresholds under different best and
worst-case skill modeling assumptions.
Keywords
Cognitive Tutor, intelligent tutoring systems, knowledge tracing,
student modeling, mastery learning
1. INTRODUCTION
Carnegie Learnings Cognitive Tutors (CTs) [12] and other
intelligent tutoring systems (ITSs) adapt to real-time student
learning to provide efficient practice. Such tutors are structured
around cognitive models, based on the ACT-R theory of cognition
[1-4], that represent knowledge in a particular domain by
atomizing it into knowledge components (KCs). CTs for
mathematics, for example, present students with problems that are
associated with skills that track mathematics KCs in cognitive
models. Content is tailored to student knowledge via run-time
assessments
that
probabilistically
track
student
knowledge/mastery of skills using a framework called Bayesian
Knowledge Tracing (BKT) [8].
Even in cases in which BKT mastery learning judgments are
based on parameters that perfectly match student parameters (e.g.,
with idealized, simulated student data), assessment of mastery or
knowledge is imperfect; student performance need not perfectly
track knowledge. In this context, mastery learning assessment is a
kind of classification problem. Like all classifiers, an ITS is
subject to two types of errors when assessing student knowledge:
(1) false positives, in which a student without knowledge is
judged to have mastered a skill, and (2) false negatives, in which a
student is presented with additional practice opportunities after
acquiring knowledge.
Figure 1. Progression to mastery (judgment) over M studentskill opportunities divided into three phases
Despite imperfect assessment of the student, adaptive tutors
attempt to minimize the number of opportunities at which students
practice skills they have already mastered, so they can focus
student practice on skills they have yet to master. We investigate
the impact of several factors on the efficiency of practice,
focusing especially on the threshold used for student mastery
assessments.
We provide a framework for thinking about inherent trade-offs
between the two types of CT assessment errors. We quantify the
notions of lag and over-practice and investigate their
relationships with the BKT probability threshold for mastery
learning, mastery learning skill parameters, and the dynamics of
the student population (or sub-populations) being modeled.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
35
3. SIMULATION REGIME
We use the BKT model to generate idealized data for simulated
students in a manner comparable to [6] and [11]. For example, if
P(L0) = 0.5, P(T) = 0.35, P(G) = 0.1 and P(S)=0.1, then the
simulation would, for each simulated student, place the student in
the known state initially with a probability of 0.5. Students in the
known state would then generate correct responses with a 0.45
probability [P(L0)*(1-P(S))]. Those in the unknown state would
generate correct responses with probability 0.1, and have a 0.35
probability of transitioning into the known state. Percent correct
on the first opportunity for all students simulated with this skill
would be 0.5 [P(L0)*(1-P(S))+(1- P(L0))*P(G)].
Since we know exactly when each virtual student transitioned into
the known state, we can compare the point where this occurred to
the judgment of the BKT run-time mastery algorithm, which can
only observe the generated student actions. We apply this testing
paradigm to scenarios where the runtime system uses the same
BKT parameters as the generating model (best-case), and to a
couple of scenarios where they are significantly different (worstcases).
We simulate data over skills represented by 14 unique parameter
quadruples, a subset of those identified in [13] as representative of
broad clusters of skills deployed in Cognitive Tutor mathematics
curricula1. We ascertain the number of lagged opportunities we
expect students to see, the frequency that the number of lagged
opportunities can reasonably be considered over-practice (i.e.,
beyond the acceptable lag), and the frequency of pre-mature
mastery judgment, for one best-case and two worst-case scenarios.
4. RESULTS
There are several ways of thinking about best and worst-case
scenarios; we do not exhaust the space of possibilities. We begin
by considering a best-case scenario.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
36
4.1.2 Over-Practice
We seek to quantify over-practice, and the extent to which ideal
students endure it, as a function of mastery thresholds and CT
mastery learning parameters.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
37
PS(L0) = 1 PM(L0)
PS(T) = 1 PM(T)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
38
75% threshold, 3 at the 90% and 95% threshold, and 4 at the 98%
threshold. Again, we do not witness particularly onerous overpractice in general, despite the mismatch of student parameters
and mastery assessment parameters.
4.2.2 Over-Practice
Compared to the best case matching parameter scenario, Figure
8 shows that the proportions of students that experience overpractice at each mastery threshold are far greater (and increase
with increasing mastery threshold). However, the amount of
over-practice through which simulated students must work
remains modest; the median student that experiences over-practice
sees 2 over-practice opportunities per skill over all the skills at the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
39
Figure 9. Frequency (count of skills) of median student overpractice opportunities for four run-time mastery threshold
probabilities with random student BKT parameters & 14
cluster skill quadruples as mastery parameters
4.3.2 Over-Practice
As in the previous two cases, we see that the proportions of
students experiencing over-practice per skill generally increase as
we increase the mastery threshold probability (Figure 11).
Further, most values of these proportions fall roughly in between
the medians for the previous two cases. Median counts of overpractice opportunities over all skills at each mastery threshold are
also similar to those in the other scenarios (median = 2 for 75%,
90%, and 95%; median = 3 for 98%).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
40
6. DISCUSSION
The trade-off, as a function of mastery probability threshold, of
student pre-mature mastery judgment (false positives), lagged
skill opportunities, and over-practice (false negatives) is
consistent across different best-case and worst-case skill modeling
assumptions. The value of the mastery probability threshold and
skill modeling assumptions influence the magnitude of these error
rates when calculated as proportions of students pre-maturely
judged to have achieved mastery or subjected to over-practice.
However, we find that for the median student subjected to overpractice, regardless of the skill modeling scenario, the amount of
over-practice as a count of opportunities is not, on the surface,
particularly onerous. Thus, BKT and a variety of skill parameters
are generally robust to committing the types of errors we have
quantified, with the exception of two outlier skills we discovered
with P(T) > 0.8 for which pre-mature mastery judgment occurred
more frequently in the two worst-case scenarios. This suggests
that without strong empirical evidence that certain skills are
7. ACKNOWLEDGMENTS
The authors acknowledge the helpful comments of Robert
Hausmann, R. Charles Murray, Brendon Towle, and Michael
Yudelson on an early draft of this paper.
8. REFERENCES
[1] Anderson, J.R. 1983. The architecture of cognition. Harvard
UP, Cambridge, MA.
[2] Anderson, J.R. 1990. The adaptive character of thought.
Erlbaum, Hillsdale, NJ.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
41
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
42
ABSTRACT
Learning involves a rich array of cognitive and affective states.
Recognizing and understanding these cognitive and affective
dimensions of learning is key to designing informed interventions.
Prior research has highlighted the importance of facial
expressions in learning-centered affective states, but tracking
facial expression poses significant challenges. This paper presents
an automated analysis of fine-grained facial movements that occur
during computer-mediated tutoring. We use the Computer
Expression Recognition Toolbox (CERT) to track fine-grained
facial movements consisting of eyebrow raising (inner and outer),
brow lowering, eyelid tightening, and mouth dimpling within a
naturalistic video corpus of tutorial dialogue (N=65). Within the
dataset, upper face movements were found to be predictive of
engagement, frustration, and learning, while mouth dimpling was
a positive predictor of learning and self-reported performance.
These results highlight how both intensity and frequency of facial
expressions predict tutoring outcomes. Additionally, this paper
presents a novel validation of an automated tracking tool on a
naturalistic tutoring dataset, comparing CERT results with manual
annotations across a prior video corpus. With the advent of
readily available fine-grained facial expression recognition, the
developments introduced here represent a next step toward
automatically understanding moment-by-moment affective states
during learning.
Keywords
Facial expression recognition, engagement, frustration, affect,
computer-mediated tutoring
1. INTRODUCTION
Over the past decade, research has increasingly highlighted ways
in which affective states are central to learning [6, 21]. Learningcentered affective states, such as engagement and frustration, are
inextricably linked with the cognitive aspects of learning. Thus,
understanding and detecting learner affective states has become a
fundamental research problem. In order to identify students
affective states, researchers often investigate nonverbal behavior.
A particularly compelling nonverbal channel is facial expression,
which has been intensely studied for decades. However, there is
still a need to more fully explore facial expression in the context
of learning [6].
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
43
2. RELATED WORK
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
44
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
Frames
12,257
15,183
127,510
9,474
14,462
Event Freq.
15.5%
21.7%
18.6%
13.2%
24.2%
45
AU1(0.80) AU2(-0.12)
AU1(0.17) AU2(0.27)
AU1(-0.02) AU2(0.07)
AU1(-0.22) AU2(-0.27)
AU1(-0.02) AU2(0.00)
AU4(0.25) AU7(-0.23)
AU4(0.08) AU7(-0.09)
AU4(0.47) AU7(0.08)
AU4(0.11) AU7(0.26)
AU4(-0.04) AU7(0.03)
AU14(-0.06)
AU14(-0.53)
AU14(-0.85)
AU14(-0.04)
AU14(0.46)
AU2:
Outer brow raiser
AU4:
Brow lowerer
AU7:
Lid tightener
AU14:
Dimpler
Figure 5. Automatically recognized facial action units (bold values are above selected threshold of 0.25)
3.2 Validation
CERT was developed using thousands of posed and
spontaneous facial expression examples of adults outside of the
tutoring domain. However, naturalistic tutoring data often has
special considerations, such as a diverse demographic,
background noise within a classroom or school setting, no
controls for participant clothing or hair, and facial occlusion
from a wide array of hand-to-face gesture movements.
Therefore, we aim to validate CERTs performance within the
naturalistic tutoring domain. CERTs adjusted output was
compared to manual annotations from a validation corpus, as
described in Section 3.1.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
46
AU2
AU4
AU7
AU14
CERT
0.14
0.29
0.05
0.29
0.29
CERT Accuracy
57%
64%
54%
64%
64%
AU2
AU4
AU7
AU14
Manual *
FACS Coder
0.88
0.82
0.79
0.78
0.73
CERT
0.86
0.86
0.68
0.71
93%
93%
85%
100%
86%
*
*
CERT Accuracy
4. PREDICTIVE MODELS
Automated facial expression recognition enables fine-grained
analyses of facial movements across an entire video corpus.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
47
5. DISCUSSION
The results highlight that specific facial movements predict
tutoring outcomes of engagement, frustration, and learning.
Particular patterns emerged for almost all of the facial action
units analyzed. We discuss each of the results in turn along with
the insight they provide into mechanisms of engagement,
frustration, and learning as predicted by facial expression.
Average intensity of brow lowering (AU4) was associated with
negative outcomes, such as increased frustration and reduced
desire to attend future tutoring sessions. Brow lowering (AU4)
has been correlated with confusion in prior research [7, 9] and
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
48
frequency of AU14 was positively predictive of both selfreported performance and normalized learning gains. While the
effect appears to be fairly subtle (effect size below 0.3 for both),
it appears to be a display of concentration. This leads to the
interesting question of whether AU4 or AU14 better represents a
thoughtful, contemplative state. Further research in this vein
may resolve the question.
While eyelid tightening (AU7) was not added to any of the
predictive models, there appear to be reasons for this.
Observation of CERT processing and the results of the
validation analysis indicate a way to adjust CERTs output of
AU7, enabling refined study of the action unit. AU7 is an
important facial movement to include, as it has been correlated
with confusion [7]. Our proposed method for correcting AU7
output was informed by observing that CERT tends to confuse
AU7 with blinking or eyelid closing. In prior manual annotation
efforts, we explicitly labeled AU7 only when eyelid movements
tightened the orbital region of the eye (as in the FACS manual).
Thus, manual annotation seems more effective due to this
complication of eye movements. However, note that CERTs
AU7 output perfectly agreed with manual annotations in our
validation analysis. Thus, CERT clearly tracks eyelid
movements well. The problem may be that CERTs AU7 output
is overly sensitive to other eyelid movements. One way to
mitigate this problem may be to subtract other eye-related
movements from instances of AU7. For instance, if AU7 is
detected, but CERT also recognizes that the eyelids are closed,
the detected AU7 event could be discarded.
The results demonstrated predictive value not only for frequency
of facial movements, but also intensity. The relationship
between facial expression intensity and learning-centered affect
is unknown, but perhaps action unit intensity is indicative of
higher-arousal internal affective states. Additionally, it is
possible that intensity will inform disambiguation between
learning-centered affective states that may involve similar action
units (e.g., confusion/frustration and anxiety/frustration). Lastly,
intensity of facial movements may be able to aid diagnosis of
low arousal affective states. For instance, a model of low
intensity facial movements may be predictive of boredom, which
current facial expression models have difficulty identifying.
6. CONCLUSION
This paper presented an automated facial recognition approach
to analyzing student facial movements during tutoring using the
Computer Expression Recognition Toolbox (CERT), which
tracks a wide array of well-defined facial movements from the
Facial Action Coding System (FACS). CERT output was
validated by comparing its output values with manual FACS
annotations, achieving excellent agreement despite the
challenges imposed by naturalistic tutoring video. Predictive
models were then built to examine the relationship between
intensity and frequency of facial movements and tutoring
session outcomes. The predictive models highlighted
relationships between facial expression and aspects of
engagement, frustration, and learning.
This novel approach of fine-grained, corpus-wide analysis of
facial expressions has great potential for educational data
mining. The validation analysis confirmed that CERT excels at
tracking specific facial movements throughout tutoring sessions.
Future studies should examine the phenomena of facial
expression and learning in more detail. Temporal characteristics
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
49
[13]
[14]
ACKNOWLEDGMENTS
This work is supported in part by the North Carolina State
University Department of Computer Science and the National
Science Foundation through Grant DRL-1007962 and the
STARS Alliance Grant CNS-1042468. Any opinions, findings,
conclusions, or recommendations expressed in this report are
those of the participants, and do not necessarily represent the
official views, opinions, or policy of the National Science
Foundation.
REFERENCES
[1] Baker, R.S.J. d., DMello, S.K., Rodrigo, M.M.T. and
Graesser, A.C. 2010. Better to Be Frustrated than Bored: The
Incidence, Persistence, and Impact of Learners CognitiveAffective States during Interactions with Three Different
Computer-Based Learning Environments. International
Journal of Human-Computer Studies. 68, 4, 223241.
[2] Baker, R.S.J. d., Gowda, S.M., Wixon, M., Kalka, J., Wagner,
A.Z., Salvi, A., Aleven, V., Kusbit, G.W., Ocumpaugh, J. and
Rossi, L. 2012. Towards Sensor-Free Affect Detection in
Cognitive Tutor Algebra. Proceedings of the 5th International
Conference on Educational Data Mining, 126133.
[3] Baltrusaitis, T., Robinson, P. and Morency, L.-P. 2012. 3D
Constrained Local Model for Rigid and Non-Rigid Facial
Tracking. Proceedings of the 2012 IEEE Conference on
Computer Vision and Pattern Recognition, 26102617.
[4] Calvo, R.A. and DMello, S.K. 2010. Affect Detection: An
Interdisciplinary Review of Models, Methods, and Their
Applications. IEEE Transactions on Affective Computing. 1,
1, 1837.
[5] Cooper, D.G., Muldner, K., Arroyo, I., Woolf, B.P. and
Burleson, W. 2010. Ranking Feature Sets for Emotion Models
used in Classroom Based Intelligent Tutoring Systems.
Proceedings of the 18th International Conference on User
Modeling, Adaptation, and Personalization, 135146.
[6] DMello, S.K. and Calvo, R.A. 2011. Significant
Accomplishments, New Challenges, and New Perspectives.
New Perspectives on Affect and Learning Technologies. R.A.
Calvo and S.K. DMello, eds. Springer. 255271.
[7] DMello, S.K., Craig, S.D. and Graesser, A.C. 2009. MultiMethod Assessment of Affective Experience and Expression
during Deep Learning. International Journal of Learning
Technology. 4, 3/4, 165187.
[8] DMello, S.K. and Graesser, A. 2010. Multimodal Semiautomated Affect Detection From Conversational Cues, Gross
Body Language, and Facial Features. User Modeling and
User-Adapted Interaction. 20, 2, 147187.
[9] DMello, S.K., Lehman, B., Pekrun, R. and Graesser, A.C.
Confusion Can Be Beneficial for Learning. Learning &
Instruction. (in press)
[10] Ekman, P. and Friesen, W. V. 1978. Facial Action Coding
System. Consulting Psychologists Press.
[11] Ekman, P., Friesen, W. V. and Hager, J.C. 2002. Facial Action
Coding System: Investigators Guide. A Human Face.
[12] Grafsgaard, J.F., Boyer, K.E. and Lester, J.C. 2011. Predicting
Facial Indicators of Confusion with Hidden Markov Models.
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
50
Keywords
Educational Games, Representation Learning, Context-Free
Grammars, Clustering
1. INTRODUCTION
Educational games are a growing sub-field of instructional technology. Researchers see video games as a compelling medium for
instruction because they can offer students the ability to practice
new skills within an authentic context that poses little personal
risk [7]. These promising aspects of games have led many educational game designers to create open games, which allow students to exercise creativity in how they solve problems. [12,25].
Open educational games are a form of exploratory learning environment and commonly use ill-defined problems as part of their
designs [11,19]. While the tendency toward open experiences is
compelling for educational game designers, it presents problems
when analyzing student learning, a necessary part of designing
activities to foster robust learning.
When designing an open game experience, the designer surrenders a degree of control over the nature and progression of the
experience to the player [10]. This openness can be problematic to
the designers of educational game experiences who are concerned
that students receive some type of intended instruction and
achieve a desired learning outcome. Educational game designers
require a detailed picture of how students are playing a game in
order to know if disparities exist between the designers intentions
and player experiences; and, if such disparities do exist, designers
need to know where to focus redesign efforts.
To facilitate designers and researchers analysis of open educational games we propose a methodology for extracting conceptual
1.1 RumbleBlocks
RumbleBlocks is an educational game designed to teach basic
structural stability and balance concepts to children in kindergarten through grade 3 (5-8 years old) [5]. It focuses primarily on
three basic principles of stability: objects with wider bases are
more stable, objects that are symmetrical are more stable, and
objects with lower centers of mass are more stable. These principles are derived from the National Research Councils Framework
for New Science Educational Standards [21] and other science
education curricula for the target age group.
The game follows a sci-fi narrative where the player is helping a
group of aliens who become stranded when their mother ship is
damaged. Each level (see Figure 1 for an example level) consists
of an alien stranded on a cliff with their deactivated space ship
lying on the ground. The player must use an inventory of blocks
to build a structure that is tall enough to reach the alien. In Figure
1, the player is dragging a third block (the highlighted square
block) from the inventory (top left) to the tower-underconstruction (bottom, center). Additionally, the players structure
must also cover a series of blue energy balls floating in space
which are narratively used to power the space ship, but serve to
both guide and constrain the players designs. Once the student is
confident in their design, they can place the spaceship on top of
their tower triggering an earthquake that serves as a test of the
towers stability. If, at the end of the quake, the tower is still
standing and the spaceship is still on top, the student passes the
level and proceeds to the next level; otherwise they start the level
over again.
Beyond the limits imposed by the energy ball mechanic and the
types of available blocks, students are not very constrained in the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
51
2.1 Discretization
The first step in the conceptual feature extraction process is discretization, or gathering meaningful data from the logs and converting it from a continuous two-dimensional space into a discrete
two-dimensional space. The input to this step is the raw student
log data, which contains action-by-action traces of student play
sessions at replay fidelity. The logs generated by RumbleBlocks
are intended to be post-processed through a replay analysis engine
[9] which allows researchers to play logs back through an active
instance of the game engine in order to extract information from
live game states. Using this approach we are able to access information on individual game objects, such as collision information
or bounding box dimensions, without having to log everything at
the time of play. Since the logs are being replayed within the same
game engine, the replayed game states are consistent with what
students experienced.
To convert the continuous data from RumbleBlocks into discrete
data we utilized a binning process. To bin a tower, the coordinates
of the extents of each blocks bounding box (the smallest rectangle which can be drawn around the block, a property accessible in
the active game state) are translated such that the bottom left corner of the tower is at position (0,0). After translation, all of the
edge coordinates of each block are divided by the size of the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
52
Figure 3. An example of how grammar (a) can be used to describe towers (b and c).
The extra space and alignment rules of the grammar are omitted for clarity.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
53
Figure 4. The two possible parses of tower (c) after alignment rules are added. Notice that the rules in the red tree
are now similar to the rules in tower (b)s parse tree.
by the direction of the division. After adding rules for all divisions, an entry is added to the collection mapping the structure to
the new nonterminal and the nonterminal is returned.
The result of the ERG algorithm is a grammar that contains a
nonterminal for every structure present in the set of towers. However, one subtle problem remains. Two towers that are nearly
similar, but are unaligned and consequently have an additional
space somewhere in the tower end up sharing no intermediate
nonterminal symbols in their parses, see the differences between
towers (b) and (c) in Figure 3. This is a problem because we are
using nonterminals to model spatial features common across towers. To counter this effect, we introduce a set of alignment rules
for every nonterminal NT in our grammar:
NT NT NSPACE [vertical]
NT NSPACE NT [horizontal]
NT NT NSPACE [horizontal]
These rules triple the number of grammar rules, but add additional
parses to towers so that they share common structure with other
similar but differently aligned towers, see Figure 4. We have two
horizontal rules so that we can have additional space on the left
and right of a symbol, but we only have one vertical rule because
we can have additional negative space on the top of a block, but
not on the bottom, because blocks in RumbleBlocks are subject to
gravity and any space below a block would be filled by the block
falling into a new position. It is important to note that while these
rules enable the towers to share similar structure, it does not give
them identical parses. This enables us to relate similar structures
using their parse trees without having to worry about truly different towers being lumped together.
2.3 Parsing
After generating a grammar, we can use it to parse the towers and
determine all of the nonterminal symbols that can be derived from
each tower. We use a modified version of the CKY algorithm [6]
that functions over two dimensions instead of one. This algorithm,
which utilizes dynamic programming, is an approach to bottomup parsing in polynomial time. One feature of the CKY algorithm
is that the amount of time required to compute all parses of a tower is the same as the amount of time required to compute one
parse. Using this approach, we produce all of the parses for every
tower in our set.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
54
mar. These values are initialized to be 0 but are set to 1 for every
nonterminal that appears in at least one of a given towers parse
trees, similar to previous work [17]. Thus, a feature vector is a
concise description of all the structures that are present in all of
the parses of a given tower. Once we have generated these feature
vectors, we can use them to perform a variety of analyses as we
will demonstrate next.
3. Data
The data we present here comes from a large formative evaluation
of RumbleBlocks, which was performed in two local area elementary schools. The sample includes play sessions from 174 students
from grades K-3 (5-8 years old) who played the game for a total
of 40 min across 2 sessions. The game contained 39 different
levels, each intended to target a specific principle of stability
through the use of the energy balls as scaffolding. Players played
an average of 17.8 unique levels ( =7.2), as not all students completed the entire game. Additionally, because students are allowed
to retry levels in which they fail, the data can contain multiple
attempts by a student on each level ( =1.24, =.68). In total, the
dataset contains 6317 unique structures created by students.
Due to constraints of the conceptual feature extraction process
some data had to be excluded from analysis. The parsing process
requires that blocks be aligned to a grid such that clear separations
can be drawn between thembecause of this it was necessary to
omit any structures where the binning process caused blocks to
overlap the same grid cell (less than 0.2% of data). Additionally,
rotating a block will sometimes cause its bounding box to intersect with adjacent grid cells, because the bounding box expands to
encompass the maximum left, right, top, and bottom values of the
blocks geometry rather than rotating with it. To address these
issues of grid overlap we exclude any record that contained blocks
whose dimensions intersected or any blocks whose z-axis rotation
was not a multiple of 90, after rounding to the nearest 15 degrees.
Overall these constraints exclude ~3.5% of our sample.
The final grammar generated from the dataset by the ERG algorithm contains 13 terminals, 6,010 nonterminals, and 30,923 rules.
Each nonterminal was used an average of 50.59 times ( =240.2)
across all towers. The average number of levels in which a given
nonterminal was used was 3.09 ( =4.14). The average number of
nonterminals per towers was 49.96 ( =40.23). Reporting statistics
on the number of nonterminals within an average parse or number
of parses within an average tower is complicated by the inclusion
of alignment rules which add some arbitrary number of parses to
each tower.
4. CLUSTER ANALYSIS
In order to demonstrate the utility of these conceptual features to
guide the design process in educational games, we performed a
clustering analysis of student solutions in RumbleBlocks, to discern how many solutions students were demonstrating. Clustering
takes a series of data points, in our cases represented by conceptual feature vectors, and assigns them to groups based on how similar the points are. Clustering similar to ours has been used by
Andersen and Liu et al. to group game states as a way of exploring common paths that players take through a game [18]. Our
approach differs from theirs in that our features are machine
learned rather than defined by designers. This allows us to observe emergent patterns in play without biasing the results with
human input.
4.1 Method
The selected three levels were chosen because they were part of
an in-game counterbalanced pre-posttest, which did not use the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
55
Table 1. Clustering measures (completeness, homogeneity, v-measure, and adjusted rand index)
means and standard deviations after 10 iterations of clustering.
Note that equality clustering is constant and so has no standard deviation.
Level
com_11_noCheck
(n=251)
s_13_noCheck
(n=249)
wb_03_noCheck
(n=254)
Comparison
k-means
equality
k-means
equality
k-means
equality
Completeness (SD)
.74 (.06)
.55 (NA)
.83 (.02)
.60 (NA)
.63 (.02)
.53 (NA)
Homogeneity (SD)
.57 (.10)
.99 (NA)
.63 (.04)
.99 (NA)
.80 (.02)
.99 (NA)
V-Measure (SD)
.63 (.04)
.71 (NA)
.72 (.02)
.75 (NA)
.71 (.02)
.69 (NA)
4.2 Results
When looking at the measures of clustering effectiveness in Table
1 we see that the k-means algorithm was able to outperform
straight equality grouping in ARI and completeness. This can be
interpreted to mean that k-means clustering is making a higher
percentage of correct decisions in grouping structures, suggesting
that the results of clustering can be validly used in further analysis. In all instances, the equality grouping performs better than kmeans clustering in homogeneity score because if direct equality
is used to assign group labels the resulting groups will be, by definition, perfectly homogeneous. In many instances, this causes the
V-measure to also be better because V-measure evenly weights
for completeness and homogeneity. Overall these results can be
interpreted to mean that clustering along conceptual features of
towers provides reasonable grouping accuracy when compared to
human clustering.
When clustering was performed across all levels, the mean homogeneity of the k-means clusters was found to be significantly
greater than the homogeneity from random grouping of student
solutions using a two-sample t-test (p < .001). Assuming that
similar towers would stand or fall together, this further supports
the idea that the clustering algorithm is not separating similar
student solutions.
Overall the clustering algorithm generated an average of 8 clusters
per level ( = 3.98), compared to the average number of groups as
determined by equality grouping 56 ( = 45.77). The smallest
number of clusters (2) was seen in the tutorial level, which contains only 1 block and the spaceship allowing for very little difference between solutions. The highest number of clusters (17) was
found in a later level (centerOfMass_07) which contains 5 larger
blocks and 6 energy balls allowing for nuanced differences in
solution styles.
Our analysis of what percentage of solutions appear similar to the
designers intended solutions shows a high degree of variability,
see Figure 5. Some levels, like the tutorial and other earlier levels,
are found near the higher end of the spectrum because as introductory levels they do not allow for a large number of solutions.
However, the levels on the lower end of the spectrum indicate that
few students actually created the towers envisioned by the designers. These levels warrant a closer investigation to ascertain what
other kinds of solutions students are producing. For example,
upon further inspection of the solutions to centerOfMass_07, designed to target the principle of low center of mass, we discovered
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
56
5. DISCUSSION
In this paper we have described a process for conceptual feature
extraction from logs of gameplay in an educational game. The
process follows four steps starting with the raw student log files.
The files are discretized and then used to generate a twodimensional context-free grammar that can be used to parse the
towers and yield a vector of features present in the tower. We
demonstrated how conceptual features could be used to perform a
clustering analysis of common student solutions.
While the results we discussed are specific to RumbleBlocks aspects of our approach could be generalized to other games or educational technology environments by altering some of the steps in
the process. One example of another game this approach would
work for is Refraction, which has players redirecting laser beams
around a grid based board by placing laser splitters to make proper fractions [1,18]. This game already takes place on a grid and so
would not require a discretization step, but the other steps would
be applicable. In this game, our approach would learn features
corresponding to patterns of laser splitters on the grid, which
could be used to generate feature vectors for each student solution
and to cluster these feature vectors. These clusters would be similar to those generated by Liu et al. [18] but the features would be
automatically generated rather than human tagged.
When applying our approach more generally, the discretization
step will always be specific to a particular game or interface, as it
requires an intimate knowledge of the context. Employing a replay analysis engine can assist with discretization by providing a
standard format [9]. The ERG algorithm is applicable to any discrete two-dimensional representation of structure in which adjacency relations are meaningful. Converting parses into feature
vectors for analysis is a technique that should be applicable to
most situations.
The features generated with this method can be used by many
different kinds of analyses beyond what we present here. For instance, the feature vectors could be used as a way to represent
game data in a format suitable for DataShop [13], a large open
repository of educational technology interaction data. A feature
vector is analogous to the state of a tutoring system interface and
the changes in the feature vector from step to step correspond to
the student actions. Additionally, virtual agents, such as SimStudent [20], could use this data representation as a way of under-
Figure 6. An example of mismatch with designer expectation and student solution from the centerOfMass_07 level.
The designer's answer is on the left.
standing and interacting with educational games, enabling us to
model student learning in these contexts.
While the grammars extracted by our method have proven to be
useful, they still have some limitations, such as an inability to
represent towers that cannot be cleanly mapped to a grid or which
contain overlapping or angled substructures. Making the grammar
more descriptive would require the relaxing of constraints concerning how nonterminals can be parsed, e.g., not requiring strict
alignment. Another issue has to do with how many different nonterminals map to nearly equivalent structures. Even though we
attempt to minimize this by introducing the alignment and space
rules, there are still cases where further reductions could be implemented. One potential solution, to address this problem in general, is to implement model merging to condense pairs of nonterminals that represent similar concepts into single nonterminals
[15]. The ability to merge similar nonterminals is a promising
direction for future work.
In addition to being able to describe more towers, model merging
would also allow the generalization of grammars to cases we have
not seen. Because context-free grammars can be used generatively, the generalized grammar could be used to produce novel towers, similar to the work of Talton et al. [26]. In our case, these
novel towers would give insight into the as-yet-unseen portions of
the solution space. Furthermore, the novel towers could be used as
templates in creating new levels. In future work we will be exploring ways to feed this information, and information from clustering, directly back into the game development environment.
The clustering results not only provide the designers of RumbleBlocks with a picture of how students are playing their game, they
also possess further uses beyond assisting design iteration, such as
exploring research questions. One potential use of the clustering is
as an empirical measure of how open a particular level is, by
counting how many different clusters, i.e. different solutions, that
level affords. Using this measure allows researchers to explore the
interactions of openness with learning and engagement. Exploring
this interpretation of the clustering results will be a part of our
ongoing analysis of RumbleBlocks.
Another intriguing direction for future work would be to explore
the relationship between the conceptual features and the
knowledge components [14] used in building towers in RumbleBlocks. There may exist a mapping between the substructures
used in towers and the conceptual knowledge components related
to stable structures. Exploring this would require measurements of
how a students use of particular structures changed over time and
how it relates to task performance. If such a mapping exists, then
our approach would not only be useful for automated feature extraction, but also for automatically building models of conceptual
knowledge components.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
57
6. CONCLUSION
Framing game experiences in terms of conceptual features can
help both designers and researchers better understand how students interact with their games. The main contribution of this
paper is an approach for extracting conceptual features from play
logs within educational games and using these features to perform
clustering of student solutions. Designers can use the clusterings
to better understand the space of student solutions and to know
where to focus their attention to improve student learning experiences. Ultimately we envision feeding back this clustering information directly into the game design platform. This information
can also enable researchers to explore important questions, such
as how openness and difficulty relate to student engagement.
While our approach was created with the specific twodimensional world of RumbleBlocks in mind, it should be generalizable, and we hope others will find it useful in exploring other
educational games.
7. ACKNOWLEDGMENTS
We would like to thank the designers of RumbleBlocks and our
colleagues who conducted the formative evaluation that yielded
our data. This work was supported in part by a Graduate Training
Grant awarded to Carnegie Mellon University by the Department
of Education #R305B090023 and the DARPA ENGAGE research
program under ONR Contract Number N00014-12-C-0284.
8. REFERENCES
[1] Andersen, E., Liu, Y., Apter, E., Boucher-genesse, F., and
Popovi, Z. Gameplay Analysis through State Projection.
Proc. FDG 10, (2010).
[2] Arthur, D. and Vassilvitskii, S. How slow is the k -means
method? Proc. SCG 06, ACM Press (2006), 144.
[3] Arthur, D. and Vassilvitskii, S. K-means++: The Advantages
of Careful Seeding. Proc. ACM-SIAM, (2007), 10271035.
[4] Cherubini, A. and Pradella, M. Picture Languages: From
Wang Tiles to 2D Grammars. In S. Bozapalidis and G.
Rahonis, eds., Algebraic Informatics. Springer, Berlin,
Germany, 2009, 1346.
[5] Christel, M.G., Stevens, S.M., Maher, B.S., et al.
RumbleBlocks: Teaching Science Concepts to Young
Children through a Unity Game. Proc. CGames 2012,
(2012), 162166.
[6] Cocke, J. Programming Languages and their Compilers:
Preliminary Notes. New York University, 1969.
[7] Gee, J.P. What video games have to teach us about learning
and literacy. Palgrave Macmillan, New York, 2003.
[8] Hamerly, G. and Elkan, C. Learning the k in k-means. Proc.
NIPS 03, (2003).
[9] Harpstead, E., Myers, B.A., and Aleven, V. In Search of
Learning: Facilitating Data Analysis in Educational Games.
Proc. CHI 13, (2013), 7988.
[10] Hunicke, R., Leblanc, M., and Zubek, R. MDA: A Formal
Approach to Game Design and Game Research. Proc. of the
AAAI Workshop on Challenges in Game AI, (2004), 15.
[11] De Jong, T. and Van Joolingen, W.R. Scientific Discovery
Learning with Computer Simulations of Conceptual
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
58
baker2@exchange.tc.columbia.edu
ABSTRACT
In the field of educational data mining, there are competing
methods for predicting student performance. One involves
building complex models, such as Bayesian networks with
Knowledge Tracing (KT), or using logistic regression with
Performance Factors Analysis (PFA). However, Wang and
Heffernan showed that a raw data approach can be applied
successfully to educational data mining with their results from
what they called the Assistance Model (AM), which takes the
number of attempts and hints required to answer the previous
question correctly into account, which KT and PFA ignore. We
extend their work by introducing a general framework for using
raw data to predict student performance, and explore a new way
of making predictions within this framework, called the
Assistance Progress Model (APM). APM makes predictions based
on the relationship between the assistance used on the two
previous problems. KT, AM and APM are evaluated and
compared to one another, as are multiple methods of ensembling
them together. Finally, we discuss the importance of reporting
multiple accuracy measures when evaluating student models.
Keywords
Student Modeling, Knowledge Tracing, Educational Data Mining,
Assistance Model, Assistance Progress Model
1. INTRODUCTION
Understanding and modeling student behavior is important for
intelligent tutoring systems (ITS) to provide assistance to students
and help them learn. For nearly two decades, Knowledge Tracing
(KT) [5] and various extensions to it [12, 16, 18] have been used
to model student knowledge as a latent using Bayesian networks,
as well as to predict student performance. Other models used to
predict student performance include Performance Factors
Analysis (PFA) [14] and Item Response Theory [8]. However,
these models do not take assistance information into account. In
most systems, questions in which hints are requested are marked
as wrong, and students are usually required to answer a question
correctly before moving on to the next one. Therefore, the number
of hints and attempts used by a student to answer a question
correctly is likely valuable information.
2.
3.
2. DATA
The data used here was the same used in [15], which introduced
AM. This dataset comes from ASSISTments, a freely available
web-based tutoring system for 4th through 10th grade mathematics.
While working on a problem within ASSISTments, a student can
receive assistance in two ways: by requesting a hint, or by
entering an incorrect answer, as shown in Figure 1.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
59
1.
2.
3.
4.
3. METHODS
This section begins by giving an overview of KT, then introduces
a framework for building data-driven student models called
tabling methods, and describes two such methods: AM and
APM. Next, the approaches used to ensemble these individual
models together are briefly discussed. Finally, the procedure and
measures used to evaluate all models are discussed.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
60
2.
3.
4.
Hint %
(1, 6]
(6, )
0.778
0.594
0.480
(0, 50]
0.560
0.623
0.444
(50, 100)
0.328
0.461
0.444
100
0.264
0.348
0.374
When hints are held constant, different patterns occur with respect
to the number of attempts used. When no hints are used, the
probability of answering the next question correctly decreases as
the number of attempts increases. This relationship is reversed
when all hints are used. Finally, if just some of the hints are used,
making a few attempts (between 2 and 6, inclusive) helps more
than making one attempt, but making many attempts (> 6)
decreases the probability of answering the next question correctly.
The pattern for no hints can be explained as more attempts
required being indicative of lower student knowledge. For all
hints being used, more attempts may indicate the student is
attempting to learn rather than just requesting hints until the
answer is given to them. Using some of the hints suggests the
student has not mastered the skill, but has some knowledge of it
and is attempting to learn. The relationship between making one
attempt and making a few attempts can be explained by the more
attempts the student makes, the more they learn, to a point. The
use of excessive amounts of attempts probably indicates the
student is not learning, despite using some of the hints.
The highest probability in the table, 0.778, corresponds to the
case where the previous question was answered correctly. This is
unsurprising since in this case, the student likely has mastered the
skill. The lowest probability, 0.264, corresponds to making only
one attempt while requesting all of the hints. This corresponds to
the case where the student requests hints until the answer is given
to them. This could be caused by the student simply not
understanding the skill, or by the student gaming the system, or
attempting to succeed in an interactive learning environment by
exploiting properties of the system rather than by learning the
material [2]. In either case, not much learning takes place.
In [15], the AM table was constructed using 80% of the data and
used to predict the remaining 20%. In this work, all models were
evaluated using five-fold cross-validation.
Hint % Relationship
Attempts
Relationship
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
<
>
<
0.672 (586)
0.611 (1410)
0.567 (60)
0.649 (248)
0.734 (8309)
0.590 (83)
>
0.541 (85)
0.552 (1019)
0.512 (299)
61
Attempt
Relationship
0.708
(2722)
<
<
0.672
(586)
0.611 (1410)
>
0.649
(248)
0.791 (5028)
>
0.580 (143)
0.352 (559)
0.551 (1104)
0.512 (299)
3.4 Evaluation
To evaluate the models, three metrics are computed: MAE,
RMSE, and AUC. These metrics are computed by obtaining
predictions using five-fold cross-validation (using the same
partition for each model), then computing each metric per student.
Finally, the individual student metrics are averaged across
students to obtain the final overall metrics. Computing the
average across students for each metric in this way avoids
favoring students with more data than others, and avoids
statistical independence issues when it comes to computing AUC.
For these reasons, Pardos et al used average AUC per student as
their accuracy measure in their work in evaluating several student
models and various ways of ensembling them [11].
All three of these metrics are reported because they are concerned
with different properties of the set of predictions and therefore do
not always agree on which model is best. MAE and RMSE are
concerned with how close the real-valued predictions are, on
average, to their actual binary values. On the other hand, AUC is
concerned with how separable the predictions for positive and
negative examples are, or how well the model is at predicting
binary classes rather than real-valued estimates.
For example, in Table 4, the first two sets of predictions (P1 and
P2) achieve AUCs of 1 since both perfectly separate the two
classes (0 and 1). However, P2 achieves much better MAE
(0.3960) and RMSE (0.6261) values than P1 (0.5940 and 0.7669,
respectively). Whats more, P3 achieves an AUC of only 0.5, but
outperforms both P1 and P2 in terms of RMSE (0.5292) and P1 in
terms of MAE (0.4400).
Table 4. Example dataset
Actual Value
P1
P2
P3
0.99
0.8
0.01
0.8
0.01
0.8
0.99
0.8
0.01
0.8
4. RESULTS
In this section, the results for both the individual models and the
ensemble models are reported. Given the importance of reporting
multiple accuracy measures as discussed in the preceding section,
three measures are reported for each model: MAE, RMSE and
AUC. Each measure is computed by first computing the measure
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
62
RMSE
AUC
Baseline
0.2510
0.4642
0.5000
AM
0.3657
0.4129
0.5789
APM
0.3844
0.4221
0.5618
KT
0.3358
0.4071
0.6466
indicates measures that are reliably better than those for KT, and
regular type indicates there is no reliable difference between the
measures for KT and the model in question. Statistical
significance was determined using two-tailed pairwise t-tests and
Benjamini and Hochbergs false discovery rate procedure [4].
4.2.1 Mean
The first ensembling method involved taking the simple mean of
the predictions given by the various models. This was done in five
ways: 1) with AM and APM to determine if it outperformed AM
and APM on their own; combining KT with 2) AM and 3) APM
to determine if either AM or APM improved predictions over
using KT on its own; 4) with all three models to determine if it
outperformed any of the individual models, and 5) taking the
mean of AM and APM first, then taking the mean of those results
with KT. The intuition for the last method is that KT performs
better than AM, and most likely APM as well. Therefore, taking
the mean of AM and APM first gives KT more influence in the
final result while still incorporating both AM and APM. The
results for these models are shown in Table 6.
Table 6. Results for the mean models
MAE
RMSE
AUC
AM, APM
0.3751
0.4137
0.5917
KT, AM
0.3508
0.4006
0.6472
KT, APM
0.3601
0.4033
0.6409
0.3620
0.4032
0.6433
0.3554
0.4010
0.6469
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
63
1.
2.
3.
The same sub-folds were used for each fold for all decision tree
models. The results for these models are reported in Table 8. The
model names correspond to the same subsets of attributes used for
the linear regression models.
Table 8. Results for the decision tree models
MAE
RMSE
AUC
AM
0.3637
0.4119
0.5793
AM, KT
0.3293
0.4009
0.6385
4.
5.
AM, APM the AM, APM* set along with the APM
prediction
AM, APM*
0.3586
0.4087
0.5847
AM, APM*, KT
0.3286
0.4008
0.6358
AM, APM
0.3586
0.4090
0.5860
AM, APM, KT
0.3290
0.4012
0.6351
6.
RMSE
AUC
AM
0.3701
0.4148
0.5770
AM, KT
0.3338
0.4024
0.6500
AM, APM*
0.3671
0.4127
0.5753
AM, APM*, KT
0.3319
0.4005
0.6341
AM, APM
0.3647
0.4112
0.5874
AM, APM, KT
0.3316
0.4000
0.6379
RMSE
AUC
AM
0.3505
0.4002
0.6461
AM, KT
0.3054
0.4117
0.6313
AM, APM*
0.3479
0.3985
0.6477
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
64
AM, APM*, KT
0.3005
0.4109
0.6358
AM, APM
0.3485
0.3990
0.6468
AM, APM, KT
0.2997
0.4090
0.6375
4.2.5 Overall
For the first three ensembling methods, those that included only
AM and KT performed the best. However, for random forests, it
was the average of KT with the random forest consisting of
predictions from all three individual models. Table 10 reproduces
these results, with bold-faced type indicating values that are
reliably better than KT, and underlined type indicating values that
are reliably worse. Table 11 reports the p-values of the differences
between these models for each accuracy measure, with values
indicating reliable differences in bold-faced type.
Table 10. Results for the best of each ensembling method
MAE
RMSE
AUC
MEAN
0.3508
0.4006
0.6472
LR
0.3338
0.4024
0.6500
TREE
0.3293
0.4009
0.6385
RF
0.2997
0.4090
0.6375
RMSE
AUC
MEAN, LR
0.0000
0.1659
0.4274
MEAN, TREE
0.0000
0.8803
0.1116
MEAN, RF
0.0000
0.0022
0.1400
LR, TREE
0.0000
0.1669
0.0223
LR, RF
0.0000
0.0026
0.0406
TREE, RF
0.0000
0.0001
0.8476
From Tables 10 and 11, it appears that either the decision tree or
random forest (averaged with KT) models could be considered the
best model, depending on which measure is considered the most
important. The random forest model is reliably better than the
decision tree in terms of MAE, but reliably worse in terms of
RMSE.
In general, it appears there is some value in comparing the usage
of assistance over the previous two problems, as ensembling APM
with AM consistently gives better results than using AM on its
own, except when taking means. Despite this, ensemble methods
that use only KT and AM perform better than any other model
studied in this work, including all of those using APM. One
explanation could be that one important thing that APM captures
is learning over the previous two questions, which is already
modeled in KT. The one exception is when a random forest of all
individual models is averaged with KT, which indicates that there
is information that APM takes into account that neither AM nor
KT considers. Right now, it is not clear which of these ensemble
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
65
6. ACKNOWLEDGMENTS
All of the opinions expressed in this paper are those solely of the
authors and not those of our funding organizations.
7. REFERENCES
[1] Arroyo, I., Cooper, D.G., Burleson, W., and Woolf, B.P.
Bayesian Networks and Linear Regression Models of
Students Goals, Moods, and Emotions. in Handbook of
educational data mining, CRC Press, Boca Raton, FL, 2010,
323-338.
[2] Baker, R.S.J.d., Is Gaming the System State-or-Trait?
Educational Data Mining Through the Multi-Contextual
Application of a Validated Behavioral Model. in Complete
On-Line Proceedings of the Workshop on Data Mining for
User Modeling at the 11th International Conference on User
Modeling, (Corfu, Greece, 2007), 76-80.
[3] Beck, J.E., Chang, K., Mostow, J., Corbett, A. Does help
help? Introducing the Bayesian Evaluation and Assessment
methodology. Intelligent Tutoring Systems, Springer Berlin
Heidelberg, 2008, 383-394.
[4] Benjamini, Y., Hochberg, Y. Controlling the False Discovery
Rate: A Practical and Powerful Approach to Multiple
Testing. Journal of the Royal Statistical Society, Series
B, 57(1), 289300.
[5] Corbett, A. and Anderson, J. Knowledge Tracing: Modeling
the Acquisition of Procedural Knowledge. User Modeling
and User-Adapted Interaction, 4(4), 253-278.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
66
Alex Zundel
Thomas F. Stahovich
Department of Computer
Science
University of California,
Riverside
900 University Ave
Riverside, CA 92521
Department of Mechanical
Engineering
University of California,
Riverside
900 University Ave
Riverside, CA 92521
Department of Mechanical
Engineering
University of California,
Riverside
900 University Ave
Riverside, CA 92521
jhero001@ucr.edu
ABSTRACT
A key challenge in educational data mining research is capturing student work in a form suitable for computational
analysis. Online learning environments, such as intelligent
tutoring systems, have proven to be one effective means for
accomplishing this. Here, we investigate a method for capturing students ordinary handwritten coursework in digital form. We provided students with LivescribeTM digital
pens which they used to complete all of their homework and
exams. These pens work as traditional pens but additionally digitize students handwriting into time-stamped pen
strokes enabling us to analyze not only the final image, but
also the sequence in which it was written. By applying data
mining techniques to digital copies of students handwritten
work, we seek to gain insights into the cognitive processes
employed by students in an ordinary work environment.
We present a novel transformation of the pen stroke data,
which represents each students homework solution as a sequence of discrete actions. We apply differential data mining
techniques to these sequences to identify those patterns of
actions that are more frequently exhibited by either good- or
poor-performing students. We compute numerical features
from those patterns which we use to predict performance in
the course. The resulting model explains up to 34.4% of the
variance in students final course grade. Furthermore the
underlying parameters of the model indicate which patterns
best correlate with positive performance. These patterns
in turn provide valuable insight into the cognitive processes
employed by students, which can be directly used by the
instructor to identify and address deficiencies in students
understanding.
1.
INTRODUCTION
stahov@engr.ucr.edu
2. RELATED WORK
Data-driven educational research has traditionally been limited by the time-consuming process of monitoring students
learning. For example, substantial research has been per-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
67
formed which investigated the correlation between performance and the amount of time and effort spent on homework assignments [2, 5, 8, 16, 18]. Manually watching each
student solve each homework assignment would require an
intractable amount of time and, additionally, may skew the
results of the study. Instead, each of these researchers relied on students or their parents to self-report the amount
of time spent on each homework assignment.
Cooper et al. [6] compared the results of each of these studies and found an average correlation of r = 0.14 with a range
from 0.25 to 0.65. Cooper et al. summarize this inconsistency in findings when they state that, to date, the role
of research in forming homework policies and practices has
been minimal. This is because the influences on homework
are complex, and no simple, general finding applicable to all
students is possible. This underlies the impact that Educational Data Mining can have on the educational research
community. By instrumenting students natural problemsolving processes, we are able to capture a precise measurement of the actions students perform when solving their
homework assignments.
More recently, researchers have applied data mining techniques to ITS and CMS data. For example, Romero et
al. [17] applied data mining techniques to data collected with
the Moodle CMS. This system allows students to both view
and submit various assignments, e.g., homework and exams,
and records detailed logs of students interactions. These
interaction logs were mined for rare association rules, that
is, patterns which appear infrequently in the data. The resulting rules were then manually inspected to identify fringe
behaviors exhibited by students.
Similarly Mostow et al. [14] applied data mining techniques
to interaction logs taken from Project LISTENs Reading
Tutor, an ITS. This system tutors young students as they
learn to read by listening to them read stories aloud and
providing feedback. The authors developed a system which
automatically identified meaningful features from these logs
which were then used to train classifiers to predict students
future behavior with the system.
Sequential pattern mining [1] is a technique used to identify
significant patterns in sequences of discrete items, e.g., consumer transaction records [1] or DNA transcripts [4]. These
techniques have typically been used to mine patterns from a
single database of sequences. In Educational Data Mining,
it is often the case that researchers seek to find patterns that
best distinguish students who do and do not perform well
in the course. Thus there is a need for novel pattern mining
techniques aimed at differentiating between two databases
of sequences.
More recently, Ye and Keogh [20] developed a novel technique which identifies patterns which best separate two timeseries databases. This technique identifies frequently occurring patterns within each database, as traditional pattern
mining techniques have, but furthermore, evaluates each
pattern by using it to separate sequences from the two databases.
If a sequence contains the pattern, that sequence is identified
as being part of the same database that the pattern came
from. The pattern which provides the greatest information
3. DATA COLLECTION
In the winter quarter of 2012, students enrolled in an undergraduate Mechanical Engineering Statics course were given
LivescribeTM digital pens. Students completed all their coursework with these pens, creating a digital record of their handwritten homework, quiz, and exam solutions. A typical
exam problem is shown in Figure 1. Each problem includes
a figure describing a system subject to external forces. The
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
68
3.5"
2.5
"
60
3"
1" 2"
3.5"
Figure 2: A hypothetical solution to a Statics problem. The color of each pen-stroke identifies the component to which it refers: cyan = FBD, green =
equation, and cross-out = black.
by students, more fine-grained labeling schemes could have
been developed by subdividing each label. For example, instead of labeling a pen stroke as being part of an equation, it
could be labeled according to the type of equation to which
it corresponds, namely, sum of forces in the X or Y direction or the sum of moments. We chose the labeling scheme
presented for two major reasons.
First, this labeling scheme is sufficient for investigating hitherto unverifiable intuitions about the ways students solve
Statics problems, such as the intuition that students who
possess a strong understanding of the material will complete
their FBD entirely before beginning their equation work.
Similarly, we may corroborate the intuition that students
who possess a strong understanding of the material will complete their problems in problem-number order, that is, they
complete problem one entirely before completing problem
two and so on.
Second, by subdividing each of the labels, we risk increasing
the granularity of the resulting action sequences too far, increasing the number of total discrete actions to a point that
prevents patterns from being identified.
4. ACTION SEQUENCES
In this section, we describe how each sketch may be transformed into an action sequence, comprising discrete actions,
that is suitable for differential pattern mining. Each action
is an element of a predefined alphabet of canonical actions.
Each element in the alphabet represents an uninterrupted
period of problem-solving performed by a student as he or
she solves a homework assignment. We seek to characterize the duration, semantic content, and homework problem
number for each action.
We begin by segmenting the pen strokes of each sketch by
semantic type. To do so, we simply identify each index, i, in
L such that li 6= li+1 , and segment the series of pen strokes
at each identified index. Each resulting segment contains
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
69
5
40
75
110
145
180
215
250
285
320
355
390
425
460
495
530
565
600
635
670
705
740
Frequency
4000
3500
3000
2500
2000
1500
1000
500
0
Figure 3: A histogram of the durations of FBD actions across all homework assignments. For example, the first(leftmost) bar indicates that approximately 6,000 FBD actions were between zero and
five seconds long.
Figure 4: A histogram of the durations of EQN actions across all homework assignments. For example, the first(leftmost) bar indicates that approximately 3,500 EQN actions were between zero and
five seconds long.
Students completed homework assignments three and four
prior to the first midterm exam. Students completed homework assignments five and six after the first midterm exam
and before the second. Students completed homework assignment eight after the second midterm exam and before
the final exam. Each midterm exam only comprised problems similar to those encountered on the homework assignments leading up to it. Thus the first midterm exam required that students solve problems similar to those found
on homework assignment three and four and the second
midterm exam required students to solve problems similar
to those found on homework assignments five and six. The
final exam comprised problems similar to all those encountered on all homework assignments.
Using this schedule of exams and homework assignments,
we assign each action sequence to a group based on performance. An action sequence is assigned to the top-performing
group if the student who performed those actions scored in
the top third on the relevant exam. Similarly, an action
sequence is assigned to the bottom-performing group if the
student scored in the the bottom third of the class. The differential mining technique employed in this paper requires
exactly two databases as input, thus the remaining middleperforming students are excluded from our analysis to help
accentuate the differences in problem-solving behaviors of
top- and bottom-performing students.
Descriptive statistics of the lengths of the action sequences
for the two performance group for each assignment are shown
in Table 1. It is interesting to note that the average action
sequences of the bottom-performing group are always longer
than those of the top-performing group, and in two cases this
difference is significant (p < 0.01).
5. DIFFERENTIAL MINING
To identify patterns that distinguish good performance from
poor performance we employ the differential pattern mining
technique developed by Kinnebrew and Biswas [13]. This
algorithm identifies patterns that are differentially frequent
with respect to two databases of sequences, called the left
and right databases.
This algorithm uses two metrics to measure the frequency
of a pattern, s-frequency and i-frequency. s-frequency is de-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
70
HW4 Bot.
130.25
128
63.97
0.00
HW4 Top
83.28
78
46.67
HW5 Bot.
127.88
119.5
70.89
0.01
HW5 Top
87.14
72
54.42
HW6 Bot.
144.73
140
52.55
0.171
HW6 Top
126.52
122
66.05
HW8 Bot.
82.45
72
73.94
0.17
HW8 Top
62.28
53.5
39.64
6. PERFORMANCE PREDICTION
The differential pattern mining technique identified 98 patterns in total: 6 that were s-frequent in the top-performing
group but not in the bottom-performing group, and 92 that
were s-frequent in the bottom-performing group but not in
the top-performing group.
Our goal is to use these 98 patterns to construct a model
to distinguish between good- and poor-performing students.
We represent each student with 98 binary features. Each
feature indicates whether a particular differential pattern
from a particular assignment is contained within a students
action sequence for that assignment. To avoid computing
a model that over-fits the data, we used the Correlationbased Feature Selection (CFS) algorithm with 10-fold crossvalidation to identify the subset of the 98 features with the
most predictive power. Those features that were selected in
more than six of the ten folds by the CFS algorithm were
included in the final feature subset. Table 2 shows the 20
features that were ultimately selected in this way.
We then used these 20 features to construct a linear regression model which predicts students overall performance in
the course. While more robust, non-linear classifiers could
have been used, e.g., AdaBoost [7] or Support Vector Machines [9], we use a linear regression model because of the
ease of interpretation; the coefficients that comprise the
model give insight into the predictive power of the features
used to train it. We used the linear regression package available in the WEKA machine learning software suite [10] to
train the model. Our predictive model achieves an R2 of
0.343 and includes seven features with non-zero coefficients.
Table 3 lists these seven features.
7. DISCUSSION
We manually inspected each of the 98 patterns identified by
the differential pattern mining algorithm and categorized the
different types of cognitive processes they demonstrate. We
identified seven distinct categories. Difficulty is the category
in which students seem to encounter difficulties with a particular problem, evidenced by either repeated cross-outs or
repeated attempts at the same component of the same problem. For example, the pattern <C, 1-E-S, C> describes a
scenario in which the student crossed out work, worked on
equations for problem one for a short time, and then again
crossed out work.
Three categories describe patterns in which actions are repeated: Repeated Equation, Repeated FBD, and Repeated
Cross-out. For instance, <2-E-S 2-E-S> is an example of a
Repeated Equation action. Such sequences may be an indication that a student is taking a break in the middle of a
particular activity to think more carefully before continuing
with that activity.
Two categories describe patterns suggesting that a student
may be revising either a FBD (FBD Revision) or an equation
(Equation revision). These patterns comprise a cross-out
followed by either the FBD or equation they are most likely
revising. Also, when a student moves from working on an
equation back to a FBD, this is likely an indication that the
FBD is being revised; students typically attempt to complete
their FBD before moving on to equations.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
71
8. CONCLUSION
We have presented an application of data mining techniques
to educational data extracted from a novel environment. We
have given undergraduate Mechanical Engineering students
LivescribeTM digital pens with which they completed all their
coursework. These pens record students handwriting as
time-stamped pen strokes enabling us to not only analyze
the final image, but also the sequence in which it was written.
We developed a novel representation of students handwritten work on an assignment which characterizes the sequence
of actions the student took to solve that problem. This representation comprises an alphabet of 49 canonical actions
that a student may make when solving his or her homework
assignment. Each action is characterized by its duration,
problem number, and semantic content. This representation
allows us for the first time, to apply traditional data mining
techniques to sequences of students handwritten problem
solutions.
We assigned these sequences into top- and bottom-performing
groups according to performance on each sequences most
relevant exam. The most relevant exam for a sequence from
a particular homework assignment is the exam which occurs
most recently after that homework assignment was due. Se-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
72
[12]
[13]
9.
[16]
REFERENCES
[14]
[15]
[17]
[18]
[19]
[20]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
73
ah3096@columbia.edu
baker2@exchange.tc.columbia.edu
Sujith M. Gowda
Albert T. Corbett
sujithmg@wpi.edu
ABSTRACT
In recent years, student modeling has been extended from
predicting future student performance on the skills being learned
in a tutor to predicting a students preparation for future learning
(PFL). These methods have predicted PFL from a combination of
features of students behaviors related to meta-cognition.
However, these models have achieved only moderately better
performance at predicting PFL than traditional methods for latent
knowledge estimation, such as Bayesian Knowledge Tracing. We
propose an alternate paradigm for predicting PFL, using
quantitative aspects of the moment-by-moment learning graph.
This graph represents individual students learning over time and
is developed using a knowledge-estimation model which infers
the degree of learning that occurs at specific moments rather than
the student's knowledge state at those moments. As such, we
analyze learning trajectories in a fine-grained fashion. This new
paradigm achieves substantially better student-level crossvalidated prediction of students PFL than previous approaches.
Particularly, we find that learning which is spread out over time,
with multiple instances of significant improvement occurring with
substantial gaps between them, is associated with more robust
learning than either very steady learning or learning characterized
by a single eureka moment or a single period of rapid
improvement.
Keywords
Moment-by-moment learning graph, preparation for future
learning, student modeling.
1. INTRODUCTION
In recent years, there has been increasing emphasis in learning
sciences research on helping students develop robust
understanding that supports a student in achieving preparation for
future learning (PFL) (cf. [9,15,17,26]), with evidence suggesting
that differences in the design of educational experiences can
substantially impact PFL [11,28]. Multiple approaches have now
been found to be successful at supporting PFL. For example,
learning-by-teaching when implemented with the use of
teachable agents, computer characters that the student have to
teach during the learning process, has been shown to support PFL
[11,26,28]. Another approach shown to support PFL is the use of
invention activities, during which students are asked to invent
corbett@cmu.edu
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
74
2. DATASET
We use attributes of the form of individual students MBMLG to
predict student preparation for future learning. We do so in a
combined data set from three studies, in total comprising 181
undergraduate and high-school students who used an intelligent
tutoring system to learn Genetics. The students enrolled in
Genetics courses at Carnegie Mellon University, or in high school
biology courses in Southwestern Pennsylvania.
2.2 Design
The studies were conducted in computer clusters at Carnegie
Mellon University. All students attended study sessions on two
consecutive days; in studies 1 and 2, each of these lasted 2 hours,
while in study 3, each lasted 2.5 hours. All students engaged in
Cognitive Tutor-supported activities for about one hour in each of
the two sessions. In studies 1 and 3 all students completed
standard Three-Factor Cross problems, as depicted in Figure 1, in
both sessions, while in study 2 all students completed standard
Gene Interaction problems, as depicted in Figure 2, in both
sessions.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
75
3. MOMENT-BY-MOMENT LEARNING
GRAPH
3.1 Construction of the Graph
The construction of the Moment-By-Moment Learning Graph
(MBMLG) is based on a three-phase process, which first infers
moment-by-moment learning using data from the future, then
infers the same construct without data from the future, and then
integrates across inferences over time to create a graph.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
76
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
77
4. FEATURE ENGINEERING
-8.459
+2.634
Variable
Coefficient
+0.641
-0.296
+0.607
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
78
Pre-test
PFL
[avgMBML]
-0.35
**
-0.48**
[sumMBML]
-0.19**
-0.40**
[graphLen]
0.30**
0.00
[area]
-0.35**
-0.48**
[peak]
-0.09
-0.35**
[2ndPeak]
-0.13
-0.41**
[3rdPeak]
-0.20**
-0.44**
[peakIndex]
-0.15
-0.09
[2ndPeakIndex]
-0.03
0.03
[2PeaksDist]
0.15
0.05
[2PeakRelDist]
0.14
0.07
[2PeakDecr]
0.29**
0.45**
[2PeakRelDecr]
0.26**
0.40**
[3PeakDecr]
0.35**
0.49**
[3PeakRelDecr]
0.21**
0.38**
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
79
7. ACKNOWLEDGMENTS
This research was supported by grants NSF #DRL-0910188 and
IES #R305A090549, and by the Pittsburgh Science of Learning
Center, NSF #SBE-0836012. We would also like to thank Adam
Goldstein, Neil Heffernan, Lisa Rossi, Angela Wagner, Zakkai
Kauffman-Rogoff, and Aaron Mitchell for their help in this study.
perspective. International
research, 31(7), 561-576.
journal
of
educational
8. REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
80
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
81
Michael Eagle
Tiffany Barnes
mjokimoto@gmail.com
maikuusa@gmail.com
ABSTRACT
We introduce InVis, a novel visualization technique and tool for
exploring, navigating, and understanding user interaction data. InVis creates a interaction network from student-interaction data extracted from large numbers of students using educational systems,
and enables instructors to make new insights and discoveries about
student learning. Here we present our novel interaction network
model and InVis tool. We also demonstrate that InVis is an effective
tool for providing instructors with useful and meaningful insights
to how students solve problems.
1.
INTRODUCTION
tmbarnes@ncsu.edu
1.1
Related Work
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
82
Student tutor-log data sets are large (representing hundreds of problem attempts with hundreds of problem states) and teachers, who
are not necessarily savvy with graphs, spreadsheets, or statistics,
need support in navigating these large domain models to learn about
student behavior. Ben-Naim and colleagues [3] have developed an
authoring tool that allows teachers to create simulators for science
and explore small graphs of student actions in the simulator. However, this visualization is restricted to teacher-created states, where
teachers label a step in the simulator as one of interest. It is not
fully derived from student data and does not facilitate exploration
of a large, diverse dataset from other tutors.
SpaceTree [14] is software that might enable educators to explore
our static interaction networks more interactively than graph visualization tools like GraphViz [7] and Gephi [2]. However, our
networks are not always trees and can contain cycles and loops.
The Spacetree layout also particularly highlights the children of a
node, while we are interested in the full path from start to finish for
problem-solving sequence, and in seeing a whole set of student behavior at once, to provide an overview and support pattern-finding.
CourseVis is a visualization tool produces graphical representations
of student tracking data collected by a Content Management System, and helps teachers gain an understanding of how students are
behaving in their online courses. In CourseVis, the focus is on the
behavior of a student over the course of an entire system, where as
in our work the focus is more fine grained, as we are interested in
the behavior of students in single problems. CourseVis does support some techniques to look at student performance but the focus
is on visualizing knowledge components and assessment performance, not problem-solving behavior as in our work [10].
In VisMod students are provided with a visualization tool for representing and interacting with their own student-model allowing students to develop their meta-cognitive skills [19]. The focus of this
tool is not the behavior of the students but instead what the students
think about their own behavior.
TADA-Ed is a tool designed for mining educational data generated
from digital tutors, much like our work. TADA-Eds focus is on
visualizing the results of several data-mining techniques, such as
k-means clustering and decision trees applied to educational data
[11]. Our work is different in that our focus is on student problem
solving behavior.
2.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
83
2.1
2.2
2.2.1
BeadLoom Game
2.2.2
3.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
84
3.1
Guidelines Review
Our visualization was designed with the visual information seeking mantra in mind. As such the tool supports functionality for
overview, zooming and filtering, details on demand, view relationships, and extraction. We are going to describe why the element is
important to viewing interaction networks, how each element was
included and supported, and improvements which can be made.
Craft and Cairns state that many other developers cite the mantra as
their guiding source for the development of their visualization but
often forgo explaining how and where they used it [5]. We use this
evaluation to find strengths and weaknesses in our approach, and
confirm we have an implementation based on these principles.
3.1.1
Overview First
The hierarchical graph offers an overview representation of the students behavior as they work through a single problem. Combined
with the edge width representing frequency we provide a quick understanding of student behavior trends. In addition the mini-map
consistently orients the user within the context they are working,
always providing an overview for the user to reference as they navigate the Interaction Network graph. A possible improvement would
be to apply other visual components to other variables, future studies will show the affects of these changes.
3.1.2
3.1.3
Detail on Demand
The details tab is where the user can find specific information regarding a state, action or student. Details are available in a set of
tabs, displaying text information about the selected node. Including students who visited the node, the frequency of the state, actions
leading from the node, whether the state is a goal or error state and
the description of the state. These details are the finest granularity
we can provide from the log data which was read in. One improvement for details is to provide aggregate user statistics regarding the
current graph, for example number of error states, total number of
states, number of actions, average actions per student, and more.
3.1.4
View Relationships
3.1.5
History
3.1.6
Extract
Sharing ones findings about student behavior is important, particularly for teachers, so challenging issues for students can be addressed. In InVis users can save the image of the visualization panel
so that it can be shared. Teachers can show the sequence of steps to
their students and highlight situations where errors were common.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
85
Figure 3: Left) The user has selected several nodes. By constructing a sub-graph, InVis presents the selected nodes in a new tab,
shown on the Right. Below is an example of the mini-map, with the white box representing the view frame of the left hand image.
Note the size of the graph when visualizing 170 students.
Improved sharing between colleagues could be supported by saving the current layouts, subgraphs, and graph annotations, so they
can be stored and shared with others easily and they too can interact
with the data via InVis.
4.
4.1
We visualized data from Deep Thought [6] and interviewed the professor responsible for its development. We met for one hour and
had him explore tutor data and inform us of different insights and
hypothesis he was able to discover or confirm from using InVis.
We prepared a data set of thirty students, representing a classroom
of students. The professor noticed a student had performed addition
rather than conjunction in order to derive A B, which is an incorrect application of the rule. After he recognized this, he mentioned
that it was a common mistake made by students; this was his hypothesis. He then used the action selector and entered ADD which
4.2
In this example, the professor was interested in the general behavioral trends of students. By including frequencies on nodes and
edges, we are able to identify strategies that students perform in order to solve problems. In this case the hypothesis was that students
who change all the implications into ORs likely had no true strategy for completely solving the problem. The reason for this is over
the years the professor has recognized students who employ this
strategy often have difficulty actually solving the problem and thus
students are explicitly instructed in class not to use this approach.
After loading that data into InVis, the professor looks for the two
main strategies performed by the students. The strategies being
the two most frequent sets of steps performed. The first strategy,
having the highest frequency, is the strategy which she teaches to
her students in class, we call the professors strategy, the first node
in this strategy has 74 students. The next most common first step
has 29 students, and is the start of the prohibited strategy, that is
to change an implication into an OR. Next the professor selected
the first node of a strategy and performed the select sub-tree action, which selected all states derived after the current state, effectively selecting all the different variations of the professors strat-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
86
egy. Then she created a sub-graph. The same was done for the
prohibited strategy. Next she selected all of the goal nodes of each
sub-graph in turn and looked at the combined frequency of the goal
nodes for each sub-graph. For the 74 students who applied the
professors strategy, 55 of those students arrived at the goal, giving a 74% success rate. For the prohibited strategy approach, the
sub-graph has a combined goal node frequency of 17 out of the 29
students, resulting in a 59% success rate which is noticeably lower.
In total there are 174 students, and the two strategies highlighted
above are the most common. The next two most common strategies
have frequencies of 11 with 9 and 7 students successfully solving
the problem with their respective strategies. Again suggesting the
prohibited strategy approach has a particularly low success rate.
4.3
One interesting application of InVis is found in the debugging of tutors. In the previous examples, the Interaction Network uncovered
bugs in the tutor systems; that is, places where the recorded interactions should not have been legal actions. This is interesting, as both
of these tutors have been used for many years and their data have
been subject to extensive analysis. However, these bugs were not
discovered until their data were visualized with InVis. Viewing the
entire group of user behaviors at once improves the ability to spot
peculiar behaviors. In ProofSolver several solutions were noticeably shorter than the average, or skipped to the goal in strange
and invalid ways. After examining the series of actions these students performed, the professor confirmed that the interactions were
illegal and should not have been permitted.
In the case of Deep Thought, some students were able to reach the
goal by repeatedly performing the same action. In this case, the
students were able to use the instantiation-action inappropriately to
add any proposition to the proof. As a result of this, students could
simply add items directly to the proof rather than use the axioms,
allowing them to game the system and illegally solve the problem.
4.4
We collected game log-files from a study performed on the BeadLoom Game (BLG) in 2010. Data came from a total of 6 classes,
ranging from 6th to 8th grade; for a total of four sessions. There
were 132 students, and 2,438 game-log files. The students were
split into two groups (called A-day and B-day) and were presented
with BLG features in different orders. The A-Day students were
given access to custom puzzles (a free play option,) while B-Day
students were given a competitive game element in the form of a
leader board. Due to differences in student time lines some B-Day
classes missed session three. These students followed an abbreviated A-Day schedule during session four. In order to investigate
whether or not there were different problem solving patterns between the groups, we colored vertices based on the percentage of
students who visited from each group. The values were normalized
from green (A-Day) to red (B-Day.) We loaded the data into InVis
and presented it to the BeadLoom Game developers.
Next we met with the BeadLoom Game developers and asked them
to explore their log data using our prototype visualization tool. In
figure 4 we have a set of students who worked on the same problem on two different days, the first and third day of the study. By
looking at the number of states we can see a more diverse set of attempts on the first day. As mentioned before, edge width represents
frequency, green vertices are from one set of students and red vertices are from another set based on how the study was run, the goal
has a square vertex. At the start of our investigation we colored the
5.
USABILITY STUDY
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
87
5.1
5.2
Results
Q1
Q3
Q4
Q7
Q9
Q11
Q12
Q14
Q15
5.2.1
Qualitative
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
88
tive survey results and the comments made while using InVis.
An important issue to address is graph layouts, two issues regarding
the layout were raised. First the layout would be more intuitive
if it were possible to order the layout in some manner along the
breadth (x-axis), either based on frequency, or other metric. By
applying a more informed layout to the graph, we could order the
states in some manner along the X-axis, for example making the
most frequent path on the left, and the least frequent on the right.
The second problem is in regard to strategies, sub-strategies and
ordered states. Participants mentioned they would like the layout
to group or cluster approaches based on similar strategies. If two
students each have nine identical steps, but the first step in each
approach is different, then the layout does not necessarily put the
states from these two approaches close to one another. When looking at 100 plus students, this makes understanding the number of
strategies difficult to understand. A graph layout which places similar paths next to each other could provide a more intuitive visualization of the interaction network.
All participants reported a positive response for whether or not
they would use the tool to augment their understanding of current
classes behavior and learning. This suggests a need for these types
of tools for exploring, and understanding student behaviors in software tutors. We will conclude with a quote from one user who said,
The tool provides a sense of how broadly varying students are in
their approaches, how many get stuck, and how many make similar
mistakes. Which we feel is a good representation of the kinds of
insights InVis was designed for detecting.
6.
CONCLUSION
The main contribution of this work is the discovery and implementation of visualization techniques for user-interaction data from educational systems. This led to new insights into problem-solving
in the deep thought logic tutoring environment, for example the
conclusions drawn in the case studies. The use of interactive visualization techniques combined with a interaction network model in
InVis allows users to explore and gain insight from interaction-log
data. We performed a user study on InVis to show that users can
successfully complete relevant tasks, and paired these results with
a standardized method for testing the usability of a software tool.
Users were able to explore an entire class set of interactions and
were able to confirm some of the hypothesis they had about students, which was a primary goal. This suggests that our technique
is effective at allowing users to explore and learn information from
the data. This is the first step in creating a domain independent
visualization tool for understanding student behavior in software
tutors, and our initial results seem promising for the future development of InVis.
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
89
Keywords
Factor analysis, ordinal regression, matrix factorization, personalized learning, block coordinate descent
1.
INTRODUCTION
Todays education system typically provides only a onesize-fits-all learning experience that does not cater to the
background, interests, and goals of individual learners. Modern machine learning (ML) techniques provide a golden opportunity to reinvent the way we teach and learn by making
it more personalized and, hence, more efficient and effective.
The last decades have seen a great acceleration in the development of personalized learning systems (PLSs), which
can be grouped into two broad categories: (i) high-quality,
but labor-intensive rule-based systems designed by domain
experts that are hard-coded to give feedback in pre-defined
scenarios [8], and (ii) more affordable and scalable ML-based
systems that mine various forms of learner data in order to
make performance predictions for each learner [15, 18, 30].
1.1
Learning analytics (LA, estimating what a learner understands based on data obtained from tracking their inter-
1.2
Contributions
In this paper, we develop Ordinal SPARFA-Tag, a significant extension to the SPARFA framework that enables
the exploitation of the additional information that is often
available in educational settings. First, Ordinal SPARFATag exploits the fact that responses are often graded on
an ordinal scale (partial credit), rather than on a binary
scale (correct/incorrect). Second, Ordinal SPARFA-Tag exploits tags/labels (i.e., keywords characterizing the underlying knowledge component related to a question) that can be
attached by instructors and other users to questions. Exploiting pre-specified tags within the estimation procedure
provides significantly more interpretable questionconcept
associations. Furthermore, our statistical framework can
discover new conceptquestion relationships that would not
be in the pre-specified tag information but, nonetheless, explain the graded learnerresponse data.
We showcase the superiority of Ordinal SPARFA-Tag compared to the methods in [24] via a set of synthetic ground
truth simulations and on a variety of experiments with realworld educational datasets. We also demonstrate that Ordinal SPARFA-Tag outperforms existing state-of-the-art collaborative filtering techniques in terms of predicting missing
ordinal learner responses.
2.
STATISTICAL MODEL
We assume that the learners knowledge level on a set of abstract latent concepts govern the responses they provide to
a set of questions. The SPARFA statistical model charac-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
90
2.1
Suppose that we have N learners, Q questions, and K underlying concepts. Let Yi,j represent the graded response (i.e.,
score) of the j th learner to the ith question, which are from a
set of P ordered labels, i.e., Yi,j O, where O = {1, . . . , P }.
For the ith question, with i {1, . . . , Q}, we propose the following model for the learnerresponse relationships:
Zi,j = wiT cj + i , (i, j),
(1)
Yi,j = Q(Zi,j + i,j ), i,j N (0, 1/i,j ) , (i, j) obs ,
where the column vector wi RK models the concept associations; i.e., it encodes how question i is related to each
concept. Let the column vector cj RK , j {1, . . . , N },
represent the latent concept knowledge of the j th learner,
with its kth component representing the j th learners knowledge of the kth concept. The scalar i models the intrinsic difficulty of question i, with large positive value of
for an easy question. The quantity i,j models the uncertainty of learner j answering question i correctly/incorrectly
and N (0, 1/i,j ) denotes a zero-mean Gaussian distribution
with precision parameter i,j , which models the reliability of
the observation of learner j answering question i. We will
further assume i,j = , meaning that all the observations
have the same reliability.1 The slack variable Zi,j in (1)
governs the probability of the observed grade Yi,j . The set
obs {1, . . . , Q} {1, . . . , N } contains the indices associated to the observed learnerresponse data, in case the
response data is not fully observed.
In (1), Q() : R O is a scalar quantizer that maps a real
number into P ordered labels according to
Q(x) = p
if p1 < x p , p O,
where {0 , . . . , P } is the set of quantization bin boundaries satisfying 0 < 1 < < P 1 < P , with 0 and
P denoting the lower and upper bound of the domain of
the quantizer Q().2 This quantization model leads to the
equivalent inputoutput relation
Zi,j = wiT cj + i ,
(i, j),
and
(2)
p(Yi,j = p | Zi,j ) =
N (s|Zi,j , 1/i,j ) ds
p1
Rx
(3)
2.2
Fundamental assumptions
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
91
To solve (OR-W) and (OR-C), we deploy the iterative firstorder methods detailed below. To optimize the precision
parameter , we compute the solution to
minimize
i,j:(i,j)obs
>0
Li,j wiT cj
3.
ALGORITHM
We start by developing Ordinal SPARFA-M, a generalization of SPARFA-M from [24] to ordinal response data. Then,
we detail Ordinal SPARFA-Tag, which considers prespecified question tags as oracle support information of W, to
estimate W, C, and , from the ordinal response matrix Y
while enforcing the assumptions (A1)(A4).
3.1
Ordinal SPARFA-M
minimize
W,C,
i,j:(i,j)obs
is given
Here, the likelihood of each response
P
by (2). The regularization term i kwi k1 imposes sparsity
on each vector wi to account for (A2). To prevent arbitrary
scaling between W and C, we gauge the norm of the matrix C by applying a matrix norm constraint kCk . For
example, the Frobenius norm constraint kCkF can be
used. Alternatively, the nuclear norm constraint kCk
can also be used, promoting low-rankness of C [9], motivated by the facts that (i) reducing the number of degreesof-freedom in C helps to prevent overfitting to the observed
data and (ii) learners can often be clustered into a few groups
due to their different demographic backgrounds and learning
preferences.
3.2
(4)
(OR-C)
minimize
wi : Wi,k 0 k
C:kCk
i`+1 t` , 0},
wi`+1 max{w
(6)
`+1
T
j log p(Yi,j | wi cj )+kwi k1 ,
minimize
(5)
and
i`+1 wi` t` f,
w
`+1
C
`+1
C
kC
`+1 k
`+1 kF
if kC
otherwise.
(7)
T
i,j log p(Yi,j | wi cj ).
Here, we assume obs = {1, . . . , Q} {1, . . . , N } for simplicity; a generalization to the case of missing entries in Y
is straightforward.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
92
0.6
N=50
N=100
N=200
0.6
N=50
N=100
N=200
N=50
0.5
0.5
0.4
0.4
0.4
0.3
0.5
EC
EW
E, Q=100, K=5
0.6
0.3
0.2
0.2
0.1
0.1
0.1
SPP
KS
SP SPT KST
SPP KS
0 SP SPT KST
SP SPT KST
SPP KS
SPP
SP SPT KST
SPP KS
KS
0 SP SPT KST
SP SPT KST
SPP KS
N=200
0.3
0.2
0 SP SPT KST
N=100
SPP
KS
SP SPT KST
SPP KS
SP SPT KST
SPP KS
(a) Impact of the number of learners, N {50, 100, 200}, with the number of questions Q fixed.
EW, N=100, K=5
0.6
Q=50
Q=100
Q=200
0.6
Q=50
Q=100
Q=200
Q=50
0.5
0.5
0.4
0.4
0.4
0.3
0.5
EC
EW
E, N=100, K=5
0.6
0.3
0.2
0.2
0.1
0.1
0.1
SPP
KS
SP SPT KST
SPP KS
0 SP SPT KST
SP SPT KST
SPP KS
SPP
KS
SP SPT KST
SPP KS
SP SPT KST
SPP KS
Q=200
0.3
0.2
0 SP SPT KST
Q=100
0 SP SPT KST
SPP
KS
SP SPT KST
SPP KS
SP SPT KST
SPP KS
(b) Impact of the number of questions, Q {50, 100, 200}, with the number of learners N fixed.
Figure 1: Performance comparison of Ordinal SPARFA-M vs. K-SVD+ . SP denotes Ordinal SPARFA-M
without given support of W, SPP denotes the variant with estimated precision , and SPT denotes
Ordinal SPARFA-Tag. KS stands for K-SVD+ , and KST denotes its variant with given support .
For the nuclear-norm constraint kCk , the projection
step is given by
C`+1 Udiag(s)VT , with s = Proj (diag(S)),
(8)
3.3
Ordinal SPARFA-Tag
minimize
T
i,j:(i,j)obs log p(Yi,j | wi cj )
()
()
2
1
i kwi k1 +
i 2 kwi k2
W,C,
Here,
is a vector of those entries in wi belonging to the
()
set , while wi is a vector of entries in wi not belonging
()
to . The `2 -penalty term on wi regularizes the entries
in W that are part of the (predefined) support of W; we
set = 106 in all our experiments. The `1 -penalty term
()
wi
(),`+1
i
max{w
/(1 + t` ), 0}.
4.
EXPERIMENTS
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
93
0.16
0.16
0.16
SP
KSY
KSZ
SP
KSY
KSZ
0.08
0.04
0
0
0.12
0.12
EC
EW
0.12
SP
KSY
KSZ
0.08
0.04
10
15
20
No. of bins P
(a) EW versus P
25
30
0
0
0.08
0.04
10
15
20
No. of bins P
25
30
0
0
(b) EC versus P
10
15
20
No. of bins P
25
30
(c) E versus P
Figure 2: Performance comparison of Ordinal SPARFA-M vs. K-SVD+ by varying the number of quantization
bins. SP denotes Ordinal SPARFA-M, KSY denotes K-SVD+ operating on Y, and KSZ denotes KSVD+ operating on Z in (3) (the unquantized data).
4.1
Synthetic data
b 2F
c k2F
22
kC Ck
k k
kW W
, EC =
, E =
.
kWk2F
kCk2F
kk22
4.2
Real-world data
We now demonstrate the superiority of Ordinal SPARFATag compared to regular SPARFA as in [24]. In particular,
we show the advantages of using tag information directly
within the estimation algorithm and of imposing a nuclear
norm constraint on the matrix C. For all experiments, we
apply Ordinal SPARFA-Tag to the graded learner response
matrix Y with oracle support information obtained from
instructor-provided question tags. The parameters and
are selected via cross-validation.
Algebra test: We analyze a dataset from a high school
algebra test carried out on Amazon Mechanical Turk [2], a
crowd-sourcing marketplace. The dataset consists of N = 99
users answering Q = 34 multiple-choice questions covering
topics such as geometry, equation solving, and visualizing
function graphs. The questions were manually labeled with
a set of 13 tags. The dataset is fully populated, with no
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
94
1.65
1.23
10
-1.32
-0.15
1.74
-0.05
1.07
12
0.95
1.24
0.11
0.51
1.56
1.82
1.12
0.27
1.03
-0.00
-0.30
0.87
0.59
-0.14
0.97
0.60
11
12
0.63
0.36
0.33
0.28
0.54
10
0.89
1.84
-0.67
-0.20
1.31
0.07
0.92
1.82
8
0.63
1.82
0.40
1.67
-0.12
0.43
1.81
0.34
1.31
0.96
1.31
0.70
1.28
1.66
0.38
0.80
-0.25
1.31
0.34
-0.25
1.00
1.14
5
6
1.37
1.31
3.27
0.25
0.44
0.88
1.62
-0.14
3.26
0.06
0.76
1.22
1.22
13
0.39
1.80
3.47
1.50
0.79
16
0.76
3.27
1.38
3.27
3.17
13
1.46
0.55
1.22
-0.02
-0.60
1.87
0.78
14
0.84
1.25
-0.76
0.82
1.45
4.14
2.01
1.24
0.39
0.14
1.50
15
0.76
1.24
4.02
0.97
0.62
1.80
1.01
1.36
3.27
2.28
3.26
11
2.01
0.82
3.44
0.77
Concept 1
Concept 2
Concept 3
Concept 4
Concept 5
Concept 1
Concept 2
Concept 3
Concept 4
Concept 5
Arithmetic
Simplifying
expressions
Solving
equations
Fractions
Quadratic
functions
Classifying
matter
Properties
of water
Mixtures and
solutions
Changes
from heat
Uses of
energy
Concept 6
Concept 7
Concept 8
Concept 9
Concept 10
Concept 6
Concept 7
Concept 8
Concept 9
Concept 10
Circuits and
electricity
Forces
and motion
Formation of
fossil fuels
Changes
to land
Evidence of
the past
Geometry
Inequality
Slope
Concept 11
Concept 12
Concept 13
Concept 11
Concept 12
Concept 13
Concept 14
Concept 15
Concept 16
Polynomials
System
equations
Plotting
functions
Earth, sun
and moon
Alternative
energy
Properties
of soil
Earths
forces
Food
webs
Environmental
changes
Trigonometry
Limits
Figure 3 shows the questionconcept association map estimated by Ordinal SPARFA-Tag using the Frobenius norm
constraint kCkF . Circles represent concepts, and
squares represent questions (labelled by their intrinsic difficulties i ). Large positive values of i indicate easy questions; negative values indicate hard questions. Connecting
lines indicate whether a concept is present in a question;
thicker lines represent stronger questionconcept associations. Black lines represent the questionconcept associations estimated by Ordinal SPARFA-Tag, corresponding to
the entries in W as specified by . Red, dashed lines represent the mislabeled associations (entries of W in ) that
are estimated to be zero. Green solid lines represent new
discovered associations, i.e., entries in W that were not in
that were discovered by Ordinal SPARFA-Tag.
By comparing Fig. 3 with [24, Fig. 9], we can see that Ordinal SPARFA-Tag provides unique concept labels, i.e., one
tag is associated with one concept; this enables precise interpretable feedback to individual learners, as the values in C
represent directly the tag knowledge profile for each learner.
This tag knowledge profile can be used by a PLS to provide targeted feedback to learners. The estimated question
concept association matrix can also serve as useful tool to
domain experts or course instructors, as they indicate missing and inexistent tagquestion associations.
4.3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
95
(RMSE)
obs kYi,j Yi,j k2 where Yi,j is
i,j:(i,j)
|obs |
the predicted score for Yi,j , averaged over 50 trials.
Figure 5 demonstrates that the nuclear norm variant of Ordinal SPARFA-M outperforms OrdRec, while the performance
of other variants of ordinal SPARFA are comparable to
OrdRec. SVD++ performs worse than all compared methods, suggesting that the use of a probabilistic model considering ordinal observations enables accurate predictions
on unobserved responses. We furthermore observe that the
variants of Ordinal SPARFA-M that optimize the precision
parameter or bin boundaries deliver almost identical performance.
We finally emphasize that Ordinal SPARFA-M not only delivers superior prediction performance over the two stateof-the-art collaborative filtering techniques in predicting
learner responses, but it also provides interpretable factors,
which is key in educational applications.
5.
RELATED WORK
A range of different ML algorithms have been applied in educational contexts. Bayesian belief networks have been successfully used to probabilistically model and analyze learner
response data in order to trace learner concept knowledge
and estimate question difficulty (see, e.g., [13, 22, 33, 34]).
Item response theory (IRT) uses a statistical model to analyze and score graded question response data [25, 29]. Our
proposed statistical model shares some similarity to the
Rasch model [28], the additive factor model [10], learning factor analysis [19, 27], and the instructional factors
model [11]. These models, however, rely on pre-defined
question features, do not support disciplined algorithms to
estimate the model parameters solely from learner response
data, or do not produce interpretable estimated factors. Several publications have studied factor analysis approaches on
learner responses [3, 14, 32], but treat learner responses
as real and deterministic values rather than ordinal values
determined by statistical quantities. Several other results
have considered probabilistic models in order to characterize
learner responses [5, 6], but consider only binary-valued responses and cannot be generalized naturally to ordinal data.
While some ordinal factor analysis methods, e.g., [21], have
been successful in predicting missing entries in datasets from
ordinal observations, our model enables interpretability of
the estimated factors, due to (i) the additional structure
imposed on the learnerconcept matrix (non-negativity combined with sparsity) and (ii) the fact that we associate unique
tags to each concept within the estimation algorithm.
6.
CONCLUSIONS
7.
ACKNOWLEDGMENTS
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
96
8.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
97
William W. Cohen
Kenneth R. Koedinger
nli1@cs.cmu.edu
wcohen@cs.cmu.edu
koedinger@cs.cmu.edu
ABSTRACT
One of the key factors that affects automated tutoring systems in making instructional decisions is the quality of the
student model built in the system. A student model is a
model that can solve problems in various ways as human
students. A good student model that matches with student
behavior patterns often provides useful information on learning task difficulty and transfer of learning between related
problems, and thus often yields better instruction on intelligent tutoring systems. However, traditional ways of constructing such models are often time consuming, and may
still miss distinctions in content and learning that have important instructional implications. Automated methods can
be used to find better student models, but usually require
some engineering effort, and can be hard to interpret. In this
paper, we propose an automated approach that finds student
models using a clustering algorithm based on automaticallygenerated problem content features. We demonstrate the
proposed approach using an algebra dataset. Experimental
results show that the discovered model is as good as one
of the best existing models, which is a model found by a
previous automated approach, but without the knowledge
engineering effort.
Keywords
student model, machine learning, learner modeling
1.
INTRODUCTION
A student model is an essential component in intelligent tutoring systems. It encodes how to solve problems in various
ways as human students do. One common way of representing such student models is a set of knowledge components
(KC) encoded in intelligent tutors to model how students
solve problems. As defined in [9], a knowledge component
is an acquired unit of cognitive function or structure that
can be inferred from performance on a set of related tasks.
The set of KCs includes the component skills, concepts, or
percepts that a student must acquire to be successful on
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
98
Tokenized Step
-Nv=N
Nv+N=N
-N
1
0
Nv
1
1
v=
1
0
=N
1
1
-Nv
1
0
Nv=
1
0
In the following sections, we start by describing how to statistically evaluate the quality of a student model. Then,
we explain how to generate features and to apply a clustering algorithm to find student models that meet such criteria. Next, we report experimental results on the comparison between the clustering-based model and the SimStudent model, along with an in-depth study using a recently
developed analysis technique, Focused Benefits Investigation (FBI) [10]. After this, we discuss the generality of the
proposed approach, and possible improvements that can be
made using SimStudent. In closing, we describe some related work as well as conclusions drawn from this work.
2.
As we have mentioned before, a student model can be represented by a set of knowledge components, where each problem step is associated with one KC that encodes how to
proceed given the current step. Therefore, the problems we
have is that given a dataset recording how human students
solve problems in one domain, how to find a set of KCs that
matches with student behavior well.
There are various ways of matching a student model with
student data. Although other models are also possible (e.g. [8],
in our case, we use the Additive Factor Model (AFM) [6] to
measure the quality of a student model. AFM is an instance
of logistic regression that models student success using each
student, each KC, and the KC by opportunity interaction
as independent variables,
ln
X
X
pij
= i +
k Qkj +
k Qkj (k Nik )
1 pij
k
Where:
i represents a student i.
j represents a step j.
v=N
1
0
v+
0
1
+N
0
1
N=
0
1
Nv+
0
1
v+N
0
1
+N=
0
1
N=N
0
1
Hence, the better the student model is; the more accurate the predictions are. To train the parameters, we use
maximum-likelihood estimation (MLE). In order to avoid
overfitting, we use cross-validation (CV) to validate the quality of the student model.
3.
3.1
Preprocessing
Before generalizing the features, we first tokenize the problem steps, so that all numbers are replaced by N , and all
variables are represented as v. For example, the tokenized
representation of 3x = 6 is N v = N . In fact, the level
of tokenization affects the result of the discovered model,
since this preprocessing step removes the difference among
steps that are of the same form but with different numbers.
This may cause problems in some cases. For instance, solving 3x = 6 can be potentially much easier than solving
452x = 904, but the preprocessing step gives both steps
the same tokenized representation N v = N . As we will
discuss later, by making use of SimStudent, we could automatically get different levels of tokenization.
k represents a skill or KC k.
pij is the probability that student i would be correct on step
j.
i is the coefficient for proficiency of student i.
k is coefficient for difficulty of the skill or KC k.
Qkj is the Q-matrix cell for step j using skill k.
k is the coefficient for the learning rate of skill k.
Nik is the number of practice opportunities student i has
had on the skill k.
3.2
Feature Generation
After preprocessing, we now generate features for these tokenized steps. There are two types of features, content features and performance features.
3.2.1
Content Features
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
99
Meaning
Average number of incorrect attempts for the current step
Average number of the student asking for a hint for the current step
Average number of correct attempts for the current step
The percentage of times that the first attempt is incorrect
The percentage of times that the first attempt is asking hint
The percentage of times that the first attempt is correct
Average number of seconds the student spending on this step
Average number of seconds the student spending on this step when the
student gets this step correct
Average number of seconds the student spending on this step when the
student gets this step incorrect
Average number of total students working on this step
Average number of total opportunities that the student has in solving
the current step
3.2.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Algorithm 1: K-Means
Input: Points to be clustered P , Number of clusters k
Output: Cluster centroids C, cluster membership M .
initialize C with k randomly selected data points in P
forall the pi P do
mi := argminj1..k distance(pi , cj )
end
while m changed do
foreach i {1..n} do
Recompute ci as the centroid of {pj |mj = i}
end
sum ratios := 0
forall the p c d do
sum ratios += wc (p)/wd (p)
end
forall the pi P do
mi := argminj1..k distance(pi , cj )
end
end
return C, M
Performance Features
The second set of features we used in the algorithm are performance features. These features measure the average performance of human students on each format of the tokenized
steps. Examples of such measurements are the time to response, and whether the students first attempt was correct.
Table 2 shows the full list of performance features used for
clustering.
Note that performance features are only used to create the
clusters of the training data. Since we are predicting the
performance of human students, performance data should
not be used in testing. For testing data, we only use the
content features to assign the cluster of the current step. In
other words, for each testing data point, we calculate the
distance of the data point to all of the training data points
based on perceptual features, and assign the testing data
point to the cluster associated with the closest training data
point.
over the features we generated. Principal component analysis is a mathematical procedure that projects a set of observations of possibly correlated variables into a set of values of
linearly uncorrelated variables. These linearly uncorrelated
variables are called principal components. The first principal component points to the direction that accounts for
the largest possible variance. The succeeding components
are orthogonal to the previous components, and account for
smaller variance.
After this transformation process, all of the features in the
projected space are orthogonal to each other. Moreover,
in order to remove less informative features, we only select
the first 40 principal components in the projected space. It
covers approximately 95% of the variance in the data.
3.4
3.3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
100
Table 3: Cross Validation Results on the ClusteringBased Model and the SimStudent Model.
SimStudent RMSE
0.4105
0.4109
0.4113
0.4107
0.4106
0.4109
0.4108
Clustering RMSE
0.4102
0.4106
0.4105
0.4111
0.4095
0.4102
0.4104
Human Student
SimStudent Model
ClusteringBased Model
0.8
0.7
Error Rate
Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Average
1
0.9
0.6
0.5
0.4
0.3
0.2
0.1
4.
Sv/N = S/S
v = S/N
Problem Abstractions
v = S/N
S/N=v
Figure 1: Error rates of human students and predicted error rates of two student models. S stands
for a signed number, N represents an integer, and v
is a variable.
model is better than the human-generated model, and provides useful instructional implications.
The clustering-based model was discovered using the approach described above. Each time a student encounters a
step using some KC is considered as an opportunity for
that student to show mastery of that KC. In both models,
a total of 6507 steps are coded.
In order to get a better understanding on how the clusteringbased model differs from other student models, we further
utilized DataShop, a large repository that contains datasets
from various educational domains as well as a set of associated visualization and analysis tools, to facilitate the process
of evaluation, which includes generating learning curve visualization, AFM parameter estimation, and evaluation statistics including AIC (Akaike Information Criterion) and cross
validation.
EXPERIMENT STUDY
In order to evaluate the effectiveness of the proposed approach, we carried out a study using an algebra dataset.
We compared the clustering-based model with a SimStudent
model. The SimStudent model is discovered by a learning
agent, which is also one of the best student models we have
in the database.
4.1
0
S.N = Sv/S
Method
To generate the SimStudent model, SimStudent was tutored on how to solve linear equations by interacting with
a Carnegie Learning Algebra I Tutor like a human student.
We selected 40 problems that were used to teach real students as the training set for SimStudent. Given all of the
acquired production rules, for each step a real student performed, we assigned the applicable production rule as the
KC associated with that step. In cases where there was
no applicable production rule, we coded the step using a
human-generated KC model (Balanced-Action-Typein). The
human-generated model is the best model constructed by
domain experts. It has been shown that the SimStudent
1
We tried smaller numbers, but it turns out that when k is
between 20 and 30, the cross validation result is often better.
4.2
Dataset
4.3
Measurements
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
101
Table 4: FBI Results on Selected KCs That Are Improved in the Clustering-Based Model.
SimStudent KCs
ctat-divide
ctat-distribute
ctat-multiply
ctat-clt
ctat-divide-typein
SimStudent
RMSE
0.5289
0.4292
0.4634
0.3757
0.3674
Model
In order to better understand this machine learning approach, we carried out an in-depth study using FBI [10]
on the clustering-based model and the SimStudent. FBI is
a recently developed technique. It is designed to analyze
which of the differences between the models improves the
prediction the most, and by how much.
4.4
Experimental Results
As shown in Table 3, in five out of the six runs, the clusteringbased models get lower RMSEs than the SimStudent model,
which indicates that the clustering-based model is at least
as good as the SimStudent model. Averaged over the six
runs, the clustering-based models get an average RMSE of
0.4104, while the SimStudent model gets a slightly higher
RMSE (i.e., 0.4108).
As you may have noticed, the difference between the RMSEs of the two models is small, but this does not mean
that the difference between the two models is small. Instead of using cross validation to measure the quality of the
model as a whole, we applied FBI to evaluate the difference
at the knowledge component level. Table 4 shows the top
five KCs in the SimStudent model that are improved in the
clustering-based model. As we can see that all of these KCs
names start with ctat, which means these KCs are from
the human-generated model. Recall that in the SimStudent
model, if SimStudent could not find any applicable production rule to a step, the step would be coded by the humangenerated model. This suggests that the clustering-based
approach is more general than the SimStudent approach in
the sense that it is able to code steps that are not supported
by SimStudent. Among the nine KCs generated by SimStudent, three of them were improved in the clustering-based
student model.
In these five KCs, the clustering-based model successfully
reduced the RMSE by at least 8%. In the KC ctat-divide,
the RMSE was reduced by around 25%. This indicates that
the clustering-based approach is able to find KCs that are
better than the existing ones. We can inspect the data more
closely to get a better qualitative understanding of how the
two models are different and what implications there might
be for improved instruction.
We took a closer look at the KC ctat-divide-typein. In
the SimStudent model, all steps that require division are
assigned to the ctat-divide-typein skill. However, there
are differences among these steps. We checked the KCs in
the clustering-based model associated with these steps, and
found out that these steps were split into different KCs in
the clustering-based model. Table 5 shows the five biggest
KCs associated with the ctat-divide-typein steps. Since we
Clustering-Based
Model RMSE
0.3984
0.3553
0.3962
0.3445
0.3368
% change of RMSE
-24.67
-17.21
-14.50
-8.325
-8.321
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
102
Clustering-Based
Model RMSE
0.1547
0.4654
0.2969
0.4194
0.4238
Expression
SignedNumber Variable
MinusSign
Variable
MinusSign Number
5.
DISCUSSION
5.1
One question we should ask is that why we should use automated student model discovery approach rather than manual construction. This is mainly due to the fact that much
of human expertise is only tacitly known. In many of the
cases, we know how to solve the problems, while it can be
hard to explain how we solved the problem. For instance,
in language learning, native speakers can accurately select
the correct article in a sentence, but do not know why they
pick that article. Similarly, most algebra experts have no
explicit awareness of subtle transformations they have acquired. Even though such instructional designers may be
experts in a domain, they may still have some blind spots
regarding subtle perceptual differences like this one, which
may make a real difference for novice learners. A machine
learning approach can help get past such blind spots by revealing challenges in the learning process that experts may
not be aware of. In addition, these discovered KCs can serve
as a basis for traditional ways of student model discovery.
5.2
Furthermore, in this paper, we simply use bigrams and trigrams of the tokenized steps as the content features. Some
SimStudent
RMSE
0.2170
0.5516
0.3205
0.4279
0.4073
Model
% change of RMSE
40.34
18.54
7.939
2.016
-3.898
5.3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
103
5.4
One additional possible study is to try other clustering techniques. In this work, we only applied k-means to discover
student models. There are other clustering algorithms such
as hierarchical agglomerative clustering and spectral clustering [19]. These clustering algorithms have different properties, and may be better fit with the student model discovery
task. In the future, we would like to further explore in this
direction with other clustering techniques.
5.5
Generality
6.
RELATED WORK
have also been put toward comparing the quality of alternative student models. LFA automatically discovers student
models, but is limited to the space of the human-provided
factors. SimStudent is less dependent on human-provided
factors, but still needs some knowledge engineering effort
in constructing the agent. Moreover, as we have shown in
the experiments, the clustering based algorithm is able to
find KCs that are better than those found by SimStudent.
Other works such as [16, 27] are less dependent on human
labeling, but may suffer from challenges in interpreting the
results. In contrast, the clustering-based approach has the
benefit that the acquired KCs usually have a straightforward
interpretation. Baffes and Mooney [2] apply theory refinement to the problem of modeling incorrect student behavior.
Other systems [23, 3] use Q-matrix to find knowledge structure from student response data. Our approach also uses
machine learning algorithms to discover student models. In
addition to model student performance, we emphasize on the
interpretability of the models by adding content features to
the clustering approach.
There has also been considerable amount of research on using artificial intelligence and machine learning techniques to
model human students. Langley and Ohlssons [13] ACM
applies symbolic machine learning techniques to automatically construct student models. Brown and Burtons [5]
DEBUGGY, and Sleeman and Smiths [20] LMS also make
use of artificial intelligent tools to construct models that explain students behavior in math domains. VanLehns [25]
Sierra models the impasse-driven acquisition of hierarchical
procedures for multi-column subtraction from sample solutions. Research on models of high-level learning [12, 1, 22,
21, 24, 18] is also closely related to our work, but to the best
of our knowledge, has not been evaluated by the fit to student learning curve data as we do in this work. In addition,
most of these work took a more symbolic approach, while
our algorithm is more statistical based.
Other research on creating simulated students [26, 7, 17] also
share some resemblance to our work. VanLehn [25] created
a learning system and evaluated whether it was able to learn
procedural bugs like real students. Biswas et al.s [4] system learns causal relations from a conceptual map created
by students. None of the above approaches except for the
SimStudent model discovery approach compared the system
with learning curve data. To the best of our knowledge, our
work is the very few who combines the two whereby we use
cognitive model evaluation techniques to assess the quality
of a simulated learner.
7.
CONCLUSION
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
104
8.
ACKNOWLEDGEMENTS
9.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
105
ABSTRACT
Open-ended educational tools can encourage creativity and
active engagement, and may be used beyond the classroom.
Being able to model and predict learner performance in such
tools is a critical component to assist the student, and enable
tool refinement. However, open-ended educational domains
typically allow an extremely broad range of learner input.
As such, building the same kind of cognitive models often
used to track and predict student behavior in existing systems is challenging. In addition, the resulting large spaces
of user input coupled with comparatively sparse observed
data, limits the applicability of straightforward classification methods. We address these difficulties with a new algorithm that combines Markov models, state aggregation, and
player heuristic search, dynamically selecting between these
methods based on the amount of available data. Applied
to a popular educational game, our hybrid model achieved
greater predictive accuracy than any of the methods alone,
and performed significantly better than a random baseline.
We demonstrate how our model can learn player heuristics
on data from one task that accurately predict performance
on future tasks, and explain how our model retains parameters that are interpretable to non-expert users.
Keywords
Educational games, user modeling
1.
INTRODUCTION
Open-ended learning environments offer promises of increased engagement, deep learning, transfer of skills to new
tasks, and opportunities for instructors to observe the learning process. One example of such environments is educational games, where players have an opportunity to explore
and experiment with a particular educational domain [12].
However, many of these exciting potential applications require low-level behavioral models of how players behave. For
example, if we can predict that a player will struggle with
a particular concept, we could try to preempt this confusion with tutorials or choose specific levels designed to address those problems. Additionally, as forcing players to
complete an explicit knowledge test often breaks the game
flow and causes many players to quit, we could estimate a
players knowledge of target concepts by predicting performance on test levels that are carefully designed to measure
understanding of those concepts. Finally, we might even
be able to compare user populations by examining models
learned from their data and hypothesize optimal learning
pathways for each population.
Accurate predictions of user behavior have been achieved in
existing educational software such as intelligent tutors [10,
9, 11]. However, we cannot directly apply such methods to
educational games for two reasons. First, educational games
often have very large state and action spaces. For instance,
a game involving building one of 10 different structures on
100 locations has a state space of size 10100 . Second, games
often increase engagement through the addition of game mechanics that are not directly linked to the main educational
objectives. One option is to use expert insight to define skills
and behavior associated with these skills for the educational
game. However, doing so can be extremely labor intensive:
for intelligent tutors for structured domains that often include activities labeled with skills, it has been estimated
that 200-300 hours of expert development are necessary to
produce one hour of content for intelligent tutors [4]. As
educational games are more open-ended, allowing students
to input a much wider variety of input compared to many
popular intelligent tutoring systems, we expect that tagging
and building structure models for them would be even more
time consuming than for structured topics such as Algebra.
Given these limitations, we would like a method requiring
minimal expert authoring, capable of inferring likely user behavior based on collected data. One popular approach with
these properties from the field of recommendation systems is
collaborative filtering [18, 21]. Collaborative filtering can be
effective with no expert authoring at all if there is enough
data; however, the large state space of many educational
games often results in high degrees of data sparsity. To
maintain accuracy in spite of such sparsity, there has been
an emergence of hybrid models that supplement collaborative filtering with limited context-specific information when
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
106
2. RELATED WORK
2.1 Educational Technology
There has been substantial research on predicting student
outcomes on tests. Some of these methods are based on dynamic assessment, an alternative testing paradigm in which
the student receives assistance while working on problems
[8, 13]. Intelligent Tutoring Systems (ITSs) include built-in
scaffolding and hinting systems, and are therefore an ideal
platform for studying dynamic assessment [10]. Studies have
shown that this data has strong predictive power. Feng et al.
show that 40 minutes of dynamic assessment in the ASSISTment system is more predictive of grades on an end-of-year
standaradized test than the same amount of static assessment [9]. Feng et al. also showed that longitudinal dynamic
assessment data is more effective at predicting strandardized test scores for middle school students than short-term
dynamic assessment data [10]. Fuchs et al. showed that dynamic assessment data from third-grade students was useful
for predicting scores on far-transfer problem-solving questions [11]. These methods are useful for predicting student
outcomes on tests. However, we require much finer granularity for applications such as predicting how students will
respond to new levels without any training data or offering
just-in-time hints only when we predict the player is about
to make a particular type of move.
2.2
Collaborative Filtering
2.3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
107
3.
GAME DESCRIPTION
We will first describe the game used in our analysis. Refraction is a educational fractions game that involves splitting
lasers into fractional amounts. The player interacts with a
grid that contains laser sources, target spaceships, and asteroids, as shown in Figure 1. The goal is to satisfy the target
spaceships and avoid asteroids by placing pieces on the grid.
Some pieces change the laser direction and others split the
laser into two or three equal parts. To win, the player must
correctly satisfy all targets at the same time, a task that requires both spatial and mathematical problem solving skills.
Some levels contain coins, optional rewards that can be collected by satisfying all target spaceships while a laser of the
correct value passes through the coin.
4.
PREDICTIVE TASK
Our objective is to predict player behavior in the educational game Refraction, similar to how student models in
intelligent tutoring systems can be used to predict student
input. We now define some of the notation we use in the rest
of the paper. For a given level, our task is the following. Let
S be the set of all possible game states on the level. A game
state is a particular configuration of pieces on the board,
independent of time. Each player i in a set of players P of a
level goes through a series of game states. We are concerned
with predicting the next substantive move class the player
will try, so we preprocess the data to eliminate consecutive
duplicate states, leaving us with the list of players states,
Si,1 , . . . , Si,mi . We define a set of collapsed states, C, and a
collapse function mapping S C. These are selected by the
designer to reduce states to features of interest, as in Table
1. For s S, define succ(s) to be the set of collapsed states
reachable in one action from s, i.e., succ(s) = {collapse(s0 ) |
s0 is reachable in one move from s}. The predictive model
M assigns a probability that the player will enter a collapsed state depending on his history. Given player is
Value
2 1/2,East
1 W-NS, 2 S-E, 1 N-E, 1 W-N
(none)
Benders: 1 S-E
0.0
5.
METRICS
In this section we explain how we will evaluate the performance of our predictive model. Our aim is to build models
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
108
0.6
0.5
0.4
Markov
0.3
6.
Here, we describe the three portions of our hybrid predictive model and describe the conditions under which each
is used. Each individual method has different benefits and
drawbacks and is suitable at a different level of data. We use
a combination of them to keep all their benefits, giving us
good predictive power, interpretability, and generalizability.
At the end, we describe the full model in detail.
6.1
Markov
Collaborative filtering models, which search for similar players and use their data to predict the behavior of new players,
are an attractive approach for our problem space because
they are data-driven and model-free. There are a number of
methods for determining the similarity of two players. We
describe and compare two methods: a simple Markov model
with no knowledge of player history and a model with full
awareness of player history.
In the simple Markov model, we compute the probability of
a state transition based only on the players current state.
To estimate these state transitions, we use our prior data,
aggregating together any examples which start in the same
initial state. To prevent the probability of a player from going to 0 when they make a transition that we have not seen
before, we add a smoothing parameter r. With r probability,
the player will choose between the possible successor states
succ(Si,j1 ) randomly, and with the remaining 1 r probability, the player will move according to the Markov model
as outlined above. We empirically determine that the best
performance is achieved with r = 0.3.
One weakness of this similarity metric is that it ignores
player history. We also attempted other collaborative filtering models. For example, we could consider using only
the transitions from other players with the same full history
of moves on that level when issuing predictions, reverting
to Markov if data is sparse. In the limit of infinite data, we
would expect this model to outperform all others based only
on order of visits to game states.
We found, however, that the performance of the second
history-aware model is worse than the performance of the
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Backoff threshold
6.2
State Aggregation
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
109
Proportion
0.5
0.4
0.3
0.2
0.1
0
100
200
300
400
500
600
# Players
700
800
900
1000
Compression
0.74
0.72
0.7
0.68
0.66
0.64
0
9 10 11 12 13 14 15 16 17 18 19
ba
Figure 4: Selection of ba , the backoff parameter controlling when to consult transitions from all similar
states instead of exact states.
ev(collapse(Si,j ))
X v(C )
e i
i
6.3
Both of the previously described models have certain advantages. They both require minimal designer input: state
space and transition functions for the Markov model, and a
similarity function for the aggregation model. Both models
also only improve as more data is gathered. Unfortunately,
these methods also have two significant drawbacks: they
perform poorly when there is very little data available, and
they have parameters that are difficult to interpret. An ed-
(1)
We optimize the weights k to maximize the log-likelihood
of the data using Covariance Matrix Adaptation: Evolutionary Strategy, an optimizer designed to run on noisy functions
with difficult-to-compute gradients [14]. Our algorithm differs from the original in that both the possible successor
states and the state that the heuristic operates on are collapsed states, since we want to predict the general type of
move players will make rather than their exact move. As
before, we also introduce a backoff parameter bh . When
searching for transitions from aggregated players, if there
are fewer than bh datapoints, we switch from the Markov
with aggregation model to the heuristic model. Empirically
we discover that the optimal value is achieved at bh = 4.
The base heuristics a1 , . . . , an are designer-specified, and
should reflect the components of the game that players pay
attention to while choosing between moves. The heuristics
we use for Refraction are listed in Table 2. In practice, the
selection of heuristics is closely related to the definition of
the collapse function used to project raw game states in the
prediction task, since both are chosen according to the game
features that the designer views as important.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
110
6.4
Tying all the components together, we now provide a description of the full hybrid prediction model for Refraction,
which combines the simple Markov model, state aggregation,
and player heuristic search.
1. Assume a player is currently in state sa .
2. Consult the Markov model for all other players transitions to collapsed states from state sa . If there are ba
or more transitions, predict the observed distribution
with the random action smoothing parameter of r.
3. Otherwise, apply the state aggregation function to all
the nodes in the graph, and count all transitions from
all states with collapsed value collapse(sa ). If there are
bh or more transitions, take the observed distribution,
remove any states impossible to reach from sa , and
predict the resulting distribution smoothed by r.
4. Otherwise, apply the heuristic with parameters learned
from the training set to each of the successors using
Equation (1) to get the probability of each transition.
7.
EXPERIMENTS
The gap between Markov, Markov with state aggregation, and the full backoff model narrow as the amount
of data increases. As we gather more players, the
amount of probability mass on players in uncommonly
visited states shrinks, so the Markov model is used to
predict player behavior in more and more situations.
While the heuristic portion of the model seems to offer only
incremental improvements, its true power can be seen when
we attempt to generalize our models to future levels, as
shown in Figure 5(b). Using 1000 players, we first learn
heuristic parameters from level 8. We then use the learned
heuristic to predict player behavior on levels 9, 10, and 11,
comparing these to a Markov with state aggregation model
trained on level 8. To get a sense of what good performance might look like, we also train player search heuristics
and full models learned on the transfer levels and evaluate their compression values with 5-fold cross-validation as
usual. We note some features of the graph here.
We see immediately that the Markov with aggregation
portion of the model has no generalization power at
all. The state space and possible transitions via succ
are completely different on future levels, so its impossible to find similar players and use their transitions to
predict moves later on.
The heuristic portion of the model, on the other hand,
allows it to predict what players will do in future levels.
When compared to full models fit directly to player
data from those levels, it is very good at predicting
behavior on level 9, somewhat good at predicting behavior on level 10, and not very good at predicting
behavior on level 11. Educational games are explicitly
designed to teach players new and better strategies as
they play, so we would expect performance to decrease
over time.
In addition, we can see that by level 11 even a heuristic
trained on player data from that level is losing power.
This means that the features of the state space players pay attention to is no longer being captured by the
component heuristic functions a1 , . . . , an . As the game
introduces new concepts such as compound fractions,
equivalent fractions, and fraction addition, players will
need to pay attention to more features of the state
space than are represented in our choice of a1 , . . . , an .
This speaks to the importance of choosing these component heuristics for the methods performance.
We caution that the generalization power of our model in
these open-ended learning domains can only reasonably be
expected to be high for the next few tasks and will be poor
if those tasks have radically different state spaces from the
training tasks. These caveats notwithstanding, these are
promising results that suggest the learned heuristic captures something fundamental about how players navigate
the search space of a Refraction level. This might allow designers to guess how players at a certain point in time will
behave on levels without needing to release updated versions
of the game, or allow educators to simulate and evaluate user
performance on assessment levels without needing to interrupt player engagement by actually giving these tasks.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
111
1.2
Compression
0.8
Full
Markov
0.6
Markov+Aggregation
Heuristic
0.4
Random
0.2
0
10
20
40
80
160
320
640
1280
# Players
(a) Performance
(b) Generalization
Figure 5: Performance and generalization power of our model. The full model is superior to the other models
at most amounts of data. In addition, the full model learned from level 8 data is able to generalize to predict
behavior in future levels due to the heuristic component.
Population
Kongregate
BrainPOP
8.
INTERPRETABILITY
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
112
These analyses show that the learned parameters in our hybrid model can be valuable tools for game designers, educators, and researchers for analyzing how populations use
their systems. For instance, because Kongregate players are
primarily adults and BrainPOP players are primarily children, we might wonder if children have more difficulty understanding piece directionality and spatial reasoning and
plan their moves less carefully than adults do. A researcher
might attempt to generalize these results to other strategic
tasks, while a game designer might create a different version
of Refraction with easier levels, fewer pieces, and clearer
graphics for children. Either way, the learned parameters
are a useful tool to help understand how players behave.
[4]
[5]
[6]
[7]
[8]
9.
CONCLUSION
Predicting player behavior in open-ended learning environments is an interesting and complex problem. This ability
could be used for a host of automatic applications to bolster
engagement, learning, or transfer. In this paper, by using a
combination of data-driven and model-based approaches, we
presented a best-of-all-worlds model able to predict player
behavior in an educational game. First, our hybrid models
performance is better than any individual components. Second, the learned weights of the sub-heuristics are humanreadable and can give insights into how players behave. We
used these parameters to formulate hypotheses about how
two populations behave differently and confirmed them with
strong statistical results. Finally, we demonstrated how the
heuristic portion of the model allows us to generalize and
predict how players will behave on levels in which we have
no data at all, opening the door to many adaptive applications involving problem ordering and choice.
There are many possible avenues for future work. On a
lower level, we could use more powerful collaborative filtering models taking advantage of timestamps in order to
find similar players. Automatic generation of state aggregation functions and autotuning the ba and bh parameters
would remove the need for some expert authoring. On a
higher level, trying the same method on other open-ended
educational environments, not necessarily games, could tell
us how well the method generalizes. Using the model for
applications such as dynamic hinting systems just when we
predict players will quit or make egregious errors could increase player engagement and learning. Finally, the ability
to estimate behavior on future, unseen problems could be
used to increase transfer by selecting tasks which specifically target incorrect strategies or concepts we believe the
player has, reflected in the heuristics they use.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
10.
ACKNOWLEDGMENTS
[22]
[23]
[24]
11.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
113
baker2@exchange.tc.columbia.edu
{zhongxiuliu,vpatara,jocumpaugh}@wpi.edu
ABSTRACT
Keywords
Affect, confusion, frustration, affect sequences, affect detection,
learning outcomes, discovery with models, affective persistence
1. INTRODUCTION
Affect has become an area of considerable interest within research
on interactive learning environments [1, 10, 11, 18, 23]. Though
findings relating boredom and engaged concentration to learning
have largely accorded to prior hypotheses, there have been
surprising patterns of results for other affective states, with
unstable effects for confusion between studies and often no effects
for frustration [7, 21].
However, many of these early studies investigated overall
proportions of affective states, rather than considering the
potential differential impacts of affect manifesting in different
ways. It may be important to consider the multiple ways a specific
affective state can manifest, especially considering that there can
be considerable variance in how long an affective state lasts [8],
affect may be influenced by behavior and vice-versa [3, 5] and
some affective states may not be unitary in nature (for example,
[12] refers to pleasurable frustration, which is presumably
different in nature than the non-pleasurable frustration often
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
114
2. METHODS
2.1 Tutor Studied
The learning system used in this study was Cognitive Tutor
Algebra I, an interactive learning environment now used by
approximately 500,000 students a year in the USA. The students
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
115
NNC
NCN
NCC
CNN
CNC
CCN
CCC
93.78
1.91
1.74
0.23
1.84
0.09
0.23
0.16
3. RESULTS
3.1 Duration of Affect and Learning Gains
In this section, we compare the relative frequency of sequences of
confusion and frustration to assessments of gains in student
learning over time. Learning gains are computed as post-pre; the
alternate metric of (post-pre)/(1-pre) is difficult to interpret when
some students obtain pre-test scores of 100%, which were seen in
this data set. In order to understand the importance of individual
patterns, we apply separate significance tests for each pattern
(with post-hoc controls as discussed below), rather than building a
unitary model to predict learning gains from a students
combination set of sequences.
Results for confusion diverged considerably from what might be
predicted based on previous research. As shown in Table 4, only
three of eight possible sequences showed marginal significance
when correlated with confusion, and all of these effects
disappeared after post-hoc controls were applied. That is,
contrary to theoretical predictions [10, 22], and the interpretation
of the findings in [17], differences in sequences of affective state
of confusion do not appear to be associated with learning gains in
this data.
p
0.054
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
0.21
CNC
0.198
0.070
0.0125
0.025
NNN
NNF
NFN
NFF
FNN
FNF
FFN
FFF
NNN
-0.181
0.097
0.01875
0.0375
96.20
1.16
1.09
0.14
1.15
0.08
0.14
0.04
NCN
0.179
0.101
0.025
0.05
CNN
0.157
0.151
0.03125
0.0625
NCC
0.149
0.173
0.0375
0.075
CCN
0.131
0.231
0.04375
0.0875
CCC
-0.049
0.654
0.05
0.1
NNA
NAN
NAA
ANN
ANA
AAN
AAA
90.25
2.94
2.70
0.41
2.86
0.20
0.40
0.24
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
116
p
0.011
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
0.273
NNN
-0.262
0.016
0.0125
0.025
NNF
0.25
0.021
0.01875
FNN
0.248
0.022
FNF
0.208
FFF
0.0375
3-step
- pre
NNN
0.010
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
0.277
0.025
0.05
NNF
-0.273
0.011
0.0125
0.025
0.056
0.03125
0.0625
FNN
-0.27
0.012
0.01875
0.0375
0.174
0.111
0.0375
0.075
NFN
-0.267
0.014
0.025
0.05
NFF
0.136
0.215
0.04375
0.0875
NFF
-0.231
0.033
0.03125
0.0625
FFN
0.136
0.215
0.05
0.1
FFN
-0.231
0.033
0.0375
0.075
FNF
-0.125
0.253
0.04375
0.0875
FFF
-0.02
0.854
0.05
0.1
p
0.006
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
-0.295
CCN
-0.283
0.009
0.0125
0.025
NNC
-0.26
0.016
0.01875
0.0375
NNN
0.255
0.018
0.025
0.05
CNN
-0.226
0.037
0.03125
0.0625
NCN
-0.195
0.074
0.0375
0.075
CNC
-0.161
0.141
0.04375
0.0875
CCC
-0.005
0.967
0.05
0.1
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
117
-0.155
NCC
NNN
0.157
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
-0.147
0.180
0.0125
0.025
3-step
- diff
NNA
0.068
0.539
0.01875
0.0375
0.006
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
0.284
0.008
0.0125
0.025
-0.279
0.010
0.01875
0.0375
0.295
NAN
CNN
-0.064
0.561
0.025
0.05
NNN
CCC
-0.061
0.579
0.03125
0.0625
ANN
0.262
0.015
0.025
0.05
CNC
0.052
0.635
0.0375
0.075
ANA
0.213
0.050
0.03125
0.0625
NNC
-0.04
0.716
0.04375
0.0875
NAA
0.204
0.061
0.0375
0.075
NCN
-0.005
0.966
0.05
0.1
AAN
0.19
0.081
0.04375
0.0875
AAA
0.01
0.931
0.05
0.1
0.106
p cutoff
(sig)
0.00625
p cutoff
(marginal)
0.0125
0.177
FNF
0.102
0.351
0.0125
0.025
NFF
-0.093
0.396
0.01875
0.0375
FFN
-0.093
0.396
0.025
0.05
NFN
0.025
0.822
0.03125
0.0625
NNF
-0.009
0.937
0.0375
0.075
FNN
-0.008
0.946
0.04375
0.0875
NNN
1.000
0.05
0.1
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
118
5. ACKNOWLEDGMENTS
This research was supported by grant Toward a Decade of PSLC
Research: Investigating Instructional, Social, and Learner Factors in
Robust Learning through Data-Driven Analysis and Modeling,
National Science Foundation award #SBE-0836012. We would like
6. REFERENCES
[1] Arroyo, I., Cooper, D., Burleson, W., Woolf, B. 2010.
Bayesian Networks and Linear Regression Models of
Students Goals, Moods, and Emotions. Handbook of
educational data mining (Oct. 2010). Taylor and Francis
Group, London, UK, 323.
[2] Baker, R.S.J.d., Corbett, A.T., Wagner, A.Z. 2006. Human
Classification of Low-Fidelity Replays of Student Actions.
Proceedings of the Educational Data Mining Workshop at
the 8th International Conference on Intelligent Tutoring
System (Jhongli, Taiwan, June 26-30, 2006), 29-36.
[3] Baker, R.S.J.d., D'Mello, S., Rodrigo, M., Graesser, A. 2010.
Better to be frustrated than bored: The incidence and
persistence of affect during interactions with three different
computer-based learning environments. International
Journal of Human-computer Studies, 68, 4 (Dec. 2010).
Elsevier B.V., Oxford, UK, 223-241.
[4] Baker, R.S.J.d., Gowda, S.M., Wixon, M., Kalka, J., Wagner,
A.Z., Salvi, A., Aleven, V., Kusbit, G., Ocumpaugh, J.,
Rossi, L. 2012. Sensor-free automated detection of affect in a
Cognitive Tutor for Algebra. Proceedings of the 5th
International Conference on Educational Data Mining
(Chania, Greece, June 19-21, 2012), 126-133.
[5] Baker, R.S.J.d., Moore, G., Wagner, A., Kalka, J., Karabinos,
M., Ashe, C., Yaron, D. 2011. The Dynamics Between
Student Affect and Behavior Occuring Outside of
Educational Software. Proceedings of the 4th bi-annual
International Conference on Affective Computing and
Intelligent Interaction. (Memphis, TN, Oct 9-16, 2011).
[6] Benjamini, Y.,Hochberg, Y. 1995. Controlling the False
Discovery Rate: A Practical and Powerful Approach to
Multiple Testing Author(s). Journal of the Royal Statistical
Society, Series B, 58, 1 (1995), London, UK, 289-300.
[7] Craig, S., Graesser, A., Sullins, J., Gholson, B. 2004. Affect
and Learning: an Exploratory Look into the Role of Affect in
Learning with AutoTutor. Journal of Educational Media, 29,
3 (Oct. 2004). Taylor & Francis, London, UK, 241-250.
[8] DMello, S.K., Graesser, A.C. 2011. The Half-Life of
Cognitive-Affective States during Complex Learning.
Cognition and Emotion, 25, 7 (2011). Taylor and Francis
Group, London, UK, 1299-1308.
[9] DMello, S.K.., Lehman, B., Pekrun, R., Graesser, A.T. In
Press. Confusion Can Be Beneficial For Learning. Learning
and Instruction. Elsevier B.V., Oxford, UK.
[10] DMello, S.K., Person, N., Lehman, B.A. 2009. AntecedentConsequent Relationships and Cyclical Patterns between
Affective States and Problem Solving Outcomes.
Proceedings of 14th International Conference on Artificial
Intelligence in Education (Brighton, UK, July 6-10, 2009),
57-64.
[11] Forbes-Riley, K., Litman D. 2009. Adapting to Student
Uncertainty Improves Tutoring Dialogues. In Proceeding of
the 2009 Conference on Artificial Intelligence in Education:
Building Learning Systems that Care: From Knowledge
Representation to Affective Modelling (Brighton, UK, July 610, 2009), 33-40.
[12] Gee, J.P. 2007. Good video games+ good learning:
Collected essays on video games, learning, and literacy
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
119
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
120
ABSTRACT
Large amounts of data are generated while students interact with
computer based learning systems. These data can be analysed
through data mining techniques to find patterns or train models
that can help tutoring systems or teachers to provide better
support. Yet, how can we exploit students data when they
perform small-group face-to-face activities in the classroom? We
propose a novel approach that aims to address this by discovering
the strategies followed by students working in small-groups at a
multi-tabletop classroom. We apply two data mining techniques,
sequence and process mining, to analyse the actions that
distinguish groups that needed more coaching from the ones that
worked more effectively. To validate our approach we analysed
data that was automatically collected from a series of authentic
university tutorial classes. The contributions of this paper are: i)
an approach to mine face-to-face collaboration data unobtrusively
captured at a classroom with the use of multi-touch tabletops,
and ii) the implementation of sequence mining and process
modelling techniques to analyse the strategies followed by
groups of students. The results of this research can be used to
provide real-time or after-class indicators to students; or to help
teachers effectively support group learning in the classroom.
Keywords
Collaborative Learning, Sequence Mining, Process Mining,
Interactive Tabletop, Classroom
1. INTRODUCTION
Collaborative face-to-face activities can offer particular
advantages compared to computer-mediated group work [17].
These include a natural channel for both verbal and non-verbal
communication, improved perception of quality of group
discussions, and an increased productivity in completing tasks
[17, 18]. The classroom is a common environment in which the
teacher can foster face-to-face collaboration skills acquisition by
making use of small-group activities [8]. However, even in smallgroup activities, it is challenging for teachers to provide students
the attention that they may require and be aware of the process
followed by each group or their individual contributions [21].
Commonly, teachers try to identify the groups that work
effectively to leave them work more independently and be able to
devote time to groups needing their attention.
Multi-user shared devices, such as interactive tabletops, provide
an enriched space where students can communicate face-to-face
with each other and, at the same time, interact with a large work
area that has access to digital content and allows the creation of
persistent artefacts [14]. Interactive tabletops may afford new
possibilities to support learning but they also introduce
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
121
2. RELATED WORK
There is a steady growth of the usage of tabletops in education.
More specifically, there are a number of research projects that
have used multiple tabletops or shared devices in the classroom.
One of these is Synergynet [1], a multi-tabletop setting that has
served to study the ways school kids collaborate and interact to
achieve group goals. This project also included the design of
tools for the teacher to control the classroom activities. Another
approach was proposed by Do Lenh [5], who developed a setting
for training on logistics, that consisted of four tangible horizontal
devices that could be orchestrated by the teacher using paperbased commands or through a remote computer. This project also
offered minimalist indicators of progress of each small group
presented at a wall display. Even though these two previous
projects included real students and teachers, they were mostly
designed and deployed as experimental scenarios. A different
approach was followed by Martinez-Maldonado et al. [10], who
presented a multi-tabletop system that permitted teachers to
assess the design and enactment of their planned classroom
activities through the use of analytics tools. This is the only
previous work that has focused on exploiting the collected data
from a multi-shared device environment to describe the activities
that occur in an authentic classroom.
In the case of data mining applied to collaborative settings, the
closest study to ours was presented by Martinez-Maldonado et al.
[12]. It consisted in extracting and clustering frequent sequential
patterns to then link them with high level group actions at a penbased tabletop learning application called Mysteries. One
important study, even though not related to tabletops, was
performed by Perera et al. [20] who explored the usage of
sequence mining alphabets and clustering to find trends of
interaction associated with effective group-work behaviours in
the context of a software development tool. Moreover, Anaya et
al. [2] analysed a computer-mediated learning tool to classify and
cluster learners according to their level of collaboration.
The work reported in this paper is the first effort we are aware of
that proposes an integrated solution, inspired by authentic needs
of the teacher in the classroom, to exploit the students data that
can be captured by multiple tabletops though the application of a
data mining technique and a process modelling tool.
3. MULTI-TABLETOP TUTORIALS
This section describes our technical infrastructure that consists
of: the multi-tabletop classroom, a teachers dashboard, the
system for capturing identified learners actions and a learning
tool for building concept maps. We also describe the teachers
design of the tutorials.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
122
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
123
5. METHOD
Resource
Table 2. Keywords included in the alphabets for the sequential pattern mining.
Alphabet 1
Alphabet 2
Action type
Parallelism turn
Actions on others
taking
Concept (Conc)-C
Link -L
Menu -M
Add -C,L
Edit (Chg) -C,L
Move -C,L,M
Short(Shrt) -B
Delete (Del)-C,L
Merge (Move)-L
Open -M
Close -M
Long B
Parallel
Other
Same
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
objects
Own
NoOwn
Alphabet 3
Master map
distance
Cruc (C,L)
NoCruc (C,L)
124
71%
+
16%
30
20
10
0
0
10
20
30
40
50
Number of actions in a block of activity
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
125
Table 3. Top-4 most frequent sequences after applying differential sequence mining on each encoded dataset.
Alphabet 1
High achieving groups
Low achieving groups
A- {Menu-Mov-Same}>{Menu-Mov-Same}>{Menu-Mov-Parallel}
B- {Con-Mov-Other}>{Link-Add-Same}>{Con-Mov-Same}>
{Link-Add-Same}
C- {Inact-Shrt}>{Con-Mov-Other}>{Link-Add-Same}
D- {Con-Mov-Other}>{Link-Add-Same}>{Con-Mov-Same}
Alphabet 2
F- {Link-Rem-Same}>{Con-Mov-Same}>{Link-Add-Same}
G- {Link-Add-Same}>{Link-Chg-Same}>{Inact-Long}
H- {Inact-Long}>{Inact-Shrt}>{Con-Mov-Same}
I- {Con-Mov-NoOwn}>{Con-Mov-NoOwn}>{Link-Add-Own}>
{Inact-Shrt}
J- {Inact-Shrt}>{Con-Mov- NoOwn }>{Con-Mov- NoOwn }>
{Link-Add-Own}
K- {Link-Mov- NoOwn }>{Link-Mov- NoOwn }>{Con-Mov- NoOwn }
L- {Inact-Shrt}>{Con-Mov- NoOwn }>{Con-Mov- NoOwn }
Alphabet 3
E- {Link-Add-Same}>{Link-Rem-Same}>{Con-Mov-Same}
Q- {Con-Mov-Cruc}>{Link-Add-Cruc}>{Con-Mov-Cruc}>
{Link-Add-Cruc}
R- {Inact-Shrt}>{Con-Mov-Cruc}>{Con-Mov-Cruc}>{Link-Add-Cruc}
S- {Link-Add-Cruc}>{Link-Mov-Cruc}>{Con-Mov-Cruc}
T- {Link-Chg-Irr}>{Con-Mov-Cruc}>{Link-Add-Cruc}
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
126
0.14
Inact-Short
LowOnly-Owner
0.84
0.29
1u: 4% 2u: 1% +u: 0%
0.8
0.7
HighLow-NoOwner
0.95
0.93
0.06
HighLow-NoOwner
0.76
0.89
0.8
0.9
0.09
0.4
0.61
0.79
0.14
NoImpact-NoOwner
0.68
LowOnly-NoOwner
0.39
1u: 10% 2u: 2% +u: 4%
0.56
Inact-Short
Inact-Long
0.4
0.73
1.0
1u: 20% 2u: 12% +u: 14%
0.2
1.0
1.0
1u: 29% 2u: 12% +u: 20%
Inact-Long
1.0
0.09
0.20
1u: 5% 2u: .5% +u: 2%
0.3
NoImpact-Owner
0.11
1u: 2% 2u: .5% +u: 1%
0.12
0.11
0.37
1u: 2% 2u: 0% +u: 0%
0.3
0.06
HighOnly-Owner
0.1
HighLow-Owner
0.02
HighOnly-NoOwner
0.74
1u: 4% 2u: 2% +u: 0%
0.12
0.37
1u: 1% 2u: 1% +u:. 5%
0.41
1u: 2% 2u: 2% +u: 0%
0.04
0.17
0.08
HighLow-Owner
0.26
0.07
LowOnly-NoOwner
0.52
1u: 12% 2u: 8% +u: 1%
0.22
Figure 4. Fuzzy model generated from groups activity. Left: Fuzzy model of high achieving groups (Conformance: 86%,
Cuttoff: 0.1). Right: Fuzzy model of low achieving groups (Conformance: 81%, Cuttoff: 0.1).
Next, we present the analysis of the number of students involved
in the activities and the validation to determine if the observable
differences can distinguish high from low achieving groups.
Active learners. Table 4 shows the results of the cumulated
distribution of the number of learners involved in the periods of
activity for both high and low achieving groups (partial rates
displayed in the third line of text inside each node of Figure 4).
Both high and low achieving groups presented more than the half
of the blocks of activity performed by a single student (54/55%).
The main difference found was that high achieving groups
presented blocks of activity in which more than two learners
were involved in comparison with low achieving groups (+u,
27% and 19% respectively). In low achieving groups most of the
blocks of activity were performed by either one or two learners.
Table 4. Distribution of the number of active learners in
blocks of activity
Achievement
One learner
(1u)
Two learners
(2u)
More learners
(+u)
High
Low
55%
54%
18%
27%
27%
19%
High
Low
High
Low
17
0
3
12
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
127
8. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
128
{lmille, lksoh}@cse.unl.edu
ABSTRACT
Supervised learning (SL) systems have been used to automatically
learn models for analysis of learning object (LO) data. However,
SL systems have trouble accommodating data from multiple
distributions and troublesome data that contains irrelevant
features or noiseall of which are relatively common in highly
diverse LO data. The solution is to break up the available data
into separate areas and then take steps to improve models on areas
containing troublesome data. Unfortunately, finding these areas
in the first place is a far from trivial task that balances finding a
single distribution with having sufficient data to support
meaningful analysis. Therefore, we propose a BoU metareasoning (MR) algorithm that first uses semi-supervised
clustering to find compact clusters with multiple labels that each
support meaningful analyses. After clustering, our BoU MR
algorithm learns a separate model on each such cluster. Finally,
our BoU MR algorithm uses feature selection (FS) and noise
correction (NC) algorithms to improve models on clusters
containing troublesome data. Our experiments, using three
datasets containing over 5000 sessions of student interactions with
LOs, show that multiple models from BoU MR achieve more
accurate analyses than a single model. Further, FS and NC
algorithms are more effective at improving multiple models than a
single model.
Keywords
Learning Object Analysis; Supervised Learning; Clustering;
Meta-Reasoning
1. INTRODUCTION
Learning objects (LOs) are independent and self-standing units of
learning content that are predisposed to reuse in multiple
instructional contexts [2]. An example of an LO is a selfcontained lesson on recursion with a tutorial, interactive exercises,
and assessment questions. In general, the analysis of student
interactions with LOs is important for many groups including
students, instructors, researchers, and content developers [16].
First, for students, such analyses can improve student study
strategies and allow for more self-regulated learning [1]. Second,
instructors can use such analyses to choose appropriate LOs for
their students [8]. Third, such analyses can help researchers and
content developers investigate which student interactions are
associated with the different learning outcomes [6].
One previously used approach for the analysis of student
interactions with LOs is supervised learning (SL) systems [16].
SL systems learn a model from previously recorded sessions of
student interactions (features) and learning outcomes (labels) that
can predict the learning outcome for a specific session of student
interactions with a high degree of accuracy.
SL systems have one main advantage over other approaches (e.g.,
statistical analysis): they learn the model automatically without
the need for direct human intervention. First, learning the model
can help students and instructors. Such a model predicts the
learning outcome for a student in real-time based on the observed
student interactions [15]. Such predictions can allow the LO to
adjust the content presented to a student while he or she is taking
the LO and provide real-time updates to instructors on student
mastery of LO content. Second, a model learned automatically
without human intervention provides independent, high-level
guidelines on which types of student interactions are associated
with the learning outcomes [4]. Such guidelines can serve as a
useful starting point for further investigation by researchers and
inform content developers on which parts of the LO may need to
be revised.
However, SL systems have some potential problems which can
limit the effectiveness of their models for analysis:
First, SL systems assume that the training data from previously
recorded sessions comes from a single underlying distribution.
Unfortunately, such training data is likely to come from multiple
distributions and be highly diverse due to a wide variety of factors
including students with different backgrounds, LOs with different
content, instructors providing varying amounts of support for the
LO content, etc. These factors make it difficult to learn a single
model which can fit all this highly diverse training data and
still achieve high accuracy.
Second, SL systems assume that the training data available is
relatively clean being free of student interactions unrelated to
the learning outcomes (i.e., irrelevant features) and errors in the
student interactions and learning outcomes provided (i.e., noise).
Unfortunately, such training data is all too likely to contain both
irrelevant features and noise. Irrelevant features are relatively
common when researchers are uncertain which student
interactions are relevant and, thus, record as many student
interactions as possible since they cannot retroactively record
additional interactions.
Noise is relatively common when
developers fail to create assessment questions appropriate for all
students and when students are motivated to game the system to
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
129
2. BACKGROUND
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
130
3. METHODOLOGY
Here we discuss the basic process and equations used for creating
the clusters and deciding whether they are correct or incorrect.
First, to make use of the BoU notion of clusters to find clusters
with a single distribution that support meaningful analysis, we use
a semi-supervised clustering (SSC) algorithm [9] to cluster the
training data. Briefly, SSC algorithms create clusters based on
both similarity in the training data and additional information
available on how the session instances should be clustered (e.g.,
constraint that two instance must/cannot be clustered together).
For our purposeto find BoU clusters, the additional information
that we incorporate for each session instance is whether or not the
model predicts the label correctly or incorrectly. The actual SSC
algorithm used is based on the k-Means variant discussed in Kulis
et al. [9]. The modified objective function for BoU-style clusters
can be expressed as:
(1)
where
is the cluster under consideration,
member, and refers to the model prediction.
(2)
is the cluster
(3)
where
is the localized estimate (Eq. 2) and is the purity
threshold parameter for the confidence interval. Eq. 3 is based on
work in Dasgupta & Hsu [5] where clusters are evaluated using
confidence intervals on the correctly-labeled member data to
decide whether to request further labels for the member data.
Finally, we also create a hierarchy of BoU clusters. The BoU uses
a hierarchical, top-down approach that iteratively (1) splits the
data into clusters and identifies the correct and incorrect clusters,
and (2) selectively improves the models on incorrect clusters.
Specifically, at each layer of the cluster dendrogram, the SSC
algorithm previously described splits the data into BoU clusters
(Eq. 1). Next, each cluster from the split is assigned a type (Eq. 3)
based on the localized estimate (Eq. 2). If the cluster is deemed
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
131
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
Figure 3. BoU MR Algorithm.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
132
iLOG 2008
5
10
20
10
9
16
47
10
127
iLOG 2008
426
604
1030
iLOG 2009
5
10
20
10
9
16
45
9
124
iLOG 2009
738
1131
1869
iLOG 2010
5
10
16
10
9
16
50
9
125
iLOG 2010
1228
2215
3443
Finally, the experiments below compare the single model with the
multiple models from the BoU. For all experiments, we provide
both the test and F1 accuracy results based on ten-fold cross
validation. In Section 4.1, we compare a single model to the BoU
models learned using three SL systems (ANN, SVM, and DTs).
In Section 4.2, we compare a single model to the BoU models
after refining the data using FS and NC. This results in six (FS or
NC ANN, SVM, or DT) configurations.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
Test Accuracy
Single BoU
0.69
0.74*
0.68
0.73*
0.72
0.73
Single BoU
0.69
0.74*
0.67
0.69*
0.62
0.65
Single BoU
0.70
0.72*
0.69
0.71
0.65
0.67*
F1 Accuracy
Single BoU
0.63
0.67
0.64
0.69*
0.68
0.68
Single BoU
0.60
0.67*
0.58
0.61*
0.52
0.56
Single BoU
0.56
0.61
0.57
0.62*
0.50
0.51
Ave. #
Clusters
1.90
2.10
4.60
Clusters
1.90
2.00
2.00
Clusters
3.20
3.60
2.30
133
Figure 4. Decision trees on the iLOG 2008 dataset created using a single model and multiple models based on BoU clusters.
On the other hand, as shown in Table 2, the accuracy (even for
BoU multiple models) is relatively low on all three iLOG datasets
(e.g., test accuracy in the 60s for DTs). As alluded to earlier,
these datasets contain both irrelevant features and noise both of
which are problematic for SL systems in general. In the next
section, we show how the BoU MR can use feature selection and
noise correction algorithms to break through this ceiling and
improve both test and F1 accuracy.
Test Accuracy
Single BoU
0.72
0.75*
0.72
0.74*
0.74
0.75
Single BoU
0.71
0.76*
0.71
0.70
0.66
0.69*
Single BoU
0.73
0.74
0.72
0.72
0.67
0.70*
F1 Accuracy
Single BoU
0.65
0.69*
0.66
0.70*
0.70
0.71
Single BoU
0.63
0.68*
0.63
0.62
0.55
0.60*
Single BoU
0.58
0.60
0.61
0.61
0.52
0.55
Ave. #
Clusters
2.00
2.00
2.43
Clusters
2.00
2.00
2.00
Clusters
2.80
3.30
2.30
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
134
2009
5
10
20
10
9
16
45
9
124
Single
5
7
11
7
9
13
22
9
83
C1
5
3
9
6
6
9
19
6
63
C2
4
8
11
9
6
10
24
6
78
C3
5
10
20
10
9
16
45
9
124
Test Accuracy
Single BoU
0.73
0.75*
0.72
0.74
0.73
0.75*
Single BoU
0.72
0.75*
0.70
0.70
0.64
0.69*
Single BoU
0.71
0.75*
0.72
0.70
0.67
0.70*
F1 Accuracy
Single BoU
0.65
0.69
0.68
0.70*
0.68
0.71
Single BoU
0.56
0.67*
0.50
0.62*
0.36
0.59*
Single BoU
0.52
0.61*
0.49
0.60
0.36
0.54*
Ave. #
Clusters
2.00
4.10
4.00
Clusters
2.00
2.00
2.00
Clusters
4.10
3.50
2.50
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
135
[8] Kiu, C-C. Lee, C-S. 2007. Learning Object Reusability and
Retrieval through Ontological Sharing: A Hybrid
Unsupervised Data Mining Approach. ICALT, 548-550.
6. REFERENCES
[15] Romero, C., Ventura, S., Espejo, P., Hervas, C. 2008. Data
Mining Algorithms to Classify Students. EDM, 187-191.
[9] Kulis, B., Basu, S., Dhillon, I., Mooney, R. 2009. SemiSupervised Graph Clustering: A Kernel Approach. Machine
Learning, 74 (1), 1-22.
[10] Liu, H., Yu, L. 2005. Towards Integrating Feature Selection
Algorithms for Classification and Clustering. IEEE Trans.
On Knowledge and Data Engineering, 17 (4), 491-502.
[11] Miller, L. et al. 2011. Evaluating the Use of Learning Objects
in CS1. SIGCSE, 57-62.
[12] Nugent, G., Soh, L-K., Samal, A., Lang, J. 2006. A
Placement Test for Computer Science: Design,
Implementation, and Analysis. Computer Science Education,
16 (1), 19-36.
[13] Pechenizkiy, M., et al. 2006. Class Noise and Supervised
Learning in Medical Domains: The effect of feature
extraction, IEEE CBMS, 708-113.
[14] Rajibussalim. 2010. Mining Students Interaction Data from
a System that Support Learning by Reflection. EDM, 249256.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
136
Keywords
Probabilistic Graphical Models, Bayesian Knowledge Tracing,
MOOC, Resource model, edX
1. INTRODUCTION
Massive Open Online Courses (MOOCs) are a quickly emerging
modality of learning in higher-education. They consist of various
learning resources, often lecture videos, etexts, online office
hours, assessments which include homework and exams, and have
a specific time in which they begin and end, often corresponding
closely to that of their residentially offered counter-parts. While
the efficacy of MOOCs compared to their residential offerings is
an open question; from the viewpoint of educational research,
MOOCs provide several substantial advantages, most notably the
detailed digital trail left by students in the form of log data and the
size of the student cohorts, which are often several orders of
magnitude larger than typical on-campus-only offerings.
Unlike Intelligent Tutoring Systems (ITS), MOOCs do not
currently provide tutorial help on demand at the points of need;
instead, the knowledge is self-sought and supplied by a
redundancy of information across various types of resources
resulting in a variety of student selected resources and pathways
through the system. This rich data provided by MOOCs presents
an opportunity to investigate the efficacy of student behavior
under varying conditions; however, MOOCs currently lack a
model of learning with which to instrument this exploration. In
this paper we will show how existing learner modeling techniques
based on Bayesian Knowledge Tracing can be adapted to the
inaugural course, 6.002x: circuit design, on the edX MOOC
platform. We identify three distinct challenges to modeling
MOOC data in section 2, followed by a description of our
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
137
check, the order in which she answered the subparts was not
known, however many students elected to click the check button
after each consecutive answer. Unlike most ITSs, homework was
scored based on the last answer entered by the user instead of the
first.
1.2 Dataset
The course drew 154,000 registrants, however; only 108,000
entered the course with around 10,000 completing the course
through the final. Among those, 7,158 received a certificate for
having earned at least a 60% weighted average. Our dataset
consisted of 2,000 randomly chosen students from the certificate
earners. A further reduction of the dataset was made by randomly
selecting ten problems (and their subparts) from each of the three
types of assessments; homework, lecture sequence, and exam
problems.
The data for this course originated from JSON log files produced
on the Amazon EC2 cloud, where the edX platform is hosted. The
original log files were separated out into individual user files and
the JSON records were parsed into a human readable time series
description of user interaction with components of the MOOC.
The final data preparation step compiled an event log by problem,
consisting of one line per student event relevant to that problem.
This included time spent on the event, correctness of each subpart,
when the student entered or changed an answer, the attempt count
of that answer, and resources accessed by the student before and
between responses. An example of this data format is shown in
Table 1.
Table 1. Example of the event log format of our distilled dataset
User
Res
Time
Resp1
Resp2
Count1
Count2
video
2m 30s
--
--
--
--
answer
10m 5s
correct
correct
10
book
4m 41s
--
--
--
--
10
book
40s
--
--
--
--
10
answer
20s
incorr.
--
--
10
answer
15s
incorr.
--
--
10
answer
1m 8s
incorr.
incorr.
10
answer
28s
--
correct
--
10
video
2m 10s
--
--
--
--
10
answer
6s
correct
--
--
( )
(
(
)
)
( )
(
)
( )
(
( )
( )
(
The p(Ln) on the right side of the formula is the prior probability
of knowledge at that time, while p(Ln|Evidencen) is the posterior
probability of knowledge calculated after taking an observation at
that time into account. Both formulas are applications of Bayes
Theorem and calculate the likelihood that the explanation for the
observed response is that the student knows the KC. Since the
student will be presented with feedback, there is a chance to learn.
The probability the student will learn the KC from the opportunity
is captured by this formula which calculates the new prior after
adding in the probability of learning:
(
( )
These formulas are used in the task of determining mastery,
however; this model of knowledge has been extended to serve as a
1
The name P(Lo) was used to denote the prior parameter in [1].
In a BKT model, this is symbolically equivalent to p(L1).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
138
Node representations
K = Knowledge node
Q1..n = Question nodes
p(G)
p(S)
Q1
Node states
K , Q = Two state (0 or 1)
Basic Model
p(L0)
p(T)
Q2
Q1
Q3
Q2
Q3
problem sub-parts
)
) ( )
(
) ( )
) (
( )
) (
) ( )
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
139
p(G|C n)
p(S|C n)
Q1
p(L0)
Q1
Q3
p(T)
C2
C3
C1
C2
Recent work has extended BKT to allow for different guess and
slip parameters to be modeled per item in a model coined KTIDEM (Item difficulty effect model) [3]. In ASSISTments, each
problem template within a skill builder problem set was allowed
to fit different guess and slip parameters, and in the Cognitive
Tutor this was done at the level of the problem, where all steps of
a given KC shared a guess/slip with one another within a problem
but steps of the same KC that appeared in a different problem
could fit different guess and slip parameter values. In both
systems, prediction accuracy was improved by ~15% when there
was ample data to fit each set of parameter (6 or more data points
per parameter). This can be seen as allowing for variation in
question difficulty among questions in a KC, or in the case of the
Cognitive Tutor, allowing for variation in KC performance
depending on problem context. It can also be interpreted as
modulating the information gained about the latent variable
depending on the question in much the same way as the count
nodes in the count model modulate the information gained about
the latent variable from responses depending on attempt count.
Q3
Q2
Q2
C1
Multiple Attempt
Count Model
C3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
140
Resource model
R
R
p(L0)
p(T|R)
P(Gn)
Q3
Q1
Q2
P(Sn)
Node states
K , Q = Two state (0 or 1)
R = Multi state (1 to M)
(Where M is the number of unique resources in the training data)
Q1
Q2
hours per fold and the highest was the resource model with 15.1
hours per fold. Lecture and exam problems took 1/10th the time to
evaluate suggesting that more answer events occurred in the
homework. For future runs, more tractable compute times could
be sought with a more aggressive filtering of homework students
with excessively long attempt counts or by cutting off response
sequence at a particular count.
Q3
Figure 4. The resource model based off of the idem model with
resource access information added and hypothesized to influence
learning. The number of parameters in this model is:
(
) (
)
The resource model was built from the idem model without
attempt count taken into consideration. The model is the same
except for the addition of the observable resource node, R, which
conditions the learning parameter, p(T|R). At each time slice, the
observable, R, is given the value corresponding to the current
resource type being accessed. This model generalizes the idem
model and can be made mathematically identical by removing all
resources types except for answer to the problem at hand, which
represents the standard learning parameter capturing the benefit of
feedback. When a non-problem resource is accessed, the R node
gets the value of that resource type and a time slice with no
question answer input is used.
After the parameters of the model are trained, each student answer
in the test set is predicted one student at a time and one time slice
at a time for that student. This prediction procedure is identical to
previous literature evaluating KT with the difference of
accommodating for multiple responses per time slice. Walking
through the prediction procedure; response data for the first
student in the test set is loaded. On the first time slice, observable
evidence, other than the response, is entered such as attempt
counts and resource type being accessed. If there is an answer
recorded for one or multiple subparts in the first time slice, the
model is told which subpart or subparts were answered and makes
a prediction of the students response(s) based on the parameters
learned from the training set. There will always be at least one
response in each time slice except for in the case of the resource
model where a time slice can represent a resource access. This
prediction is logged along with the actual response. After
prediction, the model is told what the students real responses
were and the model applies the Bayesian update formula to
calculate a posterior and then applies the learning transition
formula to calculate the new prior for the next time slice. This
processes is repeated until the end of the students response
sequence and the next student is evaluated. Past answers of a
student in the test set are used to predict their responses in the
next time slice, however; student responses in the test set are not
used to aid in prediction of other students in the test set. This form
of testing, where data is utilized temporally within an instance, is
not typical among classifier evaluations, however it is a principled
way of evaluating student models since a real-world
implementation of the model would have the benefit of a students
past responses in order to predict future performance.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
141
0.65
HW (2000)
HW (200)
0.6
LEC (2000)
LEC (200)
0.55
EXAM (2000)
EXAM (200)
0.5
4. Results
Summarized cross-validated prediction results of the four models;
basic, count, idem, and idem count, are presented in this section
for the three problem types; homework, lecture, and exam. In
addition to predicting our 2,000 sampled students, the models are
also evaluated on a smaller 200 student sample to test the
reliability of the results with less data. These results are
summarized in the next subsection. An analysis of the count
model parameters is presented in section 4.2 followed by a deeper
analysis of the IDEM model in section 4.3. A two-tailed paired ttest over problems was used to test if the difference in AUC
scores between models was statistically reliably different.
Description
basic
count
idem
idem
count
0.45
basic
count
idem
idem_count
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
142
L1e3s
L3The1
0.7
L6e1
0.65
0.35
Probability of Guess
HW1p1
0.3
L6s19e1
L7s12e1
0.6
HW2L3h1
0.25
0.2
CSDamplifierModel
0.15
CSDamplifier
0.1
L8s12e1
0.55
L5K1
SeriesParallelInductors
0
1
ChargingAnInductor
L22tfQ
L23NonInvertingAmplifier
0.45
PhaseInverter
0.05
L13InductorScale
0.5
basic
0.8
x1
0.75
CommonGate
0.7
x4
0.65
x5
0.6
OpAmpFET
0.55
idem count
L24Generalization2
0.75
HW2L3h1
0.7
L5K1
0.65
CSDamplifierModel
CSDamplifier
0.6
PhaseInverter
0.55
SeriesParallelInductors
0.5
ChargingAnInductor
ILCdelayedImpulse
0.45
basic
count
idem
idem count
0.85
idem
ILCdelayedImpulse
Attempt count
count
5. Contribution
We have presented a first foray into applying a model of learning
to a MOOC. We identified three challenges to model adaptation
and found that modeling variation in question difficulty resulted
in the largest performance gain given our definition of KC. While
our KC definition as problem with subparts as members is not
ideal for measuring learning throughout the course, it nevertheless
resulted in AUC performance accuracy rivaling that of prediction
within systems with subject matter expert defined KC models.
While we elucidated the potential for knowledge discovery given
the unique variation in resource access in MOOC data, much
work is left to demonstrate that this information can be seized on
to produce more accurate results. This raises the question of how
the efficacy of resources generalizes and the contexts and
background information that needs to be considered to identify
what works and for whom. Our solutions to the first two
challenges, of lack of a KC model and multiple unpenalized
attempt counts, will serve as an initial foundation for an efficacy
assessment framework for MOOCs.
Noise
0.5
NIC
0.45
REFERENCES
ScopeProbe1
0.4
basic
count
idem
idem count
TriodeAmplifier
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
143
[7] Beck, J., Chang, K. M., Mostow, J., & Corbett, A. (2008).
Does help help? Introducing the Bayesian Evaluation and
Assessment methodology. In Intelligent Tutoring Systems
(pp. 383-394). Springer Berlin/Heidelberg.
[8] Pardos, Z.A., Dailey, M. & Heffernan, N. (2011) Learning
what works in ITS from non-traditional randomized
controlled trial data. The International Journal of Artificial
Intelligence in Education, 21(1-2):45-63.
[9] Rau, M., Pardos, Z.A. (2012) Interleaved Practice with
Multiple Representations: Analyses with Knowledge Tracing
Based Techniques. In Proceedings of the 5th annual
International Conference on Educational Data Mining. Crete,
Greece. Pages 168-171
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
144
Henry Hua
Jamal Williams
Gavin M. Bidelman
University of Memphis
Psychology Dept.
Memphis, TN 38152
1-901-678-2326
University of Memphis
Psychology Dept.
Memphis, TN 38152
1-901-678-5590
University of Memphis
Psychology Dept.
Memphis, TN 38152
1-901-678-2364
University of Memphis,
Sch. Comm. Sci. & Disorders
Memphis, TN 38152
1-901-678-5826
ppavlik
@memphis.edu
hyhua
@memphis.edu
jawllm10
@memphis.edu
gmbdlman
@memphis.edu
ABSTRACT
From novice to expert, almost every musician must recognize
musical intervals, the perceived pitch difference between two
notes, but there have not been many empirical attempts to
discover an optimal teaching technique. The current study created
a method for teaching identification of consonant and dissonant
tone pairs. At posttest, participants increased their ability to
discern tritones from octaves, and performance was better for
those who received an interleaving order of the practice trials.
Data mining of the results used a novel method to capture
curvilinear forgetting and spacing effects in the data and allowed a
deeper analysis of the pedagogical implications of our task that
revealed richer information than would have been revealed by the
pretest-to-posttest comparison alone. Implications for musical
education, generalization learning, and future research are
discussed.
Keywords
Model-based discovery, forgetting, interleaving, spacing effect,
computer adaptive training, musical consonance
1. INTRODUCTION
Music is a rich multi-modal experience which taps a range of both
perceptual and cognitive mechanisms. As in other important facets
of human cognition (e.g., speech/language), music consists of
constituent elements (i.e., scale tones) that can be arranged in a
combinatorial manner to yield high-order units with specific
categorical labels (e.g., intervals, chords). In Western tonal music,
the octave is divided into 12 pitch classes (i.e., semitones). When
combined, these pitch relationships can be used to construct
twelve different chromatic intervals (each one semitone above or
below another) that are labeled according to the relationship
between fundamental frequencies of their tones. For example, two
tones can form an octave (2:1 ratio), a perfect fifth (3:2 ratio), or a
variety of other tonal combinations.
Perceptually, musical intervals are typically described as either
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
145
2. METHOD
2.1 Participants
After screening our data for participants (Amazon Turk workers)
who provided a full set of responses, without omissions, we had
220 participants. The average participant age was 29.88 years (SD
= 10.32). Participants had a mean of 4.30 years of musical
experience (SD = 5.70). Parsed into different types of musical
training, participants had on average 1.34 years (SD = 2.31) of
training with a private tutor, 2.27 years (SD = 4.78) studying
music on their own, and 2.86 years (SD = 3.38) of training in a
formal school setting. Prior ear training experience averaged 0.47
years (SD = 1.61), 0.24 years (SD = 0.81) of which focused
specifically on harmonic interval training.
2.2 Procedure
Participants self-selected this study from a list of various available
research participation opportunities on Amazon Mechanical Turk,
an online data-collection service. Participants were paid $3.
Participants began the study with a survey. Items included
demographic information such as age and sex. This survey also
asked for various predictors, including years of different types of
musical training (overall, private tutoring, school), and types of
musical training (ear training, harmony, reading music).
Upon finishing the survey, participants completed the interval
identification task. Prior to the pretest, to ensure that people
understood the task, the following instructions were given for
participants to view (participants clicked a button after reading):
Hint: The octave interval is sometimes described as
smooth, pleasing or pure. The tritone interval is
sometimes described as harsh, diabolic or impure.
Task: Please listen to each 2 second interval, then type
'o' for octave or 't' for tritone. After each incorrect
response, you are provided review to help you learn.
Goal: Practice the sound identification task, attempting
to learn the interval (octave or tritone) between two
notes played at the same time. This pretest portion will
get an initial measure of your skill in the task, and will
be followed by 96 training practices, and finally a
posttest of 32 practices.
This task contained three stages: practice, learning, and posttest.
The practice section presented 32 intervals for the participant to
label as either a tritone or octave. In the learning section, 96
2.3 Conditions
This experiment varied the presentation sequencing to test the
effectiveness of various presentation orders on the task of
identifying tritone and octave harmonies. Practice trials were
presented in combinations of progressive and interleaving orders
organized into four blocks, each containing 24 harmonic intervals.
Conditions were randomized between subjects.
A progressive order presented the harmonic intervals in
consecutive blocks, each block containing the same two intervals
but presented at a higher pitch register than the previous block.
Block 1 contained intervals in a low register (155.6Hz/311.1Hz
for octave and 185Hz/261.6Hz for tritone) block 2, intervals of a
medium-low register (277.2Hz/554.4Hz for octave and
329.6Hz/,466.2Hz for tritone), block 3, intervals from a mediumhigh register (493.9Hz/987.8Hz for octave and 587.3Hz/830.6Hz
for tritone) and block 4, intervals from a high register
(880Hz/1760hz for octave and 1046.5Hz/1480Hz for tritone).
Sounds were synthesized by MIDI using instrument 1 (Piano) for
a 2-second duration. An antiprogressive order presented harmonic
intervals in a way that made each block maximally different from
the previous block. block 1 consisted of low register tones, block
2, high register tones, block 3 medium-low tones, and block 4,
medium-high tones.
An interleaving order introduced a new register for each of blocks
2-4 according to the antiprogressive or progressive order, with
tones already heard from the previous blocks interleaved with the
new material. In other words, new registers were taught while
practicing the old ones, with an equal distribution for each of the
presented tone levels within a block of 24. Conditions lacking an
interleaving order did not repeat tones from previous blocks.
Therefore, the 4 experimental conditions contained all 4
combinations of progressive and interleaving orders: progressive
and no interleaving, antiprogressive and no interleaving,
progressive with interleaving, and antiprogressive with
interleaving. As a control group, there was one condition that
presented 96 learning in 4 blocks that were fully mixed, (just like
the pretest and posttest). For all conditions, although each block
contained a predetermined set of tones, tones were randomized
within each block. Finally, practice during learning blocks
were not marked by a brief pause with an introduction screen like
the pretest, learning, and post-test were marked. In other words,
transitions between sets of different items during practice were not
signaled to subjects.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
146
3. RESULTS
Participant factor
Posttest
Improvement
.28*
-.10
.25*
-.10
.26*
-.16*
.30*
-.15*
.30*
-.10
.26*
-.08
.31*
-.07
.13
.03
.18*
.04
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
147
Figure 1. Performance in all conditions plotted in blocks of 8 trials. Error bars are 1 SE confidence intervals. First 4 and last 4 blocks
represent pre- and post-test trials.
The second effect that AFM was unable to capture was the benefit
of spacing/interleaving that we saw in the interleaved conditions
with the ANOVA analysis. This was described in Section 3 as a
significant benefit for the 2 conditions which employed more
interleaving. We can also see this effect in Figure 1 as a visible
difference between the interleaving and blocked conditions at
posttest. This spacing effect, which is also very common in
verbal memory experiments [16], was not by itself as strong as the
effect of forgetting, but it has important implications for
education. If musical educators can make use of this effect, our
data suggest they may enhance learning. While AFM does not
capture such an effect in the original incarnation, more complex
models can capture such data. For example, Pavlik and Anderson
[16] describe one such model, which functions by proposing less
decay for as a function of the increased difficulty of more widely
spaced practice. Unfortunately, such models have several
parameters and no analytic method of solution, so, solving these
models is extremely difficult due to issues of time and local
minima. To resolve some of these issues we wanted to find a way
to fit a similar model as Pavlik and Anderson, that relied less on
an ad hoc, difficult to solve (albeit accurate) model form and more
on an established model formalism (logistic regression).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
148
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
149
Train
Test
Spearman
r (SE)
MAD
(SE)
Spearman
r (SE)
MAD
(SE)
AFM
0.10974
(0.00047)
0.25315
(0.0005)
0.1067
(0.00392)
0.25347
(0.00203)
AFMdecay
0.24013
(0.00044)
0.24167
(0.00053)
0.23783
(0.00402)
0.24206
(0.00239)
AFMdecayspace
0.25211
(0.00037)
0.23987
(0.00042)
0.24921
(0.0034)
0.24026
(0.00197)
type. Again we fit a single parameter for both intervals under the
assumption that are learned at equivalent rates. Again we used the
I function to sum the columns since they were mutually exclusive
predictors in the equation.
answer ~
I(octave0 + octave10 + octave20 + octave30 +
tritone0 + tritone10 + tritone20 +
tritone30) +
I(soctave0 +
soctave10 + soctave20 + soctave30 +
stritone0 + stritone10 +
stritone20 + stritone30) +
I(tris + octs) +
It was also interesting to check how well the model could be used
to simulate the experiment. Figure 4 shows graphically how well
the model captures the aggregate effects. Note that even the error
bars are of very similar magnitude. This simulation was
constructed by generating random number from 0 to 1 that were
than compared to the model of each trial to determine whether the
trial was responded to correctly in the simulated result.
I(trif + octf)
Figure 5. GLM model structure with PFA generalization
Parameter
d
g
fixed cost failure
logistic intercept
spacing coefficient
decay coefficient
PFA gain failure coeff.
PFA gain success coeff.
latency intercept
latency coeff.
Value
.628
.00106
7.054
0.99
.131
2.36
-.106
.0144
.154
1.296
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
150
5. CONCLUSIONS
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
151
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
6. ACKNOWLEDGMENTS
Our thanks to the University of Memphis for support and funding
of this research.
7. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
152
Jonathan Huang
Stanford University
Stanford University
piech@cs.stanford.edu
Chuong Do
jhuang11@stanford.edu
Andrew Ng
Coursera
Coursera
cdo@coursera.org
ng@coursera.org
Zhenghao Chen
Coursera
zhenghao@coursera.org
Daphne Koller
Coursera
koller@coursera.org
ABSTRACT
In massive open-access online courses (MOOCs), peer grading serves as a critical tool for scaling the grading of complex,
open-ended assignments to courses with tens or hundreds of
thousands of students. But despite promising initial trials, it does not always deliver accurate results compared to
human experts. In this paper, we develop algorithms for
estimating and correcting for grader biases and reliabilities,
showing significant improvement in peer grading accuracy
on real data with 63,199 peer grades from Courseras HCI
course offerings the largest peer grading networks analysed to date. We relate grader biases and reliabilities to
other student factors such as engagement, performance as
well as commenting style. We also show that our model can
lead to more intelligent assignment of graders to gradees.
1.
INTRODUCTION
Figure 1:
sessment data to extend the discourse on how to create an effective grading system. We formulate and evaluate intuitive
probabilistic peer grading models for estimating submission
grades as well as grader biases and reliabilities, allowing ourselves to compensate for grader idiosyncrasies. Our methods
improve upon the accuracy of baseline peer grading systems
that simply use the median of peer grades by over 30% in
root mean squared error (RMSE).
In addition to achieving more accurate scoring for peer grading, we also show how fair scores (where our system arrives
at a similar level of confidence about every students grade)
can be achieved by maintaining estimates of uncertainty of
a submissions grade.
Finally we demonstrate that grader related quantities in our
statistical model such as bias and reliability have much to
say about other educationally relevant quantities. Specifically we explore summative influences: what variables correspond with a student being a better grader, and formative
results: how peer grading affects future course participation. With the large amount of data available to us, we are
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
153
2.
DATASETS
3.
PROBABILISTIC MODELS
In this work, we use datasets collected from two consecutive Coursera offerings of Human Computer Interaction
(HCI), taught by Stanford professor Scott Klemmer. The
HCI courses used a calibrated peer grading system [17] in
order to assess weekly student submissions for assignments
which covered a number of different creative design tasks
for building a web site. On every assignment, each student
evaluated five randomly selected submissions (one of which
was a ground truth submission, discussed below) based on
a rubric1 , and in turn, was evaluated by four classmates.
The final score given to a submission was determined as the
median of the corresponding peer grades.2 Peer grading was
anonymized so that students could not see who they were
evaluating, or who their evaluators were.
After the first offering (HCI1), the peer grading system was
refined in several ways. Among other things, HCI2 featured
a modified rubric that addressed some of the shortcomings of the original peer grading scheme and peer graders
were divided into language groups (English and Spanish).
Counting just those who submitted at least one assignment
in the English offerings of the class, there were 3,607 students from the first offering (HCI 1) and 3,633 students
from the second offering (HCI 2). These students came from
diverse backgrounds (with a majority of students from outside of the United States). Collectively, these 7,240 students
from around the world created 13,972 submissions, receiving
63,199 peer grades in total. See Table 1 for a summary of the
dataset. To prevent overfitting our models to one instance
of the class we used the data from HCI2 as a hold out set.
True scores: We assume that every student u is associated with a submission that has a true underlying
score, denoted su , which is unobserved and to be estimated.
Grader biases: Every grader v is associated with a
bias, bv R. These bias variables reflect a graders
tendency to either inflate or deflate her assessment by
a certain number of percentage points.
Grader reliabilities: We also model grader reliability, v R+ , reflecting how close on average a graders
peer assessments tend to land near the corresponding
submissions true score after having corrected for bias.
Reliabilities will always correspond to the inverse variance of a normal distribution.
Observed grades: Finally, zuv R is the observable
score given by grader v to submission u.
Below we present, in order of increasing complexity, three
statistical models that we have found to be particularly effective.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
154
1.4
Residual (zscore)
Residual (z-score)
Standard
deviation
1
0.8
0.6
0.4
Mean
0.2
Standard
deviation
0.8
0.6
0.4
0.2
Mean
0
-2.75
-1.75
-0.2
-0.75
0.25
1.25
-0.2
-2.75
-1.75
-0.75
0.25
1.25
(a)
Figure 2:
1.2
1.2
(b)
(a) The relationship between a graders homework performance (her grade) and statistics (mean/standard deviation)
of grading performance (residual from true grade). (b) The relationship between a gradees homework performance against
statistics of assessments for her submissions. (c) Visualization of all three variables simultaneously, where intensity reflects the
mean residual z-score. Empty boxes mean that there is not enough data available to compute a reliable estimate.
3.1
)
s(T
N (0 , 1/0 ) for every user u, and
u
Model
3.2
)
)
(T )
zuv,(T ) N (s(T
+ b(T
u
v , 1/v ),
for every observed peer grade.
Model PG2 requires that we normalize grades across different homework assignments to a consistent scale. In our
experiments, for example, we have noticed that the set of
grader biases had different variances on different homework
assignments. Using a normalized score (z-score), however,
allows us to propagate a students underlying bias while remaining robust to assignment artifacts.
Note that while a model which captures the dynamics of true
scores and reliabilities across assignments can be similarly
imagined, we have focused only on the dynamics of bias
for this work (which contributes the most towards improved
accuracy while still being equitable).
3.3
PG3 is more constrained, forcing grader reliability to depend on a single parameter instead of being allowed to vary
arbitrarily, and thus prevents our model from overfitting.
Ethics and Incentives. If we are to use probabilistic inference to score students in a MOOC, the end goal could not
simply be to optimize for accuracy. We must also consider
fairness when it comes to deciding what variables to include
in the model. It might be tempting, for example, to include
variables such as race, ethnicity and gender into a model for
better accuracy, but almost everyone would agree that these
factors could not be fairly used within a scoring mechanism
even if they improved prediction accuracy. Another example
might be to model the temporal coherence of student grades
(we observe a particularly strong temporal correlation between students grades with 0.46 Pearson coefficient
of consecutive homework assignments). But incorporating
this temporal coherence for students scores into a scoring
mechanism would not allow for students to be given a clean
slate on each homework.
An interesting but subtle facet of PG3 is that by modelling
a correlation between grader reliability and how well the
grader did on the assignment, not only does getting a better
grade on an assignment influence the model into thinking a
particular student is a more reliable grader. It is also the
case that if a student is a more reliable grader, the model is
influenced into thinking that the student did better on the
assignment. This relationship would allow for a student to
game the mechanism into believing that they did better by
grading as accurately as possible. Thus PG3 may in fact
incentivize good grading. Giving students bonus points for
better grading is not a new idea. However the nuance of
PG3 is that these bonus points are justified in a statistical
sense.
3.4
Evaluation. To measure peer grading accuracy, we repeatedly simulate what score would have been assigned to each
ground truth submission had it been peer graded. Our evaluation of how well we would have graded a single ground
truth submission uses a two step methodology (based on the
evaluation method of [13]): (1) We run inference using all
of our data, except the peer grades of the ground truth submission being evaluated. This gives us an estimate of each
graders biases and reliabilities as well as model priors that
were independent of the submission being evaluated. (2)
We run simulations where we sampled four student assessments randomly from the pool of peer grades for the ground
truth submission, estimate the submissions grade using the
sample of assessments and record the residual between our
estimated grade and the true grade. For each ground truth
submission we run 3000 such simulations, from which we report the RMSE, the number of simulations which fell within
five, and ten percentage points of the true score, the average
standard deviation of the errors over each ground truth and
the worst misgrade that the simulations produced.
We compare each of our probabilistic models to the grade
estimation algorithm used on Courseras platform. In the
baseline model, the score given to students is the median of
the four peer grades they received. Specifically, the baseline
estimation does not take into account an individual graders
biases or reliabilities. Nor does it incorporate prior knowledge about the distribution of true grades.
4. EXPERIMENTAL RESULTS
4.1 Accuracy of reweighted peer grading
Using probabilistic models leads to substantially higher grading accuracy. In our experiments we are able to reduce the
RMS error on our prediction of the ground truth grade by
33% from 7.95 to 5.30. Similarly, on the second offering of
the course we were able to reduce error by 31% from 6.43
to 4.73. For the second offering, this means that the num-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
156
PG1 -bias
PG1
PG2
PG3
Baseline
PG1 -bias
PG1
PG2
PG3
7.95
51
81
7.23
-43
5.42
69
92
5.00
-34
5.40
69
94
4.96
-30
5.40
71
94
4.92
-32
5.30
70
95
4.77
-30
6.43
59
88
6.19
-36
4.84
72
96
4.57
-26
4.81
73
96
4.52
-26
4.75
73
97
4.53
-25
4.73
74
97
4.52
-26
within 5pp
within 10pp
-30-25-20-15-10 -5 0 5 10 15 20 25 30
Error
(a)
Figure 3:
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0.9
within 5pp
0.85
0.8
Need
more
than 5
rounds
46%
0.75
0.7
0.65
within 10pp
-30-25-20-15-10 -5 0 5 10 15 20 25 30
After
2 rounds
15%
After
3 rounds
10%
0.95
Pass Rate
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Percent of experiments
Percent of experiments
RMSE
% Within 5pp
% Within 10pp
Mean Std
Worst Grade
HCI 2
0.6
0.55
0.6
Error
0.65
0.7
0.75
0.8
0.85
0.9
0.95
After 5
rounds
16%
After
4 rounds
13%
Model Confidence
(b)
(c)
(d)
(a) Histogram of errors made using the baseline (median) scoring mechanism. (b) Histogram of errors using PG3 .
(c) A comparison of model confidence (x-axis) and actual success rate of predictions (y-axis), where being above the diagonal
(dark bars) is better. (d) Number of submissions for which our model can declare confidence after K rounds of grading.
4.2
Surprisingly while modeling grader bias is particularly effective, modeling grader reliability does little to improve our
performance. To dig deeper into this result we test our
model on a synthetic dataset one generated exactly from
Model PG1 . When using this synthetic data with only
four grades per student it is difficult for the model to correctly estimate grader reliability. Modeling variance for each
grader only seems to have a notable impact when students
4.3
Applying probabilistic models to peer grading networks allows us to increase our grade accuracy and better allocate
what submissions students should grade. Another product
of our work is an assignment with a belief distribution
for a true score, grader bias and grader reliability for each
student. We can use this large dataset to derive new understanding about peer grading as both a formative and summative assessment. We focus our investigation on two questions, (1) what factors influence how well a student grades?
and (2) how does grading ability affect future class performance in a MOOC?
their peers work (the students whose time spent grading has
a z-score of less than -0.30), are both unreliable (the variance
of their residuals is over 1 standard deviation away from the
gradees true score) and tend to slightly inflate grades. More
surprising is that over the tens of thousands of grades, there
is a sweet spot of time spent grading. Students who grade
assessments with a time that has a z-score of around -0.25
have significantly lower residual standard deviations (with pvalue < 0.001, diff = 0.3 standard deviations) than students
who take a long time to grade (i.e., time spent grading has
a z-score > -0.20). This sweet spot is only visible when
we look at normalized grading times. For most assignments
in the HCI class, the sweet spot corresponds to around 20
minutes grading. This may reflect both that with any less
time a grader does not have enough of a chance to fully
examine her gradees work, and that a long grading session
may mean that the grader had trouble understanding some
facet of the submission.
Examining the relationship between grader grade, gradee
grade and how they affect the residual also shows a set of
notable trends. Graders that score higher on assignments
have close to monotonically decreasing biases (Figure 2(a)).
Getting a better grade on the homework in general makes
students more reliable graders; with the notable exception
that the students that get the best grades (+1.75 z-score)
are not as accurate as the students who do very well (+.75
z-score, p = 0.04). The superlative submissions both the
best and the worst are the easiest to grade, and the submissions which are one standard deviation below the mean
are the hardest (Figure 2(b)). Finally, our results show that
students are least biased when grading peers with similar
score (Figure 2(c)). The best students significantly downgrade the worst submissions and the worst students notably
inflate the best submissions.
In addition to numerical scores, graders were asked to provide feedback in the form of free form text comments to
their gradees. In order to understand the relationship between grading performance and commenting style, we compare grading residual against the comment length as well as
sentiment polarity of the comment (Figure 4(c)). To measure the polarity of a comment, we use the sentiment analysis word list from [15] and implement a simple sentiment
analyzer that returns a (normalized) polarity score (positive or negative) proportional to the sum of word valences
over the comment. For both comment length and polarity, we filter out all non-English words. We observe that
comments that correspond to larger negative residuals are
typically significantly longer, suggesting perhaps that students write more about the weaknesses of a submission than
strong points. That being said, we observe that overall, the
comments mostly range in polarity from neutral to quite
positive, suggesting that rather than being highly negative
to some submissions, many students make an effort to be
balanced in their comments to peers.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
158
1
0.9
sweet spot
0.8
0.7
0.6
0.5
0.7
0.6
0.5
0.4
0.3
0.2
-0.30
-0.25
-0.20
-0.15
-0.10
0
0
(a)
Figure 4:
(b)
140
0.7
120
sentiment
polarity
0.6
0.5
100
0.4
80
feedback
length
0.3
60
0.2
(AUC = 0.97605 )
0.1
0.4
160
0.8
sentiment polarity
0.9
0.9
just grade
0.8
0.1
1
Standard deviation of
residual (z-scores)
all features
40
-3
-2
-1
residual (z-score)
(c)
(a) Grader consistency (measured using standard deviation of grading residual) as a function of time spent grading.
(b) ROC curve comparing performance (with linear SVM) at predicting future class participation given a students grade, bias,
reliability or all three. (c) Commenting style (length of comment and sentiment polarity) as a function of grading residual.
better predict future engagement. We tested this hypothesis by constructing a classification task in which we predict
whether a student would participate in the next assignment
(or conversely which students would drop out). In addition to the students grade, we experimented with including
grader bias and reliability as features in a linear classifier.
Our results (Figure 4(b)) show that including grader bias
and reliability improved our predictive ability by 5pp from
an area under the curve (AUC) score of 0.93 to an AUC
of 0.98. Properties about how a student grades, captures
a dimension of their engagement which is missed by their
assignment grade.
5.
RELATED WORK
6.
in grading time that we observed would be helpful in teaching graders. The trend between time spent grading and
reliability also raises the question of how to incentivize students to spend enough time grading to provide careful and
high quality feedback to their peers, particularly in an openacess course. Using model PG3 for scoring, as we discussed,
makes a students score dependent on grading performance,
and may be one way to build a justified incentive directly
into the scoring mechanism. There are also a number of challenging theoretical open questions on the mechanism design
issues behind peer grading which [6] has taken steps to address. Similarly, the subtle patterns between grader score
and reliability that are visible given our volume of data add
an interesting piece of evidence to explore from an educational perspective. While, it is unsurprising that good students grade better, we also observe that the very best students in the class, were worse graders than students in the
70th percentile. What is the force behind this trend?
There remain a number of issues to be addressed in future
work. We have considered the problem of determining which
submissions need to be allocated additional graders. However, deciding which grader is best for evaluating a particular submission is an open problem whose solution could
depend on a number of variables, from the writing styles
of the grader and gradee to their respective cultural or linguistic backgrounds, a particularly important issue for the
global scale course rosters that arise in MOOCS. Moreover,
in our study we ovserved that the mean of hundreds of students who graded the same assignment was more reliable
than volunteer staff grades. This points to a more in-depth
investigation into how accurate expert grades really are.
Finally, it is not clear how to present scores which are calculated by a complicated peer grading model to students.
While this communication might be easy when a students
final grade is simply set to be the mean or median of peer
grades, does each student need to know the inner workings
of a more sophisticated statistical backend? Students may
be unhappy with the lack of transparency in grading mechanisms, or on the other hand might feel more satisfied with
their overall grade.
As MOOCs become more widespread, the need for reliable
grading and feedback for open ended assignments becomes
ever more critical. By addressing the shortcomings of current peer grading systems, we hope that students everywhere
can get more from peer grading and consequently, more from
their free online, open access educational experience.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Acknowledgments
[17]
We thank Chinmay Kulkarni and Scott Klemmer for providing assistance with the HCI datasets and Leonidas Guibas
and John Mitchell for discussions and support. Jonathan
Huang is supported by an NSF Computing Innovations Fellowship.
[18]
[19]
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
160
Richard Scheines
Vincent Aleven
Nikol Rummel
Human-Computer Interac- Department of Philosophy Human-Computer Interac- Institute of Educational Retion Institute
Carnegie Mellon University
tion Institute
search
Carnegie Mellon University
5000 Forbes Ave
Carnegie Mellon University Ruhr-Universitt Bochum
5000 Forbes Ave
Pittsburgh, PA 15213, USA
5000 Forbes Ave
Universittsstrae 150
Pittsburgh, PA 15213, USA scheines@cmu.edu Pittsburgh, PA 15213, USA 44801 Bochum, Germany
aleven@cs.cmu.edu
nikol.rummel@rub.de
marau@cs.cmu.edu
ABSTRACT
Conceptual understanding of representations and fluency in using
representations are important aspects of expertise. However, little
is known about how these competencies interact: does representational understanding facilitate learning of fluency (understandingfirst hypothesis), or does fluency enhance learning of representational understanding (fluency-first hypothesis)? We analyze log
data obtained from an experiment that investigates the effects of
intelligent tutoring systems (ITS) support for understanding and
fluency in connection-making between fractions representations.
The experiment shows that instructional support for both representational understanding and fluency are needed for students to
benefit from the ITS. In analyzing the ITS log data, we contrast
the understanding-first hypothesis and the fluency-first hypothesis, testing whether errors made during the learning phase mediate
the effect of experimental condition. Finding that a simple statistical model does not the fit data, we searched over all plausible
causal path analysis models. Our results support the understanding-first hypothesis but not the fluency-first hypothesis.
Keywords
Causal path analysis modeling, multiple representations, intelligent tutoring systems.
1. INTRODUCTION
Representational understanding and representational fluency are
important aspects of learning in any domain [1]. When working
with representations (e.g., formulae, line graphs, path diagrams),
students need conceptual understanding of these representations
(representational understanding). Students also need to use the
representations to solve problems fast and effortlessly (representational fluency). Science and mathematics instruction typically
employs multiple graphical representations to help students learn
about complex domains [2]. For instance, instructional materials
for fractions use circle and rectangle diagrams to illustrate fractions as parts of a whole, and number lines to depict fractions in
the context of measurement [3-5]. Multiple representations have
been shown to lead to better learning than a single representation,
provided that students make connections between them [6-7]: to
benefit from the multiplicity of representations, students need to
conceptually understand how different representations relate to
one another, and they need to translate between them [8-11]. Yet,
students find it difficult to make these connections [8], and tend
not to make them spontaneously [12]. Therefore, they need to be
supported in doing so [7]. Based on [1], we distinguish between
representational understanding as conceptual understanding of
connections between different graphical representations, and re-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
161
Figure 1. Worked example support for representational understanding: students use a worked example with a rectangle (part A,
upper left) to guide their work on a fractions problem with a number line (part B, upper right). At the end (part C, bottom), students are prompted to integrate both representations by responding to drop-down menu questions.
To gain further insights into how support for representational
understanding and representational fluency affect students interactions with an ITS for fractions, we employ causal path analysis.
In doing so, we contrast mediation models that correspond to the
understanding-first hypothesis, and to the fluency-first hypothesis.
Specifically, we investigate whether errors that students make
during the learning phase mediate the interaction effect between
support for representational understanding and representational
fluency on students learning. Our results are in line with the understanding-first hypothesis, but not with the fluency-first hypothesis.
The remainder of this paper is structured as follows. We first describe the ITS that we used to carry out the experimental study.
We then provide a brief overview of the experimental design and
the results obtained from the analysis of pretests and posttests.
The main focus of this paper is on describing the causal path analysis we conducted to investigate the interaction of instructional
support for representational understanding and representational
fluency on students learning behaviors as identified by the tutor
log data. We end by discussing the implications of our analysis for
the instructional design of learning materials, and by outlining
open questions that future research should be address.
fractions and fraction addition. Taken together, the Fractions Tutor comprises about ten hours of supplemental instructional material. Students solve tutor problems by interacting both with fractions symbols and with the graphical representations. As is common with Cognitive Tutors, students receive error feedback and
hints on all steps. In addition, each tutor problem includes conceptually oriented prompts to help students relate the graphical representations to the symbolic notation of fractions.
3. EXPERIMENT
The goal of the experimental study (cf. [14] for a detailed description) was to investigate the hypothesis that students learn more
robustly when receiving instructional support for both representational understanding and support for representational fluency. We
conducted a classroom experiment with 599 4th- and 5th-grade
students from five elementary schools in the United States. Students worked with the Fractions Tutor for about ten hours during
their regular mathematics class.
We contrasted two experimental factors. One factor, support for
representational understanding in making connections had three
levels: no support, auto-linked support in which the Fractions
Tutor automatically made changes in one representation as students manipulated another, and worked examples. Figure 1 provides an example of the Fractions Tutor problem that uses worked
examples (WEs) to support representational understanding. Students used a worked example with a familiar representation as a
guide to make sense of an isomorphic problem with a less familiar
representation. This factor was crossed with a second experimental factor, namely, whether or not students received support for
representational fluency in making connections: students had to
visually estimate whether different types of graphical representations showed the same fraction. Figure 2 shows an example of a
fluency-building problem (FL). Students in all conditions worked
on 80 tutor problems: eight problems per topic (e.g., equivalent
fractions, addition, subtraction, etc.). In each topic, the first four
tutor problems were single-representation problems (i.e., they
included only a circle, only a rectangle, or only a number line, and
no connection-making support). The last four tutor problems were
multiple-representation problems and differed between the experi-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
162
Figure 2. Fluency-building support: students sort graphical representations by dragging-and-dropping them into slots that show
equivalent fractions.
mental conditions. For instance, students in the worked examples
only condition (WE) received four worked examples problems.
Students in the fluency-only condition (FL) received four fluencybuilding problems. Students in the worked examples plus fluency
condition (WE-FL) received two worked examples problems,
followed by two fluency-building problems. Table 1 illustrates
this procedure for two consecutive topics for each of these three
conditions. The same sequence of eight problems was repeated for
each of the ten topics the Fractions Tutor covered.
Table 1. Problem sequence per condition: for each topic,
problems 1-4 (P1-P4) are single-representation problems (S);
problems 5-8 are multiple-representation problems: worked
examples (WE, blue-underlined) or fluency-building problems
(FL, green-italicized).
Cond.
Topic P1 P2 P3 P4 P5
P6
P7
P8
1
S
S
S
S
WE WE WE WE
WE
2
S
S
S
S
WE WE WE WE
1
S
S
S
S
FL
FL
FL
FL
FL
2
S
S
S
S
FL
FL
FL
FL
1
S
S
S
S
WE WE FL
FL
WE2
S
S
S
S
WE WE FL
FL
FL
4. DATA SET
The analyses in this paper are based on the data obtained from the
experimental study just described. Students in the experiment
received a pretest on the day before they started to work with the
Fractions Tutor. The day after students finished working with the
Fractions Tutor, they received an immediate posttest. One week
after the immediate posttest, students were given a delayed posttest. All three tests were equivalent (i.e., they contained the same
items with different numbers). Students worked with the Fractions
Tutor for about ten hours and had to complete each tutor problem.
All interactions with the Fractions Tutor were logged.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
163
(i.e., WE vs. WE-FL for error types that students could make on
worked-example problems, and FL vs. WE-FL for error types that
students could make on fluency-building problems). For both
analyses, we adjusted for multiple comparisons using the Bonferroni correction. On worked-example problems, six error types
differed significantly between conditions, but only two error types
were significant predictors of posttest performance (both of them
passed both the Chi-square test and the regression test). On fluency-building problems, eight error types differed significantly between conditions, and four were significant predictors of posttest
performance (three of them passed both the Chi-square test and
the regression test). Table 3 provides an overview of the error
types we selected for further analyses.
Table 3. Selected error types and number of error-types per
condition.
Error type
Description
# in
WE
# in
FL
# in
WE-FL
place1Error
Locating 1 on the
number line given a
dot on the number line
and the fraction it
shows
150
n/a
222
SE-Error
Self-explanation error,
response to reflection
questions in dropdown menu format
132
0
n/a
1629
equivalenceError
Finding equivalent
fraction representations
n/a
289
9
2157
improperMixedError
n/a
138
0
1608
NameCircleMixedError
n/a
355
126
Pretest
Immediate
posttest
Delayed
posttest
WE
.36 (.22)
.43 (.20)
.49 (.26)
FL
.31 (.21)
.37 (.22)
.44 (.24)
WE-FL
.39 (.21)
.52 (.24)
.58 (.26)
A first step in this analysis was to use the tutor log data to identify
measures of errors that students made on these problems. Rather
than using the overall error rate, we applied the knowledge component model [17] that underlies the problem structure of the
Fractions Tutor to categorize the errors students made while
working on the tutor problems. Doing so allows for a much more
fine-grained analysis of students errors than the overall error rate
does. The knowledge component model describes a meaningful
set of steps within a tutor problem which provide practice opportunities for practicing a unit of knowledge. For example, every
time a student is asked to enter the numerator of a fraction, he/she
has the opportunity to practice knowledge about what the numerator of a fraction is. Worked-example problems and fluencybuilding problems cover a different set of knowledge components,
but the same knowledge components occur repeatedly across different worked example problems and fluency-building problems,
respectively. Altogether, the knowledge component model led to
12 types of errors that students could make on worked-example
problems, and 11 types of errors that students could make on fluency-building problems.
Next, we had to narrow the number of error categories to include
in the causal path analysis model. We included only those error
types which (1) were significant predictors of students posttest
performance, while controlling for pretest performance, and (2)
significantly differed between conditions. To determine whether
an error type was a significant predictor of students immediate
posttest performance, we conducted linear regression analyses
with posttest performance as the dependent variable, and pretest
performance and number of error type as predictors.
To determine whether error types differed significantly between
conditions, we conducted Chi-square tests with number of error
type as dependent variable and condition as independent variable
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
164
In path models of this type, also called "causal graphs" [22] Ibid., each arrow, or directed edge, represents a direct causal relationship relative to the other variables in the model. For example, in Figure 3 the condition is a direct cause of the mediator
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
165
Figure 6 shows a model found by GES for the fluency-first hypothesis. The model fits the data well (2 = 8.32, df = 5, p = .14).
Students with higher pretest scores make fewer SE-Errors and
perform better on both posttests. Having fluency-building support
(i.e., being in the WE-FL condition as opposed to being in the WE
condition) increases SE-Errors, which reduces performance on the
immediate and the delayed posttest. In other words, SE-Errors
mediate a negative effect of fluency-building support. There are
no further mediations of having fluency-building support, but
there is a direct positive effect of fluency-building support on
students performance on the immediate posttest. (See Table 3 for
a description of the errors.)
5.3 Results
Figure 5 shows a model found by GES for the understanding-first
hypothesis, with coefficient estimates included. The model fits the
data reasonably well 8 (2 = 16.10, df = 6, p = .013). Students with
higher pretest scores make fewer nameCircleMixedErrors, and
they perform better on the immediate and the delayed posttest.
Receiving worked-example support for representational understanding (i.e., being in the WE-FL condition and not in the FL
condition) increases nameCircleMixedErrors, which in turn decreases performance on the immediate posttest. In other words,
nameCircleMixedErrors mediate a negative effect of worked examples on students learning. Receiving worked-example support
for representational understanding also reduces equivalenceErrors
and improperMixedErrors. Since making more improperMixedErrors leads to worse performance on the immediate and the delayed
posttests, equivalenceErrors and improperMixedErrors mediate
the positive effect of the worked-example support on students
learning. Support for representational understanding through
worked examples does not have a direct impact on students posttest performance. The overall positive effect of worked examples
on students learning through equivalenceErrors and improperMixedErrors is larger than the negative effect through nameCircleMixedErrors. (See Table 3 for a description of the errors.)
Provided the generating model satisfies the parametric assumptions of the algorithm, the probability that the output equivalence class contains the generating model converges to 1 in the
limit as the data grows without bound. In simulation studies, the
algorithm is quite accurate on small to moderate samples.
All the DAGs represented by a pattern will have the same BIC
score, so a patterns BIC score is computed by taking an arbitrary DAG in its class and computing its BIC score.
The usual logic of hypothesis testing is inverted in path analysis:
a low p-value means the model can be rejected.
Figure 5. The model found by GES for the fluency-first hypothesis, with parameter estimates included.
6. DISCUSSION
Taken together, results from the causal path analysis models support the understanding-first hypothesis but not the fluency-first
hypothesis: receiving worked-example support for representational understanding helps students learn from fluency-building prob-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
166
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
167
need to develop representational fluency in using graphical representations to solve problems, and they need to effortlessly translate between different kinds of representations. But representational understanding and representational fluency are not limited
to learning with graphical representations: representational understanding and representational fluency also play a role in using
symbolic and textual representations. For example, should students acquire representational fluency in applying a formula to
solve physics problems before understanding the conceptual aspects the formula describes, or should they first conceptually understand the phenomenon of interest and then learn to apply a
formula to solve problems related to that phenomenon? This is a
crucial question for instructional design and one that remains
open. While the analysis presented in this paper takes an important step towards answering this question by providing novel insights into how representational understanding and representational fluency interact, more research is needed to investigate
implications and applications related to the question of how best
to support students to develop expertise with representational
understanding and representational fluency.
The use of search algorithms over plausible causal path analysis
models is a promising method to analyze the effects of instructional interventions on, because we can get insights into how an
intervention affects problem-solving behaviors, and how these
effects account for the advantage of one intervention over the
other. Basing our analysis on cognitive task analysis and knowledge component modeling, we make use of common techniques
in the analysis of tutor log data [23]. The results from our causal
path analysis not only provide insights into the nature of the interaction of the experimental study, but also raise new hypotheses
that can be empirically tested in future research. Thereby, our
findings illustrate that causal path analysis modeling is a useful
technique to augment regular tutor log data analysis.
8. ACKNOWLEDGEMENTS
We thank the NSF REESE-21851-1-1121307, the PSLC, funded
by NSF award number SBE-0354420, Ken Koedinger, the Datashop team, and the students and teachers.
9. REFERENCES
[1] Koedinger, K.R., Corbett, A.T. Perfetti, C.: KnowledgeLearning-Instruction Framework: Bridging the SciencePractice Chasm to Enhance Robust Student Learning.
Cognitive Science, 36, 757798 (2012).
[2] NMAP: Foundations for Success: Report of the National
Mathematics Advisory Board Panel. U.S. Government
Printing Office (2008).
[3] Cramer, K. Using models to build an understanding of
functions. Mathematics teaching in the middle school, 6,
310-318 (2001).
[4] Charalambous, C.Y., Pitta-Pantazi, D.: Drawing on a
Theoretical Model to Study Students' Understandings of
Fractions. Educational Studies in Mathematics, 64, 293-316
(2007).
[5] Siegler, R. S., Thompson, C. A. and Schneider, M. An
integrated theory of whole number and fractions
development. Cognitive Psychology, 62, 273-296 (2011).
[6] Rau, M.A., Aleven, V. Rummel, N.: Intelligent tutoring
systems with multiple representations and self-explanation
prompts support learning of fractions. In: Dimitrova, V. et al.
(eds) Proceedings of the 2009 conference on Artificial
Intelligence in Education. IOS Press, Amsterdam, The
Netherlands. pp. 441-448 (2009).
[7] Bodemer, D., Ploetzner, R., Feuerlein, I., Spada, H.: The
Active Integration of Information during Learning with
Dynamic and Interactive Visualisations. Learning and
Instruction, 14, 325-341 (2004).
[8] Ainsworth, S.: DeFT: A conceptual framework for
considering learning with multiple representations. Learning
and Instruction, 16, 183-198 (2006).
[9] de Jong, T., Ainsworth, S. E., Dobson, M., Van der Meij, J.,
Levonen, J., Reimann, P., Simerly, C. R., Van Someren, M.
W., Spada, H. and Swaak, J. Acquiring knowledge in science
and mathematics: The use of multiple representations in
technology-based learning environments. Oxford (1998).
[10] Bodemer, D. and Faust, U. External and mental referencing
of multiple representations. Computers in Human Behavior,
22, 27-42 (2006).
[11] Schwonke, R., Berthold, K. and Renkl, A. How multiple
external representations are used and how they can be made
more useful. Applied Cognitive Psychology, 23, 1227-1243
(2009).
[12] Rau, M.A., Rummel, N., Aleven, V., Pacilio, L., TuncPekkan, Z.: How to schedule multiple graphical
representations? A classroom experiment with an intelligent
tutoring system for fractions. In: Aalst, J.v., et al. (eds) The
future of learning: Proceedings of the 10th ICLS, ISLS,
Sydney, Australia. 64-71 (2012).
[13] Kellman, P. J., Massey, C. M. and Son, J. Y. Perceptual
Learning Modules in Mathematics: Enhancing Students
Pattern Recognition, Structure Extraction, and Fluency.
Topics in Cognitive Science, 22009), 285-305.
[14] Rau, M. A., Aleven, V., Rummel, N. and Rohrbach, S. Sense
Making Alone Doesnt Do It: Fluency Matters Too! ITS
Support for Robust Learning with Multiple Representations.
Springer Berlin / Heidelberg (2012).
[15] VanLehn, K. The relative effectiveness of human tutoring,
intelligent tutoring systems and other tutoring systems.
Educational Psychologist, 46, 197-221 (2011).
[16] Aleven, V., McLaren, B. M., Sewall, J. and Koedinger, K. R.
Example-tracing tutors: A new paradigm for intelligent
tutoring systems. International Journal of Artificial
Intelligence in Education, 19, 105-154 (2008).
[17] Baker, R. S. J. d., Corbett, A. T. and Koedinger, K. R. The
difficulty factors approach to the design of lessons in
intelligent tutor curricula. International Journal of Artificial
Intelligence in Education, 17, 341-369 (2007).
[18] Rau, M. A., Aleven, V. and Rummel, N. Blocked versus
Interleaved Practice with Multiple Representations in an
Intelligent Tutoring System for Fractions, In: Yacef, K. et al.
(eds), Proceedings of the 5th Intl Conference of ITS, Springer
Berlin / Heidelberg (2010).
[19] Chickering, D. M. Optimal Structure Identification with
Greedy Search. Journal of Machine Learning Research, 507554 (2002).
[20] Pearl, J. Causality: Models, Reasoning, and Interference.
Cambridge University Press (2000).
[21] Spirtes, P., Glymour, C. and Scheines, R. Causation,
Prediction, and Search. MIT Press (2000).
[22] Robins, J., Scheines, R., Spirtes, P. and Wasserman, L.:
Uniform Consistency in Causal Inference. Biometrika, 90,
491 515 (2003).
[23] Koedinger, K. R. et al.: A data repository for the EDM community: The PSLC Data-Shop, In: C. Romero (ed.), Handbook of educational data mining (pp. 10-12). Boca Raton,
FL: CRC Press (2010).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
168
Keywords
Cognitive Tutors, Assessment, Mathematics.
1. INTRODUCTION
Cognitive Tutors are primarily developed as instructional systems,
with a focus on improving student learning. While the systems
continually assess student knowledge with respect to a set of
underlying knowledge components [6], the standard for
effectiveness of an educational system is usually taken to be the
ability of that system to produce improved performance on an
external measure, typically a standardized test.
Carnegie Learnings Cognitive Tutors for mathematics have done
well on such measures [13, 15, 18] but, in these studies, the
tutoring system has been, essentially, treated as a black box. We
know that, as a whole, students using a curriculum involving the
Cognitive Tutor outperformed students using a different form of
instruction on standardized tests, but we dont know what specific
aspects of tutor use were associated with improved performance.
An understanding of the process variables (time, errors, hint usage
and other factors) that are correlated with learning can provide us
with insight into the specific student activities that seem to lead to
learning. Another perspective on the Tutor is that, if we are able to
strongly correlate Cognitive Tutor data with standardized test
data, then the Cognitive Tutor itself may be considered an
assessment, which is validated with respect to the external
standardized test. In addition, to the extent that we can identify
process variables that predict external test scores, we can provide
guidance to teachers as to expectations for their students on the
state examinations.
In most cases, Carnegie Learning does not have access to studentlevel outcome data on standardized tests. For the study reported
here, we partnered with a school district in Eastern Virginia. The
district provided Carnegie Learning with student data for all 3224
middle school students who used Cognitive Tutor in the district
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
169
2. DISTRICT IMPLEMENTATION
The students in this study used Cognitive Tutor software
developed for middle school mathematics (grades 6, 7 and 8) in
12 schools in the district. The software was used by 3224
students: 1060 in sixth grade; 1354 in seventh grade and 810 in
eighth grade.
Carnegie Learning delivers software sequences aligned to
educational standards for these grades. The sixth grade sequence
contains 45 units and 131 sections; the seventh grade sequence
contains 34 units and 92 sections and the eighth grade sequence
contains 37 units and 88 sections. The school district also created
software sequences targeted towards students who were
performing below grade level, as part of a Response to
Intervention (RTI) implementation [10].
Carnegie Learning recommends that, when used as a basal school
curriculum, students use the Cognitive Tutor as part of their math
class two days/week. Assuming 45-minute classes and a 180-day
school year, this would result in approximately 54 hours of use.
Due to scheduling issues, absenteeism and other factors, it is
common for districts to average only 25 hours, however. RTI
implementations may involve more intensive (5 days/week)
practice on prerequisite skills, typically completed in an RTI class
which takes place in parallel with work in the basal class. Thus,
students in an RTI sequence may be asked to work on the Tutor
twice as much (or more) than those who are not in an RTI
sequence. On the other hand, if students in the RTI class are able
to demonstrate mastery of the target material, they are removed
from that class, so the RTI class does not necessarily last all year.
The Tutor is available to students through a browser, so they can
go home (or elsewhere) and continue work that they began at
school. However, many students do not use the Tutor outside of
class and so, for most students, the amount of time that they spend
with the tutor is dictated by the frequency with which their teacher
tells them to use the software.
Our analysis does not distinguish between students who used the
Tutor in an RTI capacity, as a basal curriculum or in some other
capacity.
Across the schools and grade levels in our data set, usage varied
widely, from median of 1.05 hours in grade 8 in one school to a
median of 29.86 hours in grade 7 in a different school.
Figure 1 provides a schematic for understanding overall
performance of students at the schools involved in the study.
There are three graphs, representing the three grade levels. Each
school is represented by a dot, with the size of the dot
proportional to the number of students in that school and grade
level. The vertical position of the dot represents the schools
overall 2012 SOL score. The horizontal line represents the state
average for the grade level. The figure shows that students in the
district reflect a range of abilities relative to the state, with
students somewhat underperforming the state mean in all grades.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
170
60
40
20
1 2 3 4 5 6 7 8 9 10 11 12
School Number
Density
0.000
3. ANALYSIS APPROACH
0.030
0.5
SOL scores
80
0.4
0.3
1 2 3 4 5 6 7 8 9 10 11 12
School Number
Density
20
1 2 3 4 5 6 7 8 9 10 11 12
School Number
0.2
20
40
0.1
40
60
0.0
60
0.020
80
0.010
80
SOL scores
100
SOL scores
100
50
100
Total Time
150
Aggregate variables
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
171
several other
transformed.
activities.
This
variable
was
log
Section-normalized variables
Lunch_status: this is a common proxy for socioeconomic status. We coded this as a binary variable
indicating whether students were eligible for free or
reduced lunch prices. 72.5% of students were in this
category.
Students
American Indian
20
Asian
72
Black/African American
1064
White
785
Multi-racial
72
4. RESULTS
4.1 Fitted Models
The process variables found in M2 and included in M3 and M4
were
total_problem_time,
skills_encountered,
sections_encountered,
assistance_per_problem
and
sections_mastered_per_hour. A summary of the standardized
model (M2) coefficients is shown in Table 1.
Table 1: Cognitive Tutor process variables and standardized
coefficients included in the models predicting SOL.
Variable
Coefficient
p value
assistance_per_problem
-0.351
<2e-16
sections_encountered
0.422
0.004028
sections_mastered_per_hour
0.390
6.71E-10
skills_encountered
-0.456
0.000141
total_problem_time
0.258
0.000502
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
172
Coefficients
p value
assistance_per_problem
-0.134
1.02E-05
sections_encountered
0.188
0.141868
Variable
Coefficients
p value
sections_mastered_per_hour
0.272
7.80E-07
assistance_per_problem
-0.340
<2e-16
skills_encountered
-0.262
0.011882
sections_encountered
0.369
0.011183
total_problem_time
0.219
0.000669
sections_mastered_per_hour
0.368
4.04E-09
lunch_status
-0.070
0.00166
skills_encountered
-0.403
0.000663
age
-0.028
0.197271
total_problem_time
0.240
0.001043
RIT pretest
0.476
<2e-16
lunch_status
-0.106
3.30E-05
age
-0.071
0.003737
500
SOL score
BIC
R2
M1 (RIT)
2041.451
0.50
M2 (CT)
2181.015
0.43
M3 (CT+Demog)
2167.764
0.45
M4 (RIT+Demog)
2030.582
0.51
M5 (Full)
1928.369
0.57
Model
300
Learning Disability
No Disability
Physical Disability
Table 5 shows fit for the models as applied to sixth and eighth
grade students SOL scores (and as applied to the full population
of students), using the variables and coefficients found by fitting
the seventh grade student data. The model fits the sixth grade data
remarkably well, with an R2 of 0.62, higher even than the fit to the
seventh grade data. The fit to eighth grade data is not as strong,
with an R2 of 0.32. This may be due to both the smaller original
population in eighth grade and the relatively low usage. Median
usage for eighth graders was only 16.3 hours (as opposed to 20.3
hours in sixth grade), and only 438 students (54%) of eighth
graders used the tutor for more than five hours.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
173
Grade 6
Grade 8
All grades
M1
0.57
0.30
0.40
M2
0.46
0.18
0.38
M3
0.46
0.18
0.42
M4
0.57
0.30
0.43
M5
0.62
0.32
0.51
Variable
Coefficients
p value
assistance_per_problem
-0.186
1.68E-15
sections_encountered
0.267
0.006
sections_mastered_per_hour
0.031
0.45747
skills_encountered
-0.206
0.00927
total_problem_time
0.003
0.95771
lunch_status
-0.044
0.0092
age
0.008
0.64494
RIT pretest
0.677
<2e-16
Figure 4 demonstrates the fit of M5 to the SOL data for all grade
levels.
550
600
Table 7 shows fits for the five models, as applied to the RIT
posttest. As expected, RIT pretest predicts RIT posttest better than
the RIT pretest predicts the SOL posttest (R2 for M1/SOL for 7th
grade is 0.50 vs. 0.72 for M1/RIT). Even given this good fit for
M1, process variables and demographics significantly improve the
model, reducing BIC from 1502 to 1409.
400
450
R2
250
300
350
500
250
300
350
400
450
500
BIC
Grade 6
Grade 7
Grade 8
All
Grade 7
M1
0.67
0.72
0.59
0.68
1502.128
M2
0.46
0.49
0.26
0.41
2085.297
M3
0.48
0.50
0.27
0.40
2077.530
M4
0.68
0.72
0.59
0.68
1501.578
M5
0.71
0.75
0.60
0.71
1408.899
280
Note
that,
with
RIT
as
the
outcome
variable,
sections_mastered_per_hour, total_problem_time and age are no
longer significant predictors for the model. It may be that RIT, as
an adaptive test, imposes less time pressure on students, since
there is no apparent set of questions to be completed in a fixed
amount of time.
240
220
200
180
160
260
180
200
220
240
260
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
174
50
40
19
374
293 241
55
100
193
152
64
30
396
48
32
41
20
15
10
60
657
15
25
35
45
55
65
75+
5. DISCUSSION
The work presented here provides a good model of how we might
use Cognitive Tutor, either with or without additional data, to
predict student test outcomes on standardized tests. The model
was able to generalize to different student populations, and the
variables found for a model to predict SOL provided strong
predictions of RIT as well.
Surprisingly to us, demographic factors proved to be relatively
unimportant to our models.
Since we were able to improve on the RIT pretest model by
adding Cognitive Tutor process variables, our efforts show that
such variables provide predictive power beyond that provided by
a standardize pretest, even when the pre- and post-test are
identical (as in the case with the RIT outcome). A consideration of
the types of information that may be contained in Cognitive Tutor
data but not in pretest data provide us with guidance on how we
might extend this work and improve our model. We will consider
5 broad categories of factor: learning, content, question format,
process and motivation.
Learning: The most obvious difference between the RIT pretest
and the Cognitive Tutor process variables is that the RIT provide
information about students at the beginning of the school year,
while Cognitive Tutor data is collected throughout the year. One
extension of this work in exploring the role of learning would be
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
175
[10] Gersten, R., Beckmann, S., Clarke, B., Foegen, A., Marsh,
L., Star, J. R., & Witzel, B. 2009. Assisting students
struggling with mathematics: Response to Intervention (RtI)
for elementary and middle schools (NCEE 2009-4060).
Washington, DC: National Center for Education Evaluation
and Regional Assistance, Institute of Education Sciences,
U.S. Department of Education. Retrieved from http://ies.
ed.gov/ncee/wwc/publications/practiceguides/
While we are very encouraged with the results that we have seen
in this paper, we recognize that more detailed data may provide us
better ability to predict student test outcomes from Cognitive
Tutor data.
[11] Good, C., Aronson, J., & Harder, J. A. 2008. Problems in the
pipeline: Stereotype threat and womens achievement in
high-level math courses. Journal of Applied Developmental
Psychology, 29, 17-28.
6. ACKNOWLEDGMENTS
7. REFERENCES
[1] Arnold, K.E. 2010. Signals: Applying Academic Analytics.
Educause Quarterly, 33, 1.
[2] Baker, R.S.J.d. 2007. Modeling and Understanding Students'
Off-Task Behavior in Intelligent Tutoring Systems. In
Proceedings of ACM CHI 2007: Computer-Human
Interaction, 1059-1068.
[3] Baker, R.S., Corbett, A.T., Koedinger, K.R. 2004. Detecting
Student Misuse of Intelligent Tutoring Systems. Proceedings
of the 7th International Conference on Intelligent Tutoring
Systems, 531-540.
[4] Baker, R.S.J.d., Gowda, S.M., Wixon, M., Kalka, J., Wagner,
A.Z., Salvi, A., Aleven, V., Kusbit, G., Ocumpaugh, J., and
Rossi, L. 2012. Towards Sensor-Free Affect Detection in
Cognitive Tutor Algebra. In Proceedings of the Fifth
International Conference on Educational Data Mining, 126133.
[5] Beck, J. E., Lia, P. and Mostow, J. 2004. Automatically
assessing oral reading fluency in a tutor that listens.
Technology, Instruction, Cognition and Learning, 1, pp.6181.
[6] Corbett, A.T. and Anderson, J. R. 1992. Student modeling
and mastery learning in a computer-based programming
tutor. In C. Frasson, G. Gauthier and G. McCalla (Eds.),
Intelligent Tutoring Systems: Second international
conference proceedings (pp. 413-420). New York: SpringerVerlag.
[7] Corbett, A.T., Anderson, J.R. 1995. Knowledge tracing:
modeling the acquisition of procedural knowledge. User
Modeling & User-Adapted Interaction 4, (1995), 253-278.
[8] Cox, P.A. 2011. Comparisons of Selected Benchmark Testing
Methodologies as Predictors of Virginia Standards of
Learning Test Scores. Doctoral Thesis. Virginia Polytechnic
Institute and State University.
[9] Feng, M., Heffernan, N.T., & Koedinger, K.R. 2009.
Addressing the assessment challenge in an Online System
that tutors as it assesses. User Modeling and User-Adapted
Interaction: The Journal of Personalization Research
(UMUAI journal). 19(3), 243-266, August, 2009.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
176
Teachers College Columbia University, 525 W 120th St. New York, NY 10027
2
Worcester Polytechnic Institute, 100 Institute Rd. Worcester, MA 01609
Keywords
College Enrollment, Affect Detection, Knowledge Modeling,
Educational Data Mining
1. INTRODUCTION
The processes leading a student to choose to attend college starts
early, and decisions can begin to solidify as early as middle
school (ages 12-14). Especially in the United States, successful
learning experiences which develop key skills build positive selfbeliefs, interests, goals and actions, making students likely to
actively seek and plan higher educational goals and career
aspirations [31] As students go through middle school, they
increasingly find themselves engaged or disengaged from school
and learning. This process is driven in part by changes in
students self-perceptions, whether they see themselves as smart
and capable of taking the courses in high school. This leads to
students making decisions about how academic achievement,
certain careers, and college majors fit into their self-perception
[14].
It is during middle school that students either start to value
academic achievement or begin to get off track and start to
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
177
2. METHODOLOGY
2.1 The ASSISTment System
2.2 Data
2.2.1 ASSISTments Data
Action log files from the ASSISTment system were obtained for
a population of 3,747 students that came from middle schools in
New England, who used the system at various times starting
from school years 2004-2005 to 2006-2007 (with a few students
continuing tutor usage until 2007-2008 and 2008-2009). These
students were drawn from three districts who used the
ASSISTment system systematically during the year. One district
was urban with large proportions of students requiring free or
reduced-price lunches due to poverty, relatively low scores on
state standardized examinations, and large proportions of
students learning English as a second language. The other two
districts were suburban, serving generally middle-class
populations. Overall, the students made 2,107,108 actions within
the software (where an action consisted of making an answer or
requesting help), within 494,150 problems, with an average of
132 problems per student. Knowledge, affect, and behavior
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
178
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
179
this way does not impact model goodness, as it does not change
the relative ordering of model assessments.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
180
3. RESULTS
Before developing the model, we looked at our original, nonstandardized features and how their values compare between
those who were labeled to have attended college and those who
have not (Table 1).
Table 1. Features for Students who Attended College (1, n =
2166) and did not Attend college (0, n = 1581)
Coll
-ege
Mean
Std.
Dev.
Std.
Error
Mea
n
t-value
Slip/
Carelessness
0.132
0.066
0.002
-13.361
0.165
0.077
0.002
(p<0.01)
Student
Knowledge
0.292
0.151
0.004
-15.481
0.378
0.180
0.004
(p<0.01)
Correctness
0.382
0.161
0.004
-17.793
0.483
0.182
0.004
(p<0.01)
0.287
0.045
0.001
5.974
Boredom
Engaged
Concentration
Confusion
Off-Task
Gaming
Number of
First Actions
0.278
0.047
0.001
(p<0.01)
0.483
0.041
0.001
-11.979
0.500
0.044
0.001
(p<0.01)
0.130
0.054
0.001
5.686
0.120
0.052
0.001
(p<0.01)
0.304
0.119
0.003
1.184
0.300
0.116
0.002
p=0.237
0.041
0.062
0.002
8.862
0.026
0.044
0.001
(p<0.01)
114.500
91.771
2.308
-8.673
2.436
(p<0.01)
144.560
113.357
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
181
ChiSquare
pvalue
Odds
Ratio
1.078
16.193
<0.001
2.937
-1.100
25.873
<0.001
0.333
Correctness
0.758
33.943
<0.001
2.133
Boredom
0.069
0.308
0.579
1.071
Engaged
Concentration
-0.175
2.207
0.137
0.839
Confusion
0.201
20.261
<0.001
1.223
Off-Task
-0.036
0.188
0.665
0.965
Gaming
-0.047
0.720
0.396
0.954
Number of
First Actions
0.269
27.094
<0.001
1.308
Constant
0.354
99.735
<0.001
1.421
Features
Student
Knowledge
Slip/
Carelessness
This model can be refined by removing all features that are not
statistically significant, using a backwards elimination procedure.
Our final model (Table 3) achieves a cross-validated A of 0.686
and a cross-validated Kappa value of 0.247, almost identical to
the initial model with a full data set. The reduced model is both
more parsimonious and more interpretable, so it is preferred. (It
is not more generalizable within the initial data set, but its
parsimony increases the probability that it will be generalizable
to entirely new data sets). This model is also statistically
significantly better than the null model, 2(df = 6, N = 3747) =
386.502, p < 0.001. Our final model also achieved a fit of R2
(Cox & Snell) = 0.098, R2 (Nagelkerke) = 0.132, indicating that
our predictors explaining 9.8% to 13.2% of the variance of those
who attended college. Note that for our models, our R2 values
serve as measures of effect sizes; when converted to correlations,
they represent moderate effect sizes in the 0.31-0.36 range.
Table 3. Final Model of College Enrollment
Coefficient
ChiSquare
p-value
Odds
Ratio
Student
Knowledge
1.119
17.696
<0.001
3.062
Correctness
0.698
47.352
<0.001
2.010
0.261
28.740
<0.001
1.298
-1.145
28.712
<0.001
0.318
Confusion
0.217
24.803
<0.001
1.242
Boredom
0.169
12.249
<0.001
1.184
Constant
0.351
100.011
<0.001
1.420
Features
Number of First
Actions
Slip/
Carelessness
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
182
5. ACKNOWLEDGMENTS
This research was supported by grants NSF #DRL-1031398, NSF
#SBE-0836012, and grant #OPP1048577 from the Bill &
Melinda Gates Foundation. We also thank Zak Rogoff, Adam
Nakama, Aatish Salvi, and Sue Donas for their assistance in
conducting the field observations, and Adam Goldstein for help
in data processing.
6. REFERENCES
[1] Baker, R.S.J.d. 2007. Modeling and Understanding
Students' Off-Task Behavior in Intelligent Tutoring Systems.
In Proceedings of ACM CHI 2007: Computer-Human
Interaction, 1059-1068.
[2] Baker, R.S.J.d., Corbett, A.T., and Aleven, V. 2008. More
Accurate Student Modeling through Contextual Estimation
of Slip and Guess Probabilities in Bayesian Knowledge
Tracing. In Proceedings of the 9th International Conference
on Intelligent Tutoring Systems (eds Aimeur E. & Woolf
B.), Springer Verlag, Berlin, 406-415.
[3] Baker R.S.J.d., Corbett A.T., Gowda S.M., Wagner A.Z.,
MacLaren B.M., Kauffman L.R., Mitchell A.P., and
Giguere S. 2010. Contextual Slip and Prediction of Student
Performance after Use of an Intelligent Tutor. In Proc.
UMAP 2010, 52-63.
[4] Baker, R.S., Corbett, A.T., and Koedinger, K.R. 2004.
Detecting Student Misuse of Intelligent Tutoring Systems.
In Proceedings of the 7th International Conference on
Intelligent Tutoring Systems, 531-540.
[5] Baker, R.S.J.d., Goldstein, A.B., and Heffernan, N.T. 2011.
Detecting Learning Moment-by-Moment. International
Journal of Artificial Intelligence in Education, 21, 5-25.
[6] Baker, R.S.J.d., Gowda, S.M., Wixon, M., Kalka, J.,
Wagner, A.Z., Salvi, A., Aleven, V., Kusbit, G.,
Ocumpaugh, J., and Rossi, L. 2012. Towards Sensor-Free
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
183
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
184
Janice D. Gobert
mikesp@wpi.edu
ryan@educationaldatamining.org
jgobert@wpi.edu
ABSTRACT
In this paper, we incorporate scaffolding and change of tutor
context within the Bayesian Knowledge Tracing (BKT)
framework to track students developing inquiry skills. These
skills are demonstrated as students experiment within interactive
simulations for two science topics. Our aim is twofold. First, we
desire to improve the models predictive performance by adding
these factors. Second, we aim to interpret these extended models
to reveal if our scaffolding approach is effective, and if inquiry
skills transfer across the topics. We found that incorporating
scaffolding yielded better predictions of individual students
performance over the classic BKT model. By interpreting our
models, we found that scaffolding appears to be effective at
helping students acquire these skills, and that the skills transfer
across topics.
Keywords
Science Microworlds, Science Simulations, Science Inquiry,
Automated Inquiry Assessment, Educational Data Mining,
Validation, Bayesian Knowledge Tracing, User Modeling
1. INTRODUCTION
Many extensions to the classic Bayesian Knowledge Tracing
(BKT) model [1] have been developed to improve performance at
predicting skill within intelligent tutoring systems, and to increase
the interpretability of the model. For example, extensions have
been made to account for individual student differences [2, 3], to
incorporate item difficulty [4], to address learning activities
requiring multiple skills [5], and even to incorporate the effects of
automated support given by the system [6-8]. Extensions have
also been added to increase model interpretability and to provide
insight about tutor effectiveness. For example, [6] incorporated
scaffolding into BKT to determine if automated support improved
students learning and performance. However, taking into account
the differences in tutor contexts, the different facets of an activity
or problem in which the same skills are applied, has only been
studied in a limited fashion ([8] is one of the few examples).
Context is important to consider because skills learned or
practiced in one context may not transfer to new contexts [9],
[10]. This, in turn, could reduce a models predictive performance
if it is to be used across contexts. Explicitly considering the
context in which skills are applied within knowledge modeling
may also increase model interpretability and potentially reveal
whether some skills are more generalizable, and thus
transferrable.
In this paper, we explore the impacts of incorporating two new
elements to the BKT framework to track data collection inquiry
skills [cf. 11] within the Inq-ITS inquiry learning environment
[12]. These elements are scaffolding and change of tutor context.
Like [6-8], we incorporate scaffolding by adding an observable
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
185
change the other variables. Doing this lets you tell for sure if
changing the [IV] causes changes to the [DV] ([IV] and [DV] are
replaced with the students exact hypothesis) Thus, Rexs
scaffolds provide multi-level help, with each level providing more
specific, targeted feedback when the same error is made
repeatedly, similar to Cognitive Tutors [e.g. 1]. A goal of this
paper is to gain insight about the efficacy of this scaffolding
approach.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
186
within the Free Fall activities. Students did not receive any
feedback on their data collection within these activities. These
activities were used to determine the impacts of scaffolding on
transfer of skill across science topics.
6. EXTENSIONS TO BKT
We amalgamated students performances across activities within a
Bayesian Knowledge-Tracing framework [1]. BKT is a two-state
Hidden Markov Model that estimates the probability a student
possesses latent skill (Ln) after n observable practice opportunities
(Pracn). In our work, latent skill is knowing how to perform the
data collection skills, and a practice opportunity is an evaluation
of whether skill was demonstrated during data collection in an
inquiry activity. A practice opportunity begins when students
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
187
|
(
|
|
|
|
|
(
(
, where
)
)
This classic BKT model [1] carries a few assumptions. First, the
model assumes that a students latent knowledge of a skill is
binary; either the student knows the skill or does not. The model
also assumes one set of parameters per skill and that the
parameters are the same for all students. Finally, the classic model
assumes that students do not forget a skill once they know it.
Relevant to this work, the classic BKT model does not take into
account whether students received any scaffolding from the
learning environment [6] and does not account for the topic in
which skills are demonstrated [8]. The same skill in different
topics would either be treated as two separate skills (assuming no
transfer), or as having no differences between topics (assuming
complete transfer). Both of these assumptions are thought to be
questionable [10, 23]. Below, we describe our approach to
incorporate both of these factors.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
188
the science topic changes from Phase Change to Free Fall (just
before the students first opportunity to practice in Free Fall). The
corresponding P(Ln) modification for the BKT skill degradation
model is:
|
[
|
|
Once this set has been found, another brute force search around
those parameters is run at a grain-size of 0.001 to find a tighter fit.
We bound G to be less than 0.3 and S to be less than 0.1 [cf. 25];
all other parameters can be assigned values in (0.0, 1.0).
When fitting our models, we found the brute force search to be
realistically tractable only up to fitting 5 parameter models. To fit
the combined models with more parameters, we used a two-stage
process. First, we fit a classic BKT model with four parameters
(L0, G, S, T). Then, we fit a combined model using fixed values
for G and S from the classic model. These parameters were fixed
because we believe the extended models described above will
have the most impact on estimates of learning between practice
opportunities and initial knowledge, not on guessing and slipping.
7. RESULTS
We determine if extending the classic BKT model to include
scaffolding and changing of science topics will 1) improve
predictions of future student performance in our learning
environment, and 2) yield insights about the effectiveness of our
scaffolding approach, and the transferability of the inquiry skills.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
189
Table 1. BKT model variant performance predicting whether students will demonstrate skill in their next practice attempt in the learning
environment. The A values were computed under six-fold student-level cross-validation Overall, the best model for both skills is the one
in which the learning rate is conditioned on whether or not the student received scaffolding during Phase Change (T_Scaffolded).
BKT Model Variant
T_Scaffolded
T_Topic
kLn_TopicSwitch
A' collapsedb
A' collapsedc
.685
.827
.656
.846
.633
.818
.610
.840
.641
.825
.612
.844
.678
.829
.648
.848
X
X
X
X
X
X
X
X
.630
.826
.601
.845
.680
.837
.638
.852
.676
.836
.645
.853
.635
.817
.613
.841
Classic BKT:
N = 287 students; b N = 175 students; c N = 132 students
The learning rate for the Free Fall activities, which were
unscaffolded and practiced after the Phase Change activities, was
comparatively lower for each skill, T_UnScaff_FF = .094 for
designing controlled experiments, and T_UnScaff_FF = .089 for
testing stated hypotheses. The meaning of these values is more
difficult to discern because all students had prior opportunity to
practice in Phase Change before attempting the Free Fall tasks. It
could be that the unscaffolded Free Fall activities, like the
unscaffolded Phase Change activities, are less effective for
helping students acquire these inquiry skills. However, it could
also be that the lower learning rates reflect that many students
already mastered the skills in Phase Change and thus these new
activities afforded no additional learning opportunities. We
believe the latter to be the case because 1) more than 85% of
students demonstrated each skill in their first Free Fall practice
opportunity (data not presented in this paper), and 2) the initial
likelihood of knowing the skills (L0) was high.
Finally, the skill degradation parameter k, which captures the
degree of skill transfer between science topic (0 is no transfer, 1 is
full transfer), was high for both skills. For designing controlled
experiments, k = .973 and for testing stated hypotheses, k = .961.
These high values suggest that skill transfers from Phase Change
to Free Fall within our learning environment [cf. 15]. We
elaborate on this finding in more detail in the next section.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
190
Table 2. Means and standard deviations of the parameter values for full BKT model variant across all six folds.
Full BKT Model Parameters
Skill
L0
Designing Controlled
Experiments
.470 (0.014)
.196 (0.029)
.050 (0.006)
.190 (0.018)
Testing Stated
Hypotheses
.602 (0.026)
.198 (0.023)
.042 (0.007)
.158 (0.026)
T_UnScaff_PhCh T_Scaff_PhCh
T_UnScaff_FF
.638 (0.035)
.094 (0.010)
.973 (0.006)
.823 (0.057)
.089 (0.011)
.961 (0.009)
9. ACKNOWLEDGMENTS
This research is funded by the National Science Foundation (NSFDRL#0733286, NSF-DRL#1008649, and NSF-DGE#0742503)
and the U.S. Department of Education (R305A090170 and
R305A120778). Any opinions expressed are those of the authors
and do not necessarily reflect those of the funding agencies.
Special thanks are also given to Joseph Beck for advice and
feedback on this work.
10. REFERENCES
1 Corbett, A.T. and Anderson, J.R. Knowledge-Tracing:
Modeling the Acquisition of Procedural Knowledge. User
Modeling and User-Adapted Interaction, 4 (1995), 253-278.
2 Pardos, Z.A. and Heffernan, N.T. Modeling Individualization
in a Bayesian Networks Implementation of Knowledge
Tracing. In Proceedings of the 18th International Conference
on User Modeling, Adaptation and Personalization (Big
Island, HI 2010), 255-266.
3 Baker, R.S.J.d, Corbett, A.T., Gowda, S.M. et al. Contextual
Slip and Prediction of Student Performance After Use of an
Intelligent Tutor. In Proceedings of the 18th Annual
Conference
on
User
Modeling,
Adaptation
and
Personalization (Big Island of Hawaii, HI 2010), 52-63.
4 Pardos, Z.A. and Heffernan, N.T. KT-IDEM: Introducing Item
Difficulty to the Knowledge Tracing Model. In Proceedings of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
191
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
192
bvds@asu.edu
ABSTRACT
Normally, when considering a model of learning, one compares the model to some measure of learning that has been
aggregated over students. What happens if one is interested
in individual differences? For instance, different students
may have received different help, or may have behaved differently. In that case, one is interested in comparing the
model to the individual learner. In this study, we investigate three models of learning and compare them to student
log data with the goal of seeing which model best describes
individual student learning of a particular skill. The log
data is from students who used the Andes intelligent tutor
system for an entire semester of introductory physics. We
discover that, in this context, the best fitting model is not
necessarily the correct model in the usual sense.
Keywords
data mining, models of student learning
1.
INTRODUCTION
Most Knowledge Component (KC) [15] based models of learning are constructed in a similar manner, following Corbett
and Anderson [8]. First, some measure of learning is selected
(e.g. correct/incorrect on first try) for the j-th opportunity for that student to apply a given KC. This measure
of learning is then aggregated over students (e.g. fraction
of students correct) as a function of j. Finally, aggregated
measure is then compared to some model (e.g. Bayesian
Knowledge Tracing) with model parameters chosen to optimize the models fit to the data. In principle, given sufficient
student log data, one could uniquely determine which of several competing models best matches the data.
One drawback with this approach is that it does not take
into account individual learner differences or the actual behaviors of students or tutors as they are learning. Thus, a
number of authors have extended their models to include individual student proficiency and actual help received by the
1.1
Correct/Incorrect steps
Our stated goal is to determine student learning for an individual student as they progress through a course. What observable quantities should be used to determine student mastery? One possible observable is correct/incorrect steps,
whether the student correctly applies a given skill at a particular problem-solving step without any preceding errors or
hints. There are other observables that may give us clues
on mastery: for instance, how much time a student takes
to complete a step that involves a given skill. However,
other such observables typically need some additional theoretical interpretation. Exempli gratia, What is the relation
between time taken and mastery? Baker, Goldstein, and
Heffernan [3] develop a model of learning based on a Hidden
Markov model approach. They start with a set of 25 additional observables (including time to complete a step) and
construct their model and use correct/incorrect steps to calibrate the additional observables and determine which are
significant. Naturally, it is desirable to eventually include
various other observables in any determination of student
learning. However, in the present investigation, we will focus on correct/incorrect steps.
Next, we need to define precisely what we mean by a step. A
student attempts some number of steps when solving a problem using an intelligent tutor system (ITS). Usually, a step
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
193
Probability of correctness
1.0
0.8
0.6
BKT
logistic
0.4
step model
0.2
0.0
10
step HopportunityL j
Figure 1: Functional form of the three models of
student learning.
2.
(1)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
194
3.
The BKT and logistic function models are widely used and
we have introduced the step model Pstep (j) as an alternative.
How well do these models match actual student behavior?
Since we will use the step model in subsequent work, it would
be reassuring to know whether it describes the student data
as well (or better than) the other two models. We will use
the Akaike Information Criterion (AIC) for this purpose [1,
5]. AIC is defined as
AIC = 2 log (L) + 2K
(5)
(6)
where n is the number of data points. Burnham & Anderson [5, Sections 6.3 & 6.4] explain that BIC is more
appropriate in cases where the true model that actually
created the data is relatively simple (few parameters). If
the true model is contained in the set of models being considered, then BIC will correctly identify the true model in
the n limit. For BIC to have this property, the true
model must stay fixed as n increases. The authors argue
that, while BIC may be appropriate in some of the physical
sciences and engineering, in the biological and social sciences, medicine, and other noisy sciences, the assumptions
that underlie BIC are generally not met. In particular, as
the sample size increases, it is typical that the underlying
true model also becomes more complicated. This is certainly true in educational datamining: datasets are generally increased by adding data from new schools, or different
years and one generally expects noticeable variation of student behavior from school to school or from year to year.
In such cases, one safely can say that the true model is
complicated (because people are complicated) and becomes
more complicated as a dataset is increased in size. Although
most authors quote both AIC and BIC values, there is good
reason to believe that AIC is generally more appropriate for
educational datamining work.
3.1
Method
We examined log data from 12 students taking an intensive introductory physics course at St. Anselm College during summer 2011. The course covered the same content as
500
100
50
10
5
1
10
50 100
500 1000
Number of steps, n
Figure 2: Histogram of number of distinct studentKC sequences in student dataset A having a given
number of steps n.
a normal two-semester introductory course. Log data was
recorded as students solved homework problems while using the Andes intelligent tutor homework system [17]. 231
hours of log data were recorded. Each student problemsolving step is assigned one or more KCs using the heuristic
described in Section 1.1. The dataset contains a total of
2017 distinct student-KC sequences covering a total of 245
distinct KCs. We will refer to this dataset as student dataset
A. See Figure 2 for a histogram of the number of student-KC
sequences having a given number of steps.
Most KCs are associated with physics or relevant math skills
while others are associated with Andes conventions or userinterface actions (such as, notation for defining a variable).
The student-KC sequences with the largest number of steps
are associated with user-interface related skills, since these
skills are exercised throughout the entire course.
One of the most remarkable properties of the distribution
in Fig. 2 is the large number of student-KC sequences containing just a few steps. The presence of many student-KC
sequences with just one or two steps may indicate that the
default cognitive model associated with this tutor system
may be sub-optimal; to date, there has not been any attempt to improve on the cognitive model of Andes with, say,
Learning Factors Analysis [6]. Another contributing factor
is the way that introductory physics is taught in most institutions, with relatively little repetition of similar problems.
This is quite different than, for instance, a typical middle
school math curriculum where there are a large number of
similar problems in a homework assignment.
3.2
Analysis
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
195
1.0
the Pstep and Plogistic models. In this case, since Plogistic has
one fewer parameter than Pstep , it is favored by AIC by a
constant factor and wstep = e1 wlogistic . This case is plotted
as the increasing dashed line in Fig. 3.
0.6
.
=0
wstep
wB
3.3
0.4
.2
=0
.4
=0
.8
=0
0.0
0.0
wB
.6
wB
=0
wB
wB
0.2
Random data
To further investigate the observed strong discrimination between the three models, we constructed an artificial dataset
containing random bit sequences (each step has 50% probability of being correct) of length n {10, 20, 30, 40, 50},
with 10,000 sequences for each n. This dataset corresponds
to a model of the form
Prandom (j) = 1/2 .
0.2
0.4
0.6
0.8
1.0
wlogit
Figure 3: Scatter plot of Akaike weights for the
three models, Pstep , Plogistic , and PBKT , when fit to
student-KC sequences from an introductory physics
course. The point where all models are equal,
wstep = wlogistic = wBKT = 1/3, is marked with the
lower cross. The average of the weights is marked
with the upper cross. The dashed line on the left
represents points where wstep = wBKT . Finally, the
dashed line on the right marks data with bit sequences of the form 00 011 1.
(7)
(8)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
hwstep i
0.58
196
1.0
1.0
Scatter plot of Akaike weights
for 1024 random bit strings
of length n=10.
0.8
0.8
0.6
0.6
wstep
=0
=0
wstep
wB
wB
0.4
0.4
0.6
.2
wlogit
0.4
=0
0.2
0.0
0.0
1.0
.4
=0
.8
=0
0.8
wB
.6
wB
=0
wB
wB
0.6
.2
0.4
=0
0.2
.4
=0
.8
=0
0.0
0.0
0.2
wB
.6
wB
=0
wB
wB
0.2
0.8
1.0
wlogit
Figure 4: Akaike weights for the three models, Pstep , Plogistic , and PBKT , when fit to randomly generated data.
The point where wstep = wlogistic = wBKT = 1/3 is marked with a cross. For these datasets, each model should
perform equally well, since, with an appropriate choice of parameters, they all can be made equal to the
model that was used to generate the data.
randomly generated data.
If we repeat this analysis with BIC, we would still find that
the weights converge to a constant value with 1/n leading
errors. The only difference is that the logistic model has a
larger weight than the other two. The differences between
the weights of the three models still persist in the n
limit.
3.4
Conclusions
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
197
0.6
Ywstep ]
0.5
0.4
Ywlogit ]
0.3
0.2
XwBKT \
0.1
0.0
10
20
30
40
50
length n
Figure 5: Mean Akaike weights for the three models,
Pstep , Plogistic , and PBKT , when fit to randomly generated data of length n. (Each mean is calculated by
averaging over 10,000 random bit sequences.) Also
shown is a fit to a function of the form a + b/n and
a dashed line marking the asymptotic value a. Note
that the large differences between the weights persist in the n limit.
Our results suggest that the step model may be useful for
modeling the learning of an individual student. However,
the step model assumes that learning a skill occurs in a
single step. Is this how people actually learn? Certainly,
everyone has experienced eureka learning at some point in
their lives. However, it is unclear how well this describes
the acquisition of other skills, especially since many KCs
are implicit and people are not consciously aware that they
even know them [9]. Certainly, if the student performance
bit sequence is of the form 00 . . . 011 . . . 1, it seems safe to
assume that learning occurred all in one step, corresponding
to the first 1 in the sequence. However, it is possible that
the transition from unmastered to mastery occurs over some
number of opportunities and the bit sequence of steps takes
on a more complicated form. In a companion paper [14], we
introduce a method (based on AIC) that can describe gradual mastery, even though the step model itself assumes allat-once learning. In that approach, for a given bit sequence,
one speaks about the probability that learning occurred at a
particular step.
Finally, we see that the scatter plot of Akaike weights for
student data is remarkably similar to the scatter plots for
the random model. This suggests that the student data has a
high degree of randomness, and, in general, that study of the
random model may be quite useful for better understanding
the student data.
4.
ACKNOWLEDGMENTS
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
198
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
199
Diane Litman
University of Pittsburgh
Pittsburgh, PA, 15260
University of Pittsburgh
Pittsburgh, PA, 15260
wex12@cs.pitt.edu
litman@cs.pitt.edu
ABSTRACT
Topic modeling is widely used for content analysis of textual
documents. While the mined topic terms are considered as
a semantic abstraction of the original text, few people evaluate the accuracy of humans interpretation of them in the
context of an application based on the topic terms. Previously, we proposed RevExplore, an interactive peer-review
analytic tool that supports teachers in making sense of large
volumes of student peer reviews. To better evaluate the
functionality of RevExplore, in this paper we take a closer
look at its Natural Language Processing component which
automatically compares two groups of reviews at the topicword level. We employ a user study to evaluate our topic
extraction method, as well as the topic-word analysis approach in the context of educational peer-review analysis.
Our results show that the proposed method is better than
a baseline in terms of capturing student reviewing/writing
performance. While users generally identify student writing/reviewing performance correctly, participants who have
prior teaching or peer-review experience tend to have better performance on our review exploration tasks, as well as
higher satisfaction towards the proposed review analysis approach.
Keywords
Educational peer reviews, text analysis, topic modeling, user
study
1.
INTRODUCTION
2.
RELATED WORK
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
200
interest during their initial data exploration. In the detailview, RevExplore automatically abstracts the semantic information of peer reviews at the topic-word level with the
original texts visible on demand. To create the detail-view,
we adapt existing natural language processing techniques to
the peer-review domain for supporting automated analytics.
3.1
3.2
The topic signature algorithm [8] assumes that a target corpus has a single topic, and it computes the topic words for
the target corpus with respect to a general background corpus. For each word, the algorithm computes a likelihood ratio [5] which tests the hypothesis of the word being a topic
word of the target corpus versus the hypothesis that the
word is not a topic word. The 2 log likelihood ratio has
a chi-square distribution, which allows us to test the significance of each word to the topic of the target corpus when
compared against the background corpus. In our work, we
use the existing software TopicS (as mentioned before) for
extracting topic signatures.
1
http://www.tagcrowd.com/blog/2011/03/05/state-of-theunion-2002-vs-2011/
2
TopicS was developed by Anni Louis for evaluating automated text summarization [12].
3.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
201
4.
DATA
5.
5.1
To investigate student writing performance, we split students into high and low performance groups, based on a
median split of students ratingW. Then we create high
and low groups of reviews accordingly, based on the group
membership of the student who received the reviews. The
hypothesis is that students who are highly rated have different writing issues compared with those who have lower
ratings, and that such differences are reflected in the peer
reviews that students receive.
5.2
6.
EXPERIMENT SETUP
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
202
23
29
23
29
23
17
23
17
record user background information especially regarding demographic factors that depict participants prior experience
in peer review and teaching: whether they have peer-review
experience before (expPR), whether they used SWoRD before (expSWoRD), whether they were a TA before (expTA),
and whether they have graded any writing assignment before (expGW ). Participant distribution over these factors is
presented in Table 1. Although we also look at other demographic factors such as age, gender, major, etc., due to the
space limit, we do not report them in this paper.
Procedure: Before being exposed to the analysis tasks,
participants were first given instructions about the peerreview assignment, including both the paper topics and the
reviewing rubrics. We also provided a warm up example to
demonstrate how to analyze peer reviews through our user
study interface. Figure 1 is a screenshot of the interface,
which consists of three parts: the left pane displays the original reviews in lowercase after removing non-ascii characters;
the middle pane shows the topic words extracted from the
two groups of reviews; the right pane shows the analysis
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
203
Table 2: Descriptive statistics of user satisfaction. Higher rating means more positive opinion except for
Q textRef and Q textImp. One sample t-test test value = 3 (neutral). Significant items are highlighted in
bold (p < 0.05).
Question
Content
Q easyness
Q listDiff
Q layout
Q reviewDiff
Q largeData
Q approach
Q textRef
Q textImp
7.
Mean
Std.Error
Sig.(2-tailed)
2.85
.140
.291
3.52
.123
.000
3.57
.154
.001
3.54
.145
.000
3.93
.177
.000
3.46
.180
.015
1.96
.189
.000
2.93
.171
.705
EXPERIMENT RESULTS
The statistics of the task performance are summarized in Table 3. For measuring task performance, we use the following
scheme to code participants answers to the three questions:
1
Answer1 = 1
(
Answer2 =
1
Answer3 = 1
1
0
(1)
if yes,
if no.
(2)
(3)
For each question, we compare participants answers to random guess using a one sample t-test to check if using the
topic words (extracted by either method) is generally meaningful for our peer-review analysis tasks. As the table shows,
in general, the proposed approach is better than random
guess (the corresponding test mean is: 0, 0.5, 0), and the
proposed topic extraction method (TopicS ) yields better
task performance than the baseline (Freq). However, we
also notice that the task performance varies with the analysis tasks. This motivates us to further examine the effects
of Split and Dim, as well as their interaction with Method,
which is discussed later.
To analyze user satisfaction, we compare participants rating
of each survey item to the neutral state (3-point) using a one
sample t-test. As Table 2 shows, despite that participants
generally think the analysis task is neither easy or difficult,
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
204
Table 3: Summary of estimates of the variables across all different conditions, with higher mean bolded
between the two extraction methods. It shows that TopicS generally yields higher mean compared with Freq,
except for predicting the label of topic words when reviews were grouped by ratingR for Logic.
Q1
Q2
Q3
Answer1
Answer2
Answer3
Dim
Split
Method
Mean
Std.Error
Mean
Std.Error
Mean
Std.Error
Flow
ratingR
Freq
TopicS
-.217
.413
.135
.127
.457
.565
.074
.074
.217
.109
ratingW
Freq
TopicS
-.043
.022
.139
.144
.217
.783
.074
.061
.652
.109
ratingR
Freq
TopicS
.043
.326
.132
.128
.522
.543
.074
.074
.370
.130
ratingW
Freq
TopicS
.109
.391
.133
.134
.283
.717
.067
.067
.565
.115
ratingR
Freq
TopicS
.391
-.174
.118
.133
.304
.587
.069
.073
.283
.138
ratingW
Freq
TopicS
.109
.152
.140
.135
.391
.522
.073
.074
.261
.137
Freq
TopicS
.07
.19
.190
.923
.36
.62
.482
.486
.96
2.068
Insight
Logic
Together
Table 4: Summary of Type III F-tests significance of fixed effects of Method, Split, Dim and their interactions
on all variables of all three questions. Results are presented in p-value, with significant ones highlighted with
* (p < .05).
Q1
Source
Dim
Split
Method
Dim*Split
Dim*Method
Split*Method
Dim*Split*Method
Q3
Answer1
Correct
Answer2
Answer3
.907
.000*
.196
.015*
.008*
.333
.001*
.000*
.297
.039*
.533
.001*
.863
.040*
.387
.789
.000*
.912
.364
.003*
.004*
.289
.055
na
.226
na
na
na
7.1
Q2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
205
Do topic words reveal patterns in writing and reviewing performance? When tested on Answer2 using
the linear mixed model, only Method is found to be significant (F (1, 530.831) = 40.015, p < .001). An interaction
exists between Method and Split (F (1, 530.831) = 8.644,
p = .003), and among all factors (F (1, 349.122) = 5.677,
p = .004). This tells that using the proposed TopicS is
more likely to identify review patterns that are different between groups, regardless of which dimension the reviews are
on. However, the utility of topic words is also influenced
by the grouping, where TopicS typically outperformed Freq
when used for analyzing writing performance, especially on
Flow and Insight (as shown in Table 3).
Does the proposed approach extract more informative topic words? The analysis on the fixed effects of
Method above already showed that TopicS can better support users in peer review analysis. Table 3 also shows that
TopicS is preferred to Freq across all tasks. And further
analysis with a mixed model (Table 4) shows that such preference is not influenced by either Split orDim.
7.2
With respect to user background differences, we focus on demographic factors that are related to peer-review and teaching. We investigate expPR, expSWoRD, expTA and expGW
by analyzing both user satisfaction and user-study task performance.
7.2.1
For each survey question, we use oneway ANOVA to examine the ratings against each background factor as a binary
independent variable. Results are summarized in Table 5.
Table 5:
Oneway ANOVA analysis of userbackground factors (binary) on user satisfaction.
Factors that are significant (p < 0.05, highlighted
with *) or in trend are denoted by the mean value
of the yes group.
Question
expPR expSWoRD expTA expGW
Q easyness
3.78*
3.24*
Q listDiff
Q layout
Q reviewDiff
4.0*
Q largeData
4.35
Q approach
3.83*
3.88
Q textRef
1.29*
1.47*
Q textImp
2.41*
2.47*
With respect to participants peer-review experience, students who did peer review before (expP R = yes) generally think the review analysis tasks much easier than students who never did it before (p = .033). In particular,
SWoRD users feel the proposed approach more useful than
non-SWoRD users (expSW oRD = yes) in helping them
identify the peer review differences (p = .014). With respect to teaching experience, it is important to note that
students who have teaching experience (expT A = yes) like
our idea of exploring peer reviews by comparing them in
groups using their topic words (p = .039). Their feedback can somehow approximate instructors opinions towards
7.2.2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
206
Table 6: Summary of Mixed Model ANOVA of within-subjects effects, including interactions between userbackground factors (between-subjects effects) and Method, Split, Dim (within-subjects effects). Significant
results are presented in p-value (p .05).
Q1
Q2
Q3
Source
Answer1
Correct
Answer2
Answer3
Dim*expPR
F (2, 66) = 3.1, p = .050 F (2, 66) = 3.6, p = .032
Method*expSWoRD*expTA F (1, 33) = 7.4, p = .001 F (1, 33) = 6.1, p = .019
na
Mehtod*expPR*expTA
F (1, 33) = 9.6, p = .004 F (1, 33) = 4.2, p = .049
na
Mehtod*expTA*expGW
F (1, 33) = 6.1, p = .019
na
Dim*Method*expPR
F (1, 66) = 3.4, p = .040
na
Dim*Method*SWoRD
F (2, 66) = 4.0, P = .022 F (2, 66) = 5.5, p = .006
na
8.
CONCLUSIONS
In this paper we evaluate the topic-word analytics for analyzing educational peer reviews with a user study. The user
study shows that student peer reviews can be used to examine student writing and reviewing performance based on
peer review topic words, and that the proposed comparisonoriented topic-word extraction method (TopicS ) suits our
analytic tasks best compared with the frequency based method (Freq). However, the utility of the learned topic words
is influenced by the analytic goals (specified through review
grouping) and dimensions, as well as users prior experience in teaching and peer-review. Analysis of user satisfaction shows that participants who have teaching experience significantly favor our approach more than the others,
which suggests the usefulness of the proposed approach in
supporting instructors for analyzing student peer reviews in
the real-world. Even though we did not include manual digestion of original peer reviews as a baseline, we indirectly
compare it with our topic-word approach in the exit survey
(Q approach).
In the future, we would like to evaluate the proposed approach in the context of RevExplore, which allows users to
specify analytic goals at runtime. Finally we hope to integrate RevExplore into SWoRD as part of the teacher dashboard to support interactive review content analytics.
9.
ACKNOWLEDGEMENTS
This work is supported by Andrew Mellon Predoctoral Fellowship 2012-2013 and IES award R305A120370.
10.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
207
{xiaoxi,tmurray,bev,dasmith}@cs.umass.edu
ABSTRACT
Social deliberative skills are collaborative life-skills. These
skills are crucial for communicating in any collaborative
processes where participants have heterogeneous opinions
and perspectives driven by different assumptions, beliefs,
and goals. In this paper, we describe models using lexical, discourse, and gender demographic features to identify
whether or not participants demonstrate social deliberative
skills from various online dialogues. We evaluate our models using three different corpora with participants of different
educational and motivational levels. We propose a protocol
about how to use these features to build models that achieve
the best in-domain performance and identify the most useful
features for building robust models in cross-domain applications. We also reveal lexical and discourse characteristics of
social deliberative skills.
Keywords
Social deliberative skills, collaborative problem solving, collaborative learning, collaborative knowledge-building, discourse analysis, applied machine learning, feature engineering
1.
INTRODUCTION
Do they make a good faith effort to understand perspectives other than their own? These skills, including cognitive
empathy, affective empathy, and reciprocal role-taking, are
part of what we call social deliberative skills [14].
Social deliberative skills are at the overlap of cognitive skills
and social/emotional skills. Specifically, a participant should
present rational arguments with supporting evidence in order that his view be taken seriously and valued. Similarly,
one has to turn down the volume of his own thoughts to attentively listen to others opinions and has to intentionally
switch the channel from me to you to be able to understand or even appreciate anothers perspectives. Indeed,
this cognitive empathy of if you were me and I were you,
the soul of social deliberative skills, is needed in any sphere
of human interaction, from collaborative learning, to marriage, to workplace relationship, and to world affairs. The
ultimate goal of our research is to support social deliberative
skills in online communication. In this research, we explore
the possibility of automatically assessing and predicting the
occurrence of social deliberative skills.
Creating computational models for assessing social deliberative skills has profound implication on several fronts: it (1)
supports more efficient analysis for research purposes into
online communication and collaboration in social processes;
(2) provides assessment measures for evaluating the quality
and properties of group dialogues; and (3) provides tools for
informing facilitators to adjust skill support and intervention efforts [13]. Previous research in learning science has
extensively focused on creating educational software that
supports cognitive skills in collaborative environments, such
as inquiry skills, metacognition and self-regulated learning
skills, and reflective reasoning skills [24, 3, 5, 19, 11]. Research in these areas has provided a deep theoretical context for studying the cognitive aspect of social deliberative
skills. A burgeoning body of research has begun to study
the social relational aspect of collaborative processes, such
as influence [20] and up-taking [21]. This line of research
has mainly used structural features of social interactions,
such as reply structure, linking notes in a conceptual framework, as well as spatial and temporal proximity to address
the questions of who are the central actors in discussions
and whose ideas receive the most development. But, collaboration interactions generally take place in the form of
natural language. It is reasonable to suppose that languagelevel features, including lexical features (i.e., what is said)
and discourse features (i.e., how it is said) could provide
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
208
2.
Social deliberative skills involve the application of cognitivelyoriented higher-order skills to thinking about the perspectives of others and, consequently, of the self as well. In
other words, social deliberative skills require that a speaker
reflect not only upon a purely objective idea (e.g., a topic)
but also upon my ideas, your ideas, our ideas, and their
ideas. Tracing the origins of this phrase also describes its
meaning: to live with others (social) and to balance (deliberative) differences (skills). Our prior research [14] has defined
a theoretical framework for social deliberative skills, which
includes a group of high-order communication skills that are
essential for different tasks and stages of communication that
involves a disequilibrium of diverse perspectives. These component skills include social perspective seeking (i.e., social
inquiry), social perspective monitoring (i.e., self-reflection,
weighting opinions, meta-dialogue, meta-topic, and referencing sources for supporting claims), as well as cognitive
empathy and reciprocal role-taking (i.e., appreciation, apology, inter-subjectivity, and perspective taking). Here is an
example of perspective taking from authentic dialogue in
our corpora: I cant help but imagine what that is like, for
her and for her family. As an another example, the following
statement is about self reflection: I am probably extremely
bias because I am under 21 years old and in college. I wonder
if as a 45 year old I will feel differently.
Social deliberative skills can also be seen as a composite
skill [15], which, though less precise can serve as a general
marker of social deliberation, for use in evaluation and real-
time feedback in intervention. In this study, we focus on creating computational models to assess whether participants
of online dialogues demonstrate the use of composite social
deliberative skill (or social deliberative behavior, SBD).
3.
CORPORA
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
209
liability scores for both composite and component social deliberative skills as measured using Cohens Kappa statistics
across domains. Note that the social deliberative behavior
(or the composite social deliberative skill) is an aggregate
over component social deliberative skills. The inter-rater
reliability scores of social deliberative behavior for the civic
deliberation domain, college dialogues domain, and professional community negotiation domain were 73.5%, 64.3%,
and 68.4%, respectively.
4.
Participant
count
32
16
90
138
The goal of experiments in this section is to address the following two research questions. First, which type of features
(i.e., lexical, discourse, and gender demographic features) or
feature combinations are the best for building social deliberative classifiers for each domain? Second, which type of
features are the best for building a robust social deliberative
classifier across domain changes? To this end, we designed
two experimental scenarios.
Scenario 1: In-domain analysis for each domain
Scenario 2: Cross-domain analysis for each pair
of domains
of social deliberation is an
unexplored research territory. In choosing features for this
study, we recognize that we lack sufficient knowledge of what
features might be predictive of social deliberative behavior.
Therefore, to explore possible features, we turned to the literature of social, psychology, and psycholinguistic studies.
This research is the first to use lexical, discourse, and gender demographic features to characterize linguistic patterns
of social deliberative behavior.
4.1.1
4.1.2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
210
4.1.3
4.2
5.
Experimental results (Figure 2) reveal a number of interesting patterns. One of the most salient patterns is that imbalanced class/label distribution hurts predictive performance
3
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
211
(more on recall than precision), regardless of feature configurations. This can be seen in the third sub-column (i.e.,
the college dialogues domain) of in-domain results. This
observation suggests that before creating a model, it is important to strategically solve the imbalanced data problem,
either from the algorithm level (e.g., adjusting class weights
or priors) or from the data level (e.g., up-sampling or downsampling).
domain, LIWC features alone, among all 6 feature configurations, led to the best model in that domain. For the college
dialogues domain, LIWC and gender feature combined, led
to the best model in that domain. This implies that determining which features or feature combinations to use and
in which order has an impact on whether and when we will
attain the best model. We will explore this point in the text
below.
Genders effect on predicting social deliberative behavior. The first row in Figure 2 shows that gender alone
has no predictive power of social deliberative behavior. Specifically, the classification results in each domain reflect the
bias of class distribution on training machine learning models toward predicting all data as coming from the majority
class. For example, in the college dialogue domain, as shown
in Table 1, the majority class is other speech acts. Classifiers built with various feature configurations unanimously
used this bias without any corrections from the gender feature to predict every instance as other speech acts. This
means that every cell in the confusion matrix 4 is zero except the false negative, and therefore recall and precision are
zero. This pattern also applies to other domains. We speculate that because social deliberative behavior (as a composite skill) contains skills that greatly overlap cognitive and social/emotional skills, features correlated with only emotional
related skills, such as social sensitivity, are not effective in
predicting social deliberative behavior.
Different capacities of lexical and discourse features
in different domains. First, we examine the performance
of each feature alone, ignoring feature combinations. As can
be seen from in-domain results, compared to LIWC features
(70.7% at recall), Coh-Metrix features (83.6% at recall) have
the best predictive power on the civic deliberative domain.
The performance of the model built with LIWC features
added on top of Coh-Metrix features has a slight (< 1%)
increase in this domain. In the professional community negotiation domain, compared to Coh-Metrix features (74.0 %
at recall), LIWC features (90.0% at recall) have the upper
hand. The performance of the model built with Coh-Metrix
features added on top of LIWC features has a drastic ( >
15%) decrease in this domain. The college dialogues domain
has similar patterns as the professional community negotiation domain. In other words, LIWC features are the most
predictive for the college dialogues domain. These patterns
suggest that lexical and discourse features have different capacities in different domains for the task of predicting social
deliberative behavior.
Next, we look at feature combinations. For the civic deliberation domain, Coh-Metrix and LIWC features combined,
among all 6 feature configurations, led to the best model
in that domain. For the professional community negotiation
4
Protocols for using linguistic features to predict social deliberative behavior. The results in Figure 2 imply
a protocol about how to use lexical and discourse features
to build a model (i.e., l1 RLR) in order to achieve the best
in-domain performance. This protocol can be described as
follows:
1. Use LIWC features to build a model, whose performance (i.e., recall) is denoted by p(l).
2. Use Coh-Metrix features to build a model, whose performance is denoted by p(c).
3. If p(l) > p(c), the best performance is p(l); otherwise combine LIWC and Coh-Metrix to build a model,
whose performance, denoted as p(lc), is the best performance.
This protocol is the most efficient way to find the right feature sets for building a model with the best predictive performance. This protocol also suggests that for certain domains LIWC features features related to what is said
are sufficient to predict social deliberative behavior. In these
domains, Coh-Metrix features features related to how it is
said might be too overwhelming for the model to achieve
good performance. For other domains, the LIWC features
are not close enough for identifying the sophistication of social deliberative behavior, and combining with Coh-Metrix
features can greatly help increase model performance. In a
5
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
212
6.
Table 3: Top 10 LIWC features learnt by L1 regularized logistic regression built from the professional
community negotiation domain
LIWC feature
Interpretation
Weight
WC
word counts
-0.043
Dic
dictionary words
0.037
big words
-0.011
Six|tr
WPS
words/sentence
-0.01
adverb
adverbs
0.009
pronoun
pronouns
-0.009
total punctuations
-0.009
AllPct
cogmech
cognitive processes
-0.007
space
space
-0.004
auxverb
auxiliary verbs
-0.004
7.
ACKNOWLEDGEMENT
We thank all anonymous reviewers for their thorough evaluation and constructive recommendations for improving this
paper. This research is supported by the US National Science Foundation under grant #0968536 from the division of
Social and Economic Sciences. Any opinions or conclusions
expressed are those of the authors, not necessarily of the
funders.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
213
In-domain
Gender
LIWC
Cohmetrix
LIWC+Gender
Cohmetrix
+Gender
LIWC
+Cohmetrix
All
Cross-domain
Training
corpus
Professional
Civic
College
community
deliberation
dialogues
negotiation
Professional Professional
Civic
Civic
community community
deliberation deliberation
negotiation negotiation
Testing
corpus
Professional
Civic
College
community
deliberation
dialogues
negotiation
Professional
community
negotiation
College
dialogues
Civic
deliberation
College
dialogues
College
dialogues
College
dialogues
Professional
Civic
community
deliberation
negotiation
Accuracy
56.8
52.7
68.3
52.7
31.7
56.8
31.7
43.2
47.3
Precision
56.8
52.7
0.0
52.7
31.7
56.8
31.7
0.0
0.0
Recall
F2
100.0
86.8
100.0
84.8
0.0
0.0
100.0
84.8
100.0
69.9
100.0
86.8
100.0
69.9
0.0
0.0
0.0
0.0
Accuracy
55.3
55.3
67.9
56.2
47.1
59.6
39.3
42.4
47.3
Precision
58.9
54.6
46.1
56.9
33.4
59.6
32.7
46.8
50.0
Recall
70.7
90.0
9.9
69.7
67.6
89.3
86.9
9.8
2.6
F2
68.0
79.7
11.8
66.7
56.1
81.2
65.2
11.6
3.2
Accuracy
62.1
54.8
68.0
53.4
36.4
53.3
39.0
41.9
47.0
Precision
61.4
55.3
44.7
53.5
30.7
58.3
30.2
36.8
42.9
Recall
83.6
74.0
3.7
90.0
80.0
62.2
70.3
3.1
1.3
F2
77.9
69.3
4.6
79.2
60.5
61.4
55.5
3.8
1.6
Accuracy
52.3
54.6
68.0
56.2
47.1
60.9
38.3
42.4
48.0
Precision
56.8
54.2
49.2
56.8
33.2
60.6
32.6
46.8
60.0
Recall
67.1
89.6
10.6
70.6
66.6
88.9
88.9
9.8
1.3
F2
64.8
79.3
12.6
67.3
55.4
81.3
66.1
11.6
1.6
Accuracy
62.4
55.5
68.2
53.9
35.8
53.8
38.8
40.9
46.2
Precision
62.7
55.6
46.5
53.9
30.8
58.5
30.1
32.3
33.5
Recall
83.6
77.5
3.5
87.0
77.9
64.0
70.3
4.0
1.3
F2
78.3
71.8
4.3
77.5
59.6
62.8
55.5
4.8
1.6
Accuracy
62.4
54.5
68.3
54.1
38.8
55.6
39.4
41.2
48.0
Precision
63.3
55.1
48.9
54.0
30.9
60.0
30.3
35.7
66.1
Recall
84.4
74.5
4.1
87.0
75.2
65.3
70.4
4.4
1.3
F2
79.2
69.6
5.0
77.5
58.4
64.2
55.7
5.4
1.6
Accuracy
62.1
54.3
68.5
52.7
36.9
54.8
39.8
41.4
47.7
Precision
62.3
55.0
48.2
53.1
30.6
59.7
31.6
38.7
42.9
Recall
84.4
74.0
4.6
88.7
78.6
63.1
71.0
5.3
1.3
F2
78.8
69.2
5.6
78.2
59.8
62.4
56.8
6.4
1.6
Figure 2: Predictve performance (in % ) of L1 regularzied logistic regression built using different
feature configurations in different scenarios
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
214
8.
REFERENCES
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
215
Oral Presentations
(Short Papers)
Caleb DeFore
Kris Kyle
scrossley@gsu.edu
cdefore1@student.gsu.edu
kkyle3@student.gsu.edu
Jianmin Dai
Danielle S. McNamara
Jianmin.Dai@asu.edu
dsmcnamara1@gmail.com
ABSTRACT
In this paper, we describe an n-gram approach to automatically
assess essay quality in student writing. Underlying this approach
is the development of n-gram indices that examine rhetorical,
syntactic, grammatical, and cohesion features of paragraph types
(introduction, body, and conclusion paragraphs) and entire essays.
For this study, we developed over 300 n-gram indices and
assessed their potential to predict human ratings of essay quality.
A combination of these n-gram indices explained over 30% of the
variance in human ratings for essays in a training and testing
corpus. The findings from this study indicate the strength of using
n-gram indices to automatically assess writing quality. Such
indices not only explain text-based factors that influence human
judgments of essay quality, but also provide new methods for
automatically assessing writing quality.
Keywords
Essay quality, computational linguistics, corpus linguistics,
automatic feedback, intelligent tutoring systems.
1. INTRODUCTION
Academic success often depends on a students writing
proficiency [1]. Unfortunately, for many students, such
proficiency is often difficult to attain and frequently remains
elusive throughout schooling [5]. One major problem in the
teaching of writing skills is that students have limited
opportunities to write and receive feedback from teachers and
peers. Such a problem is related to time constraints inside and
outside of the classroom [5], which minimize opportunities for
students and teachers to interact one on one. A potentially
profitable approach to providing students with greater access to
writing opportunities and ensuring that students receive feedback
on their writing is through the use of automatic writing evaluation
(AWE) systems that provide students with the opportunities to
write essays and automatically receive feedback on the quality of
their writing.
However, AWE systems often lack the sensitivity to respond to a
number of features in student writing and, more specifically, to
those features that relate to instructional efficacy [8]. Our goal in
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
216
2. METHODS
Our goal in this study is to develop paragraph specific n-gram
indices to automatically assess the essay quality of student writers
in the ITS W-Pal. The purpose of these indices is to provide
potentially stronger links between the instructional modules in WPal and the automatic scores assigned to essays by the AWE
system. If practical and specific elements of texts related to essay
quality can be developed, then these elements, in turn, could also
inform feedback mechanisms and potentially provide better
connections between the instructional modules in W-Pal (i.e.,
Introduction Building, Body Building, and Conclusion Building)
and formative feedback concerning these modules.
2.1 Corpus
The corpus we used to develop the n-gram indices comprised
1123 argumentative (persuasive) essays. Because our interest is in
developing automated indices that are predictive across a broad
range of prompts, grade levels, and temporal conditions, we
selected a general corpus that contained 16 different prompts,
three different grade levels (10th grade, 11th grade, and college
freshman), and two different temporal conditions (essay that were
untimed and essays that were written in 25-minute increments).
Not all the essays from this corpus were used to develop the ngram indices. Only those essays that contained at least three
paragraphs were selected to develop the n-gram indices. Such
essays provide some evidence that the writer had produced an
introduction, body, and conclusion paragraph affording the
opportunity to examine paragraph specific n-grams. After
removing all essays that contained fewer than 3 paragraphs, we
were left with 971 essays. We used these essays to develop the ngram indices. We used the essays in the entire corpus (N =1123)
to train a regression model.
We tested the training regression model on a test set of
argumentative essays that were not used in the developmental
process. The essays were written by participant in a W-Pal study.
They ranged in grade level from 9th to 12th (M = 10.2, SD = 1.0).
Each participant wrote a pretest and a posttest essay (N = 128).
The essays were written within the W-Pal essay-writing interface.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
217
2.4 Analyses
For each n-gram grouping, we calculated an incidence score and a
proportion score for the n-grams in the grouping for each
paragraph type (i.e., introduction, body, and conclusion
paragraphs) and for the essay as a whole. We also combined all of
the positive and all of the negative n-grams into separate indices
and computed their incidence in the paragraph types and for the
essays as a whole. These incidence and proportion scores became
our automated indices for the subsequent regression analysis.
Within each essay, all body paragraphs were pooled and treated as
a single entity.
We used the essays in the entire corpus to create regression
models to predict the human ratings for the essays. We first
conducted correlations between the index scores and the human
ratings of essay quality. We selected all those variables that
demonstrated at least a small effect size (r > .10) and did not
demonstrate strong multicollinearity with one another or with text
length (r < .899). The model from this regression analysis was
then extended to the essays in the testing corpus to examine how
well the model predicted essay quality in an independent corpus.
3. Results
3.1 Multiple Regression All Essays
Of the 316 n-gram grouping indices calculated for this study, 163
of the indices demonstrated at least a small effect size with the
human ratings of essay quality (p < .001) for all the essays in the
corpus. Of these, four demonstrated strong correlations with text
length and were removed. Lastly, six indices demonstrated strong
4. Discussion
This study demonstrates that n-gram indices related to rhetorical,
grammatical, and cohesion feature of a text can be strongly
predictive of human judgments of essay quality. These n-grams
were calculated at the paragraph level and at the text level. The
indices were tested on essays that contained as few as 1 to 2
paragraphs and on essays that contained only 3 or more
paragraphs. The results of this study provide models of essay
quality that could be implemented in an AWE system to provide
increased accuracy of summative feedback (i.e., holistic scores).
Because many of the n-gram indices are paragraph specific and
many of them are related to rhetorical or cohesion patterns (as
compared to syntactic and grammatical patterns), the indices are
expected to provide more specific feedback to users within the WPal system that will be both more practical and more useful. The
feedback that is based on these indices can be linked to
instructional modules within the W-Pal system.
The regression model demonstrated that the combination of the 20
variables accounts for 37% of the variance in the human
evaluations of overall writing quality. The most predictive indices
were generally the combined n-gram indices that integrated all the
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
B
Removed
0.199
0.223
-0.087
0.132
0.130
-0.088
-0.149
-0.079
0.090
-0.079
-0.112
0.059
0.070
-0.075
0.060
0.063
0.071
-0.052
-0.059
0.055
218
5. Conclusion
While strongly predictive, the n-gram indices investigated here
should be examined in conjunction with more traditional
linguistic indices that have demonstrated predictive power in
explaining essay quality (i.e., lexical, syntactic, and cohesive
features of text; [3]). Such an analysis would assess how
predictive the n-gram indices are when combined with other
variables. More importantly, the indices should be tested to
examine the degree to which they are able to provide more
direct and specific formative feedback and the effects of such
feedback on essay revision and quality.
6. ACKNOWLEDGMENTS
This research was supported in part by the Institute for
Education Sciences (IES R305A080589 and IES R305G2001802). Ideas expressed in this material are those of the authors and
do not necessarily reflect the views of the IES.
7. REFERENCES
[1] Kellogg, R. and Raulerson, B. 2007. Improving the writing
skills of college students. Psychonomic Bulletin and
Review. 14, 237-242.
[2] Crossley, S., Roscoe, R., and McNamara, D. (in press).
Using natural language processing algorithms to detect
changes in student writing in an intelligent tutoring system.
Manuscript submitted to the 26th International Florida
Artificial Intelligence Research Society Conference.
[3] McNamara, D., Crossley, S., and Roscoe, R. 2013. Natural
language processing in an intelligent writing strategy
tutoring system. Behavioral Research Methods,
Instruments and Computers. Advance online publication.
[4] McNamara, D., Raine, R., Roscoe, R., Crossley, S.,
Jackson, G., Dai, J., Cai, Z., Renner, A., Brandon, R.,
Weston, J., Dempsey, K., Carney, D., Sullivan, S., Kim, L.,
Rus, V., Floyd, R., McCarthy, P., and Graesser, A. 2012.
The Writing-Pal: Natural language algorithms to support
intelligent tutoring on writing strategies. In P. McCarthy &
C. Boonthum-Denecke (Eds.), Applied natural language
processing and content analysis: Identification,
investigation, and resolution (pp. 298-311). Hershey, P.A.:
IGI Global.
[5] National Commission on Writing. 2003. The Neglected
R. College Entrance Examination Board, New York.
[6] Roscoe, R., Kugler, D., Crossley, S., Weston, J., and
McNamara, D. S. 2012. Developing pedagogically-guided
threshold algorithms for intelligent automated essay
feedback. In P. McCarthy & G. Youngblood
(Eds.), Proceedings of the 25th International Florida
Artificial Intelligence Research Society Conference (pp.
466-471). Menlo Park, CA: The AAAI Press.
[7] Scott, M. 2008. WordSmith Tools version 5, Liverpool:
Lexical Analysis Software.
[8] Shermis, M.D., Burstein, J.C. and Bliss, L. 2004. The
impact of automated essay scoring on high stakes writing
assessments. Paper presented at the annual meeting of the
National Council on Measurement in Education, April
2004, San Diego, CA
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
219
ABSTRACT
This paper investigates the issue of degeneracy in student
modeling with Dynamic Bayesian Network in Prime Climb, an
intelligent educational game for practicing number factorization.
We discuss that maximizing the common measure of predictive
accuracy (i.e. end accuracy) of the student model may not
necessarily ensure trusted assessment of learning in the student
and that, it could result in implausible inferences about the
student. An approach which bounds the parameters of the model
has been applied to avoid the issue of degeneracy in the student
model to a high extent without significantly diminishing the
predictive accuracy of the student model.
Keywords
Educational Games, Student Model, Dynamic
Networks, Predictive Accuracy, Model Degeneracy
Bayesian
1. INTRODUCTION
Assisting individuals to acquire desired knowledge and skills
while engaging in a game, distinguishes digital educational games
(henceforth edu-games) from traditional video games [1, 2]. Edugames integrate game design methods with pedagogical
techniques in order to more appropriately address the learning
needs of the new generation, which highly regards doing rather
than knowing. Adaptive edu-games as a sub-category of edugames leverage a user model to track the evolution of knowledge
in the students and support tailored interactions with the player
and have been proposed as an alternative solution for the one-sizefits-all approach used in designing non-adaptive edu-games [2].
Prime Climb (PC) is an adaptive edu-game for students in grades
5 and 6 to practice number factorization concepts. It provides a
test-bed for conducting research on adaptation in edu-games.
Prime Climb uses Dynamic Bayesian Network (DBN) to construct
a student model which maintains and provides an assessment of
students knowledge on target skills (number factorization skills)
during and at the end of the interaction. The models assessment
of the students knowledge on the desired skills during the game
play is leveraged by an intelligent pedagogical agent which
applies a heuristic strategy to provide the student with
personalized supports in the form of varying types of hints [3]. In
addition, the models evaluation of the students knowledge on
target skills at the end of the game, provides predictions of the
students performance on related problems outside the game
Cristina Conati
Department of Computer Science
The University of British Columbia
2366 Main Mall, Vancouver, BC, V6T1Z4, Canada
+1 (604) 8224632
conati@cs.ubc.ca
environment (for instance on a post test). Therefore, an accurate
student model is the main component of a system which adapts to
users and any issue which could decay the efficiency of the model
should be appropriately avoided and resolved.
While most of the work on user modeling in educational systems
has been on optimizing the predictive accuracy (predicting
students performance on opportunities to practice skills) of the
student models [5], there is limited work on educational
implications and conceptual meaning imposed by the student
model resulted from the predictive accuracy optimization process.
This paper investigates the issue of degeneracy in the student
model in PC and how it impacts the modeling. The issue of
degeneracy is defined as a situation in which the parameters of a
parametric student model are estimated such that the model has
the highest performance (is at its global maximum given the
performance and limitations of the optimization method) with
respect to some standard measures of accuracy, yet it violates the
conceptual assumptions (explained later in more details)
underlying the process being modeled [6].
2. RELATED WORK
Difficulties in inferring student knowledge have been recently
studied [4, 6, 8, 9, 10, 11] in an approach to educational user
modeling called Knowledge Tracing (KT) [7]. Knowledge
Tracing assumes a two-state learning model in which a skill is
either in the learned or unlearned state. An unlearned skill might
change to the state of learned at each opportunity the student
practices the skill. In KT, it is also assumed that the students
correct/incorrect performance in applying a skill is the direct
consequence of the skill being in the learned/unlearned state; yet
there is always the possibility of a student correctly applying a
rule without knowing the corresponding skill. This is referred to
as probability of guessing. Similarly, the likelihood of a student
showing an incorrect performance on applying a rule while
knowing the underlying skill is called the probability of slipping.
One issue with KT, called Identifiability was addressed by Beck
[4]. The issue of Identifiability refers to the existence of multiple
equally good mappings from observable students performance to
her corresponding latent level of knowledge while each mapping
claims differently about the student performance and knowledge.
To address this issues, Beck introduced the Dirichlet prior
approach [4] in which a Dirichlet probability distribution is
defined over the models parameters in a KT to bias the
estimation of the model parameters toward the mean of the
distribution. The Dirichlet prior approach was then extended and
the Multiple Dirichlet Prior approach [8] and Weighted Dirichlet
Prior [9] were proposed to further address the Identifiability issue
in KT. Backer et al. [6] discussed that the Knowledge Tracing
models may also suffer from the problem of degeneracy. A KT
model is degenerate if it updates the probability of a student
knowing some skills in such a way that it violates the conceptual
assumptions (such as a student being more likely to make a
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
220
the period of time that she interacts with Prime Climb. To this end
the student model consists of time slices representing relevant
temporal states in the process being modeled. Each time slice is
created once a student makes a movement (climbs a mountain).
The smallest student model in PC consists of 23 binary nodes
(random variables) and the largest one contains 131 nodes.
PCs Student Model Nodes: In PC, each student model contains
several binary nodes [5] such as:
Factorization Nodes (FX): Each factorization node, FX, is a
binary random variable which represents the probability that the
student has mastered the factorization skill of number X.
Common Factor Node (CF): There is only one CF node. It is a
binary random variable representing the probability that student
has mastered the concept of common factor between numbers.
PriorX Node: There is one Prior node for each none-root
factorization node in the model. It shows the prior probability that
the student knows the factorization of the number X to its factors.
Click Nodes (ClickXY): Once the player makes a move (i.e.
moves to number X while the partner is on Y) a Click node is
temporarily added as a child of the three random nodes FX, FY and
CF to make a causal structure. Therefore, these three nodes are
conditionally dependent to each other given evidence on the Click
node. Such causal structure allows apportion of blame for wrong
movements [5]. Table 1 and Table 2 show the Conditional
Probability Table (CPT) of the FX and Click nodes respectively.
Table 1: Model Structure and CPT of FX Factorization Node
FA
Known
Known Unknown
3. PRELIMINARIES/BACKGROUND
Unknown Known
Unknown Unknown
P(ClickXY=C)
FY=K
1-Slip
Guess
FX = K
FX = U
FY=U
FY=K
FY=U
Edu-Guess Edu-Guess
Guess
Guess
Guess
Guess
CF=K
CF=U
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
221
EduGuess
Max Slip
End Accuracy
(M/SD)
Population
0.5
0.3
0.2
0.4
0.77/0.14
Generic
0.2
0.6
0.8
0.6
0.70/0.15
0.1
0.6 0.6
0.72/0.20
User-specific 0.6
Degeneracy in Student Model: The degeneracy in student
modeling in Prime Climb is defined as violation of the conceptual
assumptions behind modeling of a students knowledge on
factorization skills during interaction with the game. The
conceptual assumptions in PC student model are as following:
Failures in Test 1
(M/SD)
268.91/64.26
Failures in Test 2
(M/SD)
1.71/2.85
Generic
101.84/28.73
258.35/80.48
User-specific
339.17/75.11
138.86/62.04
Eduguess<
Guess
1-Slip <
Guess
1-Slip <
Eduguess
Given the estimated parameters for the original presented in Table
3 and the degeneracy conditions in Table 4, different patterns of
degeneracy can be observed in the Prime Climbs original model.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
222
EduGuess
Max Slip
End Accuracy
(M/SD)
Population
0.7
0.7
0.2
0.4
0.76/0.15
Generic
0.5
0.6
0.4
0.8
0.68/0.15
0.3
0.6 0.8
0.70/0.18
User-specific 0.3
Comparison of the Models Accuracy: The results of a paired ttest showed no statistically significant difference between the end
accuracy and AUC (Area under the ROC curve) of the original
and bounded models in none of the prior probability type.
Table 7: AUC of the student models
AUC
Generic
User-specific
Original
0.7345
0.6762
0.7860
Bounded
0.7375
0.6643
0.7449
As shown in Table 10, the results of a paired t-test show that the
bounded models resulted in a significantly lower number of hints
(p<0.01 in all cases) while significantly higher performance for
the hinting mechanism (except for the student model with userspecific prior probability type). Note that on average each student
makes 164.5 movements while playing PC. Based on the results,
the hinting mechanism provides 2, 1.7 and 3.3 times more hints in
the original model than the bounded with population, generic and
user-specific prior probability types respectively. This shows that
in general, the bounded model provides more plausible models
parameters than the original student model.
6. CONCLUSIONS/FUTURE WORK
This paper discussed that optimizing the student model in Prime
Climb does not ensure a trusted student modeling because the
model might be degenerated. The issue of degeneracy and sources
and patterns of degeneracy were described and one approach to
addressing this issue called, bounded model was also introduced
and compared with the original student model. It was shown that
the bounded model has a comparable accuracy with the original
model while it contains significantly fewer cases of degeneracy.
The estimated parameters in the bounded model were also more
plausible than the parameters in the original model. In the current
bounded model, the models parameters are estimated the same
across all students. As for future work, we will consider more
personalized models parameters in bounded model to account for
individual differences between users.
7. REFERENCES
[1]
[2]
Generic
(Mean/SD)
User-specific
(Mean/SD)
Bounded
9.17 / 7.45
8.8/7.29
10.24/10.96
[4]
[5]
Test
2
Test
1
Tests Models
Original
268.91 / 64.26
101.84/28.73
339.17/75.11
Bounded
1.44 / 2.0
0.24/0.48
0.53/1.2
Original
1.71 / 2.86*
258.35/80.48
138.86/62.04
Population
(Mean/SD)
Generic
(Mean/SD)
User-specific
(Mean/SD)
Original
0.24 / 0.2
0.3 / 0.24
1/0
Bounded
0.29 / 0.22
0.55 / 0.32
0.95 / 0.08
Population
(Mean/SD)
Generic
(Mean/SD)
User-specific
(Mean/SD)
Original
112.5 / 56.62
82.95 / 27.37
139 / 39.36
Bounded
55.2 / 19.84
48.53 / 19.1
42 / 18.24
[3]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
223
Francois Lemieux
Polytechnique Montreal
Polytechnique Montreal
michel.desmarais@polymtl.ca
francois.lemieux@polymtl.ca
ABSTRACT
This paper investigates means to visualize and classify patterns of study of a college math learning environment. We
gathered logs of learner interactions with a drill and practice learning environment in college mathematics. Detailed
logs of student usage was gathered for four months. Student
activity sessions are extracted from the logs and clustered in
three categories. Visualization of clusters allows a clear and
intuitive interpretation of the activities within the clustered
sessions. The three clusters are further used to visualize
the global activity of the 69 participating students, which
would otherwise be difficult to grasp without such means to
extract patterns of use. The results reveal highly distinct
patterns. In particular, they reveal an unexpected and substantial amount of navigation through exercises and notes
without students actually trying the exercises themselves.
This combination of clustering and visualization can prove
useful to learning environments designers who need to better understand how their application software are used in
practice by learners.
1.
INTRODUCTION
2.
Visualization of student interactions is one of the core topics of educational data mining and a few studies have introduced innovative visualization tools in the last decade [11;
10; 9].
This paper focuses on the visualization of temporal sequences
of student activity. This type of data can be represented in
two different forms:
(1) Event sequences. A given event occurs at a specific time.
Events can be considered as having no duration, and
2.1
Event Sequences
2.2
State Sequences
Instead of emphasizing the transition between states, temporal state sequences of student activity emphasize the time
line perspective and the duration of activities. This type of
representation has been used in sociology [1], but has not
received much attention in Educational Data Mining.
The time line perspective representation is well suited for
visualization of activities as a function of time. Each sequence of activity is represented as a single horizontal bar,
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
224
3.
3.1
3.2
on problem solving.
Once these adjustments are made to the event sequence, the
next step consists in projecting the sequence of events over a
time line. A time line represents a series of equal segments of
time for which a state is given. If the granularity of the time
line is 15 sec., for example, then the state at each segment
of the time line is set to the most recent event in this 15 sec.
interval. This may result in events that never get displayed
if the time segment is longer then the time between events.
3.3
3.4
Clustering Algorithm
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
225
3.5
3.6
4.
DISCUSSION
This paper introduces a technique to visualize student learning activities with a self-regulated drill and practice environment, and reports on an experiment that combines the
visualization method with clustering and classification techniques to obtain a global view of student activities.
The clustering clearly reveals that very distinct study patterns emerge per session. Without the visualization of clus-
Acknowledgements
This study was funded by MATI (Maison des technologies
de formation et dapprentissage Roland-Gigu`ere) under the
project Services dapprentissage personnalises.
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
226
10
7
4
1
10 seq. (n=196)
10
Type 2
10 seq. (n=135)
Type 1
10
4
Answer Ex.
Nav. Exerc.
Nav. Notes
Pause
Prblm. solv.
Result
Start
10 seq. (n=123)
Type 3
Figure 1: Random samples of 10 sessions for each type of cluster from various students. Type 1 corresponds to browsing
through notes and exercises with frequent pauses and very little problem solving. Type 2 corresponds to short sessions of
various behaviour. Type 3 are sessions focused on problem solving and answers to exercises.
60
1
50
1
40
30
20
10
Session type
Student
60
50
40
30
20
10
Number of sessions
0 5
15
25
Student
Figure 2: Type of sessions per student. Students are on the x-axis and ordered from shortest time of use (302 sec.) to the
longest time (120 hrs.)median time is 1.2 hr. and mean time is 6.1 hrs.
[8] M. K
ock and A. Paramythis. Activity sequence modelling and dynamic clustering for personalized elearning. User Modeling and User-Adapted Interaction,
21:5197, 2011. 10.1007/s11257-010-9087-z.
[9] A. Merceron and K. Yacef. Tada-ed for educational data
mining. Interactive Multimedia Electronic Journal of
Computer-Enhanced Learning, 7(1):267287, 2005.
[10] J. Mostow, J. Beck, H. Cen, A. Cuneo, E. Gouvea, and
C. Heiner. An educational data mining tool to browse
tutor-student interactions: Time will tell. In Proceedings of the Workshop on Educational Data Mining, Na-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
227
ABSTRACT
In this paper, we describe a data mining study of the mental health
of undergraduate Engineering students in a large Canadian
university. We created a survey based on guidelines from the
Canadian Mental Health Association, and applied classification
and regression algorithms to the collected data. Our results reveal
interesting relationships between various aspects of mental health
and year of study (first and final year students have lower mental
health scores than second-year students), academic program
(students in competitive programs have lower overall mental
health but higher self-actualization, whereas students in a program
with a flexible curriculum had higher overall scores), and gender
(female Engineering students tend to have lower scores).
Keywords
Mental health data mining, linear regression, rule mining
1. INTRODUCTION
Mental health affects all facets of daily life and therefore
awareness is critical.
In particular, the well-being of
undergraduate students is imperative as it greatly affects their
academic success. In this paper, we advocate the use of data
mining to understand the factors affecting mental health. We
describe a case study in which we applied regression and
classification algorithms to mental health survey data collected in
a large Canadian university. We focus on Engineering, a
competitive and demanding discipline with a heavy gender bias
towards male students.
We conducted an anonymous survey - both online and in person in which we asked students to rate five aspects of their mental
health, as defined by the Canadian Mental Health Association [4].
These five aspects are: Ability to Enjoy Life, Resilience, Balance,
Emotional Flexibility and Self-Actualization. Our survey also
included questions about potential academic influences on mental
health such as year of study, academic program, gender, academic
workload, and relationship status.
We received over 300
responses in total.
We then applied linear regression and classification algorithms to
identify which of the above external influences have the greatest
effect on each aspect of mental health. Examining the regression
coefficients and classification rules revealed interesting insights
into the mental health of Engineering students. We found that the
number of hours of homework was the best predictor of overall
mental health, followed by year of study. In particular, first-year
and final-year students tend to have lower mental health scores
while second-year students have the highest scores. We also
2. RELATED WORK
The importance of mental health and well-being in students is
exemplified by the large number of studies on this topic. Past
research has focused on using surveys to identify factors that
affect mental health, but applying machine learning tools to such
data has not received much attention. One recent example is Li et
al. [3], which examined variables such as ethnicity, gender and
age to classify the mental health of Chinese college students into
three groups using regression. They found that the strongest
predictor of adjustment and severe mental health problems was
the level of satisfaction with ones major. In this paper, we
consider a wider variety of education-related features, including
gender, year of study, academic workload and academic program,
and we examine five different aspects of mental health.
Another interesting example of mining mental health data is
described in Diederich et al. [1], which used machine learning
techniques to identfy mental health issues such as schizophrenia,
and mania. Their thesis was that through analyzing data or
language and conversations between psychiatrists and patients,
they could develop more accurate diagnostic classification
systems. The study used various methods including emotional
classification and clustering algorithms.
In general, previous work on understanding the mental health of
students has investigated factors such as gender and academic
year. The Center for Addiction and Mental Health [6] found
through surveys that females were more prone to report mental
health issues than males, and final-year students were least likely
to report these symptoms as compared to students in other years.
The National Union of Students [5] conducted a survey of
colleges and universities across Scotland found that examinations,
concerns about future career prospects and finances were the
major sources of stress. Zacaj [9] studied gender differences in
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
228
3. METHODOLOGY
The first part of our survey included questions about potential
academic influences on mental health, summarized in Table 1.
Previous work has focused on factors such as financial situation
and career prospects; in this study, we focus mainly on academic
factors. In particular, we hypothesize that students enrolled in
competitive programs with a high workload and an imbalance of
extra-curricular activities will have lower mental health scores.
Table 1: Attributes used in the first part of the survey
Attribute
Possible Values
Gender
Male, Female
Year of study
1,2,3,4
Engineering program
Environmental, Electrical,
Computer, Systems Design,
Mechanical, Chemical,
Geological, Civil
0-40
0-80
In a committed relationship of
more than 6 months?
Yes, No
Yes, No
4. RESULTS
Hours of extracurricular
activities per week (sports,
student politics, student clubs,
etc.)
0-40
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
229
Prediction
Accuracy
RMSE
1.29
80%
Resilience
1.32
84%
Balance
1.31
80%
Self-Actualization
1.53
83%
Flexibility
1.12
75%
Second
Year
Third
Year
Fourth
Year
Electrical
-1.1848
0.9269
-0.0288
-0.1853
Environmental
-1.2735
0.922
-0.0748
-0.1589
Mechanical
-1.2312
0.9174
-0.0021
-0.1374
Civil
-1.235
0.9194
-0.0175
-0.1042
Systems
Design
-1.159
0.8555
0.0714
-0.2464
Computer
-1.2241
0.9146
0.0085
-0.152
Geological
-1.1696
0.9022
-0.0725
-0.0732
Chemical
-1.2026
0.915
-0.0248
-0.13
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
230
5. CONCLUSIONS, RECOMMENDATIONS
AND FUTURE WORK
In this paper, we presented a case study of how data mining may
be used to understand factors affecting the mental health of
students.
We applied linear regression and classification
algorithms to mental health surveys completed by Engineering
undergraduate students, which revealed interesting relationships
between various aspects of mental health and the academic
program, year of study, gender, workload and relationship status.
The results of this study suggest a number of recommendations to
help improve the mental health of Engineering undergraduate
students. Given that the number of hours of homework was an
important factor, it may be beneficial to offer first-year students
additional time-management training. Furthermore, more support
should be provided to female Engineering students, e.g.,
counseling services or forums to invite women to talk about their
experiences.
6. REFERENCES
[1] Diederich, J., Al-Ajmi, A., Yellowlees, P., (2007). Ex-ray:
Data Mining and Mental Health, Applied Soft Computing,
7(3):923-928.
[2] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P.,
Witten I. H., (2009). The WEKA Data Mining Software: An
Update, SIGKDD Explorations, 11(1):10-18.
[3] Li, H., Li, W., Liu, Q., Zhao, A., Prevatt, F., Yang, J.,
(2008). Variables Predicting the Mental Health Status of
Chinese College Students, Asian Journal of Psychiatry,
1(2):37-41.
[4] Meaning of Mental Health, Canadian Mental Health
Association (2013), retrived from
http://www.cmha.ca/mental_health/meaning-of-mentalhealth/#.UHr5GVH08YQ.
[5] National Union of Students Scottland, (2010). Silently
Stressed: A Survey into Student Mental Wellbeing, retrieved
from http://www.nus.org.uk/PageFiles/12238/THINK-POSREPORT-Final.pdf.
[6] Paglia-Boak, A., Adlaf, E.M., Hamilton H.A., Beitchman,
J.H., Wolfe, D., Mann R.E., (2012). The Mental Health and
Well-Being of Ontario Students, 1991-2011: Detailed
OSDUHS Findings (CAMH Research Document Series No.
34). Toronto, ON: Centre for Addiction and Mental Health.
[7] Soet, J., Sevig, T., (2006). Mental Health Issues Facing a
Diverse Sample of College Students:Results from the College
Student Mental Health Survey, Journal of Student Affairs
Research and Practice, 43(3):786-807.
[8] Trockel, M.T., Barnes, M.D., Egget, D.L., (2000). HealthRelated Variables and Academic Performance Among FirstYear College Students: Implications for Sleep and Other
Behaviors, The Journal of American College Health,
49(3):125-131
[9] Zacaj, A., (2010). Gender Differences in Engineering
Education, M.A.Sc. Thesis, University of Waterlo
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
231
Kenneth R. Koedinger
Vincent Aleven
goldin@cmu.edu
koedinger@cmu.edu
ABSTRACT
A student using an interactive learning environment (ILE) may
take multiple attempts to solve a problem step, at times using
hints. But how effective are hints? Because data mining
occasionally finds implausible negative effects of hints, a method
is needed to remove selection effects related to hint use.
We distinguish multiple attempts in which a student repeatedly
seeks hints from multiple attempts to answer the problem.
Exploratory analysis of log data from a tutoring system shows that
making a hint request rather on the first attempt on a problem step
correlates with hint requests on subsequent attempts, and
proficiency on a first attempt correlates with proficiency on
subsequent attempts. Based on this, we devise a multinomial
logistic regression that distinguishes hint-request tendency from
proficiency. We find that seeking just one hint is associated with
repeated hint-seeking, but when students do make attempts to
solve a problem after viewing a hint, they succeed about half of
the time. Thus, the model removes seemingly negative effects
of hints. We also find that individual differences among students
are more prominent in hint-seeking tendency than in proficiency
with hints. We conclude with some ideas to improve our model.
Keywords
Effect of help on performance, individual differences, learning
skills, multilevel Bayesian models, Item Response Theory
1. INTRODUCTION
Work on help-seeking in Interactive Learning Environments
(ILEs) shows that effects of help are not always straightforward.
For instance, different types of hints may differ in effectiveness,
and students may differ in proficiency with hints [5, 6]. Further,
students may have a variety of help-seeking behaviors, such as
help-avoidance (a failure to seek help when the student would
likely benefit from it), and help-abuse (seeking help when the
student can likely answer the problem). [1] Occasionally, use of
help may be linked with negative effects [2, 4, 5], but the negative
estimate is unsatisfactory. It is doubtful that hints cause incorrect
performance: although a hint may at times confuse a student and
thus contribute to an error, it does not reduce student knowledge.
More plausibly, a hint request evidences that the student has not
understand the material. In a sense, negative estimates of hint
effects imply that the statistical method behind these estimates is a
poor representation of human performance (or learning). A better
model would reflect a positive or neutral hint effect.
We consider whether the negative hint effects estimated in prior
work are due to student tendency to request multiple hints without
intervening attempts that could solve the problem. For example,
one perspective is that attempt outcomes are effectively binary
indicators of skill mastery, either successful (if correct) or not (if
incorrect or a hint request). An incorrect attempt suggests that
skill mastery is somehow deficient (although it may also be a
slip), and a hint request suggests that the student does not know
enough to answer the problem. Nonetheless, students may request
aleven@cs.cmu.edu
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
232
does not know enough to solve a problem, i.e., attempts that end
in a hint request or incorrect, not in a correct outcome:
Figure 1 presents a distribution of proficiencies (X axis) and hintrequest tendencies (Y axis) on each attempt type. Individual
differences in proficiency and in TAH-NRI characterize a student
across opportunities on each attempt type. Proficiency on attempts
after hints is moderately or strongly related to TAH-NRI on those
types of attempts. Proficiency on attempts after a first incorrect is
weakly or even negatively related to TAH-NRI, but proficiency
after a second incorrect is moderately related to TAH-NRI.
A student who is proficient on one attempt type should also be
proficient on others. We find (Table 2) that proficiency on first
attempts is moderately (
) related to proficiency on
attempts after a feature-pointing hint and after a second incorrect
outcome (
), and strongly related to proficiency on
attempts after the first incorrect (
). Nonetheless,
proficiency on first attempts is not related to proficiency after
other hints, nor to TAH-NRI. In general, we should take firstattempt proficiency into account when predicting performance on
other attempts, but not necessarily when predicting hint requests.
Figure 1: Average student proficiency vs. average TAH-NRI
on each attempt type (in percentage-based measures). From
top left: first attempts, attempts after feature-pointing hints,
after principle-stating hints, after bottom-out hints, after the
first incorrect outcome, after the second incorrect outcome.
bottom-out hint. Feature-pointing hints make salient important
problem features, e.g., by pointing out that two particular angles
are vertical angles. Principle-stating hints give a domain-specific
principle that is necessary to solve the problem, e.g., that vertical
angles are equal in measure. Bottom-out hints show how to find
the answer to the problem, such as by summing known quantities.
Table 1: Rates of hint request and incorrect outcomes
First-Attempt Rate
Hint Requests
Incorrects
5%
16%
Second-Attempt Rate
Second-Attempt Rate
First attempts
1.00
0.01
After FP Hint
0.39
0.12
After PS Hint
0.18
-0.08
After BOH
0.20
0.47
0.74
0.14
0.46
0.30
After 2
nd
Incorrect
3. MODELING
Hint Requests
Incorrects
Hint Requests
Incorrects
70%
5%
22%
35%
))
percent of hint requests out of all attempts when the student likely
Equation 1: ProfHelp-Multinomial
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
233
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
234
Percent of
Dataset
Percent of Prediction
Errors
First attempts
67
50
After FP Hint
10
After PS Hint
After BOH
st
11
20
nd
After 1 Incorrect
After 2 Incorrect
5. CONCLUSIONS
This work advances the study of same-step help use in ILE.
Students have a tendency to request multiple hints in a row rather
than risk an error. Our analysis improves on prior analyses of
TAH-NRI [8] in that we find that TAH-NRI may differ based on
the type of attempt, and it persists after accounting for
proficiency, for a knowledge components attractiveness to hints,
and for prior practice. TAH-NRI is distinct from other help-abuse
behavior, e.g., from gaming-the-system. Further, there are
persistent individual differences among students in TAH-NRI.
Formalizing TAH-NRI in the ProfHelp-Multinomial model
alleviates the selection bias that caused another model to estimate
that hints had negative effects. The ProfHelp-Multinomial results
suggest that students make a strategic decision to get help and
stick with it across attempts, i.e., the decision to try to solve vs. to
request a hint is not an independent decision at each attempt.
The improved understanding of help-seeking developed here is a
step towards developing effective and efficient ILE, including
systems that adapt to individual differences among students.
6. REFERENCES
Predicted
HINT_
REQUEST
Predicted
CORRECT
INCORRECT
607
951
3123
HINT_ REQUEST
1017
2078
1195
CORRECT
235
953
15778
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
235
{joseg, mostow}@cmu.edu
ABSTRACT
xu,t
# of items
We present
the Topical Hidden Markov Model method, which
M
infers
jointly
a cognitive and student model from longitudim
m
qu,t performance. Its cognitive di observations
Q
nal
of student
agnostic component specifies which items use which skills.
# of skills
Its knowledge
how to infer stuSL tracing component specifies
S
dents knowledge of these skills from their observed perfors,l
s,l
s it uses no expert engineered
mance.
ku,t
s,c
yu,t timesteps
Tu
# of students
U modeling, cognipriors
parameters
knowledge
component discovery, student
Keywords
1.
INTRODUCTION
item:
x1
x2
x3
skill:
q1
q2
q3
knowledge of skill 1:
k11
k21
k31
knowledge of skill 2:
k12
k22
k23
performance:
y1
y2
y3
2.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
236
yu,t represents student performance as a binary variable (correct or not), observed only during training.
Topical HMMs parameters specify the distributions of these
variables. Since we take a fully Bayesian approach, we model
parameters as random variables:
Qxu,t is the cognitive diagnostic model. It represents
the skill(s) required for item xu,t as a multinomial
Qxu,t to model soft membership. For example, Qxu,t =
[0.75, 0.25, 0, 0] means that item xu,t depends mostly
on skill 1, less on skill 2, and not at all on skills 3 or 4.
Unlike prior work where the mapping of items to skills
must be given, Topical HMM allows Q to be hidden,
i.e. discovered entirely from data.
K s,l is a multinomial that specifies the transition probabilities from knowledge state l of skill s to other knowledge states.
Ds,l is a binomial that specifies the emission (output)
probability of a correct answer given the students proficiency level l on the required skill s.
Topical HMM uses Dirichlet priors , , for its parameters.
3.
EVALUATION
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
237
0.8
Area Under The Curve
the expert model and do not update its values. In the case
that the expert decided that an item uses multiple skills, we
assign uniform weight to each skill even though the experts
assumed a conjunctive model. Topical HMM cannot handle
a conjunctive cognitive diagnostic model.
We now describe the values we use for the priors hyperparameters , , and .
0.75
0.7
0.65
0.6
0.55
0.5
HMM
Item diff.
Manual
Data
4.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
238
5.
Acknowledgements
This work was supported in part by the Pittsburgh Science
of Learning Center, the Costa Rican Ministry of Science
and Technology (MICIT), and National Science Foundation
Grant IIS1124240 to Carnegie Mellon University. The opinions expressed are those of the authors and do not necessarily represent the views of PSLC, MICIT or the National
Science Foundation. We thank the educators, students, and
LISTENers who helped in this study and the reviewers for
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
239
geraldine.gray@itb.ie
ABSTRACT
Increasing college participation rates, and a more diverse
student population, is posing a challenge for colleges in facilitating all learners achieve their potential. This paper reports on a study to investigate the usefulness of data mining
techniques in the analysis of factors deemed to be significant
to academic performance in first year of college. Measures
used include data typically available to colleges at the start
of first year such as age, gender and prior academic performance. The study also explores the usefulness of additional psychometric measures that can be assessed early in
semester one, specifically, measures of personality, motivation and learning strategies. A variety of data mining models
are compared to assess the relative accuracy of each.
Keywords
Educational data mining, academic performance, ability, personality, motivation, learning style, self-regulated learning.
1.
INTRODUCTION
Educational Data Mining (EDM) is emerging as an evolving and growing research discipline in recent years, covering
the application of data mining techniques to the analysis of
data in educational settings [4, 17]. EDM has given much
attention to date to datasets generated from students behaviour on Virtual Learning Environments (VLE) and Intelligent Tutoring Systems (ITS), many of which come from
school education [1]. Less focus has been given to college education, and in particular, to modelling datasets from outside virtual or online learning environments. This paper
reports on the preliminary results of a study to analyse the
significance of a range of measures in building deterministic
models of student performance in college. The dataset includes data systematically gathered by colleges for student
registration. The usefulness of additional psychometric measures gathered early in semester one is also assessed.
2.
STUDY CRITERIA
2.1
There is broad agreement that ability is correlated to academic performance, although opinions differ on the range of
sub factors that constitute ability [8]. For example, some
studies have used specific cognitive ability tests to measure
ability, for which there is extensive validity evidence. However such tests have been criticised with regarding to the
objects of measurment. For example Sternberg 1999 [19]
asserts that high correlation between cognitive intelligence
scores and academic performance is because they measure
the same skill set rather than it being a causal relationship. Therefore many studies use data already available to
colleges to measure ability, i.e. grades from 2nd level education or SAT/ACT (Scholastic Aptitude Test / American
College Testing) scores [18]. In a meta analysis of 109 studies by Robbins et al 2004 [16] prior academic achievement
based on high school GPA or grades was found to have moderate correlation with academic performance (90% CI: 0.448
0.4,). Average correlation between SAT scores and aca-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
240
2.2
Influence of personality
Factor analysis by a number of researchers, working independently and using different approaches, has resulted in broad
agreement of five main personality dimensions, namely openness, agreeableness, extraversion, conscientiousness and neuroticism, commonly referred to as the Big Five [9]. Of the
five dimensions, conscientiousness is the best predictor of
academic performance [20]. For example, Chamorro et al
2008 [6] reported a correlation of 0.37 with academic performance (p<0.01, n=158). Openness is the second most
significant personality factor, but results are not as consistent. Chamorro et al 2008 [6] reported a correlation of 0.21
(p<0.01, n=158) between openness and academic performance. However the strength of the correlation is influenced
by assessment type, with open personalities doing better
where the assessment method is not restricted by rules and
deadlines [10]. Studies on the predictive validity of other
measures of personality are inconclusive [20].
2.3
Motivational factors
Motivation is explained by a range of complementary theories, which in turn encompass a number of factors, some
of which have been shown to be relevant, directly or indirectly, to academic performance [16]. Factors relevant to
academic performance in college include achievement motivation (drive to achieve goals), self-efficacy and self-determined
motivation (intrinsic and extrinsic motivation). In Robbins
et al 2004 meta analysis of 109 studies, self-efficacy and
achievement motivation were found to be the best predictors of academic performance [16]. Correlations with selfefficacy averaged at 0.49 0.05 (CI: 90%), correlations with
achievement motivation averaged at 0.303 0.04 (CI: 90%).
Self-determined motivation is not as strong a predictor of
academic performance [11].
2.4
The relationship between academic performance and personality or motivation is mediated by a students approach
to the learning task. Such learning strategies include both
learning style (such as deep, strategic or shallow learning
approach) [6] and learning effort or self-regulation [13].
Analysing the influence of learning style directly on academic performance, some studies show higher correlations
with a deep learning approach [6], while others cite marginally
higher correlations with a strategic learning approach [5].
The difference in these results can be explained, in part, by
the type of knowledge being tested for in the assessment
itself [21]. Many studies argue that there is a negative correlation between a shallow learning approach and academic
performance [5].
Self-regulated learning is recognised as a complex concept to
define, as it overlaps with a number of other concepts including personality, learning style and motivation, particularly
self-efficacy and goal setting [2]. For example, while many
students may set goals, being able to self-regulate learning
can be the difference between achieving, or not achieving,
goals set. Violet 1996 [21] argued that self-regulated learning is more significant in tertiary level than earlier levels
2.5
3.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
241
2. Psychometric data: Table 2 lists the additional measures used which were assessed using an online questionnaire developed for the study (www.howilearn.ie).
The questionnaire was completed during first year induction. It covered measures of personality, motivation, learning style and self-regulated learning. Questions are taken from openly available, validated instruments.
3. Academic performance in year 1: A binary class label
was used based on end of year GPA score, range [0-4].
GPA is an aggregated score based on results from between 10 and 12 modules delivered in first year. The
two classes were poor academic achievers who failed
overall (GPA<2, n=296), and strong academic achievers who achieved honours overall (GPA2.5, n=340).
To focus on patterns that distinguish poor academic
achievers from strong academic achievers, students with
a GPA of between 2.0 and 2.5 were excluded, giving a
final dataset of (n=636).
4.
RESULTS
5.
CONCLUSION
Results from this study show that models of academic performance in tertiary education can achieve good predictive
accuracy, particularly if younger students and mature students are modelled separately. This suggests that patterns
are different for standard versus non-standard students. The
preliminary analysis has demonstrated that good accuracy
can be achieved based on data already available to colleges.
Including additional psychometric measures improves predictive accuracy for mature students, but the evidence so
far suggests this is due to missing data regarding prior academic performance rather than the additional added value
of the psychometric measures themselves.
The most accurate models were SVMs trained on under 21s
and over 21s separately. In general, models that can learn
more complex patterns, and handle high dimensionality, are
getting higher accuracies for both student groups. The difference in accuracy across models is most pronounced for
mature students, suggesting the patterns in that subgroup
are more complex. When training a single model for all students, including students for whom prior academic results
are not available (i.e. the quality of the dataset is reduced),
models give comparable accuracy (73.82%1.8).
The results published here represent early results from the
study. Further analysis of the psychometric measures is required to determine their predictive value, and also their usefulness in understanding how the profile of students who fail
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
242
Attributes
prior
SVM*
NN**
k-NN
Nave
Decision Log
(k=16) Bayes
Tree
Reg
Full
69.41
74.32
73.97
75.74
72.94
72.01
(n=636)
7.11 4.72 4.46 5
5.78 6.41
all
75
75.33
74.85
72.35
74.56
70.84
4.88 7.99 3.63 6
5.54 3.60
Under21 prior
82.62
78.1
76.9
77.14
70
74.94
(n=350)
9.47 8.7
4.77 5.55 4.42 6.22
all
80.71
78.1
79.05
79.29
69.76
76.45
10.4 5.3
5.41 7.76 4.89 6.47
Over 21
prior
77.03
72.29
64.99
57.16
50.62
48.96
(n=286)
9.56 11.33 7.33 16.79 1.41 22.12
all
93.45
77.31
71.7
57.16
50.62
52.47
4.41 9.3
4.86 16.79 1.41 18.08
Over 21
prior
91.03
78.3
71.54
62
51.5
64.31
no miss6.46 11
11.47 11
1.36 10.71
ing data
all
91.03
79.6
70.69
69.26
51.5
66.26
(n=178)
6.46 12.12 6.1
5.67 1.36 10.07
prior: attributes available from the college, all: all attributes
*Anova kernel, epsilon=0.7 and C=1.
*Learning rate=0.7, and momentum=0.4, 2 hidden layers (10,5).
6.
[8]
[9]
[10]
[11]
[12]
[13]
ACKNOWLEDGEMENTS
[14]
[15]
7.
REFERENCES
[16]
[17]
[18]
[19]
[20]
[21]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
243
Paul Salvador Inventado , Roberto Legaspi, Rafael Cabredo , and Masayuki Numao
The Institute of Scientific and Industrial Research, Osaka University
8-1 Mihogaoka, Ibaraki, Osaka, Japan, 567-0047
{inventado,cabredo,roberto}@ai.sanken.osaka-u.ac.jp, numao@sanken.osaka-u.ac.jp
ABSTRACT
Much research has been done on affect detection in learning environments because it has been reported to provide
better interventions to support student learning. However,
students actions inside these environments are limited by
the systems interface and the domain it was designed for. In
this research, we investigated a learning environment wherein
students had full control over their activities and they had to
manage their own goals, tasks and affective states. We identified features that would describe students learning behavior in this kind of environment and used them for building
affect models. Our results showed that although a general
affect model with acceptable performance could be created,
user-specific affect models seemed to perform better.
Keywords
affect modeling, educational data mining, student-driven learning
1.
INTRODUCTION
2.
Self-regulated learners are likely to be capable of adapting to
such environments because they can effectively manage the
different aspects of a learning scenario. One of the most important yet difficult skills to learn in self-regulation is monitoring ones cognitive and affective states. Knowledge of
ones thoughts and affective states helps students evaluate
the current situation and identify if it is better to continue
also affiliated with: Center for Empathic Human-Computer
Interactions, College of Computer Studies, De La Salle University, Manila, Philippines
RELATED WORK
Many researchers have tried improving existing learning systems by incorporating affect detection for better feedback.
DMello et al. [7] for example developed affect models using
data from students interactions with a conversational agent
in the domain of computer literacy. The features they used
for building these models were based on students responses,
the correctness of the students answers, their progress and
the type of feedback provided by the system. The model
they built to distinguish each affective state from each other
did not perform very well (i.e., Kappa = 0.163), however the
models they built for distinguishing affective states from a
neutral state performed better (i.e., Kappa = 0.207 - 0.390).
Baker et al. [1] also developed affect models for students
using Cognitive Tutor Algebra. They used features that
described students actions, the correctness of their actions
and their previous actions. They built affect models which
distinguished one affective state from another (e.g., bored
vs. not bored, frustrated vs. not frustrated) whose resulting
Kappa values ranged from 0.230 to 0.400.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
244
3.
In our previous work, we collected data from one male undergraduate student, one male masters student and two female
doctoral students who engaged in research activities as part
of their academic requirements [9]. The students were aged
between 17 and 30 years old wherein three of them were
taking Information Science while one doctoral student was
taking Physics. During the data collection period, two of
the students were writing conference papers and two made
power point presentations about their research. Students
had control over how they conducted their learning activities
and did not receive any direct support from their supervisor.
These conditions required all students to manage their own
cognitive and affective states as they learned which satisfied
our target learning scenario.
Data was collected in five separate two-hour learning episodes
from each student over a span of one week. Students freely
decided on the time, location and type of activities they
did but were required to learn in front of a computer that
recorded their learning behavior. All students used a computer in doing their research so the setup was naturalistic
and they did not have to change the ways in which they
usually learned.
Data about the students learning behavior was collected
by asking them to annotate their behavior after each learning episode using a behavior recording and annotation tool
we developed called Sidekick Retrospect [9]. At the beginning of a learning episode, students inputted their learning
goals. The system then began logging the applications they
used, taking screenshots of their desktop and capturing image stills from their webcam with corresponding timestamps.
After a learning episode, students were presented a timeline
which showed the desktop screenshots and image stills depending on the position of their mouse on the timeline. This
helped students recall what happened during the learning
episode so they could annotate it.
Students made annotations by selecting a time range and inputting their intentions, activities and affective state. Intentions can either be goal related or non-goal related relative to
the goals set at the beginning of the learning episode. Activities referred to any activity done while learning which could
either be done on the computer (e.g., using a browser) or out
of the computer (e.g., reading a book). Two sets of affect
labels were used for annotating affective states wherein goalrelated activities were annotated as delighted, engaged, confused, frustrated, bored, surprised or neutral and non-goal
related activities were annotated as delighted, sad, angry,
disgusted, surprised, afraid or neutral. Academic emotions
[4] were used for annotating goal related intentions because
they gave more contextual information about the learning
activity. However, academic emotions might not have captured other emotions outside of the learning context so Ekmans basic emotions [8] were used to annotate non-goal
related intentions.
4.
FEATURE ENGINEERING
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
245
5.
AFFECT MODELING
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
246
the past five minutes was selected as a feature while the frequency of writing notes in the past five minutes was selected
instead in another students model. These are indicative of
students affective states being influenced differently by certain tasks. This also shows that individual differences play a
part in the affective states experienced by a student making
them behave differently in similar contexts.
The features selected in both the general and user-specific
models described the frequency, duration and type of previous actions performed by the students as well as the students current learning state. These are contextual information about the students learning state which seems to be
good predictors of students affective states as shown by the
performance of the resulting affect models.
6.
CONCLUSION
Acknowledgements
This work was supported in part by the Management Expenses Grants for National Universities Corporations from
the Ministry of Education, Culture, Sports, Science and
Technology of Japan (MEXT) and JSPS KAKENHI Grant
Number 23300059. We would also like to thank all the students who participated in our data collection.
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
247
Michael Eagle
John Stamper
mjokimoto@gmail.com
maikuusa@gmail.com
Tiffany Barnes
john@stamper.org
tmbarnes@ncsu.edu
ABSTRACT
We present an algorithm for reducing the size and complexity of the Interaction Network, a data structure used for
storing solution paths explored by students in open-ended
multi-step problem solving environments. Our method reduces the number of edges and nodes of an Interaction Network by an average of 90% while still accounting for 40% of
actions performed by students, and preserves the most frequent half of solution paths. We compare our method to two
other approaches and demonstrate why it is more effective
at reducing the size of large Interaction Networks.
1.
INTRODUCTION
2.
RELATED WORK
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
248
3.
REDUCTION ALGORITHM
important information.
The purpose of this algorithm is to maximize the amount
of information we can gain from the data, while minimizing
the number of nodes and edges, to make common approaches
more clear. We also want to be as close as possible to a directed simple graph. A simple directed graph is defined as
a graph containing no loops or parallel edges. Our assumption is that simple graphs are easier to read when following
state-transitions, because they have no parallel edges. Next
we want to preserve as many paths from the problem start
to the goals as possible, to retain as many student solutions as we can. We would also like to provide continuity
and solution variations. Continuity in this case, implies the
reduced network maintain complete solution paths, so the
graph is understandable, as opposed to a list of the most frequent nodes. By providing variations to similar solutions we
should be able to provide better estimations to the numbers
of students who performed a particular solution. Without
the context of the progression of the states, users would be
unable to understand how the problems were solved. We
want to provide a means for understanding how many students solved the problem, not just which actions were most
frequent.
We will use four metrics for measuring our success.
1.
2.
3.
4.
The vertex and edge counts will inform us how well we reduced the number of states and actions, we aim for reduction in magnitude. Goal counts will let us know how many
of the solution paths we have maintained, from start to finish. For this metric, not only the count is important, but
to maintain continuity all goals must have a path from the
start of the problem to the respective goal state. The sum
of edge frequencies will inform us of the total number of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
249
actions performed by all students, and in turn what percent of student actions are being preserved in our reduction.
Arguably, edges with higher frequencies are more informative because more students performed that action, over an
action performed by a fewer students. Lastly, the average
student frequency per edge will give us an indicator of how
important each edge is in the network.
We asked two professors with over a decade of experience
teaching logic, and two graduate students who have either
taught the course or performed a teaching assistant role to
provide us with the set of solutions they expected to see
from students. These four experts provided us with eight
solutions total, four of which differed in direction or actions
used to solve the problem. Problem 3-5 was chosen because
it has one of the larger ranges of possible solutions in our
problem set. We will use these provided solutions in comparison to the reduced network and compare how many of
those solutions are preserved in the reduced network.
3.1
Algorithm
The idea for this algorithm is inspired by compression algorithms. We want to identify the edges with the highest frequencies and preserve them, then find goal states which are
close to those paths. The Interaction Network for the problem 3-5 data set has 1252 nodes and 1835 edges. The proposed algorithm works by focusing on high frequency edges,
of which there are few, and filtering out the low, and often
frequency one edges, for which there are hundreds. This
algorithm works by accepting three parameters, the Interaction Network on which to act upon, the percent of desired
reduction, and a growth parameter. Prior to reduction, we
first calculate a set of values in a pre-reduction step. In tutors which do not contain undo actions, this step will not be
necessary. To adjust for the behavior of moving forward, followed by an undo, we calculate a table of negative weights.
For each state, an incoming action followed by an undo,
will increment a negative weight counter for the incoming
action. This will be used to devalue the frequency of these
actions. Next we remove the undo edges from the network,
this reduces the number of cycles and parallel edges presumably making the flow of state-transitions in the Interaction
Network easier to follow.
4.
RESULTS
4.1
Comparison
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
250
Table 1: The average metric scores across 11 problems and 2239 problem sessions. Original refers to the full
network values. Each column is a method and its score, with percentile comparison to the Original network
in parenthesis. For vertices and edges the percent is the amount of reduction, for goals and interactions it is
the amount of coverage or inclusion.
Shortest
Greater than
Original
Reduced
Paths
Frequency One
Vertices
1172
114 (90.26%)
238 (77.85%)
203 (81.73%)
Edges
1690
132 (92.23%)
237 (84.75%)
348 (78.33%)
Goals
38
20 (52.54%)
38 (100.00%)
12 (33.04%)
Interactions
3332
1283 (39.90 %)
1162 (36.89)
1990 (60.77%)
Avg. Edge Freq.
2.10
12.34
6.60
6.03
set of goal nodes for the problem. Next, Dijkstras shortest path algorithm[4] is run and the result is the union of
the shortest paths to each goal. The Frequency one filter
approach, simply removes all edges from the network with
student frequency one.
Referring to table 1 we can see some advantages and disadvantages of each approach. First, as expected, the shortest
path approach naturally has 100% goal coverage, that is
we can see a path to every goal from the original Interaction Network. The disadvantages of this approach is that
the paths chosen do not optimize the frequencies of edges,
because the shortest path can contain many frequency one
edges. Next the overall reduction rates are half as effective
as our method, leaving on average twice as many nodes and
edges. This method preserves fewer actions performed by
students while having lower rates of reduction. The resulting shortest paths network for problem 3-5 is shown in figure
2. Note, if the growth parameter is set to infinity, the path
to all goals will be preserved - same as the shortest path
method, though naturally reduction rates will be affected.
Thus, our method can facilitate 100% goal coverage.
Alternatively, the frequency one filter, maintains a higher
rate of interactions, as we would expect since fewer edges
are removed. However, frequency one filtering suffers from
low rates of reduction, having double the number of nodes
and triple the number of edges on average, while also having
lower rates of goal coverage, 33% compared to our method
which achieved 52%. This method has lower reduction rates
and lower goal coverage. Figure 3 shows the resulting network for problem 3-5 using the frequency one filtering process. By comparing the average edge frequencies in table 1
we can see our method has double the value than either of
the other two approaches. This score is meaningful because
it is the average number of actions performed by students,
per edge within the network.
5.
CONCLUSIONS
We provide an algorithm for reducing the complete Interaction Network to a summary of the most common problem solving approaches used by students. We showed that
this algorithm was capable of reducing the number of vertices and edges of the Interaction Network by an average
of around 90%, while still depicting more than half the of
the solution paths and accounting for 40% of interactions
performed by students.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
251
Gautam Biswas
john.s.kinnebrew@
vanderbilt.edu
dmack@isis.vanderbilt.edu
ABSTRACT
Identifying sequential patterns in learning activity data can
be useful for discovering, understanding, and ultimately scaffolding student learning behaviors in computer-based learning environments. Algorithms for mining sequential patterns
generally associate some measure of pattern frequency in
the data with the relative importance or ranking of the pattern. However, another important aspect of these patterns
is the evolution of their usage over the course of a students
learning or problem-solving activities. In order to identify
and analyze learning behavior patterns of more interest in
terms of both their overall frequency and their evolution
over time, we present a data mining technique that combines sequence mining with a novel information-theoretic,
temporal-interestingness measure and a corresponding heat
map visualization. We demonstrate the utility of this technique through application to student activity data from a
recent experiment with the Bettys Brain learning environment and a comparison of our algorithms pattern rankings
with those of an expert. The results support the effectiveness of our approach and suggest further refinements for
identification of important behavior patterns in sequential
learning activity data.
Keywords
sequence mining, interestingness measure, information gain,
learning behaviors
1.
INTRODUCTION
gautam.biswas@
vanderbilt.edu
sequential pattern mining to better understand learning behavior in particular conditions or groups (e.g., [6, 8]).
However, once these behavior patterns are mined, researchers
must interpret and analyze the often large set of patterns to
identify a relevant subset of important patterns to investigate or utilize further. Ideally, these patterns provide a basis
for generating models and actionable insights about how students learn, solve problems, and interact with the environment. Algorithms for mining sequential patterns generally
associate some measure of pattern frequency to rank identified patterns. However, researchers have developed a variety
of other measures to utilize properties beyond pattern frequency in ranking mined patterns [4]. These measures are
often referred to as interestingness measures and have been
applied to results from a variety of data mining techniques.
To better analyze student learning and behavior, interestingness measures have been used tasks like ranking mined
association rules (e.g., [7]).
Investigation of the frequency with which a pattern occurs
over time can reveal additional information for pattern interpretation. Further these changes in pattern occurrence may
help identify more important patterns, which occur only
at certain times or become more/less frequent, rather than
patterns with frequent, but uniform, occurrence over time.
Qualitatively, we would like to identify patterns that are not
rare overall and have significant variations in their frequency
over time. In this paper, we present a novel approach, combining sequence mining and an information-theoretic measure for ranking behavior patterns that combines temporal variation in occurrence and overall frequency to provide more effective identification of temporally-interesting
patterns. To effectively analyze these patterns and quickly
identify trends in the evolution of pattern usage, we employ a related visualization in the form of heat maps. We
demonstrate the utility of this technique through application to student activity data from a recent experiment with
the Bettys Brain learning environment and a comparison
of our algorithms pattern rankings with those of an expert.
The results support the effectiveness of our approach and
suggest further refinements for identification of important
behavior patterns in sequential learning activity data.
2.
With long sequences of temporal data, such as student learning activities in a computer-based learning environment, re-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
252
3.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
253
4.
RESULTS
20% and 40% of their complete sequence of activity. An initial estimation of students behavior by researchers assumed
that students read and added correct links most in the first
fifth of their activities with a decreasing trend as the remaining causal relationships were those that were harder to
identify. Rather, the identified usage pattern suggests that
students require most of an hour working with the system
and reading before reaching peak efficiency in determining
correct causal links from the resources.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
254
to those made by an expert - another researcher who coordinated the Bettys Brain study analyzed here, and who
has analyzed student activity data but had no knowledge of
the TIPS approach. For this comparison, we used the top
20 patterns from each of the TIPS and occurrence rankings,
which resulted in 31 total patterns (9 patterns were in top
20 of both rankings). We presented the expert with only the
total occurrence information and the occurrence over time
(split into fifths corresponding to how the data was analyzed
with TIPS). The order of the patterns was randomized and
values for both the overall occurrence and the occurrence
over time were represented on separate color-coded scales
(between low and high values across all included patterns)
to provide some visualization for comparison among the patterns. The expert was asked to group the patterns into three
relative categories based on the provided information: high
interest (10 patterns), medium interest (10 patterns), and
low interest (11 patterns).
Table 2 presents the number of patterns identified by the
TIPS and occurrence rankings that the expert grouped into
each level of interest. All 10 of the experts high interest
results were in the top 20 identified by TIPS, with only 2 of
them also in the top 20 ranked by occurrence. These results
suggest that the TIPS ranking is closer to the experts own
interest ranking, given the total occurrence and temporal
evolution information about each. Next, we presented the
expert with the same information but also included the specific activity pattern for each result. When asked to rank the
patterns again with this additional information, the results
were more equally balanced between the TIPS and occurrence rankings, with six of each in the high interest category,
and TIPS having two more than the occurrence ranking in
the medium interest category. Overall, these preliminary
experiments illustrate the expected point that the activity
pattern itself is a major factor in its overall interestingness,
but its occurrence and temporal evolution are both important factors. Further, it suggests that rather than relying
on only one interestingness measure for identifying potentially important activity patterns, consideration of the top
patterns identified by each of multiple measures, including
both occurrence and TIPS, may be the most effective way
to analyze mined patterns from learning activity sequences.
5.
CONCLUSION
While identification of common and high-occurrence patterns is undoubtedly useful, finding patterns that have interesting evolution of usage over time is also important for researchers and experts in education, as well as other domains.
In this paper, we presented the TIPS technique and interestingness measure for identifying temporally-interesting
behavior patterns in learning activity sequences. TIPS is
designed to identify patterns with interesting temporal behavior (e.g., spikes of usage during specific time periods or
strongly increasing, decreasing, and peaking trends) even
when they are not especially frequent, as well as particu-
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
255
Kluscek
Matej
Radek Pelnek
xjarusek@fi.muni.cz
matej.klusacek@mail.muni.cz pelanek@fi.muni.cz
ABSTRACT
Given data about problem solving times, how much can we
automatically learn about students and problems characteristics? To address this question we extend a previously
proposed model of problem solving times to include variability of students performance and students learning during
sequence of problem solving tasks. We evaluate proposed
models over simulated data and data from a Problem Solving Tutor. The results show that although the models do
not lead to substantially improved predictions, the learnt
parameter values are meaningful and capture useful information about students and problems.
1.
INTRODUCTION
In intelligent tutoring systems [1, 15] student models typically focus on correctness of student answers [6], and correspondingly the problems in tutoring systems are designed
mainly with the focus on correctness. This focus is partly
due to historical and technical reasons the easiest way to
collect and evaluate student responses are multiple choice
questions. Thanks to the advances in technology, however,
it is now relatively easy to create rich interactive problem
solving activities. In such environments it is useful to analyze not only correctness of students answers, but also timing
information about a solution process.
To attain a clear focus, here we consider only the information
about problem solving times, i.e., we model students performance in exercises where the only performance criterium
is time to solve a problem. Examples of such exercises are
logic puzzles (like the well-known Sudoku puzzle) or suitably
formulated programming and mathematics problems [9].
In previous work [8] we described a model which assumes a
linear relationship between a problem solving skill and a logarithm of time to solve a problem, i.e., exponential relation
between skill and time. In this work we present extensions
of the model in two directions. The first extension models
variability of performance of individual students. The sec-
2.
2.1
Related Work
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
256
2.2
We assume that we have a set of students S, a set of problems P , and data about problem solving times: tsp is a
logarithm of time it took student s S to solve a problem
p P . In the following we always work with a logarithm of
time and use subscript s to index student parameters and
subscript p to index problem parameters.
We assume that a problem solving performance depends on
one latent problem solving skill s and two main problem parameters: a basic difficulty of the problem bp and a discrimination factor ap . The basic structure of the model is simple
a linear model with Gaussian noise: tsp = bp + ap s + .
Basic difficulty b describes expected solving time of a student with average skill. Discrimination factor a describes
the slope of the function, i.e., it specifies how the problem
distinguishes between students with different skills. Finally,
is a random noise given by a normal distribution with zero
mean and a constant variance (note that the model in [8] assumes problem dependent variance). The presented model
is not yet identified as it suffers from the indeterminacy of
the scale (analogically to many IRT models). This is solved
by normalization we require that the mean of all s is 0
and the mean of all ap is -1.
2.3
2.5
Parameter Estimation
To make the derivation more readable, we introduce the following notation: esp = tsp (bp + ap s ) (prediction error
for a student s and a problem p), vsp = c2p + a2p s2 (variance
for a student s and a problem p). Thus we can write the
log-likelihood as:
ln L =
s,p
e2sp
1
1
ln(vsp ) ln(2)
2vsp
2
2
2 2
esp s vsp ap s
esp
2
vsp
= vsp
(s + ap s2 vsp
)
sp
sp
=
2.4
Esp
2
s
Modeling Learning
Esp
bp
Esp
s
Esp
c2
p
=
=
2
ap s
vsp
a 2
+ vpsps
e
vsp
sp
e
ap vsp
sp
e2
sp
1
1
(
) = 2v12 (e2sp vsp )
2 + v
2
vsp
sp
sp
2
2
2 esp
2 1
1
(a
+
a
)
=
2va2 (e2sp
2
2
vsp
vsp
sp
vsp )
Note that the obtained expressions have in most cases straightforward intuitive interpretation. For example the gradient
e
with respect to s is ap vsp
, which means that the estimasp
tion procedure gives more weight to attempts over problems
which are more discriminating and have smaller variance.
Stochastic gradient descent can find only local minima. However, by good initialization we can improve the chance of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
257
finding a global optimum. In our case there is a straightforward way to get a good initial estimate of parameters:
bp = mean of tsp (for the given p);
ap = -1;
s = mean of bp tsp (for the given s);
cp = 12 of variance of bp tsp (for the given p);
s = 21 of variance of bp tsp (for the given s).
If we return to the original simplifying assumption and assume that the variance is constant (independent of a particular problem and student), then the error function is the
basic sum-of-squares error function and the computation of
E
E
E
gradient simplifies to asp
= s esp , bsp
= esp , sp
=
p
p
s
ap esp . Parameter estimation for the model with learning
is analogical.
3.
EVALUATION
3.1
Simulated Data
3.2
Predictions
Next we report on the evaluation of predictions of problem solving times for data on real students using a Problem
Solving Tutor [7, 9] a free web-based tutoring system for
practicing problem solving (available at tutor.fi.muni.cz).
The system has more than 10 000 registered students (mainly
university and high school students), who have spent more
then 13 000 hours solving more than 400 000 problems. The
3.3
Even through the more complex models do not lead to substantially improved predictions, they can still bear interesting information. Predictions are useful for guiding behaviour
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
258
of the tutoring systems, but small improvement in prediction precision will not change the behaviour of the system
in significant way. The important aim of the more complex
models is to give us additional information about students
and problems, e.g., the students learning rate, which can
be used for guiding the behaviour of tutoring system and
for providing feedback to users.
Since the model with learning does not improve predictions,
it may be, however, that the additional parameters overfit
the data and thus do not contain any valuable information.
To test this hypothesis we performed the following experiment: we split the data into two disjoint halves, we use each
half to train one model, and then we compare the parameter values in these two independent models. Specifically, we
measure the Spearman correlation coefficient for values of
each parameter.
Table 2 shows results for the model with learning. The results show, that estimates of basic difficulty and basic skill
correlate highly, the weakest correlation between the estimates from the two halves is for the discrimination parameter. For students learning rate, the additional parameter
of the extended model, we get the correlation coefficient between 0.5 and 0.7 a significant correlation which signals,
that the fitted parameters contain meaningful values.
We also analyzed correlations among different model parameters, e.g., between skill and learning rate . Generally
there is only weak correlation between parameters, which
shows that the new parameters bring additional information.
4.
REFERENCES
Sokoban
0.789
0.394
0.963
0.347
Rush.
0.737
0.509
0.962
0.434
Nurik.
0.904
0.570
0.837
0.195
Springer, 2006.
[5] A. Corbett and J. Anderson. Knowledge tracing:
Modeling the acquisition of procedural knowledge.
User modeling and user-adapted interaction,
4(4):253278, 1994.
[6] M. C. Desmarais and R. S. J. de Baker. A review of
recent advances in learner and skill modeling in
intelligent learning environments. User Model.
User-Adapt. Interact., 22(1-2):938, 2012.
[7] P. Jarusek and R. Pel
anek. Problem response theory
and its application for tutoring. In Educational Data
Mining, pages 374375, 2011.
[8] P. Jarusek and R. Pel
anek. Analysis of a simple model
of problem solving times. In Proc. of Intelligent
Tutoring Systems (ITS), volume 7315 of LNCS, pages
379388. Springer, 2012.
[9] P. Jarusek and R. Pel
anek. A web-based problem
solving tool for introductory computer science. In
Proc. of Innovation and technology in computer
science education, pages 371371. ACM, 2012.
[10] P. Kantor, F. Ricci, L. Rokach, and B. Shapira.
Recommender systems handbook. Springer, 2010.
[11] Y. Koren and R. Bell. Advances in collaborative
filtering. Recommender Systems Handbook, pages
145186, 2011.
[12] B. Martin, A. Mitrovic, K. R. Koedinger, and
S. Mathan. Evaluating and improving adaptive
educational systems with learning curves. User
Modeling and User-Adapted Interaction,
21(3):249283, 2011.
[13] A. Newell and P. Rosenbloom. Mechanisms of skill
acquisition and the law of practice. Cognitive skills
and their acquisition, pages 155, 1981.
[14] N. Thai-Nghe, T. Horv
ath, and L. Schmidt-Thieme.
Factorization models for forecasting student
performance. In Proc. of Educational Data Mining,
pages 1120, 2011.
[15] K. Vanlehn. The behavior of tutoring systems.
International Journal of Artificial Intelligence in
Education, 16(3):227265, 2006.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
259
Jodi Davenport
Emma Brunskill
WestEd
Oakland, CA 94612
Computer Science
Department
Carnegie Mellon University
Pittsburgh, PA 15213
jdavenp@wested.org
rafferty@cs.berkeley.edu
ABSTRACT
Estimating students knowledge based on their interactions
with computer-based tutors has the potential to improve
learning by decreasing time taking assessments and facilitating personalized interventions. Although there exist good
student models for relatively structured topics and tutors,
less progress has been made with more open-ended activities.
Further, students often complete activities in pairs rather
than individually, with no coding to indicate who performed
each action. We investigate whether pair interactions with
an open-ended chemistry tutor can be used to predict individual student post test performance. Using L1 -regularized
regression, we show that student interactions with the tutor are predictive both of the average post-test score for the
pair and of individual scores. Towards better understanding pair dynamics in this setting, we also find that for pairs
composed of students with similar pre-test scores, we can
predict the difference in students post-test scores.
Keywords
Collaboration, embedded assessment, supervised learning
1.
INTRODUCTION
Computer-based educational activities have many advantages over traditional tests as a means to assess student
knowledge. The function of testing is to provide information
about student proficiency. If an analysis of how a student
completes an activity can provide similar information, timeintensive post-tests can be eliminated, and students can have
access to the immediate feedback known to support learning.
Projects such as ASSISTments [13] and stealth assessment
[15] have demonstrated the potential for this approach.
Both interactive activities specifically designed for assessment and traditional intelligent tutoring systems provide
valuable information about student knowledge. Simulationbased activities that are designed to be assessments have
ebrun@cs.cmu.edu
proven effective for measuring science inquiry and reasoning skills (e.g., [5, 12]). Many tutoring systems use student
modeling to estimate proficiency as a student works through
problems in the tutor. However, estimating students knowledge based on their work in games and more open-ended
environments introduces new challenges [3]. These environments are less structured, lack explicit tags about which
tasks correspond to which skills, and may offer few opportunities to practice the same skill repeatedly in a similar context. Despite these challenges, games or more open-ended
environments are important as they enable different forms of
learning and can be used where formal testing is impractical.
A further challenge in estimating student knowledge from
computer-based activities is that in classroom environments,
students often work with computers in groups. Though collaboration can improve students learning from computerbased science activities, automatically logged data rarely
captures explicit collaboration, such as which student provided any given input, or what conversations occurred in
conjunction with the activity.
In this paper, we explore whether machine learning based
approaches can predict student knowledge based on interactions with an open-ended chemistry tutor, ChemVLab+.
Due to limitations in the number of available computers in
many classrooms, students generally use ChemVLab+ in
pairs, and we analyze only data from paired interactions.
We investigate what predictions we can make about individual student knowledge, corroborated by a separate post-test,
based on the students interactions with ChemVLab+.
2.
BACKGROUND
2.1
Many computer-based educational environments have openended components in which students explore topics using
free-form actions. One approach to understanding student
learning is to identify behaviors that are correlated with high
or low learning gains. For instance, the WISE platform
has identified patterns of inquiry behavior that are common in more successful students [10]. Kinnebrew, Loretz,
and Biswas [8] identified patterns of student actions associated with periods of productivity and analyzed which patterns were correlated with high learning gains. In contrast
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
260
(a)
(b)
2.2
3.
CHEMVLAB+
ChemVlab+ is a collection of online activities that allow students to apply their chemistry knowledge in authentic, realworld contexts [2]. Each activity involves a separate problem, such as whether factories are reporting accurate pollution levels, and consists of a series of pages. ChemVLab+
activities include both freeform actions, such as virtual labs
(see Figure 1), and more constrained actions, such as multiple choice questions.1 The virtual labs enable similar actions
as in a real chemistry lab: students manipulate beakers and
use chemical instruments. These virtual labs are a key part
1
4.
We now explore what we can learn about students knowledge based on their interactions with ChemVLab+, beginning with prediction of post-test scores. Paired performance
data cannot necessarily tell us about individuals: intraclass
correlation shows that for our data, the two post-test scores
in the pair are not significantly correlated (r = 0.12, p = .08,
n.s.). However, we will shortly see we still can predict some
interesting aspects of individual and pair performance.
4.1
Methods
We also tried imputation using data only from pairs who were
similar to the current pair on the other activities; this did not
significantly affect predictive performance.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
261
Prediction task
Avg. post-test score
Higher post-test score
Lower post-test score
4.2
Results
5.
5.1
Methods
Lasso regression is again used for prediction and feature selection, with the same 48 features as in the previous analysis.
10-fold cross validation is used to fit the model, and the regression is limited to 20 features with non-zero weights. In
analyses with pretest features, these features are the highest
pretest in the pair, the lowest pretest in the pair, and the
difference between the two pre-test scores.
5.2
Results
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
262
Pairs included
All
Similar pre-test
High pre-test
Low pre-test
6.
CONCLUSIONS
Given differences in classroom implementations and the pedagogical benefits of more open-ended tutors, there are many
advantages to predicting student performance based on realworld use of these systems. In this paper, we examined data
from a series of chemistry activities that students completed
in pairs, and found that pairs interactions with the activities were predictive of individual post-test scores. Though
we could make some predictions about differences in posttest scores for a pair, there is likely to be a limit on how
well we can perform this task given the lack of data about
individuals within the pair. We plan to explore how limited data about individual behavior, collected via classroom
observation, can be used to create more accurate models of
collaboration, and whether explicitly modeling control of the
computer as a latent variable can improve performance. We
would also like to explore a broader feature set, including
features that capture changes in performance over time and
more fine-grained virtual lab features (e.g., from patternmining [8]). We see this work as a first step in showing the
potential of data mining techniques to transform collaborative educational activities into embedded assessments, even
when activities are not designed for this purpose.
Acknowledgements. This research was supported an NDSEG
Fellowship to ANR and by the Institute of Education Sciences,
U.S. Department of Education, through Grant R305A100069 to
WestEd. The opinions expressed are those of the authors and do
not represent views of the Institute or the U.S. Department of
Education.
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
263
Matthew A. Ung
Alexander E. Zundel
Dept. of Mechanical
Engineering
University of California
Riverside
Riverside, CA 92521, USA
Dept. of Mechanical
Engineering
University of California
Riverside
Riverside, CA 92521, USA
nrhod001@ucr.edu
mung001@ucr.edu
azund001@ucr.edu
James Herold
Thomas F. Stahovich
jhero001@ucr.edu
ABSTRACT
Numerous studies have shown that self-explanation can lead
to improved learning outcomes. Here we examine how the
words which students use in their self-explanations correlate with their performance in the course as well as with
the effort they expend on their homework assignments. We
compute two types of numerical features to characterize students work: vocabulary-based features and effort-based features. The vocabulary-based features capture the frequency
with which individual words and n-grams appear within students self-explanation. The effort-based features estimate
the effort expended on each assignment as the amount of
time spent writing a homework solution or self-explanation
response.
We use the most predictive vocabulary-based and effortbased features to train a linear regression model to predict
students overall course grade. This model explains up to
19.4% of the variance in students performance. Furthermore, the underlying parameters of this model provide valuable insights into the ways students explain their own work,
and the cognitive processes students employ when asked to
self-explain. Additionally, we use the vocabulary-based features to train linear regression models to predict each of
the effort-based features. In doing so we demonstrate that
the vocabulary employed by a student to self-explain his or
her solution to an assignment correlates with the amount of
effort that student expends on that particular assignment.
Both of these findings serve as a basis for a novel automated
assessment technique for evaluating student performance.
1.
INTRODUCTION
Dept. of Mechanical
Engineering
stahov@engr.ucr.edu
Self-explanation is the process by which a student explains
his or her solution process, summarizing his or her understanding. Prior work has demonstrated that self-explanation
can improve a students metacognitive skills, leading to improved learning gains. These studies have typically focused
on summative assessments of students learning, demonstrating, for example, that students who were asked to provide
self-explanation of their homework solutions performed better on exams than students who did not provide self-explanation.
In this paper, we present a novel technique which provides a
formative analysis of self-explanation, identifying behaviors
which correlate with good performance. In particular we
employ machine learning techniques to identify successful
patterns latent in students self-explanations.
This analysis is enabled by our unique dataset of students
handwritten coursework. We conducted a study in which
students in an undergraduate Mechanical Engineering Statics course generated handwritten self-explanations of the
major steps they followed when solving each of their homework problems. The students completed the homework and
self-explanations using LivescribeTM Smartpens. These devices produce a digital record of students handwritten work
in the form of time-stamped pen strokes, enabling us to see
not only the final ink on the page, but also the order in
which it was written.
We compute numerical features from this digital record which
characterize the vocabulary used and the effort (time) expended, both in solving problems and writing self-explanation.
Using these features we have computed a statistical model
which predicts students grades on various homework assignments. This model accounts for up to 19.4% of the variance
in the students performance. Furthermore, the underlying parameters of this model provide valuable insights into
the ways students explain their own work, and the cognitive
processes students employ when asked to self-explain.
Additionally, we use the vocabulary-based features to train
linear regression models to predict each of the effort-based
features. In doing so we demonstrate that the vocabulary
employed by a student to self-explain his or her solution
correlates with the amount of effort that student expends
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
264
2.
3.
EXPERIMENTAL DESIGN
4. DATA PROCESSING
We manually transcribed each handwritten self-explanation,
producing 111 text documents. Each document contains
all self-explanation written by a single student for a single
homework assignment. During this manual transcription we
made slight modifications to the students explanations to
make them suitable for later processing. First, we corrected
any spelling mistakes, but did not correct grammatical errors. Second, we replaced each verb with its unconjugated
form. For example, we replaced pushed (past tense) with
push (infinitive). Our later analysis counts the number of
occurrences of words based on exact spelling. These changes
ensure that spelling variations do not prevent words from
being correctly identified.
We also developed a thesaurus to replace synonymous words
with a single, canonical word. Students use a variety of
words to refer to a given concept or object. For example,
when students described a free-body diagram, they often
used the terms system and body interchangeably. To
ensure that semantically identical words were identified as
such, we manually developed a thesaurus that maps a canonical concept to each of the words that may be used to express
that concept. For example, we created a free-body diagram
element concept category that comprises every word that
students used to refer to any component (body) in a freebody diagram, such as jaw or handle. In this example,
whenever the word jaw was found in a transcript, it was
replaced with the token FBD-Element. We developed a
total of ten conceptual categories with the help of a Statics domain expert. There were approximately 1640 unique
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
265
words used by students across all documents before correcting spelling or verb tense. After applying our thesaurusbased replacement, there were 750 unique words.
5.
6.
Given that there are 750 unique words across all explanations, there would be over 19,000 TF-IDF, bigram, and
trigram features computed for each of the 111 documents.
This is too large a feature set and would lead to an over-fit
model with inflated accuracy. To address this issue we use
two feature subset selection algorithms to reduce the size of
our feature set.
First, we apply the computationally inexpensive RELIEF
[10] algorithm to prune our feature set to the top 500 features. The RELIEF algorithm scores each feature by its
similarity to the nearest instance of the same class and to
the nearest instance of each other class. Next, we apply the
computationally expensive, but more rigorous Correlation
Feature Selection (CFS) [7] algorithm to further reduce the
feature subset. We use the RELIEF and CFS implementation available in the WEKA [6] machine learning software
suite.
7.
The accuracy of our model for predicting student performance is encouraging. More interesting though, is the fact
that the model and its parameters indicate the self-explanation
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
266
9.
CONCLUSION
subset of features for predicting homework performance. Using this subset, we computed a linear regression model that
predicts students grades on homework assignments. This
model accounts for 19.4% of the variance in the students
performance. While this is a strong correlation, what is
more valuable are the insights that can be drawn from the
underlying parameters of this model. The coefficient weights
of the model may be used to guide manual analysis of the
students self-explanation responses, revealing patterns that
provide insights into the types of self-explanation behaviors
that are indicative of understanding or lack thereof.
10. REFERENCES
[1] S. Banerjee and T. Pedersen. The design,
implementation, and use of the ngram statistics
package. Computational Linguistics and Intelligent
Text Processing, pages 370381, 2003.
[2] K. Bielaczyc, P. L. Pirolli, and A. L. Brown. Training
in self-explanation and self-regulation strategies:
Investigating the effects of knowledge acquisition
activities on problem solving. Cognition and
instruction, 13(2):221252, 1995.
[3] W. B. Cavnar and J. M. Trenkle. N-gram-based text
categorization. In In Proceedings of SDAIR-94, 3rd
Annual Symposium on Document Analysis and
Information Retrieval, pages 161175, 1994.
[4] M. T. Chi, M. Bassok, M. W. Lewis, P. Reimann, and
R. Glaser. Self-explanations: How students study and
use examples in learning to solve problems. Cognitive
science, 13(2):145182, 1989.
[5] K. Forbes-Riley, D. Litman, A. Purandare, M. Rotaru,
and J. Tetreault. Comparing linguistic features for
modeling learning in computer tutoring. In Proceedings
of the 2007 conference on Artificial Intelligence in
Education: Building Technology Rich Learning
Contexts That Work, pages 270277, Amsterdam, The
Netherlands, The Netherlands, 2007. IOS Press.
[6] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
P. Reutemann, and I. H. Witten. The weka data
mining software: an update. ACM SIGKDD
Explorations Newsletter, 11(1):1018, 2009.
[7] M. A. Hall. Correlation-based Feature Subset Selection
for Machine Learning. PhD thesis, University of
Waikato, Hamilton, New Zealand, 1998.
[8] S. Hall and E. A. Vance. Improving self-efficacy in
statistics: Role of self-explanation and feedback. J.
Stat. Educ, 18(3), 2010.
[9] K. S. Jones. A statistical interpretation of term
specificity and its application in retrieval. Journal of
Documentation, 28:1121, 1972.
[10] K. Kira and L. A. Rendell. A practical approach to
feature selection. In D. H. Sleeman and P. Edwards,
editors, Ninth International Workshop on Machine
Learning, pages 249256. Morgan Kaufmann, 1992.
[11] L. Prevost, K. Haudek, M. Urban-Lurain, and
J. Merrill. Deciphering student ideas on
thermodynamics using computerized lexical analysis of
student writing. 2012.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
267
ABSTRACT
This paper applies meta-learning to recommend the best subset of
white-box classification algorithms when using educational
datasets. A case study with 32 Moodle datasets was employed that
considered not only traditional statistical features, but also
complexity and domain specific features. Different classification
performance measures and statistics tests were used to rank
algorithms. Furthermore, a nearest neighbor approach was used to
recommend the subset of algorithms for a new dataset. Our
experiments show that the best recommendation results are
obtained when all three types of dataset features are used.
Keywords
Meta-learning, classification, predicting student performance
2. METHODOLOGY
We propose a meta-learning methodology that consists of two
steps (see Figure 1):
1. INTRODUCTION
One of the oldest and best-known problems in educational data
mining (EDM) [10] is predicting students performance as a
classification task. A wide range of algorithms have been applied
to predict academic success and course results. However,
selecting and identifying the most adequate algorithm for a new
dataset is a difficult task, due to the fact that there is no single
classifier that performs best on all datasets, as proven by the No
Free Lunch (NFL) theorem [6]. Choosing appropriate
classification algorithms for a given dataset is of great importance
in practice. Meta-learning has been used successfully to address
this problem [12]. Meta-learning is the study of the main methods
that exploit meta-knowledge to obtain efficient models and
solutions by adapting machine learning and the DM process [4].
Recommendation can be presented in various ways, such as the
best algorithm in a set, a subset of algorithms, a ranking of
algorithms, or the estimated performance of algorithms. We
propose to use several classification evaluation measures and
statistical tests to rank algorithms, and a nearest neighbor
approach to recommend the subset of best algorithms for a given
new dataset.
Meta-learning has been used mainly in general domain and
publicly available datasets such as UCI [2]. However, we have not
found any papers that tackle algorithm selection using metalearning in the EDM domain. There is only one related work
about using meta-learning to support the selection of parameter
values in a J48 classifier using several educational datasets [8]. In
the educational domain, the comprehensibility of discovered
classification models is an important issue, since they should be
interpretable by users who are not experts in data mining (such as
instructors, course authors and other stakeholders) so they can be
used in decision-making processes. Indeed, white-box DM models
based on rules are preferred to black-box DM models such as
Bayesian and artificial neural networks, although they are
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
268
3. DATASETS
Dataset
Ni
Nna
Nca
Nc
IR
Domain
Dataset1
98
1.08
Report
Dataset 2
194
1.39
Report
Dataset 3
786
9.8
Quiz
Dataset 4
658
9.1
Quiz
Dataset 5
67
40
1.23
Quiz
Dataset 6
922
3 19.27
Quiz
Dataset 7
910
3 19.24
Quiz
Dataset 8
114
11
1.19
Forum
Dataset 9
42
11
Forum
Dataset 10
103
11
1.53
Forum
Dataset 11
114
11
1.43
Forum
Dataset 12
98
1.91
Forum
Dataset 13
81
1.19
Forum
Dataset 14
33
12
32
Forum
Dataset 15
82
12
3.1
Forum
Dataset 16
113
40
23.5
Quiz
Dataset 17
105
41
1.06
Quiz
Dataset 18
123
10
3.89
Quiz
Dataset 19
102
10
1.06
Quiz
Dataset 20
75
2.12
Report
Dataset 21
52
1.89
Report
Dataset 22
208
10
3.25
Report
Dataset 23
438
10
4 15.41
Report
Dataset 24
421
10
14.2
Report
Dataset 25
84
5.43
Report
Dataset 26
168
4 11.25
Report
Dataset 27
136
11.5
Report
Dataset 28
283
10
1.67
Report
Dataset 29
155
10
1.21
Report
Dataset 30
72
11
Report
Dataset 31
40
10
1.2
Quiz
Dataset 32
48
10
1.8
Quiz
4. EXPERIMENTS
An initial experiment was carried out to select a subset of whitebox classification algorithms that best predicted the final students
performance for each Moodle dataset. We used only rule-based
and decision trees algorithms due to the fact that they provide
models that can be easily understood by humans and used directly
in the decision-making process.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
269
Sen
Prec
F-M
Kap
AUC
RConjunctiveRule
DecisionTable
Algorithm
DTNB
NBTree
JRip
REPTree
NNge
DecisionStump
OneR
DTNB
5.25
PART
Ridor
5.667
Ridor
OneR
6.333
ZeroR
ConjunctiveRule
8.083
BFTree
J48
8.417
DecisionStump
PART
8.833
J48
LADTree
LMT
NBTree
RandomForest
RandomTree
REPTree
SimpleCart
Ranking
2
2.667
5
Complex
Dataset13 Dataset11
Statisic+
Complex+
Complex
Domain
Dataset11 Dataset22
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
270
recall
F Measure 2
precision recall
precision recall
6. ACKNOWLEDGMENTS
This work was supported by the Regional Government of
Andalusia and the Spanish Ministry of Science and Technology
projects, P08-TIC-3720 and TIN-2011-22408, and FEDER funds.
7. REFERENCES
[1] Aha, D., Kibler, D. Instance-based learning algorithms.
Machine Learning. 6, 37-66, 1991.
[2] Asuncion, A, Newman, D.J.. UCI Machine Learning
Repository, University of California, Irvine, CA, 2007.
(http://www.ics.uci.edu/mlearn/MLRepository.html).
[3] Bhatt, N. Thakkar, A. Ganatra, A. A Survey & Current
Research Challenges in Meta Learning Approaches based on
Dataset Characteristics. International Journal of Soft
Computing and Engineering, 2(10), 234-247, 2012.
[4] Brazdil, P., Giraud-Carrier, C., Soares, C. and Vilalta, R.
Metalearning: Applications to Data Mining. Series:
Cognitive Technologies. Springer, 2009.
[5] Demsar, J. Statistical comparisons of classifiers over
multiple data sets. Journal of Machine Learning Research, 7
1-30, 2006.
[6] Hmlinen, W., Vinni M. Classifiers for educational data
mining; Handbook of Educational Data Mining. Chapman &
Hall/CRC. 2011.
[7] Ho T.K., Basu M. Complexity measures of supervised
classification problems. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(3):289-300, 2002.
[8] Molina, M.M., Romero, C. Luna, J.M. Ventura. S. Metalearning approach for automatic parameter tuning: A case
study with educational datasets. 5th International
Conference on Educational Data Mining, Chania, Greece,
180-183, 2012.
[9] Orriols-Puig A., Maci N. & Ho T.K. Documentation for the
data complexity library in C++. Technical report, La Salle Universitat Ramon Llull, 2010.
5. CONCLUSIONS
In this paper, meta-learning has been used to address the problem
of recommending a subset white-box classifier from Moodle
datasets. Several classification performance measures are used
together with several statistical test to rank and select a subset of
algorithms. Results show that complexity and domain features
used to characterize datasets can improve the quality of the
recommendation. For future work, we plan to extend the
experimentation, for example, using more datasets, algorithms
(including black box models), characteristics, evaluation
measures, etc.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
271
G. Tanner Jackson
Laura K. Varner
Erica.L.Snow@asu.edu
Tanner.Jackson@asu.edu
Laura.Varner@asu.edu
Danielle S. McNamara
Tempe, AZ, USA
Arizona State University
Danielle.McNamara@asu.edu
ABSTRACT
The current study investigates the relation between personalizable
feature use, attitudes, and in-system performance in the context of
the game-based system, iSTART-ME. This analysis focuses on a
subset (n=40) of a larger study (n=126) conducted with high
school students. The results revealed a positive relation between
students frequency of interactions with personalizable features
and their self-reported engagement and perceived system control.
Students who frequently interacted with personalizable features
also demonstrated better overall in-system performance compared
to students who interacted with these features less often. The
current paper adds to the growing literature supporting the
potential positive impact that personalizable features have on
students attitudes and performance in adaptive learning
environments.
Keywords
Personalization, attitudes, game-based features, off-task behaviors
1. INTRODUCTION
A growing trend in the field of adaptive learning environments
has been the study of educational games on users interest and
engagement during learning [1-2]. When games are incorporated
into these learning environments, students have demonstrated
increased engagement and motivation [3-4]. However, few studies
have investigated how features within educational games may lead
to off-task behaviors and ultimately influence in-system
performance (notable exceptions include [5-6]).
An exploration of the interactive features in educational game
environments may allow researchers to identify the aspects of
system interfaces that benefit or hinder students learning.
Developers have integrated many types of interactive choicebased features into educational game interfaces, including:
personalizable avatars, interactive maps, and customizable
background colors. These features have been found to increase
1.2 iSTART-ME
The Interactive Strategy Training for Active Reading and
Thinking Motivationally Enhanced (iSTART-ME) system is a
game-based learning environment designed to improve students
reading comprehension ability. This system is an extension of a
previous version of iSTART, which provides students with
instruction and practice using reading comprehension strategies
while reading challenging texts [15].
The iSTART-ME interface is controlled through a selection menu
where students can read and self-explain texts, personalize
avatars, play mini-games, earn points, win trophies, and view their
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
272
2. METHODS
Participants in this study included 40 high school students from a
mid-south urban environment. The sample included in the current
work is a subset of 126 students who originally participated in a
larger study that compared three conditions: iSTART-ME, the
original version of iSTART, and a no-tutoring control [17]. The
current study focuses on the students who were assigned to the
iSTART-ME condition. These students had access to the full
game-based system in which the game-based interface features
were available.
All students completed an 11-session experiment consisting of a
pretest, 8 training sessions, a posttest, and a delayed retention test.
During the first session, students completed a pretest that included
survey measures assessing motivation, prior self-explanation
ability, prior reading ability [17], and attitudes toward technology
and games. During the following 8 sessions, students engaged
with the iSTART-ME interface for a minimum of 1 hour, where
they could play games, interact with texts and personalize system
features. After training, students completed a posttest, which
included measures that were similar to the pretest. Finally, 5 days
after the posttest, students returned to complete a retention test,
which contained measures similar to the pretest and posttest (i.e.,
self-explanation and attitudinal measures).
Response Statement
Response
Scale
Enjoyment
1-6
Boredom
I felt bored
1-6
Motivation
1-6
Lack of Control
1-6
3. RESULTS
We examined interactions with personalizable features and their
relation to students performance and attitudes during training
within iSTART-ME. Using the process data from students
interactions, we calculated the number of times students spent
their earned iBucks on personalizable features. We first examined
how off-task personalization of any kind (avatar, background
theme, or agent) related to in-system performance (achievement
levels and trophies won) and posttest attitudinal composite
measures (i.e., enjoyment, boredom, and motivation) and a single
measure of perceived lack of control. The correlation results in
Table 2 indicate that students number of avatar edits was
marginally negatively related to posttest boredom and
significantly negatively related to perceived lack of control.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
273
of
Off-task
Avatar
Edits
Background
Theme Edits
Pedagogical
Agent Edits
(n=30)
(n=15)
(n=17)
Achievement Level
.26
.35
.06
.33
.39
-.10
Enjoyment
Composite
.11
.07
.14
-.13
-.31
Boredom Composite
-.35(M)
Motivation
Composite
.19
.15
-.04
Lack of Control
-.36*
-.18
-.21
Low Avatar
Editors
High Avatar
Editors
M (SD)
M (SD)
Achievement
Level
13.76 (6.25)
18.45 (5.66)
Trophies Won
15.15 (11.77)
36.00 (36.98)
Boredom
Composite
2.30 (.93)
1.55 (.59)
Lack of
Control
2.36 (1.30)
1.18 (.75)
4. DISCUSSION
The incorporation of educational games within learning
environments has demonstrated positive impacts on student
engagement [1, 4]. However, many features within these games
may promote off-task behaviors. The current study aimed to gain
a deeper understanding of the relations among these features, insystem performance, and students attitudes.
In line with previous work, our results demonstrated significant
negative relations among students interactions with
personalizable features, boredom, and lack of control. These
results replicate previous work suggesting that personalization
potentially augments students investment and perception of
control within a system [7]. ANOVAs revealed distinct
differences between high and low avatar editors for their selfreported attitudes and in-system performance. High editors
advanced to higher achievement levels and won more trophies
compared to low editors. High editors also expressed less
boredom and higher levels of perceived control in the system.
In contrast to some prior work [6], the current analyses on off-task
behaviors, in-system performance, and student attitudes provide
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
274
Acknowledgments
This research was supported in part by the Institute for
Educational Sciences (IES R305G020018-02; R305G040046,
R305A080589) and National Science Foundation (NSF
REC0241144; IIS-0735682). Any opinions, findings, and
conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the
IES or NSF
5. REFERENCES
[1] Malone, T., and Lepper, M. 1987. Making learning fun: a
taxonomy of intrinsic motivations for learning. In Aptitude,
learning, and instruction, R. Snow and M. Fair, Eds. Vol. 3.
Cognitive and affective process analyses. Erlbaum, Hillsdale,
NJ, 223-253.
[2] Moreno, R., and Mayer, R. E. 2005. Role of guidance,
reflection, and interactivity in an agent-based multimedia
game. Journal of Educational Psychology, 97, (2005) 117
128.
[3] Jackson, G. T., and McNamara, D. S. 2011. Motivational
impacts of a game-based intelligent tutoring system. In
Proceedings of the 24th International Florida Artificial
Intelligence Research Society Conference, (Palm Beach, FL,
May 18-20, 2011) FLAIRS, AAAI, 519-524.
[4] Clark, D., Nelson, B., Sengupta, P., and DAngelo, C. 2009.
Rethinking science learning through digital games and
simulations: Genres, examples, and evidence. In Learning
science: Computer games, simulations, and education
workshop, (Washington DC, October 2009). National
Academy of Sciences.
[5] Baker, R. S., Corbett, A. T., Koedinger, K. R., and Wagner,
A. Z. 2004. Off-task behavior in the Cognitive Tutor
classroom: When students game the system. In
Proceedings of Computer-Human Interaction, (Vienna,
Austria, April 24 -30, 2004) CHI, ACM Press, 383390.
[6] Rowe, J., McQuiggan, S., Robison, J., and Lester, J. 2009.
Off-task
behavior
in
narrative-centered
learning
environments. In Proceedings of the Workshop on Intelligent
Educational Games at the 14th Annual Conference on
Artificial Intelligence in Education, (Brighton, UK, July 06 10, 2009). AIED, IOS Press, 99-106.
[7] Cordova, D. I., and Lepper, M. R. 1996. Intrinsic motivation
and the process of learning: Beneficial effects of
contextualization, personalization, and choice. Journal of
Educational Psychology, 88, (1996) 715 730.
[8] Facer, K., Joiner, R., Stanton, D., Reid, J., Hull R., and Kirk,
D. 2004. Savannah: Mobile gaming and learning? Journal of
Computer Assisted Learning. 20, (2004) 399-409.
[9] Cocea, M., Hershkovitz, A., and Baker, R. S. 2009. The
Impact of Off-task and Gaming Behaviors on Learning:
Immediate or Aggregate? In Proceedings of the 14th Annual
Conference on Artificial Intelligence in Education,
(Brighton, UK, July 06 -10, 2009). AIED, IOS Press, 507514.
[10] Wallace, J. C., and Vodanovich, S. J. 2003. Workplace safety
performance: conscientiousness, cognitive failure, and their
interaction. Journal of Occupational Health Psychology, 8,
(2003) 316-327.
[11] Baker, S. J. 2007. Modeling and understanding students offtask behavior in intelligent tutoring systems. In Proceedings
of the ACM Conference on Human Factors in Computer
Systems, (San Jose, CA, April 28 May 03, 2007).CHI,
ACM Press, 10591068.
[12] Austin, J. L., and Soeda, J. M. 2008. Fixed-time teacher
attention to decrease off-task behaviors of typically
developing third graders. Journal of Applied Behavior
Analysis. 41, (2008) 279283.
[13] Rai, D., and Beck, J. 2012. Math learning environment with
game-like elements: An experimental framework.
International Journal of Game Based Learning, 2, (2012),
90- 110.
[14] Jackson, G. T., Boonthum, C., and McNamara, D. S. 2009.
iSTART-ME: Situating extended learning within a gamebased environment. In Proceedings of the Workshop on
Intelligent Educational Games at the 14th Annual
Conference on Artificial Intelligence in Education,
(Brighton, UK, July 06 -10, 2009). AIED, IOS Press, 59-68.
[15] McNamara, D. S., O'Reilly, T., Rowe, M., Boonthum, C.,
and Levinstein, I. B. 2007. iSTART: A web-based tutor that
teaches self-explanation and metacognitive reading
strategies. In Reading comprehension strategies: Theories,
interventions, and technologies, D. S. McNamara, Ed.
Erlbaum. Mahwah, NJ, 397-420.
[16] Jackson, G. T., and McNamara, D. S. In Press. Motivation
and performance in a game- based intelligent tutoring
system. Journal of Educational Psychology. (in press).
[17] MacGinitie, W., and MacGinitie, R. 1989.Gates MacGinitie
reading
tests.
Riverside.
Chicago,
I
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
275
Aaron D. Likens
G. Tanner Jackson
Erica.L.Snow@asu.edu
Aaron.Likens@asu.edu
TannerJackson@asu.edu
Danielle S. McNamara
Arizona State University
Tempe, AZ, USA
Danielle.McNamara@asu.edu
ABSTRACT
The purpose of this study was to investigate students patterns of
interactions within a game-based intelligent tutoring system (ITS),
and how those interactions varied as a function of individual
differences. The analysis presented in this paper comprises a
subset (n=40) of a larger study that included 124 high school
students. Participants in the current study completed 11 sessions
within iSTART-ME, a game-based ITS, that provides training in
reading comprehension strategies. A random walk analysis was
used to visualize students trajectories within the system. The
analyses revealed that low ability students patterns of interactions
were anchored by one feature category whereas high ability
students demonstrated interactions across multiple categories. The
results from the current paper indicate that random walk analysis
is a promising visualization tool for learning scientists interested
in capturing students interactions within ITSs and other
computer-based learning environments over time.
Keywords
Intelligent Tutoring Systems, sequential pattern analysis, random
walk analysis, individual differences
1. INTRODUCTION
A growing trend in the field of educational technology has been to
use aggregated or summative analysis to trace students
interactions with game-based features inside of Intelligent
Tutoring Systems (ITSs) [1-5]. These analyses capture students
interactions with the system overtime and at fixed intervals [3-5].
For example, aggregated analysis on the frequency of students
utilization of game-based features across multiple training
sessions found that patterns of interactions varied as a function of
individual differences in performance orientation [4]. Similarly,
summative methods have been used to investigate how the
availability of game-based elements inside of a system impacts
students overall enjoyment [3].
Although aggregated and summative analyses shed some light on
users overall system interactions, those statistical methods cannot
trace nuances in students paths in adaptive learning
environments. The current study utilizes sequential pattern
1.2 iSTART-ME
The Interactive Strategy Training for Active Reading and
Thinking Motivationally Enhanced (iSTART-ME) is a gamebased ITS developed on top of an existing system, iSTART [11].
iSTART was developed to provide instruction and practice in
comprehension strategies and improve student comprehension of
difficult science texts.
iSTART-ME training includes three initial phases where reading
comprehension strategies are introduced, demonstrated, and
practiced (phases are discussed in more detail in [1]). A fourth
phase includes extended practice, where students apply the
strategies across numerous texts and multiple sessions. iSTARTME situates this extended practice within a game-based selection
menu (see Figure 1), which includes: generative practice games,
personalizable features, achievement screens, and identification
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
276
Directional assignment
Identification Mini-Games
Personalizable Features
Achievement Screens
2. METHODS
2.1 Participants
Participants in the current work (n=40) were a subset of 124 high
school students who participated in a study at a large university
campus in the Mid-Southern United States [12]. The current
analyses focus only on those students who were randomly
assigned to interact with the game-based iSTART-ME system
(other students in the original study were assigned to an ITS or a
no-tutoring control). The students included here consisted of 20
males and 20 females, with an average age of 16 years.
2.2 Procedure
Students in this study completed an 11-session experiment that
consisted of a pretest, 8 training sessions within iSTART-ME, a
posttest, and a delayed retention test. During session 1,
participants completed a pretest to assess their attitudes,
motivation, prior self-explanation (SE) quality, vocabulary
knowledge, and prior reading ability. SE quality was measured at
pretest using the iSTART algorithm, which ranges from 0 (poor)
to 3 (good) [13]. This score provides a rough indicator for the
amount of cognitive processing involved, and represents the
quality of a students self-explanation [14]. Prior reading ability
was assessed using the Gates MacGinitie Reading Test [15].
Students interacted with the iSTART-ME system during sessions
2 through 9. During session 10, students completed a posttest,
which included measures similar to the pretest. Finally, five days
after the posttest, students returned to complete a retention test,
consisting of similar self-explanation and comprehension
measures.
3. RESULTS
The current study examined students patterns of interactions with
game-based features and how they may vary as a function of
individual differences. Students data logs from their eight
sessions in iSTART-ME were used to categorize every interaction
into one of the four possible game-based feature types: generative
practice games, personalizable features, achievement screens, and
identification mini-games. The random walk algorithm was then
used to construct a unique pattern for each participant (see Figure
3 for one students complete walk pattern).
2.3 Analysis
The current study employs a random walk algorithm to visualize
student interaction patterns across time (sessions 2 through 9).
Game-based features were grouped into four distinct categories
and each was assigned to a vector on an X, Y scatter plot.
Although the current study used only four dimensions, the number
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
277
Slope (r)
-.593**
Prior SE Quality
-.496**
Vocabulary Knowledge
-.297(M)
4. DISCUSSION
The current study used a random walk analysis to view sequences
of patterns in students interactions within the game-based ITS,
iSTART-ME. We suggest that sequential pattern analysis may
benefit learning scientists by providing a new method for tracking
and viewing students interactions with game-based features
across time. Using the slopes of each students unique walk we
found that there was a relation between a students trajectory
through the system and their prior reading comprehension ability
and prior self-explanation (SE) quality scores. Investigating this
relation further, we found that students with higher reading ability
and higher SE quality scores showed significantly different
trajectories compared to low reading ability and those students
with lower quality self-explanations. Low ability students tended
to interact more with generative practice games, whereas high
ability students interacted in a more balanced way with both
generative practice games and identification mini-games.
The implications of the current study are promising for
researchers in two ways. First we have shown that random walk
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
278
5. ACKNOWLEDGEMENTS
This research was supported in part by the Institute for
Educational Sciences (IES R305G020018-02; R305G040046,
R305A080589) and National Science Foundation (NSF
REC0241144; IIS-0735682). Any opinions, findings, and
conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the
IES or NSF
6. REFERENCES
[3] Rai, D., and Beck, J. 2012. Math learning environment with
game-like elements: An experimental framework.
International Journal of Game Based Learning, 2, (2012),
90- 110.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
279
vstoto@it.usyd.edu.au
lina.markauskaite@sydney.edu.au
michael.jacobson@sydney.edu.au
ABSTRACT
In this paper, we present a technique that we have developed to
transform sequences of technical events into more abstract actions
and semantic activities. The sequences of more abstract units are
then used for discovering patterns of students interaction with
computer models using heuristic miner. Our proposed approach
automatically segments sequences of technical events that
occurred during model runs and pauses and, on the basis of the
nature of technical events that occurred during model runs and
pauses, clusters them into actions. Then, using heuristic rules, it
classifies actions into activities. We demonstrate the usefulness of
our multilevel abstraction for extracting and exploring
characteristic patterns of students interaction with computer
models. Our study shows that each abstraction level could help to
identify distinct characteristics of students interaction.
Keywords
Process mining, data pre-processing, multilevel data analysis,
model-based learning.
1. INTRODUCTION
It has been acknowledged that model-based inquiry could be a
very effective pedagogical approach for teaching and learning
complex scientific knowledge. Computer models could help
students to engage in first-hand scientific investigations of
complex social and natural phenomena and construct deep,
authentic understanding of many complex processes, such as
climate change [3, 4]. However, a number of studies that
investigated model-based learning have found that not all learners
succeed achieving desired learning outcomes [1, 8, 9]. They
suggested that students failure and success to learn from
computer models may be related to the differences in how
students interact with computer models. For example, studies
demonstrated that some students, when they explore computer
models, systematically change one parameter after parameter; in
contrast, other students approach tasks in more haphazard ways
and change different model settings simultaneously [6]. While the
former students usually succeed completing given inquiry tasks,
the latter students tend to be less successful demonstrating desired
learning outcomes. However, methods for exploring students
model-based inquiry processes have been complicated, usually
based on human coding, thus hard to implement in computer
systems. Only recently researchers have started to explore the
possibilities to extract students inquiry characteristics and
patterns automatically from the log files [2, 7]. These techniques
could help to detect productive and unproductive students
behaviours and scaffold students inquiry automatically.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
280
Category
[type]1
Technical events
[position]
Examples
of actions
Go
[On/Off]
Model control
Start [S]
S-00000-E
[Dichotomous]
Stop [E]
E-00000-S
Our data pre-processing included three steps. During the first two
steps, we abstracted action-sequences from the event-sequences,
and in the third step, we further abstracted activity-sequences
from action-sequences. The recorded log files comprised process
instances. A process instance is a sequence of events in which
each event represents an interaction with a model (e.g., a click of
Setup button).
Setup [nil]
Model control
Setup [1]
S-10000-E
Follow a
CO2
molecule
[On/Off]
Tracking
[Dichotomous]
Follow molecule
On [4]
S-00010-E
Change of
speed
[value]
Speed change
Change
fossil fuel
use [value]
[Singular]
S-00001-E
Follow molecule
Off [5]
S-01000-E
[Continuous]
Change of speed
[2]
Parameter
change
Change fossil
fuel use [3]
S-00100-E
[Continuous]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
281
Sec
Activity
S-00000-E
>0
Control_Run
>0
Inter_Run
S-00100-E
>0
Conf_Run
>0
Comb_Run
E-10000-S
=0
Reset_Run
E-X0000-S
>0
Control_Pause
>0
Inter_Pause
E-X0100-S
>0
Conf_Pause
>0
Comb_Pause
usually took these two actions after stopping the model. Start
and stop events are mutually co-dependent. It means that these
students often started the simulation and then stopped it without
any further interaction and/or configuration. Nevertheless, start
and stop events form a part of a larger three-event loop of
stop, fossil fuel use and start with dependencies 0.9 of these
three events on each other. It indicates that these students usually
configured the model after stopping the simulation. Start,
speed and stop events form another loop, where speed
event has a dependency on start event (0.5) and stop event has
a dependency on speed (0.7). The dependencies in this event
loop, however, are lower, which indicates that the change of speed
did not necessary follow the start of the simulation or preceded
the stop. Indeed, the extracted process net shows that speed also
depends on fossil fuel use, which suggests that the students
sometimes changed speed just after changing fossil fuel use
parameter, thus configured modeling and interaction parameters
together. Further, setup event has a noticeable dependency on
stop (0.7), suggesting that the dyad usually reset the simulation
after stopping it. However, there is a loop from setup event to
setup indicating that students, for some reasons, sometimes
pressed setup button multiple times.
3. RESULTS
There were 21 event logs that, in total, included 2582 technical
events. The event logs were segmented into 1520 eventsequences. Each event-sequence was transformed into bags of
events with binary indexing. All transformed event-sequences
were then clustered into action-sequences and activity-sequences.
In order to demonstrate the capability of our data preparation
method, we present and compare process patterns of one dyad
created using data at three different levels of abstraction - events,
actions and activities. We generated three causal dependency
diagrams (nets) from the pre-processed data using the Heuristic
Miner algorithm with the default parameters [9, 10]. The analysis
and comparison of these three causal nets shows that each of the
process models depicts distinct features of students interaction
with the simulation (Figures 2-4).
The event process model shows that start and stop dominated
in students interaction with the model and each event was
recorded in the log 33 times (Figure 2). Setup event during
which students reset parameters appeared also quite often (20
times); whereas fossil-fuel-use and speed events were
executed only 14 and 11 times, respectively. The model shows
that stop event plays quite distinct role in the process pattern;
and two other events - start and fossil-fuel-use - have high
dependencies on start event (0.9). This indicates that the dyad
2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
282
5. ACKNOWLEDGMENTS
The research reported in this paper was funded by an Australia
Research Council Linkage Grant No. LP100100594. We thank
Paul Stokes, Nick Kelly, Kashmira Dave and the teachers and
students for their invaluable assistance in this study.
Figure 3. Process model mined from action-sequence
The activity process net provides additional insight into the
students modeling behavior (Figure 4). As we can see, students
activity pattern involves two kinds of model run activities: simple
control of the model (27 activities) and interaction with the model
while the simulation is running (8 activities). These activities are
similar to the actions depicted in the action process model (Figure
3). However, the activity pattern depicts that students activities
during the pauses also form two dominant groups: pauses that
involve simple control (16 activities) and pauses that involve
configuration of the model (13 activities). The control pauses are
strongly co-dependent with just one activity, which is the control
of the simulation during model runs (dependency 0.9). In contrast,
the pauses that involve model configuration are codependent with
both types of simulation runs: simple control and interaction with
the simulation while the model was running (dependencies 0.9).
This pattern shows that students active configuration during
pauses was followed by a mix of passive observation and more
proactive exploration during runs, whereas passive behavior
during pauses was always followed by their passive behavior
during runs.
6. REFERENCES
[1] Bose, R.P.J.C., and Aalst, W.M.P.V.D. 2009. Abstractions in
process mining: A taxonomy of patterns. In Proceedings of
Bussiness Process Management, 159-175.
[2] Buckley, B.C., Gobert, J.D., Horwitz, P., and O'dwyer, L.M.
2010. Looking inside the black box: Assessing model-based
learning and inquiry in Biologica, International Journal of
Learning Technology, 5(2), 166-190.
[3] Goldstone, R.L., and Wilensky, U. 2008. Promoting transfer
by grounding complex systems principles, Journal of the
Learning Sciences, 17(4), 465-516.
[4] Kelly, N., Jacobson, M., Markauskaite, L., and Southavilay, V.
2012. Agent-based computer models for learning about climate
change and process analysis techniques. In Proceedings of 10th
International Conference of the Learning Sciences, International
Society of the Learning Sciences, Sydney, Australia, 25-32.
[5] Levy, S.T., and Wilensky, U. 2010. Mining students' inquiry
actions for understanding of complex systems, Computers &
Education, 56(3), 556-573.
[6] Mcelhaney, K.W., and Linn, M.C. 2011. Investigations of a
complex, realistic task: Intentional, unsystematic, and exhaustive
experimenters, Research in Science Teaching, 48(7), 745-770.
[7] Pedro, M. S., Baker, R.S.J., Gobert, J.D., Montalvo, O., and
Nakama, A. 2013. Leveraging machine-Learned detectors of
systematic inquiry behavior to estimate and predict transfer of
inquiry skill, user model. User-Adapt. Interact., 23(1), 1-39.
[8] Thompson, K., and Reimann, P. 2010. Patterns of use of an
agent-based model and a system dynamics model: The application
of patterns of use and the impacts on learning outcomes,
Computers & Education, 54(2), 392-403.
[9] Weijters, A.J.M.M., and Ribeiro, J.T.S. 2011. Flexible
heuristics miner (FHM). In CIDM'11, Eindhoven University of
Technology, Eindhoven, 310-317.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
283
Kenneth R. Koedinger
Elizabeth A. McLaughlin
john@stamper.org
koedinger@cmu.edu
mimim@cs.cmu.edu
ABSTRACT
Keywords
Model Selection, AIC, BIC, Cross-validation, KC Modeling.
1. INTRODUCTION
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
284
Figure 1. Screenshot of the KC Models page in DataShop (http://pslcdatashop.org) for the dataset Cog Model Discovery Experiment
Spring 2010. Here we can see named models with a different number of KCs in each. Note that all the models with the same number
of observations with KCs (41,756 for example) are comparable with each other. DataShop also allows for the user to select the metric
on which to sort the models. In this case, the models are sorted by AIC where a lower value is better.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
285
Table 1. AIC and BIC correlations against each other and Cross-validation
KC set name
#
students
#
models
302
31
302
#
obs
AIC-correlation
BIC-correlation
AIC-BIC
correl
SSCV
ISCV
NSCV
SSCV
ISCV
NSCV
8181
0.666
0.936
0.994
0.989
0.438
0.662
0.630
23
4957
0.986
0.956
0.977
0.973
0.961
0.961
0.961
74
71805
0.973
0.967
1.000
0.999
0.882
0.976
0.979
74
37423
0.983
0.989
0.650
0.996
0.999
0.568
0.972
99
13
7094
0.200
0.822
0.945
0.916
0.538
0.198
0.064
than for AIC (2) for any non-trivially sized data set. In general,
this means that BIC favors models with less parameters (again
more strongly the AIC), and converges to the true or correct
model [1], however, this does not mean that for BIC to be useful
that the true model must exist in the set of possible models [2].
Both reduce the chance of over-fitting the data by penalizing for
increasing the number of parameters in the model. They are much
faster to compute than cross validation and are believed to
reasonably predict the results of cross validation, though no
systematic investigation of that has been performed, at least, for
the kinds of EDM models investigated here. Given that AIC is
more lenient, one might suspect it would be more susceptible to
favoring models that over-fitted the data. On the other hand, BIC
might over penalize more complex models that indeed do capture
true variability in the data. Many of the previous efforts to
evaluate knowledge component models in EDM have used BIC as
the evaluation criteria including Learning Factors Analysis (LFA)
[3], Performance Factors Analysis (PFA) [11], and Instructional
Factors Analysis (IFA) [4].
Table 2. Correlations and rank order correlations across the five metrics provided in DataShop (AIC, BIC, SSCV,ISCV and NSCV).
AICBIC
AICSSCV
AICISCV
AICNSCV
BICSSCV
BICISCV
BICNSCV
SSCVISCV
SSCVNSCV
ISCVNSCV
Correlation
s
0.574
0.824
0.891
0.890
0.522
0.464
0.446
0.812
0.777
0.919
Rank Corr.
0.532
0.817
0.852
0.847
0.478
0.403
0.420
0.760
0.735
0.868
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
286
each of the three cross validations regardless of whether the AICBIC correlation is strong (rows 2-4), weak (row 5) or average
(row 1). In all but one instance, the AIC correlations with cross
validation are better than BIC. Table 2 shows the average
correlations and rank correlations between AIC, BIC, and the
Cross-validations (as stated earlier, three types of ten-fold cross
validations are reported in DataShop: student stratified cross
validation (SSCV), item stratified cross validation (ISCV) and
non-stratified cross validation (NSCV)). From these averages in
Table 2, AIC and BIC have correlations with each other of just
over 0.5, which makes sense since they often do not agree on the
best fitting model. More importantly, AIC is a better predictor
than BIC of all three kinds of cross validation. Interestingly, table
2 shows SSCV is better indicated by AIC than the other CV
metrics.
Thus, on those grounds, it seems as though AIC is the best single
measure. In general, AIC best models average more knowledge
components (53 vs. 34) and more parameters (205 vs. 166) than
BIC best models. It is not surprising, then, that there is a high
level of disagreement between best model selections for AIC and
BIC (68% do not match). When comparing the best models of
AIC and BIC to the best models of all three types of cross
validation, AIC again matches better than BIC (approximately
70% to 10%). This better matching of best models is another
strong argument that AIC is a better metric for model selection.
5. REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
287
bvds@asu.edu
ABSTRACT
There are various methods for determining the moment at
which a student has learned a given skill. Using the Akaike
information criterion (AIC), we introduce an approach for
determining the probability that an individual student has
learned a given skill at a particular problem-solving step. We
then investigate how well this approach works when applied
to student log data. Using log data from students using
the Andes intelligent tutor system for an entire semester,
we show that our method can detect statistically significant
amounts of learning, when aggregated over skills or students.
In the context of intelligent tutor systems, one can use this
method to detect when students may have learned a skill
and, from this information, infer the relative effectiveness of
any help given to the student or of any behavior in which
the student has engaged.
Keywords
data mining, information theory
1.
INTRODUCTION
test. Heckler and Sayre [4] introduce an experimental technique where they administered a test to a different subgroup
of students in a large physics class each week during the
quarter, cycling through the entire class over the course of
the quarter (a between-students longitudinal study). With
a sufficiently large number of students (1694 students over
five quarters), they were able to produce plots of student
mastery of various skills as a function of time, and identify
exactly which week(s) students learned a particular skill.
However, the shortest time scale that one could imagine for
this kind of approach (administering a test in a classroom
setting) can, at best, be a day or so. Can we do better?
The use of an intelligent tutor systems (ITS) provides a way
forward. In this case, student activity is analyzed and logged
for each user interface element change, with a granularity
of typically several 10s of seconds. Instead of relying on a
distant pre-test or post-test, the experimenter can examine
student (or tutor system) activity in the immediate vicinity
of the event of interest.
Baker, Goldstein, and Heffernan [1] construct a model that
predicts the probability that a student has learned a skill at
a particular time based on the Bayesian Knowledge Tracing
(BKT) algorithm [3]. BKT gives the probability that the
student has mastered a skill at step j using the students
performance on previous opportunities to apply that skill.
The authors supplement the BKT result with information on
student correctness for the two subsequent steps j+1 and j+
2 and infer the probability that the student learned the skill
at that step. Finally, they use their model to train a second
machine-learned model that does not rely on future student
behavior, so it could be run in real time as the student is
working
We will address the same problem using an informationtheoretic approach. Starting with the Akaike information
criterion and a simple model of learning, we use a multimodel strategy to predict the probability that learning has
occurred at a given step, and to predict how much learning
has occurred. We apply our approach to student log data
from an introductory physics course. We find that, for an
individual student and skill, detection of learning has large
uncertainties. However, if one aggregates over skills or students, then learning can be detected at the desired level of
significance.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
288
Correct/Incorrect steps
Our stated goal is to determine student learning for an individual student as they progress through a course. What observable quantities should be used to determine student mastery? One possible observable is correct/incorrect steps,
whether the student correctly applies a given skill at a particular problem-solving step without any preceding errors or
hints. There are other observables that may give us clues
on mastery: for instance, how much time a student takes
to complete a step that involved a given skill. However,
other such observables typically need some additional theoretical interpretation. Exempli gratia, What is the relation
between time taken and mastery? Baker, Goldstein, and
Heffernan [1] develop a model of learning based on a Hidden
Markov model approach. They start with a set of 25 additional observables (for instance, time to complete a step)
and construct their model and use correct/incorrect steps
(as defined above) to calibrate the additional observables
and determine which are statistically significant. Naturally,
it is desirable to eventually include such additional observables in any determination of student learning. However, in
the present investigation, we will focus on correct/incorrect
steps.
What do we mean by a step? A student attempts some
number of steps when solving a problem. Usually, a step j
is associated with creating/modifying a single user interface
object (writing an equation, drawing a vector, defining a
quantity, et cetera) and is a distinct part of the problem
solution (that is, help-giving dialogs are not considered to be
steps). A student may attempt a particular problem solving
step, delete the object, and later attempt that solution step
again. A step is an opportunity to learn a given Knowledge
Component (KC) [6] if the student must apply that KC or
skill to complete the step.
For each KC and student, we select all relevant step attempts and mark each step as correct (or 1) if the student
completes that step correctly without any preceding errors
or requests for help; otherwise, we mark the step as incorrect (or 0). A single students performance on a single KC
can be expressed as a bit sequence, exempli gratia 00101011.
2.
2.1
Method
We examined log data from 12 students taking an intensive introductory physics course at St. Anselm College dur-
1.1
500
100
50
10
5
1
10
50 100
500 1000
Number of steps, n
Figure 1: Histogram of number of distinct studentKC sequences in student dataset A having a given
number of steps n.
ing summer 2011. The course covered the same content as
a normal two-semester introductory course. Log data was
recorded as students solved homework problems while using
the Andes intelligent tutor homework system [7]. 231 hours
of log data were recorded. Each step was assigned to one
or more different KCs. The dataset contains a total of 2017
distinct student-KC sequences covering a total of 245 distinct KCs. We will refer to this dataset as student dataset
A. See Figure 1 for a histogram of the number student-KC
sequences having a given number of steps.
Most KCs are associated with physics or relevant math skills
while others are associated with Andes conventions or userinterface actions (such as, notation for defining a variable).
The student-KC sequences with the largest number of steps
are associated with user-interface related skills, since these
skills are exercised throughout the entire course.
One of the most remarkable properties of the distribution
in Fig. 1 is the large number of student-KC sequences containing just a few steps. The presence of many student-KC
sequences with just one or two steps may indicate that the
default cognitive model associated with this tutor system
may be sub-optimal; there has not been any attempt, to
date, to improve on the cognitive model of Andes with, say,
Learning Factors Analysis [2]. Another contributing factor
is the way that introductory physics is taught in most institutions, with relatively little repetition of similar problems.
This is quite different than, for instance, a typical middle
school math curriculum where there are a large number of
similar problems in a homework assignment.
3.
MULTI-MODEL APPROACH
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
289
step correctness:
0
0
0
80
AIC for submodel:
60 13.1 13.6 11.6
0.5
1
9.
13.
14.5
11.6
13.6
40
20
0
weighted gain wL DL
100
0.4
0.3
step correctness:
0
0
0
weighted gain:
0.
0.03 0.09
0.39
0.03
0.01
0.09
0.03
0.2
0.1
0.0
Let us illustrate this technique with a simple example. Suppose the bit sequence for a particular student-KC sequence
is 00011011 (8 opportunities); see Fig. 2. We fit this bit sequence to 8 sub-models of the step model, corresponding to
L {1, 2, . . . , 8}, by maximizing the log likelihood, log LL .
The associated AIC values are given by AICL = 2K log LL
where K is the number of fit parameters. Note that there
are two parameters (s and g) when L > 1 and there is only
one parameter (s) when L = 1. Not surprisingly, the best fit
(lowest AIC) corresponds to the first 1 in the bit sequence
at step 4. From the AICs, we calculate the Akaike weights
104
Number of steps
Figure 2:
Akaike weights for the sub-models
Pstep,L (j). This gives the relative probability that the
student learned the KC just before step L. The case
L = 1 corresponds to no learning occurring during
use of the ITS.
1000
100
10
1
-1.0
(2)
4.
step L
step L
eAICL /2
.
wL = P
AICL0 /2
L0 e
WEIGHTED GAIN
-0.5
0.0
0.5
1.0
Weighted gain
Figure 4: Histogram of weighted gains wL L for all
steps in all student-KC sequences of dataset A.
Maximum Likelihood estimators for g and s given by submodel Pstep,L (j). For the no learning case L = 1, we set
1 = 0. We will call wL L the weighted gain associated
with Pstep,L (j). A calculation of wL L for an example bit
sequence is shown in Fig. 3. Not surprisingly, the largest
gain occurs at L = 4, corresponding to the first 1 in the bit
sequence. The remaining weighted gains are much smaller.
Our ultimate goal is to distinguish steps that result in learning from steps that do not. Hopefully, one can use this information to infer something about the effectiveness of the
help given on a particular step, or the effectiveness of the
student activity on that step.
The fact that there are so many steps with negative gain is
symptomatic of bit sequences that are very noisy (a lot of
randomness). Indeed, if we compare the histogram for student dataset A with the histogram for a randomly generated
dataset R (we take A and randomly permute the steps) we
find a similar distribution; see Fig. 5.
What would the distribution look like if the data werent so
noisy? To see this, we generated an artificial ideal dataset
I where there were no slips or guesses, but having the same
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
290
student dataset
random dataset
ideal dataset
Number of steps
104
1000
100
10
1
-1.0
-0.5
0.0
0.5
1.0
Weighted gain
Figure 5: Histogram of weighted gains wL L for the
student dataset A, a randomly generated dataset R,
and an artificial ideal dataset I.
length distribution as A (Fig. 1). Thus, the bit sequences
in I all have the form 00 011 1. In this case, for each
student-KC sequence, we expect a single large weighted gain
(corresponding to the first 1 in the bit sequence) and the
remaining weighted gains to be nearly zero. The resulting
distribution of gains is shown in Fig. 5.
We propose to use the following average of the weighted
gains as a quality index for determining how suitable a
dataset is for determining the point of learning for an individual student-KC sequence:
1 XX
wL L
(3)
Q=
N L
where is an index running over all N student-KC sequences
in a dataset. We use the sample standard deviation of the
weighted gains wL L to calculate the standard error associated with Q. An example calculation is shown in Fig. 3.
For the random dataset R, the distribution of L is symmetric about zero and Q approaches zero as N . For
the ideal dataset I, we expect that, when L coincides
with the first 1 in the bit sequence, wL will be nearly one
with the associated L also nearly one so that Q 1 in
the limit of many opportunities. Numerically, we obtain
Q = 0.5240 0.0003. The fact that it is smaller than one
is due to the large number of student-KC sequences having just a few steps. For the student dataset A, we obtain
Q = 0.0467 0.0065, which is small, but significantly larger
than zero (p < 0.001). Thus, we conclude that one can
detect statistically significant learning when applying our
method to this student dataset, with the location of that
learning given by the Akaike weights wL .
5.
CONCLUSION
6.
ACKNOWLEDGMENTS
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
291
ABSTRACT
Consider a large database of questions that assess the knowledge of learners on a range of different concepts. In this
paper, we study the problem of maximizing the estimation
accuracy of each learners knowledge about a concept while
minimizing the number of questions each learner must answer. We refer to this problem as test-size reduction (TeSR).
Using the SPARse Factor Analysis (SPARFA) framework,
we propose two novel TeSR algorithms. The first algorithm
is nonadaptive and uses graded responses from a prior set
of learners. This algorithm is appropriate when the instructor has access to only the learners responses after all questions have been solved. The second algorithm adaptively
selects the next best question for each learner based on
their graded responses to date. We demonstrate the efficacy
of our TeSR methods using synthetic and educational data.
Keywords
Learning analytics, sparse factor analysis, maximum likelihood estimation, adaptive and non-adaptive testing
1.
INTRODUCTION
2. PROBLEM FORMULATION
2.1 SPARFA in a nutshell
Suppose we have a total set of Q questions that test knowledge from K concepts. For example, in a high school mathematics course, questions can test knowledge from concepts
like solving quadratic equations, evaluating trigonometric
identities, or plotting functions on a graph. For each question i = 1, . . . , Q, let wi RK be a column vector that
represents the association of question i to all K concepts.
Note that each question can measure knowledge from multiple concepts1 . The j th entry in wi , which we denote by wij ,
measures the association of question i to concept j. In other
words, if question i does not test any knowledge from concept j, then wij = 0. Let W = [w1 , . . . , wQ ]T be a sparse,
1
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
292
(1)
where (x) = 1/(1 + ex ) is the inverse logistic link function. In words, (1) says that the probability of answering
a question correctly depends on a sparse linear combination of the entries in the concept understanding vector c .
This sparsity arises because of the assumption that wi is
sparse, i.e., it only contains a few non-zero entries. Given
graded question responses from multiple learners, the factors
W and can be estimated using either the SPARFA-M or
SPARFA-B algorithms introduced in [8].
2.2
c c )
yI , it turns out that the covariance of the error q(b
can be approximated by the inverse of the Fisher information
matrix [4], which is defined as follows:
F(WI , I , c )) =
X
iI
exp(wiT c + i )
wi wiT . (2)
(1 + exp(wT c + i ))2
The notation WI refers to the rows of W indexed by I. Similarly, I refers to the entries in indexed by I. Thus, a
natural strategy for choosing a good subset of questions I,
is to minimize the uncertainty (formally, the differential entropy) of a multivariate normal random vector with mean
zero and covariance F(WI , I , c ))1 . Consequently, the
(3)
I{1,...,Q},|I|=K
optimization problem considered in the remainder of the paper, referred to as the test-size reduction (TeSR) problem,
corresponds to
(TeSR)
b=
I
arg max
log det(F(WI , I , c )) .
I{1,...,Q},|I|=q
3.
TESR ALGORITHMS
Our proposed algorithms, that are data driven and computationally efficient, for solving TeSR are summarized in Algorithms 1 and 2. Due to space constraints, in what follows,
we only present a high level summary of the methods.
Nonadaptive TeSR: Algorithm 1 summarizes a nonadaptive method (NA-TeSR) for solving the TeSR problem. To
deal with the problem of the unknown c in (2), we notice
that the coefficient of the term wi wiT in (2) is simply the
variance of a learner in answering a question i. This variance can easily be estimated using the prior student response
e The first step in NA-TeSR is to estimate K quesdata Y.
tions, where K is the number of concepts involved in the
question database. We are able to make use of properties
of the determinant to formulate TeSR as a convex optimization problem, which we solve using low complexity methods
in [7]. The second step is to select the remaining q K
questions using a greedy algorithm that selects the best
question iteratively until all q questions have been selected.
Remark 1: Note that when W is a Q 1 vector of all
ones, the SPARFA model reduces to the Rasch model [11].
In this case, (TeSR) reduces to a problem of maximizing
the sum of the variance terms over the selected questions.
Thus, all the questions can be selected independently of the
others when using the Rasch model. On the other hand,
when using SPARFA, since we account for the statistical
dependencies among questions, the questions can no longer
be chosen independently as it is evident from Algorithm 1.
Adaptive TeSR: Our second algorithm, A-TeSR, is designed for the situation where one can iteratively and individually ask questions to a learner and then use the responses to adaptively select the next best question based
on the previous responses. Such an approach is often referred to as computerized adaptive testing [12].
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
293
4.
EXPERIMENTAL RESULTS
mean of RMSE
Greedy
ATeSR
NATeSR
ARasch
NARasch
Oracle
3
2.5
2
1.5
1
10
20
30
Number of questions
(a)
40
30
Number of questions
20
10
Greedy
Oracle
(b)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
294
2
1.5
1
0.5
0
30
Number of questions
mean of RMSE
3
2.5
10
15
20
25
Number of questions
20
10
Greedy
(a)
Oracle
(b)
Figure 2: Mechanical Turk algebra test with 3 concepts; see Figure 1(a) for the legend.
100
Number of questions
mean of RMSE
2
1.5
1
0.5
0
20
40 60 80 100 120
Number of questions
(a)
80
60
40
5.
20
0
Greedy
(b)
Acknowledgements
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
295
Shi Feng
Blair Lehman
University of Memphis
365 Innovation Drive
Memphis, TN 38152
University of Memphis
365 Innovation Drive
Memphis, TN 38152
University of Memphis
365 Innovation Drive
Memphis, TN 38152
bavega@memphis.edu
sfeng@memphis.edu
balehman@memphis.edu
Arthur C. Graesser
Sidney DMello
University of Memphis
365 Innovation Drive
Memphis, TN 38152
graesser@memphis.edu
sdmello@nd.edu
ABSTRACT
Boredom and disengagement have been found to negatively
impact learning. Therefore, it is important for learning
environments to be able to track when students disengage
from a learning task. We investigated a method to track
engagement during self-paced reading by analyzing reading
times. We propose that a breakdown in the relationship
between reading time and text complexity can reveal
disengagement. A discrepancy (or decoupling) between
attention resources and text complexity was computed via
the absolute difference between reading times and the texts
Flesch-Kincaid Grade Level, a measure of text complexity.
As expected, decoupling varied as a function of text
complexity. We also found that text complexity differentially impacted decoupling profiles for different types of
participants (i.e., high vs. low comprehenders, fast vs. slow
readers). These results suggest that decoupling scores may
be a viable method to track disengagement during reading
and could be used to trigger interventions to help students
re-engage with the text and ultimately learn the material
more effectively.
Keywords
Engagement, boredom, reading, text complexity
1. INTRODUCTION
It is widely acknowledged that engagement in a learning
task is a necessary (but not sufficient) condition to achieve
learning gains. There is also data to support this
assumption. For example, student engagement was found to
positively correlate with learning during interactions with
an intelligent tutoring system (ITS) called AutoTutor,
whereas boredom negatively correlated with learning [1].
Given this relationship, learning environments should seek
to maximize engagement and minimize boredom and
disengagement.
A variety of methods have been used to track student
engagement during learning. These include body movements, facial expressions, aspects of language and
discourse, self-reports, and observations by trained judges
2. METHOD
2.1 Participants
There were 64 participants in the present study who were
recruited from Amazons Mechanical Turk (AMT).
AMT acts as a mediator between researchers and
individuals to allow people to complete psychological tasks
online for monetary compensation. Participants were
limited to native English speakers of at least 18 years of
age. On average, it took participants 33 minutes to
complete the study and they were compensated US $4 for
their participation. Past research suggests AMT is a reliable
and valid source for collecting experimental data [9].
2.2 Materials
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
296
Texts
2.2.2
Knowledge Assessment
2.3 Procedure
Participants signed an electronic consent form and read the
instructions for the self-paced reading task. Self-paced
reading was adopted for this task to eliminate any pressures
from time constraints. Participants were then presented
with the first of four texts. A sentence-by-sentence reading
paradigm was used in which texts were presented one
sentence at a time and participants pressed the space bar to
move on to the next sentence. Reading times were collected
for each individual sentence from each of the four texts.
After participants read the first text, they were presented
with the knowledge assessment for the research methods
concept in that text. Participants then began the second text
and repeated this pattern for all four texts.
2.2.1
Empirical
Shuffled
0
-4
-2
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
297
Decoupling Score
Decoupling Score
3
2
1
Empirical
Shuffled
0
-4
-2
1
Empirical
Shuffled
2
-4
-2
1
Empirical
Shuffled
0
-4
Decoupling Score
-2
0
2
Flesch-Kincaid Grade Level (standardized)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
298
Decoupling Score
5. REFRENCES
2
1
Empirical
Shuffled
0
-4
-2
4. CONCLUSION
This study investigated a new method of measuring and
tracking cognitive engagement during reading. The
decoupling score was derived from the absolute difference
between reading times and text complexity. We propose
that this measure assesses cognitive engagement because if
readers are engaged with the text, then their reading times
should be adjusted based on text complexity. In other
words, as the text becomes easier, reading times should
become relatively faster and, conversely, as the text
becomes more difficult reading times should become
relatively slower. We found evidence that the relationship
between reading time and text complexity did seem to
reveal patterns of disengagement. Moreover, we found that
the relationship between decoupling and the complexity of
the text varies based on individual differences in reading
speed and comprehension.
Despite these promising initial findings, we were not able
to completely explain the patterns of decoupling for all
types of participants. In particular, the relationship between
decoupling pattern and comprehension scores was not
clearly revealed in the differences between the empirical
data and the shuffled surrogate corpus for participants classified as fast reader-high comprehenders. This highlights a
limitation of using Flesch-Kincaid Grade Level to assess
text complexity. Flesch-Kincaid assesses text complexity at
a rather shallow level. It may be the case that more nuanced
measures of text complexity will be able to shed more light
on how decoupling impacts comprehension. Thus, in future
work we plan to investigate more differentiated measures
of text complexity, such as narrativity, syntactic simplicity,
referential cohesion, word concreteness, and situation
model cohesion using Coh-Metrix [12]. We are specifically
targeting cohesion because past research has shown that
cohesion and breakdowns in cohesion impact learning as
well as interact with prior knowledge [13].
Student engagement over the course of a learning
experience is a vital issue. This paper provides insight on
how text complexity can factor into cognitive engagement
levels and a possible measure for it. More importantly, this
measure may be capable of tracking students cognitive
engagement across a span of text by simply using reading
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
299
Keywords
code snapshots, programming behavior, first-year challenges
1.
INTRODUCTION
2.
CONTEXT, PEDAGOGY
AND TOOLS
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
300
The first two periods for CS majors are packed with mandatory courses: Introduction to Programming Part I (7 weeks,
period 1), Introduction to Programming Part II (7 weeks,
period 2), Software Modeling (7 weeks, period 2), that are
offered by the Department of CS. In addition, students are
expected to enroll into Introduction to University Mathematics (14 weeks, periods 1 and 2) that is organized by the
Department of Mathematics and Statistics.
Both courses Introduction to University Mathematics and
the Introduction to Programming Part I are organized using the Extreme Apprenticeship method (XA) [16], which is
a modern interpretation of apprenticeship-based learning [5,
6]. XA values students personal effort and intensive interaction between the learner and the advisor, and emphasizes
deliberate practice [7] that aims towards mastering the craft.
As a craft can only be mastered by practising it, XA-based
courses contain lots of exercises. For example, during the
first week of their programming course, CS freshmen work
already on tens of programming tasks.
As XA stresses activity to be as genuine as possible, the students start working with industry-strength tools from day
one. We use NetBeans, which is an open source IDE (integrated development environment), bundled with an automated assessment service called Test My Code (TMC) [15],
which is used to download and submit exercises; moreover,
TMC is used to run tests on the students code in order to
verify the correctness of an exercise.
In addition to the assessment capabilities, on students permission, TMC gathers data from students programming
process. Currently, a snapshot is taken whenever a student
saves her code, compiles the code, or pastes code into the
IDE. Each snapshot contains student id, timestamp, source
code changes and possible configuration modifications.
3.
hour of working
minutes to deadline
minutes between sequential snapshots
edit distance between sequential snapshots.
3.1
Features
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
301
8e04
PASS
FAIL
Density
0.015
0.010
4e04
0.000
0e+00
0.005
2e04
Density
0.020
6e04
0.025
0.030
PASS
FAIL
1000
2000
3000
4000
20
40
60
80
100
0.00015
PASS
FAIL
0.00000
0.00005
Density
0.00010
2000
4000
6000
8000
10000
12000
4.
We consider identifying the students that are likely to succeed or fail their introductory mathematics course a classification problem. The course result (pass/fail) is used as the
class to predict, and each feature vector contains the aggregated values from a students snapshots. In total there are
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
302
Week
1
2
3
4
56
Accuracy
84.6 %
88.5 %
92.3 %
94.2 %
98.1 %
Precision
85.7 %
92.3 %
92.3 %
100 %
100 %
Recall
85.7 %
85.7 %
92.3 %
89.3 %
96.4 %
F-Measure
85.7 %
88.9 %
92.3 %
94.3 %
98.2 %
Table 1: Results for data that includes students programming behavior, and excludes compilation errors
and style- and programming-related issues.
Accuracy
88.5 %
94.2 %
96.2 %
100 %
Precision
92.3 %
96.3 %
96.4 %
100 %
Recall
85.7 %
92.9 %
96.4 %
100 %
F-Measure
88.9 %
94.5 %
96.4 %
100 %
5.
With our current dataset, we are able to predict the students success and failure in a 14-week introductory mathematics course already after a few weeks of their studies
based on their programming behavior. Our current data
indicates that computer science freshman that have a tendency to crunch their exercises and start working close to
the deadline are at a higher risk of failing their introductory
mathematics course than the students that work during a
longer time interval.
There is a need for intervention at an early part of the at-risk
students studies, which would direct the students towards
more successful learning styles. However, we do not know
if there is a direct causality between the working habits,
and cannot tell if our subjects would really perform better
by e.g. simply starting to work on their assignments earlier.
Our current number of samples (52) is relatively small, and
we need more data from future students.
Our current plan is to evaluate the students working process
during Fall 2013, and perform intervention(s) to a subset
of the population that our current classifier indicates being
at risk. We are also seeking more descriptive behavioral
indicators from the programming data in addition to our
current features. Overall, we are not only interested in a
single course or a single semester, but the students success
in their whole studies.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
303
Poster Presentations
(Regular Papers)
ABSTRACT
The field of educational data mining has been paying
attention to Knowledge Tracing (KT) for a long time.
Corbett and Anderson assumed the amount of learning that
students do does not depend on whether students get items
right or wrong. Ohlsson and others argued that the student
should learn more from a previous incorrect performance.
We decided to investigate a Bayes Network similar to KT
but that allows us to have learning rates that are different
according to whether students get items correct or not. While
the idea of allowing learning rates from previous incorrect
performances to be higher seems intuitive, our experiments
showed that this way does not always lead to better
predictions. Of course reasoning from a null result is
dangerous, our contribution is that this intuitive idea is not
one that other researchers should waste time in working on,
unless they come up with a different model from the model
we used (which is the nave way of modifying KT).
2. LP Model
We considered a new assumption for the KT model as
follows: Students can learn from their previous observed
performance once there is some tutoring associated with the
wrong performance. This resulted in a new model which we
call Learn from Performance (LP) model and present in
Fig.1b).
Keywords:
Knowledge Tracing, Bayesian
performance, Tutoring Strategies
Networks,
Learn
from
1. INTRODUCTION
The field of Education Data mining has depended to a large
extent on the model that was developed by Corbett and
Anderson [2] and enhanced by a number of authors for
predicting student performance. Over the years many new
models have been built to improve upon the prediction
accuracy of KT.
Wang and Heffernan have also shown that better predictions
are achieved with the inclusion of additional parameters
relating to the skills and groups to which a student belongs.
[4]
Standard KT makes a number of assumptions, including the
fact that the rate of learning is constant and that the transition
from one knowledge state to the other is not dependent of
previous performance. [2] Other researchers have introduced
different models that seem to deal with this anomaly with the
KT model [7], whiles some have compared these different
models to determine which best predicts student
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
304
RMSE
AUC
Table 2 displays the MAE, RMSE and AUC values for each
skill and for each model.
Table 2. Skill Prediction Performance Comparsion
Results
SKILL
(#)Name
(1) Box and
Whisker
(9) Stem and
Leaf Plot
(10) Table
(58)Addition
Whole
Numbers
Mean
p-values
MAE
RMSE
AUC
KT
LP
KT
LP
KT
LP
0.349
0.268
0.423
0.487
0.681
0.500
0.394
0.321
0.447
0.490
0.599
0.583
0.294
0.187
0.386
0.424
0.539
0.453
0.195
0.116
0.313
0.330
0.557
0.500
0.350
0.287
0.415
0.462
0.611
0.569
<0.05
<0.05
>0.05
4. CONCLUSION
Ohlsson theorized that students do learn from their previous
error performance, especially in the case where explanation
of the reasons for the error is provided. On the basis of this,
we developed a new nave Bayes Network model that allows
the amount of learning to increase when users get an item
wrong. Our experiments with real and simulated data showed
that we do not get better prediction of student performance
with this proposed LP model than the standard KT model.
Hence we conclude from our experiments that the
assumption that Corbett and Anderson made was justifiable
even though not intuitive according to Ohlsson. Our
contribution is that this intuitive idea is not one that other
researchers should waste time in working on, unless they
come up with a different model from the model we used.
Fold
KT
LP
KT
LP
KT
LP
0.326
0.270
0.399
0.380
0.707
0.832
0.335
0.271
0.409
0.382
0.779
0.820
0.336
0.284
0.409
0.391
0.798
0.816
The code and data for these experiments are available at:
http://users.wpi.edu/~saadjei/
0.351
0.285
0.426
0.391
0.798
0.827
REFERENCES
0.334
0.275
0.405
0.381
0.762
0.834
Mean
0.336
0.277
0.410
0.385
0.794
0.826
Pvalue
<0.05
<0.05
<0.05
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
305
Stphane Sanchez
Olivier Hguy
Andil
Universit Toulouse 1
IRIT
Universit Toulouse 1
IRIT
olivier.heguy@andil.fr
angela.bovo@andil.fr
stephane.sanchez@irit.fr
Andil
Yves Duthen
Universit Toulouse 1
IRIT
yves.duthen@irit.fr
ABSTRACT
1.2
This paper describes a proposal of relevant clustering features and the results of experiments using them in the context of determining students learning behaviours by mining Moodle log data. Our clustering experiments tried to
show whether there is an overall ideal number of clusters
and whether the clusters show mostly qualitative or quantitative differences. They were carried out using real data
obtained from various courses dispensed by a partner institute using a Moodle platform. We have compared several
classic clustering algorithms on several group of students
using our defined features and analysed the meaning of the
clusters they produced.
Keywords
clustering, Moodle, analysis, prediction
1. INTRODUCTION
1.1 Context of the project
Our project aims to monitor students by storing educational
data during their e-learning curriculum and then mining it.
The reasons for this monitoring are that we want to keep
students from falling behind their peers and giving up.
This project is a research partnership between a firm and
an university. The partner firm connects our research with
its past and current e-learning courses, hence providing us
with real data from varied trainings.
All available data comes from a Moodle [5] platform where
the courses are located. Moodles logging system keeps track
of what materials students have accessed and when. We then
mine through such logs.
2.
3.
EXPERIMENTAL METHOD
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
306
4. CLUSTERING RESULTS
4.1 Best number of clusters
The following figure shows the results of the four algorithms
used on each of our three datasets. The first shows the
frequency at which the X-Means algorithm proposed a given
number of clusters. The other three graphs show the error
for a given number of clusters for K-Means, Hierarchical
clustering and Expectation Maximisation. We can see that
all algorithms generally agree on at most 2 or 3 clusters.
4.2
5.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
307
ABSTRACT
Educational model for higher education has shown a drift from
traditional classroom to technology-driven models that merge
classroom teaching with web-based learning management systems
(LMS) such as Moodle and CLEW. Every teaching model has a
set of supervised (e.g. quizzes) and/or unsupervised (e.g.
assignments) instruments that are used to evaluate the
effectiveness of learning. The challenge is in preserving student
motivation in the unsupervised instruments such as assignments as
they are less structured compared to quizzes and tests. The
research applies association rule mining to specifically find the
impact of unsupervised course work (e.g. assignments) on overall
performance (e.g. exam and total marks).
1. INTRODUCTION
Research has shown that society is gradually drifting from the
most common teacher-centered classroom teaching model to a
hybrid educational model that combines classroom teaching and
technology such as internet [2]. Some of the recent technologybased systems are web-based courses, learning content
management systems (LMS), adaptive and intelligent web-based
educational systems (Intelligent tutoring systems (ITS)) [2] and
more recent online systems such MOOC (Massive open online
course).Such web-based courses gather student data using
activities such as quizzes, exams and assignments to measure their
cognitive ability and additional data to measure other factors that
could influence learning such as number of times a student has
visited a webpage. The objective of this paper is to study the
direct and indirect impact that unsupervised tasks (e.g.
assignments) have on final marks and grades using association
rule mining (ARM). ARM is an unsupervised learning method
that looks for hidden patterns in data. An impact on the overall
mark is considered to be direct (if the student achieves a score x in
the assignment, a 10% of x contributes to the overall mark).An
impact on final exam is considered to be indirect student
performing well in the assignments understands the course
concepts well and hence performs proportionately well in the final
exam. The motivation behind this research is to offer constructive
suggestions to educators to help them effectively decide the
optimal number of course assignments and the amount of weight
that should be given to them (e.g. giving 15% weight to the
assignments as opposed to 10%).
2. RELATED WORK
Data mining techniques such as classification, clustering and
association rule mining have been used to provide guidance to
students and teachers in activities such as predicting students
performance and failure rate, discovering interesting patterns
among student attributes and finding students who have a low
* This research was supported by the Natural Science and
Engineering Research Council (NSERC) of Canada under an
operating grant (OGP-0194134) and a University ofWindsor
grant.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
308
Rule___________________
Assignment_average> 85
70 <Assignment_average<= 85
50 <= Assignment_average<= 70
Assignment_average< 50
Final_exam_mark> 85
70 <Final_exam_mark<= 85
50 <= Final_exam_mark<= 70
Final_exam_mark< 50
Overall_mark> 85
70 <Overall_mark<= 85
50 <= Overall_mark<= 70
Overall_mark< 50
1 <Number_of_visits<=100
100 <Number_of_visits<=200
Number_of_visits> 200
4. RESULTS
Experiments were conducted on student data of two semesters in
two Computer Science courses: Programming in C for
Beginners(code 106F and 106W) and Key concepts for endusers (code 104F). A relative weight of 30% was assigned to
assignments in 106F, 10% in 106W and 50% in 104F.Threshold
used for support were 15 and 20 and for confidence was 50 since
experiments indicated that lowering minimum support and
confidence values increased the number of rules generated
substantially. An analysis of all rules generated asserts our
hypothesis that assignment marks have an impact on the students
overall and final marks. The confidence of such rules for course
106F is much higher compared to 106W, which implies that
allocating a 30% weight to the assignments (as in 106F) has a
higher impact than 10 % (106W). Similarly, rules generated for
the last 5 assignments in 106F had 100% confidence in depicting
that a student who scores > 85 in assignment also scores > 85 in
final exam and in the overall mark, and a similar trend was
observed with assignment marks <50. Some rare rules such as one
found in the dataset 106W (Assignment1 > 85 => Final <50) can
be attributed to the fact that assignment1, being the first one, was
either too simple or marking was too lenient. Rules generated for
all assignments of 104S are uniformly indicative of the fact that
achieving a score of > 85 in the assignment and final exam
ascertains a total mark in the range of >=70 and < 85. SVM model
for all the three courses using average assignment marks predicted
a students chance of passing the course with more than 95%
accuracy. However, accuracy using individual assignment marks
was 81%, 87% and 97% for 106W, 106F and 104S respectively.
5. CONCLUSIONS
This research presents how useful the association rules mined
using Apriori algorithm on student data are in extracting hidden
patterns about course assessment instruments such as assignments
and final exam. Teachers can take informed decisions using such
patterns and use them in improving their curriculum and strategy
of teaching a class. The assertion that assignment marks have a
direct correlation with final exam and overall marks can be a
motivating factor for them to perform well in the assignments.
6. FUTURE WORK
Step3: Apply Apriori algorithm to the dataset obtained as output
of step 2 to generate rules such as R4 =>R 8, R4 =>R12, R6 =>R1
and so on, as shown in table 3.
Step4: The rules generated in step 3 are then manually interpreted
using the rules confidence and support values to answer
questions such as Does average assignment mark have a
favorable impact on Final / Overall marks? or How important is
it for students to visit the CLEW site frequently?.
Step 5: An SVM model is created to classify student data (from
step 1) as Pass / Fail based on total marks. A total mark of >=50 is
labeled as Pass; < 50 is labeled as Fail.
7. REFERENCES
[1] Anwar, M., and Ahmed, N. Knowledge mining in supervised
and unsupervised assessment of students performance. In 2nd
International Conference on Networking and Information
Technology, 2011, 29-36.
[2] Minaei-Bidgoli, B. Data mining for a web-based educational
system. PhD thesis, Michigan State University, 2004.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
309
Samad Kardan
Department of Computer Science
The University of British Columbia
2366 Main Mall, Vancouver, BC,
V6T1Z4, Canada
+1 (604) 8225108
skardan@cs.ubc.ca
Cristina Conati
Department of Computer Science
The University of British Columbia
2366 Main Mall, Vancouver, BC,
V6T1Z4, Canada
+1 (604) 8224632
conati@cs.ubc.ca
ABSTRACT
1. INTRODUCTION
Many Adaptive educational systems apply data mining techniques
to answer the need for understanding and supporting varying
learning styles, capabilities and preferences in students[1, 2, 3].
Along this line of research, we concentrate on understanding how
students interact with Prime Climb (PC) as an adaptive
educational game and whether there is a connection between
behavioral patterns and attributes (for instance higher
knowledgeable vs. lower knowledgeable students) in the students.
Developing an interactive environment in which more number of
students can learn the desired skills requires a pedagogical agent
which maintains more accurate understanding of individual
differences between users and provides more tailored
interventions. For instance, if a pedagogical agent is capable of
identifying whether a group of students have higher domain
knowledge than the other group, it can be possible to leverage
such information to construct a more accurate user model and
intervention mechanism.
Behavioral discovery has been vastly used in educational systems
but there is limited application in educational games like PC in
which educational concepts are embedded in the game with
minimum technical notation to maximize game aspects (i.e.
engagement) of the system. In PC, students follow an exploratory
mechanism to explore and understand the methods and practice
them. This paper describes the first step toward leveraging
students behavioral patterns into building more effective adaptive
edu-game. The ultimate goal is devising mechanisms for making
abstract high level meaning from raw interaction data and
leveraging such understanding for real-time identification of
characterizing interaction styles to enhance user modeling and
intervention mechanism in an edu-game like Prime Climb.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
310
This result is very similar to the results when data from all 9
mountains is included. Similar patterns can be seen when more
interaction data from upper mountains is included in patterns
analysis.
5. CONCLUSION/FUTURE WORK
This paper discusses behavior discovery in PC. To this end,
different sets of features were defined. The features were
extracted from interaction of students with PC in the form of
making movements from one numbered hexagon to another
numbered hexagon and usages of the MG tool. In order to identify
frequent patterns of interaction in groups of students, firstly a
feature selection mechanism was applied to select more relevant
features from set of all features. Then a K-Means clustering was
applied to cluster the students into optimal number of clusters.
Once clusters were built, the Hotspot algorithm of Association
Rule Mining is applied on the clusters to extract frequent
interaction patterns. Finally the clusters were compared to each
other on their clusters prior knowledge. When interaction data
from all 9 mountains is included in behavior discovery, it was
found that the students with higher prior knowledge were more
engaged in the game and spent more time on making movements.
On the contrary, the students with lower prior knowledge, spent
less time on making movements, indicating that they were less
involved in the game. Behavior discovery also was conducted on
truncated sets of features in which only a fraction of interaction
data was included. The results showed that using the interaction
data from the first four mountains resulted in groups of students
that are statistically different on their prior knowledge.
As for future work, an online classifier will be built which
identifies frequent patterns of interaction in the students and
classify them into different groups in real time and leverages such
information to build a more personalized user model and adaptive
intervention mechanism in PC.
6. REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
311
Jennifer DeBoer
jdeboer@mit.edu
2
Andrew Ho
Andrew_Ho@gse.harvard.edu
1
Massachusetts Institute of Technology
77 Massachusetts Avenue
Cambridge, MA 02139
Daniel Seaton
dseaton@mit.edu
Glenda S. Stump
gsstump@mit.edu
1
David E. Pritchard
dpritch@mit.edu
ABSTRACT
Lori Breslow
lrb@mit.edu
2
Harvard Graduate School of Education
455 Gutman Library, 6 Appian Way
Cambridge, MA 02138
[see, for example, 3, 6]. Once students enroll, the norms and
behaviors they understand as beneficial may be differentially
rewarded within the school system [1, 8]. The tools that serve
more privileged groups of students in traditional settings (e.g.,
linguistic capital, knowledge of cultural references, highly
educated role models) may also be relevant for online learning.
1. RATIONALE
4. DATA
Keywords
Data come from the students in 6.002x who completed the exit
survey. While the survey was announced specifically for course
completers, the link to the survey was open on the website. We
find ~800 completers did not receive a certificate in the course.
It is important to note these data were gathered using matrix
sampling to mitigate non-response due to survey fatigue. We
impute missing responses using chained equations [7].
Missingness ranges from 59% to 85% of the over 7,000 students
who completed the survey.
Number of students
2. RESEARCH QUESTION
3. CONCEPTUAL FRAMEWORK
We apply a social capital lens to our work. While the inputs of
formal schooling are important, education researchers have also
noted the importance of the acculturation and social preparation to
which students are differentially exposed prior to entering school
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
312
5. RESULTS
For the subset of students completing surveys, once we control for
key student background information, we immediately find the
impact of certain resources on students total score to be
diminished. In addition, when we add controls such as score on
the first homework (proxy for prior ability), and when we control
for the students country, we find that we remove more bias. This
may result, for example, from correlation between initial score
and subsequent study strategies such as referencing previous
homeworks or viewing relevant questions on the discussion
forum.
Table 1. Additive models predicting partial credit score, OLS1
Nave
Control for
Controls for
Covariate
estimate
first HW
country (not listed)
Homework
4.41
3.70
3.71
(0.28) ***
(0.26) ***
(0.27) ***
Labs
0.61 (0.34)
0.45 (0.32)
0.52 (0.32)
Lecture
0.26
0.09
0.10
problems
(0.10) **
(0.09)
(0.09)
Lecture
0.38
0.17
0.09
videos
(0.13) **
(0.13)
(0.13)
Tutorials
0.11 (0.06)
0.09 (0.06)
0.07 (0.06)
Book
-0.26
-0.23
-0.24
(0.07) ***
(0.07) **
(0.07) ***
Wiki
-0.64
-0.60
-0.58
(0.07) ***
(0.07) ***
(0.07) ***
Discussion
0.30
0.39
0.33
board
(0.12) **
(0.11) ***
(0.11) **
Female
-1.12 (1.88)
-1.11(1.80) -1.20 (1.88)
Parent
2.16
1.86
1.91
engineer
(0.81) **
(0.79) **
(0.79) **
Worked with 2.05
1.96
2.11
other offline (0.66) **
(0.65) **
(0.71) **
Teach EE
0.26 (0.58)
-0.01(0.56) 0.08 (0.58)
Took diff.
4.72
4.44
4.56
equations
(0.56) ***
(0.48) ***
(0.55) ***
First HW
0.57
0.55
(0.03) ***
(0.03) ***
5.3 Demographics
The impact of key demographic background factors is consistent
across models, including the fully specified model with all
covariates, which allows for a fixed-effect for the students
country of access. Individual factors such as gender and whether
the student teaches electrical engineering are not related to
achievement. On the other hand, some background factors are
strongly related to performance. Specifically, having taken
differential equations predicted a higher score, even controlling
for the first assignment. Similarly, students who reported offline
collaboration also scored higher. This might reflect the same
positive role of collaboration as participation in the discussion
forum.
7. ACKNOWLEDGMENTS
Funding was provided by NSF Grant, DRL-1258448. We also
acknowledge the support of MITs RELATE group and edX.
8. REFERENCES
[1] Anyon, J. 2008. Social Class and School Knowledge, in The
Way Class Works, L. Weis (ed). Routledge, chap. 13.
[2] Bowen, W.G., Chingos, M.M., Lack, K.A., & Nygren, T.I.
2012. Interactive Learning Online at Public Universities:
Evidence from Randomized Trials, ITHAKA, May 22, 2012.
[3] Bourdieu, P. 1977. Cultural Reproduction and Social
Reproduction, in Power and Ideology, J. Karabel & A.H.
Halsey (eds), Oxford Press, Chap. 29.
[4] Dziuban, C., Moskal, P., Brophy, J., Shea, P. 2007. Student
satisfaction with asynchronous learning, Journal of
Asynchronous Learning Networks, 11(1), 87-95.
[5] edX. 2012. Frequently asked Questions. From: [edx.org/faq].
[6] Lareau, A. 2003. Unequal Childhoods: Class, Race, and
Family Life. University of California Press.
[7] Little, R.J.A. & Rubin, D.B. 2003. Statistical Analysis with
Missing Data. Technometrics, 45 (4), pp. 364-365.
[8] Meyer, J. 1977. The Effects of Education as an Institution,
American Journal of Sociology, 83(1): 55-76.
[9] Park, J.-H., & Choi, H. J. 2009. Factors influencing adult
learners' decision to drop out or persist in online learning.
Educational Technology & Society, 12(4), 207-217.
[10] Xu, D. & Jaggars, S.S. 2013. Adaptability to Online
Learning: Differences Across Types of Students and
Academic Subject Areas. CCRC, Working paper No 5.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
313
Khusro Kidwai
Pearson
kristen.dicerbo@pearson.com
khusro.kidwai@maine.edu
ABSTRACT
In this paper we describe the development of a detector of
seriousness of pursuit of a particular goal in a digital game. As
gaming researchers attempt to make inferences about player
characteristics from their actions in open-ended gaming
environments, understanding game players goals can help
provide an interpretive lens for those actions. This research uses
Classification and Regression Tree methodology to develop and
then cross-validate features of game play and related rules
through which player behavior about pursuing a goal of
completing a quest can be classified as serious or not serious.
Keywords
Detector, goal, stealth assessment, games, educational data
mining.
1. INTRODUCTION
Many recently-developed online learning environments provide
open spaces for students to explore. At the same time, there is
growing interest in stealth assessment [5], or the use of data
resulting from students every day interactions in a digital
environment to make inferences about player characteristics.
This use of data from natural activity in open-ended environments
presents a challenge for interpretation. Much of the evidence we
wish to use to assess skill proficiency and player attributes
assumes that individuals are working towards the goal of
completion of sub-tasks or levels within a game. However, game
players often appear to be pursuing differing goals [2], which
provide different lenses for interpreting player behavior based on
data in log files for games. For example, behavior might be
categorized as off-task if a player is pursuing a quest but ontask if a player has a goal of exploring the environment. If we
are interested in using evidence contained in game log files to
assess constructs such as persistence, we have to be careful not to
identify a player as lacking persistence when in fact they were
very persistently pursuing a different goal.
This paper describes the creation of a detector for a specific
goalserious pursuit of completion of quests in a game. The
approach taken in this paper builds on research regarding
detectors for gaming the system [1], which use machine learning
to identify features and rules to classify behavior into discrete
categories. In this paper, the focus was on whether a player in the
online game Poptropica is seriously pursuing the goal of
completing quests in the game. The ability to successfully
categorize players based on whether or not they are pursuing the
goal of quest completion is likely to help with interpretation of
other actions players pursue in the game. This paper discusses the
selection of possible features or indicators of goal seriousness, the
process of detector creation, and the analysis of the effectiveness
of the detector in correctly classifying play.
2. DESCRIPTION OF THE
ENVIRONMENT
Poptropica is a virtual world in which players explore islands
with various themes and overarching quests that players can
choose to pursue. Players choose which islands to visit and the
quests generally involve completion of 25 or more steps (for
example, collecting and using assets) that are usually completed
in a particular order. Apart from the quests, players can talk to
other players in highly scripted chats (players can only select
from a pre-determined set of statements to make in the chat
sessions), play arcade-style games head-to-head, and spend time
creating and modifying their avatar.
Like with most online gaming environments, the Poptropica
gaming engine captures time-stamped event data for each player.
On an average day actions of over 350,000 Poptropica players
generate 80 million event lines.
3. DETECTOR DEVELOPMENT
Prior to building a machine detector of goal-seriousness, it was
necessary to establish a human-coded standard from which the
computer could learn and verify rules. A total of 527 clips were
coded by two raters as being either serious or not serious
about the goal of completing a quest. Cohens Kappa [3] between
the two raters for the full set of non-training clips was .72; all
disagreements were discussed until accord was reached.
Elements of the log files hypothesized to be indicative of goal
directedness were identified as features including: (1) total
number of events completed on the island, (2) total amount of
time spent on the island, (3) total number of events related to
quest-completion, (4) number of locations (scenes) visited on the
island, (5) number of costumes tried on, and (6) number of
inventory checks. The number of costumes and number of
inventory checks were hypothesized to be negatively correlated to
completing quests.
Researchers employed a Classification and Regression Tree
(CART) methodology to create the detector. The process of the
creation of decision trees begins with the attempt to create
classification rules until the data has been categorized as close to
perfectly as possible, however, this can result in overfit to the
training data. The software then tries to relax these rules, in a
process called pruning to balance accuracy and flexibility to
new data. This research employed the J48 algorithm [4] for
pruning. The results of the analyses were evaluated using (1)
precision, (2) recall, (3) Cohens kappa, and (4) A.
4. RESULTS
The final decision tree is displayed in Figure 1. Each branch
provides classification rules and an ultimate classification
decision at the end. The red boxes end paths that indicate
individuals non-serious about the goal of quest completion while
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
314
the green boxes indicate serious goal directed behavior. So, for
example, following the left-most path, we find that people who
visit 4 or fewer scenes and complete 2 or fewer quest events were
classified as not seriously goal-directed. This rule correctly
classified 335 clips and misclassified 18 of the 353 total clips that
followed this pattern. Following other branches reveals different
rules, all leading to classifications of seriously or not seriously
goal-directed.
Humans Agree
Computer/
Humans Disagree
49.15%
(N=29)
10.90%
(N=51)
Computer/
Humans Agree
50.85%
(N=30)
89.10%
(N=417)
6. ACKNOWLEDGMENTS
Our thanks to Ryan Baker and Ilya Golden who served as mentors
during this projects inception at the Pittsburgh Science of
Learning Center LearnLab Summer School 2012.
7. REFERENCES
Detector
Human
Not Serious
Serious
Not Serious
328
32
Serious
48
119
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
315
Linglong Zhu
Yutao Wang
Neil T.Heffernan
hdduong@wpi.edu lzhu@wpi.edu
yutaowang@wpi.edu
nth@wpi.edu
ABSTRACT
Intelligent Tutoring Systems (ITS) have been proven to be
efficient in providing students assistance and assessing their
performance when they do their homework. Many research
projects have been done to analyze how students knowledge
grows and to predict their performance from within intelligent
tutoring system. Most of them focus on using correctness of the
previous question or the number of hints and attempts students
need to predict their future performance, but ignore how they ask
for hints and make attempts. In this paper, we build a Sequence of
Actions (SOA) model taking advantage of the sequence of hints
and attempts a student needed for previous question to predict
students performance. We used an ASSISTments dataset of 66
students answering a total of 34,973 problems generated from
5010 questions over the course of two years. The experimental
results showed that the Sequence of Action model has reliable
predictive accuracy than Knowledge Tracing.
Keywords
Knowledge Tracing, Educational Data Mining, Student Modeling,
Sequence of Action model.
1. INTRODUCTION
Understanding student behavior is crucial for Intelligent Tutoring
Systems (ITS) to improve and to provide better tutoring for
students. For decades, researchers in ITS have been developing
various methods of modeling student behavior using their
performance as observations. One example is the Knowledge
Tracing (KT) model (Corbett and Anderson, 1995), which uses a
dynamic Bayesian network to model student learning. But KT
focuses attention on students performance of correctness,
ignoring the process a student used to solve a problem. Many
papers have shown the value of using the raw number of attempts
and hints (Feng, Heffernan and Koedinger, 2009, Wang,
Heffernan 2011). However, most EDM models we are aware of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
316
Examples
a
Next Question
Correct Percent
0.8339
aa, aaa, ,
aaaaaaaaaaaa
0.7655
ha, hha,,
hhhhhhha
0.4723
Alternative, Attempt
First (a-mix)
aha, aahaaha,,
aahhhhaaa
0.6343
haa, haha,,
hhhhaha
0.4615
(h-mix)
From the tabling results, shown in Table 1, we can see that the
percent of next-question-correct is highest among students only
using one attempt since they master the skill the best. They can
correctly answer the next question with the same skill. For
students in All Attempts category, they are more self-learning
oriented, they try to learn the skill by making attempts over and
over again. So they get the second highest next-question-correct
percent. But for students in the All Hints category, they do the
homework only relying on the hints. It is reasonable that they
dont master the skill well or they dont even want to learn, so
their next-question-correct percent is very low. The alternative
sequence of action reflects students learning process. Intuitively,
these students have positive attitude for study. They want to get
some information from the hint based on which they try to solve
the problem. But the results for the two alternative categories are
very interesting. Though students in these two categories
alternatively ask for hints and make attempts, the first action
somewhat decided their learning altitude and final results. For
students who make an attempt first, if they get the question
wrong, they try to learn it by asking for hints. But for students
who ask for a hint first, they seem to have less confidence in their
knowledge. Although they also make some attempts, from the
statistics of action sequence, they tend to ask for more hints than
making attempts. The shortage of knowledge or the negative study
attitude makes their performance as bad as the students asking
exclusively for hints first.
RMSE
AUC
KT
0.3032
0.3921
0.6817
SOA
0.2900
0.3813
0.6841
t-test p value
0.0000
0.0000
0.5286
3. CONTRIBUTIONS
In this work, we presented a Sequence of Actions (SOA) model,
in which students action of asking for hints and making attempts
are divided into five categories shown in Table 1. The result of a
tabling method shows that students who make an attempt first did
better on next question with the same skill than those who ask for
a hint first. The result from logistic regression shows that paying
attention to the sequence of action increases prediction accuracy
of students performance.
4. ACKNOWLEDGMENTS
This work is funding in part by IES and NSF grants to Professor
Neil Heffernan. The first three authors have all been funded by the
Computer Science Department at WPI.
5. REFERENCES
[1] Wang Q.Y., Pardos, Z. A., & Heffernan, N. T. (2011).
Tabling MethodA simple and practical complement to
Knowledge Tracing. KDD.
[2] Feng, M., Heffernan, N.T., &Koedinger, K.R. (2009).
Addressing the assessment challenge in an Online System
that tutors as it assesses. UMUAI: The Journal of
Personalization Research19(3), 243-266.
[3] Corbett, A. T., & Anderson, J. R. (1995). Knowledge
tracing: modeling the acquisition of procedural knowledge.
User Modeling and User-Adapted Interaction, 4, 253278.
[4] Shih, B., Koedinger, K.R., &Scheines, R. (2010). Discovery
of Learning Tactics using Hidden Markov Model Clustering.
Proceedings of the 3rd International Conference on EDM.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
317
Marta Zorrilla
diego.garcia@unican.es
marta.zorrilla@unican.es
ABSTRACT
Choosing a suitable classier for a given data set is an important part of a data mining process. Since a large variety
of classication algorithms are proposed in literature, nonexperts, as teachers, do not know which method should be
used in order to achieve a good pattern. Hence, a recommender service which guide on the process or automatize it
is welcome. In this paper, we rely on meta-learning in order
to predict the best algorithm for a data set given. More
specically, our work analyses what meta-features are more
suitable for the problem of predicting student performance
and also evaluates the viability of the recommender.
data sets and algorithms are chosen. For instance, some authors [3] utilize general, statistical and information-theoretical
measuresextracted from data sets whereas others as use landmarkers as in [4].
This paper is not intended to design a database to store data
mining processes as there is already one available [1], but
its main aim is to assess the feasibility of our proposal and
propose a set of measurable features on educational data sets
which can help us to choose automatically the classication
algorithm with certain reliability.
We must also mention several research projects have targeted meta-learning in recent years, as e-LICO project [2].
1. INTRODUCTION
2.
EXPERIMENTATION
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
318
3.
We have analysed which features are more suitable for describing educational data sets aimed at predicting student
performance. We have also shown that construction of a
recommender system following a meta-learning approach is
feasible.
In a near future we will work with other kind of metacharacteristics such as the mentioned landmarkers and setting parameters of the algorithms. Of course, other quality
measures of model, in addition to accuracy, will be considered.
4.
Table 1: Recommended features by ClassierSubSetEval
md1
md2
md3
NB J48 NB J48 NB J48
#N Instances
9
5
8
10
8
8
#N Attributes
7
9
9
2
6
8
#N Numeric att.
1
3
0
4
5
5
#N Nominal att.
9
3
0
0
1
6
Completeness
10
10
8
10
7
9
#Type att.
7
3
6
4
5
4
#N Classes.
6
2
1
9
3
5
Is balanced?
9
8
9
7
3
3
Next, we built models using the three meta-data sets generated for this experimentation. The more accurate model was
achieved with md2. This is due to the class attribute has
4 possible values in a data set with 80 instances, whereas, in
md1 and md3, it has 12 dierent values with a slightly
higher number of instances. Moreover, most models built
from md1 and md3 were over-tted.
Figure 1 depicts a model built with md2 using J48. As
expected, according to previous features analysis, it uses
the type of data, the number of instances, the number of
attributes and the completeness to build the model. From
81 data sets, 63 were used for building our recommender
and the rest for testing. It achieved an accuracy of 68.75%
which is a little lower than those obtained by classiers built
CONCLUSIONS
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
319
Markus Hofmann
kylegoslin@gmail.com
ABSTRACT
As an educational institute grows an increase in the number
of programs each with individual modules and learning objects can be seen. Learning environments provide a structured environment that can provide an additional level of
insight into the relationship between content.
This paper outlines the identification of similarities at a
Learning Object, Module and Program Level utilizing these
inherent structures. Once generated, these results are then
visualized in graph form providing an insight into the overlap between course material.
Keywords
Similarity Detection, Visualization, Data Structures, Moodle
1.
INTRODUCTION
markus.hofmann@itb.ie
objects [2] through the use of metadata [3] and simple string
matching. These approaches, however useful, do not take
into consideration any of the prior knowledge that can be
extracted from the environment to aid this search process.
Each of these search queries is also limited to the relevancy
and accuracy of the search terms entered by the user which
often may not be as specific and relevant as required [4].
3.
SYSTEM OVERVIEW
The Tree Generator creates a tree based structure of Moodle including meta data. A second tool titled the Moodle
Crawler downloads each file from the Moodle instance locally and associates the Tree record ID to each file. Each
file then converted into their HTML counter part and added
to the local tree. Similarities between the data are then
generated using the free and open source data mining tool
RapidMiner [5]. A graph is then generated using a custom
operator and viewed using Gephi [6].
Figure 1 below provides an overview of the generation process from start to finish.
As a solution to this a tool was developed to extract the hierarchy and structures of the learning environment created
by educators during their daily use. Similarity measures between documents are then calculated and can be used along
with the gathered structural information to aid the process
of narrowing and selecting applicable learning objects with
similar content. These results are then visualized in graph
from to aid the process of similarity detection.
2.
BACKGROUND
3.1
Dataset
4.
VISUALIZATION
During the visualization process a number of different relationships between nodes were created by the Tree Builder.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
320
5.
CONCLUSIONS
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
321
Ivon Arroyo
Neil Heffernan
kkelly@wpi.edu
iarroyo@wpi.edu
nth@wpi.edu
ABSTRACT
This study suggests that the data generated by intelligent tutoring
systems can be used to accurately predict end-of-year
standardized state test scores. A traditional model including only
past performance on the test yielded an R2 of 0.38 and an
enhanced traditional model that added current class average
improved predictions (R2=0.50). These models served as baseline
measures for comparing an ITS model. Logistic regression models
that include features such as hint percentage, average number of
attempts and percent correct overall improved the R2 to 0.57. The
predictive power of the data is as effective with only a few months
of use. This lends support for the increased use of the systems in
the classroom and for nightly homework.
Keywords
Intelligent tutoring system, homework, standardized
prediction, regression, classification, decision tree
test,
1. INTRODUCTION
With the introduction of the No Child Left Behind Act in 2001,
assessing student performance became a significant focus of
schools. With the high-stake nature of these tests, it is imperative
to identify at-risk students accurately and as early in the year as
possible to provide time for interventions. Intelligent tutoring
systems (ITS) allow teachers to evaluate student performance
while students are learning. Furthermore, the ITS provides data to
teachers which can be used to predict standardized-state-test
scores (Feng, et al. 2006, Feng, et al. 2008). Specifically, help
request behavior is effective at predicting student proficiency
(Beck et al. 2003).
While the above studies are promising, the content used to
generate the data was very narrow, consisting of previously
released state test questions. Therefore the material mapped
directly to the test that was being predicted. The present study
uses ASSISTments (www.assistments.org), a web-based
intelligent tutoring system, which allows teachers to enter their
own content in addition to using certified problem sets. This
content can include in-class warm ups, challenge problems, and
questions from the textbook. Some of the problem sets may
include tutoring in the form of hints or scaffolding while others
include correctness only feedback with varying numbers of
attempts allowed. What impact does this diverse data have on the
previously established usefulness of ITS in predicting end-of-year
test scores? The present research attempts to determine if the data
collected from student use of an ITS over an entire school year
accurately predicts student performance on a standardized-statetest.
2. APPROACH
For the 2010-1011 school year, 129 students in a suburban middle
school used ASSISTments as part of their 7th grade math class.
The different types of assignments completed during the course of
the year include classwork, homework and assessments. Student
data from August through May was used to predict MCAS
(Massachusetts Comprehensive Assessment System) scaled score
and to classify performance. A smaller date range (August
through October) was also considered to determine if the model is
equally effective with less data, earlier in the school year.
2.1 Modeling
3. RESULTS
Students who were not enrolled in the course for the entire time
period considered in this study were not included in the analysis
(n=8). Finally, students whose 6th or 7th grade MCAS scores were
not available were not included (n=4).
3.1 Prediction
Traditionally, prior performance on a standardized test is used to
predict future performance on the same test. A linear regression
using only 6th grade MCAS score to predict 7th grade MCAS
scaled score serves as a comparison model for the more complex
models. These scores are highly correlated (r(115)=0.617,
p<0.001) and was 75% accurate in categorizing students.
The enhanced traditional model included 6th grade MCAS score
( = 0.341, t(115) = 4.03, p < .001) as well as
Percent_Correct_First_Attempt ( = 0.448, t(115) = 5.30, p <
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
322
3.2 Classification
A J48 Decision tree with cross validation predicted MCAS
classification with 68.4% accuracy for the full year. The attributes
included in the tree were prior MCAS performance, total number
of questions answered, and percentage of hints used. While the
tree does well with predicting the classification of Advanced and
Proficient, it was unable to identify the students who fell in the
Needs Improvement category. This is a significant limitation of
this model. However, it is important to note that with only 2
students falling in the Needs Improvement category, it will be
very challenging to identify them.
A separate decision tree was constructed based on the data from
August through October. This model predicted MCAS
classification better with 76% accuracy. See Figure 1 for the tree.
The attributes that were included were average number of
attempts
and
percentage
of
hints
used.
R2
Accuracy
Kappa
Traditional
0.381
75%
0.44
Enhanced Traditional
0.505
75%
0.46
0.566
81%
0.61
0.566
82%
0.63
N/A
76%
0.39
N/A
68%
0.36
5. ACKNOWLEDGMENTS
Bill and Melinda Foundation via EDUCAUSE, IES grants
R305C100024 and R305A120125, and NSF grants ITR 0325428,
HCC 0834847, DRL 1235958.
6. REFERENCES
[1] Beck, J., Jia, P., Sison, J., Mostow, J. (2003). Predicting
student help-request behavior in an intelligent tutor for
reading. Proc. of the 9 th Int. Conf. on User Modeling, LNAI
2702
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
323
1.
INTRODUCTION
2.
SPARFA [4] assumes that graded learner response data consist of N learners answering a subset of Q questions that
involve K Q, N underlying (latent) concepts. Let the
column vector cj RK , j {1, . . . , N }, represent the latent concept knowledge of the j th learner, let wi RK ,
i {1, . . . , Q}, represent the associations of question i to
each concept, and let the scalar i R represent the intrinsic difficulty of question i. The studentresponse relationship is modeled as
Zi,j = wiT cj + i ,
i, j,
and
(i, j) obs ,
(1)
where Yi,j {0, 1} corresponds to the observed binaryvalued graded response variable of the j th learner to the
ith question, where 1 and 0 indicate correct and incorrect
responses, respectively. Ber(z) designates a Bernoulli distribution with success probability z, and (x) = 1+e1x denotes the inverse logit link function. The set obs contains
the indices of the observed entries (i.e., the observed data
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
324
4.98
1.47
7.46
0.63
4.23
2.02
3.25
4.09
prediction likelihood
7.14
3.46
0.62
2.19
3.27
4.51
3.20
-0.31
1.98
1.70
0.61
3.48
2.60
6.96
1.96
-0.46
4.03
0.56
2.32
0.6
0.10
5.56
0.62
1.06
2.22
4.32
1.69
1.27
0.59
4.55
1.71
3.77
STEMScopes
Algebra test
0.58
16
2.22
1.44
-0.37
1.64
3.64
6.19
2.37
32
2.88
1.56
0.98
0.16
1.07
1.24
3.85
-0.32
1.39
0.43
2.33
2.09
Figure 1:
Average predicted likelihood using
SPARFA-Top with different precision parameters .
2.71
1.80
2.67
-0.45
1.10
2.62
2.59
3.92
2.19
3.69
3.49
4.02
-0.06
wiT tv
and
Bi,v P ois(Ai,v ),
i, v,
(2)
where tv RK
+ is a non-negative column vector that characterizes the expression of the v th word in every concept.
The latent factors wi , cj , tv and i are estimated through a
block coordinate descent algorithm, which is detailed in [3].
3.
1.99
1.06
-0.97
EXPERIMENTS
2.36
1.18
4.02
0.39
-1.91
0.93
4.69
5.72
6.19
0.59
Concept 1
Concept 2
Concept 3
Concept 4
Concept 5
Energy
Water
Earth
Water
Percentage
Sand
Plants
Buffalo
Eat
Water
Soil
Sample
Water
Heat
Objects
4.
CONCLUSIONS
We have introduced the SPARFA-Top framework, which extends SPARFA by jointly analyzing both the binary-valued
graded learner responses to a set of questions and the text
associated with each question via a Poisson topic model.
Our purely data-driven approach avoids the manual assignment of tags to each question and significantly improves the
interpretability of the estimated concepts by automatically
associating keywords extracted from question text to each
estimated concept. For additional details, please refer to the
full version of this paper on arXiv [3].
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
325
Arthur C. Graesser
Zhiqiang Cai
University of Memphis
202 Psychology Building
University of Memphis
Memphis, TN, 38152
Tel. 1-901-678-2364
University of Memphis
202 Psychology Building
University of Memphis
Memphis, TN, 38152
Tel. 1-901-678-4857
University of Memphis
202 Psychology Building
University of Memphis
Memphis, TN, 38152
Tel. 1-901-678-2364
hli5@memphis.edu
graesser@memphis.edu
zcai@memphis.edu
ABSTRACT
Automated text analysis tools such as Coh-Metrix and Linguistic
Inquiry and Word Count (LIWC) provides overwhelming indices
for text analysis, so fewer underlying dimensions are required.
This paper developed an underlying component model for text
analysis. The component model was developed from large English
and Chinese corpora in terms of results from Coh-Metrix, and
English and Chinese Linguistic Inquiry and Word Count (LIWC).
Keywords
Component model, Coh-Metrix, LIWC, principal component
analysis
1. INTRODUCTION
With the development of the computational linguistics, automated
text analysis tools like Coh-Metrix and Linguistic Inquiry and
Word Count (LIWC) have been developed to analyze enormous
amounts of data efficiently.
Coh-Metrix provides 53 language and discourse measures at
multilevels related to conceptual knowledge, cohesion, lexical
difficulty, syntactic complexity, and simple incidence scores
(http://cohmetrix.memphis.edu) [1]. Meanwhile, a principle
components analysis performed on 37,520 texts of TASA corpus
extracts five factors (Coh-Metrix-Text Easability Assessor, TEA,
http://tea.cohmetrix.com), including Narrativity (word familiarity
and oral language), Referential cohesion (content word overlap),
Deep cohesion (causal, intentional, and temporal connectives),
Syntactic simplicity (familiar syntactic structures), and Word
concreteness (concrete words) [1].
Even though the Coh-Metrix provides the normed five
dimensions, no articles describe the details of this model. This
paper not only gives a thorough description of this model, but also
uses this method to build up the normed dimensions with the text
analysis tools of English and Chinese LIWC.
LIWC is a text analysis software program with a text processing
module and an internal default dictionary [2]. LIWC classifies
2. METHOD
Two reference corpora were used in this study. The English
corpus used TASA (Touchstone Applied Science Associates,
Inc.), randomly-collected excerpts of 37,520 samples, 10,829,757
words with nine genres, including language arts, science, and
social studies/history, business, health, home economics and
industrial arts.
The Chinese reference corpus was collected according to similar
genres in TASA such as classic fiction, modern fiction, history,
science. Texts in the Chinese corpus included complete 4,679
documents with 25,184,754 words rather than segmented.
Six factors extracted from LIWC in these two independent
corpora showed significantly high correlation on dimensions of
cognitive complexity, narrativity, emotions and embodiment [7].
Therefore, these two corpora are able to reflect some common
linguistic and psychological features.
The procedure of the component model is described below. First,
TASA was analyzed by Coh-Metrix, English LIWC; Chinese
corpus was analyzed by Chinese LIWC. Thus, three data sets were
generated. Second, PCA was performed to reduce a range of
indices from Coh-Metrix (53) and LIWC (English 64; Chinese
71) to fewer potential constructs. The fixed number of dimensions
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
326
should analyze the writing with English LIWC to obtain the score
of all the indices (64). Then the mean and standard deviation of
indices in all the categories, the corresponding coefficients of
Negative Emotion component in the Component Model should be
obtained from the reference corpus.
For instance, the verb score for one subject is 1.5. According to
the model, the mean of the verb is 1.37, standard deviation
1.29, and the coefficient -0.06. Thus, the value of the verb in
the component score is [(1.5-1.37)/1.29](-0.06) = 0.01 for this
subject. We need compute the value of all the other categories in
this way, then sum them, and finally obtain the value of the
Negative Emotion composite score for this subject.
Thus, each component composite score from any coming corpus
will be computed and standardized based on this component
model from these three component models.
4. CONCLUSION
This study developed three component models for text analysis
with Coh-Metrix component model, English LIWC component
model and the Chinese LIWC component model. The component
model can be used to generate the composite component scores
when the data set has a small sample size and PCA is
inappropriately performed. The results are comparable across
different data sets.
The limitation of this study is that we didnt evaluate the model
with human judgment. In the future, the evaluation of the model
will be carried out.
5. ACKNOWLEDGMENTS
This work was supported by the National Science Foundation
(BCS 0904909) for the Minerva project: Languages across
Culture.
6. REFERENCES
[1] McNamara, D.S., Graesser, A.C., McCarthy, P., and Cai, Z.
in press. Automated evaluation of text and discourse with
Coh-Metrix. Cambridge University Press, Cambridge.
[2] Pennebaker, J. W., Booth, R. J., and Francis, M. E. 2007.
LIWC2007: Linguistic Inquiry and Word Count. Austin,
Texas: LIWC.net
[3] Huang, J., Chung, C. K., Hui, N., Lin, Y., Xie, Y., Lam, Q.,
Cheng, W., Bond, M., and Pennebaker, J. W. 2012.
[The development
of the Chinese Linguistic Inquiry and Word Count
dictionary]. Chinese Journal of Psychology, 54(2), 185-201.
[4] Biber, D. 1988. Variation across speech and writing.
Cambridge, England: Cambridge University Press.
[5] Lee, D. Y. W. 2004. Modeling variation in spoken and
written English. Routledge, London/New York.
[6] Hair, J. F., Black, W. C., Babin, B. J., and Anderson, R. E.
2009. Multivariate data analysis. Prentice Hall, New Jersey.
[7] Li, H., Cai, Z., Graesser, A.C., and Duan, Y. 2012. A
comparative study on English and Chinese word uses with
LIWC. In Proceedings of the Twenty-Fifth International
Florida Artificial Intelligence Research Society Conference.
(California, US, May 23 25, 2012), 238-243.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
327
Xiaolu Xiong
Joseph E. Beck
ABSTRACT
Student modeling has been widely used in the prediction of
student correctness behavior on the immediate next action. Some
researchers have been working on student modeling to predict
delayed performance, that is, retention. Prior work has found that
the factors influencing retention differ from those that influence
short-term performance. However, this prior research did not use
data which were specially targeted to measure retention. In this
study, we describe our experiments of using dedicated retention
performance data to test the students ability to retain, and
experiment with a new feature called mastery speed, indicates
how many problems the students need to attain initial mastery.
We found that this new feature is the most useful of our features.
Its not only a helpful predictor for 7-day retention tests, but also a
long-term factor that influences students later retention tests even
after 105 days. We also found that, although statistically reliable,
most features are not useful predictors, such as the number of
students previous correct and incorrect responses which are not
as helpful in predicting students retention performance as in PFA.
Keywords
Educational data mining, Knowledge retention, Robust learning,
Feature selection, Intelligent tutoring system.
1. INTRODUCTION
Automatic Reassessment and Relearning System (ARRS) is an
extension of the mastery learning problem sets in the
ASSISTments system (www.assistments.org), a non-profit webbased tutoring system for 4th through 10th grade mathematics.
Mastery Learning is a pedagogical strategy which, in most ITS,
indicates that a student is presented with problems to solve until
he masters the skill. The exact definition of mastery varies from
tutor to tutor: some tutors consider a student to have mastered the
skill if his estimated knowledge is very high, for example over
0.95 (e.g., [3]), while ASSISTments uses a heuristic of three
correct responses in a row. The idea of ARRS is if a student
masters a problem set, such mastery is not necessarily an
indication of long-term retention. Therefore, ARRS will present
the student with a reassessment test on the same skill at expanding
intervals: first 7 days after the initial mastery is due, then 14 days
after the prior test, than 28 days later, and finally 56 days later.
Thus, the retention tests are spread over an interval of at least 105
(7+14+28+56) days. In this study, we defined retention
performance as the reassessment test performance one week after
a student was assigned a skill (i.e., the first reassessment test).
Note, that if a student fails the reassessment test, ASSISTments
will give him an opportunity to relearn the skill. Once a student
relearns (demonstrates mastery) a skill, he will receive another
reassessment test at the same delay at which he previously
responded incorrectly. In other words, if the student failed the
second reassessment test, he would have to relearn the skill and
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
328
R2
p-value
R2 gain
mastery_speed
0.379
---
0.000
0.006
n_correct
0.374
0.010
0.000
0.001
n_incorrect
0.373
-0.007
0.004
0.000
n_day_seen
0.373
0.026
0.002
0.000
g_mean_performance
0.378
1.130
0.000
0.005
g_mean_time
0.373
0.000
0.649
0.000
ACKNOWLEDGMENTS
2.3 Impact of Mastery Speed
From the previous models we presented, we found that mastery
speed has a clear influence on students 7-day reassessment tests.
However, what about the 14 day test, 28 day test, and even the 56
day tests? We collected all student performances on all four
reassessment tests. As shown in Figure 1, we calculated the
percentage of correct answers on each retention test,
disaggregated by initial mastery speed.
Students get better as they move to the later retention tests. This is
expected since they must get the previous tests correct in order to
move on, and some weaker students are forced to repeat and so
are systematically oversampled on the left side of the graph. On
the 7-day retention test, students who mastered a skill quickly
with 3 or 4 attempts (blue line) have a 24% higher chance of
responding correctly than those students who required more than
We want to acknowledge the funding on NSF grant DRL1109483 as well as funding of ASSISTments. See here
(http://www.webcitation.org/67MTL3EIs) for the funding sources
for ASSISTments.
REFERENCES
[1] Anderson, J.R., Rules of the Mind. Lawrence Erlbaum
(1993).
[2] Wang, Y., & Beck, J. E. (2012). Using Student Modeling to
Estimate Student Knowledge Retention. In Proceedings of
the 5th International Conference on Educational Data
Mining, 176-179.
[3] Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing:
Modeling the acquisition of procedural knowledge. User
modeling and user-adapted interaction, 4(4), 253-27
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
329
Jihie Kim
Sofus A. Macskassy
Erin Shaw
senliu@usc.edu
jihie@isi.edu
sofmac@isi.edu
shaw@isi.edu
Keywords
Collaborative project, group project, SVN, data mining, entropy
analysis, graph theory
1. INTRODUCTION
The goal of this case study is to make progress towards
understanding the impact of collaboration on individual and group
performance in programming courses that use a collaborative code
management system, such as SVN (Subversion), which supports
team-based programming projects by providing a complete
history of individual programming activities. Past studies of group
work have analyzed how the characteristics of team members
affect group outcomes [1] and whether certain members or leaders
influence performance [2]. Related work by the authors has shown
that team pacing is highly correlated to group project performance
[3]. Building on these results, we explore the following questions:
Does individual coursework performance affect the group
project performance?
Does the most interactive (or influential) student affect the
group project performance?
Does even work pacing affect the group work performance?
Results indicate that when integrating components from different
members, teamwork skills and usage of teamwork tools may
improve the group performance; however, for implementing
difficult programs, individual members programming skills
become more important. The performance of leaders or central
students can affect the group performance greatly, and work
pacing and management of the work throughout the project period
can be an important fact for a successful team programming.
2. STUDY CONTEXT
To better prepare students for professional employment, two
undergraduate computer science teachers at the University of
Southern California combined a first and second year course so
that students could work on an authentic project. This case study
of that experiment spans a seven-week period of collaboration
among students in the two classes.
2011
FALL
2011
SPRING
ABSTRACT
Group
M1
M2
T1
T2
W1
W2
W3
W4
W5
M1
M2
M3
M4
T1
T2
T3
W1
W2
W3
N of Students
16 (5,11)
14 (5,9)
13 (4,9)
13(4,9)
13 (5,8)
13 (5,8)
14 (6,8)
14 (6,8)
13 (5,8)
11 (6,5)
10 (5,5)
11 (5,6)
9 (6,3)
10 (5,5)
12 (5,7)
10 (4,6)
12 (7,5)
9 (5,4)
11 (6,5)
N of Files
4007
4007
474
1740
1603
1412
1994
2082
2156
2919
3332
1737
2992
1770
1301
1096
5711
1186
2444
N of File Mods
5333
6142
1433
3173
2856
3288
3845
3357
3873
5885
5276
3243
4279
3370
2871
1842
7287
2184
4137
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
330
The left group was one of the better performers (project grades
were 92.31 for cs200 and 94.75 for cs201). We have color-coded
the nodes based on cs200/cs201 breakdown as well as the two
best cs201 students (best exam scores). We find that cs201
students are more central than cs200 students and see closer
interaction between the cs201 students. Finally, note that the two
best cs201 performers are also the two most central nodes.
4. ACKNOWLEDGEMENT
The authors thank USC CS faculty Drs. David Wilczynski and
Michael Crowley for their assistance. The research was supported
by a grant from the National Science Foundation (#0941950).
5. REFERENCES
[1] Michaelsen, L.K., Sweet, M. (2008), The Essential Elements of
Team-Based Learning, New Directions for Teaching and Learning,
n116 p7-27 Win 2008.
[2] Strijbos, J. W. (2004). The effect of roles on computer supported
collaborative learning, Open Universiteit Nederland, Heerlen, The
Netherlands (Chapters 3 and 4).
[3] Ganapathy, C., Shaw, E. & Kim, J. (2011) Assessing Collaborative
Undergraduate Student Wikis and SVN with Technology-based
Instrumentation: Relating Participation Patterns to Learning, Proc. of
the American Society of Engineering Education Conference, 2011.
[4] Wasserman, S. & Faust., K. (1994). Social Network Analysis.
Cambridge: Cambridge University Press, 1994.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
331
Tamara Sumner
University of Colorado
Institute of Cognitive Science
Boulder, Colorado USA
University of Colorado
Institute of Cognitive Science
Boulder, Colorado USA
keith.maull@colorado.edu
ABSTRACT
As technology continues to disrupt education at nearly all
levels from K12 to college and beyond, the challenges of understanding the impact technology has on teaching continue
to mount. One critical area that yet remains open, is examining teachers usage of technology by specifically collecting
detailed data of their technology use, developing techniques
to analyze that data and then finding meaningful connections that may show the value of that technology. In this
research, we will present a model for predicting test score
gains using data points drawn from typical educational data
sources such as teacher experience, student demographics
and classroom dynamics, as well as from the online usage
behaviors of teachers. Building upon prior work in developing a usage typology of teachers using an online curriculum planning system, the Curriculum Customization Service
(CCS), to assist in the development of their instruction and
planning for an Earth systems curriculum, we apply the results of this typology to add new information to a model for
predicting test score gains on a district-level Earth systems
subject area exam. Using both multinomial logistic regression and Nave Bayes algorithms on the proposed model, we
show that even with a simplification of the highly complex
tapestry of variables that go into teacher and student performance, teacher usage of the CCS proved valuable to the
predictive capability in average and above average test score
gains cases.
Keywords
online user behavior, teaching, pedagogy, learner gain prediction, instructional planning support
1.
BACKGROUND
Within the past 20 years, the use of technology in the classroom has grown at an unimaginable pace. From K12, to
college and lifelong learning, students and teachers alike are
now using a vast array of tools in their educational endeavors. Teachers, especially, are using tools in many interesting
tamara.sumner@colorado.edu
ways with the hope that these tools improve their teaching
productivity, better engage their learners, and ultimately
provide optimizations that make their jobs less difficult so
that they can maximize their value and skill in teaching.
Tools that educators once relied on to enter grades and organize lesson plans, have transformed into the now-diverse
online ecosystem of intra-, extra- and Inter-net based platforms that allow them to do a multitude of activities like collaborate with like-minded educators located anywhere in the
world, find digital resources relevant to their curricular objectives, find out how state and national standards are tied
to specific resources prepared for use in their classroom, examine the progress their students are making through those
lessons and even manage all these things in a single portal.
These technologies are affecting everyone in education administrators, educators, learners, parents, etc. and as the
state of the art pushes policy and pedagogy forward, in its
wake a mounting number of challenges must be sorted out,
including whether or not these technology tools are facilitating educational productivity or hindering it.
It is widely recognized that teachers matter a great deal in
the learning process of students, and many studies suggest
that teacher skill is one of the key predictive forces in learner
success. Even with large efforts such as The Gates Foundation $45 million dollar, multi-year study to understand the
factors that might accurately predict teacher effectiveness,
it still remains unclear from these and many other studies, what impact technology and specifically online tools
designed to impact pedagogy, are having on the toolkits,
skillsets and patterns of productivity employed by effective
teachers, where effectiveness in this context is measured by
learner gains. This research aims to explore the mechanisms
and models of understanding how teacher utilization of online tools might be linked with learner gains. By studying
the usage of the Curriculum Customization Service (CCS),
we will try to bridge the gaps between the online behaviors
of teachers and learner gains, while at the same time utilizing common educational data, such as teacher experience,
class demographics and class dynamics variables.
2.
RESEARCH CONTEXT
The research presented here examines the online usage behaviors of teachers within the context of the learning gains
observed over the course of a single year of data. Earth
systems teachers within a large urban public school district within the U.S. were trained and given the Curriculum Customization Service (CCS) to use for their planning
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
332
User Characteristics
Lower overall use of the system.
Heavier relative use of the
Interactive Resources components of the system.
Heavy robust use of the CCS.
Overall moderate use of the
system.
Heavier relative use of the
community and sharing features of the CCS.
3.
METHOD
amines the linkages between CCS usage and the other variables collected in this research. Two years (200809 and
200910) of standardized test and Benchmark exam score
data, teacher experience and class data (class size and demographic makeup) were examined to determine if there were
any significant differences in the population characteristics
and student test score performance. For the first year data of
data (200809) the CCS was not used, yet in the successive
year (200910) the CCS was used. Though there were significant differences between the Benchmark exam score gains
( < .001; df = 2; 2 = 1039; p = 2.2e16 ), where the mean
letter grade gain in 200809 Benchmark exam was 0.29 and
the mean letter gain in 200910 was 1.04, the other variables such as demographic makeup, class size and teacher
experience, show no significant differences.
4.
5.
Two classifiers were compared, Nave Bayes and multinomial logistic regression, in several different configurations to
study the predictive capability of model with and without
CCS usage. The best model sensitivity (true positive rate)
achieved by all models examined was 0.67 for the average
and above average gains, corresponding to G( s, t) U + N
and G( s, t) U + D CCS usage category + class size
and CCS usage + demographic category, respectively. The
model performed poorly at predicting below average gains
for the individual section gain grouping.
6.
ACKNOWLEDGEMENTS
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
333
Sebastian Gross
Benjamin Paassen
TU Clausthal
Clausthal-Zellerfeld, Germany
bmokbel@techfak.uni-bielefeld.de
sebastian.gross@tu-clausthal.de
bpaassen@techfak.uni-bielefeld.de
Niels Pinkwart
Barbara Hammer
TU Clausthal
Clausthal-Zellerfeld, Germany
niels.pinkwart@tu-clausthal.de
bhammer@techfak.uni-bielefeld.de
ABSTRACT
Assuming that effective feedback-strategies can be established based on appropriate examples, let us restrict to a
e of examples x
j is explicitly given, and
scenario where a set X
i
b needs to be associated
from the set X
a student solution x
to the most suited example. We further assume that examples are themselves solutions (or are represented in the same
form) and we can process them in the same manner. Let
d(xi , xj ) be a meaningful proximity measure which indicates
b X
e by
the dissimilarity of any two solutions xi , xj X = X
b can simi X
a positive value. Then, a student solution x
ply be associated to the most similar (and thus most suited)
e .
j ), j {1, . . . , |X|}
example by choosing: arg minj d(
xi , x
In the following, we will present general approaches to calculate this dissimilarity, if solutions are represented as formal
graphs with annotations. The overall calculation is not tailored to a specific learning task or domain, if general data
representations are used. To explain the details of the approach, and show first experimental results, we will refer to
our example application scenario: an ITS to support programming courses for the Java language.
1.
INTRODUCTION
2.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
334
F
=
s,
t
{1,
.
.
.
,
m},
s 6= t.
i
s
t
s=1 s
The goal of fragmenting the graph is to compare distinctive
parts of the solutions independently. To evaluate the dissimilarity of a single pair of fragments (Fsi , Ftj ) with i 6= j
and s, t {1, . . . , m}, one can rely on established proximity measures from the literature, as exemplified by dtfidf or
dalign . This requires only that each fragment can be represented individually to apply the respective measure, e.g. as
a string, a numeric vector, etc. We call this the signature
of the fragment. Let d(Fsi , Ftj ) be the dissimilarity of the
fragment pair. To compare the two underlying solutions xi
and xj as a whole, m suitable pairs of fragments (Fsi , Ftj ),
(s, t) M {1, . . . , m}2 have to be established for comparison. Since we want those pairs to yield the best overall
match, i.e. the minimum sum of dissimilarities, we arrive at
an optimal matching problem. For now, we use a simple
greedy heuristic to gain an approximate matching M of all
fragments. We then compute
the overall dissimilarity as the
P
j
i
1
mean (F i , F j ) = m
(s,t)M d(Fs , Ft ) .
3.
using String objects or arrays of characters, splitting the sentence into words or iterating over the whole input sequence.
Within each group, the 6 programs are only slightly different, with altered syntactic details like variable names and
the sequence of operations. The Tasks dataset consists of
438 real student solutions, collected during 3 different programming exams for business students. Each solution was
provided by an individual student, and the data is classlabeled according to the 3 different tasks assigned in the
respective exam: [I] implementing Newtons method to find
zeros in 2nd order polynomials (144 solutions); [II] calculating income tax for a given income profile (155); and [III]
checking if a given sentence contains a palindrome, and if the
sentence is a pangram (139). The TextCheck dataset consists of 68 student solutions which solve the above-mentioned
task [III]. Here, class labels were provided by tutoring experts who were asked to determine meaningful groups in
the solution set. The experts distinguished them according to 3 design choices very similar to the ones used in the
Artificial set (which was subsequently created). This resulted in 8 classes corresponding to distinct strategies to
solve the task. In general, solutions are very heterogeneous
and classes are highly imbalanced. Therefore, this dataset
represents a state-of-the-art challenge for a real ITS.
After preprocessing as described, we applied four variants of
proximity measures, evaluating the accuracy of a 3-NN classifier, see Tab. 1. The results show that with all measures
the solutions from the Artificial and Tasks dataset are classified rather reliably, which indicates that the measures are
semantically meaningful. Accuracies for the TextCheck data
are generally low, as expected from the challenging scenario.
However, the measures based on fragmentation, tfidf and
align , showed a performance increase as compared to their
simpler counterparts dtfidf and dalign . A rigorous evaluation
of the approach is the subject of ongoing work.
Dataset
Artificial
Tasks
TextCheck
#fragments
m=4
m=6
m=6
dtfidf
0.94
0.98
0.34
tfidf
0.92
0.83
0.32
dalign
0.94
0.98
0.41
align
0.94
0.98
0.54
Table 1: Classification accuracies of a 3-NN classifier with different proximity measures on the experimental datasets. The number of fragments m in the
measures tfidf and align was chosen with regard to
the average size of graphs in the respective dataset.
4.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
335
Andr Kretzschmar
andre.kretzschmar@uni.lu
Samuel Greiff
samuel.greiff@uni.lu
University of Luxembourg
6, rue Richard Coudenhove Kalergi
1359 Luxembourg-Kirchberg
ABSTRACT
Complex Problem Solving (CPS) is a prominent representative of
transversal, domain-general skills, empirically connected to a
broad range of outcomes and recently included in large-scale
assessments such as PISA 2012. Advancements in the assessment
of CPS are now calling for a) broader assessment vehicles
allowing the whole breadth of the concept to unfold and b)
additional efforts with regard to the exploitation of log-file data
available. Our Paper explores the consequences of heterogeneous
tasks with regard to the applicability of an established measure of
strategic behavior (VOTAT) featured currently in assessment
instruments. We present a modified conception of this strategy
suitable for a broader range of tasks and test its utility on an
empirical basis. Additional value is investigated along the line of
theory driven educational data mining of process data.
Keywords
Complex Problem Solving, Theory Driven Data Mining,
Exploration Behavior.
1. INTRODUCTION
Targeting human behavior in problem situations characterized by
dynamic and interactive features [1], widespread application of
Complex Problem Solving (CPS) assessment only began after the
introduction of formal frameworks and the restriction to so-called
minimal complex systems [3]. Recent inclusions of CPS as a
representative of domain-general skills in large-scale studies such
as the Programme for International Student Assessment (PISA) in
2012 can be seen as a direct result of these advancements.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
336
2. METHOD
Participants and procedure: Participants provided data in the
context of a broader assessment at a German school in November
2012. There were 565 students who completed the relevant
assessment of CPS, 54% of participants were female, mean age
was 14.95 years (SD = 1.30).
Measures: FSA: Five LSA-based tasks were included in the
assessment. The test is called MicroFIN [4], Micro standing for
the minimal complex systems approach and FIN for the
framework of finite state automata. Separate phases of MicroFIN
are targeting knowledge acquisition and knowledge application.
Empirically, we expect a three-dimensional measurement model
for MicroFIN, separating the application of nested-VOTAT,
knowledge acquisition, and knowledge application.
Scoring of nested-VOTAT: Credit for nested-VOTAT is given per
task for the isolated variation of all input variables in one of the
qualitatively different areas of the task. The number of these areas
varies with the number of states the problem shows different
relations to input variations for.
LSE: To get a comparable measure of CPS based on LSE, eight
MicroDYN tasks were included in testing. MicroDYN [3]
includes a scoring of VOTAT in the exploration phase, which is
typically strongly connected to knowledge acquisition. A twodimensional measurement model is expected, conflating strategy
application (i.e., VOTAT) and knowledge acquisition [3].
3. RESULTS
Manifest Correlations: Manifest correlations between nestedVOTAT indicators and knowledge acquisition and knowledge
application in MicroFIN were of rather low size per task. This
result is in line with previous findings with regard to single task
indicators of MicroFIN and MicroDYN.
Dimensionality and internal consistency: Exploratory factor
analysis for nested-VOTAT indicated a single latent factor.
Internal consistency was poor with Cronbachs alpha = .53.
Latent modeling: Measurement models: Measurement models
for MicroFIN indicated the separability of nested-VOTAT
application and knowledge acquisition, as well as a third
dimension for knowledge application. The three dimensional
model fitted well, conflating any of the two dimensions resulted in
significantly worse model fit (2 (167) = 230.246, p < .001,
RMSEA = 0.026, CFI = 0.981, TLI = 0.978; latent factors
correlated r = .71 to .81, all p < .001). Measurement models for
MicroDYN indicated the expected two-dimensional model fitting
well (2 (251) = 588.234, p < 0.001, RMSEA = 0.049, CFI =
0.994, TLI = 0.993, latent correlation of factors r = .79, p < .001).
Separating the use of VOTAT from knowledge acquisition in
MicroDYN resulted in estimation problems due to very highly
correlated factors.
MicroDYN
(2)
MicroDYN
Facet
(1)
(3)
(2)
0.81
(3)
0.80
0.71
(4)
0.66
0.64
0.61
(5)
0.68
0.56
0.62
(4)
0.78
4. DISCUSSION
The present study represents a first advance into utilizing the
potential of process data in FSA-based tasks. It shows the
feasibility of including measures of exploration behavior into the
assessment of CPS even when based on heterogeneous tasks.
The notion of nested-VOTAT has been shown to be applicable to
a range of tasks based on FSA with a varying degree of success:
The low internal consistency of = .53 could mean different
optimal strategies being necessary in some of the tasks, as it could
be related to measurement problems due to the high heterogeneity
of tasks. A broader variation of item features and alternative
strategy scorings is needed to clarify this aspect.
The latent correlations point to the potential of including processrelated measures into FSA-based CPS assessment: Contrary to the
case of LSE-based tests, the higher heterogeneity of FSA seems to
result in a more probabilistic relation between strategies in
exploration and knowledge acquisition.
Future inquiries into the determinants of CPS performance should
be built on broad assessment vehicles and careful examination of
participant behavior. Based on the findings elaborated here,
approaches of educational data mining like text replay tagging can
be utilized even when targeting a heterogeneous set of tasks.
5. REFERENCES
[1]
[2]
[3]
[4]
[5]
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
337
ABSTRACT
Mastery learning in intelligent tutoring systems produces a
differential attrition of students over time, based on their levels of
knowledge and ability. This results in a systematic bias when
student data are aggregated to produce learning curves. We
outline a formal framework, based on Bayesian Knowledge
Tracing, to evaluate the impact of differential student attrition in
mastery learning systems, and use simulations to investigate the
impact of this effect in both homogeneous and mixed populations
of learners.
Keywords
Mastery learning, attrition bias, learning curves, aggregate
learning, heterogeneous learner populations, knowledge tracing
= [ ] 1 ! + (1 [ ])!
= 1 ! ()
where is the number of students who learn the skill at time t,
and so transition from the unknown into the known state. It is also
binomially distributed: ()~( 1 , ! ), where ! is the
BKT learning (aka. transition) parameter. ! and ! () give
the numbers of students from the known and unknown states,
respectively, that are judged to have mastered the material by the
system, and so removed from the population. The initial share of
students in the known and unknown states is controlled by ! , the
initial knowledge parameter from BKT:
1 = ~(, ! ).
From this we can see that the learning curve begins from a
theoretical initial value of:
[ 1 ] = ! 1 ! + (1 ! )!
In the no-mastery attrition situation, where ! and ! are
always 0, will tend towards 1. Therefore the learning curve
will converge to a theoretical maximum:
lim [ ] = 1 !
!!
() + ()
+ ()
()~ , !
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
338
2. HETEROGENEOUS POPULATIONS
So far we have been considering idealized situations in which all
students are instances of a BKT model with a common set of
parameters. Naturally, we wish to investigate what can happen to
aggregate learning curves when we have a heterogeneous
population of different learners. There are very many different
possible ways a heterogeneous population might be composed,
and there could be very many perverse aggregate learning curves
created by specially constructed mixed populations. We illustrate
just a couple of examples that show interesting aggregate
behavior.
= !(!)
!
!!!
! () ! ! + ! ()
3. CONCLUSIONS
Aggregate learning curves are used to evaluate and improve
instructional systems[3]. However, there are significant distortions
to aggregate measures of student learning created by the
differential attrition bias inherent to mastery learning systems.
Aggregate performance on each step shown by learning curves
need not be representative of the learning of individuals or groups
of students [4]. Aggregate measures of such attrition-biased data
will tend to under-represent the amount of learning occurring.
Explicitly modeling the effect of this attrition bias may be a
fruitful direction for future research.
A mixed population with different learning characteristics can
introduce additional distortions when mastery learning is
involved. There has been much work already on identifying the
learning characteristics of individuals and sub-populations[1][5].
Further developments in this direction would help build richer and
more accurate models of learning robust to the attrition bias in
mixed-population data.
4. REFERENCES
[1] Baker, R. Corbett, A. T., Aleven, V. 2008. More Accurate
Student Modelling through Contextual Estimation of Slip
and Guess Probabilities in Bayesian Knowledge Tracing.
Proceedings of the 9th International Conference on
Intelligent Tutoring Systems. ITS 2008. 406-415
[2] Corbett, A. T. and Anderson, J. R. 1995. Knowledge tracing:
Modeling the acquisition of procedural knowledge. User
Modeling and User-Adapted Interaction. 4, 4, 253-278
[3] Martin, B., Mitrovic, A., Koedinger, K. R., Mathan, S., 2011.
Evaluating and improving adaptive educational systems with
learning curves. User Modeling and User-Adapted
Interaction. 21, 3, 249-283
[4] Murray, C., Ritter, S., Nixon, T., Schwiebert, R., Hausmann,
R., Towle, B., Fancsali, S., Vuong, A. 2013. Revealing the
Learning in Learning Curves. Proceedings of the 16th
International Conference on Artificial Intelligence in
Education (Memphis, TN, July 9-13 2013). AIED 2013.
[5] Pardos, Z. A., Heffernan, N. T.
individualization in a Bayesian networks
knowledge tracing. Proceedings of the
Conference on User Modeling,
Personalization. 255-266
2010. Modeling
implementation of
18th International
Adaptation and
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
339
Kenji YAMANISHI
yamanishi@mist.i.u-tokyo.ac.jp
oeda@j.kisarazu.ac.jp
ABSTRACT
2.
Examination results are used to judge whether an examinee possesses the desired latent skills. In order to grasp the
skills, it is important to find which skills a question item
contains. The relationship between items and skills may be
represented by what we call a Q-matrix. Recent studies have
been attempting to extract a Q-matrix with non-negative
matrix factorization (NMF) from a set of examinees test
scores. However, they did not consider the time-evloving
nature of latent skills. In order to comprehend the learning
eects in the educational process, it is significant to study
how the distribution of examinees latent skills changes over
time. In this paper, we propose novel methods for extracting both a Q-matrix and time-evolving latent skills from
examination time series, simultaneously.
1. INTRODUCTION
The relationship between items and skills may be represented by what we call a Q-matrix [5]. Its original idea came
from the rule space method (RSM) developed by Tatsuoka
et al. [5]. The Q-matrix allows us to determine which skills
are necessary to solve each item. However, the process of determining the skills involved in a given item is a boring and
heavy task. Recently, there exist several studies on how to
extract a Q-matrix from a set of examinees test scores [1, 2].
These studies applied the non-negative matrix factorization
(NMF) to the problem of establishing the skills from examinee performance data. They were applied to only static
examination results provided at a certain time. However, in
order to comprehend the learning eects in the educational
process, it is significantly important to study how the distribution of examinees latent skills changes over time. There
have been studied on cognitive modeling from student performance over time [3, 4]. From another aspect, we propose
novel methods for extracting time-evolving latent skills. It
enables us to extract both a Q-matrix and time-evolving
latent skills from examination time series, simultaneously.
3.
Qt ,St
(1)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
340
80
80
Qmatrix Error
Rank
20
15
Rank
10
40
Qmatrix Error
20
15
10
Rank
20
60
40
10
12
14
0
0
0
0
20
Qmatrix Error
20
15
Rank
10
40
20
Qmatrix Error
60
60
25
25
Qmatrix Error
Rank
25
80
Qmatrix Error
Rank
10
12
Time
5.
14
Time
10
12
14
Time
CONCLUSIONS
In this paper, we have introduced the online NMF with regularization for the purpose of extracting a Q-matrix and a
time-evolving S-matrix from time series of examination results. We have designed it in order to extract a stable Qmatrix in an online fashion. We have employed a synthetic
data set to demonstrate that it performs more accurately
and in a more stable way than the conventional NMF in the
extraction of the Q-matrix.
4. EXPERIMENTAL RESULTS
In order to verify the eectiveness of our methods, we made
a synthetic examination time series. We generated a timevarying S-matrix and a fixed Q-matrix to obtain Rt according to the equation Rt = Q (St ). A conjunctive
Q-matrix consisted of 31 items and 6 skills. We designed
a time series of St as a process of acquiring skills, on the
basis of the item response theory (IRT) [6].
6.
7.
b t Qk2F .
et = kQ
(2)
Figures 1 and 2 show the experimental results obtained using the coventional NMF and the online NMF. Note that
the factorized solutions obtained using the NMF may not be
unique due to the randomness of the initial matrices. Hence
we calculated the mean of Q-matrix errors from 10-fold simulations and indicated error bars indicating the standard deviations. The Q-matrix errors both in Figures 1 and 2 were
large at the initial and the final stages, while the ranks of
matrices were small at the same stages. We calculated the
rank of each matrix Rt by means of QR decomposition.
As a result, there was a correlation between the Q-matrix
error and the rank of matrix. Note that in the conventional
NMF, the Q-matrix error did not become zero at any stage,
and the standard deviations were uniformly large. In the
online NMF, the Q-matrix errors became zero at the middle
stage of t = 7, , 11. However, the Q-matrix error gradually increased after t = 12. Figure 3 shows the result of
the online NMF with regularization. It overcame the problem as above. That is, the Q-matrix error became zero after
t = 12.
ACKNOWLEDGMENTS
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
341
pawla@uwplatt.edu
ABSTRACT
A popular type of problem in online homework involves a set
of several true/false statements and requires that students
submit their answers to all the statements at once. Such
problems can force a student to submit many responses to
the same true/false statement. It is possible to examine student submission patterns to problems of this type with the
goal of determining which of the individual true/false statements exhibit a large proportion of response switches and
which statements exhibit largely consistent responses. This
paper describes algorithms that allow an instructor to uncover those statements that exhibit class-wide randomness
and also those that exhibit a class-wide preference for an incorrect response. The utility of the approach is suggested by
the fact that examining statements which emerge as outliers
according to these metrics uncovers several statements that
probe known student misconceptions.
1.
INTRODUCTION
A popular type of problem in the LON-CAPA online homework network [1] consists of a situation or set of situations
followed by five related true/false statements [2] (an example is shown in Fig. 1). The student is required to submit
answers to all the true/false statements at once and receives
only correct/incorrect feedback. A student who submits an
incorrect answer will not know which of the statements has
been answered incorrectly or even how many of the statements are incorrect, and so may submit as many as 25 = 32
responses before arriving at the correct answer.
One goal of this work is to develop a means to detect statements to which the class consistently responds incorrectly.
Such response patterns correlate with strong misconceptions.
The definition of strong misconception in the context of
this work is an intuitive belief that is in conflict with the
concepts taught in the course. An example is the first statement in the problem shown in Fig. 1. Research has shown
that students in introductory physics courses have a strong
tendency to believe that there must be a net force in the di-
rection of motion, even when this belief conflicts with Newtons First Law [3, 4]. Thus, one might expect the class
to consistently answer True to this statement, even when
forced to answer multiple times.
Another complementary goal is to investigate whether significant class-wide randomness in the answers to a given
statement can be an indicator of incomplete understanding.
Again, the problem of Fig. 1 provides a useful illustration.
If a significant portion of the class is indeed convinced that
a net force is necessary to produce constant velocity, this
could produce a conflict in the minds of the students about
the consequences of applying more force. Will the extra
force produce a steady acceleration in accordance with Newtons Second Law, or will there be a transient acceleration
dropping to zero when the appropriate velocity is reached?
Because of these conflicting ideas, one might expect students
to exhibit a tendency to change their answer to the second
statement shown in Fig. 1.
2. ASSESSING CONSISTENCY
A class with an average near 100% correct submissions to a
certain statement is consistently giving the correct response.
However, because a true/false statement has only one incorrect response, an average near 0% correct submissions
also implies consistent responses (the class is continuing to
respond with the one incorrect answer). If the class is answering randomly, the average will approach 50% correct
submissions as the number of tries becomes large. Thus, a
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
342
Ns,i,correct Ns,i,incorrect
Nr,tot
(2)
where Ns,i,correct is the number of correct initial submissions, Ns,i,incorrect is the number of incorrect initial submissions and Nr,tot is the total number respondents.
The overall consistency score Ctot is then defined:
3.
(3)
ASSESSING RANDOMNESS
0.3
0.2
0.1
0.0
0.2
0.4
0.6
0.8
1.0
The second goal is to uncover statements that produce frequent switching of the response. The total number of possible switches for a class making Ns,tot submissions to a
statement is Ns,tot Nr,tot where Nr,tot is the number of
respondents. A switch is realized if the current submission
is different from the prior one. A submission-weighted randomness score Rsw can be defined as the fraction of possible
switches that are realized:
Ns,switch
Rsw =
(4)
Ns,tot Nr,tot
Rrw =
Csw
Ns,correct
=
0.5
Ns,tot
(5)
5. CONCLUSIONS
This paper has presented data-mining algorithms for assessing the consistency and the randomness of student responses to individual true/false statements. These algorithms are directly applicable to problems involving several
linked true/false statements, which have been implemented
in online homework. Investigation of examples indicates
that the consistency score can uncover class-wide misconceptions and the randomness score can be a useful indicator
of incomplete understanding among the class. Both scores
can also serve to uncover errors in problem construction.
The promise of the approach is that a simple question format that is suitable for use in online homework or as part
of online courses can uncover the specific concepts that give
a significant portion of the class problems.
6. ACKNOWLEDGMENTS
This paper relies on work done when the author was a postdoc with D.E. Pritchards RELATE group at the Massachusetts
Institute of Technology. R.E. Teodorescu constructed the
homework sets and provided comments on this work.
7. REFERENCES
Rtot = Rsw Rrw .
4.
(6)
[1] http://lon-capa.org
[2] Kashy, E., Gaff, S.J., Pawley, N.H., Stretch, W.L.,
Wolfe, S.L., Morrissey, D.J., and Tsai, Y. 1995.
Conceptual questions in computer-assisted assignments.
Am. J. Phys. 63 (Nov. 1995), 1000-1004.
[3] Clement, J. 1982. Students preconceptions in
introductory mechanics. Am. J. Phys. 50 (Jan. 1982),
66-71.
[4] Hestenes, D., Wells, M., and Swackhamer, G. 1992.
Force Concept Inventory. Phys. Teach. 30 (Mar. 1992)
141-158.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
343
Joseph E. Beck
Ivon Arroyo
dovan@wpi.edu
josephbeck@wpi.edu
ivon@cs.umass.edu
ABSTRACT
We are using causal modeling to analyze relationships between
pedagogical intervention, students attitudes, affective states,
perceptions and outcomes, based on the data from a math tutor,
Wayang Outpost. The causal model generated gives interpretable
multi level interrelationships within the data variables identifying
direct and indirect effects among them. We observed that among
the four affective variables, confidence and frustration are more
tightly linked with their performance and ability whereas interest
and excitement are more related to their attitude and appreciation
of math and tutor.
1. INTRODUCTION
In recent years, researchers have found that attending to students'
motivational and affective characteristics are as crucial as
cognitive aspects for effective learning. Since students cognitive,
affective and behavioral aspects are interconnected, analyzing
their interrelationships give us clearer understanding of the
learning process.
Causal models [1, 2] are graphical models that make the
additional assumption that the links between nodes represent
causal influence. By causal, we mean that a link AB means that
if we intervene and change the value of A, then B will change.
Based on the conditional independencies within the data, causal
modeling makes causal inferences among the variables.
We used TETRAD, a free causal modeling software package [2]
to generate causal graphs. We also made use of an extension to
Tetrad, developed by Doug Selent, a graduate student at WPI, that
enabled us to restrict links on the basis of the magnitude of the
relation between the pair of nodes. It measures the magnitude of
the relation using R2, the amount of the variability accounted for
by the relationship between the nodes.
The data comes from students working with Wayang Outpost, an
adaptive math tutoring system that helps students learn to solve
standardized-test questions, in particular state-based exams taken
at the end of high school in the USA.
We are using data from 94 middle school students, in the 7th and
8th grade (approximately 12 to 14 years old), who were part of
mathematics classes in a rural-area public middle school in
Massachusetts, USA. As part of the activity, students took a
survey on the first day, which assessed their baseline achievement
level, as well as affective and motivational factors related to
mathematics problem solving. The students took identical survey
at the end of the study. Student responses were collected in 5point Likert scale.
2. CAUSAL MODELING
We inputted the data in TETRAD and performed a model search
using the PC search algorithm and generated a graph as shown in
figure 1. PC search algorithm assumes that there are no hidden
common causes between observed variables in the input (i.e.,
variables from the data set, or observed variables in the input
graph) and that the graphical structure sought has no cycles [2].
The causal links are color-coded (corresponding to the R2 value:
red >0.5, orange >0.1, yellow >0.05) and plus sign denotes
positive relationship and minus sign denotes negative
relationships. We also input domain knowledge based on temporal
precedence. For example: gender is put on higher knowledge tier
than math self concept so that there could be a causal link directed
from gender to math self-concept but not the other way round.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
344
3. CONCLUSIONS
Since correlation underdetermines causality, there are multiple
equivalent causal models that can be generated from the same set
of data. Moreover, the stronger causal assumptions employed by
this approach add more analytical power but also introduce higher
chances of inaccuracy. Researchers have to, therefore, be careful
about interpreting the causal relations of the model before making
any causal claims. Most of the times, causal models are only able
to summarize associations rather than uncover new causal
mechanisms. It basically depends on whether we are able to
observe all possible common causes in our data set. We do not
recommend causal modeling to use as a tool to make strong causal
conclusions, as we would do with randomized controlled studies.
A thorough understanding of the domain is required before we
make causal interpretations of the model. Causal models can only
be as good as the variables that we are able to include in the
modeling process. But given the data variables, causal modeling is
the most superior exploratory tool available compared to the
existing statistical approaches such as correlation and regression.
4. REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
345
Balaraman Ravindran
Edward F. Gehringer
lramach@ncsu.edu
ravi@cse.iitm.ac.in
ABSTRACT
Reviews of technical articles or documents must be thorough in discussing their content. At times a review may
be based on just one section in a document, say the Introduction. Review coverage is the extent to which a review
covers the important topics in a document. In this paper
we present an approach to evaluate the coverage of a submission by a review. We use a novel agglomerative clustering technique to group the submissions sentences into topic
clusters. We identify topic sentences from these clusters, and
calculate review coverage in terms of the overlaps between
the review and the submissions topic sentences. We evaluate our coverage identification approach on peer-review data
from Expertiza, a collaborative, web-based learning application. Our approach produces a high correlation of 0.51 with
human-provided coverage values.
Keywords
review quality, review coverage, topic identification, agglomerative clustering, lexico-semantic matching
1.
INTRODUCTION
efg@ncsu.edu
2.
APPROACH
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
346
4.
correlation
0.51
0.46
Avg. # words
108
100
In order to illustrate our approach we use real-world submission and review data from assignments completed using
Expertiza [9]. Expertiza is a collaborative, web-based learning application that helps students submit assignments and
peer review each others work. Figure 1 contains a sample submission with its topic-representative sentences highlighted in bold, and three sample reviews with high, medium
and no coverage of the submissions topic sentences. The
first review covers the submission because it mentions ethical principles and ethics. However, the review with medium
coverage mentions just ethics, and the review with no coverage does not contain any relevant information.
3.
EXPERIMENT
CONCLUSION
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
347
ssalmeron@bec.uned.es
ABSTRACT
Olga C. Santos
ocsantos@dia.uned.es
Keywords
Affective Computing, Educational Data Mining, Emotions,
Adaptive Systems, User Modeling.
1. RESEARCH APPROACH
Given the strong role emotions play on the learning process, by
combining in a wise way the use of emotional information and
user interactions in an e-learning platform, an impact on users
performance and cognition can be expected [3]. The ongoing
research works in the literature aim to progress on managing
different information, sources such as physiological sensors or
face tracking systems [1]. To this, data mining is applied to
provide personalized feedback to learners, which aims at
supporting them in achieving better results on their tasks [5].
The approach followed in the MAMIPEC project [4] is focused
on addressing the problem of emotion detection from a
multimodal viewpoint by using different data sources. Our goal
here is to combine those data sources to model the learners
current affective state and improve thus the possible results
obtained from a single data source [1]. The learner model, which
is based on standards, is thus enriched with new features that are
used to provide personalized feedback during the learning
process. In particular, the research of this ongoing Ph.D work
focuses on identifying and modeling affective states to support
adaptive features in educational systems. The top-level hypothesis
behind the research is that the application of data mining
techniques to different emotional data sources can improve the
modeling of the learners affective state in terms of standards and
thus, better provide a personalized support in open educational
service oriented architectures (i.e. those that take advantage of
standards-based models).
Jesus G. Boticario
jgb@dia.uned.es
3. DATA MINING
With this collected data, the work on this Ph.D focuses on
detecting the emotions felt by the users by processing the
different data sources obtained during the experience.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
348
4. OPEN ISSUES
The work reported so far has helped to identify several kinds of
open issues to deal with i) the infrastructure for the data gathering
and synchronization, ii) the emotions detection of neutral values,
iii) the data mining approach itself, and iv) reducing the
5. ONGOING WORKS
Several open issues have been identified analyzing the data
gathered. In particular, there is a need to redefine certain aspects
from the data gathering in order to provide more meaningful
information from where to obtain better mining results in future
experiments. The approach proposed at the end of this stage must
offer an affective state detection strong enough to provide a
robust base to the model generated in the next layers.
Proposed work aims to provide an accurate standards-based user
model useful to supply personalized assistance taking the
learners affective state into account. The first layer of this work
is still being addressed, studying the state of the art in order to
meet multimodal approaches on emotion detection and also being
able to detect different emotions by using data mining.
6. ACKNOWLEDGEMENT
Authors would like to thank experiments participants,
MAMIPEC project (TIN2011-29221-C03-01) colleagues and the
Spanish Government for its funding.
7. REFERENCES
[1] DMello, S. and Kory, J. 2012. Consistent but modest: a
meta-analysis on unimodal and multimodal affect detection
accuracies from 30 studies. Proceedings of the 14th ACM
international conference on Multimodal interaction (New
York, NY, USA, 2012), 3138.
[2] Goetz, T., Frenzel, A.C., Hall, N.C. and Pekrun, R. 2008.
Antecedents of academic emotions: Testing the
internal/external frame of reference model for academic
enjoyment. Contemporary Educational Psychology. 33,1
(2008), 933.
[3] Moore, S.C. and Oaksford, M. 2002. Some long-term effects
of emotion on cognition. British Journal of Psychology. 93, 3
(2002), 383395.
[4] Santos, O.C., Salmeron-Majadas, S., Boticario, J.G. 2013.
Emotions detection from math exercises by combining
several data sources. Proceedings of the 16th International
Conference on Artificial Intelligence in Education (AIED
2013), in press
[5] Shen, L., Wang, M. and Shen, R. 2009. Affective e-learning:
Using emotional data to improve learning in pervasive
learning environment. Educational Technology & Society.
12, 2 (2009), 176189.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
349
Yoav Bergner
David E. Pritchard
dseaton@mit.edu
bergner@mit.edu
dpritch@mit.edu
ABSTRACT
We use a two-parameter family of bounded distribution functions (Kumaraswamy) to fit electronic textbook (etext) usage in 20 blended and online courses from Michigan State
University, MIT, and edX. We observe clusters of courses
in the parameter space that correlate with course structural
features such as frequency of exams.
Keywords
Course structure, etexts, MOOCs, usage mining
1.
INTRODUCTION
When etexts are integrated into Learning Management Systems (LMS), one can extract the number, frequency and
duration of page views from the tracking logs. These data
have fed back into etext interface design [2] and personalized
approaches aimed at understanding student reading habits
and comprehension [3], but an incomplete picture remains
in terms of guiding instructors on how to integrate etexts
into courses. This is particularly salient given studies which
point to low use of traditional textbooks and poor correlation of use with performance [4].
We consider the fraction of etext pages accessed by students
in courses of varying structure. Aspects of course structure include the primary/supplementary role of the etext,
its integration with graded assessment, and the frequency
of exams. Our data come from blended and distance learning courses from Michigan State University (MSU) and from
open online courses from the RELATE group at the Massachusetts Institute of Technology (MIT) and edX. The MSU
populations are typical for introductory science at a large
state university. In contrast, the student populations in open
online courses are highly variable in age and preparation [1].
Courses in this study use either LON-CAPA, with e-texts as
modularized html pages, or edX, which uses digital versions
of traditional textbooks within simple navigation. We fit the
distribution of unique page views in each course using a twoparameter family of distribution functions with support on
the interval [0,1]. The complimentary cumulative distribution function (CCDF) of the Kumaraswamy distribution is
given by F (x; a, b) = (1 xa )b . The (a, b) parameters which
determine the shape of each distribution may not be familiar. We highlight four relevant regions: bimodal (a, b < 1),
low usage (a < 1, b > 1), high usage (a > 1, b < 1), and unimodal (a, b > 1). Note the probability distributions (PDF,
not CCDF) associated with a = b = 0.5 (bimodal) and
a = b = 1.5 (unimodal) have the same average use.
2.
BLENDED COURSES
3.
ONLINE COURSES
http://www.pa.msu.edu/bauer/mmp/
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
350
(a)
(b)
Figure 1: Etext usage via Kumaraswamy a, b-parameters for (a) blended courses and (b) online courses.
Contour lines represent the average fraction of the etext viewed. CCDFs (bold) with curve fits (smooth) are
displayed as an inlay.
same etext and LMS as the previously discussed blended
courses. RELATE reformed courses used LON-CAPA with a
Mechanics text developed by the RELATE group [5]. MITx
MOOCs were disseminated through edX.
The online course data in Fig. 1(b) display similar, if slightly
more nuanced, clustering of usage by course structure. MSU
distance courses cluster near the saddle point (a = b = 1),
but end up within three of the four usage regions. Since the
average view fraction remains similar to the blended courses
(0.5 0.6), the points on either side of the saddle point lend
themselves to the following interpretation: while students
in MSU primary (blended) courses typically accessed 55%
of the etext (unimodal), students in distance courses viewed
either more or less than this (bimodal).
The two RELATE courses in Fig. 1(b) both have similar average view fraction ( 0.78), but one (summer) appears in
the unimodal region and the other (spring) in the high-usage
region. The spring instance required students to complete
all 14 weekly units; in summer, the last three units were
optional. Only a small fraction of students completed the
optional assignments, explaining the shift. The three MITx
MOOCs in a tight cluster are computer science offerings (avf
0.1). A fourth, Solid-State Chemistry, lies just outside
(avf 0.2). The MOOCs fall in the same region of the parameter space as the MSU supplemental (blended) courses,
which is consistent with their similar course structure: all
provide their etext as a supplement text.
4.
Acknowledgments
Work partially supported by NSF grant DUE-1044294. We
thank G. Kortemeyer, J.M. Van Thong, and Piotr Mitros,
as well as edX, TLL, and ODL at MIT.
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
351
jean.simon@univ-reunion.fr
Abstract
In this paper we propose one possible way to preprocess
data according to Activity theory. Such an approach is
particularly interesting in Educational Data Mining.
Keywords
Activity theory, data preprocessing, preservice teacher
INTRODUCTION
METHODOLOGY
subject
object
outcome
tool
Rule
object
subject
rules
community
outcome
division of labor
community
division of labor
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
352
APPLICATION
Average number of
producers for one group
Average number of documents 4
per producer for one group
11
CONCLUSION
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
353
University of Memphis
Memphis, TN, 38152, USA
scotty.craig@asu.edu
ABSTRACT
2. METHODS
1. INTRODUCTION
Advanced learning environments [1, 4] are designed to create
the most effective learning gains for students, but students vary,
and not all students benefit equally from these systems. The
issue then becomes discovering individual factors that maximize
the usefulness of the program [9]. Studies have found that
individual factors such as student effort and ability impact
learning behaviors and outcomes in the computer- based
learning environment [3, 8, 9].
Given the existing findings linking effort and ability with
learning, this study analyzed two important characteristics on
which students vary: effort and ability. We investigated the
extent to which effort and math ability differentially influenced
learning gains in a Web-based intelligent tutoring system (ITS)
called ALEKS (the Assessment and LEarning in Knowledge
Spaces) [5].
3. ANALYSIS
The independent variables were math ability and effort. Math
ability was measured by TCAP 5th grade scores. For data mining
purposes, we measured student effort by the ratio of the total
number of mastered topics divided by the total number of
attempted topics. Although it was not a pure measure of effort,
this ratio was a reliable method for contrasting student learning
with persistence. The first reason was that by using artificial
intelligent to map each students knowledge, ALEKS offered the
topics he/she was ready to learn right now [2]. For students, the
distances were equal between the difficulty of each students
topics and his/her math ability. Once they seriously tried, the
students would master them. Second, ALEKS required students
to write down each step when solving every problem, thence
students had no chance to guess the right answers to the
problems. Third, this effort measure would not be contaminated
by students absent minded learning behaviors such as spent
time clicking on computer screen randomly. Therefore, this ratio
reflected students true effort in ALEKS. The dependent
variables are math posttest which is TCAP 6th grade scores. All
the variables were normalized.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
354
LH
(N=121)
HL
(N=73)
HH
(N=21)
LL
(N=53)
Math
ability
-.15
.40
1.21
-.78
Effort
.03
-.07
1.13
-.78
5. ACKNOWLEDGMENTS
This research was supported by the Institute for Education
Sciences (IES) Grant R305A090528 to Dr. Xiangen Hu, PI. Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of IES.
6. REFERENCES
[1] Craig, S. et al. 2011. Learning with ALEKS: The Impact of
Students Attendance in a Mathematics After-School
Program. Artificial Intelligence in Education (2011), 435
437.
[2] Falmagne J.C., Cosyn E., Doignon J. P., Thiry N.
2006. The assessment of knowledge, in theory and in
practice. Formal Concept Analysis Lecture Notes in
Computer Science 3874 (2006), 61-79.
[3] Welsh, Elizabeth T., et al. "Elearning: emerging uses,
empirical results and future directions." International
Journal of Training and Development 7.4 (2003): 245-258.
[4] Hu, X. et al. 2012. The Effects of a Traditional and
Technology-based After-school Setting on 6th Grade
Students Mathematics Skills. Journal of Computers in
Mathematics and Science Teaching 31, 1 (2012), 1738.
[5] Overview of ALEKS,
http://www.aleks.com/about_aleks/overview
[6] PARSONS, JACQUELYNNE E., and Diane N. Ruble.
"Development of Integration Processes Using Ability and
Effort Information to Predict Outcome1." Developmental
Psychology 10.5 (1974): 721-732.
[7] Timmers, C.F. et al. 2012. Motivational beliefs, student
effort, and feedback behavior in computer-based formative
assessment. Computers & Education. (2012).
4. DISCUSSIONS
This study applied K-means to cluster students based on the
levels of math ability and effort in ALEKS. The notable result
was that in ALEKS learning system, math ability and effort had
a multiplying interaction on students math posttest. Expending
effort improved students math performance, but particularly for
high ability students. It illustrated that ALEKS could help
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
355
Jack Mostow
yanbox@cs.cmu.edu
mostow@cs.cmu.edu
ABSTRACT
Previous work on knowledge tracing has fit parameters per skill
(ignoring differences between students), per student (ignoring
differences between skills), or independently for each <student,
skill> pair (risking sparse training data and overfitting, and undergeneralizing by ignoring overlap of students or skills across pairs).
To address these limitations, we first use a higher order Item
Response Theory (IRT) model that approximates students initial
knowledge as their one-dimensional (or low-dimensional) overall
proficiency, and combines it with the estimated difficulty and
discrimination of each skill to estimate the probability knew of
knowing a skill before practicing it. We then fit skill-specific
knowledge tracing probabilities for learn, guess, and slip. Using
synthetic data, we show that Markov Chain Monte Carlo (MCMC)
can recover the parameters of this Higher-Order Knowledge
Tracing (HO-KT) model. Using real data, we show that HO-KT
predicts performance in an algebra tutors significantly better than
fitting knowledge tracing parameters per student or per skill.
2. Approach
IRTs 2-Parameter Logistic model [4] estimates the probability
knewnj of student n already knowing skill j as a logistic function of
student proficiency n, skill discrimination aj, and difficulty bj:
=
1
1 + exp (1.7! (! ! )
Keywords
Knowledge tracing, Item Response Theory, higher order models
1. Introduction
Traditional knowledge tracing (KT) [1] estimates the probability
that a student knows a skill by observing attempted steps that
require it, and applying a model with four parameters for each
skill, assumed to be the same for all students: the probabilities
knew of knowing the skill before practicing it, learn of acquiring
the skill from one attempt, guess of succeeding at the attempt
without knowing the skill, and slip of failing despite knowing the
skill. Prior work shows that fitting such parameters for individual
students can improve the models accuracy in predicting student
performance [2] or reduce unnecessary practice [3]. Such perstudent parameters, however, ignore differences between skills.
Fitting KT parameters separately instead for each <student, skill>
pair risks sparse training data and overfitting, and undergeneralizes by ignoring overlap of students or skills across pairs.
Item Response Theory (IRT) [4, 5] predicts a students
performance on an item based on the difficulty and discrimination
of the skill(s) the item requires, and a one- (or low-) dimensional
static estimate of the students overall proficiency. Prior work
adapted IRT to estimate the static probability of knowing a given
skill [6], or dynamic changes in overall proficiency [7]. Here we
dynamically estimate individual skills required in observed steps.
!!!
, ,
, ()
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
356
3. Experiment
We first generated synthetic data with N=100 students, each of
whom practices J=4 skills required in a series of T=100 steps. We
used OpenBUGS [8] to implement MCMC estimation for HO-KT
in the BUGS language. We simultaneously ran the model in 5
chains for 10,000 iterations with a burn-in of 3000, each chain
starting from randomly generated initial values, and considered
MCMC to converge when all 5 chains overlapped in OpenBUGS
monitor window. Table 1 shows how well the estimated value of
learn for each simulated skill recovered its true value; estimates of
other parameters were similarly accurate but omitted here for lack
of space. Moreover, MCMC correctly recovered 99.4% of the
simulated students 10,000 hidden binary knowledge states.
Table 1. Estimation of learn in synthetic data
Skill j
1
2
3
4
learn
0.8
0.6
0.5
0.3
s.d.
0.13
0.05
0.11
0.02
MC_error
0.006599
0.002132
0.005432
7.79E-04
Overall
87.13%
85.92%
85.15%
85.10%
Accuracy
Correct
97.76%
96.19%
99.99%
100.00%
Incorrect
26.43%
27.28%
0.92%
0.00%
Loglikelihood
-5442.50
-5216.23
-5102.15
--
4. Discussion
HO-KT uses IRT to estimate students initial knowledge of a skill
based on its difficulty and discrimination and their overall
proficiency, and KT to model learning over time. It outperforms
per-student or per-skill KT by combining information about both.
HO-KT estimates every probability Knew(student, skill) without
requiring training data for every <student, skill> pair, because it
can estimate student proficiency based on other skills, and skill
difficulty and discrimination based on other students.
Future work should compare HO-KT to other methods and on
data from other tutors. We should also test if k-dimensional
student proficiency captures enough additional variance to justify
fitting k times as many parameters. Finally, extending HO-KT to
ACKNOWLEDGMENTS
This work was supported by the Institute of Education Sciences,
U.S. Department of Education, through Grants R305A080157 and
R305A080628 to Carnegie Mellon University, and by the
National Science Foundation under Cyberlearning Grant
IIS1124240. The opinions expressed are those of the authors and
do not necessarily represent the views of the Institute, the U.S.
Department of Education, or the National Science Foundation. We
thank Ken Koedinger for his algebra tutor data.
REFERENCES
[1] Corbett, A. and J. Anderson. Knowledge tracing: Modeling
the acquisition of procedural knowledge. User modeling and
user-adapted interaction, 1995. 4: p. 253-278.
[2] Pardos, Z. and N. Heffernan. Modeling Individualization in a
Bayesian Networks Implementation of Knowledge Tracing.
Proceedings of the 18th International Conference on User
Modeling, Adaptation and Personalization, 255-266. 2010.
Big Island, Hawaii.
[3] Lee, J.I. and E. Brunskill. The Impact on Individualizing
Student Models on Necessary Practice Opportunities
Proceeding of the 5th International Conference on
Educational Data Mining (EDM) 118-125. 2012. Chania,
Greece.
[4] Hambleton, R.K., H. Swaminathan, and H.J. Rogers.
Fundamentals of Item Response Theory. Measurement
Methods for the Social Science. 1991, Newbury Park, CA:
Sage Press.
[5] Birnbaum, A. Some latent trait models and their use in
inferring an examinees ability. Statistical theories of mental
test scores, 1968: p. pp. 374 - 472.
[6] de la Torre, J. and J.A. Douglas. Higher-order latent trait
models for cognitive diagnosis. Psychometrika 2004. 69(3):
p. 333-353.
[7] Martin, A.D. and K.M. Quinn. Dynamic Ideal Point
Estimation via Markov Chain Monte Carlo for the U.S.
Supreme Court, 1953-1999. Political Analysis, 2002. 10: p.
134-153.
[8] Lunn, D., D. Spiegelhalter, A. Thomas, and N. Best. The
BUGS project: Evolution, critique and future directions.
Statistics in Medicine 2009. 28: p. 3049306.
[9] Koedinger, K.R., R.S.J.d. Baker, K. Cunningham, A.
Skogsholm, B. Leber, and J. Stamper. A Data Repository for
the EDM community: The PSLC DataShop. In C. Romero, et
al., Editors, Handbook of Educational Data Mining, 43-55.
CRC Press: Boca Raton, FL, 2010.
[10] Chang, K.-m., J.E. Beck, J. Mostow, and A. Corbett. A Bayes
Net Toolkit for Student Modeling in Intelligent Tutoring
Systems. In Proceedings of the 8th International Conference
on Intelligent Tutoring Systems, K. Ashley and M. Ikeda,
Editors. 2006: Jhongli, Taiwan, p. 104-113.
[11] Koedinger, K.R., P.I. Pavlik, J. Stamper, T. Nixon, and S.
Ritter. Avoiding Problem Selection Thrashing with
Conjunctive Knowledge Tracing. In Proceedings of the 4th
International Conference on Educational Data Mining.
2011: Eindhoven, NL, p. 91-100.
[12] Xu, Y. and J. Mostow. Using Logistic Regression to Trace
Multiple Subskills in a Dynamic Bayes Net. Proceedings of
the 4th International Conference on Educational Data
Mining 241-245. 2011. Eindhoven, Netherlands.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
357
Kenneth R. Koedinger
myudelson@carnegielearning.com
koedinger@cmu.edu
ABSTRACT
Educational Data Mining researchers use various prediction
metrics for model selection. Often the improvements one
model makes over another, while statistically reliable, seem
small. The field has been lacking a metric that informs us
on how much practical impact a model improvement may
have on student learning efficiency and outcomes. We propose a metric that indicates how much wasted practice can
be avoided (increasing efficiency) and extra practice would
be added (increasing outcomes) by using a more accurate
model. We show that learning can be improved by 15-22%
when using machine-discovered skill model improvements
across four datasets and by 7-11% by adding individual student estimates to Bayesian Knowledge Tracing.
1.
INTRODUCTION
2.
DATA
We used the datasets from the KDD Cup 2010 EDM Challenge and from the Pittsburgh Science of Learning Center (PSLC) DataShop (www.pslcdatashop.org): Algebra I
3.
MODELS
4.
METHOD
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
358
Table 1: Comparing models in terms of root mean squared error, percent cases number of prescribed practice opportunities
differs by at least one, average student/skill practice opportunities, and time
(a) Estimated prediction improvements and practical benefits of replacing hand-made by LFA machine-discovered KC models
across four DataShop datasets (RMSE values are given for a student-stratified 10-fold cross-validation).
Time
KCs
RMSE
% diff Orig.-LFA
Mean stud. opp/KC
Stud. time
Dataset
/step Orig. LFA Orig. LFA 1 (-1,1) 1 Orig. LFA diff
%
total diff
%
Geometry 1996-97 17.12s
15
18 0.410 0.400 10% 29% 61%
8.8
7.8 1.5 18-20 26m 2m
9
Articles 2009
15.09s
13
26 0.437 0.420
5% 53% 43%
7.9
7.1 1.2 15-17 14m 3m 23
Geometry 2010
15.10s
46
43 0.240 0.239 95%
5% 0%
8.6 10.5 1.9 18-22 88m 6m
7
Numberline 2011 12.77s
12
22 0.459 0.457 41% 22% 37% 15.1 15.2 2.8
19
18m 32m 182
These values could be inflated due to absence of mastery learning in respective tutors and as a result the amount of
student work being less optimal.
(b) Estimated prediction improvements and practical benefits of replacing standard BKT models by individualized BKT
models across two KDD Cup 2010 datasets (RMSE values are given for a student-stratified 10-fold cross-validation).
Time
RMSE
% diff BKT-iBKT
Dataset
/step KCs BKT iBKT 1 (-1,1) 1
Algebra 1(kts)
n/a
515 0.363 0.361 24% 72% 4%
Algebra 1(ss)
n/a
541 0.342 0.341 34% 63% 3%
B.to Algebra(kts) 12.81s 807 0.363 0.359 22% 74% 5%
B.to Algebra(ss) 12.81s 933 0.359 0.355 27% 68% 5%
5.
Table 1 is a summary of model comparisons. Table 1a compares original skill models and best fitting machine-discovered
skill models for the DataShop datasets. Table 1b compares
standard BKT and individualized BKT modeling methods
for the same skill models in KDD Cup 2010 datasets. Despite the vast difference in the size of the datasets (inherently the size of curriculum), improvements with respect
to student-stratified cross-validated RMSE are quite small.
Just like the improvements in RMSE, the mean absolute difference in mean student opportunities are small: from 1.1
to 2.8 practice attempts. However, in terms of percent practice opportunities, those differences constitute 15-22% in
DataShop datasets, and 7-11% in KDD Cup 2010 datasets.
The practice opportunity differences are shown in Table 1a
and Table 1b under % diff Orig-LFA and % diff BKT-iBKT
respectively. Here the column marked 1 indicates the
percent of student-KC experiences for which the model built
on the LFA-discovered KC model prescribes at least one
opportunity /less/ on average than the model built on the
Original KC model. Similarly, column 1 indicates the
percent that the LFA-discovered skill model prescribes at
least one more opportunity..
The overall amount of time students spend with the tutor differs from dataset to dataset: from 14-18 mimutes to
6.
ACKNOWLEDGMENTS
7.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
359
zhilin.zheng@tu-clausthal.de
ABSTRACT
Group formation strategies have the goal of providing the
participating students with the good initial conditions for
collaborative learning. Continuing with the existing methods to
set up the initial conditions to make peer interaction more likely
happen, we propose a method for dynamically recomposing
learning groups based on intra-group iteration analysis to
optimize the learning group formation iteratively.
Keywords
Group Formation, Educational Data Mining, Collaborative
Learning, Dynamic Group Composition, Interaction Analysis
1. INTRODUCTION
Group formation plays a critical role for the success of
collaborative learning groups [2]. Through pedagogical
experiments, both homogeneous and heterogeneous group
formation strategies can effectively promote collaboration [1]. In
order to compose heterogeneous or homogeneous learning
groups, plenty of composition approaches have been suggested
[2; 3; 6]. These approaches pay most attention to the
performance of the proposed algorithms, such as solution
optimization and time cost, while the peer interaction within the
formed groups is typically not considered for refining the groups.
In addition, some data mining technologies have recently been
proposed to analyze the peer interaction, with results indicating
that there are recurring interactions within groups with strong
peer interaction [7]. Therefore, if we could find some way of
group composition which would lead to groups showing these
interaction patterns, then an effective peer interaction within
these groups might be triggered with higher probability.
Initialize
Group Formation Generator
Student
Interface
Group 1
Instructor
Interface
Group n
Group n
2. PROPOSED APPROACH
The proposed method is to dynamically recompose groups based
on interaction analysis. We expect to distinguish groups with
strong interaction from weaker ones and learn group
composition rules from this. In this paper, group composition
rules denote that which types of group members work together
could trigger either strong or weak peer interaction. Initially, the
collaborative groups are composed by existing composition
approaches (e.g. Graf and Bekeles method)[2]. Learners in each
group are then instructed to complete team tasks collaboratively.
After the completion of the tasks, the peer interaction in the
learning groups is analyzed. Data mining technologies are used
to extract interaction patterns (e.g. sequential patterns) from
group interaction logfile. These patterns together with tutors
assessment could be used to distinguish the effective interaction
Interaction Analysis
Weak groups
Strong groups
a
{ If ( condition 1) then Weak Interaction;
If (condition 2) then Strong Interaction; }
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
360
Interaction type
Group1
strong
Group2
weak
Cluster A
Cluster B
5. ACKNOWLEDGMENTS
This research is supported by China Scholarship Council (CSC).
6. REFERENCES
[1] ABRAMI, P.C. and CHAMBERS, B. 1996. Research on
Cooperative Learning and Achievement: Comments on
Slavin. Contemporary Educational Psychology 21, 1 (Jan.
1996), 70-79.
[2] GRAF, S. and BEKELE, R. 2006. Forming heterogeneous
groups for intelligent collaborative learning systems with
ant colony optimization. In the Proceedings of the 8th
international conference on Intelligent Tutoring Systems
(Jhongli, Taiwan, June 26-30, 2006). ITS2006. SpringerVerlag, Heidelberg, 217-226.
[3] LIN, Y.T., HUANG, Y.M., and CHENG, S.C. 2010. An
automatic group composition system for composing
collaborative learning groups using enhanced particle
swarm optimization. Computers & Education 55, 4 (Dec.
2010), 1483-1493.
[4] LOU, Y., ABRAMI, P.C., SPENCE, J.C., POULSEN, C.,
CHAMBERS, B., and D'APOLLONIA, S. 1996. WithinClass Grouping: A Meta-Analysis. Review of Educational
Research 66, 4 (Winter, 1996), 423-458.
[5] MARTINEZ, R., YACEF, K., KAY, J., KHARRUFA, A.,
and AL-QARAGHULI, A. 2011. Analysing frequent
sequential patterns of collaborative learning activity around
an interactive tabletop. In the Proceedings of the Fourth
International Conference on Educational Data Mining
(Eindhoven, The Netherlands, July 6-8, 2011). EDM
2011.111-120.
[6] MORENO, J., OVALLE, D.A., and VICARI, R.M. 2012.
A genetic algorithm approach for group formation in
collaborative learning considering multiple student
characteristics. Computers & Education 58, 1 (Jan. 2012),
560-569.
[7] PERERA, D., KAY, J., KOPRINSKA, I., YACEF, K., and
ZAIANE, O.R. 2009. Clustering and sequential pattern
mining of online collaborative learning data. IEEE
Transactions on Knowledge and Data Engineering 21, 6
(June 2009), 759-772.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
361
Poster Presentations
(Late Breaking Results)
Philip Janisiewicz
nicolefv@gmail.com
pjanisiewicz@gmail.com
anie.aghababyan@gmail.com1
taylormartin@usu.edu2
ABSTRACT
This study investigates ways to interpret and utilize the vast
amount of log data collected from an educational game called
Refraction to understand student fraction learning. Study
participants are elementary students enrolled in an online virtual
school system who played the game over the course of multiple
weeks. Findings suggest that students use a variety of splitting
strategies when solving Refraction levels and that these strategies
are related to learning gains.
Keywords
Educational data mining; hierarchical clustering; learning
analytics; mathematics education; fractions; educational games.
1. INTRODUCTION
Electronic games have become a regular part of childhood and
adolescence [1]. In recent years, the interest in games for learning
has grown, and educational games have increased in their
popularity as means of instruction [3]. These games are
unstructured environments where students can learn educational
concepts through engaging interfaces and at their own pace.
Educational data mining techniques have the potential to
illuminate learning patterns across a large number of students who
play these games. By analyzing the data generated through these
activities and assignments, data scientists can gain insights into
when students have mastered a concept or skill, what excites
them, where they are getting stuck, and what works to support
learning. The ability to discern this for each student and for all
students is a key contributing factor in improving the quality of
education in the U.S.
2. REFRACTION
Refraction (http://play.centerforgamescience.org/refraction/site/)
is an online game based on fraction learning through splitting. It is
an open-access, interactive, and spatially challenging game that
allows researchers to discover students fraction learning
pathways. In the game level used for this study, students are
required to create laser beams of 1/6 and 1/9 using a combination
of 1/2 and 1/3 splitters. Four 1/2 splitters, four 1/3 splitters, and
seven benders are provided to achieve this goal. One possible
solution is shown in Figure 1(b).
(a)
(b)
3. METHOD
3.1 Procedures
The game begins with a short series of introductory levels, and
then students play an in-game pretest level. Following gameplay,
which can include several levels, they play an in-game posttest
level. The pretest and posttest levels are identical. For this study,
we mined only the in-game pretest and in-game posttest levels.
We chose this problem because it requires students to move
beyond repeated halving (such as creating 1/4 or 1/8), and it
requires students to use a combination of 1/2 and 1/3 splitters.
Players reach the pretest after introductory levels that teach them
the mechanics of the game, in order to avoid any pre-post
differences simply being attributable to games unfamiliarity.
3.2 Variables
Every time a splitter is placed on the laser beam in the Refraction
environment, a new board state is logged. We used this data to
examine the process of learning by splitting using hierarchical
cluster analysis and included the following variables:
a. Initial 1/2: this is the percent of board states in a level that
have 1/2 splitter as the initial splitter.
b. Initial 1/3: this is the percent of board states in a level that
have 1/3 splitter as the initial splitter.
c. Backtrack: this is a binary variable indicating whether the
player returned to using 1/2 as the initial splitter after having
used 1/3 as the initial splitter.
d. Average distance from goal: Average distance from each
board state to the goal state.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
362
4. RESULTS
Clustering results show that there are four distinct ways that
students solved the pre-post level. The four clusters can be
described as follows:
a. Halving strategy: Students using this strategy are primarily
exploring the 1/2 space of the game. They display a high
percentage of board states that start with a 1/2 splitter. They
also have high average distance from the goal and a low
percentage of board states that start with a 1/3 splitter. When
they do use the 1/3 splitter, they often backtrack to using the
1/2 splitter.
b. Thirds strategy: Students using this strategy spend the
majority of their time in the 1/3 space of the game. They
have a high percentage of 1/3 initial board states. They rarely
backtrack. Their average distance from the goal is low.
c. Exploring Thirds Strategy: While students using this strategy
still experiment with initial 1/2 board states, they have a
higher percentage of 1/3 initial board states. They do not
backtrack often, but still have a high average distance from
the goal.
d. General Exploring Strategy: Students using this strategy are
exploring the mathematical space of the game more broadly.
They have high percentage of board states using both the 1/2
and 1/3 splitters, and they backtrack often. They have
medium average distance from the goal.
We conducted one-way ANOVAs on the classifications of
students game play strategy (pre- and postlevel) for both the preand posttest. We found that there was a significant main effect for
prelevel strategy type on transfer pretest score, F(3, 2494) = 7.79,
MSE = 23.10, p < .001. Post hoc tests showed that this effect was
primarily accounted for by the Thirds groups significantly greater
performance than the Halving and Exploring Thirds groups (p <
.05). The General Exploring group did not perform significantly
differently than any other group.
5. DISCUSSION
To interpret the clusters, post hoc comparisons of the means of all
four clustering variables were performed. Duncans Multiple
Range Test was used to compare the means of these four variables
across four clusters (Hair, Anderson, Tatham, & Grablosky,
1979). In this test, pairwise comparisons are done across clusters
and significant differences are identified at the pre-defined
significance level; in this case, p < 0.1. Furthermore, the test sorts
the clusters into groups wherein the means of the clusters within a
group are not significantly different from each other, but differ at
a statistically significant level from clusters in other groups. For
example, for the variable backtrack, the test sorted our four
clusters into three distinct groups, as seen by the designation of L,
M, and H in Table 1. In this case, the mean of Cluster 1 is
significantly higher than the means for Clusters 2 and 3, but
significantly lower than the mean for Cluster 4.
6. REFERENCES
[1] Greenberg, B. S., Sherry, J., Lachlan, K., Lucas, K., &
Holmstrom, A. 2008. Orientations to video games among gender
and age groups. Simulation & Gaming, 41(2), 238259.
[2] Lorr, M. (1983). Cluster analysis for social scientists. JosseyBass, San Francisco, CA.
[3] Rodrigo, M., Baker, R., DMello, S., Gonzalez, M., Lagud, M.,
Lim, S., & Viehland, N. (2008). Comparing learners affect
while using an intelligent tutoring system and a simulation
problem solving game. Intelligent Tutoring Systems. 40-49.
[4] Ulrich, D. and B. McKelvey (1990), "General Organizational
Classification: An Empirical Test Using the United States and
Japanese Electronic Industry," Organization Science, 1, 99-118.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
363
Jonathan Cobian
Matthew Hunter
MIT
233 Massachusetts Avenue
Cambridge, MA 02139
574-370-7597
sdmello@nd.edu
jcobian@nd.edu
mjhunter@mit.edu
ABSTRACT
We present a fully-automated person-independent approach to
track mind wandering by monitoring eye gaze during reading. We
tracked eye gaze of 84 students who engaged in an approximately
30-minute self-paced reading task on research methods. Mind
wandering reports were collected by auditorily probing students
in-between and after reading certain pages. Supervised classifiers
trained on global and local features extracted from students gaze
fixations 3, 5, 10, and 15 seconds before each probe were used to
predict mind wandering with a leave-several-subjects-out cross
validation procedure. The most accurate model tracked both
global and local eye gaze in a 5-second window before a probe
and yielded a kappa (accuracy after correcting for chance) of 0.23
on a downsampled corpus containing 50% yes and 50% no
responses to probes. Implications of our findings for adaptive
interventions that restore attention when mind wandering is
detected are discussed.
Keywords
Mind wandering, eye gaze, affective computing, affect detection
1. INTRODUCTION
Mind wandering (or zoning out) is a phenomenon in which
attention drifts away from the primary task to task-unrelated
thoughts [1]. It is critically important to learning because active
comprehension of information involves extracting meaning from
external sources of information (e.g., text, audio, image) and
aligning this information with existing mental models that are
ultimately consolidated into long-term memory structures. Mind
wandering signals a breakdown in this coupling of external
information and internal representations. Hence, it is no surprise
that mind wandering has disastrous effects on learning and
comprehension because it negatively impacts a learners ability to
attend to external events, to encode information into memory, and
to comprehend learning materials [2, 3]. Therefore, there is a
crucial need for interventions to track and restore attention when
mind wandering is detected.
A system that responds to mind wandering must first detect when
minds wander. In line with this, Drummond and Litman [4]
attempted to identify episodes of zoning out while students
were engaged in a spoken dialog with an intelligent tutoring
system (ITS). Students were periodically interrupted to complete a
short survey to indicate the extent to which they were focusing on
the task (low zoning out) or on other thoughts (high zoning out).
J48 decision trees trained on acoustic-prosodic features extracted
from the students utterances yielded 64% accuracy in
discriminating high vs. low zone-outs. This study was pioneering
in that it represents the first attempt to automatically detect zone-
2. METHOD
2.1 Labeled Data Collection
A Tobii T60 eye tracker was used to record gaze patterns of 84
students while they read four texts on research methods (e.g.,
random assignment, experimental bias) for approximately 30
minutes. Students read the texts on a page-by-page basis (roughly
144 words per page) and used the space bar to navigate forward.
Mind wandering was measured via auditory probes, which is the
standard and validated method for collecting online mind
wandering reports [1]. When a students gaze fixated on
previously determined probe words, which were pseudorandomly inserted in the texts, the system played an auditory cue
(i.e., a beep) to prompt the student to indicate whether or not he or
she was mind wandering by pressing keys marked Yes and
No. These probes are referred to as in-between page probes. In
addition, end of page probes were triggered when students
pressed the space bar to advance to the next page. There were
approximately 10 probes per text and reports of mind wandering
were obtained for 35% of the probes, which is comparable to rates
obtained in previous studies on reading [2].
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
364
3. RESULTS
The best results were obtained from the downsampled corpus
without outlier removal. Mean kappas and standard deviations
(across 25 iterations and shown in parentheses) for the best
performing models for each feature type are shown in Table 1.
Table 1. Results for best performing models
Features
Kappa
RR.
Win
Classifier
Global (G)
.14
(.10)
56.2
(5.55)
37
4
Multiboost
Adaboost
Local (L)
.08
(.06)
53.7
(4.21)
15
40
2
Nave Bayes
Updatable
Global +
Local
.23
(.08)
60.0
(6.35)
37
4
Locally-weighted
learning
4. GENERAL DISCUSSION
Mind wandering is a ubiquitous phenomenon that has disastrous
consequences for learning because it a quintessential signal of
waning attention. We present a proof-of-concept of the possibility
of automated tracking mind wandering during reading. Although
we had some success in developing person-independent models to
detect mind wandering, the accuracy of our models was moderate.
We are currently in the process of refining our models by both
increasing the size of the training data while simultaneously
considering a larger feature space and more sophisticated
classifiers. When coupled with the falling cost of eye trackers and
the potential use of web-cams for low-cost gaze tracking, we
expect that these improvements will yield sufficiently robust and
scalable detectors of mind wandering. In turn, these detectors can
be used to trigger interventions to restore engagement by
reorienting attention to the task at hand.
5. ACKNOWLEDGMENTS
We gratefully acknowledge our collaborators at the University of
Memphis. This research was supported by the National Science
Foundation (NSF) (ITR 0325428, HCC 0834847, DRL 1235958).
Any opinions, findings and conclusions, or recommendations
expressed in this paper are those of the authors and do not
necessarily reflect the views of NSF.
6. REFERENCES
[1] J. Smallwood, et al., When attention matters: The curious
incident of the wandering mind, Memory & Cognition, vol.
36, no. 6, 2008, pp. 1144-1150; DOI 10.3758/mc.36.6.1144.
[2] J. Smallwood, et al., Counting the cost of an absent mind:
Mind wandering as an underrecognized influence on
educational performance, Psychonomic Bulletin & Review,
vol. 14, no. 2, 2007, pp. 230-236.
[3] S. Feng, et al., Mindwandering while reading easy and
difficult texts, Psychonomic Bulletin & Review, in press.
[4] J. Drummond and D. Litman, In the Zone: Towards
Detecting Student Zoning Out Using Supervised Machine
Learning, Intelligent Tutoring Systems, Part Ii, Lecture
Notes in Computer Science 6095, V. Aleven, et al., eds.,
Springer-Verlag, 2010, pp. 306-308.
[5] K. Rayner, Eye movements in reading and information
processing: 20 years of research, Psychological Bulletin,
vol. 124, no. 3, 1998, pp. 372-422.
[6] E.D. Reichle, et al., Eye movements during mindless
reading, Psychological Science, vol. 21, no. 9, 2010, pp.
1300.
[7] D. Smilek, et al., Out of Mind, Out of Sight: Eye Blinking
as Indicator and Embodiment of Mind Wandering,
Psychological Science, vol. 21, no. 6, 2010, pp. 786-789;
DOI 10.1177/0956797610368063.
[8] M. Hall, et al., The WEKA data mining software: an
update, ACM SIGKDD Explorations Newsletter, vol. 11,
no. 1, 2009, pp. 10-18.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
365
Vasile Rus
Dan Stefanescu
nbnraula@memphis.edu
vrus@memphis.edu
dstfnscu@memphis.edu
ABSTRACT
Anaphora resolution is a central topic in dialogue and discourse processing that deals with finding the referents of
pronouns. There are no studies, to the best of our knowledge, that focus on anaphora resolution in the context of
tutorial dialogues. In this paper, we present the first version
of DARE (Deep Anaphora Resolution Engine), an anaphora
resolution engine for dialogue-based Intelligent Tutoring Systems. The development of DARE was guided by dialogues
obtained from two dialogue-based computer tutors: DeepTutor and AutoTutor.
Keywords
Anaphora Resolution, Tutoring System, Dialogue Systems
1.
INTRODUCTION
www.deeptutor.org
www.autotutor.org
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
366
2.
THE METHODOLOGY
As a starting point for developing DARE, we analyzed pronoun use in 24,945 student responses from DeepTutor log
files and 1,978 student responses from AutoTutor logs. The
results are shown in Figure 1. Both the students and computer tutor use first-person pronouns (i.e. i,me, we,
us) during the interaction. It should be noted that these
pronouns do not need be resolved for assessment purposes
(for space reasons we do not elaborate). Of the remaining pronouns, the top two most frequent pronouns, one of
which is it, account for more than 80% of the anaphors.
Thus, considering a very few, very frequent anaphors may
be a good start for developing DARE. Moreover, it was observed (see Section 3) that most pronouns used in student
responses can be resolved withing the same responses or just
looking at the previous system turn, i.e. the previous hint
from the system. Although these aspects of anaphors in
dialoge-based ITSs simplify the problem of pronoun resolution, the task is still challenging because it is often an
pleonastic, i.e. it is not always an anaphoric pronoun [1].
Another important aspect of anaphors in tutorial dialogues
is the location of pronouns in students responses. In our
data, we observed that most pronouns at the beginning of a
student response refer to an antecedent present in the most
recent system response, i.e. the previous dialogue turn. On
the other hand, pronouns that occur in the middle or last
part of a student response most likely refer to an entity in the
current/same student response. Thus, we added in DARE a
classifier that relies on the position of the anaphors to identify the text where the antecedent should be searched for.
Given this fact, input to the DAREs resolution engine is the
concatenation of previous tutor turn and current student response if the pronoun to be resolved occurs at the beginning
of the student response and current student response only
if the pronoun occurs in the middle or near the end of the
student response. DARE then uses the coreference resolution module in the Stanford CoreNLP package in order to
perform anaphora resolution. It should be noted that sometimes a pronoun in the student response may refer to entity
in the problem description, e.g. the current Physics problems the student is working on. Also, it may be possible that
a pronoun refers to something mentioned much earlier in the
dialogue than the previous system turn. In the current version of DARE, we do not handle these latter cases. Another
case we do not directly handle in DARE currently is the use
of eliptic anaphors - see the first student response in (c) in
Table 1 where instead of saying it is increasing the student
simply says increasing (a typical exaple of ellipsis).
3.
Figure 1: Pronouns used by (left) DeepTutor students in 24,945 dialogue turns (right) AutoTutor
students in 1,978 dialogue turns
Antecedents in
H0
A
H1
H2
Count(%)
85 (75.89%)
22 (19.64%)
5 (4.46%)
0 (0.00%)
4.
5.
ACKNOWLEDGMENTS
This research was supported in part by Institute for Education Sciences under awards R305A100875.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
367
G. Tanner Jackson
Erica L. Snow
Laura.Varner@asu.edu
Tanner.Jackson@asu.edu
Erica.L.Snow@asu.edu
Danielle S. McNamara
Tempe, AZ, USA
Arizona State University
Danielle.McNamara@asu.edu
ABSTRACT
The current study identifies relations among students natural
language input, individual differences in reading commitment,
and learning gains in an intelligent tutoring system. Students (n =
84) interacted with iSTART across eight training sessions.
Linguistic features of students generated self-explanations (SEs)
were analyzed using Coh-Metrix. Results indicated that linguistic
properties of students training SEs were predictive of learning
gains, and that the strength and nature of these relations differed
for students of low and high commitment to reading.
Keywords
Natural Language Processing, Learning, Intelligent Tutoring
Systems, Reading Commitment
1. INTRODUCTION
Educational learning environments provide students with
instruction intended to enhance particular knowledge, skills, and
strategies in various domains. For example, iSTART is an
intelligent tutoring system (ITS) that teaches students to use selfexplanation (SE) reading strategies to comprehend challenging
science texts. In this system, strategies are introduced and
demonstrated to students. Then, students are offered the
opportunity to practice applying the strategies they have learned
to new texts. A natural language processing (NLP) algorithm
assesses the quality of students generated SEs and assigns scores
(ranging from 0-3) and feedback to students during training [1].
Empirical studies have found that iSTART improves students
comprehension and strategy use over control groups at multiple
education levels [2-3]. More recent work has investigated the
impact of individual differences on students learning gains in the
system [4-5]. Jackson, Varner, Boonthum-Denecke, and
McNamara (under review), for instance, found that iSTART
helped students with low reading commitment to significantly
improve their SE performance and ultimately match (or exceed)
the performance of high commitment readers.
2.2 ANALYSES
We investigated how linguistic properties of students generated
SEs predict relative learning gains. A stepwise regression analysis
using each students average Coh-Metrix scores as predictors of
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
368
R2
.12*
.35
-.45
.47
-.45
.53
-.32
.56**
.19*
.10*
.09*
.11*
.07*
3. DISCUSSION
This study investigated relations among students prior
commitment to reading, linguistic properties of their generated
SEs, and relative learning gains in the iSTART system. Results
indicated that the relations between the linguistic features of
students SEs and relative learning gains varied when accounting
for students reading commitment. In particular, one cohesion
variable accounted for 12% of the variance in low reading
commitment students relative learning gains, whereas five
predictors combined to account for over 50% of the variance in
the relative learning gains of highly committed students.1 As
1
4. ACKNOWLEDGMENTS
This research was supported in part by: IES R305G020018-02,
IES R305G040046, IES R305A080589, and NSF REC0241144,
NSF IIS-0735682. Opinions, conclusions, or recommendations do
not necessarily reflect the views of the IES or NSF.
5. REFERENCES
[1] McNamara, D., Boonthum, C., Levinstein, I., Millis, K.:
Evaluating Self-explanations in iSTART: Comparing Wordbased and LSA Algorithms. In T. Landauer, D. McNamara,
S. Dennis, W. Kintsch (eds.) Handbook of Latent Semantic
Analysis, pp. 227-241. Mahwah Erlbaum (2007)
[2] Magliano, J., Todar, S., Millis, K., Wiemer-Hastings, K.,
Kim, H., McNamara, D.: Changes in Reading Strategies as a
Function of Reading Training: A Comparison of Live and
Computerized Training. Journal of Educational Computing
Research 32, 185-208 (2005)
[3] O'Reilly, T., Best, R., McNamara, D.: Self-explanation
Reading Training: Effects for Low-knowledge Readers. In:
Proceedings of the 26th Annual Conference of the Cognitive
Science Society pp. 1053-1058. Erlbaum Portland, OR
(2004)
[4] Jackson, G.T., Boonthum, C., McNamara, D.S.: The Efficacy
of iSTART Extended Practice: Low Ability Students Catch
Up. In: The Proceedings of Intelligent Tutoring Systems pp.
349-351. Springer Berlin/Heidelberg, Pittsburg, PA (2010)
[5] Jackson, G. T., Varner, L. K., Boonthum-Denecke, C.,
McNamara, D. S.: The Impact of Individual Differences on
Learning with an Educational Game and a Traditional ITS.
Manuscript submitted to the International Journal of
Learning Technology (under review)
[6] Jackson, G., McNamara, D.: Motivation and Performance in
a Game- based Intelligent Tutoring System. Journal of
Educational Psychology, (in press)
[7] Varner, L. K., Jackson, G. T., Snow, E. L., McNamara, D. S.:
Does size matter? Investigating user input at a larger
bandwidth. In: Proceedings of the 26th Annual Florida
Artificial Intelligence Research Society Conference, AAAI,
St. Petersburg, FL (in press)
[8] Graesser, A.C., McNamara, D.S., Louwerse, M., Cai, Z.:
Coh-Metrix: Analysis of Text on Cohesion and Language.
Behavior Research Methods 36, 193-202 (2004)
[9] Jackson, G.T., Guess, R.H., McNamara, D.S.: Assessing
Cognitively Complex Strategy Use in an Untrained Domain.
Topics in Cognitive Science 2, 127-137 (2010)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
369
Danielle S. McNamara
ABSTRACT
Intelligent tutoring systems yield data with many properties that
render it potentially ideal to examine using multi-level models
(MLM). Repeated observations with dependencies may be
optimally examined using MLM because it can account for
deviations from normality. This paper examines the applicability
of MLM to data from the intelligent tutoring system Writing-Pal
using intraclass correlations. Further analyses were completed to
assess the impact of individual differences on daily essay scores
along with the differential impact of daily vs. mean attitudinal
ratings.
Keywords
Multi-Level models, Writing, Intelligent Tutoring Systems
1. INTRODUCTION
With the advent of intelligent tutoring systems (ITSs), the amount
and complexity of data available to researchers has increased
exponentially. ITSs provide the opportunity for repeated
administration of assessments and, in some cases, ease of scoring
that data. Though most tutoring systems provide multiple
assessments of student progress (i.e., multiple text responses or
worked problems), many researchers assess performance using
pretest-posttest differences or repeated measures analyses,
potentially missing out on rich data collected between these two
end points.
When a student produces multiple responses, dependency arises
in the data, thus violating central assumptions underlying both
regression and ANOVA. Dependency, measured using intraclass
correlations (ICC), is a pervasive problem in educational data,
ranging from less problematic (a group of students within schools)
to highly problematic (observations within individuals) [1]. Even
when 5% of the variation in a data set is due to nested structure,
(i.e.; dependency) it is advisable to assess differences at the
highest cluster level.
The Writing Pal (W-Pal, [2]) is an ITS that provides writing
strategy instruction to high school and entering college students.
This system teaches writing strategies that encompass the entire
writing process from prewriting through revision. Students have
the opportunity to watch lesson videos, practice individual
strategies within educational mini-games, and write and receive
feedback on timed, prompt-based (SAT-style) essays.
In addition to providing instruction, W-Pal affords students the
opportunity to practice writing and receive feedback on their
essays. Students write prompt-based persuasive essays within an
essay writing module. Essays are scored using an algorithm
trained on a large corpus of SAT-style essays [3]. In this paper,
2. METHODS
Sixty-five high school students from a large urban southwestern
city participated for payment in a lab based study to assess the
effectiveness of W-Pal. All participants were recruited from the
community. The study compared two versions of the W-Pal
system: the full W-Pal system, and a version including only Essay
Practice. In the W-Pal condition, students had access to the entire
W-Pal system, whereas those in the Essay Practice condition only
interacted with the essay practice function. These conditions were
designed to control for time-on-task.
This study consisted of 10 sessions along with a home survey,
which participants completed prior to attending their sessions.
The home survey included basic demographics and measures of
writing habits. The first session was a pretest session during
which participants completed a pretest essay and prior knowledge
assessments.
Participants in all conditions began sessions 2-9 by filling out a
survey about their previous session and current mood, and then
completed a SAT-style practice essay. Based on students
randomly assigned condition, some students interacted with all of
W-Pal (n= 33), while others interacted with the Essay Practice
module in W-Pal (n=32). Participants were given a maximum of
25-minutes to complete their essay. They then received feedback
and were given an additional 10-minutes to revise their essays.
Students in the W-Pal condition then completed an assigned
lesson and game based practice. Students in the Essay condition
completed a second SAT-style essay, also revising this essay.
During the final session, students completed a posttest, which was
the same for all participants regardless of condition. For the
current paper, only the essay scores, pretest, and attitudinal
measures will be considered.
2.1 Measures
2.1.1 Essays
Depending on condition, participants wrote either 8 or 16 practice
essays with feedback, ,and a pretest and posttest essay without
feedback. The essay prompts were adapted from SAT writing
assessments and scored on a 1-6 scale using the W-Pal algorithm
validated by Crossley and colleagues [3]. This algorithm displays
sufficient accuracy (exact agreement of 55% and adjacent
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
370
3. RESULTS
3.1 Applicability of Multilevel Models
A series of unconditional models for all level-1 variables were
estimated. The variance estimates from these analyses were used
to compute intraclass correlations (ICCs). The ICC for daily essay
scores was ICC =.47, suggesting that 47% of the variance in essay
score can be attributed to the individual. For daily survey items,
these values ranged from .37 - .98, suggesting that a significant
portion of the variance for all of the daily survey items can be
attributed to the individual.
.020; 3 = -.261, p =.049; the contextual effect for mood (4) was
marginally significant, 4 = .373, p = .061. The signs and
magnitude of the level-2 regressions (daily survey means
predicting daily essay mean) were stronger than the level-1
predictors; however, the effects of sustained levels of certain
feelings about the system (e.g., frustration) seemed to be more
complex, warranting further investigation.
4. DISCUSSION
The data examined in this study exhibit high levels of
dependency, rendering it ideal for multi-level modeling. The ICC
values for the repeated assessments in W-Pal range from .37 - .98,
exceeding appropriate values for using regression and analysis of
variance. By using a means-as-outcomes model, we were able to
account for 63% of the variance due to the cluster (student). The
results suggest that there is an advantage for those interacting with
the complete W-Pal system, additionally, individual differences
were important predictors of average daily essay score.
The analysis using the contextual effects model showed that, for
this data, daily and mean values for attitudinal survey items had
differential effects on essay scores. For instance, while daily
enjoyment has a negative relationship with daily essay score, the
participants average level of enjoyment had a positive
relationship with average essay scores.
Further work will be completed to combine these models and to
investigate the utility of using random slopes for the level-1
variables. Interactions will also be investigated further. Overall,
the data from W-Pal is ideal for using MLM for assessment.
5. ACKNOWLEDGMENTS
The research reported here was supported by the Institute of
Education Sciences, U.S. Department of Education, through Grant
R305A080589 to Arizona State University.
6. REFERENCES
[1] Hedges, L. V., and Hedberg, E. C.(2007). Intraclass
correlation values for planning group randomized trials in
education. Educational Evaluation and Policy Analysis. 29,
1. 60-87.
[2] Roscoe, R. D., and McNamara, D.S. (in press). Writing pal:
intelligent tutoring of writing strategies in the high school
classroom, Journal of Educational Psychology.
[3] Crossley, S. A., Roscoe, R., and McNamara, D. S. (in press).
Predicting human scores of essay quality using
computational indices of linguistic and textual features.
Proceedings of the 15th International Conference on Artificial
Intelligence in Education. Auckland, New Zealand: AIED.
[4] Weston, J. L., Roscoe, R., Floyd, R. G., and McNamara, D.
S. (2013, May). The WASSI (Writing Attitudes and
Strategies Self-Report Inventory): Reliability and validity of
a new self-report writing inventory. Poster presented at the
2013 Annual Meeting of the American Educational Research
Association, San Francisco, CA
[5] Posada, D., & Buckley, T. R. (2004). Model selection and
model averaging in phylogenetics: Advantages of Akaike
information criterion and Bayesian approaches over
likelihood ratio tests. Systemic Biology,53, 5. 793-808.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
371
Oral Presentations
(Young Researchers Track)
Tiffany Barnes
mjeagle@ncsu.edu
tiffany.barnes@ncsu.edu
ABSTRACT
This work explores the effects of using automatically generated hints in problem solving tutor environments. Generating hints automatically removes a large amount of development time for new tutors, and it also useful for already
existing computer-aided instruction systems that lack intelligent feedback. We focus on a series of problems, after
which, previous analysis showed the control group is to be
3.5 times more likely to cease logging onto an online tutor when compared to the group who were given hints. We
found a consistent trend in which students without hints
spent more time on problems when compared to students
that were provided hints.
1.
INTRODUCTION
Problem solving is an important skill across many fields, including science, technology, engineering, and math (STEM).
Working open-ended problems may encourage learning in
higher levels of cognitive domains [1]. Intelligent tutors
have been shown to be as effective as human tutors in supporting learning in many domains, in part because of their
individualized, immediate feedback, enabled by expert systems that diagnose students knowledge states [9]. However,
it can be difficult to build intelligent support for students
in open problem-solving environments. Intelligent tutors require content experts and pedagogical experts to work with
tutor developers to identify the skills students are applying
and the associated feedback to deliver [6].
Barnes and Stamper built an approach called the Hint Factory to use student data to build a graph of student problemsolving approaches that serves as a domain model for automatic hint generation [7]. Hint factory has been applied
across domains [5]. Stamper et al. found that the odds
of a student in the control group dropping out of the tutor were 3.5 times more likely when compared to the group
provided with automatically generated hints [8]. The hints
also affected problem completion rates, with the number of
problems completed in L1 being significantly higher for the
2.
2.1
Data
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
372
in the Control group. Students for which application logdata did not exist were dropped from the study; resulting
in 68 and 37 students in the Hint group, and 28 and 70 students in the Control group for the first and second semesters
respectively. This results in a total of 105 students in the
Hint group and 98 students in the Control group. Students
from the 6 sections of an introduction to logic course were
assigned 13 logic proofs in the deep thought tutor. The
problems are organized into three constructs: level one (L1)
consisting of the first six problems assigned; level two (L2)
consisting of the next five problems assigned; and level three
(L3) consisting of the last two problems assigned. We refer
to the group that received hints as the Hint group, and the
group that did not receive hints as the Control group.
3.
RESULTS
five. The ratio is calculated by taking the difference between the hint group mean and the control group mean. As
lg(x) lg(y) = lg( xy ) the confidence interval from the logged
data estimates the difference between the population means
of log transformed data. Therefore, the anti-logarithms of
the confidence interval provide the confidence interval for the
ratio. We provide the C:H ratios and confidence intervals in
Table 4.
Table 3: Ratio Between Groups (H:C) in the Original Scale
95% Confidence Interval
Prob Ratio
low high p-value
t
1.1
0.69 0.50 0.97
0.03 -2.18
1.2
0.72 0.49 1.06
0.10 -1.68
1.3
0.78 0.56 1.10
0.15 -1.43
1.4
0.58 0.44 0.78
0.00 -3.61
1.5
0.62 0.42 0.93
0.02 -2.31
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
373
Exploring the total time spent between all five problems also
required a log transformation. The total time spent on the
first 5 problems between the hint group (M = 3.34, SD =
0.4) and the control group (M = 3.44, SD = 0.51) was not
significant, t(198) = 1.41, p = 0.16. This corresponds to a
H:C ratio of 0.81 (95% CI: 0.60 to 1.09), and a C:H ratio of
1.24 (95% CI: 0.92 to 1.66).
4.
DISCUSSION
The results of this analysis show that students in the control group are overall not spending significantly more time
in the tutor during these first five problems. However, the
control does spend significantly more time in some problems
compared to the hint group. Problems one, three and four
provided students with the automatically generated hints.
While problem two and five had no hints for either group.
We would expect there to be differences in time to solve
for the hint group, and this was the case for problem one.
We would also expect that having no hints on problem two
would not display an effect, as the second problem is too
early to expect differences to emerge between the groups.
Problem three is interesting as this problem is the first in
which the groups begin to show preferences towards different solution strategies. With the control group preferring
to work backwards, and the hint group preferring to work
forwards (hints are only available for solutions working forward). Problem four and five, both of which showed significant differences in time spent, showed a large portion
of control group student interactions to be perusing buggystrategies.
This is interesting as the control group is spending at least
as much, and often more, time in tutor and yet meeting with
less overall success. The control students are not becoming
stuck in a single bottleneck location within the problems
and then quitting, which would result in lower control group
times. The control students are actively trying to solve the
problems using strategies that do not work. The hint group
is able to avoid these strategies via the use of the hints. The
hint group students also develop a preference for solving
problems forward, as that is the direction in which they can
ask for hints. It is interesting to see that these preferences
remain, even when hints are not available.
The effect of the automatically generated hints appear to let
the hint group spend around 60% of the time per problem
compared to the control group. Or stated differently, the
control group requires about 1.5 times as much time per
problem when compared to the hint group. These results
show that the hints provided by the Hint Factory, which
are generated automatically, can provide large differences in
5.
This paper has provided evidence that automatically produced hints can have drastic effects on the amount of time
that students spend solving problems in a tutor. We found
a consistent trend in which students without hints spent
more time on problems when compared to students that were
provided hints. Exploration of the interaction networks for
these problems revealed that the control group often spent
this extra time pursuing buggy-strategies that did not lead
to solutions. Future work will explore other data available
on the interaction level, such as errors, in order to get a better understanding of what the control group is doing with
their extra time in tutor. We will also look into the development of further interventions that can help students avoid
spending time on strategies that are unlikely to provide solutions.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
374
ABSTRACT
In this paper, we describe results of a multimodal learning
analytics pilot study designed to understand the differences in eye
tracking patterns found to exist between students with low and
high performance in three engineering-related computer games,
all of which require spatial ability, problem-solving skills, and a
capacity to interpret visual imagery. In the first game, gears and
chains had to be properly connected so that all gears depicted on
the screen would spin simultaneously. In the second game,
students needed to manipulate lines so as to ensure that no two
intersected. In the final game, students were asked to position
gears in specific screen locations in order to put in motion onscreen objects. The literature establishes that such abilities are
related to math learning and math performance. In this regard, we
believe that understanding these differences in students visual
processing, problem-solving, and the attention they dedicate to
spatial stimuli will be helpful in making positive interventions in
STEM education for diverse populations.
Keywords
Eye tracking, simulations, games, multimodal learning analytics,
constructionism, spatial ability.
1. INTRODUCTION
The need to engage and motivate more students to learn science
and engineering has raise considerable awareness about
Constructionist [9] and project-based pedagogies in classrooms.
Understanding students behaviors and cognitive evolution in
these open-ended environments is a challenge that is being tackled
in the nascent field of Multimodal Learning Analytics [4, 11]. In
particular, this study uses eye tracking to examine students
capacity to interpret visual imagery in the context of engineering
problem solving.
Computer-based learning tools such as games and simulations
have become pervasive in learning environments. These
technologies can be used by learners to improve their cognitive
abilities and to acquire specific skills [6], including those
involving visuospatial attention and perception [1]. Video and
computer games habits have been shown to be related to the
improvement of visuospatial abilities, including mental rotation
2. EXPERIMENTAL DESIGN
Seven high school students were invited to play three online
games in two separate sessions six days apart. During the first
session, they played Wheels, a game that required them to
connect gears and chains until all gears were spinning (Figure 1),
as well as Lines, a game in which they were required to uncross
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
375
3. METHODOLOGICAL APPROACH
4. RESULTS
The sample was composed of 4 males and 3 females, and all of
them played the games on the same two days. The average level
reached during the first session was 5.71 (1.11) for Wheels and
3.71 (0.49) for Lines. The average level during the second session
was 8.8 (2.1) for Gears and 3.6 (0.84) for Lines. Since the
students had a limited time to play, they were given the option to
stop playing at any time for any reason. When they skipped the
game, they were led to the next game. Only one student stopped
the games prior to completion, and he did so in all of the games.
During the first session, he stopped playing Wheels at level 4
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
376
6. ACKNOWLEDGEMENTS
This material is based upon work supported by the National
Science Foundation under the CAREER Award #1055130 and the
Lemann Center at Stanford University.
7. REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
377
geraldine.gray@itb.ie
ABSTRACT
The increasing numbers enrolling for college courses, and increased diversity in the classroom, poses a challenge for colleges in enabling all students achieve their potential. This
paper reports on a study to model factors, using data mining techniques, that are predictive of college academic performance, and can be measured during first year enrolment.
Data was gathered over three years, and focused on a diverse
student population of first year students from a range of academic disciplines (n1100). Initial models generated on two
years of data (n=713) demonstrate high accuracy. Advice is
sought on additional analysis approaches to consider.
Keywords
Educational data mining, academic performance, personality, motivation, specific learning difficulties, self-regulation.
1.
INTRODUCTION
In tertiary education, learning is typically measured by student performance based on a variety of assessments that are
aggregated to generate a single measure of academic performance. Factors impacting on academic performance have
been the focus of research for many years [9, 14]. It still
remains an active research topic [5, 12], indicating the inherent difficulty in defining robust deterministic models to predict academic performance [16]. Typically, methodologies
for quantitative research in this domain focus on statistical
analysis of performance metrics and their correlations with,
or dependencies on, a wide variety of factors including measures of aptitude, motivation, organisation skills, personality
traits, prior academic achievements and demographic data
[6, 11, 18]. More recently, Educational Data Mining (EDM)
has emerged as an evolving and growing research discipline,
covering the application of data mining techniques in educational settings [1, 4, 10, 19]. There have been calls for
greater use of data mining by educational institutes to realise the potential of the large amounts of data gathered
2.
EXPECTED CONTRIBUTION
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
378
3. RESULTS SO FAR
3.1 Study criteria
Limited profiling of students in terms of specific learning
difficulties and some learning preferences was already underway at ITB. This study extended that initiative, adding
measures relating to four additional factors: aptitude, personality, motivation and learning strategies. These were
chosen firstly because research highlights these factors as
being directly or indirectly related to academic performance
[21], and secondly because these factors can be measured
early in semester one. An online questionnaire was developed to profile students and give immediate feedback
(www.howilearn.ie). Data already available to college administration on prior academic performance was also used4 .
A full list of the factors used is given in Table 1.
2
The majority of students in the IoT sector will have attained between 200 and 400 points in the Leaving Certificate
exam, the state exam at the end of secondary school. The
majority of students in the Irish university sector will have
attained over 400 points, including some with the maximum
score of 600 points [13, Appendix A].
3
The NLN assessment team includes an educational
psychologist, assistant psychologist and occupational
therapist
(http://www.nln.ie/Learning-and-AssessmentServices.aspx).
4
Prior academic performance is based on state examinations
completed by all students at the end of secondary school in
Ireland.
3.2
Study participants
3.3
Initial results
Modelling was done on the 2010 and 2011 data, using five dimensions, namely: prior academic performance, motivation,
learning orientation, personality and age. A binary class label was used based on end of year GPA, range [0-4]. The two
classes included poor academic achievers who failed overall (GPA<2, n=296), and strong academic achievers who
achieved honours overall (GPA2.5, n=340). To focus on
patterns that distinguish poor and strong academic achievements, students with a GPA of between 2.0 and 2.5 were
excluded from initial models, giving a dataset of (n=636).
Six algorithms were used: Support Vector Machine(SVM),
Neural Network, k-Nearest Neighbour, Nave Bayes, Decision tree and Logistic Regression, using RapidMiner V5.2
(rapid-i.com). When modelling all students, model performance was comparable across the six learners, with Nave
5
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
379
4.
OUTSTANDING QUESTIONS
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
380
ABSTRACT
BOTS is a socially multiplayer online game designed to teach
students about introductory computer science concepts such as
loops and functions. Using this game, I plan to explore the use of
user-generated content (UGC) in game-based tutors, increasing
replayability and, ideally, player engagement. BOTS has so far
been used for work towards identifying what makes a level or
puzzle in the game good and how I identify that quality in new
submissions, as well as investigating several mechanisms for
moderation of submitted content. The use of UGC has the
potential to revolutionize how game-based tutors are created,
drastically reducing the burden of content creation on developers
and educators.
Keywords
Game Based Tutors, Moderation, Player Engagement, SelfEvaluation, Serious Games, Social Games, User Generated
Content..
1. INTRODUCTION
1.1 Background
Computer assisted learning, and game-based learning in particular
has been shown to be able to be nearly as effective as one-on-one
human tutoring [8, 10], however the developers and educators are
required to use a great deal of time and expert knowledge [2, 14].
Murray estimated approximately it takes 300 hours to create a
single hour of educational content. If concerns for game design,
user immersion, and content creation are considered, this time
cost would only increase. Additionally, problems created by
educators or developers are often presented in a sequence, and
once the in-game content is exhausted, the experience is generally
over. Replayability is a major component of successful games
[20], and games constructed in this way simply cannot be
replayable experiences.
According to Scott Nicholson, allowing users to create game
content, such as new levels and puzzles, "extends the life of a
game and allows the designers to see how creative users can be
with the toolkits provided." Many principles from the use of UserGenerated Content (UGC) can be used to improve Serious Games
by allowing players to set their own goals [16]. Additionally,
design patterns for educational games identified by a team at
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
381
will be able to spend more time developing the core of the game,
ensuring that the game mechanics are both fun for players and in
line with learning objectives. With these advances, we will be able
to address many of the NSF's goals for Cyberlearning [17], and
BOTS and systems like it may expand to play a more important
role in early STEM education.
2. METHODS
In [2], the authors used a Machine Learning approach based on
the tagging habits of users to identify low-quality Wikipedia
articles. I hope to be able to use similar data-driven methods to
analyze user-created levels. Lacking a large installed user-base to
tag submitted levels, I work with player solutions as they are
submitted, with the first being the author's own solution to the
submitted level.
While the quality of a game level is subjective, I developed a set
of criteria for a level in our game to be "useful", inspired by the
use of design patterns in level design analysis [11]. To identify
common design patterns, I examined levels created over the
course of several game sessions with three groups. 24 levels were
created by the first group, 13 by the second, and 11 by the third.
Once I identified these patterns, I detailed how they could impact
gameplay, why a level creator could be motivated to create them,
and how developers could affect that motivation through game
mechanics or incentives.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
382
3. FUTURE WORK
In addition to replicating the above experiment with a larger
group of students, I have already begun an investigation of how to
further use student data to moderate and evaluate submitted
levels. Being able to assess UGC in this way allows us to provide
meaningful problem orderings even with levels I have not
analyzed in depth, as well as provides us with a metric which can
be used to reward players for creating specific types of levels, or
levels which fill in gaps in content or difficulty. In the future, I
will experiment with different methods of directed level creation
using the information gained, to see if level creation can be better
integrated into the system as a learning activity in and of itself.
4. ACKNOWLEDGEMENTS
Thanks to my advisor, Dr. Tiffany Barnes, and to the developers
who have worked on BOTS so far, including Veronica Catete,
Monique Jones, Andrew Messick, Thomas Hege, Michael Clifton,
Vincent Bugica, Victoria Cooper, Dustin Culler, Shaun Pickford,
Antoine Campbell, and Javier Olaya. This material is based upon
work supported by the National Science Foundation Graduate
Research Fellowship under Grant No 0900860 and Grant No
1252376.
5. REFERENCES
[1] Aleahmad, T., Aleven, V., and Kraut, R. Open
community authoring of worked example problems. In
Proceedings of the 8th international conference on
International conference for the learning sciences (ICLS'08),
Vol. 3. 3-4.
[2] Anderka, M., Stein, B., and Lipka, N. Predicting quality
flaws in user-generated content: the case of wikipedia.
In Proceedings of the 35th international ACM SIGIR
conference on Research and development in information
retrieval (SIGIR '12). ACM, New York, NY, USA, 981-990.
[3] Bayliss, J. D. Using games in introductory courses: Tips
from the trenches. In Proceedings of the 40th ACM
Technical Symposium on Computer Science Education.
(SIGCSE '09). ACM, New York, NY, 337-341
[4] Bartle, R. A.. Hearts, clubs, diamonds, spades: Players who
suit MUDs. Journal of MUD research, 1-19, 1996.
[5] Bartle, R. A. Designing Virtual Worlds. Boston, MA: New
Riders / Pearson Education. 2004
[6] Boyce, A., Doran, K., Pickford, S., Campbell, A., Culler, D.,
and Barnes, T. BeadLoom Game: Adding Competitive, User
Generated, and Social Features to Increase Motivation. In
Proceedings of the Foundation of Digital Games (FDG '11).
ACM, New York, NY, USA, 243-247.
[7] Carmel, D., Roitman, H., and Yom-Tov, E. 2012. On the
Relationship between Novelty and Popularity of UserGenerated Content. ACM Trans. Intell. Syst. Technol. 3, 4,
Article 69 (September 2012)
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
383
{inventado,roberto}@ai.sanken.osaka-u.ac.jp, numao@sanken.osaka-u.ac.jp
ABSTRACT
In personalized learning scenarios, students have control over
their learning goals and how they want to learn which is
advantageous since they tend to be more motivated and immersed in what they are learning. However, they need to
regulate their motivation, affect and activities so they can
learn effectively. Our research deals with helping students
identify the long-term effects of their learning behavior and
identify effective actions that span across learning episodes
which are not easily identified without in depth analysis.
In this paper, we discuss how we are trying to identify such
effective learning behavior and how they can be used to generate feedback that will help students learn in personalized
learning scenarios.
Keywords
personalized learning, self-regulated learning, reinforcement
learning, user modeling
1.
INTRODUCTION
Governments and educational institutions have called for reforms on how students are taught in school to enable them
to have more control over their learning [1]. Allowing students to engage in personalized learning grant them skills
that prepare them for the needs of the current society and
more importantly help shape them into life-long learners.
In personalized learning, students have control over what
they learn and how they learn causing them to be more motivated and immersed in what they are learning. Teachers
no longer serve as the main sources of information but instead become facilitators of the students learning process.
Although teachers can guide students and give them suggestions about what they are learning, teachers can only assess
and provide support for a small number of the challenges
also affiliated with: Center for Empathic Human-Computer
Interactions, College of Computer Studies, De La Salle University, Manila, Philippines
that students face. Especially because students learn in situations where teachers are unavailable, students can easily
get overwhelmed by challenges and not achieve their aspired
learning goals. It is also possible that students would engage
in non-learning related activities which might hinder them
from learning. Thus, in this kind of learning scenario, selfregulation is essential for students to manage their goals,
time, motivation, affective states and hindrances to learning.
Self-regulation is not an easy task because it requires much
motivation and effort [5]. There is a high cognitive load
when students perform learning tasks while managing it.
They would need to continuously monitor the effects of their
actions and decide if they should continue doing it or if they
should change it. Furthermore, students also keep track of
effective learning behavior so they can use them in future
learning episodes.
We have been developing a software that helps students
monitor their behavior and reflect on what transpired during
the learning episode with the help of webcam and desktop
snapshots [3]. After each learning episode, students who
used the system were asked to review their learning episode
then annotate their intentions, their activities and their affective states so they could further understand and analyze
their behavior. According to the results, students who used
the system discovered behaviors they were initially unaware
of and were able to identify ways to improve ineffective learning behavior. We were also able to analyze and process the
students annotated data to have a better understanding of
their learning behavior.
Students reflections from the experiment however, seemed
to focus only on immediate effects of their actions and did
not consider its long term effects in the learning episode.
Also, their reflections did not incorporate their realizations
from previous learning episodes. Currently, we are investigating how we can help students identify actions that benefit
learning not only in the short-term but also in the long-term.
We also want to help students to identify effective learning
behavior that span over different learning episodes.
2.
The data we used for this research was gathered from four
students engaging in research-related work, which is an example of a personalized learning scenario. One male masteral student and one female doctoral student created a re-
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
384
port about their research involving activities such as information search, reading papers, reading books and creating
a power point presentation. One male undergraduate student and one female doctoral student wrote a conference
paper about their research involving activities such as information search, reading papers, reading books, running
programs and simulations to retrieve data from their experiments and paper writing. We gathered two hours of data
for five different learning episodes from each student within
a span of one week.
Unlike other research, our work dealt with students who
freely decided on the time, location and type of activities
they did including non-learning related activities. However,
they were required to learn in front of a computer running the software we developed for recording and annotating
learning behavior.
Although the students worked on different topics and used
different applications, all of them processed and performed
experiments on previously collected data, searched for related literature and created a report or document about
it. Analyzing the data showed that students performed six
types of actions information search (e.g., using a search engine), view information source (e.g., reading a book, viewing
a website), write notes, seek help from peers (e.g., talking to
a friend), knowledge application (e.g., paper writing, presentation creation, data processing) and off-task (e.g., playing
a game).
3.
BEHAVIOR EFFECTIVENESS
In a learning episode, students perform many different actions to achieve their goal. Although students can identify
the effectiveness of the current action by monitoring its effect, it is more difficult to identify how it will affect or how
it has affected their learning in the long run. For example,
students spending a long time learning about a topic would
seem to be performing well, however they may experience
more stress and have a higher chance of making mistakes
and getting confused more easily. It would probably be advantageous for the student to also take a rest once in a while.
We adapted the concept of returns in reinforcement learning [4] to account for this situation wherein the effectiveness
of an action was not measured only by its immediate effects but rather its long term effects on the learning episode.
Moreover, as the student engaged in more learning episodes,
a reinforcement learning algorithm updated the rewards of
each action which incorporated the effects of actions from
previous learning episodes.
Due to the lack of control in the students activities while
learning, it was not possible to directly gauge the students
learning progress which could have been used to define the
rewards of their actions. However, their affective states gave
an idea about the events that transpired during the learning
episode. DMello and Graessers model of affective dynamics [2] describes the relationship between affective states and
events that occur in a learning scenario. For example, confusion indicates instances wherein students need to exert more
effort to progress in the current activity. Frustration arises
when students are too confused, get stuck and no longer
progress in their learning. Too much frustration results in
boredom or disengagement from the learning activity. En-
4.
RESULTS
The Q-learning algorithm was applied on each of the students data separately since we assumed that each student
would have a different learning policy. Due to the number
of features we used for state representation , there were a
lot of states and many of them had high return values. Due
to space limitations, we only present some of the notable
state-action pairs from one of the students learning policy
in Table 2. Majority of the states with high return values
contained state-action pairs that represented transitions inherent to the domain. For example, high returns were given
when students applied knowledge after viewing an information source, which happens naturally for example when a
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
385
student shifts between reading information sources and creates a power point presentation. However, some interesting
strategies were discovered such as shifting from an engaged
on-task activity to an off-task activity indicating that off
task activities may actually have positive long term effects.
Students answers from surveys and personal interviews regarding their thoughts on a recent learning episode correlated with the reward values produced by the algorithm.
For example, students identified the need to continue learning despite encountering challenges (e.g., confusion) and not
spending too much time in off-task activities.
5.
FUTURE DIRECTION
Acknowledgements
This work was supported in part by the Management Expenses Grants for National Universities Corporations from
the Ministry of Education, Culture, Sports, Science and
Technology of Japan (MEXT) and JSPS KAKENHI Grant
Number 23300059. We would also like to thank all the students who participated in our data collection.
6.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
386
Tiffany Barnes
bzmostaf@ncsu.edu
tmbarnes@ncsu.edu
ABSTRACT
When developing an intelligent tutoring system, it is necessary to
have a significant number of highly varied problems that adapt to
a students individual learning style. In developing an intelligent
tutor for logic proof construction, selecting problems for
individual students that effectively aid their progress can be
difficult, since logic proofs require knowledge of a number of
concepts and problem solving abilities. The level of variation in
the problems needed to satisfy all possibilities would require an
infeasible number of problems to develop. Using a proof
construction tool called Deep Thought, we have developed a
system which chooses existing problem sets for students using
knowledge tracing of students accumulated application of logic
proof solving concepts and are running a pilot study to determine
the systems effectiveness. Our ultimate goal is to use what is
learned from this study to be able to automatically generate logic
proof problems for students that fit their individual learning style,
and aid in the mastery of proof construction concepts.
Keywords
Logic Proof, Problem Selection, Knowledge Tracing, Intelligent
Tutor.
1. INTRODUCTION
Logic proof construction is an important skill in several fields,
including computer science, philosophy, and mathematics.
However, proof construction can be difficult for students to learn,
since it requires a satisfactory knowledge of logical operations
and their application, as well as strategies for problem solving.
These required skills make developing an intelligent tutor for
logic proof construction challenging, since a number of variables
must be taken into account when selecting problems for students
that promote learning of proof concepts that fit their individual
learning styles.
We describe the on-going development of an intelligent tutor, and
an initial experiment to determine the effectiveness of knowledge
tracing methods used to select sets of problems for students. For
the study, we have built upon an existing, non-intelligent proof
construction tool called Deep Thought, which has previously been
used for proof construction assignments, and from which student
performance data has been collected.
Our long-term goal is to provide a system for logic proof
construction that adapts to a students individual learning abilities,
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
387
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
388
4. FUTURE WORK
The data from this initial experiment needs further analysis before
any new features are added to the system. However, initial results
are promising, and it appears that the system is effective in
selecting problem sets for students at a general level. Once this
data has been analyzed further and compared to previous data
from the old version of Deep Thought, we can make more definite
assumptions about the effectiveness of our problem selection.
The next step is to apply the system within levels to test specific
problem selection based on rule scores and rule ordering, rather
than just problem sets. If that proves effective, we can apply
methods in development for automatic generation of problems
based on individual rule component construction. Overall, we plan
to continue development of Deep Thought into a more effective
intelligent tutor in logic proof construction.
5. REFERENCES
Based on the system and our goals for it, these paths are what
would be expected. The problems at levels 1 and 2 are basic
inference problems, and are designed to be easier to solve for
students with the expected requisite knowledge. Level 3 was
where the problems were designed to increase in difficulty.
Students should not have been able to complete level 3 without
showing a higher level of proficiency than had been required up
until that point if the problem selection was effective. The fact
that most of the class was transferred to the easy path at level 3
indicates that this is the case; students were given problems that
were difficult enough to challenge them on the hard path (to the
point of being sent to the easy path at the next level) while still
being manageable on the easy path.
Since most students did not complete Deep Thought past this
point, the paths from level 4 on are somewhat skewed. However,
the fact that the students who did complete Deep Thought through
level 7 remained on the easy path indicates that the problems in
levels 4, 5, and 6 were overall appropriately difficult. These
problems were meant to be challenging regardless of the path the
student was on, particularly considering that the students did not
have requisite knowledge of replacement rules at this point.
Therefore the fact that most students stayed on the easy path
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
389
Demos/Interactive Events
Stphane Sanchez
Olivier Hguy
Andil
Universit Toulouse 1
IRIT
Universit Toulouse 1
IRIT
olivier.heguy@andil.fr
angela.bovo@andil.fr
sanchez@univ-tlse1.fr
Andil
Yves Duthen
Universit Toulouse 1
IRIT
yves.duthen@univtlse1.fr
ABSTRACT
We would like to demonstrate a web application using data
mining and machine learning techniques to monitor studentss progress along their e-learning cursus and keep them
from falling behind their peers.
Keywords
Moodle, analysis, monitoring, machine learning, statistics
1.
2.
2.1
Data consolidation
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
390
tors, sometimes in ill-adapted solutions such as in a spreadsheet. This keeps teachers from making meaningful links:
for instance, the student has not logged in this week, but it
is actually normal because they called to say they were ill.
We have already provided forms for importing grades obtained in offline exams, presence at on-site trainings and
commentaries on students. In the future, we will expand
this to an import directly from a spreadsheet, and to other
types of data. From Moodle, we regularly import the relevant data: categories, sections, lessons, resources, activities,
logs and grades.
2.2
Data granularity
2.3
Machine learning
For a more complex output, we use different machine learning methods to analyse the data more in depth and interpret
it semantically [1]. We use classical clustering and classification algorithms, in their implementation by the free library
Weka.
We provide the following algorithms: for clustering, Expectation Maximisation, Hierarchical Clustering, Simple KMeans, and X-Means; for classification, Logistic Regression,
LinearRegression, Naive Bayes and Multilayer Perceptron.
They can be used with or without cross-validation, and the
random seed and number of folds can be manually selected.
For clustering algorithms where the number of clusters is not
3.
FUTURE WORK
4.
CONCLUSION
5.
REFERENCES
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
391
University of Craiova
Bvd. Decebal, 107
+40-251 438198
University of Craiova
Bvd. Decebal, 107
+40-251 438198
University of Craiova
Bvd. Decebal, 107
+40-251 438198
cmariusstefan@gmail.com
mihaescu@software.ucv.ro
burdescu@software.ucv.ro
ABSTRACT
One of the primary concerns in on-line educational environments
is the effective and intuitive visualization of the activities
performed by students. This paper presents a tool that is mainly
designed for the use of professors in order of assist them in a
better monitoring of students activity. The tool presents in an
intuitive and graphical way the activities performed by students.
The graphical presentation creates a mental model of the
performed activities from the perspective of former generations of
students that followed the same activities. The tool integrates kmeans clustering algorithm for grouping students and for
facilitating the customization of parameters and number of
clusters that are displayed.
Keywords
activity visualization, k-means, e-Learning, PCA.
1. INTRODUCTION
One of the main issues of on-line educational environments is the
effective and intuitive visualization of activities performed by
students. For an e-Learning platform which has an average of
more than 100 students per module it may become quite difficult
for the professor to visualize the on-line activity performed by
each student at a time or by all students at one time. The paper
presents a tool that improves the productivity of a professor by
providing an effective way of visualizing and interacting with the
students.
This paper presents a tool that displays in a graphical format the
activities performed by students. There are several characteristics
of the tool that make it user friendly and very efficient in
presenting in synthetic form the results. The main characteristic
consists of the fact that the display is in 2D (2-dimensional) space
and for each coordinate a single parameter is used. The display
presents a specified number of groupings according with the
settings of the professor. The number of groupings (i.e., clusters)
represents the main parameter for the clustering algorithm that
effectively places the students into clusters. For each cluster there
is clearly presented the centroid. The students from the same
cluster are presented with the same distinct geometric sign in
color and shape. Each cluster is divided into three areas: center,
close area and far area. Each area gathers students that have the
same behavioral pattern regarding the activities performed within
the on-line educational environment.
3. SOFTWARE ARCHITECTURE
The application is divided into packages that contain classes that
implement related functionalities. The main classes that perform
the business logic of the tool are ClusteringServlet,
RunScheduledJobServlet,
BuildArffFileScheduledJob,
BulidClusterersScheduledJob,
KMeansClustererStart,
ClientStervlet and ClustererClientApplet.
The web server administration interface (index.html) allows
building a number of clusters and viewing them. This is
performed by the ClusteringServlet which in turn uses the
KMeansClustererStart class. KMeansClustererStart class
generates the clusters (i.e., the model) based on ARFF file using
KMeans algorithm. It also contains various methods for
manipulating clusters data. In this class PCA (Principal
Component Analysis) algorithm is used to reduce the
dimensionality (number of attributes) of a given dataset.
The RunScheduledJobsServlet class is a servlet that starts at server
startup and runs the scheduled job at specified time. The time and
frequency are specified in a xml configuration file. The scheduled
jobs are represented by BuildArffFileScheduledJob and
BuildClustersScheduledJob classes. These, as their name implies,
deal with building the arff file from database and building clusters
from arff. The BuildArffFileScheduledJob class uses
ArffGenerator class for building the arff file containing the
training dataset.
On the client side we have a java applet that runs in an Internet
browser. It connects to server, takes data needed and displays the
students grouped according to certain features chosen by the
professor. The interface of the client application allows specifying
data for a new student and viewing its position on the chart (in the
cluster to which it belongs).
2. RELATED WORKS
There are several examples of educational data mining tools. The
latest research trends place an important emphasis on developing
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
392
4. SOFTWARE TOOL
Figure 1 presents the GUI of the visualization tool. In this running
there are three clusters of students built according to the two
features of the axes (i.e., testing activity and messaging activity).
The application allows the professor to select other features by
which to build the clusters. When you keep the mouse pointer
over a point on the graph you can see the features values for that
student. The points representing the students have different colors
for each cluster. For a better visualization each cluster is divided
into three areas: center, middle area and far area. Each area is
colored differently giving thus intuitive information regarding
how close from the centroids are the points belonging to a cluster.
may be easily selected. The tool may be also successfully used for
outlier detection, which in our case is represented by students that
hardly can be assigned to a cluster.
5. CONCLUSIONS
This paper presents a visualization tool based on a clustering
algorithm. The tool presents the clusters of students in a very
intuitive way. The clustered students may be selected and a set of
specific actions may be performed: sending messages, export to
pdf, etc. In future, the tool may be extended by integrating other
features for data representation and by providing other advanced
functionalities for professors.
The tool uses a total of six features with which we can build
clusters as presented in figure 1. The last two features are
composed features resulting by combining two simple features
using
Principal
Component
Analysis
(PCA).
The
MessagingActivity feature is computed as a combination between
the NumberOfMessages feature and AvgNrOfCharacters feature.
In the same way, the TestingActivity feature is computed as a
combination between NumberOfTests and AverageOfResults
feature.
6. REFERENCES
If features values are provided for a new student the tool places a
big X mark with the same color as other instances from the same
cluster in corresponding position and thus the cluster is
immediately determined. After viewing clustered students, the
teacher can select one or more students from the chart and save
their data in a PDF or send an e-mail to them.
The tool may be used for two purposes. One regards easy
visualization of the students activity based on different criteria.
Once the visual information is retrieved, the tool may be used to
interact (i.e., send messages) with a specific set of students that
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
393
Mahyar Vaghefi
William H. Batchelder
france@uwm.edu
mahyar@uwm.edu
ABSTRACT
In this paper, we describe a software package called FlexCCT for
analyzing numerical ratings data. FlexCCT is implemented in
MATLAB and incorporates a range of different cultural consensus
theory (CCT) models. We describe the standalone GUI version of
FlexCCT. We give an illustrative example, showing how
FlexCCT can be used to analyze and interpret essay rating data.
Keywords
CCT, maximum likelihood, optimization, essay rating
1. INTRODUCTION
Cultural consensus theory (CCT) is a methodology used to
analyze cultural values or truth. CCT has several educational
data analysis/data mining applications. CCT can be used to
analyze educational essay/question ratings data to i) evaluate rater
competency, ii) evaluate rater bias, iii) calculate accurate
competency weighted ratings, and iv) evaluate the
easiness/difficulty of rating individual answers. The CCT results
can be used to evaluate a set of essay ratings and then recommend
actions, for example retraining certain raters or using rater
competency to determine the number of raters assigned to rating
tasks. CCT can also be used as part of the rating/grading process;
for example, the item easiness CCT models can be used to assign
additional raters to answers that are deemed difficult to rate.
We present FlexCCT, a software package for implementing
maximum likelihood CCT. We do not give axiomatic or
mathematical descriptions of the class of CCT models described
in this paper. These can be found in [1,3,4]. We give an intuitive
description of several of the CCT models implemented in the
FlexCCT software, concentrating on CCT models for continuous
data. We summarize the model features and describe the software
implementation of the models. We then describe work that uses
CCT to analyze essay grading data.
whbatche@uci.edu
L d,z X
k 1 i 1
di 2 e
di xik zk
(1)
3. SOFTWARE DESCRIPTION
FlexCCT consists of a set of MATLAB functions and a compiled,
standalone GUI version of the software. The GUI consists of a
single input screen, where the user configures the software
parameters and an output screen, which displays results from the
CCT model optimization. The output screen has an option to save
the output parameter values. The input screen is given in Figure 1
and a description of the associated options is given in Table 1.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
394
Data
Combination
Data
Standardization
Estimation
Method
Bias Type
Optimization
Method
Converge
Options
A csv file containing the values of X. Rows
correspond to raters and columns correspond to
items. The data should not contain a header.
Add Traits: If more than one trait/attribute then
add the trait values together.
Correspondence analysis: Calculates a single
continuous trait from multiple qualitative traits.
None: Use raw data.
Standardize: Subtract column and divide by
column .
Range Scale: Divide by column range.
Simple Average: di = 1 for all raters and zk is
the average value of xk.
Factor Analysis: Utilizes a minimum residual
factor analysis as per classical CCT [5].
ML Model: Basic maximum likelihood model
from (1).
IE Error Variance: Item easiness error
variance model.
IE Multiply: Item easiness where dij = dij.
IE Add: Item easiness where dij = di + j.
No Bias: (see section 2).
Additive Bias: (see section 2).
Multiplicative Bias: (see section 2).
Fixed point: Fixed point estimation.
Two Stage Fixed Point. The values of z and d
are estimated first, followed by other parameters.
Derivative Free: Standard MATLAB routine.
Gradient: MATLAB Gradient descent
optimization, utilizing first order derivatives
Converge criteria for optimization, (default = 1e6).
MissingVal
Max d
Max IE
Options
An n 1 vector of competencies.
An n 1 vector of biases.
pll (1-3)
4. EDUCATIONAL EXAMPLE
In [4], a detailed example is given to show how CCT can be used
in essay (or more general) rating applications. A subset of 50
essays was taken from a set of high school essays. The prompt for
the essays was to describe a situation involving laughter. A
grading rubric was defined and each essay was graded on 6
attributes, with each attribute having a range from 1-6. An overall
continuous score was calculated using two approaches. In the first
approach, the assumptions of classical test theory were used and
the scores for each attribute were added together to give a total
score in the range of 636. In the second approach, multiple
correspondence analysis was used to explicitly scale the multiple
ordinal attribute scales into one continuous scale. The essays were
graded by 2 expert graders and 10 student graders. Each student
grader was given 30 minutes for training and 3 minutes to grade
each essay.
Some overall conclusions reached in [4] are that CCT provides
useful measures of rater competency, rater bias, and item
easiness/difficulty. CCT can be used to help train and evaluate
raters and to identify essays where accurate evaluation is difficult.
The CCT competencies can be used to produce competency
weighted averages of essay ratings. In the essay rating data
analyis, incorporating bias gave improved model fit and additive
bias gave better model fit then multiplicative bias. Likewise, the
multiplicative item easiness model gave better model fit than the
additive item easiness model.
6. ACKNOWLEDGMENTS
The third author acknowledges the support of a grant from the
Army Research Office (ARO) and a fellowship from the Oak
Ridge Institute for Science and Education (ORISE).
7. REFERENCES
[1] Batchelder, W. H. and Romney, A. K. New results in test
theory without an answer key. In Roskam, E. E. ed.
Mathematical Psychology in Progress. Springer-Verlag,
Heidelberg, Germany, 1989, 229-248.
[2] Batchelder, W. and Romney, A. 1988. Test theory without an
answer key. Psychometrika, 53, 1 (Mar. 1988), 71-92.
[3] France, S. L. and Batchelder, W. H. 2012. Unsupervised
Consensus Analysis for On-line Review and Questionnaire
Data. Working paper, UC Irvine.
[4] France, S.L. and Batchelder, W. H. 2013. A Maximum
Likelihood Item Difficulty Model for Consensus Analysis.
Working paper, UC Irvine.
[5] Romney, A. K., Weller, S. C. and Batchelder, W. H. 1986.
Culture as Consensus: A Theory of Culture and Informant
Accuracy. American Anthropologist, 88, 2 (Oct. 1986), 313338
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
395
Margarita Elkina,
Andreas Pursian
Amrumer Strae 10
13353 Berlin, Germany
Alt Friedrichsfelde 60
10314 Berlin, Germany
{merceron,
sschwarzrock}@beuthhochschule.de
{margarita.elkina,
andreas.pursian}@hwrberlin.de
{liane.beuster,
albrecht.fortenbacher, kappe,
boris.wenzlaff}@htw-berlin.de
ABSTRACT
2. VISUALIZING INTERACTIONS
Keywords
Learning Management System,
Performance, Visual Analytics.
Visualization,
Interaction,
1. INTRODUCTION
The use of a Learning Management System (LMS) to support
teaching and learning is widespread. The usage data such systems
store is not analyzed in a routine basis by different stakeholders to
retrieve pedagogical information that could support reflection. For
example, if a teacher notices that some non-compulsory exercise
she has made available in her course is hardly attempted, she
might do some further analysis: does it seem to have a positive
impact on the mark of the final exam for the few students who
solved it? If not, she might consider deleting it from the course for
the next semester; if yes, she might change her teaching style so
that more students attempt it.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
396
Figure 1 shows this filter with the types assignment and file
selected. All available visualizations have these three filters.
Figure 2 shows the network of learning objects that results when
order of access or navigation is taken into account. Learning
objects are nodes colored according to their type. The size of a
node is proportional to the number of accesses. Navigational steps
between learning-objects are depicted by edges. The edges are
weighted and color-coded to encode the amount of navigational
steps. Placing the cursor over a node will bring up a tool-tip that
includes information concerning the learning-objects name, the
type as well as the total number of accesses or requests for it. For
further exploration, a single click on a specific node will rearrange
the graph in a way that it focuses on the node of interest,
displaying neighboring nodes in the immediate proximity, and
moves other nodes further away.
students are in the highest interval [95 -100] for the second part of
the exam, called Klausur-Teil2, while no student has achieved this
high performance in the first part of the exam, Klausur-Teil1.
Again, the visualization is interactive. Cursor over a bar shows the
exact number. The user can activate or deactivate any test or
assignment by clicking on the circle near the name. A hollow
circle visualizes that an assignment has been deselected.
Appearance or disappearance of bars after activation, respectively
deactivation, occurs progressively, so that the user can follow the
change taking place.
A second visualization shows the box plots for all assignments. In
Figure 4 the same assignments as in Figure 3 have been selected.
It is possible to grasp not only that the maximum mark is higher in
the second part of the exam, but also that over 75% of the students
have done better, since the whole box including median is higher.
3. VISUALIZING PERFORMANCE
An assignment is a generic term meaning any work that can be
graded, such as questions, exercises, tests, exams and so on. Most
of the LMS allows for calculating easily useful statistics such as
average and standard deviation for a given assignment. It is more
difficult to visualize and compare performance of students across
several or all assignments, a question raised by teachers. The
visualizations presented here cater for this need. It is not rare that
different assignments are marked differently. For example
assignment 1 may be out of 20 points and assignment 2 out of 50.
Sticking to the original scale given by teachers makes a
comparison awkward. Therefore in the following visualizations all
assignments are scaled to 100. The usual filters mentioned earlier
can be used to select particular assignments or tests, or to select
particular users. Figure 3 and 4 shows two visualizations to
explore performance concentrating on the diagrams.
4. CONCLUSION
This visual exploration is the first step for analyzing usage data.
When teachers grasp the overall trends in interactions and
performance in their course, they should be able to deepen their
analysis, seeking answers for questions such as: can students be
grouped according to their performance? Or, dually, can
assignments be grouped according to how students perform?
Future work includes implementing means to help answering
similar more involved questions. A challenge is to select the data
mining algorithms and their parameters carefully so as to avoid
misinterpretation of the results by stakeholders.
5. ACKNOWLEDGMENTS
This work is partially supported by the Berlin Senatsverwaltung
fr Wirtschaft, Technologie und Forschung with funding from
the European Social Fund.
6. REFERENCES
[1] Beuster, E., Elkina, M., Fortenbacher, A., Kappe, L.,
Merceron, A., Schwarzrock, S., Wenzlaff, B. 2012. LeMoLernprozessmonitoring auf personalisierenden und nicht
personalisierenden Lernplattformen. In Proceedings of the
GML2 Grundfragen des Multimedialen Lehrens and Lernens
Conference (Berlin, Germany, March 15-16, 2012) .
Waxmann Verlag, 63-76. http://www.gml2012.de/tagungsband/Tagungsband_GML2012_web.pdf
Figure 3. Comparing performance with marks distribution.
The histogram Figure 3 gives an overview of the distribution of
the students according to their marks. We can see that two
[2] Shneiderman, B. 1996. The eyes have it: A task by data type
taxonomy for information visualizations. In: Proc. of the
IEEE Symposium on Visual Languages, (Maryland Univ.,
College Park, MD, USA, September 03 06, 1996). IEEE,
336-343
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
397
ABSTRACT
2. BACKGROUND
Keywords
1. INTRODUCTION
Although bullying has long been a significant problem, increased
awareness has brought the matter to the attention of legislatures.
States across the country are passing laws aimed at preventing
bullying. Some pieces of legislation, like New Yorks Dignity for
All Students Act, include provisions for information sharing and
increased data retention [3], creating an environment ripe for
innovation in the fight against bullying.
In this paper, we will propose CASSI (Classroom Assisting Social
Systems Intelligence), an open-source system aimed at being
inexpensive and easily integrated into existing educational
practices that will allow for the collection, modeling, and analysis
of student behavioral data. Specifically, the collected behavioral
data will be used to construct a social graph that represents how
dysfunctional the directed relationship between each pair of
students is. This social graph can then be used to inform the
behavior of a variety of expert systems.
It is hoped that this system, in due time, will be implemented in
educational institutions to adhere to both the letter and the spirit of
new pieces of anti-bullying legislation. A single central repository
for an educational institutions behavioral data would allow for the
data to be more easily shared amongst educators and formatted
into reports for administrators. This repository of data would also
allow for the implementation of expert systems which educators
can use in day-to-day classroom management tasks that influence
bullying [1]. One expert system will be described that makes use
of the data stored in the social graph to improve classroom
management by allowing teachers to automatically generate
seating charts likely to reduce negative classroom behavior. Other
possible expert systems will also briefly be discussed.
There has been a recent push for software to address cyberbullying through the analysis of social networking sites. Nahar,
Unankard, Li, and Pang [5] describes a method for using
sentiment analysis to develop a graph-based method for
identifying cyber bullies and their victims while Sanchez and
Kumar [1] describes a method for integrating this style of
sentiment analysis with Twitter. Although cyber-bullying is a
significant problem, neither of these papers address problems
associated with completeness. It is easy for students to restrict
educators access to their social media accounts and such
restrictions may skew the decisions of an expert system which
integrates social media data with observed classroom data.
However, both of these methods appear to accurately recognize
the asymmetric nature of bullying discussed by Allen [1].
It also is important to note the distinction between social networks
and social networking. As noted in Purtell et al. [6], it is possible
to extract implicit or inferred social topographies [8] from other
sources other than social media.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
398
the system requires educators to record is: the victim, the bully,
and a ranking of the bullying incident on a score from 1 to 10. It is
expected the system will also record additional book-keeping data
such as a time-stamp, the educator filing the incident report, and a
detailed description of the even that may be utilized in the future.
This data, collected via a web-form when an incident occurs and
stored in a local database, is similar to the directed data described
by Purtell et al. [6] as suitable for use in the construction
implicit social graphs of the type described in Roth et al. [8].
4. CLASSROOM OPTIMIZATION
Once the system has been tested on genuine behavioral data, there
are a number of additional expert system modules that may extend
its usefulness. In particular, the researchers of Project CASSI
expect that the system may be extended to support behaviorally
informed scheduling and time-series forecasting. The former task
grouping students into non-disruptive classes may be
accomplished a similar selection algorithm to the one proposed
for seating chart generation but operating on combinations of
students rather than permutations of students. The later task, timeseries forecasting using the time-stamps of the incident report,
may be conducted with the ultimate goal of predicting bullying
trends and predicting bullying events before they actually occur.
6. REFERNECES
[1] Allen, K. P. (2010). Classroom Management, Bullying, and
Teacher Practices. Professional Educator, 34(1), 1-15.
[2] Dawson, S. (2008). A study of the relationship between
student social networks and sense of community.
Educational Technology & Society, 11(3), 224-238.
One expert system that has been developed to make use of the
behavioral social graph aims at automating the task of finding
behaviorally optimal seating charts for rectangular seating
arrangements of an arbitrary size. This task is accomplished by
comparing the social graph to a particular arrangement of
students. If two students are seated adjacent to one another in the
classroom, their relationship information is extracted from the
social graph and added to the ranking of the classroom. This
reduces the task of finding an optimal classroom down to a simple
minimization problem the lower the ranking, the better the
classroom.
[6] Purtell, T. J., McLean, D., Teh, S. K., Hangal, S., Lam, M.
S., & Heer, J. (2011). An Algorithm and Analysis of Social
Topologies from Email and Photo Tags. Workshop on Social
Network Mining & Analysis, ACM KDD. San Diego, CA.
[3] Dignity for All Students Act. 2010. N.Y. Educ. Law 1018
[5] Nahar, V., Unankard, S., Li, X., & Pang, C. (2012).
Sentiment Analysis for Effective Detection. In Web
Technologies and Applications (Vol. 7235, pp. 767-774).
Springer Berlin Heidelberg.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
399
ABSTRACT
Keywords
Student usage data, Moodle block, data visualization, data mining.
1. INTRODUCTION
Nowadays, there are a great number of general free and
commercial DM tools and frameworks [2], such as: Weka,
RapidMiner, KNIME, R, SAS Entreprise Miner, Oracle Data
Mining, etc. These tools can be used for mining datasets from any
domain or research area. However, none of these tools is
specifically designed for pedagogical/educational purposes and
problems. So they are cumbersome for an educator to use since
they are designed more for power and flexibility than for
simplicity. Due to this fact, an increasing number of specific
mining tools have been developed to solve different educational
problems [5]. Of all of them, only one small subgroup of tools is
specifically oriented to using Moodle data, such as:
2. TOOL DESCRIPTION
Our tool has been developed in PHP language and integrated into
Moodle as a new block (an item which may be added to the left or
right or centre column of any page in Moodle). It consists of two
main tabs or components:
Figure 2: Bar diagram about the number of resources accessed.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
400
In the new pop-up window (Figure 2), the instructor can select the
attribute and type of graphics to visualize.
3. CONCLUSIONS
In the future, we intend to add various new data mining
algorithms in order to provide for more advanced algorithms of
each type. Also, we would like to add a specific pre-processing
step/tab that lets the instructor modify the selected data before
running the data mining.
4. ACKNOWLEDGMENTS
This work was supported by the Regional Government of
Andalusia and the Spanish Ministry of Science and Technology
projects, P08-TIC-3720 and TIN-2011-22408, respectively,
FEDER funds.
5. REFERENCES
[1] Mazza, R., Milani, C. GISMO: a Graphical Interactive
Student Monitoring Tool for Course Management Systems,
In International Conference on Technology Enhanced
Learning, Milan, 1-8. 2004.
[2] Mikut, R., Reischl, M. Data Mining Tools. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 1,5, 431443. 2011.
[3] Pedraza-Perez, R. Romero, C., Ventura, S. A Java Desktop
Tool for Mining Moodle Data. In International Conference
on Educational Data Mining, Eindhoven, 319-320. 2011.
[4] Rabbany, R., Takaffoli, M., Zaane, O. Analyzing
Participation of Students in Online Courses Using Social
Network Analysis Techniques. In International Conference
on Educational Data Mining, Eindhoven, 21-30. 2011.
Figure 4: Main window for mining data.
The instructor also has to select both the attributes to use from the
data file and the parameter values of the algorithm. Once the
algorithm has been executed, the model obtained/discovered is
shown and can be saved as a PDF or plain text file.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
401
Rajendra Banjade
Mihai Lintean
vrus@memphis.edu
rbanjade@memphis.edu
mclinten@memphis.edu
Nobal Niraula
Dan Stefanescu
nbnraula@memphis.edu
dnstfnscu@memphis.edu
ABSTRACT
We present in this demo SEMILAR, a SEMantic similarity
toolkit. SEMILAR includes offers in one software environment
several broad categories of semantic similarity methods: vectorial
methods including Latent Semantic Analysis, probabilistic
methods such as Latent Dirichlet Allocation, greedy lexical
matching methods, optimal lexico-syntactic matching methods
based on word-to-word similarities and syntactic dependencies
with negation handling, kernel based methods, and some others.
We will demonstrate during this demo presentation the efficacy of
using SEMILAR to investigate and tune assessment algorithms
for evaluating students natural language input based on data from
the DeepTutor computer tutor.
Keywords
Natural language student inputs, assessment, conversational
tutors.
1. INTRODUCTION
In dialogue-based Intelligent Tutoring Systems (ITS; Rus,
DMello, Hu, & Graesser, in press; Evens & Michael, 2005), it is
important to understand students natural language responses.
Accurate assessment of students responses enables the building
of accurate student models for both cognition and affect. An
accurate student model in turn affects the quality of tutors
feedback (Rus & Lintean, 2012). In general, accurate student
models lead to improved macro- and micro-adaptivity in ITSs
which is needed for effective tutoring (Rus, DMello, Hu, &
Graesser, in press).
There are at least two different types of natural language
assessments in conversational ITSs. First, there is need for
advanced natural language algorithms to interpret the meaning of
students natural language contributions at each turn in the
dialogue. The student responses in the middle of the dialogue tend
to be short, i.e. the length of a sentence or less. There is also a
need to assess the more comprehensive, essay-type answers that
students are required to provide immediately after being prompted
to solve a problem. These essay-type answers can be a paragraph
long or even longer depending on the task and target domain.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
402
3. ACKNOWLEDGMENTS
This research was supported in part by Institute for Education
Sciences under awards R305A100875.
4. REFERENCES
[1] Evens, M., and Michael, J. 2005. One-on-one Tutoring by
Humans and Machines. Mahwah, NJ: Lawrence Erlbaum
Associates.
[2] Graesser, A. C., Rus., V., DMello, S., K., & Jackson, G. T.
(2008). AutoTutor: Learning through natural language
dialogue that adapts to the cognitive and affective states of
the learner. In D. H. Robinson & G. Schraw (Eds.), Current
perspectives on cognition, learning and instruction: Recent
innovations in educational technology that facilitate student
learning (pp. 95125). Information Age Publishing.
[3] Rus, V. & Lintean, M. (2012). A Comparison of Greedy and
Optimal Assessment of Natural Language Student Input
Using Word-to-Word Similarity Metrics, Proceedings of the
Seventh Workshop on Innovative Use of Natural Language
Processing for Building Educational Applica-tions, NAACLHLT 2012, Montreal, Canada, June 7-8, 2012.
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
403
Olga C. Santos
Jesus G. Boticario
ssalmeron@bec.uned.es
Ral Cabestrero
UNED
rcabestrero@psi.uned.es
ocsantos@dia.uned.es
Pilar Quirs
UNED
pquiros@psi.uned.es
ABSTRACT
Collecting and processing data in order to detect and recognize
emotions has become a research hot topic in educational
scenarios. We have followed a multimodal approach to collect
and process data from different sources to support emotion
detection and recognition. To illustrate the approach, in this
demo, participants will be shown what emotional data can be
gathered while solving Math problems.
Keywords
Affective Computing, Data Mining, Sensor Data, Emotion
Detection, Mathematics
1. INTRODUCTION
Currently there is a growing interest in offering emotional
support to learners in e-learning platforms through an expanded
set of adaptive features. A key issue is to determine learners
affective state, which is related to their cognitive and
metacognitive process [4], preferable with low cost sensors [2].
Affective states in our approach are to be defined from mining
in a jointly manner subjective, physiological and behavioral data
gathered from diverse emotional information sources while the
learner interacts on the given e-learning environment. This
approach offers possible improvements on emotion detection,
which as suggested in the literature may come out from the
combination of different data sources simultaneously [5]. Math
problem solving scenarios have provided opportunities to
investigate this new approach, as from them different emotions
may be elicited [7].
2. OUR APPROACH
As for emotion detection, our approach is based on the use of
data mining techniques. As shown in Figure 1, we follow a
multimodal gathering approach based on the combination of the
jgb@dia.uned.es
Mar Saneiro
marsanerio@dia.uned.es
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
404
4. ACKNOWLEDGMENTS
Authors would like to thank the Spanish Government for
funding MAMIPEC project (TIN2011-29221-C03-01).
5. REFERENCES
[1] Arnau, D., Arevalillo-Herrez, M., Puig, L., GonzlezCalero, J.A. 2013. Fundamentals of the design and the
operation of an intelligent tutoring system for the learning
of the arithmetical and algebraic way of solving word
problems. Computers & Education. 63,119-130.
3. ONGOING WORKS
This approach is supported by the MAMIPEC project, where we
are exploring how to combine different information sources
from different signals to offer an accessible and personalized
learning experience to the learner, which accounts for their
affective state and aims to provide, accordingly, affective based
recommendations. To progress on this goal, a large-scale
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
405
Fazel Keshtkar
Arthur C. Graesser
bsamei@memphis.edu
fkshtkar@memphis.edu
Vasile Rus
a-graesser@memphis.edu
vrus@memphis.edu
ABSTRACT
In this Demo, we introduce a tool that provides a GUI interface to
a previously designed to Speech Act Classifier. The tool also
provides features to manually annotate data by human and
evaluate and improve the automated classifier. We describe the
interface and evaluate our model with results from two human
judges and Computer.
Keywords
Speech act, interactive machine learning, Speech act classifier
sets have been experimented with but the five features just
mentioned proved to lead to highest performance in conjunction
with decision trees and nave Bayes methods [2]. The model has
been trained and experimented with on data sets from intelligent
tutoring systems as well as chat data [2]. Our model is a J48
model built on the training the data using Weka toolkit.
The model and training data can be updated and improved
independently. This tool can be used with several training data
based on the domain. For example, if we are looking at the
dialogs in a movie we can use a different training data model
based on this domain.
1. INTRODUCTION
Speech act classification is the task of classifying a sentence,
utterance or any other discourse contribution, into a speech act
category which is selected from a set of predefined categories.
Each category represents a particular social discourse function.
What is your name? for example, is classified as a Question.
There are other examples of speech act categories, such as
Statement, Greeting, etc.
Our tool is designed to offer annotation facilities in order to
improve a previously developed automated speech act
classification (SAC; the SAC is available online at
www.cs.memphis.edu/~vrus/SAC/) [2]. Furthermore, we provide
a GUI-based interface to the speech act classifier [2]. We use an
interactive machine learning model for this task that allows for
manual classification by human judges which is used to improve
the accuracy of our machine learning model. The tool is created
and written in Java. The SAC relies on decision tree (J48) that has
proved to provide based performance on training data from human
annotated utterances [2].
The decision tree is a machine learning approach that requires a
feature set to be designed. The feature set is an important part of
machine learning algorithms. Moldovan, Rus, and Graesser [2]
designed a feature set and used it in order to automatically classify
chat utterances.
They used eight speech act categories which are shown in Table 1.
According to analyzes on a variety of corpora, such as chat and
multiparty games we can converge on a set of speech act
categories that are both theoretically justified and can be used by
trained judges [3]. For the feature set we tokenize the chat
utterances based on basic regular expressions and for each
utterance five features are extracted: the first three words, the last
word, and the length of the sentence in words. Many other feature
Example
ExpressiveEvaluation
Greeting
Hello!
MetaStatements
Statement
Question
Reaction
Thank you
Request
Other
ed is tough, no doubt.
2. THE INTERFACE
We have designed a Graphical User Interface (GUI) for the tool.
It can be used on any machine, since it is implemented in Java.
Figure 1 shows a snapshot of the starting interface of the tool.
Use is able to Run or Annotate the input data (see Figure 1).
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
406
By clicking on Run the tool will start to classify the input file
which contains a collection of utterances. After classifying the
utterances, the output will be saved as an excel file. The users can
also click on Annotate to annotate the data manually. By
clicking on Annotate a new GUI will appear (Figure 2) which
contains the utterances. The user will see 10 utterances in each
step and for each utterance there is a drop down list of categories
from which the user selects one. After annotating by user can go
to next utterances or save the current annotation any time and do
the rest after getting back.
Agreement
Human1 Human2
70%
Human1 Comp1
50%
Human2 Comp1
47%
Human1 Comp2
4%
Human2 Comp2
3%
Human1 Comp3
2%
Human2 Comp3
3%
3. RESULTS
A collection of chat utterances are used as the data set to evaluate
the algorithm. The system is trained by a collection of datasets
derived from Auto-Mentor frame board dataset. The test data is
chosen from a collection of new chat data. We have chosen a
hundred chat utterances from the data and tried to maintain a
normal distribution on the speech act categories so each category
has 10 to12 utterances in our test data. The test data is also
annotated by two human judges.
The system runs the algorithm on the test data and for each
utterance we show top three speech-act categories based on their
probability distribution in the decision tree. Top three categories
are the ones with the highest probability and we represent these 3
categories as Comp1, Comp2, and Comp3. Figure 3 represents the
agreement of Human judges with Comp1, Comp2, and Comp3.
As our model is based on interactive machine learning we tend to
compare automated classification to human judges in order to
improve the model and retrain with new enhanced training data.
To do this, we have calculated agreement among humans and
computer. For each utterance we have five output categories, the
top three assigned by our model, and the annotations by human
judges. These five outputs are compared to investigate their
agreement and evaluate the current model. We have looked at
agreement both by Speech-Act categories and overall among the
dataset.
Table 2 shows the overall agreement among the classifiers. Our
human judges agree on 70.00% of the utterances. Agreement of
our model and human judges is about 50%. The agreement of
judges with Comp2 and Comp3 are less than 5% of the human
agree with the second and third category computed by our model.
As it was mentioned earlier Comp2 and Comp3 are actually the
4. CONCLUSION
As the results show, our model is working close to human judges.
The tool can be used to improve the model by taking both human
and computer annotations and enhance the training data. The main
goal of this tool is not only to automate the classification task, but
also provide more features to improve the classifier. Both
automated and manual annotations are easy to use by the interface
and this can be used in several applications and domains. The
training data and J48 models can be externally changed for
different domains.
5. References
[1] Bagley, E., 2011 , Stop Talking and Type: Mentoring in a
Virtual and Face-to-Face Environmental Education
Environment. Ph.D. thesis, University of WisconsinMadison
[2] Moldovan, C., Rus, V., Graesser, A.C. 2011, Automated
speech act classification for online chat. In: Proceeding of FLAIRS 2011
[3] R.G. DAndrade, M.W., 1985, Speech act theory in
quantitative research on inter-personal behavior. In:
Discourse Processes. 8 (2), 229-259
S. D'Mello, R. Calvo, & A. Olney (Eds.). Proc Educational Data Mining (EDM) 2013
407
EDM 2013
Author Index
Author Index
Adjei, Seth
Aghababyan, Ani
Aleven, Vincent
Andersen, Erik
Anderson, John R.
Arroyo, Ivon
304
362
51, 161, 232
106
2
322, 344
Banjade, Rajendra
Baraniuk, Richard
Barnes, Tiffany
Batchelder, William
Beck, Joseph
Bergner, Yoav
Beuster, Liane
Bidelman, Gavin
Biswas, Gautam
Blikstein, Paulo
Boticario, Jesus G.
Bovo, Angela
Bowers, Alex
Boyer, Kristy Elizabeth
Breslow, Lori
Brunskill, Emma
Brusilovsky, Peter
Burdescu, Dumitru Dan
Butler, Eric
402
90, 292, 324
82, 248, 372, 387
394
4, 328, 344
137, 350
396
145
252
375
348, 404
306, 390
177
20, 43
312
106, 260
28
392
106
Cabestrero, Ra
ul
Cabredo, Rafael
Cai, Zhiqiang
Carin, Lawrence
Carlson, Ryan
Castro, Cristobal
Chaturvedi, Ritu
Chen, Zhenghao
Chiritoiu, Marius Stefan
Cobian, Jonathan
Cohen, William
Conati, Cristina
Corbett, Albert T.
Craig, Scotty
Crossley, Scott
404
244
326
292
12
400
308
153
392
364
98
220, 310
74
354
216
DMello, Sidney
296, 364
EDM 2013
Author Index
Dai, Jianmin
Daily, Zachary
Davenport, Jodi
Davoodi, Alireza
Deboer, Jennifer
Defore, Caleb
Desmarais, Michel
Deziel, Melissa
Dicerbo, Kristen
Do, Chuong
Dow, Steven P.
Duong, Hien
Duthen, Yves
216
398
260
220, 310
312
216
224
228
314
153
51
316
306, 390
Eagle, Michael
Elkina, Margarita
Ezeife, Christie
Ezen-Can, Aysu
Falakmasir, Mohammad H.
Fancsali, Stephen
Feng, Shi
Forsgren Velasquez, Nicole
Fortenbacher, Albrecht
France, Stephen
28
35, 169, 338
296
362
396
394
Garca-Saiz, Diego
Gehringer, Edward
Genin, Konstantin
Gobert, Janice
Golab, Lukasz
Goldin, Ilya
Gomes, July
Gonz
alez-Brenes, Jose
Gordon, Geoffrey J.
Goslin, Kyle
Gowda, Sujith M
Graesser, Art
Graesser, Arthur
Graesser, Arthur C.
Grafsgaard, Joseph
Gray, Geraldine
Greiff, Samuel
Gross, Sebastian
Hammer, Barbara
Harpstead, Erik
Hawkins, William
318
346
12
185
228
232
375
236
28
320
74
296, 326
354
406
43
240, 378
336
334
334
51
59
EDM 2013
Heffernan, Neil
Herold, James
Herold, Jim
Hershkovitz, Arnon
Hicks, Andrew
Ho, Andrew
Hofmann, Markus
Hu, Xiangen
Hua, Henry
Huang, Jon
Huang, Xudong
Hunter, Matthew
Heguy, Olivier
Inventado, Paul Salvador
Author Index
J. Jacobson, Michael
Jackson, G. Tanner
Janisiewicz, Philip
Jarusek, Petr
Johnson, Matthew
Joshi, Ambarish
280
272, 276, 368
362
256
82, 248
169
Kappe, Leonard
Kardan, Samad
Kay, Judy
Kelly, Kim
Keshtkar, Fazel
Kidwai, Khusro
Kim, Jihie
Kinnebrew, John
Klus
acek, Matej
Koedinger, Kenneth
Koedinger, Kenneth R.
Koller, Daphne
Kretzschmar, Andre
Kurhila, Jaakko
Kyle, Kris
396
310
121
322
406
314
330
252
256
232, 284, 358
51, 98
153
336
300
216
Lan, Andrew
Lan, Andrew S.
Legaspi, Roberto
Lehman, Blair
Lemieux, Francois
Lester, James
Li, Haiying
Li, Nan
Li, Shoujing
90, 324
292
244, 384
296
224
43
326
98
328
EDM 2013
Author Index
Likens, Aaron D.
Lin, King-Ip
Lintean, Mihai
Litman, Diane
Liu, Sen
Liu, Yun-En
Liu, Zhongxiu
Luukkainen, Matti
276
354
402
200
330
106
114
300
Mack, Daniel
MacLellan, Christopher J.
Macskassy, Sofus
Malayny, John
Mandel, Travis
Markauskaite, Lina
Martin, Taylor
Martinez-Maldonado, Roberto
Maull, Keith
McGuinness, Colm
Mclaughlin, Elizabeth
Mcnamara, Danielle S.
McNamara, Danielle S.
Merceron, Agathe
Mihaescu, Cristian
Miller, L. Dee
Mokbel, Bassam
Mostafavi, Behrooz
Mostow, Jack
Murray, Tom
Myers, Brad A.
M
uller, Jonas
252
51
330
398
106
280
362
121
332
378
284
370
368
396
392
129
334
387
356
208
51
336
Ng, Andrew
Niraula, Nobal
Niraula, Nobal B.
Nixon, Tristan
Numao, Masayuki
ORourke, Eleanor
Ocumpaugh, Jaclyn
Oeda, Shinichi
Olawo, Dayo
Olmo, Juan Luis
Olson, Robert
Owende, Philip
Paassen, Benjamin
Pardos, Zachary
240,
216,
272, 276,
236,
153
402
366
35, 169, 338
244, 384
106
114
340
228
268
398
240, 378
334
137
EDM 2013
Pardos, Zachary A.
Pataranutaporn, Visit
Pavlik Jr., Philip I.
Pawl, Andrew
Pel
anek, Radek
Piech, Chris
Pinkwart, Niels
Popovic, Zoran
Pritchard, David
Pritchard, David E.
Pursian, Andreas
Author Index
28
114
145
342
256
153
334
106
137
312, 350
396
Quir
os, Pilar
404
Rafferty, Anna
Rai, Dovan
Ramachandran, Lakshmi
Rau, Martina
Ravindran, Balaraman
Rhodes, Nicholas
Ritter, Steve
Ritter, Steven
Romero, Cristobal
Rummel, Nikol
Rus, Vasile
260
344
346
161
346
264
169
338
400
161
402
12,
35,
268,
366,
EDM 2013
Stefanescu, Dan
Studer, Christoph
Stump, Glenda S.
Sumner, Tamara
Szkutak, Robert
Author Index
366, 402
90, 292, 324
312
332
398
Tang, Quan
Truchon, Lisa
354
228
Ung, Matthew
264
Vaghefi, Mahyar
van de Sande, Brett
Varner, Laura
Varner, Laura K.
Vats, Divyanshu
Vega, Benjamin
Ventura, Sebasti
an
Vihavainen, Arto
394
193, 288
272
368
292
296
268, 400
300
Wang, Jin
Wang, Yutao
Waters, Andrew
Wenzlaff, Boris
Weston, Jennifer
Wiebe, Eric N.
Wiggins, Joseph B.
Williams, Jamal
Woolf, Beverly Park
Worsley, Marcelo
354
59, 304, 316
90, 324
396
370
43
43
145
208
375
Xie, Jun
Xiong, Wenting
Xiong, Xiaolu
Xu, Xiaoxi
Xu, Yanbo
354
200
4, 328
208
356
Yacef, Kalina
Yamanishi, Kenji
Yassine, Mohamed
Yudelson, Michael
121
340
375
358
Zheng, Zhilin
Zhu, Linglong
Zorrilla, Marta
Zundel, Alex
Zundel, Alexander
360
316
318
67
264
ISBN: 978-0-9839525-2-7