Académique Documents
Professionnel Documents
Culture Documents
www.emeraldinsight.com/2398-5348.htm
Predictive
Predictive analytic models of analytic models
student success in higher education in higher
education
A review of methodology
Ying Cui and Fu Chen
Department of Educational Psychology, University of Alberta, Edmonton,
Alberta, Canada Received 11 October 2018
Revised 24 January 2019
Ali Shiri 11 February 2019
Accepted 19 February 2019
Department of Library and Information Studies, University of Alberta,
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
Abstract
Purpose – Many higher education institutions are investigating the possibility of developing predictive
student success models that use different sources of data available to identify students that might be at risk of
failing a course or program. The purpose of this paper is to review the methodological components related to
the predictive models that have been developed or currently implemented in learning analytics applications in
higher education.
Design/methodology/approach – Literature review was completed in three stages. First, the authors
conducted searches and collected related full-text documents using various search terms and keywords.
Second, they developed inclusion and exclusion criteria to identify the most relevant citations for the purpose
of the current review. Third, they reviewed each document from the final compiled bibliography and focused
on identifying information that was needed to answer the research questions
Findings – In this review, the authors identify methodological strengths and weaknesses of current
predictive learning analytics applications and provide the most up-to-date recommendations on predictive
model development, use and evaluation. The review results can inform important future areas of research that
could strengthen the development of predictive learning analytics for the purpose of generating valuable
feedback to students to help them succeed in higher education.
Originality/value – This review provides an overview of the methodological considerations for
researchers and practitioners who are planning to develop or currently in the process of developing predictive
student success models in the context of higher education.
Keywords Higher education, Machine learning, Student success, Learning analytics,
Educational data mining, Methodology review, Predictive models
Paper type Literature review
Introduction
The 2016 Horizon Report Higher Education Edition (Johnson et al., 2016) predicts that
learning analytics will be increasingly adopted by higher education institutions across the
globe in the near future to make use of student data gathered through online learning Information and Learning
Sciences
environments to improve, support and extend teaching and learning. The 2016 Horizon © Emerald Publishing Limited
2398-5348
report defines learning analytics as “an educational application of web analytics aimed at DOI 10.1108/ILS-10-2018-0104
ILS learner profiling, a process of gathering and analyzing details of individual student
interactions in online learning activities” (p. 38). It can help to “build better pedagogies,
empower active learning, target at-risk student populations, and assess factors affecting
completion and student success” (p. 38). Terms such as “educational data mining,”
“academic analytics” and the more commonly adopted “learning analytics” have been used
in the literature to refer to the methods, tools and techniques for gathering very large
volumes of online data about learners and their activities and contexts. The advantages of
learning analytics have been enumerated by Siemens et al. (2011) and Siemens and Long
(2011), and some of the important ones include: early detection of at-risk students and
generating alerts for learners and educators; personalization and adaption of learning
process and content; extension and enhancement of learner achievement, motivation and
confidence by providing learners with timely information about their performance and that
of their peers; higher quality learning design and improved curriculum development;
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
interactive visualizations of complex information that give learners and educators the
ability to “zoom in” or “zoom out” on data sets; and more rapid achievement of learning
goals by giving learners access to tools that help them to evaluate their progress.
Many higher education institutions are beginning to explore the use of learning analytics
for improving student learning experiences (Sclater et al., 2016). According to a recent
literature review on learning analytics in higher education (Leitner et al., 2017), the most
popular strand of research in the field is to use student data to make predictions of their
performance (36 citations out of the total of 102 found in the literature review). The primary
goal of this area of research is to develop predictive student success models that make use of
different sources of data available within a higher education institution to identify students
who might be at risk of failing a course or program and could benefit from additional help.
This type of learning analytics research and application is important as it generates
actionable information that allows students to monitor and self-regulate their own learning,
as well as allows instructors to develop and implement effective learning interventions and
ultimately help students succeed.
The purpose of the present paper is to systematically review the methodological
components of the predictive models that have been developed or currently implemented in
the learning analytics applications in higher education. Student learning is a complex
phenomenon as cognitive, socio and emotional factors, together with prior experience, all
influence how students learn and perform (Illeris, 2006). As a result, to predict student
performance in a course or a program, many variables need to be considered, such as
cognitive variables associated with targeted knowledge and skills in the domain and socio-
emotional variables, such as engagement, motivation and anxiety. Student demographic
characteristics and past academic history are also often used in model building to reflect
information related to student prior experiences. Supervised machine learning techniques
such as logistic regression and neural networks are then applied to these student variables
to train and test the predictive models so as to estimate the likelihood of a student’s
successful passing of a course. Kotsiantis (2007) specified several key issues that are
consequential to the success of supervised machine learning applications, including variable
(i.e. attributes, features) selection, data preprocessing, choosing specific learning algorithms
and model validation. These issues are directly related to the steps of the typical process of
statistical modeling in quantitative research, which have guided us in terms of identifying
our research questions, as outlined below:
RQ1. What data sources and student variables were used to predict student
performance in higher education?
RQ2. How data were preprocessed and how missing data were handled prior to their Predictive
use in training, testing and validating predictive learning analytics models? analytic models
RQ3. Which machine learning techniques were used in developing predictive learning in higher
analytics models? education
RQ4. How were the accuracy and generalizability of the predictive learning analytics
evaluated?
The main goal of this review is to provide an overview of the methodological considerations
for researchers and practitioners who are planning to develop or currently in the process of
developing predictive student success models in the context of higher education. The
answers to these four questions can provide a practical guide regarding the steps of
developing and evaluating predictive models of student success, from variable selection and
data preparation through results validation. The review also helps identify methodological
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
strengths and weaknesses of the current predictive learning analytics applications in higher
education so we can provide the most up-to-date recommendations on predictive model
development, use and evaluation. In this process, we also identify areas where research on
predictive learning analytics is lacking, which will inform important future areas of research
that could strengthen the development of predictive learning analytics for the purpose of
generating valuable feedback to students to help them succeed in higher education.
Method
Our literature review was completed in three stages. First, we conducted searches and
collected related full-text documents using various search terms and keywords related to
predictive learning analytics applications in higher education. The search strings include:
(student performance OR student success OR Drop out OR student graduation OR at-risk student)
and (systems OR application OR method OR process OR system OR technique OR methodology
OR procedure) AND (“educational data mining” or “learning analytics”) and (prediction).
We selected “learning analytics” and “educational data mining” as two widely and
interchangeably used search terms in the literature for this study. Siemens and d Baker
(2012) enumerate the common research areas, interests and approaches between learning
analytics and educational data mining. Furthermore, Ferguson (2012) makes a
clear distinction between these two terms (i.e. learning analytics and educational data
mining) and academic analytics. Learning analytics and educational data mining address
technical and educational challenges to benefit and support students and faculty, whereas
academic analytics addresses political and economic challenges that benefit funders,
administrators and marketing at institutional, regional and government levels. Also, a quick
exact phrase searching in Google shows the popularity and the extent of information on
learning analytics (10,80,000 hits) and educational data mining (372,000 hits) as compared to
academic analytics (44,100 hits).
We conducted searches in four international databases of well-known academic
resources and publishers, namely, ScienceDirect, IEEE Xplore, ERIC and Springer. The
rationale for the choice of these four databases is that learning analytics, as an emerging
field of research and practice, involves the interdisciplinary area of science, social science,
education, engineering, psychology and other related fields. These four databases together
cover a broad spectrum of the interdisciplinary area involved in learning analytics. In
addition, they offer various scholarly products from conference proceedings, book chapters
and journal articles to funding agencies research reports, dissertations and policy papers.
For instance, ScieneDirect has an international coverage of physical sciences and
ILS engineering, life sciences, health sciences and social sciences and humanities with over 12
million pieces of content from 4,051 academic journals and 28,417 books. ERIC has an
extensive coverage and collection of the literature in education and psychology with links to
more than 330,000 full-text documents. IEEE explore has a focus on computer science,
electrical engineering and electronics and allied fields and provide access to more than four
million documents. Springer covers a variety of topics in the sciences, social sciences and
humanities with over ten million scientific documents.
To filter the irrelevant articles, our review was narrowed down to journal articles, full-
text conference papers and book chapters that could be downloaded from library website.
Full-text conference papers were included in our review based on the consideration that in
some fields such as computer science, conference papers are greatly valued as they are
typically peer reviewed and highly selective and considered to be more timely and with
greater novelty. According to Meyer et al. (2009), acceptance rates at selective computer
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
science conferences range between 10 and 20 per cent. The authors argued that “it is
important not to use journals as the only yardsticks for computer scientists” (p. 32). In
addition, given the emerging nature of learning analytics as a research and development
domain and that many new learning analytics systems and applications and their empirical
studies tend to be reported at conferences, we decided to include a broad range of scholarly
publications, including conference proceedings, to capture the recent literature of the area. A
cursory look at our reviewed papers shows a reasonable combination of journal articles and
conference papers, with conference papers constituting some of the publications after the
year 2015. We understand that some conference proceedings may not be as rigorous as
journal papers, but we wanted to ensure that the recent studies of the area are captured, even
if they are presented in conference proceedings. Our search process yielded 742 results from
all the four databases, which formed the initial list of citations. The publication time of the
selected citations spanned from 2002 to early 2018. Figure 1 displays the number of
publications reviewed over time, which shows that research on learning analytics has
gained more and more popularity in recent years.
Second, we developed inclusion and exclusion criteria to identify the most relevant
citations for the purpose of the current review. For this review, we excluded short conference
papers and abstracts because of their typical lack of detailed information about
methodologies. Because of our focus on the practical methodological considerations during
modeling process of real data applications in higher education, we excluded studies
conducted in educational settings other than higher education (e.g. high schools); we also
Figure 1.
Number of
publications reviewed
over time
excluded citations that are pure theoretical or conceptual without empirical data/results. In Predictive
addition, we excluded studies that focused on clustering students into different groups analytic models
based on their academic behavior or background. Although students who are grouped
in higher
together might share similar profiles that could be linked to student success or dropout, our
review focused on explicit predictive models with specific predictor variables (i.e. variables education
used to predict another variable), such as student background variables or activity data
from learning management systems, and the outcome variable (i.e. the variable whose value
depends on predictor variables) such as student course grades or last year grade point
average (GPA). As a result, of the 742 citations compiled from the first stage of the literature
review, a total of 121 citations remained after applying our exclusion criteria.
Third, we reviewed each document from the final compiled bibliography of 121 articles
and focused on identifying information that was needed to answer our research questions
regarding the four methodological components of predictive algorithms, namely:
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
We synthesize the current practice and findings from the 121 articles and conclude our
review with a number of recommendations for predictive algorithm development, analysis
and use based on the literature and our own evaluation, and in this process, we highlight
important areas for further research.
Results
Based on our review, there are two major categories of studies that focused on the prediction
of student performance in the higher education context. Of the 121 articles reviewed in our
study, the majority of studies (a total of 86 studies) focused on the prediction of student
performance and achievement at the course level in specific undergraduate or graduate
courses. These courses are delivered in a variety of formats, including traditional face-to-
face, online or blended. In these studies, student performance and achievement is typically
measured by their assignment scores or final grades on a variety of different scales,
including continuous scales (e.g. percentages), binary scales (e.g. pass or fail) and categorical
scales (e.g. fail, good, very good, or excellent). Course-level prediction of student course
performance is intended to help individual instructors monitor student progress and predict
how well a student will perform in the course so early interventions can be implemented.
Course-level predictions have been also applied to student outcome in massive open online
courses (MOOCs) in a number of studies (Al-Shabandar et al., 2017; Boyer and
Veeramachaneni, 2015; Brinton et al., 2016; Chen et al., 2016; Deeva et al., 2017; Hughes and
Dobbins, 2015; Kidzin ski et al., 2016; Klüsener and Fortenbacher, 2015; Liang et al., 2016;
Li et al., 2017; Liang et al., 2016; Pérez-Lemonche et al., 2017; Ruipérez-Valiente et al., 2017;
Xing et al., 2016; Yang et al., 2017; Ye and Biswas, 2014). The primary aim of predictions for
MOOCs is to identify inactive students to prevent early dropout. As a result, the outcome
variable being predicted in MOOCs is typically course completion or dropout. Another type
of course-level prediction is to estimate student performance in future courses (Elbadrawy
et al., 2016; Polyzou and Karypis, 2016; Sweeney et al., 2016), which could help students
select courses in which they are predicted to succeed and therefore create personalized
degree pathways to facilitate successful and timely graduation.
ILS The second category of studies of predicting student performance (a total of 35 studies)
has focused on the program-level prediction of student outcome in higher education
institutions, including student overall academic performance as measured by student
cumulative GPA (CGPA) or GPA at graduation, student retention or degree completion. For
example, Dekker et al. (2009) predicted the dropout of electrical engineering students after
the first semester of their studies. This type of prediction can provide important information
to senior administrators regarding institutional accountability and strategies with the goal
to maintain and improve student retention and graduation rates.
Although the aims of the course-level and program-level predictions are generally
different, these studies share similar methodological components and considerations, with
some minor differences. We presented our results of the methodological review of the 121
articles in the following four subsections, each related to one of the research questions
outlined in the Introduction.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
Predictor variables
student course performance (Evale, 2016; Badr et al., 2016). Examples include student
gender, age, nationality, full time or part time status, educational background and admission
scores.
In addition to student variables, features of the course such as course modality, discipline
and enrollment (Ornelas and Ordonez, 2017), as well as the quality and teaching styles of the
instructor (Corrigan and Smeaton, 2017; Sweeney et al., 2016), have been used to predict
student performance.
Program-level prediction. At the program level, predictions of student outcome rely
mostly on student demographics and previous academic background, as specific
information related to individual courses is no longer directly relevant. Guarín et al. (2015)
predicted student academic attrition based on these two categories of variables. The
demographic variables the authors chose included age at admission, gender, city of origin,
socio-economic classification and ethnicity. Variables related to the previous academic
background included high school type (public or private), type of access (regular or special
admission program), option in which the student chose the program (from 1, first option,
to 3), the previous program if exists, admission test score in five modules (i.e. Sciences, Math,
Image, Text and Social studies) and classification levels for Basic Math and Literacy.
In addition to the use of conventional student demographic and academic background
variables, Uddin and Lee (2017) analyzed social networking-based data such as Facebook
and Twitter to extract student personality traits (i.e. openness, conscientiousness,
extraversion, agreeableness and neuroticism) and other relevant features (e.g. # of Facebook
friends, # of Facebook posts daily, type of posts involved and liking activity) to improve the
predictive modeling of student performance.
Ogihara and Ren (2017) predicted student retention at the program level using linguistic
features extracted from the college admission application essays from the admission
database CommonApp used by many universities and colleges in the USA for streamlining
admission processes. Three sets of linguistic features were generated from text analysis:
(1) latent Dirichletal location (LDA)-based topic modeling with a variety of topic
numbers;
(2) linguistic inquiry and word count; and
(3) part-of-speech (POS) distribution.
The results show that the POS distribution features yielded the best prediction performance
among these three. However, the cross validation error was considerably high, suggesting
that the predictive model was not directly generalizable to other data sets.
ILS Variable selection. In statistical modeling, variable selection, also known as feature
selection, is the process of selecting a set of relevant variables (i.e. features, predictors) for
use in model construction. As summarized by Guyon and Elisseeff (2003), there are three
main reasons for variable selection in machine learning-related research, namely, improving
the predictive power of the models, making faster and more cost-effective predictions and
providing a better understanding of the processes underlying the data. Variable selection is
especially important when a large number of potential student variables are available but
with a limited sample size.
Among the articles we reviewed, only a few studies briefly discussed their variable
selection techniques. Hart et al. (2017) used all-subsets regression to reduce the total
predictor variables that would be entered into their final analysis, dominance analysis,
which is computationally intensive and limited to a maximum of ten predictor
variables. Badr et al. (2016) and Ibrahim and Rusli (2007) used the rankings of the
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
Jayaprakash, et al. (2014) deleted variables with 20 per cent or more missing values. Han
et al. (2017) deleted student with more than two missing values, and if the student has one or
two missing values, replacement by the mean of the variable is made. Chuan et al. (2016)
simply deleted all data with missing values. Al-Saleem et al. (2015) replaced all the missing
values with the mean by calculating the average grade in the course from other students.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
DT 46
Naïve Bayes 32 Table II.
SVM 26
Machine learning
Neural networks and MLP 26
RF 23 techniques and their
Logistic regression 22 corresponding
K-nearest neighbor 16 number of
Other 25 publications
ILS previous students in different courses. J48 showed superior performance, a higher overall
accuracy of 83.75 per cent, compared that of ID3, 69.27 per cent.
NBC is a simple probabilistic classifier that calculates the conditional probability of the
data (given the class membership) by applying Bayes’ theorem and assuming conditional
independence among the predictors given the class (Friedman et al., 1997). The conditional
independence assumption greatly simplifies the calculation of the conditional probability of
the data by reducing it to the product of the likelihood of each predictor. Despite the
oversimplified assumption that is often violated in practice (e.g. student academic
background and midterm grade may not be conditionally independent), the NBC has shown
excellent performance that could be comparable to more advanced methods such as SVM.
For example, Marbouti et al. (2016) compared the performance of seven different predictive
models for identifying at-risk students in an engineering course and found that NBC
exhibited superior performance compared to other models.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
SVM finds a hyperplane that classifies data into two categories (Cortes and Vapnik,
1995). SVM uses a kernel function to map the data from the original space into a new feature
space and finds an optimal decision boundary with the maximum margin from data in both
categories. SVM is suited to learning tasks with large number of features (or predictors)
relative to the size of training sample. This property makes SVM a desirable technique for
the analysis of the learning management data in which a large number of student features
are available. For example, SVM was adopted by Corrigan et al. (2015) because with SVM,
not all of the extracted features from the log data:
Have to be actually useful in terms of discriminating different forms of student outcome [. . .] we
can be open-minded about how we represent students’ online behaviour and if a feature is not
discriminative, the SVM learns this from the training material (p. 47).
ANNs were initially developed to mimic basic principles of biological neural systems where
information processing is modeled as the interactions between numerous interconnected
nerve cells or neurons. ANNs can also serve as a highly flexible nonlinear statistical
technique for modeling complex relationships between inputs and output. MLP is perhaps
the most well-known supervised ANN. An MLP is a network of neurons (i.e. nodes) that are
arranged in a layered architecture. Typically, this type of ANNs consists of three or more
layers: one input layer, one output layer and at least one hidden layer. Statistically, the MLP
functions similar to a nonlinear multivariate regression model. The layer of input neurons is
analogous to the set of predictor variables, whereas the layer of output neurons is analogous
to the outcome variables. The relationship between the input and output layers is parallel to
the mathematical functional form in the regression model. The number of nodes in the
hidden layer is typically chosen by the user to control the degree of nonlinearity between
predictors and the outcome variables. With more nodes in the hidden layer, the relationship
between predictors and outcome variables becomes more nonlinear in the MLP model. It has
been mathematically demonstrated that the MLP, given a sufficient number of hidden
nodes, can approximate any nonlinear function to any desired level of accuracy (Dawson
and Wilby, 2004; Hornik et al., 1989)., Rachburee et al. (2015) developed predictive models
with five classification techniques, namely, DT, NBC, k-nearest neighbors, SVM and MLP.
The results show that MLP generates the best prediction with 89.29 per cent accuracy.
RF is an ensemble classifier built on DTs. In DT, improper constraints or regularizations
on trees may result in overfitting the training data. Models with the problem of overfitting
show low bias and high variance, which imply that they cannot be well generalized to other
external data sets. RF was proposed to deal with this overfitting problem to improve the
model prediction and generalizability. In RF, the bagging method, or bootstrap aggregating,
is used to aggregate the predictions. Specifically, a bootstrap sampling approach with Predictive
replacement is used to obtain multiple subsets of the training data. For each subset data, a analytic models
DT is then built, which considers only a subset of features. These DTs for different subset
data constitute a forest (i.e. a multitude of DTs) for the whole data set. Multiple classes or
in higher
predicted values from different DTs thus can be obtained, and RF outputs the mode of education
predicted classes (for classification) or the mean of predicted values (for regression) as the
final prediction. As such, by considering different subsets of samples and features, RF
introduces randomness and diversity into the model, which improves the model
generalizability. RF has shown to be a powerful and efficient classifier in the literature. For
example, in their study on the prediction of assignment grades with student online learning
behaviors and demographic information extracted from the MOOC data, Al-Shabandar et al.
(2017) found that RF largely outperformed other seven classifiers considered in the study.
Logistic regression is a classical multivariate statistical procedure used to predict a
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
categorical outcome variable from a set of continuous, categorical or both types of predictor
variables. When the outcome variable has only two categories, the probability of the
outcome being in one category can be modeled as a sigmoid function of the linear
combination of predictors. The model parameters can be estimated by maximizing the log
likelihood of obtaining the observed data. For example, Jayaprakash et al. (2014) used
logistic regression, among three other techniques, to predict whether students are at risk or
in good standing in a course. The predictors included student age, gender, SAT scores, full-
time or part time status, academic standing, cumulative GPA, year of study, score computed
from partial contributions to the final grade, number of Sakai courses sessions opened by
the student and number of times a section is accessed by the student. Logistic regression
was found to outperform other techniques, with a better combination of high recall, low
percentage of false alarms and higher precision in predicting at-risk students.
to train the model. The process is repeated k times, with each of the k subsamples used once
as the testing data. The results from the k replications are then averaged to produce the final
estimation.
In addition to the use of cross validation, a few studies we reviewed evaluated the model
generalizability by applying the generated model to data from other academic years or from
other institutions. For example, Gray et al. (2016) trained the predictive model with data
from the 2010 and 2011 student cohort and tested it with data from the 2012 student cohort.
Boyer and Veeramachaneni (2015) called the use of models trained on previous courses for
the real-time prediction in a subsequent offering of the same course (or other new courses) as
transfer learning. Multiple transfer learning methods were proposed, such as the naïve
transfer method, multi-task learning method and logistic regression with prior method. The
authors argue that transfer learning is of great importance for real-time predictions in
learning analytics. Furthermore, the Open Academic Analytics Initiative program
(Jayaprakash et al., 2014) researched issues related to the scaling up of predictive learning
analytics across different higher institutions. Predictive model trained with Marist College
data was applied to data from several other institutions.
Conclusion
This methodology review aims to provide researchers and practitioners with a survey of the
literature on learning analytics with a particular focus on the predictive analytics in the
context of higher education. Learning analytics is still an emerging field in education (Avella
et al., 2016). The adoption and application of learning analytics in higher education is still
mostly small-scale and preliminary. Student data captured within higher education
institutions (e.g. learning management systems, student information systems and student
services) have yet to be properly integrated, analyzed and interpreted to realize its full
potential for providing valuable insight for students and instructors to facilitate and support
learning. Sound analytical methodology is the central tenet of any high-quality learning
analytics application. The aim of the current study was to help better understand the current
state of the methodology in the development of predictive learning analytic models by
systematically reviewing issues related to:
data sources and student variables;
data preprocessing and handling;
machine learning techniques; and
evaluation of accuracy and generalizability.
Summary of results and conclusions Predictive
Data sources and student variables. Most of the reviewed studies make use of multiple data analytic models
sources and student variables in the modeling process to enhance prediction accuracy. For
course-level prediction, student intermediate course performance data (e.g. marks on quizzes
in higher
and midterms), student log data from learning management systems (e.g. logins and education
downloads) and student demographics and previous academic history have been the most
often used predictors of student performance. Given that student learning involves both
cognitive and socio-emotional competencies, in a few studies, data were collected through
surveys and questionnaires that measure student self-reported learning attitudes/strategies/
difficulties and their self-evaluation, which have been used to predict student performance.
Features of courses and instructors have also been used as predictors considering the
importance of contextual information for learning. For program-level prediction, student
demographic and academic backgrounds are the most typical predictors chosen. The social
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
networking-based variables have also been researched as possible predictors. However, the
results so far are not clear in terms of whether and to what extent the social networking-
based variables have contributed to a significant improvement of prediction accuracy.
Data preprocessing and handling. Although data preprocessing and missing data handling
are critical for successful predictive learning analytic applications, few studies we reviewed
have presented detailed information about this process. Of the few citations that provided a
documentation on data preprocessing, variable normalization, data anonymization, translation
of student records, discretization of continuous variables, removal of irrelevant information in
data and information extraction from raw log files have been reported at the stage of data
preprocessing. Regarding missing data handling, none of the studies we reviewed provided
information on the extent of missing values in the data, the patterns of the missing data and the
justification of the selected approach for handling missing data. For the few studies that
reported how they handled the missing data, simple procedures such as mean replacement and
listwise deletion (i.e. deleting cases with missing values) were often used.
Machine learning techniques. The most frequently used and successful techniques in the
literature of predictive learning analytics appear to be DT, NBC, SVM, ANNs, RF and
logistic regression. Of these five techniques, SVM and MLP are considered as “black-box”
techniques in the sense that one cannot know exactly how the prediction is derived and how
to interpret the meaning of different parameters in the model. In comparison, results of DT
are highly interpretive as the set of developed rules is simple to understand and can describe
clearly the process of the prediction. However, the disadvantage of DT is its instability,
meaning that small changes in the data might lead to different tree structures and set of
rules. For example, Jayaprakash et al. (2014) applied DT to 25, 50, 75 and 100 per cent of the
training data and found that the method exhibited unstable performance when varying the
sample size. RF, logistic regression and NBC appear to be good options for predictive
learning analytic applications.
Evaluation of accuracy and generalizability. Measures based on the percentages of correct
predictions such as the overall prediction accuracy, precision, recall and F-measure are most
often used measures for evaluating the performance of predictive models. However, as
argued by Fawcett (2004), these measures may be problematic for unbalanced classes where
one class dominates the sample. For example, when the class distribution is highly skewed
with 90 per cent of students passing, a model can have a high overall prediction accuracy by
simply predicting everyone to the majority class. Unbalanced classes are common in the
area of predictive learning analytics, given that typically a relatively small percentage of
students fail a course or drop out of a program. Good performance measures of predictive
modeling should not be influenced by the class distributions in the sample. An example is
ILS ROC curves, which have a desirable property of being insensitive to changes in class
distributions. Another way to evaluate the performance of predictive models is by
examining the effectiveness of interventions designed based on the model-derived
predictions of student performance. This type of results can strengthen the practical use of
predictive models in real settings.
To evaluate the generalizability of predictive models, cross validation has been routinely
utilized in the learning analytic literature. This is a good practice considering the possibility
of model overfitting with the use of machine learning techniques in learning analytics
research. Although cross validation is important, it does not provide strong evidence to
show that the model can be generalized to other contexts or settings. Another, perhaps more
rigorous, way to examine the model generalizability is to apply the generated model to data
from other academic years or from other institutions.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
Second, the majority of research articles, book chapters and conference presentations
available in the literature to date have focused on the programmatic aspect of model
development, and these publications are mostly led by researchers in the field of computer
science. This aspect of research is important, and continued efforts are needed. However,
student learning is a complex phenomenon as many factors (e.g. cognitive, socio and
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
emotional and background variables) influence the learning process and outcome (Illeris,
2006). Therefore, understanding the cognitive and socio-emotional aspect of human learning
and achievement is also a crucial component for predictive learning analytics, which has
received much less attention. Based on our review of the predictive learning analytic
literature, there is a clear gap in the development of theoretical frameworks and input from
content experts and educators to support and inform key decision-making during the
process of model building. From a theoretical perspective, two questions arise: What student
features are important predictors of the student outcome? How do these features interact
with each other and together influence the outcome? These are examples of important
questions that cannot be solved solely by computer programs. Results from studies in
cognitive science and learning domain knowledge provide valuable insights into how
students learn content and perform tasks, which should be injected into the data pre-
processing and analysis phase to best address the research questions. This calls for a close
collaboration among educators, domain experts, cognitive scientists and data scientists in
building predictive models that aim at providing useful information to benefit student
learning and classroom teaching.
Third, very few studies have discussed how the results of predictions generated
from the model should be best used to help students. If a model predicts that a student
is likely to fail the course, what information should be provided to the student so that
that he/she can take an action upon to improve learning? To answer this question, we
need to understand how the prediction is made, which information/variable is most
relevant and whether the student makes changes can increase his/her likelihood of
passing the course. This bears implications for predictive modeling techniques. To
develop a clear understanding of the process that derives the prediction, the black-box
type of techniques such as SVM and artificial networks may not be ideal for
interpretative purposes. If available, student behavioral variables (e.g. student
activities recorded from learning manage system) should be considered as potential
predictors as these variables are useful in terms of generating actionable information
that help design interventions. Based on the results, for example, feedback related to
how students can change their behaviors (e.g. participate in group discussions or
submit assignment on time) to increase their chance of success in the course can be
provided. When demographic variables and student past academic history are used as
the only predictors of student performance (Valdiviezo-Díaz et al., 2015; Al-Shabandar
et al., 2017; Roy and Garg, 2017; Guarín et al., 2015; Rubiano and Garcia, 2015),
instructors should be encouraged to generate feedback based on further examination
and comparison of resource uses and activities between groups of students who have
ILS been predicted as passing and failing the course. Furthermore, instructor can encourage
students to have face-to-face meetings with them or visit various student support
centers on campus such as student success center or student accessibility services.
On a related note, based on our review, student intermediate performance data are often
used as potential predictors of final course performance. The use of intermediate
performance data seems to be logical as these data can naturally serve as measures/
indicators of student learning progress in the course. It is also a common practice in higher
education that student marks on quizzes and midterms account for certain percentages of
the final marks. When these percentages are high, it is important to make early predictions.
For example, if the midterm performance accounts for a high percentage of the final mark, it
will be desirable to make predictions before the midterm so that students can reflect on their
learning process and change their behaviors to increase their midterm scores, which in turn
increases their chance of success in the course.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)
References
Abdous, M.H., Wu, H. and Yen, C.J. (2012), “Using data mining for predicting relationships between
online question theme and final grade”, Journal of Educational Technology and Society, Vol. 15
No. 3, pp. 77-88.
Almutairi, F.M., Sidiropoulos, N.D. and Karypis, G. (2017), “Context-aware recommendation-based
learning analytics using tensor and coupled matrix factorization”, IEEE Journal of Selected
Topics in Signal Processing, Vol. 11 No. 5, pp. 729-741.
Al-Saleem, M., Al-Kathiry, N., Al-Osimi, S. and Badr, G. (2015), “Mining educational data to predict
students’ academic performance”, International Workshop on Machine Learning and Data
Mining in Pattern Recognition, Springer, Cham, pp. 403-414.
Al-Shabandar, R., Hussain, A., Laws, A., Keight, R., Lunn, J. and Radi, N. (2017), “Machine learning
approaches to predict learning outcomes in massive open online courses”, 2017 International
Joint Conference on Neural Networks (IJCNN), IEEE, pp. 713-720.
Avella, J.T., Kebritchi, M., Nunn, S.G. and Kanai, T. (2016), “Learning analytics methods, benefits,
and challenges in higher education: a systematic literature review”, Online Learning, Vol. 20
No. 2, pp. 13-29,
Badr, G., Algobail, A., Almutairi, H. and Almutery, M. (2016), “Predicting students’ performance in
university courses: a case study and tool in KSU mathematics department”, Procedia Computer
Science, Vol. 82, pp. 80-89.
Boyer, S. and Veeramachaneni, K. (2015), “Transfer learning for predictive models in massive open online
courses”, International Conference on Artificial Intelligence in Education, Springer, Cham, pp. 54-63.
Brinton, C.G., Buccapatnam, S., Chiang, M. and Poor, H.V. (2016), “Mining MOOC clickstreams: Video-
watching behavior vs. in-video quiz performance”, IEEE Transactions on Signal Processing,
Vol. 64 No. 14, pp. 3677-3692.
Chen, Y., Chen, Q., Zhao, M., Boyer, S., Veeramachaneni, K. and Qu, H. (2016), “DropoutSeer: visualizing
learning patterns in massive open online courses for dropout reasoning and prediction”, 2016
IEEE Conference on Visual Analytics Science and Technology (VAST), IEEE, pp. 111-120.