Vous êtes sur la page 1sur 16

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/2398-5348.htm

Predictive
Predictive analytic models of analytic models
student success in higher education in higher
education
A review of methodology
Ying Cui and Fu Chen
Department of Educational Psychology, University of Alberta, Edmonton,
Alberta, Canada Received 11 October 2018
Revised 24 January 2019
Ali Shiri 11 February 2019
Accepted 19 February 2019
Department of Library and Information Studies, University of Alberta,
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

Edmonton, Alberta, Canada, and


Yaqin Fan
Department of Educational Technology, Northeast Normal University, Changchun,
Jilin, China

Abstract
Purpose – Many higher education institutions are investigating the possibility of developing predictive
student success models that use different sources of data available to identify students that might be at risk of
failing a course or program. The purpose of this paper is to review the methodological components related to
the predictive models that have been developed or currently implemented in learning analytics applications in
higher education.
Design/methodology/approach – Literature review was completed in three stages. First, the authors
conducted searches and collected related full-text documents using various search terms and keywords.
Second, they developed inclusion and exclusion criteria to identify the most relevant citations for the purpose
of the current review. Third, they reviewed each document from the final compiled bibliography and focused
on identifying information that was needed to answer the research questions
Findings – In this review, the authors identify methodological strengths and weaknesses of current
predictive learning analytics applications and provide the most up-to-date recommendations on predictive
model development, use and evaluation. The review results can inform important future areas of research that
could strengthen the development of predictive learning analytics for the purpose of generating valuable
feedback to students to help them succeed in higher education.
Originality/value – This review provides an overview of the methodological considerations for
researchers and practitioners who are planning to develop or currently in the process of developing predictive
student success models in the context of higher education.
Keywords Higher education, Machine learning, Student success, Learning analytics,
Educational data mining, Methodology review, Predictive models
Paper type Literature review

Introduction
The 2016 Horizon Report Higher Education Edition (Johnson et al., 2016) predicts that
learning analytics will be increasingly adopted by higher education institutions across the
globe in the near future to make use of student data gathered through online learning Information and Learning
Sciences
environments to improve, support and extend teaching and learning. The 2016 Horizon © Emerald Publishing Limited
2398-5348
report defines learning analytics as “an educational application of web analytics aimed at DOI 10.1108/ILS-10-2018-0104
ILS learner profiling, a process of gathering and analyzing details of individual student
interactions in online learning activities” (p. 38). It can help to “build better pedagogies,
empower active learning, target at-risk student populations, and assess factors affecting
completion and student success” (p. 38). Terms such as “educational data mining,”
“academic analytics” and the more commonly adopted “learning analytics” have been used
in the literature to refer to the methods, tools and techniques for gathering very large
volumes of online data about learners and their activities and contexts. The advantages of
learning analytics have been enumerated by Siemens et al. (2011) and Siemens and Long
(2011), and some of the important ones include: early detection of at-risk students and
generating alerts for learners and educators; personalization and adaption of learning
process and content; extension and enhancement of learner achievement, motivation and
confidence by providing learners with timely information about their performance and that
of their peers; higher quality learning design and improved curriculum development;
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

interactive visualizations of complex information that give learners and educators the
ability to “zoom in” or “zoom out” on data sets; and more rapid achievement of learning
goals by giving learners access to tools that help them to evaluate their progress.
Many higher education institutions are beginning to explore the use of learning analytics
for improving student learning experiences (Sclater et al., 2016). According to a recent
literature review on learning analytics in higher education (Leitner et al., 2017), the most
popular strand of research in the field is to use student data to make predictions of their
performance (36 citations out of the total of 102 found in the literature review). The primary
goal of this area of research is to develop predictive student success models that make use of
different sources of data available within a higher education institution to identify students
who might be at risk of failing a course or program and could benefit from additional help.
This type of learning analytics research and application is important as it generates
actionable information that allows students to monitor and self-regulate their own learning,
as well as allows instructors to develop and implement effective learning interventions and
ultimately help students succeed.
The purpose of the present paper is to systematically review the methodological
components of the predictive models that have been developed or currently implemented in
the learning analytics applications in higher education. Student learning is a complex
phenomenon as cognitive, socio and emotional factors, together with prior experience, all
influence how students learn and perform (Illeris, 2006). As a result, to predict student
performance in a course or a program, many variables need to be considered, such as
cognitive variables associated with targeted knowledge and skills in the domain and socio-
emotional variables, such as engagement, motivation and anxiety. Student demographic
characteristics and past academic history are also often used in model building to reflect
information related to student prior experiences. Supervised machine learning techniques
such as logistic regression and neural networks are then applied to these student variables
to train and test the predictive models so as to estimate the likelihood of a student’s
successful passing of a course. Kotsiantis (2007) specified several key issues that are
consequential to the success of supervised machine learning applications, including variable
(i.e. attributes, features) selection, data preprocessing, choosing specific learning algorithms
and model validation. These issues are directly related to the steps of the typical process of
statistical modeling in quantitative research, which have guided us in terms of identifying
our research questions, as outlined below:

RQ1. What data sources and student variables were used to predict student
performance in higher education?
RQ2. How data were preprocessed and how missing data were handled prior to their Predictive
use in training, testing and validating predictive learning analytics models? analytic models
RQ3. Which machine learning techniques were used in developing predictive learning in higher
analytics models? education
RQ4. How were the accuracy and generalizability of the predictive learning analytics
evaluated?
The main goal of this review is to provide an overview of the methodological considerations
for researchers and practitioners who are planning to develop or currently in the process of
developing predictive student success models in the context of higher education. The
answers to these four questions can provide a practical guide regarding the steps of
developing and evaluating predictive models of student success, from variable selection and
data preparation through results validation. The review also helps identify methodological
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

strengths and weaknesses of the current predictive learning analytics applications in higher
education so we can provide the most up-to-date recommendations on predictive model
development, use and evaluation. In this process, we also identify areas where research on
predictive learning analytics is lacking, which will inform important future areas of research
that could strengthen the development of predictive learning analytics for the purpose of
generating valuable feedback to students to help them succeed in higher education.

Method
Our literature review was completed in three stages. First, we conducted searches and
collected related full-text documents using various search terms and keywords related to
predictive learning analytics applications in higher education. The search strings include:
(student performance OR student success OR Drop out OR student graduation OR at-risk student)
and (systems OR application OR method OR process OR system OR technique OR methodology
OR procedure) AND (“educational data mining” or “learning analytics”) and (prediction).
We selected “learning analytics” and “educational data mining” as two widely and
interchangeably used search terms in the literature for this study. Siemens and d Baker
(2012) enumerate the common research areas, interests and approaches between learning
analytics and educational data mining. Furthermore, Ferguson (2012) makes a
clear distinction between these two terms (i.e. learning analytics and educational data
mining) and academic analytics. Learning analytics and educational data mining address
technical and educational challenges to benefit and support students and faculty, whereas
academic analytics addresses political and economic challenges that benefit funders,
administrators and marketing at institutional, regional and government levels. Also, a quick
exact phrase searching in Google shows the popularity and the extent of information on
learning analytics (10,80,000 hits) and educational data mining (372,000 hits) as compared to
academic analytics (44,100 hits).
We conducted searches in four international databases of well-known academic
resources and publishers, namely, ScienceDirect, IEEE Xplore, ERIC and Springer. The
rationale for the choice of these four databases is that learning analytics, as an emerging
field of research and practice, involves the interdisciplinary area of science, social science,
education, engineering, psychology and other related fields. These four databases together
cover a broad spectrum of the interdisciplinary area involved in learning analytics. In
addition, they offer various scholarly products from conference proceedings, book chapters
and journal articles to funding agencies research reports, dissertations and policy papers.
For instance, ScieneDirect has an international coverage of physical sciences and
ILS engineering, life sciences, health sciences and social sciences and humanities with over 12
million pieces of content from 4,051 academic journals and 28,417 books. ERIC has an
extensive coverage and collection of the literature in education and psychology with links to
more than 330,000 full-text documents. IEEE explore has a focus on computer science,
electrical engineering and electronics and allied fields and provide access to more than four
million documents. Springer covers a variety of topics in the sciences, social sciences and
humanities with over ten million scientific documents.
To filter the irrelevant articles, our review was narrowed down to journal articles, full-
text conference papers and book chapters that could be downloaded from library website.
Full-text conference papers were included in our review based on the consideration that in
some fields such as computer science, conference papers are greatly valued as they are
typically peer reviewed and highly selective and considered to be more timely and with
greater novelty. According to Meyer et al. (2009), acceptance rates at selective computer
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

science conferences range between 10 and 20 per cent. The authors argued that “it is
important not to use journals as the only yardsticks for computer scientists” (p. 32). In
addition, given the emerging nature of learning analytics as a research and development
domain and that many new learning analytics systems and applications and their empirical
studies tend to be reported at conferences, we decided to include a broad range of scholarly
publications, including conference proceedings, to capture the recent literature of the area. A
cursory look at our reviewed papers shows a reasonable combination of journal articles and
conference papers, with conference papers constituting some of the publications after the
year 2015. We understand that some conference proceedings may not be as rigorous as
journal papers, but we wanted to ensure that the recent studies of the area are captured, even
if they are presented in conference proceedings. Our search process yielded 742 results from
all the four databases, which formed the initial list of citations. The publication time of the
selected citations spanned from 2002 to early 2018. Figure 1 displays the number of
publications reviewed over time, which shows that research on learning analytics has
gained more and more popularity in recent years.
Second, we developed inclusion and exclusion criteria to identify the most relevant
citations for the purpose of the current review. For this review, we excluded short conference
papers and abstracts because of their typical lack of detailed information about
methodologies. Because of our focus on the practical methodological considerations during
modeling process of real data applications in higher education, we excluded studies
conducted in educational settings other than higher education (e.g. high schools); we also

Figure 1.
Number of
publications reviewed
over time
excluded citations that are pure theoretical or conceptual without empirical data/results. In Predictive
addition, we excluded studies that focused on clustering students into different groups analytic models
based on their academic behavior or background. Although students who are grouped
in higher
together might share similar profiles that could be linked to student success or dropout, our
review focused on explicit predictive models with specific predictor variables (i.e. variables education
used to predict another variable), such as student background variables or activity data
from learning management systems, and the outcome variable (i.e. the variable whose value
depends on predictor variables) such as student course grades or last year grade point
average (GPA). As a result, of the 742 citations compiled from the first stage of the literature
review, a total of 121 citations remained after applying our exclusion criteria.
Third, we reviewed each document from the final compiled bibliography of 121 articles
and focused on identifying information that was needed to answer our research questions
regarding the four methodological components of predictive algorithms, namely:
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

 data sources and student variables;


 procedures of data handling and processing;
 adopted machine learning techniques; and
 evaluation of accuracy and generalizability.

We synthesize the current practice and findings from the 121 articles and conclude our
review with a number of recommendations for predictive algorithm development, analysis
and use based on the literature and our own evaluation, and in this process, we highlight
important areas for further research.

Results
Based on our review, there are two major categories of studies that focused on the prediction
of student performance in the higher education context. Of the 121 articles reviewed in our
study, the majority of studies (a total of 86 studies) focused on the prediction of student
performance and achievement at the course level in specific undergraduate or graduate
courses. These courses are delivered in a variety of formats, including traditional face-to-
face, online or blended. In these studies, student performance and achievement is typically
measured by their assignment scores or final grades on a variety of different scales,
including continuous scales (e.g. percentages), binary scales (e.g. pass or fail) and categorical
scales (e.g. fail, good, very good, or excellent). Course-level prediction of student course
performance is intended to help individual instructors monitor student progress and predict
how well a student will perform in the course so early interventions can be implemented.
Course-level predictions have been also applied to student outcome in massive open online
courses (MOOCs) in a number of studies (Al-Shabandar et al., 2017; Boyer and
Veeramachaneni, 2015; Brinton et al., 2016; Chen et al., 2016; Deeva et al., 2017; Hughes and
Dobbins, 2015; Kidzin ski et al., 2016; Klüsener and Fortenbacher, 2015; Liang et al., 2016;
Li et al., 2017; Liang et al., 2016; Pérez-Lemonche et al., 2017; Ruipérez-Valiente et al., 2017;
Xing et al., 2016; Yang et al., 2017; Ye and Biswas, 2014). The primary aim of predictions for
MOOCs is to identify inactive students to prevent early dropout. As a result, the outcome
variable being predicted in MOOCs is typically course completion or dropout. Another type
of course-level prediction is to estimate student performance in future courses (Elbadrawy
et al., 2016; Polyzou and Karypis, 2016; Sweeney et al., 2016), which could help students
select courses in which they are predicted to succeed and therefore create personalized
degree pathways to facilitate successful and timely graduation.
ILS The second category of studies of predicting student performance (a total of 35 studies)
has focused on the program-level prediction of student outcome in higher education
institutions, including student overall academic performance as measured by student
cumulative GPA (CGPA) or GPA at graduation, student retention or degree completion. For
example, Dekker et al. (2009) predicted the dropout of electrical engineering students after
the first semester of their studies. This type of prediction can provide important information
to senior administrators regarding institutional accountability and strategies with the goal
to maintain and improve student retention and graduation rates.
Although the aims of the course-level and program-level predictions are generally
different, these studies share similar methodological components and considerations, with
some minor differences. We presented our results of the methodological review of the 121
articles in the following four subsections, each related to one of the research questions
outlined in the Introduction.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

Data sources and student variables


Based on our review, a variety of data sources and student variables have been utilized in
building predictive models by various studies. Table I provides a summary of the predictor
variables found in the literature that had been used for course- and program-level prediction.
Course-level prediction. For course-level prediction, student course performance data
during the semester such as marks on quizzes and midterms are considered as potential
predictors of final course performance. Marbouti et al. (2015) and Marbouti et al. (2016) built
course-level predictive models for a first-year engineering course at a large Midwestern
University (USA) with three types of course performance variables: attendance records,
homework and quiz grades. With course performance data only, the predictive model was
able to identify at-risk students and predict students’ success with 85 per cent accuracy.
Another important data source for predicting course-level performance is originated
from the learning management systems, virtual learning environments or other Web-based
learning environments in which detailed student activity log data are recorded, such as
logins, assignment submissions, resources accessed and frequency and interaction with
discussion forums. The rationale of using student logs or behavioral data collected while
interacting with learning management systems is that they serve as indicators of student
engagement and efforts in the course, which have been shown to be positively related to

Predictor variables

Course-level prediction Program-level prediction


Course performance data such as marks on quizzes Student demographics and previous academic history
and midterms Social networking-based data such as Facebook and
Student activity data from learning managing Twitter
systems Linguistic features extracted from college admission
Student socio-emotional variables from surveys and application essays
questionnaires such as self-reported learning
attitudes and self-evaluation
Student demographics and previous academic
Table I.
history
Predictor variables Features of the course such as course modality,
for course- and discipline and enrolment
program-level Variables related instructors such as teaching
prediction quality and style
student academic performance (Chen and Jang, 2010; Davies and Graff, 2005; de Barba et al., Predictive
2016; Kizilcec et al., 2013; Morris et al., 2005; Tempelaar et al., 2015). Given the large amount analytic models
of data accumulated through the course of an academic term, statistical procedures such as
correlation analysis have been used in the literature as an initial step (Chen et al., 2018) to
in higher
identify relevant student activities that could help predict course performance. education
In a few studies, data collected through surveys and questionnaires that measure student
self-reported learning attitudes/strategies/difficulties and their self-evaluation (Abdous
et al., 2012) are utilized as predictors in predictive modeling. Sorour et al. (2016) ask students
to freely write comments after each class and extract words to reflect their learning attitude,
understanding of course contents and difficulties of learning. Results show that the
prediction accurate rate achieves over 80 per cent based on the student comments.
Institutional data from student systems that record student demographics and previous
academic history have also been used in many studies as key variables for predicting
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

student course performance (Evale, 2016; Badr et al., 2016). Examples include student
gender, age, nationality, full time or part time status, educational background and admission
scores.
In addition to student variables, features of the course such as course modality, discipline
and enrollment (Ornelas and Ordonez, 2017), as well as the quality and teaching styles of the
instructor (Corrigan and Smeaton, 2017; Sweeney et al., 2016), have been used to predict
student performance.
Program-level prediction. At the program level, predictions of student outcome rely
mostly on student demographics and previous academic background, as specific
information related to individual courses is no longer directly relevant. Guarín et al. (2015)
predicted student academic attrition based on these two categories of variables. The
demographic variables the authors chose included age at admission, gender, city of origin,
socio-economic classification and ethnicity. Variables related to the previous academic
background included high school type (public or private), type of access (regular or special
admission program), option in which the student chose the program (from 1, first option,
to 3), the previous program if exists, admission test score in five modules (i.e. Sciences, Math,
Image, Text and Social studies) and classification levels for Basic Math and Literacy.
In addition to the use of conventional student demographic and academic background
variables, Uddin and Lee (2017) analyzed social networking-based data such as Facebook
and Twitter to extract student personality traits (i.e. openness, conscientiousness,
extraversion, agreeableness and neuroticism) and other relevant features (e.g. # of Facebook
friends, # of Facebook posts daily, type of posts involved and liking activity) to improve the
predictive modeling of student performance.
Ogihara and Ren (2017) predicted student retention at the program level using linguistic
features extracted from the college admission application essays from the admission
database CommonApp used by many universities and colleges in the USA for streamlining
admission processes. Three sets of linguistic features were generated from text analysis:
(1) latent Dirichletal location (LDA)-based topic modeling with a variety of topic
numbers;
(2) linguistic inquiry and word count; and
(3) part-of-speech (POS) distribution.

The results show that the POS distribution features yielded the best prediction performance
among these three. However, the cross validation error was considerably high, suggesting
that the predictive model was not directly generalizable to other data sets.
ILS Variable selection. In statistical modeling, variable selection, also known as feature
selection, is the process of selecting a set of relevant variables (i.e. features, predictors) for
use in model construction. As summarized by Guyon and Elisseeff (2003), there are three
main reasons for variable selection in machine learning-related research, namely, improving
the predictive power of the models, making faster and more cost-effective predictions and
providing a better understanding of the processes underlying the data. Variable selection is
especially important when a large number of potential student variables are available but
with a limited sample size.
Among the articles we reviewed, only a few studies briefly discussed their variable
selection techniques. Hart et al. (2017) used all-subsets regression to reduce the total
predictor variables that would be entered into their final analysis, dominance analysis,
which is computationally intensive and limited to a maximum of ten predictor
variables. Badr et al. (2016) and Ibrahim and Rusli (2007) used the rankings of the
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

correlation coefficients to select variables. Xu et al. (2017) conducted principle


component analysis to reduce the dimensions of predictor variables. Daud et al. (2017)
utilized information gain and gain ratio to select the best variable subset. Finally,
Chuan et al. (2016) used chi squared attributes evaluator and ranker search methods to
identify the best attributes/variables.

Data preprocessing and missing data handling


Data preprocessing. Very few studies we reviewed reported briefly accuracy checks of
the data (Jayaprakash et al., 2014; Rachburee et al., 2015). No studies we reviewed provided
detailed procedures for checking data quality and errors. As part of the data preprocessing,
a few papers we reviewed normalized the variables prior to model building. The rationale
for normalization is that some machine learning techniques such as multi-layer perceptron
(MLP) require variables, if measured on different scales, be normalized to ensure that
variables with larger ranges/variabilities would not carry more weights in predicting the
outcome. Different normalization methods have been used in the literature of predictive
learning analytics. Waddington et al. (2016) normalized the data by calculating each
student’s percentile rank of course resource use within each resource category. Gray et al.
(2016) and Sweeney et al. (2016) adopted a standard normal Z-transformation. Meedech et al.
(2016) rescaled the input variables into the range of [0, 1].
In addition to normalization, other data preprocessing has been reported in the literature
to get the data ready for the predictive modeling. For example, data were anonymized to
remove identifier information (Jayaprakash et al., 2014). Data from different systems were
integrated into the same data file (Rubiano and Garcia, 2015). The translation of student
records from the original language to the language of the study was conducted (Badr et al.,
2016; Gray et al., 2016). Discretization of continuous variables into categories was done as
required by certain machine learning models such as Bayesian classifier (Sivakumar and
Selvaraj, 2018). In addition, irrelevant events such as registration place included in the data
set were filtered out (Al-Saleem et al., 2015). When log data from a learning management
system is analyzed, information such as number of videos watched, posts written and posts
ski
read were extracted from the initial log files for later analysis (Corrigan et al., 2015; Kidzin
et al., 2016).
Missing data handling. Missing data are perhaps one of the most ubiquitous problems
in any data modeling. Predictive learning analytics typically involves multiple sources
of data from different institutional information systems. As a result, the presence of
missing data is unavoidable. However, none of the studies we reviewed report the
extent of missing values in the data and the patterns of the missing data. Only a few
studies have provided information regarding brief procedures for handling missing Predictive
data. Badr et al. (2016) used three missing data handling methods to replace missing analytic models
grades on optional courses, namely:
in higher
(1) replacing the missing grades with the average score in the course; education
(2) replacing the missing grades with the student’s grade in an equivalent course; and
(3) eliminating student record with high percentage of unavailable data.

Jayaprakash, et al. (2014) deleted variables with 20 per cent or more missing values. Han
et al. (2017) deleted student with more than two missing values, and if the student has one or
two missing values, replacement by the mean of the variable is made. Chuan et al. (2016)
simply deleted all data with missing values. Al-Saleem et al. (2015) replaced all the missing
values with the mean by calculating the average grade in the course from other students.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

Machine learning techniques


Based on our review and analysis, the most active area of research in predictive learning
analytics appears to be the examination and comparison of different machine learning
techniques for predicting student performance. A variety of machine learning techniques
have been proposed and studied in the literature. Table II presents a list of machine learning
techniques utilized in the literature for predictive learning analytics, along with the number
of publications we reviewed that adopted each technique. The most frequently used and
successful techniques include decision tree (DT), naïve Bayes classifier (NBC), support
vector machine (SVM), artificial neural networks (ANNs), random forest (RF) and logistic
regression. To familiarize the reader with each of these techniques, we provide a brief
overview first, followed by our findings. K-nearest neighbor, a clustering technique
designed to group individuals into different clusters, was not reviewed given the focus of
this review on predictive modeling techniques.
DT is a recursive procedure to find a set of rules that partition data into two or more
homogeneous groups with respect to the outcome variable. In each step, a partition rule is
formed by selecting one predictor and splitting data into groups based on the selected
predictor. The process continues until all data in each partitioned group belong to the same
outcome category or all variables have been used (Ferreira et al., 2001). According to our
review, DT is the most often used machine learning technique in the area of predictive
learning analytics in higher education. Of the 121 studies we reviewed, 46 adopted DT,
which demonstrated the popularity of the technique. For example, Al-Saleem et al. (2015)
used two of the most recognized DT classification algorithms, ID3 and J48, to predict
student performance in future courses based on a model developed using the grades of

Techniques No. of publications

DT 46
Naïve Bayes 32 Table II.
SVM 26
Machine learning
Neural networks and MLP 26
RF 23 techniques and their
Logistic regression 22 corresponding
K-nearest neighbor 16 number of
Other 25 publications
ILS previous students in different courses. J48 showed superior performance, a higher overall
accuracy of 83.75 per cent, compared that of ID3, 69.27 per cent.
NBC is a simple probabilistic classifier that calculates the conditional probability of the
data (given the class membership) by applying Bayes’ theorem and assuming conditional
independence among the predictors given the class (Friedman et al., 1997). The conditional
independence assumption greatly simplifies the calculation of the conditional probability of
the data by reducing it to the product of the likelihood of each predictor. Despite the
oversimplified assumption that is often violated in practice (e.g. student academic
background and midterm grade may not be conditionally independent), the NBC has shown
excellent performance that could be comparable to more advanced methods such as SVM.
For example, Marbouti et al. (2016) compared the performance of seven different predictive
models for identifying at-risk students in an engineering course and found that NBC
exhibited superior performance compared to other models.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

SVM finds a hyperplane that classifies data into two categories (Cortes and Vapnik,
1995). SVM uses a kernel function to map the data from the original space into a new feature
space and finds an optimal decision boundary with the maximum margin from data in both
categories. SVM is suited to learning tasks with large number of features (or predictors)
relative to the size of training sample. This property makes SVM a desirable technique for
the analysis of the learning management data in which a large number of student features
are available. For example, SVM was adopted by Corrigan et al. (2015) because with SVM,
not all of the extracted features from the log data:
Have to be actually useful in terms of discriminating different forms of student outcome [. . .] we
can be open-minded about how we represent students’ online behaviour and if a feature is not
discriminative, the SVM learns this from the training material (p. 47).
ANNs were initially developed to mimic basic principles of biological neural systems where
information processing is modeled as the interactions between numerous interconnected
nerve cells or neurons. ANNs can also serve as a highly flexible nonlinear statistical
technique for modeling complex relationships between inputs and output. MLP is perhaps
the most well-known supervised ANN. An MLP is a network of neurons (i.e. nodes) that are
arranged in a layered architecture. Typically, this type of ANNs consists of three or more
layers: one input layer, one output layer and at least one hidden layer. Statistically, the MLP
functions similar to a nonlinear multivariate regression model. The layer of input neurons is
analogous to the set of predictor variables, whereas the layer of output neurons is analogous
to the outcome variables. The relationship between the input and output layers is parallel to
the mathematical functional form in the regression model. The number of nodes in the
hidden layer is typically chosen by the user to control the degree of nonlinearity between
predictors and the outcome variables. With more nodes in the hidden layer, the relationship
between predictors and outcome variables becomes more nonlinear in the MLP model. It has
been mathematically demonstrated that the MLP, given a sufficient number of hidden
nodes, can approximate any nonlinear function to any desired level of accuracy (Dawson
and Wilby, 2004; Hornik et al., 1989)., Rachburee et al. (2015) developed predictive models
with five classification techniques, namely, DT, NBC, k-nearest neighbors, SVM and MLP.
The results show that MLP generates the best prediction with 89.29 per cent accuracy.
RF is an ensemble classifier built on DTs. In DT, improper constraints or regularizations
on trees may result in overfitting the training data. Models with the problem of overfitting
show low bias and high variance, which imply that they cannot be well generalized to other
external data sets. RF was proposed to deal with this overfitting problem to improve the
model prediction and generalizability. In RF, the bagging method, or bootstrap aggregating,
is used to aggregate the predictions. Specifically, a bootstrap sampling approach with Predictive
replacement is used to obtain multiple subsets of the training data. For each subset data, a analytic models
DT is then built, which considers only a subset of features. These DTs for different subset
data constitute a forest (i.e. a multitude of DTs) for the whole data set. Multiple classes or
in higher
predicted values from different DTs thus can be obtained, and RF outputs the mode of education
predicted classes (for classification) or the mean of predicted values (for regression) as the
final prediction. As such, by considering different subsets of samples and features, RF
introduces randomness and diversity into the model, which improves the model
generalizability. RF has shown to be a powerful and efficient classifier in the literature. For
example, in their study on the prediction of assignment grades with student online learning
behaviors and demographic information extracted from the MOOC data, Al-Shabandar et al.
(2017) found that RF largely outperformed other seven classifiers considered in the study.
Logistic regression is a classical multivariate statistical procedure used to predict a
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

categorical outcome variable from a set of continuous, categorical or both types of predictor
variables. When the outcome variable has only two categories, the probability of the
outcome being in one category can be modeled as a sigmoid function of the linear
combination of predictors. The model parameters can be estimated by maximizing the log
likelihood of obtaining the observed data. For example, Jayaprakash et al. (2014) used
logistic regression, among three other techniques, to predict whether students are at risk or
in good standing in a course. The predictors included student age, gender, SAT scores, full-
time or part time status, academic standing, cumulative GPA, year of study, score computed
from partial contributions to the final grade, number of Sakai courses sessions opened by
the student and number of times a section is accessed by the student. Logistic regression
was found to outperform other techniques, with a better combination of high recall, low
percentage of false alarms and higher precision in predicting at-risk students.

Evaluation of accuracy and generalizability


Accuracy evaluation. Once built, the accuracy of the predictive models must be evaluated.
There are several measures that have been used, the most often used one being the overall
prediction accuracy (i.e. the percentage of true positives and true negatives with respect to
the total sample size). Similar measures include precision (i.e. the percentage of true
positives with respect to the total number of model-predicted positives), recall (i.e. the
percentage of true positives with respect to the total number of positives in the sample),
fusion matrix (i.e. a two-by-two matrix listing true positives, true negatives, false positives
and false negatives) and F-measure (i.e. the harmonic mean of the precision and recall).
A few studies we reviewed used the area under the receiver operating characteristics
(ROC) curve as a performance measure of predictive models (Corrigan et al., 2015;
Jayaprakash et al., 2014). ROC graphs are two-dimensional graphs in which the true positive
rate is plotted on the y-axis and the false positive rate is plotted on the x-axis to depict the
relative trade-offs between benefits (true positives) and costs (false positives). For
regression-based modeling, the traditional R2, root mean square residuals and mean
absolute error are often reported as model accuracy measures (Almutairi et al., 2017;
Kidzinski et al., 2016; Strecht et al., 2015).
Corrigan et al. (2015) evaluated the performance of predictive models by examining the
effectiveness of interventions designed based on the model-derived predictions of student
performance. The goal of predictive learning analytics is to develop actionable feedback that
could be provided to students so they can reflect on their learning process and eventually
improve their learning. By examining the effectiveness of the feedback, the validity of the
predictive model was inferred indirectly. In this study, the authors reported that the
ILS students who received emails each week based on the results of predictive models
outperformed those who opted out by nearly 3 per cent (58.4-61.2 per cent) on average, while
no prior differences were found between the two groups on a number of measures related to
previous academic records.
Generalizability evaluation. Regarding the generalizability of the predictive models,
majority of the studies we reviewed have cross-validated the results by training and testing
the model with independent data sets to examine whether the model could be generalized to
data that have not been used in the training of the model. The simplest method of cross
validation was to randomly split the original data into a training set to train the model and a
test set to evaluate it (Al-Shabandar et al., 2017; Chen et al., 2018). K-fold validation has also
been frequently reported in the literature (Luo et al., 2015; Sorour et al., 2016), with the basic
idea of splitting the original sample randomly into k equal sized subsamples, one of which is
retained as the testing data to validate the model. The remaining k 1 subsamples are used
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

to train the model. The process is repeated k times, with each of the k subsamples used once
as the testing data. The results from the k replications are then averaged to produce the final
estimation.
In addition to the use of cross validation, a few studies we reviewed evaluated the model
generalizability by applying the generated model to data from other academic years or from
other institutions. For example, Gray et al. (2016) trained the predictive model with data
from the 2010 and 2011 student cohort and tested it with data from the 2012 student cohort.
Boyer and Veeramachaneni (2015) called the use of models trained on previous courses for
the real-time prediction in a subsequent offering of the same course (or other new courses) as
transfer learning. Multiple transfer learning methods were proposed, such as the naïve
transfer method, multi-task learning method and logistic regression with prior method. The
authors argue that transfer learning is of great importance for real-time predictions in
learning analytics. Furthermore, the Open Academic Analytics Initiative program
(Jayaprakash et al., 2014) researched issues related to the scaling up of predictive learning
analytics across different higher institutions. Predictive model trained with Marist College
data was applied to data from several other institutions.

Conclusion
This methodology review aims to provide researchers and practitioners with a survey of the
literature on learning analytics with a particular focus on the predictive analytics in the
context of higher education. Learning analytics is still an emerging field in education (Avella
et al., 2016). The adoption and application of learning analytics in higher education is still
mostly small-scale and preliminary. Student data captured within higher education
institutions (e.g. learning management systems, student information systems and student
services) have yet to be properly integrated, analyzed and interpreted to realize its full
potential for providing valuable insight for students and instructors to facilitate and support
learning. Sound analytical methodology is the central tenet of any high-quality learning
analytics application. The aim of the current study was to help better understand the current
state of the methodology in the development of predictive learning analytic models by
systematically reviewing issues related to:
 data sources and student variables;
 data preprocessing and handling;
 machine learning techniques; and
 evaluation of accuracy and generalizability.
Summary of results and conclusions Predictive
Data sources and student variables. Most of the reviewed studies make use of multiple data analytic models
sources and student variables in the modeling process to enhance prediction accuracy. For
course-level prediction, student intermediate course performance data (e.g. marks on quizzes
in higher
and midterms), student log data from learning management systems (e.g. logins and education
downloads) and student demographics and previous academic history have been the most
often used predictors of student performance. Given that student learning involves both
cognitive and socio-emotional competencies, in a few studies, data were collected through
surveys and questionnaires that measure student self-reported learning attitudes/strategies/
difficulties and their self-evaluation, which have been used to predict student performance.
Features of courses and instructors have also been used as predictors considering the
importance of contextual information for learning. For program-level prediction, student
demographic and academic backgrounds are the most typical predictors chosen. The social
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

networking-based variables have also been researched as possible predictors. However, the
results so far are not clear in terms of whether and to what extent the social networking-
based variables have contributed to a significant improvement of prediction accuracy.
Data preprocessing and handling. Although data preprocessing and missing data handling
are critical for successful predictive learning analytic applications, few studies we reviewed
have presented detailed information about this process. Of the few citations that provided a
documentation on data preprocessing, variable normalization, data anonymization, translation
of student records, discretization of continuous variables, removal of irrelevant information in
data and information extraction from raw log files have been reported at the stage of data
preprocessing. Regarding missing data handling, none of the studies we reviewed provided
information on the extent of missing values in the data, the patterns of the missing data and the
justification of the selected approach for handling missing data. For the few studies that
reported how they handled the missing data, simple procedures such as mean replacement and
listwise deletion (i.e. deleting cases with missing values) were often used.
Machine learning techniques. The most frequently used and successful techniques in the
literature of predictive learning analytics appear to be DT, NBC, SVM, ANNs, RF and
logistic regression. Of these five techniques, SVM and MLP are considered as “black-box”
techniques in the sense that one cannot know exactly how the prediction is derived and how
to interpret the meaning of different parameters in the model. In comparison, results of DT
are highly interpretive as the set of developed rules is simple to understand and can describe
clearly the process of the prediction. However, the disadvantage of DT is its instability,
meaning that small changes in the data might lead to different tree structures and set of
rules. For example, Jayaprakash et al. (2014) applied DT to 25, 50, 75 and 100 per cent of the
training data and found that the method exhibited unstable performance when varying the
sample size. RF, logistic regression and NBC appear to be good options for predictive
learning analytic applications.
Evaluation of accuracy and generalizability. Measures based on the percentages of correct
predictions such as the overall prediction accuracy, precision, recall and F-measure are most
often used measures for evaluating the performance of predictive models. However, as
argued by Fawcett (2004), these measures may be problematic for unbalanced classes where
one class dominates the sample. For example, when the class distribution is highly skewed
with 90 per cent of students passing, a model can have a high overall prediction accuracy by
simply predicting everyone to the majority class. Unbalanced classes are common in the
area of predictive learning analytics, given that typically a relatively small percentage of
students fail a course or drop out of a program. Good performance measures of predictive
modeling should not be influenced by the class distributions in the sample. An example is
ILS ROC curves, which have a desirable property of being insensitive to changes in class
distributions. Another way to evaluate the performance of predictive models is by
examining the effectiveness of interventions designed based on the model-derived
predictions of student performance. This type of results can strengthen the practical use of
predictive models in real settings.
To evaluate the generalizability of predictive models, cross validation has been routinely
utilized in the learning analytic literature. This is a good practice considering the possibility
of model overfitting with the use of machine learning techniques in learning analytics
research. Although cross validation is important, it does not provide strong evidence to
show that the model can be generalized to other contexts or settings. Another, perhaps more
rigorous, way to examine the model generalizability is to apply the generated model to data
from other academic years or from other institutions.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

Recommendation for practice


Based on our review, we identify several gaps/issues in the literature that could benefit from
more rigorous investigation in the field of learning analytics. First, although a total of 121
publications were found in the area of predictive learning analytics in the context of higher
education, many papers fail to report methodological details, which makes our review and
assessment challenging. For example, very few studies reported procedures of examining
data accuracy prior to any modeling analysis. This step may be cumbersome, but of extreme
importance, given that the quality and validity of the data underwrites the trustworthiness
of the models derived from the data. Screening for data accuracy involves the removal of
duplicated cases, the correction of inconsistent data and the detection of outliers. In their
multivariate statistics textbook, Tabachnick and Fidell (2013) suggested to inspect the
descriptive statistics of each variable (e.g. minimum, maximum, frequencies, means and
standard deviation) for data accuracy in large data sets. Are the values of each variable
within the acceptable range? Are the means and standard deviations of the continuous
variables (or the frequencies of the categorical variables) consistent with expectations? In
addition, how to deal with missing data is a critical issue that deserves more attention and
documentation. Missing data handling requires a careful deliberation of the patterns of
missing. The best scenario is when missing data appear to be completely random and no
systematic patterns/reasons are suspected for why data are missing. In this case, missing
data do not affect the validity of the predictive models, and different missing data handling
procedures may result in similar findings. Non-randomly missing data, on the other hand,
may pose serious problems in the analyses and results due to the potential distortion of
variable distributions and relationships. For example, suppose that students who refuse to
provide comments after the class may not be well engaged in class learning and activities
and therefore achieve low performance. If missing data are deleted, the distribution of the
class performance variable would be biased. Therefore, it is desirable to test whether the
missing data are random or systematically related to other variables in the study. One
strategy often recommended by many statistical textbooks (Tabachnick and Fidell, 2013;
Warner, 2008) is to compare groups with and without missing data for a particular variable
and investigate whether these two groups are associated with significant differences on
other variables considered in the study. If no significant differences are found, random
missing can be assumed and decisions on how to deal with missing data are not critical.
Otherwise, it is important to preserve the cases with missing data so the missing values will
need to be estimated. Simple estimation methods include mean substitution may lead to the
reduction in variable variances. More sophisticated approaches such as expectation-
maximization and multiple imputation can be considered. Graham et al. (2003) provided a
comprehensive summary of different methods for handling missing data. For future Predictive
research in the area of predictive learning analytics, a careful and detailed documentation of analytic models
the data handling and analysis is of great importance to:
in higher
 boost the confidence of stakeholders in the use of developed models;
education
 promote the healthy and methodologically solid development of the field; and
 sustain the impact on the learning and teaching in higher education.

Second, the majority of research articles, book chapters and conference presentations
available in the literature to date have focused on the programmatic aspect of model
development, and these publications are mostly led by researchers in the field of computer
science. This aspect of research is important, and continued efforts are needed. However,
student learning is a complex phenomenon as many factors (e.g. cognitive, socio and
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

emotional and background variables) influence the learning process and outcome (Illeris,
2006). Therefore, understanding the cognitive and socio-emotional aspect of human learning
and achievement is also a crucial component for predictive learning analytics, which has
received much less attention. Based on our review of the predictive learning analytic
literature, there is a clear gap in the development of theoretical frameworks and input from
content experts and educators to support and inform key decision-making during the
process of model building. From a theoretical perspective, two questions arise: What student
features are important predictors of the student outcome? How do these features interact
with each other and together influence the outcome? These are examples of important
questions that cannot be solved solely by computer programs. Results from studies in
cognitive science and learning domain knowledge provide valuable insights into how
students learn content and perform tasks, which should be injected into the data pre-
processing and analysis phase to best address the research questions. This calls for a close
collaboration among educators, domain experts, cognitive scientists and data scientists in
building predictive models that aim at providing useful information to benefit student
learning and classroom teaching.
Third, very few studies have discussed how the results of predictions generated
from the model should be best used to help students. If a model predicts that a student
is likely to fail the course, what information should be provided to the student so that
that he/she can take an action upon to improve learning? To answer this question, we
need to understand how the prediction is made, which information/variable is most
relevant and whether the student makes changes can increase his/her likelihood of
passing the course. This bears implications for predictive modeling techniques. To
develop a clear understanding of the process that derives the prediction, the black-box
type of techniques such as SVM and artificial networks may not be ideal for
interpretative purposes. If available, student behavioral variables (e.g. student
activities recorded from learning manage system) should be considered as potential
predictors as these variables are useful in terms of generating actionable information
that help design interventions. Based on the results, for example, feedback related to
how students can change their behaviors (e.g. participate in group discussions or
submit assignment on time) to increase their chance of success in the course can be
provided. When demographic variables and student past academic history are used as
the only predictors of student performance (Valdiviezo-Díaz et al., 2015; Al-Shabandar
et al., 2017; Roy and Garg, 2017; Guarín et al., 2015; Rubiano and Garcia, 2015),
instructors should be encouraged to generate feedback based on further examination
and comparison of resource uses and activities between groups of students who have
ILS been predicted as passing and failing the course. Furthermore, instructor can encourage
students to have face-to-face meetings with them or visit various student support
centers on campus such as student success center or student accessibility services.
On a related note, based on our review, student intermediate performance data are often
used as potential predictors of final course performance. The use of intermediate
performance data seems to be logical as these data can naturally serve as measures/
indicators of student learning progress in the course. It is also a common practice in higher
education that student marks on quizzes and midterms account for certain percentages of
the final marks. When these percentages are high, it is important to make early predictions.
For example, if the midterm performance accounts for a high percentage of the final mark, it
will be desirable to make predictions before the midterm so that students can reflect on their
learning process and change their behaviors to increase their midterm scores, which in turn
increases their chance of success in the course.
Downloaded by Australian Catholic University At 23:47 09 April 2019 (PT)

Last, the majority of publications we reviewed are targeted at predicting student


performance at the course level. It is worth investigating whether a general prediction model
can be developed for use in multiple courses. Obviously, a general model is more efficient
than course specific models in that the model can be trained once and directly applied to other
courses. However, very few people would doubt that a general model cannot address
the complexity of all courses because learning objectives, activities and assessments of
different courses can vary a great deal. The model accuracy must be compromised. The
question is to what degree. This is an empirical question, and future research is much needed.

References
Abdous, M.H., Wu, H. and Yen, C.J. (2012), “Using data mining for predicting relationships between
online question theme and final grade”, Journal of Educational Technology and Society, Vol. 15
No. 3, pp. 77-88.
Almutairi, F.M., Sidiropoulos, N.D. and Karypis, G. (2017), “Context-aware recommendation-based
learning analytics using tensor and coupled matrix factorization”, IEEE Journal of Selected
Topics in Signal Processing, Vol. 11 No. 5, pp. 729-741.
Al-Saleem, M., Al-Kathiry, N., Al-Osimi, S. and Badr, G. (2015), “Mining educational data to predict
students’ academic performance”, International Workshop on Machine Learning and Data
Mining in Pattern Recognition, Springer, Cham, pp. 403-414.
Al-Shabandar, R., Hussain, A., Laws, A., Keight, R., Lunn, J. and Radi, N. (2017), “Machine learning
approaches to predict learning outcomes in massive open online courses”, 2017 International
Joint Conference on Neural Networks (IJCNN), IEEE, pp. 713-720.
Avella, J.T., Kebritchi, M., Nunn, S.G. and Kanai, T. (2016), “Learning analytics methods, benefits,
and challenges in higher education: a systematic literature review”, Online Learning, Vol. 20
No. 2, pp. 13-29,
Badr, G., Algobail, A., Almutairi, H. and Almutery, M. (2016), “Predicting students’ performance in
university courses: a case study and tool in KSU mathematics department”, Procedia Computer
Science, Vol. 82, pp. 80-89.
Boyer, S. and Veeramachaneni, K. (2015), “Transfer learning for predictive models in massive open online
courses”, International Conference on Artificial Intelligence in Education, Springer, Cham, pp. 54-63.
Brinton, C.G., Buccapatnam, S., Chiang, M. and Poor, H.V. (2016), “Mining MOOC clickstreams: Video-
watching behavior vs. in-video quiz performance”, IEEE Transactions on Signal Processing,
Vol. 64 No. 14, pp. 3677-3692.
Chen, Y., Chen, Q., Zhao, M., Boyer, S., Veeramachaneni, K. and Qu, H. (2016), “DropoutSeer: visualizing
learning patterns in massive open online courses for dropout reasoning and prediction”, 2016
IEEE Conference on Visual Analytics Science and Technology (VAST), IEEE, pp. 111-120.

Vous aimerez peut-être aussi