1 s2.0 S0167923618301660 Main PDF

Decision Support Systems 116 (2019) 48–63
Contents lists available at ScienceDirect
Decision Support Systems

journal homepage: www.elsevier.com/locate/dss
Ensemble machine learning models for aviation incident risk prediction T

*
Xiaoge Zhang, Sankaran Mahadevan
Department of Civil and Environmental Engineering, Vanderbilt University, Nashville, TN 37235, USA
ARTICLE INFO ABSTRACT
Keywords: With the spectacular growth of air traffic demand expected over the next two decades, the safety of the air
Air transportation transportation system is of increasing concern. In this paper, we facilitate the “proactive safety” paradigm to
Deep learning increase system safety with a focus on predicting the severity of abnormal aviation events in terms of their risk
Support vector machine levels. To accomplish this goal, a predictive model needs to be developed to examine a wide variety of possible
System safety
cases and quantify the risk associated with the possible outcome. By utilizing the incident reports available in the
Risk assessment
Aviation Safety Reporting System (ASRS), we build a hybrid model consisting of support vector machine and an
ensemble of deep neural networks to quantify the risk associated with the consequence of each hazardous cause.
The proposed methodology is developed in four steps. First, we categorize all the events, based on the level of
risk associated with the event consequence, into five groups: high risk, moderately high risk, medium risk,
moderately medium risk, and low risk. Secondly, a support vector machine model is used to discover the re-
lationships between the event synopsis in text format and event consequence. In parallel, an ensemble of deep
neural networks is trained to model the intricate associations between event contextual features and event
outcomes. Thirdly, an innovative fusion rule is developed to blend the prediction results from the two types of
trained machine learning models, thereby improving the prediction. Finally, the prediction on risk level cate-
gorization is extended to event-level outcomes through a probabilistic decision tree. By comparing the perfor-
mance of the developed hybrid model against another three individual models with ten-fold cross-validation and
statistical tests, we demonstrate the effectiveness of hybrid model in quantifying the risk related to the con-
sequences of hazardous events.
1. Introduction safety of the air transportation system has received great attention [6-
8]. Over the past decades, numerous efforts have been dedicated to the
The International Air Transport Association (IATA) forecasts that development on comprehensive safety metrics as well as qualitative/
the air traffic demand will grow at a 3.7% annual Compound Average quantitative approaches to detect anomalous behavior, assess the
Growth Rate (CAGR), and there will be 7.2 billion air travelers in 2035 safety, and quantify the risk associated with one subsystem (i.e., airport
[1,2]. Thus, compared to the 3.8 billion air travelers in 2016, the air ground operations, flight tracking, or taxiway and runway system) or a
travel demand will nearly double over the next two decades. The rapid mixture of subsystems in the air transportation system [9]. For ex-
growth in air traffic demand will put pressure on the air transportation ample, Netjasov and Janic [10] performed a comprehensive review of
system, which is already struggling to cope with the current demand. four state-of-the-art methods, namely causal models, collision risk
System-wide departure delays and en-route congestion will deteriorate models, human error models, and third-party risk models, for the as-
due to the massive increase in the number of aircraft within the limited sessment of risk and safety in civil aviation. McCallie et al. [11] de-
airspace. Such consequences further contribute to an increased number scribed a taxonomy of possible attacks on one of the most critical
of conflicts in the air traffic, which might escalate to collisions or evolve technical upgrades – Automatic Dependent Surveillance Broadcast
into other hazardous events [3-5]. The rapid increase of air traffic de- (ADS-B) system – in the Next Generation (NextGen) Air Transportation
mand also puts tremendous pressure on air traffic operators in main- System; they examined the risks inherent in the ADS-B implementation
taining the system safety at the same level as before. All the afore- and their potential impact, thereby supporting risk mitigation and
mentioned factors could increase the occurrence rate of air traffic management. Margellos and Lygeros [12] used Monte Carlo Simulation
incidents. (MCS) to assess the probability of flights meeting the Target Windows
Since air traffic accidents usually have severe consequences, the (TWs) constraints and the probability of aircraft having conflicts in the
*
Corresponding author.
E-mail address: sankaran.mahadevan@vanderbilt.edu (S. Mahadevan).
https://doi.org/10.1016/j.dss.2018.10.009
Received 22 May 2018; Received in revised form 6 September 2018; Accepted 17 October 2018
Available online 05 November 2018
0167-9236/ © 2018 Elsevier B.V. All rights reserved.
X. Zhang, S. Mahadevan Decision Support Systems 116 (2019) 48–63
4-D Trajectory-Based Operations (TBO). Landry et al. [13] developed a hazardous events due to the following challenges:
conflict detection and resolution (CD&R) decision support system for
the airport surface operations, and illustrated this with an example from 1. High-dimensional data. Each incident record consists of more than
the Hartsfield Atlanta International Airport. Di Ciccio et al. [14] pro- 50 items ranging from the operational context (weather, visibility,
posed a model to automatically detect flight trajectory anomalies by flight phase, and flight conditions) to the characteristics of the
using semi-publicly available data for the sake of alerting the receiving anomalous operation (aircraft equipment, malfunction type, and
parties to take responsive actions in a timely manner. Sankararaman event synopsis). The url https://asrs.arc.nasa.gov/docs/dbol/ASRS_
et al. [15] investigated the vulnerability of Trajectory-Based Operations CodingTaxonomy.pdf gives a detailed description of each item.
(TBO) to technology disruption, and evaluated the degradation of four Besides, air traffic operation is a complex system composed of a
key performance indicators (KPI) subject to the technological impair- number of programs and personnel. It is likely for the incident to
ment. occur at any phase and location due to a variety of factors (i.e.,
In general, there are two principal strategies to enhance the air operation violation, visibility, human factors etc.), which makes it
transportation system safety: (I) increase system/subsystem reliability, hard for the machine learning algorithm to predict the exact level of
i.e., reduce the probability of operators making mistakes or systems risk associated with each event outcome.
having malfunctions; and (II) improve the operators' preparedness. 2. Primarily categorical data. Over 99% of the items in the ASRS da-
When a hazardous event occurs, the operator is able to take appropriate tabase are categorical, and only one attribute (crew size) is nu-
safety measures to reduce the cost of the consequence caused by the merical in each record. Although the categorical information offers a
hazardous event in a timely manner [16-18]. Although tremendous high-level description of the context of each incident, very limited
progress has been made over the past decades for enhancing the air information specific to each abnormal event can be derived from
transportation system safety, most of the studies emphasize identifying such categorical information, e.g., how did the incident happen,
the accident precursors and initiating corrective actions to prevent fu- how did the incident evolve over time in the system, what operation
ture errors, which can be grouped into the first strategy. Among the the pilot took to resolve the issue, etc. Since the categorical features
existing studies, much of the effort has been devoted to investigating are not informative, how to blend the predictions of the model
human factors induced accidents and constructing causal models trained by the categorical features and the predictions from the
[10,19]. For example, Wiegmann and Shappell [20] described the model trained by other informative indicators (e.g., event synopsis)
Human Factors Analysis and Classification System (HFACS) and pro- without compromising the model performance is an issue worthy of
vided aviation case studies on human factor analysis with HFACS. investigation.
However, air transportation is a complex system of systems involving 3. Unstructured data. One important unstructured attribute is event
many varied but yet interlinking distributed networks of human op- synopsis, which is a concise summary of the incident/accident in the
erators, technical/technological procedures and systems, and organi- form of text. Mining causal relationships from the unstructured text
zations/stakeholders. The scarcity in the operational data and the in- data is a daunting task [25]. A common characteristic in handling
volvement of a variety of human operators are major challenges that text data is to transform it into numerical data in the representation
severely affect the prediction of erroneous operation, and challenge the of term frequency of each individual document. Along this direction,
assessment and improvement of the system reliability. In view of this, it researchers have developed numerous techniques to elicit useful
would be beneficial to leverage the second strategy and emphasize the knowledge from the text data, e.g., support vector machine [26],
“proactive safety” paradigm [21], focused on strengthening the cap- latent Dirichlet allocation-based topic mining [27-29], Naïve Bayes-
ability and efficiency of the operators in response to the abnormal based document classification [30], k-nearest-neighbor (k-NN)
events with appropriate actions, thereby mitigating the consequence or classification [31], and others [32,33]. Among them, support vector
severity of hazardous events. In this connection, a large number of in- machine [34] has demonstrated good performance in text categor-
cident/accident reports have been filed over the past half century to- ization because it overcomes over-fitting and local minima, thereby
gether. A few machine learning studies have been carried out to analyze achieving good generalization to applications [35,36].
these reports. For example, Oza et al. [22] developed two algorithms – 4. Imbalanced class distribution. In the ASRS, the number of records in
Mariana and nonnegative matrix factorization-based algorithm – to one class (i.e., possible outcome) is significantly larger than that of
perform multi-label classification for the reports of Aviation Safety the others. The distribution of the outcomes for all the hazardous
Reporting System (ASRS). Budalakoti et al. [23] presented a set of al- events reported between January 2006 and December 2017 is illu-
gorithms to detect and characterize anomalies in discrete symbol se- strated in Fig. 1. As can be observed, the number of records across
quences arising from recordings of switch sensors in the cockpits of different classes is highly imbalanced. Such imbalanced class dis-
commercial airliners. These studies have not explored the intricate re- tribution has posed a serious challenge to machine learning algo-
lationships between abnormal event characteristics and the induced rithms which assume a relatively well-balanced distribution [37].
consequences. In this paper, we are motivated to fill this gap by
quantifying the risk associated with the consequence of hazardous Since ASRS consists of a variety of heterogeneous data (e.g., text
events through a hybrid machine learning model. This type of analysis data, categorical data, and numerical data), it is challenging to develop
will help to develop a decision support system that can learn the pat- one single model to learn from the entire dataset for predicting the risk
terns in the data and help the analyst examine incidents/accidents associated with the consequence of hazardous events. In this paper, we
quickly and systematically, thus assisting the risk manager in risk adopt the “divide and conquer” strategy to split the data into two parts:
quantification, priority setting, resource allocation and decision making structured and unstructured, and develop a hybrid model to handle the
in support of the implementation of proactive safety paradigm. two types of data, respectively. By doing this, we are able to leverage
To do this, we have utilized the Aviation Safety Reporting System the strengths of each model in processing certain type of data.
(ASRS), which is a database of incident reports that provides a plethora Compared with building a single model, the dimension of the problem
of incidents/accidents that occur over the course of the past several is reduced significantly. With respect to categorical data, considering its
decades. The incident reports are submitted by pilots, air traffic con- high-dimensional feature space, deep learning might be a good candi-
trollers, dispatchers, cabin crew, maintenance technicians, and others, date to discover the highly intricate relationship between event con-
and describe both unsafe occurrences and hazardous situations [24]. textual characteristics and event consequence due to its powerful ability
ASRS covers almost all the domains of aircraft operations that could go in establishing a dense representation of the feature space, which makes
wrong. However, it is a non-trivial task to build a model from the ASRS it effective in learning high-order features from the raw data [38,39].
data for the prediction of risk associated with the consequence of As a result, a deep learning model is developed to process the
49
Fig. 1. Distribution of outcomes for all incidents/accidents between January 2006 and December 2017.
categorical data for the purpose of learning the associations between information on aviation safety and human factors [24]. As one of its
event contextual features and event outcomes. In parallel, a support primary tasks, ASRS collects, processes, and analyzes voluntarily sub-
vector machine model is trained to identify the relationships between mitted aviation incident/situation reports from pilots, flight attendants,
text-based event synopsis and the risk level associated with the con- air traffic controllers, dispatchers, cabin crew, ground workers, main-
sequence of each incident. Afterwards, the predictions from the two tenance technicians, and others involved in aviation operations.
machine learning models are fused together for quantifying the risk Reports submitted to ASRS include both unsafe occurrences and
associated with the consequence of each incident. Finally, the predic- hazardous situations. Each submitted report is first screened by two
tion on risk level categorization is extended to event-level outcomes analysts to provide the initial categorization and to determine the triage
through a probabilistic decision tree. Compared to the current state of of processing. During this process, ASRS analysts identify hazardous
the art in machine learning for aviation safety, we make the following situations from the reports, and issue an alert message to persons in a
contributions: position to correct them. Based on the initial categorization, ASRS
analysts might aggregate multiple reports on the same event to form
1. We have developed a machine learning methodology to learn the one database “record”. If any information needs to be further clarified,
relationships between abnormal event characteristics and their ASRS analysts might choose to call a reporter over the telephone to
consequences. We focus on the data set that has a large number of gather more information. After all the necessary information is col-
outcomes and imbalance of available data regarding these out- lected, the reports are codified using the ASRS taxonomy and recorded
comes. This challenge is overcome by grouping the event outcomes in the database. Next, some critical information in each record is de-
into five risk categories and by up-sampling the minority classes. identified; then ASRS distributes the incident/accident records gathered
2. A probabilistic fusion rule is developed to blend the predictions of from these reports to all the stakeholders in positions of authority for
multiple machine learning models that are built on different seg- future evaluation and potential corrective action development.
ments of the available data. Specifically, a hybrid model blending Table 1 presents a sample situation record extracted from the ASRS
SVM prediction on unstructured data and deep neural network en- database. As can be observed, each record has more than 20 fields
semble on structured data is developed to quantify the risk of the ranging from event occurrence location to event characteristics. Here,
consequence of hazardous events, in which the record-level pre- we briefly introduce several primary attributes in each record:
diction probabilities, class-level prediction accuracy in each re-
spective model, and the proportion of each class in the records with 1. Time and location of the abnormal event: The incident in Table 1
disagreeing predictions are considered in model fusion. occurred on February 2017 at the BUR airport in USA. Here, BUR is
the International Air Transport Association (IATA) code of the air-
The rest of the paper is structured as follows. In Section 2, we port, and it refers to the Hollywood Burbank Airport located in Los
provide a brief introduction to the ASRS database, and describe the Angeles County, California. Besides, the record also provides the
principal elements in each record. In Section 3, we develop a hybrid basic altitude above Mean Sea Level (MSL). In this case, the ha-
model to learn the complex relationships between event characteristics zardous event occurred when the flight was at an altitude of 2500 ft.
and event outcomes. In Section 4, we evaluate the performance of the 2. Environment: This describes the surrounding conditions encom-
trained machine learning model on a test dataset. In Section 5, we passing the aircraft operations. Examples are: flight conditions
provide concluding remarks and discuss future research directions. (VMC, IMC, marginal, or mixed), and weather elements such as
visibility, light, and ceiling. Here, VMC refers to visual meteor-
2. Aviation safety reporting system ological condition (VMC) under visual flight rules (VFR) flight. In
VMC, pilots have sufficient visibility to fly the aircraft maintaining
The Aviation Safety Reporting System (ASRS) is a program operated visual separation from the terrain and other aircraft. Different from
by the National Aeronautics and Space Administration (NASA) with the VMC, instrument meteorological condition (IMC) represents the
ultimate goal of increasing aviation system safety by discovering system category that describes weather conditions that require pilots to fly
safety hazards hidden in the multitude of air traffic operations. Over the primarily by reference to instruments under instrument flight rules
past few decades, ASRS has become one of the world's largest sources of (IFR). Visibility and cloud ceiling are two key factors in determining
50
Table 1 record will have the component field, and it describes which aircraft
A sample incident/accident record extracted from ASRS. component is faulty (i.e., engine, nosewheel, transponder etc.) and
Attribute Content the type of the problem as well (i.e., malfunctioning, improperly
operated, or others). This field might be empty if there is no com-
Time/Day Date: 201702 ponent malfunction.
Local Time Of Day: 0601–1200
5. Person: Person describes the fundamental information of the per-
Place Locale Reference.Airport: BUR.Airport
State Reference: CA
sonnel that reports the problem. For example, the location of the
Altitude.MSL.Single Value: 2500 person, his/her location in the aircraft, the reporter organization,
Environment Flight Conditions: VMC the qualification of the reporter, and the contributing human factor
Weather Elements/Visibility.Visibility: 10 (e.g., fatigue, distraction, confusion, and time pressure).
Light: Daylight
6. Events: This attribute provides the basic characteristics of anom-
Ceiling.Single Value: 12,000
Aircraft Reference: X alous behavior and its consequence. In this record, BUR Tower de-
ATC / Advisory.Center: BUR tected the excursion of the flight from the assigned altitude and is-
Aircraft Operator: Personal sued course correction to avoid the traffic. After taking course
Make Model Name: PA-28R Cherokee Arrow All Series
correction to avoid traffic, the pilot did not maintain the same
Crew Size.Number Of Crew: 1
Operating Under FAR Part: Part 91
course heading to Burbank while ATC assumed that the pilot was on
Flight Plan: VFR correct heading. As a result, the BUR Tower recognized the error
Mission: Personal and issued a new heading and commanded the pilot to climb back to
Flight Phase: Initial Approach the assigned altitude to perform landing from the very start.
Route In Use: Visual Approach
7. Assessments: Assessments provide evaluation of the root cause and
Airspace.Class D: VNY
Component Aircraft Component: Engine Air other factors that contribute to the occurrence of the abnormal
Problem: Malfunctioning event.
Person Reference: 1 8. Synopsis: All the above fields are categorical except the crew size.
Location Of Person: X The last row of Tables 1 and 2 provide several sample event sy-
Location In Aircraft: Flight Deck
Reporter Organization: Personal
nopses elicited from the ASRS database. As can be observed, event
Function.Flight Crew: Single Pilot synopses gives a brief summary of the cause of the problem, and
Qualification.Flight Crew: Private how the abnormal event evolves over time. Such information is
Experience.Flight Crew.Total: 175 helpful for us to assess the severity of the hazardous event and
Experience.Flight Crew.Last 90 Days: 30
analyze possible consequences.
Experience.Flight Crew.Type: 175
ASRS Report Number.Accession Number: 1428684
Human Factors: Confusion The above descriptions provide a brief introduction to the physical
Events Anomaly.Deviation - Altitude: Excursion From Assigned Altitude meanings of some important fields in each record. In ASRS, other re-
Anomaly.Deviation - Track/Heading: All Types cords might differ from the above sample records in certain fields or
Anomaly.Deviation - Procedural: Clearance
Anomaly.Inflight Event/Encounter: Unstabilized Approach
they might have additional fields which are absent in the sample record
Detector.Person: Air Traffic Control due to the difference in the type and characteristics of the abnormal
When Detected: In-flight event.
Result.Flight Crew: Returned To Clearance
Result.Flight Crew: Became Reoriented
Result.Air Traffic Control: Issued New Clearance 3. Proposed method
Result.Air Traffic Control: Issued Advisory/Alert
Assessments Contributing Factors/Situations: Airspace Structure
Contributing Factors/Situations: Human Factors
In this section, we develop a hybrid method to estimate the risk of
Primary Problem: Human Factors the event consequence with the consideration of operational conditions
Synopsis PA28R pilot reported becoming confused during a VFR flight to and event characteristics. To build the hybrid model, we investigate the
BUR and lined up on VNY. BUR Tower detected the error and incidents/accidents that occurred from January 2006 to December
issued a new heading and climb back to assigned altitude.
2017. The detailed proposed framework is outlined in Fig. 2, which
shows a four-step procedure. In the first step, we perform a risk-based
event outcome categorization. That is we employ the level of risk as a
whether the weather condition is VMC or IMC, and the boundary
quantitative metric to measure the severity of the event outcome and
between IMC and VMC is known as VMC minima. “Marginal VMC”
collapse all the possible event outcomes into five categories: high risk,
refers to conditions above but close to one VMC minima or more.
moderately high risk, medium risk, moderately medium risk, and low
3. Aircraft: This attribute reports the basic aircraft and flight in-
risk. Given the restructured categories, two models are developed in the
formation, including the aircraft make and model, flight type (per-
second step to process the unstructured data (text data) and structured
sonal or commercial), the number of crew onboard, flight plan,
data (categorical and numerical information). Specifically, a support
flight mission, flight phase, and the airspace class the flight is in.
vector machine (SVM) model is developed to represent the relationship
Such information details the specific flight phase in which the ha-
between event synopsis and the risk pertaining to the event outcome,
zardous event occurs. The basic aircraft information also enables us
and an ensemble of deep neural networks (DNN) is trained to predict
to have an understanding of the scale of the possible event con-
the level of risk based on contextual features of each abnormal event. In
sequence.
the third step, we develop an innovative fusion rule to blend the pre-
4. Component: If there is any mechanical failure in the aircraft, the
diction results from the SVM and DNN models. Finally, the risk-level
Table 2
Two examples of event synopsis in ASRS.
Date Event synopsis
February 2017 A319 Flight Attendant reported smoke in the cabin near the overwing exits during climb.
January 2017 A319 flight crew reported fumes in the flight deck. After a short time they began to experience problems concentrating and the onset of incapacitation.
51
Fig. 2. Hybrid machine learning framework for risk prediction.
52
prediction is further expanded to event-level outcome analysis in Table 3

through a probabilistic tree. Mapping between risk levels and event outcomes.
Risk level Event outcome
3.1. Risk-based event outcome categorization
High risk General declared emergency
General physical injury/incapacitation
As shown in Fig. 1, there are 36 unique event outcomes among the
Flight crew inflight shutdown
incident reports. Because almost all of the contextual features are ca- Air traffic control separated traffic
tegorical, they are uninformative indicators of the event outcome Aircraft damaged
considering that one contextual condition might correspond to a large Moderately high risk General evacuated
number of event outcomes that belong to different risk levels. For ex- Flight crew regained aircraft control
Air traffic control issued advisory/alert
ample, if we are only given the weather visibility, it is challenging to
Flight crew landed in emergency condition
predict what might happen because the information is too limited. In Medium risk General work refused
other words, there exist too many possible scenarios given the weather Flight crew became reoriented
visibility. From this perspective, event synopsis is the only attribute left Flight crew diverted
Flight crew executed go around missed approach
that can help us to differentiate the event outcome. Since the event
Flight crew overcame equipment problem
synopsis embodies the complex event evolution process, it is difficult to Flight crew rejected takeoff
use the current state-of-the-art text mining techniques to understand Flight crew took evasive action
the semantics, discover the causal relationships, mine the event se- Air traffic control issued new clearance
quences, and associate with the event consequence from such a short Moderately medium risk General maintenance action
General flight cancelled delayed
and condensed report. Considering that there are 36 unique event
General release refused Aircraft not accepted
outcomes, the machine learning algorithm is challenged by the lack of Flight crew overrode automation
significant predictors in the ASRS records that can be used to distin- Flight crew FLC overrode automation
guish the event outcomes. Flight crew exited penetrated airspace
Another challenge is that the class distribution is severely im- Flight crew requested ATC assistance clarification
Flight crew landed as precaution
balanced. In such circumstances, the standard classification algorithms Flight crew returned to clearance
are often biased toward the majority class, leading to a high mis- Flight crew returned to departure airport
classification rate for the minority class [40-42]. To overcome this Aircraft automation overrode flight crew
problem, one popular way is to generate additional samples by ran- Low risk General police Security involved
Flight crew returned to gate
domly duplicating observations from the existing records in the min-
Aircraft equipment problem dissipated
ority class with replacement (referred to as up-sampling in the rest of Air traffic control provided assistance
the paper) so as to balance the number of records for the majority and General none reported taken
minority classes. Since the ratio between the majority class 1 and Flight crew FLC complied w automation advisory
minority class 36 is larger than 1000, a large number of samples needs
to be generated to increase the number of records for the minority class.
As can be observed from Fig. 1, this is also true for many other minority
classes. Therefore, some additional operations need to be considered to
reduce the number of samples that need to be generated. Also, the 10 4
2
prediction model built with the upsampled data needs to account for
the upsampling. 1.8
The last issue is that one abnormal event might result in multiple
outcomes. For example, as shown in the eighth row of Table 1, there are 1.6
four different types of outcomes (as underlined in Table 1) for this 1.4
sample record. It is impossible to train four individual machine learning
models with each model being used to predict a particular event out- 1.2
come. Besides, accident/incident records with multiple consequences
1
are not rare in the ASRS database, and occupy almost 50% of the entire
data. 0.8
To address the above three challenges, we develop a risk-based
event outcome categorization, where each event outcome is associated 0.6
with a specific risk indicator out of five categories: high risk, moder- 0.4
ately high risk, medium risk, moderately medium risk, and low risk.
According to the severity of event consequence, each event is assigned 0.2
to a particular risk group based on expert opinion. Table 3 reports the
0
five risk categories and the set of outcomes belonging to that risk ca- k
ris ris
k
ris
k isk isk
tegory. By doing this, we collapse the original 36 unique event out- w m hr hr
Lo ium diu h i g Hig
comes into five groups, and project the event outcome to one of the five m ed Me ely
ely rat
risk groups. Now, even if an event has multiple consequences, we can rat ode
de M
identify the highest risk group corresponding to the outcomes of that Mo
event, and use it to indicate the amount of risk related to that abnormal
event. Another benefit is that the number of samples that need to be Fig. 3. Distribution of risk outcomes after recategorization of incidents/acci-
generated to balance the majority and minority classes is reduced by a dents from January 2006 to December 2017.
large factor. Fig. 3 illustrates the class distribution after the collapse
operation. It can be noticed that the class distribution is in better shape the number of samples that need to be generated also mitigates the
compared to the distribution of the original 36 classes. The ratio be- computational effort for the machine learning models constructed in
tween the class with the most number of records and the class with the the next section.
least number of records is reduced from 1000 to 2.1. The decrease in
53
3.2. Model construction support vector machine (SVM) has been found to provide a higher
prediction accuracy than most other techniques [49]. Different from
The ASRS data can be classified into two groups: structured data and other classifiers, SVM implements a structural risk minimization func-
unstructured data. In particular, structured data include numerical data tion, which entails finding an optimal hyperplane to separate the dif-
(e.g., crew size) and categorical data (e.g., flight phase, weather visi- ferent classes with the maximum margin. The introduction of the
bility, flight conditions etc.). In ASRS, event synopsis is the only un- structural risk minimization (regularization term) function also enables
structured data, and it is used to describe how the accident/incident SVM to have a good generalization characteristic, thereby guaranteeing
occurred. the lower classification error on unseen instances.
Before developing the two machine learning approaches, we up- The first essential step in applying SVM for text classification is to
sample the minority classes in the restructured risk domain. Basically, transform the text data into numerical feature vectors; this is referred to
up-sampling is the process of randomly duplicating observations from as text representation. Since the text descriptions of the incidents/ac-
the minority class to reinforce its signal. With respect to the reorganized cidents appear in the form of long sentences in different records, we
risk categories, we perform resampling with replacement for the three employ the bag-of-words (BoW) representation to transform the text
minority classes, namely: high risk, moderately high risk and moder- data into numerical features. Specifically, we utilize the tokenizer in the
ately medium risk, so that the number of records for them matches with natural language processing toolbox to split raw text into sentences,
the majority class (medium risk). After the up-sampling operation, we words and punctuation [50]. Afterwards, all the stop words and
develop two individual models to handle the structured and un- punctuation are eliminated from the BoW representation, and we ex-
structured data, separately. The flowchart of the developed method is tract distinct words from all the event synopsis records, assign a fixed
illustrated in Algorithm 1. integer id to each term present in any event synopsis, and represent the
Algorithm 1. Hybrid model development. content of a document as a vector in the term space, which is also re-
3.2.1. Support vector machine for text-based classification ferred to as a vector space model. However, in many cases, term fre-
Regarding the text data, our objective is to automatically categorize quency alone might not be adequate in differentiating the documents.
the abnormal event into the correct risk category based on the content For example, since the word ‘the’ is a very common term that appears in
of the event synopsis. In the past decades, a large number of statistical every document, the term frequency method will tend to incorrectly
and computational methods have been developed for classification emphasize documents with the usage of frequent terms (such as, ‘the’,
based on text data [33,43], for example, Naïve Bayes [44,45], support ‘a’) but carrying very little information about the contents of the
vector machine [46], maximum entropy [47], and others [48]. The document, while more meaningful terms might be shadowed. As
54
developed by Spärck Jones [51], the term frequency-inverse document

frequency (TF-IDF) method overcomes this drawback by determining
the relative frequency of a word in a document compared to the inverse
proportion of that word across all the documents, which is defined as:
1 + nd
tf-idf(t , d ) = tf(t , d ) × 1 + log
1 + df(d, t ) (1)
where tf (t, d) is the number of occurrences of term t in document d, nd

is the total number of documents, and df (d, t) is the number of docu-
ments that contain the term t.
By performing this operation, we can increase the term's dis-
criminating ability and make the weight of terms suitable to be used by
the classifier. Given the TF-IDF vector representation of each event
synopsis, we feed it into a support vector machine model in order to
identify a hyperplane that separates the different classes with the
maximum margin. In the next section, we will discuss the details on
how to tune the model parameters in the support vector machine
model.
3.2.2. Ensemble of deep neural networks

With respect to the structured data, there are 29 different data items
after removing the attributes which are empty in 80% of the records. As
shown in Fig. 2, the structured data can be further classified into three
groups: flight conditions, event characteristics, and operations. The
flight conditions primarily depict the basic operational condition for
each flight, including the weather condition, airport visibility, aircraft
characteristics (i.e., model, flight mission, and flight plan), the persons
involved (i.e., location of person, reporter organization), and other
contributing factors (i.e., ATC equipment, human factors, and com- Fig. 4. Proposed fusion rule to integrate the two models.
munication ambiguity). The flight conditions specify the context in
which the accident/incident occurs. Next, the event characteristics hand-engineered feature extractor to transform the raw data into a
describe the important features pertaining to the accident/incident, suitable representation or feature vector. Since deep learning requires
including the severity of the aircraft equipment problem (less severe or very little engineering by hand, it can be updated by additional col-
critical), the type of the event (illness, smoke, fire, procedural devia- lection of abnormal event records in the ASRS database. Leveraging the
tion, or airspace violation), whether passengers were involved in the powerful capabilities of deep learning, we have developed an ensemble
event, the detecting person, as well as the time when event is detected of feed-forward deep neural networks in which each network consists of
(routine inspection, maintenance, or in-flight). The event character- two hidden layers with each hidden layer having 24 and 12 neurons.
istics enable us to have a better resolution in understanding a variety of The construction and performance of the ensemble of deep neural
aspects of the incident, for example, what is the cause of the event, networks will be discussed in the next section.
when it happened, and the severity of the problem. Third, when the
pilot or other operator is faced with the problem, they take certain
measures to resolve the issue (e.g., flight crew took evasive action, or 3.3. Model fusion
flight crew executed an emergency landing as precaution). All the event
outcomes are displayed in the dashed box in the diagram at the bottom As described above, we develop two models: one for the un-
right of Fig. 2. structured data, and the other one for the structured data. How to fuse
Considering the dimension of the input variables, we leverage the the prediction results of the two trained models is the next challenge.
deep neural network (DNN, also referred to as deep learning) to learn Many approaches have been developed to address this issue, including
the associations between the contextual features and event outcomes. majority voting schemes (unanimous voting, simple majority, and ma-
Over the last five years, DNN has been proven to be very good at dis- jority voting), weighted sum, and support function fusion based on the
covering intricate structures in high-dimensional data [52,53], and has ranking of each predicted class in terms of the estimated likelihood in
dramatically improved the performance of the state-of-the-art in image each individual classifier [57]. Several challenges arise when we at-
classification [54], speech recognition [55], and natural language un- tempt to fuse the predictions from the two models by using these
derstanding [56]. The key advantage of deep learning over the classical strategies. First of all, since there are only two models here, majority
learning algorithms is that it is able to learn appropriate re- voting is not possible when the two models have different predictions.
presentations from the raw data with multiple levels of abstrac- Secondly, as there are five classes in the risk-based event outcome ca-
tion in an automatic manner that are needed for detection or classi- tegorization, weighted sum is inappropriate to be implemented in this
fication. Typically, conventional machine learning algorithms require circumstance. For example, suppose the support vector machine and the
manually designed rigorous feature extractor to elicit informative fea- deep neural network ensemble have 60% and 80% overall prediction
tures from raw data (e.g., image) that usually needs careful engineering accuracy, respectively. Now they are given a new test record (not used
and a considerable amount of domain expertise, from which the clas- in training), and suppose the predictions from the SVM and DNN
sifier is able to learn and differentiate patterns revealed in the extracted models are classes 1 and 5, respectively. In this case, the model pre-
features. Whereas, deep learning is able to extract features embodied in diction after we perform a simple weighted sum operation will be
massive raw data with multiple levels of representation (e.g., orienta-
0.6 0.8
0.6 + 0.8
× 1 + 0.6 + 0.8 × 5 = 3.28, which does not make any sense. Note
tion, location, motifs, and their combinations in images) from the also that the fused prediction result needs to be an integer. Even if we
training data automatically. As a result, it frees us from the design of a take certain operations (e.g., rounding, flooring) to transform the
55
Table 4
Performance metrics for the two trained models. Here, the support column denotes the number of occurrences of each class in the actual observation, the consistent
prediction column represents the number of consistent predictions between the two models for each class, and consistent prediction accuracy denotes the proportion
of records in the consistent model predictions that are correctly labeled.
Support vector machine Shared attributes Deep learning ensemble
Predicted label Support Consistent prediction Consistent prediction accuracy Predicted label
1 2 3 4 5 1 2 3 4 5
True label 1 s
A1,1 s
A1,2 s
A1,3 s
A1,4 s
A1,5 N1 N1c λ1 True label 1 d
A1,1 d
A1,2 d
A1,3 d
A1,4 d
A1,5
2 s
A2,1 s
A2,2 s
A2,3 s
A2,4 s
A2,5 N2 N2c λ2 2 d
A2,1 d
A2,2 d
A2,3 d
A2,4 d
A2,5
3 s
A3,1 s
A3,2 s
A3,3 s
A3,4 s
A3,5 N3 N3c λ3 3 d
A3,1 d
A3,2 d
A3,3 d
A3,4 d
A3,5
4 s
A4,1 s
A4,2 s
A4,3 s
A4,4 s
A4,5 N4 N4c λ4 4 d
A4,1 d
A4,2 d
A4,3 d
A4,4 d
A4,5
5 s
A5,1 s
A5,2 s
A5,3 s
A5,4 s
A5,5 N5 N5c λ5 5 d
A5,1 d
A5,2 d
A5,3 d
A5,4 d
A5,5
decimal into an integer, the model prediction (3) after the transfor- measures the probability that the trained model assigns the test
mation might be the least probable prediction in the two models. record a to a given class i. To be specific, with respect to the DNN
Consequently, the weighted sum operation is inappropriate to be im- ensemble, the record-level prediction probabilities can be measured
plemented in this problem. as the ratio of the most frequent prediction to the total number of
We develop a probabilistic fusion rule to blend the predictions from model predictions (which is 10, in this case). For example, if the
the two models. The framework of the proposed fusion rule is demon- most frequent prediction of the ten models is class 5, and it appears
strated in Fig. 4. In the first place, if the two models have the same six times out of the ten model predictions, then the model prediction
prediction (prediction 1 = prediction 2), then the prediction result is probability for this particular record belonging to class 5 is 6/10
easy to be determined. If the two models have different predictions, (0.6). Regarding the support vector machine, since samples far away
then we calculate the probability of the test record belonging to each from the separating hyperplane are presumably more likely to be
class among the five risk categories in each model. To illustrate the classified correctly, we can use the distance from the hyperplane as
proposed method, Table 4 reports the performance metrics of the two a measure of record-level prediction probability following the
models on a validation dataset. The validation dataset consists of N1, method introduced in Ref. [58].
N2, N3, N4, and N5 records in the five respective classes, where Ais, j and 2. Proportion of each class in the records with disagreeing predictions:
Aid, j represents the number of records that actually belong to class i but When the two models are trained, there are the same number of
are labeled as class j in the support vector machine and DNN ensemble, records belonging to each class in the training dataset. In other
respectively. The third to the seventh columns present the confusion words, the data is balanced for each class in the training dataset.
matrix of the trained support vector machine on the validation dataset, However, when we use the trained models to make predictions on
while the confusion matrix of the deep learning ensemble is reported in test records, we only need decisions when the two trained models
the thirteenth to seventeenth column. The ninth column of Table 4 have inconsistent predictions. With respect to the set of disagreeing
reports the number of consistent predictions between the two trained predictions, the actual proportion of records belonging to each class
models on the validation dataset, while the tenth column reports the might be different (imbalanced) from the training dataset (ba-
accuracy of the correctly labeled records among the consistent predic- lanced). The use of the model trained by the balanced class dis-
tions of the two models in the validation dataset. tribution will result in a biased estimator if it is directly utilized for
When the two models have disagreeing predictions, one important making predictions on the set of records with inconsistent model
underlying mechanism when developing the fusion strategy is: since no predictions.
model is perfect, each model is expected to have misclassifications or To address this issue, we introduce a weight factor to correct the
make erroneous predictions. However, the valuable information em- trained model to fit the class distribution in the set of records with
bodied in the misclassifications should be further utilized to correct the disagreeing model predictions, thereby ensuring that the updated
model predictions on the subsequent unseen test dataset in a way that estimator is unbiased. The number of records with consistent model
makes up the class that the observation should belong to if we know predictions in each class is complementary to the amount of records
how often the model mislabels the class as other classes. Considering with inconsistent model predictions in that class. In general, the
the various types of misclassifications that the trained model is prone to more records the two classifiers agree on, the less the number of
make, one way to achieve this objective is to increase the probability of records the two classifiers disagree on. As a result, we formulate the
labeling the observation as the correct class, while reducing the prob- following equation to represent the total number of records with
ability of labeling the record as the predicted class given by the model. inconsistent model predictions across all the classes considered:
Fortunately, such information can be derived from the confusion ma- 5
trix. In fact, a confusion matrix not only provides class-level model T= (Ni i × Nic )
prediction accuracy, but also the probability of mislabeling a record as i=1 (2)
other classes. Next, we will utilize the model misclassification prob-
ability to correct the model prediction on new test records. Suppose the where i is the class label, Ni denotes the actual number of records
two models have different predictions on a new test record a; there are that should have been labeled as class i, Nic represents the number of
three important considerations in determining the probability of the consistent predictions between the two models on all the records,
new test record a belonging to each of the five classes i in the two and λi is the model accuracy with respect to the consistent predic-
models: tions.
Considering the total number of disagreeing predictions across all
1. Record-level prediction probabilities: As mentioned earlier, two the classes, the proportion w.r.t. class j is:
models have been trained: support vector machine and a deep Nj × N jc
j
neural network ensemble. The record-level probability p (Ya = i ) p (j ) =
(3)
T
56
Given the already correctly labeled samples by the two trained p (Y = i| Y = j ) is calculated as:
models, the probability of a new test record a that results in dis-
Ais, j
agreeing predictions between the two models belonging to class j is p
s
p (Y = i | Y = j ) = 5
(j). As
i = 1 i, j (8)
3. Class-level accuracy: This measures the degree of consistency be-
where denotes the number of samples with model predictions being
Ais, j
tween model predictions and actual observations. Mathematically, it class j, but actually belonging to class i in the validation dataset.
can be represented as: p (Y = i| Y = j ) , where Y = j represents the In a similar way, the metric p (Y = i| Y = j ) in the DNN ensemble is
samples with model predictions being class j, and computed accordingly. Given the model classification performance
p (Y = i| Y = j )(j i ) quantifies the proportion of samples with p (Y = i| Y = j ) , the probability of test record a belonging to each class
model predictions being class j that should have been labeled as in each model is computed as below:
class i. In particular, when j = i, then p (Y = i| Y = i ) measures the
ratio of samples actually belonging to class i to the number of 5
s s Nj j × N jc
samples with model predictions being class i, which can also be p (Yas = i) p ( Y = i | Y = j ) × p (Y a = j ) × 5
(Nk k × Nkc )
referred to as a likelihood function. In other words, the likelihood j =1 k=1
function measures the probability of observing the data given the 5

Ais, j s Nj j × N jc
× p (Y a = j ) ×
prediction. Such a quantitative metric p (Y = i| Y = i ) measures the 5
As
5
(Nk × Nkc )
j=1 i = 1 i, j k=1 k
accuracy of the trained model in making predictions with respect to
5
class i. d d
p (Yad = i) p ( Y = i | Y = j ) × p (Y a = j )
If we only consider the special case (i = j), then the Bayes factor (or j =1
likelihood ratio) can be used to help decide which model supports
our observation better from the two trained models, thereby as- Nj j × N jc
× 5
sisting the model selection [59]. However, the Bayes factor does not k=1
(Nk k × Nkc )
consider the information embodied in the model misclassifications. 5
Aid, j Nj × N jc
Therefore, we propose an equation below to handle all the possible d
× p (Y a = j ) ×
j
5 5
situations existing in the five-class classification problem by uti- j=1 Ad
i = 1 i, j k=1
(Nk k × Nkc )
lizing the information contained in the term p (Y = i| Y = j ) . The
quantitative metric p (Y = i| Y = j ) is equivalent to the confusion (9)
matrix used in the performance evaluation on the validation dataset. where = i ) and
p (Yas = i ) represent the probability of support
p (Yad
By utilizing the metric p (Y = i| Y = j ) , we relate the model pre- vector machine and deep learning models in labeling the test record as
dictions to the actual observations, from which we compute the Nj c
j × Nj
class i, the term in Eq. (9) denotes the proportion of re-
probability of a test record a actually belonging to a given class i in 5
k = 1 Nk
c
k × Nk
the subsequent sections. cords belonging to class j in the set of inconsistent model predictions;
the terms Ais, j and Aid, j represent the samples belonging to class i but
s
For a given new test record a, the proposed fusion rule based on the labeled as class j in the two respective models, and p (Ya = j ) and
above three considerations is: d
p (Y a = j ) are the prediction confidence of the two respective models to
5 classify the test record a as class j.
p (j )
p (Ya = i) = p (Y = i| Y = j ) p (Ya = j ) × ~ After normalizing the probability of the test record a belonging to
j=1
p (j ) (4) each class in each model, we select the predicted class with the max-
imum probability from the two models as the hybrid model prediction.
where p (Y = i| Y = j )(j i ) is the model misclassification rate, in
By considering the three factors, the proposed method successfully
which the model prediction is class j while the actual observation is
adjusts the trained model to suit the classification with respect to the
class i; when j = i, then p (Y = i| Y = i ) denotes the model prediction
records within the inconsistent model predictions. After we obtain the
precision with respect to class i, p (j) is the proportion of class j in the set
probability of the test record belonging to each class in each model,
of inconsistent model predictions, p~ (j ) represents the proportion of
then we assign the test record to the class with the maximum prob-
class j in the training dataset, and p (Ya = j ) represents the confidence of
ability.
the model in classifying the test record a as class j.
By substituting Eq. (3) into Eq. (4), we have:
3.4. Event-level outcome analysis
5
1 Nj j × N jc
p (Ya = i) = p (Y = i| Y = j ) p (Ya = j ) × ~ ×
p (j ) T After the risk-level category that the test record should belong to is
j=1
probabilistically determined, then event-level outcome can be derived
(5) by measuring the event synopsis similarity between the test record a
Since all the five classes are evenly distributed in the training da- and other records belonging to the same risk category. One popular way
taset, p~ (j ) is a constant, which can be ignored, then we have: to measure document similarity is based on the content overlap be-
5
tween two documents [60] as represented in Eq. (10).
Nj j × N jc
p (Ya = i) p (Y = i| Y = j) p (Ya = j ) × k
T w × wm, j
j=1 (6) sim(di , dj ) = m = 1 m, i
k k
(wm, i )2 × (wm, j )2 (10)
Next, we perform the normalization operation following Eq. (7) m =1 m=1
such that the sum of the probability over the five classes in each model where di and dj denote two documents, wm,i (wm,j) is the frequency of
is 1. object (words, or terms) om present in document di (dj), and k is the
Nj c
j × Nj
number of unique objects across all the documents.
5
j=1
p (Y = i| Y = j ) p (Ya = j) × T The similarity metric defined in Eq. (10) measures the cosine of the
p (Ya = i) =
Nj c
j × Nj
angle between vector-based representations of the two documents.
5 5
p (Y = i| Y = j ) p (Ya = j) × Since there are multiple documents belonging to the same risk category
i=1 j=1 T (7)
as the test record a, we formulate the following equation to address
With respect to the support vector machine, the likelihood function such issues:
57
Table 5 Table 6
The number of records belonging to each risk category. Confusion matrix.
Class High Moderately high Medium Moderately Low Predicted as positive Predicted as negative
medium
Actually positive True Positives (TP) False Negative (FN)
Number of 12,327 8261 18,841 8636 16,508 Actually negative False Positives (FP) True Negative (TN)
records
representation of five risk groups: high risk, moderately high risk,

ck
p (e = k |Ya = j) =
1
sim(da, dI (v ) ) medium risk, moderately medium risk, and low risk. Considering
ck v=1 (11) the five risk groups, up-sampling is performed to balance the
number of records between minority classes and majority classes.
where ck denotes the number of records having event outcome k in the
2. Two models are developed to process the structured data and un-
training dataset, and I(v) denotes the index of the v-th record having
structured data, respectively. To handle the structured data in high
event outcome k in the training dataset.
dimension, an ensemble of deep neural networks are trained to as-
Afterwards, a normalization operation can be performed by taking
sociate the event contextual features with the event outcomes. A
into account all the possible event outcomes in the j-th risk category.
support vector machine model is used to discover the relationships
p (e = k |Ya = j ) between event synopsis and event consequence.
p (e = k |Ya = j) = K
p (e = k |Ya = j ) (12) 3. A probabilistic fusion decision rule is developed to blend the pre-
k=1
dictions by the two machine learning models. When the prediction
where K represents the number of possible event consequences in the j- results are inconsistent, a quantitative metric is proposed to com-
th risk category. pute the likelihood that the test record belongs to each predicted
By considering the event outcomes in the corresponding risk cate- class, from which we can determine the class the test record should
gory, a decision tree can be constructed to demonstrate the probability be assigned to.
of each event to occur. In this way, the risk-level category can be 4. Once the test record is assigned to a specific risk category, we map
mapped to event-level outcome. the risk-level category to event-level outcomes through a probabil-
istic tree, in which all the possible event outcomes in the corre-
3.5. Summary sponding risk category are considered.
The methodology developed in this section tackles the problem of

risk quantification regarding the consequences of abnormal events in 4. Numerical results
the national airspace system with a four-step procedure (see Fig. 2):
4.1. Data description
1. We reorganize all the incidents/accidents based on the risk asso-
ciated with the consequence of each hazardous event. The risk- We collected 12 years of incident reports (from January 2006 to
based event outcome categorization enables us to collapse the ori- December 2017) from ASRS, thus obtaining 64,573 records. The
ginal 36 unique event outcomes into five groups, and project the number of records belonging to each risk class is shown in Table 5. To
event outcome to the dimension of risk quantification in the address the imbalance, we up-sample the three minority classes (high,
moderately high, and moderately medium) with replacement to get the
Fig. 5. Datasets and model inputs.
58
Table 7
Performance metrics for the two trained models. Here, the support column denotes the number of occurrences of each class in the actual observation, the consistent
prediction column represents the number of consistent predictions between the two models for each class, and consistent prediction accuracy denotes the proportion
of records in the consistent model predictions that are correctly labeled.
Support vector machine Shared attributes Deep neural networks
Predicted label Support Consistent prediction Consistent prediction accuracy Predicted label
1 2 3 4 5 1 2 3 4 5
True label 1 550 82 205 69 41 947 491 0.82 True label 1 567 74 204 46 56
2 14 951 27 13 12 1017 988 0.93 2 25 931 28 5 28
3 153 65 592 119 59 988 448 0.76 3 208 72 495 97 116
4 16 7 43 943 25 1034 936 0.96 4 19 11 48 927 29
5 20 17 34 22 883 976 861 0.93 5 38 27 45 28 838
same amount of data as in the majority class to form a balanced dataset. Table 8
After up-sampling, the dataset contains 18,841 records for the high, Model predictions with respect to test record a.
moderately high, medium, and moderately medium risk classes re- Model 1 2 3 4 5
spectively, and 16,508 number of records in the low risk category. After
the up-sampling operation, there are 91,872 number of records in total. SVM 0.10 0.20 0.20 0.20 0.30
All the input variables are summarized in Fig. 5, and they are DNN 0.20 0.40 0.10 0.02 0.28
grouped into two categories: unstructured data and structured data. The
structured data includes 29 different variables ranging from aircraft
on the training dataset and selects the best combination. After the op-
information to event characteristics, while the unstructured (text) data
timal hyperparameters are identified, we run the trained support vector
only includes event synopsis. With the 29 structured variables and one
machine algorithm on the validation data to obtain its performance
unstructured variable, we train the SVM and DNN models.
metrics. The left part of Table 7 reports the performance metrics of the
Given a trained classifier and a test dataset, the relationship be-
trained support vector machine model on the validation dataset.
tween model predictions and true observations can be represented as a
In the DNN model, all the categorical features are encoded using one
confusion matrix, as illustrated in Table 6. To assess the performance of
hot encoding. We randomly select 85% records from the training da-
every machine learning model, we adopt three most commonly used
taset to train a deep neural network, then we repeat the same procedure
performance metrics.
ten times to obtain an ensemble of deep neural networks. Every DNN
has the same structure – 8 hidden layers and 40 neurons per layer. We
( TP
)
1. Precision: Mathematically, Precision TP + FP is the ratio of correctly
choose an Adam Optimizer [61] with a learning rate of 0.001 to per-
predicted positive observations to the total predicted positive ob- form backward propagation in adjusting the weight variables with the
servations, and high precision typically corresponds to low false objective of minimizing the categorical cross entropy as defined in Eq.
positive predictions. (13).
TP
( )
2. Recall: Recall is defined as Recall TP + FN quantifies the ratio of
1
n
correctly labeled positive observations to all the positive observa- d
L (Y , Y ) =
d
[Yi log Yi ]
tions in the actual class. It measures the ability of the trained model n i=1 (13)
in identifying positive observations from all the samples that should
where n is the number of training samples, Yi is a one-hot-encoding
have been labeled as positive.
representation of the actual observation with the corresponding class
3. The F1 score is the weighted average of precision and recall, which d
precision * recall label being 1 and the values of other classes being zero, and Yi is the
is mathematically described as F1 = 2 × precision + recall . In the F1
probabilistic estimation of ensemble deep learning models on the re-
score, the relative contribution of precision and recall is the same. In
cord i.
other words, the F1-measure is the harmonic mean of precision and
After the ten deep neural networks are trained, we use them to make
recall.
predictions on the validation dataset, and the results are reported in
Table 7. It is observed that the DNN ensemble does not perform as well
4.2. Experimental analysis as the SVM model in terms of precision and recall for classes 2, 3, and 5
due to the low level of information contained in the categorical fea-
We split the 91,872 records into three parts: training set (85%), tures. As shown in the ninth column of Table 7, the two trained models
validation set (5%), and test set (10%). The training set is used to guide agree on 491, 988, 448, 936, and 961 predictions with respect to the
the machine learning algorithms to optimize the relevant parameters so five classes. Among the consistent predictions of the two models, 401,
as to minimize prediction error; the validation set is utilized to develop 922, 339, 897, and 804 predictions with respect to the five classes are
the performance metrics for an unseen dataset (i.e., prediction accu- correctly classified, from which we can compute the ratio of consistent
racy). With the performance metrics obtained from the validation da- predictions that are correctly labeled in the validation dataset. Next, we
taset, we then blend the predictions from the two trained models for the utilize the two trained models to make predictions on the test dataset.
test dataset. With the performance metrics acquired on the validation dataset, we
Regarding the SVM estimator for text classification, we leverage blend the predictions of the two models. Suppose the two models have
grid search to optimize the relevant hyperparameters. Specifically, we probabilistic classifications on the test record a as shown in Table 8;
optimize the loss function (hinge, log, perceptron, squared loss, etc.), each cell represents the probability that the test example a is a member
the regularization function (l1,l2, or elasticnet), the coefficient of reg- of the class. Regarding the test record a, the SVM model supports class 5
ularization function ([10 2, 10 5]), and the range of N-gram (N-gram is a the most, whereas the DNN ensemble assigns the highest probability to
contiguous sequence of n items from a given sample of text). The grid class 2.
search exhaustively generates candidates from the set of parameter Following the method introduced before, the total number of in-
values, and evaluates all the possible combinations of parameter values consistent model predictions is T = (947 − 401) +
59
Fig. 6. Proportion of each class in the records with disagreeing predictions.
(1017 − 922) + (988 − 339) + (1034 − 897) + (976 − 804) = 1599. From the above computational results, test record a has the highest
Then the proportion of each class in the disagreeing records is calcu- probability (0.37) of being labeled as class 3 in the SVM model. As a
lated as: result, this test record a is assigned to class 3 in the hybrid model. For
the remaining test records, we blend the two model predictions in a
p = [0.34 0.06 0.40 0.09 0.11]
similar manner.
Then the probability of assigning the test record a to class 1 in the SVM To compare the performance of all the investigated models, we
model is computed as: randomly split the original dataset into ten equal sized groups to per-
5 form ten-fold cross-validation. The procedure presented in Algorithm 1
is repeated over the ten equal sized groups. Fig. 6 shows the proportions
s s
p (Yas = 1) [p (Y = 1| Y = j ) × p (Y = j) × p (j)]
j=1
of five risk categories in the records with disagreeing predictions.
550 82
=
550 + 14 + 153 + 16 + 20
× 0.1 × 0.34 +
82 + 951 + 65 + 7 + 17
× 0.2 × 0.06 Among the five risk categories, the medium risk class has the largest
205 69 proportion of records with disagreeing predictions, followed by low risk
+ × 0.2 × 0.4 + × 0.2 × 0.09
205 + 27 + 592 + 43 + 34 69 + 13 + 119 + 943 + 22 and high risk classes. The proportion of each class with disagreeing
41
+ × 0.3 × 0.11 prediction is relatively stable with a small variability. Due to the im-
41 + 12 + 59 + 25 + 883
= 0.0468 balanced class distribution in the records with disagreeing predictions,
the adjustment factor introduced in Eq. (4) plays an essential role in
In a similar way, the probabilities of labeling the test record a as embodying such information in the subsequent hybrid model predic-
other classes in the SVM model is calculated: tions.
p (Yas = 2) 0.0137, p (Yas = 3) 0.0646 In addition, we implement ordinal logistic regression (OLR) for the
p (Yas = 4) 0.0193, p (Yas = 5) 0.0324 purpose of comparing its performance with the DNN ensemble on the
structured data [62,63]. The computational result of ten-fold cross-va-
After normalization, the probability of the test record belonging to lidation for the four models is demonstrated as a boxplot in Fig. 7. It is
each class in the SVM model is: worth noting that we shift the values of the three performance in-
p (Yas = 1) = 0.26, p (Yas = 2) = 0.08, p (Yas = 3) = 0.37, dicators of ordinal logistic regression by +0.4 for the sake of demon-
strating the four models' performance deviation clearly. As can be ob-
p (Yas = 4) = 0.11, p (Yas = 5) = 0.18.
served, the hybrid model outperforms SVM, DNN ensemble, and OLR in
Likewise, with the DNN ensemble, the probabilities of the test re- terms of precision, recall and F1 score. More importantly, the hybrid
cord a belonging to class 1 to 5 are calculated as: model has a much smaller deviation for all the performance metrics
when compared to all the other three models. In other words, it has the
p (Yad = 1) = 0.35, p (Yad = 2) = 0.15, p (Y ad = 3) = 0.28, most stable prediction capability, thereby making it the best candidate
p (Yad = 4) = 0.04, p (Yad = 5) = 0.18. for quantifying the risk of abnormal events. Since OLR is inferior to the
Fig. 7. The performance of hybrid model versus support vector machine (SVM), ensemble of deep neural networks (DNN), and ordinal logistic regression (OLR).
60
(a) (b) (c)

Fig. 8. Confusion matrix. (a) the hybrid method, (b) support vector machine, (c) deep neural networks. The entry in the i-th row and j-th column corresponds to the
percentage of samples from class i that were classified as class j. 1: low, 2: moderately medium, 3: medium, 4: moderately high, 5: high.
Table 9
Event synopsis of test record a.
Date Event synopsis
March 2017 After initiating descent on a visual approach with glideslope out of service; an A319 Flight Crew initiated a go-around when flight director caused airspeed increase
and climb to intercept altitude set for ILS to previously assigned runway.
other three models, we do not analyze its performance any further. confusion matrices also embody the misclassification information. As
From a quantitative viewpoint, based on the cross-validation results, illustrated in Fig. 8, it is most probable for all the three models to
the proposed hybrid model yields a better performance in precision, misclassify the records in class 1 as class 3, and vice versa. Such in-
with an average score of 0.81, 3% higher than the scores of SVM and formation can be utilized to guide the further refinement of the hybrid
6% higher than DNN ensemble models. The proposed hybrid model also model.
outperforms the other two models regarding the recall rate. In other Considering that the test record a is labeled as class 3 (medium risk),
words, the hybrid model has a better performance in correctly identi- and the event synopsis of the test record a is shown in Table 9, then a
fying the records that actually belong to each class. Besides, the F1 tree is built to demonstrate the likelihood of the occurrence of every
score of the hybrid model is 3% higher than the support vector machine event outcome in the medium risk category. By measuring the simi-
and 6% higher than deep neural networks on average. Regarding the larity between the event synopsis of test record a and that of other
predictions on the structured data, ordinal logistic regression performs records in the medium risk category, the probability for test record a
much worse than deep learning ensemble. In summary, the proposed having each outcome is obtained, and the result is illustrated in Fig. 9.
decision rule to fuse the predictions from the two models is effective in The red filled node is a chance node used to identify the event in a
enhancing the hybrid model performance. decision tree where a degree of uncertainty exists. In this case, since the
In addition to the boxplot illustrated in Fig. 7, statistical t-test is also hybrid model does not have the capability to make event-level outcome
used to check whether there is a significant difference in the prediction prediction, we expand the risk-level prediction to event-level outcome
performance of the four models [64]. Specifically, we make a null hy- prediction by considering all the possible event outcomes under the
pothesis that the means of population from hybrid model and any other corresponding risk category. Along each line is shown the probability of
individual model are the same, and rejection of this hypothesis in- each event to occur. As can be seen, it is most probable for the test
dicates that there is sufficient evidence that the means of the popula- record a to have event outcome “Flight Crew Executed Go Around
tions are significantly different, while failing to reject this hypothesis Missed Approach”, followed by “Flight Crew Became Reoriented” and
reveals that the distributions are identical. The t-test rejected the null “Air Traffic Control Issued New Clearance”. With respect to other risk
hypothesis for all three aforementioned performance indicators, thus categories, similar diagrams can be constructed to represent event-level
implying that there is a statistically significant improvement in the outcomes. Such event-level outcomes enable to connect the root cause
performance of the hybrid model compared to the SVM, DNN ensemble (i.e., malfunction) and the consequence of the incident at the event
and OLR models. outcome level.
Fig. 8 shows the confusion matrices of the three trained models for
the five considered classes in one test case. It can be observed that the
hybrid method significantly increases the number of correct predictions 5. Conclusion
for class 1 and class 3, while maintaining the prediction accuracy for
the remaining three classes almost at the same level as the other two This paper developed a hybrid model by blending support vector
algorithms. The confusion matrices in Fig. 8 provide comprehensive machine and an ensemble of deep neural networks to quantify the risk
information in terms of the number of correctly identified observations pertaining to the consequence of hazardous events in the air transpor-
in each class. Across the five risk groups, the hybrid model correctly tation system. The SVM model is trained using the event synopsis, while
identifies 200 more observations than the SVM model. Besides, the DNN ensemble is trained using categorical and numerical data. By
merging the predictions from the two models, we formulate a hybrid
61
Fig. 9. The probabilistic event outcomes for test record a. (For interpretation of the references to color in this figure, the reader is referred to the web version of this
article.)
model to assess the severity of abnormal event outcomes in terms of variables enables risk-informed decision making, thereby aiding the
their risk levels using 64,573 reports on incidents/accidents that were implementation of a proactive safety paradigm in the future.
reported between January 2006 and December 2017.
Several contributions have been made in this paper. First, we de- Acknowledgement
velop a risk-based event outcome categorization strategy to project the
event outcomes in the space of risk quantification by collapsing the The research reported in this paper was supported by funds from
original 36 unique event outcomes into five risk groups. Secondly, this NASA University Leadership Initiative program (Grant No.
paper proposes a support vector machine and deep learning-based hy- NNX17AJ86A, Project Technical Monitor: Dr. Kai Goebel) through
brid model to make prediction on the risk level associated with the subcontract to Arizona State University (Principal Investigator: Dr.
event outcome by analyzing the event contextual features and event Yongming Liu). The support is gratefully acknowledged.
description in an integrated way. Thirdly, an innovative fusion rule is
developed to blend the predictions from the two trained machine References
learning algorithms. Finally, a probabilistic tree is constructed to map
the risk-level prediction to event-level outcomes. The results demon- [1] IATA forecasts passenger demand to double over 20 years, http://www.iata.org/
strate that the developed hybrid model outperforms the individual pressroom/pr/Pages/2016-10-18-02.aspx, Accessed: 2017-12-28.
[2] X. Zhang, S. Mahadevan, Aircraft re-routing optimization and performance assess-
models in terms of precision, recall and F1 score. ment under uncertainty, Decision Support Systems 96 (2017) 67–82.
Future work can be carried out in the following directions. If the [3] The Next Generation Air Transportation System (NextGen), https://www.nasa.gov/
actions taken by the pilot or other operators involved in the response to sites/default/files/atoms/files/nextgen_whitepaper_06_26_07.pdf, Accessed: 2018-
02-08.
abnormal events are available, it will be useful to extend the developed [4] P. Fleurquin, J.J. Ramasco, V.M. Eguiluz, Systemic delay propagation in the US
hybrid model to account for such important information. The inclusion airport network, Scientific Reports 3 (2013) 1159.
of such information helps us to identify risk mitigation actions for un- [5] N.B. Sarter, H.M. Alexander, Error types and related error detection mechanisms in
the aviation domain: an analysis of aviation safety reporting system incident re-
foreseen events in the future by learning from the actions that have
ports, The International Journal of Aviation Psychology 10 (2) (2000) 189–206.
been taken in past. Another direction worthy of investigation is to [6] I. Hwang, C.E. Seah, Intent-based probabilistic conflict detection for the next gen-
leverage data mining and pattern recognition techniques to discover the eration air transportation system, Proceedings of the IEEE 96 (12) (2008)
2040–2059.
intricate event sequences by incorporating more comprehensive reports
[7] C. Barnhart, D. Fearing, V. Vaze, Modeling passenger travel and delays in the na-
on severe accidents available in National Transportation Safety Board tional air transportation system, Operations Research 62 (3) (2014) 580–601.
(NTSB). Since many events interact with each other and tend to exhibit [8] K. Ng, C. Lee, F. Chan, A robust optimisation approach to the aircraft sequencing
certain patterns, it is helpful to identify such interactions from the and scheduling problem with runway configuration planning, Industrial
Engineering and Engineering Management (IEEM), 2017 IEEE International
available records in ASRS, and characterize them in a rigorous manner. Conference on, IEEE, 2017, pp. 40–44.
The modeling of such interactions can facilitate our understanding of [9] L.F. Vismari, J.B.C. Junior, A safety assessment methodology applied to CNS/ATM-
the evolution of abnormal events, and help to develop corrective stra- based air traffic control system, Reliability Engineering & System Safety 96 (7)
(2011) 727–738.
tegies. Last but not the least, root cause analysis needs to be performed [10] F. Netjasov, M. Janic, A review of research on risk and safety modelling in civil
to identify the major contributing factors and variables that lead to the aviation, Journal of Air Transport Management 14 (4) (2008) 213–220.
occurrence of incidents with high risk consequences. Along this direc- [11] D. McCallie, J. Butts, R. Mills, Security analysis of the ADS-B implementation in the
next generation air transportation system, International Journal of Critical
tion, appropriate permutation-based significance measures or other Infrastructure Protection 4 (2) (2011) 78–87.
alternative methods need to be utilized to recognize the important in- [12] K. Margellos, J. Lygeros, Toward 4-D trajectory management in air traffic control: a
cident contributing factors. The identification of crucial contributing study based on Monte Carlo simulation and reachability analysis, IEEE Transactions
62
on Control Systems Technology 21 (5) (2013) 1820–1833. Citeseer, 1998, pp. 41–48.
[13] S.J. Landry, X.W. Chen, S.Y. Nof, A decision support methodology for dynamic [45] X. Zhang, S. Mahadevan, X. Deng, Reliability analysis with linguistic data: an evi-
taxiway and runway conflict prevention, Decision Support Systems 55 (1) (2013) dential network approach, Reliability Engineering & System Safety 162 (2017)
165–174. 111–121.
[14] C. Di Ciccio, H. Van der Aa, C. Cabanillas, J. Mendling, J. Prescher, Detecting flight [46] J. Lilleberg, Y. Zhu, Y. Zhang, Support vector machines and word2vec for text
trajectory anomalies and predicting diversions in freight transportation, Decision classification with semantic features, Cognitive Informatics & Cognitive Computing
Support Systems 88 (2016) 1–17. (ICCI* CC), 2015 IEEE 14th International Conference on, IEEE, 2015, pp. 136–140.
[15] S. Sankararaman, I. Roychoudhury, X. Zhang, K. Goebel, Preliminary Investigation [47] S. Zhu, X. Ji, W. Xu, Y. Gong, Multi-labelled classification using maximum entropy
of Impact of Technological Impairment on Trajectory-Based Operations, 17th AIAA method, Proceedings of the 28th Annual International ACM SIGIR Conference on
Aviation Technology, Integration, and Operations Conference, AIAA Aviation Research and Development in Information Retrieval, ACM, 2005, pp. 274–281.
Forum, Denver, Colorado, United States, 2017. [48] D. Isa, L.H. Lee, V. Kallimani, R. Rajkumar, Text document preprocessing with the
[16] H.-J. Shyur, A quantitative model for aviation safety risk assessment, Computers & Bayes formula for classification using the support vector machine, IEEE
Industrial Engineering 54 (1) (2008) 34–44. Transactions on Knowledge and Data Engineering 20 (9) (2008) 1264–1272.
[17] X. Chen, I. Bose, A.C.M. Leung, C. Guo, Assessing the severity of phishing attacks: a [49] S. Chakrabarti, S. Roy, M.V. Soundalgekar, Fast and accurate text classification via
hybrid data mining approach, Decision Support Systems 50 (4) (2011) 662–672. multiple linear discriminant projections, The VLDB Journal 12 (2) (2003) 170–185.
[18] K. Barker, J.E. Ramirez-Marquez, C.M. Rocco, Resilience-based network component [50] Natural Language Toolbox, https://www.nltk.org/, Accessed: 2018-04-02.
importance measures, Reliability Engineering & System Safety 117 (2013) 89–97. [51] K. Spärck Jones, IDF term weighting and IR research lessons, Journal of
[19] A. Roelen, R. Wever, A. Hale, L. Goossens, R. Cooke, R. Lopuhaä, M. Simons, Documentation 60 (5) (2004) 521–523.
P. Valk, Causal modeling for integrated safety at airports, Proceedings of ESREL, [52] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
2003, pp. 1321–1327. [53] J. Evermann, J.-R. Rehse, P. Fettke, Predicting process behaviour using deep
[20] D.A. Wiegmann, S.A. Shappell, A Human Error Approach to Aviation Accident learning, Decision Support Systems 100 (2017) 129–140.
Analysis: The Human Factors Analysis and Classification System, Routledge, 2017. [54] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
[21] K.L. McFadden, E.R. Towell, Aviation human factors: a framework for the new volutional neural networks, Advances in Neural Information Processing Systems,
millennium, Journal of Air Transport Management 5 (4) (1999) 177–184. 2012, pp. 1097–1105.
[22] N. Oza, J.P. Castle, J. Stutz, Classification of aeronautics system health and safety [55] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
documents, IEEE Transactions on Systems, Man, and Cybernetics, Part C V. Vanhoucke, P. Nguyen, T.N. Sainath, et al., Deep neural networks for acoustic
(Applications and Reviews) 39 (6) (2009) 670–680. modeling in speech recognition: the shared views of four research groups, IEEE
[23] S. Budalakoti, A.N. Srivastava, M.E. Otey, Anomaly detection and diagnosis algo- Signal Processing Magazine 29 (6) (2012) 82–97.
rithms for discrete symbol sequences with applications to airline safety, IEEE [56] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural
Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) language processing (almost) from scratch, Journal of Machine Learning Research
39 (1) (2009) 101–113. 12 (2011) 2493–2537.
[24] ASRS Program Briefing, https://asrs.arc.nasa.gov/docs/ASRS_ [57] M. Woźniak, M. Graña, E. Corchado, A survey of multiple classifier systems as
ProgramBriefing2016.pdf, Accessed: 2017-12-28. hybrid systems, Information Fusion 16 (2014) 3–17.
[25] Y. Liu, C. Jiang, H. Zhao, Using contextual features and multi-view ensemble [58] B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass
learning in product defect identification from online discussion forums, Decision probability estimates, Proceedings of the Eighth ACM SIGKDD International
Support Systems 105 (2018) 1–12. Conference on Knowledge Discovery and Data Mining, ACM, 2002, pp. 694–699.
[26] T. Joachims, Text categorization with support vector machines: learning with many [59] S. Sankararaman, S. Mahadevan, Model validation under epistemic uncertainty,
relevant features, European Conference on Machine Learning, Springer, 1998, pp. Reliability Engineering & System Safety 96 (9) (2011) 1232–1241.
137–142. [60] P. Ahlgren, C. Colliander, Document-document similarity approaches and science
[27] Y. Wang, W. Xu, Leveraging deep learning with LDA-based text analytics to detect mapping: experimental comparison of five approaches, Journal of Informetrics 3 (1)
automobile insurance fraud, Decision Support Systems 105 (2018) 87–95. (2009) 49–63.
[28] F. Wang, T. Xu, T. Tang, M. Zhou, H. Wang, Bilevel feature extraction-based text [61] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, International
mining for fault diagnosis of railway systems, IEEE Transactions on Intelligent Conference on Learning Representation, ACM, 2015, pp. 1–13.
Transportation Systems 18 (1) (2017) 49–58. [62] J.D. Rennie, N. Srebro, Loss functions for preference levels: regression with discrete
[29] R. Chen, Y. Zheng, W. Xu, M. Liu, J. Wang, Secondhand seller reputation in online ordered labels, Proceedings of the IJCAI multidisciplinary workshop on advances in
markets: a text analytics framework, Decision Support Systems 108 (2018) 96–106. preference handling, Kluwer, Norwell, MA, 2005, pp. 180–186.
[30] A.S. Abrahams, W. Fan, G.A. Wang, Z.J. Zhang, J. Jiao, An integrated text analytic [63] F. Pedregosa-Izquierdo, Feature Extraction and Supervised Learning on fMRI: From
framework for product defect discovery, Production and Operations Management Practice to Theory, Ph.D. thesis Université Pierre et Marie Curie-Paris VI, 2015.
24 (6) (2015) 975–990. [64] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, Chapman and Hall/
[31] G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, Using kNN model for automatic text CRC, 2012.
categorization, Soft Computing 10 (5) (2006) 423–430.
[32] M.-L. Zhang, Z.-H. Zhou, Multilabel neural networks with applications to functional Xiaoge Zhang received his B.S. degree from Chongqing University of Posts and
genomics and text categorization, IEEE Transactions on Knowledge and Data Telecommunications in 2011, and the M.S. degree from Southwest University in 2014,
Engineering 18 (10) (2006) 1338–1351. both in Chongqing, China. Currently, he is pursuing his Ph.D. degree in Vanderbilt
[33] M. Lan, C.L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods University, and is expected to graduate in early 2019. From August to December in 2016,
for automatic text categorization, IEEE Transactions on Pattern Analysis and he interned at the National Aeronautics and Space Administration (NASA) Ames Research
Machine Intelligence 31 (4) (2009) 721–735. Center (ARC), Moffett Field, CA, working at the Prognostics Center of Excellence (PCoE)
[34] S. Tong, D. Koller, Support vector machine active learning with applications to text led by Dr. Kai Goebel. He was a recipient of the Chinese Government Award for
classification, Journal of Machine Learning Research 2 (Nov) (2001) 45–66. Outstanding Self-financed Students Abroad in 2017. He has published more than 30 re-
[35] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. search papers in peer-reviewed journals, such as IEEE Transactions on Cybernetics, IEEE
[36] N. Li, D.D. Wu, Using text mining and sentiment analysis for online forums hotspot Transactions on Reliability, Decision Support Systems, Information Sciences, Safety
detection and forecast, Decision Support Systems 48 (2) (2010) 354–368. Science, Annals of Operations Research, Reliability Engineering and System Safety, and
[37] Y. Tang, Y.-Q. Zhang, N.V. Chawla, S. Krasser, SVMs modeling for highly im- International Journal of Production Research, among others. Four of his publications are
balanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B included in the Essential Science Indicators (ESI) highly cited papers. His current research
39 (1) (2009) 281–288. interests include uncertainty quantification, reliability assessment, risk analysis, network
[38] W. Zhang, T. Du, J. Wang, Deep learning over multi-field categorical data, optimization, and data analytics. He is a student member of IEEE, INFORMS, and SIAM.
European Conference on Information Retrieval, Springer, 2016, pp. 45–57.
[39] Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil, A latent semantic model with con-
volutional-pooling structure for information retrieval, Proceedings of the 23rd ACM Sankaran Mahadevan is John R. Murray Sr. Professor of Engineering, and Professor of
International Conference on Conference on Information and Knowledge Civil and Environmental Engineering at Vanderbilt University, Nashville, Tennessee,
Management, ACM, 2014, pp. 101–110. where he has served since 1988. His research interests are in the areas of uncertainty
[40] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classifi- quantification, model verification and validation, reliability and risk analysis, design
optimization, and system health monitoring, with applications to civil, mechanical and
cation with imbalanced data: empirical results and current trends on using data
intrinsic characteristics, Information Sciences 250 (2013) 113–141. aerospace systems. His research has been extensively funded by NSF, NASA, FAA, DOE,
[41] J.F. Díez-Pastor, J.J. Rodríguez, C. García-Osorio, L.I. Kuncheva, Random balance: DOD, DOT, NIST, GE, GM, Chrysler, Union Pacific, American Railroad Association, and
ensembles of variable priors classifiers for imbalanced data, Knowledge-Based Sandia, Idaho, Los Alamos and Oak Ridge National Laboratories. His research contribu-
Systems 85 (2015) 96–111. tions are documented in more than 600 publications, including two textbooks on relia-
bility methods and 280 journal papers. He has directed 42 Ph.D. dissertations and 24 M.
[42] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority
over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) S. theses, and has taught several industry short courses on reliability and risk analysis
methods. He is a Fellow of AIAA and EMI (ASCE). His awards include the NASA Next
321–357.
[43] S. Tan, Y. Li, H. Sun, Z. Guan, X. Yan, J. Bu, C. Chen, X. He, Interpreting the public Generation Design Tools award (NASA), the SAE Distinguished Probabilistic Methods
sentiment variations on Twitter, IEEE Transactions on Knowledge and Data Educator Award, SEC Faculty Award, and best paper awards in the MORS Journal and the
Engineering 26 (5) (2014) 1158–1170. SDM and IMAC conferences. Professor Mahadevan obtained his B.S. from Indian Institute
[44] A. McCallum, K. Nigam, et al., A comparison of event models for Naïve Bayes text of Technology, Kanpur, M.S. from Rensselaer Polytechnic Institute, Troy, NY, and Ph.D.
from Georgia Institute of Technology, Atlanta, GA.
classification, AAAI-98 workshop on learning for text categorization, vol. 752,
63

1 s2.0 S0167923618301660 Main PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1 s2.0 S0167923618301660 Main PDF

Transféré par

Droits d'auteur :

Formats disponibles

Decision Support Systems 116 (2019) 48–63

Contents lists available at ScienceDirect

Decision Support Systems

Ensemble machine learning models for aviation incident risk prediction T

ARTICLE INFO ABSTRACT

Fig. 2. Hybrid machine learning framework for risk prediction.

prediction is further expanded to event-level outcome analysis in Table 3

developed by Spärck Jones [51], the term frequency-inverse document

where tf (t, d) is the number of occurrences of term t in document d, nd

3.2.2. Ensemble of deep neural networks

function measures the probability of observing the data given the 5

representation of five risk groups: high risk, moderately high risk,

The methodology developed in this section tackles the problem of

Fig. 5. Datasets and model inputs.

Fig. 6. Proportion of each class in the records with disagreeing predictions.

(a) (b) (c)

Vous aimerez peut-être aussi