Vous êtes sur la page 1sur 10

Computers in Biology and Medicine 68 (2016) 2736

Contents lists available at ScienceDirect

Computers in Biology and Medicine


journal homepage: www.elsevier.com/locate/cbm

Unsupervised learning assisted robust prediction


of bioluminescent proteins
Abhigyan Nath n, Karthikeyan Subbiah n
Department of Computer Science, Banaras Hindu University, Varanasi 221005, India

art ic l e i nf o

a b s t r a c t

Article history:
Received 13 April 2015
Accepted 28 October 2015

Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical
signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging
from cellular imaging to gene expression analysis. However, identication and annotation of bioluminescent proteins is a difcult task as they share poor sequence similarities among them. In this paper, we
present a novel approach for within-class and between-class balancing as well as diversifying of a
training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority
Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model.
Further, we experimented by varying different levels of balancing ratio of positive data to negative data
in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversied training set resulted in near complete
learning with greater generalization on the blind test datasets. The obtained results strongly justify the
fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near
perfect learning. Using random forest as the weak learners in boosting and training it on the optimally
balanced and diversied training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross
validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specicity of 91.8% on a holdout test set.
It is quite possible that the general framework discussed in the current work can be successfully applied
to other biological datasets to deal with imbalance and incomplete learning problems effectively.
& 2015 Elsevier Ltd. All rights reserved.

Keywords:
Class imbalance
Training set diversity
Optimal class distribution
K-Means
SMOTE

1. Introduction
There are mainly two phenomena, bioluminescence and bioourescence which are responsible for the emission of visible light
from the living organisms. The mechanisms of these two processes
are distinct as the former involves a chemical reaction, and the latter
involves absorption of light from external sources and its emission
after transformation. Bioluminescence is observed in both terrestrial
and marine habitats. The chemical reaction, which is responsible for
bioluminescence, generates very less heat and can be categorized
into oxygen dependent (luciferin-luciferase system) and oxygen
independent types (ex. Photoproteins). The colour of the emission is
governed by the amino acid sequence, and by accessory proteins like
yellow uorescent proteins (YFP) and green uorescent proteins
(GFP) [1]. Diverse systems for bioluminescence exist in nature, for ex.
in Dinoagellates, specialized organelles known as Scintillons [2,3]
exhibit bioluminescence. Bioluminescence plays an important role in
n

Corresponding authors. Tel.: 91 9956015187, 91 9473967721.


E-mail addresses: abhigyannath01@gmail.com (A. Nath),
karthinikita@gmail.com (K. Subbiah).
http://dx.doi.org/10.1016/j.compbiomed.2015.10.013
0010-4825/& 2015 Elsevier Ltd. All rights reserved.

bacterial intracellular chemical signalling and in symbiosis: a common example of which is shown by Epryme scolopes and Vibrio shcri
[4,5], in attracting for a mate and repelling the predators. The independent evolution of bioluminescence in different organisms has
been discussed in Hastings et al. [1]. In some organisms, the usefulness of bioluminescence is still unknown.
In silico prediction of bioluminescent proteins (BLP) was rst
carried out by Kandaswamy et al. [6]. They developed Blprot,
which is an SVM based method. Their prediction model was
trained by using 544 amino acid physicochemical properties. The
prediction of bioluminescent proteins was further improved by
Zhao et al. (BLPre) [7] using evolutionary information in the form
of PSSM (Position Specic Scoring Matrices) obtained from PSIBLAST. Fan et al. [8] used a balanced dataset (equal number of
positive and negative samples for training) with average chemical
shift and modied pseudo amino acid composition for prediction
of bioluminescent proteins. Recently, Huang [9] proposed a scoring
card method (SCBM) for their prediction.
The imbalanced class ratios are often encountered in the protein family classication problems. This causes the overrepresentation of instances belonging to majority class and under
representation of instances belonging to minority class in the

28

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

training set. The machine learning models trained with the


imbalance training dataset have classication bias towards
majority class and behave like a majority class classier. This issue
of imbalanced dataset has not been given the required attention in
the bioinformatics community as it deserves.
In the current prediction problem, bioluminescent proteins
(BLPs) are the positive minority class (which is the class of interest) and the majority class consists of all the non-bioluminescent
proteins (NBLPs) belonging to different other protein families. The
negative class is naturally very large as compared to the number of
BLPs. So, the bioluminescent prediction problem training dataset is
one of the classic examples of imbalanced dataset. This imbalance
in class distribution greatly affects the accuracy in predicting the
positive class instances (as the prediction models tends to act as a
majority class classier) and it is also quite evident from the
previous studies [69].
When we use any machine learning algorithm to build a prediction model, the major motive is to maximize the generalization
ability of the model. This insures that the trained predictive model
will yield good prediction accuracy on the future unseen data.
Ideally the training dataset that is presented to the learning
algorithm should be properly diversied by covering the representatives from the entire input instance space to achieve the
maximum possible generalization ability. If the training data are
composed of a large number of very similar instances, it may get
biased towards those instances. This notion holds true in the cases
of both between-class (inter-class) and within-class (intra-class)
instances. So the diversication of the training set is essential to
gain enhanced generalization. Both between-class imbalance and
within-class imbalance have a negative inuence on the performance of machine learning algorithms [10].
In the present study, we have created a diversied and
balanced training dataset by using unsupervised K-Means clustering algorithm (to deal with the within class imbalance where
each class contains subgroups of similar instances of varying
numbers) and then using SMOTE [11] (to selectively amplify the
representative minority class sequences for balancing the
between-class imbalance). The boosted random forest algorithm
which has performed considerably better than the other machine
learning algorithms was used to create our prediction model.
As the next part of this study, we have investigated the effect
on prediction performances by varying the balancing ratio from
ideal ratio (that is 1:1) to the original imbalance ratio. Analyzing
the experimental results has revealed that the best prediction
performance can be achieved at an optimal balancing ratio rather
than at ideal balancing ratio. It was found that another performance factor (diversity) gets affected at the ideal balancing ratio
of 1:1. This has motivated us to probe for optimal class distribution which is required to achieve superior accuracy (provides the best trade-off between inter-class balancing ratio and
the diversity). The optimal class distribution is seldom explored
in bioinformatics.
Finally individual features are ranked using the Relieff feature
ranking algorithm and investigated the performance of the classier by varying the number of features starting from 5 most
discriminating features up to 40 (according to their rank) and
recorded the calculated performance evaluation metrics obtained
for RARF. The prediction performance increases with the increasing number of features (according to their ranks). This has
authenticated the presence of large diversity among BLPs and
there is a need for nding the optimal class distribution in order to
achieve the best prediction performance. The superiority of the
proposed framework as compared to random sampling is also
discussed.

2. Materials and methods


2.1. Dataset
We used the dataset of Kandaswamy et al. [6] which consists of
441 positive class sequences (bioluminescent proteins) having less
than 40% sequence identity and 18202 negative class sequences
(non-bioluminescent protein NBLP) having more than 40%
sequence identity. The redundant sequences in the dataset may
result in bias and overestimation of model evaluation parameters.
So we have used CD-HIT [12] to reduce the redundancy by
removing sequences having more than 40% sequence identity,
which resulted in 13,446 negative sequences. The nal dataset
consisted of approximately 1:30 positive to negative instances
ratio. The data imbalance is intrinsically present in most of the
protein family classication problems and affects the accuracy of
predicting the members of a particular protein family. So the
datasets are needed to be appropriately balanced to achieve the
true performance of the classiers.
2.2. Sequence based input features
The input vectors were created by extracting the following
three types of features from every protein sequence.
(i) Amino acid composition: We used the percentage composition of amino acid residues (aa) as one of the feature vectors.
This feature was selected on the assumption that there are
some specic avoidances and preferences of certain amino
acids in the formation of a protein family to perform a common functionality, which resulted in distinguishable frequency
compositions (fres).
N res;i
 100
N total_res;i

f res

where
res stands for one of the 20 different amino acid residues
fres denotes the amino acid percentage frequency of the specic residue in ith Sequence.
Nres,i denotes the total count of amino acid of the specic type
in the ith sequence.
Ntotal_res,i denotes the total count of all residues in the ith
sequence (i.e. sequence length).
(ii) Amino acid property group composition: The percentage
frequency counts of amino acid property groups were used as
the second component in the feature vector. The different
amino acid property groups [13] that are selected for this
study are given in Table 1. This is a renement over amino acid
frequency composition where specic property group count is
computed instead of the individual amino acid count.
f pg

Npg;i
 100
Ntotal_res;i

where
pg denotes one of the 11 different amino acid property groups
fpg denotes the percentage frequency of the specic amino acid
property group in the ith sequence.
Npg,i denotes the total count of the specic amino acid property
group in the ith sequence.
Ntotal_res,i denotes the total count of all residues in the ith
sequence.

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

(iii) Physicochemical n-grams: Physicochemical properties of


amino acid residues play an important role in determining
the proteins function. There are many effective ways to
incorporate the physicochemical properties of amino acid
residues for representing protein sequences which can be
used for discrimination of proteins of interest from other
proteins. For calculating physicochemical n-grams, we used a
technique of sliding window of length n (where n is an
integer). If all the amino acid residues inside the sliding
window share the same physicochemical group then the
frequency of that physicochemical group is counted. If the
amino acid residues are having more than one similar physicochemical group then corresponding counts are made
for all those physicochemical groups. The physicochemical
groups mentioned in table 1 were retained for calculation of
physicochemical n-grams.
Physicochemical 2  gram : Small

N
1
X

C i; i 1

i1

N denotes the length of the protein sequence


I denotes the position of the amino acid residue along the
protein sequence

C(i,i 1) if the condition aai A S ; aai 1 A S is satised then C
(i,i 1) 1 otherwise C(i,i 1) 0. The set of small-aminoacids Sn {Ala,Cys,Asp,Gly,Asn,Pro,Ser,Thr,Val}
In the similar way the physicochemical 2-grams for the other
ten physicochemical property groups were calculated. This feature
captures the most important positional related information. So our
input vector is a judicious combination of amino acid frequency
compositions, physiochemical properties and positional features
that are extracted from every sequence from the dataset.
2.3. Necessity of balancing and diversication protocol
For any supervised learning algorithm, the presentation of a
diverse set of labelled data from the entire input space belonging
to different classes is very important for its proper learning of all
the concepts and the sub-concepts. Moreover, a dataset is said to
be imbalanced when there is a large difference between the
numbers of examples belonging to different classes. Normally
every classier tends to be the majority class classier. The
Bioluminescent protein classication problem is one of the
classical examples of the class imbalance problem. In this present classication problem, the minority examples are the
bioluminescent proteins and majority examples belongs to the
non-bioluminescent proteins.
Table 1
The property groups of amino acids that have been taken for the present analysis
S.No.

Amino Acid Property


Group

Amino Acids in the Specic Group

1.
2.
3.
4.

Tiny group
Small group
Aliphatic group
Non-polar groups

5.
6.
7.
8.
9.
10.
11.

Aromatic group
Polar group
Charged group
Basic group
Acidic group
Hydrophobic group
Hydrophilic group

Ala, Cys, Gly, Ser, Thr


Ala, Cys, Asp, Gly, Asn, Pro, Ser, Thr and Val
Ile, Leu and Val.
Ala, Cys, Phe, Gly, Ile, Leu, Met, Pro, Val, Trp
and Tyr
Phe, His, Trp and Tyr
Asp, Glu, His, Lys, Asn, Gln. Arg, Ser, and Thr.
Asp, Glu, His, Arg, Lys
His, Lys and Arg
Asp and Glu
Ala, Cys, Phe, Ile, Leu, Met, Val, Trp, Tyr
Asp, Glu, Lys, Asn, Gln,Arg

29

Most of the learning algorithms are designed to optimize the


accuracy as the evaluation metric during the process of learning
the concepts and the sub-concepts from the dataset. When accuracy is taken as the evaluation metric, it gets too strongly biased
towards the majority class by predicting correctly most of the
majority class instances as compared to minority class instances.
So the accuracy does not prove to be the true indicator for better
performance as it is a weighted average of accuracies in predicting
both the majority and minority classes. Often in cases, when there
is imbalanced data and the class of interest is the minority class, it
is very important to gain better predictive accuracy and generalization ability for the minority class than for the majority class.
When the imbalance ratio is high between the majority and
minority class instances, the accuracy in predicting the minority
class is low, as the learning algorithm will have less opportunity to
learn all the minority class sub-concepts as compared to majority
class sub-concepts due to the overwhelming number of instances
from the dominant class. This may also result in misclassication
of some of the minority class instances into the majority class.
Apart from between-class imbalance, the within-class imbalance
in the training data may also result in a lower generalization of the
learned models. Presence of rare cases or less common cases results
in the within-class imbalance. If the common cases are present in
more numbers in the training data, then learning algorithm will
have less opportunity to learn the rare case sub-concepts. Ideally,
there should be an adequate representation of common as well as
rare cases from both majority and minority sub-classes in the
training data. Also the class distribution of the training and testing
samples should be similar otherwise the model may not generalize
well on the unseen test set examples. Here we propose a hybrid
sampling method using K-Means clustering and SMOTE for creating
a balanced and diversied training dataset. Diversication of the
training set is important as it includes as many distinct training
samples as possible to maximize the generalization ability.
One of the popular methods to handle class imbalanced data is
by using a sampling of the dataset. Sampling can be done in a
random manner like random downsampling; random oversampling;
and in an intelligent manner by using SMOTE or its variants [14].
There is a good chance of losing important instances in random
downsampling of the majority class. While in random oversampling
there is a good chance that some instances of the minority class may
get overrepresented. Both these situations results in an incomplete
learning. Random sampling results in the reduction of data variation
in the training set and consequently results in a low generalization
of the prediction model. Jo et al. [15] in their work addressed the
problem of small disjuncts (disjuncts can be dened as those regions
in the input space that covers only a few training examples). These
small disjuncts are difcult to learn and ideally there should be an
adequate representation of them in the training data.
2.3.1. K-Means clustering
This clustering technique aims to nd homogeneous groups
that occur naturally in a dataset. It is an unsupervised method of
clustering, where with a given similarity measure (clustering criterion); it tries to nd hidden patterns in the dataset and groups
together the more similar entries. We have applied K-Means
clustering separately on positive and negative class instances. The
objective function, which is to be minimized during the each
iteration of K-Means, is given as follows:
SSE

nj
K X
X
j1i1

j j P ji  C j j j

where
Cj denotes the centroid of the jth cluster.

30

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

Pij denotes the ith pattern of the jth cluster.


nj denotes the number of objects in the jth cluster.
K denotes predetermined number of clusters
|| || denotes matching distance metric used
The entire feature vectors were normalized before clustering
with euclidean distance as the distance metric. To avoid sticking to
the local minima, we have repeated the clustering for 10 times
with a new set of centroid positions for the initial clusters. To
determine the optimal value for K, (i) we have calculated the ratio
of within cluster variance to between cluster variance (distortion
ratio) for every value of K that varies from 1 to 441(that is the total
number bioluminescent proteins in the dataset) using the KMeans algorithm for the positive dataset, (ii) a graph was plotted
for K versus distortion ratio (Fig.1) and (iii) the value of K after
which there is no signicant decrease in the corresponding value
of the distortion ratio is taken as its optimal value (375). We used
the optimal K value obtained from the positive dataset as the
initial value of K for the negative dataset (the majority class) due to
the constraints of very high time and space complexity in nding
its Koptimal as well as the need for dealing with an within-class
imbalance. The experiments were carried out with the Koptimal
as well as with integral multiple of Koptimal (from 2  Koptimal
to 7  Koptimal) for the negative dataset to nd out the best
balancing factor. The rationale for using the higher value for K for
the negative dataset is that the K-optimal for the positive dataset
need not be optimal for the relatively larger negative dataset.
Moreover, clustering with the same K-optimal for negative dataset
causes tight clustering of the negative instances and may result in
merging two or more distinct clusters or redistribution of unique
cluster elements to different clusters. Accordingly, the higher
values of K for the negative class were used to have a relax clustering so as to include vast diversity from the negative instances.
For the given K-optimal for positive dataset 375, the possible
training datasets were created, such as Training Set I: 375 positive
instances and 375 negative instances, Training Set II: 375 positive
and 750 negative instances, Training Set III: 375 positive, 1125
negative instances and so forth.
The main purpose of clustering is to select representative
samples from both the positive and negative classes to achieve
diversication of the training set, so as to minimize the withinclass imbalance. One instance from each cluster is selected from
both the positive class clusters and negative class clusters for the

training set and the rest of the cluster members are retained in the
testing set.
2.3.2. SMOTE
In SMOTE, minority class is oversampled by inducing articial
instances. It is a nearest neighbour based method. It selects randomly a minority class instance and its N nearest minority class
neighbours (the default value of N 5). Distance is calculated
between the sample and one of the randomly chosen nearest
neighbour in the feature space and then it creates a synthetic
instance along the line segment between the minority sample and
its selected nearest neighbour.
In cases where there are an unequal K (for ex. in 2  K, 3  K,
etc.), we used SMOTE to selectively oversample the positive class
representative instances equal to the number of negative class
instances. We experimented with different % of SMOTE sampling
and have examined the effect of a balanced and imbalanced
dataset on the prediction evaluation metrics by creating different
datasets with different proportions of positive and negative
instances. The properties of different training and testing sets are
presented in Table 2.
2.3.3. Boosted random forest
Boosting [16,17] combines many weak base learners linearly to
construct a strong classier with improved accuracy. It is an
iterative procedure. During the each iteration, the incorrectly
classied instances from both the positive and negative classes are
given more weights so that the learning is concentrated on the
hard and difcult to classify instances in the training set. It is a
sequential ensemble method where the subsequent learners are
evolved from the previous learners.
Random forest [18] is an ensemble learning method consisting
of many individual decision trees. Classier ensembles promote an
optimal trade-off between diversity and accuracy. Ensemble classiers usually outperform single classiers and they are robust to
the presence of noise in the data and to over tting of inputs [19].
Different base classiers making errors in different parts of the
hypothesis space give better accuracy when properly combined
together.
The concept of bagging [20] is implemented in the random
forest classication algorithm. In random forest bootstrap samples
from the training set with randomly selected feature subsets were
evaluated at each node of the decision tree. The nal decision is
made by decision fusion of all the trees by majority voting.

Fig. 1. Plot of distortion ratio verses number of clusters for bioluminescent protein instances.

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

Specicity: expresses the percentage of correctly predicted


NBLPs.

Table 2
Properties of different training and testing sets
Training sets
IDs

No. of positive
instances in the
training set

No. of negative
instances in the
training set

Property

1.1
2.1
2.2

375
375
750

375
750
750

3.1
3.2

375
750

1125
1125

3.3

1125

1125

4.1
4.2
4.3
4.4

375
750
1125
1500

1500
1500
1500
1500

5.1
5.2
5.3
5.4
5.5

375
750
1125
1500
1875

1875
1875
1875
1875
1875

6.1
6.2
6.3
6.4
6.5
6.6

375
750
1125
1500
1875
2250

2250
2250
2250
2250
2250
2250

7.1
7.2
7.3
7.4
7.5
7.6
7.7

375
750
1125
1500
1875
2250
2625

2625
2625
2625
2625
2625
2625
2625

Balanced
Imbalanced
Balanced with
100% SMOTE
Imbalanced
Balanced with
100% SMOTE
Balanced with
200% SMOTE
Imbalanced
Imbalanced
Imbalanced
Balanced with
300% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
400% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
500% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
600% SMOTE

Specif icity

Accuracy

TP TN
TP FP TN FN

AUC: the area under the receiver operating characteristic curve


can be summarized by a single numerical quantity known as area
under the curve (AUC). An AUC value close to 1 is considered good.
g-means: It is the geometric mean of sensitivity and specicity
and is calculated by
p
g means Sensitivity  Specif icity
8
Youden's index (Y): this parameter measures the models
ability to avoid failures and is calculated by
Y Sensitivity  1  Specif icity

3. Results

2.4. Evaluation metrics


The performance of the machine learning methods is evaluated
by using threshold-dependent and threshold-independent parameters and these parameters are calculated from the values of true
positives (TP), false negatives (FN), true negatives (TN) and false
positives (FP).
Sensitivity: expresses the percentage of correctly predicted
BLPs.
TP
 100
TP FN

TN
 100
TN FP

Accuracy: expresses the percentage of correctly predicted BLPs


and NBLPs.

Random forests have been successfully applied to many classication and prediction tasks [21,22]. Major steps of random forest
are summarized as follows: (1) a bagged sample is drawn from the
training data. (2) A decision tree is grown without pruning on the
bagged sample, where at each node a randomly selected subset of
features from the full feature subset is evaluated. (3) Fusing the
decisions from all the individual trees.
We have used random forest as weak learners for Boosting
algorithm. Recently, some of the authors have also successfully
applied boosted random forest for classication and prediction
[23,24]. Real Adaboost is one of the popular modications of the
Adaboost algorithm. The major steps are same except that it
involves the calculation of real valued class probability estimates.
We experimented with both discrete and real Adaboosting algorithms. The schematic representation of the proposed methodology is shown in Fig.2.

Sensitivity

31

Initially, we trained the following six different machine learning algorithms using the randomly balanced training set, namely:
Support vector machines with sequential minimization optimization (SMO), K Nearest Neighbour (IBK), Random Forest (RF),
Rotation Forest (ROF), RARF (Real Adaboosting Random Forest)
and ARF (Adaboosting Random Forest). The performance evaluation metrics on the randomly balanced training set (using tenfold
stratied cross validation) and on the holdout testing set are
presented in Table 3.
The overall accuracy of the tree based algorithm is better than
the SMO and IBK algorithms. All the algorithms performed relatively well on the positive samples. RARF gave a comparable
sensitivity among other learning algorithms with highest overall
accuracy, AUC, g-means and Youden's index on the training set
using tenfold cross validation. The same trend is also observed on
the testing set
3.1. Effect of balanced training set on performance evaluation
parameters
We selected the RARF algorithm for further analysis with
varying balancing ratios, as it outperformed other learning algorithms on the randomly balanced dataset. The performance evaluation metrics of RARF using different training and testing sets
are presented in Table 4.
It can be observed that RARF performed relatively better
whenever the training set is fully balanced (with training set
IDs  1, 2.2, 3.3, 4.4, 5.5, 6.6, and 7.7) and the sensitivity (accuracy
of the positive class) increased with the increasing rate of SMOTE
oversampling. RARF achieved higher sensitivity values on fully
balanced training sets as compared to other partially balanced
training sets. An opposite trend of decreasing specicity values is
observed with the increasing rate of SMOTE oversampling. Though
the specicity values on fully balanced training sets are least as
compared to other training sets, the overall accuracy of RARF was
increased with the rate of SMOTE oversampling. On all the fully
balanced training sets, highest accuracy values can be observed as
compared to other training sets. Full balancing of training
instances between positive and negative samples also has positive
effects on AUC values. Higher AUC values were observed in
training sets having a lower imbalance ratio. The g-means reects

32

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

Fig. 2. Schematic representation of the methodology followed for this study.

Table 3
Performance evaluation metrics of different machine learning algorithms trained using the imbalanced training set.
Learning algorithms

SMO
IBK
ROF
RF
RARF
ARF

Training Set 1

Testing set 1

Sensitivity

Specicity

Accuracy

AUC

g-Means

Sensitivity

Specicity

Accuracy

AUC

g-Means

87.5
88.8
81.1
85.1
86.1
84.5

70.7
55.7
79.7
80.5
82.1
81.3

79.1
72.3
80.4
82.8
84.1
82.9

0.791
0.723
0.886
0.907
0.923
0.909

78.6
70.3
80.3
82.7
84.07
82.8

0.582
0.445
0.608
0.656
0.682
0.658

98.5
100
93.9
93.9
97.0
93.9

59.9
52.1
70.2
69.9
73.6
69.6

60.1
52.3
70.4
70.0
73.8
69.7

0.792
0.761
0.932
0.928
0.949
0.922

76.8
72.1
81.1
81.0
84.4
80.7

0.584
0.521
0.641
0.638
0.706
0.635

the models accuracy on both the positive and negative instances.


Higher g-means values were observed in training sets with lower
imbalance ratios. In training sets where the imbalance is more
pronounced, lower accuracy for the positive class (sensitivity) was
observed as compared to the accuracy for the negative class
(specicity). Observing the Youden's index values, imbalanced
training sets were having a lower fault avoidance rate as compared
to fully balanced training sets. Learning of RARF on fully balanced
training sets is nearly complete as is evident from the performance
evaluation metrics. Partially balanced and fully balanced training
sets are having comparable performance evaluation metrics for
the RARF algorithm (with training set IDs 4.3, 4.4, 5.3, 5.5, 6.5, 6.6,
7.6, 7.7). The sensitivity values in the holdout testing sets also
increased with the rate of SMOTE oversampling of the representatives of the minority class samples in the training sets. The
best tradeoffs between the different performance-evaluation
metrics were observed in the training set 5.5.
Although training set 7.7 gave 96.4% overall accuracy on tenfold
cross validation, the overall accuracy on the testing set was only

92.5%. Based on g-means and AUC, which are robust to the


imbalance nature of the datasets, training set 5.5 can be considered as the optimal training set for the prediction of bioluminescent proteins. Also the accuracy for predicting the positive class
-bioluminescent proteins (sensitivity) did not improve further
after training set 5.5.
The training datasets were created by varying the balancing
ratio of positive class to negative class instances with the
increasing number of training instances. The proposed prediction
model trained with the training set 5.5 yielded the best performance on the testing set. This indicates that the dataset 5.5 is
optimally diversied with both the positive as well as the negative
class instances and providing the highest generalization ability.
This is because the initial training datasets contains fully diversied positive class instances and partially diversied negative class
instances, the input space of the negative class was much larger
(more distinct sub-classes) than the input space of the positive
class (less distinct sub-classes). Accordingly, optimally diversied
(both the ve and  ve classes) training set 5.5 provides the

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

33

When the dataset is imbalanced it is very trivial to get high


accuracy by predicting most of the sequences as the majority class
sequence. But to have a true estimation of model's generalization
ability, parameters like sensitivity, specicity, accuracy, gmeans
and AUC should also be taken into consideration. To have a complete picture of the classier performance, the evaluation of
learning algorithms should be based on appropriate evaluation
metrics. The proposed approach tried to diversify the training
instances, reducing the number of redundant instances from both
the majority and minority classes and selection of a training set
with the optimal class distribution. Overall full balancing gave
perfect training and testing with optimal evaluation parameters
for bioluminescent protein prediction.
Within-class imbalance is often ignored in resampling techniques. If the within class complexity is high, then learning algorithms are hard to optimize. The strength of K-Means assisted

learning algorithm with all sorts of distinct samples for learning


the sequence features of both the classes, giving the learning
algorithm a chance to learn the entire input space, thereby
achieving towards near perfect learning, while the balanced ratio
of positive to negative class samples keeps the accuracy for the
minority positive class samples optimal.
We also compared the performance of RARF using random
training set with the same ratio as in the training set 5.5 with
SMOTE oversampling on randomly selected positive samples and
randomly undersampled majority class instances (Table 5). Using
random sampling we created ten random training and testing sets.
The performance of RARF on the K-Means preprocessed training
set on tenfold cross validation is comparable with that of randomly
SMOTE oversampled training set, but the sensitivity, g-means and
AUC values are lower than the K-Means preprocessed training set
and the same trend is also observed in the holdout testing set.

Table 4
Performance evaluation metrics of RARF on the different training sets with different ratios of positive and negative samples.
Learning Algorithm- RARF
Training Set IDs

Sensitivity

Specicity

Accuracy

AUC

g-Means

Testing Set IDs

Sensitivity

Specicity

Accuracy

AUC

g-Means

Training Set 1.1


Training Set 2.1
Training Set 2.2
Training Set 3.1
Training Set 3.2
Training Set 3.3
Training Set 4.1
Training Set 4.2
Training Set 4.3
Training Set 4.4
Training Set 5.1
Training Set 5.2
Training Set 5.3
Training Set 5.4
Training Set 5.5
Training Set 6.1
Training Set 6.2
Training Set 6.3
Training Set 6.4
Training Set 6.5
Training Set 6.6
Training Set 7.1
Training Set 7.2
Training Set 7.3
Training Set 7.4
Training Set 7.5
Training Set 7.6
Training Set 7.7

86.1
74.1
91.1
61.9
86.8
94.2
53.6
83.5
91.8
95.6
46.9
80.8
89.7
94.2
96.4
40.3
76.9
88.0
93.1
96.4
97.2
36.5
75.2
86.4
92.0
94.6
96.6
97.6

82.1
94.1
90.0
97.4
95.1
93.3
98.3
96.7
94.9
93.7
98.8
97.9
96.6
95.5
94.2
99.3
98.1
97.1
96.0
95.7
95.0
99.7
98.6
97.9
96.9
96.3
95.7
95.1

84.1
87.5
90.9
88.5
91.8
93.3
89.3
92.3
93.6
94.7
90.2
93.0
94.0
94.9
95.3
90.9
92.8
94.0
94.9
96.0
96.1
91.8
93.4
94.4
95.2
95.6
96.1
96.4

0.923
0.918
0.968
0.919
0,971
0.983
0.911
0.971
0.982
0.988
0.919
0.973
0.983
0.989
0.991
0.919
0.973
0.985
0.990
0.991
0.993
0.918
0.974
0.986
0.991
0.992
0.994
0.928

84.07
90.5
90.5
77.6
90.8
93.7
72.5
89.8
93.3
94.6
68.0
88.9
93.0
94.8
95.2
63.2
86.8
92.4
94.5
96.0
96.0
60.3
86.1
91.9
94.4
95.4
96.1
96.3

0.682
0.682
0.811
0.593
0.819
0.875
0.519
0.802
0.867
0.893
0.457
0.787
0.863
0.897
0.906
0.396
0.851
0.851
0.891
0.921
0.922
0.362
0.738
0.843
0.889
0.909
0.923
0.927

Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set

97.0
89.4
90.0
75.8
87.9
89.4
74.2
75.8
84.8
86.4
63.6
77.3
81.8
86.4
89.4
53.0
69.7
80.3
83.3
86.4
86.4
31.3
71.2
75.8
80.3
83.3
86.4
86.4

73.6
89.7
83.9
95.5
91.5
88.2
97.4
97.4
92.3
90.4
98.5
96.4
94.5
92.9
91.8
99.2
97.4
95.6
94.5
93.3
92.5
99.4
97.8
96.6
95.5
94.5
93.6
92.5

73.8
89.7
84.0
95.4
91.5
88.2
97.3
94.4
92.2
90.4
98.3
94.4
94.4
92.9
91.7
99.0
97.2
95.5
94.5
93.2
92.4
99.1
97.6
96.4
95.5
94.4
93.6
92.5

0.949
0.961
0.962
0.964
0.964
0.965
0.968
0.970
0.968
0.969
0.970
0.969
0.969
0.972
0.971
0.969
0.970
0.971
0.973
0.971
0.969
0.970
0.972
0.971
0.974
0.972
0.970
0.970

84.4
89.5
87.3
85.0
88.7
88.7
85.18
84.2
88.4
88.37
79.14
86.3
87.9
89.5
90.5
72.5
82.3
87.6
88.7
89.7
89.3
71.5
83.4
85.5
87.5
88.7
89.9
89.3

0.706
0.791
0.739
0.713
0.794
0.776
0.732
0.732
0.771
0.768
0.621
0.737
0.763
0.793
0.812
0.522
0.671
0.759
0.778
0.797
0.789
0.307
0.690
0.724
0.758
0.800
0.800
0.789

1.1
2.1
2.2
3.1
3.2
3.3
4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
5.5
6.1
6.2
6.3
6.4
6.5
6.6
7.1
7.2
7.3
7.4
7.5
7.6
7.7

Table 5
Performance evaluation metrics of RARF on the randomly sampled training set with SMOTE.
Randomly generated training
and testing data set IDs

Learning algorithm- RARF


Training Set

1
2
3
4
5
6
7
8
9
10
AVERAGE

Testing Set

Sensitivity Specicity Accuracy AUC

g-Means Sensitivity Specicity Accuracy AUC

g-Means

96.8
96.4
96.2
96.0
96.8
96.4
96.1
96.2
96.3
96.9
96.41

95.9
95.8
95.4
95.2
95.4
95.6
95.5
95.0
95.4
95.9
95.51

87.4
75.9
79.9
81.6
81.6
82.3
84.0
81.8
81.8
79.8
81.9

95.1
95.4
94.7
94.5
94.2
95.0
95.1
94.0
94.6
95.0
94.76

95.9
95.9
95.4
95.2
95.5
95.7
95.6
95.1
95.4
95.9
95.56

0.993
0.993
0.993
0.992
0.992
0.993
0.992
0.992
0.992
0.992
0.992

80.3
60.6
66.7
69.7
69.7
71.2
74.2
75.8
69.7
66.7
70.46

95.3
95.3
95.9
95.7
95.6
95.2
95.1
96.1
96.1
95.5
95.6

95.2
95.1
95.8
95.5
95.5
95.1
95.0
95.7
95.7
95.4
95.37

0.935
0.925
0.928
0.922
0.919
0.959
0.924
0.929
0.929
0.927
0.932

34

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

SMOTE oversampling (of the minority class) and undersampling


(of the majority class) depends on the unique characteristics of the
dataset and the learning algorithm used.
Undersampling the majority class with K-Means undersampling has some advantages
(i) It reduces the training time for the learning algorithms as it
eliminates large quantities of redundant majority instances
which may also bias the learning.
(ii) While simple undersampling results in loss of valuable information from the discarded instances, K-Means based undersampling preserves the informative sequences from the entire
input space and hence prevents the information loss.
K-Means on the minority class samples results in:
(i) Having representative sequences from the entire positive
instance space so that the learning algorithm may have a
chance to learn about all the concepts and sub-concepts.
(ii) Elimination of redundant minority instances which may bias
the learning.
Further, to have an optimal class distribution in the training set,
SMOTE is used to oversample the representative instances to
match the number of majority instances in the training set.
3.2. Comparison with other methods
We compared our method with the four existing methods
(Table 6). Fan et al. [8] have modied the same dataset in terms of

proportionality of positive and negative sequences to address the


issue of huge imbalance which may lead to false model evaluation
parameters.
For evaluating the generalization ability of the developed
model only BLProt and BLPre have used a separate blind testing
set. All of the previous methods have higher specicity values than
the sensitivity values, which indicate that these models are having
more power in detecting non-bioluminescent proteins than bioluminescent proteins. Due to imbalanced nature of the dataset, the
trained models gets biased towards majority instances (non-bioluminescent proteins) and contributes towards higher overall
accuracy values. The present method handles the class imbalance
problem giving balanced evaluation metrics for both the majority
and minority class instances. Our method gives experimental
evidence that selective SMOTE of representative feature vectors
gives the superior generalization ability to the learned models,
specically larger gains in sensitivity can be observed.
3.3. Feature importance
We have used Relieff [25] feature ranking algorithm for ranking
the protein sequence features according to their discriminating
ability. Fig. 3 shows the heatmap representation of the various
features along with their ranks in discriminating between the two
groups.
The importance of aromatic amino acids in bioluminescence
has been stressed in some of the previous studies [26,27]. Previously Huang [9] also mentioned the importance of hydrophobic
amino acids through the implementation of probabilistic method.
The role of Tryptophan in uorescence has been stressed notably

Table 6
Performance evaluation metrics of the present method compared with the previous methods.
Prediction Models

Fan et al.[8]
BLPre [7]
BLProt [6]
SCBM [9]
Proposed model (5.5)

tenfold cross validation

Holdout Testing Set

Sensitivity

Specicity

Accuracy

AUC

g-Means

Sensitivity

Specicity

Accuracy

AUC

g-Means

88.30
79.30
74.47
89.67
96.40

92.70
91.00
84.21
92.00
94.20

90.50
85.17
80.06
90.83
95.30

0.950
0.920
0.870

0.991

90.4
84.9
79.1
90.8
95.2

0.810
0.700
0.580
0.817
0.906

89.4

91.8

90.71
80.06

91.70

0.971

90.5

89.4

Fig. 3. Heatmap representation of features along with their ranking in discriminating the two groups.

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

35

Table 7
Performance evaluation metrics for RARF with varying number of features.
Number of features

5
10
15
20
25
30
35
40

Ten fold cross validation

Testing Set

Sensitivity

Specicity

Accuracy

AUC

Sensitivity

Specicity

Accuracy

AUC

89.5
93.7
95.4
96.2
96.3
96.5
96.7
96.4

86.8
90.4
92.1
92.3
93.7
94.5
94.1
94.3

88.2
92.1
93.1
94.2
95.0
95.0
95.4
95.4

0.953
0.978
0.985
0.987
0.990
0.991
0.991
0.991

71.2
80.3
86.4
86.4
84.8
83.3
87.9
87.9

84.8
89.0
89.5
90.4
91.5
91.8
91.7
91.5

84.7
88.9
89.5
90.4
91.4
91.8
91.7
91.5

0.865
0.939
0.957
0.965
0.972
0.971
0.966
0.970

in [28,29]. The contribution of hydrophobic 2grams and nonpolar


2grams in discrimination is mostly due to the presence of aromatic amino acids.
We have also investigated the performance of the classier by
varying the number of features starting from 5 most discriminating features up to 40 (according to the Relieff feature ranking) and
recorded the calculated performance evaluation metrics obtained
for RARF in Table 7.
This result indicates that most of the sequences are able to be
predicted with only two top ranking features and the additional
unique types of sequences are predicted with the help of other
remaining higher ranking features.
This variation in performance metrics indicates great diversity
(poor sequence similarities) among the bioluminescent protein
sequences and consequently emphasizes the need for representation from each diverse group of bioluminescent protein
sequences in the training set, which is essential for improving the
generalization ability of the classier. The lowest rank features
contribute to either higher sensitivity or higher specicity.
As the dimension of our feature vector was not large, so it did
not affected the performance evaluation metrics much, but for
larger dimension feature vectors, it may be useful to nd optimal
dimension of the feature vector using feature ranking method
(feature reduction method).

The current research addresses the issues of classier bias due


to imbalanced dataset and incomplete learning through proposing
a novel method for the creation of a balanced and diversied
training dataset. Most of the common resampling methods do not
take into account the within-class imbalance. In the current work,
we addressed the issue of between-class imbalance and withinclass imbalance simultaneously. Our method of sampling gives the
superior generalization ability to the learned models. We robustly
tested our model against a larger blind testing set and the results
are highly improved. The present method not only handles the
imbalance between the majority and minority classes, but also
gives a proper representation to the rare instances present in both
majority and minority classes. As the time complexity of the K
Means algorithm is quite high, the proposed preprocessing has
some limitations for large dataset with high dimensional feature
space. Divide and conquer approach and other suitable techniques
can be explored in the future research to deal with this problem.
We hope that the current framework can be implemented successfully to other imbalance datasets to achieve the true performance of a classier.

Conict of interest statement


The authors declare they have no conicts of interest.

4. Conclusion
References
Previously the issue of imbalanced dataset and its effect on the
prediction performance in bioinformatics have been addressed by
Dobson et al. [30], Wei et al. [31]. Their study pointed out the
necessity of a balanced dataset for more accurate prediction performance. The class imbalance should be given proper importance
as it is almost ubiquitous in protein family classication problems.
When there is a huge imbalance between the different classes, the
classier can achieve very high accuracy by simply predicting most
of the test instances as the majority class instances. Creating an
appropriate training dataset is not a straightforward process for
any learning algorithm as the different factors inuence the classication accuracy and identifying those factors as well as nding
the best trade-off among them is a challenging task. Through
experiments, we have studied the effect of different training sets
with varying level of imbalance, on the learning of classiers. We
have analyzed the role of these factors in achieving the best
classication accuracy. The proposed method effectively undersamples the majority class, well balances the within-class imbalance and attains optimal class distribution to obtain the superior
classication performance. we applied K-Means to achieve this
task. The current work proved that the balanced training set have
performed better than randomly created training sets.

[1] T. Wilson, J.W. Hastings, Bioluminescence, Annu. Rev. Cell Dev. Biol. 14 (1998)
197230.
[2] R. DeSa, J.W. Hastings, The characterization of scintillons. Bioluminescent
particles from the marine dinoagellate, Gonyaulax polyedra, J. Gen. Physiol.
51 (1968) 105122.
[3] M. Fogel, R.E. Schmitter, J.W. Hastings, On the physical identity of scintillons:
bioluminescent particles in Gonyaulax polyedra, J. Cell Sci. 11 (1972) 305317.
[4] E.G. Ruby, K.-H. Lee, The Vibrio scheriEuprymna scolopes light organ association: current ecological paradigms, Appl. Environ. Microbiol. 64 (1998)
805812.
[5] K.L. Visick, M.J. McFall-Ngai, An exclusive contract: specicity in the Vibrio
scheriEuprymna scolopes partnership, J. Bacteriol. 182 (2000) 17791787.
[6] K. Kandaswamy, G. Pugalenthi, M. Hazrati, K.-U. Kalies, T. Martinetz, BLProt:
prediction of bioluminescent proteins based on support vector machine and
relieff feature selection, BMC Bioinform. 12 (2011) 345.
[7] X. Zhao, J. Li, Y. Huang, Z. Ma, M. Yin, Prediction of bioluminescent proteins
using auto covariance transformation of evolutional proles, Int. J. Mol. Sci. 13
(2012) 36503660.
[8] G.-L. Fan, Q.-Z. Li, Discriminating bioluminescent proteins by incorporating
average chemical shift and evolutionary information into the general form of
Chou's pseudo amino acid composition, J. Theor. Biol. 334 (2013) 4551.
[9] H.-L. Huang, Propensity scores for prediction and characterization of bioluminescent proteins from sequences, PLoS One 9 (2014) e97158.
[10] N. Japkowicz, Concept-learning in the presence of between-class and withinclass imbalances, In: E. Stroulia, S. Matwin (Eds.), Advances in Articial
Intelligence, Springer, Berlin Heidelberg, 2001, pp. 6777.
[11] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic
minority over-sampling technique, J. Artif. Int. Res 16 (2002) 321357.

36

A. Nath, K. Subbiah / Computers in Biology and Medicine 68 (2016) 2736

[12] W. Li, A. Godzik, Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences, Bioinformatics 22 (2006) 16581659.
[13] A. Nath, R. Chaube, K. Subbiah, An insight into the molecular basis for convergent evolution in sh antifreeze Proteins, Comput. Biol. Med. 43 (2013)
817821.
[14] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling
method in imbalanced data sets learning, In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing-Volume Part I,
Springer-Verlag, Hefei, China, 2005, pp. 878887.
[15] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, SIGKDD Explor,
ACM SIGKDD Explor. Newsl. 6 (2004) 4049.
[16] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, In: Proceedings of the Thirteenth International Conference on Machine Learning, San
Francisco, 1996, pp. 148156.
[17] R. Schapire, The boosting approach to machine learning: an overview, In:
D. Denison, M. Hansen, C. Holmes, B. Mallick, B. Yu (Eds.), Nonlinear Estimation and Classication, Springer, New York, 2003, pp. 149171.
[18] L. Breiman, Random forests, Mach. Learn. 45 (2001) 532.
[19] R. Polikar, Ensemble based systems in decision making, IEEE Circuits Syst.
Mag. 6 (2006) 2145.
[20] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123140.
[21] K.K. Kandaswamy, K.-C. Chou, T. Martinetz, S. Mller, P.N. Suganthan,
S. Sridharan, G. Pugalenthi, AFP-Pred: a random forest approach for predicting
antifreeze proteins from sequence-derived properties, J. Theor. Biol. 270 (2011)
5662.
[22] A. Nath, R. Chaube, S. Karthikeyan, Discrimination of Psychrophilic and
Mesophilic Proteins Using Random Forest Algorithm, In: Proceedings of the

[23]

[24]

[25]

[26]
[27]
[28]

[29]
[30]
[31]

2012 International Conference on Biomedical Engineering and Biotechnology


(iCBEB), 2012, pp. 179182.
J. Thongkam, X. Guandong, Z. Yanchun, AdaBoost algorithm with random
forests for predicting breast cancer survivability, In: Proceedings of the IEEE
International Joint Conference on Neural Networks, 2008. IJCNN 2008. (IEEE
World Congress on Computational Intelligence), 2008, pp. 30623069.
V. Saravanan, P.T. Lakshmi, SCLAP: an adaptive boosting method for predicting
subchloroplast localization of plant proteins, OMICS: J. Integr. Biol. 17 (2013)
106115.
K. Kira, L.A. Rendell, A practical approach to feature selection, in: Proceedings
of the ninth international workshop on Machine learning, Morgan Kaufmann
Publishers Inc., Aberdeen, Scotland, United Kingdom, 1992, pp. 249256.
W.A. Goddard, D. Brenner, S.E. Lyshevski, G.J. Iafrate, Handbook of
Nanoscience, Engineering, and Technology, Taylor & Francis, 2002.
L. Wang, J. Xie, A.A. Deniz, P.G. Schultz, Unnatural amino acid mutagenesis of
green uorescent protein, J. Org. Chem. 68 (2003) 174176.
R.W. Alston, L. Urbanikova, J. Sevcik, M. Lasagna, G.D. Reinhart, J.M. Scholtz,
C.N. Pace, Contribution of single tryptophan residues to the uorescence and
stability of ribonuclease Sa, Biophys. J. 87 (2004) 40364047.
C. Pigault, D. Gerard, Inuence of the location of tryptophanyl residues in
proteins on their photosensitivity, Photochem. Photobiol. 40 (1984) 291296.
R. Dobson, P. Munroe, M. Cauleld, M. Saqi, Predicting deleterious nsSNPs: an
analysis of sequence and structural attributes, BMC Bioinform. 7 (2006) 217.
Q. Wei, R.L. Dunbrack Jr., The role of balanced training and testing data sets for
binary classiers in bioinformatics, PLoS One 8 (2013) e67863.

Vous aimerez peut-être aussi