Académique Documents
Professionnel Documents
Culture Documents
art ic l e i nf o
a b s t r a c t
Article history:
Received 13 April 2015
Accepted 28 October 2015
Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical
signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging
from cellular imaging to gene expression analysis. However, identication and annotation of bioluminescent proteins is a difcult task as they share poor sequence similarities among them. In this paper, we
present a novel approach for within-class and between-class balancing as well as diversifying of a
training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority
Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model.
Further, we experimented by varying different levels of balancing ratio of positive data to negative data
in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversied training set resulted in near complete
learning with greater generalization on the blind test datasets. The obtained results strongly justify the
fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near
perfect learning. Using random forest as the weak learners in boosting and training it on the optimally
balanced and diversied training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross
validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specicity of 91.8% on a holdout test set.
It is quite possible that the general framework discussed in the current work can be successfully applied
to other biological datasets to deal with imbalance and incomplete learning problems effectively.
& 2015 Elsevier Ltd. All rights reserved.
Keywords:
Class imbalance
Training set diversity
Optimal class distribution
K-Means
SMOTE
1. Introduction
There are mainly two phenomena, bioluminescence and bioourescence which are responsible for the emission of visible light
from the living organisms. The mechanisms of these two processes
are distinct as the former involves a chemical reaction, and the latter
involves absorption of light from external sources and its emission
after transformation. Bioluminescence is observed in both terrestrial
and marine habitats. The chemical reaction, which is responsible for
bioluminescence, generates very less heat and can be categorized
into oxygen dependent (luciferin-luciferase system) and oxygen
independent types (ex. Photoproteins). The colour of the emission is
governed by the amino acid sequence, and by accessory proteins like
yellow uorescent proteins (YFP) and green uorescent proteins
(GFP) [1]. Diverse systems for bioluminescence exist in nature, for ex.
in Dinoagellates, specialized organelles known as Scintillons [2,3]
exhibit bioluminescence. Bioluminescence plays an important role in
n
bacterial intracellular chemical signalling and in symbiosis: a common example of which is shown by Epryme scolopes and Vibrio shcri
[4,5], in attracting for a mate and repelling the predators. The independent evolution of bioluminescence in different organisms has
been discussed in Hastings et al. [1]. In some organisms, the usefulness of bioluminescence is still unknown.
In silico prediction of bioluminescent proteins (BLP) was rst
carried out by Kandaswamy et al. [6]. They developed Blprot,
which is an SVM based method. Their prediction model was
trained by using 544 amino acid physicochemical properties. The
prediction of bioluminescent proteins was further improved by
Zhao et al. (BLPre) [7] using evolutionary information in the form
of PSSM (Position Specic Scoring Matrices) obtained from PSIBLAST. Fan et al. [8] used a balanced dataset (equal number of
positive and negative samples for training) with average chemical
shift and modied pseudo amino acid composition for prediction
of bioluminescent proteins. Recently, Huang [9] proposed a scoring
card method (SCBM) for their prediction.
The imbalanced class ratios are often encountered in the protein family classication problems. This causes the overrepresentation of instances belonging to majority class and under
representation of instances belonging to minority class in the
28
f res
where
res stands for one of the 20 different amino acid residues
fres denotes the amino acid percentage frequency of the specic residue in ith Sequence.
Nres,i denotes the total count of amino acid of the specic type
in the ith sequence.
Ntotal_res,i denotes the total count of all residues in the ith
sequence (i.e. sequence length).
(ii) Amino acid property group composition: The percentage
frequency counts of amino acid property groups were used as
the second component in the feature vector. The different
amino acid property groups [13] that are selected for this
study are given in Table 1. This is a renement over amino acid
frequency composition where specic property group count is
computed instead of the individual amino acid count.
f pg
Npg;i
100
Ntotal_res;i
where
pg denotes one of the 11 different amino acid property groups
fpg denotes the percentage frequency of the specic amino acid
property group in the ith sequence.
Npg,i denotes the total count of the specic amino acid property
group in the ith sequence.
Ntotal_res,i denotes the total count of all residues in the ith
sequence.
N
1
X
C i; i 1
i1
1.
2.
3.
4.
Tiny group
Small group
Aliphatic group
Non-polar groups
5.
6.
7.
8.
9.
10.
11.
Aromatic group
Polar group
Charged group
Basic group
Acidic group
Hydrophobic group
Hydrophilic group
29
nj
K X
X
j1i1
j j P ji C j j j
where
Cj denotes the centroid of the jth cluster.
30
training set and the rest of the cluster members are retained in the
testing set.
2.3.2. SMOTE
In SMOTE, minority class is oversampled by inducing articial
instances. It is a nearest neighbour based method. It selects randomly a minority class instance and its N nearest minority class
neighbours (the default value of N 5). Distance is calculated
between the sample and one of the randomly chosen nearest
neighbour in the feature space and then it creates a synthetic
instance along the line segment between the minority sample and
its selected nearest neighbour.
In cases where there are an unequal K (for ex. in 2 K, 3 K,
etc.), we used SMOTE to selectively oversample the positive class
representative instances equal to the number of negative class
instances. We experimented with different % of SMOTE sampling
and have examined the effect of a balanced and imbalanced
dataset on the prediction evaluation metrics by creating different
datasets with different proportions of positive and negative
instances. The properties of different training and testing sets are
presented in Table 2.
2.3.3. Boosted random forest
Boosting [16,17] combines many weak base learners linearly to
construct a strong classier with improved accuracy. It is an
iterative procedure. During the each iteration, the incorrectly
classied instances from both the positive and negative classes are
given more weights so that the learning is concentrated on the
hard and difcult to classify instances in the training set. It is a
sequential ensemble method where the subsequent learners are
evolved from the previous learners.
Random forest [18] is an ensemble learning method consisting
of many individual decision trees. Classier ensembles promote an
optimal trade-off between diversity and accuracy. Ensemble classiers usually outperform single classiers and they are robust to
the presence of noise in the data and to over tting of inputs [19].
Different base classiers making errors in different parts of the
hypothesis space give better accuracy when properly combined
together.
The concept of bagging [20] is implemented in the random
forest classication algorithm. In random forest bootstrap samples
from the training set with randomly selected feature subsets were
evaluated at each node of the decision tree. The nal decision is
made by decision fusion of all the trees by majority voting.
Fig. 1. Plot of distortion ratio verses number of clusters for bioluminescent protein instances.
Table 2
Properties of different training and testing sets
Training sets
IDs
No. of positive
instances in the
training set
No. of negative
instances in the
training set
Property
1.1
2.1
2.2
375
375
750
375
750
750
3.1
3.2
375
750
1125
1125
3.3
1125
1125
4.1
4.2
4.3
4.4
375
750
1125
1500
1500
1500
1500
1500
5.1
5.2
5.3
5.4
5.5
375
750
1125
1500
1875
1875
1875
1875
1875
1875
6.1
6.2
6.3
6.4
6.5
6.6
375
750
1125
1500
1875
2250
2250
2250
2250
2250
2250
2250
7.1
7.2
7.3
7.4
7.5
7.6
7.7
375
750
1125
1500
1875
2250
2625
2625
2625
2625
2625
2625
2625
2625
Balanced
Imbalanced
Balanced with
100% SMOTE
Imbalanced
Balanced with
100% SMOTE
Balanced with
200% SMOTE
Imbalanced
Imbalanced
Imbalanced
Balanced with
300% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
400% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
500% SMOTE
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced with
600% SMOTE
Specif icity
Accuracy
TP TN
TP FP TN FN
3. Results
TN
100
TN FP
Random forests have been successfully applied to many classication and prediction tasks [21,22]. Major steps of random forest
are summarized as follows: (1) a bagged sample is drawn from the
training data. (2) A decision tree is grown without pruning on the
bagged sample, where at each node a randomly selected subset of
features from the full feature subset is evaluated. (3) Fusing the
decisions from all the individual trees.
We have used random forest as weak learners for Boosting
algorithm. Recently, some of the authors have also successfully
applied boosted random forest for classication and prediction
[23,24]. Real Adaboost is one of the popular modications of the
Adaboost algorithm. The major steps are same except that it
involves the calculation of real valued class probability estimates.
We experimented with both discrete and real Adaboosting algorithms. The schematic representation of the proposed methodology is shown in Fig.2.
Sensitivity
31
Initially, we trained the following six different machine learning algorithms using the randomly balanced training set, namely:
Support vector machines with sequential minimization optimization (SMO), K Nearest Neighbour (IBK), Random Forest (RF),
Rotation Forest (ROF), RARF (Real Adaboosting Random Forest)
and ARF (Adaboosting Random Forest). The performance evaluation metrics on the randomly balanced training set (using tenfold
stratied cross validation) and on the holdout testing set are
presented in Table 3.
The overall accuracy of the tree based algorithm is better than
the SMO and IBK algorithms. All the algorithms performed relatively well on the positive samples. RARF gave a comparable
sensitivity among other learning algorithms with highest overall
accuracy, AUC, g-means and Youden's index on the training set
using tenfold cross validation. The same trend is also observed on
the testing set
3.1. Effect of balanced training set on performance evaluation
parameters
We selected the RARF algorithm for further analysis with
varying balancing ratios, as it outperformed other learning algorithms on the randomly balanced dataset. The performance evaluation metrics of RARF using different training and testing sets
are presented in Table 4.
It can be observed that RARF performed relatively better
whenever the training set is fully balanced (with training set
IDs 1, 2.2, 3.3, 4.4, 5.5, 6.6, and 7.7) and the sensitivity (accuracy
of the positive class) increased with the increasing rate of SMOTE
oversampling. RARF achieved higher sensitivity values on fully
balanced training sets as compared to other partially balanced
training sets. An opposite trend of decreasing specicity values is
observed with the increasing rate of SMOTE oversampling. Though
the specicity values on fully balanced training sets are least as
compared to other training sets, the overall accuracy of RARF was
increased with the rate of SMOTE oversampling. On all the fully
balanced training sets, highest accuracy values can be observed as
compared to other training sets. Full balancing of training
instances between positive and negative samples also has positive
effects on AUC values. Higher AUC values were observed in
training sets having a lower imbalance ratio. The g-means reects
32
Table 3
Performance evaluation metrics of different machine learning algorithms trained using the imbalanced training set.
Learning algorithms
SMO
IBK
ROF
RF
RARF
ARF
Training Set 1
Testing set 1
Sensitivity
Specicity
Accuracy
AUC
g-Means
Sensitivity
Specicity
Accuracy
AUC
g-Means
87.5
88.8
81.1
85.1
86.1
84.5
70.7
55.7
79.7
80.5
82.1
81.3
79.1
72.3
80.4
82.8
84.1
82.9
0.791
0.723
0.886
0.907
0.923
0.909
78.6
70.3
80.3
82.7
84.07
82.8
0.582
0.445
0.608
0.656
0.682
0.658
98.5
100
93.9
93.9
97.0
93.9
59.9
52.1
70.2
69.9
73.6
69.6
60.1
52.3
70.4
70.0
73.8
69.7
0.792
0.761
0.932
0.928
0.949
0.922
76.8
72.1
81.1
81.0
84.4
80.7
0.584
0.521
0.641
0.638
0.706
0.635
33
Table 4
Performance evaluation metrics of RARF on the different training sets with different ratios of positive and negative samples.
Learning Algorithm- RARF
Training Set IDs
Sensitivity
Specicity
Accuracy
AUC
g-Means
Sensitivity
Specicity
Accuracy
AUC
g-Means
86.1
74.1
91.1
61.9
86.8
94.2
53.6
83.5
91.8
95.6
46.9
80.8
89.7
94.2
96.4
40.3
76.9
88.0
93.1
96.4
97.2
36.5
75.2
86.4
92.0
94.6
96.6
97.6
82.1
94.1
90.0
97.4
95.1
93.3
98.3
96.7
94.9
93.7
98.8
97.9
96.6
95.5
94.2
99.3
98.1
97.1
96.0
95.7
95.0
99.7
98.6
97.9
96.9
96.3
95.7
95.1
84.1
87.5
90.9
88.5
91.8
93.3
89.3
92.3
93.6
94.7
90.2
93.0
94.0
94.9
95.3
90.9
92.8
94.0
94.9
96.0
96.1
91.8
93.4
94.4
95.2
95.6
96.1
96.4
0.923
0.918
0.968
0.919
0,971
0.983
0.911
0.971
0.982
0.988
0.919
0.973
0.983
0.989
0.991
0.919
0.973
0.985
0.990
0.991
0.993
0.918
0.974
0.986
0.991
0.992
0.994
0.928
84.07
90.5
90.5
77.6
90.8
93.7
72.5
89.8
93.3
94.6
68.0
88.9
93.0
94.8
95.2
63.2
86.8
92.4
94.5
96.0
96.0
60.3
86.1
91.9
94.4
95.4
96.1
96.3
0.682
0.682
0.811
0.593
0.819
0.875
0.519
0.802
0.867
0.893
0.457
0.787
0.863
0.897
0.906
0.396
0.851
0.851
0.891
0.921
0.922
0.362
0.738
0.843
0.889
0.909
0.923
0.927
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
97.0
89.4
90.0
75.8
87.9
89.4
74.2
75.8
84.8
86.4
63.6
77.3
81.8
86.4
89.4
53.0
69.7
80.3
83.3
86.4
86.4
31.3
71.2
75.8
80.3
83.3
86.4
86.4
73.6
89.7
83.9
95.5
91.5
88.2
97.4
97.4
92.3
90.4
98.5
96.4
94.5
92.9
91.8
99.2
97.4
95.6
94.5
93.3
92.5
99.4
97.8
96.6
95.5
94.5
93.6
92.5
73.8
89.7
84.0
95.4
91.5
88.2
97.3
94.4
92.2
90.4
98.3
94.4
94.4
92.9
91.7
99.0
97.2
95.5
94.5
93.2
92.4
99.1
97.6
96.4
95.5
94.4
93.6
92.5
0.949
0.961
0.962
0.964
0.964
0.965
0.968
0.970
0.968
0.969
0.970
0.969
0.969
0.972
0.971
0.969
0.970
0.971
0.973
0.971
0.969
0.970
0.972
0.971
0.974
0.972
0.970
0.970
84.4
89.5
87.3
85.0
88.7
88.7
85.18
84.2
88.4
88.37
79.14
86.3
87.9
89.5
90.5
72.5
82.3
87.6
88.7
89.7
89.3
71.5
83.4
85.5
87.5
88.7
89.9
89.3
0.706
0.791
0.739
0.713
0.794
0.776
0.732
0.732
0.771
0.768
0.621
0.737
0.763
0.793
0.812
0.522
0.671
0.759
0.778
0.797
0.789
0.307
0.690
0.724
0.758
0.800
0.800
0.789
1.1
2.1
2.2
3.1
3.2
3.3
4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
5.5
6.1
6.2
6.3
6.4
6.5
6.6
7.1
7.2
7.3
7.4
7.5
7.6
7.7
Table 5
Performance evaluation metrics of RARF on the randomly sampled training set with SMOTE.
Randomly generated training
and testing data set IDs
1
2
3
4
5
6
7
8
9
10
AVERAGE
Testing Set
g-Means
96.8
96.4
96.2
96.0
96.8
96.4
96.1
96.2
96.3
96.9
96.41
95.9
95.8
95.4
95.2
95.4
95.6
95.5
95.0
95.4
95.9
95.51
87.4
75.9
79.9
81.6
81.6
82.3
84.0
81.8
81.8
79.8
81.9
95.1
95.4
94.7
94.5
94.2
95.0
95.1
94.0
94.6
95.0
94.76
95.9
95.9
95.4
95.2
95.5
95.7
95.6
95.1
95.4
95.9
95.56
0.993
0.993
0.993
0.992
0.992
0.993
0.992
0.992
0.992
0.992
0.992
80.3
60.6
66.7
69.7
69.7
71.2
74.2
75.8
69.7
66.7
70.46
95.3
95.3
95.9
95.7
95.6
95.2
95.1
96.1
96.1
95.5
95.6
95.2
95.1
95.8
95.5
95.5
95.1
95.0
95.7
95.7
95.4
95.37
0.935
0.925
0.928
0.922
0.919
0.959
0.924
0.929
0.929
0.927
0.932
34
Table 6
Performance evaluation metrics of the present method compared with the previous methods.
Prediction Models
Fan et al.[8]
BLPre [7]
BLProt [6]
SCBM [9]
Proposed model (5.5)
Sensitivity
Specicity
Accuracy
AUC
g-Means
Sensitivity
Specicity
Accuracy
AUC
g-Means
88.30
79.30
74.47
89.67
96.40
92.70
91.00
84.21
92.00
94.20
90.50
85.17
80.06
90.83
95.30
0.950
0.920
0.870
0.991
90.4
84.9
79.1
90.8
95.2
0.810
0.700
0.580
0.817
0.906
89.4
91.8
90.71
80.06
91.70
0.971
90.5
89.4
Fig. 3. Heatmap representation of features along with their ranking in discriminating the two groups.
35
Table 7
Performance evaluation metrics for RARF with varying number of features.
Number of features
5
10
15
20
25
30
35
40
Testing Set
Sensitivity
Specicity
Accuracy
AUC
Sensitivity
Specicity
Accuracy
AUC
89.5
93.7
95.4
96.2
96.3
96.5
96.7
96.4
86.8
90.4
92.1
92.3
93.7
94.5
94.1
94.3
88.2
92.1
93.1
94.2
95.0
95.0
95.4
95.4
0.953
0.978
0.985
0.987
0.990
0.991
0.991
0.991
71.2
80.3
86.4
86.4
84.8
83.3
87.9
87.9
84.8
89.0
89.5
90.4
91.5
91.8
91.7
91.5
84.7
88.9
89.5
90.4
91.4
91.8
91.7
91.5
0.865
0.939
0.957
0.965
0.972
0.971
0.966
0.970
4. Conclusion
References
Previously the issue of imbalanced dataset and its effect on the
prediction performance in bioinformatics have been addressed by
Dobson et al. [30], Wei et al. [31]. Their study pointed out the
necessity of a balanced dataset for more accurate prediction performance. The class imbalance should be given proper importance
as it is almost ubiquitous in protein family classication problems.
When there is a huge imbalance between the different classes, the
classier can achieve very high accuracy by simply predicting most
of the test instances as the majority class instances. Creating an
appropriate training dataset is not a straightforward process for
any learning algorithm as the different factors inuence the classication accuracy and identifying those factors as well as nding
the best trade-off among them is a challenging task. Through
experiments, we have studied the effect of different training sets
with varying level of imbalance, on the learning of classiers. We
have analyzed the role of these factors in achieving the best
classication accuracy. The proposed method effectively undersamples the majority class, well balances the within-class imbalance and attains optimal class distribution to obtain the superior
classication performance. we applied K-Means to achieve this
task. The current work proved that the balanced training set have
performed better than randomly created training sets.
[1] T. Wilson, J.W. Hastings, Bioluminescence, Annu. Rev. Cell Dev. Biol. 14 (1998)
197230.
[2] R. DeSa, J.W. Hastings, The characterization of scintillons. Bioluminescent
particles from the marine dinoagellate, Gonyaulax polyedra, J. Gen. Physiol.
51 (1968) 105122.
[3] M. Fogel, R.E. Schmitter, J.W. Hastings, On the physical identity of scintillons:
bioluminescent particles in Gonyaulax polyedra, J. Cell Sci. 11 (1972) 305317.
[4] E.G. Ruby, K.-H. Lee, The Vibrio scheriEuprymna scolopes light organ association: current ecological paradigms, Appl. Environ. Microbiol. 64 (1998)
805812.
[5] K.L. Visick, M.J. McFall-Ngai, An exclusive contract: specicity in the Vibrio
scheriEuprymna scolopes partnership, J. Bacteriol. 182 (2000) 17791787.
[6] K. Kandaswamy, G. Pugalenthi, M. Hazrati, K.-U. Kalies, T. Martinetz, BLProt:
prediction of bioluminescent proteins based on support vector machine and
relieff feature selection, BMC Bioinform. 12 (2011) 345.
[7] X. Zhao, J. Li, Y. Huang, Z. Ma, M. Yin, Prediction of bioluminescent proteins
using auto covariance transformation of evolutional proles, Int. J. Mol. Sci. 13
(2012) 36503660.
[8] G.-L. Fan, Q.-Z. Li, Discriminating bioluminescent proteins by incorporating
average chemical shift and evolutionary information into the general form of
Chou's pseudo amino acid composition, J. Theor. Biol. 334 (2013) 4551.
[9] H.-L. Huang, Propensity scores for prediction and characterization of bioluminescent proteins from sequences, PLoS One 9 (2014) e97158.
[10] N. Japkowicz, Concept-learning in the presence of between-class and withinclass imbalances, In: E. Stroulia, S. Matwin (Eds.), Advances in Articial
Intelligence, Springer, Berlin Heidelberg, 2001, pp. 6777.
[11] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic
minority over-sampling technique, J. Artif. Int. Res 16 (2002) 321357.
36
[12] W. Li, A. Godzik, Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences, Bioinformatics 22 (2006) 16581659.
[13] A. Nath, R. Chaube, K. Subbiah, An insight into the molecular basis for convergent evolution in sh antifreeze Proteins, Comput. Biol. Med. 43 (2013)
817821.
[14] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: a new over-sampling
method in imbalanced data sets learning, In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing-Volume Part I,
Springer-Verlag, Hefei, China, 2005, pp. 878887.
[15] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, SIGKDD Explor,
ACM SIGKDD Explor. Newsl. 6 (2004) 4049.
[16] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, In: Proceedings of the Thirteenth International Conference on Machine Learning, San
Francisco, 1996, pp. 148156.
[17] R. Schapire, The boosting approach to machine learning: an overview, In:
D. Denison, M. Hansen, C. Holmes, B. Mallick, B. Yu (Eds.), Nonlinear Estimation and Classication, Springer, New York, 2003, pp. 149171.
[18] L. Breiman, Random forests, Mach. Learn. 45 (2001) 532.
[19] R. Polikar, Ensemble based systems in decision making, IEEE Circuits Syst.
Mag. 6 (2006) 2145.
[20] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123140.
[21] K.K. Kandaswamy, K.-C. Chou, T. Martinetz, S. Mller, P.N. Suganthan,
S. Sridharan, G. Pugalenthi, AFP-Pred: a random forest approach for predicting
antifreeze proteins from sequence-derived properties, J. Theor. Biol. 270 (2011)
5662.
[22] A. Nath, R. Chaube, S. Karthikeyan, Discrimination of Psychrophilic and
Mesophilic Proteins Using Random Forest Algorithm, In: Proceedings of the
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]