Vous êtes sur la page 1sur 9

A framework for cost-based feature selection

V. Boln-Canedo
n
, I. Porto-Daz, N. Snchez-Maroo, A. Alonso-Betanzos
Laboratory for Research and Development in Articial Intelligence (LIDIA), Computer Science Department, University of A Corua, 15071 A Corua, Spain
a r t i c l e i n f o
Article history:
Received 19 July 2012
Received in revised form
15 November 2013
Accepted 21 January 2014
Available online 28 January 2014
Keywords:
Cost-based feature selection
Machine learning
Filter methods
a b s t r a c t
Over the last few years, the dimensionality of datasets involved in data mining applications has increased
dramatically. In this situation, feature selection becomes indispensable as it allows for dimensionality
reduction and relevance detection. The research proposed in this paper broadens the scope of feature
selection by taking into consideration not only the relevance of the features but also their associated
costs. A new general framework is proposed, which consists of adding a new term to the evaluation
function of a lter feature selection method so that the cost is taken into account. Although the proposed
methodology could be applied to any feature selection lter, in this paper the approach is applied to two
representative lter methods: Correlation-based Feature Selection (CFS) and Minimal-Redundancy-
Maximal-Relevance (mRMR), as an example of use. The behavior of the proposed framework is tested on
17 heterogeneous classication datasets, employing a Support Vector Machine (SVM) as a classier. The
results of the experimental study show that the approach is sound and that it allows the user to reduce
the cost without compromising the classication error.
& 2014 Elsevier Ltd. All rights reserved.
1. Introduction
The proliferation of high-dimensional data has become a trend
in the last few years. Datasets with a dimensionality over the tens
of thousands are constantly appearing in applications such as
medical image and text retrieval or genetic data. In fact, analyzing
the dimensionality of the datasets posted in the UCI Machine
Learning Repository [1] in the last decades, one can observe that in
the 1980s, the maximum dimensionality of the data is about 100,
increasing to more than 1500 in the 1990s; and nally in the
2000s, it further increases to about 3 million [2].
The high-dimensionality of data has an important impact in
learning algorithms, since they degrade their performance when a
number of irrelevant and redundant features are present. In fact,
this phenomenon is known as the curse of dimensionality [3],
because unnecessary features increase the size of the search space
and make generalization more difcult. For overcoming this major
obstacle in machine learning, researchers usually employ dimen-
sionality reduction techniques. In this manner, the set of features
required for describing the problem is reduced, most of the times
along with an improvement in the performance of the models.
Feature selection is arguably the most famous dimensionality
reduction technique. It consists of detecting the relevant features
and discarding the irrelevant ones. Its goal is to obtain a subset of
features that describe properly the given problem with a minimum
degradation in performance [4], with the implicit benets of improv-
ing data and model understanding and the reduction in the need for
data storage. With this technique, the original features are maintained,
contrary to what usually happens in other techniques such as feature
extraction, where the generated dataset is represented by a newly
generated set of features, different than the original.
Feature selection methods can be divided into wrappers, lters
and embedded methods [4]. While wrapper models involve
optimizing a predictor as a part of the selection process, lter
models rely on the general characteristics of the training data to
select features with independence of any predictor. The embedded
methods generally use machine learning models for classication,
and then an optimal subset or ranking of features is built by the
classier algorithm. Wrappers and embedded methods tend to
obtain better performances but at the expense of being very time
consuming and having the risk of overtting when the sample size
is small. On the other hand, lters are faster and, therefore, more
suitable for large datasets. They are also easier to implement and
scale up better than wrapper and embedded methods. As a matter
of fact, lters can be used as a pre-processing step before applying
other more complex feature selection methods. For all these
reasons, lters will be the focus of this work.
There is a broad suite of lter methods, based on different
metrics, but the most common approaches are to nd either
a subset of features that maximizes a given metric or either an
ordered ranking of the features based on this metric. Two of the
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/pr
Pattern Recognition
0031-3203/$ - see front matter & 2014 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2014.01.008
n
Corresponding author at: Department of Computer Science, Facultade de
Informtica, Campus de Elvia s/n, University of A Corua, 15071 A Corua, Spain.
Tel. +34 981 167 000x1305; fax: +34 981 167 160.
E-mail addresses: vbolon@udc.es (V. Boln-Canedo),
iporto@udc.es (I. Porto-Daz), nsanchez@udc.es (N. Snchez-Maroo),
ciamparo@udc.es (A. Alonso-Betanzos).
Pattern Recognition 47 (2014) 24812489
most popular lter metrics for classication problems are correla-
tion and mutual information, although other common lter
metrics include error probability, probabilistic distance, entropy
or consistency [5].
There are some situations where a user is not only interested in
maximizing the merit of a subset of features, but also in reducing
costs that may be associated to features. For example, for medical
diagnosis, symptoms observed with the naked eye are costless, but
each diagnostic value extracted by a clinical test is associated with
its own cost and risk. In other elds, such as image analysis, the
computational expense of features refers to the time and space
complexities of the feature acquisition process [6]. This is a critical
issue, specically in real-time applications, where the computa-
tional time required to deal with one or another feature is crucial,
and also in the medical domain, where it is important to save
economic costs and to also improve the comfort of a patient by
preventing risky or unpleasant clinical tests (variables that can be
also treated as costs).
The goal of this research is to obtain a trade-off between a lter
metric and the cost associated to the selected features, in order to
select relevant features with a low associated cost. A general frame-
work to be applied together with the lter approach is introduced.
In this manner, any lter metric can be modied to have into account
the cost associated to the input features. In this paper, and for the sake
of brevity, two implementations of this framework will be presented
as an example of use, choosing two representative and widely
used lters: Correlation-based Feature Selection (CFS) and Minimal-
Redundancy-Maximal-Relevance (mRMR). The results obtained
with these two lters are promising, showing that the approach
is sound.
The rest of the paper is organized as follows: Section 2
summarizes previous research on the subject; Section 3 describes
the proposed method in detail; Sections 4 and 5 describe the
experimental study performed and the results obtained, respec-
tively; and nally, Section 6 presents the conclusions and the
future work.
2. Background
Feature selection has been an active and effective tool in
numerous elds such as DNA microarray analysis [7,8], intrusion
detection [9,10], medical diagnosis [11] or text categorization [12].
New feature selection methods are constantly appearing, however,
the great majority of them only focus on removing irrelevant and
redundant features but not on the costs for obtaining the input
features.
The cost associated to a feature can be related to different
concepts. For example, in medical diagnosis, a pattern consists of
observable symptoms (such as age and sex) along with the results
of some diagnostic tests. Contrary to observable symptoms, which
have no cost, diagnostic tests have associated costs and risks. For
example, an invasive exploratory surgery is much more expensive
and risky than a blood test [13]. Another example of the risk of
extracting a feature can be found in [14], where for evaluating the
merits of beef cattle as meat producers is necessary to carry out
zoometry on living animals.
On the other hand, the cost can also be related to computational
issues. In the medical imaging eld, extracting a feature from a
medical image can have a high computational cost. For example, in
the texture analysis technique known as co-occurrence features
[15], the computational cost for extracting each feature is not the
same, which implies different computational times. In other cases,
such as real-time applications, the space complexity is negligible,
but the time complexity is very important [6].
As one may notice, features with an associated cost can be found in
many real-life applications. However, this has not been the focus of
much attention for machine learning researchers. As mentioned in
Section 1, the purpose of this research is to propose a general
framework to the problem of cost-based feature selection, trying to
balance the correlation of the features with the class and their cost.
There have been similar attempts to balance the contribution of
different terms in other areas. For instance, in classication, Friedman
et al. [16] included a regularization term to the traditional Linear
Discriminant Analysis (LDA). The left side term of their cost function
evaluates the error and the right side term would be the regularizat-
ion one, which is weighted with . This provides a framework in
which, according to the value, different regularized solutions can be
obtained. Related to feature extraction, in [17] a criterion is proposed
to select kernel parameters based on maximizing between-class
scattering and minimizing within-class scattering. Applied to face
recognition, Wright et al. [18] proposed a general classication frame-
work to study feature extraction and robustness to occlusion via
obtaining a sparse representation. Instead of measuring the correlation
between a feature and the class, this method evaluates the represen-
tation error. However, our objective is completely different, as it is to
provide a framework for feature selection where features with an
inherent cost could be dealt with.
Despite the previous attempts in classication and feature
extraction, to the best knowledge of the authors, there are only a
few attempts to deal with this issue in feature selection. In the
early 1990s, Feddema et al. [6] were developing methodologies for
the automatic selection of image features to be used by a robot. For
this selection process, they employed a weighted criterion that
took into account the computational expense of features, i.e. the
time and space complexities of the feature extraction process.
Several years later, Yang and Honavar [13] proposed a genetic
algorithm to perform feature subset selection where the tness
function combined two criteria: the accuracy of the classication
function realized by the neural network and the cost of performing
the classication (dened by the cost of measuring the value of a
particular feature needed for classication, the risk involved, etc.).
A similar approach was presented in [19], where a genetic
algorithm is used for feature selection and parameters optimiza-
tion for a support vector machine. In this case, classication
accuracy, the number of selected features and the feature cost
were the three criteria used to design the tness function. Another
proposal can be found in [20] by presenting a hybrid method for
feature subset selection based on ant colony optimization and
articial neural networks. The heuristic that enables ants to select
features is the inverse of the cost parameter.
The methods found in the literature that deal with cost
associated to the features, which were described above, have the
disadvantage of being computationally expensive by having inter-
action with a classier, which prevents their use in large data-
bases, a trending topic in the past few years [21]. However, the
general framework proposed in this paper is applied together with
the lter model, which is known to have a low computational cost
and be independent of any classier. By being fast and with a good
generalization ability, lters using this cost-based feature selection
framework will be suitable for application to databases with a
great number of input features like microarray DNA data.
In light of the above, the novelty of our paper lies in that there
does not exist too much research in cost-based feature selection
methods. As a matter of fact, no cost methods can be found in the
most popular machine learning and data mining tools. For instance,
in Weka [22] we can only nd some methods that address the
problem of cost associated to the instances (not to the features), and
they were incorporated in the latest release. RapidMiner [23] does
in fact include some methods that take cost into account, but they
are quite simple. One of them selects the attributes that have a cost
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2482
value which satises a given condition and another one just selects
the k attributes with the lower cost. Therefore, the general frame-
work for cost-based feature selection proposed in this paper intends
to cover this necessity.
3. Description of the method
In this section the proposed method is described. Our proposal
intends to be a framework applicable to any lter. However, in this
paper, we have decided to implement our idea over two represen-
tative lters and carry out an experimentation to discover whether
our approach is sound. Considering that lters may be subdivided
into subset or ranker methods, one lter of each type has been
selected. The lters chosen are CFS (Correlation-based Feature
Selection), which is a subset lter and mRMR (Minimal-Redun-
dancy-Maximal-Relevance), which is a ranker lter. For the sake of
clarity, we will explain the modications we have implemented in
both CFS and mRMR and only then generalize the approach.
3.1. Cost-based CFS
CFS (Correlation-based Feature Selection) is a multivariate
subset lter algorithm. It uses a search algorithm combined with
an evaluation function to estimate the merit of feature subsets.
The implementation of CFS utilized in this work uses forward best
rst search [24] as its search algorithm. Best rst search is
an articial intelligence search strategy that allows backtracking
along the search path. It moves through the search space by
making local changes to the current feature subset. If the explored
path looks uninteresting, the algorithm can backtrack to a pre-
vious subset and continue the search from there on. As a stopping
criterion, the search terminates if ve consecutive fully expanded
(all possible local changes considered) subsets show no improve-
ment over the current best subset.
The evaluation function takes into account the usefulness of
individual features for predicting the class label as well as the level
of correlation among them. It is assumed that good feature subsets
contain features highly correlated with the class and uncorrelated
with each other. The evaluation function can be seen in the
following equation:
M
S

kr
ci

kkk1r
ii
p 1
where M
S
is the merit of a feature subset S that contains k features,
r
ci
is the average correlation between the features of S and the class,
and r
ii
is the average intercorrelation between the features of S. In
fact, this function is Pearson's correlation with all variables standar-
dized. The numerator estimates how predictive of the class S is and
the denominator quanties the redundancy among the features in S.
The modication of CFS we propose in our research consists of
adding a term to the evaluation function to take into account the
cost of the features, as can be seen in the following equation:
MC
S

kr
ci

kkk1r
ii
p

k
i 1
C
i
k
2
where MC
S
is the merit of the subset S affected by the cost of the
features, C
i
is the cost of the feature i, and is a parameter
introduced to weight the inuence of the cost in the evaluation
function.
The parameter is a positive real number. If is 0, the cost is
ignored and the method works as the regular CFS. If is between
0 and 1, the inuence of the cost is smaller than the other term. If
1 both terms have the same inuence and if 41, the inuence
of the cost is greater than the inuence of the other term.
3.2. Cost-based mRMR
mRMR (Minimal-Redundancy-Maximal-Relevance) is a multi-
variate ranker lter algorithm. As mRMR is a ranker, the search
algorithm is simpler than CFS's.
The evaluation function combines two constraints (as the name
of the method indicates), maximal relevance and minimal redun-
dancy. The former is denoted by the letter D, it corresponds to the
mean value of all mutual information values between each feature x
i
and class c, and has the expression shown in the following equation:
DS; c
1
jSj

x
i
AS
Ix
i
; c 3
where S is a set of features and Ix
i
; c is the mutual information
between the feature x
i
and the class c. The expression of Ix; y is
shown in the following equation:
Ix; y
Z Z
px; y log
px; y
pxpy
dx dy 4
The constraint of minimal redundancy is denoted by the letter R, and
has the expression shown in the following equation:
RS
1
jSj
2

x
i
;x
j
AS
Ix
i
; x
j
5
The evaluation function to be maximized combines the two con-
straints (3) and (5). It is called Minimal-Redundancy-Maximal-Relevance
(mRMR) and has the expression shown in the following equation:
D; R
1
jSj

x
i
AS
Ix
i
; c
1
jSj
2

x
i
;x
j
AS
Ix
i
; x
j
DS; c RS 6
In practice, this is an incremental search method that selects on each
iteration the feature that maximizes the evaluation function. Suppose
we already have S
m1
, the feature set with m1 features, the mth
selected feature will optimize the following condition:
max
x
j
AX S
m1
Ix
j
; c
1
m1

x
i
AS
m1
Ix
j
; x
i

" #
7
The modication of mRMR which we propose in this paper
consists of adding a term to the condition to be maximized so as to
take into account the cost of the feature to be selected, as can be
seen in the following equation:
max
x
j
AX S
m1
Ix
j
; c
1
m1

x
i
AS
m1
Ix
j
; x
i

!
C
j
" #
8
where C
j
is the cost of the feature j, and is a parameter
introduced to weight the inuence of the cost in the evaluation
function, as explained in the previous subsection.
3.3. Generalization
Ultimately, the general idea consists of on adding a term to the
evaluation function of the lter to take into account the cost of the
features. Since, to the best knowledge of the authors, all lters use an
evaluation function, this evaluation function could be modied to
contemplate costs in the following manner. Let M
S
be the merit of the
set of k features S, that is, the value originally returned by the function
M
S
EvFS 9
where EvF is the evaluation function. Let C
S
be the average cost of S.
C
S

k
i 1
C
i
k
10
where C
i
is the cost of feature i. The evaluation function can be
modied to become
MC
S
M
S
C
S
11
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2483
where is a parameter introduced in order to weight the inuence of
the cost in the evaluation. Notice that when we use a ranker method
that selects features one at a time, such as mRMR, the cardinality of S
is one and then C
S
in (10) results in the cost of that single feature.
4. Experimental study
The experiment is performed over three blocks of datasets
(Table 2). The datasets in the rst and second blocks are available
at the UCI Machine Learning Repository [1]. The datasets in
the third block are DNA microarray datasets and are available
on http://datam.i2r.a-star.edu/.sg/datasets/.krbd and http://www.
broadinstitute.org/cgi-bin/cancer/datasets.cgi. The main feature of
the rst block of datasets is that they have intrinsic cost associated
to the input features. For the second and third blocks, as these
datasets do not have intrinsic cost associated, random cost for
their input features has been generated. This decision has been
taken because no datasets with cost, other than the four ones of
the rst block, exist publicly available, to the best knowledge of
the authors. For each feature, the cost was generated as a random
number between 0 and 1. For instance, in Table 1 the costs for each
feature of Yeast dataset are displayed.
Overall, the chosen classication datasets are very heteroge-
neous. They present a variable number of classes, ranging from
two to twenty six. The number of samples and features range from
single digits to the tens of thousands. Notice that datasets in
the rst and second blocks have a larger number of samples than
features, whilst datasets in the third block have a much larger
number of features than samples, which poses a big challenge for
feature selection researchers. This variety of datasets allows for a
better understanding of the behavior of the proposed method.
The experiment consists of performing feature selection with
both Cost CFS and Cost mRMR over the datasets. The goal of the
experiment is to study the behavior of the methods under the
inuence of parameter. The performance is evaluated in terms of
both the total cost of the selected features and the classication
error by a SVM classier estimated under a 10-fold cross-valida-
tion. It is expected that the larger the is the lower the cost
and the higher the error, because increasing gives more weight
to cost at the expense of correlation between features. Moreover,
a KruskalWallis statistical test and a multiple comparison test
(based on Tukey's honestly signicant difference criterion) [25]
have been run on the errors obtained. The results of the tests could
help the user to choose the value of the parameter.
5. Experimental results
Figs. 1, 3 and 6 show the average cost and error for several
values of . The solid line with x represents the error (referenced
on the left Y axis) and the dashed line with o represents the cost
(referenced on the right Y axis). Notice than when 0 the cost
has no inuence on the behavior of the method and it behaves as if
it was the non-cost version.
Fig. 1 plots the error/cost of the four datasets with cost
associated found at the UCI repository (see Table 2). The behavior
expected when applying cost feature selection is that the higher
the , the lower the cost and the higher the error. The results
obtained for the rst block of datasets, in fact, show that cost value
behaves as expected (although the magnitude of the cost does not
change too much because these datasets have few features and
the set of selected ones is often very similar). The error, however,
remains constant in most of the cases. This may happen because
these datasets are quite simple and the same set of features is
often chosen. The KruskalWallis statistical test run on the results
displayed that the errors are not signicantly different, except for
Pima dataset. This fact can be caused because this dataset has very
few expensive features (which are often associated with a higher
predictive power), as can be seen on Table 3. Therefore, removing
them has a greater effect on the classication accuracy.
Table 1
Random costs of the features of Yeast dataset.
Feature Cost
1 0.5093
2 0.1090
3 0.5890
4 0.2183
5 0.8112
6 0.6391
7 0.2741
8 0.1762
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
Fig. 1. Error/cost plots of the rst block of datasets for cost feature selection with CFS and mRMR. (a) Hepatitis CFS, (b) liver CFS, (c) Pima CFS, (d) Thyroid CFS, (e) hepatitis
mRMR (f) Liver mRMR, (g) Pima mRMR and (h) Thyroid mRMR.
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2484
Fig. 2 displays the results of the KruskalWallis statistical test
for Pima dataset. The entries in the ANOVA (ANalysis Of VAriance)
Table (Figs. 2(a) and (c)) are the usual sums of squares (SS),
degrees of freedom (df), mean square estimator (MS), chi-square
statistic (Chi-sq) and the p value that determines the signicance
of the chi-square statistic Prob4Chi sq. As can be seen, the
p value is 9 10
6
for the Cost CFS and 2 10
4
for the Cost
mRMR, as displayed in Fig. 2(a) and (c). This indicates that there
exist values signicantly different than others. In Fig. 2(b) and
(d) it is shown which groups of errors are signicantly different,
information that can be helpful for the user to decide which value
of utilize. When using Cost CFS, can be 0.5 (which means
decreasing the cost) without signicantly increasing the error,
whilst when using Cost mRMR, a reduction in cost cannot be
achieved without worsening the error measure. For Cost mRMR,
when is 0 (and hence, the cost is not taken into account), the
second feature is selected, which has a high cost (see Table 3).
However, when the method is forced to decrease the cost (by
increasing the value of ), this feature is not selected anymore and
prevents the classier to obtain a high prediction accuracy. Some-
thing similar happens with Cost CFS, where the second feature is
selected for values 0 and 0.5 and removed for the remaining
values due to its high cost.
The error/cost graphs of the second block of datasets are
displayed in Fig. 3. It can be seen how cost decreases, according
to expected, and how, contrary to the rst block, error usually
raises when increases. In the cases when error raises mono-
tonically (see Fig. 3(a) or (j), for example), there exist signicant
error changes (p-values are close to zero), therefore the user has to
make a choice to nd an appropriate trade-off between the cost
and the error. On the other hand, there are some other cases, such
as the Sat dataset with Cost CFS (see Fig. 4) where, with 2, the
error is not signicantly worse but a signicant reduction in the
cost (Fig. 5) is achieved.
Finally, Fig. 6 presents the results for the third block of datasets,
corresponding with the well-known DNA microarray domain, with
Table 2
Description of the datasets.
Dataset No. features No. samples No. classes
Hepatitis 19 155 2
Liver 6 345 2
Pima 8 768 2
Thyroid 20 3772 3
Letter 16 20,000 26
Magic04 10 19,020 2
Optdigits 64 5620 10
Pendigits 16 7494 10
Sat 36 4435 6
Segmentation 19 2310 7
Waveform 21 5000 3
Yeast 8 1033 10
Brain 12,625 21 2
CNS 7129 60 2
Colon 2000 62 2
DLBCL 4026 47 2
Leukemia 7129 72 2
Table 3
Costs of the features of Pima dataset (normalized to 1).
Feature Cost
1 0.0100
2 0.7574
3 0.0100
4 0.0100
5 0.9900
6 0.0100
7 0.0100
8 0.0100
10 0 10 20 30 40 50 60
10
5
2
1
0.75
0.5
0

Mean Ranks
10 0 10 20 30 40 50 60
10
5
2
1
0.75
0.5
0
Mean Ranks

Fig. 2. KruskalWallis statistical test results of Pima dataset. (a) ANOVA table (Cost CFS), (b) graph of multiple comparison (Cost CFS), (c) ANOVA table (Cost mRMR) and
(d) graph of multiple comparison (Cost mRMR).
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2485
much more features than samples. As expected, cost decreases as
increases, and since these datasets have a larger number of input
attributes than the ones in previous blocks, cost experiments
larger variability (see, for instance, Fig. 6(h), (j)). For instance, for
the DLBCL dataset with Cost mRMR, we can choose 10, as the
errors are not signicantly different (see Fig. 7) and the cost for
10 is signicantly lower than the one for the four rst (0, 0.5,
0.75 and 1) (Fig. 8).
Notwithstanding, the behavior of the error, in some cases, and
contrary to expected, remains almost constant (see, for instance,
Fig. 6(c) or (f)). The reason why the error is not raising can be
two-fold:

On one hand, it is necessary to remind that in this research the


proposed framework is being tested using lter feature selec-
tion methods. This approach has the benet of being fast and
computationally inexpensive. This characteristic of lters can
cause that the selected features, according to particular criteria,
would not be more suitable for a given classier to obtain the
highest accuracy. Therefore, forcing a lter to select features

E
r
r
o
r
0
10
20
30
40
50
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
Fig. 3. Error/cost plots of second block of datasets for cost feature selection with CFS and mRMR. (a) Letter CFS, (b) Magic04 CFS, (c) Optdigits CFS, (d) Pendigits CFS, (e) Letter
mRMR, (f) Magic04 mRMR, (g) Optdigits mRMR, (h) Pendigits mRMR, (i) Sat CFS, (j) Segment CFS, (k) Waveform CFS, (l) Yeast CFS, (m) Sat mRMR, (n) Segment mRMR,
(o) Waveform mRMR and (p) Yeast mRMR.
0 10 20 30 40 50 60 70 80
10
5
2
1
0.75
0.5
0
Mean Ranks

Fig. 4. KruskalWallis error statistical test of Sat dataset with Cost CFS. (a) ANOVA table and (b) graph of multiple comparison.
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2486
according to another criterion rather than correlation (or the
one used for each particular lter) may cause the selection of
features to be more suitable for minimizing classication error.
For example, in [26,5], a synthetic dataset called Monk3 is
dealt with. Among others, this dataset contains three relevant
features. However, some classiers obtain a better classication
accuracy when lters only had selected two relevant features
than when selecting the three relevant ones. This fact demon-
strates that the behavior of some lters is somewhat unpre-
dictable and not always the one expected.

On the other hand, it has to be noted that DNA microarray


datasets are a difcult challenge for feature selection methods,
due to the enormous amount of features they present. In fact,
the lters evaluated in this research are usually retaining
a maximum of 2% of features. Therefore, irregular results are
expected with such an important reduction in number of
features.
It is also worth reecting on the behavior of each lter utilized
to illustrate the framework proposed in this research. Studying in
detail graphs in Fig. 6, one can see that in general, mRMR achieves
the lowest cost at the expense of a higher error than CFS. There-
fore, it is the user who has to decide which lter to use depending
on the degradation in the error he/she is willing to assume.
6. Conclusions and future work
In this paper a new framework for cost-based feature selection
is proposed. The objective is solving problems where it is not only
interesting to minimize the classication error, but also reduce
costs that may be associated to input features. This framework
consists of adding a new term to the evaluation function of
any lter feature selection method so that it is possible to reach
a trade-off between a lter metric (e.g. correlation or mutual
10 0 10 20 30 40 50 60 70 80
10
5
2
1
0.75
0.5
0
Mean Ranks

Fig. 5. KruskalWallis cost statistical test results of Sat dataset with Cost CFS. (a) ANOVA table and (b) graph of multiple comparison.
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
0 0.5 0.75 1 2 5 10
0
0.2
0.4
0.6
0.8
1

E
r
r
o
r
0
10
20
30
40
50
C
o
s
t
Error
Cost
Fig. 6. Error/cost plots on the third block of datasets for cost feature selection with CFS and mRMR. (a) Brain CFS, (b) CNS CFS, (c) colon CFS, (d) DLBCL CFS, (e) brain mRMR,
(f) CNS mRMR, (g) colon mRMR, (h) DLBCL mRMR, (i) Leukemia CFS and (j) Leukemia mRMR.
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2487
information) and the cost associated to the selected features.
A new parameter, called , is introduced in order to adjust the
inuence of the cost into the evaluation function, allowing the
user ne control of the process according to his needs.
In order to test the adequacy of the proposed framework, two
well-known and representative lters are chosen: CFS (belonging to
the subset feature selection methods) and mRMR (belonging to the
ranker feature selection methods). Experimentation is executed over
a broad suite of different datasets. Results after performing classi-
cation with a SVM display reveal that the approach is sound and
allow the user to reduce the cost without compromising the
classication error signicantly, which can be very useful in elds
such as medical diagnosis or real-time applications.
As future work, we plan to test the framework over other lters
and real data. It would also be interesting to study other feature
selection methods, such as embedded methods or wrappers.
Conict of interest
None declared.
Acknowledgments
This work was supported by Secretara de Estado de Investiga-
cin of the Spanish Government under project TIN 2009-02402,
and by the Consellera de Industria of the Xunta de Galicia through
the research project CN2011/007, both of them partially supported
by the European Union ERDF. V. Boln-Canedo and I. Porto-Daz
acknowledge the support of Xunta de Galicia and Universidade da
Corua under their grant programs.
References
[1] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of
California, Irvine, School of Information and Computer Sciences, http://mlearn.
ics.uci.edu/MLRepository.html, Last access: April 2012.
[2] Z.A. Zhao, H. Liu, Spectral Feature Selection for Data Mining, Chapman & Hall/
CRC, London, UK, 2012.
[3] A. Jain, D. Zongker, Feature selection: evaluation, application, and small
sample performance, IEEE Trans. Pattern Anal. Machine Intell. 19 (2) (1997)
153158.
[4] I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh, Feature Extraction. Foundations and
Applications, Springer, New York, USA, 2006.
[5] V. Boln-Canedo, N. Snchez-Marono, A. Alonso-Betanzos, A review of feature
selection methods on synthetic data, Knowl. Inf. Syst. 34 (3) (2013) 483519.
[6] J.T. Feddema, C.S.G. Lee, O.R. Mitchell, Weighted selection of image features for
resolved rate visual feedback control, IEEE Trans. Robot. Autom. 7 (1) (1991)
3147.
[7] C. Ding, H. Peng, Minimum redundancy feature selection from microarray
gene expression data, in: Proceedings of the 2003 IEEE, Bioinformatics
Conference, 2003, CSB 2003, IEEE, 2003, pp. 523528.
[8] V. Boln-Canedo, N. Snchez-Maroo, A. Alonso-Betanzos, An ensemble of
lters and classiers for microarray data classication, Pattern Recognit. 45 (1)
(2012) 531539.
[9] S. Mukkamala, A.H. Sung, Feature selection for intrusion detection with neural
networks and support vector machines, Transp. Res. Rec. J. Transp. Res. Board
1822 (1) (2003) 3339.
[10] V. Boln-Canedo, N. Snchez-Maroo, A. Alonso-Betanzos, Feature selection
and classication in multiple class datasets: an application to KDD cup 99
dataset, Exp. Syst. Appl. 38 (5) (2011) 59475957.
[11] M.F. Akay, Support vector machines combined with feature selection for breast
cancer diagnosis, Exp. Syst. Appl. 36 (2) (2009) 32403247.
[12] G. Forman, An extensive empirical study of feature selection metrics for text
classication, J. Mach. Learn. Res. 3 (2003) 12891305.
[13] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE
Intell. Syst. Appl. 13 (2) (1998) 4449.
[14] A. Bahamonde, G.F. Bayn, J. Dez, J.R. Quevedo, O. Luaces, J.J. Del Coz, J. Alonso,
F. Goyache, Feature subset selection for learning preferences: a case study, in:
Proceedings of the Twenty-First International Conference on Machine Learn-
ing, ACM, 2004, pp. 4956.
[15] Robert M. Haralick, K. Shanmugam, Its'Hak Dinstein, Texture features for
image classication, IEEE Trans. Syst. Man Cybern. 3 (1973) 610621.
[16] Jerome H. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc.
84 (405) (1989) 165175.
20 25 30 35 40 45 50 55 60
10
5
2
1
0.75
0.5
0
Click on the group you want to test
Mean Ranks

Fig. 7. KruskalWallis error statistical test of DLBCL dataset with Cost mRMR. (a) ANOVA table and (b) graph of multiple comparison.
10 0 10 20 30 40 50 60 70 80
10
5
2
1
0.75
0.5
0
Mean Ranks

Fig. 8. KruskalWallis cost statistical test of DLBCL dataset with Cost mRMR. (a) ANOVA table and (b) graph of multiple comparison.
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2488
[17] Di You, Onur C. Hamsici, Aleix M. Martinez, Kernel optimization in discriminant
analysis, IEEE Trans. Pattern Anal. Mach. Intell. 33 (3) (2011) 631638.
[18] John Wright, Allen Y. Yang, Arvind Ganesh, Shankar S. Sastry, Yi Ma, Robust
face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach.
Intell. 31 (2) (2009) 210227.
[19] C.L. Huang, C.J. Wang, A GA-based feature selection and parameters optimiza-
tion for support vector machines, Exp. Syst. Appl. 31 (2) (2006) 231240.
[20] R.K. Sivagaminathan, S. Ramakrishnan, A hybrid approach for feature subset
selection using neural networks and ant colony optimization, Exp. Syst. Appl.
33 (1) (2007) 4960.
[21] H. Jiawei, M. Kamber, Data Mining: Concepts and Techniques, San Francisco,
Ltd: Morgan Kaufmann, CA, 2001.
[22] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA
data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1) (2009) 1018.
[23] Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, Timm Euler.
Yale: rapid prototyping for complex data mining tasks, in: Lyle Ungar, Mark
Craven, Dimitrios Gunopulos, and Tina Eliassi-Rad (Eds.), KDD '06: Proceedings
of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, New York, NY, USA, August 2006, ACM, pp. 935940.
[24] E. Rich, K. Knight, Articial Intelligence, McGraw-Hill, New York, 1991.
[25] Y. Hochberg, A.C. Tamhane, Multiple Comparison Procedures, John Wiley &
Sons, New Jersey, USA, 1987.
[26] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 (12)
(1997) 273324.
Vernica Boln-Canedo received her B.S. degree in Computer Science from University of A Corua, Spain, in 2008. She received her M.S. degree in 2010 and is currently
a Ph.D. student in the Department of Computer Science at the same university. Her research interests include machine learning and feature selection.
Iago Porto-Daz received his B.S. degree in Computer Science from University of A Corua, Spain, in 2008. He received his M.S. degree in 2010 and is currently a Ph.D. student
in the Department of Computer Science at the same university. His research interests include machine learning and feature selection.
Noelia Snchez-Maroo received the Ph.D. degree for her work in the area of functional and neuronal networks in 2005 at the University of A Corua. She is currently
teaching at the Department of Computer Science in the same university. Her current research areas include agent-based modeling, machine learning and feature selection.
Amparo Alonso-Betanzos received the Ph.D. degree for her work in the area of medical expert systems in 1988 at the University of Santiago de Compostela. Later, she was a
postdoctoral fellow in the Medical College of Georgia, Augusta. She is currently a Full Professor in the Department of Computer Science, University of A Corua. Her main
current areas are intelligent systems, machine learning and feature selection.
V. Boln-Canedo et al. / Pattern Recognition 47 (2014) 24812489 2489

Vous aimerez peut-être aussi