Académique Documents
Professionnel Documents
Culture Documents
VOL. 24,
NO. 3,
MARCH 2012
465
INTRODUCTION
HE
. K. Javed and H.A. Babri are with the Department of Electrical Engineering,
University of Engineering and Technology, Lahore 54890, Pakistan.
E-mail: {kashif.javed, babri}@uet.edu.pk.
. M. Saeed is with the Department of Computer Science, National
University of Computer and Emerging Sciences, Block-B, Faisal Town,
Lahore, Pakistan. E-mail: mehreen.saeed@nu.edu.pk.
Manuscript received 24 Oct. 2009; revised 11 May 2010; accepted 14 Aug.
2010; published online 21 Dec. 2010.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-10-0734.
Digital Object Identifier no. 10.1109/TKDE.2010.263.
1041-4347/12/$31.00 2012 IEEE
466
FEATURE SELECTION
VOL. 24,
NO. 3,
MARCH 2012
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
467
PNCl
t1
Fit
NCl
pFi 1jCl
8i; 1 i M; 8l; 1 l L:
468
W Fi mi
XX
C
VOL. 24,
NO. 3,
Fi
MARCH 2012
pFi
:
pFi jC
P0 di10 log2
1
1
P0 di00
di01
P0 di01 log2
di01
P0 di10 di11
P0 di11 log2
di11
6
di01
di11
P0 log2 d10 d10 1 d10 1d10
1 P0 log2 d11 d11 1 d11 1d11
log2 d11 P0 d11 d10 d11 P0 d11 d10
1 d11 P0 d11 d10 1d11 P0 d11 d10 :
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
469
Fig. 1. Relationship between diff-criterion and mutual information for balanced (left), partially unbalanced (middle), and highly unbalanced (right)
data.
4.1
470
VOL. 24,
NO. 3,
MARCH 2012
TABLE 1
MBF Algorithm [4]
TABLE 2
BMM-MBF Algorithm [29]
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
471
TABLE 3
Summary of the Data Sets [1]
The number of classes for each data set is 2. The train, valid and test columns show the total number of instances in the corresponding data sets.
EXPERIMENTAL RESULTS
This section first measures the effectiveness of our classdependent density-based feature elimination algorithm used
as a stand-alone method. Then, we evaluate CDFE as a
preprocessor to the FSS algorithms, described in Section 4, as
a part of our two-stage algorithm for selecting features from
high-dimensional binary data. Experiments are carried out
on three different real-life benchmark data sets using two
different classifiers. The three data sets are NOVA, GINA,
and HIVA, which were collected from the text-mining,
handwriting, and medicine domains, respectively, and were
introduced in agnostic learning track of the Agnostic
Learning versus Prior Knowledge challenge organized by
the International Joint Conference on Neural Networks in
2007 [1]. The data sets are summarized in Table 3.
Designed for the text classification task, NOVA classifies
emails into two classes: politics and religion. The data are a
sparse binary representation of a vocabulary of 16,969 words
and hence consists of 16,969 features. The positive class is
28.5 percent of the total instances. Thus, NOVA is a partially
unbalanced data set.
HIVA is used for predicting the compounds that are
active against the AIDS HIV infection. The data are
represented as 1,617 sparse binary features and 3.5 percent
of the class variable comprise the positive class. HIVA is,
thus, an unbalanced data set.
The GINA data set is used for the handwritten digit
recognition task, which consists of separating the two-digit
even numbers from the two-digit odd numbers. With sparse
continuous input variables, it is designed such that only the
unit digit provides the information about the classes. The
GINA features are integers quantized to 256 grayscale
levels. We converted these 256 gray levels into 2 by
substituting 1 for the values greater than 0. This is
equivalent to converting a grayscale image to a binary
image. Data sets with GINA-like feature values can be
binarized with this strategy which does not affect
5.1
Fig. 2. Comparison of weights assigned to the features for NOVA (left), HIVA (middle), and GINA (right).
472
HIVA Performance of CDFE and Baseline method without feature ranking using kridge
0.32
CDFE
Baseline method without feature ranking
0.31
BER 1617 =0.26778
0.35
VOL. 24,
NO. 3,
MARCH 2012
GINA Performance of CDFE and Baseline method without feature ranking using kridge
0.22
CDFE
Baseline method without feature ranking
0.21
BER 970 =0.14044
0.3
0.2
0.29
0.19
0.28
0.18
BER
BER
BER
0.3
0.25
0.27
0.17
0.26
0.16
0.25
0.15
0.2
0.15
0.1
0.05
0.24
2000
4000
0.23
0.14
200
400
600
800
1000 1200
size of feature subsets
1400
1600
1800
0.13
100
200
300
400
500
600
700
size of feature subsets
800
900
1000
Fig. 3. Comparison of CDFE against a baseline method of selecting random features without feature ranking for NOVA (left), HIVA (middle), and
GINA (right) using kridge classifier.
Fig. 4. Comparison of the CDFE algorithm against MI-based ranking, MBF and BMM-MBF algorithms for NOVA (left), HIVA (middle) and GINA
(right) using kridge classifier.
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
473
TABLE 4
Comparison of Various Feature Selection Algorithms Using Kridge Classifier
F is the entire feature set, G is the selected feature subset, and BER is the balanced error rate.
TABLE 5
Stage-1: CDFE as a Preprocessor
F is the entire feature set, G is the selected feature subset, and BER is the balanced error rate.
NOVA Performance of naive Bayes
0.13
0.32
0.26
CDFE + MBFk=2
BER16969 =0.07889
0.12
CDFE + MBFk = 1
MBFk = 1
BER 1617 =0.28984
0.31
CDFE + MBFk = 2
MBFk = 2
BER 970 =0.19888
0.25
0.24
0.11
0.3
0.23
0.09
BER
BER
BER
0.1
0.29
0.22
0.21
0.08
0.28
0.2
0.07
0.19
0.27
0.06
0.05
0.18
1000
2000
3000
size of feature subsets
4000
5000
0.26
200
400
600
800
1000 1200
size of feature subsets
1400
1600
1800
0.17
200
400
600
size of feature subsets
800
1000
Fig. 5. Comparison of the two-stage (CDFE MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) using
naive Bayes classifier.
of 450 features and hence, after the first stage, we are left
with 46.4 percent of the original features.
474
CDFE + MBFk = 2
MBFk = 2
BER 1617 =0.26778
0.32
0.16
CDFE + MBFk = 1
MBFk = 1
BER 970 =0.14044
0.21
0.31
0.2
0.3
0.14
MARCH 2012
0.22
0.33
CDFE + MBFk=2
BER16969 =0.070175
NO. 3,
0.18
VOL. 24,
0.19
BER
BER
BER
0.29
0.12
0.28
0.18
0.17
0.27
0.1
0.16
0.26
0.15
0.25
0.08
0.14
0.24
0.06
1000
2000
3000
size of feature subsets
4000
5000
0.23
200
400
600
800
1000 1200
size of feature subsets
1400
1600
1800
0.13
200
400
600
size of feature subsets
800
1000
Fig. 6. Comparison of the two-stage (CDFE MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) using
kridge classifier.
TABLE 6
Comparison of the Two-Stage (CDFE MBF) Algorithm against the MBF Algorithm
F is the entire feature set, G is the selected feature subset, and BER is the balanced error rate.
Fig. 7. Comparison of the two-stage (CDFE BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA
(right) using naive Bayes classifier.
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
475
Fig. 8. Comparison of the two-stage (CDFE BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA
(right) using kridge classifier.
Fig. 7 compares the performance of the two-stage algorithm and that of the BMM-MBF algorithm using naive Bayes
classifier. Our two-stage algorithm, for the NOVA data set,
leads to an optimum BER value of 2 percent with 2,048 features while it selects a subset of 605 features with classification accuracy as good as obtained by all the features. The
HIVA plot indicates that classification accuracy of BMM-MBF
is improved with the introduction of the CDFE stage in such a
manner that almost 8 percent of the original features result in
an accuracy obtained with all the features. In case of GINA,
we find that BMM-MBF alone performs the classification task
with 279 features with an accuracy equal to that attained with
all the features. The addition of CDFE to BMM-MBF reduces
this subset to 165 features.
In Fig. 8, results of the kridge classifier when applied on the
three data sets are shown. Dimensionality of the NOVA
subset selected from the first stage is reduced further to 780 by
BMM-MBF without compromising the classification accuracy that is obtained with all the features. From the HIVA plot,
we find that the smallest subset selected by BMM-MBF to
perform the classification task with a BER value equal to that
attained by all the features, consists of 817 features. The size of
this subset is reduced to 140 when CDFE and BMM-MBF are
combined in two stages. When the experiment was run on the
GINA data set, BMM-MBF selected 550 features while the
two-stage algorithm selected 279 features.
5.3
CONCLUSIONS
This paper is devoted to feature selection in high-dimensional binary data sets. We proposed a ranking criterion,
called diff-criterion, to estimate the relevance of features
using their density values over the classes. We showed that it
is equivalent to the mutual information measure but is
TABLE 7
Comparison of the Two-Stage (CDFE BMM MBF) Algorithm against the BMM-MBF Algorithm
476
VOL. 24,
NO. 3,
MARCH 2012
TABLE 8
Comparison of CDFE Performance against Top 3 Winning Entries of the Agnostic Learning Track [1]
BMM is Bernoulli mixture model, PCA is principal component analysis, PSO is particle swarm optimization, and SVM is support vector machine.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
ACKNOWLEDGMENTS
Kashif Javed was supported by a doctoral fellowship at the
University of Engineering and Technology, Lahore. The
authors would like to thank the anonymous reviewers for
their helpful comments.
[15]
REFERENCES
[17]
[1]
[18]
[16]
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
477