Class-Dependent Density Feature Selection

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 24,
NO. 3,
MARCH 2012
465
Feature Selection Based on Class-Dependent

Densities for High-Dimensional Binary Data
Kashif Javed, Haroon A. Babri, and Mehreen Saeed
AbstractData and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and
noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset
selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE),
for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as
compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE
uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR
algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms
analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these
problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a
preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also
relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection
idea. We carry out experiments with two different classifiers (naive Bayes and kernel ridge regression) on three different real-life data
sets (NOVA, HIVA, and GINA) of the Agnostic Learning versus Prior Knowledge challenge. As a stand-alone method, CDFE shows
up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly
improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against
the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in
case of GINA.
Index TermsFeature ranking, binary data, feature subset selection, two-stage feature selection, classification.
INTRODUCTION
HE
advancements in data and knowledge management

systems have made data collection easier and faster. Raw
data are collected by researchers and scientists working in
diverse application domains such as engineering (robotics),
pattern recognition (face, speech), internet applications
(anomaly detection), and medical applications (diagnosis).
These data sets may consist of thousands of observations or
instances where each instance may be represented by tens or
hundreds of thousands of variables, also known as features.
The number of instances and the number of variables
determine the size and the dimension of a data set. Data sets
such as NOVA [1], a text classification data set, consisting of
16,969 features and 19,466 instances and DOROTHEA [2], a
data set used for drug discovery, consisting of 100,000 features and 1,950 instances are not too uncommon these days.
Intuitively, having more features implies more discriminative power in classification [3]. However, this is not always
true in practical experience, because not all the features
present in high-dimensional data sets help in class prediction.
. K. Javed and H.A. Babri are with the Department of Electrical Engineering,
University of Engineering and Technology, Lahore 54890, Pakistan.
E-mail: {kashif.javed, babri}@uet.edu.pk.
. M. Saeed is with the Department of Computer Science, National
University of Computer and Emerging Sciences, Block-B, Faisal Town,
Lahore, Pakistan. E-mail: mehreen.saeed@nu.edu.pk.
Manuscript received 24 Oct. 2009; revised 11 May 2010; accepted 14 Aug.
2010; published online 21 Dec. 2010.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-10-0734.
Digital Object Identifier no. 10.1109/TKDE.2010.263.
1041-4347/12/$31.00 2012 IEEE
Many features might be irrelevant and possibly detrimental

to classification. Also, redundancy among the features is not
uncommon [4], [5]. The presence of irrelevant and redundant
features not only slows down the learning algorithm but also
confuses it by causing it to overfit the training data [4]. In
other words, eliminating irrelevant and redundant features
makes the classifiers design simple, improves its prediction
performance and its computational efficiency [6], [7].
High-dimensional data sets are inherently sparse and
hence, can be transformed to lower dimensions without
losing too much information about the classes [8]. This
phenomenon known as the empty space phenomenon [9] is
responsible for the well-known issue of curse of dimensionality, a term first coined by Bellman [10] in 1961 to
describe the problems faced in the analysis of highdimensional data. He proved that to effectively estimate
the multivariate density functions up to a given degree of
accuracy, an increase in data dimensions leads to an
exponential growth in the number of data samples. While
studying the small sample size effects on classifier design, a
phenomenon related to the curse of dimensionality was
observed by Ruadys and Jain [11] and was termed as the
peaking phenomenon. They found that for a given
sample size, the accuracy of a classifier first increases with
the increase in the number of features, approaches the
optimal value, and then starts decreasing.
Problems faced by learning algorithms with high-dimensional data sets have been intensively worked on by
researchers. Algorithms that have been developed can be
categorized into two broad groups. The algorithms that
Published by the IEEE Computer Society
466
select, from the original feature set, a subset of features, which

are highly effective in discriminating classes, are categorized
as feature selection (FS) methods. Relief [12] is a popular FS
method that filters out irrelevant features using the nearest
neighbor approach. Another well-known method is recursive
feature elimination support vector machine (RFE-SVM) [13]
which selects useful features while training the SVM
classifier. The FS algorithms are further discussed in Section 2.
On the other hand, algorithms that create a new set of features
from the original features, through the application of some
transformation or combination of original features are termed
as feature extraction (FE) methods. Among them, principal
component analysis (PCA) and linear discriminant analysis
(LDA), are the two well-known linear algorithms which are
widely used because of their simplicity and effectiveness [3].
Nonlinear FE algorithms include isomap [14] and locally
linear embedding (LLE) [15]. For a comparative review of FE
methods, interested readers are referred to [16]. In this paper,
our focus is on the problem of supervised feature selection
and we propose a solution that is suitable for binary data sets.
The remainder of the paper is organized into four sections.
Section 2 describes the theory related to feature selection and
presents a literature survey of the existing methods. In
Section 3, we propose a new feature ranking (FR) algorithm,
termed as class-dependent density-based feature elimination
(CDFE). Section 4 discusses how to combine CDFE with other
feature selection algorithms in two stages. Experimental
results on three real-life data sets are discussed in Section 5.
The conclusions are drawn in Section 6.
FEATURE SELECTION
This section describes the theory related to the feature

selection problem and surveys the various methods
presented in the literature for its solution. Suppose we are
given a labeled data set fxt ; C t gN
t1 consisting of N instances
and M features such that xt 2 RM and C t denotes the class
variable of instance t. There can be L number of classes.
Each vector x is, thus, an M-dimensional vector of features;
t
g. We use F to denote the set
hence, xt fF1t ; F2t ; . . . ; FM
comprising all features of a data set whereas G denotes a
feature subset. The feature selection problem is to find a
subset G of m features from the set F having M features
with the smallest classification error [17] or at least without
a significant degradation in the performance [6].
A straightforward solution
to the feature selection
problem is to explore all M
m possible subsets of size m.
However, this kind of search is computationally expensive
for even moderate values of M and m. Therefore, alternative search strategies have to be designed.
Generally speaking, the feature selection process may
consist of four basic steps, namely, subset generation, subset
evaluation, stopping criterion, and result validation [6]. In
the subset generation step, the feature space is searched
according to a search strategy for a candidate subset, which
is evaluated later. The search can begin with either an
empty set, which is then successively built up (forward
selection) or it starts with the entire feature set and then
features are successively eliminated (backward selection).
Different search strategies have been devised such as
complete search, sequential search, and random search
VOL. 24,
NO. 3,
MARCH 2012
[18]. The newly generated subset is evaluated either with

the help of the classifier performance or some criterion that
does not involve the classifier feedback. These two steps are
repeated until a stopping criterion is met.
Two well-known classes of feature selection algorithms
are feature ranking and feature subset selection (FSS) [7], [19].
Feature ranking methods typically assign weights to features
by assessing each feature individually according to some
criterion such as the degree of relevance to the class variable.
Correlation, information theoretic, and probabilistic-based
ranking criteria are discussed in [20]. Features are then sorted
according to their weights in descending order. A fixed
number of the top ranked features can comprise the optimal
subset or alternatively, a threshold value provided by the
user can be set on the ranking criterion to retain/discard
features. Thus, FR methods dont indulge themselves in
explicit search of the smallest optimal set. They are highly
attractive for microarray analysis and text-categorization
domains because of their computational efficiency and
simplicity [7]. Kira and Rendells Relief algorithm [12]
estimates the relevance of a feature using the values of the
features of its nearest neighbors. Hall [21] proposes a ranking
criterion that evaluates and ranks subsets of features rather
than assessing features individually. In [22], a comparison of
four feature ranking methods is given. Presenting the most
relevant features to a classifier may not produce an optimal
result as it may contain redundant features. In other words,
the selected m best features may not result in the highest
classification accuracy, which can be achieved with the best
m features. Yu and Liu [23] suggest analyzing the subset
obtained by the feature ranking methods for feature
redundancy in a separate stage.
Unlike feature ranking methods, feature subset selection
methods select subsets of features, which together have
good predictive power. Guyon and Elisseeff present
theoretical examples to illustrate superiority of the FSS
methods over the FR methods in [7]. Feature subset
selection methods are divided into three broad categories:
filter, wrapper, and embedded methods [7], [19], [20]. A
filter acts as a preprocessing step to a learning algorithm
and assesses feature subsets without the algorithms
involvement. Fleuret [24] proposes a filtering criterion
based on conditional mutual information (MI) for binary
data sets. A feature Fi among the unselected features is
selected if its mutual information, IC; Fi jFk conditioned
on every feature Fk already in the selected subset of
features, is the largest. This conditional mutual information
maximization (CMIM) criterion discards features similar to
the already selected ones as they do not carry additional
information about the class. In [25], Peng et al. propose the
minimal-redundancy-maximal-relevance (mRMR) criterion,
which adds a feature to the final subset if it maximizes the
difference between its mutual information with the class
and the sum of its mutual information with each of the
individual features already selected. Qu et al. [26] suggest a
new redundancy measure and a feature subset merit
measure based on mutual information concepts to quantify
the relevance and redundancy among features. The
proposed filter first finds a subset of highly relevant
features. Among these features, the most relevant feature
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA
having the least redundancy with the already selected

features is added to the final subset.
Wrapper methods, motivated by Kohavi and John [17]
use the performance of a predetermined learning algorithm
for searching an optimal feature subset. They suggest using
n-fold cross validation for evaluating feature subsets and
find that the best first search strategy outperforms the hillclimbing technique for forward selection. In practice,
wrappers are considered to be computationally more
expensive as compared to filters.
In the embedded approach, the feature selection process is
integrated into the training process of a given classifier. An
example is the recursive feature elimination (RFE) algorithm
[13] in which the support vector machines (SVM) is used as
the classifier. Features are assigned weights that are
estimated by the SVM classifier after being trained with the
data set. During each iteration, feature(s), which decrease
margin of the class separation the least, are eliminated.
Another class of feature selection algorithms uses the
concepts of Bayesian networks [27]. The Markov blanket
(MB) of a target variable is a minimal set of variables
conditioned on, which all other variables are probabilistically independent of the target. The optimal feature subset
for classification is the Markov blanket of the class variable.
One way of identifying the Markov blanket is through
learning the Bayesian network [28]. Another way is to
discover the Markov blanket directly from the data [4]. The
Markov blanket filtering (MBF) algorithm of Koller and
Sahami [4] calculates pairwise correlations between all the
features and assumes the K highest correlated features of a
feature to comprise its Markov blanket. During each
iteration, expected cross entropy is used to estimate the
MB of a feature and a feature whose MB is best
approximated, is eliminated. For large values of K, MBF
runs into computational and data fragmentation problems.
To address these problems, in [29], we propose a Bernoulli
mixture model-based Markov blanket filtering (BMM-MBF)
algorithm for binary data sets that estimates the expected
cross entropy measure via Bernoulli mixture models rather
than from the training data set.
CLASS-DEPENDENT DENSITY-BASED FEATURE

ELIMINATION
In this section, we propose a new feature ranking algorithm,

termed as, class-dependent density-based feature elimination, for binary data sets. Binary data sets are found in a
wide variety of applications including document classification [30], binary image recognition [31], drug discovery [32],
databases [33], and agriculture [34]. It may be possible to
binarize nonbinary data sets in many cases (e.g., binarization of the GINA data set; see Section 5). CDFE uses a
measure termed as, diff-criterion, to estimate the relevance
of features. The diff-criterion is a probabilistic measure and
assigns weights to features by determining their density
value in each class. Mathematically, we show that the
computational cost of estimating the weights by diffcriterion is less as compared to the cost for calculating
weights by mutual information. Feature rankings obtained
by the two criteria are similar to each other. Instead of using
467
a user-provided threshold value, CDFE determines the final

subset with the help of a classifier.
Guyon et al. [35], proposed the Zfilter method to rank the
features of a sparse-integer data set. The filter counts the
nonzero values of a feature irrespective of the class variable
and assigns the sum of the count as its weight. Features
having a weight less than a given threshold value are then
removed to obtain the final subset. In earlier work [36], [37],
we proposed a similar density-based elimination strategy
for high-dimensional binary data sets using the maxcriterion (see definition 3.2). In the following, we suggest
a new and more effective density-based ranking criterion
(diff-criterion) and present a formal analysis of its working.
The discussion that follows is for binary features and twoclass classification problems unless stated otherwise.
Definition 3.1. The density of a binary feature for a given class is
the fraction of the number of instances whose value is 1.
The density of the ith feature, Fi , in the lth class, Cl
having NCl instances, is calculated as
di1l
PNCl
t1
Fit
NCl
pFi 1jCl
8i; 1 i M; 8l; 1 l L:
Remark 3.1. 0 di1l 1 can be intuitively acquired. The

extremums are the cases when a features value remains
identical over all the instances in a given class.
Definition 3.2. The max-criterion [36] calculates the density
value of a feature in each class and then, scores it with the
maximum density value over all the classes.
The weight of the ith feature, Fi using max-criterion, is
W Fi max arg max di1l :
l
Remark 3.2. 0 W Fi max 1 can be intuitively acquired.

Irrelevant features will be assigned a value, W Fi max 0.
Definition 3.3. The diff-criterion calculates the density value of a
feature in each class and then, scores it with difference of the
density values over the two classes C0 and C1 .
The weight of the ith feature, Fi using diff-criterion, is
W Fi diff jdi11 di10 j
jpFi 1jC1 pFi 1jC0 j:
Remark 3.3. 0 W Fi diff 1 can be intuitively acquired.

A feature having W Fi diff 0 is irrelevant whereas
W Fi diff 1 means Fi is the most relevant feature of a
data set. Features with lower weights are, thus, less
relevant as compared to features with higher values of
Wdiff .
The class-dependent density-based feature elimination
strategy ranks the features using diff-criterion given by (3)
and sorts them according to decreasing relevance. In feature
selection algorithms such as [5], [23], [26], where relevance
468
and redundancy are analyzed separately in two steps, a

preliminary subset of relevant features is chosen in the first
step using a threshold value provided by the user. A high
threshold value provided by the user may result in a very
small subset of highly relevant features, whereas, with a low
value the subset may consist of too many features including
highly relevant features along with less relevant features. In
the former case, a lot of information about the class may be
lost, whereas, the subset, in the latter case, will still contain a
lot of information irrelevant to the class, thus, requiring a
computationally expensive second stage for selecting the best
features. Feature ranking algorithms such as Relief [12] and
others also suffer from the same problem with a userprovided threshold value. To address this problem, CDFE
uses a filtrapper approach [20] and defines nested sets of
features S1 S2 SNT in search of the optimal subset.
Here, NT denotes the number of threshold levels. A sequence
of increasing Wdiff values can be used as thresholds to
progressively eliminate more and more features of decreasing relevance in the nested subsets. Each feature subset, thus,
generated is evaluated with a classifier and the final subset is
chosen according to the application requirement. Either the
subset smallest in size having the same accuracy as attained
by the entire feature set, is selected or the one with best
classification accuracy is chosen.
The W Fi diff value of the ith feature, Fi , is determined
by counting the number of 1s over the instances of the two
classes. Therefore, the time complexity of diff-criterion is
ONM, where N is the number of instances and M is
number of features in the training set. Consequently, the
time complexity of CDFE is ONMNT V , where V is the
computing time of the classifier used.
3.1 Rationale of the Diff-Criterion Measure

In the remainder of this section, theoretical justification for
diff-criterion is provided. We determine mutual information in terms of diff-criterion and show that diff-criterion is
computationally more efficient.
Definition 3.4. Mutual information is a measure of the amount
of information that one variable contains about another
variable [38].
It is calculated by finding the relative entropy or
Kullback-Leibler distance between the joint distribution
pC; Fi of two random variables C and Fi and their product
distribution pCpFi [38]. Being consistent with our
notation used in (2) and (3), the weight of the ith feature,
Fi using mutual information, is
W Fi mi DKL pC; Fi kpCpFi
XX
pCpFi
pC; Fi log2

:
pC; Fi
C F
Remark 3.4. Because of the properties of Kullback-Leibler

divergence, W Fi mi 0 with equality, if and only if, C
and Fi are independent. Larger Wmi value means a
feature is more important.
Writing mutual information given in (4) in terms of classconditional probabilities
W Fi mi
XX
C
VOL. 24,
NO. 3,
pFi jCpC log2
Fi
MARCH 2012
pFi
:
pFi jC
S i n c e pFi f pFi fjC 0pC 0 pFi

fjC 1pC 1 and pC 0 pC 1 1, and using
the notation pC c Pc for prior probability of the class
variable and difc pFi fjC c for class-conditional
probabilities of the ith feature, where f; c 2 f0; 1g, in (5),
we get
W Fi mi
P0 di00 log2
P0 di00 di01 di01

di00
P0 di10 log2
P0 di10 di11 di11

di10
1
1
P0 di00
di01
P0 di01 log2
di01
P0 di10 di11
P0 di11 log2
di11
6
di01
di11
Putting di00 di10 1 and di01 di11 1 in (6), suppressing

the index i and rearranging the terms, we get
W Fi mi

P0 log2 d10 d10 1 d10 1d10

1 P0 log2 d11 d11 1 d11 1d11

log2 d11 P0 d11 d10 d11 P0 d11 d10

1 d11 P0 d11 d10 1d11 P0 d11 d10 :
Equation (7) indicates that mutual information between a

feature and the class variable depends on P0 , d11 , d10 , and the
diff-criterion measure, d11 d10 . The first term in (7) ranges
over [P0 0] with minimum at d10 0:5 and maxima at
d10 0; 1. Similarly, the second term lies in the range
[1 P0 0] with a minimum value at d11 0:5 and maxima
at d11 0; 1. The third term lies over the range [0 1] and
depends on P0 and d11 d10 . The most significant contribution to mutual information comes from the third term
which contains the diff-criterion measure. Fig. 1 shows the
relationship between mutual information and the diffcriterion for a balanced data set, P0 0:5, a partially
unbalanced data set with P0 0:25 and an unbalanced data
set having P0 0:035. Mutual information increases as a
function of the diff-criterion. The change in mutual information due to different values of d11 and d10 but the same value
of d11 d10 is relatively small as evident from the standard
deviation bars on the three plots in Fig. 1. It is also observed
that the maximum value of mutual information obtained
with diff-criterion, d11 d10 1 decreases as P0 decreases.
Remark 3.5. Mutual information of a feature, Fi , whose
value remains the same over the two classes is 0.
Proof. In this case, diff-criterion, d11 d10 becomes 0 or
W Fi diff 0. Putting this value in (7) and using
limy!0;1 log2 yy 1 y1y 0, we get, W Fi mi 0. t
u
Theorem 3.1. Mutual information is upper bounded by entropy
of the class variable.
469
Fig. 1. Relationship between diff-criterion and mutual information for balanced (left), partially unbalanced (middle), and highly unbalanced (right)
data.
Proof. The most relevant feature has d11 d10 1 or

W Fi diff 1. Putting this value in (7) and using
limy!0;1 log2 yy 1 y1y 0, we get
W Fi mi
P0 log2 P0 P1 log2 P1
pC 0 log2 pC 0 pC 1 log2 pC 1
EntropyC:
t
u
Remark 3.6. The range of diff-criterion measure is [0 1]
whereas mutual information lies within [0 Entropy(C)].
In other words, features with higher Wdiff will reduce the
uncertainty of the class variable more as compared to features
with lower Wdiff and a feature whose Wdiff is 1 will contain all
the information required to predict the class variable.
Remark 3.7. Diff-criterion is a computationally less expensive measure as compared to mutual information.
If we assume that it takes t1 units of time to calculate the
density term, pFi 1jC, a subtraction operation is
performed in t2 and an absolute operation takes t3 units
of time, then the computational cost of W Fi diff given in
(3) is 2 t1 t2 t3 . Further, if we assume that pC and
pFi take t4 , log2 takes t5 , a division operation takes t6 and
a multiplication takes t7 units of time, then the computational cost of W Fi mi given in (5) is 4 t1 4 t2 4
t4 4 t5 4 t6 8 t7 . Comparing the two computational costs and keeping in mind that logarithm, multiplication, and division are expensive operations, we find
that diff-criterion is a relatively less expensive measure.
TWO-STAGE FEATURE SELECTION ALGORITHMS
Feature ranking algorithms while selecting a final subset

ignore redundancies among the features. Without any
search strategy, they choose features that are highly
relevant to the class variable. Due to their simplicity and
computational efficiency, they are highly popular in
application domains involving high-dimensional data. On
the other hand, feature subset selection algorithms take the
redundancies among features into consideration while
selecting features but are computationally expensive with
data having a very large number of features. In this section,
we suggest combining an FR algorithm using a filtrapper
approach [20] with an FSS algorithm to overcome these

limitations. The idea of designing dimensionality reduction
algorithms with more than one stage is not new [25], [39].
However, this kind of combination of FR and FSS
algorithms for high-dimensional binary data has not yet
been explored. The first stage of the two-stage algorithm is
based on a computationally cheap FR measure and selects a
preliminary subset with best classification accuracy. A
potentially large number of irrelevant and redundant
features are discarded in this phase. This makes the job of
an FSS algorithm relatively easy. In the second stage, a
higher performance computationally more expensive FSS
algorithm is employed to select the most useful features
from the reduced feature set produced in the first stage.
4.1
First Stage: Selection of the Preliminary Feature

Subset
To evaluate its effectiveness as a preprocessor to FSS
algorithms, CDFE is employed in the first stage of our twostage algorithm. In this capacity, it provides them with a
reduced initial feature subset having good classification
accuracy as compared to the entire feature set. Besides the
irrelevant features, a large number of redundant features
are eliminated by CDFE during this stage. The subset thus,
generated is not only easier to manipulate by the FSS
algorithm in the second stage but also improves its
classification performance.
4.2
Second Stage: Selection of the Final Feature

Subset
In this paper, we have tested two FSS algorithms in the
second stage: Koller and Sahamis Markov blanket filtering
algorithm [4], which is an approximation to the theoretically
optimal feature selection criterion, and our Bernoulli
mixture model-based Markov blanket filtering algorithm
[29], which makes MBF computationally more efficient. The
two algorithms are briefly described here.
4.2.1 Koller and Sahamis Markov Blanket Filtering
Algorithm [4]
Koller and Sahami show that a feature Fi can be safely
eliminated from a set without an increase in the divergence
from the true class distribution, if its Markov blanket, M,
can be identified. Practically, it is not possible to exactly
pinpoint the true MB of Fi ; hence, heuristics have to be
applied. MBF is a backward elimination algorithm and is
outlined in Table 1. For each feature Fi , a candidate set Mi
consisting of those K features, which have the highest
470
VOL. 24,
NO. 3,
MARCH 2012
TABLE 1
MBF Algorithm [4]
correlation with Fi , is selected. The value of K should be as

large as possible to subsume all the information Fi contains
about the class and other features. Then, MBF estimates
how close Mi is to being the MB of Fi using the following
expected cross entropy measure:
X
P Mi f Mi ; Fi fi
G Fi jMi
f Mi ;fi
DKL P CjM fM ; Fi fi jjP CjM f M :

8
The feature Fi having the smallest value of G Fi jMi is
omitted. The output of this algorithm can also be a list of
features sorted according to relevance to the class variable.
Its time complexity is OrMKN2K L where r is the number
of features to eliminate and L is the total number of classes.
4.2.2 Bernoulli Mixture Model-Based Markov Blanket

Filtering Algorithm [29]
Larger values of K in the MBF algorithm demand heavy
computations for calculating the expected cross entropy
measure given in (8) from the training data. This issue is
addressed by the BMM-MBF algorithm for binary data sets
by estimating the cross entropy measure from the Bernoulli
mixture model instead of the training set. A Bernoulli
mixture model can be seen as a tool for partitioning an Mdimensional hypercube, identifying regions of high data
density on the corners of the hypercube. BMM-MBF is the
same as the MBF algorithm given in Table 1, except that
Step 2b is replaced by the steps given in Table 2.
BMM-MBF, first determines the Bernoulli mixtures, (Q1
and Q0 ), from the training data for the positive and negative
(C1 and C0 ) classes, respectively. The qth mixture is
specified by the prior q and the probability vector

pq 2 0; 1 M , 1 q Q Q1 Q0 . These two parameters
can be determined by the expectation maximization (EM)
algorithm [40]. Then, BMM-MBF thresholds the values of
probability vector to see, which corner of the hypercube is
represented by this mixture. A probability value greater
than 0.5 is taken as a 1 and 0 otherwise. This converts pq
into a feature vector x whose probability of occurrence can
be estimated as
pxjq q min pxqii 1 pqi 1xi ;
i
where pqi 2 0; 1 , 1 i M, denotes the probability of

success of the ith feature in the qth mixture. The feature
vector having the highest probability of occurrence according to (9) in the mixture density, is termed as main vector
and is denoted by v
v arg max pxjq;
x2X
where X represents the set of all binary vectors in f0; 1gM .

Once the main vectors are estimated, the steps given in
Table 2 are then followed. Here, the kth mixtures for the
positive and negative classes are denoted by q1k and q0k ,
respectively.
The BMM-MBF algorithm has a time complexity of
OrMKQ2 L. The cross entropy measure in MBF is
computed from N K sized data and we need to look at
2K combinations of values. On the other hand, BMM-MBF
computes the cross entropy measure from Q K sized
data, where we are only looking at Q main vectors in each
Bernoulli mixture, resulting in a dramatic reduction in time.
TABLE 2
BMM-MBF Algorithm [29]
471
TABLE 3
Summary of the Data Sets [1]
The number of classes for each data set is 2. The train, valid and test columns show the total number of instances in the corresponding data sets.
EXPERIMENTAL RESULTS
This section first measures the effectiveness of our classdependent density-based feature elimination algorithm used
as a stand-alone method. Then, we evaluate CDFE as a
preprocessor to the FSS algorithms, described in Section 4, as
a part of our two-stage algorithm for selecting features from
high-dimensional binary data. Experiments are carried out
on three different real-life benchmark data sets using two
different classifiers. The three data sets are NOVA, GINA,
and HIVA, which were collected from the text-mining,
handwriting, and medicine domains, respectively, and were
introduced in agnostic learning track of the Agnostic
Learning versus Prior Knowledge challenge organized by
the International Joint Conference on Neural Networks in
2007 [1]. The data sets are summarized in Table 3.
Designed for the text classification task, NOVA classifies
emails into two classes: politics and religion. The data are a
sparse binary representation of a vocabulary of 16,969 words
and hence consists of 16,969 features. The positive class is
28.5 percent of the total instances. Thus, NOVA is a partially
unbalanced data set.
HIVA is used for predicting the compounds that are
active against the AIDS HIV infection. The data are
represented as 1,617 sparse binary features and 3.5 percent
of the class variable comprise the positive class. HIVA is,
thus, an unbalanced data set.
The GINA data set is used for the handwritten digit
recognition task, which consists of separating the two-digit
even numbers from the two-digit odd numbers. With sparse
continuous input variables, it is designed such that only the
unit digit provides the information about the classes. The
GINA features are integers quantized to 256 grayscale
levels. We converted these 256 gray levels into 2 by
substituting 1 for the values greater than 0. This is
equivalent to converting a grayscale image to a binary
image. Data sets with GINA-like feature values can be
binarized with this strategy which does not affect
the sparsity of the data. The positive class is 49.2 percent

of the total instances. In other words, GINA is balanced
between the positive and negative classes.
The class labels of the test sets of these data sets are not
publicly available but one can make an online submission to
know the prediction accuracy on the test set. In our
experiments, the training and validation sets are combined
to train the Naive Bayes and kernel ridge regression
(kridge) classifiers. The software implementation given in
Challenge Learning Object Package (CLOP) [35] for both the
classifiers, was used. The classification performance is
evaluated by the balanced error rate (BER) over fivefold
cross validation. BER is the average of the error rates of the
positive and negative classes [20]. Given two classes, if tn
denotes the number of negative instances that are correctly
labeled by the classifier and fp refers to number of negative
instances that are incorrectly labeled, then false positive rate
is defined as, fpr fp =tn fp . Similarly, we can define
false negative rate as fnr fn =tp fn , where fn is the
number of positive instances that are incorrectly labeled by
the classifier and tp denotes the number of positive
instances that are correctly labeled. Thus, BER, is given by
BER 0:5 fpr fnr:
For data sets that are unbalanced in cardinality, BER gives a
better picture of the error than the simple error rate [20].
5.1
Class-Dependent Density-Based Feature

Elimination as a Stand-Alone Feature Selection
Algorithm
Experiments described in this section test the performance
of CDFE as a stand-alone feature selection algorithm.
Features are scored using the diff-criterion and are then
sorted in descending order according to their weights. Fig. 2
shows the weights (sorted in descending order) assigned to
the features by the max-criterion, diff-criterion, and mutual
information measures using (2), (3), and (4), respectively.
Although, each measure assigns a different value to a
Fig. 2. Comparison of weights assigned to the features for NOVA (left), HIVA (middle), and GINA (right).
472

NOVA Performance of CDFE and Baseline method without feature ranking using kridge
0.45
CDFE
Baseline method without feature ranking
0.4
BER 16969 =0.070175
HIVA Performance of CDFE and Baseline method without feature ranking using kridge
0.32
CDFE
0.31
BER 1617 =0.26778
0.35
VOL. 24,
NO. 3,
MARCH 2012
GINA Performance of CDFE and Baseline method without feature ranking using kridge
0.22
CDFE
0.21
BER 970 =0.14044
0.3
0.2
0.29
0.19
0.28
0.18
BER
BER
BER
0.3
0.25
0.27
0.17
0.26
0.16
0.25
0.15
0.2
0.15
0.1
0.05
0.24
2000
4000
6000 8000 10000 12000 14000 16000 18000

size of feature subsets
0.23
0.14
200
400
600
800
1000 1200
1400
1600
1800
0.13
100
200
300
400
500
600
700
800
900
1000
Fig. 3. Comparison of CDFE against a baseline method of selecting random features without feature ranking for NOVA (left), HIVA (middle), and
GINA (right) using kridge classifier.
feature, we are actually interested in looking at their

patterns. The curve of diff-criterion for the three data sets
lies in middle of the curves of the other two measures, while
the curve of max-criterion lies at top of the three curves. It is
evident from these patterns that diff-criterion behaves in a
more similar fashion to mutual information as compared to
max-criterion. For NOVA, the Wdiff values lie in the range
[0 0.231] with most of the features having value close to zero
as shown by its diff-criterion pattern. Thus, most NOVA
features have poor discriminating power. The Wdiff values
of HIVA ranges over [0 0.272]. Compared to NOVA, a larger
fraction of HIVA features have good class separation
capability as seen in Fig. 2. In case of GINA, the Wdiff
values lie within the range [0 0.471] with a fairly large
fraction of features have good discriminating power.
To find the final subset, the space of M features is
searched with a filtrapper like approach [20]. We define
nested sets of features progressively eliminating more and
more features of decreasing relevance with the help of a
sequence of increasing threshold values on Wdiff . For a
given threshold value, we discard a number of features and
retain the remaining features. The usefulness of every
feature subset, thus, generated is tested using the classification accuracy of a classifier.
In Fig. 3, we look at the effectiveness of a feature ranking
method. The CDFE algorithm is compared against a
baseline method that generates nested feature subsets but
selects features randomly from the data set that is not
ranked. From the plots, we observe that ranking the NOVA,
HIVA, and GINA features significantly improves the
classification accuracy. Besides this, a feature subset of
smaller size attains the BER value that is obtained with the
set containing all the features.
Next, we compare CDFE against three feature selection

algorithms: the mutual information-based ranking method,
Koller and Sahamis MBF algorithm, and BMM-MBF algorithm using kridge classifier. The MI-based ranking method
assigns weights to features according to the mutual information measure given in (4) and sorts them in the order of
decreasing weights. The plots are given in Fig. 4, and the
results are tabulated in Table 4. Among these algorithms,
CDFE is the least expensive. Both MBF and BMM-MBF
applied on the entire NOVA feature set become computationally infeasible, as each algorithm involves calculating the
correlation matrix of size M M. For data sets with a large
number of features, such calculations render these algorithms
impractical. Due to this reason, we could not compare the
performance of CDFE against that of the MBF and BMM-MBF
algorithms for NOVA. However, when compared with the
MI-based ranking method, CDFE gives better results as
shown in Fig. 4. CDFE reduces the original dimensionality to
a set of 3,135 features (81.53 percent reduction) having a
classification accuracy as good as attained with the entire
feature set. On the other hand, the MI-based ranking method
selects a subset of 4,950 features (70.83 percent reduction). For
the HIVA data set, CDFE results in higher classification
accuracy as compared to the other three feature selection
algorithms. It generates a subset with about 8.6 percent of the
original features. Here, CDFEs performance is close to MBF
and outperforms the MI-based ranking method and the
BMM-MBF algorithm. In case of GINA, the classification
accuracy patterns of CDFE, MBF, and MI-based ranking
method are similar. CDFE generates a subset whose
dimensionality is 33 percent of original feature set and comes
third in terms of feature reduction.
Fig. 4. Comparison of the CDFE algorithm against MI-based ranking, MBF and BMM-MBF algorithms for NOVA (left), HIVA (middle) and GINA
(right) using kridge classifier.
473
TABLE 4
Comparison of Various Feature Selection Algorithms Using Kridge Classifier
F is the entire feature set, G is the selected feature subset, and BER is the balanced error rate.
TABLE 5
Stage-1: CDFE as a Preprocessor
NOVA Performance of naive Bayes
HIVA Performance of naive Bayes
0.13
GINA Performance of naive Bayes
0.32
0.26
CDFE + MBFk=2
BER16969 =0.07889
0.12
CDFE + MBFk = 1
MBFk = 1
BER 1617 =0.28984
0.31
CDFE + MBFk = 2
MBFk = 2
BER 970 =0.19888
0.25
0.24
0.11
0.3
0.23
0.09
BER
BER
BER
0.1
0.29
0.22
0.21
0.08
0.28
0.2
0.07
0.19
0.27
0.06
0.05
0.18
1000
2000
3000
4000
5000
0.26
200
400
600
800
1000 1200
1400
1600
1800
0.17
200
400
600
800
1000
Fig. 5. Comparison of the two-stage (CDFE MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) using
naive Bayes classifier.
5.2 Two-Stage Feature Selection Algorithms

This section measures the performance of the two-stage
algorithm with CDFE used as a preprocessor to an FSS
algorithm (MBF or BMM-MBF) in the second stage. For this
purpose, we compare the performance of two stages used in
unison against only the second stage feature selection
algorithm.
5.2.1 Stage-1: Class-Dependent Density-Based Feature
Elimination
When CDFE is used as a preprocessor, we choose the
feature subset resulting in minimum BER value for the next
stage. Table 5 summarizes the minimum BER results for the
three data sets obtained from Fig. 4. The NOVA plot
indicates that a BER of 6.38 percent is obtained when we
eliminate features using a threshold of Wdiff < 0:00365. The
selected subset contains 29.2 percent of the original set of
features. In case of HIVA, we find that optimal classification
is obtained when features with Wdiff less than 0.04041 are
discarded. The selected subset consists of 575 features,
which is 35.56 percent of the total features. From the GINA
plot, we observe that the feature subset obtained by
discarding features with Wdiff less than 0.02399 results in
an optimal BER value of 13.65 percent. This subset consists
of 450 features and hence, after the first stage, we are left
with 46.4 percent of the original features.
5.2.2 Results of the Two-Stage Algorithm with the MBF

Algorithm in Second Stage
The working of MBF when combined with CDFE and when
used as a stand-alone method is compared here for NOVA,
HIVA, and GINA data sets using naive Bayes and kridge
classifiers. We experimented with MBF using different values
of K and found that it takes several hours on an Intel Core 2
Duo CPU of 1.83 GHz clock speed to terminate for values of K
greater than 15. The best value of K was used for each data set.
The plots are given in Figs. 5 and 6 and results are
summarized in Table 6. The large number of NOVA features
makes MBF computationally infeasible. This issue is addressed by introducing CDFE as a preprocessor to MBF.
Besides this, we observe an improvement in MBFs performance both in terms of classification accuracy and feature
selection, when it is used in two stages.
In Fig. 5, naive Bayes classifier is used to evaluate the
performance of the two-stage algorithm for the three data
sets. For NOVA, an error rate of 7.88 percent was attained
when using all the features. This error rate reduces to
5 percent when the two stage algorithm is used for selecting
the best 2,816 features. We also find that only 512 (or 3 percent)
474

NOVA Performance of kridge
CDFE + MBFk = 2
MBFk = 2
BER 1617 =0.26778
0.32
0.16
CDFE + MBFk = 1
MBFk = 1
BER 970 =0.14044
0.21
0.31
0.2
0.3
0.14
MARCH 2012
0.22
0.33
CDFE + MBFk=2
BER16969 =0.070175
NO. 3,
GINA Performance of kridge
HIVA Performance of kridge
0.18
VOL. 24,
0.19
BER
BER
BER
0.29
0.12
0.28
0.18
0.17
0.27
0.1
0.16
0.26
0.15
0.25
0.08
0.14
0.24
0.06
1000
2000
3000
4000
5000
0.23
200
400
600
800
1000 1200
1400
1600
1800
0.13
200
400
600
800
1000
Fig. 6. Comparison of the two-stage (CDFE MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) using
kridge classifier.
TABLE 6
Comparison of the Two-Stage (CDFE MBF) Algorithm against the MBF Algorithm
Fig. 7. Comparison of the two-stage (CDFE BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA
(right) using naive Bayes classifier.
features selected by the two-stage algorithm result in a

classification accuracy as good as the one achieved by all the
features. From the HIVA plot, a shift in the optimum BER
point of MBF toward the left is evident when it is combined
with CDFE. MBF alone results in an optimum BER of
26.8 percent with 185 features while it attains an optimum
BER of 26.4 percent with 140 features in two stages. The
smallest subset which attains a BER value equal to that
attained with all the HIVA features, consists of 96 features
when MBF is used as a stand-alone method. It consists of
76 features when MBF is combined with CDFE. In case of
GINA, MBF alone performs the classification task with
128 features with an accuracy equal to that obtained with
the entire feature set. The size of this subset is reduced to
32 features when CDFE and MBF are combined in two stages.
Fig. 6 investigates performance of the two-stage algorithm
against that of MBF using kridge classifier. When applied on
NOVA, the smallest subset selected by the two-stage
algorithm which attains a BER value equal to that obtained
with all the features, consists of 1,792 features. For HIVA,
MBF selects a subset of 64 features while CDFE and MBF in
two stages select 58 features to perform the classification task
without any degradation in the accuracy that is obtained with
all the features. The GINA results indicate that 165 features
selected by MBF result in a BER that is obtained with the entire

feature set. On the other hand, subset selected by the twostage algorithm consists of 150 features.
5.2.3 Results of the Two-Stage Algorithm with the

BMM-MBF Algorithm in Second Stage
In this section, the performance of two-stage algorithm is
discussed when CDFE is combined with the BMM-MBF
algorithm. We experimented with the BMM-MBF algorithm
using different values of K and found that unlike Koller and
Sahamis MBF algorithm, it remains computationally efficient even if it has to search for the Markov blanket of a
feature using values of K as large as 40. For each data set, we
use the optimal value of K. Like MBF, we evaluated the
performance of our two-stage algorithm against the classification accuracy of the entire feature set obtained by naive
Bayes and kridge classifiers for the NOVA data set. For
HIVA and GINA, the two-stage algorithm was compared
against the performance of the BMM-MBF algorithm. The
empirical results are shown in Figs. 7 and 8 and are
summarized in Table 7. We find that the performance of
the BMM-MBF algorithm, both in terms of feature reduction
and classification accuracy, is significantly improved with
the addition of the CDFE algorithm as a first stage.
475
Fig. 8. Comparison of the two-stage (CDFE BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA
(right) using kridge classifier.
Fig. 7 compares the performance of the two-stage algorithm and that of the BMM-MBF algorithm using naive Bayes
classifier. Our two-stage algorithm, for the NOVA data set,
leads to an optimum BER value of 2 percent with 2,048 features while it selects a subset of 605 features with classification accuracy as good as obtained by all the features. The
HIVA plot indicates that classification accuracy of BMM-MBF
is improved with the introduction of the CDFE stage in such a
manner that almost 8 percent of the original features result in
an accuracy obtained with all the features. In case of GINA,
we find that BMM-MBF alone performs the classification task
with 279 features with an accuracy equal to that attained with
all the features. The addition of CDFE to BMM-MBF reduces
this subset to 165 features.
In Fig. 8, results of the kridge classifier when applied on the
three data sets are shown. Dimensionality of the NOVA
subset selected from the first stage is reduced further to 780 by
BMM-MBF without compromising the classification accuracy that is obtained with all the features. From the HIVA plot,
we find that the smallest subset selected by BMM-MBF to
perform the classification task with a BER value equal to that
attained by all the features, consists of 817 features. The size of
this subset is reduced to 140 when CDFE and BMM-MBF are
combined in two stages. When the experiment was run on the
GINA data set, BMM-MBF selected 550 features while the
two-stage algorithm selected 279 features.
5.3
Comparison of CDFE Performance against the

Top 3 Winning Entries of the Agnostic Learning
Track [1]
The organizers of the agnostic learning track of Agnostic
Learning versus Prior Knowledge challenge evaluated all
the entrants on the basis of the BER on the test sets. We tested
CDFE in both the capacities, as a stand-alone method and as a
part of the two-stage algorithm (i.e., a preprocessor to MBF or
BMM-MBF) with kridge classifier and the classification
method given in [36] for NOVA, HIVA, and GINA. Table 8

gives a comparison of CDFEs performance against the top 3
winning entries of the agnostic learning track of the challenge.
In case of NOVA, both our methods, the CDFE standalone algorithm and the two-stage algorithm outperform
the top 3 results. We also find that CDFE performs better in
two stages (CDFE MBF) as compared to the CDFE standalone case. For the HIVA data set, we observe that the BER
value obtained by CDFE with the kridge classifier outperforms the top 3 results. When combined with MBF in two
stages, CDFE results in a performance that is comparable to
the three winning BER results. Results obtained on GINA
indicate that the two-stage (CDFE MBF) algorithm beats
the second and third winning entries. As a stand-alone
method, CDFE obtains the third position in the ranking of
the top 3 entries of the challenge.
Feature selection algorithms may behave differently on
data sets from different application domains. The main
factors that affect the performance include the number of
features and samples and the balance of the classes of the
training data [20]. NOVA, HIVA, and GINA belong to
different application domains. The ratio of features to
samples is 0.103, 2.378, and 3.251 and the positive class is
28.5, 3.5, and 49.2 percent of the total samples, respectively.
Results in Table 8 indicate that CDFE, which is currently
limited to the domain of two-class classification with binaryvalued features, performs consistently better as compared to
the other feature selection algorithms used in the challenge.
CONCLUSIONS
This paper is devoted to feature selection in high-dimensional binary data sets. We proposed a ranking criterion,
called diff-criterion, to estimate the relevance of features
using their density values over the classes. We showed that it
is equivalent to the mutual information measure but is
TABLE 7
Comparison of the Two-Stage (CDFE BMM MBF) Algorithm against the BMM-MBF Algorithm
476
VOL. 24,
NO. 3,
MARCH 2012
TABLE 8
Comparison of CDFE Performance against Top 3 Winning Entries of the Agnostic Learning Track [1]
BMM is Bernoulli mixture model, PCA is principal component analysis, PSO is particle swarm optimization, and SVM is support vector machine.
computationally more efficient. Based on the diff-criterion,

we proposed a supervised feature selection algorithm
termed as class-dependent density-based feature elimination, to select a subset of useful binary features. CDFE uses a
classifier instead of a user-provided threshold value to select
the final subset. Our experiments on three real-life data sets
demonstrate that CDFE, in spite of its simplicity and
computational efficiency, either outperforms other wellknown feature selection algorithms or is comparable to them
in terms of classification and feature selection performance.
We also found that CDFE can be effectively used as a
preprocessing step for other feature selection algorithms for
determining compact subsets of features without compromising the accuracy on a classification task. It, thus,
provides them with a substantially smaller feature subset
having better class separability. Feature selection algorithms, such as MBF and BMM-MBF, involving square
matrices of size equal to the number of features, become
computationally intractable for high-dimensional data sets.
It was shown empirically that CDFE adequately relieves
them from this problem and significantly improves their
classification and feature selection performance.
Furthermore, we analyzed CDFEs performance by
comparing it against the winning entries of the agnostic
learning track of Agnostic Learning versus Prior Knowledge challenge. Results indicate that CDFE outperforms
the best entries obtained on NOVA and HIVA data sets and
attains the third position on the GINA data set.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
ACKNOWLEDGMENTS
Kashif Javed was supported by a doctoral fellowship at the
University of Engineering and Technology, Lahore. The
authors would like to thank the anonymous reviewers for
their helpful comments.
[15]
REFERENCES
[17]
[1]
[18]
I. Guyon, A. Saffari, G. Dror, and G. Cawley, Agnostic Learning

vs. Prior Knowledge Challenge, Proc. Intl Joint Conf. Neural
Networks (IJCNN), http://www.agnostic.inf.ethz.ch, 2007.
[16]
Feature Selection Challenge by Neural Information Processing

Systems Conference (NIPS), http://www.nipsfsc.ecs.soton.
ac.uk, 2003.
R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second
ed. Wiley, 2001.
D. Koller and M. Sahami, Toward Optimal Feature Selection,
Proc. 13th Intl Conf. Machine Learning, pp. 284-292, 1996.
L. Yu and H. Liu, Efficient Feature Selection via Analysis of
Relevance and Redundancy, J. Machine Learning Research, vol. 5,
pp. 1205-1224, 2004.
M. Dash and H. Liu, Feature Selection for Classification,
Intelligent Data Analysis, Elsevier Science B.V., vol. 1, no. 3,
pp. 131-156, 1997.
I. Guyon and A. Elisseeff, An Introduction to Variable and
Feature Selection, J. Machine Learning Research, vol. 3, pp. 11571182, 2003.
L. Jimenez and D. Landgrebe, Supervised Classification in High
Dimensional Space: Geometrical, Statistical and Asymptotical
Properties of Multivariate Data, IEEE Trans. Systems, Man and
CyberneticsPart C: Applications and Rev., vol. 28, no. 1, pp. 39-54,
Feb. 1998.
D. Scott and J. Thompson, Probability Density Estimation in
Higher Dimensions, Proc. 15th Symp. Interface, Elsevier Science
Publishers, pp. 173-179, 1983.
R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton
Univ. Press, 1961.
S. Ruadys and A. Jain, Small Sample Size Effects in Statistical
Pattern Recognition: Recommendations for Practitioners, IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3,
pp. 252-264, Mar. 1991.
K. Kira and L.A. Rendell, A Practical Approach to Feature
Selection, Proc. Ninth Intl Conf. Machine Learning, pp. 249-256,
1992.
I. Guyon, J. Watson, S. Barnhill, and V. Vapnik, Gene Selection
for Cancer Classification Using Support Vector Machines,
Machine Learning, vol. 46, pp. 389-422, 2002.
J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, vol. 290, pp. 2319-2323, 2000.
L.K. Saul and S.T. Roweis, Think Globally, Fit Locally:
Unsupervised Learning of Low Dimensional Manifolds,
J. Machine Learning Research, vol. 4, pp. 119-155, 2003.
L. van der Maaten, E. Postma, and H. van den Herik,
Dimensionality Reduction: A Comparative Review, Technical
Report TiCC-TR 2009-005, Tilburg Univ., 2009.
R. Kohavi and G. John, Wrappers for Feature Subset Selection,
Artificial Intelligence, vol. 97, pp. 273-324, Dec. 1997.
H. Liu and L. Yu, Toward Integrating Feature Selection
Algorithms for Classification and Clustering, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.
[19] A.L. Blum and P. Langley, Selection of Relevant Features and

Examples in Machine Learning, Artificial Intelligence, Elsevier
B.V., vol. 97, pp. 245-271, 1997.
[20] I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, Feature
Extraction Foundations and Applications. Springer, 2006.
[21] M. Hall, Correlation-Based Feature Selection for Discrete and
Numeric Class Machine Learning, Proc. 17th Intl Conf. Machine
Learning, 2000.
[22] R. Ruiz and J.S. Aguilar-Ruiz, Analysis of Feature Rankings for
Classification, Proc. Intl Symp. Intelligent Data Analysis (IDA),
pp. 362-372, 2005.
[23] L. Yu and H. Liu, Feature Selection for High-Dimensional Data:
A Fast Correlation-Based Filter Solution, Proc. 20th Intl Conf.
Machine Learning, 2003.
[24] F. Fleuret, Fast Binary Feature Selection with Conditional Mutual
Information, J. Machine Learning Research, vol. 5, pp. 1531-1555,
2004.
[25] H. Peng, F. Long, and C. Ding, Feature Selection Based on Mutual
Information: Criteria of Max-Dependency, Max-Relevance, and
Min-Redundancy, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[26] G. Qu, S. Hariri, and M. Yousaf, A New Dependency and
Correlation Analysis for Features, IEEE Trans. Knowledge and Data
Eng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005.
[27] J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan
Kaufmann, 1988.
[28] A. Freno, Selecting Features by Learning Markov Blankets, Proc.
11th Intl Conf., KES 2007 and XVII Italian Workshop Neural Networks
Conf. Knowledge-Based Intelligent Information and Eng. Systems: Part
I (KES/WIRN), pp. 69-76, 2007.
[29] M. Saeed, Bernoulli Mixture Models for Markov Blanket Filtering
and Classification, J. Machine Learning Research, vol. 3, pp. 77-91,
2008.
[30] A. Juan and E. Vidal, On the Use of Bernoulli Mixture Models for
Text Classification, Pattern Recognition, vol. 35, pp. 2705-2710,
2002.
[31] A. Juan and E. Vidal, Bernoulli Mixture Models for Binary
Images, Proc. 17th Intl Conf. Pattern Recognition, (ICPR 04), 2004.
[32] Annual KDD Cup 2001, http://www.sigkdd.org/kddcup/,
2001.
[33] R. Agrawal and R. Srikant, Fast Algorithms for Mining
Association Rules, Proc. 20th Intl Conf. Very Large Databases
(VLDB 94), 1994.
[34] J. Wilbur, J. Ghosh, C. Nakatsu, S. Brouder, and R. Doerge,
Variable Selection in High-Dimensional Multivariate Binary Data
with Application to the Analysis of Microbial Community DNA
Fingerprints, Biometrics, vol. 58, pp. 378-386, 2002.
[35] I. Guyon et al., CLOP, http://ymer.org/research/files/clop/
clop.zip, 2011.
[36] M. Saeed, Hybrid Learning Using Mixture Models and Artificial
Neural Networks, Hands-on Pattern Recognition Challenges in Data
Representation, Model Selection, and Performance Prediction, http://
www.clopinet.com/ChallengeBook.html, Microtome, 2008.
[37] M. Saeed and H. Babri, Classifiers Based on Bernoulli Mixture
Models for Text Mining and Handwriting Recognition, Proc.
IEEE Intl Joint Conf. Neural Networks, 2008.
[38] T.M. Cover and J.A. Thomas, Elements of Information Theory. John
Wiley and Sons, 1991.
[39] L. Jimenez and D.A. Landgrebe, Projection Pursuit in High
Dimensional Data Reduction: Initial Conditions, Feature Selection
and the Assumption of Normality, Proc. IEEE Intl Conf. Systems,
Man and Cybernetics, 1995.
[40] C.M. Bishop, Pattern Recognition and Machine Learning. Springer,
2006.
[41] R.W. Lutz, Doubleboost, Fact Sheet http://clopinet.com/
isabelle/Projects/agnostic/, 2007.
[42] V. Nikulin, Classification with Random Sets, Boosting and
Distance-Based Clustering, Fact Sheet http://clopinet.com/
[43] V. Franc, Modified Multi-Class SVM Formulation; Efficient LOO
Computation, Fact Sheet http://clopinet.com/isabelle/Projects/
agnostic/, 2007.
[44] H.J. Escalante, Particle Swarm Optimization for Neural
Networks, Fact Sheet http://clopinet.com/isabelle/Projects/
agnostic/, 2007.
[45] J. Reunanen, Cross-Indexing, Fact Sheet http://clopinet.com/
477
[46] I.C. ASML team, Feature Selection with Redundancy Elimination

Gradient Boosted Trees, Fact Sheet http://clopinet.com/
Kashif Javed received the BSc and MSc
degrees in electrical engineering in 1999 and
2004, respectively, from the University of
Engineering and Technology (UET), Lahore,
Pakistan, where he is currently working toward
the PhD degree. He joined the Department of
Electrical Engineering at UET in 1999, where he
is currently an assistant professor. His research
interests include machine learning, pattern
recognition, and ad hoc network security.
Haroon A. Babri received the BSc degree in

electrical engineering from the University of
Engineering and Technology (UET), Lahore,
Pakistan, in 1981, and the MS and PhD degrees
in electrical engineering from the University of
Pennsylvania in 1991 and 1992, respectively. He
was with the Nanyang Technological University,
Singapore, from 1992 to 1998, with the Kuwait
University from 1998 to 2000, and with the
Lahore University of Management Sciences
(LUMS) from 2000 to 2004. He is currently a professor of electrical
engineering at UET. He has written two book chapters and has more
than 60 publications in machine learning, pattern recognition, neural
networks, and software reverse engineering.
Mehreen Saeed received the doctorate degree
from the Department of Engineering Mathematics, University of Bristol, United Kingdom,
in 1999. She is currently working as an assistant
professor in the Department of Computer
Science, FAST National University of Computer
and Emerging Sciences, Lahore Campus, Pakistan. Her main areas of interest include artificial
intelligence, machine learning and statistical
pattern recognition.
. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

Class-Dependent Density Feature Selection

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Class-Dependent Density Feature Selection

Transféré par

Droits d'auteur :

Formats disponibles

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Feature Selection Based on Class-Dependent

advancements in data and knowledge management

Many features might be irrelevant and possibly detrimental

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

select, from the original feature set, a subset of features, which

This section describes the theory related to the feature

[18]. The newly generated subset is evaluated either with

having the least redundancy with the already selected

CLASS-DEPENDENT DENSITY-BASED FEATURE

In this section, we propose a new feature ranking algorithm,

a user-provided threshold value, CDFE determines the final

Remark 3.1. 0 di1l 1 can be intuitively acquired. The

Remark 3.2. 0 W Fi max 1 can be intuitively acquired.

Remark 3.3. 0 W Fi diff 1 can be intuitively acquired.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

and redundancy are analyzed separately in two steps, a

3.1 Rationale of the Diff-Criterion Measure

Remark 3.4. Because of the properties of Kullback-Leibler

pFi jCpC log2

S i n c e pFi f pFi fjC 0pC 0 pFi

P0 di00 di01 di01

P0 di10 di11 di11

Putting di00 di10 1 and di01 di11 1 in (6), suppressing

Equation (7) indicates that mutual information between a

Proof. The most relevant feature has d11 d10 1 or

TWO-STAGE FEATURE SELECTION ALGORITHMS

Feature ranking algorithms while selecting a final subset

approach [20] with an FSS algorithm to overcome these

First Stage: Selection of the Preliminary Feature

Second Stage: Selection of the Final Feature

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

correlation with Fi , is selected. The value of K should be as

 DKL P CjM fM ; Fi fi jjP CjM f M :

4.2.2 Bernoulli Mixture Model-Based Markov Blanket

specified by the prior q and the probability vector

where pqi 2 0; 1 , 1 i M, denotes the probability of

where X represents the set of all binary vectors in f0; 1gM .

the sparsity of the data. The positive class is 49.2 percent

Class-Dependent Density-Based Feature

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

6000 8000 10000 12000 14000 16000 18000

feature, we are actually interested in looking at their

Next, we compare CDFE against three feature selection

HIVA Performance of naive Bayes

GINA Performance of naive Bayes

5.2 Two-Stage Feature Selection Algorithms

5.2.2 Results of the Two-Stage Algorithm with the MBF

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

GINA Performance of kridge

HIVA Performance of kridge

features selected by the two-stage algorithm result in a

selected by MBF result in a BER that is obtained with the entire

5.2.3 Results of the Two-Stage Algorithm with the

Comparison of CDFE Performance against the

method given in [36] for NOVA, HIVA, and GINA. Table 8

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

computationally more efficient. Based on the diff-criterion,

I. Guyon, A. Saffari, G. Dror, and G. Cawley, Agnostic Learning

Feature Selection Challenge by Neural Information Processing

[19] A.L. Blum and P. Langley, Selection of Relevant Features and

[46] I.C. ASML team, Feature Selection with Redundancy Elimination

Proof. The most relevant feature has d11 d10 1 or

DKL P CjM fM ; Fi fi jjP CjM f M :