Académique Documents
Professionnel Documents
Culture Documents
REVIEW
Gene expression
Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium and Bioinformatics and Evolutionary Genomics
group, Department of Molecular Genetics, Ghent University, B-9052 Ghent, Belgium and 2Department of Computer
Science and Artificial Intelligence, Computer Science Faculty, University of the Basque Country, Paseo Manuel de
Lardizabal 1, 20018 Donostia - San Sebastian, Spain
Received on April 17, 2007; revised on June 11, 2007; accepted on June 25, 2007
Advance Access publication August 24, 2007
Associate Editor: Jonathan Wren
ABSTRACT
Feature selection techniques have become an apparent need in
many bioinformatics applications. In addition to the large pool of
techniques that have already been developed in the machine
learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.
In this article, we make the interested reader aware of the
possibilities of feature selection, providing a basic taxonomy of
feature selection techniques, and discussing their use, variety and
potential in a number of both common as well as upcoming
bioinformatics applications.
Contact: yvan.saeys@psb.ugent.be
Supplementary information: http://bioinformatics.psb.ugent.be/
supplementary_data/yvsae/fsreview
2
1
INTRODUCTION
FEATURE
SELECTION TECHNIQUES
The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
2507
Y.Saeys et al.
Table 1. A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice
for a technique suited to the goals and resources of practitioners in the field
Model search
Advantages
Disadvantages
Examples
Filter
Univariate
Ignores feature dependencies
Ignores interaction with
the classifier
2
Euclidean distance
i-test
Information gain,
Gain ratio (Ben-Bassat, 1982)
Correlation-based feature
selection (CFS) (Hall, 1999)
Markov blanket filter (MBF)
(Koller and Sahami, 1996)
Fast correlation-based
feature selection (FCBF)
(Yu and Liu, 2004)
Computationally intensive
Classifier dependent selection
Higher risk of overfitting
than deterministic algorithms
Simulated annealing
Randomized hill climbing
(Skalak, 1994)
Genetic algorithms
(Holland, 1975)
Estimation of distribution
algorithms (Inza et al., 2000)
Decision trees
Weighted naive Bayes
(Duda et al., 2001)
Feature selection using
the weight vector of SVM
(Guyon et al., 2002;
Weston et al., 2003)
Fast
Scalable
Independent of the classifier
FS space
Classifier
Multivariate
Models feature dependencies
Independent of the classifier
Better computational complexity
than wrapper methods
Wrapper
Deterministic
Simple
Interacts with the classifier
Models feature dependencies
Less computationally
intensive than randomized methods
FS space
Hypothesis space
Classifier
Embedded
FS U hypothesis space
Classifier
Randomized
2508
3
3.1
APPLICATIONS IN BIOINFORMATICS
Feature selection for sequence analysis
Sequence analysis has a long-standing tradition in bioinformatics. In the context of feature selection, two types of problems
can be distinguished: content and signal analysis. Content
analysis focuses on the broad characteristics of a sequence, such
as tendency to code for proteins or fulfillment of a certain
biological function. Signal analysis on the other hand focuses
on the identification of important motifs in the sequence, such
as gene structural elements or regulatory elements.
Apart from the basic features that just represent the
nucleotide or amino acid at each position in a sequence,
many other features, such as higher order combinations of
these building blocks (e.g. k-mer patterns) can be derived, their
2509
Y.Saeys et al.
Table 2. Key references for each type of feature selection technique in the microarray domain
Filter methods
Univariate
Multivariate
Wrapper methods
Embedded methods
Random forest
(D az-Uriarte and
Alvarez de Andres, 2006;
Jiang et al., 2004)
Weight vector of
SVM (Guyon et al., 2002)
Parametric
Model-free
Bivariate
(B and Jonassen, 2002)
Sequential search
(Inza et al., 2004;
Xiong et al., 2001)
BSS/WSS
(Dudoit et al., 2002)
Genetic algorithms
(Jirapech-Umpai
and Aitken, 2005;
Li et al., 2001;
Ooi and Tan, 2003)
Estimation of distribu- Weights of logistic
tion algorithms
regression (Ma and
(Blanco et al., 2004)
Huang, 2005)
TNoM
(Ben-Dor et al., 2000)
3.2
2510
2511
Y.Saeys et al.
Table 3. . Key references for each type of feature selection technique in the domain of mass spectrometry
Univariate
Filter
Multivariate
Parametric
Model-free
Peak Probability
Contrast (Tibshirani et al., 2004)
Kolmogorov-Smirnov test (Yu et al., 2005)
Wrapper
Embedded
3.3
2512
4.1
4.2
5
5.1
2513
Y.Saeys et al.
WEKA
Fast Correlation Based Filter
Feature Selection Book
MLC
Spider
SVM and Kernel Methods
Matlab Toolbox
Java
Java
Ansi C
C
Matlab
Matlab
http://www.kyb.tuebingen.mpg.de/bs/people/spider
Canu et al. (2003)
http://asi.insa-rouen.fr/arakotom/toolbox/index
SAM
GALGO
PCP
GA-KNN
Rankgene
EDGE
GEPAS-Prophet
DEDS (Bioconductor)
RankProd (Bioconductor)
Limma (Bioconductor)
Multtest (Bioconductor)
Nudge (Bioconductor)
Qvalue (Bioconductor)
twilight (Bioconductor)
ComparativeMarkerSelection
(GenePattern)
R, Excel
R
C, C
C
C
R
Perl, C
R
R
R
R
R
R
R
JAVA, R
GA-KNN
R-SVM
CHOISS
MLR-tagging
WCLUSTAG
C, Perl
C
JAVA
2514
5.2
FS SOFTWARE PACKAGES
ACKNOWLEDGEMENTS
We would like to thank the anonymous reviewers for their
constructive comments, which significantly improved the
quality of this review. This work was supported by BOF
grant 01P10306 from Ghent University to Y.S., and the
SAIOTEK and ETORTEK programs of the Basque
Government and project TIN2005-03824 of the Spanish
Ministry of Education and Science to I.I. and P.L.
Conflict of Interest: none declared.
REFERENCES
Alon,U. et al. (1999) Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proc. Nat. Acad. Sci. USA, 96, 67456750.
Al-Shahib,A., et al. (2005) Feature selection and the class imbalance problem in
predicting protein function from sequence. Appl. Bioinformatics, 4, 195203.
Ambroise,C. and McLachlan,G. (2002) Selection bias in gene extraction on the
basis of microarray gene-expression data. Proc. Nat. Acad. Sci. USA, 99,
65626566.
Baldi,P. and Long,A. (2001) A Bayesian framework for the analysis of
microarray expression data: regularized t-test and statistical inferences of
gene changes. Bioinformatics, 17, 509516.
Ball,G., et al. (2002) An integrated approach utilizing artificial neural networks
and SELDI mass spectrometry for the classification of human tumours and
rapid identification of potential biomarkers. Bioinformatics, 18, 395404.
Ben-Bassat,M. (1982) Pattern recognition and reduction of dimensionality.
In Krishnaiah,P. and Kanal,L., (eds.) Handbook of Statistics II, Vol. 1.
North-Holland, Amsterdam. pp. 773791.
Ben-Dor,A., et al. (2000) Tissue classification with gene expression profiles.
J. Comput. Biol., 7, 559584.
Bhanot,G., et al. (2006) A robust meta classification strategy for cancer detection
from MS data. Proteomics, 6, 592604.
Blanco,R., et al. (2004) Gene selection for cancer classification using wrapper
approaches. Int. J. Pattern Recognit. Artif. Intell., 18, 13731390.
B,T. and Jonassen,I. (2002) New feature subset selection procedures
for
classification
of
expression
profiles.
Genome
Biol.,
3,
research0017.1research0017.11.
Braga-Neto,U. and Dougherty,E. (2004) Is cross-validation valid for
small-sample microarray classification? Bioinformatics, 20, 374380.
Breitling,R., et al. (2004) Rank products: a simple, yet powerful, new method to
detect differentially regulated genes in replicated microarray experiments.
FEBS Lett., 573, 8392.
Buturovic,L. (2005) PCP: a program for supervised classification of gene
expression profiles. Bioinformatics, 22, 245247.
Canu,S., et al. ( 2003) SVM and Kernel Methods Matlab Toolbox. In Perception
Syste`mes et Information. INSA de Rouen, Rouen, France.
Carlson,C., et al. (2004) Selecting a maximally informative set of single-nucleotide
polymorphisms for association analyses using linkage disequilibrium. Am. J.
Hum. Genet., 74, 106120.
Chuzhanova,N., et al. (1998) Feature selection for genetic sequence classification.
Bioinformatics, 14, 139143.
Cohen,A. and Hersch,W. (2005) A survey of current work in biomedical text
mining. Brief. Bioinformatics, 6, 5771.
Conilione,P. and Wang,D. (2005) A comparative study on feature selection for
E.coli promoter recognition. Int. J. Inf. Technol., 11, 5466.
Coombes,K., et al. (2007) Pre-processing mass spectrometry data.
In Dubitzky,M., et al. (eds.), Fundamentals of Data Mining in Genomics and
Proteomics. Kluwer, Boston, pp. 7999.
Daelemans,W., et al. (2003) Combined optimization of feature selection
and algorithm parameter interaction in machine learning of language.
2515
Y.Saeys et al.
2516
Smyth,G. (2004) Linear models and empirical Bayes methods for assessing
differential expression in microarray experiments. Stat. Appl. in Genet. and
Mol. Biol., 3, Article 3.
Somorjai,R., et al. (2003) Class prediction and discovery using gene microarray
and proteomics mass spectroscopy data: curses, caveats, cautions.
Bioinformatics, 19, 14841491.
Statnikov,A., et al. (2005) A comprhensive evaluation of multicategory
classification methods for microarray gene expression cancer diagnosis.
Bioinformatics, 21, 631643.
Storey,J. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B,
64, 479498.
Su,Y., et al. (2003) RankGene: identification of diagnostic genes based on
expression data. Bioinformatics, 19, 15871579.
Tadesse,M., et al. (2004) Identification of DNA regulatory motifs using Bayesian
variable selection. Bioinformatics, 20, 25532561.
Thomas,J., et al. (2001) An efficient and robust statistical modeling approach to
discover differentially expressed genes using genomic expression profiles.
Genome Res., 11, 12271236.
Tibshirani,R., et al. (2004) Sample classification from protein mass spectrometry,
by peak probability contrast. Bioinformatics, 20, 30343044.
Trevino,V. and Falciani,F. (2006) GALGO: an R package for multivariate
variable selection using genetic algorithms. Bioinformatics, 22, 11541156.
Troyanskaya,O., et al. (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18, 14541461.
Tusher,V., et al. (2001) Significance analysis of microarrays applied to ionizing
radiation response. In Proceedings of the National Academy of Sciences,
Vol. 98, pp. 51165121.
Varshavsky,R., et al. (2006) Novel unsupervised feature filtering of biological
data. Bioinformatics, 22, e507e513.
Wang,Y., et al. (2005) Gene selection from microarray data for cancer
classificationa machine learning approach. Comput. Biol. Chem., 29, 3746.
Wang,Y., et al. (2006) Tumor classification based on DNA copy number
aberrations determined using SNPS arrays. Oncol. Rep., 5, 10571059.
Weston,J., et al. (2003) Use of the zero-norm with linear models and kernel
methods. J. Mach. Learn. Res., 3, 14391461.
Witten,I. and Frank,E. (2005) Data Mining: Practical Machine Learning Tools
and Techniques, 2nd Ed. Morgan Kaufmann, San Francisco.
Wu,B., et al. (2003) Comparison of statistical methods for classification of
ovarian cancer using mass spectrometry data. Bioinformatics, 19, 16361643.
Xing,E.P. et al. (2001) Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on
Machine Learning, pp. 601608.
Xiong,M., et al. (2001) Biomarker identification by feature wrappers.
Genome Res., 11, 18781887.
Yang,Y., et al. (2005) Identifying differentially expressed genes from microarray
experiments via statistic synthesis. Bioinformatics, 21, 10841093.
Yeoh,E., et al. (2002) Classification, subtype discovery, and prediction of
outcome in pediatric lymphoblastic leukemia by gene expression profiling.
Cancer Cell, 1, 133143.
Yeung,K. and Bumgarner,R. (2003) Multiclass classification of microarray data
with repeated measurements: application to cancer. Genome Biol., 4, R83.
Yeung,K., et al. (2005) Bayesian model averaging: development of an improved
multi-class, gene selection and classification tool for microarray data.
Bioinformatics, 21, 23942402.
Yu,J. and Chen,X. (2005) Bayesian neural network approaches to ovarian cancer
identification from high-resolution mass spectrometry data. Bioinformatics,
21, i487i494.
Yu,J., et al. (2005) Ovarian cancer identification based on dimensionality reduction
for high-throughput mass spectrometry data. Bioinformatics, 21, 22002209.
Yu,L. and Liu,H. (2004) Efficient feature selection via analysis of relevance and
redundancy. J. Mach. Learn. Res., 5, 12051224.
Zavaljevsky,N., et al. (2002) Support vector machines with selective kernel scaling
for protein classification and identification of key amino acid positions.
Bioinformatics, 18, 689696.
Zhang,X., et al. (2006) Recursive SVM feature selection and sample classification
for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.
2517