Vous êtes sur la page 1sur 5

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 1 JANUARY 2015 - ISSN: 2349 - 9303

A Classification of Cancer Diagnostics based on


Microarray Gene Expression Profiling
V.S. Gokkul1

Dr. J. Vijay Franklin2

Department of Computer Science & Engineering,


Department of Computer Science & Engineering,
Bannari Amman Institute of Technology,
Bannari Amman Institute of Technology,
1
2
Anna University,
Anna University,
vijayfranklinj@bitsathy.ac.in
gokkul.vs@gmail.com
Abstract Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with
processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to
be taken. The important problem in which pattern recognition are applied have common that they are too complex to
model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional
microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given
problem using a set of instances, each instances represented by a number of features. The microarray expression
technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays
generated large amount of data has stimulate the development of various computational methods to different
biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in
Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics
and drug designing. In this work a new schema has developed for classification of unknown malignant tumors
into known class. According to this work an new classification scheme includes the transformation of very high
dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed
classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary
and multiclass data sets. To improve the performance of the classification gene selection method is applied to the
datasets as a preprocessing and data extraction step.
Index Terms Pattern Recognition, Microarray Gene Expression Profiling, Mahalanobis, Classifier, Gene
Selection

1INTRODUCTION

Term cancer does not refer to one disease, but rather to many
diseases that can occur in various regions of the body. Every
type of cancer is characterized to growth of cell. The cancer is
third most common diseases and the second leading cause of
death in this world. Detection of cancer is the research topic
with significant importance. Every gene array techniques have
been shown to provide inside into cancer study and the
molecular profiling based on gene expression array technology
is expected to the promise of precise cancer detection and the
classification. The most important problems in the treatment of
cancer is the early detection of the disease. If the cancer is
detected in later stages, it compromised the function of one or
more organ systems and is widespread entire body. Methods for
the early detection of cancer are of utmost importance and are an
active area of current research. An important step in the
diagnosis of cancer is classification of different types of
unknown malignant tumors to known classes.

IJTET2015

After the initial detection of a cancerous growth valid diagnosis


and staging of disease are essential for the design of treatment
plan.
Gene chip analysis the Microarray technology is a powerful tool
for genomic analysis. It gives a global view of the genome in a
single experiment. Data analysis of the microarray is a vital part
of the experiment. Each microarray study comprises multiple
microarrays, each giving tens of thousands of data points. So the
volume of data is bean growing exponentially as microarrays
grow larger, the analysis becomes high challenging.
During scanning image the image processing techniques is
applied. It is any form of signal processing for which the input
is an image, such as a photograph or video frame, the output of
image processing may be either an of characteristics or
parameters related to the input image. After applying image
processing techniques, image that is scanned from microarray
gene chip is transformed to a data matrix.

144

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 1 JANUARY 2015 - ISSN: 2349 - 9303


Then this data matrix is pre-processed and analyzed to get useful
information required for design study.

Figure1: Pattern Recognition solution for microarray data


The preprocessing of gene expression data is to carry out a
supervised classification and unsupervised classification may
have several reasons. So that technological problems of the
microarrays expression for some genes values cannot accurately
to be measured, in that problem missing data arises. The
classification methods do not generally have the capability to
handle missing data like one exception is the classification tree.
Feature selection is another important tissue in cancer
classification. By removing features that are irrelevant to
cancers, the accuracy of prediction can usually be improved.
With a sub set of many few genes, the contribution of each gene
will be more prominent and even the interaction of these genes
can be revealed by using other techniques such as genetic
network. We used a method called Recursive Feature
Elimination (RFE) to iteratively select a series of nested gene
sub sets and found that some of the sub sets could achieve most
perfect classification with many fewer genes than the original set
of genes.
During classification model the three well known classifier,
namely the decision tree learner C4.5, the simple Bayesian
classier nave Bayes, and Support Vector Machine (SVM). The
rst two algorithms and the corresponding feature selection
methods.

2 RELATED WORKS
Wang et al. (2005) demonstrates various gene selection methods
to improve the classification performance. As microarray data
contains small number of samples and large number of
genes/features therefore it is a very challenging task to choose
relevant features to be in various types of cancer. In the study
various Feature Selection (FS) algorithms namely wrappers,
filters and Correlation based Feature Selection (CFS) are
statistically analyzed to get the useful information about genes
and to reduce the dimensionality.
IJTET2015

Wrappers choose the best genes to build the classifier while


filters rank the genes for the specific problem. CFS can choose
genes which are highly correlated to cancers but uncorrelated to
each other. The dataset used in this paper are acute leukemia and
lymphoma microarray. Results show that the above mentioned
FS approaches have same impact on classification. But for fast
analysis of data the filters and CFS are suggested. But in order to
select a very few genes validation of the results the wrapper
approaches can be proposed.
F. Chu and L.Wang (2005) used four reduction methods for
dimension they are principal components analysis (PCA), class
separability measure, Fisher ratio and T-test for selection of
genes. The Classification SVM is used in this process. Publicly
available micro array gene expression datasets i.e., SRBCT,
lymphoma and leukemia is used in the study. The accuracy is
good but it is same as the previously published results but with
fewer number of features.
Huilin Xiong and Xue-Wen Chen (2005) presented a new
approach for classification. It is based on the optimize kernel
function for the classification performance. To increase the class
separability of data a more flexible kernel model is used.
Datasets used in the study are ALL-AML Leukemia Data,
Breast-ER, Breast-LN, Colon, Lung and Prostate microarray
data. For performance analysis K-nearest-neighbor (KNN) and
support vector machine (SVM) were used and compared with
optimized kernel model. Optimized kernel provides better
accuracy.
L. Shen and E.C. Tan (2005) presented the Penalized Logistic
Regression (PLR) for classification of cancer. This PLR along
with the dimension reduction technique is also used to
improvement in classification performance. Support vector
machine (SVM) and least square regression are used for
classification and for iterative gene selection Recursive feature
elimination (RFE) was used. RFE tries to find the gene subset
that is most related to the cancers. Seven publically available
datasets i.e., breast cancer, central nervous system (CNS), colon,
Acute Leukemia, Lung, ovarian and Prostate microarray datasets
are used. To compare with the regression methods linear SVM is
used. Performance can be improved by the combination of
Penalized logistic regression and PLS.
For cancer classification Feng Chu and Lipo Wang
(2006)proposed a novel radial basis function neural network
using very few genes. They applied their technique to three
publically available datasets which are lymphoma data set, the
small round blue cell tumors (SRBCT). To measure the
discriminative ability of genes T-test scoring method is used for
gene ranking. Compared to the previous nearest shrunken
centroids, RBF neural network used very few genes as well as it
also reduces the gene redundancy for the classification of tumors
using microarray data.

145

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 1 JANUARY 2015 - ISSN: 2349 - 9303


Xiyi Hang (2008) presented a new approach for cancer
classification using microarray data. The method is called Sparse
Representation.

Osareh et al. (2010) developed an automated system for


consistent cancer analysis based on gene microarray expression
data.

Two date sets are used to check the classifiers performance. The
datasets are Tumors and Brain Tumor. Experimental results of
Sparse Represent the gene expression data is compared with the
SVMs. The proposed sparse representation method is
implemented in MATLAB R14.The Result of SVM are
calculated By GSM GEMS (Gene Expression Model Selector),
with graphical user interface for classification data. This GUI is
available at http://www.gemssystem.org/. Accuracies are
compared with SVM results and it was found that there are no
differences except a little improvement. Hence it was found that
classification performance of sparse representation is quite
similar to that of SVM.

The system contains number of famous classifiers such as K


nearest neighbors, naive Bayes, neural networks and decision
tree, Support vector machine. Microarray datasets that are used
contains both binary and multiclass problems. Best experimental
results are obtained using support vector machine classifier.

M. Rangasamy and S. Venketraman (2009) presented a new


classification scheme which is based on statistical methods. It is
called efficient Statistical Model Based Classification. It uses
very less genes from microarray gene data for classification. For
gene ranking classical statistical technique is used and for gene
ranking and prediction two classifiers are used. The
methodology was applied to three publicly available cancer
datasets and results were compared with earlier approaches and
proven better prediction strength with less computation cost.
A. Bharathi and A. M. Natarajan (2010) give a gene selection
method called Analysis Of Variance(ANOVA). It is a method
used to find minimum number of genes from microarray gene
expression dataset. For classification support vector machine is
used. Effectiveness of classifier is checked on lymphoma
dataset. To check performance of classifier two stepped
procedure is followed. In 1,st step important genes are selected
using ANOVA method and then the top ranked genes are
selected with the highest score. In second step gene
combinations are formed and classification strength of all gene
combinations is checked using support vector machines.
For early analysis of cancer patients, Zainuddin and Pauline
(2009) worked on wavelet neural network (WNN) and enhanced
WNN using clustering algorithms. Different clustering
algorithms such as , K-means (KM), Fuzzy C-means (FCM),
symmetry-based K-means (SBKM), symmetry-based Fuzzy Cmeans (SBFCM) and modified point symmetry-based K-means
(MPKM) are used.
In this study they used four datasets which includes LEU,
SRBCT, GLO and CNS. Performance of proposed classifier is
compared with other classifier. Performance analysis showed
that results of proposed classifiers. Accuracy of proposed
classifier ranges from 86% to 100%.
IJTET2015

3 PROPOSED SYSTEM
Principal Component Analysis (PCA) is statistical procedure that
uses an orthogonal transformation to convert a set of
observations of possibly correlated variables into a set of values
of linearly uncorrelated variables called principal components.
The number of principal components is lesser than to the number
of original variables. Datasets that used in this work has a large
number of features m and a small number of samples n. It is very
computationally intensive to calculate covariance of such type of
data. To beat this issue, covariance is computed through the idea
of eigenfaces. The eigenvectors of the covariance matrix that are
count by projecting the data into feature space that spans the
changes among known classes then it can solve the memory
issue occur due to the large number of features. The data into
feature space makes some mathematical assumptions.
The eigenvectors of the original covariance matrix is used to
transform matrix to low dimensional space. So in this case when
there is large m and small n. Calculations are reduced from the
number of features to number of samples. PCA is one of the
most famous dimensionality reduction techniques, used for the
datasets that have redundant information. In PCA class
information is not given while for proposed scheme class
information is given and taken into consideration for the
projection of data. It is not exactly the PCA, but a variation of
PCA because although, it reduces the dimension of data, it
preserves the information of data. The term eigenfaces is used to
because they were dealing with face images for recognition
purpose. However in this study transformed genes are used for
classification purpose therefore eigengenes are used instead of
eigenfaces for recognition purpose. It is difficult to compute
inverse of covariance matrix because it is not a diagonal matrix.
Karhunen Leove (KL) theorem is used to make data somewhat
diagonal matrix. KL-transformation is mostly used for the
datasets that are highly correlated to make them uncorrelated. In
proposed scheme, data is KL-transformed by multiplying each
sample with the transformation matrix. The computations are
greatly reduced due to the fact that there is less correlation
between various transformed features.

146

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 1 JANUARY 2015 - ISSN: 2349 - 9303


Transformation to Mahalanobis Space is samples of each class
are transformed to Mahalanobis space defined by means and
variances of the same class. The Mahalanobis space is a space
where sample variance along with each dimension is unity. This
transformation is achieved by dividing each KL-transformed
coefficient of vector by its corresponding standard deviation.
As eigengenes of original covariance matrix are used for
transformation, therefore instead of standard deviation, square
root of eigenvalue is used because eigenvalues are proportional
to the variances in the direction of principal components (PCs).
For the final classification, the Euclidean distance is calculated
between the transformed samples and class means. This is the
strength of proposed classification scheme that complex
classification problem become much simpler as Euclidean
classifier.

4 PERFORMANCE ANALYSES
According to the experimental result, the proposed classification
schema should be plotted in the graph. Simulation results for the
performance indices are showed below. The performance of
classifiererror rates are computed for each dataset using
bootstrap method.In this method is used for the estimation of
error rate and uses the weighted average of the reconstitution
error, means the training set and error on the samples are not
used to train the classifier to leave out cross validation. The
Adeno, Colon and SRBCT the error rate is high, when compare
to other datasets when we going to take for classification. The
number of errors present in this datasets represents the
deficiency of the process. Although the classification boundary
is nonlinear as it incorporates different data spreads between
classes as well as gene types.
1.5

Err

1
0.5
0

5 CONCLUTION
The classification scheme for both binary and
multiclass cancer diagnosis involves the transformation
of Microarray data to the Mahalanobis space after
performing KL-transformation. The force of proposed
pattern is that the final classification becomes simple
Euclidean classifier.
Also the classification boundary is nonlinear it incorporates
different data spreads between classes as well as gene types. The
gene selection is applied for better performance of a algorithm.
For both gene selection and cancer classification the researchers
were curious to know about the main causes that lie behind the
occurrence of any disease therefore a lot of research has been
done for the analysis data.
A various number of issues were encountered during the
analysis. So the Accuracy and speed of proposed technique is
checked on different published datasets. Results show that
proposed technique is extremely efficient in performance as well
as computation.

6 FUTURE WORKS
In future, this work can be extended to apply ensembling
techniques. Ensemble methods as sets of machine learning
techniques whose decisions are combined in some way to
improve the performance of the overall system. Two important
aspects to be focused on ensemble approaches. First aspect is
how to generate diverse base classier. In traditional, re-sampling
has been widely used to generating training datasets for base
classier learning. This method is much too random and due to
the small numbers of samples, the datasets may be greatly
similar. The second aspect is how to combine the base classiers.
An intelligent approach for constructing ensemble classiers is
proposed. The methods rst training the base classiers with
Particle Swarm Optimization (PSO) algorithm, and then select
the appropriate classiers to construct a high performance
classication committee with Estimation of Distribution
Algorithm (EDA).

Gene selection is applied for better performance of algorithm.


Results on other datasets are also better after performing the
gene selection.
IJTET2015

147

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 1 JANUARY 2015 - ISSN: 2349 - 9303


REFERENCES
[1]
Bharathi and A. M. Natarjan, Cancer Classification Of
[13] U. I. Bajwa, I. A. Taj, and Z. E. Bhatti, A
Bioinformatics Data Using ANOVA, International Journal Of
Comprehensive Comparative Analysis on Performance of
Computer Theory And Engineering, vol. 2, pp.1793-8201, 2010.
Laplacian faces and Eigen faces for Face Recognition, The
[2]
A. Osareh and B. Shadgar, Microarray data analysis
Imaging Science Journal, vol. 59, pp. 32-40, 2011.
for cancer classification,5th International Symposium on Health
[14] X.Hang,
"Cancer
Classification
By
Sparse
Informatics and Bioinformatics (HIBIT), vol. 9, pp. 459-468,
Representation
Using
Microarray
Gene
Expression
Data,
2010.
IEEE International Conference On Bioinformatics And
[3]
A. Osareh and B. Shadgar, Microarray data analysis
Biomedicine Workshops, pp. 174-177, 2008.
for cancer classification,5th International Symposium on Health
[15] Y. Wang, I. V. Tetko, M. A. Hall and E. Frank, Gene
Informatics and Bioinformatics (HIBIT), vol. 9, pp. 459Selection From Microarray Data For Cancer Classification: A
468,2010.
Machine Learning Approach, Comput Biol Chem, vol. 29,
[4]
D. Singh, P.G. Febbo, K. Ross, D. G. Jackson, J.
pp. 37-46,2005.
Manola and C. Ladd, Gene expression correlates of clinical
prostate cancer behavior, Cancer Cell, pp. 203209, 2002
[5]
F. Chu and L. Wang, "Applications of Support Vector
Machines to Cancer Classification With The Microarray Data,"
International Journal of Neural Systems, vol. 15, pp.475484,
2005.
[6]
Dimension Reduction-Based Penalized Logistic
Regression For ray Data, IEEE/ACM Trans, Computational
Biology And Bioinformatics, vol. 2, pp. 166-175, Apr -June
2005.
[7]
M. H. Asyali, D. Colak, O. Demirkaya and M. S. Inan,
Gene expression profile of classification, Current
Bioinformatics, vol. 1, pp. 55-73, 2006.
[8]
M. Rangasamy and S.Venketraman, An Efficient
Statistical Model Based Classification Algorithm for Classifying
Cancer Gene Expression Data with Minimal Gene Subsets,
International Journal of Cyber Society and Education, vol. 2,
pp.51-66, 2009.
[9] P. Qiu, Z. J. Wang and K. J. R. Liu, Ensemble dependancemodel for
classification and prediction of cancer and normal gene expression data,
Bioinformatics, vol. 21, pp. 3114 3121, 2005.
[10]
F. Chu and L.Wang, Applying Rbf Neural Networks To Cancer
Classification Based on Gene Expressions,
International Joint Conference on Neural Networks, pp. 37-46, July 2006.
[11]
H. Xiong and Xue-Wen Chen, "Optimized Kernel Machines for
Cancer Classification using Gene Expression Datas," Proceedings Of The
2005 IEEE Symposium On Computational Intelligence In Bioinformatics
And Computational Biology, pp.1-7,2005.L. Shen and E.C. Tan,
[12]
S. Dudoit, J. Fridlyand, and T. P. Speed, Comparison of
discriminat menthods for the classification of tumors using gene
expression data, Journal of the American Statistical Association, vol. 97,
pp. 77-87, 2002.

IJTET2015

148

Vous aimerez peut-être aussi