Académique Documents
Professionnel Documents
Culture Documents
HYPE L DATA
ERSPECTRAL A
B
By
SOUM
MYADIIP CHA
ANDRA
DEPA
ARTMENT OF CIVIL ENGINEE
ERING
INDIAN
N INSTITU
UTE OF TECHN
NOLOGY KANPU
UR
Julyy 2010
i
SUPERV
U VISED LEARNING WITH
HYPE L DATA
ERSPECTRAL A
A Dissertat
D bmitted In Partiial Fulfilllment of
tion Sub o the
Requireements for the Degree
D o
of
Ma
aster of Techno
ology
B
By
SOUM
MYADIIP CHA
ANDRA
(Y81
103044)
DEPA
ARTMENT OF CIVIL ENGINEE
ERING
INDIAN
N INSTITU
UTE OF TECHN
NOLOGY KANPU
UR
Julyy 2010
i
ii
ABSTRACT
Hyperspectral data (HD) has ability to provide large amount of spectral
information than multispectral data. However, it suffers from problems like curse
of dimensionality and data redundancy. The size of data set is also very large.
Consequently, it is difficult to process these datasets and obtain satisfactory
classification results.
The objectives of this thesis are to find the best feature extraction (FE)
techniques and improvement in accuracy and time for classification of HD by
using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-
nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In
order to achieve these objectives, experiments were performed with different FE
techniques like segmented principal component analysis (SPCA), kernel principal
component analysis (KPCA), orthogonal subspace projection (OSP) and projection
pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations
in this thesis work.
From the experiments performed with the parametric and non-parametric
classifier, the GML classifier was found gave the best results with an overall
kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)
per class and 45 bands on SPCA feature extracted data set.
SVM algorithm with quadratic programming (QP) optimizer gave the best results
amongst all optimizers and approaches. The overall k-value of 96.91% was
achieved by using 300 TP per class and 20 bands of SPCA feature extracted data
set. However, the supervised FE techniques like KPCA and OSP failed to improve
results obtained by SVM significantly.
The best results obtained for GML, KNN and SVM were compared by the
one-tailed hypothesis testing. It was found that SVM classifier performed
significantly better than the GML classifiers for statistically large set of TP (300).
For statistically exact (100) and sufficient (200) set of TP, the performance of SVM
on SPCA extracted data set is statistically not better than the performance of
GML classifier.
iii
ACKNOWLEDGEMENTS
I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for
his involvement, motivation and encouragement throughout and beyond the thesis
work. His expert directions have inculcated in my qualities which I will treasure
throughout my life. His patient hearing, critical comments approach to the research
problem made me do better every time. His valuable suggestions to all stages of the
thesis work helped me to improvise various sorts of my shortcomings of my thesis
work. I also express my sincere thanks for his effort in going through the
manuscript carefully and making it more readable. It has been a great learning
and life changing experience working with him.
I would like to express my sincere tribute to Dr. Bharat Lohani for his
friendly nature, excellent guidance and teaching during my stay at IITK.
I would like to thank specially to Sumanta Pasari for his valuable
comments and corrections of the manuscript of my thesis.
I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,
Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI
peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,
pleasant and memorable one.
In closure, I express my cordial homage to my parents and my best friend
for their unwavering support and encouragement to complete my study at IITK
SOUMYADIP CHANDRA
July 2010
iv
CONTENTS
CERTIFICATE………………………………………………………………………….. ii
ABSTRACTS........................................................................................................... iii
ACKNOWLEDGEMENTS……………………………………………………………. iv
CONTENTS………………………………………………………………………………...v
LIST OF TABLES………………………………………………………………………..ix
LIST OF FIGURES..................................................................................................x
LIST OF ABBREVIATIONS…………………………………………………………xiii
CHAPTER 1 - Introduction......................................................................... 1
v
1.7 Structure of thesis ............................................................................................... 9
vi
3.3 Supervised classifier .......................................................................................... 43
4.1.2 PP ................................................................................................................. 62
4.1.3 KPCA............................................................................................................ 63
4.1.4 OSP............................................................................................................... 64
classifier ........................................................................................................................ 66
vii
5.2 Results for parametric and non-parametric classifiers ................................... 75
5.3.4 Class wise comparison of the best result of SVM ................................... 103
REFERENCES………………………………………………….……………….115
APPENDIX A……………………………………………………………………..120
viii
LIST OF TABLES
ix
LIST OF FIGURES
x
4.2 Projection pursuit feature extraction method 63
4.3 KPCA feature extraction method 63
4.4 OSP feature extraction method 64
4.5 Overview of classification procedure 66
4.6 Experimental scheme for Set-I experiments 67
4.7 The experimental scheme for advanced classifier (Set-II) 68
5.1 Correlation image of the original data set consisting of three 70
blocks having bands 32, 6 and 27 respectively
5.2 Projection of the data points. (a) Most interesting projection 71
direction (b) Second most interesting projection direction
5.3 First six Segmented Principal Components (SPCs) (b) shows water 72
body and salt lake
5.4 First six Kernel Principal Components (KPCs) obtained by using 72
400 TP
5.5 First six features obtained by using eight end-members 73
5.6 Two components of most interesting projections 73
5.7 Correlation images after applying various feature extraction 74
techniques
5.8 Overall kappa value observed for GML classification on different 78
feature extracted data sets using selected different bands
5.9 Comparison of kappa values and classification times for GML 81
classification method
5.10 Best producer accuracy of individual classes observed for GMLC 82
on different feature extracted data set with respect to different set
of TP
5.11 Overall accuracy observed for KNN classification of OD and 85
feature extracted data sets for 25 TP
5.12 Overall accuracy observed for KNN classification of OD and 86
feature extracted data sets for 100 TP
5.13 Overall accuracy observed for KNN classification of OD and 87
feature extracted data sets for 200 TP
5.14 Overall accuracy observed for KNN classification of OD and 88
feature extracted data sets for 300 TP
5.15 Time comparison for KNN classification. Time for different bands 91
xi
at different neighbors for (a) 300 TP (b) 200 TP training data per
class
5.16 Comparison of best k-value and classification time for original and 91
feature extracted data set
5.17 Class wise accuracy comparison of OD and different feature 92
extracted data for KNNC
5.18 Overall kappa values observed for classification of FE modified 94
data sets using SVM and QP optimizer
5.19 Classification time comparison using 200 and 300 TP per class 97
5.20 Overall kappa values observed for classification of original and FE 100
modified data sets using SVM with SMO optimizer
5.21 Comparison of classification time different set of TPs with respect 101
to number of bands for SVM_SMO classification algorithm
5.22 Overall kappa values observed for classification original and feature 103
modified data sets using KPCA_SVM algorithm.
5.23 Comparison of classification accuracy of individual classes for 105
different SVM algorithms
xii
LIST OF ABBREVIATIONS
AC Advance classifier
DAFE Discriminant analysis feature extraction
DAIS Digital airborne imaging spectrometer
DBFE Decision boundary feature extraction
FE Feature extraction
GML Gaussian maximum likelihood
HD Hyperspectral data
ICA Independent component analysis
KNN k-nearest neighbors
k-value Kappa value
KPCA Kernel principal component analysis
KPCA_SVM Support vector machine with Kernel principal component
analysis
MS Multispectral data
NWFE Nonparametric weighted feature extraction
Ncri Critical value
OD Original data
OSP Orthogonal subspace projection
PCA Principal component analysis
PCT Principal component transform
PP Projection pursuit
rbf Radial basic function
SPCA Segmented principal component analysis
SV Support vectors
SVM Support vector machine
SVM_QP Support vector machine with quadratic programming optimizer
xiii
SVM_SMO Support vector machine with sequential minimal optimizer
TP Training pixels
Dedicated
to
my family & guide
xiv
CHAPTER 1
INTRODUCTION
Remote sensing technology has brought a new dimension in the field of earth
observation, mapping and in many other different fields. At the beginning of this
technology, multispectral sensors were used for capturing data. The multispectral
sensors capture data in a small number of bands with broad wavelength intervals.
Due to few spectral bands, their spectral resolution is insufficient to discriminate
amongst many earth objects. But if the spectral measurement is performed by using
hundreds of narrow wavelength bands, then several earth objects could be
characterized precisely. This is the key concept of hyperspectral imagery.
As compared to multispectral (MS) data set, hyperspectral data (HD) has large
information content, voluminous and also different in characteristics. So, the
extraction of that huge information from HD remains a challenge. Therefore, some
cost effective and computationally efficient procedures are required to classify the
HD. Data classification is the categorization of data for its most effective and efficient
use. As a result of classification, we need a high accuracy thematic map. HD has that
potentiality.
This chapter will provide the concept of high dimensional space, HD and
difficulties in classification of HD. Next part focuses on the objectives of the thesis
followed by an overview of data set used in this thesis. Details of the software used
are mentioned in the next part of this chapter followed by the structure of thesis.
ii
n-dimensional spaces with large values of n are sometimes called high-dimensional
spaces (Werke, 1876). Many familiar geometric objects can be expressed by some
number of dimensions. For example, the two-dimensional triangle and the three-
dimensional tetrahedron can be seen as specific instances of the n-dimensional space.
In addition, the circle and the sphere are particular form of the n-dimensional
hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).
These images are then combined and form a three dimensional hyperspectral
cube. As the dimension of the HD is very high, it is comparable with the high
dimensional space. HD follows same characteristics like high dimensional space
which are described in the following section.
2
1.1.2 Characteristics of high dimensional space
High dimensional spaces, spaces with a dimensionality greater than three,
have properties that are substantially different from normal sense of distance,
volume, and shape. In particular, in a high-dimensional Euclidean space, volume
expands far more rapidly with increasing diameter in compared to lower-dimensional
spaces, so that, for example:
(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin
shell near its outer "surface"
(ii). The volume within a high-dimensional hypersphere relative to a hypercube of
the same width tends to zero as dimensionality tends to infinity, and almost all
of the volume of the hypercube is concentrated in its "corners".
The above mentioned characteristics have two important consequences for high
dimensional data that appear immediately. The first one is, high dimensional space is
mostly empty. As a consequence, high dimensional data can be projected to a lower
dimensional subspace without losing significant information in terms of separability
among the different statistical classes (Jimenez and Landgrebe, 1995). The second
consequence of the foregoing is, normally distributed data will have a tendency to
concentrate in the tails; similarly, uniformly distributed data will be more likely to be
collected in the corners, making density estimation more difficult. Local
neighborhoods are almost empty, requiring the bandwidth of estimation to be large
and producing the effect of losing detailed density estimation (Abhinav, 2009).
3
Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube
4
1.2 What is classification?
Classification means to put data into groups according to their characteristics.
In the case of spectral classification, the areas of the image that have similar spectral
reflectance are put into same group or class (Abhinav, 2009). Classification is also
seen as a means of compressing image data by reducing the large range of digital
number (DN) in several spectral bands to a few classes in a single image.
Classification reduces this large spectral space into relatively few regions and
obviously results in loss of numerical information from the original image. Depending
on the availability of information of the region which is imaged, supervised or
unsupervised classification methods are performed.
5
select the uncorrelated bands or make the bands uncorrelated, applying
feature reduction algorithms (Varshney and Arora, 2004).
4. Optimum number of feature: It is very critical to select the optimum
number of bands out of large number of bands (e.g. 224 bands for AVIRIS
image) to use in classification. Till today there are no suitable algorithms or
any rule for selection of optimal number of features.
5. Large data size and high processing time due to complexity of
classifier: Hyperspectral imaging system provides large amount of data. So
large memory and powerful system is necessary to store and handle the
data, generally which is very expensive.
6
1.4 Objectives
This thesis has investigated the following two objectives pertaining to
classification with hyperspectral data:
Objective-1:
To evaluate various FE techniques for classification of hyperspectral data.
Objective-2
To study the extent to which advance classifier can reduce problems related to
classification of hyperspectral data.
7
Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)
8
Figure 1.5: Google earth image of study area (Google earth, 2007)
9
CHAPTER 2
LITERATURE REVIEW
This chapter outlines the important research works and major achievements in
the field of high dimensional data analysis and data classification. The chapter begins
with some of the FE techniques and classification approaches, for solving problems
related to HD classification as suggested by various researchers. The results of useful
experiments with the HD will also be included to highlight the usefulness and
reliability of these approaches. These results are presented in tabulated form. Some
other issues related to classification of HD are also discussed at the end of this
chapter.
10
2.1.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data for feature reduction. Also it can be used as the tool of image
enhancement and digital change detection (Lodwick, 1979). For the case of dimension
reduction of HD, PCA outperforms those FE techniques which are based on class
statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited
and ratio to the number of dimension is low for HD, class covariance matrix cannot be
estimated properly. To overcome these problems Jia (1996) proposed the scheme for
segmented principal component analysis (SPCA) which applies PCT on each of the
highly correlated blocks of bands. This approach also reduces the processing time by
converting the complete set of bands into several highly correlated bands. Jensen and
James (1999) proposed that the SPCA-based compression generally outperforms
PCA-based compression in terms of high detection and classification accuracy on
decompressed HD. PCA works efficiently for the highly correlated data set but SPCA
works efficiently for both high correlated as well as low correlated data sets (Jia,
1996).
Jia (1996) compared SPCA and PCA extracted features for target detection and
concluded SPCA as a better FE technique than PCA. She also showed that both
feature extracted data sets are identical and there is no loss of variance in the middle
stages, as long as no components are removed.
11
theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)
introduced a projection index called the chi-square projection pursuit index. Posse
(1995a, 1995b) used a random search method to locate a plane with an optimal value
of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. Each projection found in this
manner shows a structure that is less important (in terms of the projection index)
than the previous one. Most recently, the PP technique can also be used to obtain 1-D
projections (Martinez, 2005). In this research work, Posse’s method is followed that
reduces n-dimensional data set to 2-dimensional data.
feature space and then PCA is applied on the mapped vectors. KPCA is also a
powerful method for preprocessing steps for classification algorithm (Mika et. al.
1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature
selection in a high-dimensional feature space where input variables were mapped by
12
a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of
the higher-order statistics. To obtain this higher-order statistics, a large number of
TP is required. This causes problems for KPCA, since KPCA requires storing and
manipulating the kernel matrix whose size is the square of the number of TP. To
overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian
Algorithm (KHA) was introduced by ( Scholkopf
et. al., 2005).
13
and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is
better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,
DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed
that PCA is the best FE technique for HD among the other mentioned feature
extractor for GML classifier. He also suggested that some FE techniques like KPCA,
OSP, SPCA, PP may improve the classification result using GML classifier.
2.3.1 KNN
KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern
recognition. The technique can achieve high classification accuracy in problems which
have unknown and non-normal distributions. However, it has a major drawback that
a large amount of TP is required in the classifiers resulting in high computational
complexity for classification (Hwang and Wen, 1998).
Pechenizkiy (2005) compared the performance of KNN classifier on the PCA
and random projection (RP) feature extracted data set. He concluded that KNN
performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the
KNN works better on the ICA feature extracted data set than the original data set
(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-
KNN method with a few wavelengths had the same performance as the KNN
classifier alone using information from all wavelengths.
Some more non–parametric classifiers based on geometrical approaches of data
classification were found during literature survey. These approaches consider the
data points to be located in the Euclidean space and exploit the geometrical patterns
of the data points for classification. Such approaches are grouped into a new class of
14
classifiers known as machine learning techniques. Support Vector Machines (SVM)
(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among
the popular classifiers of this kind. These do not make any assumptions regarding
data density function or the discriminating functions and hence are purely non–
parametric classifiers. However, these classifiers also need to be trained using the
training data.
2.3.2 SVM
SVM has been considered as advance classifier. SVM is a new generation of
classification techniques based on Statistical Learning Theory having its origins in
Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,
1998) discussed SVM based classification in detail. SVM tends to improve learning by
empirical risk minimization (ERM) to minimize learning error and to minimize the
upper bound on the overall expected classification error by structural risk
minimization (SRM). SVM makes use of principle of optimal separation of classes to
find a separating hyperplane that separates classes of interest to maximum extent by
maximizing the margin between the classes (Vapnik, 1992). This technique is
different from that of estimation of effective decision boundaries used by Bayesian
classifiers as only data vectors near to the decision boundary (also known as support
vectors) are required to find the optimal hyperplane. A linear hyperplane may not be
enough to classify the given data set without error. In such cases, data is transformed
to a higher dimensional space using a non–linear transformation that spreads the
data apart such that a linear separating hyperplane may be found. Kernel functions
are used to reduce the computational complexity that arises due to increased
dimensionality (Varshney and Arora, 2004).
Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization
capability and ability to adapt their learning characteristics by using kernel functions
due to which they can adequately classify data on a high–dimensional feature space
with a limited number of training data sets and are not affected by the Hughes
phenomenon and other affects of dimensionality. The ability to classify using even
limited number of training samples make SVM as a very powerful classification tool
for remotely sensed data. Thus, SVM has the potential to produce accurate
classifications from HD with limited number of training samples. SVMs are believed
15
to be better learning machines than neural networks, which tends to overfit classes
causing misclassification (Abhinav, 2009), as they rely on margin maximization
rather than finding a decision boundary directly from the training samples.
For conventional SVM an optimizer is used based on quadratic programming
(QP) or linear programming (LP) methods to solve the optimization problem. The
major disadvantage of QP algorithm is the storage requirement of kernel matrix in
the memory. When the size of the kernel matrix is large enough, it requires huge
memory that may not be always available. To overcome this Benett and Campbell
(2000) suggested an optimization method which sequentially updates the Lagrange
multipliers called the kernel adatron (KA) algorithm. Another approach was
decomposition method which updates the Lagrange multipliers in parallel since they
update many parameters in each iteration unlike other methods that update
parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which
updates lagrange multipliers on the fixed size working data set. Decomposition
method uses QP or LP optimizer to solve the problem of huge data set by considering
many small data sets rather than a single huge data set (Varshney, 2001). The
sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of
decomposition method when the size of working data set is fixed such that an
analytical solution can be derived in very few numerical operations. This does not use
the QP or LP optimization methods. This method needs more number of iterations
but requires a small number of operations thus results in an increase in optimization
speed for very large data set.
The speed of SVM classification decreases as the number of support vectors
(SV) decreases. By using kernel mapping, different SVM algorithms have successfully
incorporated effective and flexible nonlinear models. There are some major difficulties
for large data set due to calculation of nonlinear kernel matrix. To overcome the
computational difficulties, some authors have proposed low rank approximation to
the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)
have proposed the method of reduced support vector machine (RSVM) which reduces
the size of the kernel matrix. But there was a problem of selecting the number of
support vectors (SV). In 2009, Sundaram proposed a method which will reduce the
number of SV through the application of KPCA. This method is different from other
16
proposed method as the exact choice of support vector is not important as long as the
vector spanned a fixed subspace.
Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he
used linear SVM on the feature extracted data set and showed that KPCA features
are more linearly separable than the features extracted by conventional PCA. Shah et
al (2003) compared SVM, GML and ANN classifiers for accuracies at full
dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and
concluded that SVM gives higher accuracies than GML and ANN for full
dimensionality but poor accuracies for features extracted by DAFE and DBFE.
Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,
DBFE, DAFE feature extracted data set. He concluded that SVM provides better
result for OD than GML. SVM works best with PCA and ICA feature extracted data
set where ANN works better with DBFE and NWFE feature extracted data set.
The works done by various researchers with different hyperspectral data sets
using different classifiers and FE methods and the results obtained by them is
summarized in Table 2.1.
17
Table 2.1: Summary of literature review
Author Dataset used Method used Results obtained
Lee and Landgrebe Field Spectrometer GML classifier is used to Features extracted by DBFE
(1993) System (airborne compare classification produces better classification
hyperspectral accuracies obtained by accuracies than those
sensor) DBFE and PCA FE obtained from PCA and
Bhattacharya feature
selection methods.
Jimenez and Stimulated and real Hyperspectral data Hughes phenomenon was
Landgrebe (1998) AVIRIS data characteristics were observed as an effect of
studied with respect to dimensionality and
effects of classification accuracy was
dimensionality, order of observed to be increasing
data statistics used on with use of higher statistics
supervised classification order. But lower order
techniques. statistics were observed to
be less affected by Hughes
phenomenon.
Benediktsson et al ROSIS-03 KPCA and PCA feature KPCA features are more
(2001) extracted data set was linearly separable than
used for classification features extracted by
using linear SVM.
conventional PCA.
Shah et al. (2003) AVIRIS Compared SVM, GML SVM was found to be giving
and ANN classifiers for higher accuracies than GML
accuracies at full and ANN for full
dimensionality and dimensionality but poor
using DAFE and DBFE accuracies were obtained for
feature extraction features extracted by DAFE
techniques and DBFE.
Kuo and Landgrebe Stimulated and real NWFE and DAFE FE NWFE was found to be
(2004) data (HYDICE techniques were producing better
image of DC mall, compared for classification accuracies
Washington, US) classification accuracy than DAFE.
achieved by nearest
neighbor and GML
classifiers.
Pechenizkiy (2005) 20 data sets with KNN classifier was used PCA gave the better result than
different to compare classification Random Projection
characteristics were accuracies obtained by
taken from the UCI PCA and Random
machine learning Projection FE
repository.
Zhu et al (2007) Hyperspectral ICA ranking methods ICA-KNN method with a few
imaging system were used to select the band had the same
developed by ISL. optimal wave length the performance as the KNN
KNN was used. Then classifier alone using all
KNN alone was used. bands.
Sundaram (2009) The adult dataset KPCA was applied in Significantly reduce the
,part of UCI the support vector, then processing time without
Machine Learning usual SVM algorithm is effecting the classification
Repository used accuracy
18
Abhinav (2009) DAIS 7915 GML, SAM, MDM GML was the best among
classification techniques the other techniques and
were used on the PCA, performs best on PCA
ICA, NWFE, DBFE and extracted data set.
DAFE feature extracted
data set
Abhinav (2009) DAIS 7915 SVM and GML GML performed very low in
classification techniques OD than SVM. SVM provide
were used on the OD better accuracy than GML.
and PCA, ICA, NWFE, SVM performs better on
DBFE and DAFE PCA and ICA extracted data
feature extracted data set.
set to compare the
accuracy
1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,
ICA, DAFE, DBFE and NWFE perform well in improving the classification
accuracies when used with GML. But the features extracted by DBFE and
DAFE failed to improve results obtained by SVM implying a limitation of these
techniques for the advance classifiers. KNN works best with PCA and ICA
feature extracted data set. However, in the surveyed literature the effects of
PP, SPCA, KPCA and OSP extracted features on classification accuracy
obtained from the advance classifiers like SVM, parametric classifier like GML
and nonparametric classifier KNN have not been observed.
2. Another important aspect found missing in the literature is the comparison of
classification time for SVM classifiers because SVM takes long time for
training using large TP. It was seen that many approach of SVM were
proposed to reduce the classification time but there is no conclusion for the best
SVM algorithm depending on classification accuracy and processing time.
3. Although KNN is effective classification technique for HD, there is no guideline
for classification time or suggestion of best FE techniques for KNN classifier.
Also the effect of different parameters like number of nearest neighbor,
number of TP, number of bands is not suggested for KNN.
19
4. During the literature survey, it is further found that there is no suggestion for
the best FE techniques for different SVM algorithms, GML and KNN.
Such missing aspects will be investigated in this thesis work and the
guidelines to choose an efficient and less time consuming classification technique
shall be presented as the result of this research.
This chapter presented the FE and classification techniques for mitigating the
effects of dimensionality. These techniques were result of different approaches used
to deal with the problem of high dimensionality and improving performance of
advance, parametric and nonparametric classifier. The approaches were applied on
real life HD and comparative results as reported in literature were compiled and
presented here. In addition, the important aspects found missing in the literature
survey were highlighted which this thesis work shall try to investigate. The
mathematical rationale and algorithms used to apply these techniques will be
discussed in detail in the next chapter.
20
CHAPTER 3
MATHEMATICAL BACKGROUND
This chapter will provide the detailed mathematical background of each of the
techniques used in this thesis. Starting with the some basic concepts of kernels and
kernel space this chapter will describe the unsupervised and supervised FE
techniques followed by classification and optimization rules for supervised classifier.
Finally, the scheme for statistical analysis which has been used for comparing the
results of different classification techniques are discussed.
Notations which are followed in this chapter for matrix and vector are given
below:
X A two dimensional matrix, whose columns represent the data points (m) and
rows represent number of bands (n), where X = X ⎣⎡n, m⎦⎤ .
Φ( z ) Mapping of the input vector z in kernel space, using some kernel function.
∈ Belongs to
Rn Set of n-dimensional real number.
N Set of natural number.
T
⎡⎣ ⎤⎦ Denotes the transpose of a matrix.
∀ For all.
21
• Feature space: The space spanned by the transformed data points (from
original space) which were mapped by some functions.
Kernel is the dot product in feature space H via a map Φ from input space,
such that Φ : X → H . Kernel can be defined as k( x , x ') = Φ( x ), Φ( x ') , where
x , x ' and Φ( x ), Φ( x ') are the elements of input space and feature space respectively
and k is called the kernel and Φ is called feature map associated with k. Φ also can
be called as the kernel function. The space containing these dot products is called
kernel space. This is a nonlinear mapping from input space to feature space which
increases the internal distance between two points in a data set. This means that the
data set which is nonlinearly separable in input space becomes linearly separable in
kernel space. A few definitions related to kernel are given below:
Positive definite matrix: A real n x n symmetric matrix K satisfying x1T Kx1 > 0 for
equality in previous equation occurs only for x11 = x 21 = ........ = xn1 = 0 , then the matrix
gives rise to a strictly positive definite gram matrix, called strictly positive definite
kernel.
Definitions of some commonly used kernel functions are shown in Table 3.1.
22
Table 3.1: Examples of common kernel functions (Modified after Varshney and
Arora, 2004)
All the above definitions have been explained with the following simple
example.
⎡1 2 1 ⎤
Let, X = ⎣⎡x1 x 2 x3 ⎤⎦ = ⎢⎢2 1 3 ⎥⎥ is a matrix in input space whose columns ( xi , i = 1,2,3 )
⎢⎣1 1 3 ⎥⎦
denote the number of data points and rows denote the dimension of data points.
Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.
Let xi , x j denotes the inner product of the columns of the matrix X using Gaussian
kernel function.
Then the gram matrix (kernel matrix) K takes precisely the form,
⎡ x1 , x1 x1 , x 2 x1 , x3 ⎤
⎢ ⎥
K = ⎢ x 2 , x1 x2 , x2 x 2 , x3 ⎥
⎢ ⎥
⎢⎣ x3 , x1 x3 , x2 x 3 , x3 ⎥⎦
23
3.2 Feature extraction techniques
(i) Elimination of dimensions with very low information content. Features with
low information content can be discarded as noise.
(ii) Remove redundancy among the dimensions of data space i.e. the reduced
feature set should be spanned by orthogonal vectors.
24
3.2.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral
image data, PCT outperforms those FE techniques which are based on the class
statistics. The main advantage of using a PCT is that global statistics are used to
determine the transform functions. Implementation of PCT on high dimensional data
set requires high computational load. SPCA can overcome the problem of long
processing time by partitioning the complete data set into several highly correlated
subgroups (Jia, 1996).
The complete data set is first partitioned into K subgroups with respect to the
correlation of bands. From the correlation image of HD, it can be seen that blocks are
formed from highly correlated bands (Figure 3.2). These blocks are selected as the
subgroups. Let n1 , n2 and nk are the number of bands in subgroups 1, 2 and k
respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After
applying PCT on each subgroup, significant features are selected by variance
information of each component. The PCs which contain about 99% variance were
chosen for each block then the selected features can be regrouped and transformed
again to compress the data further.
25
Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27
bands respectively, corresponding to highly correlated bands have been
formed from the correlation image of HYDICE hyperspectral sensor data.
Segmented PCT retains all the variance as with the conventional PCT. There
is no information lost either in the case that the transformation is conducted on the
complete vector at once or a few sub vectors are transformed separately (Jia, 1996).
When the new components obtained from each segmented PCT are gathered and
transformed again, then the resulting data variance and covariance are identical to
those of the conventional PCT. The main effect is that, the data compression rate is
lower in the middle stages compared to the no segmentation case. However, it makes
a relatively small difference in compression rate, if segmented transformation is
developed on those subgroups which have poor correlation with each other.
26
Figure 3.2a: Chart of multilayered segmented PCA
(i) A projection pursuit index measures the degree of departure from normality.
(ii) A method for finding the projection that yields the highest value for the index.
Posse (1995a, 1995b) used a random search to locate a plane with an optimal
value of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. The interesting projections are
found in decreasing order of the value of the PP index. This implies that each
projection found in this manner shows a structure that is less important (in terms of
the projection index) than the previous one. In the following discussion, first the chi-
squared PP index has been described followed by the structure finding procedure.
Finally, the structure removal procedure is illustrated.
rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the
same angular width of 450 . R is chosen so that the boxes have approximately the
1
Where,
φ The standard bivariate normal density.
ck Probability evaluated over kth region using the normal density function,
given by ck = ∫∫ φ dz1dz2 .
Bk
28
Bk Box in the projection plane.
πj
λj , j = 0,.....,8 is the angle by which the data are rotated in the plane
36
before being assigned to regions.
α,β Orthonormal p-dimensional vectors which span the projection plane (It
can be first two PCs or randomly chosen two pixels of the OD set).
P (α , β ) A plane consists of two orthonormal vectors α , β
α
Zi , Z j β
Sphered observations projected onto the vectors α and β . ( Ziα = ZiT α and
Ziβ = ZiT β )
29
45o
1/48 1/48
1/48
1/48
R/5
1/48 1/48
1/48 1/48
R
Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after
Posse, 1995a)
possible direction onto 2-D planes. Posse (1990) proposed a random search for
locating the global maximum of the projection index. Combined with the structure-
removal procedure, this gives a sequence of interesting bi-dimensional views of
decreasing importance. Starting with random planes, the algorithm tries to improve
(
the current best solution α * , β * ) by considering two candidate planes ( a1 ,b1 ) and
( a2 ,b2 ) ( )
within a neighborhood of α * , β * . These candidate planes are given by,
α * + cv1 β * − ( a1T β * ) a1 ⎫
a1 = * b1 = ⎪
α + cv1 β * − ( a1T β * ) a1 ⎪
⎪
⎬ (3.2)
α − cv1
* β * − ( a1T β * ) a2 ⎪
a2 = b2 = ⎪
α * − cv1 β * − ( a1T β * ) a2 ⎪⎭
Where c is a scalar that determines the size of the neighborhood visited, and v is a
unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to
30
start a global search and then to concentrate on the region of the global maximum by
decreasing the value of c. After a specified number of steps, called half, without an
increase of the projection index, the value of c is halved. When this value is small
enough, the optimization is stopped. Part of the search still remains global to avoid
being kept in dummy local optimum. The complete search of the best plane contains
m such random searches with different random starting planes. The goal of PP
algorithm is to find best projection plane.
( )
2. Generate a random starting plane α 0 , β 0 , where α 0 and β 0 are orthonormal.
(
Consider this as the current best plane α * , β * . )
( )
3. Evaluate the projection index PI χ 2 α * , β * for the starting plane.
4. Generate two candidate plane ( a1 ,b1 ) and ( a2 ,b2 ) according to the Eq. (3.2)
( )
as the current best plane α * , β * .
31
manner, it will give a sequence of projections providing informative views of the data.
The procedure repeatedly transforms the projected data to standard normal until
they stop becoming more normal as measured by the projection pursuit index. One
starts with a p × p matrix, where the first two rows of the matrix are the vectors of
the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal
and ‘0’ elsewhere. For example, if p = 4, then
⎡α1* α 2* α 3* α 4* ⎤
⎢ * * * * ⎥
β β β β
U =⎢ 1 2 3 4 ⎥
*
(3.3)
⎢0 0 1 0 ⎥
⎢ ⎥
⎢⎣0 0 0 1 ⎥⎦
T = UZ T (3.4)
Where T is a p × n matrix. With this transformation, the first two rows of T of every
transformed observations are the projection onto the plane given by α * , β * . Now ( )
applying a transformation ( Θ ), which transforms the first two rows of T to a
standard normal and the rest remain unchanged, structure removal is performed
(Martinez, 2004). This is where the structure is removed, making the data normal in
that projection (the first two rows). The transformation is defined as follows,
Θ (T1 ) = φ −1 ⎡⎣ F (T1 ) ⎤⎦ ⎫
⎪⎪
Θ (T2 ) = φ −1 ⎡⎣ F (T2 ) ⎤⎦ ⎬ (3.5)
⎪
Θ (Ti ) = Ti i = 3,4,........., p ⎪⎭
and T2 are the first two rows of the matrix T and F is a function defined in Eq. (3.7).
From Eq. (3.3), it is seen that only the first two row of T are changing. T1 and T2 can
be written as,
( * * *
T1 = z1α , z2α ......., z αj ,......., znα
*
) (3.6)
= (z )
* * * *
β
T2 1 , z2β ......., z βj ,......., znβ
32
* *
Where z αj and z βj are coordinates of the jth observation projected onto the plane
( )
spanned by α * , β * . Next, a rotation is defined about the origin through the angle as
follows
z j ( ) = z j ( ) cos γ + z j ( ) sin γ
1t 1t 2t
(3.7)
z j ( ) = z j ( ) cos γ − z j ( ) sin γ
2t 2t 1t
Where γ = 0,π / 4,π / 8,3π / 8 and z j ( ) represents the jth element of T1 at the tth
1t
iteration of the process. Now, applying the following transformation on Eq. (3.7) to the
rotated points it replaces each rotated observation by its normal score in the
projection.
z
1(t +1)
=φ ⎨
⎪
−1 j ( )
⎧ r z1(t ) − 0.5 ⎫
⎪
j ⎬
⎪⎩ n
⎭⎪
(3.8)
zj(
2 t +1) −1 ⎪
=φ ⎨
j ( )
⎧ r z 2(t ) − .5 ⎫
⎪
⎬
⎪⎩ n
⎭⎪
( )
Where r z j ( ) represents the rank of z j ( )
1t 1t
With this procedure, the projection index is reduced by making the data more
normal. During the first few iteration, the projection index should decrease rapidly
(Friedman, 1987). After approximate normality is obtained, the index might oscillate
with small changes. Usually, the process takes between 5 to 15 complete iterations to
remove the structure. Once the structure is removed using this process, data is
transformed back using the following equation,
Z ′ = U T Θ UZ T ( ) (3.9)
From Matrix Theory (Strang, 1988), it is known that all directions that are
orthogonal to the structure (i.e., all rows of T other than the first two) have not been
changed, whereas the structure has been Gaussianized and then transformed back.
Next section will describe the summary of the steps of PP,
33
3.2.2.4 Steps of PP
1. Load the data and set the value of the parameters like number of best
projection plane (N), number of neighborhood for random starts (m), value of c
and half
2. Sphere the data and obtain the Z matrix.
3. Find each of the desired number of projection plane (structures) (3.3.4.2) using
Posse chi-squareindex.
4. Remove the structure (to reduce the effect of local optimum) and find another
structure (3.3.4.3) until the projection pursuit index stop changing.
5. Continue the process until the best projection plane (orthogonal to each other)
is obtained.
Thus
34
1 m
v= ∑ x j x jT v
mλ j =1
(3.12)
1 m
= ∑ ( x j .v )x j
mλ j =1
( )
since x .x v = ( x .v ) x
T
In Eq. (3.12), the term ( x j .v ) is a scalar. This means that all the solutions v with λ ≠
2. Find the eigen values λ ≥ 0 and corresponding non zero eigen vectors
v ∈ H \ {0} of the covariance matrix C from the equation,
λv = Cv (3.15)
3. As shown in previously (for PCA), all solution of v ( λ ≠ 0 ) lie in the span of
Φ( x1 ),........, Φ( xm ) , i.e.,
m
v = ∑ α i Φ( x i ) (3.16)
i =1
Therefore,
m
Cv = λv = λ ∑ α i Φ( x i ) (3.17)
i =1
m m m
mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi )Φ( xi )T Φ( x j ) (3.18)
j =1 i =1 j =1
35
m m m
mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi ) K ( xi , x j ) (3.19)
j =1 i =1 j =1
matrix K, called the kernel matrix, whose ijth element is the inner-product
kernel , K ( x i , x j ) . The vector α of length m, whose jth element is the coefficient
αj.
m
1 m m
λ ∑ α j Φ( x k )T Φ( x j ) = ∑∑ α j Φ( xk )T Φ( xi )Φ( xi )T Φ( x j )
i =1 m i =1 j =1 (3.20)
∀ k = 1,2,...., m
mλ Kα = K 2α (3.21)
To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be
solved,
mλα = K α
(3.22)
7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel
matrix K. Let λ1 ≥ λ2 ≥ ........ ≥ λm be the eigen values of K and β1 , β2 ,......., βm be
the corresponding set of eigen vectors with λ p being the last non zero eigen
value.
36
(
(a) (b)
Figu
ure 3.4: (a
a) Input pooints before kernel PCA
P (b) Ouutput afterr kernel PCA.
Thhe three groups
g aree distinguishable usiing the firrst compon
nent
on
nly (Wikipeedia, 2010)).
H. Then
T
m
β n , Φ( x ) = ∑ β n Φ( xi ),
) Φ( x ) (3.2
23)
i =1
9. In the
t above algorithm,
a it has been assumed
d that the d
data set is centered, but
it iss certainly difficult to
o obtain th
he mean off the mappeed data in feature sp
pace
H (Schölkopf, 2004) . Th
herefore, it is problem
matic to cen
nter the mapped
m data
a in
Figure-3.5
5 provides the
t outlinee of KPCA algorithm.
a
37
38
3.2.4.1 Automated target generation process algorithm (ATGP)
In hyperspectral image analysis a pixel may encompass many different
materials; such pixels are called mixed pixels. It contains multiple spectral
signatures. Let a column vector ri represent the mixed pixel by linear model,
ri = M αi + ni (3.25)
where the vector ri is a l × 1 column vector, represents the ith mixed pixel. l is the
number of spectral bands. Each distinct material in the mixed pixel is called an
endmember (p). Assume that there are p spectrally distinct endmembers in the ith
mixed pixel. M is a matrix of dimension l × p , is made up of linearly independent
columns. These columns are denoted by ( m1 , m2 ,......, m j ,......., mp ) . Here this system is
considered as over determined ( l > p ) system and m j denotes the spectral signature of
(α ,α ,......,α ,......,α )
T
1 2 j p where the jth element represents the fraction of the jth
signature as present in the ith mixed pixel. ni is a l × 1 column vector presenting the
white Gaussian noise with zero mean and covariance matrix σ 2 I where I is an l × l
identity matrix.
In the Eq. (3.25), assume ri ’s are a linear combination of p endmembers with
the weight coefficients designated by the fraction vector αi . The term M αi has been
rewritten to separate the desired spectral signatures from the undesired signatures.
In other way, targets are being separated from background. In searching for a single
spectral signature this can be written as:
M α = dα p + U γ (3.26)
the remaining column vectors from M. These are the undesired spectral signatures or
background information. This is given by U = ( m1 , m2 ,....., m j , ........, mp−1 ) with
(fractions) of α
39
Suppose P is an operator, which eliminates the effects of U, the undesired
signatures. To do this, an operator (orthogonal subspace operator) has been developed
that projects r onto a subspace that is orthogonal to the columns of U. This results in
a vector that only contains energy associated with the target d and noise n. The
operator used is the l × l matrix
(
P = 1 − U (U TU )−1U T ) (3.27)
The operator P maps d into a space orthogonal to the space spanned by the
uninteresting signatures in U. Now apply the operator P on the mixed pixel r from
Eq. (3.25)
Pr = Pdα p + PU γ + Pn (3.28)
The operator X T acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is
given by,
X T Pdα p2dT P T X
λ= (3.31)
X T PE ⎡⎣nnT ⎤⎦ P T X
⎛ α p2 ⎞ X T PddT P T X
λ =⎜ 2⎟ (3.32)
⎜ σ ⎟ X T PP T X
⎝ ⎠
where E [ ] denotes the expected value. Maximization of this quotient is the
generalized eigenvector problem
PddT P T X = λ PP T X (3.33)
40
⎛σ2 ⎞
where λ = λ ⎜ 2 ⎟ , The value of X T which maximizes λ can be determined in general
⎜α ⎟
⎝ p⎠
using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and
symmetric properties of the interference rejection operator. As it turns out the value
of X T which maximizes the SNR is
X T = kdT (3.34)
where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is
seen that the overall classification operator for a desired hyperspectral signature in
the presence of multiple undesired signatures and white noise is given by the 1 × l
vector as
q T = dT p (3.35)
This result first nulls the interfering signatures, and then uses a matched filter for
the desired signature to maximize the SNR. When the operator is applied to all of the
pixels in a hyperspectral scene, each l × 1 pixel is reduced to a scalar which is a
measure of the presence of the signature of interest. The ultimate aim is to reduce the
l images that make-up the hyperspectral image cube into a single image where pixels
with high intensity indicate the presence of the desired signature.
This operator can be easily extended to seek out k signatures of interest. The
vector operator simply becomes a k × l matrix operator which is given by,
When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral
scene, each l × 1 pixel is reduced to 1 × 1 vector. Ultimately, l dimensional
hyperspectral image reduces to single dimensional feature extracted image where
pixels with high intensity indicate the presence of the desired signature. Thus for k
desired signature hyperspectral image can be reduce to k dimensional feature
extracted image. Here each band corresponds to the each desired signature.
The above algorithm is discussed with the following example:
Let us start with three vectors or classes, each six elements or bands long. The
vectors are in reflectance units and can be seen below.
41
⎡0.26 ⎤ ⎡0.07 ⎤ ⎡0.07 ⎤
⎢0.30 ⎥ ⎢0.07 ⎥ ⎢0.13 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0.31 ⎥ ⎢0.11 ⎥ ⎢0.19 ⎥
Concrete = ⎢ ⎥ Tree = ⎢ ⎥ Water = ⎢ ⎥
⎢0.31 ⎥ ⎢0.54 ⎥ ⎢0.25 ⎥
⎢0.31 ⎥ ⎢0.55 ⎥ ⎢0.30 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣⎢0.31 ⎦⎥ ⎣⎢0.54 ⎦⎥ ⎣⎢0.34 ⎦⎥
Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels
looks like,
pixel40 = (.08 ) concrete + (.75 ) tree + (.07 ) dirt + noise (3.37).
Let us assume that the noise is zero. If all the pixel mixture fractions have been
defined, particular class spectrum can be chosen to extract from the image. Suppose
the concrete material has to be extracted throughout the image. Same procedure can
be followed to extract grass and tree material.
Assume that pixel40 is made up some weighted linear combination of
endmembers.
pixel40 = M α + noise (3.38)
columns of U using the operator P. In other words, P maps d into a space orthogonal
to the space spanned by the undesired signatures while simultaneously minimizing
the effects of U. If P is operated on U, which contains tree and water, then it is seen
that the effect of U is minimized.
42
⎡00 ⎤
⎢0 0 ⎥
⎢ ⎥
PU =⎢0 0 ⎥ (3.39)
⎢ ⎥
⎢0 0 ⎥
⎢⎣0 0 ⎥⎦
Now operator x T needs to find out which will maximizes the signal-to noise
ratio (SNR). The operator x T acting on Pr1 will produce a scalar. As stated before, the
value of x T which maximizes the SNR is X T = kdT . This leads to an overall OSP
operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the
entire data vector can be project along the columns of Q and OSP feature extracted
image is formed.
For any class k , the posteriori probability for a pixel vector x is denoted by pk ( ck |x )
P ( x |ck )P (ck )
pk (ck | x ) = k= K
(3.41)
∑ f ( x )P ( c
k =1
k k )
43
Therefore, the Bayes decision rule is:
x ∈ ci if pi (ci | x ) = max pk (ck | x ) (3.41a)
k
3.3.2 Gaussian maximum likelihood classification (GML):
Gaussian maximum likelihood classifier assumes that the distribution of the data points is
Gaussian (normally distributed) and classifies an unknown pixel based on the variance and
covariance of the spectral response patterns. This classification is based on probability density
function associated with training data. Pixels are assigned to the most likely class based on a
comparison of the posterior probability that it belongs to each of the signatures being considered.
Under this assumption, the distribution of a category response pattern can be completely described
by the mean vector and the covariance matrix. With these parameters, the statistical probability of
a given pixel value being a member of a particular land cover class can be computed (Lillesand et
al., 2002). GML classification can obtain minimum classification error under the assumption that
the spectral data of each class is normally distributed. It considers not only the cluster centre but
also its shape, size and orientation by calculating a statistical distance based on the mean values
and covariance matrix of the clusters. The decision boundary for the GML classification is:
−(1 2) ⎡⎢ln Σ
⎣ k( ) k
ˆ −1 ( x − μˆ )⎤
ˆ + ( x − μˆ )T Σ
k k ⎥
⎦ (3.42)
And the final bayesian decision rule is:
x ∈ c j if g j ( x ) = max g k ( x )
k
44
constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which
is most frequent among the k training samples nearest to that test pixel.
Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified
either to the first class of squares or to the second class of triangles. If k
= 3, it is classified to the second class because there are 2 triangles and
only 1 square inside the inner circle. If k = 5, it is classified to first class
(3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified
to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).
Where x = ( x11 , x12 ......x1n ), yi = ( yi1 , yi 2 ...... yin ) and D = { d1 , d2 ......dp } , p is number of TP
45
⎧ ⎛ ⎡k ⎤ ⎞ ⎫
⎪ ⎜ ⎢ ⎥ + 1 ⎟ , k even ⎪
x ∈ c j if minimum element of D corresponding to c j is ⎨
⎪ ⎝ ⎣2⎦ ⎠ ⎪
⎬ (3.44)
⎪ ⎡k ⎤ ⎪
⎢ 2 ⎥ , k odd
⎩⎪ ⎢ ⎥ ⎭⎪
In case of tie, the test pixel is assigned to the class c j if its distance from the mean
Where ki ,(i =1,2,....., p) is a user defined parameter which implies the number of
46
principle, which has been shown to be superior, (Gunnet al., 1997), to traditional
Empirical Risk Minimization (ERM) principle, employed by conventional neural
networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM
that minimizes the error on the training data. SVMs were developed to solve the
classification problem, but recently they have been extended to the domain of
regression problems (Vapnik et al., 1997).
SVM is basically a linear learning machine based on the principle of optimal
separation of classes. The aim is to find a hyperplane which linearly separates the
class of interest. The linear separating hyperplane is placed between the classes in
such a way that it satisfies two conditions.
(i) All the data vector that belongs to the same class are placed to the same side of
separating hyperplane.
(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).
The main aim of SVM is to define an optimum hyperplane between two classes
which will maximize the boundary of two classes. For each class, the data vectors
forming the boundary of classes are called the support vectors (SV) and the
hyperplane is called decision surface (Pal, 2002).
47
number of points, which can be separated into two classes without error in all
possible 2k ways (Varshney and Arora, 2004).
data vector with each sample belonging to either of the two classes labeled by
y ∈ { −1, +1} . These samples are said to be linearly separable if there exists a
hyperplane in m-dimensional space whose orientation is given by a vector w and
whose location is determined by a scalar b as offset of this hyperplane from the origin
(Figure 3.8). In case such a hyperplane exists then the given set of training data
points must satisfy the following inequalities:
w ⋅ xi + b ≥ +1, ∀ i : yi = +1 (3.45)
w ⋅ xi + b ≤ −1, ∀ i : yi = −1
(3.46)
48
Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after
Gunn, 1998).
The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality
as:
yi (w.xi + b) ≥ 1 (3.47)
Thus, the decision rule for the linearly separable case can be defined in the following
form:
xi ∈ sign(w.xi + b) (3.48)
Where, sign(.) is the signum function whose value is +1 for any element greater than
or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily
represent the two classes given by labels +1 and –1.
The separating hyperplane (Figure 3.8) will be able to separate the two classes
optimally when its margin from both the classes is equal and maximum (Varshney,
2004) i.e. the hyperplane should be located exactly in the middle of the two classes.
49
The distance D( x ; w,b) is used to express the margin of separation or margin for a
Where, 2
denotes the second norm which is equivalent to the Euclidean length of
the element vector for which it is being computed and is the absolute function. Let
d be the value of the margin between two separating planes. To maximize the
margin, express the value of d as
w.x + b + 1 w.x + b − 1
d= −
w2 w2
2
=
w2
2
= (3.49a)
wT w
2
To obtain an optimal hyperplane the margin value ( d ) should be maximized i.e.
w2
as shown:
50
n
w = ∑ λi yi xi ,
i =1
nt
(3.51)
∑λ y
i =1
i i =0
becomes
n
1 n n
max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi ( xi ⋅ x j ) (3.52)
λ
i =1 2 i =1 j =1
∑λ y
i =1
i i =0 (3.53)
λi ≥ 0, i = 1,2,..., n (3.54)
where, ns is the number of support vectors found. Substituted these in Eq. (3.51) to
The offset from origin ( b0 ) is determined from the equation given below,
1 0 0
b0 = ⎡⎣w x +1 + w0 x −01 ⎤⎦ (3.56)
2
Where x +01 and x −01 are support vector of class labels +1 and -1 respectively. The
following decision rule (obtained from Eq. (3.48)) is then applied to classify the data
vectors into two classes +1 and -1:
f ( x ) = sign ( ∑
support vectors
yi λi0 ( xi .x ) + b0 ) (3.57)
51
Generally, it may not be possible to separate the classes optimally by a linear
hyperplane and thus a non-linear manifold in hyperspace would be required for
optimal separation among the classes. The data present in m-dimensional space can
be mapped into a higher dimensional space where it spread out and can be separated
by a linear hyperplane in that dimensional space, shown in Figure 3.9.
Suppose the non-linear transformation function φ map the data into a higher
(a) Input space (b) Feature space
52
Putting Eq. (3.60) into eq. (3.59), the modified form of dual optimization problem
becomes:
nt
1 nt nt
max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi K ( xi , x j ) (3.61)
λ
i =1 2 i =1 j =1
∑λ y
i =1
i i =0 (3.62)
Some of the commonly used kernel functions for classification are presented in Table
3.2. Selection of suitable kernel function is essential for better classification of a
particular data set. The details on effects of different kernel functions on
classification accuracy are available in Varshney and Arora (2004).
Originally SVMs were developed to perform binary classification. Now it has
been extended for multiclass classification where the number of classes is more than
two. Pal (2004) proposed two multiclass classification methods: one is one against the
rest and another is pairwise classification method. In the first one, K binary
classifiers may be created where each classifier is trained to distinguish one class
from another K − 1 class for a K class classification problem. The second approach
considers one pair of classes at a time and performs SVM based binary classification
for classifying all the pixels to one of the two classes under consideration. A total of
K ( K − 1)
pairs of classes are possible for a K class problem and thus that many SVM
2
binary classifiers are to be created. A pixel is finally classified to a class to which it is
K ( K − 1)
classified by most number of SVM classifiers out of total (Varshney and
2
Arora, 2004).
Figure 3.10 shows summary of the SVM classification algorithm.
53
54
3.3.4.4 SMO optimization for SVM
Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly
solve the SVM QP problem without any extra matrix storage and without using
numerical QP optimization steps at all. SMO decomposes the overall QP problem into
QP sub-problems, using Osuna’s theorem (Osuna, 1997) to ensure convergence.
Unlike the previous methods, SMO chooses to solve the smallest possible
optimization problem at every step. For the standard SVM QP problem, the smallest
possible optimization problem involves two Lagrange multipliers, because the
Lagrange multipliers must obey a linear equality constraint. At every step, SMO
chooses two Lagrange multipliers to jointly optimize, finds the optimal values for
these multipliers, and updates the SVM to reflect the new optimal values. The
advantage of SMO lies in the fact that solving for two Lagrange multipliers can be
done analytically. Thus, numerical QP optimization is avoided entirely. Even though
more optimization sub-problems are solved in the course of the algorithm, each sub-
problem is so fast that the overall QP problem is solved quickly. In addition, SMO
requires no extra matrix storage at all. Thus, very large SVM training problems can
fit inside the memory of an ordinary personal computer or workstation. Because no
matrix algorithms are used in SMO, it is less susceptible to numerical precision
problems. There are two components to SMO: an analytic method for solving for the
two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize.
In this thesis, all the computations regarding SMO optimization method have
been done with the Matlab in-built function “SVMSMOSET”
3.3.4.4 KPCA-SVM
Nonlinear SVM is quite accurate then linear SVM. However, they are slow and
time taking for classification increases linearly with the number of SV. Reduced set
methods (reducing no. of SVs) try to speed up the SVM classification by reducing the
number of SV (Burges and Scholkopf, 1996). This section will present the technique of
reducing the number of SVs using KPCA algorithm (Sundaram, 2009). It should be
kept in mind that the space spanned by original set of SVs will be always equivalent
to the space spanned by reduced set of SVs. This is the criteria for choosing minimum
number of SVs to improve the classification time
55
The solution of the optimization problem Eq. (3.52) is obtained in terms of
Lagrange’s multiplier. SVs are extracted solving by the Eq. (3.52). The algorithm for
this method is stated below.
1. First choose appropriate kernel function. Then calculate the kernel matrix K xx
K xx (i, j ) = K ( x i , x j ) (3.64)
where , j = 1,2,........, N
2. Center the kernel matrix K xx ,
c
K xx = HK xx H (3.65)
1
where, H = I − I , I is N × N identity matrix. H is centering matrix
N
Sundaram (2009) used the Eq. (3.65) to center the kernel matrix. But, according to
different literatures, kernel matrix should be center by using Eq. (3.24). This is the
standard procedure for centering kernel matrix.
3. Perform Kernel PCA by implementing an eigen value decomposition on
c
centered kernel matrix ( K xx ).
c
K xx = A ΛA T (3.66)
4. Sort the eigen values and corresponding eigen vectors. Discard eigen values
smaller than a threshold. A value of 10−5 has been used in this thesis work.
This was done to prevent numerical problems in the later stages of the
algorithm.
5. Calculate the normalized principal directions.
1 N
Vk = ∑ a jkΦ ( xi )
λk j =1
(3.67)
( x ) = Φ( x ) − 1 ∑ Φ( x )
N
where Φ j j i
N 1=1
In matrix form this becomes:
1
−
V = KA Λ 2 (3.68)
Select the first M number of principal directions which retains a total 99%
variance.
56
6. Calculate new SV by choosing the projections on the principal directions from
λk
a uniform distribution U[ −σ k , +σ k ] where σ k = . In matrix form it
N
becomes,
V = VR (3.69)
1
1
Where R = Λ 2U
N
Where U is a matrix of points chosen from the uniform distribution U[ −1, +1] .
7. Each column of V corresponds to a new SV. Now project image of the old SVs
( Φ( xi ) ) along the direction of new set of SVs (i.e. along the direction of PCs).
N
Φ( zk ) = ∑ Vik Φ( xi ) (3.70)
i =1
This ensures that both SVMs produce same results for all the zk ’s, k = 1,2,.......M
57
Figure 3.11: Overview of KPCA_SVM algorithm
58
overall kappa (k) values obtained from different classification techniques were used
for the one-tail hypothesis testing (Congalton, 1991) for comparing any two
classification results. While the class-wise producer’s kappa ( kpa ) values were used to
kˆ1 − kˆ2
Z12 = (3.73)
(σˆ 2
1 + σˆ 22 )
Where, k̂1 and k̂2 are the kappa estimates obtained for the two classification
techniques under consideration and σˆ12 , σˆ 22 are the respective estimates of variances
for the kappa values observed. The z-statistic obtained is used for the one-tailed
hypothesis testing with the following null ( H 0 ) and alternate ( H1 ) hypotheses:
H 0 : Z12 = k1 − k2 ≤ 0
(3.74)
H1 : Z12 = k1 − k2 > 0
The null hypothesis chosen here is that the out of the two classification results
obtained k̂1 and k̂2 , k̂1 is not significantly better than k̂2 which means that the first
classification technique is not significantly better than the second technique. While
the alternate hypothesis selected, it says that the two classification results are
statistically different and also the result corresponding to k̂1 is statistically better
than that corresponding to k̂2 and thus, it can be said that the first classification
59
of 95%, the null hypothesis can be rejected and it can be said with 95% confidence
that the two classification results are statistically different with the first one
performing better than the second one (Abhinav, 2009).
0 Zc = 1.65
60
CHAPTER 4
EXPERIMENTAL DESIGN
This chapter will address the methodology followed for this thesis work.
Experiments were designed to investigate the best FE technique, classification
algorithm and best time saving strategy for HD. On the basis of conclusions from the
literature survey and recommendations for future work by Abhinav (2009), several
FE and classification algorithms have been tested which have potential for improving
classification accuracy and time for HD. The theoretical background of these
algorithms was presented in Chapter 3.
61
c) Kernel principal component analysis support vector machine
(KPCA_SVM) (Sundaram, 2009).
This chapter starts with experimental details for different FE and selection
techniques. Then it explains the classification techniques for parametric and non-
parametric classifier followed by advanced classifier.
4.1.1 SPCA
For SPCA, complete data set is subgrouped on the basis of correlation of bands.
Then PCA is applied separately on each subgroup of data. Feature selection from the
new data set is obtained after the first subgroup transformation by variance
information (first few PCs retaining 99% variance were selected). Then selected
features are regrouped and transformed again to compress the data further. The
flowchart of SPCA method is shown in Figure 4.1.
4.1.2 PP
For PP, Posse’s (1995a) algorithm was used in this research work where OD (n-
dimension) is projected on two dimensional space. Thus the dimension of the PP
62
feature extracted data set is two. Chi-square projection pursuit index was chosen
here. The methodology adopted for PP method is shown in Figure 4.2.
4.1.3 KPCA
The number of PCs is equal to the number of TP used for FE . In this
experiment, a total up to 400 TP have been used for FE using KPCA method. Hence,
the dimension of the KPCA feature extracted data set is up to 400. Firstly, TP are
mapped into feature space using different kernel function (linear, polynomial and
Gaussian) in the form of gram matrix. Then eigen values and eigen vectors of gram
matrix are calculated. Afterwards, OD is mapped in kernel space using the same
kernel function (used for TP) and projected along the direction of eigen vectors.
Finally, KPCA feature extracted data set is obtained. The outline of KPCA method is
shown in Figure 4.3.
Figure 4.3: KPCA feature extraction method
63
4.1.4 OSP
The dimensionality of feature extracted data set depends upon the number of
classes present in the OD. OSP starts with finding the endmembers by automated
target generation process (ATGP). Then OD is projected along the endmembers and
feature extracted data set is obtained. The data set used for this thesis has eight
classes, so the number of endmembers is also eight. The dimension of feature
extracted data set is equal to the number of endmembers. The brief description of
OSP method is shown in Figure 4.4.
64
classification results. These numbers were chosen in order to consider the following
cases of training sample size.
65
Figure 4.5: Overview of classification procedure
66
Figure 4.6: Experimental scheme for Set-I experiments
67
For SVM_QP and SVM_SMO, quadratic programming optimization and
sequential minimal optimization methods were used respectively to solve the dual
optimization problem. The classification scheme for Set-II experiment is given in
Figure 4.7.
4.5 Parameters
Parameters play also an important role in HD classification. So, choosing of
parameters are also an important task. All the parameters chosen for different FE techniques
and classification algorithms are listed in Table 4.1.
FE techniques Parameters
SPCA Correlation matrix of the bands
PP No. of random searches – 5
half – 15
Stopping value – .01
KPCA Kernel function – rbf
OSP No. of endmembers – 8
Classifiers Parameters
GML Confidence interval – 99%
KNN Neighbors – 3,4,5……,11 and 15
SVM Kernel function – rbf
68
CHAPTER-5
RESULTS
This chapter provides observations for various experiments and interpretation of the
same. Starting with the visual interpretation of feature extracted data sets, the
chapter will discuss the result of GML classifier on feature-extracted data set. These
results are compared with the best result for GML as observed by Abhinav (2009).
Then it will discuss the effect of KNN classification algorithm on OD and feature
extracted data set followed by the discussion of the results of different SVM
algorithms.
69
Figure 5.1: Correlation image of the OD set consisting of three blocks having bands
32, 6 and 27 respectively.
In PP process, one can find from the most important to less important two-
dimensional structures in a sequential manner. Two structures (first one is the most
interesting) with decreasing order is given in Figure 5.2. The PP index after five
random searches was 0.3825 and the size of neighborhood (c) around the best
projection plane was 0.011. Total time taken to complete the whole process was about
11.30 hours. Table 5.1 presents the required time for each FE techniques with
different constraints.
70
Table 5.1: The time taken for each FE techniques
FE methods Time
SPCA 6-8 seconds
KPCA with rbf 1) 4 minutes for 25 TP
kernel 2) 5.5 minutes for 100 TP
3) 6.3 minutes for 200 TP
4) 8.5 minutes for 300 TP
5) 10 minutes for 400 TP
PP 11.30 hours
(a) (b)
β*
β *
α* α*
Figure 5.2: Projection of the data points. (a) Most interesting projection direction
(b) Second most interesting projection direction.
The grayscale images of features extracted data using various FE techniques are
provided in Figures 5.3 to 5.6, followed by the corresponding correlation images
shown in Figure 5.7.
71
(a) SPCA-1 (b) SPCA-2 (c) SPCA-3
Figure 5.3: First six Segmented Principal Components (SPCs) (b) shows water body
and salt lake
Figure 5.4: First six Kernel Principal Components (KPCs) obtained by using 400 TP
72
(a) OSP-1 (b) OSP-2 (c) OSP-3
Figure 5.5: First six features obtained by using eight end-members (b) shows
vineyards and wheat, (c) shows bare soil, (d) shows salt lake.
(a) PP -1 (b) PP -2
Figure 5.6: Two components of most interesting projections (a) shows salt lake.
73
1
0.9
0.8
0.7
0.6
(a) SPCA (b) KPCA
0.5
0.4
0.3
0.2
0.1
(i) Since extracted SPCs were ranked according to their eigen values, a higher
amount of information can be easily noticed in the first four SPCs. No
interesting structures could be visually identified beyond 4th SPC. As SPC uses
the local correlation of the bands rather than global (like PCA), it has ability to
make involved bands highly uncorrelated than PCA. So better classification
result is expected from SPCs. It has also been visually observed that SPCA-2 is
associated with the water body and salt lake classes.
(ii) The first few features extracted by KPCA were visually inferior than those
obtained by SPCA (not revealing any class). Some of the features like KPCA-1
and KPCA-2 show water body and salt lake prominently but other classes are
also present there.
74
(iii) OSP is generally used to extract same number of features as the number of
classes present in the data set (in this case eight classes; hence eight features).
Although number of extracted features by OSP is low, it can identify some
structures prominently. For example, OSP-4 identifies salt lake, OSP-2
identifies vineyards and wheat and OSP-3 shows bare soil. From the algorithm
of OSP, it can be suggested that each band of OSP extracted data set is
associated with one of the predefined classes. Therefore, it can be said that
OSP is expected to perform well for classification.
(iv) The dimension of PP extracted feature is two. However, from the first
extracted feature, salt lake can be identified very clearly but the second feature
contains no identifiable structures and gives hazy appearance.
(v) The quality improvement of features extracted by different FE techniques can
be observed by comparing the correlation images of OD (Figure 5.1) and
feature extracted data (Figure 5.7). The correlation matrices obtained by SPCA
and PP extracted data sets are found to be perfectly diagonal with values equal
to unity and all the off-diagonal elements as zeros. On the other hand, feature
extracted data using supervised FE techniques (OSP, KPCA) are correlated.
This is because the SPCA and PP algorithms extract only orthogonal features
while the FE criterion is different for OSP. So highly correlated features are
observed for OSP. For the correlation image of KPCA feature extracted data
set, t can be observed that along diagonal correlation is unity which decreases
inversely with the increase in distance from diagonal in correlation matrix,
except for bands 80 to 100. These bands are observed to be fully uncorrelated.
The performance of GMLC with feature modified data sets (SPCA, KPCA,
OSP, PP FE methods) was compared to the best result obtained by Abhinav (2009)
75
for GML classifier to evaluate the improvement in classification due to these FE
technique. It may be noted that he obtained the best results with PCA modified data
set. Figure 5.8 shows k-values obtained for different feature modified data sets.
Following observations can be listed from Figure 5.8:
(i) Considering the case with sufficient TP (100, 200, 300), the k-values
obtained for PCA, SPCA, and OSP extracted data sets were observed to be
higher than the PP and KPCA modified data sets.
(ii) For statistically insufficient TP (25), GML performs poorly for SPCA, PCA
and OSP modified data sets. When the number of bands increase, after a
certain number of bands, k-value for PCA and SPCA modified data set
becomes negative for 25 TP per class. Because to invert a p × p matrix, at
least p+1 sample points are required for obtaining numerically well
conditioned inverse of the matrix. Due to this effect, GML fails when more
than 25 bands were used with 25 TP per class. These were insufficient for
computing the inverse of the class covariance matrix.
(iii) An interesting phenomenon can be observed for k-values of KPCA modified
data set. The k-value increases for the first 35 bands. Then suddenly it falls
for 40 bands. From 45 bands onwards, it again starts to increase. The result
for KPCA modified data set is observed up to 65 bands (dimension of OD is
65).
(iv) The k-values obtained for SPCA and OSP seems to be outperforming those
obtained by PCA, KPCA.
(v) Performance of PP is found to be very poor due to very low number of
features (two features). Hence, PP was not considered any further for
classification.
(vi) For all FE techniques (except KPCA, OSP), the k-values increase
significantly with increase in number of bands up to a critical number of
bands (say, Ncri) after which no improvement could be observed in k-values.
This is due to the fact that the features extracted by these techniques were
arranged in decreasing order of eigen values. So useful information are
stored in the first few features only while the lower order features contain
76
less useful information and are very noisy. Therefore, when noisy bands
were added then probability of misclassification increases. As a result, the
classification accuracy becomes stagnant.
(vii) Ncri is different for different set of TP. When number of TP increases, Ncri
increases. Because of Hughes phenomenon, classification of large number of
bands provide poor result unless the number of TP is large.
77
PCA SPCA
KP
PCA
OSP PP
Figure 5.8
8: Overall kappa vallue observeed for GML L classifica
ation on different featture
extracted data sets using selectted differen
nt bands
78
To confirm these observations, statistical analysis was performed. The k-
values obtained for each FE technique are given in Table 5.1. The best results
obtained by GML classification on different feature extracted data set for three
training data sets (100, 200 and 300 TP) were selected for comparison with the best
GML result obtained with PCA extracted data set. The condition for selecting the
best classification result (best k-value) is the least number of bands used after which
no statistically significant improvement in k-value could be achieved. A comparison of
the best results between the PCA and other FE modified data sets and among the
various FE techniques is presented in Table 5.2 in terms of z-statistic values obtained
for one-tailed hypothesis testing at 5% significance level.
79
(vi) The best kappa accuracy for GML classifier is obtained by using SPCA
extracted data set with 300 TP. The kappa value is 0.9589 and the number of
bands used for classification is 45.
Table 5.2: Best kappa values and z-statistic (at 5% significance values) for GML
From Table 5.3 it is observed that the best results for PCA, SPCA, KPCA
extracted data sets were obtained for 30-45 features at 300 TP and for OSP extracted
data set 8 features at 300 TP. During the experiments, it was seen that GMLC took
around 55-70 seconds for processing of 30-45 bands for 300 TP per class for SPCA and
PCA extracted data set and about 32 seconds for OSP extracted data. However, OSP
provides statistically similar result to PCA and SPCA for 300 TP, but the processing
time is very less than other FE techniques. Therefore, OSP can be considered as an
effective FE technique. However, considering both accuracy and processing time, OSP
can be rated as the most effective FE technique for GMLC. For statistically
insufficient TP (25) and statistically sufficient TP (200) SPCA is rated as the best FE
technique. For 100 TP per class, performance of PCA and SPCA for GMLC is same.
From Figure 5.9, it can be observed that GMLC on OSP is the fastest than any other
FE technique. PCA and SPCA take about same time to provide the best k-values.
Table 5.3: Ranking of FE techniques and time required to obtain the best k-value
80
9:
Figure 5.9 Comparrison of kappa va alues and
d classifica
ation times for GML
G
classificcation meth
hod.
5.2.2 Cla
ass-wise compar
rison of result
r for
r GMLC
Thee class-wise accuracy
y for GML
LC has beeen observ
ved for diffferent featture
extracted data set. From
F Figurre 5.10, following can be observeed:
81
2 Training Pixels
25 100 Training
g Pixels
WT : W
Water
200 Training
g Pixels 300 Training
g Pixels
SLT : Salt lake
HV : HydrophyticVeg
WHT : W
Wheat
VY : V
Vineyards
BS : Bare Soil
PL : P
Pasture Land
5.2.3 Cla
assification resu
ults using
g KNN cllassifier
r (KNNC))
To und
derstand th
he effect of
o FE tech
hniques on
n KNN cla
assifier, experiment was
w
performed
d with OD as
a well as feature
f exttracted datta. Same seet of TP, ass used in GML
G
classificatiion, was chosen
c to compare classificattion accura
acy. Obserrvations frrom
figure no. 5.11 to 5.14 are as following:
(i) In case
c of KN
NN, poor pe
erformancee is observ nsufficient TP
ved for stattistically in
set (i.e. 25TP)). Howeverr, KNN on OD perforrms betterr than PCA
A, OSP, SP
PCA
82
extracted data set. The maximum k-value was obtained for 65 bands and three
neighbors. For the KPCA extracted data set, k-value was comparatively better
than OD when 50 bands were taken for all neighbors. PP was not taken into
accuracy analysis as due to very low dimensionality it would not be able to
provide good k-values.
(ii) For statistically exact TP (100 TP), the performance of KNN on OD is better
than any other feature extracted data set. More number of bands, increases the
k-values for all feature extracted data sets except SPCA. Increasing number of
bands did not show any significant change in case of SPCA. However, if
number of neighbors is increased, changes were easily observed. It is observed
that, when number of neighbors is increased, after a critical number of
neighbors (say, Nnbd), k-value starts decreasing. Therefore, it is independent on
number of bands. It may be due to the effect of noisy points present in training
data set. However, large number of neighbors accelerates the chance of using
noisy TP. Consequently, misclassification error is added up.
(iii) For 200 TP per class, no improvement in result is observed for PCA, KPCA,
OSP extracted data set than OD. But, improvement was observed for SPCA
extracted data set. However, it did not show a prior change in PCA and KPCA
extracted data set for KNNC with 100 and 200 TP set respectively. Effect of
neighborhood on accuracy can be viewed from Table 5.4. Always for the first
few neighbors for all sets of TP, highest k-value is achieved (Table 5.2).
(iv) For large training data set (300 TP), it was observed that the k-values are
better than OD. This is due to PCA and SPCA extracted data sets. After a
certain threshold neighborhood, k-value decreases monotonically for PCA,
OSP, and SPCA extracted data set.
(v) KPCA extracted data set provides better result for high dimension since it is
more refined than PCA or SPCA extracted data set.
(vi) For all training data sets, except statistically insufficient, k-value for OSP
extracted data set varies a little (0.02 - 0.05) because of very low
dimensionality. If the number of extracted end members is large enough, result
could be further improved.
83
(vii) Another important aspect was observed for feature-extracted data set. The
difference of the k-values (for all set of TP), obtained using minimum and
maximum number of bands, is about 0.15 to 0.20. This could be because most
of the information was gathered in first some bands of feature extracted data
set. Additional bands cannot provide more useful information to change k-
value significantly.
Table 5.4: Classification with KNNC on OD and feature extracted data set
84
25 Traiining Pixels
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
N
NNb*: number of nearest neiighbors
Figure 5.1
11: Overalll accuracy observed
o forr KNN classification off OD and feature extracted
data seets for 25 TP
P
85
100 Training Pixells
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
N
NNb*: number of nearest neiighbors
Figure 5.1
12: Overalll accuracy observed
o forr KNN classification off OD and feature extracted
data seets for 100 TP
T
86
200 Training Pixells
Origin
nal PCA
NN
Nb
SPC
CA KPCA
NNb
b
OSP
P PP
N
NNb*: number of nearest neiighbors
13:
Figure 5.1 Overalll accuracy observed
o forr KNN classification off OD and feature extracted
data seets for 200 TP
T
87
300 Training Pixells
Origin
nal PCA
SPC
CA KPCA
OSP
P PP
NN
Nb
N
NNb*: number of nearest neiighbors
14:
Figure 5.1 Overalll accuracy observed
o forr KNN classification off OD and feature extracted
data seets for 300 TP
T
88
The k-values for the classification of these data sets were analyzed to select the
best results for each data set. Similar approach as in the case of GML is also followed
here. The z-statistic values obtained for selected best k-values are shown in Table 5.5.
The following can be inferred from these results:
(i) Results obtained using PCA and SPCA modified data sets, were found to be
significantly better than those obtained using the OD for large training data
size (300). However, SPCs and PCs still found to be performing inferior than
OD for 100 TP. Statistically similar results were obtained for OD and SPCA
modified data sets using a training data set of 200 TP. For other feature
extracted data set and for all set of training data, OD provides statistically
significant result for KNN classification.
(ii) The best results were obtained with OD using 30 to 55 bands and three
neighbors. For 300 TP, statistically better results than OD were obtained using
SPCA (40 bands) and PCA (20 bands) modified data sets with three neighbors.
For 200 TP, SPCA modified data set (15 features and 3 neighbors) provides
statistically similar results to OD.
(iii) SPCA extracted data sets were observed to be performing statistically
significant to PCA extracted data sets with smaller training data sets, whereas
the best results, obtained with 300 TP training data set using SPCs, were
statistically similar to those as obtained by PCs.
(iv) SPCs were also observed to be performing significantly better than KPCA and
OSP modified data sets for all training data sets. In addition, the best results
for PCA and OSP were found to be statistically poor for all training data size.
89
Time taken to train the KNN classifier is highly affected by the number of TP.
This is due to the fact that a distance matrix needs to be computed between a test
pixels and each of TP. Increasing number of TP indeed extends the calculation time
i.e. for n TP and m test pixels, number of distances calculated is nm . However,
increasing number of neighbors has significantly less effect in run time. It has been
observed that time taken for classification, for three and for 15 neighbors are almost
similar (maximum difference is 60-120 seconds) (Figure 5.15). Another aspect is also
noticed, increasing number of bands proportionally affect the calculation time (Figure
5.15). From the Figure 5.16, it could be observed that PCA takes least time in
compared to OD and SPCA extracted data to provide best result. Considering the
time constraint and k-value, PCA could be chosen as the best FE technique, followed
by SPCA, among the available techniques for KNN classification. Figure 5.15 shows
the comparison of time between 200 TP and 300 TP for same number of bands and
neighbors. Rank of FE techniques with respect to accuracy for KNNC for each set of
TP could be inferred from table 5.6.
From Table 5.6, it is further observed that for statistically exact size of (i.e.
100), KNNC produced best result with OD. For statistically sufficient TP (i.e.200),
SPCA secured first rank. However, for statistically large TP (i.e. 300), SPCA and PCA
both perform better. Therefore, it is concluded that among all the data sets feature
modified and original, SPCA and PCA provide the best result for KNNC which in
turn tells that PCA is the best FE technique among all of these techniques for KNNC.
Table 5.6 Rank of FE techniques and time required to obtain best k-value (Rank 1
indicates the best)
90
NN
Nb N
NNb
(a) 300 TP
P (b) 200 TP
T
N
NNb*: number of nearest neiighbors
5.2.4 Cla
ass wise comparison of r
results fo
or KNNC
C
From Figure 5.17,
5 follow
wing observ
vations can d for class wise accurracy
n be viewed
of KNNC
KN
NNC extraccts water and
a salt la
ake classess with very
y high accu
uracy for both
b
feature modified datta and OD
D. Howeverr, the builtt up area is classifieed very pooorly
due to preesence of large numb
ber of mixeed pixels. For built u
up area soome pixels are
classified into
i hydrop
phytic veg,, wheat, pa
asture land
d classes for all data sets
s due to the
presence of
o large nu
umber of mixed
m pixels in built-u
up area cla
ass. Perforrmance of OD,
O
91
KPCA and
d OSP mod
dified data sets are loower than SPCA and
d PCA mod
dified data
a set
to providee good classification accuracy
a foor classificcation of h
hydrophyticc veg class for
all sets of TP. For vin
neyards, a built-up arrea classess for all datta sets and
d TP.
10
00 Training Pixels 200
0 Training Pixels
P
30
00 Training Pixels
WTT : Water
SLT
T : Salt lakee
HV
V : Hydroph hobicVeg
WHHT : Wheat
VY
Y : Vineyard ds
BSS : Bare Soiil
PL : Pasture Land
BU
UA : Built-up
p Area
5.3 Experim
E ment re
esults for
fo SVM
M based
d classiifiers
In this
t section
n, results of
o differentt SVM algoorithms ha
ave been de
escribed. First
F
it will de
escribe thee results of SVM_Q
QP algoritthm follow
wed by SV
VM_SMO and
KPCA_SV
VM. The secction also provides
p a comparisoon of classiffication tim
me of differrent
SVM algorrithms.
92
5.3.1 Experiment results for SVM_QP algorithm
Using the optimal set of parameter values (Table 4.5, recommended by
Abhinav, 2009) for SVM classifiers, classification were performed on feature modified
data sets. Results from these experiments are compared with the best result obtained
by Abhinav (2009) for SVM classifier. He noted that performance of SVM_QP was the
best for PCA extracted data set. The same training and input data sets were used as
for GML and KNN classifiers. The classification results obtained by SVM are
presented in Figure 5.18 from which the following observations can be made:
(i) The k-values are seen as improving with increase in training data size for all
input data sets types (PCA, SPCA, KPCA, OSP and PP modified data sets).
(ii) The best classification results were obtained by PCA and SPCA modified data
sets. For KPCA modified data set, when number of bands increases the k-
values also increase. It is possible that for very high dimension, KPCA
extracted data set can provide high k-value like SPCA or PCA extracted data
sets.
(iii) Increasing in k-values were observed for PCs and SPCs which stagnates after a
critical number of features used. After that it starts to decrease gradually. This
could be due to same reason discussed for GML classification algorithm in
section 5.1.
(iv) A similarity can be observed for KPCA, PCA and SPCA modified data set. For
statistically insufficient TP (25) suddenly k-values reach to about zero for
classification using 50 bands. The reason is not clear. Probably due to using
these number of bands and TP, SVM_QP was unable to find proper decision
boundary.
(v) Best result for KPCA and OSP extracted data set are about to similar for each
set of TP except for 25 TP.
93
SV
VM_QP
(a) PC
CA (b) SPCA
(c) KPC
CA
(d) PP
P (e) OSP
(i) PCA and SPCA were found to be giving statistically similar result for all set of
TP. On the other hand, PCA always provides statistically significantly better
result than KPCA and OSP modified data set for all set of TP for SVM_QP
classifier.
(ii) Classification with SPCA modified data set always performs statistically better
than KPCA modified data set for all sets of TP. However, OSP performs
statistically better than KPCA modified data set for 100 and 200 TP per class.
For large set of TP (300), OSP performs statistically similar with KPCA
modified data set.
(iii) Another observation is made from the Table 5.7 that the SPCA modified data
set always performs statistically better than OSP modified data set.
(iv) It can be concluded that PCs and SPCs have the better ability to improve k-
value than any other FE techniques. KPCA performs the worst among all the
FE techniques.
Table 5.7: The best kappa accuracy and z-statistic for SVM_QP on different feature
modified data set
100 0.9408 15 0.8703 55 0.9408 15 0.8874 8 36.30 0.00 28.70 -36.30 -7.79 28.70
200 0.9621 15 0.8901 65 0.9573 15 0.9050 8 7.89 0.53 6.26 -33.39 -7.26 30.40
300 0.9643 15 0.9090 60 0.9691 20 0.9069 8 6.07 -0.59 6.30 -7.40 1.06 7.65
NB* = no. of bands used to achieve the best k-value; ki* = k-value for ith FE technique ,
During above experiments, it was observed that time taken to train the SVM
based classifier is affected very much by the number of training samples used. This is
because a kernel matrix has to be computed for every pair of TP. There were very
little changes in training times with increase in number of bands.
Generally the total time taken to perform SVM based classification was
observed to be ranging from 23 to 102 seconds when bands were increased from 5 to
95
65 for 25 TP.
T The sa
ame range for 100 TP
P was obserrved as 82 to 273 secconds, and 522
to 615 seco
onds for 20
00 TP.
An important
i aspect hass been obseerved for th
he classificcation timee using 200
0 TP
with SPCA
A modified
d data set (Figure 5..19). When
n the band
ds are increased, afteer a
critical nu
umber of ba
ands (30 bands),
b the classificattion time d
decreases monotonica
m ally.
Same tren
nd was observed for 300
3 TP perr class. Thiis could bee due to thee, fact thatt by
using larg
ge number of TP and
d large num
mber of ba
ands, SVM
M_QP was unable
u to ffind
sufficient number off support vectors required for classificattion. For SPCA
S or PCA
P
modified data
d sets, except
e firstt few bands, all rema
aining band
ds contain large amoount
of noise. Due
D to thee presencee of noise, optimizattion probleem might not be sollved
properly for
fo large nu
umber of bands
b with
h large set of TP for SPCA or PCA modified
data sets. That mea
ans that sufficient
s n
number of SV could not be fin
nd. When the
number off SV are less
l then classificatio
c on time allso be less and k-vallues may a
also
decrease. This
T could be supporrted from th
he Figure 5.18 (a), (b
b). It is obsserved thatt for
SPCA and s k-valuees start to decrease
d PCA modiified data set d affter 25 ban
nds.
Excceptionally higher tim
mes of the order of 2600
2 secon
nds were observed when
the trainin
ng data sizze was incrreased to 300 TP. Succh higher ttimes weree observed due
to the QP
P optimizeer used. Varshney
V and Arora
a (2004) suggested
s a few better
optimizerss which woould give the same classificatiion accura
acies in shorter train
ning
times. It is known th
hat same performanc
p ce would bee achieved
d irrespective of choicce of
optimizer in case of SVM as it makes usee of the sta
atistical lea
arning theoory as poin
nted
out by Varrshney and
d Arora (20
004)
96
Figure 5.19: Classification time comparison using 200 and 300 TP per class.
(i) The k-values could be seen as improving with increase in training data size
(except 200 TP) for all input data set.
(ii) Like SVM_QP, a sudden decrease in k-value is observed with 25 TP for the
OD, SPCA, KPCA and OSP extracted data sets. For all data sets, this
happens for 50 features.
(iii) For all data sets (except KCPA extracted data), statistically sufficient
training data set (200 TP) is unable to provide positive k-value. This could
be due to failure of solving optimization problem for these data sets using
200 TP. For KPCA extracted data set, first few bands provide very low k-
value for 200 TP. From 20 bands onwards, k-value provided by KPCA
extracted data set for 200 TP is acceptable.
(iv) Increasing k-values were observed for original and KPCA modified data sets
which stops after a critical number of features used. After that, it starts to
decrease. It is because of same reason as reported for GML classifier. For
the OD and KPCA modified data sets k-values increase monotonically for
100 and 300 TP per class.
(v) For PP modified data set, however, very low k-values are observed. So, all
the results for PP extracted data set are ignored for comparison of results of
SVM_SMO classifier.
The k-values for the classification of these data sets were statistically analyzed
to select the best results for each data set. The approach was similar to the one
followed in previous cases. The z-statistic values are obtained to compare each data
97
set. The best k-values are shown in Table 5.8. The following can be inferred from
these results:
(i) The best results obtained using feature modified data sets were found to be
significantly better than those obtained using the OD set for large training
data size (300 TP). For OSP modified result is marginal, but can be said
that significantly better than OD set. Performance of OD, SPCA and OSP
modified data is very bad, but performance of KPCA modified data is very
high for 200 TP training data. SPCs found to be performing statistically
better than OD set for 100 TP per class and statistically similar to OD for
200 TP.
(ii) The best results were obtained with the OD using 50-60 bands, while
significantly better results than OD were obtained using SPCA modified
data sets with 15-30 features. For 300 TP, statistically similar result to OD
is obtained using OSP modified data set with eight bands.
(iii) KPCs were observed to be performing significantly better than SPCA and
OSP modified data set for 200 TP. For 100 and 300 TP, the best results
obtained by SPCA modified data set are significantly better than OSP and
KPCA modified data sets.
(iv) Classification with OSP is found to be significantly better than KPCA for
100 TP while KPCA is observed to be statistically better than OSP modified
data for 200 and 300 TP. Thus it can be said that SPCA performs better
than OD and any other feature extracted data and performance of OSP is
worst for SVM_SMO based classification.
98
SVM
M_SMO
Origin
nal SPCA
Figure 5.2
20: all kappa values
Overa v obsserved for classificattion of original and FE
modifi
fied data seets using SVVM with SMO
S optimmizer
99
Figure 5.2
21: Compa
arison of cllassification
n time for different sset of TP with
w respecct to
numbe
er of bands for SVM_S SMO classiification allgorithm.
100
(i) For OD and KPCA extracted data, unpredictable behavior of KPCA_SVM
classifier is observed for all data set, TP and for different bands. Maximum k-
value for OD is obtained for 200 TP with 35 bands and for KPCA 200 TP with
25 bands.
(ii) For SPCA extracted data set, k-values reach to about zero after 20 bands for
each set of TP. Maximum k-value obtained by SPCA is better than obtained by
OD and KPCA extracted data set. Maximum k-value for each set of TP is
obtained with five bands.
(iii) For OSP extracted data set, highest k-value is obtained for 200 TP. This value
is higher than the k-values of other feature modified data sets, those are
obtained for 200 TP. Reverse of this scenario is seen for OSP modified data set
with 300 TP.
(iv) One important phenomenon is observed for KPCA_SVM algorithm. For large
set of TP (300), KPCA_SVM provides very low k-value. The best k-value is
obtained for all data set using 200 TP per class.
101
KPC
CA_SVM
Origin
nal SPCA
KPCA OSP
102
(ii) The best results were obtained with the OD with 50-60 bands while
significantly better results than OD were obtained using SPCA modified data
sets with five to ten features for 100 and 200 TP per class. For OSP modified
data set, statistically better result than OD is obtained using 200 TP with eight
bands
(iii) SPCs were observed to be performing significantly better than OSP for 100 and
200 TP. While OSP performs statistically better than SPCs for 200 TP. KPCs
perform statistically better than OSP for 100 TP. However, performance of
KPCs for 200 and 300 training data is statistically significantly low than OSP.
(iv) SPCs always perform statistically better than KPCs and OSP performs better
than SPCA only for 200 TP. It could be concluded that for 100, 200 and 300 TP,
KPCA_SVM performs better with SPCA, OSP modified data set and OD
respectively. KPCA_SVM provides low k-value compared to SVM_QP or
SVM_SMO algorithms.
Table 5.9: The best k-value and z-statistic for KPCA_SVM on original and different
feature modified data sets.
(i) Ability to distinguish salt lake class of all SVM classifier is about same.
(ii) Accuracy of separation of wheat class by SVM_QP and SVM_SMO
classifiers is about same. However, performance of KPCA_SVM is very low
(except salt) to separate any other classes than other two classifiers.
(iii) SVM_SMO separates all other classes with little low accuracy than
SVM_QP.
103
(iv) SVM_QP
S is
i the bestt classifier. It has ab
bility to seeparate alll classes with
w
h
high k-valu
ue.
23:
Figure 5.2 Comp parison off classification accu uracy of individuall classes for
differrent SVM algorithm ms. WT – water, SL LT – Salt Lake, HV V –
Hydrrophobic veeg, WHT – wheat, VY Y – Vineyarrds, BS – Bare
B soil, PL
P –
Pastu ure land, BUA
B – Buillt-up area
5.3.5 Com
mpariso
on of resu
ults for d
differentt SVM allgorithm
ms
Thee overall beest results obtained b
by differen
nt SVM alg
gorithms were
w compa
ared
statisticallly to find out
o the besst SVM cla
assification
n method in terms off classificattion
accuracy obtained.
o The
T same was
w done foor the timee scales obsserved for these in orrder
to comparee the practtical appliccability of tthese methods.
104
required classification time using 300 TP is about two third of SVM_QP.
Though SVM_SMO needs more bands than SVM_QP to obtain best k-values
for different sets of TP but its processing time is very less than SVM_QP.
(iii) KPCA_SVM is poorest method amongst SVM_QP and SVM_SMO. Highest k-
value is obtained for KPCA_SVM by using OSP modified data set with 200 TP.
When number of pixel is large performance of KPCA_SVM is less.
From the above discussion, it can be concluded that SVM_QP is the best
classifier with respect to accuracy. Considering both the classification time and
accuracy, SVM_SMO can be considered as the effective SVM classifier. The best
accuracy is obtained by SVM_QP by using 300 TP with the first 20 bands of SPCA
modified data set. For SVM_SMO the best accuracy is obtained by using 300 TP with
the first 30 bands of SPCA modified data set.
105
The followings are observed from the Table 5.11:
(i) GML performs statistically better than KNN classifier for all set of TP. Also
the classification time of GMLC is negligible with respect to KNNC.
(ii) GMLC performs statistically similar with SVM_QP for 100 and 200 TP. For
large set of TP (300), the performance of SVM_QP classifier is statistically
significantly better than GMLC. However, required classification time is very
high for SVM classifier.
(iii) SVM_QP provides statistically better result than KNNC for all set of TP. From
here it can be concluded that SVM_QP is the best classifier on the basis of
classification accuracy. GML is ranked as the second best classifier.
(iv) It is also observed that the best results are obtained by all the classifiers by
using SPCA modified data set. It is also concluded that SPCA is the best
feature reduction technique among all other techniques for all classifiers.
(v) Processing time of GMLC is very less than any other classifiers. GMLC
provides little poor k-value than SVM_QP for 300 TP. Considering both
classification time and accuracy, it can be concluded that GMLC is the best
classifier than any other classifier.
106
the classifiers. After these, the classes: wheat, vineyards and bare-soil were showing
a little lower accuracy values which means these are a little difficult to separate. The
lowest accuracies were observed for pasture land, built-up area and hydrophytic
vegetation classes. These classes are very poorly separated and thus complex decision
boundaries would be required to separate them. For large set of TP, SVM_QP is able
to achieve higher classification accuracies than the parametric and non-parametric
classifier because they were not able to separate the poor classes in a better way.
Classified maps corresponding to the best results of different classifiers are
shown in Appendix A (Figure A.1).
107
108
CHAPTER 6
SUMMARY OF RESULTS AND
CONCLUSIONS
109
Lastly, the best results for parametric, non-parametric and advance classifiers
were compared to find out the best classifier for HD. All the comparisons were
performed by the one-tailed hypothesis testing at 5% significance level.
Classification experiments were performed using the four FE techniques,
namely, SPCA, KPCA, OSP and PP. From the statistical analysis of classification
results obtained using these feature modified data sets, it could be concluded that
among the four above mentioned FE techniques, SPCA modified data set provides the
best results. These results were also compared with the best classification results
obtained by Abhinav (2009) using different FE techniques. SPCA performs better
because it uses the local statistics rather than global.
Analyzing the different classifiers results, it is observed that sometimes the
results obtained from PCA modified data set competes with those obtained by SPCA
modified data set. Generally, different classifiers provide the best results using 15 to
30 bands of SPCA or PCA modified data sets, which effectively reduces the
classification time. For OSP and PP, due to very low dimensionality, these always fail
to produce satisfactory results. However, the results obtained by using eight bands of
OSP modified data set are reasonably good, though they are not always statistically
significantly better than SPCA or PCA modified data sets. There is a possibility of
improving result by increasing the dimension of OSP modified data set by extracting
more number of endmembers. For KPCA modified data set, it was observed that its
performance is always poor in quality. However, it is observed that KPCA can
produce satisfactory result by increasing the dimension which will also increase the
classification time proportionally. Therefore, KPCA is not considered as an effective
FE technique.
From the experiments performed with parametric classifier (GML), it was
observed that the performance of GML was significantly improved after applying FE
techniques. Comparing the obtained results with the best result obtained by Abhinav
(2009), SPCA was found to be working best among all available FE techniques, in
improving classification accuracy by GML.
Moving on to the non-parametric classifier, it is observed that result of KNN
classifier depends on the choice of number of bands and neighbors. Best results were
selected for KNN with and without applying FE techniques and it was found that
110
result of KNN was enhanced by PCA and SPCA techniques while the supervised FE
techniques like KPCA and OSP failed to do so.
SVM algorithm was selected as the advance classifier. It uses statistical learning
theory, which is expected to produce consistent and optimal results as compared to
the parametric and non-parametric classifiers. Different SVM algorithms (SVM_QP,
SVM_SMO and KPCA_SVM) were tested to reach this goal. For SVM based
classifiers, it was observed that, the dimension of the data sets and choosing of
optimizer significantly affect the results. The best result of SVM_QP was achieved by
SPCA feature extracted data set with 20 bands. It was also observed that, the
classification result using advanced classifier was further improved than the best
result obtained by Abhinav (2009). He obtained the best result using PCA modified
data sets. This result was further improved by using SPCA modified data set. This
proves that by using selected FE techniques, classification results of advance
classifier can further be improved. It was observed that supervised FE technique like
KPCA, OSP could not improve the result of SVM while unsupervised FE technique
(SPCA) made improvement in result. On the other hand, the best results of
SVM_SMO and KPCA_SVM were obtained by using SPCA and OSP modified data
sets respectively. Comparing the best results of different SVM algorithms SVM_QP is
concluded as the best SVM classifier.
On comparing the best results obtained by SVM classifiers with the best
results of parametric and non-parametric classifications, it was found that the
advance classifier performs significantly better for both the data sets, original or
feature extracted. The reason for better performance of this classifier is the
improvement in separating a few classes which shows poor k-values when parametric
or non-parametric classifiers were used. This observation is expected because of the
variation in formation of decision boundary. The decision boundary form by
parametric or non-parametric classifiers are simpler. For this reason they are unable
to perform to separate the poor classes efficiently. Advance classifier has ability to
form complex, nonlinear decision boundaries which help them to improve decision
boundary for separating poor classes.
Compared to parametric classifier, SVM required higher computation time and
memory requirement. In spite of these difficulties, significant improvement was
111
observed over parametric and non-parametric classifiers by advance classifier. This
strongly suggest that SVM has an ability to reduce the troubles regarding HD
classification.
6.2 Conclusions
Based on these results, the following conclusions are drawn:
(i). In this thesis work the high memory and computational time required by SVM
methods were little reduced by using different optimizers and algorithms.
There is still chance to reduce the computation time for SVM algorithm by
using Lagrangian SVM algorithm (Mangasarian and Musicant, 2000). This
required testing further. In addition, some optimization techniques like Kernel
Adatron (Bennett and Campbell, 200), Succesive Overrelaxation (SOR)
(Mangasarian and Musicant, 1998) should also be tested which may reduce the
computation time significantly.
112
(ii). Moreover, it can be commented that for large set of TP, KPCA method takes
much time. Lima and Zen (2005) suggested a method called Sparse KPCA
which may reduce the computation time. This needs to be tested.
(iii). The high computation time required by KNN found in this thesis work. It is
because of the large number of computation is required to classify a single
pixel. For large data set it will increase exponentially. In order to reduce these
Hash-table approach could be applied. By using Hash-table number of
computation will be less.
113
REFERENCES
Barros, A. S and Rutledge, D, N (2005) ‘Segmented principal component
transform–principal component analysis’, Chemometrics and Intelligent Laboratory
Systems 78 (2005) 125– 137
Ben-Dor, E., Patkin K., Banin A. and Karnieli, A. (2002) ‘Mapping of several soil
properties using DAIS-7915 hyperspectral scanner data – a case study over clayey
soils in Israel,’ International Journal of Remote Sensing, Vol. 23, No. 6, pp. 1043-
1062.
Boser, H., Guyon, I. M., Vapnik, V. N. (1992) ‘A training algorithm for optimal
margin classifiers’ Proceedings of the 5th Annual Workshop on Computational
Learning Theory, ACM New York, NY, USA, pp. 144-152.
Cha, G. H. (2005) ‘Kernel principal component analysis for content based image
retrieval’, PAKDD 2005, LNAI 3518, pp. 844 – 849, Springer-Verlag Berlin
Heidelberg.
114
Congalton, R. G. (1991) ‘A reviews of assessing the accuracy of classifications of
remotely sensed data,’ Remote Sensing of Environment, Elsevier Science (pub.),
Vol.37, No. 1, pp. 35-46.
Hughes, G. (1968) ‘On the mean accuracy of statistical pattern recognizers,’ IEEE
Transactions on Information Theory, Vol. IT-14, No. 1, pp. 55-63.
Hwang, W. J. and Wen, K.W. (1998) ‘Fast KNN classification algorithm based on
partial distance search’, IEEE Transaction, Electronics Filter, Vol. 34, No. 21.
115
Hwang, J., Lay, S., and Lippman, A. (1994), ‘Nonparametric multivariate density
estimation: A comparative study,’ IEEE Transactions Signal Processing, Vol.42, No.
10, pp. 2795-2810.
Jia, X. (1996) Classification techniques for hyperspectral remote sensing data, Ph. D.
Thesis, University of Canberra.
Jones, M. C., and Sibson, R. (1987) ‘What is projection pursuit?’, Journal of the
Royal Statistical Society, Ser. A, 150, 1-38.
Kim, K. I., Franz, F. O., and Scholkopf, B. (2005) ‘Iterative Kernel principal
component analysis for image modeling’, IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 27, No. 9.
116
Mercer, J. (1909) ‘Functions of positive and negative type, and their connection with
the theory of integral equations,’ Transactions of the London Philosophical Society,
Vol.-209, No. A, pp. 415-446.
Ping, X., Guo, G., and Chen, G. (2006) A fast document classification algorithm
based on improved KNN, IEEE Transaction.
Vapnik V. N. (1998) Statistical learning theory. John Wiley and Sons, NY.
117
Welling, M. ‘Kernel principal component analysis’, Department of Computer Science,
University of Torento.
Zhu, B., Jiang, L., Jin, F., Qin, L.,Vogel, A., and Tao, Y. (2007) ‘Walnut shell and
meat differentiation using fluorescence hyperspectral imagery with ICA-KNN optimal
wavelength selection’, Sens. & Instrumen. Food Qual. (2007) 1:123–131 DOI
10.1007/s11694-007-9015-z, Springer Science+Business Media, LLC 2007
118
APPENDIX A
GML Legend KNN
SVM_QP SVM_SMO
KPCA_SVM
Figure A.1: Classified maps corresponding to the best results of different classifiers
119