0 évaluation0% ont trouvé ce document utile (0 vote)

195 vues134 pagesHyperspectral image classification and comparison of different feature extraction techniques.

Jan 05, 2011

© © All Rights Reserved

PDF, TXT ou lisez en ligne sur Scribd

Hyperspectral image classification and comparison of different feature extraction techniques.

© All Rights Reserved

0 évaluation0% ont trouvé ce document utile (0 vote)

195 vues134 pagesHyperspectral image classification and comparison of different feature extraction techniques.

© All Rights Reserved

Vous êtes sur la page 1sur 134

HYPE L DATA

ERSPECTRAL A

B

By

SOUM

MYADIIP CHA

ANDRA

DEPA

ARTMENT OF CIVIL ENGINEE

ERING

INDIAN

N INSTITU

UTE OF TECHN

NOLOGY KANPU

UR

Julyy 2010

i

SUPERV

U VISED LEARNING WITH

HYPE L DATA

ERSPECTRAL A

A Dissertat

D bmitted In Partiial Fulfilllment of

tion Sub o the

Requireements for the Degree

D o

of

Ma

aster of Techno

ology

B

By

SOUM

MYADIIP CHA

ANDRA

(Y81

103044)

DEPA

ARTMENT OF CIVIL ENGINEE

ERING

INDIAN

N INSTITU

UTE OF TECHN

NOLOGY KANPU

UR

Julyy 2010

i

ii

ABSTRACT

Hyperspectral data (HD) has ability to provide large amount of spectral

information than multispectral data. However, it suffers from problems like curse

of dimensionality and data redundancy. The size of data set is also very large.

Consequently, it is difficult to process these datasets and obtain satisfactory

classification results.

The objectives of this thesis are to find the best feature extraction (FE)

techniques and improvement in accuracy and time for classification of HD by

using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-

nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In

order to achieve these objectives, experiments were performed with different FE

techniques like segmented principal component analysis (SPCA), kernel principal

component analysis (KPCA), orthogonal subspace projection (OSP) and projection

pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations

in this thesis work.

From the experiments performed with the parametric and non-parametric

classifier, the GML classifier was found gave the best results with an overall

kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)

per class and 45 bands on SPCA feature extracted data set.

SVM algorithm with quadratic programming (QP) optimizer gave the best results

amongst all optimizers and approaches. The overall k-value of 96.91% was

achieved by using 300 TP per class and 20 bands of SPCA feature extracted data

set. However, the supervised FE techniques like KPCA and OSP failed to improve

results obtained by SVM significantly.

The best results obtained for GML, KNN and SVM were compared by the

one-tailed hypothesis testing. It was found that SVM classifier performed

significantly better than the GML classifiers for statistically large set of TP (300).

For statistically exact (100) and sufficient (200) set of TP, the performance of SVM

on SPCA extracted data set is statistically not better than the performance of

GML classifier.

iii

ACKNOWLEDGEMENTS

I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for

his involvement, motivation and encouragement throughout and beyond the thesis

work. His expert directions have inculcated in my qualities which I will treasure

throughout my life. His patient hearing, critical comments approach to the research

problem made me do better every time. His valuable suggestions to all stages of the

thesis work helped me to improvise various sorts of my shortcomings of my thesis

work. I also express my sincere thanks for his effort in going through the

manuscript carefully and making it more readable. It has been a great learning

and life changing experience working with him.

I would like to express my sincere tribute to Dr. Bharat Lohani for his

friendly nature, excellent guidance and teaching during my stay at IITK.

I would like to thank specially to Sumanta Pasari for his valuable

comments and corrections of the manuscript of my thesis.

I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,

Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI

peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,

pleasant and memorable one.

In closure, I express my cordial homage to my parents and my best friend

for their unwavering support and encouragement to complete my study at IITK

SOUMYADIP CHANDRA

July 2010

iv

CONTENTS

CERTIFICATE………………………………………………………………………….. ii

ABSTRACTS........................................................................................................... iii

ACKNOWLEDGEMENTS……………………………………………………………. iv

CONTENTS………………………………………………………………………………...v

LIST OF TABLES………………………………………………………………………..ix

LIST OF FIGURES..................................................................................................x

LIST OF ABBREVIATIONS…………………………………………………………xiii

CHAPTER 1 - Introduction......................................................................... 1

v

1.7 Structure of thesis ............................................................................................... 9

vi

3.3 Supervised classifier .......................................................................................... 43

4.1.2 PP ................................................................................................................. 62

4.1.3 KPCA............................................................................................................ 63

4.1.4 OSP............................................................................................................... 64

classifier ........................................................................................................................ 66

vii

5.2 Results for parametric and non-parametric classifiers ................................... 75

5.3.4 Class wise comparison of the best result of SVM ................................... 103

REFERENCES………………………………………………….……………….115

APPENDIX A……………………………………………………………………..120

viii

LIST OF TABLES

2.1 Summary of literature review 18

3.1 Examples of common kernel functions 23

4.1 List of parameters 68

5.1 The time taken for each FE techniques 71

5.2 The best kappa values and z-statistic (at 5% significance values) 80

for GML

5.3 Ranking of FE techniques and time required to obtain the best k- 80

value

5.4 Classification with KNNC on OD and feature extracted data set 84

5.5 The best k-values and z-statistic for KNNC 89

5.6 Rank of FE techniques and time required to obtain best k-value 90

5.7 The best kappa accuracy and z-statistic for SVM_QP on different 95

feature modified data set

5.8 The best k-value and z-statistic for SVM_SMO on OD and different 100

feature modified data set

5.9 The best k-value and z-statistic for KPCA_SVM on original and 104

different feature modified data sets

5.10 Comparison of the best k-values with different FE techniques, 106

classification time, and z-statistic for different SVM algorithms

5.11 Statistical comparison of different classifier’s results obtained for 107

different data sets

5.12 Ranking of different classification algorithms depending on 109

classification accuracy and time. (Rank: 1 indicate the best)

ix

LIST OF FIGURES

1.1 Hyperspectral image cube 2

1.2 Fractional volume of a hypersphere inscribed in hypercube decrease 4

as dimension increases

1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8

1.4 FCC obtained by first 3 principal components and superimposed 8

reference image showing training data available for classes

identified for study area

1.5 Google earth image of study area 9

3.1 Overview of FE methods 24

3.2 Formation of blocks for SPCA 26

3.2a Chart of multilayered segmented PCA 27

3.3 Layout of the regions for the chi-square projection index 30

3.4 (a) Input points before kernel PCA (b) Output after kernel PCA. 37

The three groups are distinguishable using the first component

only

3.5 Outline of KPCA algorithm 38

3.6 KNN classification scheme 45

3.7 Outline of KNN algorithm 46

3.8 Linear separating hyperplane for linearly separable data 49

3.9 Non-linear mapping scheme 52

3.10 Brief description of SVM_QP algorithm 54

3.11 Overview of KPCA_SVM algorithm 58

3.12 Definitions and values used in applying one-tail hypothesis testing 60

4.1 SPCA feature extraction method 62

x

4.2 Projection pursuit feature extraction method 63

4.3 KPCA feature extraction method 63

4.4 OSP feature extraction method 64

4.5 Overview of classification procedure 66

4.6 Experimental scheme for Set-I experiments 67

4.7 The experimental scheme for advanced classifier (Set-II) 68

5.1 Correlation image of the original data set consisting of three 70

blocks having bands 32, 6 and 27 respectively

5.2 Projection of the data points. (a) Most interesting projection 71

direction (b) Second most interesting projection direction

5.3 First six Segmented Principal Components (SPCs) (b) shows water 72

body and salt lake

5.4 First six Kernel Principal Components (KPCs) obtained by using 72

400 TP

5.5 First six features obtained by using eight end-members 73

5.6 Two components of most interesting projections 73

5.7 Correlation images after applying various feature extraction 74

techniques

5.8 Overall kappa value observed for GML classification on different 78

feature extracted data sets using selected different bands

5.9 Comparison of kappa values and classification times for GML 81

classification method

5.10 Best producer accuracy of individual classes observed for GMLC 82

on different feature extracted data set with respect to different set

of TP

5.11 Overall accuracy observed for KNN classification of OD and 85

feature extracted data sets for 25 TP

5.12 Overall accuracy observed for KNN classification of OD and 86

feature extracted data sets for 100 TP

5.13 Overall accuracy observed for KNN classification of OD and 87

feature extracted data sets for 200 TP

5.14 Overall accuracy observed for KNN classification of OD and 88

feature extracted data sets for 300 TP

5.15 Time comparison for KNN classification. Time for different bands 91

xi

at different neighbors for (a) 300 TP (b) 200 TP training data per

class

5.16 Comparison of best k-value and classification time for original and 91

feature extracted data set

5.17 Class wise accuracy comparison of OD and different feature 92

extracted data for KNNC

5.18 Overall kappa values observed for classification of FE modified 94

data sets using SVM and QP optimizer

5.19 Classification time comparison using 200 and 300 TP per class 97

5.20 Overall kappa values observed for classification of original and FE 100

modified data sets using SVM with SMO optimizer

5.21 Comparison of classification time different set of TPs with respect 101

to number of bands for SVM_SMO classification algorithm

5.22 Overall kappa values observed for classification original and feature 103

modified data sets using KPCA_SVM algorithm.

5.23 Comparison of classification accuracy of individual classes for 105

different SVM algorithms

xii

LIST OF ABBREVIATIONS

AC Advance classifier

DAFE Discriminant analysis feature extraction

DAIS Digital airborne imaging spectrometer

DBFE Decision boundary feature extraction

FE Feature extraction

GML Gaussian maximum likelihood

HD Hyperspectral data

ICA Independent component analysis

KNN k-nearest neighbors

k-value Kappa value

KPCA Kernel principal component analysis

KPCA_SVM Support vector machine with Kernel principal component

analysis

MS Multispectral data

NWFE Nonparametric weighted feature extraction

Ncri Critical value

OD Original data

OSP Orthogonal subspace projection

PCA Principal component analysis

PCT Principal component transform

PP Projection pursuit

rbf Radial basic function

SPCA Segmented principal component analysis

SV Support vectors

SVM Support vector machine

SVM_QP Support vector machine with quadratic programming optimizer

xiii

SVM_SMO Support vector machine with sequential minimal optimizer

TP Training pixels

Dedicated

to

my family & guide

xiv

CHAPTER 1

INTRODUCTION

Remote sensing technology has brought a new dimension in the field of earth

observation, mapping and in many other different fields. At the beginning of this

technology, multispectral sensors were used for capturing data. The multispectral

sensors capture data in a small number of bands with broad wavelength intervals.

Due to few spectral bands, their spectral resolution is insufficient to discriminate

amongst many earth objects. But if the spectral measurement is performed by using

hundreds of narrow wavelength bands, then several earth objects could be

characterized precisely. This is the key concept of hyperspectral imagery.

As compared to multispectral (MS) data set, hyperspectral data (HD) has large

information content, voluminous and also different in characteristics. So, the

extraction of that huge information from HD remains a challenge. Therefore, some

cost effective and computationally efficient procedures are required to classify the

HD. Data classification is the categorization of data for its most effective and efficient

use. As a result of classification, we need a high accuracy thematic map. HD has that

potentiality.

This chapter will provide the concept of high dimensional space, HD and

difficulties in classification of HD. Next part focuses on the objectives of the thesis

followed by an overview of data set used in this thesis. Details of the software used

are mentioned in the next part of this chapter followed by the structure of thesis.

In Mathematics, an n-dimensional space is a topological space whose

dimension is n (where n is a fixed natural number). One of the typical example is n-

dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.

ii

n-dimensional spaces with large values of n are sometimes called high-dimensional

spaces (Werke, 1876). Many familiar geometric objects can be expressed by some

number of dimensions. For example, the two-dimensional triangle and the three-

dimensional tetrahedron can be seen as specific instances of the n-dimensional space.

In addition, the circle and the sphere are particular form of the n-dimensional

hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).

When spectral measurement is done by using hundreds of narrow contiguous

wavelength intervals then the captured image is called Hyperspectral image. Mostly,

the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In

this cube, x and y axes specify the size of image and λ axis specifies the dimension or

the number bands. Hyperspectral sensors corresponding to each band collect

information as a set of images. Each image represents a range of the electromagnetic

spectrum for each band.

These images are then combined and form a three dimensional hyperspectral

cube. As the dimension of the HD is very high, it is comparable with the high

dimensional space. HD follows same characteristics like high dimensional space

which are described in the following section.

2

1.1.2 Characteristics of high dimensional space

High dimensional spaces, spaces with a dimensionality greater than three,

have properties that are substantially different from normal sense of distance,

volume, and shape. In particular, in a high-dimensional Euclidean space, volume

expands far more rapidly with increasing diameter in compared to lower-dimensional

spaces, so that, for example:

(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin

shell near its outer "surface"

(ii). The volume within a high-dimensional hypersphere relative to a hypercube of

the same width tends to zero as dimensionality tends to infinity, and almost all

of the volume of the hypercube is concentrated in its "corners".

The above mentioned characteristics have two important consequences for high

dimensional data that appear immediately. The first one is, high dimensional space is

mostly empty. As a consequence, high dimensional data can be projected to a lower

dimensional subspace without losing significant information in terms of separability

among the different statistical classes (Jimenez and Landgrebe, 1995). The second

consequence of the foregoing is, normally distributed data will have a tendency to

concentrate in the tails; similarly, uniformly distributed data will be more likely to be

collected in the corners, making density estimation more difficult. Local

neighborhoods are almost empty, requiring the bandwidth of estimation to be large

and producing the effect of losing detailed density estimation (Abhinav, 2009).

3

Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube

decreases as dimension increases (Modified after Jimenez,

Landgrebe, 1995)

Hyperspectral imaging collects and processes information using the

electromagnetic spectrum. Hyperspectral imagery makes difference between many

types of earth’s objects, which may appear as the same color to the human eye.

Hyperspectral sensors look at objects using a vast portion of the electromagnetic

spectrum. The whole process of hyperspectral imaging can be divided into three steps:

preprocessing, radiance to reflectance transformation and data analysis (Varshney

and Arora, 2004).

In particular, preprocessing is required to convert the raw radiance to sensor

radiance. The processing steps contain the operations like spectral calibration,

geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and

geometric accuracy of hyperspectral data is significantly different from one band to

another band (Varshney and Arora, 2004).

4

1.2 What is classification?

Classification means to put data into groups according to their characteristics.

In the case of spectral classification, the areas of the image that have similar spectral

reflectance are put into same group or class (Abhinav, 2009). Classification is also

seen as a means of compressing image data by reducing the large range of digital

number (DN) in several spectral bands to a few classes in a single image.

Classification reduces this large spectral space into relatively few regions and

obviously results in loss of numerical information from the original image. Depending

on the availability of information of the region which is imaged, supervised or

unsupervised classification methods are performed.

Though it is possible that HD can provide a high accuracy thematic map than

MS data, there are some difficulties in classification in case of high dimensional data

as listed below:

the dimensionality of data set increases with the number of bands, the

number of training pixels (TP) required for training a specific classifier

should be increased as well to achieve the desired accuracy for

classification. It becomes very difficult and expensive to obtain large

number of TP for each sub class. This has been termed as “curse of

dimensionality” by Bellman (1960), which leads to the concept of “Hughes

phenomenon” (Hughes, 1968).

2. Characteristics of high dimensional space: The characteristics of high

dimensional space have been discussed in above section (Sec. 1.1.2). For

those reasons, the algorithms that are used to classify the multispectral

data often fail for hyperspectral data.

3. Large number of highly correlated bands: Hyperspectral sensor uses

the large number of contiguous spectral bands. Therefore, among these

bands, some bands are highly correlated. These correlated bands do not

provide good result in classification. Therefore, the important task is to

5

select the uncorrelated bands or make the bands uncorrelated, applying

feature reduction algorithms (Varshney and Arora, 2004).

4. Optimum number of feature: It is very critical to select the optimum

number of bands out of large number of bands (e.g. 224 bands for AVIRIS

image) to use in classification. Till today there are no suitable algorithms or

any rule for selection of optimal number of features.

5. Large data size and high processing time due to complexity of

classifier: Hyperspectral imaging system provides large amount of data. So

large memory and powerful system is necessary to store and handle the

data, generally which is very expensive.

This thesis work is the extension of work done by Abhinav Garg (2009) in his

M.Tech thesis. In his thesis, he showed that among the conventional classifiers

(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),

GML provides the best result. The performance of GML is improved significantly

after applying feature extraction (FE) techniques. Principal component analysis

(PCA) was found to be working best, among all FE techniques (discriminant analysis

FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and

independent component analysis(ICA)), in improving classification accuracy of GML.

For the advance classifier, SVM’s result does not depend on the choice of

parameters but ANN’s does. He also showed SVM’s result was improved by using

PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE

failed to improve it significantly.

He showed some drawbacks for advanced classifier like SVM and suggested

some FE techniques which may improve the result for conventional classifier (CC) as

well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes

more processing time than small size of TP. The objectives of this thesis work are to

sort out these problems and to find the best FE technique, which will improve the

classification result for HD. In next article, the objective of this thesis work has been

described.

.

6

1.4 Objectives

This thesis has investigated the following two objectives pertaining to

classification with hyperspectral data:

Objective-1:

To evaluate various FE techniques for classification of hyperspectral data.

Objective-2

To study the extent to which advance classifier can reduce problems related to

classification of hyperspectral data.

The study area for this research is located within an area known as 'La

Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.

1.4). The area is mainly used for cultivation of wheat, barley and other crops such as

vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on

29th June, 2000, at 5 m resolution.

Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an

exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to

2.5 μm were selected for further analysis (Pal, 2002). Striping problems were

observed between bands 41 and 72. All the 72 bands were visually examined and 7

bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were

removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels

covering the area of interest was extracted (Abhinav, 2009).

The data set available for this research work includes the 65 (retained after

pre-processing) bands data and the reference image, generated with the help of field

data collected by local farmers as briefed in Pal (2002). The area included in imagery

was found to be divided into eight different land cover types, namely wheat, water

body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built

up area.

7

Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)

reference image showing training data available for classes

identified for study area (Pal, 2002).

8

Figure 1.5: Google earth image of study area (Google earth, 2007)

For the processing of HD very power full system is required due to the size of

data set and complexity of algorithms. The machine used for this thesis work

contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.

Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results

are obtained here from same machine for the comparison of different algorithm.

The present thesis is organized into six chapters. Chapter1 focuses on the

characteristics of high dimensional space, challenges of HD classification and outline

of the experiments of this thesis work. Also it discusses the study region, data set and

the software used in this thesis work. Chapter 2 presents the detailed description of

the HD classification and the previous research work related to this domain. Chapter

3 describes the detailed mathematical background of the different processes used in

this work. Chapter 4 outlines the detailed methodology carried out for this thesis

work. Chapter 5 presents the experiments which are conducted for this thesis

followed by interpretation. Chapter 6 provides the conclusions for present work and

the scopes for future works.

9

CHAPTER 2

LITERATURE REVIEW

This chapter outlines the important research works and major achievements in

the field of high dimensional data analysis and data classification. The chapter begins

with some of the FE techniques and classification approaches, for solving problems

related to HD classification as suggested by various researchers. The results of useful

experiments with the HD will also be included to highlight the usefulness and

reliability of these approaches. These results are presented in tabulated form. Some

other issues related to classification of HD are also discussed at the end of this

chapter.

Swain and Davis (1978) mentioned details of various separability measures for

multivariate normal class models. Various statistical classes are found to be

overlapping which causes error of misclassification as most of the classifiers use

decision boundary approach for classification. The idea was to obtain such a

separability measure which could give an overall estimate of range of classification

accuracies that can be achieved by using a sub-set of selected features so that the

sub-set of features corresponding to highest classification accuracy can be selected for

classification (Abhinav, 2009).

FE is the process of transforming the given data from a higher dimensional

space to a lower dimensional space while conserving the underlying information

(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the

underlying information spread in high dimensional space by containing it into

comparatively smaller number of dimensions without loss of significant amount of

useful information. FE techniques, in case of classification, try to enhance class

separability while reducing data dimensionality (Abhinav, 2009).

10

2.1.1 Segmented principal component analysis (SPCA)

The principal component transform (PCT) has been successfully applied in

multispectral data for feature reduction. Also it can be used as the tool of image

enhancement and digital change detection (Lodwick, 1979). For the case of dimension

reduction of HD, PCA outperforms those FE techniques which are based on class

statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited

and ratio to the number of dimension is low for HD, class covariance matrix cannot be

estimated properly. To overcome these problems Jia (1996) proposed the scheme for

segmented principal component analysis (SPCA) which applies PCT on each of the

highly correlated blocks of bands. This approach also reduces the processing time by

converting the complete set of bands into several highly correlated bands. Jensen and

James (1999) proposed that the SPCA-based compression generally outperforms

PCA-based compression in terms of high detection and classification accuracy on

decompressed HD. PCA works efficiently for the highly correlated data set but SPCA

works efficiently for both high correlated as well as low correlated data sets (Jia,

1996).

Jia (1996) compared SPCA and PCA extracted features for target detection and

concluded SPCA as a better FE technique than PCA. She also showed that both

feature extracted data sets are identical and there is no loss of variance in the middle

stages, as long as no components are removed.

Projection pursuit (PP) methods were originally posed and experimented by

Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman

and Tukey (1974). They described PP as a way of searching for and exploring

nonlinear structure in multi-dimensional data by examining many 2-D projections.

Their goal was to find interesting views of high dimensional data set. The next stages

in the development of the technique were presented by Jones (1983) who, amongst

other things, developed a projection index based on polynomial moments of the data.

Huber (1985) presented several aspects of PP, including the design of projection

indices. Friedman (1987) derived a transformed projection index. Hall (1989)

developed an index using methods similar to Friedman, and also developed

11

theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)

introduced a projection index called the chi-square projection pursuit index. Posse

(1995a, 1995b) used a random search method to locate a plane with an optimal value

of the projection index and combined it with the structure removal of Friedman

(1987) to get a sequence of interesting 2-D projections. Each projection found in this

manner shows a structure that is less important (in terms of the projection index)

than the previous one. Most recently, the PP technique can also be used to obtain 1-D

projections (Martinez, 2005). In this research work, Posse’s method is followed that

reduces n-dimensional data set to 2-dimensional data.

Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)

method which simultaneously reduces the data dimensionality, suppresses undesired

or interfering spectral signatures, and detects the presence of a spectral signature of

interest. The concept is to project each pixel vector onto a subspace which is

orthogonal to the undesired pixel. In order to make the OSP to be effective, number of

bands must not be taken less than the number of signatures. It is a big limitation

associated with multispectral image. To overcome this, Ren and Chang (2000)

presented the Generalized OSP (GOSP) method that relaxes this constraint in such a

manner that the OSP can be extended to multispectral image processing in an

unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,

2001) and also for magnetic resonance image classification (Wang et.al, 2001).

Linear PCA always detect all structure in a given data set. By the use of

suitable nonlinear feature extractor, more information can be extracted from the data

set. The kernel principal component analysis (KPCA) can be used as a strong

nonlinear FE method ( Scholkopf

and Smola, 2002) which maps the input vectors to

feature space and then PCA is applied on the mapped vectors. KPCA is also a

powerful method for preprocessing steps for classification algorithm (Mika et. al.

1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature

selection in a high-dimensional feature space where input variables were mapped by

12

a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of

the higher-order statistics. To obtain this higher-order statistics, a large number of

TP is required. This causes problems for KPCA, since KPCA requires storing and

manipulating the kernel matrix whose size is the square of the number of TP. To

overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian

Algorithm (KHA) was introduced by ( Scholkopf

et. al., 2005).

Parametric classifiers (Fukunaga, 1990) require some parameters to develop

the assumed density function model for the given data. These parameters are

computed with the help of a set of already classified or labeled data points called

training data. It is a subset of given data for which the class labels are known and is

chosen by sampling techniques (Abhinav, 2009). It is used to compute some class

statistics to obtain the assumed density function for each class. Such classes are

referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon

the training data and may differ from the actual classes.

Maximum likelihood method is based on the assumption that the frequency

distribution of the class membership can be approximated by the multivariate normal

probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one

of the most popular parametric classifiers that has been used conventionally for

purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages

of GML classification method are that, it can obtain minimum classification error

under the assumption that the spectral data of each class is normally distributed and

it not only considers the class centre but also its shape, size and orientation by

calculating a statistical distance based on the mean values and covariance matrix of

the clusters (Lillesand et al., 2002).

Lee and Landgrebe (1993) compared the result of GML classifier on PCA and

DBFE feature extracted data set and concluded that DBFE feature extracted data set

provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE

techniques were compared for classification accuracy achieved by nearest neighbor

13

and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is

better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,

DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed

that PCA is the best FE technique for HD among the other mentioned feature

extractor for GML classifier. He also suggested that some FE techniques like KPCA,

OSP, SPCA, PP may improve the classification result using GML classifier.

The non–parametric classifiers (Fukunaga, 1990) uses some control

parameters, carefully chosen by the user, to estimate the best fitting function by

using an iterative or learning algorithm. They may or may not require any training

data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor

(KNN) (Cover and Hart, 1967) are two popular working classifiers under this

category. Edward (1972) gave brief descriptions of many non-parametric approaches

for estimation of data density functions.

2.3.1 KNN

KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern

recognition. The technique can achieve high classification accuracy in problems which

have unknown and non-normal distributions. However, it has a major drawback that

a large amount of TP is required in the classifiers resulting in high computational

complexity for classification (Hwang and Wen, 1998).

Pechenizkiy (2005) compared the performance of KNN classifier on the PCA

and random projection (RP) feature extracted data set. He concluded that KNN

performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the

KNN works better on the ICA feature extracted data set than the original data set

(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-

KNN method with a few wavelengths had the same performance as the KNN

classifier alone using information from all wavelengths.

Some more non–parametric classifiers based on geometrical approaches of data

classification were found during literature survey. These approaches consider the

data points to be located in the Euclidean space and exploit the geometrical patterns

of the data points for classification. Such approaches are grouped into a new class of

14

classifiers known as machine learning techniques. Support Vector Machines (SVM)

(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among

the popular classifiers of this kind. These do not make any assumptions regarding

data density function or the discriminating functions and hence are purely non–

parametric classifiers. However, these classifiers also need to be trained using the

training data.

2.3.2 SVM

SVM has been considered as advance classifier. SVM is a new generation of

classification techniques based on Statistical Learning Theory having its origins in

Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,

1998) discussed SVM based classification in detail. SVM tends to improve learning by

empirical risk minimization (ERM) to minimize learning error and to minimize the

upper bound on the overall expected classification error by structural risk

minimization (SRM). SVM makes use of principle of optimal separation of classes to

find a separating hyperplane that separates classes of interest to maximum extent by

maximizing the margin between the classes (Vapnik, 1992). This technique is

different from that of estimation of effective decision boundaries used by Bayesian

classifiers as only data vectors near to the decision boundary (also known as support

vectors) are required to find the optimal hyperplane. A linear hyperplane may not be

enough to classify the given data set without error. In such cases, data is transformed

to a higher dimensional space using a non–linear transformation that spreads the

data apart such that a linear separating hyperplane may be found. Kernel functions

are used to reduce the computational complexity that arises due to increased

dimensionality (Varshney and Arora, 2004).

Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization

capability and ability to adapt their learning characteristics by using kernel functions

due to which they can adequately classify data on a high–dimensional feature space

with a limited number of training data sets and are not affected by the Hughes

phenomenon and other affects of dimensionality. The ability to classify using even

limited number of training samples make SVM as a very powerful classification tool

for remotely sensed data. Thus, SVM has the potential to produce accurate

classifications from HD with limited number of training samples. SVMs are believed

15

to be better learning machines than neural networks, which tends to overfit classes

causing misclassification (Abhinav, 2009), as they rely on margin maximization

rather than finding a decision boundary directly from the training samples.

For conventional SVM an optimizer is used based on quadratic programming

(QP) or linear programming (LP) methods to solve the optimization problem. The

major disadvantage of QP algorithm is the storage requirement of kernel matrix in

the memory. When the size of the kernel matrix is large enough, it requires huge

memory that may not be always available. To overcome this Benett and Campbell

(2000) suggested an optimization method which sequentially updates the Lagrange

multipliers called the kernel adatron (KA) algorithm. Another approach was

decomposition method which updates the Lagrange multipliers in parallel since they

update many parameters in each iteration unlike other methods that update

parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which

updates lagrange multipliers on the fixed size working data set. Decomposition

method uses QP or LP optimizer to solve the problem of huge data set by considering

many small data sets rather than a single huge data set (Varshney, 2001). The

sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of

decomposition method when the size of working data set is fixed such that an

analytical solution can be derived in very few numerical operations. This does not use

the QP or LP optimization methods. This method needs more number of iterations

but requires a small number of operations thus results in an increase in optimization

speed for very large data set.

The speed of SVM classification decreases as the number of support vectors

(SV) decreases. By using kernel mapping, different SVM algorithms have successfully

incorporated effective and flexible nonlinear models. There are some major difficulties

for large data set due to calculation of nonlinear kernel matrix. To overcome the

computational difficulties, some authors have proposed low rank approximation to

the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)

have proposed the method of reduced support vector machine (RSVM) which reduces

the size of the kernel matrix. But there was a problem of selecting the number of

support vectors (SV). In 2009, Sundaram proposed a method which will reduce the

number of SV through the application of KPCA. This method is different from other

16

proposed method as the exact choice of support vector is not important as long as the

vector spanned a fixed subspace.

Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he

used linear SVM on the feature extracted data set and showed that KPCA features

are more linearly separable than the features extracted by conventional PCA. Shah et

al (2003) compared SVM, GML and ANN classifiers for accuracies at full

dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and

concluded that SVM gives higher accuracies than GML and ANN for full

dimensionality but poor accuracies for features extracted by DAFE and DBFE.

Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,

DBFE, DAFE feature extracted data set. He concluded that SVM provides better

result for OD than GML. SVM works best with PCA and ICA feature extracted data

set where ANN works better with DBFE and NWFE feature extracted data set.

The works done by various researchers with different hyperspectral data sets

using different classifiers and FE methods and the results obtained by them is

summarized in Table 2.1.

17

Table 2.1: Summary of literature review

Author Dataset used Method used Results obtained

Lee and Landgrebe Field Spectrometer GML classifier is used to Features extracted by DBFE

(1993) System (airborne compare classification produces better classification

hyperspectral accuracies obtained by accuracies than those

sensor) DBFE and PCA FE obtained from PCA and

Bhattacharya feature

selection methods.

Jimenez and Stimulated and real Hyperspectral data Hughes phenomenon was

Landgrebe (1998) AVIRIS data characteristics were observed as an effect of

studied with respect to dimensionality and

effects of classification accuracy was

dimensionality, order of observed to be increasing

data statistics used on with use of higher statistics

supervised classification order. But lower order

techniques. statistics were observed to

be less affected by Hughes

phenomenon.

Benediktsson et al ROSIS-03 KPCA and PCA feature KPCA features are more

(2001) extracted data set was linearly separable than

used for classification features extracted by

using linear SVM.

conventional PCA.

Shah et al. (2003) AVIRIS Compared SVM, GML SVM was found to be giving

and ANN classifiers for higher accuracies than GML

accuracies at full and ANN for full

dimensionality and dimensionality but poor

using DAFE and DBFE accuracies were obtained for

feature extraction features extracted by DAFE

techniques and DBFE.

Kuo and Landgrebe Stimulated and real NWFE and DAFE FE NWFE was found to be

(2004) data (HYDICE techniques were producing better

image of DC mall, compared for classification accuracies

Washington, US) classification accuracy than DAFE.

achieved by nearest

neighbor and GML

classifiers.

Pechenizkiy (2005) 20 data sets with KNN classifier was used PCA gave the better result than

different to compare classification Random Projection

characteristics were accuracies obtained by

taken from the UCI PCA and Random

machine learning Projection FE

repository.

Zhu et al (2007) Hyperspectral ICA ranking methods ICA-KNN method with a few

imaging system were used to select the band had the same

developed by ISL. optimal wave length the performance as the KNN

KNN was used. Then classifier alone using all

KNN alone was used. bands.

Sundaram (2009) The adult dataset KPCA was applied in Significantly reduce the

,part of UCI the support vector, then processing time without

Machine Learning usual SVM algorithm is effecting the classification

Repository used accuracy

18

Abhinav (2009) DAIS 7915 GML, SAM, MDM GML was the best among

classification techniques the other techniques and

were used on the PCA, performs best on PCA

ICA, NWFE, DBFE and extracted data set.

DAFE feature extracted

data set

Abhinav (2009) DAIS 7915 SVM and GML GML performed very low in

classification techniques OD than SVM. SVM provide

were used on the OD better accuracy than GML.

and PCA, ICA, NWFE, SVM performs better on

DBFE and DAFE PCA and ICA extracted data

feature extracted data set.

set to compare the

accuracy

1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,

ICA, DAFE, DBFE and NWFE perform well in improving the classification

accuracies when used with GML. But the features extracted by DBFE and

DAFE failed to improve results obtained by SVM implying a limitation of these

techniques for the advance classifiers. KNN works best with PCA and ICA

feature extracted data set. However, in the surveyed literature the effects of

PP, SPCA, KPCA and OSP extracted features on classification accuracy

obtained from the advance classifiers like SVM, parametric classifier like GML

and nonparametric classifier KNN have not been observed.

2. Another important aspect found missing in the literature is the comparison of

classification time for SVM classifiers because SVM takes long time for

training using large TP. It was seen that many approach of SVM were

proposed to reduce the classification time but there is no conclusion for the best

SVM algorithm depending on classification accuracy and processing time.

3. Although KNN is effective classification technique for HD, there is no guideline

for classification time or suggestion of best FE techniques for KNN classifier.

Also the effect of different parameters like number of nearest neighbor,

number of TP, number of bands is not suggested for KNN.

19

4. During the literature survey, it is further found that there is no suggestion for

the best FE techniques for different SVM algorithms, GML and KNN.

Such missing aspects will be investigated in this thesis work and the

guidelines to choose an efficient and less time consuming classification technique

shall be presented as the result of this research.

This chapter presented the FE and classification techniques for mitigating the

effects of dimensionality. These techniques were result of different approaches used

to deal with the problem of high dimensionality and improving performance of

advance, parametric and nonparametric classifier. The approaches were applied on

real life HD and comparative results as reported in literature were compiled and

presented here. In addition, the important aspects found missing in the literature

survey were highlighted which this thesis work shall try to investigate. The

mathematical rationale and algorithms used to apply these techniques will be

discussed in detail in the next chapter.

20

CHAPTER 3

MATHEMATICAL BACKGROUND

This chapter will provide the detailed mathematical background of each of the

techniques used in this thesis. Starting with the some basic concepts of kernels and

kernel space this chapter will describe the unsupervised and supervised FE

techniques followed by classification and optimization rules for supervised classifier.

Finally, the scheme for statistical analysis which has been used for comparing the

results of different classification techniques are discussed.

Notations which are followed in this chapter for matrix and vector are given

below:

X A two dimensional matrix, whose columns represent the data points (m) and

rows represent number of bands (n), where X = X ⎣⎡n, m⎦⎤ .

T

xi = ⎡⎣x1i , x2i ,....., xni ⎤⎦

Φ( z ) Mapping of the input vector z in kernel space, using some kernel function.

∈ Belongs to

Rn Set of n-dimensional real number.

N Set of natural number.

T

⎡⎣ ⎤⎦ Denotes the transpose of a matrix.

∀ For all.

Before defining kernel, let’s look at the following two definitions:

• Input space: The space where originally data points lie.

21

• Feature space: The space spanned by the transformed data points (from

original space) which were mapped by some functions.

Kernel is the dot product in feature space H via a map Φ from input space,

such that Φ : X → H . Kernel can be defined as k( x , x ') = Φ( x ), Φ( x ') , where

x , x ' and Φ( x ), Φ( x ') are the elements of input space and feature space respectively

and k is called the kernel and Φ is called feature map associated with k. Φ also can

be called as the kernel function. The space containing these dot products is called

kernel space. This is a nonlinear mapping from input space to feature space which

increases the internal distance between two points in a data set. This means that the

data set which is nonlinearly separable in input space becomes linearly separable in

kernel space. A few definitions related to kernel are given below:

K := ( k( x i , x j ))ij is called the gram matrix of k with respect to x1 , x2 ........., xn ∈ X .

Positive definite matrix: A real n x n symmetric matrix K satisfying x1T Kx1 > 0 for

T

equality in previous equation occurs only for x11 = x 21 = ........ = xn1 = 0 , then the matrix

gives rise to a strictly positive definite gram matrix, called strictly positive definite

kernel.

Definitions of some commonly used kernel functions are shown in Table 3.1.

22

Table 3.1: Examples of common kernel functions (Modified after Varshney and

Arora, 2004)

K ( x , xi )

Decision boundary either

Linear x × xi

linear or non linear

Polynomial with User defined parameters

( x × x i + 1)n n is a positive integer

degree n

⎛ (x - x ) 2 ⎞ User defined parameters

Radial basis function exp ⎜ − i

⎟ σ is a user defined

⎜ 2σ 2

⎟ value

⎝ ⎠

tanh( k( x.x ) + Θ) K and Θ are user

User defined parameters

Sigmoid i

defined parameter

All the above definitions have been explained with the following simple

example.

⎡1 2 1 ⎤

Let, X = ⎣⎡x1 x 2 x3 ⎤⎦ = ⎢⎢2 1 3 ⎥⎥ is a matrix in input space whose columns ( xi , i = 1,2,3 )

⎢⎣1 1 3 ⎥⎦

denote the number of data points and rows denote the dimension of data points.

Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.

Let xi , x j denotes the inner product of the columns of the matrix X using Gaussian

kernel function.

Then the gram matrix (kernel matrix) K takes precisely the form,

⎡ x1 , x1 x1 , x 2 x1 , x3 ⎤

⎢ ⎥

K = ⎢ x 2 , x1 x2 , x2 x 2 , x3 ⎥

⎢ ⎥

⎢⎣ x3 , x1 x3 , x2 x 3 , x3 ⎥⎦

The numerical value of the matrix K is, K = ⎢⎢ 0.0498 1.0000 0.6065 ⎥⎥

⎢⎣ 0.0821 0.6065 1.0000 ⎥⎦

called positive definite kernel and if it is strictly positive definite, then it is called

strictly positive definite kernel.

23

3.2 Feature extraction techniques

( x ∈ X : Rn ) belonging to an unknown probability distribution in n-dimensional space

can be represented by some coordinate system in m dimensional space (Carreira-

Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate

system such that when the data points from higher dimensional space are projected

onto it, a dimensionally compact representation of these data points will be obtained.

There are two following main conditions to obtain an optimal dimension reduction

(Carreira-Perpinan, 1997):

(i) Elimination of dimensions with very low information content. Features with

low information content can be discarded as noise.

(ii) Remove redundancy among the dimensions of data space i.e. the reduced

feature set should be spanned by orthogonal vectors.

research work (Figure 3.1). For the unsupervised approach, segmented principal

component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,

kernel principal component analysis (KPCA) and orthogonal subspace projection

(OSP) are used. The next sub-sections will discuss the assumptions used by these FE

techniques in detail.

24

3.2.1 Segmented principal component analysis (SPCA)

The principal component transform (PCT) has been successfully applied in

multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral

image data, PCT outperforms those FE techniques which are based on the class

statistics. The main advantage of using a PCT is that global statistics are used to

determine the transform functions. Implementation of PCT on high dimensional data

set requires high computational load. SPCA can overcome the problem of long

processing time by partitioning the complete data set into several highly correlated

subgroups (Jia, 1996).

The complete data set is first partitioned into K subgroups with respect to the

correlation of bands. From the correlation image of HD, it can be seen that blocks are

formed from highly correlated bands (Figure 3.2). These blocks are selected as the

subgroups. Let n1 , n2 and nk are the number of bands in subgroups 1, 2 and k

respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After

applying PCT on each subgroup, significant features are selected by variance

information of each component. The PCs which contain about 99% variance were

chosen for each block then the selected features can be regrouped and transformed

again to compress the data further.

25

Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27

bands respectively, corresponding to highly correlated bands have been

formed from the correlation image of HYDICE hyperspectral sensor data.

Segmented PCT retains all the variance as with the conventional PCT. There

is no information lost either in the case that the transformation is conducted on the

complete vector at once or a few sub vectors are transformed separately (Jia, 1996).

When the new components obtained from each segmented PCT are gathered and

transformed again, then the resulting data variance and covariance are identical to

those of the conventional PCT. The main effect is that, the data compression rate is

lower in the middle stages compared to the no segmentation case. However, it makes

a relatively small difference in compression rate, if segmented transformation is

developed on those subgroups which have poor correlation with each other.

26

Figure 3.2a: Chart of multilayered segmented PCA

Projection pursuit (PP) refers to a technique first described by Friedman and

Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by

means of selected low dimensional linear projections. To reach this goal, an objective

function is assigned, called projection index, to every projection characterizing the

structure present in the projection. Interesting projections are then automatically

picked up by optimizing the projection index numerically. The notion of interesting

projections has usually been defined as the ones exhibiting departure from normality

(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).

Posse (1990) proposed an algorithm based on a random search and a chi-

squared projection index for finding the most interesting plane (two-dimensional

view). The optimization method was able to locate in general the global maximum of

the projection index over all two-dimensional projections (Posse, 1995). The chi-

squared index was efficient, being fast to compute and sensitive to departure from

normality in the core rather than in the tail of the distribution. In this investigation

only chi-squared (Posse, 1995a, 1995b) projection index has been used.

27

Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:

(i) A projection pursuit index measures the degree of departure from normality.

(ii) A method for finding the projection that yields the highest value for the index.

Posse (1995a, 1995b) used a random search to locate a plane with an optimal

value of the projection index and combined it with the structure removal of Friedman

(1987) to get a sequence of interesting 2-D projections. The interesting projections are

found in decreasing order of the value of the PP index. This implies that each

projection found in this manner shows a structure that is less important (in terms of

the projection index) than the previous one. In the following discussion, first the chi-

squared PP index has been described followed by the structure finding procedure.

Finally, the structure removal procedure is illustrated.

Posse proposed an index based on the chi-square index. The plane is first

divided into 48 regions or boxes Bk , k = 1,2,..,48 that are distributed in the form of

rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the

same angular width of 450 . R is chosen so that the boxes have approximately the

1

( 2 log 6 ) 2 . The

5

outer boxes were having weight 1/48 under normally distributed data. This choice for

the radial width provides regions with approximately same probability for the

standard bivariate normal distribution (Martinez, 2001). The projection index is

given as:

2

1 8 48 1 ⎡ 1 n

(

PI χ 2 (α , β ) = ∑∑ ⎢ ∑ I Bk zi j , zi j − ck ⎥

9 j =0 k =1 ck ⎣ n i =1

α (λ )

)

β (λ ) ⎤

⎦

(3.1)

Where,

φ The standard bivariate normal density.

ck Probability evaluated over kth region using the normal density function,

given by ck = ∫∫ φ dz1dz2 .

Bk

28

Bk Box in the projection plane.

πj

λj , j = 0,.....,8 is the angle by which the data are rotated in the plane

36

before being assigned to regions.

α,β Orthonormal p-dimensional vectors which span the projection plane (It

can be first two PCs or randomly chosen two pixels of the OD set).

P (α , β ) A plane consists of two orthonormal vectors α , β

α

Zi , Z j β

Sphered observations projected onto the vectors α and β . ( Ziα = ZiT α and

Ziβ = ZiT β )

β (λj ) α sin λ j + β cos λ j

I Bk The indicator functions for region.

PI χ 2 (α , β ) The chi-squareprojection index evaluated using the data projected onto

the plane spanned by α and β .

However, it is sensitive to distributions that have a hole in the core, and it will also

yield projections that contain clusters. The chi-square projection pursuit index is fast

and easy to compute, making it appropriate for large sample sizes. Posse (1995a)

provides a formula to approximate the percentiles of the chi-square index.

29

45o

1/48 1/48

1/48

1/48

R/5

1/48 1/48

1/48 1/48

R

Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after

Posse, 1995a)

For PPEDA projection pursuit index, PI χ 2 (α , β ) must be optimized over all

possible direction onto 2-D planes. Posse (1990) proposed a random search for

locating the global maximum of the projection index. Combined with the structure-

removal procedure, this gives a sequence of interesting bi-dimensional views of

decreasing importance. Starting with random planes, the algorithm tries to improve

(

the current best solution α * , β * ) by considering two candidate planes ( a1 ,b1 ) and

( a2 ,b2 ) ( )

within a neighborhood of α * , β * . These candidate planes are given by,

α * + cv1 β * − ( a1T β * ) a1 ⎫

a1 = * b1 = ⎪

α + cv1 β * − ( a1T β * ) a1 ⎪

⎪

⎬ (3.2)

α − cv1

* β * − ( a1T β * ) a2 ⎪

a2 = b2 = ⎪

α * − cv1 β * − ( a1T β * ) a2 ⎪⎭

Where c is a scalar that determines the size of the neighborhood visited, and v is a

unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to

30

start a global search and then to concentrate on the region of the global maximum by

decreasing the value of c. After a specified number of steps, called half, without an

increase of the projection index, the value of c is halved. When this value is small

enough, the optimization is stopped. Part of the search still remains global to avoid

being kept in dummy local optimum. The complete search of the best plane contains

m such random searches with different random starting planes. The goal of PP

algorithm is to find best projection plane.

1. Sphere the OD set, let’s say, Z is the matrix of sphered data set.

( )

2. Generate a random starting plane α 0 , β 0 , where α 0 and β 0 are orthonormal.

(

Consider this as the current best plane α * , β * . )

( )

3. Evaluate the projection index PI χ 2 α * , β * for the starting plane.

4. Generate two candidate plane ( a1 ,b1 ) and ( a2 ,b2 ) according to the Eq. (3.2)

6. Choose the candidate plane with a higher value of the projection pursuit index

( )

as the current best plane α * , β * .

pursuit index.

8. If the index does not improve for certain time, then decrease the value of c by

half

9. Repeat step 4 to step 8 until c becomes some small number (say .01).

There may be more than one interesting projection, and there may be other

views that reveal insights about the hyperspectral data. To locate other views,

Friedman (1987) proposed a method called structure removal. In this approach, first

we perform the PP algorithm on the data set to obtain the structure which means the

optimal projection plane. The approach then removes the structure found at that

projection, and repeats the projection pursuit process to find a projection that yields

another maximum value of the projection pursuit index. By proceeding in this

31

manner, it will give a sequence of projections providing informative views of the data.

The procedure repeatedly transforms the projected data to standard normal until

they stop becoming more normal as measured by the projection pursuit index. One

starts with a p × p matrix, where the first two rows of the matrix are the vectors of

the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal

and ‘0’ elsewhere. For example, if p = 4, then

⎡α1* α 2* α 3* α 4* ⎤

⎢ * * * * ⎥

β β β β

U =⎢ 1 2 3 4 ⎥

*

(3.3)

⎢0 0 1 0 ⎥

⎢ ⎥

⎢⎣0 0 0 1 ⎥⎦

orthonormal. Let U is the orthonormal matrix of U * . The next step in the structure

removal process is to transform the Z matrix using the following equation,

T = UZ T (3.4)

Where T is a p × n matrix. With this transformation, the first two rows of T of every

transformed observations are the projection onto the plane given by α * , β * . Now ( )

applying a transformation ( Θ ), which transforms the first two rows of T to a

standard normal and the rest remain unchanged, structure removal is performed

(Martinez, 2004). This is where the structure is removed, making the data normal in

that projection (the first two rows). The transformation is defined as follows,

Θ (T1 ) = φ −1 ⎡⎣ F (T1 ) ⎤⎦ ⎫

⎪⎪

Θ (T2 ) = φ −1 ⎡⎣ F (T2 ) ⎤⎦ ⎬ (3.5)

⎪

Θ (Ti ) = Ti i = 3,4,........., p ⎪⎭

and T2 are the first two rows of the matrix T and F is a function defined in Eq. (3.7).

From Eq. (3.3), it is seen that only the first two row of T are changing. T1 and T2 can

be written as,

( * * *

T1 = z1α , z2α ......., z αj ,......., znα

*

) (3.6)

= (z )

* * * *

β

T2 1 , z2β ......., z βj ,......., znβ

32

* *

Where z αj and z βj are coordinates of the jth observation projected onto the plane

( )

spanned by α * , β * . Next, a rotation is defined about the origin through the angle as

follows

z j ( ) = z j ( ) cos γ + z j ( ) sin γ

1t 1t 2t

(3.7)

z j ( ) = z j ( ) cos γ − z j ( ) sin γ

2t 2t 1t

Where γ = 0,π / 4,π / 8,3π / 8 and z j ( ) represents the jth element of T1 at the tth

1t

iteration of the process. Now, applying the following transformation on Eq. (3.7) to the

rotated points it replaces each rotated observation by its normal score in the

projection.

z

1(t +1)

=φ ⎨

⎪

−1 j ( )

⎧ r z1(t ) − 0.5 ⎫

⎪

j ⎬

⎪⎩ n

⎭⎪

(3.8)

zj(

2 t +1) −1 ⎪

=φ ⎨

j ( )

⎧ r z 2(t ) − .5 ⎫

⎪

⎬

⎪⎩ n

⎭⎪

( )

Where r z j ( ) represents the rank of z j ( )

1t 1t

With this procedure, the projection index is reduced by making the data more

normal. During the first few iteration, the projection index should decrease rapidly

(Friedman, 1987). After approximate normality is obtained, the index might oscillate

with small changes. Usually, the process takes between 5 to 15 complete iterations to

remove the structure. Once the structure is removed using this process, data is

transformed back using the following equation,

Z ′ = U T Θ UZ T ( ) (3.9)

From Matrix Theory (Strang, 1988), it is known that all directions that are

orthogonal to the structure (i.e., all rows of T other than the first two) have not been

changed, whereas the structure has been Gaussianized and then transformed back.

Next section will describe the summary of the steps of PP,

33

3.2.2.4 Steps of PP

1. Load the data and set the value of the parameters like number of best

projection plane (N), number of neighborhood for random starts (m), value of c

and half

2. Sphere the data and obtain the Z matrix.

3. Find each of the desired number of projection plane (structures) (3.3.4.2) using

Posse chi-squareindex.

4. Remove the structure (to reduce the effect of local optimum) and find another

structure (3.3.4.3) until the projection pursuit index stop changing.

5. Continue the process until the best projection plane (orthogonal to each other)

is obtained.

Kernel principal component analysis (KPCA) means conducting PCT in feature

space (kernel space). KPCA is applied on the variables which are nonlinearly related

to the input variables. In this section KPCA algorithm has been described through

PCA algorithm.

First m number of TP ( x i ∈ R n , i = 1,........, m ) are chosen. PCA finds the principal

1 m

C= ∑

m j =1

x j x jT (3.10)

can be obtained.

λv = Cv (3.11)

For PCA, first sort the eigen values in decreasing order and find the corresponding

eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this

manner. Now next step is rewriting of PCA in terms of dot product. Now substituting

Eq. (3.10) in Eq. (3.11)

1 m

Cv = ∑ x j x jT v = λv

m j =1

Thus

34

1 m

v= ∑ x j x jT v

mλ j =1

(3.12)

1 m

= ∑ ( x j .v )x j

mλ j =1

( )

since x .x v = ( x .v ) x

T

In Eq. (3.12), the term ( x j .v ) is a scalar. This means that all the solutions v with λ ≠

m

v = ∑ α i xi (3.13)

i =1

1. For KPCA, first transform the TPs using a kernel function ( Φ ) to feature space

( H ). Data set ( Φ( xi ), i = 1,....., m ) in feature space are assumed as centered to

set takes the form as following

1 m

C= ∑

m j =1

Φ( x j )Φ( x j )T (3.14)

2. Find the eigen values λ ≥ 0 and corresponding non zero eigen vectors

v ∈ H \ {0} of the covariance matrix C from the equation,

λv = Cv (3.15)

3. As shown in previously (for PCA), all solution of v ( λ ≠ 0 ) lie in the span of

Φ( x1 ),........, Φ( xm ) , i.e.,

m

v = ∑ α i Φ( x i ) (3.16)

i =1

Therefore,

m

Cv = λv = λ ∑ α i Φ( x i ) (3.17)

i =1

m m m

mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi )Φ( xi )T Φ( x j ) (3.18)

j =1 i =1 j =1

(3.18) following equation is obtained.

35

m m m

mλ ∑ α j Φ( x j ) = ∑∑ α j Φ( xi ) K ( xi , x j ) (3.19)

j =1 i =1 j =1

kernel, premultiply both sides by Φ( x k )T for all k = 1,……,m. Define the m ×m

matrix K, called the kernel matrix, whose ijth element is the inner-product

kernel , K ( x i , x j ) . The vector α of length m, whose jth element is the coefficient

αj.

m

1 m m

λ ∑ α j Φ( x k )T Φ( x j ) = ∑∑ α j Φ( xk )T Φ( xi )Φ( xi )T Φ( x j )

i =1 m i =1 j =1 (3.20)

∀ k = 1,2,...., m

mλ Kα = K 2α (3.21)

To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be

solved,

mλα = K α

(3.22)

7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel

matrix K. Let λ1 ≥ λ2 ≥ ........ ≥ λm be the eigen values of K and β1 , β2 ,......., βm be

the corresponding set of eigen vectors with λ p being the last non zero eigen

value.

36

(

(a) (b)

Figu

ure 3.4: (a

a) Input pooints before kernel PCA

P (b) Ouutput afterr kernel PCA.

Thhe three groups

g aree distinguishable usiing the firrst compon

nent

on

nly (Wikipeedia, 2010)).

mponent, iit is neede pute projecction onto the

ed to comp

en vectors βn in H ( n = 1,...., p ). Let x be a test pointt, with an image

eige i Φ(x ) in

H. Then

T

m

β n , Φ( x ) = ∑ β n Φ( xi ),

) Φ( x ) (3.2

23)

i =1

9. In the

t above algorithm,

a it has been assumed

d that the d

data set is centered, but

it iss certainly difficult to

o obtain th

he mean off the mappeed data in feature sp

pace

H (Schölkopf, 2004) . Th

herefore, it is problem

matic to cen

nter the mapped

m data

a in

m the

equation for kernel

k PCA

A. It is need

ded to diago atrix K,

onalize thee kernel ma

1

K i , j = ( K − 1m K − K1m + 1m K1m )i , j Where

W (1m )ij := ∀i, j 24)

(3.2

m

Figure-3.5

5 provides the

t outlinee of KPCA algorithm.

a

37

subspace projection is to eliminate all unwanted or undesired spectral

signatures (background) within a pixel, then use a matched filter to extract the

desired spectral signature (endmember) present in that pixel.

38

3.2.4.1 Automated target generation process algorithm (ATGP)

In hyperspectral image analysis a pixel may encompass many different

materials; such pixels are called mixed pixels. It contains multiple spectral

signatures. Let a column vector ri represent the mixed pixel by linear model,

ri = M αi + ni (3.25)

where the vector ri is a l × 1 column vector, represents the ith mixed pixel. l is the

number of spectral bands. Each distinct material in the mixed pixel is called an

endmember (p). Assume that there are p spectrally distinct endmembers in the ith

mixed pixel. M is a matrix of dimension l × p , is made up of linearly independent

columns. These columns are denoted by ( m1 , m2 ,......, m j ,......., mp ) . Here this system is

considered as over determined ( l > p ) system and m j denotes the spectral signature of

(α ,α ,......,α ,......,α )

T

1 2 j p where the jth element represents the fraction of the jth

signature as present in the ith mixed pixel. ni is a l × 1 column vector presenting the

white Gaussian noise with zero mean and covariance matrix σ 2 I where I is an l × l

identity matrix.

In the Eq. (3.25), assume ri ’s are a linear combination of p endmembers with

the weight coefficients designated by the fraction vector αi . The term M αi has been

rewritten to separate the desired spectral signatures from the undesired signatures.

In other way, targets are being separated from background. In searching for a single

spectral signature this can be written as:

M α = dα p + U γ (3.26)

while α p is 1 × 1 , the fraction of the desired signature. The matrix U is composed of

the remaining column vectors from M. These are the undesired spectral signatures or

background information. This is given by U = ( m1 , m2 ,....., m j , ........, mp−1 ) with

(fractions) of α

39

Suppose P is an operator, which eliminates the effects of U, the undesired

signatures. To do this, an operator (orthogonal subspace operator) has been developed

that projects r onto a subspace that is orthogonal to the columns of U. This results in

a vector that only contains energy associated with the target d and noise n. The

operator used is the l × l matrix

(

P = 1 − U (U TU )−1U T ) (3.27)

The operator P maps d into a space orthogonal to the space spanned by the

uninteresting signatures in U. Now apply the operator P on the mixed pixel r from

Eq. (3.25)

Pr = Pdα p + PU γ + Pn (3.28)

have

Pr = Pdα p + Pn (3.29)

The second step in deriving the pixel classification operator is to find the 1 × l

operator X T that maximizes the SNR. Operating on Eq. (3.28) get

X T Pr = X T Pdα p + X T PU γ + X T Pn (3.30)

The operator X T acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is

given by,

X T Pdα p2dT P T X

λ= (3.31)

X T PE ⎡⎣nnT ⎤⎦ P T X

⎛ α p2 ⎞ X T PddT P T X

λ =⎜ 2⎟ (3.32)

⎜ σ ⎟ X T PP T X

⎝ ⎠

where E [ ] denotes the expected value. Maximization of this quotient is the

generalized eigenvector problem

PddT P T X = λ PP T X (3.33)

40

⎛σ2 ⎞

where λ = λ ⎜ 2 ⎟ , The value of X T which maximizes λ can be determined in general

⎜α ⎟

⎝ p⎠

using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and

symmetric properties of the interference rejection operator. As it turns out the value

of X T which maximizes the SNR is

X T = kdT (3.34)

where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is

seen that the overall classification operator for a desired hyperspectral signature in

the presence of multiple undesired signatures and white noise is given by the 1 × l

vector as

q T = dT p (3.35)

This result first nulls the interfering signatures, and then uses a matched filter for

the desired signature to maximize the SNR. When the operator is applied to all of the

pixels in a hyperspectral scene, each l × 1 pixel is reduced to a scalar which is a

measure of the presence of the signature of interest. The ultimate aim is to reduce the

l images that make-up the hyperspectral image cube into a single image where pixels

with high intensity indicate the presence of the desired signature.

This operator can be easily extended to seek out k signatures of interest. The

vector operator simply becomes a k × l matrix operator which is given by,

When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral

scene, each l × 1 pixel is reduced to 1 × 1 vector. Ultimately, l dimensional

hyperspectral image reduces to single dimensional feature extracted image where

pixels with high intensity indicate the presence of the desired signature. Thus for k

desired signature hyperspectral image can be reduce to k dimensional feature

extracted image. Here each band corresponds to the each desired signature.

The above algorithm is discussed with the following example:

Let us start with three vectors or classes, each six elements or bands long. The

vectors are in reflectance units and can be seen below.

41

⎡0.26 ⎤ ⎡0.07 ⎤ ⎡0.07 ⎤

⎢0.30 ⎥ ⎢0.07 ⎥ ⎢0.13 ⎥

⎢ ⎥ ⎢ ⎥ ⎢ ⎥

⎢0.31 ⎥ ⎢0.11 ⎥ ⎢0.19 ⎥

Concrete = ⎢ ⎥ Tree = ⎢ ⎥ Water = ⎢ ⎥

⎢0.31 ⎥ ⎢0.54 ⎥ ⎢0.25 ⎥

⎢0.31 ⎥ ⎢0.55 ⎥ ⎢0.30 ⎥

⎢ ⎥ ⎢ ⎥ ⎢ ⎥

⎣⎢0.31 ⎦⎥ ⎣⎢0.54 ⎦⎥ ⎣⎢0.34 ⎦⎥

Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels

looks like,

pixel40 = (.08 ) concrete + (.75 ) tree + (.07 ) dirt + noise (3.37).

Let us assume that the noise is zero. If all the pixel mixture fractions have been

defined, particular class spectrum can be chosen to extract from the image. Suppose

the concrete material has to be extracted throughout the image. Same procedure can

be followed to extract grass and tree material.

Assume that pixel40 is made up some weighted linear combination of

endmembers.

pixel40 = M α + noise (3.38)

assign the desired as d and undesired as U signatures to spectrum. Let concrete be

the vector d and tree and water be the column vectors of the matrix U. However, the

fractions of mixing are unknown to us. But it is known that pixel40 is made up of

d = ⎡⎣concrete ⎤⎦ and U = ⎡⎣tree,water ⎤⎦

projection operator P, that when operated on U, will reduce its contribution to zero.

To find concrete, d, pixel40 is projected onto a subspace that is orthogonal to the

columns of U using the operator P. In other words, P maps d into a space orthogonal

to the space spanned by the undesired signatures while simultaneously minimizing

the effects of U. If P is operated on U, which contains tree and water, then it is seen

that the effect of U is minimized.

42

⎡00 ⎤

⎢0 0 ⎥

⎢ ⎥

PU =⎢0 0 ⎥ (3.39)

⎢ ⎥

⎢0 0 ⎥

⎢⎣0 0 ⎥⎦

Now operator x T needs to find out which will maximizes the signal-to noise

ratio (SNR). The operator x T acting on Pr1 will produce a scalar. As stated before, the

value of x T which maximizes the SNR is X T = kdT . This leads to an overall OSP

operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the

entire data vector can be project along the columns of Q and OSP feature extracted

image is formed.

This section describes the mathematical background of supervised classifiers.

First, it will describe the Bayesian decision rule followed by the decision rule for

Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-

nearest neighbor (KNN) and Support vector machine (SVM) classification rules.

In pattern recognition, patterns need to be classified. There are plenty of

decision rules available in literatures but only Bayes Decision Theory is optimal

(Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose

there are K classes and let f ( x ) be the distribution function of the kth class, where

k

K

0 < k < K , and P ( ck ) is the prior probability of the kth classes such that ∑ P (c k ) =1.

k =1

For any class k , the posteriori probability for a pixel vector x is denoted by pk ( ck |x )

P ( x |ck )P (ck )

pk (ck | x ) = k= K

(3.41)

∑ f ( x )P ( c

k =1

k k )

43

Therefore, the Bayes decision rule is:

x ∈ ci if pi (ci | x ) = max pk (ck | x ) (3.41a)

k

3.3.2 Gaussian maximum likelihood classification (GML):

Gaussian maximum likelihood classifier assumes that the distribution of the data points is

Gaussian (normally distributed) and classifies an unknown pixel based on the variance and

covariance of the spectral response patterns. This classification is based on probability density

function associated with training data. Pixels are assigned to the most likely class based on a

comparison of the posterior probability that it belongs to each of the signatures being considered.

Under this assumption, the distribution of a category response pattern can be completely described

by the mean vector and the covariance matrix. With these parameters, the statistical probability of

a given pixel value being a member of a particular land cover class can be computed (Lillesand et

al., 2002). GML classification can obtain minimum classification error under the assumption that

the spectral data of each class is normally distributed. It considers not only the cluster centre but

also its shape, size and orientation by calculating a statistical distance based on the mean values

and covariance matrix of the clusters. The decision boundary for the GML classification is:

−(1 2) ⎡⎢ln Σ

⎣ k( ) k

ˆ −1 ( x − μˆ )⎤

ˆ + ( x − μˆ )T Σ

k k ⎥

⎦ (3.42)

And the final bayesian decision rule is:

x ∈ c j if g j ( x ) = max g k ( x )

k

KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification

technique which has been proven to be effective in pattern recognition. However, its

inherent limitations and disadvantages restrict its practical applications. One of the

shortages is lazy learning which makes the traditional KNN time-consuming. In this

thesis work traditional KNN process has been applied (Fix and Hodges, 1951).

The k-nearest neighbor classifier is commonly based on the Euclidean distance

between a test pixel and the specified TP. The TP are vectors in a multidimensional

feature space, each with a class label. In the classification phase, k is a user-defined

44

constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which

is most frequent among the k training samples nearest to that test pixel.

Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified

either to the first class of squares or to the second class of triangles. If k

= 3, it is classified to the second class because there are 2 triangles and

only 1 square inside the inner circle. If k = 5, it is classified to first class

(3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified

to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).

Where x = ( x11 , x12 ......x1n ), yi = ( yi1 , yi 2 ...... yin ) and D = { d1 , d2 ......dp } , p is number of TP

45

⎧ ⎛ ⎡k ⎤ ⎞ ⎫

⎪ ⎜ ⎢ ⎥ + 1 ⎟ , k even ⎪

x ∈ c j if minimum element of D corresponding to c j is ⎨

⎪ ⎝ ⎣2⎦ ⎠ ⎪

⎬ (3.44)

⎪ ⎡k ⎤ ⎪

⎢ 2 ⎥ , k odd

⎩⎪ ⎢ ⎥ ⎭⎪

In case of tie, the test pixel is assigned to the class c j if its distance from the mean

Where ki ,(i =1,2,....., p) is a user defined parameter which implies the number of

classification is given in Figure: 3.7

The foundations of Support Vector Machines (SVM) have been developed by

Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)

46

principle, which has been shown to be superior, (Gunnet al., 1997), to traditional

Empirical Risk Minimization (ERM) principle, employed by conventional neural

networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM

that minimizes the error on the training data. SVMs were developed to solve the

classification problem, but recently they have been extended to the domain of

regression problems (Vapnik et al., 1997).

SVM is basically a linear learning machine based on the principle of optimal

separation of classes. The aim is to find a hyperplane which linearly separates the

class of interest. The linear separating hyperplane is placed between the classes in

such a way that it satisfies two conditions.

(i) All the data vector that belongs to the same class are placed to the same side of

separating hyperplane.

(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).

The main aim of SVM is to define an optimum hyperplane between two classes

which will maximize the boundary of two classes. For each class, the data vectors

forming the boundary of classes are called the support vectors (SV) and the

hyperplane is called decision surface (Pal, 2002).

The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical

framework for learning from input training with known class and predict the outcome of data point

with unknown identity. The first is called ERM whose aim is to reduce the training error and the

second is called SRM, whose goal is to minimize the upper bound on the expected error on the

whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).

First, it does not depend on the unknown cumulative distribution function. Secondly, it can be

minimized with respect to the parameter, which is used in decision rule.

VC dimension is a measure of the capacity of a set of classification functions. The

VC-dimension, generally denoted by h, is an integer that represents the largest number of

data points that can be separated by a set of functions fα in all possible ways. For

47

number of points, which can be separated into two classes without error in all

possible 2k ways (Varshney and Arora, 2004).

method (SVM_QP):

simple linearly separable case for two classes which can be separated by a hyperplane

and it can be extended for the multiclass classification problem. This procedure then

can be extended to the case where a hyperplane cannot separate the two classes that

is kernel method for SVM.

Let there are n number of training samples obtained from two classes,

represented as ( x1 , y1 ),( x1 , y1 ),..........,( xn , yn ) where x i ∈ R m , m is the dimension of the

data vector with each sample belonging to either of the two classes labeled by

y ∈ { −1, +1} . These samples are said to be linearly separable if there exists a

hyperplane in m-dimensional space whose orientation is given by a vector w and

whose location is determined by a scalar b as offset of this hyperplane from the origin

(Figure 3.8). In case such a hyperplane exists then the given set of training data

points must satisfy the following inequalities:

w ⋅ xi + b ≥ +1, ∀ i : yi = +1 (3.45)

w ⋅ xi + b ≤ −1, ∀ i : yi = −1

(3.46)

48

Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after

Gunn, 1998).

The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality

as:

yi (w.xi + b) ≥ 1 (3.47)

Thus, the decision rule for the linearly separable case can be defined in the following

form:

xi ∈ sign(w.xi + b) (3.48)

Where, sign(.) is the signum function whose value is +1 for any element greater than

or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily

represent the two classes given by labels +1 and –1.

The separating hyperplane (Figure 3.8) will be able to separate the two classes

optimally when its margin from both the classes is equal and maximum (Varshney,

2004) i.e. the hyperplane should be located exactly in the middle of the two classes.

49

The distance D( x ; w,b) is used to express the margin of separation or margin for a

w.x + b

D( x ; w,b) = (3.49)

w 2

Where, 2

denotes the second norm which is equivalent to the Euclidean length of

the element vector for which it is being computed and is the absolute function. Let

d be the value of the margin between two separating planes. To maximize the

margin, express the value of d as

w.x + b + 1 w.x + b − 1

d= −

w2 w2

2

=

w2

2

= (3.49a)

wT w

2

To obtain an optimal hyperplane the margin value ( d ) should be maximized i.e.

w2

Thus, the objective function Φ(w) of finding the best separating hyperplane

reduces to

1 T

Φ(w ) = w w (3.50)

2

A constrained optimization problem can be constructed for minimizing the objective

function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of

constrained optimization problem with a convex objective function of w and linear

constraints is called a primal problem and can be solved using standard Quadratic

Programming (QP) optimization techniques. The QP optimization technique can be

implemented by replacing the inequalities in a simpler form by transforming the

problem into a dual space representation using Lagrange multipliers ( λi )

as shown:

50

n

w = ∑ λi yi xi ,

i =1

nt

(3.51)

∑λ y

i =1

i i =0

becomes

n

1 n n

max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi ( xi ⋅ x j ) (3.52)

λ

i =1 2 i =1 j =1

n

∑λ y

i =1

i i =0 (3.53)

λi ≥ 0, i = 1,2,..., n (3.54)

multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,

2000) some of the Lagrange’s multiplier will be zero. The multipliers which have

nonzero values are called SVs. The result from an optimizer, also called as an optimal

solution, will be a set of unique and independent multipliers: λ o = ( λ1o , λ2o ,..., λnos )

where, ns is the number of support vectors found. Substituted these in Eq. (3.51) to

n

w0 = ∑ yi λi 0 xi (3.55)

i =1

The offset from origin ( b0 ) is determined from the equation given below,

1 0 0

b0 = ⎡⎣w x +1 + w0 x −01 ⎤⎦ (3.56)

2

Where x +01 and x −01 are support vector of class labels +1 and -1 respectively. The

following decision rule (obtained from Eq. (3.48)) is then applied to classify the data

vectors into two classes +1 and -1:

f ( x ) = sign ( ∑

support vectors

yi λi0 ( xi .x ) + b0 ) (3.57)

x ∈ sign ( ∑

support vectors

yi λi0 ( xi .x ) + b0 ) (3.58)

51

Generally, it may not be possible to separate the classes optimally by a linear

hyperplane and thus a non-linear manifold in hyperspace would be required for

optimal separation among the classes. The data present in m-dimensional space can

be mapped into a higher dimensional space where it spread out and can be separated

by a linear hyperplane in that dimensional space, shown in Figure 3.9.

Suppose the non-linear transformation function φ map the data into a higher

represented as φ ( x ) in higher dimensional space. Thus, the dual optimization

n

1 n n

max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi K ( xi , x j ) (3.59)

λ

i =1 2 i =1 j =1

functions are used to substitute the value of dot product of the transformed vectors

according to Mercer’s Theorem (Mercer, 1909). Suppose there exists a kernel function

K such that

K ( x i , x j ) = φ ( x i ) ⋅ φ( x j ) (3.60)

(a) Input space (b) Feature space

pixels from input space to feature space. φ ( xi ) s are pixels in feature space.

Linearly non separable pixels in input space become linearly separable in

feature space (Cristianini, 2000).

52

Putting Eq. (3.60) into eq. (3.59), the modified form of dual optimization problem

becomes:

nt

1 nt nt

max L(w,b, λ ) = ∑ λi − ∑∑ λi λ j y j yi K ( xi , x j ) (3.61)

λ

i =1 2 i =1 j =1

nt

∑λ y

i =1

i i =0 (3.62)

ns

x ∈ sign ( ∑ yi λio K ( xi , x ) + bo ) (3.63)

i =1

Some of the commonly used kernel functions for classification are presented in Table

3.2. Selection of suitable kernel function is essential for better classification of a

particular data set. The details on effects of different kernel functions on

classification accuracy are available in Varshney and Arora (2004).

Originally SVMs were developed to perform binary classification. Now it has

been extended for multiclass classification where the number of classes is more than

two. Pal (2004) proposed two multiclass classification methods: one is one against the

rest and another is pairwise classification method. In the first one, K binary

classifiers may be created where each classifier is trained to distinguish one class

from another K − 1 class for a K class classification problem. The second approach

considers one pair of classes at a time and performs SVM based binary classification

for classifying all the pixels to one of the two classes under consideration. A total of

K ( K − 1)

pairs of classes are possible for a K class problem and thus that many SVM

2

binary classifiers are to be created. A pixel is finally classified to a class to which it is

K ( K − 1)

classified by most number of SVM classifiers out of total (Varshney and

2

Arora, 2004).

Figure 3.10 shows summary of the SVM classification algorithm.

53

54

3.3.4.4 SMO optimization for SVM

Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly

solve the SVM QP problem without any extra matrix storage and without using

numerical QP optimization steps at all. SMO decomposes the overall QP problem into

QP sub-problems, using Osuna’s theorem (Osuna, 1997) to ensure convergence.

Unlike the previous methods, SMO chooses to solve the smallest possible

optimization problem at every step. For the standard SVM QP problem, the smallest

possible optimization problem involves two Lagrange multipliers, because the

Lagrange multipliers must obey a linear equality constraint. At every step, SMO

chooses two Lagrange multipliers to jointly optimize, finds the optimal values for

these multipliers, and updates the SVM to reflect the new optimal values. The

advantage of SMO lies in the fact that solving for two Lagrange multipliers can be

done analytically. Thus, numerical QP optimization is avoided entirely. Even though

more optimization sub-problems are solved in the course of the algorithm, each sub-

problem is so fast that the overall QP problem is solved quickly. In addition, SMO

requires no extra matrix storage at all. Thus, very large SVM training problems can

fit inside the memory of an ordinary personal computer or workstation. Because no

matrix algorithms are used in SMO, it is less susceptible to numerical precision

problems. There are two components to SMO: an analytic method for solving for the

two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize.

In this thesis, all the computations regarding SMO optimization method have

been done with the Matlab in-built function “SVMSMOSET”

3.3.4.4 KPCA-SVM

Nonlinear SVM is quite accurate then linear SVM. However, they are slow and

time taking for classification increases linearly with the number of SV. Reduced set

methods (reducing no. of SVs) try to speed up the SVM classification by reducing the

number of SV (Burges and Scholkopf, 1996). This section will present the technique of

reducing the number of SVs using KPCA algorithm (Sundaram, 2009). It should be

kept in mind that the space spanned by original set of SVs will be always equivalent

to the space spanned by reduced set of SVs. This is the criteria for choosing minimum

number of SVs to improve the classification time

55

The solution of the optimization problem Eq. (3.52) is obtained in terms of

Lagrange’s multiplier. SVs are extracted solving by the Eq. (3.52). The algorithm for

this method is stated below.

1. First choose appropriate kernel function. Then calculate the kernel matrix K xx

K xx (i, j ) = K ( x i , x j ) (3.64)

where , j = 1,2,........, N

2. Center the kernel matrix K xx ,

c

K xx = HK xx H (3.65)

1

where, H = I − I , I is N × N identity matrix. H is centering matrix

N

Sundaram (2009) used the Eq. (3.65) to center the kernel matrix. But, according to

different literatures, kernel matrix should be center by using Eq. (3.24). This is the

standard procedure for centering kernel matrix.

3. Perform Kernel PCA by implementing an eigen value decomposition on

c

centered kernel matrix ( K xx ).

c

K xx = A ΛA T (3.66)

4. Sort the eigen values and corresponding eigen vectors. Discard eigen values

smaller than a threshold. A value of 10−5 has been used in this thesis work.

This was done to prevent numerical problems in the later stages of the

algorithm.

5. Calculate the normalized principal directions.

1 N

Vk = ∑ a jkΦ ( xi )

λk j =1

(3.67)

( x ) = Φ( x ) − 1 ∑ Φ( x )

N

where Φ j j i

N 1=1

In matrix form this becomes:

1

−

V = KA Λ 2 (3.68)

Select the first M number of principal directions which retains a total 99%

variance.

56

6. Calculate new SV by choosing the projections on the principal directions from

λk

a uniform distribution U[ −σ k , +σ k ] where σ k = . In matrix form it

N

becomes,

V = VR (3.69)

1

1

Where R = Λ 2U

N

Where U is a matrix of points chosen from the uniform distribution U[ −1, +1] .

7. Each column of V corresponds to a new SV. Now project image of the old SVs

( Φ( xi ) ) along the direction of new set of SVs (i.e. along the direction of PCs).

N

Φ( zk ) = ∑ Vik Φ( xi ) (3.70)

i =1

step (( Φ( zk ) )) according to the formula given below (Scholkopf, 1996).

n

1

∑V ik ( (1 − VkT K xxVk + 2VkT kxi ))xi

2

zk = i =1N (3.71)

1

∑i =1

Vik ( (1 − VkT K xxVk + 2VkT kxi ))

2

K zz β = K zxα (3.72)

This ensures that both SVMs produce same results for all the zk ’s, k = 1,2,.......M

Therefore new set of SV are obtained, zk , k = 1,2,...., M and the new coefficients

the new set of SV’s. Figure 3.11 describes the outline of above algorithm.

57

Figure 3.11: Overview of KPCA_SVM algorithm

The classification results obtained using various classification techniques are

expressed in standard confusion matrix (Landgrebe, 2003) showing the class-wise

user ( kua ), producer ( kpa ) and overall (k) kappa measures (Congalton, 1991). The

58

overall kappa (k) values obtained from different classification techniques were used

for the one-tail hypothesis testing (Congalton, 1991) for comparing any two

classification results. While the class-wise producer’s kappa ( kpa ) values were used to

classes (Abhinav, 2009).

z-statistic (Congalton, 1991) is computed using the kappa values obtained for

comparing any two classification techniques:

kˆ1 − kˆ2

Z12 = (3.73)

(σˆ 2

1 + σˆ 22 )

Where, k̂1 and k̂2 are the kappa estimates obtained for the two classification

techniques under consideration and σˆ12 , σˆ 22 are the respective estimates of variances

for the kappa values observed. The z-statistic obtained is used for the one-tailed

hypothesis testing with the following null ( H 0 ) and alternate ( H1 ) hypotheses:

H 0 : Z12 = k1 − k2 ≤ 0

(3.74)

H1 : Z12 = k1 − k2 > 0

The null hypothesis chosen here is that the out of the two classification results

obtained k̂1 and k̂2 , k̂1 is not significantly better than k̂2 which means that the first

classification technique is not significantly better than the second technique. While

the alternate hypothesis selected, it says that the two classification results are

statistically different and also the result corresponding to k̂1 is statistically better

than that corresponding to k̂2 and thus, it can be said that the first classification

The z-statistic obtained in Eq. (3.73) follows the standard normal distribution

(Congalton, 1991) and thus, according to one-tailed hypothesis testing (Fig. 3.12) if the

value of Z12 -statistic is greater than a critical value (say, 1.65) for a confidence level

59

of 95%, the null hypothesis can be rejected and it can be said with 95% confidence

that the two classification results are statistically different with the first one

performing better than the second one (Abhinav, 2009).

Non-rejection region H 0

for H 0

0 Zc = 1.65

(Abhinav, 2009).

60

CHAPTER 4

EXPERIMENTAL DESIGN

This chapter will address the methodology followed for this thesis work.

Experiments were designed to investigate the best FE technique, classification

algorithm and best time saving strategy for HD. On the basis of conclusions from the

literature survey and recommendations for future work by Abhinav (2009), several

FE and classification algorithms have been tested which have potential for improving

classification accuracy and time for HD. The theoretical background of these

algorithms was presented in Chapter 3.

(1) Feature extraction algorithms

• Unsupervised feature extraction algorithm

a) Segmented principal component analysis (SPCA) (Jia, 1996).

a) Projection pursuit (PP) (Friedman and Tukey, 1974).

• Supervised feature extraction algorithm

b) Kernel principal component analysis (KPCA) (Scholkopf, 1995).

b) Orthogonal subspace projection (OSP) (Lentilucci, 2001).

(2) Classification algorithms

• Parametric classification approach

a) Gaussian maximum likelihood (GML) (Savage, (1976)).

• Non-parametric classification approach

a) k nearest neighborhood (KNN) (Fix and Hodges, 1951).

• Advance classification approach

a) Support vector machine (Quadratic programming optimization method)

(SVM_QP) (Vapnik, 1995).

b) Support vector machine (sequential minimal optimization method)

(SVM_SMO) (Platt, 1999).

61

c) Kernel principal component analysis support vector machine

(KPCA_SVM) (Sundaram, 2009).

This chapter starts with experimental details for different FE and selection

techniques. Then it explains the classification techniques for parametric and non-

parametric classifier followed by advanced classifier.

Two types of FE techniques, unsupervised and supervised, were used in this

experiment. SPCA, PP are unsupervised FE techniques and KPCA, OSP are

supervised FE techniques. The details of FE methods are given below.

4.1.1 SPCA

For SPCA, complete data set is subgrouped on the basis of correlation of bands.

Then PCA is applied separately on each subgroup of data. Feature selection from the

new data set is obtained after the first subgroup transformation by variance

information (first few PCs retaining 99% variance were selected). Then selected

features are regrouped and transformed again to compress the data further. The

flowchart of SPCA method is shown in Figure 4.1.

4.1.2 PP

For PP, Posse’s (1995a) algorithm was used in this research work where OD (n-

dimension) is projected on two dimensional space. Thus the dimension of the PP

62

feature extracted data set is two. Chi-square projection pursuit index was chosen

here. The methodology adopted for PP method is shown in Figure 4.2.

4.1.3 KPCA

The number of PCs is equal to the number of TP used for FE . In this

experiment, a total up to 400 TP have been used for FE using KPCA method. Hence,

the dimension of the KPCA feature extracted data set is up to 400. Firstly, TP are

mapped into feature space using different kernel function (linear, polynomial and

Gaussian) in the form of gram matrix. Then eigen values and eigen vectors of gram

matrix are calculated. Afterwards, OD is mapped in kernel space using the same

kernel function (used for TP) and projected along the direction of eigen vectors.

Finally, KPCA feature extracted data set is obtained. The outline of KPCA method is

shown in Figure 4.3.

Figure 4.3: KPCA feature extraction method

63

4.1.4 OSP

The dimensionality of feature extracted data set depends upon the number of

classes present in the OD. OSP starts with finding the endmembers by automated

target generation process (ATGP). Then OD is projected along the endmembers and

feature extracted data set is obtained. The data set used for this thesis has eight

classes, so the number of endmembers is also eight. The dimension of feature

extracted data set is equal to the number of endmembers. The brief description of

OSP method is shown in Figure 4.4.

This section will provide the detailed methodology of the classification which

was followed in this research work. Feature extracted data or OD, TP and selected

bands are given as the input to classifier. In this thesis work, same set of TP have

been used for any data set to train the classifier. For example, to perform

classification using 200 TP per class on SPCA modified data set, the same 200 TP

were used for OD. To vet the results obtained by Abhinav (2009), the same sets of TP

are also used here. Those TP were obtained by multinomial TP selection algorithm.

Statistically sufficient sample size for training and test was calculated at a confidence

level of 99% and a desired precision of 4% using formula as suggested by Toratora

(1976). Following this approach, a minimum of 99 TP per class have to be chosen to

train a classifier.

Experiments were performed with GML, KNN and advance classifier (SVM).

For each classifier, two types of experiments were performed. The first type of

classification experiment was implemented on OD and the second type was carried

out on the feature extracted data set. For each set of experiment, classifier was

trained with 25, 100, 200 and 300 TP per class. The same set of TP will ensure no

discrepancy due to different training data sets while comparing different

64

classification results. These numbers were chosen in order to consider the following

cases of training sample size.

b) Statistically exact training sample size (100 TP)

c) Statistically sufficient training sample size (200 TP)

d) Very large training sample size (300 TP)

used to obtain test accuracy of classifiers in terms of confusion matrix. Accuracy

analysis of the resulted maps was performed using the kappa value for different

algorithms comparing z-statistics, on the basis of one tailed hypothesis, performed on

95% confidence interval (Congalton, 1991).

For each classification technique, initially five bands of OD or feature extracted

data set (except OSP and PP feature extracted data set) were chosen. Later on, it was

incremented by five in a stepwise manner up to the available bands (number of

available bands may be different for different feature extracted data set). The

classification was performed to evaluate if there was any improvement in accuracy.

This was performed for each set of TP.

Dimension of OSP feature extracted data set is equal to the number of classeds

present in OD. Each band of OSP feature extracted data set contains information

corresponding to each class. Therefore, for the classification, all bands of the OSP

feature extracted data set should be taken together. Otherwise, it may produce

classification error. For all the experiment in this thesis work, eight bands of OSP

feature extracted data set was taken together .

The dimension of the PP feature extracted data set is two. Therefore, the

maximum number of bands available for PP feature extracted data set is two. For all

the experiment on PP feature extracted data set both the bands were taken together.

The methodology of the classification procedure for this thesis work is shown in

Figure 4.5.

65

Figure 4.5: Overview of classification procedure

and non-parametric classifier

Set-I experimental set up was designed to investigate the results of parametric

(GML) and non-parametric (KNN) classifier. The classification was performed by

selecting different parameters of KNN and GML.

For KNN, initially three neighboring pixels were chosen which was further

increased by one, up to a neighborhood size of 11. Then, it was performed only for

neighborhood size of 15. However, there were negligible improvements in accuracy for

more than five neighboring pixels. The experiment was conducted to study the effect

of neighboring pixels in accuracy.

The best classification result for KNN and GML for feature extracted data sets

as well as OD were independently observed along with the parameters responsible for

the best result. The experimental scheme is given in Figure 4.6.

66

Figure 4.6: Experimental scheme for Set-I experiments

classifier

The second sets of experiments were designed with advance classifier, SVM

algorithms. Different optimization techniques and algorithms for SVM were chosen

for comparing the accuracy and time taken to train the classifier. In this thesis work,

SVM_QP, SVM_SMO and another approach KPCA_SVM were used to compare the

classification accuracy and time. As mentioned before, all these algorithms were

performed on OD as well as on feature extracted data set.

algorithms, depending upon the accuracy and processing time

(ii) Inquiry of the best FE techniques for SVM classifier

problem using quadratic programming (QP) optimization method. Then KPCA

algorithm with Gaussian kernel was applied on the SV and PCs were arranged in

descending order with respect to the eigen values of kernel matrix. These PCs are the

new set of SV. In this research work, for all the experiment related to KPCA_SVM,

about 70% of the original SV were chosen from the new set of SV (for details, section

3.2.3.4), because about 99% variance was stored in first 70% of the PCs. Finally, the

SVM decision rule was applied on the new set of SV to obtain classified map.

67

For SVM_QP and SVM_SMO, quadratic programming optimization and

sequential minimal optimization methods were used respectively to solve the dual

optimization problem. The classification scheme for Set-II experiment is given in

Figure 4.7.

4.5 Parameters

Parameters play also an important role in HD classification. So, choosing of

parameters are also an important task. All the parameters chosen for different FE techniques

and classification algorithms are listed in Table 4.1.

FE techniques Parameters

SPCA Correlation matrix of the bands

PP No. of random searches – 5

half – 15

Stopping value – .01

KPCA Kernel function – rbf

OSP No. of endmembers – 8

Classifiers Parameters

GML Confidence interval – 99%

KNN Neighbors – 3,4,5……,11 and 15

SVM Kernel function – rbf

68

CHAPTER-5

RESULTS

This chapter provides observations for various experiments and interpretation of the

same. Starting with the visual interpretation of feature extracted data sets, the

chapter will discuss the result of GML classifier on feature-extracted data set. These

results are compared with the best result for GML as observed by Abhinav (2009).

Then it will discuss the effect of KNN classification algorithm on OD and feature

extracted data set followed by the discussion of the results of different SVM

algorithms.

techniques

techniques can be visually inspected using grayscale views of the first few features.

The image form of correlation matrix are also used for this purpose.

From the correlation image of OD (Figure 5.1), it is clear that there are three

highly correlated blocks of bands. The first block contains 32 bands, the second 6

bands and the last contains 27 bands (Figure 5.1). The average correlation values for

each block are 0.931, 0.997 and 0.941 respectively. Thus, the OD is segmented based

on correlation of these three blocks of bands. Then PCT was applied on the basis of

correlation matrix of each block of bands for which SPCA feature extracted data set

was obtained. Total time taken to complete the aforementioned process was about 8

seconds.

69

Figure 5.1: Correlation image of the OD set consisting of three blocks having bands

32, 6 and 27 respectively.

In PP process, one can find from the most important to less important two-

dimensional structures in a sequential manner. Two structures (first one is the most

interesting) with decreasing order is given in Figure 5.2. The PP index after five

random searches was 0.3825 and the size of neighborhood (c) around the best

projection plane was 0.011. Total time taken to complete the whole process was about

11.30 hours. Table 5.1 presents the required time for each FE techniques with

different constraints.

70

Table 5.1: The time taken for each FE techniques

FE methods Time

SPCA 6-8 seconds

KPCA with rbf 1) 4 minutes for 25 TP

kernel 2) 5.5 minutes for 100 TP

3) 6.3 minutes for 200 TP

4) 8.5 minutes for 300 TP

5) 10 minutes for 400 TP

PP 11.30 hours

(a) (b)

β*

β *

α* α*

Figure 5.2: Projection of the data points. (a) Most interesting projection direction

(b) Second most interesting projection direction.

The grayscale images of features extracted data using various FE techniques are

provided in Figures 5.3 to 5.6, followed by the corresponding correlation images

shown in Figure 5.7.

71

(a) SPCA-1 (b) SPCA-2 (c) SPCA-3

Figure 5.3: First six Segmented Principal Components (SPCs) (b) shows water body

and salt lake

Figure 5.4: First six Kernel Principal Components (KPCs) obtained by using 400 TP

72

(a) OSP-1 (b) OSP-2 (c) OSP-3

Figure 5.5: First six features obtained by using eight end-members (b) shows

vineyards and wheat, (c) shows bare soil, (d) shows salt lake.

(a) PP -1 (b) PP -2

Figure 5.6: Two components of most interesting projections (a) shows salt lake.

73

1

0.9

0.8

0.7

0.6

(a) SPCA (b) KPCA

0.5

0.4

0.3

0.2

0.1

0.0

data sets (Figure 5.3 to 5.6) and their correlation images (Figure 5.7):

(i) Since extracted SPCs were ranked according to their eigen values, a higher

amount of information can be easily noticed in the first four SPCs. No

interesting structures could be visually identified beyond 4th SPC. As SPC uses

the local correlation of the bands rather than global (like PCA), it has ability to

make involved bands highly uncorrelated than PCA. So better classification

result is expected from SPCs. It has also been visually observed that SPCA-2 is

associated with the water body and salt lake classes.

(ii) The first few features extracted by KPCA were visually inferior than those

obtained by SPCA (not revealing any class). Some of the features like KPCA-1

and KPCA-2 show water body and salt lake prominently but other classes are

also present there.

74

(iii) OSP is generally used to extract same number of features as the number of

classes present in the data set (in this case eight classes; hence eight features).

Although number of extracted features by OSP is low, it can identify some

structures prominently. For example, OSP-4 identifies salt lake, OSP-2

identifies vineyards and wheat and OSP-3 shows bare soil. From the algorithm

of OSP, it can be suggested that each band of OSP extracted data set is

associated with one of the predefined classes. Therefore, it can be said that

OSP is expected to perform well for classification.

(iv) The dimension of PP extracted feature is two. However, from the first

extracted feature, salt lake can be identified very clearly but the second feature

contains no identifiable structures and gives hazy appearance.

(v) The quality improvement of features extracted by different FE techniques can

be observed by comparing the correlation images of OD (Figure 5.1) and

feature extracted data (Figure 5.7). The correlation matrices obtained by SPCA

and PP extracted data sets are found to be perfectly diagonal with values equal

to unity and all the off-diagonal elements as zeros. On the other hand, feature

extracted data using supervised FE techniques (OSP, KPCA) are correlated.

This is because the SPCA and PP algorithms extract only orthogonal features

while the FE criterion is different for OSP. So highly correlated features are

observed for OSP. For the correlation image of KPCA feature extracted data

set, t can be observed that along diagonal correlation is unity which decreases

inversely with the increase in distance from diagonal in correlation matrix,

except for bands 80 to 100. These bands are observed to be fully uncorrelated.

classifiers

This section will represent the results of GML and KNN classifier using

different data sets. First, it will describe the results for GML classifier followed by

KNN.

The performance of GMLC with feature modified data sets (SPCA, KPCA,

OSP, PP FE methods) was compared to the best result obtained by Abhinav (2009)

75

for GML classifier to evaluate the improvement in classification due to these FE

technique. It may be noted that he obtained the best results with PCA modified data

set. Figure 5.8 shows k-values obtained for different feature modified data sets.

Following observations can be listed from Figure 5.8:

(i) Considering the case with sufficient TP (100, 200, 300), the k-values

obtained for PCA, SPCA, and OSP extracted data sets were observed to be

higher than the PP and KPCA modified data sets.

(ii) For statistically insufficient TP (25), GML performs poorly for SPCA, PCA

and OSP modified data sets. When the number of bands increase, after a

certain number of bands, k-value for PCA and SPCA modified data set

becomes negative for 25 TP per class. Because to invert a p × p matrix, at

least p+1 sample points are required for obtaining numerically well

conditioned inverse of the matrix. Due to this effect, GML fails when more

than 25 bands were used with 25 TP per class. These were insufficient for

computing the inverse of the class covariance matrix.

(iii) An interesting phenomenon can be observed for k-values of KPCA modified

data set. The k-value increases for the first 35 bands. Then suddenly it falls

for 40 bands. From 45 bands onwards, it again starts to increase. The result

for KPCA modified data set is observed up to 65 bands (dimension of OD is

65).

(iv) The k-values obtained for SPCA and OSP seems to be outperforming those

obtained by PCA, KPCA.

(v) Performance of PP is found to be very poor due to very low number of

features (two features). Hence, PP was not considered any further for

classification.

(vi) For all FE techniques (except KPCA, OSP), the k-values increase

significantly with increase in number of bands up to a critical number of

bands (say, Ncri) after which no improvement could be observed in k-values.

This is due to the fact that the features extracted by these techniques were

arranged in decreasing order of eigen values. So useful information are

stored in the first few features only while the lower order features contain

76

less useful information and are very noisy. Therefore, when noisy bands

were added then probability of misclassification increases. As a result, the

classification accuracy becomes stagnant.

(vii) Ncri is different for different set of TP. When number of TP increases, Ncri

increases. Because of Hughes phenomenon, classification of large number of

bands provide poor result unless the number of TP is large.

77

PCA SPCA

KP

PCA

OSP PP

Figure 5.8

8: Overall kappa vallue observeed for GML L classifica

ation on different featture

extracted data sets using selectted differen

nt bands

78

To confirm these observations, statistical analysis was performed. The k-

values obtained for each FE technique are given in Table 5.1. The best results

obtained by GML classification on different feature extracted data set for three

training data sets (100, 200 and 300 TP) were selected for comparison with the best

GML result obtained with PCA extracted data set. The condition for selecting the

best classification result (best k-value) is the least number of bands used after which

no statistically significant improvement in k-value could be achieved. A comparison of

the best results between the PCA and other FE modified data sets and among the

various FE techniques is presented in Table 5.2 in terms of z-statistic values obtained

for one-tailed hypothesis testing at 5% significance level.

(i) PCA and SPCA were found to be giving statistically similar result for 100 and

300 TPs per class while SPCA provides statistically significantly better result

than PCA for 200 TP per class. SPCA is more improved method than PCA.

(ii) In case of OSP, statistically better result could not be achieved for statistically

exact TP set (100 TP per class) but when number of TP increases, it provides

the statistically better result than PCA. In case of large TP (300), statistically

similar result to PCA is obtained.

(iii) For 200, 300 TP set, SPCA and OSP provides statistically similar result but in

the case of small set of TP (100), SPCA provides the better result than OSP.

Since, SPCA extracted data set is more orthonormal than OSP extracted data

set, it can be concluded that SPCA is the best FE techniques than OSP for

GML classification.

(iv) PP extracted data set always provides statistically very poor result than OSP.

It is because of the low dimensionality (dimension-2) of PP extracted data set.

(v) KPCA always fails (for large or small TP) to provide statistically better result

than PCA or OSP and OSP is statistically better than PP for all sets of TP.

Again, SPCA provides statistically better result than PCA or OSP. Therefore, it

can be concluded that SPCA is the best FE techniques than PCA and other FE

techniques like OSP, PP, KPCA.

79

(vi) The best kappa accuracy for GML classifier is obtained by using SPCA

extracted data set with 300 TP. The kappa value is 0.9589 and the number of

bands used for classification is 45.

Table 5.2: Best kappa values and z-statistic (at 5% significance values) for GML

Best NB* Best NB Best NB Best NB Best NB Z12 Z13 Z14 Z24 Z34 Z45

k1 k2 k3 k4 k5

100 0.9362 20 0.9384 20 0.8489 35 0.9205 8 0.2220 2 -1.35 41.95 8.87 10.51 -41.95 222.81

200 0.9460 20 0.9579 30 0.8332 35 0.9505 8 0.2146 2 -3.97 53.47 -4.07 0.99 -53.47 304.51

300 0.9568 40 0.9589 45 0.8569 35 0.9572 8 0.2228 2 -1.45 50.65 -0.28 1.20 -50.65 290.75

kˆ1 − kˆ2

NB* numbers of bands used, Z12 =

(σˆ12 + σˆ 22 )

From Table 5.3 it is observed that the best results for PCA, SPCA, KPCA

extracted data sets were obtained for 30-45 features at 300 TP and for OSP extracted

data set 8 features at 300 TP. During the experiments, it was seen that GMLC took

around 55-70 seconds for processing of 30-45 bands for 300 TP per class for SPCA and

PCA extracted data set and about 32 seconds for OSP extracted data. However, OSP

provides statistically similar result to PCA and SPCA for 300 TP, but the processing

time is very less than other FE techniques. Therefore, OSP can be considered as an

effective FE technique. However, considering both accuracy and processing time, OSP

can be rated as the most effective FE technique for GMLC. For statistically

insufficient TP (25) and statistically sufficient TP (200) SPCA is rated as the best FE

technique. For 100 TP per class, performance of PCA and SPCA for GMLC is same.

From Figure 5.9, it can be observed that GMLC on OSP is the fastest than any other

FE technique. PCA and SPCA take about same time to provide the best k-values.

Table 5.3: Ranking of FE techniques and time required to obtain the best k-value

k1* Time Rank k2 Time Rank k3 Time Rank k4 Time Rank k5 Time Rank

(s)* (s) (s) (sec) (sec)

25 0.8409 53.6 1 0.8296 53.6 2 0.8215 59.7 2 0.2700 35.4 3 0.1960 - 4

100 0.9384 60.6 1 0.9362 60.6 1 0.8489 75.6 3 0.9205 39.2 2 0.2220 - 5

200 0.9579 65.2 1 0.9460 59.4 3 0.8332 74.4 4 0.9505 36.7 2 0.2146 - 5

300 0.9589 83.5 1 0.9568 72.3 1 0.8569 62.8 2 0.9572 39.8 1 0.2228 - 3

Time* = Time (second) for obtaining best k-value, ki* = k-value for ith FE technique , Rank:1 indicates the best

80

9:

Figure 5.9 Comparrison of kappa va alues and

d classifica

ation times for GML

G

classificcation meth

hod.

5.2.2 Cla

ass-wise compar

rison of result

r for

r GMLC

Thee class-wise accuracy

y for GML

LC has beeen observ

ved for diffferent featture

extracted data set. From

F Figurre 5.10, following can be observeed:

For all sizes of xtract salt lake class from all sets

s of featture

extrracted data y high k-va

a with very alue. Wateer class is a

also extraccted with very

v

h k-value for

high f all featture modiffied data set (except PP). From

m PP modified

data

a set, only

y Salt lak an be separated with satisfacctory k-va

ke class ca alue.

Because first feature off PP modiffied data set

s can disstinguish salt

s lake very

v

arly (Figure

clea e 5.6)

(ii) Beside salt lak

ke and watter body, G

GMLC sepa

arates hydrrophytic cla

ass from otther

classses with very

v high k-value for all featurre extracted

d data set and all seet of

TP. For 300 TP, GML

LC separattes hydrop

phyticc veg

g class wiith very h

high

accu

uracy from

m SPCA modified data

a set.

(iii) GM

MLC classiifies viney

yards and wheat with

w aboutt same k-value. Soome

vineeyards pixeels have be

een classifi

fied to wheat pixels due

d to pressence of miixed

pixe

el. Accuraccy of classiffication forr vineyardss, bare soil,, pasture la

and and bu

uilt-

up area classses are about same for 200 an

nd 300 TP

P for SPCA

A, PCA, O

OSP

dified data set. It is lo

mod ow for KPC

CA modified data set.

81

2 Training Pixels

25 100 Training

g Pixels

WT : W

Water

200 Training

g Pixels 300 Training

g Pixels

SLT : Salt lake

HV : HydrophyticVeg

WHT : W

Wheat

VY : V

Vineyards

BS : Bare Soil

PL : P

Pasture Land

Figure 5.10 uracy of inddividual classses observed for GML LC on differrent

feature extracted

e da h respect to different seet of TP.

ata set with

5.2.3 Cla

assification resu

ults using

g KNN cllassifier

r (KNNC))

To und

derstand th

he effect of

o FE tech

hniques on

n KNN cla

assifier, experiment was

w

performed

d with OD as

a well as feature

f exttracted datta. Same seet of TP, ass used in GML

G

classificatiion, was chosen

c to compare classificattion accura

acy. Obserrvations frrom

figure no. 5.11 to 5.14 are as following:

(i) In case

c of KN

NN, poor pe

erformancee is observ nsufficient TP

ved for stattistically in

set (i.e. 25TP)). Howeverr, KNN on OD perforrms betterr than PCA

A, OSP, SP

PCA

82

extracted data set. The maximum k-value was obtained for 65 bands and three

neighbors. For the KPCA extracted data set, k-value was comparatively better

than OD when 50 bands were taken for all neighbors. PP was not taken into

accuracy analysis as due to very low dimensionality it would not be able to

provide good k-values.

(ii) For statistically exact TP (100 TP), the performance of KNN on OD is better

than any other feature extracted data set. More number of bands, increases the

k-values for all feature extracted data sets except SPCA. Increasing number of

bands did not show any significant change in case of SPCA. However, if

number of neighbors is increased, changes were easily observed. It is observed

that, when number of neighbors is increased, after a critical number of

neighbors (say, Nnbd), k-value starts decreasing. Therefore, it is independent on

number of bands. It may be due to the effect of noisy points present in training

data set. However, large number of neighbors accelerates the chance of using

noisy TP. Consequently, misclassification error is added up.

(iii) For 200 TP per class, no improvement in result is observed for PCA, KPCA,

OSP extracted data set than OD. But, improvement was observed for SPCA

extracted data set. However, it did not show a prior change in PCA and KPCA

extracted data set for KNNC with 100 and 200 TP set respectively. Effect of

neighborhood on accuracy can be viewed from Table 5.4. Always for the first

few neighbors for all sets of TP, highest k-value is achieved (Table 5.2).

(iv) For large training data set (300 TP), it was observed that the k-values are

better than OD. This is due to PCA and SPCA extracted data sets. After a

certain threshold neighborhood, k-value decreases monotonically for PCA,

OSP, and SPCA extracted data set.

(v) KPCA extracted data set provides better result for high dimension since it is

more refined than PCA or SPCA extracted data set.

(vi) For all training data sets, except statistically insufficient, k-value for OSP

extracted data set varies a little (0.02 - 0.05) because of very low

dimensionality. If the number of extracted end members is large enough, result

could be further improved.

83

(vii) Another important aspect was observed for feature-extracted data set. The

difference of the k-values (for all set of TP), obtained using minimum and

maximum number of bands, is about 0.15 to 0.20. This could be because most

of the information was gathered in first some bands of feature extracted data

set. Additional bands cannot provide more useful information to change k-

value significantly.

Table 5.4: Classification with KNNC on OD and feature extracted data set

sets Bnd* NN Bnd NN Bnd NN

Original 55 3 35 3 30 3

PCA 35 5 45 5 20 3

SPCA 10 3 15 3 40 3

KPCA 35 3 45 3 30 6

OSP 8 3 8 3 8 3

PP 2 15 2 11 2 15

bnd* = best k-values obtained for the number of bands

NN* = no. of neighbors, for which best k-value obtained

84

25 Traiining Pixels

Origin

nal PCA

SPC

CA KPCA

OSP

P PP

N

NNb*: number of nearest neiighbors

Figure 5.1

11: Overalll accuracy observed

o forr KNN classification off OD and feature extracted

data seets for 25 TP

P

85

100 Training Pixells

Origin

nal PCA

SPC

CA KPCA

OSP

P PP

N

NNb*: number of nearest neiighbors

Figure 5.1

12: Overalll accuracy observed

o forr KNN classification off OD and feature extracted

data seets for 100 TP

T

86

200 Training Pixells

Origin

nal PCA

NN

Nb

SPC

CA KPCA

NNb

b

OSP

P PP

N

NNb*: number of nearest neiighbors

13:

Figure 5.1 Overalll accuracy observed

o forr KNN classification off OD and feature extracted

data seets for 200 TP

T

87

300 Training Pixells

Origin

nal PCA

SPC

CA KPCA

OSP

P PP

NN

Nb

N

NNb*: number of nearest neiighbors

14:

Figure 5.1 Overalll accuracy observed

o forr KNN classification off OD and feature extracted

data seets for 300 TP

T

88

The k-values for the classification of these data sets were analyzed to select the

best results for each data set. Similar approach as in the case of GML is also followed

here. The z-statistic values obtained for selected best k-values are shown in Table 5.5.

The following can be inferred from these results:

(i) Results obtained using PCA and SPCA modified data sets, were found to be

significantly better than those obtained using the OD for large training data

size (300). However, SPCs and PCs still found to be performing inferior than

OD for 100 TP. Statistically similar results were obtained for OD and SPCA

modified data sets using a training data set of 200 TP. For other feature

extracted data set and for all set of training data, OD provides statistically

significant result for KNN classification.

(ii) The best results were obtained with OD using 30 to 55 bands and three

neighbors. For 300 TP, statistically better results than OD were obtained using

SPCA (40 bands) and PCA (20 bands) modified data sets with three neighbors.

For 200 TP, SPCA modified data set (15 features and 3 neighbors) provides

statistically similar results to OD.

(iii) SPCA extracted data sets were observed to be performing statistically

significant to PCA extracted data sets with smaller training data sets, whereas

the best results, obtained with 300 TP training data set using SPCs, were

statistically similar to those as obtained by PCs.

(iv) SPCs were also observed to be performing significantly better than KPCA and

OSP modified data sets for all training data sets. In addition, the best results

for PCA and OSP were found to be statistically poor for all training data size.

k1 NB* k2 NB k3 NB k5 NB Z12 Z13 Z14 Z23 Z34 Z45

100 0.8889 55 0.7773 35 0.8669 10 0.7715 35 0.8268 8 42.51 9.42 44.72 -34.98 37.24 -20.55

200 0.9037 35 0.7881 45 0.9040 15 0.8062 45 0.8514 8 48.98 0.15 41.31 -49.10 11.43 -17.72

300 0.9244 30 0.8141 30 0.9325 40 0.9320 20 0.8701 8 47.91 -4.58 -4.29 -52.68 0.29 30.95

* Number of bands used to obtain best k-value

89

Time taken to train the KNN classifier is highly affected by the number of TP.

This is due to the fact that a distance matrix needs to be computed between a test

pixels and each of TP. Increasing number of TP indeed extends the calculation time

i.e. for n TP and m test pixels, number of distances calculated is nm . However,

increasing number of neighbors has significantly less effect in run time. It has been

observed that time taken for classification, for three and for 15 neighbors are almost

similar (maximum difference is 60-120 seconds) (Figure 5.15). Another aspect is also

noticed, increasing number of bands proportionally affect the calculation time (Figure

5.15). From the Figure 5.16, it could be observed that PCA takes least time in

compared to OD and SPCA extracted data to provide best result. Considering the

time constraint and k-value, PCA could be chosen as the best FE technique, followed

by SPCA, among the available techniques for KNN classification. Figure 5.15 shows

the comparison of time between 200 TP and 300 TP for same number of bands and

neighbors. Rank of FE techniques with respect to accuracy for KNNC for each set of

TP could be inferred from table 5.6.

From Table 5.6, it is further observed that for statistically exact size of (i.e.

100), KNNC produced best result with OD. For statistically sufficient TP (i.e.200),

SPCA secured first rank. However, for statistically large TP (i.e. 300), SPCA and PCA

both perform better. Therefore, it is concluded that among all the data sets feature

modified and original, SPCA and PCA provide the best result for KNNC which in

turn tells that PCA is the best FE technique among all of these techniques for KNNC.

Table 5.6 Rank of FE techniques and time required to obtain best k-value (Rank 1

indicates the best)

k1 Time Rank k2 Time Rank k3 Time Rank k4 Time Rank k5 Time Rank

(s)* (s) (s) (s) (s)

100 0.8889 875.1 1 0.7773 722.9 4 0.8669 661.2 2 0.7715 789.6 5 0.8268 655.2 3

200 0.9037 1200.6 1 0.7881 1271.1 4 0.9040 1122.1 1 0.8062 1272.0 3 0.8514 1022.7 2

300 0.9244 1574.6 2 0.8141 1556.0 4 0.9325 1712.5 1 0.9320 1434.0 1 0.8701 1291.9 3

Time(s)*: presents the required time in second

90

NN

Nb N

NNb

(a) 300 TP

P (b) 200 TP

T

N

NNb*: number of nearest neiighbors

c n for KNN N classifica

ation. Timee for differrent bandss at

differen

nt neighb

bors for (a) 300 TP (b)) 200 TP trraining datta per classs.

b k-valu

ue and cla n time for original and

assification

feature

e extracted

d data set

5.2.4 Cla

ass wise comparison of r

results fo

or KNNC

C

From Figure 5.17,

5 follow

wing observ

vations can d for class wise accurracy

n be viewed

of KNNC

KN

NNC extraccts water and

a salt la

ake classess with very

y high accu

uracy for both

b

feature modified datta and OD

D. Howeverr, the builtt up area is classifieed very pooorly

due to preesence of large numb

ber of mixeed pixels. For built u

up area soome pixels are

classified into

i hydrop

phytic veg,, wheat, pa

asture land

d classes for all data sets

s due to the

presence of

o large nu

umber of mixed

m pixels in built-u

up area cla

ass. Perforrmance of OD,

O

91

KPCA and

d OSP mod

dified data sets are loower than SPCA and

d PCA mod

dified data

a set

to providee good classification accuracy

a foor classificcation of h

hydrophyticc veg class for

all sets of TP. For vin

neyards, a built-up arrea classess for all datta sets and

d TP.

10

00 Training Pixels 200

0 Training Pixels

P

30

00 Training Pixels

WTT : Water

SLT

T : Salt lakee

HV

V : Hydroph hobicVeg

WHHT : Wheat

VY

Y : Vineyard ds

BSS : Bare Soiil

PL : Pasture Land

BU

UA : Built-up

p Area

w accurracy compa

arison of OD

O and diffferent featture extraccted

data foor KNNC

5.3 Experim

E ment re

esults for

fo SVM

M based

d classiifiers

In this

t section

n, results of

o differentt SVM algoorithms ha

ave been de

escribed. First

F

it will de

escribe thee results of SVM_Q

QP algoritthm follow

wed by SV

VM_SMO and

KPCA_SV

VM. The secction also provides

p a comparisoon of classiffication tim

me of differrent

SVM algorrithms.

92

5.3.1 Experiment results for SVM_QP algorithm

Using the optimal set of parameter values (Table 4.5, recommended by

Abhinav, 2009) for SVM classifiers, classification were performed on feature modified

data sets. Results from these experiments are compared with the best result obtained

by Abhinav (2009) for SVM classifier. He noted that performance of SVM_QP was the

best for PCA extracted data set. The same training and input data sets were used as

for GML and KNN classifiers. The classification results obtained by SVM are

presented in Figure 5.18 from which the following observations can be made:

(i) The k-values are seen as improving with increase in training data size for all

input data sets types (PCA, SPCA, KPCA, OSP and PP modified data sets).

(ii) The best classification results were obtained by PCA and SPCA modified data

sets. For KPCA modified data set, when number of bands increases the k-

values also increase. It is possible that for very high dimension, KPCA

extracted data set can provide high k-value like SPCA or PCA extracted data

sets.

(iii) Increasing in k-values were observed for PCs and SPCs which stagnates after a

critical number of features used. After that it starts to decrease gradually. This

could be due to same reason discussed for GML classification algorithm in

section 5.1.

(iv) A similarity can be observed for KPCA, PCA and SPCA modified data set. For

statistically insufficient TP (25) suddenly k-values reach to about zero for

classification using 50 bands. The reason is not clear. Probably due to using

these number of bands and TP, SVM_QP was unable to find proper decision

boundary.

(v) Best result for KPCA and OSP extracted data set are about to similar for each

set of TP except for 25 TP.

93

SV

VM_QP

(a) PC

CA (b) SPCA

(c) KPC

CA

(d) PP

P (e) OSP

c on of FE modified

m d

data

sets ussing SVM and

a QP opttimizer

f the classsification of these da

ata sets weere statisticcally analy

yzed

to select th

he best ressults for ea

ach data seet. The app

proach was similar too that follow

wed

94

in case of GML. The z-statistic values obtained for best k-values are shown in Table

5.7. The following can be inferred from these results:

(i) PCA and SPCA were found to be giving statistically similar result for all set of

TP. On the other hand, PCA always provides statistically significantly better

result than KPCA and OSP modified data set for all set of TP for SVM_QP

classifier.

(ii) Classification with SPCA modified data set always performs statistically better

than KPCA modified data set for all sets of TP. However, OSP performs

statistically better than KPCA modified data set for 100 and 200 TP per class.

For large set of TP (300), OSP performs statistically similar with KPCA

modified data set.

(iii) Another observation is made from the Table 5.7 that the SPCA modified data

set always performs statistically better than OSP modified data set.

(iv) It can be concluded that PCs and SPCs have the better ability to improve k-

value than any other FE techniques. KPCA performs the worst among all the

FE techniques.

Table 5.7: The best kappa accuracy and z-statistic for SVM_QP on different feature

modified data set

k1* NB* k2 NB k3 NB k4 NB Z12 Z13 Z14 Z23 Z24 Z34

100 0.9408 15 0.8703 55 0.9408 15 0.8874 8 36.30 0.00 28.70 -36.30 -7.79 28.70

200 0.9621 15 0.8901 65 0.9573 15 0.9050 8 7.89 0.53 6.26 -33.39 -7.26 30.40

300 0.9643 15 0.9090 60 0.9691 20 0.9069 8 6.07 -0.59 6.30 -7.40 1.06 7.65

NB* = no. of bands used to achieve the best k-value; ki* = k-value for ith FE technique ,

During above experiments, it was observed that time taken to train the SVM

based classifier is affected very much by the number of training samples used. This is

because a kernel matrix has to be computed for every pair of TP. There were very

little changes in training times with increase in number of bands.

Generally the total time taken to perform SVM based classification was

observed to be ranging from 23 to 102 seconds when bands were increased from 5 to

95

65 for 25 TP.

T The sa

ame range for 100 TP

P was obserrved as 82 to 273 secconds, and 522

to 615 seco

onds for 20

00 TP.

An important

i aspect hass been obseerved for th

he classificcation timee using 200

0 TP

with SPCA

A modified

d data set (Figure 5..19). When

n the band

ds are increased, afteer a

critical nu

umber of ba

ands (30 bands),

b the classificattion time d

decreases monotonica

m ally.

Same tren

nd was observed for 300

3 TP perr class. Thiis could bee due to thee, fact thatt by

using larg

ge number of TP and

d large num

mber of ba

ands, SVM

M_QP was unable

u to ffind

sufficient number off support vectors required for classificattion. For SPCA

S or PCA

P

modified data

d sets, except

e firstt few bands, all rema

aining band

ds contain large amoount

of noise. Due

D to thee presencee of noise, optimizattion probleem might not be sollved

properly for

fo large nu

umber of bands

b with

h large set of TP for SPCA or PCA modified

data sets. That mea

ans that sufficient

s n

number of SV could not be fin

nd. When the

number off SV are less

l then classificatio

c on time allso be less and k-vallues may a

also

decrease. This

T could be supporrted from th

he Figure 5.18 (a), (b

b). It is obsserved thatt for

SPCA and s k-valuees start to decrease

d PCA modiified data set d affter 25 ban

nds.

Excceptionally higher tim

mes of the order of 2600

2 secon

nds were observed when

the trainin

ng data sizze was incrreased to 300 TP. Succh higher ttimes weree observed due

to the QP

P optimizeer used. Varshney

V and Arora

a (2004) suggested

s a few better

optimizerss which woould give the same classificatiion accura

acies in shorter train

ning

times. It is known th

hat same performanc

p ce would bee achieved

d irrespective of choicce of

optimizer in case of SVM as it makes usee of the sta

atistical lea

arning theoory as poin

nted

out by Varrshney and

d Arora (20

004)

96

Figure 5.19: Classification time comparison using 200 and 300 TP per class.

The classification results obtained using SVM with SMO optimization

techniques are presented in Figure 5.20. The rbf kernel function is used for

classification of different data sets using SVM_SMO algorithm. The following

observations can be made on the basis of k-value presented in Figure 5.20:

(i) The k-values could be seen as improving with increase in training data size

(except 200 TP) for all input data set.

(ii) Like SVM_QP, a sudden decrease in k-value is observed with 25 TP for the

OD, SPCA, KPCA and OSP extracted data sets. For all data sets, this

happens for 50 features.

(iii) For all data sets (except KCPA extracted data), statistically sufficient

training data set (200 TP) is unable to provide positive k-value. This could

be due to failure of solving optimization problem for these data sets using

200 TP. For KPCA extracted data set, first few bands provide very low k-

value for 200 TP. From 20 bands onwards, k-value provided by KPCA

extracted data set for 200 TP is acceptable.

(iv) Increasing k-values were observed for original and KPCA modified data sets

which stops after a critical number of features used. After that, it starts to

decrease. It is because of same reason as reported for GML classifier. For

the OD and KPCA modified data sets k-values increase monotonically for

100 and 300 TP per class.

(v) For PP modified data set, however, very low k-values are observed. So, all

the results for PP extracted data set are ignored for comparison of results of

SVM_SMO classifier.

The k-values for the classification of these data sets were statistically analyzed

to select the best results for each data set. The approach was similar to the one

followed in previous cases. The z-statistic values are obtained to compare each data

97

set. The best k-values are shown in Table 5.8. The following can be inferred from

these results:

(i) The best results obtained using feature modified data sets were found to be

significantly better than those obtained using the OD set for large training

data size (300 TP). For OSP modified result is marginal, but can be said

that significantly better than OD set. Performance of OD, SPCA and OSP

modified data is very bad, but performance of KPCA modified data is very

high for 200 TP training data. SPCs found to be performing statistically

better than OD set for 100 TP per class and statistically similar to OD for

200 TP.

(ii) The best results were obtained with the OD using 50-60 bands, while

significantly better results than OD were obtained using SPCA modified

data sets with 15-30 features. For 300 TP, statistically similar result to OD

is obtained using OSP modified data set with eight bands.

(iii) KPCs were observed to be performing significantly better than SPCA and

OSP modified data set for 200 TP. For 100 and 300 TP, the best results

obtained by SPCA modified data set are significantly better than OSP and

KPCA modified data sets.

(iv) Classification with OSP is found to be significantly better than KPCA for

100 TP while KPCA is observed to be statistically better than OSP modified

data for 200 and 300 TP. Thus it can be said that SPCA performs better

than OD and any other feature extracted data and performance of OSP is

worst for SVM_SMO based classification.

98

SVM

M_SMO

Origin

nal SPCA

Figure 5.2

20: all kappa values

Overa v obsserved for classificattion of original and FE

modifi

fied data seets using SVVM with SMO

S optimmizer

nd z-statistiic for SVM

M_SMO on OD

O and diffferent featture

modified data set

k1 NB** k2 NB k3 NB k4 NB Z12

1 Z13 Z1 4 Z23 Z24

2 Z34

100 0.8955 50 0.8626 40 0.9304 15 0

0.8739 8 15

5.00 -15.91 9.8

85 -33.90 -4.99 28.25

200 0.1694 5 0.8826 50 0.1694 5 0

0.0001 8 -33

36.2 0.00 12.9

90 475.46 630.00 -

300 0.8934 60 0.9013 50 0.9436 30 0

0.8999 8 -3

3.80 -26.98 1.6

65 -23.75 5.56 28.86

d to obtain bestt k-value; ki* = k-value for itth FE techniqu

ue

99

Figure 5.2

21: Compa

arison of cllassification

n time for different sset of TP with

w respecct to

numbe

er of bands for SVM_S SMO classiification allgorithm.

e taken to perform S

SVM_SMO based classsification was obserrved

to be rang

ging from 55-90

5 secon

nds when bands

b weree increased

d from 5 to

o 65 for 25 TP,

the same range

r for 100

1 TP wass observed as 145-194

4 seconds, 350-409 seeconds for 200

TP and 11

184 to 1814

4 for 300 TP

T (Figure 5.21). Unlike to SVM

M_QP it is observed that

t

when num

mber of ba

ands increases the cclassificatio

on time allso increasses. The time

t

requireme

ent for larg

ge number of TP for S

SVM_SMO is observeed to be sig

gnificantly less

than the SVM classsification method

m ba

ased on QP

P optimizeer. This is due to SMO

optimizatiion methood. The soolution deerived for SMO meethods neeeds very few

numericall operation

ns. This meethod need

ds more nu

umber of itterations but

b requirees a

small num

mber of opeerations th

hus resultiing in an increase

i in

n optimizattion speed

d for

very large data sets

.

5.3.3 Exp

perimen

nt resultss for KPCA_SVM

M algoritthm

Thee classificattion results obtained using KPC

CA_SVM a

algorithm (QP

( optimizer)

is presentted in Figu

ure 5.22. The rbf k

kernel fun

nction is u

used for cla

assification

n of

different data

d set using

u KPCA

A_SVM algorithm. The

T followiing observ

vations can

n be

he basis off k-values presented

made on th p iin Figure 5.22:

5

100

(i) For OD and KPCA extracted data, unpredictable behavior of KPCA_SVM

classifier is observed for all data set, TP and for different bands. Maximum k-

value for OD is obtained for 200 TP with 35 bands and for KPCA 200 TP with

25 bands.

(ii) For SPCA extracted data set, k-values reach to about zero after 20 bands for

each set of TP. Maximum k-value obtained by SPCA is better than obtained by

OD and KPCA extracted data set. Maximum k-value for each set of TP is

obtained with five bands.

(iii) For OSP extracted data set, highest k-value is obtained for 200 TP. This value

is higher than the k-values of other feature modified data sets, those are

obtained for 200 TP. Reverse of this scenario is seen for OSP modified data set

with 300 TP.

(iv) One important phenomenon is observed for KPCA_SVM algorithm. For large

set of TP (300), KPCA_SVM provides very low k-value. The best k-value is

obtained for all data set using 200 TP per class.

101

KPC

CA_SVM

Origin

nal SPCA

KPCA OSP

all kappa values

v obseerved for classificati

c ion origina

al and featture

modifi

fied data seets using K

KPCA_SVM M algorithmm.

fication of these data

a sets weree analyzed

d to select the

best resultts for each

h data set. The approoach was siimilar to th

hat followeed in previious

cases. The

e following can be infeerred from these resu

ults (Table 5.9):

(i) Thee best resullts obtained using fea

ature modiified data sets

s (except KPCA) were

w

foun

nd to be siignificantly

y better th

han those obtained

o using the OD

O set for 200

TP. For 300 TP, OD provides

p sttatistically better ressult than other featture

mod

dified data. Performa

ance of OD

D, KPCA an

nd OSP moodified datta is not goood.

How

wever, perfformance of

o SPCA moodified datta is very h

high for 100

0 TP. SPCA

A is

obse

erved to bee performin

ng statisticcally betterr than OD sset for 100 TP per cla

ass.

102

(ii) The best results were obtained with the OD with 50-60 bands while

significantly better results than OD were obtained using SPCA modified data

sets with five to ten features for 100 and 200 TP per class. For OSP modified

data set, statistically better result than OD is obtained using 200 TP with eight

bands

(iii) SPCs were observed to be performing significantly better than OSP for 100 and

200 TP. While OSP performs statistically better than SPCs for 200 TP. KPCs

perform statistically better than OSP for 100 TP. However, performance of

KPCs for 200 and 300 training data is statistically significantly low than OSP.

(iv) SPCs always perform statistically better than KPCs and OSP performs better

than SPCA only for 200 TP. It could be concluded that for 100, 200 and 300 TP,

KPCA_SVM performs better with SPCA, OSP modified data set and OD

respectively. KPCA_SVM provides low k-value compared to SVM_QP or

SVM_SMO algorithms.

Table 5.9: The best k-value and z-statistic for KPCA_SVM on original and different

feature modified data sets.

k1 NB* k2 NB k3 NB k4 NB Z12 Z13 Z14 Z34

100 0.7110 50 0.5150 25 0.7565 10 0.4192 8 62.93 -15.32 93.69 96.77

200 0.6736 45 0.6514 30 0.6976 5 0.7917 8 6.98 -6.64 -39.72 -21.83

300 0.7142 55 0.5109 45 0.5340 5 0.3488 8 61.15 5.10 203.10 104.90

NB* = No. of band used to obtain best k-value

Ability of SVM classifiers to separate different classes is observed from Figure

5.23.

(i) Ability to distinguish salt lake class of all SVM classifier is about same.

(ii) Accuracy of separation of wheat class by SVM_QP and SVM_SMO

classifiers is about same. However, performance of KPCA_SVM is very low

(except salt) to separate any other classes than other two classifiers.

(iii) SVM_SMO separates all other classes with little low accuracy than

SVM_QP.

103

(iv) SVM_QP

S is

i the bestt classifier. It has ab

bility to seeparate alll classes with

w

h

high k-valu

ue.

23:

Figure 5.2 Comp parison off classification accu uracy of individuall classes for

differrent SVM algorithm ms. WT – water, SL LT – Salt Lake, HV V –

Hydrrophobic veeg, WHT – wheat, VY Y – Vineyarrds, BS – Bare

B soil, PL

P –

Pastu ure land, BUA

B – Buillt-up area

5.3.5 Com

mpariso

on of resu

ults for d

differentt SVM allgorithm

ms

Thee overall beest results obtained b

by differen

nt SVM alg

gorithms were

w compa

ared

statisticallly to find out

o the besst SVM cla

assification

n method in terms off classificattion

accuracy obtained.

o The

T same was

w done foor the timee scales obsserved for these in orrder

to comparee the practtical appliccability of tthese methods.

5 it is observed that

t _QP methood is statisstically better

SVM_

n all otherr SVM algoorithms forr all sets of

than o TP. Bestt results off SVM_QP are

obta

ained for SPCA

S and PCA modiified data sets

s for 10

00 and 200 TP per class.

For 300 TP best resu

ult is obttained for SPCA m

modified data

d set. The

T

classsification time

t rangees from 14

48 seconds (for 100 T

TP) to 2596

6 seconds ((300

TP). This timee range is very

v high.

(ii) From Table 5.10,

5 it is observed

o th

hat SVM_S

SMO algorrithm is th

he second best

b

M decision rule. The best

SVM b k-valu

ues for SVM

M_SMO aree obtained with 100, 200

and

d 300 TP using

u SPCA

A, KPCA a

and SPCA modified d

data sets re y. k-

espectively

valu

ues obtain

ned for 100

0 and 300

0 TP are little

l M_QP, though

less than SVM

104

required classification time using 300 TP is about two third of SVM_QP.

Though SVM_SMO needs more bands than SVM_QP to obtain best k-values

for different sets of TP but its processing time is very less than SVM_QP.

(iii) KPCA_SVM is poorest method amongst SVM_QP and SVM_SMO. Highest k-

value is obtained for KPCA_SVM by using OSP modified data set with 200 TP.

When number of pixel is large performance of KPCA_SVM is less.

From the above discussion, it can be concluded that SVM_QP is the best

classifier with respect to accuracy. Considering both the classification time and

accuracy, SVM_SMO can be considered as the effective SVM classifier. The best

accuracy is obtained by SVM_QP by using 300 TP with the first 20 bands of SPCA

modified data set. For SVM_SMO the best accuracy is obtained by using 300 TP with

the first 30 bands of SPCA modified data set.

classification time, and z-statistic for different SVM algorithms.

k1 FEA* Time NB* k2 FEA Time NB k3 FEA Time NB Z12 Z13 Z23

(s)* (s) (s)

100 0.9408 PCA, 122.6 15 0.9304 SPCA 148.1 15 0.7565 SPCA 94.3 10 6.14 77.61 71.94

SPCA

200 0.9621 PCA, 585.7 15 0.8836 KPCA 363.9 50 0.7927 OSP 262.3 8 45.16 77.51 36.4

SPCA

300 0.9691 SPCA 2596.2 20 0.9446 SPCA 1694.8 30 0.7142 OD 1190.2 55 18.38 113.47 97.01

ki = best k-value for ith classifier; FEA* = Feature extraction algorithms; NB* = No. of band used to

obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second

classifiers

The best results obtained by the parametric (GML), non-parametric (KNN) and

advanced (SVM) classifiers with different feature modified data set are already

presented in Tables 5.2, 5.5 and 5.9. The best advanced classifier (SVM_QP) is chosen

by statistically comparing all the advanced classifiers. The statistical comparison of

parametric, nonparametric and best advanced classifiers are carried out in order to

evaluate the best classifier among these classifiers with respect to classification

accuracy and time. The corresponding z-statistic is presented in Table 5.11:

105

The followings are observed from the Table 5.11:

(i) GML performs statistically better than KNN classifier for all set of TP. Also

the classification time of GMLC is negligible with respect to KNNC.

(ii) GMLC performs statistically similar with SVM_QP for 100 and 200 TP. For

large set of TP (300), the performance of SVM_QP classifier is statistically

significantly better than GMLC. However, required classification time is very

high for SVM classifier.

(iii) SVM_QP provides statistically better result than KNNC for all set of TP. From

here it can be concluded that SVM_QP is the best classifier on the basis of

classification accuracy. GML is ranked as the second best classifier.

(iv) It is also observed that the best results are obtained by all the classifiers by

using SPCA modified data set. It is also concluded that SPCA is the best

feature reduction technique among all other techniques for all classifiers.

(v) Processing time of GMLC is very less than any other classifiers. GMLC

provides little poor k-value than SVM_QP for 300 TP. Considering both

classification time and accuracy, it can be concluded that GMLC is the best

classifier than any other classifier.

different data sets

k1 FEA* Time (s)* NB* k2 FEA Time (s) NB k3 FEA Time (s) NB Z12 Z13 Z23

100 0.9384 SPCA 60.6 20 0.8669 SPCA 661.2 10 0.9408 SPCA 122.6 15 36.82 -1.54 -38.06

, PCA

200 0.9579 SPCA 64.7 30 0.9040 SPCA 1122.1 15 0.9573 SPCA 585.7 15 31.33 0.42 -30.98

, PCA

300 0.9589 SPCA 82.6 45 0.9325 SPCA 1712.5 40 0.9691 SPCA 2596.2 20 16.00 -7.97 -25.37

ki = best k-value for ith classifier; FEA* = Feature extraction algorithm; NB* = No. of band used to

obtain best k-value; Time (s)* = Required time to obtain best k-value, presented in second

attributed to difference in their classification mechanisms. GML and KNN are

capable of forming only simple decision boundaries where SVM can forms highly

complex non-linear decision boundaries. In the given data, different kinds of class

separabilities were observed for different classes. The water and salt classes were

found easily separable from the rest of the classes. About 100% classification

accuracies were observed for these classes with very small number of features for all

106

the classifiers. After these, the classes: wheat, vineyards and bare-soil were showing

a little lower accuracy values which means these are a little difficult to separate. The

lowest accuracies were observed for pasture land, built-up area and hydrophytic

vegetation classes. These classes are very poorly separated and thus complex decision

boundaries would be required to separate them. For large set of TP, SVM_QP is able

to achieve higher classification accuracies than the parametric and non-parametric

classifier because they were not able to separate the poor classes in a better way.

Classified maps corresponding to the best results of different classifiers are

shown in Appendix A (Figure A.1).

HD classification is very crucial task due to its characteristics and large

volume of data. It is clear from the analysis that depending on availability of TP the

selection of FE techniques and classification algorithms are very important for

classification of HD. Another important aspect should also be kept in mind that is

time-consuming classification and FE procedures. This thesis work has pointed on

some important guidelines for classification of HD (Table 5.12).

(i) When only statistically insufficient TP is available, it is suggested to apply

either SVM_QP algorithm with OSP FE technique. This will provide high

classification accuracy in minimum time.

(ii) GML is strongly recommended to apply on SPCA modified data set to achieve

very high accuracy in very less time for statistically exact and statistically

sufficient training data sets.

(iii) For statistically large training data set, high accuracy could be achieved by

implementing SVM_QP on SPCA modified data set. Nevertheless, this method

will take very large processing time. So, it is strongly recommended to apply

GML on SPCA modified data set, though achieved classification accuracy is

little less than SVM_QP but processing time is negligible than SVM_QP.

SVM_SMO could also be used for large set of TP on SPCA modified data set.

(iv) Among all the popular FE techniques for HD, SPCA is the most effective FE

technique, which could be used to achieve high classification accuracy for HD

for all classification techniques.

107

accuracy and time. (Rank: 1 indicate the best)

TP Parametric Non- Advanced

parametric

GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA

25 2 SPCA 3 KPCA 1 SPCA, 1 SPCA 4

OSP

100 1 SPCA 3 SPCA 1 PCA, 2 SPCA 4 SPCA

SPCA

200 1 SPCA 2 SPCA 1 PCA, 3 KPCA 4 OSP

SPCA

300 2 SPCA 4 SPCA 1 SPCA 3 SPCA 5 OD

Ranking depending on accuracy & time

TP Parametric Non- Advanced

parametric

GML FEA KNN FEA SVM_QP FEA SVM_SMO FEA KPCA_SVM FEA

25 2 SPCA 3 SPCA 1 OSP 1 SPCA 4 SPCA

100 1 SPCA 4 SPCA 2 PCA, 3 SPCA 5 SPCA

SPCA

200 1 SPCA 4 SPCA 2 PCA, 3 KPCA 5 OSP

SPCA

300 1 SPCA 4 SPCA 3 SPCA 2 SPCA 5 OD

108

CHAPTER 6

SUMMARY OF RESULTS AND

CONCLUSIONS

chapter mainly aims to summarize the conclusions corresponding to the main

objectives as defined in the first chapter. It also suggests the some area and methods

for further research in future.

6.1 Summary of results

This research work is the extension of the work done by Abhinav (2009). For

this research work, DAIS 7915 hyperspectral sensor data was used for testing

different FE techniques and classification algorithms. The best results obtained by

these experiments were compared with those obtained by Abhinav (2009). Based on

the conclusions from the literature survey and recommendations for future work by

Abhinav (2009), several FE (SPCA, KPCA, OSP, PP) and classification algorithms

(KNN, GML, SVM based classifiers) have been tested to achieve the objectives as

mentioned in section 1.4.

For parametric classifier (GML), experiments were performed on different

feature extracted data sets which are mentioned above. The best result obtained by

the experiments were compared with the best result obtained by Abhinav (2009) to

observe the improvement. For non-parametric classifier (KNN), first experiment was

performed with OD. Then algorithm was applied on the different feature modified

data. The best results for OD and feature extracted data were compared to obtain the

best result for non-parametric classifier. For the advance classifier (SVM_QP,

SVM_SMO and KPCA_SVM) experiments were performed on OD as well as feature

modified data sets. For SVM_QP, like GML, also the best result was compared with

the best result obtained by Abhinav (2009). The best results of different SVM

classifiers were examined to obtain best SVM algorithm.

109

Lastly, the best results for parametric, non-parametric and advance classifiers

were compared to find out the best classifier for HD. All the comparisons were

performed by the one-tailed hypothesis testing at 5% significance level.

Classification experiments were performed using the four FE techniques,

namely, SPCA, KPCA, OSP and PP. From the statistical analysis of classification

results obtained using these feature modified data sets, it could be concluded that

among the four above mentioned FE techniques, SPCA modified data set provides the

best results. These results were also compared with the best classification results

obtained by Abhinav (2009) using different FE techniques. SPCA performs better

because it uses the local statistics rather than global.

Analyzing the different classifiers results, it is observed that sometimes the

results obtained from PCA modified data set competes with those obtained by SPCA

modified data set. Generally, different classifiers provide the best results using 15 to

30 bands of SPCA or PCA modified data sets, which effectively reduces the

classification time. For OSP and PP, due to very low dimensionality, these always fail

to produce satisfactory results. However, the results obtained by using eight bands of

OSP modified data set are reasonably good, though they are not always statistically

significantly better than SPCA or PCA modified data sets. There is a possibility of

improving result by increasing the dimension of OSP modified data set by extracting

more number of endmembers. For KPCA modified data set, it was observed that its

performance is always poor in quality. However, it is observed that KPCA can

produce satisfactory result by increasing the dimension which will also increase the

classification time proportionally. Therefore, KPCA is not considered as an effective

FE technique.

From the experiments performed with parametric classifier (GML), it was

observed that the performance of GML was significantly improved after applying FE

techniques. Comparing the obtained results with the best result obtained by Abhinav

(2009), SPCA was found to be working best among all available FE techniques, in

improving classification accuracy by GML.

Moving on to the non-parametric classifier, it is observed that result of KNN

classifier depends on the choice of number of bands and neighbors. Best results were

selected for KNN with and without applying FE techniques and it was found that

110

result of KNN was enhanced by PCA and SPCA techniques while the supervised FE

techniques like KPCA and OSP failed to do so.

SVM algorithm was selected as the advance classifier. It uses statistical learning

theory, which is expected to produce consistent and optimal results as compared to

the parametric and non-parametric classifiers. Different SVM algorithms (SVM_QP,

SVM_SMO and KPCA_SVM) were tested to reach this goal. For SVM based

classifiers, it was observed that, the dimension of the data sets and choosing of

optimizer significantly affect the results. The best result of SVM_QP was achieved by

SPCA feature extracted data set with 20 bands. It was also observed that, the

classification result using advanced classifier was further improved than the best

result obtained by Abhinav (2009). He obtained the best result using PCA modified

data sets. This result was further improved by using SPCA modified data set. This

proves that by using selected FE techniques, classification results of advance

classifier can further be improved. It was observed that supervised FE technique like

KPCA, OSP could not improve the result of SVM while unsupervised FE technique

(SPCA) made improvement in result. On the other hand, the best results of

SVM_SMO and KPCA_SVM were obtained by using SPCA and OSP modified data

sets respectively. Comparing the best results of different SVM algorithms SVM_QP is

concluded as the best SVM classifier.

On comparing the best results obtained by SVM classifiers with the best

results of parametric and non-parametric classifications, it was found that the

advance classifier performs significantly better for both the data sets, original or

feature extracted. The reason for better performance of this classifier is the

improvement in separating a few classes which shows poor k-values when parametric

or non-parametric classifiers were used. This observation is expected because of the

variation in formation of decision boundary. The decision boundary form by

parametric or non-parametric classifiers are simpler. For this reason they are unable

to perform to separate the poor classes efficiently. Advance classifier has ability to

form complex, nonlinear decision boundaries which help them to improve decision

boundary for separating poor classes.

Compared to parametric classifier, SVM required higher computation time and

memory requirement. In spite of these difficulties, significant improvement was

111

observed over parametric and non-parametric classifiers by advance classifier. This

strongly suggest that SVM has an ability to reduce the troubles regarding HD

classification.

6.2 Conclusions

Based on these results, the following conclusions are drawn:

technique followed by PCA. In addition, orthogonal subspace projection can be

taken as the effective FE technique if its dimension could be increased.

2. Although advance classifiers needs large processing time but these are able to

reduce the problems concerned with the classification of HD in a much better

manner than the parametric or non-parametric classifiers. For statistically

exact and sufficient sets of TP, performance of SVM_QP is not statistically

better than those of parametric classifier. For large set of TP, SVM_QP

produces statistically better result than all classifiers. In addition, the SPCA

FE techniques were found to be helpful to increase the accuracy significantly

for all of advance, parametric and non-parametric classifiers.

During the literature survey, some additional methods were found that are not

included in this thesis work. These seem to be showing scope of improving accuracy

and computation time for the advance classifiers presented in this thesis. The

following methods are recommended for the future work:

(i). In this thesis work the high memory and computational time required by SVM

methods were little reduced by using different optimizers and algorithms.

There is still chance to reduce the computation time for SVM algorithm by

using Lagrangian SVM algorithm (Mangasarian and Musicant, 2000). This

required testing further. In addition, some optimization techniques like Kernel

Adatron (Bennett and Campbell, 200), Succesive Overrelaxation (SOR)

(Mangasarian and Musicant, 1998) should also be tested which may reduce the

computation time significantly.

112

(ii). Moreover, it can be commented that for large set of TP, KPCA method takes

much time. Lima and Zen (2005) suggested a method called Sparse KPCA

which may reduce the computation time. This needs to be tested.

(iii). The high computation time required by KNN found in this thesis work. It is

because of the large number of computation is required to classify a single

pixel. For large data set it will increase exponentially. In order to reduce these

Hash-table approach could be applied. By using Hash-table number of

computation will be less.

113

REFERENCES

Barros, A. S and Rutledge, D, N (2005) ‘Segmented principal component

transform–principal component analysis’, Chemometrics and Intelligent Laboratory

Systems 78 (2005) 125– 137

populations defined by probability distributions,’ Bulletin of Calcutta Mathematical

Society, Vol. 35, pp. 99-109.

Ben-Dor, E., Patkin K., Banin A. and Karnieli, A. (2002) ‘Mapping of several soil

properties using DAIS-7915 hyperspectral scanner data – a case study over clayey

soils in Israel,’ International Journal of Remote Sensing, Vol. 23, No. 6, pp. 1043-

1062.

mineral assemblages associated with gold mineralization in the Central Pilbara,

Western Australia,’ Economic Geology and the Bulletin of the Society of Economic

Geologists, Vol. 97, No. 4, pp. 819-826.

Boser, H., Guyon, I. M., Vapnik, V. N. (1992) ‘A training algorithm for optimal

margin classifiers’ Proceedings of the 5th Annual Workshop on Computational

Learning Theory, ACM New York, NY, USA, pp. 144-152.

Technical Report, Vol. 9, No. CS-96, Department of Computer Science, University of

Sheffield.

Cha, G. H. (2005) ‘Kernel principal component analysis for content based image

retrieval’, PAKDD 2005, LNAI 3518, pp. 844 – 849, Springer-Verlag Berlin

Heidelberg.

interference rejection approach to target detection and classification for hyperspectral

imagery,’ Opt. Eng., VOL. 37, PP. 735–743.

comprehensive study and analysis’, IEEE Transactions on Geoscience and

Remotesensing, VOL. 43, No. 3.

machines and other kernel-based learning methods, Cambridge University Press,

Cambridge, UK.

114

Congalton, R. G. (1991) ‘A reviews of assessing the accuracy of classifications of

remotely sensed data,’ Remote Sensing of Environment, Elsevier Science (pub.),

Vol.37, No. 1, pp. 35-46.

Transactions Information Theory, Vol. IT-13, No. 1, pp. 21–27.

procedure applied to AVIRIS data,’ IEEE Transactions on Geoscience and Remote

Sensing, Vol. 27, No. 5, pp. 620-628.

techniques’, IEEE Computer Society Press, Los Alamitos, CA

Englewood Cliffs, New Jersey.

classifier for the analysis of hyperspectral data,’ IEEE Transactions on Geoscience

and Remote Sensing, Vol. 42, No. 1, pp. 271-277.

statistical association, 82, 249-266.

(edt.), II edn., Academic Press, Inc., San Diego, USA.

M. Tech Thesis, Indian Institute of Technology, Kanpur.

dimensionality reduction: An orthogonal subspace projection,’ IEEE Transactions on

Geoscience and Remote sensing, VOL. 32, PP. 779–785.

in hyperspectral image sequences, Ph.D. dissertation, Dept. Elect. Eng., Univ.

Maryland Baltimore County, Baltimore, MD.

Hughes, G. (1968) ‘On the mean accuracy of statistical pattern recognizers,’ IEEE

Transactions on Information Theory, Vol. IT-14, No. 1, pp. 55-63.

Hwang, W. J. and Wen, K.W. (1998) ‘Fast KNN classification algorithm based on

partial distance search’, IEEE Transaction, Electronics Filter, Vol. 34, No. 21.

115

Hwang, J., Lay, S., and Lippman, A. (1994), ‘Nonparametric multivariate density

estimation: A comparative study,’ IEEE Transactions Signal Processing, Vol.42, No.

10, pp. 2795-2810.

analysis with projection pursuit’ IEEE Transactions on Geoscience and

Remotesensing, VOL. 38, NO. 6.

Jia, X. (1996) Classification techniques for hyperspectral remote sensing data, Ph. D.

Thesis, University of Canberra.

Jones, M. C., and Sibson, R. (1987) ‘What is projection pursuit?’, Journal of the

Royal Statistical Society, Ser. A, 150, 1-38.

dimensional space: Geometrical, statistical and asymptotic properties of multivariate

data,’ IEEE Transactions Systems, Man and Cybernetics - Part C: Applications and

Reviews, Vol. 28, No. 1, pp. 39-54.

Kim, K. I., Franz, F. O., and Scholkopf, B. (2005) ‘Iterative Kernel principal

component analysis for image modeling’, IEEE Transactions on Pattern Analysis and

Machine Intelligence, Vol. 27, No. 9.

classification of hyperspectral data’, MICAI 2008, LNAI 5317, pp. 360 – 370,

Springer-Verlag Berlin Heidelberg.

search for spatial network databases’. Proceedings of the 30th VLDB

Conference,Toronto, Canada, 2004.

theory’, Taiwan.

study,’ LARS Information Note, Vol. 21171.

Wesley, Menlo Park, California

analysis (KPCA) by utilizing two dimensional wavelet compression: applications to

spectroscopic imaging’, Wiley Inter Science.

Matlab, Chapman and Hall /CRC

116

Mercer, J. (1909) ‘Functions of positive and negative type, and their connection with

the theory of integral equations,’ Transactions of the London Philosophical Society,

Vol.-209, No. A, pp. 415-446.

Kaufmann Publishers Inc., San Mateo, CA.

comparative study, Ph. D. Thesis, University of Nottingham.

Classifier: kNN, Naïve Bayes and C4.5’. B. Kégl and G. Lapalme (Eds.): AI 2005,

LNAI 3501, pp. 268 – 279, 2005., Springer-Verlag Berlin Heidelberg

Ping, X., Guo, G., and Chen, G. (2006) A fast document classification algorithm

based on improved KNN, IEEE Transaction.

Computational and Graphical Statistics, Vol. 4, No. 2 (June, 1995), pp. 83- 100.

introduction, IV edn., Springer, Berlin.

based on independent component analysis,’ Proceedings of SPIE: Automatic Target

Recognition XII, SPIE-International Society for Optical Engineering, Vol. 4726, pp.

173-182.

PCA, Statistical Machine Learning, National ICT Australia.

recognition, regression, approximation, and operator inversion’, GMD Technical

Report: 1064.

Technical Report No. UCB/EECS-2009-94.

Vapnik V. N. (1998) Statistical learning theory. John Wiley and Sons, NY.

remotely sensed hyperspectral data, Springer, NY.

Journal of the American Statistical Association, Vol. 85, No. 411, PP. 664- 675.

117

Welling, M. ‘Kernel principal component analysis’, Department of Computer Science,

University of Torento.

Zhu, B., Jiang, L., Jin, F., Qin, L.,Vogel, A., and Tao, Y. (2007) ‘Walnut shell and

meat differentiation using fluorescence hyperspectral imagery with ICA-KNN optimal

wavelength selection’, Sens. & Instrumen. Food Qual. (2007) 1:123–131 DOI

10.1007/s11694-007-9015-z, Springer Science+Business Media, LLC 2007

118

APPENDIX A

GML Legend KNN

SVM_QP SVM_SMO

KPCA_SVM

Figure A.1: Classified maps corresponding to the best results of different classifiers

119

## Bien plus que des documents.

Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.

Annulez à tout moment.