Vous êtes sur la page 1sur 123

THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION

By
KYU-HWA JEONG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL


OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2007
1

c 2007 Kyu-Hwa Jeong

I dedicate this to my parents and family

ACKNOWLEDGMENTS
First of all, I thank my advisor, Dr. Jose C. Principe, for his great inspiration,
support and guidance over my graduate studies. I was impressed by Dr. Principes active
thought and appreciated very much his supervision which gave me a lot of opportunities to
explore on my research.
I am also grateful to all the members of my advisory committee: Dr. John G. Harris,
Dr. K. Clint Slatton and Dr. Murali Rao for their valuable time and interest in serving on
my supervisory committee, as well as their comments, which helped improve the quality of
this dissertation.
I also express my appreciation to all the CNEL colleagues, especially ITL group
members who are Jianwu Xu, Puskal Pokharel, Seungju Han, Sudhir Rao, Antonio Paiva,
and Weifeng Liu for their help, collaboration and valuable discussions during my PhD
study.
Finally, I express my great love for my wife, Inyoung and our two lovely sons, Hoyeon
(Luke) and Seungyeon (Justin). I thank Inyoung for her love, caring, and patience, which
made this study possible. Also I am grateful to my parents for their great support for my
life.

TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

CHAPTER
1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.1
1.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13
18

FUNDAMENTAL DISTORTION INVARIANT LINEAR


CORRELATION FILTERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.1
2.2
2.3
2.4

.
.
.
.

21
21
23
25

KERNEL-BASED CORRELATION FILTERS . . . . . . . . . . . . . . . . . . .

27

3.1

.
.
.
.
.
.
.

27
27
27
30
31
31
32

A RKHS PERSPECTIVE OF THE MACE FILTER . . . . . . . . . . . . . . .

36

4.1
4.2
4.3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reproducing Kernel Hilbert Space (RKHS) . . . . . . . . . . . . . . . . . .
Interpretation of the MACE filter in the RKHS . . . . . . . . . . . . . . .

36
36
39

NONLINEAR VERSION OF THE MACE IN A NEW RKHS :


THE CORRENTROPY MACE (CMACE) FILTER . . . . . . . . . . . . . . . .

41

3.2
3.3

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Synthetic Discriminant Function (SDF) . . . . . . . . . . . . .
Minimum Average Correlation Energy (MACE) Filter . . . . .
Optimal Trade-off Synthetic Discriminant (OTSDF) Function

Brief review on Kernel Method . . . . . . . . . . .


3.1.1 Introduction . . . . . . . . . . . . . . . . . .
3.1.2 Kernel Method . . . . . . . . . . . . . . . .
Kernel Synthetic Discriminant Function . . . . . .
Application of the Kernel SDF to Face Recognition
3.3.1 Problem Description . . . . . . . . . . . . .
3.3.2 Simulation Results . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

Correntropy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Some Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41
41
42

5.2
5.3

.
.
.
.

46
50
50
51

51

THE CORRENTROPY MACE IMPLEMENTATION . . . . . . . . . . . . . .

53

6.1
6.2
6.3

.
.
.
.
.

53
54
56
57
57

APPLICATIONS OF THE CMACE TO


IMAGE RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

7.1

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Face Recognition . . . . . . . . . . . . . . . . . . . .
7.1.1 Problem Description . . . . . . . . . . . . . .
7.1.2 Simulation Results . . . . . . . . . . . . . . .
Synthetic Aperture Radar (SAR) Image Recognition
7.2.1 Problem Description . . . . . . . . . . . . . .
7.2.2 Aspect Angle Distortion Case . . . . . . . . .
7.2.3 Depression Angle Distortion Case . . . . . . .
7.2.4 The Fast Correntropy MACE Results . . . . .
7.2.5 The effect of additive noise . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

93
95

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1
9.2

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

93

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

. . . . . . . . . . . . . . . . . . . . .

CONCLUSIONS AND FUTURE WORK

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

79
80
81
81
83
83
85

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

8.1
8.2
8.3
8.4

. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Properties
. . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.

79

. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Similarity
. . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.

DIMENSIONALITY REDUCTION WITH RANDOM PROJECTIONS . . . . .


Introduction . . . . . . .
Motivation . . . . . . . .
PCA and SVD . . . . .
Random Projections . .
8.4.1 Random Matrices
8.4.2 Orthogonality and
Simulations . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.

60
60
60
64
64
65
71
75
75

8.5
9

The Output of the CMACE Filter . . . . .


Centering of the CMACE in Feature Space
The Fast CMACE Filter . . . . . . . . . .
6.3.1 The Fast Gauss Transform . . . . .
6.3.2 The Fast Correntropy MACE Filter

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
filter: Prewhitening
. . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

7.2

The Correntropy MACE Filter . . . . . . . . . . . .


Implications of the CMACE Filter in the VRKHS .
5.3.1 Implication of Nonlinearity . . . . . . . . . .
5.3.2 Finite Dimensional Feature Space . . . . . .
5.3.3 The kernel correlation filter vs. The CMACE
in Feature Space . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

APPENDIX
A

CONSTRAINED OPTIMIZATION WITH LAGRANGE MULTIPLIERS . . . .

THE PROOF OF PROPERTY 5 OF CORRENTROPY . . . . . . . . . . . . . 100

98

THE PROOF OF A SHIFT-INVARIANT PROPERTY OF THE CMACE . . . 101

COMPUTATIONAL COMPLEXITY OF THE MACE AND CMACE . . . . . . 102

THE CORRENTROPY-BASED ROBUST NONLINEAR BEAMFORMER . . 103


E.1 Introduction . . . . . . . . . . . . . . . .
E.2 Standard Beamforming Problem . . . . .
E.2.1 Problem . . . . . . . . . . . . . .
E.2.2 Minimum Variance Beamforming
E.2.3 Kernel-based beamforming . . . .
E.3 Nonlinear Beamformer using Correntropy
E.4 Simulations . . . . . . . . . . . . . . . .
E.5 Conclusions . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

103
104
104
105
106
107
109
111

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117


BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

LIST OF TABLES
Table

page

6-1 Estimated computational complexity for training with N images and testing
with one image. Matrix inversion and multiplication are considered . . . . . . .

59

7-1 Comparison of standard deviations of all the Monte-Carlo simulation outputs . .

63

7-2 Comparison of ROC areas with different kernel sizes . . . . . . . . . . . . . . . .

64

7-3 Case A: Comparison of ROC areas with different kernel sizes . . . . . . . . . . .

72

7-4 Comparison of computation time and error for one test image between the direct
method (CMACE) and the FGT method (Fast CMACE) with p = 4 and kc = 4

75

7-5 Comparison of computation time and error for one test image in the FGT method
with a different number of orders and clusters . . . . . . . . . . . . . . . . . . . 76
8-1 Comparison of the memory and computation time between the original CMACE
and CMACE-RP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

LIST OF FIGURES
Figure

page

1-1 Block diagram for pattern recognition. . . . . . . . . . . . . . . . . . . . . . . .

14

1-2 Block diagram for image recognition process using correlation filter. . . . . . . .

15

2-1 Example of the correlation output plane of the SDF . . . . . . . . . . . . . . . .

22

2-2 Example of the correlation output plane of the MACE . . . . . . . . . . . . . .

24

2-3 Example of the correlation output plane of the OTSDF . . . . . . . . . . . . . .

25

3-1 The example of kernel method . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3-2 Sample images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3-3 The output peak values when only 3 images are used for training . . . . . . . .

33

3-4 The comparison of ROC curves with different number of training images. . . . .

34

3-5 The output values of noisy test input images with additive Gaussian noise when
25 images are used for training . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3-6 The ROC curves of noisy test input images with different SNRs when 10 images
are used for training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

5-1 Contours of CIM(X,0) in 2D sample space . . . . . . . . . . . . . . . . . . . . .

44

7-1 The averaged test output peak values . . . . . . . . . . . . . . . . . . . . . . . .

61

7-2 The test output peak values with additive Gaussian noise . . . . . . . . . . . . .

61

7-3 The comparison of ROC curves with different SNRs. . . . . . . . . . . . . . . .

62

7-4 The comparison of standard deviation of 100 Monte-Carlo simulation outputs


of each noisy false class test images. . . . . . . . . . . . . . . . . . . . . . . . . .

64

7-5 Case A: Sample SAR images (64x64 pixels) of two vehicle types for a target chip
(BTR60) and a confuser (T62). . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

7-6 Case A: Peak output responses of testing images for a target chip and a confuser

67

7-7 Case A: ROC curves with different numbers of training images. . . . . . . . . .

67

7-8 Case A: The MACE output plane vs. the CMACE output plane . . . . . . . . .

69

7-9 Sample images of BTR60 of size (64 64) pixels . . . . . . . . . . . . . . . . . .

70

7-10 Case A: Output planes with shifted true class input image . . . . . . . . . . . .

70

7-11 The ROC comparison with different kernel sizes . . . . . . . . . . . . . . . . . .

72

7-12 Case B: Sample SAR images (64x64 pixels) of two vehicle types for a target chip
(2S1) and a confuser (T62). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

7-13 Case B: Peak output responses of testing images for a target chip and a confuser

74

7-14 Case B: ROC curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

7-15 Comparison of ROC curves between the direct and the FGT method in case A .

76

7-16 Sample SAR images (64x64 pixels) of BTR60. . . . . . . . . . . . . . . . . . . .

77

7-17 ROC comparisons with noisy test images (SNR=7dB) in the case A . . . . . . .

77

8-1 The comparison of ROC areas with different RP dimensionality . . . . . . . . .

85

8-2 The comparison of ROC areas with different RP dimensionality . . . . . . . . .

86

8-3 ROC comparison with different dimensionality reduction methods for MACE
and CMACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

8-4 ROC comparison with PCA for MACE and CMACE . . . . . . . . . . . . . . .

89

8-5 Sample images of size 16 16 after RP . . . . . . . . . . . . . . . . . . . . . . .

90

8-6 The cross correlation vs. cross correntropy . . . . . . . . . . . . . . . . . . . . .

91

8-7 Correlation output planes vs. correntropy output planes after dimension reduction
with random projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
E-1 Comparisons of the beampattern for three beamformers in Gaussian noise with
10dB of SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
E-2 Comparisons of BER for three beamformers in Gaussian noise with different
SNRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
E-3 Comparisons of BER for three beamformers with different characteristic exponent
levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
E-4 Comparisons of the beampattern for three beamformers in non-Gaussian noise. . 116

10

Abstract of Dissertation Presented to the Graduate School


of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
THE CORRENTROPY MACE FILTER FOR IMAGE RECOGNITION
By
Kyu-Hwa Jeong
August 2007
Chair: Jose C. Principe
Major: Electrical and Computer Engineering
The major goal of my research was to develop nonlinear methods of the family of
distortion invariant filters, specifically the minimum average correlation energy (MACE)
filter. The minimum average correlation energy (MACE) filter is a well known correlation
filter for pattern recognition. My research investigated a closed form solution of the
nonlinear version of the MACE filter using the recently introduced correntropy function.
Correntropy is a positive definite function that generalizes the concept of correlation
by utilizing higher order moments of the signal statistics. Because of its positive definite
nature, correntropy induces a new reproducing kernel Hilbert space (RKHS). Taking
advantage of the linear structure of the RKHS, it is possible to formulate and solve the
MACE filter equations in the RKHS induced by correntropy. Due to the nonlinear relation
between the feature space and the input space, the correntropy MACE (CMACE) can
potentially improve upon the MACE performance while preserving the shift-invariant
property (additional computation for all shifts will be required in the CMACE).
To alleviate the computation complexity of the solution, my research also presents
the fast CMACE using the Fast Gauss Transform (FGT). Both the MACE and CMACE
are basically memory-based algorithms and due to the high dimensionality of the image
data, the computational cost of the CMACE filter is one of critical issues in practical
applications. Therefore, my research also used a dimensionality reduction method based

11

on random projections (RP), which has emerged as a powerful method for dimensionality
reduction in machine learning.
We applied the CMACE filter to face recognition using facial expression data and the
MSTAR public release Synthetic Aperture Radar (SAR) data set, and experimental results
show that the proposed CMACE filter indeed outperforms the traditional linear MACE
and the kernelized MACE in both generalization and rejection abilities. In addition,
simulation results in face recognition show that the CMACE filter with random projection
(CMACE-RP) also outperforms the traditional linear MACE with small degradation in
performance, but great savings in storage and computational complexity.

12

CHAPTER 1
INTRODUCTION
1.1

Background

The goal of pattern recognition is to detect and assign an observation into one of
multiple classes to be recognized. The observation can be a speech signal, an image or a
higher-dimensional object. In general there are two broad class of classification problems:
open and close sets. Most of classification problems deal with closed set, which means
that we have all prior information of the classes and classify those given classes. In open
classification problem, we only have a prior information of one specific class and no prior
information for out of class which can be universe. Object recognition that we present
in this research is an open set problem, therefore the method that we are going to use is
based on one class versus the universe.
There are a lot of applications for object recognitions. In automatic target recognition,
the goal is to quickly and automatically detect and classify objects which may be present
within large amounts of data (typically imagery) with a minimum of human intervention
such as vehicle vs. non-vehicle, tanks vs. trucks, on type of tank vs. another type.
Another pattern recognition applications which is emerging research fields is the
biometrics such as face, iris and fingerprint recognition for human identification and
verification. biometrics technology is rapidly being adopted in a wide variety of security
applications such as computer and physical access control, electronic commerce, homeland
security, and defense.
Figure 1-1 shows the block diagram for the common pattern recognition process. In
the preprocessing block, denosing, normalization, edge detection, pose estimation, etc are
conducted for each applications.
Feature extraction involves simplifying the amount of resources required to describe a
large set of data accurately. When performing analysis of complex data one of the major
problems stems from the number of variables involved. Analysis with a large number

13

of variables generally requires a large amount of memory and computation power or a


classification algorithm which overfits the training sample and generalizes poorly to new
samples. Feature extraction is a general term for methods of constructing combinations of
the variables to get around these problems while still describing the data with sufficient
accuracy.
The goal of classification is to assign the features derived from the input pattern to
one of the classes. There are a variety of classifiers including statistical classifiers, artificial
neural networks, support vector machine (SVM), and so on.
Another important pattern recognition methodology is to use the training data
directly instead of extracting some features and performing classification based on those
features. While feature extraction works well in many pattern recognition applications, it
is not always easy for humans to identify what the good features may be.
Correlation filters have been applied successfully to automatic target detection
and recognition (ATR) [1] for SAR image [2],[3],[4] and biometric identification such as
face, iris and fingerprint recognition [5],[6], by virtue of their shift-invariant property[7],
which means that if the test image contains the reference object at a shifted location, the
correlation output is also shifted by exactly the same amount. Due to this property, there
is no need to conduct additional process of centering the input image prior to recognizing
it. Also, in some ATR applications, it is not only desirable to recognize various targets,
but to locate them with some degree of accuracy and the location can be easily founded
by searching the peak of the correlation output. Another advantage of correlation filters is
that it is linear and therefore the solution can be computed analytically.

Figure 1-1. Block diagram for pattern recognition.

14

Figure 1-2 depicts the simple block diagram for image recognition process using
correlation filters. Object recognition can be performed by cross-correlating an input
image with a synthesized template (filter) and the correlation output is searched for the
peak, which is used to determine whether the object of interest is present or not.
It is well known that matched filters are the optimal linear filters for signal detection
under linear channel and white noise conditions [8][9]. Matched spatial filters (MSF) are
optimal in the sense that they provide the maximum output signal to noise ratio (SNR)
for the detection of a known image in the presence of white noise, under the reasonable
assumption of Gaussian statistics [10]. However, the performance of the MSF is very
sensitive to even small changes in the reference image and the MSF cannot be used for
multiclass pattern recognition since it is only optimum for a single image. Therefore
distortion invariant composite filters have been proposed in various papers [1].
Distortion invariant composite filters are a generalization of matched spatial filtering
for the detection of a single object to the detection of a class of objects, usually in the
image domain. Typically the object class is represented by a set of training exemplars.
The exemplar images represent the image class through the entire range of distortions of
a single object. The goal is to design a single filter which will recognize an object class
through the entire range of distortion. Under the design criterion the filter is equally

Figure 1-2. Block diagram for image recognition process using correlation filter.

15

matched to the entire range of distortion as opposed to a single viewpoint as in a matched


filter.
The most well known of such composite correlation filters belong to the synthetic
discriminant function (SDF) class [11] and its variations. One of the appeals of the SDF
class is that it can be computed analytically and effectively using frequency domain
techniques. In the conventional SDF approach, the filter is matched to a composite
template that is a linear combination of the training image vectors such that the cross
correlation output at the origin has the same value with all training images. The hope
is that this composite template will correlate equally well not only with the training
images but also with distorted versions of the training images, as well as with test images
in the same class. One of the problems with the original SDF filters is that because
only the origin of the correlation plane is constrained, it is quite possible that some
other value in the correlation plane is higher than this value at the origin even when the
input is centered at the origin. Since the processing of resulting correlation outputs is
based on detecting peaks, we can expect a high probability of false peaks and resulting
misclassifications in such situations.
Minimum variance SDF (MVSDF) filter has been proposed in [12] taking into
consideration additive input noise. The MVSDF minimizes the output variance due to
zero-mean input noise while satisfying the same linear constraints of the SDF. One of the
major difficulties in MVSDF is that the noise covariance is not known exactly; even when
known, an inversion is required and it may be computationally demanding.
Another correlation filter that is widely used is the minimum average correlation
energy (MACE) filter [13]. The MACE minimizes the average correlation energy of the
output over the training images to produce a sharp correlation peak subject to the same
linear constraints as the MVSDF and SDF filters. In practice, the MACE filter performs
better than the MVSDF with respect to out-of-class rejection. The MACE filter however,
has been shown to have poor generalization properties, that is, images in the recognition

16

class but not in the training exemplar set are not recognized well. The MACE filter is
generally known to be sensitive to distortions but readily able to suppress clutter. In
general, it was observed that filters that produce broader correlation peaks(such as the
early SDFs) offer better distortion tolerance. However, they may also provide poorer
discrimination between classes since these filters tend to correlate broadly with low
frequency information in which the classes may be difficult to separate.
By minimizing the average energy in the correlation plane, we hope to keep the side
lobes low while maintaining the origin values at prespecified levels. This is indirect method
to reduce the false peak or side lobe problem. However, in their attempt to produce
delta-function type correlation outputs, MACE filters emphasize high frequencies and yield
low correlation outputs with images not in the training set.
Therefore, some advanced MACE approaches such as the Gaussain MACE (G-MACE)
[14], the minimum noise and correlation energy (MINACE) [15] and optimal trade-off
filters [16] have been proposed to combine the properties of various SDFs. In the
G-MACE, the correlation outputs are made to approximate Gaussian shaped functions.
This represents a direct method to control the shape of the correlation outputs. The
MINACE and G-MACE variations have improved generalization properties with a slight
degradation in the average output plane variance and sharpness of the central peak
respectively.
In the most of the previous research in SDF type filters, linear constraints are
imposed on the training images to yield a known value at specific locations in the
correlation plane. However, placing such constraints satisfies conditions only at isolated
points in the image space but does not explicitly control the filters ability to generalize
over the entire domain of the training images.
New correlation filter design based on relaxed constraints, which is called the
maximum average correlation height (MACH) filter has been proposed in [17]. MACH
filter adopt a statistical approach that they do not treat training images as deterministic

17

representations of the object but as samples of a class whose characteristic parameters


should be used in encoding the filter.
The concept of relaxing the correlation constraints and utilizing the entire correlation
output for multi-class recognition was explicitly first addressed by the distance classifier
correlation filter (DCCF)[18].
1.2

Motivation

Most of the members of the distortion invariant filter family are linear filters, which
are optimal only when the underlying statistics are Gaussian. For the non-Gaussian
statistics case, we need to extract information beyond second-order statistics. This is the
fundamental motivation of this research.
A nonlinear version of correlation filters called the polynomial distance classifier
correlation filter (PDCCF) has been proposed in [19]. A nonlinear extension to the
MACE filter using neural network topology has also been proposed in [20]. Since the
MACE filter is equivalent to a cascade of a linear pre-processor followed by a linear
associative memory (LAM) [21], the LAM portion of the MACE can be replaced with a
nonlinear associative memory structure, specifically a feed-forward multi-layer perceptron
(MLP). It is well known that non-linear associative memory structures can outperform
their linear counterparts on the basis of generalization and dynamic range. However,
in general, they are more difficult to design as their parameters cannot be computed in
closed form. Results have also shown that it is not enough to simply train a MLP using
backpropagation. Careful analysis of the final solution is necessary to confirm reasonable
results. Experimental results in [22] showed better generalization and classification
performance than the linear MACE in the MSTAR ATR data set(at 80% probability of
false alarms, the probability of detection dropped from 4.37%(MACE) to 2.45% in the
nonlinear MACE).
Recently, kernel based learning algorithms have been applied to classification and
pattern recognition due to the fact that they easily produce nonlinear extensions to linear

18

classifiers and boost performance [23]. By transforming the data into a high-dimensional
reproducing kernel Hilbert space (RKHS) and constructing optimal linear algorithms in
that space, the kernel-based learning algorithms effectively perform optimal nonlinear
pattern recognitions in input space to achieve better separation, estimation, regression and
etc. The nonlinear versions of a number of signal processing techniques such as principal
component analysis (PCA) [24], Fisher discriminant analysis [25] and linear classifiers [25]
have already been defined in a kernel space. Also the kernel matched spatial filter (KMSF)
has been proposed for hyperspectral target detection in [26] and the kernel SDF has
been proposed for face recognition [27]. The kernel correlation filter (KCF) which is the
kernelized MACE filter after prewhitening has been proposed in [28] for face recognition.
Similar to Fishers idea in [20], in the KCF, prewhitening is performed in the input space
with linear methods and it may affect to the whole performance. We will later present the
difference between the KCF and proposed method, where all the computation including
prewhitening are conducted in the feature space.
More recently, a new generalized correlation function, called correntropy has been
introduced by our group [29]. Correntropy is a positive definite function, which measures
a generalized similarity between random variables (or stochastic processes) and it involves
high-order statistics of input signals, therefore it could be a promising candidate for
machine learning and signal processing. Correntropy defines a new reproducing kernel
Hilbert space (RKHS), which has the same dimensionality as the one defined by the
covariance matrix in the input space and it simplifies the formulation of analytic solutions
in this finite dimensional RKHS. Applications to the matched filter [30], chaotic time
series prediction have been presented in the literature.
Based on the promising properties of correntropy and the MACE filter, the main goal
of this research is to exploit the generalized nonlinear MACE filter for image recognition.
As the first step, we applied the kernel method to the SDF and obtained the kernel SDF.
Application of the kernel SDF to face recognition has been presented in this research. As

19

the main part of the research, this dissertation establishes the mathematical foundations of
the correntropy MACE filter (called here the CMACE filter) and evaluates its performance
in face recognition and synthetic aperture radar (SAR) ATR applications.
The formulation exploits the linear structure of the RKHS induced by correntropy
to formulate the correntropy MACE filter in the same manner as the original MACE,
and solves the problem with virtually the same equations (e.g without regularization)
in the RKHS. Due to the nonlinear relation between the input and this feature spaces,
the CMACE corresponds to a nonlinear filter in the input space. In addition, the
CMACE preserves the shift-invariant property of the linear MACE, however it requires an
additional computation for each input image shift. In order to reduce the computational
complexity of the CMACE, the fast CMACE filter based on the Fast Gauss Transform
(FGT) is also proposed.
Also we introduce the dimensionality reduction method based on random projections
(RP) and apply RP to the CMACE in order to decrease the storage and meet more
readily available computational resources and show the RP method works well with the
CMACE filter for image recognition.
We can say the CMACE formulation for image recognition is one application of
the general case of the energy minimization problems. We can also apply the same
methodology which is minimizing correntropy energy of the output for the beamforming
problem, whose conventional linear solution is obtained by minimizing the output power.
Appendix E presents the new application of the correntropy to the beamforming problem
in wireless communications with some preliminary results.

20

CHAPTER 2
FUNDAMENTAL DISTORTION INVARIANT LINEAR
CORRELATION FILTERS
2.1

Introduction

Distortion invariant composite filters are a generalization of matched spatial filtering


for the detection of a single object to the detection of a class of objects, and those are
widely used for image recognition. There are a lot of variations of the correlation filters.
The SDF and the MACE filter are fundamental correlation-based distortion invariant
filters for object recognition. Most of the correlation filters are based on them. In this
research, we present the nonlinear extensions to the SDF and MACE. The formulations
of the SDF and MACE filter are briefly introduced in this chapter. We consider a
2-dimensional image as a d 1 column vector by lexicographically reordering the image,
where d is the number of pixels.
2.2

Synthetic Discriminant Function (SDF)

The SDF filter is matched to a composite image h, where h is a linear combination of


the training image vectors xi
h=

N
X

ai xi ,

(21)

i=1

where N is the number of training images and the coefficients ai are chosen to satisfy the
following constraints
h T x j = uj ,

j = 1, 2, , N,

(22)

where T denotes the transpose and uj is a desired cross correlation output peak value. In
vector form, we define the training image data matrix X as
X = [x1 , x2 , , xN ],

(23)

where the size of matrix X is d N . Then the SDF is the solution to the following
optimization problem
min hT h,

subject to XT h = u.

21

(24)

Figure 2-1. Example of the correlation output plane of the SDF [31].
It is assumed that N < d and so the problem formulation is a quadratic optimization
subject to an under-determined system of linear constraints. The optimal solution is
h = X(XT X)1 u.

(25)

Once h is determined, we apply an appropriate threshold to the output of the cross


correlation, which is the inner product of the test input image and the filter h and decide
on the class of the test image.
Figure 2-1 shows the general shape of the correlation output plane of the SDF uisng
inverse synthetic aperture radar (ISAR) imagery [31]. As stated early, the SDF has
a broad output plane response and it means that the SDF has a good generalization
performance with the true class images, but a poorer discrimination between true class
and out of class images.

22

2.3

Minimum Average Correlation Energy (MACE) Filter

Let us denote the ith image vector be xi after reordering. The conventional MACE
filter is better formulated in the frequency domain. The discrete Fourier transform (DFT)
of the column vector xi is denoted by Xi and we define the training image data matrix X
as
X = [X1 , X2 , , XN ] ,

(26)

where the size of X is d N and N is the number of training image. Let the vector h be
the filter in the space domain and represent by H its Fourier transform vector. We are
interested in the correlation between the input image and the filter. The correlation of the
ith image sequence xi (n) with filter sequence h(n) can be written as
gi (n) = xi (n) h(n).

(27)

By Parsevals theorem, the correlation energy of the ith image can be written as a
quadratic form
Ei = HH Di H,

(28)

where Di is a diagonal matrix of size d d whose diagonal elements are the magnitude
squared of the associated element of Xi , that is, the power spectrum of xi (n) and the
superscript H denotes the Hermitian transpose. The objective of the MACE filter is
to minimize the average correlation energy over the image class while simultaneously
satisfying an intensity constraint at the origin for each image. The value of the correlation
at the origin can be written as
gi (0) = XH
i H = ci ,

(29)

for i = 1, 2, , N training images, where ci is the user specified output correlation


value at the origin for the ith image. Then the average energy over all training images is
expressed as
Eavg = HH DH,

23

(210)

Figure 2-2. Example of the correlation output plane of the MACE [31].
where
D = (1/N )

N
X

Di .

(211)

i=1

The MACE design problem is to minimize Eavg while satisfying the constraint,XH H =
c, where c = [c1 , c2 , , cN ] is an N dimensional vector. This optimization problem can
be solved using Lagrange multipliers, and the solution is
H = D1 X(XH D1 X)1 c.

(212)

It is clear that the spatial filter h can be obtained from H by an inverse DFT. Once h is
determined, we apply an appropriate threshold to the output correlation plane and decide
whether the test image belongs to the class of the template or not.
Figure 2-2 shows the general shape of the correlation output plane of the MACE [31].
It shows a sharp peak at the origin and as a result the abilities of finding the location
of the target and discrimination between true class and out of class images have been

24

Figure 2-3. Example of the correlation output plane of the OTSDF [31].
improved. However, a sharp output plane causes a worse distortion tolerance and poor
generalization.
2.4

Optimal Trade-off Synthetic Discriminant (OTSDF) Function

The optimal trade-off filter (OTSDF) is a well know correlation filter to overcome the
poor generalization of the MACE when noise input is presented. The OTSDF wishes to
trade-off the MACE filter criterion versus the MVSDF filter criterion.
The OTSDF filter in the frequency domain is given by
H = T1 X(XH T1 X)1 c,
where T = D +

(213)

1 2 C, and 0 1, and D is the diagonal matrix in the MACE

and C is the diagonal matrix containing the input noise power spectral density as its
diagonal entries.

25

The correlation output response of the OTSDF is shown in Figure 2-3[31]. As


compared to the MACE filter response, the output peak is not nearly as sharp, but still
more localized than the SDF case.

26

CHAPTER 3
KERNEL-BASED CORRELATION FILTERS
3.1
3.1.1

Brief review on Kernel Method

Introduction

Kernel-based algorithms have been recently developed in the machine learning


community, where they were first used to solve binary classification problems, the so-called
Support Vector Machine (SVM) [32]. And there is now an extensive literature on SVM
[33],[34] and the family of kernel-based algorithms [23].
A kernel-based algorithm is a nonlinear version of a linear algorithm where the data
has been previously (and most often nonlinearly) transformed to a higher dimensional
space in which we only need to be able to compute inner products (via a kernel function).
It is clear that many problems arising in signal processing are of statistical nature
and require automatic data analysis methods. Furthermore, the algorithms used in
signal processing are usually linear and their transformation for nonlinear processing is
sometimes unclear. Signal processing practitioners can benefit from a deeper understanding
of kernel methods, because they provide a different way of taking into account nonlinearities
without loosing the original properties of the linear method. Another aspect is dealing
with the amount of available data in a space of a given dimensionality, one needs methods
that can use little data and avoid the curse of dimensionality.
Aronszajn [35] and Parzen [36] were some of the first to employ positive definite
kernel in statistics. Later, based on statistical learning theory, support vector machine
[70] and other kernel-based learning algorithms [63] such as kernel principal component
analysis [64], kernel Fisher discriminant analysis [58] and kernel independent component
analysis [4] have been introduced.
3.1.2

Kernel Method

Many algorithms for data analysis are based on the assumption that the data can
be represented as vectors in a finite dimensional vector space. These algorithms, such

27

Figure 3-1. The example of kernel method (left: Input space, right: feature space).
as linear discrimination, principal component analysis, or least squares regression, make
extensive use of the linear sstructure. Roughly speaking, kernels allow one to naturally
derive nonlinear versions of linear algorithms through the implicit nonlinear mapping. The
general idea is the following. Given a linear algorithm (i.e. an algorithm which works in
a vector space), one first maps the data living in a space (the input space) to a vector
space H (the feature space) via a nonlinear mapping () : H; and then runs the
algorithm on the vector representation (x) of the data. In other words, one performs
nonlinear analysis of the data using a linear method. The purpose of the map () is to
translate nonlinear structures of the data into linear ones in H.
Consider the following discrimination problem (see Figure 3-1) where the goal is
to separate two sets of points. In the input space, the problem is nonlinear, but after

applying the transformation (x1 , x2 ) = (x21 , 2x1 x2 , x22 ) which maps each vector to the
three monomials of degree 2 formed by its coordinates, the separation boundary becomes
linear. We have just transformed the data and we hope that in the new representation,
linear structures will emerge.
Working directly with the data in the feature space may be difficult because the space
can be infinite dimensional, or the transformation implicitly defined. The basic idea of
kernel algorithm is to transform the data x from the input space to a high dimensional
feature space of vectors (x), where the inner products can be computed using a positive

28

definite kernel function satisfying Mercers condition [23],


(x, y) =< (x), (y) > .

(31)

Mercers Theorem: Suppose (t, s) is a continuous symmetric non-negative


function on a closed finite interval T T. Denote by {k , k = 1, 2, . . . } a sequence
of non-negative eigenvalues of (t, s) and by {k (t), k = 1, 2, . . . } the sequence of
corresponding normalized eigenfunctions, in other word, for all integers t and s,
Z
T

(t, s)k (t)dt = k k (s), s, t T


Z
k (t)j (t)dx = k,j

(32)
(33)

where k,j is the Kronecker delta function, i.e., equal to 1 or 0 according as k = j or k 6= j.


Then
(t, s) =

k k (t)k (s)

(34)

k=0

where the series above converges absolutely and uniformly on T T [37].


This simple and elegant idea allows us to obtain nonlinear versions of any linear
algorithm expressed in terms of inner products, without even knowing the exact mapping
function . A particularly interesting characteristic of the feature space is that it is a
reproducing kernel Hilbert space (RKHS), i.e., the span of functions {(, x) : x }
defines a unique functional Hilbert space [35]. The crucial property of these space is the
reproducing property of the kernel
f (x) =< (, x), f >, f F.

(35)

In particular, we can define our nonlinear mapping from the input space to RKHS as
(x) = (, x), then we have
< (x), (y) >=< (, x), (, y) >= (x, y),

29

(36)

and thus (x) = (, x) defines the Hilbert space associated with the kernel.
In this research, we use the Gaussian kernel, which is the most widely used Mercer
kernel,
(x y) =
3.2

1
kx yk2
exp (
).
2 2
2

(37)

Kernel Synthetic Discriminant Function

Based on the kernel methodology, the previous optimization problem for the SDF can
be solved in an infinite dimensional kernel feature space by transforming each element of
the matrix of exemplars X to (Xij ) and h to (h) with sample by sample mapping, thus
forming a higher dimensional matrix (X) whose (i, j)th feature vector is (Xij ). Let the
N training images matrix X be
X = [x1 , x2 , , xN ],

(38)

where xi is ith training image vector given by


xi = [xi (1), , xi (d)],

(39)

Then we can extend the SDF optimization problem to the nonlinear feature space by
min T (h)(h),

subject to T (X)(h) = u.

(310)

where the dimensions of the transformed (X) and (h) are N and 1,
respectively for the Gaussian kernel. Then the solution in kernel space becomes
(h) = (X)(T (X)(X))1 u.

(311)

We denote KXX = T (X)(X) ,which is a N N full rank matrix whose (i, j)th element
is given by
(KXX )ij =

d
X

k(xi (k), xj (k)), i, j = 1, 2, , N.

k=1

30

(312)

Although (h) is a infinite dimensional vector, the output of this filter is going to be an
N 1, which can be easily computed using these kernels.
Let Z be the matrix of vector images for testing and its number of testing images are
L. We denote KZX = T (Z)(X), which is L N matrix whose each element is given by
(KZX )ij =

d
X

k(zi (k), xj (k)), i = 1, 2, , L, j = 1, 2, , N.

(313)

k=1

Then the L 1 output vector of the kernel SDF is given by


y = T (Z)(h) = KZX K1
XX u.

(314)

We can compute KXX off-line with given training data. Then KZX and y are can be
computed on-line with a given test image. Given N training images and one test image,
the computational complexity are O(dN 2 ) and O(dN ) + O(N 3 ) during off-line and on-line,
respectively. In general, N d, so the dominant part of the computational complexity for
matrix inversion in (314), O(N 3 ), is not a critical computational issue in the KDSF. Also
the required memory for KXX which is O(N 2 ) is much less than the case of depending on
the number of image pixels, d.
By applying an appropriate threshold to the output in (314), we can detect and
recognize the testing data without generating the composite filter in a feature space. In
object recognition and classification senses, the proposed kernel SDF is simpler than the
kernel matched filter.
3.3
3.3.1

Application of the Kernel SDF to Face Recognition

Problem Description

In this section, we show the performance of the proposed kernel based SDF filter for
face image recognition. In the simulations, we used the facial expression database collected
at the Advanced Multimedia Processing Lab at the Electrical and Computer Engineering
Department of Carnegie Mellon university [38]. The database consists of 13 subjects,
whose facial images were captured with 75 varying expressions. The size of each image is

31

6464. Sample images are depicted in Figure 3-2. In this research, we tested the proposed
kernel SDF method with the original database images as well as with noisy images.
Sample images with additive Gaussian noise with a 10 dB SNR are shown in Figure 3-2(c).
In order to evaluate the performance of the SDF and kernel SDF filter in this data set,
we examined 975(1375) correlation outputs. From these results and the ones reported
in [5] we picked and report the results of the two most difficult cases who produced the
worst performance with the conventional SDF method. We test with all the images of
each persons data set resulting in 75 outputs for each class. The simulation results have
been obtained by averaging (Monte-Carlo approach) over 100 different training sets
(each training set has been chosen randomly ) to minimize the problem of performance
differences due to splitting the relatively small database in training and testing sets. In
this data set, it has been observed that the kernel size around 30%-50% of the standard
deviation of the input data would be appropriate.
3.3.2

Simulation Results

(a)

(b)

(c)

Figure 3-2. Sample images: (a) Person A (b) Person B (c) Person A with additive
Gaussian noise (SNR=10dB).

32

output

1
0.9

True class
False class

0.8
0.7
0.6

10

20

30
40
50
number of testing images

60

70

output

1
0.9

True class
False class

0.8
0.7
0.6

10

20

30
40
50
number of testing images

60

70

Figure 3-3. The output peak values when only 3 images are used for training (N=3),
(Top): SDF, (Bottom): Kernel SDF.
Figure 3-3 shows the average output peak values for image recognition when we use
only N = 3 images as training. The desired output peak value should be close to one
when the test image belongs to the training image class. Figure 3-3 (Top) shows that the
correlation output peak values of the conventional SDF in both true and false classes not
only overlap but are also close to one. As a result the system will have great difficulty to
differentiate these two individuals because they can be interpreted as belonging to the
same class. Figure 3-3 (Bottom) shows the output values of kernel SDF and we can see
that the two images can be recognized well even with a small number of training images.
Figure 3-4 shows the ROC curves with different number of training images (N ). In the
kernel SDF with N = 3, the probability of detection with zero false alarm rate is 1.
However, the conventional SDF needs at least 25 images for training in order to have the
same detection performance as the kernel SDF.

33

Figure 3-4. The comparison of ROC curves with different number of training images.
One of the major problems of the conventional SDF is that the performance can
be easily degraded by additive noise in the test image since SDF does not have any
special mechanism to consider input noise. Therefore, it has a poor rejecting ability for
a false class image. Figure 3-5 (Top) shows the noise effect on the conventional SDF.
When the class images are seriously distorted by additive Gaussian noise with a very
low SNR (-2dB), the correlation output peaks of some test images become great than
1, hence wrong recognition happens. The results in Figure 3-5 (Bottom) are obtained
by the kernel SDF. The kernel SDF shows a much better performance even in a very
low SNR environment. The comparison of ROC curves between the kernel SDF and the
conventional SDF in the case of noisy test input with different SNRs is shown in Figure
3-6. We can see that the kernel SDF outperforms the SDF and achieves a robust pattern
recognition performance in a very high noisy environment.

34

1.05

output

1
0.95
0.9
0.85
0.8

10

20

30
40
50
number of testing images

60

70

10

20

30
40
50
number of testing images

60

70

output

0.8
0.6
0.4
0.2
0

Figure 3-5. The output values of noisy test input images with additive Gaussian noise
when 25 images are used for training (N=25),(Top): SDF, circle-true class
with SNR=10dB, cross-false class with SNR=-2dB, diamond-false class with
no noise, (Bottom): Kernel SDF, circle-true class with SNR=10dB, cross-false
class with SNR=-2dB.
1

0.9

0.8

Probability of Detection

0.7

0.6

0.5

0.4
SDF,
SNR = 2dB
Kernel SDF, SNR = 2 dB
SDF,
SNR = 0.5 dB
Kernel SDF, SNR = 0.5 dB
SDF,
SNR = 1.5dB
Kernel SDF, SNR = 1.5dB

0.3

0.2

0.1

0.1

0.2

0.3

0.4
0.5
0.6
Probabolity of False Alarm

0.7

0.8

0.9

Figure 3-6. The ROC curves of noisy test input images with different SNRs when 10
images are used for training (N=10).

35

CHAPTER 4
A RKHS PERSPECTIVE OF THE MACE FILTER
4.1

Introduction

This section presents the interpretation of the MACE filter in the RKHS. The original
linear MACE filter was formulated in the frequency domain, however, the MACE filter
can also be understood by the theory of Hilbert space representations of random functions
proposed by Parzen [39]. Parzen analyzed the connection between RKHS and second-order
random (or stochastic) processes by using the isometric isomorphism

that exists

between the Hilbert space spanned by the random variables of a stochastic process and the
RKHS determined by its covariance function. Here, we present first the basic theory of the
RKHS, then show the interpretation the MACE filter formulation in the RKHS.
4.2

Reproducing Kernel Hilbert Space (RKHS)

A reproducing kernel Hilbert space (RKHS) is a special Hilbert space associated


with a kernel such that reproduces (via an inner product) each function in the space,
or, equivalently, every point evaluation functional is bounded. Let H be a Hilbert space
of functions on some set E, define an inner product h, iH in H and a complex-valued

Consider two Hilbert space H1 and H2 with inner products denoted as hf1 , f2 i1 and
hg1 , g2 i2 respectively, H1 and H2 are said to be isomorphic if there exists a one-to-one and
surjective mapping from H1 to H2 satisfying the following properties
(f1 + f2 ) = (f1 ) + (f2 ) and (f ) = (f )

(41)

for all functionals in H1 and any real number . The mapping is called an isomorphism
between H1 and H2 . The Hilbert spaces H1 and H2 are said to be isometric if there exist
a mapping that preserves inner products,
hf1 , f2 i1 = h(f1 ), (f2 )i2 ,
for all functions in H1 . A mapping satisfying both properties (41) and (42) is
said to be an isometric isomorphism or congruence. The congruence maps both
linear combinations of functionals and limit points from H1 into corresponding linear
combinations of functionals and limit points in H2 .

36

(42)

bivariate function (x, y) on E E. Then the function (x, y) is said to be positive definite
if for any finite point set {x1 , x2 , . . . , xn } E and for any not all zero corresponding
complex number {1 , 2 , . . . , n } C,
n X
n
X

i j (xi , xj ) > 0.

(43)

i=1 j=1

Any positive definite bivariate function (x, y) is a reproducing kernel because of the
following fundamental theorem.
Moore-Aronszajn Theorem: Given any positive definite function (x, y), there
exists a uniquely determined (possibly finite dimensional) Hilbert space H consisting of
functions on E such that
(i) for every x E, (x, ) H and

(44)

(ii) for every x E and f H, f (x) = hf, (x, )iH .

(45)

Then H := H() is said to be a reproducing kernel Hilbert space with reproducing kernel
. The properties (i) and (ii) are called the reproducing property of (x, y) in H().
Parzen [39] analyzed the connection between RKHSs and orthonormal expansions for
second-order stochastic processes obtaining a general expression for the reproducing kernel
inner product in terms of the eigenvalues and eigenfunctions of a certain operator defined
on an appropriate Hilbert space. In addition, Parzen showed that there exists an isometric
isomorphism between the Hilbert space spanned by the random variables of a stochastic
process and the RKHS determined by its covariance function.
Given a zero mean second-order random vector {xi : i I} with I being an index set,
the covariance function is defined as
R(i, j) = E [xi xj ] .

(46)

It is well known that the covariance function R is non-negative definite, therefore it


determines a unique RKHS, H(R), according to the Moore-Aronszajn Theorem. By the

37

Mercers theorem [35],


R(i, j) =

k k (i)k (j),

(47)

k=0

where {k , k = 1, 2, } and {k (i), k = 1, 2, } are a sequence of non-negative


eigenvalues and corresponding normalized eigenfunctions of R(i, j), respectively.
H(R) has two important properties which make it a reproducing kernel Hilbert space.
First, let R(i, ) be the function on I with value at j in I equal to R(i, j), then by the
Mercers theorem, eigen-expansion for the covariance function (47), we have
R(i, j) =

k ak k (j),

ak = k (i).

(48)

k=0

Therefore, R(i, ) H(R) for each i in I. Second, for every function f () H(R) of form
P
given by f (i) =
k=0 k ak k (i) and every i in I,
hf, R(i, )i =

k ak k (i) = f (i).

(49)

k=0

By the Moore-Aronszajn Theorem, H(R) is a reproducing kernel Hilbert space with R(i, j)
as the reproducing kernel. It follows that
hR(i, ), R(j, )i =

k k (i)k (j) = R(i, i).

(410)

k=0

Thus H(R) is a representation of the random vector{xi : i I} with covariance function


R(i, j).
One may define a congruence G form H(R) onto linear space L2 (xi , i I) such that
G(R(i, )) = xi .

(411)

The congruence G can be explicitly represented as


G(f ) =

X
k=0

38

ak k ,

(412)

where the set of k is an orthogonal random variables belong to L2 ((i), i I) and f is


P
any element in H(R) in the form of f (i) =
k=0 k ak k (i) and every i in I.
Summary Let {xi : i I} be a continuous random function defined on a closed finite
interval I. Then the following conclusions hold:
The covariance kernel R possesses the expansion (47).
There exists a Hilbert space L2 ((i), i I) of sequences which is a representation of
the random function.
There exists a reproducing kernel Hilbert space H(R) of functions on I, which is a
representation of the random function.
4.3

Interpretation of the MACE filter in the RKHS

The original MACE filter was derived in the frequency domain for simplicity, however,
it can also be derived in the space domain [40] and this helps us understand the RKHS
perspective of the MACE. Let us consider the case of one training image and construct the
following matrix

0
x(d)

x(d 1) x(d) 0

..
..
..
..

.
.
.
.

U = x(1)
x(2)

0
x(1) x(2)

..
..

0
0
.
.

..

x(d) ,

x(d 1)

..

x(1)

(413)

where the dimension of matrix U is (2d 1) d. Here we denote the ith column of the
matrix U as Ui . Then the column space of U is
L2 (U) = {

d
X

i Ui | i R, i = 1, , d},

i=1

39

(414)

which is congruent to the RKHS induced by the correlation kernel


R(i, j) =< Ui , Uj >= UiT Uj , i, j = 1, , d,

(415)

where < , > represents the inner product operation. If all the columns in U are
linearly independent, R(i, j) is positive definite and the dimensionality of L2 (U) is d.
If U is singular, the dimensionality is smaller than d. However, in either case, all the
vectors in this space can be expressed as a linear combination of the column vectors. The
d
P
optimization problem of the MACE is to find a vector go =
hi Ui in L2 (U) space with
i=1

coordinates h = [h1 h2 hd ]T such that goT go is minimized subject to the constraint


that the dth component of go (which is the correlation at zero lag) is some constant.
Formulating the MACE filter from this RKHS viewpoint only provides a new perspective
but no additional advantage. However, as explained next, it will help us derive a nonlinear
extension to the MACE with a new similarity measure.

40

CHAPTER 5
NONLINEAR VERSION OF THE MACE IN A NEW RKHS :
THE CORRENTROPY MACE (CMACE) FILTER
5.1
5.1.1

Correntropy Function

Definition

Correlation is one of the fundamental operations of statistics, machine learning and


signal processing because it quantifies similarity. However, correlation only exploits second
order statistics of the random variables or random processes, which limits its optimality to
Gaussian distributed data. Correntropy was introduced in [29] as a generalized measure of
similarity. Its name stresses the connection to correlation, but also indicates the fact that
its mean value across time or dimensions is associated with entropy, more precisely to the
argument of the log in Renyis quadratic entropy estimated with Parzen windows, which
is called the information potential. Information potential (IP) is the argument of Renyis
quadratic entropy of a random variable X with PDF fX (x) as,
Z
H2 (x) = log
where , IP (x) =

fX2 (x)dx,

(51)

fX2 (x)dx.

A nonparametric estimator of the information potential using Parzen window from N


samples data is
N
N
1 XX

I P (x) = 2
(xi xj ),
N i=1 j=1

(52)

where is the Gaussian kernel in (54) [41].


This relation to entropy shows that the correntropy contains information beyond
second order moments, and can therefore generalize correlation without requiring moment
expansions.
Definition: Cross correntropy or simply correntropy is a generalized similarity
measure between two arbitrary vector random variables X and Y defined as
V (X, Y ) = E[ (X Y )],

41

(53)

where E is the mathematical expectation and is the Gaussian kernel given by


(
)
1
kX Y k2
(X Y ) =
exp
,
2 2
2

(54)

where is the kernel size of bandwidth.


In practice, given finite number of data samples {(xi , yi )}di=1 , the cross correntropy is
estimated by
d

1X
(xi yi ).
V (X, Y ) =
d i=1
5.1.2

(55)

Some Properties

Correntropy has very nice properties that make it useful for machine learning and
nonlinear signal processing. First and foremost, it is a positive function also defining a
RKHS, but unlike the RKHS defined by the covariance function of the random variable
(process) it contains higher order statistical information. This new function quantifies
the average angular separation in the kernel feature space between the dimensions of the
random variable (or between temporal lags of the random process). Therefore, correntropy
can be the metric for similarity measurements in feature space. Several properties of
correntropy and their proofs are presented in [29][42][43]. Here we present, without proofs,
only the properties that are relevant to this dissertation.
Property 1:Correntropy is a similarity measure between X and Y incorporating
higher order moments of the random variable X Y [29].
Applying the Taylor series expansion for the Gaussian kernel, we can rewrite the
correntropy function in (53) as
V (X, Y ) =

1 X (1)k
E[(X Y )2k ],
2
k
2 k=0 (2 ) k!

(56)

which contains all the even-order moments of the random variable X Y . The kernel
size controls the emphasis of the higher order moments with respect to the second, since
the higher order terms of the expansion decay faster for larger . As increases, the

42

high-order moments decay and the second order moment tends to dominate. In fact, for
kernel size larger than 10 times the one chosen from density estimation considerations (e.g.
Silvermans rule [44]), correntropy starts to approach correlation. The kernel size has to be
chosen according to the application, but here this issue will not be further addressed and
Silvermans rule will be used by default.
Property 2: Let {xi , i T} be a random vector(process) with T being an index
set, the auto-correntropy function of random vector(process) V (i, j) = E[ (xi xj )] is a
symmetric and positive definite function; therefore it defines a new RKHS, called VRKHS
[29].
Since (xi xj ) is symmetrical, it is obvious that V (i, j) is also symmetrical. Also
since (xi xj ) is positive definite, for any set of n point {x1 , , xn } and none zero
real numbers {1 , , n }, we have
n X
n
X

i j (xi xj ) > 0.

(57)

i=1 j=1

It is true that for any strictly positive


function g(, ) of#two random variables x and y,
"
n
n
PP
i j (xi xj ) > 0. This equals to
E[g(x, y)] > 0. Thus we have E
i=1 j=1
n X
n
X

i j E[ (xi xj )] =

i=1 j=1

n X
n
X

i j V (i, j) > 0.

(58)

i=1 j=1

Thus V (i, j) is both symmetric and positive definite. Now, the Moore-Aronszajn theorem
[35] proves that for every real symmetric positive definite function k, there exists a unique
RKHS with k as its reproducing kernel. Hence V (i, j) is a reproducing kernel.
As shown in property 1, VRKHS contains higher order statistical information, unlike
the RKHS defined by the covariance function of random processes.
Property 3:Assume the samples {(xi , yi )}di=1 are drawn from the joint PDF
fX,Y (x, y) and fX,Y, (x, y) is the Parzen estimator with kernel size . The correntropy

43

2
1.5

0.

0.5
5

0.4

2
0. 0.15
0.1

5
0.3

3
0.

0.5
1

0.0

0.35

0.5

0.

5
0.2

35
0.

0.5

45

0.4

0.

0.2
5

0.

0.5

0.

45

1.5
2
2

0.5

4
0.

Figure 5-1. Contours of CIM(X,0) in 2D sample space (kernel size is set to 1)


estimator with kernel size 0 =

2 is the integral of fX,Y, (x, y) along the line x = y [42],


Z

V2 (X, Y

)=

fX,Y, (x, y)|x=y=u du.

(59)

Property 4: Correntropy, as a sample estimator, induces a metric in the sample


space. Given two vectors X = [x1 , x2 , , xN ]T and Y = [y1 , y2 , , yN ]T in the sample
space, the function CIM (X, Y ) = ( (0) V (X, Y ))1/2 ,where is Gaussian kernel in

(54) with (0) = 1/ 2, defines a metric in the sample space and is named as the
Correntropy Induced Metric (CIM) [42]. Therefore, correntropy can be the metric for
similarity measurement in feature space.
Figure 5-1 shows the contours of distance from X to the origin in a two dimensional
space. The interesting observation from the figure is as follows: when X is close to zero,

44

CIM behaves like an L2 norm

, which is clear from the Taylor expansion in (56); further

out CIM behaves like an L1 norm; eventually as X departs from the origin, the metric
saturates and becomes insensitive to distance (approaching a L0 norm2 ). larger differences
saturate so the metric is less sensitive to large deviations what makes it more robust.
This property inspired us to investigate the inherent robustness of CIM to outliers. The
kernel size controls this very interesting behavior of the metric across neighborhoods. A
small kernel size leads to a tight linear (Euclidean) region and to a large L0 region, while
a larger kernel size will enlarge the linear region. In this dissertation, we mathematically
prove that 1) when the kernel size goes to infinity, the CIM norm is equivalent to the
L2 norm and 2) when the kernel size goes to zero (from the positive side), the CIM is
equivalent to the L0 norm.
Let us define E = X Y = [e1 , e2 , ..., eN ]T , then
CIM (E) = [ (0)
=

1
{ 22N
[2 2
3

N
P

1
N

N
P

(1

i=1

(ei )]1/2

i=1

(510)

exp(e2i /2 2 ))]}1/2 .

First, let us take a look at the following limit


lim 2 2 (1 exp(e2i /2 2 ))

1exp(te2i )
t
t0
exp(te2i )e2i
lim
1
t0

= lim

(t =: 1/2 2 )

(LHospital)

(511)

= e2i .

Given vector X, Lp norm of X is defined by k X kp = (

N
P

|xi |p )1/p , where p is a real

i=1

number with p >= 1.


2

L0 norm of X is defined as limp0 k X kpp , that is the zero norm of X is simply the
number of non-zero elements of X. Despite its name, the zero norm is not a true norm; in
particular, it is not positive homogeneous.

45

Therefore,

lim (2 2N 3 )1/2 CIM (E) = ||E||2 .

(512)

Second, look at the following limit


lim (1 exp(e2i /2 2 )) =

0+

0 if ei = 0

1 if ei 6= 0

(513)

Therefore,
lim

2N [CIM (E)]2 = ||E||0 .

(514)

Property 5: Given data samples {xi }di=1 , the correntropy kernel creates another data
set {f (xi )}di=1 preserving the similarity measure as
V (i, j) = E[ (xi xj )] = E[f (xi )f (xj )].

(515)

The proof of property 5 is in Appendix B.


According to property 5, there exists a scalar nonlinear mapping f which makes the
correntropy of xi the correlation of f (xi ). (515) allows the computation of the correlation
in feature space by the correntropy function in the input space [45][46].
5.2

The Correntropy MACE Filter

According to the RKHS perspective of the MACE filter in capter 4, we can extend it
immediately to VRKHS

. Applying the correntropy concept to the MACE formulation of

chapter 4, the definition of the correlation in (415) shall be substituted by


2d1
1 X
(Uin Ujn ) i, j = 1, , d,
V (i, j) =
2d 1 n=1

(516)

where, Uin is (i, n)th elements in (413). This function is positive definite and thus
induces the VRKHS. According to the Mercers theorem [35], there is a basis {i , i =

In this dissertation, we call the RKHS induced by correntropy VRKHS

46

1, , d} in this VRKHS such that


< i , j >= V (i, j), i, j = 1, , d.

(517)

Since it is a d dimensional Hilbert space, it is isomorphic to any d dimensional real vector


space equipped with the standard inner product structure. After an appropriate choice
of this isomorphism {i , i = 1, , d}, which is nonlinearly related to the input space,
a nonlinear extension of the MACE filter can be readily constructed on this VRKHS,
d
P
namely, finding a vector v0 =
fhi i with fh = [fh1 fhd ]T as coordinates such
i=1

that v0T v0 is minimized subject to the constraint that the dth component of v0 is some
pre-specified constant.
Let the ith image vector be xi = [xi (1) xi (2) xi (d) ]T and the filter be h =
[h(1) h(2) h(d)]T , where T denotes transpose. From property 5, the CMACE filter can
be formulated in feature space by applying a nonlinear mapping function f onto the data
as well as the filter. We denote the transformed training image matrix and filter vector
whose size are d N and d 1, respectively, by
FX = [fx1 , fx2 , , fxN ],

(518)

fh = [f (h(1)) f (h(2)) f (h(d))]T ,

(519)

where fxi = [f (xi (1)) f (xi (2)) f (xi (d))]T for i = 1, , N . Given data samples, the
cross correntropy between the ith training image vector and the filter can be estimated as
d

1X
voi [m] =
f (h(n))f (xi (n m)),
d n=1

(520)

for all the lags m = d + 1, , d 1. Then the cross correntropy vector voi can be formed
including all the lags of voi [m] denoted by
voi = Si fh ,
where Si is the matrix of size (2d 1) 1 as

47

(521)

f (xi (d))
0

0
0

f (x (d 1)) f (x (d))
0

0
i
i

..
..
..
..
..

.
.
.
.
.

f (xi (1))
f (xi (2))

f (xi (d))

Si =

0
f (xi (1)) f (xi (2))

f (xi (d 1))

..
..
..

0
0
.
.
.

0
0
0
f (xi (1))
f (xi (2))

0
0

0
f (xi (1))

(522)

Since the scale factor 1/d has no influence on the solution, it will be ignored throughout
the dissertation. The correntropy energy of the ith image is given by
T
Ei = voi
voi = fhT STi Si fh ,

(523)

Denoting Vi = STi Si and using the definition of correntropy in (515), the d d


correntropy matrix Vi is

Vi =

vi (0)

vi (1)

vi (d 1)

vi (1)
vi (0) vi (d 2)

,
..
..
.
..

..
.
.
.

vi (d 1) vi (1)
vi (0)

(524)

where, each element of the matrix is computed without explicitly knowledge of the
mapping function f by
vi (l) =

d
X

(xi (n) xi (n + l)),

(525)

n=1

for l = 0, , d 1. The average correntropy energy over all the training data can be
written as
Eav =

N
1 X
Ei = fhT VX fh ,
N i=1

48

(526)

where,
N
1 X
VX =
Vi .
N i=1

(527)

Since our objective is to minimize the average correntropy energy in feature space, the
optimization problem is formulated as
min fhT VX fh subject to FTX fh = c,

(528)

where, c is the desired vector for all the training images. The constraint in (528) means
that we specify the correntropy values between the training input and the filter as the
desired constant. Since the correntropy matrix VX is positive definite, there exists an
analytic solution to the optimization problem using the method of Lagrange multipliers in
the new finite dimensional VRKHS. Then the CMACE filter in feature space becomes
1
1
fh = VX
FX (FTX VX
FX )1 c.

(529)

Unlike the KSDF in (311) which has a 1 dimensionality in the RKHS by the
conventional kernel method, the CMACE filter is defined in the finite dimensional VRKHS
which has the same dimensionality as the input space with the size of d 1. In general,
the kernel method creates a infinite dimensional feature space, so the solution often needs
the regularization to limit the bound of the solution. Therefore, the KSDF may be needed
additional regularization terms for better performance. In the computational complexity of
the CMACE compared to the KSDF, the additional O(d3 ) operation and O(d2 ) storage for
1
VX
is needed. It makes us to need fast version of the CMACE as well as a dimensionality

reduction method for practical applications.

49

5.3
5.3.1

Implications of the CMACE Filter in the VRKHS

Implication of Nonlinearity

From (517) and (522), we can say that the RKHS induced by correntropy (VRKHS)
is a Hilbert space spanned by the basis {i }di=1 of size (2d 1) 1 as
i = [0, , 0, f (x(d)), , f (x(1)), 0, , 0]T ,

(530)

where f () is a nonlinear scalar function and f (x(d)) is located in the ith element. It
is obvious that unlike the RKHS induced by correlation, the VRKHS for the CMACE
filter is nonlinearly related to the original input space. This statement can be simply
exemplified by the CIM metric. Suppose two vectors x = [x1 , 0, , 0]T and the origin
in the input space y = [0, , 0]T . Then the Euclidean distance in the input space is
given by ED(x, y) = x1 and the distance in VRKHS becomes CIM (x, y) = ( N1 ( (0)
(x1 ))1/2 ,where N is the dimension of the vectors. Now, scaling both vectors by , we
= x and y
= y. The Euclidean distance between x
and y
becomes
obtain new vectors x
) = ED(x, y), whereas CIM (
) =
ED(
x, y
x, y

1
(k(0)
N

k(x1 ))1/2 6= CIM (x, y).

Also, as goes to infinity, the Euclidean distance in the input space linearly increases
too. However, the CIM distance saturates. This shows that distances in VRKHS are
not linearly related to distances in the input space. This argument can also be observed
directly from the CIM contour in Fig. 5-1.
In addition, since correntropy is different from correlation in the sense that it involves
high-order statistics of input signals, inner products in the RKHS induced by correntropy
are no longer equivalent to statistical inference on Gaussian processes. The transformation
from the input space to VRKHS is nonlinear and the inner product structure of VRKHS
provides the possibility of obtaining closed form optimal nonlinear filter solutions by
utilizing second and high-order statistics.

50

5.3.2

Finite Dimensional Feature Space

Another important difference compared with existing machine learning methods based
on the conventional kernel method, which normally yields an infinite dimensional feature
space is that VRKHS has the same dimension as the input space. In the conventional
MACE, the template h has d degree of freedom and all the image data are in the d
dimensional Euclidean space. As derived above, all the transformed images belong to a
different d dimensional vector space equipped with the inner product structure defined
with the correntropy. The goal of this new algorithm is to find a template fh in this
VRKHS such that the cost function is minimized subject to the constraint. Therefore,
the degrees of freedom of this optimization problem is still d, so regularization, which will
be needed in traditional kernel methods, is not necessary here. Further work needs to be
done regarding this point, but we hypothesize that in our methodology, regularization
is automatically achieved by the kernel through the expected value operator (which
corresponds to a density matching step utilized to evaluate correntropy). The fixed
dimensionality also carries disadvantages because the user has no control of the VRKHS
dimensionality. Therefore, the quality of the nonlinear solution depends solely on the
nonlinear transformation between the input space and VRKHS. The theoretical advantage
of using this feature space is justified by the CIM metric, which is very suitable to
quantify similarity in feature spaces and should improve the robustness to outliers of the
conventional MACE.
5.3.3

The kernel correlation filter vs. The CMACE filter: Prewhitening in


Feature Space

One of the kernel method applied to correlation filters is the kernel class-dependent
feature analysis (KCFA)[28]. The KCFA is the kernelized version of the linear MACE filter
using the kernel trick after prewhitening preprocess. The correlation output of the MACE
filter h and an input image vector z can be expressed as
X(
XT X)
1 c,
ymace = Z

51

(531)

= D1/2 Z and X
= D1/2 X indicate pre-whitened version of Z and X in the
where, Z
frequency domain. Then (531) is equivalent to the linear SDF with prewhitened data and
applying the kernel trick yields the KCFA as follows [27]
1
yKCF = KZX KXX
c,

(532)

where (i, j)th elements of the matrix KXX and KZX are computed by
(KXX )ij =

d
X

(
xki xkj ), i, j = 1, 2, , N,

(533)

(
zki xkj ), i = 1, 2, , L, j = 1, 2, , N,

(534)

k=1

(KZX )ij =

d
X
k=1

where N is the number of training images and L is the the number of test input images.
1/2
In the CMACE, we denote FX = VX FX , and we can decompose fh as
1/2

1/2

1/2

VX

1/2

FX (FTX FX )1 c

fh = VX
= VX

FX (FTX VX

1/2

VX

FX )1 c
(535)

The main difference between the CMACE and KCFA is the prewhitening process. In the
KCFA, prewhitening is conducted in the input space using D, on the other hand, in the
CMACE, (535) implies that the image is implicitly whitened in the feature space by the
correntropy matrix VX . In the space domain MACE filter, the autocorrelation matrix
can be used as a preprocessor for prewhitening. Since the CMACE filter uses the same
formulation in the feature space, we can also expect that the correntropy matrix can be
used for prewhitening. However, in practice, we cannot obtain whitened data explicitly
since the mapping function is not explicitly known. In addition, the solution in the KCFA
is defined in a infinite feature space like the KSDF therefore additional regularization term
may be neede for a better performance.

52

CHAPTER 6
THE CORRENTROPY MACE IMPLEMENTATION
6.1

The Output of the CMACE Filter

Since the nonlinear mapping function f is not explicitly known, it is impossible to


directly use the CMACE filter fh in the feature space. However, the correntropy output
can be obtained by the inner product between the transformed input image and the
CMACE filter in the VRKHS. In order to test this filter, let Z be the matrix of L vector
testing images and FZ be the transformed matrix of Z, then the L 1 output vector is
given by
1
1
y = FTZ VX
FX (FTX VX
FX )1 c.

(61)

1
1
FX )1 . Then the output becomes
Here, we denote TZX = FTZ VX
FX and TXX = (FTX VX

y = TZX (TXX )1 c,

(62)

where TXX is N N symmetric matrix and TZX is L N matrix whose (i, j)th element
is expressed by

(TXX )ij =

d
d X
X
l=1 k=1
d
d X
X

wlk f (xi (k))f (xj (l))


wlk (xi (k) xj (l)),

i, j = 1, , N,

(63)

l=1 k=1

(TZX )ij =

d X
d
X
l=1 k=1
d X
d
X

wlk f (zi (k))f (xj (l))


wlk (zi (k) xj (l)), i = 1, , L, j = 1, , N,

l=1 k=1
1
.
where wlk is the (l, k)th element of VX

The final output expressions in (63) and (64) are obtained by approximating
f (xi (k))f (xj (l)) and f (zi (k))f (xj (l)) by (xi (k) xj (l)) and (zi (k) xj (l)),

53

(64)

respectively, which is similar to the kernel trick and holds on average because of property
5. Unfortunately (63) and (64) involve weighted versions of the functionals therefore the
error in the approximation requires further theoretical investigation.
The CMACE is formulated in the linear VRKHS but has a nonlinear behavior since
the VRKHS is nonlinearly related to the input space. However, the CMACE preserves
the shift-invariant property of the linear MACE. The proof of the shift-invariant property
is given in Appendix C. Although the output of the CMACE gives us only one value, it
is possible to construct the whole output plane by shifting the test input image and as a
result, the shift invariance property of the correlation filters can be utilized at the expense
of more computation. Applying an appropriate threshold to the output of (61), one can
detect and recognize the testing data without generating the composite filter in feature
space. As will be shown in the simulation results section, even with this approximation,
the CMACE outperforms the conventional MACE.
6.2

Centering of the CMACE in Feature Space

With the Gaussian kernel, the correntropy value is always positive, which brings the
need to subtract the mean of the transformed data in feature space in order to suppress
the effect of the output DC bias. This centering of the correntropy should not be confused
with the spatial centering of the input images.
Given d data samples {x(i)}di=1 , let us denote the mean of the transformed data in
feature space as E[f (x(i))] = mf , then the centered correntropy, that can be properly
called the generalized covariance function, is give by
Vc (i, j) = E[{f (x(i)) mf }{f (x(j)) mf }]
= E[f (x(i))f (x(j))] m2f
= V (i, j) m2f .

54

(65)

The square of the mean of the transformed data f () coincides with the estimate of the
information potential of the original data, that is,
m2f

d
d
1 XX
= 2
(x(i) x(j)).
d i=1 j=1

(66)

In order to show the validity of (66), let us consider the sample estimation of correntropy
(and ignoring the scalar factor 1/d) then we have
d
X

f ((x(i))f (x(i + t)) =

i=1

d
X

(x(i) x(i + t)).

(67)

i=1

We arrange the double summation (66) as an array and sum along the diagonal direction
which yields exactly the autocorrelation function of the transformed data at different lags,
thus the correntropy function of the input data at different lags can be written
d
d
d1 dt
d1 X
d
X
1 XX
1 XX
f (x(i))f (x(j)) = 2 {
f (x(i))f (x(i + t)) +
f (x(i))f (x(i t))}
d2 i=1 j=1
d t=0 i=1
t=1 i=1+t
d
d1 X
d1 dt
X
1 XX
(x(i) x(i t)))
(x(i) x(i + t)) +
2(
d t=0 i=1
t=1 i=1+t
d
d
1 XX
= 2
(x(i) x(j)).
d i=1 j=1

(68)

As we see in (68), when the summation is far from the main diagonal, smaller and
smaller data sizes are involved which leads to poor approximation. Notice that this is
exactly the same problem when the auto correlation function is estimated from windowed
data. However, when d is large, the approximation improves. Therefore, in the CMACE
output equation (61), we can use the centered correntropy matrix VXC by subtracting
the information potential from the correntropy matrix VX as
VXC = VX m2f avg 1dd ,

(69)

where, m2f avg is the average estimated information potential over N training images and
and 1dd is a d d matrix with all the entries equal to 1. Using the centered correntropy

55

matrix VXC , a better rejection ability for out of class images is achieved since the offset of
the output can be removed except for the center value in case of training images.
6.3

The Fast CMACE Filter

In practice, the drawback of the proposed CMACE filter is its computation


complexity. In the MACE, the correlation output can be obtained by multiplication in
the frequency domain and the computation time can be drastically reduced by the FFT.
However in the CMACE, the output of the CMACE filter is obtained by computing
the product of two matrices in (63) and (64), which depends on the image size and
the number of training images. Each element involves a double summation of weighted
kernel functions. Therefore, each elements of the matrix requires O(d2 ) computations,
where d is the number of image pixels. When the number of training images is N , the
total computation complexity for one test output is O(N d2 + N 2 ). A similar argument
shows that the computation needed for training is O(d2 (N 2 + 1) + N 2 ). On the other
hand, the MACE only requires O(4(d(2N 2 + N + 2) + N 2 ) + N d log2 (d)) for training
and O(4d + d log2 (d)) for testing one input image. Table 6-1 shows the computational
complexity of the MACE and CMACE. More details about the required computation costs
are given in Appendix D. Constructing the whole output plane would significantly increase
the computational complexity of the CMACE. This quickly becomes too demanding
in practical settings. Therefore a method to simplify the computation is necessary for
practical implementations.
Here the Fast Gauss Transform (FGT) [47] is proposed to reduce the computation
time with a very small approximation error. The FGT is one of a class of very interesting
and important families of fast evaluation algorithms that have been developed over the
past decades to enable rapid calculation of weighted sums of Gaussian functions with
arbitrary accuracy. In nonparametric probability density estimation with Gaussian kernel,
the FGT can reduce the complexity of O(dM ) to O(d + M ) for M evaluations with d
sources.

56

6.3.1

The Fast Gauss Transform

In many problems in mathematics and engineering, the function of interest can be


decomposed into sums of pairwise interactions among a set of sources. In particular, this
type of problem is found in nonparametric probability density estimation as
G(z) =

d
X

qj (z x(j)),

(610)

j=1

where k is a kernel function centered at the source points x(j) and qj are scalar weighting
coefficients. With the Gaussian kernel, (610) can be interpreted as a Gaussian potential
filed due to sources of strengths qj at the points x(j), evaluated at the target point z.
Suppose that we have M evaluation target points, then the computation of (610) requires
O(dM ) calculations, which constrains the computation bandwidth for large data sets
d and M in real world applications. The Fast Gauss Transform (FGT) can reduce the
complexity to O(d + M ) for (610). The FGT is one of a class of very interesting and
important families of fast evaluation algorithms that have been developed over the past
decades to enable rapid calculation of approximations at arbitrary accuracy. The basic
idea is to cluster the sources and target points using appropriate data structures and
the Hermite expansion, and then reduce the number of summations with a given level of
precision.
6.3.2

The Fast Correntropy MACE Filter

The major part of the computation burden in the correntropy MACE filter is given by
T=

d X
d
X

wij e(z(i)x(j))

2 /2 2

(611)

i=1 j=1

This is very similar to the density estimation problem that evaluates at d targets z(i) with
given d source samples x(j). However, the weighting factor wij in (611) are dependent
on both target and source, which is different from the original FGT applications, where
the weight vector is always the same at every evaluation target points. In our case, the
weight vector wi = [wi1 , , wid ]T is varying on every evaluation point z(i). We can say

57

that (611) is a more general expression than the original FGT formulation and it can be
written as
T=

d
X

Gi (z),

(612)

i=1

where
Gi (z) =

d
X

wij (z(i) x(j)).

(613)

j=1

This means that clustering and the Hermite expansion should be performed at every
target z(i) with a different weight vector wi , which causes an extra computation for
clustering. However, since the sources are clustered in the FGT, if one expresses the
clustered sources about its center into the Hermite expansion, then there is no need to do
clustering and the Hermite expansion at every evaluation. The only thing that is necessary
is to use different weight vectors at every evaluation point. This process does not require
additional complexity compared to the original FGT formulation except that more storage
is required to keep the weight vectors. By using the Hermite expansion around the target
s, the Gaussian centered at x(j) evaluated at z(i) can be obtained by

p1
1 x(j) s
(z(i) x(j))2
z(i) s

exp
=
hn
+ (p),
2 2
n!
2
2
n=0

(614)

where the Hermite function hn (x) is defined by


dn
hn (x) = (1)
(exp(x2 )).
n
dx
n

(615)

Also, in this research, we use a simple greedy algorithm for clustering [48], which computes
a data partition with a maximum radius at most twice the optimum. This clustering
method and the Hermite expansion with order p requires O(pd). In the case of (63) and
(64), since the number of sources and targets are the same, they can be interchanged,
that is, the test image can be the source so that the clustering and Hermite expansion can

58

Table 6-1. Estimated computational complexity for training with N images and testing
with one image. Matrix inversion and multiplication are considered
(In this simulation, d=4096, N =60, p=4,kc =4)
MACE
CMACE
Fast CMACE

Training (Off line)

Testing (On line)

O(4(d(2N 2 + N + 2) + N 2 ) + N d log2 (d))


O(d2 (N 2 + 1) + N 2 )
O(N 2 pd(kc + 1) + d2 + N 2 )

O(4d + d log2 (d))


O(d2 N + N 2 )
O(pd(kc + 1)N + N 2 )

be done only one time per test. Thus T in (611) can be approximated by

p1
d XX
X
1
x(i) sB

T
hn
Cn (B),
n!
2
i=1 B n=0

(616)

where B represents a cluster with a center sB and Cn (B) is given by


Cn (B) =

T
wij

z(j),wij B

z(j) sB

n
.

(617)

From (616), we can see that evaluation at kc expansions at all the evaluation points
costs O(pkc d), so the total number of operations is O(pd(kc + 1)) per computation of
each element in (63) and (64). The final aim is to obtain the output of the CMACE
filter with N training images and L test images. In order to compute the output of one
test image, the original direct method requires O(d2 N (N + 1)) operations to obtain TXX
and TZX , and we can reduce the operation count reduces to O(pd(kc + 1)N (N + 1)) by
applying this enhanced FGT. Typically p and kc are around 4 while d and N are 4,096
and around 100 respectively in our application, which results in a computational savings
of roughly 100 times. Additionally, clustering with the test image is performed only once
per test which reduces the computation time even more. However, from the Table 6-1,
we see that the computational complexity of the CMACE for the testing still depends on
the number of training images, resulting in more computations than the MACE. More
work is necessary to reduce even further the computation time of the CMACE and its
memory storage requirements, but the proposed approach enables practical applications
with present day computers.

59

CHAPTER 7
APPLICATIONS OF THE CMACE TO
IMAGE RECOGNITION
7.1
7.1.1

Face Recognition

Problem Description

In this section, we show the performance of the proposed correntropy MACE filter for
face image recognition. In the simulations, we used the same facial expression database
used in the chapter 3. We used only 5 images to composite template (filter) per person
( the MACE filter shows a reasonable recognition result with a small number of training
image in this database [5]). we picked and report the results of the two most difficult
cases who produced the worst performance with the conventional MACE method. We
test with all the images of each persons data set resulting in 75 outputs for each class.
The simulation results have been obtained by averaging (Monte-Carlo approach) over 100
different training sets (each training set consists of randomly chosen 5 images) to minimize
the problem of performance differences due to splitting the relatively small database
in training and testing sets. The kernel size, , is chosen to be 10 for the correntropy
matrix during training and 30 for test output. In this data set, it has been observed that
the kernel size around 30%-50% of the standard deviation of the input data would be
appropriate. Moreover, we can control the performance by choosing a different kernel size
during training for prewhitening.
7.1.2

Simulation Results

Figure 7-1 shows the average test output peak values for image recognition. The
desired output peak value should be close to one when the test image belongs to the
training image class (true class) and otherwise it should be close to zero. Figure 7-1 (Top)
shows that the correlation output peak values of the conventional MACE in false classes
is close to zero and it means that the MACE has a good rejecting ability of false class.
However, some outputs in the test image set, even in the true class, are not recognized as
the true class. Figure 7-1 (Bottom) shows the output values of the proposed correntropy

60

output

0.8
0.6
0.4
True class
False class

0.2
0

10

20

30

40
Index of test image

50

60

70

output

0.8
0.6
0.4
True class
False class

0.2
0

10

20

30

40
Index of test image

50

60

70

Figure 7-1. The averaged test output peak values (100 Monte-Carlo simulations with
N=5), (Top): MACE, (Bottom): CMACE.

output

1
0.5
0

10

20

30
40
50
index of test image

60

70

10

20

30
40
50
index of test image

60

70

output

1
0.5
0

Figure 7-2. The test output peak values with additive Gaussian noise (N=5),(Top):
MACE, circle-true class with SNR=10dB, cross-false class with SNR=2dB,
(Bottom): CMACE, circle-true class with SNR=10dB, cross-false class with
SNR=2dB.

61

1
0.9

MACE (No noise)


MACE (SNR:2dB)
MACE (SNR:0dB)
Proposed method
(No noise, SNR:2dB, 0dB)

Probability of Detection

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4
0.6
Probability of False Alarm

0.8

Figure 7-3. The comparison of ROC curves with different SNRs.


MACE and we can see that the generalization and rejecting performance are improved.
As a result, the two images can be recognized well even with a small number of training
images. One of problems of the conventional MACE is that the performance can be easily
degraded by additive noise in the test image since the MACE does not have any special
mechanism to consider input noise. Therefore, it has a poor rejecting ability for a false
class image when noise is added into a false class. Figure 7-2 (Top) shows the noise effect
on the conventional MACE. When the class images are seriously distorted by additive
Gaussian noise ( SNR =2dB), the correlation output peaks of some test images from false
class become great than that of the true class, hence wrong recognition happens. The
results in Figure 7-2 (Bottom) are obtained by the proposed method. The correntropy
MACE shows a much better performance especially for rejecting even in a very low SNR
environment. Figure 7-3 shows the comparison of ROC curves with different SNRs. In
the conventional MACE, we can see that the false alarm rate is increased as additive
noise power is increased. However, in the proposed method, the probability of detection

62

Table 7-1. Comparison of standard deviations of all the Monte-Carlo simulation outputs
(10075 outputs)
True

False

True

False

(No noise)

(No noise)

(SNR:0dB)

(SNR:0dB)

0.0086
0.0051

0.0527
0.0485

0.0245
0.0038

MACE
0.0498
CMACE 0.0488

with zero false alarm rate is 1. The correntropy MACE shows much better recognition
performance than the conventional MACE.
One of advantage of the proposed method is that it is more robust than the
conventional MACE. That is, the variation of the test output peak value due to a different
training set is smaller than that of the MACE. Figure 7-4 shows standard deviations
of 100 Monte-Carlo outputs per test input when the test input are noisy false class
images. Table 1 shows the comparison of the standard deviation of 750 outputs (100
Monte-Carlo outputs for 75 inputs) for each class. From the 7-1, we can see that the
variations of the correntropy MACE outputs due to different training set is much less than
those of the conventional MACE and it tells us that our proposed nonlinear version of
the MACE outperforms the conventional MACE and achieves a robust performance for
distortion-tolerant pattern recognition.
Table 7-2 shows the area under the ROC for different kernel sizes in the case of
no additive noise. In this simulation, the kernel sizes lie in the range between 0.1 to 15
provide the perfect ROC performance. The kernel size obtained by Silvermans rule of
thumb [44], which is given by i = 1.06
i d1/5 , where
i is the standard deviation of the
ith training data and d is the number of samples, is 9.63 and it also results in the best
performance. As expected from the property of correntropy, it is noticed that correntropy
approaches correlation with large kernel size (ROC area of the MACE is about 0.96).

63

0.35
MACE
Proposed method
0.3

standard deviation

0.25

0.2

0.15

0.1

0.05

10

20

30
40
50
index of test image

60

70

Figure 7-4. The comparison of standard deviation of 100 Monte-Carlo simulation outputs
of each noisy false class test images.
Table 7-2. Comparison of ROC areas with different kernel sizes
Kernel size ROC area Kernel size ROC area
0.1
1
20
0.9901
0.5
1
50
0.9804
1.0
1
100
0.9810
9.6
1
200
0.9820
10.0
1
500
0.9796
1000
0.9518
15.0
1
7.2
7.2.1

Synthetic Aperture Radar (SAR) Image Recognition

Problem Description

In this section, we show the performance of the proposed correntropy based nonlinear
MACE filter for the SAR image recognition problem in the MSTAR/IU public release
data set[49]. The MSTAR (Moving and Stationary Target Acquisition and Recognition)
data is a standard dataset in the SAR ATR community, allowing researchers to test and
compare their ATR algorithms. The database consists of X-band SAR images with 1 foot
by 1 foot resolution at 15, 17, 30 and 45 degree depression angles. The data was collected
by Sandia National Laboratory (SNL) using the STARLOS sensor. The original dataset

64

consists of different military vehicles, where the poses (aspect angles) of the vehicles
lie between 0 and 359 degree, and the target image sizes are 128128 pixels or more.
Since the MACE and the CMACE have a constraint at the origin of the output plane,
we centered all images and cropped the centered images to the size of 6464 pixels. (in
practice, with uncentered images, one needs to compute the whole output plane and search
for the peak). The selected area contains the target, its shadow and background clutter.
In this simulation we use the images which lie in the aspect angles of 0 to 179 degree. The
original SAR image is composed of magnitude and phase information, but here only the
magnitude data is used in this simulation.
This dissertation compares the recognition performance of the proposed CMACE filter
against the conventional MACE considering two distortion factors. The first distortion
case is due to a different aspect angle between training and testing, and the second case
is a different depression angle between test and training data. In the simulations, the
performance is measured by observing the test output peak value and creating the ROC
(Receiver Operating Characteristic) curve. The kernel size, , is chosen to be 0.1 for the
estimation of correntropy in the training images and 0.5 for test output in (63) and (64).
The value of 0.1 for the kernel size corresponds to the standard deviation of the training
data which is consistent with the Silvermans rule. Experimentally it was verified that a
larger kernel size for testing provided better results.
7.2.2

Aspect Angle Distortion Case

In the first simulation, we selected the BTR60 (Armored personal carrier) as a target
(true class) and the T62 (Tank) as a confuser (false class). Both of them are taken at 17
degree depression angles. The goal is to design a filter which will recognize the BTR60
with minimal confusion from the T62. Figure 7-5 (a) shows the training images, which are
used to compose the MACE and the CMACE filters. In order to evaluation the effect of
the aspect angle distortion, training images were selected at every 3 index numbers from a
total of 120 exemplar images for each vehicle (most of index numbers have a 2 difference

65

(a) Training images (BTR 60) of aspect angle 0, 35, 124, 159 degrees

(b) Test images from BTR 60 of aspect angle 3, 53, 104, 137 degrees

(c) Test images from confuser (T62) of aspect angle 2, 41, 103, 137
degrees

Figure 7-5. Case A: Sample SAR images (64x64 pixels) of two vehicle types for a target
chip (BTR60) and a confuser (T62).
and some have a 1 difference in aspect angle). That means that the total number of
training images used to construct a filter is 40 (N=40). Figure 7-5 (b) shows test images
for the recognition class and (c) represents confusion vehicle images. Testing is conducted
with all of 120 exemplar images for each vehicle. We are interested only in the center of
the output plane, since the images are already centered. The peak output responses over
all exemplars in the test set are shown in Figure 7-6. In the simulation, the constraint
value for the MACE as well as the CMACE filter is one for the training, therefore the
desired output peak value should be close to one when the test image belongs to the target
class and should be close to zero otherwise. Figure 7-6 (Top) shows the correlation output
peak value of the MACE and Figure 7-6 (Bottom) shows the output peak values of the
CMACE filter for both a target and a confuser.
Figure 7-6 illustrates that results are perfect for both the MACE and the CMACE
within the training images. However, in the MACE filter, most of the peak output values
on test images are less than 0.5. This shows that the MACE output generalizes poorly

66

Peak output

1
0.5
0

20

40

60

80

100

120

Index of Test images (aspect angle 0~179 degrees)


1.5
Peak ouput

1
0.5
0
0.5

20
40
60
80
100
Index of Test images (aspect angle 0~179 degrees)

120

Figure 7-6. Case A: Peak output responses of testing images for a target chip (circle) and
a confuser (cross) : (Top) MACE, (Bottom) CMACE.

1
0.9
MACE (N=40)
MACE (N=60)
Correntropy MACE (N=40)
Correntropy MACE (N=60)
KCFA (N=40)
KCFA (N=60)

Probability of Detection

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4
0.6
Probability of False Alarm

0.8

Figure 7-7. Case A: ROC curves with different numbers of training images.

67

for the images of the same class not used in training, which is one of known drawbacks of
the conventional MACE. For the confuser test images, most of the output values are near
zero but some are higher than those of target images, creating false alarms. On the other
hand for the CMACE, most of the peak output values of test images are above 0.5, which
means that CMACE generalizes better than the MACE. Also, the rejecting performance
for a confuser is better than the MACE. As a result, recognition performance between
two vehicles is improved by the CMACE, as best quantified in the ROC curves of Figure
7-7. From the ROC curves we can see that the detecting ability of the proposed method
is much better than both the MACE and the KCFA. For the KCFA, prewhitened images
are obtained by multiplying D0.5 in the frequency domain and applied the kernel trick
to the prewhitened images to compute the output in (531). Gaussian kernel with kernel
size of 5 is used for the KCF. From the ROC curves in Figure 7-7, we can also see that the
CMACE outperforms the nonlinear kernel correlation filter in particular for high detection
probability.
Figure 7-8 (a) shows the MACE filter output plane and (b) shows the CMACE filter
output plane, for a test image in the target class not present in the training set. Figure
7-8 (c) and (d) show the case of a confuser (false class) test input. In Figure 7-8 (a)
and (b), we can see that both the MACE and the CMACE produce a sharp peak in the
output plane. However, the peak value at the origin of the CMACE is higher (closer to
the desired value) than that of the MACE. Moreover, the CMACE has less sidelobes and
the values of sidelobes around the origin are lower than those of the MACE. These points
tell us that the detection ability of the propose method is better that the MACE. On
the other hand, for the confuser test input in Figure 7-8. (c) and (d), the output values
around the origin of the CMACE have lower values than the MACE, which means that
the CMACE has better rejection ability than the MACE.
In order to demonstrate the shift-invariant property of the CMACE, we apply the
images of Figure 7-9. The test image was cropped for the object to be shifted 13 pixels

68

1
0.98
correntropy output

correlation output

0.74
0.5
0.48
0

0.5
80

0.5
0.44
0

0.5
80
60

80

60

60

40
20

80
40

20

20
0

60

40

40

20
0

(a) True class in the MACE

(b) True class in the CMACE

peak : 0.62
correntropy output

correlation output

peak :0.87
center:0.22

0.5

center : 0.006

0.5

0.5
80

0.5
80
60

60

80
60

40

40

20

40

20

20
0

80
60

40
20
0

(c) False class in the MACE

(d) False class in the CMACE

Figure 7-8. Case A: The MACE output plane vs. the CMACE output plane
in both x and y pixel positions. Figure 7-10 shows the output planes of the MACE and
CMACE when the shifted image is used as the test input while all the training images
are centered. In Figure 7-10, the maximum peak value should happen at the position
of (77,77) in the output plane since the object is shifted by 13 pixels in both x and y
directions. In the CMACE output plane, the maximum peak happens at (77,77) and the
value is 0.9585. However, in the MACE, the maximum peak happens at (74,93) with
0.9442 and the value at the position of (77,77) is 0.93. In this test, the CMACE shows
better shift invariance property than the MACE.

69

(a) original (x,y)

(b) shifted to (x13,y13)

Figure 7-9. Sample images of BTR60 of size (64 64) pixels (a) The cropped image of size
(64 64) pixels at the center of the original of size (128x128) pixels. (b) The
cropped image of size (64 64) pixels with (x 13, y 13) of the original of
size (128 128) pixels.

X: 74
Y: 93
Z: 0.9442

X: 77
Y: 77
Z: 0.9585

0.5

0.5

0.5

0.5
100

100
100
50

100
50

50
0

50
0

(a) The MACE output plane

(b) The CMACE output plane

Figure 7-10. Case A: Output planes with shifted true class input image

70

The CMACE performance sensitivity to the kernel size is studied next. In order to
find the appropriate kernel size for the CMACE, the easiest step is to apply Silvermans
rule of thumb developed for the kernel density estimation problem, which is given by
i = 1.06
i d1/5 , where
i is the standard deviation of the ith training data and d is the
number of samples [44]. A more principled alternative is to apply cross validation to find
the best kernel size. For cross validation, we use one image of training set which is not
included in filter design. Since we are considering images as 1-dimensional vectors, we have
N different training data set. Therefore, we obtain one proper kernel size, , by averaging
P
N different kernel sizes with = N1 N
i=1 i . In this simulation, when N = 60, the value
of the kernel size given by Silvermans rule is 0.0185 and the best one from cross validation
is 0.1. Figure 7-11 shows the ROC curves for the kernel size obtained by Silvermans rule
and the one obtained by cross validation. We see that the ROC performance from the
Silverman rule is very close to that of the optimal kernel size by cross validation. Also
when we increase the kernel size to be 10, its performance is similar to that of the MACE.
As expected from the property of correntropy, it is noticed that correntropy approaches
correlation with large kernel size.
Table 7-5 shows the area under the ROC for different kernel sizes, and we conclude
that kernel sizes between 0.01 to 1 provide little change in detectability. This may be
surprising when contrasted with the problem of finding the optimal kernel size in density
estimation, but in correntropy the kernel size enters in the argument of an expected value
and plays a different role in the final solution, namely it controls the balance between the
effect of second order moments versus the higher order moments (see property 1).
7.2.3

Depression Angle Distortion Case

In the second simulation, we selected the vehicle 2s1 (Rocket Launcher) as a


target and the T62 as a confuser. These two kinds of images look very similar in shape
therefore they represent a difficult object recognition case, useful to test the performance
improvement of the proposed method. In order to show the effect of the depression angle

71

1
0.9

Probability of Detection

0.8

CMACE(=0.0185, Silvermans rule)


CMACE (=0.1, the best from cross validation)
MACE
CMACE(=10)

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4
0.6
Probability of False Alarm

0.8

Figure 7-11. The ROC comparison with different kernel sizes


Table 7-3. Case A: Comparison of ROC areas with different kernel sizes
Kernel size ROC area Kernel size ROC area
0.01
0.9623
0.6
0.9806
0.0185
0.9686
0.7
0.9771
0.05
0.9631
0.8
0.9754
0.1
0.9847
0.9
0.9749
0.2
0.9865
1.0
0.9602
0.3
0.9797
2.0
0.9397
0.4
0.9797
5.0
0.9256
0.5
0.9808
10.0
0.9033
distortion, training data are selected from target images which were collected at 30 degree
depression angle and the MACE and CMACE are tested with data taken at 17 degree
depression angle.
Figure 7-12 depicts some sample images. As we can see in Figure 7-12 (a) and (b),
due to the big change in depression angle (13 degree of depression is considered a huge
distortion), test images have more shadows and the image size of the vehicles also change,
making detection more difficult. In this simulation, we use all the images (120 images

72

(a) Training images (2S1) of aspect 0,35,124,159 degrees

(b) Test images (2S1) of aspect 3,53,104,137 degrees

(c) Test images from confuser (T62) of aspect 2,41,103,137


degrees

Figure 7-12. Case B: Sample SAR images (64x64 pixels) of two vehicle types for a target
chip (2S1) and a confuser (T62).
covering 180 degrees of pose) at 30 degree depression angle for training and also test with
all of 120 exemplar images at 17 degree depression angle.
Figure 7-13 (Top) shows the correlation output peak value of the MACE and
(Bottom) shows the output peak values of the CMACE filter with a target and a confuser
test data. We see that the conventional MACE is very poor in this case, either under or
overshooting the peak value of 1 for the target class, but the CMACE can improve the
recognition performance because of its better generalization. Figure 7-14 depicts the ROC
curve and summarizes the CMACE advantage over the MACE in this large depression
angle distortion case. More interestingly, the KCFA performance is closer to the linear
MACE, due to the same input space whitening which is unable to cope with the large
distortion.

73

Peak output
Peak output

1
0.5
0
20
40
60
80
100
Index of Test Images (aspect angle 0~179 degree)

120

20
40
60
80
100
Index of Test Images (aspect angle 0~179 degree)

120

1
0.5
0

Figure 7-13. Case B: Peak output responses of testing images for a target chip (circle) and
a confuser (cross) : (Top) MACE, (Bottom) CMACE.

1
0.9

Probability of Detection

0.8
0.7

MACE
Correntropy MACE
KCF

0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4
0.6
Probability of False Alarm

Figure 7-14. Case B: ROC curves.

74

0.8

Table 7-4. Comparison of computation time and error for one test image between the
direct method (CMACE) and the FGT method (Fast CMACE) with p = 4 and
kc = 4
Direct (sec)

FGT (sec)

7622.8
122.8

68.31
1.15

128.6

1.18

Train : KXX
KZX
Test(true)output
KZX
Test(false)output

7.2.4

Error
9.9668e-06
8.7575e-06
2.8225e-03
3.8844e-05
8.4377e-03

The Fast Correntropy MACE Results

This section shows both the computation speed improvement and the effect on
accuracy of the fast CMACE filter in the aspect angle distortion case with N = 60
training images. Computation time was clocked with MATLAB version 7.0 on a 2.8GHz
Pentium 4 processor with 2GByte of RAM running Windows XP.
Table 2 shows the comparison of computation time for (63) and (64) between
the direct implementation of the CMACE filter and the fast method with a Hermite
approximation order of p = 4 and kc = 4 clusters. The computation time and absolute
errors for one test image were obtained by averaging 120 test images. This simulation
shows that the FGT method is about 100 times faster than the direct method with a
reasonable error precision. Figure 7-15 presents the comparison in terms of ROC curves
of the MACE, the CMACE and the fast CMACE. Form the ROC curve we can observe
that the approximation with p = 4 and kc = 4 is very close to the original ROC. Table
3 shows the effect of different orders (p) and clusters (kc ) on the computation time and
accuracy for the fast CMACE filter. We conclude that the computation time increases
roughly proportional to p and kc , while the absolute error linearly decreases.
7.2.5

The effect of additive noise

This section presents the effect of additive noise on the recognition performance of
both the MACE and CMACE. For this simulation, we design the template with training
data which are selected at every 3 index numbers from a total of 120 exemplar images
for the BTR60 without noise and test with all 120 images distorted by additive noise for
75

1
Fast Correntropy MACE
MACE
Correntropy MACE

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4

0.6

0.8

Figure 7-15. Comparison of ROC curves between the direct and the FGT method in case
A
Table 7-5. Comparison of computation time and error for one test image in the FGT
method with a different number of orders and clusters
Order
2
6
10
14
20

Time (sec)

Error

0.8116
1.5140
2.2119
2.8533
3.8097

1.48e-02
8.23e-04
8.58e-06
4.16e-07
1.25e-09

Cluster
2
6
10
14
20

Time (sec)

Error

0.7181
1.6693
2.5595
3.5660
5.3067

5.61e-02
3.87e-04
4.71e-05
6.93e-06
1.14e-06

each vehicle. Figure 7-16 shows sample images of the original and noisy image with signal
to noise ratio (SNR) of 7dB. Also in this simulation, we compare the CMACE with the
optimal trade-off filter (OTSDF), which is a well know correlation filter to overcome the
poor generalization of the MACE when noise input is presented. The OTSDF filter is
given by
H = T1 X(XH T1 X)1 c,
where T = D +

(71)

1 2 C, and 0 1, and D is the diagonal matrix in the

MACE and C is the diagonal matrix containing the input noise power spectral density as
its diagonal entries.

Figure 7-17 shows the comparison of ROC curves of the MACE,


76

(a) Original

(b) Noisy with SNR=7dB

Figure 7-16. Sample SAR images (64x64 pixels) of BTR60.

1
0.9
0.8
MACE (SNR=7dB)
OTSDF(SNR=7dB)
CMACE(SNR=7dB)
CMACE (No noise)
MACE (No noise)

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4

0.6

0.8

Figure 7-17. ROC comparisons with noisy test images (SNR=7dB) in the case A
(N = 40).

77

OTSDF and CMACE when white Gaussian noise with signal to noise ratio (SNR) of
7dB are presented to all test images and we see that the MACE performance is degraded
due to the additive input noise and the OTSDF with = 0.7 shows almost the same
performance as the MACE without noise. However, the performance of the CMACE with
noisy test data is almost the same as no noise case. Although, the CMACE does not take
explicitly the noise into consideration like the OTSDF, the CMACE is robust to the input
noise. In practice, the additive noise information in unknown, therefore, the OTSDF is
impractical.

78

CHAPTER 8
DIMENSIONALITY REDUCTION WITH RANDOM PROJECTIONS
8.1

Introduction

In many pattern recognition and image processing applications, the high dimensionality
of the observed data make many efficient algorithms of statistical approaches impractical.
Therefore, a variety of data compression and dimensionality reduction methods have been
proposed to overcome the curse of dimensionality [50]. Dimensionality reduction provides
compression and coding necessary to avoid excessive memory usage and computation.
Principal Component Analysis (PCA) is the most widely known way of reducing
dimension and it is optimal in the mean square error sense. PCA determines the basis
vectors by finding the directions of maximum variance in the data and it minimizes
the error between the original data and the one reconstructed from its low dimensional
representation. PCA has been very popular in face recognition [51] and many pattern
recognition applications [52]. Finding the principal components is a well established
numerical procedure through eigen decomposition of the data covariance matrix, although
it is still expensive to compute. There are other less expensive methods [51] based
on recursive algorithms [53] for finding only a few eigenvectors and eigenvalues of a
large matrix, but the computational complexity is still a burden. Moreover, subspace
projections by PCA do not preserve discrimination [54], so there may be a loss of
performance. Variant to Singular Value Decomposition (SVD) are considered for image
compression utilizing the Karhunen-Loeve Transformation (KLT). Like PCA, SVD method
is also expensive to compute.
Discrete Cosine Transform (DCT) [50] is a widely used method for image compression
and as it can also be used in dimensionality reduction of image data. DCT is computationally
less burdensome than PCA and its performance approaches that of PCA. DCT is optimal
for human eye: the distortions introduced occur at the highest frequencies only and the
human eye tends to neglect these as noise. The image is transformed to the DCT domain

79

and dimensionality reduction is done in the inverse transform by discarding the transform
coefficients corresponding to the highest frequencies.
8.2

Motivation

Both the conventional MACE and the CMACE are memory-based algorithms,
therefore, in practice, the drawback of this class of algorithms is both the storage
requirements and the high computational demand. The output of the CMACE filter is
obtained by computing the product of two matrices defined by the image size and the
number of training images, and each elements of the matrix requires O(d2 ) computations,
where d is the number of image pixels. This quickly becomes too complex in practical
settings even for relatively small images. The fast correntropy MACE filter using the fast
Gauss transform (FGT) has been presented to increase the computational speed of the
CMACE filter, but the storage is still high. When the number of training images is N , the
total computation complexity of one test output of the CMACE is O(d2 N (N + 1)) and this
can be reduced to O(pcdN (N + 1)), where p is the order of the Hermite approximation and
c is the number of clusters utilized in the FGT (p, c d). In general, image has a high
dimensionality and the applications using large images need a large memory capacity. The
main goal of this chapter is to find a simple but powerful dimensionality reduction method
for image recognition with the CMACE filter.
Recently, random projection (RP) has merged as an alternate dimensionality
reduction method in machine learning and image compression [55],[56],[57],[58],[59]
[56],[60] due to its simple complexity and good performance. Many experiments in the
literature show that RP is computationally simple while preserving similarity to a high
degree. In random projection, the original high dimensional data is projected onto a lower
dimensional subspace using a random matrix with only a small distortion of the distanced
between the points while preserving similarity information. Even though the projected
data by random selection incudes key information of the original data, we need to extract
the information properly. Since correntropy has ability to extract higher order moments of

80

the data and in that sense, correntropy can be the promising tool for random projection
applications.
In this chapter we present a dimensionality reduction pre-processor based on
random projections (RP) to decrease the storage and meet more readily available
computational resources and show the RP method works well with the CMACE filter
for image recognition.
8.3

PCA and SVD

Principal Component Analysis (PCA) is the best linear dimensionality reduction


technique in the mean-square error sense. Being based on the covariance matrix of the
random variables it is a second-order method. In various fields, it is also known as the
singular value decomposition (SVD), the Karhunen-Loeve Transformation (KLT),the
empirical orthogonal function (EOF) method and so on.
PCA seeks to reduce the dimensionality of the data by finding a few orthogonal linear
combinations of the original variables with the largest variance.
Let us suppose that we are given a data matrix X in Rd , whose size is d N , where
N is the number of vectors in d-dimensional space. The goal is to find a k-dimensional
subspace (k < d) such that the projection of X on that subspace minimizes expected
squared error.
Then the projection of the original data onto a lower k-dimensional subspace can be
obtained by
Xrp = Ppca X,

(81)

where P is k d and it contains the k eigenvectors corresponding to the k largest


eigenvalues.
8.4

Random Projections

Random projection (RP)is a simple yet powerful dimensionality reduction technique


that uses random projection matrices to project data into a low dimensional subspace.
In RP, the original high dimensional space is projected onto a low dimensional subspace

81

using a random matrix whose columns have unit length. In contrast to other methods,
such as PCA, that use data driven optimization criteria, RP does not use such criteria,
therefore, RP is data independent. Moreover, RP is computationally simple and preserves
the structure of the data without introducing significant distortion. RP theory is far
from complete, so it has to be used with caution. The following lemma from Johnson and
Lindenstrauss (JL) provides theoretical support for RP.
JL lemma For any 0 < < 1 and any integer N , let k be a positive integer such that
k 4(2 /2 3 /3)1 ln N

(82)

then for any set V of N point in Rd , there is a map f : Rd Rk such that for all
u, v V ,
(1 ) k u vk2 k f (u) f (v)k2 (1 + ) k u vk2 .

(83)

Furthermore this map can be found in polynomial time.


JL lemma states that any N point set in d dimensional Euclidean space can be
mapped down onto a k O(logN/2 ) dimensional subspace without distorting the
distance between any pair of points by more than a factor of (1 ) for any 0 < < 1,
with probability O(1/N 2 ). A proof of this lemma as well as tighter bounds on and k are
given in [61].
Let us suppose that we are given a data matrix X, whose size is d N , where N is
the number of vectors in d-dimensional space. Then the projection of the original data
onto a lower k-dimensional subspace can be obtained by
Xrp = PX,

(84)

where P is k d and it is called the random projection matrix.


The complexity of RP is very simple compared to other dimension reduction methods.
RP needs only order of O(kdN ) for projecting d N data matrix into k dimensions. The

82

computational complexity of constructing the random matrix, O(kd), is negligible when


compared with PCA, O(N d2 ) + O(d3 ) [62].
8.4.1

Random Matrices

The choice of the random matrix P is one of the issues of RP. There are some simple
methods satisfying the JL lemma in the literature [58],[63]. Here we present three such
methods.
The Gaussian ensemble : The entries, pij , of k d random matrix P are identically
and independently sampled from a normal distribution with zero mean and unit
variance.
pij :=

1 rij ,
k

where rij is i.i.d N (0, 1)

The binary ensemble : The entries,pij , of k d random matrix P are identically and
independently sampled from a symmetric Bernoulli distribution.
pij :=

1 rij ,
k

where rij is i.i.d P (rij = 1) = 1/2

The related ensemble : pij :=

1 rij ,
k

where

+1 with probability 1/6



rij := 3 0
with probability 2/3

1 with probability 1/6


In most applications, the Gaussian ensemble would satisfy JL lemma well. The other two
methods yield significant computational savings [64].
8.4.2

Orthogonality and Similarity Properties

In fact, to preserve the similarity between the original data and the transformed data,
a projection matrix should be orthogonal. However, in a high enough dimensional space,
it is possible to use a non-orthogonal random projection matrix due to the fact that data
is sparse and in a high dimensional space, there exist a much larger number of almost
orthogonal directions [65]. Thus, vectors having random directions in a high dimensional

83

space is linear independent and these might be sufficiently close to orthogonal to provide
an approximation of a basis.
The inner product of two vectors x and y that have been obtained by random
projection of the vectors u and v with the random matrix R can be expressed as
xT y = uT RT Rv.

(85)

The matrix RT R can be decomposed into two terms


RT R = I + ,

(86)

where ij = rTi rj for i 6= j and ij = 0 for all i.


If all the entries in are equal to zero, i.e., the vectors ri and rj are orthogonal, the
matrix RT R would be equal to I and the similarity between the original data and the
projected data would be preserved exactly in the random mapping. In practice the entries
in will ne small but not equal to zero.
Here let us consider the case of that the entries of the random matrix are identically
and independently sampled from a normal distribution with zero mean and unit variance
and thereafter the length of all the ri s is normalized. Then it is evident that ij is an
estimate of the correlation coefficient between two i.i.d normal distributed random variable
and if the dimensionality k of the reduced dimensional space is large, ij is approximately
normally distributed with zero mean and its variance 2 can be approximated by
2 1/k.

(87)

That is, the distortion of the inner product produced by the random projection is zero
on the average and its variance is at most the inverse of the dimensionality of the reduced

space. This result causes the scaling factor 1/ k in the choice of random projection
matrix examples to preserve the distance. (In some applications which are not concerning

the distance, we do not need to scale the projection matrix by 1/ k). Moreover, the error

84

becomes much smaller when the data is sparse and this result tells us the relevance of the
random projection in compressive sampling for sparse signal recovery [66]. However, the
methodology for building Rp may affect the subsequent algorithms used for processing,
and this is an area that is much less studied.
8.5

Simulations

In this section, we show the performance of the CMACE filter with the RP
dimensionality reduction in the face recognition example of chapter 7. We project the
original data into a lower dimensional space with random projection and apply the
CMACE filter to the reduced dimensional data. In this simulation, we use the Gaussian
Ensemble method to generate a random projection matrix. In order to compare the
performance of the CMACE after a preprocessing with random projection, we compute the
area under the ROC curve.
Figure 8-1 shows the ROC area values with different reduced dimensions by RP.
1

0.9

X: 144
Y: 0.9684

maximun
minimum
average

ROC AREA

0.8

0.7

0.6

0.5

0.4

200

400
600
800
1000
1200
Dimensionality k after random projection

1400

1600

Figure 8-1. The comparison of ROC areas with different RP dimensionality (50 trials with
different training images and RP matrices).

85

1
X: 256
Y: 0.9502

0.9

average
maximun
minimum

ROC AREA

0.8

0.7

0.6

0.5

0.4

200

400
600
800
1000
1200
Dimensionality k after ramdom projection

1400

1600

Figure 8-2. The comparison of ROC areas with different RP dimensionality (50 trials with
different RP matrices, but fixed training images).
Since RP is in fact a random function we present results in terms of mean, maximum
and minimum performance obtained in 50 Monte-carlo simulations. At every trial we
use randomly chosen different training images (N = 5) and different random projection
matrices. When the MACE is applied to the original data, the average ROC area is 0.96
(the best case is 0.99 and the worst case is 0.8868). In Figure 8-1 the performance of the
CMACE with the reduced dimensionality k 144 is always better than that of the MACE
filter with original data. The range of performance between the best and the worst cases is
due to both the effect of different training images and different RP matrices.
In order to monitor only the effect of RP, we fixed the training images and run 50
Monte-carlo simulations with different RP matrices. The results on ROC areas are shown
in Figure 8-2. We can see that the variations of the performance due to different RP
matrices is substantially smaller and the CMACE obtains consistent performance with RP
when the image size is above 16 16 ( dimensionality k = 256 ).

86

1
0.9
0.8
0.7
0.6
0.5
CMACE(averaging)
CMACE(RP)
CMACE(bilinear interpolation)
CMACE(subsampling)
MACE(averaging)
MACE(bilineat interpolation)
MACE(subsampling)
MACE(RP)

0.4
0.3
0.2
0.1
0

0.2

0.4

0.6

0.8

Figure 8-3. ROC comparison with different dimensionality reduction methods for MACE
and CMACE (reduced image size is 16 16).
The comparison among the four dimensionality reduction methods (subsampling,
pixel-averaging, bilinear interpolation and Gaussian (RP)) for images of size 16 16 (from
64 64) is shown in Figure 8-3.
For the CMACE, the Gaussian method (RP) and pixel-averaging methods work very
well, with the subsampling the worst, but still with robust performance. Subsampling is
the simplest technique but it can also loose important detail information. In the MACE
case, the Gaussian method is the worst, with pixel-averaging method still performing some
discrimination, but at a much reduced rate (compare with Figure 7-3). It is surprising
that local pixel averaging, the simplest method of dimensionality reduction provides such
a robust performance in this application for both the MACE and CMACE. It indicates
that coarse features are sufficient for discrimination up to a certain level of performance.
However, notice that the pixel averaging looses with respect to CMACE-RP when the
operating point in the ROC is close to 100%, as can be expected (finer detail is needed to
discriminate between classes).

87

We have also applied PCA to the MACE and CMACE. There are different ways to
apply PCA to this task. One method for dimensionality reduction with PCA, uses training
images from all the data set, and then projects all the images to the subspace spanned by
principal components. With this method, when we choose 10 images (5 from true class
and 5 from false class) and project all true and false class images onto this subspace, the
performance of both the MACE and CMACE are perfect. However, the training data
must be sufficient to find the principal directions to cover the whole test data and large
computation is required. However, in practice, it is impossible to use out of class images
as a training set for a MACE filter which is designed only with data from one class. In
this more realistic case, the test image class does not belong to the training set for PCA
and the performance of the discrimination will be very poor. Figure 8-4 shows the ROC
curves when only true class images are used for PCA. Even for this case, we had to use all
the true class images (75) to find 75 principal directions, project all the images, and then
choose the 5 projected images to composite the MACE and CMACE filter. For testing,
we also project the test image onto the subspace obtained from the training set. Since
the false class test images are not used to determine the PCA subspace, the projected
data of the false class is not guarantee to preserve the information of the original images,
therefore, the rejecting performance becomes very poor. The ROC area value for the
MACE and CMACE are 0.4015 and 0.7283, respectively.
We could not obtain reasonable results in the MACE with the RP method as shown
in Figure 8-3. We will explain this MACE behavior due to the Gaussian dimensionality
reduction procedure, but it partially also applies to the other methods. Although RP
preserves similarity in the reduced projections, it changed the statistics of the original data
classes. After random projection with the Gaussian ensemble method, all the projected
images display statistics very close to white noise with similar variance. This result is
shown in Figure 8-5, where two classes sample images of size 16 16 after applying RP
to all images are depicted. The first row shows the training image set, while the second

88

1
CMACE
MACE

0.9

Probability of Detection

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4
0.6
Probability of False Alarm

0.8

Figure 8-4. ROC comparison with PCA for MACE and CMACE (reduced dimensionality
is k = 75).
row displays the in class test set and the third row the out of class test images. We see
that the projected images in the true class and the false class, although slightly different in
detail, seem to have very similar statistics.
The MACE, which extracts only second order information is unable to distinguish
between the projected image set, however the CMACE succeds in this task. In order
to explain the effectiveness of correntropy function, we compare the correlation and
correntropy in the projected space. This result is shown in Figure 8-6. We consider the 2D
images as long 1D vectors. In Figure 8-6 (a) we show the autocorrelation of one original
image vector in the true class, (b) depicts the autocorrelation of one of the training images
after RP, which leads us to conclude that the projected image has been whitened (the only
peak occurs at zero lag), and (c) shows that the cross correlation between the reduced
training image vector and test image vector in the false class after RP is practically the
same as the auto correlation of the reduced training image vector after RP. Therefore the
covariance information of the images after RP is totally destroyed. Since the conventional

89

(a)
(b)
(c)

Figure 8-5. Sample images of size 16 16 after RP. (a) Training images. (b) True class
images. (c) False class images.
MACE filter utilizes only the second order information, it is unable to discriminate
between in class and out of class images. However, in (e) and (f), we can see that the cross
correntropy between in class and out of class images is still preserved after RP, due to
the fact that correntropy has ability to extract higher order information of the reduced
dimensional data. Therefore, the CMACE filter seems very well posed to work with the
reduced dimensional images by random projection for this and other applications.
We can also see the overall detection and recognition performance of the CMACE-RP
through a further analysis of the output plane. Figure 8-7 shows correlation output planes
for the MACE and correntropy output planes (CMACE) after dimensionality reduction
with k = 64 random projections. Figure 8-7 (a) shows the desirable correlation output
plane of the MACE filter given the true class test image, however (b) shows the poor
rejecting ability for the false class test image. On the other hand, in the CMACE filter,
the true and false class image output plane in Figure 8-7 (c) and (d) show the expected
responses even with such a small dimensional images.

90

x 10

15000

2.5
2

10000

1.5

1.5

0.5

0.5

5000

2.5

0
0
5000

0
(a)

0.5
5000
200

x 10

0
0
(b)

200

0.5
200

0
(c)

200

0
(f)

200

0.015

0.015

0.01

0.01

0.005

0.005

0
5000

0
(d)

5000

0
200

0
(e)

200

x 10

0
200

Figure 8-6. The cross correlation vs. cross correntropy. (a) Autocorrelation of one of the
original training image vector. (b) Autocorrelation of one of the reduced
training image vector after RP. (c) Cross correlation between one of the
reduced training image vector and test image vector in the false class after RP.
(d) Autocorrentropy of one of the original training image vector. (e)
Autocorrentropy of one of the reduced training image vector after RP. (f)
Cross correntropy between one of the reduced training image vector and test
image vector in false class after RP.
The initial idea to use a preprocessor based on random projections was to alleviate
the storage and computation complexity of the CMACE. Table 8-1 presents comparisons
between the original CMACE and the CMACE with RP. The dominant component for
storage is the correntropy matrix (VX ). In single precision (32bit), 64Mbytes are needed
to store VX of 64 64 pixel images, but only 256Kbytes with 16 16 pixel images after
RP. We need an additional 4Mbytes to perform random projection with the Gaussian
ensemble method. In the binary ensemble case, no additional storage for RP is needed.
The table also presents the computational complexity of (61) with one test image, given
N = 5 training images and clocked with MATLAB version 7.0 on a 2.8GHz Pentium 4
processor with 2Gbytes of RAM.

91

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0
15

0
15
15

10

15

10

10
5

10
5

5
0

5
0

(a) With a true class test image in the MACE

(b) With a false class test image in the MACE

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0
15

0
15
15

10

15

10

10
5

10
5

5
0

5
0

(c) With a true class test image in the CMACE

(d) With a false class test image in the CMACE

Figure 8-7. Correlation output planes vs. correntropy output planes after dimension
reduction with random projection (reduced image size is 8 8).

Table 8-1. Comparison of the memory and computation time between the original
CMACE (image size of 64 64) and CMACE-RP (16 16,with Gaussian
ensemble method) for one test image with N = 5
CMACE
CMACE-RP
(d = 4096)
(k = 256)
2
Memory (byte)
O(4d )
O(4k(k + d))
(single precision) = 64 MB +
= 4.2 MB +
Complexity
O(d2 (N 2 + N + 1)) O(k 2 (N 2 + N + 1) + kdN )
= O(5.2 108 )
= O(7.3 106 )
Time (sec)
58.584
0.4297

92

CHAPTER 9
CONCLUSIONS AND FUTURE WORK
9.1

Conclusions

In my research, we have evaluated the correntropy based nonlinear MACE filter for
image recognition. We presented experimental results for face recognition using CMUs
facial expression data and SAR image recognition using the MSTAR public release data.
Correntropy induces a new RKHS that has the same dimensionality as the input
space but is nonlinearly related to it. Therefore, it is different from the conventional
kernel methods, in both scope and detail. Here we illustrate that the optimal MACE
filter formulation can be directly solved in the VRKHS. This CMACE overcomes the
main shortcomings of the MACE which is poor generalization. We believe this is due
to the utilization for the matching of higher order statistical information in the target
class. The CMACE also shows a good rejecting performance as well as robust results
with additive noise. This is due to the prewhitening effect in feature space and the new
metric created by the correntropy that reduces outliers. Simulation results show that the
detection and recognition performance of the CMACE exhibits better distortion tolerance
than the MACE in some kinds of distortions(in face recognition, different facial expression,
and in SAR, aspect angle as well as depression) . Also the CMACE outperforms the
nonlinear kernel correlation filter, which is the kernelized SDF with prewhitened data in
the input space, especially for the large distortion case. Moreover the CMACE preserves
the shift-invariant property well.
The sensitivity of the CMACE performance on the kernel size is experimentally
demonstrated to be small, but a full understanding of this parameter requires further
investigation. In addition, there is still an approximation in (63) and (64) to compute
the products of the projected data functionals by a kernel evaluation, which is guaranteed
on average. For large images, this approximation seems to be good, but its error needs to
be understood and quantified to obtain the best performance of the CMACE filter.

93

In practice, the drawback of the proposed CMACE filter is the required storage and
its computation complexity. Since one does not have direct access to the filer weights in
VRKHS, the computations in the test set must take into consideration the training set
data, so the total computation complexity of one test output is O(d2 N (N + 1)) and the
storage depends on the image dimension, O(d2 ). The MACE produces easily the entire
correlation output plane by FFT processing. The CMACE can also construct the whole
output plane by shifting the test input image and as a result, there is no need to center
all images provided that the input image is appropriately shifted. However, computing
the whole output plane is a big burden in the CMACE. For this reason, this research also
proposes the fast CMACE to save computation time by using the Fast Gauss Transform
(FGT) algorithm, which results in a computational savings of about 100 fold for 64 64
pixel images. With the fast Gauss transform, we were able to reduce the computation to
O(pcdN (N + 1)), where p, c d.
However, this needs still huge storage and is not very competitive with other methods
for object recognition. The random projection (RP) method may make the CMACE
useful for practical applications using standard computing hardware. RP is a preprocessor
that extracts features of the data, but unlike PCA it is very easy to compute on O(kd).
Reducing the data into features has a double effect of addressing both the storage and
computation requirements. For instance instead of 64Mbytes for 64 64 pixel images,
the storage for images with RP to 16x16 pixel images is 4.2Mbytes (binary ensemble case
is 256Kbytes). Computational speed improves by more than 100 times. The method of
random projections and its impact on subsequent pattern recognition algorithms is still
poorly understood. Here we verified that the MACE is incompatible with the Gaussian
method of random projections since it destroys the second order statistics that make the
MACE work. The pixel-averaging method seems to preserve second order statistics to a
certain degree. However, the CMACE combined with RP is a better alternative, and it
is less sensitive to the method of data reduction. This can be understood if we remember

94

that the CMACE is preserving higher order statistics of the data, unlike the MACE filter.
The performance of the CMACE-RP of 1616 is better than that of the MACE of 6464.
But further work is necessary to quantify extensively the performance of the CMACE-RP
versus other algorithms.
These tests with the CMACE and data reduction clearly showed a new application
domain for correntropy in signal processing. The conventional data reduction methods
average locally or globally data and tend to destroy mean and variance, but apparently
they preserve some of the higher order information contained in the data that can still be
utilized by correntropy. Therefore, in applications where data reduction at the front-end
is a necessity, correntropy may provide still usable algorithms, in cases where second order
methods fail. This argument is also very relevant in compressive sampling (CS), where
convex optimization needs to be utilized to minimize the l1 norm, since the l2 norm creates
a lot of artifacts in reconstruction. We think that the correntropy induced metric (called
CIM in [42]) can be a candidate to simplify the reconstruction in CS. We have however
to fully understand why correntropy is able to still distinguish between images or signals
that have been heavily distorted, because we can perhaps even propose new data reduction
procedures that preserve the discriminability of correntropy.
9.2

Future Work

The correntropy MACE filter was obtained by solving the constrained optimal
solution in the RKHS induced by correntropy, where the dimension of the RKHS is the
same as the input dimension. The data points in this new RKHS are nonlinearly related
to the original data, therefore, we still can find a closed form solution to the nonlinear
MACE filter that outperforms the linear MACE filter. However, there are still several
work to be investigated.
First, the proposed correntropy MACE filter has hard constraints on the center of
the output plane. As the same as the traditional SDF type filters, linear constraints are
imposed on the training images to yield a known value at specific locations in the output

95

plane. However, placing such constraints satisfies conditions only at isolated points in
the image space but does not explicitly control the filters ability to generalize over the
entire domain of the training images. Unlike the general classification problem, the goal
of this research is to find an appropriate template for a specific object that we want to
identify without any information on out-of-classes. We have to suppress the response to
all the images except for the true target image. Therefore, we think that constraining
only one location is not the best solution. Finding new constraints that give us good
generalization as well as rejection ability is one of the future works. One of the idea is
to use the randomly mixed images of the true target as the out-of-class images. These
generated out-of-class images are totally different from the true class images but have the
same statistical information as the true class images. Therefore, we can expect that this
idea may help improve performance.
Second, the computation of the correntropy MACE output requires an approximation.
Unfortunately (63) and (64) involve weighted versions of the functionals therefore the
error in the approximation should be addressed and it requires further investigation for a
good approximation.
Finally, my research presented the simulation results on applications to face
recognition and SAR image recognition. In addition to face recognition, the proposed
algorithm can be applied to the biometric verification such as iris and fingerprint. Also in
SAR application, there is three-class (BMP2,BTR70 and T72) object classification among
MSTAR/IU public release data set [67]. Most of literatures are applying their algorithm to
the three-class problem to compare the performance. Therefore in order to convince other
researchers we need to compare our algorithm with the three-class problem.
Summarizing the future works

Find new constraints that provide a better generalization as well as rejecting ability.

Observe the approximation errors due to the weighted values and find a good
approximation.

96

Applications to biometric verification and more detail comparison on the three-class


SAR classification.

97

APPENDIX A
CONSTRAINED OPTIMIZATION WITH LAGRANGE MULTIPLIERS
If y is a weighted sum of variables, y = aT x, then dy/dx = a. The general quadratic
form for y in matrix notation is y = xT Ax, where A = {aij } is an N N matrix of
weights. Here, we assume that x is a real vector, then the partial derivatives of y with
respect to each variable is dy/dx = (A + AT )x. If A is symmetric, then dy/dx = 2Ax.
The method of Lagrange multipliers is useful for minimizing a quadratic function
subject to a set of linear constraints. Suppose that B = [b1 b2 bM ] is an N M
matrix with vectors bi of length N as its columns and c = [c1 c2 cM ] is a vector of M
constants. We want to find the vector x which minimizes the quadratic term y = xT Ax
while satisfying the linear equations BT x = c. If A is positive semi-definite, then the y is
convex and there is at least one solution. We form the cost function
J = xT Ax 21 (bT1 x c1 ) 22 (bT2 x c2 ) 2M (bTM x cM ),

(A1)

where the scalar parameters 1 , 2 , , M are known as the Lagrange multipliers.


Setting the gradient of J with respect to x to zero yields
2Ax 2(1 b1 + 2 b2 + + M bM ) = 0.

(A2)

Defining m = [1 2 M ]T , then (A2) can be expressed as


Ax Bm = 0.

(A3)

x = A1 Bm.

(A4)

or

Substituting (A4) for x into the constraint BT x = c yields


BT A1 Bm = c.

98

(A5)

The Lagrange multiplier vector m can be obtained as


m = (BT A1 B)1 c.

(A6)

Using (A4) and (A6) we obtain the following solution to the constraint optimization
problem
x = A1 B(BT A1 B)1 c.

99

(A7)

APPENDIX B
THE PROOF OF PROPERTY 5 OF CORRENTROPY
Let pij (x, y) be the joint PDF of (xi , xj ) such that
pij (x, y) =

i i (x)j (y),

(B1)

where (), i are the eigen functions and the eigenvalues of pij (x, y), respectively. Here,
RR
E[f (x)f (y)] =
pij (x, y)f (x)f (y)dxdy
R
P R
=
i i (x)f (x)dx i (y)f (y)dy
i

P
i

where, i =

(B2)

i i2 .

i (x)f (x)dx.

Now, let i (), i be the eigen functions and the eigenvalues of the kernel k, then
RR
E[k(x, y)] =
pij (x, y)k(x, y)dxdy
R R PP
=
i i (x)i (y)j j (x)j (y)dxdy
j

PP
j

(B3)
i j ij

Observing (515) and (B1) we can construct f such that i =


P
Then there exist f (x) =
i i (x) satisfying (515).
i

100

rP
j

j ij .

APPENDIX C
THE PROOF OF A SHIFT-INVARIANT PROPERTY OF THE CMACE
A shift invariant system is one for which a shift or delay of the input sequence
causes a corresponding shift in the output sequence. All the components of the CMACE
output are determined by the kernel function and by proving that the Gaussian kernel is
shift-invariant, we can easily say that the CMACE is shift-invariant.
Let the output of the Gaussian kernel be

y(n) = (x1 (n) x2 (n)) = exp

(x1 (n) x2 (n))2


2 2

(C1)

Start with a shift of the input x1s (n) = x1 (n so ) and x2s = x2 (n so ), then the response
y1 (n) of the shifted input is

y1 (n) = (x1s (n) x2s (n)) = exp

(x1 (n so ) x2 (n so ))2
2 2

(C2)

Now the shift of the output defined as y(n so ) becomes

y(n so ) = exp

(x1 (n so ) x2 (n so ))2
2 2

Then clearly y1 (n) = y(n so ), therefore, the Gaussian kernel is time-invariant.

101

(C3)

APPENDIX D
COMPUTATIONAL COMPLEXITY OF THE MACE AND CMACE
Here, only the computational complexity for the matrix inversion and multiplication
in both the MACE and CMACE are considered. Let us assume that all the elements of
matrices are real.
In order to construct the MACE template in the frequency domain with a given
training image set, O(d) multiplications are needed for the inversion of the diagonal matrix
D of size d d and O(N 2 ) for the inversion of the Toeplitz matrix (XH D1 X) of size of
N N . The number of multiplications are O(N 2 ) for D1 X, O(dN 2 + d) for (XH D1 X)
and O(dN 2 ) for D1 X(XH D1 X)1 . In addition, the FFT needs O((N + 1)d log2 (d))
multiplications with N training images. In reality the elements of the matrices are
complex valued, therefore, the MACE requires a total of O(4(d(2N 2 + N + 2) + N 2 ) +
N d log2 (d)) multiplications to compose the template for the true class in the frequency
domain. For the test of one input image after building a template, the MACE requires
only O(4d + d log2 (d)) multiplications.
The CMACE needs O(d2 ) and O(N 2 ) multiplications for the inversion of both the
Toeplitz matrix VX of size d d and TXX of size N N and O((N d)2 ) to compute TXX ,
therefore, the total number of multiplications in off-line mode with the given training
image set is O(d2 (N 2 + 1) + N 2 ). For the testing of one image, O(N 2 ) multiplications
for the output and O((N d)2 ) operations for obtaining TZX are needed, therefore, the
total computational complexity of the CMACE for one test image requires O(d2 N + N 2 )
multiplications.
The fast CMACE with the FGT reduces the computational complexity to O(N 2 pd(kc +
1) + d2 + N 2 ) for the training set and to O(pd(kc + 1)N + N 2 ) for one testing image.

102

APPENDIX E
THE CORRENTROPY-BASED ROBUST NONLINEAR BEAMFORMER
E.1

Introduction

Beamforming is often used with an array of radar antenna in order to transmit


or receive signals in different directions without having to mechanically steer the
array [68],[69], and has found numerous applications in radar, sonar, seismology, radio
astronomy, medical imaging, speech processing, and wireless communications. The
classical approach for beamforming is a natural extension of Fourier-based spectral
analysis to spatio-temporally sampled data, which is called the conventional Bartlett
beamformer [70]. This algorithm maximizes the energy of the beamforming output for
a given input signal. Because it is independent of the signal characteristics, but only
depends on a certain direction, its major difficulties are low spatial resolution and high
sidelobes. In an attempt to alleviate the limitations of the conventional beamformer, the
Capon beamformer is introduced [71],[72].
A Capon beamformer attempts to minimize the output energy contributed by
interference coming from other directions than from the look direction. Moreover, it
maintains a fixed constant gain in the look direction (normalized to one) in order not to
risk the loss of the signal containing the information. This Capon beamformer is sensitive
to the mismatch between the assumed and actual array steering vector, which occurs
often in practice. Recently a robust beamformer was proposed by extending the Capon
beamformer to the case of uncertain array steering vectors [73],[74].
In a statistical point of view, most of these techniques are based on linear models,
which make use of only the first and second order moment information (e.g. the mean and
the variance) of the data. Therefore, they are not an appropriate choice in non-Gaussian
distributed data such as impulsive noise scenarios. In order to deal with more realistic
situation, further research into signal modeling has led to the realization that many
natural phenomena can be better represented by distributions of a more impulsive nature.

103

One type of distribution that exhibits heavier tails than the Gaussian is the class of stable
distributions introduced by Nikias and Shao [75]. Alpha-stable distributions have been
used to model diverse phenomena such as random fluctuations of gravitational fields,
economic market indexes [76], and radar clutter [77].
To overcome the limitation of the linear model for the non-Gaussian statistics case,
a nonlinear beamformer has been proposed in [78] but most of nonlinear beamforming
methods are complicated for weight vector computation. Recently, kernel based learning
algorithms have been heavily researched due to the fact that linear algorithms can be
easily extended to nonlinear versions through kernel methods [23]. Some kernel based
methods have been presented in [79],[80],[26] for beamforming and target detection
problems.
The correntropy MACE (CMACE) filter [46][81], which is the nonlinear version of the
correlation filter, has been shown to possess good generalization and rejecting performance
for image recognition applications.
In this appendix, we apply correntropy to the beamforming problem and exploit
the linear structure of the RKHS induced by correntropy to formulate the correntropy
beamformer. Due to the fact that it involves high-order statistics with the nonlinear
relation between the input and this feature spaces, the correntropy beamformer shows
better performance than the Capon and kernel methods and is robust to impulsive noise
scenarios.
E.2
E.2.1

Standard Beamforming Problem

Problem

Consider the standard beamforming model. Let the uniformly spaced linear array of
M sensors receive signals xk generated by a narrow-band source sk arriving from direction
. Using a complex envelop representation, the M 1 vector of received signals at kth
snapshot can be expressed as
xk = a () sk + nk ,

104

(E1)

where a () CM 1 is the steering vector of the array toward direction as

a () =

T
1 ej(2/)d cos . . . ej(2(M 1)/)d cos

(E2)

and nk is the M 1 vector of additive white noise. Also, the beamformer output is given
by
yk = wH xk = wH a () sk + wH nk ,

(E3)

where w CM 1 is a vector of weights and H denotes the conjugate transpose. The goal
is to satisfy wH a () = 1 and minimize the effect of the noise (wH nk ), in which case, yk
recovers sk .
Besides, we also assume that each element of nk follows a symmetric -stable (SS)
distribution described by the following characteristic function
(w) = exp (jw |w| ) ,

(E4)

where is the characteristic exponent restricted to the values 0 < 2, ( < < )
is the location parameter, and ( > 0) is the dispersion of the distribution. The value
of is related to the degree of the impulsiveness of the distribution. Smaller values of
correspond to heavier tailed distributions and hence to more impulsive behavior, while
as increases, the tails are lighter and the behavior is less impulsive. The special case of
= 2 corresponds to the Gaussian distribution (N (, 2)), while = 1 corresponds to the
Cauchy distribution.
E.2.2

Minimum Variance Beamforming

Since the look-direction frequency response is fixed by the constraints, minimization


of the non-look-direction noise energy is the same as minimization of the total output
energy. The energy of beamformer output (yk ) is minimized subject to the constraint of a
distortionless response in the direction of the desired signal as given by

min E yk2 subject to wH a () = 1.
w

105

(E5)

The constraint wH a () = 1 prevents the gain in the direction of the signal from being
reduced. This is commonly referred to as Capons method [71],[72]. Equation (E5) has an
analytical solution given by
wcapon =

R1
x a ()
a ()H R1
x a ()

(E6)

where Rx denotes the covariance matrix of the array output vector. In practical
x , where
applications, Rx is replaced by the sample covariance matrix R
N
1 X

Rx =
xk xH
k ,
N k=1

(E7)

with N denoting the number of snapshots. Substituting wcapon into equation (E3), the
constrained least squares estimate of the look-direction output is
H
ycapon,k = wcapon
xk =

E.2.3

a ()H R1
x xk
a ()H R1
x a ()

(E8)

Kernel-based beamforming

The basic idea of kernel algorithm is to transform the data xi from the input space
to a high dimensional feature space of vectors (xi ), where the inner products can be
computed using a positive definite kernel function satisfying Mercers conditions [23] :
(xi , xj ) = h(xi ), (xj )i. This simple and elegant idea allows us to obtain nonlinear
versions of any linear algorithm expressed in terms of inner products, without even
knowing the exact mapping .
Using the constrained least-squares approach that was explained in the previous
section it can easily be shown that the equivalent solution wkernel in the feature space is
given by
wkernel =

R1
(x) (a ())
(a ())H R1
(x) (a ())

(E9)

where R(x) is the correlation matrix in the feature space. The estimated correlation
matrix is given by
(x) = 1 X XH ,
R
N

106

(E10)

assuming the sample mean has already been removed from each sample(centered), where
X = [ (x1 ) , (x2 ) , . . . , (xN )] is a full rank matrix whose columns are the mapped
input reference data in the feature space. Its output is given by
ykernel,k =

H
wkernel
(xk )

(a ())H R1
(x) (xk )
(a ())H R1
(x) (a ())

(E11)

Due to the high dimensionality of the feature space, equation (E11) can not be directly
implemented in the feature space. It needs to be converted in terms of the kernel functions
by the eigenvector decomposition procedure of the kernel PCA [26]. The kernelized version
of the beamformer output is given by
ykernel,k =

1
KH
a() K Kxk
1
KH
a() K Ka()

(E12)

where
KTa() = (a ())T X

(E13)

= [ (a () , x1 ) , (a () , x2 ) , . . . , (a () , xN )] ,
KTxk = (xk )T X

(E14)

= [ (xk , x1 ) , (xk , x2 ) , . . . , (xk , xN )] ,


and K = XH
X is an N N Gram matrix whose entries are the dot products (xi , xj ) =
h(xi ), (xj )i.
E.3

Nonlinear Beamformer using Correntropy

The correntropy beamformer is formulated in the RKHS induce by correntropy and


the solution is obtained by solving the constrainted optimization problem which is to
minimize average correntropy output energy. We denote the transformed received data
matrix and filter vector whose size are M N and M 1, respectively, be
FX = [fx1 , fx2 , , fxN ],

(E15)

fw = [f (w(1))f (w(2)) f (w(M ))]H .

(E16)

107

where,
fxk = [f (xk (1))f (xk (2)) f (xk (M ))]H

(E17)

for k = 1, 2, , N . Given data samples, the cross correntropy between the received signal
at kth snapshot and the filter can be estimated as
d

voi [m] =

1X
f (w(n))f (xk (n m)),
d n=1

(E18)

for all the lags m = M + 1, , M 1.


The correntropy energy of the kth received signal output is given by
T
Ek = vok
vok = fwH Vxk fw ,

and the M M correntropy matrix Vxk is

vk (1) vk (d 1)
vk (0)

vk (1)
vk (0) vk (d 2)

Vxk =
..
..
..
...

.
.
.

vk (d 1) vk (1)
vk (0)

(E19)

(E20)

where, each element of the matrix is computed without explicitly knowledge of the
mapping function f by
vk (l) =

M
X

(xk (n) xk (n + l)),

(E21)

n=1

for l = 0, , M 1.
The average correntropy energy over all the received data can be written as
Eav =
where VX =

1
N

PN
k=1

N
1 X
Ek = fTw VX fw ,
N k=1

(E22)

Vxk . Since our objective is to minimize the average correntropy

energy in the linear feature space, we can formulate the optimization problem by
min fH
w VX fw ,

subject to fH
w fa() = 1.

108

(E23)

where, fa() of size M 1 is the transformed vector of the steering vector . Then the
solution in feature space becomes
fw =

V1
X fa()
.
H
fa() V1
f
a()
X

(E24)

Then the output is given by


ycorrentropy,k =
where
Ta =

M X
M
X

1
fH
a() VX fxk
1
fH
a() VX fa()

wij f (a(j))f (a(i))


=

i=1 j=1

Taz =

M X
M
X

M X
M
X

Tax
,
Ta

wij (a(j) a(i)),

(E25)

(E26)

i=1 j=1

wij f (xk (j))f (a(i))


=

i=1 j=1

M X
M
X

wij (xk (j) a(i)),

(E27)

i=1 j=1

1
where wij is the (i, j)th element of VX
, xk (i) is the ith element of the received signal at

kth snapshot and a(i) is ith element of the steering vector.


The final output expressions in (E26) and (E27) are obtained by approximating
f (a(j))f (a(i)) and f (xk (j))f (a(i)) by (a(j) a(i)) and (xk (j) a(i)), respectively,
which is similar to the kernel trick and holds on average because of property 5.
E.4

Simulations

In this simulation, we present comparison results among the Capon, kernel and
correntropy beamformer in the wireless communications with multiple receiving antennas.
In all of the experiments, we assume a uniform linear array with M = 25 sensor elements
and half-wavelength array spacing. Note that as the number of elements increases, the side
lobes become smaller. Also, as the total width of the array increases, the central beam
becomes narrower. For the source scenario, we assume that narrow band signals arrives
from far field and the target of interest is located at angle = 45 . We use BPSK (Binary
Phase Shift Keying) signalling which has unity power and is uncorrelated. In order to
make the result independent of the input and noise, we perform Monte-Carlo simulations
with 100 different inputs and noises.

109

In the first experiment, we investigate the effect of the number of snapshots


(N ) in a spatially white Gaussian noise case. Figure E-1 shows the beampatterns of
Capon, kernel and correntropy beamformers with N = 100 and 1000 for the case that
signal-to-noise-ratios (SNR) is 10dB. The Capon beamformer has the poor performance,
i.e., higher side-lobes for the small number of N , while the kernel method and correntropy
beamformer show a good beampattern even a small number of N . It is well known that
one of the problem of the standard Capon has a poor performance with a small number
of training data. In Figure E-2, we show the performance of BER with N = 100 and
1000 for the range of SNR between 5 and 15dB. It has been shown from Figure E-2(a)
that for N = 100 the Capon beamformer exhibits a high BER floor, but the proposed
beamformer has a much better BER performance than the Capon and kernel beamformer.
For N = 1000 in Figure E-2(b), compared with the Capon when SNR is under of 9dB,
the Capon beamformer shows better BER performance than other two methods, but when
SNR increases, BER of the correntropy beamformer becomes the best.
Next, we test the robustness of the Capon, kernel and correntropy beamformers to
the impulsive noise with N = 1000. We select such that SNR is 10dB when an -stable
noise with = 2 and = 0 (Gaussian noise). Figure E-3 shows BER performance of
three beamformers at different levels. The correntropy beamformer displays superior
performance for decreasing , that is, increasing the strength of impulsiveness. From this
result, we can say that the proposed method is robust in terms of BER to the impulsive
noise environment for wireless communications.
Figure E-4(a) and (b) show the beampattern of three beamformer at = 1.5 and
= 1.0, respectively. When = 1.5 in Figure E-4(a), the beampattern of Capon is similar
to that of kernel, and the gain of its side lobe is higher than that of correntropy by 2dB.
As decreasing , the gap of the side lobe gain between Capon and correntropy is increased
as shown Figure E-4(b).

110

One interesting result of the kernel method is that its BER performance is poor
both in Gaussian and impulsive noise cases even though it shows a nice beampattern.
The output values of the kernel method are far from the original transmitted signals,
1 therefore, it results in the poor BER performance. The kernel method shown in this
dissertation use a constraint, however, the solution of the optimization problem exists on
the infinite dimensional feature space, therefore, additional regularization for the output
to be close to the original signal may be needed. One important difference compared
with existing conventional kernel method, which normally yields an infinite dimensional
feature space, is that RKHS induced by correntropy (we call it VRKHS) has the same
dimension as the input space. In the beamforming problem, the weight vector w has M
degree of freedom and all the received data are in the M dimensional Euclidean space.
As derived above, all the transformed data belong to a different M dimensional vector
space equipped with the inner product structure defined with the correntropy. The goal
of the proposed beamformer is to find a template fw in this VRKHS such that the cost
function is minimized subject to the constraint. Therefore, the degrees of freedom of this
optimization problem is still M , so regularization, which will be needed in traditional
kernel methods, is not necessary here. Further work needs to be done regarding this point.
E.5

Conclusions

In this research, we have presented a correntropy-based nonlinear beamformer and


compared it with the Capon beamformer, which is a widely used linear technique, and the
kernel-based beamformer, which is one of the nonlinear beamformer. From simulation
results in BPSK wireless communications, it has been shown that the correntropy
beamformer outperforms the Capon and kernel beamfomers significantly in term of
sidelobe suppression of beam shaping and a reduced bit-error-rate. Also, the correntropy
beamformer has a clear advantage over the Capon beamformer in those case where small
data sets are available for training and where non-Gaussian noise is present. Compared to
the kernel beamformer, the correntropy beamformer is computationally much simpler. In

111

the computation complexity, the kernel method needs to compute the inverse of N N
Gram matrix. On the other hand, the correntropy beamformer needs the inverse of M M
correntropy matrix, where M N (in this simulation, M = 25 and N = 1000). In
addition, we hypothesize that in our methodology, regularization is automatically achieved
by the kernel through the expected value operator (which corresponds to a density
matching step utilized to evaluate correntropy).

112

Capon
Kernel
Correntropy

Array Beampattern (dB)

1
2
3
4
5
6
7
8
9
10
0

10

20

30

40
50
Degree ()

60

70

80

90

(a) N = 100
Capon
Kernel
Correntropy

Array Beampattern (dB)

2
4
6
8
10
12
14
16
18
0

10

20

30

40
50
Degree ()

60

70

80

90

(b) N = 1000
Figure E-1. Comparisons of the beampattern for three beamformers in Gaussian noise
with 10dB of SNR.

113

10

Capon
Kernel
Correntropy

BER

10

10

10

10

10
SNR (dB)

15

(a) N = 100
0

10

Capon
Kernel
Correntropy

10

BER

10

10

10

10

10
SNR (dB)

15

(b) N = 1000
Figure E-2. Comparisons of BER for three beamformers in Gaussian noise with different
SNRs.

114

10

10

BER

10

10

10

Capon
Kernel
Correntropy

10

1.5

0.5

ALPHA ()

Figure E-3. Comparisons of BER for three beamformers with different characteristic
exponent levels.

115

1
0

Array Beampattern (dB)

1
2
3
4
5
6
7
8

Capon
Kernel
Correntropy

9
10
0

10

20

30

40
50
Degree ()

60

70

80

90

(a) = 1.5
1
0

Array Beampattern (dB)

1
2
3
4
5
6
7
8

Capon
Kernel
Correntropy

9
10
0

10

20

30

40
50
Degree ()

60

70

80

90

(b) = 1.0
Figure E-4. Comparisons of the beampattern for three beamformers in non-Gaussian
noise.

116

LIST OF REFERENCES
[1] B. V. Kumar, Tutorial survey of composite filter designs for optical correlators,
Appl.Opt, vol. 31, pp. 47734801, 1992.
[2] A. Mahalanobis, A. Forman, M. Bower, R. Cherry, and N. Day, Multi-class SAR
ATR using shift invariant correlation filters, special issue of Pattern Recognition on
correlation filters and neural networks, vol. 27, pp. 619626, 1994.
[3] A. Mahalanobis, B. Vijaya Kumar, D. W. Carlson, and S. Sims, Performance
evaluation of distance classifier correlation filters, in Proc. SPIE, 1994, vol. 2238, pp.
213.
[4] R. Shenoy and D. Casasent, Correlation filters that generalize well, in Proc. SPIE,
March 1998, vol. 3386, pp. 100110.
[5] M. Savvides, B. V. Kumar, and P. Khosla, Face verification using correlation filters,
in Proc. Third IEEE Automatic Identification Advanced Technologies, Tarrytown, NY,
2002, pp. 5661.
[6] B. V. Kumar, M. Savvides, C. Xie, and K. Venkataramani, Biometric verification
with correlation filters, Applied Optics, vol. 43, no. 2, pp. 391402, Jan 2004.
[7] B. V. K. V. Kumar, A. Mahalanobis, and R. Juday, Correlation Pattern Recognition,
Cambridge University Press, 2005.
[8] G. Turin, An introduction to matched filters, IEEE Trans. Information Theory,
vol. 6, pp. 311329, 1960.
[9] S. M. Kay, Fundamentals of Statistical signal processing,Volume II Detection Theory,
Prentice-Hall, 1998.
[10] A. VanderLugt, Signal detection by complex spatial filtering, IEEE Trans.
Information Theory, , no. 10, pp. 139145, 1964.
[11] C. Hester and D. Casasent, Multivariant technique for multiclass pattern
rcognition, Appl.Opt, vol. 19, pp. 17581761, 1980.
[12] B. V. Kumar, Minimum variance synthetic discriminant functions, J.Opt.Soc.Am.A,
vol. 3, no. 10, pp. 15791584, 1986.
[13] A. Mahalanobis, B. V. Kumar, and D. Casasent, Minimum average correlation
energy filters, Appl.Opt, vol. 26, no. 17, pp. 36333640, 1987.
[14] D. Casasent and G. Ravichandran, Advanced distortion-invariant minimum average
correlation energy MACE filters, Appl.Opt, vol. 31, no. 8, pp. 11091116, 1992.
[15] G. Ravichandran and D. a. Casasent, Minimum noise and correlation energy filters,
Appl.Opt, vol. 31, no. 11, pp. 18231833, 1992.

117

[16] P. Refregier and J. Figue, Optimal trade-off filter for pattern recognition and their
comparison with weiner approach, Opt. Computer Process., vol. 1, pp. 310, 1991.
[17] A. Mahalanobis, B. V. Kumar, S. Song, S. Sims, and J. Epperson, Unconstrained
correlation filters, Appl.Opt, vol. 33, pp. 37513759, 1994.
[18] A. Mahalanobis, B. V. Kumar, and S. Sims, Distance-classifier correlation filters for
multiclass target recognition, Appl.Opt, vol. 35, no. 17, pp. 31273133, June 1996.
[19] M. Alkanha and B. V. Kumar, Polynomial distance classifier correlation filter for
pattern recognition, Appl.Opt, vol. 42, no. 23, pp. 46884708, Aug. 2003.
[20] J. Fisher and J. Principe, A nonlinear extension of the MACE filter, Neural
Networks, vol. 8, pp. 11311141, 1995.
[21] J. Fisher and J. Principe, Formulation of the mace filter as a linear associative
memory, in Proc. Int. Conf. on Neural Networks, 1994, vol. 5.
[22] J. Fisher and J. Principe, Recent advances to nonlinear MACE filters, Optical
Engineering, vol. 36, no. 10, pp. 26972709, Oct. 1998.
[23] B. Scholkopf and A. J. Smola, Learning with Kernels, The MIT Press, 2002.
[24] B. Scholkopf, A. J. Smola, and K. Muller, Kernel principal component analysis,
Neural Computation, vol. 10, pp. 12991319, 1998.
[25] A. Ruiz and E. Lopez-de Teruel, Nonlinear kernel-based statistical pattern analysis,
IEEE Trans. on Neural Networks, vol. 12, pp. 1632, 2001.
[26] H. Kwon and N. M. Nasrabadi, Kernel matched signal detectors for hyperspectral
tareget detection, in Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP),
2005, vol. 4, pp. 665668.
[27] K. Jeong, P. Pokharel, J. Xu, S. Han, and J. Principe, Kernel synthetic distriminant
function for object recognition, in Proc. Int. Conf. Acoustics, Speech, Signal
Processing (ICASSP), France, May 2006, vol. 5, pp. 765768.
[28] C. Xie, M. Savvides, and B. V. Kumar, Kernel correlation filter based redundant
class-dependence feature analysis KCFA on FRGC2.0 data, in Proc. 2nd Int.
Workshop Analysis Modeling of Faces Gesture (AMFG), Beijing, 2005.
[29] I. Santamara, P. Pokharel, and J. Principe, Generalized correlation function:
Definition,properties and application to blind equalization, IEEE Trans. Signal
Processing, vol. 54, no. 6, pp. 21872197, June 2006.
[30] P. Pokharel, R. Agrawal, and J. Principe, Correntropy based matched filtering, in
Proc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP), Sept.
2005, pp. 148155.

118

[31] J. W. Fisher, Nonlinear Extensions to the Miminum Average Correlation Energy


Filter, Ph.D. dissertation, University of Florida, Gainesville, FL, 1997.
[32] B. Boser, I. Guyon, and V. Vapnik, A training algorithm for optimal margin
classifiers, in Proc. 5th COLT, 1992, pp. 144152.
[33] V.Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995.
[34] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines,
Cambridge University Press, 2000.
[35] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc., vol. 68, pp.
337404, 1950.
[36] E. Parzen, On the estimation of probability density function and mode, The Annals
of Mathematical Statistics, 1962.
[37] J. Mercer, Functions of positive and negative type, and their connection with the
theory of integral equations,, Philosophical Trans. of the Royal Society of London,
vol. 209, pp. 415446, 1909.
[38] http://www.amp.ece.cmu.edu : Advanced multimedia processing lab at electrical
and computer eng., CMU, .
[39] E. Parzen, Statistical methods on time series by hilbert space methods, Tech. Rep.
Technical Report No 23, Applied Mathematics and Statistics Laboratory, Stanford
University, 1959.
[40] S. Sudharsanan, A. Mahalanobis, and M. Sundareshan, Unified framework for the
synthesis of synthetic discriminant functions with reduced noise varianve and sharp
correlation struture, Optical Engineering, 1990.
[41] J. C. Principe, D. Xu, and J. Fisher, Information theoretic learning, in Unsupervised Adaptive Filtering, S. Haykin, Ed., pp. 265319. JOHN WILEY, 2000.
[42] W. Liu, P. P. Pokharel, and J. C. Principe, Correntropy: properties and applications
in non-gaussian signal processing, in press, IEEE Trans. on Signal Processing.
[43] W. Liu, P. Pokharel, , and J. Principe, Correntropy: a localized similarity measure,
in Proc. 2006 IEEE World Congress on Computational Intelligence (WCCI), Canada,
July 2006, pp. 1001810023.
[44] B. W. Silverman, Density Estimation for Statistics and Data Analysis, CRC Press,
1986.
[45] P. Pokharel, J. Xu, D. Erdogmus, and J. Principe, A closed form solution for a
nonlinear wiener filter, in Proc. Int. Conf. Acoustics, Speech, Signal Processing
(ICASSP), France, May 2006, vol. 3, pp. 720723.

119

[46] K. Jeong and J. Principe, The correntropy MACE filter for image recognition,
in Proc. IEEE Int. Workshop on Machine Learning for signal Processing (MLSP),
Ireland, July 2006, pp. 914.
[47] L. Greengard and J. Strain, The fast gauss transform, SIAM J. Sci. Statist.
Comput., vol. 12, no. 1, pp. 7994, Jan. 1991.
[48] T. Gonzalez, Clustering to minimize the maximum intercluster distance, Theoretical
Computer Science, vol. 38, pp. 293306, 1985.
[49] T. Ross, S. Worrell, V. Velten, J. Mossing, and M. Bryant, Standard SAR ATR
evaluation experiments using the MSTAR public release data set, in Proc. SPIE,
April 1998, vol. 3370, pp. 566573.
[50] R. Gonzalez and R. Woods, Digital Image Processing, Second Edition, Prentice Hall,
2002.
[51] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive
Neuroscience, vol. 3, no. 1, pp. 7186, 1991.
[52] R. O. Duda, P. E. Hart, and S. D. G, Pattern Classification, Second Edition, John
Willy and sons, 2001.
[53] D. Erdogmus, Y. Rao, H. Peddaneni, A. Hegde, and J. Principe, Recursive principal
components analysis usingeigenvector matrix perterbation, EURASIP Journal on
Applied Signal Processing, , no. 13, pp. 20342041, Mar. 2004.
[54] K. Fukunaga, Introduction to statistical pattern recognition, Second Edition, Academic
Press Professional, 1990.
[55] S. Kaski, Dimensionality reduction by random mapping: Fast silimarity computation
for clustering, in Proc. Int. Joint Conf. on Neural Networks (IJCNN), 1998, pp.
413418.
[56] D. Fradkin and D. Madigan, Experiments with random projection for machine
learning, in Proc. Conference on Knowledge Discovery and Data Mining, 2003, pp.
517522.
[57] E. Brigham and H. Maninila, Random projection in dimensionality reduction:
applications to image and text data, in Proc. Conference on Knowledge Discovery
and Data Mining, 2001, pp. 245250.
[58] D. Achlioptas, Database-friendly random projections, in Symposium on Principles
of Database Systems(PODS), 2001, pp. 274281.
[59] S. Dasgupta, Experiments with random projection, in Proc. Conference on
Uncertainty in Artificial Intelligence, 2000.

120

[60] E. Candes, J. Romberg, and T. Tao, Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information, IEEE Trans. on
Information Theory, vol. 52, no. 2, pp. 489509, 2006.
[61] S. Dasgupta and A. Gupta, An elementary proof of the johnson-lindenstrauss
lemma, Tech. Rep. Technical Report TR-99-006, International Computer Science
Institute, Berkeley, CA, 1999.
[62] G. H. Golub and C. F. v. Loan, Matrix Computations, North Oxford Academic,
Oxford, UK, 1983.
[63] E. J. Candes and T. Tao, Near-optimal signal recovery from random projections:
Universal encoding strategies?, IEEE Trans. of Information Theory, vol. 52, no. 12,
pp. 54065425, Dec. 2006.
[64] D. Achlioptas, Database-friendly random projections: Johnson-lindenstrauss with
binary coins, Journal of Computer and System Sciences, vol. 66, pp. 671687, 2003.
[65] R. Hecht-Nielsen, Context vectors: general purpose approximate meaning representations self-organized from raw data, IEEE Press, 1994.
[66] D. Donoho, Compressed sensing, IEEE Trans. of Information Theory, vol. 52, no.
4, pp. 12891306, 2006.
[67] D. Casasent and N. A., Confuser rejection performance of EMACH filters for
MSTAR ART, in Proc. SPIE, April 2006, vol. 6245, pp. 62450D112.
[68] B. D. Van Veen and K. M. Buckley, Beamforming: a versatile approach to spatial
filtering, IEEE ASSP Magazine, vol. 5, no. 2, pp. 422, April 1988.
[69] H. Krim and M. Viberg, Two decades of array signal processing research: the
parametric approach, IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 6794,
July 1996.
[70] M. S. Bartlett, Smoothing periodograms from time series with continuous spectra,
Nature, vol. 161, no. 4096, pp. 686687, May 1948.
[71] J. Capon, High-resolution frequency-wavenumber spectrum analysis, Proceedings of
the IEEE, vol. 57, no. 8, pp. 14081418, August 1965.
[72] R. T. Lacoss, Data adaptive spectral analysis methods, Geophysics, vol. 36, no. 4,
pp. 134148, August 1971.
[73] P. Stoica, Z. Wang, and J. Li, Robust capon beamforming, IEEE Signal Processing
Letters, vol. 10, no. 6, pp. 172175, June 2003.
[74] R. G. Lorenz and S. P. Boyd, Robust minimum variance beamforming, IEEE
Trans. Signal Processing, vol. 53, no. 5, pp. 16841696, May 2005.

121

[75] M. Shao and C. L. Nikias, Signal processing with fractional lower order moments:
Stable processes and their applications, Proceedings of the IEEE, vol. 81, no. 7, pp.
9861010, July 1993.
[76] R. Adler, R. E. Feldman, and T. M. S, A Practical Guide to Heavy Tails: Statistical
Techniques and Applications, Boston, MA: Birkhauser, 1998.
[77] P. Tsakalides, R. Raspanti, and C. L. Nikias, Angle/doppler estimation in
heavy-tailed clutter backgrounds, IEEE Trans. Aerospace and Electronic Systems,
vol. 35, no. 2, pp. 419436, April 1999.
[78] T. Lo, H. Leung, and J. Litva, Nonlinear beamforming, Electronics Letters, vol. 27,
no. 4, pp. 350352, February 1991.
[79] S. Chen, L. Hanzo, and A. Wolfgang, Kernel-based nonlinear beamforming
construction using orthogonal forward selection with the fisher ratio class separability
measure, IEEE Signal Processing Letters, vol. 11, no. 6, pp. 478481, May 2004.
[80] M. Martinez-Ramon, J. L. Rojo-Alvarez, and G. Camps-Valls, Kernel antenna array
processing, IEEE Trans. Antennas and Propagation, vol. 55, no. 3, pp. 642650,
March 2007.
[81] K. H. Jeong, W. Liu, S. Han, and J. C. Principe, The correntropy MACE filter,
submitted to IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI).

122

BIOGRAPHICAL SKETCH
Kyu-Hwa Jeong was born June, 1972 in Korea and received the M.S degree
in electronics engineering from Yonsei University, Seoul, Korea, in 1997, where he
focused on adaptive filter theory and its applications to acoustic echo cancelation. In
1997-2003, he was a senior research engineer with Digital Media Research Lab. in LG
Electronics, Seoul, Korea, and belonged to optical storage group. He mainly participated
in CD/DVD recorder projects. Since 2003, he has been pursuing the Ph.D. degree
with the Computational NeuroEngineering Lab in electrical and computer engineering,
University of Florida, Gainesville, FL. His research interests are in the field of signal
processing, machine learning and its applications to image pattern recognition.

123

Vous aimerez peut-être aussi