Académique Documents
Professionnel Documents
Culture Documents
Abstract
This paper describes an enhanced negative selection algorithm (NSA) called V-
detector. Several key characteristics make this method a state-of-the-art advance
in the decade-old NSA. First, individual-specific size (or matching threshold) of
the detectors is utilized to maximize the anomaly coverage at little extra cost.
Second, statistical estimation is integrated in the detector generation algorithm
so the target coverage can be achieved with given probability. Furthermore,
this algorithm is presented in a generic form based on the abstract concepts of
data points and matching threshold. Hence it can be extended from the cur-
rent real-valued implementation to other problem space with different distance
measure, data/detector representation schemes, etc. By using one-shot process
to generate the detector set, this algorithm is more efficient than strongly evo-
lutionary approaches. It also includes the option to interpret the training data
as a whole so the boundary between the self and nonself areas can be detected
more distinctly. The discussion is focused on the features attributed to negative
selection algorithms instead of combination with other strategies.
keywords: negative selection algorithms, artificial immune systems, anomaly
detection, classification, computational intelligence, algorithm
1 Introduction
The negative selection algorithms are inspired by natural immune system’s
self/nonself discrimination mechanism. It was designed by modeling the biolog-
ical process in which T -cells mature in thymus through being censored against
self cells[14]. It is one of the earliest models of artificial immune systems (AIS)
[9, 15, 16].
In a negative selection algorithm, a collection of detectors, usually called
1
detector set, is generated so not to match any self samples (training data). The
detectors are subsequently used to check whether incoming new data items are
normal (self) or not (nonself). It is typical regarded as an anomaly detection or a
one-class classification method because the training data are from normal cases
only. The common engineering goals in various negative selection algorithms
are: (1) to limit the number of the detectors needed to be generated; (2) to
make the detector set cover as many of the anomalies as possible (ideally all the
anomalies); (3) to generate the detector set efficiently. To a certain extent, all
the concerns are around the so-called detector coverage, namely the proportion
of the nonself space that is covered or recognized by the detector set. Two
intertwined questions arise at this point: First, how do we estimate the coverage?
It is desirable to know how effective the detector set is before using it to detect
anomaly. Second, how do we achieve enough coverage with relatively smaller
number of detectors? Because the number of detectors is the main factor that
decides the performance of the algorithms, especially during detection phase, it
is desirable to use as few detectors as possible.
From both ends, V-detector [23, 24], as introduced in this paper, handles
the issues with innovative and efficient techniques† . Furthermore, this method
is potentially very useful when the near-perfect coverage is not necessary and
alternative estimation is inaccurate. The rest of this paper is organized as
following. In section 2, we briefly review previous research efforts to address the
issue of detector coverage and the related works that V-detector is based on. In
section 3, we will describe the algorithm in detail. Section 4 uses experimental
results to illustrate the properties of V-detector and to discuss the applicability
of this method. Lastly, the conclusion is summarized in section 5.
2 Related Works
2.1 Knowledge of Detection Coverage in Negative Selec-
tion Algorithms
Ever since the time when negative selection algorithm was first proposed, the
detector coverage is a major concern. It is desirable but never trivial to deter-
mine the coverage quantitatively for a specific negative selection algorithm, or
to decide the necessary number and distribution of detectors for a given cover-
age. Statistical methods were used in several works. D’haeseleer et al [10] did
thorough probability analysis on the relation between the number of detectors
and the probability that a random anomaly can be detected. It used matching
probability or failure probability to decide or evaluate the number of detectors.
In some later works on negative selection algorithms, similar analysis was in-
cluded as the part of theoretical support [1, 11, 12]. Some other works focused
on different aspects, e.g. lower bound for the fault probability [31], but used
statistics from the same point of view.
† Implementation and more information about V-detector can be found at http://
vdetector.zhouji.net.
2
For binary or finite-alphabet string representation, such analysis is relatively
easier to carry out. When negative selection algorithm was extended to different
data representations, probability cannot always be computed by straightforward
combinatorics anymore. Real-valued representation plays a unique role in many
applications that cannot be represented effectively in binary form. For the prob-
lems with their natural real-valued representation, it is easier to interpret the
output and usually results in more stable algorithm by maintaining affinity in
representation space. Coverage in such cases is relatively less explored because
the search space is continuous and hard to analyze by enumerative combina-
torics.
In real-valued negative selection algorithms, the detectors are usually repre-
sented as hyperspheres, hyper-rectangles, or hyperellipsoids. They are generated
by either the original generation-elimination method or various other methods.
Gonzalez et al [17] used a random process to generate and redistribute detectors
so the total overlap can be minimized generally. In another work by Dasgupta
et al [6], the detectors are represented as rules or in fact rectangular areas in
a multi-dimensional real space and generated by a genetic algorithm. A frame-
work of a multi-level learning algorithm (MILA)[8] consists mainly of negative
selections in real-valued representation using matching rule defined in lower di-
mensional sub-spaces. Another multi-level AIS compared binary and real-valued
representation [2]. Some algorithms involve resizing and redistribution of the
initial detectors [7].
In real-valued representation, Gonzalez et al [17] successfully used Monte
Carlo method to estimate the volume of self region and then decided the number
of detectors based on the estimate. That method is more sophisticated than the
analysis in binary representation in the sense that (1) the proportion of possible
self samples are evaluated in a probabilistic way instead of deterministically and
(2) the geometry and assumed distribution of the detectors have to be taken
into consideration to carry out the analysis.
Nevertheless, these seemingly different analysis tackled the problem with a
similar approach, namely, to determine the number of detectors that is con-
sidered enough before the detector set is generated. It was oriented to general
analysis of the relation between the number of detectors and the coverage and a
given detector set is not in the question. A drawback is that eventually we have
no direct knowledge of detector coverage of the actual detectors that are gener-
ated. Moreover, the link from volume of self or nonself region to the number of
detectors is not only an intuitive estimate, but also depends on the assumption
of detector distribution that may be very different from the actual detector set.
For binary representation, the distribution is usually assumed to be uniform,
which is unrealistic considering the specific application and the detector gener-
ation algorithm. For real-valued case [17], the distribution is assumed to follow
simple geometric pattern so the overlap barely can be estimated, though still
roughly.
V-detector [24] approach the issue from a different angle. It estimates the
coverage of the actual set of detector, which means (1) whatever information
we can obtain reflects the actual coverage instead of estimate based purely
3
on the number of detectors; (2) no assumption is needed for the distribution
of the detectors. As it will be shown in the algorithm details described in
section 3, it has the further advantage that the geometry of the detector doesn’t
matter so there is no difficulty to use for different representation of detectors.
While the original experiments were done in real-valued space, it is possible
to implement a similar mechanism in other representations, e.g. the popular
binary or other finite alphabet string representation. More generally, it can be
used in any detection mechanism as long as individual point can be verified to
be recognized or not even though there is no explicit algorithm to evaluate the
rate of detection. The same methodology also applies to different algorithms in
which the similar issue of proportion estimation exists.
4
Considering a collection of random points in the nonself region, V-detector al-
gorithm uses the percentage of the points that are covered by the detectors to
estimate the covered proportion of the entire nonself region. However, an es-
timate of the proportion value itself does not tell us how much or how likely
the estimate may be different from the real population proportion - in our case,
the actual detector coverage. To draw a more meaningful conclusion, we should
construct a confidence interval based on the Central limit theorem [21, 13, 20].
In other words, although we do not get the same percentage every time if we
repeat sampling of a fixed size, the distribution of these percentage is close to a
normal distribution.
The central limit theorem justifies using a normal distribution as an approx-
imation for the distribution of x̄ when n is sufficiently large. There are two
apparent sources of error in using the normal distribution as an approximation
of the binomial distribution:
1. The normal distribution is always symmetrical; the binomial distribution
is symmetric only if the probability of one outcome, p, is 0.5.
2. The normal distribution is continuous; the binomial distribution is dis-
crete.
A rule of thumb taking into account both the problems of asymmetry and
discreteness is to use the normal distribution approximation only if np > 5,
n(1 − p) > 5 and n > 10.
There are alternative distributions that can be used to deal with the asym-
metry problem, and there are mathematically strict corrections for discontinuity
too. Even though these are not the main concerns in the application in ques-
tion, the issue of asymmetry is in fact not negligible. Because the proportion of
covered nonself points, p, is the variable to be considered, it is very likely that
we need to consider a large p, for example, 90% or 99%. Fortunately, we can
circumvent the issue with proper strategy in our proposed algorithm considering
the fact that we care more about enough coverage than its exact value.
5
where q̂ = 1 − p̂ and n is the sample size. In the case of estimating detector
coverage, we are more interested in making a conclusion about the lower limit
of coverage, p > pmin , where pmin is the minimum coverage we can presume
with some certainty. So we can use a one-side confidence interval
p > p̂ − E, (4)
where r
p̂q̂
E = zα . (5)
n
To ensure the assumption that the binomial random variable is approximately
√
normally distributed with the mean µ = np and standard deviation σ = npq,
we should have np ≥ 5, nq ≥ 5.
In Equation (2), zα/2 is the z score for a confidence level of 1 − α/2 - the
positive standard z value that separates an area of α/2 in the right tail of
the standard normal distribution curve. For a standard normal distribution,
the probability of −zα/2 ≤ x ≤ zα/2 , P (−zα/2 ≤ x ≤ zα/2 ) = 1 − α. The
probability of x ≤ zα/2 , P (x ≤ zα/2 ) = 1 − α/2. Similarly, zα in Equation (5)
is where P (x ≤ zα ) = 1 − α.
6
2.4 Inspiration from Learning Theory
Statistical inference does not tell us the exact value of the detector coverage. As
we will see later, it tells the probability that the approximation of coverage is
good enough. This idea of “Probably Adequate” becomes more comprehensible
when we look into the similar concepts in machine learning. One of the major
models of computational learning theory is Probably Approximately Correct
learning (PAC learning) [5, 19]. In terms of PAC learning, successful learning
of an unknown target concept should entail obtaining, with high probability,
a hypothesis that is a good approximation of it. Accuracy, or how good the
approximation is, is described by ǫ: the hypothesis returned, h, should satisfy
error(h) ≤ ǫ. Confidence, or the chance we can correctly obtain the hypothesis
h, is described by σ: the probability of returning h is at least 1 − σ. h is parallel
to the target coverage.
Learning theory such as PAC learning could provide guidance to the devel-
opment of negative selection algorithms. Although analyzing a specific nega-
tive selection algorithm, e.g. V-detector , may not be a straightforward task,
it should be a potentially helpful work to analyze whether the problem that is
solved by a negative selection algorithm, or more particularly, by V-detector, is
PAC learnable or not. As the first step leading to formalization of the problem,
regardless of the algorithm that may be used to solve it, we should clarify the
basic assumptions about the training data our algorithms take. Some of the
previous works in this area, especially those which were not based on binary
representation and did not assume all self features are present in training data,
are hard to compare with one another shoulder-to-shoulder due to the lack of
equivalent assumptions.
In the following analysis, we assume
• Both self and nonself points appear in some bounded n-dimensional real
space. For simplicity, let us assume it is [0, 1]n .
• Some finite number of self samples are provided as input. They are ran-
domly distributed over the self region.
• The training data is noise free, meaning all the self samples are real self
points. This is not necessary in principle, but used to simplify the discus-
sion.
• To evaluate the detection performance, the testing data are finite number
of random points over the entire space in question described above. Each
of those points can be verified to be self or nonself.
3 Algorithm
3.1 Coverage - Proportion - Probability
Definition 1 The detector coverage of a given detector set is defined as the
ratio of the volume of the nonself region that can be recognized by any detector
7
in the detector set to the volume of the entire nonself region.
Generally, it can be written as
R
d~x
p = R~x∈D ,
~
x∈S
d~x
where S is the set of nonself points and D is the set of nonself points that are
recognized by the detectors. In the case of 2-dimensional continuous space, it is
reduced to the ratio of the area covered to the area of the entire nonself region
RR
(x,y)∈D
dxdy
p = RR .
x∈S
dxdy
Nonself region
nonself region
Uncovered
In statistical term, the points of the nonself region are our population. Gen-
erally speaking, the population size is infinite. The probability of each point to
8
be covered by detectors has a binomial distribution. The detector coverage is
the same as the proportion of the covered points, which equals to the proba-
bility that a random point in the nonself region is a covered point. Assuming
all the points from the entire nonself region are equally likely to be chosen in
a random sampling process, the probability of a sample point being recognized
by the detectors is thus equal to p. For a sample of fixed size, the proportion of
covered points is
D̂
p̂ = ,
Ŝ
where Ŝ is the sample; and D̂ is the set of sample points that are recognized by
the detectors. |Ŝ| is thus the sample size. p̂ is the sample statistic that is the
point estimate of the population proportion.
9
our goal is to make a decision of adding more detectors or not. What makes
this paper’s method different from traditional statistical inference is that the
testing can be done as part of the detector generation algorithm. Although it
may be implemented as a relatively independent module, we still have to face
a dilemma: the detector coverage or the proportion to be estimated is actually
changing during the detector generation. So we need to design a process in
which the hypothesis testing happens only when we temporarily stop adding new
detectors. Otherwise, the testing will be meaningless. At the same time, we also
try to reuse the random samples we use in hypothesis testing as the candidate
detectors. This doubles the advantage of integrating hypothesis testing in V-
detector.
In the case of estimating coverage, the null hypothesis would be “The cover-
age of the nonself region by all the existing detectors is below percentage pmin .”
If we accept the null hypothesis, we would include more detectors. If the null
hypothesis is actually false, the cost of a Type II Error would be more unneces-
sary detectors. On the other hand, if we reject the null hypothesis by mistake,
we would end up with lower than actual coverage. The latter, so called Type I
Error, is exactly our concern. The significant level α is the maximum acceptable
probability that we may make a Type I Error - end up with fewer than needed
detectors. We need a fixed sample size to do the hypothesis testing. If the con-
clusion is that we need more detectors, we take all the uncovered sample points
to make new detectors. This largely saves the cost of the entire algorithm.
Figure 2 shows the diagram of the modified V-detector that uses hypothesis
testing to estimate the detector coverage.
To guarantee the assumption np ≥ 5 and nq ≡ n(1 − p) ≥ 5 is valid, we can
choose sample size by
n > max(5/p, 5/(1 − p)).
If there is x points covered, p̂ = x/n, where n is the sample size, we have
x np
r
z=√ − .
npq q
During the procedure to test more points, x will either increase (when the
point is covered) or stay unchanged (when the point is uncovered). So does z.
Before the procedure finishes for all n points, if z based on the tested points is
larger than zα , it is enough to reject the null hypothesis and claim enough cov-
erage. At that point, the test can be stopped. Because the ultimate conclusion
from the procedure is either rejection or acceptance of the null hypothesis, not
the estimate of p and confidence interval, it is not necessary to finish trying to
get a “better” answer.
If the the assumption nq > 5 is in fact invalid because the real p is larger
than the p we used, then the actual coverage is more than what we want to test.
Our confidence in the coverage is not comprised in this case. If the assumption
np > 5 is in fact invalid because p is so small, the hypothesis test will pass only
when it could pass a test using the actual non-normal distribution. Because the
probability curve skew to the left side (origin side), zα of such a distribution
10
Begin
Choose p and α
N = 0, x = 0
Sample a point
Yes
Self?
No
N = N +1
No
Covered? Save the candidate
Yes
x = x+1
np
z = √x
q
−
npq q
No No
z > zα ? N = n?
Yes Yes
would be smaller than zα of normal distribution. If z does not pass this skewed
zα , it will not pass normal distribution’s zα either: z ≤ zα |p<5/n ≤ zα .
11
(a) Constant-sized detectors (b) Variable-sized detectors
detectors covering the non-self region. Figure 3(a) shows the case where the
detectors are of constant size. In this case, a large number of detectors are
needed to cover the large area of nonself space. The well-known issues of “holes”
are illustrated in black. In figure 3(b), using variable-sized detectors, the larger
area of non-self space can be covered by fewer detectors, and at the same time,
smaller detectors can cover the holes. Since the total number of detectors is
controlled by using the large detectors, it becomes more feasible to use smaller
detectors when necessary.
Another advantage of this new method is that it facilitates the usage of the
above described statistical inference.
It can be further extended to variable matching rules or at least different
distance measures. That would be an easy way to realize detectors of different
geometric shapes pursued by many other works.
12
interpretation”.
abnormal region
abnormal region
“self radius”
abnormal region
Naturally, each self sample point can be interpreted as an evidence that its
vicinity is self region. On the other hand, we can fairly assume that the self
samples can be drawn anywhere over the entire self region. There is no reason
to exclude the points that are close to the boundary between self and nonself
regions no matter what kind of matching rule or distance measure is used.
Fig. 6 illustrates the “boundary dilemma”, the scenario that the self samples
close to the boundary inevitably extend the actual self region due to the vari-
ability allowed by the algorithm. In this figure, the shaded area is the “real”
self region; the dots are the self samples and the circles are their generaliza-
tion. If the self threshold is too small, the space between self samples could not
be represented. In other words, more samples are needed to train the system
13
Figure 6: “Boundary Dilemma”
properly. On the other hand, if the self threshold is large, the false self region
represented by the boundary samples may be too large to accept.
In the case that the over-covered area is too large compared with the real
nonself region, the error would be large. When the nonself region is a thin stripe
between two self regions, it may not be able to be represented at all. In those
cases, the issue of boundary dilemma will be more considerable.
The issue described above is tackled by an ingenious simple strategy using a
negative selection algorithm to achieve the interpretation illustrated in fig. 5(c).
The above discussion concerns general interpretation of self samples. From
the view point of a negative selection algorithm, the difference in interpretation
is shown in fig. 7. Fig. 7(a) shows the coverage of a detector set using a
conservative interpretation; fig. 7(b) shows one using an extremely aggressive
interpretation. Similar to the conceptual discussion, it is possible to generalize
the self samples to a finite self region even if detection is extremely aggressive
outside the self region. This is illustrated in fig. 7(c).
self sample
“self radius”
self sample
14
so that it is able to detect the boundary of self region. We call it boundary-aware
V-detector [22].
A fixed number of random points from the self region are used as the self
samples to generate the detector set. Another number of random points, in
which some are self, some nonself, are used to test the detection performance of
the detector set. Figure 9 shows examples of training data (self samples) and
test data: 9(a) is a self sample of 100 points; 9(b) is a self sample of 1000 points;
9(c) is 1000 test data including both self points and nonself points. It can be
predicted from this figure that the number of training data will have obvious
15
influence on the detection results. Figure 10 shows the detector-covered area
using these two different numbers of training points (boundary-aware algorithm,
hypothesis testing, 99% target coverage): 10(a) the area trained with 100 points;
10(b) the area trained with 1000 points. When other control parameters are
different, e.g. using point-wise algorithm, the covered area will not be the same
as in Figure 10, but the number of training points still plays an important role.
(a) 100 points of self sample (b) 1000 points of self sample
The influence of the control parameters and the differences of strategies were
explored with more experiments. From the data side, the difference in results
may come from the number of sample points or the different shapes (including
their specific geometric parameters) of the self region. From the algorithm side,
the difference may come from: target coverage, significance level of hypothe-
sis testing, methods of estimation (naı̈ve estimate or hypothesis testing), self
threshold, and V-detector strategy (point-wise or boundary-aware). The per-
formance we want to compare include detection rate, false alarm rate, and the
number of detectors. Significant level α is set to be 0.1 in the results reported
16
(a) Trained with 100 points (b) Trained with 1000 points
in this paper.
Figure 11 compares some results of detection rate using naı̈ve estimate and
hypothesis testing for target coverages from 90% through 99%. The number
of sample points is 1000. The boundary-aware algorithm was used. The self
region in figure 11(a) is an ‘intersection’ shape, which is basically four separated
regions. The one in figure 11(b) is a pentagram whose radius of circumscribed
circle is 1/3. The plot shows the mean of 100 repeated tests; standard deviation
is shown as error bar on the graph. Results obtained with naı̈ve estimate and
hypothesis testing are plotted together to compare. Hypothesis testing has a
small but consistent advantage over the naı̈ve method.
1.2 1.2
hypothesis testing hypothesis testing
naive estimate naive estimate
1.1 1.1
1 1
Detection Rate
Detection Rate
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.9 0.92 0.94 0.96 0.98 1 0.9 0.92 0.94 0.96 0.98 1
Target Coverage Target Coverage
(a) Intersection shape (b) Pentagram shape
Table 2 again highlights the difference between naı̈ve estimate and hypothesis
testing. The results were from the following setting: boundary-aware strategy,
1000 self sample points, target coverage 90%, and self threshold 0.05. The
numbers are the mean of 100 repeated tests and the standard deviation σ is also
tabulated with the corresponding variables. Results for two different shapes of
17
self region (‘intersection’ and pentagram) are shown.
18
1.2 0.8
boundary-aware
point-wise
0.7
1
0.6
0.8 0.5
0.4
0.6
0.3
0.4 0.2
0.1
0.2
boundary-aware 0
point-wise
0 -0.1
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2
Self Threshold Self Threshold
(a) Detection rate (b) False alarm rate
7000
boundary-aware
point-wise
6000
5000
Number of Detectors
4000
3000
2000
1000
0
0 0.05 0.1 0.15 0.2
Self Threshold
(c) Number of detectors
problem space, detector representation, matching rule, etc. For example, Eu-
clidean distance, or 2-norm distance, widely used in real-valued representation
and in the earlier experiments can be generalized to Minkowski distance of or-
der m, or Lm distance, for any arbitrary m. For a point (x1 , x2 , · · · , xn ) and a
point (y1 , y2 , · · · , yn ) in n-dimensional space the 1-norm distance is Manhattan
distance
Xn
|xi − yi |.
i=1
n
! m1
X
lim |xi − yi |m = max(|xi − yi |, i = 1, 2, · · · , n)
m→∞
i=1
19
1.1
1.05
Detection Rate
0.95
0.9 cross
inverted cross
ring
0.85 inverted ring
intersection
inverted intersection
0.8 stripe
inverted stripe
0.75 pentagram
inverted pentagram
0.7
0 0.05 0.1 0.15 0.2
Self Threshold
1.01 1
1000 points
100 points
1.005
0.8
1
False Alarm Rate
Detection Rate
0.995 0.6
0.99
0.4
0.985
0.98 0.2
For different norm, the detector (or recognition region) will take different geo-
metric shapes and have different covering area. Fig. 15 illustrates the different
shapes in 2-dimensional space. They are shown with the same radius. If we use
radius r to indicate the size, r can be interpreted as the radius of the circle in the
◦
case of 2-norm distance.√ For Manhattan distance, the detector is a 45 -turned
square whose edge is 2r; for infinity norm, the detector has the shape of a
square whose edge is 2r; for any norm between 2 and ∞, the shape is evidently
between the radius r circle and the edge 2r square.
Tables 3 and 4 are the results obtained using different distance measures, for
the “intersection” self region and the “5-circles” self region, respectively. There
are two different implementations of Euclidean distance. One is the default
setting of V-detector , in which the distance measure and matching process are
actually implemented using the square of Euclidean distance for better perfor-
mance in speed. The other Euclidean distance is implemented as L2 distance
in the general way. In term of detection results, there seems to be little differ-
20
1 − norm distance 2 − norm distance
(Manhattan) (Euclidean)
ence between different distance measures for these two examples, except that
the Manhattan distance is slightly more aggressive to raise alarm of anomaly.
However, the running time of the algorithm is noticeably different with differ-
ent distance measures. The ∞ norm distance is the fastest. For general Lm
distance, the algorithm runs slower for higher m.
Although NSA were widely used in various applications and have developed
many variations, there are still some skepticism [28] or in some cases confusion
about whether and how they could be used [25]. For example, detector coverage
and detection rate are two terms that may lead to misunderstanding when we
discuss how well the detector set works. Failure to make clear distinction may
muddle otherwise clear analysis. Coverage is the proportion of nonself space
that is covered by detectors. For a given instance, we usually do not know the
actual value because the nonself space is the unknown we are seeking for. If we
21
Table 3: Effects of different distance measure: ‘Intersection’ shape
Distance measure detection rate SD false alarm rate SD
Euclidean (default efficient implementation) 96.36 1.49 9.77 1.58
Manhattan 97.05 0.83 10.53 2.12
Euclidean 96.23 1.57 9.59 1.38
3-norm 94.69 1.79 9.55 1.56
infinity norm 89.62 3.01 11 1.5
22
limitations.
23
Nevertheless, when used in suitable problems, V-detector can show its strength
compared with more time-tested methods. SVM (Support Vector Machine) is
a popular statistical learning algorithm and did very well in many experiments.
However, such good results do not guarantee that it can replace alternative
methods like V-detector under any conditions. As a simple example, let us con-
sider a scenario when V-detector is much easier to use than SVM. Two cases
were designed so that the self region is a disconnected region. (1) Fig. 16(a) is
a self region that is a circle partially cut by a cross, which we will call “intersec-
tion”. This is one of the synthetic data sets tested in earlier work [24]. (2) Fig.
16(b) is a self region made of five small circles. Both are over the unit square
2-dimensional search space.
Tables 5 and 6 show clearly that SVM does not work as well as negative
selection algorithm when default kernel function is used as in previous experi-
ments. That means at least we need to choose proper kernel function to make
SVM work. The correct choice depends on extra knowledge of the problem.
V-detector got significantly better results without the need to refine the control
parameters.
The choice of the kernel function is a known big limitation of SVM [4]. In
SVM, when the decision function is not a linear function of the data, the data
needs to be mapped to a higher dimensional space in which a linear separation
can be done. The kernel function plays the key role in the mapping. The best
choice is still a research issue even with prior knowledge. V-detector and other
approaches that do not use decision function have obvious advantage in this as-
pect. The disconnected self region in the examples mentioned above is designed
to make the possible mapping complicated and a simple kernel function hard
to work. Nonlinear problems are very common in the real world applications,
where V-detector can be used more easily. Furthermore, SVM also has difficul-
ties for very large training dataset and discrete data [4]. Both cases are where
V-detector shows its advantage.
24
Table 5: Results over Intersection self region
detection rate false alarm rate
SVM ν = 0.05 77.67 6.25
V-detector rs = 0.05 99.82 11.44
SVM ν = 0.1 81.84 54.69
V-detector r = 0.1 96.58 9.69
5 Conclusions
A novel strategy of negative selection algorithm called V-detector was intro-
duced. Its unique features give negative selection algorithms more chances to
be applied successfully in more applications.
• A statistical approach is integrated to analyze the detector coverage in a
negative selection algorithm. It makes the algorithm more reliable. An
effective strategy was developed for implementation.
• Variable sized detectors make maximum coverage with limited number of
detectors.
• Boundary-aware algorithm interprets the training points as a collection
instead of independently. Thus, the boundary of the group of the training
points can be detected.
• The simple generation process makes this method highly efficient.
The detector generation process in V-detector makes it a prefect platform
to integrate hypothesis testing as a component. Furthermore, it can be imple-
mented partly as a byproduct of the generation process without adding much
extra computational cost.
Another advantage of this method is that it applies to any detector schemes
and detection mechanisms as long as it is verifiable whether a sample point is
covered or not. For example, extension to other representation will make this
method applicable to a much larger variety of applications.
Many issues in the performance of negative selection algorithms are related
with the properties of the training data. For the comparison and analysis of
negative selection algorithms to be more meaningful, it is important to develop
a framework concerning the fundamental assumptions and to categorize the
types of data to be processed.
25
References
[1] M. Ayara, J. Timmis, R. de Lemos, L. de Castro, and R. Duncan. Nega-
tive selection: How to generate detectors. In J. Timmis and P. J. Bentley,
editors, Proceedings of the 1st International Conference on Artificial Im-
mune Systems (ICARIS), volume 1, pages 89–98, University of Kent at
Canterbury, September 2002. University of Kent at Canterbury Printing
Unit.
[2] M. Bereta and T. Burczyński. Comparing binary and real-valued coding
in hybrid immune algorithm for feature selection and classification of ecg
signals. Eng. Appl. Artif. Intell., 20(5):571–585, August 2007.
[3] P. J. C. Branco, J. A. Dente, and R. V. Mendes. Using immunology
principle for fault detection. IEEE Transactions on Industrial Electron-
ics, 50(2):362–373, April 2003.
[4] C. J. C. Burges. A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery, 2:121–167, 1998.
[5] F. Cucker and S. Smale. On the mathematical foundations of learning.
Bulletin (New Series) of the American Mathematical Society, 39(1):1–49,
October 2001.
[6] D. Dasgupta and F. Gonzalez. An immunity-based technique to character-
ize intrusion in computer networks. IEEE Transactions on Evolutionary
Computation, 6(3):1081–1088, June 2002.
[7] D. Dasgupta, K. KrishnaKumar, D. Wong, and M. Berry. Negative selec-
tion algorithm for aircraft fault detection. In Proceedings of Third Inter-
national Conference on Artificial Immune Systems (ICARIS 2004), pages
1 – 13, 2004.
[8] D. Dasgupta, S. Yu, and N. S. Majumdar. MILA - multilevel immune
learning algorithm. In Proceedings of the Genetic and Evolutionary Com-
putation Conference (GECCO 2003), LNCS 2723, pages 183–194, Chicago,
IL, July 12-16 2003. Springer.
[9] L. N. de Castro and J. Timmis. Artificial Immune System: A New Com-
putational Intelligence Approach. Springer, 2002.
[10] P. D’haeseleer, S. Forrest, and P. Helman. An immunological approach to
change detection: Algorithms, analysis, and implications. In Proceedings
of the 1996 IEEE Symposium on Computer Security and Privacy, pages
110–119, Washington, DC, USA, 1996. IEEE Computer Society.
[11] F. Esponda, E. S. Ackley, S. Forrest, and P. Helman. Online negative
databases. In G. N. et al, editor, Proceedings of Third International Con-
ference on Artificial Immune Systems (ICARIS 2004), pages 175 – 188,
September 2004.
26
[12] F. Esponda, S. Forrest, and P. Helman. A formal framework for positive
and negative detection schemes. IEEE Transactions on System, Man, and
Cybernetics, 34:357–373, February 2004.
[13] J. L. Fleiss. Statistical Methods for Rates and Proportions. John Wiley &
Sons, 1981.
[14] S. Forrest, A. Perelson, L. Allen, R., and Cherukuri. Self-nonself discrim-
ination in a computer. In Proceedings of the 1994 IEEE Symposium on
Research in Security and Privacy, pages 202–212, Los Alamitos, CA, 1994.
IEEE Computer Society Press.
[15] A. A. Freitas and J. Timmis. Revisiting the foundation of artificial immune
systems: A problem-oriented perspective. In Proceedings of Second Inter-
national Conference on Artificial Immune System (ICARIS 2003), pages
229–241, 2003.
[16] S. M. Garrett. How do we evaluate artificial immune systems? Evolutionary
Computation, 13(2):145–178, 2005.
[17] F. Gonzalez, D. Dasgupta, and L. F. Nino. A randomized real-value nega-
tive selection algorithm. In Proceedings of Second International Conference
on Artificial Immune System (ICARIS 2003), pages 261–272, September
2003.
[18] E. Hart. Not all balls are round: An investigation of alternative recognition-
region shapes. In ICARIS, pages 29–42, 2005.
[19] D. Haussler. Probably approximately correct learning. In National Confer-
ence on Artificial Intelligence, pages 1101–1108, citeseer.ist.psu.edu/
haussler90probably.html, 1990.
[20] C. A. Hawkins and J. E. Weber. Statistical Analysis - Applications to
Business and Economics. Harper & Row, Publishers, New York, 1980.
[21] R. V. Hogg and E. A. Tanis. Probability and Statistical Inference. Prentice
Hall, 6th edition, 2001.
[22] Z. Ji. A boundary-aware negative selection algorithm. In Proceedings of
IASTED International Conference of Artificial Intelligence and Soft Com-
puting (ASC 2005), pages 379–384, Spain, September 2005.
[23] Z. Ji and D. Dasgupta. Real-valued negative selection algorithm with
variable-sized detectors. In LNCS 3102, Proceedings of GECCO, pages
287–298, 2004.
[24] Z. Ji and D. Dasgupta. Estimating the detector coverage in a negative
selection algorithm. In H.-G. Beyer and et al, editors, GECCO 2005: Pro-
ceedings of the 2005 conference on Genetic and evolutionary computation,
volume 1, pages 281–288, Washington DC, USA, 25-29 June 2005. ACM
Press.
27
[25] Z. Ji and D. Dasgupta. Applicability issues of the real-valued negative
selection algorithms. In Genetic and Evolutionary Computation Conference
(GECCO 2006), pages 111–118, Seattle, Washington, 8-12 July 2006.
[26] R. E. Sanchez-Yanez, E. V. Kurmyshev, and A. Fernandez. One-class
texture classifier in the CCR feature space. Pattern Recognition Letters,
24:1503–1511, 2003.
28