Vous êtes sur la page 1sur 13

Online Intrusion Alert Aggregation

with Generative Data Stream Modeling


Alexander Hofmann, Member, IEEE, and Bernhard Sick, Member, IEEE
AbstractAlert aggregation is an important subtask of intrusion detection. The goal is to identify and to cluster different
alertsproduced by low-level intrusion detection systems, firewalls, etc.belonging to a specific attack instance which has been
initiated by an attacker at a certain point in time. Thus, meta-alerts can be generated for the clusters that contain all the relevant
information whereas the amount of data (i.e., alerts) can be reduced substantially. Meta-alerts may then be the basis for reporting to
security experts or for communication within a distributed intrusion detection system. We propose a novel technique for online alert
aggregation which is based on a dynamic, probabilistic model of the current attack situation. Basically, it can be regarded as a data
stream version of a maximum likelihood approach for the estimation of the model parameters. With three benchmark data sets, we
demonstrate that it is possible to achieve reduction rates of up to 99.96 percent while the number of missing meta-alerts is extremely
low. In addition, meta-alerts are generated with a delay of typically only a few seconds after observing the first alert belonging to a new
attack instance.
Index TermsIntrusion detection, alert aggregation, generative modeling, data stream algorithm.

1 INTRODUCTION
I
NTRUSION detection systems (IDS) are besides other
protective measures such as virtual private networks,
authentication mechanisms, or encryption techniques very
important to guarantee information security. They help to
defend against the various threats to which networks and
hosts are exposed to by detecting the actions of attackers or
attack tools in a network or host-based manner with misuse
or anomaly detection techniques [1].
At present, most IDS are quite reliable in detecting
suspicious actions by evaluating TCP/IP connections or log
files, for instance. Once an IDS finds a suspicious action, it
immediately creates an alert which contains information
about the source, target, and estimated type of the attack
(e.g., SQL injection, buffer overflow, or denial of service). As
the intrusive actions caused by a single attack instance
which is the occurrence of an attack of a particular type that
has been launched by a specific attacker at a certain point in
timeare often spread over many network connections or
log file entries, a single attack instance often results in
hundreds or even thousands of alerts. IDS usually focus on
detecting attack types, but not on distinguishing between
different attack instances. In addition, even low rates of false
alerts could easily result in a high total number of false
alerts if thousands of network packets or log file entries are
inspected. As a consequence, the IDS creates many alerts at
a low level of abstraction. It is extremely difficult for a
human security expert to inspect this flood of alerts, and
decisions that follow from single alerts might be wrong
with a relatively high probability.
In our opinion, a perfect IDS should be situation-aware
[2] in the sense that at any point in time it should know
what is going on in its environment regarding attack
instances (of various types) and attackers. In this paper, we
make an important step toward this goal by introducing
and evaluating a new technique for alert aggregation. Alerts
may originate from low-level IDS such as those mentioned
above, from firewalls (FW), etc. Alerts that belong to one
attack instance must be clustered together and meta-alerts
must be generated for these clusters. The main goal is to
reduce the amount of alerts substantially without losing any
important information which is necessary to identify on-
going attack instances. We want to have no missing meta-
alerts, but in turn we accept false or redundant meta-alerts
to a certain degree.
This problem is not new, but current solutions are
typically based on a quite simple sorting of alerts, e.g.,
according to their source, destination, and attack type.
Under real conditions such as the presence of classification
errors of the low-level IDS (e.g., false alerts), uncertainty
with respect to the source of the attack due to spoofed
IP addresses, or wrongly adjusted time windows, for
instance, such an approach fails quite often.
Our approach has the following distinct properties:
. It is a generative modeling approach [3] using prob-
abilistic methods. Assuming that attack instances
can be regarded as random processes producing
alerts, we aim at modeling these processes using
approximative maximum likelihood parameter esti-
mation techniques. Thus, the beginning as well as
the completion of attack instances can be detected.
. It is a data stream approach, i.e., each observed alert is
processed only a few times [4]. Thus, it can be
applied online and under harsh timing constraints.
282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
. The authors are with the Computationally Intelligent Systems Group
(CIS), Department of Informatics and Mathematics, University of Passau,
Innstrae 33, 94032 Passau, Germany.
E-mail: {hofmann, sick}@fim.uni-passau.de.
Manuscript received 26 Sept. 2008; revised 1 May 2009; accepted 5 July 2009;
published online 11 Aug. 2009.
For information on obtaining reprints of this article, please send e-mail to:
tdsc@computer.org, and reference IEEECS Log Number TDSC-2008-09-0152.
Digital Object Identifier no. 10.1109/TDSC.2009.36.
1545-5971/11/$26.00 2011 IEEE Published by the IEEE Computer Society
The remainder of this paper is organized as follows: In
Section 2 some related work is presented. Section 3 describes
the proposed alert aggregation approach, and Section 4
provides experimental results for the alert aggregation
using various data sets. Finally, Section 5 summarizes the
major findings.
2 RELATED WORK
Most existing IDS are optimized to detect attacks with high
accuracy. However, they still have various disadvantages
that have been outlined in a number of publications and a lot
of work has been done to analyze IDS in order to direct
future research (cf. [5], for instance). Besides others, one
drawback is the large amount of alerts produced. Recent
research focuses on the correlation of alerts from (possibly
multiple) IDS. If not stated otherwise, all approaches
outlined in the following present either online algorithms
oras we see itcan easily be extended to an online version.
Probably, the most comprehensive approach to alert
correlation is introduced in [6]. One step in the presented
correlation approach is attack thread reconstruction, which can
be seen as a kind of attack instance recognition. No
clustering algorithm is used, but a strict sorting of alerts
within a temporal window of fixed length according to the
source, destination, and attack classification (attack type). In
[7], a similar approach is used to eliminate duplicates, i.e.,
alerts that share the same quadruple of source and
destination address as well as source and destination port.
In addition, alerts are aggregated (online) into predefined
clusters (so-called situations) in order to provide a more
condensed view of the current attack situation. The
definition of such situations is also used in [8] to cluster
alerts. In [9], alert clustering is used to group alerts that
belong to the same attack occurrence. Even though called
clustering, there is no clustering algorithm in a classic sense.
The alerts from one (or possibly several) IDS are stored in a
relational database and a similarity relationwhich is based
on expert rulesis used to group similar alerts together.
Two alerts are defined to be similar, for instance, if both
occur within a fixed time window and their source and
target match exactly. As already mentioned, these ap-
proaches are likely to fail under real-life conditions with
imperfect classifiers (i.e., low-level IDS) with false alerts or
wrongly adjusted time windows.
Another approach to alert correlation is presented in [10].
A weighted, attribute-wise similarity operator is used to
decide whether to fuse two alerts or not. However, as
already stated in [11] and [12], this approach suffers from
the high number of parameters that need to be set. The
similarity operator presented in [13] has the same dis-
advantagethere are lots of parameters that must be set by
the user and there is no or only little guidance in order to
find good values. In [14], another clustering algorithm that
is based on attribute-wise similarity measures with user-
defined parameters is presented. However, a closer look at
the parameter setting reveals that the similarity measure, in
fact, degenerates to a strict sorting according to the source
and destination IP addresses and ports of the alerts. The
drawbacks that arise thereof are the same as those
mentioned above.
In [15], three different approaches are presented to fuse
alerts. The first, quite simple one groups alerts according
to their source IP address only. The other two approaches
are based on different supervised learning techniques.
Besides a basic least-squares error approach, multilayer
perceptrons, radial basis function networks, and decision
trees are used to decide whether to fuse a new alert with
an already existing meta-alert (called scenario) or not. Due
to the supervised nature, labeled training data need to be
generated which could be quite difficult in case of various
attack instances.
The same or quite similar techniques as described so far
are also applied in many other approaches to alert correla-
tion, especially in the field of intrusion scenario detection.
Prominent research in scenario detection is described in [16],
[17], [18], for example. More details can be found in [19].
In [20], an offline clustering solution based on the CURE
algorithm is presented. The solution is restricted to numer-
ical attributes. In addition, the number of clusters must be
set manually. This is problematic, as in fact it assumes that
the security expert has knowledge about the actual number
of ongoing attack instances. The alert clustering solution
described in [11] is more related to ours. A link-based
clustering approach is used to repeatedly fuse alerts into
more generalized ones. The intention is to discover the
reasons for the existence of the majority of alerts, the so-
called root causes, and to eliminate them subsequently. An
attack instance in our sense can also be seen as a kind of root
cause, but in [11] root causes are regarded as generally
persistent which does not hold for attack instances that
occur only within a limited time window. Furthermore, only
root causes that are responsible for a majority of alerts are of
interest and the attribute-oriented induction algorithm is
forced to find large clusters as the alert load can thus be
reduced at most. Attack instances that result in a small
number of alerts (such as PHF or FFB) are likely to be
ignored completely. The main difference to our approach is
that the algorithm can only be used in an offline setting and
is intended to analyze historical alert logs. In contrast, we
use an online approach to model the current attack situation.
The alert clustering approach described in [12] is based on
[11] but aims at reducing the false positive rate. The created
cluster structure is used as a filter to reduce the amount of
created alerts. Those alerts that are similar to already known
false positives are kept back, whereas alerts that are
considered to be legitimate (i.e., dissimilar to all known
false positives) are reported and not further aggregated. The
same ideabut based on a different offline clustering
algorithmis presented in [21].
A completely different clustering approach is presented
in [22]. There, the reconstruction error of an autoassociator
neural network (AA-NN) is used to distinguish different
types of alerts. Alerts that yield the same (or a similar)
reconstruction error are put into the same cluster. The
approach can be applied online, but an offline training phase
and training data are needed to train the AA-NN and also to
manually adjust intervals for the reconstruction error that
determine which alerts are clustered together. In addition, it
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 283
turned out that due to the dimensionality reduction by the
AA-NN, alerts of different types can have the same
reconstruction error which leads to erroneous clustering.
In our prior work, we applied the well-known c-means
clustering algorithm in order to identify attack instances
[23]. However, this algorithm also works in a purely offline
manner.
3 A NOVEL ONLINE ALERT AGGREGATION
TECHNIQUE
In this section, we describe our new alert aggregation
approach which isat each point in timebased on a
probabilistic model of the current situation. To outline the
preconditions and objectives of alert aggregation, we start
with a short sketch of our intrusion framework. Then, we
briefly describe the generation of alerts and the alert format.
We continue with a new clustering algorithm for offline
alert aggregation which is basically a parameter estimation
technique for the probabilistic model. After that, we extend
this offline method to an algorithmfor data streamclustering
which can be applied to online alert aggregation. Finally, we
make some remarks on the generation of meta-alerts.
3.1 Collaborating Intrusion Detection Agents
In our work, we focus on a system of structurally very
similar so-called intrusion detection (ID) agents. Through
self-organized collaboration, these ID agents form a
distributed intrusion detection system (DIDS).
Fig. 1 outlines the layered architecture of an ID agent:
The sensor layer provides the interface to the network and
the host on which the agent resides. Sensors acquire raw
data from both the network and the host, filter incoming
data, and extract interesting and potentially valuable (e.g.,
statistical) information which is needed to construct an
appropriate event. At the detection layer, different detectors,
e.g., classifiers trained with machine learning techniques
such as support vector machines (SVM) or conventional
rule-based systems such as Snort [24], assess these events
and search for known attack signatures (misuse detection)
and suspicious behavior (anomaly detection). In case of
attack suspicion, they create alerts which are then for-
warded to the alert processing layer. Alerts may also be
produced by FW or the like. At the alert processing layer,
the alert aggregation module has to combine alerts that are
assumed to belong to a specific attack instance. Thus, so-
called meta-alerts are generated. Meta-alerts are used or
enhanced in various ways, e.g., scenario detection or
decentralized alert correlation. An important task of the
reaction layer is reporting.
The overall architecture of the distributed intrusion
detection system and a framework for large-scale simula-
tions are described in [25], [26] in more detail.
In our layered ID agent architecture, each layer assesses,
filters, and/or aggregates information produced by a lower
layer. Thus, relevant information gets more and more
condensed and certain, and, therefore, also more valuable.
We aim at realizing each layer in a way such that the recall
of the applied techniques is very high, possibly at the cost of
a slightly poorer precision [27]. In other words, with the
alert aggregation moduleon which we focus in this
paperwe want to have a minimal number of missing
meta-alerts (false negatives) and we accept some false meta-
alerts (false positives) and redundant meta-alerts in turn.
3.2 Alert Generation and Format
In this section, we make some comments on the information
contained in alerts, the objects that must be aggregated, and
on their format. As the concrete content and format depend
on a specific task and on certain realizations of the sensors
and detectors, some more details will be given in Section 4
together with the experimental conditions.
At the sensor layer, sensors determine the values of
attributes that are used as input for the detectors as well as
for the alert clustering module. Attributes in an event that
are independent of a particular attack instance can be used
for classification at the detection layer. Attributes that are
(or might be) dependent on the attack instance can be used in
an alert aggregation process to distinguish different attack
instances. A perfect partition into dependent and indepen-
dent attributes, however, cannot be made. Some are clearly
dependent (such as the source IP address which can
identify the attacker), some are clearly independent such
as the destination port which usually is 80 in case of web-
based attacks), and lots are both (such as the destination
port which can be a hint to the attackers actual target
service as well as an attack tool specifically designed to
target a particular service only). In addition to the attributes
produced by the sensors, alert aggregation is based on
additional attributes generated by the detectors. Examples
are the estimated type of the attack instance that led to the
generation of the alert (e.g., SQL injection, buffer overflow,
or denial of service), and the degree of uncertainty
associated with that estimate.
An alert a consists of a number of 1 attributes. We
assume that 1
i
(1
i
1) of these attributes are categorical
and the other 11
i
are continuous (real valued). Without
loss of generality, we arrange these attributes such that
a oo
1
. . . . . oo
1
i
..
categorical
. o
1
i
1
. . . . . o
1
..
continuous
. 1
An extension to other attribute types (e.g., integers) would
be straightforward.
284 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
Fig. 1. Architecture of an intrusion detection agent.
Examples for continuous or approximately continuous
attributes are durations or the amount of transmitted data.
Examples for categorical attributes are IP addresses, port
numbers, or TCP status flags. For the categorical attributes,
we apply a 1-of-1
d
coding scheme where 1
d
is the number
of different categories of attribute d 2 f1. . . . . 1
i
g. The
value of such an attribute is represented by a vector
oo
d
o
d
1
. . . . . o
d
1
d
. 2
with o
d
,
1 if o
d
belongs to category , and o
d
,
0
otherwise. For continuous attributes no dedicated coding
scheme is required. The values are just denoted by a scalar
value o
d
2 IR for all d 2 f1
i
1. . . . . 1g.
Typically, the number of different categories can be kept
low by defining appropriate equivalence classes. For
example, we do not define 65,535 different categories for
the TCP port numbers. Instead, using the IANA port
number list [28], we define three different categories: well-
known ports (0-1,023), registered ports (1,024-49,151), and
dynamic and/or private ports (49,152-65,535). IP addresses,
for instance, are subdivided according to the categories
defined in RFC 1597 and RFC 790 [29], [30]. That is, domain
knowledge must be used to define equivalence classes.
Thus, the number of categories can be reduced substantially
which makes the alert aggregation easier without a loss of
significant information. Quite similar schemes have already
successfully been applied (see, e.g., [9], [11], [23]).
3.3 Offline Alert Aggregation
In this section, we introduce an offline algorithm for alert
aggregation which will be extended to a data stream
algorithm for online aggregation in Section 3.4.
Assume that a host with an ID agent is exposed to a
certain intrusion situation as sketched in Fig. 2: One or
several attackers launch several attack instances belonging
to various attack types. The attack instances each cause a
number of alerts with various attribute values. Only two of
the attributes are shown and the correspondence of alerts
and (true or estimated) attack instances is indicated by
different symbols. Fig. 2a shows a view on the ideal
world which an ID agent does not have. The agent only
has observations of the detectors (alerts) in the attribute
space without attack instance labels as outlined in Fig. 2b.
The task of the alert aggregation module is now to estimate
the assignment to instances by using the unlabeled
observations only and by analyzing the cluster structure
in the attribute space. That is, it has to reconstruct the attack
situation. Then, meta-alerts can be generated that are
basically an abstract description of the cluster of alerts
assumed to originate from one attack instance. Thus, the
amount of data is reduced substantially without losing
important information. Fig. 2c shows the result of a
reconstruction of the situation. There may be different
potentially problematic situations:
1. False alerts are not recognized as such and wrongly
assigned to clusters: This situation is acceptable as
long as the number of false alerts is comparably low.
2. True alerts are wrongly assigned to clusters: This
situation is not really problematic as long as the
majority of alerts belonging to that cluster is
correctly assigned. Then, no attack instance is
missed.
3. Clusters are wrongly split: This situation is un-
desired but clearly unproblematic as it leads to
redundant meta-alerts only. Only the data reduction
rate is lower, no attack instance is missed.
4. Several clusters are wrongly combined into one:
This situation is definitely problematic as attack
instances may be missed.
According to our objectives (cf. Section 3.1) we must try to
avoid the latter situation but we may accept the former
three situations to a certain degree.
How can the set of samples be clustered (i.e., aggre-
gated) to generate meta-alerts? Here, the answer to this
question is identical to the answer to the following: How
can an attack situation be modeled with a parameterized
probabilistic model and how can the parameters be
estimated from the observations?
First, we introduce a model for a single attack instance.
We regard an attack instance as a random process generat-
ing alerts that are distributed according to a certain multi-
variate probability distribution. The alert space is composed
of several attributes. We have categorical attributes which
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 285
Fig. 2. Example illustrating the alert aggregation task and possible problems (artificial attack situation). (a) Idealized world: In the idealized IDS, the
detectors do not make errors (no false and missing alerts) and the correct assignment of alerts to attack instances is known (indicated by different
symbols). (b) Actual observations: The alerts produced by a real detection layer. The task of the alert aggregation is to reconstruct the attack
situation by means of these observations only (including false alerts). (c) Reconstruction: The result of the aggregation (correspondence of alerts
and clusters/meta-alerts) together with four different types of problems that may arise.
we assume to be multinomially distributed. That is, for an
attribute oo
d
2 foo
1
. . . . . oo
1
i
g (cf. 2) we use
Moo
d
j,,
d

1
d
/1
,
d
/

o
d
/
. 3
with a parameter vector ,,
d
,
d
1
. . . . . ,
d
1
d
and the restric-
tions

1
d
/1
,
d
/
1 and ,
d
/
! 0 for all /. The continuous
attributes must be chosen such that a normal distribution
(Gaussian distribution) can be assumed. That is, for an
attribute o
d
2 fo
1
i
1
. . . . . o
1
g we use
N
_
o
d
jj
d
. o
2
d
_

2
p
o
d
exp
1
2o
2
d
o
d
j
d

2
_ _
. 4
with mean j
d
2 IR and standard deviation o
d
2 IR

.
Assuming further that the attributes are not correlated
(i.e., statistically independent from each other), the multi-
variate hybrid distribution H for one attack instance gets
Hajjj. oo
2
. ,,

1
i
d1
Moo
d
j,,
d

1
d1
i
1
N
_
o
d
jj
d
. o
2
d
_
. 5
with corresponding parameter vectors jj, oo
2
, and ,,. That
is, the joint distribution is a product of the univariate
distributions. The distribution underlying an overall attack
situation consisting of J 2 NN attack instances is then
assumed to be a mixture of J hybrid distributions which
are then called components of this mixture distribution:
Sajjj. oo
2
. ,,.

J
,1

,
H
_
ajjj
,
. oo
2
,
. ,,
,
_
. 6
with
1
. . . . .
J
, 0
,
for all ,, and

J
,1

,
1. The

,
are called mixing coefficients.
Reconstructing an attack situation from observed sam-
ples as sketched above means to estimate all the parameters
of the mixture distribution by means of those samples. The
standard approach for such a problem is a maximum
likelihood (ML) estimation [3].
That is, we are given a set of ` alerts A fa
1
. . . . . a
`
g
originating from the situation that must be reconstructed.
Assuming that these samples are independent and identi-
cally distributed (i.i.d.) according to S, the so-called like-
lihood function
SAjjj. oo
2
. ,,.

`
i1
S
_
a
i
jjj. oo
2
. ,,.
_
. 7
must be maximized with respect to all parameters. In the
case of mixture models, it is not possible to evolve a closed
formula for the optimization. Instead, an iterative method
called expectation maximization (EM) must be applied (cf. [3],
for instance).
The resulting EM procedure for our attack situation
model is shown in Algorithm 1. It iteratively maximizes
the likelihood with two alternating computation steps:
E (expectation) and M (maximization). The E step assigns
the alerts to componentsresulting in a partition of the
set A with J clustersand the M step optimizes the
parameters of the mixture model.
Some additional remarks must be made:
Initialization of model parameters. The aim of the
initialization is to find good initial values. Instead of using
a random initialization which results in higher runtimes
and sub-optimal solutions, we use a heuristic which we
have successfully applied to the training of radial basis
function neural networks [31]. This heuristic selects as
initial cluster centers a set of alerts with a large spread in
the attribute space.
Hard assignment of alerts to components. More general
EM algorithms make a gradual assignment of alerts to
components in the E step (cf. responsibilities in [3]). In
practical applications, a hard assignment reduces the
runtimes significantly at the cost of slightly worse solutions
in some situations. In our case, this is acceptable as we do
not want to find the optimal model parameters at the end,
but to generate the optimal set of meta-alerts.
Stopping criterion. An EM algorithm guarantees that the
set of parameters is improvedineachstep. Inaddition, due to
the hardassignment of alerts, there exists a limitednumber of
possible assignments. Thus, it can be guaranteed that the
algorithmconverges. For the sake of simplicity, however, we
usually run the algorithm for a fixed number of iterations.
Fixed mixing coefficients. One of the main difficulties in
alert aggregation is the wide range of possible cluster sizes.
There are clusters that contain thousands of alerts, but there
are also clusters that consist of a fewalerts only. For instance,
a Neptune attack instance may result in 200,000 alerts
whereas a PHF attack instance may consist of only five alerts
[32]. Thus, in contrast to a more general EM approach, it is
important to fix the mixing coefficients (here:
,

1
J
).
Otherwise, if the mixing coefficients were estimated from
the observed samples, the EM algorithm would focus on the
optimization of the parameters of heavy components
while neglecting the light ones.
286 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
One important task has not been mentioned so far: The
estimation of the number of components (clusters) J from
the set of observed samples A. Up to now, we assumed that
this number is given. In our case, the number of clusters shall
correspond to the number of attack instances in an ideal case.
To solve this problem, cluster validation measures can be used
to assess the quality of a clustering result. A clustering
algorithm is started several times with a varying number of
clusters, the quality for each clustering result is assessed
numerically, and finally the results are compared and the
number that optimizes this measure is chosen (so-called
relative criterion, cf. [33]). Examples for suitable measures
are the Dunn index [34], the Davies-Bouldin index [35], and
the CDbw index [36]. The idea that forms the basis of those
measures is that they search for a clustering result that
consists of clusters that are both, compact and separable.
Although the usability of such measures has been proven in
many (offline) application examples, they cannot be applied
in a data streaming case as they suffer from high runtimes,
e.g., O`
2
in the case of the Dunn index. We need a measure
that can be updated incrementally when a new alert arrives.
Therefore, we chose a density-based approach to estimate
separability and compactness.
As a final result of the EM algorithm, we not only know
the optimized parameter values of the model, but also the
final assignment of alerts to components. These clusters
define a partition of the sample set.
First, we define the compactness of a single cluster
associated with a component , by

,
min
a
i
assigned to ,
H
_
a
i
jjj
,
. oo
2
,
. ,,
,
_
. 8
Then, the compactness of the overall partition is
min
,2f1.....Jg

,
. 9
That means, the compactness is influenced by the alert
with the lowest likelihood. As a consequence, we get a bias
toward a higher number of components, which is desirable
in our application (high recall required, lower precision
accepted). Note that we can easily update this measure in
O1 if a new alert is assigned to a component.
Second, we define the separability of a pair of clusters
associated with component i and , by
i. ,
1
2
H
_
a

i
jjj
,
. oo
2
,
. ,,
,
_
H
_
a

,
jjj
,
. oo
2
,
. ,,
,
_

H
_
a

,
jjj
i
. oo
2
i
. ,,
i
_
H
_
a

i
jjj
i
. oo
2
i
. ,,
i
_
_ _
. 10
where we use the alerts with the highest likelihoods
a

/
arg max
a
i
assigned to /
H
_
a
i
jjj
/
. oo
2
/
. ,,
/
_
. 11
Then, the separability of the overall partition is
min
i.,2f1.....Jg.i6,
i. ,. 12
An incremental version of this measure with O1 time
complexity is also straightforward.
Finally, the overall cluster validation measure for a
clustering result is

. 13
We search for a number J that maximizes this measure. In
Section 3.4, it will become clear that only a small number of
choices for J must actually be tested by our online
aggregation algorithm.
3.4 Data Stream Alert Aggregation
In this section, we describe how the offline approach is
extended to an online approach working for dynamic attack
situations.
Assume that in the environment observed by an ID agent
attackers initiate new attack instances that cause alerts for a
certain time interval until this attack instance is completed.
Thus, at any point in time the ID agentwhich is assumed
to have a model of the current situation, cf. Fig. 3ahas
several tasks, cf. Fig. 3b:
1. Component adaption: Alerts associated with al-
ready recognized attack instances must be identi-
fied as such and assigned to already existing
clusters while adapting the respective component
parameters.
2. Component creation (novelty detection): The occur-
rence of new attack instances must be stated. New
components must be parameterized accordingly.
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 287
Fig. 3. Example illustrating the principle of online alert aggregation (artificial attack situation). (a) Existing model: Components have been created by
the alert aggregation module. These components are the basis for meta-alert generation. (b) Assignment problem: New observations must either be
assigned to an existing component which is then adapted or a new component must be created. Also, outdated components must be deleted.
(c) Adapted model: The new situation after a few steps. One component has been created, one component has been deleted, and the other
components have been adapted accordingly.
3. Component deletion (obsoleteness detection): The
completion of attack instances must be detected
and the respective components must be deleted
from the model.
That is, the ID agent must be situation-aware and try to keep
his model of the current attack situation permanently up to
date, see Fig. 3c.
Clearly, there is a trade-off between runtime (or reaction
time) and accuracy. For example, it is hardly possible to
decide upon the existence of a new attack instance when
only one observation is made. From the viewpoint of our
objectives (cf. Section 3.1), the tasks 1 and 2 are more time
critical than task 3.
From a probabilistic viewpoint we can state that our
overall random process is nonstationary in a certain sense
which can be regarded as being equivalent to changing the
mixing coefficients at certain points in time. A mixing
coefficient is either zero or the reciprocal of the number of
active components (for the time interval of the respective
attack instance). With appropriate novelty and obsoleteness
detection mechanisms, we aim at detecting these points in
time with both sufficient certainty and timeliness.
The offline aggregation algorithm provides the basic idea
for the online case, too. We have to consider, however, that
we want to have a data stream algorithm with short
runtimes. In the following, we do not distinguish between
the following two views on solutions of the modeling
problem: The probabilistic model with J components with
parameters jj
,
, oo
2
,
, and ,,
,
for , 2 f1. . . . . Jg and mixing
coefficients
1
J
, and the partition of the overall set of alerts
into a set of clusters C fC
1
. . . . . C
J
g. That is, we equate the
terms partition and model as well as cluster and compo-
nent, respectively. For online clustering, we require that
each alert a has a time stamp tsa (time of observation) and
that we always know the current point in time t.
Algorithm 2 describes the online alert aggregation. If a
new alert is observed we first have to decide whether a first
component has to be created. In this case, we initialize its
parameters with information taken from this alert. Random,
small values are added, for example, to prevent any
subsequent optimization steps from running into singula-
rities of the respective likelihood function [3]. Otherwise,
we have to decide whether the alert has to be associated
with an existing component or not, i.e., whether we believe
that it belongs to an ongoing attack instance or not.
Provisionally, we assign the alert to the most likely
component (E step) and optimize the parameters of this
component (M step). For the reason of temporal efficiency,
we do not conduct a sequence of E and M steps for the
overall model. In some tests [19], it turned out that our main
goalnot to miss any attack instances, see Section 3.1can
be achieved this way with substantially lower runtimes but
at the cost of some redundant meta-alerts (due to split of
clusters, see Section 3.3).
The assignment of the alert to an existing component is
not accepted in any case, only if the quality of the model
increases or does not decrease too much, e.g., not more than
15 percent (realized by means of threshold 0). Otherwise,
we consider the possibility that a new attack instance may
have been initiated, do not modify the model, and
temporarily store the alert in a buffer until this belief is
corroborated by additional observations.
Novelty handling basically means that we empty the
buffer by assigning its content either to existing components
or to new components. This procedure is initiated either
when the temporal spread of the buffer content is too large
or when the content is no longer homogeneous in the sense
that we assume that another new attack instance may have
been initiated:
. Temporal spread: As the rate of incoming alerts
depends on the current attack situation, it changes
heavily over time ranging from thousands of alerts
288 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
per minute to only a fewalerts per hour. Thus, to keep
the response time short, we have to take into account
the temporal spread of the buffer content. The spread
is defined as the difference between the time stamps
of the oldest and the most recent alert in the buffer.
. Homogeneity: The goal is to ensure that only alerts
that are similar to each other are stored in the buffer.
Thus, it is possible that the novelty handling
conductsfor temporal performance reasonsonly
a local optimization of the model (see below). We
regard the alerts in the buffer as being similar if they
all have the same most likely component.
That is, to initiate the novelty handling for a given
model C and a buffer B, we use the function
noveltya
true. if
_
max
b2B
ftsa ts bg !
_
_
_
arg max
,2f1.....jCjg
H
_
ajjj
,
. oo
2
,
. ,,
,
_
6 arg max
,2f1.....jCjg
H
_
bjjj
,
. oo
2
,
. ,,
,
_
for any b 2 B
_
.
false. otherwise.
_

_
14
with a user-defined threshold to state novelty. Note, that
spread and homogeneity can both be checked with low
temporal effort.
Algorithm 3 describes the novelty handling itself.
Basically, to adapt the overall model, we run the offline
aggregation algorithm several times with different possible
component numbers to chose the optimal number. How-
ever, due to the homogeneity of the buffer, we may restrict
the optimization to the alerts in the buffer and in one
neighbor cluster on the one hand and a relatively small
user-defined maximum number of components 1 on the
other without violating our main goal [19]. The result of this
local optimization is finally fused with the unmodified
parts of the model.
Inorder toreduce the runtime of this algorithmfurther, we
may reduce the number of alerts that have to be processed by
means of an appropriate subsampling technique.
To initiate the obsoleteness handling for a given model C
and component C
,
, we use
obsoletenessC
,

true. if t ! max
c2C
,
ftscg c.
false. otherwise.
_
15
where c is a convex combination of a multiple of the mean
interarrival time between the alerts in the cluster and prior
knowledge which is faded out when the number of alerts in
this cluster gets larger. That is, we use a kind of estimate of
the arrival time of the next alert belonging to the correspond-
ing attack instance. If obsoleteness is stated, the correspond-
ing component is simply deleted from the model.
Note that the mixing coefficients are always defined
implicitly by the number of currently existing components.
3.5 Meta-Alert Generation and Format
With the creation of a new component, an appropriate meta-
alert that represents the information about the component in
an abstract way is created. Every time a new alert is added
to a component, the corresponding meta-alert is updated
incrementally, too. That is, the meta-alert evolves with the
component. Meta-alerts may be the basis for a whole set
further tasks (cf. Fig. 1):
. Sequences of meta-alerts may be investigated further
in order to detect more complex attack scenarios
(e.g., by means of hidden Markov models).
. Meta-alerts may be exchanged with other ID agents
in order to detect distributed attacks such as one-to-
many attacks.
. Based on the information stored in the meta-alerts,
reports may be generated to inform a human security
expert about the ongoing attack situation.
Meta-alerts could be used at various points in time from
the initial creation until the deletion of the corresponding
component (or even later). For instance, reports could be
created immediately after the creation of the component
orwhich could be more preferable in some casesa
sequence of updated reports could be created in regular
time intervals. Another example is the exchange of meta-
alerts between ID agents: Due to high communication costs,
meta-alerts could be exchanged based on the evaluation of
their interestingness [37].
According to the task for which meta-alerts are used, they
may contain different attributes. Examples for those attri-
butes are aggregated alert attributes (e.g., lists or intervals of
source addresses or targeted service ports, or a time interval
that marks the beginning and the endif availableof the
attack instance), attributes extracted from the probabilistic
model (e.g., the distribution parameters or the number of
alerts assigned to the component), an aggregated alert
assessment provided by the detection layer (e.g., the attack
type classification or the classification confidence), and also
information about the current attack situation (e.g., the
number of recent attacks of the same or a similar type, links
to attacks originating from the same or a similar source).
4 EXPERIMENTAL RESULTS
This section evaluates the new alert aggregation approach.
We use three different data sets to demonstrate the
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 289
feasibility of the proposed method: The first is the well-
known DARPA intrusion detection evaluation data set [32],
for the second we used real-life network traffic data
collected at our university campus network, and the third
contains firewall log messages from a commercial Internet
service provider. All experiments were conducted on an PC
with 2.20 GHz and 2 GB of RAM.
4.1 Description of the Benchmark Data Sets
4.1.1 DARPA Data
For the DARPA evaluation [32], several weeks of training
and test data have been generated on a test bed that emulates
a small government site. The network architecture as well as
the generated network traffic has been designed to be similar
to that of an Air Force base. We used the TCP/IP network
dump as input data and analyzed all 104 TCP-based attack
instances (corresponding to more than 20 attack types) that
have been launched against the various target hosts. As
sketched in Section 3.2, sensors extract statistical information
from the network traffic data. At the detection layer, we
apply SVM to classify the sensor events. By applying a
varying threshold to the output of the classifier, a so-called
receiver operating characteristics (ROC) curve can be
created [38]. The ROC curve in Fig. 4 plots the true positive
rate (TPR, number of true positives divided by the sum of
true positives and false negatives) against the false positive
rate (FPR, number of false positives divided by the sum of
false positives and true negatives) for the trained SVM. Each
point of the curve corresponds to a specific threshold. Four
operating points (OP) are marked. OP 1 is the one with the
smallest overall error, but as we want to realize a high recall,
we also investigate three more operating points which
exhibit higher TPR at the cost of an increased FPR. We will
also investigate the aggregation under idealized conditions
where we assume to have a perfect detector layer with no
missing and no false alerts at all. As attributes for the alerts,
we use the source and destination IP address, the source and
destination port, the attack type, and the creation time
differences (based on the creation time stamps).
Table 1 shows the number of alerts produced for the
different OP and also for the idealized condition, i.e., a
perfect detection layer. In addition, the number of attack
instances for which at least one alert is generated by the
detector layer is also given. Note that we have 104 attack
instances in the data set altogether. For OP 1, there are three
attack instances for which not even a single alert is created,
i.e., these instances are already missed at the detection layer.
The three instances are missed because there are only a few
training samples of the corresponding attack types in the
data set which results in a bad generalization performance
of the SVM. Switching from OP 1 to OP 2 alerts are created
for one more instance. OP 3 and OP 4 do not yield any
further improvement.
We are aware of the various critique on the DARPA
benchmark data (e.g., published in [39], [40]) and the
limitations that emerge thereof. Despite all disadvantages,
this data set is still frequently usednot least because it is a
well-documented and publicly available data set that allows
the comparison of results of different researchers. In order
to achieve fair and realistic results, we carefully analyzed all
known deficiencies and omitted features that could bias the
detector classification performance. Besides, most of the
deficiencies of this data set have an impact on the detector
layer but only little influence on the alert aggregation
results. A detailed discussion can be found in [19] and [41].
4.1.2 Campus Network Data
To assess the performance of our approach in more detail,
we also conducted own attack experiments. We launched
several brute force password guessing attacks against the
mail server (POP3) of our campus network and recorded
the network traffic. The attack instances differed in origin,
start time, duration, and password guessing rate. The attack
schedule was designed to reflect situations which we regard
as being difficult to recognize. In particular, we have
1. several concurrent attack instances (up to seven),
2. partially and completely overlapping attack in-
stances,
3. several instances within a short time interval,
4. different attack instances from similar sources,
5. different attack durations, and
6. an attacker that changes his IP address during the
attack.
In order to demonstrate that the proposed technique can
also be used with a conventional signature-based detector,
the captured traffic was analyzed by the open source IDS
Snort [24], which detected all 17 attack instances that have
been launched and produced 128,816 alerts (cf. Table 1).
The alert format equals the one used for the SVM detector,
i.e., the alerts exhibit the source and destination IP address,
290 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
Fig. 4. ROC curve for the SVM detector.
TABLE 1
Input of the Alert Aggregation Algorithm
the source and destination port, the attack type, and
creation time differences. Snort was configured to match
our network topology and we turned off rudimental alert
aggregation features. In order to achieve a high recall, we
activated all available rule setsthe official rule sets as well
as available community rules, which both are available at
the Snort web page [42]. Activating all rules leads to a false
alert rate of 0.33 percent. The FPR is based on the
assumption that all alerts that are not classified with the
attack type that we launched are false alerts. There is no
missing alert rate given in Table 1 for this data set for two
reasons: First, it cannot be guaranteed that there are
unknown attacks in the data set that were started by real
attackers and second, we do not know exactly how many
alerts should be created by the attacks we launched.
4.1.3 Internet Service Provider Firewall Logs
The third data set used here differs from the previous ones
as we actually do not have a detector layer that performs a
classification and searches for known attacks. Instead, we
use log messages that were generated by a CISCO PIX
system, which is a well-known commercial FW and
network address translation (NAT) device. Typical PIX
log messages are DENY or ACCEPT messagesdepending
on whether the incoming or outgoing network packets or
connections are forbidden or allowed according to the
firewall rule set. Here, the alerts consist of the source and
destination IP address, the source and destination port, the
creation time differences, and the PIX message type. Details
on the log message format can be found in [43]. The PIX
was installed in the network of an Internet service provider
within a real-life environment. During the six hours of
collection time, 4,898 log messageswe also use the term
alerts in the followingwere created. As the data were not
collected within a controlled simulation environment, we
do not have any information whether there occurred
attacks or not and specifications of TPR and FPR are not
possible. Nonetheless, this data set can be used to
demonstrate the broad applicability of the proposed online
alert aggregation technique.
4.2 Performance Measures
In order to assess the performance of the alert aggregation,
we evaluate the following measures:
Percentage of detected instances (j). We regard an
attack instance as being detected if there is at least one meta-
alert that predominantly contains alerts of that particular
instance. The percentage of detected attack instances j can
thus be determined by dividing the number of instances
that are detected by the total number of instances in the data
set. The measure is computed with respect to the instances
covered by the output of the detection layer, i.e., instances
missed by the detectors are not considered.
Number of meta-alerts (`) and reduction rate (i). The
number of meta-alerts (`) is further divided into the
number of attack meta-alerts `
attack
which predominantly
contain true alerts and the number of nonattack meta-alerts
`
nonattack
which predominantly contain false alerts. The
reduction rate i is 1 minus the number of created meta-
alerts ` divided by the total number of alerts `.
Average runtime (t
avg
) and worst case runtime (t
worst
).
The average runtime is measured in milliseconds per alert.
Assuming up to several hundred thousand alerts a day, t
avg
should stay clearly below 100 ms per alert. The worst case
runtime t
worse
, which is measured in seconds, states how
long it takes at most to execute the while loop of Algorithm 2,
which may include the execution of Algorithms 3 and 1 (the
latter several times).
Meta-alert creation delay (d). It is obvious that there is a
certain delay until a meta-alert is created for a new attack
instance. The meta-alert creation delay d measures the delay
betweenthe actual beginning of the instance (i.e., the creation
time of the first alert) and the creation of the first meta-alert
for that instance. We investigate, how many seconds the
algorithmneeds to create 90 percent (d
90%
), 95 percent (d
95%
),
and 100 percent (d
100%
) of the meta-alerts.
4.3 Results
In the following, the results for the alert aggregation are
presented. For all experiments, the same parameter settings
are used. We set the threshold 0 that decides whether to add
a new alert to an existing component or not to five percent,
and the value for the threshold that specifies the allowed
temporal spread of the alert buffer to 180 seconds. 0 was set
that low value in order to ensure that even a quite small
degrade of the cluster quality, which could indicate a new
attack instance, results in a new component. A small value
of 0, of course, results in more components and, thus, in a
lower reduction rate, but it also reduces the risk of missing
attack instances. The parameter , which is used in the
novelty assessment function, controls the maximum time
that new alerts are allowed to reside in the buffer B. In order
to keep the response time short, we set it to 180 s which we
think is a reasonable value. For both parameters, there were
large intervals in which parameter values could be chosen
without deteriorating the results. A detailed analysis and
discussion on the effects of different parameter settings can
be found in [19].
4.3.1 DARPA Data
Results for the DARPA data set are given in Table 2. First of
all, it must be stated there is an operation point of the SVM
at the detection layer (OP 1) where we do not miss any
attack instances at all (at least in addition to those already
missed at the detection layer). The reduction rate is with
99.87 percent extremely high, and the detection delay is
only 5.41 s in the worst case (d
100%
). Average and worst case
runtimes are very good, too.
All OP will now be analyzed in much more detail.
All attack instances for which the detector produces at
least a single alert are detected in the idealized case and with
OP 1 and OP 2. Choosing another OP, the rate of detected
instances drops to 98.04 percent (OP 3) and 99.02 percent
(OP 4). In OP 3, a FORMAT instance and a MULTIHOP
instance are missed. In OP 4, only the FORMAT instance
could not be detected. A further analysis identified the
following reasons:
. The main reason in the case of the FORMAT
instance is the small number of only four alerts.
Those alerts are created by the detector layer for all
OP, i.e., there is obviously no benefit from choosing
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 291
an OP with higher FPR. By increasing the FPR, the
true FORMAT alerts are erroneously merged with
false alerts into one cluster. Hence, as the false alerts
easily outnumber the four true FORMAT alerts
within this cluster, the FORMAT instance gets lost.
. For the MULTIHOP instance, for which we have
19 alerts, the situation is more complex. The instance
is only missed in OP 3 and not in OP 4. In OP 3, the
downside of a higher FPR outweighs the benefit of a
higher TPRthe MULTIHOP alerts are merged with
a large number of false alerts. Further increasing the
FPR (OP 4) leads to more false alerts as well, but, in
this case, also to a further split of clusters such that
the false alerts and the MULTIHOP alerts are placed
into separate clusters.
Next, we analyze the number of meta-alerts ` and the
reduction rate i. In the idealized case, 324 meta-alerts are
created. Compared to the about 1.6 million alerts, we get a
reduction rate of 99.98 percent, which is a reduction of
almost three orders of magnitude. Unfortunately, with
exception of the first of seven weeks, it was not possible to
achieve the ideal case with exactly one meta-alert for every
attack instance. Basically, there are four reasons:
Distinguishable steps of an attack type. Often, a split of
attack instances into more meta-alerts is caused by the
nature of the attacks themselves. Actually, many attack
types consist of different, clearly distinguishable steps. As
an example, the FTP-WRITE attack exhibits three such
steps: an FTP login on port 21, an FTP data transfer on
port 20, and a remote login on port 513. Thus, a split into
three related meta-alerts is quite natural. Subsequent tasks
at the alert processing layer are supposed to handle such
multistep attack scenarios (cf. Fig. 1).
Several independent attackers. In the DARPA data set,
some attack instances are labeled as a single attack instance
although they are in fact comprised of the actions of several
independent attackers. A typical example is a WAREZ-
CLIENT instance in week four: There, attackers download
illegally provided software from a compromised FTP server.
As there is no cooperation between the attackers as in the
case of a coordinated denial of service attack, for example,
there is an individual meta-alert created for every attacker.
Long attack duration. Attack instances with a long
duration are often split into several meta-alerts. Typical
examples are slow or hidden port scans or (distributed)
denial of service attacks which can last several hours.
Bidirectional communication. TCP/IP-based communi-
cation between two hosts results in packets transmitted in
both directions. If the detector layer produces alerts for both
directions (e.g., due to malicious packets), the source and
destination IP address are swapped, which in the end
results in two meta-alerts. This problem could be solved
with an appropriate preprocessing step.
With an increasing FPR, the number of meta-alerts also
increases from 1,976 meta-alerts in OP 1 to 8,588 meta-alerts
in OP 4. Even though the number of meta-alerts increases
clearly, we still have a reduction rate of 99.46 percent in OP 4,
which would heavily reduce thepossibly manualeffort
of a human security expert that analyzes that information
flood. It is also interesting to see that, although the overall
number of meta-alerts ` increases, the number of attack
clusters `
attack
remains roughly constant. The majority of
additionally created meta-alerts contains only false alerts.
With more meta-alerts, the runtime increases. Even in
OP 4, with an average runtime of 0.97 ms per alert, the
proposed technique is still very efficient. Fig. 5 shows the
component creation delays for the four operating points. The
figure depicts the percentage of attack instances for which a
meta-alert was created after a time not exceeding the value
specified at the x-axis. It can be seen that the creation delay is
within the range of a few seconds and, thus, meets the
requirements for an online application. Table 2 displays the
time after which for 90, 95, and 100 percent of the attack
instances meta-alerts were created. Note that these values
correspond to points on the curve in Fig. 5. Interestingly, in
the idealized case, the delay is much higher which can be
explained by the novelty detection mechanism (cf. (14)). In
292 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011
TABLE 2
Results of the Online Alert Aggregation for Three Benchmark Data Sets
Fig. 5. Cumulative creation delays for meta-alerts.
the case of a more homogeneous alert stream, the novelty
handling is mainly influenced by the temporal spread
whereas in the case of a heterogeneous alert stream due to
false alerts, the novelty handling is started more often which
results in lower component creation delay times.
4.3.2 Campus Network Data
For the campus network data, for which the IDS Snort was
used to create alerts, quite similar results could be achieved
(see Table 2). All attack instances that have been launched
were correctly detected. For the 17 attack instances with
128,816 alerts, 52 meta-alerts were created, which is
equivalent to a reduction rate of 99.96 percent. Again, the
majority of meta-alerts is caused by false alerts. We have
20 attack meta-alerts and 32 nonattack meta-alerts. For
three of the 17 attack instances, a redundant meta-alert was
created. The reasons are similar to the ones described for
the DARPA data set. It is worth mentioning that for the
attack instance where the attacker changed his IP address
during the attack, only one meta-alert was created. The
results for the runtime as well as for the cluster creation
delay are similar to that of the DARPA data, too.
4.3.3 Internet Service Provider Firewall Logs
For the firewall log data, the proposed alert aggregation
could also be applied successfully. As Table 2 shows,
56 meta-alerts were created for the 4,989 alerts, which is a
reduction rate of 98.86 percent. As it is not possible to
specify a percentage of detected attack instances, we
analyzed the content of the 56 resulting meta-alerts: In
many cases, it is possible to find a particular reason for the
meta-alerts, e.g., we identified meta-alerts that subsume all
alerts that are caused by obviously forbidden accesses from
hosts within the DMZ to an external printer using the
Printer PDL Datastream Port (HP JetDirect, port 9,100).
Another example is meta-alerts that subsume alerts for
illegal network time protocol (ntp, port 123) accesses from
hosts outside the DMZ. The runtime turns out to be a little
bit lower with 1.53 ms per alert, but, interestingly, the worst
case runtime t
worst
is with 0.27 s much smaller than for the
previous data sets. This effect might be explained by the
more heterogeneous alert stream in case of the firewall log
data which results in an increased number of calls of
Algorithm 3 due to the novelty handling: On the one side,
the more heterogeneous the alert stream is, the more often
the (time expensive) Algorithm 3 is called which increases
the overall runtime: On the other side, due to the increased
number of calls of Algorithm 3, every call is conducted on
an alert buffer that contains only a few new alerts which
reduces the runtime of a single call of Algorithm 3and,
thus, the worst case runtime.
4.4 Conclusion
The experiments demonstrated the broad applicability of
the proposed online alert aggregation approach. We
analyzed three different data sets and showed that
machine-learning-based detectors, conventional signature-
based detectors, and even firewalls can be used as alert
generators. In all cases, the amount of data could be
reduced substantially. Although there are situations as
described in Section 3.3especially clusters that are
wrongly splitthe instance detection rate is very high:
None or only very few attack instances were missed.
Runtime and component creation delay are well suited for
an online application.
5 SUMMARY AND OUTLOOK
We presented a novel technique for online alert aggregation
and generation of meta-alerts. We have shown that the
sheer amount of data that must be reported to a human
security expert or communicated within a distributed
intrusion detection system, for instance, can be reduced
significantly. The reduction rate with respect to the number
of alerts was up to 99.96 percent in our experiments. At the
same time, the number of missing attack instances is
extremely low or even zero in some of our experiments
and the delay for the detection of attack instances is within
the range of some seconds only.
In the future, we will develop techniques for interest-
ingness-based communication strategies for a distributed
IDS. This IDS will be based on organic computing principles
[44]. In addition, we will investigate how human domain
knowledge can be used to improve the detection processes
further. We will also apply our techniques to benchmark
data that fuse information from heterogeneous sources (e.g.,
combining host and network-based detection).
ACKNOWLEDGMENTS
This work was partly supported by the German Research
Foundation (DFG) under grant number SI 674/3-2. The
authors would like to thank D. Fisch for his support in
preparing one of the data sets. The authors highly appreciate
the suggestions of the anonymous reviewers that helped
them to improve the quality of the article.
REFERENCES
[1] S. Axelsson, Intrusion Detection Systems: A Survey and
Taxonomy, Technical Report 99-15, Dept. of Computer Eng.,
Chalmers Univ. of Technology, 2000.
[2] M.R. Endsley, Theoretical Underpinnings of Situation Aware-
ness: A Critical Review, Situation Awareness Analysis and
Measurement, M.R. Endsley and D.J. Garland, eds., chapter 1,
pp. 3-32, Lawrence Erlbaum Assoc., 2000.
[3] C.M. Bishop, Pattern Recognition and Machine Learning. Springer,
2006.
[4] M.R. Henzinger, P. Raghavan, and S. Rajagopalan, Computing on
Data Streams. Am. Math. Soc., 1999.
[5] A. Allen, Intrusion Detection Systems: Perspective, Technical
Report DPRO-95367, Gartner, Inc., 2003.
[6] F. Valeur, G. Vigna, C. Kru gel, and R.A. Kemmerer, A
Comprehensive Approach to Intrusion Detection Alert Correla-
tion, IEEE Trans. Dependable and Secure Computing, vol. 1, no. 3,
pp. 146-169, July-Sept. 2004.
[7] H. Debar and A. Wespi, Aggregation and Correlation of
Intrusion-Detection Alerts, Recent Advances in Intrusion Detection,
W. Lee, L. Me, and A. Wespi, eds., pp. 85-103, Springer, 2001.
[8] D. Li, Z. Li, and J. Ma, Processing Intrusion Detection Alerts in
Large-Scale Network, Proc. Intl Symp. Electronic Commerce and
Security, pp. 545-548, 2008.
[9] F. Cuppens, Managing Alerts in a Multi-Intrusion Detection
Environment, Proc. 17th Ann. Computer Security Applications Conf.
(ACSAC 01), pp. 22-31, 2001.
[10] A. Valdes and K. Skinner, Probabilistic Alert Correlation, Recent
Advances in Intrusion Detection, W. Lee, L. Me, and A. Wespi, eds.
pp. 54-68, Springer, 2001.
[11] K. Julisch, Using Root Cause Analysis to Handle Intrusion
Detection Alarms, PhD dissertation, Universitat Dortmund, 2003.
HOFMANN AND SICK: ONLINE INTRUSION ALERT AGGREGATION WITH GENERATIVE DATA STREAM MODELING 293
[12] T. Pietraszek, Alert Classification to Reduce False Positives in
Intrusion Detection, PhD dissertation, Universitat Freiburg, 2006.
[13] F. Autrel and F. Cuppens, Using an Intrusion Detection Alert
Similarity Operator to Aggregate and Fuse Alerts, Proc. Fourth
Conf. Security and Network Architectures, pp. 312-322, 2005.
[14] G. Giacinto, R. Perdisci, and F. Roli, Alarm Clustering for
Intrusion Detection Systems in Computer Networks, Machine
Learning and Data Mining in Pattern Recognition, P. Perner and
A. Imiya, eds. pp. 184-193, Springer, 2005.
[15] O. Dain and R. Cunningham, Fusing a Heterogeneous Alert
Stream into Scenarios, Proc. 2001 ACM Workshop Data Mining for
Security Applications, pp. 1-13, 2001.
[16] P. Ning, Y. Cui, D.S. Reeves, and D. Xu, Techniques and Tools for
Analyzing Intrusion Alerts, ACM Trans. Information Systems
Security, vol. 7, no. 2, pp. 274-318, 2004.
[17] F. Cuppens and R. Ortalo, LAMBDA: A Language to Model a
Database for Detection of Attacks, Recent Advances in Intrusion
Detection, H. Debar, L. Me, and S.F. Wu, eds. pp. 197-216, Springer,
2000.
[18] S.T. Eckmann, G. Vigna, and R.A. Kemmerer, STATL: An Attack
Language for State-Based Intrusion Detection, J. Computer
Security, vol. 10, nos. 1/2, pp. 71-103, 2002.
[19] A. Hofmann, Alarmaggregation und Interessantheitsbewertung
in einem dezentralisierten Angriffserkennungsystem, PhD dis-
sertation, Universitat Passau, under review.
[20] M.S. Shin, H. Moon, K.H. Ryu, K. Kim, and J. Kim, Applying
Data Mining Techniques to Analyze Alert Data, Web Technologies
and Applications, X. Zhou, Y. Zhang, and M.E. Orlowska, eds.
pp. 193-200, Springer, 2003.
[21] J. Song, H. Ohba, H. Takakura, Y. Okabe, K. Ohira, and Y. Kwon,
A Comprehensive Approach to Detect Unknown Attacks via
Intrusion Detection Alerts, Advances in Computer ScienceASIAN
2007, Computer and Network Security, I. Cervesato, ed., pp. 247-253,
Springer, 2008.
[22] R. Smith, N. Japkowicz, M. Dondo, and P. Mason, Using
Unsupervised Learning for Network Alert Correlation,
Advances in Artificial Intelligence, R. Goebel, J. Siekmann, and
W. Wahlster, eds. pp. 308-319, Springer, 2008.
[23] A. Hofmann, D. Fisch, and B. Sick, Identifying Attack Instances
by Alert Clustering, Proc. IEEE Three-Rivers Workshop Soft
Computing in Industrial Applications (SMCia 07), pp. 25-31, 2007.
[24] M. Roesch, SnortLightweight Intrusion Detection for Net-
works, Proc. 13th USENIX Conf. System Administration (LISA 99),
pp. 229-238, 1999.
[25] O. Buchtala, W. Grass, A. Hofmann, and B. Sick, A
Distributed Intrusion Detection Architecture with Organic
Behavior, Proc. First CRIS Intl Workshop Critical Information
Infrastructures (CIIW 05), pp. 47-56, 2005.
[26] D. Fisch, A. Hofmann, V. Hornik, I. Dedinski, and B. Sick, A
Framework for Large-Scale Simulation of Collaborative Intrusion
Detection, Proc. IEEE Conf. Soft Computing in Industrial Applica-
tions (SMCia 08), pp. 125-130, 2008.
[27] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,
second ed. Wiley Interscience, 2001.
[28] IANA, Port Numbers, http://www.iana.org/assignments/
port-numbers, May 2009.
[29] Y. Rekhter, B. Moskowitz, D. Karrenberg, and G. de Groot, RFC
1597Address Allocation for Private Internets, http://www.
faqs.org/rfcs/rfc1597.html, Mar. 1994.
[30] J. Postel, RFC 790Assigned numbers, http://www.faqs.org/
rfcs/rfc790.html, Sept. 1981.
[31] O. Buchtala, A. Hofmann, and B. Sick, Fast and Efficient Training
of RBF Networks, Artificial Neural Networks and Neural Information
ProcessingICANN/ICONIP 2003, O. Kaynak, E. Alpaydin, E. Oja,
and L. Xu, eds., pp. 43-51, Springer, 2003.
[32] R.P. Lippmann, D.J. Fried, I. Graf, J.W. Haines, K.R. Kendall, D.
McClung, D. Weber, S.E. Webster, D. Wyschogrod, R.K.
Cunningham, and M.A. Zissman, Evaluating Intrusion Detec-
tion Systems: The 1998 DARPA Offline Intrusion Detection
Evaluation, Proc. DARPA Information Survivability Conf. and
Exposition (DISCEX), vol. 2, pp. 12-26, 2000.
[33] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, On Clustering
Validation Techniques, J. Intelligent Information Systems, vol. 17,
nos. 2/3, pp. 107-145, 2001.
[34] J.C. Dunn, Well Separated Clusters and Optimal Fuzzy Parti-
tions, J. Cybernetics, vol. 4, pp. 95-104, 1974.
[35] D.L. Davies and D.W. Bouldin, A Cluster Separation Measure,
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 2,
pp. 224-227, Apr. 1979.
[36] M. Halkidi and M. Vazirgiannis, Clustering Validity Assessment
Using Multi Representatives, Proc. SETN Conf., vol. 2, pp. 237-
249, 2002.
[37] A. Hofmann, I. Dedinski, B. Sick, and H. de Meer, A Novelty-
Driven Approach to Intrusion Alert Correlation Based on
Distributed Hash Tables, Proc. 12th IEEE Symp. Computers and
Comm. (ISCC 07), pp. 71-78, 2007.
[38] F. Provost and T. Fawcett, Analysis and Visualization of
Classifier Performance: Comparison under Imprecise Class and
Cost Distributions, Proc. Third Intl Conf. Knowledge Discovery and
Data Mining (KDD 97), pp. 43-48, 1997.
[39] J. McHugh, Testing Intrusion Detection Systems: A Critique of
the 1998 and 1999 DARPA Intrusion Detection System Evaluations
as Performed by Lincoln Laboratory, ACM Trans. Information and
System Security, vol. 3, no. 4, pp. 262-294, 2000.
[40] M.V. Mahoney and P.K. Chan, An Analysis of the 1999 DARPA/
Lincoln Laboratory Evaluation Data for Network Anomaly
Detection, Recent Advances in Intrusion Detection, G. Vigna,
E. Jonsson, and C. Kru gel, eds., pp. 220-237, Springer, 2003.
[41] A. Hofmann, D. Fisch, and B. Sick, Improving Intrusion
Detection Training Data by Network Traffic Variation, Proc.
IEEE Three-Rivers Workshop Soft Computing in Industrial Applica-
tions, pp. 25-31, 2007.
[42] Sourcefire, Inc., http://www.snort.org/, 2009.
[43] CISCO Systems, Inc., Cisco PIX Firewall System Log Messages,
Version 6.3, http://www.cisco.com/en/US/docs/security/pix/
pix63/system/message/pixemsgs.html, 2009.
[44] Organic Computing, R.P. Wu rtz, ed. Springer, 2008.
Alexander Hofmann received the diploma
(masters degree) in computer science in 2002
from the University of Passau, Germany. He
worked toward the PhD degree as a research
assistant in the area of computer security as a
member of the research group Computationally
Intelligent Systems chaired by Bernhard Sick.
Currently, he is employed as a senior software
engineer with Capgemini sd&m AG. He has
published several articles in the areas of intru-
sion detection, data mining, neural networks, and peer2peer technolo-
gies. He is a member of the IEEE and the ACM.
Bernhard Sick received the diploma (1992) and
PhD degrees (1999) in computer science, from
the University of Passau, Germany. Currently,
he is an assistant professor (Privatdozent) at the
Faculty of Computer Science and Mathematics
of the University of Passau, Germany. There, he
is conducting research in the areas of machine
learning, organic computing, collaborative data
mining, intrusion detection, and biometrics. He
authored more than 70 peer-reviewed publica-
tions in these areas. He is a member of the IEEE and an associate editor
of the IEEE Transactions on Systems, Man, and Cybernetics, Part B. He
holds one patent and received several thesis, best paper, teaching, and
inventor awards.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
294 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 8, NO. 2, MARCH-APRIL 2011

Vous aimerez peut-être aussi