10.1109 cc.2016.7559072

NETWORK CODING AND ALGORITHM
Intrusion Detection Algorithm Based on Density,

Cluster Centers, and Nearest Neighbors
Xiujuan Wang1, Chenxi Zhang1, Kangfeng Zheng2
1
2
Computer Sciences, Beijing University of Technology, Beijing, China

Computer Science And Technology, Beijing University of Posts and Telecommunications, Beijing, China
Abstract: Intrusion detection aims to detect

intrusion behavior and serves as a complement to firewalls. It can detect attack types
of malicious network communications and
computer usage that cannot be detected by
idiomatic firewalls. Many intrusion detection
methods are processed through machine learning. Previous literature has shown that the
performance of an intrusion detection method
based on hybrid learning or integration approach is superior to that of single learning
technology. However, almost no studies focus
on how additional representative and concise
features can be extracted to process effective intrusion detection among massive and
complicated data. In this paper, a new hybrid
learning method is proposed on the basis of
features such as density, cluster centers, and
nearest neighbors (DCNN). In this algorithm,
data is represented by the local density of each
sample point and the sum of distances from
each sample point to cluster centers and to its
nearest neighbor. k-NN classifier is adopted
to classify the new feature vectors. Our experiment shows that DCNN, which combines
K-means, clustering-based density, and k-NN
classifier, is effective in intrusion detection.
Keywords: intrusion detection; DCNN; density; cluster center; nearest neighbor
China Communications July 2016
I. INTRODUCTION
Information has become a valuable resource
in modern society. However, the security of
information systems is a critical problem because of the openness of networks. In network
security, possessing a complete security system is not possible. Instead, it is more practical to establish a secure system that is easy to
implement and, to simultaneously, construct
a corresponding security assistance system
according to the security policies. Intrusion
detection is a security system that serves as
a supplement to firewalls, which defend the
computer system against attacks [1].
Intrusion detection detects external intrusions and supervises unauthorized activities of
internal users by identifying and responding to
malicious network communication and computer usage behavior. Intrusion detection aims
to detect intrusions by studying the process
and characteristics of intrusion behavior, thereby enabling a real-time response to intrusion
events and the invasion process. Two basic
intrusion detection technologies exist, namely,
anomaly detection and misuse detection [2].
Currently, most of the relevant literature
focuses on intrusion detection based on machine learning and combines different criteria
24
to improve detection performance, such as

accuracy, detection rate, and false alarm. Although numerous advanced detection methods
have been proposed, only a few studies focus
on how simple and relatively large correlation
values can be used to represent a large amount
of data.
In this paper, a new characteristic value
representation method is proposed for intrusion detection. This new feature vector uses
the local density of each sample point in the
dataset, the distance from the sample point to
the cluster center, and the distance to the nearest neighbor. Thus, the method is known as
DCNN.
This paper is organized as follows: In the
second section, we summarize the related
technologies in intrusion detection. The proposed DCNN algorithm is detailed in the third
section. The experimental setup and the results
are provided in the fourth section. Finally, a
conclusion is given.
II. LITERATURE REVIEW

At present, intrusion detection technology
based on rule matching, machine learning, and
data mining, among others, mainly exist.

Devaraju S et al. used intrusion detection
technology based on rule matching[3]. Association rule mining algorithm (ARMA) was used
to detect the attack types on KDD Cup 99
datasets.
Zhang Yi et al. used data mining technolo[4]
gy . This paper described the use of association rules and its optimization algorithm. The
design and implementation of an intrusion detection system were based on feature analysis
and knowledge discovery of log files.
As mentioned above, intrusion detection
based on rule matching fails to recognize
unknown malicious behaviors, which can be
solved with machine learning. Many studies
have been conducted on this subject[5][6][7].
A.S. Eesa et al. used cuttlefish algorithm to
obtain a new feature and then used decision
tree for classification[5].
E. de la Hoz et al used a multi-objective approach to select features for the DARPA/NSLKDD dataset and then used growing hierarchical self-organizing maps to detect intrusions[6].
Table I lists several papers related to intrusion detection, which mainly compare the intrusion detection technology, dataset, problem
Table I Comparison among related work

Work
Technique
Dataset
Problem domain
Evaluation method
Devaraju S et al
[3]
ARMA
KDD-Cup99
Misuse Detection
DR, FA
Tsai CF, Lin CY
[7]
TANN
KDD-Cup99
Anomaly detection
DR , Accuracy, FP , FN
KDD-Cup99
Anomaly detection
DR, Accuracy, FP, FN
Lin WCet al
[8]
A.S. Eesa et al
CANN
ARMA
5
K-NN/ SVM7
K-NN/SVM
KDD-Cup99
Anomaly detection
DR,FP, Accuracy, ROC curved
N/A
E. de la Hoz et al[6]
GHSOM11
DARPA/NSL-KDD
Anomaly detection
DR, ROC curve
NB12, RF13, DT
Wang FN et al [9]
KPCA14 +RVM15
KDD-Cup99
Anomaly detection
DR, FA, Accuracy
RVM
[5]
DT
Baseline
10
1 ARMA: associationrulemining algorithm

2 FA: false alarm
3 TANN: triangle area-based nearest neighbors approach to intrusion detection
4 DR: detection rate
5 FP: false positive
6 FN: false negative
7 SVM: support vector machine
8 CANN: an intrusion detection system based on combining cluster centers and nearest neighbors
9 DT: decision tree references
10 ROC curve: receiver operating characteristic curve
11 GHSOM: growing hierarchical self-organizing maps
12 NB: nave Bayes
13 RF: random forest
14 KPCA: kernel principal component analysis
15 RVM: relevance vector machine
25
areas, evaluation methods, and baseline classifiers. Most studies used the KDD Cup 99 dataset. The most widely used evaluation method
is the accuracy and detection rate. k-NN classifier is the most widely used baseline classifier. Most techniques used in the literature are
two techniques for intrusion detection. Using
clustering method first, and then use the classifier
Tsai CF and Lin CY used intrusion detection technology based on hybrid machine
learning[7]. First, k-means clustering was performed to obtain cluster centers. Then, the triangular area related to two cluster centers with
one data from the given dataset is calculated,
thereby forming a new feature signature of the
data. Finally, the k-NN classifier was used to
classify similar attacks on the basis of the new
feature represented by the triangular area[7].
Lin WC, Ke SW, and Tsai CF also used intrusion detection technology based on hybrid
machine learning[8]. They first used k-means
clustering to obtain cluster centers and nearest
neighbors, and then they obtained a new feature by calculating the distance between data
points and cluster centers or nearest neighbors.
Finally, k-NN classifier and SVM were used
to classify based on the new feature. They
achieved a detection accuracy close to that of
normal
2000
Density
Fig. 1(b). Density distribution of

the PROBING class
Density
Fig. 1(c). Density distribution of

the DOS class
R2L
500
0
1
71
141
211
281
351
421
491
1000
500
1
71
141
211
281
351
421
491
1
81
161
241
321
401
481
4000
1000
Fig. 1(d). Density distribution of

the U2R class
DOS
500
U2R
Density
To illustrate the validity of density, the density distribution of the KDD Cup 99 corpus of
normal and abnormal classes are calculated as
shown in Figure 1. The definition of local density will be a subject of focus later.
Figure 1 indicates that the density distribution of each data type is different. The local
density of normal data is most distributed at
higher values of approximately 497500, and
PROBING
Density
Fig. 1(a). Density distribution of

the normal class
3.1 DCNN framework
1000
0
80
160
240
320
400
480
III. DCNN ALGORITHM
1
81
161
241
321
401
481
5000
the k-NN classification on original features

with low cost.
In this paper, we propose an intrusion detection algorithm based on the study of Lin
WC et al. [8]. Clustering can be conducted
based on measures, such as distance and density. Which types of clustering is better depends
on the characteristics being addressed. In intrusion detection, the measure that performs
best is uncertain. In other words, the distance
cannot be fully representative of the feature.
In addition to the distance, the density is a representative value of the feature. Therefore, we
suggest taking density into account as a new
representation of the feature.
Density
Fig. 1(e). Density distribution of

the R2L class
Fig.1 Density distribution

26
the attack data tend to distribute with lower

densities. Hence, density can be used as an effective distinguishing feature.
The frame of intrusion detection algorithm
based on DCNN is shown in Figure 2. Hybrid
learning is applied to network data. First, clustering is used to obtain distances and density
related to network data to form a new feature
vector with low dimension. k-NN classifier is
developed based on the new feature vectors,
and the label is output the class of data
Clustering on training set T aims to
obtain the cluster center (Ci), the nearest
neighbors of each sample point (Ni), and
the local density of each sample point (i):
Intrusion detection is a classification problem.
Therefore, the number of categories in classification should be determined first. Distance is
adopted to measure the similarity between the
unlabeled sample point and each type of attack
data. Hence, the number of clusters should be
equal to the number of classification categories.
New feature vector formation: The distance between the sample point and all cluster
centers (d1), and the distance between the data
point and nearest neighbor point in this cluster
(d2) are calculated. Then, two distances are
added to obtain the new data Di. The training
dataset T=(x1,x2,,xn) is replaced by a new
two-dimensional feature vector T=(D,)
formed by distance Di and local density i.
Cluster centers,
nearest
neighborslocal
density
Cluster
centers Ci
Training
dataset T
Nearest
neighbors Ni
New feature vector
Distance
between Ci
and point (d1)
Dlabel
Distance
dataset S
Local density
i
Label=
attacks
No label
data
K-NN
classify
Label=
normal
Fig.2 Frame of DCNN
27
3.2 Determining Ci, Ni, i

In this paper, we use the k-means clustering
algorithm to extract the clustering center. Assume that the number of categories is a total of
N class data. Let K=N. Then, a total of N cluster centers exist, namely, C1C2CN.
Figure 3 shows an example that contains 15
data points to be clustered into 5 categories;
each cluster has a cluster center C1C2C3
C4C5. The green boxes represent the cluster
centers.
The nearest neighbor of one data in terms
of distance is an index. Therefore, the nearest
neighbor points in this paper are determined
as follows: For each data point A, the distance
between A and all other points in its cluster is
calculated. Then the point of shortest distance
is the nearest neighbor point of A. The neighbor points of point A are defined as Neigh (A).
The nearest neighbor point NN (A) is defined
as Formula (1).
(1)
As shown in Figure 2, the distance between
A and other points in cluster 4 is calculated.
dAE and dAB are obtained, with dAE being less
than dAB. Thus, the nearest neighbor of A is E
point.
The local density calculation method used
in this paper is based on the work of Rodriguez A et al. [10]. Local density is defined as
Formula (2).
(2)
between Ni
and point (d2)
Testing
Training and testing: The above step is

repeated on test dataset S. A new two-dimensional feature vector S is obtained. Then,
k-NN classifier is used for intrusion detection
on T and S datasets.
dc is a truncated distance that is manually set.

dij represents the distance from points I to j.
According to the formula, to define a truncated
distance dc, the distance dij between every two
points is calculated. Then, the local density i
of each data point with dc and dij is obtained,
as defined in Formula (2).
Figure 4 shows an illustrated process of calChina Communications July 2016
culating the local density. The dotted line represents the truncated distance dc. To calculate
the local density of point I, a circle is drawn
around center I, and the radius equals d c.
Then, the number of points within this circle
is counted. The density of the point I shown in
the graph is equal to 5.
With the local density i and the distance

Di, a new two-dimensional feature vector (Di,
i) is obtained to replace the original feature
vector.
Finally, the two-dimensional feature vector
is used to train the k-NN classifier. Then, the
3.3 New feature vectors

Before obtaining the new feature data, two
distances need to be calculated in addition to
the local density. One is the distance between
sample point A and cluster centers Ci. Suppose that N cluster centers are obtained in the
clustering process. Then, the distances d (ACi)
between point A and the N cluster centers Ci
need to be calculated. The sum of said N distances can describe the data point in terms of
distance synthetically. Then, we can obtain a
distance d1, which is defined as Formula (3).
C1
C2
C5
A
C3
C4
Fig.3 Extracting the cluster centers with the use of k-means and determining the
nearest neighbor
(3)
The other distance is the distance between
the sample point and its nearest neighbor. The
calculation method used in this paper is Euclidean distance. Assuming that the original
dataset is an M dimensional vector, then the
distance d2 from point A (a1,a2,,am) to
the nearest neighbor point B (b1,b2,bm) is
defined as Formula (4)
dc
I
Fig.4 Calculating local density
(4)
Figure 5 shows an example of calculating
said 2 distances in case of 5 classes. The black
solid line represents the distances between
data point A and 5 cluster centers. The red
dashed line represents the distance between
data point A and its nearest neighbor B. After
obtaining these two distances, d1 and d2 are
added together. A new distance Di is obtained
for each sample point to act as the first new
feature, as defined in Formula (5)
(5)
C1
C2
C5
A
C3
Fig.5 Distance between point A and cluster centers, and distance between point A
and nearest neighbor
28
Table II Tags for data classification

types of data
Normal
PROBING
DOS
U2R
R2L
Tags
Table III Measures of evaluation

Actual\predicted
Normal
Normal
TN
FP
Attacks
FN
TP
120000
100000
80000
training dataset number
4.1.2 Evaluations
training and testing dataset number
This paper uses three evaluation criteria to

evaluate the performance of DCNN, i.e., accuracy, detection rate, and false alarm, as defined
in Formulas 6, 7, and 8, respectively. Table 3
explains the variables used in the formulas.
66.65
%
64.57%
28.77
33.09 %
%
60000
40000
20000
0
Attacks
1.47 2.41
%
%
Normal PROBING
DOS
(6)
0.04 0.14 0.83 2.03%

%
%
%
U2R
R2L
(7)
(8)
Fig.6 Classification of data
4.1 Experimental setup
True positives (TP): the number of malicious executables correctly classified as malicious
True negatives (TN): the number of benign
programs correctly classified as benign
False positives (FP): the number of benign
programs falsely classified as malicious
False negative (FN): the number of malicious executables falsely classified as benign
4.1.1 Dataset
4.2 Experimental results
training and testing sets are combined to obtain the two-dimensional feature vector for the
k-NN classifier testing.
IV. EXPERIMENTS
The training and testing datasets used in

this paper are all KDD Cup 99 corpus [11].
KDD Cup 99 corpus is the dataset used in
the Knowledge Discovery and Data Mining
(KDD) contest held in 1999. Although the data
are old, they are widely recognized and used
by researchers.
Each network connection in the KDD Cup
99 dataset is marked as normal or attack. The
attacks are divided into 4 categories and 39
species. The four types of attack are DOS,
R2L, U2R, and PROBING. Table II describes
the classification tags for the five types of data.
The KDD Cup 99 dataset has 41 dimen-
29
sional of feature descriptions and one dimension of category label for a total of 42 dimensions. Similar to the work of Zhang et al. [11],
19 dimensional characteristics are selected.
After taking out 19 dimensional data, quantitative data need to be normalized. Afterwards,
we need to remove duplicate data to obtain a
single dataset. Figure 6 shows the composition
of the remaining data. The training dataset has
119845 data, and the training and testing datasets have 177463 data.
4.2.1 Original data classification

Intrusion detection using k-NN classifier is
performed for the original 19 dimensional
KDD Cup datasets. Results of the five types of
data are shown in Table IV. K is set to 21. The
total accuracy is 84.36%.
4.2.2 CANN
Results of the five data types of CANN are
shown in Table V. The K value of the baseline
classifier k-NN is set to 21. The total accuracy
is 89.79%, thereby indicating that the performance is better than that of k-NN classifica-
tion on original features.

4.2.3 DCNN
Results of the five data types of DCNN are
shown in Table VI. The K value of the baseline
classifier k-NN is set to 21. The total accuracy
is 96.74%, thereby indicating the best performance.
4.2.4 Discussion
The intrusion detection approaches based on
the three evaluation criteria are shown in Figure 7.
The above three tables and Figure 7 show
that the k-NN classification process is simplified because of the dimension reduction
processing of the data feature in the DCNN algorithm. Our experiments indicate that DCNN
performs better than CANN and k-NN classifiers and has the highest accuracy and detection
rate, and the lowest false alarm. Thus, new
features in DCNN can describe the characteristic of network data well. The processing
time of DCNN algorithm is better than that
of direct use of k-NN classifier for intrusion
detection because of dimensionality reduction
processing. Tables IV, V, and VI indicate that
U2R and R2L have low detection accuracy.
Thus, DCNN can improve the detection accuracy of these two types of attacks.
Figure 1 shows that the local density of
each data type exhibits differences. The density of the normal data type is much higher than
that of attack types, and the performance of
DCNN is better than CANN. Thus, local density is a valuable feature. The above findings
indicate that our method is successful.
V. CONCLUSION
In this paper, we proposed a new hybrid machine learning-based intrusion detection method called DCNN, which effectively reduces
the feature dimension of the original dataset
into a simple and representative two-dimensional vector. It saves time and improves the
accuracy in our experiment on the KDD Cup
99 dataset. Experimental results show that
Table IV Result of k-NN classifier (K=21)

actual\ Predicted
Normal
PROBING
DOS
U2R
R2L
Accuracy
Normal
68741
1852
6124
593
79
88.83%
PROBING
126
1594
21
11
90.77%
DOS
8147
1218
30287
76.38%
U2R
19
27
51.92%
R2L
241
44
254
452
45.38%
DOS
U2R
R2L
Accuracy
Table V Result of CANN classifier (K=21)

actual\ Predicted Normal PROBING
Normal
69630
240
7079
80
360
89.97%
PROBING
94
1595
60
90.83%
DOS
3056
793
35549
118
136
89.65%
U2R
10
33
63.16%
R2L
162
31
801
80.42%
Table VI Result of DCNN classifier (K=21)

actual\ Predicted
Normal
PROBING
DOS
U2R
R2L
Accuracy
Normal
76375
896
82
11
25
98.69%
PROBING
42
1711
97.44%
DOS
1992
502
36955
69
134
93.20%
U2R
14
35
67.31%
R2L
67
15
48
862
86.55%
KNN
120
100
96.74
84.3689.79
80
CANN
DCNN
91.9694.93
78.91
60
40
11.17
4.55 2.69
20
0
Accuracy(%)
Detection
Rate(%)
False
Alarm(%)
Fig.7 Performance comparison
DCNN can successfully detect intrusions.

Additional work is needed in the future.
For example, we used only k-means clustering
algorithm to obtain cluster centers. We can
change the selection method of initial cluster
centers to improve the clustering accuracy. In
addition, the density calculation is not very accurate, In the future, we can change the density to the density in the points cluster. Finally,
30
only the k-NN classifier is used as the baseline

classifier in this paper. Wide comparisons can
be conducted with the use of other baseline
classifiers for intrusion detection.
References
[1] Yao Lan, Wang Xinmei. Present situation and
development trend of intrusion detection
system. Telecommunications Science. 2002.
(12):31-35.
[2] YAO Jun-lan. Intrusion detection technology
and its development trend. Information Technology. 2006.(4):172-175
[3] Devaraju S, Ramakrishnan S. Detectionof
Attacks for IDS using AssociationRuleMining Algorithm. IETE JOURNAL OF RESEARCH.
2015.61(6)624-633.
[4] ZHANG Yi, LIU Yan-heng et al. Intrusion detection system based on association rules. Journal
of Jilin University. 2006. (2).
[5] A.S. Eesa, Z. Orman, A.M.A. Brifcani. A novel feature-selection approach based on the cuttlefish
optimization algorithm for intrusion detection
systems. EXPERT SYSTEMS WITH APPLICATIONS. 2015. 42(5):26702679.
[6] E. de la Hoz, E. de la Hoz, A. Ortiz, J. Ortega, A. Martinez-Alvarez. Feature selection by
multi-objective optimization: application to network anomaly detection by hierarchical self-organising maps. KNOWLEDGE-BASED SYSTEMS.
2014. 71(SI):322338.
[7] Tsai CF, Lin CY. A triangle area based nearest
neighbors approach to intrusion detection.
PATTERN RECOGNITION. 2010. 43(1): 222-229.
[8] Lin WC, Ke SW, Tsai CF. CANN: An intrusion
detection system based on combining cluster centers and nearest neighbors. KNOWLEDGE-BASED SYSTEMS. 2015. 78:13-21.
31
[9] Wang FN,Wang SS. Solving theintrusiondetectionproblem with KPCA-RVM. DESIGN, MANUFACTURING AND MECHATRONICS. 2016.520527.
[10] Rodriguez A, Laio A. Clustering by fast search
and find of density peaks. SCIENCE. 2014.
344(6191):1492-1496.
[11] X.-Q. Zhang, C.H. Gu, J.J. Lin. Intrusion detection
system based on feature selection and support
vector machine. International Conference on
Communications and Networking in China.
2006. pp. 15.
Biographies
Xiujuan Wang, received her PhD in Information and
Signal Processing in July 2006 at the Beijing University of Posts and Telecommunications. She is currently
an instructor lecturer at the College of Computer Sciences, Beijing University of Technology. Her research
interests include information and signal processing
and network security. E-mail: xjwang@bjut.edu.cn
Chenxi Zhang, is currently pursuing her Master at
the College of Computer Sciences, Beijing University
of Technology. Her research interests include information and network security. The corresponding
author, E-mail: 15110005031 @163.com
Kangfeng Zheng, received his PhD in Information
and Signal Processing in July 2006 at Beijing University of Posts and Telecommunications. He is currently
an associate professor at the School of Computer
Science and Technology, Beijing University of Posts
and Telecommunications. His research interests include networking and system security, and network
information processing. E-mail: kfzheng@bupt.edu.
cn

10.1109 cc.2016.7559072

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

10.1109 cc.2016.7559072

Transféré par

Droits d'auteur :

Formats disponibles

NETWORK CODING AND ALGORITHM

Intrusion Detection Algorithm Based on Density,

Computer Sciences, Beijing University of Technology, Beijing, China

Abstract: Intrusion detection aims to detect

to improve detection performance, such as

II. LITERATURE REVIEW

data mining, among others, mainly exist.

Table I Comparison among related work

Tsai CF, Lin CY

DR, Accuracy, FP, FN

DR,FP, Accuracy, ROC curved

DR, ROC curve

DR, FA, Accuracy

1 ARMA: associationrulemining algorithm

China Communications July 2016

Fig. 1(b). Density distribution of

Fig. 1(c). Density distribution of

Fig. 1(d). Density distribution of

Fig. 1(a). Density distribution of

3.1 DCNN framework

III. DCNN ALGORITHM

the k-NN classification on original features

Fig. 1(e). Density distribution of

Fig.1 Density distribution

the attack data tend to distribute with lower

New feature vector

Fig.2 Frame of DCNN

3.2 Determining Ci, Ni, i

Training and testing: The above step is

dc is a truncated distance that is manually set.

With the local density i and the distance

3.3 New feature vectors

Fig.4 Calculating local density

Table II Tags for data classification

Table III Measures of evaluation

training dataset number

training and testing dataset number

This paper uses three evaluation criteria to

0.04 0.14 0.83 2.03%

Fig.6 Classification of data

4.1 Experimental setup

4.2 Experimental results

The training and testing datasets used in

4.2.1 Original data classification

China Communications July 2016

tion on original features.

Table IV Result of k-NN classifier (K=21)

Table V Result of CANN classifier (K=21)

Table VI Result of DCNN classifier (K=21)

Fig.7 Performance comparison

DCNN can successfully detect intrusions.

only the k-NN classifier is used as the baseline

China Communications July 2016

Vous aimerez peut-être aussi