Vous êtes sur la page 1sur 10

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No.

4, ISSN: 1837-7823

A Novel Graph Based Clustering Approach for Network Intrusion Detection


D.P.Jeyepalan1 E.Kirubakaran2 1 Research Scholar, School of Computer Science, Engineering and Applications, Bharathidasan University, Tiruchirappalli, Tamilnadu, India.
2

Additional General Manager, SSTP (Systems), Bharat Heavy Electricals Ltd, Tiruchirappalli, India. Abstract

Detecting the vulnerabilities in a network plays a vital role in the prevention of intrusions in a system. This paper describes a cluster based mechanism for detecting vulnerabilities and in turn intrusions. The network is analyzed and a graph is constructed representing the entire network. This graph is passed to a clustering algorithm that clusters the nodes. This process of clustering is basically an elimination of edges, hence providing the number of clusters or the shape of the cluster before the processing is not necessary. This process helps us in sorting out the outliers. These outliers are the nodes that have the maximum vulnerability of being attacked. Analysis shows that our process has an accuracy rate of 0.91375.

Keywords: Intrusion detection; clustering; graph based clustering 1. Introduction


Due to the increase in amount of network related transactions, network related crimes have also shown a rapid increase. These crimes take the form of attacking a target system directly or stealing information during online transactions. In either of the forms, a computer forms the base of the attack. This system is called the compromised node. Detecting these compromised nodes is a very important issue in intrusion detection. The compromised nodes has the ability to perform malicious activities like sniffing of packets, performing Denial of Service (DoS) attacks, transmitting viruses/worms and much worse, converting other computers into compromised nodes. All other systems within the network become vulnerable to attacks due to the presence of a compromised node. Hence it becomes mandatory to black list these nodes and either remove them from the network or monitor its activities for malicious behavior and restore the system to its initial state. Increase in the usage of data mining techniques in the areas of intrusion detection has led to the increase in amount of specialized algorithms for detecting intrusion. Some of these include, association rule mining algorithm, frequency scene rule mining algorithm, classification algorithm, and clustering algorithm. The first three algorithms belong to the supervised learning category. These algorithms require training datasets describing all behaviors. Only after applying this training dataset, the system will be able to detect anomalies. While clustering algorithm comes under the unsupervised learning category. These types of algorithms do not depend on training data, instead they use similarity grouping to recognize the odd one out. The rest of this paper is organized as follows. Section 2 describes the related works and section 3 describes the overall system architecture and an outline of the complete functioning of the system. Section 4 describes the actual intrusion detection mechanism in detail, section 5 shows the obtained results and their analysis and section 6 provides the conclusion.

12

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

2. Related Works
In general, detection of an anomaly focuses mainly on monitoring and recording the users behavior. This helps us detect unusual behavior from the normal behavior. Any kind of behavior that deviates from the normal behavior is labeled as an anomaly or intrusion. Typical conventional anomaly detection researches [1, 2, 3] have used statistical approaches. The statistical methods have the strong point that the size of a profile for real-time intrusion detection can be minimized. However, the usage of statistical operators alone cannot provide best results. Further detection of false positives cannot be avoided. Furthermore, the statistical methods cannot handle infrequent but periodically occurring activities. Leonid Portnoy [4] introduced a clustering algorithm to detect both known and new intrusion types without the need to label the training data. A simple variant of singlelinkage clustering to separate intrusion instances from the normal instances was used. Though this algorithm overcomes the shortcoming of number of clusters dependency, it requires a predefined parameter of clustering width W which is not always easy to find. The assumption that "the normal instances constitute an overwhelmingly large portion (>98%)" is also too strong. In [5], Qiang Wang introduced Fuzzy-Connectedness Clustering (FCC) for intrusion detection based on the concept of fuzzy connectedness which was introduced by Rosenfeld in 1979. FCC can detect the known intrusion types and their variants, but it is difficult to find a general accepted definition of fuzzy affinity which was used by FCC.

3. SYSTEM ARCHITECTURE
The process of Intrusion detection can be performed as described in the Figure 1.

Figure 1: Intrusion Detection Mechanism

The initial phase deals with creating a graph for proceeding with the processing. Every system in a network is considered as a node and every connection between the systems is marked as an edge. A complete graph is created along with the weight details for future analysis. The graph is analyzed using the weight values provided and all related nodes are grouped together to form clusters [10]. After the formation of clusters, the cluster analysis [6] is performed, in which every cluster is checked for outlying items, i.e. items that are at farthest reach from the cluster centre. These are isolated and are considered to be the vulnerable nodes. After this process, monitoring of the nodes is performed, and if traffic anomalies were detected, then the node is labeled as an intruder.

13

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

4. Clustering Based Intrusion Detection


The Clustering based Intrusion Detection can be performed in four phases, graph creation, cluster creation, cluster analysis and monitoring. The graph creation phase initially marks all the nodes and edges. All systems that come under the considered network form the nodes of the graph. The connections between these nodes form the edges of the graph. Since, all the systems have two way connections, the edges in the graph represent two way paths. The distances between these nodes form the weights of the graph. Let G = (V ,E) be a graph where V and E are, respectively, its set of nodes and edges. The number of nodes of G is n. Each edge is represented by a pair (i, j ), where i and j are nodes from V . Consider A = [aij ]nn to be the adjacency matrix of graph G. Each element of the adjacency matrix has a binary value, representing the relationship between two nodes. Thus, aij = 1 if nodes i and j are adjacent, i.e., if there is an edge linking node i to node j , and aij = 0 otherwise. This paper deals with weighted graphs. LetW = [wij ]nn be the weight matrix for the edges of a weighted graph G. The element wij of this matrix W is defined as the weight of the edge that links node i to node j. If there is no edge between a pair of nodes i and j, then wij = 0. The degree of a node i, degi , from an unweighted or weighted graph, is calculated considering the number of its adjacent objects. It is given by

deg i = aij
j =1

A measure that evaluates the clustering tendency in graphs is known as clustering coefficient. It is based on the analysis of three node cycles around a node i. A formulation of this measure for unweighted graphs is given by

ci =
Note that

2 j = 1 k =
n 1 n

j +1 ij

a a jk aik

deg i (deg i 1)
deg i


n 1
j= 1 j i

n
k = j +1 k i

aij a jk aik

corresponds to the number of triangles around node i. the degree

indicates the total number of neighbors of node i. The denominator measures the maximum possible number of edges that could exist between the vertices within the neighborhood. This measure evaluates the tendency of the nearest neighbors of node i to be connected to each other.

Figure 2: Sub-graph Creation

14

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823 After constructing the graph, clustering is performed. The process of clustering divides the graph into several subgraphs. Clustering [7]&[8] is performed by providing a threshold value , which is calculated using the formula

=min + (max min) C P


Where, min and max represent the minimum and maximal value of matrix A (adjacency matrix) respectively, and CP represents the Cluster Precision. So an edge is cut down from this graph if its value of weight is greater than threshold . This results in the formation of subgraphs. Cluster analysis phase performs the process of detecting the probable outlier from the subgraphs. The following aspects are considered while performing the outlier detection. For any positive integer k, the k-distance of object p, denoted as k-distance (p), is defined as the distance d (p, o) between p and object

o D such that:

For at least k objects o ' D \{ p} , it holds that

d ( p, o ' ) d ( p, o)

For at most k-1 objects o ' D \{ p} , it holds that d

( p, o ' ) < d ( p, o) .

Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not greater than the k-distance. N k distance (p) = { q D{p} | d(p,q) < k-distance (p) } These objects q are called the k-nearest neighbors of p. Given the k-distance of p, and p is a center of circle with radius k. All objects in this circle are k-distance neighborhood of p. p is the centre of mass of this circle. So the Local Deviation Rate is defined as:

LDRk ( p ) = dis ( p , p ') | Nk dis tan ce ( p ) |


The dis(p, p) is the distance between object p and centre of mass p. Given the k-distance neighborhood of p and LDR, the Local Deviation Coefficient is defined as:

LDCk ( p ) =

oNk dis tan ce ( p ) LDRk ( o )


| N k dis tan ce ( p ) |

Intuitively, LDC is sum of the LDR of k distance neighborhood of p. The coefficient reflects the degree of dispersion of an objects neighborhood. Greater value of LDC means higher probability of one object being an outlier. On the other hand, a low LDC value indicates that the density of an objects neighborhood is high. So its hardly to be an outlier. All probable outliers are shortlisted in this phase. After the completion of this phase, comes the monitoring phase. All the shortlisted nodes that are considered vulnerable for attacks are monitored for attacks or abnormal activities. The traffic flow to and from these nodes are monitored. If any abnormalities were discovered, then cleanup is performed on the node for removing the vulnerabilities.

15

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

Figure 3: Intrusion Detection mechanism

5. Result Analysis
The current process is evaluated with various sets of data containing different number of data items and the obtained values are recorded in a confusion matrix.
Table 1: Confusion Matrix

Predicted Positive Actual Positive Negative Where, TP - True positive, FP- False positive, TN True Negative and FN False Negative. The two performance measures, sensitivity and specificity are used for evaluating the results. Sensitivity is the accuracy on the positive instances (equivalent to True Positive Rate-TPR) TP TN Negative FP FN

where TP is True Positive and FN is False Negative. Specificity is the accuracy of the negative instances (equivalent to False Positive Rate-FPR) 16

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

where TN is True Negative and FP is False Positive.

Figure 4: A sample confusion matrix set with TPR and FPR The simulation is conducted with KDD-Cup 99 dataset. The process was broken at regular intervals to find the values of TP, FP, TN and FN. These function as the basis for calculating the TPR and FPR. These readings are tabulated and the ROC [9] is plotted (Fig 5).

From Figure 5, we can see that during the initial stages, when the number of entries are minimal, the plots point to 0,0 and 0,1 points. As the number of entries keep increasing, we can see that the plotted points are clustered towards the northwest corner and are above the diagonal. This proves that this process provides a high level of accuracy, almost meeting the perfect standard of 0,1. 1 0.9 0.8 0.7 T P R 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 FPR
Figure 5: ROC Plot

0.6

0.8

17

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823 Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. Hence we can use this measure to find the relevance of the readings. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 Recall Figure 6: PR Curve Usually, precision and recall[9] scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other measure or both are combined into a single measure, such as their harmonic mean the F-measure, which is the weighted harmonic mean of precision and recall. 0.6 0.8 1

P r e c i s i o n

F = 2.
This is also known as the

precision recall precision + recall


):

F1 measure, because recall and precision are evenly weighted.

It is a special case of the general - F measure (for non-negative real values of

F= (1 + 2 ).

precision recall 2 precision + recall

Two other commonly used the

F measures are the F2 measure, which weights recall higher than precision, and

F0.5 measure, which puts more emphasis on precision than recall.

The F-measure was derived by van Rijsbergen (1979) so that F "measures the effectiveness of retrieval with respect to a user who attaches

times as much importance to recall as precision".

It is based on van Rijsbergen's effectiveness measure. 18

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

E = 1

1 1 + P R

Their relationship is F = 1 + E where

1 1+ 2

Figure 7: Precision, Recall and F-Measure Sample values

6. Conclusion and Discussions


Discovering attacks in a network plays an important role in the management of a network. The attacks take place by exploiting the vulnerabilities in a network node. Faster detection of these vulnerabilities helps in better network maintenance. Analysis shows that our proposed system provides faster and more accurate detection rates when compared to the existing methodologies [1][2][3][4][5]. N o o f n o d e s 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 Instance number

No of nodes in network No of nodes detected for vulnerabilities

Figure 8: Total number of nodes present Vs nodes detected for vulnerabilities

19

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823

Figure 8 shows the detection rate of our algorithm. 15% of the total nodes show abnormalities. 30 N u m b e r o f 25 n o d e s 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Instance number No of nodes detected for vulnerabilities Actual no of nodes attacked

Figure 9: No of nodes detected for vulnerabilities Vs Actual number of nodes attacked

Figure 9 shows the actual number of nodes detected for monitoring the vulnerabilities versus actual number of nodes attacked. We can see that our algorithm has managed to detect most of the nodes that are vulnerable. Our system shows a detection percentage of 84.91729. Here, the F-Measure of our values shows a rate of 0.84833 and we obtain an average accuracy rate of 0.91375. Further, we can see that our proposed structure reduces the amount of nodes that are to be monitored, hence reduction in the amount of processing is observed. Further, the number and shape of the clusters is not defined. Hence any type of network can be used for the clustering process. The current process can be further fine tuned by incorporating artificial intelligence into the system. This can help create an evolutionary system that can learn new types of attacks and evolve in time.

7. REFERENCES
[1] [2] [3] [4] [5] Harold S.Javitz and Alfonso Valdes, "The NIDES Statistical Component Description and Justification," Annual Report, SRI International, 333 Ravenwood Avenue, Menlo Park, CA 94025, March 1994. Phillip A. Porras and Peter G. Neumann, "EMERALD: Event Monitoring Enabling Responses to Anomalous Live Disturbances," 20th NISSC, October 1997. H.S. Javitz, A. Valdes, "The SRI IDES Statistical Anomaly Detector," IEEE Symposium on Research in Security and Privacy, May 1991. Portnoy, L., Eskin, E., Stolfo, S, Intrusion Detection with Unlabeled Data Using Clustering, ACM CSS Workshop on Data Mining Applied to Security, pp. 58. ACM Press, Philadelphia, 2001. Qiang, W., Vasileios, M, A Clustering Algorithm for Intrusion Detection, The SPIE Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security, Florida, vol. 5812, pp. 3138, 2005.

20

International Journal of Computational Intelligence and Information Security, April 2013, Vol. 4 No. 4, ISSN: 1837-7823 [6] Joshua Oldmeadow, Siddarth Ravinutala1, and Christopher Leckie, Adaptive Clustering for Network Intrusion Detection PAKDD 2004, LNAI 3056, pp. 255259, Springer-Verlag, Berlin Heidelberg, 2004. [7] XlONG Jiajun, LI Qinghua, TU Jing, A Heuristic Clustering Algorithm for Intrusion Detection Based on Information Entropy, Wuhan University Journal Of Natural Sciences, Vol. 11 No. 2 2006 355-359, 2006. [8] Maria C.V. Nascimento, Andre C.P.L.F. Carvalho, J, A Graph Clustering Algorithm Based On A Clustering Coefficient For Weighted Graphs, Brazil Computer Society, 17: 1929 DOI 10.1007/s13173010-0027, 2011. [9] Jesse Davis, Mark Goadrich, The Relationship Between Precision-Recall and ROC Curves, Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 2006. [10] Sang-Hyun Oh and Won-Suk Lee, Z.-H. Zhou, H. Li, and Q. Yang (Eds.), Anomaly Intrusion Detection Based on Dynamic Cluster Updating, PAKDD 2007, LNAI 4426, pp. 737744, Springer-Verlag, Berlin Heidelberg, 2007.

21

Vous aimerez peut-être aussi