Single Pass Anomaly Detection

Single Pass Anomaly Detection
M.Tech Project First Stage Report Submitted in partial fulllment of the requirements for the degree of Master of Technology
by Deepak Garg Roll No: 05329015
under the guidance of Prof. Om P. Damani
Kanwal Rekhi School of Information Technology Indian Institute of Technology, Bombay Mumbai
Acknowledgments
I am extremely thankful to Prof. Om P. Damani for his guidance. His constant encouragement has helped me a lot. I would like to thank Nakul Aggarwal for his constant help to modifying the ADWICE source code.
Deepak Garg I. I. T. Bombay July 17th, 2006
Abstract Anomaly detection in networks is detection of deviations from what is considered to be normal and misuse detection detect all the known attack descriptions. Performing anomaly detection a learning approach to detect failures and intrusions in a network, intended to capture novel attacks. ADWICE [1, 2] is an ecient algorithm to detect anomaly, But since it uses distance based clustering mechanism it suers from inecient clustering. We have proposed some additional density based statistical variables and also proposed to change the cluster in the form of box pattern, so as to improve the eciency.
Keywords: Clustering, K-means. ADWICE - Anomaly Detection With fast Incremental Clustering. BIRCH - Balanced Iterative Reducing and Clustering. DBSCAN - Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. PHAD - Packet Header Anomaly Detection. IDS - Intrusion Detection System.
Introduction
Network Anomaly refers to network behavior which deviates from normal network behavior. Anomalies occurs due to causes like system mis-conguration, implementation bugs, denial of service attacks, network overload, le server failures, etc. The main detection scheme of most commercial intrusion detection systems is misuse detection, where known bad behaviors (attacks) are encoded into signatures, and In anomaly detection normal behavior of users or the protected system is modeled, often using machine learning or data mining techniques rather than given signatures. During detection new data is matched against the normality model, and deviations are marked as anomalies. Since no knowledge of attacks is needed to train the normality model, misuse detection system can not detect unknown attacks, but anomaly detection may detect previously unknown attacks the detection rate is based on the power of AD algorithm. The training of the normality model for anomaly detection may be performed by a variety of dierent techniques like clustering based, statistics based etc. and many approaches have been evaluated. I am focusing on clustering technique, Clustering is the method of grouping objects into meaningful subclasses so that the members from the same cluster are quite similar, and the members from dierent clusters are quite dierent from each other. Therefore, clustering methods can be useful for classifying log data using distance and density function and detecting intrusions. This report organized as follows: section 2 have survey of dierent papers and reports, section 3 have the results on ADWICE-BOX with grid index, Future work mentioned in section 4 and next section for references.
Literature Survey
This section covers many learning algorithm, and work related to the problem discussed above.
2.1
PHAD - Packet Header Anomaly Detection
Packet header anomaly detector (PHAD)[3] is a model trained by attack-free trac to learn the normal range of values in each packet header eld. To simplify implementation, we require all elds to be 1, 2, 3, or 4 bytes. If a eld is larger than 4 bytes (for example, the 6 byte Ethernet address), then we split it into smaller elds (two 3-byte elds). We group smaller elds (such as the 1-bit TCP ags) into 1 byte eld. During training, we would record all values for each eld that occur at least once. However, for 4-byte elds, which can have up to 232 values, this is impractical for two reasons. This requires excessive memory. There is normally not enough training data to record all possible values Resulting in a model that over ts the data. To solve these problem, we record a reduced set of values either by hashing the eld value modulo a constant H, or by clustering them into C contiguous ranges. For each eld, we record the number r of anomalies that occur during the training period. An anomaly is simply any value which was not previously observed. For hashing, the maximum value of r is H. For clustering, an anomaly is any value outside of all of the clusters. After observing the value, a new cluster of size 1 is formed, and if the number of clusters exceeds C, then the two closest clusters are combined into a single contiguous range. Also for each eld, we record the number n of times that the eld was observed. For the Ethernet elds, this is the same as the number of packets. For higher level protocols (IP, TCP, UDP, ICMP), n is the number of packets of that type. Thus, p = r/n is the estimated probability that a given eld observation will be anomalous, at least during the training period. During testing, we x the model (n, r, and the list of observed values). When an anomaly occurs, we assign a eld score of t/p, where p = r/n is the estimated probability of observing an anomaly, and t is the time since the previous anomaly in the same eld (either in training or earlier in testing). The idea is that events that occur rarely (large t and small p) should receive higher anomaly scores. Finally, we sum up the scores of the anomalous elds (if there is more than one) to assign an anomaly score to the packet. Packet score =
i
ti /pi = anomalous elds

i
ti ni /ri
(1)
When used a hash function to reduce the model size to conserve memory, and more importantly, to avoid over tting the training data. We got good performance because the important elds for intrusion detection have a small r, so that hash collision are rare for these elds. However, hashing is a poor way to generalize continuous values such as TTL or IP packet length when the training data is not complete. A better representation would be a set of clusters, or continuous ranges. For instance, Instead of listing all possible hashes of the IP packet length (0, 1, 2,..., 999 for r = 1000), we list a set of ranges, such as 28-28, 60-1500, 65532-65535. Then if the number of clusters exceeds 3
C, the two closest clusters are merged. The distance between clusters is the smallest dierence between two cluster elements. Negative Points of This Algorithm The choice of C or H value. trade o between over generalizing (small C or H) and over tting the training data (large C or H). Weights of all elds are equal. We can improve the performance of this algorithm, when we assign some weight for each elds. like Ethernet address eld has higher weight as compare to port address eld.
2.2
ADWICE - Anomaly Detection With fast Incremental Clustering
ADWICE[2] is distance based algorithm and data should be numeric. Data is therefore assumed to be transformed into numeric format by pre-processing. This algorithm required some input parameters as follows. M - Denoted by total number of clusters. LS - Leaf Size. T - Initial Threshold. Threshold step for updating of T. An ADWICE model consists of a number of clusters, a number of parameters, and a tree index in which the leaves contain the clusters. A ADWICE is to store only condensed information (cluster feature) instead of all data points of a cluster. A cluster feature is a triple CF = (n, S, SS) where n is the number of data points in the cluster, S is the linear sum of the n data points and SS is the square sum of all data points. From now on we represent clusters by cluster features (CF). The distance between a data point and a cluster is the Euclidean distance between data point and the centroid of the cluster while the distance between two clusters is the Euclidean distance between their centroids. two cluster merge according to its CFs.Each cluster of the leaf node must satisfy a threshold requirement (TR) with respect to the threshold value T to allow the cluster to absorb a new data point[2]. ADWICE used the index, a tree structure where the non-leaf nodes contained one CF for each child. summarizing all clusters contained in the child below. Unfortunately the original index results in a suboptimal search where the closest cluster is not always found. Although this does not decrease processing performance, accuracy suers. If a cluster included in the normality model is not found and the test data is normal, an index error results in an erroneous false positive and degrades detection quality. Because of this unwanted property a new grid-based index was developed preserving the adaptability and good performance of ADWICE. During learning/adapting the normality model there are three cases in which the nodes of the grid tree need to be updated[2]: If no cluster is close enough to absorb the data point, v is inserted into the model as a new cluster. If there does not exist a leaf subspace in which the new cluster ts, a new leaf is created. However, there is no need for any additional updates of the tree, since nodes higher up do not contain any summary of data below. 4
When the closest cluster absorbs v , its centroid is updated accordingly. This may cause the cluster to move in space. A cluster may potentially move outside its current subspace. In this case, the cluster is removed from its current leaf and inserted anew in the tree from the root, since the path all the way up to the root may have changed. If the cluster was the only one in the original leaf, the leaf itself is removed to keep unused subspaces without leaf representations. If a cluster is removed/forgotten the index is only changed if the leaf is now empty in which case the leaf of the removed cluster is also removed. Positive Points of this algorithm: ADWICE algorithm is adaptive. ADWICE required very less memory as compare to other algorithm. Negative Points of this algorithm: The algorithm is order dependent of data so clustering is not unique, i.e. the clustering results depends upon the order of the data points. Uses distance based measures for all calculations which are known to be less accurate when clusters with dierent densities and sizes exists. Some data points may be classied to wrong clusters because of the limitations of distance based calculations in measurement. Take input parameters. Clusters formed are spherical, may lead to large false positives.
2.3
Y-Mean
K-means[4] is a typical clustering algorithm. It partitions a set of data into k clusters through the following steps [4]. Step 1 (Initialization): Randomly choose k instances from the data set and make them initial cluster centers of the clustering space. Step 2 (Assignment): Assign each instance to its closest center. Step 3 (Updating): Replace each center with the mean of its members. Step 4 (Iteration): Repeat Steps 2 and 3 until there is no more updating. K-mean has two shortcomings in clustering large data sets: Number of Clusters Dependency. Degeneracy. 5
Number of clusters dependency is that the value of k is very critical to the clustering result. Degeneracy means that the clustering may end with some empty clusters. Y-means[4] is clustering algorithm for intrusion detection. It is expected to automatically partition a data set into a reasonable number of clusters so as to classify the instances into normal clusters and abnormal clusters. It also overcomes the shortcomings of the K-means algorithm. Y-means algorithm, Similar to K-means algorithm, here we are describing this algorithm with the
Figure 1: Y-Mean (This gure from paper of Y-means [4]) help of gure 1[4]. it partitions the normalized data into k clusters. The value of k between 1 to n, where n is the total number of instances. The next step is to nd whether there are any empty clusters. If there are, new clusters will be created to replace these empty clusters; and then instances will be re-assigned to existing centers. This iteration will continue until there is no empty cluster. Subsequently, the outliers of clusters will be removed to form new clusters, in which instances are more similar to each other; and overlapped adjacent clusters will merge into a new cluster. In this way, the value of k will be determined automatically by splitting or merging clusters. The last step is to label the clusters according to their populations; that is, if the population ratio of one cluster is above a given threshold, all the instances in the cluster will be classied as normal; otherwise, they are labeled abnormal. Positive Points of this algorithm: Y-mean algorithm is independent from xed number of clusters. Its remove empty clusters from the model and rebuild again. Negative Points of this algorithm: Uses distance based measures for all calculations which are known to be less accurate when clusters with dierent densities and sizes exists. Degeneracy required much time. 6
2.4
BIRCH - Balanced Iterative Reducing and Clustering
BIRCH[5] algorithms take order of O(N ) to form the cluster from the input datasets. It is divided into many phases. T, that is the size of cluster. B, the branching factor of the tree. P, the memory size available to this process. L, the maximum number of clusters at each leaf node. It maintains a binary Tree type Tree structure with each node having maximum of B chides. All the clusters are at the leaf nodes of the tree. Now initially the tree is empty and let say T=0, as new data points keep coming, it traverses the tree to nd the appropriate leaf node where it can ?t into, and then it looks for the perfect match in each of the clusters in that leaf node. If it can ?t into any of the clusters, then it is inserted there else a new cluster is formed. The tting of the data point is dened by distance based measure (which can be Manhattan, Euclidean etc) and the cluster statistics are updated after insertion. If formation of cluster increases the leaf child count by L, then the leaf is split into 2 leaves with a parent above them and clusters are designated to the appropriate leaf nodes. Also, at sometime if memory cap i.e. P is reached, then the T is increased so that the cluster sizes are increased and more points can be tted into each of the clusters, henceforth reducing the cluster count freeing up some memory. Positive Points of this algorithm: Time complexity is O(n). Memory ecient. Classication of new data point is easier. Negative Points of this algorithm: The algorithm is order dependent of data so clustering is not unique, i.e. the clustering results depends upon the order of the data points. Uses distance based measures for all calculations which are known to be less accurate when clusters with dierent densities and sizes exists. Some data points may be classied to wrong clusters because of the limitations of distance based calculations in measurement. Take input parameters. Clusters formed are spherical, may lead to large false positives. We describe more about this problem in next subsection.
2.5
DBSCAN - Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
DBSCAN[6] an O(N logN ) time clustering algorithm. It is based on density and distance instead of only distance. It just iterates once over all the data points for all the clusters with addition time of log(N ) in each step makes it an O(N logN ) algorithm. It takes two input parameters as follows. Eps - which is the measure of distance, within which one should look at for nding its neighbors. Minpts - which dene the number of points which must lie within a Eps-neighborhood of a point for it to be core-point. For each of the points, rst nd the Eps-neighborhood of that point (this step takes O(log N ) time, using ecient R* trees), then if there exists more than Minpts data points within this region, then this cluster is named as a cluster of its own and assigned a new cluster ID. One may think here that each of the points which has more than Minpts data points within its neighborhood would be a separate cluster, this is not true though. Since, paper also denes the merging mechanism for the clusters which should form one cluster i.e. the density reachable and density connected for a pair of clusters. In the step where the neighborhood is ?nd and cluster ID is assigned, one more loop is ran for each of the points in the neighborhood of this point, where, checking is done if they also form their own clusters, if yes, then clusters are merged based on denition of reachability. Now, this step goes into recursion (each of the merged cluster points, checks for their points and hence their cluster possibilities) and using the denition of connectedness, clusters are kept merging unless no more clusters can be merged. Positive Points of this algorithm: This is density based algorithm for clustering, which is more accurate than distance based algorithms. Algorithm results in unique clustering results. Is able to detect the clusters of any size and shapes. Negative Points of this algorithm: It also takes input parameters. It is not capable for dierentiating clusters with dierent sizes and dierent densities since the Eps is predened and xed all the time.
Experiment On ADWICE
ADWICE[2] uses BIRCH as the clustering algorithm for learning the normal data and then classifying the new data as anomalous or normal. BIRCH suers from a lot of shortcomings. Here we tried to reduce the number of false positives by modifying the threshold calculation, cluster bounds and cluster structure by modify the cluster feature (CF). In original algorithm of BIRCH uses a constant threshold( T ) same for each of the clusters which increases whenever number of clusters reach to maximum number of clusters and merge nearest cluster according to T . Fixing the same threshold for all clusters is unfair for many of them. For example consider a cluster, with all points near the center of cluster and cluster?s threshold T . This cluster can include some of the bad points which are near the boundary. Hence, xing the same threshold for all clusters is not ?ne rather it should depend on the cluster properties like points distribution, density of the cluster etc. BIRCH uses distance based measures for clustering algorithm. According to which, all clusters have the same threshold size, T . For a new point inclusion into a cluster, its distance from the center of the cluster has to be less than T . So, dene inclusion region as the spherical region of radius T around the center of cluster. Currently, inclusion region is independent of the current density of the cluster and same for all clusters. But if, a cluster is dense, inclusion region should be less and should be dependent on the current radius of the cluster rather than some predened xed threshold. While for sparse cluster, inclusion region should be relatively large. So, the inclusion of the new point in a cluster should be dependent on the density of the cluster (i.e. the number of points in cluster and its current radius). Mathematically, the measurements will be made on the basis of two more variables t and R where both the terms has been explained below[7]. R (additional statistical variable need to be stored with each cluster Cluster feature set) is dierent for each of the clusters and depends on the current number of points in it and its current Radius (R(CFi )
R (CFi ) = R(CFi ) (1 + c/f n(n, d))
(2)
d = dimension of the data points. n = number of points inside this cluster. fn(n, d) = some function of n and d c = some constant. i.e. R = its current radius + current radius multiplied by some constant and divided by some function of n. The function f n can be logd (n) or just log(n). So, threshold requirement should be R(CFi ) R (CFi ) (3)
But using above expression as measure, clustering will suer in the case of one or very few points in the cluster, hence dene t as the threshold for handling the base cases. (this can be kept fairly small). So, threshold requirement becomes R(CFi ) max(R (CFi ), t ) 9 (4)
Also, for large sparse clusters, we want an upper bound on the radius of the cluster so as to prevent explosion by some of the clusters. So, threshold requirement in ADWICE-TRAD[7] would be R(CFi ) min(max(R (CFi ), t ), T ) (5)
Figure 2: Actual vs Detected Anomaly FP Rate = FP/(FP + TN) Detection Rate = TP/(TP + FN) We have made another change in ADWICE-GRID. The ADWICE-GRID algorithm used to make cluster patterns like circle in two dimension, sphere in three dimension and so on. We have changed it to patterns like rectangle in two dimension and cube in three dimension and so on. When clusters pattern are in the form of circle or sphere they tend to have some anomaly region. we can remove some part of this region with the help of making the cluster in the form of BOX. Following are the example in two dimensions. In this example (gure 3) all the points are nearer to each other except few points. When we try to form a cluster then some anomalous region also covered by this cluster; Because the center of the cluster is nearer to those points witch are nearer to each other and some points far away from this center. Cluster includes all the points on the basis of the radius. So some anomalous region is also cover by this cluster. we can remove some part of this anomalous region by making the cluster in the form of box. so new name of this algorithm is ADWICE-BOX. We also need to modify the cluster feature (CF). Previously CF used to a store three values which are number of points, linear sum of the points and square sum of all the points. We modied the CF, to store two other parameter min and max value for each dimensions. When new point comes, compare min and max value in each dimension and update the same for merging of two clusters. Following are brief steps in new ADWICE with BOX algorithm. Calculate new center of the cluster. Center of Cluster = (min + max)/2. 10
Figure 3: Box Cluster Calculate Radius Vector instead of radius for cluster. Radius of Cluster = (max - min)/2. when new point arrive or two clusters merge, nd the closest cluster basis on the radius vector instead of single radius of the cluster.
3.1
Results
In experiment of ADWICE-BOX with grid index, evaluation of detection quality with ADWICEgrid[2] index using KDDCUP99. Our algorithm needs only normal dataset for training. KDD training data set of 972781 session records was used to train our model. This model was evaluated on the KDD testing data set consisting of 311029 session records of normal data and many instances from dierent attacks type. we have performed this experiment on dierent number of clusters like 5K, 8K, 12K, 15k, 18K, 20k and 30k. our results are found to be better than ADWICE with grid but with little dierence. gure 4 gives the comparison between both the algorithm with maximum no of cluster is 12K. Refer appendix-A for modied methods of ADWICE code.
Future Work
We have modied ADWICE with grid to ADWICE-BOX with grid, and we would concentrate on following points. Proper parameters (maximum number of clusters, threshold etc) setting is important to ADWICE-BOX eciency. We will concentrate on more reasonable ways to increasing the threshold dynamically and make corresponding changes in algorithm so that this algorithm will be independent from the the maximum number of clusters pattern. 11
We will also concentrate on learning from input data set, so it will not depend on the order of input data. Reduce the running time complexity and false positive as much as possible.
Figure 4: Results of ADWICE-BOX
References
[1] Kalle Burbeck and Simin Nadjm-Tehrani. Adwice : Anomaly detection with real-time incremental clustering. In Seongtaek Chee Choonsik Park, editor, Lecture Notes in Computer Science, pages 407 424. Springer Berlin / Heidelberg, jan 2005. [2] Kalle Burbeck and Simin Nadjm-Tehrani. Adaptive real-time anomaly detection with improved index and ability to forget. In ICDCSW 05: Proceedings of the Second International Workshop on Security in Distributed Computing Systems (SDCS) (ICDCSW05), pages 195202, Washington, DC, USA, 2005. IEEE Computer Society. [3] Matthew V. Mahoney. Network trac anomaly detection based on packet bytes. In SAC 03: Proceedings of the 2003 ACM symposium on Applied computing, pages 346350, New York, NY, USA, 2003. ACM Press.
12
[4] Yu Guan, Nabil Belacel, and Ali A. Ghorbani. Y-means: a clustering method for intrusion detection. In ICDCSW 05: Proceedings of the Second International Workshop on Security in Distributed Computing Systems (SDCS) (ICDCSW05), Canada, may 2003. Canadian Conference on Electrical and Computer Engineering (CCECE-2003). [5] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an ecient data clustering method for very large databases. In SIGMOD 96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 103114, New York, NY, USA, 1996. ACM Press. [6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Evangelos Simoudis, Jiawei Han, and Usama Fayyad, editors, Second International Conference on Knowledge Discovery and Data Mining, pages 226231, Portland, Oregon, 1996. AAAI Press. [7] Nakul Aggrawal. Improving the eciency of network intrusion detection systems. Technical report, India, jun 2006. BTech dissertation. [8] Simin Nadjm-Tehrani. Source code of adwice. http://www.ida.liu.se/~snt/research/adwice/, 2005.
13

Single Pass Anomaly Detection

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Single Pass Anomaly Detection

Transféré par

Droits d'auteur :

Formats disponibles

Single Pass Anomaly Detection

by Deepak Garg Roll No: 05329015

under the guidance of Prof. Om P. Damani

Deepak Garg I. I. T. Bombay July 17th, 2006

PHAD - Packet Header Anomaly Detection

ti /pi = anomalous elds

ADWICE - Anomaly Detection With fast Incremental Clustering

BIRCH - Balanced Iterative Reducing and Clustering

R (CFi ) = R(CFi ) (1 + c/f n(n, d))

Figure 4: Results of ADWICE-BOX

Vous aimerez peut-être aussi