Vous êtes sur la page 1sur 8

Robust Data Model for Enhanced Anomaly Detection

R.Ravinder Reddy*, Dr.Y Ramadevi1, Dr.K.V.N Sunitha2


*, 1
Department of Computer Science and Engineering, Chaitanya Bharathi Institute of
Technology, Hyderabad 500075,Telengana state, India.
*, 1
{ ravindra_rkk, yrd }@cbit.ac.in
2
B.V.Raju Institute of Technology for women, Bachupally, Hyderabad 500090, Telengana
state, India, email: k.v.n.sunitha@gmail.com

Abstract. As the volume of network usage increases, inexorably, the


proportions of threats are also increasing. Various approaches to anomaly
detection are currently being in use with each one has its own merits and
demerits. Anomaly detection is the process of analyzing the users data either
normal or anomaly, most of the records are normal records only. When
analyzing these imbalanced types of datasets with machine learning algorithms
the performance degradation is high and can’t predict the class label accurately.
In this paper we proposed a hybrid approach to address these problems. Here
we combine the class balancing and rough set theory (RST). This approach
enhances the anomaly detection rate and Empirical results show that
considerable performance improvements.
Keywords: Anomaly detection, Imbalanced data, rough sets, Classification, Intrusion
Detection.

1 Introduction

Recent research, mostly focusing on anomaly-based network intrusion detection,


because it can detect known as well as unknown attacks. Most of the time intrusion
detection problem treated as a classification problem [1, 2]. Classification-based
anomaly detection approaches are popular for detecting network anomalies in
machine learning and data mining. Classification is a supervised machine learning
technique. There is a major issue relevant to supervised anomaly detection i.e.
Anomalous instances are rare compared to normal instances in the training data. This
issue arises due to imbalanced class distributions [3]. The improper distribution of
these training data often makes the task of learning more challenging. To address
these challenges we proposed a new robust data model.
Robust data selection is demanding approach for analysis of anomaly detection.
Because anomaly based intrusion detection system, has become more dependent on
learning methods, especially on classifications schemes. For the classification
problem the records should be identically independent distribution is required.
Probability distribution should be balanced among the classes are important. To make
the classification more accurate and effective, more robust approaches are required.
Data selection and type of input for the classification techniques is very effective on
the anomaly detection rate. Performance of the classifier is directly depends on the
type of input. Type of input data is a key aspect of any anomaly based network
intrusion detection technique is the nature of the input data used for analysis [4] . Need
to build an effective anomaly model robust data approaches are required. In this
regard we combine different data boosting techniques. Considerable impact of data
preprocessing has on the accuracy and capability of anomaly detection.
Accuracy of the anomaly detection is depends on the quality of the training data
and number of records is used to train the model and its distribution of records. For
this purpose here we are considering the unbalanced data for anomaly detection, it
affects the classifier quality, for that we are trying to improve the number of records
of the minority class for balancing the dataset. Balancing the class is very important,
it has wide application including image, intrusion detection etc. Class balancing is an
important aspect for improving the quality of the data, which is given to classifier. It
will improve accuracy for the anomaly detection. Class balancing will increase the
size of the dataset, to address this issue we used the rough set theory.
Rough set theory [5] is used for reducing the dimensionality of the feature vector.
Feature vector size is also a problem for classification, many of the features are not
involving in the process of classifying, to remove these features we need to adopt the
feature selection technique, here we used rough set approach. In the feature selection
process RST approach produces better results when compared with the traditional
principle component approach (PCA). Because PCA needs lot of space and
computational time is required for the computation of Eigen vectors. It is difficult
when the data size is increases.
The remaining topics are organized as follows, in section 2 briefs outline of the
related concepts, in section 3 discusses the details of implementation and results are
discussed in section 4. Conclusion is in section 5.

2 Related work

2.1 Intrusion Detection

Confidentiality, integrity and availability (CIA) are the main characteristics of


information security. To ensure that authenticated and authorized entities are able to
reliably access secure information. However these principles can be violated by users
intentionally some times without knowingly, the prevention tools can’t stop these
activities fully. We need to protect the system securely we need another layer of
protection is called the intrusion detection system. Intrusion is an attempt to access
the system resources in an unauthorized way to modify or destroy the resources from
outsiders or may be the insiders. So intrusion detection system is the second wall of
the protection of the system. The firewall will only filter packets. Basically based on
the behavior of intruders it divides two aspects
1. Misuse detection
2. Anomaly Detection

2.2 Imbalanced Data


Traditional learning models assumed that the data classes are well distributed and
worked on that data. Fewer data sets shows the proper results not all, after the years of
research found that class of interest is having very few records and affects the system
performance. In many real world data domains are class-imbalanced, where the main
class of interest is represented by only a few tuples. This is known as the class
imbalanced problem [14 15].
Mainly class imbalanced problem can be solved by using these methods
1. oversampling
2. under-sampling
3. Threshold moving
4. Ensemble technique

Oversampling and under-sampling change the distribution of tuples in the training


set. Oversampling works by re-sampling the positive tuples so that the resulting
training set contains an equal number of positive and negative tuples.
Anomaly detection is mainly concerns on the user’s abnormal behavior in the
system. When the user behavior deviates from the normal we can say the anomaly.
For the analysis of this we need to analyze the user data in a proper manner. For this
very few anomaly records will be available in the system, we need to oversample
these records for the analysis of anomaly behavior. Once balance the data set it will
enhance the classifiers performance.

2.3 Rough Set Theory

Rough Set Theory is an extension of conventional set theory [5, 6]. That supports
approximations in decision making, it is an approximation of a vague concept. By a
pair of precise concepts, called lower and upper approximations. This is a
classification of the domain of interest into disjoint categories. Feature selection
property of rough set theory helps in finding reducts of the oversampled dataset. In
this way it not only reduces the size of dataset but also improves the classifier
performance.
Feature selection methods are used in the intrusion detection domain for
eliminating unimportant or irrelevant features. Feature selection reduces
computational complexity, removes information redundancy, increases the accuracy
of the detection algorithm, facilitates data understanding and improves generalization
The concepts in rough set theory are used to define the necessity of features. The
measures of necessity are calculated by the functions of lower and upper
approximation. These measures are employed as heuristics to guide the feature
selection process. These heuristic functions are used to decide which attribute is
relevant to the target concept.

2.4 Dataset

To evaluate any system we need a benchmark input and compare the results.
Fortunately for evaluation of the intrusion detection system we have used The “HTTP
dataset CSIC 2010” [7] contains thousands of web requests automatically generated.
It can be used for the testing of web attack protection systems. It was developed at the
“Information Security Institute” of CSIC (Spanish Research National Council).
The main motivation behind this is current problem in web attack detection is the
lack of publicly available data sets to test WAFs (Web Application Firewalls). The
DARPA data set [8, 9] has been widely used for intrusion detection. However, it has
been criticized by the IDS community [10]. Regarding web traffic, some of the
problems of the DARPA data set are that it is out of date and also that it does not
include many of the actual attacks. Because of that, it is not appropriate for web attack
detection. The problem of data privacy is also a concern in the generation of publicly
available data sets and is probably one of the reasons why most of the available HTTP
data sets do not target real web applications. Because of these reasons, we decided to
use the HTTP data set CSIC 2010.
The HTTP dataset CSIC 2010 contains the generated traffic targeted to an
eCommerce web application. In this web application, users can buy items using a
shopping cart and register by providing some personal information. As it is a web
application in Spanish, the data set contains some Latin characters. The dataset is
generated automatically and contains 36,000 normal requests and more than 25,000
anomalous requests. The HTTP requests are labeled as normal or anomalous and the
dataset includes attacks such as SQL injection, buffer overflow, information
gathering, and files disclosure, CRLF injection, XSS, server side include, parameter
tampering and so on.

3 Methodologies
In this method we address the two issues regarding anomaly detection. They are,
feature selection and balancing the dataset, here we mainly focused on class
balancing. Anomaly datasets are imbalanced in class, while using these types of data
to train the machine learning techniques like classification, it doesn’t perform well
compared to the normal distributed data. Balance the dataset [14, 15] using the data
mining technique, it will improve the prediction rate. In this approach we increase
rare class data by using oversampling technique. Here the proposed approach will
address these issues.

Algorithm: Hybrid Data sampling


Input: HTTP CSIC dataset
Output: The anomaly prediction rate
1. Pre-process Dataset to the required format
2. Apply the rough set feature selection
3. Prepare the new dataset with the obtained feature set
4. Apply the data sampling approach to balance the class label.
5. Re-distributes the data tuples.
6. Apply SVM classifier to the refined Dataset.

In the process of balancing the dataset, it may increase the size of the dataset in this
approach, it will consume system resources to avoid this problem we are applying the
rough set approach for reducing the dimensionality of the dataset. The Rough set
approach enormously reduces the data size without affecting the classifier accuracy.
In this approach we used smote algorithm for balancing the dataset. An
oversampling technique called SMOTE(Synthetic Minority Oversampling Technique)
is used to generate synthetic samples of minority class in order to balance the dataset.
SMOTE algorithm [11, 12] considers the minority class instances and oversamples it
by generating synthetic examples joining all of the k minority class nearest neighbors.
The value of k depends upon the amount oversampling to be done. The process begins
by selecting some point yi and determining its nearest neighbor’s yi1 to yik. Random
numbers from r1 to rk are generated by randomized interpolation of the selected
nearest neighbors.

Synthetic samples of minority class can be generated as follows:

1. Consider the feature vector (minority sample) and its nearest neighbor and take
the difference between them.
2. Multiply this difference by a random number between 0 and 1, and add it to the
feature vector under consideration.
3. This causes the selection of a random point along the line segment between two
specific features.
Once the data sampling is completed we need to re distribute the records.
Distribution of samples is also an important issue in the classification process.
Reducing the dimensionality of the data we used rough set approach, in this we used
Johnson’s reduct, Johnson‘s algorithm is a dynamic reduct [13] computation
algorithm. The process of reduct generation starts with an empty set, RS. Iteratively,
each conditional attribute in the discernibility matrix is evaluated based on a heuristic
measure and the highest heuristic valued attribute is to be added to the RS and deletes
the same from the original discernibility matrix. The algorithm ends when all the
clauses are removed from the discernibility matrix. Pseudo code for Johnson‘s reduct
generation is given below.

Algorithm: Johnson Reduct (Ca, fD)


Input: Ca, the set of conditional attributes,
fD is the Discernibility function.
Output: RS, The minimal reduct set
1. RS← Ø; bestca=0;
2. While ( discernibility function, fD is not empty)
3. For each c Ca that appears in fD
4. h = heuristic (c)
5. If (h > bestca) then
6. bestca=h;
7. bestAttribute ← c
8. RS ← RS U bestAttribute
9. fD ← removeClauses (fD, bestAttribute)
10. Return RS

The reduct generated by the Johnson‘s algorithm may not be optimal, still research
is going on finding an optimal feature set for a given dataset. Here the HTTP dataset
CSIC 2010 is used, it contains the following conditional features. The decision
attribute is normal or anomaly.

{index, method, url, protocol, userAgent, pragma, cacheControl, accept,


acceptEncoding, acceptCharset, acceptLanguage, host, connection, contentLength,
contentType, cookie, payload, label}

Applying the rough set feature selection the 17 conditional features are reduced to 8
features these are as follows.

{cookie, payload, index, url, contenetLength, method, host, contentType}

In the table 1 we compare the computational time for the rough set model, in fig 1
shows that there is huge difference for both the models.

Table 1: Time comparison for the approaches

Technique Time taken for Time taken to classify


Classify with Reduct
SVM classifier 2593 1782

3000
Roughset
2000 approach

1000 Without
Feature
0 selection
Computational Time

Fig 1: Computational time comparison

4 Result Analysis
In this model we used the HTTP dataset CSIC 2010, it contains millions of records
labeled as normal and anomaly. In this method we applied the SMOTE algorithm to
oversampling the data among the normal and anomaly records. After this randomize
the record to distribute the oversampled records throughout the data set. Once we
obtain the optimal feature vector we applied this data to the SVM classifier and
calculated the results. We used RBF kernel for evaluation of the results. It performs
well on the balanced data and produce good results compared with the other balancing
methods.
Table 2: Comparison of results with balanced and rough set reduct datasets.
Measure Unbalanced Balanced Balanced and
Rough set Reduct
Time 2593 2780 1782
Accuracy 99.44 99.85 99.80
Precision 0.994 0.998 0.998
FP Rate 0.005 0.002 0.002
Recall 0.994 0.998 0.998
F-Measure 0.994 0.998 0.998

Fig 2: Performance measures

The empirical results show that by using the balancing and rough set reduct
improves the classifier accuracy as well as reduce the computational time. In the
table 2 false positive rate is also decreased, we compute the precision, recall and F-
measure these results shows that the robust approach is performing well compared to
the earlier methods. Fig 2 shows that the hybrid approach performs well compared
with the unbalanced dataset.

5 Conclusion and future enhancements


Classifying the anomaly accurately with the available dataset may not possible in all
the cases. Here we reduce computational tome using rough set based feature selection,
secondly we balance the classes, using oversampling the data for predicting the
anomaly classes properly. This approach enhances the performance of the system.
The experimental results show the performance enhancement with this hybrid
approach.

In the future genetic and fuzzy algorithm may used to get better records using the
fitness function and membership values. Feature vector size may also decrease based
on the availability of the optimal feature selection algorithms. Still lot of research is
going on finding optimal feature subset. Using these techniques may achieve the
better anomaly detection model. And use the other classification techniques on well
distributed and balanced data.
References

1. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., & Zhang, J.
(2001). Real time data mining-based intrusion detection. In:DARPA information
survivability conference & exposition II,2001, DISCEX’01, Proceedings(Vol. 1, pp. 89–
100).

2. V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM


Computing Surveys, vol. 41, no. 3, pp. 15:1–15:58, September 2009.
3. M. V. Joshi, R. C. Agarwal, and V. Kumar, “Mining needle in a haystack: classifying rare
classes via two-phase rule induction,” in Proc. of the 7th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM,2001, pp. 293–298.
4. P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley,
2005.
5. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic
Publishers, Dordrecht, MA, 1991.
6. Pawlak Z: Rough Sets and Intelligent Data Analysis, Information Sciences,2002, 147:1-
12.
7. http:// iec.csic.es/dataset/
8. R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D. McClung, D. Webber,S.
Webster, D. Wyschograd, R. Cunninghan, and M. Zissman. Evaluating Intrusion
Detection Systems: The 1998 DARPA offline intrusion detection evaluation. In Proc. of
DARPA Information Survivability Conference and Exposition (DISCEX00), Hilton Head,
South Carolina, January 2527.IEEE Computer Society Press, Los Alamitos, CA, 1226
(2000).
9. R. Lippmann, J. W. Haines, D. J. Fried, J. Korba and K. Das. The 1999 DARPA OffLine
Intrusion Detection Evaluation. In Proc. Recent Advances in Intrusion Detection
(RAID2000). H. Debar, L. Me, and S. F. Wu, Eds. SpringerVerlag,New York, NY,
162182 (2000).
10. J. McHugh. Testing Intrusion Detection Systems: A Critique of the 1998 and 1999
DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory. In
Proc. of ACM Transactions on Information and System Security (TISSEC) 3(4), pp.
262294(2000).
11. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE:
Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research,
16:321-357.
12. Enislay Ramentol,yaile caballero, A journal on SMOTE-RSB,23 December 2009.
13. Jan G. Bazan, Marcin Szczuka, “The rough set exploration system (2005)”
TRANSACTIONS ON ROUGH SETS III, Springer.
14. Nitesh V.chawla Chapter on data mining for imbalanced datasets: An overview, Springer.
15. Nitesh v. Chawla, Nathalie japkowicz, “Data Mining for Imbalanced Datasets: An
Overview” A journal on special issue on learning from imbalanced datasets, volume 6,
Issue 1 pp:853-857.

Vous aimerez peut-être aussi