Vous êtes sur la page 1sur 8

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No.

6 ISSN: 1837-7823

Evaluation of Rule based Machine Learning Algorithms on NSL-KDD Dataset

Tejvir Kaur1 and Sanmeet Kaur2

M,Tech Student, 2 Asst. Professor

Thapar University, Patiala- 147001, India

Intrusion Detection can be defined as the act of detecting actions that attempt to compromise the confidentiality, integrity or availability of a resource. There are many approaches to intrusion detection like anomaly based, signature based and machine learning based. Machine learning approach can prove to be very useful for developing intrusion detection systems. This paper presents the comparison of various Rule based machine learning algorithms. These learning algorithms are categorized under supervised learning. The NSL-KDD dataset [9] and Waikato Environment for Knowledge Analysis (WEKA) [3] is used to evaluate the performance of the machine learning algorithms. Keywords: Intrusion detection, NSL-KDD dataset, WEKA

1. Introduction
The purpose of network security is to protect the network from unauthorized access, destruction and disclosure. Many techniques have emerged in the field of network security that helps in the protection of computer systems and computer networks. One of the techniques used for making the network secure and detecting intrusions is Intrusion Detection System. Intrusion Detection System is a mechanism that detects unauthorized and malicious activity present in the computer systems. There are mainly two approaches to Intrusion Detection Signature detection and Anomaly detection. Machine learning techniques have also been applied to Intrusion detection in many ways. This paper presents the evaluation and results of rule base machine learning algorithms to NSLKDD dataset [12].

2. Review of Literature
Intrusion Detection System helps information systems to deal with attacks. An IDS gathers and analyzes information from various areas within a computer or a network to identify the intrusions which includes attacks from outside the organization and as well as attacks from within the organization. There are mainly two approaches to intrusion detection - Signature based detection and Anomaly based detection. The signature-based approach looks for the signatures of known attacks, which exploit weaknesses in system and application software [5]. It uses pattern matching techniques against a frequently updated database of attack signatures. It is useful to detect already known attack but not the new ones. Many attacks can be detected by this approach because many attacks have clear and distinct signatures. An Intrusion Detection system that looks at network traffic and detects data that is incorrect, not valid or generally abnormal is called anomaly-based detection. This method is useful for detecting unwanted traffic that


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823 is not specifically known [5]. There are various methods that can be used in Anomaly detection approach to detect anomalous behavior from normal behavior like machine learning, statistical methods. Machine learning algorithms can be used in Intrusion detection problems to find interesting intrusion patterns in data [10]. This requires the data to be in labelled form. For this purpose NSL- KDD dataset is taken which has the required characteristics. The rule based learning algorithms are applied on this dataset. K.. Shafi et al. [11] proposes a methodology to create a fully labelled network intrusion detection dataset which is suitable for machine learning algorithms. The dataset is created using real background traffic and simulated attacks. This dataset is tested on supervised machine learning algorithms in WEKA. F. Gharibian and A. Ghorbani (2007) [2] compare the supervised machine learning techniques. The algorithms used are Naive Bayes, Gaussian, Decision Tree and Random Forests. The ability of each technique for detecting the attack categories in the KDD dataset has been compared. From the results, the proper technique for identifying an attack category is also proposed. According to M. Panda and M. Patra (2007) [9], the use of nave bayes for anomaly based network intrusion detection technique produces better results in terms of false positive rate, cost, and computational time when applied to KDD99 data sets as compared to a back propagation neural network based approach. The experimentation is done on WEKA program. J. Zhang et al. (2008) [13] focuses on a framework that apply a data mining algorithm called random forests in misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built automatically by the random forests algorithm over training data. In anomaly detection, novel intrusions are detected by the outlier detection mechanism of the random forests algorithm. G. Oreku and F. Mtenzi (2009) [8] presents the use of data mining techniques to discover consistent and useful patterns of system features that describe program and user behavior. According to them the useful set of relevant system features can recognize anomalies and known intrusions. D. Zhao (2010) [14] et al. proposes a hybrid IDS which combines network and host IDS, with anomaly and misuse detection mode. Data mining programs are applied to learn rules that can capture the behavior of intrusions and normal activities. K. Qazanfari et al. (2012) [10] have proposed an Intrusion detection system which uses Support Vector Machine (SVM) and Multi Layer Perceptron (MLP) machine learning algorithms to classify normal from abnormal behaviors. S. M. Hussein et al. (2012) [4] discusses the anomaly detection engine that will be based on NaveBayes algorithm, J48graft Decision Tree algorithm and Bayes Net algorithm in WEKA program.

3. Methodology
This section presents the methodology used to carry out the work. The workflow is described in Figure 1. The main aim is to analyze the performance of rule based classifiers present in WEKA. For this purpose NSL KDD dataset is used. The NSL KDD dataset is applied to the rule based classifiers. The output of the classifiers is compared to each other. The measures used to compare the results are accuracy, false alarm rate and the number of instances that are correctly and incorrectly classified. The rule based classifier giving the best performance is deduced by analyzing the results based on above metrics.


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

Figure 1: Design of Research

4. Experimentation 4.1 Description of dataset

The KDD data set is built based on the data captured in DARPA98 IDS evaluation program. DARPA98 is about 4 gigabytes. It contains tcpdump data of 7 weeks of network traffic. This data can be processed into about 5 million connection records. The two weeks of test data have around 2 million connection records. KDD training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41 features. It is labelled as either normal or an attack. The analysis has shown that there are two important issues in the data set which affects the performance of evaluated systems. It results in a very poor evaluation of anomaly detection approaches. To solve these issues, a new dataset called NSL-KDD [7] has been proposed which consists of selected records of the complete KDD data set [12]. The NSL-KDD data set has the following advantages over the original KDD data set. It does not include redundant records in the train set. There are no duplicate records in the proposed test. The number of records in the train and test sets is reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion [7].


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823 For the purpose of experimentation KDDTest-21.ARFF - a subset of the KDDTest+.arff file is taken. It contains 11850 instances [7]. The attribute names and types are listed in Table 1. Table 1: Attribute name and type in KDDTest-21.arff file Attribute Name Duration protocol_type Service Flag src_bytes dst_bytes Land wrong_fragment Urgent Hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login Attribute Type Real Nominal Nominal Nominal Real Real Nominal Real Real Real Real Nominal Real Real Real Real Real Real Real Real Nominal Attribute Name is_guest_login Count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate Class Attribute Type nominal real real real real real real real real real real real real real real real real real real real nominal

4.2 Classifiers used

The following are the machine learning algorithms or classifiers that are evaluated on NSL-KDD dataset. 1) ConjunctiveRule: This classifier implements a single conjunctive rule learner that can predict for numeric and nominal class labels. 2) DecisionTable: The classifier is used for building and using a simple decision table majority classifier. 3) DTNB: This classifier provides keys to the hash table 4) JRip: This class implements a propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER).


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823 5) OneR: It generates a set of rules that test one particular attribute and learns a one-level decision tree. 6) Part: Class for generating a PART decision list. 7) NNge:It is Nearest neighbour like algorithm using non-nested generalized exemplars. 8) Ridor: This is implementation of a Ripple-Down Rule learner.

4.3 Test Options in WEKA

The result of applying the chosen classifier will be tested according to the options. The following are the four test modes that are provided in WEKA [3]. 1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. 2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. A classifier is built using the train set and evaluate it using the test set. 3. Cross-validation. The classifier is evaluated by cross-validation by using the number of folds that are entered in the Folds text field. The dataset is randomly divided into k subsamples. The k-1 subsamples are used as training data and one sub-sample as test data. This process is repeated k times. 4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field [3].

4.4 Results and Discussion

The measures used for evaluating the performance are as follows. 1) Correctly classified instances and Incorrectly classified instances shows the percentage of instances that are correctly and incorrectly classified. 2) Overall accuracy: The percentage of correctly classified instances is called accuracy. 3) False alarm rate: The alarm rate is the proportion of examples which were classified as class x, but belong to a different class, among all examples which are not of class x. 4) Class wise accuracy: It is the proportion of examples which were classified as class x, among all examples which truly have class x [3]. The results of evaluating rule based classifiers are shown in Table 2 and Table 3. Test option used is 10 fold cross validation. Part classifier has the highest overall accuracy which clearly justifies that it correctly classifies 11540 instances. The false alarm rate of JRip and Part classifier is the lowest. The ConjunctiveRule classifier has the highest false alarm rate and it has least accuracy. So the overall performance of Part is best and ConjunctiveRule is worst. The performance of Part and Jrip is almost identical. Table 3 shows class wise accuracy. Almost all the classifiers show above 90% accuracy for Anomaly class. For Normal class and Anomaly class Part gives highest accuracy.


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823 Table 2: Comparison of various measures for rule based classifiers Rule based Classifier Correctly classified Instances ConjunctiveRule DecisionTable DTNB JRip OneR Part NNge Ridor 10258 11268 11314 11492 10867 11540 11401 11386 Incorrectly Classified Instances 1592 582 536 358 983 310 449 464 86.5654 % 95.0886 % 95.4768 % 96.9789 % 91.7046 % 97.384 % 96.211 % 96.0844 % 8.7% 3.5% 3.4% 1.8% 4.3% 1.7% 2% 3.2% Overall Accuracy False Alarm rate (class=Normal)

Table 3: Class wise accuracy Rule based classifier ConjunctiveRule DecisionTable DTNB JRip OneR Part NNge Ridor Normal 65% 88.8% 90.5% 91.7% 73.9% 93.4% 88% 92.8% Anomaly 91.3% 96.5% 96.6% 98.2% 95.7% 98.3% 98% 96.8%

N 0 . o f

i n s t a n c e s

12000 10000 8000 6000 4000 2000 0

Correctly classified Instances Incorrectly classified Instances Figure 2: Classification of Instances by rule based classifiers


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

100.00% 95.00% 90.00% 85.00% 80.00% Overall Accuracy

Figure 3: Overall Accuracy of various rule based classifiers

10.00% 9.00% 8.00% 7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00%

False alarm rate

Figure 4: False alarm rate of rule based classifiers 120% 100% 80% 60% Class wise Accuracy 40% 20% 0% Normal Anomaly

Figure 5: Accuracy of rule based classifiers for class Normal and class Anomaly


International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

5. Conclusion and Future Scope

The eight Rule based classifiers are evaluated on the NSL- KDD dataset. The various measures taken for comparison are overall accuracy, class wise accuracy and false alarm rate. Part and Jrip classifier performed best among all the algorithms and the performance of ConjunctiveRule is the worst. The rules generated by these classifiers can be incorporated into the signature based Intrusion detection systems to enhance their performance.

[1] Chandolikar n. S., and Nandavadekar V. D., (2012), comparative analysis of two algorithms for intrusion attack classification using kdd cup dataset, International Journal of Computer Science and Engineering ( IJCSE ) Vol.1, Issue 1, pp. 81-88. [2] Gharibian, F., & Ghorbani, A. A. (2007), Comparative study of supervised machine learning techniques for intrusion detection, In Communication Networks and Services Research, CNSR'07.,Fifth Annual Conference on ,pp. 350-358, IEEE. [3] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. H., (2009),The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume 11, Issue 1. [4] Hussein S. M., Ali F., and Kasiran Z., (2012), Evaluation Effectiveness of Hybrid IDS Using Snort with Nave Bayes to Detect Attacks, IEEE. [5] Information Assurance Tools Report Intrusion Detection Systems, (2009), IATAC, Herndon, VA. [6] Kurundkar G.D., Naik N.A. and Dr.Khamitkar S.D,(2012), Network Intrusion Detection using SNORT, International Journal of Engineering Research and Applications (IJERA) ,Vol. 2, Issue 2, pp. 1288-1296. [7] Nsl-kdd data set for network-based intrusion detection systems. (2009) Available on: http://nsl.cs.unb.ca/NSL-KDD/. [8] Oreku, G. S., & Mtenzi, F. J., (2009), Intrusion Detection Based on Data Mining, In Dependable, Autonomic and Secure Computing, DASC'09, Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing pp. 696-701, IEEE. [9] Panda, M., and Patra, M. R., (2007), Network intrusion detection using naive bayes., International journal of computer science and network security, Vol.7, No.12. [10] Qazanfari K., Mirpouryan . S., and Gharaee H., (2012), A Novel Hybrid Anomaly Based Intrusion Detection Method, 6.th International Symposium on Telecommunications (IST'2012). [11] Shafi, K., Abbass, H. A., and Zhu, W. A., Methodology to Evaluate Supervised Learning Algorithms for Intrusion Detection. [12] Tavallaee M., Bagheri E., Lu W., and Ghorbani A. A., (2009), A Detailed Analysis of the KDD CUP 99 Data Set, Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications. [13] Zhang J., Zulkernine M., and Haque A., (2008), Random-Forests-Based Network Intrusion Detection Systems, IEEE Transactions On Systems, Man, And CyberneticsPart C: Applications And Reviews, Vol. 38, No. 5. [14] Zhao D., Xu Q., Feng Z., (2010), Analysis and Design for Intrusion Detection System Based on Data Mining, Second International Workshop on Education Technology and Computer Science.