Vous êtes sur la page 1sur 20

ANALYSIS AND DETECTION OF BOTNETS

USING MACHINE LEARNING


TECHNIQUES

A SYNOPSIS

Submitted by

G. KIRUBAVATHI

in partial fulfillment of the requirements for the degree

of

DOCTOR OF PHILOSOPHY

FACULTY OF SCIENCE AND HUMANITIES


ANNA UNIVERSITY
CHENNAI 600 025

MARCH 2016
1. INTRODUCTION

Botnets are the preeminent source of cyber crime and the greatest
threat to the Internet infrastructure. It can be widely spread across distance
and geography, with infected hosts and botmasters operating in different
countries and locations. According to the PandaLab research report, botnets
have played a big dangerous threat to the Internet, responsible for various
malicious activities from distributed denial of service (DDoS) to spamming,
phishing, information harvesting, and identity theft, etc. As reported by BCC
news 2012, botnets have started conscripting smart phones to send spam and
perform mine cryptocurrencies.

2. RATIONALE FOR THE STUDY

Botnets represent one of the most significant threats against cyber


security. They employ different techniques, topologies and communication
protocols in different stages of their lifecycle. Also they can upgrade their
methodology at any time, have resulted in a highly dynamic threat landscape
that is not amenable to traditional security approaches. Data mining
techniques like classification (Yin et al 2013) and association rule mining
algorithms which incorporate induction algorithms that explore data in order
to discover hidden patterns and develop predictive models, have proved to be
effective in tackling the aforementioned cyber security challenges. The
volume of data dealing with both network and host activity is so large that it
makes it an ideal candidate for using data mining techniques for botnet
detection. Botnet detectors are the primary tools in defense against botnet.
The quality of such a detector is determined by the techniques it uses. The
real challenges in the design of botnet detection techniques are to achieve
higher detection accuracy, lower false positive rate and lower processing
overhead. It is therefore imperative that we study botnet detection techniques
currently available and understand their working and limitations.
Techniques used for detecting botnets using data mining can be
categorized into two broad categories based on the host based and network
based behavior analysis. This work analyses the network features extracted
from botnet traffics and applies data mining techniques for classifying them
as benign or botnet. We have considered two major types of botnet: Botnet
affecting the worlds widely used Operating system - Windows, and the botnet
affecting the most popular mobile environment - Android.

3. OBJECTIVES

i) To analyze a large collection of Windows based and Android


based botnets to understand the characteristics.
ii) To extract significant features based on the analysis.
iii) Drive botnet detection techniques which use the analyzed
features.
iv) Reduce the false positive rate and improve Accuracy and F-
measure.
v) Detect botnets irrespective of their control structures.

4. SUMMARY OF RESEARCH WORK

The main focus of this research work is to carry out a detailed


analysis of Windows and Android based botnets and the design of efficient
detection mechanisms. A brief summary of the various research works carried
out along with some of the experimental results is given below with some
experimental results.

4.1 Botnets: A Study and Analysis

This work focuses on the analysis of botnets to understand the


behavior and their lifecycle mechanism as shown in Figure 1 which would
be helpful for the future study of thwarting botnet communications.
According to Mori et al (2010), the extents to which botnets smash up are
becoming more critical day by day. Botnet has made an endeavor to
control zombies remotely and instruct them by commands from botmaster
through C&C channel. The C&C channel is a crucial component of a
botnet. Different botnets can organize their C&C channel in diverse ways.
Botnets can be centralized, decentralized and hybrid according to their
C&C channels and communication protocols (HTTP, P2P, IRC, IM, etc).

Figure 1 A typical botnet lifecycle

In a controlled environment various categories of botnets like IRC,


HTTP and P2P are analyzed to understand the control mechanism,
propagation mechanism and possible attacks. Analyzed botnets include
Rbot (IRC), Zeus, BlackEnergy, Festi, Spyeye and Chameleon (HTTP)
and Slapper, Phat bot, spam thru, Kelihos, and ZeroAccess (P2P). Also, a
detailed survey of some of the existing botnet detection techniques, their
advantages and limitations helped to understand the challenges in the
design of botnet detection techniques. However, most of the current botnet
detection techniques are designed only for specific botnet C&C
communication protocols and structures. Consequently, when botnets
change their C&C architecture, protocols and utilize encrypted
communications, these methods will not be effective in detecting them.
Therefore developing techniques to detect botnets regardless of the C&C
architecture and with encrypted communications will be a challenging
task.

4.2 HTTP botnet detection using Hidden Semi-Markov Model with SNMP
MIB variables

In this work, a Hidden semi-Markov Model (HsMM) is used to


distinguish normal traffic and HTTP botnet traffic. This model is developed
by considering that most of the communications of HTTP botnets are based
on TCP related connections. The duration of a system in normal state need not
be a constant and exponentially distributed (Livadas et al 2006). Hence we
use a Hidden semi-Markov Model for modeling the system behavior as shown
in Figure 2.

Figure 2 HsMM detection model

The TCP connection-related MIB variables namely tcpActiveOpens,


tcpPassiveOpens, tcpAttemptFails, tcpEstabResets, tcpCurrEstab, tcpInSegs,
tcpRetransSegs, tcpInErrs are collected. The collected MIB variables are
passed through the principal component analysis (PCA) for selecting the
significant SNMP MIB variables. After some exhaustive experiments with
normal traffic flow and HTTP botnet communications flow, we found that
summation of MIB variables (SUM-MIB) at different time points provide
interesting results and hence used in for further analysis.

In the training phase, the SNMP MIB variables are first transformed to
HsMM observation sequence using forward-backward training algorithm.
Next the HsMM is inferred from the observation sequence. In the testing
phase, the SNMP MIB variables are transformed to HsMM observation
sequences, and then the HsMM is used to compute the probability of each test
sequence in order to determine Average Log Likelihood (ALL) which decides
whether it is a normal traffic or HTTP botnet communication.

Experimental results:

Botnet setup is created in the SSE lab Network that correlates the
behavior of the existing real time HTTP botnet as shown in Figure 3.

Figure 3 Botnet experimental setup

Using this botnet setup several experiments are conducted using


Spyeye, BlackEnergy, Zeus, Athena and Andromeda HTTP botnets to
validate the model. Normal traces are collected from the SSE network
during the systems normal activities which include web service and
FTP service. The MIB datasets are used for the experiments in shown
in Table 1 and Table 2 shows the detection accuracy and false positive
rate of the proposed model.
Table 1 Description of the datasets
Botnet MIB traces
Botnet MIB Trace size Botnet MIB Trace size
Spyeye 1.25 GB Athena 2.73 GB
Blackenergy 2.96 GB Andromeda 4.59 GB
Zeus 2.57 GB
Normal MIB traces
FTP service 4.95 GB Web service 4.27 GB

We have also compared the model with other botnet detection schemes
proposed by Nogueria et al (2010) which uses Neural network to
classify the licit and illicit traffic patterns and Choi and Lee (2011)
which uses DNS traffic patterns to identify the botnet traffics. It is seen
that our model provides better detection accuracy as shown in Figure 4.
The proposed model is light weight and real time since it uses SNMP
MIB variables collected from SNMP agents instead of analyzing the
network traffic flows.
Table 1 Performance of the proposed model
Datasets False positive Detection Results
rate accuracy
Web service 0% 100% Normal
FTP service 0% 100% Normal
Spyeye 1.67% 98.14% Botnet
Blackenergy 1.58% 98.72% Botnet
Zeus 1.75% 98.02% Botnet
Athena 1.29% 98.94% Botnet
Andromeda 1.47% 98.62% Botnet

99
98
Accuracy (%)

97
96
95
94
93
Figure 4 Accuracy comparisons with existing techniques

4.3 HTTP Botnet Detection using adaptive learning rate Multilayer


Feed Forward Neural Network

In this work, we consider the detection of HTTP botnet in the network


level. Most of the communications of web botnets is based on TCP
connection and hence the following relative and direct features of TCP
connections are extracted from the network traffic for the detection of HTTP
botnets.

Number of one way connection TCP packets


One-way connection ratio of TCP = 100
Total no of TCP packets

Number of Inco min g TCP packets


Ratio of Incoming Outgoing TCP packets = Number of Outgoing TCP packets
Number of TCP packets
Ratio of TCP packets =
Total number of packets

SYN Flag Count the number of TCP packets with SYN flag set
FIN Flag Count the number of TCP packets with FIN flag set
PSH Flag Count the number of TCP packets with PSH flag set
The extracted TCP features are normalized using min-max
normalization. Then the normalized features are passed to the Multi-Layer
Feed Forward Neural Network training model which uses Bold Driver Back-
propagation learning algorithm. This learning algorithm has the advantage of
dynamically changing the learning rate parameter during the weight updating
process.

Experimental results:
A dataset comprising of 48.6 GB traffic flow traces belonging to both
botnet and benign with TCP features extracted is used as shown in
Table 3.
Table 3 Description of datasets
Botnet traffic
Botnet Family Trace Size Botnet Family Trace Size
Zeus 5.36 GB Sogou 18 MB
Spyeye 5.14 GB Athena 3.91 GB
BlackEnergy 6.25 GB Andromeda 2.64 GB
Normal traffic
Web service 8.52 GB Remote service 2.69 GB

E-mail service 5.10 GB FTP service 3.29 GB

Using this approach Spyeye, BlackEnergy, Sogou, Athena, Andromeda


and Zeus botnets are efficiently identified.
The performance of the proposed method is compared with that of
C4.5 Decision Tree, Random Forest and Radial Basis function
network. Results show an improvement in detection accuracy with
neural network when compared to other classification techniques as
shown in Table 4.
We have also compared the model with other botnet detection schemes
proposed by Gu et al (2008) which use communication traffics to
cluster the similar botnet patterns. It is seen that our system provides
better detection accuracy as shown in Figure 5.

Table 4 Performance of the three classifiers with NN model


Methods Precision Recall F-Measure Accuracy
Decision Tree 0.968 0.931 0.949 96.58
Random Forest 0.968 0.934 0.950 96.667

RBF 0.976 0.927 0.950 96.53


NN Model 0.984 0.973 0.976 98.67
100

98

Accuracy (%)
96

94

92

Figure 5 Accuracy comparisons with existing techniques

The research works in 4.2 and 4.3 have focused on HTTP based botnet
detection. Nowadays, botmasters have dynamically changed their Command
and Control structure to avoid the detection. Hence we concentrate on
designing and developing efficient botnet detection mechanisms for
irrespective of their Command and Control structures in the next work.

4.4 Botnet detection based on mining of traffic flow characteristics

In this work, a method is proposed to detect botnets irrespective of


their structures, based on network traffic flow behavior analysis and machine
learning techniques. Many botnet characteristics have been analyzed in a
controlled environment and found that a bot in the network will generate a
burst of small packets when actively searching for susceptible hosts and
exhibit a more uniform pattern when the bot queries for updates or
instructions continuously, resulting in many uniform sized, small TCP/UDP
packets. Based on this behavior we have extracted four traffic flow features
in different time windows as shown below:

Small _Packets Ps - No. of small packets send and received in a flow


for specified time interval.
Packet_ratio Pr - Ratio of incoming and outgoing packets in a flow
for specified time interval
Initial Packet_length Pl - Length of the first packet in a flow
Bot-response_packet ratio BRp - Ratio of Bot-Response packets
and total packets in a flow for specified time interval

After identifying the significant features, bots can be detected in


advance before it launches some attack. To accomplish this task, individual
flows are split into multiple parts using time windows WT in seconds. The
characteristics of a given flow are observed by examining its traffic in a given
time window. Intuitively, smaller time windows may fail to capture unique
traffic characteristics that only become visible over a longer period of time
and if the time window is longer, our detection system will take long period to
make decision. Ultimately, the selection of time window size will be a
challenging task. In this work, WT is fixed based on the experimental analysis.

After the feature extraction, flow vectors are formed to classify the
traffic flows into botnet and normal flows by applying machine learning
techniques.
Let fj be a flow. A flow vector fj(ti) = (Ps, Pr, Pl, BRp), where Ps, Pr, Pl, BRp
are the features extracted from the flow fj during the time period ti.

We have used the most prominent classification techniques namely,


Boosted decision tree ensemble classifier, Naive bayesian (NB) statistical
classifier and support vector machine (SVM) discriminative classifier.

Experimental results:

Three different dataset comprising 10.09 GB, 17.102 GB and 10.05


GB of both botnet and benign traffic flow traces are used.
Botnet datasets are collected from diverse sources such as ISOT Botnet
dataset from University of Victoria, Conficker dataset from CAIDA,
dataset from University of Georgia, four different datasets from CVUT
University, Citadel botnet and Alexa benign datasets from Dalhousie
University and three different IRC botnet dataset from Centro
University, Argentina.
Different size time windows ranging from 60 to 300 s in multiples of
60 are used to evaluate the performance of the system
To analyze the performance of the proposed method, various metrics
such as Accuracy, Precision, Recall and F-measure are computed for
three different classification schemes and are shown in Table 5. The
naive bayesian classifier achieves highest accuracy of 99% and 0.02%
false positive rate approximately. The other classifiers also achieve
high detection accuracy and low false positive rate due to the
appropriate feature selection.
We observe the effects of varying time window size on the detection
accuracy and false positive rate. Figures 6 and 7 show the effects of the
time window size on the detection rates and false positive rates for
botnet detection respectively.
We have also compared the model with other botnet detection schemes
proposed by Livadas et al (2008) which uses classification algorithm to
identify the IRC botnet traffics, Saad et al (2011) which uses five
different machine learning algorithms to identify the P2P botnet
traffics, Masud et al (2008) which using flow based detection by
considering the correlation between multiple log files to identify IRC
botnets, Liao et al (2010) which uses classification techniques to
identify the P2P botnets, Wang et al (2011) which uses fuzzy pattern
recognition techniques to identify the botnets and Huang et al (2013)
which uses network failure flows to identify the botnets. It is seen that
our system provides better detection accuracy as shown in Figure 8.

Table 5 Performance of the three methods at WT = 180s

Datasets Methods Precision Recall F-Measure Accuracy FPR


D1 AdaBoost+J48 0.998 0.960 0.950 96.80 0.06
NB 0.997 0.991 0.992 99.20 0.02
SVM 0.961 0.937 0.949 93.20 0.07
D2 AdaBoost+J48 0.987 0.956 0.971 96.20 0.08
NB 0.996 0.992 0.993 99.46 0.02
SVM 0.962 0.939 0.950 93.33 0.11
D3 AdaBoost+J48 0.989 0.960 0.974 96.66 0.07
NB 0.997 0.994 0.995 99.60 0.01
SVM 0.963 0.939 0.974 93.14 0.08

Figure 6 Effects on time window size on detection rate

Figure 7 Effects on time window size on false positive rate

100
98
96
Accuracy (%)

94
92
90
88
86
84
82
80

Figure 8 Accuracy comparisons with existing techniques


The research works in 4.2 to 4.4 have focused on Windows based botnet
detection. Smart phone device usage has expanded at a very high rate and
Android has surpassed other mobile platforms as the most popular whilst also
witnessing a dramatic increase in botnet targeting the platform. Hence we
concentrate on designing and developing efficient detection mechanism for
Android based botnet in the next work.

4.5 Structural Analysis and Detection of Android Botnets Using


Machine Learning Techniques

In this work, an Android botnet analysis is carried out and a detection


mechanism is designed using machine learning algorithms. Unique patterns in
the combinations of requested permissions and used features based on
malicious activities of botnets are identified by using Apriori association rule
mining algorithm and information gain method is used to select the most
significant patterns in order to provide a better detection. The selected unique
patterns are passed to the machine learning framework to classify the
applications as benign or botnet. The main contributions in this work are
Analysis of a large collection of dataset from diverse sources to
understand the important requested permissions and used features
related to botnet applications.
Using a proper frequency analysis, unique patterns of requested
permissions and used features pertained to malicious activities of
botnet applications are identified and significant patterns are selected
based on the support values and Information Gain. These selected
significant patterns are used for classification to identify the botnet
applications effectively.

Experimental results:

The dataset comprises of 9756 Android botnet applications belonging


to different botnet families and 1,22,176 benign applications of various
categories.
We have selected three prominent classification algorithms namely,
Naive Bayesian (NB) a statistical classifier, Support Vector Machine
(SVM) a discriminative classifier and Reduced Error Pruning Tree
(REPTree) a decision tree classifier.
The selected classifiers are trained by using 70% of the botnet and
benign applications. The botnet samples are randomly selected from
each category. The trained models are tested with the rest of the
samples. The results of these experiments are summarized in Table 6.
It can be seen that SVM classifier provides better detection accuracy
compared to other classification algorithms.
Performance of the proposed work is compared with some of the
existing similar works proposed by Peiravian et al (2013) which uses
requested permissions and API calls , Yerima et al (2014) which uses
requested permissions and API calls and Sanz et al (2013) which uses
requested permission to identify the malicious applications. It is seen
that our model provides better detection accuracy as shown in Figure 9.

Table 6 Performance of the three classifiers method


Measures Nave Bayes SVM REPTree
Precision 96.42 98.97 95.40
Recall 96.80 99.31 95.26
F-Measure 96.90 97.95 95.37
MCC 94.21 97.95 95.42
Kappa 94.14 97.80 94.67
Accuracy 96.53 99.06 95.03
FPR 4.68 1.05 5.26
100

95

Accuracy (%)
90

85

80

75

Figure 9 Accuracy comparisons with existing techniques

5. CONCLUSION

The two major types of botnets existing nowadays is the one


affecting the most widely used operating system-Windows and the most
popular mobile environment- Android. A thorough analysis of both of these
botnets has been carried out to arrive at the most significant features for
botnet detection. Through the various works carried out we are able to
achieve the objectives mentioned in section 3. The first work is analysis of
various botnets which is a foundation for the detection methods second and
third work focus on the detection of HTTP botnets. But detecting botnets
irrespective of the protocol and C&C structure is challenging and the fourth
work achieves this objective. The final work is concentrated on Android
botnets detection.

REFERENCES:

1. Annual Report PandaLabs - Press Panda Security 2013, Available from


http://press.pandasecurity.com/wpcontent/uploads/2010/05/AnnualRep
ort-PandaLabs-2013.pdf.
2. Android Smart phones Used for Botnet, Researchers Say, July 5
2012, Available from http: //www.bbc.co.uk/news/technology-
18720565.
3. Choi H & Lee H 2012, Identifying botnets by capturing group
activities in DNS traffic, Computer Networks, vol.56, no.1, pp. 20-33.

4. Gu G, Perdisci R, Zhang J & Lee W 2008, BotMiner: Clustering


Analysis of Network Traffic for Protocol-and Structure-Independent
Botnet Detection, In USENIX Security Symposium , Vol. 5, No. 2, pp.
139-154.

5. Huang CY 2013, Effective bot host detection based on network failure


models, Computer Networks, vol. 57, no.2, pp. 51425.

6. Liao WH & Chang CC 2010, Peer to peer botnet detection using data
mining scheme, In Proceedings of IEEE international conference on
internet technology and applications, pp. 14.

7. Livadas C, Walsh R, Lapsley D & Strayer W T 2006, Usilng machine


learning technliques to identify botnet traffic, . In Proceedings of 31st
IEEE Conference on Local Computer Networks, pp. 967-974.

8. Lu W, Rammidi G & Ghorbani AA 2011, Clustering botnet


communication traffic based on n-gram feature selection, Computer
Communications, vol.34, no. 3, pp. 502-514.

9. Masud MM, Al-Khateeb T, Khan L, Thuraisingham B & Hamlen KW


2008, Flow-based identification of botnet traffic by mining multiple
log files, In Proceedings of First IEEE International Conference
on Distributed Framework and Applications, pp. 200-206.

10. Mori T, Esquivel H, Akella A, Shimoda A & Goto S 2010,


Understanding large-scale spamming botnets from internet edge sites,
In Proceedings of the Conference on E-Mail and Anti-Spam (CEAS),
pp. 41-52.

11. Nogueira A, Salvador P & Blessa F 2010 A botnet detection system


based on neural networks, In Fifth IEEE International Conference on
Digital Telecommunications, pp. 57-62.
12. Peiravian N & Zhu X 2013, Machine learning for android malware
detection using permission and api calls, In IEEE 25th International
Conference on Tools with Artificial Intelligence (ICTAI), pp. 300-
305.
13. Saad S, Traore I, Ghorbani A, Sayed B, Zhao D, Lu W & Hakimian P
2011, Detecting P2P botnets through network behavior analysis and
machine learning In Ninth IEEE Annual International Conference
on Privacy, Security and Trust (PST), pp. 174-180.

14. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PG & lvarez


G 2013 Puma: Permission usage to detect malware in android, In
Springer International Joint Conference CISIS12-ICEUTE 12-
SOCO 12, pp. 289-298.

15. Wang K, Huang CY, Lin SJ &Lin YD 2011, A fuzzy pattern-based


filtering algorithm for botnet detection, Computer Networks, vol. 55,
no. 15, pp. 327586.

16. Yerima SY, Sezer S & McWilliams G 2014, Analysis of Bayesian


classification-based approaches for Android malware
detection, Information Security, IET, vol 8, no. 1, pp. 25-36.

17. Yin C, Yang L & Wang J 2013, Botnet Detection Based on Degree
Distributions of Node Using Data Mining Scheme, International
Journal of Future Generation Communication and Networking, vol. 6,
no.6, pp. 81-90.
LIST OF PUBLICATIONS

1. Kirubavathi G & Anitha R 2016, Botnet detection via mining of


traffic flow characteristics. Computers & Electrical Engineering
Journal, vol. 50, pp. 91-101, Impact factor 0.836. Available in
Annexure I.
2. Kirubavathi G, and Anitha R 2014, Botnets: A Study and Analysis,
Proceedings of the Springer international conference on Computational
Intelligence, Cyber Security and Computational Models. Springer
India, pp. 203-214.
3. Kirubavathi Venkatesh G, Srihari V, Veeramani R, Karthikeyan RM
and Anitha R 2013, HTTP botnet detection using Hidden Semi-
Markov Model with SNMP MIB variables, International Journal of
Electronic security and digital forensics, vol.5, Nos.3/4, pp.188-200,
Available in Annexure II.
4. Kirubavathi Venkatesh G and Anitha R 2012, HTTP Botnet Detection
using adaptive learning rate Multilayer Feed Forward Neural
Network, Proceedings of Workshop in Information Security Theory
and Practice WISTP12, Royal Holloway, UK, pp. 38-48.
5. Kirubavathi G and Anitha R Structural Analysis and Detection of
Android Botnets Using Machine Learning Techniques , International
Journal of Information Security, Springer. (Under Revision)

Vous aimerez peut-être aussi