Académique Documents
Professionnel Documents
Culture Documents
Abstract—The importance of knowing what type of traffic purposes. Generally speaking, the disadvantages brought about
is flowing through a network is paramount to its success. by encrypting traffic create a security trade-off between a
Traffic engineering, quality of service, identifying critical business secure protocol and losing knowledge about a network [4].
applications, intrusion detection systems, as well as network
management activities all require the base knowledge of what Past investigations into traffic classification have shown
traffic is flowing over a network before any further steps promising results primarily in the classification of unencrypted
can be taken. With Secure Socket Layer (SSL) traffic on the traffic with more recent explorations into encrypted traffic.
rise due to applications securing or concealing their traffic These methods rely on the statistical patterns left behind by
via encryption, the ability to determine what applications are the packet attributes or flows to determine which application is
running within a network is getting more and more difficult.
Traditional methods of traffic classification through port numbers within a given stream. Nevertheless, few have incorporated the
and deep packet inspection tools have been deemed inadequate use of SSL in their training and those that have conglomerated
despite their continued popular usage. The purpose of this work the traffic together treating it as a single label with no
is to investigate if a machine learning approach can be used with regard to the underlying application [5], [6], [7]. As such,
flow features to identify SSL traffic in a given network trace. To the objective of this research is to investigate a statistical
this end, different machine learning methods, namely AdaBoost,
C4.5, RIPPER, and Naive Bayesian techniques, are investigated flow-based approach for identifying SSL traffic in a given
without the use of port numbers, Internet Protocol addresses, or network trace based on machine learning techniques, namely
payload information. AdaBoost, C4.5, RIPPER and Naieve Bayesian, without using
IP addresses, port numbers, and payload information. The
I. I NTRODUCTION generated machine learning models are then tested against an
Correct classification of network traffic is a fundamental unseen dataset to demonstrate the robustness of the machine
step for many pivotal services required by various stakeholders learning techniques.
including Internet Service Providers (ISPs), governments, and In the rest of this paper, Section II summarizes the related
system administrators. These services include traffic shaping, work. Data sets employed and the methodology followed are
ensuring the uptime of networked mission-critical applications, presented in Section III and Section IV, respectively. Section V
workload modeling, managing bandwidth budgets, detecting discusses the results and conclusions are drawn and future
bottlenecks, and balancing Quality of Service (QoS) [1], [2], work is given in Section VI.
[3]. Successful methods pursued in the past have relied on deep
II. BACKGROUND
packet inspection by examining the contents of the payload
or using port numbers to correctly identify the application A. SSL Overview
behind a traffic stream. Unfortunately, they no longer hold The concept behind SSL is to provide secure communi-
as much weight as they once did due to encryption (rendering cation over a public network through the use of multiple
packet payloads non-transparent) and dynamic port allocation algorithms for cryptography, digests, and signatures. By sup-
(enabling applications to connect on alternate ports than the porting this type of dynamic authentication, SSL servers are
ones assigned by the Internet Assigned Numbers Association, able to adapt to any legal obligations surrounding the use of
IANA). Alternatively, applications may conceal packets by cryptography by choosing which algorithms to use during the
masquerading as a different protocol through tunneling or even handshake. SSL is designed to be application independent,
employ encryption to protect their payloads. laying between the transport layer (specifically over TCP)
Secure Socket Layer (SSL) is a fundamental security pro- and the application layer of the TCP/IP protocol stack. It
tocol belonging to the application layer in the Internet Trans- was originally designed by Netscape to secure e-commerce
mission Control Protocol / Internet protocol (TCP/IP) model transactions over the HyperText Transfer Protocol (HTTP).
but residing below the higher level application protocols and However, the more prominent HyperText Transfer Protocol
above the TCP. It enables e-commerce transactions and other Secure (HTTPS) is not the only application that can run
applications to communicate securely over a public network over SSL. As stated by Bernaille et al. [8], other application
by encrypting the packet payload. Despite its good intentions protocols are realizing the need to encrypt and conceal their
SSL also creates a black hole of encrypted network traffic, data from packet sniffers. They are implementing Application
which may be used for illegitimate or non-network sanctioned Programming Interfaces (APIs) to use back-end SSL libraries.
Training size: 6000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 71.56% 62.28% 58.79% 8.12% 63.86%
FPR SSL 0.28 0.37 0.40 0.94 0.36
FPR Non-SSL 0.02 0.02 0.02 0.01 0.01
Recall SSL 0.99 0.96 0.98 0.99 0.99
Recall Non-SSL 0.72 0.63 0.60 0.06 0.64
Training size: 12000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 71.86% 67.94% 70.50% 7.80% 70.30%
FPR SSL 0.28 0.32 0.29 0.95 0.30
FPR Non-SSL 0.01 0.01 0.01 0.01 0.01
Recall SSL 0.99 0.99 0.99 1 0.99
Recall Non-SSL 0.72 0.68 0.71 0.05 0.70
Training size: 500000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 87.45% 95.69% 85.13% 89.26% 82.59%
FPR SSL 0.12 0.04 0.14 0.11 0.17
FPR Non-SSL 0.01 0.02 0.01 0.01 0.01
Recall SSL 0.99 0.99 0.99 0.99 0.99
Recall Non-SSL 0.88 0.95 0.86 0.89 0.83
TABLE IV
F EATURES CHOSEN BY A DA B OOST FOR THE SSL VS N ON -SSL
EXPERIMENTS
Training size: 6000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 98.38% 91.89% 98.14% 70.33% 87.49%
FPR SSL 0.01 0 0.01 0.04 0.01
FPR SSL-Tunnel 0.01 0.11 0.01 0.30 0.12
Recall SSL 0.99 0.89 0.99 0.70 0.88
Recall SSL-Tunnel 0.99 1 0.99 0.96 0.99
Training size: 12000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
Correctly Id. 98.98% 93.31% 98.59% 70.11% 97.33%
FPR SSL 0 0.01 0.01 0.02 0.02
FPR SSL-Tunnel 0.01 0.06 0.01 0.35 0.01
Recall SSL 0.99 0.94 0.99 0.65 0.99
Recall SSL-Tunnel 1 0.99 0.99 0.98 0.98
TABLE VI
R ESULTS FROM VARIOUS TRAINING SET SIZES USING THE DIFFERENT PERFORMANCE METRICS FOR THE N ON -SSL VS SSL-T UNNEL RUN .
Training size: 6000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 94.74% 93.88% 93.51% 9.69% 91.73%
FPR SSL-Tunnel 0.05 0.06 0.06 0.91 0.08
FPR Non-SSL 0 0 0 0 0
Recall SSL-Tunnel 1 1 1 1 1
Recall Non-SSL 0.95 0.94 0.94 0.09 0.92
Training size: 12000 AdaBoost C4.5 AdaBoost C4.5 Naive Bayes RIPPER
DR 89.64% 93.54% 91.27% 10.43% 99.41%
FPR SSL-Tunnel 0.10 0.07 0.09 0.90 0.01
FPR Non-SSL 0 0 0 0 0
Recall SSL-Tunnel 1 1 1 1 1
Recall Non-SSL 0.9 0.93 0.91 0.1 1
best performing model from the SSL vs Non-SSL experiments as well as a popular traffic analysis tool further solidifies the
was chosen to be tested on the NIMS data set. The results methodology taken.
showed that 95% of the flow records were correctly identified
as either SSL or Non-SSL traffic with a 0.5% SSL FPR, clearly For SSL vs Non-SSL, AdaBoost proved to have the best
indicating the robustness of the ML algorithm across different classification performance with a 96% classification accuracy,
datasets. The accuracy achieved despite the different network 4% SSL FPR. In the case of the native SSL vs SSL-Tunnel,
setup, background noise, and applications used demonstrates a modified version of AdaBoost using C4.5 decision trees
that the flow features chosen by AdaBoost to identify SSL instead of decision stumps performed the best with a 98%
applications are generalizable enough to recognize traffic on classification accuracy and 0.6% SSL FPR. Finally, the ML
previously unseen datasets. algorithm best able to distinguish the Non-SSL vs SSL-Tunnel
class run was RIPPER achieving a 99% classification accuracy
VI. C ONCLUSION and 0.5% SSL-Tunnel FPR. In general, it was found that
AdaBoost maintained the highest overall performance across
In conclusion, the investigation into classifying applications all the different class runs. While the training set sizes varied
encrypted by SSL achieved promising results using flow-based amongst all the runs, it can be seen that without adequate
statistics and ML algorithms. This was accomplished without representation from an application instance, the classifiers will
employing IP addresses, port numbers, or packet payloads. To be unable to perform well during testing. Comparing these
this end, we have generated and captured data sets in a lab results to the Wireshark traffic analysis tool confirmed the
environment and benchmarked the performances of AdaBoost, incompetencies of payload inspection methods and inability
C4.5, RIPPER, and Naive Bayesian learning algorithms to to rely on port numbers for classification.
identify SSL traffic.
The generated dataset represented not only real network There are several areas for future work is given that this is
traffic without imposing restrictions on SSL encryption algo- only investigative in nature. To this end, exploring alternative
rithms, but also ensured the ground truth during the labeling ML algorithms and other data sets are the next natural di-
process, and included the use of tunnels to increase the overall rections. Furthermore with the changes implemented between
entropy of the dataset. We have made the data set public to IPv4 and IPv6, it would be interesting to see how much (if at
the research community for encouraging further research in the all) the results achieved in this work may be affected. Further
area. Additionally, investigation on the sizes of the training set investigation on parameter sensitivity and the affect they have
helped to further optimize based on FPRA. The comparison of on the ML algorithms may offer more explanation as to why
the results from the multiple class runs against recent literature the classifiers acted the way they did.
ACKNOWLEDGMENT [22] R. Alshammari, A. Zincir-Heywood, and A. Farrag, “Performance com-
parison of four rule sets: An example for encrypted traffic classification,”
This work was supported by NSERC. Our thanks to Dana Privacy, Security, Trust and the Management of e-Business, 2009, pp.
Echtner, Jeff Allen, Krista Skodje, and TARA for provid- 21 –28, 2009.
[23] R. Alshammari and A. N. Zincir-Heywood, “Investigating two different
ing us the lab environment to generate our data sets. This approaches for encrypted traffic classification,” Proceedings - IEEE
research was conducted at the Dalhousie NIMS Laboratory, Conference on Privacy, Security and Trust, pp. pages 156–166, 2008.
http://www.cs.dal.ca/projectx. [24] A. Montigny-Leboeuf, “Flow attributes for use in traffic characteriza-
tion,” Journal of CRC Technical Note, CRCTN-2005-003 Ottawa, ON,
Canada, 2005.
R EFERENCES [25] N. Williams, S. Z, and G. Armitage, “A preliminary performance
comparison of five machine learning algorithms for practical ip traffic
[1] R. Alshammari and A. N. Zincir-Heywood, “Machine learning based flow classification,” Computer Communication Review, vol. 30, 2006.
encrypted traffic classification: Identifying ssh and skype,” IEEE Sym- [26] D. J. Barrett and R. E. Silverman, SSH, The Secure Shell: The Definitive
posium on Computational Intelligence for Security and Defense Appli- Guide, ISBN: 978-0-596-00011-0, O’Reilly & Associates, Inc., 2001.
cations, pp. 8 – 8, 2009. [27] C. V. Wright, F. Monrose, and G. M. Masson, “On inferring application
[2] R. Alshammari and A. Zincir-Heywood, “Investigating two different protocol behaviors in encrypted network traffic.” Journal of Machine
approaches for encrypted traffic classification,” Proceedings - 6th Annual Learning Research, vol. 7, no. 12, pp. 2745 – 2769, 2006.
Conference on Privacy, Security and Trust, pp. 156 – 166, 2008. [28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
[3] R. Alshammari and A. N. Zincir-Heywood, “Generalization of signatures Witten, “The weka data mining software: an update,” SIGKDD Explor.
for ssh encrypted traffic identification,” Proceedings - IEEE Symposium Newsl., vol. 11, no. 1, pp. 10–18, 2009.
on Computational Intelligence in Cyber Security, pages 174 - 174, 2009. [29] A. Dupay, S. Sengupta, O. Wolfson, and Y. Yemini, “Netmate: A
[4] A. Yamada, Y. Miyake, K. Takemori, A. Studer, and A. Perrig, network management environment,” Network, IEEE, vol. 5, no. 2, pp.
“Intrusion detection for encrypted web accesses,” 21st International 35 –40, 43, mar. 1991.
Conference on Advanced Information Networking and Applications [30] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
Workshops/Symposia, vol. 2, pp. 569 – 576, 2007. on-line learning and an application to boosting,” in Proceedings of
[5] A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines for tcp the Second European Conference on Computational Learning Theory.
traffic classification,” Computer Networks, vol. 53, no. 14, pp. 2476 – London, UK: Springer-Verlag, 1995, pp. 23–37.
2490, 2009. [31] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
[6] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Of- ISBN: 0137903952, Pearson Education, 2003.
fline/realtime traffic classification using semi-supervised learning,” Per- [32] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,
formance Evaluation, vol. 64, no. 9-12, pp. 1194 – 1213, 2007. CA, USA: Morgan Kaufmann Publishers Inc., 1993.
[7] A. Callado, J. Kelner, D. Sadok, C. A. Kamienski, and S. Fernandes, [33] W. W. Cohen, “Fast effective rule induction,” in Proceedings of the
“Better network traffic identification through the independent combi- Twelfth International Conference on Machine Learning. Morgan
nation of techniques,” Journal of Network and Computer Applications, Kaufmann, 1995, pp. 115–123.
vol. 33, no. 4, pp. 433 – 446, 2010.
[8] T. R. Bernaille, Laurent, “Early recognition of encrypted applications,”
Passive and Active Network Measurement, vol. 4427, pp. 165–175, 2007.
[9] M. Trojnara, “Stunnel: Ssl tunnel,” www.stunnel.org. Accessed June
2010.
[10] J. Viega, P. Chandra, and M. Messier, Network Security with Openssl,
ISBN: 059600270X, O’Reilly & Associates, Inc., 2002.
[11] T. IETF, Accessed June 2010, http://www.ietf.org/rfc/rfc2246.txt.
[12] D. M. Nicol and N. Schear, “Models of privacy preserving traffic
tunneling,” Simulation, vol. 85, no. 9, pp. 589 – 607, 2009.
[13] N. Schear and D. M. Nicol, “Performance analysis of real traffic carried
with encrypted cover flows,” Workshop on Principles of Advanced and
Distributed Simulation, pp. 80 – 87, 2008.
[14] L. Qing and L. Yaping, “Analysis and comparison of several algorithms
in ssl/tls handshake protocol,” ITCS ’09: Proceedings of the 2009
International Conference on Information Technology and Computer
Science, pp. 613–617, 2009.
[15] L. Zhao, R. Iyer, S. Makineni, and L. Bhuyan, “Anatomy and perfor-
mance of ssl processing,” IEEE International Symposium on Perfor-
mance Analysis of Systems and Software, pp. 197 – 206, 2005.
[16] F. Allard, R. Dubois, P. Gompel, and M. Morel, “Tunneling activities
detection using machine learning techniques,” NATO Research and
Technology Organization Symposium on Information Assurance and
Cyber Defence, 2010.
[17] A. B. Mohd and D. S. bin Mohd Nor, “Towards a flow-based internet
traffic classification for bandwidth optimization,” International Journal
of Computer Science and Security, vol. 3, pp. 146–153, 2009.
[18] R. Yuan, Z. Li, X. Guan, and L. Xu, “An svm-based machine learning
method for accurate internet traffic classification,” Information Systems
Frontiers, vol. Volume 12, pp. 149 – 156, 2010.
[19] M. Soysal and E. G. Schmidt, “Machine learning algorithms for accurate
flow-based network traffic classification: Evaluation and comparison,”
Performance Evaluation, vol. 67, no. 6, pp. 451 – 467, 2010.
[20] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Detecting http
tunnels with statistical mechanisms,” IEEE International Conference on
Communications, pp. 6162 – 6168, 2007.
[21] M. Dusi, M. Crotti, F. Gringoli, and L. Salgarelli, “Tunnel hunter:
Detecting application-layer tunnels with statistical fingerprinting,” Com-
puter Networks, vol. 53, no. 1, pp. 81 – 97, 2009.