Académique Documents
Professionnel Documents
Culture Documents
In this paper, we propose a novel protocol formats reverse engineering framework, which can automatically detect unknown protocols from the captured binary data. Our
framework contains ve components:data capturing, frames
location, frequency nding, association analysis, and protocol
format inference. The data capturing captures binary data from
wireless network communicating channels, converts them into
the bit stream. Then the frames location segments the bit
stream by identifying the preamble, after which, the frequency
nding and the association analysis respectively nds the
frequent sequences and analyzes their probability relationship
by utilizing the association rules. Finally, the protocol format
inference extracts and identies the potential unknown protocols.
Our main contributions are:
We propose a protocol formats reverse engineering framework, which can automatically analyze the binary data
captured from wireless communication channels.
We design a new algorithm to identify unknown protocols
from binary data by utilizing feature sequences and
association rules.
Several experiments are carried out in real-world wireless
environments to verify our frameworks validity.
The rest of the paper is organized as follows. Section II is
dedicated to the related work. Section III shows the system
architecture. Our algorithms are introduced in section IV. In
section V, we carry out experiments to evaluate our framework.
We draw brief conclusions in Section VI.
AbstractWith the wide deployment of wireless networks, attackers may exploit Wi-Fi network vulnerabilities to transfer data
secretly, or covert communication channels to spread malicious
codes. The protocol formats reverse engineering technique can be
used to detect such attacks, however, previous works are focused
on the application layer protocol analysis, and can hardly work
under the scenarios that the captured data is only in binary
format due to the lack of semantics. In this paper, we propose a
novel protocol formats reverse engineering framework, which
utilizes the association rules of feature sequences to identify
unknown protocols from captured binary data. We rst convert
the captured binary data into a bit stream, and segment it into
frames. The improved AC algorithm is adopted to analyze the
binary sequences. After which, we extract the feature sequences
and analyze their association rules to detect potential unknown
protocols. The experimental results show that our framework
can identify 100% ARP packets and 98% ICMP packets from
captured binary data.
Index Termsbinary analysis; protocol formats; association
rules; wireless network
I. I NTRODUCTION
The wireless network has become one of the most popular
ways to access the Internet. As many application protocols
on the Internet are proprietary and have no publicly released
specications, the protocol formats reverse engineering can
help to detect these unknown protocols, especially in many
security applications, the protocol formats reverse engineering are widely used. For example, the intrusion prevention
and detection that performs deep packet inspection, and the
penetration testing which generates network inputs to an
application to detect potential vulnerabilities. They can also
be applied to identify protocols and tunnelings in monitored
network trafc.
The traditional methods are mainly based on manual work.
It is time-consuming and error-prone. For example, manually
reverse engineering the Microsoft Server Message Block (SMB) protocol took 12 years in the open source SAMBA project.
For a closed protocol, there are large elds to parse and there
may exist complex relationships. Some researcher proposed
automatic protocol formats reverse engineering techniques,
such as Biprominer [1] and ProDecoder [2]. These systems
mainly focus on application layer protocols, for the binary
protocol, due to the lacking of the semantic information, these
methods could hardly work, further, it is difcult to distinguish
the same binary sequences from various protocol messages.
978-0-7695-5022-0/13 $26.00 2013 IEEE
DOI 10.1109/TrustCom.2013.21
III. P RELIMINARIES
A. Design Goals
The basic goal of our framework is to identify the unknown
protocol accurately from captured binary data in wireless
environments.
Firstly, we need to locate the frames in bit stream to identify
the protocol formats. The frames of the same protocol may
not equal in length, thus the frames location methods based
on length are not applicable. The preamble is used to identify
the start position of the frames.
Secondly, we focus on working out an efcient method of
extracting feature sequences. Our framework scans the raw data by bits, so the traditional multi-pattern matching algorithms
are not adapted. Considering the frameworks efciency, the
improved AC algorithm is introduced. It can scan the input
data by bits and record all the frequent sequences.
Finally, the method based on association rules is to solve the
problem that using feature sequences to identify an unknown
protocol may not be accurate enough. Due to the feature of
bit stream, there may be the same sequences which have
135
IV. O UR F RAMEWORK
We employ the improved AC algorithm in frames location
and frequency nding. And the approximate string matching
algorithm [14][16] to compress the feature sequences set, the
FP-Growth algorithm [17] in association analysis.
A. The Data Capturing
The inputs of our framework are signals intercepted from
wireless network. We transfer these signals into binary data,
which is treated as the bit stream. This bit stream contains many frames, and each frame has multiple feature
sequences dened in the specication of the protocol. A
keyword is a binary sequence of arbitrary length essentially.
For example, the keywords used in the ARP protocol include
0x0806 ,0x0001 ,0x0800 ,etc.
sim(A, B) =
Where length(x, y) =
computed as followed:
length(x,y)C|x||y|
length(x,y)
length(x)+length(y)
.
2
C0,i = i, Cj,0 = j
Ci,j = Ci1,j1 , if Pi = Tj
136
The C|x||y| is
A. Evaluation Parameters
In the evaluation experiments, we dene the following three
sets:
1) True Positives: the set of type X frames where each frame
matches an association rule generated by our framework.
2) False Positives: the set of not type X frames where
each frame matches an association rule generated by our
framework.
3) False Negatives: the set of type X frames where each
frame can not match an association rule generated by our
framework.
Next, the following two parameters are dened to quantitatively evaluate the effectiveness of our framework.
recall =
|T rueP ositives|
|T rueP ositives| + |F alseP ositives|
|T rueP ositives|
|T rueP ositives| + |F alseN egatives|
B. Experiment Setup
In the frame location experiment, we use USPR to intercept
Beacon frames sent by single AP(Access Point). In the frequent nding and association analyzing experiment, ARP and
ICMP are chosen as the target protocols. In this experiment,
we capture the data by Wireshark and convert them into
binary to simulate the data frame which we segmented at
frame location experiment. Our data set consists of 6000 ARP
packets of a total of 2.66 MB, 6000 ICMP packets of a total
of 3.54 MB, and 10000 non-ARP and non-ICMP packets of a
total of 10 MB. The average packet lengths are also small for
both SMTP and SMB protocols because they mostly consist
of command codes rather than payload data. We use 90% of
the packet traces for training and the rest 10% for measuring
the precision and recall of our framework.
1) The Suppmin is used in the improved AC algorithm.
2) The Simmin is used in the approximate string matching
algorithm.
3) The Confmin is used in the FP Growth algorithm.
After a number of experiments with different values of parameters, we set Suppmin = 0.6, Simmin = 0.8, Confmin =
0.9.
Supp(XY )
Supp(X)
137
preamble sequence will be in this approximate set. So according to Table I, we could ensure the true preamble with priori
knowledge and locate the frames.
Length
112
104
136
144
128
120
Sequences
0xffffffffffffffffffffffffe0b9
0xffffffffffffffffffffffff82
0xfffffffffffffffffffffffffff05cf0a0
0xfffffffffffffffffffffffffffff05cf0a0
0xffffffffffffffffffffffffc173c281
0xfffffffffffffffffffffff05cf0a0
C. Experiment Results
1) The Frame Location Experiment: The experiment results
illustrate in Table I and Fig.2. Table I shows the preambles
approximate sequences we found. We consider that the true
138
Length
28
24
24
48
44
20
20
20
20
24
24
Frequency
100%
100%
98.6%
98%
100%
68.2%
100%
100%
100%
37.6%
100%
No
A
B
C
D
E
F
G
H
I
J
Length
28
24
24
48
44
20
20
20
20
24
24
Length
28
24
24
44
20
20
20
20
24
24
Frequency
100%
100%
100%
100%
100%
34.9%
100%
100%
100%
32.5%
100%
NO.
Asso.A
Asso.B
Asso.C
Asso.D
Asso.E
Frequency
100%
30.2%
30.2%
78.2%
52.9%
100%
100%
100%
25.2%
100%
Association rule
0x00000000000 0x0018100
0x020000 0x0018100
0x040001 0x0018100
0xffffffffffff 0x0018100
0x00000000000,020000,040001,ffffffffffff 0x0018100
139
Fig. 7: The Precision Rate of Association Fig. 8: The Precision Rate of Association
rules in ARP broadcast packets
rules in ARP unicast packets
NO.
Asso.A
Asso.B
Asso.C
Asso.D
Asso.E
Association rule
0x80042 0x0018100
0x10800 0x0018100
0x42000 0x0018100
0x08400 0x0018100
0x10800,42000,08400,80042 0x0018100
4) The ICMP Experiment: Table VII is the parts of association rules in ICMP. As the same method as the ARP,
we compute the precision rate and recall rate to evaluate the
experiment.
140
R EFERENCES
Association rule
0x616263646566 0x2c4c6c8cacc
0x131b232b333b43 0x616263646566
0x131b232b333b43 0x2c4c6c8cacc
0x131b232b333b43 0x616263646566,2c4c6c8cacc
141