Académique Documents
Professionnel Documents
Culture Documents
s
W
E
E E
Figure 4. 'At risk time distribution Ior single AVs
IV. MODELLING THE IMPACT OF ADDITIONAL AVS
An interesting characteristic oI the data shown in Table
IV is the asymptotic reduction in the proportion oI
'imperIect systems, (i.e. systems that Iail on one or more oI
the malware). A simple exponential model oI the decrease
with the number oI AVs (N) in the diverse system would
estimate the proportion as:
P(faulty , N) p
N
where p is the probability that a single AV Iails to detect all
the malware.
In practice this proved to be a very poor Iit to the
observed proportions. However an empirically derived
hyper-exponential model oI the Iorm:
P(faulty , N) (p
N
)
N/2
proved to be a remarkably good Iit to the observed data as
shown in the Figure 5 below.
Effect of increasing diversity by adding
AVs
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0 2 4 6 8 10 12 14 16
Number of diverse AVs (N)
P
r
o
b
.
1
o
o
N
s
y
s
t
e
m
f
a
i
l
s
o
n
a
t
l
e
a
s
t
o
n
e
d
e
m
a
n
d
Observed Exponential Hyper-exponential
Figure 5. EIIect oI increasing diversity by adding AVs
The results show that the proportion oI imperIect systems
decreases Iar more rapidly than a simple exponential
distribution would indicate. For example, increasing N Irom
8 to 10 decreases the chance oI having a 1ooN system that
Iails by an order oI magnitude.
V. DISCUSSION
A. Confiaence on the Best Case Observea Failure Rate
Any claim on the best case (i.e. lowest) Iailure rate we
can hope Ior, using this dataset, will be in the region oI 10
-4
to 10
-5
. This is because we have a maximum oI 47,970
demands sent to any given AV (as we explained earlier).
This is the upper bound on the observed Iailure rate. To
make claims about lower Iailure rates (i.e. smaller than 10
-5
)
we need more data or, in other words, we must do more
testing. The theoretical background behind these intuitive
results is given in |15|
6
.
But the Iact that we are seeing only, Ior instance, 5 out oI
almost 10 million diIIerent combinations oI 1oo14 systems
Iail on this dataset is telling us something about the
improvements in detection rates Irom using diversity (cI.
Table IV), even iI we cannot claim a Iailure rate lower than
10
-5
. We may interpret the 0.999999 proportion oI 1oo14
systems with a perIect Iailure rate as a 99.9999 confiaence
we have in the claim that, for this aataset, 'the average
6
Section III oI |15| contains a discussion oI the limits on claims
we can make about reliability when statistical testing reveals no
Iailures. Both classical and Bayesian inIerence methods are
described.
16
failure rate of a 1oo14 aiverse AJ system constructea from
ranaomly selecting 14 AJs from a pool of 26 AJ proaucts,
will be no worse than ~ 4.7 * 10
-5
`. This may or may not
persuade a given system administrator to take on the
additional cost oI buying this extra security: this will clearly
depend on their particular requirements. But in some cases,
especially regulated industries, there may be a legal
requirement to show actually achieved Iailure rates with a
given level oI conIidence. It may not be Ieasible to
demonstrate those Iailure rates with the associated
conIidence Ior single AV products without extensive testing.
The claim above is limited by the level oI diversity we
can expect between various diverse conIigurations. We
'only have 26 AV products (even though this may represent
a large chunk oI the available commercial solutions
available). Hence, Ior instance, even though we have over
65,000 diIIerent combinations oI 1oo5 systems using 26
AVs, the level oI diversity that exists between these diIIerent
1oo5 systems may be limited in some cases. For example, a
1oo5 system which consists oI AV1, AV2, AV3, AV4 and
AV5, is not that much diIIerent Irom a 1oo5 system
consisting oI AV1, AV2, AV3, AV4 and AV6. Hence care
must be taken Irom generalising the level oI conIidence Irom
these large numbers without considerations oI these subtle
concepts on the sample space oI possible systems.
B. Limitations of the Stuay ana the Dataset
There are two major limitations oI the dataset we have
used which prevent us Irom making more generalised
conclusions on the possible beneIits oI diversity.
The Iirst is to do with the time in which the data has been
collected. As we mentioned beIore the malware samples in
our study were collected Irom the SGNET network in 2008,
hence these malware samples do not accurately reIlect the
current prevalence oI malware in 2011. However we should
stress that we are matching like with like: i.e. malware
samples and AVs at some snapshot in time, in this case 2008.
We plan to do Iurther work with malware which are
prevalent in 2011 and the detection capabilities oI 2011 AVs
when they inspect these malware. This will give us results on
the consistency, or not, oI the detection capabilities oI
diverse AVs in diIIerent snapshots in time.
The second limitation is due to the nature oI the data set:
we only have conIirmed malware. Hence we can measure
Iailure to detect genuine malware (false negatives), but we
could not measure cases where benign Iiles are incorrectly
identiIied as malware (false positives). False positive rate is
an important measure when evaluating the eIIectiveness oI
any detection system, including AV systems. OI course we
could have artiIicially created a large set oI non-malicious
Iiles which we could have sent to the AV systems and
obtained Ialse positive rates this way. But this can just as
easily be criticised due to the lack oI representativeness oI
the chosen non-malicious Iiles. Choosing representative test
loads and deIining what constitutes a Ialse positive is a
matter oI some debate in the AV community (see discussion
in |10| and |11| Ior more details). We plan to study the Ialse
positive aspects in Iuture work.
C. Discussion of the Hyper-Exponential Moael
There is no immediate explanation Ior the surprisingly
good Iit oI the 1ooN system detection capability to the
hyper-exponential model, but we suspect it is related to the
Central Limit Theorem where individual variations are
smoothed out to produce a normal distribution (Ior a sum oI
distributions), or a log-normal distribution (Ior the product oI
distributions). In our case, the hyper-exponential distribution
could be intermediate between these two extremes.
Furthermore, it might be that the asymptotic nature oI
exceedances and extreme values can be used to justiIy a
general hyper exponential or Gumbel distribution. This is an
area that requires more investigation.
II the hyper-exponential distribution is shown to be
generic, it would be a useIul means Ior predicting the
expected detection rates Ior a large 1ooN AV system based
on measurements made with simpler 1oo2 or 1oo3
conIigurations.
VI. RELATED WORK
A. Architectures that Utilise Diverse AntiJirus Proaucts
In this paper we have presented the potential gains in
detection capability that can be achieved by using diverse
AVs. Security proIessionals may be interested in the
architectures that enable the use oI diverse AV products.
The Iollowing publication |5| has provided an initial
implementation oI an architecture called Cloud-AV, which
utilises multiple diverse AntiVirus products. The Cloud-AV
architecture is based on the client-server paradigm. Each host
machine in a network runs a host service which monitors the
host and Iorwards suspicious Iiles to a centralised network
service. This centralised service uses a set oI diverse
AntiVirus products to examine the Iile, and based on the
adopted security policy makes a decision regarding
maliciousness oI the Iile. This decision is then Iorwarded to
the host. To improve perIormance, the host service adds the
decision to its local repository oI previously analysed Iiles.
Hence, subsequent encounters oI the same Iile by the host
will be decided locally. The implementation detailed in |5|
handles executable Iiles only. A study with a deployment oI
the Cloud-AV implementation in a university network over a
six month period is given in |5|. For the executable Iiles
observed in the study, the network overhead and the time
needed Ior an AntiVirus engine to make a decision are
relatively low. This is because the processes running on the
local host, during the observation period, could make a
decision on the maliciousness oI the Iile in more than 99 oI
the cases that they had to examine a Iile. The authors
acknowledge that the perIormance penalties might be
signiIicantly higher iI the types oI Iiles that are examined
increases as well as iI the number oI new Iiles that are
observed on the host is high (since the host will need to
Iorward the Iiles Ior examination to the network service
more oIten).
Another implementation |6| is a commercial solution Ior
e-mail scanning which utilises diverse AntiVirus engines.
17
B. Other Relatea Empirical Work with AJ Proaucts
Empirical analyses oI the beneIits oI diversity with
diverse AV products are scarce. CloudAV |5| and our
previous paper |1| are the only examples we know oI
published research.
Studies which perIorm analysis oI the detection
capabilities and rank various AV products are, on the other
hand, much more common. A recent publication |8| reports
results on ranking oI several AV products and also contains
an interesting analysis oI 'at risk time Ior single AV
products. Several other sites
7
also report rankings and
comparisons oI AV products, though one must be careIul to
look at the criteria used Ior comparison and what exactly was
the 'system under test to compare the results oI diIIerent
reports.
VII. CONCLUSIONS
The analysis proposed in this work is an assessment oI
the practical impacts oI the application oI diversity in a real
world scenario based on realistic data generated by a
distributed honeypot deployment. As shown in |11|, the
comprehensive perIormance evaluation oI AntiVirus engines
is an extremely challenging, iI not impossible, problem. This
work does not aim at providing a solution to this challenge,
but builds upon it to clearly deIine the limits oI validity oI its
measurements.
The perIormance analysis oI the signature-based
components showed a considerable variability in their ability
to correctly detect the samples considered in the dataset.
Also, despite the generally high detection rate oI the
detection engines, none oI them achieved 100 detection
rate. The detection Iailures were both due to the lack oI
knowledge oI a given malware at the time in which the
samples were Iirst detected, but also due to regressions in the
ability to detect previously known samples as a consequence,
possibly, oI the deletion oI some signatures.
The diverse perIormance oI the detectors justiIied the
exploitation oI diversity to improve the detection
perIormance. We calculated the Iailure rates oI all the
possible diverse systems in various 1-out-oI-n (with n
between 2 and 26) setups that can be obtained Irom the best
26 individual AVs in our dataset (we discarded the worst 6
AVs Irom our initial set oI 32 AV products, as they exhibited
extremely poor detection capability). The main results can be
summarised as Iollows:
x Considerable improvements in detection rates can be
gained Irom employing diverse AVs;
x No single AV product detected all the malware in
our study, but almost 25 oI all the diverse pairs,
and over 50 oI all triplets successIully detected all
the malware;
x In our dataset, no malware causes more than 14
diIIerent AVs to Iail on any given date. Hence we
get perIect detection rates, with this dataset, by using
15 diverse AVs;
7
http://av-comparatives.org/, http://av-test.org/,
http://www.virusbtn.com/index
x We observe a law oI diminishing returns in the
proportion oI systems that successIully detect all the
malware as we move Irom 1-out-oI-8 diverse
conIigurations to more diverse conIigurations;
x SigniIicant potential gains in reducing the 'at risk
time oI a system Irom employing diverse AVs:
even in cases where AVs Iail to detect a malware,
there is diversity in the time it takes diIIerent
vendors to successIully deIine a signature Ior the
malware and detect it;
x The analysis oI 'at risk time is a novel contribution
compared with traditional analysis oI beneIits oI
diversity Ior reliability: analysis and modelling oI
diversity has usually been in terms oI demands (i.e.
in space) and the time dimension was not usually
considered;
x An empirically derived hyper-exponential model
proved to be a remarkably good Iit to the proportion
oI systems in each diverse setup that had a zero
Iailure rate. We plan to do Iurther analysis with new
datasets and iI the hyper-exponential distribution is
shown to be generic, it would be a useIul means Ior
predicting the expected detection rates Ior a large
1ooN AV system based on measurements made with
simpler 1oo2 or 1oo3 conIigurations.
There are several provisions Ior Iurther work:
x As we stated in the introduction, there are many
diIIiculties with constructing meaningIul
benchmarks Ior the evaluation oI the detection
capability oI diIIerent AntiVirus products (see |11|
Ior a more elaborate discussion). Modern AntiVirus
products comprise a complex architecture oI
diIIerent types oI detection components, and achieve
higher detection capability by combining together
the output oI these diverse detection techniques.
Since some oI these detection techniques are also
based on analysing the behavioural characteristics oI
the inspected samples, it is very diIIicult to setup a
benchmark capable to Iully assess the detection
capability oI these complex components. In our
study we have concentrated on one speciIic part oI
these products, namely their signature-based
detection engine. Further studies are needed, with
other more up to date datasets, to test the detection
capabilities oI these products in Iull, including their
sensitivities to Ialse positives (whatever the
deIinition oI a Ialse positive may be Ior a given
setup);
x Studying the detection capability with diIIerent
categories oI malicious Iiles. In our study we have
concentrated on malicious executable Iiles only.
Further studies are needed to check the detection
capability Ior other types oI Iiles e.g. document Iiles,
media Iiles etc;
x Analysis oI the beneIits oI diversity Ior 'majority
voting setups, such as 2oo3 ('two-out-oI-three)
diverse conIigurations Ior example. Since our dataset
contained conIirmed malware samples, we thought
18
detection capability is the most appropriate aspect to
study: a system should prevent a malware Irom
executing iI at least one AV has detected it as being
malicious (hence our emphasis on 1-out-oI-n setups).
But in cases where Ialse positives become an issue,
using majority voting may help curtail the Ialse
positive rate;
x More extensive exploratory modelling and
modelling Ior prediction. The results with the hyper-
exponential distribution were a pleasant surprise, but
more extensive analysis with new datasets is
required to make more general conclusions.
Extensions oI current methods Ior modelling
diversity to incorporate the time dimension are also
needed.
ACKNOWLEDGMENT
This work has been supported by the European Union FP
6 via the "Resilience Ior Survivability in InIormation Society
Technologies" (ReSIST) Network oI Excellence (contract
IST-4-026764-NOE), FP 7 via the project FP7-ICT-216026-
WOMBAT, and a City University Strategic Development
Fund grant.
The initial research |1| on these results was done in
collaboration with Corrado Leita and Olivier Thonnard, who
are now with Symantec Research.
We would like to thank our colleagues at CSR Ior IruitIul
discussions oI the research presented, and especially Robert
Stroud who reviewed an earlier version oI this paper.
REFERENCES
|1| Gashi, I., et al. An Experimental Study oI Diversity with
OII-the-ShelI AntiVirus Engines. in Proceedings oI the
8th IEEE Int. Symp. on Network Computing and
Applications (NCA), p. 4-11, 2009.
|2| van der Meulen, M.J.P., et al. Protective Wrapping oI
OII-the-ShelI Components. in the 4th Int. ConI. on
COTS-Based SoItware Systems (ICCBSS). Bilbao,
Spain, p. 168-177, 2005.
|3| Strigini, L., Fault Tolerance Against Design Faults, in
Dependable Computing Systems: Paradigms,
PerIormance Issues, and Applications, H. Diab and A.
Zomaya, Editors, J. Wiley & Sons. p. 213-241, 2005.
|4| Reynolds, J., et al. The Design and Implementation oI
an Intrusion Tolerant System. in 32nd IEEE Int. ConI.
on Dependable Systems and Networks (DSN).
Washington, D.C., USA, p. 285-292, 2002.
|5| Oberheide, J., E. Cooke, and F. Jahanian. CloudAV: N-
Version Antivirus in the Network Cloud. in Proc. oI the
17th USENIX Security Symposium, p. 91106, 2008.
|6| GFi. GFiMailDeIence Suite, last checked 2011,
http://www.gIi.com/maildeIense/.
|7| VirusTotal. VirusTotal - A Service Ior Analysing
Suspicious Files, last checked 2011,
http://www.virustotal.com/sobre.html.
|8| Sukwong, O., H.S. Kim, and J.C. Hoe, Commercial
Antivirus SoItware EIIectiveness: An Empirical Study.
IEEE Computer, 2011. 44(3): p. 63-70.
|9| Leita, C. and M. Dacier. SGNET: Implementation
Insights. in IEEE/IFIP Network Operations and
Management Symposium (NOMS). Salvador da Bahia,
Brazil, p. 1075-1078, 2008.
|10| Leita, C., et al. Large Scale Malware Collection:
Lessons Learned. in Workshop on Sharing Field Data
and Experiment Measurements on Resilience oI
Distributed Computing Systems, 27th Int. Symp. on
Reliable Distributed Systems (SRDS). Napoli, Italy,
2008.
|11| Leita, C., SGNET: Automated Protocol Learning Ior the
Observation oI Malicious Threats, in Institute
EURECOM. 2008, University oI Nice: Nice, France.
|12| Bayer, U., TTAnalyze: A Tool Ior Analyzing Malware,
in InIormation Systems Institute. 2005, Technical
University oI Vienna: Vienna, Austria. p. 86.
|13| Bayer, U., et al., Dynamic Analysis oI Malicious Code.
Journal in Computer Virology, 2006. 2(1): p. 6777.
|14| F-Secure. Malware InIormation Pages: Allaple.A, last
checked 03 June 2011, http://www.I-secure.com/v-
descs/allaplea.shtml.
|15| Littlewood, B. and L. Strigini, Validation oI ultrahigh
dependability Ior soItware-based systems. Commun.
ACM, 1993. 36(11): p. 69-80.
19