Vous êtes sur la page 1sur 19

A DISTRIBUTED AND INTEROPERABLE ANTI-

SPAM ARCHITECTURE USING PLUG-INS

GODWIN CARUANA
MALTA INFORMATION TECHNOLOGY AND TRAINING SERVICES LTD,
MALTA
godwin.v.caruana@gov.mt

Acknowledgement(s): This paper is heavily based on the submission of my


dissertation in partial fulfilment of the requirements for the degree of Master Of
Science, with the University of Liverpool under the kind supervision of Paul
Darbyshire.

Abstract architectures as an afterthought. To


complicate matters further, a number of
Electronic mail has become cast and anti-spam solutions lack out-of-the-box
embedded in our everyday lives. It has flexibility and scalability and thus
been estimated that during 2006, 25 become very expensive to maintain and
billion legitimate emails where being sent operate in the longer term. Few
on a daily basis (Ferris Research, 2007). implementations allow for additional
The widely established underlying spam-detection / filtering schemes to be
infrastructure, its widespread availability added in a simple fashion,
as well as its ease of use have all acted programmatically.
as catalysts to such widespread
utilisation. Unfortunately, the same can This paper discusses current literature as
be alleged about the proliferation of well as industry approaches towards
unsolicited bulk mail, or rather spam. It spam handling. It also suggests a
is estimated that on average, 70% of potential alternative, distributed,
email is in fact spam (MessageLabs, heterogeneous and extensible
2007). architecture for fighting spam. An
From simply an annoying overview of the results achieved using a
‘characteristic’ of the electronic mail prototype implementation are also
epoch, spam has evolved into a very presented.
expensive, resource and time-consuming
problem, reaching previously unthought- Keywords: Anti-Spam, Distributed
of proportions. To this extent, spam Architectures, Heterogeneous, Extensible
filtering has become critical. Whilst
considerable attention is generally
attributed to the design of email
provisioning architectures, one cannot
easily assert the same with respect to the
design of respective anti-spam setups.
Many organisations design and deploy
rigid, proprietary and frail anti-spam

1
1. Introduction or a combination of both. Many
organisations, including Microsoft
Although it may sound overly passionate, Corporation (Microsoft, 2006, p.8) for
in (Harris, 2003), unsolicited bulk email example, primarily utilise a centralized
or rather spam, has been described as service approach, and mostly rely on
being analogous to sewage due to its scalability practices that are available for
continued contamination of legitimate their preferred technology stacks to be
email communication and its enabling able to increase spam-filtering capacity.
infrastructure. The costs in terms of Setting up and configuring additional
processing power, storage, as well as the processing nodes to such anti-spam
human effort required to deal with the setups can be complex, costly, and mostly
spam problem are substantial and require exclusive processing power to be
constantly on the increase worldwide. effective.
Figure 1 below, reproduced from (Ferris
Research, 2005), portrays typical costs Production environments that host other
spam inflicts on major economies. types of services cannot be easily
employed as part of the overall anti-spam
architecture. The reason behind this is
because the installation of the additional
anti-spam services may jeopardize
current operations given the specific and
complex requirements these can exert.

Additionally, traditional anti-spam


solutions tend to inflict a number of
limitations when it comes to spam
detection techniques extensibility. This is
primarily due to imposed product,
environment and vendor dependencies.
Such constraints, amongst others, lead to
anti-spam setups that do not lend
Figure 1: The cost of spam across countries, in
millions of dollars. Reproduced from (Ferris
themselves very well to todays
Research, 2005) continuously evolving spam filtering
capacity requirements and which mandate
The architectures, including hardware, multi-platform support, scalability and
software and networks, that are employed extensibility out-of-the-box.
to perform anti-spam operations are
commonly built using an ad-hoc 2. The spam problem
approach. This regularly results in such
architectures becoming quickly obsolete Spam has increased dramatically over the
in terms of their effectiveness, or too last couple of years. The greater
costly to scale and keep up with the awareness of the issue has stimulated and
continuously changing spamming attracted considerable attention from the
approaches. Information technology (IT) audience in
general, including the commercial as well
A number of architectural spam detection as academic arenas. The anti-spam topic
and filtering approaches exist, including needs not solely be considered and
but not limited to server only, client only, tackled from a technology perspective in
isolation.

2
In fact, amongst a number of others, application of client, server, or a
legislative, financial and authentication combination of both, ‘intelligence’, to try
approaches (Allman, 2003), including to identify and subsequently filter out
supporting frameworks are also actively spam, such as presented in ‘An e-mail
under consideration and development. client implementation with spam filtering
However, the most perceptible efforts to and security mechanism’ (Shi-Jinn et al,
date remain those that try to provide a 2005) is typical. This approach is
technology based solution. extremely popular and one on which a
considerable number of anti-spam
Various works, involving numerous architectures design is based on.
approaches have been carried out, studied
in depth and implemented to this extent. The primary advantage of this approach
Nevertheless, most of the research work is that some of the processing power that
carried out so far seems to be is available on the participating clients
concentrated on the detection algorithms can be capitalized upon. This is
and accompanying implementation frequently at the expense of the
techniques rather than the overall complexity and diversity of the individual
approach that constitutes anti-spam runtime environments, configurations and
architectures in general. capabilities of the client machines / email
clients themselves.
At the core of the email enabling
infrastructure is SMTP – Simple Mail An interesting derivative, which employs
Transfer Protocol. Thus, to alleviate the an innovative peer-to-peer based
spam pandemic problem, perhaps one can approach, can be found in ‘P2P-based
start by considering improving SMTP collaborative spam detection and
itself. Without negating its inherent filtering’ (Damiani et al, 2006). The
potential stemming from its sheer authors present a scheme whereby end-
simplicity, initial SMTP (SMTP, 1982) users contribute, in a clustered fashion,
based implementations where not towards the identification and
originally intended to be very secure. confirmation of spam, via the exchange
Neither was it envisaged that SMTP of spam reports between the participating
would be capitalized upon as a mass anti-spam architecture servers and end-
unsolicited email-spreading vehicle. users.
Changes to SMTP and closely related
technologies such as DNS - Domain It is the authors’ opinion themselves that
Name System (DNS, 1987) have been the biggest concern with this approach is
considered from various perspectives, the assumed end-users willingness to
such as via the Reverse-Mail-eXchanger participate in the spam identification /
(RMX) approach (Krawetz, 2004, Sect. confirmation process (Damiani et al,
1.4). The biggest concern with similar 2006).
approaches is that due to its
overwhelming popularity, reach, and its In a paper titled ‘An anti-spam scheme
deep encapsulation within the core using pre-challenges’ (Roman, Zhou and
Internet email infrastructure, any Lopez, 2006), the authors propose
significant changes to SMTP will have an another interesting method for spam
overarching affect on the entire Internet control. A challenge–response approach
email platform. is suggested, the primary scope of which
is to ensure that the email is being sent by
Less intrusive anti-spam approaches have a human being.
also been exhaustively evaluated. The

3
This is accomplished by providing classification (Zhen, Xiangfei, Weiran,
challenges for additional interaction and Jun, 2006) and their derivatives,
before the actual email ‘transaction’ is amongst a plethora of others, are quite
performed. Again however, it is popular nowadays.
questionable whether the extent of the
inconvenience introduced for legitimate In real world implementations, it is also
email usage that is inherent to this very common to identify best practices
approach, would be considered as that employ combinations of these
acceptable and endorsed by end-users. techniques rather than adopting
exclusively one, which increases the
The power of distributed, P2P and GRID overall effectiveness. Complexity,
computing has been applied to tackle accuracy, efficiency and processing
various computationally complex and power requirements vary according to the
resource hungry problems. In its wider particular needs, and frequently lead to
context, this approach towards computing tradeoffs between these characteristics.
relies on the capitalisation of The direct effect of this tends to become
underutilised, distributed, node more amplified in environments where
processing power which can be utilised to the spam processing functionality co-
perform complex and processing exists with other core, critical services,
intensive tasks, collaboratively. Some and which are processing intensive as
studies, such as described in ‘Anti-Spam well.
Grid: A Dynamically Organized Spam
Filtering Infrastructure’ (Liu et al, 2004), To this extent, various studies have been,
discuss collaborative and distributed anti- and are continuously carried out, to try to
spam approaches in their presentations, come up with a good compromise
intended to improve the spam detection between accuracy, efficiency and
process in this particular case. performance. For example, in a paper
titled ‘Fast statistical spam filter by
Nevertheless, one can safely state that approximate classifications’ (Li and
overall, the distributed approach towards Zhong, 2006), the authors present an
spam filtering, as well as exploiting innovative solution to the processing
underutilised or commodity processing overhead typically associated with a very
power for spam filtering purposes, are popular detection scheme, namely
areas which still have plenty of scope for statistical Bayesian filtering. This is
further exploration. Additionally, in most performed by applying approximation
approaches mentioned herewith and as techniques based on “hash based lookup
highlighted earlier, there is no easily and lossy encoding” (Li and Zhong,
accessible open-interface that allows for 2006, p.1).
the straightforward construction and
introduction of additional spam detection 4. Industry practices
‘filters’.
Due to its significant financial impact, it
3. Spam filtering / detection comes to no surprise that the amount of
techniques literature that is available from the
information and communication
The number of approaches and technology industry in general with
techniques towards the detection and respect to the spam problem is
hence filtering of spam are numerous. exceptional. From ‘Spam: The Silent ROI
Rule based, text clustering (Sasaki and Killer’ (Nucleus, 2003), the percentage
Shinnou, 2005) and Bayesian loss of productivity for each employee

4
per year is illustrated at an average of position to fully capitalise on their
1.4%, subsequently equated to potential. This is seldom easy however,
approximately 900$ per employee per given that typically, such anti-spam
year! (Nucleus, 2003). setups are part and parcel of and tightly
coupled / integrated with email setups,
If these figures alone are put in the commonly sharing the underlying
context of a whole organization, it hardware resources as well.
quickly becomes evident how expensive
the spam problem has become, without
even having to look at the cost incurred at
a service, i.e. email, provisioning level. It
also becomes obvious why organisations
are placing more effort in trying to find
alternative and more effective ways to
tackle the unsolicited messages plague. It
is an accepted reality that spammers do
not rely on any single approach towards
spamming. In turn, the industry actively
employs a number of architectures and Figure 2: Anti-spam architectures employed at
a national level. Reproduced from (Caruana
approaches to try to mitigate the and Naudi, 2007)
problems associated with mass spam
proliferation.
5. Research executed
As indicated earlier, there are three
From a system topology perspective,
primary classes of architectures that are
typical centralised anti-spam setups tend
frequently employed in this respect,
to exhibit a number of limitations with
namely server based, client based and
respect to straightforward scalability
their combination. From an industry
under heavy loads, alongside limited
perspective, one can also consider adding
comparative redundancy.
the outsourcing of the spam-detection and
filtering infrastructure(s) and associate
To this extent, one of underlying
services to third-party entities
principles of the presented research is
specializing in this area.
based upon the perspective that the
application of distributed computing
To date, at least locally, server based
concepts for the provision of an anti-
implementations seem to be the most
spam architecture should provide a
proliferated approach, as also indicated
number of advantages when compared to
by Figure 2, reproduced from a short
more traditional centralised setups
report that looks into basic anti-spam
(Bettati, 2007). A good number of anti-
architectures at a national level (Caruana
spam technology provisioning stacks at
and Naudi, 2007). This popularity is
large today are predominantly proprietary
primarily due to their easily maintainable
in nature. This provides the rationale
configurations.
behind the consideration for the provision
of an application-programming interface
Nevertheless, “with spam and e-mail
to abstract the spam-detection / filtering
borne virus attacks increasing mail
functionality from the rest of the
volumes exponentially, server-based must
software. Such an approach will greatly
be truly scalable” (Roaring Penguin,
enhance the software’s flexibility with
2004, p.2) and flexible in order for
respect to its ability to keep abreast with
organisations to realistically be in a
spamming techniques.

5
Additionally, considering that the between the prototype and simulation
magnitude of spam is neither easily laboratory environment are presented
quantifiable nor predictable over time, the herewith, and the potential behaviour and
ability to easily capitalise upon and throughput characteristics of the
exploit any underutilised processing prototype when executed in a production
power within an organisation has been environment inferred.
considered an added catalyst for a
potential research opportunity. The simulation laboratory environment
was prepared by first analyzing a number
6. Basic prototype design of key properties of the production
environment, namely:
In order to gauge the potential of such an
approach, an anti-spam software • Its email throughput during
architecture based on distributed typical weekly / daily operation.
computing concepts was prototyped. The • The basic spam detection
overall design is based on a central RFC- techniques adopted.
821 compliant (RFC 821, 1982) SMTP • Analysis of the basic underlying
software service controller, which accepts infrastructure.
email and delegates mail checking to a
number of participating nodes (agents) in The average email throughput of the
the distributed environment for spam production environment was established
content analysis. The distributed anti- by identifying a 7 day period sample for
spam participating agents have a plug-in which relevant audit data was extracted
based architecture to enable third party and processed from the respective logs.
developers to supplement the architecture Subsequently, this data was averaged to
with additional anti-spam detection generate a 24 hour based data-set and the
techniques/approaches as the need arises. resultant data analyzed more in depth.
The architecture is represented pictorially
in Figure 3. This analysis was used to regulate the
working parameters of the simulation
7. Basic findings laboratory environment and to reproduce
as closely as possible the behaviour and
The prototype was designed to gauge the characteristics of the live environment
effectiveness of the proposed architecture when presented with a similar set of
when compared to a specifically built processing requirements.
anti-spam simulation laboratory
environment that mimics the operation Simulation Laboratory vs. Prototype –
and process flow of a common server basic overall performance comparison
based anti-spam setup as well as exhibits
the same overall behaviour The graph represented by Figure 4
characteristics. This environment aggregates 2 primary results, namely the
reflected as closely as possible the basic performance of:
throughput characteristics of the anti-
spam setup within the company that the i. The simulation laboratory
author is employed with. According to environment.
deVilliers (deVilliers, 2005), laboratory
ii. Prototype with 5 anti-spam
environment testing is considered a
nodes.
relevant evaluation approach.
The 5 node instance was selected because
To this extent, basic comparison results

6
Plug-
Plug-
ins
ins

Plug-
RFC 821 compliant mail service ins
Anti Spam Co-ordinator / load distributor Anti Spam
Plug- Agent Host Agent Host
ins

Anti Spam
Anti Spam Plug- Agent Host
Plug-
Agent Host ins
ins
Plug-
ins

Anti Spam
Anti Spam Plug-
ins Agent Host
Agent Host
Anti Spam Anti Spam
Agent Host Agent Host

Organisation Email Primary Service

Email Client Email Client Email Client Email Client Email Client Email Client Email Client
End User End User End User End User End User End User End User

Figure 3: Anti-Spam architecture prototype

Simulator vs Proto
(5 Node - No Pass Thru's)
Test Proto 5 Node Poly. (Test) Poly. (Proto 5 Node)
1.80

1.60

R2 = 0.929
1.40

1.20

1.00
Time / Email

0.80

0.60

0.40

0.20
R2 = 0.9728
0.00
1 3 5 7 9 11 13 15 17 19
Mail / second

Figure 4: Comparison of prototype with 1 and 5 nodes and the simulation environment.

7
with this number of nodes and given the switch over point – highlighted by ‘O’
simulated load, the prototype was able to (≈ 30 mails/second).
carry out spam checking without the need
to adopt the pass-thru’ 1approach for any At this point the prototype, employing 5
email. nodes, starts exhibiting better throughput
characteristics since the delta increase in
The primary characteristics of this graph time taken to process email is lower in
indicate that the prototype’s performance the proposed architecture when compared
is better because the effect of increasing to the simulation laboratory setup.
load is less in terms of the time required
to process emails. Thus at lower throughput requirements,
the prototype suffers from the added
However, on closer inspection it can also email distribution overhead and baggage
be observed that whilst in practice the of the data transmission protocol
processing time required to process one implemented. This characteristic is
email in both scenarios should be reversed as soon as the throughput
approximately the same or in favour of requirements increase to a point where
the simulation environment (given the the distribution overhead becomes less
prototype’s communication overheads), significant in comparison to the
the graph presents a different picture with workload.
the prototype shown as being faster.
Further analysis work revealed that this Mostly constituted of main memory
was primarily attributable to the based operations, the prototype is not
differences between the approach and hindered by slower disk I/O, although
complexity of the implemented spam higher priority processes may still induce
detection algorithms amongst the page faulting and affect overall
simulation laboratory and prototype throughput.
environments.
Conversely, the inability to cache and
After careful consideration based on capitalise on permanent storage can be
supplementary data gathering and argued as being a primary disadvantage
analysis exercises, comprising of because the occurrence of an internal
additional input samples, it was identified failure may result in email loss.
that a better approach to evaluate the
performance characteristics of both The same applies to the current
setups is by comparing the respective implementation’s inability to prevent
processing time deltas (the rate of pass-thru’s without resorting to the
throughput). This was calculated by using introduction of additional nodes. This can
the simple formula ((nn –n1)/n1*100). be mitigated by queuing data to
The graph presented in Figure 5 permanent storage and performing the
demonstrates that whilst the simulation distribution of email as a separate
laboratory architecture shows better process.
throughput when load is relatively small
(no. of mails/second < 29), there is a

1
refers to emails that the specific configuration (server + number of participating nodes) was not able
to process normally, i.e. check for spam

8
10 450

9 400

8 350

Delta increase (% over 1st)


7 300
Time / email (s)

6 250

5 200

4 150

3 100

2 50

1 0

0 -50
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Email / s
Figure 5: Graph showing time delta contrast between the laboratory environment and prototype
using 5 nodes

Figure 6: Legend for Figure 4 above

The ability to introduce nodes easily Where


allows the presented prototype to scale on
a need basis. Each node increases the 1. i = minimum number of nodes
prototype’s ability to handle more
incoming email. The overall throughput 2. n = maximum number of nodes
of the prototype / architecture can be
3. t = throughput
represented via:
Until the server-controller reaches
n
saturation point, the average throughput
of a single node is approximately 1 email
∑ t i = t (n1) + t (n2) + t (n3) + … + t
every 0.2 seconds on the employed test-
(nn)
i bed. Thus, one can infer the approximate
=1 number of nodes required for processing
specific loads as shown below:

9
Number of Number of
emails / second nodes still be applied to this extent, such as load
5 1 balancing a number of server-controllers
10 2 via round robin DNS, amongst others.
25 5 Groups of agent-nodes can then make use
50 10 of the different server-controllers
250 50 available.

The current method employed for The prototype’s ability to allow third-
distributing work to participating nodes is party developers to integrate additional
very simple. The following is a pseudo spam detection routines via a simple API
code representation of the current increases its flexibility considerably.
allocation scheme. Vendor independence for such additions
also reduces the overall cost of the
If number of nodes in active list > 0 architecture.
then
Send work to active list node The provisioning of an application-
available at list element [0] (first one) programming interface that allows third-
Else party developers to provide extensions in
Send work directly to final a simple fashion brings considerable
destination without performing any benefits. The prototype exposes a very
checking simple mechanism in this respect that
End if allows extensions to be developed using
Java.
The complexity of the current allocation
scheme can be expressed as O(n) overall Developing plug-ins is relatively simple
in Big-O notation (NIST, 2005). and basically involves the creation of the
Although perhaps slightly too simplistic, extension logic implemented as Java
this approach has the advantage of classes, which must expose one single
substantially reducing the computational method as described earlier and
overhead required for work distribution. pictorially in the graphic below, where p1,
p2 … pn are any number of plug-ins.
This is however at the expense of
introducing a considerable number of
pass-thru’s, i.e. emails that are forwarded
to final destination without being
checked, when the working node count is
small. This is due to the server-
controller’s design being such that when
Figure 7: Plug-in interface
the number of participating nodes is
inadequate, it will skip checking email From a configuration perspective, it can
for spam via the agent-nodes and simply be argued that the prototype’s simpler
forwards it to the final server. approach is an advantage when
introducing additional processing nodes,
Nevertheless, the server-controller itself
although conversely, this can also be
will still exhibit a saturation point
considered as a limitation when it comes
whereby it will not be able to cope with
to configuration flexibility.
incoming mail, irrespective of the number
of agent-nodes that are available. Heterogeneity proved to be a very
Traditional mitigation approaches can important attribute. The ability to

10
capitalise on a wide variety of platforms not be underestimated . It was also
gives the proposed prototype, especially recognized that unless fast convergence /
from an agent-node perspective, the interaction schemes for agent-node
ability to be executed on environments participation, alongside efficient
ranging from traditional desktops and algorithms to keep the server-controller’s
servers as well as other non-conventional node state data-structures up-to-date are
devices, including appliances, which employed, the overall effectiveness can
provide the Java Virtual Machine as part be significantly reduced.
of their software base.
Nevertheless, the original rationale
Additionally, many organisations perform behind this research, i.e. that anti-spam
technology refresh cycles on a periodic architectures can benefit from the
basis. This is done to upgrade hardware distributed computing model, as well as
that cannot keep up with immediate the flexibility introduced to such
processing requirements, frequently architectures by providing a plug-in API
decommissioning hardware which could for additional detection scheme
prove beneficial if employed to increase development, has been demonstrated as
the organisation’s capability to handle overall successful.
spam. The ability to run on a wide range
of different platforms becomes even more Future work - Software engineering
important when anti-spam solutions are viewpoint
required to capitalise on any unused or As shown, the presented architecture’s
under utilised processing power that is scalability characteristics can be
available, similar to the approach adopted described as fairly predictable from a
in P2P and GRID computing initiatives. participating nodes perspective. This
hyporesearch is being based on empirical
8. Conclusions & prospects for values as well as experience.
Nevertheless, irrespective of the number
future work
of participating nodes, the server-
This research provided a great learning controller will at a specific stage, reach a
opportunity as well as insight with saturation point and become a bottleneck
respect to distributed anti-spam itself unless appropriate measures are
architectures. This study demonstrates taken to mitigate this issue.
that the application of distributed
computing as a backbone to anti-spam It would be very interesting if an attempt
architectures can provide a number of to try to identify a formal relationship
benefits when compared to traditional between the number of nodes, email
approaches. processing time (based on an
environment where there are no pass-
Additionally, the provision of a simple thru’s), load distribution logic, server
API allows third party extensions to be load and server saturation point, is carried
developed with ease, providing the out.
flexibility required to keep anti-spam
architectures up-to-date. It was observed Based on the deductions presented
that the effects of the extra latency earlier, the following is a simplified
introduced for email transfer between the formal model of the architecture
server-controller and agent-nodes, presented:
alongside the adopted load distribution
scheme performance must K = TA + (n * Z )

11
Future work - Prototype viewpoint
Where (in the presented prototype’s
case): The current prototype implementation
exhibits a number of limitations with
n = number of active nodes. respect to email handling capabilities,
being specifically limited to text only
Z = server-controller’s logic code email data. The capability to accept and
complexity – may be represented process multi-part messages, images and
using McCabe (VanDoren, 2000) other rich content as part of the spam
or Halstead (VanDoren, 1997) inspection data should be actively
software/computational considered to make the solution effective
complexity metrics. in a real-world environment.
TA = time taken by individual This is very important given that a
nodes to process work, which considerable mass of modern day spam
should not be affected by the employs images for its presentation
number of active nodes and may (Kelly, 2006. p1).
be considered as relatively
‘fixed’.

Therefore, if the hypothetical factor Z


primarily reflects the code complexity for
forwarding mails to nodes and keeping
relevant node accounting information, its
effect on the architecture can be studied
and potentially use this information to
help identify the actual processing
saturation point of the server-controller.

Elaborating, K = TA + (n * z), where K is


Figure 8: Image Spam on the increase -
the total email processing time from Reproduced from (Kelly, 2006, p.1)
initial source to final destination. Given
that from the evaluation carried out, K is
roughly the same for 1 node as well as for While these additional features will
6 nodes, it can be argued that for small undeniably have an effect on the overall
values of n, the product of n * z throughput of the individual server and
approximates 0 and thus the relationship nodes, and thus the time taken to process
above can be approximated to: ‘work’, the overall scalability traits and
capabilities of the proposed architecture
should remain largely the same, although
K ≈ TA any individual saturation points will be
reached earlier.
Consequently, it would be interesting
future work to try to identify when (and Rather than using a point-to-point
why) K becomes greater than TA (K > approach for server and node(s)
TA), which denotes that the server- convergence, a broadcast based approach
controller is reaching a saturation point will provide a more effective solution, as
and becoming a bottleneck. well as reduce overall network traffic
during control-data exchange.

12
The current work allocation scheme is
rather simplistic. A better approach subject area on its own, to ensure
would be to collect and capitalise on confidentiality, the application of relevant
runtime information by scrutinizing security measures, including but not
individual node processing load and limited to encryption of data as it flows
performance capabilities, as well as from server to node, is an additional
consider any underlying inter-connecting aspect that has to be considered for any
network latencies for email allocation future work. Any inflicted throughput
logic. degradation must be taken into
consideration.
Another interesting improvement could
be to consider having separate allocation Finally, a study of whether and how such
and monitoring logic which is detached a distributed approach would behave if
from the server-controller and agent- the underlying reach is extended to the
nodes, as described in (Artigas and Internet rather than a typical LAN should
Ferdman, 2000, p.4). prove very interesting. This implies
having the ability to have server-
controller(s) and agent-nodes that are
actually dispersed across a distributed
setup from a geographic perspective,
increasing the scalability as well as the
processing power available considerably,
if effective.

In this way, the aggregation of a number


of independent entities can collectively
Figure 9: Separation of allocation and
monitoring from the server-controller. institutionalize a formal global virtual
Reproduced from (Artigas and Ferdman, 2000, organisation for a concerted effort to
p.4) fight spam.

Furthermore, rather than employing a


proprietary protocol for server  node
control communications, the study and
subsequently application of industry
standard approaches such as SNMP
(SNMP, 1990), if applicable, would be
more appropriate. Adequate message data
compression as well as caching can
provide additional effective mechanisms
for substantially increased processing
throughput. Additionally, introducing the
ability to queue messages, both from a
server-controller as well as an agent-node
perspective, will inherently increase the
resiliency and throughput of the
prototype.

Security is a topic of concern when it


comes to heterogeneous and distributed

13
computing approaches. A complex
REFERENCES CITED Bettati, Riccardo (2007). Introduction to
Allman, Eric (2003). ‘2003. Spam, Spam, Distributed System. Department of
Spam, Spam, Spam, the FTC, and Computer Science Texas A&M
Spam’. Queue 1, 6 (Sep. 2003), 62- University. [Online] Available at
69. The ACM Digital Library. http://faculty.cs.tamu.edu/bettati/Cour
[Online] Available at DOI= ses/662/Generic/Slides/HTTP/Intro/ts
http://doi.acm.org/10.1145/945131.94 ld002.htm (Accessed: 23/06/2007)
5157 (Accessed: 10/03/2007). Blanco, Elena (2007). Open source and
AmAvis (2007). A Mail Virus Scanner. the postmaster. OSS Watch. [Online]
[Online] Available at Available at http://www.oss-
http://www.amavis.org/ (Accessed: watch.ac.uk/resources/postmaster.xml
31/05/2007). (Accessed: 10/07/2007)
Androutsopoulos,I.; Koutsias, J.; Caruana, Godwin (2007). Anti-Spam
Chandrinos, K.V. and Architectures – Online Survey. Based
Spyropoulos,C.D (2000). ‘An on SurveyMonkey Online Web
experimental comparison of naive Surveys. [Online] Available at http://
Bayesian and keyword-based anti- www.surveymonkey.com/s.asp?
spam filtering with personal e-mail u=911173384951. (Accessed:
messages’. In Proceedings of the 03/03/2007).
23rd Annual international ACM
SIGIR Conference on Research and
Development in information Retrieval Caruana, Godwin and Naudi, Rodney
(Athens, Greece, July 24 - 28, 2000). (2007). ‘An overall assessment of
SIGIR '00. ACM Press, New York, current, local anti-spam
NY, 160-167. The ACM Digital implementations’. Article awaiting
Library. [Online] Available at DOI= publication in the MITTS Ltd journal.
http://doi.acm.org/10.1145/345508.34 Christie, Alan M. (1999). Simulation: An
5569. (Accessed: 31/03/2007). Enabling Technology in Software
Artigas, Pedro and Ferdman, Michael Engineering. Software Engineering
(2000).Centralised vs. Distributed. Institute – Carnegie Mellon. [Online]
Allocation in Distributed System. Available at http://www.sei.cmu.edu/
[Online] Available at publications/articles/christie-apr1999/
http://www.cs.cmu.edu/~artigas/class christie-apr1999.html. (Accessed:
proj/osproj.ps (Accessed: 04/06/2007) 16/07/2006)
Barracuda Networks (2004). An Clam-AV (2007). Clam Anti-Virus.
Overview of Spam Blocking [Online] Available at
Technique. [Online] Available at http://www.clamav.net/ (Accessed:
http://www.adantasys.com/down/Barr 31/05/2007).
acuda/SpamTechniquesFINAL.pdf. Damiani, E.; De Capitani di Vimercati,
(Accessed: 26/03/2007) S.; Paraboschi, S. and Samarati, P.
Baskerville, Richard L. (1999). (2006) ‘P2P-based collaborative spam
Investigating Information Systems With detection and filtering’. Peer-to-Peer
Action Researc. Computer Information Computing, 2004. IEEE. [Online]
Systems Department- Georgia State Available at DOI =
University. [Online] Available at 10.1109/PTP.2004.1334945
http://www.cis.gsu.edu/~rbaskerv/CAIS_ (Accessed: 14/01/2006)

14
2_19/CAIS_2_19.html (Accessed:
21/06/2007).
de Villiers, M. R. (2005). ‘Three Ferris Research (2007). Industry
approaches as pillars for interpretive Statistics. [Online] Available at http://
information systems research: www.ferris.com/research-
development research, action research library/industry-statistics/ (Accessed:
and grounded theory’. In Proceedings 16/08/2007)
of the 2005 Annual Research Gentoo (2007a). Gentoo Linux. [Online]
Conference of the South African Available at http://www.gentoo.org/
institute of Computer Scientists and (Accessed: 29/053/2007)
information Technologists on IT Gentoo (2007b). HOWTO Spam Filtering
Research in Developing Countries with Gentoo, Postfix, Amavis. [Online]
(White River, South Africa, Available at http://gentoo-
September 20 - 22, 2005). ACM wiki.com/HOWTO_Spam_Filtering_with
International Conference Proceeding _Gentoo,_Postfix,_Amavis (Accessed:
Series, vol. 150. South African 31/05/2007).
Institute for Computer Scientists and
Information Technologists, 142-151. Goodman, Joshua; Cormack, Gordon V.
The ACM Digital Library. [Online] and Heckerman, David (2007). ‘Spam
Available at www.acm.org (No DOI) and the ongoing battle for the inbox’.
(Accessed: 22/06/2007) Commun. ACM 50, 2 (Feb. 2007),
24-33. The ACM Digital Library.
DNS (1987). RFC 1034 - Domain names [Online] Available at DOI=
- concepts and facilities. [Online] http://doi.acm.org/10.1145/1216016.1
Available at http://www.faqs.org/rfcs/ 216017 (Accessed: 10/03/2007).
rfc1034.html. (Accessed:
19/03/2007). Guttam, Michael (2006). Modeling
notation progressing from UML to
Eeles, Peter (2006). What is software MDA to SOA. Software magazine
architecture ? IBM Developerworks. [Online] Available at
[Online] Available at http://www.softwaremag.com/L.cfm?
http://www.ibm.com/developerworks/ Doc=950-5/2006. (Accessed:
rational/library/feb06/eeles/ 04/03/20076).
(Accessed: 24/05/2007)
Harris, David (2003). Drowning in
Exchange (2007). Microsoft Exchange. Sewage: SPAM, the curse of the new
Microsoft Corporation. [Online] millenium: an overview and white
Available at paper. SpamHelp. [Online] Available
http://www.microsoft.com/exchange/ at http://www.spamhelp.org/articles/
default.mspx (Accessed: 03/03/2007) Drowning-in-sewage.pdf. (Accessed:
Ferris Research (2005). The Global 29/01/2007).
Economic Impact of Spam, 2005. Hasan, Helen (2004). Information
Report #409. Ferris Research Systems Development As A Research
Analyzer Information Service. Method. Australasian Journal of
[Online] Available at Information Systems. [Online]
http://www.ferris.com/? Available at
file_id=2004/05/611_409SpamCosts. http://dl.acs.org.au/index.php/ajis/arti
pdf (Accessed: 16/08/2007) cle/download/142/122(Accessed:
22/06/2007)

15
Hershkop, S. and Stolfo, S. J. (2004). Li, Kang and Zhong, Zhenyu (2006).
Identifying spam without peeking at ‘Fast statistical spam filter by
the contents. Crossroads 11, 2 (Dec. approximate classifications’. In
2004), 3-3. The ACM Digital Library. Proceedings of the Joint international
[Online] Available at DOI= Conference on Measurement and
http://doi.acm.org/10.1145/1144403.1 Modeling of Computer Systems (Saint
144406 (Accessed: 14/02/2007) Malo, France, June 26 - 30, 2006).
SIGMETRICS '06/Performance '06.
JDK (2007). JavaTM Platform, Standard
ACM Press, New York, NY, 347-358.
Edition 6 Development Kit. [Online]
The ACM Digital Library. [Online]
Available at
Available at DOI= http://doi.acm.org/
http://java.sun.com/javase/6/webnotes
10.1145/1140277.1140317
/README.html. (Accessed:
(Accessed: 03/04/2007)
31/03/2007).
Kelly, Nick (2006). Image Spam: The Liu, Peng; Shi, Yao; Li, Li-San; Lau,
Francis C. M and Wang, Cho-Li
New Email Scourge. McAffee.
(2004). Anti-Spam Grid: A
[Online] Available at
Dynamically Organized Spam
http://www.mcafee.com/us/local_cont
Filtering Infrastructure. [Online]
ent/white_papers/threat_center/wp_i
Available at http://www.anti-
magespam_f.pdf. (Accessed:
spamgrid.org/files/A%20Spam
12/07/2007)
%20Filtering%20System%20Based
Kolcz, A., Bond, M., and Sargent, J. %20on%20Dynamically
(2006). ‘The challenges of service- %20Organized%20Grid.pdf
side personalized spam filtering: (Accessed: 17/03/2007)
scalability and beyond’. In
Proceedings of the 1st international MessageLabs (2007). Spam Spikes – The
Battering Ram of Spam. Message
Conference on Scalable information
Labs Intelligence. [Online] Available
Systems (Hong Kong, May 30 - June
at
01, 2006). InfoScale '06, vol. 152.
http://www.messagelabs.co.uk/mlirep
ACM Press, New York, NY, 21. The
ort/2007%2005%20May
ACM Digital Library. [Online]
%20MLI_final.pdf. (Accessed:
Available at DOI= http://doi.acm.org/
16/08/2007)
10.1145/1146847.1146868
(Accessed: 03/04/2007). Microsoft (2006). Messaging Hygiene at
Microsoft-How Microsoft IT Defends
Krawetz, Dr. Neal. (2004). Anti-Spam
Against Spam, Viruses, and E-Mail
Solutions and Security. Security
Attacks. Microsoft Corporation.
Focus. [Online] Available at
[Online] Available at
http://www.securityfocus.com/infocu
http://download.microsoft.com/downl
s/1763 (Accessed: 13/04/2007)
oad/6/7/3/673cd069-ad45-4284-9ebb-
7a83cc61fcf7/MessagingHygieneTW
P.doc. (Accessed: 27/01/2007)
MITTS (2007). Malta Information
Technology and Tranining Services
Ltd. [Online] Available at
http://mitts.gov.mt/Default.aspx?
partid=43&id=30 (Accessed:
31/03/2007)

16
Netsense (2004). Spam Solutions White OMG (2007c). Executive Overview –
Paper. NetSense. [Online] Available Model Driven Architecture. [Online]
at http://www.netsense.info/Spam Available at
%20Solutions%20White http://www.omg.org/mda/executive_o
%20Paper.pdf (Accessed: verview.htm. (Accessed: 27/01/2007)
27/01/2007) RBLCheck (2007). RBLCheck. [Online]
Niglas, Katrin (2000). Combining Available at
quantitative and qualitative http://rblcheck.sourceforge.net/.
approaches. University of Leeds. (Accessed: 27/02/2007)
[Online] Available at Postfix (2007). Postfix. [Online]
http://www.leeds.ac.uk/educol/docum Available at http://www.postfix.org/
ents/00001544.htm (Accessed: (Accessed: 10/07/2007)
04/07/2007)
Roaring Penguin (2004). Demystifying
NIST (2005). Big-O Notation. The the Anti-Spam Buzz: Features vs.
National Institute of Standards and Fluff in the Search for an Enterprise
Technology. [Online] Available at Anti-Spam Solution. Roaring Penguin
http://www.nist.gov/dads/HTML/big Software Inc. [Online] Available at
Onotation.html (Accessed: http://www.roaringpenguin.com/files/
09/07/2007) images/resources_files/SpamHype_08
Nucleus (2003). Spam: The Silent ROI -04.pdf (Accessed: 26/03/2007)
Killer. Nucleus Research Inc. Roman, Rodrigo; Zhou, Jianying and
[Online] Available at Lopez, Javier (2006). ‘An anti-spam
http://www.nucleusresearch.com/rese scheme using pre-challenges’.
arch/d59.pdf (Accessed: 25/03/2006) ScienceDirect. [Online] Available at
Nunamaker, J.F., Jr.; Chen, M. (1990). DOI=
‘Systems development in information 10.1016/j.comcom.2005.10.037.
systems research’. Conference on (Accessed: 19/03/2007)
System Sciences, 1990. Proceedings Sasaki, M.; Shinnou, H. (2005). ‘Spam
of the Twenty-Third Annual Hawaii detection using text clustering’.
International. Volume iii, 2-5 Jan. Cyberworlds, 2005. International
1990 Page(s):631 - 640 vol.3. IEEE. Conference. 23-25 Nov. 2005
[Online] Available at DOI = Digital Page(s):4 pp. IEEE Conference
Object Identifier Proceeding. [Online] Available at
10.1109/HICSS.1990.205401 DOI= 10.1109/CW.2005.83.
(Accessed: 20/03/2006) (Accessed: 03/04/2007)
OMG (2007a). OMG Model Driven Sendmail (2007). Sendmail. [Online]
Architecture. [Online] Available at Available at
http://www.omg.org/mda/. http://www.sendmail.org/. (Accessed:
(Accessed: 27/01/2007) 27/02/2007)
OMG (2007b). Unified Modeling
Language. [Online] Available at
http://www.uml.org/. (Accessed:

17
27/01/2007)

Shi-Jinn, S. & Chao-Yi, W. (2005), 'An SpamAssassin (2007). The Apache


e-mail client implementation with SpamAssassin Project. [Online]
spam filtering and security Available at
mechanisms'. Web Services, 2005. http://spamassassin.apache.org/.
ICWS 2005. Proceedings. 2005 IEEE (Accessed: 27/02/2007)
International Conference on. IEEE. SubethaSMTP (2007). SubethaSMTP
[Online] Available at Mail Server. [Online] Available at
DOI=10.1109/ICWS.2005.24 http://subethasmtp.tigris.org/.
(Accessed: 19/03/2007). Accessed: 18/01/2006).
Sjoberg, D. I., Dyba, T., and Jorgensen, TOGAF (2007). TOGAFtm Enterprise
M. (2007). ‘The Future of Empirical Edition. The Open Group. [Online]
Methods in Software Engineering Available at
Research’. In 2007 Future of http://www.opengroup.org/togaf/
Software Engineering (May 23 - 25, (Accessed: 03/03/2006)
2007). International Conference on
Software Engineering. IEEE TumbleWeed (2003). Architectural
Computer Society, Washington, DC, Comparison of Enterprise Anti-spam
358-378. The ACM Digital Library. Solutions. Intelligent Enterprise
[Online] Available at DOI= Research Library. [Online] Available
http://dx.doi.org/10.1109/FOSE.2007. at http://www.tumbleweed.com/pdfs/
30 (Accessed: 22/06/2007) 2394-TMWD_Enterprise_Anti-
spam_WP_07_17_03.pdf (Accessed:
SMTP (1982). RFC 821 - Simple Mail 24/03/2007)
Transfer Protocol. [Online] Available
at Ubuntu (2007). Ubuntu Linux. [Online]
http://www.faqs.org/rfcs/rfc821.html. Available at http://www.ubuntu.com/
(Accessed: 19/03/2007) (Accessed: 09/07/2007)
SMTP-SOURCE (2007). smtp-source - VanDoren, Edmond (1997). Halstead
multi-threaded SMTP/LMTP test Complexity Measures. Carnegie
generator.[Online] Available at http:// Mellon Software Engineering
www.postfix.org/smtp-source.1.html Institute. [Online] Available at http://
(Accessed: 01/06/2007) www.sei.cmu.edu/str/descriptions/hal
stead.html#1227444 (Accessed:
SNMP (1990). RFC 1157 - Simple 03/07/2007)
Network Management Protocol
(SNMP). [Online] Available at http:// VanDoren, Edmond (2000). Cyclomatic
www.faqs.org/rfcs/rfc1157.html Complexity. Carnegie Mellon
(Accessed: 19/07/2007) Software Engineering Institute.
[Online] Available at
Sommerville, Ian (2004). ‘Software http://www.sei.cmu.edu/str/descriptio
Engineering. Seventh Edition’. ns/cyclomatic_body.html (Accessed:
Pearson. Addison Wesley – Boston. 03/07/2007)
ISBN 0-321-21026-3
Zachman (2007). Zachman Framework.
The Zachman Institute for Framework
Advancement. [Online] Available at

18
http://www.zifa.com (Accessed:
03/03/2006)

Zhen Yang; Xiangfei Nie; Weiran Xu;


Jun Guo (2006). ‘An Approach to
Spam Detection by Naive Bayes
Ensemble Based on Decision
Induction’. Intelligent Systems
Design and Applications, 2006. ISDA
'06. Sixth International Conference.
Volume 2, Oct. 2006 Page(s):861 –
866. IEEE Conference Proceeding.
[Online] Available at DOI=
10.1109/ISDA.2006.253725.
(Accessed: 03/04/2007)

19

Vous aimerez peut-être aussi