Académique Documents
Professionnel Documents
Culture Documents
TECHNOLOGY
PRESENTED BY :
Final-year
Department of Computer Science and Engineering
phugal.capture@gmail.com [1]
Dinram.kannan@gmail.com[2]
9843980614 [1]
9790118512 [2]
AUTOMATED SPAM FILTERING: A FUZZY
SIMILARITY APPROACH
Final-year
Department of Computer Science and Engineering
Kongunadu College of Engineering and Technology
Abstract
E-mail spam has become an epidemic problem There is a prediction that by
that can negatively affect the usability of year 2015 spam will exceed 95% of all
electronic mail as a communication means.
e-mail traffic. Spam can be very costly
Besides wasting users’ time and effort to scan
and delete the massive amount of junk e-mails to e- mail recipients; it reduces their
received; it consumes network bandwidth and productivity by wasting their time and
storage space, slows down e-mail servers, and causing annoyance to deal with a large
provides a medium to distribute harmful amount of spam. As a result, spam has
and/or offensive content. Several machine
become an area of growing concern
learning approaches have been applied to this
problem. In this paper, we explore a new attracting the attention of many
approach based on fuzzy similarity that can security researchers and practitioners.
automatically classify e-mail messages as Front-end filtering was the most
spam or legitimate. We study its performance common and easier way to reject or
for various conjunction and disjunction
quarantine spam messages as early as
operators for several datasets.
possible at the receiving server. In this
1. Introduction paper, we investigate an alternative
automated spam filtering technique
With the increasing popularity based on fuzzy similarity.
of electronic mail, several people and
companies found it an easy way to 2. Spam-Filtering Methods
distribute a massive amount of
unsolicited messages to a tremendous Spam-filtering methods are
classified into two broad categories:
number of users at a very low cost.
These unwanted bulk messages or junk non-machine learning based and
machine learning based.
emails are called spam messages. E-
mail spam has continued to increase at
Non machine learning methods
a very fast rate over the last couple of
years. It has become a major threat for base their e-mail classification on a
predefined list of known spammers
business users, network administrators
and even normal users. and/or a list of keywords whereas
machine learning takes the content of
A study in 2007 reported that
spam will continue to rise, reaching a the message into its consideration and
adapts its decision accordingly.
plateau at around 92% of e-mail traffic.
3. Existing Systems Memory-Based (Instance-
Based) Approach
Some of the most widely
applied techniques in spam filtering: Sakkis et al proposed a
memory-based approach to antispam
Bayesian Approach filtering for mailing lists. In this
approach, each message in the
In this technique, each training examples is converted into a
message is described by a set of vector representing the values of
attributes (e.g. words or phrases). different attributes of the message.
Probabilities are assigned to each These vectors are stored in a memory
attribute based on its number of structure and are used directly to
occurrences in the training corpus. classify e-mail messages. This method
These probabilities are then used to uses a variant of the simple k nearest-
classify a message into the most neighbor (k-NN) method in which the
probable category by applying Bayes classification is usually performed by
theorem. assigning to each unseen instance the
majority class of its k closest training
Rule-Based Filtering instances.
5.3. Classification
Having constructed a
knowledge base for the token-category The fuzzy similarity measure
membership, spam filtering is now (SM) is given by using the fuzzy
based on calculating the fuzzy conjunction operator and fuzzy
similarity measure between the disjunction operator. In case a false
received message and each category. positive has the same weight as false
The message is then classified by negative, then we can decide the
comparing its fuzzy similarity message category to be the one with
measures. In order to calculate fuzzy the highest similarity measure.
similarity, we must first determine the
membership degree of each token to However, if false positive is
the message. One way to do that is by more serious than false negative, we
first determining the frequency of each should have some threshold value such
token in the message. The that it is greater than 1. Then decide
membership degree is then defined based on the ratio of the similarity
as the ratio of number of measures, i.e., if the ratio of the spam
occurrences of a token in the and legitimate contents is greater than
the threshold value, then the message
is spam.
6. Conclusion