Vous êtes sur la page 1sur 6

KONGUNADU COLLEGE OF ENGINEERING AND

TECHNOLOGY

AUTOMATED SPAM FILTERING: A


FUZZY SIMILARITY APPROACH

PRESENTED BY :

S.Phugalandran [1] M.Dineshkannan [2]

Final-year
Department of Computer Science and Engineering

phugal.capture@gmail.com [1]
Dinram.kannan@gmail.com[2]

9843980614 [1]
9790118512 [2]
AUTOMATED SPAM FILTERING: A FUZZY
SIMILARITY APPROACH

S.Phugalandran [1] M.Dineshkannan [2]

Final-year
Department of Computer Science and Engineering
Kongunadu College of Engineering and Technology

Phugal.capture@gmail.com [1] dinram.kannan@gmail.com [2]

Abstract
E-mail spam has become an epidemic problem There is a prediction that by
that can negatively affect the usability of year 2015 spam will exceed 95% of all
electronic mail as a communication means.
e-mail traffic. Spam can be very costly
Besides wasting users’ time and effort to scan
and delete the massive amount of junk e-mails to e- mail recipients; it reduces their
received; it consumes network bandwidth and productivity by wasting their time and
storage space, slows down e-mail servers, and causing annoyance to deal with a large
provides a medium to distribute harmful amount of spam. As a result, spam has
and/or offensive content. Several machine
become an area of growing concern
learning approaches have been applied to this
problem. In this paper, we explore a new attracting the attention of many
approach based on fuzzy similarity that can security researchers and practitioners.
automatically classify e-mail messages as Front-end filtering was the most
spam or legitimate. We study its performance common and easier way to reject or
for various conjunction and disjunction
quarantine spam messages as early as
operators for several datasets.
possible at the receiving server. In this
1. Introduction paper, we investigate an alternative
automated spam filtering technique
With the increasing popularity based on fuzzy similarity.
of electronic mail, several people and
companies found it an easy way to 2. Spam-Filtering Methods
distribute a massive amount of
unsolicited messages to a tremendous Spam-filtering methods are
classified into two broad categories:
number of users at a very low cost.
These unwanted bulk messages or junk non-machine learning based and
machine learning based.
emails are called spam messages. E-
mail spam has continued to increase at
Non machine learning methods
a very fast rate over the last couple of
years. It has become a major threat for base their e-mail classification on a
predefined list of known spammers
business users, network administrators
and even normal users. and/or a list of keywords whereas
machine learning takes the content of
A study in 2007 reported that
spam will continue to rise, reaching a the message into its consideration and
adapts its decision accordingly.
plateau at around 92% of e-mail traffic.
3. Existing Systems  Memory-Based (Instance-
Based) Approach
Some of the most widely
applied techniques in spam filtering: Sakkis et al proposed a
memory-based approach to antispam
 Bayesian Approach filtering for mailing lists. In this
approach, each message in the
In this technique, each training examples is converted into a
message is described by a set of vector representing the values of
attributes (e.g. words or phrases). different attributes of the message.
Probabilities are assigned to each These vectors are stored in a memory
attribute based on its number of structure and are used directly to
occurrences in the training corpus. classify e-mail messages. This method
These probabilities are then used to uses a variant of the simple k nearest-
classify a message into the most neighbor (k-NN) method in which the
probable category by applying Bayes classification is usually performed by
theorem. assigning to each unseen instance the
majority class of its k closest training
 Rule-Based Filtering instances.

Usually, there is a certain  Sparse Binary Polynomial


pattern used in spam, rule-based Hash (SBPH):
filters examine messages for those
patterns following specific rules in A generalization of Bayesian
order to identify spam mails. The that can match mutating phrases as
Ripper algorithm is a typical a rule- well as single words. By using SBPH,
based classifier. This kind of filters a large amount of features can be
often scales relatively poorly with the generated from an incoming text
sample size. automatically, and then a weight is
assigned to each feature according to
 Support Vector Machines the probability of being spam or not.
(SVM)
 Rough Set Theory:
Ducker et al used a support
vector machine for e-mail E-mail can be classified into
classification based on the content. three categories: spam, non-spam and
Given the input as a binary feature suspicious. The results of their
vector, the idea behind SVM is to find experiments showed that rough set
a hyper plane that best separate data based filters can reduce false positive
points into two classes with maximum classification.
margins between them. SVM is very
well suited for text categorization.
4. Proposed System white spaces. These tokens (or terms)
can represent words, phrases or any
Although a variety of machine keyword patterns. All mixed-case
tokens are converted to lowercase.
learning techniques have been applied
The resulting set of tokens are
to spam filtering, fuzzy logic has not stemmed to their roots (i.e. replacing
received much attention. A trainable each token with its base form) to avoid
fuzzy logic based e-mail classification treating different forms of the same
system was presented to learn the most word as different attributes; thus
effective fuzzy rules during the reducing the size of the attribute set.
training phase and then apply the Also if a token appears few times in
either category (e.g. less than three
fuzzy control model to classify
times), it is removed. Hence, only
unseen messages. Inspired by the significant features with high
success of fuzzy similarity in text information gain are kept. Now, tokens
classification and document from all messages are combined into
retrieval, it shows its effectiveness in one vector T= <t1, t2, …, tN> where N
filtering spam based on the textual is the total number of tokens.
content of e-mail messages. Also the number of occurrences
of each token, in each category is
determined.
5. Fuzzy similarity approach
5.2. Training
Fuzzy logic deals with fuzzy
sets that allow partial membership in a During the training phase, a
set. The membership in these vaguely model is built based on the
defined sets is represented by the characteristics of each category in a
degree of relevance. This provides pre-classified set of e-mail messages.
flexibility in dealing with uncertainty The training dataset should be selected
in systems such as spam filtering. in such a way that it is varying in
Using a fuzzy similarity approach, a content and subject.
classification model is built from a set
of pre-classified e-mail instances. Each sample message is labeled
with a specific category. We first
5.1. Pre-processing perform pre-processing to extract
tokens and determine the number of
Before e-mail messages in the occurrences of each token in each
given corpus are used for training and category. This implies that if a token
classification, some preprocessing occurs only in one category, then its
needs to be done in order to reach membership to this category will be
optimum results. First, all HTML tags one, and to the other category will be
are stripped off. Then, all stop words, zero. From these data, we define a
i.e. words that appear frequently but fuzzy token-category relation which
have low content discriminating maps each element in T × C to a
power, are removed from each e-mail membership value between 0 and 1.
message. The message is then This implies that if a token occurs only
tokenized into a set of strings in one category, then its membership to
separated by some delimiters, e.g.
this category will be one, and to the message to the maximum number of
other category will be zero. occurrences of the token in all
Typically, the membership categories. Thus, the token with the
values will range between zero and maximum number of occurrences will
one. The training process is illustrated be assigned a value of 1, and all other
tokens will be assigned proportional
in Figure
values.

5.3. Classification

Having constructed a
knowledge base for the token-category The fuzzy similarity measure
membership, spam filtering is now (SM) is given by using the fuzzy
based on calculating the fuzzy conjunction operator and fuzzy
similarity measure between the disjunction operator. In case a false
received message and each category. positive has the same weight as false
The message is then classified by negative, then we can decide the
comparing its fuzzy similarity message category to be the one with
measures. In order to calculate fuzzy the highest similarity measure.
similarity, we must first determine the
membership degree of each token to However, if false positive is
the message. One way to do that is by more serious than false negative, we
first determining the frequency of each should have some threshold value such
token in the message. The that it is greater than 1. Then decide
membership degree is then defined based on the ratio of the similarity
as the ratio of number of measures, i.e., if the ratio of the spam
occurrences of a token in the and legitimate contents is greater than
the threshold value, then the message
is spam.

6. Conclusion

In this paper, we presented a


fuzzy similarity based spam filtering
method. This method considers the
content of the message to predict its
category rather than relying on a
fixed pre-specified set of keywords.
Thus, it can adapt to spammer tactics
and dynamically build its knowledge
base.

Vous aimerez peut-être aussi