467 vues

Transféré par Alok Nandan Jha

- Apex Issue May 2017
- e Mail Etiquettes
- Cisco IronPort Async OS 7.2.0 for the Security Management Appliance User Guide
- Acceptable Use Policy
- EMAIL ETIQUETTE
- infosheet_socialnetwork
- Yökdi̇l Mart 2018 Fen
- matthew mcardles resume 2018 pdf
- Acceptable Use Policy.pdf
- Text Mining Menggunakan Metode Naive Bayes - Source Code Program Tesis Skripsi Tugas Akhir
- VPIT Status and Activity Report for October 2018
- process paper
- Symantec Messaging Gateway 9.5.1 Administration Guide
- MB0039 Business Communication Set2
- B2B Email Deliverability
- Pareto Analysis
- cookie 1
- Bloqueio de Virus Pelo Web Proxy No Mikrotik.
- 79529201503e445390dab5040b6b682b
- Client Relation

Vous êtes sur la page 1sur 16

Submitted To:

BY:

(B-11)

INDEX

CONTENT PAGE NUMBER

PROBLEM STATEMENT 3

INTRODUCTION 4

LITERATURE SURVEY 5

METHODOLOGY PROPOSED 9

CONCLUSION 13

FUTURE BLOCK 14

REFERENCE 15

1

Page

ACKNOWLEDGEMENT

A journey is easier when you travel together. Interdependence is certainly more valuable than

independence. This report is a result of intensive work and observation whereby we have been

accompanied and supported by many people. It’s a pleasant aspect that we have now the

opportunity to express our gratitude for all of them.

We thank Arti Gupta Ma’am, our IRDM Project mentor for providing us with her initial

stimulating ideas to start the project to gain an insight in the project. Her support and constant

guidance was a constant source of inspiration for us towards the completion of the report.

In a nutshell, we can say that this project would have been stuck in wilderness without her

assistance that provided the stimulating discussion to work on the project.

2

Page

PROBLEM STATEMENT

We would do the implementation of Paul Graham's Naive Bayesian Spam Filter algorithm in C#.

It is suitable for incorporation into an ASP.NET Blogging, Forum or Email.

The Achilles heel of the spammers is their message. They can circumvent any other barrier we

set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can

write software that recognizes their messages, there is no way they can get around that. In fact,

we have found that we can filter present-day spam acceptably well using nothing more than a

Bayesian combination of the spam probabilities of individual words.

3

Page

INTRODUCTION

Earlier it used to be a simple silent human-detection script that was run behind the scenes to

ensure that a real person was sitting at a real keyboard and typing blog entries in by hand. Now

we see a new breed of spam showing up on Blogabond, and it's getting worse every day. Modern

email clients all use Bayesian spam filtering, so that's what we are going to implement. The

content-based filters are the way to stop spam.

Using a slightly tweaked Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false

positives. A Bayes classifier is a simple probabilistic classifier based on applying Bayes'

theorem (from Bayesian statistics) with strong (naive) independence assumptions. Depending on

the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently

in a supervised learning setting.

4

Page

LITERATURE SURVEY

SOURCE-

http://www.paulgraham.com/spam.html

This source was the originator of naïve Bayesian based spam filter. According to the paper the

statistical approach is not usually the first one people try when they write spam filters. Most

hackers' first instinct is to try to write software that recognizes individual properties of spam.

One look at spams and he/she thinks, the gall of these guys to try sending me mail that begins

"Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points.

Filtering out that stuff will take a big bite out of incoming spam. But the paper further discusses

sort of AI techniques, to automate this process i.e. to train the system to automatically filter the

spam based on training set.

SOURCE-

The paper first discussed about the problem of spam that has been seriously troubling the

Internet community during the last few years and currently reached an alarming scale.

Observations made at CERN (European Organization for Nuclear Research located in Geneva,

Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. Then the

paper presented naïve Bayesian classifier based on a Bag of Words representation of an email as

a widely used process to stop this unwanted flood as it combines good performance with

simplicity of the training and classification processes. However, facing the constantly changing

patterns of spam, it is necessary to assure online adaptability of the classifier.

5

Page

SOURCE-

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

The article discussed about the Bayes theorem and how it could be used for spam filtering.

Bayes' theorem is used several times in the context of spam:

A first time, to compute the probability that the message is spam, knowing that a

given word appears in this message;

A second time, to compute the probability that the message is spam, taking into

consideration all of its words (or a relevant subset of them);

The formula used by the software to determine that is derived from Bayes' theorem

Where:

is the probability that a message is a spam, knowing that the word is in it;

is the overall probability that any given message is spam;

is the probability that the word appears in spam messages;

is the overall probability that any given message is not spam (is "ham");

is the probability that the word appears in ham messages.

Most Bayesian spam detection software make the assumption that there is no a priori reason for

any incoming message to be spam rather than ham, and consider both cases to have equal

probabilities of 50%:

The filters that use this hypothesis are said to be "not biased", meaning that they have no

prejudice regarding the incoming email. This assumption allows us to simplify the general

formula to:

6

Page

Combining individual probabilities

The Bayesian spam filtering software makes the "naive" assumption that the words present in the

message are independent events. That is wrong in natural languages like English, where the

probability of finding an adjective, for example, is affected by the probability of having a noun.

With that assumption, one can derive another formula from Bayes' theorem:

where:

p1 is the probability p(S | W1) that it is a spam knowing it contains a first word .

p2 is the probability p(S | W2) that it is a spam knowing it contains a second word.

pN is the probability p(S | WN) that it is a spam knowing it contains an Nth word

SOURCE-

http://www.ibm.com/developerworks/linux/library/l-spamf.html

In this article, the author describe ways that computer code can help eliminate unsolicited

commercial e-mail, viruses, Trojans, and worms, as well as frauds perpetrated electronically and

other undesired and troublesome e-mail. The problem with spam is that it tends to swamp

desirable e-mail. He discussed various spam filtering methods.

The e-mail client we use has the capability to sort incoming e-mail based on simple strings found

in specific header fields, the header in general, and/or in the body. Its capability is very simple

and does not even include regular expression matching. Almost all e-mail clients have this much

filtering capability.

These few simple filters correctly catch about 80% of the spam. Unfortunately, they also have a

relatively high false positive rate -- enough that one needs to manually examine some of the

spam folders from time to time.

2. Whitelist/verification filters

A fairly aggressive technique for spam filtering is what we would call the "whitelist plus

7

Page

A whitelist filter passes mail only from explicitly approved recipients on to the inbox. Other

messages generate a special challenge response to the sender. The whitelist filter's response

contains some kind of unique code that identifies the original message, such as a hash or

sequential ID. This challenge message contains instructions for the sender to reply in order to be

added to the whitelist (the response message must contain the code generated by the whitelist

filter).

3. Rule-based rankings

In this we evaluate a large number of patterns -- mostly regular expressions -- against a candidate

message. Some matched patterns add to a message score, while others subtract from it. If a

message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered

legitimate. But here rules need to be updated as the products and scams advanced by spammers

evolve.

Paul Graham wrote a provocative essay in August 2002. The general idea is that some words

occur more frequently in known spam, and other words occur more frequently in legitimate

messages. Using well-known mathematics, it is possible to generate a "spam-indicative

probability" for each word.

1. It can generate a filter automatically from corpora of categorized messages rather than

requiring human effort in rule development.

2. It can be customized to individual users' characteristic spam and legitimate messages.

3. It can be implemented in a very small number of lines of code.

4. It works surprisingly well.

8

Page

METHODOLOGY PROPOSED

We started with one corpus of spam and one of nonspam mail. At the moment each one

has about 4000 messages in it. I scan the entire text, including headers and embedded

html and javascript, of each message in each corpus. We currently consider alphanumeric

characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else

to be a token separator. (There is probably room for improvement here). We ignore

tokens that are all digits, and we also ignore html comments, not even considering them

as token separators. We count the number of times each token (ignoring case, currently)

occurs in each corpus.

At this stage we end up with two large hash tables, one for each corpus, mapping tokens

to number of occurrences.

Next we create a third hash table, this time mapping each token to the probability that an

email containing it is a spam, which we calculate as follows :

(b (or (gethash word bad) 0)))

(unless (< (+ g b) 5)

(max .01

(min .99 (float (/ (min 1 (/ b nbad))

(+ (min 1 (/ g ngood))

(min 1 (/ b nbad)))))))))

where word is the token whose probability we're calculating, good and bad are the hash

tables we created in the first step, and ngood and nbad are the number of nonspam and

spam messages respectively. We want to bias the probabilities slightly to avoid false

positives, and we've found that a good way to do it is to double all the numbers in good.

This helps to distinguish between words that occasionally do occur in legitimate email

and words that almost never do. We only consider words that occur more than five times

in total

And then there is the question of what probability to assign to words that occur in one

corpus but not the other. Again we chose .01 and .99 as found. There may be room for

tuning here, but as the corpus grows such tuning will happen automatically anyway.

We considered each corpus to be a single long stream of text for purposes of counting

occurrences. We use the number of emails in each, rather than their combined length, as

the divisor in calculating spam probabilities. This adds another slight bias to protect

9

Page

When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens,

where interesting is measured by how far their spam probability is from a neutral .5, are

used to calculate the probability that the mail is spam. From probs is a list of the fifteen

individual probabilities, we calculate the combined probability.

One question that arises in practice is what probability to assign to a word you've never

seen, i.e. one that doesn't occur in the hash table of word probabilities. We've found,

again that .4 was the number proposed. If you've never seen a word before, it is probably

fairly innocent; spam words tend to be all too familiar.

10

Page

EXPERIMENT AND RESULT

INPUT PARAMETERS:

ALGO A:

ALGO B:

11

Page

SCREEN SHOT:

PARAMETERS

FALSE POSITIVE 12 OUT OF 100 5 OUT OF 100 1/2 OUT OF 100

SOLVED USING OCR

USED BASIC BLOCK OF EVERY USED AT USER END OF WIDELY USED AT

SPAM FILTER SPAM FILTER DEVELOPER END OF

SPAM FILTER

12

Page

CONCLUSION

The advantage of Bayesian spam filtering is that it can be trained on a per-user basis. The spam

that a user receives is often related to the online user's activities. The word probabilities are

unique to each user and can evolve over time with corrective training whenever the filter

incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is

often superior to pre-defined rules. It can also perform particularly well in avoiding false

positives, where legitimate email is incorrectly classified as spam.

However there are few disadvantages too i.e. Bayesian spam filtering is susceptible to Bayesian

poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam

filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out

emails with large amounts of legitimate text (gathered from legitimate news or literary

sources). Spammer tactics include insertion of random innocuous words that are not normally

associated with spam, thereby decreasing the email's spam score, making it more likely to slip

past a Bayesian spam filter. Also another technique used to try to defeat Bayesian spam filters is

to replace text with pictures, either directly included or linked.

Currently a probably more efficient solution has been proposed by Google and is used by

its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to

large size image, analyzing the text inside.

Even though Bayesian filtering is used widely to identify spam email, the technique can classify

(or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. There is

recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and

decide on behavioral responses

13

Page

FUTURE BLOCK

We could develop filter based on word pairs, or even triples, rather than individual

words. This should yield a much sharper estimate of the probability. For example, in my

current database, the word "offers" has a probability of .96. If you based the probabilities

on word pairs, you'd end up with "special offers" and "valuable offers" having

probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a

probability of .1 or less.

Recognizing nonspam features may be more important than recognizing spam features.

False positives are such a worry that they demand extraordinary measures. We will

probably in future versions add a second level of testing designed specifically to avoid

false positives. If a mail triggers this second level of filters it will be accepted even if its

spam probability is above the threshold.

We might also focus extra attention on specific parts of the email. For example, about

95% of current spam includes the url of a site they want you to visit. (The remaining 5%

want you to call a phone number, reply by email or to a US mail address, or in a few

cases to buy a certain stock.) The url is in such cases practically enough by itself to

determine whether the email is spam.

spammers. A way to create such a list is to test dubious urls by sending out a crawler to

look at the site before the user looked at the email mentioning it. We could use a

Bayesian filter to rate the site just as one would an email, and whatever was found on the

site could be included in calculating the probability of the email being a spam. A url that

led to a redirect would of course be especially suspicious.

14

Page

REFERENCE

[1] P.Graham.(2002, August). A Plan for Spam[Online].Available:www.paulgraham.com/spam.html

Academy of science, Engineering and Technology 7 2005

Available: www.ibm.com/developerworks/linux/library/l-spamf.html

[4] http://en.wikipedia.org/wiki/Bayesian_spam_filtering

15

Page

- Apex Issue May 2017Transféré parKoraseeka Srinivasu
- e Mail EtiquettesTransféré parSarang
- Cisco IronPort Async OS 7.2.0 for the Security Management Appliance User GuideTransféré parmritto
- Acceptable Use PolicyTransféré parrubens
- EMAIL ETIQUETTETransféré parMartha Aurora Juarez Sanchez
- infosheet_socialnetworkTransféré parSteven O'Callaghan
- Yökdi̇l Mart 2018 FenTransféré parHande
- matthew mcardles resume 2018 pdfTransféré parapi-438877434
- Acceptable Use Policy.pdfTransféré parShankey
- Text Mining Menggunakan Metode Naive Bayes - Source Code Program Tesis Skripsi Tugas AkhirTransféré parYudi Adnyana S
- VPIT Status and Activity Report for October 2018Transféré parTimothy Chester
- process paperTransféré parapi-255575737
- Symantec Messaging Gateway 9.5.1 Administration GuideTransféré pards0909@gmail
- MB0039 Business Communication Set2Transféré parAyaz Ansari
- B2B Email DeliverabilityTransféré parb2bemailexperts
- Pareto AnalysisTransféré parsrsureshrajan
- cookie 1Transféré parMoises Anibal
- Bloqueio de Virus Pelo Web Proxy No Mikrotik.Transféré parClaudio Peixoto
- 79529201503e445390dab5040b6b682bTransféré paritalki
- Client RelationTransféré parvyom saxena
- An Introduction to Cyber Crime General ConsiderationsTransféré parprafulla
- Competitive Analysis - Fortinet vs Trend Micro Network VirusWallTransféré paro0000o
- Product Hot Sheet for EMEATransféré parAlan Glass
- Web Based LearningTransféré parKaviarasu Kanakappan
- Barlow 0296.DeclarationTransféré pargugolina
- Rachel Barkley CV OnlineTransféré parrachelbark
- sfghjbkljhkTransféré parJorgeMiguel
- Js Functions.jsTransféré parPrabhakaran Vengadasalam
- dados “AnonPaste” protesto jun 2013Transféré parLuiz Carlos Martins
- 0918004907Transféré parTech

- Cummins c330d5 (Nta855g1a)Transféré parIltizam1982
- Download VMware vSphere Hyper Visor for Free - Based on ESXiTransféré parCeleste Schmidt
- Edexcel GCSE Science P1.4 Waves and Earth Test 13_14 4DSTransféré parPaul Burgess
- MIL-C-21723ATransféré parSinisa Gale Gacic
- Bakovici the Bigest Gold Deposit of Bosnia and Herzegovina[1]Transféré parsenadur
- 4 - Diffusion Reaction Part 1Transféré parNaufal Fasa
- Preferences on Empire GraphicsTransféré parKaputznik
- Team6 Button BounceTransféré parjose moreno
- Metal RollingTransféré parNishith
- Fundamentals of Digital Logic With Verilog Design 3rd Edition Solution Manual PDFTransféré parIrfan Haider
- Pond ConstructionTransféré parmike_pilar
- P 3Y Vector Assignment-3Transféré parSudhir Kumar
- NonlinearAnalysis of StructuresTransféré parUmut Akın
- oratop (1)Transféré parjonytapia
- Floor Hardening ConcreteTransféré parNo Green Blue
- Apollo 11 Final Flight PlanTransféré parBob Andrepont
- Minelab 4500 User ManualTransféré parslicktricker
- ds386.pdfTransféré parl1624242
- it UserguideTransféré parDiego Bispo
- محاضرات هندسة الاساسات د. طارق نجيب P3Transféré parHasan
- 1-s2.0-S030626191401160X-mainTransféré parAhda Dapong Rizqy Maulana
- Portland Cement 2Transféré parVincent Salarda Baldomero
- Scaffold Checklist 11-6-13 PvTransféré parLAL SANKAR
- Unvented Combustion Devices and IAQ Position Document_ASHRAE.pdfTransféré parAntho Latouche
- CX36B Tier 4 Op's ManualTransféré parblaktion
- Tl-wa901nd v5 User GuideTransféré parbeyond2050243
- Excel TipsTransféré parnvc_vishwanathan
- _02 Abb Pm Inst Book-tank Density MtrTransféré parDean Bartlett
- Effects of Fluoride on BoneTransféré pardrnuha3
- Apostolescu n Lozici d FullTransféré parAnonymous LMfIjpE