Vous êtes sur la page 1sur 9

Expert Systems with Applications 37 (2010) 7913–7921

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Intelligent phishing detection system for e-banking using fuzzy data mining
Maher Aburrous a,*, M.A. Hossain a, Keshav Dahal a, Fadi Thabtah b
a
Department of Computing University of Bradford, Bradford, UK
b
MIS Department Philadelphia University Amman, Jordan

a r t i c l e i n f o a b s t r a c t

Keywords: Detecting and identifying any phishing websites in real-time, particularly for e-banking, is really a com-
Phishing plex and dynamic problem involving many factors and criteria. Because of the subjective considerations
Fuzzy logic and the ambiguities involved in the detection, fuzzy data mining techniques can be an effective tool in
Data mining assessing and identifying phishing websites for e-banking since it offers a more natural way of dealing
Classification
with quality factors rather than exact values. In this paper, we present novel approach to overcome
Association
Apriori
the ‘fuzziness’ in the e-banking phishing website assessment and propose an intelligent resilient and
E-banking risk assessment effective model for detecting e-banking phishing websites. The proposed model is based on fuzzy logic
combined with data mining algorithms to characterize the e-banking phishing website factors and to
investigate its techniques by classifying the phishing types and defining six e-banking phishing website
attack criteria’s with a layer structure. Our experimental results showed the significance and importance
of the e-banking phishing website criteria (URL & Domain Identity) represented by layer one and the var-
ious influence of the phishing characteristic on the final e-banking phishing website rate.
Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction lem with each other for which there is no known single silver
bullet to entirely solve it. The motivation behind this study is to
E-banking phishing websites are forged websites that are cre- create a resilient and effective method that uses fuzzy data mining
ated by malicious people to mimic real e-banking websites. Most algorithms and tools to detect e-banking phishing websites in an
of these kinds of web pages have high visual similarities to scam automated manner.
their victims. Some of these web pages look exactly like the real DM approaches such as neural networks, rule induction, and
ones. Unwary Internet users may be easily deceived by this kind decision trees can be a useful addition to the fuzzy logic model.
of scam. Victims of e-banking phishing websites may expose their It can deliver answers to business questions that traditionally were
bank account, password, credit card number, or other important too time-consuming to resolve such as, ‘‘Which are most important
information to the phishing web page owners. The impact is the e-banking phishing website characteristic indicators and why?” by
breach of information security through the compromise of confi- analyzing massive databases and historical data for training
dential data, and the victims may finally suffer losses of money purposes.
or other kinds. Phishing is a relatively new Internet crime in com- The paper is organized as follows: Section 2 presents the litera-
parison with other forms, e.g., virus and hacking. More and more ture review and related work, Section 3 shows the theory and
phishing web pages have been found in recent years in an acceler- methodology of the proposed fuzzy based data mining approach
ative way (Fu, Wenyin, & Deng, 2006). The word phishing from the for the phishing website risk assessment model. Section 4 intro-
phrase ‘‘website phishing” is a variation on the word ‘‘fishing”. The duces the system design and implementation with the overall fuz-
idea is that bait is thrown out with the hopes that a user will grab it zy data mining inference rules. Section 5 reveals the experiments
and bite into it just like the fish. In most cases, bait is either an e- and results of the fuzzy data mining e-banking phishing website
mail or an instant messaging site, which will take the user to hos- risk assessment model and then conclusions and future work are
tile phishing websites (James, 2006). given in Section 6.
E-banking phishing website is a very complex issue to under-
stand and to analyze, since it is joining technical and social prob- 2. Literature review and related work

2.1. Literature review


* Corresponding author.
E-mail addresses: mrmaburr@bradford.ac.uk (M. Aburrous), m.a.hossain1@
bradford.ac.uk (M.A. Hossain), k.p.dahal@bradford.ac.uk (K. Dahal), ffayez@ Phishing website is a recent problem, nevertheless due to its
philadelphia.edu.jo (F. Thabtah). huge impact on the financial and on-line retailing sectors and since

0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.04.044
7914 M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921

preventing such attacks is an important step towards defending automatic way rather than involving too many human efforts.
against e-banking phishing website attacks, there are several Their method first decomposes the web pages (in HTML) into sali-
promising defending approaches to this problem reported earlier. ent (visually distinguishable) block regions. The visual similarity
In this section, we briefly survey existing anti-phishing solutions between two web pages is then evaluated in three metrics: block
and list of the related works. One approach is to stop phishing at level similarity, layout similarity, and overall style similarity,
the e-mail level (Adida, Hohenberge, & Rivest, 2005), since most which are based on the matching of the salient block regions (Fu
current phishing attacks use broadcast e-mail (spam) to lure vic- et al., 2006).
tims to a phishing website (Wu, Miller, & Garfinkel, 2006a). An-
other approach is to use security toolbars. The phishing filter in 2.2. Main characteristics of e-banking phishing websites
IE7 (Sharif, 2006) is a toolbar approach with more features such
as blocking the user’s activity with a detected phishing site. A third Evolving with the anti-phishing techniques, various phishing
approach is to visually differentiate the phishing sites from the techniques and more complicated and hard-to-detect methods
spoofed legitimate sites. Dynamic Security Skins (Dhamija & Tygar, are used by phishers. The most straightforward way for a phisher
2005) proposes to use a randomly generated visual hash to cus- to defraud people is to make the phishing web pages similar to
tomize the browser window or web form elements to indicate their targets. Actually, there are many characteristics and factors
the successfully authenticated sites. A fourth approach is two-fac- that can distinguish the original legitimate website from the forged
tor authentication, which ensures that the user not only knows a e-banking phishing website like Spelling errors, Long URL address
secret but also presents a security token (FDIC, 2004). However, and Abnormal DNS record. The full list is shown in Table 1 which
this approach is a server-side solution. Phishing can still happen will be used later on our analysis and methodology study.
at sites that do not support two-factor authentication. Sensitive
information that is not related to a specific site, e.g., credit card
2.3. Why using fuzzy logic and data mining?
information and SSN (Social Security Number), cannot be protected
by this approach either (Wu, Miller, & Little, 2006b).
FL has been used for decades in the engineering sciences to
Many industrial anti-phishing products use toolbars in web
embed expert input into computer models for a broad range of
browsers, but some researchers have shown that security tool bars
do not effectively prevent phishing attacks. Bridges and Vaughn Table 1
(2001), Dhamija and Tygar (2005) proposed a scheme that utilises Components and layers of e-banking phishing website criteria.
a cryptographic identity-verification method that lets remote web
Criteria N Component Layer No.
servers prove their identities. However, this proposal requires
changes to the entire web infrastructure (both servers and clients), URL & Domain 1 Using the IP address Layer 1
Identity 2 Abnormal request URL
so it can succeed only if the entire industry supports it. In Liu, (Weight = 0.3) 3 Abnormal URL of Anchor
Deng, Huang, & Fu (2006), the authors proposed a tool to model 4 Abnormal DNS record Sub
and describe phishing by visualizing and quantifying a given site’s weight = 0.3
threat, but this method still would not provide an anti-phishing 5 Abnormal URL
solution. Another approach is to employ certification, e.g., Micro- Security & 1 Using SSL certificate
soft spam privacy (Microsoft, 2004; Microsoft Corp., 2005; Olsen, Encryption
2 Certification authority Layer 2
2004; Perez, 2003; WholeSecurity Web Caller, 2005). A recent
(Weight = 0.2) 3 Abnormal Cookie
and particularly promising solution was proposed in Herzberg 4 Distinguished Names
and Gbara (2004), which combines the technique of standard cer- Certificate (DN)
tificates with a visual indication of correct certification; a site- Source Code & 1 Redirect pages
dependent logo indicating that the certificate was valid would be Java script
displayed in a trusted credentials area of the browser. A variant 2 Straddling attack
of web credential is to use a database or list published by a trusted (Weight = 0.2) 3 Pharming attack Sub
weight = 0.4
party, where known phishing websites are blacklisted. For exam- 4 Using onMouseOver to hide
ple, Netcraft anti-phishing toolbar (Netcraft, 2004) prevents phish- the Link
ing attacks by utilising a centralized blacklist of current phishing 5 Server Form Handler (SFH)
URLs. Other examples include websense, McAfee’s anti-phishing Page Style & 1 Spelling errors
filter, Netcraft anti-phishing system, Cloudmark SafetyBar, and Contents
Microsoft Phishing Filter (Pan & Ding, 2006). The weaknesses of 2 Copying website
(Weight = 0.1) 3 Using forms with ‘‘Submit” Layer 3
this approach are its poor scalability and its timeliness. Note that
button
phishing sites are cheap and easy to build and their average life- 4 Using Pop-Ups windows
time is only a few days. APWG provides a solution directory at 5 Disabling Right-Click
Anti-Phishing Working Group (2007) which contains most of the Web Address Bar 1 Long URL address
major anti-phishing companies in the world. However, an auto- 2 Replacing similar characters
matic anti-phishing method is seldom reported. The typical tech- for URL
nologies of anti-phishing from the user interface aspect are done (Weight = 0.1) 3 Adding a prefix or suffix
4 Using the @ Symbol to Sub
by Dhamija and Tygar (2005) and Wu et al., 2006b. They proposed Confuse weight = 0.3
methods that need web page creators to follow certain rules to cre- 5 Using Hexadecimal Character
ate web pages, either by adding dynamic skin to web pages or add- Codes
ing sensitive information location attributes to HTML code. Social Human 1 Much emphasis on security
However, it is difficult to convince all web page creators to follow Factor and response
the rules (Fu et al., 2006). In Fu et al. (2006), Liu, Huang, Liu, Zhang, (Weight = 0.1) 2 Public generic salutation
3 Buying Time to Access
& Deng, 2005; Liu et al., 2006), the DOM-based (Wood, 2005) visual
Accounts
similarity of web pages is oriented, and the concept of visual ap-
Total 1
proach to phishing detection was first introduced. Through this ap-
weight
proach, a phishing web page can be detected and reported in an
M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921 7915

applications. It offers a promising alternative for measuring opera-


tional risks (Shah, 2003). The FL approach provides more informa-
tion to help risk managers effectively manage assessing and
ranking e-banking phishing website risks than the current qualita-
tive approaches as the risks are quantified based on a combination
of historical data and expert input. The advantage of the fuzzy ap-
proach is that it enables processing of vaguely defined variables,
and variables whose relationships cannot be defined by mathemat-
ical relationships. FL can incorporate expert human judgment to
define those variable and their relationships.
DM is the process of searching through large amounts of data
and picking out relevant information. It has been described as Fig. 1. Input variable for Long URL address component.
‘‘the nontrivial extraction of implicit, previously unknown, and
potentially useful information from large data sets (Kantardzic &
Mehmed, 2003; Peter & Varian, 2003). It is a powerful new technol- experience. On that matter and instead of employing an expert sys-
ogy with great potential to help researchers focus on the most tem, we utilised data mining classification and association rule ap-
important information in their data archive. Data mining tools pre- proaches in our new e-banking phishing website risk assessment
dict future trends and behaviors, allowing businesses to make pro- model as shown in Fig. 2 to automatically find significant patterns
active, knowledge-driven decisions (Fayyad, 1998). of phishing characteristic or factors in the e-banking phishing web-
site archive data. Particularly, we used a number of different exist-
3. The proposed fuzzy based data mining approach ing data mining classification techniques implemented within
WEKA (WEKA – University of Waikato, New Zealand, EN, 2006)
3.1. Fuzzy data mining algorithms and techniques and CBA packages (Liu, Hsu, & Ma, 1998). JRip (Witten & Frank,
2005) WEKA’s implementation of RIPPER, PART (Witten & Frank,
The approach described here is to apply fuzzy logic and data 2005), Prism (Cendrowska, 1987) and C4.5 (Quinlan, 1996) algo-
mining algorithms to assess e-banking phishing website risk on rithms are selected to learn the relationships of the selected differ-
the 27 characteristics and factors which stamp the forged website. ent phishing features. We have chosen these algorithms since the
The essential advantage offered by fuzzy logic techniques is the use learnt classifiers are easily understood by human (Ciesielski and
of linguistic variables to represent key phishing characteristic indi- Lalani, 2003). While for the association finding, we have used the
cators and relating e-banking phishing website probability. apriori (Agrawal & Srikant, 1994) and predictive apriori algorithm
(Scheffer, 2001) using WEKA.
We used two web access archives: one from APWG archive
3.1.1. Fuzzification
(Anti-Phishing Working Group, 2007) and one from phishtank ar-
In this step, linguistic descriptors such as high, low, medium, for
chive (http://www.phishtank.com/phish_archive.php, (2008)). We
example, are assigned to a range of values for each key phishing
managed to extract six different feature sets from the e-banking
characteristic indicator. Valid ranges of the inputs are considered
phishing website archives and then derived many important rules
and divided into classes, or fuzzy sets. For example, length of
which helped us very much in fuzzy rule phase.
URL address can range from ‘low’ to ‘high’ with other values in be-
tween. We cannot specify clear boundaries between classes. The
3.1.3. Aggregation of the rule outputs
degree of belongingness of the values of the variables to any se-
This is the process of unifying the outputs of all discovered
lected class is called the degree of membership; membership func-
rules. Combining the membership functions of all the rules conse-
tion is designed for each phishing characteristic indicator, which is
quents previously scaled into single fuzzy sets (output).
a curve that defines how each point in the input space is mapped to
a membership value between [0, 1]. Linguistic values are assigned
3.1.4. Defuzzification
for each phishing indicator as low, moderate, and high while for e-
This is the process of transforming a fuzzy output of a fuzzy
banking phishing website risk rate as Very legitimate, Legitimate,
inference system into a crisp output. Fuzziness helps to evaluate
Suspicious, Phishy, and Very phishy (triangular and trapezoidal
the rules, but the final output has to be a crisp number. The input
membership function). For each input, their values range from 0
for the defuzzification process is the aggregate output fuzzy set
to 10 while for output, range from 0 to 100.
An example of the linguistic descriptors used to represent one
of the key phishing characteristic indicators (URL Address Long)
and a plot of the fuzzy membership functions are shown in
Fig. 1. The fuzzy representation more closely matches human cog-
nition, thereby facilitating expert input and more reliably repre-
senting experts’ understanding of underlying dynamics (Bridges
& Vaughn, 2001).
The same approach is used to calibrate the other 26 key phish-
ing characteristic indicators.

3.1.2. Rule generation using classification algorithms


Having specified the risk of e-banking phishing website and its
key phishing characteristic indicators, the next step is to specify
how the e-banking phishing website probability varies. Experts
provide fuzzy rules in the form of if. . .then statements that relate
e-banking phishing website probability to various levels of key
phishing characteristic indicators based on their knowledge and Fig. 2. E-banking phishing website risk assessment model.
7916 M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921

and the output is a number. This step was done using Centroid substantial fraction of users and suggest that alternative ap-
technique (Han & Karypis, 2000) since it is a commonly used proaches are needed (Dhamija, Tygar, & Hearst, 2006).
method. The output is e-banking phishing website risk rate
and is defined in fuzzy sets like ‘very phishy’ to ‘Very legitimate’. 3.3. Mining e-banking phishing websites challenges
The fuzzy output set is then defuzzified to arrive at a scalar
value. There are a number of challenges posed by doing post hoc clas-
sification of e-banking phishing websites. Most of these challenges
only apply to the e-banking phishing websites data and materialize
3.2. Data sets and experimental results
as a form of information, which has the net effect of increasing the
false negative rate. The age of the data set is the most significant
Two publicly available data sets were used to test our imple-
problem, which is particularly relevant with the phishing corpus.
mentation: the ‘‘phishtank” from the phishtank.com (http://
E-banking phishing websites are short lived, often lasting only in
www.phishtank.com/phish_archive.php, 2008) which is consid-
the order of 48 h. Some of our features can therefore not be ex-
ered one of the primary phishing-report collators both the 2007
tracted from older websites, making our tests difficult. The average
and 2008 collections, for a total of approximately 606 e-banking
phishing site stays live for approximately 2.25 days (Putting an end
phishing websites. The PhishTank database records the URL for
to account-hijacking identity theft, 2004). Furthermore, the pro-
the suspected website that has been reported, the time of that re-
cess of transforming the original e-banking phishing website ar-
port, and sometimes further detail such as the screenshots of the
chives into record feature data sets is not without error. It
website and is publicly available. The Anti-Phishing Working
requires the use of heuristics at several steps. Thus, high accuracy
Group (APWG) that maintains a ‘‘Phishing Archive” describing
from the data mining algorithms cannot be expected. However, the
phishing attacks dating back to September 2007 (Anti-Phishing
evidence supporting the golden nuggets comes from a number dif-
Working Group, 2007). We performed a cognitive walkthrough
ferent algorithms and feature sets, and we believe it is compelling
on the approximately 1006 sample attacks within this archive.
(Fette, Sadeh, & Tomasic, 2006).
We used a series of short scripts to programmatically extract the
above features and store these in an excel sheet for quick reference
3.4. Utilisation of different DM classification algorithms
as shown in Fig. 3.
Our goal is to gather information about the strategies that are
The practical part of this study utilises five different common
used by attackers and to formulate hypotheses about classifying
DM algorithms (C4.5, Ripper, Part, Prism, and CBA). Our choice of
and categorizing of all different e-banking phishing attacks tech-
these methods is based on the different strategies they use in
niques. By thoroughly investigating these phishing attacks, we
learning rules from data sets (Misch, 2006). The C4.5 algorithm
have created a data set containing information regarding what dif-
(Quinlan, 1996) employs divide and conquer approach, and the
ferent techniques have been used and how the usage of these tech-
RIPPER algorithm uses separate and conquer approach. The choice
niques has changed over time. By investigating these information,
of PART algorithm is based on the fact that it combines both ap-
we have found some interesting techniques depending on the main
proaches to generate a set of rules. It adapts separate and conquer
perception that phishers know that most users do not know how to
to generate a set of rules and uses divide and conquer to build par-
check the security and often assume that sites requesting sensitive
tial decision trees. The way PART builds and prunes partial decision
information are secure which makes very difficult for them to see
tree is similar to the C4.5 implementation with a difference which
the difference between authentic security and mimicked security
can be explained as follows: C4.5 generates one decision tree and
features (Persson, 2007). We also found that some visual deception
uses pruning techniques to simplify it; each path from the root
attacks can fool even the most sophisticated users. These results
node to one of the leaves in the tree represents a rule. On the other
illustrate that standard security indicators are not effective for a
hand, PART avoids the simplification process by building up partial
decision trees and choosing only one path in each one of them to
derive a rule. Once the rule is generated, all instances associated
with it and the partial tree will be discarded. PRISM is a classifica-
tion rule which can only deal with nominal attributes and does not
do any pruning. It implements a top–down (general to specific)
sequential-covering algorithm that employs a simple accuracy-
based metric to pick an appropriate rule antecedent during rule
construction. Finally, CBA algorithm employs association rule min-
ing (Liu et al., 1998) to learn the classifier and then adds a pruning
and prediction steps. This results in a classification approach
named associative classification (Fadi, Peter, & Peng (2005); Thab-
tah, Cowling, & Peng, 2004).
We recorded the prediction accuracy of the considered classifi-
cation approaches we used in this study in Tables 2 and 3. The
overall summary output can be interpreted as: Web Address Bar
and URL Domain Identity are the major important criteria for iden-
tifying and detecting e-banking phishing website. Such as if one or
both of them are genuine, then most likely the website is legiti-
mate, and on the other hand if it is Fraud, then the website is most
likely phishing. The classification rules did not just showed the sig-
nificance roll of the Web Address Bar criteria and URL Domain
Identity criteria but showed also the magnitude value of some
other e-banking phishing website criteria like Security & Encryp-
tion criteria comparing to the others. We have used 10-fold-
Fig. 3. Excel sheet of the e-banking phishing main extracted features. cross-validation as a testing mode which evaluating the derived
M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921 7917

Table 2 Using the IP Address


Results from WEKA classifier using four methods applied to websites archive to Abnormal Request URL
classify phishing. Abnormal URL of Anchor URL & Domain Identity
Abnormal DNS Record
C4.5 decision P.A.R.T. JRip PRISM
Abnormal URL
tree R.I.P.P.E.R.
Anomalous SSL Certificate
Test mode 10-fold cross-validation
Conflicting Certification Authority
Attributes URL Domain Identity Source Security & Encryption Abnormal Cookie Security & Encryption
Code & Java Web Address Page Style & Contents Inconsistent Distinguished Names (DN)
Bar Social Human Factor
Redirect Pages Layer Two
Class
Straddling Attack
Number of rules 57 38 14 155 Pharming Attack Source Code Java Script
Correctly 848 869 818 855 Using onMouseOver to Hide the Link
classified (84.294%) (86.381%) (81.312%) (84.990%) Server Form Handler (SFH)
Incorrectly 158 137 188 141 Website Phishing
Rate
classified (15.705%) (13.618%) (18.687%) (14.016%)
Spelling Errors
Number of 1006 1006 1006 1006
Copying Website
instances
Using Forms with “Submit ” Button Page Style & Contents
Using Pop-Ups Windows
Disabling Right-Click

Long URL Address


Replacing Similar Characters for URL
Adding Prefix or Suffix Web Address Bar Layer Three
Table 3 Using the @ Symbol to Confuse
Results from CBA classifier using association rule mining applied to websites archive Using Hexadecimal Character Codes
to classify phishing.
Emphasis on Security and Response
Mine: single sup Mine: multi sup Public Generic Salutation Social Human Factor

Num of test case 1006 1006 Buying Time to Access Accounts

Correct prediction 758 713


Error rate 24.652% 29.125% Fig. 4. Structure of the fuzzy data mining inference overall system to evaluate e-
MinSup 20.000% 10.000% banking phishing website risk rate.
MinConf 100.000% 100.000%
RuleLimit 80000 80000
LevelLimit 6 6 4.1. Overall fuzzy data mining inference rules
Number of rules 22 15

4.1.1. The Rule base1 for layer 1


The rule base has five input parameters and one output and
contains all the ‘‘IF-THEN” rules of the system. For each entry of
classifiers. Cross-validation is a well-known testing method in DM the rule base, each component is assumed to be one of three values
and machine learning communities. and each criterion has five components. Therefore, the Rule base 1
contains (35) = 243 entries. The output of Rule base 1 is one of the
e-banking phishing website rate fuzzy sets (Genuine, Doubtful or
4. System design
Fraud) representing URL & Domain Identity criteria phishing risk
rate. A sample of the structure and the entries of the Rule base 1
In this paper, e-banking phishing website detection rate is per-
for layer 1 are shown in Table 4. The system structure for URL &
formed based on six criteria: URL & Domain Identity, Security &
Domain Identity criteria is the joining of its five components (Using
Encryption, Source Code & Java script, Page Style & Contents,
the IP Address, Abnormal Request URL, Abnormal URL of Anchor,
Web Address Bar, and Social Human Factor as shown in Table
Abnormal DNS record and Abnormal URL), which produces the
1. This table also shows that there are different numbers of com-
URL & Domain Identity criteria (layer 1).
ponents for each criterion, five components for URL & Domain
Identity, Source Code & Java script, Page Style & Contents, and
Web Address Bar, respectively. Four components for Security & 4.1.2. The rule base for layer 2
Encryption, and three components for Social Human Factor. In layer 2, there are two inputs, which are (Security & Encryp-
Therefore, there are 27 components in total. There are three lay- tion and Source Code & Java script) and one output. The system
ers on this e-banking phishing website fuzzy data mining model structure for Security & Encryption criteria is the joining of its four
as shown in Fig. 4. The first layer contains only URL & Domain components (Using SSL certificate, Certification authority, Abnor-
Identity criteria with a weight equal to 0.3 for its importance; mal Cookie, and Distinguished Names Certificate (DN)) using Rule
the second layer contains Security & Encryption criteria and base 1, which produces Security & Encryption criteria. The system
Source Code & Java script criteria with a weight equal to 0.2 each; structure for Source Code & Java script criteria is the joining of its
the third layer contains Page Style & Contents criteria, Web Ad- five components (Redirect pages, Straddling attack, Pharming At-
dress Bar criteria, and Social Human Factor criteria with a weight tack, Using onMouseOver to hide the Link, and Server Form Han-
equal to 0.1 each. The six criteria have been classified and prior- dler (SFH)) using Rule base 1, which produces Source Code & Java
itized through mining the e-banking phishing website archive script criteria. The structure and the entries of the rule base for
database using the classification and association algorithms men- layer 2 are illustrated in Table 5. The system structure for layer 2
tioned earlier. is the combination of two e-banking phishing website criteria
E-banking phishing website rating = 0.3*URL & Domain Identity (Security & Encryption and Source Code & Java script), which pro-
crisp [First layer] + ((0.2*Security & Encryption crisp)+(0.2*Source duces Rule base 2. The rule base contains (32) = 9 entries, and the
Code & Java script crisp)) [Second layer] + ((0.1*Page Style & Con- output of Rule base 2 is one of the e-banking phishing website rate
tents crisp) + (0.1*Web Address Bar crisp) + (0.1*Social Human Fac- fuzzy sets (Legal, Uncertain or Fake) representing layer 2 criteria
tor crisp)) [Third layer]. phishing risk rate.
7918 M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921

Table 4
Sample of the Rule base 1 structure and entries for URL & Domain Identity criteria.

Rule (Component 1) Using (Component 2) (Component 3) (Component 4) (Component 5) URL & Domain identity criteria
# the IP address Abnormal Req. URL Abnormal URL anchor Abnormal DNS record Abnormal URL phishing risk (layer 1)
1 Low Low Low Low Low Genuine
2 Low Low Low Low Mod. Genuine
3 Low Low Mod. Mod. Mod. Doubtful
4 Low Low Low Mod. high Doubtful
5 Low Low Mod. Mod. high Fraud
6 Mod. Mod. Mod. Low high Fraud
7 Mod. Low high Mod. high Fraud
8 high Mod. Low Mod. Low Doubtful
9 Low Mod. Low Low Mod. Fraud
10 high Mod. high high Low Fraud

4.1.3. The rule base for layer 3 Table 6


In layer 3, there are three inputs, which are the Page Style & The Rule base 3 structure and entries for layer 3.

Contents, Web Address Bar, and Social Human Factor which is Rule Page Style & Web Social Human Phishing risk
the output from layer 3, and one output. The system structure Contents Address Bar Factor (layer 3)
for Page Style & Contents criteria is the joining of its five compo- 1 Genuine Genuine Doubtful Legal
nents (Spelling errors, Copying website, Using forms with ‘‘Sub- 2 Genuine Doubtful Fraud Uncertain
mit” button, Using Pop-Ups windows, and Disabling Right-Click) 3 Genuine Fraud Doubtful Uncertain
4 Doubtful Doubtful Genuine Uncertain
using Rule base 1, which produces Page Style & Contents criteria.
5 Doubtful Doubtful Doubtful Uncertain
The system structure for Web Address Bar criteria is the joining of 6 Doubtful Fraud Doubtful Fake
its five components (Long URL address, Replacing similar charac- 7 Doubtful Genuine Genuine Legal
ters for URL, Adding a prefix or suffix, Using the @ Symbol to Con- 8 Fraud Doubtful Doubtful Uncertain
fuse, and Using Hexadecimal Character Codes) using Rule base 1, 9 Fraud Fraud Fraud Fake

which produces Web Address Bar criteria. The system structure


for Social Human Factor criteria is the joining of its three compo-
nents (much emphasis on security and response, Public generic
plots of this structure are shown in Fig. 5 using MATLAB. The rule
salutation, and Buying Time to Access Accounts) using Rule base
base contains (33) = 27 entries, and the output of final e-banking
1, which produces Social Human Factor criteria.
phishing website rule base is one of the final output fuzzy sets
A sample of the structure and the entries of the rule base for
(Very legitimate, Legitimate, Suspicious, Phishy or Very phishy)
layer 3 are shown in Table 6. The system structure for layer 3
representing final e-banking phishing website rate.
is the combination of Page Style & Contents, Web Address Bar,
and Social Human Factor, which produces Rule base 3. The rule
base contains (33) = 27 entries, and the output of Rule base 3 is
Table 7
one of the e-banking phishing website rate fuzzy sets (Legal,
The e-banking phishing website rate rule base structure and entries for final phishing
Uncertain or Fake) representing layer 3 criteria phishing risk rate.
rate.
Rule URL & Domain Layer 2 Layer 3 Final e-banking phishing
Identity website rate
4.1.4. The rule base for final e-banking phishing website rate 1 Genuine Legal Legal Very legitimate
In the e-banking phishing website rule base last phase, there 2 Genuine Legal Uncertain Legitimate
are three inputs, which are layer 1, layer 2, and layer 3, and 3 Genuine Legal Fake Suspicious
4 Genuine Uncertain Legal Suspicious
one output which is the rate of the e-banking phishing website. 5 Genuine Uncertain Uncertain Suspicious
The structure and the entries of the rule base for e-banking phish- 6 Genuine Uncertain Fake Phishy
ing website rate are shown in Table 7. The system structure for is 7 Genuine Fake Legal Suspicious
the combination of layer 1, layer 2, and layer 3, which produces 8 Genuine Fake Uncertain Suspicious
9 Genuine Fake Fake Phishy
final e-banking phishing website rule base. The three-dimensional
10 Doubtful Legal Legal Legitimate
11 Doubtful Legal Uncertain Suspicious
12 Doubtful Legal Fake Suspicious
13 Doubtful Uncertain Legal Suspicious
Table 5 14 Doubtful Uncertain Uncertain Suspicious
The Rule base 2 structure and entries for layer 2. 15 Doubtful Uncertain Fake Phishy
16 Doubtful Fake Legal Phishy
Rule Security & Source Code & Java Phishing risk (layer 17 Doubtful Fake Uncertain Phishy
Encryption script 2) 18 Doubtful Fake Fake Very phishy
1 Genuine Genuine Legal 19 Fraud Legal Legal Suspicious
2 Genuine Doubtful Legal 20 Fraud Legal Uncertain Suspicious
3 Genuine Fraud Uncertain 21 Fraud Legal Fake Phishy
4 Doubtful Genuine Uncertain 22 Fraud Uncertain Legal Suspicious
5 Doubtful Doubtful Uncertain 23 Fraud Uncertain Uncertain Phishy
6 Doubtful Fraud Uncertain 24 Fraud Uncertain Fake Phishy
7 Fraud Genuine Uncertain 25 Fraud Fake Legal Phishy
8 Fraud Doubtful Fake 26 Fraud Fake Uncertain Very phishy
9 Fraud Fraud Fake 27 Fraud Fake Fake Very phishy
M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921 7919

2002) to find the center of gravity (COG). Centroid defuzzification


technique shown in Eq. (1) can be expressed as where x* is the
defuzzified output, li(x) is the aggregated membership function
and x is the output variable.
R
l ðxÞxdx
x ¼ R i ð1Þ
li ðxÞdx
The proposed intelligent e-banking phishing website detection
system has been implemented in MATLAB 6.5. The results of some
input combinations are listed in Tables 8–10.
The final e-banking phishing website risk rating will be bal-
anced (54%) representing a [suspicious website], when the layer
1 (URL & Domain Identity) of the e-banking phishing website risk
criteria has 10 input values which indicate High phishing indicator,
and all other layers have the value of zero inputs as shown in Table
8. Same result can be made when all e-banking phishing website
risk criteria’s representing by the three layers have middle (5) in-
put values which indicate Mod. phishing indicator. These results
show the significance and importance of the e-banking phishing
website criteria (URL & Domain Identity) represented by layer 1
Fig. 5. Three-dimensional plots for final phishing rate. especially when compared to the other criteria’s and layers. Table
9 shows that when the layer 1 and layer 2 of the e-banking phish-
ing website risk criteria has middle (5) input values which indicate
5. Experiments and results Mod. phishing indicator and other third layer has the value of 10
input values which indicate High phishing indicator, the final e-
Clipping method (Ho, Ling, & Reiss, 2006) is used in aggregating banking phishing website risk rating will be reasonably high
the consequences, and the aggregated surface of the rule evalua- (72%) representing a [phishy website], which means that there is
tion is defuzzified using Mamdani method (Liu, Chen, & Wu, a Good guarantee that the website is forged phishy website. This

Table 8
Five highest (10) for layer 1 and all others lowest (0).

Component Layer 1 URL & Domain Layer 2 Layer 3 % e-banking phishing


Identity website rating
Security & Source Code & Java Page Style & Web Address Social Human
Encryption script Contents Bar Factor
1 10 0 0 0 0 0 54
2 10 0 0 0 0 0
3 10 0 0 0 0 0
4 10 0 0 0 0
5 10 0 0 0

Table 9
Five middle (5) inputs for layer 1 and layer 2 and highest (10) inputs for layer 3.

Component Layer 1 URL & Domain Layer 2 Layer 3 % e-banking phishing


Identity website rating
Security & Source Code & Java Page Style & Web Address Social Human
Encryption script Contents Bar Factor
1 5 5 5 10 10 10 72
2 5 5 5 10 10 10
3 5 5 5 10 10 10
4 5 5 5 10 10
5 5 5 10 10

Table 10
Five middle (5) inputs for layer 1 and all others lowest (0) inputs.

Component Layer 1 URL & Domain Layer 2 Layer 3 % e-banking phishing


Identity website rating
Security & Source Code & Java Page Style & Web Address Social Human
Encryption script Contents Bar Factor
1 5 0 0 0 0 0 39
2 5 0 0 0 0 0
3 5 0 0 0 0 0
4 5 0 0 0 0
5 5 0 0 0
7920 M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921

result clearly shows that even if some of the e-banking phishing Bridges, S. M., & Vaughn, R. B. (2001). Fuzzy data mining and genetic algorithms
applied to intrusion detection. Department of Computer Science Mississippi State
website characteristics or layers are not very clear or not definite,
University, White Paper.
the website can still be phishy and forged, and users should be Cendrowska, J. (1987). PRISM: An algorithm for inducing modular rules.
aware when dealing with it especially when other phishing charac- International Journal of Man–Machine Studies, 27(4), 349–370.
teristics or layers are obvious and clear. Ciesielski Vic, & Lalani Anand (in press). Data mining of web access logs from an
academic web site. In Proceedings of the third international conference on hybrid
Table 10 shows that when the layer 1 of the e-banking phishing intelligent systems (HIS’03): Design and Application of Hybrid Intelligent Systems
website risk criteria (URL & Domain Identity) has middle (5) input (pp. 1034–1043). IOS Press.
values which indicate Mod. phishing indicator and all other Layers Dhamija, R., & Tygar, J. D. (2005). The battle against phishing: Dynamic security
skins. In Proc. Symp. Usable Privacy and Security.
have the value of zero input values which indicate Low phishing Dhamija Rachna, Tygar, J. D., & Hearst Marti (2006). Why phishing works. Harvard
indicator, the final e-banking phishing website risk rating will be University, White Paper.
reasonably low (39%) representing a [legitimate website], which Fadi, T., Peter, C. & Peng, Y. (2005). MCAR: Multi-class classification based on
association rule. In IEEE international conference on computer systems and
means that there is Good guarantee that the website is legitimate applications (pp. 127–133).
website. This result clearly shows that even if some of the e-bank- Fayyad, U. M. (1998). Mining databases: Towards algorithms for knowledge
ing phishing website characteristics or layers are noticed or ob- discovery. Data Engineering Bulletin, 21(1), 39–48.
FDIC (2004). Putting an end to account – hijacking identity theft. Available from
served that does not mean at all that the website is phishy or http://www.fdic.gov/consumers/consumer/idtheftstudy/identity_theft.pdf.
forged, but it can be safe and secured especially when other phish- Fette Ian, Sadeh Norman, & Tomasic Anthony (2006). Learning to detect phishing e-
ing characteristics or layers are not noticeable, visible or detect- mails. Institute for Software Research International. CMU-ISRI-06–112,
Fu, A. Y., Wenyin, L., & Deng, X. (2006). Detecting phishing web pages with visual
able. The results also indicate that the worst e-banking phishing
similarity assessment based on earth mover’s distance (EMD). IEEE Transactions
website rate (all three layers have 10 input value) equals 83.7% on Dependable and Secure Computing, 3(4).
representing [Very phishy Website], and the best e-banking phish- Han, E., & Karypis, G. (2000). Centroid-based document classification: Analysis and
ing website rate (all three layers have 0 input value) is 16.4% experimental results. Principles of Data Mining and Knowledge Discovery,
424–431.
representing [Very legitimate website] rather than a full range, Herzberg, A., & Gbara, A. (2004). Protecting naive web users. Draft of July, 18.
i.e. 0–100, because of the fuzzification process. Ho, C. Y., Ling, B. W., & Reiss, J. D. (2006). Fuzzy impulsive control of high-order
interpolative low-pass sigma–delta modulators. IEEE Transactions on Circuits
and Systems–I: Regular Papers, 53(10).
Available from http://www.phishtank.com/phish_archive.php (2008).
6. Conclusion and future work James, L. (2006). Phishing exposed. Tech target article sponsored by: Sunbelt
software. Available from searchexchange.com.
The fuzzy data mining e-banking phishing website model Kantardzic, & Mehmed (2003). Data mining: Concepts, models, methods, and
algorithms. John Wiley & Sons. ISBN 0471228524. OCLC 50055336.
showed significance and importance of the e-banking phishing Liu, M., Chen, D., & Wu, C. (2002). The continuity of Mamdani method. International
website criteria (URL & Domain Identity) represented by layer 1. Conference on Machine Learning and Cybernetics, 3, 1680–1682.
It also showed that even if some of the e-banking phishing website Liu Bing, Hsu Wynne, & Ma Yiming (1998). Integrating classification and
association rule mining. In Proceedings of the fourth international conference on
characteristics or layers are not very clear or not definite, the web- knowledge discovery and data mining (KDD-98, Plenary Presentation). New York,
site can still be phishy especially when other phishing characteris- USA.
tics or layers are obvious and clear. On the other hand, even if some Liu, W., Huang, G., Liu, X., Zhang, M., & Deng, X. (2005). Phishing Web Page
Detection. In Proc. eighth int’l conf. documents analysis and recognition (pp. 560–
of the e-banking phishing website characteristics or layers are no-
564).
ticed or observed, it does not mean at all that the website is phishy, Liu, W., Deng, X., Huang, G., & Fu, A. Y. (2006). An anti-phishing strategy based on
but it can be safe and secured especially when other phishing char- visual similarity assessment, published by the IEEE computer society 1089–
acteristics or layers are not noticeable, visible, or detectable. 7801/06 IEEE. Internet Computing IEEE.
Microsoft (2004). Available from http://www.microsoft.com/mscorp/twc/privacy/
Our first goal was to determine whether we could find any gold- spam.
en nuggets in the e-banking phishing website archive data using Microsoft Corp. (2005). Microsoft phishing filter: A new approach to building trust
classification algorithms. In this, major rules discovered were in- in e-commerce content. White Paper.
Misch Sebastian, (2006). Content negotiatio in internet mail. Diploma thesis.
serted into the fuzzy rule engine to help giving exact phishing rate University of Applied Sciences Cologne, Mat. No.: 7042524.
output. A major issue in using data mining algorithms is the prep- Netcraft (2004). Available from http://toolbar.netcraft.com.
aration of the feature sets to be used. Finding the ‘‘right” feature set Olsen, S. (2004). AOL tests caller ID for e-mail. CNET News.com.
Pan, Y., & Ding, X. (2006). Anomaly based web phishing page detection. In
is a difficult problem and requires some intuition regarding the Proceedings of the 22nd annual computer security applications conference
goal of data mining exercise. We are not convinced that we have (ACSAC’06). Computer Society.
used the best feature sets and we think that there is more work Perez, J. C. (2003). Yahoo airs antispam initiative. Available from ComputerWeekly.
com.
to be done in this area. Moreover, there are number of emerging Persson Anders (2007). Exploring phishing attacks and countermeasures. Master
technologies that could greatly assist phishing classification that Thesis in Computer Science, Thesis No: MCS-2007:18.
we have not considered. However, we believe that using features Peter, Lyman, & Varian, Hal R. (2003). How much information. Available from http://
www.sims.berkeley.edu/how-much-info-2003. Retrieved on 2008-12-17.
such as those presented here can significantly help with detecting
Putting an end to account-hijacking identity theft, FDIC, Tech. Rep. (2004). [Online].
this class of e-banking phishing websites. The classification Available from http://www.fdic.gov/consumers/consumer/idtheftstudy/
approaches are promising. The training and classification experi- identitytheft.pdf.
ments have proven that it is possible to improve the categorization Quinlan, J. R. (1996). Improved use of continuous attributes in c4.5. Journal of
Artificial Intelligence Research, 4, 77–90.
process. The progress of detecting e-banking phishing websites is Scheffer, T. (2001). Finding association rules that trade support optimally against
really very interesting with a never ending possibility of algo- confidence. In Proc. of the 5th European conf. on principles and practice of
rithms variations when it is combined with each other. knowledge discovery in databases (PKDD’01) (pp. 424–435). Springer-Verlag:
Freiburg, Germany.
Shah, S. (2003). Measuring operational risks using fuzzy logic modeling. Article, Towers
Perrin.
References Sharif, T. (2006). Phishing filter in IE7. Available from http://blogs.msdn.com/ie/
archive/2005/09/09/463204.aspx.
Adida, B., Hohenberger, S., & Rivest, R. (2005). Lightweight encryption for e-mail. In Thabtah, F., Cowling, P., & Peng, Y. (2004). A new multi-class, multi-label associative
USENIX steps to reducing unwanted traffic on the internet workshop (SRUTI). classification approach . In The 4th international conference on data mining
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in (IVFM’04). Brighton, UK.
large databases. In Proc. international conference on very large databases (pp. WEKA – University of Waikato, New Zealand, EN (2006). Weka – Data Mining with
478–499). Morgan Kaufmann, Los Altos, CA: Santiage, Chile. Open Source Machine Learning Software in Java. Available from http://
Anti-Phishing Working Group (2007). Phishing Activity Trends Report. Available www.cs.waikato.ac.nz/ml/weka (2006/01/31).
from http://antiphishing.org/reports/apwg_report_sep2007_final.pdf. WholeSecurity Web Caller-ID (2005). Available from www.wholesecurity.com.
M. Aburrous et al. / Expert Systems with Applications 37 (2010) 7913–7921 7921

Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and Wu, M., Miller, R. C., & Garfinkel, S. L. (2006a). Do Security Toolbars Actually Prevent
techniques (3rd ed.). San Francisco, CA: Morgan Kaufmann. Phishing Attacks?. CHI
Wood, L. (2005). Document object model level 1 specification. Available from http:// Wu, M., Miller, R. C., & Little, G. (2006b). Web wallet: Preventing phishing attacks by
www.w3.org. revealing user intentions. MIT Computer Science and Artificial Intelligence Lab.

Vous aimerez peut-être aussi