Plagiarized WebPage Detection by Measuring The Dissimilarities Using SIFT

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.
ORG
286
Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

B.Srinivas, M.V.Pratyusha, S.Govinda Rao and K.V.Subba raju
Abstract Phishing is a very common network attack, where the attacker creates a replica of an existing Web page to fool users. In this paper, we propose an effective anti-phishing solution, by combining image based visual similarity based approach with digital signatures to detect plagiarized web pages. We implemented two effective algorithms for our detection mechanism, one is Scale Invariant Featured Transform (SIFT) algorithm in order to generate signature based on extracting stable key points from the screen shot of the web page and the other algorithm is the MD5 hash algorithm in order to generate digital signature based on the content of the web page. When a legitimate web page is registered with our system these two algorithms are implemented on that web page in order to generate signatures, and these signatures are stored in the database of our trained system. When there is a suspected web page, these algorithms are applied to generate both the signatures of the suspected page and are verified against our database of corresponding legitimate web pages. Our results verified that our proposed system is very effective to detect the mimicked web pages with minimal false positives. Index Terms Plagiarized webpages, MD5, Phishing, SIFT.
1. INTRODUCTION
lagiarized web pages are forged web pages which are generally used for phishing attacks. These types of plagiarized pages are created by malicious persons to imitate the web pages of real web sites in order to cheat their victims. A phishing attack is a criminal activity by the hackers with an intention to make the end-users to visit the fake website in order to steal their personal information such as usernames, passwords and the details of credit cards etc., . For a successful phishing attack the phisher initially sends the URL of the fake web site to a large number of users at random through e-mails or IMs. The victims unsuspectingly who click on the link are directed to the fake website where their personal information can be stolen easily. Now a days setting up a fake website is much easier as there are phishing kits which can create a phishing site in a very short time. With a successful phishing attack the risk is not only on the personal information leakage but it can seriously damage the enterprises brand reputation as the users believe that their enterprises will protect them from these attacks. However, detecting such plagiarized web pages is a daunting challenge. One of the most common techniques to detect these plagiarized web pages relies on the internal clues embedded in the text or by visual characteristics (text, image, layout etc.,) of the web page. Most of these plagiarized web pages have high textual and visual similarity with the original web pages.
We propose an effective approach to recognize both textual and image/photo plagiarism. Initially the users or system administrators have to register their system with true web pages which are needed to protect from plagiarism, and at the time of registration the server is trained to produce hash values that is generated using MD5 algorithm for every web page based on the content. Along with these hash values, every web page will be taken a screenshot and for every screenshot the system is trained to generate specific key-points using SIFT algorithm and stored these screenshots in the database along with the digital signature. Whenever a suspected web page is appeared the trained system generates the key-point features and the digital signature from the current web page. Finally the detection is performed based on comparing both the image signature along with the digital signature of the current web page against the database of the signatures of the trained system with legitimate web page.
2. LITERATURE SURVEY
Phishing is the most online fraudulent activity which is not new to the internet users, but many people are still tricked by these plagiarized web pages which perform phishing attacks. In order to obstruct phishing threat many anti-phishing solutions have been developed both by the researchers and by the enterprises. In this section we review the previous anti-phishing work briefly by epitomizing them into five different categories.
B.Srinivas is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. M.V.Pratyusha is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh. S.Govind is with Department of Computer science and Engineering, GR Institute of Engineering and Technology, Hyderabad, Andhrapradesh. K.V. Subba Raju is with Department of Computer Science and Engineering, MVGR College of Engineering, vizianagaram, Andhra pradesh.
2.1 Email level approach In email level approach includes both filtering and path analysis techniques to identify phishing e-mails before they are delivered to users. Phishing emails are generally considered as spam mails and the most effective approach to reduce these phishing attacks are implemented by spam filtering techniques. A large number of phishing
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
287
emails can be blocked by continuously trained filters. But the success of this technique depends on the availability and training of the rule based filters. Using this technique, the user rarely receives the phishing mails. But, if the phishing email bypasses the filter then the user might believe that the received mail is legitimate. Along with filtering another popular anti spam solutions which try to stop email scams is by analyzing email contents. Current path based mechanisms, such as Sender ID [2] by Microsoft and DomainKey [3] by Yahoo are designed by looking up mail sources in DNS tables. These mechanisms can verify whether a received email is authentic or not but unfortunately now a days these techniques are not widely used by internet users.
site if they are visiting a rogue page which has a domain name that similar to a legitimate site. However, this method cannot directly judge whether a suspicious page is legitimate or phishing.
2.2 Content based approach This is a heuristic approach which determines the current web page as a phishing page or a legitimate page. Zhang et al.[4] designed and evaluated CANTINA, to detect phishing websites which combines Time FrequencyInverse Document Frequency (TF-IDF)algorithm to detect whether the web page is plagiarized for phishing attack or not. CANTINA uses five words along with highest algorithmic(TF-IDF) weight as lexical signature on a given web page and submits that signature to Google. If CANTINA finds the site URL in question within the top results, then the web page is classified as legitimate page else as phishing web page. But, this method is effective if the lexical signature is representative and can be considered as a query for search engine. This method also depends on the reliability of the search engine. 2.3 Browser integrated tools and plug-ins Most of the web browser have anti-phishing solutions [5,6] that block phishing sites by well maintaining blacklists and whitelists. For example, Firefox uses the lists from Google[8] and Stopbadware.org[9] to block malicious web pages. Blacklist/whitelist is the most straight forward solution for phishing attacks at the client side. A whitelist contains URLs of known legitimate sites while a blacklist contains those of known phishing sites. Many anti-phishing technologies relies on the combination of both the technologies. for example Firefox add-ons like PhishTank SiteChecker[10], Firephish[11], CallingID LinkAdvisor[12] etc., these extensions or toolbars remain their users whether they are surfing safely or not. Blacklist/whitelist are vulnerable with frequent updations as they cannot include new phishing sites timely and these lists may also show false positives with the legitimate sites. Other than these listings anti-phishing solutions also include heuristic approach to judge whether the page has phishing characteristics or not for example SpoofGuard[13] include checking the host name, checking the URLs against spoofing and checking previously seen images. 2.4 Relevant domain name suggestion This technique suggests users the relevant domain name when they are accessing the Web. For example, SpoofStick[14] remarkably display the most relevant domain info. This toolbar can help user to detect the actual web-
2.5 Visual similarity based approach Fu and et al [15] propose to use visual similarities to detect phishing they treat the whole webpage as an image and convert this image into a low resolution image. The colors and coordinates of the images that are converted are stored in the database as signatures. In order to detect the new web page as phishing page the current web page is also converted into a low resolution image and used to match against those signatures in the database, the similarity distance is calculated by Earth Movers Distance (EMD) algorithm. Angelo et al. [16] also proposed phishing detection mechanism based on visual assessment, they employed DOM structure of a web page to obtain visual characteristics to detect phishing pages. These include block similarity, style similarity and layout similarity. Eric et al. [17] also proposed visual-similarity-based phishing detection by generating signatures. This signature extraction is done from the text, image and the overall appearance within the webpage. When a new web page is ready for detection then the signature extracted from the new page is compared with already existing signature of the legitimate page and if the similarity is more between these pages then the system warns that the web page as phishing web page.
3. PROPOSED SYSTEM
Our work is a heuristic approach which determines whether the web page is legitimate or not. But, our system varies with the previous mechanisms as we implement visual similarity based approach along with digital signature concept to produce accurate results in order to detect phishing attacks. We employed two different algorithms in our research, in which the first algorithm is the MD5 hashing algorithm [21,22], this algorithm is employed to generate the signature of the entire web page based on the content. These generated signatures are unique for every web page. The other algorithm that is implemented in our work is the Scale Invariant Feature Transform (SIFT) algorithm [18,20]. Every web page is taken a screenshot and considered as an image, and every image is gone through SIFT algorithm to detect robust key-point features. The SIFT algorithm is used to implement Visual Similarity approach for our system. Finally these key-points and the digital signatures are stored in the database of a trained system. When there is a need to detect whether the current web page is legitimate or not, the signature and the key-points of the current web page are also generated and compared to the database of keypoints and the signature of the trained system. If these signatures are similar then the current web page is a decriminalized web page or the web page is considered as a plagiarized web page.
4. METHODOLOGY
We employed the following steps in order to detect the
288
plagiarized web pages. 1. Register the websites that are needed to protect against phishing attacks 2. For every web page 2.1 A digital signature is generated based on the textual pattern of the web page using MD5 hashing algorithm. 2.2 A screenshot for the web page is taken and SIFT algorithm is implemented for the screenshot in order to generate featured key-points. 3. Store the signature and the trained image screenshot of every web page that is registered at step 1 in the database. 4. When a new web page is needed to be verified, perform step 2. 5. These newly generated web page signature and key points are compared with its corresponding signature and clustered key-points from the database 6. If the verified web page is similar with that of web page that is stored in the database then the current web page is considered as legitimate web page and if they vary then the web page is considered as plagiarized web page.
Legitimate Web Page
necessary octaves and blur levels (scales) are generated. The generation of progressively blurred images was done using Gaussian blur operator. Mathematically blurring for this algorithm is referred to as convolution of Gaussian operator and the image (, , ) = (, , ) (, ) Symbols L is a blurred image G is a Gaussian blur operator x,y are location coordinates is the scale paramener * Is the convolution operator for x,y coordinates, this applies Gaussian blur G on image I (, , ) =
( )
Extract the screenshot
Extract the Text
Generate the Signature
Database
Protected List http://canarabank.com
Image Signature ABFJGUASDA
Digital Signature sadhajdha
Figure 1: Training the Legitimate webpages 4.1 Scale Invariant Feature Transform Algorithm Scale Invariant Feature Transform is an algorithm which was published by David Lowe in 1999, this algorithm is implemented for extracting invariant features from images that can be used to perform reliable matching between different views of an image. This algorithm also provides accurate recognition from suspected web pages by matching individual features to a database of features from known web pages. The main purpose of this algorithm is to derive and describe key points which are robust to scale, rotation and change in illumination. We analyzed the algorithm in six steps 1. Scale-space extrema detection: In order to create a scale space, the image (in our case the web page screen shot) is taken and generate progressively blurred out images. Then the original image should be resized to half of its original size and generate blurred out image again and this process should be repeated until
The above represented equation is to calculate the Gaussian blur operator. 2. LOG Approximation: For Laplacian of Gaussian (LOG) we need to take the image(web page screenshot) and blur it a little and then we need to calculate the second order derivatives on it(or Laplacian). This LOG locates edges and corners on the image which are good for finding key-points. As the second order derivatives are sensitive to noise they are intensive computationally, we calculated the difference between two consecutive scales i.e., Difference of Gaussian (DOG). Similarly all the consecutive pairs are considered to perform DOG operation. 3. Finding Key points Finding the key points includes two parts a) Locate maxima/minima in DOG images- here the trained system iterated through each pixel and check all its neighbors, this process is checked for current image along with the Images that are above and below it. And the points that are checked are the appropriate maxima and minima b) Find subpixel maxima/minima-using the available pixel data, subpixel values are generated using the method Taylor expansion of image around the approximate key-point which is mathematically represented as follows () = + +
On solving the above equation subpixel key point locations will be generated, this increases the chances of matching and robustness of the algorithm. 4. Get rid of bad key points Previous step produces lot of key points but some of them lie on edge or they dont have enough contrast. In both the cases they are not useful as features. Harris Corner Detector technique is implemented in this algorithm to remove edge features. And the intensities of the keypoints are also calculated in order to verify the contrast. To calculate the intensity we again used the Taylor expansion for the sub-pixel key points to get the intensity value of the subpixel location. In general the image around the key point can be -A flat region: both the gradients will be small
289
-An edge: the perpendicular gradient will be big and the other will be small. -A corner region: both the coordinates are big. Depending on their location coordinates also there is a scope to eliminate the unnecessary key-points by calculating threshold maxima/minimal values. SIFT Detector Parameters are Threshold to maintain every image threshold value. For every given input it detects the edges. To measure the edges we have parameter known as EdgeThreshold. 5. Orientation assignment:
Suspected Web Page
Extract the Screenshot
Extract the Text
Compare the features
Now we got legitimate key points which are tested as Database stable. And we already know the scale invariance that is the scale at which the key point is detected. Gradient magnitude and orientation is calculated with the below Protected List Image Signature Digital Signature formulae http://canarabank.com ABFJGUASDA sadhajdha (, ) = (( + 1, ) ( 1, )) + ((, + 1) (, 1)) (, ) Figure 2: Testing the Suspected webpages = (((, + 1) (, 1))(( + 1, ) ( 1, ))) All the pixels around the key-point will be calculated using these formulae to generate magnitude and orienta- 4.2 Message Digest 5 (MD5) Algorithm tion. Most prominent orientation(s) is figured out and is MD5 is the most widely used secured algorithm which assigned to the key-point. The size of the orientation takes input as a message of arbitrary length and produces collection region around the keypoint depends on its 128 bit output for the inputted message. scale. After calculating orientation a histogram is generat- The message processing for MD5 involves following steps ed. In this histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). Lets say the gradi- (1)Padding: the message is padded with single 1 and folent direction at a certain point (in the orientation collec- lowed by 0s to ensure that the message length plus 64 is tion region) is 17.589 degrees, then it will go into the 10- divisible by 512. Padding is must even if the length of the message is congruent to 448 modulo 512 19 degree bin. (2)Appending length: the result of step 1 is attached with 64 bit binary representation of original length. This 6. Generate SIFT Features Now a unique fingerprint is generated for every keypoint. should be a multiple of 512 bits. Equivalently, this mesThe orientation histograms summarize the contents over sage has a length that is an exact multiple of 16(32-bit) 44 regions with 8 orientation bins which can allocate 128- words. Let S[0 ---- N-1] denote the words of resulting message, where Nis a multiple of 16. element feature for each keypoint. Correspondence of feature points is generated by the ratio (3)Initialize MD Buffer: In order to compute the message of distance for the descriptor vector from the closest digest a four word buffer (A,B,C,D) is used, where each neighbor to the distance of second closest. A,B,C,D is a 32-bit register. These registers are initialized with the following Hex values, which initiates the lowerAfter applying this algorithm the legitimate web pages order bytes first. are trained perfectly and produced robust keypoint fea- Word A : 01 23 45 67 tures for every web page screenshot and stored those Word B : 89 ab cd ef scaled invariant images in the database of the trained sys- Word C : fe dc ba 98 tem. Application of this algorithm can easily detect the Word D : 76 54 32 10 plagiarized web pages with visually similar images bas- (4)Process message in 16-Word Block: four auxiliary funcing on its unique scale invariant key point features. When tions are defined three 32-bit words as input and produce a suspected page appears and depending on the match of one 32-bit word as output. the unique keypoints of the suspected page with the F(X, Y, Z) = XY v not(X) Z trained database images it can be easily detected whether G(X, Y, Z) = XZ v Y not(Z) the web page is true or not. H(X, Y, Z) = X xor Y xor Z I(X, Y, Z) = Y xor (X v not(Z)) If X,Y and Z bits are independent and unbiased, then each bit of F(X,Y,Z), G(X,Y,Z), H(X,Y,Z) and I(X,Y,Z) will be independent and unbiased. (5)Output: the message digest is produced as output is
290
A,B,C,D. this output is generated by initiating with the lower-order byte of A and by concluding with the higherorder byte of D. The main logic used to implement this algorithm is to check the integrity of the web page. When a web page is needed to verify whether it is legitimate or fake, MD5 generates the signature for that web page and is compared with its corresponding signature stored in the database. If both the hashes are similar then the web page is considered as original or the integrity of the web page is considered as dishonest.
The above table specifies the initial points that are generated by our trained system from each domain, and from these initial points stable and robust keypoints are extracted at different scales from different octaves.
5. RESULTS AND ANALYSIS

In our system we maintained two types of datasets, one is a trained dataset and the other is a test dataset. The trained dataset maintains the database of legitimate web pages and the test database is created to maintain a database of suspected web pages. In order to analyze the performance of our proposed system, we collected 500 web pages by using various keywords i.e., banking, mail, social networking 12 web pages which are phishing ones. Combined all these, we got 512 web pages, trained all these and tested with legitimate as well as plagiarized one. 5.1 Implementing SIFT algorithm We considered SIFT for measuring the dissimilarities between the web page screenshots. To compute Scale Space, SIFT scale space parameters are SigmaN, Sigma0, O is for measuring Number of Octaves, S for Number of Levels, Omin for First Octave, in addition to these Smin, Smax. To generate key points, maxima/minima and threshold values we use SIFT Detector Parameters which are Threshold to maintain every image threshold value. For every given input it detects the edges. To measure the edges we have parameter known as Edge Threshold. For orientation assignment and histogram generation, there are SIFT Descriptor parameters NBP: Number of Spacial Bins, NBO Number of Spacial Orient Bins. Training is done with the default parameters SigmaN is 0.50000, SigmaO is 2.015874. Number of Octaves per iteration 6. in each Octave Number of Levels are 3. Number of Spacial and Orientations are 4 and 8 respectively. TABLE-1: EXTRACTING STABLE KEY POINTS FROM THE INITIAL POINTS AT DIFFERENT OCTAVES Domains Canara Bank Icici.com Yahoo.com Irctc.co.in Gmail.com Initial Points 1140 2039 1907 1138 289 Round1 416 864 700 267 320 Round1 122 234 187 111 87 Round2 72 46 48 3 52 Figure-3: Stable Key points that are generated for a trained web page TABLE-2: EXTRACTING THE SIMILARITIES FROM THE TRAINED DATA AND TEST DATA OF A PARTICULAR DOMAIN AT DIFFERENT LEVELS. Domain Name Canara Bank Icici.com Yahoo.com Irctc.co.in Gmail.com Level 1 72/68 46/42 48/41 3/2 52/47 Level 2 72/71 46/46 48/46 3/3 52/49
In table-2, the test data is evaluated against trained data at different levels in order to check similarities. For example, for the canara bank web page screenshot 72 key points are generated and stored in the trained data, when a suspected canara bank web page is needed to be verified, the key points of the test data is evaluated against the trained data key points of canara bank at different levels. In the above table 68 key points are similar out of 72 key points and in the next level 71 key points are similar out of 72 key points. This specifies that the suspected web page is a legitimate web page. TABLE-3: ALL THE 512 WEB PAGES FROM TRAINED DATA ARE COMPARED WITH TEST DATA IN ORDER TO DETECT PLAGIARISM. S.No 1 2 3 4 5 6 7 Stage 1 34/4 49/2 25/5 118/14 89/8 172/24 25/3 Stage 2 34/6 49/5 25/3 118/42 89/15 172/28 25/5 Stage 3 34/32 49/48 25/25 118/116 89/88 172/172 25/25
In table-3, all the 512 web pages which are under trained data are tested against test data at different stages. For example, in the first row 34 trained web pages are verified against test data and in the first stage only 4 web pages are similar, later in the next stage 6 web pages out
291
of 34 web pages are similar and finally in the last stage 32 web pages are similar out of 34 web pages, this specifies that the remaining 2 web pages are detected as plagiarized pages.
By combining the Image signature as well as digital signature we could achieve following results. TABLE-5: COMBINING BOTH THE SIGNATURES TO GENERATE FINAL OUTPUT. Image sig- Digital Unnature signature Identified 1 34/32 153/2 1 2 49/48 89/3 0 3 25/25 12/2 0 4 118/116 87/0 1 5 89/88 40/1 0 6 172/172 89/1 0 7 25/25 42/0 0 We achieved accurate results by combining both the signatures; after large scale of tests system could not detect two among all. S.No
Figure-4: Key points match. In the above scenario we compared the key points of a suspected web page with the legitimate image i.e. stored in the database of the trained system. The figure 4 shows that the key points are matched and the suspected web page is justified as a legitimate web page. 5.2 Implementing MD5 algorithm We use MD5 for applying the signature on the web page. It validate the web page based on the content. No two legitimate web pages have same signature. We prove the legitimacy of a web page by comparing the signatures of tested data with their appropriate signatures in the trained data. If both the signatures are similar then the suspected page is considered as a legitimate web page, if these signatures vary then our trained system detects that the suspected page is mimicked by the legitimate web page. TABLE-4: VERIFYING SIGNATURES OF THE TEST DATA AGAINST
THE TRAINED DATA OF ALL THE 512 WEB PAGES
6. CONCLUSION AND FUTURE WORK

In this paper we presented an effective approach in order to detect plagiarized web pages by comparing visual similarities along with the digital signature between a suspicious web page and the potential, legitimate target page. As our proposed system is purely trained server side system there is no burden on the client in order to justify whether the received web page is legitimate or not. We considered a visual similarity based approach because the victims are typically convinced that they are visiting a legitimate page by the look-and-feel of a web site. But only visual similarity based approach might not derive efficient and accurate results. For this reason we append digital signature based approach along with visual similarity based anti phishing solution. We performed an experimental evaluation of our comparison technique to assess its accuracy and effectiveness in detecting plagiarized web pages. We used a dataset containing 12 real plagiarized phishing pages with their corresponding target legitimate pages in our trained system. The results are satisfactory and only two false positives are raised.
S.No 1 2 3 4 5 6 7
Signature Change 153/2 89/3 12/2 87/0 40/1 89/1 42/0
Detected 1 2 2 0 1 1 0
References
[1] [2] [3] [4] APWG. http://www.anti-phishing.org/. Microsoft.SenderIDHomePage. http://www.microsoft.com/mscorp/safety/technologies/senderid /default.ms%px Yahoo. AntiSpam Resource Center. http://antispam.yahoo.com/domainkeys. Y.Zhang, J.I.Hong,L.F.Cranor, CANTINA: A Content-based approach to detecting phishing web sites, in: the international World Wide Web conference, www 2007, ACM Press, Banff, Alberta, Canada, 2007, pp. 639-648. Microsoft Corporation. PhishingFilter: Help protect yourself from online scams. http://www.microsoft.com/protect/products/yourself/phishingfilt er.mspx Mozilla Project. Firefox phishing and malware protection. http://www.mozilla.com/en-US/firefox/phishing-protection/. R. Dhamija and J.D. Tygar, The Battle Against Phishing: Dynamic Security Skins, Proc. Symp. Usable Privacy and Security, 2005.
The first row in table-4 specifies that out of 153 signatures, 2 signatures vary with their legitimate signatures and out of these two signatures our trained system detected that plagiarism is attempted in one particular web page, where the signature is varied with its corresponding legitimate signature. 5.3 Combining both SIFT and MD5 Our aim is to combine both the image signatures and the digital signature in order to produce efficient and accurate results from our trained system.
[5]
[6] [7]
292
[8] [9] [10]
[11] [12] [13] [14] [15]
[16]
[17] [18] [19] [20] [21] [22] [23]
[24]
[25] [26]
[27]
[28]
[29]
Google, Inc. Google safe browsing for Firefox. http://www.google.com/tools/firefox/safebrowsing/. stopbadware.org. Badware website clearinghouse. http://stopbadware.org/home/clearinghouse Firefox add-on PhishTank SiteChecker, https://addons.mozilla.org/en-us/firefox/addon/phishtanksitechecker/ Firefox add-on for anti-phishing Firephish, https://addons.mozilla.org/en-US/firefox/addon/firephish-antiphishing-extens/ CallingID LinkAdvisor, www.callingid.com SpoofGuard is a tool to help prevent a form of malicious attack, http://crypto.stanford.edu/SpoofGuard/ SpoofStick a simple browser extension for IE and Firefox, http://www.spoofstick.com/ Y. F. Anthony, W. Liu, X. Deng. Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD). IEEE Transactions on Dependable and Secure Computing, October 2006, Volume 3 (4), 301-311. Angelo P.E. Rosiello, Engin Kirada, Christopher Kruegel and Fabrizio Ferrandi. A Layout-Similarity Approach for detecting Phishing pages. Eric Medvet, Engin Kirda,and Christopher Kruegal. VisualSimilarity-Based Phishing Detection Yu Meng, Dr. Bernard Tiddeman, Implementing the Scale Invarient Feature Transform(SIFT) Method. T.C. Hoad and J. Zobel, Methods for Identifying Ver-sioned and Plagiarized Documents, J. Am. Soc. Infor-mation Science and Technology, vol. 54, no. 3, pp. 203-215, 2003. David.G.Lowe, Distinctive Image Features from Scale-Invariant Keypoints. Janaka Deepakumara, Howard M.Heys and R. Venkatesan, FPGA Implementation of MD5 Hash Algorithm. Anti-Phishing Group of the City University of Hong Kong, http://antiphishing.cs.cityu.edu.hk, 2005. W. Liu, X. Deng, G. Huang, and A.Y. Fu, An Anti-Phishing Strategy Based on Visual Similarity Assess-ment, IEEE Internet Computing, vol. 10, no. 2, pp. 58-65, 2006. APWG. Phishing Activity Trend. http://www.antiphishing.org/reports/apwg_report_march_2007.p df W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, Detection of Phishing Web Pages Based on Visual Similar-ity, Proc. 14th Intl World Wide Web Conf., pp. 1060-1061, 2005. T. Nanno, S. Saito, and M. Okumura, Structuring Web Pages Based on Repetition of Elements, Proc. Sev-enth Intl Conf. Document Analysis and Recognition, 2003. A. Emigh, "Online identity theft: Phishing technilogy, chokepoints and countermeasures, " Radix Labs, Tech. Rep., 2005, retrieved from AntiPhishing Working Group: http://www.antiphishing.org/resources.html. W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, Detection of Phishing Web Pages Based on Visual Similarity, Proc. 14th Intl World Wide Web Conf., pp. 1060-1061, 2005. PhishGuard.com. Protect Against Internet Phishing Scams http://www.phishguard.com/.
ing and Technology. He got 7 years of Teching Experience. K.V.Subba Raju received B.Tech in Computer Science & Engineering. M.Tech in Software Engineering. Currently Working as an Assistant Professor in MVGR College of Engineering. He got 5 years of teaching and fifteen years of technical experience.
B.Srinivas received M.Tech in computer Science and engineering in 2008 from Acharya Nagarjuna University; he has two and half years of industry and four years of teching experience. He is currently employed as a an Assistant professor in CSE department, MVGR College of Engineering. M.V.Pratysha. received B.Tech in (computer science and Engineering). S. Govinda Rao received M.Tech in Computer Science and Systems Engineering, Andhra University . Currently working as an Associate Professor in Gokaraju Rangaraju Institute of Engineer-

Plagiarized WebPage Detection by Measuring The Dissimilarities Using SIFT

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Plagiarized WebPage Detection by Measuring The Dissimilarities Using SIFT

Transféré par

Droits d'auteur :

Formats disponibles

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

Plagiarized WebPage Detection by Measuring the Dissimilarities Using SIFT

Extract the screenshot

Extract the Text

Generate the Signature

Generate the Signature

Protected List http://canarabank.com

Image Signature ABFJGUASDA

Digital Signature sadhajdha

Suspected Web Page

Extract the Screenshot

Extract the Text

Generate the Signature

Generate the Signature

Compare the features

5. RESULTS AND ANALYSIS

6. CONCLUSION AND FUTURE WORK

Signature Change 153/2 89/3 12/2 87/0 40/1 89/1 42/0

[8] [9] [10]

[11] [12] [13] [14] [15]

[17] [18] [19] [20] [21] [22] [23]

Vous aimerez peut-être aussi