Académique Documents
Professionnel Documents
Culture Documents
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-
tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a
web authoring software), as well as less legitimate spamdexing attempts (e.g. link farms, faked
directories. . . ).
Those pages built using the same generating method (template or script) share a common “look
and feel” that is not easily detected by common text classification methods, but is more related
to stylometry.
In this work we study and compare several html style similarity measures based on both textual
and extra-textual features in html source code. We also propose a flexible algorithm to cluster a
large collection of documents according to these measures. The algorithm we propose being based
on locality sensitive hashing (lsh), we give some recalls about this technique.
We describe how to use the html style similarity clusters to pinpoint dubious pages and enhance
the quality of spam classifiers, and give an evaluation of our algorithm on the WEBSPAM-UK2006
dataset.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing
General Terms: Algorithms, Experimentations
Additional Key Words and Phrases: Clustering, Document Similarity, Search Engine Spam, Sty-
lometry, Templates Identification
1. INTRODUCTION
Automatically generated content is nowadays ubiquitous on the web, especially
with the advent of professional web sites and popular three-tier architectures such
as “LAMP” (Linux Apache Mysql Php). Generation of these pages using such
architecture involves:
—a scripting component;
—a page template (“skeleton” of the site pages);
—content (e.g. product catalog, articles repository. . . ), usually stored in databases.
When summoned, the scripting component combines the page template with
information from the database to generate an html page, having no difference
with a static html page from a robot crawler point of view (shall robots have point
of view).
For instance, a well known strategy to mislead search engines ranking algorithms
consists of generating a maze of fake web pages called link farm.
Apart from the common dynamic web sites practice, the ability to automatically
generate a large amount of web pages is also appealing to web spammers. Indeed
[Fetterly et al. 2004] points out that “the only way to effectively create a very large
number of spam pages is to generate them automatically”.
When those pages are all hosted under a few domains, the detection of those
domains can be a sufficient countermeasure for a search engine, but this is not an
option when the link farm spans hundreds or thousands of different hosts — for
instance using word stuffed new domain names, or buying expired ones [Gyöngyi
and Garcia-Molina 2005].
One would like to be able to detect all pages generated using the same method
once a spam page is detected in a particular search engine response list. A direct
application of such a process is to enhance the efficiency of search engines blacklists
by “spreading” detected spam information to find affiliate domains, following the
philosophy of [Gyöngyi et al. 2004].
The first approach of the problem is to find useful features to detect spam, while
the two others are more related to semi-supervised learning : the second approach
enhances the recall of a spam classification while the third approach enhances its
precision. Both approaches are leaning on a good definition of similarity between
web pages. This similarity can be based on hyperlinks [Benczur et al. 2006; Castillo
et al. 2007] or content. In this paper we focus on non-textual content similarity.
1.2.1 Detecting Spam By Similarity. Semantic text similarity detection usually
involves word-based features, such as in e-mail Bayesian filtering. From these fea-
tures a statistical model tries to detect texts that are similar to known examples of
spam. Such word-based features tend to model the text topic: for example “exiled
presidents” and “energizing sex drugs” are recurrent topics in e-mail spam. In the
case of web, sites about “free ringtones” or “naked girls” are often full of spam
but, as described in WEBSPAM-UK2006 guidelines for web spam [Castillo et al. 2006],
the topic cannot be the only criterion to decide if a web site is using spamming
techniques.
Some spam techniques like honey pots are based on text plagiarism : the honey
pot technique consists of mirroring a reputed web site to introduce sneaky links
in its html code. This strong syntactic similarity is detected by semi-duplicate
detection algorithms [Broder et al. 1997; Heintze 1996; Charikar 2002]. Another
form of plagiarism consists of building fake content by stitching together parts of
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 3
text collected from several other web sites. To counter this form of spam, [Fetterly
et al. 2005] proposed to use a phrase-level syntactic clustering.
The use of word-based features is not always relevant because even if the spam
pages often share the same generation method, they rarely share the same vocabu-
lary [Westbrook and Greene 2002]. Furthermore, link farm automatically generated
pages tend to use large dictionaries in order to span a lot of different possible re-
quests [Gyöngyi and Garcia-Molina 2005] — hence, using common text filtering
methods with this kind of web spam fails to catch a lot of positive instances.
To detect similarity based on page generation method, one needs to use features
more closely related to the internal structure of html documents. For instance,
[Meyer Zu Eissen and Stein 2004] proposed to use html specific features along with
text and word statistics to build a classifier for web documents genre.
1.2.2 Stylometry and html. In fact, what best describes the relationship be-
tween the pages generated using the same template or method is more on a style
ground than on a topical one. This relates our problem with the stylometry area.
Up to now, stylometry was more generally associated with authorship identifica-
tion, to deal with problems such as attributing plays to the right Shakespeare, or to
detect computer software plagiarism [Gray et al. 1997]. Usual metrics in stylometry
are mainly based on word counts [McEnery and Oakes 2000], but also sometimes on
non-alphabetic features such as punctuation. In the area of web spam detection,
[Westbrook and Greene 2002; Ntoulas et al. 2006] and [Lavergne 2006] propose
to use lexicometric features to classify the part of web spam that does not follow
regular language metrics.
html files
3.1 Preprocessing
3.3 Clustering
enhanced tagging
Fig. 1. The style-based spam detection framework : all computing steps are described in section 3.
larity, these parts may be sequences of letters (n-grams), words, sequences of words,
sentences or paragraphs. Parts may overlap or not.
The most important ingredients for the quality of comparison are the prepro-
cessing step and the parts granularity, but for a given preprocessing and a given
granularity there are many flavors of similarity measure [Van Rijsbergen 1979; Zobel
and Moffat 1998]. We present here the most common ones.
When frequency is not important, like in plagiarism detection, the documents are
represented by sets of parts. The most commonly used set-based similarity measure
is the Jaccard index: for two sets of parts D1 , D2 :
|D1 ∩ D2 |
Jaccard(D1 , D2 ) = .
|D1 ∪ D2 |
Variants may be used for the normalizing factor, such as in inclusion coefficient:
|D1 ∩ D2 |
Inc(D1 , D2 ) = .
min{|D1 |, |D2 |}
When frequencies or weights of parts in documents are important it is more
convenient to represent documents as high-dimension sparse vectors. The most
common vectorial similarity is the Cosine index:
d~1 · d~2
Cosine(d~1 , d~2 ) = .
||d~1 || · ||d~2 ||
The one-to-one calculus of similarities is interesting for fine comparison into a
small set of documents, but the quadratic explosion induced by such a brute-force
approach is unacceptable at the scale of a web search engine. To circumvent this
explosion, more tricky methods are required. These methods are described in the
next sections.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 5
1.2
Jaccard vs MinHashing
Cosine vs Charikar
1
0.8
estimated similarity
0.6
0.4
0.2
−0.2
−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
full text similarity
Fig. 2. A rough comparison of real similarity and its 256 bits lsh estimation for 400 similarity
pairs picked in a set of 1000 html files.
Fig. 3. To detect that two pages were generated by the same script, one must both consider
visual features and hidden features of html. For example, some scripts always jump one line after
writing height=”2”><font.
3.1 Preprocessing
3.1.1 Splitting Content and Noise. The usual procedure to compare two docu-
ments is to first remove everything that do not reflect their content. In the case
of html documents, this preprocessing step may include removing of tags, extra
spaces and stop words. It may also include normalization of words by capitalization
or stemming.
In order to reflect similarity based on form (and more specifically templates)
rather than content, we propose to apply the opposite strategy : keep only the
“noisy” parts of html documents by removing any informational content, and
then compute document similarity based on the filtered version.
3.1.2 A Collection of Parsers. We propose below several document preproces-
sors to achieve this goal of removing content from html document.
—html noise preprocessor (denoted hss) filters all alpha-numeric characters from
the original document, and keeps anything else. We expect to be able to model
the “style” of html documents through the use of those neglected features of
ACM Journal Name, Vol. V, No. N, Month 20YY.
8 · Tanguy Urvoy et al.
html texts like extra-spaces, lines feedback or tags [Urvoy et al. 2006].
—html noise var spaces (denoted hss-varsp) is a variant of the former one, where
repeats of blank spaces are squeezed into a single one. This intends to smooth
differences that may arise in output of documents using the same templates but
a varying number of words in content parts. Should the text be very large, a
large part of the html noise output would otherwise be composed of the spaces
used to separate words, which neutered down the impact of the true template
part in the preprocessed document.
—html tags (denoted tags) applies a straightforward method to extract the format-
ting skeleton from the page : it filters anything but html tags and attributes from
the original document. Content between tags is ignored, as well as comments and
javascript.
—html tags and noise (denoted tags-noise) mixes approaches of html noise and
html tags. html tags and attributes are kept in the output, as well as any
non alphanumeric characters between tags and in javascript chunks.
Two more preprocessors are used in order to compare those strategies with more
standard filters :
WORDS (words)
Mozilla org Home of the Mozilla Project Skip to main content Mozilla About ...
FULL (full)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<title>Mozilla.org - Home of the Mozilla Project</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="web browser, mozilla, firefox, camino, thunderbird, ...">
<>
<>. - </>
< -="-" ="/; =-">
< ="" =" , , , , , , , ">
<>
<>.- </>
<-="-" ="/; =-">
<="" =", , , , , , , ">
TAGS (tags)
<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd ><html lang >...
...<head><title></title><meta http equiv content ><meta name content><link rel type href media >...
<head>
<title>. - </title>
<meta http equiv content >
<meta name content >
3.2 Fingerprinting
As said earlier, using the one-to-one similarity measure does not fit large scale clus-
tering. Using lsh fingerprints on the preprocessing output is required to address the
scalability issue. Some technical details may change a lot the quality of similarity
estimation.
3.2.1 Fingerprinting Implementation Details. The parts of text we consider are
32 characters wide overlapping sequences that we hash into 64 bit integers. This
large size avoids false-positives and gives, as in [Fetterly et al. 2005], a sentence-
level description of documents. To hash sub-sequences, we use a variant of Jenkins
hash function [Jenkins 1997] which achieves a better collision rate than Rabin fin-
gerprints.
For minhashing we pre-compute m = 64 permutations σi : [264 ] → [264 ] and
compare the permuted values:
x ≺i y ⇔ σi (x) < σi (y) .
To compute these permutations, we use a sub-family of permutations of the form
σi = σi1 ◦ σi2 where σi1 is a bit shuffle and σi2 (x) is an exclusive or mask. To reduce
the fingerprint size to 256 bits, we keep only four bits by key.
For Charikar algorithm, instead of pre-computing the random normal distribution
ACM Journal Name, Vol. V, No. N, Month 20YY.
10 · Tanguy Urvoy et al.
B-words
C-words
B-full
C-full
B-hss
C-hss
B-tags
C-tags
B-tags-noise
C-tags-noise
B-hss-varsp
C-hss-varsp
Fig. 4. Correlogram between each similarity measures on a sample of 105 pairs of html documents.
Most of these sampled pairs were dissimilar. Darkest colors implies high correlation.
3
2
Fig. 5. The quasi-transitivity of estimated similarity relation is well illustrated by this complete
similarity graph (realized from 3000 html files). With a transitive relation, each connected com-
ponent would be a clique. For “bunch of grapes” components like (1) and (2), there is a low
probability for connecting edges to be sampled. An edge density measure may be employed to
detect “worms” components like (3).
LSH fingerprints
similarity edges
url clusters
mutation and we check locally for similarity pairs in this sorted array (see Algo-
rithm 1). A similar permutation technique is used by [Bawa et al. 2005] on B-trees
to search the nearest neighbors. As described in Figure 6, we use two parallel sorts
(prefix and suffix). With these two sorts we get a good compromise to cluster 3M
fingerprints with a relatively narrow (winsize = 4) sliding window.
Depending of the architecture, the sliding window edge detection processes may
also be serialized. For serialized algorithm, most redundant similar fingerprints may
be removed from the stream to reduce the cost of the second sort.
Fig. 7. spam/normal labels in a html style similarity graph. This graph was built with the B-
hss-varspace similarity on a sample of 105 html files from UK-2006. Each node represents a host
and each edge represent a high similarity between pages in different hosts. Black nodes indicate
spam labels and white nodes indicate normal labels, other are unknown. The purity of clusters is
better than in Figure 8. The highly connected clusters are symptomatic of links farms or mirrors.
4. AIRWEB-2006 EXPERIMENTS
In this section we discuss the result of an experiment done in 2006 on our own
dataset. These results were presented at the AirWeb 2006 Workshop [Urvoy et al.
2006]. This experiment is based on B-hss fingerprints.
4.1 Dataset
We used a corpus of five million html pages crawled from the web. This corpus
was built by combining tree crawl strategies:
—a deep internal crawl of 3 million documents from 1300 hosts of dmoz directory;
—a flat crawl of 1 million documents from Orange French search engine blacklist
(with many adult content);
—a deep breadth first crawl from 10 non adult spam urls (chosen in the blacklist)
and 10 trustful urls (mostly from French universities).
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 15
Fig. 8. spam/normal labels in a words similarity graph. This graph was built with the B-words
similarity on a sample of 105 html files from WEBSPAM-UK2006. Each node represents a host and
each edge represents a high similarity between pages in different hosts. Black nodes indicate
spam labels and white nodes indicates normal labels, other are unknown. The small pure clusters
indicates “spam-full” and “spam-free” of topics. Most hosts are in the biggest cluster : spam
independent topics.
At the end of the process, we estimated roughly that 2/5 of the collected documents
were spam.
After fingerprinting, the initial volume of 130 GB of data to analyze was reduced
down to 390 M B, so the next step could easily be made in memory.
140
HS−similarity to www.franao.com
120
franao.com : template 1 (large)
franao.com : template 1 (small)
80 1 other sites
60
2
40
3
20
franao.com : template 2
0
0 20 40 60 80 100 120 140 160 180 200
row x 1000 (by decreasing HS−similarity)
Fig. 9. By sorting all html documents by decreasing similarity with one reference page (here from
franao.com) we get a curve with different gaps. The third gap (around 180, 000) marks the end
of franao web site.
1 (Figure 10);
—around 20, 000 there is a smooth gap (1) between long list and short list pages,
but up to 95, 000, the template is the same;
—around 95, 000, there is a strong gap (2) which marks a new template (Figure 11):
up to 180, 000, all pages are internal franao links built according to template 2;
—around 180, 000 there is a strong gap (3) between franao pages and other web
sites pages.
pages generated using the PhpBB open source project. The cluster #3 is also
interesting: it is populated by Apache default directory listings;
(2) Link farm clusters are also a special case of Template. They contain numerous
computer generated pages, based on the same template and containing a lot of
hyper-links between each other;
(3) Mirrors clusters contain sets of near duplicate pages hosted under different
servers. Generally only minor changes were applied to copies like adding a link
back to the server hosting the mirror;
(4) Copy/Paste clusters contain pages that are not part of mirrors, but do share
the same content: either a text (e.g. license, porn site legal warning. . . ), a
frameset scheme, or the same javascript code (often with few actual content).
The first two cluster classes are the most interesting benefit of the use of B-hss
clustering. They allow an easy classification of web pages by categorizing a few of
them.
5. UK-2006 EXPERIMENTS
5.1 Dataset
The WEBSPAM-UK2006 reference dataset is a collection based on a crawl of the .uk
domain done by the University of Roma “La Sapienza” with a big amount of hosts
labeled by a team of volunteers. The whole labeling process is described by [Castillo
et al. 2006] in their article “A Reference Collection for Web Spam”.
The full dataset contains 77 millions web pages spread over 11400 hosts. There
are about 5400 labeled hosts. Because the assessors spent on average five minutes
per hosts, it makes sense to summarize the dataset by taking only on the first 400
reachable pages of each host. We worked on the summarized dataset which contains
3.3 million web pages stored in 8 volumes of 1.7GB each.
100 100000
90000
80 80000
70000
Number of clusters
Percent of clusters
60 60000
50000
40 40000
30000
20 20000
10000
0 0
B-
B-
B-
B-
B-
B-
C
-w
-fu
-h
-h
-ta
-ta
w
fu
ta
ta
ss
ss
ss
ss
or
or
ll
gs
gs
ll
gs
gs
-v
ds
-v
ds
-n
-n
ar
ar
oi
oi
sp
sp
se
se
Algorithm and preprocessing
16+ 4-7 1
8-15 2-3 number of clusters
Fig. 12. Distribution of cluster size (number of hosts) and number of clusters for each pre-
processing.
of spam information through the clusters. We then evaluate the precision improve-
ment ability of our method and we conclude our experimentation by a side remark
about mirrors.
Table II. Purity of the clusters depending on the algorithm, preprocessing and cluster size.
Alg. B Alg. C
mean mean
stddev stddev
purity purity
2-3 0.9939 0.077001 2-3 0.9764 0.14974
words
words
4-7 0.98411 0.11016 4-7 0.97556 0.13342
8-15 0.96433 0.15218 8-15 0.98735 0.084978
16+ 0.98403 0.10645 16+ 0.98996 0.1316
2-3 0.9956 0.065797 2-3 0.99459 0.073127
Full 4-7 0.99418 0.071929 4-7 0.99455 0.069851
Full
8-15 0.99 0.084263 8-15 0.98612 0.10964
16+ 0.9945 0.06451 16+ 0.99597 0.056474
2-3 0.99819 0.041969 2-3 0.99691 0.055119
4-7 0.99748 0.046309 4-7 0.99669 0.053076
hss
hss
8-15 0.99567 0.059022 8-15 0.99556 0.058094
16+ 0.99591 0.057085 16+ 0.99312 0.071062
2-3 0.99762 0.048259 2-3 0.99797 0.044552
hss-varsp
hss-varsp
4-7 0.9979 0.041595 4-7 0.9973 0.047808
8-15 0.98827 0.093353 8-15 0.99586 0.056173
16+ 0.99501 0.062869 16+ 0.99318 0.070702
2-3 0.99807 0.043069 2-3 0.99838 0.03961
4-7 0.99817 0.039301 4-7 0.99797 0.042088
tags
tags
8-15 0.99792 0.039127 8-15 0.99358 0.065956
16+ 0.99314 0.097974 16+ 0.98936 0.087702
tags-noise
tags-noise
2-3 0.9988 0.033991 2-3 0.99749 0.049578
4-7 0.998 0.042234 4-7 0.99695 0.051297
8-15 0.99797 0.038512 8-15 0.99653 0.051339
16+ 0.9954 0.061504 16+ 0.99335 0.070447
spam) is important. These are a lot more dangerous than false negatives (spam
identified as normal) in this context. A high level of precision is mandatory.
To estimate the precision gain, we use each clustering to propagate spam infor-
mation. As explained in section 3.4.2, If a cluster has more spam urls than normal
ones and has a purity greater or equal than 0.9, we label each host corresponding
to the urls not already tagged as spam. Once this step is done for each cluster, we
label the remaining untagged hosts as normal.
This evaluation is a bit pessimistic since a real classifier would give better results
on the hosts that are not covered by clusters.
Table II shows that for the different clusterings, the mean purity of the clusters
is really high. This property let suppose a very good precision measure. This
hypothesis is confirmed by Figure 13, which shows the precision according to the
percentage of tagged labels.
Precision reaches very high levels even if the part of known labels is very low.
However there’s a major exception concerning the words preprocessing. This pre-
processing extracts all of the text structure to leave only content and is, for example,
very bad to distinguish a trustful site from another spam one which plagiarizes it.
The highest precision levels are reached by preprocessors dealing with html noise.
We also note that, even if their level stands high, Charikar-based variants are
slightly lower than Broder-based corresponding ones.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 21
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags
0.45
0.4
0.35
0.3
Recall
0.25
0.2
0.15
0.1
0.05
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags
Figure 14 shows the recall according to the known tags rate. Results are relatively
low. This low score can be explained on one hand by the high similarity threshold
used that maximizes precision at the recall detriment and on the other hand by
the clustering itself that makes some urls (and hosts) unreachable. In other words,
some clusters may not contain any spam nor normal labels. Urls in such clusters
are impossible to label by tags spreading.
ACM Journal Name, Vol. V, No. N, Month 20YY.
22 · Tanguy Urvoy et al.
0.6
0.55
0.5
0.45
0.4
F-Mesure
0.35
0.3
0.25
0.2
0.15
0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags
Methods based on words i.e. words and full have a recall that is significantly
lower than noise-based methods and so whatever the rate of known labels in input.
This observation is also true for F1 measure.
2 · precision · recall
F1 =
(precision + recall)
The F1 measure is a good way to appreciate the global quality of the various
methods. Figure 15 shows F1 measure according to the percentage of tagged hosts.
For html noise-based methods, we note that Charikar-based methods are gen-
erally slightly better for highest percentages of known input and corresponding
Broder-based methods for lesser ones.
We also note, according to Figure 15, a global performance increase with the rate
of known labels.
Using an external imperfect spam classifier can then be useful to increase the
spam/normal base information and so use the clustering amplification as explained
in section 3.4.
According to these results, we note that the hss-varsp preprocessor combined
to Broder fingerprinting technique gives globally the best results. Nevertheless,
depending on the rate of tagged host, tag-noise preprocessor gives better results on
lower rates combined with Broder fingerprints and for higher rates with Charikar
ones.
5.2.3 Precision Improvement Ability. Collections of labels are subject to errors
(human misjudgment, classifier imprecision,. . . ). To use these labels efficiently, tags
spreading should be fault-tolerant.
The high purity threshold mandatory to label a whole cluster offers a rather good
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 23
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0 5 10 15 20
% of false tags
B-tag-noise C-tag-noise
Fig. 16. Evolution of precision when injecting errors for tag-noise preprocessed clusterings.
—an host name and it’s variant with the www prefix like: www.cwrightandco.co.uk
and cwrightandco.co.uk
—different variants of the same name like: library.cardiff.ac.uk and library.cf.ac.uk
—fully different host names: www.yorkshire-evening-post.co.uk and thisisleeds.co.uk
If detection of the first type of mirrors is trivial and reliable, the handling of the
two other needs semi-duplicate detection techniques.
6. CONCLUSION
We proposed and studied several similarity measures to compare web pages accord-
ing to their html “style”. We implemented a computationally efficient algorithm to
cluster html documents based on these similarities. We proposed a main spam de-
tection framework that we evaluated by a detailed experiment on WEBSPAM-UK2006
tagged dataset.
This experiment showed the efficiency of our framework to enhance the quality
of spam classifiers by spreading and consolidating their predictions according to
clusters consistencies. It also showed that noise-based similarities are significantly
more efficient than text content and full content based similarities to diffuse spam
information. Among the noise-based similarities, it showed a slightly better result
for B-hss-varsp and tag-noise similarities.
The html style similarities finds several uses in a search engine back-office pro-
cess: a direct application is to enhance the efficiency of search engines blacklists
management by “spreading” detected spam information, a second one is to help
the detection of web sites boundaries by detecting html templates, a third one is
to point out large clusters of similar pages spanning several domains: this is often
a good hint of either sites mirroring or automatic spam pages generation.
For practical use in a search engine, the most interesting way to fingerprint hmtl
documents is probably to combine a word-level content-based Charikar fingerprint
for topic detection, and a sentence-level noise-based Broder fingerprint for templates
detection.
REFERENCES
Bawa, M., Condie, T., and Ganesan, P. 2005. Lsh forest: self-tuning indexes for similarity
search. In WWW. 651–660.
Benczur, A. A., Csalogany, K., and Sarlos, T. 2006. Link-based similarity search to fight web
spam. In International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
Boullé, M. 2006. Modl: A bayes optimal discretization method for continuous attributes. Ma-
chine Learning 65, 1, 131–165.
Broder, A. 1997. On the resemblance and containment of documents. In SEQUENCES ’97:
Proceedings of the Compression and Complexity of Sequences 1997. IEEE Computer Society,
Washington, DC, USA, 21.
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering
of the web. In Selected papers from the sixth international conference on World Wide Web.
Elsevier Science Publishers Ltd., Essex, UK, 1157–1166.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A
reference collection for web spam. SIGIR Forum 40, 2 (December).
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007.
Know your neighbors: Web spam detection using the web topology. Unpublished,
http://research.yahoo.com/publications/3/Search.
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In STOC
’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM
Press, New York, NY, USA, 380–388.
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: using sta-
tistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International
Workshop on the Web and Databases. ACM Press, New York, NY, USA, 1–6.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 25
Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the
world wide web. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR
conference on Research and development in information retrieval. ACM Press, New York, NY,
USA, 170–177.
Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analy-
sis techniques to computer programs. In 3rd Biannual Conference of International Association
of Forensic Linguists (IAFL ’97). 1–8.
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In First International Work-
shop on Adversarial Information Retrieval on the Web (AIRWeb).
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with
TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases
(VLDB). Morgan Kaufmann, Toronto, Canada, 576–587.
Heintze, N. 1996. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic
Commerce.
Henzinger, M. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In
SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval. ACM Press, New York, NY, USA, 284–291.
Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of
dimensionality. In STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory
of computing. ACM Press, New York, NY, USA, 604–613.
Jenkins, B. 1997. burtleburtle.net/bob/hash/doobs.html. web site.
Lavergne, T. 2006. Unnatural language detection. In Proceedings of RJCRI’06 : Young Scien-
tists’ conference on Information Retrieval.
McEnery, T. and Oakes, M. 2000. Authorship identification and computational stylometry. In
Handbook of Natural Language Processing. Marcel Dekker Inc.
Meyer Zu Eissen, S. and Stein, B. 2004. Genre classification of web pages. In Proceedings
of KI-04, 27th German Conference on Artificial Intelligence, S. Biundo, T. Frühwirth, and
G. Palm, Eds. Ulm, DE. Published in LNCS 3238.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages
through content analysis. In International World Wide Web Conference (WWW).
Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity.
In International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
Van Rijsbergen, C. J. 1979. Information Retrieval, 2nd edition. Dept. of Computer Science,
University of Glasgow, Glasgow, Scotland, UK.
Westbrook, A. and Greene, R. 2002. Using semantic analysis to classify search engine spam.
Tech. rep., Stanford University.
Zobel, J. and Moffat, A. 1998. Exploring the similarity space. SIGIR Forum 32, 1, 18–34.
Contents
1 Introduction 1
1.1 Spamdexing and Generated Content . . . . . . . . . . . . . . . . . . 1
1.2 Detecting Spam Web Pages . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Detecting Spam By Similarity . . . . . . . . . . . . . . . . . . 2
1.2.2 Stylometry and html . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of This Paper . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 AirWeb-2006 Experiments 14
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 One-to-All Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Global Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Conclusion of this experiment . . . . . . . . . . . . . . . . . . . . . . 18
5 UK-2006 Experiments 18
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Evaluation of Similarity Clustering . . . . . . . . . . . . . . . . . . . 18
5.2.1 Clustering evaluation . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Evaluation of Tags Spreading . . . . . . . . . . . . . . . . . . 19
5.2.3 Precision Improvement Ability . . . . . . . . . . . . . . . . . 22
5.2.4 A side note about mirrors . . . . . . . . . . . . . . . . . . . . 23
5.3 Web Spam 2007 Challenge . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Conclusion 24