Vous êtes sur la page 1sur 26

Tracking Web Spam with HTML Style Similarities

TANGUY URVOY and EMMANUEL CHAUVEAU


PASCAL FILOCHE and THOMAS LAVERGNE1
France Telecom R&D

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-
tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a
web authoring software), as well as less legitimate spamdexing attempts (e.g. link farms, faked
directories. . . ).
Those pages built using the same generating method (template or script) share a common “look
and feel” that is not easily detected by common text classification methods, but is more related
to stylometry.
In this work we study and compare several html style similarity measures based on both textual
and extra-textual features in html source code. We also propose a flexible algorithm to cluster a
large collection of documents according to these measures. The algorithm we propose being based
on locality sensitive hashing (lsh), we give some recalls about this technique.
We describe how to use the html style similarity clusters to pinpoint dubious pages and enhance
the quality of spam classifiers, and give an evaluation of our algorithm on the WEBSPAM-UK2006
dataset.
Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing
General Terms: Algorithms, Experimentations
Additional Key Words and Phrases: Clustering, Document Similarity, Search Engine Spam, Sty-
lometry, Templates Identification

1. INTRODUCTION
Automatically generated content is nowadays ubiquitous on the web, especially
with the advent of professional web sites and popular three-tier architectures such
as “LAMP” (Linux Apache Mysql Php). Generation of these pages using such
architecture involves:

—a scripting component;
—a page template (“skeleton” of the site pages);
—content (e.g. product catalog, articles repository. . . ), usually stored in databases.
When summoned, the scripting component combines the page template with
information from the database to generate an html page, having no difference
with a static html page from a robot crawler point of view (shall robots have point
of view).

1.1 Spamdexing and Generated Content


By analogy with e-mail spam, the word spamdexing designates the techniques used
to reach a web site to a higher-than-deserved rank in search engines response lists.

1 Thomas Lavergne is also ENST Paris


ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–26.
2 · Tanguy Urvoy et al.

For instance, a well known strategy to mislead search engines ranking algorithms
consists of generating a maze of fake web pages called link farm.
Apart from the common dynamic web sites practice, the ability to automatically
generate a large amount of web pages is also appealing to web spammers. Indeed
[Fetterly et al. 2004] points out that “the only way to effectively create a very large
number of spam pages is to generate them automatically”.
When those pages are all hosted under a few domains, the detection of those
domains can be a sufficient countermeasure for a search engine, but this is not an
option when the link farm spans hundreds or thousands of different hosts — for
instance using word stuffed new domain names, or buying expired ones [Gyöngyi
and Garcia-Molina 2005].
One would like to be able to detect all pages generated using the same method
once a spam page is detected in a particular search engine response list. A direct
application of such a process is to enhance the efficiency of search engines blacklists
by “spreading” detected spam information to find affiliate domains, following the
philosophy of [Gyöngyi et al. 2004].

1.2 Detecting Spam Web Pages


We see the problem of spam detection in a search engine back office process as
three-fold:

—pinpoint dubious sets of pages in a large uncategorized corpus;


—detect new instances of already encountered spam (through editorial review or
automatic classification methods);
—enhance the reliability of spam diagnostics.

The first approach of the problem is to find useful features to detect spam, while
the two others are more related to semi-supervised learning : the second approach
enhances the recall of a spam classification while the third approach enhances its
precision. Both approaches are leaning on a good definition of similarity between
web pages. This similarity can be based on hyperlinks [Benczur et al. 2006; Castillo
et al. 2007] or content. In this paper we focus on non-textual content similarity.
1.2.1 Detecting Spam By Similarity. Semantic text similarity detection usually
involves word-based features, such as in e-mail Bayesian filtering. From these fea-
tures a statistical model tries to detect texts that are similar to known examples of
spam. Such word-based features tend to model the text topic: for example “exiled
presidents” and “energizing sex drugs” are recurrent topics in e-mail spam. In the
case of web, sites about “free ringtones” or “naked girls” are often full of spam
but, as described in WEBSPAM-UK2006 guidelines for web spam [Castillo et al. 2006],
the topic cannot be the only criterion to decide if a web site is using spamming
techniques.
Some spam techniques like honey pots are based on text plagiarism : the honey
pot technique consists of mirroring a reputed web site to introduce sneaky links
in its html code. This strong syntactic similarity is detected by semi-duplicate
detection algorithms [Broder et al. 1997; Heintze 1996; Charikar 2002]. Another
form of plagiarism consists of building fake content by stitching together parts of
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 3

text collected from several other web sites. To counter this form of spam, [Fetterly
et al. 2005] proposed to use a phrase-level syntactic clustering.
The use of word-based features is not always relevant because even if the spam
pages often share the same generation method, they rarely share the same vocabu-
lary [Westbrook and Greene 2002]. Furthermore, link farm automatically generated
pages tend to use large dictionaries in order to span a lot of different possible re-
quests [Gyöngyi and Garcia-Molina 2005] — hence, using common text filtering
methods with this kind of web spam fails to catch a lot of positive instances.
To detect similarity based on page generation method, one needs to use features
more closely related to the internal structure of html documents. For instance,
[Meyer Zu Eissen and Stein 2004] proposed to use html specific features along with
text and word statistics to build a classifier for web documents genre.
1.2.2 Stylometry and html. In fact, what best describes the relationship be-
tween the pages generated using the same template or method is more on a style
ground than on a topical one. This relates our problem with the stylometry area.
Up to now, stylometry was more generally associated with authorship identifica-
tion, to deal with problems such as attributing plays to the right Shakespeare, or to
detect computer software plagiarism [Gray et al. 1997]. Usual metrics in stylometry
are mainly based on word counts [McEnery and Oakes 2000], but also sometimes on
non-alphabetic features such as punctuation. In the area of web spam detection,
[Westbrook and Greene 2002; Ntoulas et al. 2006] and [Lavergne 2006] propose
to use lexicometric features to classify the part of web spam that does not follow
regular language metrics.

1.3 Overview of This Paper


This paper presents an extended version of our AIRWeb 2006 work [Urvoy et al.
2006]. The main principle is to cluster web pages generated using similar tools,
scripts or html-templates to enhance the quality of web spam detection. We first
give some recalls about syntactic similarity measures and large scale, fingerprints-
based clustering in section 2. We then detail the specificities of our spam-detection
framework in section 3 (An overview of this framework is given by Figure 1.).
The experimental results of [Urvoy et al. 2006] where based on our own (mostly
French) dataset. These are described in section 4. The WEBSPAM-UK2006 tagged
dataset [Castillo et al. 2006] allowed us to perform more advanced experimentations
which are described in section 5.

2. RECALLS ON SIMILARITY AND CLUSTERING


Our approach of spam detection being based on similarity, we give some recalls
about the standard text similarity measures.

2.1 Similarity Measures


The first step before comparing documents is to extract their (interesting) content:
this is what we call preprocessing. The second step is to transform this content
into a suitable model for comparison (except for string edition based distances like
Levenshtein and its derivatives where this intermediate model is not mandatory).
Usually the documents are split up into parts. Depending on the expected granu-
ACM Journal Name, Vol. V, No. N, Month 20YY.
4 · Tanguy Urvoy et al.

html files

3.1 Preprocessing

3.2 LSH Fingerprinting

3.3 Clustering

clusters spam tags

3.4 tags spreading

enhanced tagging

Fig. 1. The style-based spam detection framework : all computing steps are described in section 3.

larity, these parts may be sequences of letters (n-grams), words, sequences of words,
sentences or paragraphs. Parts may overlap or not.
The most important ingredients for the quality of comparison are the prepro-
cessing step and the parts granularity, but for a given preprocessing and a given
granularity there are many flavors of similarity measure [Van Rijsbergen 1979; Zobel
and Moffat 1998]. We present here the most common ones.
When frequency is not important, like in plagiarism detection, the documents are
represented by sets of parts. The most commonly used set-based similarity measure
is the Jaccard index: for two sets of parts D1 , D2 :
|D1 ∩ D2 |
Jaccard(D1 , D2 ) = .
|D1 ∪ D2 |
Variants may be used for the normalizing factor, such as in inclusion coefficient:
|D1 ∩ D2 |
Inc(D1 , D2 ) = .
min{|D1 |, |D2 |}
When frequencies or weights of parts in documents are important it is more
convenient to represent documents as high-dimension sparse vectors. The most
common vectorial similarity is the Cosine index:

d~1 · d~2
Cosine(d~1 , d~2 ) = .
||d~1 || · ||d~2 ||
The one-to-one calculus of similarities is interesting for fine comparison into a
small set of documents, but the quadratic explosion induced by such a brute-force
approach is unacceptable at the scale of a web search engine. To circumvent this
explosion, more tricky methods are required. These methods are described in the
next sections.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 5

2.2 lsh Fingerprints


A locality sensitive hash (lsh) function [Indyk and Motwani 1998] is a hashing
function where the probability of collision is high for similar documents and low for
non-similar ones. More formally, if D is a set of documents and sim : D ×D → [0, 1]
is a given similarity measure, a set of functions F ⊆ ND is an lsh scheme for a
similarity function sim if for any couple d1 , d2 ∈ D, we have the probability

Ph∈F [h(d1 ) = h(d2 )] = sim(d1 , d2 ) .

We can build an estimator for the similarity by gluing together independent


lsh functions. Let simH (~x, ~y ) be the number of equal dimensions between two
vectors ~x, ~y — i.e. simH (~x, ~y) = |{i |~xi = ~yi }|, let h1 , . . . , hm be m independent
lsh functions and let H be the mapping defined by H(d) = (h1 (d), . . . , hm (d)),
then
simH (H(d1 ), H(d2 ))
Sim(d1 , d2 ) ≃ .
m
To avoid confusion we call lsh-fingerprint the reduced vector H(d), and lsh-keys
the hash values hi (d) for 1 ≤ i ≤ m. We present here two standard algorithms for
building lsh-fingerprints and their application to large scale similarity clustering.
2.2.1 Broder MinHashing. To our knowledge, the oldest references about min-
hashing (also called minsampling) are [Heintze 1996; Broder 1997] and [Broder et al.
1997].
As described in section 2.1, each preprocessed document is split up into parts.
Let us call P the set of all possible parts. The main principle of minsampling over
P is to fix at random a linear ordering on P (call it ≺) and represent each document
D ⊆ P by its lowest elements according to ≺ : h≺ (D) = min≺ (D).
If ≺ is chosen at random, then for any pair of documents D1 , D2 ⊆ P , we have
the lsh property :

P≺ [h≺ (D1 ) = h≺ (D2 )] = Jaccard(D1 , D2 ) .

If we consider m independent linear orderings ≺i , and define H by H(D) =


(min≺1 (D), . . . , min≺m (D)), then

simH (H(D1 ), H(D2 ))


Jaccard(D1 , D2 ) ≃ .
m
2.2.2 Charikar Fingerprints. [Charikar 2002] introduces a new approach to lo-
cality sensitive hashing. This method partitions the vector space according to
random hyperplanes and estimates the Cosine distance.
Let ~r be the characteristic vector of a random hyperplane (each coordinate fol-
lowing a centered normal distribution law). Given a weighted vector d~i representing
a document, we define the h~r hashing function by:
(
1 if ~r.d~i ≥ 0
h~r (d~i ) =
0 if ~r.d~i ≤ 0
ACM Journal Name, Vol. V, No. N, Month 20YY.
6 · Tanguy Urvoy et al.

1.2
Jaccard vs MinHashing
Cosine vs Charikar
1

0.8

estimated similarity
0.6

0.4

0.2

−0.2

−0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
full text similarity

Fig. 2. A rough comparison of real similarity and its 256 bits lsh estimation for 400 similarity
pairs picked in a set of 1000 html files.

this implies the following lsh property :


θ(d~1 , d~2 )
P~r [h~r (d~1 ) = h~r (d~2 )] = 1 − .
π
This value is mapped to the cosine similarity by
Cosine(d~1 , d~2 ) = cos(π · (1 − P~r ))) .

2.3 Clustering with lsh Fingerprints


When working on a large volume of data, one would like to group together the
documents which are similar enough according to the chosen similarity : we want
to compute a mapping C associating to each document d its class representative
C(d), with C(d) = C(d′ ) if and only if sim(d, d′ ) is higher than a given threshold.
The first benefit of using fingerprints for this task is to reduce the size of doc-
uments representatives, allowing to perform all computation in memory. As illus-
trated by Figure 2, this reduction by sampling is at the cost of a little loss of quality
in the similarity estimation.
Another important benefit of fingerprints is to give a low dimension representa-
tion of documents. It becomes possible to compare only the documents that match
at least on some dimensions. This is a way to build the sparse similarity matrix
with some control over the quadratic explosion induced by the biggest clusters. Our
clustering algorithm, detailed in section 3.3, exploits this property.

3. THE HSS FRAMEWORK


As illustrated by Figure 3, by keeping usually neglected features of html like extra-
spaces, line feeds or tags, we are able to model the “style” of html generating tools.
We propose to use specific document preprocessors to exclude as much as possible
the textual content and keep the html “noise” instead. These preprocessors are
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 7

Fig. 3. To detect that two pages were generated by the same script, one must both consider
visual features and hidden features of html. For example, some scripts always jump one line after
writing height=”2”><font.

described in section 3.1.


The html noise is split up into wide overlapping sequences of characters from
which lsh fingerprints are computed. In section 3.3 we propose a clustering algo-
rithm which allows a flexible trade-off between brute force one-to-one comparison
and fast approximate clustering. In section 3.4 we propose different uses of this
clustering to detect web spam.

3.1 Preprocessing
3.1.1 Splitting Content and Noise. The usual procedure to compare two docu-
ments is to first remove everything that do not reflect their content. In the case
of html documents, this preprocessing step may include removing of tags, extra
spaces and stop words. It may also include normalization of words by capitalization
or stemming.
In order to reflect similarity based on form (and more specifically templates)
rather than content, we propose to apply the opposite strategy : keep only the
“noisy” parts of html documents by removing any informational content, and
then compute document similarity based on the filtered version.
3.1.2 A Collection of Parsers. We propose below several document preproces-
sors to achieve this goal of removing content from html document.

—html noise preprocessor (denoted hss) filters all alpha-numeric characters from
the original document, and keeps anything else. We expect to be able to model
the “style” of html documents through the use of those neglected features of
ACM Journal Name, Vol. V, No. N, Month 20YY.
8 · Tanguy Urvoy et al.

html texts like extra-spaces, lines feedback or tags [Urvoy et al. 2006].

—html noise var spaces (denoted hss-varsp) is a variant of the former one, where
repeats of blank spaces are squeezed into a single one. This intends to smooth
differences that may arise in output of documents using the same templates but
a varying number of words in content parts. Should the text be very large, a
large part of the html noise output would otherwise be composed of the spaces
used to separate words, which neutered down the impact of the true template
part in the preprocessed document.

—html tags (denoted tags) applies a straightforward method to extract the format-
ting skeleton from the page : it filters anything but html tags and attributes from
the original document. Content between tags is ignored, as well as comments and
javascript.

—html tags and noise (denoted tags-noise) mixes approaches of html noise and
html tags. html tags and attributes are kept in the output, as well as any
non alphanumeric characters between tags and in javascript chunks.

Two more preprocessors are used in order to compare those strategies with more
standard filters :

—html to words (denoted words) outputs every alphanumeric character outside of


tags

—full (denoted full) outputs the initial document unmodified

An experimental comparison of these preprocessing procedures with respect to


the task of categorizing web spam pages is presented in section 5.
The following example shows the head output of the different parsers for the
http://www.mozilla.org/ url:

WORDS (words)
Mozilla org Home of the Mozilla Project Skip to main content Mozilla About ...

FULL (full)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">

<head>
<title>Mozilla.org - Home of the Mozilla Project</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="keywords" content="web browser, mozilla, firefox, camino, thunderbird, ...">

<link rel="stylesheet" type="text/css" href="css/print.css" media="print">


<link rel="stylesheet" type="text/css" href="css/base/content.css" media="all">
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 9

HSS NOISE (hss)


<! "-//// .//" "://..///.">
< ="">

<>
<>. - </>
< -="-" ="/; =-">
< ="" =" , , , , , , , ">

< ="" ="/" ="/." ="">

HSS NOISE VAR SPACE (hss-varsp)


<!"-////.//" "://..///.">
<="">

<>
<>.- </>
<-="-" ="/; =-">
<="" =", , , , , , , ">

<="" ="/" ="/." ="">

TAGS (tags)
<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd ><html lang >...
...<head><title></title><meta http equiv content ><meta name content><link rel type href media >...

TAGS AND NOISE (tag-noise)


<DOCTYPE HTML PUBLIC W3C DTD HTML 4 01 EN http www w3 org TR html4 strict dtd >
<html lang >

<head>
<title>. - </title>
<meta http equiv content >
<meta name content >

<link rel type href media >

3.2 Fingerprinting
As said earlier, using the one-to-one similarity measure does not fit large scale clus-
tering. Using lsh fingerprints on the preprocessing output is required to address the
scalability issue. Some technical details may change a lot the quality of similarity
estimation.
3.2.1 Fingerprinting Implementation Details. The parts of text we consider are
32 characters wide overlapping sequences that we hash into 64 bit integers. This
large size avoids false-positives and gives, as in [Fetterly et al. 2005], a sentence-
level description of documents. To hash sub-sequences, we use a variant of Jenkins
hash function [Jenkins 1997] which achieves a better collision rate than Rabin fin-
gerprints.
For minhashing we pre-compute m = 64 permutations σi : [264 ] → [264 ] and
compare the permuted values:
x ≺i y ⇔ σi (x) < σi (y) .
To compute these permutations, we use a sub-family of permutations of the form
σi = σi1 ◦ σi2 where σi1 is a bit shuffle and σi2 (x) is an exclusive or mask. To reduce
the fingerprint size to 256 bits, we keep only four bits by key.
For Charikar algorithm, instead of pre-computing the random normal distribution
ACM Journal Name, Vol. V, No. N, Month 20YY.
10 · Tanguy Urvoy et al.

B-words

C-words

B-full

C-full

B-hss

C-hss

B-tags

C-tags

B-tags-noise

C-tags-noise

B-hss-varsp

C-hss-varsp

Fig. 4. Correlogram between each similarity measures on a sample of 105 pairs of html documents.
Most of these sampled pairs were dissimilar. Darkest colors implies high correlation.

of hyperplane characteristic vector ~r, we use a linear shuffle and a pre-computed


inverse normal function : let P be a big prime, and let inorm : [n] → Z be a
discretization of inverse normal law, then
~ri := inorm(iP mod n) .
From 256 random primes we get a 256 bit fingerprint. This implementation slightly
differs from the one used by [Henzinger 2006] where the ~ri values are restricted to
{−1, +1}.
3.2.2 Combination of Preprocessing and Fingerprinting. As explained in sec-
tion 2.2, Broder minhashing gives an estimation of Jaccard index [Broder et al.
1997] and Charikar fingerprinting gives an estimation of Cosine similarity [Charikar
2002]. The former similarity measure is only based on intersection while the latter
also includes frequency information.
To know which ones are the most relevant for html “style” similarity we combine
the six preprocessors defined before with these two algorithms to get twelve kinds
of lsh fingerprints. We write “B-words” for Broder fingerprints using the word
preprocessor “C-full” for Charikar fingerprints combined with the full preprocessor,
and so on for other preprocessors.
A rough estimation of correlation between these twelve similarities is given by Fig-
ure 4. As expected, minhashing similarities and Charikar similarities are strongly
dependent for a given preprocessing. As expected also, all variants of html pre-
processing except words are dependent. A more interesting result is the correlation
between full and noise similarities. This can be explained by the low visible text
ratio in html (around 18% for WEBSPAM-UK2006 dataset). We show in section 5
that noise similarities gives a significantly better recall than full html for web spam
detection.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 11

3
2

Fig. 5. The quasi-transitivity of estimated similarity relation is well illustrated by this complete
similarity graph (realized from 3000 html files). With a transitive relation, each connected com-
ponent would be a clique. For “bunch of grapes” components like (1) and (2), there is a low
probability for connecting edges to be sampled. An edge density measure may be employed to
detect “worms” components like (3).

3.3 Multi-Sort Sliding Window Clustering


We use our own algorithm for clustering. It does not build the entire similarity
graph and use some permuted lexical sorts to find potentially similar pairs. As in
[Broder et al. 1997], the similarity clusters are the connected components of the
sampled graph.
By thresholding the full similarity matrix, we obtain a non oriented similarity
graph:
G = {(d1 , d2 ) ∈ D × D | sim(d1 , d2 ) > threshold} .
this similarity graph is characterized by its quasi-transitivity property: if xGy and
yGz then there is a high probability that xGz. In other words, the connected
components of the graph are almost cliques. This quasi-transitivity property is
helpful : by sampling the most similar edges, we both accelerate the clustering
process and reduce the probability of a false positive instance in lsh similarity
estimation (cf. Figure 5).
To sample the edges, we sort fingerprints lexically according to a given bit per-
ACM Journal Name, Vol. V, No. N, Month 20YY.
12 · Tanguy Urvoy et al.

LSH fingerprints

prefix sort suffix sort

sliding window edges detection sliding window edges detection

similarity edges

connected components computation

url clusters

Fig. 6. Our clustering process.

mutation and we check locally for similarity pairs in this sorted array (see Algo-
rithm 1). A similar permutation technique is used by [Bawa et al. 2005] on B-trees
to search the nearest neighbors. As described in Figure 6, we use two parallel sorts
(prefix and suffix). With these two sorts we get a good compromise to cluster 3M
fingerprints with a relatively narrow (winsize = 4) sliding window.
Depending of the architecture, the sliding window edge detection processes may
also be serialized. For serialized algorithm, most redundant similar fingerprints may
be removed from the stream to reduce the cost of the second sort.

Algorithm 1 Multi Sort Sliding Window Edge Detection


Require: winsize > 0, threshold ≤ m, T [nb doc] is the fingerprint array.
initialize similarity graph;
for all chosen bit permutations σ do
sort lexically T according to σ
for i := winsize − 1 to nb doc do
for j := 0 to winsize do
if Sim(T [i − j].f p, T [i].f p) > threshold then
add edge (T [i − j].id, T [i].id) to similarity graph;
end if
end for
end for
end for
compute connected components of the graph;

3.4 Tracking Web Spam With Similarity Clusters


The different clusterings obtained are interesting both to detect new spam and to
enhance the quality of a spam diagnostic obtained from other sources (through
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 13

editorial review or automatic classification methods). We call features extraction


the former technique, and tags spreading the latter one. The experimentations
described in section 4 are mostly oriented toward features extraction while the
WEBSPAM-UK2006-based experiments of section 5 also include tags spreading.
3.4.1 Features Extraction. From a given clustering we extract several simple
features. These features are only derived from clusters and straightforward url
properties:
—number of urls, hosts and domains by cluster;
—number of clusters by host or by domain;
If we also consider the underlying graph or fingerprints more features can be
extracted:
—edge density by cluster;
—inter-host edge density by cluster;
—mean similarity and standard deviation.
For example, in section 4, we combine mean similarity and domain counts in B-
hss clusters to detect potential link-farms. All these features are simple and relevant
for web spam detection but the most efficient way to use similarity clusterings is to
smooth external features.
3.4.2 Tags Spreading. As described in Figure 1, we consolidate a set of tags
by spreading them into consistent clusters. To evaluate the consitency of clusters
regarding the tags, we measure the prevalence of a label (spam or normal) in a
given cluster. This measure, called purity, is defined by :
 2
spam − normal
purity = .
spam + normal
We observe on Figure 7 that most host-clusters contain only one kind of label.
In such a case, a high purity value means that the major label is really prevalent
over the other : the label can be propagated to the whole cluster. On the opposite,
a low purity value indicates a balanced amount of each labels : and we cannot rely
on this cluster to propagate one or the other label. The big central cluster of the
Figure 8 is a perfect example of such indecision.
A straighforward approach is to propagate a reliable human evaluation. As we
see in WEBSPAM-UK2006-based experimentation, this approach is efficient for big
clusters (most of the time these are link-farms), but the majority of urls being in
small or medium clusters, the recall remains too low.
A more efficient way to use the small and medium clusters is to train a classifier
to work on features that we have for all hosts or web pages. There is a lot of
content-based and link-based features that are relevant to diagnostic if an url or a
host is spam. Many such features were proposed for the Web Spam Challenge 2007,
these are well described in [Castillo et al. 2007]. Even if the classifier precision is
low the resulting diagnostic is enhanced by the clusters purity. A similar smoothing
method is employed for link-based clusters in [Castillo et al. 2007].
In our first experiment, described in next section, we mostly considered cluster
based features extraction. We experiment tags spreading in section 5.
ACM Journal Name, Vol. V, No. N, Month 20YY.
14 · Tanguy Urvoy et al.

Fig. 7. spam/normal labels in a html style similarity graph. This graph was built with the B-
hss-varspace similarity on a sample of 105 html files from UK-2006. Each node represents a host
and each edge represent a high similarity between pages in different hosts. Black nodes indicate
spam labels and white nodes indicate normal labels, other are unknown. The purity of clusters is
better than in Figure 8. The highly connected clusters are symptomatic of links farms or mirrors.

4. AIRWEB-2006 EXPERIMENTS
In this section we discuss the result of an experiment done in 2006 on our own
dataset. These results were presented at the AirWeb 2006 Workshop [Urvoy et al.
2006]. This experiment is based on B-hss fingerprints.

4.1 Dataset
We used a corpus of five million html pages crawled from the web. This corpus
was built by combining tree crawl strategies:

—a deep internal crawl of 3 million documents from 1300 hosts of dmoz directory;
—a flat crawl of 1 million documents from Orange French search engine blacklist
(with many adult content);
—a deep breadth first crawl from 10 non adult spam urls (chosen in the blacklist)
and 10 trustful urls (mostly from French universities).
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 15

Fig. 8. spam/normal labels in a words similarity graph. This graph was built with the B-words
similarity on a sample of 105 html files from WEBSPAM-UK2006. Each node represents a host and
each edge represents a high similarity between pages in different hosts. Black nodes indicate
spam labels and white nodes indicates normal labels, other are unknown. The small pure clusters
indicates “spam-full” and “spam-free” of topics. Most hosts are in the biggest cluster : spam
independent topics.

At the end of the process, we estimated roughly that 2/5 of the collected documents
were spam.
After fingerprinting, the initial volume of 130 GB of data to analyze was reduced
down to 390 M B, so the next step could easily be made in memory.

4.2 One-to-All Similarity


The comparison between one html page and all other pages of our test base is a
way to estimate the quality of B-hss similarity. If the reference page comes from
a known web site, we are able to judge the quality of B-hss similarity as web site
detector. In the example considered here: franao.com (a web directory with many
links and many internal pages), a threshold of 20/128 gives a franao web site
detector which is not based on urls. On our corpus, this detector is 100% correct
according to urls prefixes.
By sorting all html documents by decreasing similarity to one randomly chosen
page of franao web site, we get a decreasing curve with different gaps (Figure 9).
These gaps are interesting to consider in detail:
—the first 20, 000 html pages are long lists of external links, all built from template
ACM Journal Name, Vol. V, No. N, Month 20YY.
16 · Tanguy Urvoy et al.

140
HS−similarity to www.franao.com

120
franao.com : template 1 (large)
franao.com : template 1 (small)

128 keys HS−similarity


100

80 1 other sites
60
2
40
3
20
franao.com : template 2
0
0 20 40 60 80 100 120 140 160 180 200
row x 1000 (by decreasing HS−similarity)

Fig. 9. By sorting all html documents by decreasing similarity with one reference page (here from
franao.com) we get a curve with different gaps. The third gap (around 180, 000) marks the end
of franao web site.

Fig. 10. franao.com template 1 (external links)

1 (Figure 10);
—around 20, 000 there is a smooth gap (1) between long list and short list pages,
but up to 95, 000, the template is the same;
—around 95, 000, there is a strong gap (2) which marks a new template (Figure 11):
up to 180, 000, all pages are internal franao links built according to template 2;
—around 180, 000 there is a strong gap (3) between franao pages and other web
sites pages.

4.3 Global Clustering


To cluster the whole corpus we raised up the threshold to ensure a low level of false
positive. We chose the following parameters :
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 17

Fig. 11. franao.com template 2 (internal links)

Table I. Clusters with highest mean similarity and domain count


Urls Dom. Mean Prototypical member (centroı̈d)
sim.
268 231 1 www.9eleven.com/index.html Copy/Paste
93148 313 0.58 www.les7laux.com/hiver/forum/phpBB2/. . . Template (Forums)
3495 255 0.33 www.orpha.net/static/index.html Template (Apache)
966 174 0.40 www.asliguruney.com/result.php?Keywo. . . Link farm
122 91 0.74 anus.fistingfisting.com/index.htm Copy/Paste
1148 173 0.38 www.basketmag.com/result.php?Keywords. . . Link farm
19834 164 0.40 www.series-tele.fr/index.html?mo=s t. . . Template
122 55 0.91 www.ie.gnu.org/philosophy/index.html Mirror
139 101 0.44 www.reha-care.net/home buying.htm?r=p Link farm
218 195 0.21 chat.porno-star.it/index.html Copy/Paste
177 60 0.67 www.ie.gnu.org/home.html Mirror
2288 44 0.90 www.cash4you.com/insuranceproviders/. . . Link farm
626900 70 0.52 animalworld.petparty.com/automotive.. . Link farm
168 96 0.32 www.google.ca/intl/en/index.html Mirror
214 61 0.50 shortcuts.00go.com/shorcuts.html Link farm
42314 112 0.26 forums.cosplay.com/index.html Template
121 63 0.41 collection.galerie-yemaya.com/index.html Copy/Paste
555 34 0.68 allmacintosh.digsys.bg/audiomac ra. . . Template
114 77 0.29 www.gfx-revolution.com/search/webarc. . . Link farm
286 60 0.35 gnu.typhon.net/home.sv.html Mirror

—size of text parts : 32,


—fingerprint size (in bytes) m = 128,
—similarity threshold t = 35/128
With a similarity score of at least 35/128, the number of misclassified urls seems
negligible but some clusters are split into smaller ones.
We obtained 43, 000 clusters with at least 2 elements. The Table I shows the 20
first clusters sorted by highest mean similarity × domain count (For the sake of
readability some mirror clusters have been removed from the list.).
In order to evaluate the quality of the clustering, the first 50 clusters, as well as
50 other randomly chosen ones, were manually checked, showing no misclassified
urls.
Most of the resulting clusters belong to one of these classes:
(1) Template clusters groups html pages from dynamic web sites using the same
skeleton. The cluster #2 of Table I is a perfect example, it groups all forum
ACM Journal Name, Vol. V, No. N, Month 20YY.
18 · Tanguy Urvoy et al.

pages generated using the PhpBB open source project. The cluster #3 is also
interesting: it is populated by Apache default directory listings;
(2) Link farm clusters are also a special case of Template. They contain numerous
computer generated pages, based on the same template and containing a lot of
hyper-links between each other;
(3) Mirrors clusters contain sets of near duplicate pages hosted under different
servers. Generally only minor changes were applied to copies like adding a link
back to the server hosting the mirror;
(4) Copy/Paste clusters contain pages that are not part of mirrors, but do share
the same content: either a text (e.g. license, porn site legal warning. . . ), a
frameset scheme, or the same javascript code (often with few actual content).
The first two cluster classes are the most interesting benefit of the use of B-hss
clustering. They allow an easy classification of web pages by categorizing a few of
them.

4.4 Conclusion of this experiment


This experiment confirms the usefulness of stylometry for detecting similarity be-
tween host and especially for spam detection. The main problem we had here was
the lack of validation for the recall of the method. The WEBSPAM-UK2006 tagged
dataset allows us to go further on this thread and to validate our results.

5. UK-2006 EXPERIMENTS
5.1 Dataset
The WEBSPAM-UK2006 reference dataset is a collection based on a crawl of the .uk
domain done by the University of Roma “La Sapienza” with a big amount of hosts
labeled by a team of volunteers. The whole labeling process is described by [Castillo
et al. 2006] in their article “A Reference Collection for Web Spam”.
The full dataset contains 77 millions web pages spread over 11400 hosts. There
are about 5400 labeled hosts. Because the assessors spent on average five minutes
per hosts, it makes sense to summarize the dataset by taking only on the first 400
reachable pages of each host. We worked on the summarized dataset which contains
3.3 million web pages stored in 8 volumes of 1.7GB each.

5.2 Evaluation of Similarity Clustering


To evaluate the ability of our smoothing framework to enhance the efficiency of a
classifier, we want to measure both its ability to :
(1) spread spam information (recall improvement);
(2) consolidate spam information (precision improvement).
The ability to spread information is related to the size distribution of clusters and
its ability consolidate is related to the consistency of clusters with regard to labels.
To avoid the side effect of choosing a specific classifier, our experiments are only
based on WEBSPAM-UK2006 human labels.
In a first time we perform some statistics on our twelve clusterings (size distribu-
tion and purity). In a second time, we study, by cross-validation, the propagation
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 19

100 100000

90000

80 80000

70000

Number of clusters
Percent of clusters
60 60000

50000

40 40000

30000

20 20000

10000

0 0
B-

B-

B-

B-

B-

B-

C
-w

-fu

-h

-h

-ta

-ta
w

fu

ta

ta
ss

ss
ss

ss
or

or

ll

gs

gs
ll

gs

gs
-v
ds

-v
ds

-n

-n
ar

ar

oi

oi
sp

sp

se

se
Algorithm and preprocessing
16+ 4-7 1
8-15 2-3 number of clusters

Fig. 12. Distribution of cluster size (number of hosts) and number of clusters for each pre-
processing.

of spam information through the clusters. We then evaluate the precision improve-
ment ability of our method and we conclude our experimentation by a side remark
about mirrors.

5.2.1 Clustering evaluation. To be efficient for information spreading, a cluster-


ing should contains big and consistent clusters. Big clusters diffuse spam informa-
tion to more urls and increase the spam recall.
The distribution of cluster size as well as the total number of built clusters for
the different clusterings are summarized in Figure 12. The first observation is that
standard filters like full or words tend to build smaller clusters on average, and
consequently to reduce the spam recall.
We also note that, for a given preprocessing step, the Charikar fingerprint pro-
duces slightly more clusters than Broder ones.
If Figure 12 gives a good representation of cluster size variation for the various
clusterings, it shows nothing about their consistency with regard to spam labels.
Table II gives purity measures for different classes of clusters size (host count).
This shows that all clusterings build very consistent clusters, with a little loss for
the words preprocessor.
These results on clusters purity suggest that similarity clustering methods have
good precision when tagging new spams.

5.2.2 Evaluation of Tags Spreading. In order to evaluate the diffusion of labels


in the various clusterings, we restrict the WEBSPAM-UK2006 corpus to tagged hosts.
Each experiment is made by giving to the system a fraction of these labels, using
the remaining one for the evaluation, each time using cross-validation.
When trying to detect spam, minimizing false positives (normal identified as
ACM Journal Name, Vol. V, No. N, Month 20YY.
20 · Tanguy Urvoy et al.

Table II. Purity of the clusters depending on the algorithm, preprocessing and cluster size.
Alg. B Alg. C
mean mean
stddev stddev
purity purity
2-3 0.9939 0.077001 2-3 0.9764 0.14974

words

words
4-7 0.98411 0.11016 4-7 0.97556 0.13342
8-15 0.96433 0.15218 8-15 0.98735 0.084978
16+ 0.98403 0.10645 16+ 0.98996 0.1316
2-3 0.9956 0.065797 2-3 0.99459 0.073127
Full 4-7 0.99418 0.071929 4-7 0.99455 0.069851

Full
8-15 0.99 0.084263 8-15 0.98612 0.10964
16+ 0.9945 0.06451 16+ 0.99597 0.056474
2-3 0.99819 0.041969 2-3 0.99691 0.055119
4-7 0.99748 0.046309 4-7 0.99669 0.053076
hss

hss
8-15 0.99567 0.059022 8-15 0.99556 0.058094
16+ 0.99591 0.057085 16+ 0.99312 0.071062
2-3 0.99762 0.048259 2-3 0.99797 0.044552
hss-varsp

hss-varsp
4-7 0.9979 0.041595 4-7 0.9973 0.047808
8-15 0.98827 0.093353 8-15 0.99586 0.056173
16+ 0.99501 0.062869 16+ 0.99318 0.070702
2-3 0.99807 0.043069 2-3 0.99838 0.03961
4-7 0.99817 0.039301 4-7 0.99797 0.042088
tags

tags
8-15 0.99792 0.039127 8-15 0.99358 0.065956
16+ 0.99314 0.097974 16+ 0.98936 0.087702
tags-noise

tags-noise
2-3 0.9988 0.033991 2-3 0.99749 0.049578
4-7 0.998 0.042234 4-7 0.99695 0.051297
8-15 0.99797 0.038512 8-15 0.99653 0.051339
16+ 0.9954 0.061504 16+ 0.99335 0.070447

spam) is important. These are a lot more dangerous than false negatives (spam
identified as normal) in this context. A high level of precision is mandatory.
To estimate the precision gain, we use each clustering to propagate spam infor-
mation. As explained in section 3.4.2, If a cluster has more spam urls than normal
ones and has a purity greater or equal than 0.9, we label each host corresponding
to the urls not already tagged as spam. Once this step is done for each cluster, we
label the remaining untagged hosts as normal.
This evaluation is a bit pessimistic since a real classifier would give better results
on the hosts that are not covered by clusters.
Table II shows that for the different clusterings, the mean purity of the clusters
is really high. This property let suppose a very good precision measure. This
hypothesis is confirmed by Figure 13, which shows the precision according to the
percentage of tagged labels.
Precision reaches very high levels even if the part of known labels is very low.
However there’s a major exception concerning the words preprocessing. This pre-
processing extracts all of the text structure to leave only content and is, for example,
very bad to distinguish a trustful site from another spam one which plagiarizes it.
The highest precision levels are reached by preprocessors dealing with html noise.
We also note that, even if their level stands high, Charikar-based variants are
slightly lower than Broder-based corresponding ones.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 21

0.9

0.8

0.7

Precision
0.6

0.5

0.4

0.3

0.2

0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags

Fig. 13. Precision of clustering methods according to pre-labeled hosts rate.

0.45

0.4

0.35

0.3
Recall

0.25

0.2

0.15

0.1

0.05
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags

Fig. 14. Recall of clustering methods according to pre-labeled hosts rate.

Figure 14 shows the recall according to the known tags rate. Results are relatively
low. This low score can be explained on one hand by the high similarity threshold
used that maximizes precision at the recall detriment and on the other hand by
the clustering itself that makes some urls (and hosts) unreachable. In other words,
some clusters may not contain any spam nor normal labels. Urls in such clusters
are impossible to label by tags spreading.
ACM Journal Name, Vol. V, No. N, Month 20YY.
22 · Tanguy Urvoy et al.

0.6

0.55

0.5

0.45

0.4

F-Mesure
0.35

0.3

0.25

0.2

0.15

0.1
10 20 30 40 50 60 70 80 90
% of already tagged hosts
B-words C-full B-hss-varsp C-tag-noise
C-words B-hss C-hss-varsp B-tags
B-full C-hss B-tag-noise C-tags

Fig. 15. F1 measure of clustering methods in function of pre-labeled hosts rate.

Methods based on words i.e. words and full have a recall that is significantly
lower than noise-based methods and so whatever the rate of known labels in input.
This observation is also true for F1 measure.

2 · precision · recall
F1 =
(precision + recall)
The F1 measure is a good way to appreciate the global quality of the various
methods. Figure 15 shows F1 measure according to the percentage of tagged hosts.

For html noise-based methods, we note that Charikar-based methods are gen-
erally slightly better for highest percentages of known input and corresponding
Broder-based methods for lesser ones.
We also note, according to Figure 15, a global performance increase with the rate
of known labels.
Using an external imperfect spam classifier can then be useful to increase the
spam/normal base information and so use the clustering amplification as explained
in section 3.4.
According to these results, we note that the hss-varsp preprocessor combined
to Broder fingerprinting technique gives globally the best results. Nevertheless,
depending on the rate of tagged host, tag-noise preprocessor gives better results on
lower rates combined with Broder fingerprints and for higher rates with Charikar
ones.
5.2.3 Precision Improvement Ability. Collections of labels are subject to errors
(human misjudgment, classifier imprecision,. . . ). To use these labels efficiently, tags
spreading should be fault-tolerant.
The high purity threshold mandatory to label a whole cluster offers a rather good
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 23

0.9

0.8

0.7

Precision
0.6

0.5

0.4

0.3
0 5 10 15 20
% of false tags
B-tag-noise C-tag-noise

Fig. 16. Evolution of precision when injecting errors for tag-noise preprocessed clusterings.

tolerance to errors. In order to estimate the tolerance level, we introduced a known


proportion of errors. Figure 16 shows the evolution of precision according to the
rate of injected errors for the tag noise preprocessor.
If the rate of injected errors is lesser than five percent, the final precision is only
slightly lowered. This “precision boosting” property is really interesting to combine
different spam-detection methods.
5.2.4 A side note about mirrors. The full pre-processing also allows to make
mirrors-detection by looking at web sites with very high similarities.
This is made in two step. In a first time, we cluster using B-full similarity with a
high similarity level of 60 out of 64. In a second time, we clusterize the hosts using
Jaccard similarity on the set of clusters on which they span.
Applying this procedure on the WEBSPAM-UK2006 corpus, we detect that around
7% are mirrors web sites which fall in three classes:

—an host name and it’s variant with the www prefix like: www.cwrightandco.co.uk
and cwrightandco.co.uk
—different variants of the same name like: library.cardiff.ac.uk and library.cf.ac.uk
—fully different host names: www.yorkshire-evening-post.co.uk and thisisleeds.co.uk
If detection of the first type of mirrors is trivial and reliable, the handling of the
two other needs semi-duplicate detection techniques.

5.3 Web Spam 2007 Challenge


WEBSPAM-UK2006 reference collection is also the data reference for the web spam
challenge 2007 : http://webspam.lip6.fr. The goal of this challenge is to evaluate
different spam detection algoritms based on partial labeling. For this challenge,
ACM Journal Name, Vol. V, No. N, Month 20YY.
24 · Tanguy Urvoy et al.

we used the framework described in section 3 combined with a selective bayesian


classifier called MODL [Boullé 2006].

6. CONCLUSION
We proposed and studied several similarity measures to compare web pages accord-
ing to their html “style”. We implemented a computationally efficient algorithm to
cluster html documents based on these similarities. We proposed a main spam de-
tection framework that we evaluated by a detailed experiment on WEBSPAM-UK2006
tagged dataset.
This experiment showed the efficiency of our framework to enhance the quality
of spam classifiers by spreading and consolidating their predictions according to
clusters consistencies. It also showed that noise-based similarities are significantly
more efficient than text content and full content based similarities to diffuse spam
information. Among the noise-based similarities, it showed a slightly better result
for B-hss-varsp and tag-noise similarities.
The html style similarities finds several uses in a search engine back-office pro-
cess: a direct application is to enhance the efficiency of search engines blacklists
management by “spreading” detected spam information, a second one is to help
the detection of web sites boundaries by detecting html templates, a third one is
to point out large clusters of similar pages spanning several domains: this is often
a good hint of either sites mirroring or automatic spam pages generation.
For practical use in a search engine, the most interesting way to fingerprint hmtl
documents is probably to combine a word-level content-based Charikar fingerprint
for topic detection, and a sentence-level noise-based Broder fingerprint for templates
detection.
REFERENCES
Bawa, M., Condie, T., and Ganesan, P. 2005. Lsh forest: self-tuning indexes for similarity
search. In WWW. 651–660.
Benczur, A. A., Csalogany, K., and Sarlos, T. 2006. Link-based similarity search to fight web
spam. In International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
Boullé, M. 2006. Modl: A bayes optimal discretization method for continuous attributes. Ma-
chine Learning 65, 1, 131–165.
Broder, A. 1997. On the resemblance and containment of documents. In SEQUENCES ’97:
Proceedings of the Compression and Complexity of Sequences 1997. IEEE Computer Society,
Washington, DC, USA, 21.
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering
of the web. In Selected papers from the sixth international conference on World Wide Web.
Elsevier Science Publishers Ltd., Essex, UK, 1157–1166.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. 2006. A
reference collection for web spam. SIGIR Forum 40, 2 (December).
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007.
Know your neighbors: Web spam detection using the web topology. Unpublished,
http://research.yahoo.com/publications/3/Search.
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In STOC
’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM
Press, New York, NY, USA, 380–388.
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: using sta-
tistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International
Workshop on the Web and Databases. ACM Press, New York, NY, USA, 1–6.
ACM Journal Name, Vol. V, No. N, Month 20YY.
Tracking Web Spam with HTML Style Similarities · 25

Fetterly, D., Manasse, M., and Najork, M. 2005. Detecting phrase-level duplication on the
world wide web. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR
conference on Research and development in information retrieval. ACM Press, New York, NY,
USA, 170–177.
Gray, A., Sallis, P., and MacDonell, S. 1997. Software forensics: Extending authorship analy-
sis techniques to computer programs. In 3rd Biannual Conference of International Association
of Forensic Linguists (IAFL ’97). 1–8.
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In First International Work-
shop on Adversarial Information Retrieval on the Web (AIRWeb).
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with
TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases
(VLDB). Morgan Kaufmann, Toronto, Canada, 576–587.
Heintze, N. 1996. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic
Commerce.
Henzinger, M. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In
SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval. ACM Press, New York, NY, USA, 284–291.
Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of
dimensionality. In STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory
of computing. ACM Press, New York, NY, USA, 604–613.
Jenkins, B. 1997. burtleburtle.net/bob/hash/doobs.html. web site.
Lavergne, T. 2006. Unnatural language detection. In Proceedings of RJCRI’06 : Young Scien-
tists’ conference on Information Retrieval.
McEnery, T. and Oakes, M. 2000. Authorship identification and computational stylometry. In
Handbook of Natural Language Processing. Marcel Dekker Inc.
Meyer Zu Eissen, S. and Stein, B. 2004. Genre classification of web pages. In Proceedings
of KI-04, 27th German Conference on Artificial Intelligence, S. Biundo, T. Frühwirth, and
G. Palm, Eds. Ulm, DE. Published in LNCS 3238.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages
through content analysis. In International World Wide Web Conference (WWW).
Urvoy, T., Lavergne, T., and Filoche, P. 2006. Tracking web spam with hidden style similarity.
In International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
Van Rijsbergen, C. J. 1979. Information Retrieval, 2nd edition. Dept. of Computer Science,
University of Glasgow, Glasgow, Scotland, UK.
Westbrook, A. and Greene, R. 2002. Using semantic analysis to classify search engine spam.
Tech. rep., Stanford University.
Zobel, J. and Moffat, A. 1998. Exploring the similarity space. SIGIR Forum 32, 1, 18–34.

ACM Journal Name, Vol. V, No. N, Month 20YY.


26 · Tanguy Urvoy et al.

Contents

1 Introduction 1
1.1 Spamdexing and Generated Content . . . . . . . . . . . . . . . . . . 1
1.2 Detecting Spam Web Pages . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Detecting Spam By Similarity . . . . . . . . . . . . . . . . . . 2
1.2.2 Stylometry and html . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of This Paper . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Recalls on Similarity and Clustering 3


2.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 lsh Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Broder MinHashing . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Charikar Fingerprints . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Clustering with lsh Fingerprints . . . . . . . . . . . . . . . . . . . . 6

3 The HSS Framework 6


3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Splitting Content and Noise . . . . . . . . . . . . . . . . . . . 7
3.1.2 A Collection of Parsers . . . . . . . . . . . . . . . . . . . . . 7
3.2 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Fingerprinting Implementation Details . . . . . . . . . . . . . 9
3.2.2 Combination of Preprocessing and Fingerprinting . . . . . . . 10
3.3 Multi-Sort Sliding Window Clustering . . . . . . . . . . . . . . . . . 11
3.4 Tracking Web Spam With Similarity Clusters . . . . . . . . . . . . . 12
3.4.1 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Tags Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 AirWeb-2006 Experiments 14
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 One-to-All Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Global Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Conclusion of this experiment . . . . . . . . . . . . . . . . . . . . . . 18

5 UK-2006 Experiments 18
5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Evaluation of Similarity Clustering . . . . . . . . . . . . . . . . . . . 18
5.2.1 Clustering evaluation . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Evaluation of Tags Spreading . . . . . . . . . . . . . . . . . . 19
5.2.3 Precision Improvement Ability . . . . . . . . . . . . . . . . . 22
5.2.4 A side note about mirrors . . . . . . . . . . . . . . . . . . . . 23
5.3 Web Spam 2007 Challenge . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Conclusion 24

ACM Journal Name, Vol. V, No. N, Month 20YY.

Vous aimerez peut-être aussi