Académique Documents
Professionnel Documents
Culture Documents
Propagation.
processing and query-time costs low. One option is to Where In(i) represent the inlinks set of page i, Out(j)
cluster the Web page repository into a small number of the outlinks set of page j. Hi and Ai represent the hub
clusters in the hopes of achieving a representative and authority weight of page i.
basis. Haveliwala Taher chose instead to use the freely An extension of HITS algorithm is also proposed by
available, hand constructed Open Directory as a source a group of Microsoft search [6]. Ji-Rong Wen & al
of topics. To keep the basis set small, he made use of propose a novel block-based HITS algorithm to solve
the 16 top level categories. The result gives 17 the noisy links and topics drifting problem of the
different vectors of pagerank for each page: full classical HITS algorithms. The basic idea is to segment
pagerank, and 16 biased pageranks correspondent at each Web page into multiple semantic blocs using a
each level categories. These values are stored in the vision based page segmentation algorithm. The main
index. The second step is to determine the topics most step of the HITS algorithm can be performed at the
closely associated with the query, and use the bloc level instead of the page level. For each page, the
appropriate topic-sensitive pagerank vectors for bloc with highest block rank (BRMax) is selected, and
ranking the documents satisfying the query. This can then it combines with hub weight of the page as
be obtained by syntactic and semantic analysis of follow:
query terms, easy to calculate. Rank(d ) = α ⋅ rankPR (d ) + β ⋅ rankPR _ BRMax(d )
3. Our work function on the ranking of query results. RankLR(Pi,Q)
is a link rank based on propagation scores over links
The idea underlying our work is that the popularity according to Query terms and RankDR(Pi,Q) the
of a page depends on scores of pages that point to it document rank of the page Pi witch is based on
according to terms of query. We proposed a function content-only of the page. We will more explain such
that combines link rank and document rank in relation function in the following sections.
with query terms. We will describe this function later.
First, we start with the text processing. 3.3. Scoring Function Details
3.1. Text Processing The primary innovation in our work arises from the
use of a ranking function which depends on content
Our system processed documents by first removing and popularity of page according to query terms. This
HTML tags and punctuation, and then excluding high- dependence allows better adequacy of found results by
frequency terms by using a stop words list. After traditional model of IR with user needs. Our ranking
punctuation and stop word removal, the system function is based on two measures: one is traditional
replaced each word by its representative class (root) by and widely used in current systems witch is the cosine
using a porter stemming algorithm [11]. To represent measure. It computes the cosine of the angle between
documents and query, we have used the vector model query and document vector. This measure is defined as
[9]. This choice is justified by its success in the Web follows:
community and the satisfactory results that it P ⋅Q
∑ w ⋅w t i , Pi ti ,Q
Rank ( Pi , Q ) = i
=
t i ∈ Pi ∩ Q
(2 )
generates. In the following section, we define the DR
Pi
2
⋅ Q
2
∑ w t2i , Pi . ∑ w t2i , Q
retrieval methodology that we respect and we detail the t i ∈ Pi ti
scoring function that we used. We also present the Where wij, wqj , j=1 to t (total of terms in the entire
experiments and the results of our evaluation on TREC collection), are term weights assigned to different
collection. terms for the page Pi and query Q respectively. The
best known term-weighting schemes use weights witch
are given by
3.2. Retrieval methodology N
w ij = f ij × log
DF j
Our experiments are conducted according to the
Where fij is the normalized frequency of term kj in
following steps:
page Pi and DFj is the Document Frequency (DF),
that measures the number of documents in which a
Step 1. Initial Retrieval: An initial list of ranked web
term kj has appeared in the entire document collection.
pages that contain at least one query term is obtained.
fij is given by :
The Document rank obtained by using the TF-IDF tf ij
f ij = 0 . 5 + 0 . 5 ×
scores in this step is called DR (Document Rank). It is max tf ij
based on content-only of the page. Where tfij is the Term Frequency (TF), which measures
the number of times a term kj appears in page Pi.
Step 2. Page Partition: A partition method is applied The second one is the structural measure that takes
to partition the retrieved pages into sets. Each set is account link structure information. In order to
composed by pages that contain exactly k terms of understand our function, we start with the following
query (k=1...n) where n is the number of terms of assumption “We considered that a page is well-known
query. for a term t of a query if it contains incoming links
from pages which have the term t”. This measure can
Step 3. Final Retrieval: The final list of ranked web be computed as follows:
pages is constructed by pages that contain all query Let Q is a query containing n terms. Let us suppose
terms. For each page of the final list, we calculate the a found page Pj by a traditional system of IR. Let T(Q)
link rank by applying the function proposed in the next and T(Pj) the set of query terms and Pj terms
section. After obtaining the link rank, a combined respectively. We denote by In(Pj) the set of pages that
rank is used, in which the rank of each web page Pi is point to the page Pj (incoming link) and |In(Pj)| the
determined by: number of incoming link of the page Pj. Our link rank
rank (Pi , Q ) = α ∗ rank DR (Pi , Q ) + (1 − α ) ∗ rank LR (Pi , Q ) (1) is defined as follows.
With α is a parameter witch can be set between 0 and ∑[ ( )]∩ [ T ( Pi )∩ T (Q ) = k ] Rank ( Pi , Q )
(P ,Q )=
n
k DR
∑C (3 )
Pi ∈ IN P j
Rank k
∗ ∗β ∗
1. It allows us to see the impact of our link rank LR j
k =1
n
n IN ( Pj )
k
With Cn is the number of sets that contain exactly k algorithm. The red line shows the results of using
INDEGREE algorithm witch is a simple heuristic that
terms of query. It is given by this formula: can be viewed as the predecessor of all Link Analysis
n!
C nk = Ranking algorithms. It ranks the pages according to
k !∗ (n − k )!
their popularity where the popularity of a page is
β is a parameter which is between 0 and 1 and verifies measured by the number of pages that link to this page.
the following condition: The last one shows the results of combining link rank
n
n! k n
(n − 1)! ∗ β = 1. and document rank with α being optimal for the
∑
k =1 k!∗(n − k )! n
∗ ∗β =1 ≡ ∑
k =1 (k − 1)!∗(n − k )!
specific method (α=0,15).
We have:
n
(n − 1)! = 2 n−1 ⇒ β = 1
∑
k =1 (k − 1)!∗(n − k )! 2 n −1
0,6 Combining link and document rank
Content-Only (Baseline)
Precision
Rank (P ) =
n DR i
∑ (4 )
Pi ∈ IN P j ∩ T Pi ∩ T Q = k
LR j
2 n −1
* IN ( Pj )
k =1 (k − 1)!∗ (n − k )! 0,3
0,2
0
In this section, we start by describing the test 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
collection that we have used in our experiments. Then 11 standard recall levels
we detail the tests carried out. We carried out 50 Figure 1. Comparisons of three ranking function on
queries and we compared three ranking functions. precision over the 11 standard recall levels
4.1 The WT10g collection The dependency between precision at 0%, 10%,
20% of recall levels and α is illustrated in Figure 2, in
In our experiments, we have chosen WT10g which all the curves converge to the baseline when
collection like test collection. This collection contains α=1.
1.692.096 documents (Web page) with 1.532.012 0,55 0% recall 10% recall 20% recal
pages with incoming links and 1295841 pages with 0,5
outgoing links. Due to our limited storage resources
0,45
we had to restrict our experiments to a relatively small
Précision
15
25
35
45
55
65
75
85
95
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
combining parameter α
In this section we present an experimental As can be seen from Figure 1, the INDEGREE
evaluation of the algorithm that we propose, as well as algorithm performs the worst result. With this
some of the existing algorithms. We study the rankings algorithm, a page has the same score (or popularity
they produce. In our experiments, the precision over value) for all queries performed on the system because
the 11 standard recall levels which are 0%, 10%, …, it takes account the popularity of a page independently
100% is the main evaluation metric, and we also of query terms. For that, the results were worse.
evaluate the precision at 5 and 10 documents retrieval Combining link rank and document rank is strongly
(P@5 & P@10). better than the baseline, though it is the best among all
Figure 1 shows the experimental results on the methods. The performance of our method increases
information retrieval using different ranking methods. significantly when α decrease. That means, more we
The first one witch is based on content-only of the give importance to link rank, more the results are
page and presented by the blue line is the baseline
greater. The better value of α is 0.15 to have more 6. References
relevant document at the top of rank list.
[1] S. Brin and L. Page, “The anatomy of a large-scale
Table 1. P@5 and P@10 comparison hypertextual Web search engine”, Computer Networks and
Link ISDN Systems, 1998, 107-117.
Baseline 0.15*DR+
Precision InDegree Rank
DR 0.85*LR
LR [2] Kleinberg L., “Authoritative sources in a hyperlinked
P@5 0,077 0,216 0,265 0,306 environment”, In Proceeding of 9th ACM-SIAM Symposium
P@10 0,072 0,163 0,192 0,208 on Discrete Algorithms, 1998, 604-632.