Vous êtes sur la page 1sur 6

Precision Evaluation of an Information Retrieval System using Link Scores

Propagation.

Idir Chibane Bich-Liên Doan


Supélec, Plateau de Moulon, 3 rue Joliot Supélec, Plateau de Moulon, 3 rue Joliot
Curie, 91 192 Gif/Yvette, France Curie, 91 192 Gif/Yvette, France
Idir.Chibane@supelec.fr Bich-Lien.Doan@supelec.fr

Abstract structure among the Web page. A number of


extensions of these two algorithms are also proposed,
Web search engines have become indispensable in such as [3][4][5][6] [7][8]. All these link analysis
our daily life to help us find the information we need. algorithms are based on two assumptions: (1) If there
Several search tools, for instance Google, use links to is a link from page A to page B, then we may assume
select the matching documents against a query. In this that page A endorses and recommends the content of
paper, we propose a new ranking function that page B. (2) Pages that are co-cited by a certain page
combines content and link rank based on propagation are likely to share the same topic as well as to help
of scores over links. This function propagates scores retrieval.
from sources pages to destination pages in relation The rest of this paper is organized as follows. In
with the keywords of a query. We assessed our ranking Section 2, we review the recent works on link analysis.
function with experiments on the TREC-9 collection. We first review the related literature on link analysis
We conclude that propagation link scores according to ranking algorithms and also present some extension of
query terms provide significant improvement for these algorithms. Then we present our information
information retrieval. retrieval model with the new ranking function. In
section 4, we show the experimental results on
1. Introduction multiple queries using the proposed algorithm,
including a comparative study of different algorithms.
The explosive growth of the web has led to surge of In Section 5, we summarize our main contributions and
research activity in the aria of information retrieval discuss possible new applications for our proposed
(IR) on the Wold Wide Web. Ranking has always been method.
an important component of any information retrieval
system (IRS). In the case of web search its importance 2. Related Works
becomes critical. Due to the size of the Web (Google
counts more 4.28 billion Web pages in January Various studies suggested taking account links
2005[12]), it is imperative to have ranking function between documents in order to increase the quality of
that capture the user needs. To this end the Web offers information retrieval. PageRank[1] of Google and
a rich context of information which is expressed HITS[2] of Keinberg are the basic algorithms using
thought links. In this paper we investigate, link structure information. Generally, these systems
theoretically and experimentally, the application of function in two steps. In first stage, a traditional search
link analysis to ranking on the Web. engine returns a list of pages in response to user query.
In recent years, several information retrieval In the second stage, these systems take account links to
methods using the information about the link structure rank the documents results. In this section we describe
have been developed and proved to provide significant some of previous link analysis ranking algorithms.
enhancement to the performance of Web search in
practice. Actually, most of systems based on link 2.1 Page Rank
structure information combine content with popularity
measure of the page to rank query results. Google’s PageRank was invented by Lawrence Page and
PageRank[1] and Keinberg’s HITS[2] are two Sergey Brin [1] and used in Google search engine. The
fundamental algorithms by employing the link principle of the algorithm consists to evaluate the
importance of each page according to each link research carried out. Because search engines does not
pointing them according to the following assumption: take into account semantics, context or user profile.
“a page is important when it is much cited or cited by From where, the idea to compute personalized PR.
an important page”. PageRank precomputes a rank Last years, research led to three radically different
vector that provides a-priori “importance” estimates solutions [3]: modular pagerank, blockrank and topic
for all the pages on the Web. This vector is computed sensitive pagerank. These approaches approximate PR
once, offline, and is independent of the search query. with some approximation, although they differ
At query time, these importance scores are used in substantially in their computational requirements and
conjunction with query-specific IR scores to rank in the granularity of personalization achieved.
query results. PageRank simulates a user navigating
randomly in the Web who jumps to a random page 2.1.1. Modular PageRank. Modular pagerank
with probability (1-d) or follows a random hyperlink approach was proposed by Jeh and Widom [4]
(on the current page) with probability d. This process according to the following principle: "it is possible to
can be modelled with a Markov chain, from where the compute a good approximation of PR of all the Web
stationary probability of being in each page can be pages by cutting out the entire Web starting from an
computed. initial set of pages, called Hub set". The Hub set can
Let C(A) be the number of outgoing links of page A be composed by pages whose PR is highest (pages that
and suppose that A is pointed to by pages T1, T2, …, are very important), pages mentioned in Yahoo! ODP
Tn. The PageRank, PR (A) of the page A is given by (Open Directory Project) or important pages for
the following formula: particular company (like Microsoft). From each
N
PR (Ti ) pœ”Hub set”, a zone of pages is built. A zone is made
PR ( A) = (1 − d ) + d ∑
i =1 C (Ti )
up of all the pages which we can attain it by starting
from page p. Jeh and Widom represented PPV’s
With d is a damping factor which can be set between 0
(personalized PageRank vector) as a linear
and 1. They usually set d to 0.85. Notice that the
combination of |Hub set| hub vectors, one for each
ranking (weight) of other pages is normalized by the
pœ“Hub set”. Any PPV based on hub pages can be
number of links in the page.
constructed quickly from the set of precomputed hub
Intuitively, this formula means that the PR of a page
vectors, but computing and storing all hub vectors is
A depends at the same time on the quality and the
impractical. To compute a large number of hub vectors
number of pages which pointes A. That means a page
efficiently, they further decompose them into partial
which has a link from a very relevant page (a page
vectors and the hubs skeleton, components from which
with good reputation for example the home page of
hub vectors can be constructed quickly at query time.
Yahoo!) is considered relevant. Pages with in-links
The best results are obtained when the Hub set is
from many not those relevant pages are also
composed of pages whose PR is high.
considered relevant.
Pagerank can be computed using an iterative
2.1.2. The BlockRank. This algorithm starts from an
algorithm. Initially, all the pages are equiprobable,
experimental statement: "the links between Web pages
their PR value is equal to 1/N, where N is the total
are not distributed uniformly". Indeed, there are
number of documents in text collection. A cycle of
groups of strongly inter-connected pages, like those
iteration is necessary to propagate the probabilities on
constituting domain, Web site or repertory. This
the pages. The algorithm stops theoretically when a
group of pages is then connected to the other pages by
new iteration does not produce any more modifications
a low number of links. Experiments showed that about
of PR values of all the pages. In practice convergence
79% of links are in the same host. And about 84% of
is obtained at tens iterations. The System using PR has
links are in the same domain [8]. Considering the Web
a basic characteristic which distinguishes it from the
is a nested structure, the Web graph could be
other systems. The most complex and longest
partitioned into blocks according to the different level
computations are made offline. Therefore, the system
of Web structure, such as page level, directory level,
has only very simple computations to execute in order
host level and domain level. Furthermore, the
to build the query results.
hyperlink at the block level could be divided into two
The PR computations are long and require cleaning
types: Intra-hyperlink and Inter-hyperlink, where inter-
the entire Web. Moreover, the results obtaining by
hyperlink is the hyperlink that links two Web pages
Google shows that the algorithm witch compute PR
over different blocks while intra-hyperlink is the
value of a page is not completely relevant. The query
hyperlink that links two Web pages in the same block.
results do not have sometimes any relationship with
The idea of blockrank algorithm consists in first step to 2.2 Kleinberg’s HITS procedure
compute a "local" PR on only one block. With number
reduced of pages (only pages that constitute the block), Kleinberg distinguishes between two different
the PR algorithm converges quickly. In the second notions of relevance: an authority is a page that is
step, to obtain an excellent approximation of full PR, relevant in itself; a hub is a page that is relevant since
the block importance is computed from a matrix of it contains links to many related authorities. There is,
blocks, but not the entire Web. This value is called what Kleinberg calls, a natural equilibrium between
BlockRank(BR)[5]. The BR algorithm is defined as authorities and hubs: "Hubs and authorities exhibit
follows: what could be called a mutually reinforcing
- The Web is cut out into blocks according to the relationship: a good hub is a page that points to many
type of block (repertory, site or domain). good authorities; a good authority is a page that is
- The local PR of each page in each block is pointed to by many good hubs." [2].
computed In [2], Kleinberg introduced a procedure for
- The importance relative to each block (noted BR) is identifying web pages that are good hubs or good
also computed authorities, in response to a given query. To identify
- The local PR of each page and the BR are good hubs and authorities, Kleinberg’s procedure
combined to estimate the full PR of the page. exploits the graph structure of the web. Each web page
This algorithm decreases PR computing time (up to 20 is a node and a link from page A to page B is
times less). It also allows determining a good effective represented by a directed edge from node A to node B.
approximation of PR for all new page. When introducing a query, the procedure first
constructs a focused subgraph G, and then computes
2.1.3. Topic Sensitive Pagerank. This new algorithm hubs and authorities scores for each node of G (say N
proposed by Haveliwala Taher [7] consist on reuse of nodes in total). In order to quantify the quality of a
traditional PageRank. The idea of this algorithm is: page as a hub and an authority, Kleinberg associated
"the traditional PR rank pages according to their every page with a hub and an authority weight.
importance whose contents are identical. But it does According to the mutual reinforcing relationship
not distinguish a page which speaks about the animal between hubs and authorities, Kleinberg defined the
jaguar or the car jaguar". The solution suggested in hub weight to be the sum of the authority weights of
Topic Sensitive Pagerank is to compute a set of the nodes that are pointed to by the hub, and the
importance scores for each page according to various authority weight to be the sum of the hub weights that
topics. The computations are defined as follow: point to this authority. Let A denote the n-dimensional
The first step in this approach is to generate a set of vector of the authority weights, where Ai is the
biased PageRank vectors using a set of basis topics. authority weight of node i (or page i) and let H denote
This step is performed once, offline, during the pre- the n-dimensional vector of the hub weights, where Hj
processing of the Web crawl. There are many possible is the hub weight of node j. We have that
sources for the basis set of topics. However, using a Ai = ∑ H j and H j = ∑ Ai
small basis set is important for keeping the pre- j∈In (i ) i∈Out ( j )

processing and query-time costs low. One option is to Where In(i) represent the inlinks set of page i, Out(j)
cluster the Web page repository into a small number of the outlinks set of page j. Hi and Ai represent the hub
clusters in the hopes of achieving a representative and authority weight of page i.
basis. Haveliwala Taher chose instead to use the freely An extension of HITS algorithm is also proposed by
available, hand constructed Open Directory as a source a group of Microsoft search [6]. Ji-Rong Wen & al
of topics. To keep the basis set small, he made use of propose a novel block-based HITS algorithm to solve
the 16 top level categories. The result gives 17 the noisy links and topics drifting problem of the
different vectors of pagerank for each page: full classical HITS algorithms. The basic idea is to segment
pagerank, and 16 biased pageranks correspondent at each Web page into multiple semantic blocs using a
each level categories. These values are stored in the vision based page segmentation algorithm. The main
index. The second step is to determine the topics most step of the HITS algorithm can be performed at the
closely associated with the query, and use the bloc level instead of the page level. For each page, the
appropriate topic-sensitive pagerank vectors for bloc with highest block rank (BRMax) is selected, and
ranking the documents satisfying the query. This can then it combines with hub weight of the page as
be obtained by syntactic and semantic analysis of follow:
query terms, easy to calculate. Rank(d ) = α ⋅ rankPR (d ) + β ⋅ rankPR _ BRMax(d )
3. Our work function on the ranking of query results. RankLR(Pi,Q)
is a link rank based on propagation scores over links
The idea underlying our work is that the popularity according to Query terms and RankDR(Pi,Q) the
of a page depends on scores of pages that point to it document rank of the page Pi witch is based on
according to terms of query. We proposed a function content-only of the page. We will more explain such
that combines link rank and document rank in relation function in the following sections.
with query terms. We will describe this function later.
First, we start with the text processing. 3.3. Scoring Function Details

3.1. Text Processing The primary innovation in our work arises from the
use of a ranking function which depends on content
Our system processed documents by first removing and popularity of page according to query terms. This
HTML tags and punctuation, and then excluding high- dependence allows better adequacy of found results by
frequency terms by using a stop words list. After traditional model of IR with user needs. Our ranking
punctuation and stop word removal, the system function is based on two measures: one is traditional
replaced each word by its representative class (root) by and widely used in current systems witch is the cosine
using a porter stemming algorithm [11]. To represent measure. It computes the cosine of the angle between
documents and query, we have used the vector model query and document vector. This measure is defined as
[9]. This choice is justified by its success in the Web follows:
community and the satisfactory results that it P ⋅Q
∑ w ⋅w t i , Pi ti ,Q

Rank ( Pi , Q ) = i
=
t i ∈ Pi ∩ Q
(2 )
generates. In the following section, we define the DR
Pi
2
⋅ Q
2
∑ w t2i , Pi . ∑ w t2i , Q
retrieval methodology that we respect and we detail the t i ∈ Pi ti

scoring function that we used. We also present the Where wij, wqj , j=1 to t (total of terms in the entire
experiments and the results of our evaluation on TREC collection), are term weights assigned to different
collection. terms for the page Pi and query Q respectively. The
best known term-weighting schemes use weights witch
are given by
3.2. Retrieval methodology N
w ij = f ij × log
DF j
Our experiments are conducted according to the
Where fij is the normalized frequency of term kj in
following steps:
page Pi and DFj is the Document Frequency (DF),
that measures the number of documents in which a
Step 1. Initial Retrieval: An initial list of ranked web
term kj has appeared in the entire document collection.
pages that contain at least one query term is obtained.
fij is given by :
The Document rank obtained by using the TF-IDF tf ij
f ij = 0 . 5 + 0 . 5 ×
scores in this step is called DR (Document Rank). It is max tf ij
based on content-only of the page. Where tfij is the Term Frequency (TF), which measures
the number of times a term kj appears in page Pi.
Step 2. Page Partition: A partition method is applied The second one is the structural measure that takes
to partition the retrieved pages into sets. Each set is account link structure information. In order to
composed by pages that contain exactly k terms of understand our function, we start with the following
query (k=1...n) where n is the number of terms of assumption “We considered that a page is well-known
query. for a term t of a query if it contains incoming links
from pages which have the term t”. This measure can
Step 3. Final Retrieval: The final list of ranked web be computed as follows:
pages is constructed by pages that contain all query Let Q is a query containing n terms. Let us suppose
terms. For each page of the final list, we calculate the a found page Pj by a traditional system of IR. Let T(Q)
link rank by applying the function proposed in the next and T(Pj) the set of query terms and Pj terms
section. After obtaining the link rank, a combined respectively. We denote by In(Pj) the set of pages that
rank is used, in which the rank of each web page Pi is point to the page Pj (incoming link) and |In(Pj)| the
determined by: number of incoming link of the page Pj. Our link rank
rank (Pi , Q ) = α ∗ rank DR (Pi , Q ) + (1 − α ) ∗ rank LR (Pi , Q ) (1) is defined as follows.
With α is a parameter witch can be set between 0 and ∑[ ( )]∩ [ T ( Pi )∩ T (Q ) = k ] Rank ( Pi , Q )
(P ,Q )=
n
k DR
∑C (3 )
Pi ∈ IN P j
Rank k
∗ ∗β ∗
1. It allows us to see the impact of our link rank LR j
k =1
n
n IN ( Pj )
k
With Cn is the number of sets that contain exactly k algorithm. The red line shows the results of using
INDEGREE algorithm witch is a simple heuristic that
terms of query. It is given by this formula: can be viewed as the predecessor of all Link Analysis
n!
C nk = Ranking algorithms. It ranks the pages according to
k !∗ (n − k )!
their popularity where the popularity of a page is
β is a parameter which is between 0 and 1 and verifies measured by the number of pages that link to this page.
the following condition: The last one shows the results of combining link rank
n
n! k n
(n − 1)! ∗ β = 1. and document rank with α being optimal for the

k =1 k!∗(n − k )! n
∗ ∗β =1 ≡ ∑
k =1 (k − 1)!∗(n − k )!
specific method (α=0,15).
We have:
n
(n − 1)! = 2 n−1 ⇒ β = 1

k =1 (k − 1)!∗(n − k )! 2 n −1
0,6 Combining link and document rank
Content-Only (Baseline)

After transformation of the function (3), the final 0,5 INDEGRE

function is presented as follows: 0,4


(n − 1)! ∑ [ ( )] [ ( ) ( ) ] Rank (P )

Precision
Rank (P ) =
n DR i
∑ (4 )
Pi ∈ IN P j ∩ T Pi ∩ T Q = k
LR j
2 n −1
* IN ( Pj )
k =1 (k − 1)!∗ (n − k )! 0,3

0,2

4. Experiments on TREC collection 0,1

0
In this section, we start by describing the test 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

collection that we have used in our experiments. Then 11 standard recall levels

we detail the tests carried out. We carried out 50 Figure 1. Comparisons of three ranking function on
queries and we compared three ranking functions. precision over the 11 standard recall levels

4.1 The WT10g collection The dependency between precision at 0%, 10%,
20% of recall levels and α is illustrated in Figure 2, in
In our experiments, we have chosen WT10g which all the curves converge to the baseline when
collection like test collection. This collection contains α=1.
1.692.096 documents (Web page) with 1.532.012 0,55 0% recall 10% recall 20% recal
pages with incoming links and 1295841 pages with 0,5
outgoing links. Due to our limited storage resources
0,45
we had to restrict our experiments to a relatively small
Précision

subset of the TREC collection. For our experiments we 0,4

downloaded pages of 490 Web sites appear in the 0,35


collection. We have selected the sites that containing at 0,3
least two relevant pages for one of queries carried out
0,25
on our system. The number of pages that were
downloaded is 546.423 with 477.064 pages with 0,2

incoming links and 410.378 pages with outgoing links.


05

15

25

35

45

55

65

75

85

95
0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

combining parameter α

Figure2. Precision at 0%, 10% and 20% recall level with


4.2. Experimental Setup and Results different combining parameter α on WT10g

In this section we present an experimental As can be seen from Figure 1, the INDEGREE
evaluation of the algorithm that we propose, as well as algorithm performs the worst result. With this
some of the existing algorithms. We study the rankings algorithm, a page has the same score (or popularity
they produce. In our experiments, the precision over value) for all queries performed on the system because
the 11 standard recall levels which are 0%, 10%, …, it takes account the popularity of a page independently
100% is the main evaluation metric, and we also of query terms. For that, the results were worse.
evaluate the precision at 5 and 10 documents retrieval Combining link rank and document rank is strongly
(P@5 & P@10). better than the baseline, though it is the best among all
Figure 1 shows the experimental results on the methods. The performance of our method increases
information retrieval using different ranking methods. significantly when α decrease. That means, more we
The first one witch is based on content-only of the give importance to link rank, more the results are
page and presented by the blue line is the baseline
greater. The better value of α is 0.15 to have more 6. References
relevant document at the top of rank list.
[1] S. Brin and L. Page, “The anatomy of a large-scale
Table 1. P@5 and P@10 comparison hypertextual Web search engine”, Computer Networks and
Link ISDN Systems, 1998, 107-117.
Baseline 0.15*DR+
Precision InDegree Rank
DR 0.85*LR
LR [2] Kleinberg L., “Authoritative sources in a hyperlinked
P@5 0,077 0,216 0,265 0,306 environment”, In Proceeding of 9th ACM-SIAM Symposium
P@10 0,072 0,163 0,192 0,208 on Discrete Algorithms, 1998, 604-632.

[3] T. Haveliwala and S. Kamvar and G. Jeh, “An Analytical


We also compare the different algorithms with the Comparison of Approaches to Personalizing PageRank”,
average precision at 5 and 10 documents retrieval technical report, Stanford University, 2003.
(P@5, P@10). The performance of our algorithm is
still better than the baseline algorithm. That means [4] G. Jeh and J. Widom, “Scaling personalized web search”,
that there are more relevant documents in the top rank In Proceedings of the Twelfth International World Wide Web
list. From table 1, we can see that either link rank only Conference, 2003.
or combining link rank with document rank performed
best results than baseline, either on average precision [5] S. Kamvar and T. Haveliwala and C. Manning and G.
Golub, “Exploiting the Block Structure of the Web for
P@5 or P@10. For example, the P@5 and P@10 of
Computing PageRank”, Technical Report, 2003.
our algorithm is 0,306 and 0,208 respectively. This
result achieved 42% and 28% improvements over the [6] Deng Cai; Shipeng Yu; Ji-Rong Wen; Wei-Ying Ma
baseline algorithm on average precision at P@5 and (2004), “Block-based Web Search”, SIGIR '04: Proceedings
P@10 (0,216 and 0,163 respectively). The results in of the 27th annual international conference on Research and
this table show that InDegree algorithm was still the development in information retrieval, ACM Press, Sheffield
worst one. United Kingdom, 456--463, 2004.

[7] Taher H. Haveliwala, “Topic-Sensitive PageRank : A


5. Conclusion Context-Sensitive Ranking Algorithm for Web Search”,
Knowledge and Data Engineering, IEEE Transactions on,
Several algorithms based on link analysis approach 2003.
were developed. But, until now many experiments
showed that there is no significant profit compared to [8] X.M. Jiang, G.R. Xue, W.G. Song, H.J. Zeng, Z. Chen
the methods based on content-only of the page. In this and W.Y. Ma, “Exploiting PageRank at Different Block
paper, we introduce an approach for combining Level”, International Conference on Web Information
content and link rank based on propagation scores over Systems Engineering, 2004.
links according to query terms. During the
[9] G. Salton, C.S. Yang and C.T. Yu, “A theory of term
computation, the algorithm that we proposed importance in automatic text analysis”, Journal of the
propagates a portion of rank scores of the source web American Society for Information Science and Technology,
pages to the destination web pages in accordance with 1975, 33-44.
query terms. We performed experimental evaluations
of our algorithm using IR test collection of TREC 9 [10] M.F. Porter, “An algorithm for suffix stemming”, 1980.
and found that this algorithm outperforms significantly
a content-only retrieval. The study concluded from [11] http://seattlepi.nwsource.com/business/160997_google-
our experiments shows that propagation link scores 18.html
according to query terms provide significant
improvement. It is still better than the baseline method
based on content-only.
More study and experiments will be conduced, e.g.,
we will use the weighted inter-host and intra-host link
score propagation. We also plan to test this framework
at the semantic blocks level to see the structural effects
of blocks on ranking query results. Finally, new
measure representing additional semantic information
may be explored.

Vous aimerez peut-être aussi