Relevance Propagation Model For Large Hypertext. Documents Collections

Relevance Propagation Model for Large Hypertext Document Collections
Idir Chibane & Bich-Liên Doan

SUPELEC, Computer Science dpt
Plateau de Moulon, 3 rue Joliot Curie,
91 192 Gif/Yvette, France
{ Idir.Chibane, Bich-Lien.Doan}@supelec.fr
Abstract
Web search engines have become indispensable in our daily life to help us finding the information we need.
Several search tools, for instance Google, use links to select the matching documents against a query. In this paper,
we propose a new ranking function that combines content and link rank based on propagation of scores over links.
This function propagates scores from source pages to destination pages in relation with query terms. We assessed
our ranking function with experiments over two test collections WT10g and GOV. We conclude that propagating
link scores according to query terms provides significant improvement for information retrieval.
Introduction
A major focus of information retrieval (IR) research is on developing strategies for identifying
documents that are “relevant” to a given query. In traditional IR, the evidence of relevance is
thought to reside within the text content of documents. Consequently, the fundamental strategy
of traditional IR is to rank documents according to their estimated degree of relevance based on
measures such as term similarity or term occurrence probability. In the Web setting, however,
information can reside outside the textual content of documents. For example, links between
pages can be used to increase the term based estimation of document relevance. Furthermore
hyperlinks, being the most important source of evidence in Web documents, have been the
subject of many researches exploring retrieval strategies based on link analysis.
The explosive growth of the web has led to surge of research activity in the area of IR on the
World Wide Web. Ranking has always been an important component of any information
retrieval system (IRS). In the case of web search its importance becomes critical. Due to the
size of the Web (Google counted more than 8.16 billion Web pages in August 2005), it is
imperative to have a ranking function that capture the user needs. To this end the Web offers a
rich context of information which is expressed via the links. In recent years, several information
retrieval methods using the information about the link structure have been developed. Actually,
most of systems based on link structure information combine content with a popularity measure
of the page to rank query results. Google’s PageRank (Brin et al., 1998), and Keinberg’s HITS
(Kleinberg, 1999) are two fundamental algorithms employing the link structure in the Web
page. A number of extensions of these two algorithms are also proposed, such as (Lempel et al.,
2000) (Haveliwala, 2002) (Kamvar et al., 2003) (Jeh et al., 2003) (Deng et al., 2004) and
(Xue-Mei et al., 2004). All these link analysis algorithms are based on two assumptions: (1) If
there is a link from page A to page B, then we may assume that page A endorses and
recommends the content of page B. (2) Pages that are co-cited by a certain page are likely to
share the same topic as well as to help retrieval. The power of hyperlink analysis comes from
the fact that it uses the content of other pages to rank the current page. Hopefully, these pages
were created by authors independently from the author of the original page, thus adding an
unbiased factor to the ranking.
The study of the existing systems enabled us to conclude that most of ranking functions using
link structure do not depend on query terms. However, the precision of the found results may
Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France
decrease significantly. In this paper we investigate, theoretically and experimentally, the
application of link analysis to rank pages on the Web. The rest of this paper is organized as
follows. In Section 2, we review the recent works on link analysis. We first review the related
work on link analysis ranking algorithms and also present some extensions of these algorithms.
Then we present our information retrieval model with the new ranking function. In section 4,
we show the experimental results on multiple queries using the proposed algorithm, including a
comparative study of different algorithms. In Section 5, we summarize our main contributions
and discuss possible new applications for our proposed method.
Previous Work
Different from traditional IR, the Web contains both content and link structures that have
provided many new dimensions for exploring better IR techniques. In the early days, people
analyzed web content and structure independently. Typical approaches such as (Hawking,
2000) (Craswell et al., 2003) (Craswell et al., 2004) use TF-IDF (Salton et al., 1975) of the
query term in the page to compute a relevance score, and use hyperlinks to compute a
query-independent importance score (e.g. PageRank (Brin et al., 1998)). And then these two
scores are combined to rank the retrieved documents.
In recent years, some new methodologies that explore the inter-relationship between content
and link structures have been introduced. (Qin et al., 2005) divides these methods into two
categories: the first one is used to enhance the link analysis with the help of content information
(Kleinberg, 1999) (Lempel et al., 2000) (Haveliwala, 2002) (Amento et al., 2000) (Chakrabarti,
2001) (Chakrabarti et al., 2001) (Ingongngam et al., 2003).; the second one is the relevance
propagation, which propagates content information with the help of Web structure (Mcbryan,
1994) (Song et al., 2004) (Shakery et al., 2003).
HITS (Kleinberg, 1999) is the representative of the first category. The HITS algorithm builds a
query specific sub-graph first, and then computes the authority and hub scores on this
sub-graph to rank the documents. Kleinberg distinguishes between two different notions of
relevance: An authority is a page that is relevant in itself and a hub is a page that is relevant
since it contains links to many related authorities. To identify good hubs and authorities,
Kleinberg’s procedure exploits the graph structure of the web. When introducing a query, the
procedure first constructs a focused sub graph G, and then computes hubs and authorities scores
for each node of G. In order to quantify the quality of a page as a hub and an authority,
Kleinberg associated every page with a hub and an authority weight. According to the mutual
reinforcing relationship between hubs and authorities, Kleinberg defined the hub weight to be
the sum of the authority weights of the nodes that are pointed to by the hub, and the authority
weight to be the sum of the hub weights that point to this authority. Let A denote the
n-dimensional vector of the authority weights, where Ai is the authority weight of the page and
let H denote the n-dimensional vector of the hub weights, where Hi is the hub weight of the
page pi. The computation of authority and hub weights is given by the following formula:
Ai = ∑( H)
j∈ In p i
j and H i = ∑( A)
j∈Out p i
j
Where In(pi) represents the in-links set of the page pi and Out(pi) the out-links set of the page pi.
Hi and Ai represents the hub and authority weight of the page pi. Generally speaking, these
methods conduct link analysis on a sub-graph which is sampled from the whole Web graph by
considering the content of the web pages.
For the second category, many relevance propagation methods were proposed to refine the
content of web pages by propagating content-based attributes through web structure. For
example, (Mcbryan, 1994) (Brin et al., 1998) propagate anchor text from one page to another to
expand the feature set of web pages. (Shakery et al., 2003) propagates the relevance score of a
page to another page through the hyperlink between them. (Song et al., 2004) propagates query
term frequency from child pages to parent pages in the sitemap tree. They first construct a
sitemap for each website based on URL analysis, and then propagate query term frequency
along the parent-child relationship in the sitemap tree as follow.
f t ' ( p ) = (1 + α ) f t ( p ) +
(1 − α )
∑ f t (q )
Fils ( p ) q∈Fils ( p )
Where f ’t(p) is the occurrence frequency of term t in page p after the i-th iteration, and ft(p) is
the original occurrence frequency of term t in page p. (Qin et al., 2005) proposes a generic
relevance propagation framework that can be used to derive many existing propagation models.
1
f t k + 1 ( p ) = α ⋅ f t 0 ( p ) + (1 − α ) ∑ f t k (q )
q ∈ Child ( p ) Child ( p )
Relevance Model Propagation

Intuitively, the content similarity of a page to a query, on its own, may not be sufficient for
selecting a key resource, and the link information can be useful in finding key resources
(neighborhood of the page). A good resource is a page whose content is related to the query
topic, and which has links from other resources that are also related to the query topic. So, two
factors are important in selecting good resources: the content of the page and the relevance of
the pages which have links to the page. The idea underlying our work is that the popularity of a
page depends on scores of pages that point to it according to query terms. Motivated by these
observations, we proposed a function that combines link rank and document rank related to the
query terms. Specifically, we define the “hyper-relevance” score of each page as a function of
two variables, the content similarity of the page to the query and a weighted sum of scores of
the pages that point to this page. Formally, the relevance propagation function can be written
as:
Rank (P , Q ) = ϕ ( Rank DR (P , Q ); Rank LR ( P , Q ))

Where Rank (P, Q) is the hyper-relevance score of the page P, RankDR(P, Q) is the content
similarity between the page P and the query Q, and RankLR(P, Q) is the link rank based on
propagation scores over links according to query terms and given by :
wlink ( Pi , P, Q) ∗ RankDR (Pi , Q)

RankLR ( P, Q) = ∑
Pi → P Ep
Where wlink (Pi, P,Q) is the weighting function assigned to the existing link between the page Pi
and the page P according to query terms and Ep is the number of pages that point to the page P.
In principle, the choice of the function φ could be arbitrary. An interesting choice is a linear
combination of the two variables shown below:
Rank(P, Q) = α ∗ RankDR (P, Q) + (1 − α ) ∗ RankLR (P, Q)
With α is a parameter which can be set between 0 and 1. It allows us to see the impact of our
link rank function on the ranking of query results.
Text Processing
Our system processed documents by first removing HTML tags and punctuation, and then
excluding high-frequency terms by using a stop words list. After punctuation and stop word
removal, the system replaced each word by its representative class (root) by using a porter
stemming algorithm (Porter, 1980). To represent documents and query, we have used the vector
model (Salton et al., 1975). This choice is justified by its success in the Web community and the
satisfactory results that it generates.
Scoring Function Details
The primary innovation in our work arises from the use of a ranking function which depends on
content and neighbourhood of the page according to query terms. This dependence allows
better adequacy of found results by traditional model of IR with user needs. Our ranking
function is based on two measures: one is based on content only which is OKAPI M25 measure.
It is of the form:
(k1 + 1)tf (k 3 + 1)qtf

∑ w (K + tf )(k
T ∈Q 3 + qtf )
where Q is a query containing key terms T, tf is the frequency of occurrence of the term within
a specific document, qtf is the frequency of the term within the topic from which Q was derived,
and w is the Robertson/Sparck Jones weight of T in Q. It is calculated
w = log
(r + 0.5) (R − r + 0.5)
(n − r + 0.5) (N − n − R + r + 0.5)
where N is the number of documents in the collection, n is the number of documents containing
the term, R is the number of documents relevant to a specific topic, and r is the number of
relevant documents containing the term. . In our experiment, the R and r are set to zero. K is
calculated by
K = k1 ((1 − b ) + b × (dl avdl ))

where dl and avdl denote the document length and the average document length. In our
experiments, we adjust the k1= 4.2, k3=1000, b = 0.8 to achieve the best baseline (We took the
result of using content only scores as our baseline).
The second one is the structural measure that takes account link structure information. In order
to understand our function, we start with the following assumption “We considered that a page
is well-known for a term t of a query if it contains incoming links from pages which have the
term t” (Doan et al., 2005). The main idea of our structural measure is to weigh links according
to the number of query terms contained in the source page for an incoming link and the
destination page for an outgoing link. In the following sections, we considered only incoming
link. However, the assumption of our idea stimulates that link weight emitted by a page that
contain n query terms is twice more significant than link weight from page that contain just n-1
query terms. This measure can be computed as follows:
Let Q is a query containing nbtQ terms. Let us suppose a found page P by a traditional system of
information retrieval. Let nbtQ(P) be the number of query terms that P contains. We denote by
Ep the number of pages that point to the page P (incoming links).
w link ( Pi , P , Q ) ∗ Rank DR (Pi , Q )

Rank LR ( P , Q ) = ∑
pi → p Ep
With wlink(Pi,P,Q) is the link weight between pages Pi and P. More the page Pi contains query
terms, more the link weight between the two pages increase. This weight is defined as follow:
nbtQ ( Pi )
2 β
wlink (Pi , P, Q ) = nbtQ
∗β = nbtQ − nbtQ ( Pi )
2 2
β is a parameter between 0 and 1 that verifies the following condition which is a probability
distributions of link weights on different typical links according to query terms. The typical
links are:
type 1: links emitted by a page that contain one query terms.
type 2: links emitted by a page that contain two query terms.
…
Type n : links emitted by a page that contain n query terms
β
nbtq nbtq
1 ⎛ 1 1 1 ⎞
∑
k =1 2
nbtq −k
=1 ⇒ β ∗∑
k =1 2
nbtq −k
=1 ⇒ β ∗ ⎜ nbt −1 + nbt −2 + ...+ +1⎟ = 1
⎝2 q 2 q 2 ⎠
nbt q
⎛1⎞
1− ⎜ ⎟
⎛ 1 1 1 ⎞ ⎛ ⎛ 1 ⎞ nbtq ⎞ 1
+ ... + + 1⎟ = ⎝ ⎠
2
⎜ nbtq −1 + nbtq −2 = 2 ∗ ⎜1 − ⎜ ⎟ ⎟ ⇒ β=
⎝2 2 2 ⎠ 1 ⎜ ⎝2⎠ ⎟ ⎛ ⎛ 1 ⎞ nbtq ⎞
1− ⎝ ⎠ 2 ∗ ⎜1 − ⎜ ⎟ ⎟
2 ⎜ ⎝2⎠ ⎟
⎝ ⎠
More query contains terms, less will be the value of β. By replacing β with its value in the
neighbor function, we obtained the follow function that we use in our experimentations.
Rank (P , Q ) = ∑ P → P 2
nbt Q ( Pi )
* Rank DR (P , Q )
i
LR
i
⎛ ⎛1⎞
nbt Q
⎞
* ⎜1 − ⎜ ⎟ ⎟* EP
nbt Q + 1
2
⎜ ⎝2⎠ ⎟
⎝ ⎠
Empirical evaluations
In this section, experiments were conducted to evaluate the performance and efficiency of our
model. We first introduce the experimental settings and some implementation issues, and then
present experimental results and discussions.
Experimental Setting
To avoid the corpus bias, two different data collections were used in our experiments. One is
the WT10g corpus, which is crawled from the web in early 2000. This corpus has been used as
the data collection of Web Track since TREC 2000. It contains 1.692.096 pages with 1.532.012
pages with incoming links and 1.295.841 pages with outgoing links. The other data set is the
“.GOV” corpus, which is crawled from the .gov domain in early 2002. This corpus has been
used as the data collection of Web Track since TREC 2002. There are totally 1,053,110 pages
with 11,164,829 hyperlinks in it. According to Soboroff (Soboroff, 2002), there isn’t great
difference between the two corpuses WT10g and .GOV and the real structure of the Web.
However, the size of WT10g collection is less huge than .GOV collection (10 Gb for WT10g
against 18Gb for .GOV). The following table shows the characteristics of each of the two
collections:
WT10g .GOV
Number of documents 1.692.096 1.247.753
Number of documents with incoming links 1.295.841(76,5%) 1.067.339(85,5%)
Number of documents with outdoing links 1 532 012(90,5%) 1.146.213(91.9%)
Average number of incoming links by page 5,26 10,4
Average number of outgoing links by page 6.22 9,69
Number of query 50 50
Average number of relevant pages by query 52,54 31,48
Table 1: Characteristics of the WT10g and GOV test collections
Experimental Setup and Results

In this section we present an experimental evaluation of the algorithm that we propose, as well
as some other existing algorithms. We study the ranking they produce. In our experiments, the
precision over the 11 standard recall levels which are 0%, 10%, …, 100% is the main
evaluation metric, and we also evaluate the precision at 5 and 10 documents retrieval (P@5 &
P@10). When conducting the experiments on both “WT10g” and “.GOV” corpus, we used the
topic distillation task in the Web Track of TREC 2000 and 2003 respectively as our query sets
(with 50 queries for each collection). For simplicity, we denote these two query sets by
TOPIC2000 and TOPIC2003. The ground truths of these tasks are provided by the TREC
committee. We compare various categories of algorithms (contents-only, popularity with
PageRank algorithm and our algorithm based on combining link and document rank). The
dependency between precision at 10 documents retrieval and α on both WT10g and .GOV
collection is illustrated in Figure 2, in which all the curves converge to the baseline when α=1.
40% Contents-Only
35%
Combining Link and Document Rank (α=0.15)
PageRank
30%
25%
Precision
20%
15%
10%
5%
0%
R0% R10% R20% R30% R40% R50% R60% R70% R80% R90% R100%
The 11 standard recall levels
WT10g collection
35% Contents-Only
Combining Link and Document Rank (α=0.25)
30%
PageRank
25%
Precision
20%
15%
10%
5%
0%
R0% R10% R20% R30% R40% R50% R60% R70% R80% R90% R100%
The 11 standard recall levels
.GOV collection
Figure 1: Average precision on the 11 standard recall levels of three ranking functions (contents-only,
PageRank and combining link and document rank algorithms) carried out on WT10g and .GOV
collections
As can be seen from Figure 1, the PageRank algorithm performs the worst result on both
WT10g and.GOV collections. With the PageRank algorithm, a page has the same score (or
popularity value) for all queries performed on the system because it takes account the
popularity of a page independently of query terms. For that, the results were worse. Combining
link and document rank is strongly better than the baseline, though it is the best among all the
methods. The performance of our method increases significantly when α decrease. That means,
more we give importance to link rank, more the results are greater. The better value of α to have
more relevant document at the top of rank list is 0.15 for WT10g collection and 0.25 for .GOV
collection. However, if we don’t take into account the page textual contents in the computation
of the ranking function, the research performances decrease. This result shows the importance
of page contents in the computation of document relevance to a given query.
Combining Link and Document Rank
17,00% α*RankDR(p,q)+(1-α)*RankLR(p,q)
Contents-Only
16,50%
16,00%
P@10
15,50%
15,00%
14,50%
14,00%
0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1
The parameter α
WT10g collection
Combining Link and Document Rank
α*RankDR(p,q)+(1-α)*RankLR(p,q)
14,0%
Contents-Only
13,0%
12,0%
11,0%
P@10
10,0%
9,0%
8,0%
7,0%
0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1
The parameter α
.GOV collection
Figure 2: Average precision at 10 documents retrieval according to the parameter α for the link function
on both WT10g and .GOV collections
Algorithms P@5 P@10 P@20

Baseline algorithm (Contents-Only) 14,89% 15,11% 26,81%
WT10g 0.15*RankDR(p,q)+0.85*RankLR(p,q) 17,87% 15,74% 29,15%
PageRank 2,12% 4,89% 16,17%
Baseline algorithm (Contents-Only) 11,20% 9,80% 16,40%

.GOV 0.25*RankDR(p,q)+0.75*RankLR(p,q) 14% 11,40% 17,40%
PageRank 2% 1,60% 3,80%
Table 2: P@5, P@10 and P@20 comparison of three ranking functions (contents-only,
PageRank and combining link and document rank algorithms) carried out on WT10g and .GOV
collections
We also compare the different algorithms with the average precision at 5, 10 and 20 documents
retrieval (P@5, P@10 and P@20). The performance of our algorithm is still better than the
baseline algorithm on both WT10g and .GOV collections. That means that there are more
relevant documents in the top rank list. From table 2, we can see that combining link rank with
document rank performed best results than baseline, either on average precision P@5 or P@10
or P@20 on the both collections. For example, the result achieved 20% and 25% improvements
over the baseline algorithm on average precision at five documents retrieval P@5 on WT10g
and .GOV collections respectively.
Efficiency Evaluation
In the previous section, we investigate the effectiveness of the relevance propagation models.
However, for real-world applications, efficiency is another important factor besides
effectiveness. In this regard, we evaluate the efficiency of our model in this section to see their
potential of being used in search engines. Roughly speaking, typical architecture of a search
engine has three components (Baeza-Yates and Ribeiro-Neto,1999) crawler, indexer, and
searcher. If we want to integrate relevance propagation technologies into search engine, we
should consider these three components. Clearly, we could only embed relevance propagation
into the second or third component. Since the search engine indexes the Web offline, and
implement the search operation online, we will discuss the efficiency of relevance propagation
for the online case and offline case respectively.
Online Complexity
Due to the algorithm descriptions, the relevance propagation model that we propose has two
kinds of computations. The first one is to retrieve the relevant pages and rank them by relevance
weighting function (OKAPI BM25). Actually this is also needed by existing search engines.
The second is the additionally-introduced complexity including relevance propagation. This
will be the major concern when integrating these models into the search engines. In this regard,
we will focus on the analysis of these additional computations in this section. According to the
model formulation and the implementation issues, we can get the following estimations on the
online complexity of the relevance propagation models. Note that the time complexity we
estimate here is for one query. For our models, we need to propagate the relevance score of a
page along its in-link or out-link in the sub graph of the relevant set which contains all the pages
that have at least one query term. Note that the source and destination pages of the hyperlink
should be both in the relevant set, and so the average numbers of in-links and out-links per page
are equal to each other. We denote this number by avlink. If we further use tcp to indicate the
time complexity of propagating an entity from a page to another page along hyperlinks, we can
get that the complexity of our model is avlink *tcp*rs, where rs represent the average number
of pages of the relevant set over all the topics.
Average avlink Average rs Time complexity

WT10g 5,26 1532 8052*tcp
.GOV 10,4 4261 44314*tcp
Table 3: Time complexity on TREC and GOV collection with TOPIC2000 and TOPIC2003
respectively.
Offline implementation is much more preferred if we want to apply our model in real-world applications.
However, the score propagation model that we propose can not be build offline because scores do not
exist in the offline indices but are dependent on the online relevance ranking algorithm.
Conclusion
Several algorithms based on link analysis approach were developed. But, until now many
experiments showed that there is no significant profit compared to the methods based on
content-only of the page. In this paper, we introduce an approach for combining content and
link rank based on propagation scores over links according to query terms. During the
computation, the algorithm that we proposed propagates a portion of rank scores of the source
web pages to the destination web pages in accordance with query terms. We performed
experimental evaluations of our algorithm using IR test collection of TREC 9 and .GOV. We
found that this algorithm outperforms a content-only retrieval. The study concluded from our
experiments shows that propagation link scores according to query terms provide a certain
improvement. It is still better than the baseline method based on content-only. More study and
experiments will be conduced, e.g., we will use the weighted inter-host and intra-host link score
propagation. We also plan to test this framework at the semantic blocks level to see the
structural effects of blocks on ranking query results. Finally, new measure representing
additional semantic information may be explored.
References
Amento, B. et al, 2000. Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Pages.
In Proc. ACM SIGIR 2000, pp. 296--303.
Baeza-Yates, R., Ribeiro-Neto, B. Modern Information Retrieval, Addison Wesley, 1999.
Brin, S. and Page, L., 1998. The anatomy of a large-scale hyper textual Web search engine. In
Proceeding of WWW7.
Craswell, N. and Hawking, D., 2003. Overview of the TREC 2003 Web Track, in the 12th TREC.
Craswell, N. and Hawking, D., 2004. Overview of the TREC 2004 Web Track, in the 13th TREC.
Chakrabarti, S., 2001. Integrating the Page Object Model with hyperlinks for enhanced topic distillation
and information extraction, In the 10th WWW.
Chakrabarti, S. et al, 2001. Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks, In
Proceedings of the 24th ACM SIGIR, pp. 208-216.
Deng C., Shipeng Y., Ji-Rong W., Wei-Ying M., 2004. Block-based Web Search, Microsoft research
ASIA.
Fabio Crestani, Puay Leng Lee, 1999: Searching the web by constrained spreading activation. Research
and Technology Advances in Digital Libraries, 1999. ADL apos;99. Proceedings of IEEE Forum on
Volume , Issue , 1999 Page(s):163 - 170
Doan B. and Chibane I., 2005. Expérimentations sur un modèle de recherche d’information utilisant les
liens hypertextes des pages Web, Revue des Nouvelles Technologies de l'Information (RNTI-E-3),
numéro spécial Extraction et Gestion des Connaissances (EGC'2005), Vol. 1:245-256,
Cépaduès-Editions, pp.257-262.
Hawking D., 2000. Overview of the TREC-9 Web Track, in the 9th TREC.
Haveliwala T.H., 2002. Topic-Sensitive Pagerank: A Context-Sensitive Ranking Algorithm for Web
Search. In Proceedings of the eleventh international conference on World Wide Web, pp. 517-526,
ACM Press.
Ingongngam P and Rungsawang A., 2003. Report on the TREC 2003 Experiments Using Web
Topic-Centric Link Analysis, In the 12th TREC.
Jeh G and Widom J., 2003. Scaling personalized web search. In Proceedings of the Twelfth
International World Wide Web Conference.
Kamvar S. et al., 2003. Exploiting the Block Structure of the Web for Computing PageRank, Stanford
University Technical Report.
Kleinberg J., 1999. Authoritative Sources in a Hyperlinked Environment, Journal of the ACM, Vol. 46,
No. 5, p. 604-622.
Lempel R. and Moran S., 2000. The stochastic approach for link-structure analysis (SALSA) and the
TKC effect, In Proceeding of 9th International World Wide Web Conference.
Mcbryan O., 1994. GENVL and WWW: Tools for Taming the Web, In Proceedings of the 1st WWW.
Porter M. M., 1980. An Algorithm for Suffix Stripping, Program, Vol. 14(3), p. 130-137.
Qin T. et al, 2005. A Study of Relevance Propagation for Web Search, the 28th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval.
Salton G. et al, 1975. A theory of term importance in automatic text analysis, Journal of the American
Society for Information Science and Technology.
Shakery A. and Zhai C.X., 2003. Relevance Propagation for Topic Distillation UIUC TREC 2003 Web
Track Experiments, in the 12th TREC..
Song R. et al, 2004. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004, in the
13th TREC.
Soboroff I., 2002. Do TREC Web Collections Look Like the Web?, SIGIR Forum Fall, Volume 36,
Number 2.
Xue-Mei J. et al., 2004. Exploiting PageRank at Different Block Level - International Conference on
Web Information Systems Engineering.

Relevance Propagation Model For Large Hypertext. Documents Collections

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Relevance Propagation Model For Large Hypertext. Documents Collections

Transféré par

Droits d'auteur :

Formats disponibles

Relevance Propagation Model for Large Hypertext Document Collections

Idir Chibane & Bich-Liên Doan

Relevance Model Propagation

Rank (P , Q ) = ϕ ( Rank DR (P , Q ); Rank LR ( P , Q ))

wlink ( Pi , P, Q) ∗ RankDR (Pi , Q)

(k1 + 1)tf (k 3 + 1)qtf

K = k1 ((1 − b ) + b × (dl avdl ))

w link ( Pi , P , Q ) ∗ Rank DR (Pi , Q )

Table 1: Characteristics of the WT10g and GOV test collections

Experimental Setup and Results

Algorithms P@5 P@10 P@20

Baseline algorithm (Contents-Only) 11,20% 9,80% 16,40%

Average avlink Average rs Time complexity

Vous aimerez peut-être aussi