Académique Documents
Professionnel Documents
Culture Documents
Retrieval
Jian-Yun Nie
University of Montreal
Canada
1
Outline
The problem of IR
Goal = find documents relevant to an information need from a large document set
Info.
need
Query
Document
collection
Retrieval
IR
syste
m
Answer list
Example
Googl
e
Web
IR problem
Possible approaches
1. String matching (linear search in
documents)
- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement
6
Indexing-based IR
Document
Query
indexing
indexing
(Query analysis)
Representation
(keywords)
evaluation
Representation
Query (keywords)
Main problems in IR
System evaluation
Document indexing
Coverage
(Recall)
String
Word
Phrase
Accuracy
(Precision)
Concept
informativity
Max.
Min.
123
Rank
10
tf*idf weighting
schema
tf = term frequency
The higher the tf, the higher the importance (weight) for
the doc.
df = document frequency
11
tf(t,
tf(t,
tf(t,
tf(t,
D)=freq(t,D)
idf(t) = log(N/n)
D)=log[freq(t,D)]
n = #docs containing t
D)=log[freq(t,D)]+1
N = #docs in corpus
D)=freq(t,d)/Max[f(t,d)]
12
Document Length
Normalization
pivoted (t , D) =
weight (t , D)
slope
1+
normalized _ weight (t , D)
(1 slope) povot
Probability
of relevance
slope
pivot
Probability of retrieval
Doc. length
13
Stopwords / Stoplist
Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
14
Stemming
Reason:
Stemming:
comput
15
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix
stripping, Program, 14(3) :130-137)
Step 4:
Step 3:
SSES -> SS
(*v*) ING ->
(m>1) AL ->
(m>1) ANCE ->
Step 5:
(m>1) E ->
16
Lemmatization
17
Result of indexing
Inverted file:
comput {(D1,0.2), (D2,0.1), }
Inverted file is used during retrieval for higher efficiency.
18
Retrieval
Retrieval model
Implementation
19
Cases
1-word query:
The documents to be retrieved are those that
include the word
- Retrieve the inverted list for the word
- Sort in decreasing order of the weight of the
word
Multi-word query?
Combining several lists
- How to interpret the weight?
(IR model)
-
20
IR models
21
Boolean model
e.g.
Problems:
22
Extensions to Boolean
model
(for document ordering)
Interpretation:
A possible Evaluation:
R(D, ti) = ti(D);
R(D, Q1 Q2) = min(R(D, Q1), R(D, Q2));
R(D, Q1 Q2) = max(R(D, Q1), R(D, Q2));
R(D, Q1) = 1 - R(D, Q1).
23
Query
Q=
bi = weight of ti in Q
R(D,Q) = Sim(D,Q)
24
Matrix representation
Document
space
t1 t 2
t3
tn
D1
a1n
D2
a2n
D3
a3n
amn
Qb1 b2
b3
Term
vector
space
bn
25
Sim( D, Q ) = (ai * bi )
(a * b )
i
Sim( D, Q ) =
Sim( D, Q ) =
ai * bi
2
t2
2 (ai * bi )
i
ai + bi
2
Jaccard
Dice
t1
(a * b )
Sim( D, Q ) =
a + b (a * b )
i
26
Implementation (space)
Stored as:
D1 {(t1, a1), (t2,a2), }
t1 {(D1,a1), }
27
Implementation (time)
O(|Q|*n)
28
Other similarities
Cosine:
Sim( D, Q) =
(a * b )
i
a *b
2
=
i
ai
a
j
bi
2
j
2
j
2
2
use
and
a
b
j j
j j to normalize the
weights after indexing
Dot product
(Similar operations do not apply to Dice
and Jaccard)
29
Probabilistic model
P(t
= xi | R)
( t i = xi )D
= P(ti = 1 | R) xi P(ti = 0 | R) (1 xi ) = pi i (1 pi ) (1 xi )
x
ti
ti
P( D | NR ) = P(ti = 1 | NR ) xi P(ti = 0 | NR ) (1 xi ) = qi i (1 qi ) (1 xi )
x
ti
ti
30
(1 xi )
i
p
(
1
p
)
i
i
x
P( D | R)
Odd ( D) = log
= log
P ( D | NR )
ti
(1 xi )
i
q
(
1
q
)
i
i
x
ti
pi (1 qi )
1 pi
= xi log
+ log
qi (1 pi ) ti
1 qi
ti
pi (1 qi )
xi log
qi (1 pi )
ti
31
pi =
Ri
qi =
N Ri
ri
ni-ri
Rel.
doc.
with ti
Irrel.doc Doc.
.
with ti
with ti
Ri-ri
N-Ri
n+ri
Rel.
doc.
without
ti
Ri
ni
N-ni
Doc.
Irrel.doc without
ti
.
without
ti
N-Ri
N
Rel. doc Irrel.doc Samples
32
.
Odd ( D ) = xi
ti
33
BM25
(k1 + 1)tf (k3 + 1)qtf
avdl dl
Score( D, Q) = w
+ k2 | Q |
K + tf
k3 + qtf
avdl + dl
tQ
K = k1 ((1 b) + b
dl
)
avdl dl
(Classic) Presentation of
results
35
System evaluation
relevant
retrieved 36
General form of
precision/recall
Precision
1.0
Recall
1.0
An illustration of P/R
calculation
List Rel?
Doc Y
1
Doc
2
Precision
1.0 -
Doc Y
3
Doc Y
4
Doc
* (0.2, 1.0)
0.8 -
* (0.6, 0.75)
* (0.4, 0.67)
0.6 0.4 -
* (0.6, 0.6)
* (0.2, 0.5)
0.2 0.0
|
0.2
|
0.4
|
0.6
|
0.8
|
1.0
38
Recall
1
1
j
n Qi | Ri | D j Ri rij
n = # test queries
E.g. Rank:
1
4
1st rel. doc.
5
8
2nd rel. doc.
10
3rd rel. doc.
1 1 1 2 3
1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8
39
F-measure = 2 P * R / (P + R)
Average precision = average at 11 points of recall
Precision at n document (often used for Web IR)
Expected search length (no. irrelevant documents to read
before obtaining n relevant doc.)
40
Test corpus
An evaluation example
(SMART)
Run number:
1
2
Num_queries:
52
52
Total number of documents over all queries
Retrieved:
780
780
Relevant:
796
796
Rel_ret:
246
229
Recall - Precision Averages:
at 0.00
0.7695
0.7894
at 0.10
0.6618
0.6449
at 0.20
0.5019
0.5090
at 0.30
0.3745
0.3702
at 0.40
0.2249
0.3070
at 0.50
0.1797
0.2104
at 0.60
0.1143
0.1654
at 0.70
0.0891
0.1144
at 0.80
0.0891
0.1096
at 0.90
0.0699
0.0904
at 1.00
0.0699
0.0904
Average precision
11-pt Avg:
% Change:
Recall:
Exact:
at 5 docs:
at 10 docs:
at 15 docs:
at 30 docs:
Precision:
Exact:
0.2936
At 5 docs:
At 10 docs:
At 15 docs:
At 30 docs:
0.4166
0.2726
0.3572
0.4166
0.4166
0.3154
0.4308
0.3538
0.3154
0.1577
42
0.4192
0.3327
0.2936
0.1468
TREC evaluation
methodology
44
Tracks (tasks)
45
NTCIR:
46
Impact of TREC
Some techniques to
improve IR effectiveness
48
Effect of RF
2nd retrieval
* * *
* *
**
**
1st retrieval
x x
* * *
x
x
*Q* R* Q * NR x
new
*
x *
x x
*
*
x
49
Modified relevance
feedback
50
Query expansion
Q = information retrieval
Q = (information + data + knowledge + )
(retrieval + search + seeking + )
Corpus analysis:
Ambiguity:
table = data structure, furniture?
Lack of precision:
operating, system less precise than operating_system
Suggested solution
53
Theory
Bayesian networks
P(Q|D)
D1
D2
t1
c1
Inference
c2
D3
t2
t3
t4
c3
c4
cl
Dm
tn
revision
Language models
54
Logical models
55
Related applications:
Information filtering
ignore
User profile
56
IR for (semi-)structured
documents
Hierarchical indexing
INEX experiments
57
PageRank in Google
I1
I2
PR( I i )
PR( A) = (1 d ) + d
C(Ii )
i
58
IR on the Web
59
Final remarks on IR
Why is IR difficult
Vocabularies mismatching
61