Académique Documents
Professionnel Documents
Culture Documents
retrieval
Modelling
Introduction
IR systems usually adopt index terms to
process queries
Index term:
Introduction
Indexing results in records like
Doc12: napoleon, france, revolution, emperor
or (weighted terms)
Doc12: napoleon-8, france-6, revolution-4, emperor-7
Introduction
Instead the information is inverted :
napoleon : doc12, doc56, doc87, doc99
or (weighted)
napoloen : doc12-8, doc56-3, doc87-5, doc99-2
index term
inverted file is organised so that a
given index term can be found quickly.
Introduction
To find documents satisfying the
query
Docs
Introduction
Index Terms
doc
match
Ranking
Information Need
query
Introduction
Introduction
IR Models
Set Theoretic
Classic Models
U
s
e
r
T
a
s
k
Retrieval:
Adhoc
Filtering
boolean
vector
probabilistic
Structured Models
Non-Overlapping Lists
Proximal Nodes
Browsing
Browsing
Flat
Structure Guided
Hypertext
Fuzzy
Extended Boolean
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
IR Models
The IR model, the logical view of the docs, and the retrieval
task are distinct aspects of the system
Ad hoc retrieval:
Q1
Q2
Collection
Fixed Size
Q3
Q4
Q5
Filtering:
Docs Filtered
for User 2
User 2
Profile
User 1
Profile
Docs for
User 1
Documents Stream
q = ka (kb kc)
(1,0,0)
Kb
(1,1,0)
(1,1,1)
Kc
vec(qcc) |
(vec(qcc) vec(qdnf))
(ki, gi(vec(dj)) = gi(vec(qcc)))
0 otherwise
sim(q,dj) = 1 if
Example:
Terms: K1, ,K8.
Documents:
Define:
q
i
Sim(q,dj) = cos()
= [vec(dj) vec(q)] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|
Since wij > 0 and wiq > 0,
0 <= sim(q,dj) <=1
A document is retrieved even if it matches the
query terms only partially
Let,
Example:
A collection includes 10,000 documents
The term A appears 20 times in a particular
document
The maximum apperance of any term in this
document is 50
The term A appears in 2,000 of the collection
documents.
f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93
Advantages:
term-weighting improves quality of the answer set
partial matching allows retrieval of docs that
approximate the query conditions
cosine ranking formula sorts documents according to
degree of similarity to the query
Disadvantages:
k2
k1
d7
d6
d2
d4
d5
d1
d3
k3
d1
d2
d3
d4
d5
d6
d7
k1
1
1
0
1
1
1
0
k2
0
0
1
0
1
1
1
k3
1
0
1
0
1
0
0
q dj
2
1
2
1
3
2
1
|dj|
1.41
1
1.41
1
1.73
1.41
1
|q|
1.73
Sim(dj,q)
0.82
0.58
0.82
0.58
1
0.82
0.58
k2
k1
d7
d6
d2
d4
d5
d1
d3
k3
d1
d2
d3
d4
d5
d6
d7
k1
1
1
0
1
1
1
0
k2
0
0
1
0
1
1
1
k3
1
0
1
0
1
0
0
q dj
4
1
5
1
6
3
2
k2
k1
d7
d6
d2
d4
d5
d3
d1
k3
d1
d2
d3
d4
d5
d6
d7
k1
2
1
0
2
1
1
0
k2
0
0
1
0
2
2
5
k3
1
0
3
0
4
0
0
q dj
5
1
11
2
17
5
10