Vous êtes sur la page 1sur 41

11-741:

Information Retrieval

Retrieval Models:
Boolean and Vector Space

Jamie Callan
Carnegie Mellon University
callan@cs.cmu.edu
Lecture Outline

Introduction to retrieval models


Exact-match retrieval
Unranked Boolean retrieval model
Ranked Boolean retrieval model
Best-match retrieval
Vector space retrieval model
Term weights
Boolean query operators (p-norm)

2 2003, Jamie Callan


What is a Retrieval Model?

A model is an abstract representation of a process or object


Used to study properties, draw conclusions, make predictions
The quality of the conclusions depends upon how closely the
model represents reality
A retrieval model describes the human and computational
processes involved in ad-hoc retrieval
Example: A model of human information seeking behavior
Example: A model of how documents are ranked computationally
Components: Users, information needs, queries, documents,
relevance assessments, .
Retrieval models define relevance, explicitly or implicitly

3 2003, Jamie Callan


Basic IR Processes

Information Need Document

Representation Representation

Query Indexed Objects

Comparison

Evaluation/Feedback Retrieved Objects

4 2003, Jamie Callan


Overview of Major Retrieval Models

Boolean
Vector space
Basic vector space
Extended Boolean model
Probabilistic models
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever)
Page rank (Google)

5 2003, Jamie Callan


Types of Retrieval Models:
Exact Match vs. Best Match Retrieval

Exact Match
Query specifies precise retrieval criteria
Every document either matches or fails to match query
Result is a set of documents
Usually in no particular order
Often in reverse-chronological order
Best Match
Query describes retrieval criteria for desired documents
Every document matches a query to some degree
Result is a ranked list of documents, best first

6 2003, Jamie Callan


Overview of Major Retrieval Models

Boolean Exact Match


Vector space Best Match
Basic vector space
Extended Boolean model
Latent Semantic Indexing (LSI)
Probabilistic models Best Match
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever) Best Match
Page rank (Google) Exact Match

7 2003, Jamie Callan


Exact Match vs. Best Match Retrieval

Best-match models are usually more accurate / effective


Good documents appear at the top of the rankings
Good documents often dont exactly match the query
Query was too strict
Multimedia AND NOT (Apple OR IBM OR Microsoft)
Document didnt match user expectations
multi database retrieval vs multi db retrieval
Exact match still prevalent in some markets
Large installed base that doesnt want to migrate to new software
Efficiency: Exact match is very fast, good for high volume tasks
Sufficient for some tasks
Web advanced search

8 2003, Jamie Callan


Retrieval Model #1:
Unranked Boolean

Unranked Boolean is most common Exact Match model


Model
Retrieve documents iff they satisfy a Boolean expression
Query specifies precise relevance criteria
Documents returned in no particular order
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Near/1, Sentence/1, Paragraph/1,
String matching operators: Wildcard
Field operators: Date, Author, Title
Note: Unranked Boolean model not the same as Boolean queries

9 2003, Jamie Callan


Unranked Boolean:
Example

A typical Boolean query


(distributed NEAR/1 information NEAR/1 retrieval) OR
((collection OR database OR resource) NEAR/1 selection) OR
(multi NEAR/1 database NEAR/1 search) OR
((FIELD author callan) AND CORI) OR
((FIELD author hawking) AND NOT stephen)

Studies show that people are not good at creating Boolean queries
People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)

10 2003, Jamie Callan


Unranked Boolean:
WESTLAW

Large commercial system


Serves legal and professional markets
Legal: Court cases, statutes, regulations, ..
News: Newspapers, magazines, journals, .
Financial: Stock quotes, SEC materials, financial analyses
Total collection size: 5-7 Terabytes
700,000 users
In operation since 1974
Best-match and free text queries added in 1992

11 2003, Jamie Callan


Unranked Boolean:
WESTLAW

Boolean operators
Proximity operators
Phrases: Carnegie Mellon
Word proximity: Language /3 Technologies
Same sentence (/s) or paragraph (/p): Microsoft /s anti trust
Restrictions: Date (After 1996 & Before 1999)
Query expansion:
Wildcard and truncation: Al*n Lev!
Automatic expansion of plurals and possessives
Document structure (fields): Title (gGlOSS)
Citations: Cites (Callan) & Date (After 1996)

12 2003, Jamie Callan


Unranked Boolean:
WESTLAW

Queries are typically developed incrementally


Implicit relevance feedback
V1: machine AND learning
V2: (machine AND learning) OR (neural AND networks) OR
(decision AND tree)
V3: (machine AND learning) OR (neural AND networks) OR
(decision AND tree) AND C4.5 OR Ripper OR EG OR EM
Queries are complex
Proximity operators used often
NOT is rare
Queries are long (9-10 words, on average)

13 2003, Jamie Callan


Unranked Boolean:
WESTLAW

Two stopword lists: 1 for indexing, 50 for queries


Legal thesaurus: To help people expand queries
Document ordering: Determined by user (often date)
Query term highlighting: Help people see how document matched
Performance:
Response time controlled by varying amount of parallelism
Average response time set to 7 seconds
Even for collections larger than 100 GB

14 2003, Jamie Callan


Unranked Boolean:
Summary

Advantages:
Very efficient
Predictable, easy to explain
Structured queries
Works well when searchers knows exactly what is wanted
Disadvantages:
Most people find it difficult to create good Boolean queries
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
Documents that are close are not retrieved

15 2003, Jamie Callan


Term Weights:
A Brief Introduction

The words of a text are not equally indicative of its meaning


Most scientists think that butterflies use the position of the sun
in the sky as a kind of compass that allows them to determine
which way is north. Scientists think that butterflies may use other
cues, such as the earth's magnetic field, but we have a lot to learn
about monarchs' sense of direction.
Important: Butterflies, monarchs, scientists, direction, compass
Unimportant: Most, think, kind, sky, determine, cues, learn
Term weights reflect the (estimated) importance of each term
There are many variations on how they are calculated
Historically it was tf weights
The standard approach for many IR systems is tf.idf weights

16 2003, Jamie Callan


Retrieval Model #2:
Ranked Boolean

Ranked Boolean is another common Exact Match retrieval model


Model
Retrieve documents iff they satisfy a Boolean expression
Query specifies precise relevance criteria
Documents returned ranked by frequency of query terms
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Proximity
String matching operators: Wildcard
Field operators: Date, Author, Title

17 2003, Jamie Callan


Exact Match Retrieval:
Ranked Boolean

How document scores are calculated:


Term weight: How often term i occurs in document j (tfi,j)
tfi,j idfi weight
AND weight: Minimum of argument weights
OR weight: Maximum of argument weights
Sum of argument weights

Min t1, j , t 2, j , ..., t n , j Max t1, j , t 2, j , ..., t n , j t i, j


AND OR OR

t1, j t2, j tn, j t1, j t2, j tn, j t1, j t2, j tn, j

18 2003, Jamie Callan


Exact Match Retrieval:
Ranked Boolean

Advantages:
All of the advantages of the unranked Boolean model
Very efficient, predictable, easy to explain, structured queries,
works well when searchers know exactly what is wanted
Result set is ordered by how redundantly a document satisfies a query
Usually enables a person to find relevant documents more quickly
Other term weighting methods can be used, too
Example: tf, tf.idf,

19 2003, Jamie Callan


Exact Match Retrieval:
Ranked Boolean

Disadvantages:
Its still an Exact-Match model
Good Boolean queries hard to create
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
The returned documents match expectations.
so it is easy to forget that many relevant documents are missed
Documents that are close are not retrieved

20 2003, Jamie Callan


Are Boolean Retrieval Models Still Relevant?

Many people prefer Boolean


Professional searchers (e.g., librarians, paralegals)
Some Web surfers (e.g., Advanced Search feature)
About 80% of WESTLAW & Dialog searches are Boolean
What do they like? Control, predictability, understandability
Boolean and free-text queries find different documents
Solution: Retrieval models that support free-text and Boolean queries
Recall that almost any retrieval model can be Exact Match
Extended Boolean (vector space) retrieval model
Bayesian inference networks

21 2003, Jamie Callan


Retrieval Model #3:
Vector Space Model

Retrieval Model:
Any text object can be represented by a term vector
Examples: Documents, queries, sentences, .
Similarity is determined by distance in a vector space
Example: The cosine of the angle between the vectors
The SMART system:
Developed at Cornell University, 1960-1999
Still used widely

22 2003, Jamie Callan


How Different Retrieval Models
View Ad-hoc Retrieval

Boolean
Query: A set of FOL conditions a document must satisfy
Retrieval: Deductive inference

Vector space
Query: A short document
Retrieval: Finding similar objects

23 2003, Jamie Callan


Vector Space Retrieval Model:
Introduction

How are documents represented in the binary vector model?


Term1 Term2 Term3 Term4 Termn
Doc1 1 0 0 1 1
Doc2 0 1 1 0 0
Doc3 1 0 1 0 0
: : : : : : :
A document is represented as a vector of binary values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Query 0 0 1 0 1
Linear algebra is used to determine which vectors are similar

24 2003, Jamie Callan


Vector Space Representation:
Linear Algebra Review

Formally, a vector space is defined by a set of linearly


independent basis vectors.
Basis vectors:
correspond to the dimensions or directions in the vector space;
determine what can be described in the vector space; and
must be orthogonal, or linearly independent,i.e. a value along
one dimension implies nothing about a value along another.

Basis vectors Basis vectors


for 2 dimensions for 3 dimensions
25 2003, Jamie Callan
Vector Space Representation

What should be the basis vectors for information retrieval?


Basic concepts:
Difficult to determine (Philosophy? Cognitive science?)
Orthogonal (by definition)
A relatively static vector space (there are no new ideas)
Terms (Words, Word stems):
Easy to determine
Not really orthogonal (orthogonal enough?)
A constantly growing vector space
new vocabulary creates new dimensions

26 2003, Jamie Callan


Vector Space Representation:
Vector Coefficients

The coefficients (vector elements, term weights) represent term presence,


importance, or representativeness
The vector space model does not specify how to set term weights
Some common elements:
Document term weight: Importance of the term in this document
Collection term weight: Importance of the term in this collection
Length normalization: Compensate for documents of different lengths
Naming convention for term weight functions: DCL.DCL
First triple is document term weights, second triple is query term weights
n=none (no weighting on that factor)
Example: lnc.ltc

27 2003, Jamie Callan


Vector Space Representation:
Term Weights

There have been many vector-space term weight functions


lnc.ltc: A very popular and commonly-used choice
l: log (tf) + 1
n: No weight / normalization (i.e., 1.0)
t: log (N / df)
c: cosine normalization
wi 2
For example: N
logtf 1 logqtf 1 log
d i qi df
d i 2 qi 2
N
2
log tf 1 2
log qtf 1 log
df

28 2003, Jamie Callan


Term Weights:
A Brief Introduction

Inverse document frequency (idf):


Terms that occur in many documents in the collection are less useful
for discriminating among documents
Document frequency (df): number of documents containing the term
idf often calculated as: N
I log( ) 1
df
Idf is log p, the entropy of the term

29 2003, Jamie Callan


Vector Space Representation:
Term Weights

Cosine normalization favors short documents


Pivoting helps, but shifts the bias towards long documents
Old normalization
1.0 slope slope
Average old normalization
Empirically, cosine normalization approximates # unique terms
lnu and Lnu document weighting:
u: pivoted unique normalization
1
Num Unique Terms
0.8 0.2
Avg Num Unique Terms
L: compensate for higher tfs in long documents
log(tf ) 1
log(avg tf in doc) 1
(Singhal, 1996; Singhal, 1998)
30 2003, Jamie Callan
Vector Space Representation:
Term Weights

dnb.dtn: A more recent attempt to handle varied document lengths


d: 1 + ln (1 + ln (tf)), (0 if tf = 0)
t: log ((N+1)/df)
b: 1 / (0.8 + 0.2 * doclen / avg_doclen)
What next?

(Singhal, et al, 1999)


31 2003, Jamie Callan
Vector Space Retrieval Model

How are documents represented using a tf.idf representation?


TermID T1 T2 T3 T4 Tn
Doc 1 0.33 0.00 0.00 0.47 0.53
Doc 2 0.00 0.64 0.53 0.00 0.00
Doc 3 0.43 0.00 0.25 0.00 0.00
: : : : : : :
A document is represented as a vector of tf.idf values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Q 0.00 0.00 1.00 0.00 2.00
Linear algebra is used to determine which vectors are similar

32 2003, Jamie Callan


Vector Space Similarity

Similarity is
inversely related
Doc1 to the angle
Query
between the vectors.
Term2

Doc2 is the
most similar
Doc2 to the Query.

Term3 Rank the documents


by their similarity
to the Query.

33 2003, Jamie Callan


Vector Space Similarity:
Common Measures

Sim(X,Y) Binary Term Vectors Weighted Term Vectors


Inner product | X Y | xi yi
(# nonzero dimensions)
2 | X Y | 2 xi yi
Dice coefficient
| X | |Y | i i
x 2
y 2
(Length normalized
Inner Product)
Cosine coefficient
| X Y | xi yi
(like Dice, but lower
| X | |Y | xi 2 yi 2
penalty with diff # features)
Jackard coefficient | X Y | xi yi
(like Dice, but penalizes | X | | Y | | X Y |
xi 2 yi 2 xi yi
low overlap cases)

34 2003, Jamie Callan


Vector Space Similarity:
Example

Term Weights Term Weights


Query 0.0 0.2 0.0 Doc 1 0.3 0.1 0.4
Doc 2 0.8 0.5 0.6

(0.0 0.3) (0.2 0.1) (0.0 0.4) 0.02


Sim(D 1, Q) 0.20
0.02 0.22 0.02 * 0.32 0.12 0.42 0.10

(0.0 0.8) (0.2 0.5) (0.0 0.6) 0.10


Sim(D 2 , Q) 0.45
0.02 0.22 0.02 * 0.82 0.52 0.62 0.22

35 2003, Jamie Callan


Vector Space Similarity:
The Extended Boolean (p-norm) Model

Some similarity measures mimic Boolean AND, OR, NOT


1
qi p dipp
SimOR Q, D p
qi


1
qi p 1 dip p
Sim AND Q, D 1
qi p

Sim NOT Q, D 1 Sim(Q, D)

p indicates the degree of strictness.


1 is the least strict.
is the most strict, i.e. the Boolean case.
2 tends to be a good choice.
36 2003, Jamie Callan
Vector Space Similarity:
The Extended Boolean (p-norm) Model

Characteristics:
Full Boolean queries
(White AND House) OR
(Bill AND Clinton AND (NOT Hillary))
Ranked retrieval
Handle tf idf weighting
Effective
Boolean operators are not used often with the vector space model
It is not clear why

37 2003, Jamie Callan


Vector Space Retrieval Model:
Summary

Standard vector space


each dimension corresponds to a term in the vocabulary
vector elements are real-valued, reflecting term importance
any vector (document,query, ...) can be compared to any other
cosine correlation is the similarity metric used most often
Ranked Boolean
vector elements are binary
Extended Boolean
multiple nested similarity measures

38 2003, Jamie Callan


Vector Space Retrieval Model:
Disadvantages

Assumed independence relationship among terms


Lack of justification for some vector operations
e.g. choice of similarity function
e.g., choice of term weights
Barely a retrieval model
Doesnt explicitly model relevance, a persons information need,
language models, etc.
Assumes a query and a document can be treated the same
Lack of a cognitive (or other) justification

39 2003, Jamie Callan


Vector Space Retrieval Model:
Advantages

Simplicity: Easy to implement


Effectiveness: It works very well
Ability to incorporate any kind of term weights
Can measure similarities between almost anything:
documents and queries, documents and documents, queries and
queries, sentences and sentences, etc.
Used in a variety of IR tasks:
Retrieval, classification, summarization, SDI, visualization,

The vector space model is the most popular retrieval model (today)

40 2003, Jamie Callan


For Additional Information

G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill. 1968.


G. Salton. The SMART Retrieval System - Experiments in Automatic Document Processing.
Prentice-Hall. 1971.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill. 1983.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer. Addison-Wesley. 1989.
A. Singhal. AT&T at TREC-6. In The Sixth Text Retrieval Conference (TREC-6). National
Institute of Standards and Technology special publication 500-240.
A. Singhal, J. Choi, D. Hindle, D.D. Lewis, and F. Pereira. AT&T at TREC-7. In The Seventh
Text Retrieval Conference (TREC-7). National Institute of Standards and Technology special
publication 500-242.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. SIGIR-96.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman. Indexing by latent
semantic analysis. Journal of the American Society for Information, 41, pp. 391-407. 1990.
E. A. Fox. Extending the Boolean and Vector Space models of Information Retrieval with P-
Norm queries and multiple concept types. PhD thesis, Cornell University. 1983.

41 2003, Jamie Callan

Vous aimerez peut-être aussi