Académique Documents
Professionnel Documents
Culture Documents
Information Retrieval
Retrieval Models:
Boolean and Vector Space
Jamie Callan
Carnegie Mellon University
callan@cs.cmu.edu
Lecture Outline
Representation Representation
Comparison
Boolean
Vector space
Basic vector space
Extended Boolean model
Probabilistic models
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever)
Page rank (Google)
Exact Match
Query specifies precise retrieval criteria
Every document either matches or fails to match query
Result is a set of documents
Usually in no particular order
Often in reverse-chronological order
Best Match
Query describes retrieval criteria for desired documents
Every document matches a query to some degree
Result is a ranked list of documents, best first
Studies show that people are not good at creating Boolean queries
People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)
Boolean operators
Proximity operators
Phrases: Carnegie Mellon
Word proximity: Language /3 Technologies
Same sentence (/s) or paragraph (/p): Microsoft /s anti trust
Restrictions: Date (After 1996 & Before 1999)
Query expansion:
Wildcard and truncation: Al*n Lev!
Automatic expansion of plurals and possessives
Document structure (fields): Title (gGlOSS)
Citations: Cites (Callan) & Date (After 1996)
Advantages:
Very efficient
Predictable, easy to explain
Structured queries
Works well when searchers knows exactly what is wanted
Disadvantages:
Most people find it difficult to create good Boolean queries
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
Documents that are close are not retrieved
Advantages:
All of the advantages of the unranked Boolean model
Very efficient, predictable, easy to explain, structured queries,
works well when searchers know exactly what is wanted
Result set is ordered by how redundantly a document satisfies a query
Usually enables a person to find relevant documents more quickly
Other term weighting methods can be used, too
Example: tf, tf.idf,
Disadvantages:
Its still an Exact-Match model
Good Boolean queries hard to create
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
The returned documents match expectations.
so it is easy to forget that many relevant documents are missed
Documents that are close are not retrieved
Retrieval Model:
Any text object can be represented by a term vector
Examples: Documents, queries, sentences, .
Similarity is determined by distance in a vector space
Example: The cosine of the angle between the vectors
The SMART system:
Developed at Cornell University, 1960-1999
Still used widely
Boolean
Query: A set of FOL conditions a document must satisfy
Retrieval: Deductive inference
Vector space
Query: A short document
Retrieval: Finding similar objects
Similarity is
inversely related
Doc1 to the angle
Query
between the vectors.
Term2
Doc2 is the
most similar
Doc2 to the Query.
1
qi p 1 dip p
Sim AND Q, D 1
qi p
Sim NOT Q, D 1 Sim(Q, D)
Characteristics:
Full Boolean queries
(White AND House) OR
(Bill AND Clinton AND (NOT Hillary))
Ranked retrieval
Handle tf idf weighting
Effective
Boolean operators are not used often with the vector space model
It is not clear why
The vector space model is the most popular retrieval model (today)