Académique Documents
Professionnel Documents
Culture Documents
Jim Martin
Lecture 3
9/3/2008
Today 9/5
Review
Dictionary contents
Advance query handling
Phrases
Wildcards
Spelling correction
First programming assignment
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>
Cuil
Yahoo! BOSS
Make it better
Two flavors
Word-based
Caribb*
Phrasal
“Pirates * Caribbean”
General approach
Generate a set of new queries from the
original
Operation on the dictionary
Run those queries in a not stupid way
Single instance of a *
* means an string of length 0 or more
This is not Kleene *.
mon*: find all docs containing any word
beginning “mon”.
Index your lexicon on prefixes
*mon: find words ending in “mon”
Maintain a backwards index
Exercise: from this, how can w enumerate all terms
meeting the wild-card query pro*cent ?
11/02/21 CSCI 7000 - IR 18
Arbitrary Wildcards
Query = hel*o
Rotate
Lookup o$hel*
Heathrow”
We’d like to respond
Spell-correction is computationally
expensive
Avoid running routinely on every query?
Run only on queries that matched few docs
On to Chapter 4
Real indexing