CSCI 7000 Modern Information Retrieval Jim Martin

CSCI 7000
Modern Information Retrieval
Jim Martin
Lecture 3
9/3/2008
Today 9/5
 Review
 Dictionary contents
 Advance query handling
 Phrases
 Wildcards
 Spelling correction
 First programming assignment
11/02/21 CSCI 7000 - IR 2

Index: The Dictionary file and a Postings file
Term Doc # Freq

ambitious 2 1 Doc # Freq
Term N docs Coll freq 2 1
be 2 1
ambitious 1 1 2 1
brutus 1 1
be 1 1 1 1
brutus 2 1
brutus 2 2 2 1
capitol 1 1
capitol 1 1 1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 did 1 1 2 2
did 1 1 enact 1 1 1 1
enact 1 1 hath 1 1 1 1
hath 2 1 I 1 2
2 1
I 1 2 i' 1 1
1 2
i' 1 1 it 1 1
1 1
it 2 1 julius 1 1
killed 1 2 2 1
julius 1 1
let 1 1 1 1
killed 1 2
me 1 1 1 2
let 2 1
noble 1 1 2 1
me 1 1
so 1 1 1 1
noble 2 1
the 2 2 2 1
so 2 1 2 1
told 1 1
the 1 1 you 1 1 1 1
the 2 1 was 2 2 2 1
told 2 1 with 1 1 2 1
you 2 1 2 1
was 1 1 1 1
was 2 1 2 1
with 2 1 2 1
11/02/21 CSCI 7000 - IR 3

Review: Dictionary
 What goes into creating the dictionary?

 Tokenization
 Case folding
 Stemming
 Stop-listing
 Dealing with numbers (and number-like entities)
 Complex morphology
11/02/21 CSCI 7000 - IR 4

Phrasal queries
 Want to handle queries such as
 “Colorado Buffaloes” – as a phrase
 This concept is popular with users; about 10% of
web queries are phrasal queries
 Postings that consist of document lists alone are
not sufficient to handle phrasal queries
 Two general approaches
 Biword indexing
 Positional indexing
11/02/21 CSCI 7000 - IR 5

Solution 1: Biword Indexing
 Index every consecutive pair of terms in

the text as a phrase
 For example the text “Friends, Romans,
Countrymen” would generate the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary
term
 Two-word phrase query-processing is now
free

11/02/21 Not really. CSCI 7000 - IR 6
Longer Phrasal Queries
 Longer phrases can be broken into the

Boolean AND queries on the component
biwords
 “Colorado Buffaloes at Arizona”
 (Colorado Buffaloes) AND (Buffaloes at)
AND (at Arizona)
Susceptible to Type 1 errors (false positives)
11/02/21 CSCI 7000 - IR 7

Solution 2: Positional Indexing
 Change our posting content

 Store, for each term, entries of the form:
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
11/02/21 CSCI 7000 - IR 8

Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>
11/02/21 CSCI 7000 - IR 9

Processing a phrase query
 Extract inverted index entries for each distinct
term: to, be, or, not.
 Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433;
7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches
11/02/21 CSCI 7000 - IR 10

Positional index size
 As we’ll see you can compress position

values/offsets
 But a positional index still expands the
postings storage substantially
 Nevertheless, it is now the standard
approach because of the power and
usefulness of phrase and proximity queries
… whether used explicitly or implicitly in a
ranking retrieval system.
11/02/21 CSCI 7000 - IR 11

Rules of thumb
 Positional index size 35–50% of volume of

original text
 Caveat: all of this holds for “English-like”
languages
11/02/21 CSCI 7000 - IR 12

Combination Techniques
 Biwords are faster.

 And they cover a large percentage of very
frequent (implied) phrasal queries
 Britney Spears
 So it can be effective to combine positional
indexes with biword indexes for frequent
bigrams
11/02/21 CSCI 7000 - IR 13

Web
 Cuil
 Yahoo! BOSS
11/02/21 CSCI 7000 - IR 14

Programming Assignment: Part 1
 Download and install Lucene

 How does Lucene handle (by default)
 Case, stemming, and phrasal queries
 Download and index a collection that I will
point you at
 How big is the resulting index?
 Terms and size of index
 Return the Top N document IDs (hits) from
a set of queries I’ll provide.
11/02/21 CSCI 7000 - IR 15

Programming Assignment: Part 2
 Make it better
11/02/21 CSCI 7000 - IR 16

Wild Card Queries
 Two flavors
 Word-based
 Caribb*
 Phrasal
 “Pirates * Caribbean”
 General approach
 Generate a set of new queries from the
original
 Operation on the dictionary
 Run those queries in a not stupid way
11/02/21 CSCI 7000 - IR 17

Simple Single Wild-card queries: *
 Single instance of a *
 * means an string of length 0 or more
 This is not Kleene *.
 mon*: find all docs containing any word
beginning “mon”.
 Index your lexicon on prefixes
 *mon: find words ending in “mon”
 Maintain a backwards index
Exercise: from this, how can w enumerate all terms
meeting the wild-card query pro*cent ?
11/02/21 CSCI 7000 - IR 18
Arbitrary Wildcards
 How can we handle multiple *’s in the

middle of query term?
 The solution: transform every wild-card
query so that the *’s occur at the end
 This gives rise to the Permuterm Index.
11/02/21 CSCI 7000 - IR 19

Permuterm Index
 For term hello index under:

hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
 Example
Query = hel*o
Rotate
Lookup o$hel*
11/02/21 CSCI 7000 - IR 20

Permuterm query processing
 Rotate query wild-card to the right

 Now use indexed lookup as before.
 Permuterm problem: ≈ quadruples lexicon
size
Empirical observation for English.
11/02/21 CSCI 7000 - IR 21

Spelling Correction
 Two primary uses

 Correcting document(s) being indexed
 Retrieve matching documents when query
contains a spelling error
 Two main flavors:
 Isolated word
 Check each word on its own for misspelling
 Will not catch typos resulting in correctly spelled words
e.g., from  form
 Context-sensitive
 Look at surrounding words, e.g., I flew form
Heathrow to Narita.
11/02/21 CSCI 7000 - IR 22
Document correction
 Primarily for OCR’ed documents

 Correction algorithms tuned for this
 Goal: the index (dictionary) contains fewer
OCR-induced misspellings
 Can use domain-specific knowledge
 E.g., OCR can confuse O and D more often
than it would confuse O and I (adjacent on
the QWERTY keyboard, so more likely
interchanged in typing).
11/02/21 CSCI 7000 - IR 23

Query correction
 Our principal focus here

 E.g., the query Alanis Morisett
 We can either
 Retrieve using that spelling
 Retrieve documents indexed by the correct
spelling, OR
 Return several suggested alternative
queries with the correct spelling
 Did you mean … ?
11/02/21 CSCI 7000 - IR 24

Isolated word correction
 Fundamental premise – there is a lexicon

from which the correct spellings come
 Two basic choices for this
 A standard lexicon such as
 Webster’s English Dictionary
 An “industry-specific” lexicon – hand-maintained
 The lexicon of the indexed corpus
 E.g., all words on the web
 All names, acronyms etc.
 (Including the mis-spellings)
11/02/21 CSCI 7000 - IR 25

Isolated word correction
 Given a lexicon and a character sequence

Q, return the words in the lexicon closest
to Q
 What’s “closest”?
 We’ll study several alternatives
 Edit distance
 Weighted edit distance
 Character n-gram overlap
11/02/21 CSCI 7000 - IR 26

Edit distance
 Given two strings S1 and S2, the minimum

number of basic operations to covert one to
the other
 Basic operations are typically character-level
 Insert
 Delete
 Replace
 E.g., the edit distance from cat to dog is 3.
 Generally found by dynamic programming.
11/02/21 CSCI 7000 - IR 27

Weighted edit distance
 As above, but the weight of an operation

depends on the character(s) involved
 Meant to capture keyboard errors, e.g. m
more likely to be mis-typed as n than as q
 Therefore, replacing m by n is a smaller
edit distance than by q
 (Same ideas usable for OCR, but with
different weights)
 Require weight matrix as input
 Modify dynamic programming to handle
weights (Viterbi)
11/02/21 CSCI 7000 - IR 28
Using edit distances
 Given query, first enumerate all dictionary

terms within a preset (weighted) edit
distance
 Then look up enumerated dictionary terms
in the term-document inverted index
11/02/21 CSCI 7000 - IR 29

Edit distance to all dictionary terms?
 Given a (misspelled) query – do we

compute its edit distance to every
dictionary term?
 Expensive and slow
 How do we cut the set of candidate
dictionary terms?
 Here we can use n-gram overlap for this
11/02/21 CSCI 7000 - IR 30

Context-sensitive spell correction
 Text: I flew from Heathrow to Narita.

 Consider the phrase query “flew form
Heathrow”
 We’d like to respond
Did you mean “flew from Heathrow”?

because no docs matched the query phrase.
11/02/21 CSCI 7000 - IR 31

Context-sensitive correction
 Need surrounding context to catch this.
 NLP too heavyweight for this.
 First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
 Now try all possible resulting phrases with one
word “fixed” at a time
 flew from heathrow
 fled form heathrow
 flea form heathrow
 etc.
 Suggest the alternative that has lots of hits?
11/02/21 CSCI 7000 - IR 32

Exercise
Suppose that for “flew form Heathrow”

we have 7 alternatives for flew, 19 for
form and 3 for heathrow.
How many “corrected” phrases will we
enumerate in this scheme?
11/02/21 CSCI 7000 - IR 33

General issue in spell correction
 Will enumerate multiple alternatives for

“Did you mean”
 Need to figure out which one (or small
number) to present to the user
 Use heuristics
 The alternative hitting most docs
 Query log analysis + tweaking
 For especially popular, topical queries
 Language modeling
11/02/21 CSCI 7000 - IR 34

Computational cost
 Spell-correction is computationally
expensive
 Avoid running routinely on every query?
 Run only on queries that matched few docs
11/02/21 CSCI 7000 - IR 35

Next Time
 On to Chapter 4
 Real indexing
11/02/21 CSCI 7000 - IR 36

CSCI 7000 Modern Information Retrieval Jim Martin

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CSCI 7000 Modern Information Retrieval Jim Martin

Transféré par

Droits d'auteur :

Formats disponibles

CSCI 7000

Modern Information Retrieval

11/02/21 CSCI 7000 - IR 2

Term Doc # Freq

11/02/21 CSCI 7000 - IR 3

 What goes into creating the dictionary?

11/02/21 CSCI 7000 - IR 4

11/02/21 CSCI 7000 - IR 5

 Index every consecutive pair of terms in

 Longer phrases can be broken into the

Susceptible to Type 1 errors (false positives)

11/02/21 CSCI 7000 - IR 7

 Change our posting content

11/02/21 CSCI 7000 - IR 8

11/02/21 CSCI 7000 - IR 9

11/02/21 CSCI 7000 - IR 10

 As we’ll see you can compress position

11/02/21 CSCI 7000 - IR 11

 Positional index size 35–50% of volume of

11/02/21 CSCI 7000 - IR 12

 Biwords are faster.

11/02/21 CSCI 7000 - IR 13

11/02/21 CSCI 7000 - IR 14

 Download and install Lucene

11/02/21 CSCI 7000 - IR 15

11/02/21 CSCI 7000 - IR 16

11/02/21 CSCI 7000 - IR 17

 How can we handle multiple *’s in the

11/02/21 CSCI 7000 - IR 19

 For term hello index under:

11/02/21 CSCI 7000 - IR 20

 Rotate query wild-card to the right

11/02/21 CSCI 7000 - IR 21

 Two primary uses

 Primarily for OCR’ed documents

11/02/21 CSCI 7000 - IR 23

 Our principal focus here

11/02/21 CSCI 7000 - IR 24

 Fundamental premise – there is a lexicon

11/02/21 CSCI 7000 - IR 25

 Given a lexicon and a character sequence

11/02/21 CSCI 7000 - IR 26

 Given two strings S1 and S2, the minimum

11/02/21 CSCI 7000 - IR 27

 As above, but the weight of an operation

 Given query, first enumerate all dictionary

11/02/21 CSCI 7000 - IR 29

 Given a (misspelled) query – do we

11/02/21 CSCI 7000 - IR 30

 Text: I flew from Heathrow to Narita.

Did you mean “flew from Heathrow”?

11/02/21 CSCI 7000 - IR 31

11/02/21 CSCI 7000 - IR 32

Suppose that for “flew form Heathrow”

11/02/21 CSCI 7000 - IR 33

 Will enumerate multiple alternatives for

11/02/21 CSCI 7000 - IR 34

11/02/21 CSCI 7000 - IR 35

11/02/21 CSCI 7000 - IR 36

Vous aimerez peut-être aussi