Académique Documents
Professionnel Documents
Culture Documents
Alessandro Agostini
agostini@science.unitn.it
http://www.dit.unitn.it/agostini/
A LESSANDRO AGOSTINI
Universita` di Trento
ITC-IRST Trento
Thanks to:
S. Chakrabarti
R. Baeza-Yates
B. Ribeiro-Neto
B. D. Davison
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.1/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.2/36
IR Process
I think in that figure all the lines should have been solid!
p.3/36
Towards Matching...
p.4/36
Documents : texts!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.5/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.6/36
Towards Indexing...
Index term:
a keyword or group of selected words;
any word (full text, more general);
(usually) a noun - nouns have meaning;
represents (a part of) the document semantics.
But,
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.7/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.8/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.9/36
p.10/36
Document Pre-Processing
What to Index?
Text
Data / Text level
All text?
Phrases?
Punctuation?
Hyphenation (mark - in compound words)?
Word roots (stems)?
Metatext
Metadata level
Categories / Concepts (LSI)
Structure
Keywords (at metadata level - labels)
Titles, authors, publication date,...
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.11/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.12/36
Document Analysis
Text Operations
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
Stopwords
Problems:
Queries containing only stopwords ruled out
Polysemous words that are stopwords in one
sense but not in others
E.g.: can as a verb vs. can as a noun.
Elimination of stopwords might reduce recall:
EXAMPLE:
Stemming
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.16/36
Stemming (cont)
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.14/36
Stopwords (cont)
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
Techniques
morphological analysis (e.g., Porters algorithm)
dictionary lookup (e.g., WordNet)
Stemming may increase recall but at the price of
precision
Abbreviations, polysemy and names coined in the
technical and commercial sectors
E.g.: Stemming ides to IDE, SOCKS to
sock, gated to gate, may be bad !
Reduces the vocabulary
p.17/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.18/36
Term Extraction
Tokenization
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.19/36
Example
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.20/36
Thesauri
DEF (Thesaurus):
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.21/36
p.22/36
[?]
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.23/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.24/36
Query Operations
Query Processing
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.25/36
Querying by phrases decreases recall index terms might not be in exactly that phrase.
p.27/36
Query Revision
p.28/36
[?]
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
Feedback
p.26/36
Query Re-formulation
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.29/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.30/36
Relevance Feedback
Most popular method to query reformulation.
User marks relevant ranked documents (top 10-20).
Main idea:
select index terms within the selected documents;
enhance the weight of these to query rewriting.
The new query moves (hopefuly) to more relevant
docs (and away from non-relevant)
Well see more on feedback in Lectures 5 and 9.
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.31/36
Summary
p.32/36
Next Lecture
Text operations
Document pre-processing
normalization
term / structure extraction
Term categorization (thesauri)
Query operations
Query formulation
Query processing (re-formulation)
Relevance feedback
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
Indexing
Searching
p.33/36
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.34/36