Vous êtes sur la page 1sur 6

Contact & Credits

Alessandro Agostini

Web Information Retrieval

agostini@science.unitn.it

A Graduate Course Spring 2006

http://www.dit.unitn.it/agostini/

A LESSANDRO AGOSTINI
Universita` di Trento
ITC-IRST Trento

Thanks to:

S. Chakrabarti
R. Baeza-Yates
B. Ribeiro-Neto
B. D. Davison

LECTURE 3 - TEXT / QUERY OPERATIONS

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.1/36

Your questions from Lecture 2

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.2/36

IR Process

[L1q4] IR Process (see figure) : dotted arrows ?

[Baeza-Yates & Ribeiro-Neto, 1999]

I think in that figure all the lines should have been solid!

[R. Baeza-Yates, 15 May, personal communication]


c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.3/36

Towards Matching...

p.4/36

Documents : texts!

...on the left-side of the Box are query operations...

Before the retrieval process can even be initiated:


Define the documents repository (DB)
textual;
non-textual / multimedia
Specify (by DB Manager Module):
documents (texts) to be accessed
text operations,
text model (structure,...)
elements to index (keywords)

...on the right-side are text operations.


Indexing - Searching - Matching - Ranking.
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.5/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.6/36

Towards Indexing...

Logical View of Docs Extended

IR models need index terms to process documents.

Text operations [Baeza & Ribeiro, 1999]:

Index term:
a keyword or group of selected words;
any word (full text, more general);
(usually) a noun - nouns have meaning;
represents (a part of) the document semantics.
But,

transform the original documents,


a document is Text + Structure + Metadata:

How are index terms defined?

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.7/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.8/36

Outline for Today


Text operations
Document pre-processing
normalization
term / structure extraction
Term categorization (thesauri)
Query operations
Query formulation
Query processing (re-formulation)
Relevance feedback

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.9/36

p.10/36

Document Pre-Processing

What to Index?
Text
Data / Text level
All text?
Phrases?
Punctuation?
Hyphenation (mark - in compound words)?
Word roots (stems)?
Metatext
Metadata level
Categories / Concepts (LSI)
Structure
Keywords (at metadata level - labels)
Titles, authors, publication date,...
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Indexing requires documents to be prepared.


To build an index (see Lecture 4), we first need:
Break up docs corpus (DB) into individual docs
guarantee of separate access for retrieval.
Document analysis and purification of noise.
lexical analysis
stopwords list
stemming
Terms extraction.
Tokenization.

p.11/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.12/36

Document Analysis

Text Operations

(optional) Examine document structure / metadata


Decide what parts (or information) to index.
Search engines (usually) ignore comments,
metatags, image maps, tiny invisible text.
Several formats and contents to deal with.
Convert to a standard format for normalization
purposes:
E.g.: PostScript to ASCII, HTML to plain text

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Basic operations (improve quality of indexing):


lexical analysis : define the words in the document
- digits, punctuation, case letters,...
stopwords : filtering out low-discrimination words
- frequent words (articles, conjunctions,...)
stemming : remove prefixes, suffixes
term selection : decide what terms / groups of
terms to index - usually nouns are good candidates
text compression : write document in fewer bits
Goal: improve efficiency of retrieval (storage /
communication costs, faster searching)
p.13/36

Stopwords

Problems:
Queries containing only stopwords ruled out
Polysemous words that are stopwords in one
sense but not in others
E.g.: can as a verb vs. can as a noun.
Elimination of stopwords might reduce recall:
EXAMPLE:

Consider a user looking for to be or not to be


elimination of stopwords produces be!
Nowadays controversy on improving IR
performance by using stopwords.
p.15/36

Stemming

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.16/36

Stemming (cont)

Conflating words to help match a query term with a


morphological variant in the corpus.
Remove inflections that convey parts of speech,
tense and number
E.g. connect: connecting, connection,
connections.
E.g.: university and universal both stems to
universe
Nowadays controversy on improving IR
performance by using stemming.

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.14/36

Stopwords (cont)

Function words and connectives (non-content terms)


Appear in large number of documents and little use
in pinpointing documents
Indexing stopwords
Stopwords not indexed
For reducing index space and improving
performance
Replace stopwords with a placeholder (to
remember the offset)

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Techniques
morphological analysis (e.g., Porters algorithm)
dictionary lookup (e.g., WordNet)
Stemming may increase recall but at the price of
precision
Abbreviations, polysemy and names coined in the
technical and commercial sectors
E.g.: Stemming ides to IDE, SOCKS to
sock, gated to gate, may be bad !
Reduces the vocabulary

p.17/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.18/36

Term Extraction

Tokenization

Now decide which words / phrases best represent


semantic contents.
automatic / manual (by specialists) selection
best words are nouns or nouns gropus
(e.g. computer science)
automatic extraction (eliminate verbs, adj.,...)
Nowadays controversy on improving IR
performance by using term selection.
Most search engines use full text-based indexing.

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Applies to all documents in the corpus (DB).


(optional) Documents are pre-processed
DEF (Token) : nonempty sequence of characters
excluding spaces and punctuations.
represented by a suitable integer (token index, tid);
Documents (document index deneted by did) are
transformed into ordered sequences of integers.
Useful to buid a more efficent (search) Index.

p.19/36

Example

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.20/36

Thesauri
DEF (Thesaurus):

Suppose the corpus DB is the set {d1, d2}, where:


d1 : My care is loss of care with old care done
d2 : Your care is gain of care with new care won
The tokenized documents are:
d1 : !t1, t2, t3, t4, t5, t2, t6, t7, t2, t8"
d2 : !t9, t2, t3, t10, t5, t2, t6, t11, t2, t12"

list of terms in a given domain;


for each term, a list of related terms.

Matrix [document, terms] above is inverted to


build the inverted index well see.
REMARK:

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.21/36

Purposes to use thesauri in IR [Foskett, 1997]:


define a controlled vocabulary for indexing:
indexing concepts rather than terms;
indices with clear semantics;
reduction of noise;
good indexing for specialized corpus, to test on
general domains
assist user query formulation / query expansion
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.22/36

Space for Your Notes & Quests...

Thesauri and Metadata


Both help in selecting appropriate index terms.
Metadata:
Descriptive : author, date, , source, lenght, ...
Semantic : associated to contents (e.g. ISBN)
use of ontologies to standardize semantic terms
Structural : related to text structure (e.g.
sections,...)
Assigned paper:

[?]

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.23/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.24/36

Query Operations

Query Processing

Difficult (by the user) to formulate good queries.

Three steps of processing:

What is the best question?

(User) : Formulation of the query / request - Q


represents the information need
difficult to do by the majority of users
user-friendly querying interfaces might help.
(System) : Transform Q into query q.
(System) : Searching on the index to match q with
some dj.

Web search engines users spend time in


reformulating their queries.
Distinguish between first formulation and
reformulationsquery processing.
Query reformulation defined in two steps:
query expansion (adding of new terms)
term reweighting (in the expanded query)

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.25/36

Most common in search engines.


Terms allow for ranking, not just retrieval.
Querying by phrases improves precision -

Query reformulations - two steps:


query expansion (adding of new terms)
term reweighting (in the expanded query)

increase the likelihood that retrieved docs are relevant.

Approaches grouped according to feedback:


User feedback based - relevance feedback.
Automatic Local analysis - use of retrieved docs.
Automatic Global analysis - use of the corpus.

Querying by phrases decreases recall index terms might not be in exactly that phrase.

Search engines might implement implicit


connectives (e.g. : Google uses AND)

p.27/36

Query Revision

p.28/36

Two interpretatons of the users feedback:


Evaluation of the IR system / search engine
retrieved-and-selected documents are okay;
other documents are not relevant
Request of query revision:
new query formulation from the user;
reformulation using the feedback.
IMPORTANT : The choice of feedback method depends on
the IR Model - weights definition.

[?]

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Feedback

Ranked documents might be revised by:


(user) Explicit formulation of revised / new
request.
(system) More focused query.
Providing feedback more like this :
Explicit feedback - by Yes / No feedback
Implicit feedback - by Downloading / a click
Automatic feedback - by Clustering
Query revision / reformulation is common.
Assigned paper:

p.26/36

Query Re-formulation

Term / Phrases Queries

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.29/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.30/36

Space for Your Notes & Quests...

Relevance Feedback
Most popular method to query reformulation.
User marks relevant ranked documents (top 10-20).
Main idea:
select index terms within the selected documents;
enhance the weight of these to query rewriting.
The new query moves (hopefuly) to more relevant
docs (and away from non-relevant)
Well see more on feedback in Lectures 5 and 9.

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.31/36

Summary

p.32/36

Next Lecture

Text operations
Document pre-processing
normalization
term / structure extraction
Term categorization (thesauri)
Query operations
Query formulation
Query processing (re-formulation)
Relevance feedback

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

Indexing
Searching

p.33/36

c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!

p.34/36