Contact & Credits: A Graduate Course - Spring 2006

Contact & Credits
Alessandro Agostini
Web Information Retrieval
agostini@science.unitn.it
A Graduate Course Spring 2006
http://www.dit.unitn.it/agostini/
A LESSANDRO AGOSTINI
Universita` di Trento
ITC-IRST Trento
Thanks to:
S. Chakrabarti
R. Baeza-Yates
B. Ribeiro-Neto
B. D. Davison
LECTURE 3 - TEXT / QUERY OPERATIONS
c Alessandro Agostini A Graduate Course in Web Information Retrieval ICT International Graduate School, Trento Spring 2006
!
p.1/36
Your questions from Lecture 2
!
p.2/36
IR Process
[L1q4] IR Process (see figure) : dotted arrows ?
[Baeza-Yates & Ribeiro-Neto, 1999]
I think in that figure all the lines should have been solid!
[R. Baeza-Yates, 15 May, personal communication]

!
p.3/36
Towards Matching...
p.4/36
Documents : texts!
...on the left-side of the Box are query operations...
Before the retrieval process can even be initiated:

Define the documents repository (DB)
textual;
non-textual / multimedia
Specify (by DB Manager Module):
documents (texts) to be accessed
text operations,
text model (structure,...)
elements to index (keywords)
...on the right-side are text operations.

Indexing - Searching - Matching - Ranking.
!
!
p.5/36
!
p.6/36
Towards Indexing...
Logical View of Docs Extended
IR models need index terms to process documents.
Text operations [Baeza & Ribeiro, 1999]:
Index term:
a keyword or group of selected words;
any word (full text, more general);
(usually) a noun - nouns have meaning;
represents (a part of) the document semantics.
But,
transform the original documents,

a document is Text + Structure + Metadata:
How are index terms defined?
!
p.7/36
!
p.8/36
Outline for Today

Text operations
Document pre-processing
normalization
term / structure extraction
Term categorization (thesauri)
Query operations
Query formulation
Query processing (re-formulation)
Relevance feedback
!
p.9/36
p.10/36
Document Pre-Processing
What to Index?
Text
Data / Text level
All text?
Phrases?
Punctuation?
Hyphenation (mark - in compound words)?
Word roots (stems)?
Metatext
Metadata level
Categories / Concepts (LSI)
Structure
Keywords (at metadata level - labels)
Titles, authors, publication date,...
!
!
Indexing requires documents to be prepared.

To build an index (see Lecture 4), we first need:
Break up docs corpus (DB) into individual docs
guarantee of separate access for retrieval.
Document analysis and purification of noise.
lexical analysis
stopwords list
stemming
Terms extraction.
Tokenization.
p.11/36
!
p.12/36
Document Analysis
Text Operations
(optional) Examine document structure / metadata

Decide what parts (or information) to index.
Search engines (usually) ignore comments,
metatags, image maps, tiny invisible text.
Several formats and contents to deal with.
Convert to a standard format for normalization
purposes:
E.g.: PostScript to ASCII, HTML to plain text
!
Basic operations (improve quality of indexing):

lexical analysis : define the words in the document
- digits, punctuation, case letters,...
stopwords : filtering out low-discrimination words
- frequent words (articles, conjunctions,...)
stemming : remove prefixes, suffixes
term selection : decide what terms / groups of
terms to index - usually nouns are good candidates
text compression : write document in fewer bits
Goal: improve efficiency of retrieval (storage /
communication costs, faster searching)
p.13/36
Stopwords
Problems:
Queries containing only stopwords ruled out
Polysemous words that are stopwords in one
sense but not in others
E.g.: can as a verb vs. can as a noun.
Elimination of stopwords might reduce recall:
EXAMPLE:
Consider a user looking for to be or not to be

elimination of stopwords produces be!
Nowadays controversy on improving IR
performance by using stopwords.
p.15/36
Stemming
!
p.16/36
Stemming (cont)
Conflating words to help match a query term with a

morphological variant in the corpus.
Remove inflections that convey parts of speech,
tense and number
E.g. connect: connecting, connection,
connections.
E.g.: university and universal both stems to
universe
performance by using stemming.
!
p.14/36
Stopwords (cont)
Function words and connectives (non-content terms)

Appear in large number of documents and little use
in pinpointing documents
Indexing stopwords
Stopwords not indexed
For reducing index space and improving
performance
Replace stopwords with a placeholder (to
remember the offset)
!
!
Techniques
morphological analysis (e.g., Porters algorithm)
dictionary lookup (e.g., WordNet)
Stemming may increase recall but at the price of
precision
Abbreviations, polysemy and names coined in the
technical and commercial sectors
E.g.: Stemming ides to IDE, SOCKS to
sock, gated to gate, may be bad !
Reduces the vocabulary
p.17/36
!
p.18/36
Term Extraction
Tokenization
Now decide which words / phrases best represent

semantic contents.
automatic / manual (by specialists) selection
best words are nouns or nouns gropus
(e.g. computer science)
automatic extraction (eliminate verbs, adj.,...)
performance by using term selection.
Most search engines use full text-based indexing.
!
Applies to all documents in the corpus (DB).

(optional) Documents are pre-processed
DEF (Token) : nonempty sequence of characters
excluding spaces and punctuations.
represented by a suitable integer (token index, tid);
Documents (document index deneted by did) are
transformed into ordered sequences of integers.
Useful to buid a more efficent (search) Index.
p.19/36
Example
!
p.20/36
Thesauri
DEF (Thesaurus):
Suppose the corpus DB is the set {d1, d2}, where:

d1 : My care is loss of care with old care done
d2 : Your care is gain of care with new care won
The tokenized documents are:
d1 : !t1, t2, t3, t4, t5, t2, t6, t7, t2, t8"
d2 : !t9, t2, t3, t10, t5, t2, t6, t11, t2, t12"
list of terms in a given domain;

for each term, a list of related terms.
Matrix [document, terms] above is inverted to

build the inverted index well see.
REMARK:
!
p.21/36
Purposes to use thesauri in IR [Foskett, 1997]:

define a controlled vocabulary for indexing:
indexing concepts rather than terms;
indices with clear semantics;
reduction of noise;
good indexing for specialized corpus, to test on
general domains
assist user query formulation / query expansion
!
p.22/36
Space for Your Notes & Quests...
Thesauri and Metadata

Both help in selecting appropriate index terms.
Metadata:
Descriptive : author, date, , source, lenght, ...
Semantic : associated to contents (e.g. ISBN)
use of ontologies to standardize semantic terms
Structural : related to text structure (e.g.
sections,...)
Assigned paper:
[?]
!
p.23/36
!
p.24/36
Query Operations
Query Processing
Difficult (by the user) to formulate good queries.
Three steps of processing:
What is the best question?
(User) : Formulation of the query / request - Q

represents the information need
difficult to do by the majority of users
user-friendly querying interfaces might help.
(System) : Transform Q into query q.
(System) : Searching on the index to match q with
some dj.
Web search engines users spend time in

reformulating their queries.
Distinguish between first formulation and
reformulationsquery processing.
Query reformulation defined in two steps:
query expansion (adding of new terms)
term reweighting (in the expanded query)
!
p.25/36
Most common in search engines.

Terms allow for ranking, not just retrieval.
Querying by phrases improves precision -
Query reformulations - two steps:

query expansion (adding of new terms)
term reweighting (in the expanded query)
increase the likelihood that retrieved docs are relevant.
Approaches grouped according to feedback:

User feedback based - relevance feedback.
Automatic Local analysis - use of retrieved docs.
Automatic Global analysis - use of the corpus.
Querying by phrases decreases recall index terms might not be in exactly that phrase.
Search engines might implement implicit

connectives (e.g. : Google uses AND)
p.27/36
Query Revision
p.28/36
Two interpretatons of the users feedback:

Evaluation of the IR system / search engine
retrieved-and-selected documents are okay;
other documents are not relevant
Request of query revision:
new query formulation from the user;
reformulation using the feedback.
IMPORTANT : The choice of feedback method depends on
the IR Model - weights definition.
[?]
!
!
Feedback
Ranked documents might be revised by:

(user) Explicit formulation of revised / new
request.
(system) More focused query.
Providing feedback more like this :
Explicit feedback - by Yes / No feedback
Implicit feedback - by Downloading / a click
Automatic feedback - by Clustering
Query revision / reformulation is common.
Assigned paper:
p.26/36
Query Re-formulation
Term / Phrases Queries
!
!
p.29/36
!
p.30/36
Space for Your Notes & Quests...
Relevance Feedback
Most popular method to query reformulation.
User marks relevant ranked documents (top 10-20).
Main idea:
select index terms within the selected documents;
enhance the weight of these to query rewriting.
The new query moves (hopefuly) to more relevant
docs (and away from non-relevant)
Well see more on feedback in Lectures 5 and 9.
!
p.31/36
Summary
p.32/36
Next Lecture
Text operations
Document pre-processing
normalization
term / structure extraction
Term categorization (thesauri)
Query operations
Query formulation
Query processing (re-formulation)
Relevance feedback
!
!
Indexing
Searching
p.33/36
!
p.34/36

Contact & Credits: A Graduate Course - Spring 2006

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Contact & Credits: A Graduate Course - Spring 2006

Transféré par

Droits d'auteur :

Formats disponibles

Contact & Credits

Web Information Retrieval

A Graduate Course Spring 2006

LECTURE 3 - TEXT / QUERY OPERATIONS

Your questions from Lecture 2

[L1q4] IR Process (see figure) : dotted arrows ?

[Baeza-Yates & Ribeiro-Neto, 1999]

[R. Baeza-Yates, 15 May, personal communication]

...on the left-side of the Box are query operations...

Before the retrieval process can even be initiated:

...on the right-side are text operations.

Logical View of Docs Extended

IR models need index terms to process documents.

Text operations [Baeza & Ribeiro, 1999]:

transform the original documents,

How are index terms defined?

Outline for Today

Indexing requires documents to be prepared.

(optional) Examine document structure / metadata

Basic operations (improve quality of indexing):

Consider a user looking for to be or not to be

Conflating words to help match a query term with a

Function words and connectives (non-content terms)

Now decide which words / phrases best represent

Applies to all documents in the corpus (DB).

Suppose the corpus DB is the set {d1, d2}, where:

list of terms in a given domain;

Matrix [document, terms] above is inverted to

Purposes to use thesauri in IR [Foskett, 1997]:

Space for Your Notes & Quests...

Thesauri and Metadata

Difficult (by the user) to formulate good queries.

Three steps of processing:

What is the best question?

(User) : Formulation of the query / request - Q

Web search engines users spend time in

Most common in search engines.

Query reformulations - two steps:

increase the likelihood that retrieved docs are relevant.

Approaches grouped according to feedback:

Search engines might implement implicit

Two interpretatons of the users feedback:

Ranked documents might be revised by:

Term / Phrases Queries

Space for Your Notes & Quests...

Vous aimerez peut-être aussi