Vous êtes sur la page 1sur 37

Introduction to Information Retrieval

Index Construction
Introduction to Information Retrieval

Index construction
How do we construct an index?
What strategies can we use with limited main
memory?

19-Jul-17 CS F469 2
Introduction to Information Retrieval

Indexing
Indexing is a technique borrowed from databases
An index is a data structure that supports efficient
lookups in a large data set
E.g., hash indexes, R-trees, B-trees, etc.

19-Jul-17 CS F469 3
Introduction to Information Retrieval

Forward index
What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take

Querying the forward index would require sequential


iteration through each document and to each word to verify
a matching document
Too much time, memory and resources required!

CS F469
19-Jul-17 4
Introduction to Information Retrieval

What is inverted index?

Posting
One List
posting

Opposed to forward index, store the list of documents


per each word
Directly access the set of documents containing the word

19-Jul-17 CS F469 5
Introduction to Information Retrieval

Term Doc #

index construction I
did
1
1
enact 1
julius 1

Documents are parsed to extract words and these caesar


I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2

Caesar I was killed Caesar. The noble


caesar
the
2
2
i' the Capitol; Brutus hath told you noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
19-Jul-17 CS F469 6
ambitious 2
Introduction to Information Retrieval

Key step Term


I
did
Doc #
1
1
Term
ambitious
be
Doc #
2
2
enact 1 brutus 1
julius 1 brutus 2
After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1

We focus on this sort step. me


so
1
2
i'
it
1
2

We have 100M items to sort. let


it
2
2
julius
killed
1
1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

19-Jul-17 CS F469 7
Introduction to Information Retrieval

Scaling index construction


In-memory index construction does not scale.
How can we construct an index for very large
collections?
Taking into account the hardware constraints we just
learned about . . .
Memory, disk, speed, etc.

19-Jul-17 CS F469 8
Introduction to Information Retrieval

Sort-based index construction


As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (termID 4 bytes
+ docID 4 bytes + freq 4 bytes), demands a lot of space for
large collections.
Total = 100,000,000 in the case of RCV1
Thus: We need to store intermediate results on disk.

19-Jul-17 CS F469 9
Introduction to Information Retrieval

Use the same algorithm for disk?


Can we use the same index construction algorithm
for larger collections, but by using disk instead of
memory?
No: Sorting T = 100,000,000 records on disk is too
slow too many disk seeks.
We need an external sorting algorithm.

19-Jul-17 CS F469 10
Introduction to Information Retrieval

Bottleneck
Parse and build postings entries one doc at a time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too slow
must sort T=100M records

19-Jul-17 CS F469 11
Introduction to Information Retrieval

BSBI: Blocked sort-based Indexing


(Sorting with fewer disk seeks)
12-byte (4+4+4) records (termID, doc, freq).
These are generated as we parse docs.
Must now sort 100M such 12-byte records by term.
Define a Block ~ 10M such records
Can fit comfortably into memory for in-place sorting (e.g.,
quicksort). Total 100M records
Will have 10 such blocks to start with.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk.
Then merge the blocks into one long sorted order.
The term -> termID mapping (= dictionary)
19-Jul-17 mustCSalready
F469 be available built from a first pass.
12
Introduction to Information Retrieval

Blocked sort based indexing


Use termID instead of term
Main memory is insufficient to collect termID-docID
pair, we need external sorting algorithm that uses
disk
Segment the collection into parts of equal size
Sorts and group the termID-docID pairs of each part in
memory
Store the intermediate result onto disk
Merges all intermediate results into the final index
Running Time: O (T log T)
19-Jul-17 CS F469 13
Introduction to Information Retrieval

Block Inversion
Inversion involves two steps:
1. We sort the termID-docID pairs.
2. We collect all termID-docID pairs with the same
termID into a posting list, where a posting list is
simply a docID.
This results an inverted index for the block we have just
read.

19-Jul-17 CS F469 14
Introduction to Information Retrieval

Postings lists to be merged Merged postings lists


brutus: d1, 3; d3, 2 brutus: d6, 1; d8, 3 brutus: d1, 3; d3, 2; d6, 1; d8, 3
caesar: d1, 2; d2, 1; d4, 4
noble: d5, 2 + caesar: d6, 4;
julius: d10, 1
caesar: d1, 2; d2, 1; d4, 4; d6, 4
julius: d10, 1
with: d1, 2; d3, 1; d5, 2 killed: d6, 4; d7, 3 killed: d6, 4; d7, 3
noble: d5, 2
with: d1, 2; d3, 1; d5, 2

disk

19-Jul-17 CS F469 15
Introduction to Information Retrieval

Sorting 10 blocks of 10M records


First, read each block, sort in main, write back to disk:
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Exercise: estimate total time to read each block from
disk and and quicksort it.
10 times this estimate gives us 10 sorted runs of
10M records each on disk. Now, need to merge all!
Done straightforwardly, merge needs 2 copies of data
on disk (one for the lists to be merged, one for the
merged output)
But we can optimize this
19-Jul-17 CS F469 16
Introduction to Information Retrieval

How to merge the sorted runs?


Use a 9-element priority queue repeatedly deleting
External mergesort its smallest element and adding to it from the buffer
One-pass to which the smallest belonged.

One example of external sorting is the external mergesort algorithm. For example, for
sorting 900 megabytes of data using only 100 megabytes of RAM:

1. Read 100 MB of the data in main memory and sort by some conventional method, like
quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to
be merged into one single output file.
4. Read the first 10 MB of each sorted chunk into input buffers in main memory and
allocate the remaining 10 MB for an output buffer. (In practice, it might provide better
performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is
full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the
next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is
available.
19-Jul-17 CS F469 17
Introduction to Information Retrieval

Remaining problem with sort-based


algorithm
Our assumption was: we can keep the dictionary in
memory.
We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings .

19-Jul-17 CS F469 18
Introduction to Information Retrieval

SPIMI:
Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for each
block no need to maintain term-termID mapping
across blocks.
In other words, sub-dictionaries are generated on the
fly.
Key idea 2: Dont sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
19-Jul-17 CS F469 19
Introduction to Information Retrieval

SPIMI-Invert

Dictionary term generated on the fly!

Merging of blocks is analogous


19-Jul-17 CS F469
to BSBI. 20
Introduction to Information Retrieval

Merge algorithm

19-Jul-17 CS F469 21
Introduction to Information Retrieval

BSBI vs. SPIMI

Bl Bl
oc oc
Dicti Bl k2 k4
ona oc
Bl
Inverted
ry k12 Bl
oc Index Bl
oc oc
Main k1
k3 k5

Phase: Merge
Pass 2
1 Disk

BSBI
19-Jul-17 CS F469 22
Introduction to Information Retrieval

BSBI vs. SPIMI

Sub
Bl -
Sub
Sub - Bl oc dict
- Bl ion
dict ocInvertedk 3
dict oc ion ary
ion
k1
ary
k12 ary Index
Sub
Main - Bl
dict oc
ion k2
ary
Phase: Merge
Single Pass
Disk

SPIMI
19-Jul-17 CS F469 23
Introduction to Information Retrieval

Difference between BSBI and SPIMI

SPIMI BSBI
1. Add postings directly to 1. Collect term-docID pairs , sort
postings list them and then create
postings list
2. It is faster then BSBI because 2. Slower then SPIMI
there is no Sorting necessary
3. It saves memory because No 3. Require to store termID , so
termID needs to be stored need more space
4. Time complexity O( T ) 4. Time complexity O( T logT)

19-Jul-17 CS F469 24
Introduction to Information Retrieval

Distributed indexing
For web-scale indexing
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?

19-Jul-17 CS F469 25
Introduction to Information Retrieval

Google data centers


Google data centers mainly contain commodity
machines.
Data centers are distributed around the world.
Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
Estimate: Google installs 100,000 servers each
quarter.
Based on expenditures of 200250 million dollars per year
This would be 10% of the computing capacity of the
world!?!
19-Jul-17 CS F469 26
Introduction to Information Retrieval

Distributed indexing
Maintain a master machine directing the indexing job
considered safe.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine
from a pool.

19-Jul-17 CS F469 27
Introduction to Information Retrieval

Parallel tasks
We will use two sets of parallel tasks
Parsers
Inverters
Break the input document collection into splits
Each split is a subset of documents (corresponding to
blocks in BSBI/SPIMI)

19-Jul-17 CS F469 28
Introduction to Information Retrieval

Parsers
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term,
doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.
Now to complete the index inversion

19-Jul-17 CS F469 29
Introduction to Information Retrieval

Inverters
An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
Sorts and writes to postings lists

19-Jul-17 CS F469 30
Introduction to Information Retrieval

Data flow
assign Master assign
Postings

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z


Inverter g-p

splits Inverter q-z


Parser a-f g-p q-z

Map Reduce
Segment files
19-Jul-17
phase CS F469
phase 31
Introduction to Information Retrieval

MapReduce
The index construction algorithm we just described is
an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing
without having to write code for the distribution
part.
They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce.
19-Jul-17 CS F469 32
Introduction to Information Retrieval

Dynamic indexing
Up to now, we have assumed that collections are
static.
They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists have
to be modified:
Postings updates for terms already in dictionary
New terms added to dictionary

19-Jul-17 CS F469 33
Introduction to Information Retrieval

Simplest approach
Maintain big main index
New docs go into small auxiliary index
Search across both, merge results
Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation
bit-vector
Periodically, re-index into one main index

19-Jul-17 CS F469 34
Introduction to Information Retrieval

Issues with main and auxiliary indexes


Problem of frequent merges you touch stuff a lot
Poor performance during merge
Actually:
Merging of the auxiliary index into the main index is efficient if we
keep a separate file for each postings list.
Merge is the same as a simple append.
But then we would need a lot of files inefficient for O/S.
Assumption for the rest of the lecture: The index is one big
file.
In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in one
file etc.)
19-Jul-17 CS F469 35
Introduction to Information Retrieval

Dynamic/Positional indexing at search engines


All the large search engines now do dynamic
indexing
Their indices have frequent incremental changes
News items, blogs, new topical web pages
Sarah Palin,
But (sometimes/typically) they also periodically
reconstruct the index from scratch
Query processing is then switched to the new index, and
the old index is then deleted
Positional indexes
Same sort of sorting problem just larger
19-Jul-17 CS F469 36
Introduction to Information Retrieval

END

19-Jul-17 CS F469 37