Vous êtes sur la page 1sur 37

Introduction to Information Retrieval

Index Construction
Introduction to Information Retrieval

Index construction
How do we construct an index?
What strategies can we use with limited main

19-Jul-17 CS F469 2
Introduction to Information Retrieval

Indexing is a technique borrowed from databases
An index is a data structure that supports efficient
lookups in a large data set
E.g., hash indexes, R-trees, B-trees, etc.

19-Jul-17 CS F469 3
Introduction to Information Retrieval

Forward index
What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take

Querying the forward index would require sequential

iteration through each document and to each word to verify
a matching document
Too much time, memory and resources required!

CS F469
19-Jul-17 4
Introduction to Information Retrieval

What is inverted index?

One List

Opposed to forward index, store the list of documents

per each word
Directly access the set of documents containing the word

19-Jul-17 CS F469 5
Introduction to Information Retrieval

Term Doc #

index construction I
enact 1
julius 1

Documents are parsed to extract words and these caesar

are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2

Caesar I was killed Caesar. The noble

i' the Capitol; Brutus hath told you noble
Brutus killed me. Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
19-Jul-17 CS F469 6
ambitious 2
Introduction to Information Retrieval

Key step Term

Doc #
Doc #
enact 1 brutus 1
julius 1 brutus 2
After all documents have been caesar
parsed, the inverted file is was
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1

We focus on this sort step. me


We have 100M items to sort. let

be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

19-Jul-17 CS F469 7
Introduction to Information Retrieval

Scaling index construction

In-memory index construction does not scale.
How can we construct an index for very large
Taking into account the hardware constraints we just
learned about . . .
Memory, disk, speed, etc.

19-Jul-17 CS F469 8
Introduction to Information Retrieval

Sort-based index construction

As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (termID 4 bytes
+ docID 4 bytes + freq 4 bytes), demands a lot of space for
large collections.
Total = 100,000,000 in the case of RCV1
Thus: We need to store intermediate results on disk.

19-Jul-17 CS F469 9
Introduction to Information Retrieval

Use the same algorithm for disk?

Can we use the same index construction algorithm
for larger collections, but by using disk instead of
No: Sorting T = 100,000,000 records on disk is too
slow too many disk seeks.
We need an external sorting algorithm.

19-Jul-17 CS F469 10
Introduction to Information Retrieval

Parse and build postings entries one doc at a time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too slow
must sort T=100M records

19-Jul-17 CS F469 11
Introduction to Information Retrieval

BSBI: Blocked sort-based Indexing

(Sorting with fewer disk seeks)
12-byte (4+4+4) records (termID, doc, freq).
These are generated as we parse docs.
Must now sort 100M such 12-byte records by term.
Define a Block ~ 10M such records
Can fit comfortably into memory for in-place sorting (e.g.,
quicksort). Total 100M records
Will have 10 such blocks to start with.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk.
Then merge the blocks into one long sorted order.
The term -> termID mapping (= dictionary)
19-Jul-17 mustCSalready
F469 be available built from a first pass.
Introduction to Information Retrieval

Blocked sort based indexing

Use termID instead of term
Main memory is insufficient to collect termID-docID
pair, we need external sorting algorithm that uses
Segment the collection into parts of equal size
Sorts and group the termID-docID pairs of each part in
Store the intermediate result onto disk
Merges all intermediate results into the final index
Running Time: O (T log T)
19-Jul-17 CS F469 13
Introduction to Information Retrieval

Block Inversion
Inversion involves two steps:
1. We sort the termID-docID pairs.
2. We collect all termID-docID pairs with the same
termID into a posting list, where a posting list is
simply a docID.
This results an inverted index for the block we have just

19-Jul-17 CS F469 14
Introduction to Information Retrieval

Postings lists to be merged Merged postings lists

brutus: d1, 3; d3, 2 brutus: d6, 1; d8, 3 brutus: d1, 3; d3, 2; d6, 1; d8, 3
caesar: d1, 2; d2, 1; d4, 4
noble: d5, 2 + caesar: d6, 4;
julius: d10, 1
caesar: d1, 2; d2, 1; d4, 4; d6, 4
julius: d10, 1
with: d1, 2; d3, 1; d5, 2 killed: d6, 4; d7, 3 killed: d6, 4; d7, 3
noble: d5, 2
with: d1, 2; d3, 1; d5, 2


19-Jul-17 CS F469 15
Introduction to Information Retrieval

Sorting 10 blocks of 10M records

First, read each block, sort in main, write back to disk:
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Exercise: estimate total time to read each block from
disk and and quicksort it.
10 times this estimate gives us 10 sorted runs of
10M records each on disk. Now, need to merge all!
Done straightforwardly, merge needs 2 copies of data
on disk (one for the lists to be merged, one for the
merged output)
But we can optimize this
19-Jul-17 CS F469 16
Introduction to Information Retrieval

How to merge the sorted runs?

Use a 9-element priority queue repeatedly deleting
External mergesort its smallest element and adding to it from the buffer
One-pass to which the smallest belonged.

One example of external sorting is the external mergesort algorithm. For example, for
sorting 900 megabytes of data using only 100 megabytes of RAM:

1. Read 100 MB of the data in main memory and sort by some conventional method, like
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to
be merged into one single output file.
4. Read the first 10 MB of each sorted chunk into input buffers in main memory and
allocate the remaining 10 MB for an output buffer. (In practice, it might provide better
performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is
full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the
next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is
19-Jul-17 CS F469 17
Introduction to Information Retrieval

Remaining problem with sort-based

Our assumption was: we can keep the dictionary in
We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings .

19-Jul-17 CS F469 18
Introduction to Information Retrieval

Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for each
block no need to maintain term-termID mapping
across blocks.
In other words, sub-dictionaries are generated on the
Key idea 2: Dont sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
19-Jul-17 CS F469 19
Introduction to Information Retrieval


Dictionary term generated on the fly!

Merging of blocks is analogous

19-Jul-17 CS F469
to BSBI. 20
Introduction to Information Retrieval

Merge algorithm

19-Jul-17 CS F469 21
Introduction to Information Retrieval


Bl Bl
oc oc
Dicti Bl k2 k4
ona oc
ry k12 Bl
oc Index Bl
oc oc
Main k1
k3 k5

Phase: Merge
Pass 2
1 Disk

19-Jul-17 CS F469 22
Introduction to Information Retrieval


Bl -
Sub - Bl oc dict
- Bl ion
dict ocInvertedk 3
dict oc ion ary
k12 ary Index
Main - Bl
dict oc
ion k2
Phase: Merge
Single Pass

19-Jul-17 CS F469 23
Introduction to Information Retrieval

Difference between BSBI and SPIMI

1. Add postings directly to 1. Collect term-docID pairs , sort
postings list them and then create
postings list
2. It is faster then BSBI because 2. Slower then SPIMI
there is no Sorting necessary
3. It saves memory because No 3. Require to store termID , so
termID needs to be stored need more space
4. Time complexity O( T ) 4. Time complexity O( T logT)

19-Jul-17 CS F469 24
Introduction to Information Retrieval

Distributed indexing
For web-scale indexing
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?

19-Jul-17 CS F469 25
Introduction to Information Retrieval

Google data centers

Google data centers mainly contain commodity
Data centers are distributed around the world.
Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
Estimate: Google installs 100,000 servers each
Based on expenditures of 200250 million dollars per year
This would be 10% of the computing capacity of the
19-Jul-17 CS F469 26
Introduction to Information Retrieval

Distributed indexing
Maintain a master machine directing the indexing job
considered safe.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine
from a pool.

19-Jul-17 CS F469 27
Introduction to Information Retrieval

Parallel tasks
We will use two sets of parallel tasks
Break the input document collection into splits
Each split is a subset of documents (corresponding to
blocks in BSBI/SPIMI)

19-Jul-17 CS F469 28
Introduction to Information Retrieval

Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term,
doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.
Now to complete the index inversion

19-Jul-17 CS F469 29
Introduction to Information Retrieval

An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
Sorts and writes to postings lists

19-Jul-17 CS F469 30
Introduction to Information Retrieval

Data flow
assign Master assign

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

Inverter g-p

splits Inverter q-z

Parser a-f g-p q-z

Map Reduce
Segment files
phase CS F469
phase 31
Introduction to Information Retrieval

The index construction algorithm we just described is
an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
without having to write code for the distribution
They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce.
19-Jul-17 CS F469 32
Introduction to Information Retrieval

Dynamic indexing
Up to now, we have assumed that collections are
They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists have
to be modified:
Postings updates for terms already in dictionary
New terms added to dictionary

19-Jul-17 CS F469 33
Introduction to Information Retrieval

Simplest approach
Maintain big main index
New docs go into small auxiliary index
Search across both, merge results
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation
Periodically, re-index into one main index

19-Jul-17 CS F469 34
Introduction to Information Retrieval

Issues with main and auxiliary indexes

Problem of frequent merges you touch stuff a lot
Poor performance during merge
Merging of the auxiliary index into the main index is efficient if we
keep a separate file for each postings list.
Merge is the same as a simple append.
But then we would need a lot of files inefficient for O/S.
Assumption for the rest of the lecture: The index is one big
In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in one
file etc.)
19-Jul-17 CS F469 35
Introduction to Information Retrieval

Dynamic/Positional indexing at search engines

All the large search engines now do dynamic
Their indices have frequent incremental changes
News items, blogs, new topical web pages
Sarah Palin,
But (sometimes/typically) they also periodically
reconstruct the index from scratch
Query processing is then switched to the new index, and
the old index is then deleted
Positional indexes
Same sort of sorting problem just larger
19-Jul-17 CS F469 36
Introduction to Information Retrieval


19-Jul-17 CS F469 37