Académique Documents
Professionnel Documents
Culture Documents
1
Agenda
• Who am I?
• Motivation
• New flex APIs
• Codecs
• Wrap up
2
Who am I?
• Committer, PMC member Lucene/Solr
• Co-author of Lucene in Action, 2nd edition
– LUCENEREV40 promo code!
• Blog: http://chbits.blogspot.com
• Emacs, Python lover
• Sponsored by IBM
Your ideas will go further if you don’t insist on going with them.
3
Motivation
• Lucene is showing its age
– vInt is costly
• Lucene is hard to change, at low-levels
– Index format is too rigid
• Yet, lots of innovation in the IR world...
– New compression formats, data structures,
scorings models, etc.
• IR researchers use other search engines
– Terrier, Lemur/Indri, MG4J, etc.
7
Inverted Index 101
Field
Term
body Doc ID
SortedMap<Field, bay
> pod
11 payload
> 22 payload
title ... ...
sweet
Positions
8
Flex overview
• 4.0 (trunk) only!
• New low-level postings enum API
• Pluggable, per-segment codec has full
control over reading/writing postings
– Building blocks make it easy to create your own
– Some neat codecs!
• Performance gains
– Much less RAM used
– Faster queries, filters
Content Users
Indexing Searching
Flex APIs
Codec
Disk
10
4D enum API
• Fields, FieldsEnum
– field
• Terms, TermsEnum
– term, docFreq, ord
• DocsEnum
– docID, freq
• DocsAndPositionsEnum
– docID, freq, position, payload
• All enums allow custom attrs
18
What’s really in a codec?
• Codec provides read/write for one segment
– Unique name (String)
– FieldsConsumer (for writing)
– FieldsProducer is 4D enum API + close
• CodecProvider creates Codec instance
– Passed to IndexWriter/Reader
• You can override merging
• Reusable building blocks
– Terms dict + index, Postings
27
SimpleText codec
• All postings stored in _X.pst text file
• Read / write
• Not performant
– Do not use in production!
• Fully functional
– Passes all Lucene/Solr unit tests (slowly...)
• Useful/fun for debugging
• http://s.apache.org/eh
33
Some ideas to try
• In-memory postings
– Maybe only terms dict, select postings, etc.
• Variable-gap terms index
– Add indexed term if docFreq > N
– Good for noisy terms (eg, OCR)
• DFA/trie/FST as terms dict/index
• Finer omitTFAP (OmitTF, OmitP, per-term)
• Block-encoding for terms dict sections
36
¿Preguntas?
37
Backup
38
Composite vs atomic readers
• Lucene has aggressively moved to “per
segment” search, starting at 2.9
• Flex furthers this!
• Best to work directly with sub-readers
– Use direct flex APIs, eg reader.fields(), for this
• If you must operate on composite reader...
– Use MultiFields.getFields(reader), or
– SlowMultiReaderWrapper.wrap
– Beware performance hit!
39
Code: visit docs containing a term
Fields fields = reader.fields();
Terms terms = fields.terms(“body”);
TermsEnum iter = terms.iterator();
if (iter.seek(new BytesRef(“pod”)) ==
SeekStatus.FOUND) {
DocsEnum docs = iter.docs(null);
int docID;
while ((docID = docs.nextDoc()) !=
DocsEnum.NO_MORE_DOCS) {
...
}
}
40
Explore more about Flexible Indexing at
www.lucidimagination.com
41