Vous êtes sur la page 1sur 41

Grouping & Joining

Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC Member

Thursday, May 17, 2012

Grouping & Joining Overview


! Background

! Joining

! Result grouping

! Conclusion

Searchworkings.org - The online search community


Thursday, May 17, 2012

Background Lucenes model


! Lucene is document based.

! Lucene doesnt store information about relations between documents.

! Data often holds relations.

! Good free text search over relational data.

Searchworkings.org - The online search community


Thursday, May 17, 2012

Background Example
! Product ! Name ! Description ! Product-item ! Color ! Size ! Price

! Goal: Show the most applicable product based on product-item criteria.


Searchworkings.org - The online search community
Thursday, May 17, 2012

Background Common Lucene solutions


! Compound documents. ! May result in documents with many fields. ! Subsequent searches. ! May cause a lot network overhead.

! Non Lucene based approach: ! If free text search isnt very important use a relational database.

Searchworkings.org - The online search community


Thursday, May 17, 2012

Background Example domain


! Compound Product & Product-items document. ! Each product-item has its own field prefix.

Searchworkings.org - The online search community


Thursday, May 17, 2012

Background Different solutions


! Lucene offers solutions to have a 'relational' like search. ! Parent child

! Grouping & joining aren't naturally supported. ! All the solutions do increase the search time.

! Some scenarios grouping and joining isn't the right solution.

Searchworkings.org - The online search community


Thursday, May 17, 2012

Joining
Modelling relations

Thursday, May 17, 2012

Joining Introduction
! Support for parent child like search from Lucene 3.4 ! Not a SQL join.

! The parent and each children are stored as documents.

! Two types: ! Index time join ! Query time join

Searchworkings.org - The online search community


Thursday, May 17, 2012

Joining Index time join


! Two block join queries: ! ToParentBlockJoinQuery ! ToChildBlockJoinQuery

! One Lucene collector: ! ToParentBlockJoinCollector

! Index time join requires block indexing.

Searchworkings.org - The online search community


Thursday, May 17, 2012

10

Joining Block indexing


! Atomically adding documents. ! A block of documents.

! Each document gets sequentially assigned Lucene document id.

! IndexWriter#addDocuments(docs);

Searchworkings.org - The online search community


Thursday, May 17, 2012

11

Joining Block indexing


! Index doesn't record blocks. ! Segment merging doesnt re-order documents in a segment.

! App is responsible for identifying block documents. ! Marking the last document in a block.

! Adding a document to a block requires you to reindex the whole block. ! Removing a document from a block doesnt requires reindexing a block.

Searchworkings.org - The online search community


Thursday, May 17, 2012

12

Joining Example domain


! Parent is the last document in a block.

Searchworkings.org - The online search community


Thursday, May 17, 2012

13

Joining Block indexing


Marking parent documents

Searchworkings.org - The online search community


Thursday, May 17, 2012

14

Joining Block indexing

Add block

Add block

Searchworkings.org - The online search community


Thursday, May 17, 2012

15

Joining ToParentBlockJoinQuery
! Parent filter marks the parent documents.

! Child query is executed in the parent space.

! ToChildBlockJoinQuery works in the opposite direction.


Searchworkings.org - The online search community
Thursday, May 17, 2012

16

Joining Query time joining


! Query time joining is executed in two phases and is field based: ! fromField ! toField

! Doesnt require block indexing.

Searchworkings.org - The online search community


Thursday, May 17, 2012

17

Joining Query time joining


! First phase collects all the terms in the fromField for the documents that match with the original query. ! Currently doesnt take the score from original query into account.

! The second phase returns the documents that match with the collected terms from the previous phase in the toField.

! Two different implementations: ! JoinUtil - Lucene (! 3.6) ! Join query parser - Solr (trunk)
Searchworkings.org - The online search community
Thursday, May 17, 2012

18

Joining Query time joining - Indexing

Referrer the product id.


Searchworkings.org - The online search community
Thursday, May 17, 2012

19

Joining Query time joining - Indexing

Searchworkings.org - The online search community


Thursday, May 17, 2012

20

Joining Query time joining

! Result will contain one product. ! Possible to join over two indices.

Searchworkings.org - The online search community


Thursday, May 17, 2012

21

Joining Final thoughts


! Joining module has good solutions to model parent child relations.

! Use block join if you care about scoring. ! Frequent updates can be problematic. ! Use query time join for parent child filtering. ! Query time join is slower than index time join.

! Mostly a Lucene feature only. ! All code is annotated as experimental.


Searchworkings.org - The online search community
Thursday, May 17, 2012

22

Result grouping
Previously known as Field Collapsing.

Thursday, May 17, 2012

Result grouping Introduction


! Group matching documents that share a common property.

! Search hit represents a group. ! Facet counts & total hit count represent groups.

! Per group collect information ! Most relevant document. ! Top three documents. ! Aggregated counts
Searchworkings.org - The online search community
Thursday, May 17, 2012

24

Result grouping Usages


! Group documents by a shared property ! Product-item by product id (Parent child)

! Collapse similar looking documents ! E.g. all results from the Wikipedia domains.

! Remove duplicates from the search result. ! Based on a field that contains a hash

Searchworkings.org - The online search community


Thursday, May 17, 2012

25

Result grouping Example domain

! Each Product-item is a document, but includes the product data.

Searchworkings.org - The online search community


Thursday, May 17, 2012

26

Result grouping Implementation


! Result grouping implemented with Lucene collectors. ! Module in trunk and a contrib in 3.x versions.

! Two pass result grouping. ! Grouping by indexed field, function or doc values.

! Single pass result grouping. ! Requires block indexing.

Searchworkings.org - The online search community


Thursday, May 17, 2012

27

Result grouping Two pass implementation


! First pass collects the top N groups. ! Per group: group value + sort value

! Second pass collects data for each top group. ! The top N documents per group. ! Possible other aggregated information.

! Second pass search ignores all documents outside topN groups.

Searchworkings.org - The online search community


Thursday, May 17, 2012

28

Result grouping Result grouping - Indexing

Searchworkings.org - The online search community


Thursday, May 17, 2012

29

Result grouping Result grouping - Searching

Searchworkings.org - The online search community


Thursday, May 17, 2012

30

Result grouping Result grouping made easier


! GroupingSearch

! Solr ! http://myhost/solr/select?q=shirt&group=true&group.field=product_id ! Many more options: ! http://wiki.apache.org/solr/FieldCollapsing


Searchworkings.org - The online search community
Thursday, May 17, 2012

31

Result grouping Parent child result


! TopGroups - Equivalent to TopDocs. ! Hit count ! Group count ! Groups ! Top documents

! Facet and total count can represent groups instead of documents. ! But requires more query time.

Searchworkings.org - The online search community


Thursday, May 17, 2012

32

Conclusion
Compare...

Thursday, May 17, 2012

Conclusion Compare the parent child solutions


! Result grouping ! + Distributed support & Parent child relation as hit. ! - Parent data duplication ! - Impact on query time

! Joining ! + Fast & no data duplication ! - Index time join not optimal for updates ! - Query time join is limited.
Searchworkings.org - The online search community
Thursday, May 17, 2012

34

Conclusion Compare the parent child solutions


! Compound documents. ! + Fast and works out-of-the box with all features. ! - Not flexible when it comes to updates. ! - Document granularity is set in stone.

Searchworkings.org - The online search community


Thursday, May 17, 2012

35

Any questions?

36
Thursday, May 17, 2012

Extra slides
We have time left!

Thursday, May 17, 2012

Conclusion Future work


! Higher level parent-child API. ! Needs to cover search & indexing.

! Joining ! Distributed support. ! Represent a hit as a parent child relation in the search result.

! Result grouping ! Aggregated grouped information like: sum, avg, min, max etc.
Searchworkings.org - The online search community
Thursday, May 17, 2012

38

Joining ToParentBlockJoinCollector

! TopGroups contains a group per top N parent document. ! Each group contains a parent and child documents.
Searchworkings.org - The online search community
Thursday, May 17, 2012

39

Result grouping Groups & facet counts


! Faceting and result grouping are different features. ! But are often used together!

! Facet counts can be based on: ! Found documents. ! Found groups. ! Combination of facet value and group.

! All options are supported in Solr.


Searchworkings.org - The online search community
Thursday, May 17, 2012

40

Result grouping Doc values


! Doc values / Column Stride values

! Prevents the creation of expensive data structures in FieldCache.

! Inverted index is meant for free text search.

! All grouping collectors have doc values based implementations!

Searchworkings.org - The online search community


Thursday, May 17, 2012

41

Vous aimerez peut-être aussi