Vous êtes sur la page 1sur 17

1 17

Christian Bhm, Bernhard Braunmller, Florian Krebs, and Hans-Peter Kriegel,


University of Munich

Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data

2 17

Feature Based Similarity

3 17

Simple Similarity Queries


Specify query object and
Find similar objects range query Find the k most similar objects nearest neighbor q.

4 17

Join Applications: Catalogue Matching


Catalogue matching
E.g. Astronomic catalogues

R S

5 17

Join Applications: Clustering


Clustering (e.g. DBSCAN)

Similarity self-join

6 17

Grid partitioning
General idea: Grid approximation where grid line distance = e
Similar idea in the e-kdB-tree
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Disadvantage of any grid approach: Number of neighboring grid cells: 3d - 1

7 17

Scalability of the e-kdB-tree

Assumption: 2 adjacent e-stripes fit in main mem. Unrealistic for large data sets which are ...
clustered, skewed and high-dimensional data

8 17

Epsilon Grid Order

9 17

e-Grid-Order Is a Total Strict Order


Strict Order:
Irreflexivity Transitivity Asymmetry

e-grid-order can be used in any sorting algorithm

10 17

e-Interval
Coarse approximation of join mates: Used for I/O processing

11 17

I/O Processing for the Self Join


Decompose the sorted file into I/O units

12 17

Epsilon Grid Order

13 17

CPU Processing
I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions

14 17

CPU Processing
Point distance computations: Order of dimensions
Neighboring inactive dimensions Unspecified dimensions Active dimension Aligned inactive dimensions

15 17

Experimental Results
8-dimensional uniformly distributed vectors

16 17

Experimental Results (2)


16-d feature vectors from CAD application

17 17

Conclusions
Summary
High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU

Future research potential


Similarity join for metric index structures Approximate similarity join Parallel similarity join algorithms

Vous aimerez peut-être aussi