Académique Documents
Professionnel Documents
Culture Documents
Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data
2 17
3 17
4 17
R S
5 17
Similarity self-join
6 17
Grid partitioning
General idea: Grid approximation where grid line distance = e
Similar idea in the e-kdB-tree
[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]
7 17
Assumption: 2 adjacent e-stripes fit in main mem. Unrealistic for large data sets which are ...
clustered, skewed and high-dimensional data
8 17
9 17
10 17
e-Interval
Coarse approximation of join mates: Used for I/O processing
11 17
12 17
13 17
CPU Processing
I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions
14 17
CPU Processing
Point distance computations: Order of dimensions
Neighboring inactive dimensions Unspecified dimensions Active dimension Aligned inactive dimensions
15 17
Experimental Results
8-dimensional uniformly distributed vectors
16 17
17 17
Conclusions
Summary
High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU