Vous êtes sur la page 1sur 12

A Nearly

The Twin Grid File: Space Optimal Index Structure


Hans- Werner Six
FernUniversitHt Hagen Postfach 940 D - 5800 Hagen

Andreaa Hutflesz
Universitat Karlsruhe Postfach 6980 D - 7500 Karlsruhe

Peter Widmayer
Universitlt Karlsruhe Postfach 6980 D - 7500 Karlsruhe

Abstract
Index structures for points, supporting spatial searching, usually suffer from an undesirably low storage space utilization. We show how a dynamically changing set of points can be distributed among two grid files in such a way that storage space utilization is close to optimal. Insertions and deletions trigger the redistrir bution of points among the two grid files. The number of bucket accesses for redistributing points and the storage space utilization vary with the selected amount of redistribution efforts. Typical range queries - the most important spatial search operations - can be answered at least as fast as in the standard grid file.

Introduction
engineering, CAD, VLSI and other non-standard database applications, storage

In geographic,

schemes are needed that allow for the efficient manipulation of large sets of geometric objects. Especially, proximity queries should be handled efficiently. Index structures for secondary storage have been proposed supporting insertions, deletions, and range queries in a set of multidimensional points (Robinson 1981, Nievergelt et al. 1984, Krishnamurthy et al. 1985, Kriegel et al. 1986, Otoo 1986, Freeston 1987, Hutflesz et al. 1988). The grid file (Nievergelt et al. 1984) has proven to be useful when applied to geometric data (Hinrichs 1985, Six et al. 1988), because it adapts gracefully to the distribution of points in the data space, even if there are clusters (Hinrichs 1985). However, the grid file suffers from an undesirably low average storage space utilization of roughly 69% only, like many other schemes based on recursive halving (Lomet 1987). Especially if storage space is costly or the set of data points is relatively static, i.e., few insertions and deletions occur after the database has been set up, one might wish to increase space utilization, at the expense of some restructuring, e.g. performed upon insertions or deletions. In contrast, it is usually intolerable to increase the time for answering range queries. We propose a fully adaptive access scheme, based on the grid file (Nievergelt exhibits the following properties:
l

et al. 1984), that

storage space utilization is high; experiments for two-dimensional points;

show it to be around 90% on the average as in the grid file, for


Forschungsgemeinschaft

. range queries can be answered with at least the same efficiency average ranges;
* This work was partially supported by grants Si 374/l

and Wi 810/2 from the Deutsche

353 the number of accesses to secondary storage per insertion or deletion, including restructuring operations, increases by a small factor as compared with the grid file; experiments show the factor to lie around 2 in a realistic setting.

Our scheme, the twin grid file, achieves this performance by using two grid files, both spanning the data space, and by appropriately partitioning the point set between these two grid files. It is similar to recursive linear hashing (Ramamohanarao et al. 1984) or to extendible hashing with overflow (Tamminen 1982) in that primarily one grid file is used to store data points, while the other grid file holds quasi-overflow quite regularly transferred points. It is different from the above in that a true overflow for better space utilization. Also, points are dictates. does not occur, but is instead created artificially

back and forth between the grid files, as space utilization

In the next section, we sketch the basic idea of the mechanism of the twin grid file. A precise description of the structure and its operations follows in Section 3. We have implemented the twin grid file and compared its performance with the standard grid file; the implementation is described and the performance evaluation is presented in Section 4. Finally, we draw a conclusion in Section 5.

The

basic

idea

Consider a set of points in the plane, to be stored in a grid file, where each data bucket can hold three points. Let us focus on a series of insertions into the initially empty grid file. To recall the grid file partitioning Figures 2.1 (a) through strategy (f). and compare it with the twin grid file mechanism, consult

(b)
0

(4 ,___-*

(4
bucket region boundary point in the grid file

Figure 2.1: A standard grid file

Figure 2.1 (a) shows three points in the data bucket corresponding to the entire data space. As point 4 is inserted (Fig. 2.1 (b)), the bucket region is split into two, corresponding to two buckets, the minimum number of buckets needed to store four points. After point 5 has been inserted (Fig. 2.1 (c)), we have three buckets, clearly more than the required minimum. Here, the twin grid file idea comes in.

354 Figures 2.2 (a) through (f) show the insertion of the points under consideration grid file; it illustrates the basic twin grid file idea and merits careful study. into the twin

+---. 0 .

primary bucket region boundary point in the primary grid file secondary and (primary) bucket region boundary point in the secondary grid file

Figure 2.2: A twin grid file In the twin grid file method, we artificially create an overflow, i.e., we do not store point 5 in This action alone does not

the grid file as usual, but instead store it in an overflow bucket.

decrease the number of buckets used; however, it gives us the potential to share the overflow bucket among several of the primary buckets. This potential can be exploited immediately by transferring point 4 from its primary bucket to the overflow bucket. Thereby its primary bucket becomes empty and need not be stored explicitly any more. Hence we are able to store the five points in two buckets. For an illustration, see Figures 2.2 (a) through (c). As long as there is only one overflow bucket, no access structure for overflow buckets is needed.

Later on, as the number of overflow buckets increases, we want to perform the operations insert, delete, and range query efficiently for the data stored in the overflow buckets as well. We therefore organize the overflow points in a grid file, the secondary grid file. The twin grid file then consists of two parts, the primary grid file and the secondary grid jile. Points will be distributed between the two in such a way that the number of used buckets is small. If a point is to be inserted and falls into the bucket region of a full or an empty primary bucket, the twin grid file examines whether a secondary bucket is available for the insertion point. If this is the case, the point is inserted into this secondary bucket, and hence the use of an extra bucket can be avoided. Otherwise, it is checked whether the number of used buckets can be kept small by means of an extra secondary bucket. To this end, it is examined whether other primary buckets can be emptied by shifting their points into the secondary bucket in question (this holds for point 4 in Fig. 2.2 (c)). Shifting points from primary to secondary buckets is optimization in the twin grid file. Whenever it pays between the two grid files, triggered by an insertion or 6 are inserted into the secondary grid file, because in by far not the only attempt at space off, points are shifted back and forth a deletion. In Figure 2.2, points 5 and the primary grid file an extra bucket

355 would be needed to store them. Point 7 falls into a region with a full bucket, in the primary as well as in the secondary grid file. Therefore, the corresponding primary bucket is split; no other optimization can be performed. Point 8, again, causes a primary bucket to be split. If nothing else is done, three primary buckets and one secondary bucket are used to store eight points. However, the twin grid file does better. Since points should preferrably be stored in the primary grid file (unless bucket savings are possible), points 5 and 6 are transferred from the secondary to the primary buckets. Still, no bucket has been saved. But now, point 1 can be transferred into the secondary bucket, thereby emptying and saving one bucket. Hence, the twin grid file arrives at storing these eight points in three buckets, whereas the standard grid file needs four buckets. Also note the difference in storage potential of the grid file and the twin grid file at this stage: any next point can be inserted into the twin grid file without for an extra bucket, whereas in the grid file an extra bucket may be necessary. These are just examples of the local restructuring operations performed by the twin grid file. the need

Our proposed set of operations, including the preconditions of their applications, is described in the next section. Note that all restructuring operations are initiated by insertions or deletions of points. Range queries need to be carried out on both parts of the twin grid file, but remain unaffected from the restructurings.

The

twin

grid

file mechanism

The twin grid file T consists of two parts, the primary grid file P and the secondary grid file S, both spanning the entire data space. Each grid file partitions the data space into regions associated with data buckets. Dummy buckets, i.e., buckets with empty bucket region, are not stored explicitly; instead, they are represented by a special directory entry.

In order to achieve a high space utilization, we aim at locally minimizing the number of buckets used to store a given set of points. For the purpose of explaining the twin grid file mechanism, let us restrict our attention to two-dimensional points. Locally, i.e., for a (generally small) connected subregion R of the data space, we pursue the following objectives: 1. Minimize 2. Minimize 3. Minimize the number of buckets used to store the points in region R; the number of secondary buckets used to store the points in R; the number of points in R stored in secondary buckets.

Objective 1 directly translates the These objectives are ordered in decreasing importance. overall aim to the considered region R, while Objectives 2 and 3 aim at keeping secondary regions large and secondary buckets rather empty, in order to support future optimization efforts. Clearly, Objective 2 should only be pursued after Objective 1 has been satisfied and similarly for Objective 3 after 1 and 2.

T is restructured to meet these objectives by repeatedly transferring a point from P to S, a shift down operation, or from S to P, a shift up operation, and carrying out the necessary adjustments within P and S, as prescribed by the standard grid file mechanism. After a

356

sequence of shift operations,

a bucket in P or S can be saved in one of two ways.

Either

it

has become empty, i.e., it has not been empty before the sequence of shift operations

and is

now empty, or two buckets can be merged, according to the grid file mechanism, where the set of points in the two affected buckets fits into one bucket. We call this a saving in P or in S, respectively. An additional bucket in P or S may be needed, if a sequence of shifts necessitates a bucket split, or an empty bucket becomes non-empty; we call this a loss in P or S, respectively. Any sequence of shift operations that leaves the number of buckets in P or S unchanged is called neutral in P or S, respectively. A restructuring operation following simple rules. Rule la: for T consists of the combined and repeated application of all of the

If a saving in P or in S is possible without (see Fig. 3.1 (a)).

any shift, then perform the corresponding

merge operation

This rule directly aims at reducing following Rules lb through Id. Rule Rule

the total number of buckets used (Objective

l), as do the

lb: If there exists a sequence of shift down operations in S, then perform these shifts (see Fig. 3.1 (b)).

leading to one saving in P, neutral

lc: If there exists a sequence of shift down operations leading to two savings in P and one loss in S, then perform these shifts (see Fig. 3.1 (c)).

before and after application of Rule la Figure 3.1 (a)

after before and application of Rule lb Figure 3.1 (b)

and after before application of Rule lc Figure 3.1 (c) Note that Rule lc is essential for achieving any savings at all when inserting points into the initially empty twin grid file. Rule lb affects one bucket becoming empty or two buckets being merged after one sequence of shift down operations; Rule lc takes a broader view in considering up to four buckets in P. When more and more buckets are considered in this manner, we get the following slightly more general rule that covers Rules la through lc as special cases.

357 Rule Id: If there exists a sequence of shift down operations leading to i savings in P and less than i losses in S, for some integer i > 0, then perform these shifts; 2 is pursued by means of Rule 2, as follows. leading to one saving in S and one these shifts (see Fig. 3.2).

Objective Rule

2: If there exists a sequence of shift up operations loss in P, then perform

This rule can also be applied in a more general way, with i instead of one saving in S and loss in P. However, one has to keep in mind that taking a broader view on the optimization region R entails more accesses on secondary storage for testing the preconditions for the application of the rules, without necessarily resulting in a much higher storage utilization. l Finally, Rule Objective 3 is pursued by Rule 3. neutral in P and S, then perform it (see Fig. 3.3).

3: If there exists a shift up operation

0 . 0 0 .. I--

. 0 0

before and after application of Rule 2 Figure 3.2

before and after application of Rule 3 Figure 3.3

0 L

0 .

As a consequence of the application or empty buckets in P.

of Rule 3, all points stored in S lie in bucket regions of full

Even though all of the above rules are simple and their effect straightforward, their combined and repeated application in general may rearrange the association of points with buckets quite substantially (recall Figures 2.2 (e) and (f), where Rules 3 and lb have been applied). All restructuring operations are driven by insertions and deletions; range queries do not restructuring operations. In an insertion, the point to be inserted into T is first inserted then, a restructuring operation takes place. Similarly, a point to be deleted from T is from P or S, depending on where it is being stored, and then a restructuring operation Note that an insertion or deletion within P or S may already lead to some reorganization P or S, as prescribed by the standard grid file mechanism. initiate into P; deleted follows. within

In order to keep the restructuring operations reasonably efficient, the set of points to which shift operations may be applied should be limited. On the other hand, this set should not be too small, in order to have an effect on space utilization. In our experiments we restrict the set of shiftable points to lie in a subregion of the data space that is relatively small and simple to find, as follows. At the insertion or deletion of a point, consider the two buckets bp in P and bs in S in whose regions the point lies, together with all buckets in P whose regions intersect bss and all buckets in S whose regions intersect bps. We restrict all restructuring operations to the combined set of these buckets; experimental results for this specific interpretation of the locality of the space minimization efforts are presented in the next section.

358

Implementation
the standard

and performance
and the twin grid file on an IBM-AT in Modula-2. Our imple-

We implemented

mentation of the twin grid file for two-dimensional points is based on two separate standard grid files. In addition to the bucket addresses, the directory contains the number of points stored in each bucket. This information greatly reduces the number of bucket accesses in restructuring the twin grid file, without noticeably affecting its space utilization. Let q be the point to be inserted. Recall that bp denotes the bucket of the primary grid file P covering point q, bs the corresponding bucket of the secondary grid file S. Restructuring operations take place in a suitably defined subregion R. To keep R simple, we make sure that any bucket region in P is completely contained in a bucket region in S. To this end, we allow a split in S only if it does not partition a bucket region in P, i.e., the split in S only uses bucket boundaries existing in P. The set B of buckets for restructuring consists of all buckets of P whose regions are completely contained in the region of bs including dummy buckets. Furthermore, in P only pairs of buckets with both regions lying in one bucket region of S are considered for merge operations. The effect of the application with the standard of our algorithm for a sequence of 100 insertions, in Figure 4.1. The standard as compared

grid file, is shown graphically

grid file needs 48

buckets to store the 100 points, whereas the twin grid file uses only 37 buckets.

I I - t--i I0 I I 0 I I I lo I L-----I__~---__L--L_~~-~~

I lo 81 $0;

I 1

(4

(b)

Figure 4.1: 100 points in a twin grid file (a) according to our implementation, grid file (b), where each bucket can hold 3 points.

and in a standard

Now we describe the procedure to insert point q. We assume that the bucket capacities of P and S are equal. To decide which restructuring operations are to be carried out, we compute the effect of actions (like split or merge) in main memory; these actions are called tentative actions in the following description of the algorithm.

359 Insert q: If bp is not full and not empty then store q in bp else if bs is not full and not empty then store q in bs else {an additional bucket in P (Objective 2) is needed}

then store q in bp, shift up all points of bs lying in the region of bp (Rule 3) else {bp is full} reinsert q. Restructure.

if bp is empty

Reinsert q: Let bp be the (full) bucket covering q. Split bp into bb and b& with the grid file mechanism, possible lying in the regions of bk and b:. W.l.o.g., let bb be the bucket covering q. If bb is not full then store q in b& if bs is not full then store q in bs else {b;b must be empty, no loss in P} reinsert q. else

shift up as many points of bs as

Restructure:
If If

there is a merge partner bs that can be merged with 6s then merge bs and b, into bs (Rule la), and update B according to bs. bs is not empty then if there is a bucket bp of B whose region covers all points of bs then if bp is empty then shift up all points of bs into bp (Rule 2) else {bp is full} tentatively split bp into bk and bc. If all points of bs can be shifted up into 6; and b> then split bp into b)p and b$, shift up all points of bs into b)p and b$ (Rule 2). If there is a merge partner bs that can be merged with 6s then merge 6s and b, into bs (Rule la), and update B according to bs.

360

{Let B be a subset of the set B of buckets. savings that can be achieved in P by shifting

Let sB, be the maximum

number of bucket

down points from buckets in B without

causing a split of bs. Let D be such a set of points for shifting down. Let A4 be a smallest subset of B for which sM is maximal. Let Me be a smallest subset of B for which so,, is maximal under the restriction that no points may be shifted down.} Compute M, sM, D, MO and SM~ for bs. If bs is empty and so 5 sag + 1 then realize sM,, savings in P by merging buckets according to MO down all points

else
realize SM savings in P, and one loss in S if bs is empty, by shifting in D and merging buckets according to M. Tentatively split bs into b$ and bg. Compute M, &, D for b$, and compute M, sk, D for b S > 1 then If .st+sL split bs into b, and b;.realizing one loss in S; realize &+sL savings in P by shifting down all points in D and D and merging buckets according to M and M. Note that our implementation just represents one specific way of determining the restructuring rules. We have evaluated it with the standard and compared the local restructhe performance grid file.

turing region R, and of applying

of the twin grid file in our implementation

accesses durin

bucket accesses to answer range queries with area (in % of the area of the data snacel Table 4.1

Table 4.1 shows the results of our performance evaluation. We have inserted 40000 pseudo random two-dimensional points (from a uniform distribution) into the initially empty twin grid file T, in four sequences of 10000 points each. Each bucket holds at most 20 points, a realistic number for many applications. For comparison, we have inserted the same sequences of points into a standard grid file G. After each sequence of insertions, we have measured the actual

361

storage utilization. In addition, we have computed the average storage utilization over each sequence of insertions, as well as the number of read and write accesses to secondary storage. Furthermore, after each sequence of insertions we have carried out 300 range queries with square ranges of three different sizes at pseudo random positions, and we have measured the average number of read accesses to secondary storage to answer these range queries. The storage space in the twin grid file

T is utilized at roughly 90%, which is certainly close to


of the standard grid file G by 20 percentage

optimal. This exceeds the storage space utilization points. I.e., G needs 29% more buckets than T.

To achieve the high space utilization, the number of bucket accesses during insertions in T is higher than in G, roughly by a factor of 1.7. Note, however, that for simplicity we have kept the entire directory in main memory. Since any insertion or deletion in T leads to an update of the number of stored points in the directory, accesses in the affected directory or deletion. page in

T needs to be written

to the secondary storage after each insertion

Therefore,

the number of directory

T may exceed those in G by a factor of 4 at most. Experiments show that roughly twice as many external storage accesses are needed in T versus G. This is quite a low penalty
in cases where storage space is scarce or the data are changed infrequently. For higher bucket capacity, the insertion cost of points is lower. Our experiments show that doubling the bucket capacity from 20 to 40 points reduces the number of bucket accesses by more than 23%, while space utilization remains at 90%. It is essential, however, that range queries do not lose their efficiency in T versus G. This is in fact the case for query ranges of at least a small minimum size (a query range intersecting roughly 15 bucket regions, in our setting). As expected, for larger query ranges T performs even better than G, due to the better space utilization. Other operations than insertion, deletion, and range query are of minor importance in most applications. However, it should be clear that they may be carried out efficiently as well. Let us illustrate this briefly for exact match queries: they cost at most twice as many block accesses as in the standard grid file. Since in our experiments it turns out that 14% of all points are stored in the secondary grid file, successful exact match queries in the twin grid file in our setting cost 1.14 times as many bucket accesses as in the standard grid file, on the average. For unsuccessful exact match queries, the average bucket access factor is 1.56, because non-full and non-empty primary bucket regions cover 44% of the data space, and for query points falling into these bucket regions, no secondary bucket access is necessary. Taking the directory into account, we get 2.28 accesses in a successful and 3.12 accesses in an unsuccessful exact match.

Conclusion

We have proposed an index structure for geometric databases, where the efficiency of spatial proximity queries as well as a high storage space utilization are the major concerns. Our structure, the twin grid file, is based on the grid file. By means of distributing the set of points to be stored among two grid files, both covering the entire data space, we achieve a space

362 utilization close to optimal, while preserving the efficiency of range queries. with a standard grid file.

We have implemented

a twin grid file, and compared its performance has used few rules for redistributing in very limited

Even though our implementation file outperforms the standard

points among the two only, the twin grid

grid files, and these rules have been applied

combinations

grid file significantly.

Further optimization can be built into our algorithm easily, thereby enhancing storage space utilization even more, at the expense of more bucket accesses to redistribute points. Figure 5.1 shows that this can even be achieved locally, for an example with three points per bucket.

(b)
result of our implementation Figure 5.1 The principle of building twin structures can be applied to other index structures as well. Onedimensional index structures, like e.g. extendible hashing (Fagin et al. 1979) and digital B-trees (Lomet 1981), are also suitable for twin structures. However, we expect the beneficial effect to increase with higher dimension. Attractive alternatives to the grid file, like e.g. the BANG (Freeston 1987), are among the most promising twin structure candidates. file one bucket less

Acknowledgement
We wish to thank Norbert Fels for discussions and valuable suggestions, and Gabriele Reich for typesetting the paper and styling the layout. and Brunhilde Beck

References
. R. Fagin, J. Nievergelt, N. Pippenger, H.R. Strong: Extendible Hashing: A Fast Access Method for Dynamic Files, ACM Transactions on Database Systems, Vol. 4, 3, 1979, 315-344. M. Freeston: The BANG file: a new kind of grid file, Proc. ACM SIGMOD-87 International Conference on Management of Data, 1987, 269-269. K.H. Hinrichs: The grid file system: implementation and case studies of applications, Doctoral Thesis No. 7734, ETH Ziirich, 1985. A. Hutflesz, H.-W. Six, P. Widmayer: Globally Order Preserving Multidimensional Linear Hashing, IEEE Fourth International Conference on Data Engineering, 1988.

! j.

--

363 H.-P. Kriegel, B. Seeger: Multidimensional Order Perserving Linear Hashing with Partial Expansions, Proc. International Conference on Database Theory, 1986, 203-220. R. Krishnamurthy, K.-Y. Whang: Multilevel Grid Files, IBM Research Report, Yorktown Heights, 1985. D.B. Lomet: Digital B-Trees, Proc. International Conference on Very Large Data Bases, IEEE, 1981, 333-344. D.B. Lomet: Partial Expansions for File Organizations with an Index, ACM Transactions on Database Systems, Vol. 12, 1, 1987, 65-84. J. Nievergelt, H. Hinterberger, K.C. Sevcik: The Grid File: An Adaptable, Symmetric Multikey File Structure, ACM Transactions on Database Systems, Vol. 9, 1, 1984,38-71. E.J. Otoo: Balanced Multidimensional Extendible Hash Tree, Proc. 5th ACM SIGACT/SIGMOD Symposium on Principles of Database Systems, 1986, 100-113. K. Ramamohanarao, R. Sacks-Davis: Recursive Linear Hashing, ACM Transactions on Databas; Systems, Vol. 9,3,1984, 369391. J.T. Robinson: The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes, Proc. ACM SIGMOD International Conference on Management of Data, 1981, 10-18. H.-W. Six, P. Widmayer: Spatial Searching in Geometric Databases, IEEE Fourth International Conference on Data Engineering, 1988. M. Tamminen: Extendible Hashing with Overflow, Information Processing Letters, Vol. 15, 5, 1982,227232.