Vous êtes sur la page 1sur 4

Modern B-tree techniques

Goetz Graefe, Harumi Kuno


Hewlett-Packard Laboratories
{goetz.graefe, harumi.kuno}@hp.com

I. INTRODUCTION Less than 10 years after Bayer and McCreight introduced B-trees in 1970, and now more than a quarter century ago, Comer in 1979 called B-tree indexes ubiquitous. Gray and Reuter asserted in 1993 that B-trees are by far the most important access path structure in database and file systems. Btrees in various forms and variants are used in databases, keyvalue stores, information retrieval, and file systems. It could be said that the worlds information is at our fingertips because of B-trees. Many students, researchers, and professionals know the basic facts about B-tree indexes. Basic knowledge includes their organization in nodes including one root and many leaves, the uniform distance between root and leaves, their logarithmic height and logarithmic search effort, and their efficiency during insertions and deletions. This tutorial briefly reviews the basics but assumes that the audience is interested in more detailed information about modern B-tree techniques. Not all relevant topics can be covered in a short time. The selection of topics is focused on current opportunities for Btree indexes in novel contexts. Thus, the focus is on duplicate key values (including bitmaps, column storage, and compression), updates (including alternative B-tree structures, load utilities, and update execution plans), and the effects of novel hardware (including very large memory, flash storage, and memory hierarchies). II. DUPLICATES Duplicate key values in secondary indexes occur in many data collections. Multiple representation choices exist that differ in resulting data size. Even bitmaps are possible; in other words, B-trees are a sensible basic data structure for bitmap indexes. Moreover, B-trees enable column stores with compression and efficient navigation to specific row identifiers and their column values. A. Traditional representations The rows in Figure 1 show alternative representations of the same information: (a) shows individual records repeating the duplicate key value for each distinct record identifier associated with the key value, which is a simple scheme that requires the most space. (b) shows a list of record identifiers with each unique key value, and (c) shows a combination of these two techniques suitable for breaking up extremely long lists, e.g., those spanning multiple pages. (d) and (e) show two simple compression schemes based on truncation of shared prefixes and on run-length encoding. For example, (9)2 indicates that this entry is equal to the preceding one in its first

9 letters or Smith, 471 and 4711(2) indicates a contiguous series with 2 entries starting with 4711. (f) shows a bitmap as might be used in a bitmap index. Bitmaps themselves are often compressed using some variant of run-length encoding.
a b c d e f Smith, 4711 Smith, 4712 Smith, 4723

Smith, 4711, 4712, 4723 Smith, 4711, 4712 Smith, 4711 (9) 2 Smith, 4723 (8) 23

Smith, 4711(2), 4723 Smith, 4708, 0001100000000001


Figure 1. Representation choices for duplicates.

These representations show only the simplest forms of compression, mostly by truncation of some sort rather than encoding. Traditional compression encodes frequent values, frequent sub-strings, etc. practically all traditional compression techniques can also be applied to database pages and in particular B-tree indexes. B. Key range locking and duplicate key values If entries in a non-clustered index are not unique, multiple row identifiers may be associated with each value of the search key. Even thousands of record identifiers per key value are possible due to a single frequent key value or due to attributes with few distinct values. In non-unique indexes, key value locking may lock each value (including its entire cluster of row identifiers) or it may lock each unique pair of value and row identifier. The former saves lock requests in search queries, while the latter may permit higher concurrency during updates. A hierarchical design might permit locks for both unique values of the user-defined index key and for individual entries made unique by including the record identifier. Depending on the details of the design, it may not be required to lock individual row identifiers if those are already locked in the table to which the non-clustered index belongs. The choice whether to lock unique values (and their clusters of row identifiers) or each unique pair of value and row identifier is independent of the choice of representation on disk or in memory. For example, a system may store each key

978-1-4244-8960-2/11/$26.00 2011 IEEE

1370

ICDE Conference 2011

value only once in order to save disk space yet lock individual pairs of key value and row identifier. Conversely, another system may store pairs of key value and record identifier yet lock all such pairs with the same value with a single lock. Representation and granularity of locking are independent of each other whether any form of compression is applied or not.

The essence of the technique is that in each B-tree page, the page header stores the lowest tag value among all B-tree entries on that page, and the actual tag value for each individual B-tree entry is calculated by adding this value and the slot number of the entry within the page. There is no need to store the tag value in the individual B-tree entries; only a single tag value is required per page. If a page contains tens, hundreds, C. Bitmap B-trees or even thousands of B-tree entries, the overhead for storing The term bitmap index is commonly used, but it is quite the minimal tag value is practically zero for each individual ambiguous without explanation of the index structure. Bit- record. If the size of the row identifier is 4 or 8 bytes and the maps can be used in B-trees just as well as in hash indexes size of a B-tree node is 8 KB, the per-page row identifier imand other forms of indexes. As seen in Figure 1, bitmaps are poses an overhead of 0.1 % or less. one or many representation techniques for a set of integers. If all the records in a page have consecutive tag values, this Wherever a set of integers is associated with each index key, method not only solves the storage problem but also reduces the index can be a bitmap index. In the following, however, a search for a particular key value in the index to a little bit of non-unique non-clustered B-tree index is assumed. arithmetic followed by a direct access to the desired B-tree Bitmaps in database indexes are a fairly old idea that entry. Thus, the access performance in leaf pages of these Bgained importance with the rise of relational data warehousing. trees can be even better than that achieved with interpolation The only requirement is that there is a one-to-one mapping search or in hash indexes. between information associated with index keys and integers, III. UPDATES i.e., the positions of bits in a bitmap. For example, record identifiers consisting of device number, page number, and slot The default algorithms for insertion and deletion of a single number can be interpreted as a single large integer and thus key value are well known. In addition, many optimizations can be encoded in bitmaps and bitmap indexes. have been designed, in particular for large updates, also Without compression, bitmap indexes are space-efficient known as bulk updates. Most of them require a more or less only if there are very few distinct key values in the index. drastic change in data structures in order to enable the effiWith effective compression, the size of bitmap indexes is cient maintenance algorithms. about equal to that of traditional indexes with lists of references broken into segments, as shown in Figure 1. For exam- A. Row-by-row versus index-by-index ple, with WAH compression, each reference requires at most Various strategies for efficient index maintenance have one run of 0 sections plus a bitmap of 31 bits. A traditional been designed. These include sorting changes prior to modifyrepresentation with record identifiers might also require 64 ing records and pages in B-trees, splitting each modification bits per reference. Thus, bitmap indexes are useful for both of an existing record into a deletion and an insertion, and a sparse and dense bitmaps, i.e., for both low- and high- choice between row-by-row and index-by-index updates. The cardinality attributes. row-by-row technique is the traditional algorithm. When mulBitmaps are used primarily for read-only or read-mostly da- tiple rows change in a table with multiple indexes, this strateta, not for update-intensive databases and indexes. This is due gy computes the delta rows (that include old and new values) to the perceived difficulty of updating compressed bitmaps, and then applies them one by one. For each delta row, all ine.g., insertion of a new value in run-length encoding schemes dexes are updated before the next delta row is processed. such as WAH. On the other hand, lists of record identifiers The index-by-index maintenance strategy applies all com-pressed using numeric differences are very similar to the changes to the clustered index first. The delta rows may be counters in run-length encoding. Update costs should be very sorted on the same columns as the clustered index. The desimilar in these two compressed storage formats. sired effect is similar to sorting a set of search keys during index-to-index navigation. Sorting the delta rows separately D. Column storage B-trees for each index is possible only in index-by-index maintenance, Column storage seems to offer two benefits over traditional of course. B-tree indexes, namely faster scans due to shorter records and This strategy is beneficial if there are more changes than better compression due to a uniform data type and domain. pages in each index. Sorting the changes for each index enTraditional database B-trees are unsuitable for column sto- sures that each index page needs to be read and written at rage. However, very moderate modifications or generaliza- most once. Prefetch of individual pages or large read-ahead tions of traditional B-tree formats can achieve information can be used just as in read-only queries; in addition, updates density comparable to column stores, including elimination of benefit from write-behind. If read-ahead and write-behind overheads as well as active compression with methods such as transfer large sequences of individual pages as a single operarun-length encoding, dictionary encoding, etc. The advantage tion, this strategy is beneficial if the number of changes exof B-trees for column storage over alternative storage struc- ceeds the number of such transfers during an index scan. tures is that traditional algorithms can be reused, e.g., for asThe basic ideas of index-by-index maintenance and update sembling rows with multiple columns, bulk insertion and dele- plans also apply to foreign key constraints, materialized view tion, logging and recovery, consistency checking, etc.

1371

and their indexes, and even triggers. Specifically, foreign key constraints may be verified using branches in an update plan quite similar to maintenance of an individual non-clustered index, and cascading of information for foreign key constraints and materialized views can be achieved in additional branches. Plan creation is a challenge in its own right. B. Buffered B-trees Techniques optimized for efficient bulk insertions into Btrees can be divided into two groups. Both groups rely on some form of buffering to delay B-tree maintenance and to gain some economy of scale. The first group focuses on the structure of B-trees and buffers insertions in interior nodes. Thus, B-tree nodes are very large, are limited to a small fanout, or require additional storage on the side. The second group exploits B-trees without modifications to their structure, either by employing multiple B-trees or by creating partitions within a single B-tree by means of an artificial leading key field. In all cases, pages or partitions with active insertions are retained in the buffer pool. C. Partitioned B-trees The essence of partitioned B-trees is to maintain partitions within a single B-tree, by means of an artificial leading key field, and to reorganize and optimize such a B-tree online using, effectively, the merge step well known from external merge sort. This key field probably should be an integer of 2 or 4 bytes. By default, the same single value appears in all records in a B-tree, and most of the techniques for partitioned B-trees rely on exploiting multiple alternative values, temporarily in most cases and permanently for a few techniques. If a table or view in a relational database has multiple indexes, each index has its own artificial leading key field. The values in these fields are not coordinated or propagated among the indexes. In other words, each artificial leading key field is internal to a single B-tree, such that each B-tree can be reorganized and optimized independently of all others. If a table or index is horizontally partitioned and represented in multiple B-trees, the artificial leading key field should be defined separately for each partition.

The last partition might remain in the buffer pool, where it can absorb random insertions very efficiently. When its size exceeds the available buffer pool, a new partition is started and the prior one is written from the buffer pool to disk, either by an explicit request or on demand during standard page replacement in the buffer pool. D. Load and purge Bulk deletion, also known as purging, roll-out, or information de-staging, can employ some of the techniques invented for bulk insertion. For example, one technique for deletion simply inserts anti-matter records with the fastest bulk insertion technique, leaving it to queries or a subsequent reorganization to erase records and reclaim storage space. In partitioned B-trees, the reorganization should happen prior to the actual deletion. In the first and preparatory step, records to be deleted are moved from the main source partition to a dedicated victim partition. In the second and final step, this dedicated partition is deleted very efficiently, mostly be simple deallocation of leaf nodes and appropriate repair of internal B-tree nodes. For some example bandwidth calculations, consider bulk insertion into a table with a clustered index and three nonclustered indexes, all stored on a single disk supporting 200 read-write operations per second (100 read-write pairs) and 100 MB/s read-write bandwidth (assuming large units of I/O and thus negligible access latency). In this example calculation, record sizes are 1 KB in the clustered index and 0.02 KB in non-clustered indexes, including overheads for page and record headers, free space, etc. For simplicity, let us assume a warm buffer pool such that only leaf pages require I/O. The baseline plan relies on random insertions into 4 indexes, each requiring one read and one write operation. 8 I/Os per inserted row enable 25 row insertions per second into this table. The sustained insertion bandwidth is 25 KB/s = 0.025 MB/s per disk drive. It turns out that a plan relying on index removal and recreation after insertion, permits about 0.236 MB/s sustained insertion bandwidth. While this might seem poor compared to 100 MB/s, it is about ten times faster than random insertion, so it is not surprising that vendors have recommended this scheme. For a B-tree implementation that buffers insertions at each interior tree node, reasonable assumptions lead to 0.250 MB/s sustained insertion bandwidth. In addition to the bandwidth improvement, this technique retains and maintains the original B-tree indexes and permits query processing throughout. Partitioned B-trees permit about 10 MB/s sustained insertion bandwidth. Query processing remains possible throughout initial capture of new information and during B-tree reorganization. The tutorial provides details of these calculations. IV. HARDWARE A. Large memory In traditional system architectures, the purpose of the buffer pool is to enable data access significantly more efficient than disk access. In a memory-rich environment, the purpose of a

Figure 2. Traditional vs. partitioned B-tree indexes.

Figure 2 illustrates how the artificial leading key field divides the records in a B-tree into partitions. Within each partition, the records are sorted, indexed, and searchable by the user-defined key just as in a standard B-tree. In this example partition 1 might be the main partition whereas partitions 2-4 contain recent insertions, appended to the B-tree as new partitions after in-memory sorting.

1372

buffer pool must be to achieve in-memory performance in most cases but to enable data growth beyond the memory size if required. Thus, the focus is on speeding in-memory navigation even at some additional expense in the rare cases when pages are loaded into or evicted from the buffer pool. Achieving in-memory performance requires in-memory pointers, thus avoiding look-up in the hash table mapping disk addresses (page identifiers) to buffer frames (or their descriptors). For a B-tree index, this requires that child pointers be memory addresses while the appropriate B-tree nodes are in the buffer pool. Moreover, in order to enable the appropriate reference counting, child nodes must refer to the immediate parent node. These additional pointers exist in the buffer pool only; they are not part of the on-disk format. Not only the pointers but also the keys can be optimized for efficiency, possibly using another parallel array. For example, insertion of a few artificial ghost keys may speed up insertions by reducing the amount of shifting and searches by enabling effective interpolation search instead of binary search. If both key values and child pointers are copied from a data page in the buffer pool into new data structures optimized for inmemory B-tree navigation, most of the original data page is no longer required. If the page is evicted from the buffer pool, and if it has been updated since it was last saved on permanent storage, its contents can be reconstructed from the key array and the pointer array. Any differences between the prior ondisk page and the newly reconstructed page are due to the update, which of course is reflected in the recovery log, or are equivalent to a page reorganization, which does not need to be logged. B. Flash storage Flash storage is usually packaged to emulate disk drives. The main difference to traditional disks is the fast access latency. Further differences include wear limitations and the need for wear leveling. While access latencies differ by orders of magnitude from disks, transfer bandwidth does not. Thus, the optimal page size for B-tree indexes is much smaller on flash storage than on disks. Optimal page sizes were derived in even the earliest B-tree papers; applied to flash storage and disks, optimal page sizes differ by more than an order of magnitude. If RAM, flash storage, and disk form a memory hierarchy, the optimal data structures form their own hierarchy. Specifically, a large disk page contains many small pages. When a large page is read from disk, it remains on flash storage, and small pages are loaded from flash storage into the buffer pool in RAM. These pages can be organized as B-trees, including splitting a full large page into two half-full large pages. Moreover, the records in a B-tree might represent hierarchical components of complex objects, e.g., customers, orders, invoices, and detail items in orders and invoices. By appropriate master-detail clustering, buffer pool requirements in RAM and on flash as well as access and transfer effort can be re-

duced. This is true in particular if application programs are programmed and process data in terms of complex objects rather than normalized records. C. Persistent byte-addressable storage Software techniques for B-trees in persistent byteaddressable storage hardware, e.g., phase-change memory or memristor storage, are still in the research stage. One possible research direction focuses on online fault detection and recovery for individual pages in case only a small amount of hardware wears out and becomes unreliable. Another possible research direction might simplify the traditional do, redo, undo paradigm for transactional updates. Instead, write-ahead logging might be refined to log changes before their first application to the data structure. Thus, a do action is no longer required, because its function can be realized by invoking redo of the log record. Of course, undo actions remain obligatory. Idempotent redo and undo actions are not required if logged undo actions (compensation) and a page lsn (log sequence number) can be used ensure that each action is applied exactly once the equivalent techniques for persistent byte-addressable storage is an open research question. Relevant techniques and the state-of-the-art will be discussed in the tutorial, if time permits. V. SUMMARY In summary, the core design of B-trees has remained unchanged in 40 years: balanced trees, pages or other units of I/O as nodes, efficient root-to-leaf search, splitting and merging nodes, etc. On the other hand, an enormous amount of research and development has improved every aspect of Btrees including data contents such as multi-dimensional data, access algorithms such as multi-dimensional queries, data organization within each node such as compression and cache optimization, concurrency control such as separation of latching and locking, recovery such as multi-level recovery, etc. Gray and Reuter believed in 1993 that B-trees are by far the most important access path structure in database and file systems. It seems that this statement remains true today. Btree indexes are likely to gain new importance in relational databases due to the advent of flash storage. Fast access latencies permit many more random I/O operations than traditional disk storage, thus shifting the break-even point between a fullbandwidth scan and a B-tree index search, even if the scan has the benefit of columnar database storage. We hope that this tutorial of B-tree techniques will stimulate research and development of modern B-tree indexing techniques for future data management systems. VI. REFERENCES A survey paper with references will be made available during the tutorial.

1373

Vous aimerez peut-être aussi