Btree  

Type  tree  
Invented  1971  
Invented by  Rudolf Bayer, Edward M. McCreight  

In computer science, a Btree is a selfbalancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The Btree is a generalization of a binary search tree in that a node can have more than two children (Comer 1979, p. 123). Unlike selfbalancing binary search trees, the Btree is optimized for systems that read and write large blocks of data. Btrees are a good example of a data structure for external memory. It is commonly used in databases and filesystems.
In Btrees, internal (nonleaf) nodes can have a variable number of child nodes within some predefined range. When data is inserted or removed from a node, its number of child nodes changes. In order to maintain the predefined range, internal nodes may be joined or split. Because a range of child nodes is permitted, Btrees do not need rebalancing as frequently as other selfbalancing search trees, but may waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are typically fixed for a particular implementation. For example, in a 23 Btree (often simply referred to as a 23 tree), each internal node may have only 2 or 3 child nodes.
Each internal node of a Btree contains a number of keys. The keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a_{1} and a_{2}. All values in the leftmost subtree will be less than a_{1}, all values in the middle subtree will be between a_{1} and a_{2}, and all values in the rightmost subtree will be greater than a_{2}.
Usually, the number of keys is chosen to vary between and , where is the minimum number of keys, and is the minimum degree or branching factor of the tree. In practice, the keys take up the most space in a node. The factor of 2 will guarantee that nodes can be split or combined. If an internal node has keys, then adding a key to that node can be accomplished by splitting the hypothetical key node into two key nodes and moving the key that would have been in the middle to the parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its neighbor each have keys, then a key may be deleted from the internal node by combining it with its neighbor. Deleting the key would make the internal node have keys; joining the neighbor would add keys plus one more key brought down from the neighbor's parent. The result is an entirely full node of keys.
The number of branches (or child nodes) from a node will be one more than the number of keys stored in the node. In a 23 Btree, the internal nodes will store either one key (with two child nodes) or two keys (with three child nodes). A Btree is sometimes described with the parameters — or simply with the highest branching order, .
A Btree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being one more node farther away from the root.
Btrees have substantial advantages over alternative implementations when the time to access the data of a node greatly exceeds the time spent processing that data, because then the cost of accessing the node may be amortized over multiple operations within the node. This usually occurs when the node data are in secondary storage such as disk drives. By maximizing the number of keys within each internal node, the height of the tree decreases and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The maximum number of child nodes depends on the information that must be stored for each child node and the size of a full disk block or an analogous size in secondary storage. While 23 Btrees are easier to explain, practical Btrees using secondary storage need a large number of child nodes to improve performance.
The term Btree may refer to a specific design or it may refer to a general class of designs. In the narrow sense, a Btree stores keys in its internal nodes but need not store those keys in the records at the leaves. The general class includes variations such as the B+ tree and the B^{*} tree.
Rudolf Bayer and Ed McCreight invented the Btree while working at Boeing Research Labs in 1971 (Bayer & McCreight 1972), but they did not explain what, if anything, the B stands for. Douglas Comer explains:
The origin of "Btree" has never been explained by the authors. As we shall see, "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. Because of his contributions, however, it seems appropriate to think of Btrees as "Bayer"trees. (Comer 1979, p. 123 footnote 1)
Donald Knuth speculates on the etymology of Btrees in his May, 1980 lecture on the topic "CS144C classroom lecture about disk storage and Btrees", suggesting the "B" may have originated from Boeing or from Bayer's name.^{[3]}
Ed McCreight answered a question on Btree's name in 2013:
Bayer and I were in a lunchtime where we get to think [of] a name. And ... B is, you know ... We were working for Boeing at the time, we couldn't use the name without talking to lawyers. So, there is a B. [The Btree] has to do with balance, another B. Bayer was the senior author, who [was] several years older than I am and had many more publications than I did. So there is another B. And so, at the lunch table we never did resolve whether there was one of those that made more sense than the rest. What really lives to say is: the more you think about what the B in Btrees means, the better you understand Btrees."^{[4]}
Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation. A binary search of a sorted table with N records, for example, can be done in roughly ⌈ log_{2} N ⌉ comparisons. If the table had 1,000,000 records, then a specific record could be located with at most 20 comparisons: ⌈ log_{2} (1,000,000) ⌉ = 20.
Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the Seagate ST3500320NS, the tracktotrack seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds.^{[5]} For simplicity, assume reading from disk takes about 10 milliseconds.
Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds.
The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay. With 100 records per block, the last 6 or so comparisons don't need to do any disk reads—the comparisons are all within the last disk block read.
To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.
A significant improvement can be made with an index. In the example above, initial disk reads narrowed the search range by a factor of two. That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index). This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly. Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read. The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads.
The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an auxaux index that would need only 100 entries and would fit in one disk block.
Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the auxaux index identifies the relevant block in auxindex. Reading and searching that auxindex block identifies the relevant block in the main database. Instead of 150 milliseconds, we need only 30 milliseconds to get the record.
The auxiliary indices have turned the search problem from a binary search requiring roughly log_{2} N disk reads to one requiring only log_{b} N disk reads where b is the blocking factor (the number of entries per block: b = 100 entries per block in our example; log_{100} 1,000,000 = 3 reads).
In practice, if the main database is being frequently searched, the auxaux index and much of the aux index may reside in a disk cache, so they would not incur a disk read.
If the database does not change, then compiling the index is simple to do, and the index need never be changed. If there are changes, then managing the database and its index becomes more complicated.
Deleting records from a database is relatively easy. The index can stay the same, and the record can just be marked as deleted. The database remains in sorted order. If there are a large number of deletions, then searching and storage become less efficient.
Insertions can be very slow in a sorted sequential file because room for the inserted record must be made. Inserting a record before the first record requires shifting all of the records down one. Such an operation is just too expensive to be practical. One solution is to leave some spaces. Instead of densely packing all the records in a block, the block can have some free space to allow for subsequent insertions. Those spaces would be marked as if they were "deleted" records.
Both insertions and deletions are fast as long as space is available on a block. If an insertion won't fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted. The hope is that enough space is available nearby, such that a lot of blocks do not need to be reorganized. Alternatively, some outofsequence disk blocks may be used.
The Btree uses all of the ideas described above. In particular, a Btree:
In addition, a Btree minimizes waste by making sure the interior nodes are at least half full. A Btree can handle an arbitrary number of insertions and deletions.
The literature on Btrees is not uniform in its terminology (Folk & Zoellick 1992, p. 362).
Bayer & McCreight (1972), Comer (1979), and others define the order of Btree as the minimum number of keys in a nonroot node. Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of keys is not clear. An order 3 Btree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998, p. 483) avoids the problem by defining the order to be maximum number of children (which is one more than the maximum number of keys).
The term leaf is also inconsistent. Bayer & McCreight (1972) considered the leaf level to be the lowest level of keys, but Knuth considered the leaf level to be one level below the lowest keys (Folk & Zoellick 1992, p. 363). There are many possible implementation choices. In some designs, the leaves may hold the entire data record; in other designs, the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a Btree.^{[6]}
There are also unfortunate choices like using the variable k to represent the number of children when k could be confused with the number of keys.
For simplicity, most authors assume there are a fixed number of keys that fit in a node. The basic assumption is the key size is fixed and the node size is fixed. In practice, variable length keys may be employed (Folk & Zoellick 1992, p. 379).
According to Knuth's definition, a Btree of order m is a tree which satisfies the following properties:
Each internal node’s keys act as separation values which divide its subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a_{1} and a_{2}. All values in the leftmost subtree will be less than a_{1}, all values in the middle subtree will be between a_{1} and a_{2}, and all values in the rightmost subtree will be greater than a_{2}.
A Btree of depth n+1 can hold about U times as many items as a Btree of depth n, but the cost of search, insert, and delete operations grows with the depth of the tree. As with any balanced tree, the cost grows much more slowly than the number of elements.
Some balanced trees store values only at leaf nodes, and use different kinds of nodes for leaf nodes and internal nodes. Btrees keep values in every node in the tree, and may use the same structure for all nodes. However, since leaf nodes never have children, the Btrees benefit from improved performance if they use a specialized structure.
Let h be the height of the classic Btree. Let n > 0 be the number of entries in the tree.^{[7]} Let m be the maximum number of children a node can have. Each node can have at most m−1 keys.
It can be shown (by induction for example) that a Btree of height h with all its nodes completely filled has n= m^{h+1}−1 entries. Hence, the best case height of a Btree is:
Let be the minimum number of children an internal (nonroot) node can have. For an ordinary Btree,
Comer (1979, p. 127) and Cormen et al. (2001, pp. 383–384) give the worst case height of a Btree (where the root node is considered to have height 0) as
Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to bottom. At each level, the search reduces its field of view to the child pointer (subtree) whose range includes the search value. A subtree's range is defined by the values, or keys, contained in its parent node. These limiting values are also known as separation values.
Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of interest.
All insertions start at a leaf node. To insert a new element, search the tree to find the leaf node where the new element should be added. Insert the new element into that node with the following steps:
If the splitting goes all the way up to the root, it creates a new root with a single separator value and two children, which is why the lower bound on the size of internal nodes does not apply to the root. The maximum number of elements per node is U−1. When a node is split, one element moves to the parent, but one element is added. So, it must be possible to divide the maximum number U−1 of elements into two legal nodes. If this number is odd, then U=2L and one of the new nodes contains (U−2)/2 = L−1 elements, and hence is a legal node, and the other contains one more element, and hence it is legal too. If U−1 is even, then U=2L−1, so there are 2L−2 elements in the node. Half of this number is L−1, which is the minimum number of elements allowed per node.
An improved algorithm ^{[citation needed]} supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved algorithm, we must be able to send one element to the parent and split the remaining U−2 elements into two legal nodes, without adding a new element. This requires U = 2L rather than U = 2L−1, which accounts for why some textbooks impose this requirement in defining Btrees.
There are two popular strategies for deletion from a Btree.
The algorithm below uses the former strategy.
There are two special cases to consider when deleting an element:
The procedures for these cases are in order below.
Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one can be the new separator for the two subtrees. Algorithmically described below:
Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the minimum number of nodes. That redistribution operation is called a rotation. If no sibling can spare an element, then the deficient node must be merged with a sibling. The merge causes the parent to lose a separator element, so the parent may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root. Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a problem. The algorithm to rebalance the tree is as follows:^{[citation needed]}
While freshly loaded databases tend to have good sequential behavior, this behavior becomes increasingly difficult to maintain as a database grows, resulting in more random I/O and performance challenges.^{[8]}
In applications, it is frequently useful to build a Btree to represent a large existing collection of data and then update it incrementally using standard Btree operations. In this case, the most efficient way to construct the initial Btree is not to insert every element in the initial collection successively, but instead to construct the initial set of leaf nodes directly from the input, then build the internal nodes from these. This approach to Btree construction is called bulkloading. Initially, every leaf but the last one has one extra element, which will be used to build the internal nodes.^{[citation needed]}
For example, if the leaf nodes have maximum size 4 and the initial collection is the integers 1 through 24, we would initially construct 4 leaf nodes containing 5 values each and 1 which contains 4 values:





We build the next level up from the leaves by taking the last element from each leaf node except the last one. Again, each node except the last will contain one extra value. In the example, suppose the internal nodes contain at most 2 values (3 child pointers). Then the next level up of internal nodes would be:

 





This process is continued until we reach a level with only one node and it is not overfilled. In the example only the root level remains:
 

 





In addition to its use in databases, the Btree (or § Variants) is also used in filesystems to allow quick random access to an arbitrary block in a particular file. The basic problem is turning the file block address into a disk block (or perhaps to a cylinderheadsector) address.
Some operating systems require the user to allocate the maximum size of the file when the file is created. The file can then be allocated as contiguous disk blocks. In that case, to convert the file block address into a disk block address, the operating system simply adds the file block address to the address of the first disk block constituting the file. The scheme is simple, but the file cannot exceed its created size.
Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical blocks to physical blocks is more involved.
MSDOS, for example, used a simple File Allocation Table (FAT). The FAT has an entry for each disk block,^{[note 1]} and that entry identifies whether its block is used by a file and if so, which block (if any) is the next disk block of the same file. So, the allocation of each file is represented as a linked list in the table. In order to find the disk address of file block , the operating system (or disk utility) must sequentially follow the file's linked list in the FAT. Worse, to find a free disk block, it must sequentially scan the FAT. For MSDOS, that was not a huge penalty because the disks and files were small and the FAT had few entries and relatively short file chains. In the FAT12 filesystem (used on floppy disks and early hard disks), there were no more than 4,080 ^{[note 2]} entries, and the FAT would usually be resident in memory. As disks got bigger, the FAT architecture began to confront penalties. On a large disk using FAT, it may be necessary to perform disk reads to learn the disk location of a file block to be read or written.
TOPS20 (and possibly TENEX) used a 0 to 2 level tree that has similarities to a Btree^{[citation needed]}. A disk block was 512 36bit words. If the file fit in a 512 (2^{9}) word block, then the file directory would point to that physical disk block. If the file fit in 2^{18} words, then the directory would point to an aux index; the 512 words of that index would either be NULL (the block isn't allocated) or point to the physical address of the block. If the file fit in 2^{27} words, then the directory would point to a block holding an auxaux index; each entry would either be NULL or point to an aux index. Consequently, the physical disk block for a 2^{27} word file could be located in two disk reads and read on the third.
Apple's filesystem HFS+, Microsoft's NTFS,^{[9]} AIX (jfs2) and some Linux filesystems, such as btrfs and Ext4, use Btrees.
B^{*}trees are used in the HFS and Reiser4 file systems.
Lehman and Yao^{[10]} showed that all the read locks could be avoided (and thus concurrent access greatly improved) by linking the tree blocks at each level together with a "next" pointer. This results in a tree structure where both insertion and search operations descend from the root to the leaf. Write locks are only required as a tree block is modified. This maximizes access concurrency by multiple users, an important consideration for databases and/or other Btreebased ISAM storage methods. The cost associated with this improvement is that empty pages cannot be removed from the btree during normal operations. (However, see ^{[11]} for various strategies to implement node merging, and source code at.^{[12]})
United States Patent 5283894, granted in 1994, appears to show a way to use a 'Meta Access Method' ^{[13]} to allow concurrent B+ tree access and modification without locks. The technique accesses the tree 'upwards' for both searches and updates by means of additional inmemory indexes that point at the blocks in each level in the block cache. No reorganization for deletes is needed and there are no 'next' pointers in each block as in Lehman and Yao.
((cite web))
: Unknown parameter deadurl=
ignored (urlstatus=
suggested) (help)
((citation))
: Invalid ref=harv
(help)((citation))
: Invalid ref=harv
(help).((citation))
: Invalid ref=harv
(help)((citation))
: volume=
has extra text (help). Section 6.2.4: Multiway Trees, pp. 481–491. Also, pp. 476–477 of section 6.2.3 (Balanced Trees) discusses 23 trees.((citation))
: CS1 maint: location missing publisher (link).Search trees (dynamic sets/associative arrays)  

Heaps  
Tries  
Spatial data partitioning trees  
Other trees 
Types  

Abstract  
Arrays  
Linked  
Trees  
Graphs  