kd tree  

Type  Multidimensional BST  
Invented  1975  
Invented by  Jon Louis Bentley  

In computer science, a kd tree (short for kdimensional tree) is a spacepartitioning data structure for organizing points in a kdimensional space. Kdimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions.^{[1]} kd trees are a useful data structure for several applications, such as:
kd trees are a special case of binary space partitioning trees.
The kd tree is a binary tree in which every node is a kdimensional point. Every nonleaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as halfspaces. Points to the left of this hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree. The hyperplane direction is chosen in the following way: every node in the tree is associated with one of the k dimensions, with the hyperplane perpendicular to that dimension's axis. So, for example, if for a particular split the "x" axis is chosen, all points in the subtree with a smaller "x" value than the node will appear in the left subtree and all points with a larger "x" value will be in the right subtree. In such a case, the hyperplane would be set by the x value of the point, and its normal would be the unit xaxis.^{[2]}
Since there are many possible ways to choose axisaligned splitting planes, there are many different ways to construct kd trees. The canonical method of kd tree construction has the following constraints:^{[3]}
This method leads to a balanced kd tree, in which each leaf node is approximately the same distance from the root. However, balanced trees are not necessarily optimal for all applications.
Note that it is not required to select the median point. In the case where median points are not selected, there is no guarantee that the tree will be balanced. To avoid coding a complex medianfinding algorithm^{[4]}^{[5]} or using an sort such as heapsort or mergesort to sort all n points, a popular practice is to sort a fixed number of randomly selected points, and use the median of those points to serve as the splitting plane. In practice, this technique often results in nicely balanced trees.
Given a list of n points, the following algorithm uses a medianfinding sort to construct a balanced kd tree containing those points.
function kdtree (list of points pointList, int depth) { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median by axis from pointList; // Create node and construct subtree node.location := median; node.leftChild := kdtree(points in pointList before median, depth+1); node.rightChild := kdtree(points in pointList after median, depth+1); return node; }
It is common that points "after" the median include only the ones that are strictly greater than the median in the current dimension. For points that lie on the median in the current dimension, it is possible to define a function that compares them in all dimensions. In some cases, it is acceptable to let points equal to the median lie on one side of the median, for example, by splitting the points into a "lesser than" subset and a "greater than or equal to" subset.
This algorithm creates the invariant that for any node, all the nodes in the left subtree are on one side of a splitting plane, and all the nodes in the right subtree are on the other side. Points that lie on the splitting plane may appear on either side. The splitting plane of a node goes through the point associated with that node (referred to in the code as node.location).
Alternative algorithms for building a balanced kd tree presort the data prior to building the tree. Then, they maintain the order of the presort during tree construction and hence eliminate the costly step of finding the median at each level of subdivision. Two such algorithms build a balanced kd tree to sort triangles in order to improve the execution time of ray tracing for threedimensional computer graphics. These algorithms presort n triangles prior to building the kd tree, then build the tree in time in the best case.^{[6]}^{[7]} An algorithm that builds a balanced kd tree to sort points has a worstcase complexity of .^{[8]}^{[9]} This algorithm presorts n points in each of k dimensions using an sort such as Heapsort or Mergesort prior to building the tree. It then maintains the order of these k presorts during tree construction and thereby avoids finding the median at each level of subdivision.
One adds a new point to a kd tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.
Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be rebalanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.
To remove a point from an existing kd tree, without breaking the invariant, the easiest way is to form the set of all nodes and leaves from the children of the target node, and recreate that part of the tree.
Another approach is to find a replacement for the point removed.^{[10]} First, find the node that contains the point to be removed. For the base case where R is a leaf node, no replacement is required. For the general case, find a replacement point, say , from the subtree rooted at . Replace the point stored at with . Then, recursively remove .
For finding a replacement point, if discriminates on (say) and has a right child, find the point with the minimum value from the subtree rooted at the right child. Otherwise, find the point with the maximum value from the subtree rooted at the left child.
Balancing a kd tree requires care because kd trees are sorted in multiple dimensions so the treerotation technique cannot be used to balance them as this may break the invariant.
Several variants of balanced kd trees exist. They include divided kd tree, pseudo kd tree, KDBtree, hBtree and Bkdtree. Many of these variants are adaptive kd trees.
The nearest neighbour search (NN) algorithm aims to find the point in the tree that is nearest to a given input point. This search can be done efficiently by using the tree properties to quickly eliminate large portions of the search space.
Searching for a nearest neighbour in a kd tree proceeds as follows:
Generally the algorithm uses squared distances for comparison to avoid computing square roots. Additionally, it can save computation by holding the squared current best distance in a variable for comparison.
The algorithm can be extended in several ways by simple modifications. It can provide the k nearest neighbours to a point by maintaining k current bests instead of just one. A branch is only eliminated when k points have been found and the branch cannot have points closer than any of the k current bests.
It can also be converted to an approximation algorithm to run faster. For example, approximate nearest neighbour searching can be achieved by simply setting an upper bound on the number points to examine in the tree, or by interrupting the search process based upon a real time clock (which may be more appropriate in hardware implementations). Nearest neighbour for points that are in the tree already can be achieved by not updating the refinement for nodes that give zero distance as the result, this has the downside of discarding points that are not unique, but are colocated with the original search point.
Approximate nearest neighbour is useful in realtime applications such as robotics due to the significant speed increase gained by not searching for the best point exhaustively. One of its implementations is bestbinfirst search.
Main article: Range searching 
A range search searches for ranges of parameters. For example, if a tree is storing values corresponding to income and age, then a range search might be something like looking for all members of the tree which have an age between 20 and 50 years and an income between 50,000 and 80,000. Since kd trees divide the range of a domain in half at each level of the tree, they are useful for performing range searches.
Analyses of binary search trees has found that the worst case time for range search in a kdimensional kd tree containing n nodes is given by the following equation.^{[11]}
Finding the nearest point is an operation on average, in the case of randomly distributed points, although analysis in general is tricky.^{[12]}
In highdimensional spaces, the curse of dimensionality causes the algorithm to need to visit many more branches than in lowerdimensional spaces. In particular, when the number of points is only slightly higher than the number of dimensions, the algorithm is only slightly better than a linear search of all of the points. As a general rule, if the dimensionality is k, the number of points in the data, n, should be . Otherwise, when kd trees are used with highdimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search,^{[13]} and, if a goodenough fast answer is required, approximate nearestneighbour methods should be used instead.
Additionally, even in lowdimensional space, if the average pairwise distance between the k nearest neighbors of the query point is significantly less than the average distance between the query point and each of the k nearest neighbors, the performance of nearest neighbor search degrades towards linear, since the distances from the query point to each nearest neighbor are of similar magnitude. (In the worst case, consider a cloud of points distributed on the surface of a sphere centered at the origin. Every point is equidistant from the origin, so a search for the nearest neighbor from the origin would have to iterate through all points on the surface of the sphere to identify the nearest neighbor – which in this case is not even unique.)
To mitigate the potentially significant performance degradation of a kd tree search in the worst case, a maximum distance parameter can be provided to the tree search algorithm, and the recursive search can be pruned whenever the closest point in a given branch of the tree cannot be closer than this maximum distance. This may result in a nearest neighbor search failing to return a nearest neighbor, which means no points are within this maximum distance from the query point.
Instead of points, a kd tree can also contain rectangles or hyperrectangles.^{[14]}^{[15]} Thus range search becomes the problem of returning all rectangles intersecting the search rectangle. The tree is constructed the usual way with all the rectangles at the leaves. In an orthogonal range search, the opposite coordinate is used when comparing against the median. For example, if the current level is split along x_{high}, we check the x_{low} coordinate of the search rectangle. If the median is less than the x_{low} coordinate of the search rectangle, then no rectangle in the left branch can ever intersect with the search rectangle and so can be pruned. Otherwise both branches should be traversed. See also interval tree, which is a 1dimensional special case.
It is also possible to define a kd tree with points stored solely in leaves.^{[3]} This form of kd tree allows a variety of split mechanics other than the standard median split. The midpoint splitting rule^{[16]} selects on the middle of the longest axis of the space being searched, regardless of the distribution of points. This guarantees that the aspect ratio will be at most 2:1, but the depth is dependent on the distribution of points. A variation, called slidingmidpoint, only splits on the middle if there are points on both sides of the split. Otherwise, it splits on point nearest to the middle. Maneewongvatana and Mount show that this offers "good enough" performance on common data sets.
Using slidingmidpoint, an approximate nearest neighbour query can be answered in . Approximate range counting can be answered in with this method.