11.1 Binary search trees

A binary search tree (BST) is a binary tree that conforms to the following condition, known as the binary search tree property:

All nodes stored in the left subtree of a node whose key value is KK have key values less than or equal to KK. All nodes stored in the right subtree of a node whose key value is KK have key values greater than KK.

Figure 11.1 below shows two BSTs for a collection of values. One consequence of the binary search tree property is that if the BST nodes are printed using an inorder traversal, then the resulting enumeration will be in sorted order from lowest to highest.

Figure 11.1: Two Binary Search Trees for a collection of values. Tree (a) results if values are inserted in the order 37, 24, 42, 120, 32, 7. Tree (b) results if the same values are inserted in the order 7, 37, 42, 32, 120, 24..

The only thing that differentiates a BST from a normal binary tree is the BST property. This property is an invariant, as as explained in Section 2.2, and it is a condition that the BST always must satisfy.

Invariants are not stored in the datatype, so the actual declaration for a BST is exactly the same as a normal binary tree:

datatype BST:
    root: BinaryNode = null
    size: Int = 0

The BST property is therefore not explicit in the datatype declaration, but instead it is something that all operations must satisfy. Whenever we implement a new operation, we have to make sure that we never break the property.

Searching in a BST

Because of the BST property, we don’t have to search the whole tree if we want to find an element. Instead, we can start at the root node and compare its value with the value we are searching for. If the value is smaller than the root, we know that we can disregard everything in the right subtree. (And conversely, if the value is larger, we can disregard the left subtree.)

Now, assuming that the value was smaller than the root, we can go to the left child and compare again. If the value now is larger than the child, we can continue into the child’s right subtree. We continue this until we have found the value in a node, or until we reach an empty child. If we reach an empty child we know that the value is not in the tree. Here is a possible implementation of a method that returns true if the element is in the tree:

datatype BST:
    ...
    contains(elem):
        node = root
        while node is not null:
            if elem == node.elem:
                return true
            else if elem < node.elem:
                node = node.left
            else if elem > node.elem:
                node = node.right
        return false

Inserting into a BST

When we want to insert an element we first have to search for it, similar to what we did above. If it already exists we don’t have to do anything, and if it isn’t in the tree we create a new node and attach it to the right place. But how do we know where to add the new node? The variable node becomes null if the element isn’t found, so we have nothing to attach the node to – instead we want to attach the new node to the previous tree node that we looked at. The solution is to add another temporary variable, parent, which points to the parent of node – i.e., its value in the previous iteration.

Now, after we have completed the search, and the element wasn’t found, we can simply create a new node and attach it as a child to parent. Depending on if the element is smaller or larger than the parent element, we make it a left or right child. Alternatively, if the tree is empty then we have to attach the new node directly to the root.

datatype BST:
    ...
    add(elem):
        // Search for the parent of the new node.
        parent = null
        node = root
        while node is not null:
            parent = node
            if elem == node.elem
                return  // The element is already in the BST, so we do nothing.
            else if elem < node.elem:
                node = node.left
            else if elem > node.elem:
                node = node.right

        // Create a new node and attach it to the parent.
        node = new BinaryNode(elem)
        if parent is null:
            root = node  // The tree is empty, so we update the root.
        else if elem < parent.elem:
            parent.left = node
        else if elem > parent.elem:
            parent.right = node
        size = size + 1

We have to decide what to do when the node that we want to insert has a key value equal to the key of some node already in the tree. If during insert we find a node that duplicates the key value to be inserted, then we have two options. If the application does not allow nodes with equal keys, then this insertion should be treated as an error (or ignored). If duplicate keys are allowed, our convention will be to insert the duplicate in the left subtree.

The shape of a BST depends on the order in which elements are inserted. A new element is added to the BST as a new leaf node, potentially increasing the depth of the tree. Figure 11.1 illustrates two BSTs for a collection of values. It is possible for the BST containing nn nodes to be a chain of nodes with height nn. This would happen if, for example, all elements were inserted in sorted order. In general, it is preferable for a BST to be as shallow as possible. This keeps the average cost of a BST operation low.

11.1.1 Recursive search and insert

Bot searching and insertion can be implemented as recursive function too, and it is instructive to see how to to that. And since we showed the set operations contains and add above, we will show how to implement recursive versions of the map operations get and put.

Note that this means that our binary tree nodes will have the instance variables key and value, instead of the single elem that we used earlier.

Getting the value for a key

First we will look at the get operation for finding the record that matches a given key. To accomplish this we need a recursive helper function which takes a node as argument, and we start by calling this function with the root:

datatype BSTMap implements Map:
    ...
    get(key):
        return getHelper(root, key)

The function getHelper performs the same iteration as we did earlier with contains, but implicitly as recursive calls. When the key is found it returns the value of the node, and if the key doesn’t exist it returns null:

    getHelper(node, key):
        if node is null:
            return null
        else if key < node.key:
            return getHelper(node.left, key)
        else if key > node.key:
            return getHelper(node.right, key)
        else if key == node.key:
            return node.value

Here is an interactive explanation of searching in a BST.

Here is an exercise on the BST search algorithm.

Setting the value for a key

Now we look at how to set the value of a key. Yet again we need a recursive helper function which we call with the root of the tree. This function returns a pointer to the updated tree, and we have to make sure to update the root to this updated tree.

datatype BSTMap implements Map:
    ...
    put(key, value):
        root = putHelper(root, key, value)

We do the same as for getHelper as for putHelper above, but we have to update the value of each child pointer in the path to be the new updated tree.

    putHelper(node, key, value):
        if node is null:
            size = size + 1
            return new BSTNode(key, value)
        else if key < node.key:
            node.left = putHelper(node.left, key, value)
        else if key > node.key:
            node.right = putHelper(node.right, key, value)
        else if key == node.key:
            node.value = value
        return node

Here is an interactive explanation of recursive insertion.

Note that, except for the last node in the path, putHelper will not actually change the child pointer for any of the nodes that are visited. In that sense, many of the assignments seem redundant. However, the cost of these additional assignments is worth paying to keep the insertion process simple. The alternative is to check if a given assignment is necessary, which is probably more expensive than the assignment!

Here is an exercise on BST insertion.

11.1.2 Removing from a BST

Removing a node from a BST is a bit trickier than inserting a node, but it is not complicated if all of the possible cases are considered individually. Before tackling the general node removal process, we need a useful companion method, largestNode, which returns a pointer to the node containing the maximum value in a subtree.

function largestNode(node):
    while node.right is not null:
        node = node.right
    return node

This time we will show a recursive implementation of remove. For this we need a recursive helper function which we initially call with the root of the tree.

datatype BSTMap:
    ...
    remove(key):
        root = removeHelper(root, key)

Now we are ready for the removeHelper method. Removing a node with given key value RR from the BST requires that we first find RR and then remove it from the tree. So, the first part of the remove operation is a search to find RR. Once RR is found, there are several possibilities. If RR has no children, then RR’s parent has its pointer set to null. If RR has one child, then RR’s parent has its pointer set to RR’s child. The problem comes if RR has two children. One simple approach, though expensive, is to set RR’s parent to point to one of RR’s subtrees, and then reinsert the remaining subtree’s nodes one at a time. A better alternative is to find a value in one of the subtrees that can replace the value in RR.

Thus, the question becomes: Which value can substitute for the one being removed? It cannot be any arbitrary value, because we must preserve the BST property without making major changes to the structure of the tree. Which value is most like the one being removed? The answer is the least key value greater than the one being removed, or else the greatest key value less than (or equal to) the one being removed. If either of these values replace the one being removed, then the BST property is maintained.

Here is an interactive explanation of BST deletion.

When duplicate node values do not appear in the tree, it makes no difference whether the replacement is the greatest value from the left subtree or the least value from the right subtree. If duplicates are stored in the left subtree, then we must select the replacement from the left subtree. To see why, call the least value in the right subtree LL. If multiple nodes in the right subtree have value LL, selecting LL as the replacement value for the root of the subtree will result in a tree with equal values to the right of the node now containing LL. Selecting the greatest value from the left subtree does not have a similar problem, because it does not violate the Binary Search Tree Property if equal values appear in the left subtree.

Note: Alternatively, we can decide to store duplicate values in the right subtree instead of the left. Then we must replace a deleted node with the least value from its right subtree.

The code for removal is shown here. Note that the helper function returns the updated subtree, and we have to make sure to update the child pointers to this updated tree.

    removeHelper(node, key):
        if node is null:
            return null
        else if key < node.key:
            node.left = removeHelper(node.left, key)
            return node
        else if key > node.key:
            node.right = removeHelper(node.right, key)
            return node
        else if key == node.key:
            if node.left is null:
                size = size - 1
                return node.right
            else if node.right is null:
                size = size - 1
                return node.left
            else:
                predecessor = largestNode(node.left)
                node.key = predecessor.key
                node.value = predecessor.value
                node.left = removeHelper(node.left, predecessor.key)
                return node
datatype BSTMap:
    ...
    remove(key):
        parent = null
        node = root
        while node is not null:
            parent = node
            if key < node.key:
                node = node.left
            else if key > node.key:
                node = node.right
            else: // key == node.key
                break  // we found the key so break out of the loop
        if node is null:
            return  // the key is not in the tree, so do nothing

        // node to be deleted has at most one child
        if node.left is null or node.right is null:
            // newNode will replace the node to be deleted
            if node.left is null:
                newNode = node.right
            else: // node.right is null
                newNode = node.left

            if parent is null:
                root = newNode  // the node to be deleted is the root, so update the root
            else if node == parent.left:
                parent.left = newNode  // node is a left child, so make newNode a left child
            else: // node == parent.right
                parent.right = newNode  // node is a right child, so make newNode a right child

            node = None

        // node to be deleted has two children
        else:
            predecessorParent = null
            predecessor = node.left
            while predecessor.right is not null:
                predecessorParent = predecessor
                predecessor = predecessor.right

            node.key = predecessor.key
            node.value = predecessor.value

            if predecessorParent is null:
                node.left = predecessor.left
            else:
                predecessorParent.right = predecessor.left

Here is an exercise on BST deletion.

11.1.3 Analysis

The cost for contains, get, add, and put is the depth of the node found or inserted. The cost for remove is the depth of the node being removed, or in the case when this node has two children, the depth of the node with smallest value in its right subtree. Thus, in the worst case, the cost for any one of these operations is the depth of the deepest node in the tree. This is why it is desirable to keep BSTs balanced, that is, with least possible height. If a binary tree is balanced, then the height for a tree of nn nodes is approximately logn\log n. However, if the tree is completely unbalanced, for example in the shape of a linked list, then the height for a tree with nn nodes can be as great as nn. Thus, a balanced BST will in the average case have operations costing O(logn)O(\log n), while a badly unbalanced BST can have operations in the worst case costing O(n)O(n). Consider the situation where we construct a BST of nn nodes by inserting records one at a time. If we are fortunate to have them arrive in an order that results in a balanced tree (a “random” order is likely to be good enough for this purpose), then each insertion will cost on average O(logn)O(\log n), for a total cost of O(nlogn)O(n \log n). However, if the records are inserted in order of increasing value, then the resulting tree will be a chain of height nn. The cost of insertion in this case will be i=1niO(n2)\sum_{i=1}^{n} i \in O(n^2).

Traversing a BST costs O(n)O(n) regardless of the shape of the tree. Each node is visited exactly once, and each child pointer is followed exactly once. For example, if we want to print the nodes in ascending order we can perform an inorder traversal (Section 8.4), which then will take time linear in the size of the tree.

While the BST is simple to implement and efficient when the tree is balanced, the possibility of its being unbalanced is a serious liability. There are techniques for organising a BST to guarantee good performance. Two examples are the AVL tree (see Section 11.3) and the splay tree (see Section 11.4). There also exist other types of search trees that are guaranteed to remain balanced, such as the red-black tree or the 2-3 Tree.

11.1.4 Guided information flow

When writing a recursive method to solve a problem that requires traversing a binary tree, we want to make sure that we are visiting the required nodes (no more and no less).

So far, we have seen several tree traversals that visited every node of the tree. We also saw the BST search, insert, and remove routines, that each go down a single path of the tree. Guided traversal refers to a problem that does not require visiting every node in the tree, though it typically requires looking at more than one path through the tree. This means that the recursive function is making some decision at each node that sometimes lets it avoid visiting one or both of its children. The decision is typically based on the value of the current node. Many problems that require information flow on binary search trees are “guided” in this way.

Example: Minimum value in a tree

An extreme example is finding the minimum value in a BST. A bad solution to this problem would visit every node of the tree. However, we can take advantage of the BST property to avoid visiting most nods in the tree. You know that the values greater than the root are always in the right subtree, and those values less than the root are in the left subtree. Thus, at each node we need only visit the left subtree until we reach a leaf node.

Here is a problem that typically needs to visit more than just a single path, but not all of the nodes.