DSABook – Implementing binary trees

8.3 Implementing binary trees

In this section we examine one way to implement binary tree nodes. By definition, all binary tree nodes have two children, though one or both children can be empty. Binary tree nodes typically contain a value field, with the type of the field depending on the application. The most common node implementation includes a value field and pointers to the two children. Figure 8.4 is an illustration of how the tree from Figure 8.1 looks like, where the child pointers are shown explicitly.

Figure 8.4: Illustration of a typical pointer-based binary tree implementation, where each node stores two child pointers and a value.

Here is a simple implementation for binary tree nodes, which can store one single element in each node. Every BinaryNode object also has two pointers, one to its left child and another to its right child.

datatype BinaryNode of T:
    elem: T            // Element for this node.
    left: BinaryNode   // Pointer to left child.
    right: BinaryNode  // Pointer to right child.

    isLeaf() -> bool:
        // Return true if a leaf node, false otherwise.
        return this.left is null and this.right is null

We define a leaf to be a node with no children – i.e., where both childen pointers point to nothing.

Some programmers find it convenient to add a pointer to the node’s parent, allowing easy upward movement in the tree. Using a parent pointer is somewhat analogous to adding a link to the previous node in a doubly linked list. In practice, the parent pointer is almost always unnecessary and adds to the space overhead for the tree implementation. It is not just a problem that parent pointers take space. More importantly, many uses of the parent pointer are driven by improper understanding of recursion and so indicate poor programming. If you are inclined toward using a parent pointer, consider if there is a more efficient implementation possible.

Binary trees

Our final datatype for binary trees is in fact very similar to the linked lists that we introduced in Section 6.3 – we need a reference to the root node and the total size of the tree:

datatype BinaryTree:
    root: BinaryNode = null
    size: Int = 0

8.3.1 Differentiating between internal nodes and leaves

An important decision in the design of a pointer-based node implementation is whether the same class definition will be used for leaves and internal nodes. Using the same class for both will simplify the implementation, but might be an inefficient use of space. Some applications require data values only for the leaves. Other applications require one type of value for the leaves and another for the internal nodes. Examples include the Huffman coding tree (see Section 9.4), the binary trie, the PR Quadtree, and the expression tree illustrated by Figure 8.5 below. By definition, only internal nodes have non-empty children. If we use the same node implementation for both internal and leaf nodes, then both must store the child pointers. But it seems wasteful to store child pointers in the leaf nodes. Thus, there are many reasons why it can save space to have separate implementations for internal and leaf nodes.

Figure 8.5: An example of an expression tree for $4x(2x + a) - c$ .

As an example of a tree that stores different information at the leaf and internal nodes, consider the expression tree illustrated by Figure 8.5. The expression tree represents an algebraic expression composed of binary operators such as addition, subtraction, multiplication, and division. Internal nodes store operators, while the leaves store operands. The tree of the figure represents the expression $4x(2x + a) - c$ . The storage requirements for a leaf in an expression tree are quite different from those of an internal node. Internal nodes store one of a small set of operators, so internal nodes could store a small code identifying the operator such as a single byte for the operator’s character symbol. In contrast, leaves store variable names or numbers, which is considerably larger in order to handle the wider range of possible values. At the same time, leaf nodes need not store child pointers.

Object-oriented languages allow us to differentiate leaf from internal nodes through the use of a class hierarchy. A base class provides a general definition for an object, and a subclass modifies the base class to add more detail. We will not discuss further how to implement different kind of tree nodes more in this book, but will just assume that all nodes are of the same class.

8.3.2 Space requirements

In this subsection we present techniques for calculating the amount of overhead required by a binary tree, based on its node implementation.

Recall that overhead is the amount of space necessary to maintain the data structure. In other words, it is any space not used to store data records. The amount of overhead depends on several factors including which nodes store data values (all nodes, or just the leaves), whether the leaves store child pointers, and whether the tree is a full binary tree.

In a simple pointer-based implementation for binary tree nodes, every node has two pointers to its children (even when the children are NULL). This implementation requires total space amounting to $n(2P + D)$ for a tree of $n$ nodes. Here, $P$ stands for the amount of space required by a pointer, and $D$ stands for the amount of space required by a data value. The total overhead space will be $2Pn$ for the entire tree. Thus, the overhead fraction will be $2P/(2P + D)$ . The actual value for this expression depends on the relative size of pointers versus data fields. If we arbitrarily assume that $P = D$ , then a binary tree has about two thirds of its total space taken up in overhead. Worse yet, the Full Binary Tree Theorem tells us that about half of the pointers are “wasted” NULL values that serve only to indicate tree structure, but which do not provide access to new data.

In many languages (such as Java or Javascript), the most typical implementation is not to store any actual data in a node, but rather a pointer to the data record. In this case, each node will typically store three pointers, all of which are overhead, resulting in an overhead fraction of $3P/(3P + D)$ .

If only leaves store data values, then the fraction of total space devoted to overhead depends on whether the tree is full. If the tree is not full, then conceivably there might only be one leaf node at the end of a series of internal nodes. Thus, the overhead can be an arbitrarily high percentage for non-full binary trees. The overhead fraction drops as the tree becomes closer to full, being lowest when the tree is truly full. In this case, about one half of the nodes are internal.

Great savings can be had by eliminating the pointers from leaf nodes in full binary trees. Again assume the tree stores a pointer to the data field. Because about half of the nodes are leaves and half internal nodes, and because only internal nodes now have child pointers, the overhead fraction in this case will be approximately

$\frac{\frac{n}{2} (2P)}{\frac{n}{2} (2P) + Dn} = \frac{P}{P + D}$

If $P = D$ , the overhead drops to about one half of the total space. However, if only leaf nodes store useful information, the overhead fraction for this implementation is actually three quarters of the total space, because half of the “data” space is unused.

If a full binary tree needs to store data only at the leaf nodes, a better implementation would have the internal nodes store two pointers and no data field while the leaf nodes store only a pointer to the data field. This implementation requires

$\frac{n}{2}2P + \frac{n}{2}(P+D)$

units of space. If $P = D$ , then the overhead is $3P/(3P + D) = 3/4$ . It might seem counter-intuitive that the overhead ratio has gone up while the total amount of space has gone down. The reason is because we have changed our definition of “data” to refer only to what is stored in the leaf nodes, so while the overhead fraction is higher, it is from a total storage requirement that is lower.

There is one serious flaw with this analysis. When using separate implementations for internal and leaf nodes, there must be a way to distinguish between the node types. When separate node types are implemented via Java subclasses, the runtime environment stores information with each object allowing it to determine, for example, the correct subclass to use when the isLeaf virtual function is called. Thus, each node requires additional space. Only one bit is truly necessary to distinguish the two possibilities. In rare applications where space is a critical resource, implementors can often find a spare bit within the node’s value field in which to store the node type indicator. An alternative is to use a spare bit within a node pointer to indicate node type. For example, this is often possible when the compiler requires that structures and objects start on word boundaries, leaving the last bit of a pointer value always zero. Thus, this bit can be used to store the node-type flag and is reset to zero before the pointer is dereferenced. Another alternative when the leaf value field is smaller than a pointer is to replace the pointer to a leaf with that leaf’s value. When space is limited, such techniques can make the difference between success and failure. In any other situation, such “bit packing” tricks should be avoided because they are difficult to debug and understand at best, and are often machine dependent at worst.

Here is an exercise for the space requirements of binary trees.