8.3 Implementing binary trees
In this section we examine one way to implement binary tree nodes. By definition, all binary tree nodes have two children, though one or both children can be empty. Binary tree nodes typically contain a value field, with the type of the field depending on the application. The most common node implementation includes a value field and pointers to the two children. Figure 8.4 is an illustration of how the tree from Figure 8.1 looks like, where the child pointers are shown explicitly.
Figure 8.4: Illustration of a typical pointer-based binary tree implementation, where each node stores two child pointers and a value.
Here is a simple implementation for binary tree nodes, which can
store one single element in each node. Every BinaryNode
object also has two pointers, one to its left child and another to its
right child.
datatype BinaryNode of T:
// Element for this node.
elem: T // Pointer to left child.
left: BinaryNode // Pointer to right child.
right: BinaryNode
-> bool:
isLeaf() // Return true if a leaf node, false otherwise.
return this.left is null and this.right is null
We define a leaf to be a node with no children – i.e., where both childen pointers point to nothing.
Some programmers find it convenient to add a pointer to the node’s parent, allowing easy upward movement in the tree. Using a parent pointer is somewhat analogous to adding a link to the previous node in a doubly linked list. In practice, the parent pointer is almost always unnecessary and adds to the space overhead for the tree implementation. It is not just a problem that parent pointers take space. More importantly, many uses of the parent pointer are driven by improper understanding of recursion and so indicate poor programming. If you are inclined toward using a parent pointer, consider if there is a more efficient implementation possible.
Binary trees
Our final datatype for binary trees is in fact very similar to the linked lists that we introduced in Section 6.3 – we need a reference to the root node and the total size of the tree:
datatype BinaryTree:
= null
root: BinaryNode size: Int = 0
8.3.1 Differentiating between internal nodes and leaves
An important decision in the design of a pointer-based node implementation is whether the same class definition will be used for leaves and internal nodes. Using the same class for both will simplify the implementation, but might be an inefficient use of space. Some applications require data values only for the leaves. Other applications require one type of value for the leaves and another for the internal nodes. Examples include the Huffman coding tree (see Section 9.4), the binary trie, the PR Quadtree, and the expression tree illustrated by Figure 8.5 below. By definition, only internal nodes have non-empty children. If we use the same node implementation for both internal and leaf nodes, then both must store the child pointers. But it seems wasteful to store child pointers in the leaf nodes. Thus, there are many reasons why it can save space to have separate implementations for internal and leaf nodes.
Figure 8.5: An example of an expression tree for .
As an example of a tree that stores different information at the leaf and internal nodes, consider the expression tree illustrated by Figure 8.5. The expression tree represents an algebraic expression composed of binary operators such as addition, subtraction, multiplication, and division. Internal nodes store operators, while the leaves store operands. The tree of the figure represents the expression . The storage requirements for a leaf in an expression tree are quite different from those of an internal node. Internal nodes store one of a small set of operators, so internal nodes could store a small code identifying the operator such as a single byte for the operator’s character symbol. In contrast, leaves store variable names or numbers, which is considerably larger in order to handle the wider range of possible values. At the same time, leaf nodes need not store child pointers.
Object-oriented languages allow us to differentiate leaf from internal nodes through the use of a class hierarchy. A base class provides a general definition for an object, and a subclass modifies the base class to add more detail. We will not discuss further how to implement different kind of tree nodes more in this book, but will just assume that all nodes are of the same class.
8.3.2 Space requirements
In this subsection we present techniques for calculating the amount of overhead required by a binary tree, based on its node implementation.
Recall that overhead is the amount of space necessary to maintain the data structure. In other words, it is any space not used to store data records. The amount of overhead depends on several factors including which nodes store data values (all nodes, or just the leaves), whether the leaves store child pointers, and whether the tree is a full binary tree.
In a simple pointer-based implementation for binary tree nodes, every node has two pointers to its children (even when the children are NULL). This implementation requires total space amounting to for a tree of nodes. Here, stands for the amount of space required by a pointer, and stands for the amount of space required by a data value. The total overhead space will be for the entire tree. Thus, the overhead fraction will be . The actual value for this expression depends on the relative size of pointers versus data fields. If we arbitrarily assume that , then a binary tree has about two thirds of its total space taken up in overhead. Worse yet, the Full Binary Tree Theorem tells us that about half of the pointers are “wasted” NULL values that serve only to indicate tree structure, but which do not provide access to new data.
In many languages (such as Java or Javascript), the most typical implementation is not to store any actual data in a node, but rather a pointer to the data record. In this case, each node will typically store three pointers, all of which are overhead, resulting in an overhead fraction of .
If only leaves store data values, then the fraction of total space devoted to overhead depends on whether the tree is full. If the tree is not full, then conceivably there might only be one leaf node at the end of a series of internal nodes. Thus, the overhead can be an arbitrarily high percentage for non-full binary trees. The overhead fraction drops as the tree becomes closer to full, being lowest when the tree is truly full. In this case, about one half of the nodes are internal.
Great savings can be had by eliminating the pointers from leaf nodes in full binary trees. Again assume the tree stores a pointer to the data field. Because about half of the nodes are leaves and half internal nodes, and because only internal nodes now have child pointers, the overhead fraction in this case will be approximately
If , the overhead drops to about one half of the total space. However, if only leaf nodes store useful information, the overhead fraction for this implementation is actually three quarters of the total space, because half of the “data” space is unused.
If a full binary tree needs to store data only at the leaf nodes, a better implementation would have the internal nodes store two pointers and no data field while the leaf nodes store only a pointer to the data field. This implementation requires
units of space. If , then the overhead is . It might seem counter-intuitive that the overhead ratio has gone up while the total amount of space has gone down. The reason is because we have changed our definition of “data” to refer only to what is stored in the leaf nodes, so while the overhead fraction is higher, it is from a total storage requirement that is lower.
There is one serious flaw with this analysis. When using separate
implementations for internal and leaf nodes, there must be a way to
distinguish between the node types. When separate node types are
implemented via Java subclasses, the runtime environment stores
information with each object allowing it to determine, for example, the
correct subclass to use when the isLeaf
virtual function is
called. Thus, each node requires additional space. Only one bit is truly
necessary to distinguish the two possibilities. In rare applications
where space is a critical resource, implementors can often find a spare
bit within the node’s value field in which to store the node type
indicator. An alternative is to use a spare bit within a node pointer to
indicate node type. For example, this is often possible when the
compiler requires that structures and objects start on word boundaries,
leaving the last bit of a pointer value always zero. Thus, this bit can
be used to store the node-type flag and is reset to zero before the
pointer is dereferenced. Another alternative when the leaf value field
is smaller than a pointer is to replace the pointer to a leaf with that
leaf’s value. When space is limited, such techniques can make the
difference between success and failure. In any other situation, such
“bit packing” tricks should be avoided because they are difficult to
debug and understand at best, and are often machine dependent at
worst.
Here is an exercise for the space requirements of binary trees.