DSABook – General trees

8.6 General trees

Many organisations are hierarchical in nature, such as the military and most businesses. Consider a company with a president and some number of vice presidents who report to the president. Each vice president has some number of direct subordinates, and so on. If we wanted to model this company with a data structure, it would be natural to think of the president in the root node of a tree, the vice presidents at level 1, and their subordinates at lower levels in the tree as we go down the organisational hierarchy.

Because the number of vice presidents is likely to be more than two, this company’s organisation cannot easily be represented by a binary tree. We need instead to use a tree whose nodes have an arbitrary number of children. Unfortunately, when we permit trees to have nodes with an arbitrary number of children, they become much harder to implement than binary trees. We consider such trees in this chapter. To distinguish them from binary trees, we use the term general tree.

In this section we will examine general tree terminology and define a basic class for general trees.

8.6.1 Definitions and terminology

A tree $\mathbf{T}$ is a finite set of one or more nodes such that there is one designated node $R$ , called the root of $\mathbf{T}$ . If the set $(\mathbf{T} -\{R\})$ is not empty, these nodes are partitioned into $n > 0$ disjoint sets $\mathbf{T}_0$ , $\mathbf{T}_1$ , ..., $\mathbf{T}_{n-1}$ , each of which is a tree, and whose roots $R_1, R_2, ..., R_n$ , respectively, are children of $R$ . The subsets $\mathbf{T}_i (0 \leq i < n)$ are said to be subtrees of $\mathbf{T}$ . These subtrees are ordered in that $\mathbf{T}_i$ is said to come before $\mathbf{T}_j$ if $i < j$ . By convention, the subtrees are arranged from left to right with subtree $\mathbf{T}_0$ called the leftmost child of $R$ . A node’s out degree is the number of children for that node. A forest is a collection of one or more trees. Figure 8.6 presents further tree notation generalised from the notation for binary trees.

Figure 8.6: Notation for general trees. Node P is the parent of nodes V, S1, and S2. Thus, V, S1, and S2 are children of P. Nodes R and P are ancestors of V. Nodes V, S1, and S2 are called siblings. The oval surrounds the subtree having V as its root. — Figure 8.6: Notation for general trees. Node $P$ is the parent of nodes $V$ , $S1$ , and $S2$ . Thus, $V$ , $S1$ , and $S2$ are children of $P$ . Nodes $R$ and $P$ are ancestors of $V$ . Nodes $V$ , $S1$ , and $S2$ are called siblings. The oval surrounds the subtree having $V$ as its root.

Each node in a tree has precisely one parent, except for the root, which has no parent. From this observation, it immediately follows that a tree with $n$ nodes must have $n-1$ edges because each node, aside from the root, has one edge connecting that node to its parent.

8.6.2 ADT for general trees

Before discussing general tree implementations, we should first make precise what operations such implementations must support. Any implementation must be able to initialise a tree. Given a tree, we need access to the root of that tree. There must be some way to access the children of a node. In the case of binary tree nodes, this was done by providing instance variables for the left and right child pointers. Unfortunately, because we do not know in advance how many children a given node will have in the general tree, we cannot give explicit functions to access each child. An alternative must be found that works for an unknown number of children.

One choice would be to provide a function that takes as its parameter the index for the desired child. That combined with a function that returns the number of children for a given node would support the ability to access any node or process all children of a node. Unfortunately, this view of access tends to bias the choice for node implementations in favour of an array-based approach, because these functions favour random access to a list of children.

An alternative is to provide access to a List of the children pointers. This list can be an array-based list or a linked list or even a dynamic function generating the children on demand. The only thing we will assume is that it follows the List ADT, as described in Section 6.10.

// General tree nodes
datatype GTNode of T:
    elem: T                   // This is the value, just as for binary nodes
    children: List of GTNode  // All children of the node

8.6.3 Traversing a general tree

There are three traditional tree traversals for binary trees: preorder, postorder, and inorder (see Section 8.4). For general trees, preorder and postorder traversals are defined with meanings similar to their binary tree counterparts. Preorder traversal of a general tree first visits the root of the tree, then performs a preorder traversal of each subtree from left to right. A postorder traversal of a general tree performs a postorder traversal of the root’s subtrees from left to right, then visits the root. Inorder traversal does not have a natural definition for the general tree, because there is no particular number of children for an internal node. An arbitrary definition – such as visit the leftmost subtree in inorder, then the root, then visit the remaining subtrees in inorder – can be invented. However, inorder traversals are generally not useful with general trees.

Visualisation of preorder traversal.

To perform a preorder traversal, it is necessary to visit each of the children for a given node (say $R$ ) from left to right. This is accomplished by starting at R’s leftmost child (call it $T$ ). From $T$ , we can move to $T$ ’s right sibling, and then to that node’s right sibling, and so on.

Visualisation of postorder traversal.

Using the General Tree class shown above, here are implementations to process the nodes of a general tree in preorder and in postorder. The code is very simple, but this is because we defer all the complexity to the underlying List implementation of the children.

function preorder(node):
    process(node)
    for each child in node.children:
        preorder(child)

function postorder(node):
    for each child in node.children:
        postorder(child)
    process(node)

8.6.4 Sequential tree representations

Next we consider a fundamentally different approach to implementing trees. The goal is to store a series of node values with the minimum information needed to reconstruct the tree structure.

This approach, known as a sequential tree representation, has the advantage of saving space because no pointers are stored. It has the disadvantage that accessing any node in the tree requires sequentially processing all nodes that appear before it in the node list. In other words, node access must start at the beginning of the node list, processing nodes sequentially in whatever order they are stored until the desired node is reached. Thus, one primary virtue of the other implementations discussed in this section is lost: efficient access (typically $O(\log n)$ time) to arbitrary nodes in the tree. Sequential tree implementations are ideal for archiving trees on disk for later use because they save space, and the tree structure can be reconstructed as needed for later processing.

Sequential tree implementations can be used to serialise a tree structure. Serialisation is the process of storing an object as a series of bytes, typically so that the data structure can be transmitted between computers. This capability is important when using data structures in a distributed processing environment.

A sequential tree implementation typically stores the node values as they would be enumerated by a preorder traversal, along with sufficient information to describe the tree’s shape. If the tree has restricted form, for example if it is a full binary tree, then less information about structure typically needs to be stored. A general tree, because it has the most flexible shape, tends to require the most additional shape information. There are many possible sequential tree implementation schemes. We will begin by describing methods appropriate to binary trees, then generalise to an implementation appropriate to a general tree structure.

Because every node of a binary tree is either a leaf or has two (possibly empty) children, we can take advantage of this fact to implicitly represent the tree’s structure. The most straightforward sequential tree implementation lists every node value as it would be enumerated by a preorder traversal. Unfortunately, the node values alone do not provide enough information to recover the shape of the tree. In particular, as we read the series of node values, we do not know when a leaf node has been reached. However, we can treat all non-empty nodes as internal nodes with two (possibly empty) children. Only NULL values will be interpreted as leaf nodes, and these can be listed explicitly. Such an augmented node list provides enough information to recover the tree structure.

Figure 8.7: Sample binary tree for sequential tree implementation examples..

Reconstructing the original tree in Figure 8.7 from its sequential representation.

Practice exercise for sequential tree representation.

Alternative sequential representation

To illustrate the difficulty involved in using the sequential tree representation for processing, consider searching for the right child of the root node. We must first move sequentially through the node list of the left subtree. Only at this point do we reach the value of the root’s right child. Clearly the sequential representation is space efficient, but not time efficient for descending through the tree along some arbitrary path.

Assume that each node value takes a constant amount of space. An example would be if the node value is a positive integer and null is indicated by the value zero. From the Full Binary Tree Theorem, we know that the size of the node list will be about twice the number of nodes (i.e., the overhead fraction is 1/2). The extra space is required by the null pointers. We should be able to store the node list more compactly. However, any sequential implementation must recognise when a leaf node has been reached, that is, a leaf node indicates the end of a subtree. One way to do this is to explicitly list with each node whether it is an internal node or a leaf. If a node $X$ is an internal node, then we know that its two children (which may be subtrees) immediately follow $X$ in the node list. If $X$ is a leaf node, then the next node in the list is the right child of some ancestor of $X$ , not the right child of $X$ . In particular, the next node will be the child of $X$ ‘s most recent ancestor that has not yet seen its right child. However, this assumes that each internal node does in fact have two children, in other words, that the tree is full. Empty children must be indicated in the node list explicitly. Assume that internal nodes are marked with a prime (’) and that leaf nodes show no mark. Empty children of internal nodes are indicated by “/”, but the (empty) children of leaf nodes are not represented at all. Note that a full binary tree stores no null values with this implementation, and so requires less overhead.

Reconstructing the original tree in Figure 8.7 from the alternative sequential representation.

Storing $n$ extra bits can be a considerable savings over storing $n$ null values. In the example above, each node was shown with a mark if it is internal, or no mark if it is a leaf. This requires that each node value has space to store the mark bit. This might be true if, for example, the node value were stored as a 4-byte integer but the range of the values sored was small enough so that not all bits are used. An example would be if all node values must be positive. Then the high-order (sign) bit of the integer value could be used as the mark bit.

Practice exercise for the alternative sequential representation.

Bit vector representation

Another approach is to store a separate bit vector to represent the status of each node. In this case, each node of the tree corresponds to one bit in the bit vector. A value of “1” could indicate an internal node, and “0” could indicate a leaf node.

Reconstructing the shape of the tree in Figure 8.7 from its bit vector representation.

Practice exercise for the bit vector representation.

Serialising general trees

Storing general trees by means of a sequential implementation requires that more explicit structural information be included with the node list. Not only must the general tree implementation indicate whether a node is leaf or internal, it must also indicate how many children the node has. Alternatively, the implementation can indicate when a node’s child list has come to an end. The next example dispenses with marks for internal or leaf nodes. Instead it includes a special mark (we will use the “)” symbol) to indicate the end of a child list. All leaf nodes are followed by a “)” symbol because they have no children. A leaf node that is also the last child for its parent would indicate this by two or more successive “)” symbols.

Reconstructing a general from its sequential representation.

Note that this representation for serialising general trees cannot be used for binary trees. This is because a binary tree is not merely a restricted form of general tree with at most two children. Every binary tree node has a left and a right child, though either or both might be empty. So this representation cannot let us distinguish whether node $D$ in Figure 8.7 is the left or right child of node $B$ .

Practice exercise for sequential representation of general trees.