DSABook – Open addressing

12.5 Open addressing

We now turn to the other commonly used form of hashing: open addressing (also called closed hashing).

Compared to separate chaining (Section 12.3), we now store all elements directly in the hash table. Each element $E$ has a home position that is $\textbf{h}(E)$ , the slot computed by the hash function. If $E$ is to be inserted and another record already occupies its home position, then $E$ will be stored at some other slot in the table. It is the business of the collision resolution policy to determine which slot that will be. Naturally, the same policy must be followed during search as during insertion, so that any element not found in its home position can be recovered by repeating the collision resolution process.

We will still denote the size of the internal arraw as $m$ and the total number of elements as $n$ , and the load factor is still calculated in the same way, as $n/m$ . Note that since all elements are stored in the array, $n$ can never be larger than $m$ , so the load factor must always be smaller than 1.

Implementation overview

The internal table in separate chaining consists of pointers to a linked list (or some other kind of set data structure). Open addressing, in contrast, stores the elements themselves directly in the table:

datatype OpenAddressingHashTable of Elem:
    table: Array of Elem

To implement a method on an open addressing hash table we first have to find the correct table index for the element in question, and then we can apply the method on the slot.

datatype OpenAddressingHashTable:
    ...
    method(elem, ...extra arguments...):
        i = hashAndProbe(elem)
        apply method to table[i]

Hashing and probing

The goal of collision resolution is to find a free slot in the hash table when the “home position” for the record is already occupied. We can view any collision resolution method as generating a sequence of hash table slots that can potentially hold the record. The first slot in the sequence will be the home position for the key. If the home position is occupied, then the collision resolution policy goes to the next slot in the sequence. If this is occupied as well, then another slot must be found, and so on. This sequence of slots is known as the probe sequence, and it is generated by some probe function that we will call p.

The simplest probe function is linear probing, which tries all consecutive positions starting from the home position. Once the bottom of the table is reached, the probe sequence wraps around to the beginning of the table (since the last step is to mod the result to the table size). Linear probing has the virtue that all slots in the table will be candidates for inserting a new record before the probe sequence returns to the home position.

Here is a visualisation of how linear probing works.

In the following code we iterate through all possible offsets 0, 1, 2, …, etc.

datatype OpenAddressingHashTable:
    ...
    hashAndProbe(elem):
        home = hashIndex(elem)
        for offset in 0 .. table.size-1:
            i = (home + offset) % table.size
            if table[i] == elem or table[i] is null:
                return i
        throw error "Hash table full"

Linear probing has some drawbacks, and there are several alternative strategies which we will discuss later in Section 12.8.

Finding and inserting

Searching for an element in an open addressing hash table is straightforward: hashAndProbe returns a slot, and all we have to do is to check that it points to the correct element.

datatype OpenAddressingHashTable:
    ...
    contains(elem):
        i = hashAndProbe(elem)
        return table[i] == elem

To add an element we first find the correct slot, and then we set the value in that slot:

datatype OpenAddressingHashTable:
    ...
    add(elem):
        i = hashAndProbe(elem)
        table[i] = elem

Collisions and clusters

When the table gets more and more populated, it starts building clusters – sequences of neighbouring non-empty slots in the table. The efficiency of the basic operations depend crucially on the size of the largest cluster. Assume we want to search for an element that’s not in the table, but happens to have its home position as the very first slot in a large cluster. Then we have to inspect every slot in the cluster before we can be certain that the element is not there.

Note that a cluster can consist of elements that all have different hash codes and different home positions – as long as the neighbouring slots are occupied they all belong to the same cluster. Therefore it’s not only important that the hash function gives different hash codes for different elements – ideally the hash codes should be spread apart as much as possible, even if the elements themselves are very close to each other.

It is possible to reduce some clustering by using different probing strategies than simple linear probing, see Section 12.8 for more details.

Deleting

The difficult part when implementing deletion is how to handle clusters.

Assume for example that we have a cluster of five elements [A, B, C, D, E], and we want to remove C. If we simply clear the slot we will get the two clusters [A, B] and [D, E]. Let us also assume that the home positions of D is in the beginning of the original large cluster – i.e., A and D have the same home positions.

Now, what happens if we try to search for D in the table, after we have removed C? It will start searching in the first cluster and inspect slots A and B, but then it encounters an empty slot and stops. So it will report that D is not in the table, even though it hasn’t been deleted.

Therefore we cannot just clear slots that we want to delete. The simplest solution is instead to mark the slot in some way. The most common way is to use a special value which is not used anywhere else, which will be the new value of the slot. This special DELETED value is sometimes called a tombstone.

datatype OpenAddressingHashTable:
    ...
    delete(elem):
        i = hashAndProbe(elem)
        table[i] = DELETED

Things are not exactly as simple as they look here, there are some details that are very important to get right, we will discuss these details later in Section 12.7.