12 Hash tables

Hashing is a method for storing and retrieving records from a database. It lets you insert, delete, and search for records based on a search key value. When properly implemented, these operations can be performed in constant time. In fact, a properly tuned hash system typically looks at only one or two records for each search, insert, or delete operation. This is better than the O(logn)O(\log n) cost required to do a binary search on a sorted array of nn records, or the O(logn)O(\log n) cost required to do an operation on a self-balancing binary search tree. However, even though hashing is based on a very simple idea, it is surprisingly difficult to implement properly. Designers need to pay careful attention to all of the details involved with implementing a hash system.

A hash system stores records in an array called a hash table, which we will call 𝐻𝑇\textit{HT} below. Hashing works by performing a computation on a search key kk in a way that is intended to identify the position in 𝐻𝑇\textit{HT} that contains the record with key kk. The function that does this calculation is called the hash function, and will be denoted by 𝐑\mathbf{h}. Since hashing schemes place records in the table in whatever order satisfies the needs of the address calculation, records are not ordered by value. A position in the hash table is also known as a slot. The number of slots in hash table 𝐻𝑇\textit{HT} will be denoted by the variable mm with slots numbered from 0 to mβˆ’1m-1.

The goal for a hashing system is to arrange things such that, for any key value kk and some hash function 𝐑\mathbf{h}, i=𝐑(k)i = \mathbf{h}(k) is a slot in the table such that 0≀i<m0 \leq i < m, and we have the key of the record stored at 𝐻𝑇[i]\textit{HT}[i] equal to kk.

Since the records are not ordered by value, hashing is not a good method for answering range searches. In other words, we cannot easily find all records (if any) whose key values fall within a certain range. Nor can we easily find the record with the minimum or maximum key value, or visit the records in key order. Hashing is most appropriate for answering the question, β€œWhat record, if any, has key value kk?” For applications where all search is done by exact-match queries, hashing is a very good search method because it is extremely efficient when implemented correctly. As we will see in this chapter, however, there are many approaches to hashing and it is easy to devise an inefficient implementation. Hashing is suitable for both in-memory and disk-based searching and is one of the two most widely used methods for organising large databases stored on disk (the other is the B-tree).

Simple hashing example

As a simple (though unrealistic) example of hashing, consider storing nn records, each with a unique key value in the range 0 to nβˆ’1n-1. A record with key kk can be stored in 𝐻𝑇[k]\textit{HT}[k], and so the hash function is 𝐑(k)=k\mathbf{h}(k) = k. To find the record with key value kk, look in 𝐻𝑇[k]\textit{HT}[k].

In most applications, there are many more values in the key range than there are slots in the hash table. For a more realistic example, suppose the key can take any value in the range 0 to 65,535 (i.e., the key is a two-byte unsigned integer), and that we expect to store approximately 1000 records at any given time. It is impractical in this situation to use a hash table with 65,536 slots, because then the vast majority of the slots would be left empty. Instead, we must devise a hash function that allows us to store the records in a much smaller table. Because the key range is larger than the size of the table, at least some of the slots must be mapped to from multiple key values. Given a hash function h and two keys k1k_1 and k2k_2, if 𝐑(k1)=β=𝐑(k2)\mathbf{h}(k_1) = \beta = \mathbf{h}(k_2) where β\beta is a slot in the table, then we say that k1k_1 and k2k_2 have a collision at slot β\beta under hash function h. Finding a record with key value kk in a database organised by hashing follows a two-step procedure:

  1. Compute the table location 𝐑(k)\mathbf{h}(k).
  2. Starting with slot 𝐑(k)\mathbf{h}(k), locate the record containing key kk using (if necessary) a collision resolution policy.