1.1 Selecting a Data Structure
It should go without saying that people write programs to solve problems. However, sometimes programmers forget this. So it is crucial to keep this truism in mind when selecting a data structure to solve a particular problem. Only by first analyzing the problem to determine the performance goals that must be achieved can there be any hope of selecting the right data structure for the job. Poor program designers ignore this analysis step and apply a data structure that they are familiar with but which is inappropriate to the problem. The result is typically a slow program. Conversely, there is no sense in adopting a complex representation to “improve” a program that can meet its performance goals when implemented using a simpler design.
When selecting a data structure to solve a problem, you should follow these steps.
- Analyze your problem to determine the basic operations that must be supported. Examples of basic operations include inserting a data item into the data structure, deleting a data item from the data structure, and finding a specified data item.
- Quantify the resource constraints for each operation.
- Select the data structure that best meets these requirements.
This three-step approach to selecting a data structure operationalizes a data-centered view of the design process. The first concern is for the data and the operations to be performed on them, the next concern is the representation of those data, and the final concern is the implementation of that representation.
Resource constraints on certain key operations, such as search, inserting data records, and deleting data records, normally drive the data structure selection process. Many issues relating to the relative importance of these operations are addressed by the following three questions, which you should ask yourself whenever you must choose a data structure.
- Are all data items inserted into the data structure at the beginning, or are insertions interspersed with other operations? Static applications (where the data are loaded at the beginning and never change) typically get by with simpler data structures to get an efficient implementation, while dynamic applications often require something more complicated.
- Can data items be deleted? If so, this will probably make the implementation more complicated.
- Are all data items processed in some well-defined order, or is searching for specific data items allowed? “Random access” search generally requires more complex data structures.
Each data structure has associated costs and benefits. In practice, it is hardly ever true that one data structure is better than another for use in all situations. If one data structure or algorithm is superior to another in all respects, the inferior one will usually have long been forgotten. For nearly every data structure and algorithm presented in this book, you will see examples of where it is the best choice. Some of the examples might surprise you.
A data structure requires a certain amount of space for each data item it stores, a certain amount of time to perform a single basic operation, and a certain amount of programming effort. Each problem has constraints on available space and time. Each solution to a problem makes use of the basic operations in some relative proportion, and the data structure selection process must account for this. Only after a careful analysis of your problem’s characteristics can you determine the best data structure for the task.
Example: Bank
A bank must support many types of transactions with its customers, but we will examine a simple model where customers wish to open accounts, close accounts, and add money or withdraw money from accounts. We can consider this problem at two distinct levels: (1) the requirements for the physical infrastructure and workflow process that the bank uses in its interactions with its customers, and (2) the requirements for the database system that manages the accounts.
The typical customer opens and closes accounts far less often than accessing the account. Customers are willing to spend many minutes during the process of opening or closing the account, but are typically not willing to wait more than a brief time for individual account transactions such as a deposit or withdrawal. These observations can be considered as informal specifications for the time constraints on the problem.
It is common practice for banks to provide two tiers of service. Human tellers or automated teller machines (ATMs) support customer access to account balances and updates such as deposits and withdrawals. Special service representatives are typically provided (during restricted hours) to handle opening and closing accounts. Teller and ATM transactions are expected to take little time. Opening or closing an account can take much longer (perhaps up to an hour from the customer’s perspective).
From a database perspective, we see that ATM transactions do not modify the database significantly. For simplicity, assume that if money is added or removed, this transaction simply changes the value stored in an account record. Adding a new account to the database is allowed to take several minutes. Deleting an account need have no time constraint, because from the customer’s point of view all that matters is that all the money be returned (equivalent to a withdrawal). From the bank’s point of view, the account record might be removed from the database system after business hours, or at the end of the monthly account cycle.
When considering the choice of data structure to use in the database system that manages customer accounts, we see that a data structure that has little concern for the cost of deletion, but is highly efficient for search and moderately efficient for insertion, should meet the resource constraints imposed by this problem. Records are accessible by unique account number (sometimes called an exact-match query). One data structure that meets these requirements is the hash table. Hash tables allow for extremely fast exact-match search. A record can be modified quickly when the modification does not affect its space requirements. Hash tables also support efficient insertion of new records. While deletions can also be supported efficiently, too many deletions lead to some degradation in performance for the remaining operations. However, the hash table can be reorganized periodically to restore the system to peak efficiency. Such reorganization can occur offline so as not to affect ATM transactions.
Example: Databases
A company is developing a database system containing information about cities and towns in Europe. There are many thousands of cities and towns, and the database program should allow users to find information about a particular place by name (another example of an exact-match query). Users should also be able to find all places that match a particular value or range of values for attributes such as location or population size. This is known as a range query.
A reasonable database system must answer queries quickly enough to satisfy the patience of a typical user. For an exact-match query, a fraction of a second is satisfactory. If the database is meant to support range queries that can return many cities that match the query specification, the user might tolerate the entire operation to take a little longer, perhaps a couple of seconds. To meet this requirement, it will be necessary to support operations that process range queries efficiently by processing all cities in the range as a batch, rather than as a series of operations on individual cities.
The hash table suggested in the previous example is inappropriate for implementing our city database, because it cannot perform efficient range queries. The B-tree supports large databases, insertion and deletion of data records, and range queries. However, a simple linear index would be more appropriate if the database is created once, and then never changed, such as an atlas accessed from a website.
1.1.1 Introduction Summary Questions
A tool for measuring the efficiency of an algorithm or problem is:
- A profiler works on a program, not an algorithm.
- The system clock works on a program, not an algorithm.
- Empirical analysis works on a program, not an algorithm.
- Algorithm analysis estimates the cost of an algorithm or problem.
Which of these is NOT a definition for efficiency in a computer program?
- One definition for an efficient program is that it runs within the required resource constraints.
- One definition for an efficient program is that it requires fewer resources than known alternatives.
- Many efficient programs require more than linear time. Sometimes linear time is not efficient for a given problem.
An exact-match query is:
- Returning a record that falls within a range is called a range query.
- There are many database queries that are not exact-match queries.
- Whether a search is efficient or not has nothing to do with whether it is an exact-match query or not.
- A query to find the record that matches a unique identifier is called an exact-match query.
Which is NOT a topic that a this course book focuses on?
- A key topic for any DSA course is the basic toolkit of data structures and algorithms.
- A key concept for any DSA course is the concept of tradeoffs for costs versus benefits.
- A key skill for any DSA course is ability to measure the effectiveness of a data structure or algorithm.
- Designing and maintaining large programs is a focus for Software Engineering.
As computers have become more powerful:
- Good solutions for large, complex computational problems usually require different approaches than what we do in everyday life.
- As problems that we try to solve grow more complicated, their solutions tend to grow more complicated.
- Processor speed cannot grow as fast as the cost of a slower algorithm applied to a bigger problem.
- As our computers have become more powerful, our history has been to apply them to more complex problems.
A range query is:
- A query to find the record that matches a unique identifier is called an exact-match query.
- There are many database queries that are not range queries.
- Whether a search is efficient or not has nothing to do with whether it is a range query or not.
- Returning a record that falls within a range is called a range query.
Which of these is more a concern for Software Engineering than for a data structures course?
- Designing efficient programs is a focus for data structures and algorithms courses.
- Designing and maintaining large programs is a focus for Software Engineering.
Which of these is NOT one of the three standard steps to follow when selecting a data structure to solve a problem?
- Knowing the basic operations required to solve your problem is the first step to selecting a suitable data structure.
- Knowing the resource constraints for your problem’s basic operations is the second step to selecting a suitable data structure.
- Once you know the basic operations and their resource constraints, then you can select a data structure that matches.
- Many problems do not require that you run simulations to determine the expected times for alternative solutions.