CS 334: Indexing

Submit electronically your answers to the following.

When discussing a sparse index, one needs to think about the level of sparsity. One extreme version of a sparse index is so sparse that contains merely a single entry (or perhaps none at all). Another extreme version is simply a dense index. What level of sparsity is optimal? Be precise.

Think back to the plain old simple binary search trees (BSTs) as you saw then in your Data Structures course: these were just binary search trees where each node contained a key value, and two pointers. Answer each of the following questions with a sentence or two of explanation.
1. Are BSTs an ordered index (as your DB textbook defines an ordered index)?
2. Are BSTs, as you saw them, sparse or dense?
3. In a DB context, one also makes the distinction between primary and secondary indices. Explain why this distinction doesn't fit well into BSTs as they are typically presented in a Data Structures class.

Index-sequential files, also known as ISAM, are wonderful for doing lookups on large databases that don't change much. Suppose that we build an ISAM on a sorted file where each page in the index has room for 20 search keys and pointers. Suppose that the sorted file itself is sequential on disk, that every page in that file is completely filled, and that records within this file have a fixed width.

If the sorted file itself has 30,000 pages in it, how many levels will the ISAM have (not including the sorted file itself)? Give me a number, but also give me the short mathematical formula that enables you to determine that answer.
Based on your answer above, how many page accesses are needed to obtain a record with a specific search key?

As a continuation of the previous question, suppose that I proposed ISAM as a new topic to be discussed in a Data Structures course. In Data Structures, it is always assumed that all data fits in memory. So instead of a sorted file of records, let's instead assume that I have a sorted array of records. I then create an ISAM on that arary where each node in the ISAM has 20 search keys and pointers.

Since everything is now in memory, there's no need to discuss pages or blocks. (I carefully chose the word "node" above instead of "page.") Suppose that the lowest level of my in-memory ISAM contains 30,000 nodes, to be consistent with the previous example. How many nodes in the ISAM must be accessed in order to reach an actual record in the array? Give me a number, but also give me the short mathematical formula that enables you to determine that answer.
In general, if I have an in-memory ISAM with n nodes at the bottom level and k records per page, what would be my big-O estimate of the amount of work to find a particular record in the array based on its search key?
The previous question hopefully pointed out a gaping hole in the ISAM approach we've discussed: it never came up at all what algorithm we use to find a particular search key within an ISAM node. Why? The technique you choose to use is critical to your answer to the previous question.
Here's a related followup question: in this course, we seem to be emphasizing trees with wide fan-out (many children). In Data Structures, one typically emphasizes trees with only 2 children (though there is a balanced tree algorithm with 3 children.) Why the dramatic distinction? If so many children work well in databases, why not do it for in-memory algorithms also?