Search Engine via Hashing

Overview

This is an individual assignment. You may chat with others in the class and trade ideas, but the work that you submit should be your own.

This assignment is a repeat of the previous assignment, but done by using a hash table instead of a binary search tree. Hash tables are typically much, much faster than binary search trees, but you can't efficiently look up values in a range. So for this assignment, we won't be reimplementing the prefix portion of the search. You'll then do a comparison between the two approaches.

Because this assignment is a rework of the previous assignment, you are free and welcome to start with the code that you submitted for the last one if you wish. Because this assignment is an individual one, you should clearly indicate in program comments which portions of your code have been lifted / modified from the previous pair assignment, and which portions you've added yourself.

If you had trouble with the previous assignment, don't panic. There's no need to use the previous code at all if you don't wish to. You can read on for more details.

Part 1: The Map

Create a class called HashMapForStrings that will be used to map Strings to Strings. (The intention is to use it to map movie titles to movie plots.) Implement this map as a hash table with chaining. For this assignment, you should not the built-in Java classes that do this task such as TreeMap, HashMap, TreeSet, or HashSet. You should also build your linked list chains from scratch: you should not use arrays or the built-in LinkedList or ArrayList classes for managing your chains. Your map should have the following methods:

Make sure to test your map appropriately. A main method would be a nice way to do this.

Part 2: Filling up the map with title and plot data from IMDB, and running queries

In a similar fashion to the previous assignment, create a class called HashSearchEngine with a main that creates a HashMapForStrings object, then loads up a file of movie data into it. Specifically, the movie titles should be the keys, and the plot summaries should be the values. Note that this data is messy in a few ways: this is typical for data that you get from other sources. You'll need to think about what kinds of processing you may want to do on the data to make the titles usable.

Since you'll be using more than one input file (the big one, and your small testing one), don't hardcode the name of your file into your program. Instead, add it as command-line parameter. When you run your program, do so like this:

java -Xmx1g HashSearchEngine plot.lists
or
java -Xmx1g HashSearchEngine plotsmall.lists

Use the array of args in main to grab the filename and open it appropriately.

The "-Xmx1g" tells Java to allow a maximum about of memory of 1 gigabyte, which you'll need for loading up this much data. Without doing so, the default maximum amount of memory that Java uses is something like 128 MB or 256 MB (depending on the version of Java).

Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and printing out a status update every 1000 movies.

After loading up the data, you should enter a loop that prompts the user to enter in a movie title. For example, if I enter in Star Wars (1977), I should see the plot summary for that movie. Happily, if you've completed part 1, you've got a method that helps you out directly in doing this. After displaying the plot summaries, prompt the user again for another prefix. Do this until the user types ####, in which case the program should end. I've checked that there are no movies with that title.

Part 3: Evaluation of how well it works

Perform a comparison of how hashing scales with the size of the index array, and also with binary search trees, regarding the time it takes to do a lookup. I want you to take some time to do this carefully and prepare a detailed writeup on this: this is the assignment that is being used to enable this course to qualify as a "Quantitative Reasoning Encounter."

In order to do this, I want you to compare the time it takes to do build the map, and how long it takes to do a lookup, under the following scenarios:

  1. A variety of sizes for the link array in your hash table. Try a variety of array sizes, and plot the results that you see. Large array sizes likely mean a lot of wasted memory. For each of your array sizes, you should also plot how much memory that you believe your program is using. Getting Java to tell you this information directly is complicated and unreliable; a much better way to go is to keep track in your software whenever you allocate memory, and increment a counter that you keep yourself with a rough estimate of how much memory you think you're using.
  2. Also compare the timing with your binary search tree from the previous assignment. If you had trouble making that work, feel free to use the built-in Java TreeMap in order to get timing measurements instead.

Make sure to discuss and interpret the results that you see. Explain how it fits in with the theoretical expectations.

One challenge that you'll face is that you can't get accurate timing measurements by seeing how long it takes to look up a single movie. There are two reasons for this: a single movie might not be representative of the dataset as a whole, and it happens so quickly that the timing measurements are not especially accurate or relevant. A much better way to do it is to do a large number of lookups, and consider either the total time or average time for each within. To do a large number of lookups, you'll need to write code that takes a large number of movie titles and "looks them up" in your map. As with the previous assignment, the method System.currentTimeMillis() is very useful here.

In addition to submitting the code that you write, you should turn in your plots and your explanations in something that resembles a lab report. It should be a coherent document that explains what you did, what results you find, and what conclusions you draw. This should also be submitted electronically: formats such as pdf or standard word processing document formats are fine.

What to turn in

Part 1: Turn in your TreeMapForStrings class, which should include a main method for testing it with some small data.

Part 2: Turn in your HashSearchEngine class, which should be successfully working for querying individual movies. include a main method for testing it with some small data.

Part 3: Turn in your lab report and additional code that you have written in order to generate your experimental results.

Have fun, and good luck!