CS 201: Search Engine

Search Engine

Overview

This is a team assignment.

For this assignment, we'll make a rudimentary search engine. Specifically, we'll load up all of the titles and plots from the Internet Movie Database (IMDB) into a map implemented as a binary search tree, and then allow the user to request titles and plots for all movies that start with a particular prefix.

Part 1: The Map

Create a class called TreeMapForStrings that will be used to map Strings to Strings. (The intention is to use it to map movie titles to movie plots.) Implement this map as a binary search tree. For this assignment, you should not the built-in Java classes that do this task such as TreeMap, HashMap, TreeSet, or HashSet. Your code should have the following methods:

void put(String key, String value): add a key and a value to your tree. If the key is already present in the tree, replace the old value with the new one.
String get(String key): get the value associated with a particular key. get should return null if the word is not in the tree.

Make sure to test your map appropriately. A main method would be a nice way to do this.

Part 2: More Map

Add this method to your TreeMapforStrings class:

ArrayList<String> getKeysForPrefix(String prefix): return a list of all keys that start with the prefix. You should do this efficiently: specifically, you should not traverse the entire tree in order to find these keys.

Again, make sure to submit code to test your map appropriately.

Part 3: Filling up the map with title and plot data from IMDB, and running queries

About IMDB data

IMDB makes plain text files of all of its data available for download. I've grabbed a file called plot.list that contains titles and plot summaries for all of the movies in the database, and put it on a department shared directory in /Accounts/courses/cs201/dmusican, which is accessible from the lab machines. While plot.list is the original file from IMDB, it has one major problem for this assignment: it is sorted by movie title, which works poorly for creating a binary search tree from scratch. Therefore, you should instead use plotunsorted.lst, which I have created by randomly ordering the movies. You will need to loop over this file and extract, for each movie, the title and plot summary. Sometimes more than one plot summary is provided for a single movie. For our purposes, just concatenate multiple plot summaries for the same title together to make a single large (and redundant) plot summary.

The file plotunsorted.list is large. Specifically, it's around 200 megabytes. Do not copy this file to your own home directory or a subdirectory. If 35 students each copy a 200 MB file to their own network storage directory, that's a total of 7 gigabtes of network storage that we're wasting merely for everyone to have their own copy. Moreover, you'll be fighting with all the other students in the class to read the file over the network every time you read it in, since your home directory is on a networked disk drive.

Do not direct your software to read this file directly from /Accounts/courses/cs201/dmusican. If 35 students in the class repeatedly try to read a 200 MB file at the same time, we may clog the network and bring it to a halt. This is the same problem we face if you put the file in your home directory (in addition to storage space issues).

So what should you do? If you're working on a lab machine, make a copy of the file from /Accounts/courses/cs201/dmusican once to the directory /tmp. /tmp is a local directory on the computer that you're using. The file might get deleted on a reboot or a logout, but you can always recopy the file over if it vanishes. If you're using your own computer, you can put the file anywhere you want on your own computer. Make sure, however, that you are storing it locally and not in some network location.

(One minor source of irritation you may run into is that if someone who used the computer before you has already copied this file onto /tmp, you may not have permission to read the file. If that happens, just copy it again with a different file name. Alternatively, reboot the computer, and it will hopefully disappear.)

The other thing that you should do is make a small version of this file for use in debugging. Reading in the entire plotunsorted.list file takes a long time, and if you read from that file every time you debug your program, you'll add many hours to your coding and debugging waiting for it every time. This is a key general strategy that is really useful when working with big data; make a small version of the data to use for testing.

Finally, according to IMDB licensing, I should add:

Information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission.

Loading the IMDB data into your tree

Create class called SearchEngine with a main that creates a TreeMapForStrings object, then loads up a file of movie data into it. Specifically, the movie titles should be the keys, and the plot summaries should be the values. Note that this data is messy in a few ways: this is typical for data that you get from other sources. You'll need to think about what kinds of processing you may want to do on the data to make the titles usable.

Since you'll be using more than one input file (the big one, and your small testing one), don't hardcode the name of your file into your program. Instead, add it as command-line parameter. When you run your program, do so like this:

java -Xmx1g SearchEngine plot.lists

java -Xmx1g SearchEngine plotsmall.lists

Use the array of args in main to grab the filename and open it appropriately.

The "-Xmx1g" tells Java to allow a maximum about of memory of 1 gigabyte, which you'll need for loading up this much data. Without doing so, the default maximum amount of memory that Java uses is something like 128 MB or 256 MB (depending on the version of Java).

Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and printing out a status update every 1000 movies.

After loading up the data, you should enter a loop that prompts the user to enter in a movie title prefix. For example, if I enter in Flash, I should see the titles and plot summaries for all movies whose titles start with the characters Flash. Happily, if you've completed part 1, you've got a method that helps you out directly in doing this. After displaying the titles and plot summaries, prompt the user again for another prefix. Do this until the user types ####, in which case the program should end. I've checked that no movie titles start with that prefix.

Timing issues

Finally, in a readme.txt file that you submit along with your code, answer the following question:

Loading up the entire movie database is slow. Is that because reading the file itself is slow, or is because adding to the tree is slow? Experimentally determine which is the bottleneck, and explain your results. You may find the method System.currentTimeMillis() very useful here.

What to turn in

Part 1: Turn in your TreeMapForStrings class, which should include a main method for testing it with some small data.

Part 2: Turn in your updated TreeMapForStrings class, which should include a main method for testing it with some small data.

Part 3: Resubmit Part 2, as well as your SearchEngine class and your readme.txt.

Have fun, and good luck!