Search Engine via Hashing

Table of Contents

This is a pair programming assignment. If you are working in a pair, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Set a timer to swap every 15 minutes. You can choose your favorite from this online timer page. Make sure your sound volume is audible, but not so loud to disturb the people around you.

We will use anonymous grading on Moodle, which means that the grader won't see your name until after the grading is done. This is an easy way to help add an extra element of fairness to the grading. Therefore, make sure your name doesn't appear on your actual submission. When you submit via Moodle, it will know you are. Thanks!

Overview

For this assignment, we'll make a rudimentary search engine. Specifically, we'll load up a dataset of performers and biographies from Internet Movie Database (IMDb) into a map implemented as a hash table, and then allow the user to request info as desired.

Part 1: Create a Map

Create a class called HashMap201 that will be used to map Strings to Strings. (The intention is to use it to map performers to biographies.) Implement this map as a hash table with chaining via linked lists. For this assignment, you should not the built-in Java classes that do this task such as TreeMap, HashMap, TreeSet, or HashSet. That said, you should use the built-in class LinkedList, for your chains. Your map should have the following methods:

  • void put(String key, String value)

    add a key and a value to your map. If the key is already present in the map, replace the old value with the new one.

  • String getValue(String key)

    get the value associated with a particular key. Return null if the key is not in the map.

Make sure to test your map appropriately. A main method would be a nice way to do this.

You should not be searching the web or other sources for similar code to this. You are welcome to look at the textbook.

Optional extra challenge

If you would like an optional additional challenge, make your class generic instead of String specific. In that case, your HashMap201<K,V> class methods would look like:

  • void put(K key, V value)
  • V getValue(K key)

Using the built-in LinkedList classes

To use the built-in LinkedList class, review the various methods that it offers. One source of trickiness is that for each entry, you need to store both a key and a value. You'll have to think through how you want to do that. You'll also need to address how to efficiently search through the linked list and be able to change one of the items if you need to without having to find it all over again. For this purpose, the so-called for-each loop can be quite useful. Here is a sample as to how you might use it:

// Assume I've created a class called Entry which contains a key and a value
LinkedList<Entry> mylist = new ListList<Entry>();
// ... add a bunch of items to it ...
for (Entry e : mylist) {
   System.out.println(e.key);
   System.out.println(e.value);
   e.value = someNewValue;
}

The for-each loop works on any class that implements the Iterable interface. Java Interlude 5 in your textbook has lots more information on how iterators work, which forms the basis for what the for-each loop does.

Part 2: Fill up the Map with data from IMDb, and run queries

About IMDb data

IMDb makes plain text files of all of its data available for download. We're going to use a file called bio_unsorted.list that contains all of the performers and biographies in the database. Because the file is so large (nearly 600 MB uncompressed), I've had to place them in a shared CS department directory with enough disk space (that location is called /Accounts/courses/). That directory isn't accessible via the web, so you'll find directions below on how to obtain the file you need. But first: do not copy this file to any of your network-shared Carleton directories. If 35 students each copy and uncompress a (600 MB) file to their own network storage directory, that's a total of 21 gigabytes of network storage that we're wasting merely for everyone to have their own copy. Moreover, you'll be fighting for bandwidth with all the other students in the class to read the file over the network every time you read it in.

Also, in case you were thinking about being clever about saving disk space (which is often awesome), do not direct the program you write to read the file directly from the /Accounts/courses location without copying it first. If 35 students in the class try to read a very large file from the same network storage location at the same time, repeatedly while debugging code, we may clog the network and bring it to a halt. This is the same problem we face if you put the file in your home directory (in addition to storage space issues).

So what should you do? If you're working on a lab machine, make a copy of the file to the directory /tmp. /tmp is a local directory on the computer that you're using. The file might get deleted on a reboot or a logout, but you can always recopy the file over if it vanishes. If you're using your own computer, you can put the file anywhere you want on your own computer. Make sure, however, that you are storing it locally and not in some network location.

(One minor source of irritation you may run into is that if someone who used the computer before you has already copied this file onto /tmp, you may not have permission to read the file. If that happens, just copy it again with a different file name. Alternatively, reboot the computer, and it will hopefully disappear.)

There are a variety of ways of copying the file, but here is one that will work under Mac or Linux: use scp, which is a command-line program for doing a "secure copy." At a command-prompt, you can learn more about how it works by typing:

man scp

Specifically, to grab the file you need, first use the cd command to change to the directory you want to put the file in. For example, if it's a lab machine, type

cd /tmp

Then issue the following command, which securely copies the file from the remote location:

scp username@mirage.mathcs.carleton.edu:/Accounts/courses/cs201/dmusicant/bio_unsorted.list.gz .

where username is your username. Don't miss the dot at the very end; that means copy the file into your current directory. mirage is the name of a CS department server that you're copying the file from. You'll need to enter your Carleton password once you issue the above command, and the file should start coming down. Once you've got it, you'll need to uncompress it. That file is compressed with the gz compression format. To uncompress it, on a Mac or Linux issue the following command. (Check again first: are you in the directory /tmp?)

gunzip bio_unsorted.list.gz

Windows computers frustratingly don't natively have scp or a similar program installed. If you are using a new enough version of Windows 10, you can install the Windows Subsystem for Linux. If you can get this installed, then the above instructions for Mac or Linux should also work for Windows. Alternatively, for any version of Windows you can install WinSCP, though you'll have to figure out how to configure it; I haven't played with it. Likewise, to unzip a gz file natively under Windows, you'll likely need to install either Winzip or 7-zip.

Make a smaller version

For testing your program, you should make a small version of this file where you cut out most of the movies. Reading in the entire bio_unsorted.list file takes a long time, and if you read from that file every time you debug your program, you'll add many hours to your coding and debugging waiting for it every time. This is a key general strategy that is really useful when working with big data; make a small version of the data to use for testing. Opening up a file this big in a text editor would likely be exceedingly slow and would use a huge amount of memory. Instead, try to use the UNIX command head which works on both Mac and Linux (and the Windows Subsystem for Linux). Read the man page on head (type man head at a prompt) to see how it works. head normally dumps the output to the terminal window; instead, you'll need to redirect the output to a file. If you don't know how to do that, look up "redirecting the output" in this awesome UNIX tutorial.

(If you're using a Windows computer and unable to make WSL work, the current cleanest way to do this is to use some PowerShell commands. If you plan to continue to work extensively with Windows, it would be a worthwhile experience to learn how to use PowerShell. Alternatively, you could do this piece on a department Mac and transfer the file to your computer.)

Load the IMDb data into your map

Create a class called HashSearchEngine with a main that creates a HashMap201 object, then loads up a file of biography data into it. Specifically, the performer names should be the keys, and the biographies should be the values. Note that this data is messy in a few ways: this is typical for data that you get from other sources. You'll need to think about what kinds of processing you may want to do on the data to make the titles usable.

When you use Scanner to read the file, you'll need to add a special addition to it. The input file contains a lot of unusual characters, such as diacritical marks. The file is encoded in a format called "ISO-8859-1", which is different from the default format Java expects to find (which is called "UTF-8"). When you open up the file, you need to do it using a second parameter to Scanner like so:

Scanner inp = new Scanner(new File("filename.txt"), "ISO-8859-1");

Since you'll be using more than one input file (the big one, and your small testing one), don't hardcode the name of your file into your program like "filename.txt" above. Instead, add it as command-line parameter; use the array args in main to grab the filename and open it appropriately.

  • If you are using your own computer and you have a limited amount of memory, you might run out of memory while loading up the hash table. By default, Java only allows you to use around 1/4 of your computer's memory before running out. If you get an out of memory error when trying to run your code, you can change Java's default in jGRASP by going to the Settings, Compiler Settings, Workspace screen, going to the "Run" row in the second column (the one labeled "FLAGS2 or ARGS2"), clicking the square next to it so that you can change the value, and put in -Xmx1g. This tells Java to use 1 GB of memory, even if this is more than 1/4 of the memory that your computer has.

Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and print out a status update every 1000 movies.

After loading up the data, you should enter a loop that prompts the user to enter in a performer name. For example, if I enter in Recha, David, I should see his biography. After displaying the biography, prompt the user again for another performer. Do this until the user types ####, in which case the program should end. I've checked that no performer names start with that prefix.

Evaluation of how well it works

Finally, you should perform some experiments to see how the amount of time it takes to run varies with the size of the main hash array. Try a variety of array sizes, and submit a plot made with the tool of your choice showing how runtime changes with respect to array size.

  • To measure runtime, the method System.nanoTime() is really useful. You can measure the time on the clock before and after doing something, and then calculate the difference.

Submit a readme.txt file that contains a brief summary of your findings, and a PDF with your plot.

What to turn in

Part 1: Turn in your HashMap201 class, which should include a main method for testing with some small data.

Part 2: Turn in your HashSearchEngine class, as well as your plot showing performance for different array sizes.

Acknowledgements

According to IMDb licensing, I should add:

Information courtesy of IMDb (http://www.imdb.com/). Used with permission.

Author: Dave Musicant

Validate