Search Engine via Hashing
Table of Contents
This is a pair programming assignment. If you are working in a pair, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Set a timer to swap every 15 minutes. You can choose your favorite from this online timer page. Make sure your sound volume is audible, but not so loud to disturb the people around you.
We will use anonymous grading on Moodle, which means that the grader won't see your name until after the grading is done. This is an easy way to help add an extra element of fairness to the grading. Therefore, make sure your name doesn't appear on your actual submission. When you submit via Moodle, it will know you are. Thanks!
Overview
For this assignment, we'll make a rudimentary search engine. Specifically, we'll load up a dataset of performers and biographies from Internet Movie Database (IMDb) into a map implemented as a hash table, and then allow the user to request info as desired.
Part 1: Create a Map
Create a class called HashMap201
that will be used to map Strings to
Strings. (The intention is to use it to map performers to biographies.)
Implement this map as a hash table with chaining via linked lists. For this
assignment, you should not the built-in Java classes that do this task such as
TreeMap
, HashMap
, TreeSet
, or HashSet
. That said, you should use the
built-in class LinkedList
, for your chains. Your map should have the following
methods:
void put(String key, String value)
add a key and a value to your map. If the key is already present in the map, replace the old value with the new one.
String getValue(String key)
get the value associated with a particular key. Return
null
if the key is not in the map.
Make sure to test your map appropriately. A main method would be a nice way to do this.
You should not be searching the web or other sources for similar code to this. You are welcome to look at the textbook.
Optional extra challenge
If you would like an optional additional challenge, make your class generic
instead of String specific. In that case, your HashMap201<K,V>
class methods
would look like:
void put(K key, V value)
V getValue(K key)
Using the built-in LinkedList
classes
To use the built-in LinkedList
class, review the various methods that it
offers. One source of trickiness is that for each entry, you need to store both
a key and a value. You'll have to think through how you want to do that. You'll
also need to address how to efficiently search through the linked list and be
able to change one of the items if you need to without having to find it all
over again. For this purpose, the so-called for-each loop can be quite useful. Here
is a sample as to how you might use it:
// Assume I've created a class called Entry which contains a key and a value LinkedList<Entry> mylist = new ListList<Entry>(); // ... add a bunch of items to it ... for (Entry e : mylist) { System.out.println(e.key); System.out.println(e.value); e.value = someNewValue; }
The for-each loop works on any class that implements the Iterable
interface. Java Interlude 5 in your textbook has lots more information on how
iterators work, which forms the basis for what the for-each loop does.
Part 2: Fill up the Map with data from IMDb, and run queries
About IMDb data
IMDb makes plain text files of all of its data available for download. We're
going to use a file called bio_unsorted.list
that contains all of the
performers and biographies in the database. Because the file is so large (nearly
600 MB uncompressed), I've had to place them in a shared CS department directory
with enough disk space (that location is called /Accounts/courses/
). That
directory isn't accessible via the web, so you'll find directions below on how
to obtain the file you need. But first: do not copy this file to any of your
network-shared Carleton directories. If 35 students each copy and uncompress a
(600 MB) file to their own network storage directory, that's a total of 21
gigabytes of network storage that we're wasting merely for everyone to have
their own copy. Moreover, you'll be fighting for bandwidth with all the other
students in the class to read the file over the network every time you read it
in.
Also, in case you were thinking about being clever about saving disk space
(which is often awesome), do not direct the program you write to read the file
directly from the /Accounts/courses
location without copying it first. If 35
students in the class try to read a very large file from the same network
storage location at the same time, repeatedly while debugging code, we may clog
the network and bring it to a halt. This is the same problem we face if you put
the file in your home directory (in addition to storage space issues).
So what should you do? If you're working on a lab machine, make a copy of the
file to the directory /tmp
. /tmp
is a local directory on the computer that
you're using. The file might get deleted on a reboot or a logout, but you can
always recopy the file over if it vanishes. If you're using your own computer,
you can put the file anywhere you want on your own computer. Make sure, however,
that you are storing it locally and not in some network location.
(One minor source of irritation you may run into is that if someone who used the
computer before you has already copied this file onto /tmp
, you may not have
permission to read the file. If that happens, just copy it again with a
different file name. Alternatively, reboot the computer, and it will hopefully
disappear.)
There are a variety of ways of copying the file, but here is one that will work
under Mac or Linux: use scp
, which is a command-line program for doing a
"secure copy." At a command-prompt, you can learn more about how it works by
typing:
man scp
Specifically, to grab the file you need, first use the cd
command to change to
the directory you want to put the file in. For example, if it's a lab machine,
type
cd /tmp
Then issue the following command, which securely copies the file from the remote location:
scp username@mirage.mathcs.carleton.edu:/Accounts/courses/cs201/dmusicant/bio_unsorted.list.gz .
where username
is your username. Don't miss the dot at the very end; that
means copy the file into your current directory. mirage
is the name of a CS
department server that you're copying the file from. You'll need to enter your
Carleton password once you issue the above command, and the file should start
coming down. Once you've got it, you'll need to uncompress it. That file is
compressed with the gz
compression format. To uncompress it, on a Mac or Linux
issue the following command. (Check again first: are you in the directory /tmp
?)
gunzip bio_unsorted.list.gz
Windows computers frustratingly don't natively have scp or a similar program
installed. If you are using a new enough version of Windows 10, you can install
the Windows Subsystem for Linux. If you can get this installed, then the above
instructions for Mac or Linux should also work for Windows. Alternatively, for
any version of Windows you can install WinSCP, though you'll have to figure out
how to configure it; I haven't played with it. Likewise, to unzip a gz
file
natively under Windows, you'll likely need to install either Winzip or 7-zip.
Make a smaller version
For testing your program, you should make a small version of this file where you
cut out most of the movies. Reading in the entire bio_unsorted.list
file takes
a long time, and if you read from that file every time you debug your program,
you'll add many hours to your coding and debugging waiting for it every time.
This is a key general strategy that is really useful when working with big data;
make a small version of the data to use for testing. Opening up a file this big
in a text editor would likely be exceedingly slow and would use a huge amount of
memory. Instead, try to use the UNIX command head
which works on both Mac and
Linux (and the Windows Subsystem for Linux). Read the man page on head
(type
man head
at a prompt) to see how it works. head
normally dumps the output to
the terminal window; instead, you'll need to redirect the output to a file. If
you don't know how to do that, look up "redirecting the output" in this awesome
UNIX tutorial.
(If you're using a Windows computer and unable to make WSL work, the current cleanest way to do this is to use some PowerShell commands. If you plan to continue to work extensively with Windows, it would be a worthwhile experience to learn how to use PowerShell. Alternatively, you could do this piece on a department Mac and transfer the file to your computer.)
Load the IMDb data into your map
Create a class called HashSearchEngine
with a main that creates a
HashMap201
object, then loads up a file of biography data into it.
Specifically, the performer names should be the keys, and the biographies
should be the values. Note that this data is messy in a few ways: this is
typical for data that you get from other sources. You'll need to think about
what kinds of processing you may want to do on the data to make the titles
usable.
When you use Scanner
to read the file, you'll need to add a special addition
to it. The input file contains a lot of unusual characters, such as diacritical
marks. The file is encoded in a format called "ISO-8859-1", which is different
from the default format Java expects to find (which is called "UTF-8"). When
you open up the file, you need to do it using a second parameter to Scanner
like so:
Scanner inp = new Scanner(new File("filename.txt"), "ISO-8859-1");
Since you'll be using more than one input file (the big one, and your small
testing one), don't hardcode the name of your file into your program like
"filename.txt" above. Instead, add it as command-line parameter; use the array
args
in main
to grab the filename and open it appropriately.
- If you are using your own computer and you have a limited amount of memory,
you might run out of memory while loading up the hash table. By default, Java
only allows you to use around 1/4 of your computer's memory before running
out. If you get an out of memory error when trying to run your code, you can
change Java's default in jGRASP by going to the Settings, Compiler Settings,
Workspace screen, going to the "Run" row in the second column (the one
labeled "FLAGS2 or ARGS2"), clicking the square next to it so that you can
change the value, and put in
-Xmx1g
. This tells Java to use 1 GB of memory, even if this is more than 1/4 of the memory that your computer has.
Loading up the data can take a while on the big file. I found it convenient to keep a running counter of how many movies I had loaded, and print out a status update every 1000 movies.
After loading up the data, you should enter a loop that prompts the user to
enter in a performer name. For example, if I enter in Recha, David
, I should
see his biography. After displaying the biography, prompt the user again for
another performer. Do this until the user types ####
, in which case the
program should end. I've checked that no performer names start with that
prefix.
Evaluation of how well it works
Finally, you should perform some experiments to see how the amount of time it takes to run varies with the size of the main hash array. Try a variety of array sizes, and submit a plot made with the tool of your choice showing how runtime changes with respect to array size.
- To measure runtime, the method
System.nanoTime()
is really useful. You can measure the time on the clock before and after doing something, and then calculate the difference.
Submit a readme.txt
file that contains a brief summary of your findings,
and a PDF with your plot.
What to turn in
Part 1: Turn in your HashMap201
class, which should include a main
method for testing with some small data.
Part 2: Turn in your HashSearchEngine
class, as well as your plot showing
performance for different array sizes.
Acknowledgements
According to IMDb licensing, I should add:
Information courtesy of IMDb (http://www.imdb.com/). Used with permission.