In this assignment, we will implement the part of a web search
engine that ranks pages based on how well they match a search
query. The best match is the web page with the highest word frequency
counts for the words in a particular query. Your main class for this
assignment should be called ProcessQueries, and will be
used as follows:
java ProcessQueries urlListFile ignoreFile
The urlListFile should contain a list of URLs, one per
line. These URLs should correspond to files that are on our file
system. An example urlListFile might contain:
/Accounts/courses/cs127/dmusican/dept-web-site/contact.html /Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html /Accounts/courses/cs127/dmusican/dept-web-site/guideToTheCSMajor.html /Accounts/courses/cs127/dmusican/dept-web-site/index2.html /Accounts/courses/cs127/dmusican/dept-web-site/stufflist.html /Accounts/courses/cs127/dmusican/dept-web-site/summaries.html /Accounts/courses/cs127/dmusican/dept-web-site/websearch.html
The ignoreFile should contain a list of words that you
would like to ignore when counting word frquencies (just as in the
last assignment).
Create a class called URLContent that will store two
things: a String containing a URL, and a
WordFrequencyTree that will contain the frequency counts
for that URL. If you had trouble getting your last assignment working,
you can use a HashMap instead; see me immediately so that
I can help get you started.
When your program starts, it should process all the URLs in the
list by creating a URLContent object for each, and build
a WordFrequencyTree for each. (Keep all of these
URLContent objects in an ArrayList.) Once
you have finished processing all of these URLs, your program should
enter a loop as shown below. The program should prompt the user to
enter a search query (or -1 to quit), and then list all URLs that
match the query in order of the best match first and the worst match
last. Include each result URLs priority in parentheses after each
result. URLs of web pages that do not contain any of the words in the
query should not appear in the result list. The priority of a page is
defined as the number of times a word in the query appears on that
page. Here is a sample run (not necessarily correct):
Enter a query or -1 to quit. Search for: computer science Relevant pages: /Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = x) /Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = y) Search for: mtie Relevant pages: /Accounts/courses/cs127/dmusican/dept-web-site/GuidetoMath.html (priority = x) /Accounts/courses/cs127/dmusican/dept-web-site/summaries.html (priority = y) /Accounts/courses/cs127/dmusican/dept-web-site/contact.html (priority = z) Search for: -1
To process a query, create a HeapPriorityQueue. For
each URL, determine its priority, then add that URL to the priority
queue. After you have processed all URLs, use the priority queue to
print out the URLs in order with the priorities next to them.
To get started create a class called
HeapPriorityQueue. This should be a priority queue,
implemented as a heap, with the usual methods for inserting and
removing items. Test out your priority queue on some sample
data. You will be using your priority queue to store Strings. There is
priority queue code in your textbook which you can use as a reference
(make sure to cite it if you do), but you will undoubtedly learn more
and have more fun if you try to build it yourself.
Next, build the class ProcessQueries. (This is where
your main program will lie.) In the main method, use Scanner to read
in both the urlListFile and the ignoreFile
in a similar manner to the last assignment.
URLContents
object.
Finally, actually implement the query itself. Determine the priority for each URL, and insert it into a priority queue. Then print out the matching URLs in order of best to worst. Your program should handle multiple word queries. The priority for a given URL should be determined as the sum of the number of matches for each word in the query.
Have fun, and good luck! Remember to start early and make incremental progress.
Many thanks to Tia Newhall and Lisa Meeden at Swarthmore College.