Due 11:59PM Thursday 4/7/05. You may work in pairs or individually. Submit your code via HSP .
For your previous assignment, you wrote a program to collect n-gram data. To put n-grams to use in realistic applications, however, you need more. In particular, you need a way to assign probabilities to n-grams that do not appear in the training corpus. Doing this is called "smoothing"; see section 6.3 of your textbook for details.
For this assignment, you will enhance your n-gram collection with smoothing, and apply the result to the task of assigning probabilities to sentences. Your program should take two command-line arguments. The first argument will be the name of a training corpus file (the file from which your program collects n-gram data), and the second should be a file consisting of one sentence per line. Your program should compute a smoothed probability for each sentence, and print the sentences to standard output, sorted in decreasing order of probability.
Use either Witten-Bell or Good-Turing smoothing for this program.
[Note for future years: Witten-Bell as described in the textbook is not applicable unless you already know the full legal vocabulary. So you'll need to know the words in the test data that are not found in the training data, which is pretty awkward. Furthermore, Witten-Bell is not especially well described in the book. Good-Turing would be a better choice in class and on the assignment.]
For this and many other programs, it can help a lot to spend some time
early thinking carefully about the functions you will need. As a simple example,
you will need a function/method/subroutine that takes a sentence as a parameter,
and returns a probability. This suggests that a Java method like
public double getSentenceProbability( String s )
or a C++ function
like double getSentenceProbability( const string& s )
would be appropriate.
These functions will in turn need to call some sort of function or functions
to get the smoothed probability of each n-gram in the sentence (not to mention
the 1-gram, 2-gram, etc. at the beginning of the sentence). Designing a clean
functional interface for this task will help you organize your code
in a way that makes the program easier to write in the first place, and
easier to debug and maintain later.
Have fun, start early, and keep in touch.