N-grams with smoothing

Due 11:59PM Thursday 4/7/05. You may work in pairs or individually. Submit your code via HSP .

What to do

For your previous assignment, you wrote a program to collect n-gram data. To put n-grams to use in realistic applications, however, you need more. In particular, you need a way to assign probabilities to n-grams that do not appear in the training corpus. Doing this is called "smoothing"; see section 6.3 of your textbook for details.

For this assignment, you will enhance your n-gram collection with smoothing, and apply the result to the task of assigning probabilities to sentences. Your program should take two command-line arguments. The first argument will be the name of a training corpus file (the file from which your program collects n-gram data), and the second should be a file consisting of one sentence per line. Your program should compute a smoothed probability for each sentence, and print the sentences to standard output, sorted in decreasing order of probability.

Use either Witten-Bell or Good-Turing smoothing for this program.

[Note for future years: Witten-Bell as described in the textbook is not applicable unless you already know the full legal vocabulary. So you'll need to know the words in the test data that are not found in the training data, which is pretty awkward. Furthermore, Witten-Bell is not especially well described in the book. Good-Turing would be a better choice in class and on the assignment.]

Suggestions

For this and many other programs, it can help a lot to spend some time early thinking carefully about the functions you will need. As a simple example, you will need a function/method/subroutine that takes a sentence as a parameter, and returns a probability. This suggests that a Java method like public double getSentenceProbability( String s ) or a C++ function like double getSentenceProbability( const string& s ) would be appropriate. These functions will in turn need to call some sort of function or functions to get the smoothed probability of each n-gram in the sentence (not to mention the 1-gram, 2-gram, etc. at the beginning of the sentence). Designing a clean functional interface for this task will help you organize your code in a way that makes the program easier to write in the first place, and easier to debug and maintain later.

Have fun, start early, and keep in touch.





Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu