CS 395 Assignment due Wednesday, 4/11/01

N-Grams

Due Wednesday 4/11/01, midnight. Submit your code via HSP . Be prepared to do a brief (five minutes) demo of your program in class Wednesday, including explanation of your approach to the problem and any features of your solution that strike you as interesting.

What to do

For this assignment, try to duplicate the experiment described on pages 202-204 of Jurafsky and Martin. In particular, you should collect unigram and bigram (and possibly trigram and quadrigram) information from a corpus, and generate random sentences based on that corpus. Make your program capable of using either unsmoothed or smoothed n-grams (for smoothed, you should implement either Witten-Bell or Good-Turing smoothing).

Try your code on a couple corpora with distinctive styles (e.g. Shakespeare, Hemingway, Jane Austen, James Joyce, the Old Testament, Dr. Seuss, the Washington Post, etc.). Project Gutenberg provides a great source of public domain literary corpora.

Advice

Don't forget to include punctuation marks as word types.

Beware the temptation to write all your code before testing any piece of it. Don't start generating random sentences before you are reliably computing probabilities. It's good to plan for generality, but unigrams and bigrams are more important than trigrams and quadrigrams, so it would be appropriate to focus your initial development on bigrams.

Test your n-gram counting on very small corpora of your own devising and compare against your manual computations.

Have fun, start early, and keep in touch.

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu