N-Grams
Due Wednesday 4/11/01, midnight. Submit your code via
HSP .
Be prepared to do a brief (five minutes) demo of your program
in class Wednesday, including explanation of your approach to the problem
and any features of your solution that strike you as interesting.
What to do
For this assignment, try to duplicate the experiment described
on pages 202-204 of Jurafsky and Martin. In particular, you should
collect unigram and bigram (and possibly trigram and quadrigram) information
from a corpus, and generate random sentences based on that corpus.
Make your program capable of using either unsmoothed or smoothed n-grams
(for smoothed, you should implement either Witten-Bell or Good-Turing
smoothing).
Try your code on a couple corpora with distinctive styles (e.g. Shakespeare,
Hemingway, Jane Austen, James Joyce, the Old Testament, Dr. Seuss, the
Washington Post, etc.).
Project Gutenberg
provides a great source of public domain literary corpora.
Advice
Don't forget to include punctuation marks as word types.
Beware the temptation to write all your code before testing any piece of it.
Don't start generating random sentences before you are reliably computing
probabilities. It's good to plan for generality, but unigrams and bigrams
are more important than trigrams and quadrigrams, so it would be appropriate
to focus your initial development on bigrams.
Test your n-gram counting on very small corpora of your own devising
and compare against your manual computations.
Have fun, start early, and keep in touch.
Jeff Ondich,
Department of Mathematics and Computer Science,
Carleton College, Northfield, MN
55057,
(507) 646-4364,
jondich@carleton.edu