N-grams

Due 5:00 Friday 4/1/05--no fooling. You may work in pairs or individually. Submit your code via HSP .

For this and later assignments, you should be prepared to do a brief (five to ten minutes) demo of your program in class on the due date. I will pick two or three groups to do these presentations, and we will then discuss as a class the kinds of problems you faced, what sorts of solutions you arrived at, what kinds of questions you still have, how your code might be improved, etc.

What to do

Write a program that collects the complete list of n-grams from an input corpus, along with the occurrence count for each bigram. Your program should be invokable from a Linux command line, and should allow "n" to be specified at the command line. The program should read the corpus from standard input and print the list of n-grams, sorted in decreasing order by occurrence count, to standard output. The language you use is up to you.

Try your code on a couple corpora with distinctive styles (e.g. Shakespeare, Hemingway, Jane Austen, James Joyce, the Old Testament, Dr. Seuss, the Washington Post, etc.). Project Gutenberg provides a great source of public domain literary corpora. The Library of Congress Thomas web site has lots of neato government documents. I'm sure you can dig up song lyric archives, movie scripts, Watergate hearing transcripts, or whatever. There are many big text files out there.

Design considerations and advice

Think about how to define a word type. Should punctuation be ignored? Should a word with trailing punctuation (e.g. "ignored?") be a word type? Should each punctuation mark be its own word type?...

Test your code on very small corpora of your own devising and compare against your manual computations.

Many applications of n-grams depend upon computing the probability that the next word is W, given that the previous n-1 words were W[1],...,W[n-1]. You should try to design your data structure in such a way that this probability computation is fast and easy to perform, because I will be asking you to use your n-gram code for just this purpose at a later time.

Have fun, start early, and keep in touch.





Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu