CS 357: Natural Language Processing

Counting n-grams

For this assignment, you will write a program in the programming language of your choice to collect the complete list of n-grams from an input corpus, along with the occurrence count for each n-gram.

More details

Your program should be invokable from the Linux command line via the syntax:

(C/C++) ngramcollector n
(Java) java NGramCollector n
(Perl) perl ngramcollector.pl n
(Python) python ngramcollector.py n

where n is a positive integer representing the "n" of the n-grams whose counts you are collecting. Your program should take input from standard input and print its output to standard output. For example, if you run the command "java NGramCollector 3 < ulysses.txt > ulyssestrgrams.txt", your program will dump the trigrams found in ulysses.txt into ulyssestrigrams.txt.
Each line of your output will consist of an integer followed by a comma, followed by the text of an n-gram. Your output should be sorted in decreasing order by frequency. For example:

30,in the
27,the cat
25,a big
22,the mouse
...
For the purposes of this assignment, define a "word" to be any maximal contiguous block of Latin letters. This definition will have some negative implications. For example, "don't" will be treated as the two words "don" and "t". Nonetheless, it will simplify your work. If you write your code in a modular way (for example, by writing some sort of "getWord" function/method), you should be able to change the definition of word later without changing the rest of your code.

Suggestions

Test your code on small corpora of your own devising and compare the output against your manual computations. Then test on some big text files (see Project Gutenberg for a large collection of handy text files) to get a feel for your program's running time, and to seek out bugs not triggered by the smaller test file.

Many applications of n-grams depend upon computing the probability that the next word is W, given that the previous n - 1 words were W[1],...,W[n-1]. You should try to design your data structure in such a way that this probability computation is fast and easy to perform, because I will be asking you to use your n-gram code for just this purpose at a later time.

It is always a good idea to think carefully about your class structure and your data structures before sitting down to code. In this case, how you go about storing an n-gram so as to quickly find its current count is essential. Similarly, good class and method design will ease your coding job. We will discuss both of these issues in class.

Have fun, start early, and keep in touch.

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu