CS 117 Assignment: Counting Words

Assigned Wednesday, 10/26/05
Due 11:59PM Tuesday, 11/1/05

You may work with a partner for this assignment. In fact, I encourage you to do so, since I think you will benefit from discussing the project with somebody else. Submit your source code via HSP.

The problem

Last time I asked you to write a program to manipulate words from a text file, you didn't know about arrays or ArrayLists. This left you in the position of doing only computations that could be completed while storing only one word at a time. Computing the average word length, for example, can be done by keeping a running total of the lengths of the words seen so far: get a word, add its length to the total, repeat.

As it happens, the most interesting computations involving words can't be done by storing only one word at a time. A simple example is the preparation of a list of all the words in a document, along with the number of times each word appears in the document. You might argue that counting words isn't especially interesting, but in fact word counts have many remarkable applications. Word counts have been used, for example, to identify the authors of those Federalist Papers whose authors were previously unknown (Frederick Mosteller and David L. Wallace. Inference in an Authorship Problem. A Comparative Study of Discrimination Methods Applied to the Federalist Papers. Journal of the American Statistical Association 58 (1963): 275-309). For another, most automatic speech recognition (computer dictation) systems use counts of words, word pairs, and word triples to help figure out what sentence you have spoken into a microphone.

For this assignment, you will write a program that takes a file as input and produces a list of the words found in the file along with the number of times each word appears in the file. Your list of words and their counts will be displayed twice--once sorted by count, largest count first, and once sorted alphabetically. Your program's output will look something like this:

==== Alphabetical =====
aardvark 2
and 27
are 14
as 8
asp 1
...


==== Sorted by count ====
the 45
it 28
and 27
...

Constraints and suggestions

Take a look at WordCountTester.java. The main method in WordCountTester instantiates a Scanner and a WordCounter, and then calls a few methods on its way to displaying the required lists.

Your job is to create the WordCounter class and all the public methods required by WordCountTester. I will leave the organization of the WordCounter class to you, but you should make sure that whatever you write, it works with WordCountTester without changing any of the WordCountTester code.

How would I approach this program? Like so:

  1. First, I would create a WordCounter.java with stubs for all the methods WordCountTester expects WordCounter to have. Once you have created these stubs correctly, you should be able to compile WordCounter.java and WordCountTester.java and run the resulting program. Since your methods will be stubs, the program won't really do anything yet, but you'll at least have a shell that compiles and runs.

    What methods will you need?

  2. Second, I would decide on instance variables for WordCounter. Here, I have a little suggestion. I would create a class called WordWithCount or some such thing. This class would have two public instance variables: a String called word and an int called count. The whole point of the WordWithCount class would to be to represent an ordered pair of a word and its count--like ("the",47) or ("aardvark",2). This class wouldn't need any methods--maybe a constructor, but nothing else.

    Once I had a WordWithCount class, I would give WordCounter exactly one instance variable: an ArrayList<WordWithCount>. That is, the data contained by our WordCounter instance would be a list of (word,count) pairs.

  3. Next, I would write a very simple version of countWord. Have it instantiate a new WordWithCount object containing the specified word, and add this object to the ArrayList of words. This won't solve the problem of increasing the word count when you encounter a word for the second or third or fourth time, but it will be a start.

    Then write a simple version of printReportAlphabetized to just print out the contents of the ArrayList in whatever order it happens to be in.

    Neither of these methods will be in their final forms at this point, but if you do as I have suggested, you will now have a running program that reads words from a file, adds them to a list, and then prints out the list. This isn't the end goal, but it's an important intermediate goal, and not terribly difficult to achieve.

  4. I think I would next write a private method called sortAlphabetical that would alphabetize the ArrayList of (word,count) pairs. Then, adding a single sortAlphabetical() call to the beginning of printReportAlphabetized, I would have a program that prints out the words from the file in alphabetical order. Again, not the final goal (the counts will all be 1 and repeated words will appear multiple times), but it's inching closer to the end.

  5. I would continue to ignore printReportByCount, and now return to repair countWord. The thing to do here is to scan through the ArrayList, looking for the word being counted. If you find it, then you can increase its count. Otherwise, instantiate a new WordWithCount as you did a couple steps ago and add it to the list.

    Success? Now you have a program that prints out an alphabetized list of the words in the file, with counts and no word repetitions. Just like you want.

  6. All that's left is to produce the list sorted by count. Again, I would add a private method to do the sorting (sortByCount()), call it at the start of printReportByCount, and then use the same code to print the ArrayList as I wrote for printReportAlphabetical.

  7. And that, my friends, is that. Not easy, not a short process. But if you do it in small steps like this, with a functioning (though not complete) program at the end of each step, you'll make your way to a completed program in an orderly way. This approach takes a certain amount of discipline, but it pays back in lots of ways, both technical (better design) and emotional (less chaos and panic).

In case you want it...

Here's my version of WordWithCount.java, which you may use or not use, as you see fit.

Good luck

Once you have this program, it's fun to count words in all sorts of stuff. Most common word in Hamlet? Number of times "Elizabeth" appears in Pride and Prejudice? No problem.

Start early, have fun, and keep in touch.