Type: Group Assignment

Due: by 9:30 PM on Tuesday, November 6, 2018

Goals

The goal of this assignment is to give you practice with implementing and navigating binary search trees. You will also see the idea of a “dictionary” or “map” put to use in a text analysis application.

Setup and Instructions

This is a partner project, and so should be worked on with the partner I assigned to you (if you have a partner). Remember that pair programming requires you to be working at the same computer at the same time!

Download hw8.zip and unzip it into the directory that you’ve created for your work on this assignment. There is a Java class called WordCloudMaker.java (originally written by Sherri Goings, with a growing list of modifications by others). There are also a few sample documents and a stopwords file called StopWords.txt.

Building a Word Cloud

You may have seen word clouds like the one below, which was generated from the text from “Anne of Green Gables” by Lucy Maud Montgomery. A word cloud takes the most frequent words in a document and displays them in a jumbled arrangement. The key feature of a word cloud is that the size of each word is proportional to the number of times that word appears in the text. (Note that very common words, generally called “stopwords,” are excluded from the word cloud. Otherwise, our clouds would be dominated by “the”, “and”, “a”, etc.)

In this assignment, you will implement a binary search tree to count the number of times each word appears in a given text, allowing you to make your own word clouds!

Word cloud of Anne of Green Gables

Overall Program Structure

Your program will come in three main pieces:

  1. A class called WordCountMap consisting of a binary search tree that maintains a record of words and their counts. (WordCountMap will make use of two very small classes called Node and WordCount described below.)

  2. A class called WordCounter containing your main program and any support methods it needs. The main program will be in charge of opening a text file, counting all of the words it contains, and producing one of three types of output.

  3. A class WordCloudMaker.java (provided to you) that will transform a list of word counts into HTML that shows a simple word cloud.

WordCountMap

First you’ll build a class called WordCountMap. Each object of this class will represent a binary search tree where each node of the tree contains a word and the count of the number of times that word appears. The tree should be organized according to the alphabetical ordering of the words. (What instance variables might you want for a binary search tree?)

Specifically, this class will need to make use of two other classes. The first is a Node class which should look something like the following:

private class Node
{
    private String word;
    private int count;
    private Node left;
    private Node right;
}

The exact details of the Node class are up to you, since it’s a class that will be invisible to users of WordCountMap. You may add other (private) methods to this class, but it should be a nested class within your WordCountMap class.

The other class you will need is a WordCount class. This small class will allow you to create an object that contains a word and its associated count. Please make this class in a separate file WordCount.java. This is a relatively simple class that should look something like the following (plus whatever methods/constructors you choose to add):

public class WordCount implements Comparable<WordCount>
{
    public String word;
    public int count;
}

You must use these exact instance variables and make them public, as the WordCloudMaker class will depend on accessing them directly. Since this class implements the Comparable interface, you’ll have to implement a compareTo method. In particular, you should use the instance variable count to do this comparison. If you aren’t sure what this method should do, take a closer look at the Javadoc description of it in the Comparable interface.

Your WordCountMap class will need to implement the following three methods:

/**
 * If the specified word is already in this WordCountMap, then its
 * count is increased by one. Otherwise, the word is added to this map
 * with a count of 1.
 */
public void incrementCount(String word)
{
    ...
}

 /**
 * Returns an array list of WordCount objects, one per word stored in this
 * WordCountMap, sorted in decreasing order by count.
 */
public ArrayList<WordCount> getWordCountsByCount()
{
    ...
}

 /**
 * Returns a list of WordCount objects, one per word stored in this
 * WordCountMap, sorted alphabetically by word.
 */
public ArrayList<WordCount> getWordCountsByWord()
{
    ...
}

You may (and probably should!) create other methods to help you implement these methods in your class. Make sure that you implement these methods exactly as described, as the grader and I will be building some testing scripts that use these methods. Lastly, note that ArrayList should be the built-in Java implementation of a List that uses an array to store the data.

WordCounter

This class will contain your main program and any support methods it needs. The main program will be in charge of opening a text file, counting all of the words it contains, and producing one of three types of output. Specifically, your main method here will parse the command-line arguments to support the following three ways of running the program:

a. You should be able to run the program using the following commands:

java WordCounter alphabetical [textFileName]

This will print out a list of words and their occurrence counts, one word per line, each line consisting of a word, a colon, and the word’s count. This list will be sorted alphabetically by word. For example, the output might look like the following:

$ java WordCounter alphabetical FederalistPapers.txt

abandon: 5
abandoned: 3
abandoning: 1
abate: 1
abatements: 1
abbe: 4
abetted: 2
abhorrence: 1
abide: 1
...

b. You should be able to run the program using the following commands:

java WordCounter frequency [textFileName]

This is the same as the the alphabetical case, except the words will be sorted in decreasing order by frequency (or count). For example, the output might look like the following:

$ java WordCounter frequency FederalistPapers.txt

states: 865
government: 830
state: 792
its: 658
power: 620
people: 615
no: 601
those: 481
constitution: 468
...

c. You should be able to run the program using the following commands:

java WordCounter cloud [textFileName] [numberOfWordsToInclude]

This command will write HTML to a file containing a word cloud based on the code I’ve given you (read through the code and comments in WordCloudMaker.java to figure out what methods you should call to do this). A typical invocation of the cloud generator would be:

$ java WordCounter cloud FederalistPapers.txt 40

This would generate the word cloud based on the 40 most common non-stopwords in FederalistPapers.txt. (If FederalistPapers.txt contains fewer than 40 non-stopwords, then the cloud will just use all the words.)

In your code, you will need to create a reasonable filename to contain the HTML. Make sure that it has the extension .html at the end of it. Ideally, you would take the name of the text file being passed in and use that to name the HTML file, though I will not take off points for how you create the filename. So, for instance, the command:

$ java WordCounter cloud FederalistPapers.txt 40

would create a file called FederalistPapers.html, which, when opened in an internet browser, might look like the following:

Word cloud of the Federalist Papers

Other Notes, Requirements and Suggestions

  • Make sure that WordCountMap implements a binary search tree based on the alphabetical ordering of the words. You should not use other data structures included with Java to do this. You must implement the tree structure directly yourself.

  • I’ve provided you a file called StopWords.txt that contains a list of common English stop words. You should write your code so that any word which appears in this list is not included in the results that you produce. How exactly you choose to do this is up to you, but you should think about the efficiency of your code for this portion.

  • At least one of your getWordCountsBy methods in your WordCountMap should use a tree traversal to get the list of words and counts in the correct order. For the other method, you can use built in Java methods if you’d like (such as Collections.sort()).

  • We talked briefly about command-line arguments at the beginning of the term, though we did not dive into them in great depth. If you don’t remember this well, look back at the code I gave you for HW1 (LineReader.java) or in section A.10 of your book for more information—or ask questions about it!

  • I do not care if you treat “alacrity” or “Alacrity” as the same word or different words for the purpose of this assignment. However, note that StopWords.txt only contains words in lower case. As such, when determining whether a word is a stopword, case should be ignored.

  • When reading from a text file, you should strip punctuations from words. One way to do this is to use the String method replaceAll(). In particular, this method can use something called a regular expression to identify and then remove certain patterns of characters. If you want to find and remove any character in a String s that is NOT a letter, you could use the following line of code:

s = s.replaceAll("[^a-zA-Z]","");
  • You do NOT need to make any changes to WordCloudMaker.java. But if you want to make changes, go ahead! The same applies to StopWords.txt.

  • Feel free to try your word cloud on other documents! Many can be found at Project Gutenberg.

  • Keep modularity and encapsulation in mind as you design your code. Make sure your comments indicate what the instance variables you choose represent.

Submission and Grading

Submit your assignment by placing the following files into an assignment8 directory on the courses drive. Note that only one partner from each pair needs to submit the assignment.

  • WordCountMap.java
  • WordCounter.java
  • WordCloudMaker.java
  • WordCount.java
  • Any other files used by your code.

This assignment is worth 24 points:

  • 18 points for the assignment requirements.
  • 6 points for comments, style and design.

Extra Challenge?

There are a number of things you can do with this assignment if you are looking for an extra challenge. Here are a few suggestions.

  • Implement a self balancing tree (although if you do this, please do submit your code for your regular binary search tree as well). Here’s your chance to practice AVL Trees, or to try out one of the other self-balancing trees discussed in the book!
  • Add additional functionality to your WordCountMap class. Maybe a decremenCount() method. Maybe add a different kind of tree traversal.
  • Play around with the WordCloudMaker.java code to change the look of the cloud you create. How might we adjust position or font based off of the data being shown? Are there other uses for color that we might want?

Start early, ask lots of questions, and have fun! Your instructor, the lab assistants, and the prefect are all here to help you succeed!