Decision Trees

Write a program in Python to implement the ID3 decision tree algorithm. You should read in a tab delimited dataset, and output to the screen your decision tree and the training set accuracy in some readable format.

Here are two sample datasets you can try: tennis.txt and titanic2.txt.

(Sources: tennis.txt is from our textbook, titanic2.txt is from Vanderbuilt University Department of Biostatistics)

When you run your program, it should take a command-line parameter that contains the name of the file containing the data. For example:

python decisiontree.py tennis.txt

The first line of the file will contain the name of the fields. The last column is the classification attribute, and will always contain contain the values yes or no. All files are tab delimited.

For output, you can choose how to draw the tree so long as it is clear what the tree is. One really neat way to go would be to use Graphviz, which is a mini programming language for drawing graphs. Graphviz is installed on all of our department systems. Alternatively, you can use text. You might find it easier if you turn the decision tree on its side, and use indentation to show levels of the tree as it grows from the left. For example:

outlook = sunny
|  humidity = high: no
|  humidity = normal: yes
outlook = overcast: yes
outlook = rainy
|  windy = TRUE: no
|  windy = FALSE: yes

You don't need to make your tree output look exactly like above: feel free to print out something similarly readable if you think it is easier to code.

You may find Python dictionaries especially useful here, as they will give you a quick an easy way to help manage counting the number of times you see a particular attribute. But beware: read the FAQs below.

Here are some FAQs that I've gotten in the past regarding this assignment, and some I might get if I don't answer them now.

Is it possible that some value, like "normal," could appear in more than one column?: Yes. In the tennis dataset, for example, addition to the column "humidity", we might have had another column called "skycolor" which could have values "normal," "weird," and "bizarre."
Could "yes" and "no" appear as possible values in columns other than the classification column?: Yes. In the tennis dataset, for example, addition to the column "humidity", we might have had another column called "skycolor" which could have values "normal," "weird," and "bizarre."; In other words, don't use just the value as a key to a single dictionary that encompasses all attributes.
How do I use dictionaries in Python?: Check out Jeff's awesome examples.

Bonus work

You'll receive 18 out of 20 points if you do all of the above correctly based on the ID3 algorithm for decision trees (i.e., using information gain) that we discussed in class. For the 19th point (which I'm considering optional), print out a second decision tree obtained by performing chi-squared pruning on the first decision tree. Finally, for the optional 20th point, run your exercise again with leave-one-out cross-validation, and measure the test set accuracy of both trees. Did pruning help for either or both of these datasets?

One irritating challenge you may find when using leave-one-out cross-validation is that there may be a particular value in the dataset that only exists in a single point. This value then is seemingly not in the training set when you go to build your decision tree when that point is left out for cross-validation purposes. You'll need to handle this appropriately; when building the decision tree, your program will need to know about all potential values in the dataset, even if they don't appear in the training set for a given fold.