CS 322: Natural Language Processing

Part-of-speech tagging

As usual, hand in your code, data, and report to your "hand-in" folder, using HSP or not as you prefer.

For this assignment, you will implement and test a Hidden Markov Model-based part-of-speech tagger, as discussed in class and in Section 5.5 of the textbook. Follow this outline:

  1. Familiarize yourself with the Penn Treebank Project's part-of-speech tag set. See p. 131 of your textbook. If you want a more detailed discussion of this tag set, visit the Penn Treebank Project site and grab the detailed description of the tag set.
  2. Tag the fragment of Dr. Seuss's Green Eggs and Ham I gave you in class, and email your tagging to the rest of the class at cs322-00-f08@lists.carleton.edu. If you think somebody else's tagging is mistaken, we can discuss it via this mailing list or in class. Please complete this portion of the assignment by class time on Wednesday, October 29.
  3. Using the words, tags, and sentences that appear in our tagging of Green Eggs and Ham, train a Hidden Markov Model using 80% of the sentences. You may do this training in any language you find convenient, including a spreadsheet if you find that approach handy. We'll spend this week in class preparing you to do this step.
  4. Run your HMM as a part-of-speech tagger on the remaining 20% of the sentences.
  5. Put your report in a readme.txt file. Your report should include:
    • A description of how you initialized your HMM.
    • A brief description of the code you used to build your part-of-speech tagger and how to use it.
    • A list of your test sentences, showing their correct taggings and the taggings produced by your tagger.
    • The accuracy of your tagger. That is, the ratio between the number of words your tagger tagged correctly and the total number words in your test set.

You might find this spreadsheet HMM useful, or maybe not.