Identifying authors
Due midnight Monday 9/30/02. You may work in groups or individually.
What to do
Select and download corpora written by three authors with distinctive styles
Also select more or less randomly at least five paragraphs apiece from other works written by these same
authors, and set those paragraphs aside.
Now, devise and implement a scheme for using n-gram data from the three corpora to guess
the author of a given paragraph. Feed your fifteen paragraphs into your guessing code
and record the results. If your code computes some sort of measurement of similarity between
the paragraph and each of the corpora, report the similarity measures as well.
What to hand in
- On paper: A brief explanation of your strategy, on paper.
- On paper: A list of the corpora used, the sources of the extra paragraphs, and
a brief explanation of how you chose the extra paragraphs.
- On paper: A table showing the results of your author-guessing, including similarity measures, if any.
Summarize the results by giving the percentage of correct guesses. If you want to break it down
further (percentage correct for Dr. Seuss, percentage correct for Jane Austen, etc.), go right
ahead.
- Via HSP: Your code.
Have fun, start early, and keep in touch.
Jeff Ondich,
Department of Mathematics and Computer Science,
Carleton College, Northfield, MN
55057,
(507) 646-4364,
jondich@carleton.edu