CS 204: Software Design

An e-mail file summarizer

One of the most widely used mechanisms for storing e-mail is the mbox family of file formats. For this assignment, you are going to write a program that produces various forms of summary information based on the contents of one or more mbox files, using the mbox variant used by the qmail system.

Step 1: Get to know message headers and the mbox format

First, open up your e-mail client and look at the full text, including headers, of some message. In Thunderbird, you do this by selecting the message and then picking Message Source from the View menu. In Zimbra, the only way I've figured out how to do this is to double-click on the message, and then right-click on the message Fragment, choosing Show Original from the popup menu. All clients allow you to see the message headers, but you'll have to figure this out on whatever client you use.

It's interesting to do this same thing with a variety of messages, including some with non-ASCII characters (e.g. the word resumé) and some with attachments.

Once you've taken a look at some e-mail headers, read up on the qmail mbox format.

Step 2: Collect test data

Unless you control a server that uses qmail, you probably don't have ready access to a suitable mbox file on which to test your program. Therefore, we will use the power of our little community to generate a reasonably sized mbox file on which you can all test your programs.

Start by sending yourself an e-mail message. Then reply, reply to the reply, etc. until you have a small thread of 4 or 5 e-mail messages. Next, use your e-mail client's "view source" or "view headers" feature to extract the full e-mail messages and copy them into a text file. Finally, insert suitable "From " lines at the top of each message's headers, leaving you with a small mbox file consisting of all the messages in your fictional thread. If you want to modify the e-mail addresses of the senders and recipients, go right ahead.

Once you have your messages ready to go, send them to me at jondich@carleton.edu. Please do this by 5:00PM Sunday, Jan 13, so I can compile them all into a single mbox test file and send them to you before class Monday.

Warning: It's OK to make your messages funny, but please, let's keep them clean and friendly.

Step 3: Write your summarization program

For this assignment, you will write a program to provide several different kinds of summaries of mbox files.

Your program should be invokable from the Linux command line, like so:

python mboxsummarizer.py action [inputfile]

Here, the mandatory "action" argument can be one of three possibilities:

"--thread", "--date", "--sender". The "inputfile" argument is optional (as indicated by the square brackets). If inputfile is specified, then your program should take input from the named file. If inputfile is not specified, then your program should take input from standard input. Your program should send output to standard output. This combination of input and output behaviors supports command like the following:

cat mbox* | python mboxsummarizer.py --thread > threadsummary.txt

Assuming there are a bunch of files whose names start with "mbox", this command will concatenate those files together and store a thread summary of the combined mbox files in threadsummary.txt.

Here's how your program should behave for each of the possible "action" values:

--thread: Your program should print out a list of all the subject lines in decreasing order of the number of messages in the corresponding thread. For example, if your inbox has 12 messages titled "the big meeting" or "Re: the big meeting" or "Fwd: Re: the big meeting", etc., your output might look like this:

the big meeting 12
check out this video! 9
emus gone wild 3
I'll be home early 1

Note that each output line should begin with the original subject line of the thread in question (no Re:'s or Fwd:'s), followed by a tab character, and finally the number of messages in the thread.

--sender: Your program should print out a list of all the distinct sending e-mail addresses, in decreasing order of the number of messages sent by that sender. As with --thread, the output lines should be e-mail address followed by tab followed by count.

--date: Your program should print out a list of all the message dates in date order, along with the number of requests per day (date, tab, count).

unknown action: If your program is called with an action argument other than one of the three named above, you should print out a usage statement to help the user remember how the command is called.

no action: If your program is called with no command-line arguments (just "python mboxsummarizer.py"), you should again print out the usage statement.

Step 4: Deliver the program

Write a brief readme.txt file describing your program's status (what works and what doesn't).

Put all your source files (maybe just one, maybe more) plus the readme into a folder called mboxsummarizer, and submit it using the Collab/Courses system.

Questions? Let me know.