CS 204: Software Design

Analyzing on-line dictionary logs

Hand in by 9:50 AM Monday, Sep 27

The Ultralingua on-line dictionary is free to use for about half a dozen word look-ups per day. Its underlying services (conjugation of verbs, number translation, and several kinds of word look-up) also support the on-line services of several other companies. The current incarnation of this collection of tools has been in continuous service for about two years, and has generated about five and a half million log entries in that time. That's about 5 log entries per minute--hardly Google, but still quite a bit of data.

Each log entry looks something like this:

('2010-09-15 14:24:50', 'Ulod', 'Ultralingua', 'Onlinedictionary', '1453080969', 'english', 'french', 'define', None, 'crazy')

In order, the fields of the log entry are:

  • Here is the complete log file. Note that it is nearly 90MB zipped, and over 850MB unzipped. So make sure you have enough space on your computer to handle it. Also, I have the system set up so you can only grab this data file if your IP address is one of Carleton's.
  • The goal

    For this project, you will write a command-line program that filters the log file in various ways to produce useful reports on aspects of the data. The command-line syntax of your program will be:

    python loganalyzer.py [options] [logfile]

    Your program will print all its output to standard output (via print or sys.stdout.write). If the logfile command-line argument is present, then your program should take input from the specified logfile. Otherwise, your program should take input from sys.stdin.

    The required command-line options are:

    A good job on these required elements will be worth a B for this assignment. To move into the A range, you will need to implement at least one non-trivial additional feature. Some possibilities include:

    What to hand in

    Hand in via the Courses folder (Courses/f10/cs/cs204-00-f10/Student Work/youraccount/hand-in/) a folder called "loganalyzer". In this folder, include:

    Things to keep in mind

    Use subsets of the data to test your program. It will be too slow to do every little test on the full data set.

    Don't keep lots of .8GB files lying around.

    Please don't share this data widely, and please delete it when you're all done.