CS 257: Software Design

A Server Log Summarizer

When a browser requests a file from a web server, the web server typically keeps a log of the request. For example, consider the following snippet from one of my own server log files (the web server in question is Apache). Here, we see that somebody is looking up the French word "repoir" (to repolish) in our French dictionary of definitions. Each line shows the requester's IP address, the date and time of the request, the specific file or service requested by the browser, the response code sent back to the browser by the server (200 = OK, 302 = "moved temporarily", etc.), and the number of bytes transferred. When this person went to http://ultralingua.com/onlinedictionary/ and asked for the French definition of "repoir", the browser actually made several requests: first for the main page, then for the stylesheets referenced by that page, then for the images that appear on the page, and finally for the "favicon.ico" file that, if present, would be shown in the browser's address bar and next to bookmarks for this page.

[Note that I have changed the IP address to hide the identity of the requester.]

123.123.123.123 - - [02/Feb/2007:22:42:26 -0600] "GET /onlinedictionary/?action=define&searchtype=stemming&text=repoir&service=french2french HTTP/1.1" 200 11825
123.123.123.123 - - [02/Feb/2007:22:42:26 -0600] "GET /onlinedictionary/style.css HTTP/1.1" 200 10681
123.123.123.123 - - [02/Feb/2007:22:42:26 -0600] "GET /onlinedictionary/mainStyles.css HTTP/1.1" 200 697
123.123.123.123 - - [02/Feb/2007:22:42:27 -0600] "GET /onlinedictionary/pictures/ulnet.jpg HTTP/1.1" 200 31466
123.123.123.123 - - [02/Feb/2007:22:42:27 -0600] "GET /onlinedictionary/pictures/rotatingAd.gif HTTP/1.1" 200 24129
123.123.123.123 - - [02/Feb/2007:22:42:27 -0600] "GET /favicon.ico HTTP/1.1" 302 305

A similar sample shows some other requests, but this time, the IP address belongs to one of the many instances of Googlebot--the programs that traverse the web pages of the world to make most of Google's services possible.

64.233.173.82 - - [03/Feb/2007:14:08:02 -0600] "GET /onlinedictionary/shared/portal/ HTTP/1.1" 200 45651
64.233.173.82 - - [03/Feb/2007:14:08:02 -0600] "GET /includes/en/menu_support/ULmenu1.js HTTP/1.1" 200 21009
64.233.173.82 - - [03/Feb/2007:14:08:02 -0600] "GET /onlinedictionary/shared/subscribe.html HTTP/1.1" 200 15457
64.233.173.82 - - [03/Feb/2007:14:08:02 -0600] "GET /onlinedictionary/authentication/login.html HTTP/1.1" 200 12349
...[lots more lines with the same IP address]...

For this assignment, you will write a program to provide three different kinds of summaries of a log file formatted as in the samples above.

Details

Your program should be invokable from the Linux command line, like so:

Java: java LogSummarizer action [inputfile]
Python: python logsummarizer.py action [inputfile]
Perl: perl logsummarizer.pl action [inputfile]
etc.

Here, the mandatory "action" argument can be one of three possibilities: "--file", "--ip", "--date". The "inputfile" argument is optional (as indicated by the square brackets). If inputfile is specified, then your program should take input from the named file. If inputfile is not specified, then your program should take input from standard input. Your program should send output to standard output. This combination of input and output behaviors supports command like the following:

cat log* | java LogSummarizer --ip > ipsummary.txt

Assuming there are a bunch of files whose names start with "log", this command will concatenate those files together and store an IP summary of the combined log files in ipsummary.txt.

Here's how your program should behave for each of the possible "action" values:

--file: Your program should print out a list of all the requested files, in decreasing order of the number of times they were requested. Each line of your output should consist of an integer (the number of requests for the file) followed by a tab character, followed by the name of the file. Note that if a request provides arguments in the URL (e.g. the "?action=define&searchtype=stemming&text=repoir&service=french2french" in the first request shown above), you should strip those arguments out before counting the file.

To clarify what constitutes the "requested file", consider the sample log entries shown above. The files being requested are:

/onlinedictionary/
/onlinedictionary/style.css
/onlinedictionary/mainStyles.css
/onlinedictionary/pictures/ulnet.jpg
etc.

Note first of all that the "requested file" includes the full path information provided in the log entry. That is, the second sample line is requesting "/onlinedictionary/style.css", not just "style.css". Second, note that in the first line, "/onlinedictionary/" is actually a directory. Most web servers will be configured to interpret the request for a directory to mean a request for the "index.html" (or "index.php" or "index.htm" or...) file contained in the requested directory. For your purposes, however, "/onlinedictionary/" is the requested file, and you don't need to worry about whether the resulting data came from "/onlinedictionary/index.html" or some other source.

--ip: Your program should print out a list of all the IP addresses from which requests came, in decreasing order of the number of requests made from each address. Each line of output should consist of an integer, a tab character, and the IP address.

--date: Your program should print out a list of all the dates on which requests were made, in date order, along with the number of requests per day. Each line of output should consist of an integer (the number of requests made on the date), a tab character, and the date.

Questions? Let me know.