CS 204: Software Design

A web server log summarizer

When a browser requests a file from a web server, the web server typically keeps a log of the request. For example, consider the following snippet from one of my own server log files (the web server in question is Apache). Here, we see that somebody is looking up the French word "judicieux" (judicious, wise) in a French-English bilingual dictionary. Each line shows the requester's IP address, the date and time of the request, the specific file or service requested by the browser, the response code sent back to the browser by the server (200 = OK, 302 = "moved temporarily", 404 = "not found", etc.), the number of bytes transferred, the URL of the "referrer" (if any), and the "user agent" string. When this person went to http://ultralingua.com/onlinedictionary/ and asked for the English translation of the French word "judicieux", the browser actually made several requests: first for the main page, then for a Javascript file called ULOD.js, a CSS stylesheet called ul.css, an image file called resultsclosebutton.png, etc.

[Note that I have changed the IP address to hide the identity of the requester.]
123.456.78.90 - - [25/Mar/2012:06:15:32 -0400] "GET /onlinedictionary/index.html?action=define&text=judicieux&service=&searchtype=stemmed&service=french2english HTTP/1.1" 200 22632 "-" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.0.19) Gecko/2010031218 Firefox/3.0.19"
123.456.78.90 - - [25/Mar/2012:06:15:32 -0400] "GET /onlinedictionary/js/ULOD.js HTTP/1.1" 200 8005 "http://www.ultralingua.com/onlinedictionary/index.html?action=define&text=judicieux&service=&searchtype=stemmed&service=french2english" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.0.19) Gecko/2010031218 Firefox/3.0.19"
123.456.78.90 - - [25/Mar/2012:06:15:38 -0400] "GET /styles/ul.css HTTP/1.1" 302 557 "http://www.ultralingua.com/onlinedictionary/ulod.py?action=define&text=judicieux&srclang=french&dstlang=english&searchtype=stemming&clang=english&nlang=english&casesensitive=&ignoreaccents=&wholewords=&searchdefs=" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.0.19) Gecko/2010031218 Firefox/3.0.19"
123.456.78.90 - - [25/Mar/2012:06:15:38 -0400] "GET /onlinedictionary/images/resultsclosebutton.png HTTP/1.1" 200 4237 "http://www.ultralingua.com/onlinedictionary/ulod.py?action=define&text=judicieux&srclang=french&dstlang=english&searchtype=stemming&clang=english&nlang=english&casesensitive=&ignoreaccents=&wholewords=&searchdefs=" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.0.19) Gecko/2010031218 Firefox/3.0.19"

A similar sample shows some other requests, but this time, the IP address belongs to one of the many instances of Googlebot--the programs that traverse the web pages of the world to make most of Google's services possible.

66.249.72.45 - - [25/Mar/2012:02:24:43 -0400] "GET /press/ HTTP/1.1" 301 589 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.45 - - [25/Mar/2012:02:24:44 -0400] "GET /press HTTP/1.1" 200 39504 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.4 - - [25/Mar/2012:02:26:04 -0400] "GET /press/newsletters/july-newsletter-2009 HTTP/1.1" 200 12275 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.72.80 - - [25/Mar/2012:02:27:27 -0400] "GET /pt-br/products/german-italian-dictionary.html HTTP/1.1" 200 42595 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

...[lots more lines with similar IP address]...

For this assignment, you will write a program to provide several kinds of summaries of a log file formatted as in the samples above.

Details

Your program should be invokable from a Unix command line, like so:

python logsummarizer.py action [inputfile]

Here, the mandatory "action" argument can be one of 3 possibilities: "--files", "--ip", and "--404". The "inputfile" argument is optional (as indicated by the square brackets). If inputfile is specified, then your program should take input from the named file. If inputfile is not specified, then your program should take input from standard input. Your program should send output to standard output. This combination of input and output behaviors supports command like the following:

cat log* | python logsummarizer --ip > summary.txt

Assuming there are a bunch of files whose names start with "log", this command will concatenate those files together and store a summary of the combined log files in summary.txt.

Here's how your program should behave for each of the possible "action" values:

--files:

Your program should print out a list of all the requested files, in decreasing order of the number of times they were requested. Each line of your output should consist of an integer (the number of requests for the file) followed by a tab character, followed by the name of the file. Note that if a request provides arguments in the URL (e.g. the "?action=define&searchtype=stemming&text=repoir&service=french2french" in the first request shown above), you should strip those arguments out before counting the file.

To clarify what constitutes the "requested file", consider the sample log entries shown above. The files being requested are:

/onlinedictionary/
/onlinedictionary/js/ULOD.js
/styles/ul.css
/onlinedictionary/images/resultsclosebutton.png
etc.

Note first of all that the "requested file" includes the full path information provided in the log entry. That is, the second sample line is requesting "/onlinedictionary/style.css", not just "style.css". Second, note that in the first line, "/onlinedictionary/" is actually a directory. Most web servers will be configured to interpret the request for a directory to mean a request for the "index.html" (or "index.php" or "index.htm" or...) file contained in the requested directory. For your purposes, however, "/onlinedictionary/" is the requested file, and you don't need to worry about whether the resulting data came from "/onlinedictionary/index.html" or some other source.

--ip: Your program should print out a list of all the IP addresses from which requests came, in decreasing order of the number of requests made from each address. Each line of output should consist of an integer, a tab character, and the IP address.

--404: Your program should print out a list of all the files that were requested but returned a 404 (file not found) error code, along with the number of times each such file was requested. This list should be printed as in the "--file" description above.

I have made one day's log file available to you from within the Carleton network. Please destroy your copy of this file once you are done with this assignment. Log files sometimes can be exploited to discover individuals' browsing behavior. There's unlikely to be anything of particular interest in this sample log file, but it's a good general rule protect this sort of data to prevent privacy violations and other nasty behavior.

The comment at the top of your Python source code should include the authors' names, the date, and a brief summary of the nature of the program. Also, please include one line in your comment that says either "You may use this program for an in-class code review demonstration" or "No, please do not use this program for an in-class code review demonstration".

Questions? Let me know.