CS257 Assignment: Automated Testing

Due 11:10 AM Wednesday, 4/20/05. Submit via HSP.

You may work with a partner for this assignment, if you wish.

The problem

The effectiveness of modern search engines depends in part on analysis of the link structure of the Internet. For example, if lots of web pages containing the word "baseball" point to "http://www.mlb.com/", it is reasonable to conclude that mlb.com is in some way authoritative on the subject of baseball, and thus that its web site should rank high in "baseball" search results. Because links are so important, it is helpful to have tools to help us extract and analyze them. For this assignment and the next, you will be working on a simple web link analysis tool, written in Perl.

The input for your tool will be the name of a directory. In that directory and its sub-directories will be ".htm" and/or ".html" files, possibly intermixed with other files. Your program will traverse the directory and its sub-directories, and print out a directed graph representing the link structure of the HTML files.

How should you print out a graph? Fortunately, other people have thought about this question, so there are many graph-description file formats in the world. One such format is the DOT file format used by the Graphviz graph visualization software. Here, for example, is a DOT representation of a four-node directed graph with red nodes and default-colored (black) edges.


digraph "Example"
{
    "A" [ color = red ]
    "B" [ color = red ]
    "C" [ color = red ]
    "D" [ color = red ]

    "A" -> "B" [ ]
    "A" -> "C" [ ]
    "B" -> "D" [ ]
    "C" -> "B" [ ]
}

You might want to download the Graphviz tools (notably the "dotty" program), save the above example in a .dot file, and use dotty to take a look at the graph.

Your program will print the directed graph of links between web files as a DOT file. Feel free to use whatever features of DOT you consider appropriate, but keep in mind that you will need to be able to test the resulting output.

One last detail: you may ignore any links in your web pages that are not relative. For example, you should handle href="../../index.html", but ignore href="http://something.somewhere.whatever.com/".

Part 1: Test Plan

Your job is not to write a program to solve this problem. Instead, I want you to develop test data and an automated system for performing your tests. You may assume that the program you are testing writes to standard output.

I will leave it to you to decide how your testing system should operate. However, I am looking especially for thorough testing, useful test reports, and ease of use. Hand in a directory containing test data, code for any programs you may write to help you do your testing, and a readme file describing your test plan. Do not feel commanded to write testing programs--you may, for example, figure out a way to use the Unix diff command to good effect instead. But make sure you articulate very clearly how you will go about testing the proram, and how you will be able to tell whether it passes your tests.

Have fun.