Assignment 8 - Caching Web Proxy

24 May, 2025

Due: Wednesday, May 28, at 10:00pm

This assignment is an adaptation of Aaron Bauer’s adaptation of a lab developed for the Carnegie Mellon University’s 15-213 (Introduction to Computer Systems) course, which is the course for which a fun (but not free) textbook was written.

Starter code: a8.tar
Upload solutions via Gradescope as: proxy.c

Goals

This assignment is designed to help you with the following:

learning about web proxies
deepening your knowledge of how messages are sent using the HyperText Transfer Protocol (HTTP)
learning how caching can improve performance

Collaboration policy

For this assignment, you may work alone or with a partner, but you must type up all of the answers yourself. (It is therefore unexpected for two submissions to be completely identical.)

You may also discuss the assignment at a high level with other students.

You should list any student with whom you discussed the assignment, and the manner of discussion (high level, partner, etc.) in comments at the top of your .c source file(s).

If you work alone, you should say so instead.

Assessment

To demonstrate proficiency, your submission needs to:

pass the autograder tests for basic proxy operation, as defined in Part 1

To demonstrate mastery, your submission needs to:

pass all autograder tests, including those for a working cache, as defined as in Part 2
be somewhat well-styled
include your name and collaboration statement at the top of your proxy.c file

Background: web proxies

A web proxy is a program that acts as a middleman between a web browser and an end server. Instead of contacting the end server directly to get to a web page, the browser contacts the proxy, which forwards the request on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the browser.

Proxies are useful for many purposes. Sometimes proxies are used in firewalls, so that browsers behind a firewall can only contact a server beyond the firewall via the proxy. Proxies can also act as anonymozers: by stripping requests of all identifying information, a proxy can make the browser anonymous to web servers. Proxies can even be used to cache web objects by storing local copies of objects from servers and then responding to future requests by reading them out of a cache rather than by communicating again with remote servers.

Assignment overview

In this assignment, you will write a simple HTTP proxy that caches web objects. For Part 1, you will set up the proxy to accept incoming connections, read and parse requests, forward requests to web servers, read the servers’ responses, and forward those responses to the corresponding clients. This first part will involve learning about basic HTTP operation and how to use sockets to write programs that communication over network connections.

In Part 2, you will add caching to your proxy using a simple main memory cache of recently accessed web content.

To demonstrate Proficiency, you will need to complete Part 1. To demonstrate Mastery, you will additionally need to complete Part 2.

Getting the starter package

You can download the files you need for this assignment using:

wget https://www.cs.carleton.edu/faculty/tamert/courses/cs208-s25/resources/assignments/a8.tar

Then, run this command:

tar xvf a8.tar

This should generate a directory called a8. The README file describes the various files.

The file proxy.c contains starter code and comments for Parts 1 and 2, as described below. This starter code is simply guidance, not a requirement—you are free to modify and/or ignore it.

Part 1: Implementing a sequential web proxy

Your first step is to implement a basic sequential proxy that handles HTTP/1.0 GET requests. Other request types, such as POST, are not needed for this assignment. You will need to implement the function handle_request, provided in the starter code, to complete this part. (You may find it helpful to include helper functions, especially if you are aiming for well-styled code.)

When started, your proxy should listen for incoming connections on a port whose number will be specified on the command line. Once a connection is established, your proxy should:

Read the entirety of the request from the client and parse the request.
Determine whether the client has sent a valid HTTP request.
If so, it can then establish its own connection to the appropriate web server, and then request the object the client specified.
Finally, read the server’s response and forward it to the client.

The starter code in main takes care of listening on the port passed in on the command line. When it receives a connection, it creates a new socket and passes the associated file descriptor to handle_request. (This part has already been written for you.)

Your code in handle_request should use the RIO functions to read the request from the client. For example, the Tiny server’s doit function uses the code below to initialize buffered reading on the socket file descriptor fd and read the first line from the client:

rio_t rio;

/* Read request line and headers */
Rio_readinitb(&rio, fd);
if (!Rio_readlineb(&rio, buf, MAXLINE))
    return;

Note that RIO has functions for buffered reading (Rio_readlineb, Rio_readnb) and functions for unbuffered reading and writing (Rio_readn, Rio_writen). You should not interleave buffered and unbuffered calls on the same file descriptor, as this will cause problems.

Once you have read and parsed the first line of the request from the client, you will need to open a socket to the server (Step 3 above). Use the provided Open_clientfd function for this. It takes two string arguments, the hostname and the port, and returns a file descriptor.

HTTP/1.0 GET requests

When an end user enters a URL such as http://www.carleton.edu/index.html into the address bar of a web browser, the browser sends an HTTP request to the proxy that begins with a line that might resemble the following:

GET http://www.carleton.edu/index.html HTTP/1.1

In that case, the proxy should parse the request into the following fields:

the hostname: www.carleton.edu
the path or query and everything following it: /index.html

That way, the proxy can determine that it should open a connection to www.carleton.edu and send an HTTP request of its own starting with a line of the form:

GET /index.html HTTP/1.0

Note that all lines in an HTTP request end with a carriage return (\r) followed by a newline (\n). Also important is that every HTTP request is terminated by a single empty line: "\r\n". This means that a message like

GET http://www.carleton.edu/index.html HTTP/1.1
Accept: */*

Is actually encoded as a single string: "GET http://www.carleton.edu/index.html HTTP/1.1\r\nAccept: */*\r\n\r\n".

You should notice in the above example that the web browser’s request line ends with HTTP/1.1, while the proxy’s request line ends with HTTP/1.0. Modern web browsers will generate HTTP/1.1 (or HTTP/2 or even HTTP/3) requests, but your proxy should just forward them as HTTP/1.0 requests.

If a browser sends any request headers (any non-empty lines after the first, like "Accept: */*\r\n" above), your proxy should forward them unchanged.

Remember that not all content on the web is ASCII text. Much of the content on the web is binary data, such as images and video. Ensure that you account for binary data when selecting and using functions for network I/O.

Port numbers

There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening port.

The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of the form http://www.carleton.edu:8080/index.html, in which case your proxy should connect to the host www.carleton.edu on port 8080 instead of the default HTTP port, which is port 80. The provided autograder always supplies a port number with its requests.

The listening port is the port on which your proxy should listen for incoming connections. Your proxy should accept a command-line argument specifying the listening port number for your proxy. For example, with the following command, your proxy should listen for connections on port 55057:

./proxy 55057

You may select any non-privileged listening port (greater than 1024 and less than 65536) as long as it is not used by other processes. Since each proxy must use a unique listening port and many people may be simultaneously working on mantis, the script port-for-user.pl is provided to help you pick your own personal port number. Use it to generate a port number based on your user ID:

./port-for-user.pl tamert
tamert: 2718

The port p returned by port-for-user.pl is always an even number. So, if you need an additional port number, say for the Tiny server, you can safely use ports p and p+1.

Please don’t pick your own random port. If you do, you run the risk of interfering with another user, and preventing them from completing their assignment.

String parsing

This part is essentially a string parsing problem—given a URL, you need to extract the hostname, the path, and the port. To help you devise your own approach to this, here are some of the tools available in C for string parsing (some of which you’ve seen before!):

sscanf is a good tool for extracting parts of a string when its structure is known. For example, line 60 of tiny/tiny.c uses sscanf to separate the request line consisting of the request method, the URL, and the HTTP version into separate variables:

sscanf(buf, "%s %s %s", method, url, version);
This line will take the string in buf and copy the characters before the first space into method, copy the characters after the first space and before the second space to url, and copy the rest of buf into version. For example, GET www.carleton.edu/index.html HTTP/1.1 would be split into GET, www.carleton.edu/index.html, and HTTP/1.1. Note that sscanf is also useful when dealing with a known prefix, such as http://. For example:
sscanf(url, "http://%s", trimmed_url);
This will extract the part of the string url after the http:// and copy it to trimmed_url. In this case, nothing will be copied to trimmed_url if url does not start with exactly http://. Recall that sscanf returns the number of items that were successfully matched.

char *strchr(char *cs, char c) returns a pointer to the first occurrence of character c in string cs, or NULL if c is not present. For example, if url is "www.carleton.edu/index.html", then

char *temp = strchr(url, '/');
strncpy(filename, temp, 100);
would set temp to point to the '/' character with url, and then copy from that point in url to the end (up to 100 characters) to filename (after which filename would point to "/index.html"). If you then did
*temp = '\0';
to insert a null terminator, then url would become just "www.carleton.edu".

char *strstr(char *cs, char *ct) returns a pointer to the first occurrence of string ct in string cs, or NULL if ct is not present.
char *strtok(char *s, char *ct) searches s for tokens delimited by characters from ct (remember Assignment 3?). A sequence of calls of strtok(s, ct) splits s into tokens, each delimited by a character from ct. The first call in a sequence has a non-NULL s. It finds the first token in s consisting of characters not in ct; it terminates that by overwriting the next character of s with '\0' and returns a pointer to that token. Each subsequent call, indicated by a NULL value of s, returns the next such token, searching from just the path starting at the end of the previous one. strtok returns NULL when no further token is found. The string ct may be different each call.
Not parsing per se, but sprintf can be a nice way to assemble a string from multiple pieces of data in C. It works exactly like printf, except that the result is copied to another string rather than printed. See the documentation here.

Part 2: Caching web objects

For the second part of this assignment, you need to add a cache to your proxy that stores recently used web objects in memory. Here a web object just means a file sent by a web server. HTTP actually defines a fairly complex model by which web servers can give instructions as to how the objects they serve should be cached, and clients can specify how caches should be used on their behalf. However, your proxy will adopt a simplified approach.

When your proxy receives a web object from a server, it should cache it in memory as it transmits the object to the client. If another client requests the same object from the same server, your proxy need not reconnect to the server; it can simply resend the cached object.

The starter code in proxy.c provides struct definitions and a set of cache functions you can use to get started. Under that design, when the proxy receives a request, it should look up the URL in the cache, and if an entry is found, the proxy should send the associated item to the client. If an entry isn’t found, when the proxy receives a response from the server, it should buffer that response in memory and then insert an entry into the cache with the request’s URL as the url and the contents of the buffer as the item.

Testing

Autograder

Your handout materials include an autograder, called driver.sh, that assigns scores for Basic Correctness and Cache. To run the autograder, run

./driver.sh

from the a8 directory. You must run this script on a Linux machine.

Tiny web server

Your handout directory contains the source code for the CSPP Tiny web server in the tiny subdirectory. While not as powerful as thttpd (a small, open-source web server written in C), the CSPP Tiny web server will be easy for your to modify as you see fit. It’s also a reasonable starting point for your proxy code. And, it’s the server that the autograder uses to fetch pages.

A general pattern for testing might be:

Start the Tiny server in one terminal.
Start your proxy in another terminal.
In a third terminal, use curl (see below) to send a request to the Tiny server via your proxy.

In VS Code, the little + in the upper-right of the terminal section will open another terminal.

`curl`

You can use curl to generate HTTP requests to any server, including your own proxy. It is an extremely useful debugging tool. For example, if your proxy and Tiny are both running on the local machine, with your proxy listening on port 2718 and Tiny listening on port 2719, then you can request a page from Tiny via your proxy using the following curl command:

linux> curl -v --proxy http://localhost:2718 http://localhost:2719/home.html
* About to connect() to proxy localhost port 2718 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 2718 (#0)
> GET http://localhost:2719/home.html HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu)...
> Host: localhost:2619
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: Tiny Web Server
< Content-length: 120
< Content-type: text/html
< 
<html>
<head><title>test</title></head>
<body> 
<img align="middle" src="godzilla.gif">
Dave O'Hallaron
</body>
</html>
* Closing connection #0