Assignment 8 - Caching Web Proxy
Due: Wednesday, May 28, at 10:00pm
This assignment is an adaptation of Aaron Bauer’s adaptation of a lab developed for the Carnegie Mellon University’s 15-213 (Introduction to Computer Systems) course, which is the course for which a fun (but not free) textbook was written.
Starter code: a8.tar
Upload solutions via Gradescope as: proxy.c
Goals
This assignment is designed to help you with the following:
- learning about web proxies
- deepening your knowledge of how messages are sent using the HyperText Transfer Protocol (HTTP)
- learning how caching can improve performance
Collaboration policy
For this assignment, you may work alone or with a partner, but you must type up all of the answers yourself. (It is therefore unexpected for two submissions to be completely identical.)
You may also discuss the assignment at a high level with other students.
You should list any student with whom you discussed the assignment, and the manner of discussion (high level, partner, etc.) in comments at the top of your .c source file(s).
If you work alone, you should say so instead.
Assessment
To demonstrate proficiency, your submission needs to:
- pass the autograder tests for basic proxy operation, as defined in Part 1
To demonstrate mastery, your submission needs to:
- pass all autograder tests, including those for a working cache, as defined as in Part 2
- be somewhat well-styled
- include your name and collaboration statement at the top of your
proxy.cfile
Background: web proxies
A web proxy is a program that acts as a middleman between a web browser and an end server. Instead of contacting the end server directly to get to a web page, the browser contacts the proxy, which forwards the request on to the end server. When the end server replies to the proxy, the proxy sends the reply on to the browser.
Proxies are useful for many purposes. Sometimes proxies are used in firewalls, so that browsers behind a firewall can only contact a server beyond the firewall via the proxy. Proxies can also act as anonymozers: by stripping requests of all identifying information, a proxy can make the browser anonymous to web servers. Proxies can even be used to cache web objects by storing local copies of objects from servers and then responding to future requests by reading them out of a cache rather than by communicating again with remote servers.
Assignment overview
In this assignment, you will write a simple HTTP proxy that caches web objects. For Part 1, you will set up the proxy to accept incoming connections, read and parse requests, forward requests to web servers, read the servers’ responses, and forward those responses to the corresponding clients. This first part will involve learning about basic HTTP operation and how to use sockets to write programs that communication over network connections.
In Part 2, you will add caching to your proxy using a simple main memory cache of recently accessed web content.
To demonstrate Proficiency, you will need to complete Part 1. To demonstrate Mastery, you will additionally need to complete Part 2.
Getting the starter package
You can download the files you need for this assignment using:
wget https://www.cs.carleton.edu/faculty/tamert/courses/cs208-s25/resources/assignments/a8.tarThen, run this command:
tar xvf a8.tarThis should generate a directory called a8. The README file describes the various files.
The file proxy.c contains starter code and comments for Parts 1 and 2, as described below. This starter code is simply guidance, not a requirement—you are free to modify and/or ignore it.
Part 1: Implementing a sequential web proxy
Your first step is to implement a basic sequential proxy that handles HTTP/1.0 GET requests. Other request types, such as POST, are not needed for this assignment. You will need to implement the function handle_request, provided in the starter code, to complete this part. (You may find it helpful to include helper functions, especially if you are aiming for well-styled code.)
When started, your proxy should listen for incoming connections on a port whose number will be specified on the command line. Once a connection is established, your proxy should:
- Read the entirety of the request from the client and parse the request.
- Determine whether the client has sent a valid HTTP request.
- If so, it can then establish its own connection to the appropriate web server, and then request the object the client specified.
- Finally, read the server’s response and forward it to the client.
The starter code in main takes care of listening on the port passed in on the command line. When it receives a connection, it creates a new socket and passes the associated file descriptor to handle_request. (This part has already been written for you.)
Your code in handle_request should use the RIO functions to read the request from the client. For example, the Tiny server’s doit function uses the code below to initialize buffered reading on the socket file descriptor fd and read the first line from the client:
rio_t rio;
/* Read request line and headers */
Rio_readinitb(&rio, fd);
if (!Rio_readlineb(&rio, buf, MAXLINE))
return;Note that RIO has functions for buffered reading (Rio_readlineb, Rio_readnb) and functions for unbuffered reading and writing (Rio_readn, Rio_writen). You should not interleave buffered and unbuffered calls on the same file descriptor, as this will cause problems.
Once you have read and parsed the first line of the request from the client, you will need to open a socket to the server (Step 3 above). Use the provided Open_clientfd function for this. It takes two string arguments, the hostname and the port, and returns a file descriptor.
HTTP/1.0 GET requests
When an end user enters a URL such as http://www.carleton.edu/index.html into the address bar of a web browser, the browser sends an HTTP request to the proxy that begins with a line that might resemble the following:
GET http://www.carleton.edu/index.html HTTP/1.1In that case, the proxy should parse the request into the following fields:
- the hostname:
www.carleton.edu - the path or query and everything following it:
/index.html
That way, the proxy can determine that it should open a connection to www.carleton.edu and send an HTTP request of its own starting with a line of the form:
GET /index.html HTTP/1.0Note that all lines in an HTTP request end with a carriage return (\r) followed by a newline (\n). Also important is that every HTTP request is terminated by a single empty line: "\r\n". This means that a message like
GET http://www.carleton.edu/index.html HTTP/1.1
Accept: */*Is actually encoded as a single string: "GET http://www.carleton.edu/index.html HTTP/1.1\r\nAccept: */*\r\n\r\n".
You should notice in the above example that the web browser’s request line ends with HTTP/1.1, while the proxy’s request line ends with HTTP/1.0. Modern web browsers will generate HTTP/1.1 (or HTTP/2 or even HTTP/3) requests, but your proxy should just forward them as HTTP/1.0 requests.
If a browser sends any request headers (any non-empty lines after the first, like "Accept: */*\r\n" above), your proxy should forward them unchanged.
Remember that not all content on the web is ASCII text. Much of the content on the web is binary data, such as images and video. Ensure that you account for binary data when selecting and using functions for network I/O.
Port numbers
There are two significant classes of port numbers for this lab: HTTP request ports and your proxy’s listening port.
The HTTP request port is an optional field in the URL of an HTTP request. That is, the URL may be of the form http://www.carleton.edu:8080/index.html, in which case your proxy should connect to the host www.carleton.edu on port 8080 instead of the default HTTP port, which is port 80. The provided autograder always supplies a port number with its requests.
The listening port is the port on which your proxy should listen for incoming connections. Your proxy should accept a command-line argument specifying the listening port number for your proxy. For example, with the following command, your proxy should listen for connections on port 55057:
./proxy 55057You may select any non-privileged listening port (greater than 1024 and less than 65536) as long as it is not used by other processes. Since each proxy must use a unique listening port and many people may be simultaneously working on mantis, the script port-for-user.pl is provided to help you pick your own personal port number. Use it to generate a port number based on your user ID:
./port-for-user.pl tamert
tamert: 2718The port p returned by port-for-user.pl is always an even number. So, if you need an additional port number, say for the Tiny server, you can safely use ports p and p+1.
Please don’t pick your own random port. If you do, you run the risk of interfering with another user, and preventing them from completing their assignment.
String parsing
This part is essentially a string parsing problem—given a URL, you need to extract the hostname, the path, and the port. To help you devise your own approach to this, here are some of the tools available in C for string parsing (some of which you’ve seen before!):
sscanfis a good tool for extracting parts of a string when its structure is known. For example, line 60 oftiny/tiny.cusessscanfto separate the request line consisting of the request method, the URL, and the HTTP version into separate variables:
sscanf(buf, "%s %s %s", method, url, version);This line will take the string in
bufand copy the characters before the first space intomethod, copy the characters after the first space and before the second space tourl, and copy the rest ofbufintoversion. For example,GET www.carleton.edu/index.html HTTP/1.1would be split intoGET,www.carleton.edu/index.html, andHTTP/1.1. Note thatsscanfis also useful when dealing with a known prefix, such ashttp://. For example:sscanf(url, "http://%s", trimmed_url);This will extract the part of the string
urlafter thehttp://and copy it totrimmed_url. In this case, nothing will be copied totrimmed_urlifurldoes not start with exactlyhttp://. Recall thatsscanfreturns the number of items that were successfully matched.
char *strchr(char *cs, char c)returns a pointer to the first occurrence of charactercin stringcs, orNULLifcis not present. For example, ifurlis"www.carleton.edu/index.html", then
char *temp = strchr(url, '/'); strncpy(filename, temp, 100);would set
tempto point to the'/'character withurl, and then copy from that point inurlto the end (up to 100 characters) tofilename(after whichfilenamewould point to"/index.html"). If you then did*temp = '\0';to insert a null terminator, then
urlwould become just"www.carleton.edu".
-
char *strstr(char *cs, char *ct)returns a pointer to the first occurrence of stringctin stringcs, orNULLifctis not present. -
char *strtok(char *s, char *ct)searchessfor tokens delimited by characters fromct(remember Assignment 3?). A sequence of calls ofstrtok(s, ct)splitssinto tokens, each delimited by a character fromct. The first call in a sequence has a non-NULLs. It finds the first token insconsisting of characters not inct; it terminates that by overwriting the next character ofswith'\0'and returns a pointer to that token. Each subsequent call, indicated by aNULLvalue ofs, returns the next such token, searching from just the path starting at the end of the previous one.strtokreturnsNULLwhen no further token is found. The stringctmay be different each call. -
Not parsing per se, but
sprintfcan be a nice way to assemble a string from multiple pieces of data in C. It works exactly likeprintf, except that the result is copied to another string rather than printed. See the documentation here.
Part 2: Caching web objects
For the second part of this assignment, you need to add a cache to your proxy that stores recently used web objects in memory. Here a web object just means a file sent by a web server. HTTP actually defines a fairly complex model by which web servers can give instructions as to how the objects they serve should be cached, and clients can specify how caches should be used on their behalf. However, your proxy will adopt a simplified approach.
When your proxy receives a web object from a server, it should cache it in memory as it transmits the object to the client. If another client requests the same object from the same server, your proxy need not reconnect to the server; it can simply resend the cached object.
The starter code in proxy.c provides struct definitions and a set of cache functions you can use to get started. Under that design, when the proxy receives a request, it should look up the URL in the cache, and if an entry is found, the proxy should send the associated item to the client. If an entry isn’t found, when the proxy receives a response from the server, it should buffer that response in memory and then insert an entry into the cache with the request’s URL as the url and the contents of the buffer as the item.
Testing
Autograder
Your handout materials include an autograder, called driver.sh, that assigns scores for Basic Correctness and Cache. To run the autograder, run
./driver.shfrom the a8 directory. You must run this script on a Linux machine.
Tiny web server
Your handout directory contains the source code for the CSPP Tiny web server in the tiny subdirectory. While not as powerful as thttpd (a small, open-source web server written in C), the CSPP Tiny web server will be easy for your to modify as you see fit. It’s also a reasonable starting point for your proxy code. And, it’s the server that the autograder uses to fetch pages.
A general pattern for testing might be:
- Start the Tiny server in one terminal.
- Start your proxy in another terminal.
- In a third terminal, use
curl(see below) to send a request to the Tiny server via your proxy.
In VS Code, the little + in the upper-right of the terminal section will open another terminal.
curl
You can use curl to generate HTTP requests to any server, including your own proxy. It is an extremely useful debugging tool. For example, if your proxy and Tiny are both running on the local machine, with your proxy listening on port 2718 and Tiny listening on port 2719, then you can request a page from Tiny via your proxy using the following curl command:
linux> curl -v --proxy http://localhost:2718 http://localhost:2719/home.html
* About to connect() to proxy localhost port 2718 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 2718 (#0)
> GET http://localhost:2719/home.html HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu)...
> Host: localhost:2619
> Accept: */*
> Proxy-Connection: Keep-Alive
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: Tiny Web Server
< Content-length: 120
< Content-type: text/html
<
<html>
<head><title>test</title></head>
<body>
<img align="middle" src="godzilla.gif">
Dave O'Hallaron
</body>
</html>
* Closing connection #0