Parser

Table of Contents

For this part of the assignment, you'll implement a parser for Scheme, in C. (You must implement the parser from scratch - don't use Yacc, Bison, ANTLR, or any similar tool, even if you know what it is and how to use it.)

This assignment is to be done individually. You can talk to other people in the class, me (Dave), and any of the course staff (graders, lab assistants, teaching assistants, prefects) for ideas and to gain assistance. You can help each other debug programs, if you wish. The code that you write should be your own, however, and you shouldn't directly share your code with others. See the course syllabus for more details or just ask me if I can clarify.

1 Get started

1.1 Use GitHub classroom to create a repository for your assignment

The first thing you'll need to do is to create a repository for your project using GitHub classroom. Visit this GitHub classroom link (which I've placed in Moodle). Log into GitHub if need to. GitHub should then hopefully respond with a screen that says "You're ready to go!" and will supply you with a link for the repository that you are creating. Click on that link, and you should be taken to the repository.

1.2 Clone your repository into Repl.it

Since this is an individual assignment, you'll again use the my-work repl in Repl.it that you created. In Repl.it, start up your my-work repl, and then make sure that you are in your home directory in the terminal window. You can always navigate there by typing

cd ~

in the terminal window. To further confirm, type pwd (which is short for "print working directory.") You should see /home/runner. If you see something different, you're not in the home directory, so try again or ask for help.

In a different browser tab (leave the one with Repl.it open), visit our our class GitHub organization page online. Once you've landed there, you should see the GitHub repository you created in the GitHub Classroom step above. It should be named parser-username (where username is your username). Contact us for help if the repository isn't there. Click on that repository title. This should take you to the home page for that repository. There is a green box on the page that says "Clone or download." Click that box, and make sure it says "Clone with HTTPS." Highlight the https link shown, and copy it to the clipboard. The link that you're copying should look something like

https://github.com/carleton251-term/parser-username.git

… though your username and term will be different.

Then navigate back to the Repl.it tab with the repl that you created above, and click on the terminal window within your new repl. Your next step is to clone the repository from GitHub. To do that, type git clone, then a space, then paste in the URL from the github page that you copied from above. (To paste, you might need to right-click in the terminal window and choose paste rather than using a keyboard shortcut.)

For example, I would type the following, though your username and term will be different:

git clone https://github.com/carleton251-term/parser-username.git

You'll be prompted to enter your GitHub username and password.

If all goes well, you should receive a directory, titled parser-username. If you type ls at the prompt, you should be able to see it. It will also appear in your file browser window with Repl.it on the left. Then navigate into that directory by typing in:

cd parser-username

… and you should be all set to work.

2 Assignment details

You should create a parser.c file that implements the functions specified in parser.h. Particularly, you will need to write a function parse that uses the token list from the last assignment to construct a list of parse trees.

For example, here is a syntactically correct Scheme program:

(define x (quote a))
(define y x)
(car y)

Any Scheme program consists of a list of S-expressions. An S-expression always has parentheses around it, unless it is an atom. Note that a Scheme program itself, then, is not a single S-expression; it is a list of S-expressions.

Your parse function should therefore return a list, using the linked list structure that we have been using, of parse trees. A parse tree, in turn, uses the linked list structure again to represent a tree. For example, the expression (define x (quote a)) would become the following parse tree, where each box is a Value:

parse.png

The above parse tree structure would be the first item in a 3-item linked list that represents the above Scheme program.

Do not include parentheses in your parse tree. Their purpose is to tell you the structure of the tree, so once you have an actual tree you don't need them anymore - they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)

Parsing languages can be pretty complicated. The good news is that parsing Scheme is really easy, at least in a relative sense. Since your Scheme code IS a tree, you just need to be able to read it as such and turn it into a tree in memory. You can do it with a stack of tokens:

  • Initialize an empty stack.
  • While there are more tokens:
    • Get the next token.
    • If the token is anything other than a close paren, push it onto the stack.
    • If the token is a close paren, start popping items from your stack until you pop off an open paren (and form a list of them as you go). Push that list back on the stack.

So you'll need a stack… your linked list is a fine candidate for it.

You'll also need to write a printTree function. The output format for a parse tree of valid Scheme code is very simple: it looks exactly like Scheme code! To output an internal node in the parse tree, output ( followed by the outputs of the children of that internal node from left to right followed by ). To output a leaf node (a non-parenthetical token), just output its value as you did in the tokenizer, sans type.

3 Syntax errors

As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. Here are the different types of syntax errors you should handle:

  1. If the input code ever has too many close parentheses (in other words, if you ever encounter a close paren that doesn't match a previous open paren), print Syntax error: too many close parentheses.
  2. If the parse function returns an incomplete parse tree and the end of the input is reached and there are too few close parentheses, print Syntax error: not enough close parentheses.

If you ever encounter a syntax error, your program should exit after printing the error - don't do any more parsing. This is why we wrote the function texit. Before exiting, you should output the text "Syntax error". Also feel free to print Scheme code around the error to be helpful if you can, though this optional.

Most production parsers continue to try to parse after an error to provide additional error messages. Your efforts here may help you to gain some sympathy as to why all error messages after the first one are questionable!

4 Sample executions

$ cat test-in-01.scm
(if 2 3)
(+ ))

$ cat test-in-02.scm
(define map
  (lambda (f L)
    (if (null? L)
        L
        (cons (f (car L))
              (map f (cdr L))))))

$ cat test-in-03.scm
1 2 (3)

$ ./interpreter < test-in-01.scm
Syntax error: too many close parentheses

$ ./interpreter < test-in-02.scm
(define map (lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L))))))

$ ./interpreter < test-in-03.scm
1 2 (3)

5 Formatting your output

You may find that it is easier to produce output similar to the above but with extraneous white space. For me, I had to hack some extra code to make sure that the last item of a list didn't have a space between it and the closing paren that followed it. You may have extraneous whitespace in your output if you wish (see capstone work).

6 Sample code

Here is a rough approximation of part of my parse function. My addToParseTree function takes in a pre-existing tree, a token to add to it, and a pointer to an integer depth. depth is updated to represent the number of unclosed open parentheses in the parse tree. The parse tree is complete if and only if depth is 0 when the parse function returns.

Value *parse(Value *tokens) {

    Value *tree = makeNull();
    int depth = 0;

    Value *current = tokens;
    assert(current != NULL && "Error (parse): null pointer");
    while (current->type != NULL_TYPE) {
        Value *token = car(current);
        tree = addToParseTree(tree, &depth, token);
        current = cdr(current);
    }
    if (depth != 0) {
        syntaxError();
    }
}

7 Capstone work

Work in this section is 100% optional, and not worth any points. Nonetheless, if you're looking for an extra challenge, these are fun additional exercises to try.

  • Format your output so that it resembles DrRacket output as closely as possible. Specifically, don't have an extraneous space after the last item in a list before the closing paren.
  • Correctly handle square brackets. Some dialects of Scheme allow you to use square brackets as an alternative to parentheses anywhere you like, so long as they match accordingly. Here is an example:

    (define [lambda (x) [+ x 1]])
    

    Once your parse tree is constructed, you should retain no memory of whether parentheses or square brackets were used, and your output should only show parentheses (just as DrRacket does). However, you'll need to add some extra details to the parse to make sure the input is correct. For example, even though you'll eventually transform (a [b c] d) to (a (b c) d), the input (a [b c) d] should result in a syntax error.

  • Correctly parse dotted pairs, e.g. (a . b).
  • Correctly parse a single quote, i.e '. Remember that 'a is really just an abbreviation for (quote a).

8 The files that you start with

After you clone your repository, you should be able to see that you get the following starting files:

  • linkedlist.h: no changes from last time
  • talloc.h: no changes from last time
  • value.h: some added types, to facilitate capstone questions
  • tokenizer.h: no changes from last time
  • parser.h: the header file for the functions you'll need to write for this assignment. You'll also create a number of helper functions for that, but those don't need to appear in the header since they will be "private".
  • main.c: Short man function that drives the whole program.
  • Makefile: contains instructions for the command make, which will compile and test your code
  • tests/: a directory of Scheme programs as test inputs, and expected outputs
  • capstone/: further tests to go with the capstone exercises.
  • tester.py: a script to handle automated testing.

At this point, you have a choice regarding how to proceed for linkedlist.c, talloc.c, and tokenizer.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o, talloc.o, and tokenizer.o files. These are compiled binary versions of my own matching .c files. If you would prefer to use these in your project instead of your own, you can replace them in the Makefile. Specifically, in the Makefile, there are instructions on how to comment out a line starting with SRCS, and uncommenting a replacement line. I provide these binaries so that people who have trouble with earlier parts of the project don't get slowed down, but I heavily encourage you to use your own code if it works for two reasons:

  • You'll feel a much stronger sense of ownership for the project if you do.
  • You won't be able to trace bugs back to my source code if you need to. You'll sometimes have to make do with cryptic errors from my code that say "linked list insertion error" or something like that. My code works as specified, but it will produce bad and sometimes difficult-to-understand output if it gets bad input.

Note that compiled binaries are operating system specific (sometimes version specific). These .o files were compiled to work on (some versions of) Linux. They work on Repl.it, which is a Linux virtual machine. They definitely won't work on Macs. They might work under WSL in Windows. I'm not sure.

The instructions for compiling and testing your code (including with valgrind) are exactly the same as the last assignment, so you should look there for details. The one difference is that I now have two sets of tests: one for the main part of the assignment, and one for the capstone. If you want to test the capstone portion, you can run those tests with the command make capstone. This will run the capstone tests, with and without valgrind.

9 What to submit

Add all of your files to your Git repository and push to GitHub. (It's up to you if you want to submit your interpreter executable or not.) Your submitted files should be exactly what you checked out, but also with a file named parser.c (and possibly linkedlist.c, talloc.c, and tokenizer.c if you are using your own.) Your Makefile may have changed if you modified it to work with my binaries.

The graders and I need to know which commit is the one that we should use for grading. This is handled well in Git with tags. You did this back in the original clab. If you've forgotten how to tag in Git, look at the end of part 1 of the C lab. Name your tag parser-done.


This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant and Laura Effinger-Dean.