CS 251: Parser

Design and implement a parser for Racket, in C. You must implement the parser from scratch - don't use Yacc or Bison or any similar tool, even if you know what it is and how to use it - though you can use any approach to your code that you'd like, including approaches that are inspired by Yacc/Bison/etc.

Parsing

You will need to write a function parse that uses the token list to construct a parse tree. You can use the linked list structure you created in the first part of the project to represent the parse tree; for example, the expression (define x (quote a)) would become the parse tree:

This parse tree is simplified slightly, in that within a ConsCell, car and cdr are always pointers. I've compressed the picture a little bit here for purposes of brevity. You'll also note that I'm using a Value as the "root" of the parse tree; you could also use a ConsCell with its cdr.consValue set to NULL.

I recommend that you do not include parentheses in your parse tree. Their purpose is to tell you the structure of the tree, so once you have an actual tree you don't need them anymore - they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)

Parsing languages can be pretty complicated. The good news is that parsing Scheme is really easy, at least in a relative sense. Since your Scheme code IS a tree, you just need to be able to read it as such and turn it into a tree in memory. You can do it with a stack of tokens:

Initialize an empty stack.
While there are more tokens:
- Get the next token.
- If the token is anything other than a close paren, push it onto the stack.
- If the token is a close paren, start popping items from your stack until you pop off an open paren (and form a list of them as you go). Push that list back on the stack.

So you'll need a stack... your linked list is a fine candidate for it.

Main function

Put your main function that contains test code for your parser into a separate C file. You'll want to put an interface for your parser into a header file (either common.h or a new file), and put your implementation into another C file. These will be part of your final project; the parsing tester code will not be included (at least not wholly).

Here is the spec for your tester code. It shall read Scheme code from standard input and write to standard output a parse tree of that Racket code, or the phrase syntax error. You should parse each line of input separately: read in a line, then print the parse tree for that line (except for multi-line expressions - see below). Your parser should do the following repeatedly: (1) tokenize a line; (2) parse the list of tokens; (3) print the parse tree.

The output format for a parse tree of valid Scheme code is very simple: it looks exactly like Scheme code! To output an internal node in the parse tree, output ( followed by the outputs of the children of that internal node from left to right followed by ). To output a leaf node (a non-parenthetical token), just output its value as you did in the tokenizer, sans type.

Multi-line inputs

If the input code has too few close parentheses, you should give the code a chance to continue. What I mean by this is that a valid Racket expression can cross line breaks: your Racket code often starts a function definition on one line and continues it on the next. If this happens, you need to take the following actions:

Save the results of the parse as an incomplete parse tree.
Continue the parse with the new tokens from the next line and the incomplete parse tree.

I suggest first getting everything working for the single-line case. From there, you should be able to figure out how to represent an "incomplete parse tree" (which is not really a tree at all).

Conversely, you should be able to handle blank lines or lines with just a comment. For these lines, lex will return an empty token list.

Syntax errors

As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. There are four different types of syntax errors you should handle:

If the input code is untokenizable, print syntax error.
If the input code ever has too many close parentheses (in other words, if you ever encounter a close paren that doesn't match a previous open paren), print syntax error.
If a line contains more than one S-expression, print syntax error. An S-expression is either a single atom or a list of S-expressions:
```
<S>    ::= <atom> | ( <S>* )
<atom> ::= <number> | <identifier> | ...
```
If the parse function returns an incomplete parse tree and the end of the input is reached and there are too few close parentheses, print syntax error.

If you ever encounter a syntax error, you may exit your program after printing "syntax error" - don't read any more input. Print the parse tree of any successfully parsed line as you go, so that you'll get the first 10 parsed lines followed by syntax error if there's a problem on line 11.

Sample output

$ cat test1
(if 2 3)
(+ ))

$ cat test2
(define map
  (lambda (f L)
    (if (null? L)
        L
        (cons (f (car L))
              (map f (cdr L))))))

$ cat test1 | ./parser
(if 2 3)
syntax error

$ cat test2 | ./parser
(define map (lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L)))))

Sample code

Here is a rough approximation of my main function. My addToParseTree function takes in a pre-existing tree, a token to add to it, and a pointer to an integer depth. depth is updated to represent the number of unclosed open parentheses in the parse tree. The parse tree is complete if (and only if) depth == 0 when the parse function returns.


int main() {
    Value *tokens = tokenize();
    Value *tree = NULL;
    int depth = 0;

    while (tokens != NULL) {
        Value *token;
        popFromFront(&tokens,&token);
        tree = addToParseTree(tree,,token,&depth);
        if (depth == 0) {
            printTree(tree);
            printf("\n");
            freeTree(tree);
            tree = NULL;
        }
    }

    if (depth != 0) {
        syntaxError(); // error case 4
    }
}

Memory errors

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using valgrind. We will be checking this during grading.

Turning in your work

Commit and push all of your work to your git repository, and tag it as parser1 (where you increment the number if you resubmit). You may put the source code wherever you want, but you must add a top-level directory called parser and put the following in that directory:

A Makefile. When I'm in that directory and I execute the command make, I should get an executable file called parser in that directory.
A collection of at least 8 pairs of test files. (More is better.) Each pair of files should be named parser-test.input.XX and parser-test.output.XX, where XX is a test number. It should be the case that if I run the following command then I get no output:
```
cat parser-test.input.XX | ./parser | diff parser-test.output.XX -
```
A file credits.txt listing the members of your team, anyone other than me who helped your team with the assignment, and any websites, books or other sources you used for reference. When in doubt, give credit!

Good luck, and have fun!

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant and Laura Effinger-Dean.