COS 371: Programming Languages

Spring 2022

proj3: Tokenizer

Due: Wednesday, 04/06 at 22:00

1. Goals

To implement a tokenizer for Scheme in C.

2. Introduction

Now that you have a working (albeit simplistic) garbage collector, you are ready to commence work on the three main phases of interpretation. The first phase is tokenization. (You must implement the tokenizer from scratch: don't use Lex or any similar tool, even if you happen to know what it is and how to use it; similarly, don't use any regular expression libraries.)

3. Tasks

In value.h, add the following to your valueType enumeration: OPEN_TYPE, CLOSE_TYPE, BOOL_TYPE, and SYMBOL_TYPE.
Create a header file tokenizer.h with #include guard (like all the header files from previous checkpoints) and with the following two function prototypes:
```
Value *tokenize();
void displayTokens(Value *list);
```
The tokenize function should read stdin (the standard input stream in C) in its entirety and return a linked list consisting of all tokens found. The displayTokens function takes a linked list of tokens as input, and displays those tokens, one per line, with each token's type.
Implement tokenize and displayTokens in tokenizer.c. You may want to write a number of additional helper functions, but the helper functions are not part of the spec (and should not be usable/visible outside of tokenizer.c) and thus they should not appear in the header file.

Create main_tokenize.c that, after including appropriate header files, simply does this:

int main(void) {
   Value *list = tokenize();
   displayTokens(list);
   tfree();
   return 0;
}

Modify your Makefile to compile your new tokenizer code. When I'm in your directory and I execute the command make tokenizer, I should get an executable file called tokenizer in that directory. (There is a tutorial about Makefiles linked from the website.)
To run your code, write some Scheme code in a file named input.scm, say. Then run
```
./tokenizer < input.scm
```
The < tells the shell to feed the file input.scm to tokenizer as stdin.
Create a collection of at least 4 pairs of test files. (More is better! The sooner you get your tokenizer totally right, the sooner you will get the rest of your project right.) Each pair of files should be named test.tokenizer.input.XX and test.tokenizer.output.XX, where XX is a test number. It should be the case that if I run the following command then I get no output:
```
./tokenizer < test.tokenizer.input.XX | diff - test.tokenizer.output.XX
```
(Your code may be tested against my test files or other teams' test files.)

4. Specification

Here is the spec for displayTokens(tokenize()): read Scheme code from standard input and write to standard output a list of tokens accompanied by the type of each token. You must include the following token types:

     boolean, integer, float, string, symbol, open, close

the last two are for ( and )). The output format is, per line, token:type. You should strip out comments as you tokenize, so that anything after a ; on a line is ignored. (You'll also have to think about how to handle whitespace: sometimes it matters! sometimes it doesn't!)

Scheme has several other token types which you do not need to handle: character, #(, ', `, ,, ,@, and ..

For example, here's a sample run of my code:

$ cat test.tokenizer.input.01
(+ x2 ( + ( quote x ) "foo;; still a string" 23) ;; now it's a comment!

$ ./tokenizer < test.tokenizer.input.01
(:open
+:symbol
x2:symbol
(:open
+:symbol
(:open
quote:symbol
x:symbol
):close
"foo;; still a string":string
23:integer
):close

Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Syntax error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. After encountering an error, your program should exit after printing the error: don't read any more input. This is why we wrote the function texit. Also feel free to print Scheme code around the error to be helpful if you can, though this is optional.

5. Syntax details

While Scheme is relatively simple as languages go, it is still rather complex. In the interest of sanity, we will restrict the spec somewhat (adapted from Dybvig):

Numbers.
```
<number>   ->  <sign> <ureal> | <ureal>
<sign>     ->  + | -
<ureal>    ->  <uinteger> | <udecimal>
<uinteger> ->  <digit>⁺
<udecimal> ->  . <digit>⁺ | <digit>⁺ . <digit>^*
<digit>    ->  0 | 1 | ... | 9
```
Note that ^* is shorthand for zero or more repetitions of the symbol, and ⁺ is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions atoi and atof in the standard library.
Symbols.
Our symbols (identifiers) are case-sensitive. While + and - can be at the start of numbers, by themselves they are single-character symbols. Handle them.
```
<identifier> ->  <initial> <subsequent>* | + | -
<initial>    ->  <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^
<subsequent> ->  <initial> | <digit> | . | + | -
<letter>     ->  a | b | ... | z | A | B | ... | Z
<digit>      ->  0 | 1 | ... | 9 
```
This may be inconsistent with the behaviour of Scheme. If you wish to instead implement the entire Scheme grammar, that can be a possible extension at the end of the project.
Strings. Handle escaped characters including (at least) the following: \n (newline), \t (tab), \\ (backslash), \' (single quote), and \" (double quote). You do not need to handle multiline strings; any newlines in a string must be explicitly written as \n.
Lists. You must handle () to enclose lists. Some dialects of Scheme allow using [] and {} as list enclosers; you need not.

You may also find Dybvig §1.1 helpful. The dialect described there may not be completely consistent with the above, but it's readable in English. What I have provided above is considered to be the correct specification for this assignment.

In general, you may try to match the behaviour of the R5RS dialect in DrRacket whenever you're in doubt. However, there are some exceptions. For example, DrRacket thinks -1234a and 3.1.4 are valid identifiers. But if you carefully check them against the simplified spec given above (or the full R5RS spec), you will find that these are not valid identifiers at all. I want you to follow the spec I gave you as closely as possible, while using DrRacket as a guide only.

Don't worry about block and datum comments or vectors, or for that matter any wacky syntax, unless you want to push this really far. Start from the basics and build up. By the way, if it helps, my code generates no tokens given input like "error (an unterminated string).

When testing your code, we intend to test on reasonable programs that follow well-understood and documented Scheme conventions. We may try some corner cases that will attempt to see if you have coded carefully, but we do not intend to find obscure details of the Scheme language standard to see if you have carefully deciphered every bit of it.

6. Notes

There are many different ways of reading input. I found it most useful to use the functions fgetc and occasionally ungetc. Look those up. They're more useful than fscanf.

The heart of your code will be your tokenize function, which reads the input and returns a list of Values. Here's a rough sketch of what that function might look like:

Value *tokenize() {
   char charRead;
   Value *list = makeNull();
   charRead = fgetc(stdin);
   while (charRead != EOF) {
      if (charRead == ...) {
         ...
      } else if (charRead == ...) {
         ...
      } else {
         ...
      }
      charRead = fgetc(stdin);
   }
   return reverse(list); // see below
}

The most natural way to handle the input involves treating the linked list of tokens that you are creating as a stack, rather than a queue. (You have cons, not add-to-the-end.) You may therefore find reverse helpful. Sure, it's inefficient, and you are more than welcome to improve on this now or later, but make it work first, and make it fast later (if at all).

Here are some command-line tricks that you may want to look up or employ:

You can run cat | ./tokenizer and type input without creating a file. Press Ctrl-D to send EOF.
more or less: pagers to view the output of a shell command before it all scrolls off the screen.
If you have errors that are generated by a program, the normal pipe through a pager won't help: only stdout, not stderr, is redirected through more if you run cat input.scm | ./tokenizer | more. You can instead do cat input.scm | ./tokenizer 2>&1 | more to get both stdout and stderr to go through more.
Instead of piping output through a pager, you can save it to a file: cat input.scm | ./tokenizer > output.
To show the output on the screen and still save it to the file, you can pipe it through tee: cat input.scm | ./tokenizer | tee output.

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using Valgrind. We will be checking this during grading, so you should test your code via Valgrind yourself.

7. Submission

Follow the instructions from the previous assignment. In particular:

Update credits.txt if applicable.
Push all your code into your GitHub repository, including all created/modified files described as above: value.h, tokenizer.h, tokenizer.c, main_tokenize.c, Makefile, and test.tokenizer.X.Y for X={input,output} and Y={01,02,03,04} (or more).
When you are sure you are done, tag your commit with proj3 and push the tag.

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Laura Effinger-Dean, and Jed Yang.

Start early, have fun, and discuss questions on Moodle.