Tokenizer

Table of Contents

For this part of the assignment, you'll build a tokenizer for Scheme, in C. (You must implement the tokenizer from scratch – don't use Lex or any similar tool, even if you happen to know what it is and how to use it.)

This is an interpreter project team assignment, which means that if you are working on a team with someone else, you and your partner should do your best to even the workload between you. Strict pair programming is not required, but you should make sure that both team members are contributing to the project. I do recommend that you try to do pair programming if it is convenient to do so.

1 Get started

1.1 Use GitHub classroom to create a repository for your assignment

The first thing you'll need to do is to create a repository for your project using GitHub classroom. If you are working on a team, this repository will be shared, so only one of you should do this. To help reduce a round of communication, I'm going to designate that if you are working in a pair, the member of your team whose Carleton official email address comes second alphabetically should do this next step of setting up the GitHub repository. This is opposite from the previous pair assignment; I'm trying to give everyone an equal chance to participate. If you are working alone, then you should do this step.

If you are the designated person above, visit this GitHub classroom link (which I've placed in Moodle). Log into GitHub if your browser prompts you to after you get to GitHub. You should be taken to a screen that invites you to either "Join an existing team" or to "Create a new team". Pick the team that you created for the last team assignment. GitHub should then hopefully respond with a screen that says "You are ready to go!" and supplies you with a link for the repository that you are creating. Click on that link, and you should be taken to the repository. If you've gotten this far, you have successfully created the repository, and you the user who is logged in have access to it. Contact your partner if you have one and tell them that this step worked. (If you are working alone on the project, you can skip the next step.)

The other one of you, after the above is complete, should visit the same GitHub classroom link in the first sentence in the previous paragraph. You'll see the screen asking you to "Join an existing team" or to "Create a new team." You should be able to see the team that your partner created above; click the "Join" button. This should then take you to the "You are ready to go!" screen, and you can hopefully find a link to the repository. Click on it. You now both have access to the same GitHub repository.

1.2 Select the correct repl in Repl.it (only if you're in a pair)

You should use the same repl in Repl.it that you used for the last assignment.

Again for the person designated above as the starting person, you should now clone the repository from GitHub into Repl.it. If you're in a pair, you can in principle both be watching in multiplayer mode while you do this.

In Repl.it, make sure that you are in your home directory in the terminal window. You can always navigate there by typing

cd ~

in the terminal window. To further confirm, type pwd (which is short for "print working directory.") You should see /home/runner. If you see something different, you're not in the home directory, so try again or ask for help.

In a different browser tab (leave the one with Repl.it open), visit our our class GitHub organization page online. Once you've landed there, you should see the GitHub repository you created in the GitHub Classroom step above. It should be named tokenizer-username (where username is either own username, or a combination of the two of you if you're working in a pair). Contact us for help if the repository isn't there. Click on that repository title. This should take you to the home page for that repository. There is a green box on the page that says "Clone or download." Click that box, and make sure it says "Clone with HTTPS." Highlight the https link shown, and copy it to the clipboard. The link that you're copying should look something like

https://github.com/carleton251-term/tokenizer-username.git

… though your username and term will be different.

Then navigate back to the Repl.it tab with the repl that you created above, and click on the terminal window within your new repl. Your next step is to clone the repository from GitHub. To do that, type git clone, then a space, then paste in the URL from the github page that you copied from above. (To paste, you might need to right-click in the terminal window and choose paste rather than using a keyboard shortcut.)

For example, I would type the following, though your username and term will be different:

git clone https://github.com/carleton251-term/tokenizer-username.git

You'll be prompted to enter your GitHub username and password.

If all goes well, you should receive a directory, titled tokenizer-username. If you type ls at the prompt, you should be able to see it. It will also appear in your file browser window with Repl.it on the left. Then navigate into that directory by typing

cd tokenizer-username

… and you should be all set to work.

2 Assignment details

You should fill in tokenizer.c to implement the functions specified in tokenizer.h. More generally, your tokenizer should read Scheme code from a file and return a linked list of tokens accompanied by the type of each token. You must include the following token types:

boolean, integer, float, string, symbol, open, close

(the last two are for ( and )). The output format is, per line, token:type.

On comments: You should strip out comments as you tokenize, so that anything after a ; on a line is ignored. In detecting if you've found the end of a line, make sure to be super careful if you have created your test file with an editor on Microsoft Windows. Linux and Mac use an ascii \n at the end of a line, but Windows uses a \r\n at the end of a line. We will be running our tests on Linux so you don't need to worry about this unless you are creating your own test files with Windows. If you are, you need to either just create your test files on Linux, or make sure that your tokenizer code can also handle \r\n at the ends of a line.

For example, here's the input file t04.scm that I supplied:

(+ x2 ( + ( quote x ) "foo;; still a string" 323) ;; now it's a comment!

On the above, your program should output:

(:open
+:symbol
x2:symbol
(:open
+:symbol
(:open
quote:symbol
x:symbol
):close
"foo;; still a string":string
323:integer
):close

Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. Whatever you print, it should start with the string text "Error" so that it will pass the automated tests. (Upper or lower case doesn't matter.) After encountering an error, your program should exit after printing the error - don't read any more input. This is why we wrote the function texit. Also feel free to follow your error message by printing Scheme code around the error to be helpful if you can, though this optional.

3 Syntax details

Scheme is a complex language. In the interest of sanity, here is a simplified syntax for numbers that I expect your tokenizer to handle (adapted from Dybvig's book):

<number>   ->  <sign> <ureal> | <ureal>
<sign>     ->  + | -
<ureal>    ->  <uinteger> | <udecimal>
<uinteger> ->  <digit>+
<udecimal> ->  . <digit>+ | <digit>+ . <digit>*
<digit>    ->  0 | 1 | ... | 9

Note that * is shorthand for zero or more repetitions of the symbol, and + is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions strtol and strtod in the standard library.

Similarly, you should recognize symbols (identifiers) with the following syntax (again adapted from Dybvig's book):

<identifier> ->  <initial> <subsequent>* | + | -
<initial>    ->  <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^
<subsequent> ->  <initial> | <digit> | . | + | -
<letter>     ->  a | b | ... | z | A | B | ... | Z
<digit>      ->  0 | 1 | ... | 9 

This is a little inconsistent with the actual behavior of Scheme, but it simplifies things up a bit (at the cost of some minor incompatibilities).

Symbols are case-sensitive.

You may also find the syntax section of Dybvig's book to be helpful. The dialect described may not be completely consistent with the above, but it's readable in English. The BNF that I have provided above is considered to be the correct specification for this assignment. Try to match the behavior of the Guile whenever you're in doubt.

When testing your code, we will intend to test on reasonable programs that follow well-understood and documented Scheme conventions. We may try some corner cases that will attempt to see if you have coded carefully, but we do not intend to find obscure details of the Scheme language standard to see if you have carefully deciphered every bit of it.

Scheme provides many different ways of enclosing lists (parentheses, square brackets, etc.) We will only test code that uses parentheses. None of our test code will use square brackets. You can write your tokenizer and later parts of the project to only work on parentheses. That said, of you wish your project to also work on square brackets, it isn't that hard of an extension. If you think you want to go that way, you'll need to make sure that you tokenize parentheses and brackets differently.

4 Some additional hints

I suppose it is theoretically possible to code this assignment up so that it produces precisely the right output, but without actually storing the data in a list. Don't do that; perhaps that might be easier to get results on this portion of the project, but it will leave you in a useless position to continue for the next portion.

There are many different ways of reading input files. I found it most useful to use the functions fgetc and occasionally ungetc. Look those up.

The heart of your code will be your tokenize function, which reads the file and returns a list of Values. Here's a rough sketch of what that function might look like:

Value *tokenize() {
    char charRead;
    Value *list = makeNull();
    charRead = (char)fgetc(stdin);

    while (charRead != EOF) {

        if (charRead == .....) {
            ...
        } else if (charRead == ......) {
            ...
        } else {
            ...
        }
        charRead = (char)fgetc(stdin);
    }


    Value *revList = reverse(list);
    return revList;
}

We will guarantee that no token will be longer than 300 characters long. That includes strings. Your code doesn't need to handle arbitrarily long tokens.

We will guarantee that no line will be longer than 300 characters long. Your code doesn't need to handle arbitrarily long lines.

FAQ: Should your tokenizer catch errors with mismatched parentheses, i.e. )(a b ))? No. That's the parser's job.

5 Capstone work

Work in this section is 100% optional, and not worth any points. Nonetheless, if you're looking for an extra challenge, these are fun additional exercises to try.

Here are some additional details in the Scheme language that you can integrate these if you wish; they will be unnecessary for the core part of the project going forward, though they may play into later capstone tasks:

  • Square brackets: [ ]. Add openbracket and closebracket token types, and tokenize as well.
  • Dot: .. A dot by itself is useful for handling dotted pairs (which may be an optional addition.) A dot token type should be identified if and only if there is a dot with whitespace on both sides of it. A dot at the beginning or end of file cannot be a dot token.
  • Single quote: '. Add a singlequote token type, and handle appropriately.

6 The files you start with

After you clone your repository, you should be able to see that you get the following starting files:

  • value.h: essentially the same header we have been using for a Value, but some new types are added in
  • linkedlist.h: the same header file from the last assignment
  • talloc.h: the same header file from the last assignment
  • tokenizer.h: the header file for the functions you'll need to write for this assignment. You'll also create a number of helper functions for that, but those don't need to appear in the header since they will be "private".
  • main.c: Short main function that drives the whole program.
  • Makefile: contains instructions for the command make, which will compile and test your code
  • tests/: a directory of Scheme programs as test inputs, and expected outputs
  • tester.py: a script to handle automated testing.

At this point, you have a choice regarding how to proceed for linkedlist.c and talloc.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o and talloc.o files. These are compiled binary versions of my linkedlist.c and talloc.c. If you would prefer to use one or both of these in your project instead of your own, you can replace them in the Makefile. Specifically, in the Makefile, there are instructions on how to comment out a line starting with SRCS, and uncommenting a replacement line. I provide these binaries so that people who have trouble with earlier parts of the project don't get slowed down, but I heavily encourage you to use your own code if it works for two reasons:

  • You'll feel a much stronger sense of ownership for the project if you do.
  • You won't be able to trace bugs back to the linked list / talloc source code if you need to. You'll sometimes have to make do with cryptic errors from my code that say "linked list insertion error" or something like that. My code works as specified, but it will produce bad and sometimes difficult-to-understand output if it gets bad input.

Note that compiled binaries are operating system specific (sometimes version specific). These .o files were compiled to work on (some versions of) Linux. They work on Repl.it, which is a Linux virtual machine. They definitely won't work on Macs. They might work under WSL in Windows. I'm not sure.

To compile your code, issue the command "make" at the command prompt. (This will give an error if you haven't yet created the .c files you need, or if you haven't changed the Makefile to refer to the necessary .o files.)

To run your code, first create a file with some of your own Scheme code in it, and give it a filename (e.g., myprog.scm). Then issue this command at a prompt:

./interpreter < myprog.scm

Your tokenizer code should read data from "stdin." In that case, the above command will redirect the text from the myprog.scm file into your program.

To run your code through valgrind, you can similarly type

valgrind ./interpreter < myprog.scm

One challenge you may find is that when your output gets long, it gets hard to read before it scrolls all the way off the top of your terminal window. Piping your output through the UNIX less program helps. It will pause the output at the bottom of the screen, waiting for you press the space bar to continue, or q if you want to quit. You can do this with out without valgrind:

./interpreter < myprog.scm | less
valgrind ./interpreter < myprog.scm | less

Piping the results through less as shown above works, but there's an additional problem to solve. The output from your program, as well as from valgrind, comes from two different streams: one is known as "standard out," and the other is known as "standard error." By default, when you use the vertical bar (which means "pipe"), only standard out is piped into less. This means that if you have a lot of error messages from your program, or info from valgrind (which tends to go through standard error), it will still scroll off the screen. To fix this, you want to first redirect the standard error stream to the standard out stream; then you want to pipe all of that output to less. To do so, with and without valgrind, you use the following arcane syntax:

./interpreter < myprog.scm 2>&1 | less
valgrind ./interpreter < myprog.scm 2>&1 | less

The shell program that we are using (known as bash refers to standard error as stream number 2, and standard out as stream number 1. The above syntax says to redirect stream 2 into stream 1. If you'd like to learn more about all of this, I'd encourage you to look at the documentation for bash; type man bash at a command prompt.

7 Testing your code

I have created a series of test files for you, which will be autorun on GitHub. To run those tests locally, type make test at the prompt. It will compare the output of your tokenizer with the correct output, and also run it through Valgrind. My test script does some cleaning of the output to eliminate extra whitespace, additional decimal places on doubles, etc., in order to try to make the best shot of getting the output of your code to match mine. If you find that a test fails because of some trivial kind of issue (spacing, or what not), you need to clean that up in your code to make the tests pass.

8 What to submit

Add all of your files to your Git repository and push to GitHub. (It's up to you if you want to submit your interpreter executable or not.) Your submitted files should be exactly what you checked out, but also with a file named tokenizer.c (and possibly linkedlist.c and talloc.c if you are using your own.) Your Makefile may have changed if you modified it to work with my binaries.

The graders and I need to know which commit is the one that we should use for grading. This is handled well in Git with tags. You did this back in the original clab. If you've forgotten how to tag in Git, look at the end of part 1 of the C lab. Name your tag tokenizer-done.

Have fun tokenizing!


This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant and Laura Effinger-Dean.