COS 371: Programming Languages
Spring 2022
proj3: Tokenizer
Due: Wednesday, 04/06 at 22:00
1. Goals
To implement a tokenizer for Scheme in C.2. Introduction
Now that you have a working (albeit simplistic) garbage collector, you are ready to commence work on the three main phases of interpretation. The first phase is tokenization. (You must implement the tokenizer from scratch: don't use Lex or any similar tool, even if you happen to know what it is and how to use it; similarly, don't use any regular expression libraries.)3. Tasks
-
In
value.h
, add the following to yourvalueType
enumeration:OPEN_TYPE
,CLOSE_TYPE
,BOOL_TYPE
, andSYMBOL_TYPE
. -
Create a header file
tokenizer.h
with #include guard (like all the header files from previous checkpoints) and with the following two function prototypes:Value *tokenize(); void displayTokens(Value *list);
Thetokenize
function should readstdin
(the standard input stream in C) in its entirety and return a linked list consisting of all tokens found. ThedisplayTokens
function takes a linked list of tokens as input, and displays those tokens, one per line, with each token's type. -
Implement
tokenize
anddisplayTokens
intokenizer.c
. You may want to write a number of additional helper functions, but the helper functions are not part of the spec (and should not be usable/visible outside oftokenizer.c
) and thus they should not appear in the header file. -
Create
main_tokenize.c
that, after including appropriate header files, simply does this:int main(void) { Value *list = tokenize(); displayTokens(list); tfree(); return 0; }
-
Modify your
Makefile
to compile your new tokenizer code. When I'm in your directory and I execute the commandmake tokenizer
, I should get an executable file calledtokenizer
in that directory. (There is a tutorial about Makefiles linked from the website.) -
To run your code, write some Scheme code in a file named
input.scm
, say. Then run./tokenizer < input.scm
The<
tells the shell to feed the fileinput.scm
totokenizer
asstdin
.Create a collection of at least 4 pairs of test files. (More is better! The sooner you get your tokenizer totally right, the sooner you will get the rest of your project right.) Each pair of files should be named
test.tokenizer.input.XX
andtest.tokenizer.output.XX
, whereXX
is a test number. It should be the case that if I run the following command then I get no output:./tokenizer < test.tokenizer.input.XX | diff - test.tokenizer.output.XX
(Your code may be tested against my test files or other teams' test files.)
4. Specification
Here is the spec fordisplayTokens(tokenize())
:
read
Scheme code from standard input and write to standard output a list of tokens
accompanied by the type of each token. You must include the following token
types:
boolean, integer, float, string, symbol, open, closethe last two are for ( and )). The output format is, per line, token:type. You should strip out comments as you tokenize, so that anything after a ; on a line is ignored. (You'll also have to think about how to handle whitespace: sometimes it matters! sometimes it doesn't!)
Scheme has several other token types which you do not need to handle: character, #(
, '
, `
, ,
, ,@
, and .
.
For example, here's a sample run of my code:
$ cat test.tokenizer.input.01 (+ x2 ( + ( quote x ) "foo;; still a string" 23) ;; now it's a comment! $ ./tokenizer < test.tokenizer.input.01 (:open +:symbol x2:symbol (:open +:symbol (:open quote:symbol x:symbol ):close "foo;; still a string":string 23:integer ):close
Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Syntax error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. After encountering an error, your program should exit after printing the error: don't read any more input. This is why we wrote the function texit. Also feel free to print Scheme code around the error to be helpful if you can, though this is optional.
5. Syntax details
While Scheme is relatively simple as languages go, it is still rather complex. In the interest of sanity, we will restrict the spec somewhat (adapted from Dybvig):
- Numbers.
<number> -> <sign> <ureal> | <ureal> <sign> -> + | - <ureal> -> <uinteger> | <udecimal> <uinteger> -> <digit>+ <udecimal> -> . <digit>+ | <digit>+ . <digit>* <digit> -> 0 | 1 | ... | 9
Note that * is shorthand for zero or more repetitions of the symbol, and + is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions atoi and atof in the standard library. - Symbols.
Our symbols (identifiers) are case-sensitive. While
+
and-
can be at the start of numbers, by themselves they are single-character symbols. Handle them.<identifier> -> <initial> <subsequent>* | + | - <initial> -> <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^ <subsequent> -> <initial> | <digit> | . | + | - <letter> -> a | b | ... | z | A | B | ... | Z <digit> -> 0 | 1 | ... | 9
This may be inconsistent with the behaviour of Scheme. If you wish to instead implement the entire Scheme grammar, that can be a possible extension at the end of the project. - Strings.
Handle escaped characters including (at least) the following:
\n
(newline),\t
(tab),\\
(backslash),\'
(single quote), and\"
(double quote). You do not need to handle multiline strings; any newlines in a string must be explicitly written as\n
. - Lists. You must handle
()
to enclose lists. Some dialects of Scheme allow using[]
and{}
as list enclosers; you need not.
You may also find Dybvig §1.1 helpful. The dialect described there may not be completely consistent with the above, but it's readable in English. What I have provided above is considered to be the correct specification for this assignment.
In general, you may try to match the behaviour of the R5RS dialect in DrRacket whenever you're in doubt.
However, there are some exceptions.
For example, DrRacket thinks -1234a
and 3.1.4
are valid identifiers.
But if you carefully check them against the simplified spec given above
(or the full R5RS spec),
you will find that these are not valid identifiers at all.
I want you to follow the spec I gave you as closely as possible,
while using DrRacket as a guide only.
Don't worry about block and datum comments or vectors, or for that matter any
wacky syntax, unless you want to push this really far. Start from the basics and
build up. By the way, if it helps, my code generates no tokens given input like
"error
(an unterminated string).
When testing your code, we intend to test on reasonable programs that follow well-understood and documented Scheme conventions. We may try some corner cases that will attempt to see if you have coded carefully, but we do not intend to find obscure details of the Scheme language standard to see if you have carefully deciphered every bit of it.
6. Notes
There are many different ways of reading input. I found it most useful to use the functions fgetc and occasionally ungetc. Look those up. They're more useful than fscanf.
The heart of your code will be your tokenize function, which reads
the input and returns a list of Value
s. Here's a rough sketch of what that
function might look like:
Value *tokenize() { char charRead; Value *list = makeNull(); charRead = fgetc(stdin); while (charRead != EOF) { if (charRead == ...) { ... } else if (charRead == ...) { ... } else { ... } charRead = fgetc(stdin); } return reverse(list); // see below }
The most natural way to handle the input involves treating the linked list of tokens that you are creating as a stack, rather than a queue.
(You have cons
, not add-to-the-end
.)
You may therefore find reverse
helpful.
Sure, it's inefficient, and you are more than welcome to improve on this now or later,
but make it work first, and make it fast later (if at all).
Here are some command-line tricks that you may want to look up or employ:
- You can run
cat | ./tokenizer
and type input without creating a file. Press Ctrl-D to sendEOF
. more
orless
: pagers to view the output of a shell command before it all scrolls off the screen.- If you have errors that are generated by a program, the normal pipe through a pager won't help:
only
stdout
, notstderr
, is redirected throughmore
if you runcat input.scm | ./tokenizer | more
. You can instead docat input.scm | ./tokenizer 2>&1 | more
to get bothstdout
andstderr
to go throughmore
. - Instead of piping output through a pager, you can save it to a file:
cat input.scm | ./tokenizer > output
. - To show the output on the screen and still save it to the file, you can pipe it through
tee
:cat input.scm | ./tokenizer | tee output
.
7. Submission
Follow the instructions from the previous assignment. In particular:- Update
credits.txt
if applicable. - Push all your code into your GitHub repository,
including all created/modified files described as above:
value.h
,tokenizer.h
,tokenizer.c
,main_tokenize.c
,Makefile
, andtest.tokenizer.X.Y
forX={input,output}
andY={01,02,03,04}
(or more). - When you are sure you are done, tag your commit with
proj3
and push the tag.
This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Laura Effinger-Dean, and Jed Yang.
Start early, have fun, and discuss questions on Moodle.