Morphological Parsing

Due Wednesday 4/4/01, by class time. Submit your code via HSP . Be prepared to do a brief (five minutes) demo of your program in class, including explanation of your approach to the problem and any features of your solution that strike you as interesting.

I encourage you to work with a partner or two on this and all non-exam assignments for this class.

The Input

Your program will take three text files as input.

  1. A file consisting of nouns whose plural forms are irregular. Each line of this file will contain a singular noun followed by a comma, followed by the plural of the noun.

  2. A file consisting of one noun per line. The nouns may be singular or plural, regular or irregular. For effective testing of your program, you'll want some of each.

  3. A file consisting of one lexical representation of a noun on each line. For example, one such line might be "goose +N +PL".

The Output

Your program should take the names of the three input files as command-line arguments (for your users' convenience, please provide a correct usage statement if the program is executed with other than three arguments). The program should then deliver to standard output (cout) first a list of the lexical representations of the lines in file number 2, followed by the surface representations (i.e. the correct spellings) of the lines in file number 3.

For example, if file 2 is


	ox
	oxen
	ax
	axes
	socks
	sock

and file 3 is


	goose +N +PL
	moose +N +PL
	excuse +N +SG
	truce +N +PL
	truss +N +PL

then the output should be something like


	ox       ->     ox +N +SG
	oxen     ->     ox +N +PL
	ax       ->     ax +N +SG
	axes     ->     ax +N +PL
	socks    ->     sock +N +PL
	sock     ->     sock +N +SG
	
	goose +N +PL    ->    geese
	moose +N +PL    ->    moose
	excuse +N +SG   ->    excuse
	truce +N +PL    ->    truces
	truss +N +PL    ->    trusses

Advice

Try to develop a single finite state transducer that can be used in both directions. It's good to think of FSTs as two-way translators (among other things--see page 72 of Jurafsky and Martin). Also, I would develop a small FST class interface to isolate the translation code inside a small number of well-chosen functions.

The non-deterministic FSA search code on page 44 might be handy (in modified form, to be sure).

Work together. Need help finding a compatible work partner? Let me know.

Need an example to help you handle command-line arguments? Try slowmapwordcounter.cpp (developed to illustrate the C++ Standard Template Library "map" template, but it includes a small amount of command-line handling code).

Ignore my advice if it seems appropriate to do so.

Have fun, start early, and keep in touch.



Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu