CS 357: Natural Language Processing

An Earley-based parser

Due 11:59PM Wednesday 4/16. You may work in pairs or individually. Submit your code via HSP .

The goal

Implement the Earley algorithm. Your implementation should accept two text files and a string as input. The text files will contain a grammar and a lexicon, respectively. The string will be a sentence to be parsed. For each valid interpretation of the sentence in the given grammar, your program will print out a parse tree, and the sentence tagged by part of speech (e.g. "They [n] study [vt] fish [n] in [prep] cans [n]").

Use the handout I gave you in class as the basis for your algorithm. Note that the handout tells you how to build the parse chart, but does not tell you how to extract parse trees from the chart--that job will be for you.

File formats

The grammar file will be a text file each of whose lines is either a comment (starting with a # sign), a blank line, or a rule. For example, the following would be a simple grammar file:

# My little grammar
S : NP VP

# Noun phrases
NP :  det  n
 NP : n

# Verb phrases
VP : v
VP: v NP

I have included some quirkly formatting here (inconsistent spacing before the colon, extra spaces in some places). This is to emphasize that whitespace acts as a delimiter between symbols, but should be ignored otherwise. Also, non-terminal constituents are all caps and lexical constituents are all lower case.

The lexicon file should have the same format, like so:

# My li'l lexicon
v : run
v : study
n : fish
n:cans
 n  :  arrow
[etc.]

If you are careful in your design, you should be able to write a common set of functions to read both the lexicon file and the grammar file.

Output

Printing out the parse tree is tricky in text, but there are ways to do it. Here, for example, is one approach:

S
    NP
        n - I
    VP
        vt - like
        NP
            n - candy

Go ahead and be creative. Sometimes a clever use of brackets or line spaces can greatly enhance the readability of your output.

Have fun, start early, and keep in touch.




Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu