THE GRYPHON MANIFESTO

The Definitive Guide to the Inner-workings of the Greatest Comps Group
Project Ever

by Eric Lantz, March 2, 2004


I Overview

Gryphon is a dialogue system (see note 1) designed to allow a user to 
get information about available courses through a natural language input 
interface and a bimodal speech-text output.  It incorporates several open 
source technologies as leverage to accomplish this daunting task.  These
technologies provided a framework through which a non-linear, multi-stage,
system was created to perform the above task.  This document provides an 
overview of the system components from a technical perspective, with 
particular attention paid to the transformational aspects of each module 
from input to output.  It is designed to portray the system in a manner 
that is as straight-forward as possible.  I hope it succeeds in that regard.


II Dialogue Systems - Methods For Domain Restricted Conversation

Several types of dialogue systems have been developed and deployed in 
various capacities.  Most of these systems do not attempt the daunting task
of universal understanding of human conversation, but rather focus on a 
particular restricted domain in which the user (a human) has a particular
task to complete.  This goal-oriented structure places restrictions on both
participants, a condition designers of dialogue systems can use to their 
advantage.  Despite the amazing diversity of human speech, the vernacular
used in stranger-to-stranger transactions - arranging a train ticket, for
example - is quite limited.  In this regard, there are three primary (though
somewhat overlapping) levels of complexity that dialogue systems utilize:

State-based systems: This level is very similar to the "Press 1 for customer
comments, press 2 for product information, ..." that has become popular in
automated telephone systems, except that the input is via speech rather than
a telephone keypad.  The system restricts the options that the user has, 
and only accepts one of the options for that state, directing the 
"conversation" to the next predetermined list of options.  This architecture
is the easiest to design, and least flexible from the user's perspective.

Frame-based systems: Here the system has a list of information slots that it
is trying to fill in order to move on in the conversation.  However, unlike
state-based systems, these slots can be filled in any order.  For example, if
a user of a plane ticket system says she wants a flight from Boston to 
Toronto on Thursday, the system would ask what time she wanted to return, 
since it needs to know both dates to plan a flight.  Frame-based systems tend
to be more user-lead than state-based systems, because the user can decide 
the order in which information is provided.

Agent-based systems:  These are the most complicated systems, in which the
system keeps a complex internal state, often trying to determine the goal of
the user among several possible options without being explicitly told.  For 
example, if the user asks if movie theater A is showing movie B this 
afternoon, the system may say no, but movie theaters C and D nearby are 
showing the movie.  The user is allowed to introduce new topics at any time.

The Gryphon dialogue system is in most respects a frame-based system, 
although the particular domain for which it was designed causes the 
specific architecture to differ from the general description above.  
Gryphon is entirely user-lead - the user is not given a list of things to 
choose from.  It has the aspect of the agent-based systems of allowing the 
user to change the topic at any time.  However, it does not keep much 
internal state.  The domain of course information does not allow for the 
formation of defined frames of minimum information required to make a query.
Gryphon attempts to answer the user's query with whatever information it was
able to understand, and leaves the obligation to the user to further limit
the information requested.  


III System Components
    * Galaxy Communicator - Framework for server communication (DARPA)
    * Sphinx - Speech recognition (CMU)
    * Phoenix - Semantic Parser (CU)
    * Festival - Text to Speech (U of Edinburgh)
    * Dialogue Manager - Controls Answering of Query
    * Database Server - Interfaces with MySQL database
    * PHP Server - Handles input through web browser

A Open Source Software

1 Galaxy Communicator

Galaxy Communicator <http://communicator.sourceforge.net> is an architecture
for designing dialogue systems.  It was developed by MIT and the MITRE 
corporation through a grant by the US Defense Advanced Research Projects
Agency (DARPA).  It provides a specification framework allowing compliant
servers to communicate.  It consists of several independent servers (which
need not be running on the same machine) sending messages to one another
under the direction of a central Hub.  None of the servers connect directly 
to one another, the only connections exist through the Hub.  In general, the
setup looks like this:

                        Speech Recognition
                               |
                               | /--Database
                    Parser----HUB
                               | \--Text to Speech
                               |
                        Dialogue Management

There is no limit to the number of servers that can be connected to the Hub.
The Hub passes messages from one server to another according to a list of
rules it is given.  The rules are based on the headers of the messages, and
tell the hub to create a new message and send it to a particular function 
that another server can perform, carrying along certain information from the 
original message.  This design allows all servers to be constantly listening
for messages, allowing operations of one server to be interrupted by another
should the occasion arise.

2 Sphinx

Sphinx 2 <http://www.speech.cs.cmu.edu/sphinx/> is a real-time, large 
vocabulary, speaker independent speech recognition program developed at
Carnegie Mellon University and made publicly available in 2000.  Its 
recognition is based on phoneme-level acoustic models and hidden Markov
models of speech.  It is capable of determining the phoneme set for a given 
vocabulary list based on the model it is using.  For words that are exceptions
to its determined pronunciation, it allows specification of a hand dictionary
where pronunciation is described using a set of phonetic characters.  
Additionally, common phrases can be added to the vocabulary to increase the
chance that the system will recognize them as a group.

Included in Sphinx is a wrapper that allows it to act as a Communicator
server.  This component was not as developed as the rest of the system, and 
took some time to incorporate into Gryphon.  It remains the most difficult
part of the system to predict, and recognition errors are not uncommon, even
with the limited vocabulary allowed to the system.  However, human speech is 
a very difficult problem, and Sphinx is probably the most sophisticated (and 
computationally intensive) component of Gryphon, and we are very thankful to 
CMU for making it available for us to incorporate into our system.

3 Phoenix

The Phoenix Semantic Frame Parser <http://communicator.colorado.edu> was 
developed by the University of Colorado at Boulder.  Rather than attempting
to parse input into parts of speech, Phoenix attempts to organize words in
the utterance into semantic groups.  Essentially, the goal is to pick out the
parts of the utterance that tell us what the user wants and ignore the rest. 
Phoenix frames are domains of related information.  Gryphon uses 2 frames - 
Courses and Respond.  Courses contains information related to specifying 
course information, while Respond includes other important spoken data.  
Frames are populated with nets, which are context-free grammars ending with
the spoken word.  An example is appropriate here, based on a simplified
version of the grammar used by Gryphon:

Frame: Courses
Nets:
    [Department]      <- Net definitions are defined in brackets
    [Professor]

and elsewhere in the file:

[Department]
    ([_Biol])         <- subnets
    ([_Math])         <- the underscore indicates everything after this is
                         removed.  more on that below.
;
[_Biol]
    (biology)        <- strings in parentheses without brackets are terminals
    (bio)
;
[Professor]
    ([First_name]* [Last_name])    <- "*" indicates that field is optional 
;
[First_name]
    (Joan)
;
[Last_name]
    (Edwards)
;

Let's try this very limited grammar out on a couple sentences.

"Let me see biology courses" produces

  Courses:[Department].Biol

"Let me see bio courses" has the same output.  We have taken two equivalent
ways of referring to the same department, and reduced it to the same output.
This funneling is very useful for moving toward querying a database.  Let's
try something else.

"What does Joan Edwards teach?" produces

  Courses:[Professor].[First_name].Joan [Last_name].Edwards

Phoenix will match as many nets to the utterance as it can.  Taking 
advantage of the optional fields, we can ask

"What biology courses does professor Edwards teach?" to get

  Courses:[Department].Biol
  Courses:[Professor].[Last_name].Edwards

So while the Phoenix program was written and integrated quite easily with
Galaxy, there was much work to do in order to tailor it to the course domain.
Gryphon uses an extensive hierarchy of these data files to extract enough
useful information from each utterance for the later components to determine
the goal of the utterance and execute it.  This required extensive examination
of the types of queries the system could expect to receive.  The parsing 
performed by Phoenix is an important step in reducing the complex arrangement
of human speech in a systematic way to be dealt with later in the system.

4 Festival 

Festival Speech Synthesis System <http://www.cstr.ed.ac.uk/projects/festival> 
was developed by the Center for Speech Technology Research at the University 
of Edinburgh, UK.  It is configurable for different voices and for British 
and American English.  When determining how to pronounce a word, it uses a 
hierarchy of lexicons, phonemes, and letter-to-sound rules.  It even controls
intonation and duration of syllables in order to sound more natural.  Like 
Sphinx, you can create a lexicon of special case words which have unusual
pronunciation.  Festival also presented a bit of a problem attempting to
connect it to the Galaxy architecture, and it's status was somewhat demoted
when the bimodal output chosen, but it is still necessary for the speech-in,
speech-out goal of this project.


B Group-Written Components


Notes:
1 - The word "dialog" is often used as a variant of "dialogue".  Their 
    meanings are the same; they differ only in spelling etymology.  As is
    the standard for this field, we choose to use the "Dialogue" spelling.
2 - "Frames" is a highly overloaded word in the context of Gryphon's 
    components.  The messages that Galaxy Communicator sends are (in its
    documentation) also called frames.  I have changed the wording in this
    document to avoid confusion.