THE GRYPHON MANIFESTO The Definitive Guide to the Inner-workings of the Greatest Comps Group Project Ever by Eric Lantz, March 2, 2004 I Overview Gryphon is a dialogue system (see note 1) designed to allow a user to get information about available courses through a natural language input interface and a bimodal speech-text output. It incorporates several open source technologies as leverage to accomplish this daunting task. These technologies provided a framework through which a non-linear, multi-stage, system was created to perform the above task. This document provides an overview of the system components from a technical perspective, with particular attention paid to the transformational aspects of each module from input to output. It is designed to portray the system in a manner that is as straight-forward as possible. I hope it succeeds in that regard. II Dialogue Systems - Methods For Domain Restricted Conversation Several types of dialogue systems have been developed and deployed in various capacities. Most of these systems do not attempt the daunting task of universal understanding of human conversation, but rather focus on a particular restricted domain in which the user (a human) has a particular task to complete. This goal-oriented structure places restrictions on both participants, a condition designers of dialogue systems can use to their advantage. Despite the amazing diversity of human speech, the vernacular used in stranger-to-stranger transactions - arranging a train ticket, for example - is quite limited. In this regard, there are three primary (though somewhat overlapping) levels of complexity that dialogue systems utilize: State-based systems: This level is very similar to the "Press 1 for customer comments, press 2 for product information, ..." that has become popular in automated telephone systems, except that the input is via speech rather than a telephone keypad. The system restricts the options that the user has, and only accepts one of the options for that state, directing the "conversation" to the next predetermined list of options. This architecture is the easiest to design, and least flexible from the user's perspective. Frame-based systems: Here the system has a list of information slots that it is trying to fill in order to move on in the conversation. However, unlike state-based systems, these slots can be filled in any order. For example, if a user of a plane ticket system says she wants a flight from Boston to Toronto on Thursday, the system would ask what time she wanted to return, since it needs to know both dates to plan a flight. Frame-based systems tend to be more user-lead than state-based systems, because the user can decide the order in which information is provided. Agent-based systems: These are the most complicated systems, in which the system keeps a complex internal state, often trying to determine the goal of the user among several possible options without being explicitly told. For example, if the user asks if movie theater A is showing movie B this afternoon, the system may say no, but movie theaters C and D nearby are showing the movie. The user is allowed to introduce new topics at any time. The Gryphon dialogue system is in most respects a frame-based system, although the particular domain for which it was designed causes the specific architecture to differ from the general description above. Gryphon is entirely user-lead - the user is not given a list of things to choose from. It has the aspect of the agent-based systems of allowing the user to change the topic at any time. However, it does not keep much internal state. The domain of course information does not allow for the formation of defined frames of minimum information required to make a query. Gryphon attempts to answer the user's query with whatever information it was able to understand, and leaves the obligation to the user to further limit the information requested. III System Components * Galaxy Communicator - Framework for server communication (DARPA) * Sphinx - Speech recognition (CMU) * Phoenix - Semantic Parser (CU) * Festival - Text to Speech (U of Edinburgh) * Dialogue Manager - Controls Answering of Query * Database Server - Interfaces with MySQL database * PHP Server - Handles input through web browser A Open Source Software 1 Galaxy Communicator Galaxy Communicator is an architecture for designing dialogue systems. It was developed by MIT and the MITRE corporation through a grant by the US Defense Advanced Research Projects Agency (DARPA). It provides a specification framework allowing compliant servers to communicate. It consists of several independent servers (which need not be running on the same machine) sending messages to one another under the direction of a central Hub. None of the servers connect directly to one another, the only connections exist through the Hub. In general, the setup looks like this: Speech Recognition | | /--Database Parser----HUB | \--Text to Speech | Dialogue Management There is no limit to the number of servers that can be connected to the Hub. The Hub passes messages from one server to another according to a list of rules it is given. The rules are based on the headers of the messages, and tell the hub to create a new message and send it to a particular function that another server can perform, carrying along certain information from the original message. This design allows all servers to be constantly listening for messages, allowing operations of one server to be interrupted by another should the occasion arise. 2 Sphinx Sphinx 2 is a real-time, large vocabulary, speaker independent speech recognition program developed at Carnegie Mellon University and made publicly available in 2000. Its recognition is based on phoneme-level acoustic models and hidden Markov models of speech. It is capable of determining the phoneme set for a given vocabulary list based on the model it is using. For words that are exceptions to its determined pronunciation, it allows specification of a hand dictionary where pronunciation is described using a set of phonetic characters. Additionally, common phrases can be added to the vocabulary to increase the chance that the system will recognize them as a group. Included in Sphinx is a wrapper that allows it to act as a Communicator server. This component was not as developed as the rest of the system, and took some time to incorporate into Gryphon. It remains the most difficult part of the system to predict, and recognition errors are not uncommon, even with the limited vocabulary allowed to the system. However, human speech is a very difficult problem, and Sphinx is probably the most sophisticated (and computationally intensive) component of Gryphon, and we are very thankful to CMU for making it available for us to incorporate into our system. 3 Phoenix The Phoenix Semantic Frame Parser was developed by the University of Colorado at Boulder. Rather than attempting to parse input into parts of speech, Phoenix attempts to organize words in the utterance into semantic groups. Essentially, the goal is to pick out the parts of the utterance that tell us what the user wants and ignore the rest. Phoenix frames are domains of related information. Gryphon uses 2 frames - Courses and Respond. Courses contains information related to specifying course information, while Respond includes other important spoken data. Frames are populated with nets, which are context-free grammars ending with the spoken word. An example is appropriate here, based on a simplified version of the grammar used by Gryphon: Frame: Courses Nets: [Department] <- Net definitions are defined in brackets [Professor] and elsewhere in the file: [Department] ([_Biol]) <- subnets ([_Math]) <- the underscore indicates everything after this is removed. more on that below. ; [_Biol] (biology) <- strings in parentheses without brackets are terminals (bio) ; [Professor] ([First_name]* [Last_name]) <- "*" indicates that field is optional ; [First_name] (Joan) ; [Last_name] (Edwards) ; Let's try this very limited grammar out on a couple sentences. "Let me see biology courses" produces Courses:[Department].Biol "Let me see bio courses" has the same output. We have taken two equivalent ways of referring to the same department, and reduced it to the same output. This funneling is very useful for moving toward querying a database. Let's try something else. "What does Joan Edwards teach?" produces Courses:[Professor].[First_name].Joan [Last_name].Edwards Phoenix will match as many nets to the utterance as it can. Taking advantage of the optional fields, we can ask "What biology courses does professor Edwards teach?" to get Courses:[Department].Biol Courses:[Professor].[Last_name].Edwards So while the Phoenix program was written and integrated quite easily with Galaxy, there was much work to do in order to tailor it to the course domain. Gryphon uses an extensive hierarchy of these data files to extract enough useful information from each utterance for the later components to determine the goal of the utterance and execute it. This required extensive examination of the types of queries the system could expect to receive. The parsing performed by Phoenix is an important step in reducing the complex arrangement of human speech in a systematic way to be dealt with later in the system. 4 Festival Festival Speech Synthesis System was developed by the Center for Speech Technology Research at the University of Edinburgh, UK. It is configurable for different voices and for British and American English. When determining how to pronounce a word, it uses a hierarchy of lexicons, phonemes, and letter-to-sound rules. It even controls intonation and duration of syllables in order to sound more natural. Like Sphinx, you can create a lexicon of special case words which have unusual pronunciation. Festival also presented a bit of a problem attempting to connect it to the Galaxy architecture, and it's status was somewhat demoted when the bimodal output chosen, but it is still necessary for the speech-in, speech-out goal of this project. B Group-Written Components Notes: 1 - The word "dialog" is often used as a variant of "dialogue". Their meanings are the same; they differ only in spelling etymology. As is the standard for this field, we choose to use the "Dialogue" spelling. 2 - "Frames" is a highly overloaded word in the context of Gryphon's components. The messages that Galaxy Communicator sends are (in its documentation) also called frames. I have changed the wording in this document to avoid confusion.