Principal Investigator Larry Rudolph
Co-investigator Boris Katz
Project Website http://projects.csail.mit.edu/nrcc/nla.php
We propose a new, flexible architecture for connecting users of mobile devices to active resources and services that surround them. The key idea is to leverage language and language-motivated representations as a fundamental organizing principle for interpreting, routing and acting upon user inputs. Our research will lead to the development of Mobile UI architecture that builds upon the strengths of mobile devices and natural user interaction as specified by the Mobile Ecosystem 2012 vision.
A number of interaction technologies are reaching maturity and can now be used as effective I/O channels with mobile, lightweight, handheld devices. Similarly, there are many types of external systems and servers with which handheld devices may interact. On both fronts, the user interface and the server interface, we are constrained by the operating characteristics of current practice: user interfaces typically concern menu selection and simple text entry, while server interfaces involve operations that are too low-level. The solution, we believe, is to raise both the user and server interfaces to a higher level of abstraction, one which can be captured by natural language terms, phrases and sentences used in describing the handheld device, available external systems, and the actions and objects they may process.
The following scenario illustrates the sorts of interactions we hope to support through this new architecture. User input may involve keyboard entry, speech input, cursor-based pointing, and menu selection. Device output may involve typed or spoken language, and displayed graphics and imagery. Both internal and external servers utilized during the interactions may include active agents, application programs, handset functions, other mobile devices, and static and dynamic information sets.
Scenario: Scheduling a meeting.user: "Which conference rooms are free next Tuesday from 1 to 3?"device: "Rooms G-319 and G-432." (shows a map) user: "I'd like to reserve this room." (points to a location on the map)device: "One second... It's reserved."user: "Then, send this message to everyone in our group." device: "OK. Ready to take the message."user: "Listen everyone, we need to get moving on some things, so I'm scheduling a meeting next Tuesday at 1 in room G-319."device: "OK. Message sent."device: "Would you like me to save the entire sequence of steps we just took?" user: "Yes."device: "What would you like to call the sequence of steps?"user: "Scheduling a group meeting."device: "OK. Saved."
In the proposed architecture, three functionally distinct layers interact with one another by way of abstraction barriers. The top layer comprises a Universal User Interface that accepts multiple modalities of input and provides multiple means of presentation to the user. This layer provides a platform for the insertion of specific user interface capabilities that enable users to ask for information, issue commands, supply information and configure the digital environment, using typed or spoken natural language, menu-based interactions, and gestures via pointer or touch screen input. The platform will also support the insertion of output capabilities for presenting responses and clarification dialogs through screen displays that contain graphical and written elements, plus generated speech.
The role of the top layer is both to process the distinct modalities in a correct manner and to provide a single, modality-independent representation for relaying components of information to and from the other layers of the architecture. We propose that a language-based scheme serve as the core of this modality-independent representation, specifically in the form of nested constituent-relation-constituent expressions composed of natural language terms, with which we have experienced a considerable degree of success to date for such organizational tasks.
The middle layer of the architecture serves as a Dispatcher and Coordinator in connecting user interactions to interactions with active resources. This layer is a stateful component that will be tied to an individual user. The middle layer receives inputs passed along by the user interface layer and draws upon the current context and subsequent queries to the user to disambiguate these inputs and supplement them with missing information. This layer also matches user questions and commands to components of information and actions available through associated active resources, in some cases by decomposing the inputs into subquestions and subcommands. The middle layer passes targeted queries and commands to specific active resources by way of the bottom layerotocols and methods of communication in relaying specific queries and commands to available active resources, and it standardizes the responses returned by those resources and services.
Supporting Technology
The proposed architecture will be demonstrated using a synthesis of technologies that have been developed by the participating groups and their collaborators. The following paragraphs describe these technologies.
CSAIL's OKRA system focuses on a universal user interface that enables users to interact with the system using any one or a mixture of specific user interfaces and devices. It supports keyboard, sketching, speech, gestures, and other types of input. The key idea is to exploit expectations as to what the user can and might say to configure or control any of the particular user interfaces. The results of a user's actions are captured for processing, and when these actions are ambiguous, the alternatives are included.
CSAILn language systems. GALAXY utilizes a client/server approach where users can communicate with the system from lightweight clients while specialized servers handle computationally heavy tasks. GALAXY has been selected as the reference architecture for the DARPA Communicator Program.
CSAIL's START system (http://start.csail.mit.edu) specializes in language-driven information access. START accepts questions in everyday language and provides "just the right information" drawn from distributed resources. The returned information may be text or multi-media information, or, in the general case, START may execute arbitrary procedures as a result of its question processing. START has been available on the World Wide Web since 1993 and has to date processed millions of questions from hundreds of thousands of users around the world.
CSAIL's Omnibase system (http://groups.csail.mit.edu/infolab/publications/) provides START with a uniform interface to Web resources, databases and local files for purposes of question answering. Omnibase uses a language-motivated "object-property-value" model to describe the contents of resources, and in doing so, abstracts away many of the details of data access, location and variant terminology. Omnibase has supported START in this role for several years and currently completes a substantial portion of the user questions received by START over the Web.