Simone: Spoken Interaction for Mobile Networked Ecosystems

Principal Investigator James Glass

Project Website http://projects.csail.mit.edu/nrcc/simone.php

Simone (Spoken Interaction for Mobile Networked Ecosystems) -- The cellular phone is rapidly transforming itself from a mobile telecommunication device into a multi-faceted information manager that supports both spoken communication among people as well as the manipulation of an increasingly diverse set of data types that are stored both locally and remotely. Although the essence of the device has been to support spoken human-human communication, speech and language technology will also prove to be vital enablers for graceful human-device interaction, and for effective audio-visual content creation and access.

The need for better human-device interaction is clear. As personal devices continue to shrink in size yet expand their capabilities, the conventional GUI model will become increasingly cumbersome to use. A voice-based interface will work seamlessly with small devices, and will allow users to more easily invoke local applications or access remote information.

Information devices of the future will operate as an audio-visual recorder for a variety of personal or business uses. To be more effective, the data could be annotated with additional information to allow them to be indexed, searched, summarized, or even translated. Some of the annotations could be meta-level descriptions such as the sequence of events that occurred during the recording (e.g., who was speaking? when? where? what was the structure of the event?). Other annotations could involve more detailed transcriptions of what was said by different speakers.

This project proposes to develop spoken language and multimodal technology in support of a new generation of mobile user interfaces, and content processing for mobile information devices. We propose to perform research and collaborate with other Nokia-MIT researchers to enable natural, spoken interaction to control device applications, and to develop methods for annotating content that has been recorded by an individual with their mobile information device. Since we believe the technology should ultimately operate in the user's language of preference, we propose to focus on English and Mandarin-based processing to demonstrate the viability of multilingual interaction.

In this proposal we suggest two areas where spoken language technology can potentially benefit Nokia: 1) for simplifying the current user interface, especially in providing assistance to the end user, and 2) for annotating and retrieving audio-visual content. We propose to initially explore these topics in English, and then develop Mandarin capabilities in both areas.