Multimodal Processing and Interaction

Principal Investigator James Glass

Co-investigators Victor Zue , Stephanie Seneff

Project Website http://groups.csail.mit.edu/sls/research/multimodal.shtml

Although speech and language are extremely natural and efficient, often a simple gesture can more easily communicate important information. For example, when displays are available on the interface, gestures can complement spoken input to produce a more natural and robust interaction. We are exploring a variety of multimodal scenarios to combine speech and gesture for more natural interaction. The web-based prototypes run inside a conventional browser and transmit audio and pen or mouse gesture information to servers running at MIT. Thus, they are available anywhere a user is connected to the internet.

In addition to conveying linguistic information, the speech signal also contains cues about a talkers identity, their location, and their emotional (paralinguistic) state. Information about these attributes can also be found in the visual channel, as well as other untethered modalities. By combining conventional audio-based processing with additional modalities, improved performance can often be obtained, sometimes dramatically, especially in challenging audio environments.

We are performing research in a number of areas related to person identification, and speech recognition. The goal is to obtain robust performance in a variety of environments, without unduly encumbering the talker.

(*) Audio-visual person identification
(*) Audio-ultrasonic speech recognition
(*) Audio-visual speech recognition