Entry Date:
January 25, 2017

Linguistic Event Extraction and Integration (LEXI): A New Approach to Speech Analysis

Principal Investigator Stefanie Hufnagel

Co-investigator Jeung-Yoon (Elizabeth) Choi

Project Start Date September 2016

Project End Date
 August 2018


This exploratory project develops a new system for speech signal analysis that can be used to improve automatic speech recognition (ASR) systems, and provide a testable model of human speech perception. The system is based on finding important events in the speech signal, i.e. 'acoustic edges' where the signal changes suddenly because the mouth closes or opens during the formation of a consonant (like /p/ or /s/), or a vowel (like /a/ or /u/). These abrupt changes, called Landmarks, are especially informative, because they (and the parts of the signal near them) are richly informative about the speaker's intended words and their sounds. Focusing on these events results in greater computational efficiency, by identifying the linguistically relevant information in the speech signal, rather than measuring every part of the signal. This focus on individual cues to speech sounds also means that the system can deal with non-typical speech produced by children, older people, speakers with foreign accents, or those with clinical speech disabilities. As a result, this system will bring the benefits of ASR to speakers who are not well served by current recognition systems, making it possible for more people to use cell phones, tablets and laptops. While existing systems work well for typical speakers by using statistical analysis of large samples of typical speech, they leave many people underserved. The Landmark-based system will also provide a tool for testing whether human speech recognition depends on finding the individual cues to the sounds of words, even when those cues are very different in different contexts, and so can lead to the development of a new model of human speech perception.

The system works by extracting speech-related measurements from the signal, such as fundamental frequency, formant frequencies, spectral band energies and their derivatives, and interpreting these measures as acoustic cues for distinctive features. Innovative aspects of the system include the use of Landmarks, which are the most robust of the acoustic feature cues and are related to articulatory manner features. Once the landmark acoustic cues are found, other acoustic cues related to place and voicing features, and to prosodic structure, can also be found. The extraction of distinctive features and prosodic structure provides the first abstract linguistic units that can be extracted from the physical continuous signal, and this information is used to identify words, and to construct a representation of the entire utterance. To develop and evaluate the performance of this innovative system, speech databases consisting of isolated vowel-consonant-vowel sequences, read continuous speech, read radio-style speech, and spontaneous speech will be hand-labeled with Landmarks and other acoustic cues. Results of this basic speech research project will support the development of new approaches to ASR, will provide a testable computational model of human speech production, and will produce material suitable for development of a tutorial to train students in engineering, linguistics and cognitive science to label acoustic feature cues.