Entry Date:
September 22, 2004

Modeling Linguistic Features in Continuous Speech Recognition

Principal Investigator Stephanie Seneff

Co-investigator Victor Zue


In recent research, we have developed a two-stage framework which incorporates a set of linguistically-motivated features into current probabilistic speech recognition systems. We have demonstrated significant improvement in speech recognition accuracy on an isolated-word database. Our current focus is on generalizing the original two-stage framework to recognize continuous speech.

The idea of using sub-word linguistic features such as manner and place of articulation in automatic speech recognition (ASR) dates back to some of the earliest research efforts in this field. The research is inspired in particular by a series of quantitative analyses of the constraining power of mannerbased broad classes for lexical access. These researchers discovered that the number of lexical candidates can be reduced significantly if manner classes of the phonemes can first be established. Based on these findings, they proposed a two-stage speech recognition framework, in which the first stage segmented and classified the signal into manner-based “broad classes”. Lexical retrieval based on this broad class representation will result in a small “cohort” of possible word candidates. In the second stage, more detailed acoustic phonetic analysis, coupled possibly with the use of analysis-bysynthesis techniques and higher level knowledge sources such as syntactic and semantic constraints, are applied to the small cohort to achieve an efficient and accurate recognition.

We report our experiments on the Jupiter weather information domain and the Mercury air travel planning domain. The first-stage recognizer generates a 50-best list for each utterance. A vocabulary of the 100 words the first-stage recognizer is most likely to miss is used to augment the Nbest lexicon. On average this translates to roughly a reduction of 10 in terms of vocabulary size for the second stage recognizer. We use as a baseline a state-of-the-art phone-based SUMMIT recognizer in the second stage for both domains. In Table 16 we report results on both the full test set and a clean subset, which is free of artifacts or out of vocabulary words.

The performance of the first stage recognizer can probably be further improved by utilizing phonological rules based on the linguistic units we used in this research, and by introducing constraints from higher linguistic hierarchies. In its present form, our system uses a reduced vocabulary while maintaining the original language model. It would be interesting to consider modifications to the language model of the second stage on the basis of the first stage cohort, for example, through the simple technique of re-normalizing the probability model once rejected words have been pruned. Alternative techniques similar to boosting may also be effective.