Acoustic Modeling Improvements in SUMMIT (Speech Recognition)

Principal Investigator James Glass

Project Website http://groups.csail.mit.edu/sls/technologies/asr.shtml

SUMMIT is a speech recognition system developed by the Spoken Language Systems (SLS) group. This speech recognizer uses a segment-based approach for modeling acoustic-phonetic events and utilizes finite-state transducer (FST) technology to efficiently represent all aspects of the speech hierarchy including the phonological rules, the lexicon, and the probabilistic language model.

The SUMMIT segment-based speech recognition system is capable of handling two rather different types of acoustic models: segment models and boundary models. Segment models are intended to model hypothesized phonetic segments in an acoustic-phonetic graph, and can be context-independent or context-dependent. The observation vector for these models is of fixed dimensionality and is typically derived from spectral vectors spanning the segment. Thus we extract a single feature vector and compute a single likelihood for a phone regardless of its duration. In contrast, boundary models are intended to model transitions between phonetic units. The observation vector for these diphone models is also of fixed dimensionality and is centered at hypothesized phonetic boundary locations or landmarks. Since some hypothesized boundaries will in fact be internal to a phone, both internal and transition boundary models are utilized. Typically, the internal models are context-independent, and the transition models represent diphones.

Because the observation spaces of segment and boundary models differ significantly, they tend to contribute different information about the recognition search and ranking of hypotheses. The baseline JUPITER system uses boundary models only, and this work investigated the addition of triphone segment models and their effect on accuracy and speed. In the early development of the JUPITER system, context-independent segment models were used. As more data became available, context-dependent (diphone) boundary models were added. The log probability model scores of the boundary and segment models were linearly combined for every phone. Ideally, this combination of models should have been more accurate than either of the segment or boundary models separately. However, the boundary models, with their higher degree of context dependency, benefited more from the increased training data than the segment models, and thus as more data became available the context-independent segment models began to actually degrade performance as compared to boundary models only.

In these experiments, for boundary models we used 61 context-independent internal models and 715 diphone transition models. For segment models, we used 935 triphones, 1190 diphones, and 61 monophones. Examining the pronunciation network, we find that 71% of all arcs have triphone labels, the rest having diphone or monophone labels. For segment models, the selection of triphones and diphones was based on the number of occurrences in the training data by using a count threshold of 250.

SUMMIT now makes use of finite-state transducers (FSTs) to represent context, phonological, lexical, and language model constraints. In particular, the first (Viterbi) pass of the recognition search makes use of a single FST composed of all these compo-nents. Such a formulation makes the correct application of cross-word context-dependent acoustic models very straightforward.