Invariant Representation Learning for Speech Recognition

Principal Investigator Tomaso Poggio

Project Website http://cbmm.mit.edu/research/projects-thrust/theories-intelligence/invariant-re…

The recognition of sound categories and speech in the human brain is remarkably robust to signal variations that preserve identity. Apart from any contextual inference imposed through complex language models, the lower-level neural representation of speech sounds might also be important for recognition. The idea of hierarchical representations that handle invariance through successive, feedforward maps, prominent in biologically plausible, computational models for vision is a starting point for developing computational models and forming hypothesis for representations in the auditory cortex.

An invariant representation of speech, in both biological and artificial systems, is crucial for improving the robustness of the acoustic to phonetic mapping, decreasing the sample complexity (i.e., the number of labeled examples) and enhancing the generalization performance of learning in the presence of distribution mismatch due to speech variability. In human brains, learning to associate sounds or words is the result of a few, directed, examples and the unsupervised observation of auditory objects and their transformations (across speakers, modes of speech etc.). A key element of this might be the unsupervised learning of effective data representations, i.e., the mapping of the sensory data in a feature space that is resilient to (lexical or sub-lexical) identity-preserving transformations, such as changes in voices, speakers and acoustic environments.

The project goal is to provide a theoretical and computational framework for speech representation learning while formulating plausibility hypotheses for learning and processing mechanisms in the human auditory cortex. The following research directions span both machine learning and neuroscience:

Algorithms for invariant representation learning in machines: An appropriate representation of the data (encoding, feature map, embedding) aims at facilitating a statistical learning problem. Data-adaptive, as opposed to deterministic, representations can be learned, in a supervised or unsupervised way by imposing criteria on a representation map, for example the preservation of distances or the reconstruction accuracy. We are interested in representations that are invariant to class-preserving transformations and selective to class-specific properties, for learning to separate multiple categories with reduced sample complexity.

Invariant speech representations in brains: We make hypothesis for the feedforward processing in the human auditory cortex, using the visual cortex paradigm (e.g., Hubel-Wiesel cells, hierarchical models, invariance). We are interested in the invariance and selectivity properties of auditory representations, the form of auditory receptive fields (STRFs) in hierarchical organizations, the acoustic-to-phonetic mapping of speech sounds in the human brain, the types of associations for learning sound categories and the levels and parts of the representation of “auditory objects.”

peech recognition from a few labeled examples: A case study involves systems for speech recognition (words or phonemes) that require less resources for achieving high accuracy in classification tasks. We consider several aspects of the underlying speech representations, such as signal scales, input domain representations, network structures (shallow vs. multilayer/hierarchical), and ways of sampling for templates and transformations (explicit or learned discriminatively/unsupervised).