Analysis by Synthesis Revisited: Visual Scene Understanding by Integrating Probabilistic Programs and Deep Learning

Principal Investigator Joshua Tenenbaum

Project Website http://toyota.csail.mit.edu/node/36

Everyday tasks like driving a car, preparing a meal, or watching a lecture, depend on the brain’s ability to compute mappings from raw sensory inputs to representations of the world. Despite the high dimensionality of the sensor data, humans can easily compute structured representations such as objects moving in three dimensional space and agents with multi-scale plans and goals. How are these mappings computed? How are they computed so well and so quickly, in real time? And how can we capture these abilities in machines?

Representation learning provides a task-oriented approach to obtaining abstract features from raw data, making such methods suitable for tasks such as object recognition. However, agents acting in 3D environments need to represent the world beyond statistical vectorized feature representations. The internal representations need to be lifted to an explicit 3D space with physical interactions and should have the ability to extrapolate to radically new viewpoints and other non-linear transformations. In this proposal, we aim to close this representational gap by integrating powerful deep learning techniques (recognition networks) with rich probabilistic generative models (inverse graphics) of generic object classes. This paradigm can termed as ‘analysis-by-synthesis’, where models or theories about the world are learnt to interpret perceptual observations.

The availability of ImageNet and other large datasets facilitated rapid innovation in the field of deep learning. Similarly the ATARI Learning Environment (ALE) led to a considerable amount of progress in deep reinforcement learning. To train our models we not only need computational tools required to scale-up our approach but we also need rich 3D learning environments. We plan to develop and release a 3D game engine which will be built upon the Unreal engine. It will be a centralized test bed for evaluating our models on a variety of tasks. This includes inferring 3D generative object models from 2D viewpoints, observing a sequence of frames then predicting ‘what happens next’, and learning to play a 3D maze game using the newly acquired knowledge of 3D representations. As computer graphics improve the production of photorealistic scenes, models trained in such environments will ultimately stand a solid chance of being applied to the real world by minimally fine-tuning models during ‘test’ time.