Invariant Representations for Action Recognition in the Human Visual System

Principal Investigator Tomaso Poggio

Project Website http://cbmm.mit.edu/research/projects-thrust/theories-intelligence/invariant-re…

Recognizing actions from dynamic visual input is an important component of social and scene understanding. The majority of visual neuroscience studies, however, focus on recognizing static objects in artificial settings. Here we apply magnetoencaphalography (MEG) decoding and a computational model to a novel data set of natural movies to provide a mechanistic explanation of invariant action recognition in human visual cortex.

The human brain can rapidly parse a constant stream of visual input. The majority of visual neuroscience studies, however, focus on responses to static, still-frame images. Here we use Magnetoencephalography (MEG) decoding and a computational model to study invariant action recognition in videos. We created a well-controlled, naturalistic dataset to study action recognition across different views and actors. We find that, like objects, actions can also be read out from MEG data in under 200 ms (after the subject has viewed only 5 frames of video). Action can also be decoded across actor and viewpoint, showing that this early representation is invariant. Finally, we developed an extension of the HMAX model, inspired by Hubel and Wiesel’s findings of simple and complex cells in primary visual cortex as well as a recent computational theory of the feedforward invariant systems, which is traditionally used to perform size- and position-invariant object recognition in images, to recognize actions. We show that instantiations of this model class can also perform recognition in natural videos that are robust to non-affine transformations. Specifically, view-invariant action recognition and action invariant actor identification in the model can be achieved by pooling across views or actions, in the same manner and model layer as affine transformations (size and position) in traditional HMAX. Together these results provide a temporal map of the first few hundred milliseconds of human action recognition as well as a mechanistic explanation of the computations underlying invariant visual recognition.