Computer Vision

Inspired by strategies from human vision and cognition, we build deep learning models of object, place, and events recognition. To this aim, we are building a core of visual knowledge (e.g., Places, a large-scale dataset with 10 million annotated images; Moments in Time, a large-scale dataset of 1 million annotated short videos) that can be used to train artificial systems for visual and auditory event understanding and common-sense tasks, such as identifying where the agent is (i.e., the place), what objects are within reach, what potential surprising events may occur, which types of actions people are performing, and what may happen next.