Reading Minds with Vision and Language

Principal Investigator Boris Katz

Co-investigator Nicholas Roy

Project Website http://toyota.csail.mit.edu/node/28

Safe and controllable autonomous robots must understand and reason about the behavior of the people in the environment and be able to interact with them. For example, an autonomous car must understand the intent of other drivers and the subtle cues that signal their mental state. While much attention has been paid to basic enabling technologies like navigation, sensing, and motion planning, the social interaction between robots and humans has been largely overlooked. As robots grow more capable, this problem grows in magnitude since they become harder to manually control and understand.

We are addressing this problem by combining language understanding, perception, and planning into a single coherent approach. Language provides the interface that humans use to request actions, explain why an action was taken, inquire about the future, and more. By combining this with perception one can focus the attention of a machine on a particular part of the environment, use the environment to disambiguate language and to acquire new linguistic concepts, and much more. Planning provides the final piece of the puzzle. It not only allows machines to follow commands, but it also allows them to reason about the intent of other agents in the environment by assuming that they too are running similar but inaccessible planners. These planners are only observed indirectly through the actions and statements of those agents. In this way a joint language, vision, and planning approach enables machines to understand the physical and social world around them and to communicate that understanding in natural language.