Vincent Sitzmann

Assistant Professor of Electrical Engineering and Computer Science

Next Steps for AI: Creating 3D Understanding from 2D Images

Next Steps for AI: Creating 3D Understanding from 2D Images
By: Eric Brown

Image generators based on large language models such as DALL-E have gained wide-spread adoption by generating photorealistic images and videos in a variety of styles based on text prompts. Yet neural networks continue to struggle to interpret images. Vision-oriented AI models are well established in applications such as identifying defects in manufacturing and recognizing faces for security, but they often fail to understand the 3D scenes depicted by 2D images. Self-driving cars, for example, are incapable of safely navigating many real-world driving scenarios.

What vision AI is missing is something that comes naturally to people: quickly grasping what is happening in the field of view and guessing what will happen next. Even a gifted artist cannot dash off quality artwork in a few seconds based on word prompts. Yet, the same artist can usually drive through chaotic rush hour traffic without an accident – and without giving the task much thought.

The good news is that interpretive vision AI appears to be primed for a breakthrough. “In the next few years, we should have much more powerful vision AI models that can reconstruct a navigable 3D view from a 2D image,” says Vincent Sitzmann, an assistant professor at MIT CSAIL and leader of the Scene Representation Group.

Sitzmann focuses on the problem of “neural scene representations,” with a goal toward “building an AI that perceives the world the way humans do.” His research has been a major contributor to recent breakthroughs that are poised to accelerate computer vision applications such as autonomous navigation for robots and vehicles.

“We still know precious little about how the brain understands the world from vision,” says Sitzmann. “Yet, we know a lot about the data humans observe to learn about the world. These insights have informed my research to train neural networks that learn to reconstruct 3D scenes just by, essentially, watching videos. This is the foundation for AI that can learn to perceive and understand the world as well as people can.”

“This is the foundation for AI that can learn to perceive and understand the world as well as people can,” he says.

Reconstructing 3D from 2D
Sitzmann is inspired by -- and has contributed to -- three major breakthroughs in vision AI in recent years. The first is differential rendering, which enables neural networks to reconstruct 3D scenes from 2D images and video. Due to the lack of large data sets of 3D scenes for neural network training, differential rendering is a game changer in applications such as autonomous navigation.

“With differential rendering, you can train a neural network by supplying an image of a scene and asking it to reconstruct the 3D scene that generated that image,” says Sitzmann.  “Then you can give the neural network an image of the same scene from a different angle, and it can tell you if it is correct.”

A second breakthrough, which is based on differential rendering and originated from Sitzmann’s research group at MIT, enables a neural network to reason about uncertainty in 3D reconstruction from a 2D image. A project led by Ayush Tewari, a postdoc in Sitzmann’s group, has developed a neural network model that can guess what an object looks like from the reverse side based on a single image of the visible side. The project, which is described in a paper called “Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision,” was a collaboration with Tianwei Yin, George Cazenavette, Semon Rezchikov, Fredo Durand, Joshua Tenenbaum and William Freeman.

“If I show you this mug, you can easily imagine what it might look like from the back,” says Sitzmann. “By looking at a lot of mugs, you have developed a belief about what a mug looks like, so if you turn it around, it usually matches your prediction. Yet you also know there are many possible options and that you might be surprised. People have developed concepts of likelihood and uncertainty that neural networks struggle with.”

This failure to model uncertainty has led to blurry and implausible reconstructions even for simple objects, and to the complete failure to extrapolate any hidden part of a more complex scene, such as a room. In contrast, you can provide a 2D image of a room to the model proposed by Sitzmann’s group, and it will not only create a 3D reconstruction of that scene but also the parts of the room that are not shown. The AI can further generate predictions of what one might see if a door to another room was opened.

“Like a human, the neural network does not know what the next room looks like, but it can now provide a distribution of possibilities,” says Sitzmann. “In ‘Diffusion with Forward Models,’ we showed that it is possible to train neural networks to sample many different possible 3D reconstructions given a single image.”

A third achievement, which is still a work in progress, is the ability to train neural network models “truly unsupervised just from video,” says Sitzmann. The research project shows promise of solving a long-standing goal among computer vision researchers: supplying a neural network with a video captured with a camera moving through space and enabling it to use this input to simultaneously reconstruct the 3D geometry of the scene and the trajectory of the camera.

Researchers have attempted to solve this problem with a technique called Structure from Motion (SfM), which estimates 3D structure from 2D images. In robotics vision, SfM enables visual Simultaneous Localization and Mapping (vSLAM or SLAM), which calculates the position and orientation of a moving camera within a scene while simultaneously mapping it.

“The problem with SfM and SLAM is that they require handcrafting heuristic algorithms, which are time consuming and often inaccurate,” says Sitzmann. “Our human algorithms for this are much faster and more robust. But now we are making progress towards creating algorithms that allow us to train neural networks to look at videos and learn to reconstruct the 3D geometry of the scene while simultaneously reconstructing their own motion and the motion of the camera.”

The Path to Self-learning, Semantic AI
Depth-sensing imaging systems such as LIDAR can help with vision AI applications such as autonomous navigation, but do not solve the fundamental problem of understanding a complex 3D scene, says Sitzmann. “Depth-sensing is very useful, as it not only gives the neural network the colors of each pixel but also their distance from the camera. Yet, the problem is much bigger than just reconstructing 3D geometry. It’s about semantically understanding all aspects of the 3D world.”

One of the first steps is to understand physics. Human toddlers do this by observing the effect of their own actions when poking, grasping, throwing, or even biting objects. They can then learn how to predict the results of actions. “When you look at this mug, you not only understand its geometry, but realize that it's an object,” says Sitzmann. “You might guess that if you poked it, it would move and estimate how easy it would be to lift.”

Understanding physics is vital for AI, especially with autonomous robots. A robot, for example, might need to know how to open a doorknob or a drawer.

“One of our research goals is to build AI that can take an image or video and reconstruct the underlying scene that encodes the physical properties of the scene,” says Sitzmann. “One hypothesis of how our brain learns how the world works is that it constantly predicts what will happen next based on our actions. To achieve a similar process in neural networks, we need to teach them how to reconstruct what happened, conditioned on an action. The first step is creating a model that could watch a video and then reproduce the 3D motion of all the objects. Then, by teaching it to deal with the inherent uncertainty of the reconstruction problem, we could build models that not only reconstruct these motions but predict them. That will give rise to a degree of physical understanding.”

Currently, neural networks are taught about physics primarily through passive observations, such as watching vast numbers of video on the Internet. “An AI can learn a lot simply from watching video, which is cheaply available” says Sitzmann. “It can learn how certain objects move or that cars usually go forward rather than sideways. The AI may understand that a mug is unlikely to fall through the tabletop or that it will likely move only if someone touches it.”

Yet to skillfully interact with the world, an AI needs to connect these observed rules to its own actions. “We need to capture active observations and then embody them to create agents that can interact with the world and collect relatively small amounts of data that embody physical principles,” says Sitzmann. “The captured data needs to be small because it’s very expensive and time-consuming. The goal is then to build models that can leverage this relatively small amount of active data to connect the models they have learned from passive data to their own actions.”

Training Truly General Models of Perception
With computing advances, neural networks can now perform astounding feats, leading to breakthroughs such as generative AI. Yet some question whether the current AI tools and approaches are sufficient to create a fully autonomous AI agent with a sense of self that can safely interact with a complex world.

“Current models still regularly fail to produce sensible outputs, especially in situations that are too different from those they encountered during training,” says Sitzmann. “Instead, we want these models to be as general as humans, which are capable of learning about the rules of the world from their observations and can subsequently reason about even unfamiliar situations by applying the same rules."

Some have suggested that the solution to these problems lies with the old idea of linking a symbolic agent with neural networks. Others are looking for clues in neuroscience.

Although Sitzmann seeks inspiration from neuroscience, he notes that aside from research by Hubel and Wiesel in the 1960s on the visual cortices of cats, which played a role in the development of convolutional neural networks, very few neuroscience discoveries have proven useful for AI. He also questions whether symbolic methods will be applicable.

“Humans tend to perceive the world in terms of symbols, and it is very compelling to try to build models that operate that way,” says Sitzmann. “Yet, I suspect that the models that give rise to the symbols that we think about are not symbolic methods themselves, but rather give rise to abstractions that behave like symbols. Today’s vision AI incorporates very few symbolic ideas, and I doubt that will change soon. I do not think we have run into the limits of what connectionist models can do.”

For now, hand-engineered 3D inductive biases are providing shortcuts and certain guarantees of generalization for neural network models. These include the differential rendering and related techniques used by Sitzmann.

“The question is whether it is possible to build models that uncover the rules of the world without hand engineering or memorizing everything they've seen,” says Sitzmann. “The latest models, including large language and video generative models are intriguing because they predict the next token to guess what happens next in a sequence. This is a very productive way to think about self-supervised learning in vision. Yet, their plausible predictions require massive datasets, and still fail to exactly back out the rules underlying the data. Perhaps the solution is a new kind of model altogether. This would be a data-driven model like current models, but with the goal of accurately and precisely identifying the processes that generate the data without requiring human-designed shortcuts. This should be an exciting research direction over the next decade.”