Sitzmann Feature

-
Video details
Vincent Sitzmann,
Assistant Professor, MIT CSAIL
-
Interactive transcript
[MUSIC PLAYING]
VINCENT SITZMANN: Hi, my name is Vincent Sitzmann. I'm an assistant professor at MIT. And my research is on, broadly, computer vision. But specifically, basically, I'm interested in building AI that perceives the world the way humans can perceive the world.
So, for instance, in the room that you're sitting in, you don't only, when you move your head, see colors, but actually you perceive objects. You perceive the geometry of the room. If you looked at that mug over there, you might anticipate how heavy it would be if you would lift it up. And all of that, you constantly infer just from vision.
And you learn that not by you know having someone tell you all these things, but instead you just learn these things by interacting with the world, by navigating through the world, by seeing the world. And I'm trying to build AI that can learn to perceive the world in the same way and that can understand the world in that way.
We, unfortunately, know preciously little about how the brain actually implements all of these algorithms that allow you to understand the world around you from vision. And I would go as far as to say that on the level of the algorithms that neurons implement when they talk to each other, that allow your brain to learn and all of these things, we know extremely little. However, we do know a lot about, on a high level, what kind of skills humans have. And we know what kind of data a human observes in their lifetime. And so we know how they learn these skills. So what data do they have to learn these skills.
And those insights have contributed a lot to research over the years. So let me give you an example. And we know that humans are capable of estimating the geometry of rooms, or of scenes, or of objects from just a single image. And just from a single image, actually, if you just look at math, there's no way of figuring out the geometry of the scene.
So we know it is something that you have learned over your lifetime. And that, for instance, has informed some of my research and we build neural networks that learn to reconstruct scenes and objects from just a single image. I would say that in the past decade, there have been three major breakthroughs that have enabled my research and that my research has contributed to also, that I think will, in the next few years, contribute to vision models that are much more powerful than what we've seen over the past decade.
And, specifically, I would say these three breakthroughs are differentiable rendering, which I'm going to talk about in a moment; the ability to reason about uncertainty in 3D reconstruction. And then another one, which is still in progress, is being able to train these models truly unsupervised just from video. So let's take this apart.
So the first word, or like the first term, differentiable rendering, sounds very fancy. But it's actually a very simple concept. The concept basically is that we know mathematically, given a 3D scene, how does that 3D scene generate the images that you observe of the 3D scene. So as you move a camera through space, we can mathematically express what images the camera sees as long as we know the geometry and the appearance of the 3D scene.
And for a long time, there was this question, which is, OK, if we want to train neural networks just from images to reconstruct 3D scenes? And why is that important? It's important because we don't have ground truth 3D scenes. We really only have images. We have lots of videos, but we don't have any large data set of 3D scenes. We just have lots of videos.
So we want to train neural networks to reconstruct 3D scenes just from video. But there was a key missing, which is how can we connect images to the 3D scenes that generate these images and how can we imbue neural networks with that knowledge? And that is what differentiable rendering is about. Differentiable rendering is basically saying-- is giving a neural network a way of expressing how a 3D model that the neural network has generated can generate an image.
And then what you can do is you can train neural networks that say, OK, I give you an image of a scene. Please reconstruct the 3D scene that generated that image. And then, afterwards, I will give you a different image of that scene. And then the neural network can check if it was correct or not. So your AI can basically learn to reconstruct 3D scenes even though it's only looking at images of these 3D scenes.
[MUSIC PLAYING]
-
Video details
Vincent Sitzmann,
Assistant Professor, MIT CSAIL
-
Interactive transcript
[MUSIC PLAYING]
VINCENT SITZMANN: Hi, my name is Vincent Sitzmann. I'm an assistant professor at MIT. And my research is on, broadly, computer vision. But specifically, basically, I'm interested in building AI that perceives the world the way humans can perceive the world.
So, for instance, in the room that you're sitting in, you don't only, when you move your head, see colors, but actually you perceive objects. You perceive the geometry of the room. If you looked at that mug over there, you might anticipate how heavy it would be if you would lift it up. And all of that, you constantly infer just from vision.
And you learn that not by you know having someone tell you all these things, but instead you just learn these things by interacting with the world, by navigating through the world, by seeing the world. And I'm trying to build AI that can learn to perceive the world in the same way and that can understand the world in that way.
We, unfortunately, know preciously little about how the brain actually implements all of these algorithms that allow you to understand the world around you from vision. And I would go as far as to say that on the level of the algorithms that neurons implement when they talk to each other, that allow your brain to learn and all of these things, we know extremely little. However, we do know a lot about, on a high level, what kind of skills humans have. And we know what kind of data a human observes in their lifetime. And so we know how they learn these skills. So what data do they have to learn these skills.
And those insights have contributed a lot to research over the years. So let me give you an example. And we know that humans are capable of estimating the geometry of rooms, or of scenes, or of objects from just a single image. And just from a single image, actually, if you just look at math, there's no way of figuring out the geometry of the scene.
So we know it is something that you have learned over your lifetime. And that, for instance, has informed some of my research and we build neural networks that learn to reconstruct scenes and objects from just a single image. I would say that in the past decade, there have been three major breakthroughs that have enabled my research and that my research has contributed to also, that I think will, in the next few years, contribute to vision models that are much more powerful than what we've seen over the past decade.
And, specifically, I would say these three breakthroughs are differentiable rendering, which I'm going to talk about in a moment; the ability to reason about uncertainty in 3D reconstruction. And then another one, which is still in progress, is being able to train these models truly unsupervised just from video. So let's take this apart.
So the first word, or like the first term, differentiable rendering, sounds very fancy. But it's actually a very simple concept. The concept basically is that we know mathematically, given a 3D scene, how does that 3D scene generate the images that you observe of the 3D scene. So as you move a camera through space, we can mathematically express what images the camera sees as long as we know the geometry and the appearance of the 3D scene.
And for a long time, there was this question, which is, OK, if we want to train neural networks just from images to reconstruct 3D scenes? And why is that important? It's important because we don't have ground truth 3D scenes. We really only have images. We have lots of videos, but we don't have any large data set of 3D scenes. We just have lots of videos.
So we want to train neural networks to reconstruct 3D scenes just from video. But there was a key missing, which is how can we connect images to the 3D scenes that generate these images and how can we imbue neural networks with that knowledge? And that is what differentiable rendering is about. Differentiable rendering is basically saying-- is giving a neural network a way of expressing how a 3D model that the neural network has generated can generate an image.
And then what you can do is you can train neural networks that say, OK, I give you an image of a scene. Please reconstruct the 3D scene that generated that image. And then, afterwards, I will give you a different image of that scene. And then the neural network can check if it was correct or not. So your AI can basically learn to reconstruct 3D scenes even though it's only looking at images of these 3D scenes.
[MUSIC PLAYING]
-
Video details
Vincent Sitzmann,
Assistant Professor, MIT CSAIL
-
Interactive transcript
[UPBEAT MUSIC]
VINCENT SITZMANN: And there's another really interesting question to think about. So if I show you this mug from the front here, right from this angle, then without a doubt, you could imagine what this might look like from the other side. And how have you learned that? You have learned that by looking at lots of mugs in your lifetime, right? So you have a belief over what the backside of this mug looked like. And in fact, if I asked you, and you were a gifted artist, then you could maybe even draw, like, almost photorealistically what the mug might look like.
But fundamentally, you don't know what it looks like. There could be many options of what it might look like from the other side. Nevertheless, when I turn it around, you're not surprised. This matches your prediction of what the mug looks like from the other angle.
OK, so what does that tell us? It tells us that there is a notion of uncertainty. It says that if I show you this side of the mug, you have a belief over many possible options of what the other side looks like. And in the past, it was not possible to build neural networks that could make this infinite number of possibility, infinite number of possible predictions of what an object might look like.
Or another example is, imagine you have a neural network that you give it an image of a room, and you want it to predict what the next room looks like after you go through a door. That would be important for robotics, for instance. It's the same thing. You can open the door, but you can imagine, oh, maybe it's a bathroom. Maybe it's a kitchen. But you don't know exactly.
So you have a distribution over possibilities. And in this paper that you mentioned, diffusion with Ford models, we showed that it is possible to train neural networks, again, just from images that can learn to make many such predictions. And they can basically sample many different possible 3D reconstructions given just a single image.
So that is another really cool thing. And another thing that I'm very excited about is a very fundamental problem in vision that has been tackled for a long time, or attempted for a long time, which is the problem of, if I give you a video captured with a camera that is moving through space, there's the problem of reconstructing at the same time the 3D geometry of the scene as well as the trajectory of the camera.
And this problem has been investigated since the '70s, it's known as structure from motion, or SLAM. And in the past, we have built handcrafted algorithms-- like heuristic algorithms engineered by humans-- that attempt this problem. But these algorithms, they kind of work, but they don't work all the time. They don't work as robustly as the algorithm that you really are running does.
And we are making progress towards algorithms that allow us to train neural networks that look at videos. And they learn to reconstruct the 3D geometry of the scene at the same time as reconstructing their own motion, the motion of the camera through the scene. And that, I think, will be very important going forward for robotics, for autonomous navigation, for all these kinds of things.
So these kind of depth-sensing modalities are very useful. So they basically allow you to give your neural network that has to reason about the scene not only the colors that you capture with the image, but also the distances of every pixel to the camera. And that can be very useful. It does make the problem slightly easier, but the fundamental problem that we're trying to solve is much bigger than just reconstructing the 3D geometry of the scene, which is what LiDAR might help with. Rather, it is really about understanding all aspects of the scene.
So for instance, when you look at this mug, not only do you understand the geometry of the mug, but also you understand that it's an object, which means that if you poked it, then it would move. Or you understand, maybe, for the laptop behind me, that you could close it. So you understand its affordances.
And these kinds of things you really learn from interacting with the world and from navigating through the world. And these are really intelligence things. These are things that you need intelligence for to understand them. And so LiDAR and other kind of additional sensors make the problem easier because they give your AI additional information that it can use as a starting point, but they don't fundamentally solve the problem of going to this next level of semantically understanding the 3D world.
-
Video details
Vincent Sitzmann,
Assistant Professor, MIT CSAIL
-
Interactive transcript
- So one of our research goals is indeed to build Artificial Intelligence that can go from an image or maybe a short video, to a reconstruction of the underlying scene that encodes the physical properties of the scene as well. And, for instance, the idea that maybe for a robot that wants to interact with the scene, it's really important to anticipate how heavy an object would be before you pick it up, or maybe it's important to understand how a drawer works in order to interact with the scene. And one of the big goals is indeed to build AI that can learn about the physics of the world. And this has been a very long standing problem and we haven't solved it yet. But I think, right now, there's actually critical progress happening that will make it possible to make significant progress on this problem in the near future, so in the next few years. And what does that progress look like? I think, one key missing piece for this, is to be able to-- given video, reconstruct the 3D motion of all the things in the scene. And why is that critical? It's critical-- how do we think you learn about physics? And the way you learn about physics is, maybe, as a baby you play with objects, maybe you poke objects, maybe you throw objects, maybe you bite them. And what happens is that you observe the effect of your own action. You take an action and then you can observe the outcome of that action. And so, what your brain is probably doing, is it's constantly trying to predict what will happen next, conditioned on the action that you're taking. So conditioned on me poking this mug, what will happen next? And that is probably the signal that we can use to capture and to understand physics. The problem for that is that we first need a way of reconstructing what actually happened conditioned on the action. So before I can learn about how objects move in 3D when I poke them, I first need to be able to observe how objects can move in 3D. And this is a very hard problem, that is in fact unsolved still today. So there is no neural network model or no algorithm right now that exists that you can give it a video and it returns to you a 3D reconstruction of all the objects in motion. But, with the kinds of breakthroughs we were just discussing in fact, these kinds of methods that allow you to do a 3D reconstruction with neural networks, that allow you to figure out how the camera is moving through the scene, that allow you to deal with this inherent uncertainty of this reconstruction problem, and with some other things that we are working on, it will be able to solve that exact problem. So we will be able to build models that can go from videos to the 3D motion of all the particles, and then we are able to build models that, not only reconstruct these motions, but actually predict these motions and that then gives rise to a physical understanding of the world. So, there's two kinds of signals that we need to learn about physics, one is not necessary but it makes the problem a lot easier. So there's basically passive observations you can capture of the world. And these are very cheap and there is active observations you can capture. And what I mean by that, is maybe one of them is embodied, these are the active observations and the other one is not embodied. So, examples for passive observations are just videos you can find on the internet. So of course, we have billions of hours of video that was captured in our world, and the way to think about that video it is, all the video that exists is measurements of the physical processes of our world. And you can learn a lot from just watching all that video. So for instance, you can learn in which ways certain objects can move, you can learn that cars usually go forward. You learn that they don't go sideways, generally. You know that this mug is unlikely to fall through the tabletop anytime soon. And you can even learn things like this mug will only move if someone touches it. And so you can learn a lot about the rules of the world just from passively observing it. But then, there is this critical piece, if you really want to learn to connect that to your own actions, if you really want to connect that to what happens if I interact with the world, then you need embodiment. Then you need an agent that can interact with the world. But the current idea is basically that we will build AI models that learn a lot from this passive data that is just cheaply available, and then we will collect or we will basically unleash agents that can interact with the world and collect relatively small amounts of data, because that's very expensive and it takes a long time, but with these small amounts of data they can then go this one extra step of connecting the models that they have learned to their own actions and then really figuring it out.
-
Video details
Vincent Sitzmann,
Assistant Professor, MIT CSAIL
-
Interactive transcript
[MUSIC PLAYING]
VINCENT SITZMANN: So first of all, I think the core set of applications for these computer vision models or these 3D-understanding models, indeed, are robotics, autonomous navigation, and computer graphics. These are three core applications.
And I do think that there are lots of problems that are shared across these domains. So for instance, the problem of 3D reconstruction is a problem that is certainly shared among all of them. So for instance, in computer graphics, when you make a movie or you want to make a video game, you would really rather have it that-- you see an object in the real world. You can snap your fingers. And you have a 3D model of it that you can put in your video game.
And so that is a problem where 3D reconstruction is very present. In robotics, of course, if the robot wants to interact with the 3D world, it better have a 3D model of the 3D world. And in autonomous navigation, for instance, the same is true. If you want to navigate your car, best have a 3D model of your environment.
And so for these kinds of like low-level things, certainly, these models will carry over. Another thing that carries over is this kind of semantic understanding of the world that is useful. So for instance, in computer graphics today, the way we edit 3D content is very low-level and very unintuitive.
So recently, in 2D image editing, we have seen the rise of AI in the sense that maybe you can select an area in an image and ask, OK, please now make this person smiling. And the AI will just make it. So you don't have to go in yourself and like push the pixels around, which is what it was 20 years ago.
And the same will certainly happen for 3D. But for that, we need models that look at a 3D object and then have a semantic understanding of the parts of the 3D object such that then you can say, OK, for this mug, please delete the handle. And then the AI can just make it so. And again, this is a problem that carries over across robotics where you need to have a semantic understanding of the world or self-driving cars.
So yeah, there will be lots of models built that will be shared. On the flip side, there's, of course, lots of applications and UI that are different across them. But yeah, I think there will be models that are foundational that will be shared across these different applications. So these developments in generative AI, both in text and in text-to-image, for instance, are intimately related to vision.
And I think one very strong connection is actually this aspect about uncertainty that we talked about earlier, because the way to model uncertainty oftentimes is exactly as a generative model. So you say conditioned on this image, give me a model that allows to reconstruct all possible reconstructions. And that then is a generative model.
And in fact, there have been insights that have carried over here. So for instance, this model that we were talking about before is a diffusion model. It's a diffusion model that allows you to sample 3D scenes directly. And that was certainly building on top of the advance of diffusion models in image generative modeling. And so that carries over.
Another good way to think about this is that for large language models, these are basically models that always predict the next token in a sequence, right? So they are basically models that predict what happens next in a sequence. And it is very productive to think about self-supervised learning in vision in a similar way.
So we said earlier that maybe your brain learns by making predictions on what happens next. So it's similarly a model that looks at the current state of the world and then says, OK, what will happen next? And so this is an example for a sequence model. And the insights that we have from language modeling are already carrying over to vision. And so this is all very intimately connected.
[MUSIC PLAYING]