2024 MIT Digital Technology & Strategy Conference: Generative Models as a Data Source for AI Systems

Conference Video|Duration: 44:17
September 17, 2024
  • Video details

    Phillip Isola
    Associate Professor, Department of Electrical Engineering and Computer Science

    Generative models can now produce realistic and diverse synthetic data in many domains. This makes them a viable choice as a data source for training downstream AI systems. Unlike real data, synthetic data can be steered and optimized via interventions in the generative process. I will share my view on how this makes synthetic data act like data++, data with additional capabilities. I will discuss the advantages and disadvantages of this setting, and show several applications toward problems in computer vision and robotics. 

  • Interactive transcript
    Share

    GRAHAM RONG: Now, our next speaker, next the keynote speaker, will give a talk on very specific science, computer science, computer vision and data. It's called synthetic data. So now, let me introduce our second keynote speaker, Professor Phillip Isola, who is an associate professor in MIT EECS and a principal investigator in MIT CSAIL.

    His work focuses on why we represent the world the way we do and how we can replicate the abilities in machines. He has a recent book just published a few months ago called Foundations of Computer Vision that covers topics not standard in textbooks including, for example, transformers, diffusion models, statistical image models, issues of fairness and ethics, and the research process. Now, let's welcome Phillip.

    [APPLAUSE]

    When they ask questions, before you answer just repeat the question.

    PHILLIP ISOLA: Repeat, OK. Hi, everyone. Yeah, thank you for the intro and for having me here and happy to share some of my work and some of the trends that I'm seeing going on in the field right now in generative AI. OK, let me see if I know where to point. OK, good.

    So I'm going to talk about generative models as a data source for AI systems. And this is a topic, which sometimes goes under the name synthetic data. And a lot of companies and a lot of applications are now building their systems not on top of real data, but on top of data made by models.

    And these are two examples of synthetic data. This is a generative model of images of cats. It's a model called a GAN. And this is a generative model, which is called a NRF, a Neural Radiance Field. And I'll mention some applications of those as well as some of the even newer models. OK, let's see. There we go. Finding the Slide Advance button is always the first step.

    So a lot of work in AI proceeds, as is shown on this slide here, you take a data set as fixed. In academia, this is what we used to do all the time. And it's still what most people do.

    They take data as fixed. The rules of the game are you have a data set, and you want to model that data. You want to have an algorithm that analyzes that data.

    And the entire action is in the learning algorithm. And then the learning algorithm fits a function or a model to the data and gets some intelligence out of it. But you don't have to do things that way. That was just a standard way that we have done in the past.

    Instead, what I've been more interested in recently is, can we just freeze the learning algorithm-- the rules of the game become the learning algorithm is frozen-- and, instead, change the data that we feed to that algorithm? And if we feed it better data, perhaps we'll get a different result or a better result. So we're changing the focus from being on the algorithms to on the data sets.

    And one motivation for this is that we've all seen that data is the big thing that has driven a lot of the recent progress in AI. The algorithms haven't changed that much, but the data sets have scaled up incredibly. And the compute is the other part that's changed, but I'm not going to touch on that in this talk. So the hardware and the GPUs and all of the computational power has also been a big driver.

    But we know that data is the fuel of machine learning. It's data-driven intelligence. You get a lot of data, and that data tells you how to make intelligent machines. So we should really study the data, and we should really figure out ways of making better data. And generative models are going to be one way of doing that.

    So here's the perspective that I like to take on this problem. So generative AI is often thought of as just a fun tool for making creative stories or images, but I think it's actually much more fundamental and powerful than that. So normally, we start with a data. It goes into a learning algorithm. We get intelligence out.

    But instead, we're going to add a generative model to the first part of the pipeline. We're going to take data, fit a generative model to that data. This is going to be like an image generation algorithm or a text generation algorithm. So think ChatGPT or DALL-E, these types of things.

    And what we're going to get out is not just more photos or more text. We're going to get out something which is actually fundamentally different. It's like the original data. But in some ways, it can be better. It's data with more potential. I like calling this data plus plus. And I'll tell you the ways in which this can be an augmentation or an improvement over the original data.

    And it's a little bit weird. I mean, some of you, you should be kind of skeptical. You have real data, and then you make this approximation to it with a model. And it's actually better. That doesn't make any sense, right?

    We think of models as making artifacts and hallucinations and not being factual. So it could be data minus minus after modeling. But I'm going to talk about some ways in which it can actually be data plus plus.

    And I think that one perspective on this is that this is just what's happening in the world. So whether or not this is a good idea-- and I'll try to argue that in some ways it can be a good idea. But whether or not it's a good idea, this is the paradigm that has emerged. Sorry.

    We have companies that will train these large generative models, like language models or image generative models. And users will take those models and do something with them. So rather than the user directly gathering training data for their algorithm and then training their models on that data, they will interface with the data via a massive foundation model, a massive generative model created in industry. So this is just the reality of how most people will interact with data is via an interface, which is through another model generated by a large organization or a company. It could be an open source model as well.

    So we have to, I think, understand this paradigm of learning from model data that's mediated by a generative model. So we had an era previously, which was this era of big data. Big data drove the advances in AI, in deep learning, and machine learning.

    Now, we have these big models, and these are more and more becoming the interface. And just for example, in my field of computer vision, it used to be that whenever we'd start a new project I would tell my students-- or it could be in a company, you tell your employees-- first you want to collect a million labeled photos for your application area. And for research, we would often use this data set called ImageNet.

    But that's not what we do anymore. Instead, the starting point that my students go to is called Stable Diffusion. We don't download ImageNet and train our algorithms on that. We now go to a model called Stable Diffusion, which is a generative AI system that is our interface to data. So our interface to data is no longer a data set. It's now a model.

    And I think that this trend might continue. And there's also a lot talk about the social consequences and the concerns about the internet no longer being like a data set. The internet is just going to be a big repository of synthetic data. And most of the dialogue about this right now has been critiques of that setting where the internet is just polluted by fake data. So I want to give a little more optimistic take, that there are some benefits as well.

    So here are going to be the benefits. At the end, I'll mention some of the limitations. It is true. These critiques are also valid. But let's look at the positive side first.

    So generative models, what do they do? One thing they do is they take a finite data set, a small, in some sense, data set. And they create a system that can produce an infinity of data. So the way they do that is you start with-- I'm going to talk mostly about examples in computer vision.

    They start with a set of photos. And these are like discrete points in some data space. So the set of photos here is just going to be dots in this distribution. And I could sample photos from that set, and I'll get real photos.

    And if I train a generative model, what you actually do is you fit a continuous density. You fit a continuous function to that discrete set. And that continuous function has filled in all the gaps. So you can interpolate and potentially even extrapolate a little bit outside of the distribution of the training data and produce samples which look photorealistic. Because that's the point we're at now.

    But you can produce infinite samples. The diversity of that set won't be infinite. It will be bounded by the training data. But you can interpolate and even extrapolate a little bit beyond the actual training data. So you get more data out. That's one simple sense in which the output of a generative model can be like data plus plus.

    So generative models turn finite data into continuous data, into infinite data in some sense. And that's one starting point for understanding what they can do. But we'll see. Is that actually going to be a good thing? Is that useful for anything?

    So I want to tell you a little bit about how these generative models work to give you the next benefit that I see in them. So most of the generative AI systems we have now for making images follow this pipeline. You start with a set of what we call latent variables. So sometimes people call these noise, but I don't really think noise is the right way to think of them. Instead, think of them as the control knobs that, when you spin those control knobs, they'll get a different image. You'll get a random image.

    So you have a set of control knobs labeled as z in this diagram here. And you put those inputs-- you set those control knobs to whatever setting you want. And you put them into a neural network, G, that outputs an image. And if I spin my control knobs to a different setting, I'll get a different image.

    So these controls can be visualized as follows. So this is a real result. This isn't just a cartoon. If I have a set of control variables, maybe I have hundreds of these knobs. And I'm just going to look at two of them now. I'm going to look at two of them on these two axes on the right.

    Then if I take the setting of the knobs, it makes this bird. And then I turn one knob individually left and right. Then I will get the bird rotating. And if I turn another of these control knobs left and right, I'll get the background color changing.

    So these control knobs act like interpretable kind of controls that create this little continuous manifold of natural images. So this is something you can't do with real data very easily. I can't just find photos of this bird from all these different angles with different background colors. But with a generative model, I have these independent latent variables that can be tuned to create a visual like this.

    So the other super powerful control knob is that we can use text now to control these models. So we don't just have to use these kind of so-called latent variables, which are like continuous knobs you're turning. You can also just describe what you want. And I'm sure most of you have seen things like this where you can do text to image, text to video.

    Here is a generative model, which is taking the text input, a photo of a group of robots building this data center. And this is the output. These are the things that would have been mind blowing a few years ago.

    But the world changes and updates very quickly. And now it's like, OK yeah, we've seen this. But just remember, a few years ago, this would have looked like magic. Nobody saw this coming that I know.

    So we can control our images via text. That's a new capability that generative models allow that was not really easily possible if I just had a traditional data set of photos. So this is the next benefit I see is that generative models take kind of small, unstructured data as input. And they output, bigger continuous data plus controls with these control knobs. So it's like the data plus the controls.

    And that's actually the most powerful part of it in my opinion. Because we can intervene on the data via those controls to steer it and change it into the type of data we want. We can change the properties of the data to be maybe less toxic or censor content that we don't like. Or we can make it more beautiful. We can do a lot of interesting things to it. I'll show some of these applications.

    But the big idea of these control knobs is they allow what we can call counterfactual reasoning. So counterfactual reasoning is reasoning of the form, what would it look like if? So what would it look like if the lighting changed? What would it look like if the camera angle changed? What would it look like if the pose of that cat changed?

    And so these control knobs allow you to ask that question and visualize or sometimes we would say hallucinate these counterfactuals. But this is a positive version of hallucination. It's imagination more than it is making up things that are unfactual in a negative way.

    So how do we do that with a photo that we want to imagine this counterfactual for? What we can do is we can take our photo of this cat, and we can do this data to data plus plus conversion process where we find the setting of controls. So find the setting of latent variables, z, that, when put through a generative model, will replicate that photo of the cat.

    So we call that encoding the cat into the control variable space of the generative model. And then we can decode that cat back into a photo via our generative model. Now, we have converted the static real photo of the cat into this generative version of the cat, which is coupled with its latent variables. And if we change those latent variables, those control knobs, then we get the cat changing its pose.

    Maybe we can change the lighting conditions. We get this living object. So we've converted the static real data into this more alive, controllable, synthetic data. And that's what I see as the kind of most qualitatively important new power of synthetic data that you don't have in real data.

    So there's a bunch of papers. These ones are from a few years ago, but this is very much an ongoing and hot research area right now on what you can do with counterfactual interventions into synthetic data to improve your data processing pipeline.

    So I'm going to talk about a few things now that you can do, a few applications. So we have this new type of data. I've argued that it is better because it is continuous, controllable. And what can you actually do with that? How can that make applications better?

    So here's one paper that we have from a few years ago where we tried to improve a kind of crummy classifier. So we had this classifier, which I've labeled C. This is just a system that looks at photos and decides is it a cat? Is it a dog? Is it an airplane? And what we did is we tried to get a more robust classifier by using these counterfactual visualizations.

    So we have a noisy, imperfect classifier C. It might say that's a dog. But we can then do what I just showed on the previous slides and convert this input photo into these counterfactual variations on the input photo. And this, now, is saying, well, the cat could have been in a different pose. The lighting could have been different. Here's all the different ways that the image could have been taken without really fundamentally changing the content of the image.

    And now, we put all of those into our classifier, ensemble the results. And as long as one of these kind of counterfactual visualizations is clearer or more easy to understand by the classifier, then the ensemble, the set of all the classifiers averaged together, will get a more accurate result. So this is a way of making a more accurate and robust classifier by just revisualizing the data via generative modeling. So this improves the accuracy a little bit and the robustness of the system. And this is work that we published a few years ago that Lucy Chai up at the right led.

    So what we showed in that project is that generative augmentation, so counterfactual augmentation of what could this cat look like under different poses and lighting conditions, can help. But it's at what we call test time. So we take our system. And when we're actually going to deploy it and use it, we're going to do this augmentation. But couldn't you do these same augmentations when you're actually training the underlying classifier or the underlying computer vision system?

    Now, we would call it pretraining. And in the era that we worked on that, which was the GAN era of generative models, it didn't really work. Those models just were not realistic enough. They had too many artifacts. The hallucinations were too unrealistic. So the cat looked pretty good, but it didn't work on harder scenes.

    But time goes on. These models just get bigger and better. Now, we're in the era of what are called diffusion models. And now, train time augmentation is starting to work. And I'll show a few examples of that on the next slides.

    So here's the basic paradigm which we started a few years ago, which is using generative models as a source of training data. And this is work led by Ali Jahanian and the other authors shown here. And we're just going to contrast between training a computer vision system on a data set of real images, like the ImageNet data set, versus training a computer vision system on a data set of synthetic images that are made by a generative model.

    And this cartoon is just meant to say the synthetic images are like this data generating engine. It can produce an infinity of data with underlying control knobs indicated in that circle with the z. The technical details I'm not going to get too much into, but think of it like this engine that can have this potentially endless creativity.

    So we're going to combine our synthetic data generation with a machine learning algorithm called contrastive learning. So here's how this algorithm works. This algorithm takes data. This is like the learner on that first slide. It takes data, and it tries to create a computer vision representation from that data.

    And the way it works is you take two crops of a photo, and you say these two crops of that tiger probably represent the same thing because they came from the same photo. So we will try to learn a representation in which two different crops of the same photo map to the same representation vector. We're trying to learn a vector representation of an image is what they call it. And it's done in that way.

    So two different views of the same object should map to the same representational vector. And normally, you do that by just taking a data point and then taking another data point, which is a sample from the same image, a crop from the same image, patch from the same image. And in contrastive learning, you say that a different image entirely that is not a crop from the tiger photo should map to a different representational vector.

    So we can combine these two things together, and we can say now we have our latent variables, which are a way of getting two different views of the same thing. So rather than doing just cropping patches from photos, which seems like a very crude way of finding two different views of the same thing, we can use a generative model to create two different views of the same thing. And that's what's showing here.

    So we take our generative model. That produces two different views of the same thing. But now, they're not just different crops. They're actually the cat potentially with different lighting, or pose, or semantic and physical changes like this.

    So one of the most popular contrastive learning algorithms from real data is called SimCLR. That's shown on the left here. And that just takes crops. It also might drop color channels and do other types simple augmentations. But with a generative model, we can do these more physical and semantic augmentations, like changing the dog's expression or the angle at which the photo is taken.

    And this paper just kind set out that basic paradigm. And at the time, I do have to admit that, if the computer vision system trained on the real data, actually it was still doing a little better. But the synthetic data was doing almost as well despite that this was using the GAN era models which had all these artifacts. It wasn't quite the same dog. So there were problems with that. But there was some hope.

    So we've continued on that. And now, we just did the same thing recently with the latest generative models, diffusion models rather than GANs. And now, it actually is starting to get to the level of actually outperforming real data in some interesting ways. So I'm going to talk about this project called StableRep. It's going to be basically the same idea, except we just updated with the latest generative models.

    So here's the experiment that I'm going to describe. We're going to train a contrastive representation, a vision system on either n real photos or n synthetic photos. So we're going to sample n real photos or n generate, n synthetic photos. And then we're going to train our system on that. So it's kind of equal setting. On a per sample basis, which is better, a synthetic image or a real photo?

    So here's the first result. So on the y-axis is performance of our computer vision system measured on a standard benchmark. It's measured on the ImageNet benchmark. This is how good that computer vision system is at classifying objects and cats and dogs.

    And on the x-axis are a few different representation learning algorithms just to show that this approach works with different methods. And the green dots are if I train on n real photos. And the orange dots are if I train on n synthetic photos.

    And what you can see is that training on n synthetic photos is better in general. So on a per sample basis, one fake synthetic image is more valuable to your downstream system than one real image. So we're kind of in the plus plus regime. We're not in the minus minus regime in this project.

    And so we want to understand why is that. This is fake data. Shouldn't it be worse? But one key thing is that we actually intervened on one of those control knobs to make the data more useful to us. And what this plot is showing is the x-axis is how we set the control knob.

    So I'm not going to tell you the technical details of this control knob. It's a knob that changes the distribution that we're generating. It kind of trades off between diversity and realism. And we can tune that knob back and forth. We can set it to be low or set it to be high.

    And what this graph is showing is that, by tuning that knob, for some settings-- on the y-axis is accuracy. For some settings, the synthetic data is worse than real data. And for other settings, the synthetic data is better than real data.

    So the synthetic data is just data with more opportunity. And you can use that opportunity to make it more useful than real data or less useful than real data. But you have this control knob that can intervene to make the distribution different, and that's where the power comes in.

    So we had to tune that variable. And if you don't tune that variable, then, yes, you do get this data minus minus. You get synthetic data that isn't as good as the real data. But you can intervene and make it better. That's the trick.

    So far, what we showed in these projects is that synthetic data on a per sample basis can be more useful than real data. But there is a caveat, which is that these generative models are themselves trained on massive, massive real data. And the rules of the game so far are that we just are given a generative model already trained. That cost has been paid up already. We don't have to worry about that. What should we do?

    But what if we actually don't have a trained the generative model? Now, we have to make the choice-- should we train a generative model, or just directly use our real data for the downstream task? Can a model of a data set x actually be more effective than directly using x itself if we were making that choice?

    This is a harder question. This is still somewhat of an open question. But I think we're also now seeing some evidence that the answer can be yes to that question. So here's a follow up project to the StableRep project.

    So in the StableRep project, we compared these two paradigms. On the top, we have learning from data. You have a data set of images that are generated from a data set of text prompts, but it's real captions written by humans that generate the text data. And it's real photos downloaded from the internet that are the photo data. And you put those together, and you get one of these good computer vision systems.

    In the StableRep project, we used actually a text to image generative model. So we started with real captions, and we generated fake photos. And we got, on a per sample basis, better performance. But if we actually take into account the cost of having trained the original generative model, it was worse performance. So that was the caveat I mentioned. It's a little bit of a detail, but it is important.

    But in this project, we said, what if we replace the real captions that generated the fake images with fake captions generated by ChatGPT or a language model? So now, the entire process is synthetic. The language is a generative model, and the images are a generative model. And putting those two things together, now we actually do see we're outperforming the state of the art systems that are trained directly on real data.

    So just by a little bit here-- but I think this is the direction things are going. And if you talk to people at OpenAI, these big companies, maybe some of you have done this at your companies, I think synthetic data is part of a lot of the actual pipelines in production right now. For example, the DALL-E 3 system is a little bit old, but it was trained. It's a text to image generation system that's trained on synthetic captions. So it's not so different than this idea here.

    So let me show one reason why this may have happened. So we're looking at three different systems. I'm going to show three different ways of generating kind of classes for training a computer vision system. So the most standard way to do it 10 years ago was to label photos with the object category.

    So in this case, all of these golden retrievers would be labeled golden retriever. And you would train your system to say these are golden retrievers. But if I have a text to image generative model, now I can create classes that are at a much more fine granularity. Because I'm not saying all golden retrievers are the same. I'm going to say all photos that have the same caption, that are generated by the same caption, are the same.

    So this is, again, something that's not easy to do with real data because we don't have two different photos with the same text caption. But with the generative model, which goes from text to caption to photo is easy to make infinite photos that all have the same text caption. I just run the generative model over and over again on those text captions.

    So because the generative model has these underlying text controls, it allows you to redefine the class granularity in this fashion. And that's actually what we did in that project. I didn't go through all the details, but that's actually what we did. And that's what allows us to get this other granularity that's finer grained than just using labels to define the classes.

    And in the SimCLR and the self-supervised stuff that I mentioned that just uses two different crops from the same image, you're essentially defining the classes in your visual concepts. The categorization and vision that you're using is that every image is its own class. And it's too fine grained.

    So this intermediate level of class granularity is the one that actually works best. And the numbers I'm showing here are just kind of an apples to apples comparison. If you define the classes in these three different ways and you run everything else the same, then defining it with the two images with the same caption are the same class. That's your granularity. It does the best. And that's a granularity that can really only easily be achieved with generative models.

    So moving right along, generative models can be a good data source for training. They can be useful for doing counterfactual reasoning at test time. These are a few different things it can do. Here's one more, which is you can take a generative model of your data. And you can intervene on it toward human preferences.

    This is probably the most popular way of intervening on generative models. It's what led to ChatGPT. It was the big thing that ChatGPT did. It's related to this idea called reinforcement learning from human feedback.

    But let me show you what you can do with images intervening in this way. So what you can do is you can take a generative model of photos that has these underlying control knobs, and you can now take a model of human preferences. So we have a human that looks at these images and makes some judgment of whether or not they're beautiful or whether or not they've fairly depicted reality, whether or not they have demographic parity and how they're representing different occupations and genders and demographic properties like this.

    There's a lot of questions you could ask a human that you might want to tune toward. You might want to get rid of violent images. So we can ask a human.

    And in the project that I'm showing here, we did this a few years ago. And we actually were interested in making images that are more or less memorable. So it's kind of a weird question, but it's fun.

    So this photo is actually super memorable. You're all going to remember this after the talk. Because in our experiments from a long time ago, we found that this was the most memorable photo on the whole internet that we could find-- not the whole internet, but a section. So you're going to remember that photo, and a human will look at this and say it's memorable. We create a model of what a human would say, and then we'll just tune the knobs of an image generator to make an image that's super memorable.

    So we'll start with this photo here, and we're going to now tune the knobs in the direction that makes it more memorable according to our model of memorability. And what happens is the dog face becomes super zoomed in and kind of cute. And the eyes are bright. And if you tune it toward the knob that will make the image very forgettable, it has artifacts. The dog recedes into the background. It's kind of hard to make it out.

    So this is tuning. With generative data, you can actually change it, intervene toward human preferences or make it more memorable, make it less toxic, make it whatever you want. And this is the general paradigm.

    There's a lot, a lot of work that does this in language models. The most famous example is ChatGPT. That's the difference between ChatGPT and the previous models was they tuned it toward human preferences. And suddenly, it talked to you in a polite, friendly way. And it just was a lot more fun.

    So generative models provide data you can optimize. You can optimize over the generative process because you have continuous control variables. You can actually do gradient descent and backpropagation through those controls. That's the technical term. So generative models can provide better data in the sense of it being bigger and continuous, in the sense of it having control knobs, in the sense of it having optimizable controls that can be tuned toward human preference. In the last section, I want to talk about some applications to robotics.

    I'm going to switch to an entirely different type of generative model, which is called a neural radiance field. And the way these work is you take a set of photos of a room. This is a set of photos of one of the offices in Stata on the campus at MIT.

    And from a set of photos, just about 60 photos of this room, you get a 3D model that you can navigate through. And it looks photorealistic. So it's like a generative model in that you take photos and you create something that can sample more photos of that same place, but kind fill in the gaps, interpolate between all of the missing images.

    So here's the generative model picture of a neural radiance field. And the way you can think of it here is it's also a model with control variables. But now, the control variables are camera controls. So you can input into this generative model. Where is the camera going to be? At what point do I put my camera? And what would I see if I put my camera there?

    So I'm going to put in the angle of the camera and the location of the camera and get out a picture of what the camera would have seen in that location. So as I tune and move around in this latent variable space, the control knobs, I end up rotating the camera to see what the image would look like from different angles. So this is another generative model that has taken the raw photographic data and added these control variables.

    And these control variables are actually quite powerful. You can not only change the location of the camera, but you can also change the camera's optics. So if any of you are photographers or work on imaging or even just hobbyists, you'll know that there's a lot of complicated optics within the lens system of a camera. And the model that we've fit to these images can synthesize what would it look like if I changed those lenses.

    So I can change those lenses to create what's called an orthographic projection of the scene. So an orthographic projection is equivalent to if I zoom in to infinity. If I take a telephoto lens and I zoom into infinity, I get an orthographic projection of the world. So it's not easy to achieve with actual real optics. But with synthetic optics, you just change these variables around. It's just an equation. It's really simple.

    So here's what a photo of some mugs looks like from top down with a regular camera. But with an orthographic camera, like a telephoto lens zoomed into infinity, it would look like that. And the model can hallucinate this. So the model can take my photos and revisualize them as if they were taken with this orthographic camera that can't actually exist in reality except in this ideal case of zooming into infinity.

    Why is that a good thing? Well, that's a good thing because, in robotics, orthographic projection makes things much easier. If we have a robot trying to manipulate some blocks on a table, perspective affects where the mug shapes and the block shapes get all distorted due to perspective. They make it very hard for the robot to reason about where things are.

    But orthographic projection is much simpler to work with. The math is just much simpler. For example, if I want to do operations on the imagery, you can often just do it in 2D, in the 2D plane. You don't have to reason about 3D. It can all be kind of done in 2D.

    So orthographic makes certain vision algorithms, especially for robotic applications, much easier. People like to use these close to orthographic cameras with robots, but there's physical limitations on what you can do. But with synthetic data, you can hallucinate what it would look like if it were orthographic. And that works quite well.

    So this was just a project. I'm only going to show you the result here where the robot could pick up this floss and put it into these different containers. And it improved despite all the transparency and reflections. Because the orthographic synthetic data, it was kind of revisualizing the scene as if it were orthographic. And that made it an easier to solve problem.

    So here's just the perspective projection of that scene from the robot's point of view and the orthographic. And what you should see is that the orthographic shapes don't change as much. They undergo rigid transformation as opposed to distorting.

    So in the last few minutes, I'm just going to tell you the basic recipe I've presented and a few limitations. So the basic recipe that I think is quite powerful is, first, take your data, fit a model, a generative model, to your data. Second, sample more and better data from your model to get x prime. It's like data plus plus. And third, use x prime for your task rather than the original data.

    This recipe shows up in a lot of pipelines right now. And one way to think about it is, well, you're taking data, and you're making better data. So if you're going to send that data to a data processing system, then it's kind of safe. Anything you can do with the original data, you can do with this new sample data if your new sample data is truly better than your original data.

    But it's not always going to be the case that your generative data is actually better than your original data. We all know that, if you have a bad generative model, then it's going to have artifacts. And if you have applications where factuality really matters, where you need to know who is the president today and the model says the wrong person and you're going to write some journalism article about it, that's just a problem. So hallucination and factuality can matter sometimes. So we're just few minutes left.

    And another thing that's become kind of a popular and interesting critique is that, if you train models recursively on their own samples, you can potentially get drift and collapse of these models. So this is all to say that generative models can produce good data or bad data, but we have some tools for making them produce good data. And if you use those tools, then potentially you can get a benefit.

    So let's improve AI on the data side. Let's make generative models that produce better data than we started with, data plus plus. And I'll end there and have a few minutes for questions.

    [APPLAUSE]

    OK, great. So I think I can read the questions and go from there. So the first question is the idea is great. Images that are similar to the ones that are available on the internet. But how would this perform on specialized domains like X-rays, MRI, thermal images?

    Yeah, so you need to have a lot of training data for your generative model. And we have a lot of training data for random internet photos, but we don't necessarily for X-rays and MRIs. So if I used an off-the-shelf generative model as a data source for MRIs, well, that wouldn't be appropriate. Because the off-the-shelf models just don't know what MRIs look like.

    So what I would recommend here is we need to make good generative models, good foundation models for X-rays and MRIs. And of course, there's legal and privacy and ethical constraints on this. But I imagine this could also be done. Hospitals do have a lot of data, and potentially we could make big data sets for that purpose.

    OK, next question, synthetic data bears the bias of the models that generated it. What can you do to control or prevent these biases? Won't these biases propagate to the end result? Yeah. This is a really important point.

    So data sets have bias. And we should always be careful and aware of that. And generative models are no exception. So a generative model is a system that produces data, and it might produce biased data. But it's not really fundamentally different in that sense from other data sets, which also are systems that can be used to sample data and potentially showcase biases that we don't want.

    But I think there's a nice opportunity here, which is that, because we have these knobs and we can intervene and change our synthetic data, we can potentially debias it. And there's actually a lot of interesting algorithms that do study this. They do try to take a generative model and remove social biases that we might not want or other types of biases.

    Let's say that the generative model has made photos, and it shows images of doctors. But we potentially want it to show occupations at demographic parity. That might be one application that we care about for an advertisement. We want to show that all different genders and races can be represented as doctors. Well, with real data, we might not actually have that distribution out there because historically there have been biases.

    But with synthetic data, we can manipulate those knobs and create a world which is different than reality. And that could be beneficial, or it could be something you don't want. So it's a trade off. Yeah.

    So what real life applications in sustainability and climate and life science do I anticipate in this line of research on synthetic data? Yeah, I think that's really interesting. I'm not really involved in that side of things too much. But I do have a collaborator, Sara Beery, who's another professor at MIT.

    We've been working together on synthetic data for trying to detect rare bird types. So if I have some unusual species or bird that I want to identify in the forest, I don't have enough real data to train on. But I could potentially use a generative model to create a lot of hallucinated images of this bird and then create a better detector for that rare bird type. So that's actually an application we're even working on now.

    GRAHAM RONG: Can I ask one more question?

    PHILLIP ISOLA: Yeah.

    GRAHAM RONG: It looks like Phillip just demonstrated or showcased how powerful is the generative model as a data source for AI systems. Can you also say some words about other different applications of just using generative models?

    PHILLIP ISOLA: Yeah. So generative models can be a great data source for any data consuming algorithm, but maybe one other big class of data consuming algorithm is systems that are kind of doing model based control. So in robotics, we often will want to have a model of the world that we can reason over to decide what action will be optimal to take.

    Imagine you're controlling a car, and you want to have a model of where the other cars are on the street to know when you're going to turn your wheel. You have to have a model of the velocity of the car and a model of the friction on the ground. And there is this kind of up and coming class of generative models, which are called world models, which actually try to not only model text or 2D images, but model the entire physical dynamics of the world around you.

    And this kind of generative model is going to find a lot of application in robotics and control where you actually need to know the physics. And then all of these ideas I'm talking about, like counterfactual reasoning show up there as well. You say, what would happen if I steered my wheel really quickly to the left?

    And then you run your generative model. Traditionally, you run your classical physics simulator. But in the future, you'll run your generative physics engine. And it will predict what would happen. And then you decide, should I actually take that action? If I'm going to crash, I would not take that action if that's my prediction. And if I'm not going to crash, maybe I would. OK, thank you.

    [APPLAUSE]

  • Video details

    Phillip Isola
    Associate Professor, Department of Electrical Engineering and Computer Science

    Generative models can now produce realistic and diverse synthetic data in many domains. This makes them a viable choice as a data source for training downstream AI systems. Unlike real data, synthetic data can be steered and optimized via interventions in the generative process. I will share my view on how this makes synthetic data act like data++, data with additional capabilities. I will discuss the advantages and disadvantages of this setting, and show several applications toward problems in computer vision and robotics.