
Updating the State of the Art

-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: I'm Marzyeh Ghassemi. I'm an assistant professor in ACS at IMES here at MIT. And I did my PhD here. I went away and was a professor at University of Toronto for a couple of years. And they invited me to come back, so I did.
I did a master's at Oxford. And when I was there, I was working on some wet lab research, and then also a little bit of prediction for whether people would develop asthma or not. And it was really exciting, because we found that there were a lot of environmental factors that are well modeled in this ultimate outcome. And then we got access to some ICU data from a hospital here in Boston, from the MIMIC data set, which was created at MIT.
And I remember, at the end of my master's, when we were trying to model all of these really difficult physiological concepts, like whether somebody will need an intervention, the person I was being advised by said, you know, you seem to be really interested in this. Maybe you should think about going to MIT for your PhD. And I did.
And the big thing that I worked on during my PhD was trying to understand whether we can use some of these very high capacity neural network models to understand when human physiology is going to break down or deteriorate, because usually in the hospital, that means you need a specific kind of intervention. Maybe you need a vasopressor or a ventilator. And maybe you need more experienced staff to be on hand.
And so we would like to know that ahead of time, so that you can have people there who know how to do specific interventions. And so this kind of research was very exciting. And at the end of my PhD, I was asked by one of my committee members, you know, you have these really interesting papers about predicting the need for different kinds of clinical interventions. Have you ever looked at how these different models perform across different kinds of patients?
And, for me, it was really eye-opening, because up until that point, I did what most people in machine learning do. I reported my performance in aggregate. We do this well on all of the students that we see, all of the patients that we see, all of the objects that we classify, all of the images that we try to look at, right? This is a general machine learning thing.
And when we looked at the stratification of how well the state of the art models were doing on different kinds of patients, they were doing significantly worse in minorities and minoritized patients. And that's really problematic, because we already know that, in health, there are social biases that may play out and give people less access or maybe have them experience worse or poorer care. That's really unfortunate.
But if it bleeds into our training of deep neural models, and then a recommendation is given with this sort of sheen of objectivity, that could propagate these things even worse. And so it's a really exciting space to be in, because doctors are overburdened. There's a lot of information. There's a lot we don't know in medicine that machine learning models could really help us with.
There are these superhuman pattern recognizers. But it's also disturbing sometimes, because we don't want them to follow all the humans all the time. Sometimes we really want them to do something better than we do. And in those cases, we have to know when they should follow the data that we have, versus maybe ignore some bad examples that we're giving.
[MUSIC PLAYING]
-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: I'm Marzyeh Ghassemi. I'm an assistant professor in ACS at IMES here at MIT. And I did my PhD here. I went away and was a professor at University of Toronto for a couple of years. And they invited me to come back, so I did.
I did a master's at Oxford. And when I was there, I was working on some wet lab research, and then also a little bit of prediction for whether people would develop asthma or not. And it was really exciting, because we found that there were a lot of environmental factors that are well modeled in this ultimate outcome. And then we got access to some ICU data from a hospital here in Boston, from the MIMIC data set, which was created at MIT.
And I remember, at the end of my master's, when we were trying to model all of these really difficult physiological concepts, like whether somebody will need an intervention, the person I was being advised by said, you know, you seem to be really interested in this. Maybe you should think about going to MIT for your PhD. And I did.
And the big thing that I worked on during my PhD was trying to understand whether we can use some of these very high capacity neural network models to understand when human physiology is going to break down or deteriorate, because usually in the hospital, that means you need a specific kind of intervention. Maybe you need a vasopressor or a ventilator. And maybe you need more experienced staff to be on hand.
And so we would like to know that ahead of time, so that you can have people there who know how to do specific interventions. And so this kind of research was very exciting. And at the end of my PhD, I was asked by one of my committee members, you know, you have these really interesting papers about predicting the need for different kinds of clinical interventions. Have you ever looked at how these different models perform across different kinds of patients?
And, for me, it was really eye-opening, because up until that point, I did what most people in machine learning do. I reported my performance in aggregate. We do this well on all of the students that we see, all of the patients that we see, all of the objects that we classify, all of the images that we try to look at, right? This is a general machine learning thing.
And when we looked at the stratification of how well the state of the art models were doing on different kinds of patients, they were doing significantly worse in minorities and minoritized patients. And that's really problematic, because we already know that, in health, there are social biases that may play out and give people less access or maybe have them experience worse or poorer care. That's really unfortunate.
But if it bleeds into our training of deep neural models, and then a recommendation is given with this sort of sheen of objectivity, that could propagate these things even worse. And so it's a really exciting space to be in, because doctors are overburdened. There's a lot of information. There's a lot we don't know in medicine that machine learning models could really help us with.
There are these superhuman pattern recognizers. But it's also disturbing sometimes, because we don't want them to follow all the humans all the time. Sometimes we really want them to do something better than we do. And in those cases, we have to know when they should follow the data that we have, versus maybe ignore some bad examples that we're giving.
[MUSIC PLAYING]
-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: One really good example of when you want to modify a machine learning protocol is with a recent paper we did called "Medical Dead Ends." And so the idea here is in a standard kind of machine learning called reinforcement learning, that's really, really powerful for learning games or self-driving cars, what you do is you have many, many example demonstrations of something you want a model to learn, many chess games, many go games, many poker games, or many examples of people driving cars.
And then you say you, as a model, I want you to learn how to do this, because you have a lot of expert, or maybe non-expert demonstrations of how to do it well. The problem in medicine is we're in a really different setting. When you train a self-driving car, you can then let that model go off into a parking lot and try to drive the car for a little bit. Or if we're talking about two chess models, they can just play each other as many times as they want. And it doesn't matter if they do really poorly initially or lose a lot of games.
But if we train the model using expert demonstrations of how to treat patients in the hospital, and then it went off and tried different things and killed some patients along this learning process, we would not be very happy. So this is a very different situation. We're in an offline setting where we can't really support the machine trying to learn online in practice.
And so we have a different way of thinking about the kinds of recommendations that a model can make in those settings. And we call this "Medical Dead Ends." What that means is, instead of trying to figure out what kind of recommendation a model should give to a doctor at a specific point in time for different kinds of care, we learn what actions seem to lead to bad outcomes in past iterations of clinical care, like ones for this specific patient.
And by flipping the paradigm, and saying we have two different neural models looking at the risk of certain treatments and the reward of the same treatment in different patients we've observed over time, we're able to give a good prediction to a doctor as they're looking at this sort of sequence of operations in real time for a patient, of which treatments maybe they should avoid because they seem to be more high-risk than reward.
[MUSIC PLAYING]
-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: The reason we were interested in focusing on injecting fairness into a machine learning model is this experience that I had at the end of my PhD, where we saw that many of these different models have biases. The problem with thinking about bias in machine learning models in a health care setting, but then also in other settings as well, where we acquire data from a natural environment, is that usually you can't really fix the underlying data bias.
We just have some groups that we're going to sample less from, potentially because there are fewer people from that population. There's always going to be minority populations in any data set. And so we don't want to be in a situation where we say, just because this kind of data has a minority population, we know we're not going to work as well on that kind of data. And so there's a really popular kind of learning that we do for images, called deep metric learning.
And what it is is it takes a bunch of different kinds of images and says, I don't really know what sort of underlying representation I should have to say that these two images are similar. I probably don't want to just look pixel by pixel and subtract them. Let's make a deep neural network learn a representation in which two pictures of dogs embed to a very similar space, and two pictures of cats embed to another space that's also very similar.
And so this kind of learning, this metric learning, is really cool for images, because it means that we can understand how any two images might relate to one another, so that at retrieval time, if I see a brand new image I've never seen before, I can immediately see where other images like it would have been mapped, and then give it a label that makes sense, like dog or cat. So the problem is, if you use standard deep metric learning techniques that are still state of the art, if you have a minority class, meaning we don't have many labradoodles in a data set. That part of the embedding space doesn't get learned very well.
And so what that means in a very human setting is if you have significantly fewer examples of Black faces than white faces, we're going to learn a less rich representation of Black faces. And so we'll have a embedding space that embeds most Black faces to the exact same place, even though they're not the same person, just because there's so few examples of Black faces.
We didn't want to have that happen in general. And if you just train a vanilla model that's trying to do retrieval, it will, by default. And so we're trying to de-correlate these attributes and give some independence, so that you're learning a representation that forces richness for different kinds of people. And that's really important in these kinds of applications, because often we think just throw data at the problem and learn whatever you can learn.
But here in the parade paper, the goal is to say we know something extra about this data. We know that, for example, there are fewer examples of dark skin. And we want to make sure that that's input we give to the model, so that it can learn partially to de-correlate the attribute and then have a richer representation. And that means that when we try to retrieve a new face or classify an unseen face, we'll have a better performance for minority groups.
[MUSIC PLAYING]
-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: One thing that has become a large issue in machine learning and health is the idea of trade-offs in performance, and specifically the trade-offs between utility, privacy, and fairness. And so differential privacy is a state of the art technique that's used in other spaces, in imaging and in text, to ensure that we don't have any data point that's too unique, pull around a classification too much.
And what I mean by that is if I'm thinking about large tech companies having lots of text data, right now the state of the art is to use differentially private predictions, so that if I always type duck duck stone instead of duck duck go, it doesn't recommend to me the next time I type duck that maybe the next word is stone. And that's a funny example, if it's duck and stone.
But if it's a sensitive condition or word, maybe for health about abortion or HIV status, we wouldn't want personal predictions that are very unique to affect our other text examples. And so in this setting, differential privacy makes a lot of sense. But we wrote a paper about the interaction between differential privacy and fairness and utility. And what we found is that if you add differential privacy to standard prediction tasks that we do in the hospital, these are things like predicting who will die, who might need a different intervention, we do really, really poorly.
And this kind of utility draw would make the model not usable in a medical setting. So doctors would not be able to use a model that performs at this rate, when you add the differential privacy. It probably also doesn't make sense in a medical setting to have this really intense differential privacy construct, because doctors and nurses who are using the outputs of this model already have access to a lot of your health information.
We also found, worryingly, that the patients who seem to be too unique and have more of their data redacted, and they have less of an influence on the final performance, are minority patients. And so specifically in the paper we found it was Black patients that were having most of their influence in the classification problem removed. And that's not something that we would want to see in practice.
And so this is another example of how you can't just throw state of the art machinery at a health problem blindly. You can't naively try to apply these really large levers that we have. There are going to be trade-offs in any system.
[MUSIC PLAYING]
-
Interactive transcript
[MUSIC PLAYING]
MARZYEH GHASSEMI: I have one paper called "Do as AI Say," where it's a really fun paper. So the goal of this paper is to say what happens when we show radiologists incorrect advice, and we say it's from a doctor, or we say it's from an AI. And it's a really interesting paper. When you take experienced radiologists, so people who do chest X-ray review for a living, and you give them incorrect advice some of the time and correct advice some of the time, but you say it's from a human, they rate it as higher quality than if you give them the same mixture of good and bad advice and say it's from an AI.
But even though they say that they think it's lower quality advice, they're fooled by it just as often. So they're just as susceptible to this poor advice. And they often end up following it and incorrectly misdiagnosing somebody, because they've anchored to this bad advice. That's another thing you have to be really careful about in this space, because even if you had a perfect model, perfect for whatever sense of training we have, it'll be wrong occasionally, right? Even a really, really high capacity model is going to occasionally be wrong.
And we don't want to deploy something that is so convincing that even when it's wrong, it convinces you to listen to it. And that's this middle space that we have now in my group, where we're looking at let's train the best model we can, let's audit it for fairness, privacy and robustness, but now I have a model that's been deployed. It's going to be wrong sometimes.
How do I give the advice to somebody such that when it's right, you want to listen to it. But when it's wrong it's not too convincing so that it fools you. And that's a really difficult challenge. So we're trying to engage with different experts for human-computer interaction, and make our model objective, to be not how do I get the best performance for this model.
How do I make a model that leads to the best performance in the human that's going to use it? And that's a much harder problem in general. So what we did for the medical checklist is, let's say that you're training a deep neural network. It has access to your entire electronic health care record. It trains this really high capacity model, and now you have some predictions.
Well, how are you going to integrate that in hundreds of hospitals? It's really hard to do because hospitals have different backend information systems. How are you going to display the advice in a good way? What kind of predictions do you want? And so one way you could think about doing things is let's still train a machine learning model that's going to have this intelligent prediction. But let's get a result that we could deploy really easily, a checklist.
And checklists are everywhere in medicine. We use them all the time, right? So we have medical checklists that are for everything from diagnoses to readiness for surgery. And the reason that they're really popular in medicine is they're easy to use, easy to verify, easy to deploy. You could just print them on a piece of paper.
But the way that we create checklists now in medicine is you assemble a team of experts. They all get together, and then they have to come to a consensus about what they want in the checklist. And that takes a lot of time and it's very challenging to do. But also there's been some work recently that it's found that even once you make this expert-driven checklist, it'll come out a decade later, two decades later, that it's really biased against certain subgroups, and it just doesn't work well for Black Americans or for women, because it wasn't taken into account that there are these other conditions or other social issues that contribute to having a specific check on or off.
So what we did for this paper is said, well, why don't we try to make an optimally predictive checklist, because they're the end result that somebody might want to use. And so we showed that if you used our mixed integer programming approach and just optimize for the best checklist to predict something important, like does somebody need a specific treatment in the intensive care unit, you do really well. And it's state of the art performance that hospitals have told us they would be interested in trialing for an actual intervention, because instead of saying a machine learning model has a specific number it predicted, we have a result that says if these three items are checked, then you need to have a family consult, because this treatment might not work for your family member.
The interesting thing about this paper, too, is we looked at whether you could induce fairness in these checklists. And we found some really strange things. There have been all these papers that have found that if you include race in a checklist, it can either over or underestimate risk for different kinds of conditions, all over the spectrum of medicine. We found that there are proxy variables in health care data. Even if you don't include race or gender directly, our model was performing really poorly.
The maximum gap we were getting between white men and Black women was really high, even though the checklist didn't use gender or ethnicity. And it's because there are these proxies, like height and weight. Men generally have higher heights and weights. That's a proxy for gender.
There's also proxies like insurance type, which in the United States is a proxy for ethnicity. And so using these kinds of variables, you might actually have a checklist that looks very fair. It's not using ethnicity. It's not using gender. But it's still going to have differential performance.
And so we have constraints that we can add to this checklist. So we say, give me the best checklist to predict this medical condition. But you can have no more than this amount of gap between any two subgroups of patients in the data.
And I think that's really exciting, because it means that we could probably get around a lot of the existing issues that we have with checklist creation, or even help doctors verify checklists that they want to create right now with expert panels.
[MUSIC PLAYING]