
Using Computational Tools for Molecule Discovery

-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: I'm Connor Coley. I'm a new assistant professor in the Department of Chemical Engineering. And the focus of my research is to bring new computational tools to accelerate the discovery of new molecules and materials. One of the projects that we're working on at the moment is applying Bayesian optimization to the selection of optimal molecules. So trying to discover a new functional molecule is a very challenging process partially because there are so many molecules to choose from. So there are an estimated 10 to the 20 or 10 to the 60 small molecules that are biologically relevant depending on who you ask. And of those we've only really synthesized and tested a very small amount, maybe 10 to the ninth.
And so this is a very complex search process for which we need some tools to help us identify most efficiently what the best molecules are for a given task. And so the group has been working on computational techniques to learn to correlate molecular structures with their function to speed up this process. You can imagine that if we have a way of simulating the properties of a molecule computationally or a way of running an experiment to test the properties, that can be quite expensive, especially if we're thinking about testing hundreds, thousands, or even millions of different molecular structures.
One of the techniques that we've been applying recently is model guided optimization where we train machine learning models to essentially replace these expensive evaluations. And so if we train a machine learning model to recapitulate what an experimental assay would find out or what an expensive simulation would find out, we can use that to screen a large number of molecules in a much more efficient amount of time.
And by going through iterative cycles of testing molecules, training machine learning models, and then predicting which molecules should be the most performance, we've managed to speed up the identification of high performing molecules by orders of magnitude. So the applications we're interested in sort of spanned many different disciplines from drug discovery to material science. And what we're trying to do is essentially think about figuring out which molecules are going to have the best biological activities, the best physical properties, maybe bioavailability. There's so many different aspects of a molecule's function that help influence whether it'll be a good drug candidate for example.
We're relying on these computational tools to help us understand the function of these molecules across a number of different dimensions. And the different applications I talked about-- so drug discovery or material science or agriculture even-- they're very similar in how you approach the discovered molecules. So the process one undertakes looks quite similar for these different applications even if the details of what makes it a good molecule change between them.
So a lot of the techniques that we're developing are trying to make the process of discovering new drugs or new drug candidates faster and cheaper. So we have access to a lot of these computational tools and simulation tools that help us predict molecular properties. There are techniques like docking molecular dynamics. But efficiently using these tools is still a challenge, again, because we are dealing with such a large search space of candidate structures.
And so these new tools are going to help us accelerate the rate at which we find high performing compounds and reduce the computational cost. And so overall, that's going to shorten, hopefully, the development timelines when bringing a drug from an initial early stage all the way through to clinical trials. But really after shortening and accelerating, that's part of the drug discovery process.
[MUSIC PLAYING]
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: I'm Connor Coley. I'm a new assistant professor in the Department of Chemical Engineering. And the focus of my research is to bring new computational tools to accelerate the discovery of new molecules and materials. One of the projects that we're working on at the moment is applying Bayesian optimization to the selection of optimal molecules. So trying to discover a new functional molecule is a very challenging process partially because there are so many molecules to choose from. So there are an estimated 10 to the 20 or 10 to the 60 small molecules that are biologically relevant depending on who you ask. And of those we've only really synthesized and tested a very small amount, maybe 10 to the ninth.
And so this is a very complex search process for which we need some tools to help us identify most efficiently what the best molecules are for a given task. And so the group has been working on computational techniques to learn to correlate molecular structures with their function to speed up this process. You can imagine that if we have a way of simulating the properties of a molecule computationally or a way of running an experiment to test the properties, that can be quite expensive, especially if we're thinking about testing hundreds, thousands, or even millions of different molecular structures.
One of the techniques that we've been applying recently is model guided optimization where we train machine learning models to essentially replace these expensive evaluations. And so if we train a machine learning model to recapitulate what an experimental assay would find out or what an expensive simulation would find out, we can use that to screen a large number of molecules in a much more efficient amount of time.
And by going through iterative cycles of testing molecules, training machine learning models, and then predicting which molecules should be the most performance, we've managed to speed up the identification of high performing molecules by orders of magnitude. So the applications we're interested in sort of spanned many different disciplines from drug discovery to material science. And what we're trying to do is essentially think about figuring out which molecules are going to have the best biological activities, the best physical properties, maybe bioavailability. There's so many different aspects of a molecule's function that help influence whether it'll be a good drug candidate for example.
We're relying on these computational tools to help us understand the function of these molecules across a number of different dimensions. And the different applications I talked about-- so drug discovery or material science or agriculture even-- they're very similar in how you approach the discovered molecules. So the process one undertakes looks quite similar for these different applications even if the details of what makes it a good molecule change between them.
So a lot of the techniques that we're developing are trying to make the process of discovering new drugs or new drug candidates faster and cheaper. So we have access to a lot of these computational tools and simulation tools that help us predict molecular properties. There are techniques like docking molecular dynamics. But efficiently using these tools is still a challenge, again, because we are dealing with such a large search space of candidate structures.
And so these new tools are going to help us accelerate the rate at which we find high performing compounds and reduce the computational cost. And so overall, that's going to shorten, hopefully, the development timelines when bringing a drug from an initial early stage all the way through to clinical trials. But really after shortening and accelerating, that's part of the drug discovery process.
[MUSIC PLAYING]
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: So this model-guided optimization relies on the use and availability of models that can correlate molecular structure with their function. This has been a rapidly growing area in machine learning as applied to chemistry. But there's still some significant challenges in how we think about building these models.
So molecules, unlike images, are not inherently numerical. We don't think about describing a molecular structure, which is this very complex shape, in terms of simple numerical vectors and matrices, like machine learning tools are often designed to take as input. And so we have to be very thoughtful about how we represent molecular structures, so that these machine learning models can learn how to infer patterns, and how their structure relates to their function.
So we're working on some new techniques for how you actually analyze the molecular structure. And some of these are based on the ideas of graph neural networks, considering molecular structures as graphs where atoms are nodes in the graph and bonds are edges in the graph. But we're also thinking beyond that, because molecules are these very complex three-dimensional structures that are flexible, and they change their shape and orientation depending on their environment and the temperature. So we're working on strategies to take that into account, essentially, and just to have a more nuanced understanding of what it means to have a molecular structure, and how best to capture that as inputs to these models.
One of the bottlenecks to advancing these tools and advancing these methods is having good test cases and applications on which we can benchmark our performance. So there's some specific challenges for addressing involving stereochemistry of molecules. So two molecules that are mirror images of each other will have different functions in many environments, including in the human body. When we develop new computational techniques that try to take that into account, we need to use tasks. We need specific goals to measure our progress in developing new methods.
And so we're working on developing also some, what we call synthetic benchmarks, right? So coming up with data that mirrors real experimental data, but is defined in a more controlled environment so that we can rapidly iterate on these new models and these new methods, until they're ready and mature enough to apply to applications like drug discovery.
Another project that we're working on is the generation of new molecular structures. This is a rapidly growing field that only started a few years ago. And it essentially inverts the typical paradigm of how we discover new molecules. So typically, we will take machine learning models or property prediction models, we can query them with a molecule of interest, and it will tell us what it believes the properties to be. It predicts or it simulates what the properties are.
But recently, there have been these new inverse design models, or generative models, that take the model, train to predict a property, and essentially let it directly generate a new molecular structure. And so it can create a new molecule that we haven't queried it with explicitly, that's predicted to have some sort of optimal property profile.
And so this notion of inverse design is a really exciting alternative to screening, which is the typical paradigm, because we can essentially ask the model to come up with new ideas for us. These models are coming up with new hypotheses, new structures of molecules that are predicted to have a good set of properties.
Now, these models aren't perfect right now. There have been dozens or hundreds of studies even in the past few years. But there are a number of pretty significant limitations that we're still trying to address. So in particular, these type of models that generate new molecular structures tends to be very data-inefficient. So it might take them hundreds, thousands, hundreds of thousands of guesses before they find a molecule that actually does have the properties we wanted to have.
And if you think about evaluating molecules with experiments, we don't want to run hundreds of thousands of experiments on new molecules. If you think about a typical drug discovery campaign for a small molecule, it could involve in total synthesizing hundreds to thousands of different structures. And so we have these models which can identify optimal structures by proposing new molecules, getting information about their performance, and iterating in a closed loop fashion.
But it takes them far too long to get to the right answer. And so one of the directions of the group has been to try to accelerate that process by making these models much more efficient in terms of the number of evaluations required to identify these good performing structures.
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
CONNOR COLEY: The types of molecules that these models right now generate-- so these are deep-learning models that try to learn patterns of molecular structures and understand what molecules typically look like, and then to understand what molecules that achieve a certain property typically look like. But they have their flaws because they don't fully understand what it takes to actually produce those molecules.
And if you think about how useful these might be to a chemist who's actually trying to experimentally test what's being proposed, the structures that are generated by these deep-learning models might look very unstable. They might be very difficult to synthesize. They might be very expensive, even if they're able to be purchased. And that limits their practical utility for real applications in drug discovery.
Our solution that we're just starting to work on involves constraining the generation of these molecules to abide by the rules of synthetic chemistry as we know them. So we're trying to change how we do the prediction, to not just predict new molecular structures, but predict new synthetic pathways, new recipes for making those molecular structures. So that when we have these models propose hypotheses and suggestions for chemists to test, they're actionable suggestions. They're suggestions where we know exactly what to do when we get to the laboratory.
One of the opportunities that this technology offers is to reduce the reliance on human intuition for a lot of discovery tasks. Currently it takes a lot of expertise to propose these new structures and to efficiently guide the drug discovery process, or propose a new functional material that ends up being validated in the laboratory. And so if we can use these models to formalize that process and codify that process, even, of how we approach the discovery of these new structures, then we're no longer going to be limited by time availability, essentially. And we can free experts in the fields to think about higher level tasks.
This lets us scale out the discovery process much more efficiently. It also brings us one step closer to automating the full system, so automating the full process of discovering new structures. Part of what these models rely on and part of what we need to still improve is their ability to understand how experimental processes work. If we're asking a model to help us plan the synthetic routes to a new chemical structure, it needs to have a good understanding of how molecules react and how different reactant structures can be brought together to form a product.
The closer that we have these models be able to predict what physically happens, the more useful they will be. Right now, it's very hard to predict the outcome of an experiment. So, even if these models believe that their suggestion is going to work, if they don't have an accurate model of the world, if they don't understand what happens when you mix two chemicals together or heat those chemicals, then it's not going to provide very accurate suggestions. And so we want to create these computational environments, where the model has a good understanding of the behavior of the world. And it can quality check, essentially, what it proposes to chemists, because it understands what would happen if you actually tried this physically, not just in silico.
One of the other aspects that makes the use of these generative models challenging is that when we think about what makes a drug a good drug, or what makes material a good material, it's usually not just about one property. Certainly for a drug, it's not just about bioactivity, although that's an important component. There are many other factors that influence whether it's going to be successful in clinical trials or whether it's going to be successful commercially.
And that connects to the multi-objective nature of these kinds of discovery tasks. It's never just about one property. And trying to optimize multiple properties simultaneously can be a challenge for some of these algorithms right now. There are computational techniques that try to balance these multiple objectives and these competing objectives, but understanding the trade-offs and importance of them in advance is still an open question.
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: So a relatively new effort for the group is also focusing on the automation of multi-step chemical syntheses and the automation of chemical reaction screening in very small scale. So the group has worked on a lot of computational techniques to design synthetic pathways, to predict the outcomes of organic reactions, to recommend the conditions with which we should try to run organic reactions.
But these are all coming from the literature. So these are machine learning models largely trained on data we get from the literature. And we, of course, don't have direct control over what's published in the literature. We're working with this historical data. And so a really exciting direction that we're taking now is to try to couple this literature data, which is very rich and captures to some degree the past 100 years of chemistry with our own automation platform to generate new data in a very focused and high throughput fashion.
You can imagine a situation in which there is a new chemical reaction-- so a new way of combining molecules in a very specific way that lets us access a new type of structure, which is found to have good properties. Maybe it's a new class of antibiotics. But we might not have a full understanding over the limits of that transformation. We might not know if it's robust enough. And what we would like to do is take this sort of baseline understanding of the chemical reaction. So maybe we've seen two papers published on this chemical reaction. We don't really know how well it works. We really don't know what types of structures it's compatible with.
But we'd like to find out. And so we'd like to take this sort of seed of an understanding that we have from the literature and combine that with our own data generation to, again, very rapidly and efficiently learn what makes this reaction work and what's the scope of it. What's the extent to which it applies to new structures? And there's this idea of a feedback loop now between data we find in the literature and data we generate experimentally. And we're thinking about ways to design computational tools that can merge the two efficiently. So in the fewest numbers of experiments, how do we most rapidly increase our knowledge of chemical reactivity?
The goal of this work is to have a better and more formalized understanding of chemical reactivity that has a number of benefits downstream. So it makes synthesis more predictable. We can, again, get closer to anticipating what will happen in the real world just through a simulation in [INAUDIBLE] That means that we'll have hopefully, in the laboratory, fewer failed experiments, which will lead to faster times accessing new molecular structures.
The dream is that if there's a new molecular structure that a chemist or a model proposes, we'd really like to test that as quickly as possible to know if it's going to be high performing or not. And to do so we need to plan out the synthesis and we need to understand if each of those reaction steps are likely to succeed or are likely to fail. And with the kinds of models that we're developing, we can try to quantify that and quickly assess which molecules are easy to synthesize and which are hard to synthesize.
Now there's a much more ambitious goal, I would say, behind some of this work, which is trying to fully automate the process of creating these new molecular structures. So synthesis of these new structures is a major bottleneck in discovery efforts. Between the time a chemist proposes a dozen new molecules to test that they believe might be active against a certain protein target, it might take weeks or months to actually get those molecules in physical form to test them in biological assays. And sometimes that relies on in-house synthesis. Sometimes it relies on contracting that synthesis out overseas to contract research organizations. That's costly and it's time consuming. And it slows down the process tremendously.
Now if we have a complete understanding of chemical reactivity or complete enough from these computational models and if we combine that with laboratory automation and robotics, we can start to think about automating the steps it takes to produce these new structures. So automatically picking the solutions off the shelf of your starting materials, automatically mixing them in the right ratios, and heating them, stirring them, purifying the product of that reaction. And if we have a robust enough platform, we can do that multiple times. So we can take then the product we've just made and feed it back in as a starting material for the next step.
And this is, of course, something that chemists do by hand quite well. And it's the way that we've approached synthesis for decades or hundreds of years. But the idea that we can automate the process and have a robotic system be sort of adaptive and flexible enough to respond to the analyses it runs and to understand if it's succeeding or failing in its task, that could be hugely enabling. You can imagine a situation in which the generation of new molecular structures and testing those structures is, again, no longer limited by human time and expert chemist time, but the availability of these platforms, which then makes it more of a capital problem than a human resource problem.
[MUSIC PLAYING]
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: There are a lot of both engineering and more theoretical challenges with this project. On the engineering side, there are many complexities in terms of automating the manipulation of solids and understanding just how to get all of these robotic components to work together. But on the more fundamental challenge side, there's questions of identifying unknown outcomes. And trying to respond to failures in response to uncertainty.
A more concrete example of that is, let's suppose that we've asked the robotic platform to run a certain reaction, and it does so. But in analyzing the results, it needs to check if it succeeded. Has it actually made the product we wanted to make?
Now, there are some analytical techniques we can use to check the structure. And we can try to identify whether or not the spectra that we get from some of these analytical techniques matched the spectra we expect to find in that product. But maybe when we do the analysis and we do the purification and workup, we find a few other peaks in those tables. We find a few other impurities, a few other structures, but we're not actually sure what's been made.
The task of structural elucidation, or actually automatically identifying what those new chemical structures are, is still a very challenging research problem. It's something that expert chemists learn how to do, something that expert chemists build up a skill set for. But it's not something that we can approach computationally that well yet.
We need to better understand how to respond to these analytical observations and dynamically choose what to do next, how to assess if we've been successful. Or if we haven't been successful, how do we change what we're doing, change the conditions or starting materials, to increase our success and make the products that we're ultimately after? And it's this adaptability and responsiveness that is missing from these platforms.
That difference is how I like to describe the difference between thinking about automation and autonomy for these platforms. So autonomy has this extra component of responding to surprising events or responding to failure, whereas automation is going through the motions in a prescribed manner. My goal is to achieve that full level of autonomy, where we have these closed-loop systems that generate their own hypotheses, test their own hypotheses, and understand when they've been validated or falsified, and know how to adjust their beliefs accordingly.
Another initiative that we've become involved with recently and helped lead is an initiative called the Open Reaction Database. This is an initiative to create an open access platform for sharing chemical reaction data. All of our techniques and our methods, and many of our colleagues', rely on the availability of well-curated, high-quality chemical reaction data. Of course, that's what a lot of machine-learning tools are built on. And so we're trying to also create a mindset shift, in terms of how data is generated and shared, and trying to bring more of that information out into the open.
One of the ways in which our goals and automation connect to this is when we run chemical reaction screening, when we try to run hundreds or thousands of experiments, microliter-scale, to understand the outcome, we'll get a number of successes and a number of failures. We might have reactions that work well, reactions that don't work well. And the status quo right now is to just publish what works, just publish what went well. And so with this Open Reaction Database project and other initiatives we're working on, we're trying to create a more open and community-driven mode of sharing data that includes those failures, that includes more detail and more information than what's typically shared.
[MUSIC PLAYING]
-
Video details
Assistant Professor of Chemical Engineering
-
Interactive transcript
[MUSIC PLAYING]
CONNOR COLEY: So sharing failed reaction data and sharing reactions that didn't lead to the expected results, it's quite useful because these are, perhaps, the most informative examples. If somebody ran a reaction, or they ran an experiment, and they had a surprising outcome, that tells us two things. The first thing is that that experimenter thought it was worth trying in the first place. They had reason to believe that that experiment would be successful.
The second thing it tells us is that it, of course, wasn't and there is something else going on, some underlying phenomenon that maybe we don't fully understand, or at least that experimentalist didn't fully understand. And this adds a sort of richness to the data we have, and it lets us learn different trends in the data. We get to learn, not just from positive results, and trying to understand from the literature what's worked well, but we explicitly see these negatives to know what didn't.
And that contrast helps these machine learning systems learn. It helps them realize what it takes to have a successful experiment and what might lead to an unexpected and more undesired outcome. And that, again, connects to the idea of just having this more formal, wider understanding of chemical reactivity that we can capture in these models.
Creating this culture change in how data is shared and trying to have the chemistry community help take ownership over its curation does present some obstacles. So it does take extra time. It's an additional investment into the data.
But one of the exciting things that we're seeing is, I think, everyone is trying to identify the potential benefits of using these types of tools. And everyone is seeing around them the impact of machine learning and trying to understand how that can affect their research and their domain. And there's, obviously, the very high profile successes of machine learning and protein structure prediction and game playing and machine translation. And there are also those successes in chemistry for synthesis planning and predicting the yields of chemical reactions.
I think, as more of these success stories become visible, it becomes more apparent what the value is in contributing to them by supporting the data generation process and by supporting the development of the tools by these well-curated examples. But there is certainly a culture shift that's needed and that's one that's hard to instill. But we're hoping to help encourage this, again, community-driven efforts to take ownership over the generation and curation of reaction data to improve all of our research processes.
Right now it's not the norm to share your failures, and it's not the norm to share reactions that you've tried that don't fit into the narrative of a final manuscript or publication. That's perhaps a fault of the publication environment and expectations around what it takes to publish and what readers of scientific journals would like to see. And I also think that there's a change in how many reactions people are running. So even just the sheer number of experiments people run before they submit a publication is changing.
And if you're thinking about running, not just a dozen reactions on the bench but 1,000 reactions in an automated platform, there's, of course, a much richer data set behind the scenes. And so including that full story in the publication is something that I think readers should emphasize their interest in, that journals should emphasize their interest in.
And so I think it's a change that can come at a number of different levels. The people who write papers, people who read papers, the journals themselves and the publishers. I think creating those community standards has been successful elsewhere, like in structural biology with the protein data bank, and in crystallographic data storage, but we just haven't had that mindset change in chemistry yet.