Using Computational Tools for Molecule Discovery
Connor Coley, an assistant professor of chemical engineering, is developing computational tools that would be able to predict molecular behavior and learn from the successes and mistakes.
Discovering a drug, material or anything new requires finding and understanding molecules. It’s a time and labor-intensive process, which can be helped along by a chemist’s expertise, but it can only go so quickly, be so efficient, and there’s no guarantee for success. Connor Coley is looking to change that dynamic. The assistant professor of chemical engineering is developing computational tools that would be able to predict molecular behavior and learn from the successes and mistakes.
It’s an intuitive approach and one that still has obstacles, but Coley says that this autonomous platform holds enormous potential for remaking the discovery process. A reservoir of untapped and never-before imagined molecules would be opened up. Suggestions could be made from the outset, offering a running start and shortening the overall timeline from idea to result. And human capital would no longer be a restriction, allowing scientists to be be freed up from monitoring every step and instead tackle bigger questions that they weren’t able to before. “This would let us boost our productivity and scale out the discovery process much more efficiently,” he says.
Molecules present a couple of challenges. They take time to figure out and there’s a lot of them. Coley cites estimates that there are 1020-1060 that are small and biologically relevant, but fewer than 109 have been synthesized and tested. To close that gap and accelerate the process, his group has been working on computational techniques that learn to correlate molecular structures with their functions.
The aim is to have the model make better predictions as it runs through a technique called active learning, and it might reduce the number of experiments it takes for a hypothetical new drug to go from initial stages to clinical trials by an order of magnitude.
One of the tools is guided optimization, which would evaluate a molecule across a number of dimensions and determine which will have the best properties for a given task. The aim is to have the model make better predictions as it runs through a technique called active learning, and Coley says that it might reduce the number of experiments it takes for a hypothetical new drug to go from initial stages to clinical trials “by an order of magnitude.”
There are still inherent limitations. The guided optimization relies on models that are currently available, and molecules, unlike images, aren’t numerical or static. Their shapes change based on factors like environment and temperature. Coley is looking to take those elements into account, so the tool can learn patterns, and the result would be “a more nuanced understanding of what it means to have a molecular structure and how best to capture that as an input to these machine learning models.”
One bottleneck, as he calls it, is having good test cases to benchmark performance. As an example, two molecules that are mirror images can still behave differently in different environments, one of those being the human body, but many datasets don’t show that.
Developing new algorithms and models requires having specific tasks and goals, and he’s working on creating synthetic benchmarks that would be controlled but would still reflect real applications.
More than selecting molecules, Coley is also working on tools that would generate new structures. The typical method is for a scientist to design property models and make a query. What comes out is a prediction of molecular function, but only for the molecule that was requested. Coley says that new approaches make it possible to ask the model to come up with new ideas and structures that would have a good set of properties, even though it hasn’t been specifically queried. In essence, it “inverts” the process.
Models and robotics could pick the solutions and mixtures and perform the heating, stirring and purifying, and whatever product was made could be fed back in and be the start for the next experiment.
The potential is enormous, but the models are still data-inefficient. It could take more than 100,000 guesses before a “good” molecule is found, which is too many, says Coley, adding that the desire is to be able to discover molecules in a closed-loop fashion. An essential aspect of achieving that goal is to constrain generation to abide by the rules of synthetic chemistry, otherwise it could take months to test what the model proposes. In the new approach, it would be able to “quality check” and propose both molecules and pathways. He also wants to get to the point where models will understand the variability in and uncertainty of real-world situations. Together, these capabilities would reduce the reliance on human intuition, giving chemists a head start and the time to take on higher level tasks.
One limitation with improving any data-driven model is that it hinges on available literature. Coley would like to open that up through a collaborative effort he co-leads, the Open Reaction Database. It would be community driven, synthetic chemistry-focused and encourage researchers to share experiments that haven’t worked and wouldn’t normally be published. That’s not the usual request, and it would entail a mindset shift in the chemistry field, but Coley says that there’s a value in looking at what weren’t “successes.” “It adds richness to the data we have,” he says.
That’s the overarching theme to his work. The computational model would build on the last 100 years of chemistry and end up being a platform that keeps learning. The big picture goal is to fully automate the process of research. Models and robotics could pick the solutions and mixtures and perform the heating, stirring and purifying, and whatever product was made could be fed back in and be the start for the next experiment. “That could be hugely enabling in terms of our ability to efficiently make, test, and discover new chemical matter,” Coley says.
And the end result is that restrictions on discovery would come down to the availability of platforms, not the availability of time, a question of capital rather than human resources. The missing piece is designing a computational approach that can identify new structures and have a better chance from the outset of success. In actuality, it’s not about automation. That approach goes through steps in a prescribed manner. What Coley wants is that extra component of being able to generate ideas, test hypotheses, respond to surprises and adjust accordingly. “My goal is to achieve that full level of autonomy,” he says.