A new model predicts how molecules will dissolve in different solvents

Using machine learning, MIT chemical engineers have created a computational model that can predict how well any given molecule will dissolve in an organic solvent — a key step in the synthesis of nearly any pharmaceutical. This type of prediction could make it much easier to develop new ways to produce drugs and other useful molecules.

The new model, which predicts how much of a solute will dissolve in a particular solvent, should help chemists to choose the right solvent for any given reaction in their synthesis, the researchers say. Common organic solvents include ethanol and acetone, and there are hundreds of others that can also be used in chemical reactions.

“Predicting solubility really is a rate-limiting step in synthetic planning and manufacturing of chemicals, especially drugs, so there’s been a longstanding interest in being able to make better predictions of solubility,” says Lucas Attia, an MIT graduate student and one of the lead authors of the new study.

The researchers have made their model freely available, and many companies and labs have already started using it. The model could be particularly useful for identifying solvents that are less hazardous than some of the most commonly used industrial solvents, the researchers say.

“There are some solvents which are known to dissolve most things. They’re really useful, but they’re damaging to the environment, and they’re damaging to people, so many companies require that you have to minimize the amount of those solvents that you use,” says Jackson Burns, an MIT graduate student who is also a lead author of the paper. “Our model is extremely useful in being able to identify the next-best solvent, which is hopefully much less damaging to the environment.”

William Green, the Hoyt Hottel Professor of Chemical Engineering and director of the MIT Energy Initiative, is the senior author of the study, which appears today in Nature Communications. Patrick Doyle, the Robert T. Haslam Professor of Chemical Engineering, is also an author of the paper.

Solving solubility

The new model grew out of a project that Attia and Burns worked on together in an MIT course on applying machine learning to chemical engineering problems. Traditionally, chemists have predicted solubility with a tool known as the Abraham Solvation Model, which can be used to estimate a molecule’s overall solubility by adding up the contributions of chemical structures within the molecule. While these predictions are useful, their accuracy is limited.

In the past few years, researchers have begun using machine learning to try to make more accurate solubility predictions. Before Burns and Attia began working on their new model, the state-of-the-art model for predicting solubility was a model developed in Green’s lab in 2022.

That model, known as SolProp, works by predicting a set of related properties and combining them, using thermodynamics, to ultimately predict the solubility. However, the model has difficulty predicting solubility for solutes that it hasn’t seen before.

“For drug and chemical discovery pipelines where you’re developing a new molecule, you want to be able to predict ahead of time what its solubility looks like,” Attia says.

Part of the reason that existing solubility models haven’t worked well is because there wasn’t a comprehensive dataset to train them on. However, in 2023 a new dataset called BigSolDB was released, which compiled data from nearly 800 published papers, including information on solubility for about 800 molecules dissolved about more than 100 organic solvents that are commonly used in synthetic chemistry.

Attia and Burns decided to try training two different types of models on this data. Both of these models represent the chemical structures of molecules using numerical representations known as embeddings, which incorporate information such as the number of atoms in a molecule and which atoms are bound to which other atoms. Models can then use these representations to predict a variety of chemical properties.

One of the models used in this study, known as FastProp and developed by Burns and others in Green’s lab, incorporates “static embeddings.” This means that the model already knows the embedding for each molecule before it starts doing any kind of analysis.

The other model, ChemProp, learns an embedding for each molecule during the training, at the same time that it learns to associate the features of the embedding with a trait such as solubility. This model, developed across multiple MIT labs, has already been used for tasks such as antibiotic discovery, lipid nanoparticle design, and predicting chemical reaction rates.

The researchers trained both types of models on over 40,000 data points from BigSolDB, including information on the effects of temperature, which plays a significant role in solubility. Then, they tested the models on about 1,000 solutes that had been withheld from the training data. They found that the models’ predictions were two to three times more accurate than those of SolProp, the previous best model, and the new models were especially accurate at predicting variations in solubility due to temperature.

“Being able to accurately reproduce those small variations in solubility due to temperature, even when the overarching experimental noise is very large, was a really positive sign that the network had correctly learned an underlying solubility prediction function,” Burns says.

Accurate predictions

The researchers had expected that the model based on ChemProp, which is able to learn new representations as it goes along, would be able to make more accurate predictions. However, to their surprise, they found that the two models performed essentially the same. That suggests that the main limitation on their performance is the quality of the data, and that the models are performing as well as theoretically possible based on the data that they’re using, the researchers say.

“ChemProp should always outperform any static embedding when you have sufficient data,” Burns says. “We were blown away to see that the static and learned embeddings were statistically indistinguishable in performance across all the different subsets, which indicates to us that that the data limitations that are present in this space dominated the model performance.”

The models could become more accurate, the researchers say, if better training and testing data were available — ideally, data obtained by one person or a group of people all trained to perform the experiments the same way.

“One of the big limitations of using these kinds of compiled datasets is that different labs use different methods and experimental conditions when they perform solubility tests. That contributes to this variability between different datasets,” Attia says.

Because the model based on FastProp makes its predictions faster and has code that is easier for other users to adapt, the researchers decided to make that one, known as FastSolv, available to the public. Multiple pharmaceutical companies have already begun using it.

“There are applications throughout the drug discovery pipeline,” Burns says. “We’re also excited to see, outside of formulation and drug discovery, where people may use this model.”

The research was funded, in part, by the U.S. Department of Energy.

Source: Read MoreÂ

Error’d: Pickup Sticklers

From Prompt To Partner: Designing Your Custom AI Assistant

Microsoft unveils reimagined Marketplace for cloud solutions, AI apps, and more

Design Dialects: Breaking the Rules, Not the System

Building personal apps with open source and AI

What Can We Actually Do With corner-shape?

Craft, Clarity, and Care: The Story and Work of Mengchu Yao

Cailabs secures €57M to accelerate growth and industrial scale-up

Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

The first browser with JavaScript landed 30 years ago

A new model predicts how molecules will dissolve in different solvents

Repurposing Protein Folding Models for Generation with Latent Diffusion

Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

Abandoned Sogou Zhuyin Update Server Hijacked, Weaponized in Taiwan Espionage Campaign

Over 70 Malicious npm and VS Code Packages Found Stealing Data and Crypto

This free ChatGPT feature flew under the radar – but it’s a game changer

Perplexity’s Comet AI browser is hurtling toward Chrome – how to try it

Filament v4 is Stable!

AI Development for Enterprises: Cost, Strategy, and Success Frameworks

SonicWall: SMA100 VPN vulnerabilities now exploited in attacks

RondoDox: Sophisticated Botnet Exploits TBK DVRs & Four-Faith Routers for DDoS Attacks

A new model predicts how molecules will dissolve in different solvents

Related Posts