Benchmarking Protein–Ligand Interaction Energy

by Ishaan Ganti · Jul 11, 2025

Accurately modeling protein–ligand (PL) interactions is core to structure-based drug design. While computing individual protein–ligand interaction energies is a necessary first step in any binding-affinity-prediction workflow, this computation can be quite challenging. Conventional forcefields like those used in molecular dynamics often get non-covalent interactions wrong. Quantum-chemical methods like density-functional theory (DFT) are much more accurate, but are typically unable to scale to the requisite number of atoms. For many complexes, even pruning all residues more than 10 Å away from the ligand still leaves 600–2000 atoms in the truncated system, too many for DFT to routinely handle.

Low-cost methods like neural network potentials (NNPs) and tight-binding semiempirical methods have becoming increasingly popular in the past few years. These methods give near-DFT accuracy for small molecules but run orders of magnitude faster than DFT or other quantum-chemical methods. However, it remains an open question whether NNPs and semiempirical methods can scale to protein-scale computations with good accuracy; most NNPs are trained only on small organic molecules or periodic systems, often with fewer than a hundred atoms.

Benchmarking protein–ligand interaction energies is challenging because, as discussed above, the systems are too large to directly run reference quantum-chemical calculations. In 2020, Kristian Kříž and Jan Řezáč addressed this problem by publishing the PLA15 benchmark set, which uses fragment-based decomposition to estimate the interaction energy for 15 protein–ligand complexes at the DLPNO-CCSD(T) level of theory. A recent LinkedIn post from Řezáč showed that the new g-xTB method performed well, but didn't include some of the other computational methods we often consider using here at Rowan.

The structure of the 2CET complex from PLA15.

Visualization of the 2CET complex from the PLA15 benchmark set.

We benchmarked a variety of low-cost computational methods against PLA15:

Here are the results, showing relative percent error ( 100(predref)ref\frac{100\cdot(\text{pred} - \text{ref})}{|\text{ref}|} ):

A grid heatmap of different methods on various protein complexes.

Comparison of a variety of low-cost methods on the individual complexes in the PLA15 benchmark. (Open image in new tab to view in higher resolution.)

There's a lot to digest—here's a few things that immediately pop out:

  1. Among the NNPs, the models trained on OMol25 are by far the best. Egret-1 and AIMNet2 (both with naïve and DSF charge handling) are "middle of the road," while the materials-science models (Orb-v3 and MACE-MP-0b2-L) perform worst.
  2. g-xTB appears to be the best method overall, but GFN2 also looks good.
  3. Most of the models appear to generally underbind the ligands and predict interaction energies that are too small. Interestingly, the models trained on OMol25 all consistently overbind. This may stem from the use of the VV10 correction in the training data, which allegedly causes overbinding in non-covalent systems.
  4. Handling explicit charge differently causes big differences. AIMNet2 consistently underbinds with normal charge handling but switches to occasional overbinding with DSF charge handling.

I was surprised to see how good g-xTB was compared to all of the NNPs, boasting a mean absolute percent error of 6.1% on this set. For comparison, the OMol25-trained models hover around the 11% mark, whereas AIMNet2 and Egret-1 are around 25%. The other NNPs fall well behind.

g-xTB outperforms the NNPs, but the models trained on OMol25 are decent.

Comparison of the overall performance of different low-cost methods on the PLA15 benchmark. (Open image in new tab to view in higher resolution.)

The spread on g-xTB is also quite nice; no insane outliers are present. Meanwhile, UMA-medium is the only OMol25-based model without any drastic outlier. For any sort of protein–ligand free-energy predictions, having a stable underlying interaction-energy predictor is a must, suggesting that the current generation of NNPs is likely to do poorly at this task. It may be possible to correct the predictions of models that exhibit a strong systematic error like UMA-medium, potentially via Δ-learning the difference.

We also analyzed the Pearson and Spearman correlations of the models under study, to see if the relative rankings were correct.

ModelCoefficient of Determination (R2R^2)Spearman ρ\rhoMean Abs. Percent Error (%)
MACE-MP-0b2-L0.611 ± 0.1710.750 ± 0.15967.29
Orb-v30.565 ± 0.1370.776 ± 0.14146.62
Egret-10.731 ± 0.1070.876 ± 0.11024.33
ANI-2x0.543 ± 0.2510.613 ± 0.23238.76
AIMNet20.969 ± 0.0200.951 ± 0.05027.42
AIMNet2 (DSF)0.633 ± 0.1370.768 ± 0.15522.05
oMol25 eSEN-s0.992 ± 0.0030.949 ± 0.04610.91
UMA-s0.983 ± 0.0090.950 ± 0.05112.70
UMA-m0.991 ± 0.0070.981 ± 0.0239.57
GFN-FF0.446 ± 0.2250.532 ± 0.24121.74
GFN20.985 ± 0.0070.963 ± 0.0368.15
g-xTB0.994 ± 0.0020.981 ± 0.0236.09

The corresponding R2R^2 and Spearman ρ\rho coefficients largely seem to match what is seen in the previous two figures. The exception is AIMNet2, which seems to rank-order and correlate very strongly with the reference data even while having a high average relative error. This either implies that AIMNet2's outputs are off by roughly a linear offset (which would be great!), or this is just "lucky" and can be attributed to the small size of the dataset. I'm inclined to say it's the latter, but I remain hopeful. It's worth noting that Egret-1, which has similar relative errors to AIMNet2, does far worse here.

In accordance with chemical intuition, charge handling seems to matter a lot here: the worst NNPs are those that don't explicitly take a total molecular charge as input. Every complex in the PLA15 dataset contained either a charged ligand or a charged protein; future work in NNP development should focus on better accounting for the effect of charge across large systems, as it seems to be crucial here.

Note: Olexandr Isayev responded on X, saying:

Due to incorrect electrostatics the public aimnet2 models should not be used for protein-ligand systems. The interaction energy is wrong!

This may explain why AIMNet2, the only model which explicitly includes charge-based energy terms, doesn't seem to perform any better here.

To summarize, the PLA15 benchmark shows that there's a pretty stark gap between today's NNPs and semiempirical methods for predicting protein–ligand interaction energies. Though models like UMA-medium show promise, their consistent overbinding suggests a need for a systematic correction. Other models simply have absolute errors that are too large to deem them reliable.

Right now, g-xTB seems to be the clear winner in terms of accuracy. While it can't take advantage of the growing power of GPUs, it's still incredibly fast. We remain hopeful that future generations of NNPs will address the problems discussed here and provide computational chemists with fast, accurate, and reliable methods for computing protein–ligand interaction energies.

Methodology

The PLA15 dataset comes with 15 PDB files and a plain text file with reference energies. All NNP interaction energies were calculated by masking out the protein/ligand from the PDB based on residue name as needed, and then using the ASE calculator interface. For the tight-binding methods, I converted each PDB into 3 .xyz files (complex, protein, ligand) and ran jobs through Rowan's Python API. Formal charge information is at the top of the supplied PDBs and was passed explicitly where required. All input files and output energies can be found at this GitHub repo.

Banner background image

What to Read Next

Benchmarking Protein–Ligand Interaction Energy

Benchmarking Protein–Ligand Interaction Energy

How new low-cost computational methods perform on the PLA15 benchmark.
Jul 11, 2025 · Ishaan Ganti
Efficient Black-Box Prediction of Hydrogen-Bond-Donor and Acceptor Strength

Efficient Black-Box Prediction of Hydrogen-Bond-Donor and Acceptor Strength

Here, we report a robust black-box workflow for predicting site-specific hydrogen-bond basicity and acidity in organic molecules with minimal computational cost.
Jul 1, 2025 · Corin C. Wagen
Tracking Boltz-2 Benchmarks

Tracking Boltz-2 Benchmarks

Tracking the community's response to the new Boltz-2 model, plus some notes about Chai-2.
Jul 1, 2025 · Corin Wagen
g-xTB, Credit Usage, & More

g-xTB, Credit Usage, & More

the new g-xTB model from Grimme and co-workers; an easy visual overview of credit usage; better credit handling for organizations; bulk PDB download; a new collapsible JSON viewer
Jun 27, 2025 · Jonathon Vandezande, Ari Wagen, Spencer Schneider, and Corin Wagen
Representing Local Protein Environments With Atomistic Foundation Models

Representing Local Protein Environments With Atomistic Foundation Models

A guest post about how to use NNP embeddings for other prediction tasks.
Jun 20, 2025 · Meital Bojan and Sanketh Vedula
Co-Folding Updates

Co-Folding Updates

Boltz-2 FAQ and launch event recap; new visuals for co-folding workflows; new submission options; PDB bugfixes; new credit-management tools
Jun 12, 2025 · Ari Wagen, Spencer Schneider, and Corin Wagen
The Boltz-2 FAQ

The Boltz-2 FAQ

Questions and answers about the Boltz-2 biomolecular foundation model.
Jun 9, 2025 · Corin Wagen and Ari Wagen
Cleaning the Tap Room

Cleaning the Tap Room

beer and bezos; terms-of-service and privacy-policy updates; more deployment options; compliance requirements and country restrictions; a blog post about transition states
Jun 6, 2025 · Ari Wagen and Corin Wagen
BREAKING: Boltz-2 Now Live On Rowan

BREAKING: Boltz-2 Now Live On Rowan

This morning, a team of researchers from MIT and Recursion released Boltz-2, an open-source protein–ligand co-folding model.
Jun 6, 2025 · Corin Wagen, Spencer Schneider, and Ari Wagen
How to Run Boltz-2

How to Run Boltz-2

Step-by-step guides on how to run the Boltz-2 model locally and through Rowan's computational_chemistry platform.
Jun 6, 2025 · Corin Wagen