Benchmarking Protein–Ligand Interaction Energy

by Ishaan Ganti · Jul 11, 2025

Accurately modeling protein–ligand (PL) interactions is core to structure-based drug design. While computing individual protein–ligand interaction energies is a necessary first step in any binding-affinity-prediction workflow, this computation can be quite challenging. Conventional forcefields like those used in molecular dynamics often get non-covalent interactions wrong. Quantum-chemical methods like density-functional theory (DFT) are much more accurate, but are typically unable to scale to the requisite number of atoms. For many complexes, even pruning all residues more than 10 Å away from the ligand still leaves 600–2000 atoms in the truncated system, too many for DFT to routinely handle.

Low-cost methods like neural network potentials (NNPs) and tight-binding semiempirical methods have becoming increasingly popular in the past few years. These methods give near-DFT accuracy for small molecules but run orders of magnitude faster than DFT or other quantum-chemical methods. However, it remains an open question whether NNPs and semiempirical methods can scale to protein-scale computations with good accuracy; most NNPs are trained only on small organic molecules or periodic systems, often with fewer than a hundred atoms.

Benchmarking protein–ligand interaction energies is challenging because, as discussed above, the systems are too large to directly run reference quantum-chemical calculations. In 2020, Kristian Kříž and Jan Řezáč addressed this problem by publishing the PLA15 benchmark set, which uses fragment-based decomposition to estimate the interaction energy for 15 protein–ligand complexes at the DLPNO-CCSD(T) level of theory. A recent LinkedIn post from Řezáč showed that the new g-xTB method performed well, but didn't include some of the other computational methods we often consider using here at Rowan.

The structure of the 2CET complex from PLA15.

Visualization of the 2CET complex from the PLA15 benchmark set.

We benchmarked a variety of low-cost computational methods against PLA15:

Here are the results, showing relative percent error ( 100(predref)ref\frac{100\cdot(\text{pred} - \text{ref})}{|\text{ref}|} ):

A grid heatmap of different methods on various protein complexes.

Comparison of a variety of low-cost methods on the individual complexes in the PLA15 benchmark. (Open image in new tab to view in higher resolution.)

There's a lot to digest—here's a few things that immediately pop out:

  1. Among the NNPs, the models trained on OMol25 are by far the best. Egret-1 and AIMNet2 (both with naïve and DSF charge handling) are "middle of the road," while the materials-science models (Orb-v3 and MACE-MP-0b2-L) perform worst.
  2. g-xTB appears to be the best method overall, but GFN2 also looks good.
  3. Most of the models appear to generally underbind the ligands and predict interaction energies that are too small. Interestingly, the models trained on OMol25 all consistently overbind. This may stem from the use of the VV10 correction in the training data, which allegedly causes overbinding in non-covalent systems.
  4. Handling explicit charge differently causes big differences. AIMNet2 consistently underbinds with normal charge handling but switches to occasional overbinding with DSF charge handling.

I was surprised to see how good g-xTB was compared to all of the NNPs, boasting a mean absolute percent error of 6.1% on this set. For comparison, the OMol25-trained models hover around the 11% mark, whereas AIMNet2 and Egret-1 are around 25%. The other NNPs fall well behind.

g-xTB outperforms the NNPs, but the models trained on OMol25 are decent.

Comparison of the overall performance of different low-cost methods on the PLA15 benchmark. (Open image in new tab to view in higher resolution.)

The spread on g-xTB is also quite nice; no insane outliers are present. Meanwhile, UMA-medium is the only OMol25-based model without any drastic outlier. For any sort of protein–ligand free-energy predictions, having a stable underlying interaction-energy predictor is a must, suggesting that the current generation of NNPs is likely to do poorly at this task. It may be possible to correct the predictions of models that exhibit a strong systematic error like UMA-medium, potentially via Δ-learning the difference.

We also analyzed the Pearson and Spearman correlations of the models under study, to see if the relative rankings were correct.

ModelCoefficient of Determination (R2R^2)Spearman ρ\rhoMean Abs. Percent Error (%)
MACE-MP-0b2-L0.611 ± 0.1710.750 ± 0.15967.29
Orb-v30.565 ± 0.1370.776 ± 0.14146.62
Egret-10.731 ± 0.1070.876 ± 0.11024.33
ANI-2x0.543 ± 0.2510.613 ± 0.23238.76
AIMNet20.969 ± 0.0200.951 ± 0.05027.42
AIMNet2 (DSF)0.633 ± 0.1370.768 ± 0.15522.05
oMol25 eSEN-s0.992 ± 0.0030.949 ± 0.04610.91
UMA-s0.983 ± 0.0090.950 ± 0.05112.70
UMA-m0.991 ± 0.0070.981 ± 0.0239.57
GFN-FF0.446 ± 0.2250.532 ± 0.24121.74
GFN20.985 ± 0.0070.963 ± 0.0368.15
g-xTB0.994 ± 0.0020.981 ± 0.0236.09

The corresponding R2R^2 and Spearman ρ\rho coefficients largely seem to match what is seen in the previous two figures. The exception is AIMNet2, which seems to rank-order and correlate very strongly with the reference data even while having a high average relative error. This either implies that AIMNet2's outputs are off by roughly a linear offset (which would be great!), or this is just "lucky" and can be attributed to the small size of the dataset. I'm inclined to say it's the latter, but I remain hopeful. It's worth noting that Egret-1, which has similar relative errors to AIMNet2, does far worse here.

In accordance with chemical intuition, charge handling seems to matter a lot here: the worst NNPs are those that don't explicitly take a total molecular charge as input. Every complex in the PLA15 dataset contained either a charged ligand or a charged protein; future work in NNP development should focus on better accounting for the effect of charge across large systems, as it seems to be crucial here.

Note: Olexandr Isayev responded on X, saying:

Due to incorrect electrostatics the public aimnet2 models should not be used for protein-ligand systems. The interaction energy is wrong!

This may explain why AIMNet2, the only model which explicitly includes charge-based energy terms, doesn't seem to perform any better here.

To summarize, the PLA15 benchmark shows that there's a pretty stark gap between today's NNPs and semiempirical methods for predicting protein–ligand interaction energies. Though models like UMA-medium show promise, their consistent overbinding suggests a need for a systematic correction. Other models simply have absolute errors that are too large to deem them reliable.

Right now, g-xTB seems to be the clear winner in terms of accuracy. While it can't take advantage of the growing power of GPUs, it's still incredibly fast. We remain hopeful that future generations of NNPs will address the problems discussed here and provide computational chemists with fast, accurate, and reliable methods for computing protein–ligand interaction energies.

Methodology

The PLA15 dataset comes with 15 PDB files and a plain text file with reference energies. All NNP interaction energies were calculated by masking out the protein/ligand from the PDB based on residue name as needed, and then using the ASE calculator interface. For the tight-binding methods, I converted each PDB into 3 .xyz files (complex, protein, ligand) and ran jobs through Rowan's Python API. Formal charge information is at the top of the supplied PDBs and was passed explicitly where required. All input files and output energies can be found at this GitHub repo.

Banner background image

What to Read Next

Eliminating Imaginary Frequencies

Eliminating Imaginary Frequencies

How to get rid of pesky imaginary frequencies.
Dec 1, 2025 · Corin Wagen
Conformer Deduplication, Clustering, and Analytics

Conformer Deduplication, Clustering, and Analytics

deduplicating conformers with PRISM Pruner; Monte-Carlo-based conformer search; uploading conformer ensembles; clustering conformers to improve efficiency; better analytics on output ensembles
Nov 25, 2025 · Corin Wagen, Ari Wagen, and Jonathon Vandezande
The Multiple-Minimum Monte Carlo Method for Conformer Generation

The Multiple-Minimum Monte Carlo Method for Conformer Generation

Guest blog post from Nick Casetti discussing his new multiple-minimum Monte Carlo method for conformer generation.
Nov 24, 2025 · Nick Casetti
Screening Conformer Ensembles with PRISM Pruner

Screening Conformer Ensembles with PRISM Pruner

Guest blog post from Nicolò Tampellini, discussing efficient pruning of conformational ensembles using RMSD and moment of inertia metrics.
Nov 21, 2025 · Nicolò Tampellini
GPU-Accelerated DFT

GPU-Accelerated DFT

the power of modern GPU hardware; GPU4PySCF on Rowan; pricing changes coming in 2026; an interview with Navvye Anand from Bindwell; using Rowan to develop antibacterial PROTACs
Nov 19, 2025 · Jonathon Vandezande, Ari Wagen, Corin Wagen, and Spencer Schneider
Rowan Research Spotlight: Emilia Taylor

Rowan Research Spotlight: Emilia Taylor

Emilia's work on BacPROTACs and how virtual screening through Rowan can help.
Nov 19, 2025 · Corin Wagen
GPU-Accelerated DFT with GPU4PySCF

GPU-Accelerated DFT with GPU4PySCF

A brief history of GPU-accelerated DFT and a performance analysis of GPU4PySCF, Rowan's newest DFT engine.
Nov 19, 2025 · Jonathon Vandezande
A Conversation With Navvye Anand (Bindwell)

A Conversation With Navvye Anand (Bindwell)

Corin interviews Navvye about pesticide discovery, the advantages that ML gives them, and what areas of research he's most excited about.
Nov 18, 2025 · Corin Wagen
Ion Mobility, Batch Docking, Strain, Flow-Matching Conformer Generation, and MSA

Ion Mobility, Batch Docking, Strain, Flow-Matching Conformer Generation, and MSA

a diverse litany of new features: ion-mobility mass spectrometry; high-throughput docking with QVina; a standalone strain workflow; Lyrebird, a new conformer-generation model; and standalone MSAs
Nov 5, 2025 · Corin Wagen, Ari Wagen, Eli Mann, and Spencer Schneider
Using Securely Generated MSAs to Run Boltz-2 and Chai-1

Using Securely Generated MSAs to Run Boltz-2 and Chai-1

Example scripts showing how Boltz-2 and Chai-1 can be run using MSA data from Rowan's MSA workflow.
Nov 5, 2025 · Spencer Schneider and Ari Wagen