How to Predict Protein–Ligand Binding Affinity

by Corin Wagen · Feb 11, 2026

A picture of Rowan's RBFE FEP workflow

Prediction of protein–ligand binding affinity is one of the most common and crucial tasks in computer-assisted drug design. Despite the multiparametric nature of drug discovery, achieving good binding affinity remains central to the development of a successful therapeutic: a recent perspective by Mark Murcko argues that the importance of optimizing binding affinity remains underrated.

Unfortunately, accurate computational prediction of binding affinity is extremely difficult. Binding affinity depends not only on the static structures of protein–ligand complexes, which are themselves difficult to predict with experimental accuracy, but also on a host of even more complex and difficult-to-predict factors: subtle non-covalent interactions between ligand atoms and protein sidechains, the structure and behavior of the water solvation shell, the conformational motion of protein and ligand, and the interplay between entropy and enthalpy in determining the ultimate free energy of binding.

Since ideal solutions to these problems are usually prohibitively expensive (if not outright impossible), practical binding-affinity prediction relies on a large number of bespoke approximations which can be deployed at various points in a project lifecycle. This post aims to (1) give a high-level overview of seven widely used approaches in practice, (2) briefly explain how each approach works (and where it might fail), and (3) help novice drug designers choose the right tool for the right occasion in a program.

(Before we start, a necessary disclaimer: this is an exceptionally large and complex field, and the present post does not aspire to be an exhaustive review. Rather, this is intended to be a high-level overview for newcomers to the field. Experienced practitioners will no doubt take issue with some of the statements here; although we've tried to be correct and minimally misleading, inaccuracies certainly remain.)

What do we mean by binding affinity?

Experimentally, teams often work with several related readouts, including Kd and Ki, and assay-dependent quantities like IC50. Kinetic parameters that imply residence time (kon, koff) are also sometimes used. Most physics-based methods aim to estimate a binding free energy, ΔGbind, because it connects most directly to thermodynamic definitions of binding. With appropriate assumptions about the binding model and assay conditions, ΔGbind can be related to Kd or Ki. Unlike Kd or Ki, IC50 is assay- and mechanism-dependent, and it only maps to an inhibition constant under additional assumptions (for example, via the Cheng–Prusoff relationship for simple competitive inhibition)

It is worth noting that even "gold-standard" experimental measurements carry nontrivial uncertainty. Across labs and assay formats, reported Kd and Ki values for the same system can differ by a few tenths of a log unit, and sometimes more. This variability is an important reality check for any model trained on heterogeneous public data, though tighter floors are possible in a single well-controlled assay.

Lower-cost methods often produce uncalibrated scores that are used primarily for ranking rather than for absolute ΔG prediction. Docking scores are best treated as heuristic ranking signals, and even methods that output "ΔG-like" values with energy units (such as MM/GBSA) are usually used for relative ranking unless carefully calibrated.

In practice, many teams layer methods together: docking to generate poses, MM/GBSA or ML rescoring to refine ranking, and FEP on a focused set of top candidates. Most of the individual methods below can be (and often are) used in combination, and a layered workflow is often more effective than relying on any single approach.

1. Ligand-based similarity methods

Ligand-based virtual screening starts from the heuristic that similar molecules often have similar activity, while recognizing that activity cliffs are common. The simplest and most ubiquitous operationalization of this is 2D fingerprint similarity—computing Tanimoto coefficients over extended-connectivity fingerprints (ECFP4/Morgan fingerprints) to identify analogs of known actives. This is fast, trivially parallelizable, and often a surprisingly strong baseline.

Shape-based methods extend similarity into 3D by aligning molecules and scoring how well their volumes and pharmacophoric features overlap. ROCS (Rapid Overlay of Chemical Structures) is the classic example; FastROCS is the high-throughput GPU-accelerated version designed for very large libraries. The "color" component of ROCS, which captures overlap of pharmacophoric and chemical features beyond simple 3D shape, is often critical; shape alone can miss many meaningful chemical distinctions (e.g. pyridine vs benzene).

Pharmacophore models are a related ligand-based approach. Rather than comparing full molecular shapes, pharmacophore methods define an abstract spatial arrangement of chemical features (hydrogen-bond donors & acceptors, hydrophobic regions, aromatic rings, charged groups, and so on) and screen for molecules that match. Pharmacophore models can be built from a set of known actives alone or combined with protein-structure information; they are particularly useful for scaffold hopping, since chemically dissimilar molecules can satisfy the same feature geometry.

When you have a known ligand and you care about finding analogs or scaffold hops that preserve key features, similarity and pharmacophore methods can be extremely cost-effective. A key limitation is the need for at least one reasonable query ligand (and, for 3D methods, a plausible query pose), and performance can degrade when the binding mode is not conserved or when active structures are diverse in shape.

Key publications:

Representative tools:

2. Docking and classical scoring functions

Docking tries to answer two questions: (1) what pose does the ligand adopt, and (2) how good is that pose? Most docking engines generate many poses in a binding site and rank them with a scoring function that mixes simplified physics with empirical calibration. Docking is fast enough for screening, provides interpretable poses, and is often the first structure-based method that drug-discovery teams reach for.

Rowan's docking workflow

A visual representation of different docking poses bound to the protein target. The pocket for docking is visible as a cube.

In practice, docking scores are usually poor surrogates for binding free energies, especially across diverse chemotypes. Scoring functions fail to account for numerous physical effects: protein flexibility, waters and ions, alternate protonation states, strain and desolvation effects, and more. As a result, docking is often more reliable for pose generation than for potency ranking, particularly when protein flexibility, waters, or protonation states matter, and obtaining meaningful enrichment from docking scores at scale requires considerable caution. Benchmarking docking enrichment is itself non-trivial—the DUD-E benchmark set (Mysinger et al., 2012) is widely used but has known biases that can inflate apparent performance, particularly for machine-learned models.

Some of these limitations are manageable in practice. Water thermodynamics analyses can help identify conserved and displaceable waters and can sometimes improve pose selection and triage, though results are system- and protocol-dependent. Similarly, careful attention to protonation and tautomeric states, supported by tools that enumerate relevant forms for both protein and ligand, can help.

Key publications:

Representative tools:

3. QSAR and supervised ML potency models

Quantitative structure–activity relationship (QSAR) models learn relationships between chemical structure and activity from experimental data. These can range from simple linear models on hand-crafted descriptors to modern deep-learning approaches including graph neural networks, transformers, and multitask models. If you have a consistent assay and enough data within a chemical series, QSAR is often the fastest way to get useful potency prioritization.

At the simplest end, Free–Wilson analysis (Free and Wilson, 1964) decomposes activity into additive contributions from substituents at defined positions on a shared scaffold. This remains a valuable tool in lead optimization: it is interpretable, fast to build, and can perform well when additivity approximately holds. (Pat Walters has demonstrated a Python implementation of Free–Wilson analysis on his Practical Cheminformatics blog.)

Matched molecular pair (MMP) analysis is a related idea—rather than decomposing activity on a scaffold, MMP analysis identifies pairs of molecules that differ by a single structural transformation and tabulates the associated activity change. MMP databases built from large corporate datasets can be remarkably useful for predicting the effect of common medicinal chemistry moves.

Modern ML-based QSAR typically uses molecular graph- or fingerprint-based representations fed to random forests, gradient-boosted trees, or graph neural networks. These approaches can capture nonlinear SAR and generalize further than Free–Wilson analyses, but they can also struggle in practice: a model can look excellent under retrospective random splits yet fail prospectively when the next design cycle moves into new chemical space or changes assay context. Understanding when and how QSAR models fail—and designing evaluation protocols (temporal splits, scaffold splits) that expose these failures before they matter—is at least as important as choosing the right architecture.

Key publications :

Representative Tools:

MM/GBSA and MM/PBSA compute a ΔG-like score by combining molecular-mechanics terms with an implicit-solvent model (and optionally an approximate entropy term), typically over MD snapshots. They are often called "endpoint" methods because they do not explicitly alchemically transform one ligand into another, instead comparing bound and unbound states for a given ligand.

In practice, these endpoint methods occupy a middle ground between docking and full free-energy methods. They are often more physically motivated than docking scores and can incorporate local relaxation, but they remain highly protocol-sensitive: implicit solvent is an aggressive approximation, conformational sampling is typically limited, and entropy estimation is imperfect and protocol-dependent. They can be useful for pose rescoring and for rough comparisons within closely related sets, but prospective performance varies widely with force field, sampling, and solvation-model choices.

Linear interaction energy (LIE) is a related endpoint approach that estimates binding free energy from the average electrostatic and van der Waals interaction energies between ligand and surroundings in bound and free states, scaled by fitted coefficients. LIE is conceptually simpler than MM/GBSA and avoids some of its approximations (particularly the implicit-solvent decomposition), but requires empirical parameterization and has seen less widespread adoption.

Semiempirical quantum chemistry has also been explored as a single-structure or limited-sampling scoring approach. SQM2.20 uses a semiempirical QM Hamiltonian to score protein–ligand complexes directly; the authors have reported strong benchmark performance and external evaluations are starting to appear, but its practical domain of reliability is still being mapped.

Key publications:

Representative tools:

5. Relative binding free-energy perturbation (RBFE FEP)

Relative binding free-energy methods compute the change in binding free energy between two ligands by "alchemically" transforming one ligand into another in both solvent and the protein complex. The output is ΔΔG between ligands, which is often exactly what a medicinal chemist needs for SAR ranking. Because many systematic errors partially cancel between similar ligands, RBFE can be among the most accurate prospective options for congeneric series when binding modes are stable.

The main practical constraints are setup complexity, sampling, and overall cost. Robust RBFE requires sensible ligand mappings, careful handling of protonation and tautomers, attention to net charge changes, and enough sampling to converge relevant degrees of freedom. Conversely, RBFE can struggle when ligands change binding mode, when slow protein motions matter, or when the water network reorganizes substantially across the series.These failure modes are increasingly addressable: REST2 (replica exchange with solute tempering) & related enhanced-sampling techniques can improve convergence for slow degrees of freedom and grand canonical methods for water insertion & deletion can handle cases where the water network changes between ligands.

Rowan's RBFE FEP workflow

An example of a sample output graph from an RBFE FEP calculation.

Computational cost remains considerable, however. RBFE calculations, while faster than absolute binding free-energy calculations (vide supra), are still pretty slow compared to other computational chemistry methods. In production-like protocols, RBFE commonly costs on the order of single to tens of GPU-hours per edge, depending on sampling, repeats, and the difficulty of the transformation. In cloud settings, marginal compute costs per evaluated edge can be nontrivial ($10 or more), which is why most teams reserve RBFE for a focused set of high-value decisions.

Development of improved RBFE engines remains an important scientific area. Various companies and research groups are investigating the use of neural network potentials or low-cost quantum-chemical methods as more accurate replacements for force fields, while other teams are studying ways to accelerate RBFE calculations through techniques like local resampling and non-equilibrium switching.

Key publications:

Representative tools:

6. Absolute binding free-energy perturbation (ABFE FEP)

Absolute binding free energy methods aim to compute ΔGbind for a single ligand, typically by applying restraints and alchemically decoupling the ligand from its environment. ABFE is appealing because it promises transferability across chemotypes, including cases where there is no obvious mapping between ligands.

In practice, ABFE is harder to converge than RBFE. Convergence is more challenging—ABFE typically requires substantially more sampling per compound than RBFE, which translates into slower and more expensive calculations—and results are more sensitive to protocol choices. Waters, multiple binding modes, and protein reorganization can dominate the error budget, and they are hard to capture reliably without substantial sampling. That said, there has been significant progress, and ABFE is increasingly used in targeted scenarios where RBFE is not applicable: comparing hits from different chemical series, evaluating fragment binding hypotheses, or connecting distinct scaffolds within a program.

Key publications:

Representative tools:

7. Structure-based ML scoring

A newer category of methods predicts potency-like values or ranking scores from a 3D protein–ligand complex using machine learning. These range from models that score a given pose (analogous to a classical scoring function, but learned from data) to models that jointly predict the 3D structure and affinity from sequence and SMILES alone. The distinction matters because pose-conditioned models inherit pose errors, while joint structure-plus-affinity models compound uncertainty from two difficult predictions.

Among pose-scoring models, early examples include RF-Score (Ballester and Mitchell, 2010), which applied random forests to interaction features, and more recent neural approaches like KDEEP (Jiménez et al., 2018) and gnina (McNutt et al., 2021), which use convolutional neural networks on voxelized representations of the binding site. AEV-PLIG is another representative example, producing affinity estimates from bound complex structures quickly enough for interactive use.

On the joint-prediction side, Boltz-2 (Wohlwend et al., 2025) predicts both the 3D structure of the protein–ligand complex and a binding affinity estimate starting from protein sequence and ligand SMILES. (For more details, see our Boltz-2 FAQ.) DiffDock (Corso et al., 2023) uses a diffusion generative model for pose prediction with learned confidence scores, though it does not directly predict affinity.

Rowan's protein–ligand co-folding workflow

An example of a Boltz-2 calculation with associated binding-affinity predictions.

Many structure-based ML scoring functions perform best near their training distribution and can degrade on new targets, new protein families, or novel chemotypes. A growing body of work has focused on understanding and mitigating this problem. Much of the field has historically trained on PDBbind (Wang et al., 2004), which has well-documented biases: random train/test splits leak information through protein family and ligand similarity, and temporal or scaffold-based splits reveal substantially worse generalization. Careful evaluation hygiene—using time-splits, target-splits, or prospective validation—is essential for understanding what these models can actually do.

A practical way to think about these models is as learned scoring functions that can be extremely useful within a validated domain, but that should not be assumed to generalize without targeted testing. They can be very useful for triage and ranking when you have plausible poses, but they should be validated in the specific context you plan to use them, especially when making expensive synthesis decisions.

Key publications:

Representative tools:

Which approach should you use?

No single method is best across all stages of a program, but a method's cost and failure modes can be matched to the decision at hand.

If you are doing hit finding or exploring ultra-large libraries, prioritize throughput and enrichment. Ligand-based similarity methods (fingerprints, shape, pharmacophores) and docking are often the right first layer, sometimes combined with fast ML rescoring when you have plausible poses.

If you are in lead optimization with a congeneric series and stable binding mode, RBFE FEP is usually the most decision-relevant tool because it targets ΔΔG directly and tends to be most accurate when chemical changes are incremental. For simpler SAR questions in a well-behaved series, Free–Wilson or MMP analysis may give you what you need faster.

If you need to compare across scaffolds, ABFE and carefully validated structure-based ML models become more attractive, but you should expect higher variance and more protocol sensitivity than in RBFE.

Banner background image

What to Read Next

How to Predict Protein–Ligand Binding Affinity

How to Predict Protein–Ligand Binding Affinity

A comparison of seven different approaches to predicting binding affinity.
Feb 13, 2026 · Corin Wagen
SAPT, Protein Preparation, and Starling-Based Microscopic pKa

SAPT, Protein Preparation, and Starling-Based Microscopic pKa

interaction energy decomposition w/ SAPT0 & a warning; making protein preparation more granular; catching forcefield errors earlier; microscopic pKa via Starling; internship applications now open
Feb 12, 2026 · Corin Wagen, Jonathon Vandezande, Ari Wagen, and Eli Mann
Credits FAQ

Credits FAQ

How credits work, why Rowan tracks usage with credits, and how these numbers translate into real-world workflows.
Feb 9, 2026 · Corin Wagen and Ari Wagen
Analogue Docking, Protein MD, Multiple Co-Folding Samples, Speed Estimates, and 2FA

Analogue Docking, Protein MD, Multiple Co-Folding Samples, Speed Estimates, and 2FA

docking analogues to a template; running MD on proteins w/o ligands; generating multiple structures with Boltz & Chai; runtime estimates & dispatch information; two-factor authentication; speedups
Jan 28, 2026 · Corin Wagen, Ari Wagen, and Spencer Schneider
Predicting Permeability for Small Molecules

Predicting Permeability for Small Molecules

why permeability matters; different experimental and computational approaches; Rowan’s supported methods; an example script
Jan 9, 2026 · Corin Wagen, Eli Mann, and Ari Wagen
2025 in Review

2025 in Review

looking back on the last year for Rowan
Jan 1, 2026 · Corin Wagen
Making Rowan Even Easier To Use

Making Rowan Even Easier To Use

easier sign-on; layered security with IP whitelists; clearer costs; solvent-aware conformer searching; interviews with onepot and bioArena
Dec 16, 2025 · Ari Wagen, Spencer Schneider, and Corin Wagen
Batch Calculations Through Rowan's API

Batch Calculations Through Rowan's API

How to efficiently submit and analyze lots of workflows through Rowan's free Python API.
Dec 10, 2025 · Corin Wagen
Building BioArena: Kat Yenko on Evaluating Scientific AI Agents

Building BioArena: Kat Yenko on Evaluating Scientific AI Agents

Ari interviews Kat Yenko about her vision for BioArena, what led her to get started, and how to evaluate the utility of frontier models for real-world science.
Dec 9, 2025 · Ari Wagen
Automating Organic Synthesis: A Conversation With Daniil Boiko and Andrei Tyrin from onepot

Automating Organic Synthesis: A Conversation With Daniil Boiko and Andrei Tyrin from onepot

Corin talks with Daniil and Andrei about their recent seed round and how they plan to automate all of synthesis.
Dec 5, 2025 · Corin Wagen