An Introduction to Neural Network Potentials

by Ari Wagen · Dec 19, 2024

Neural network potentials (NNPs)—also called neural network interatomic potentials (NNIPs), machine-learned interatomic potentials (MLIPs), and occasionally machine-learned forcefields (MLFFs)—are machine-learned (ML) models used to approximate solution of the Schrödinger equation, which describes how systems of atoms will behave with quantum accuracy.

This article is written to serve as an introduction to NNPs for scientists without prior knowledge of the field, and it contains the following sections:

  1. Why NNPs Matter
  2. Early NNPs
  3. NNP Datasets
  4. NNP Architectures
  5. General Pretrained NNPs
  6. Benchmarking NNPs
  7. Applications of NNPs in Life and Materials Sciences
  8. Future Directions for NNPs
  9. How to Start Using NNPs
  10. How to Start Training and Fine-Tuning NNPs

Why NNPs Matter

NNPs are an emerging technology for atomistic simulation. But why do we care about simulating atoms in the first place? If we can accurately model atoms, molecules, and systems of molecules, then we can study phenomena that are too small and fast to observe by other methods and we can learn things about chemistry without having to run error-prone and expensive experiments in the real world. In theory, this should let us accelerate the pace of scientific research in important fields like drug discovery and materials science; in practice, simulating atoms turns out to be extremely difficult.

The Schrödinger equation is a quantum mechanical equation that describes the wave-like nature of electrons. When solved, this equation exactly describes the behavior of non-relativistic systems (within the Born–Oppenheimer approximation). Sadly, the solution of this equation is very complex, and systems with more than two atoms introduce many-body interactions that make analytical solutions of the equation impossible. If we treat propane's nuclei as fixed and just solve for the electronic wavefunction, it still takes 6.6 years to compute.

To model systems in a reasonable amount of time, different approximations have been made to the Schrödinger equation, leading to the creation of a number of quantum mechanics (QM) "methods," including the Hartree–Fock method, density-functional theory (DFT), Møller–Plesset perturbation theory, and coupled cluster.

These different methods are among the most complex and compute-intensive algorithms in scientific computing; but at a very high level, they can all be thought of as functions that take a list of atomic numbers and their Cartesian coordinates and return that system's energy, the energy's gradients with respect to atomic position (also called "forces"), and a number of other electronic properties including the dipole moment, excitation energies, and electron distribution (Figure 1A).

Unfortunately, these algorithms—while tractable for systems up to hundreds of atoms—are still very slow. To accelerate QM-style simulation, people have fit the potential energy surface of atomic systems to simple polynomial equations. For simple systems, this can be done very accurately, enabling much longer simulations to be run at high accuracy; this has been done for a number of systems, including hydrogen sulfide, H+H2, NaCl–H2, H2O−H2, and H+D. However, this approach becomes very complex for systems with more than a few atoms.

Advances in machine learning allow us to scale fast approximations of QM to large and complex systems. Instead of fitting a number of polynomial terms to a potential energy surface, ML models can fit millions of parameters to more complex QM data. With these models (NNPs) trained to work as functional approximators of QM, high-accuracy atomistic simulations can now be run many orders of magnitude faster than traditional QM simulations.

A simplified overview of QM and NNPs

Figure 1. A simplified overview of QM and NNPs.

It's worth noting a distinction that will be important in the rest of this article. In molecular simulation, there are two types of calculations: molecular and periodic. Briefly, molecular calculations model finite molecular systems (isolated molecules or groups of molecules surrounded by a vacuum or a dielectric field) while periodic calculations model infinite systems by using a repeating unit cell, where the molecule or group of molecules "sees" itself tiled infinitely in all dimensions. Molecular and periodic QM are usually implemented in different software packages; similarly, NNPs tend to be trained on either only molecular or only periodic QM data.

There are a variety of commercial, academic, and open-source QM software packages used to run high accuracy simulations and generate training data for NNPs. These include PySCF, Psi4, CP2K, GAMESS, Gaussian, Q-Chem, FHI-aims, Schrödinger's Jaguar, FACCTs' ORCA, TURBOMOLE, PetaChem's TeraChem, NWChem, Quantum Espresso, and VASP.

An overview of widely-used quantum chemistry software engines

Figure 2. An overview of widely-used commercial, academic, and open-source QM software packages. Some software packages can run both periodic and molecular calculations; in these cases, we've used our discretion in assessing which usage dominates per package.

QM-style simulations aren't the only way to simulate systems of atoms. Molecular mechanics (MM) is an approach to simulation that employs parametric energy-evaluation schemes called forcefields. These forcefields can get quite complex, sometimes including terms for angle stretches, dihedral rotations, van der Waals forces, polarizability, and more. Despite this complexity, forcefields are typically limited to simple functional forms like polynomials, sines & cosines, and other easily computable expressions.

The simplicity of these functional forms allows for the efficient simulation of large systems over long time scales. MM methods are commonly used for molecular dynamics (MD) and free-energy-perturbation studies, especially to model protein–ligand interactions or study conformational motion in biomolecules. However, MM forcefields are generally less accurate and less expressive than QM methods, which can lead to poor accuracy. One of the major goals of NNPs is to scale the high precision of QM-based calculations to the important workflows for which forcefields are typically used today.

To learn more about QM and MM methods, we recommend the following resources:

Early NNPs

The first1 modern NNP was published by Jörg Behler and Michele Parrinello in a 2007 Phys. Rev. Lett paper studying bulk silicon. This NNP used a simple architecture and was trained on 8,200 DFT energies. Unlike most QM engines, the architectures of NNPs (including this first NNP) make them able to handle both molecular and periodic systems.

Early researchers in this field trained new NNPs for every system under study. For example, in 2012, Behler and coworkers published an NNP trained on the water dimer. Their NNP learned the full potential energy surface of the water dimer in good agreement with their DFT calculations.

Behler and Parrinello's NNP architecture from their 2007 paper

Figure 3. The basic NNP architecture used by both Parrinello (2007) and Behler (2012). Atomic coordinates are fed through symmetry functions and then "fed forward" through a number of "hidden" matrix layers to produce per atom energies. These per atom energy contributions are summed to yield a total energy prediction. This illustration is taken from Figure 1 of Behler 2012.

In 2017, building on the Behler and Parrinello symmetry functions and architecture, Justin Smith, Olexander Isayev, and Adrian Roitberg released ANI-1, a general pretrained NNP. The ANI-1 dataset comprised "~17.2 million conformations generated from ~58k small molecules." These atoms contained up to 8 heavy atoms and were made up of the elements H, C, N, O. In the paper, the authors then go on to show that the ANI-1 NNP can be used to make accurate predictions on systems containing more than 8 heavy atoms, showing good agreement with QM potential energy surfaces. This result showed that NNPs could generalize to unseen topologies: instead of training a new NNP for every system of interest, scientists could use a single pretrained NNP to model unseen molecules with good accuracy, so long as the model had been trained on similar types of molecules.

ANI-1's performance improves as the size of its training data increases

Figure 4. ANI-1 showed that NNPs could generalize across topologies at scale. This illustration is taken from figure 3 of Smith et al. 2017 (the ANI-1 paper).

NNP Datasets

Before continuing to look at the inner workings of NNPs, it's worth taking some time to look at the data these models are trained on. NNPs are trained to recreate the results of QM software. This means that given some list of atomic numbers and positions, NNPs should be able to predict energies, forces, etc. that closely match those predicted by QM.

To match the breadth of QM, people training NNPs spend a lot of time thinking about how to sample molecular space well to ensure that NNPs will be able to handle all kinds of structures. In this vein, Jean-Louis Reymond and co-workers have worked on a series of databases that enumerate all chemically feasible molecules under sets of constraints:

These enumerated structure databases have been used as the starting point for QM databases, including:

The Materials Project (paper, website) was started in the 2010s to be an open repository for computational data on materials. This dataset was originally centered around structures relevant to battery research, but it has since expanded. The Materials Project is the basis of a widely used QM dataset:

To help create datasets that could be used for catalyst design and optimization, Meta and Carnegie Mellon teamed up to start the Open Catalyst Project. The Open Catalyst Project generated QM datasets that included structures relevant to surface catalysis, including:

In 2023, Meta and Georgia Tech collaborated to work on direct air capture, releasing:

In 2024, Meta continued this work, releasing:

In 2023, Peter Eastman teamed up with others to generate very accurate QM data that would enable ML models to predict drug-like small molecule and protein interactions, leading to:

Finally, another dataset that sees frequent use is Miguel Marques and coworkers':

Which should not be confused with:

To read more about datasets that have been created for training NNPs, we recommend:

NNP Architectures

In NNPs, molecular structures are commonly represented as graphs. In this representation, each atom corresponds to a node (or "point") in the graph, and edges connect an atom to its neighboring atoms within a specified cutoff radius. This cutoff defines which atoms are considered "neighbors," ensuring that each atom primarily interacts only with those in its local vicinity. This principle is known as the "locality assumption": the idea that an atom's environment—and hence its contribution to the system's energy—depends predominantly on atoms within a local spherical region around it.

In some NNPs, these local environment representations are then transformed by symmetry functions (more on this in the "Ensuring Invariance/Equivariance" section below). After obtaining these local representations, the neural network processes them through a series of hidden layers to produce atomic energy contributions. Finally, these per-atom energy contributions are summed across all atoms, resulting in a single total energy prediction for the entire structure.

From this basic architecture, NNP architectures have branched out in several directions. Four main themes have emerged: the addition of global inputs, the tension between fully enforcing equivariance and simplifying models for speed, the tradeoff between predicting forces directly and calculating them as derivatives of the potential energy, and the question of how to best handle long-range interactions.

Adding Global Inputs

Some models incorporate global parameters, such as total charge or thermodynamic conditions, to improve predictions beyond neutral, equilibrium scenarios. For example, in MEGNet (2019), Shyue Ping Ong and colleagues input temperature, pressure, and entropy alongside atomic numbers and positions to better predict materials properties. Oliver Unke, Klaus-Robert Müller, and coauthors give SpookyNet (2021) total system charge and multiplicity, enabling it to distinguish ionic and radical states formed from the same atoms. Other models have since followed SpookyNet's approach, including Olexandr Isayev's group's AIMNet2 (2023).

Ensuring Invariance/Equivariance

"Equivariance" means that if you transform the input of a function, the output changes in a predictable way. "Invariance' means that if you transform the input, the output stays exactly the same. With NNPs, scalar predictions like total energy ought to be invariant; if you rotate a system, its total energy shouldn't change. On the other hand, vector predictions like forces and dipole vectors ought to be equivariant; if you rotate a system, its dipole vector should rotate too.

Different NNP architectures handle questions of symmetry and equivariance very differently. At one extreme, a totally unconstrained NNP that can employ any sort of features is extremely expressive, but might struggle to learn the correct symmetry properties of a system—it's generally quite bad when rotating a molecule changes the output of a calculation. On the other end of the spectrum, an NNP that operates only on invariant features (like Behler–Parrinello symmetry functions) will have the correct rotational invariance, but might struggle to achieve sufficient expressivity to describe chemistry accurately. Equivariant architectures based around spherical harmonics and tensor products (often using packages like Mario Geiger and Tess Smidt's e3nn) inherently produce NNPs with the correct symmetries, but the additional mathematics required for training and inference generally makes these networks substantially slower. Significant work has been put into both optimizing these tensor operations through work like NVIDIA's cuEquivariance, but non-equivariant models still remain substantially faster and can often learn approximate equivariance through train-time data augmentation.

Before 2019, most architectures built on Behler–Parrinello symmetry functions, which are invariant. In NEquip (2021), Simon Batzner, Boris Kozinsky, and colleagues enforce equivariance by restricting operations to equivariant spherical harmonics (using e3nn), increasing data efficiency over older Behler–Parrinello models. Stephan Günnemann's group's GemNet (2021) employs simpler spherical representations to ensure equivariance. Other methods approach equivariance differently, such as Equiformer (2022) by Yi-Lun Liao and Tess Smidt, which uses a transformer-like structure, and So3krates (2022) by J. Thorben Frank, Oliver Unke, and Klaus-Robert Müller, which introduces "spherical harmonic coordinates."

Other architectures opt to "teach" NNPs equivariance by rotationally augmenting data during training rather than strictly enforcing it, resulting in faster inference and reduced memory usage. Sergey Pozdnyakov and Michele Ceriotti describe this approach in "Smooth, exact rotational symmetrization for deep learning on point clouds" (2023). Mark Neumann and coworkers' Orb (2024) rotationally augments its data during training, increasing inference speed by "3–6x." Similarly, Eric Qu and Aditi Krishnapriyan's EScAIP (2024) uses rotational data augmentation to increase inference speed and reduce memory costs, achieving state-of-the-art performance on multiple datasets.

Predicting Forces

"Forces" are the derivatives of a system's total energy with respect to atomic positions. Accurate force predictions are essential for running simulations like relaxations or molecular dynamics. Any method—whether QM, MM, or NNPs—must be capable of predicting forces accurately. Fortunately, neural networks excel at this task. As Behler and Parrinello noted in their 2007 paper: "Once trained, the atomic coordinates are given to the NN and the potential energy, from which also forces can be calculated analytically, is received."

Analytical force calculation through a neural network is enabled by libraries like PyTorch, using automatic differentiation to compute derivatives. This approach is referred to as "conservative" because the predicted forces respect the conservation of energy.

Since analytical differentiation makes inference slower, some architectures opt to predict forces directly as additional model outputs. This requires training the neural network to predict both energies and forces and eschews the differentiation step, roughly doubling the speed of NNP inference. However, this "non-conservative" method can lead to unstable simulations, as highlighted in Filippo Bigi and others' recent "The dark side of the forces: assessing non-conservative force models for atomistic machine learning."

Handling Long-Range Interactions

The locality assumption—that an atom's energy contribution depends primarily on its neighbors within a certain radius of it—fails to capture long-range interactions like electrostatic energy and van der Waals forces.

To make some of these forces learnable, Kristof T. Schütt, Alexandre Tkatchenko, Klaus-Robert Müller, and others released SchNet (2017), an NNP that uses a graph-based, "message-passing" architecture as introduced by Justin Gilmer and coworkers' "Neural Message Passing for Quantum Chemistry" (2017). In SchNet's framework, each atom and its neighbors (defined by a local cutoff radius) are represented as nodes and edges in a graph. Through iterative message passing steps, each atom's embedding is updated based on information propagated from its immediate neighbors, allowing the network to build increasingly accurate and context-aware representations of local chemical environments that reach beyond each atom's immediate neighbors. Message passing, used to propagate local information through graphs, can mitigate strict locality but often slows models. Ilyes Batatia, Gábor Csányi, and others speed it up in MACE (2023) by passing fewer higher-order messages, while Seungwu Han and others' SevenNet (2024) parallelizes many of NEquip's message-passing steps. Some architectures, like Allegro (2023) from Albert Musaelian, Simon Batzner, Boris Kozinsky, and collaborators, abandon message passing entirely for better scalability.

Other architectures have added physical equations to capture the effect of these long-range interactions. John Parkhill's group's TensorMol-0.1 (2018) divides total energy predictions into short-range embedded atomic energy, electrostatic energy, and van der Waals energy components. To predict the total energy of a structure, TensorMol-0.1 uses physical equations to calculate electrostatic energy and van der Waals energy before using a standard NNP architecture to predict the remaining energy terms. Other architectures, including Jörg Behler and others' "fourth-generation high-dimensional" NNP (2021), Schrödinger's "QRNN" (2022), and Dylan Anstine, Roman Zubatyuk, and Olexandr Isayev's AIMNet2 (2023), have also adopted this approach, combining machine-learned short-range terms with physics-based long-range components to reduce the need for extensive message passing.

TensorMol-0.1's architecture separates short-range and long-range interactions

Figure 5. By calculating electrostatics and van der Waals forces with simple physical equations, TensorMol-0.1 is able to accurately model long-range interactions while maintaining the speed of other NNPs. This illustration is taken from figure 1 of Yao et al. 2018 (the TensorMol-0.1 paper).

To read more about the development of NNP architectures as well as current problems and their potential solutions, we recommend reading:

General Pretrained NNPs

Traditionally, using an NNP required generating data and training a new model for each new system under study. To help bring NNPs to bear on more problems, a growing number of academic labs, startups, and big tech firms have released general pretrained NNPs (sometimes called "foundation models"). These models are trained once on extensive, diverse datasets, with the goal of capturing broad patterns in atomic behavior. The resulting NNPs can then be used to make predictions on novel systems within a similar domain, drastically reducing setup time and effort, or fine-tuned to achieve better data efficiency for specific problems. However, an NNP's efficacy is only ever as good as the scope and quality of its training data, its architecture, and its training procedure.

Olexander Isayev's group and Adrian Roitberg's group collaborated on the ANI series of models, including:

Since then, the Isayev group has developed the AIMNet architecture, most recently releasing:

Gábor Csányi's group has released pretrained models using the MACE architecture, including:

Other academic labs working on NNPs have released pretrained models as well, including:

Orbitals Materials, a startup using NNPs to design advanced green materials, has also released permissively-licensed pretrained NNPs, including:

Many big tech firms have also released pretrained NNPs for different applications, including:

Google trained an NNP for materials discovery applications, but they didn't release the pretrained model:

(And while we're talking about big tech, Apple has a paper out on using machine learning to generate conformers.)

For a more exhaustive list of pretrained NNPs, we recommend looking at:

Benchmarking NNPs

QM and MM methods rely solely on physical or physically-motivated equations, allowing them to generalize effectively to new atomic systems. NNPs, however, are empirical. To quote Yuanqing Wang:

Conceptually, to accurately fit energies and forces on all chemical spaces is no different than having the MLFF model able to solve the Schrödinger's [sic] equation, which seems impossible, judging from the no free lunch theorem. To this end, QM datasets are always curated with biases, in terms of the coverage on chemical spaces and conformational landscape.

An NNP trained exclusively on inorganic materials is unlikely to make accurate predictions for organic structures. Researchers in this field use various tools and strategies to improve the robustness and utility of NNPs, but rigorous benchmarking is essential to ensure their predictions are decision-worthy. To evaluate the performance of different NNPs on specific tasks, researchers and users depend on standardized benchmarks.

In silico benchmarks evaluate NNP predictions against high-level QM results. Energy benchmarks, for instance, compare the relative energy outputs of NNPs to high-accuracy QM energy values. Some energy benchmarks include:

Other benchmarks assess NNP predictions of forces, dipoles, and other electronic properties, comparing them directly to high-level QM results. Some studies go further, using higher-level benchmarks to compare NNP-predicted potential energy surfaces, radial distribution functions, and folded protein structures against those computed using QM or MM methods.

Some NNPs fail to recreate the radial distribution function of amorphous silicon predicted by DFT

Figure 6. The radial distribution function of amorphous Si as predicted by a number of NNPs trained on materials versus the AIMD prediction (vdW-DF-optB88). This illustration is taken from figure 4 of Wines and Choudhary (the CHIPS-FF paper).

One goal of benchmarking NNPs is to assess whether or not they can support stable molecular dynamics (MD) simulations. In "Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations," Rafael Gomez-Bombarelli, Tommi Jaakkola, and colleagues argue that strong energy and force benchmark results alone are insufficient to guarantee MD stability, and instead benchmark different NNPs by directly running MD simulations and checking for instability.

Experimental benchmarks evaluate NNP predictions against experimental data. Since it's impossible to directly measure the total QM energy of a system to validate single-point energy predictions, these benchmarks are all higher level. Common experimental benchmarks include tests for crystal structure stability ranking and the reproduction of protein crystal structures. For example, X23 is a benchmark measuring molecular crystal cell volume and lattice energy predictions.

Benchmarking NNPs is still a very active and fast-moving area of research.

To compare the performance of NNPs on different benchmarks, we recommend checking out:

Applications of NNPs in Life and Materials Sciences

When applying NNPs to problems, they can be thought of as faster QM methods or as more accurate MM methods. As NNPs become both faster and more accurate, they promise to prove valuable in use cases that currently rely on both QM and MM.

In drug discovery, QM methods are used to generate and score conformers, compute charges, run torsional scans, predict pKa, compute bond-dissocation energies, search for tautomers, model reactivity, predict crystal structure stability, and more (this list is taken from Corin Wagen's "Quantum Chemistry in Drug Discovery"). NNPs are already being used for all of these tasks, helping researchers answer questions about structures in a fraction of the time that QM previously needed.

MM methods, on the other hand, are often used to run molecular dynamics simulations and free energy perturbation on systems consisting of e.g. a protein, a drug-like small molecule, and solvent molecules. NNPs promise to prove useful in these cases, but the size of the systems being studied as well as the length of complexity of the simulations mean that NNPs aren't yet seeing much production use in these areas today.

In materials science, a number of startups, including Orbital Materials, cusp.ai, and Entalpic, are training NNPs to predict materials properties. Academic labs and big tech have also developed NNPs with diverse applications in mind, including:

Future Directions for NNPs

To read about future directions for NNPs, we recommend reading:

How to Start Using NNPs

You can run AIMNet2 and OMat24 through the Rowan platform or API. Rowan makes it easy to run molecular simulation jobs with a simple GUI and cloud based computing. Rowan offers 500 free credits on sign up, and an additional 20 credits each week:

Many pretrained NNPs have simple tutorials attached to their GitHubs. We'd suggest starting with one of the following models:

To get the best performance out of an NNP, it's important to run it on a GPU. If you don't have a GPU sitting around at home, that's ok! You can quickly run code on cloud GPUs with Modal, which gives users $30 worth of free credits each month:

To run molecular dynamics with a neural network potential, see:

How to Start Training and Fine-Tuning NNPs

If you want to train your own NNP, we recommend starting with:

If you want to finetune a pretrained NNP with your own dataset, we'd recommend checking out:

Thanks to Timothy Duignan, Mark Neumann, Corin Wagen, Jonathon Vandezande, and Eli Mann for reading drafts of this piece and providing valuable feedback. Any errors are mine.


  1. Or, at least, one of the first.

text