Which Optimizer Should You Use With NNPs?

by Ari Wagen and Corin Wagen · Sep 4, 2025

On August 27th, Orbital Materials released their latest open-source model, "OrbMol." OrbMol combines Orbital's super-scalable Orb-v3 neural network potential (NNP) architecture with Meta et al.'s massive Open Molecules 2025 (OMol25) dataset (which we've written about previously).

To the best of my knowledge, this is the first NNP trained on the full OMol25 dataset from outside the OMol25 team. This is a huge accomplishment for Orbital Materials and impressively fast too—we can only imagine how many GPUs were involved in the creation of OrbMol!

After running OrbMol through our internal benchmark suite at benchmarks.rowansci.com, one result stuck out as anomalous: OrbMol matches the performance of the UMA and eSEN models trained on OMol25 for single-point-energy-derived tasks, but it failed to complete optimizations of 25 drug-like molecules using the Sella optimizer with 0.01 eV/Ă… fmax and max. 250 steps.

A screenshot of NNP performance on molecular energy and optimization performance from Rowan's benchmarks site

When I (Ari) shared these results on X, many people commented asking about our choice of Sella. In this blog post, we'll discuss how we conceived of this simple optimization test—what it's good for and what it leaves unanswered—and study the effect of optimizer on these results by comparing each combination of four optimizers (Sella, geomeTRIC, and ASE's implementations of FIRE and LBFGS) and four NNPs (OrbMol, OMol25's eSEN Conserving Small, AIMNet2, and Egret-1) with GFN2-xTB serving as a "control" low-cost method.

Why Report Successful Optimization Counts at All?

At Rowan and in our conversations with industry scientists, we've found that the most common use case for NNPs is as a drop-in replacement for density–functional-theory (DFT) calculations. Optimizing a molecule with DFT is often the most time-consuming and compute-intensive step in many standard modeling workflows. NNPs can offer real value in the short term, but only if they can reliably replace DFT for these routine optimization tasks.

Most common benchmarks that people use for benchmarking quantum-chemical methods, semiempirical methods, and NNPs measure a method's ability to recreate energies at specific points on the potential energy surface. Many of these energy vs. energy benchmarks are quite nuanced and insightful, like GMTKN55, which accumulates results across 55 different subsets of relative energy challenges, or Wiggle150, which measures a method's ability to rank order highly-strained conformers of three drug-like molecules. However, these energy-based benchmarks fail to conclusively prove that NNPs can reliably serve as replacements for DFT in the context of molecular optimization.

To attempt to fill this gap, we chose 25 drug-like molecules to compare each NNP along the following axes:

These tests matter because practitioners use DFT to optimize starting structures with the goal of finding true local minima as quickly as possible. An NNP that aims to replace DFT must not only reproduce conformer rankings and relative energies, but also reliably complete optimizations in few steps without yielding imaginary frequencies.

The 25 molecular structures under study are available on GitHub.

Optimizer x NNP Study

We chose four common optimization methods for this benchmark study.

1. L-BFGS

The limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) is a classic quasi-Newton algorithm used in all sorts of contexts. Like all second-order methods, L-BFGS can get confused by noisy potential-energy surfaces. In this study, we use the L-BFGS optimizer implemented in the Atomic Simulation Environment.

2. FIRE

The fast inertial relaxation engine (FIRE) is a first-order minimization method designed for fast structural relaxation. FIRE uses a molecular-dynamics-based approach, making it faster and more noise-tolerant than Hessian-based methods like L-BFGS. However, it's also a bit less precise and often performs worse for complex molecular systems. In this study, we used the FIRE optimizer implemented in the Atomic Simulation Environment.

3. Sella

Sella is an open-source package developed by the Zador lab at Sandia National Laboratories. While Sella is often used for transition-state optimization, it also implements rational function optimization to optimize structures towards minima. Sella uses an internal coordinate system alongside a trust-step restriction and a quasi-Newton Hessian update.

4. geomeTRIC

geomeTRIC is a general-purpose optimization library developed by the Wang lab at UCSD. geomeTRIC uses a special internal coordinate scheme called "translation–rotation internal coordinates" (TRIC); within this coordinate system, it uses standard L-BFGS with line search. For this study, geomeTRIC was used with both Cartesian ("cart") and TRIC ("tric") coordinates.

These choices represent common "good" optimizers used by the community; a recent benchmark from Akhil Shajan and co-workers found Sella to be the best open-source geometry optimizer when using HF/6-31G(d,p) gradients on the Baker test set, although geomeTRIC also did quite well.

Convergence Criteria

Most optimization and quantum chemistry libraries give users detailed control over optimization criteria. For instance, geomeTRIC determines if a structure is optimized based on five criteria:

These criteria are used by Gaussian, NWChem, TurboMole, TeraChem, Q-Chem, ORCA, and so on. Unfortunately, ASE's interface for optimizers only exposes fmax, the maximum component of the gradient vector. (This is one of the reasons we currently use geomeTRIC at Rowan; we want to make sure that "optimized" structures are fully optimized.)

To enable a fair comparison with ASE optimizers, convergence was determined solely on the basis of the maximum gradient component for this study. All other convergence criteria were disabled in geomeTRIC. (We're aware that this is suboptimal, and discuss this more in our "Limitations" section.)

Results

For brevity, OMol25's eSEN Conserving Small is referred to simply as "OMol25 eSEN" in the below tables.

Number Successfully Optimized

This metric reports how many of the 25 systems the given NNP-optimized pairing was able to successfully optimize.

Optimizer \ MethodOrbMolOMol25 eSENAIMNet2Egret-1GFN2-xTB
ASE/L-BFGS22a23252324
ASE/FIRE2020252015
Sella1524251525
geomeTRIC (cart)8122579
geomeTRIC (tric)12014125

aBenjamin Rhodes reports that OrbMol succesfully optimizes all 25 systems when using "float32-highest" precision and L-BFGS with max. 500 steps.

All of the unsuccessful optimizations failed because they exceeded 250 optimization steps, with a single exception: the Sella optimization of brexpiprazole with OMol25's eSEN Conserving Small failed with a nondescript "Internal Error."

Average Number of Steps

This metric tells us the average number of steps needed for successful optimizations.

Optimizer \ MethodOrbMolOMol25 eSENAIMNet2Egret-1GFN2-xTB
ASE/L-BFGS108.899.91.2112.2120.0
ASE/FIRE109.4105.01.5112.6159.3
Sella73.1106.512.987.1108.0
geomeTRIC (cart)182.1158.713.6175.9195.6
geomeTRIC (tric)11114.149.713103.5

Number of Minima Found

This metric tells us how many of the 25 systems optimized to a local minimum (as opposed to saddle points). This is determined via a frequencies calculation; the presence of an imaginary frequency indicates that a local minima was not found.

Optimizer \ MethodOrbMolOMol25 eSENAIMNet2Egret-1GFN2-xTB
ASE/L-BFGS1616211820
ASE/FIRE1514211112
Sella111721817
geomeTRIC (cart)682257
geomeTRIC (tric)11713123

Average Number of Imaginary Frequencies

This metric tells us how many imaginary frequencies each successfully optimized structure had on average.

Optimizer \ MethodOrbMolOMol25 eSENAIMNet2Egret-1GFN2-xTB
ASE/L-BFGS0.270.350.160.260.21
ASE/FIRE0.350.300.160.450.20
Sella0.400.330.160.470.36
geomeTRIC (cart)0.380.330.120.290.22
geomeTRIC (tric)0.000.150.070.000.08

Limitations

This is a blog post, not a full study, and there are plenty of limitations here. A few obvious ones:

There's so much more work to do here! We hope that our work motivates further research, and we plan to investigate this more ourselves.

Takeaways

Still, we think we have enough data to draw a few tentative conclusions.

1. Better Benchmarks Are Needed

A 2022 paper from Xiang Fu and co-workers stated that "forces are not enough" for benchmarking NNPs. Fu showed that even NNPs with good error metrics for energy and forces could give unrealistic and unstable MD simulations, pushing the field to more aggressively benchmark MD stability with new NNPs (like eSEN, which was specifically designed to give smooth potential-energy surfaces).

Just as Fu showed that "the ability to run stable MD simulations" was an important benchmark for NNPs, we think that "the ability to optimize to true minima on the PES" should also be used for benchmarking future NNPs.

2. The Optimization Method and Coordinate System Matter

Different optimization methods performed very differently, even for the same NNP. In our experience, most scientists spend considerably more time benchmarking the level of theory than the optimizer they use. At least for NNPs, our results suggest that finding the right optimizer for a given class of systems might be just as crucial as finding the right method. If you're going to run hundreds or thousands of optimizations, it might be worth benchmarking optimizers or tuning hyperparameters before running everything.

3. NNPs Might Need New Optimizers

Second-order optimizers like L-BFGS make certain assumptions about the smoothness of the potential-energy surface that might not be true for all NNPs. A recent study from Filippo Bigi and co-workers shows that non-conservative NNPs perform worse on geometry-optimization benchmarks, particularly for second-order optimizers; we anticipate that other NNP architectural decisions will have similar impacts on optimization performance.

While MD-based optimizers like FIRE are allegedly better in contexts like this, since they make fewer assumptions about the nature of the potential-energy surface, our results show that FIRE is not significantly better on this benchmark.

We'd like to see more work on building noise-tolerant optimizers for NNPs. Papers like this work from Bastian Schaefer and co-workers suggest that it's possible to improve classic L-BFGS methods for noisy cases, but to our knowledge there aren't good implementations of this work or related methods for molecular geometry optimization. New optimization methods could help scientists better leverage advances in NNPs and conduct high-throughput calculations much more efficiently. If you're interested in working on this, please reach out to our team! We'd like to talk.


So, with all this in mind, let's try to answer the question in the title—if you're running optimizations with NNPs today, which optimizer should you use? Our results suggest that the ASE L-BFGS is the best plug-and-play choice right now—for almost every NNP, L-BFGS converges more structures, and L-BFGS structures have fewer imaginary frequencies. While there are plenty of reasons why you might choose a different optimizer (convergence control, constraints, and so on), we think that L-BFGS is an excellent "first choice" for routine work.

Banner background image

What to Read Next

Benchmarking OMol25-Trained Models on Experimental Reduction-Potential and Electron-Affinity Data

Benchmarking OMol25-Trained Models on Experimental Reduction-Potential and Electron-Affinity Data

We evaluate the ability of neural network potentials (NNPs) trained on OMol25 to predict experimental reduction-potential and electron-affinity values for a variety of main-group and organometallic species.
Sep 4, 2025 · Sawyer VanZanten, Corin C. Wagen
Which Optimizer Should You Use With NNPs?

Which Optimizer Should You Use With NNPs?

The results of optimizing 25 drug-like molecules with each combination of four optimizers (Sella, geomeTRIC, and ASE's implementations of FIRE and LBFGS) and four NNPs (OrbMol, OMol25's eSEN Conserving Small, AIMNet2, and Egret-1) & GFN2-xTB.
Sep 4, 2025 · Ari Wagen and Corin Wagen
Double-Ended TS Search and the Invisible Work of Computer-Assisted Drug Design

Double-Ended TS Search and the Invisible Work of Computer-Assisted Drug Design

finding transition states; the freezing-string method; using Rowan to find cool transition states; discussing drug design
Sep 3, 2025 · Jonathon Vandezande, Ari Wagen, Spencer Schneider, and Corin Wagen
The Invisible Work of Computer-Assisted Drug Design

The Invisible Work of Computer-Assisted Drug Design

Everything that happens before the actual designing of drugs, and how Rowan tries to help.
Aug 28, 2025 · Corin Wagen
MSA, Protein–Ligand Binding Affinity Exploration, and Stereochemistry

MSA, Protein–Ligand Binding Affinity Exploration, and Stereochemistry

MSA-related occurrences and our incident postmortem; MSA server coming soon; exploring new approaches to binding-affinity prediction; a farewell to interns; a new stereochemistry lab
Aug 22, 2025 · Ari Wagen, Corin Wagen, and Spencer Schneider
Co-Folding Failures, Our Response, and Rowan-Hosted MSA

Co-Folding Failures, Our Response, and Rowan-Hosted MSA

A narrative account of our response to a sudden rise in protein–ligand co-folding failures.
Aug 22, 2025 · Ari Wagen
Exploring Protein–Ligand Binding-Affinity Prediction

Exploring Protein–Ligand Binding-Affinity Prediction

Trying a few modern ML-based approaches for predicting protein–ligand binding affinity.
Aug 20, 2025 · Ishaan Ganti
What Ishaan and Vedant Learned This Summer

What Ishaan and Vedant Learned This Summer

Reflections from two of our interns on their time at Rowan and a few things they learned.
Aug 15, 2025 · Ishaan Ganti and Vedant Nilabh
Projects: Organization, Sharing, and Saving Structures

Projects: Organization, Sharing, and Saving Structures

better organization through projects; saving structures; usage tracking; new conf. search features; second-order SCF; ex. API repo; SMILES imports; a guide to the pKa-perplexed; our inaugural demo day
Aug 14, 2025 · Ari Wagen, Spencer Schneider, Corin Wagen, and Jonathon Vandezande
Macroscopic and Microscopic pKa

Macroscopic and Microscopic pKa

Two different ways to calculate acidity, what they mean, and when to use them.
Aug 11, 2025 · Corin Wagen