by Ari Wagen and Corin Wagen · Sep 4, 2025
On August 27th, Orbital Materials released their latest open-source model, "OrbMol." OrbMol combines Orbital's super-scalable Orb-v3 neural network potential (NNP) architecture with Meta et al.'s massive Open Molecules 2025 (OMol25) dataset (which we've written about previously).
To the best of my knowledge, this is the first NNP trained on the full OMol25 dataset from outside the OMol25 team. This is a huge accomplishment for Orbital Materials and impressively fast too—we can only imagine how many GPUs were involved in the creation of OrbMol!
After running OrbMol through our internal benchmark suite at benchmarks.rowansci.com, one result stuck out as anomalous: OrbMol matches the performance of the UMA and eSEN models trained on OMol25 for single-point-energy-derived tasks, but it failed to complete optimizations of 25 drug-like molecules using the Sella optimizer with 0.01 eV/Ă… fmax and max. 250 steps.
When I (Ari) shared these results on X, many people commented asking about our choice of Sella. In this blog post, we'll discuss how we conceived of this simple optimization test—what it's good for and what it leaves unanswered—and study the effect of optimizer on these results by comparing each combination of four optimizers (Sella, geomeTRIC, and ASE's implementations of FIRE and LBFGS) and four NNPs (OrbMol, OMol25's eSEN Conserving Small, AIMNet2, and Egret-1) with GFN2-xTB serving as a "control" low-cost method.
At Rowan and in our conversations with industry scientists, we've found that the most common use case for NNPs is as a drop-in replacement for density–functional-theory (DFT) calculations. Optimizing a molecule with DFT is often the most time-consuming and compute-intensive step in many standard modeling workflows. NNPs can offer real value in the short term, but only if they can reliably replace DFT for these routine optimization tasks.
Most common benchmarks that people use for benchmarking quantum-chemical methods, semiempirical methods, and NNPs measure a method's ability to recreate energies at specific points on the potential energy surface. Many of these energy vs. energy benchmarks are quite nuanced and insightful, like GMTKN55, which accumulates results across 55 different subsets of relative energy challenges, or Wiggle150, which measures a method's ability to rank order highly-strained conformers of three drug-like molecules. However, these energy-based benchmarks fail to conclusively prove that NNPs can reliably serve as replacements for DFT in the context of molecular optimization.
To attempt to fill this gap, we chose 25 drug-like molecules to compare each NNP along the following axes:
These tests matter because practitioners use DFT to optimize starting structures with the goal of finding true local minima as quickly as possible. An NNP that aims to replace DFT must not only reproduce conformer rankings and relative energies, but also reliably complete optimizations in few steps without yielding imaginary frequencies.
The 25 molecular structures under study are available on GitHub.
We chose four common optimization methods for this benchmark study.
The limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) is a classic quasi-Newton algorithm used in all sorts of contexts. Like all second-order methods, L-BFGS can get confused by noisy potential-energy surfaces. In this study, we use the L-BFGS optimizer implemented in the Atomic Simulation Environment.
The fast inertial relaxation engine (FIRE) is a first-order minimization method designed for fast structural relaxation. FIRE uses a molecular-dynamics-based approach, making it faster and more noise-tolerant than Hessian-based methods like L-BFGS. However, it's also a bit less precise and often performs worse for complex molecular systems. In this study, we used the FIRE optimizer implemented in the Atomic Simulation Environment.
Sella is an open-source package developed by the Zador lab at Sandia National Laboratories. While Sella is often used for transition-state optimization, it also implements rational function optimization to optimize structures towards minima. Sella uses an internal coordinate system alongside a trust-step restriction and a quasi-Newton Hessian update.
geomeTRIC is a general-purpose optimization library developed by the Wang lab at UCSD. geomeTRIC uses a special internal coordinate scheme called "translation–rotation internal coordinates" (TRIC); within this coordinate system, it uses standard L-BFGS with line search. For this study, geomeTRIC was used with both Cartesian ("cart") and TRIC ("tric") coordinates.
These choices represent common "good" optimizers used by the community; a recent benchmark from Akhil Shajan and co-workers found Sella to be the best open-source geometry optimizer when using HF/6-31G(d,p) gradients on the Baker test set, although geomeTRIC also did quite well.
Most optimization and quantum chemistry libraries give users detailed control over optimization criteria. For instance, geomeTRIC determines if a structure is optimized based on five criteria:
These criteria are used by Gaussian, NWChem, TurboMole, TeraChem, Q-Chem, ORCA, and so on. Unfortunately, ASE's interface for optimizers only exposes fmax
, the maximum component of the gradient vector. (This is one of the reasons we currently use geomeTRIC at Rowan; we want to make sure that "optimized" structures are fully optimized.)
To enable a fair comparison with ASE optimizers, convergence was determined solely on the basis of the maximum gradient component for this study. All other convergence criteria were disabled in geomeTRIC. (We're aware that this is suboptimal, and discuss this more in our "Limitations" section.)
For brevity, OMol25's eSEN Conserving Small is referred to simply as "OMol25 eSEN" in the below tables.
This metric reports how many of the 25 systems the given NNP-optimized pairing was able to successfully optimize.
Optimizer \ Method | OrbMol | OMol25 eSEN | AIMNet2 | Egret-1 | GFN2-xTB |
---|---|---|---|---|---|
ASE/L-BFGS | 22a | 23 | 25 | 23 | 24 |
ASE/FIRE | 20 | 20 | 25 | 20 | 15 |
Sella | 15 | 24 | 25 | 15 | 25 |
geomeTRIC (cart) | 8 | 12 | 25 | 7 | 9 |
geomeTRIC (tric) | 1 | 20 | 14 | 1 | 25 |
aBenjamin Rhodes reports that OrbMol succesfully optimizes all 25 systems when using "float32-highest" precision and L-BFGS with max. 500 steps.
All of the unsuccessful optimizations failed because they exceeded 250 optimization steps, with a single exception: the Sella optimization of brexpiprazole with OMol25's eSEN Conserving Small failed with a nondescript "Internal Error."
This metric tells us the average number of steps needed for successful optimizations.
Optimizer \ Method | OrbMol | OMol25 eSEN | AIMNet2 | Egret-1 | GFN2-xTB |
---|---|---|---|---|---|
ASE/L-BFGS | 108.8 | 99.9 | 1.2 | 112.2 | 120.0 |
ASE/FIRE | 109.4 | 105.0 | 1.5 | 112.6 | 159.3 |
Sella | 73.1 | 106.5 | 12.9 | 87.1 | 108.0 |
geomeTRIC (cart) | 182.1 | 158.7 | 13.6 | 175.9 | 195.6 |
geomeTRIC (tric) | 11 | 114.1 | 49.7 | 13 | 103.5 |
This metric tells us how many of the 25 systems optimized to a local minimum (as opposed to saddle points). This is determined via a frequencies calculation; the presence of an imaginary frequency indicates that a local minima was not found.
Optimizer \ Method | OrbMol | OMol25 eSEN | AIMNet2 | Egret-1 | GFN2-xTB |
---|---|---|---|---|---|
ASE/L-BFGS | 16 | 16 | 21 | 18 | 20 |
ASE/FIRE | 15 | 14 | 21 | 11 | 12 |
Sella | 11 | 17 | 21 | 8 | 17 |
geomeTRIC (cart) | 6 | 8 | 22 | 5 | 7 |
geomeTRIC (tric) | 1 | 17 | 13 | 1 | 23 |
This metric tells us how many imaginary frequencies each successfully optimized structure had on average.
Optimizer \ Method | OrbMol | OMol25 eSEN | AIMNet2 | Egret-1 | GFN2-xTB |
---|---|---|---|---|---|
ASE/L-BFGS | 0.27 | 0.35 | 0.16 | 0.26 | 0.21 |
ASE/FIRE | 0.35 | 0.30 | 0.16 | 0.45 | 0.20 |
Sella | 0.40 | 0.33 | 0.16 | 0.47 | 0.36 |
geomeTRIC (cart) | 0.38 | 0.33 | 0.12 | 0.29 | 0.22 |
geomeTRIC (tric) | 0.00 | 0.15 | 0.07 | 0.00 | 0.08 |
This is a blog post, not a full study, and there are plenty of limitations here. A few obvious ones:
fmax
). Future benchmarks should assess the performance of methods like FIRE and L-BFGS using state-of-the-art convergence criteria that more closely mimic accepted best practices.There's so much more work to do here! We hope that our work motivates further research, and we plan to investigate this more ourselves.
Still, we think we have enough data to draw a few tentative conclusions.
A 2022 paper from Xiang Fu and co-workers stated that "forces are not enough" for benchmarking NNPs. Fu showed that even NNPs with good error metrics for energy and forces could give unrealistic and unstable MD simulations, pushing the field to more aggressively benchmark MD stability with new NNPs (like eSEN, which was specifically designed to give smooth potential-energy surfaces).
Just as Fu showed that "the ability to run stable MD simulations" was an important benchmark for NNPs, we think that "the ability to optimize to true minima on the PES" should also be used for benchmarking future NNPs.
Different optimization methods performed very differently, even for the same NNP. In our experience, most scientists spend considerably more time benchmarking the level of theory than the optimizer they use. At least for NNPs, our results suggest that finding the right optimizer for a given class of systems might be just as crucial as finding the right method. If you're going to run hundreds or thousands of optimizations, it might be worth benchmarking optimizers or tuning hyperparameters before running everything.
Second-order optimizers like L-BFGS make certain assumptions about the smoothness of the potential-energy surface that might not be true for all NNPs. A recent study from Filippo Bigi and co-workers shows that non-conservative NNPs perform worse on geometry-optimization benchmarks, particularly for second-order optimizers; we anticipate that other NNP architectural decisions will have similar impacts on optimization performance.
While MD-based optimizers like FIRE are allegedly better in contexts like this, since they make fewer assumptions about the nature of the potential-energy surface, our results show that FIRE is not significantly better on this benchmark.
We'd like to see more work on building noise-tolerant optimizers for NNPs. Papers like this work from Bastian Schaefer and co-workers suggest that it's possible to improve classic L-BFGS methods for noisy cases, but to our knowledge there aren't good implementations of this work or related methods for molecular geometry optimization. New optimization methods could help scientists better leverage advances in NNPs and conduct high-throughput calculations much more efficiently. If you're interested in working on this, please reach out to our team! We'd like to talk.
So, with all this in mind, let's try to answer the question in the title—if you're running optimizations with NNPs today, which optimizer should you use? Our results suggest that the ASE L-BFGS is the best plug-and-play choice right now—for almost every NNP, L-BFGS converges more structures, and L-BFGS structures have fewer imaginary frequencies. While there are plenty of reasons why you might choose a different optimizer (convergence control, constraints, and so on), we think that L-BFGS is an excellent "first choice" for routine work.