GPU-Accelerated DFT with GPU4PySCF

by Jonathon Vandezande · Nov 19, 2025

GPUs have revolutionized the field of scientific computing, but electronic-structure methods like DFT have historically performed poorly on GPUs due to the complicated nature of the two-electron repulsion integrals (ERIs). The first reported use of a GPU for DFT was by Yasuda in 2007, who calculated the Coulomb matrix using Gauss-Rys quadrature on an NVIDIA GeForce 8800 GTX with a beta release of CUDA. However, the lack of fast double-precision arithmetic limited the accuracy of these calculations, with errors in the total energy of around 10^–7 Hartree. These errors could be mitigated by performing only the initial iterations with the GPU or offloading the calculation of larger Coulomb matrix elements to the CPU, but this significantly limited the speedup.

In 2008, Ufimtsev and Martínez reported large speedups for the calculation of both the Coulomb and exchange s- and p-type ERIs using the GPU as a co-processor. This was followed by a direct-SCF implementation of the Coulomb and exchange matrix build that avoided the bandwidth-limiting step of transferring the N⁴ ERI tensor from memory to the GPU. However, these matrices (both N²) still needed to be transferred to the CPU to complete the Fock build, limiting the potential performance gains of using a GPU.

The lack of d- and f-type functions limited early GPU-based DFT codes to organic molecules and severely hampered the accuracy of the energies that were calculated. Extension to higher angular momentum basis functions often required meta-programming techniques. However, the complexity involved can limit the benefit of using GPUs, since GPUs are best at highly repetitive tasks with minimal branching. When normalized by the higher cost of running a GPU, early implementations often did not represent a significant economic benefit.

Since 2007 there have been a variety of commercial and academic GPU-DFT releases with varying success, including:

and many more packages have implemented GPU acceleration as an option. While the Rowan platform has offered limited GPU-based DFT in the past with an interface to TeraChem, we are happy to announce that we are offering much broader GPU-based DFT support via the addition of GPU4PySCF!

GPU4PySCF

DFT engine comparison

Figure 1: Time to compute a r²SCAN/def2-TZVP single point energy for a series of linear alkanes. PySCF and Psi4 used a c7a.4xlarge (16 vCPUs, 32 GiB RAM) instance while GPU4PySCF used an NVIDIA H200.

The recently released GPU4PySCF package calculates ERIs based on Rys quadrature, allowing straightforward implementation of high-angular-momentum functions and their derivatives (up to g-functions). This method provides a small memory footprint and high data locality, leading to more effective data caching. All of this makes for incredibly fast DFT calculations. When compared to Rowan’s CPU-based DFT engines (PySCF and Psi4), GPU4PySCF quickly outpaces CPU-based implementations (see appendix for a full list of CPU and GPU specifications).

GPU4PySCF is significantly faster than CPU-based engines, and Psi4 ran out of memory on the c7a.4xlarge node (32 GiB) after only 30 carbons. The acceleration is not limited to the latest GPU architectures; even the 5-year-old A100 GPUs show significant speedup vs CPU.

r2SCAN/def2-TZVP GPU hardware comparison

Figure 2: Time to compute a r²SCAN/def2-TZVP single point energy for a series of linear alkanes on various hardware. CPU calculations (grey dashed line) performed with Psi4 on a c7a.4xlarge node.

Interestingly, while CPU-based calculations are competitive with small GPUs on smaller molecules, the memory efficiency and batching implemented in GPU4PySCF allows small GPUs to calculate significantly larger systems, despite having less RAM (c7a.4xlarge: 32 GiB; A10 and L4: 24GB). The GPU results show two scaling regimes, excess RAM and batching, with a distinct elbow for many GPUs. Small systems do not saturate the RAM and thus see minimal benefit from adding additional RAM, but there are significant speedups for larger systems when moving to GPUs with larger memory (e.g. A100-40GB → A100-80GB, see appendix for more GPU specs). Sadly, the current version of GPU4PySCF does not support the latest NVIDIA Blackwell architecture.

GPU-based DFT implementations have traditionally struggled with high-angular-momentum basis functions, reducing their usefulness for the large basis sets needed for highly accurate energetics. The double-ζ def2-SVP basis set is useful for initial explorations of the potential energy surfaces (PES), but triple-ζ or higher basis sets are typically recommended for accurate energetics.

Basis	ζ	s	p	d	f	g
sto-3g	1	2	1
def2-SVP	2	3	2	1
def2-TZVP	3	5	3	2	1
def2-QZVP	4	7	4	3	2	1

Table 1: Contracted Gaussian functions for carbon in basis sets of increasing size.

Unlike many historic GPU-based DFT implementations, GPU4PySCF can easily handle meta-GGA functionals, dispersion corrections, and high-angular-momentum functions, scaling better with basis set size than CPU-based algorithms.

GPU4PySCF DFT Acceleration across basis sets

Figure 3: Time to compute a r²SCAN single point energy with various basis sets for a series of linear alkanes on various hardware. CPU calculations (grey dashed line) performed with Psi4 on a c7a.4xlarge node.

The speedup trends for different size basis sets mirror the def2-TZVP results, with the Psi4 CPU timings being similar to 4-year-old A10 GPUs. def2-QZVP quickly saturates the RAM of smaller GPUs and there is significant benefit to moving to GPUs with more memory.

Modern range-separated hybrids like ωB97M-V are needed for the highest accuracy calculations, but have more complicated ERIs than pure functionals like the r²SCAN meta-GGA and global-hybrid functionals like B3LYP. (This is due to the adjustment of the Coulomb kernel ( $\frac{1}{r_{12}}$ ) in the exchange ERIs to smoothly interpolate from DFT exchange at short distances to HF exchange at larger distances.) Across basis sets and molecular systems, ωB97M-V is ≈1.5x slower than r²SCAN.

DFT functional timings with GPU4PySCF

Figure 4: Time to compute a r²SCAN single point energy with various basis sets for a series of linear alkanes on an H200 GPU.

GPU calculations are often on unfair footing, as modern GPUs are often significantly more expensive per hour. Indeed, Modal’s NVIDIA H200 instances are currently $4.54/hour compared to $0.82/hour for a c7a.4xlarge CPU node on AWS. Thus the total cost should also be considered, not just the acceleration, especially if a large number of jobs are intended to be run in an embarrassingly parallel manner.

Total cost of r2SCAN/def2-TZVP on various hardwares

Figure 5: Cost to compute a r²SCAN/def2-TZVP single point energy for a series of linear alkanes on various hardware. CPU calculations (grey dashed line) performed with Psi4 on a c7a.4xlarge node.

The increased computational power of A100 and H100/200 GPUs more than makes up for the increased cost. A100-80GB instances are the most economical for smaller systems, while H200s dominate for large systems due to A100-80GB saturating its memory. The benefits of H200 GPUs (141 GB of RAM) are even more pronounced when dealing with large basis sets, where memory saturation comes at smaller molecule sizes. The cost of c7a nodes is proportional to their memory (see appendix), and thus increasing the memory size to deal with larger systems or basis sets is particularly uneconomical.

Of course, there is no benefit to switching to GPU if the energies are incorrect. While early GPU implementations struggled with single-precision calculations, modern GPU implementations can achieve the same accuracy thanks to advances in 64-bit math support. The observed energy differences are less than the error typically seen from density fitting.

GPU vs CPU energy differences

Figure 6: Energy difference between CPU (PySCF) and GPU (GPU4PySCF) implementations. Note: def2-QZVP and def2-QZVPPD are almost perfectly overlapping in their errors.

Real-World Examples

So what does all of this mean for everyday calculations? For a small reaction like ethyl isocyanate + water, GPU4PySCF optimized the transition state with r²SCAN/def2-TZVP in 43 steps while taking 160 seconds, while Psi4 took 7x as long (GPU4PySCF was 15x faster when comparing wB97M-V/def2-QZVP timings).

For larger molecules, the speedup is even greater. The 78-atom HIV drug Maraviroc sees a speedup of 13x for a single point energy with r²SCAN/def2-SVP; larger basis sets could not be run with Psi4 without running out of memory (GPU4PySCF calculations with def2-TZVP were just 1.4x longer than for def2-SVP). These speedups also hold for organometallic species. A single-point energy of a 95-atom hydrocupration transition state (previously found with our double-ended-TS-search workflow) runs in 1 minute with ωB97M-D3BJ/vDZP on an H200 GPU, while Psi4 takes >50x longer. Beyond this size of system CPU-based methods consistently run out of RAM on the provided hardware; since the cost of AWS c7a nodes scales with the provided RAM, CPU-based DFT becomes especially unfavorable for larger systems.

Valinomycin (168 atoms) single point run with r²SCAN/def2-TZVP on an H200 GPU in <5 minutes.

The addition of GPU4PySCF to the Rowan platform significantly improves the speed of our DFT, unlocking new possibilities for large-scale and highly accurate modeling. While we currently support only a fraction of the many features in GPU4PySCF, we look forward to adding more, such as support for excited states, advanced property prediction, and more. If you are interested in any specific properties, please reach out to us.

Appendix

Notes on Calculations

Linear alkanes were constructed as straight chains and optimized with GFN2-xTB.

All calculations were run using the Rowan Scientific platform. Psi4 and PySCF calculations were run using a c7a.4xlarge AWS instance with 16 vCPUs and 32 GiB of RAM.

GPU4PySCF calculations were run on Modal GPUs. GPUs were provisioned with a single vCPU and 4 GB RAM. All GPUs were initialized with an untimed r²SCAN/sto-3g energy calculation on methane to ensure the startup time was not included in calculations (for more, see the Modal documentation on cold starts). Obvious outliers were rerun in an ad hoc manner, and accounted for <5% of all jobs (outliers can occur due to the need to cloud providers changing GPUs and due to Modal occasionally upgrading GPUs).

CPU pricing for on on-demand node was obtained from Amazon's EC2 estimator on 2025-11-19.

GPU pricing was obtained from Modal's pricing page on 2025-11-19.

Node Specifications

GPU	Architecture	Release date	VRAM (GB)	CUDA Cores	FP64 (TFLOPS)	FP32 (TFLOPS)	Cost ($/hour)
T4	Turing	2018-09	16	2560	0.254	8.14	0.59
A10	Ampere	2021-04	24	9216	0.976	31.24	0.80
L4	Ada	2023-03	24	7424	0.473	30.29	1.10
L40S	Ada	2023-08	48	18176	1.431	91.61	1.95
A100-40GB	Ampere	2020-05	40	6912	9.746	19.49	2.10
A100-80GB	Ampere	2020-11	80	6912	9.746	19.49	2.50
H100	Hopper	2023-03	80	14592	25.610	51.22	3.95
H200	Hopper	2023-11	141	16896	30.160	60.32	4.54