Derivative Informed Learning of Exchange-Correlation Functionals
Summary
This ICML 2026 paper introduces Derivative Informed XC-Loss (DI-Loss), a training approach for machine-learned exchange-correlation functionals that incorporates first and second derivative supervision on the Grassmannian of density matrices. Across four architectures, DI-Loss reduces total-energy MAE by 66% compared to energy and density supervision alone, and improves excited-state predictions in TDDFT calculations.
View Cached Full Text
Cached at: 06/05/26, 02:24 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.04279](https://arxiv.org/html/2606.04279)
marginparsep has been altered\. topmargin has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.
Derivative Informed Learning of Exchange\-Correlation Functionals
Eike S\. Eberhard123Luca A\. Thiede45Abdul Aldossary4Andreas Burger45Nicholas Gao6Vignesh Bhethanabotla7Alán Aspuru\-Guzik45Stephan Günnemann123
††footnotetext:1Technical University of Munich, Munich, Germany2Munich Data Science Institute, Munich, Germany3Munich Center for Machine Learning, Munich, Germany4University of Toronto, Toronto, Canada5Vector Institute, Toronto, Canada6CuspAI, Berlin, Germany7California Institute of Technology, Pasadena, United States of America\. Correspondence to: Eike S\. Eberhard <eike\.eberhard@tum\.de\>\.
Proceedings of the43rd\\mathit\{43\}^\{rd\}International Conference on Machine Learning, Seoul, South Korea\. PMLR 306, 2026\. Copyright 2026 by the author\(s\)\.###### Abstract
Machine\-learned \(ML\) XC functionals aim to replace human\-designed density functional approximations by learning directly from reference data, but they still do not consistently outperform traditional𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\-scaling hybrid functionals\. We therefore study a hybrid\-distillation setting, where𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling ML\-XC functionals are trained to reproduce B3LYP/def2\-SVP targets\. We introduceDerivative Informed XC\-Loss\(DI\-Loss\), a loss that incorporates additional information from the reference hybrid functional by supervising first and second derivatives of the energy on the Grassmannian of admissible density matrices\. Rather than only matching the self\-consistent fixed point, DI\-Loss aligns the local first\- and second\-order response of the learned functional with that of the target functional\. Across four evaluated architectures, DI\-Loss consistently improves the main energy metrics\. Averaged uniformly across architectures, the total\-energy MAE decreases by66%66\\%relative to energy and density supervision alone\. The density\-sensitive mean\-field energy metricEρE\_\{\\rho\}improves from1\.21\.2to0\.80\.8mEh on average, while dipole andℒ2\\mathcal\{L\}\_\{2\}density errors do not improve uniformly\. We further show that densities from the distilled functionals reduce hybrid\-functional SCF iterations by up to50%50\\%\. In downstream TDDFT calculations, Hessian supervision improves excited\-state predictions, with XCdiff reducing the mean excitation\-energy MAE by1919–35%35\\%\.
The ability to accurately and efficiently predict molecular and materials properties is central to modern computational chemistry\. Such predictions enable the rational design of molecules and materials across a wide range of applications, from catalyst design and energy storage to drug discoverySliwoskiet al\.\([2014](https://arxiv.org/html/2606.04279#bib.bib6)\); Lu \([2021](https://arxiv.org/html/2606.04279#bib.bib7)\); Aldossaryet al\.\([2024](https://arxiv.org/html/2606.04279#bib.bib8)\); Rowaiyeet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib9)\)\. A fundamental challenge in this domain is the tension between predictive accuracy and computational scalability\. In principle, most of chemistry is described by the Schrödinger equation which admits an exact solutionKnowles and Handy \([1984](https://arxiv.org/html/2606.04279#bib.bib210)\)\. However, its factorial scaling with system size renders it infeasible beyond very small systems\. Highly accurate approximations such as coupled\-cluster theoryPurvis III and Bartlett \([1982](https://arxiv.org/html/2606.04279#bib.bib211)\); Raghavachariet al\.\([1989](https://arxiv.org/html/2606.04279#bib.bib212)\)reduce this cost, but still scale steeply with system size, typically as𝒪\(N6\)\\mathcal\{O\}\(N^\{6\}\)for CCSD and𝒪\(N7\)\\mathcal\{O\}\(N^\{7\}\)for CCSD\(T\)\. Density functional theory \(DFT\) occupies a central position in electronic structure theory by offering a favorable compromise between accuracy and cost\.
Density Functional Theory \(DFT\) is based on the Hohenberg\-Kohn \(HK\) theorems\(Hohenberg and Kohn,[1964](https://arxiv.org/html/2606.04279#bib.bib80)\), which reduce the description of an interacting many\-electron system from its3N3N\-dimensional wavefunction to the three\-dimensional electron densityρ\(𝐫\):ℝ3→\[0,∞\)\\rho\(\\mathbf\{r\}\):\\mathbb\{R\}^\{3\}\\rightarrow\[0,\\infty\)\. The first theorem establishes a one\-to\-one correspondence between the external potentialvext\(𝐫\)v\_\{\\text\{ext\}\}\(\\mathbf\{r\}\)generated by the nuclei and the ground\-state densityρ0\(𝐫\)\\rho\_\{0\}\(\\mathbf\{r\}\)\. The second theorem guarantees the existence of an energy functionalE\[ρ\]E\[\\rho\]whose minimization yields the ground\-state density as the minimizer and the ground\-state energy as the minimum value\. The only system\-specific input is the nuclear potentialvext\(𝐫\)v\_\{\\text\{ext\}\}\(\\mathbf\{r\}\), while the remaining contributions form a universal functional shared by all electronic systems\.
While the HK theorems guarantee the existence ofE\[ρ\]E\[\\rho\], they say nothing about its form, leaving its parameterization as the central challenge of DFT\. Kohn and ShamKohn and Sham \([1965](https://arxiv.org/html/2606.04279#bib.bib104)\)proposed mapping the interacting system onto a fictitious non\-interacting reference system with the same ground\-state densityρ∗\\rho^\{\*\}, whose kinetic energyTsT\_\{s\}can be evaluated exactly through*single\-particle orbitals*φi:ℝ3→ℝ\\varphi\_\{i\}:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}\. In this setting, the total energy functional can be decomposed as
E\[ρ\]=Enuc\+Ts\[ρ\]\+Vext\[ρ\]\+EC\[ρ\]\+Exc\[ρ\],\\displaystyle E\[\\rho\]=E\_\{\\mathrm\{nuc\}\}\+T\_\{s\}\[\\rho\]\+V\_\{\\mathrm\{ext\}\}\[\\rho\]\+E\_\{\\mathrm\{C\}\}\[\\rho\]\+E\_\{\\mathrm\{xc\}\}\[\\rho\]\\\>,\(1\)where the nuclear repulsion energyEnucE\_\{\\mathrm\{nuc\}\}, the external potential energyVext\[ρ\]=∫vext\(𝐫\)ρ\(𝐫\)𝑑𝐫V\_\{\\mathrm\{ext\}\}\[\\rho\]=\\int v\_\{\\mathrm\{ext\}\}\(\\mathbf\{r\}\)\\,\\rho\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}, and the Coulomb \(Hartree\) energyECE\_\{\\mathrm\{C\}\}are known exactly, while all many\-body effects beyond classical electrostatics, together with the residual kinetic contributionT\[ρ\]−Ts\[ρ\]T\[\\rho\]\-T\_\{s\}\[\\rho\], are absorbed into the unknown exchange\-correlation \(XC\) functionalExcE\_\{\\mathrm\{xc\}\}\. The density itself is reconstructed from the orbitals viaρ\(𝐫\)=∑i\|φi\(𝐫\)\|2\\rho\(\\mathbf\{r\}\)=\\sum\_\{i\}\|\\varphi\_\{i\}\(\\mathbf\{r\}\)\|^\{2\}, with the orbitals obtained by solving the Kohn\-Sham equations
\{hcore\(𝐫\)\+vC\[ρ\]\(𝐫\)\+vxc\[ρ\]\(𝐫\)\}φi=εiφi,\\displaystyle\\left\\\{h\_\{\\textrm\{core\}\}\(\\mathbf\{r\}\)\+v\_\{C\}\[\\rho\]\(\\mathbf\{r\}\)\+v\_\{\\textrm\{xc\}\}\[\\rho\]\(\\mathbf\{r\}\)\\right\\\}\\varphi\_\{i\}=\\varepsilon\_\{i\}\\varphi\_\{i\}\\\>,\(2\)subject to the orthonormality constraint∫φi\(𝐫\)φj\(𝐫\)𝑑𝐫=δij\\int\\varphi\_\{i\}\(\\mathbf\{r\}\)\\,\\varphi\_\{j\}\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}=\\delta\_\{ij\}\. KS\-DFT retains a favorable computational scaling, making it the workhorse of modern electronic structure calculations\.
The accuracy of KS\-DFT is almost solely limited by the quality of the approximation to the unknown XC functional
Exc:\{ρ:ℝ3→ℝ\+\|∫ρ\(𝐫\)d𝐫=Nelec\}→ℝ,\\displaystyle E\_\{\\mathrm\{xc\}\}:\\left\\\{\\rho:\\mathbb\{R\}^\{3\}\\to\\mathbb\{R\}^\{\+\}\\;\\middle\|\\;\\int\\rho\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}=N\_\{\\mathrm\{elec\}\}\\right\\\}\\to\\mathbb\{R\}\\\>,\(3\)which maps the electron density to a scalar energy contribution\. Recent efforts have attempted to bypass traditional, human\-designed density functional approximations \(DFAs\) by learning XC functionals directly from high\-accuracy reference data\(Snyderet al\.,[2012](https://arxiv.org/html/2606.04279#bib.bib161); Nagaiet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib135); Dick and Fernandez\-Serra,[2021](https://arxiv.org/html/2606.04279#bib.bib51); Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67)\), training the functional to reproduce ground\-state densities and energies obtained from reference calculations such as coupled\-cluster\. While promising, these ML\-driven functionals generally compete with, but do not consistently surpass, traditional𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\-scalinghybridfunctionals on energy benchmarksLuiseet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib125)\); Karton \([2026](https://arxiv.org/html/2606.04279#bib.bib3)\)\.
In this work, we target the accuracy\-efficiency gap by*distilling*a traditional𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\-scaling XC functional into a lower\-cost𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling ML XC functional\. The advantage of this approach is the access to additional information about the reference functional in the vicinity of the ground state, which is not available from expensiveab initioreference calculations\. Rather than supervising only energies and densities as in prior work, we additionally supervise the first and second functional derivatives of the energy with respect to the density, aligning the self\-consistent field \(SCF\) dynamics of the distilled functional with those of the target\. These additional loss terms enable the distillate to more faithfully reproduce ground\-state energies and improve out\-of\-distribution generalization, as well as response properties relevant to excited\-state calculations\. While we restrict ourselves to distilling𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\-scaling functionals in this work, the approach extends to more accurate functionals such as double hybridsGrimme \([2006](https://arxiv.org/html/2606.04279#bib.bib217)\)\.
## 2Background
The discretization of KS\-DFTutilizes a finite basis\{χμ:ℝ3→ℝ\}μ=1B\\\{\\chi\_\{\\mu\}:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}\\\}^\{B\}\_\{\\mu=1\}to solve Eq\. \([2](https://arxiv.org/html/2606.04279#S1.E2)\) in aBB\-dimensional spaceLehtolaet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib114)\)\. In this discretization, the set of eigenfunctions𝝋\(𝒓\)=\(φ1,…,φB\)T\\boldsymbol\{\\varphi\}\(\\boldsymbol\{r\}\)=\(\\varphi\_\{1\},\.\.\.,\\varphi\_\{B\}\)^\{T\}can be represented using anorbital coefficient matrix
𝝋\(𝒓\)=CT𝝌\(𝒓\)\.\\displaystyle\\boldsymbol\{\\varphi\}\(\\boldsymbol\{r\}\)=C^\{T\}\\boldsymbol\{\\chi\}\(\\boldsymbol\{r\}\)\\\>\.\(4\)Notably, only the firstO=Nelec/2O=N\_\{\\textrm\{elec\}\}/2orbitals areoccupiedby electrons, the remainingV=B−OV=B\-Oare commonly referred to asvirtual\.111Here we limit ourselves to restricted/closed\-shell systems typical for stable organic chemistry, which have an even number of electrons\.We denote the common slices𝑪:,:O\\boldsymbol\{C\}\_\{:,:O\}and𝑪:,O:\\boldsymbol\{C\}\_\{:,O:\}as𝑪occ\\boldsymbol\{C\}\_\{\\textrm\{occ\}\}and𝑪virt\\boldsymbol\{C\}\_\{\\textrm\{virt\}\}, respectively\. The electron density can then be written asρ=𝝌𝑪occ𝑪occT𝝌\\rho=\\boldsymbol\{\\chi\}\\boldsymbol\{C\}\_\{\\textrm\{occ\}\}\\boldsymbol\{C\}\_\{\\textrm\{occ\}\}^\{T\}\\boldsymbol\{\\chi\}, making it convenient to define thedensity matrix𝑷=𝑪occ𝑪occT\\boldsymbol\{P\}=\\boldsymbol\{C\}\_\{\\textrm\{occ\}\}\\boldsymbol\{C\}\_\{\\textrm\{occ\}\}^\{T\}\.
Starting from someρ\(t\)\\rho^\{\(t\)\}, one can plug it back into[Equation2](https://arxiv.org/html/2606.04279#S1.E2)to obtain a new set of orbitals by filling the lowest\-energy orbitals up toNelec/2N\_\{\\mathrm\{elec\}\}/2, constructing a newCocc\(t\+1\)C\_\{\\mathrm\{occ\}\}^\{\(t\+1\)\}corresponding toρ\(t\+1\)\\rho^\{\(t\+1\)\}\. This procedure is repeated until self\-consistency, i\.e\., untilρ\(t\)=ρ\(t\+1\)\\rho^\{\(t\)\}=\\rho^\{\(t\+1\)\}, yielding the ground state density,ρgs\\rho\_\{\\mathrm\{gs\}\}\.
From the way we definedPP, it is clear that not any matrix inℝB×B\\mathbb\{R\}^\{B\\times B\}is a valid density matrix\. Interestingly, the set of conditions implicitly defines the Grassmannian manifoldObasis/\(Oocc×Ovirt\)O\_\{\\mathrm\{basis\}\}/\(O\_\{\\mathrm\{occ\}\}\\times O\_\{\\mathrm\{virt\}\}\)of dimensionO×VO\\times V, whereObasisO\_\{\\mathrm\{basis\}\}is the orthogonal group of the size of the basisEdelmanet al\.\([1998](https://arxiv.org/html/2606.04279#bib.bib5)\)\. This manifold can be understood as the space of projection matrices of the occupiedOO, invariant to unitary rotations withinOOorVV\. Starting from a valid coefficient matrixC0C\_\{0\}on this manifold, all other validCCmatrices can be parameterized by
C\(θov\):=C0exp\(0θov−θovT0\),\\displaystyle C\(\\theta\_\{ov\}\):=C\_\{0\}\\exp\\begin\{pmatrix\}0&\\theta\_\{\\mathrm\{ov\}\}\\\\ \-\\theta\_\{\\mathrm\{ov\}\}^\{T\}&0\\end\{pmatrix\}\\\>,\(5\)whereθov∈RO×V\\theta\_\{\\text\{ov\}\}\\in R^\{O\\times V\}are the so\-calledorbital rotation angles\. For more details and derivations, refer to[AppendixA](https://arxiv.org/html/2606.04279#A1)\.
Traditional XC\-functional approximationshave historically been parameterized asExc=∫εxc\[ρ\]\(𝐫\)ρ\(𝐫\)𝑑𝐫E\_\{\\text\{xc\}\}=\\int\\varepsilon\_\{\\text\{xc\}\}\[\\rho\]\(\\mathbf\{r\}\)\\,\\rho\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}, modeling anintensiveenergy densityεxc\\varepsilon\_\{\\text\{xc\}\}that depends on local properties of the electron density \(Appendix[E](https://arxiv.org/html/2606.04279#A5)\)\. Beyond the basic Local Density Approximation \(LDA\) solely based onρ\\rho, Generalized Gradient Approximations \(GGAs\) add dependence on the density gradient∇ρ\\nabla\\rho, while meta\-GGAs further include the kinetic energy densityτ\\tau, all of which retain𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)scaling\. Higher accuracy can be achieved by incorporating exact exchange from Hartree\-Fock theory, which is done for \(range\-separated\)hybridfunctionals\(Becke,[1993](https://arxiv.org/html/2606.04279#bib.bib215); Yanaiet al\.,[2004](https://arxiv.org/html/2606.04279#bib.bib216)\)\.Double hybrids\(Grimme,[2006](https://arxiv.org/html/2606.04279#bib.bib217)\)add perturbative correlation corrections at even greater expense\. This increased accuracy, however, comes at a high computational cost: Hybrid and range\-separated hybrid functionals scale as𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\), while double hybrids scale as𝒪\(N5\)\\mathcal\{O\}\(N^\{5\}\)\. As a result, the most accurate KS\-DFT\-based methods sacrifice some of the favorable scaling that originally motivated it\.
Time\-dependent DFT \(TDDFT\)is the most widely used method for calculating excited\-state properties of molecular systems\. It is formally grounded in the Runge\-Gross theoremRunge and Gross \([1984](https://arxiv.org/html/2606.04279#bib.bib222)\), the time\-dependent analogue of the Hohenberg\-Kohn theorem, which establishes a one\-to\-one correspondence between the time\-dependent external potential and the time\-dependent electron density\. Intuitively, a time\-varying electric field, such as an oscillating light wave, perturbs the ground\-state density and induces a density response\. If the field frequencyω\\omegamatches a natural transition frequency of the system, this response becomes singular \(a “pole”\), and the frequencies at which these poles occur correspond to the molecule’s excitation energies\. In linear\-response TDDFT, computing these poles reduces to a generalized eigenvalue problem whose central operator is the orbital HessianGross and Maitra \([2012](https://arxiv.org/html/2606.04279#bib.bib75)\)\. Its XC contribution is precisely the second functional derivative ofExcE\_\{\\mathrm\{xc\}\}with respect to the density\. TDDFT excitations can therefore probe how well an ML functional reproduces the curvature of the reference functional around the ground\-state density\. Computational details are provided in[AppendixF](https://arxiv.org/html/2606.04279#A6)\.
## 3Related Work
Data\-driven DFAspredate recent machine\-learning approaches; so\-calledempirical functionalsfit free parameters to experimental or compute\-intensive high\-accuracy reference data\. Compared to contemporary ML\-functionals, their parameter count is small; for example, the popular B3LYP functional byBecke \([1993](https://arxiv.org/html/2606.04279#bib.bib215)\)has three open parameters \(a0a\_\{0\},axa\_\{x\}, andaca\_\{c\}\) that are least\-squares fitted to atomization energies, ionization potentials, and proton affinities\. In recent years deep learning has emerged as a promising direction toward improved DFAs\. Beyond architectural advances, composite loss functions exemplify how ML practices have improved the field\. Traditional empirical functionals have historically been overfitted to energies at the expense of accurate electron densities\(Medvedevet al\.,[2017](https://arxiv.org/html/2606.04279#bib.bib132)\), a problem less prevalent in constraint\-driven approaches\(Kaplanet al\.,[2023](https://arxiv.org/html/2606.04279#bib.bib96)\)\. By jointly constraining energies and densities\(Nagaiet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib135)\), composite loss functions mitigate this failure mode\.
The parameterizations of most recent ML\-DFAsshare a common pattern:
Exc\[ρ\]=∫εxcbase\(𝐠local\[ρ\]\(𝐫\)\)Fθ\(𝐠\[ρ\]\(𝐫\)\)ρ\(r\)d3r,\\displaystyle E\_\{\\mathrm\{xc\}\}\[\\rho\]=\\int\\varepsilon\_\{\\mathrm\{xc\}\}^\{\\mathrm\{base\}\}\\bigl\(\\mathbf\{g\}\_\{\\textrm\{local\}\}\[\\rho\]\(\\mathbf\{r\}\)\\bigr\)\\;F\_\{\\theta\}\\\!\\bigl\(\\mathbf\{g\}\[\\rho\]\(\\mathbf\{r\}\)\\bigr\)\\rho\(r\)\\,\\mathrm\{d\}^\{3\}r\\\>,\(6\)whereεxcbase\\varepsilon\_\{\\mathrm\{xc\}\}^\{\\mathrm\{base\}\}is a traditional, computationally cheap approximation of the XC\-energy density \(e\.g\. energy density of the uniform electron gas\),𝐠local\[ρ\]\(𝐫\)\\mathbf\{g\_\{\\textrm\{local\}\}\}\[\\rho\]\(\\mathbf\{r\}\)is a vector of \(semi\)\-local density features, e\.g\.𝐠local\[ρ\]\(𝐫\)=\(ρ\(r\),\|∇ρ\(r\)\|,…\)\\mathbf\{g\}\_\{\\textrm\{local\}\}\[\\rho\]\(\\mathbf\{r\}\)=\(\\rho\(r\),\|\\nabla\\rho\(r\)\|,\.\.\.\)\) \([E\.1](https://arxiv.org/html/2606.04279#A5.SS1)\), andFθF\_\{\\theta\}is a learnableenhancement factorwith parametersθ\\theta\. Typically,FθF\_\{\\theta\}is modeled with a multi\-layer perceptron \(MLP\) or compositions thereof\. Depending on the architecture the density feature vector𝐠\[ρ\]\(𝐫\)\\mathbf\{g\}\[\\rho\]\(\\mathbf\{r\}\)contains only local features\(Dick and Fernandez\-Serra,[2021](https://arxiv.org/html/2606.04279#bib.bib51); Nagaiet al\.,[2022](https://arxiv.org/html/2606.04279#bib.bib136)\)or both local and non\-local density features𝐠\[ρ\]\(𝐫\)=𝐠local\[ρ\]\(𝐫\),𝐠non\-local\[ρ\]\(𝐫\)\\mathbf\{g\}\[\\rho\]\(\\mathbf\{r\}\)=\\mathbf\{g\}\_\{\\textrm\{local\}\}\[\\rho\]\(\\mathbf\{r\}\),\\mathbf\{g\}\_\{\\textrm\{non\-local\}\}\[\\rho\]\(\\mathbf\{r\}\)\. The second category can be further subdivided into static predefined non\-local density features\(Nagaiet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib135)\)and more expressive learnable non\-local features\(Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67); Luiseet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib125)\)\. However, satisfactory basis\-set generalization has not been demonstrated for these architectures, and it is reasonable to assume that evaluating them on a basis set family different from the one used during training would fail\.Medvidovićet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib19)\)proposed to learn the kinetic energy densityτ:ℝ3→ℝ\\tau:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}to distill semi\-local \(𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling\) models into cheaper local models of the same complexity class\.
Training XC\-functionalshas proven to be challenging\.Nagaiet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib135)\)circumvented an auto\-differentiable implementation altogether by falling back to stochastic parameter updates using Metropolis\-Hastings type weight perturbation sampling\. An alternative route of implementing a fully auto\-differentiable SCF solver was first proposed byTamayo\-Mendozaet al\.\([2018](https://arxiv.org/html/2606.04279#bib.bib14)\)\.Liet al\.\([2021](https://arxiv.org/html/2606.04279#bib.bib118)\)showed that this end\-to\-end training approach has a regularizing effect resulting in improved generalization for one\-dimensional hydrogen systems\.Dick and Fernandez\-Serra \([2021](https://arxiv.org/html/2606.04279#bib.bib51)\)and laterGaoet al\.\([2024](https://arxiv.org/html/2606.04279#bib.bib67)\)applied this training approach to cylindrical and arbitrary molecules, respectively\.Kanungoet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib95)\)augment energy supervision with a constraint on theground\-stateXC potential obtained via inverse DFT, penalizing the squared density\-weighted potential error\(∫ρ\(𝐫\)Δvxc\(𝐫\)𝑑𝐫\)2\\big\(\\int\\rho\(\\mathbf\{r\}\)\\,\\Delta v\_\{\\mathrm\{xc\}\}\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}\\big\)^\{2\}\.
## 4Derivative Informed XC\-Loss
We proposeDerivative Informed XC\-Loss\(DI\-Loss\), which consists of two novel loss terms for learnable XC\-functionals\. It supervises the first and second derivatives of the energy with respect to the electron density on the Grassmann manifold of admissible, physically valid \(idempotent\) density matrices\. The total loss function is given by a linear combination of energyℒE\\mathcal\{L\}\_\{E\}and densityℒρ\\mathcal\{L\}\_\{\\rho\}terms, as well as our new gradientℒ∇\\mathcal\{L\}\_\{\\nabla\}and HessianℒH\\mathcal\{L\}\_\{H\}contributions
ℒDI=αEℒE\+αρℒρ\+α∇ℒ∇\+αHℒH\\displaystyle\\mathcal\{L\}\_\{\\textrm\{DI\}\}=\\alpha\_\{E\}\\mathcal\{L\}\_\{E\}\+\\alpha\_\{\\rho\}\\mathcal\{L\}\_\{\\rho\}\+\\alpha\_\{\\nabla\}\\mathcal\{L\}\_\{\\nabla\}\+\\alpha\_\{H\}\\mathcal\{L\}\_\{H\}\(7\)whereαE,αρ,α∇,αH\\alpha\_\{E\},\\alpha\_\{\\rho\},\\alpha\_\{\\nabla\},\\alpha\_\{H\}are the loss weight constants for energy, density, gradient, and Hessian, respectively\. We followLiet al\.\([2021](https://arxiv.org/html/2606.04279#bib.bib118)\)and supervise the first three terms \(ℒE,ℒρ,ℒ∇\\mathcal\{L\}\_\{E\},\\mathcal\{L\}\_\{\\rho\},\\mathcal\{L\}\_\{\\nabla\}\) along the densities of the SCF trajectoryℒ=∑j=0Ncyclesωjℒ\(j\)\\mathcal\{L\}=\\sum\_\{j=0\}^\{N\_\{\\textrm\{cycles\}\}\}\\omega\_\{j\}\\mathcal\{L\}^\{\(j\)\}putting increasing weightsωj\\omega\_\{j\}on the later iterations\. In the following, we denote the difference between a target quantityXXand its predictionX^\\hat\{X\}asΔX:=X^−X\\Delta X:=\\hat\{X\}\-X\.
For theenergylossℒE\\mathcal\{L\}\_\{E\}we use the mean squared error\. For thedensityloss we use the per\-electronL1L^\{1\}norm of the real\-space density error
ℒρ\(j\)\[ρ,ρ^j\]=1Nelec∫\|Δρ\(𝐫\)\|𝑑𝐫,\\displaystyle\\mathcal\{L\}^\{\(j\)\}\_\{\\rho\}\[\\rho,\\hat\{\\rho\}\_\{j\}\]=\\frac\{1\}\{N\_\{\\textrm\{elec\}\}\}\\int\\left\|\\Delta\\rho\(\\mathbf\{r\}\)\\right\|d\\mathbf\{r\}\\\>,\(8\)which performs best among the density loss types we compared \(Appendix[L](https://arxiv.org/html/2606.04279#A12)\)\.
Thegradient𝒈∈ℝO×V\\boldsymbol\{g\}\\in\\mathbb\{R\}^\{O\\times V\}with regard to the tangent coordinates \(Appendix[B](https://arxiv.org/html/2606.04279#A2)\)𝜽ia∈ℝO×V\\boldsymbol\{\\theta\}\_\{ia\}\\in\\mathbb\{R\}^\{O\\times V\}of the Grassmann manifold
𝒈ia\(xc\)\[ρ\]:=∂Exc\[ρ\(𝐫;𝜽\)\]∂𝜽ia\|𝜽=0\\displaystyle\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\_\{ia\}\[\\rho\]:=\\frac\{\\partial E\_\{\\textrm\{xc\}\}\[\\rho\(\\mathbf\{r\};\\boldsymbol\{\\theta\}\)\]\}\{\\partial\\boldsymbol\{\\theta\}\_\{ia\}\}\\bigg\|\_\{\\boldsymbol\{\\theta\}=0\}\(9\)can be efficiently computed in closed\-form \(Appendix[B](https://arxiv.org/html/2606.04279#A2)\)
𝒈ia\(xc\)=−2\(𝑪T𝑽xc𝑪\)ia\.\\displaystyle\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\_\{ia\}=\-2\(\\boldsymbol\{C\}^\{T\}\\boldsymbol\{V\}\_\{\\textrm\{xc\}\}\\boldsymbol\{C\}\)\_\{ia\}\\\>\.\(10\)The gradient of the total energy is given by
𝒈ia\(tot\)\[ρ\]:=∂Eρ\[ρ\(𝐫;𝜽\)\]∂𝜽ia\+𝒈ia\(xc\),\\displaystyle\\boldsymbol\{g\}^\{\\textrm\{\(tot\)\}\}\_\{ia\}\[\\rho\]:=\\frac\{\\partial E\_\{\\rho\}\[\\rho\(\\mathbf\{r\};\\boldsymbol\{\\theta\}\)\]\}\{\\partial\\boldsymbol\{\\theta\}\_\{ia\}\}\+\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\_\{ia\}\\\>,\(11\)such that the mean\-field termEρE\_\{\\rho\}cancels out in the loss computation and we obtain
ℒ∇\(j\)=1Nelec‖\(\(𝑪\(j\)\)T𝚫Vxc\(j\)𝑪\(j\)\)ia‖F\.\\displaystyle\\mathcal\{L\}^\{\(j\)\}\_\{\\nabla\}=\\frac\{1\}\{N\_\{\\textrm\{elec\}\}\}\\left\\\|\\left\(\\left\(\\boldsymbol\{C\}^\{\(j\)\}\\right\)^\{T\}\\boldsymbol\{\\Delta\}V^\{\(j\)\}\_\{\\textrm\{xc\}\}\\,\\boldsymbol\{C\}^\{\(j\)\}\\right\)\_\{ia\}\\right\\\|\_\{F\}\\\>\.\(12\)In contrast to this construction,Kanungoet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib95)\)match the potential through a single density\-weighted scalar per sample,\(∫ρ\(𝐫\)Δvxc\(𝐫\)𝑑𝐫\)2\\big\(\\int\\rho\(\\mathbf\{r\}\)\\,\\Delta v\_\{\\mathrm\{xc\}\}\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}\\big\)^\{2\}\. This fixes only theρ\\rho\-weighted mean of the potential error and leaves every variation with∫ρδv=0\\int\\rho\\,\\delta v=0unconstrained\. Supervising the fullVxc∈ℝB×BV\_\{\\mathrm\{xc\}\}\\in\\mathbb\{R\}^\{B\\times B\}matrix goes to the opposite extreme, since its occupied\-occupied and virtual\-virtual blocks describe rotations that leave the density invariant, forcing the model to spend capacity on potential structure with no physical effect\. Our gradient instead constrains theO×VO\\times Voccupied\-virtual block, the directions that move the density and drive the SCF update, givingO⋅VO\\cdot Vindependent constraints per sample and SCF cycle\. It supervises these directions along the trajectory rather than at the ground state alone\.Kanungoet al\.\([2021](https://arxiv.org/html/2606.04279#bib.bib50)\)show that small density errors correspond to large potential errors, so matching them away from equilibrium is a stronger requirement than matching the ground\-state density\. Early ablations showed that supervising the fullVxcV\_\{\\mathrm\{xc\}\}matrix degrades distillation accuracy relative to the Grassmannian gradient, consistent with the model wasting capacity on the unphysical occupied\-occupied and virtual\-virtual blocks\.
TheHessianis only supervised at the equilibrium density to focus model expressivity on the physically relevant curvature of the energy functional\. The Hessian at the ground state governs SCF stability and linear response, while second\-order information far from the minimum is weakly constrained and unnecessary for many practical applications\. Even though the dimensionality of the Grassmann manifold is much smaller than that of the fullB×BB\\times Bmatrix space, the corresponding Grassmannian Hessian
𝑯iajb=∂2Exc\[ρ\(𝐫;𝜽\)\]∂𝜽ia∂𝜽jb\|𝜽=0\\displaystyle\\boldsymbol\{H\}\_\{iajb\}=\\frac\{\\partial^\{2\}E\_\{\\textrm\{xc\}\}\[\\rho\(\\mathbf\{r\};\\boldsymbol\{\\theta\}\)\]\}\{\\partial\\boldsymbol\{\\theta\}\_\{ia\}\\partial\\boldsymbol\{\\theta\}\_\{jb\}\}\\bigg\|\_\{\\boldsymbol\{\\theta\}=0\}\(13\)remains prohibitively large to materialize\. Instead, we supervise randomly sampled linear responses
δ𝒈μi\(xc\)=∂𝒈μi\(xc\)\[ρ\(r;𝜽\)\]∂𝜽νj\|𝜽=0δ𝜽νj\\displaystyle\\delta\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\_\{\\mu i\}=\\frac\{\\partial\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\_\{\\mu i\}\[\\rho\(r;\\boldsymbol\{\\theta\}\)\]\}\{\\partial\\boldsymbol\{\\theta\}\_\{\\nu j\}\}\\bigg\|\_\{\\boldsymbol\{\\theta\}=0\}\\hskip\-5\.69054pt\\delta\\boldsymbol\{\\theta\}\_\{\\nu j\}\(14\)which are efficiently computable using Hessian\-vector products \(Appendix[D](https://arxiv.org/html/2606.04279#A4)\)\. Perturbation directions are sampled with importance weights biased toward small occupied\-virtual gaps
δθia∝zia/\(ϵa−ϵi\),zia∼𝒩\(0,1\)\.\\displaystyle\\delta\\theta\_\{ia\}\\propto z\_\{ia\}/\(\\epsilon\_\{a\}\-\\epsilon\_\{i\}\),\\quad z\_\{ia\}\\sim\\mathcal\{N\}\(0,1\)\\\>\.\(15\)This weighting focuses the Hessian supervision on the low\-gap transitions that dominate linear response and TDDFT excitation energies\. We define the Hessian loss as the Monte Carlo expectation
ℒH=𝔼δ𝜽\[‖Δδ𝒈\(xc\)‖F2‖δ𝒈\(xc\)‖F2\],\\displaystyle\\mathcal\{L\}\_\{H\}=\\mathbb\{E\}\_\{\\delta\\boldsymbol\{\\theta\}\}\\left\[\\frac\{\\\|\\Delta\\delta\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\\\|\_\{F\}^\{2\}\}\{\\\|\\delta\\boldsymbol\{g\}^\{\\textrm\{\(xc\)\}\}\\\|\_\{F\}^\{2\}\}\\right\]\\\>,\(16\)where we normalize by the target response magnitude to aid training stability, approximated withT=8T=8samples per training step\.
Thegap weighting\([15](https://arxiv.org/html/2606.04279#S4.E15)\) carries a precise physical meaning beyond down\-weighting high\-gap transitions\. We collect the gaps into a diagonal matrix𝑫\\boldsymbol\{D\}with entriesdia=εa−εid\_\{ia\}=\\varepsilon\_\{a\}\-\\varepsilon\_\{i\}, so the sampled direction isδ𝜽=𝑫−1𝒛\\delta\\boldsymbol\{\\theta\}=\\boldsymbol\{D\}^\{\-1\}\\boldsymbol\{z\}\. The same𝑫−1\\boldsymbol\{D\}^\{\-1\}plays three roles\. First, as the leading\-order inverse orbital Hessian it preconditions the SCF iteration, since a first\-order update from the occupied\-virtual Fock residual scales asθia∝−Fia/\(εa−εi\)\\theta\_\{ia\}\\propto\-F\_\{ia\}/\(\\varepsilon\_\{a\}\-\\varepsilon\_\{i\}\)\(Helgakeret al\.,[2000](https://arxiv.org/html/2606.04279#bib.bib10)\)\. Second, it is proportional to the static Kohn\-Sham responseχs\\chi\_\{s\}, diagonal in the particle\-hole basis with entries∝dia−1\\propto d\_\{ia\}^\{\-1\}\(Appendix[G](https://arxiv.org/html/2606.04279#A7)\), which sets the metric of the OEP equationχsV^xcOEP=Λ\\chi\_\{s\}\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}=\\Lambda\. Third, it inverts the independent\-particle block of the Casida matrix𝑨=𝑫\+𝑲\\boldsymbol\{A\}=\\boldsymbol\{D\}\+\\boldsymbol\{K\}\(Appendix[F](https://arxiv.org/html/2606.04279#A6)\), where the supervised Hessian is the exchange\-correlation \(fxcf\_\{xc\}\) part of the coupling block𝑲\\boldsymbol\{K\}\. The sampling therefore drawsδ𝜽∝χs𝒛\\delta\\boldsymbol\{\\theta\}\\propto\\chi\_\{s\}\\boldsymbol\{z\}, the first\-order response to a random potential perturbation, equivalently the SCF rotation induced by a random Fock residual\. The perturbations concentrate where the response function, the SCF iteration, and the OEP residual are jointly most sensitive\. This sampling yields an OEP\-inspired curvature match at the cost of one batched Hessian\-vector product, without forming or inverting the response operatorχs\\chi\_\{s\}that the OEP equation requires\.
The four terms carry complementary information about the ground\-state energy basin\. Energy supervision fixes the scalar value at the minimum and leaves the density that attains it unconstrained\. Density supervision locates the minimum but not the path toward it\. Gradient supervision adds the local slope, the direction of steepest descent in density space\. Hessian supervision adds the curvature, which governs convergence and links to linear response\. The training signal moves from the value at a single point to the shape of the basin around it\.
## 5Experiments
Datasets:We train and test on QM9 molecules\(Ramakrishnanet al\.,[2014](https://arxiv.org/html/2606.04279#bib.bib185)\)and use QM40\(Madushankaet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib186)\)for far out\-of\-distribution evaluation\. We split QM9 by heavy\-atom count, training and validating on the smaller molecules and testing on the larger ones to isolate size extrapolation\. From QM9 we exclude fluorine, since it is rare in the dataset and the heavy\-atom split leaves close to none in the training set\. From QM40 we exclude fluorine, sulfur, and chlorine, since these elements are scarce or absent in QM9 and would otherwise add element extrapolation to the size\-extrapolation test\. We subsample QM40 to 50 molecules per heavy\-atom bin up to 40 heavy atoms\.
Evaluation metrics:Total energy alone does not capture functional quality, especially when directly optimized\. We therefore evaluate a diverse set of properties: total energyEtotE\_\{\\text\{tot\}\}, energy components \(EρE\_\{\\rho\},ECE\_\{C\},ExcE\_\{\\text\{xc\}\}\), HOMO\-LUMO gapεHL\\varepsilon\_\{\\text\{HL\}\}, TDDFT excited\-state energies, density errors \(L1L\_\{1\},L2L\_\{2\}, dipoleμρ\\mu\_\{\\rho\}\), and SCF convergence behavior\. See Appendix[J](https://arxiv.org/html/2606.04279#A10)for details\.
Adaptive Training Stabilization:Training ML\-XC functionals end\-to\-end through the SCF solver is challenging because parameter updates that destabilize the XC\-potential prevent proper SCF convergence, producing unreliable gradients that further degrade subsequent updates\. Viewed as a fixed\-point iterationP\(t\+1\)=fθ\(P\(t\)\)P^\{\(t\+1\)\}=f\_\{\\theta\}\(P^\{\(t\)\}\), where the mapfθf\_\{\\theta\}is implicitly defined by the learned XC functional, SCF can be connected to a setting studied in the literature asDeep Equilibrium Models\(DEQs\)\(Baiet al\.,[2019](https://arxiv.org/html/2606.04279#bib.bib27)\)\.Baiet al\.\([2021](https://arxiv.org/html/2606.04279#bib.bib29)\)pointed out that such architectures are indeed brittle and addressed their instability by regularizing the fixed\-point map Jacobian, but their approach does not transfer directly to the SCF setting\. Our Hessian supervision can be viewed as an alternative to this, as it encodes the ground state as a stable attractor of the SCF\-map\.
However, to reliably compare all loss settings and prevent individual bad updates from derailing the training, we employ a Metropolis\-inspired accept\-reject mechanism based on an adaptive variance estimate of the relative mean epoch loss change \(Appendix[H](https://arxiv.org/html/2606.04279#A8)\)\. Updates exceeding a given tolerance are rejected and the optimizer momentum is rescaled after consecutive rejections\. This stabilization scheme enables a simplified single\-stage, gradient\-based training procedure, unlike the multi\-stage approaches ofDick and Fernandez\-Serra \([2021](https://arxiv.org/html/2606.04279#bib.bib51)\),Kirkpatricket al\.\([2021](https://arxiv.org/html/2606.04279#bib.bib102)\), andLuiseet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib125)\), or the gradient\-free optimization ofNagaiet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib135)\)\. Notably, this procedure converges reliably from standard MINAO density initializations, without requiring the pre\-converged densities used in prior work\(Dick and Fernandez\-Serra,[2021](https://arxiv.org/html/2606.04279#bib.bib51); Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67)\)\.
Parameter Optimization and Initialization:Since DEQs are known to be sensitive w\.r\.t\. their initialization\(Agarwala and Schoenholz,[2022](https://arxiv.org/html/2606.04279#bib.bib21)\), we calibrate the initial weight distribution to be variance preserving across layers222We do not do this for Skala since its reference implementation explicitly pairs uniform Xavier initialization with SiLu\-activations\. More generally, we observe significant improvements using the Muon optimizer\(Jordanet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib15)\)compared to Adam\(Kingma and Ba,[2017](https://arxiv.org/html/2606.04279#bib.bib18)\)across all loss configurations \(Appendix[I](https://arxiv.org/html/2606.04279#A9)\) and use it throughout\. All runs are performed with different random seeds sampled from hardware entropy\.
Models:We evaluate DI\-Loss on multiple ML\-XC architectures\. NNmGGA\(Nagaiet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib135)\)is a simple constraint\-free neural network operating on semi\-local density features\. The semi\-local model byDick and Fernandez\-Serra \([2021](https://arxiv.org/html/2606.04279#bib.bib51)\)explicitly enforces physical constraints through architecture design similar to traditional constraint\-based DFAs\(Sunet al\.,[2015](https://arxiv.org/html/2606.04279#bib.bib170)\)\. EG\-XC\(Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67)\)augments learnable mGGA functionals to non\-local models via equivariant message passing on nuclei\-centered representations\. In our distillation experiments, we use NNmGGA as the local component of EG\-XC\. Skala\(Luiseet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib125)\)is broadly similar to EG\-XC but without message passing on the nuclei\-centered representations; to isolate the influence of their grid\-based MLP architecture, we focus onSkala mGGA, their model with the non\-local augmentation disabled333This is equivalent to setting the non\-local flag to off in their reference implementation\. To compare between models, we match the semi\-local model sizes within each distillation setting \(Appendix[N](https://arxiv.org/html/2606.04279#A14)\)\.
### 5\.1Learning mGGAs

Figure 1:Diagnostic toy example: energy\-density of mGGA functionals distilled from SCAN\.Each panel plots the ratio of learned to target energy density \(y\-axis\) against the dimensionless density gradientss\(x\-axis\), a standard semi\-local input feature \(Appendix[E\.1](https://arxiv.org/html/2606.04279#A5.SS1)\)\. Rows are the three architectures, columns three density regimes set by\(n,τ\)\(n,\\tau\), and a perfect functional traces the line at one\. The bottom row shows the importance\-weighted \(ωquadrature×ρ×\|εtarget\|\\omega\_\{\\text\{quadrature\}\}\\times\\rho\\times\|\\varepsilon\_\{\\text\{target\}\}\|\) distribution ofsson an ethanol quadrature grid, marking where errors carry energetic weight\. Energy supervision deviates from the target even in the data\-rich region\. The gradient and Hessian terms \(\+∇\+\\nabla,\+H\+H\) remove these deviations and extrapolate into low weight grid feature regimes\. Supervising only energy\-type quantities, this setting isolates how the derivative terms reshape the energy density without any density loss\.To investigate the influence of DI\-Loss terms we start with a toy example\. We learn an ML mGGA representation of a traditional mGGA, where the learned and target functionals share the energy\-density formεxc\(ρ,ζ,s,τ\)\\varepsilon\_\{\\text\{xc\}\}\(\\rho,\\zeta,s,\\tau\)and we can inspect the learned functional directly\. Training against a reference in the same XC category keeps model expressivity from confounding the loss\-term influence\(Sunet al\.,[2015](https://arxiv.org/html/2606.04279#bib.bib170)\)\. We scale all semi\-local models to the same parameter count \(Appendix[N](https://arxiv.org/html/2606.04279#A14)\), which gives approximately equal cost per SCF step\. We train on QM9 molecules with up to 5 heavy atoms, which leaves only 174 molecules\. To isolate the derivative terms we supervise only energy\-type quantities, the total energy, the mean\-field energyEρE\_\{\\rho\}in place of the tunedL1L\_\{1\}density loss, and the energy gradient and Hessian\. This handicaps the baseline on purpose, since the point is to read off what the gradient and Hessian add when no real\-space density signal is present\.
Crucially, the mGGA\-to\-mGGA setting provides a diagnostic in which the learned functionals can be analyzed at the level of energy densitiesεxc\(ρ,ζ,s,τ\)\\varepsilon\_\{\\text\{xc\}\}\(\\rho,\\zeta,s,\\tau\)\. Figure[1](https://arxiv.org/html/2606.04279#S5.F1)demonstrates that DI\-Loss systematically corrects energy misallocation \(in the absence of density supervision\) across density regimes, providing insight into where and how the learned functional improves, rather than treating it as a black box\. Table[7](https://arxiv.org/html/2606.04279#A13.T7)shows the same effect on the energy components, where the gradient term reduces the Coulomb and exchange\-correlation errors that energy\-only supervision leaves two orders of magnitude off\.

Figure 2:Distillation of𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\-scaling B3LYP into𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling ML\-XC functionals\.We train on QM9 molecules with up to 7 heavy atoms and evaluate out\-of\-distribution on a random QM40 subsplit, fixed across all models, at the B3LYP/def2\-SVP level of theory\. Bars give the mean MAE over random seeds, error bars the standard error of that mean, and the jittered dots resolve the per\-molecule errors that the means average over\. DI\-Loss sharpens the energetic quantities \(EtotE\_\{\\mathrm\{tot\}\},ExcE\_\{\\mathrm\{xc\}\}\) and leaves the density metrics \(μρ\\mu\_\{\\rho\},L2\[ρ\]L\_\{2\}\[\\rho\]\) near theE\+ρE\+\\rhobaseline across the three semi\-local architectures\. On the more expressive EG\-XC architecture the added terms improve the densities, loweringL2\[ρ\]L\_\{2\}\[\\rho\], whereas they slightly worsen it on the three semi\-local architectures\. We train EG\-XC once per setting, so we treat this as suggestive rather than conclusive\. Full per\-run numbers are in Table[8](https://arxiv.org/html/2606.04279#A13.T8)\.
### 5\.2Distillation of XC\-Functionals
We further evaluate DI\-Loss in a distillation setting, compressing B3LYP \(𝒪\(N4\)\\mathcal\{O\}\(N^\{4\}\)\) into semi\-local functionals \(𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\) and testing out\-of\-distribution on QM40 \(Figure[2](https://arxiv.org/html/2606.04279#S5.F2)\)\. Since hybrid functionals are fundamentally non\-local and cannot fulfill the same set of constraints as semi\-local methods, we additionally evaluate the non\-local EG\-XC functional using NNmGGA as a local model\(Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67); Nagaiet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib135)\)\. The computational scaling of DFT is more accurately described in terms of the basis sizeBBrather than atom count, with the commonly cited𝒪\(NX\)\\mathcal\{O\}\(N^\{X\}\)behavior reflecting scaling inBB, which itself grows linearly with system size for fixed basis set families\. To reduce the cost of the𝒪\(B4\)\\mathcal\{O\}\(B^\{4\}\)\-scaling hybrid reference, we use the smaller def2\-SVP basis set for these experiments\. We use moderately larger local networks than in the SCAN experiments \(Appendix[N](https://arxiv.org/html/2606.04279#A14)\), since the non\-local character of exact exchange places greater demands on model capacity than semi\-local distillation\.
DI\-Loss lowers the total energy error of every architecture and leaves the density metrics near theE\+ρE\+\\rhobaseline\. The effect is largest on EG\-XC, where the full loss cuts the total energy error by80%80\\%and is the only architecture whose density also improves, loweringL2\[ρ\]L\_\{2\}\[\\rho\]by8%8\\%\. We attribute this to its non\-local representation having the capacity to absorb the derivative signal that the semi\-local models cannot, though we have trained EG\-XC only once per setting and cannot rule out seed variance\. The distilled functionals require∼15%\{\\sim\}15\\%more SCF iterations than the B3LYP reference, and the Hessian term recovers part of this overhead, consistent with our DEQ\-regularization hypothesis that curvature supervision regularizes the energy landscape near the ground state\.
### 5\.3Accelerated\-SCF using ML\-XC\-Functionals
\.5 Figure[2](https://arxiv.org/html/2606.04279#S5.F2)shows that the self\-consistent densities of the distilled functionals remain close to the B3LYP optimum in an energetic sense\. When evaluated with the B3LYP functional, these densities yield errors below11mEh on the out\-of\-distribution QM40 test set:0\.850\.85–0\.90\.9mEh for theE\+ρE\+\\rhobaselines and0\.670\.67mEh with DI\-Loss\. The gain from derivative supervision is modest for this metric, but the absolute error is small\. This observation has motivated us to explore the use of these densities as initial guesses for the B3LYP functional as a secondary application for distilled functionals beyond direct evaluation\. We run the lower\-scaling distilled functional and use the resulting density as the initial guess for a subsequent B3LYP calculation\. This follows a standard electronic\-structure strategy, where cheaper methods initialize more expensive calculations\(Stein and Hutter,[2022](https://arxiv.org/html/2606.04279#bib.bib202)\), but applies it to distilled neural network XC functionals\.
Figure[3](https://arxiv.org/html/2606.04279#S5.F3)shows that this results in a significant reduction in the required number of solver iterations relative to standard MINAO initialization for B3LYP\. The traditional semi\-local initialization baseline reduces the relative iteration count to about 65%, while distilled ML\-XC densities reduce it further to about 50% across all architectures\. This acceleration effect is largely unaffected by DI\-Loss\. Distillates trained only withE\+ρE\+\\rhoalready provide low\-RIC initializations, and derivative supervision changes RIC only marginally relative to the gain from distillation itself\.

Figure 3:Relative iteration count \(RIC\) for B3LYPcalculations initialized from SCAN or distilled ML\-XC densities on QM9, trained on molecules with up to 7 heavy atoms and evaluated on 1000 randomly selected molecules with exactly 9 heavy atoms, fixed across all runs\. RIC is the number of B3LYP iterations required from a given initial density, normalized by standard MINAO initialization\. SCAN reduces the iteration count to about 65% of MINAO, while distilled ML\-XC densities reduce it further to about 50% across all architectures\. Hence functional distillation provides a secondary use as initializers for subsequent calculations with more expensive traditional functionals\. Here derivative supervision has only a minor effect on RIC\. DI\-Loss supervision gives the lowest mean RIC, but the gain overE\+ρE\+\\rhois small relative to the gain from distillation itself\.
Figure 4:Hessian supervision improves TDDFT excited\-state accuracy across functionals and molecule sizes\.Each cell reports the mean absolute error \(MAE, meV\) of the lowest five vertical excitation energies against B3LYP, averaged overnmoln\_\{\\text\{mol\}\}molecules at each heavy\-atom count\. Rows group the three neural functionals \(NNmGGA, XCdiff, Skala\-mGGA\)\. Within each functional, the three sub\-rows add training signal cumulatively, from energy and density \(E\+ρE\+\\rho\), to first derivatives \(\+∇\+\\nabla\), to the full DI\-Loss with the Hessian \(\+∇\+H\+\\nabla\+H\)\. Columns give the excited states \(1–5\) and their mean MAE\. The dashed line separates in\-distribution QM9 molecules from larger out\-of\-distribution molecules from QM9 and QM40\. Color encodes the MAE \(light low, dark high, normalized within each heavy atom column\), and±\\pmvalues are the standard error across different random seeds\. Hessian supervision lowers MAE across every functional and size\. In\-distribution, the first excitation already carries the lowest error \(181 meV, XCdiff, 7 heavy atoms\), so the gains concentrate on the higher states, where the DI\-Loss cuts the energy MAE of the fifth excitation from 356 to 233 meV \(XCdiff, 7 heavy atoms\)\. Out\-of\-distribution, the first excitation also improves \(462 to 386 meV, XCdiff, 14 heavy atoms\)\. XCdiff gives the lowest mean MAE at every size\. Relative toE\+ρE\+\\rho, the DI\-Loss reduces the XCdiff mean by 28% \(7 heavy\), 24% \(9 heavy\), 24% \(12 heavy\), and 19% \(14 heavy atoms\)\.RIC reduction alone does not imply walltime reduction, since the distilled pre\-run adds additional SCF cycles\. We therefore measure the end\-to\-end runtime of ML\-XC\-initialized B3LYP in Appendix[K](https://arxiv.org/html/2606.04279#A11)\. On a QM40 molecule with 35 heavy atoms, six NNmGGA cycles followed by the main B3LYP calculation results in a1\.35×1\.35\\timeswalltime speedup\. This finite\-size gain should increase for larger systems and larger basis sets, since the distilled pre\-run avoids the𝒪\(B4\)\\mathcal\{O\}\(B^\{4\}\)exact\-exchange step of B3LYP\. For a fixed number of ML\-XC pre\-run cycles, the relative cost of the initialization therefore decreases with basis size\. In this regime, the walltime reduction approaches the RIC reduction, and the speedup factor approaches1/RIC1/\\mathrm\{RIC\}\. Appendix[K](https://arxiv.org/html/2606.04279#A11)provides single\-cycle benchmarks supporting this scaling trend\.
Prior machine\-learned density\(Song and Feng,[2024](https://arxiv.org/html/2606.04279#bib.bib166)\)and Hamiltonian\(Yuet al\.,[2023](https://arxiv.org/html/2606.04279#bib.bib189)\)predictors have also been used for SCF initialization\. These methods are cheaper than our approach, since they predict an initial density or Hamiltonian in a single forward pass rather than running a distilled SCF calculation\. However, matrix predictors trained only on converged ground states fail at size extrapolation\(Liuet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib20)\)\. Our approach makes a different tradeoff\. It adds the cost of several SCF cycles with the distilled functional, but gives a larger iteration reduction in this setting\. On QM40 at the same B3LYP/def2\-SVP level, solver\-aligned matrix prediction reduces the B3LYP iteration count by about 20–30%\(Eberhardet al\.,[2026](https://arxiv.org/html/2606.04279#bib.bib2)\), while distilled functional initialization reduces it by about 50%\. So these two approaches target different cost regimes\. Matrix predictors provide cheaper initial guesses, while distilled functionals provide stronger initialization when the additional pre\-run cost is amortized by the computational scaling of the hybrid calculation\.
### 5\.4TDDFT
The orbital Hessian that governs the linear\-response equations is the second derivative that the DI\-Loss supervises, so we expect the Hessian term to improve the excitation energies\. We compute the lowest five vertical excitations for each trained functional and report the MAE against the B3LYP target in[Figure4](https://arxiv.org/html/2606.04279#S5.F4), across one in\-distribution size \(7 heavy atoms\) and three out\-of\-distribution sizes \(9 heavy atoms from QM9, and 12 and 14 heavy atoms from QM40\)\.
Adding only the gradient term raises the MAE above theE\+ρE\+\\rhobaseline for every functional and size \(9–14% for XCdiff\)\. The Hessian term reverses this and lowers the MAE below the baseline at every size\. In\-distribution, the first excitation is already accurate, so the reduction concentrates on the higher states\. Out\-of\-distribution, the first state benefits as well\. XCdiff reaches the lowest mean error throughout, where DI\-Loss cuts the mean by 28% at 7 heavy atoms and 19% at 14\.
## 6Discussion
In this work, we introducedDerivative Informed XC\-Loss\(DI\-Loss\), a composite loss function that regularizes ML\-XC functionals by supervising derivatives of the energy on the Grassmannian of admissible density matrices\. We evaluate this objective in a hybrid\-distillation setting, where lower\-scaling ML\-XC functionals \(three semi\-local architectures and the non\-local EG\-XC\) are trained to reproduce B3LYP/def2\-SVP targets while retaining the lower scaling of non\-hybrid functionals\. In this setting, DI\-Loss consistently improves the main energy metrics across all four evaluated architectures\. For the selected loss weights, the total\-energy MAE decreases by66%66\\%when averaged uniformly across architectures\. These improvements are strongest for Skala\-mGGA and EG\-XC, where the total\-energy MAE decreases from15\.815\.8to3\.63\.6mEh and from15\.915\.9to3\.13\.1mEh, respectively\. NNmGGA and XCdiff also improve, with total\-energy MAEs decreasing from4\.24\.2to2\.92\.9mEh and from8\.98\.9to5\.55\.5mEh\. The effect on density\-related quantities is more metric\-dependent\. The mean\-field energy metricEρE\_\{\\rho\}, which has been proposed as a density\-sensitive benchmark byGould \([2023](https://arxiv.org/html/2606.04279#bib.bib72)\), improves on average from1\.21\.2to0\.80\.8mEh\. In contrast, the direct density metricsμρ\\mu\_\{\\rho\}andℒ2\[ρ\]\\mathcal\{L\}\_\{2\}\[\\rho\]do not improve uniformly\.
We show that the response information introduced by Hessian supervision improves downstream excited\-state calculations\. In TDDFT calculations against B3LYP reference excitations, adding Hessian supervision lowers the mean excitation\-energy MAE across functionals and molecule sizes\. For the strongest architecture, XCdiff, the full DI\-Loss reduces the mean MAE relative toE\+ρE\+\\rhotraining by28%28\\%,24%24\\%,24%24\\%, and19%19\\%for molecules with77,99,1212, and1414heavy atoms, respectively, and reaches35%35\\%on the smallest in\-distribution molecules \(55heavy atoms,[SectionM\.3](https://arxiv.org/html/2606.04279#A13.SS3)\)\. In\-distribution, where the first excitation is already comparatively accurate, the gains are concentrated on higher excited states\. For example, the MAE of the fifth excitation decreases from356356to233233meV for XCdiff on molecules with77heavy atoms\. Out\-of\-distribution, the first excitation also improves, decreasing from462462to386386meV for XCdiff on molecules with1414heavy atoms\.
Overall, these results indicate that Grassmannian derivative supervision improves more than the fitted ground\-state energy\. It regularizes the learned functional in directions that affect the self\-consistent density, orbital gaps, and the response properties entering downstream TDDFT calculations\.
Limitations and Future Work\.The additional loss terms we propose are only directly applicable to the distillation setting, where reference gradients and Hessians are readily available from a reference functional\. Extending DI\-Loss to higher\-fidelity targets such as CCSD\(T\) would require inverse DFT techniques to obtain the corresponding XC potential and its derivatives, which remains computationally demanding\(Kanungoet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib95)\)\. Nevertheless, we envision a multi\-stage training strategy analogous to recent advances in machine\-learned interatomic potentials \(MLIPs\), where combining lower\-fidelity pretraining with higher\-fidelity fine\-tuning improves performance at reduced computational cost\(Kulichenkoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib108)\)\. Concretely, DI\-Loss could enable efficient pretraining on range\-separated or double\-hybrid functionals, which currently compete with𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling ML\-XC functionals\(Luiseet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib125)\), followed by fine\-tuning on CCSD\(T\) energies or inverse\-DFT\-derived XC potentials\. For double\-hybrid distillation in particular, correct orbital energies are essential, and density\-only supervision leaves virtual orbital energies largely unconstrained, making our Hessian supervision especially relevant\.
All experiments in this work are restricted to closed\-shell organic molecules, a relatively narrow chemical domain\. While our molecular runtime benchmarks already show favorable scaling compared to B3LYP \(Appendix[K](https://arxiv.org/html/2606.04279#A11)\), hybrid\-functional acceleration is likely even more consequential in solid\-state settings where exact exchange is substantially more expensive\. Whether the benefits of DI\-Loss transfer to periodic solid\-state systems remains an open question\. Additionally, while Hessian supervision improves the quality of the learned functional, it introduces a moderate training cost overhead \(Appendix[K\.2](https://arxiv.org/html/2606.04279#A11.SS2)\)\. However, this overhead is confined entirely to training, and the resulting semi\-local functional incurs no additional inference cost\.
A further open question concerns the expressivity of current ML\-XC architectures\. Our evaluations suggest that even recent𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)\-scaling parameterizations may lack the capacity to fully capture exact exchange effects\. Whether this expressivity gap can be closed without sacrificing computational efficiency remains unclear\. However, it is also unknown whether the true XC functional exhibits such hybrid\-like features that resist approximation by𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)architectures, and this limitation may prove less relevant when training on wavefunction\-based targets such as CCSD\(T\)\. Additionally, we note that our architectural comparisons were performed on down\-scaled versions of the original architectures and larger\-scale ablations would strengthen expressivity claims\.
More broadly, beyond the performance gains reported here, our robust single\-stage training procedure enables cross\-architecture comparisons under consistent protocols, addressing a key evaluation gap in prior ML\-XC work\. Finally, KS\-DFT serves as a reference theory for generating large\-scale datasets used to train machine\-learned force fields\(Levineet al\.,[2025](https://arxiv.org/html/2606.04279#bib.bib116)\), and improvements in XC\-functional accuracy have the potential to indirectly enhance the fidelity of these downstream models, amplifying the impact of improved density functional approximations\.
## Acknowledgements
We thank Jacobus Dijkman for insightful discussions regarding Hessian supervision in classical density functional theory\.
## Impact Statement
This work develops training strategies for ML\-XC\-functionals aiming to improve the accuracy and stability of electronic structure simulations\. By enabling more reliable and efficient quantum chemical calculations, the proposed methods may contribute to advances in computational chemistry workflows used in materials science, molecular modeling, and chemical design\. Potential downstream applications include accelerated screening of functional materials, improved modeling of molecular properties, and more efficient electronic\-structure simulation in academic and industrial research\. We do not foresee societal risks beyond those generally associated with advances in chemical simulation and materials modeling\.
## References
- Deep equilibrium networks are sensitive to initialization statistics\.InPMLR 162,Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.SS0.SSS0.Px3.p1.1),[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.1),[§5](https://arxiv.org/html/2606.04279#S5.p5.1)\.
- A\. Aldossary, J\. A\. Campos\-Gonzalez\-Angulo, S\. Pablo\-García, S\. X\. Leong, E\. M\. Rajaonson, L\. Thiede, G\. Tom, A\. Wang, D\. Avagliano, and A\. Aspuru\-Guzik \(2024\)In silico chemical experiments in the age of ai: from quantum chemistry to machine learning and back\.Advanced Materials36\(30\),pp\. 2402369\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- D\. G\. Anderson \(1965\)Iterative Procedures for Nonlinear Integral Equations\.J\. ACM12\(4\),pp\. 547–560\.Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.2)\.
- S\. Bai, J\. Z\. Kolter, and V\. Koltun \(2019\)Deep Equilibrium Models\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.1),[§5](https://arxiv.org/html/2606.04279#S5.p3.2)\.
- S\. Bai, V\. Koltun, and J\. Z\. Kolter \(2021\)Stabilizing Equilibrium Models by Jacobian Regularization\.InPMLR 139,Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.1),[§5](https://arxiv.org/html/2606.04279#S5.p3.2)\.
- A\. D\. Becke \(1993\)Density\-functional thermochemistry\. iii\. the role of exact exchange\.The Journal of chemical physics98\(7\),pp\. 5648–5652\.Cited by:[§2](https://arxiv.org/html/2606.04279#S2.p4.8),[§3](https://arxiv.org/html/2606.04279#S3.p1.3)\.
- S\. Dick and M\. Fernandez\-Serra \(2021\)Highly accurate and constrained density functional obtained with differentiable programming\.Physical Review B104\(16\),pp\. L161109\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p4.1),[§3](https://arxiv.org/html/2606.04279#S3.p2.10),[§3](https://arxiv.org/html/2606.04279#S3.p3.1),[§5](https://arxiv.org/html/2606.04279#S5.p4.1),[§5](https://arxiv.org/html/2606.04279#S5.p6.1)\.
- J\. Dijkman, M\. Dijkstra, R\. Van Roij, M\. Welling, J\. van de Meent, and B\. Ensing \(2025\)Learning neural free\-energy functionals with pair\-correlation matching\.Physical Review Letters134\(5\),pp\. 056103\.Cited by:[Appendix D](https://arxiv.org/html/2606.04279#A4.SS0.SSS0.Px2.p1.2)\.
- E\. S\. Eberhard, V\. Kotsev, T\. Güthle, and S\. Günnemann \(2026\)Transferable SCF\-Acceleration through Solver\-Aligned Initialization Learning\.arXiv\.Cited by:[§5\.3](https://arxiv.org/html/2606.04279#S5.SS3.p4.1)\.
- A\. Edelman, T\. A\. Arias, and S\. T\. Smith \(1998\)The geometry of algorithms with orthogonality constraints\.SIAM journal on Matrix Analysis and Applications20\(2\),pp\. 303–353\.Cited by:[§2](https://arxiv.org/html/2606.04279#S2.p3.10)\.
- N\. Gao, E\. Eberhard, and S\. Günnemann \(2024\)Learning Equivariant Non\-Local Electron Density Functionals\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Table 16](https://arxiv.org/html/2606.04279#A14.T16),[§1](https://arxiv.org/html/2606.04279#S1.p4.1),[§3](https://arxiv.org/html/2606.04279#S3.p2.10),[§3](https://arxiv.org/html/2606.04279#S3.p3.1),[§5\.2](https://arxiv.org/html/2606.04279#S5.SS2.p1.6),[§5](https://arxiv.org/html/2606.04279#S5.p4.1),[§5](https://arxiv.org/html/2606.04279#S5.p6.1)\.
- L\. Goerigk and S\. Grimme \(2014\)Double\-hybrid density functionals\.WIREs Computational Molecular Science4\(6\),pp\. 576–600\.Cited by:[§E\.3](https://arxiv.org/html/2606.04279#A5.SS3.SSS0.Px3.p1.2)\.
- T\. Gould \(2023\)A step toward density benchmarking—The energy\-relevant “mean field error”\.The Journal of Chemical Physics159\(20\),pp\. 204111\.Cited by:[Appendix J](https://arxiv.org/html/2606.04279#A10.SS0.SSS0.Px5.p1.1),[Appendix L](https://arxiv.org/html/2606.04279#A12.p1.9),[§6](https://arxiv.org/html/2606.04279#S6.p1.14)\.
- S\. Grimme \(2006\)Semiempirical hybrid density functional with perturbative second\-order correlation\.The Journal of chemical physics124\(3\)\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p5.3),[§2](https://arxiv.org/html/2606.04279#S2.p4.8)\.
- E\. K\. U\. Gross and N\. T\. Maitra \(2012\)Introduction to TDDFT\.InFundamentals of Time\-Dependent Density Functional Theory,M\. A\.L\. Marques, N\. T\. Maitra, F\. M\.S\. Nogueira, E\.K\.U\. Gross, and A\. Rubio \(Eds\.\),Vol\.837,pp\. 53–99\.External Links:ISBN 978\-3\-642\-23517\-7 978\-3\-642\-23518\-4Cited by:[§2](https://arxiv.org/html/2606.04279#S2.p5.2)\.
- T\. Helgaker, P\. Jørgensen, and J\. Olsen \(2000\)Hartree–fock theory\.InMolecular Electronic\-Structure Theory,pp\. 433–522\.External Links:ISBN 9781119019572,[Document](https://dx.doi.org/10.1002/9781119019572.ch10),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119019572.ch10)Cited by:[§4](https://arxiv.org/html/2606.04279#S4.p5.13)\.
- P\. Hohenberg and W\. Kohn \(1964\)Inhomogeneous Electron Gas\.Physical Review136\(3B\),pp\. B864–B871\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p2.6)\.
- S\. Ivanov, S\. Hirata, and R\. J\. Bartlett \(2002\)Finite\-basis\-set optimized effective potential exchange\-only method\.The Journal of chemical physics116\(4\),pp\. 1269–1276\.Cited by:[Appendix G](https://arxiv.org/html/2606.04279#A7.SS0.SSS0.Px1.p2.5),[Appendix G](https://arxiv.org/html/2606.04279#A7.SS0.SSS0.Px1.p3.2)\.
- K\. Jordan, J\. Bernstein, B\. Rappazzo, @fernbear\.bsky\.social, B\. Vlado, Y\. Jiacheng, F\. Cesista, B\. Koszarsky, and @Grad62304977 \(2024\)Modded\-nanogpt: speedrunning the nanogpt baseline\.External Links:[Link](https://github.com/KellerJordan/modded-nanogpt)Cited by:[Table 3](https://arxiv.org/html/2606.04279#A9.T3),[§5](https://arxiv.org/html/2606.04279#S5.p5.1)\.
- B\. Kanungo, J\. Hatch, P\. M\. Zimmerman, and V\. Gavini \(2025\)Learning local and semi\-local density functionals from exact exchange\-correlation potentials and energies\.Science Advances11\(38\),pp\. eady8962\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p3.1),[§4](https://arxiv.org/html/2606.04279#S4.p3.10),[§6](https://arxiv.org/html/2606.04279#S6.p4.1)\.
- B\. Kanungo, P\. M\. Zimmerman, and V\. Gavini \(2021\)A Comparison of Exact and Model Exchange–Correlation Potentials for Molecules\.The Journal of Physical Chemistry Letters12\(50\),pp\. 12012–12019\.Cited by:[§4](https://arxiv.org/html/2606.04279#S4.p3.10)\.
- A\. D\. Kaplan, M\. Levy, and J\. P\. Perdew \(2023\)The Predictive Power of Exact Constraints and Appropriate Norms in Density Functional Theory\.Annual Review of Physical Chemistry74\(1\),pp\. 193–218\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p1.3)\.
- A\. Karton \(2026\)W1–SN2–BH: A Large\-Scale CCSD\(T\)/CBS Kinetic Database\.The Journal of Physical Chemistry A,pp\. acs\.jpca\.6c00424\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p4.1)\.
- D\. P\. Kingma and J\. Ba \(2017\)Adam: A Method for Stochastic Optimization\.arXiv\.Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.p3.6),[Table 3](https://arxiv.org/html/2606.04279#A9.T3),[§5](https://arxiv.org/html/2606.04279#S5.p5.1)\.
- J\. Kirkpatrick, B\. McMorrow, D\. H\. P\. Turban, A\. L\. Gaunt, J\. S\. Spencer, A\. G\. D\. G\. Matthews, A\. Obika, L\. Thiry, M\. Fortunato, D\. Pfau, L\. R\. Castellanos, S\. Petersen, A\. W\. R\. Nelson, P\. Kohli, P\. Mori\-Sánchez, D\. Hassabis, and A\. J\. Cohen \(2021\)Pushing the frontiers of density functionals by solving the fractional electron problem\.Science374\(6573\),pp\. 1385–1389\.Cited by:[§5](https://arxiv.org/html/2606.04279#S5.p4.1)\.
- P\. J\. Knowles and N\. C\. Handy \(1984\)A new determinant\-based full configuration interaction method\.Chemical physics letters111\(4\-5\),pp\. 315–321\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- W\. Kohn and L\. J\. Sham \(1965\)Self\-Consistent Equations Including Exchange and Correlation Effects\.Physical Review140\(4A\),pp\. A1133–A1138\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p3.4)\.
- M\. Kulichenko, B\. Nebgen, N\. Lubbers, J\. S\. Smith, K\. Barros, A\. E\. A\. Allen, A\. Habib, E\. Shinkle, N\. Fedik, Y\. W\. Li, R\. A\. Messerly, and S\. Tretiak \(2024\)Data Generation for Machine Learning Interatomic Potentials and Beyond\.Chemical Reviews124\(24\),pp\. 13681–13714\.Cited by:[§6](https://arxiv.org/html/2606.04279#S6.p4.1)\.
- S\. Kümmel and L\. Kronik \(2008\)Orbital\-dependent density functionals: theory and applications\.Reviews of Modern Physics80\(1\),pp\. 3–60\.Cited by:[Appendix G](https://arxiv.org/html/2606.04279#A7.SS0.SSS0.Px2.p1.2)\.
- S\. Lehtola, F\. Blockhuys, and C\. Van Alsenoy \(2020\)An Overview of Self\-Consistent Field Calculations Within Finite Basis Sets\.Molecules25\(5\),pp\. 1218\.Cited by:[Appendix A](https://arxiv.org/html/2606.04279#A1.p1.1),[Appendix A](https://arxiv.org/html/2606.04279#A1.p3.9),[Appendix A](https://arxiv.org/html/2606.04279#A1.p5.9),[Appendix B](https://arxiv.org/html/2606.04279#A2.p1.4),[§2](https://arxiv.org/html/2606.04279#S2.p1.3)\.
- D\. S\. Levine, M\. Shuaibi, E\. W\. C\. Spotte\-Smith, M\. G\. Taylor, M\. R\. Hasyim, K\. Michel, I\. Batatia, G\. Csányi, M\. Dzamba, P\. Eastman, N\. C\. Frey, X\. Fu, V\. Gharakhanyan, A\. S\. Krishnapriyan, J\. A\. Rackers, S\. Raja, A\. Rizvi, A\. S\. Rosen, Z\. Ulissi, S\. Vargas, C\. L\. Zitnick, S\. M\. Blau, and B\. M\. Wood \(2025\)The Open Molecules 2025 \(OMol25\) Dataset, Evaluations, and Models\.Cited by:[§6](https://arxiv.org/html/2606.04279#S6.p7.1)\.
- L\. Li, S\. Hoyer, R\. Pederson, R\. Sun, E\. D\. Cubuk, P\. Riley, and K\. Burke \(2021\)Kohn\-Sham Equations as Regularizer: Building Prior Knowledge into Machine\-Learned Physics\.Physical Review Letters126\(3\),pp\. 036401\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p3.1),[§4](https://arxiv.org/html/2606.04279#S4.p1.11)\.
- E\. H\. Lieb \(1983\)Density functionals for coulomb systems\.International journal of quantum chemistry24\(3\),pp\. 243–277\.Cited by:[Appendix G](https://arxiv.org/html/2606.04279#A7.SS0.SSS0.Px1.p2.4)\.
- Z\. Liu, Y\. Ni, Z\. Pu, Q\. Sun, S\. Liu, and W\. Yan \(2025\)Towards A Universally Transferable Acceleration Method for Density Functional Theory\.arXiv\.Cited by:[§5\.3](https://arxiv.org/html/2606.04279#S5.SS3.p4.1)\.
- Z\. Lu \(2021\)Computational discovery of energy materials in the era of big data and machine learning: a critical review\.Materials Reports: Energy1\(3\),pp\. 100047\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- G\. Luise, C\. Huang, T\. Vogels, D\. P\. Kooi, S\. Ehlert, S\. Lanius, K\. J\. H\. Giesbertz, A\. Karton, D\. Gunceler, M\. Stanley, W\. P\. Bruinsma, L\. Huang, X\. Wei, J\. G\. Torres, A\. Katbashev, B\. Máté, S\. Kaba, R\. Sordillo, Y\. Chen, D\. B\. Williams\-Young, C\. M\. Bishop, J\. Hermann, R\. van den Berg, and P\. Gori\-Giorgi \(2025\)Accurate and scalable exchange\-correlation with deep learning\.arXiv\.Cited by:[Appendix K](https://arxiv.org/html/2606.04279#A11.p3.3),[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.1),[§1](https://arxiv.org/html/2606.04279#S1.p4.1),[§3](https://arxiv.org/html/2606.04279#S3.p2.10),[§5](https://arxiv.org/html/2606.04279#S5.p4.1),[§5](https://arxiv.org/html/2606.04279#S5.p6.1),[§6](https://arxiv.org/html/2606.04279#S6.p4.1)\.
- A\. Madushanka, R\. T\. Moura, and E\. Kraka \(2024\)QM40, Realistic Quantum Mechanical Dataset for Machine Learning in Molecular Science\.Scientific Data11\(1\),pp\. 1376\.Cited by:[§5](https://arxiv.org/html/2606.04279#S5.p1.1)\.
- M\. G\. Medvedev, I\. S\. Bushmarinov, J\. Sun, J\. P\. Perdew, and K\. A\. Lyssenko \(2017\)Density functional theory is straying from the path toward the exact functional\.Science355\(6320\),pp\. 49–52\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p1.3)\.
- M\. Medvidović, J\. C\. Umana, I\. Ahmadabadi, D\. D\. Sante, J\. Flick, and A\. Rubio \(2025\)Neural network distillation of orbital dependent density functional theory\.Physical Review Research7\(2\),pp\. 023113\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p2.10)\.
- R\. Nagai, R\. Akashi, and O\. Sugino \(2020\)Completing density functional theory by machine learning hidden messages from molecules\.npj Computational Materials6\(1\),pp\. 1–8\.Cited by:[§N\.1](https://arxiv.org/html/2606.04279#A14.SS1.p1.6),[§1](https://arxiv.org/html/2606.04279#S1.p4.1),[§3](https://arxiv.org/html/2606.04279#S3.p1.3),[§3](https://arxiv.org/html/2606.04279#S3.p2.10),[§3](https://arxiv.org/html/2606.04279#S3.p3.1),[§5\.2](https://arxiv.org/html/2606.04279#S5.SS2.p1.6),[§5](https://arxiv.org/html/2606.04279#S5.p4.1),[§5](https://arxiv.org/html/2606.04279#S5.p6.1)\.
- R\. Nagai, R\. Akashi, and O\. Sugino \(2022\)Machine\-learning\-based exchange correlation functional with physical asymptotic constraints\.Physical Review Research4\(1\),pp\. 013106\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p2.10)\.
- J\. Olsen, H\. J\. A\. Jensen, and P\. Jørgensen \(1988\)Solution of the large matrix equations which occur in response theory\.Journal of Computational Physics74\(2\),pp\. 265–282\.Cited by:[Appendix F](https://arxiv.org/html/2606.04279#A6.p1.7)\.
- J\. P\. Perdew \(2001\)Jacob’s ladder of density functional approximations for the exchange\-correlation energy\.InAIP Conference Proceedings,Vol\.577,Antwerp \(Belgium\),pp\. 1–20\.Cited by:[Appendix E](https://arxiv.org/html/2606.04279#A5.p1.1)\.
- P\. Pulay \(1980\)Convergence acceleration of iterative sequences\. the case of scf iteration\.Chemical Physics Letters73\(2\),pp\. 393–398\.Cited by:[Appendix H](https://arxiv.org/html/2606.04279#A8.p1.2)\.
- G\. D\. Purvis III and R\. J\. Bartlett \(1982\)A full coupled\-cluster singles and doubles model: the inclusion of disconnected triples\.The Journal of chemical physics76\(4\),pp\. 1910–1918\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- K\. Raghavachari, G\. W\. Trucks, J\. A\. Pople, and M\. Head\-Gordon \(1989\)A fifth\-order perturbation comparison of electron correlation theories\.Chemical Physics Letters157\(6\),pp\. 479–483\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- R\. Ramakrishnan, P\. O\. Dral, M\. Rupp, and O\. A\. von Lilienfeld \(2014\)Quantum chemistry structures and properties of 134 kilo molecules\.Scientific Data1\(1\),pp\. 140022\.Cited by:[§5](https://arxiv.org/html/2606.04279#S5.p1.1)\.
- A\. B\. Rowaiye, A\. A\. Folarin, T\. Akingbade, J\. C\. Okoli, O\. I\. Rowaiye, T\. R\. Folorunso, and D\. Bur \(2025\)Advancing predictive modeling in computational chemistry through quantum chemistry, molecular mechanics, and machine learning\.Discover Chemistry2\(1\),pp\. 281\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- E\. Runge and E\. K\. U\. Gross \(1984\)Density\-Functional Theory for Time\-Dependent Systems\.Physical Review Letters52\(12\),pp\. 997–1000\.Cited by:[§2](https://arxiv.org/html/2606.04279#S2.p5.2)\.
- G\. Sliwoski, S\. Kothiwale, J\. Meiler, and E\. W\. Lowe Jr\. \(2014\)Computational methods in drug discovery\.Pharmacological reviews66\(1\),pp\. 334–395\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p1.2)\.
- J\. C\. Snyder, M\. Rupp, K\. Hansen, K\. Müller, and K\. Burke \(2012\)Finding Density Functionals with Machine Learning\.Physical Review Letters108\(25\),pp\. 253002\.Cited by:[§1](https://arxiv.org/html/2606.04279#S1.p4.1)\.
- F\. Song and J\. Feng \(2024\)NeuralSCF: Neural network self\-consistent fields for density functional theory\.arXiv\.Cited by:[§5\.3](https://arxiv.org/html/2606.04279#S5.SS3.p4.1)\.
- F\. Stein and J\. Hutter \(2022\)Double\-hybrid density functionals for the condensed phase: Gradients, stress tensor, and auxiliary\-density matrix method acceleration\.The Journal of Chemical Physics156\(7\),pp\. 074107\.Cited by:[§5\.3](https://arxiv.org/html/2606.04279#S5.SS3.p1.5)\.
- J\. Sun, A\. Ruzsinszky, and J\. P\. Perdew \(2015\)Strongly Constrained and Appropriately Normed Semilocal Density Functional\.Physical Review Letters115\(3\),pp\. 036402\.Cited by:[§5\.1](https://arxiv.org/html/2606.04279#S5.SS1.p1.3),[§5](https://arxiv.org/html/2606.04279#S5.p6.1)\.
- T\. Tamayo\-Mendoza, C\. Kreisbeck, R\. Lindh, and A\. Aspuru\-Guzik \(2018\)Automatic Differentiation in Quantum Chemistry with Applications to Fully Variational Hartree–Fock\.ACS Central Science4\(5\),pp\. 559–566\.Cited by:[§3](https://arxiv.org/html/2606.04279#S3.p3.1)\.
- T\. Yanai, D\. P\. Tew, and N\. C\. Handy \(2004\)A new hybrid exchange–correlation functional using the coulomb\-attenuating method \(cam\-b3lyp\)\.Chemical physics letters393\(1\-3\),pp\. 51–57\.Cited by:[§2](https://arxiv.org/html/2606.04279#S2.p4.8)\.
- H\. Yu, Z\. Xu, X\. Qian, X\. Qian, and S\. Ji \(2023\)Efficient and Equivariant Graph Networks for Predicting Quantum Hamiltonian\.arXiv\.Cited by:[§5\.3](https://arxiv.org/html/2606.04279#S5.SS3.p4.1)\.
## Appendix AKS\-DFT in Finite Basis Sets
In practice, the minimization of the Kohn\-Sham functional \(Eq\.[1](https://arxiv.org/html/2606.04279#S1.E1)\) is done in a finite basis set using an iterative solver\. In this section we closely followLehtolaet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib114)\)and the notation used therein to provide a brief overview of the solution process, for more detailed explanations we recommend reading their paper\. Note that we further simplify by removing any reference to spin, making our summary here more accessible to a wider audience and implicitly adhering to the spin\-restricted case and an even number of electrons\. For notational convenience, we employ Einstein’s summation convention, where repeated indices imply summation over their range\.
In this work, we use an atom\-centeredGaussian\-typebasis set𝐗:=\{χμ\|χμ:ℝ3→ℝ\}μ=1B\\mathbf\{X\}:=\\\{\\chi\_\{\\mu\}\|\\chi\_\{\\mu\}:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}\\\}\_\{\\mu=1\}^\{B\}\. The overlap between basis functions is defined as the spatial integral of their product\. The correspondingoverlap matrixSSstores all of these pairwise integrals
Sμν:=∫ℝ3χμ\(r\)χν\(r\)dr\.\\displaystyle S\_\{\\mu\\nu\}:=\\int\_\{\\mathbb\{R\}^\{3\}\}\\chi\_\{\\mu\}\(r\)\\chi\_\{\\nu\}\(r\)\\\>\\textrm\{d\}r\\\>\.\(17\)Notably, the basis set is not orthogonal and henceSSis not diagonal\. The motivation to use Gaussians to construct𝐗\\mathbf\{X\}is that most integrals like these can be computed very efficiently using closed\-form solutions\.
In KS\-DFT, the electron densityρ:ℝ3→\[0,∞\)\\rho:\\mathbb\{R\}^\{3\}\\rightarrow\[0,\\infty\)is represented by a set ofmolecular orbitals\(MOs\)\{φi\|φi:ℝ3→ℝ\}i=1B\\\{\\varphi\_\{i\}\|\\varphi\_\{i\}:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}\\\}\_\{i=1\}^\{B\}which are the eigenfunctions of \(eq\.[2](https://arxiv.org/html/2606.04279#S1.E2)\)\. These orbitals can be represented by coefficient vectorsc\(i\)∈ℝBc^\{\(i\)\}\\in\\mathbb\{R\}^\{B\}in a given basis setφi\(r\)=cμ\(i\)χμ\(r\)\\varphi\_\{i\}\(r\)=c\_\{\\mu\}^\{\(i\)\}\\chi\_\{\\mu\}\(r\)The full set of orbitals can conveniently be represented in a matrixC=\(c\(1\),c\(2\),…,c\(B\)\)∈ℝB×BC=\(c^\{\(1\)\},c^\{\(2\)\},\.\.\.,c^\{\(B\)\}\)\\in\\mathbb\{R\}^\{B\\times B\}, where the leading index is used for the basis𝐗\\mathbf\{X\}and the second one indexes the MOs\. We follow the convention to use Greek letters to index the basis functions and Roman letters for the MOs\. Unlike the basis set, the MOs are orthonormal\.
δij=∫ℝ3φi\(r\)φj\(r\)dr=CiμCjν∫ℝ3χμ\(r\)χν\(r\)dr=:CiμCjνSμν\.\\displaystyle\\delta\_\{ij\}=\\int\_\{\\mathbb\{R\}^\{3\}\}\\varphi\_\{i\}\(r\)\\varphi\_\{j\}\(r\)\\\>\\textrm\{d\}r=C\_\{i\\mu\}C\_\{j\\nu\}\\int\_\{\\mathbb\{R\}^\{3\}\}\\chi\_\{\\mu\}\(r\)\\chi\_\{\\nu\}\(r\)\\\>\\textrm\{d\}r=:C\_\{i\\mu\}C\_\{j\\nu\}S\_\{\\mu\\nu\}\.\(18\)While the expansion of the MOs is linear, the resulting expansion of the electron density is quadratic in the basis functions
ρ\(r\)=∑iNe/2\|φi\(r\)\|2=χμ\(r\)χν\(r\)CμiCνi⏟=:Pμν,\\displaystyle\\rho\(r\)=\\sum\_\{i\}^\{N\_\{e\}/2\}\|\\varphi\_\{i\}\(r\)\|^\{2\}=\\chi\_\{\\mu\}\(r\)\\chi\_\{\\nu\}\(r\)\\underbrace\{C\_\{\\mu i\}C\_\{\\nu i\}\}\_\{=:P\_\{\\mu\\nu\}\}\\\>,\(19\)whereP∈ℝB×BP\\in\\mathbb\{R\}^\{B\\times B\}is the so\-called density matrix\. An instructive example of the expressivity of this discretization is that the integral over the density simplifies to a sum overB2B^\{2\}matrix elements∫ℝ3ρ\(r\)dr=PμνSμν\.\\int\_\{\\mathbb\{R\}^\{3\}\}\\rho\(r\)\\\>\\textrm\{d\}r=P\_\{\\mu\\nu\}S\_\{\\mu\\nu\}\\\>\.The total electronic energy in a finite basis set is given by\(Lehtolaet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib114)\)
Etot=PμνHμν\(core\)\+12Pμν\(μν\|λσ\)Pλσ\+Exc\[ρ\(P;𝐗\)\],\\displaystyle E\_\{\\textrm\{tot\}\}=P\_\{\\mu\\nu\}H^\{\\textrm\{\(core\)\}\}\_\{\\mu\\nu\}\+\\frac\{1\}\{2\}P\_\{\\mu\\nu\}\(\\mu\\nu\|\\lambda\\sigma\)P\_\{\\lambda\\sigma\}\+E\_\{\\textrm\{xc\}\}\[\\rho\(P;\\mathbf\{X\}\)\]\\\>,\(20\)with thecore Hamiltonian\(∈ℝB×B\\in\\mathbb\{R\}^\{B\\times B\}\)
Hμν\(core\)=∫ℝ3χμ\(r\)\(−12∇2\+∑nZn\|rn−r1\|\)χμ\(r\)dr1,\\displaystyle H^\{\\textrm\{\(core\)\}\}\_\{\\mu\\nu\}=\\int\_\{\\mathbb\{R\}^\{3\}\}\\chi\_\{\\mu\}\(r\)\\left\(\-\\frac\{1\}\{2\}\\nabla^\{2\}\+\\sum\_\{n\}\\frac\{Z\_\{n\}\}\{\|r\_\{n\}\-r\_\{1\}\|\}\\right\)\\chi\_\{\\mu\}\(r\)\\\>\\textrm\{d\}r\_\{1\}\\\>,\(21\)and theelectron repulsion integral\(ERI\) \(∈ℝB×B×B×B\\in\\mathbb\{R\}^\{B\\times B\\times B\\times B\}\) tensor
\(μν\|λσ\):=∬ℝ3χμ\(r1\)χν\(r1\)1\|r1−r2\|χλ\(r2\)χσ\(r2\)dr1dr2\.\\displaystyle\(\\mu\\nu\|\\lambda\\sigma\):=\\iint\_\{\\mathbb\{R\}^\{3\}\}\\chi\_\{\\mu\}\(r\_\{1\}\)\\chi\_\{\\nu\}\(r\_\{1\}\)\\frac\{1\}\{\|r\_\{1\}\-r\_\{2\}\|\}\\chi\_\{\\lambda\}\(r\_\{2\}\)\\chi\_\{\\sigma\}\(r\_\{2\}\)\\\>\\textrm\{d\}r\_\{1\}\\textrm\{d\}r\_\{2\}\\\>\.\(22\)
Both the core Hamiltonian and the ERI tensor are constant w\.r\.t\. the electron density, and only depend on the basis set444which in turn depends on the nuclear point cloud\. The derivative of the energyEtotE\_\{\\textrm\{tot\}\}w\.r\.t the density matrixPPis commonly referred to as theFock matrix
Fμν:=∂Etot∂Pμν=Hμν\+\(μν\|λσ\)Pλσ\+∂Exc∂Pμν,\\displaystyle F\_\{\\mu\\nu\}:=\\frac\{\\partial E\_\{\\textrm\{tot\}\}\}\{\\partial P\_\{\\mu\\nu\}\}=H\_\{\\mu\\nu\}\+\(\\mu\\nu\|\\lambda\\sigma\)P\_\{\\lambda\\sigma\}\+\\frac\{\\partial E\_\{\\textrm\{xc\}\}\}\{\\partial P\_\{\\mu\\nu\}\},\(23\)where the last term is the discretizedXC\-potentialdenoted asVμν\(xc\)V^\{\\textrm\{\(xc\)\}\}\_\{\\mu\\nu\}\. Its continuous counterpartvxc:ℝ3→ℝv\_\{\\textrm\{xc\}\}:\\mathbb\{R\}^\{3\}\\rightarrow\\mathbb\{R\}appears in Equation[2](https://arxiv.org/html/2606.04279#S1.E2), these relate to each other via
Vμν\(xc\)=∫χμ\(𝐫\)vxc\(𝐫\)χν\(𝐫\)𝑑𝐫\.\\displaystyle V^\{\\textrm\{\(xc\)\}\}\_\{\\mu\\nu\}=\\int\\chi\_\{\\mu\}\(\\mathbf\{r\}\)v\_\{xc\}\(\\mathbf\{r\}\)\\chi\_\{\\nu\}\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}\\\>\.\(24\)
One can show \(see, for example,Lehtolaet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib114)\)\) that the discretized problem of minimizing Equation \([1](https://arxiv.org/html/2606.04279#S1.E1)\) with respect to the density leads to the Roothaan–Hall equations
F\(P\(C\)\)C=diag\(ε1,…,εB\)SC,\\displaystyle F\(P\(C\)\)\\,C=\\mathrm\{diag\}\(\\varepsilon\_\{1\},\\ldots,\\varepsilon\_\{B\}\)\\,SC\\\>,\(25\)where the eigenvaluesε\\varepsiloncorrespond to the Kohn–Sham orbital energies andC∈ℝB×BC\\in\\mathbb\{R\}^\{B\\times B\}contains the orbital coefficients\. We explicitly indicate the dependence of the Fock matrixFFon the density matrixPPthrough the density dependence of the exchange\-correlation potentialV\(xc\)V^\{\\mathrm\{\(xc\)\}\}, highlighting that \([25](https://arxiv.org/html/2606.04279#A1.E25)\) constitutes a nonlinear eigenvalue problem\. In practice, this problem is solved iteratively by linearizing the equations through fixingFF, solving the resulting generalized eigenvalue problem to obtain an updated density matrixPP, and recomputingFFfrom it\. Repeating this procedure until convergence yields a self\-consistent solution\. Consequently, finding the ground\-state density can be interpreted as a fixed\-point problem in the space of density matrices, or equivalently Fock matrices, satisfying
F∗=SCFstep\(F∗\)\.\\displaystyle F^\{\*\}=\\mathrm\{SCF\}\_\{\\mathrm\{step\}\}\(F^\{\*\}\)\\\>\.\(26\)
## Appendix BOrbital Rotations and Direct Minimization
It can be shown that the density matrixPPis invariant under orthonormal linear transformationsON\(ℝ\)=\{U∈ℝN×N\|UTU=UUT=𝟏\}\\textrm\{O\}\_\{N\}\(\\mathbb\{R\}\)=\\\{U\\in\\mathbb\{R\}^\{N\\times N\}\|U^\{T\}U=UU^\{T\}=\\mathbf\{1\}\\\}of the occupied and virtual submatrices of the coefficient matrix
C=\(CoCv\),\\displaystyle C=\\begin\{pmatrix\}C\_\{\\textrm\{o\}\}&C\_\{\\textrm\{v\}\}\\end\{pmatrix\},\(27\)whereCo∈ℝO×BC\_\{\\textrm\{o\}\}\\in\\mathbb\{R\}^\{O\\times B\}andCv∈ℝV×BC\_\{\\textrm\{v\}\}\\in\\mathbb\{R\}^\{V\\times B\}\(Lehtolaet al\.,[2020](https://arxiv.org/html/2606.04279#bib.bib114)\)\. Meaning that
∀Uoo∈Oo\(ℝ\),Uvv∈Ov\(ℝ\)andC′:=C\(Uoo00Uvv\):P\(C′\)=P\(C\)\.\\displaystyle\\forall\\\>U\_\{\\textrm\{oo\}\}\\in\\textrm\{O\}\_\{\\textrm\{o\}\}\(\\mathbb\{R\}\)\\textrm\{, \}U\_\{\\textrm\{vv\}\}\\in\\textrm\{O\}\_\{\\textrm\{v\}\}\(\\mathbb\{R\}\)\\textrm\{ and \}C^\{\\prime\}:=C\\begin\{pmatrix\}U\_\{\\mathrm\{oo\}\}&0\\\\ 0&U\_\{\\mathrm\{vv\}\}\\end\{pmatrix\}:\\\>P\(C^\{\\prime\}\)=P\(C\)\\\>\.\(28\)
C\(θ\):=C0exp\(0θov−θovT0\)\\displaystyle C\(\\theta\):=C\_\{0\}\\exp\\begin\{pmatrix\}0&\\theta\_\{\\mathrm\{ov\}\}\\\\ \-\\theta\_\{\\mathrm\{ov\}\}^\{T\}&0\\end\{pmatrix\}\(29\)
In the following, we usei,j,…∈\{1,…,O\}i,j,\.\.\.\\in\\\{1,\.\.\.,O\\\}anda,b,…∈\{O\+1,…,B\}a,b,\.\.\.\\in\\\{O\+1,\.\.\.,B\\\}to index occupied and virtual orbitals, respectively
∂E∂θia\|θ=0=∂E∂Pηζ∂Pηζ∂θia=−Fηζ\[CηaCζi\+CηiCζa\]=−2\(CTFC\)ia=:−2F~ia\\displaystyle\\frac\{\\partial E\}\{\\partial\\theta\_\{ia\}\}\\bigg\|\_\{\\theta=0\}=\\frac\{\\partial E\}\{\\partial P\_\{\\eta\\zeta\}\}\\frac\{\\partial P\_\{\\eta\\zeta\}\}\{\\partial\\theta\_\{ia\}\}=\-F\_\{\\eta\\zeta\}\\left\[C\_\{\\eta a\}C\_\{\\zeta i\}\+C\_\{\\eta i\}C\_\{\\zeta a\}\\right\]=\-2\(C^\{T\}FC\)\_\{ia\}=:\-2\\tilde\{F\}\_\{ia\}\(30\)Analogously, we can derive that∂Exc∂θia=−2V~ia\(xc\)\\frac\{\\partial E\_\{\\textrm\{xc\}\}\}\{\\partial\\theta\_\{ia\}\}=\-2\\tilde\{V\}^\{\\textrm\{\(xc\)\}\}\_\{ia\}\.
## Appendix CRelevance of the Hessian
To first order in𝜽∈ℝO×V\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{O\\times V\}, the molecular orbital coefficients are given by
𝐂\(𝜽;𝐂0\)=𝐂0\+𝐂0\(0θov−θovT0\)\+𝒪\(𝜽2\)\\displaystyle\\mathbf\{C\}\(\\boldsymbol\{\\theta\};\\mathbf\{C\}\_\{0\}\)=\\mathbf\{C\}\_\{0\}\+\\mathbf\{C\}\_\{0\}\\begin\{pmatrix\}0&\\theta\_\{\\mathrm\{ov\}\}\\\\ \-\\theta\_\{\\mathrm\{ov\}\}^\{T\}&0\\end\{pmatrix\}\+\\mathcal\{O\}\(\\boldsymbol\{\\theta\}^\{2\}\)\(31\)Expanding the energy in𝜽∈ℝO×V\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{O\\times V\}around the ground\-state coefficientsC∗C^\{\*\}
Etot\(𝜽\)=Etot∗\+∂Etot∂θia⏟=0θia\+12∂2Etot∂θia∂θjbθiaθjb\+𝒪\(𝜽3\)\\displaystyle E\_\{\\textrm\{tot\}\}\(\\boldsymbol\{\\theta\}\)=E\_\{\\textrm\{tot\}\}^\{\*\}\+\\underbrace\{\\frac\{\\partial E\_\{\\textrm\{tot\}\}\}\{\\partial\\theta\_\{ia\}\}\}\_\{=0\}\\theta\_\{ia\}\+\\frac\{1\}\{2\}\\frac\{\\partial^\{2\}E\_\{\\textrm\{tot\}\}\}\{\\partial\\theta\_\{ia\}\\partial\\theta\_\{jb\}\}\\theta\_\{ia\}\\theta\_\{jb\}\+\\mathcal\{O\}\(\\boldsymbol\{\\theta\}^\{3\}\)\(32\)∂2Exc∂θia∂θjb=2Cμa∂Vμνxc∂θjb\|0Cνi⏟kernel term\+2\(δijCμaVμνxcCνb−δabCμiVμνxcCνj\)⏟potential terms\\displaystyle\\frac\{\\partial^\{2\}E\_\{\\textrm\{xc\}\}\}\{\\partial\\theta\_\{ia\}\\partial\\theta\_\{jb\}\}=\\underbrace\{2\\,C\_\{\\mu a\}\\frac\{\\partial V^\{\\mathrm\{xc\}\}\_\{\\mu\\nu\}\}\{\\partial\\theta\_\{jb\}\}\\bigg\|\_\{0\}C\_\{\\nu i\}\}\_\{\\text\{kernel term\}\}\+\\underbrace\{2\\left\(\\delta\_\{ij\}\\,C\_\{\\mu a\}V^\{\\mathrm\{xc\}\}\_\{\\mu\\nu\}C\_\{\\nu b\}\-\\delta\_\{ab\}\\,C\_\{\\mu i\}V^\{\\mathrm\{xc\}\}\_\{\\mu\\nu\}C\_\{\\nu j\}\\right\)\}\_\{\\text\{potential terms\}\}\(33\)
## Appendix DHessian\-Vector Products on the Grassmannian\.
Let𝜽:=vec\(θia\)∈ℝOV\\boldsymbol\{\\theta\}:=\\operatorname\{vec\}\(\\theta\_\{ia\}\)\\in\\mathbb\{R\}^\{OV\}denote the vectorized \(flattened\) occupied\-virtual orbital rotation angles, which parameterize the Grassmannian manifoldGr\(O,B\)\\text\{Gr\}\(O,B\)\. The orbital rotation Hessian
𝑯∈ℝOV×OV,𝑯iajb=∂2E∂θia∂θjb\|𝜽=0\\boldsymbol\{H\}\\in\\mathbb\{R\}^\{OV\\times OV\},\\qquad\\boldsymbol\{H\}\_\{iajb\}=\\frac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{ia\}\\partial\\theta\_\{jb\}\}\\bigg\|\_\{\\boldsymbol\{\\theta\}=0\}
characterizes the local curvature of the energy surface at a stationary point\. Explicitly, the Hessian\-vector product with a tangent directionδ𝜽∈ℝOV\\delta\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{OV\}reads:
𝑯δ𝜽=\[∂2E∂θ1∂θ1∂2E∂θ1∂θ2⋯∂2E∂θ1∂θOV∂2E∂θ2∂θ1∂2E∂θ2∂θ2⋯∂2E∂θ2∂θOV⋮⋮⋱⋮∂2E∂θOV∂θ1∂2E∂θOV∂θ2⋯∂2E∂θOV∂θOV\]\[δθ1δθ2⋮δθOV\]=\[∑jb∂2E∂θ1∂θjbδθjb∑jb∂2E∂θ2∂θjbδθjb⋮∑jb∂2E∂θOV∂θjbδθjb\]\.\\boldsymbol\{H\}\\,\\delta\\boldsymbol\{\\theta\}=\\begin\{bmatrix\}\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{1\}\\partial\\theta\_\{1\}\}&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{1\}\\partial\\theta\_\{2\}\}&\\cdots&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{1\}\\partial\\theta\_\{OV\}\}\\\\\[6\.0pt\] \\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{2\}\\partial\\theta\_\{1\}\}&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{2\}\\partial\\theta\_\{2\}\}&\\cdots&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{2\}\\partial\\theta\_\{OV\}\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ \\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{OV\}\\partial\\theta\_\{1\}\}&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{OV\}\\partial\\theta\_\{2\}\}&\\cdots&\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{OV\}\\partial\\theta\_\{OV\}\}\\end\{bmatrix\}\\\!\\begin\{bmatrix\}\\delta\\theta\_\{1\}\\\\ \\delta\\theta\_\{2\}\\\\ \\vdots\\\\ \\delta\\theta\_\{OV\}\\end\{bmatrix\}=\\begin\{bmatrix\}\\displaystyle\\sum\_\{jb\}\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{1\}\\partial\\theta\_\{jb\}\}\\,\\delta\\theta\_\{jb\}\\\\\[10\.0pt\] \\displaystyle\\sum\_\{jb\}\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{2\}\\partial\\theta\_\{jb\}\}\\,\\delta\\theta\_\{jb\}\\\\ \\vdots\\\\ \\displaystyle\\sum\_\{jb\}\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{OV\}\\partial\\theta\_\{jb\}\}\\,\\delta\\theta\_\{jb\}\\end\{bmatrix\}\.
Choosingδ𝜽=𝐞jb\\delta\\boldsymbol\{\\theta\}=\\mathbf\{e\}\_\{jb\}\(a unit vector corresponding to a single occupied\-virtual pair\) extracts the\(jb\)\(jb\)\-th column of the Hessian, which equals the linear response of the gradient:
𝑯𝐞jb=\[∂2E∂θ1∂θjb∂2E∂θ2∂θjb⋮∂2E∂θOV∂θjb\]=∂𝒈\(xc\)∂θjb\|𝜽=0\.\\boldsymbol\{H\}\\,\\mathbf\{e\}\_\{jb\}=\\begin\{bmatrix\}\\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{1\}\\partial\\theta\_\{jb\}\}\\\\\[6\.0pt\] \\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{2\}\\partial\\theta\_\{jb\}\}\\\\ \\vdots\\\\ \\dfrac\{\\partial^\{2\}E\}\{\\partial\\theta\_\{OV\}\\partial\\theta\_\{jb\}\}\\end\{bmatrix\}=\\frac\{\\partial\\boldsymbol\{g\}^\{\(\\mathrm\{xc\}\)\}\}\{\\partial\\theta\_\{jb\}\}\\bigg\|\_\{\\boldsymbol\{\\theta\}=0\}\.
For a general tangent vectorδ𝜽∈TPGr\(O,B\)\\delta\\boldsymbol\{\\theta\}\\in T\_\{P\}\\text\{Gr\}\(O,B\), the resultδ𝒈\(xc\)=𝑯δ𝜽\\delta\\boldsymbol\{g\}^\{\(\\mathrm\{xc\}\)\}=\\boldsymbol\{H\}\\,\\delta\\boldsymbol\{\\theta\}is the linear response of the XC gradient, with each component being a weighted sum over Hessian elements\.
#### Efficient Computation via Automatic Differentiation\.
Materializing the full Hessian𝑯\\boldsymbol\{H\}requires𝒪\(\(OV\)2\)\\mathcal\{O\}\(\(OV\)^\{2\}\)storage and𝒪\(\(OV\)2\)\\mathcal\{O\}\(\(OV\)^\{2\}\)gradient evaluations, which becomes prohibitive for large systems\. Instead, we leverage the fact that HVPs can be computed in𝒪\(OV\)\\mathcal\{O\}\(OV\)time and memory—the same cost as a single gradient evaluation—using forward\-mode automatic differentiation\.
The key identity is
𝑯δ𝜽=∂∂ε\|ε=0𝒈\(xc\)\(𝜽\+εδ𝜽\),\\boldsymbol\{H\}\\,\\delta\\boldsymbol\{\\theta\}=\\frac\{\\partial\}\{\\partial\\varepsilon\}\\bigg\|\_\{\\varepsilon=0\}\\boldsymbol\{g\}^\{\(\\mathrm\{xc\}\)\}\(\\boldsymbol\{\\theta\}\+\\varepsilon\\,\\delta\\boldsymbol\{\\theta\}\),i\.e\., the HVP is the Jacobian\-vector product \(JVP\) of the gradient function\. This operation can be vectorized \(batches\) over multiple directions, enabling the computation ofTTindependent HVPs with minimal overhead\.
#### Computational Complexity\.
HereCgC\_\{g\}denotes the cost of a single gradient evaluation\. By supervisingT≪OVT\\ll OVrandomly sampled linear responses, we obtain stochastic curvature information at a fraction of the cost of the full Hessian, making derivative\-informed training tractable for systems with hundreds of basis functions\. Similar HVP\-based approaches for functional learning have been investigated in classical\-DFT settingsDijkmanet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib1)\)
## Appendix ECategorization of Traditional XC\-Models: Jacob’s Ladder
Perdew \([2001](https://arxiv.org/html/2606.04279#bib.bib143)\)established a commonly referred to ranking of XC\-models a\.k\.a\. theJacob’s Ladder, with each rung representing an increase in expressivity, computational cost, and accuracy\.
### E\.1Grid Features
Semi\-local functionals depend on quantities evaluated at each point of a numerical integration grid:
- •Electron density:n=n↑\+n↓n=n\_\{\\uparrow\}\+n\_\{\\downarrow\}
- •Spin polarization:ζ=\(n↑−n↓\)/n∈\[−1,1\]\\zeta=\(n\_\{\\uparrow\}\-n\_\{\\downarrow\}\)/n\\in\[\-1,1\]
- •Reduced density gradient:s=\|∇n\|/\(2\(3π2\)1/3n4/3\)s=\|\\nabla n\|/\(2\(3\\pi^\{2\}\)^\{1/3\}n^\{4/3\}\), measuring density inhomogeneity relative to the Fermi wavelength
- •Kinetic energy density:τ=12∑μνPμν∇χμ⋅∇χν\\tau=\\frac\{1\}\{2\}\\sum\_\{\\mu\\nu\}P\_\{\\mu\\nu\}\\nabla\\chi\_\{\\mu\}\\cdot\\nabla\\chi\_\{\\nu\}, encoding orbital structure
### E\.2Local and Semi\-local Models
These functionals express the XC energy asExc\[n\]=∫n\(𝐫\)εxc\(𝐫\)𝑑𝐫E\_\{\\mathrm\{xc\}\}\[n\]=\\int n\(\\mathbf\{r\}\)\\,\\varepsilon\_\{\\mathrm\{xc\}\}\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}, differing in which grid featuresεxc\\varepsilon\_\{\\mathrm\{xc\}\}may depend on:
- •Local Density Approximation \(LDA\):εxc\(n,ζ\)\\varepsilon\_\{\\mathrm\{xc\}\}\(n,\\zeta\)
- •Generalized Gradient Approximation \(GGA\):εxc\(n,ζ,s\)\\varepsilon\_\{\\mathrm\{xc\}\}\(n,\\zeta,s\)
- •meta\-GGA:εxc\(n,ζ,s,τ\)\\varepsilon\_\{\\mathrm\{xc\}\}\(n,\\zeta,s,\\tau\)
### E\.3Non\-local Models
Higher rungs incorporate explicit orbital dependence, breaking the semi\-local form\.
#### Hybrid functionals
mix semi\-local exchange with a fraction of exact \(Hartree–Fock\) exchange:
Exchybrid=aExHF\+\(1−a\)ExDFT\+EcDFT\.\\displaystyle E\_\{\\mathrm\{xc\}\}^\{\\mathrm\{hybrid\}\}=a\\,E\_\{x\}^\{\\mathrm\{HF\}\}\+\(1\-a\)\\,E\_\{x\}^\{\\mathrm\{DFT\}\}\+E\_\{c\}^\{\\mathrm\{DFT\}\}\\,\.\(34\)
#### Range\-separated hybrids
partition the Coulomb operator1/r121/r\_\{12\}into short\- and long\-range components using the error function, applying different exchange treatments to each range\.
#### Double hybrids
additionally incorporate MP2\-like correlation\(Goerigk and Grimme,[2014](https://arxiv.org/html/2606.04279#bib.bib12)\):
EcPT2=14∑ijocc∑abvirt\|⟨ij\|\|ab⟩\|2εi\+εj−εa−εb\.\\displaystyle E\_\{c\}^\{\\mathrm\{PT2\}\}=\\frac\{1\}\{4\}\\sum\_\{ij\}^\{\\mathrm\{occ\}\}\\sum\_\{ab\}^\{\\mathrm\{virt\}\}\\frac\{\|\\langle ij\|\|ab\\rangle\|^\{2\}\}\{\\varepsilon\_\{i\}\+\\varepsilon\_\{j\}\-\\varepsilon\_\{a\}\-\\varepsilon\_\{b\}\}\\,\.\(35\)Range\-separated variants replace the bare two\-electron integrals with a range\-separated interactiong^ω\(r\)\\hat\{g\}\_\{\\omega\}\(r\):
EcPT2\(ω\)=14∑ijocc∑abvirt\|\(ia\|g^ω\|jb\)\|2εi\+εj−εa−εb\.\\displaystyle E\_\{c\}^\{\\mathrm\{PT2\}\}\(\\omega\)=\\frac\{1\}\{4\}\\sum\_\{ij\}^\{\\mathrm\{occ\}\}\\sum\_\{ab\}^\{\\mathrm\{virt\}\}\\frac\{\|\(ia\|\\hat\{g\}\_\{\\omega\}\|jb\)\|^\{2\}\}\{\\varepsilon\_\{i\}\+\\varepsilon\_\{j\}\-\\varepsilon\_\{a\}\-\\varepsilon\_\{b\}\}\\,\.\(36\)
## Appendix FTime\-Dependent DFT \(TDDFT\)
To find the resonance frequenciesωn\\omega\_\{n\}and thereby the excited state energiesEn=E0\+ωnE\_\{n\}=E\_\{0\}\+\\omega\_\{n\}, we have to solve the Casida equations, a non\-Hermitian eigenvalue problem given as
\(𝐀𝐁𝐁∗𝐀∗\)\(XnYn\)=ωn\(𝟏𝟎𝟎−𝟏\)\(XnYn\)\\displaystyle\\begin\{pmatrix\}\\mathbf\{A\}&\\mathbf\{B\}\\\\ \\mathbf\{B^\{\*\}\}&\\mathbf\{A^\{\*\}\}\\end\{pmatrix\}\\begin\{pmatrix\}X\_\{n\}\\\\ Y\_\{n\}\\end\{pmatrix\}=\\omega\_\{n\}\\begin\{pmatrix\}\\mathbf\{1\}&\\mathbf\{0\}\\\\ \\mathbf\{0\}&\-\\mathbf\{1\}\\end\{pmatrix\}\\begin\{pmatrix\}X\_\{n\}\\\\ Y\_\{n\}\\end\{pmatrix\}\(37\)The eigenvectors\(X,Y\)\(X,Y\)can be used to calculate properties like the transition amplitudes, intensity and dynamic polarizabilities\. The submatricesAAandBBare constructed from orbital derivatives of the DFT functional:
Aia,jb\\displaystyle A\_\{ia,jb\}=δijδab\(εa−εi\)\+Kia,jb\\displaystyle=\\delta\_\{ij\}\\delta\_\{ab\}\(\\varepsilon\_\{a\}\-\\varepsilon\_\{i\}\)\+K\_\{ia,jb\}\(38\)Kia,jb\\displaystyle K\_\{ia,jb\}=∬ψi\(𝐫\)ψa\(𝐫\)\[1\|𝐫−𝐫′\|\+fxc\(𝐫,𝐫′\)\]ψj\(𝐫′\)ψb\(𝐫′\)𝑑𝐫𝑑𝐫′\\displaystyle=\\iint\\psi\_\{i\}\(\\mathbf\{r\}\)\\psi\_\{a\}\(\\mathbf\{r\}\)\\left\[\\frac\{1\}\{\|\\mathbf\{r\}\-\\mathbf\{r\}^\{\\prime\}\|\}\+f\_\{xc\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)\\right\]\\psi\_\{j\}\(\\mathbf\{r\}^\{\\prime\}\)\\psi\_\{b\}\(\\mathbf\{r\}^\{\\prime\}\)\\,d\\mathbf\{r\}\\,d\\mathbf\{r\}^\{\\prime\}\(39\)with the kernelfxc\(𝐫,𝐫′\)=δ2Excδρ\(𝐫\)δρ\(𝐫′\)f\_\{xc\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)=\\frac\{\\delta^\{2\}E\_\{xc\}\}\{\\delta\\rho\(\\mathbf\{r\}\)\\delta\\rho\(\\mathbf\{r\}^\{\\prime\}\)\}, and
Bia,jb=Kia,bj\\displaystyle B\_\{ia,jb\}=K\_\{ia,bj\}\(40\)Since the Casida matrix is extremely large, full diagonalization is intractable\. Instead, most implementations resort to iterative diagonalization with the Davidson algorithm\. Our implementation follows the implementation given inOlsenet al\.\([1988](https://arxiv.org/html/2606.04279#bib.bib223)\)\. TDDFT can have numerical convergence problems, which can often be improved on using the TDA approximation that sets𝐁=0\\mathbf\{B\}=0\. As we can see from the above, TDDFT depends on the molecular energies, which in turn depend on the first functional derivative of the exchange\-correlation functional, and the exchange\-correlation kernel, which is the second functional derivative\.
## Appendix GThe Optimized Effective Potential \(OEP\)
A hybrid exchange\-correlation functional depends explicitly on the density matrixPP, taking the formExc\[ρ,P\]E\_\{xc\}\[\\rho,P\]\. The minimization of this energy with respect to the orbitals naturally yields a non\-local potential operator,Σxc\(𝐫,𝐫′\)\\Sigma\_\{xc\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\), defined by the functional derivative with respect to the density matrix:
Σxc\(𝐫,𝐫′\)=δExcδP\(𝐫,𝐫′\)\\displaystyle\\Sigma\_\{xc\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)=\\frac\{\\delta E\_\{xc\}\}\{\\delta P\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)\}\(41\)In the context of pure Kohn\-Sham DFT, however, the existence of a non\-interacting system reproducing the interacting density relies on the Hohenberg\-Kohn theorem, which requires a multiplicative, local potentialVxc\(𝐫\)V\_\{xc\}\(\\mathbf\{r\}\)\. A functional of the formExc\[ρ\]E\_\{xc\}\[\\rho\]yields such a local potential via the functional derivative with respect to the density:
Vxc\(𝐫\)=δExcδρ\(𝐫\)\\displaystyle V\_\{xc\}\(\\mathbf\{r\}\)=\\frac\{\\delta E\_\{xc\}\}\{\\delta\\rho\(\\mathbf\{r\}\)\}\(42\)The fundamental challenge in constructing local approximations for orbital\-dependent functionals lies in finding a local potentialVxcOEP\(𝐫\)V^\{\\mathrm\{OEP\}\}\_\{xc\}\(\\mathbf\{r\}\)that best imitates the energetic effects of the true non\-local operatorΣxc\\Sigma\_\{xc\}\.
#### The OEP Condition:
To resolve this conflict within the rigorous framework of Kohn\-Sham theory, the Optimized Effective Potential \(OEP\) method determines the local potential variationally\. The optimal local potential is defined as the one for which the total energy, including the orbital\-dependent terms, is stationary\. Mathematically, this is formulated by the Sharp\-Horton condition \(which is mathematically equivalent to the Hohenberg\-Kohn variational principle\), which enforces that the first variation of the total energy with respect to the local potential vanishes:
δEtotδVKS\(𝐫\)=0\\displaystyle\\frac\{\\delta E\_\{tot\}\}\{\\delta V\_\{KS\}\(\\mathbf\{r\}\)\}=0\(43\)This condition ensures that the resulting local potential minimizes the energy within the constraints of a multiplicative Kohn\-Sham operator\.
Applying the chain rule to the Sharp\-Horton condition and utilizing the response function of the non\-interacting system leads to a Fredholm integral equation of the first kindIvanovet al\.\([2002](https://arxiv.org/html/2606.04279#bib.bib219)\):
∫χs\(𝐫,𝐫′\)V^xcOEP\(𝐫′\)𝑑𝐫′=Λ\(𝐫\)\\displaystyle\\int\\chi\_\{s\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}\(\\mathbf\{r\}^\{\\prime\}\)d\\mathbf\{r\}^\{\\prime\}=\\Lambda\(\\mathbf\{r\}\)\(44\)whereχs\(𝐫,𝐫′\)\\chi\_\{s\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)is the static density\-density response function of the non\-interacting Kohn\-Sham system:
χs\(𝐫,𝐫′\)=∑iocc∑a∞ϕi∗\(𝐫\)ϕa\(𝐫\)ϕa∗\(𝐫′\)ϕi\(𝐫′\)εi−εa\+c\.c\.\\chi\_\{s\}\(\\mathbf\{r\},\\mathbf\{r\}^\{\\prime\}\)=\\sum\_\{i\}^\{\\text\{occ\}\}\\sum\_\{a\}^\{\\infty\}\\frac\{\\phi\_\{i\}^\{\*\}\(\\mathbf\{r\}\)\\phi\_\{a\}\(\\mathbf\{r\}\)\\phi\_\{a\}^\{\*\}\(\\mathbf\{r\}^\{\\prime\}\)\\phi\_\{i\}\(\\mathbf\{r\}^\{\\prime\}\)\}\{\\varepsilon\_\{i\}\-\\varepsilon\_\{a\}\}\+c\.c\.\(45\)\(c\.c\. is the complex conjugate\) and the inhomogeneityΛ\(𝐫\)\\Lambda\(\\mathbf\{r\}\)collects the contributions arising from the non\-local exchange\-correlation operator:
Λ\(𝐫\)=∑iocc∑a∞⟨a\|Σ^xc\|i⟩,ϕi∗\(𝐫\)ϕa\(𝐫\)εi−εa\+c\.c\.\\Lambda\(\\mathbf\{r\}\)=\\sum\_\{i\}^\{\\text\{occ\}\}\\sum\_\{a\}^\{\\infty\}\\frac\{\\langle a\|\\hat\{\\Sigma\}\_\{xc\}\|i\\rangle,\\phi\_\{i\}^\{\*\}\(\\mathbf\{r\}\)\\phi\_\{a\}\(\\mathbf\{r\}\)\}\{\\varepsilon\_\{i\}\-\\varepsilon\_\{a\}\}\+c\.c\.\(46\)It is not obvious that there exists a localV^xcOEP\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}that can solve equation \([44](https://arxiv.org/html/2606.04279#A7.E44)\)\. In fact, it is known that there can exist systems with densities, that cannot be reproduced by a non\-interacting single slater determinant moving in a local potentialVs\(𝐫\)V\_\{s\}\(\\mathbf\{r\}\)Lieb \([1983](https://arxiv.org/html/2606.04279#bib.bib220)\)\(mainly if the ground state is degenerate\)\.
Even if the solution isvsv\_\{s\}representable, it is known that Fredholm Integral equations of the first kind can become hypersensitive, which makes their numerical solution ill\-posed\. Additionally, in practice, we have to use a finite orbital or auxiliary basis to discretize equation \([44](https://arxiv.org/html/2606.04279#A7.E44)\), which then reduces to a linear system of equations:
∑νχμν,\(V^xcOEP\)ν=Λμ\\sum\_\{\\nu\}\\chi\_\{\\mu\\nu\},\(\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}\)\_\{\\nu\}=\\Lambda\_\{\\mu\}\(47\)The finite basis set expansion makes the kernel degenerate, and therefore unsolvable\. To deal with these problems, usually only a regularized version of \([47](https://arxiv.org/html/2606.04279#A7.E47)\) is being solvedIvanovet al\.\([2002](https://arxiv.org/html/2606.04279#bib.bib219)\)\.
In OEP KS\-DFT, at every step of the SCF, we would have to take the current set of orbitals, solve \(the regularized\) equation \([48](https://arxiv.org/html/2606.04279#A7.E48)\) to get the new localV^xcOEP\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}to build our Hamiltonian matrix that we then diagonalize to get the new set of orbitals\.
#### Simplified OEP as a regularizer
Given that solving equation \([47](https://arxiv.org/html/2606.04279#A7.E47)\) is very expensive and in practice not well posed to begin with, we do not use the OEP directly to supervise our model, but instead use a simpler regularizer merely inspired by the OEP\. After some basic algebra, equation \([44](https://arxiv.org/html/2606.04279#A7.E44)\) can alternatively be expressed as a sum over occupied \(ii\) and virtual \(aa\) statesKümmel and Kronik \([2008](https://arxiv.org/html/2606.04279#bib.bib218)\):
∑iocc∑a∞\(⟨a\|V^xcOEP\|i⟩−⟨a\|Σ^xc\|i⟩\)ϕi∗\(𝐫\)ϕa\(𝐫\)εi−εa\+c\.c\.=0\\displaystyle\\sum\_\{i\}^\{\\text\{occ\}\}\\sum\_\{a\}^\{\\infty\}\\frac\{\\left\(\\langle a\|\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}\|i\\rangle\-\\langle a\|\\hat\{\\Sigma\}\_\{xc\}\|i\\rangle\\right\)\\phi\_\{i\}^\{\*\}\(\\mathbf\{r\}\)\\phi\_\{a\}\(\\mathbf\{r\}\)\}\{\\varepsilon\_\{i\}\-\\varepsilon\_\{a\}\}\+c\.c\.=0\(48\)
Here,εi\\varepsilon\_\{i\}andεa\\varepsilon\_\{a\}are the Kohn\-Sham eigenvalues\. The term⟨a\|Σ^xc\|i⟩=dExc\[P\]dPai\\langle a\|\\hat\{\\Sigma\}\_\{xc\}\|i\\rangle=\\frac\{dE\_\{xc\}\[P\]\}\{dP\}\_\{ai\}represents the matrix element of the orbital\-specific non\-local operator, while⟨a\|V^xcOEP\|i⟩=dExc\[ρ\[P\]\]dPai\\langle a\|\\hat\{V\}\_\{xc\}^\{\\mathrm\{OEP\}\}\|i\\rangle=\\frac\{dE\_\{xc\}\[\\rho\[P\]\]\}\{dP\}\_\{ai\}is the corresponding matrix element of the local potential\. Formally, the sum over virtualsaamust run to infinity, though in practice it is truncated by the finite basis set size again\.
Looking at equation \([48](https://arxiv.org/html/2606.04279#A7.E48)\) we see that the exact OEP equation requires only that the weighted sum of the differences between the local and non\-local matrix elements vanishes at every point in space\. A stricter condition is to demand that the matrix elements match individually for every occupied\-virtual pair:
⟨a\|V^xc\|i⟩=⟨a\|Σ^xc\|i⟩∀i,a\\displaystyle\\langle a\|\\hat\{V\}\_\{xc\}\|i\\rangle=\\langle a\|\\hat\{\\Sigma\}\_\{xc\}\|i\\rangle\\quad\\forall i,a\(49\)
If this condition were met, the numerator in the OEP equation would vanish term\-by\-term, trivially satisfying the integral equation\. While the local potential generally lacks sufficient degrees of freedom to satisfy this equality strictly for all pairs, enforcing this condition in a least\-squares sense \(e\.g\. via DI\-Loss\) regularizes our functional and guides the optimization toward an OEP\-like potential\.
## Appendix HAdaptive Training Stabilization
As explicitly noted byLuiseet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib125)\), the online learning of the XC\-functional in an SCF solver presents a challenge\. During our evaluations, we noticed the same, due to the interaction between density optimization and parameter optimization\. If the parameter update from the previous Self\-Consistent Field \(SCF\) computation leads to an unstable XC\-potential, which does not properly converge to a plausible density, the next parameter update becomes even worse\. This type of training instability and architecture brittleness has been extensively explored in the literature on so\-calledDeep Equilibrium models\(DEQs\)\(Baiet al\.,[2019](https://arxiv.org/html/2606.04279#bib.bib27);[2021](https://arxiv.org/html/2606.04279#bib.bib29); Agarwala and Schoenholz,[2022](https://arxiv.org/html/2606.04279#bib.bib21)\)\. By matching the established notationz∗=fθ\(x,z∗\)z^\{\*\}=f\_\{\\theta\}\(x,z^\{\*\}\)of DEQs, this connection becomes immediate
P∗=SCFθ\(P0,P∗\)=SCF\(P0,Fθ\(P∗\)\)\.\\displaystyle P^\{\*\}=\\textrm\{SCF\}\_\{\\theta\}\(P\_\{0\},P^\{\*\}\)=\\textrm\{SCF\}\(P\_\{0\},F\_\{\\theta\}\(P\_\{\*\}\)\)\\\>\.\(50\)We note that additional connections can be drawn for convergence acceleration techniques proposed in the context of DEQs and SCF calculations\(Anderson,[1965](https://arxiv.org/html/2606.04279#bib.bib16); Pulay,[1980](https://arxiv.org/html/2606.04279#bib.bib151)\)\.
Training neural network XC functionals end\-to\-end through SCF requires safeguards against destabilizing parameter updates\. We employ a Metropolis\-inspired accept\-reject mechanism that adaptively identifies outlier updates based on the relative change in mean epoch loss\. The mechanism uses a tolerance schedule that relaxes constraints during early training, then tightens as the model stabilizes\. Letτw\\tau\_\{\\mathrm\{w\}\}denote the warmup length in epochs, and letΔltol\(0\)\\Delta l\_\{\\textrm\{tol\}\}^\{\(0\)\}andΔltol\\Delta l\_\{\\textrm\{tol\}\}denote the initial and target tolerance thresholds, respectively\. We define a piecewise linear tolerance schedule:
Δltol\(t\)=\{Δltol\(0\)0≤t<τw2,Δltol\(0\)\+Δltol−Δltol\(0\)τw/2\(t−τw2\)τw2≤t<τw,Δltolt≥τw\.\\displaystyle\\Delta l\_\{\\textrm\{tol\}\}\(t\)=\\begin\{cases\}\\Delta l\_\{\\textrm\{tol\}\}^\{\(0\)\}&0\\leq t<\\dfrac\{\\tau\_\{\\mathrm\{w\}\}\}\{2\},\\\\\[8\.00003pt\] \\Delta l\_\{\\textrm\{tol\}\}^\{\(0\)\}\+\\dfrac\{\\Delta l\_\{\\textrm\{tol\}\}\-\\Delta l\_\{\\textrm\{tol\}\}^\{\(0\)\}\}\{\\tau\_\{\\mathrm\{w\}\}/2\}\\left\(t\-\\dfrac\{\\tau\_\{\\mathrm\{w\}\}\}\{2\}\\right\)&\\dfrac\{\\tau\_\{\\mathrm\{w\}\}\}\{2\}\\leq t<\\tau\_\{\\mathrm\{w\}\},\\\\\[8\.00003pt\] \\Delta l\_\{\\textrm\{tol\}\}&t\\geq\\tau\_\{\\mathrm\{w\}\}\.\\end\{cases\}\(51\)Early in training, the fixed\-point map may only be marginally attractive, making SCF convergence fragile\. We therefore start with a tighter toleranceΔltol\(0\)\\Delta l\_\{\\textrm\{tol\}\}^\{\(0\)\}to aggressively reject destabilizing updates, then linearly relax to the target toleranceΔltol\\Delta l\_\{\\textrm\{tol\}\}over the second half of warmup as the model stabilizes, so as not to interfere excessively with the gradient\-based optimizer\.
At each epochtt, we compute the relative loss changeΔlt=\(Lt−Lt−1\)/Lt−1\\Delta l\_\{t\}=\(L\_\{t\}\-L\_\{t\-1\}\)/L\_\{t\-1\}whereLtL\_\{t\}is the mean epoch loss\. We maintain an exponential moving average estimate of the*uncentered*standard deviation
σ^t2\\displaystyle\\hat\{\\sigma\}\_\{t\}^\{2\}=βσ^t−12\+\(1−β\)\(Δlt\)2,\\displaystyle=\\beta\\hat\{\\sigma\}\_\{t\-1\}^\{2\}\+\(1\-\\beta\)\(\\Delta l\_\{t\}\)^\{2\},\(52\)which captures the typical scale of epoch\-to\-epoch loss fluctuations similar to the step\-to\-step variance estimate of the Adam optimizer\(Kingma and Ba,[2017](https://arxiv.org/html/2606.04279#bib.bib18)\)\. To ensure numerical stability, we clip this estimate to a predefined range:
σ~t\\displaystyle\\tilde\{\\sigma\}\_\{t\}=clip\(σ^t,σmin,σmax\)\.\\displaystyle=\\operatorname\{clip\}\\bigl\(\\hat\{\\sigma\}\_\{t\},\\,\\sigma\_\{\\min\},\\,\\sigma\_\{\\max\}\\bigr\)\.\(53\)The lower boundσmin\\sigma\_\{\\min\}prevents the mechanism from becoming overly sensitive to small fluctuations, which would artificially throttle training\. Overly tight thresholds would erroneously reject updates important for informing the inner gradient\-based optimizer\. Thescaled deviationztz\_\{t\}measures how many standard deviations the current loss change exceeds the tolerance:
zt\\displaystyle z\_\{t\}=Δltσ~t\.\\displaystyle=\\frac\{\\Delta l\_\{t\}\}\{\\tilde\{\\sigma\}\_\{t\}\}\.\(54\)
#### Cosine decay function\.
To smoothly interpolate between full acceptance and full rejection, we define:
c\(x;s\)\\displaystyle c\(x;s\)=12\(1\+cos\(πclip\(x,0,s\)s\)\),\\displaystyle=\\frac\{1\}\{2\}\\left\(1\+\\cos\\\!\\left\(\\pi\\,\\frac\{\\operatorname\{clip\}\(x,0,s\)\}\{s\}\\right\)\\right\),\(55\)which transitions from11atx=0x=0to0atx=sx=s\.
#### Adaptive rejection factor\.
Letκt\\kappa\_\{t\}denote the tolerance in units of standard deviations \(annealing from11toκ\\kappaover warmup\) andht=κt/2h\_\{t\}=\\kappa\_\{t\}/2its half\-width\. The acceptance probabilitypt∈\[0,1\]p\_\{t\}\\in\[0,1\]is:
pt\\displaystyle p\_\{t\}=\{1,zt<ht,c\(zt−ht;ht\),zt≥ht\.\\displaystyle=\\begin\{cases\}1,&z\_\{t\}<h\_\{t\},\\\\\[6\.00006pt\] c\\bigl\(z\_\{t\}\-h\_\{t\};\\,h\_\{t\}\\bigr\),&z\_\{t\}\\geq h\_\{t\}\.\\end\{cases\}\(56\)Updates withzt<htz\_\{t\}<h\_\{t\}are always accepted\. For larger deviations, we accept the update with probabilityptp\_\{t\}, which decays smoothly via the cosine function, reaching zero \(certain rejection\) whenzt≥κtz\_\{t\}\\geq\\kappa\_\{t\}\. After consecutive rejections, we additionally rescale the optimizer momentum to prevent the accumulation of destabilizing gradient directions\.
#### Initialization selection\.
The choice of initial network parameters has an outsized effect on end\-to\-end SCF training\.Agarwala and Schoenholz \([2022](https://arxiv.org/html/2606.04279#bib.bib21)\)show that DEQs are unusually sensitive to the statistics of their initialization, with poorly scaled weights yielding ill\-conditioned or non\-contractive fixed\-point maps\. Since our SCF map inherits this DEQ structure, an unfortunate initialization can place the XC functional in a regime where the iteration fails to reach a plausible density from the outset, before the stabilizer has any loss signal to act on\. Rather than hand\-tuning the initialization statistics, we treat the initialization as a cheap outer\-loop meta\-optimization: at the start of training we drawninitn\_\{\\mathrm\{init\}\}independent random parameter initializations, run a single epoch with each from a fresh optimizer state, and continue the main run from the parameters and optimizer state that achieved the lowest first\-epoch loss\. The selection is performed once and is negligible relative to the full training, while reliably discarding the small fraction of initializations that would otherwise stall SCF convergence\. This concerns the network parameters only; the electron density is always initialized from a standard MINAO guess\.
Table 1:Adaptive Training StabilizationParameterValueNotesWarm\-up schedule10 epochsFirst 10 epochs strict improvement, rejecting any updates that increase the loss10 epochsfollowed by a 10 epoch linear tolerance schedule until specified tolerance is reachedTolerance5σ5\\,\\sigmaBest out of\{4σ,5σ,6σ\}\\\{4\\,\\sigma,5\\,\\sigma,6\\,\\sigma\\\}Lower bound onσ^\(i\)\\widehat\{\\sigma\}^\{\(i\)\}20%Best out of\{15%,20%,25%\}\\\{15\\%,20\\%,25\\%\\\}Variance estimator momentumβ\\beta0\.75Best out of\{0\.75,0\.8,0\.9\}\\\{0\.75,0\.8,0\.9\\\}differences are marginalParameter momentum rescaling0\.7applied at each reject after22consecutive rejects \(also tried 0\.55 and 0\.85\)Initialization tryoutsninitn\_\{\\mathrm\{init\}\}5Random parameter inits trained for one epoch; lowest first\-epoch loss retained
## Appendix IOptimizer Ablations
Table 2:Optimizer configuration for B3LYP/def2\-SVP on QM9 \(train up to 7 heavy atoms; test on 1000 molecules with exactly 9 heavy atoms\) experimentsParameterSCAN/def2\-TZVPDB3LYP/def2\-SVPNotesOptimizermuon”See head\-to\-head comparison to Adam in Table[3](https://arxiv.org/html/2606.04279#A9.T3)Warmuplinear”From EG\-XCDecay11\+i/Nscale\\frac\{1\}\{1\+i/N\_\{\\textrm\{scale\}\}\}”Infinite range while monotonically decreasinglr3e\-2 / 1e\-2 / 1e\-1”NNmGGA / XCdiff / Skala\-mGGA best out of\{\\\{1e\-1, 3e\-2, 1e\-2, 3e\-3\}\\\}\+ warmup steps1000100010001000From EG\-XC\+ decay scale7507501000EG\-XC reference used 1000 throughout, we observed quicker training with750750for small subsplits of QM9Weight decay1e\-5”Table 3:NNmGGA models on QM9, trained on molecules with up to 5 heavy atoms and evaluated on 1000 randomly selected molecules with exactly 9 heavy atoms, fixed across all runs, at SCAN/def2\-TZVPD level of theory\. Energy MAEs inmEh\\textrm\{mE\}\_\{h\}\. These are early\-stage ablations with different hyperparameters than those we later settled on; nevertheless, the results are representative of the optimizer’s performance\. We observe higher training stability and improvements on almost all metrics across all loss compositions when using Muon\(Jordanet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib15)\)compared to Adam\(Kingma and Ba,[2017](https://arxiv.org/html/2606.04279#bib.bib18)\)\.
## Appendix JEvaluation metrics
Here we list and describe all the metrics we are using:
#### Total energyEtotE\_\{\\text\{tot\}\}:
The total energy error between the self\-consistent density of the learned functionalρML\\rho\_\{\\text\{ML\}\}and the self\-consistent reference densityρref\\rho\_\{\\text\{ref\}\}
ΔEtot=\|Etot,ML\[ρML\]−Etot,ref\[ρref\]\|\\displaystyle\\Delta E\_\{\\text\{tot\}\}=\|E\_\{\\text\{tot,ML\}\}\[\\rho\_\{\\text\{ML\}\}\]\-E\_\{\\text\{tot,ref\}\}\[\\rho\_\{\\text\{ref\}\}\]\|\(57\)
#### Total energy error with donated densityErefE\_\{\\text\{ref\}\}
The energy error for the self\-consistent density of the learned functional evaluated with the reference functional
ΔEref=\|Etot,ref\[ρML\]−Etot,ref\[ρref\]\|\\displaystyle\\Delta E\_\{\\text\{ref\}\}=\|E\_\{\\text\{tot,ref\}\}\[\\rho\_\{\\text\{ML\}\}\]\-E\_\{\\text\{tot,ref\}\}\[\\rho\_\{\\text\{ref\}\}\]\|\(58\)
#### Exchange energyExcE\_\{\\text\{xc\}\}:
The XC energy error between the learned XC functional with its self\-consistent density, and the reference XC functional with its own self\-consistent reference density
ΔExc=\|Exc,1\[ρML\]−Exc,ref\[ρref\]\|\\displaystyle\\Delta E\_\{\\text\{xc\}\}=\|E\_\{\\text\{xc,1\}\}\[\\rho\_\{\\text\{ML\}\}\]\-E\_\{\\text\{xc,ref\}\}\[\\rho\_\{\\text\{ref\}\}\]\|\(59\)
#### Coulomb \(Hartree\) metric:
The error of the Coulomb energy between the learned self\-consistent density and the ground truth self\-consistent density
ΔEC=\|EC\[ρML\]−EC\[ρref\]\|\\displaystyle\\Delta E\_\{C\}=\|E\_\{C\}\[\\rho\_\{\\text\{ML\}\}\]\-E\_\{C\}\[\\rho\_\{\\text\{ref\}\}\]\|\(60\)
#### Mean\-field energy error:
The mean\-field energy difference is defined as all the electronic energy contributions except forExcE\_\{\\text\{xc\}\}\(Gould,[2023](https://arxiv.org/html/2606.04279#bib.bib72)\)
FMF\[P\]=Ts\[P\]\+∫vext\(𝐫\)ρ\[P\]\(𝐫\)𝑑𝐫\+EH\[P\],F\_\{\\mathrm\{MF\}\}\[P\]=T\_\{s\}\[P\]\+\\int v\_\{\\mathrm\{ext\}\}\(\\mathbf\{r\}\)\\,\\rho\[P\]\(\\mathbf\{r\}\)\\,d\\mathbf\{r\}\+E\_\{\\mathrm\{H\}\}\[P\],the corresponding density error is\|FMF\[PML\]−FMF\[Pref\]\|\|F\_\{\\mathrm\{MF\}\}\[P\_\{\\text\{ML\}\}\]\-F\_\{\\mathrm\{MF\}\}\[P\_\{\\text\{ref\}\}\]\|with the density matrixPP\.
#### Homo\-Lumo GapΔεHL\\Delta\\varepsilon\_\{\\text\{HL\}\}:
The KS Homo\-Lumo gap for a densityρ\\rhois defined as
ΔHL\[ρ\]=εL\[ρ\]−εH\[ρ\],\\Delta\_\{\\mathrm\{HL\}\}\[\\rho\]=\\varepsilon\_\{\\mathrm\{L\}\}\[\\rho\]\-\\varepsilon\_\{\\mathrm\{H\}\}\[\\rho\],whereεH\\varepsilon\_\{\\mathrm\{H\}\}andεL\\varepsilon\_\{\\mathrm\{L\}\}are the highest occupied and lowest unoccupied KS eigenvalues of the effective one\-electron Hamiltonian corresponding toρ\[P\]\\rho\[P\]\. Given a reference densityρref\\rho\_\{\\mathrm\{ref\}\}, the Homo\-Lumo gap error is then
ΔεHL=\|ΔHL\[ρ\[PML\]\]−ΔHL\[ρ\[Pref\]\]\|\.\\Delta\\varepsilon\_\{\\text\{HL\}\}=\|\\Delta\_\{\\mathrm\{HL\}\}\[\\rho\[P\_\{\\text\{ML\}\}\]\]\-\\Delta\_\{\\mathrm\{HL\}\}\[\\rho\[P\_\{\\text\{ref\}\}\]\]\|\.
#### L1L\_\{1\}norm:
TheL1L\_\{1\}distance uniformly measures the total absolute difference between two densities:
L1\[Δρ\]=∫\|ρML\(𝐫\)−ρref\(𝐫\)\|𝑑𝐫\.L\_\{1\}\[\\Delta\\rho\]=\\int\\left\|\\rho\_\{\\text\{ML\}\}\(\\mathbf\{r\}\)\-\\rho\_\{\\text\{ref\}\}\(\\mathbf\{r\}\)\\right\|\\,d\\mathbf\{r\}\\\>\.
#### L2L\_\{2\}norm:
TheL2L\_\{2\}distance emphasizes larger local deviations more strongly than theL1L\_\{1\}norm and is therefore more sensitive to large spikes that can occur, for example, around the core:
L2\[Δρ\]=\(∫\|ρML\(𝐫\)−ρref\(𝐫\)\|2𝑑𝐫\)1/2\.L\_\{2\}\[\\Delta\\rho\]=\\left\(\\int\\left\|\\rho\_\{\\text\{ML\}\}\(\\mathbf\{r\}\)\-\\rho\_\{\\text\{ref\}\}\(\\mathbf\{r\}\)\\right\|^\{2\}\\,d\\mathbf\{r\}\\right\)^\{1/2\}\\\>\.
#### Dipole error:
The dipole difference,
𝝁Δρ=\|μ\[ρML\]−μ\[ρref\]\|,\\boldsymbol\{\\mu\}\_\{\\Delta\\rho\}=\|\\mathbf\{\\mu\}\[\\rho\_\{\\text\{ML\}\}\]\-\\mathbf\{\\mu\}\[\\rho\_\{\\text\{ref\}\}\]\|,captures how discrepancies in the density translate into errors in the first moment of the charge distribution\. It probes global shifts of electronic charge and is directly related to experimentally observable response properties\.
## Appendix KRuntime Benchmarks
To complement the possible runtime speedup analysis in Figure[3](https://arxiv.org/html/2606.04279#S5.F3), we benchmark the walltime of a single SCF cycle for B3LYP and the distilled NNmGGA functional\. Each timing is averaged over ten repetitions on one NVIDIA H200 GPU\. All measurements are from our unoptimized research codebase and should be interpreted as indicative single\-cycle costs rather than production\-ready inference timings\. We use PySCF grid level 1 throughout these experiments\. This grid choice reduces peak memory consumption, since our current implementation keeps the density\-fitted ERI tensor in GPU memory rather than rematerializing it on the fly as in production\-ready inference\-only DFT codes\.
Table 4:Single\-cycle SCF walltime benchmarks on QM9 molecules\. Timings are reported in milliseconds as mean±\\pmstandard deviation over ten repetitions on one NVIDIA H200 GPU\. The speedup is computed as the B3LYP walltime divided by the NNmGGA walltime\.The NNmGGA pre\-run includes neural network evaluation and is already comparable to B3LYP for small def2\-SVP molecules\. The speedup increases consistently with system size and basis size, reaching more than5×5\\timesat the def2\-TZVPD level for 19 heavy atoms\. This trend reflects the favorable𝒪\(B3\)\\mathcal\{O\}\(B^\{3\}\)scaling of semi\-local ML\-XC functionals relative to the𝒪\(B4\)\\mathcal\{O\}\(B^\{4\}\)exact\-exchange contribution in B3LYP\. At the smallest system sizes the neural network evaluation introduces a modest overhead, but this overhead is quickly amortized as exact exchange dominates the hybrid\-functional cost\.
For a fully optimized, production\-ready integration of a related architecture,Luiseet al\.\([2025](https://arxiv.org/html/2606.04279#bib.bib125)\)report GPU\-accelerated Skala runtimes matching semi\-local r2SCAN cost and more than10×10\\timesbelow hybrid\-functional cost\. Their CPU\-based PySCF implementation shows an approximately33–4×4\\timesprefactor over r2SCAN, consistent with the overhead observed in our unoptimized implementation\.
### K\.1End\-to\-End Initialization Runtime
We additionally measure end\-to\-end walltime for using an ML\-XC pre\-run to initialize B3LYP on a randomly selected large QM40 molecule with 35 heavy atoms at the B3LYP/def2\-SVP level of theory\. All workflows converge to the same final B3LYP energy of−1587\.69540087Eh\-1587\.69540087\\,E\_\{h\}\. The baseline is a direct B3LYP calculation from the standard initialization, while the remaining workflows first perform a fixed number of NNmGGA SCF cycles and then continue with B3LYP from the resulting density\.
Table 5:End\-to\-end walltime for ML\-XC initialized B3LYP calculations on a 35\-heavy\-atom QM40 molecule at the B3LYP/def2\-SVP level of theory\. Walltimes are reported in seconds as mean±\\pmstandard deviation\. The speedup is relative to direct B3LYP\.For this molecule, the best trade\-off is reached after six NNmGGA cycles\. This yields a net1\.35×1\.35\\timeswalltime speedup while keeping the total number of SCF cycles equal to the direct B3LYP calculation\. Since the single\-cycle benchmarks show increasing ML\-XC speedups with system size, we expect the net end\-to\-end speedup to improve further for larger molecules\.
### K\.2Training Wall\-Clock Runtime
We also report training wall\-clock curves for the B3LYP runs\. All runs were performed on the same compute node with one H100 GPU used per run\. Figure[5](https://arxiv.org/html/2606.04279#A11.F5)tracks training progress and validation energy error as a function of wall time for each loss variant\. All runs use the same early\-stopping parameters, and most training sessions finish at comparable wall\-clock times\. The per\-step overhead of Hessian supervision is visible as fewer completed optimization steps at a fixed wall time, but the total training duration remains comparable across loss variants\. On a single H100 GPU, the approximate per\-step times are300300ms for theE\+ρE\+\\rhodensity baseline,310310ms with gradient supervision, and400400ms with Hessian supervision, corresponding to an approximately33%33\\%overhead for the Hessian term\. The validation set is drawn from the same QM7 distribution as the training data; we therefore use validation energy error here only as a common, physically interpretable observable for tracking training progress over time\.

Figure 5:Training wall\-clock curves for B3LYP density\-baseline runs on a single H100 GPU\. The left panel shows training progress as a function of wall time, where Hessian supervision completes fewer steps at a fixed time because of its per\-step overhead\. The right panel shows validation energy error on the in\-distribution QM7 validation split, used here as a common observable for comparing training progress\.
## Appendix LDensity Loss\-Type Ablations
Table 6:Density loss ablations with NNmGGA at B3LYP/def2\-SVP level of theory on QM9, trained on molecules with up to 7 heavy atoms and evaluated on 1000 randomly selected molecules with exactly 9 heavy atoms, fixed across all runs \(same setting as Table[7](https://arxiv.org/html/2606.04279#A13.T7)\)\.We compare three choices for the density term in theE\+ρE\+\\rhobaseline, the per\-electronL1L\_\{1\}andL2L\_\{2\}norms of the real\-space density error and the mean\-field energy errorEρE\_\{\\rho\}\(Gould,[2023](https://arxiv.org/html/2606.04279#bib.bib72)\)\. We sweep the weightαρ\\alpha\_\{\\rho\}for each and report the resulting metrics in Table[6](https://arxiv.org/html/2606.04279#A12.T6)\. The mean\-field energy error gives poor real\-space density agreement, which motivates the two real\-space norms\. TheL1L\_\{1\}norm atαρ=10−3\\alpha\_\{\\rho\}=10^\{\-3\}reaches the lowest dipole andL1\[ρ\]L\_\{1\}\[\\rho\]error, so we adopt it as the density term throughout\. TheL2L\_\{2\}norm reaches lower total and mean\-field energy errors but worse density shape, which we do not prioritize since the energy term already supervises the energy\.
## Appendix MData Tables
### M\.1Learning SCAN
Table 7:Local models on QM9, trained on molecules with up to 5 heavy atoms and evaluated on 1000 randomly selected molecules with exactly 9 heavy atoms, fixed across all runs, at SCAN/def2\-TZVPD level of theory\. Energy MAEs inmEh\\textrm\{mE\}\_\{h\}\.
### M\.2Distilling B3LYP into Semi\-Local ML\-XC\-Functionals
Table 8:Raw data of all late\-stage runs performed for B3LYP/def2\-SVP distillation trained on QM9 up to 7 heavy atoms and evaluated on a random subset of QM40\. Energy MAEs \(EtotE\_\{\\textrm\{tot\}\},EρE\_\{\\rho\},ExcE\_\{\\textrm\{xc\}\},ΔεHL\\Delta\\varepsilon\_\{\\textrm\{HL\}\},EB3LYPE\_\{\\textrm\{B3LYP\}\}\) are inmEh\\textrm\{mE\}\_\{h\}\. Best per model metrics are highlighted in bold\. One molecule \(ID 86802\) has been removed from the QM40 evaluation due to convergence issues across all models which dominated the MAE, the affected rows are marked with an asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column\. The grey rows are not plotted in the figures, but they informed the loss\-weights chosen for the mGGA models \(1e\-3 forα∇\\alpha\_\{\\nabla\}and 3e\-5 forαH\\alpha\_\{H\}\)\.ModelNRunsN\_\{\\textrm\{Runs\}\}α∇\\alpha\_\{\\nabla\}αH\\alpha\_\{H\}EtotE\_\{\\textrm\{tot\}\}EρE\_\{\\rho\}ExcE\_\{\\textrm\{xc\}\}ΔεHL\\Delta\\varepsilon\_\{\\textrm\{HL\}\}μρ\\mu\_\{\\rho\}ℒ2\[ρ\]\\mathcal\{L\}\_\{2\}\[\\rho\]RICEB3LYPE\_\{\\textrm\{B3LYP\}\}NNmGGA4∗4^\{\*\}\-\-4\.24\.2±0\.6\\pm 0\.61\.41\.4±0\.2\\pm 0\.23\.03\.0±0\.4\\pm 0\.451\.551\.5±0\.1\\pm 0\.10\.0180\\mathbf\{0\.0180\}1\.0e\-51\.161\.160\.90\.93∗3^\{\*\}1e\-3\-5\.75\.7±0\.4\\pm 0\.40\.90\.9±0\.1\\pm 0\.15\.45\.4±0\.3\\pm 0\.353\.453\.4±0\.3\\pm 0\.30\.01970\.01971\.2e\-51\.171\.170\.7\\mathbf\{0\.7\}3∗3^\{\*\}1e\-31e\-53\.23\.2±0\.3\\pm 0\.30\.8\\mathbf\{0\.8\}±0\.1\\pm 0\.13\.13\.1±0\.2\\pm 0\.250\.950\.9±0\.0\\pm 0\.00\.01900\.01901\.2e\-51\.151\.150\.70\.73∗3^\{\*\}1e\-33e\-52\.92\.9±0\.5\\pm 0\.50\.90\.9±0\.2\\pm 0\.22\.62\.6±0\.4\\pm 0\.449\.649\.6±0\.1\\pm 0\.10\.01890\.01891\.2e\-51\.151\.150\.70\.711e\-31e\-42\.2\\mathbf\{2\.2\}\-0\.80\.8\-2\.3\\mathbf\{2\.3\}\-48\.9\\mathbf\{48\.9\}\-0\.02050\.02051\.3e\-51\.14\\mathbf\{1\.14\}0\.70\.7Skala3∗3^\{\*\}\-\-15\.815\.8±1\.8\\pm 1\.81\.21\.2±0\.1\\pm 0\.114\.614\.6±1\.8\\pm 1\.850\.350\.3±0\.2\\pm 0\.20\.01760\.01769\.1e\-61\.151\.150\.90\.9\-mGGA3∗3^\{\*\}1e\-3\-5\.85\.8±1\.5\\pm 1\.50\.90\.9±0\.2\\pm 0\.25\.45\.4±1\.3\\pm 1\.352\.652\.6±0\.2\\pm 0\.20\.01920\.01921\.3e\-51\.161\.160\.70\.71∗1^\{\*\}\-1e\-616\.316\.3\-1\.61\.6\-14\.714\.7\-50\.550\.5\-0\.01750\.01758\.5e\-61\.161\.160\.90\.91∗1^\{\*\}\-1e\-510\.510\.5\-1\.51\.5\-9\.19\.1\-49\.049\.0\-0\.0169\\mathbf\{0\.0169\}8\.1e\-61\.151\.150\.90\.91∗1^\{\*\}\-1e\-48\.98\.9\-1\.41\.4\-7\.67\.6\-48\.2\\mathbf\{48\.2\}\-0\.01880\.01881\.0e\-51\.13\\mathbf\{1\.13\}0\.90\.92∗2^\{\*\}1e\-31e\-54\.74\.7±1\.2\\pm 1\.21\.01\.0±0\.2\\pm 0\.24\.24\.2±0\.7\\pm 0\.750\.450\.4±0\.1\\pm 0\.10\.01810\.01811\.1e\-51\.151\.150\.70\.73∗3^\{\*\}1e\-33e\-53\.6\\mathbf\{3\.6\}±0\.2\\pm 0\.20\.7\\mathbf\{0\.7\}±0\.0\\pm 0\.03\.7\\mathbf\{3\.7\}±0\.2\\pm 0\.249\.549\.5±0\.1\\pm 0\.10\.01820\.01821\.1e\-51\.141\.140\.7\\mathbf\{0\.7\}XCdiff3∗3^\{\*\}\-\-8\.98\.9±0\.9\\pm 0\.91\.21\.2±0\.1\\pm 0\.17\.97\.9±1\.0\\pm 1\.051\.851\.8±0\.1\\pm 0\.10\.01820\.01829\.8e\-61\.171\.170\.90\.911e\-4\-11\.711\.7\-1\.01\.0\-10\.910\.9\-52\.052\.0\-0\.01830\.01839\.3e\-61\.171\.170\.80\.83∗3^\{\*\}1e\-3\-3\.73\.7±0\.4\\pm 0\.41\.01\.0±0\.0\\pm 0\.03\.43\.4±0\.2\\pm 0\.253\.253\.2±0\.1\\pm 0\.10\.01910\.01911\.2e\-51\.181\.180\.70\.71∗1^\{\*\}3e\-3\-2\.2\\mathbf\{2\.2\}\-1\.11\.1\-2\.6\\mathbf\{2\.6\}\-53\.853\.8\-0\.02010\.02011\.5e\-51\.191\.190\.6\\mathbf\{0\.6\}1∗1^\{\*\}\-1e\-68\.58\.5\-1\.71\.7\-6\.86\.8\-51\.051\.0\-0\.0177\\mathbf\{0\.0177\}9\.2e\-61\.171\.170\.90\.91∗1^\{\*\}\-1e\-512\.812\.8\-1\.71\.7\-11\.111\.1\-49\.449\.4\-0\.01770\.01779\.1e\-61\.161\.160\.90\.91∗1^\{\*\}\-1e\-48\.18\.1\-1\.21\.2\-7\.07\.0\-48\.4\\mathbf\{48\.4\}\-0\.01820\.01829\.9e\-61\.14\\mathbf\{1\.14\}0\.90\.92∗2^\{\*\}1e\-31e\-54\.64\.6±0\.6\\pm 0\.60\.80\.8±0\.1\\pm 0\.15\.05\.0±0\.4\\pm 0\.450\.550\.5±0\.0\\pm 0\.00\.01820\.01821\.1e\-51\.161\.160\.70\.73∗3^\{\*\}1e\-33e\-55\.55\.5±1\.6\\pm 1\.60\.7\\mathbf\{0\.7\}±0\.0\\pm 0\.05\.55\.5±1\.6\\pm 1\.649\.449\.4±0\.0\\pm 0\.00\.01870\.01871\.1e\-51\.141\.140\.70\.71∗1^\{\*\}1e\-31e\-46\.76\.7\-0\.70\.7\-7\.07\.0\-49\.049\.0\-0\.01950\.01951\.2e\-51\.151\.150\.70\.7EG\-XC1∗1^\{\*\}\-\-15\.915\.9\-1\.0\\mathbf\{1\.0\}\-15\.615\.6\-51\.851\.8\-0\.01820\.01821\.0e\-51\.16\\mathbf\{1\.16\}0\.90\.91∗1^\{\*\}1e\-6\-3\.43\.4\-1\.21\.2\-3\.3\\mathbf\{3\.3\}\-52\.352\.3\-0\.0179\\mathbf\{0\.0179\}9\.8e\-61\.181\.180\.7\\mathbf\{0\.7\}11e\-61e\-53\.1\\mathbf\{3\.1\}\-1\.01\.0\-3\.43\.4\-49\.2\\mathbf\{49\.2\}\-0\.01810\.01819\.2e\-61\.161\.160\.70\.7
### M\.3Time\-Dependent DFT \(TDDFT\) Distillation Energy Errors
Table 9:TDDFT distillation errors for QM9 molecules with up to5 heavy atoms\(n=50n=50\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.Table 10:TDDFT distillation errors for QM9 molecules with up to7 heavy atoms\(n=20n=20\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.Table 11:TDDFT distillation errors for QM9 molecules with up to9 heavy atoms\(n=20n=20\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.Table 12:TDDFT distillation errors for QM40 molecules with up to10 heavy atoms\(n=20n=20\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.Table 13:TDDFT distillation errors for QM40 molecules with up to12 heavy atoms\(n=20n=20\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.Table 14:TDDFT distillation errors for QM40 molecules with up to14 heavy atoms\(n=10n=10\)\. Excitation\-energy MAEs \(MAEall\\mathrm\{MAE\}\_\{\\mathrm\{all\}\},MAES0\\mathrm\{MAE\}\_\{S0\}–MAES4\\mathrm\{MAE\}\_\{S4\}\) are in meV;RelMAEall\\mathrm\{RelMAE\}\_\{\\mathrm\{all\}\}is in %\. An asterisk in theNRunsN\_\{\\textrm\{Runs\}\}column indicates that at least one molecule did not converge for that loss configuration\.
## Appendix NMachine Learnable XC\-Models
### N\.1NN\-mGGA
In their foundational work on machine\-learned exchange\-correlation functionalsNagaiet al\.\([2020](https://arxiv.org/html/2606.04279#bib.bib135)\)circumvented the parameter gradient calculation by using a probabilistic \(Metropolis\-Hastings\-type\) training method\. To make their model more amenable to gradient\-based training methods, we replaced theirC1C^\{1\}activation function \(ELU\) by aC∞C^\{\\infty\}activation \(shifted softplus\), while maintaining the same value rangeσ:ℝ→\(−1,∞\)\\sigma:\\mathbb\{R\}\\rightarrow\(\-1,\\infty\)and zero fixed pointσ\(0\)=0\\sigma\(0\)=0\. Moreover, we standardize the output space of the neural network by adding a scaling factors=0\.1s=0\.1and by initializing the bias of the output layer, s\.t\.σ\(b\)=0\.1\\sigma\(b\)=0\.1to account for the systematic bias of the uniform electron gas \(UEG/LDA\) on the stable organic chemistry datasets that we focus on in this work
Fθ\(fρ\(r\)\)=1\+σ\(s⋅NNθ\(fρ\(r\)\)\+b\)\.\\displaystyle F\_\{\\theta\}\(f\_\{\\rho\}\(r\)\)=1\+\\sigma\(s\\cdot\\textrm\{NN\}\_\{\\theta\}\(f\_\{\\rho\}\(r\)\)\+b\)\\\>\.\(61\)While this shift can be unlearned by the bias of the last layer, by shifting the output at initialization, the training already starts at a better point in parameter space\. We verified all of these tweaks to aid model performance during late\-stage development on QM9 molecules with up to 4 heavy atoms\.
### N\.2Hyperparameters
Semi\-local model sizes are matched to XCdiff within each teacher\-functional setting to keep the architecture comparison focused on the effect of the loss terms\. The larger B3LYP models reflect the higher capacity required to distill the non\-local exact\-exchange contribution of the hybrid reference\.
Table 15:Semi\-local model hyperparametersTable 16:EG\-XC specific hyperparameters\(Gaoet al\.,[2024](https://arxiv.org/html/2606.04279#bib.bib67)\)EmbeddingRadial cutoff5\.0 ÅFrom EG\-XC\# radial filters32From EG\-XCnuclei partitioning’exponential’From EG\-XCGNNHidden irreps128x0e\+64x1o128x0e\+64x1oWe deviate from the original32x0e\+32x1o\+32x2e32x0e\+32x1o\+32x2eto improve compile speed and reduce memory requirements, while observing little to no performance lossRadial basis size8From EG\-XCMessage distance cutoff5\.0 ÅFrom EG\-XC\# layers3From EG\-XCNon\-Local Reweighting\-MLP\# non\-local grid features16From EG\-XCHidden dim\.16From EG\-XC\# layers3From EG\-XCActivationSiLUInfinitely smooth differentiable \(from EG\-XC\)initial output scale0\.1Reduced from 1 for increased training stabilitySimilar Articles
Accurate and scalable exchange-correlation with deep learning
Microsoft Research releases Skala, a deep-learning exchange-correlation functional for DFT that achieves 2.8 kcal/mol accuracy on GMTKN55 at semi-local cost, outperforming traditional functionals across broad chemistry benchmarks.
Agentic Discovery of Exchange-Correlation Density Functionals
This paper presents an agentic system using Large Language Models to automate the discovery of exchange-correlation functionals in Density Functional Theory, achieving improvements over human-designed baselines while highlighting challenges with benchmark overfitting.
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
This paper introduces Digit Entropy Loss (DEL), a novel loss function for numerical learning in large language models that reformulates entropy optimization to improve digit-level prediction accuracy and handle floating-point numbers, consistently outperforming existing methods on mathematical reasoning benchmarks.
Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components
This paper diagnoses the loss landscape of gradient-based inversion for the Gray-Scott reaction-diffusion system, showing that direct backpropagation fails due to flat plateaus and sharp cliffs, while PINN components like residual loss smooth the landscape. The findings provide design implications for PINN-type methods.
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
This paper introduces DiMS, a dynamical system sampler that guarantees exact sampling from the submanifold of minimum loss solutions in neural networks, enabling better uncertainty quantification in Bayesian inference.