Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials

arXiv cs.LG 06/04/26, 04:00 AM Papers
Summary
Researchers from MIT, University of Warwick, and NVIDIA introduce Stein Kernelized Molecular Dynamics (SKMD), an enhanced sampling method that uses interacting particle dynamics to acquire informative training configurations for active learning and fine-tuning of machine learning interatomic potentials (MLIPs). SKMD is a stochastic variant of Stein variational gradient descent adapted for molecular dynamics, preserving the Boltzmann distribution while achieving higher model accuracy in fewer training iterations compared to baselines.
arXiv:2606.04100v1 Announce Type: new Abstract: Machine learning interatomic potentials (MLIPs) enable efficient and accurate atomistic simulations but depend critically on the quality and diversity of the training data. We introduce Stein kernelized molecular dynamics (SKMD), an enhanced sampling method that uses interacting particle dynamics to acquire informative training configurations for the active learning and fine-tuning of MLIPs. SKMD corresponds to a stochastic variant of Stein variational gradient descent that is adapted for molecular dynamics by incorporating asynchronous particle updates and a kernel of global atomic descriptors, which provides a symmetry-aware measure of configurational similarity. Unlike other enhanced samplers used in molecular dynamics, SKMD preserves the Boltzmann distribution as the asymptotic distribution of the dynamics. This property enforces a balance between the exploration of diverse configurations and attraction toward high-probability regions of the energy landscape. We further propose an approach to efficient online data acquisition using an adaptive stopping criterion that selects non-redundant training data over the course of simulation. We demonstrate SKMD for the active learning of a neural network model of the M\"uller-Brown potential and the fine-tuning of a MACE interatomic potential for alanine dipeptide. Compared to active learning baselines, our method achieves higher model accuracy in fewer training iterations with the same number of acquired training samples.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:20 AM
# Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials
Source: [https://arxiv.org/html/2606.04100](https://arxiv.org/html/2606.04100)
Joanna Zou1,Fraser Birks2Dallas Foster3Youssef Marzouk1 1Center for Computational Science & Engineering, Schwarzman College of Computing, MIT 2Warwick Centre for Predictive Modelling, School of Engineering, University of Warwick 3NVIDIA

###### Abstract

Machine learning interatomic potentials \(MLIPs\) enable efficient and accurate atomistic simulations but depend critically on the quality and diversity of the training data\. We introduce Stein kernelized molecular dynamics \(SKMD\), an enhanced sampling method that uses interacting particle dynamics to acquire informative training configurations for the active learning and fine\-tuning of MLIPs\. SKMD corresponds to a stochastic variant of Stein variational gradient descent that is adapted for molecular dynamics by incorporating asynchronous particle updates and a kernel of global atomic descriptors, which provides a symmetry\-aware measure of configurational similarity\. Unlike other enhanced samplers used in molecular dynamics, SKMD preserves the Boltzmann distribution as the asymptotic distribution of the dynamics\. This property enforces a balance between the exploration of diverse configurations and attraction toward high\-probability regions of the energy landscape\. We further propose an approach to efficient online data acquisition using an adaptive stopping criterion that selects non\-redundant training data over the course of simulation\. We demonstrate SKMD for the active learning of a neural network model of the Müller–Brown potential and the fine\-tuning of a MACE interatomic potential for alanine dipeptide\. Compared to active learning baselines, our method achieves higher model accuracy in fewer training iterations, with the same number of acquired training samples\.

## 1Introduction

Many modern advances in the atomistic simulation of chemical phenomena stem from the use of machine learning interatomic potentials \(MLIPs\), data\-driven surrogate models of atomistic force fields that enable molecular dynamics \(MD\) simulation at greater system sizes and time scales than what is feasible withab initiomethods\. The accuracy of MLIPs depends critically on the quality of the training data: training configurations must be representative of both key thermodynamic states of the system and the transitional states bridging between them in order for the MLIP to correctly characterize chemical properties\. Training data of these transitional states, or of unobserved thermodynamic states, are challenging to acquire due to the infrequency of transitions during simulation\. Moreover, the high cost of labeling data with quantum\-mechanical reference calculations limits the number of samples which can be feasibly added to the training set\.

Inactive learningof MLIPs, we progressively improve the accuracy of the model by alternating between the collection of training data and retraining of the model on the augmented dataset\. A wide array of existing active learning approaches for MLIPs leverage subset selection methods—based, e\.g\., on D\-optimal design\[[47](https://arxiv.org/html/2606.04100#bib.bib55)\], CUR decomposition\[[10](https://arxiv.org/html/2606.04100#bib.bib53)\], the MaxVol algorithm\[[38](https://arxiv.org/html/2606.04100#bib.bib6),[37](https://arxiv.org/html/2606.04100#bib.bib49)\], Gaussian process variance\[[28](https://arxiv.org/html/2606.04100#bib.bib51),[58](https://arxiv.org/html/2606.04100#bib.bib20),[59](https://arxiv.org/html/2606.04100#bib.bib28),[62](https://arxiv.org/html/2606.04100#bib.bib24),[61](https://arxiv.org/html/2606.04100#bib.bib35)\], entropy\-maximization\[[29](https://arxiv.org/html/2606.04100#bib.bib78),[52](https://arxiv.org/html/2606.04100#bib.bib50)\], query\-by\-committee\[[3](https://arxiv.org/html/2606.04100#bib.bib52)\], and determinantal point processes\[[66](https://arxiv.org/html/2606.04100#bib.bib76)\]—for selecting informative subsets of the simulated MD trajectory to add to the training set\. However, standard MD trajectories can remain trapped in energy basins, producing highly correlated data irrespective of the subset selection technique, which limits model improvement\.

For this reason, recent active learning methods utilizeenhanced samplingin molecular dynamics to promote the exploration of novel regions of configuration space, including metadynamics\[[26](https://arxiv.org/html/2606.04100#bib.bib54)\], uncertainty\-driven dynamics\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\], and hyperactive learning\[[57](https://arxiv.org/html/2606.04100#bib.bib62)\]\. These methods introduce an adaptive biasing force which drives dynamics toward underrepresented regions of the configuration space and define acquisition criteria for selecting nonredundant training data over the course of simulation\. However, the biased dynamics do not retain fidelity to the Boltzmann distribution associated with the MLIP, and the acquisition criteria generally do not take into account the underlying energy landscape\. Therefore, the chosen training configurations may not be representative of physically meaningful configurations or the true distribution of thermodynamic states\.

We address these problems with Stein kernelized molecular dynamics \(SKMD\), a novel enhanced sampling method for active learning of MLIPs\. Our idea is to adapt variational inference approaches in Bayesian inference and statistics for sampling problems in molecular dynamics\. SKMD is derived from Stein variational gradient descent \(SVGD\)\[[34](https://arxiv.org/html/2606.04100#bib.bib15)\], a particle\-based variational inference algorithm which utilizes an interacting particle set to approximate a target distribution\. Our method improves upon the other enhanced samplers by retaining the Boltzmann distribution of the MLIP as the asymptotic distribution of the dynamics\. Furthermore, the SKMD biasing force offers a means to define an acquisition criterion which balances the selection of diverse configurations with those informed by the energy landscape\.

Active learning is related to model fine\-tuning, which improves the accuracy of model outputs in regions of data space as specified by a reward function\. While flow\-based generative methods have been adopted for Boltzmann sampling and fine\-tuning\[[45](https://arxiv.org/html/2606.04100#bib.bib56),[46](https://arxiv.org/html/2606.04100#bib.bib57),[44](https://arxiv.org/html/2606.04100#bib.bib58)\], they require that training data already exist in regions which are targeted for additional sampling and can struggle to sufficiently sample regions with poor data coverage\. We argue that an enhanced sampling framework is better suited for the task of active learning, as local particle transforms enable the discovery of thermodynamic states which are unseen by the existing training data\.

We summarize our contributions as follows:

- •We propose SKMD as a stochastic variant of SVGD implemented with asynchronous particle updates and a kernel of global atomic descriptors, thus adapting the algorithm to molecular dynamics settings\.
- •We prove that the asymptotic distribution of SKMD dynamics is the Boltzmann distribution of the system\. In[Proposition˜1](https://arxiv.org/html/2606.04100#Thmproposition1), we show that its mean\-field limit coincides with that of SVGD, which converges to the Boltzmann distribution under appropriate conditions\.
- •We develop an approach to online data acquisition in the form of an adaptive stopping criterion for SKMD, discussed in greater detail in[Section˜C\.3](https://arxiv.org/html/2606.04100#A3.SS3)and[Section˜D\.1](https://arxiv.org/html/2606.04100#A4.SS1)\.
- •We show that SKMD outperforms other sampling techniques for data generation and active learning of MLIPs, demonstrated on problems of learning a neural network model of the Müller\-Brown potential and fine\-tuning a MACE foundation model of organic molecules for alanine dipeptide\.

## 2Background

### 2\.1Machine learning atomistic force fields

Interatomic potentialsmodel the total potential energy of a configuration ofNNatoms as a function of the atomic positions𝐱=\(x1,…,xN\)∈ℝ3N\\mathbf\{x\}=\(x\_\{1\},\\ldots,x\_\{N\}\)\\in\\mathbb\{R\}^\{3N\}, wherexn∈ℝ3x\_\{n\}\\in\\mathbb\{R\}^\{3\}\. Whereas classical empirical potentials\[[53](https://arxiv.org/html/2606.04100#bib.bib89),[50](https://arxiv.org/html/2606.04100#bib.bib90),[14](https://arxiv.org/html/2606.04100#bib.bib91),[6](https://arxiv.org/html/2606.04100#bib.bib92)\]are analytical functions taking simple parametric forms, machine learning interatomic potentials \(MLIPs\) are flexible function approximations learned from higher\-fidelity reference data such as density functional theory \(DFT\) calculations\. LetV:ℝ3N→ℝV:\\mathbb\{R\}^\{3N\}\\to\\mathbb\{R\}and−∇𝐱V:ℝ3N→ℝ3N\-\\nabla\_\{\\mathbf\{x\}\}V:\\mathbb\{R\}^\{3N\}\\to\\mathbb\{R\}^\{3N\}correspond to reference DFT calculations of the potential energy and forces\. A MLIPVθ:ℝ3N→ℝV\_\{\\theta\}:\\mathbb\{R\}^\{3N\}\\to\\mathbb\{R\}is typically learned from a weighted least squares objective, where the model parametersθ∈Θ\\theta\\in\\Thetaare the minimizer of the loss functionℒ\\mathcal\{L\},

ℒ\(θ\)=λ0K∑k=1K\|Vθ\(𝐱k\)−V\(𝐱k\)\|2\+λ1K∑k=1K‖∇𝐱Vθ\(𝐱k\)−∇𝐱V\(𝐱k\)‖2,\\mathcal\{L\}\(\\theta\)=\\frac\{\\lambda\_\{0\}\}\{K\}\\sum\_\{k=1\}^\{K\}\|V\_\{\\theta\}\(\\mathbf\{x\}^\{k\}\)\-V\(\\mathbf\{x\}^\{k\}\)\|^\{2\}\+\\frac\{\\lambda\_\{1\}\}\{K\}\\sum\_\{k=1\}^\{K\}\|\|\\nabla\_\{\\mathbf\{x\}\}V\_\{\\theta\}\(\\mathbf\{x\}^\{k\}\)\-\\nabla\_\{\\mathbf\{x\}\}V\(\\mathbf\{x\}^\{k\}\)\|\|^\{2\}\\ ,\(1\)defined byλ0,λ1\>0\\lambda\_\{0\},\\lambda\_\{1\}\>0and evaluated at a training set𝒟≔\{\(𝐱k,V\(𝐱k\),∇𝐱V\(𝐱k\)\)\}k=1K\\mathcal\{D\}\\coloneqq\\\{\(\\mathbf\{x\}^\{k\},V\(\\mathbf\{x\}^\{k\}\),\\nabla\_\{\\mathbf\{x\}\}V\(\\mathbf\{x\}^\{k\}\)\)\\\}^\{K\}\_\{k=1\}\.

Atomic descriptorsare feature representations of local atomic environments which form the basis of many MLIPs\. A local descriptorg~\(𝐱\)∈Ω~⊆ℝd\\tilde\{g\}\(\\mathbf\{x\}\)\\in\\tilde\{\\Omega\}\\subseteq\\mathbb\{R\}^\{d\}can be learned from data, as with invariant latent representations from GNN potentials such as NequIP\[[9](https://arxiv.org/html/2606.04100#bib.bib93)\], Allegro\[[43](https://arxiv.org/html/2606.04100#bib.bib94)\], or MACE\[[8](https://arxiv.org/html/2606.04100#bib.bib83),[7](https://arxiv.org/html/2606.04100#bib.bib96)\], or constructed explicitly from symmetry\-adapted bases of local atomic neighborhoods which enforce invariances under SO\(3\), such as SOAP descriptors\[[5](https://arxiv.org/html/2606.04100#bib.bib9)\], bispectrum components\[[54](https://arxiv.org/html/2606.04100#bib.bib37)\], and ACE basis functions\[[18](https://arxiv.org/html/2606.04100#bib.bib5)\]\. Whereas local descriptors are per\-atom representations,global descriptorsg\(𝐱\)∈Ω⊆ℝdg\(\\mathbf\{x\}\)\\in\\Omega\\subseteq\\mathbb\{R\}^\{d\}are representations of the multi\-atom configuration, typically a composition of local descriptorsg~\(𝐱\)=\[g~1\(𝐱\),…,g~N\(𝐱\)\]∈Ω~N⊆ℝNd\\tilde\{g\}\(\\mathbf\{x\}\)=\[\\tilde\{g\}^\{1\}\(\\mathbf\{x\}\),\\ldots,\\tilde\{g\}^\{N\}\(\\mathbf\{x\}\)\]\\in\\tilde\{\\Omega\}^\{N\}\\subseteq\\mathbb\{R\}^\{Nd\}\. An example of a global descriptor which is used in\[[29](https://arxiv.org/html/2606.04100#bib.bib78)\]is the mean of the local descriptors,g\(𝐱\)=1N∑n=1Ng~n\(𝐱\)g\(\\mathbf\{x\}\)=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\tilde\{g\}^\{n\}\(\\mathbf\{x\}\)\. Depending on the application, a more informative global descriptor could be the mean of local descriptors of a subset of the atoms, e\.g\., of interfacial atoms within a bulk configuration\.

Boltzmann samplingin molecular dynamics consists of generating samples distributed according to the Boltzmann distribution of configurations of the atomic system,π\(𝐱\)=1Zexp⁡\(−βV\(𝐱\)\)\\pi\(\\mathbf\{x\}\)=\\frac\{1\}\{Z\}\\exp\\big\(\-\\beta V\(\\mathbf\{x\}\)\\big\), whereZZis the normalizing constant\. Given a MLIP which is cheaper to evaluate than the reference, one generally approximatesπ\\piwith the Boltzann distribution of the MLIP,πθ\(𝐱\)=1Zθexp⁡\(−βVθ\(𝐱\)\)\\pi\_\{\\theta\}\(\\mathbf\{x\}\)=\\frac\{1\}\{Z\_\{\\theta\}\}\\exp\\big\(\-\\beta V\_\{\\theta\}\(\\mathbf\{x\}\)\\big\)\. A common approach to Boltzmann sampling is to perform molecular dynamics simulation with a Langevin thermostat, which corresponds to underdamped Langevin dynamics\. If the dynamics are ergodic, then the marginal invariant distribution of positions coincides withπθ\\pi\_\{\\theta\}\[[32](https://arxiv.org/html/2606.04100#bib.bib41)\]\.

In the overdamped limit of Langevin dynamics,πθ\\pi\_\{\\theta\}remains the marginal invariant distribution of positions and the molecular dynamics according to

d𝐱t=−∇𝐱Vθ\(𝐱t\)dt\+2β−1dWt\\textup\{d\}\\mathbf\{x\}\_\{t\}=\-\\nabla\_\{\\mathbf\{x\}\}V\_\{\\theta\}\(\\mathbf\{x\}\_\{t\}\)\\textup\{d\}t\+\\sqrt\{2\\beta^\{\-1\}\}\\textup\{d\}W\_\{t\}\\\(2\)can be used to sample from the Boltzmann distribution\. However, on practical simulation time scales, the simulation of either underdamped or overdamped Langevin dynamics does not guarantee an accurate sample fromπθ\\pi\_\{\\theta\}, as the dynamics are susceptible to metastability\. As the Boltzmann distribution of molecular systems is typically highly multi\-modal, trajectories can remain confined for long times within metastable basins separated by high barriers in the free energy landscape, leading to slow mixing and poor sampling of the full distribution\[[25](https://arxiv.org/html/2606.04100#bib.bib44)\]\.

### 2\.2Particle\-based variational inference

Stein variational gradient descent \(SVGD\)is a particle\-based variational inference algorithm which approximates a target densityπ\\pion a state space𝒳\\mathcal\{X\}with the empirical distributionq^t\\hat\{q\}\_\{t\}of a set of interacting particlesX¯t=\{𝐱ti\}i=1J\\bar\{X\}\_\{t\}=\\\{\\mathbf\{x\}^\{i\}\_\{t\}\\\}\_\{i=1\}^\{J\}\. The evolution of each particle fori=1,…,Ji=1,\\ldots,Jis given by the following update equation, for time stepϵ\>0\\epsilon\>0and symmetric positive semi\-definite kernelk:𝒳×𝒳→ℝk:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\},

𝐱t\+1i←𝐱ti\+ϵϕ^t∗\(𝐱ti;X¯t\),\\mathbf\{x\}^\{i\}\_\{t\+1\}\\leftarrow\\mathbf\{x\}^\{i\}\_\{t\}\+\\epsilon\\hat\{\\phi\}^\{\*\}\_\{t\}\(\\mathbf\{x\}^\{i\}\_\{t\};\\bar\{X\}\_\{t\}\)\\ ,\(3a\)ϕ^t∗\(⋅;X¯t\)=1J∑i=1J\[∇𝐱tjlog⁡π\(𝐱tj\)k\(𝐱tj,⋅\)\+∇𝐱tjk\(𝐱tj,⋅\)\],𝐱tj∈X¯t\.\\hat\{\\phi\}^\{\*\}\_\{t\}\(\\cdot;\\bar\{X\}\_\{t\}\)=\\frac\{1\}\{J\}\\sum\_\{i=1\}^\{J\}\\Big\[\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}\\log\\pi\(\\mathbf\{x\}^\{j\}\_\{t\}\)k\(\\mathbf\{x\}^\{j\}\_\{t\},\\cdot\)\+\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}k\(\\mathbf\{x\}^\{j\}\_\{t\},\\cdot\)\\Big\],\\quad\\mathbf\{x\}^\{j\}\_\{t\}\\in\\bar\{X\}\_\{t\}\\ \.\(3b\)One can show that \([3a](https://arxiv.org/html/2606.04100#S2.E3.1)\) corresponds to the Euler discretization of a continuous\-time ODE describing the particle dynamics, and \([3b](https://arxiv.org/html/2606.04100#S2.E3.2)\) corresponds to a Monte Carlo estimator of an expectation resulting from Stein’s identity\[[49](https://arxiv.org/html/2606.04100#bib.bib81)\],

d𝐱ti=ϕt∗\(𝐱ti\)dt,\\textup\{d\}\\mathbf\{x\}^\{i\}\_\{t\}=\\phi^\{\*\}\_\{t\}\(\\mathbf\{x\}^\{i\}\_\{t\}\)\\textup\{d\}t\\ ,\(4a\)ϕt∗\(⋅\)=𝔼𝐱∼q^t\[∇𝐱log⁡π\(𝐱\)k\(𝐱,⋅\)\+∇xk\(𝐱,⋅\)\]\.\\phi^\{\*\}\_\{t\}\(\\cdot\)=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\hat\{q\}\_\{t\}\}\[\\nabla\_\{\\mathbf\{x\}\}\\log\\pi\(\\mathbf\{x\}\)k\(\\mathbf\{x\},\\cdot\)\+\\nabla\_\{x\}k\(\\mathbf\{x\},\\cdot\)\]\\ \.\(4b\)In the limit of infinite timet→∞t\\to\\inftyand infinite particlesJ→∞J\\to\\infty, the empirical distributionq^t\\hat\{q\}\_\{t\}converges weakly toπ\\piin Kullback–Leibler \(KL\) divergence\[[35](https://arxiv.org/html/2606.04100#bib.bib16),[36](https://arxiv.org/html/2606.04100#bib.bib101)\]\.[Section˜C\.2](https://arxiv.org/html/2606.04100#A3.SS2)describes the SVGD formulation in further detail\.

## 3Stein kernelized molecular dynamics

Active learning is performed with an iterative scheme of \(i\) sampling configuration space, \(ii\) selecting configurations to add as training data, and \(iii\) retraining the MLIP on the augmented training set\. In the following, we introduce the SKMD sampling algorithm and its associated adaptive stopping criterion, which is a kernel\-based strategy for online data acquisition\.

### 3\.1Sampling algorithm

LetX¯s=\{𝐱sj\}j=1J\\bar\{X\}\_\{s\}=\\\{\\mathbf\{x\}\_\{s\}^\{j\}\\\}\_\{j=1\}^\{J\}be an ensemble ofJJparticles at a given time steps\>0s\>0\. Letk:ℝn×ℝn→ℝk:\\mathbb\{R\}^\{n\}\\times\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}be a symmetric positive semi\-definite kernel\. For stopping timeℓ\>0\\ell\>0, step sizeϵ\>0\\epsilon\>0, constantη\>0\\eta\>0, scale parameterA:ℝn→ℝA:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}, andξti∼N\(0,𝕀n\)\\xi^\{i\}\_\{t\}\\sim N\(0,\\mathbb\{I\}\_\{n\}\), the evolution of theiith particle in the set fort=s,…,s\+ℓ−1t=s,\.\.\.,s\+\\ell\-1is given by

d𝐱t\+1i←𝐱ti\+ϵ\[−A\(𝐱ti\)β∇𝐱tiVθ\(𝐱ti\)\+Fθ,sSKMD\(𝐱ti;X¯s\)\]\+2ϵηJξti,\\textup\{d\}\\mathbf\{x\}^\{i\}\_\{t\+1\}\\leftarrow\\mathbf\{x\}^\{i\}\_\{t\}\+\\epsilon\\big\[\-A\(\\mathbf\{x\}^\{i\}\_\{t\}\)\\beta\\,\\nabla\_\{\\mathbf\{x\}^\{i\}\_\{t\}\}V\_\{\\theta\}\(\\mathbf\{x\}^\{i\}\_\{t\}\)\+F^\{\\textup\{SKMD\}\}\_\{\\theta,s\}\(\\mathbf\{x\}^\{i\}\_\{t\};\\bar\{X\}\_\{s\}\)\\big\]\+\\sqrt\{\\tfrac\{2\\epsilon\\eta\}\{J\}\}\\ \\xi\_\{t\}^\{i\}\\ ,\(5a\)Fθ,sSKMD\(⋅;X¯s\)=1J−1∑j=1J−1\[−β∇𝐱sjVθ\(𝐱sj\)k\(𝐱sj,⋅\)\+∇𝐱sjk\(𝐱sj,⋅\)\],𝐱sj∈X¯s∖\{𝐱si\}\.F^\{\\textup\{SKMD\}\}\_\{\\theta,s\}\(\\cdot;\\bar\{X\}\_\{s\}\)=\\frac\{1\}\{J\-1\}\\sum\_\{j=1\}^\{J\-1\}\\Big\[\-\\beta\\,\\nabla\_\{\\mathbf\{x\}\_\{s\}^\{j\}\}V\_\{\\theta\}\(\\mathbf\{x\}\_\{s\}^\{j\}\)k\(\\mathbf\{x\}\_\{s\}^\{j\},\\cdot\)\+\\nabla\_\{\\mathbf\{x\}\_\{s\}^\{j\}\}k\(\\mathbf\{x\}\_\{s\}^\{j\},\\cdot\)\\Big\],\\ \\mathbf\{x\}\_\{s\}^\{j\}\\in\\bar\{X\}\_\{s\}\\setminus\\\{\\mathbf\{x\}\_\{s\}^\{i\}\\\}\.\(5b\)Afterℓ\\ellsteps, the current point𝐱s\+ℓi\\mathbf\{x\}^\{i\}\_\{s\+\\ell\}replaces𝐱si\\mathbf\{x\}^\{i\}\_\{s\}in the ensemble and we switch to evolving the next particle in the ensemble starting at steps\+ℓs\+\\ell\. The sampling algorithm, summarized in[Algorithm˜2](https://arxiv.org/html/2606.04100#alg2), has the following characteristics:

Adaptive biasing force\.The force field in \([5a](https://arxiv.org/html/2606.04100#S3.E5.1)\) is the combination of−β∇𝐱tiVθ\(𝐱ti\)\-\\beta\\,\\nabla\_\{\\mathbf\{x\}^\{i\}\_\{t\}\}V\_\{\\theta\}\(\\mathbf\{x\}^\{i\}\_\{t\}\), referred to as the gradient force, andFθ,sSKMD\(𝐱ti;X¯s\)F^\{\\textup\{SKMD\}\}\_\{\\theta,s\}\(\\mathbf\{x\}^\{i\}\_\{t\};\\bar\{X\}\_\{s\}\), referred to as the SKMD biasing force\. In the first term in the summand of \([5b](https://arxiv.org/html/2606.04100#S3.E5.2)\), the gradient force at each particle in the ensemble scaled by the kernel acts as an attractive force that draws the trajectory toward low\-energy configurations to promote fidelity to the free energy landscape\. The second term, the gradient of the kernel, acts as a repulsive force that drives particles apart in order to promote exploration of the configuration space\. The attractive and repulsive forces strike a balance between theexplorationof novel configurations andexploitationof high\-probability regions which ensures the accuracy of the asymptotic distribution of samples\.

We introduce a scale parameterAAto differentially weigh the strength of the gradient force relative to the SKMD biasing force at the current point in the simulation path\. By settingA\(𝐱\)=k\(𝐱,𝐱\)=aA\(\\mathbf\{x\}\)=k\(\\mathbf\{x\},\\mathbf\{x\}\)=a,∀𝐱∈ℝn\\forall\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}, the drift of \([5a](https://arxiv.org/html/2606.04100#S3.E5.1)\) coincides with the velocity field of SVGD att=st=s\. Setting a larger kernel amplitudea\>0a\>0can enhance the repulsive effect of the dynamics, but settingA=aA=awould in turn magnify the scale of the negative gradient of the potential, making it more difficult for the path to cross an energy barrier\. Therefore, for greater flexibility, we do not strictly require thatAAequal the kernel self\-similarity in the SKMD algorithm\. In our implementation, we generally setA=1A=1\.

Kernel of global descriptors\. The kernel measures the degree of similarity between two configurations\. We assume that configurations are more clearly separated in the Euclidean geometry of the descriptor spaceΩ\\Omegathan in Cartesian space, since the descriptor map encodes invariance or equivariance relations\. Therefore, we use translation\-invariant kernels based on distances between the descriptors; for example, a Gaussian kernel with amplitudea\>0a\>0and length scaleb\>0b\>0,

k\(𝐱,𝐱′\)=kg\(g\(𝐱\),g\(𝐱′\)\)=aexp⁡\(−‖g\(𝐱\)−g\(𝐱′\)‖22b2\)\.k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=k\_\{g\}\(g\(\\mathbf\{x\}\),g\(\\mathbf\{x\}^\{\\prime\}\)\)=a\\exp\\Big\(\-\\frac\{\|\|g\(\\mathbf\{x\}\)\-g\(\\mathbf\{x\}^\{\\prime\}\)\|\|^\{2\}\}\{2b^\{2\}\}\\Big\)\\ \.\(6\)The gradient of the kernel with respect to its first argument is computed via chain rule as

∇𝐱k\(𝐱,𝐱′\)=−1b2k\(𝐱,𝐱′\)𝒥\(𝐱\)⊤\(g\(𝐱\)−g\(𝐱′\)\),\\nabla\_\{\\mathbf\{x\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=\-\\frac\{1\}\{b^\{2\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\mathcal\{J\}\(\\mathbf\{x\}\)^\{\\top\}\\big\(g\(\\mathbf\{x\}\)\-g\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\\ ,\(7\)where the Jacobian𝒥∈ℝd×3N\\mathcal\{J\}\\in\\mathbb\{R\}^\{d\\times 3N\}is a block matrix with𝒥ij\(𝐱\)=∂gi∂𝐱j\(𝐱\)\\mathcal\{J\}\_\{ij\}\(\\mathbf\{x\}\)=\\frac\{\\partial g\_\{i\}\}\{\\partial\\mathbf\{x\}\_\{j\}\}\(\\mathbf\{x\}\)\.

Asynchronous particle updates\.Rather than updating the positions of all particles in the ensemble at once, as is typically done in interacting particle systems, one particle is updated at a time for a finite number of stepsℓ\\ell\. The asynchronous scheme allows one to use single\-particle simulation to approximate the ensemble dynamics of the interacting particle system\. This approach both reduces the computational overhead associated with inter\-particle communication and simplifies the integration of SKMD into existing molecular dynamics workflows, such as those in LAMMPS\[[55](https://arxiv.org/html/2606.04100#bib.bib99)\]\. We evaluate the effect of asynchronous particle updates on the fidelity of SKMD samples to the Boltzmann distribution in[Section˜B\.1](https://arxiv.org/html/2606.04100#A2.SS1)\.

Isotropic stochastic noise\.The addition of stochastic noise in \([5a](https://arxiv.org/html/2606.04100#S3.E5.1)\) serves to improve the mixing of the simulation path for small ensemble sizes \(O\(10\)O\(10\)\) and relates the update equation to a stochastic variant of SVGD first proposed in\[[23](https://arxiv.org/html/2606.04100#bib.bib60)\]\. In\[[23](https://arxiv.org/html/2606.04100#bib.bib60),[20](https://arxiv.org/html/2606.04100#bib.bib100)\], stochastic SVGD is defined by a kernel\-dependent diffusion coefficient\. However, we found in numerical experiments that the asynchronous scheme in \([5a](https://arxiv.org/html/2606.04100#S3.E5.1)\) with the isotropic, state\-independent diffusion coefficient produces samples which exhibit better agreement with the Boltzmann distribution compared to the implementation with the kernel\-dependent diffusion coefficient\. A state\-independent diffusion coefficient is also adopted in\[[63](https://arxiv.org/html/2606.04100#bib.bib63)\]for a single\-trajectory variant of stochastic SVGD\. Our choice of diffusion coefficient does not perturb the mean\-field limit of the dynamics, as shown in[Section˜A\.1](https://arxiv.org/html/2606.04100#A1.SS1)\.

Fidelity to the Boltzmann distribution\.A key property of SKMD is that, in the limit ofϵ→0\\epsilon\\to 0,ℓ→0\\ell\\to 0, andJ→∞J\\to\\infty, the empirical distribution of the particles approaches the solution to a mean\-field equation whose stationary solution is the Boltzmann distribution\. We show this result in[Proposition˜1](https://arxiv.org/html/2606.04100#Thmproposition1)\. Consequently, SKMD inherits the convergence properties of SVGD that are derived from the mean\-field equation, and the asymptotic behavior of \([5](https://arxiv.org/html/2606.04100#S3.E5)\) remains consistent with the Boltzmann distributionπθ\\pi\_\{\\theta\}\. Intuitively, the local attractive and repulsive dynamics in configuration space correspond to global transport of the empirical measure of particles in probability space, such that in the appropriate asymptotic regime, the empirical distribution weakly converges toπθ\\pi\_\{\\theta\}in KL divergence\. See[Section˜A\.1](https://arxiv.org/html/2606.04100#A1.SS1)for the necessary assumptions and further discussion\.

Algorithm 1SKMD for active learning with online data acquisitionInitial ensemble

X¯0=\{𝐱01,…,𝐱0J\}\\bar\{X\}\_\{0\}=\\\{\\mathbf\{x\}^\{1\}\_\{0\},\\ldots,\\mathbf\{x\}^\{J\}\_\{0\}\\\}, training set

𝒟0\\mathcal\{D\}\_\{0\}, model

Vθ0V\_\{\\theta\_\{0\}\}, kernel

kk, threshold

ζ0\\zeta\_\{0\}
fortraining iteration

τ=0,1,…\\tau=0,1,\\ldotsdo

forensemble member

j=1,…,Jj=1,\\ldots,Jdo

Initialize trajectory at

𝐱0←𝐱τj\\mathbf\{x\}\_\{0\}\\leftarrow\\mathbf\{x\}^\{j\}\_\{\\tau\}
repeatfor

t=0,1,…t=0,1,\\ldots
Simulate one step of \([5](https://arxiv.org/html/2606.04100#S3.E5)\):

𝐱t\+1←SKMD\(𝐱t,X¯τ,Vθτ,k\)\\mathbf\{x\}\_\{t\+1\}\\leftarrow\\text\{SKMD\}\(\\mathbf\{x\}\_\{t\},\\bar\{X\}\_\{\\tau\},V\_\{\\theta\_\{\\tau\}\},k\)
untilacquisition criterion \([8](https://arxiv.org/html/2606.04100#S3.E8)\) is met at

𝐱t\+1\\mathbf\{x\}\_\{t\+1\}with threshold

ζ0\\zeta\_\{0\}
Add

𝐱t\+1\\mathbf\{x\}\_\{t\+1\}to the set

𝒟τ\+1\\mathcal\{D\}\_\{\\tau\+1\}
Update ensemble: replace

𝐱τi\\mathbf\{x\}^\{i\}\_\{\\tau\}with

𝐱t\+1i\\mathbf\{x\}^\{i\}\_\{t\+1\}in

X¯τ\\bar\{X\}\_\{\\tau\}
Train new model

Vθτ\+1V\_\{\\theta\_\{\\tau\+1\}\}on the augmented training set

⋃k=0τ\+1𝒟k\\bigcup\_\{k=0\}^\{\\tau\+1\}\\mathcal\{D\}\_\{k\}
returnFinal model

VθfinalV\_\{\\theta\_\{\\text\{final\}\}\}and training set

𝒟final\\mathcal\{D\}\_\{\\text\{final\}\}

### 3\.2Data acquisition

In[Section˜3\.1](https://arxiv.org/html/2606.04100#S3.SS1), we presented SKMD as a general purpose sampling algorithm for a given Boltzmann distribution\. In this section, we discuss approaches to adapt the sampling algorithm for active learning, where the goal is to select new training data which improve model quality while minimizing the cost of data acquisition and labeling\.

Offline data acquisitionis performed by choosing a subset of a pre\-collected pool of candidate configurations to label with the oracle and add to the training data\. In many such methods, the chosen subset is the one that maximizes some functional measuring the dissimilarity of elements—such as the determinant of a feature kernel, as with MaxVol algorithm\[[38](https://arxiv.org/html/2606.04100#bib.bib6),[37](https://arxiv.org/html/2606.04100#bib.bib49)\]and determinantal point processes\[[15](https://arxiv.org/html/2606.04100#bib.bib75),[66](https://arxiv.org/html/2606.04100#bib.bib76)\], or the distance between elements, as with farthest point sampling \(FPS\)\. These methods may be directly adopted by utilizing samples generated by SKMD as the pool of candidate configurations, as described in[Algorithm˜3](https://arxiv.org/html/2606.04100#alg3)\. Offline methods for data acquisition often incur high computational costs, as the selection of subsets from a candidate pool is a combinatorially hard problem\. Practical algorithms rely on greedy approximations of optimal subset selection, which requires repeated matrix determinant evaluations whose cost scales with the total set size\.

Online data acquisitionis a cost\-effective alternative to offline approaches, performed by specifying criteria for collecting training data during simulation\. While the sampling algorithm in \([5](https://arxiv.org/html/2606.04100#S3.E5)\) is defined for a fixed stopping timeℓ\>0\\ell\>0, SKMD can be oriented into a technique for online data acquisition by introducing an adaptive stopping criterion\. Let the Euclidean norm of the SKMD biasing force define an acquisition functionαs\\alpha\_\{s\}\. For a given particle that evolves from a times≥0s\\geq 0, the adaptive stopping timeℓ\(s;ζ0\)\\ell\(s;\\zeta\_\{0\}\)is the first time after a set number of stepsℓ0\\ell\_\{0\}that acquisition function falls below a thresholdζ0\>0\\zeta\_\{0\}\>0, i\.e\.,

ℓ\(s;ζ0\)≔inf\{t≥ℓ0:αs\(𝐱s\+t\)<ζ0\},αs\(𝐱\)=‖Fθ,sSKMD\(𝐱;X¯s\)‖\.\\ell\(s;\\zeta\_\{0\}\)\\coloneqq\\inf\\\{t\\geq\\ell\_\{0\}:\\alpha\_\{s\}\(\\mathbf\{x\}\_\{s\+t\}\)<\\zeta\_\{0\}\\\},\\qquad\\alpha\_\{s\}\(\\mathbf\{x\}\)=\|\|F^\{\\textup\{SKMD\}\}\_\{\\theta,s\}\(\\mathbf\{x\};\\bar\{X\}\_\{s\}\)\|\|\\ \.\(8\)At the point𝐱s\+ℓ\\mathbf\{x\}\_\{s\+\\ell\}where the criterion is met, we collect that point as a new training datum and switch the simulation to the next particle inX¯s\\bar\{X\}\_\{s\}\. After advancing each particle in the ensemble and collectingJJconfigurations, we label the data with reference calculations and retrain the model on the augmented training set\. This approach is outlined in[Algorithm˜1](https://arxiv.org/html/2606.04100#alg1)\.

The acquisition criterion is a Stein\-informed heuristic for when a given point is informative with respect to both the potential energy surface and the existing training set\. It promotes the selection of points at key geometric features of the potential energy surface, since the gradient of the potential in the SKMD biasing force is small in energy basins and at saddle points\. Moreover, it promotes the selection of points which are distinct from previously added training data, since the kernel and its gradient become small when the point is far fromX¯s\\bar\{X\}\_\{s\}in descriptor space\.

## 4Related work

Enhanced sampling\[[25](https://arxiv.org/html/2606.04100#bib.bib44)\]encompasses a broad class of methods used to promote some desired sampling behavior in molecular dynamics, reviewed in[Section˜C\.1](https://arxiv.org/html/2606.04100#A3.SS1)\. The method most closely related to our application is uncertainty\-driven dynamics \(UDD\)\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\],

d𝐱t=\(−∇𝐱Vθ\(𝐱t\)−∇𝐱VtUDD\(𝐱t\)\)dt\+2β−1dWt,\\textup\{d\}\\mathbf\{x\}\_\{t\}=\\big\(\-\\nabla\_\{\\mathbf\{x\}\}V\_\{\\theta\}\(\\mathbf\{x\}\_\{t\}\)\-\\nabla\_\{\\mathbf\{x\}\}V^\{\\textup\{UDD\}\}\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\\big\)\\textup\{d\}t\+\\sqrt\{2\\beta^\{\-1\}\}\\textup\{d\}W\_\{t\}\\ ,\(9\)a query\-by\-committee approach to active learning which introduces a biasing potential based on a Gaussian kernel of the empirical variance in energy predictions by a committee of MLIPs\. The committee is composed of the MLIP with different parameters learned fromkk\-fold splits of the training data\. Hyperactive learning\[[57](https://arxiv.org/html/2606.04100#bib.bib62)\]employs a sampler of the same form, but with the committee parameters drawn from a Bayesian posterior\. Online data acquisition is performed by querying points at which the committee uncertainty exceeds a certain threshold\. Both methods are designed to preferentially sample regions of high uncertainty, but unlike SKMD, they do not maintain the Boltzmann distribution as the invariant distribution and do not incorporate information on the free energy landscape in the acquisition criteria\. Moreover, because UDD requires the training and simulation of multiple MLIPs, each active learning iteration with UDD incurs a higher computational cost than with SKMD, as reflected by their run times and memory usage detailed in[Section˜E\.3](https://arxiv.org/html/2606.04100#A5.SS3)\.

Kernel herdingconstructs a set of representative points whose empirical distribution closely matches a target distributionπ\\piin terms of their kernel mean embeddings\. The SKMD acquisition criterion can be compared to kernel herding via Stein points\[[11](https://arxiv.org/html/2606.04100#bib.bib67)\], where points are selected greedily according to

𝐱n∈argmin𝐱∈𝒳∑i=1n−1𝒮π𝐱i𝒮π𝐱⊗k\(𝐱i,𝐱\)\\mathbf\{x\}\_\{n\}\\in\\underset\{\\mathbf\{x\}\\in\\mathcal\{X\}\}\{\\text\{argmin\}\}\\sum\_\{i=1\}^\{n\-1\}\\mathcal\{S\}^\{\\mathbf\{x\}\_\{i\}\}\_\{\\pi\}\\mathcal\{S\}^\{\\mathbf\{x\}\}\_\{\\pi\}\\otimes k\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\)\(10\)given previously chosen points\{𝐱1,…,𝐱n−1\}\\\{\\mathbf\{x\}\_\{1\},\.\.\.,\\mathbf\{x\}\_\{n\-1\}\\\}and Stein operator𝒮π𝐱\\mathcal\{S\}^\{\\mathbf\{x\}\}\_\{\\pi\}\. In our acquisition criterion, the previously chosen points are training dataX¯⊆𝒟\\bar\{X\}\\subseteq\\mathcal\{D\}, the target distribution is the Boltzmann distributionπθ\\pi\_\{\\theta\}, and rather than searching for a minimum over the full configuration space𝒳\\mathcal\{X\}, we select points along the simulation path with low values of∑i=1J−1𝒮πθ⋅⊗k\(𝐱i,⋅\)\\sum\_\{i=1\}^\{J\-1\}\\mathcal\{S\}^\{\\cdot\}\_\{\\pi\_\{\\theta\}\}\\otimes k\(\\mathbf\{x\}\_\{i\},\\cdot\)in terms of its Euclidean norm\. Therefore, the SKMD acquisition scheme can be interpreted as an approach to select points which reduce a simpler proxy objective to the one in Stein points\. We discuss this relation in[Section˜C\.3](https://arxiv.org/html/2606.04100#A3.SS3)\.

SVGDhas spawned a plethora of variants of Stein repulsive dynamics, for applications in Bayesian inference\[[23](https://arxiv.org/html/2606.04100#bib.bib60),[16](https://arxiv.org/html/2606.04100#bib.bib72),[60](https://arxiv.org/html/2606.04100#bib.bib71),[33](https://arxiv.org/html/2606.04100#bib.bib18),[63](https://arxiv.org/html/2606.04100#bib.bib63),[1](https://arxiv.org/html/2606.04100#bib.bib66)\], data assimilation\[[51](https://arxiv.org/html/2606.04100#bib.bib22)\], and POMDPs\[[65](https://arxiv.org/html/2606.04100#bib.bib21)\]\. To our knowledge, our work is the first to derive an explicit active learning framework from principles of SVGD\.

## 5Experiments

### 5\.1Neural network potential

System\.We evaluate the performance of SKMD for active learning \(AL\) of a neural network potential\. As the reference, we use the Müller–Brown potential, a common 2D benchmark for studying problems in molecular dynamics\. The neural network potential is a multi\-layer perceptron defined by a single 20\-dimensional hidden layer, tanh activation functions, and a confining quartic term to ensure that the system has a well\-defined Boltzmann distribution\. We use autodifferentiation to compute the negative gradient of the neural network potential\.

Baselines\.We compare against two active learning baselines: in the first, we simulate overdamped Langevin dynamics at a set temperature for a fixed number of time steps and select data along the simulation path at periodic time intervals\. In the second, we implement the UDD sampler\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\]with a committee of 5 models which are trained onkk\-fold splits of the training data and query data using the UDD uncertainty\-based criterion\. For all schemes, we utilize the same initial model and dataset, training loss of the form in \([1](https://arxiv.org/html/2606.04100#S2.E1)\), and number of additional training data in each active learning iteration\. For fair comparison, we fix the temperature and the realization of Brownian motion in each active learning trial performed with each scheme, such that the only variation in their performance is a result of the choice of sampler and data acquisition criteria\.

Active learning protocol\.We define an initial training set of 32 points concentrated in a metastable region of state space, to emulate conditions where the starting configurations are highly correlated\. The neural network potential trained on the initial set, shown in Figure[1](https://arxiv.org/html/2606.04100#S5.F1)at iteration 0, primarily characterizes the region of state space informed by the data\.

We implement two versions of our method: one in which new data are selected along the simulation path at period time intervals \(“SKMD”\) and one in which new data are selected with the acquisition criterion \([8](https://arxiv.org/html/2606.04100#S3.E8)\) \(“a\-SKMD”\)\. Further details on the experimental setup and model parameters are provided in[Section˜E\.1](https://arxiv.org/html/2606.04100#A5.SS1)\.

Results\.At the given temperature, the samples from Langevin dynamics remain confined to the energy basin and the model is not able to resolve the potential in regions which fall outside the coverage of the training data, as shown in Figure[1](https://arxiv.org/html/2606.04100#S5.F1)\. In UDD, there are several occurrences of the queried data clustering in close proximity to each other\. This behavior is due to the fact that the uncertainty metric defining the UDD stopping criterion may be large only in concentrated regions and the path has a tendency to revisit regions where the uncertainty metric is high, as shown in[Section˜D\.1](https://arxiv.org/html/2606.04100#A4.SS1)\. a\-SKMD leads to faster mixing, such that the model quickly resolves the other two energy basins which are unidentified in the initial model\. Moreover, the data points queried at each training iteration exhibit good spread in the state space, indicating that the SKMD stopping criterion is effective in promoting dissimilarity among the queried points\.

![Refer to caption](https://arxiv.org/html/2606.04100v1/figures/al_plot_grid.png)Figure 1:Contours of the neural network potential at iterations\{1,2,4,8\}\\\{1,2,4,8\\\}of active learning by overdamped Langevin dynamics \(top row\), UDD \(middle row\), and a\-SKMD \(bottom row\)\. The accumulated training data are shown in red, the queried data at the current iteration in cyan, and the path from the previous stopping time to the current stopping time in dark blue\. The reference Müller–Brown potential and the initial model are to the left\.[Figure˜2](https://arxiv.org/html/2606.04100#S5.F2)shows the statistics of the model error computed over 50 active learning trials\. Model error is evaluated in terms of the root mean square error \(RMSE\) in the potential energy and forces from a test set of samples distributed according to the Boltzmann distribution of the Müller–Brown potential,π\\pi\. Between the two schemes with naïve data acquisition, SKMD leads to lower error compared to the Langevin scheme\. Between the two schemes with adaptive acquisition criteria, a\-SKMD more rapidly decreases error compared to UDD\. a\-SKMD shows significant performance gains over the other methods, converging to a lower value of error with smaller variance in the fewest iterations\.

![Refer to caption](https://arxiv.org/html/2606.04100v1/figures/al_test_error_h.png)Figure 2:Results from active learning with the Müller–Brown potential as the reference\. Root mean square error \(RMSE\) in potential energy \(left\) and forces \(right\) of the neural network potential across training iterations for the Langevin \(blue\), UDD \(orange\), SKMD \(green\), and a\-SKMD \(purple\) schemes\. The solid lines show the median error and the shaded regions show the 25th to 75th percentile range of the error across 50 trials\.
### 5\.2MACE fine\-tuning

System\.We next apply SKMD to perform fine\-tuning of a MACE potential on the alanine dipeptide molecule\. Alanine dipeptide is commonly used as an example in studies of active learning and enhanced sampling\[[48](https://arxiv.org/html/2606.04100#bib.bib85),[64](https://arxiv.org/html/2606.04100#bib.bib84)\]since its Ramachandran \(ψ,ϕ\\psi,\\phi\) conformational energy landscape contains multiple well\-separated minima\. These minimum energy conformations and their positions on the landscape are shown in panels \(a\) and \(b\) of Figure[3](https://arxiv.org/html/2606.04100#S5.F3)\.

MACE models\.We start from two MACE models: a larger oracle model \(which we treated as the ground truth\), and a smaller surrogate model which serves as a starting point for fine\-tuning\. For the oracle, we use the MACE\-OFF\-23\-small foundation model\[[30](https://arxiv.org/html/2606.04100#bib.bib86)\], which is fit to the SPICE dataset\[[22](https://arxiv.org/html/2606.04100#bib.bib87)\]\. For the surrogate starting point, we fit a MACE model with 32 invariant channels to a randomly selected 0\.1% of the MACE\-OFF\-23\-small training dataset\.

Baselines\.We run three independent samplers: underdamped MD at 300 K, underdamped MD at 700 K, and SKMD\. Both unbiased underdamped simulations have a timestep of 1 fs and use Langevin thermostats with damping parameters of 0\.1 ps\. SKMD uses overdamped \(Brownian\) dynamics with a temperature of 50 K, a timestep of 0\.1 fs and an isotropic translational viscous damping coefficient of 75\.0gmol−1ps−1\\mathrm\{g\\,mol^\{\-1\}\\,ps^\{\-1\}\}\. For SKMD, the global descriptor is defined as the sum of local atomic descriptors\. Each atomic descriptor is the concatenated invariant representations from the first and second layers of the MACE model, giving a descriptor dimension ofd=64d=\\mathrm\{64\}\.

Active learning protocol\.The starting configuration is obtained by relaxing Structure 1 from Figure[3](https://arxiv.org/html/2606.04100#S5.F3)with the surrogate potential, followed by a 4000\-step burn\-in simulation\. Particle positions for the first active learning iteration were generated by running 6000 further simulation steps, with snapshots captured every 200 steps\. For both the unbiased and SKMD runs, the core sampling algorithm was then the same: each particle was propagated 500 steps in a round robin fashion, and the trajectory endpoints were taken as sampling candidates\. This process was repeated 300 times, yielding 300 candidates\. A greedy furthest\-point algorithm was used to select 100 diverse samples from this candidate set using distances in the full descriptor space, which were then labeled by the oracle\. All previous samples were taken into account in the furthest\-point sampling step\. At each iteration, newly sampled data are added to the existing fine\-tuning dataset\. Further details on the experimental setup are in[Section˜E\.2](https://arxiv.org/html/2606.04100#A5.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.04100v1/figures/macefig1_draft_3.png)Figure 3:Results from alanine dipeptide enhanced sampling and fine\-tuning\. \(a\) A contour map of theE\(ψ,ϕ\)E\(\\psi,\\phi\)Ramachandran surface of alanine dipeptide using the MACE\-OFF\-23\-small foundation model\. \(b\) Three minimum energy configurations of alanine dipeptide numbered 1 to 3\. The Ramachandran angles \(ψ,ϕ\\psi,\\phi\) are labeled on 1\. \(c\)–\(e\) Heat maps of all 1000 samples taken during the 10 iterations of active learning with both unbiased MD at \(c\) 300 K & \(d\) 700 K, and \(e\) SKMD\. The SKMD samples exhibit the broadest surface coverage\. \(f\) \- \(g\) Curves showing the change in \(f\) energy RMSE and \(g\) force RMSE on a diverse held\-back test set over 10 iterations of the active learning process\. Results for unbiased MD simulations at 300 K & 700 K are shown as blue & orange curves respectively\. SKMD results are shown in green\. The test set RMSE falls more rapidly and remains lower for SKMD than for the unbiased MD simulations\.Results\.Panels \(c\)–\(e\) of Figure[3](https://arxiv.org/html/2606.04100#S5.F3)show the coverage of samples across the Ramachandranψ,ϕ\\psi,\\phisurface for all 1000 samples taken over 10 active learning iterations from each run\. The 700 K unbiased MD run provides broader coverage than the 300 K run, but SKMD covers a substantially larger region of theψ,ϕ\\psi,\\phisurface than either unbiased simulation\. Panels \(f\) and \(g\) of Figure[3](https://arxiv.org/html/2606.04100#S5.F3)show the energy and force RMSEs of the fine\-tuned model after each active learning iteration on a held\-back test set consisting of 300 diverse samples generated through running SKMD with the oracle model\. Over the course of active learning, the RMSEs decrease in all cases, with the fastest and largest reductions obtained when SKMD is used to generate samples\.

## 6Conclusions

We introduce a novel framework to improve the efficiency of active learning and fine\-tuning of MLIPs\. Our method, SKMD, is defined by an adaptive biasing force that balances repulsive dynamics, which promote the exploration of new configurations, with attractive dynamics, which promote fidelity to the Boltzmann distribution\. We evaluate SKMD in two numerical experiments: in a 2D benchmark, SKMD outperforms overdamped Langevin dynamics and UDD in generating training data that improve model quality in the fewest active learning iterations\. For fine\-tuning a MACE foundation model of organic molecules for alanine dipeptide, SKMD produces models with lower energy and force error compared to those trained on data from standard and tempered versions of underdamped Langevin dynamics\.

SKMD may be used as a general purpose algorithm for sampling configuration space as well as for online data acquisition\. A limitation of the adaptive stopping criterion is that it requires that SKMD be defined by a kernel with a fixed bandwidth\. When SKMD is defined by a variable kernel bandwidth \(e\.g\., the median distance between particles\), the norm of the SKMD biasing force defining the acquisition function does not decrease significantly over simulation time steps\. One can implement active learning in a two\-phase scheme: first, an exploration phase which utilizes SKMD with a variable kernel bandwidth and fixed stopping to promote faster mixing in high\-dimensional configuration space, followed by an exploitation phase in which the interacting particles sample energy basins utilizing SKMD with a fixed kernel bandwidth and adaptive stopping\. The efficacy of this two\-phase scheme for more complex chemical systems will be studied in future work\.

## Acknowledgments and Disclosure of Funding

The work of JZ and YM was supported by the United States Department of Energy, National Nuclear Security Administration under Award Number DE\-NA0003965\. The work of FB was supported by a studentship from the UK Engineering and Physical Sciences Research Council–funded Centre for Doctoral Training in Modelling of Heterogeneous Systems \(Grant No\. EP/S022848/1\)\. JZ and FB gratefully acknowledge the support of a Research Workshop Follow\-on Grant from the International Centre for Mathematical Sciences, Edinburgh\. FB further acknowledges the University of Warwick Scientific Computing Research Technology Platform for computational support, as well as the use of resources provided by the Isambard\-AI National AI Research Resource \(AIRR\)\. Isambard\-AI is operated by the University of Bristol and is funded by the UK Government’s Department for Science, Innovation and Technology \(DSIT\) via UK Research and Innovation; and the Science and Technology Facilities Council \[ST/AIRR/I\-A\-I/1023\]\[[41](https://arxiv.org/html/2606.04100#bib.bib102)\]\.

## References

- \[1\]T\. Alsup, L\. Venturi, and B\. Peherstorfer\(2021\-04\)Multilevel stein variational gradient descent with applications to bayesian inverse problems\.External Links:[Link](http://arxiv.org/abs/2104.01945)Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[2\]L\. Ambrosio, N\. Gigli, and G\. Savaré\(2008\)Gradient flows: in metric spaces and in the space of probability measures\.Lectures in Mathematics ETH Zürich,Birkhäuser Basel\.External Links:ISBN 978\-3\-7643\-8722\-8Cited by:[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p1.8)\.
- \[3\]N\. Artrith and J\. Behler\(2012\-01\)High\-dimensional neural network potentials for metal surfaces: a prototype study for copper\.Phys\. Rev\. B85,pp\. 045439\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevB.85.045439),[Link](https://link.aps.org/doi/10.1103/PhysRevB.85.045439)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[4\]A\. Barducci, G\. Bussi, and M\. Parrinello\(2008\-01\)Well\-tempered metadynamics: a smoothly converging and tunable free\-energy method\.Phys\. Rev\. Lett\.100,pp\. 020603\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.100.020603)Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p3.3)\.
- \[5\]A\. P\. Bartók, M\. C\. Payne, R\. Kondor, and G\. Csányi\(2010\)Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons\.Physical review letters104\(13\),pp\. 136403\.Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[6\]M\. I\. Baskes and R\. A\. Johnson\(1994\)Modified embedded atom potentials for hcp metals\.Modelling and Simulation in Materials Science and Engineering2,pp\. 147–163\.External Links:[Document](https://dx.doi.org/10.1088/0965-0393/2/1/011)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p1.8)\.
- \[7\]I\. Batatia, P\. Benner, Y\. Chiang, A\. M\. Elena, D\. P\. Kovács, J\. Riebesell, X\. R\. Advincula, M\. Asta, M\. Avaylon, W\. J\. Baldwin, F\. Berger, N\. Bernstein, A\. Bhowmik, F\. Bigi, S\. M\. Blau, V\. Cărare, M\. Ceriotti, S\. Chong, J\. P\. Darby, S\. De, F\. Della Pia, V\. L\. Deringer, R\. Elijošius, Z\. El\-Machachi, E\. Fako, F\. Falcioni, A\. C\. Ferrari, J\. L\. A\. Gardner, M\. J\. Gawkowski, A\. Genreith\-Schriever, J\. George, R\. E\. A\. Goodall, J\. Grandel, C\. P\. Grey, P\. Grigorev, S\. Han, W\. Handley, H\. H\. Heenen, K\. Hermansson, C\. H\. Ho, S\. Hofmann, C\. Holm, J\. Jaafar, K\. S\. Jakob, H\. Jung, V\. Kapil, A\. D\. Kaplan, N\. Karimitari, J\. R\. Kermode, P\. Kourtis, N\. Kroupa, J\. Kullgren, M\. C\. Kuner, D\. Kuryla, G\. Liepuoniute, C\. Lin, J\. T\. Margraf, I\. Magdău, A\. Michaelides, J\. H\. Moore, A\. A\. Naik, S\. P\. Niblett, S\. W\. Norwood, N\. O’Neill, C\. Ortner, K\. A\. Persson, K\. Reuter, A\. S\. Rosen, L\. A\. M\. Rosset, L\. L\. Schaaf, C\. Schran, B\. X\. Shi, E\. Sivonxay, T\. K\. Stenczel, C\. Sutton, V\. Svahn, T\. D\. Swinburne, J\. Tilly, C\. van der Oord, S\. Vargas, E\. Varga\-Umbrich, T\. Vegge, M\. Vondrák, Y\. Wang, W\. C\. Witt, T\. Wolf, F\. Zills, and G\. Csányi\(2025\-11\)A foundation model for atomistic materials chemistry\.The Journal of Chemical Physics163\(18\),pp\. 184110\.External Links:ISSN 0021\-9606,[Document](https://dx.doi.org/10.1063/5.0297006)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[8\]I\. Batatia, D\. P\. Kovacs, G\. Simm, C\. Ortner, and G\. Csanyi\(2022\)MACE: higher order equivariant message passing neural networks for fast and accurate force fields\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 11423–11436\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/4a36c3c51af11ed9f34615b81edb5bbc-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[9\]S\. Batzner, A\. Musaelian, L\. Sun, M\. Geiger, J\. P\. Mailoa, M\. Kornbluth, N\. Molinari, T\. E\. Smidt, and B\. Kozinsky\(2022\)E\(3\)\-equivariant graph neural networks for data\-efficient and accurate interatomic potentials\.Nature Communications13\(2453\)\.External Links:[Document](https://dx.doi.org/10.1038/s41467-022-29939-5)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[10\]N\. Bernstein, G\. Csányi, and V\. L\. Deringer\(2019\)De novo exploration and self\-guided learning of potential\-energy surfaces\.npj Computational Materials5\(1\),pp\. 99\.External Links:[Document](https://dx.doi.org/10.1038/s41524-019-0236-6)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[11\]W\. Y\. Chen, L\. Mackey, J\. Gorham, F\. Briol, and C\. Oates\(2018\)Stein points\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 844–853\.Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p2.1),[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p3.3),[§4](https://arxiv.org/html/2606.04100#S4.p2.1)\.
- \[12\]Y\. Chen, M\. Welling, and A\. J\. Smola\(2010\)Super\-samples from kernel herding\.InProceedings of the Twenty\-Sixth Conference on Uncertainty in Artificial Intelligence,UAI’10,pp\. 109–116\.Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[13\]E\. Darve and A\. Pohorille\(2001\)Calculating free energies using average force\.The Journal of Chemical Physics115\(20\),pp\. 9169–9183\.Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p2.3)\.
- \[14\]M\. S\. Daw and M\. I\. Baskes\(1984\)Embedded\-atom method: derivation and application to impurities, surfaces, and other defects in metals\.Physical Review B29\(12\),pp\. 6443–6453\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevB.29.6443)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p1.8)\.
- \[15\]M\. Dereziński, F\. Liang, and M\. W\. Mahoney\(2020\)Bayesian experimental design using regularized determinantal point processes\.AISTATS\.Cited by:[§3\.2](https://arxiv.org/html/2606.04100#S3.SS2.p2.1)\.
- \[16\]G\. Detommaso, T\. Cui, A\. Spantini, Y\. Marzouk, and R\. Scheichl\(2018\-06\)A stein variational newton method\.External Links:[Link](http://arxiv.org/abs/1806.03085)Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[17\]C\. Domingo\-Enrich, R\. Dwivedi, and L\. Mackey\(2023\)Compress then test: powerful kernel testing in near\-linear time\.InProceedings of the 26th International Conference on Artificial Intelligence and Statistics \(AISTATS\),Proceedings of Machine Learning Research, Vol\.206\.Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[18\]R\. Drautz\(2019\)Atomic cluster expansion for accurate and transferable interatomic potentials\.Physical Review B99\(1\),pp\. 014104\.Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[19\]A\. B\. Duncan, N\. Nüsken, and G\. A\. Pavliotis\(2017\-12\)Using perturbed underdamped langevin dynamics to efficiently sample from probability distributions\.Journal of Statistical Physics169,pp\. 1098–1131\.External Links:[Document](https://dx.doi.org/10.1007/s10955-017-1906-8),ISSN 00224715Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p2.4)\.
- \[20\]A\. Duncan, L\. Szpruch, and N\. Nusken\(2023\)On the geometry of stein variational gradient descent\.Journal of Machine Learning Research24\(56\),pp\. 1–39\.External Links:ISSN 1532\-4435Cited by:[§A\.1](https://arxiv.org/html/2606.04100#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2606.04100#A1.SS2.1.p1.6),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p4.2),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p5.6),[§3\.1](https://arxiv.org/html/2606.04100#S3.SS1.p6.1)\.
- \[21\]R\. Dwivedi and L\. Mackey\(2024\)Kernel thinning\.Journal of Machine Learning Research25\(152\),pp\. 1–77\.Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[22\]P\. Eastman, P\. K\. Behara, D\. L\. Dotson, R\. Galvelis, J\. E\. Herr, J\. T\. Horton, Y\. Mao, J\. D\. Chodera, B\. P\. Pritchard, Y\. Wang, G\. De Fabritiis, and T\. E\. Markland\(2023\-01\)SPICE, a dataset of drug\-like molecules and peptides for training machine learning potentials\.Scientific Data10\(1\),pp\. 11\(en\)\.External Links:ISSN 2052\-4463,[Document](https://dx.doi.org/10.1038/s41597-022-01882-6)Cited by:[§5\.2](https://arxiv.org/html/2606.04100#S5.SS2.p2.1)\.
- \[23\]V\. Gallego and D\. R\. Insua\(2018\-11\)Stochastic gradient mcmc with repulsive forces\.External Links:[Link](http://arxiv.org/abs/1812.00071)Cited by:[§3\.1](https://arxiv.org/html/2606.04100#S3.SS1.p6.1),[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[24\]M\. Girolami and B\. Calderhead\(2011\)Riemann manifold langevin and hamiltonian monte carlo methods\.Journal of the Royal Statistical Society: Series B73\(2\),pp\. 123–214\.Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p2.4)\.
- \[25\]J\. Hénin, T\. Lelièvre, M\. R\. Shirts, O\. Valsson, and L\. Delemotte\(2022\-Dec\.\)Enhanced sampling methods for molecular dynamics simulations\.Living Journal of Computational Molecular Science4\(1\),pp\. 1583\.External Links:[Document](https://dx.doi.org/10.33011/livecoms.4.1.1583)Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p4.2),[§4](https://arxiv.org/html/2606.04100#S4.p1.2)\.
- \[26\]J\. E\. Herr, K\. Yao, R\. McIntyre, D\. W\. Toth, and J\. Parkhill\(2018\-03\)Metadynamics for training neural network model chemistries: a competitive assessment\.The Journal of Chemical Physics148\(24\),pp\. 241710\.External Links:ISSN 0021\-9606,[Document](https://dx.doi.org/10.1063/1.5020067),[Link](https://doi.org/10.1063/1.5020067)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p3.1)\.
- \[27\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§E\.2](https://arxiv.org/html/2606.04100#A5.SS2.p1.1)\.
- \[28\]R\. Jinnouchi, F\. Karsai, and G\. Kresse\(2019\-07\)On\-the\-fly machine learning force field generation: application to melting points\.Phys\. Rev\. B100,pp\. 014105\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevB.100.014105),[Link](https://link.aps.org/doi/10.1103/PhysRevB.100.014105)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[29\]M\. Karabin and D\. Perez\(2020\-02\)An entropy\-maximization approach to automated training set generation for interatomic potentials\.External Links:[Link](http://arxiv.org/abs/2002.07876)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[30\]D\. P\. Kovács, J\. H\. Moore, N\. J\. Browning, I\. Batatia, J\. T\. Horton, Y\. Pu, V\. Kapil, W\. C\. Witt, I\. Magdău, D\. J\. Cole, and G\. Csányi\(2025\-05\)MACE\-off: short\-range transferable machine learning force fields for organic molecules\.Journal of the American Chemical Society147\(21\),pp\. 17598–17611\.External Links:ISSN 0002\-7863,[Document](https://dx.doi.org/10.1021/jacs.4c07099)Cited by:[§5\.2](https://arxiv.org/html/2606.04100#S5.SS2.p2.1)\.
- \[31\]M\. Kulichenko, K\. Barros, N\. Lubbers, Y\. W\. Li, R\. Messerly, S\. Tretiak, J\. S\. Smith, and B\. Nebgen\(2023\-03\)Uncertainty\-driven dynamics for active learning of interatomic potentials\.Nature Computational Science\.External Links:[Document](https://dx.doi.org/10.1038/s43588-023-00406-5),ISSN 26628457Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p5.2),[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p5.5),[§E\.1](https://arxiv.org/html/2606.04100#A5.SS1.p3.1),[§1](https://arxiv.org/html/2606.04100#S1.p3.1),[§4](https://arxiv.org/html/2606.04100#S4.p1.2),[§5\.1](https://arxiv.org/html/2606.04100#S5.SS1.p2.1)\.
- \[32\]T\. Lelievre, M\. Rousset, and G\. Stoltz\(2010\)Free energy computations: a mathematical perspective\.World Scientific Publishing Company\.External Links:ISBN 9781908978752Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p3.5)\.
- \[33\]L\. Li, Y\. Li, J\. Liu, Z\. Liu, and J\. Lu\(2020\)A stochastic version of stein variational gradient descent for efficient sampling\.Communications in Applied Mathematics and Computational Science15\(1\),pp\. 37–63\.Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[34\]Q\. Liu and D\. Wang\(2016\)Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm\.NeurIPS29\.Cited by:[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p2.1),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p3.24),[§1](https://arxiv.org/html/2606.04100#S1.p4.1)\.
- \[35\]Q\. Liu\(2017\)Stein variational gradient descent as gradient flow\.NeurIPS30\.Cited by:[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p1.8),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p3.24),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p4.2),[§2\.2](https://arxiv.org/html/2606.04100#S2.SS2.p1.11)\.
- \[36\]J\. Lu, Y\. Lu, and J\. Nolen\(2019\)On the mean\-field limit of stein variational gradient descent\.SIAM Journal on Mathematical Analysis51\(5\),pp\. 3611–3640\.Cited by:[§A\.1](https://arxiv.org/html/2606.04100#A1.SS1.p3.6),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p4.4),[§C\.2](https://arxiv.org/html/2606.04100#A3.SS2.p5.6),[§2\.2](https://arxiv.org/html/2606.04100#S2.SS2.p1.11)\.
- \[37\]Y\. Lysogorskiy, A\. Bochkarev, M\. Mrovec, and R\. Drautz\(2023\-04\)Active learning strategies for atomic cluster expansion models\.Phys\. Rev\. Mater\.7,pp\. 043801\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevMaterials.7.043801)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.04100#S3.SS2.p2.1)\.
- \[38\]Y\. Lysogorskiy, C\. v\. d\. Oord, A\. Bochkarev, S\. Menon, M\. Rinaldi, T\. Hammerschmidt, M\. Mrovec, A\. Thompson, G\. Csányi, C\. Ortner,et al\.\(2021\)Performant implementation of the atomic cluster expansion \(pace\) and application to copper and silicon\.npj Computational Materials7\(1\),pp\. 1–12\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.04100#S3.SS2.p2.1)\.
- \[39\]S\. Mak and V\. R\. Joseph\(2017\)Projected support points: a new method for high\-dimensional data reduction\.Note:arXiv:1708\.06897External Links:1708\.06897Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[40\]S\. Mak and V\. R\. Joseph\(2018\)Support points\.The Annals of Statistics46\(6A\),pp\. 2562–2592\.External Links:[Document](https://dx.doi.org/10.1214/17-AOS1629)Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[41\]S\. McIntosh\-Smith, S\. Alam, and C\. Woods\(2025\)Isambard\-ai: a leadership\-class supercomputer optimised specifically for artificial intelligence\.InProceedings of the Cray User Group,CUG ’24,New York, NY, USA,pp\. 44–54\.External Links:ISBN 9798400713286,[Link](https://doi.org/10.1145/3725789.3725794),[Document](https://dx.doi.org/10.1145/3725789.3725794)Cited by:[Acknowledgments and Disclosure of Funding](https://arxiv.org/html/2606.04100#Sx1.p1.1)\.
- \[42\]B\. B\. Moser, A\. S\. Shanbhag, S\. Frolov, F\. Raue, J\. Folz, and A\. Dengel\(2025\)A coreset selection of coreset selection literature: introduction and recent advances\.arXiv preprint arXiv:2505\.17799\.Cited by:[§C\.3](https://arxiv.org/html/2606.04100#A3.SS3.p1.4)\.
- \[43\]A\. Musaelian, S\. Batzner, A\. Johansson, and et al\.\(2023\)Learning local equivariant representations for large\-scale atomistic dynamics\.Nature Communications14\(579\)\.External Links:[Document](https://dx.doi.org/10.1038/s41467-023-36329-y)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[44\]J\. Nam, S\. Liu, G\. Winter, K\. Jun, S\. Yang, and R\. Gómez\-Bombarelli\(2025\)Flow matching for accelerated simulation of atomic transport in crystalline materials\.Nature Machine Intelligence\.External Links:[Document](https://dx.doi.org/10.1038/s42256-025-01125-4)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p5.1)\.
- \[45\]F\. Noé, S\. Olsson, J\. Köhler, and H\. Wu\(2019\)Boltzmann generators: Sampling equilibrium states of many\-body systems with deep learning\.Science365\(6457\),pp\. eaaw1147\.External Links:[Document](https://dx.doi.org/10.1126/science.aaw1147)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p5.1)\.
- \[46\]M\. Plainer, H\. Wu, L\. Klein, S\. Günnemann, and F\. Noé\(2025\)Consistent sampling and simulation: molecular dynamics with energy\-based diffusion models\.InAdvances in Neural Information Processing Systems,Note:arXiv:2506\.17139Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p5.1)\.
- \[47\]E\. V\. Podryabinkin and A\. V\. Shapeev\(2017\-12\)Active learning of linearly parametrized interatomic potentials\.Computational Materials Science140,pp\. 171–180\.External Links:[Document](https://dx.doi.org/10.1016/j.commatsci.2017.08.031),ISSN 09270256Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[48\]D\. Schwalbe\-Koda, A\. R\. Tan, and R\. Gómez\-Bombarelli\(2021\-08\)Differentiable sampling of molecular geometries with uncertainty\-based adversarial attacks\.Nature Communications12\(1\),pp\. 5104\(en\)\.External Links:ISSN 2041\-1723,[Document](https://dx.doi.org/10.1038/s41467-021-25342-8)Cited by:[§5\.2](https://arxiv.org/html/2606.04100#S5.SS2.p1.1)\.
- \[49\]C\. Stein\(1972\)A bound for the error in the normal approximation to the distribution of a sum of dependent random variables\.Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability2,pp\. 583–602\.Cited by:[§2\.2](https://arxiv.org/html/2606.04100#S2.SS2.p1.12)\.
- \[50\]F\. H\. Stillinger and T\. A\. Weber\(1985\)Computer simulation of local order in condensed phases of silicon\.Physical Review B31\(8\),pp\. 5262–5271\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevB.31.5262)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p1.8)\.
- \[51\]A\. S\. Stordal, R\. J\. Moraes, P\. N\. Raanes, and G\. Evensen\(2021\)P\-kernel stein variational gradient descent for data assimilation and history matching\.Mathematical Geosciences53,pp\. 375–393\.External Links:[Document](https://dx.doi.org/10.1007/s11004-021-09937-x)Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[52\]A\. P\. A\. Subramanyam and D\. Perez\(2025\)Information\-entropy\-driven generation of material\-agnostic datasets for machine\-learning interatomic potentials\.npj Computational Materials11\(1\),pp\. 218\.External Links:[Document](https://dx.doi.org/10.1038/s41524-025-01602-9)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[53\]J\. Tersoff\(1988\)Empirical interatomic potential for silicon with improved elastic properties\.Physical Review B38\(14\),pp\. 9902–9905\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevB.38.9902)Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p1.8)\.
- \[54\]A\. P\. Thompson, L\. P\. Swiler, C\. R\. Trott, S\. M\. Foiles, and G\. J\. Tucker\(2015\)Spectral neighbor analysis method for automated generation of quantum\-accurate interatomic potentials\.Journal of Computational Physics285,pp\. 316–330\.Cited by:[§2\.1](https://arxiv.org/html/2606.04100#S2.SS1.p2.4)\.
- \[55\]A\. P\. Thompsonet al\.\(2022\)LAMMPS \- a flexible simulation tool for particle\-based materials modeling at the atomic, meso, and continuum scales\.Computer Physics Communications\.External Links:[Document](https://dx.doi.org/10.1016/j.cpc.2021.108171)Cited by:[§3\.1](https://arxiv.org/html/2606.04100#S3.SS1.p5.1)\.
- \[56\]O\. Valsson, P\. Tiwary, and M\. Parrinello\(2016\)Enhancing important fluctuations: rare events and metadynamics from a conceptual viewpoint\.Annual Review of Physical Chemistry67,pp\. 159–184\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1146/annurev-physchem-040215-112229)Cited by:[§C\.1](https://arxiv.org/html/2606.04100#A3.SS1.p3.3)\.
- \[57\]C\. van der Oord, M\. Sachs, D\. P\. Kovács, C\. Ortner, and G\. Csányi\(2022\-10\)Hyperactive learning \(hal\) for data\-driven interatomic potentials\.External Links:[Link](http://arxiv.org/abs/2210.04225)Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p3.1),[§4](https://arxiv.org/html/2606.04100#S4.p1.1)\.
- \[58\]J\. Vandermause, S\. B\. Torrisi, S\. Batzner, Y\. Xie, L\. Sun, A\. M\. Kolpak, and B\. Kozinsky\(2020\)On\-the\-fly active learning of interpretable bayesian force fields for atomistic rare events\.npj Computational Materials6\(1\),pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[59\]J\. Vandermause, Y\. Xie, J\. S\. Lim, C\. J\. Owen, and B\. Kozinsky\(2021\)Active learning of reactive bayesian force fields: application to heterogeneous hydrogen\-platinum catalysis dynamics\.arXiv preprint arXiv:2106\.01949\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[60\]Y\. Wang and W\. Li\(2019\-09\)Accelerated information gradient flow\.External Links:[Link](http://arxiv.org/abs/1909.02102)Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[61\]Y\. Xie, J\. Vandermause, S\. Ramakers, N\. H\. Protik, A\. Johansson, and B\. Kozinsky\(2022\)Uncertainty\-aware molecular dynamics from bayesian active learning: phase transformations and thermal transport in sic\.arXiv preprint arXiv:2203\.03824\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[62\]Y\. Xie, J\. Vandermause, L\. Sun, A\. Cepellotti, and B\. Kozinsky\(2021\)Bayesian force fields from active learning for simulation of inter\-dimensional transformation of stanene\.npj Computational Materials7\(1\),pp\. 1–10\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1)\.
- \[63\]M\. Ye, T\. Ren, and Q\. Liu\(2020\-02\)Stein Self\-Repulsive Dynamics: Benefits From Past Samples\.External Links:[Link](http://arxiv.org/abs/2002.09070)Cited by:[§3\.1](https://arxiv.org/html/2606.04100#S3.SS1.p6.1),[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[64\]V\. Zaverkin, D\. Holzmüller, H\. Christiansen, F\. Errica, F\. Alesiani, M\. Takamoto, M\. Niepert, and J\. Kästner\(2024\-04\)Uncertainty\-biased molecular dynamics for learning uniformly accurate interatomic potentials\.npj Computational Materials10\(1\),pp\. 83\(en\)\.External Links:ISSN 2057\-3960,[Document](https://dx.doi.org/10.1038/s41524-024-01254-1)Cited by:[§5\.2](https://arxiv.org/html/2606.04100#S5.SS2.p1.1)\.
- \[65\]Y\. Zhang, B\. Luo, A\. Mukhopadhyay, G\. Karsai, and A\. Dubey\(2025\)ESCORT: efficient stein\-variational and sliced consistency\-optimized temporal belief representation for POMDPs\.InAdvances in Neural Information Processing Systems,Note:arXiv:2510\.21107Cited by:[§4](https://arxiv.org/html/2606.04100#S4.p3.1)\.
- \[66\]J\. Zou and Y\. Marzouk\(2025\)Data curation for machine learning interatomic potentials by determinantal point processes\.ICLR AI4MAT Workshop\.Cited by:[§1](https://arxiv.org/html/2606.04100#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.04100#S3.SS2.p2.1)\.

## Appendix AAssumptions, Analysis, and Proofs

### A\.1Asymptotic analysis of SKMD

We show that the asymptotic behavior of \([5](https://arxiv.org/html/2606.04100#S3.E5)\) maintains fidelity to the Boltzmann distributionπθ\\pi\_\{\\theta\}, such that it can be implemented for Boltzmann sampling\. First, we state the necessary assumptions, following the analysis in\[[20](https://arxiv.org/html/2606.04100#bib.bib100)\]\.

###### Condition 1\.

The kernelk:𝒳×𝒳→ℝk:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}is continuous and symmetric positive semi\-definite, i\.e\., it satisfies∑i,j=1Jαiαjk\(𝐱i,𝐱j\)≥0\\sum\_\{i,j=1\}^\{J\}\\alpha\_\{i\}\\alpha\_\{j\}k\(\\mathbf\{x\}^\{i\},\\mathbf\{x\}^\{j\}\)\\geq 0for allJ∈ℕJ\\in\\mathbb\{N\},α1,…,αJ∈ℝ\\alpha\_\{1\},\.\.\.,\\alpha\_\{J\}\\in\\mathbb\{R\}, and𝐱1,…,𝐱J∈𝒳\\mathbf\{x\}^\{1\},\.\.\.,\\mathbf\{x\}^\{J\}\\in\\mathcal\{X\}\.

###### Condition 2\.

The probability measure associated with the Boltzmann distributionπθ\\pi\_\{\\theta\}belongs to𝒫k\(𝒳\)\\mathcal\{P\}\_\{k\}\(\\mathcal\{X\}\), a subset of𝒫\(𝒳\)\\mathcal\{P\}\(\\mathcal\{X\}\)defined as

𝒫k\(𝒳\)≔\{ρ∈𝒫\(𝒳\)\\displaystyle\\mathcal\{P\}\_\{k\}\(\\mathcal\{X\}\)\\coloneqq\\Big\\\{\\rho\\in\\mathcal\{P\}\(\\mathcal\{X\}\):ρadmits a smooth Lebesgue density,suppρ=𝒳,\\displaystyle:\\rho\\textup\{ admits a smooth Lebesgue density\},\\textup\{supp \}\\rho=\\mathcal\{X\},\(11\)∫𝒳k\(𝐱,𝐱\)dρ\(𝐱\)<∞\}\.\\displaystyle\\int\_\{\\mathcal\{X\}\}k\(\\mathbf\{x\},\\mathbf\{x\}\)\\textup\{d\}\\rho\(\\mathbf\{x\}\)<\\infty\\Big\\\}\\ \.

In the limit ofϵ→0\\epsilon\\to 0andℓ→0\\ell\\to 0, the SKMD update equation in \([5](https://arxiv.org/html/2606.04100#S3.E5)\) converges to the stochastic differential equation

d𝐱ti=1J∑j=1J\[−k\(𝐱tj,𝐱ti\)β∇𝐱tjVθ\(𝐱tj\)\+∇𝐱tjk\(𝐱tj,𝐱ti\)\]\+2ηJdWti,𝐱ti,𝐱tj∈X¯t,\\textup\{d\}\\mathbf\{x\}^\{i\}\_\{t\}=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\Big\[\-k\(\\mathbf\{x\}^\{j\}\_\{t\},\\mathbf\{x\}^\{i\}\_\{t\}\)\\beta\\,\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}V\_\{\\theta\}\(\\mathbf\{x\}^\{j\}\_\{t\}\)\+\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}k\(\\mathbf\{x\}^\{j\}\_\{t\},\\mathbf\{x\}^\{i\}\_\{t\}\)\\Big\]\+\\sqrt\{\\tfrac\{2\\eta\}\{J\}\}\\textup\{d\}W\_\{t\}^\{i\},\\quad\\mathbf\{x\}^\{i\}\_\{t\},\\mathbf\{x\}^\{j\}\_\{t\}\\in\\bar\{X\}\_\{t\}\\ ,\(12\)forA\(𝐱\)=k\(𝐱,𝐱\)A\(\\mathbf\{x\}\)=k\(\\mathbf\{x\},\\mathbf\{x\}\), whereWtiW^\{i\}\_\{t\},i=1,…,J,i=1,\.\.\.,J,are independent copies ofnn\-dimensional standard Brownian motion\. In[Proposition˜1](https://arxiv.org/html/2606.04100#Thmproposition1), we establish that in the limit ofJ→∞J\\to\\infty, \([12](https://arxiv.org/html/2606.04100#A1.E12)\) has a mean\-field equation that is identical in form to that of SVGD\. The proof is provided in[Section˜A\.2](https://arxiv.org/html/2606.04100#A1.SS2)\.

###### Proposition 1\(Mean field limit of SKMD\)\.

Assume[Conditions˜1](https://arxiv.org/html/2606.04100#Thmcondition1)and[2](https://arxiv.org/html/2606.04100#Thmcondition2)hold and thatA\(𝐱\)=k\(𝐱,𝐱\)A\(\\mathbf\{x\}\)=k\(\\mathbf\{x\},\\mathbf\{x\}\)for all𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}\. AsJ→∞J\\to\\infty, the empirical measureq^tJ\(𝐱\)≔1J∑j=1Jδ\(𝐱−𝐱tj\)\\hat\{q\}^\{J\}\_\{t\}\(\\mathbf\{x\}\)\\coloneqq\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\delta\(\\mathbf\{x\}\-\\mathbf\{x\}\_\{t\}^\{j\}\)of particles evolving according to \([12](https://arxiv.org/html/2606.04100#A1.E12)\) converges weakly toqtq\_\{t\}, the solution to

∂∂tqt\(𝐱\)=∇⋅\(qt\(𝐱\)∫𝒳k\(𝐱,𝐱′\)\(∇𝐱′qt\(𝐱′\)\+qt\(𝐱′\)β∇𝐱′Vθ\(𝐱′\)\)d𝐱′\)\.\\frac\{\\partial\}\{\\partial t\}q\_\{t\}\(\\mathbf\{x\}\)=\\nabla\\cdot\\Bigg\(q\_\{t\}\(\\mathbf\{x\}\)\\int\_\{\\mathcal\{X\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\big\(\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}q\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\+q\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\\beta\\,\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}V\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\\textup\{d\}\\mathbf\{x\}^\{\\prime\}\\Bigg\)\\ \.\(13\)

Therefore, the mean field limit of \([12](https://arxiv.org/html/2606.04100#A1.E12)\) has a stationary solution atπθ∝exp⁡\(−βVθ\)\\pi\_\{\\theta\}\\propto\\exp\(\-\\beta V\_\{\\theta\}\)\. Note that \([13](https://arxiv.org/html/2606.04100#A1.E13)\) does not depend on the scalingη\\eta, meaning that the diffusion coefficient can be of arbitrary magnitude as long as it vanishes asJ→∞J\\to\\infty\. By way of the mean\-field limit, SKMD inherits the convergence properties of SVGD; in particular, one can establish conditions which guarantee thatqtq\_\{t\}converges weakly toπθ\\pi\_\{\\theta\}ast→∞t\\to\\infty\[[36](https://arxiv.org/html/2606.04100#bib.bib101), Thm\. 2\.8\]\.

### A\.2Proof of[Proposition˜1](https://arxiv.org/html/2606.04100#Thmproposition1)

###### Proof\.

The proof closely follows that of Proposition 2 in\[[20](https://arxiv.org/html/2606.04100#bib.bib100)\], with a modification to the diffusion coefficient of the SDE\. In the following, let

b\(𝐱,q\)≔∫𝒳\[−k\(𝐱,𝐱′\)β∇𝐱′V\(𝐱′\)\+∇𝐱′k\(𝐱,𝐱′\)\]dq\(𝐱′\)b\(\\mathbf\{x\},q\)\\coloneqq\\int\_\{\\mathcal\{X\}\}\\Big\[\-k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\beta\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}V\(\\mathbf\{x\}^\{\\prime\}\)\+\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\Big\]\\textup\{d\}q\(\\mathbf\{x\}^\{\\prime\}\)and

σ\(𝐱\)≔2ηJ,Σ\(𝐱\)≔σ\(𝐱\)σ\(𝐱\)⊤=2ηJ\.\\sigma\(\\mathbf\{x\}\)\\coloneqq\\sqrt\{\\frac\{2\\eta\}\{J\}\},\\quad\\Sigma\(\\mathbf\{x\}\)\\coloneqq\\sigma\(\\mathbf\{x\}\)\\sigma\(\\mathbf\{x\}\)^\{\\top\}=\\frac\{2\\eta\}\{J\}\\ \.We use the duality principle with test functions to prove convergence ofq^tJ\\hat\{q\}^\{J\}\_\{t\}toqtq\_\{t\}, where it suffices to show that for a smooth test function with compact supportϕ∈Cc∞\(\[0,∞\)×𝒳\)\\phi\\in C^\{\\infty\}\_\{c\}\(\[0,\\infty\)\\times\\mathcal\{X\}\),∫𝒳ϕ\(t,𝐱\)dq^tJ\(𝐱\)→∫𝒳ϕ\(t,𝐱\)dqt\(𝐱\)\\int\_\{\\mathcal\{X\}\}\\phi\(t,\\mathbf\{x\}\)\\textup\{d\}\\hat\{q\}^\{J\}\_\{t\}\(\\mathbf\{x\}\)\\to\\int\_\{\\mathcal\{X\}\}\\phi\(t,\\mathbf\{x\}\)\\textup\{d\}q\_\{t\}\(\\mathbf\{x\}\)asJ→∞J\\to\\infty\.

Let𝐱¯t=\{𝐱tj\}j=1J\\bar\{\\mathbf\{x\}\}\_\{t\}=\\\{\\mathbf\{x\}^\{j\}\_\{t\}\\\}\_\{j=1\}^\{J\}denote the set of all particles whose empirical distribution corresponds toq^tJ\\hat\{q\}^\{J\}\_\{t\}andΦ\(t,𝐱¯t\)≔1J∑j=1Jϕ\(t,𝐱ti\)=⟨ϕ\(t,⋅\),q^tJ⟩\\Phi\(t,\\bar\{\\mathbf\{x\}\}\_\{t\}\)\\coloneqq\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\phi\(t,\\mathbf\{x\}\_\{t\}^\{i\}\)=\\langle\\phi\(t,\\cdot\),\\hat\{q\}^\{J\}\_\{t\}\\rangledenote the aggregate test function, which averages the test function evaluated at all the particles\. Here, the brackets denote the duality pairing between test functions and measures\. By Itô’s formula, we have that

ddtΦ\(t,𝐱¯t\)\\displaystyle\\frac\{\\textup\{d\}\}\{\\textup\{d\}t\}\\Phi\(t,\\bar\{\\mathbf\{x\}\}\_\{t\}\)=1J∑j=1J\[∂tϕ\(t,𝐱tj\)\+∇ϕ\(t,𝐱tj\)⋅b\(𝐱tj,q^tJ\)\]dt\\displaystyle=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\left\[\\partial\_\{t\}\\phi\(t,\\mathbf\{x\}^\{j\}\_\{t\}\)\+\\nabla\\phi\(t,\\mathbf\{x\}^\{j\}\_\{t\}\)\\cdot b\(\\mathbf\{x\}^\{j\}\_\{t\},\\hat\{q\}^\{J\}\_\{t\}\)\\right\]\\textup\{d\}t\+ηJTr⁡\(Hess⁡Φ\(t,𝐱¯t\)\)dt\+dℳt,\\displaystyle\\quad\+\\frac\{\\eta\}\{J\}\\operatorname\{Tr\}\\left\(\\operatorname\{Hess\}\\Phi\(t,\\bar\{\\mathbf\{x\}\}\_\{t\}\)\\right\)\\textup\{d\}t\+\\textup\{d\}\\mathcal\{M\}\_\{t\},\(14\)wheredℳt≔2ηJJ∑j=1J∇ϕ\(𝐱tj\)dWtj\\textup\{d\}\\mathcal\{M\}\_\{t\}\\coloneqq\\frac\{\\sqrt\{2\\eta\}\}\{J\\sqrt\{J\}\}\\sum\_\{j=1\}^\{J\}\\nabla\\phi\(\\mathbf\{x\}^\{j\}\_\{t\}\)\\textup\{d\}W^\{j\}\_\{t\}is a local martingale\. The HessianHess⁡Φ∈ℝJn×Jn\\operatorname\{Hess\}\\Phi\\in\\mathbb\{R\}^\{Jn\\times Jn\}is block diagonal with blocks

\[Hess⁡Φ\(𝐱¯\)\]ij=\{1JHess⁡ϕ\(𝐱i\)i=j,0i≠j\.\[\\operatorname\{Hess\}\\Phi\(\\bar\{\\mathbf\{x\}\}\)\]\_\{ij\}=\\begin\{cases\}\\dfrac\{1\}\{J\}\\operatorname\{Hess\}\\phi\(\\mathbf\{x\}^\{i\}\)&i=j,\\\\\[6\.0pt\] 0&i\\neq j\.\\end\{cases\}Therefore, the Itô correction evaluates to

ηJTr⁡\(Hess⁡Φ\(t,𝐱¯t\)\)=ηJ2∑j=1JΔϕ\(𝐱tj\)=ηJ∫𝒳Δϕ\(𝐱\)dq^tJ\(𝐱\)\.\\frac\{\\eta\}\{J\}\\operatorname\{Tr\}\\left\(\\operatorname\{Hess\}\\Phi\(t,\\bar\{\\mathbf\{x\}\}\_\{t\}\)\\right\)=\\frac\{\\eta\}\{J^\{2\}\}\\sum\_\{j=1\}^\{J\}\\Delta\\phi\(\\mathbf\{x\}^\{j\}\_\{t\}\)=\\frac\{\\eta\}\{J\}\\int\_\{\\mathcal\{X\}\}\\Delta\\phi\(\\mathbf\{x\}\)\\,\\textup\{d\}\\hat\{q\}^\{J\}\_\{t\}\(\\mathbf\{x\}\)\.Integrating \([A\.2](https://arxiv.org/html/2606.04100#A1.Ex3)\) from0tott, we obtain

⟨ϕ\(t,⋅\),q^tJ⟩−⟨ϕ\(0,⋅\),q^0J⟩\\displaystyle\\langle\\phi\(t,\\cdot\),\\hat\{q\}^\{J\}\_\{t\}\\rangle\-\\langle\\phi\(0,\\cdot\),\\hat\{q\}^\{J\}\_\{0\}\\rangle=∫0t⟨∂sϕ\(s,⋅\),q^sJ⟩ds\\displaystyle=\\int\_\{0\}^\{t\}\\langle\\partial\_\{s\}\\phi\(s,\\cdot\),\\hat\{q\}^\{J\}\_\{s\}\\rangle\\,\\textup\{d\}s\+∫0t⟨∇ϕ\(s,⋅\)⋅b\(⋅,q^sJ\),q^sJ⟩ds\\displaystyle\\quad\+\\int\_\{0\}^\{t\}\\langle\\nabla\\phi\(s,\\cdot\)\\cdot b\(\\cdot,\\hat\{q\}^\{J\}\_\{s\}\),\\hat\{q\}^\{J\}\_\{s\}\\rangle\\,\\textup\{d\}s\+ηJ∫0t⟨Δϕ\(⋅\),q^sJ⟩ds\+ℳt\.\\displaystyle\\quad\+\\frac\{\\eta\}\{J\}\\int\_\{0\}^\{t\}\\langle\\Delta\\phi\(\\cdot\),\\hat\{q\}^\{J\}\_\{s\}\\rangle\\textup\{d\}s\+\\mathcal\{M\}\_\{t\}\.\(15\)The quadratic variation of the local martingale is

\[ℳ⋅,ℳ⋅\]t\\displaystyle\[\\mathcal\{M\}\_\{\\cdot\},\\mathcal\{M\}\_\{\\cdot\}\]\_\{t\}=2ηJ3∑i,j=1J∫0t∇ϕ\(𝐱si\)⋅∇ϕ\(𝐱sj\)ds\\displaystyle=\\frac\{2\\eta\}\{J^\{3\}\}\\sum\_\{i,j=1\}^\{J\}\\int\_\{0\}^\{t\}\\nabla\\phi\(\\mathbf\{x\}^\{i\}\_\{s\}\)\\cdot\\nabla\\phi\(\\mathbf\{x\}^\{j\}\_\{s\}\)\\,\\textup\{d\}s=2ηJ∫0t∫𝒳∇ϕ\(𝐱\)⋅∇ϕ\(𝐱′\)𝑑q^sJ\(𝐱\)𝑑q^sJ\(𝐱′\)ds,\\displaystyle=\\frac\{2\\eta\}\{J\}\\int\_\{0\}^\{t\}\\int\_\{\\mathcal\{X\}\}\\nabla\\phi\(\\mathbf\{x\}\)\\cdot\\nabla\\phi\(\\mathbf\{x\}^\{\\prime\}\)\\,d\\hat\{q\}^\{J\}\_\{s\}\(\\mathbf\{x\}\)d\\hat\{q\}^\{J\}\_\{s\}\(\\mathbf\{x\}^\{\\prime\}\)\\,\\textup\{d\}s,which isO\(J−1\)O\(J^\{\-1\}\)\. Therefore,ℳt→0\\mathcal\{M\}\_\{t\}\\to 0in probability asJ→∞J\\to\\infty\.

Assume that the family\{q^⋅J:J∈ℕ\}\\\{\\hat\{q\}^\{J\}\_\{\\cdot\}:J\\in\\mathbb\{N\}\\\}has a limit pointq⋅∈𝒫\(C\[0,T\]\)q\_\{\\cdot\}\\in\\mathcal\{P\}\(C\[0,T\]\)\. Formally, asJ→∞J\\to\\infty, the Itô correction and quadratic variation of the local martingale both vanish, yielding

⟨ϕ\(t,⋅\),qt⟩−⟨ϕ\(0,⋅\),q0⟩=∫0t⟨∂sϕ\(s,⋅\),qs⟩ds\+∫0t⟨∇ϕ\(s,⋅\)⋅b\(⋅,qs\),qs⟩ds,\\langle\\phi\(t,\\cdot\),q\_\{t\}\\rangle\-\\langle\\phi\(0,\\cdot\),q\_\{0\}\\rangle=\\int\_\{0\}^\{t\}\\langle\\partial\_\{s\}\\phi\(s,\\cdot\),q\_\{s\}\\rangle\\,\\textup\{d\}s\+\\int\_\{0\}^\{t\}\\langle\\nabla\\phi\(s,\\cdot\)\\cdot b\(\\cdot,q\_\{s\}\),q\_\{s\}\\rangle\\,\\textup\{d\}s,\(16\)which is the weak formulation of the nonlinear transport equation

∂tqt\(t,𝐱\)=−∇⋅\(b\(⋅,qt\)qt\)\.\\partial\_\{t\}q\_\{t\}\(t,\\mathbf\{x\}\)=\-\\nabla\\cdot\(b\(\\cdot,q\_\{t\}\)\\,q\_\{t\}\)\.\(17\)Since∇𝐱′k\(𝐱,𝐱′\)=−∇𝐱k\(𝐱,𝐱′\)\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=\-\\nabla\_\{\\mathbf\{x\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)by symmetry of the kernel and∇qt=−β∇Vθ⋅qt\+∇qt\\nabla q\_\{t\}=\-\\beta\\nabla V\_\{\\theta\}\\cdot q\_\{t\}\+\\nabla q\_\{t\}from integration by parts, we recover \([13](https://arxiv.org/html/2606.04100#A1.E13)\) after substituting the definition ofbb\. ∎

One can easily verify that the stationary condition∂tqt=0\\partial\_\{t\}q\_\{t\}=0is satisfied byqt=πθ∝exp⁡\(−βVθ\)q\_\{t\}=\\pi\_\{\\theta\}\\propto\\exp\(\-\\beta V\_\{\\theta\}\), since

β∇Vθ\(𝐱′\)πθ\(𝐱′\)\+∇πθ\(𝐱′\)\\displaystyle\\beta\\nabla V\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\pi\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\+\\nabla\\pi\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)=πθ\(𝐱′\)\[β∇Vθ\(𝐱′\)\+∇log⁡πθ\(𝐱′\)\]\\displaystyle=\\pi\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\left\[\\beta\\nabla V\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\+\\nabla\\log\\pi\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\right\]=πθ\(𝐱′\)\[β∇Vθ\(𝐱′\)−β∇Vθ\(𝐱′\)\]=0\.\\displaystyle=\\pi\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\left\[\\beta\\nabla V\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\-\\beta\\nabla V\_\{\\theta\}\(\\mathbf\{x\}^\{\\prime\}\)\\right\]=0\.

## Appendix BAlgorithm Details and Variants

### B\.1SKMD for Boltzmann sampling

SKMD can be used as a general purpose algorithm for sampling the invariant distribution of an energy\-based system\.[Algorithm˜2](https://arxiv.org/html/2606.04100#alg2)corresponds to a asynchronous implementation of stochastic SVGD, where each particle is evolved one at a time for a fixed number of stepsℓ\\ell, such that the interacting particle dynamics may be approximated by a single chain\. As shown in[Appendix˜A](https://arxiv.org/html/2606.04100#A1), the distribution of the dynamics defined by a potentialVVin the continuous time limit \(ϵ→0\\epsilon\\to 0\), synchronized limit \(ℓ→0\\ell\\to 0\), and mean\-field limit \(J→∞J\\to\\infty\) approaches the invariant distributionπ∝exp⁡\(−V\)\\pi\\propto\\exp\(\-V\), assumingβ=1\\beta=1without loss of generality\.

Algorithm 2SKMD for Boltzmann samplingInitial ensemble

X¯0=\{𝐱01,…,𝐱0J\}\\bar\{X\}\_\{0\}=\\\{\\mathbf\{x\}^\{1\}\_\{0\},\\ldots,\\mathbf\{x\}^\{J\}\_\{0\}\\\}, trajectory

X=∅X=\\emptyset, potential

VV, kernel

kk, stopping

ℓ\\ell
Initialize ensemble index

s←0s\\leftarrow 0, active particle index

i←1i\\leftarrow 1
for

t=0,1,2,…t=0,1,2,\\ldotsdo

Simulate one step of \([5](https://arxiv.org/html/2606.04100#S3.E5)\):

𝐱t\+1i←SKMD\(𝐱ti,X¯s,V,k\)\\mathbf\{x\}^\{i\}\_\{t\+1\}\\leftarrow\\text\{SKMD\}\(\\mathbf\{x\}^\{i\}\_\{t\},\\bar\{X\}\_\{s\},V,k\)
if

t\+1=s\+ℓt\+1=s\+\\ellthen

Update ensemble:

X¯s\+ℓ←\(X¯s∖\{𝐱si\}\)∪\{𝐱s\+ℓi\}\\bar\{X\}\_\{s\+\\ell\}\\leftarrow\\bigl\(\\bar\{X\}\_\{s\}\\setminus\\\{\\mathbf\{x\}^\{i\}\_\{s\}\\\}\\bigr\)\\cup\\\{\\mathbf\{x\}^\{i\}\_\{s\+\\ell\}\\\}
Advance ensemble index:

s←s\+ℓs\\leftarrow s\+\\ell
Switch active particle:

i←\(imodJ\)\+1i\\leftarrow\(i\\bmod J\)\+1
Append to trajectory:

X←X∪\{𝐱si,…,𝐱s\+ℓi\}X\\leftarrow X\\cup\\\{\\mathbf\{x\}^\{i\}\_\{s\},\.\.\.,\\mathbf\{x\}^\{i\}\_\{s\+\\ell\}\\\}
returnTrajectory

XXwith approximate distribution

π∝exp⁡\(−V\)\\pi\\propto\\exp\(\-V\)

We numerically evaluate the efficacy of SKMD as a sampling algorithm using the Müller\-Brown potential as an illustrative example\. We compare the distribution of overdamped Langevin dynamics and SKMD in terms of their Wasserstein\-2 distance from the Boltzmann distribution of the Müller\-Brown potential after1×1051\\times 10^\{5\}simulation time steps\. To study the effect of the asynchronous particle updates introduced by the stopping time on the quality of samples generated, we implement SKMD with four different stopping times,ℓ=\{1,101,102,103\}\\ell=\\\{1,10^\{1\},10^\{2\},10^\{3\}\\\}\. The metric is computed over 100 random trials, where in each trial the Brownian motion realization is identical across sampling methods\.[Figure˜4](https://arxiv.org/html/2606.04100#A2.F4)shows that samples drawn with SKMD have lower mean error with respect to the invariant distribution and lower variance in error compared to those drawn with overdamped Langevin dynamics\. With SKMD, increasing the stopping time increases the discrepancy with respect to the invariant distribution, but still demonstrates better accuracy than Langevin dynamics withℓ=103\\ell=10^\{3\}\. The higher error associated with Langevin dynamics is a consequence of slow mixing and metastability, as the invariant distribution of the Müller\-Brown potential is non\-log concave with well\-separated modes\.

![Refer to caption](https://arxiv.org/html/2606.04100v1/figures/skmd_w2_boxplot.png)Figure 4:Comparison of the quality of samples from overdamped Langevin dynamics and SKMD with varying stopping timeℓ\\ell\. Quality is measured in terms of a sample\-based estimator of the Wasserstein\-2 distance with respect to the Boltzmann distribution\.
### B\.2SKMD for active learning with offline data acquisition

SKMD may be paired with subset selection methods for offline data acquisition\. In this setting, SKMD is used purely as a sampling algorithm for generating a pool of candidate configurations\. We then rely on any subset selection method, such as the ones reviewed in[Section˜1](https://arxiv.org/html/2606.04100#S1), to draw sets of informative data to label with reference calculations and add to the training set\. This approach is summarized in[Algorithm˜3](https://arxiv.org/html/2606.04100#alg3)\.

Algorithm 3SKMD for active learning with offline data acquisitionInitial ensemble

X¯0=\{𝐱01,…,𝐱0J\}\\bar\{X\}\_\{0\}=\\\{\\mathbf\{x\}^\{1\}\_\{0\},\\ldots,\\mathbf\{x\}^\{J\}\_\{0\}\\\}, training set

𝒟0\\mathcal\{D\}\_\{0\}, model

Vθ0V\_\{\\theta\_\{0\}\}, kernel

kk, stopping

ℓ\\ell
fortraining iteration

τ=0,1,…\\tau=0,1,\\ldotsdo

Generate SKMD trajectory

XXfrom[Algorithm˜2](https://arxiv.org/html/2606.04100#alg2)using

X¯0\\bar\{X\}\_\{0\},

Vθ0V\_\{\\theta\_\{0\}\},

kk,

ℓ\\ell
Select training data

𝒟τ\+1⊂X\\mathcal\{D\}\_\{\\tau\+1\}\\subset Xusing any subset selection method

Train new model

Vθτ\+1V\_\{\\theta\_\{\\tau\+1\}\}on the augmented training set

⋃k=0τ\+1𝒟k\\bigcup\_\{k=0\}^\{\\tau\+1\}\\mathcal\{D\}\_\{k\}
returnFinal model

VθfinalV\_\{\\theta\_\{\\text\{final\}\}\}and training set

𝒟final\\mathcal\{D\}\_\{\\text\{final\}\}

A common heuristic with SVGD is to utilize a kernel with a variable kernel bandwidth, such as one which is proportional to the median distance between particles, as a strategy to accelerate mixing in high\-dimensional state space\. When implementing SKMD with a variable kernel bandwidth, offline data acquisition tends to perform better than online data acquisition\. The acquisition criterion in \([8](https://arxiv.org/html/2606.04100#S3.E8)\) remains relatively constant for the median bandwidth heuristic, since the kernel bandwidth adapts such that the kernel terms in the SKMD biasing force are nonzero and the particles remain interactive at further distances\. For this reason, in[Section˜5\.2](https://arxiv.org/html/2606.04100#S5.SS2), we utilize the median bandwidth heuristic and furthest point sampling to select training configurations for active learning of the MACE interatomic potential\.

## Appendix CAdditional Background and Related Work

### C\.1Enhanced sampling methods

Biasing force methods are a class of enhanced samplers which modify the force field of the dynamics by the addition of a biasing force; see\[[25](https://arxiv.org/html/2606.04100#bib.bib44)\]for a review\. In these methods, the force field in overdamped or underdamped Langevin dynamics is replaced by

F~θ,t\(𝐱t\)=−∇𝐱Vθ\(𝐱t\)\+Ftbias\(𝐱t\)\.\\tilde\{F\}\_\{\\theta,t\}\(\\mathbf\{x\}\_\{t\}\)=\-\\nabla\_\{\\mathbf\{x\}\}V\_\{\\theta\}\(\\mathbf\{x\}\_\{t\}\)\+F\_\{t\}^\{\\text\{bias\}\}\(\\mathbf\{x\}\_\{t\}\)\\ \.\(18\)
In the standard formulation ofadaptive biasing forces \(ABF\)\[[13](https://arxiv.org/html/2606.04100#bib.bib25)\], the objective is to reduce energy barriers characterized in a reduced coordinate space of collective variables \(CVs\)ξ\(𝐱\)∈ℝr\\xi\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{r\}, wherer≪3Nr\\ll 3N\. At a given point𝐱∈ℝ3N\\mathbf\{x\}\\in\\mathbb\{R\}^\{3N\}, ABF adaptively learns and cancels gradient of the free energy in CV space by setting the biasing force to be

FtABF\(𝐱t\)=−∇𝐱ξ\(𝐱\)⊤∇ξℱ\(ξ\(𝐱\)\),F^\{\\text\{ABF\}\}\_\{t\}\(\\mathbf\{x\}\_\{t\}\)=\-\\nabla\_\{\\mathbf\{x\}\}\\xi\(\\mathbf\{x\}\)^\{\\top\}\\nabla\_\{\\xi\}\\mathcal\{F\}\(\\xi\(\\mathbf\{x\}\)\),\(19\)where∇𝐱ξ\\nabla\_\{\\mathbf\{x\}\}\\xiis the Jacobian of the CV map\. The resulting dynamics correspond to uniform sampling of the CV space and enhanced exploration of state space regions separated by high\-energy barriers\. The performance of ABF depends heavily on the identification of a reliable CV map; otherwise, the method is not guaranteed to improve sampling\. Various other methods in the same vein as adaptive biasing force methods have been proposed in the statistics literature which preserve the target invariant measure while altering transient behavior to accelerate convergence; see\[[24](https://arxiv.org/html/2606.04100#bib.bib26),[19](https://arxiv.org/html/2606.04100#bib.bib69)\]for discussion on reversible and irreversible perturbations for MCMC\.

Adaptive biasing potential methods are a subclass of adaptive biasing force methods which are obtained when the biasing force is conservative, meaning that it can be written as the gradient of a potential,Ftbias=−∇𝐱VtbiasF\_\{t\}^\{\\text\{bias\}\}=\-\\nabla\_\{\\mathbf\{x\}\}V\_\{t\}^\{\\text\{bias\}\}\. In this case, the biased dynamics are described by a modified potential energy,

V~θ,t\(𝐱t\)=Vθ\(𝐱t\)\+Vtbias\(𝐱t\)\.\\tilde\{V\}\_\{\\theta,t\}\(\\mathbf\{x\}\_\{t\}\)=V\_\{\\theta\}\(\\mathbf\{x\}\_\{t\}\)\+V\_\{t\}^\{\\text\{bias\}\}\(\\mathbf\{x\}\_\{t\}\)\\ \.\(20\)The form of the biasing potentialVtbias:ℝ3N→ℝV\_\{t\}^\{\\text\{bias\}\}:\\mathbb\{R\}^\{3N\}\\to\\mathbb\{R\}depends on the sampling objective\. Inmetadynamics\[[4](https://arxiv.org/html/2606.04100#bib.bib45),[56](https://arxiv.org/html/2606.04100#bib.bib46)\], sampling is promoted along selected directions of the state space defined by CVs in order to characterize the free energy landscape or rare\-event kinetics\. The biasing potential in metadynamics is a sum of repulsive Gaussian kernels which are deposited at times𝒮τ=\(s1,…,sτ\)\\mathcal\{S\}\_\{\\tau\}=\(s\_\{1\},\.\.\.,s\_\{\\tau\}\)along the simulation path in CV space,

Vtmeta\(𝐱t\)=∑s∈𝒮τasexp⁡\(−‖ξ\(𝐱s\)−ξ\(𝐱t\)‖22bs2\),V\_\{t\}^\{\\text\{meta\}\}\(\\mathbf\{x\}\_\{t\}\)=\\sum\_\{s\\in\\mathcal\{S\}\_\{\\tau\}\}a\_\{s\}\\exp\\Big\(\-\\frac\{\|\|\\xi\(\\mathbf\{x\}\_\{s\}\)\-\\xi\(\\mathbf\{x\}\_\{t\}\)\|\|^\{2\}\}\{2b\_\{s\}^\{2\}\}\\Big\)\\ ,whereas,bs\>0a\_\{s\},b\_\{s\}\>0are the magnitude and length scale parameters of the Gaussian kernel at timess\. Metadynamics acts to “fill” visited regions with repulsive kernels until the modified free energy landscape is leveled and sampling becomes approximately uniform in the CV space\. As with ABF, metadynamics relies on the availability of an informative collective variable\.

Inuncertainty\-driven dynamics \(UDD\)\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\], a biasing potential is introduced to enhance the sampling of uncertain regions of the free energy landscape for the purpose of active learning\. Given a committee of model potentials\(V^t\(m\)\)m∈\[1,M\]\(\\hat\{V\}^\{\(m\)\}\_\{t\}\)\_\{m\\in\[1,M\]\}and their mean predictionV¯t=1M∑m=1MV^t\(m\)\\bar\{V\}\_\{t\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\hat\{V\}^\{\(m\)\}\_\{t\}, a biasing potential is constructed as

VtUDD\(𝐱t\)=a\[exp⁡\(−σV2\(𝐱t\)NMb2\)−1\],V\_\{t\}^\{\\text\{UDD\}\}\(\\mathbf\{x\}\_\{t\}\)=a\\Big\[\\exp\\Big\(\-\\frac\{\\sigma^\{2\}\_\{V\}\(\\mathbf\{x\}\_\{t\}\)\}\{NMb^\{2\}\}\\Big\)\-1\\Big\]\\ ,wherea,b\>0a,b\>0are scale parameters andσV2\\sigma\_\{V\}^\{2\}is the empirical variance of energy predictions by models in the committee,

σV2\(𝐱t\)=1M∑m=1M\(V^t\(m\)\(𝐱t\)−V¯t\(𝐱t\)\)2\.\\sigma\_\{V\}^\{2\}\(\\mathbf\{x\}\_\{t\}\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\(\\hat\{V\}^\{\(m\)\}\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\-\\bar\{V\}\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\)^\{2\}\\ \.\(21\)The model committee in\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\]is constructed viakk\-fold cross validation, with each model trained on a distinct partition of the dataset\.

In most cases, the biasing potential alters the invariant measure of the dynamics\. Assuming the biasing potential remains time\-invariant or approaches a stationary limit,limt→∞Vtbias=Vbias\\lim\_\{t\\to\\infty\}V^\{\\text\{bias\}\}\_\{t\}=V^\{\\text\{bias\}\}, the state dynamics converge to a biased Boltzmann distribution,

π~θ\(𝐱\)=1Z^exp⁡\(−V~θ\(𝐱\)\)=1Z^exp⁡\(−Vθ\(𝐱\)−Vbias\(𝐱\)\)\.\\tilde\{\\pi\}\_\{\\theta\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\hat\{Z\}\}\\exp\(\-\\tilde\{V\}\_\{\\theta\}\(\\mathbf\{x\}\)\)=\\frac\{1\}\{\\hat\{Z\}\}\\exp\\big\(\-V\_\{\\theta\}\(\\mathbf\{x\}\)\-V^\{\\text\{bias\}\}\(\\mathbf\{x\}\)\)\\ \.\(22\)Samples fromπ~θ\\tilde\{\\pi\}\_\{\\theta\}may still be used to calculate summary statistics of the system under the original Boltzmann distributionπθ\\pi\_\{\\theta\}through an appropriate importance reweighting scheme\. For some measurable function of positionsf\(𝐱\)f\(\\mathbf\{x\}\), its expectation with respect toπθ\\pi\_\{\\theta\}can be computed as

𝔼𝐱∼πθ\[f\(𝐱\)\]=𝔼𝐱∼π~θ\[f\(𝐱\)w\(𝐱\)\],\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\pi\_\{\\theta\}\}\[f\(\\mathbf\{x\}\)\]=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\tilde\{\\pi\}\_\{\\theta\}\}\\big\[f\(\\mathbf\{x\}\)w\(\\mathbf\{x\}\)\\big\]\\ ,\(23\)wherew\(𝐱\)w\(\\mathbf\{x\}\)is the self\-normalized importance sampling reweighting function given by

w\(𝐱\)=ω\(𝐱\)∫ω\(𝐱\)𝑑x,ω\(𝐱\)=exp⁡\(−Vθ\)exp⁡\(−Vθ−Vbias\)=exp⁡\(Vbias\)\.w\(\\mathbf\{x\}\)=\\frac\{\\omega\(\\mathbf\{x\}\)\}\{\\int\\omega\(\\mathbf\{x\}\)dx\},\\quad\\omega\(\\mathbf\{x\}\)=\\frac\{\\exp\(\-V\_\{\\theta\}\)\}\{\\exp\(\-V\_\{\\theta\}\-V^\{\\text\{bias\}\}\)\}=\\exp\(V^\{\\text\{bias\}\}\)\\ \.\(24\)Note thatω\\omegais the ratio of the unnormalized forms ofπθ\\pi\_\{\\theta\}andπ~θ\\tilde\{\\pi\}\_\{\\theta\}\. Since the reweighting function depends exponentially on the biasing potential, even moderate distortions of the biasing potential can induce highly variable weights, as well as high variance or numerical instability of the importance sampling estimator\.

### C\.2Variational inference and gradient flows

In various sampling and generative modeling problems, the goal is to approximate a target densityπ\\pion𝒳\\mathcal\{X\}with a set of samples\. Variational inference is a class of techniques which find an optimal densityq∗q^\{\*\}, typically from a parametric family𝒬\\mathcal\{Q\}, which minimizes the Kullback\-Liebler \(KL\) divergence from the target density,

q∗∈argminq∈𝒬𝔻KL\(q\|\|π\)\.q^\{\*\}\\in\\underset\{q\\in\\mathcal\{Q\}\}\{\\text\{argmin\}\}\\ \\mathbb\{D\}\_\{\\textup\{KL\}\}\(q\|\|\\pi\)\\ \.\(25\)Recently, a class of particle\-based variational inference techniques have emerged which approximate a target density with the empirical distribution of a dynamic set of particles\. The core principle behind these methods is to view variational inference as a gradient flow of KL divergence on the space of probability distributions\. The evolution of a distributionqqover time is given by the continuity equation,

∂qt∂t=−∇⋅\(qtϕt\),\\frac\{\\partial q\_\{t\}\}\{\\partial t\}=\-\\nabla\\cdot\(q\_\{t\}\\phi\_\{t\}\)\\ ,\(26\)whereϕt\\phi\_\{t\}is a velocity field moving probability mass\. One can discretize the flow in \([26](https://arxiv.org/html/2606.04100#A3.E26)\) with particles whose dynamics are governed by

d𝐱tdt=ϕt\(𝐱t\),𝐱0∼q0,\\frac\{\\textup\{d\}\\mathbf\{x\}\_\{t\}\}\{\\textup\{d\}t\}=\\phi\_\{t\}\(\\mathbf\{x\}\_\{t\}\),\\quad\\mathbf\{x\}\_\{0\}\\sim q\_\{0\},\(27\)such thatqt=Law\(𝐱t\)q\_\{t\}=\\text\{Law\}\(\\mathbf\{x\}\_\{t\}\)\. The velocity fieldϕt\\phi\_\{t\}is not uniquely determined a priori, but depends on the choice of geometric structure on the space of probability measures\[[2](https://arxiv.org/html/2606.04100#bib.bib27)\]\. For instance, the law of overdamped Langevin dynamics evolves according to the Fokker\-Planck equation, which can be written in the form of a continuity equation and interpreted as the Wasserstein gradient flow of KL divergence\[[35](https://arxiv.org/html/2606.04100#bib.bib16)\]\.

Stein variational gradient descent \(SVGD\), introduced in\[[34](https://arxiv.org/html/2606.04100#bib.bib15)\], is a particle\-based variational inference algorithm which corresponds to a gradient flow restricted to velocity fields that lie in a reproducing kernel Hilbert space \(RKHS\)\. This constraint yields a closed\-form expression for the steepest descent direction of KL divergence within the RKHS and a tractable particle\-based algorithm\.

In SVGD, one solves \([25](https://arxiv.org/html/2606.04100#A3.E25)\) by constructing a sequence of empirical measures\(q^0,q^1,…,q^t\)\(\\hat\{q\}\_\{0\},\\hat\{q\}\_\{1\},\.\.\.,\\hat\{q\}\_\{t\}\)which converges weakly on the target distributionπ\\pi\. Beginning with the empirical measureq^0\(𝐱\)=1J∑j=1Jδ\(𝐱−𝐱j\)\\hat\{q\}\_\{0\}\(\\mathbf\{x\}\)=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\delta\(\\mathbf\{x\}\-\\mathbf\{x\}^\{j\}\)of an initial set of particlesX¯0=\{𝐱0j\}j=1J\\bar\{X\}\_\{0\}=\\\{\\mathbf\{x\}\_\{0\}^\{j\}\\\}\_\{j=1\}^\{J\}, we take the empirical measure at iterationttto beq^t=Tt♯q^0\\hat\{q\}\_\{t\}=T\_\{t\\sharp\}\\hat\{q\}\_\{0\}, or the pushforward ofq^0\\hat\{q\}\_\{0\}under some transport mapTtT\_\{t\}\. Equation \([25](https://arxiv.org/html/2606.04100#A3.E25)\) then becomes the minimization of discrepancy betweenTt♯q0T\_\{t\\sharp\}q\_\{0\}and the target distribution,

Tt∗∈argminTt𝔻KL\(Tt♯q0\|\|π\),T\_\{t\}^\{\*\}\\in\\underset\{T\_\{t\}\}\{\\text\{argmin\}\}\\ \\mathbb\{D\}\_\{\\textup\{KL\}\}\(T\_\{t\\sharp\}q\_\{0\}\|\|\\pi\)\\ ,\(28\)whereq0q\_\{0\}is the mean\-field measure corresponding toq^0\\hat\{q\}\_\{0\}\. The transport map at iterationttis composed of intermediate mapsTt=T¯1∘T¯2∘…∘T¯tT\_\{t\}=\\bar\{T\}\_\{1\}\\circ\\bar\{T\}\_\{2\}\\circ\.\.\.\\circ\\bar\{T\}\_\{t\}which take the formT¯t\(𝐱\)=𝐱\+ϵϕt\(𝐱\)\\bar\{T\}\_\{t\}\(\\mathbf\{x\}\)=\\mathbf\{x\}\+\\epsilon\\phi\_\{t\}\(\\mathbf\{x\}\)\. The transformation consists of perturbing the positions of particles at timettby a step sizeϵ\\epsilonin the directionϕt\\phi\_\{t\}, which belongs to a class of bounded functionsℱ≔\{f:‖f‖ℱ≤1\}\\mathcal\{F\}\\coloneqq\\\{f:\|\|f\|\|\_\{\\mathcal\{F\}\}\\leq 1\\\}\. We can then reframe the optimization problem to be one of choosingϕt\\phi\_\{t\}to be the direction of steepest gradient descent, maximally decreasing KL divergence with respect to the target distribution, i\.e\.

ϕt∗∈argmaxϕt∈ℱ\[−∇ϵ𝔻KL\(Tt♯q0\|\|π\)\|ϵ=0\]\.\\phi\_\{t\}^\{\*\}\\in\\underset\{\\phi\_\{t\}\\in\\mathcal\{F\}\}\{\\text\{argmax\}\}\\ \\big\[\-\\nabla\_\{\\epsilon\}\\mathbb\{D\}\_\{\\textup\{KL\}\}\(T\_\{t\\sharp\}q\_\{0\}\|\|\\pi\)\|\_\{\\epsilon=0\}\\big\]\\ \.\(29\)It is shown in\[[34](https://arxiv.org/html/2606.04100#bib.bib15),[35](https://arxiv.org/html/2606.04100#bib.bib16)\]that a practical approach to solving the above optimization problem is to projectϕt\\phi\_\{t\}onto an reproducing kernel Hilbert space \(RKHS\)\. We chooseℱ\\mathcal\{F\}to be the RKHSℋn\\mathcal\{H\}^\{n\}for𝒳⊆ℝn\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{n\}, defined by a symmetric positive semi\-definite kernelk:𝒳×𝒳→ℝk:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\. Then we have a closed\-form expression for the gradient of KL divergence, given by

∇ϵ𝔻KL\(qt\|\|π\)\|ϵ=0=−⟨ϕt,ψt⟩ℋn\\displaystyle\\nabla\_\{\\epsilon\}\\mathbb\{D\}\_\{\\textup\{KL\}\}\(q\_\{t\}\|\|\\pi\)\|\_\{\\epsilon=0\}=\-\\langle\\phi\_\{t\},\\psi\_\{t\}\\rangle\_\{\\mathcal\{H\}^\{n\}\}\(30a\)ψt\(⋅\)=𝔼𝐱∼qt\[k\(𝐱,⋅\)∇𝐱log⁡π\(𝐱\)\+∇𝐱k\(𝐱,⋅\)\]\.\\displaystyle\\psi\_\{t\}\(\\cdot\)=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim q\_\{t\}\}\[k\(\\mathbf\{x\},\\cdot\)\\nabla\_\{\\mathbf\{x\}\}\\log\\pi\(\\mathbf\{x\}\)\+\\nabla\_\{\\mathbf\{x\}\}k\(\\mathbf\{x\},\\cdot\)\]\\ \.\(30b\)It follows that the solution to \([29](https://arxiv.org/html/2606.04100#A3.E29)\), giving the steepest descent direction at steptt, isϕt∗=ψt\\phi\_\{t\}^\{\*\}=\\psi\_\{t\}\.

The flow of empirical measures corresponds to the state evolution of the interacting particle set, represented by the ordinary differential equation

d𝐱ti=ϕt∗\(𝐱i\)dt,\\displaystyle\\textup\{d\}\\mathbf\{x\}^\{i\}\_\{t\}=\\phi^\{\*\}\_\{t\}\(\\mathbf\{x\}^\{i\}\)\\textup\{d\}t,\(31a\)ϕt∗\(⋅\)=1J∑j=1J\[k\(𝐱tj,⋅\)∇𝐱tjlog⁡π\(𝐱tj\)\+∇𝐱tjk\(𝐱tj,⋅\)\],\\displaystyle\\phi^\{\*\}\_\{t\}\(\\cdot\)=\\frac\{1\}\{J\}\\sum\_\{j=1\}^\{J\}\\Big\[k\(\\mathbf\{x\}^\{j\}\_\{t\},\\cdot\)\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}\\log\\pi\(\\mathbf\{x\}^\{j\}\_\{t\}\)\+\\nabla\_\{\\mathbf\{x\}^\{j\}\_\{t\}\}k\(\\mathbf\{x\}^\{j\}\_\{t\},\\cdot\)\\Big\],\(31b\)for all𝐱ti,𝐱tj∈X¯t\\mathbf\{x\}^\{i\}\_\{t\},\\mathbf\{x\}^\{j\}\_\{t\}\\in\\bar\{X\}\_\{t\}\. AsJ→∞J\\to\\infty, the mean field limit of \([31](https://arxiv.org/html/2606.04100#A3.E31)\) is the following, as shown in\[[35](https://arxiv.org/html/2606.04100#bib.bib16),[20](https://arxiv.org/html/2606.04100#bib.bib100)\]:

∂∂tqt\(𝐱\)=∇⋅\(qt\(𝐱\)∫𝒳k\(𝐱,𝐱′\)\(∇𝐱′qt\(𝐱′\)−qt\(𝐱′\)∇𝐱′log⁡π\(𝐱′\)\)d𝐱′\)\.\\frac\{\\partial\}\{\\partial t\}q\_\{t\}\(\\mathbf\{x\}\)=\\nabla\\cdot\\Bigg\(q\_\{t\}\(\\mathbf\{x\}\)\\int\_\{\\mathcal\{X\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\big\(\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}q\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\-q\_\{t\}\(\\mathbf\{x\}^\{\\prime\}\)\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}\\log\\pi\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\\textup\{d\}\\mathbf\{x\}^\{\\prime\}\\Bigg\)\\ \.\(32\)Under appropriate assumptions on the kernel, target measure, and initial condition, the solution to \([32](https://arxiv.org/html/2606.04100#A3.E32)\) exists and is unique\[[36](https://arxiv.org/html/2606.04100#bib.bib101), Thm\. 3\.2\]\.

It is straightforward to verify thatπ\\piis a stationary point of \([32](https://arxiv.org/html/2606.04100#A3.E32)\), where∂qt∂t=0\\frac\{\\partial q\_\{t\}\}\{\\partial t\}=0whenqt=πq\_\{t\}=\\pi\. A proof of qualitative convergence was provided in\[[36](https://arxiv.org/html/2606.04100#bib.bib101), Thm\. 2\.8\], which showed the conditions under whichqtq\_\{t\}converges weakly toπ\\piast→∞t\\to\\infty\. To prove convergence with quantitative rates, the authors of\[[20](https://arxiv.org/html/2606.04100#bib.bib100)\]first recognize that that SVGD defines a gradient flow of KL divergence not in the Wasserstein geometry, but in the Stein geometry induced by the kernel\. In this geometry, the KL functional is not geodesically convex, meaning that in general one cannot establish functional inequalities that would prove global exponential convergence to the target measure\. However, one can prove exponential convergence provided that one has local strong convexity and starts close to the target measure at equilibrium\[[20](https://arxiv.org/html/2606.04100#bib.bib100), Lemma 28\]\.

### C\.3Kernel herding

Kernel herding\[[12](https://arxiv.org/html/2606.04100#bib.bib31)\]is an approach to constructing a discrete approximation of a probability distribution using a finite set of points, referred to as support points\[[39](https://arxiv.org/html/2606.04100#bib.bib30),[40](https://arxiv.org/html/2606.04100#bib.bib29)\]or coresets\[[21](https://arxiv.org/html/2606.04100#bib.bib32),[17](https://arxiv.org/html/2606.04100#bib.bib33),[42](https://arxiv.org/html/2606.04100#bib.bib34)\]in the literature\. For a kernelk\(⋅,⋅\)k\(\\cdot,\\cdot\)which defines a reproducing kernel Hilbert spaceℋ\\mathcal\{H\}, the kernel mean embedding of a probability distributionπ\\piinℋ\\mathcal\{H\}is defined as

μπ≔𝔼𝐱∼π\[k\(𝐱,⋅\)\]\.\\mu\_\{\\pi\}\\coloneqq\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\pi\}\[k\(\\mathbf\{x\},\\cdot\)\]\\ \.The goal is then to approximateμπ\\mu\_\{\\pi\}with an empirical embedding

μ^N=1N∑i=1Nk\(𝐱i,⋅\),\\hat\{\\mu\}\_\{N\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}k\(\\mathbf\{x\}\_\{i\},\\cdot\)\\ ,defined byNNpoints which minimize the maximum mean discrepancyMMD2\(q^N,π\)≔‖μ^N−μπ‖ℋ2\\text\{MMD\}^\{2\}\(\\hat\{q\}\_\{N\},\\pi\)\\coloneqq\|\|\\hat\{\\mu\}\_\{N\}\-\\mu\_\{\\pi\}\|\|^\{2\}\_\{\\mathcal\{H\}\}, whereq^N=1N∑j=1Nδ\(𝐱−𝐱j\)\\hat\{q\}\_\{N\}=\\tfrac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\delta\(\\mathbf\{x\}\-\\mathbf\{x\}^\{j\}\)is the empirical distribution of the point set\. Since global minimization of the objective over point sets is generally intractable, kernel herding finds a greedy solution to the minimization problem: forn=1,…,Nn=1,\.\.\.,N, given a set of previously chosen points\{𝐱i\}i=1n\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{n\}, kernel herding selects the next point as the one which decreasesMMD2\(q^n,π\)\\text\{MMD\}^\{2\}\(\\hat\{q\}\_\{n\},\\pi\)according to

𝐱n\+1∈argminx∈𝒳\[1n∑i=1nk\(𝐱i,𝐱\)−𝔼𝐱′∼π\[k\(𝐱′,𝐱\)\]\]\.\\mathbf\{x\}\_\{n\+1\}\\in\\underset\{x\\in\\mathcal\{X\}\}\{\\text\{argmin\}\}\\Bigg\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}k\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\)\-\\mathbb\{E\}\_\{\\mathbf\{x\}^\{\\prime\}\\sim\\pi\}\[k\(\\mathbf\{x\}^\{\\prime\},\\mathbf\{x\}\)\]\\Bigg\]\\ \.\(33\)
Stein herding\[[11](https://arxiv.org/html/2606.04100#bib.bib67)\]is an analogous approach to kernel herding which targets the minimization of kernel Stein discrepancy\. Let𝒮π\\mathcal\{S\}\_\{\\pi\}be the Stein operator where

𝒮π⊗k\(𝐱,⋅\)≔∇𝐱log⁡π\(𝐱\)k\(𝐱,⋅\)\+∇𝐱k\(𝐱,⋅\)\.\\mathcal\{S\}\_\{\\pi\}\\otimes k\(\\mathbf\{x\},\\cdot\)\\coloneqq\\nabla\_\{\\mathbf\{x\}\}\\log\\pi\(\\mathbf\{x\}\)k\(\\mathbf\{x\},\\cdot\)\+\\nabla\_\{\\mathbf\{x\}\}k\(\\mathbf\{x\},\\cdot\)\\ \.Kernel Stein discrepancy is defined asKSD2\(q^n,π\)≔𝔼𝐱,𝐱′∼q^n\[κπ\(𝐱,𝐱′\)\]\\textup\{KSD\}^\{2\}\(\\hat\{q\}\_\{n\},\\pi\)\\coloneqq\\mathbb\{E\}\_\{\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\\sim\\hat\{q\}\_\{n\}\}\[\\kappa\_\{\\pi\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\], whereκπ:𝒳×𝒳→ℝ\\kappa\_\{\\pi\}:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}is the Stein kernel

κπ\(𝐱,𝐱′\)\\displaystyle\\kappa\_\{\\pi\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)≔𝒮π𝐱𝒮π𝐱′⊗k\(𝐱,𝐱′\)\\displaystyle\\coloneqq\\mathcal\{S\}^\{\\mathbf\{x\}\}\_\{\\pi\}\\mathcal\{S\}^\{\\mathbf\{x\}^\{\\prime\}\}\_\{\\pi\}\\otimes k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=∇𝐱⋅∇𝐱′k\(𝐱,𝐱′\)\\displaystyle=\\nabla\_\{\\mathbf\{x\}\}\\cdot\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\+∇𝐱k\(𝐱,𝐱′\)⋅∇𝐱′log⁡π\(𝐱′\)\\displaystyle\\quad\+\\nabla\_\{\\mathbf\{x\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\cdot\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}\\log\\pi\(\\mathbf\{x\}^\{\\prime\}\)\+∇𝐱′k\(𝐱,𝐱′\)⋅∇𝐱log⁡π\(𝐱\)\\displaystyle\\quad\+\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\cdot\\nabla\_\{\\mathbf\{x\}\}\\log\\pi\(\\mathbf\{x\}\)\+k\(𝐱,𝐱′\)∇𝐱log⁡π\(𝐱\)⋅∇𝐱′log⁡π\(𝐱′\)\.\\displaystyle\\quad\+k\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\nabla\_\{\\mathbf\{x\}\}\\log\\pi\(\\mathbf\{x\}\)\\cdot\\nabla\_\{\\mathbf\{x\}^\{\\prime\}\}\\log\\pi\(\\mathbf\{x\}^\{\\prime\}\)\\ \.
In Stein herding, the next point which minimizesKSD2\(q^n,π\)\\textup\{KSD\}^\{2\}\(\\hat\{q\}\_\{n\},\\pi\)is given by

𝐱n\+1∈argminx∈𝒳\[1n∑i=1nκπ\(𝐱i,𝐱\)\],\\mathbf\{x\}\_\{n\+1\}\\in\\underset\{x\\in\\mathcal\{X\}\}\{\\text\{argmin\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\kappa\_\{\\pi\}\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\)\\Big\]\\ ,\(34\)which is identical to the expression in \([10](https://arxiv.org/html/2606.04100#S4.E10)\)\. As noted in\[[11](https://arxiv.org/html/2606.04100#bib.bib67)\], eachnnth iteration of Stein herding or kernel herding requires the solution to a global optimization problem over𝒳\\mathcal\{X\}, which the authors perform using numerical optimization methods such as the Nelder\-Mead algorithm and brute search\.

## Appendix DAdditional Discussion

### D\.1Online data acquisition

In the numerical studies of[Section˜5\.1](https://arxiv.org/html/2606.04100#S5.SS1), we compare the online data acquisition schemes of UDD and a\-SKMD\. UDD utilizes an acquisition function defined as the scaled standard deviation of committee predictions,

ρ\(𝐱\)≔2MσV\(𝐱\),\\rho\(\\mathbf\{x\}\)\\coloneqq\\sqrt\{\\tfrac\{2\}\{M\}\}\\sigma\_\{V\}\(\\mathbf\{x\}\)\\ ,\(35\)whereσV\\sigma\_\{V\}is defined in \([21](https://arxiv.org/html/2606.04100#A3.E21)\)\. At the point where the uncertainty metricρ\\rhoexceeds a threshold, i\.e\.,ρ\(𝐱t\)\>ζ1\\rho\(\\mathbf\{x\}\_\{t\}\)\>\\zeta\_\{1\}, the point is collected as new training datum\.

[Figure˜5](https://arxiv.org/html/2606.04100#A4.F5)compares the acquisition functions of a\-SKMD and UDD after one iteration of active learning\. The realization of Brownian motion for the path simulation is fixed between the two schemes shown\. The SKMD acquisition function, defined as the norm of the SKMD biasing force, is informed by both the underlying potential and the kernel\. Panel \(c\) of[Figure˜5](https://arxiv.org/html/2606.04100#A4.F5)shows that the SKMD aquisition function is low near other particles in the ensemble which lie in energy basins, as a result of the attractive component of the biasing force, while the function is high near other particles located in high energy regions, as a result of the repulsive component of the biasing force\. Therefore, the criterion serves to select points which are diverse with respect to the other particles but also representative of the underlying energy landscape\. Moreover, the threshold of the SKMD criterion spans a large area of state space, such that with each active learning iteration, the criterion leads to a well\-defined radial expansion in state space of selected points\. Panel \(d\) shows that the UDD acquisition function increases with distance from the existing training data, with regions of high uncertainty concentrated in certain domains of the state space\. The location of these regions does not correlate with the underlying energy landscape\. Moreover, the UDD criterion does not have a means to enforce diversity among points selected at a given active learning iteration, which can lead to clustering in the high\-uncertainty regions\. As a result, the data selected by the SKMD criterion often exhibit greater spread in the state space compared to those by the UDD criterion, as shown in panels \(a\) and \(b\)\.

![Refer to caption](https://arxiv.org/html/2606.04100v1/figures/adaptstop.png)Figure 5:The neural network potential \(contours\), previous training data \(red\), and selected data \(cyan\) corresponding to 1 iteration of active learning by a\-SKMD \(a\) and UDD \(b\)\. The corresponding contours of the acquisition function, selected data \(cyan\), and threshold for the acquisition criterion \(red line\) for a\-SKMD \(c\) and UDD \(d\)\.

## Appendix EExperimental Details

### E\.1Neural network potential

For all active learning schemes in the comparison, we fix the temperature parameter to beβ=1\.0\\beta=1\.0and perform numerical simulation with a time stepϵ=0\.002\\epsilon=0\.002\. In each trial, a maximum of 16 active learning iterations are performed and 32 new data points are queried in each active learning iteration\.

Data acquisition schemes\.In the naïve data acquisition scheme utilized in overdamped Langevin dynamics and SKMD without adaptive stopping, we collect data every 100 simulation steps and stop the simulation after 32 points have been collected\. SKMD is implemented according to[Algorithm˜3](https://arxiv.org/html/2606.04100#alg3)usingℓ=100\\ell=100\.

We implement UDD using the uncertainty\-based acquisition criterion \([35](https://arxiv.org/html/2606.04100#A4.E35)\) specified in\[[31](https://arxiv.org/html/2606.04100#bib.bib61)\]\. From a pilot simulation, we determinedζ1=0\.02\\zeta\_\{1\}=0\.02to be an appropriate uncertainty threshold\. In order to reduce redundancy in the collected points, we further impose that UDD performs a minimum of 50 simulation steps before the next datum is collected\.

a\-SKMD utilizes the acquisition function defined by the norm of the SKMD biasing force in \([8](https://arxiv.org/html/2606.04100#S3.E8)\), requiring a minimum number of simulation stepsℓ0=50\\ell\_\{0\}=50before stopping\. We determine an appropriate value for the threshold isζ0=1\.0\\zeta\_\{0\}=1\.0by tracking the acquisition functionαs\\alpha\_\{s\}over a pilot simulation\.

SKMD parameters\.Both SKMD and a\-SKMD utilizeA=1A=1and a Gaussian kernel defined by the Euclidean distance in state space with amplitudea=2a=2and fixed length scaleb=0\.41b=0\.41, which is the median distance between elements in the initial ensemble\. For both schemes, the interacting set of 32 points is initialized at the initial training set\. In each active learning iteration, all particles are simulated one at a time and the data are selected to be the stopping points of each particle\.

### E\.2MACE fine\-tuning

Training protocol\.The same training protocol was used for every active learning iteration and for all sampling methods\. We use the Low\-Rank Adaptation \(LoRA\) method\[[27](https://arxiv.org/html/2606.04100#bib.bib88)\]for the fine\-tuning of model parameters, in which pretrained model weights are adapted to new data\. In each case, the initial model was fine\-tuned for 100 epochs using LoRA with rank 4 and scaling factorα=1\.0\\alpha=1\.0\. The isolated atom energies were consistent between the pre\-trained model and the oracle\-labeled dataset, so multi\-head fine\-tuning was disabled\. Training used a two\-stage schedule: during the first 50 epochs, the force and energy loss weights were 1000 and 40, respectively, with a learning rate of 0\.01\. From epoch 50 onward, the force and energy weights were changed to 10 and 1000, respectively, and the learning rate was reduced to 0\.001\.

SKMD outer\-loop schedule\.The 300 outer iterations are split into an exploration phase \(iterations 1–250\) and a relaxation phase \(iterations 251–300\)\. During exploration, the kernel amplitude is held fixed ata=3\.0a=3\.0and the bandwidthbbis set adaptively at each iteration to the median pairwise Euclidean distance between the descriptor vectors of all particles\. During the relaxation phase, bothaaandbbdecay linearly from their values at the end of exploration to half those values by iteration 300\. This relaxation allows the sampler to settle into low\-energy regions of the landscape\.

Structure selection\.Candidate descriptor vectors areL2L^\{2\}\-normalised prior to computing pairwise distances in the furthest\-point sampling step\. When selecting structures at iterationt\>1t\>1, the descriptors of all previously selected structures are used to initialise the minimum\-distance field, so that newly selected structures are maximally diverse relative to the entire accumulated dataset and not only relative to the current batch of candidates\.

Evaluation metrics\.Energy error is reported as the root mean square error \(RMSE\) in the total potential energy over all 300 test structures\. Force error is the RMSE of all3N3Nforce components across the test set\. Both are computed relative to oracle \(MACE\-OFF\-23\-small\) predictions\.

### E\.3Compute Infrastructure

We performed experiments with the neural network potential on x86 server CPUs \(AMD EPYC and Intel Xeon processors\) using batch SLURM jobs, each with 2 CPU cores, 80GB memory for the Langevin, SKMD, and a\-SKMD runs, and 200GB memory for the UDD runs\. The typical run times per job were 5 minutes for the Langevin scheme, 10 minutes for the SKMD and a\-SKMD schemes, and 1 hour for the UDD scheme\. For the MACE examples, the active learning loops were conducted on the Isambard\-AI Phase 2 system at the Bristol Centre for Supercomputing, which is built from HPE Cray EX nodes containing four NVIDIA GH200 Grace Hopper Superchips per node\. Molecular dynamics simulations were run as single\-core jobs on the NVIDIA Grace CPU \(72 Arm Neoverse V2 cores, Armv9\.0\-A, 3\.1 GHz, with 4×128\-bit SVE2 vector units and 120 GB of co\-packaged LPDDR5X memory\), while MACE training was performed on the NVIDIA H100 Hopper GPU \(96 GB HBM3\) of the same superchip, accessed via the 900 GB/s coherent NVLink\-C2C interface\. Total resource usage for results generation was minimal, at approximately 6 CPU\-hours and 4 GPU\-hours across all MD and MACE training\.

## Appendix FSocietal Impacts

We develop a method for improving the efficiency of atomistic simulations, with the potential to accelerate scientific discovery in chemistry, materials science, and drug design\. By reducing the number of expensive quantum\-mechanical calculations required for training MLIPs, our method can lower the computational cost and energy consumption of atomistic simulation workflows\. While we do not foresee any direct negative societal impacts, atomistic simulation tools could be misused—for example, for the development of biochemical weapons\. Such risks are not specific to our method\.
Stein Kernelized Molecular Dynamics for Active Learning of Interatomic Potentials

Similar Articles

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Hessian Matching for Machine-Learned Coarse-Grained Molecular Dynamics

Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning

Personal continual learning for LLMs without GPU — position paper [OC]

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

Submit Feedback

Similar Articles

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Hessian Matching for Machine-Learned Coarse-Grained Molecular Dynamics
Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning
Personal continual learning for LLMs without GPU — position paper [OC]
GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction