Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics

arXiv cs.LG 05/18/26, 04:00 AM Papers
Summary
This paper introduces DiMS, a dynamical system sampler that guarantees exact sampling from the submanifold of minimum loss solutions in neural networks, enabling better uncertainty quantification in Bayesian inference.
arXiv:2605.15459v1 Announce Type: new Abstract: The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible. By construction, existing methods either spread over low-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system. Our proposed sampler, DiMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# Don’t Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
Source: [https://arxiv.org/html/2605.15459](https://arxiv.org/html/2605.15459)
Albert Kjøller Jacobsen1&Leo Uhre Jakobsen2&Johanna Marie Gegenfurtner1&Georgios Arvanitidis1

1Section for Cognitive Systems, Technical University of Denmark 2Center for Quantum Information Physics, New York University \{akjja, johge, gear\}@dtu\.dk,luj2004@nyu\.edu

###### Abstract

The minima of modern neural network loss functions are typically not isolated, rather they form connected components of reparameterization invariant solutions on the training data\. Analytically characterizing these solutions is a hard problem, but sampling approaches are feasible\. By construction, existing methods either spread over low\-loss regions, and thus do not sample reparameterization invariant solutions exactly, or are inherently local, which limits exploration of other minima valleys\. We propose sampling such reparameterization invariant models using a dynamical system based on kinetic energy, subject to a gravitational pull and a friction term that dissipates energy from the system\. Our proposed sampler,DiMS, is guaranteed to sample exactly from the minimum level sets and depends on physically motivated hyperparameters which allows control over the exploration capabilities of the sampler\. We consider uncertainty quantification in Bayesian inference as the motivating problem and observe improved performance compared to previously proposed approaches\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.15459v1/figures/figure1.png)Figure 1:Two trajectories through parameter space via different dynamical systems\. The sphere constitutes a submanifold of parameters that minimizeℒ\(𝜽\)=∑i=1Kθi2−1\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=\\sum\_\{i=1\}^\{K\}\\theta\_\{i\}^\{2\}\-1\. While a geodesic on the loss surface \(\) is the straightest path, it never converges to a minimum solution, even when starting from one\. Our modified dynamics \(\) ensure energy dissipation which leads the particle to stop at some minimum solution\.Training a deep neural network results in finding a single point estimate with minimum loss within a vast, high\-dimensional parameter space\. Yet the loss landscape has a rich geometric structure and a single point estimate can only capture local information about it\. This can have important shortcomings for downstream tasks, for instance, in Bayesian inference\. The architecture and the data together shape the loss landscape such that several parameter configurations perform equally well on the training data; the minimum level sets of modern overparameterized networks form extended, continuous submanifolds for which the model is*reparameterization invariant*on the observed data\(Dinhet al\.,[2017](https://arxiv.org/html/2605.15459#bib.bib32); Liet al\.,[2018](https://arxiv.org/html/2605.15459#bib.bib33); Garipovet al\.,[2018](https://arxiv.org/html/2605.15459#bib.bib30); Draxleret al\.,[2018](https://arxiv.org/html/2605.15459#bib.bib31)\)\.

In this paper we ask*“how should we navigate the space of good solutions”*rather than*“where is a good solution”*\. We propose theDissipativeMinimaSampler \(DiMS\), formulated as a dynamical system that navigates the loss landscape into low\-loss regions while guaranteeing convergence to the submanifold of minimum solutions\. As a result, the gathered samples correspond to a diverse set of functions that agree on the training data but vary in their predictions elsewhere\.

Recent works have focused on sampling low\-loss regions by respecting the inherent geometry of the problem\(Bergaminet al\.,[2023](https://arxiv.org/html/2605.15459#bib.bib3); Yuet al\.,[2024](https://arxiv.org/html/2605.15459#bib.bib4); Fadelet al\.,[2025](https://arxiv.org/html/2605.15459#bib.bib29)\), and even restricting samples to the reparameterization invariant set\(Royet al\.,[2024](https://arxiv.org/html/2605.15459#bib.bib2)\)\. Our proposed method differ by guaranteeing convergence to the minimum manifold while allowing for escaping the initial loss basin to find other connected minima \- a combination that neither optimization\-based methods nor existing sampling\-based methods are designed for\. Our contributions include:

1. 1\.A deterministic sampler for the minimum submanifolds, continuously visiting distinct points on the minimum submanifolds of the training loss\. We formalize geometry\-driven dynamics on Riemannian manifolds induced by the loss with physically interpretable hyperparameters controlling the exploration behavior of the sampler\. The resulting trajectories can be considered as an implicit approximate Bayesian posterior concentrated on the minimum manifold\.
2. 2\.Theoretical guaranteesthat our dynamics explore the loss surface before converging to the minimum manifold in finite time\.
3. 3\.Empirical validationthat posterior approximations concentrated on the minimum manifold generalize well, and can improve predictive performance, calibration, and especially out\-of\-domain uncertainty in Bayesian inference settings\.

#### Intuition:

The sampling process \(Figure[1](https://arxiv.org/html/2605.15459#S1.F1)\) starts by shooting out a particle from a fixed point𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}that minimizes the loss\. The direction and initial speed of the particle is randomly sampled from a distribution,𝒗∼q\(𝒗\)\\boldsymbol\{v\}\\sim q\(\\boldsymbol\{v\}\), and determines the initial kinetic energy of the particle\. By introducing external forces well\-known from physics, we guide the particle toward minima, eventually stopping at a minimum level set\. Every intersection with this set corresponds to a model that optimally fits the training data, yet exhibits different behavior in other data regions\. This leads to high functional variation in regions away from observed data \(Figure[2](https://arxiv.org/html/2605.15459#S1.F2)\), which is desired in Bayesian inference settings\. Remark thatDiMSis not a traditional approximate posterior as it deliberately generates samples only at the maxima ridges of the true posterior\.

![Refer to caption](https://arxiv.org/html/2605.15459v1/x1.png)Figure 2:We fit a neural network toN=7N=7training points and show function\-space samples from two geometry\-aware sampling schemes formulated as continuous dynamical systems\. The Riemannian Laplace approximation \(left\) explores low\-loss regions but does not enforce interpolating the training data\. Our samplerDiMS\(center\) leverages an alternative formulation of the system and guarantees convergence to minimum training loss solutions, ensuring all function\-space samples to interpolate the training data, while remaining in low\-loss regions along paths generated by the dynamics \(right\)\.

## 2Background \- geometry of loss surfaces

Let𝒟=\{\(𝒙i,yi\)\}i=1N\\mathcal\{D\}=\\\{\(\\boldsymbol\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote a dataset withNNpaired input\-output observations of𝒙i∈𝒳⊆ℝD\\boldsymbol\{x\}\_\{i\}\\in\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{D\}andyi∈𝒴y\_\{i\}\\in\\mathcal\{Y\}where𝒴⊆ℝ\\mathcal\{Y\}\\subseteq\\mathbb\{R\}for regression or𝒴=\{1,…,C\}\\mathcal\{Y\}=\\\{1,\\dots,C\\\}for classification amongCCclasses\. In most predictive settings we can define a likelihood parameterized by a neural network, then following Bayes’ theorem the negative log\-posterior equals the log\-joint up to a constant:

ℒ\(𝜽;𝒟\):=−log⁡p\(𝒟\|𝜽\)−log⁡p\(𝜽\)\+log⁡p\(𝒟\)=−log⁡p\(𝜽\)−∑i=1Nlog⁡p\(yi\|f𝜽\(𝒙i\)\)\+C\\mathcal\{L\}\(\\boldsymbol\{\\theta\};\\mathcal\{D\}\):=\-\\log p\(\\mathcal\{D\}\|\\boldsymbol\{\\theta\}\)\-\\log p\(\\boldsymbol\{\\theta\}\)\+\\log p\(\\mathcal\{D\}\)=\-\\log p\(\\boldsymbol\{\\theta\}\)\-\\sum\\nolimits\_\{i=1\}^\{N\}\\log p\\left\(y\_\{i\}\|f\_\{\\boldsymbol\{\\theta\}\}\(\\boldsymbol\{x\}\_\{i\}\)\\right\)\+C\(1\)wheref𝜽:𝒳→𝒴f\_\{\\boldsymbol\{\\theta\}\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}is the neural network with parameters𝜽∈Θ\\boldsymbol\{\\theta\}\\in\\Thetaandp\(𝜽\)p\(\\boldsymbol\{\\theta\}\)is a prior distribution over the parameter spaceΘ⊆ℝK\\Theta\\subseteq\\mathbb\{R\}^\{K\}\. The posterior distributionp\(𝜽\|𝒟\)p\(\\boldsymbol\{\\theta\}\|\\mathcal\{D\}\)is intractable when working with neural networks and we thus resort to an approximate posteriorq\(𝜽\)q\(\\boldsymbol\{\\theta\}\)rather than the true one\. Using the generalized Bayesian framework, we formulate a Gibbs posterior over the parameter space that enables a Bayesian treatment of any neural network without requiring a likelihood specification\.

![Refer to caption](https://arxiv.org/html/2605.15459v1/x2.png)Figure 3:The loss surface viewed as a Riemannian manifold\(ℳ,g\)\(\\mathcal\{M\},g\)embedded inℝK\+1\\mathbb\{R\}^\{K\+1\}\. The parameter space inherits the geometry of the loss surface via the pullback metric𝐆\(𝜽\)\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\. A curve on the manifold𝜸𝒗\(t\)∈ℳ\\boldsymbol\{\\gamma\}\_\{\\boldsymbol\{v\}\}\(t\)\\in\\mathcal\{M\}starting at𝝁\\boldsymbol\{\\mu\}is generated by the initial velocity𝒗\\boldsymbol\{v\}and can be expressed through its corresponding path in the parameter space𝜶𝒗~\(t\)∈Θ\\boldsymbol\{\\alpha\}\_\{\\tilde\{\\boldsymbol\{v\}\}\}\(t\)\\in\\Thetainduced by𝒗~\\tilde\{\\boldsymbol\{v\}\}\. We denote the mappings from the initial velocities to the curve at timettbyϕ𝝁,t:𝒯𝝁ℳ→ℳ\\phi\_\{\\boldsymbol\{\\mu\},t\}:\\mathcal\{T\}\_\{\\boldsymbol\{\\mu\}\}\\mathcal\{M\}\\rightarrow\\mathcal\{M\}andψ𝝁~,t:𝒯𝝁~Θ→Θ\\psi\_\{\\tilde\{\\boldsymbol\{\\mu\}\},t\}:\\mathcal\{T\}\_\{\\tilde\{\\boldsymbol\{\\mu\}\}\}\\Theta\\rightarrow\\Theta\.We briefly review Riemannian geometry of loss surfaces, which forms the foundation of our approach\. For notational convenience we suppress the dependency on the data𝒟\\mathcal\{D\}unless explicitly needed\.

### 2\.1Defining the loss surface as a Riemannian manifold

Given a smooth loss functionℒ:ℝK→ℝ\\mathcal\{L\}:\\mathbb\{R\}^\{K\}\\rightarrow\\mathbb\{R\}, the graph of the function over the parameter space defines a smooth,KK\-dimensional submanifold embedded inℝK\+1\\mathbb\{R\}^\{K\+1\}, often called the loss surface:

ℳ=\{h\(𝜽\)∣𝜽∈Θ\}⊆ℝK\+1,withh\(𝜽\)=\(θ1,…,θK,ℒ\(𝜽\)\)\.\\mathcal\{M\}=\\left\\\{h\(\\boldsymbol\{\\theta\}\)\\mid\\boldsymbol\{\\theta\}\\in\\Theta\\right\\\}\\subseteq\\mathbb\{R\}^\{K\+1\},\\qquad\\textrm\{with\}\\qquad h\(\\boldsymbol\{\\theta\}\)=\\left\(\\theta\_\{1\},\\dots,\\theta\_\{K\},\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\right\)\.\(2\)This definition ofℳ\\mathcal\{M\}naturally induces a Riemannian metric on the parameter space, as we now establish\. For a configuration𝜽\\boldsymbol\{\\theta\}, the ambient tangent space centered ath\(𝜽\)h\(\\boldsymbol\{\\theta\}\)is spanned by the Jacobian𝐉h\(𝜽\)=∇𝜽h\(𝜽\)=\[𝕀K,∇𝜽ℒ\(𝜽\)\]⊤∈ℝ\(K\+1\)×K\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)=\\nabla\_\{\\boldsymbol\{\\theta\}\}h\(\\boldsymbol\{\\theta\}\)=\\begin\{bmatrix\}\\mathbb\{I\}\_\{K\},\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\end\{bmatrix\}^\{\\top\}\\in\\mathbb\{R\}^\{\(K\+1\)\\times K\}, and we can express two tangent space vectors𝒗1,𝒗2∈𝒯𝒑ℳ\\boldsymbol\{v\}\_\{1\},\\boldsymbol\{v\}\_\{2\}\\in\\mathcal\{T\}\_\{\\boldsymbol\{p\}\}\\mathcal\{M\}in parameter space coordinates as𝒗1=𝐉h\(𝜽\)𝒗~1\\boldsymbol\{v\}\_\{1\}=\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}\_\{1\}and𝒗2=𝐉h\(𝜽\)𝒗~2\\boldsymbol\{v\}\_\{2\}=\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}\_\{2\}where𝒗~1,𝒗~2∈𝒯𝜽ℳ\\tilde\{\\boldsymbol\{v\}\}\_\{1\},\\tilde\{\\boldsymbol\{v\}\}\_\{2\}\\in\\mathcal\{T\}\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{M\}denote the corresponding tangent vectors in parameter space coordinates\. We can then define inner products on the manifold via the metric tensorg𝒑:𝒯𝒑ℳ×𝒯𝒑ℳ→ℝg\_\{\\boldsymbol\{p\}\}:\\mathcal\{T\}\_\{\\boldsymbol\{p\}\}\\mathcal\{M\}\\times\\mathcal\{T\}\_\{\\boldsymbol\{p\}\}\\mathcal\{M\}\\rightarrow\\mathbb\{R\}as:

g𝒑\(𝒗1,𝒗2\)=𝒗1⊤𝒗2=𝒗~1⊤𝐉h\(𝜽\)⊤𝐉h\(𝜽\)𝒗~2=𝒗~1⊤𝐆\(𝜽\)𝒗~2\.g\_\{\\boldsymbol\{p\}\}\(\\boldsymbol\{v\}\_\{1\},\\boldsymbol\{v\}\_\{2\}\)=\\boldsymbol\{v\}\_\{1\}^\{\\top\}\\boldsymbol\{v\}\_\{2\}=\\tilde\{\\boldsymbol\{v\}\}\_\{1\}^\{\\top\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}\_\{2\}=\\tilde\{\\boldsymbol\{v\}\}\_\{1\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}\_\{2\}\.\(3\)Here𝐆\(𝜽\)=𝕀K\+∇𝜽ℒ\(𝜽\)∇𝜽ℒ\(𝜽\)⊤\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)=\\mathbb\{I\}\_\{K\}\+\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}is the symmetric positive\-definite \(SPD\) metric matrix, making\(ℳ,g\)\(\\mathcal\{M\},g\)a Riemannian manifold whose inner products are locally distorted by curvature information from the loss gradient\. This definition of the inner product is the*pullback metric*as it pulls the Euclidean inner product on the tangent space𝒯𝒑ℳ\\mathcal\{T\}\_\{\\boldsymbol\{p\}\}\\mathcal\{M\}to the parameter spaceΘ\\Theta, in which all geometric computations onℳ\\mathcal\{M\}can therefore be carried out\. Note that while the embeddinghhfrom Equation \([2](https://arxiv.org/html/2605.15459#S2.E2)\) captures only the loss geometry, the theory extends to any smooth embedding function\.

### 2\.2Loss\-minimizing submanifolds

For a non\-convex, data\-dependent loss functionℒ:Θ→ℝ\\mathcal\{L\}:\\Theta\\rightarrow\\mathbb\{R\}, we define the loss level set atllasSl\(𝒟\)=⨆k∈𝒦\{𝜽∈ℬk:ℒ\(𝜽\)=l≥ℒ∗\(𝒟\)\},S\_\{l\}\(\\mathcal\{D\}\)=\\bigsqcup\_\{k\\in\\mathcal\{K\}\}\\left\\\{\\boldsymbol\{\\theta\}\\in\\mathcal\{B\}\_\{k\}:\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=l\\geq\\mathcal\{L\}^\{\\ast\}\(\\mathcal\{D\}\)\\right\\\},whereℬk\\mathcal\{B\}\_\{k\}is the basin of attraction of thekk\-th local minimum,𝒦\\mathcal\{K\}is the index set over all local minima andℒ∗\(𝒟\)\\mathcal\{L\}^\{\\ast\}\(\\mathcal\{D\}\)is the global minimum loss on data𝒟\\mathcal\{D\}\. We define the*loss\-minimizing submanifold*as the disjoint union of minimum level sets within basins:

S∗\(𝒟\)=⨆k∈𝒦\{𝜽∈ℬk:ℒ\(𝜽\)=ℒk∗\(𝒟\)\}\.S\_\{\\ast\}\(\\mathcal\{D\}\)=\\bigsqcup\\nolimits\_\{k\\in\\mathcal\{K\}\}\\left\\\{\\boldsymbol\{\\theta\}\\in\\mathcal\{B\}\_\{k\}:\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=\\mathcal\{L\}\_\{k\}^\{\\ast\}\(\\mathcal\{D\}\)\\right\\\}\.\(4\)Hereℒk∗\(𝒟\)=min𝜽′∈ℬk⁡ℒ\(𝜽′;𝒟\)\\mathcal\{L\}^\{\\ast\}\_\{k\}\(\\mathcal\{D\}\)=\\min\_\{\\boldsymbol\{\\theta\}^\{\\prime\}\\in\\mathcal\{B\}\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}^\{\\prime\};\\mathcal\{D\}\)denotes the minimum loss value of thekk\-th basin, soS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\)is essentially the collection of all local minima\. A special case of the loss level submanifold isSℒ∗\(𝒟\)S\_\{\\mathcal\{L\}^\{\\ast\}\(\\mathcal\{D\}\)\}for which all solutions have minimal loss and are functionally similar on all points in𝒟\\mathcal\{D\}\. In the interpolation regime whereℒ∗\(𝒟\)=0\\mathcal\{L\}^\{\\ast\}\(\\mathcal\{D\}\)=0, all parameters inS0S\_\{0\}exactly interpolate the training data𝒟\\mathcal\{D\}, yet may induce functionally distinct predictions on unseen inputs, as depicted in Figure[2](https://arxiv.org/html/2605.15459#S1.F2)\. We remark that in the overparameterized regime whereK≫NK\\gg N, the loss imposes at mostNNconstraints on theKK\-dimensional parameter space\. With far more parameters than data points, there are locally at leastK−NK\-Ndirections in which the parameters can move without changing the loss value\.

## 3Geometric mechanics \- curves on manifolds induced by classical mechanics

Classical mechanics is fundamentally the study of motion and thus naturally describes the evolution of a dynamical system over time\. In the following, we recall the Lagrangian formulation for a single\-particle system evolving in the intrinsic coordinates of a Riemannian manifold, which in our case is the parameter spaceΘ\\Theta\. The trajectories will serve as the foundation for our sampler\.

The path of a particle traveling on the loss manifold𝜸:\[t0,t1\]→ℳ\\boldsymbol\{\\gamma\}:\[t\_\{0\},t\_\{1\}\]\\rightarrow\\mathcal\{M\}can also be expressed as a parameter space curve𝜶:\[t0,t1\]→Θ\\boldsymbol\{\\alpha\}:\[t\_\{0\},t\_\{1\}\]\\rightarrow\\Theta, such that𝜸\(t\)=h\(𝜶\(t\)\)\\boldsymbol\{\\gamma\}\(t\)=h\(\\boldsymbol\{\\alpha\}\(t\)\), illustrated in Figure[3](https://arxiv.org/html/2605.15459#S2.F3)\. At any time along the path, we characterize the energy of the particle by its kinetic and potential energies:

Ekin=T\(𝜶,𝜶˙\)=12𝜶˙⊤𝐆\(𝜶\)𝜶˙Epot=V\(𝜶\)=κ⋅\(ℒ\(𝜶\)−ℒ\(𝜽glob∗\)\)E\_\{\\text\{kin\}\}=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=\\frac\{1\}\{2\}\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\qquad\\qquad E\_\{\\text\{pot\}\}=V\(\\boldsymbol\{\\alpha\}\)=\\kappa\\cdot\\left\(\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\mathcal\{L\}\(\\boldsymbol\{\\theta\}^\{\\ast\}\_\{\\text\{glob\}\}\)\\right\)\(5\)where𝜶˙=ddt𝜶\(t\)\\dot\{\\boldsymbol\{\\alpha\}\}=\\frac\{d\}\{dt\}\\boldsymbol\{\\alpha\}\(t\)is the velocity, and we follow standard notation conventions and neglect the explicit time\-dependency unless needed\. Note that𝜽glob∗\\boldsymbol\{\\theta\}^\{\\ast\}\_\{\\text\{glob\}\}denotes any global minimum and thatκ\>0\\kappa\>0is a constant encoding the coupling of mass \(which we set to unity\) and gravitational acceleration\. The particle dynamics are governed by the Euler\-Lagrange \(EL\) equations, which are derived from the Lagrangian defined as the difference between kinetic and potential energy\. The EL equations simplify under the definition of the manifold \(see Appendix[B](https://arxiv.org/html/2605.15459#A2)\), and we get a second\-order ordinary differential equation \(ODE\) governing the acceleration:

𝜶¨=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅grad⁡ℒ\(𝜶\)\.\\ddot\{\\boldsymbol\{\\alpha\}\}=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.\(6\)where𝐇ℒ\(𝜶\):=∇𝜽2ℒ\(𝜶\)\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\):=\\nabla\_\{\\boldsymbol\{\\theta\}\}^\{2\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)is the Hessian of the loss andgrad⁡ℒ\(𝜶\)=𝐆\(𝜶\)−1∇θℒ\(𝜶\)\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)^\{\-1\}\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)is the Riemannian gradient, hence guiding motion toward the steepest descent, which corresponds to a gravitational pull on the particle traveling on the manifold\. Remark that the Hessian term relates to the kinetic energy and that theκ\\kappaterm relates to the potential energy\. In practice, we can efficiently solve the second\-order ODE numerically using automatic differentiation\.

![Refer to caption](https://arxiv.org/html/2605.15459v1/x3.png)Figure 4:Top:With the white initial velocity, the systems affected by a gravity pull remain closer to the minimum level set, yet without dissipation the particle never converges\. With dissipation via a friction force, convergence is guaranteed, and our speed\-dependent dissipation function allows traveling further than constant friction\.Bottom:With the black initial velocity, curves affected by gravity oscillates around the minimum, while the dissipative systems converge steadily at different speeds\. The free particle can travel forever\.In the following two paragraphs, we highlight key properties, and the main issue with the dynamical system developed so far for exploring the submanifold of minima\. In Figure[4](https://arxiv.org/html/2605.15459#S3.F4)we show the behavior of the system as is, and under our improved dynamics presented in Subsection[3\.1](https://arxiv.org/html/2605.15459#S3.SS1)\.

#### The straightest paths onℳ\\mathcal\{M\}\.

The Lagrangian dynamics presented in Equation \([6](https://arxiv.org/html/2605.15459#S3.E6)\) confirm the well\-known fact that geodesics \- which are the straightest paths on the manifold \- are curves that extremize the kinetic energy integral, since the dynamics coincide with the geodesic equations forκ=0\\kappa=0, i\.e\. when we disregard the potential energy\. A particle with dynamics governed only by kinetic energy is a*free particle*, meaning that it has zero tangential acceleration and constant velocity along the path\. Moreover in the case ofκ\>0\\kappa\>0, the resulting trajectories can be interpreted as geodesics on a Finsler manifold under a modified metric:

F\(𝜶,𝜶˙\)=𝜶˙⊤𝐆\(𝜶\)𝜶˙\+2κ⋅ℒ\(𝜶\)\.F\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=\\sqrt\{\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+2\\kappa\\cdot\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\}\.\(7\)For further details we refer to textbooks on Riemannian mechanics, e\.g\.Bucataru and Miron \([2007](https://arxiv.org/html/2605.15459#bib.bib27)\)\.

#### Particles that never rest\.

A free particle \(κ=0\\kappa=0\) has constant kinetic energy and no potential energy, and can therefore travel forever on a complete manifold \- its trajectory length is unbounded and scales with integration time\. Introducing a potential energy term can be interpreted as a gravitational pull on the particle\. While this affects the individual energy components along the trajectory, the total energy of the system is conserved; motion up\- or downhill on the loss surface simply trades kinetic for potential energy and vice versa\. Hence a particle with non\-zero initial velocity and dynamics governed by Equation \([6](https://arxiv.org/html/2605.15459#S3.E6)\) never comes to rest and will instead oscillate around the minimum level set\. This is problematic, especially in high\-dimensional settings, where the dynamics are both computationally expensive to solve and difficult to simulate over long time horizons\. See Appendix[B](https://arxiv.org/html/2605.15459#A2)for details\.

### 3\.1The secret ingredient: energy dissipation through speed\-dependent friction

An obvious way to remove energy from the system \- or formally to*dissipate*energy \- is through the introduction of external forces, e\.g\.\(Minguzzi,[2015](https://arxiv.org/html/2605.15459#bib.bib15); Cline,[2017](https://arxiv.org/html/2605.15459#bib.bib16)\)\. In the following we break energy conservation by introducing*friction*; a dissipative force that continuously removes energy from the system\. For a dissipation functionη\(t\)\>0\\eta\(t\)\>0, the alternative dynamical system is:

𝜶¨=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅grad⁡ℒ\(𝜶\)−η\(t\)⋅𝜶˙,\\ddot\{\\boldsymbol\{\\alpha\}\}=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\eta\(t\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\},\(8\)for which we describe key properties of the total energy of the system in Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1)and Theorem[1](https://arxiv.org/html/2605.15459#Thmtheorem1)\. We provide proofs for all theorems and lemmas in Appendix[C](https://arxiv.org/html/2605.15459#A3)\.

###### Lemma 1\(Energy dissipation and boundedness\)\.

LetH\(t\)=T\(𝛂,𝛂˙,t\)\+V\(𝛂,t\)H\(t\)=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\+V\(\\boldsymbol\{\\alpha\},t\)be the the total energy of a particle moving along𝛂\(t\)\\boldsymbol\{\\alpha\}\(t\)at any timet\>t0t\>t\_\{0\}whereT,V≥0T,V\\geq 0and assume thatH\(t0\)<∞H\(t\_\{0\}\)<\\infty\. Under the dissipative dynamics in \([8](https://arxiv.org/html/2605.15459#S3.E8)\) whereη\(t\)\>0\\eta\(t\)\>0is the dissipation function and holds for allt≥t0t\\geq t\_\{0\}, the total energy dissipates over time asH˙=−2η\(t\)T\(𝛂,𝛂˙,t\)\\dot\{H\}=\-2\\eta\(t\)T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)and converges toH∞:=limt→∞H\(t\)H\_\{\\infty\}:=\\lim\_\{t\\rightarrow\\infty\}H\(t\)\. For allt≥t0t\\geq t\_\{0\}the total energy is bounded below as:

0≤H∞≤H\(t\)≤H\(t0\)<∞0\\leq H\_\{\\infty\}\\leq H\(t\)\\leq H\(t\_\{0\}\)<\\infty

###### Theorem 1\(Convergence of motion\)\.

Under the assumptions in Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1), and further assuming that the sublevel setΩ=\{\(𝛂,𝛂˙\):H\(𝛂,𝛂˙\)≤H\(t0\)\}\\Omega=\\\{\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\):H\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\\leq H\(t\_\{0\}\)\\\}is compact, all trajectories converge by LaSalle’s invariance principle to the largest invariant set

ℐ=\{\(𝜶,𝜶˙\):𝜶˙=𝟎,grad⁡ℒ\(𝜶\)=𝟎\},\\mathcal\{I\}=\\left\\\{\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\):\\dot\{\\boldsymbol\{\\alpha\}\}=\\boldsymbol\{0\},\\ \\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\\boldsymbol\{0\}\\right\\\},consisting of equilibria of the loss\. Assuming thatℒ\\mathcal\{L\}satisfies the strict saddle property, the attracting elements ofℐ\\mathcal\{I\}are almost surely local minima and converge to any𝛉∈S∗\(𝒟\)\\boldsymbol\{\\theta\}\\in S\_\{\\ast\}\(\\mathcal\{D\}\)\.

#### A speed\-dependent dissipation function\.

We suggest defining the dissipation function such that the friction force scales with the speed of the particle, i\.e\. the square root of the kinetic energy, thereby dissipating energy more at higher speed:

η\(t\)=η0T\(𝜶,𝜶˙,t\)=η0∥𝜶˙\(t\)∥1\+∥∇𝜽ℒ\(𝜶\(t\)\)∥2cos2⁡\(ω\)\\eta\(t\)=\\eta\_\{0\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\}=\\eta\_\{0\}\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)\\rVert\\sqrt\{1\+\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\)\\rVert^\{2\}\\cos^\{2\}\(\\omega\)\}\(9\)whereη0\\eta\_\{0\}is a hyperparameter controlling the effect of the dissipative force andω\\omegadenotes the angle between the velocity𝜶˙\(t\)\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)and the gradient∇𝜽ℒ\(𝜶\(t\)\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\)\. Together with Equation \([8](https://arxiv.org/html/2605.15459#S3.E8)\), this dissipation function is what constitutes the sampling dynamics used byDiMS\.

![Refer to caption](https://arxiv.org/html/2605.15459v1/x4.png)Figure 5:The dynamics of the proposed improved dynamical system\. Depending on the initial velocity sample𝒗~∼q\(𝒗~\)\\tilde\{\\boldsymbol\{v\}\}\\sim q\(\\tilde\{\\boldsymbol\{v\}\}\),DiMSis capable of sampling distinct minimum level sets by dissipating energy particularly when moving proportional to the gradient, and eventually stop\. In contrast the geodesic path is unconstrained and requires defining a stop time\.We remark that the dissipation strength directly depends on the Euclidean norm of the intrinsic velocity,∥𝜶˙\(t\)∥\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)\\rVert, revealing that the particle slows down faster at high speeds\. Furthermore, the dissipation strength depends explicitly on the angleω\\omegabetween the velocity of the particle and the steepness of the surface expressed by the gradient\. As such, the energy dissipation is maximized when the velocity is proportional to the gradient, i\.e\.𝜶˙\(t\)∝∇𝜽ℒ\(𝜶\(t\)\)\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)\\propto\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\), while dissipation only depends on the speed of the Euclidean parameter space curve when the velocity is perpendicular to the gradient, i\.e\.𝜶˙\(t\)⟂∇𝜽ℒ\(𝜶\(t\)\)\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)\\perp\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\)\. In other words, the particle is slowed down the least when moving in the direction∇θ⟂ℒ\(𝜶\(t\)\)\\nabla\_\{\\theta\}^\{\\perp\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\)that traverses the current level set\.

In Theorems[2](https://arxiv.org/html/2605.15459#Thmtheorem2)and[3](https://arxiv.org/html/2605.15459#Thmtheorem3)we extend the previous convergence analysis to this specific dissipation function, from which we get a maximal convergence time and an exploration guarantee before convergence\.

###### Theorem 2\(Convergence time\)\.

For a particle that traces the curve𝛂:\[t0,t1\]→Θ\\boldsymbol\{\\alpha\}:\[t\_\{0\},t\_\{1\}\]\\rightarrow\\Thetawhich converges to𝛂∞∈ℐ\\boldsymbol\{\\alpha\}\_\{\\infty\}\\in\\mathcal\{I\}ast1→∞t\_\{1\}\\rightarrow\\inftyaccording to Theorem[1](https://arxiv.org/html/2605.15459#Thmtheorem1), letH\(t\)H\(t\)be the total energy at timettdescribed in Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1)\. Let furthermoretϵ=inf\{t≥t1:T\(𝛂,𝛂˙,t\)<ϵ\}t\_\{\\epsilon\}=\\inf\\\{t\\geq t\_\{1\}:T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)<\\epsilon\\\}be the proxy convergence time at which the kinetic energy is sufficiently low to consider the particle not moving much\. Then for dissipation functionη\(t\)=η0T\(𝛂,𝛂˙,t\)≥0\\eta\(t\)=\\eta\_\{0\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\}\\geq 0, an upper bound on the proxy convergence time is

tϵ≤t0\+H\(t0\)\(2η0ϵ3/2\)−1\.t\_\{\\epsilon\}\\leq t\_\{0\}\+H\(t\_\{0\}\)\\left\(2\\eta\_\{0\}\\epsilon^\{3/2\}\\right\)^\{\-1\}\.

###### Theorem 3\(Lower bound on the𝜸\\boldsymbol\{\\gamma\}\-arc\-length\)\.

Under the assumptions in Theorem[2](https://arxiv.org/html/2605.15459#Thmtheorem2), and further definingΔℒ=ℒ\(𝛂\(t0\)\)−ℒ\(𝛂∞\)\\Delta\\mathcal\{L\}=\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\\left\(t\_\{0\}\\right\)\)\-\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\_\{\\infty\}\), the arc\-length of the curve𝛄\(t\)=h\(𝛂\)∈ℳ\\boldsymbol\{\\gamma\}\(t\)=h\(\\boldsymbol\{\\alpha\}\)\\in\\mathcal\{M\}satisfies

T\(𝜶,𝜶˙,t0\)\+κΔℒ2η0H\(t0\)≤Length\(𝜸\)\.\\frac\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\_\{0\}\)\+\\kappa\\Delta\\mathcal\{L\}\}\{\\sqrt\{2\}\\eta\_\{0\}H\(t\_\{0\}\)\}\\leq\\textsc\{Length\}\(\\boldsymbol\{\\gamma\}\)\.

From Theorem[2](https://arxiv.org/html/2605.15459#Thmtheorem2)we conclude that the time,tϵt\_\{\\epsilon\}, at which the particle converges to a low energy\-state is finite and depends on the initial total energy, friction coefficient and the allowed speed\-toleranceϵ\\epsilon\. While the bound may not be tight, it establishes that the dynamics converge in finite time, which is of practical and theoretical significance\. Theorem[3](https://arxiv.org/html/2605.15459#Thmtheorem3)reveals that the particle travels at least a certain distance and that this distance is also based on the initial energy of the system\. Hence, sampling with our proposed dynamics guarantees some exploration of the manifold before convergence\.

### 3\.2How do we obtain samples fromS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\)?

Essentially, to generate samples withDiMS, it suffices to appropriately initialize the dynamical system\. Since we want to exploit the modified dynamical system defined in Equations \([8](https://arxiv.org/html/2605.15459#S3.E8)\)\-\([9](https://arxiv.org/html/2605.15459#S3.E9)\) to sample the submanifold of minimaS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\), we typically set the particle’s initial position𝝁~\\tilde\{\\boldsymbol\{\\mu\}\}to a minimum solution𝜽∗∈S∗\(𝒟\)\\boldsymbol\{\\theta\}^\{\\ast\}\\in S\_\{\\ast\}\(\\mathcal\{D\}\)obtained by training a neural network\. This choice is purely practical as it reduces run time, and we remark that convergence toS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\)is guaranteed for any initial position\.

A point estimate obtained via optimization corresponds to a stable equilibrium of the loss function, meaning that it can not move without an external force acting on it\. We therefore need an initial velocity𝒗~\\tilde\{\\boldsymbol\{v\}\}along which motion is initialized\. The particle path𝜶𝝁~,𝒗~,η0:\[0,t1\]→Θ\\boldsymbol\{\\alpha\}\_\{\\tilde\{\\boldsymbol\{\\mu\}\},\\tilde\{\\boldsymbol\{v\}\},\\eta\_\{0\}\}:\[0,t\_\{1\}\]\\rightarrow\\Thetais then computed by numerically integrating the second\-order ODE with initial conditions𝜶\(0\)=𝝁~\\boldsymbol\{\\alpha\}\(0\)=\\tilde\{\\boldsymbol\{\\mu\}\}and𝜶˙\(0\)=𝒗~\\dot\{\\boldsymbol\{\\alpha\}\}\(0\)=\\tilde\{\\boldsymbol\{v\}\}\. We treat the initial velocity as a random variable and place a distribution over it,𝒗~∼q\(𝒗~\)\\tilde\{\\boldsymbol\{v\}\}\\sim q\(\\tilde\{\\boldsymbol\{v\}\}\), to ensure variation over the constructed paths\. The distribution can freely be chosen, however we define it heuristically as the Laplace approximation; a Gaussian respecting the local loss geometry at𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}, which is common in approximate Bayesian inference\(MacKay,[1992](https://arxiv.org/html/2605.15459#bib.bib36); Daxbergeret al\.,[2021](https://arxiv.org/html/2605.15459#bib.bib20); Immeret al\.,[2021b](https://arxiv.org/html/2605.15459#bib.bib21); Bergaminet al\.,[2023](https://arxiv.org/html/2605.15459#bib.bib3); Yuet al\.,[2024](https://arxiv.org/html/2605.15459#bib.bib4)\)\. We discuss this further in Appendix[A\.2](https://arxiv.org/html/2605.15459#A1.SS2)\.

Our sampler generates an implicit distribution over the submanifold of minima and can be formalized:

𝒗~∼q\(𝒗~\)=𝒩\(𝟎,𝐇ℒ\(𝝁~\)−1\)𝜽\(s\)=DiMS\(𝒗~\|t1,η0,𝝁~,𝒟\):=𝜶𝝁~,𝒗~,η0\(t1\)∈S∗\(𝒟\)\\tilde\{\\boldsymbol\{v\}\}\\sim q\(\\tilde\{\\boldsymbol\{v\}\}\)=\\mathcal\{N\}\\left\(\\boldsymbol\{0\},\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\tilde\{\\boldsymbol\{\\mu\}\}\)^\{\-1\}\\right\)\\quad\\boldsymbol\{\\theta\}^\{\(s\)\}=\\textsc\{DiMS\}\(\\tilde\{\\boldsymbol\{v\}\}\|t\_\{1\},\\eta\_\{0\},\\tilde\{\\boldsymbol\{\\mu\}\},\\mathcal\{D\}\):=\\boldsymbol\{\\alpha\}\_\{\\tilde\{\\boldsymbol\{\\mu\}\},\\tilde\{\\boldsymbol\{v\}\},\\eta\_\{0\}\}\(t\_\{1\}\)\\in S\_\{\\ast\}\(\\mathcal\{D\}\)\(10\)where convergence toS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\)is guaranteed for large enought1t\_\{1\}\. As pointed out in Theorem[2](https://arxiv.org/html/2605.15459#Thmtheorem2)this time depends on the hyperparameterη0\\eta\_\{0\}that controls exploration\. Flipping the relation, we argue that selectingη0\\eta\_\{0\}only depends on the computational budget available and that smallerη0\\eta\_\{0\}are preferred\. We describe how to obtain samples withDiMScomputationally in Appendix[D](https://arxiv.org/html/2605.15459#A4)\.

## 4Related work

#### Loss landscape geometry\.

Understanding the geometry of the loss landscape has been a fundamental research topic in modern machine learning\.Dinhet al\.\([2017](https://arxiv.org/html/2605.15459#bib.bib32)\)showed that neural network parameter spaces contain reparameterization invariant sets over which loss properties such as flatness are non\-constant, motivating geometry\-aware methods that are invariant to such reparameterizations\. More recentlyGaripovet al\.\([2018](https://arxiv.org/html/2605.15459#bib.bib30)\)andDraxleret al\.\([2018](https://arxiv.org/html/2605.15459#bib.bib31)\)revealed that independently trained neural networks are connected by low\-loss paths, suggesting that the reparameterization invariant set that minimizes the loss is well\-described as a connected component\.Entezariet al\.\([2021](https://arxiv.org/html/2605.15459#bib.bib42)\)conjectured that this connectivity is explained by permutation symmetries within the parameter space, withZhaoet al\.\([2025](https://arxiv.org/html/2605.15459#bib.bib43)\)extending this line of work to continuous symmetries\.

#### Flexible posterior approximations\.

A conceptually simple approximate posterior technique is the Laplace approximation\(MacKay,[1992](https://arxiv.org/html/2605.15459#bib.bib36)\)that defines the approximate distribution as a Gaussian with covariance given by the inverse Hessian\. In neural network parameter spaces this assumption is often impractical and follow up works has instead used the the Generalized Gauss\-Newton as the covariance\(Daxbergeret al\.,[2021](https://arxiv.org/html/2605.15459#bib.bib20)\)or linearized the function space samples\(Immeret al\.,[2021b](https://arxiv.org/html/2605.15459#bib.bib21)\)\. Trajectory\-based methods such as SWAG\(Maddoxet al\.,[2019](https://arxiv.org/html/2605.15459#bib.bib49)\)and deep ensembles provide scalable uncertainty estimates but remain agnostic to the geometric structure of the loss landscape\. A more principled line of work suggests exploiting the local geometry when defining the approximate posterior;Bergaminet al\.\([2023](https://arxiv.org/html/2605.15459#bib.bib3)\)introduced the Riemannian Laplace approximation that defines a distribution in low\-loss regions using the pullback metric defined previously, whereasYuet al\.\([2024](https://arxiv.org/html/2605.15459#bib.bib4)\)corrected an inherent bias of the method by replacing the metric with the Fisher metric\. In contrast our approach adjusts the bias by modifying the dynamical system\. Other works\(Royet al\.,[2024](https://arxiv.org/html/2605.15459#bib.bib2); Fadelet al\.,[2025](https://arxiv.org/html/2605.15459#bib.bib29); Reichlinet al\.,[2025](https://arxiv.org/html/2605.15459#bib.bib5)\)followed alternative approaches that used a diffusion\-like exploration of the reparameterization invariant solutions in the sampling process\. While geometrically principled, these approaches are inherently local and confined to a single loss basin\.

#### Mechanical systems in optimization and sampling theory\.

First\-order optimization methods with momentum such as Polyak’s heavy\-ball\(Polyak,[1964](https://arxiv.org/html/2605.15459#bib.bib44)\)and Nesterov’s accelerated gradient\(Nesterov,[1983](https://arxiv.org/html/2605.15459#bib.bib45)\)correspond to discretizations of second\-order dissipative ODEs, a connection clarified bySuet al\.\([2016](https://arxiv.org/html/2605.15459#bib.bib48)\)andKovachki and Stuart \([2021](https://arxiv.org/html/2605.15459#bib.bib28)\)\. In the Euclidean setting with constant friction, our proposed continuous\-time dynamics recover Polyak’s heavy\-ball exactly and can thus be seen as a Riemannian generalization of this classical system\. Sampling algorithms based on mechanics \- such as stochastic gradient Langevin dynamics \(SGLD\)\(Welling and Teh,[2011](https://arxiv.org/html/2605.15459#bib.bib10)\), Hamiltonian Monte Carlo \(HMC\)\(Neal,[2011](https://arxiv.org/html/2605.15459#bib.bib40)\)or its Lagrangian reformulation\(Lanet al\.,[2012](https://arxiv.org/html/2605.15459#bib.bib51)\), along with their Riemannian variants\(Girolami and Calderhead,[2011](https://arxiv.org/html/2605.15459#bib.bib39); Chenet al\.,[2014](https://arxiv.org/html/2605.15459#bib.bib38)\)\- are guaranteed to recover the full posterior in the limit of infinite sampling time, which is rarely achieved in practice\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2605.15459v1/x5.png)\(a\)Trained initial𝝁~=𝜽∗∈S∗\(𝒟\)\\tilde\{\\boldsymbol\{\\mu\}\}=\\boldsymbol\{\\theta\}^\{\\ast\}\\in S\_\{\\ast\}\(\\mathcal\{D\}\)
![Refer to caption](https://arxiv.org/html/2605.15459v1/x6.png)\(b\)Random initial position𝝁~∉S∗\(𝒟\)\\tilde\{\\boldsymbol\{\\mu\}\}\\not\\in S\_\{\\ast\}\(\\mathcal\{D\}\)

Figure 6:Function space samples obtained byDiMSgive high OOD estimates without breaking the fit on training data, whichRLAdoes not guarantee\. Even when initialized from a suboptimal position,DiMSprovides well\-behaved function space samples\. Red and white dots are ID and OOD data, respectively, and the black line is the function induced by the initial position parameters\. Uncertainty ranges are based on standard deviation and we plot the average over the ensemble as colored lines\.We evaluateDiMSon uncertainty quantification tasks and compare with methods based on the Laplace approximation, including the sampled \(LA\), linearized \(LinLA\) and Riemannian \(RLA\) Laplace approximations\. For all experiments we apply a Bayesian treatment of*all*network parameters and tune the prior precision and likelihood parameters \(if any\) using the marginal log\-likelihood as proposed byDaxbergeret al\.\([2021](https://arxiv.org/html/2605.15459#bib.bib20)\)\. We first consider illustrative settings and later extend the analysis to larger neural networks\. All experimental details are provided in Appendix[D](https://arxiv.org/html/2605.15459#A4)·

#### Snelson regression dataset\.

We train a fully connected neural network with22hidden layers of1616hidden units andGELUactivations\. We use2020% dropout during training to increase the number of active neurons\. We set the friction hyperparameter sufficiently lowη0=0\.1\\eta\_\{0\}=0\.1and runDiMSsamples until convergence, whereasRLAper definition stops att1=1t\_\{1\}=1\. The resulting predictive uncertainty based onS=100S=100Monte Carlo \(MC\) samples is shown in Figure[6\(a\)](https://arxiv.org/html/2605.15459#S5.F6.sf1)\. ThoughRLAis highly uncertain out\-of\-domain \(OOD\) it comes at the cost of fitting the data worse in\-domain \(ID\)\.

As argued byBergaminet al\.\([2023](https://arxiv.org/html/2605.15459#bib.bib3)\)the sampled LA underperforms in regression settings, even when the Hessian used as the covariance matrix is not ill\-conditioned and requires either linearization or geometric awareness to obtain a proper fit\. However, this depends on the assumption that the starting position is initialized at an optimum\.DiMSis position\-agnostic in the sense that the gravitational pull forces the particle towardS∗\(𝒟\)S\_\{\\ast\}\(\\mathcal\{D\}\)no matter the starting position\. This setting is similar to a continuous\-time gradient descent with momentum and non\-zero initial velocity, determining the direction to initialize the optimization process\. In Figure[6\(b\)](https://arxiv.org/html/2605.15459#S5.F6.sf2)we show the implication of starting the sampling process from a random initialization of the network withη0=1\.0\\eta\_\{0\}=1\.0and running until reaching a minimum\. We highlight that samples fromDiMSstill converge to solutions that fit the training data well and has variation OOD, whileRLAsamples perform worse under this initialization\. Remark also that the OOD variation forDiMSis more limited when initialized randomly, suggesting that our proposed dynamics allow for maximum exploration when the motion is dominated by the kinetic term rather than the gradient flow dynamics\.

![Refer to caption](https://arxiv.org/html/2605.15459v1/x7.png)Figure 7:2D binary classification on the banana dataset\. The trainedMAPmodel fits the training data well, yet gives unreliable uncertainty estimates\.LAandLinLAinduce predictive distributions over the input space, yet their approximations are too crude to fit the data properly\. In contrast the geometry\-aware methods fit the data well while providing better uncertainty estimates than theMAPmodel, especiallyDiMSthat captures OOD uncertainty well, better thanRLA\. Uncertainty is given by the variance of the predictive distribution and we provide calibration metrics in Appendix[D\.2](https://arxiv.org/html/2605.15459#A4.SS2)\.
#### Banana classification dataset\.

To solve the banana classification task depicted in Figure[7](https://arxiv.org/html/2605.15459#S5.F7), we train a fully connected with44hidden layers of1616units each, usingSiLUactivations,50%50\\%dropout and label smoothing of0\.050\.05, to allow for high OOD variation\. We observe thatDiMSexhibits high OOD uncertainty while remaining confident in regions supported by training data and preserving fine\-grained details of the data distribution\. WhileRLAperforms similarly, it is less uncertain in a large part of the input space\. The remaining methods either underfit the data or provide unreliable OOD uncertainty\.

Table 1:Negative log\-likelihood \(NLL↓\\downarrow\) on UCI classification test sets\. We estimate the predictive distribution withS=30S=30samples and report the mean over 5 seeds, including the standard error\.
#### UCI classification datasets\.

We consider66classification problems from the UCI repository and train a classifier for each task, consisting of a33\-layer network with1616hidden units per layer,SiLUactivations and50%50\\%dropout\. We evaluate these using common uncertainty quantification metrics, specifically the negative log\-likelihood \(NLL\), Brier score, expected calibration error \(ECE\), and maximum calibration error \(MCE\)\. For each seed, metrics are computed by averaging overS=30S=30sampled models, and we report the mean NLL and its standard error across55seeds in Table[1](https://arxiv.org/html/2605.15459#S5.T1)\. The remaining metrics are provided in Appendix[D\.3](https://arxiv.org/html/2605.15459#A4.SS3)and yield similar conclusions\.

Overall,DiMSachieves competitive or superior performance across the datasets, whileRLAoutperformsDiMSon22out of the66datasets\. Notably,DiMSwas run with a fixed default friction coefficient ofη0=0\.5\\eta\_\{0\}=0\.5across all datasets with no per\-dataset tuning – adjusting this hyperparameter yields better performance thanRLAon one of the two datasets whereRLAotherwise has an advantage\. This suggests thatDiMSis robust to the choice of hyperparameters and can achieve stronger performance with minor tuning effort, as further detailed in Appendix[D\.3](https://arxiv.org/html/2605.15459#A4.SS3)\.

#### MNIST and related datasets\.

We train a convolutional neural network \(the LeNet architecture\(LeCunet al\.,[1989](https://arxiv.org/html/2605.15459#bib.bib53)\)\) with approximatelyK=44,000K=44\{,\}000parameters on MNIST and Fashion MNIST and evaluate each method on the test sets along with the test sets of Extended MNIST and KMNIST\. We provide the ID and OOD performance in Tables[2](https://arxiv.org/html/2605.15459#S5.T2)and[3](https://arxiv.org/html/2605.15459#S5.T3)\. To define the Laplace approximations we compute the Hessian based on a randomly selected subset containingN=1,000N=1\{,\}000of the training samples\. We drawSB=20S\_\{B\}=20parameter samples from each method and repeat the sampling process overB=5B=5batches, resulting in a total ofS=100S=100samples per method using a friction coefficient ofη0=0\.1\\eta\_\{0\}=0\.1\. Computing the full Hessian is infeasible for a neural network with a high\-dimensional parameter space as it requires inverting aK×KK\\times Kmatrix\. We use a low\-rank approximation of the Hessian and discuss its implications further in Appendix[D\.4](https://arxiv.org/html/2605.15459#A4.SS4)\.

In general,DiMSis better calibrated and fits ID data better thanRLAwhile sampledLAlacks expressiveness\. As expected,DiMSconsistently outperforms the other methods on OOD data\.

Table 2:In\-distribution performance on MNIST and FMNIST with low\-rank Hessian approximation\.Table 3:Out\-of\-distribution performance measured as AUROC↑\\uparrow\.

## 6Conclusion

We proposedDiMS, a sampler grounded in dissipative mechanics on Riemannian manifolds that is guaranteed to explore the minimum\-loss submanifolds of neural networks, i\.e\. parameter space regions where models perform well on training data while retaining high predictive uncertainty on out\-of\-distribution inputs\. We provided theoretical guarantees that the dynamics converge to these minimum\-loss submanifolds in finite time regardless of the initial conditions, and validated the sampler empirically across various settings\.DiMSconsistently improves over or performs on par with competing Laplace\-based methods on uncertainty quantification tasks, while being robust to its single physically\-motivated hyperparameter –the friction coefficient– which governs energy dissipation in the dynamical system\. We hope this work encourages further research on geometry and physics inspired approaches to minima exploration and uncertainty quantification in deep learning\.

Limitations and future work\.DiMSis, by design, a geometry\-based approach and therefore computationally demanding, as is typical for such methods\. The main cost stems from solving an ODE in a high\-dimensional space, which could potentially be mitigated using tailored or stochastic solvers\. WhileDiMSis robust to the friction hyperparameter, a position\-dependent formulation could be considered\. Sampling reparameterization invariant models requires a sufficiently expressive function space, and it is of interest to characterize the conditions under which this holds\.

## Acknowledgments and Disclosure of Funding

This work was supported by the Danish Data Science Academy, which is funded by the Novo Nordisk Foundation \(NNF21SA0069429\) and VILLUM FONDEN \(40516\), and by the DFF Sapere Aude Starting Grant “GADL”\.

## References

- Riemannian Laplace approximations for Bayesian neural networks\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.15459#S3.SS2.p2.6),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.15459#S5.SS0.SSS0.Px1.p2.2)\.
- I\. Bucataru and R\. Miron \(2007\)Finsler\-lagrange geometry: applications to dynamical systems\.Editura Academiei Române Buchurest\.Cited by:[§B\.1](https://arxiv.org/html/2605.15459#A2.SS1.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2605.15459#S3.SS0.SSS0.Px1.p1.3)\.
- R\. T\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud \(2018\)Neural ordinary differential equations\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[Appendix D](https://arxiv.org/html/2605.15459#A4.SS0.SSS0.Px1.p1.6)\.
- T\. Chen, E\. Fox, and C\. Guestrin \(2014\)Stochastic gradient Hamiltonian Monte Carlo\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- D\. Cline \(2017\)Variational principles in classical mechanics\.University of Rochester River Campus Librarie\.Cited by:[§3\.1](https://arxiv.org/html/2605.15459#S3.SS1.p1.1)\.
- E\. Daxberger, A\. Kristiadi, A\. Immer, R\. Eschenhagen, M\. Bauer, and P\. Hennig \(2021\)Laplace redux\-effortless Bayesian deep learning\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.15459#S3.SS2.p2.6),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.15459#S5.p1.1)\.
- L\. Dinh, R\. Pascanu, S\. Bengio, and Y\. Bengio \(2017\)Sharp minima can generalize for deep nets\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p1.1),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px1.p1.1)\.
- M\. P\. Do Carmo and J\. Flaherty Francis \(1992\)Riemannian geometry\.Springer\.Cited by:[§A\.1](https://arxiv.org/html/2605.15459#A1.SS1.p1.9)\.
- J\. R\. Dormand and P\. J\. Prince \(1980\)A family of embedded Runge\-Kutta formulae\.Journal of computational and applied mathematics6\(1\),pp\. 19–26\.Cited by:[Appendix D](https://arxiv.org/html/2605.15459#A4.SS0.SSS0.Px1.p1.6)\.
- F\. Draxler, K\. Veschgini, M\. Salmhofer, and F\. Hamprecht \(2018\)Essentially no barriers in neural network energy landscape\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p1.1),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px1.p1.1)\.
- R\. Entezari, H\. Sedghi, O\. Saukh, and B\. Neyshabur \(2021\)The role of permutation invariance in linear mode connectivity of neural networks\.arXiv preprint arXiv:2110\.06296\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Fadel, H\. Roy, N\. Krämer, Y\. Zainchkovskyy, S\. Syrota, A\. V\. Mahou, C\. H\. Ek, and S\. Hauberg \(2025\)VIKING: Deep variational inference with stochastic projections\.arXiv preprint arXiv:2510\.23684\.Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p3.1),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- T\. Garipov, P\. Izmailov, D\. Podoprikhin, D\. P\. Vetrov, and A\. G\. Wilson \(2018\)Loss surfaces, mode connectivity, and fast ensembling of dnns\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p1.1),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px1.p1.1)\.
- M\. Girolami and B\. Calderhead \(2011\)Riemann manifold langevin and hamiltonian monte carlo methods\.Journal of the Royal Statistical Society Series B: Statistical Methodology\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- A\. Immer, M\. Bauer, V\. Fortuin, G\. Rätsch, and K\. M\. Emtiyaz \(2021a\)Scalable marginal likelihood estimation for model selection in deep learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix D](https://arxiv.org/html/2605.15459#A4.SS0.SSS0.Px1.p1.6)\.
- A\. Immer, M\. Korzepa, and M\. Bauer \(2021b\)Improving predictions of Bayesian neural nets via local linearization\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[§3\.2](https://arxiv.org/html/2605.15459#S3.SS2.p2.6),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- H\. K\. Khalil and J\. W\. Grizzle \(2002\)Nonlinear systems\.Vol\.3,Prentice hall Upper Saddle River, NJ\.Cited by:[Appendix C](https://arxiv.org/html/2605.15459#A3.SS0.SSS0.Px2.p1.17)\.
- N\. B\. Kovachki and A\. M\. Stuart \(2021\)Continuous time analysis of momentum methods\.Journal of Machine Learning Research \(JMLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- S\. Lan, V\. Stathopoulos, B\. Shahbaba, and M\. Girolami \(2012\)Lagrangian Dynamical Monte Carlo\.arXiv preprint arXiv:1211\.3759\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- Y\. LeCun, B\. Boser, J\. S\. Denker, D\. Henderson, R\. E\. Howard, W\. Hubbard, and L\. D\. Jackel \(1989\)Backpropagation applied to handwritten zip code recognition\.Neural computation\.Cited by:[§5](https://arxiv.org/html/2605.15459#S5.SS0.SSS0.Px4.p1.7)\.
- J\. M\. Lee \(2018\)Introduction to riemannian manifolds\.Springer\.Cited by:[§A\.1](https://arxiv.org/html/2605.15459#A1.SS1.p1.9)\.
- H\. Li, Z\. Xu, G\. Taylor, C\. Studer, and T\. Goldstein \(2018\)Visualizing the Loss Landscape of Neural Nets\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p1.1)\.
- D\. J\. MacKay \(1992\)A practical Bayesian framework for backpropagation networks\.Neural computation\.Cited by:[§3\.2](https://arxiv.org/html/2605.15459#S3.SS2.p2.6),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- W\. J\. Maddox, P\. Izmailov, T\. Garipov, D\. P\. Vetrov, and A\. G\. Wilson \(2019\)A simple baseline for Bayesian uncertainty in deep learning\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- E\. Minguzzi \(2015\)Rayleigh’s dissipation function at work\.European Journal of Physics\.Cited by:[§3\.1](https://arxiv.org/html/2605.15459#S3.SS1.p1.1)\.
- R\. M\. Neal \(2011\)MCMC using Hamiltonian dynamics\.Handbook of Markov Chain Monte Carlo\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- Y\. Nesterov \(1983\)A method for solving the convex programming problem with convergence rateO\(1/k2\)O\(1/k^\{2\}\)\.InDokl akad nauk Sssr,Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- B\. T\. Polyak \(1964\)Some methods of speeding up the convergence of iteration methods\.Ussr computational mathematics and mathematical physics\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- A\. Reichlin, M\. Vasco, and D\. Kragic Jensfelt \(2025\)Walking on the Fiber: A Simple Geometric Approximation for Bayesian Neural Networks\.Transactions on Machine Learning Research \(TMLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- H\. Roy, M\. Miani, C\. H\. Ek, P\. Hennig, M\. Pförtner, L\. Tatzel, and S\. Hauberg \(2024\)Reparameterization invariance in approximate Bayesian inference\.Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p3.1),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- W\. Su, S\. Boyd, and E\. J\. Candes \(2016\)A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights\.Journal of Machine Learning Research \(JMLR\)\.Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- M\. Welling and Y\. W\. Teh \(2011\)Bayesian learning via stochastic gradient Langevin dynamics\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px3.p1.1)\.
- H\. Yu, M\. Hartmann, B\. W\. M\. Sanchez, M\. Girolami, and A\. Klami \(2024\)Riemannian Laplace Approximation with the Fisher Metric\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[§1](https://arxiv.org/html/2605.15459#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.15459#S3.SS2.p2.6),[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px2.p1.1)\.
- B\. Zhao, N\. Dehmamy, R\. Walters, and R\. Yu \(2025\)Understanding Mode Connectivity via Parameter Space Symmetry\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4](https://arxiv.org/html/2605.15459#S4.SS0.SSS0.Px1.p1.1)\.

## Appendix ADetails on the loss manifold

### A\.1Quantities of the pullback metric required for the derivations

The metric is given by:

𝐆\(𝜽\)=𝕀K\+∇𝜽ℒ\(𝜽\)∇𝜽ℒ\(𝜽\)⊤,\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)=\\mathbb\{I\}\_\{K\}\+\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\},\(A\.1\)The inverse metric and its square root follows from the Sherman\-Morisson formula:

𝐆\(𝜽\)−1=𝕀K−∇𝜽ℒ\(𝜽\)∇𝜽ℒ\(𝜽\)⊤1\+‖∇𝜽ℒ\(𝜽\)‖2,\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-1\}=\\mathbb\{I\}\_\{K\}\-\\frac\{\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\}\{1\+\\left\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\right\\rVert^\{2\}\},\(A\.2\)𝐆\(𝜽\)−12=𝕀K−∇𝜽ℒ\(𝜽\)∇𝜽ℒ\(𝜽\)⊤1\+‖∇𝜽ℒ\(𝜽\)‖2\+1\+‖∇𝜽ℒ\(𝜽\)‖2\.\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-\\frac\{1\}\{2\}\}=\\mathbb\{I\}\_\{K\}\-\\frac\{\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\}\{1\+\\left\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\right\\rVert^\{2\}\+\\sqrt\{1\+\\left\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\right\\rVert^\{2\}\}\}\.\(A\.3\)The Riemannian gradient can then be expressed as:

grad⁡ℒ\(𝜽\)=𝐆\(𝜽\)−1∇𝜽ℒ\(𝜽\)=∇𝜽ℒ\(𝜽\)1\+∥∇𝜽ℒ\(𝜽\)∥2\.\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-1\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)=\\frac\{\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\}\{1\+\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\rVert^\{2\}\}\.\(A\.4\)The Christoffel symbols are needed to solve the Euler\-Lagrange equations\. These describe how coordinate basis vectors change along the manifold and determine how tangent vectors are corrected when moving on the manifold\. We refer to classic textbooks on differential geometry, e\.g\.Do Carmo and Flaherty Francis \[[1992](https://arxiv.org/html/2605.15459#bib.bib26)\], Lee \[[2018](https://arxiv.org/html/2605.15459#bib.bib25)\], for further reading\. For the specific definition of the manifold, the Christoffel symbols simplify to:

Γijk\(𝜽\)=\(∂2∂θi∂θjℒ\(𝜽\)⋅∑mK𝐆km−1\(𝜽\)∂∂θmℒ\(𝜽\)\)\.\\Gamma\_\{ij\}^\{k\}\(\\boldsymbol\{\\theta\}\)=\\left\(\\frac\{\\partial^\{2\}\}\{\\partial\\theta\_\{i\}\\partial\\theta\_\{j\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\cdot\\sum\_\{m\}^\{K\}\\mathbf\{G\}\_\{km\}^\{\-1\}\(\\boldsymbol\{\\theta\}\)\\frac\{\\partial\}\{\\partial\\theta\_\{m\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\right\)\.\(A\.5\)When computing the Christoffel symbols, we leveraged the closed form expression of the derivative of the metric that is the outer product of the Hessian columns with the gradient of the loss\. By letting𝐇ℒ\(k\)\(𝜽\)\\mathbf\{H\}\_\{\\mathcal\{L\}\}^\{\(k\)\}\(\\boldsymbol\{\\theta\}\)denote thekk’th column of the Hessian ofℒ\(𝜽\)\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\), the derivative of the metric is:

∂∂θk𝐆\(𝜽\)=𝐇ℒ\(k\)\(𝜽\)∇𝜽ℒ\(𝜽\)⊤\+∇𝜽ℒ\(𝜽\)𝐇ℒ\(k\)\(𝜽\)⊤\.\\frac\{\\partial\}\{\\partial\\theta\_\{k\}\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)=\\mathbf\{H\}\_\{\\mathcal\{L\}\}^\{\(k\)\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\+\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\mathbf\{H\}\_\{\\mathcal\{L\}\}^\{\(k\)\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\.\(A\.6\)For defining how the energy of a particle evolves over time, we need a description of how the metric evolves along the curve\. We therefore establish that the time\-derivative of the metric along𝜶\(t\)\\boldsymbol\{\\alpha\}\(t\)is:

𝐆˙\(𝜶\)=∑k=1Kα˙k∂∂θk𝐆\(𝜶\)=𝐇ℒ\(𝜶\)𝜶˙∇𝜽ℒ\(𝜶\)⊤\+∇𝜽ℒ\(𝜶\)𝜶˙⊤𝐇ℒ\(𝜶\)\\dot\{\\mathbf\{G\}\}\(\\boldsymbol\{\\alpha\}\)=\\sum\_\{k=1\}^\{K\}\\dot\{\\alpha\}\_\{k\}\\frac\{\\partial\}\{\\partial\\theta\_\{k\}\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)=\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)^\{\\top\}\+\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\(A\.7\)Furthermore, the𝐆\\mathbf\{G\}\-norm of the gradient is:

∥∇𝜽ℒ\(𝜽\)∥𝐆\(𝜽\)=∇𝜽ℒ\(𝜽\)⊤𝐆\(𝜽\)∇𝜽ℒ\(𝜽\)=∥∇𝜽ℒ\(𝜽\)∥2\+∥∇𝜽ℒ\(𝜽\)∥4\.\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\rVert\_\{\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\}=\\sqrt\{\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\}=\\sqrt\{\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\rVert^\{2\}\+\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\theta\}\)\\rVert^\{4\}\}\.\(A\.8\)

### A\.2Geometry\-driven posterior approximations on the loss manifold

We consider the manifoldℳ\\mathcal\{M\}defined as the graph of the negative log\-posterior loss on the training data as in Equation \([1](https://arxiv.org/html/2605.15459#S2.E1)\)\. For posterior approximations defined as the pushforward of a Gaussian in the tangent space via some map, we provide a schematic overview of the relations between the ambient tangent space𝒯𝝁ℳ\\mathcal\{T\}\_\{\\boldsymbol\{\\mu\}\}\\mathcal\{M\}, intrinsic tangent space𝒯𝝁~Θ\\mathcal\{T\}\_\{\\tilde\{\\boldsymbol\{\\mu\}\}\}\\Theta, parameter spaceΘ\\Thetaand the manifoldℳ\\mathcal\{M\}\. The schematic is meant as a side\-kick to what we depicted in Figure[3](https://arxiv.org/html/2605.15459#S2.F3)\. We remark that we do not requires the pushforward map to be a diffeomorphism, yet remark that for non\-bijective maps, the inverse mapping from samples to the initial position does not exist\. This limits obtaining an explicit approximate posterior via the change\-of\-variables formula, yet admits an implicit approximate posterior obtained through sampling\.

𝒯𝝁~Θ\\mathcal\{T\}\_\{\\tilde\{\\boldsymbol\{\\mu\}\}\}\\ThetaΘ\\Thetaℳ\\mathcal\{M\}𝒯𝝁ℳ\\mathcal\{T\}\_\{\\boldsymbol\{\\mu\}\}\\mathcal\{M\}ψ\\psi\(ψ−1\)\\left\(\\psi^\{\-1\}\\right\)h−1h^\{\-1\}hh\(ϕ−1\)\\left\(\\phi^\{\-1\}\\right\)ϕ\\phi𝐉h†\(𝝁\)\\mathbf\{J\}\_\{h\}^\{\\dagger\}\(\\boldsymbol\{\\mu\}\)𝐉h\(𝝁~\)\\mathbf\{J\}\_\{h\}\(\\tilde\{\\boldsymbol\{\\mu\}\}\)

The speed of a curve𝜸\(t\)∈ℳ\\boldsymbol\{\\gamma\}\(t\)\\in\\mathcal\{M\}at timettis given by the norm of its velocity,‖𝜸˙\(t\)‖\\left\\lVert\\dot\{\\boldsymbol\{\\gamma\}\}\(t\)\\right\\rVert\. Recall that an element of the ambient tangent space can be expressed in intrinsic coordinates as𝐉h\(𝜽\)𝒗~\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}, and using the Moore\-Penrose inverse,𝐉h\(𝜽\)†=\(𝐉h\(𝜽\)⊤𝐉h\(𝜽\)\)−1𝐉h\(𝜽\)⊤=𝐆\(𝜽\)−1𝐉h\(𝜽\)⊤\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\dagger\}=\\left\(\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\right\)^\{\-1\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}=\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-1\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\top\}we can hence map velocity samples from the ambient tangent space to their parameter space coordinates, since the mapping is preserved:

𝒗~=𝐉h\(𝜽\)†𝒗=𝐉h\(𝜽\)†𝐉h\(𝜽\)𝒗~=𝐆\(𝜽\)−1𝐆\(𝜽\)𝒗~=𝒗~\.\\tilde\{\\boldsymbol\{v\}\}=\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\dagger\}\\boldsymbol\{v\}=\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)^\{\\dagger\}\\mathbf\{J\}\_\{h\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}=\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-1\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}=\\tilde\{\\boldsymbol\{v\}\}\.\(A\.9\)Hence, mapping a vector from the parameter space coordinates to its tangent space coordinates preserves the mapping\. We then define the distribution of initial velocities directly in the parameter space coordinates as:

𝒗~=𝐆\(𝜽\)−12ϵ,q\(ϵ\)=𝒩\(𝟎,𝚺\)\\tilde\{\\boldsymbol\{v\}\}=\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-\\frac\{1\}\{2\}\}\\boldsymbol\{\\epsilon\},\\qquad q\(\\boldsymbol\{\\epsilon\}\)=\\mathcal\{N\}\\left\(\\boldsymbol\{0\},\\mathbf\{\\Sigma\}\\right\)\(A\.10\)due to linearity of Gaussians\. We see that the initial speed samples from this distribution is invariant to the position𝝁\\boldsymbol\{\\mu\}, since:

∥𝒗∥2=∥𝒗~∥𝐆\(𝜽\)2=ϵ⊤𝐆\(𝜽\)−12𝐆\(𝜽\)𝐆\(𝜽\)−12⏟=𝕀Kϵ=∥ϵ∥2,\\lVert\\boldsymbol\{v\}\\rVert^\{2\}=\\lVert\\tilde\{\\boldsymbol\{v\}\}\\rVert^\{2\}\_\{\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\}=\\boldsymbol\{\\epsilon\}^\{\\top\}\\underset\{=\\mathbb\{I\}\_\{K\}\}\{\\underbrace\{\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-\\frac\{1\}\{2\}\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)^\{\-\\frac\{1\}\{2\}\}\}\}\\ \\boldsymbol\{\\epsilon\}=\\lVert\\boldsymbol\{\\epsilon\}\\rVert^\{2\},\(A\.11\)which is not the case if we disregard precoditioning by the square\-root of the inverse metric, see:

∥𝒗∥2=∥𝒗~∥𝐆\(𝜽\)2=𝒗~⊤𝐆\(𝜽\)𝒗~\.\\lVert\\boldsymbol\{v\}\\rVert^\{2\}=\\lVert\\tilde\{\\boldsymbol\{v\}\}\\rVert^\{2\}\_\{\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\}=\\tilde\{\\boldsymbol\{v\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\\tilde\{\\boldsymbol\{v\}\}\.\(A\.12\)Remark that at a local minimum𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}the preconditioning has no effect since𝐆\(𝜽∗\)−12=𝕀K\\mathbf\{G\}\(\\boldsymbol\{\\theta\}^\{\\ast\}\)^\{\-\\frac\{1\}\{2\}\}=\\mathbb\{I\}\_\{K\}by definition\. For geometry\-driven approximate posteriors that initialize the dynamics at a minimum \- which is the case for any such distribution based on the Laplace approximation \- the preconditioning does not need to be accounted for\.

## Appendix BDerivations related to curve energy

For a unit\-mass particle traveling on the loss manifold, its energy is fully captured by a kinetic and a potential energy component\. The kinetic energy of a particle moving on the manifold can equally be formulated in its intrinsic coordinates, here referring to the parameter space\. These formulations relate as:

Ekin=T\(𝜸,𝜸˙\)=12⋅g𝜸\(𝜸˙,𝜸˙\)=12⋅𝜶˙⊤𝐆\(𝜶\)𝜶˙=T\(𝜶,𝜶˙\)≥0,E\_\{\\text\{kin\}\}=T\(\\boldsymbol\{\\gamma\},\\dot\{\\boldsymbol\{\\gamma\}\}\)=\\frac\{1\}\{2\}\\cdot g\_\{\\boldsymbol\{\\gamma\}\}\(\\dot\{\\boldsymbol\{\\gamma\}\},\\dot\{\\boldsymbol\{\\gamma\}\}\)=\\frac\{1\}\{2\}\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\\geq 0,\(B\.1\)Similarly, the potential energy of such a system is described as:

Epot=V\(𝜸\)=κ⋅\(ℒ∘h−1\)\(𝜸\)−V∗=κ⋅\(ℒ\(𝜶\)−ℒ\(𝜽glob∗\)\)=V\(𝜶\)≥0,E\_\{\\text\{pot\}\}=V\(\\boldsymbol\{\\gamma\}\)=\\kappa\\cdot\\left\(\\mathcal\{L\}\\circ h^\{\-1\}\\right\)\(\\boldsymbol\{\\gamma\}\)\-V^\{\\ast\}=\\kappa\\cdot\\left\(\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\mathcal\{L\}\(\\boldsymbol\{\\theta\}^\{\\ast\}\_\{\\text\{glob\}\}\)\\right\)=V\(\\boldsymbol\{\\alpha\}\)\\geq 0,\(B\.2\)where𝜽glob∗\\boldsymbol\{\\theta\}^\{\\ast\}\_\{\\text\{glob\}\}is a parameter configuration associated to the minimum potentialV∗V^\{\\ast\}andκ\>0\\kappa\>0is a positive constant summarizing the gravitational acceleration and the particle’s mass\. The particle dynamics are then described through the Lagrangian that encodes the energy balance between the kinetic and potential energies:

L\(𝜸,𝜸˙\)=T\(𝜸,𝜸˙\)−V\(𝜸\)=T\(𝜶,𝜶˙\)−V\(𝜶\)=L\(𝜶,𝜶˙\)\.L\(\\boldsymbol\{\\gamma\},\\dot\{\\boldsymbol\{\\gamma\}\}\)=T\(\\boldsymbol\{\\gamma\},\\dot\{\\boldsymbol\{\\gamma\}\}\)\-V\(\\boldsymbol\{\\gamma\}\)=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\-V\(\\boldsymbol\{\\alpha\}\)=L\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\.\(B\.3\)An extremal action curve is a trajectory of the particle, i\.e\.𝜸:\[t0,t1\]→ℳ\\boldsymbol\{\\gamma\}:\[t\_\{0\},t\_\{1\}\]\\rightarrow\\mathcal\{M\}, that respects the principle of stationary action, meaning that it minimizes an action functionalSSdefined by:

arg⁡min𝜸S=arg⁡min𝜸∫t0t1L\(𝜸,𝜸˙\)𝑑t=arg⁡min𝜶∫t0t1L\(𝜶,𝜶˙\)𝑑t\.\\underset\{\\boldsymbol\{\\gamma\}\}\{\\arg\\min\}\\ S=\\underset\{\\boldsymbol\{\\gamma\}\}\{\\arg\\min\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}L\(\\boldsymbol\{\\gamma\},\\dot\{\\boldsymbol\{\\gamma\}\}\)dt=\\underset\{\\boldsymbol\{\\alpha\}\}\{\\arg\\min\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}L\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)dt\.\(B\.4\)What should be evident from Equations \([B\.1](https://arxiv.org/html/2605.15459#A2.E1)\)\-\([B\.4](https://arxiv.org/html/2605.15459#A2.E4)\) is that energies and dynamics of a particle moving on the loss manifold can equivalently be expressed and computed by considering the parameter space trajectory of the particle\.

A curve𝜶\(t\)\\boldsymbol\{\\alpha\}\(t\)is a stationary point ofSSif and only if it satisfies the Euler\-Lagrange equations:

∂L∂αk−ddt∂L∂α˙k=0\.\\frac\{\\partial L\}\{\\partial\\alpha\_\{k\}\}\-\\frac\{d\}\{dt\}\\frac\{\\partial L\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}=0\.\(B\.5\)
### B\.1Solution to the Euler\-Lagrange equations

Inserting the definition of the Lagrangian from Equation \([B\.3](https://arxiv.org/html/2605.15459#A2.E3)\) into the Euler\-Lagrange equations, we get:

∂T∂αk−∂V∂αk−ddt\(∂T∂α˙k−∂V∂α˙k\)=0,\\frac\{\\partial T\}\{\\partial\\alpha\_\{k\}\}\-\\frac\{\\partial V\}\{\\partial\\alpha\_\{k\}\}\-\\frac\{d\}\{dt\}\\left\(\\frac\{\\partial T\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}\-\\frac\{\\partial V\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}\\right\)=0,\(B\.6\)and since the potential is a function of the coordinates only, this simplifies to:

∂T∂αk−ddt∂T∂α˙k=∂V∂αk\.\\frac\{\\partial T\}\{\\partial\\alpha\_\{k\}\}\-\\frac\{d\}\{dt\}\\frac\{\\partial T\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}=\\frac\{\\partial V\}\{\\partial\\alpha\_\{k\}\}\.\(B\.7\)Remark the coordinate dependency of the kinetic energy in the first term, a result of the fact that the particle is moving on the manifold with metric𝐆\(𝜽\)\\mathbf\{G\}\(\\boldsymbol\{\\theta\}\)\. We insert the kinetic and potential energies and get:

12\(∂∂αk\(𝜶˙⊤𝐆\(𝜶\)𝜶˙\)−ddt∂∂α˙k\(𝜶˙⊤𝐆\(𝜶\)𝜶˙\)\)=κ⋅∂∂αkℒ\(𝜶\)\.\\displaystyle\\frac\{1\}\{2\}\\left\(\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\right\)\-\\frac\{d\}\{dt\}\\frac\{\\partial\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\right\)\\right\)=\\kappa\\cdot\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.\(B\.8\)We proceed by considering the derivative terms individually\. First realize that:

∂∂αk\(𝜶˙⊤𝐆\(𝜶\)𝜶˙\)=∂𝐆ij\(𝜶\)∂αkα˙iα˙j\.\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\right\)=\\frac\{\\partial\\mathbf\{G\}\_\{ij\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{k\}\}\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\.\(B\.9\)Next, we leverage the product and the chain rule to get:

ddt∂∂α˙k\(𝜶˙⊤𝐆\(𝜶\)𝜶˙\)\\displaystyle\\frac\{d\}\{dt\}\\frac\{\\partial\}\{\\partial\\dot\{\\alpha\}\_\{k\}\}\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\right\)=2ddt𝐆kj\(𝜶\)α˙j\\displaystyle=2\\frac\{d\}\{dt\}\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\alpha\}\_\{j\}\(B\.10\)=2\(ddt𝐆kj\(𝜶\)\)α˙j\+2𝐆kj\(𝜶\)α¨j\\displaystyle=2\\left\(\\frac\{d\}\{dt\}\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\right\)\\dot\{\\alpha\}\_\{j\}\+2\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\alpha\}\_\{j\}=2∂𝐆kj\(𝜶\)∂αiα˙iα˙j\+2𝐆kj\(𝜶\)α¨j\\displaystyle=2\\frac\{\\partial\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{i\}\}\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\+2\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\alpha\}\_\{j\}=\(∂𝐆kj\(𝜶\)∂αi\+∂𝐆ki\(𝜶\)∂αj\)α˙iα˙j\+2𝐆kj\(𝜶\)α¨j,\\displaystyle=\\left\(\\frac\{\\partial\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{i\}\}\+\\frac\{\\partial\\mathbf\{G\}\_\{ki\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{j\}\}\\right\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\+2\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\alpha\}\_\{j\},and inserting Equations \([B\.9](https://arxiv.org/html/2605.15459#A2.E9)\)\-\([B\.10](https://arxiv.org/html/2605.15459#A2.E10)\) into Equation \([B\.8](https://arxiv.org/html/2605.15459#A2.E8)\) gives:

12\(∂𝐆ij\(𝜶\)\)∂αk−∂𝐆kj\(𝜶\)\)∂αi−∂𝐆ki\(𝜶\)\)∂αj\)α˙iα˙j−𝐆kj\(𝜶\)α¨j=κ⋅∂∂αkℒ\(𝜶\)\.\\frac\{1\}\{2\}\\left\(\\frac\{\\partial\\mathbf\{G\}\_\{ij\}\(\\boldsymbol\{\\alpha\}\)\)\}\{\\partial\\alpha\_\{k\}\}\-\\frac\{\\partial\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\)\}\{\\partial\\alpha\_\{i\}\}\-\\frac\{\\partial\\mathbf\{G\}\_\{ki\}\(\\boldsymbol\{\\alpha\}\)\)\}\{\\partial\\alpha\_\{j\}\}\\right\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\-\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\alpha\}\_\{j\}=\\kappa\\cdot\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.\(B\.11\)To find the equations of motion, we isolate the acceleration\. To do so, realize that:

∑k𝐆lk−1\(𝜶\)𝐆kj\(𝜶\)=δjl,\\sum\_\{k\}\\mathbf\{G\}^\{\-1\}\_\{lk\}\(\\boldsymbol\{\\alpha\}\)\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)=\\delta\_\{j\}^\{l\},\(B\.12\)which allows for rewriting Equation \([B\.11](https://arxiv.org/html/2605.15459#A2.E11)\) using the Christoffel symbols:

12𝐆lk−1\(𝜶\)\(∂𝐆kj\(𝜶\)∂αi\+∂𝐆ki\(𝜶\)∂αj−∂𝐆ij\(𝜶\)∂αk\)α˙iα˙j\+α¨l\\displaystyle\\frac\{1\}\{2\}\\mathbf\{G\}^\{\-1\}\_\{lk\}\(\\boldsymbol\{\\alpha\}\)\\left\(\\frac\{\\partial\\mathbf\{G\}\_\{kj\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{i\}\}\+\\frac\{\\partial\\mathbf\{G\}\_\{ki\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{j\}\}\-\\frac\{\\partial\\mathbf\{G\}\_\{ij\}\(\\boldsymbol\{\\alpha\}\)\}\{\\partial\\alpha\_\{k\}\}\\right\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\+\\ddot\{\\alpha\}\_\{l\}=−κ⋅𝐆lk−1\(𝜶\)⋅∂∂αkℒ\(𝜶\)\\displaystyle=\-\\kappa\\cdot\\mathbf\{G\}^\{\-1\}\_\{lk\}\(\\boldsymbol\{\\alpha\}\)\\cdot\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\(B\.13\)⇒Γijl\(𝜶\)α˙iα˙j\+α¨l\\displaystyle\\Rightarrow\\hskip 18\.49988pt\\hskip 18\.49988pt\\Gamma\_\{ij\}^\{l\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\+\\ddot\{\\alpha\}\_\{l\}=−κ⋅𝐆lk−1\(𝜶\)⋅∂∂αkℒ\(𝜶\),\\displaystyle=\-\\kappa\\cdot\\mathbf\{G\}^\{\-1\}\_\{lk\}\(\\boldsymbol\{\\alpha\}\)\\cdot\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\),\(B\.14\)from which the acceleration can be isolated and forms a second\-order ordinary differential equation \(ODE\):

α¨l=−Γijl\(𝜶\)α˙iα˙j−κ⋅𝐆lk−1\(𝜶\)⋅∂∂αkℒ\(𝜶\),\\ddot\{\\alpha\}\_\{l\}=\-\\Gamma\_\{ij\}^\{l\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\-\\kappa\\cdot\\mathbf\{G\}^\{\-1\}\_\{lk\}\(\\boldsymbol\{\\alpha\}\)\\cdot\\frac\{\\partial\}\{\\partial\\alpha\_\{k\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\),\(B\.15\)By swapping indices and recalling that the term coming from the potential energy equals the Riemannian gradient, the expression becomes:

α¨k=−Γijk\(𝜶\)α˙iα˙j−κ⋅grad⁡ℒ\(𝜶\)k,\\ddot\{\\alpha\}\_\{k\}=\-\\Gamma\_\{ij\}^\{k\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\alpha\}\_\{i\}\\dot\{\\alpha\}\_\{j\}\-\\kappa\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\_\{k\},\(B\.16\)wheregrad⁡ℒ\(𝜶\)k\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\_\{k\}denotes thekk\-th index of the Riemannian gradient\. Remark that we used no assumptions about the metric and that these dynamics hold for a particle traveling on any Riemannian manifold\.

#### Solution under the pullback metric\.

By leveraging the Christoffel symbols for the loss manifold provided in Equation \([A\.5](https://arxiv.org/html/2605.15459#A1.E5)\), it is straight\-forward to show that the Euler\-Lagrange equation from above simplifies to:

𝜶¨=−𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙1\+‖∇𝜽ℒ\(𝜶\)‖2⋅∇𝜽ℒ\(𝜶\)−κ⋅grad⁡ℒ\(𝜶\)=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅grad⁡ℒ\(𝜶\)\.\\ddot\{\\boldsymbol\{\\alpha\}\}=\-\\frac\{\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\}\{1\+\\left\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\right\\rVert^\{2\}\}\\cdot\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\kappa\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.\(B\.17\)

#### Interpreting the potential term as a gravitational pull\.

The dynamics derived in the previous section also appear when adding a gravitational force component to the tangential acceleration of the particle traveling onℳ\\mathcal\{M\}along𝜸\(t\)\\boldsymbol\{\\gamma\}\(t\)\. Leveraging the covariant derivative, see that:

∇𝜸˙𝜸˙=𝒅−⟨𝒅,𝒏^\(𝜶\)⟩𝒏^\(𝜶\)=\[−κ⋅grad⁡ℒ\(𝜶\),κ⋅∥∇𝜽ℒ\(𝜶\)∥21\+∥∇𝜽ℒ\(𝜶\)∥2\]⊤,\\nabla\_\{\\dot\{\\boldsymbol\{\\gamma\}\}\}\\dot\{\\boldsymbol\{\\gamma\}\}=\\boldsymbol\{d\}\-\\langle\\boldsymbol\{d\},\\hat\{\\boldsymbol\{n\}\}\(\\boldsymbol\{\\alpha\}\)\\rangle\\hat\{\\boldsymbol\{n\}\}\(\\boldsymbol\{\\alpha\}\)=\\begin\{bmatrix\}\-\\kappa\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\),&\\kappa\\cdot\\frac\{\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\rVert^\{2\}\}\{1\+\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\rVert^\{2\}\}\\end\{bmatrix\}^\{\\top\},\(B\.18\)where𝒅=\[0,…,0,−κ\]⊤∈ℝK\+1\\boldsymbol\{d\}=\\left\[0,\\dots,0,\-\\kappa\\right\]^\{\\top\}\\in\\mathbb\{R\}^\{K\+1\}is the gravitational pull and the normal vector to the loss surface is given by𝒏^\(𝜶\)=\[−∇𝜽ℒ\(𝜶\),1\]⊤∈ℝK\+1\\hat\{\\boldsymbol\{n\}\}\(\\boldsymbol\{\\alpha\}\)=\\left\[\-\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\),1\\right\]^\{\\top\}\\in\\mathbb\{R\}^\{K\+1\}\. Considering all parameter space dimensions, we see that this corresponds to the term originating from the potential energy when solving the Euler\-Lagrange equations\.

#### Gravity\-influenced curves are geodesics on Finsler manifolds\.

As argued byBucataru and Miron\[[2007](https://arxiv.org/html/2605.15459#bib.bib27)\], a Finslerian mechanical system with metricF\(𝜶,𝜶˙,t\)F\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)corresponds to defining the Lagrangian as:

L\(𝜶,α˙,t\)=12F\(𝜶,𝜶˙,t\)2\.L\(\\boldsymbol\{\\alpha\},\\dot\{\\alpha\},t\)=\\frac\{1\}\{2\}F\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)^\{2\}\.\(B\.19\)Inverting the relation and using the definition of the Lagrangian as the energy balance between the kinetic and potential energy, it is easy to see that the proposed dynamics from Equation \([B\.17](https://arxiv.org/html/2605.15459#A2.E17)\) correspond to motion on a Finsler manifold with metric:

F\(𝜶,𝜶˙,t\)=𝜶˙\(t\)⊤𝐆\(𝜶\(t\)\)𝜶˙\(t\)\+2κ⋅ℒ\(𝜶\(t\)\)\.F\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=\\sqrt\{\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\(t\)\)\\dot\{\\boldsymbol\{\\alpha\}\}\(t\)\+2\\kappa\\cdot\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\)\)\}\.\(B\.20\)

### B\.2Energy flow equations under the pullback metric

We consider a parameter space curve𝜶\(t\)\\boldsymbol\{\\alpha\}\(t\)induced by the non\-dissipative dynamics stated in Equation \([B\.17](https://arxiv.org/html/2605.15459#A2.E17)\)\. In this section we will show the well\-known fact that a non\-dissipative system conserves energy by trading of kinetic with potential energy and vice versa and later show the implication of introducing dissipation as a friction force\. First, we take the time\-derivative of the kinetic energy:

ddtT\(𝜶,𝜶˙\)=12ddt\(𝜶˙⊤𝐆\(𝜶\)𝜶˙\)=𝜶˙⊤𝐆\(𝜶\)𝜶¨\+12𝜶˙⊤𝐆˙\(𝜶\)𝜶˙\.\\displaystyle\\frac\{d\}\{dt\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=\\frac\{1\}\{2\}\\frac\{d\}\{dt\}\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\right\)=\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\boldsymbol\{\\alpha\}\}\+\\frac\{1\}\{2\}\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\dot\{\\mathbf\{G\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\.\(B\.21\)We study the two terms separately\. For the first term, we leverage the definition of the acceleration from Equation \([B\.17](https://arxiv.org/html/2605.15459#A2.E17)\) to get:

𝜶˙⊤𝐆\(𝜶\)𝜶¨\\displaystyle\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\ddot\{\\boldsymbol\{\\alpha\}\}=−𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙1\+‖∇𝜽ℒ\(𝜶\)‖2⋅𝜶˙⊤𝐆\(𝜶\)∇𝜽ℒ\(𝜶\)−κ⋅𝜶˙⊤∇𝜽ℒ\(𝜶\)\\displaystyle=\-\\frac\{\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\}\{1\+\\left\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\right\\rVert^\{2\}\}\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\kappa\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\(B\.22\)=−𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙⋅𝜶˙⊤𝐆\(𝜶\)𝐆\(𝜶\)−1∇𝜽ℒ\(𝜶\)−κ⋅𝜶˙⊤∇𝜽ℒ\(𝜶\)\\displaystyle=\-\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)^\{\-1\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\kappa\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅𝜶˙⊤∇𝜽ℒ\(𝜶\)\.\\displaystyle=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.For the second term, we insert the definition of the time\-derivative of the metric and get:

12𝜶˙⊤𝐆˙\(𝜶\)𝜶˙=𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙⋅𝜶˙⊤∇𝜽ℒ\(𝜶\)\.\\displaystyle\\frac\{1\}\{2\}\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\dot\{\\mathbf\{G\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}=\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\.\(B\.23\)Combining the two terms and applying the chain rule inversely, we obtain the*law of conservation of energy*which is well\-known from physics and exactly reveals the direct relation between kinetic and potential energy for the moving particle:

T˙\(𝜶,𝜶˙\)=ddtT\(𝜶,𝜶˙\)=−κ⋅𝜶˙⊤∇𝜽ℒ\(𝜶\)=−κ⋅ddtℒ\(𝜶\)=−ddtV\(𝜶\)=−V˙\(𝜶\)\.\\dot\{T\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=\\frac\{d\}\{dt\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=\-\\kappa\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\-\\kappa\\cdot\\frac\{d\}\{dt\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\-\\frac\{d\}\{dt\}V\(\\boldsymbol\{\\alpha\}\)=\-\\dot\{V\}\(\\boldsymbol\{\\alpha\}\)\.\(B\.24\)
#### Introducing dissipation through friction\.

The introduction of dissipation as a friction force based on dissipation functionη\(t\)\>0\\eta\(t\)\>0results in a modified version of the particle’s acceleration:

𝜶¨=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅grad⁡ℒ\(𝜶\)−η\(t\)⋅𝜶˙\.\\ddot\{\\boldsymbol\{\\alpha\}\}=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\eta\(t\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}\.\(B\.25\)As a consequence the the time\-derivative of the kinetic energy changes\. Inserting the dissipative acceleration equation into the previous derivation, it is straight\-forward to see that:

T˙\(𝜶,𝜶˙\)\\displaystyle\\dot\{T\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)=−V˙\(𝜶\)−η\(t\)⋅𝜶˙⊤𝐆\(𝜶\)𝜶˙\\displaystyle=\-\\dot\{V\}\(\\boldsymbol\{\\alpha\}\)\-\\eta\(t\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\{\\boldsymbol\{\\alpha\}\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\(B\.26\)=−V˙\(𝜶\)−2η\(t\)⋅T\(𝜶,𝜶˙\)\.\\displaystyle=\-\\dot\{V\}\(\\boldsymbol\{\\alpha\}\)\-2\\eta\(t\)\\cdot T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\.Sinceη\(t\)⋅𝜶˙⊤𝐆\(𝜶\)𝜶˙≥0\\eta\(t\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\{\\boldsymbol\{\\alpha\}\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\geq 0by definition, the kinetic energy gradually decreases even though a drop in potential energy can still locally increase the kinetic energy\.

Lastly, we rewrite the speed\-dependent dissipation function considered in our samplerDiMSas:

η\(t\)=η0T\(𝜶,𝜶˙\)\\displaystyle\\eta\(t\)=\\eta\_\{0\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\}=η0𝜶˙⊤𝐆\(𝜶\)𝜶˙\\displaystyle=\\eta\_\{0\}\\sqrt\{\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\}\(B\.27\)=η0∥𝜶˙∥2\+∥𝜶˙⊤∇𝜽ℒ\(𝜶\)∥2\\displaystyle=\\eta\_\{0\}\\sqrt\{\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert^\{2\}\+\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\rVert^\{2\}\}=η0∥𝜶˙∥2\+\(∥𝜶˙∥∥∇𝜽ℒ\(𝜶\)∥cos⁡\(ω\)\)2\\displaystyle=\\eta\_\{0\}\\sqrt\{\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert^\{2\}\+\\left\(\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\rVert\\cos\(\\omega\)\\right\)^\{2\}\}=η0∥𝜶˙∥1\+∥∇𝜽ℒ\(𝜶\)∥2cos2⁡\(ω\),\\displaystyle=\\eta\_\{0\}\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert\\sqrt\{1\+\\lVert\\nabla\_\{\\boldsymbol\{\\theta\}\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\\rVert^\{2\}\\cos^\{2\}\(\\omega\)\},whereη0\>0\\eta\_\{0\}\>0is the friction coefficient, treated as a hyperparameter\. This clearly shows that the dissipation depends on the angle between the velocity and the gradient,ω\\omega, for the position\-velocity state at timett\.

## Appendix CProofs of lemmas and theorems

#### Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1)\- energy dissipation and boundedness\.

Consider a curve𝜶:\[t0,∞\]→Θ\\boldsymbol\{\\alpha\}:\[t\_\{0\},\\infty\]\\rightarrow\\Thetafollowing the general dissipative dynamics presented in Equation \([B\.25](https://arxiv.org/html/2605.15459#A2.E25)\), and recall that by definition the kinetic and potential from Equations \([B\.1](https://arxiv.org/html/2605.15459#A2.E1)\)\-\([B\.2](https://arxiv.org/html/2605.15459#A2.E2)\), the energy terms are bounded below asT\(𝜶,𝜶˙,t\),V\(𝜶,t\)≥0T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\),V\(\\boldsymbol\{\\alpha\},t\)\\geq 0\. We define the total energy of the particle at timet≥t0t\\geq t\_\{0\}as the Hamiltonian:

H\(t\):=H\(𝜶,𝜶˙,t\)=T\(𝜶,𝜶˙,t\)\+V\(𝜶,t\)≥0\.H\(t\):=H\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\+V\(\\boldsymbol\{\\alpha\},t\)\\geq 0\.\(C\.1\)From Equation \([B\.26](https://arxiv.org/html/2605.15459#A2.E26)\) and given thatη\(t\)\>0\\eta\(t\)\>0it holds that:

H˙\(𝜶,𝜶˙,t\)=T˙\(𝜶,𝜶˙,t\)\+V˙\(𝜶,t\)=−2η\(t\)T\(𝜶,𝜶˙,t\)≤0\\displaystyle\\dot\{H\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=\\dot\{T\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\+\\dot\{V\}\(\\boldsymbol\{\\alpha\},t\)=\-2\\eta\(t\)T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\leq 0\(C\.2\)revealing that the total energyH\(t\)H\(t\)is monotonically non\-increasing intt, and hence0≤H\(t\)≤H\(t0\)0\\leq H\(t\)\\leq H\(t\_\{0\}\)for allt≥t0t\\geq t\_\{0\}\. AssumingH\(t0\)<∞H\(t\_\{0\}\)<\\infty, the Hamiltonian is bounded below and monotone, and therefore converges to the limitH∞:=limt→∞H\(t\)∈ℝH\_\{\\infty\}:=\\lim\_\{t\\rightarrow\\infty\}H\(t\)\\in\\mathbb\{R\}by completeness ofℝ\\mathbb\{R\}, such that0≤H∞≤H\(t\)≤H\(t0\)0\\leq H\_\{\\infty\}\\leq H\(t\)\\leq H\(t\_\{0\}\)for allt≥t0t\\geq t\_\{0\}\.

#### Theorem[1](https://arxiv.org/html/2605.15459#Thmtheorem1)\- convergence guarantee\.

From Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1)and under the dynamics in Equation \([B\.25](https://arxiv.org/html/2605.15459#A2.E25)\), recall that0≤H∞≤H\(t\)≤H\(t0\)0\\leq H\_\{\\infty\}\\leq H\(t\)\\leq H\(t\_\{0\}\)for allt≥t0t\\geq t\_\{0\}, so all trajectories remain in the sublevel setΩ=\{\(𝜶,𝜶˙\):H\(𝜶,𝜶˙,t\)≤H\(t0\)\}\\Omega=\\\{\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\):H\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\leq H\(t\_\{0\}\)\\\}\. IfV\(𝜶,t\)V\(\\boldsymbol\{\\alpha\},t\)is coercive, i\.e\.V\(𝜶,t\)→∞V\(\\boldsymbol\{\\alpha\},t\)\\rightarrow\\inftyas∥𝜶∥→∞\\lVert\\boldsymbol\{\\alpha\}\\rVert\\rightarrow\\infty, then sinceT\(𝜶,𝜶˙,t\)→∞T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\rightarrow\\inftyas∥𝜶˙∥→∞\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert\\rightarrow\\infty, it holds that

H\(𝜶,𝜶˙,t\)=T\(𝜶,𝜶˙,t\)\+V\(𝜶,t\)→∞as∥𝜶∥→∞or∥𝜶˙∥→∞,H\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\+V\(\\boldsymbol\{\\alpha\},t\)\\rightarrow\\infty\\quad\\textrm\{as\}\\quad\\lVert\\boldsymbol\{\\alpha\}\\rVert\\rightarrow\\infty\\textrm\{ or \}\\lVert\\dot\{\\boldsymbol\{\\alpha\}\}\\rVert\\rightarrow\\infty,\(C\.3\)soHHis a radially unbounded function in𝜶\\boldsymbol\{\\alpha\}and𝜶˙\\dot\{\\boldsymbol\{\\alpha\}\}, and thusΩ\\Omegais compact in𝜶\\boldsymbol\{\\alpha\}and𝜶˙\\dot\{\\boldsymbol\{\\alpha\}\}\. Moreover assuming thatHHis continuously differentiable and sinceH˙\(𝜶,𝜶˙,t\)≤0\\dot\{H\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\leq 0onΩ\\Omega, the conditions of LaSalle’s invariance principle \(e\.g\. in the textbook byKhalil and Grizzle\[[2002](https://arxiv.org/html/2605.15459#bib.bib54)\]\) are satisfied and all trajectories converge to the largest invariant subset contained in

𝒮=\{\(𝜶,𝜶˙\):H˙\(t\)=0\}\.\\mathcal\{S\}=\\\{\\left\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\\right\):\\dot\{H\}\(t\)=0\\\}\.\(C\.4\)Recall thatH˙\(𝜶,𝜶˙,t\)=−2η\(t\)T\(𝜶,𝜶˙,t\)=−η\(t\)𝜶˙⊤𝐆\(𝜶\)𝜶˙≤0\\dot\{H\}\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=\-2\\eta\(t\)T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=\-\\eta\(t\)\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\\leq 0, due to positive definiteness of𝐆\(𝜶\)\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)andη\(t\)\>0\\eta\(t\)\>0for alltt, we note thatH˙\(t\)=0\\dot\{H\}\(t\)=0if and only if𝜶˙=𝟎\\dot\{\\boldsymbol\{\\alpha\}\}=\\boldsymbol\{0\}, hence:

𝒮=\{\(𝜶,𝜶˙\):𝜶˙=𝟎\}\.\\mathcal\{S\}=\\\{\\left\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\\right\):\\dot\{\\boldsymbol\{\\alpha\}\}=\\boldsymbol\{0\}\\\}\.\(C\.5\)Next, note that a trajectory starting in𝒮\\mathcal\{S\}remains in𝒮\\mathcal\{S\}only if𝜶¨=0\\ddot\{\\boldsymbol\{\\alpha\}\}=0\. Following the dynamics from Equation \([B\.25](https://arxiv.org/html/2605.15459#A2.E25)\) for the particular states\(𝜶,𝜶˙\)∈𝒮\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\\in\\mathcal\{S\}, see that:

𝜶¨=−\(𝜶˙⊤𝐇ℒ\(𝜶\)𝜶˙\+κ\)⋅grad⁡ℒ\(𝜶\)−η\(t\)⋅𝜶˙=\(𝜶,𝜶˙\)∈𝒮−κgrad⁡ℒ\(𝜶\)=𝟎\.\\displaystyle\\ddot\{\\boldsymbol\{\\alpha\}\}=\-\\left\(\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\+\\kappa\\right\)\\cdot\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)\-\\eta\(t\)\\cdot\\dot\{\\boldsymbol\{\\alpha\}\}\\overset\{\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\\in\\mathcal\{S\}\}\{=\}\-\\kappa\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\\boldsymbol\{0\}\.\(C\.6\)The equality is satisfied if and only ifgrad⁡ℒ\(𝜶\)=𝟎\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\\boldsymbol\{0\}sinceκ\>0\\kappa\>0by definition, hence the largest invariant subsetℐ\\mathcal\{I\}of𝒮\\mathcal\{S\}to which all trajectories converge ast→∞t\\rightarrow\\inftyis exactly zero\-velocity states reaching stationary points of the loss surface:

ℐ=\{\(𝜶,𝜶˙\):𝜶˙=𝟎,grad⁡ℒ\(𝜶\)=𝟎\}⊂𝒮\.\\mathcal\{I\}=\\left\\\{\\left\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\\right\):\\dot\{\\boldsymbol\{\\alpha\}\}=\\boldsymbol\{0\},\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\)=\\boldsymbol\{0\}\\right\\\}\\subset\\mathcal\{S\}\.\(C\.7\)
Assuming that the lossℒ\\mathcal\{L\}satisfies the strict saddle property, saddle points and local maxima inℐ\\mathcal\{I\}are unstable equilibria and trajectories converge almost surely to a local minimum ofℒ\\mathcal\{L\}\. We thereby conclude that the proposed speed\-dependent dissipative dynamics converge to a local minimum ofℒ\\mathcal\{L\}\.

#### Theorem[2](https://arxiv.org/html/2605.15459#Thmtheorem2)\- convergence time\.

Under the assumptions stated in Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1), the Hamiltonian dissipates energy, hence0≤H∞≤H\(t1\)≤H\(t0\)0\\leq H\_\{\\infty\}\\leq H\(t\_\{1\}\)\\leq H\(t\_\{0\}\)\. We formulate the energy difference equation of the Hamiltonian as the difference in total energy between the start\- and endpoint of the curve as:

ΔH=H\(t0\)−H\(t1\)=−∫t0t1H˙\(t\)𝑑t=2∫t0t1η\(t\)T\(𝜶,𝜶˙,t\)𝑑t=2η0∫t0t1T\(𝜶,𝜶˙,t\)3/2𝑑t,\\Delta H=H\(t\_\{0\}\)\-H\(t\_\{1\}\)=\-\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\dot\{H\}\(t\)dt=2\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\eta\(t\)T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)dt=2\\eta\_\{0\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)^\{3/2\}dt,\(C\.8\)Now, lettϵt\_\{\\epsilon\}be a proxy convergence time at which the kinetic energy drops below a thresholdϵ\\epsilon, i\.e\.tϵ=inf\{t≥t0:T\(𝜶,𝜶˙,t\)≤ϵ\}\.t\_\{\\epsilon\}=\\inf\\left\\\{t\\geq t\_\{0\}:T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\leq\\epsilon\\right\\\}\.Due to the infimum condition that definestϵt\_\{\\epsilon\}it holds that:

ϵ3/2\(tϵ−t0\)=∫t0tϵϵ3/2𝑑t≤∫t0tϵT\(𝜶,𝜶˙,t\)3/2𝑑t,\\epsilon^\{3/2\}\(t\_\{\\epsilon\}\-t\_\{0\}\)=\\int\_\{t\_\{0\}\}^\{t\_\{\\epsilon\}\}\\epsilon^\{3/2\}dt\\leq\\int\_\{t\_\{0\}\}^\{t\_\{\\epsilon\}\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)^\{3/2\}dt,\(C\.9\)Now inserting this lower bound into the difference equation along with boundedness of the Hamiltonian gives:

2η0ϵ3/2\(tϵ−t0\)≤2η0∫t0t1T\(𝜶,𝜶˙,t\)3/2𝑑t=ΔH≤H\(t0\)\.2\\eta\_\{0\}\\epsilon^\{3/2\}\(t\_\{\\epsilon\}\-t\_\{0\}\)\\leq 2\\eta\_\{0\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)^\{3/2\}dt=\\Delta H\\leq H\(t\_\{0\}\)\.\(C\.10\)By isolating the proxy convergence time, we get an upper bound on the form:

tϵ≤t0\+H\(t0\)2η0ϵ3/2\.t\_\{\\epsilon\}\\leq t\_\{0\}\+\\frac\{H\(t\_\{0\}\)\}\{2\\eta\_\{0\}\\epsilon^\{3/2\}\}\.\(C\.11\)

#### Theorem[3](https://arxiv.org/html/2605.15459#Thmtheorem3)\-𝜸\\boldsymbol\{\\gamma\}\-arc\-length\.

Given the Hamiltonian described in Lemma[1](https://arxiv.org/html/2605.15459#Thmlemma1)and the formulation of the Hamiltonian difference equation described in Theorem[2](https://arxiv.org/html/2605.15459#Thmtheorem2)along with the fact thatT\(𝜶,𝜶˙,t\)≤H\(t\)≤H\(t0\)T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\\leq H\(t\)\\leq H\(t\_\{0\}\)for anyt≥t0t\\geq t\_\{0\}, it holds that:

ΔH=2η0∫t0t1T\(𝜶,𝜶˙\)3/2𝑑t=2η0∫t0t1T\(𝜶,𝜶˙\)⋅T\(𝜶,𝜶˙\)𝑑t≤2η0H\(t0\)∫t0t1T\(𝜶,𝜶˙\)𝑑t\.\\Delta H=2\\eta\_\{0\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)^\{3/2\}dt=2\\eta\_\{0\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\}\\cdot T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)dt\\leq 2\\eta\_\{0\}H\(t\_\{0\}\)\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\)\}dt\.\(C\.12\)Note furthermore that the integral of the square\-root of the kinetic energy is proportional to the arc\-length of the curve𝜸:\[t0→t1\]∈ℳ\\boldsymbol\{\\gamma\}:\[t\_\{0\}\\rightarrow t\_\{1\}\]\\in\\mathcal\{M\}by:

∫t0t1T\(𝜶,𝜶˙\)dt=12∫t0t1𝜶˙⊤𝐆\(𝜶\)𝜶˙dt=12∫t0t1∥𝜸˙∥dt=Length\(𝜸\)2,\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\sqrt\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\}\}\)dt=\\frac\{1\}\{\\sqrt\{2\}\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\sqrt\{\\dot\{\\boldsymbol\{\\alpha\}\}^\{\\top\}\\mathbf\{G\}\(\\boldsymbol\{\\alpha\}\)\\dot\{\\boldsymbol\{\\alpha\}\}\}dt=\\frac\{1\}\{\\sqrt\{2\}\}\\int\_\{t\_\{0\}\}^\{t\_\{1\}\}\\lVert\\dot\{\\boldsymbol\{\\gamma\}\}\\rVert dt=\\frac\{\\textsc\{Length\}\(\\boldsymbol\{\\gamma\}\)\}\{\\sqrt\{2\}\},\(C\.13\)whereLength\(𝜸\)\\textsc\{Length\}\(\\boldsymbol\{\\gamma\}\)denotes the arc\-length of the curve\. In combination with the upper bound on the energy difference, we get a lower bound on the arc\-length:

2ΔH2η0H\(t0\)=ΔH2η0H\(t0\)≤Length\(𝜸\)\.\\frac\{\\sqrt\{2\}\\Delta H\}\{2\\eta\_\{0\}H\(t\_\{0\}\)\}=\\frac\{\\Delta H\}\{\\sqrt\{2\}\\eta\_\{0\}H\(t\_\{0\}\)\}\\leq\\textsc\{Length\}\(\\boldsymbol\{\\gamma\}\)\.\(C\.14\)From the convergence properties in Theorem[3](https://arxiv.org/html/2605.15459#Thmtheorem3), the asymptotic behavior of the Hamiltonian guarantees that:

limt→∞H\(t\)=limt→∞T\(𝜶,𝜶˙,t\)\+limt→∞V\(𝜶,t\)=V\(𝜶,∞\)=κ⋅\(ℒ\(𝜶∞\)−ℒ\(𝜶∗\)\)≥0,\\lim\_\{t\\rightarrow\\infty\}H\(t\)=\\lim\_\{t\\rightarrow\\infty\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)\+\\lim\_\{t\\rightarrow\\infty\}V\(\\boldsymbol\{\\alpha\},t\)=V\(\\boldsymbol\{\\alpha\},\\infty\)=\\kappa\\cdot\\left\(\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\_\{\\infty\}\)\-\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}^\{\\ast\}\)\\right\)\\geq 0,\(C\.15\)sincelimt→∞T\(𝜶,𝜶˙,t\)=0\\lim\_\{t\\rightarrow\\infty\}T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\)=0\. Note that we by𝜶∞\\boldsymbol\{\\alpha\}\_\{\\infty\}denote the position to which the trajectory converges wheregrad⁡ℒ\(𝜶∞\)=𝟎\\operatorname\{grad\}\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\_\{\\infty\}\)=\\boldsymbol\{0\}and remark that𝜶∞\\boldsymbol\{\\alpha\}\_\{\\infty\}is not necessarily a global minimum, henceℒ\(𝜶∞\)≥ℒ\(𝜶∗\)\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\_\{\\infty\}\)\\geq\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}^\{\\ast\}\)\. Now, letting the stop timet1→∞t\_\{1\}\\rightarrow\\infty, we see that:

limt1→∞ΔH\\displaystyle\\lim\_\{t\_\{1\}\\rightarrow\\infty\}\\Delta H=H\(t0\)−limt1→∞H\(t1\)\\displaystyle=H\(t\_\{0\}\)\-\\lim\_\{t\_\{1\}\\rightarrow\\infty\}H\(t\_\{1\}\)\(C\.16\)=T\(𝜶,𝜶˙,t0\)\+κ⋅\(ℒ\(𝜶\(t0\)\)−ℒ\(𝜶∞\)\)\\displaystyle=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\_\{0\}\)\+\\kappa\\cdot\\left\(\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\(t\_\{0\}\)\)\-\\mathcal\{L\}\(\\boldsymbol\{\\alpha\}\_\{\\infty\}\)\\right\)\(C\.17\)=T\(𝜶,𝜶˙,t0\)\+κΔℒ\.\\displaystyle=T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\_\{0\}\)\+\\kappa\\Delta\\mathcal\{L\}\.\(C\.18\)Collecting the pieces concludes the proof and we get a lower bound on the form:

T\(𝜶,𝜶˙,t0\)\+κΔℒ2η0H\(t0\)≤Length\(𝜸\)\.\\frac\{T\(\\boldsymbol\{\\alpha\},\\dot\{\\boldsymbol\{\\alpha\}\},t\_\{0\}\)\+\\kappa\\Delta\\mathcal\{L\}\}\{\\sqrt\{2\}\\eta\_\{0\}H\(t\_\{0\}\)\}\\leq\\textsc\{Length\}\(\\boldsymbol\{\\gamma\}\)\.\(C\.19\)

## Appendix DExperimental details

#### Implementation details\.

All experiments are implemented in PyTorch\. Differential equations are solved using the Dormand\-Prince 5th\-order solver\[Dormand and Prince,[1980](https://arxiv.org/html/2605.15459#bib.bib56)\]fromtorchdiffeq\[Chenet al\.,[2018](https://arxiv.org/html/2605.15459#bib.bib55)\], with absolute and relative tolerances ofatol=1e\-6andrtol=1e\-7to ensure stable dynamics\. These tolerances are a computational bottleneck for both standard geodesic sampling \(RLA\) and forDiMSat high exploration rates \(lowη0\\eta\_\{0\}and larget1t\_\{1\}\), as the adaptive step\-size controller requires many evaluations to satisfy the error bounds\. Automatic differentiation is used throughout to compute acceleration terms efficiently via Hessian\-vector products\. For all experiments, the prior precision and noise variance are optimized by maximizing the log\-marginal likelihood following the suggestion byImmeret al\.\[[2021a](https://arxiv.org/html/2605.15459#bib.bib19)\]\. The gravitational pull is fixed toκ=1\\kappa=1across all runs, leaving the friction coefficientη0\\eta\_\{0\}as the primary hyperparameter controlling exploration\. Networks are trained using the Adam optimizer with learning rate0\.0010\.001and weight decay0\.010\.01\.

The linearized Laplace \(LinLA\) uses the same Gaussian samples as the standard, sampled Laplace, i\.e\. a Gaussian𝜽s∼𝒩\(𝜽∗,𝚺\)\\boldsymbol\{\\theta\}\_\{s\}\\sim\\mathcal\{N\}\(\\boldsymbol\{\\theta\}^\{\\ast\},\\mathbf\{\\Sigma\}\)centered at the MAP𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}, but it computes predictions via the first\-order Taylor expansionf𝜽∗\(𝒙\)\+𝐉f𝜽∗\(𝒙\)\(𝜽s−𝜽∗\)f\_\{\\boldsymbol\{\\theta\}^\{\\ast\}\}\(\\boldsymbol\{x\}\)\+\\mathbf\{J\}\_\{f\_\{\\boldsymbol\{\\theta\}^\{\\ast\}\}\}\(\\boldsymbol\{x\}\)\(\\boldsymbol\{\\theta\}\_\{s\}\-\\boldsymbol\{\\theta\}^\{\\ast\}\), which can be evaluated efficiently with forward\-mode automatic differentiation\.RLAuses the centered Laplace samples𝒗~s=𝜽s−𝜽∗\\tilde\{\\boldsymbol\{v\}\}\_\{s\}=\\boldsymbol\{\\theta\}\_\{s\}\-\\boldsymbol\{\\theta\}^\{\\ast\}as initial velocities and integrates the geodesic equation on the loss manifold from𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}for timet1t\_\{1\}\.

### D\.1Snelson regression

A22\-layer fully connected network with1616hidden units per layer andGELUactivations was trained for50,00050,000epochs using full\-batch gradient descent with20%20\\%dropout\. The Laplace approximation was computed using the exact full Hessian over all parameters, with prior precision and variance optimized accordingly\. Out\-of\-distribution data consists of the5252points inx∈\[1\.5,3\]x\\in\[1\.5,3\], and the remainingN=148N=148points form the training set\. We setη0=0\.1\\eta\_\{0\}=0\.1and integrate the dynamics for timet1=50t\_\{1\}=50, drawingS=100S=100samples from each method\.

### D\.2Banana classification

A44\-layer fully connected network with1616hidden units per layer andSiLUactivations was trained for5,0005,000epochs using full\-batch gradient descent, with50%50\\%dropout and label smoothing of0\.050\.05\. Of all the available data, we randomly selectedN=265N=265points as the training set while the remaining approximately5,0005,000points serve as the test set\. We setη0=0\.01\\eta\_\{0\}=0\.01and integrate for timet1=100t\_\{1\}=100, drawingS=50S=50samples from each method\.

In Table[4](https://arxiv.org/html/2605.15459#A4.T4)we report in\-distribution performance on the banana classification task\. TheMAPestimate achieves the highest accuracy \(88\.43%\) but is a point estimate and thus provides no valid uncertainty quantification out\-of\-domain\. BothLAandLinLAare too crude approximations, resulting in low accuracy and poor calibration, highlighting the sensitivity of linear Laplace methods to the choice of Hessian approximation in this setting\.RLAachieves strong performance across all metrics, closely approachingMAPaccuracy \(86\.49%\) while providing meaningful uncertainty estimates\.DiMSperforms comparably toRLAin terms of accuracy \(85\.49%\), though with slightly higher NLL and calibration errors, and somewhat larger variance across runs — consistent with the higher exploration induced by the dissipative dynamics\. Note, however, that this need not always be the case, as evidenced by the remaining experiments where these methods remain competitive in\-distribution\.

Table 4:In\-distribution performance for the banana classification dataset\. We report the mean values over theS=50S=50samples along with the standard error of the mean as the uncertainty estimate\. Calibration metrics are computed with1010bins\.
### D\.3UCI classification

A33\-layer fully connected network with1616hidden units per layer andSiLUactivations was trained for10,00010,000epochs with a batch size of6464and50%50\\%dropout\. We drawS=30S=30samples from each method and repeat experiments over55random seeds\.

We provide test performance in Table[6](https://arxiv.org/html/2605.15459#A4.T6)and in\-distribution training performance in Table[7](https://arxiv.org/html/2605.15459#A4.T7)\. For each seed, performance metrics are computed by averaging over theS=30S=30sample, and results are reported as the mean with the standard error across the 5 seeds as the uncertainty estimate\. Calibration errors are computed using1010bins\. Overall,DiMSperforms competitively withRLAacross datasets, with the geometry\-aware methods generally dominating over the standard approximations\. As in the banana experiment, sampling\-based methods occasionally trade a degree of in\-distribution accuracy for improved calibration and uncertainty quantification\. This is seen both on the test and training set\.

### D\.4MNIST and Fashion MNIST

We train a LeNet classifier \(approx\.44,00044\{,\}000parameters\) on MNIST and Fashion MNIST using Adam for1010epochs using a batch size of128128, with a log\-softmax output layer and negative log\-likelihood loss\. After training, the MAP checkpoint𝜽∗\\boldsymbol\{\\theta\}^\{\\ast\}serves as the reference point for all posterior approximations\.

We compareDiMSto Laplace\-based approximations that fit a Gaussian𝒩\(𝜽∗,𝚺\)\\mathcal\{N\}\(\\boldsymbol\{\\theta\}^\{\\ast\},\\mathbf\{\\Sigma\}\)whereΣ−1=𝐇ℒ\(𝜽∗\)\\Sigma^\{\-1\}=\\mathbf\{H\}\_\{\\mathcal\{L\}\}\(\\boldsymbol\{\\theta\}^\{\\ast\}\)is the Hessian of the negative log\-posterior\. For computational reasons, we estimate the Hessian matrix on a randomly selected subset ofN=1,000N=1\{,\}000training samples and since the full Hessian is intractable for a network with a high\-dimensional parameter space, we resort to an approximation of the Hessian\. ForDiMSwe set the friction coefficient toη0=0\.1\\eta\_\{0\}=0\.1and consider the study of three Hessian approximations \- specifically the diagonal \(D\), Kronecker\-factored Approximate Curvature \(KFAC or K\) and a low\-rank approximation \(LR\) based on the eigenvalues of the full Hessian \- as an ablation study on which approximate Hessian to use for further experiments\. As can be seen in Table[5](https://arxiv.org/html/2605.15459#A4.T5), the diagonal approximation underfits severely in the standard and linearized Laplace settings, in fact to such an extend that the initial velocities results in unstable behavior of the geometry\-based methods that never converge within the tolerances set, for which reason we exclude these results\. While KFAC also results in underfitting for the standard Laplace and linearized Laplace, the geometry\-aware methods perform relatively well on the task withDiMS\-Kbeing superior toRLA\-K\. Lastly, the LR approximations improve results significantly for the sampled, linearized and Riemannian Laplace approximations, and also forDiMSon certaint metrics\. We remark thatDiMSis superior or on par withRLA\.

With this experiment we therefore highlight the limitation of the proposed methods originating from the approximation of the Hessian that is used to generate the initial velocities\. Nevertheless,DiMSappears to be somewhat robust to this choice, which is a clear benefit of the approach\. We continue with the low\-rank approximation and evaluate the generated samples from each method on out\-of\-distribution \(OOD\) data via the AUROC score\. For MNIST the OOD sets are FashionMNIST, EMNIST, and KMNIST, and for FashionMNIST they are MNIST, EMNIST, and KMNIST\. These results are provided in Table[3](https://arxiv.org/html/2605.15459#S5.T3)in the main paper\. We provide ID and OOD results under different settings in Tables[8](https://arxiv.org/html/2605.15459#A4.T8)\-[10](https://arxiv.org/html/2605.15459#A4.T10)\.

Table 5:Influence of the Hessian approximation\. Remark that the diagonal approximation is unstable for the geometric methods, given the already low tolerances, and therefore excluded\. Results are based on a single seed for exploration purposes and serves as a first step targeting which approximation to choose for further experiments\.Table 6:UCI results on the test sets\. We report the mean of 5 seeds along with the standard error\.Table 7:UCI results on the training sets\. We report the mean of 5 seeds along with the standard error\.Table 8:ID results for a random seed on MNIST and Fashion MNIST\. We show that the running time does not matter much, when compared to theRLAbaseline\. We use the low\-rank Hessian approximation for all cases, and provide the associated OOD detection results below, given by theAUROCscore \(↑\\uparrow\)\.
Table 9:ID results for a different random seed on MNIST and Fashion MNIST\. We show again that the running time does not matter much, compared to theRLAbaseline\. We also observe that our proposed sampler works well for higher friction coefficientsη0\\eta\_\{0\}\. We use the low\-rank Hessian approximation for all cases, and provide the associated OOD detection results below, given by theAUROCscore \(↑\\uparrow\)\.
Table 10:ID results for a third random seed on MNIST and Fashion MNIST\. We use the low\-rank Hessian approximation for all cases, and provide the associated OOD detection results below, given by theAUROCscore \(↑\\uparrow\)\.
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics

Similar Articles

Why DDIM Hallucinates More than DDPM: A Theoretical Analysis of Reverse Dynamics

Debiased Model-based Representations for Sample-efficient Continuous Control

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Submit Feedback

Similar Articles

Why DDIM Hallucinates More than DDPM: A Theoretical Analysis of Reverse Dynamics
Debiased Model-based Representations for Sample-efficient Continuous Control
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics
DP-MacAdam: Differentially Private Mechanism with Adaptive Clipping and Adaptive Momentum
Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning