Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective

arXiv cs.LG Papers

Summary

This paper proposes a unified framework for energy-based generative models by casting density transport as a nonlinear control problem with KL divergence as a Lyapunov function. It derives finite-step stopping criteria and demonstrates how nonlinear control theory tools can be applied to static scalar energy models.

arXiv:2605.05530v1 Announce Type: new Abstract: Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely. We unify the training and sampling phases of this paradigm, conventionally treated as separate procedures, within a single framework: density transport on the Wasserstein space, cast as a nonlinear control problem in which the Kullback Leibler (KL) divergence serves as a Lyapunov function. Training and sampling are then two instances of this same master dynamics, differing only in initial condition. Within this autonomous framework we develop two analytic results. First, since the Lyapunov certificate is asymptotic, we derive a finite step stopping criterion for Langevin sampling and prove that no Lyapunov certificate exists for the deterministic gradient flow on the same energy landscape. Second, the reformulation brings the toolkit of nonlinear control theory to bear on static scalar energy generative modeling, that is, we show that additive composition of trained scalar energies retains an explicit Gibbs invariant measure and inherits the closed-loop Lyapunov certificate. Beyond these immediate results, this reformulation bridges static scalar energy generative models with the full toolkit of nonlinear control theory, opening the door to barrier functions for constrained generation and contraction metrics for accelerated sampling. Experiments on synthetic distributions validate the theoretical predictions.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 07:53 AM

# A Lyapunov-based Energy Matching Perspective
Source: [https://arxiv.org/html/2605.05530](https://arxiv.org/html/2605.05530)
## Energy Generative Modeling: A Lyapunov\-based Energy Matching Perspective

Yixuan Wang Wenqian Xue Warren E\. Dixon Department of Mechanical and Aerospace Engineering University of Florida \{wang\.yixuan, w\.xue, wdixon\}@ufl\.edu

###### Abstract

Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely\. We unify the training and sampling phases of this paradigm, conventionally treated as separate procedures, within a single framework: density transport on the Wasserstein space, cast as a nonlinear control problem in which the Kullback Leibler \(KL\) divergence serves as a Lyapunov function\. Training and sampling are then two instances of this same master dynamics, differing only in initial condition\. Within this autonomous framework we develop two analytic results\. First, since the Lyapunov certificate is asymptotic, we derive a finite step stopping criterion for Langevin sampling and prove that no Lyapunov certificate exists for the deterministic gradient flow on the same energy landscape\. Second, the reformulation brings the toolkit of nonlinear control theory to bear on static scalar energy generative modeling, that is, we show that additive composition of trained scalar energies retains an explicit Gibbs invariant measure and inherits the closed\-loop Lyapunov certificate\. Beyond these immediate results, this reformulation bridges static scalar energy generative models with the full toolkit of nonlinear control theory, opening the door to barrier functions for constrained generation and contraction metrics for accelerated sampling\. Experiments on synthetic distributions validate the theoretical predictions\.

## 1Introduction

The primary task for generative models is to construct a mechanism that produces samples from an optimal distributionρ∗\\rho^\{\*\}, which is an approximation of the underlying data distribution, given only a finite training set\{xi\}i=1N⊂ℝd\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}\\subset\\mathbb\{R\}^\{d\}\. Diffusion modelsSohl\-Dicksteinet al\.\([2015](https://arxiv.org/html/2605.05530#bib.bib25)\); Hoet al\.\([2020](https://arxiv.org/html/2605.05530#bib.bib1)\); Song and Ermon \([2019](https://arxiv.org/html/2605.05530#bib.bib22)\); Songet al\.\([2021](https://arxiv.org/html/2605.05530#bib.bib2)\)and flow matchingLipmanet al\.\([2023](https://arxiv.org/html/2605.05530#bib.bib3)\); Liuet al\.\([2023](https://arxiv.org/html/2605.05530#bib.bib4)\); Albergoet al\.\([2023](https://arxiv.org/html/2605.05530#bib.bib26)\); Tonget al\.\([2024](https://arxiv.org/html/2605.05530#bib.bib27)\)have emerged as the dominant approach, learning time dependent vector fields or score functions that transport a simple noise distribution to the data distribution along a continuous probability path\. A recent paradigm, Energy MatchingBalceraket al\.\([2025](https://arxiv.org/html/2605.05530#bib.bib8)\), has emerged that removes time conditioning entirely\. Instead of learning time varying dynamics, one learns a single scalar energy functionUθ:ℝd→ℝU\_\{\\theta\}\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, defines a Gibbs distributionρθ​\(x\)=exp⁡\(−Uθ​\(x\)\)/Zθ\\rho\_\{\\theta\}\(x\)=\\exp\(\-U\_\{\\theta\}\(x\)\)/Z\_\{\\theta\}as a surrogate forρdata\\rho\_\{\\mathrm\{data\}\}, and draws samples by running dynamics on this static energy landscape\.

The static energy paradigm possesses a structural property that the time conditioned paradigm does not: it produces an autonomous \(time independent\) dynamical system\. Concretely, samples are drawn by simulating the stochastic differential equationd​xt=−∇U​\(xt\)​d​t\+2​d​Wtdx\_\{t\}=\-\\nabla U\(x\_\{t\}\)dt\+\\sqrt\{2\}dW\_\{t\}, whose distributional counterpart is the Fokker Planck equation∂tρt=∇⋅\(ρt​∇Uθ\)\+Δ​ρt\\partial\_\{t\}\\rho\_\{t\}=\\nabla\\cdot\(\\rho\_\{t\}\\nabla U\_\{\\theta\}\)\+\\Delta\\rho\_\{t\}, both governed by the time invariant drift∇Uθ\\nabla U\_\{\\theta\}\. Autonomous systems are precisely the setting where the classical toolkit of nonlinear control theory applies most naturally: Lyapunov stability and control barrier functionsKhalil \([2002](https://arxiv.org/html/2605.05530#bib.bib33)\); Sontag \([1998](https://arxiv.org/html/2605.05530#bib.bib34)\); Ameset al\.\([2017](https://arxiv.org/html/2605.05530#bib.bib44)\)\.

The classical analytical perspective on the Langevin sampler can be viewed as treatingKL​\(ρt∥ρθ\)\\mathrm\{KL\}\(\\rho\_\{t\}\\\|\\rho\_\{\\theta\}\)as a Lyapunov function on𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)Bakryet al\.\([2014](https://arxiv.org/html/2605.05530#bib.bib13)\)\. This is an analytical guarantee for a dynamics inherited from physics\. However, it does not address why the Langevin stochastic differential equation \(SDE\), rather than its deterministic counterpartx˙t=−∇Uθ​\(xt\)\\dot\{x\}\_\{t\}=\-\\nabla U\_\{\\theta\}\(x\_\{t\}\)or any other autonomous flow on the same energy landscape, is the correct distributional generator in the first place\. The generative modeling objective is fundamentally distributional: the task is not to move one particle to one mode but to steer a prior distributionρ0\\rho\_\{0\}towardρ∗\\rho^\{\*\}such that minimizes a metric such as total variation or KL divergence\.

The present work reformulates the static energy generative problem as a nonlinear control problem of probabilistic density transportation on the space of𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. In this reformulation, the state variable is the probability densityρt\\rho\_\{t\}itself, the open\-loop dynamics are governed by the continuity equation∂tρt=−∇⋅\(ρt​ut\)\\partial\_\{t\}\\rho\_\{t\}=\-\\nabla\\cdot\(\\rho\_\{t\}u\_\{t\}\)with the velocity fieldutu\_\{t\}serving as the control input, and the KL divergenceKL​\(ρt∥ρ∗\)\\mathrm\{KL\}\(\\rho\_\{t\}\\\|\\rho^\{\*\}\)serves as the Lyapunov function\. The Lyapunov design requirement thatdd​t​KL​\(ρt∥ρ∗\)≤0\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\rho\_\{t\}\\\|\\rho^\{\*\}\)\\leq 0along all trajectories determines a unique optimal velocity field, and the resulting closed\-loop density dynamics is the Fokker\-Planck equation\.

Our contributions are as follows\.

1. 1\.We develop a control\-theoretic reformulation framework in which training, sampling, and smoothing are unified as three instances of density transport onP2​\(ℝd\)P\_\{2\}\(\\mathbb\{R\}^\{d\}\), each governed by a continuity equation\.
2. 2\.We provide a finite time stopping analysis for deterministic sampling and Langevin sampling\. The deterministic flow is analyzed as an early stopped sampling surrogate, despite its asymptotic tendency toward mode collapse, while Langevin dynamics is treated as the intrinsic stochastic sampler\. Our analysis identifies when deterministic transport can be terminated in a meaningful distributional regime and uses a Lyapunov based argument to characterize the performance of Langevin sampling\.
3. 3\.We bring the toolkit of nonlinear control theory to bear on static scalar energy generative modeling\. We show that additive composition of trained scalar energies retains an explicit Gibbs invariant measure and inherits the closed\-loop Lyapunov certificate, which enables us the tool of control barrier function for safety guaranteed generation\.

#### Notation\.

Throughout,∇\\nabladenotes the gradient operator\(∂x1,…,∂xd\)⊤\(\\partial\_\{x\_\{1\}\},\\ldots,\\partial\_\{x\_\{d\}\}\)^\{\\top\}and∇⋅\\nabla\\cdotdenotes the divergence operator acting on vector fields:∇⋅F=∑i=1d∂xiFi\\nabla\\cdot F=\\sum\_\{i=1\}^\{d\}\\partial\_\{x\_\{i\}\}F\_\{i\}\. The Laplacian isΔ=∇⋅∇=∑i=1d∂xi2\\Delta=\\nabla\\cdot\\nabla=\\sum\_\{i=1\}^\{d\}\\partial\_\{x\_\{i\}\}^\{2\}\. The space𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)consists of Borel probability measures onℝd\\mathbb\{R\}^\{d\}with finite second moment\. Forρ,ν∈𝒫2​\(ℝd\)\\rho,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)withρ\\rhoabsolutely continuous with respect toν\\nu, the KL divergence isKL​\(ρ∥ν\)=∫ℝdρ​log⁡\(ρ/ν\)​𝑑x\\mathrm\{KL\}\(\\rho\\\|\\nu\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\\log\(\\rho/\\nu\)\\,dx, the total variation distance isTV​\(ρ,ν\)=12​∫ℝd\|ρ−ν\|​𝑑x\\mathrm\{TV\}\(\\rho,\\nu\)=\\frac\{1\}\{2\}\\int\_\{\\mathbb\{R\}^\{d\}\}\|\\rho\-\\nu\|\\,dx, and the relative Fisher information isℐ​\(ρ∥ν\)=∫ℝdρ​‖∇log⁡\(ρ/ν\)‖2​𝑑x\\mathcal\{I\}\(\\rho\\\|\\nu\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\\,\\\|\\nabla\\log\(\\rho/\\nu\)\\\|^\{2\}\\,dx\. ForA∈ℝn×nA\\in\\mathbb\{R\}^\{n\\times n\},A⪰0A\\succeq 0denotes a semi\-positive definite matrixAA\.KhK\_\{h\}is a symmetric positive definite kernel used in the kernal density estimation \(KDE\) with bandwidthh\>0h\>0, for example the Gaussian kernelKh​\(x\)=\(2​π​h2\)−d/2​e​x​p​\(−‖x‖2/\(2​h2\)\)K\_\{h\}\(x\)=\(2\\pi h^\{2\}\)^\{\-d/2\}exp\(\-\\\|x\\\|^\{2\}/\(2h^\{2\}\)\)\.xxdenotes the data input andθ\\thetadenotes the parameters for the neural network\.

## 2Control\-Theoretic Reformulation

In this section, we develop the control\-theoretic reformulation\. Take the Wasserstein space𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)as the state space and the probability densityρ∈𝒫2​\(ℝd\)\\rho\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)as the state variable\. The open\-loop dynamics of the probability density are governed by the continuity equation

∂tρ=−∇⋅\(ρ​u\),\\partial\_\{t\}\\rho\\;=\\;\-\\nabla\\cdot\(\\rho\\,u\),\(1\)in which the velocity fieldu:ℝd→ℝdu\\colon\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}enters as the control input\. The design objective is to select a controlleruusuch that the resulting closed\-loop dynamics converges a prior distributionρ0\\rho\_\{0\}to a target distribution\. The prior distribution and target distribution are different between the training phase and sampling phase\. In training phase, a neural network learns an static scalar energy fieldUθ∗U\_\{\\theta\}^\{\*\}, such that the corresponding Gibbs densityρθ∗​\(x\)=exp⁡\(−Uθ∗​\(x\)\)/Zθ∗\\rho\_\{\\theta\}^\{\*\}\(x\)=\\exp\(\-U\_\{\\theta\}^\{\*\}\(x\)\)/Z\_\{\\theta\}^\{\*\}is an optimal approximation ofρ∗\\rho^\{\*\}ast→∞t\\to\\infty\. In sampling phase, a well trained energy fieldUθ∗U\_\{\\theta\}^\{\*\}is given\. The density transportation is between the prior noise distributionρ0\\rho\_\{0\}andρθ∗\\rho\_\{\\theta\}^\{\*\}\. Since both training and sampling share the same dynamical system, we take the training phase as an example in the later derivations\.

To accomplish this objective through Lyapunov’s direct method, we choose the KL divergence as the Lyapunov candidate

V​\(ρθ\)=KL​\(ρθ∥ρ∗\)=∫ℝdρθ​\(x\)​log⁡ρθ​\(x\)ρ∗​d​x\.V\(\\rho\_\{\\theta\}\)=\\mathrm\{KL\}\(\\rho\_\{\\theta\}\\\|\\rho^\{\*\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\(x\)\\log\\frac\{\\rho\_\{\\theta\}\(x\)\}\{\\rho^\{\*\}\}dx\.\(2\)Then, following control theory, the Lie derivativeYano \([2020](https://arxiv.org/html/2605.05530#bib.bib52)\)ofVValong the open\-loop dynamics in\([1](https://arxiv.org/html/2605.05530#S2.E1)\)is computed by \(see Appendix[A\.1](https://arxiv.org/html/2605.05530#A1.SS1)\)

ℒu​V​\(ρθ\)=∫ℝdρθ​\(x\)​u​\(x\)⋅\(∇log⁡ρθ​\(x\)\+∇U∗\)​𝑑x,\\mathcal\{L\}\_\{u\}V\(\\rho\_\{\\theta\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\(x\)\\,u\(x\)\\cdot\\bigl\(\\nabla\\log\\rho\_\{\\theta\}\(x\)\+\\nabla U^\{\*\}\\bigr\)\\,dx,\(3\)whereU∗U^\{\*\}corresponds toρ∗\\rho^\{\*\}in the Gibbs density\. The same expression appears in the Wasserstein gradient flow theoryJordanet al\.\([1998](https://arxiv.org/html/2605.05530#bib.bib10)\); Villani \([2009](https://arxiv.org/html/2605.05530#bib.bib11)\), here reached through the control\-theoretic route of computing the directional derivative of a Lyapunov candidate along the controlled flow\. To enable convergence, the Lyapunov conditionℒu​V​\(ρθ\)<0\\mathcal\{L\}\_\{u\}V\(\\rho\_\{\\theta\}\)<0forρθ≠ρ∗\\rho\_\{\\theta\}\\neq\\rho^\{\*\}is required to be met\. By selecting the optimal controlleruuas

u∗​\(x,ρθ\)=−∇U∗​\(x\)−∇log⁡ρθ​\(x\),u^\{\*\}\(x,\\rho\_\{\\theta\}\)=\-\\nabla U^\{\*\}\(x\)\-\\nabla\\log\\rho\_\{\\theta\}\(x\),\(4\)the Lyapunov derivative collapses to the negative relative Fisher information,

ℒu∗​V​\(ρθ\)=−∫ℝdρθ​\(x\)​‖∇log⁡ρθ​\(x\)\+∇U∗​\(x\)‖2​𝑑x=−ℐ​\(ρθ∥ρ∗\)≤0,\\mathcal\{L\}\_\{u^\{\*\}\}V\(\\rho\_\{\\theta\}\)=\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\(x\)\\,\\bigl\\\|\\nabla\\log\\rho\_\{\\theta\}\(x\)\+\\nabla U^\{\*\}\(x\)\\bigr\\\|^\{2\}dx=\-\\mathcal\{I\}\(\\rho\_\{\\theta\}\\\|\\rho^\{\*\}\)\\leq 0,\(5\)where the last equal holds if and only ifρθ=ρ∗\\rho\_\{\\theta\}=\\rho^\{\*\}\. Then, from the Lyapunov stability principlesKhalil \([2002](https://arxiv.org/html/2605.05530#bib.bib33)\), the controlled trajectory satisfiesρθ→ρ∗\\rho\_\{\\theta\}\\to\\rho^\{\*\}ast→∞t\\to\\infty\. The updated neural network parameterθ\\thetawill update the controller at each update cycle, as discussed later\.

The closed\-loop density dynamics, obtained by substituting \([4](https://arxiv.org/html/2605.05530#S2.E4)\) into \([1](https://arxiv.org/html/2605.05530#S2.E1)\) and usingρ​∇log⁡ρ=∇ρ\\rho\\nabla\\log\\rho=\\nabla\\rho, is the Fokker Planck equation

∂tρ=∇⋅\(ρ​∇U∗\)\+Δ​ρ,\\partial\_\{t\}\\rho=\\nabla\\cdot\(\\rho\\,\\nabla U^\{\*\}\)\+\\Delta\\rho,\(6\)whose particle level realization is the Langevin SDERisken \([1996](https://arxiv.org/html/2605.05530#bib.bib14)\)

d​xt=−∇U∗​d​t\+2​d​Wt,dx\_\{t\}=\-\\nabla U^\{\*\}\\,dt\+\\sqrt\{2\}\\,dW\_\{t\},\(7\)whereWtW\_\{t\}is the standard Brownian noise\.

On the sampling side, a Langevin SDE is recovered by the steepest descent controller selected by the Lyapunov design\. The optimal control lawu∗u^\{\*\}admits a decomposition into a feedforward term−∇Uθ∗\-\\nabla U\_\{\\theta\}^\{\*\}, encoding the score of the target measure, and a state\-dependent feedback term−∇log⁡ρθ\-\\nabla\\log\\rho\_\{\\theta\}that the closed\-loop system in \([7](https://arxiv.org/html/2605.05530#S2.E7)\) realizes through the Brownian increment2​d​Wt\\sqrt\{2\}\\,dW\_\{t\}\. Removing Brownian noise produces deterministic gradient descent on the learned energy fieldUθ∗U\_\{\\theta\}^\{\*\}, which, as we rigorously establish in Section[3](https://arxiv.org/html/2605.05530#S3), cannot serve as a distributional sampler\.

On the training side, pluggingρθ\\rho\_\{\\theta\}into \([4](https://arxiv.org/html/2605.05530#S2.E4)\), identifying the velocity fieldu∗​\(x,ρθ\)=−∇U∗​\(x\)\+sρθ​\(x\)u^\{\*\}\(x,\\rho\_\{\\theta\}\)=\-\\nabla U^\{\*\}\(x\)\+s\_\{\\rho\_\{\\theta\}\}\(x\), wheresρθ≜∇log⁡ρθs\_\{\\rho\_\{\\theta\}\}\\triangleq\\nabla\\log\\rho\_\{\\theta\}is the score of the model density fromUθU\_\{\\theta\}, is equivalent to fit the energyUθU\_\{\\theta\}\. To enable the convergenceρθ→ρ∗\\rho\_\{\\theta\}\\to\\rho^\{\*\}, fittingUθU\_\{\\theta\}corresponds to two matching contributions\. One is the implicit score matching that minimizes𝔼ρ∗​\[12​‖∇Uθ‖2−Δ​Uθ\]\\mathbb\{E\}\_\{\\rho^\{\*\}\}\\bigl\[\\tfrac\{1\}\{2\}\\\|\\nabla U\_\{\\theta\}\\\|^\{2\}\-\\Delta U\_\{\\theta\}\\bigr\]and recovers the score by integration by parts identityHyvärinen \([2005](https://arxiv.org/html/2605.05530#bib.bib49)\)\. The other one is denoising score matching that minimizes𝔼ρ∗,qσ\[∥∇Uθ\(x~\)−∇logqσ\(x~∣x\)∥2\]\\mathbb\{E\}\_\{\\rho^\{\*\},q\_\{\\sigma\}\}\\bigl\[\\\|\\nabla U\_\{\\theta\}\(\\tilde\{x\}\)\-\\nabla\\log q\_\{\\sigma\}\(\\tilde\{x\}\\mid x\)\\\|^\{2\}\\bigr\]under a perturbation kernelqσq\_\{\\sigma\}Vincent \([2011](https://arxiv.org/html/2605.05530#bib.bib50)\)\. Both losses, optimized over a sufficiently expressive scalar parameterization, driveρθ→ρ∗\\rho\_\{\\theta\}\\to\\rho^\{\*\}in the metric induced by the relative Fisher information\. Therefore, training and sampling are not two separate algorithmic blocks but two specifications of the same Lyapunov controller within a unified framework: the training stage approximatesρ∗\\rho^\{\*\}, while the sampling stage applies the full controlleru∗u^\{\*\}to a noise initial condition\.

Regarding the form of available data to provide the training targetρ∗\\rho^\{\*\}, the empirical measure as a Dirac representationρ^data=1N​∑i=1Nδxi\\hat\{\\rho\}\_\{\\mathrm\{data\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\delta\_\{x\_\{i\}\}is singular and admits neither a score nor a finite KL divergence to any absolutely continuous density\. Therefore, the effective target of score\-based or KL\-based training procedures is a smoothed densityρ∗=ρ^data∗Kh\\rho^\{\*\}=\\hat\{\\rho\}\_\{\\mathrm\{data\}\}\\ast K\_\{h\}, obtained by convolution with a symmetric strictly positive kernelKhK\_\{h\}of bandwidthh\>0h\>0Hoti \([2003](https://arxiv.org/html/2605.05530#bib.bib51)\)\. The smoothing operationρ^data↦ρ∗\\hat\{\\rho\}\_\{\\mathrm\{data\}\}\\mapsto\\rho^\{\*\}is itself a kernel operation on𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)in the limiting sense that convolution with a diffusion semigroup interpolates between the singular and the smoothed measures\. The smoothed densityρ∗=ρ^data∗Kh\\rho^\{\*\}=\\hat\{\\rho\}\_\{\\mathrm\{data\}\}\\ast K\_\{h\}enters the framework as the fixed target of the training transport\. The training process then seeks the optimal training result such thatρθ∗\\rho\_\{\\theta\}^\{\*\}is an close approximation ofρ∗\\rho^\{\*\}, which closes the framework, identifying the Lyapunov target with the smoothed data measure and the closed\-loop sampler \([7](https://arxiv.org/html/2605.05530#S2.E7)\) as a transport whose stationary distribution is exactlyρ∗\\rho^\{\*\}\. Table[1](https://arxiv.org/html/2605.05530#S2.T1)summarises the transports unified by this construction\.

Table 1:Two density transports on𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)unified by the control\-theoretic reformulation\. Both are governed by a continuity equation\.
## 3Sampling Convergence Rates and Finite Step Stopping

### 3\.1Deterministic Sampling Leads to Mode Collapse

Given a learned energyUθ∗U\_\{\\theta\}^\{\*\}, setting the noise to zero in \([7](https://arxiv.org/html/2605.05530#S2.E7)\) produces the deterministic gradient flowx˙t=−∇Uθ∗​\(xt\)\\dot\{x\}\_\{t\}=\-\\nabla U^\{\*\}\_\{\\theta\}\(x\_\{t\}\), whose distributional counterpart is the continuity equation

∂tρt=∇⋅\(ρt​∇Uθ∗\),\\partial\_\{t\}\\rho\_\{t\}=\\nabla\\cdot\(\\rho\_\{t\}\\,\\nabla U^\{\*\}\_\{\\theta\}\),\(8\)corresponding to the controller given byud∗​\(x\)=−∇Uθ∗​\(x\)u^\{\*\}\_\{d\}\(x\)=\-\\nabla U\_\{\\theta\}^\{\*\}\(x\)\. This dynamics drives every individual trajectory toward a critical point ofUθ∗U\_\{\\theta\}^\{\*\}, concentrating the distribution onto local minima, a phenomenon known as mode collapse\. To analyze the failure to preserve the target distribution in this case, substituting the Gibbs densityρθ∗=e−Uθ∗/Zθ∗\\rho\_\{\\theta\}^\{\*\}=e^\{\-U\_\{\\theta\}^\{\*\}\}/Z\_\{\\theta\}^\{\*\}into the right hand side of \([8](https://arxiv.org/html/2605.05530#S3.E8)\) and using∇ρθ∗=−ρθ∗​∇Uθ∗\\nabla\\rho\_\{\\theta\}^\{\*\}=\-\\rho\_\{\\theta\}^\{\*\}\\nabla U\_\{\\theta\}^\{\*\}gives

∇⋅\(ρθ∗​∇Uθ∗\)=\(−ρθ∗​∇Uθ∗\)⋅∇Uθ∗\+ρθ∗​Δ​Uθ∗=ρθ∗​h​\(x\),\\nabla\\cdot\(\\rho\_\{\\theta\}^\{\*\}\\nabla U\_\{\\theta\}^\{\*\}\)=\(\-\\rho\_\{\\theta\}^\{\*\}\\nabla U\_\{\\theta\}^\{\*\}\)\\cdot\\nabla U\_\{\\theta\}^\{\*\}\+\\rho\_\{\\theta\}^\{\*\}\\Delta U\_\{\\theta\}^\{\*\}=\\rho\_\{\\theta\}^\{\*\}h\(x\),\(9\)whereh​\(x\)≜Δ​Uθ​\(x\)∗−‖∇Uθ​\(x\)∗‖2h\(x\)\\triangleq\\Delta U\_\{\\theta\}\(x\)^\{\*\}\-\\\|\\nabla U\_\{\\theta\}\(x\)^\{\*\}\\\|^\{2\}is a residual function \(see details in Appendix[A\.2](https://arxiv.org/html/2605.05530#A1.SS2)\)\. At any nondegenerate local minimumx0x\_\{0\}, one hash​\(x0\)=tr​\(∇2Uθ∗​\(x0\)\)\>0h\(x\_\{0\}\)=\\mathrm\{tr\}\(\\nabla^\{2\}U\_\{\\theta\}^\{\*\}\(x\_\{0\}\)\)\>0, while under the assumption thatUθ∗U\_\{\\theta\}^\{\*\}has Lipschitz gradient and confining tail forces‖∇Uθ∗​\(x\)‖2→∞\\\|\\nabla U\_\{\\theta\}^\{\*\}\(x\)\\\|^\{2\}\\to\\inftyas‖x‖→∞\\\|x\\\|\\to\\infty, we haveh​\(x\)→−∞h\(x\)\\to\-\\infty\. Therefore,hhchanges sign and cannot vanish identically\. It follows that∂tρθ∗≠0\\partial\_\{t\}\\rho^\{\*\}\_\{\\theta\}\\neq 0, so the Gibbs density is not a stationary solution of the deterministic flow in \([8](https://arxiv.org/html/2605.05530#S3.E8)\), and the deterministic flow possesses no equilibrium that coincides with the target\. We now show a stronger result extending this local failure to a global obstruction\.

###### Theorem 2\(Failure of deterministic Sampling\)\.

LetUθ∗:ℝd→ℝU\_\{\\theta\}^\{\*\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}have Lipschitz gradient and satisfy‖∇Uθ∗​\(x\)‖→∞\\\|\\nabla U^\{\*\}\_\{\\theta\}\(x\)\\\|\\to\\inftyas‖x‖→∞\\\|x\\\|\\to\\infty\. Consider the deterministic density evolution in \([8](https://arxiv.org/html/2605.05530#S3.E8)\)\. There exists no Lyapunov functionalV:𝒫2​\(ℝd\)→ℝ≥0V:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}\_\{\\geq 0\}that simultaneously satisfies \(i\)V​\(ρ\)=0V\(\\rho\)=0if and only ifρ=ρθ∗\\rho=\\rho\_\{\\theta\}^\{\*\}, \(ii\)V​\(ρ\)≥0V\(\\rho\)\\geq 0for allρ∈𝒫2​\(ℝd\)\\rho\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\), and \(iii\)V​\(ρt\)V\(\\rho\_\{t\}\)nonincreasing along every solution of \([8](https://arxiv.org/html/2605.05530#S3.E8)\)\.

The proof can be seen in Appendix[A\.4](https://arxiv.org/html/2605.05530#A1.SS4)\. The interpretation is that the entropy gradient term−∇log⁡ρt\-\\nabla\\log\\rho\_\{t\}, which is realized at the particle level through the Brownian noise2​d​Wt\\sqrt\{2\}\\,dW\_\{t\}in \([7](https://arxiv.org/html/2605.05530#S2.E7)\), is an essential component of the optimal transport mechanism in Section[2](https://arxiv.org/html/2605.05530#S2)\. Removing this term, Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)finalizes that no Lyapunov function can exist under the deterministic flow\.

### 3\.2Convergence Rate of the Langevin Dynamics

![Refer to caption](https://arxiv.org/html/2605.05530v1/det_converge.png)Figure 1:Trajectory snapshots on the eight Gaussian targets with sampling stepK=200K=200\.Top row, deterministic flow\. The empirical distribution is closest toρ∗\\rho^\{\*\}atk=50k=50, and degenerates into a sum of Dirac masses on the eight critical points ofUθ∗U\_\{\\theta\}^\{\*\}askkincreases, in agreement with Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)\. The deterministic stopping rule in \([16](https://arxiv.org/html/2605.05530#S3.E16)\) predictsk∗=49k^\{\*\}=49, marked by the proximity of thek=50k=50panel to the data\.Bottom row, Langevin SDE\. The empirical distribution converges to the Gibbs distributionρθ∗\\rho\_\{\\theta\}^\{\*\}byk≈100k\\approx 100and remains visually unchanged atk=200k=200\. Additional integration pastτLang∗\\tau\_\{\\rm Lang\}^\{\*\}does not degrade the bound\.Givenρθ∗\\rho\_\{\\theta\}^\{\*\}obtained by the trained model, the Lyapunov candidate for sampling is given by

V​\(ρt\)=KL​\(ρt∥ρθ∗\)\.V\(\\rho\_\{t\}\)=\\mathrm\{KL\}\(\\rho\_\{t\}\\,\\\|\\,\\rho\_\{\\theta\}^\{\*\}\)\.\(10\)The Lyapunov derivativeℒu​V​\(ρt\)=−ℐ​\(ρt∥ρθ∗\)\\mathcal\{L\}\_\{u\}V\(\\rho\_\{t\}\)=\-\\mathcal\{I\}\(\\rho\_\{t\}\\,\\\|\\,\\rho\_\{\\theta\}^\{\*\}\)established in \([5](https://arxiv.org/html/2605.05530#S2.E5)\) provides a qualitative convergence guarantee forρt→ρθ∗\\rho\_\{t\}\\rightarrow\\rho\_\{\\theta\}^\{\*\}\. To obtain the convergence rate, we invoke a functional inequality relating the KL divergence and Fisher information\. Specifically, the energyUθ∗U\_\{\\theta\}^\{\*\}is said to satisfy the log Sobolev inequality with a constantCLS\>0C\_\{\\mathrm\{LS\}\}\>0when

ℒu∗​V​\(ρt\)≤12​CLS​ℐ​\(ρt∥ρθ∗\)for every smooth density​ρt∈𝒫2​\(ℝd\)\.\\mathcal\{L\}\_\{u^\{\*\}\}V\(\\rho\_\{t\}\)\\;\\leq\\;\\frac\{1\}\{2C\_\{\\mathrm\{LS\}\}\}\\,\\mathcal\{I\}\(\\rho\_\{t\}\\,\\\|\\,\\rho\_\{\\theta\}^\{\*\}\)\\quad\\text\{for every smooth density \}\\rho\_\{t\}\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\.\(11\)This condition is guaranteed by the Bakry–Émery criterionBakry and Émery \([1985](https://arxiv.org/html/2605.05530#bib.bib12)\); Bakryet al\.\([2014](https://arxiv.org/html/2605.05530#bib.bib13)\)when∇2Uθ∗​\(x\)⪰CLS​Id\\nabla^\{2\}U\_\{\\theta\}^\{\*\}\(x\)\\succeq C\_\{\\mathrm\{LS\}\}\\,I\_\{d\}for allxx\. Combining \([11](https://arxiv.org/html/2605.05530#S3.E11)\) with the Lyapunov derivative in \([5](https://arxiv.org/html/2605.05530#S2.E5)\) on the infinite dimensional state space𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)yields

V​\(ρt\)≤e−2​CLS​t​V​\(ρ0\),V\(\\rho\_\{t\}\)\\;\\leq\\;e^\{\-2C\_\{\\mathrm\{LS\}\}t\}\\,V\(\\rho\_\{0\}\),\(12\)which implies exponential convergence of the KL divergenceKL​\(ρt∥ρθ∗\)\\mathrm\{KL\}\(\\rho\_\{t\}\\,\\\|\\,\\rho\_\{\\theta\}^\{\*\}\)along the closed\-loop dynamics\. The rate in \([12](https://arxiv.org/html/2605.05530#S3.E12)\) depends on the log Sobolev constantCLSC\_\{\\text\{LS\}\}, which can degrade significantly in multimodal settings\. In particular, for energy landscapes with barriers of heightΔ​E\\Delta E, one typically hasCLS∼e−2​Δ​EC\_\{\\mathrm\{LS\}\}\\sim e^\{\-2\\Delta E\}, leading to exponentially slow convergenceHolley and Stroock \([1987](https://arxiv.org/html/2605.05530#bib.bib16)\)\. This limitation is inherent to analyses based on \([11](https://arxiv.org/html/2605.05530#S3.E11)\) and motivates the empirical surrogate and deterministic stopping rule introduced in the next subsection\.

### 3\.3Stopping Criteria for Sampling

The convergence rate in \([12](https://arxiv.org/html/2605.05530#S3.E12)\) prescribes a stopping time of Langevin sampling given the trained modelρθ∗\\rho\_\{\\theta\}^\{\*\}\. The time required to achieve a prescribed generation accuracyV​\(ρt\)≤εV\(\\rho\_\{t\}\)\\leq\\varepsiloncan be obtained by solving \([12](https://arxiv.org/html/2605.05530#S3.E12)\) fortt

τLang∗=12​CLS​log⁡\(V​\(ρ0\)ε\),\\tau\_\{\\mathrm\{Lang\}\}^\{\*\}\\;=\\;\\frac\{1\}\{2C\_\{\\mathrm\{LS\}\}\}\\,\\log\\\!\\Bigl\(\\frac\{V\(\\rho\_\{0\}\)\}\{\\varepsilon\}\\Bigr\),\(13\)at which the boundKL​\(ρτLang∗∥ρθ∗\)≤ε\\mathrm\{KL\}\(\\rho\_\{\\tau\_\{\\mathrm\{Lang\}\}^\{\*\}\}\\,\\\|\\,\\rho^\{\*\}\_\{\\theta\}\)\\leq\\varepsilonholds by substitutingt=τLang∗t=\\tau\_\{\\mathrm\{Lang\}\}^\{\*\}into \([12](https://arxiv.org/html/2605.05530#S3.E12)\)\. This result aligns with asymptotic convergence rate analysis\. Equation \([13](https://arxiv.org/html/2605.05530#S3.E13)\) is a sufficient sampling time, not a stopping rule in the operational sense: the Lyapunov decay in \([5](https://arxiv.org/html/2605.05530#S2.E5)\) guarantees thatKL​\(ρt∥ρθ∗\)\\mathrm\{KL\}\(\\rho\_\{t\}\\,\\\|\\,\\rho^\{\*\}\_\{\\theta\}\)is decreasing along the closed\-loop, so running the Langevin sampler pastτLang∗\\tau\_\{\\mathrm\{Lang\}\}^\{\*\}never degrades the bound\. The role of \([13](https://arxiv.org/html/2605.05530#S3.E13)\) is to certify that for a given performance boundϵ\\epsilon, the sufficient sampling time ist=τLang∗t=\\tau\_\{\\mathrm\{Lang\}\}^\{\*\}\.

By Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2), no Lyapunov certificate ofρθ∗\\rho^\{\*\}\_\{\\theta\}exists for \([8](https://arxiv.org/html/2605.05530#S3.E8)\), and running the deterministic flow longer eventually drivesρt\\rho\_\{t\}away from the target through mode collapse onto the critical set ofUθ∗U\_\{\\theta\}^\{\*\}\. A finite step analysis is nevertheless possible if one accepts that the goal is no longer to certify convergence toρθ∗\\rho^\{\*\}\_\{\\theta\}, but to identify the latest moment at which the deterministic trajectory still represents a meaningful approximation toρθ∗\\rho^\{\*\}\_\{\\theta\}before Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)takes effect111The convergence failure of deterministic sampling in Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)is a distributional statement; at the particle level, the deterministic flow concentrates each trajectory onto a critical point ofUθ∗U\_\{\\theta\}^\{\*\}, which often coincides with a high quality region of the data manifold and yields visually plausible samples in high dimensional practice, although mode coverage is generically lost\.\. Differentiating the Lyapunov functionV​\(ρt\)V\(\\rho\_\{t\}\)in \([10](https://arxiv.org/html/2605.05530#S3.E10)\) along the deterministic evolution∂tρt=∇⋅\(ρt​∇Uθ∗\)\\partial\_\{t\}\\rho\_\{t\}=\\nabla\\cdot\(\\rho\_\{t\}\\nabla U^\{\*\}\_\{\\theta\}\)and applying integration by parts gives \(see Appendix[A\.3](https://arxiv.org/html/2605.05530#A1.SS3)\)

ℒu∗​V​\(ρt\)=𝔼ρt​\[Δ​Uθ∗\]−𝔼ρt​\[‖∇Uθ∗‖2\]=−𝔼ρt​\[‖∇Uθ∗‖2\]​\(1−Rt\),\\mathcal\{L\}\_\{u^\{\*\}\}V\(\\rho\_\{t\}\)\\;=\\;\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\Delta U^\{\*\}\_\{\\theta\}\]\\;\-\\;\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\]\\;=\\;\-\\,\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\]\\,\(1\-R\_\{t\}\),\(14\)where the dimensionless ratio

Rt:=𝔼ρt​\[Δ​Uθ∗\]𝔼ρt​\[‖∇Uθ∗‖2\]R\_\{t\}\\;:=\\;\\frac\{\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\Delta U^\{\*\}\_\{\\theta\}\]\}\{\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\]\}\(15\)compares the Laplacian term with the drift\-induced dissipation\. The threshold valueRt=1R\_\{t\}=1is the equality case of \([14](https://arxiv.org/html/2605.05530#S3.E14)\) and crossing this value marks the regime change from drift dominated dissipation to Laplacian dominated concentration\. However, the equality𝔼ρt​\[Δ​Uθ∗\]=𝔼ρt​\[‖∇Uθ∗‖2\]\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\Delta U^\{\*\}\_\{\\theta\}\]=\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\]does not implyρt=ρθ∗\\rho\_\{t\}=\\rho^\{\*\}\_\{\\theta\}\. The deterministic flow therefore exhibits two regimes: a drift dominated phaseRt<1R\_\{t\}<1in which the KL divergence decreases, and a Laplacian dominated phaseRt\>1R\_\{t\}\>1in which the KL divergence increases, indicating concentration of mass near local minima\. The natural stopping time is the transition point

τdet∗=inf\{τ≥0:Rτ=1\},\\tau\_\{\\mathrm\{det\}\}^\{\*\}\\;=\\;\\inf\\\{\\,\\tau\\geq 0\\,:\\,R\_\{\\tau\}=1\\,\\\},\(16\)at which the KL divergence derivative changes sign and attains its minimum along the trajectory\. Integrating the entropy derivative in \([14](https://arxiv.org/html/2605.05530#S3.E14)\) up to this stopping time yields

KL​\(ρτdet∗∥ρθ∗\)=KL​\(ρ0∥ρθ∗\)−∫0τdet∗𝔼ρt​\[‖∇Uθ∗‖2\]​\(1−Rt\)​𝑑t,\\mathrm\{KL\}\(\\rho\_\{\\tau\_\{\\mathrm\{det\}\}^\{\*\}\}\\,\\\|\\,\\rho^\{\*\}\_\{\\theta\}\)\\;=\\;\\mathrm\{KL\}\(\\rho\_\{0\}\\,\\\|\\,\\rho^\{\*\}\_\{\\theta\}\)\\;\-\\;\\int\_\{0\}^\{\\tau\_\{\\mathrm\{det\}\}^\{\*\}\}\\mathbb\{E\}\_\{\\rho\_\{t\}\}\[\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\]\\,\(1\-R\_\{t\}\)\\,dt,\(17\)which represents the maximum decrease of KL divergence achievable under deterministic evolution\.

The two stopping times in \([13](https://arxiv.org/html/2605.05530#S3.E13)\) and \([16](https://arxiv.org/html/2605.05530#S3.E16)\) play structurally different roles\. The Langevin time in \([13](https://arxiv.org/html/2605.05530#S3.E13)\) is a sufficient sampling budget, justified by the Lyapunov certificate of Section[2](https://arxiv.org/html/2605.05530#S2), and running past it is harmless\. While, running past the deterministic time in \([16](https://arxiv.org/html/2605.05530#S3.E16)\) strictly degrades the approximation\. Figure[1](https://arxiv.org/html/2605.05530#S3.F1)contrasts the two samplers on the eight Gaussian targets\. The deterministic flow reaches a near\-optimal distributional approximation toρθ∗\\rho\_\{\\theta\}^\{\*\}atk=50k=50and collapses to eight Dirac masses on the critical set ofUθ∗U\_\{\\theta\}^\{\*\}askkincreases\. The predicted stopping time by \([16](https://arxiv.org/html/2605.05530#S3.E16)\) during sampling isk∗=49k^\{\*\}=49, in agreement with the empirical optimum\. The Langevin sampler reaches the Gibbs invariant measure byk≈100k\\approx 100and remains on it for the remaining integration steps, with thek=200k=200panel visually indistinguishable from thek=100k=100panel\. The figure makes the structural distinction of \([13](https://arxiv.org/html/2605.05530#S3.E13)\) and \([16](https://arxiv.org/html/2605.05530#S3.E16)\) concrete, that is, running the Langevin sampler past its sufficient time is harmless, while running the deterministic flow pastτdet∗\\tau\_\{\\rm det\}^\{\*\}strictly degrades the approximation in the manner asserted by Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)\.

## 4Composition, Diagnostic OOD, and Outlook

### 4\.1Compositional Generation Under Scalar Parameterization

LetU\+,U−:ℝd→ℝU\_\{\+\},U\_\{\-\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}be two scalar energy functions trained independently on data setsρ\+\\rho\_\{\+\}andρ−\\rho\_\{\-\}\. For coefficientsα\+,α−∈ℝ\\alpha\_\{\+\},\\alpha\_\{\-\}\\in\\mathbb\{R\}, the additive composition

Ucomp​\(x\):=α\+​U\+​\(x\)\+α−​U−​\(x\)U\_\{\\rm comp\}\(x\)\\,:=\\,\\alpha\_\{\+\}\\,U\_\{\+\}\(x\)\\,\+\\,\\alpha\_\{\-\}\\,U\_\{\-\}\(x\)\(18\)is itself a scalar function onℝd\\mathbb\{R\}^\{d\}\. Therefore, its gradient∇Ucomp=α\+​∇U\+\+α−​∇U−\\nabla U\_\{\\rm comp\}=\\alpha\_\{\+\}\\,\\nabla U\_\{\+\}\+\\alpha\_\{\-\}\\,\\nabla U\_\{\-\}is the gradient of a scalar potential, and the closed\-loop sampler in \(7\) driven by drift−∇Ucomp\-\\nabla U\_\{\\rm comp\}retains an explicit Gibbs invariant measure

ρcomp​\(x\)∝e−α\+​U\+​\(x\)−α−​U−​\(x\)\.\\rho\_\{\\rm comp\}\(x\)\\,\\propto\\,e^\{\-\\alpha\_\{\+\}\\,U\_\{\+\}\(x\)\-\\alpha\_\{\-\}\\,U\_\{\-\}\(x\)\}\.\(19\)
#### Empirical verification of compositional generation\.

We train two scalar energiesUA,UBU\_\{A\},U\_\{B\}on disjoint four Gaussian mixtures inℝ2\\mathbb\{R\}^\{2\}\(centersA=\{\(±3,±3\)\}A=\\\{\(\\pm 3,\\pm 3\)\\\}, centersB=\{\(0,±4\),\(±4,0\)\}B=\\\{\(0,\\pm 4\),\(\\pm 4,0\)\\\}\) and sample three compositions by Langevin SDE under drift−∇Ucomp\-\\nabla U\_\{\\rm comp\}, with no retraining\. The compositions are

Conjunction:\\displaystyle\\textsc\{Conjunction\}:Ucomp=UA\+UB,\\displaystyle\\quad U\_\{\\rm comp\}\\,=\\,U\_\{A\}\+U\_\{B\},Disjunction:\\displaystyle\\textsc\{Disjunction\}:Ucomp=−log⁡\(e−UA\+e−UB\),\\displaystyle\\quad U\_\{\\rm comp\}\\,=\\,\-\\log\\bigl\(e^\{\-U\_\{A\}\}\+e^\{\-U\_\{B\}\}\\bigr\),Negation of A given B:\\displaystyle\\textsc\{Negation~of~A~given~B\}:Ucomp=UB−0\.35​UA\.\\displaystyle\\quad U\_\{\\rm comp\}\\,=\\,U\_\{B\}\-0\.35\\,U\_\{A\}\.Figure[2](https://arxiv.org/html/2605.05530#S4.F2)shows5,0005\{,\}000samples per composition\.

![Refer to caption](https://arxiv.org/html/2605.05530v1/compose_samples.png)Figure 2:Compositional generation by additive energy operations\. From left to right: expertAAsamples, expertBBsamples, conjunctionUA\+UBU\_\{A\}\+U\_\{B\}, disjunction−log⁡\(e−UA\+e−UB\)\-\\log\(e^\{\-U\_\{A\}\}\+e^\{\-U\_\{B\}\}\), negationUB−0\.35​UAU\_\{B\}\-0\.35\\,U\_\{A\}\. Black crosses mark theAAcenters, and gold stars mark theBBcenters\. The conjunction places samples on a ring through compromise locations between the two mode sets, since the centers ofAAandBBdo not coincide and no single location is a mode of both energies\. The negationUB−0\.35​UAU\_\{B\}\-0\.35\\,U\_\{A\}produces4,9734\{,\}973of5,0005\{,\}000samples within1\.51\.5of aBBcenter and0within1\.51\.5of anyAAcenter, consistent withAAmodes being repelled by the negative coefficient\. Composed energies were not retrained\.The conjunctionUA\+UBU\_\{A\}\+U\_\{B\}does not place samples at a hypothetical intersection of the two mode sets, because no point inℝ2\\mathbb\{R\}^\{2\}is simultaneously a mode of both energies\. Instead, it samples the new minima ofUA\+UBU\_\{A\}\+U\_\{B\}, which are compromise locations on a ring at intermediate radius\. The disjunction tilts towardBBbecause of a small mismatch in normalization between the two trained energies \(alignment offset0\.0230\.023on a scale of≈62\\approx 62\); the qualitative shape places mass near both mode sets\. The negation produces a clean separation: zero samples nearAAcenters,4,9734\{,\}973of5,0005\{,\}000nearBBcenters\. Across all three compositions, the curl of the composed drift remains at the level of automatic differentiation noise, confirming the structural prediction \([19](https://arxiv.org/html/2605.05530#S4.E19)\)\.

### 4\.2Diagnostic Study: Gradient Norm as an OOD Score

We report a diagnostic study evaluating three scores derived from the closed\-loop Langevin dynamics on synthetic and CIFAR 10 data\. We frame this section as a diagnostic study, not as a contribution of new OOD methodology\. The contribution of our study is the systematic evaluation of all three scores on the same scalar EBM\.

#### Score hierarchy from the closed\-loop\.

Consider three scores

S1​\(x\)=Uθ​\(x\),S2​\(x\)=‖∇Uθ​\(x\)‖,S3​\(x\)=‖x−ΦΔ​t\(τ\)​\(x\)‖,\\displaystyle S\_\{1\}\(x\)\\,=\\,U\_\{\\theta\}\(x\),\\qquad S\_\{2\}\(x\)\\,=\\,\\\|\\nabla U\_\{\\theta\}\(x\)\\\|,\\qquad S\_\{3\}\(x\)\\,=\\,\\bigl\\\|x\-\\Phi^\{\(\\tau\)\}\_\{\\Delta t\}\(x\)\\bigr\\\|,whereΦΔ​t\(τ\)\\Phi^\{\(\\tau\)\}\_\{\\Delta t\}is theτ\\taustep Euler discretization of the deterministic flowx˙=−∇Uθ​\(x\)\\dot\{x\}=\-\\nabla U\_\{\\theta\}\(x\)with step sizeΔ​t\\Delta t, andlog⁡Zθ\\log Z\_\{\\theta\}is computed by thermodynamic integration along the pathUs=\(1−s\)​U0\+s​UθU\_\{s\}=\(1\-s\)U\_\{0\}\+sU\_\{\\theta\}from a Gaussian referenceU0U\_\{0\}\. The scoreS1S\_\{1\}is the negative log likelihood up to the constantlog⁡Zθ\\log Z\_\{\\theta\}\. The scoreS2S\_\{2\}is the magnitude of the deterministic descent rate fromρ0=δx\\rho\_\{0\}=\\delta\_\{x\}, sincedd​t​𝔼ρt​Uθ\|t=0\+=−‖∇Uθ​\(x\)‖2\\frac\{d\}\{dt\}\\mathbb\{E\}\_\{\\rho\_\{t\}\}U\_\{\\theta\}\\big\|\_\{t=0^\{\+\}\}=\-\\\|\\nabla U\_\{\\theta\}\(x\)\\\|^\{2\}\. The scoreS3S\_\{3\}measures trajectory displacement under the deterministic flow, where points near critical regions ofUθU\_\{\\theta\}are not displaced\.

#### CIFAR 10 study\.

We use the checkpoint ofBalceraket al\.\([2025](https://arxiv.org/html/2605.05530#bib.bib8)\), generation FID3\.343\.34, with the internal time argument fixed att=0\.5t=0\.5so that the network acts as a time invariant scalar potential\. The in distribution set is the CIFAR 10 test split withN=10,000N=10\{,\}000, while the out of distribution \(OOD\) sets are SVHN test, the Describable Textures Dataset, and pixel\-wise independent uniform noise\. Table[2](https://arxiv.org/html/2605.05530#S4.T2)reports the OOD AUROC scores\.

Table 2:CIFAR 10 OOD results: AUROC for each score on each OOD set\. Bold marks the best LEM derived score per column\. The scoreS2S\_\{2\}on natural OOD sets \(SVHN, DTD\) attains0\.7630\.763on average, exceeding all reported baselines, including the EqM dot product variant by0\.2430\.243absolute, without retraining or auxiliary heads\.

### 4\.3Outlook: Control Barrier Functionals on𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)

The control\-theoretic perspective taken in this paper opens a path toward formal safety guarantees for generative models in the form of control barrier functionals \(CBF\) on the Wasserstein space𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. In control of finite\-dimensional dynamical systemsx˙=f​\(x,u\)\\dot\{x\}=f\(x,u\), a barrier functionb:ℝd→ℝb:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}encodes a safe setC=\{x:b​\(x\)≥0\}C=\\\{x:b\(x\)\\geq 0\\\}and the controller is required to satisfy the pointwise constraint∇b​\(x\)⋅u​\(x\)\+α​\(b​\(x\)\)≥0\\nabla b\(x\)\\cdot u\(x\)\+\\alpha\(b\(x\)\)\\geq 0for some class𝒦\\mathcal\{K\}functionα\\alpha, ensuring forward invariance ofCCalong the controlled trajectoryAmeset al\.\([2017](https://arxiv.org/html/2605.05530#bib.bib44)\)\.

In our LEM setting, safety is naturally formulated at the distributional level\. Given an unsafe set𝒰⊂ℝd\\mathcal\{U\}\\subset\\mathbb\{R\}^\{d\}and a toleranceβ∈\[0,1\)\\beta\\in\[0,1\), the candidate barrier functional is

B​\(ρ\)=β−ρ​\(𝒰\),𝒮=\{ρ∈𝒫2​\(ℝd\):B​\(ρ\)≥0\},B\(\\rho\)\\,=\\,\\beta\-\\rho\(\\mathcal\{U\}\),\\qquad\\mathcal\{S\}\\,=\\,\\\{\\rho\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\):B\(\\rho\)\\geq 0\\\},and the lifted CBF condition requires the controlled flowu∗u^\{\*\}to keepρt\\rho\_\{t\}within𝒮\\mathcal\{S\}for allt≥0t\\geq 0\. The conservative structure of scalar parameterization \(Section[4\.1](https://arxiv.org/html/2605.05530#S4.SS1)\) is well aligned with this formulation\. That is, a safety penalty implemented by additive energy compositionUsafe=U\+−λ​U−U\_\{\\rm safe\}=U\_\{\+\}\-\\lambda\\,U\_\{\-\}withU−U\_\{\-\}trained on negative examples retains an explicit Gibbs invariant measureρsafe∝e−Usafe\\rho\_\{\\rm safe\}\\propto e^\{\-U\_\{\\rm safe\}\}and inherits the closed\-loop guarantees of Section 3\.

## 5Conclusion

We have presented a control\-theoretic reformulation of static scalar energy generative modeling on the Wasserstein space𝒫2​\(ℝd\)\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\. Training and sampling are unified as two instances of one autonomous dynamical dynamics\. The smoothing operationρ^data↦ρ∗=ρ^data∗Kh\\hat\{\\rho\}\_\{\\mathrm\{data\}\}\\mapsto\\rho^\{\*\}=\\hat\{\\rho\}\_\{\\mathrm\{data\}\}\*K\_\{h\}enters the framework as the prescription for the training target\. Within this single dynamics, two analytic results follow naturally\. The corresponding deterministic gradient flow on the same learned energy landscape admits no distributional Lyapunov certificate, which formalizes the long observed mode collapse phenomenon at the level of the Wasserstein space rather than at the level of individual trajectories\. The stochastic Langevin sampler is able to recover the target distribution under the same Lyapunov function, and the framework supplies stopping criteria for both samplers that distinguish a sufficient budget from a hard cutoff\. The deeper insight is that mode collapse and stochastic correctness are not two separate phenomena, but two consequences of whether the noise is present in the controller\.

The structural payoff of the reformulation is that the static scalar energy paradigm now sits inside the same conceptual setting as nonlinear control theory\. Scalar energies compose additively while preserving the Gibbs invariant form and the closed\-loop Lyapunov certificate, so the additive composition operation needed to combine trained energies is algebraically free of structural cost\. This is the gateway through which control barrier functionals on the Wasserstein space encode safety as forward invariance of a sublevel set, and contraction metrics, nonlinear observers, and stochastic safety certificates become accessible to generative modeling\. We view the present paper as the foundational layer that makes such extensions possible for the static scalar energy paradigm, and as an invitation to read generative modeling as a chapter of nonlinear control theory rather than a discipline parallel to it\.

## References

- M\. S\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2023\)Stochastic interpolants: a unifying framework for flows and diffusions\.InarXiv preprint arXiv:2303\.08797,Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- A\. D\. Ames, X\. Xu, J\. W\. Grizzle, and P\. Tabuada \(2017\)Control barrier function based quadratic programs for safety critical systems\.IEEE Transactions on Automatic Control62\(8\),pp\. 3861–3876\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p2.3),[§4\.3](https://arxiv.org/html/2605.05530#S4.SS3.p1.8)\.
- D\. Bakry and M\. Émery \(1985\)Diffusions hypercontractives\.Séminaire de Probabilités XIX,pp\. 177–206\.Cited by:[§3\.2](https://arxiv.org/html/2605.05530#S3.SS2.p1.8)\.
- D\. Bakry, I\. Gentil, and M\. Ledoux \(2014\)Analysis and geometry of Markov diffusion operators\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p3.5),[§3\.2](https://arxiv.org/html/2605.05530#S3.SS2.p1.8)\.
- M\. Balcerak, T\. Amiranashvili, A\. Terpin, S\. Shit, L\. Bogensperger, S\. Kaltenbach, P\. Koumoutsakos, and B\. Menze \(2025\)Energy matching: unifying flow matching and energy\-based models for generative modeling\.arXiv preprint arXiv:2504\.10612\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5),[§4\.2](https://arxiv.org/html/2605.05530#S4.SS2.SSS0.Px2.p1.3)\.
- Y\. Du and I\. Mordatch \(2019\)Implicit generation and modeling with energy based models\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Table 2](https://arxiv.org/html/2605.05530#S4.T2.4.8.4.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- R\. Holley and D\. Stroock \(1987\)Logarithmic Sobolev inequalities and stochastic Ising models\.Journal of Statistical Physics46,pp\. 1159–1194\.Cited by:[§3\.2](https://arxiv.org/html/2605.05530#S3.SS2.p1.12)\.
- F\. Hoti \(2003\)On estimation of a probability density function and mode\.pp\.\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p6.11)\.
- A\. Hyvärinen \(2005\)Estimation of non\-normalized statistical models by score matching\.Journal of Machine Learning Research6\(24\),pp\. 695–709\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p5.13)\.
- R\. Jordan, D\. Kinderlehrer, and F\. Otto \(1998\)The variational formulation of the Fokker–Planck equation\.SIAM Journal on Mathematical Analysis29\(1\),pp\. 1–17\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p2.7)\.
- H\. K\. Khalil \(2002\)Nonlinear systems\.3rd edition,Prentice Hall\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p2.3),[§2](https://arxiv.org/html/2605.05530#S2.p2.11)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- X\. Liu, C\. Gong, and Q\. Liu \(2023\)Flow straight and fast: learning to generate and transfer data with rectified flow\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- H\. Risken \(1996\)The Fokker–Planck equation: methods of solution and applications\.2nd edition,Springer\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p3.3)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- E\. D\. Sontag \(1998\)Mathematical control theory: deterministic finite dimensional systems\.2nd edition,Springer\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p2.3)\.
- A\. Tong, N\. Malkin, G\. Huguet, Y\. Zhang, J\. Rector\-Brooks, K\. Fatras, G\. Wolf, and Y\. Bengio \(2024\)Improving and generalizing flow\-based generative models with minibatch optimal transport\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.05530#S1.p1.5)\.
- C\. Villani \(2009\)Optimal transport: old and new\.Springer\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p2.7)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural Computation23,pp\. 1661–1674\.External Links:[Document](https://dx.doi.org/10.1162/NECO%5Fa%5F00142)Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p5.13)\.
- R\. Wang and Y\. Du \(2025\)Equilibrium matching\.arXiv preprint arXiv:2510\.02300\.Cited by:[Table 2](https://arxiv.org/html/2605.05530#S4.T2.4.9.5.1)\.
- K\. Yano \(2020\)The theory of lie derivatives and its applications\.Courier Dover Publications\.Cited by:[§2](https://arxiv.org/html/2605.05530#S2.p2.2)\.

## Appendix AProof Details

### A\.1Lyapunov Derivative Computation

Sinceρ∗=exp⁡\(−U∗\)/Z\\rho^\{\*\}=\\exp\(\-U^\{\*\}\)/Z, we havelog⁡ρ∗​\(x\)=−U∗−log⁡Z\\log\\rho^\{\*\}\(x\)=\-U^\{\*\}\-\\log Zand thuslog⁡\(ρθ/ρ∗\)=log⁡ρθ\+U∗\+log⁡Z\\log\(\\rho\_\{\\theta\}/\\rho^\{\*\}\)=\\log\\rho\_\{\\theta\}\+U^\{\*\}\+\\log Z\. DifferentiatingV​\(ρθ\)=∫ℝdρθ​\(log⁡ρθ\+U∗\+log⁡Z\)​𝑑xV\(\\rho\_\{\\theta\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\(\\log\\rho\_\{\\theta\}\+U^\{\*\}\+\\log Z\)\\,dxwith respect tottyields

dd​t​V​\(ρθ\)=dd​t​∫ℝdρθ​log⁡ρθ​d​x\+dd​t​∫ℝdρθ​U∗​𝑑x\+dd​t​∫ℝdρθ​log⁡Z​d​x\.\\frac\{d\}\{dt\}V\(\\rho\_\{\\theta\}\)=\\frac\{d\}\{dt\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\\log\\rho\_\{\\theta\}dx\+\\frac\{d\}\{dt\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}U^\{\*\}dx\+\\frac\{d\}\{dt\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\\log Zdx\.\(20\)The first term gives

dd​t​∫ℝdρθ​log⁡ρθ​d​x=∫ℝd\(∂tρt\)​log⁡ρθ​d​x\+∫ℝd∂tρθ​d​x,\\frac\{d\}\{dt\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}\\log\\rho\_\{\\theta\}dx=\\int\_\{\\mathbb\{R\}^\{d\}\}\(\\partial\_\{t\}\\rho\_\{t\}\)\\log\\rho\_\{\\theta\}dx\+\\int\_\{\\mathbb\{R\}^\{d\}\}\\partial\_\{t\}\\rho\_\{\\theta\}dx,\(21\)where∫∂tρθ​d​x=0\\int\\partial\_\{t\}\\rho\_\{\\theta\}dx=0by conservation of mass∫ρθ​𝑑x=1\\int\\rho\_\{\\theta\}dx=1\. SinceU∗U^\{\*\}is time independent, the second term givesdd​t​∫ρθ​U∗​𝑑x=∫\(∂tρθ\)​U∗​𝑑x\\frac\{d\}\{dt\}\\int\\rho\_\{\\theta\}U^\{\*\}dx=\\int\(\\partial\_\{t\}\\rho\_\{\\theta\}\)U^\{\*\}dx\. Sincelog⁡Z\\log Zis constant, combining these results gives

dd​t​V​\(ρθ\)=∫ℝd\(∂tρθ\)​\(log⁡ρθ\+U∗\)​𝑑x\.\\frac\{d\}\{dt\}V\(\\rho\_\{\\theta\}\)=\\int\_\{\\mathbb\{R\}^\{d\}\}\(\\partial\_\{t\}\\rho\_\{\\theta\}\)\\bigl\(\\log\\rho\_\{\\theta\}\+U^\{\*\}\\bigr\)dx\.\(22\)Using the continuity equation∂tρθ=−∇⋅\(ρθ​u\)\\partial\_\{t\}\\rho\_\{\\theta\}=\-\\nabla\\cdot\(\\rho\_\{\\theta\}u\)and integration by parts with partsρθ​u\\rho\_\{\\theta\}uandlog⁡ρθ\+U∗\\log\\rho\_\{\\theta\}\+U^\{\*\}\(boundary terms vanish with the decay ofρt\\rho\_\{t\}at infinity\) yields

dd​t​V​\(ρθ\)\\displaystyle\\frac\{d\}\{dt\}V\(\\rho\_\{\\theta\}\)=−∫ℝd∇⋅\(ρθ​u\)​\(log⁡ρθ\+U∗\)​𝑑x\\displaystyle=\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\nabla\\cdot\(\\rho\_\{\\theta\}u\)\\bigl\(\\log\\rho\_\{\\theta\}\+U^\{\*\}\\bigr\)\\,dx=∫ℝdρθ​u⋅∇\(log⁡ρθ\+U∗\)⁡d​x=∫ℝdρθ​u⋅\(∇log⁡ρθ\+∇U∗\)​𝑑x\.\\displaystyle=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}u\\cdot\\nabla\\bigl\(\\log\\rho\_\{\\theta\}\+U^\{\*\}\\bigr\)\\,dx=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{\\theta\}u\\cdot\\bigl\(\\nabla\\log\\rho\_\{\\theta\}\+\\nabla U^\{\*\}\\bigr\)\\,dx\.\(23\)

### A\.2Fixed Point Analysis

For learned resultsρθ∗,Uθ∗\\rho^\{\*\}\_\{\\theta\},U^\{\*\}\_\{\\theta\}, andZθ∗Z^\{\*\}\_\{\\theta\}, we remove their scripts below for simplicity\. Substituting the Gibbs densityρ=e−UZ\\rho=\\frac\{e^\{\-U\}\}\{Z\}into the deterministic flow term∇⋅\(ρ​∇U\)\\nabla\\cdot\(\\rho\\nabla U\)yields

∇⋅\(ρ​∇U\)=∇ρ⋅∇U\+ρ​Δ​U\.\\nabla\\cdot\(\\rho\\nabla U\)\\;=\\;\\nabla\\rho\\cdot\\nabla U\+\\rho\\,\\Delta U\.\(24\)Using the chain rule, we have

∇ρ=∇\(e−UZ\)=−\(e−UZ\)​∇U=−ρ​∇U\.\\nabla\\rho=\\nabla\\bigl\(\\frac\{e^\{\-U\}\}\{Z\}\\bigr\)=\-\\bigl\(\\frac\{e^\{\-U\}\}\{Z\}\\bigr\)\\nabla U=\-\\rho\\nabla U\.\(25\)Then we write

∇⋅\(ρ​∇U\)=−ρ​‖∇U‖2\+ρ​Δ​U=ρ​\(Δ​U−‖∇U‖2\)\.\\nabla\\cdot\(\\rho\\nabla U\)\\;=\\;\-\\rho\\,\\\|\\nabla U\\\|^\{2\}\+\\rho\\,\\Delta U\\;=\\;\\rho\\bigl\(\\Delta U\-\\\|\\nabla U\\\|^\{2\}\\bigr\)\.\(26\)

### A\.3KL Divergence Decay Along Deterministic Flow

With the deterministic flow in \([8](https://arxiv.org/html/2605.05530#S3.E8)\) and the corresponding controllerud∗​\(x\)=−∇Uθ∗​\(x\)u^\{\*\}\_\{d\}\(x\)=\-\\nabla U\_\{\\theta\}^\{\*\}\(x\), the Lyapunov derivative in \([3](https://arxiv.org/html/2605.05530#S2.E3)\) becomes

d​KL​\(ρt∥ρθ∗\)d​t\\displaystyle\\frac\{d\\mathrm\{KL\}\(\\rho\_\{t\}\\\|\\rho^\{\*\}\_\{\\theta\}\)\}\{dt\}=−∫ℝdρt​∇Uθ∗⋅\(∇log⁡ρt\+∇Uθ∗\)​𝑑x\\displaystyle=\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\,\\nabla U^\{\*\}\_\{\\theta\}\\cdot\\bigl\(\\nabla\\log\\rho\_\{t\}\+\\nabla U^\{\*\}\_\{\\theta\}\\bigr\)\\,dx=−∫ℝdρt​∇Uθ∗⋅∇log⁡ρt​d​x−∫ℝdρt​‖∇Uθ∗‖2​𝑑x\.\\displaystyle=\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\,\\nabla U^\{\*\}\_\{\\theta\}\\cdot\\nabla\\log\\rho\_\{t\}\\ dx\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\,\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\\,dx\.\(27\)Sinceρt​∇log⁡ρt=∇ρt\\rho\_\{t\}\\nabla\\log\\rho\_\{t\}=\\nabla\\rho\_\{t\}, the first term becomes

−∫ℝdρt​∇Uθ∗⋅∇log⁡ρt​d​x=−∫ℝd∇Uθ∗⋅∇ρt​d​x=∫ℝdρt​Δ​Uθ∗​𝑑x\.\\displaystyle\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\,\\nabla U^\{\*\}\_\{\\theta\}\\cdot\\nabla\\log\\rho\_\{t\}\\ dx=\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\nabla U^\{\*\}\_\{\\theta\}\\cdot\\nabla\\rho\_\{t\}\\ dx=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\Delta U^\{\*\}\_\{\\theta\}\\ dx\.\(28\)Using this result in \([A\.3](https://arxiv.org/html/2605.05530#A1.Ex8)\) yields

d​KL​\(ρt∥ρθ∗\)d​t\\displaystyle\\frac\{d\\mathrm\{KL\}\(\\rho\_\{t\}\\\|\\rho^\{\*\}\_\{\\theta\}\)\}\{dt\}=∫ℝdρt​Δ​Uθ∗​𝑑x−∫ℝdρt​‖∇Uθ∗‖2​𝑑x,\\displaystyle=\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\Delta U^\{\*\}\_\{\\theta\}\\ dx\-\\int\_\{\\mathbb\{R\}^\{d\}\}\\rho\_\{t\}\\,\\\|\\nabla U^\{\*\}\_\{\\theta\}\\\|^\{2\}\\,dx,\(29\)which is equivalent to \([14](https://arxiv.org/html/2605.05530#S3.E14)\)\.

### A\.4Proof of Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2)

Suppose there exists a Lyapunov functionalV:𝒫2​\(ℝd\)→ℝ≥0V:\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\)\\to\\mathbb\{R\}\_\{\\geq 0\}satisfying conditions \(i\)\-\(iii\)\. Consider the deterministic flow initialized at the target density, i\.e\.,ρ0=ρθ∗\\rho\_\{0\}=\\rho^\{\*\}\_\{\\theta\}\. By condition \(ii\),V​\(ρt\)≥0V\(\\rho\_\{t\}\)\\geq 0for allρ∈𝒫2​\(ℝd\)\\rho\\in\\mathcal\{P\}\_\{2\}\(\\mathbb\{R\}^\{d\}\), and by condition \(i\),V​\(ρθ∗\)=0V\(\\rho^\{\*\}\_\{\\theta\}\)=0\. Condition \(iii\) implies thatV​\(ρt\)=0V\(\\rho\_\{t\}\)=0is nonincreasing along trajectories of the deterministic flow\. Therefore,

V​\(ρt\)≤V​\(ρθ∗\)=0,for all​t≥0\.V\(\\rho\_\{t\}\)\\leq V\(\\rho\_\{\\theta\}^\{\*\}\)=0,\\ \\text\{for all \}t\\geq 0\.\(30\)SinceV≥0V\\geq 0,V​\(ρt\)=0V\(\\rho\_\{t\}\)=0for allt≥0t\\geq 0\. By condition \(i\), this impliesρt=ρθ∗\\rho\_\{t\}=\\rho\_\{\\theta\}^\{\*\}for allt≥0t\\geq 0, so the target densityρθ∗\\rho\_\{\\theta\}^\{\*\}must be a stationary solution \(a fixed point\) of the deterministic flow\. From the continuity equation in \([8](https://arxiv.org/html/2605.05530#S3.E8)\), stationarity ofρθ∗\\rho^\{\*\}\_\{\\theta\}requires∇⋅\(ρθ∗​∇Uθ∗\)=0\\nabla\\cdot\(\\rho\_\{\\theta\}^\{\*\}\\nabla U^\{\*\}\_\{\\theta\}\)=0\. By Appendix[A\.2](https://arxiv.org/html/2605.05530#A1.SS2), this condition is equivalent toh​\(x\)=Δ​Uθ∗​\(x\)−‖∇Uθ∗​\(x\)‖2=0h\(x\)=\\Delta U^\{\*\}\_\{\\theta\}\(x\)\-\\\|\\nabla U^\{\*\}\_\{\\theta\}\(x\)\\\|^\{2\}=0for allxx\.

We now show thath​\(x\)h\(x\)cannot vanish identically\. Letx0x\_\{0\}be any critical point such that∇Uθ∗​\(x0\)=0\\nabla U^\{\*\}\_\{\\theta\}\(x\_\{0\}\)=0\. Then,h​\(x0\)=Δ​Uθ∗​\(x0\)=tr​\(∇2Uθ∗​\(x0\)\)h\(x\_\{0\}\)=\\Delta U^\{\*\}\_\{\\theta\}\(x\_\{0\}\)=\\mathrm\{tr\}\(\\nabla^\{2\}U^\{\*\}\_\{\\theta\}\(x\_\{0\}\)\)\. At any nondegenerate local minimum, the Hessian∇2Uθ∗​\(x0\)\\nabla^\{2\}U^\{\*\}\_\{\\theta\}\(x\_\{0\}\)is positive definite, and therefore

h​\(x0\)=tr​\(∇2Uθ∗​\(x0\)\)\>0\.h\(x\_\{0\}\)=\\mathrm\{tr\}\(\\nabla^\{2\}U^\{\*\}\_\{\\theta\}\(x\_\{0\}\)\)\>0\.\(31\)In addition, by the assumption in Theorem[2](https://arxiv.org/html/2605.05530#Thmtheorem2), i\.e\.,‖∇Uθ∗​\(x\)‖2→∞\\\|\\nabla U^\{\*\}\_\{\\theta\}\(x\)\\\|^\{2\}\\to\\inftyas‖x‖→∞\\\|x\\\|\\to\\infty, the Lipschitz gradient assumption implies that‖∇Uθ∗​\(x\)‖2\\\|\\nabla U^\{\*\}\_\{\\theta\}\(x\)\\\|^\{2\}is bounded, and hence\|Δ​Uθ∗​\(x\)\|=\|tr​\(∇2Uθ∗​\(x\)\)\|≤d​L\|\\Delta U\_\{\\theta\}^\{\*\}\(x\)\|=\|\\mathrm\{tr\}\(\\nabla^\{2\}U\_\{\\theta\}^\{\*\}\(x\)\)\|\\leq dLfor some constantL\>0L\>0\. Therefore,

h​\(x\)=Δ​Uθ∗​\(x\)−‖∇Uθ∗​\(x\)‖2→−∞,as​‖x‖→∞h\(x\)=\\Delta U\_\{\\theta\}^\{\*\}\(x\)\-\\\|\\nabla U\_\{\\theta\}^\{\*\}\(x\)\\\|^\{2\}\\to\-\\infty,\\ \\text\{as\}\\ \\\|x\\\|\\to\\infty\(32\)Sincehhis continuous, positive at any nondegenerate minimum, and tends to−∞\-\\inftyat infinity,hhmust change sign and consequently is not identically zero, contradicting the stationarity requirement\.

Similar Articles

Implicit generation and generalization methods for energy-based models

OpenAI Blog

OpenAI presents implicit generation and generalization methods for energy-based models (EBMs) that use Langevin dynamics for iterative refinement to generate samples without explicit generator networks. The approach offers advantages including adaptive computation time, flexibility in learning disconnected data modes, and built-in compositionality through product of experts.

A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models

OpenAI Blog

This paper establishes mathematical equivalences between generative adversarial networks (GANs), inverse reinforcement learning (IRL), and energy-based models (EBMs), demonstrating that certain IRL methods are equivalent to GANs with evaluable generator density. The work bridges three research communities to enable knowledge transfer for developing more stable and scalable algorithms.

Learning concepts with energy functions

OpenAI Blog

OpenAI presents a technique using energy functions to enable agents to learn and extract abstract concepts (visual, spatial, temporal, social) from tasks, then transfer these concepts to solve related tasks in different domains without retraining. The approach uses energy-based models with neural networks to perform both generation and recognition of concepts.