A lift for input-convex neural network training

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
Proposes a 'lift' method for training input-convex neural networks (ICNNs) that uses an unconstrained hypernetwork to emit non-negative inter-layer weights, softening the loss landscape and escaping gradient attenuation, achieving lower test loss than projected gradient descent and softplus reparametrization.
arXiv:2605.24274v1 Announce Type: new Abstract: Input-convex neural networks (ICNNs) are widely used for log-concave density estimation, convex-potential normalizing flows, optimal transport, and transport-map inversion for high-dimensional Bayesian posteriors. These tasks share a structural constraint: the inter-layer weights of the ICNN must remain non-negative. The standard recipe, projected gradient descent (PGD) onto the non-negative cone, applies a hard, non-smooth projection -- the stiff-penalty limit of an ADMM-style constraint splitting -- and its classical convergence guarantees do not transfer to the non-smooth ICNN training landscape; the differentiable alternative, softplus reparametrization, attenuates the gradient exponentially in the weight magnitude, stalling training with dead inter-layer weights and plateaued loss. Inspired by parameter-extension lifts of PDE-constrained inverse problems, we propose the lift: instead of constraining the inter-layer weights directly, we train an unconstrained hypernetwork that emits them from a permutation-invariant summary of the input batch. This adds stochasticity to the training dynamics that softens the loss landscape, letting the iterates escape the gradient-attenuated region where direct softplus stalls. We trace this softening to three structural ingredients -- a learnable bias acting as slack, a hypernetwork body that conditions on the target batch, and a cross-covariance coupling the two through batch stochasticity -- and prove each one necessary: deleting any single ingredient collapses the cross-covariance that carries the softening. On log-concave energy-based modeling from one-dimensional toy targets to image-flavored latents, and convex-potential normalizing flows on a 21-dimensional tabular benchmark, we show that the lift reaches a lower test loss than both PGD and direct softplus, and turns a plateau-bounded training trajectory into a valley-descending one.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:02 AM
# A lift for input-convex neural network training
Source: [https://arxiv.org/html/2605.24274](https://arxiv.org/html/2605.24274)
###### Abstract

Input\-convex neural networks \(ICNNs\) are widely used for a range of learning tasks—log\-concave density estimation, convex\-potential normalizing flows, optimal transport, and transport\-map inversion for high\-dimensional Bayesian posteriors\. All of these tasks share a structural constraint: the inter\-layer weights of the ICNN must remain non\-negative\. The standard recipe for enforcing it, projected gradient descent \(PGD\) onto the non\-negative cone, applies a hard, non\-smooth projection—the stiff\-penalty limit of an ADMM\-style constraint splitting—and its classical convergence guarantees do not transfer to the non\-smooth ICNN training landscape; the differentiable alternative, softplus reparametrization, instead attenuates the gradient exponentially in the weight magnitude, stalling training with dead inter\-layer weights and plateaued loss\. To address this limitation, and inspired by the parameter\-extension lifts of PDE\-constrained inverse problems, we propose the*lift*: instead of constraining the inter\-layer weights directly, we train an unconstrained hypernetwork that emits them from a permutation\-invariant summary of the input batch\. This adds a source of stochasticity to the training dynamics that softens the loss landscape, letting the iterates escape the gradient\-attenuated region where direct softplus stalls\. We trace this softening to three structural ingredients—a learnable bias acting asslack, a hypernetworkbodythat conditions on the target batch, and across\-covariancecoupling the two through batch stochasticity—and prove each one necessary: deleting any single ingredient collapses the cross\-covariance that carries the softening\. By means of log\-concave energy\-based modeling at scales from one\-dimensional toy targets to image\-flavored latents, and convex\-potential normalizing flows on a 21\-dimensional tabular benchmark, we show that*the lift reaches a lower test loss than both PGD and direct softplus, and turns a plateau\-bounded training trajectory into a valley\-descending one*\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x1.png)

\(a\) Parameter\-space view\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x2.png)

\(b\) Loss\-space view\.

Figure 1:Three positivity reparametrizations on log\-concave EBM training, three different fates\(21\-dimensional tabular target; test negative log\-likelihood reported at each method’s lowest\-validation\-loss checkpoint\)\.\(a\)Loss landscape on a two\-dimensional PGD\-anchored slice \([Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\), the convergedhypernetat the origin \(gold star\);\(b\)held\-out validation loss versus iteration on the same run\.Hypernetdescends through the basin to the deepest loss;PGDlands at the cone boundary;directsoftplus is trapped on the readout shoulder \([Section˜2\.2](https://arxiv.org/html/2605.24274#S2.SS2)\), an order of magnitude above the other two\. The lift’s margin abovePGDis the landscape\-smoothing benefit of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1); the gap abovedirectis the Kramers escape of[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)\. Code to partially reproduce the results is available on[GitHub](https://github.com/luqigroup/icnnlift)\.## 1Introduction

Input\-convex neural networks \(ICNNs\)\(Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\)parametrize the convex scalar fields that drive several modern learning tasks in probabilistic modeling and Bayesian inference\. They underlie the negative log\-density of a log\-concave\(Prékopa,[1971](https://arxiv.org/html/2605.24274#bib.bib21); Saumard and Wellner,[2014](https://arxiv.org/html/2605.24274#bib.bib23)\)energy\-based model, the convex potential whose gradient defines a convex\-potential normalizing flow\(Huanget al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib11)\), the convex potential of a PCP\-Map\(Wanget al\.,[2025](https://arxiv.org/html/2605.24274#bib.bib16); Bunneet al\.,[2022](https://arxiv.org/html/2605.24274#bib.bib5)\)for transport\-map posterior sampling, and the Brenier potential of ICNN\-parametrized optimal transport\(Makkuvaet al\.,[2020](https://arxiv.org/html/2605.24274#bib.bib17); Korotinet al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib14)\)\. These transport\-map constructions sit within a broader move toward generative posterior sampling for high\-dimensional Bayesian inverse problems, a setting in which score\-based and diffusion\-model samplers have also been applied to seismic imaging\(Baldassariet al\.,[2024](https://arxiv.org/html/2605.24274#bib.bib2); Siahkoohiet al\.,[2026](https://arxiv.org/html/2605.24274#bib.bib27)\)\. Across the ecosystem the ambient dimension ranges from a single coordinate on toy targets to thousands of pixels on image\-density and PCP\-Map applications, and all four applications share one structural feature: input\-convexity demands positivity of the inter\-layer weights, i\.e\.,𝜽⪰𝟎\{\\bm\{\\theta\}\}\\succeq\\bm\{0\}\.

The dominant practical recipe for enforcing this constraint is projected gradient descent \(PGD\)\(Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\): an unconstrained step on𝜽\{\\bm\{\\theta\}\}followed by the projection𝜽←max⁡\(𝜽,0\)\{\\bm\{\\theta\}\}\\leftarrow\\max\(\{\\bm\{\\theta\}\},0\)\. The projection is non\-differentiable on the active set, where the iterates concentrate, and PGD’s classical convergence guarantees \(O\(1/k\)O\(1/k\)for smooth convex problems with a closed convex constraint andO\(1/k\)O\(1/\\sqrt\{k\}\)to stationarity for smooth non\-convex problems\(Nesterov,[2018](https://arxiv.org/html/2605.24274#bib.bib20); Beck,[2017](https://arxiv.org/html/2605.24274#bib.bib3)\)\) assume a Lipschitz\-smooth objective that the ICNN training landscape violates there, so they do not transfer\. A differentiable alternative reparametrizes the inter\-layer weights as𝜽=ψ\(𝜽~\)\{\\bm\{\\theta\}\}=\\psi\(\\tilde\{\\bm\{\\theta\}\}\)with𝜽~∈ℝd\\tilde\{\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{d\}unconstrained andψ\\psia monotone non\-negative map, of which there are two genuine families—the smooth one, softplus, and the non\-smooth max\-based one, the ReLU\-type readoutsψ\(θ~\)=max⁡\(θ~,ϵ\)\\psi\(\\tilde\{\\theta\}\)=\\max\(\\tilde\{\\theta\},\\epsilon\)that project onto the non\-negative cone \(the cone projection being theϵ=0\\epsilon=0case\)\. Each enforces𝜽⪰𝟎\{\\bm\{\\theta\}\}\\succeq\\bm\{0\}by construction, and each introduces a chain\-rule prefactorψ′\(𝜽~\)\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)that multiplies every downstream gradient and collapses on an extended region of parameter space: the softplus prefactorψ′\(θ~\)\\psi^\{\\prime\}\(\\tilde\{\\theta\}\)vanishes smoothly asθ~→−∞\\tilde\{\\theta\}\\to\-\\infty\(Hoedt and Klambauer,[2023](https://arxiv.org/html/2605.24274#bib.bib10)\), while the ReLU\-type prefactor is identically zero on the gated set and matches PGD at the cost of differentiability\. Stochastic gradient descent escapes this attenuated region only on a time scale that grows exponentially with the inverse noise level \([Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)\)—the Kramers–Arrhenius\(Kramers,[1940](https://arxiv.org/html/2605.24274#bib.bib15); Hänggiet al\.,[1990](https://arxiv.org/html/2605.24274#bib.bib9); Xieet al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib65)\)regime, fundamentally slower than the polynomial classical PGD rates\. Existing remediations treat the symptom rather than the structure: specialized initialization schedules\(Hoedt and Klambauer,[2023](https://arxiv.org/html/2605.24274#bib.bib10)\)each tame an individual failure mode; plain alternating direction method of multipliers \(ADMM\) with positivity\(Boydet al\.,[2011](https://arxiv.org/html/2605.24274#bib.bib4)\)enforces the constraint by a closed\-form prox on the slack but leaves the primal block without data\-conditioned reparametrization, and its stiff\-penalty limitρ→∞\\rho\\to\\inftyreduces structurally to PGD on the cone\.

Inspired by the parameter\-extension lifts of full\-waveform inversion \(FWI\)\(Symeset al\.,[2020](https://arxiv.org/html/2605.24274#bib.bib49); van Leeuwen and Herrmann,[2013](https://arxiv.org/html/2605.24274#bib.bib29)\)—where a stiff variable is recast in a larger space to turn a hard non\-convex problem into a smoother one—we propose thelift\. Rather than constraining the inter\-layer weights directly, we train an unconstrained hypernetwork\(Haet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib64)\)that emits them from the input batch\. Because the batch is resampled at every step, the emitted weights fluctuate as training proceeds—an additional source of stochasticity in the training dynamics\. Unlike ordinary mini\-batch gradient noise, this fluctuation is not suppressed by the gradient attenuation that stalls direct softplus, and it softens the loss landscape around the region where training would otherwise plateau\. We trace the effect to three structural ingredients of the construction—a learnable slack, the batch\-conditioned body, and the stochastic coupling between them—and prove that none is dispensable \([Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\)\. On a one\-dimensional log\-concave energy\-based model \(EBM\) and a convex\-potential normalizing flow on a 21\-dimensional tabular benchmark, the lift descends to a lower test loss than both PGD and direct softplus, turning a plateau\-bounded trajectory into a valley\-descending one \([Figure˜1](https://arxiv.org/html/2605.24274#S0.F1)\)\.

### 1\.1Contributions

1. \(1\)The lift\.We propose the hypernetwork\-emitted ICNN parametrization of equation \([1](https://arxiv.org/html/2605.24274#S3.E1)\), which introduces an additional source of stochasticity into the training dynamics that effectively smooths the loss landscape around the readout shoulder\. We then read the parametrization as a split\-variable decomposition \([Section˜3](https://arxiv.org/html/2605.24274#S3),[Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\), an interpretation that delivers the lift’s conditioning advantage without invoking the rate\-theoretic assumptions that an ADMM convergence analysis would require\.
2. \(2\)Three jointly necessary structural ingredients\.We identify three ingredients of the lift’s conditioning advantage—an identity\-Jacobianslack, a data\-conditionedbody, and a non\-vanishingcross\-covariancebetween them—and prove that each is necessary for an operationally measurable cross\-covariance estimator \([Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\)\. An implicit strong\-convexification result links the cross\-covariance to an added curvature modulus on the loss landscape, scoped to the stochastic\-readout regime\.
3. \(3\)Empirical evidence across two ICNN paradigms\.On one\-dimensional log\-concave EBM targets, a four\-architecture ablation isolates each ingredient and a 30\-seed paired sweep bounds the lift’s distributional improvement across the log\-concave family \([Section˜5\.1](https://arxiv.org/html/2605.24274#S5.SS1)\)\. On a 21\-dimensional tabular target and on two\-dimensional synthetic targets for convex potential flows, the lift improves test loss over direct softplus and produces a measurably smoother training trajectory and a better\-conditioned loss\-landscape geometry \([Section˜5\.2](https://arxiv.org/html/2605.24274#S5.SS2),[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\)\.

The remainder of the paper is organized as follows\. We first name the chain\-rule attenuation pathology that motivates the lift \([Section˜2](https://arxiv.org/html/2605.24274#S2)\), introduce the slack\-plus\-batch\-summary reparametrization that resolves it \([Section˜3](https://arxiv.org/html/2605.24274#S3)\), and decompose the conditioning advantage into three structural ingredients \([Section˜4](https://arxiv.org/html/2605.24274#S4)\)\. We then contrast the construction against plain ADMM\-with\-positivity \([Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\) and present the empirical evidence across log\-concave EBM training and convex\-potential flow estimation \([Section˜5](https://arxiv.org/html/2605.24274#S5)\)\. The PGD\-baseline ablation \([Section˜6](https://arxiv.org/html/2605.24274#S6)\) and a discussion of scope and related work \([Section˜7](https://arxiv.org/html/2605.24274#S7)\) close out the argument before the conclusion \([Section˜8](https://arxiv.org/html/2605.24274#S8)\)\.

## 2Positivity\-reparametrized ICNNs attenuate the gradient on the readout shoulder

Before introducing the lift, we name the structural pathology it resolves: a positivity readoutψ\\psiwhose chain\-rule prefactorψ′\\psi^\{\\prime\}collapses on an extended region of parameter space, trapping stochastic gradient descent \(SGD\)\.[Section˜2\.1](https://arxiv.org/html/2605.24274#S2.SS1)fixes the ICNN training setup;[Section˜2\.2](https://arxiv.org/html/2605.24274#S2.SS2)defines the readout shoulder;[Section˜2\.3](https://arxiv.org/html/2605.24274#S2.SS3)closes the case that both readout families—smooth softplus and the non\-smooth ReLU\-type—fail there uniformly\.

### 2\.1ICNN training and the positivity constraint

We train ICNN\-parametrized models\(Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\)by SGD on a data\-driven lossℒ\\mathcal\{L\}\. The running example is a single\-component ICNN\-EBM,p𝜽\(𝒙\)∝exp⁡\(−E𝜽\(𝒙\)\)p\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\\propto\\exp\(\-E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\), withE𝜽E\_\{\\bm\{\\theta\}\}an input\-convex neural network whose convex non\-decreasing activations and non\-negative inter\-layer weights𝜽l⪰𝟎\{\\bm\{\\theta\}\}\_\{l\}\\succeq\\bm\{0\}together enforce input\-convexity ofE𝜽E\_\{\\bm\{\\theta\}\}; convex potential flows, PCP\-Map, and ICNN\-parametrized optimal transport share the same structure\. Positivity is enforced through a monotone non\-negative readoutψ:ℝ→ℝ≥0\\psi:\\mathbb\{R\}\\to\\mathbb\{R\}\_\{\\geq 0\},𝜽=ψ\(𝜽~\)\{\\bm\{\\theta\}\}=\\psi\(\\tilde\{\\bm\{\\theta\}\}\), with𝜽~∈ℝd\\tilde\{\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{d\}unconstrained\. The canonical instance isψ=softplus\\psi=\\mathrm\{softplus\}, whose derivativeψ′\(θ~\)∈\(0,1\)\\psi^\{\\prime\}\(\\tilde\{\\theta\}\)\\in\(0,1\)multiplies every downstream gradient through the chain rule\. Throughout,𝜽∈ℝd\{\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{d\}denotes the flattened vector of inter\-layer weights to which the positivity constraint applies; the ICNN’s other parameters \(per\-layer biases and any unconstrained weights\) are absorbed into the functionE𝜽E\_\{\\bm\{\\theta\}\}but are not analyzed separately, because the lift mechanism acts only on the constrained weights\.

### 2\.2The readout shoulder

The locusψ′\(𝜽~\)=σs≪1\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)=\\sigma\_\{s\}\\ll 1is thesoftplus shoulderof widthσs\\sigma\_\{s\}\([Figure˜2](https://arxiv.org/html/2605.24274#S2.F2)\)\. The shoulder is an extended region of parameter space, not a thin set: atσs=0\.05\\sigma\_\{s\}=0\.05it includes the entire half\-lineθ~≲−3\\tilde\{\\theta\}\\lesssim\-3, so an iterate that enters it has room to wander\. Once inside the shoulder, the chain\-rule prefactorψ′\(𝜽~\)\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)is exponentially small in the weight magnitude, and SGD escapes only on a time scale that grows exponentially with the inverse noise level—the Kramers–Arrhenius regime\(Kramers,[1940](https://arxiv.org/html/2605.24274#bib.bib15); Hänggiet al\.,[1990](https://arxiv.org/html/2605.24274#bib.bib9)\)quantified by[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)in[Section˜4\.1](https://arxiv.org/html/2605.24274#S4.SS1)\. The same exponential escape time appears for the other canonical positivity reparametrizations \([Remark˜1](https://arxiv.org/html/2605.24274#Thmremark1)\)\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x3.png)Figure 2:Softplus shoulder: an extended region of parameter space where the chain\-rule prefactorψ′\\psi^\{\\prime\}collapses\.The readoutψ\(θ~\)=softplus\(θ~\)\\psi\(\\tilde\{\\theta\}\)=\\mathrm\{softplus\}\(\\tilde\{\\theta\}\)\(black, left axis\) and its derivativeψ′\(θ~\)\\psi^\{\\prime\}\(\\tilde\{\\theta\}\)\(red dashed, right axis\) on a single scalar coordinateθ~\\tilde\{\\theta\}; the shaded region is the shoulder\{θ~:ψ′\(θ~\)<σs\}\\\{\\tilde\{\\theta\}:\\psi^\{\\prime\}\(\\tilde\{\\theta\}\)<\\sigma\_\{s\}\\\}withσs=0\.05\\sigma\_\{s\}=0\.05\. Iterates that enter the shoulder have body\-path gradient∂𝜽/∂hϕ=ψ′\(𝜽~\)\\partial\{\\bm\{\\theta\}\}/\\partial h\_\{\\bm\{\\phi\}\}=\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)attenuated belowσs\\sigma\_\{s\}; the lift’s identity\-Jacobian slack channel \([Section˜3](https://arxiv.org/html/2605.24274#S3)\) provides the unmodulated escape route\.
### 2\.3Direct readouts fail uniformly on the shoulder

The direct softplus parametrization treats𝜽~\\tilde\{\\bm\{\\theta\}\}as a free parameter and updates it through the chain\-rule gradient∇𝜽~ℒ=ψ′\(𝜽~\)⊙∇𝜽ℒ\\nabla\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}=\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\\odot\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}\. On the softplus shoulder the prefactorψ′\\psi^\{\\prime\}vanishes and the trajectory stalls; on the ReLU\-type gate the prefactor is identically zero on the gated set and the trajectory is trapped\. The direct softplus is feasible by construction but is not amenable to gradient descent in the high\-attenuation regime, a property shared by every canonical positivity reparametrization\.

## 3The lift: slack\-plus\-hypernetwork reparametrization of the constrained weights

While mini\-batch gradient noise reliably escapes sharp basins in standard deep\-learning training\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2605.24274#bib.bib62); Keskaret al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib61); Mandtet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib37); Jastrzębskiet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib63); Liet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib36)\), the previous section showed that the readout shoulder attenuates this very noise: the chain\-rule prefactorψ′\\psi^\{\\prime\}shrinks the deterministic drift and the stochastic kick at the same rate, and SGD stalls\. In this section we introduce the lift, a reparametrization that gives SGD a structurally distinct noise source bypassingψ′\\psi^\{\\prime\}:[Section˜3\.1](https://arxiv.org/html/2605.24274#S3.SS1)makes the construction precise,[Section˜3\.2](https://arxiv.org/html/2605.24274#S3.SS2)reads off the two structural channels it opens,[Section˜3\.3](https://arxiv.org/html/2605.24274#S3.SS3)traces how they combine with batch stochasticity to deliver an implicit smoothing of the shoulder, and[Section˜3\.4](https://arxiv.org/html/2605.24274#S3.SS4)packages the result as a one\-call wrapper for any existing ICNN training pipeline\. The formal claims behind the intuition land in[Section˜4](https://arxiv.org/html/2605.24274#S4)\.

### 3\.1Routing the constraint through a slack\-plus\-hypernetwork emission

Thelift\([Figure˜3](https://arxiv.org/html/2605.24274#S3.F3)\) replaces the direct softplus’s free parameter𝜽~\\tilde\{\\bm\{\\theta\}\}with the sum of a learnable slack bias and a permutation\-invariant hypernetwork emission, then routes the result through the positivity readout:

𝜽=ψ\(𝜽~\),𝜽~=𝒃\+hϕ\(𝑿\),hϕ\(𝑿\)≡hϕ\(2\)\(1n∑i=1nhϕ\(1\)\(𝒙i\)\)\.\{\\bm\{\\theta\}\}\\;=\\;\\psi\(\\tilde\{\\bm\{\\theta\}\}\),\\qquad\\tilde\{\\bm\{\\theta\}\}\\;=\\;\{\\bm\{b\}\}\\;\+\\;h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\),\\qquad h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\\;\\equiv\\;h\_\{\\bm\{\\phi\}\}^\{\(2\)\}\\\!\\Bigl\(\\tfrac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}h\_\{\\bm\{\\phi\}\}^\{\(1\)\}\(\{\\bm\{x\}\}\_\{i\}\)\\Bigr\)\.\(1\)Here𝒃∈ℝd\{\\bm\{b\}\}\\in\\mathbb\{R\}^\{d\}is the per\-coordinate slack bias;hϕ\(1\)h\_\{\\bm\{\\phi\}\}^\{\(1\)\}andhϕ\(2\)h\_\{\\bm\{\\phi\}\}^\{\(2\)\}are fully\-connected branches with collected weightsϕ\{\\bm\{\\phi\}\}, applied per\-point and after mean\-pooling respectively over the conditioning batch𝑿=\(𝒙1,…,𝒙n\)\{\\bm\{X\}\}=\(\{\\bm\{x\}\}\_\{1\},\\dots,\{\\bm\{x\}\}\_\{n\}\); andψ\\psiis the positivity readout of[Section˜2](https://arxiv.org/html/2605.24274#S2)\. The DeepSets\-style mean\-pool makeshϕh\_\{\\bm\{\\phi\}\}permutation\-invariant in the ordering of𝑿\{\\bm\{X\}\}\(Zaheeret al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib32)\); this DeepSets\-hypernetwork emission pattern was first applied to generative\-model training byMayeret al\.\([2024](https://arxiv.org/html/2605.24274#bib.bib38)\), where conditioning the emitted weights on a permutation\-invariant batch summary mitigates model autophagy disorder in self\-consumed training loops\(Alemohammadet al\.,[2024](https://arxiv.org/html/2605.24274#bib.bib39)\), and a function\-space variant was applied byThatipelli and Siahkoohi \([2026](https://arxiv.org/html/2605.24274#bib.bib57)\)to produce grid\-independent finite\-dimensional representations of functional data for downstream clustering\. Training minimizes the same ICNN lossℒ\\mathcal\{L\}as the direct softplus, but over the lifted parameters\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)in place of𝜽\{\\bm\{\\theta\}\}directly:

The lift’s training objective:minϕ,𝒃ℒ\(ψ\(𝜽~\)\)\\boxed\{\\quad\\textbf\{The lift's training objective:\}\\quad\\min\_\{\{\\bm\{\\phi\}\},\\,\{\\bm\{b\}\}\}\\;\\mathcal\{L\}\\bigl\(\\psi\(\\tilde\{\\bm\{\\theta\}\}\)\\bigr\)\\quad\}\(2\)The loss family, the model class, the data, and the evaluation metric are unchanged relative to the direct softplus, which is recovered by freezing𝒃≡𝟎\{\\bm\{b\}\}\\equiv\\bm\{0\}andhϕ≡𝟎h\_\{\\bm\{\\phi\}\}\\equiv\\bm\{0\}and optimizing𝜽~\\tilde\{\\bm\{\\theta\}\}in their place\. Only the optimization variable differs\.

𝒙1\{\\bm\{x\}\}\_\{1\}𝒙2\{\\bm\{x\}\}\_\{2\}⋮\\vdots𝒙n\{\\bm\{x\}\}\_\{n\}hϕ\(1\)h\_\{\\bm\{\\phi\}\}^\{\(1\)\}per\-point1n∑\\tfrac\{1\}\{n\}\\\!\\sumhϕ\(2\)h\_\{\\bm\{\\phi\}\}^\{\(2\)\}readout⊕\\oplus𝜽~\\tilde\{\\bm\{\\theta\}\}\(free param\.\)𝒃\{\\bm\{b\}\}slack biasψ\(⋅\)\\psi\(\\cdot\)softplus𝜽⪰𝟎\{\\bm\{\\theta\}\}\\succeq\\bm\{0\}constr\.ICNNE𝜽\(𝒙\)E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)directsoftpluslift\(ours\)Figure 3:The lift in one picture\.Top row \(red, direct softplus baseline\):the pre\-readout iterate𝜽~\\tilde\{\\bm\{\\theta\}\}is a free parameter, passed through the positivity readoutψ\\psito produce the constrained weight𝜽⪰𝟎\{\\bm\{\\theta\}\}\\succeq\\bm\{0\}, then through the ICNN energyE𝜽\(𝒙\)E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\.Bottom row \(orange, lift\):the conditioning batch𝑿=\(𝒙1,…,𝒙n\)\{\\bm\{X\}\}=\(\{\\bm\{x\}\}\_\{1\},\\dots,\{\\bm\{x\}\}\_\{n\}\)feeds the DeepSets hypernetworkhϕ\(𝑿\)=hϕ\(2\)\(1n∑ihϕ\(1\)\(𝒙i\)\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)=h\_\{\\bm\{\\phi\}\}^\{\(2\)\}\\bigl\(\\tfrac\{1\}\{n\}\\\!\\sum\_\{i\}h\_\{\\bm\{\\phi\}\}^\{\(1\)\}\(\{\\bm\{x\}\}\_\{i\}\)\\bigr\); its emission is summed with the slack bias𝒃\{\\bm\{b\}\}at the⊕\\oplusnode to produce𝜽~=𝒃\+hϕ\(𝑿\)\\tilde\{\\bm\{\\theta\}\}=\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\), which feeds the same sharedψ→𝜽→E𝜽\(𝒙\)\\psi\\to\{\\bm\{\\theta\}\}\\to E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)tail\. The slack and body channels and their Jacobians through the readout are analyzed in[Sections˜3\.2](https://arxiv.org/html/2605.24274#S3.SS2)and[4](https://arxiv.org/html/2605.24274#S4)\.
### 3\.2The lift decomposes the pre\-readout iterate into a constant slack and a batch\-conditioned body

The slack\-plus\-hypernetwork decomposition𝜽~=𝒃\+hϕ\(𝑿\)\\tilde\{\\bm\{\\theta\}\}=\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)splits the pre\-readout iterate into two terms with very different dependence on the conditioning batch𝑿\{\\bm\{X\}\}\. The slack𝒃\{\\bm\{b\}\}is a parameter, constant across SGD iterations that draw different batches\. The bodyhϕ\(𝑿\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)is a function of the batch—its output varies as𝑿\{\\bm\{X\}\}varies\. Both terms enter𝜽~\\tilde\{\\bm\{\\theta\}\}additively, so their first\-order Jacobians are identical \(∂𝜽~/∂𝒃=∂𝜽~/∂hϕ=𝑰d\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}=\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial h\_\{\\bm\{\\phi\}\}=\{\\bm\{I\}\}\_\{d\}\) and each picks up the same readout prefactordiag\(ψ′\(𝜽~\)\)\\mathrm\{diag\}\(\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\)when chain\-ruled to the constrained weight𝜽=ψ\(𝜽~\)\{\\bm\{\\theta\}\}=\\psi\(\\tilde\{\\bm\{\\theta\}\}\)\. The asymmetry the lift introduces is not in the Jacobians—both optimization variables see the same shoulder attenuation as the direct softplus—but in which term carries the batch dependence\.

At iterationtt, a fresh batch𝑿\(t\)\{\\bm\{X\}\}^\{\(t\)\}is drawn and the lifted iterate evaluates to𝜽~\(t\)=𝒃\+hϕ\(𝑿\(t\)\)\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}=\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(t\)\}\)\. Across a trailing window ofTTiterations at approximately fixed\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)—the SGD updates to these parameters are slow relative to the batch\-to\-batch variation inhϕ\(𝑿\(t\)\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(t\)\}\)—let𝜽~¯:=1T∑s𝜽~\(s\)\\bar\{\\tilde\{\\bm\{\\theta\}\}\}:=\\tfrac\{1\}\{T\}\\sum\_\{s\}\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}denote the trailing mean andδ𝜽~\(t\):=𝜽~\(t\)−𝜽~¯\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}:=\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}\-\\bar\{\\tilde\{\\bm\{\\theta\}\}\}the batch\-induced fluctuation of the iterate\. The slack contributes a constant offset and drops out of the fluctuation; the body carries it entirely:

δ𝜽~\(t\)=δhϕ\(𝑿\(t\)\)\.\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}\\;=\\;\\delta h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(t\)\}\)\.\(3\)The direct softplus parametrization has no analog: its pre\-readout iterate is a free parameter, independent of𝑿\{\\bm\{X\}\}, soδ𝜽~\(t\)≡0\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}\\equiv 0\.[Section˜3\.3](https://arxiv.org/html/2605.24274#S3.SS3)shows how this batch\-induced fluctuation couples to mini\-batch gradient noise and delivers an implicit smoothing of the readout shoulder\.

### 3\.3Why this helps: implicit landscape smoothing through batch stochasticity

The previous subsection split the pre\-readout iterate into a constant slack and a batch\-conditioned body; we now walk, in four steps, why that split lets SGD escape a shoulder on which the direct softplus stalls\. The mathematics is assembled formally in[Section˜4](https://arxiv.org/html/2605.24274#S4); here we keep only the intuitive chain\.

##### Step 1: the shoulder kills every gradient\-driven channel, the lift included\.

Every parameter update the optimizer takes is a gradient step, and every gradient in𝜽~\\tilde\{\\bm\{\\theta\}\}\-coordinates factors through the chain rule as a Jacobian acting on𝒈=ψ′\(𝜽~\)⊙∇𝜽ℒ\{\\bm\{g\}\}=\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\\odot\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}, so it carries the readout prefactorψ′\(𝜽~\)\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\. On the shoulderψ′\(𝜽~\)≈σs≪1\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\\approx\\sigma\_\{s\}\\ll 1, and the gradient\-driven part of the lifted incrementΔ𝜽~\\Delta\\tilde\{\\bm\{\\theta\}\}is bilinear in𝒈\{\\bm\{g\}\}—hence of orderψ′\(𝜽~\)2\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)^\{2\}in every term—so the deterministic drift and the gradient\-sourced kick collapse together\. The lift’s gradient descent therefore freezes on the shoulder exactly as the direct softplus’s does; whatever advantage the lift carries cannot come from a gradient step\.

##### Step 2: but the lift’s iterate keeps moving without a gradient step\.

The lifted pre\-readout iterate is𝜽~\(t\)=𝒃\+hϕ\(𝑿\(t\)\)\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}=\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(t\)\}\), and at the variance\-maximizing operating pointn=1n=1of[Remark˜3](https://arxiv.org/html/2605.24274#Thmremark3)the conditioning sample𝑿\(t\)\{\\bm\{X\}\}^\{\(t\)\}is redrawn on every forward pass\. Even with the parameters\(𝒃,ϕ\)\(\{\\bm\{b\}\},\{\\bm\{\\phi\}\}\)held frozen—and hence with no gradient step taken—the iterate jitters from one pass to the next byδ𝜽~\(t\)=δhϕ\(𝑿\(t\)\)\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}=\\delta h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(t\)\}\)of \([3](https://arxiv.org/html/2605.24274#S3.E3)\): a re\-evaluation of the emission on a fresh batch, not a gradient, carrying noψ′\\psi^\{\\prime\}factor\. The direct softplus has no body to re\-evaluate—its pre\-readout iterate is a free parameter,δ𝜽~\(t\)≡𝟎\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(t\)\}\\equiv\\bm\{0\}—so once its gradients die on the shoulder, its iterate is motionless\. The batch\-resampling jitter of the lifted iterate is the one channel of motion that the shoulder leaves intact\.

##### Step 3: that jitter is a genuine SGD noise channel,σJac\\sigma\_\{\\mathrm\{Jac\}\}\.

In the stochastic differential equation \(SDE\) picture of SGD\(Mandtet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib37); Liet al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib36)\), mini\-batch SGD is a drift plus a diffusion, and the diffusion of the lifted iterate has two parts: aψ′\\psi^\{\\prime\}\-attenuated gradient\-noise part of scaleσs2σobj2\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}, which dies on the shoulder with the rest of the gradient\-driven channel, and the unattenuated batch\-resampling part\. The second part is measured by the trace of the cross\-covariance between the iterate jitterδ𝜽~\\delta\\tilde\{\\bm\{\\theta\}\}and the gradient fluctuationδ𝒈=𝒈−𝔼𝑿\[𝒈\]\\delta\{\\bm\{g\}\}=\{\\bm\{g\}\}\-\\mathbb\{E\}\_\{\\bm\{X\}\}\[\{\\bm\{g\}\}\],

𝚺^slack\(t\)=1T∑s=t−Tt−1δ𝜽~\(s\)\(δ𝒈\(s\)\)⊤,σJac2≡tr𝔼𝑿\[δ𝜽~δ𝒈⊤\],\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\,\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\},\\qquad\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\\;\\equiv\\;\\mathrm\{tr\}\\,\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\\bigr\],\(4\)withssindexing SGD iterations in the trailing window\[t−T,t−1\]\[t\{\-\}T,t\{\-\}1\]\. This is a*cross*\-covariance—the jitterδ𝜽~\\delta\\tilde\{\\bm\{\\theta\}\}against the gradient fluctuationδ𝒈\\delta\{\\bm\{g\}\}—and it is nonzero because both ride the same conditioning batch𝑿\{\\bm\{X\}\}: the body emits𝜽~\\tilde\{\\bm\{\\theta\}\}from𝑿\{\\bm\{X\}\}and the loss gradient is evaluated on the same𝑿\{\\bm\{X\}\}, so their fluctuations are correlated rather than independent\. The direct softplus hasδ𝜽~≡𝟎\\delta\\tilde\{\\bm\{\\theta\}\}\\equiv\\bm\{0\}, so itsσJac\\sigma\_\{\\mathrm\{Jac\}\}is identically zero;σJac\\sigma\_\{\\mathrm\{Jac\}\}is the lift’s surviving noise channel in one number\.

##### Step 4: howσJac\\sigma\_\{\\mathrm\{Jac\}\}enters the equations\.

The lifted iterate’s effective SGD diffusion is the sum of the two parts of Step 3,σeff2=σs2σobj2\+σJac2\\sigma\_\{\\mathrm\{eff\}\}^\{2\}=\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\+\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}; on the shoulder the first term collapses andσeff2≈σJac2\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\\approx\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}, so SGD continues to sample its surroundings on a shoulder that has frozen the direct softplus\. This residual diffusion enters the downstream equations at two places, each made formal in[Section˜4](https://arxiv.org/html/2605.24274#S4)\. Through the small\-noise expansion of the SGD invariant measure, it adds to the effective landscape a strong\-convexity modulusμeff=Θ\(σJac2/\(dκ2\)\)\\mu\_\{\\mathrm\{eff\}\}=\\Theta\(\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\kappa^\{2\}\)\), withκ=‖𝜽~‖∞\\kappa=\\\|\\tilde\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}the typical readout scale andddthe constrained\-weight dimension—an implicit smoothing of the shoulder that requires no change toψ\\psi\(formal statement:[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)\)\. And it sits in the Kramers escape exponent of the bias\-channel SDE,𝔼\[τhyper\]∝exp⁡\(2α/\(σs2σobj2\+σJac2\)\)\\mathbb\{E\}\[\\tau\_\{\\mathrm\{hyper\}\}\]\\propto\\exp\(2\\alpha/\(\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\+\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\)\)against the direct softplus’sexp⁡\(2α/\(σs2σobj2\)\)\\exp\(2\\alpha/\(\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\)\), withα\\alphathe height of the loss barrier across the shoulder: the unattenuatedσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}lifts the exponent, and once it grows comparable to the barrier the exponential saturates and the escape crosses over to a polynomialσJac−2\\sigma\_\{\\mathrm\{Jac\}\}^\{\-2\}free\-diffusion scaling \(formal statement:[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1); the diffusive regime is probed empirically in[Section˜5\.1\.2](https://arxiv.org/html/2605.24274#S5.SS1.SSS2)\)\.

##### The cross\-covariance estimator is the deployable signature\.

The cross\-covariance estimator \([4](https://arxiv.org/html/2605.24274#S3.E4)\) is nonzero only when the slack, the body, and their batch\-coupled cross\-correlation are all simultaneously present \([Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\); the four\-architecture ablation of[Figure˜6](https://arxiv.org/html/2605.24274#S4.F6)tests this architecturally—deleting any one ingredient zeros the estimator, and only the full lift returns a finite reading\. The estimator is cheap \(anO\(d2\)O\(d^\{2\}\)accumulator over the trailing window, well below1%1\\%overhead in our experiments\), so it doubles as a structural test of the mechanism and a deployable training\-time monitor\.[Figure˜4](https://arxiv.org/html/2605.24274#S3.F4)renders this signature on an actual ICNN\-EBM training loop: the slack\-channel cross\-covariance peaks during the descent into the shoulder and sustains a finite reading once a non\-trivial tail of the ICNN weights enters the shoulder window—the nonzero plateau, not the spike, is what[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)measures\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x4.png)Figure 4:The slack\-channel cross\-covariance is sustained throughout the shoulder window, not transient\.Three seeds of the full lift trained under forward\-KL on the one\-dimensional Gumbel target; the gray band marks iterations where the minimum across coordinates ofθ~l\\tilde\{\\theta\}\_\{l\}sits below the readout shoulder threshold\.Top:per\-iteration Frobenius magnitude‖𝚺^slack‖F\\\|\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\\\|\_\{F\}of the slack\-channel cross\-covariance on the trailing window \(light orange per seed, bold orange median\)\. The slack\-channel cross\-covariance peaks during the descent and sustains a finite reading once the lower tail of the weight distribution enters the shoulder—distinct from the structural zero returned by every architecture that deletes one of the lift’s three ingredients \([Figure˜6](https://arxiv.org/html/2605.24274#S4.F6)\)\.Bottom:the per\-coordinate ensemble of pre\-readout weightsθ~l\\tilde\{\\theta\}\_\{l\}across the positivity\-tagged ICNN coordinates and all three seeds, with the dashed line at the shoulder threshold\. The median weight stays above the threshold throughout training, but the lower percentile band brushes it from mid\-training onward: a non\-trivial tail of the ICNN’s weights lives in the shoulder, and this tail’s batch\-induced fluctuation produces the sustained cross\-covariance floor in the top panel\.

### 3\.4A drop\-in wrapper for any ICNN training pipeline

Before the code, two structural points orient the implementation: a hyperparameter choice that determines the magnitude of the batch\-induced fluctuation of[Section˜3\.2](https://arxiv.org/html/2605.24274#S3.SS2), and a parameter\-count clarification for the experimental comparisons of[Section˜5](https://arxiv.org/html/2605.24274#S5)\.

With these two points in hand, the lift drops into any existing ICNN training pipeline as a generic wrapper around the user’s convex network: a one\-time positivity tag at construction time, then a two\-line swap at the call site \([Figure˜5](https://arxiv.org/html/2605.24274#S3.F5)\)\. The hypernetwork walks the user\-supplied module’s`named\_parameters\(\)`, picks up every tensor flagged`\_pos\_required = True`, and routes only those readouts through the positivity reparametrization; all other weights are emitted unconstrained\. Every result in[Section˜5](https://arxiv.org/html/2605.24274#S5)is produced by exactly this wrapper; the surrounding training script \(loss, optimizer, log\-det estimator, evaluation grid\) is bit\-identical to the direct\-softplus baseline\.

classMyConvexNet\(nn\.Module\):

def\_\_init\_\_\(self,\.\.\.\):

\.\.\.

forwinself\.constrained\_weights:

w\.\_pos\_required=True

icnn=MyConvexNet\(\.\.\.\)

hyper=HyperNetwork\(input\_size=code\_dim,

hidden\_sizes=\[64,64,96\],

downstream\_network=icnn\)

Figure 5:The lift as a two\-line drop\-in wrapper\.Mark the constrained weights of any user\-supplied convex network with a\_pos\_requiredflag, then wrap withHyperNetwork\. No manual list of positivity\-tagged parameter names, no manual softplus, no constraint code in the training loop\.

## 4Three structural ingredients drive the conditioning advantage

We now state the formal core of the lift mechanism that[Section˜3\.3](https://arxiv.org/html/2605.24274#S3.SS3)developed intuitively\. Proofs are in[Appendix˜A](https://arxiv.org/html/2605.24274#A1); the prose between formal statements carries the story in compact form\.

The slack\-channel cross\-covariance𝔼\[δ𝜽~δ𝒈⊤\]\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\]of[Section˜3\.3](https://arxiv.org/html/2605.24274#S3.SS3)reaches the lifted parameter’s two blocks—the bodyϕ∈ℝp\{\\bm\{\\phi\}\}\\in\\mathbb\{R\}^\{p\}and the slack𝒃∈ℝd\{\\bm\{b\}\}\\in\\mathbb\{R\}^\{d\}—through their respective Jacobians∂𝜽~/∂ϕ∈ℝd×p\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{\\phi\}\}\\in\\mathbb\{R\}^\{d\\times p\}and∂𝜽~/∂𝒃=𝑰d\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}=\{\\bm\{I\}\}\_\{d\}:

\(∂𝜽~/∂ϕ\)⊤𝔼\[δ𝜽~δ𝒈⊤\]⏟body\-channel readingvs\.\(∂𝜽~/∂𝒃\)⊤𝔼\[δ𝜽~δ𝒈⊤\]=𝔼\[δ𝜽~δ𝒈⊤\]⏟slack\-channel reading, identity Jacobian\.\\underbrace\{\\bigl\(\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{\\phi\}\}\\bigr\)^\{\\top\}\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\]\}\_\{\\text\{body\-channel reading\}\}\\quad\\text\{vs\.\}\\quad\\underbrace\{\\bigl\(\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}\\bigr\)^\{\\top\}\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\]\\;=\\;\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\]\}\_\{\\text\{slack\-channel reading, identity Jacobian\}\}\.\(5\)The body\-channel reading composes the cross\-covariance with the body Jacobian, which the readout attenuates as the shoulder squeezes; the slack\-channel reading is the cross\-covariance itself, with no further attenuating composition\. Its trace is theσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}of \([6](https://arxiv.org/html/2605.24274#S4.E6)\), structurally zero for direct softplus \(δ𝜽~≡𝟎\\delta\\tilde\{\\bm\{\\theta\}\}\\equiv\\bm\{0\}\) and generically nonzero for the lift\. The estimator \([4](https://arxiv.org/html/2605.24274#S3.E4)\) ofσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}is also a structural test: delete any one of the three ingredients and it zeros out\.

###### Theorem 1\(Each structural ingredient is necessary for the cross\-covariance estimator to be nonzero\)\.

Let𝚺^slack\(t\)\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}denote the slack\-channel cross\-covariance estimator \([4](https://arxiv.org/html/2605.24274#S3.E4)\)\. Deleting any one of the lift’s three structural ingredients—\(i\) the slack channel𝐛\{\\bm\{b\}\}, \(ii\) the batch\-dependence of the bodyhϕ\(𝐗\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\), or \(iii\) the batch\-coupling of𝐠\{\\bm\{g\}\}and𝛉~\\tilde\{\\bm\{\\theta\}\}—makes the slack\-channel reading of the estimator vanish: \(i\) and \(ii\) zero it identically for everyttand window lengthTT, while \(iii\) zeros the population cross\-covariance, of which the estimator is the unbiased,O\(T−1/2\)O\(T^\{\-1/2\}\)\-consistent sample version\. Full statement \(including the regularity assumptions of[Appendix˜A](https://arxiv.org/html/2605.24274#A1)\) and proof in[Section˜A\.1](https://arxiv.org/html/2605.24274#A1.SS1)\.

Intuition\.Each of the three deletions zeros the estimator by a distinct mechanism: a zero slack Jacobian premultiplies every summand \(i\), a vanishing iterate fluctuationδ𝜽~\\delta\\tilde\{\\bm\{\\theta\}\}makes every summand zero \(ii\), or conditional independence makes the population coupling—and hence the limit of the estimator—zero \(iii\)\. The four\-architecture ablation of[Figure˜6](https://arxiv.org/html/2605.24274#S4.F6)realizes the three deletions architecturally; only the full lift returns a finite reading\. The theorem is a necessity statement for the*estimator*, not for the conditioning advantage itself: a structural zero of𝚺^slack\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}certifies that the slack channel of[Equation˜5](https://arxiv.org/html/2605.24274#S4.E5)is inactive, but the converse—that any method with a structural zero must lose the advantage—is not claimed, and PGD \(a structural zero by the bias\-only deletion\) is observed in[Section˜6\.2](https://arxiv.org/html/2605.24274#S6.SS2)to reach the lift’s total variation \(TV\) on the lowest\-dimensional target\. The four\-architecture ablation realizing the deletions is constructed in[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1)\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x5.png)Figure 6:Only the architecture that retains all three ingredients returns a finite cross\-covariance reading\.Time\-averaged Frobenius norm of the slack\-channel cross\-covariance on the small\-σ\\sigmaregion, across the four\-architecture ablation of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1)\. The three deletions \(direct softplus, direct with bias, body without bias\) return the structural zero predicted by[Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1), plotted at the figure floor for log\-axis legibility\. The full lift \(orange, dashed box\) is the only architecture that admits a finite stable reading\.### 4\.1The cross\-covariance strong\-convexifies the loss landscape on the readout shoulder

We adopt the SDE\-of\-SGD proxy ofMandtet al\.\([2017](https://arxiv.org/html/2605.24274#bib.bib37)\); Liet al\.\([2017](https://arxiv.org/html/2605.24274#bib.bib36)\): in the small\-learning\-rate limit, the SGD trajectory onϕ\{\\bm\{\\phi\}\}is approximated bydϕ=−∇ϕℒdt\+𝚺1/2\(ϕ\)d𝑩td\{\\bm\{\\phi\}\}=\-\\nabla\_\{\{\\bm\{\\phi\}\}\}\\mathcal\{L\}\\,dt\+\{\\bm\{\\Sigma\}\}^\{1/2\}\(\{\\bm\{\\phi\}\}\)\\,d\\bm\{B\}\_\{t\}, with diffusion sourced by mini\-batch noise and batches𝑿\{\\bm\{X\}\}i\.i\.d\. from the target\. Three scalars characterize the regime—the gradient\-noise variance, the slack\-channel cross\-covariance, and the shoulder prefactor:

σobj2≡trVar𝑿\[∇𝜽ℒ\],σJac2≡tr𝔼𝑿\[δ𝜽~δ𝒈⊤\],σs≡ψ′\(w~s\),\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\\equiv\\mathrm\{tr\}\\,\\mathrm\{Var\}\_\{\{\\bm\{X\}\}\}\[\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}\],\\qquad\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\\equiv\\mathrm\{tr\}\\,\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\\bigr\],\\qquad\\sigma\_\{s\}\\equiv\\psi^\{\\prime\}\(\\tilde\{w\}\_\{s\}\),\(6\)withw~s\\tilde\{w\}\_\{s\}the shoulder location andδ𝒈=𝒈−𝔼𝑿\[𝒈\]\\delta\{\\bm\{g\}\}=\{\\bm\{g\}\}\-\\mathbb\{E\}\_\{\\bm\{X\}\}\[\{\\bm\{g\}\}\]the centered fluctuation of the pre\-readout gradient𝒈=∇𝜽~ℒ\{\\bm\{g\}\}=\\nabla\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\. Hereσobj2\\sigma\_\{\\mathrm\{obj\}\}^\{2\}is the variance of the*constrained\-coordinate*gradient∇𝜽ℒ\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}, taken before the readout prefactor is applied, so that the once\-attenuated objective\-noise variance entering the bias\-channel diffusion is the productσs2σobj2\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\(the prefactor is counted exactly once\); andσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}is the trace of the cross\-covariance between the centered iterate fluctuationδ𝜽~\\delta\\tilde\{\\bm\{\\theta\}\}andδ𝒈\\delta\{\\bm\{g\}\}\. Their dimensionless ratio

ϱ≡σJac2/\(σs2σobj2\)\\varrho\\equiv\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(\\sigma\_\{s\}^\{2\}\\,\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\)\(7\)measures which noise channel carries the lift’s effective variance—ϱ≫1\\varrho\\gg 1when the unmodulated Jacobian noise outweighs the prefactor\-attenuated gradient noise, so thatσeff,hyper2≈σJac2\\sigma\_\{\\mathrm\{eff\},\\mathrm\{hyper\}\}^\{2\}\\approx\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\(ϱ\\varrhoto disambiguate from the ADMM penaltyρ\\rhoof \([10](https://arxiv.org/html/2605.24274#S4.E10)\)\)\. The two results below hold under the regularity assumptions \(A1\)–\(A5\) stated in full in[Appendix˜A](https://arxiv.org/html/2605.24274#A1): a small\-η\\etaSDE\-of\-SGD model with i\.i\.d\. batches, a slowly\-varying single\-prefactor idealization of the shoulder, the identity slack Jacobian, a forward\-KL Hessian linearization of the gradient fluctuation, and—for[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)only—a metastable single\-barrier potential\.

###### Lemma 1\(Implicit strong\-convexification from data\-conditioned Jacobian noise\)\.

Under the regularity assumptions of[Appendix˜A](https://arxiv.org/html/2605.24274#A1), the curvature𝐇⋆≡∇ϕ2ℒ~\(ϕ⋆\)\{\\bm\{H\}\}^\{\\star\}\\equiv\\nabla^\{2\}\_\{\\bm\{\\phi\}\}\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}^\{\\star\}\)of the pullback forward\-KL landscapeℒ~\(ϕ\)≡𝔼𝐗\[ℒ\(ψ\(𝐛\+hϕ\(𝐗\)\)\)\]\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)\\equiv\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\[\\mathcal\{L\}\(\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\)\]at the convergedϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}acquires, on the slack subspace, an added curvature sourced by the slack\-channel cross\-covariance:

tr𝑯⋆≥tr𝔼𝑿\[∇𝜽~2ℒ\]\+dμeff,μeff=Θ\(σJac2/\(dκ2\)\),\\mathrm\{tr\}\{\\bm\{H\}\}^\{\\star\}\\;\\geq\\;\\mathrm\{tr\}\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\]\\;\+\\;d\\,\\mu\_\{\\mathrm\{eff\}\},\\qquad\\mu\_\{\\mathrm\{eff\}\}\\;=\\;\\Theta\\\!\\bigl\(\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\,\\kappa^\{2\}\)\\bigr\),\(8\)withκ=‖𝛉~‖∞\\kappa=\\\|\\tilde\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}the typical readout scale andddthe slack dimension, the inequality holding to leading order in the iterate\-fluctuation magnitude\. The slack\-channel cross\-covariance therefore strongly\-convexifies the landscape that SGD samples by a per\-dimension modulusμeff\\mu\_\{\\mathrm\{eff\}\}set byσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}, without deformingψ\\psi\.

Intuition\.The lift’s extra noise channel—the resampled\-batch fluctuation of the emitted weights—enters the sampled landscape as an added strongly\-convex quadratic, an implicit smoothing of the effective optimization geometry that requires no change to the readoutψ\\psi; the cross\-covariance is the formal measure of that channel\. Full statement and proof in[Section˜A\.2](https://arxiv.org/html/2605.24274#A1.SS2)\.

###### Corollary 1\(Mean first\-passage time across the shoulder, Arrhenius regime\)\.

Under the regularity assumptions of[Appendix˜A](https://arxiv.org/html/2605.24274#A1)and a non\-vanishing barrier of reference actionα\>0\\alpha\>0along the bias\-channel direction, the mean first\-passage times across the shoulder satisfy

𝔼\[τdirect\]≍exp⁡\(2α/\(σs2σobj2\)\)vs\.𝔼\[τhyper\]≍exp⁡\(2α/\(σs2σobj2\+σJac2\)\),\\mathbb\{E\}\[\\tau\_\{\\mathrm\{direct\}\}\]\\;\\asymp\\;\\exp\\\!\\bigl\(2\\alpha\\big/\(\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\)\\bigr\)\\quad\\text\{vs\.\}\\quad\\mathbb\{E\}\[\\tau\_\{\\mathrm\{hyper\}\}\]\\;\\asymp\\;\\exp\\\!\\bigl\(2\\alpha\\big/\(\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\+\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\)\\bigr\),\(9\)where≍\\asympdenotes equality of the leading exponential factor up to a sub\-exponential prefactor\. The lifted excessσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}is the slack\-channel cross\-covariance contribution of[Equation˜5](https://arxiv.org/html/2605.24274#S4.E5), structurally absent for direct softplus, so the lift escapes the shoulder strictly faster than the direct softplus\.

Intuition\.The unattenuatedσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}enters the Kramers escape exponent additively, so the lift escapes the shoulder strictly faster than the direct softplus\. Full statement and proof in[Section˜A\.3](https://arxiv.org/html/2605.24274#A1.SS3)\.

### 4\.2The lift as an ADMM consensus reformulation

The lift of[Section˜3](https://arxiv.org/html/2605.24274#S3)is a reparametrization, not an alternating algorithm: the boxed objective \([2](https://arxiv.org/html/2605.24274#S3.E2)\) is optimized by plain SGD on\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)simultaneously\. The reparametrization nonetheless structurally resembles an ADMM consensus splitting\. Introduce an auxiliary primal𝒛\{\\bm\{z\}\}and a Lagrange multiplier𝒚\{\\bm\{y\}\}for the consensus constraint𝒛=ψ\(𝒃\+hϕ\(𝑿\)\)\{\\bm\{z\}\}=\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\), giving the augmented\-Lagrangian saddle\-point problem

minϕ,𝒃,𝒛⁡max𝒚⁡ℒ\(𝒛\)\+𝒚⊤\(𝒛−ψ\(𝒃\+hϕ\(𝑿\)\)\)\+\(ρ/2\)‖𝒛−ψ\(𝒃\+hϕ\(𝑿\)\)‖22\.\\min\_\{\{\\bm\{\\phi\}\},\\,\{\\bm\{b\}\},\\,\{\\bm\{z\}\}\}\\;\\max\_\{\{\\bm\{y\}\}\}\\;\\mathcal\{L\}\(\{\\bm\{z\}\}\)\\;\+\\;\{\\bm\{y\}\}^\{\\top\}\\\!\\bigl\(\{\\bm\{z\}\}\-\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\\bigr\)\\;\+\\;\(\\rho/2\)\\,\\big\\\|\{\\bm\{z\}\}\-\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\\big\\\|\_\{2\}^\{2\}\.\(10\)The lift is \([10](https://arxiv.org/html/2605.24274#S4.E10)\) restricted to its feasible set: where𝒛=ψ\(𝒃\+hϕ\(𝑿\)\)\{\\bm\{z\}\}=\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)exactly, the penalty and multiplier terms vanish and the objective collapses to \([2](https://arxiv.org/html/2605.24274#S3.E2)\)\. We use \([10](https://arxiv.org/html/2605.24274#S4.E10)\) only to locate the lift within the ADMM family—it is a structural lens, not a training objective:𝒛\{\\bm\{z\}\}and𝒚\{\\bm\{y\}\}are never instantiated, we do not run the alternating schedule, and we do not claim the lift as aρ→∞\\rho\\to\\inftylimit, since eliminating𝒛\{\\bm\{z\}\}would interchange that limit with a non\-convex stochastic infimum\. The classical ADMM convergence theory does not transfer either—ℒ\\mathcal\{L\}is non\-convex over the ICNN energy parameters—so the resemblance is structural, not an inherited algorithm or rate theorem\. The empirical contrast with plain ADMM\-with\-positivity that*is*run, its𝒛\{\\bm\{z\}\}\-step a cone projection, is[Section˜6\.1](https://arxiv.org/html/2605.24274#S6.SS1)\.

What survives this non\-convex setting is the cross\-covariance reading of \([5](https://arxiv.org/html/2605.24274#S4.E5)\): the slack channel reads the cross\-covariance through an identity Jacobian with no attenuating composition, and the cross\-covariance is structurally nonzero for the lift and identically zero for the slack\-free direct parametrization\. The empirical evidence of[Section˜5](https://arxiv.org/html/2605.24274#S5)measures this cross\-covariance, independent of whether the optimizer runs synchronous SGD or alternating ADMM updates\.

## 5Empirical evidence across two ICNN paradigms

Two training paradigms put ICNNs to work within this paper’s scope: log\-concave energy\-based modeling \([Section˜5\.1](https://arxiv.org/html/2605.24274#S5.SS1)\) and convex potential flow estimation \([Section˜5\.2](https://arxiv.org/html/2605.24274#S5.SS2)\)\. Every comparison reported below is a three\-way reading ofhypernetvsdirect\-softplusvsPGD—PGD on the non\-negative cone is the original ICNN recipe\(Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\)and a structurally\-zero sentinel under the same argument as the bias\-only deletion of[Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\. The log\-concave EBM subsection carries the structural evidence: a four\-architecture cross\-covariance ablation \(the load\-bearing test of[Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\), a diffusive\-escape SDE on a one\-dimensional Gumbel target, and a four\-target log\-concave gallery with paired 30\-seed TV bounds\. The convex potential flow subsection transfers the lift to the change\-of\-variables likelihood on two\-dimensional synthetic targets and on a 21\-dimensional tabular benchmark, with the loss\-landscape geometry rendered in both parameter spaces\. A closing subsection \([Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\) visualizes the loss landscape across both paradigms in both the constrained ICNN\-θ\\thetaand lifted\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)coordinate systems\.

### 5\.1Log\-concave EBM training: structural evidence for the lift mechanism

We train ICNN\-EBMs with the forward\-KL objective\. The model parametrizes a densityp𝜽\(𝒙\)=Z\(𝜽\)−1exp⁡\(−E𝜽\(𝒙\)\)p\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)=Z\(\{\\bm\{\\theta\}\}\)^\{\-1\}\\exp\(\-E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\)withE𝜽:ℝd→ℝE\_\{\\bm\{\\theta\}\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}the output of an ICNN,𝜽=ψ\(𝜽~\)\{\\bm\{\\theta\}\}=\\psi\(\\tilde\{\\bm\{\\theta\}\}\)\(softplus by default\) enforcing positivity, and the hypernet emit \([1](https://arxiv.org/html/2605.24274#S3.E1)\) routing𝜽~\\tilde\{\\bm\{\\theta\}\}through the slack\-plus\-body decomposition of[Section˜3](https://arxiv.org/html/2605.24274#S3)\. The non\-negative inter\-layer weights are initialized with the folded\-normal scheme ofHoedt and Klambauer \([2023](https://arxiv.org/html/2605.24274#bib.bib10)\)\. The forward KL divergence between data and model takes the form

ℒFKL\(𝜽\)=𝔼𝒙∼pdata\[E𝜽\(𝒙\)\]\+log⁡Z\(𝜽\),\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}\(\{\\bm\{\\theta\}\}\)\\;=\\;\\mathbb\{E\}\_\{\{\\bm\{x\}\}\\sim p\_\{\\mathrm\{data\}\}\}\\big\[\\,E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\\,\\big\]\\;\+\\;\\log Z\(\{\\bm\{\\theta\}\}\),\(11\)with the data expectation estimated by a mini\-batch average and the log\-normalizerlog⁡Z\(𝜽\)=log∫exp⁡\(−E𝜽\(𝒙\)\)𝑑𝒙\\log Z\(\{\\bm\{\\theta\}\}\)=\\log\\int\\exp\(\-E\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\)\\,d\{\\bm\{x\}\}estimated by self\-normalized importance sampling \(SNIS\)\(Owen,[2013](https://arxiv.org/html/2605.24274#bib.bib66)\)against a conditional sampler flowqϕq\_\{\\phi\}trained jointly with the energy; we use a Student\-tt\-ν=3\\nu=3base forqϕq\_\{\\phi\}followingJainiet al\.\([2020](https://arxiv.org/html/2605.24274#bib.bib12)\), because log\-concave ICNN\-EBMs have at\-most\-exponential tails\(Saumard and Wellner,[2014](https://arxiv.org/html/2605.24274#bib.bib23)\)that a Gaussian\-base flow cannot cover at high dimension and SNIS has finite variance only when the proposal has at\-least\-as\-heavy tails as the target\. The optimizer is Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2605.24274#bib.bib70)\)on all parameters atlr=10−3\\mathrm\{lr\}=10^\{\-3\}with the sampler flow updated jointly with the EBM at a1:11\\\!:\\\!1inner\-outer schedule\. We adoptn=1n\{=\}1for the hypernetwork’s conditioning\-batch size throughout \(see[Remark˜3](https://arxiv.org/html/2605.24274#Thmremark3)in[Section˜3\.4](https://arxiv.org/html/2605.24274#S3.SS4)\)\. The three\-way method comparison is identical across all experiments below: the hypernet body is the same batch\-summary multilayer perceptron \(MLP\) throughout, the direct\-softplus baseline is the slack\-free𝒃≡0\{\\bm\{b\}\}\\equiv 0,hϕ≡0h\_\{\\bm\{\\phi\}\}\\equiv 0limit of \([1](https://arxiv.org/html/2605.24274#S3.E1)\), and the PGD baseline takes an unconstrained step on𝜽\{\\bm\{\\theta\}\}followed by the projection𝜽←max⁡\(𝜽,0\)\{\\bm\{\\theta\}\}\\leftarrow\\max\(\{\\bm\{\\theta\}\},0\), the standard ICNN training recipe\(Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\)and theρ→∞\\rho\\to\\inftystiff\-penalty limit of plain ADMM\-with\-positivity \([Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\)\. The structural reading of the lift’s contribution—slack, body, and cross\-covariance—is settled by the four\-architecture ablation of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1)rather than by any single test\-time outcome \([Section˜6\.2](https://arxiv.org/html/2605.24274#S6.SS2)\)\.

#### 5\.1\.1Only the full lift returns a finite cross\-covariance reading

The four architectures of[Figure˜6](https://arxiv.org/html/2605.24274#S4.F6)realize the three deletions of[Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\. Direct softplus,ψ\(𝜽~\)\\psi\(\\tilde\{\\bm\{\\theta\}\}\), has no slack and no body\. Direct softplus with bias,ψ\(𝒃\+𝜽~\)\\psi\(\{\\bm\{b\}\}\+\\tilde\{\\bm\{\\theta\}\}\), has slack but no body, so the pre\-activation has no batch\-induced variance\. Hypernet without bias,ψ\(hϕ\(𝑿\)\)\\psi\(h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\), has body but no slack\. The full lift,ψ\(𝒃\+hϕ\(𝑿\)\)\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\), retains all three ingredients\. For each architecture, we wire the cross\-covariance estimator \([4](https://arxiv.org/html/2605.24274#S3.E4)\) into the trailing window and report the time\-averaged Frobenius norm of𝚺^slack\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}on the small\-σ\\sigmaregionmin⁡θ~<−3\\min\\tilde\{\\theta\}<\-3\([Figure˜6](https://arxiv.org/html/2605.24274#S4.F6)\)\.

The three deletions return zero for distinct structural reasons\. Direct softplus lacks𝒃\{\\bm\{b\}\}entirely\. Direct softplus with bias has scalar slack with no batch\-induced variance\. The body\-only architecture keepsmin⁡θ~\\min\\tilde\{\\theta\}above the shoulder threshold throughout training, so the small\-σ\\sigmaregion is never entered\. The full lift is the only architecture whose trajectory enters the small\-σ\\sigmaregion and whose cross\-covariance is finite and stable\. PGD has neither slack to differentiate nor a data\-conditioned body, so its reading is structurally zero by the same argument as the bias\-only architecture\. The lift’s finite cross\-covariance is the load\-bearing structural reading of the mechanism, body width alone cannot close the conditioning gap, and the sentinel zero on the three deletions is invariant to seed\.

#### 5\.1\.2Diffusive\-escape SDE: the lift’s escape rate rises with the bias\-channel noise

The SDE we simulate is drift\-free: both terms are martingale increments, so the regime is diffusive escape from an absorbing region under a state\-dependent noise variance, not Arrhenius barrier\-crossing\. The toy stochastic differential equation that mirrors the lift’s diffusive escape evolves256256replicates of the bias\-channel SDE

dw~t=−ψ′\(w~t\)σobjdWtobj\+σJacdWtjacd\\tilde\{w\}\_\{t\}\\;=\\;\-\\psi^\{\\prime\}\(\\tilde\{w\}\_\{t\}\)\\,\\sigma\_\{\\mathrm\{obj\}\}\\,dW\_\{t\}^\{\\mathrm\{obj\}\}\+\\sigma\_\{\\mathrm\{Jac\}\}\\,dW\_\{t\}^\{\\mathrm\{jac\}\}across an absorbing shoulderw~s=−13\.82\\tilde\{w\}\_\{s\}=\-13\.82from an initial conditionw~0=−16\\tilde\{w\}\_\{0\}=\-16forniters=20,000n\_\{\\mathrm\{iters\}\}=20\{,\}000steps\. The first term carries the readout\-prefactor\-attenuated objective noise; the second term carries the unmodulated bias\-channel cross\-covariance\.

[Figure˜7](https://arxiv.org/html/2605.24274#S5.F7)reports the empirical escape rates and first\-passage times \(FPTs\) for the three\-wayhypernet/direct/PGDreading\. The empirical FPT decreases monotonically withσJac\\sigma\_\{\\mathrm\{Jac\}\}, the qualitative signature of the diffusive mechanism; the idealized envelope𝔼\[τ\]∝\(w~s−w~0\)2/σJac2\\mathbb\{E\}\[\\tau\]\\propto\(\\tilde\{w\}\_\{s\}\-\\tilde\{w\}\_\{0\}\)^\{2\}/\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}is an upper reference the finite\-budget SDE does not attain, since the effective escape distance is shorter than the nominalw~s−w~0\\tilde\{w\}\_\{s\}\-\\tilde\{w\}\_\{0\}\. As expected, at the two lowest\-noise cells the prefactor\-attenuated objective noise dominates the unmodulated Jacobian noise and no replicate escapes at the20,00020\{,\}000\-step budget\. This SDE is a synthetic illustration of the mechanism; the escape behavior on an actual ICNN\-EBM is measured in[Figure˜8](https://arxiv.org/html/2605.24274#S5.F8)\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x6.png)Figure 7:The lift’s escape rate rises monotonically with the bias\-channel noise; direct softplus cannot escape\.A synthetic bias\-channel SDE that illustrates the diffusive\-escape mechanism—not an ICNN run; the real\-ICNN confirmation is[Figure˜8](https://arxiv.org/html/2605.24274#S5.F8)\.\(a\)Escape rate vsσJac\\sigma\_\{\\mathrm\{Jac\}\}:hypernetrises monotonically;direct softpluscannot escape because the slack channel is absent;PGDnever enters a readout shoulder, so the escape question is degenerate\.\(b\)Mean first\-passage time on log\-log axes, decreasing withσJac\\sigma\_\{\\mathrm\{Jac\}\}; the gray dashed line is the idealized diffusive envelope\(w~s−w~0\)2/σJac2\(\\tilde\{w\}\_\{s\}\-\\tilde\{w\}\_\{0\}\)^\{2\}/\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}, an upper reference the finite\-budget SDE does not attain\.\(c\)Zoomed first\-passage time with a power\-law fit; the empirical exponent is shallower than the idealized−2\-2because the SDE’s effective escape distance is shorter than the nominalw~s−w~0\\tilde\{w\}\_\{s\}\-\\tilde\{w\}\_\{0\}\.The three\-way comparison closes structurally\. Thedirect\-softplus method corresponds to theσJac=0\\sigma\_\{\\mathrm\{Jac\}\}=0slice of the SDE—no unmodulated Jacobian\-side noise, no escape at any budget\. ThePGDmethod has noψ\\psiat all, so its iterate is never trapped on a readout shoulder; the diffusive\-escape question is degenerate for PGD, consistent with the bias\-only reading of[Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\.

The escape is not only a property of the synthetic SDE\.[Figure˜8](https://arxiv.org/html/2605.24274#S5.F8)measures the same escape on an actual ICNN\-EBM—the lift and direct softplus trained under forward\-KL on the one\-dimensional Gumbel target, five seeds, with the full per\-coordinate pre\-readout iterate logged so escape reads as a cohort statistic\. The reading is decisive and seed\-consistent: for direct softplus the readout shoulder is an absorbing trap, its shoulder\-coordinate population monotone non\-decreasing and shoulder\-touching coordinates almost never leaving; for the lift the shoulder is transient, coordinates cycling in and out as the batch\-coupled slack channel keeps the iterate wandering across the shoulder boundary\. The conditional escape statistics factor out the initialization and isolate the mechanism itself—the lift’s per\-coordinate leave\-rate is more than an order of magnitude above direct softplus’s\. A real ICNN cannot reproduce the synthetic SDE’s controlledσJac\\sigma\_\{\\mathrm\{Jac\}\}sweep, sinceσJac\\sigma\_\{\\mathrm\{Jac\}\}is emergent rather than a dial; what it confirms is the binary structural fact the SDE abstracts—with the slack channel, shoulder coordinates escape; without it, they do not\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x7.png)Figure 8:On a real ICNN\-EBM the lift’s readout shoulder is transient, while direct softplus’s is an absorbing trap\.Thehypernetanddirect softplusmethods trained under forward\-KL on the one\-dimensional Gumbel target, five seeds, with the pre\-readout iterate logged per coordinate\.\(a\)Shoulder occupancy versus iteration: thedirect softpluspopulation grows monotonically and never drains, while thehypernetpopulation stays a thin tail\.\(b\)Per\-coordinate pre\-readout\-iterate ensemble: thedirecttail sinks deep into the shoulder, thehypernettail only brushes the threshold\.\(c\)Conditional escape statistics—leave\-rate and dwell fraction, conditioned on a coordinate already being in the shoulder so initialization is factored out: thehypernetleave\-rate is more than an order of magnitude abovedirect softplus’s, and its dwell fraction is roughly half\.
#### 5\.1\.3The conditioning advantage transfers from one\-dimensional toy targets to a 32\-dimensional image\-flavored latent

The lift’s median TV\-to\-target sits strictly below direct softplus on every one of four log\-concave one\-dimensional targets, with the gap widening as the target’s support tightens\.[Figure˜9](https://arxiv.org/html/2605.24274#S5.F9)delivers both readings in a single4×24\\times 2panel: single\-seed density overlays \(top row\) and paired TV\-to\-target histograms across 30 seeds \(bottom row\) for thehypernetanddirect softplusmethods across four log\-concave one\-dimensional targets \(Gumbel, Laplace, Gamma, Beta\)\. Thehypernetmedian sits strictly below thedirectmedian on every target, with widening margins as the support tightens—Beta on\[0,1\]\[0,1\]separates the two methods most strongly\. The conditioning advantage is loss\-agnostic across the log\-concave family of[Remark˜1](https://arxiv.org/html/2605.24274#Thmremark1)\. PGD on the same Gumbel target reaches thehypernet’s TV \([Section˜6\.2](https://arxiv.org/html/2605.24274#S6.SS2)\); the structural probe that distinguisheshypernetfromPGDis the cross\-covariance reading of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1), not the test\-time density\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x8.png)Figure 9:The lift uniformly improves the convergence distribution across four log\-concave one\-dimensional ICNN\-EBM targets\.Top row:single\-seed density fits\. Target \(black\) vshypernetvsdirect softplus\(dashed\)\. Thehypernetcurve overlays the target on every panel; thedirectcurve over\-shoots the mode or mis\-fits the tail\.Bottom row:per\-seed total\-variation distance to the target across 30 seeds per method, with color\-matched dashed medians and a dotted success threshold\. The probe that distinguishes the lift from PGD is in the cross\-covariance reading of[Figure˜6](https://arxiv.org/html/2605.24274#S4.F6); the landscape geometry of the lift versus the direct constraint is rendered in[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\.##### MNIST autoencoder\-latent\.

A frozen convolutional autoencoder maps each28×2828\\times 28MNIST\(LeCunet al\.,[1998](https://arxiv.org/html/2605.24274#bib.bib69)\)digit image to a 32\-dimensional latent vector, and a per\-class log\-concave EBM is fit to each digit’s empirical latent distribution under forward\-KL\. Decoding posterior samples through the frozen autoencoder recovers an image\-quality reading on top of the latent test\-loss metric\. The setup transfers the lift verbatim: each per\-class hypernet emits the same softplus\-tagged inter\-layer weights as the one\-dimensional EBM line, and the same batch\-summary plus slack\-bias decomposition applies\. All ten digit classes are recovered and the basin/plateau geometry holds seed\-by\-seed\.

The MNIST autoencoder\-latent line extends the EBM evidence to 32\-dimensional image\-flavored latents while remaining inside the log\-concave scope of the paper’s analysis\. The per\-class ensemble of[Figure˜10](https://arxiv.org/html/2605.24274#S5.F10)shows the lift’s conditioning advantage transfers beyond the one\-dimensional gallery of[Figure˜9](https://arxiv.org/html/2605.24274#S5.F9), and the loss\-landscape view in[Figure˜11](https://arxiv.org/html/2605.24274#S5.F11)is the train\-time signature that the test\-loss bar in[Figure˜12](https://arxiv.org/html/2605.24274#S5.F12)corroborates at the operating point\.[Figure˜13](https://arxiv.org/html/2605.24274#S5.F13)aggregates the EBM line across one\-, two\-, six\-, and 32\-dimensional log\-concave targets and shows that the lift’s metric advantage scales without breaking\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x9.png)Figure 10:The lift recovers all ten digit classes\.Decoded samples from the per\-classhypernetEBM through the frozen autoencoder—one row per digit class, all ten recovered with class\-recognizable character\. The headline lift contrast lives in the multi\-seed convergence and landscape figures below\.![Refer to caption](https://arxiv.org/html/2605.24274v1/x10.png)

\(a\) Parameter\-space view \(three seeds\)\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x11.png)

\(b\) Loss\-space view \(pooled, three seeds\)\.

Figure 11:On a 32\-dimensional image\-flavored latent the lift descends through the basin while direct softplus pins to a higher plateau\.Three seeds; test loss at each method’s lowest\-validation\-loss checkpoint\.\(a\)Loss landscape on a two\-dimensional slice through the convergedhypernet\(origin, gold star\), one panel per seed; legend mirrors[Figure˜1](https://arxiv.org/html/2605.24274#S0.F1)\(a\)\.\(b\)Held\-out test loss versus iteration pooled across the same three seeds\. Thehypernetdescends through the basin whiledirect softpluspins to the readout shoulder—a gap that holds seed\-by\-seed in \(b\)’s late\-training inset\. Axes of \(a\) are constructed by the adaptive scheme of[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\. A lifted\-space companion panel is omitted because the per\-iteration hypernet snapshots were not persisted for this run\.![Refer to caption](https://arxiv.org/html/2605.24274v1/x12.png)Figure 12:The lift sits below both PGD and a classical Gaussian on every digit class\.Per\-class held\-out test loss on the 32\-dimensional MNIST autoencoder\-latent target: thehypernet,PGDon the non\-negative cone, and a per\-class full\-covarianceGaussianbaseline, each fit on the same per\-class latent partition\. Thehypernetsits strictly belowPGD, which in turn sits below theGaussian, on all ten digits—the per\-class reading of the same three\-way ordering the rest of[Section˜5\.1](https://arxiv.org/html/2605.24274#S5.SS1)reports\. The all\-class direct\-softplus contrast on the same dataset is in[Figure˜11](https://arxiv.org/html/2605.24274#S5.F11)\.![Refer to caption](https://arxiv.org/html/2605.24274v1/x13.png)Figure 13:The lift’s conditioning advantage scales with ambient dimension without breaking\.Hypernetvsdirect softplus, three seeds each, across four representative log\-concave targets at one\-, two\-, six\-, and 32\-dimensional scales\. Left: normalized metric \(value relative to thedirectmean\); at every dimension thehypernetbar lands well below thedirectreference \(lower is better\)\. Right: absolute values on log scale with the numbers annotated\.

### 5\.2Convex potential flows: the lift transfers to the change\-of\-variables likelihood

The convex\-potential\-flow paradigm\(Huanget al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib11)\)replaces the forward\-KL data expectation of[Section˜5\.1](https://arxiv.org/html/2605.24274#S5.SS1)with a change\-of\-variables likelihood, and the lift transfers to it without modification\. The transport mapT:ℝd→ℝdT:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is realized as the gradient of a convex potentialf𝜽:ℝd→ℝf\_\{\\bm\{\\theta\}\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}implemented by an ICNN with non\-negative inter\-layer weights𝜽l⪰𝟎\{\\bm\{\\theta\}\}\_\{l\}\\succeq\\bm\{0\}enforced by the same readoutψ\\psias in the EBM line\. The pushed\-forward density takes the form

log⁡pX\(𝒙\)=log⁡p𝒩\(∇f𝜽\(𝒙\);0,𝑰\)\+logdet\(∇2f𝜽\(𝒙\)\),\\log p\_\{X\}\(\{\\bm\{x\}\}\)\\;=\\;\\log p\_\{\\mathcal\{N\}\}\\big\(\\nabla f\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\);\\,\\bm\{0\},\\,\{\\bm\{I\}\}\\big\)\\;\+\\;\\log\\det\\big\(\\nabla^\{2\}f\_\{\\bm\{\\theta\}\}\(\{\\bm\{x\}\}\)\\big\),\(12\)in whichdet∇2f𝜽\\det\\nabla^\{2\}f\_\{\\bm\{\\theta\}\}is the Jacobian determinant of the Brenier map\(Brenier,[1991](https://arxiv.org/html/2605.24274#bib.bib67)\)and is estimated stochastically \(Hutchinson–CG\(Huanget al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib11)\)\) when the ambient dimension makes exact evaluation infeasible\. The hypernet emit \([1](https://arxiv.org/html/2605.24274#S3.E1)\) replaces the direct𝜽~\\tilde\{\\bm\{\\theta\}\}by theslack\-plus\-body decomposition, leaving every other component of the CPFlow stack unchanged\. The two\-dimensional experiments use a33\-layer ICNN of hidden width6464with strong\-convexity coefficient0\.050\.05; the 21\-dimensional tabular experiment uses the PICNN ofAmoset al\.\([2017](https://arxiv.org/html/2605.24274#bib.bib1)\)configured per the PCP\-Map recipe ofWanget al\.\([2025](https://arxiv.org/html/2605.24274#bib.bib16)\)as the convex potential, trained as an unconditional density by collapsing the PICNN’s conditioning input to a constant\.

The CPFlow experiments below are run as a two\-way comparison—hypernetvsdirect\-softplus; the three\-way structural reading that includesPGDis settled by the cross\-covariance test of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1)\. What this subsection adds is loss\-agnostic confirmation: the same conditioning advantage the EBM line measured under forward\-KL carries through the change\-of\-variables likelihood\.

#### 5\.2\.1Two\-dimensional synthetic targets: the lift’s advantage surfaces in the convergence distribution

The two\-dimensional demo trains a single\-block convex potential flow on the 8\-Gaussians and 2\-spirals targets, both backends, under an apples\-to\-apples symmetric initialization: each backend’s positivity readout receives the same per\-element𝒩\(0,σbh2\)\\mathcal\{N\}\(0,\\sigma\_\{b\_\{h\}\}^\{2\}\)pre\-softplus jitter, so the only remaining difference between the arms is the lift itself\. Training is6,0006\{,\}000Adam iterations atlr=10−3\\mathrm\{lr\}=10^\{\-3\}, batch size6464, held\-out 1024\-sample test loss; the sweep runs100100paired seeds per method per target\.

[Figure˜14](https://arxiv.org/html/2605.24274#S5.F14)reports the demo distributionally—the held\-out test loss of all100100seeds per method\. The reading is consistent with theϱ≪1\\varrho\\ll 1regime of \([7](https://arxiv.org/html/2605.24274#S4.E7)\) on the CPFlow loss landscape: the CG\-Hutchinson log\-det stochastic estimator injects anO\(d\)O\(d\)gradient\-noise floor that dilutes the dimensionless controlϱ\\varrhobelow the Arrhenius\-to\-diffusive threshold, so the lift’s advantage surfaces as a shift of the convergence distribution—exactly as[Remark˜5](https://arxiv.org/html/2605.24274#Thmremark5)\(c\) predicts for regime\-conditional smoothing\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x14.png)

![Refer to caption](https://arxiv.org/html/2605.24274v1/x15.png)

Figure 14:The lift shifts the convergence distribution toward a basin the direct softplus essentially never reaches\.Paired test\-loss histograms on two\-dimensional convex potential flows across 100 seeds per method per target\. Left panel is 8\-Gaussians, right is 2\-spirals; step histograms of the held\-out test loss, with dashed median lines per method and a dotted vertical line at the data\-honest thresholdτ\\tauseparating thehypernet’s converged mode from thedirectmethod’s plateau \(red dashed rectangle, “directplateau”\)\.The 100\-seed sweep separates a hypernet basin from a direct plateau on both targets, with thehypernet’s median sitting well below thedirect’s\. Adopting a data\-honest thresholdτ\\taubetween thehypernetminimum and thedirectmedian, a large majority ofhypernetseeds reach the lower\-loss mode while essentially nodirectseed does \([Figure˜14](https://arxiv.org/html/2605.24274#S5.F14)\)\. Thedirectplateau—the red dashed rectangle in each panel—is the operational signature of the failure mode and tracks the same shape as the paired\-TV bound of[Section˜5\.1\.3](https://arxiv.org/html/2605.24274#S5.SS1.SSS3)\. The mechanism applies loss\-agnostically: the change\-of\-variables likelihood replaces the forward\-KL data expectation as the source of batch\-coupled gradient noise, but ingredient \(iii\) of[Section˜4](https://arxiv.org/html/2605.24274#S4)survives\. The lift shifts the convergence distribution toward a basin the direct softplus essentially never reaches at this budget\.

#### 5\.2\.2Loss\-landscape geometry: the lift carves a valley where the direct softplus sees a plateau

[Figure˜15](https://arxiv.org/html/2605.24274#S5.F15)reports the change\-of\-variables likelihood of \([12](https://arxiv.org/html/2605.24274#S5.E12)\) on a two\-dimensional slice through a single seed of the symmetric\-init sweep on the 8\-Gaussians target, with both methods’ per\-iteration trajectories projected onto the plane\. This is one seed of the same configuration that drives[Figure˜14](https://arxiv.org/html/2605.24274#S5.F14), so both 8\-Gaussians figures are one apples\-to\-apples experiment\. The slice convention is the adaptive scheme of[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3), instantiated separately for the constrained and lifted parameter spaces\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x16.png)

\(a\) Parameter\-space view \(two spaces: constrainedθ\\thetaleft; lifted\(ϕ,b\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)right\)\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x17.png)

\(b\) Loss\-space view\.

Figure 15:The same training trajectory traces a plateau in constrained space and a clean valley in lifted space\.Convex potential flow on the 8\-Gaussians target, a single seed of the same symmetric\-init sweep that drives[Figure˜14](https://arxiv.org/html/2605.24274#S5.F14)\.\(a\)Left: constrained ICNN parameter space anchored at the convergedhypernet\. Right: lifted hypernet parameter space; axes are the top two trajectory\-PCA components of the lifted hypernet snapshots\.\(b\)Held\-out loss versus iteration on the same run\. In panel \(a\), thehypernetdescends to the origin \(gold star\) anddirect softpluspins to the plateau in constrained space, while the lifted\-space sub\-panel renders the same hypernet trajectory as a clean valley\. Panel \(b\) shows the loss\-space counterpart: thehypernetdescends to a held\-out loss of0\.580\.58nats, whiledirect softplusstalls on the constrained\-space plateau at3\.543\.54nats—the single\-seed signature of the distributional gap that the 100\-seed paired sweep of[Figure˜14](https://arxiv.org/html/2605.24274#S5.F14)quantifies\.The slice planes are constructed by the adaptive scheme of[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3), instantiated separately for the constrained and lifted spaces\. A strict random\-direction slice in the same neighborhood looks bowl\-shaped because two random filter\-norm directions almost surely miss the narrow channel separating the methods; in contrast, the adaptive frame is dense in the directions of optimization interest and renders the basin/plateau geometry that[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)predicts\.

The reading is the structural signature[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)predicts\. Inθ\\theta\-space both trajectories trace nearly the same path for most of training, pinned to the plateau around the converged direct softplus, before the hypernet’s emitted PICNN escapes the plateau and descends to the origin \(gold star\)\. In lifted space the same trajectory traces a clean valley with no plateau structure intervening\. The lift converts a plateau\-bounded trajectory in constrained space into a valley\-descending trajectory in lifted space\.

#### 5\.2\.3UCI HEPMASS: the lift improves over a literature\-scale direct\-softplus convex\-potential flow

On the 21\-dimensional tabular benchmark\(Baldiet al\.,[2016](https://arxiv.org/html/2605.24274#bib.bib46); Papamakarioset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib22)\), at the five\-block convex\-potential flow architecture ofHuanget al\.\([2021](https://arxiv.org/html/2605.24274#bib.bib11)\), the lift descends to a lower test loss than direct softplus; the direct\-softplus reproduction itself matches the published convex\-potential flow result ofHuanget al\.\([2021](https://arxiv.org/html/2605.24274#bib.bib11)\), so the comparison is anchored at literature scale\. The two\-dimensional evidence above adjudicates the lift at the lowest non\-degenerate ambient dimension; this tabular reading is the literature\-scale CPFlow benchmark, confirming that the per\-block strong\-convexification of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)translates from the two\-dimensional toy to the canonical tabular regime—each CPFlow block emits a non\-negative inter\-layer weight stack through the same hypernet readout, so the per\-block conditioning advantage compounds across depth without modification to the change\-of\-variables likelihood or the log\-det stochastic estimator\. The structural reading—slack, body, and cross\-covariance—is settled by the four\-architecture ablation of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1);[Figure˜16](https://arxiv.org/html/2605.24274#S5.F16)reports the loss\-space trace against the literature reference\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x18.png)

\(a\) Lifted\-space loss landscape with hypernet trajectory\. ![Refer to caption](https://arxiv.org/html/2605.24274v1/x19.png) \(b\) Test loss versus iteration \(both methods\)\.

Figure 16:The lift improves over a literature\-scale direct\-softplus convex\-potential flow\.\(a\)Lifted hypernetwork\-parameter space\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\); thehypernettrajectory descends a coherent valley from initialization to the converged basin at the origin \(gold star\)\. White tiles mark offsets where the log\-det estimator diverged: infeasible regions of the convex\-potential parameter space, and the training trajectory stays inside the feasibility envelope\.\(b\)Held\-out test loss versus iteration; dashed gray line marks the published reference fromHuanget al\.\([2021](https://arxiv.org/html/2605.24274#bib.bib11)\)\. Thedirect softplusreproduces the published number, and the lift improves on it—the per\-block strong\-convexification of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)compounds across depth without modifying the change\-of\-variables likelihood or the log\-det estimator\.

### 5\.3The same trajectory traces a valley in lifted space and a plateau in constrained space

The lift is a coordinate change\. The same training trajectory has two coexisting renderings—i\.e\., one in the constrained ICNN/PICNNθ\\theta\-space the direct softplus optimizes, one in the lifted\(ϕ,𝒃\)\(\{\\bm\{\\phi\}\},\{\\bm\{b\}\}\)\-space the hypernet optimizes—and side\-by\-side comparison localizes[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)’s landscape smoothing on the surface that gradient descent actually walks\.[Figure˜17](https://arxiv.org/html/2605.24274#S5.F17)arranges two representative ICNN\-EBM problems as rows \(one\-dimensional Gumbel, two\-dimensional gamma\-mode\) and the two coordinate systems as columns\.

##### Landscape visualization method\.

All loss\-landscape figures in this section evaluate the test loss on a two\-dimensional slice of parameter space centered at the convergedhypernet\. The slice plane is chosen adaptively per figure to expose the gap between the methods\. In constrained parameter space, the first axis is the convergeddirect\-minus\-hypernetend\-difference and the second is the top trajectory\-PCA direction orthogonal to it\. In lifted parameter space, both axes are the top two trajectory\-PCA components of the liftedhypernetsnapshots\. Filter\-normalization followsLiet al\.\([2018](https://arxiv.org/html/2605.24274#bib.bib35)\)\. A strict random\-direction slice in the same neighborhood looks bowl\-shaped because two random filter\-norm directions almost surely miss the narrow channel separating the methods in non\-trivial parameter spaces\. The convergedhypernetsits at the origin \(gold star\) in every panel\.

The reading is consistent across both rows\. The constrained\-space panel pins thedirect softplustrajectory to a flat basin while thehypernettrajectory descends to a converged point at substantially lower loss; the lifted\-space panel renders that same trajectory as a smooth descent down a single contiguous valley\. The loss\-versus\-iteration companion,[Figure˜18](https://arxiv.org/html/2605.24274#S5.F18), shows the lift converging to a lower loss than direct softplus on both rows\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x20.png)Figure 17:The lift reshapes a plateau\-bounded constrained\-space surface into a valley\-descending lifted\-space surface\.Same training trajectory viewed in two parameter spaces \(columns\) across two problems \(rows: one\-dimensional Gumbel and two\-dimensional gamma\-mode ICNN\-EBM\)\. Left column: constrained parameter space\. Right column: lifted parameter space\.Hypernet\(solid squares\) anddirect softplus\(dashed open circles\) trajectories; convergedhypernetat the origin \(gold star\)\. Slice planes constructed by the adaptive scheme of[Section˜5\.3](https://arxiv.org/html/2605.24274#S5.SS3)\. The loss\-vs\-iter companion is[Figure˜18](https://arxiv.org/html/2605.24274#S5.F18)\.![Refer to caption](https://arxiv.org/html/2605.24274v1/x21.png)Figure 18:The lift’s lower converged loss holds across both problems\.Loss\-space companion to[Figure˜17](https://arxiv.org/html/2605.24274#S5.F17): one panel per problem \(one\-dimensional Gumbel and two\-dimensional gamma\-mode ICNN\-EBM\), held\-out validation loss on log\-log axes, late\-training inset on linear y\. On both rows thehypernetdescends to a lower converged loss thandirect softplus, matching the basin/plateau geometry of the corresponding rows in[Figure˜17](https://arxiv.org/html/2605.24274#S5.F17)\.

## 6Is the lift necessary?

Three structural rejoinders threaten the lift’s contribution\. \(i\) Plain ADMM\-with\-positivity could enforce the non\-negativity constraint without a data\-conditioned body, in which case the body provides no advantage above the projection\-enforcement alternative\. \(ii\) The PGD baseline could match the lift’s test\-time metric without the body’s batch\-conditioning machinery, in which case ingredient \(ii\) of[Section˜4](https://arxiv.org/html/2605.24274#S4)is not load\-bearing\. \(iii\) The hypernet’s body could be doing nothing more than over\-parametrizing the search, in which case widening the direct softplus to the hypernet’s parameter count should recover the lift’s metric\.[Sections˜6\.1](https://arxiv.org/html/2605.24274#S6.SS1),[6\.2](https://arxiv.org/html/2605.24274#S6.SS2)and[6\.3](https://arxiv.org/html/2605.24274#S6.SS3)test each directly\.

### 6\.1Plain ADMM\-with\-positivity does not close the gap

If the lift is merely an ADMM consensus reformulation \([Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\), why introduce a body at all? Standard ADMM\-with\-positivity—min⁡ℒ\(𝜽\)\+\(ρ/2\)‖𝜽−𝒛\+𝒚/ρ‖2\\min\\mathcal\{L\}\(\{\\bm\{\\theta\}\}\)\+\(\\rho/2\)\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{z\}\}\+\\bm\{y\}/\\rho\\\|^\{2\}with𝒛⪰𝟎\{\\bm\{z\}\}\\succeq\\bm\{0\}, closed\-form prox𝒛←max⁡\(𝜽\+𝒚/ρ,0\)\{\\bm\{z\}\}\\leftarrow\\max\(\{\\bm\{\\theta\}\}\+\\bm\{y\}/\\rho,0\), and dual step𝒚←𝒚\+ρ\(𝜽−𝒛\)\\bm\{y\}\\leftarrow\\bm\{y\}\+\\rho\(\{\\bm\{\\theta\}\}\-\{\\bm\{z\}\}\)—requires no body and inherits theO\(1/k\)O\(1/k\)rate under per\-block convexity\(Boydet al\.,[2011](https://arxiv.org/html/2605.24274#bib.bib4); He and Yuan,[2012](https://arxiv.org/html/2605.24274#bib.bib8)\)\. While the constraint is enforced, the optimization pathology is left untouched: the failure mode of direct\-softplus training is not infeasibility but a diffusively slow escape on the readout shoulder\. The primal variable𝜽\{\\bm\{\\theta\}\}has no data\-conditioned reparametrization, soVar𝑿\[𝜽~\]=0\\mathrm\{Var\}\_\{\{\\bm\{X\}\}\}\[\\tilde\{\\bm\{\\theta\}\}\]=0and ingredient \(ii\) of[Section˜4](https://arxiv.org/html/2605.24274#S4)is identically zero\. The empirical version of this argument is the PGD baseline of[Section˜6\.2](https://arxiv.org/html/2605.24274#S6.SS2), theρ→∞\\rho\\to\\inftystiff\-penalty limit of plain ADMM\. The lift’s contribution above plain ADMM is precisely ingredient \(ii\), and \([4](https://arxiv.org/html/2605.24274#S3.E4)\) is the diagnostic that measures it\. Empirically, every stableρ\\rhoschedule of plain ADMM\-with\-positivity we tried—fixedρ\\rho\(Boydet al\.,[2011](https://arxiv.org/html/2605.24274#bib.bib4)\), residual\-balance auto\-ρ\\rho\(Heet al\.,[2000](https://arxiv.org/html/2605.24274#bib.bib58); Boydet al\.,[2011](https://arxiv.org/html/2605.24274#bib.bib4); Wohlberg,[2017](https://arxiv.org/html/2605.24274#bib.bib59)\), andρ\\rho\-doubling toward stiff\(Bertsekas,[1999](https://arxiv.org/html/2605.24274#bib.bib48)\)—plateaus several\-fold above PGD \([Figure˜19](https://arxiv.org/html/2605.24274#S6.F19)\)\. The lift sits several\-fold below the best stable plain\-ADMM cell at the same compute budget\. Sophisticated ADMM accelerators that adaptρ\\rhovia Barzilai–Borwein dual steps and line search\(Zarepishehet al\.,[2018](https://arxiv.org/html/2605.24274#bib.bib60)\)attack the sameρ\\rho\-sensitivity issue; the Barzilai–Borwein component is a drop\-in replacement for the dual update and would behave qualitatively similarly to our residual\-balance schedule, since both adaptρ\\rhofrom the primal–dual residual ratio\. The line\-search component is not deep\-learning\-compatible: each candidate dual step requires re\-solving the inner primal optimization \(re\-training the network\) at every candidate𝒚\{\\bm\{y\}\}, which is prohibitively expensive for neural\-network training\. The structural argument therefore survives any acceleration of plain ADMM: ingredient \(ii\) is not what ADMM\-with\-positivity provides at anyρ\\rhoschedule, however adaptively chosen\.

The standard rescue for non\-convex ADMM is to move toward Boyd’s exact\-solve regime by taking many Adam steps per outer iteration on the augmented Lagrangian\. This does not help here either, for a structural reason: the forward\-KL loss is not bounded below on the off\-cone region of𝜽~\\tilde\{\\bm\{\\theta\}\}, because the importance\-sampling proposal forlog⁡Z\\log Zbecomes uncovered when the ICNN energy is improper, so the inner subproblemmin𝜽~⁡ℒ\(𝜽~\)\+\(ρ/2\)‖𝜽~−𝒛\+𝒚/ρ‖2\\min\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\(\\tilde\{\\bm\{\\theta\}\}\)\+\(\\rho/2\)\\\|\\tilde\{\\bm\{\\theta\}\}\-\{\\bm\{z\}\}\+\{\\bm\{y\}\}/\\rho\\\|^\{2\}has no finite minimum\. Additional inner steps only extend the off\-cone drift more aggressively\. PGD avoids this pathology because it carries no penalty term: it is one Adam step onℒ\(𝜽\)\\mathcal\{L\}\(\{\\bm\{\\theta\}\}\)followed by a hard projection, with no dual variable to track and no approximate primal solve to accumulate error\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x22.png)Figure 19:Plain ADMM\-with\-positivity does not close the lift\-vs\-PGD gap\.Three stable schedules of textbook ADMM\-with\-positivity—fixedρ=10\\rho\{=\}10, residual\-balance auto\-ρ\\rho, andρ\\rho\-doubling toward stiff—trained on the same ICNN architecture, budget, and forward\-KL objective as thehypernet,direct softplus, andPGDbaselines of[Section˜6\.2](https://arxiv.org/html/2605.24274#S6.SS2); test loss is read on the projected iteratemax⁡\(𝜽~,0\)\\max\(\\tilde\{\\bm\{\\theta\}\},0\)at each method’s lowest\-validation\-loss checkpoint\. All three ADMM schedules plateau several\-fold above PGD \(green dashed line\), and thehypernetsits several\-fold below the best of them—the data\-conditioned body, not the constraint\-enforcement mechanism, is the load\-bearing source of the lift’s advantage over ADMM\-with\-positivity \(ingredient \(ii\) of[Section˜4](https://arxiv.org/html/2605.24274#S4)\)\.
### 6\.2Projection alone misses the lift’s margin

PGD\(Bertsekas,[1999](https://arxiv.org/html/2605.24274#bib.bib48); Amoset al\.,[2017](https://arxiv.org/html/2605.24274#bib.bib1)\)takes an unconstrained step on𝜽\{\\bm\{\\theta\}\}followed by the projection𝜽←max⁡\(𝜽,0\)\{\\bm\{\\theta\}\}\\leftarrow\\max\(\{\\bm\{\\theta\}\},0\), avoids the readout shoulder by construction \(noψ\\psireparametrization\), and is theρ→∞\\rho\\to\\inftystiff\-penalty limit of plain ADMM\-with\-positivity of[Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)\. It shares with ADMM\-with\-positivity an identity\-Jacobian path through the constraint and the absence of a data\-conditioned body, so the verdict on PGD is the verdict on whether the body’s batch\-conditioning is structurally necessary above the projection alternative\.

On the one\-dimensional Gumbel target, PGD reaches the hypernet’s TV but returns a structurally zero reading on the cross\-covariance probe of[Section˜5\.1\.1](https://arxiv.org/html/2605.24274#S5.SS1.SSS1), since PGD has neither slack nor data\-conditioned body\. PGD and the lift are therefore operationally equivalent in TV atd=1d\{=\}1, but the cross\-covariance diagnostic separates them: the lift’s contribution is the structural reading, not the test\-time TV\.

The question becomes empirical at higher dimension: does the smooth\-autodiff alternative reach the lift’s loss? On the 21\-dimensional tabular benchmark \(single seed;[Figure˜20](https://arxiv.org/html/2605.24274#S6.F20)\), the three methods separate by orders of magnitude:direct softplusis trapped on the readout shoulder and fails to converge at this budget,PGDescapes the cone boundary, and thehypernetsits below both\. The lift’s margin above PGD—a4\.24\.2\-nat gap, read off[Figure˜20](https://arxiv.org/html/2605.24274#S6.F20)\(c\)—is the landscape\-smoothing benefit of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1); PGD escapes the cone but does not deliver it\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x23.png)

\(a\) Direct\-anchored slice\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x24.png)

\(b\) Loss\-vs\-iter trace\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x25.png)

\(c\) Final test loss \(log\-x\)\.

Figure 20:Three positivity recipes on a 21\-dimensional tabular target\.Same training run as[Figure˜1](https://arxiv.org/html/2605.24274#S0.F1), with the slice in \(a\) anchored at the convergeddirect softplusso its trapped trajectory is in\-plane\.\(a\)Direct\-anchored landscape slice\.\(b\)Held\-out validation loss versus iteration on the same run\.\(c\)Final test loss on log\-x; single\-Gaussian shown as a dashed reference\. Thehypernetsits at the origin \(gold star\) of \(a\) and at the lowest loss in \(c\);direct softplusis pinned on the readout shoulder;PGD’s converged point lies off\-axis\. ThePGD\-anchored companion slice is[Figure˜1](https://arxiv.org/html/2605.24274#S0.F1)\(a\)\.
### 6\.3Capacity vs conditioning: widened\-direct ablation

A natural rejoinder is that the hypernet’s body has many more parameters than the ICNN it emits, so the advantage could be raw over\-parametrization rather than the slack\-plus\-body decomposition\. We test this directly by widening the direct ICNN to match the hypernet’s parameter count, training under forward\-KL on three seeds, otherwise identical to the headline configuration\. The widened direct softplus collapses on both targets \([Figure˜21](https://arxiv.org/html/2605.24274#S6.F21)\): on a one\-dimensional Gumbel target it saturates at the worst possible total\-variation distance, never learning a normalized density; on a six\-dimensional tabular target it diverges to many orders of magnitude in test loss\. The hypernet at the same parameter budget matches its narrow\-width performance on both targets\. Widening the direct𝜽\{\\bm\{\\theta\}\}\-space does not close the gap: the saddles of the readout shoulder are invariant under widening, only the dimension of the space they live in changes\. The lift’s advantage is the optimization landscape opened by the slack and the body, not the parameter count\.

![Refer to caption](https://arxiv.org/html/2605.24274v1/x26.png)Figure 21:Widening the directθ\{\\bm\{\\theta\}\}\-space to the hypernet parameter count does not rescue it\.Per\-seed final metric on two representative targets \(one\-dimensional Gumbel and UCI POWER at six\-dimensional, three seeds each, forward\-KL\-direct, otherwise identical to the headline configuration\), with the matched\-capacitydirectsoftplus \(∼\\sim10610^\{6\}parameters athiddendim=512\\mathrm\{hidden\}\_\{\\mathrm\{dim\}\}\{=\}512,nlayers=5n\_\{\\mathrm\{layers\}\}\{=\}5\) against thehypernetat the same parameter budget; single\-Gaussian dashed for reference\.Left:one\-dimensional Gumbel TV \(↓\\downarrow\); the matched\-capacity direct saturates at TV≈1\\approx 1\(never learns a normalized density\) while the hypernet stays at TV≈0\.04\\approx 0\.04\.Right:UCI POWER test loss \(↓\\downarrow\); direct diverges to10410^\{4\}–10510^\{5\}nat while the hypernet sits at≈3\.1\\approx 3\.1nat\. The hypernet matches its narrow\-width performance at the higher budget, so the gap is not closed by raw capacity—it is the conditioning landscape, not the parameter count\.

## 7Discussion

The positivity constraint on the inter\-layer weights of an ICNN is what makes the architecture work: without it, the network ceases to be a convex function of its input, and the downstream applications—log\-concave density estimation, convex potential flows, optimal transport, transport\-map inversion—lose their structural guarantees\. However, neither existing recipe for enforcing this constraint reshapes the optimization landscape the way the lift does\. Projected gradient descent onto the non\-negative cone is the standard ICNN recipe and escapes the cone boundary efficiently, but its hard, non\-smooth projection is the stiff\-penalty limit of an ADMM\-style constraint splitting whose classical convergence guarantees do not transfer to the non\-smooth ICNN landscape, and it delivers no landscape smoothing; softplus reparametrization installs a chain\-rule prefactor that vanishes on an extended region of parameter space and exponentially attenuates the gradient there\. The lift improves on both by routing the constraint through a slack\-plus\-hypernetwork emission, opening a batch\-stochastic channel into the pre\-readout iterate that survives the readout prefactor \([Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\)\.

##### Differences from extended\-source FWI\.

The lift’s structural antecedent is the parameter\-extension family of full\-waveform inversion\(Symes,[2008](https://arxiv.org/html/2605.24274#bib.bib28); Symeset al\.,[2020](https://arxiv.org/html/2605.24274#bib.bib49); van Leeuwen and Herrmann,[2013](https://arxiv.org/html/2605.24274#bib.bib29),[2016](https://arxiv.org/html/2605.24274#bib.bib40); Aghamiryet al\.,[2019](https://arxiv.org/html/2605.24274#bib.bib50); Siahkoohiet al\.,[2026](https://arxiv.org/html/2605.24274#bib.bib27)\): a point source is lifted to a generalized source field, the data\-fit objective is augmented by a wave\-equation residual penalty, and the lifted problem becomes amenable to gradient descent where the unlifted variant is non\-convex through cycle skipping\. The structure is identical—a constraint lifted into a higher\-dimensional space reappears as a consensus penalty—and the lift here inherits the favorable conditioning of those parameter\-extension formulations\. The lift belongs to a broader family of methods that recast a hard nonconvex problem as a tractable convex one: functional lifting convexifies nonconvex variational problems in imaging by lifting them to a higher dimension\(Pocket al\.,[2010](https://arxiv.org/html/2605.24274#bib.bib41); Vogtet al\.,[2020](https://arxiv.org/html/2605.24274#bib.bib42)\), and lossless convexification and extended convex lifting do the same in optimal and robust control\(Açıkmeşe and Blackmore,[2011](https://arxiv.org/html/2605.24274#bib.bib44); Zhenget al\.,[2026](https://arxiv.org/html/2605.24274#bib.bib43)\)\. Those methods reach exact convexity; the lift here trades dimension for a smoother landscape, not a convex one—the ICNN training loss remains non\-convex\. The two settings differ in the driving mechanism:*deterministic*landscape smoothing for extended\-source FWI versus*data\-driven Fokker–Planck escape*for the ICNN lift\. A transfer test—an ICNN\-style hypernet\-plus\-bias lift on an FWI cycle\-skipping problem—is left to future work\.

##### Differences from NTK/lazy/mean\-field analyses\.

The lift does not fit cleanly into the neural tangent kernel \(NTK\)\(Jacotet al\.,[2018](https://arxiv.org/html/2605.24274#bib.bib13)\), lazy\(Chizatet al\.,[2019](https://arxiv.org/html/2605.24274#bib.bib6)\), or mean\-field\(Meiet al\.,[2018](https://arxiv.org/html/2605.24274#bib.bib19)\)regimes\. The body’s Jacobian is genuinely data\-dependent and finite\-width; the slack’s identity Jacobian is constant by construction rather than by Taylor approximation around the initialization; and the body’s scaling is fixed by the hypernet architecture rather than by the1/N1/\\sqrt\{N\}feature\-learning law\. The closest analogue is the regularization\-driven separation ofWeiet al\.\([2019](https://arxiv.org/html/2605.24274#bib.bib31)\), which shows finite\-width regularized networks can be polynomially more sample\-efficient than their NTK kernel limit; this concerns feature learning, not constraint traversal\. A prefactor\-precise theory of the lift’s diffusive escape would combine ADMM\-style consensus optimization with stochastic\-gradient analysis; neither alone explains theσJac−2\\sigma\_\{\\mathrm\{Jac\}\}^\{\-2\}scaling of[Section˜5\.1\.2](https://arxiv.org/html/2605.24274#S5.SS1.SSS2)\.

##### Reach to other ICNN paradigms\.

The cross\-covariance mechanism is loss\-agnostic at the structural level \([Remark˜2](https://arxiv.org/html/2605.24274#Thmremark2)\): only the gradient𝒈\{\\bm\{g\}\}picks up the loss, while the slack’s identity Jacobian and the body’s batch\-induced fluctuation are properties of the lift’s architecture\. The convex\-potential\-flow application is exercised directly \([Section˜5\.2](https://arxiv.org/html/2605.24274#S5.SS2)\); the same structural argument applies to PCP\-Map transport\-map estimation\(Wanget al\.,[2025](https://arxiv.org/html/2605.24274#bib.bib16); El Moselhy and Marzouk,[2012](https://arxiv.org/html/2605.24274#bib.bib47); Spantiniet al\.,[2018](https://arxiv.org/html/2605.24274#bib.bib53); Baptistaet al\.,[2024](https://arxiv.org/html/2605.24274#bib.bib54)\), ICNN\-parametrized optimal transport\(Makkuvaet al\.,[2020](https://arxiv.org/html/2605.24274#bib.bib17); Korotinet al\.,[2021](https://arxiv.org/html/2605.24274#bib.bib14)\), and score matching for log\-concave EBMs\(Vincent,[2011](https://arxiv.org/html/2605.24274#bib.bib30)\)\. Empirical adjudication on those targets is left to future work, but the loss\-agnosticism makes the structural transfer routine: what changes across applications is the gradient noise structure, not the slack\-channel decomposition\.

##### What the lift does not do\.

The framework makes structural claims at the per\-instance cross\-covariance level\. The ADMM\-style reading of[Section˜4\.2](https://arxiv.org/html/2605.24274#S4.SS2)is a structural interpretation, not an algorithm we run: there is no explicit dual update on𝒃\{\\bm\{b\}\}, and the classical rate theorems of ADMM do not transfer to the non\-convex ICNN setting\. The lift is correctly silent in the deterministic\-readout regime of[Remark˜5](https://arxiv.org/html/2605.24274#Thmremark5), where freezing the conditioning batch to a fixed anchor shuts off the extra noise channel—the resampled\-batch fluctuation of the emitted weights—and collapsesσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}to zero, a negative result that validates the scope rather than contradicting it\. The smooth\-autodiff PGD alternative reaches the cone boundary efficiently but does not deliver the landscape smoothing of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1); the empirical4\.24\.2\-nat gap above PGD is the operating\-point signature of that distinction\. A learnable per\-iterate adaptation of the conditioning\-batch dimensionnn\([Remark˜3](https://arxiv.org/html/2605.24274#Thmremark3)\), tying the smoothing strength to the trajectory’s distance from the shoulder, would tighten the mechanism but is outside the present scope\.

## 8Conclusion

Input\-convex neural networks demand non\-negative inter\-layer weights, and current approaches to enforcing this constraint each leave the optimization landscape unsmoothed: projected gradient descent, the standard ICNN recipe, escapes the cone boundary but applies a hard, non\-smooth projection—the stiff\-penalty limit of an ADMM\-style constraint splitting—whose classical convergence guarantees do not transfer to the non\-smooth ICNN landscape, and softplus reparametrization installs a chain\-rule prefactor that vanishes on an extended region of parameter space and traps SGD on a time scale exponential in the inverse noise level\. Theliftreplaces the constrained weight by an unconstrained hypernetwork emission summed with a learnable slack bias—a split\-variable reparametrization that adds an extra source of stochasticity to the training dynamics, the resampled\-batch fluctuation of the emitted weights, which the readout attenuation does not suppress and which softens the loss landscape\. Three structural ingredients decompose its conditioning advantage: an identity\-Jacobianslack, a data\-conditionedbody, and a non\-vanishingcross\-covariancecoupling them through batch stochasticity\. Each is necessary \([Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\); the cross\-covariance acts as an implicit strong\-convexification of the loss landscape \([Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)\)\. On log\-concave EBM training across one\-, two\-, six\-, and 32\-dimensional targets, and on convex\-potential normalizing flows on a 21\-dimensional tabular benchmark, the lift descends to a lower test loss than both projected gradient descent and direct softplus, and converts a plateau\-bounded training trajectory into a valley\-descending one\. The four\-architecture cross\-covariance ablation isolates each ingredient empirically, and the widened\-direct ablation shows the advantage cannot be closed by raw capacity\. The natural next step is the deterministic\-readout regime in which[Remark˜5](https://arxiv.org/html/2605.24274#Thmremark5)predicts the lift to fall silent—a structural test the framework predicts in advance, and that no current ICNN application directly enters\.

## Acknowledgments

AS and AT acknowledge support from the Institute for Artificial Intelligence at the University of Central Florida\. AS thanks Felix J\. Herrmann, whose long\-standing intuition that the parameter\-extension methods of full\-waveform inversion \([Section˜7](https://arxiv.org/html/2605.24274#S7)\) might cross into deep learning first gave this paper its lift, and Maarten V\. de Hoop, whose deep insight, sustained support, and faith in the direction kept it aloft\.

## Appendix AProofs

This appendix collects the full statements and proofs of the theorem, lemma, and corollary stated succinctly in[Section˜4](https://arxiv.org/html/2605.24274#S4)and its subsections\. Each subsection title names the result it proves, and gives the complete formal statement—with all regularity conditions—immediately before its proof\.

### Regularity assumptions

The formal results of[Section˜4](https://arxiv.org/html/2605.24274#S4)hold under the following load\-bearing hypotheses\.\(A1\)SDE\-of\-SGD withη\\etasmall, i\.i\.d\. batches, and gradient\-noise covariance Lipschitz inϕ\{\\bm\{\\phi\}\}\.\(A2\)ψ∈C1\(ℝ\)\\psi\\in C^\{1\}\(\\mathbb\{R\}\)monotone non\-decreasing, and on the operative shoulder region its derivative is approximately constant across coordinates,ψ′\(𝜽~\)≈σs𝑰d\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\\approx\\sigma\_\{s\}\\,\{\\bm\{I\}\}\_\{d\}withσs=ψ′\(w~s\)\\sigma\_\{s\}=\\psi^\{\\prime\}\(\\tilde\{w\}\_\{s\}\)\(a single\-prefactor idealization of the shoulder; justified because the operative region is a narrow band of the softplus shoulder over whichψ′\\psi^\{\\prime\}varies slowly\)\.\(A3\)the slack Jacobian∂𝜽~/∂𝒃=𝑰d\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}=\{\\bm\{I\}\}\_\{d\}has full rank\.\(A4\)on the operative region the forward\-KL gradient fluctuation is, to leading order, the expected loss Hessian acting on the iterate fluctuation,δ𝒈=𝑯𝜽~δ𝜽~\+𝒓\\delta\{\\bm\{g\}\}=\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\,\\delta\\tilde\{\\bm\{\\theta\}\}\+\{\\bm\{r\}\}with𝑯𝜽~≡𝔼𝑿\[∇𝜽~2ℒ\]⪰𝟎\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\equiv\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\]\\succeq\\bm\{0\}symmetric positive semi\-definite \(PSD\) and𝒓\{\\bm\{r\}\}a remainder of order‖δ𝜽~‖2\\\|\\delta\\tilde\{\\bm\{\\theta\}\}\\\|^\{2\}\(justified because𝒈=∇𝜽~ℒ\{\\bm\{g\}\}=\\nabla\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}, so two iterate\-driven perturbations of𝒈\{\\bm\{g\}\}are coupled through∇𝜽~2ℒ\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}, which is PSD near a local minimum of the locally convex forward\-KL population loss of a log\-concave target\)\.\(A5\)for[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)only, the one\-dimensional bias\-channel projection ofℒ~\\tilde\{\\mathcal\{L\}\}has aC2C^\{2\}single\-barrier potential and the effective noise is small relative to the barrier \(the metastable regularity Kramers’ asymptotics require\)\.[Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)uses only \(A1\);[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)uses \(A1\), \(A3\), \(A4\);[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)uses \(A1\)–\(A5\)\.

### A\.1Each structural ingredient is necessary for the cross\-covariance estimator to be nonzero

Statement \(restatement of[Theorem˜1](https://arxiv.org/html/2605.24274#Thmtheorem1)\)\.Let𝚺^slack\(t\)\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}denote the slack\-channel cross\-covariance estimator

𝚺^slack\(t\)=1T∑s=t−Tt−1δ𝜽~\(s\)\(δ𝒈\(s\)\)⊤\.\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\,\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\}\.If any one of

1. \(i\)the slack channel𝒃\{\\bm\{b\}\}is absent \(∂𝜽~/∂𝒃≡𝟎\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}\\equiv\\bm\{0\}\);
2. \(ii\)the bodyhϕ\(𝑿\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)is constant in𝑿\{\\bm\{X\}\}\(δ𝜽~≡0\\delta\\tilde\{\\bm\{\\theta\}\}\\equiv 0\);
3. \(iii\)𝒈\{\\bm\{g\}\}and𝜽~\\tilde\{\\bm\{\\theta\}\}are independent conditional onϕ\{\\bm\{\\phi\}\}, with i\.i\.d\. batches \(A1\);

holds, then the slack\-channel reading of the estimator vanishes: under \(i\) the contraction of the estimator with the slack Jacobian is identically zero, under \(ii\) the estimator itself is identically zero, both for everyttand every window lengthTT, and under \(iii\) the population cross\-covariance is zero and the estimator is its unbiased,O\(T−1/2\)O\(T^\{\-1/2\}\)\-consistent sample version\.

Proof\.The estimator𝚺^slack\(t\)\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}is the empirical second moment of the pair\(δ𝜽~\(s\),δ𝒈\(s\)\)\(\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\},\\delta\{\\bm\{g\}\}^\{\(s\)\}\)over the trailing window, and its slack\-channel reading is the contraction with the slack Jacobian𝑱𝒃≡∂𝜽~/∂𝒃\{\\bm\{J\}\}\_\{\\bm\{b\}\}\\equiv\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}\. We do not claim a single product factorization of the estimator: each of the three deletions zeros it by a distinct mechanism, and we treat them separately\. The statement places no condition on the readoutψ\\psibeyond its presence; \(A2\) is used downstream in[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)but is not needed here, since each deletion zeros the estimator before any property ofψ\\psiis invoked\.

Case \(i\)\(slack channel absent: slack\-channel reading is identically zero\)\. Under deletion of the slack channel, the slack Jacobian satisfies𝑱𝒃≡𝟎\{\\bm\{J\}\}\_\{\\bm\{b\}\}\\equiv\\bm\{0\}, and the slack\-channel reading of the estimator is the contraction

𝑱𝒃⊤𝚺^slack\(t\)=1T∑s=t−Tt−1𝑱𝒃⊤δ𝜽~\(s\)\(δ𝒈\(s\)\)⊤≡0,\{\\bm\{J\}\}\_\{\\bm\{b\}\}^\{\\top\}\\,\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\{\\bm\{J\}\}\_\{\\bm\{b\}\}^\{\\top\}\\,\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\,\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\}\\;\\equiv\\;\\bm\{0\},identically for everyttand everyTT, since each summand is pre\-multiplied by the zero slack Jacobian\.

Case \(ii\)\(body constant in𝑿\{\\bm\{X\}\}, identical zero\)\. Ifhϕ\(𝑿\)h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)is constant in𝑿\{\\bm\{X\}\}, the per\-batch fluctuation of \([3](https://arxiv.org/html/2605.24274#S3.E3)\) vanishes:

δ𝜽~\(s\)=δhϕ\(𝑿\(s\)\)≡0a\.s\. over the trailing window\.\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\;=\\;\\delta h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}^\{\(s\)\}\)\\;\\equiv\\;\\bm\{0\}\\quad\\text\{a\.s\.\\ over the trailing window\.\}Substituting into the estimator gives

𝚺^slack\(t\)=1T∑s=t−Tt−1𝟎⋅\(δ𝒈\(s\)\)⊤≡0,\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\\bm\{0\}\\cdot\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\}\\;\\equiv\\;\\bm\{0\},identically for everyttand everyTT\. This deletion is realized architecturally by the direct\-softplus parametrization, whose pre\-readout iterate is independent of𝑿\{\\bm\{X\}\}\.

Case \(iii\)\(independence of𝒈\{\\bm\{g\}\}and𝜽~\\tilde\{\\bm\{\\theta\}\}conditional onϕ\{\\bm\{\\phi\}\}, population zero\)\. If𝒈\(s\)⟂⟂𝜽~\(s\)∣ϕ\{\\bm\{g\}\}^\{\(s\)\}\\perp\\\!\\\!\\\!\\perp\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\mid\{\\bm\{\\phi\}\}, then the centered fluctuationsδ𝒈\(s\)\\delta\{\\bm\{g\}\}^\{\(s\)\}andδ𝜽~\(s\)\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}are also conditionally independent, and the population cross\-covariance factors:

𝚺slack≡𝔼\[δ𝜽~δ𝒈⊤∣ϕ\]=𝔼\[δ𝜽~∣ϕ\]𝔼\[δ𝒈⊤∣ϕ\]=0,\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\\;\\equiv\\;\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\\mid\{\\bm\{\\phi\}\}\]\\;=\\;\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\mid\{\\bm\{\\phi\}\}\]\\,\\mathbb\{E\}\[\\delta\{\\bm\{g\}\}^\{\\top\}\\mid\{\\bm\{\\phi\}\}\]\\;=\\;\\bm\{0\},the last equality using𝔼\[δ𝜽~∣ϕ\]=𝔼\[δ𝒈∣ϕ\]=𝟎\\mathbb\{E\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\mid\{\\bm\{\\phi\}\}\]=\\mathbb\{E\}\[\\delta\{\\bm\{g\}\}\\mid\{\\bm\{\\phi\}\}\]=\\bm\{0\}by the centering construction\. The factorization is legitimate because, under \(A1\), conditional independence at fixedsstogether with cross\-iteration independence of the i\.i\.d\. batches makes the whole window collection\{𝜽~\(s\)\}\\\{\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\\}independent of\{𝒈\(s\)\}\\\{\{\\bm\{g\}\}^\{\(s\)\}\\\}givenϕ\{\\bm\{\\phi\}\}, so the trailing\-window\-centered fluctuationsδ𝜽~\(s\)\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}andδ𝒈\(s\)\\delta\{\\bm\{g\}\}^\{\(s\)\}—each a function of one of the two collections—remain conditionally independent\. Unlike cases \(i\) and \(ii\), this does not make the finite\-window estimator identically zero\. The summandsδ𝜽~\(s\)\(δ𝒈\(s\)\)⊤\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\}are identically distributed with mean𝟎\\bm\{0\}, but not independent acrossss, since each shares the common window means𝜽~¯\\bar\{\\tilde\{\\bm\{\\theta\}\}\}and𝒈¯\\bar\{\\bm\{g\}\}; subtracting those means, the standard sample\-cross\-covariance identity rewrites the estimator as the difference of an i\.i\.d\. average and a mean\-product term,

𝚺^slack\(t\)=1T∑s=t−Tt−1δ𝜽~\(s\)\(δ𝒈\(s\)\)⊤=1T∑s=t−Tt−1\(𝜽~\(s\)−𝔼\[𝜽~∣ϕ\]\)\(𝒈\(s\)−𝔼\[𝒈∣ϕ\]\)⊤−𝜽~¯c𝒈¯c⊤,\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\\delta\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\\,\(\\delta\{\\bm\{g\}\}^\{\(s\)\}\)^\{\\top\}\\;=\\;\\frac\{1\}\{T\}\\sum\_\{s=t\-T\}^\{t\-1\}\\bigl\(\\tilde\{\\bm\{\\theta\}\}^\{\(s\)\}\-\\mathbb\{E\}\[\\tilde\{\\bm\{\\theta\}\}\\mid\{\\bm\{\\phi\}\}\]\\bigr\)\\bigl\(\{\\bm\{g\}\}^\{\(s\)\}\-\\mathbb\{E\}\[\{\\bm\{g\}\}\\mid\{\\bm\{\\phi\}\}\]\\bigr\)^\{\\top\}\\;\-\\;\\bar\{\\tilde\{\\bm\{\\theta\}\}\}\_\{c\}\\,\\bar\{\\bm\{g\}\}\_\{c\}^\{\\top\},where𝜽~¯c,𝒈¯c\\bar\{\\tilde\{\\bm\{\\theta\}\}\}\_\{c\},\\bar\{\\bm\{g\}\}\_\{c\}are the window means of the𝔼\[⋅∣ϕ\]\\mathbb\{E\}\[\\cdot\\mid\{\\bm\{\\phi\}\}\]\-centered fluctuations\. The summands of the first average are genuinely i\.i\.d\. matrix\-valued with mean𝟎\\bm\{0\}\(each depends on batchssalone\), so the estimator is unbiased,𝔼\[𝚺^slack\(t\)\]=𝟎\\mathbb\{E\}\[\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\]=\\bm\{0\}, and by the strong law of large numbers the first average converges a\.s\. to𝟎\\bm\{0\}while the mean\-product term isO\(T−1\)O\(T^\{\-1\}\); hence

𝚺^slack\(t\)→T→∞a\.s\.0,\\widehat\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\(t\)\}\\;\\xrightarrow\[T\\to\\infty\]\{\\mathrm\{a\.s\.\}\}\\;\\bm\{0\},with finite\-TTfluctuations of orderO\(T−1/2\)O\(T^\{\-1/2\}\)\. Case \(iii\) thus zeros the population cross\-covariance the estimator targets, and the estimator inherits this zero in expectation and in the large\-window limit\. ∎

### A\.2Implicit strong\-convexification from data\-conditioned Jacobian noise

Statement \(restatement of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)\)\.Under the regularity assumptions \(A1\), \(A3\), \(A4\) of[Appendix˜A](https://arxiv.org/html/2605.24274#A1.SSx1), letℒ~\(ϕ\)≡𝔼𝑿\[ℒ\(ψ\(𝒃\+hϕ\(𝑿\)\)\)\]\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)\\equiv\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\[\\mathcal\{L\}\(\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\)\]denote the pullback forward\-KL landscape onϕ\{\\bm\{\\phi\}\}\. In the small\-noise expansion of the invariant measure of the Itô process

dϕ=−∇ϕℒdt\+𝚺1/2\(ϕ\)d𝑩t,d\{\\bm\{\\phi\}\}\\;=\\;\-\\nabla\_\{\{\\bm\{\\phi\}\}\}\\mathcal\{L\}\\,dt\\;\+\\;\{\\bm\{\\Sigma\}\}^\{1/2\}\(\{\\bm\{\\phi\}\}\)\\,d\\bm\{B\}\_\{t\},the second\-order term inϕ−ϕ⋆\{\\bm\{\\phi\}\}\-\{\\bm\{\\phi\}\}^\{\\star\}around the convergedϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}is a quadratic form whose curvature𝑯⋆≡∇ϕ2ℒ~\(ϕ⋆\)\{\\bm\{H\}\}^\{\\star\}\\equiv\\nabla^\{2\}\_\{\\bm\{\\phi\}\}\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}^\{\\star\}\)receives an additive contribution from the slack\-channel cross\-covariance: on the slack subspace,

tr𝑯⋆≥tr𝔼𝑿\[∇𝜽~2ℒ\]\+dμeff,μeff=Θ\(σJac2/\(dκ2\)\),\\mathrm\{tr\}\{\\bm\{H\}\}^\{\\star\}\\;\\geq\\;\\mathrm\{tr\}\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\\bigr\]\\;\+\\;d\\,\\mu\_\{\\mathrm\{eff\}\},\\qquad\\mu\_\{\\mathrm\{eff\}\}\\;=\\;\\Theta\\\!\\bigl\(\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\,\\kappa^\{2\}\)\\bigr\),whereκ=‖𝜽~‖∞\\kappa=\\\|\\tilde\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}is the typical readout scale,ddthe slack dimension, and the inequality holds up to leading order in the iterate\-fluctuation magnitude with anO\(ησJac2/\(dκ2\)\)O\(\\eta\\,\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\,\\kappa^\{2\}\)\)Itô–Taylor truncation correction\. The cross\-covariance therefore adds a strongly\-convex quadratic of per\-dimension modulusμeff\\mu\_\{\\mathrm\{eff\}\}to the landscape that SGD samples, without deformingψ\\psi\.

Proof\.The argument proceeds in four steps: the small\-η\\etaergodic expansion of the invariant measure, the quadratic expansion of the pullback landscape aroundϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}, the effective\-Hessian decomposition that imports the cross\-covariance, and the resulting added strong\-convexity modulus\.

In the small\-learning\-rate ergodic limit ofMandtet al\.\([2017](https://arxiv.org/html/2605.24274#bib.bib37)\); Liet al\.\([2017](https://arxiv.org/html/2605.24274#bib.bib36)\), the Itô process of the statement is ergodic with invariant measure

π\(ϕ\)∝exp⁡\(−2ℒ~\(ϕ\)/η\)det𝚺\(ϕ\)−1/2\(1\+O\(η\)\),\\pi\(\{\\bm\{\\phi\}\}\)\\;\\propto\\;\\exp\\\!\\bigl\(\-2\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)/\\eta\\bigr\)\\,\\det\{\\bm\{\\Sigma\}\}\(\{\\bm\{\\phi\}\}\)^\{\-1/2\}\\,\\bigl\(1\+O\(\\eta\)\\bigr\),under \(A1\)’s Lipschitz\-covariance condition \(which controls the Itô–Taylor remainder\)\. The leading dependence onϕ\{\\bm\{\\phi\}\}is through the Boltzmann factorexp⁡\(−2ℒ~\(ϕ\)/η\)\\exp\(\-2\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)/\\eta\); the prefactordet𝚺\(ϕ\)−1/2\\det\{\\bm\{\\Sigma\}\}\(\{\\bm\{\\phi\}\}\)^\{\-1/2\}contributes a sub\-leading correction atO\(η\)O\(\\eta\)and does not affect the quadratic structure aroundϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}\.

Taylor\-expand the pullback landscapeℒ~\(ϕ\)≡𝔼𝑿\[ℒ\(ψ\(𝒃\+hϕ\(𝑿\)\)\)\]\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)\\equiv\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\[\\mathcal\{L\}\(\\psi\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\)\]around the converged critical pointϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}to second order\. The first\-order term vanishes at the critical point, and the remainder is cubic:

ℒ~\(ϕ\)=ℒ~\(ϕ⋆\)\+12\(ϕ−ϕ⋆\)⊤𝑯⋆\(ϕ−ϕ⋆\)\+O\(‖ϕ−ϕ⋆‖3\),𝑯⋆≡∇ϕ2ℒ~\(ϕ⋆\)\.\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}\)\\;=\\;\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}^\{\\star\}\)\\;\+\\;\\tfrac\{1\}\{2\}\(\{\\bm\{\\phi\}\}\-\{\\bm\{\\phi\}\}^\{\\star\}\)^\{\\top\}\\,\{\\bm\{H\}\}^\{\\star\}\\,\(\{\\bm\{\\phi\}\}\-\{\\bm\{\\phi\}\}^\{\\star\}\)\\;\+\\;O\\\!\\bigl\(\\\|\{\\bm\{\\phi\}\}\-\{\\bm\{\\phi\}\}^\{\\star\}\\\|^\{3\}\\bigr\),\\qquad\{\\bm\{H\}\}^\{\\star\}\\equiv\\nabla^\{2\}\_\{\\bm\{\\phi\}\}\\tilde\{\\mathcal\{L\}\}\(\{\\bm\{\\phi\}\}^\{\\star\}\)\.Aroundϕ⋆\{\\bm\{\\phi\}\}^\{\\star\}the invariant measure is therefore a Gaussian whose precision is set by𝑯⋆\{\\bm\{H\}\}^\{\\star\}\(up to theO\(η\)O\(\\eta\)prefactor\), so the quadratic form𝑯⋆\{\\bm\{H\}\}^\{\\star\}governs the curvature of the landscape that SGD samples from\.

The Hessian𝑯⋆\{\\bm\{H\}\}^\{\\star\}, a symmetric matrix, decomposes into a deterministic loss\-curvature piece and a stochastic cross\-covariance piece\. Differentiatingℒ~\\tilde\{\\mathcal\{L\}\}twice and changing variables fromϕ\{\\bm\{\\phi\}\}to𝜽~=𝒃\+hϕ\(𝑿\)\\tilde\{\\bm\{\\theta\}\}=\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)via the body Jacobian, the curvature on the slack subspace has a Gauss–Newton\-like contribution𝔼𝑿\[∇𝜽~2ℒ\]\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\]plus a fluctuation\-induced contribution sourced by the slack\-channel cross\-covariance𝚺slack=𝔼𝑿\[δ𝜽~δ𝒈⊤\]\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}=\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\]of \([4](https://arxiv.org/html/2605.24274#S3.E4)\)\. Because𝑯⋆\{\\bm\{H\}\}^\{\\star\}is symmetric, only the symmetric part of the cross\-covariance contributes; with the body\-Jacobian\-to\-parameter rescaling that converts𝜽~\\tilde\{\\bm\{\\theta\}\}\-units toϕ\{\\bm\{\\phi\}\}\-units through the typical readout scaleκ=‖𝜽~‖∞\\kappa=\\\|\\tilde\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}, the decomposition reads

𝑯⋆=𝔼𝑿\[∇𝜽~2ℒ\]\+1κ2⋅12\(𝚺slack\+𝚺slack⊤\)\.\{\\bm\{H\}\}^\{\\star\}\\;=\\;\\mathbb\{E\}\_\{\{\\bm\{X\}\}\}\\\!\\bigl\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\\bigr\]\\;\+\\;\\frac\{1\}\{\\kappa^\{2\}\}\\cdot\\tfrac\{1\}\{2\}\\bigl\(\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\+\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\\top\}\\bigr\)\.The cross\-covariance𝚺slack\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}is itself not symmetric and not a priori sign\-definite, so its symmetric part and its trace must be sign\-controlled before any curvature claim\. This is the role of \(A4\)\. Substituting the forward\-KL chainδ𝒈=𝑯𝜽~δ𝜽~\+𝒓\\delta\{\\bm\{g\}\}=\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\,\\delta\\tilde\{\\bm\{\\theta\}\}\+\{\\bm\{r\}\}with𝑯𝜽~=𝔼𝑿\[∇𝜽~2ℒ\]⪰𝟎\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}=\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\]\\succeq\\bm\{0\}symmetric PSD,

𝚺slack=𝔼𝑿\[δ𝜽~δ𝒈⊤\]=𝔼𝑿\[δ𝜽~δ𝜽~⊤\]𝑯𝜽~\+𝔼𝑿\[δ𝜽~𝒓⊤\],\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\\;=\\;\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\{\\bm\{g\}\}^\{\\top\}\\bigr\]\\;=\\;\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\\tilde\{\\bm\{\\theta\}\}^\{\\top\}\\bigr\]\\,\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\;\+\\;\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\{\\bm\{r\}\}^\{\\top\}\\bigr\],and writing𝑽≡𝔼𝑿\[δ𝜽~δ𝜽~⊤\]⪰𝟎\{\\bm\{V\}\}\\equiv\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\\delta\\tilde\{\\bm\{\\theta\}\}^\{\\top\}\]\\succeq\\bm\{0\}for the iterate\-fluctuation covariance, the symmetric part of the leading term is12\(𝑽𝑯𝜽~\+𝑯𝜽~𝑽\)\\tfrac\{1\}\{2\}\(\{\\bm\{V\}\}\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\+\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\{\\bm\{V\}\}\)\. Its trace splits into a sign\-controlled leading term and a subleading remainder:

σJac2=tr𝚺slack=tr\(𝑽𝑯𝜽~\)\+tr𝔼𝑿\[δ𝜽~𝒓⊤\],tr\(𝑽𝑯𝜽~\)≥0,\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\\;=\\;\\mathrm\{tr\}\\,\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\\;=\\;\\mathrm\{tr\}\\bigl\(\{\\bm\{V\}\}\\,\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\bigr\)\\;\+\\;\\mathrm\{tr\}\\,\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\{\\bm\{r\}\}^\{\\top\}\\bigr\],\\qquad\\mathrm\{tr\}\\bigl\(\{\\bm\{V\}\}\\,\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\bigr\)\\;\\geq\\;0,the inequality holding because the trace of the product of the two PSD matrices𝑽\{\\bm\{V\}\}and𝑯𝜽~\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}is non\-negative, and the remaindertr𝔼𝑿\[δ𝜽~𝒓⊤\]=O\(𝔼‖δ𝜽~‖3\)\\mathrm\{tr\}\\,\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\delta\\tilde\{\\bm\{\\theta\}\}\\,\{\\bm\{r\}\}^\{\\top\}\]=O\(\\mathbb\{E\}\\\|\\delta\\tilde\{\\bm\{\\theta\}\}\\\|^\{3\}\)being a higher\-order central moment\. SoσJac2≥0\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\\geq 0to leading order in the iterate\-fluctuation magnitude—a genuine non\-negative quantity once the cubic remainder is dropped—and \(A4\)—not \(A3\)—is what supplies its sign\. \(\(A3\), the full\-rank slack Jacobian∂𝜽~/∂𝒃=𝑰d\\partial\\tilde\{\\bm\{\\theta\}\}/\\partial\{\\bm\{b\}\}=\{\\bm\{I\}\}\_\{d\}, is what makes the slack subspace the fullℝd\\mathbb\{R\}^\{d\}so that the added curvature is not confined to a proper subspace\.\)

The smoothing modulus is now extracted from the trace, with no operator inequality on a non\-symmetric matrix and no isotropy assumption\. Taking the trace of the Hessian decomposition and dividing the cross\-covariance contribution by the readout rescalingκ2\\kappa^\{2\},

tr𝑯⋆=tr𝔼𝑿\[∇𝜽~2ℒ\]\+1κ2tr12\(𝚺slack\+𝚺slack⊤\)=tr𝔼𝑿\[∇𝜽~2ℒ\]\+σJac2κ2,\\mathrm\{tr\}\{\\bm\{H\}\}^\{\\star\}\\;=\\;\\mathrm\{tr\}\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\\bigr\]\\;\+\\;\\frac\{1\}\{\\kappa^\{2\}\}\\,\\mathrm\{tr\}\\,\\tfrac\{1\}\{2\}\\bigl\(\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\+\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\\top\}\\bigr\)\\;=\\;\\mathrm\{tr\}\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\\bigr\]\\;\+\\;\\frac\{\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\}\{\\kappa^\{2\}\},usingtr12\(𝚺slack\+𝚺slack⊤\)=tr𝚺slack=σJac2\\mathrm\{tr\}\\tfrac\{1\}\{2\}\(\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}\+\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}^\{\\top\}\)=\\mathrm\{tr\}\{\\bm\{\\Sigma\}\}\_\{\\mathrm\{slack\}\}=\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}, since the trace is insensitive to the symmetrization\. Writing the added trace asdμeffd\\,\\mu\_\{\\mathrm\{eff\}\}defines the per\-dimension modulus

μeff=σJac2dκ2=Θ\(σJac2/\(dκ2\)\),\\mu\_\{\\mathrm\{eff\}\}\\;=\\;\\frac\{\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\}\{d\\,\\kappa^\{2\}\}\\;=\\;\\Theta\\\!\\bigl\(\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\kappa^\{2\}\)\\bigr\),which carries the dimension factor1/d1/dexplicitly\. The leading termtr\(𝑽𝑯𝜽~\)≥0\\mathrm\{tr\}\(\{\\bm\{V\}\}\{\\bm\{H\}\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\)\\geq 0established above makesσJac2≥0\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\\geq 0to leading order, so the added trace is non\-negative: the cross\-covariance can only add curvature\. Since the deterministic term𝔼𝑿\[∇𝜽~2ℒ\]\\mathbb\{E\}\_\{\\bm\{X\}\}\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\]is itself PSD at a local minimum ofℒ∘ψ\\mathcal\{L\}\\circ\\psi, the trace inequality

tr𝑯⋆≥tr𝔼𝑿\[∇𝜽~2ℒ\]\+dμeffon the slack subspace atϕ⋆\\mathrm\{tr\}\{\\bm\{H\}\}^\{\\star\}\\;\\geq\\;\\mathrm\{tr\}\\mathbb\{E\}\_\{\\bm\{X\}\}\\\!\\bigl\[\\nabla^\{2\}\_\{\\tilde\{\\bm\{\\theta\}\}\}\\mathcal\{L\}\\bigr\]\\;\+\\;d\\,\\mu\_\{\\mathrm\{eff\}\}\\qquad\\text\{on the slack subspace at \}\{\\bm\{\\phi\}\}^\{\\star\}follows\. The cross\-covariance therefore adds a strongly\-convex quadratic of per\-dimension modulusμeff\\mu\_\{\\mathrm\{eff\}\}to the second\-order term of the invariant measure: the landscape that SGD samples is strongly\-convexified byΘ\(σJac2/\(dκ2\)\)\\Theta\(\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\kappa^\{2\}\)\)without any deformation ofψ\\psi\. The added curvature is in general anisotropic—a quadratic form aligned with the batch\-coupled gradient noise, not a multiple of𝑰\{\\bm\{I\}\}\([Remark˜5](https://arxiv.org/html/2605.24274#Thmremark5)\)—and \([8](https://arxiv.org/html/2605.24274#S4.E8)\) reports its trace, the dimension\-averaged modulus, which needs no isotropy\. We do not claim thatℒ~\\tilde\{\\mathcal\{L\}\}is the Moreau envelope\(Beck,[2017](https://arxiv.org/html/2605.24274#bib.bib3)\)ofℒ∘ψ\\mathcal\{L\}\\circ\\psi, only that the batch\-stochastic channel reproduces the second\-order signature of one, an added strongly\-convex quadratic\. The remainder is the Itô–Taylor truncation error,O\(ησJac2/\(dκ2\)\)O\(\\eta\\,\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}/\(d\\kappa^\{2\}\)\)at leading order in the small\-η\\etaexpansion\. ∎

### A\.3Mean first\-passage time across the shoulder

Statement \(restatement of[Corollary˜1](https://arxiv.org/html/2605.24274#Thmcorollary1)\)\.Adopt the regularity assumptions \(A1\)–\(A5\) of[Appendix˜A](https://arxiv.org/html/2605.24274#A1.SSx1)—the hypotheses \(A1\), \(A3\), \(A4\) of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1), the single\-prefactor idealization \(A2\), and the metastable regularity \(A5\)—and assume in addition a non\-vanishing barrier of reference actionα=\(ℒ∘ψ\)\(w~s\)−\(ℒ∘ψ\)\(w~b\)\>0\\alpha=\(\\mathcal\{L\}\\circ\\psi\)\(\\tilde\{w\}\_\{s\}\)\-\(\\mathcal\{L\}\\circ\\psi\)\(\\tilde\{w\}\_\{b\}\)\>0, withw~b\\tilde\{w\}\_\{b\}the pre\-barrier basin minimum andw~s\\tilde\{w\}\_\{s\}the shoulder location, along the bias\-channel direction \(Arrhenius regime\)\. Reduce the Itô process of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)to its bias\-channel projectiondw~=−ℒ~′\(w~\)dt\+σeff\(w~\)dBtd\\tilde\{w\}=\-\\tilde\{\\mathcal\{L\}\}^\{\\prime\}\(\\tilde\{w\}\)\\,dt\+\\sigma\_\{\\mathrm\{eff\}\}\(\\tilde\{w\}\)\\,dB\_\{t\}, with direct effective varianceσeff,direct2=σs2σobj2\\sigma\_\{\\mathrm\{eff\},\\mathrm\{direct\}\}^\{2\}=\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}and lifted effective varianceσeff,hyper2=σs2σobj2\+σJac2\\sigma\_\{\\mathrm\{eff\},\\mathrm\{hyper\}\}^\{2\}=\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\+\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}, the lifted excessσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}being the slack\-channel cross\-covariance contribution of[Equation˜5](https://arxiv.org/html/2605.24274#S4.E5), which is structurally absent for direct softplus\. Then the mean first\-passage times across the shoulder satisfy \([9](https://arxiv.org/html/2605.24274#S4.E9)\), where≍\\asympdenotes equality of the leading exponential factor up to a sub\-exponential prefactor\.

Proof\.The argument is in two steps: the reduction of the Itô process to a one\-dimensional bias\-channel SDE, and the Kramers asymptotics for that SDE under \(A5\)\.

*Bias\-channel reduction\.*Let𝒆b\{\\bm\{e\}\}\_\{b\}be the unit vector along the slack\-channel direction of[Lemma˜1](https://arxiv.org/html/2605.24274#Thmlemma1)and project the iterate onto it,w~≡𝒆b⊤\(𝒃\+hϕ\(𝑿\)\)\\tilde\{w\}\\equiv\{\\bm\{e\}\}\_\{b\}^\{\\top\}\(\{\\bm\{b\}\}\+h\_\{\\bm\{\\phi\}\}\(\{\\bm\{X\}\}\)\)\. Projecting the Itô processdϕ=−∇ϕℒdt\+𝚺1/2d𝑩td\{\\bm\{\\phi\}\}=\-\\nabla\_\{\\bm\{\\phi\}\}\\mathcal\{L\}\\,dt\+\{\\bm\{\\Sigma\}\}^\{1/2\}\\,d\\bm\{B\}\_\{t\}onto𝒆b\{\\bm\{e\}\}\_\{b\}and writing the drift through the pullback landscapeℒ~\\tilde\{\\mathcal\{L\}\}gives a scalar SDE

dw~=−ℒ~′\(w~\)dt\+σeff\(w~\)dBt,d\\tilde\{w\}\\;=\\;\-\\tilde\{\\mathcal\{L\}\}^\{\\prime\}\(\\tilde\{w\}\)\\,dt\\;\+\\;\\sigma\_\{\\mathrm\{eff\}\}\(\\tilde\{w\}\)\\,dB\_\{t\},whereℒ~′\\tilde\{\\mathcal\{L\}\}^\{\\prime\}is the bias\-channel derivative ofℒ~\\tilde\{\\mathcal\{L\}\}andσeff2\(w~\)=𝒆b⊤𝚺\(ϕ\)𝒆b\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\(\\tilde\{w\}\)=\{\\bm\{e\}\}\_\{b\}^\{\\top\}\{\\bm\{\\Sigma\}\}\(\{\\bm\{\\phi\}\}\)\\,\{\\bm\{e\}\}\_\{b\}is the projected diffusion\. The diffusion has two contributions: the readout\-prefactor\-attenuated gradient\-driven part, which is bilinear in𝒈=ψ′\(𝜽~\)∇𝜽ℒ\{\\bm\{g\}\}=\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)\\nabla\_\{\\bm\{\\theta\}\}\\mathcal\{L\}and hence carries the prefactor asψ′\(𝜽~\)2\\psi^\{\\prime\}\(\\tilde\{\\bm\{\\theta\}\}\)^\{2\}, contributingσs2σobj2\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\(the prefactorσs2\\sigma\_\{s\}^\{2\}of \(A2\) times the gradient\-noise varianceσobj2\\sigma\_\{\\mathrm\{obj\}\}^\{2\}of \([6](https://arxiv.org/html/2605.24274#S4.E6)\)\), and the slack\-channel cross\-covariance of \([5](https://arxiv.org/html/2605.24274#S4.E5)\), contributingσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\. Direct softplus hasδ𝜽~≡𝟎\\delta\\tilde\{\\bm\{\\theta\}\}\\equiv\\bm\{0\}and hence only the attenuated part, while the lift has both, giving

σeff,direct2=σs2σobj2,σeff,hyper2=σs2σobj2\+σJac2\.\\sigma\_\{\\mathrm\{eff\},\\mathrm\{direct\}\}^\{2\}\\;=\\;\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\},\\qquad\\sigma\_\{\\mathrm\{eff\},\\mathrm\{hyper\}\}^\{2\}\\;=\\;\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}\\;\+\\;\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}\.
*Kramers asymptotics\.*By \(A5\) the bias\-channel projection ofℒ~\\tilde\{\\mathcal\{L\}\}has aC2C^\{2\}single\-barrier potential and the effective noise is small relative to the barrierα\\alpha; the SDE is then a metastable one\-dimensional diffusion in the Arrhenius regime\. The diffusion coefficientσeff2\(w~\)\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\(\\tilde\{w\}\)is in general state\-dependent, but the rate\-limiting barrier band is the narrow operative shoulder region of \(A2\), over whichψ′\\psi^\{\\prime\}—and hence the attenuated componentσs2σobj2\\sigma\_\{s\}^\{2\}\\sigma\_\{\\mathrm\{obj\}\}^\{2\}—varies slowly; treatingσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}as locally constant on the same band, the effective variance is constant to leading order across the barrier\. For a metastable diffusion with effective varianceσeff2\\sigma\_\{\\mathrm\{eff\}\}^\{2\}constant across the barrier band, the Kramers \(Eyring–Arrhenius\) escape law\(Kramers,[1940](https://arxiv.org/html/2605.24274#bib.bib15); Hänggiet al\.,[1990](https://arxiv.org/html/2605.24274#bib.bib9)\)gives the mean first\-passage time over the barrier

𝔼\[τ\]≍exp⁡\(2α/σeff2\),\\mathbb\{E\}\[\\tau\]\\;\\asymp\\;\\exp\\\!\\bigl\(2\\alpha/\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\\bigr\),where≍\\asympdenotes equality of the leading exponential factor up to a sub\-exponential prefactor set by the curvatures at the basin minimum and the barrier top\. Substituting the two effective variances yields \([9](https://arxiv.org/html/2605.24274#S4.E9)\)\. The barrierα\\alphais the reference action of the un\-smoothedℒ∘ψ\\mathcal\{L\}\\circ\\psiheld fixed across the comparison, so the lift’s benefit enters throughσeff2\\sigma\_\{\\mathrm\{eff\}\}^\{2\}alone\. ∎

The asymptotic \([9](https://arxiv.org/html/2605.24274#S4.E9)\) is the metastable, small\-noise prediction\. When the unattenuatedσJac2\\sigma\_\{\\mathrm\{Jac\}\}^\{2\}grows comparable to the barrier2α2\\alphathe lifted exponent drops to order one, the effective noise is no longer small relative to the barrier, and \(A5\) fails: the escape leaves the Arrhenius regime and is governed instead by free diffusion over the escape distance, with mean first\-passage time the polynomialΘ\(σJac−2\)\\Theta\(\\sigma\_\{\\mathrm\{Jac\}\}^\{\-2\}\)of the standard Brownian hitting\-time identity\. The drift\-free Gumbel probe of[Section˜5\.1\.2](https://arxiv.org/html/2605.24274#S5.SS1.SSS2)sits in this crossover \([Remark˜6](https://arxiv.org/html/2605.24274#Thmremark6)\)\.

## References

- B\. Açıkmeşe and L\. Blackmore \(2011\)Lossless convexification of a class of optimal control problems with non\-convex control constraints\.Automatica47\(2\),pp\. 341–347\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- H\. S\. Aghamiry, A\. Gholami, and S\. Operto \(2019\)Improving full\-waveform inversion by wavefield reconstruction with the alternating direction method of multipliers\.Geophysics84\(1\),pp\. R139–R162\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- S\. Alemohammad, J\. Casco\-Rodriguez, L\. Luzi, A\. I\. Humayun, H\. Babaei, D\. LeJeune, A\. Siahkoohi, and R\. Baraniuk \(2024\)Self\-consuming generative models go MAD\.InInternational Conference on Learning Representations,Cited by:[§3\.1](https://arxiv.org/html/2605.24274#S3.SS1.p1.12)\.
- B\. Amos, L\. Xu, and J\. Z\. Kolter \(2017\)Input convex neural networks\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§1](https://arxiv.org/html/2605.24274#S1.p2.14),[§2\.1](https://arxiv.org/html/2605.24274#S2.SS1.p1.12),[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.17),[§5\.2](https://arxiv.org/html/2605.24274#S5.SS2.p1.9),[§5](https://arxiv.org/html/2605.24274#S5.p1.2),[§6\.2](https://arxiv.org/html/2605.24274#S6.SS2.p1.4)\.
- L\. Baldassari, A\. Siahkoohi, J\. Garnier, K\. Solna, and M\. V\. de Hoop \(2024\)Conditional score\-based diffusion models for Bayesian inference in infinite dimensions\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1)\.
- P\. Baldi, K\. Cranmer, T\. Faucett, P\. Sadowski, and D\. Whiteson \(2016\)Parameterized neural networks for high\-energy physics\.European Physical Journal C76\(5\),pp\. 235\.Cited by:[§5\.2\.3](https://arxiv.org/html/2605.24274#S5.SS2.SSS3.p1.1)\.
- R\. Baptista, Y\. Marzouk, and O\. Zahm \(2024\)On the representation and learning of monotone triangular transport maps\.Foundations of Computational Mathematics24,pp\. 2063–2108\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1)\.
- A\. Beck \(2017\)First\-order methods in optimization\.SIAM\.Cited by:[§A\.2](https://arxiv.org/html/2605.24274#A1.SS2.p6.16),[§1](https://arxiv.org/html/2605.24274#S1.p2.14)\.
- D\. P\. Bertsekas \(1999\)Nonlinear programming\.2nd edition,Athena Scientific\.Cited by:[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17),[§6\.2](https://arxiv.org/html/2605.24274#S6.SS2.p1.4)\.
- S\. Boyd, N\. Parikh, E\. Chu, B\. Peleato, and J\. Eckstein \(2011\)Distributed optimization and statistical learning via the alternating direction method of multipliers\.Foundations and Trends in Machine Learning3\(1\),pp\. 1–122\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p2.14),[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17)\.
- Y\. Brenier \(1991\)Polar factorization and monotone rearrangement of vector\-valued functions\.Communications on Pure and Applied Mathematics44\(4\),pp\. 375–417\.Cited by:[§5\.2](https://arxiv.org/html/2605.24274#S5.SS2.p1.9)\.
- C\. Bunne, A\. Krause, and M\. Cuturi \(2022\)Supervised training of conditional Monge maps\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1)\.
- L\. Chizat, E\. Oyallon, and F\. Bach \(2019\)On lazy training in differentiable programming\.InAdvances in Neural Information Processing Systems,Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px2.p1.2)\.
- T\. A\. El Moselhy and Y\. M\. Marzouk \(2012\)Bayesian inference with optimal maps\.Journal of Computational Physics231\(23\),pp\. 7815–7850\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- D\. Ha, A\. M\. Dai, and Q\. V\. Le \(2017\)HyperNetworks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p3.1)\.
- P\. Hänggi, P\. Talkner, and M\. Borkovec \(1990\)Reaction\-rate theory: fifty years after Kramers\.Reviews of Modern Physics62\(2\),pp\. 251–341\.Cited by:[§A\.3](https://arxiv.org/html/2605.24274#A1.SS3.p4.8),[§1](https://arxiv.org/html/2605.24274#S1.p2.14),[§2\.2](https://arxiv.org/html/2605.24274#S2.SS2.p1.5)\.
- B\. He, H\. Yang, and S\. L\. Wang \(2000\)Alternating direction method with self\-adaptive penalty parameters for monotone variational inequalities\.Journal of Optimization Theory and Applications106\(2\),pp\. 337–356\.Cited by:[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17)\.
- B\. He and X\. Yuan \(2012\)On theO\(1/n\)O\(1/n\)convergence rate of the Douglas–Rachford alternating direction method\.SIAM Journal on Numerical Analysis50\(2\),pp\. 700–709\.Cited by:[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Flat minima\.Neural Computation9\(1\),pp\. 1–42\.Cited by:[§3](https://arxiv.org/html/2605.24274#S3.p1.2)\.
- P\. Hoedt and G\. Klambauer \(2023\)Principled weight initialisation for input\-convex neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p2.14),[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.4)\.
- C\. Huang, R\. T\. Q\. Chen, C\. Tsirigotis, and A\. Courville \(2021\)Convex potential flows: universal probability distributions with optimal transport and convex optimization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[Figure 16](https://arxiv.org/html/2605.24274#S5.F16),[§5\.2\.3](https://arxiv.org/html/2605.24274#S5.SS2.SSS3.p1.1),[§5\.2](https://arxiv.org/html/2605.24274#S5.SS2.p1.4),[§5\.2](https://arxiv.org/html/2605.24274#S5.SS2.p1.9),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural tangent kernel: convergence and generalization in neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px2.p1.2)\.
- P\. Jaini, I\. Kobyzev, M\. Brubaker, and Y\. Yu \(2020\)Tails of Lipschitz triangular flows\.InInternational Conference on Machine Learning,Cited by:[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.17)\.
- S\. Jastrzębski, Z\. Kenton, D\. Arpit, N\. Ballas, A\. Fischer, Y\. Bengio, and A\. Storkey \(2017\)Three factors influencing minima in SGD\.Note:arXiv preprint arXiv:1711\.04623Cited by:[§3](https://arxiv.org/html/2605.24274#S3.p1.2)\.
- N\. S\. Keskar, D\. Mudigere, J\. Nocedal, M\. Smelyanskiy, and P\. T\. P\. Tang \(2017\)On large\-batch training for deep learning: generalization gap and sharp minima\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2605.24274#S3.p1.2)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.17)\.
- A\. Korotin, L\. Li, J\. Solomon, and E\. Burnaev \(2021\)Continuous Wasserstein\-2 barycenter estimation without minimax optimization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- H\. A\. Kramers \(1940\)Brownian motion in a field of force and the diffusion model of chemical reactions\.Physica7\(4\),pp\. 284–304\.Cited by:[§A\.3](https://arxiv.org/html/2605.24274#A1.SS3.p4.8),[§1](https://arxiv.org/html/2605.24274#S1.p2.14),[§2\.2](https://arxiv.org/html/2605.24274#S2.SS2.p1.5)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.24274#S5.SS1.SSS3.Px1.p1.1)\.
- H\. Li, Z\. Xu, G\. Taylor, C\. Studer, and T\. Goldstein \(2018\)Visualizing the loss landscape of neural nets\.InAdvances in Neural Information Processing Systems,Cited by:[§5\.3](https://arxiv.org/html/2605.24274#S5.SS3.SSS0.Px1.p1.1)\.
- Q\. Li, C\. Tai, and W\. E \(2017\)Stochastic modified equations and adaptive stochastic gradient algorithms\.InInternational Conference on Machine Learning,Cited by:[§A\.2](https://arxiv.org/html/2605.24274#A1.SS2.p3.6),[§3\.3](https://arxiv.org/html/2605.24274#S3.SS3.SSS0.Px3.p1.4),[§3](https://arxiv.org/html/2605.24274#S3.p1.2),[§4\.1](https://arxiv.org/html/2605.24274#S4.SS1.p1.3)\.
- A\. V\. Makkuva, A\. Taghvaei, J\. Lee, and S\. Oh \(2020\)Optimal transport mapping via input convex neural networks\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- S\. Mandt, M\. D\. Hoffman, and D\. M\. Blei \(2017\)Stochastic gradient descent as approximate Bayesian inference\.Journal of Machine Learning Research18\(134\),pp\. 1–35\.Cited by:[§A\.2](https://arxiv.org/html/2605.24274#A1.SS2.p3.6),[§3\.3](https://arxiv.org/html/2605.24274#S3.SS3.SSS0.Px3.p1.4),[§3](https://arxiv.org/html/2605.24274#S3.p1.2),[§4\.1](https://arxiv.org/html/2605.24274#S4.SS1.p1.3)\.
- P\. Mayer, L\. Luzi, A\. Siahkoohi, D\. H\. Johnson, and R\. G\. Baraniuk \(2024\)Improving fairness and mitigating MADness in generative models\.Note:arXiv preprint arXiv:2405\.13977Cited by:[§3\.1](https://arxiv.org/html/2605.24274#S3.SS1.p1.12)\.
- S\. Mei, A\. Montanari, and P\. Nguyen \(2018\)A mean field view of the landscape of two\-layer neural networks\.Proceedings of the National Academy of Sciences115\(33\),pp\. E7665–E7671\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px2.p1.2)\.
- Y\. Nesterov \(2018\)Lectures on convex optimization\.2nd edition,Springer\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p2.14)\.
- A\. B\. Owen \(2013\)Monte carlo theory, methods and examples\.Note:[https://artowen\.su\.domains/mc/](https://artowen.su.domains/mc/)Cited by:[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.17)\.
- G\. Papamakarios, T\. Pavlakou, and I\. Murray \(2017\)Masked autoregressive flow for density estimation\.InAdvances in Neural Information Processing Systems,Cited by:[§5\.2\.3](https://arxiv.org/html/2605.24274#S5.SS2.SSS3.p1.1)\.
- T\. Pock, D\. Cremers, H\. Bischof, and A\. Chambolle \(2010\)Global solutions of variational models with convex regularization\.SIAM Journal on Imaging Sciences3\(4\),pp\. 1122–1145\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Prékopa \(1971\)Logarithmic concave measures with application to stochastic programming\.Acta Scientiarum Mathematicarum \(Szeged\)32,pp\. 301–316\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1)\.
- A\. Saumard and J\. A\. Wellner \(2014\)Log\-concavity and strong log\-concavity: a review\.Statistics Surveys8,pp\. 45–114\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.24274#S5.SS1.p1.17)\.
- A\. Siahkoohi, K\. Aghazade, and A\. Gholami \(2026\)Dual\-space posterior sampling for Bayesian inference in constrained inverse problems\.Note:arXiv preprint arXiv:2603\.00393Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Spantini, D\. Bigoni, and Y\. Marzouk \(2018\)Inference via low\-dimensional couplings\.Journal of Machine Learning Research19\(66\),pp\. 1–71\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1)\.
- W\. W\. Symes, H\. Chen, and S\. E\. Minkoff \(2020\)Full\-waveform inversion by source extension: why it works\.InSEG Technical Program Expanded Abstracts,pp\. 765–769\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p3.1),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- W\. W\. Symes \(2008\)Migration velocity analysis and waveform inversion\.Geophysical Prospecting56\(6\),pp\. 765–790\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Thatipelli and A\. Siahkoohi \(2026\)Hypernetwork\-based approach for grid\-independent functional data clustering\.Note:arXiv preprint arXiv:2602\.22823Cited by:[§3\.1](https://arxiv.org/html/2605.24274#S3.SS1.p1.12)\.
- T\. van Leeuwen and F\. J\. Herrmann \(2013\)Mitigating local minima in full\-waveform inversion by expanding the search space\.Geophysical Journal International195\(1\),pp\. 661–667\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p3.1),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- T\. van Leeuwen and F\. J\. Herrmann \(2016\)A penalty method for PDE\-constrained optimization in inverse problems\.Inverse Problems32\(1\),pp\. 015007\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural Computation23\(7\),pp\. 1661–1674\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- T\. Vogt, E\. Strekalovskiy, D\. Cremers, and J\. Lellmann \(2020\)Lifting methods for manifold\-valued variational problems\.InHandbook of Variational Methods for Nonlinear Geometric Data,pp\. 95–119\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
- Z\. O\. Wang, R\. Baptista, Y\. Marzouk, L\. Ruthotto, and D\. Verma \(2025\)Efficient neural network approaches for conditional optimal transport with applications in Bayesian inference\.SIAM Journal on Scientific Computing47\(4\),pp\. C979–C1005\.Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.24274#S5.SS2.p1.9),[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px3.p1.1),[Remark 2](https://arxiv.org/html/2605.24274#Thmremark2.p1.3)\.
- C\. Wei, J\. D\. Lee, Q\. Liu, and T\. Ma \(2019\)Regularization matters: generalization and optimization of neural nets v\.s\. their induced kernel\.InAdvances in Neural Information Processing Systems,Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px2.p1.2)\.
- B\. Wohlberg \(2017\)ADMM penalty parameter selection by residual balancing\.Note:arXiv preprint arXiv:1704\.06209Cited by:[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17)\.
- Z\. Xie, I\. Sato, and M\. Sugiyama \(2021\)A diffusion theory for deep learning dynamics: stochastic gradient descent exponentially favors flat minima\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24274#S1.p2.14)\.
- M\. Zaheer, S\. Kottur, S\. Ravanbakhsh, B\. Póczos, R\. Salakhutdinov, and A\. J\. Smola \(2017\)Deep sets\.InAdvances in Neural Information Processing Systems,Cited by:[§3\.1](https://arxiv.org/html/2605.24274#S3.SS1.p1.12)\.
- M\. Zarepisheh, L\. Xing, and Y\. Ye \(2018\)A computation study on an integrated alternating direction method of multipliers for large scale optimization\.Optimization Letters12\(1\),pp\. 3–15\.Cited by:[§6\.1](https://arxiv.org/html/2605.24274#S6.SS1.p1.17)\.
- Y\. Zheng, C\. R\. Pai, and Y\. Tang \(2026\)Benign nonconvex landscapes in optimal and robust control, Part II: extended convex lifting\.IEEE Transactions on Automatic Control,pp\. 1–16\.Cited by:[§7](https://arxiv.org/html/2605.24274#S7.SS0.SSS0.Px1.p1.1)\.
A lift for input-convex neural network training

Similar Articles

DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers

Learning sparse neural networks through L₀ regularization

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

Submit Feedback

Similar Articles

DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers
Learning sparse neural networks through L₀ regularization
Weight normalization: A simple reparameterization to accelerate training of deep neural networks
DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations