Constrained Diffusion Models with Primal-Dual Inference

arXiv cs.LG 06/17/26, 04:00 AM Papers
Summary
This paper proposes primal-dual inference for constrained diffusion models, jointly inferring the optimal distribution and its dual variable via a dual-conditioned score network, with convergence guarantees and applications in wireless resource allocation and portfolio management.
arXiv:2606.17192v1 Announce Type: new Abstract: This paper develops constrained diffusion models with primal-dual inference (PDI) to sample from optimal distributions of entropy-regularized optimization problems with \emph{average} constraints. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples. To enable this conditional score field, we train a single dual-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule-dependent stability factors. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# Constrained Diffusion Models with Primal-Dual Inference
Source: [https://arxiv.org/html/2606.17192](https://arxiv.org/html/2606.17192)
Samar Hadou Yiğit Berkay Uslu Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania \{selaraby,ybuslu,aribeiro\}@seas\.upenn\.edu

###### Abstract

This paper develops constrained diffusion models with primal\-dual inference \(PDI\) to sample from optimal distributions of entropy\-regularized optimization problems with*average*constraints\. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable\. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable\. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples\. To enable this conditional score field, we train a single dual\-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference\. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule\-dependent stability factors\. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management\.

## 1Introduction

Diffusion models have become a dominant framework for sampling from complex, high\-dimensional distributions by learning to reverse a noising processSohl\-Dicksteinet al\.\([2015](https://arxiv.org/html/2606.17192#bib.bib10)\); Hoet al\.\([2020](https://arxiv.org/html/2606.17192#bib.bib20)\); Songet al\.\([2021b](https://arxiv.org/html/2606.17192#bib.bib16)\)\. While their most apparent successes lie in unconstrained perceptual tasks such as imageRombachet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib11)\), audioKonget al\.\([2021](https://arxiv.org/html/2606.17192#bib.bib12)\), and video synthesisHoet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib13)\), the same distributional modeling capabilities are valuable for tackling a broad class of constrained optimization problems, whose optimal solutions are probabilistic rather than deterministic\. Wireless resource allocation, for instance, employs stochastic, time\-sharing policies to maximize a network\-wide utility, subject to per\-user service guaranteesUsluet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib7)\); Babazadeh Darabi and Coleri \([2025](https://arxiv.org/html/2606.17192#bib.bib58)\)\. Similarly, portfolio management seeks to diversify allocations to maximize returns under expected risk constraintsTiwari \([2026](https://arxiv.org/html/2606.17192#bib.bib65)\); Heet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib63)\)\. Standard diffusion models, however, lack reliable mechanisms to enforce such statistical constraints during sampling\. This paper develops diffusion models with primal\-dual inference to bridge that gap\.

Entropy\-regularized formulations that optimize over distributions with average constraints admit a clean form in the Lagrangian dual domain\. For a fixed dual variable, the distributional minimizer of the Lagrangian is a Gibbs distribution whose energy is linear in the dual variable\. This connects constrained sampling to the vast literature on diffusion\-based samplers from unnormalized Gibbs targetsZhang and Chen \([2022](https://arxiv.org/html/2606.17192#bib.bib25)\); Vargaset al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib26)\); Berneret al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib27)\); Richter and Berner \([2024](https://arxiv.org/html/2606.17192#bib.bib29)\)\. In principle, one could solve for the optimal dual variable, and train a diffusion model to sample from the corresponding Gibbs distribution\. In our setting, however, the sampler must operate over a family of constrained optimization instances and each instance can induce a different optimal multiplier\. Estimating these multipliers before sampling is fragile and expensive\. The optimal multiplier is defined through intractable expectations under the Gibbs distribution, so each dual update requires estimating statistics of the very distribution being learned\. Prior dual\-training \(DT\) approachesKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\)address this by alternating dual updates with score\-model training\. With only a limited number of score updates per dual iterate, the learned score model can reflect the accumulated history of these targets rather than the Gibbs score field indexed by the final multiplier\. Once training ends, the resulting sampler is fixed and cannot respond to constraint violations during generation\.

This paper proposes a different view\. Rather than treating the dual variable as a quantity that must be estimated before sampling, we make it part of the sampling state\. We propose primal–dual inference \(PDI\), which interleaves dual ascent with the reverse diffusion process\. At each denoising step, PDI uses the score field associated with the current multiplier to update the samples, estimates constraint violations from Tweedie\-based estimates of denoised samplesEfron \([2011](https://arxiv.org/html/2606.17192#bib.bib42)\), and updates the multiplier before the next reverse step\. The dual variable becomes an inference\-time state of the sampler that steers the trajectories toward feasibility, and the sampler follows a path\-dependent sequence of Gibbs score fields\. This dynamic coupling keeps the score field aligned with the current multiplier throughout inference and allows early noisy steps to absorb dual mismatch while later steps refine samples under multipliers that have already reached a region close to the optimal multiplier\. Our analysis captures this effect by proving convergence of the time\-averaged dual variables and bounding how residual dual mismatch propagates through the reverse process\.

To enable this primal–dual coupling, we train a single score network conditioned on the dual variable, so that the model represents a family of Gibbs targets rather than a single constrained distribution\. This gives PDI a direct mechanism for adapting to shifted constraints and out\-of\-distribution \(OOD\) instances, as long as the induced dual trajectory remains within the multiplier range covered during training\. Empirically, we show that PDI outperforms DT and remains more robust under shifted constraints\. When DT is strengthened by freezing a candidate final multiplier and continuing score training for that fixed Gibbs target, it becomes competitive with PDI, but only after this additional training stage\. This shows that PDI’s advantage is not merely estimating a good multiplier, but keeping multiplier updates coupled to the denoising trajectory during inference\.

We make the following contributions\.

1. \(C1\)We cast constrained sampling with average constraints as a saddle\-point problem in the Lagrangian dual domain, where the optimal distribution is a Gibbs distribution indexed by the optimal dual variable\.
2. \(C2\)We introduce the PDI algorithm, a reverse\-diffusion sampler in which primal denoising and dual ascent evolve jointly\. At each step, the sampler uses the score field associated with the current multiplier and then updates the multiplier using constraint violations\.
3. \(C3\)We train a single dual\-variable\-conditioned score network that generalizes across the family of Gibbs distributions indexed by the dual variable encountered during inference\.
4. \(C4\)We establish convergence of the time\-averaged dual iterates to a neighborhood of the optimal multiplier\. We also bound the effect of residual dual mismatch on the terminal distribution through schedule\-dependent stability factors\.
5. \(C5\)We validate PDI empirically on a constrained Gaussian mixture, a wireless power\-control problem with per\-user rate requirements, and a portfolio\-construction task with expected risk constraints\.

Figure 1 makes the effect of inference\-time dual updates concrete\. We showcase a wireless optimization problem where users share a channel and each needs a minimum average service rate\. An unconstrained sampler ignores the requirements and leaves many users underserved; DT denoises dual variables before sampling and so cannot correct violations during generation; and PDI, by updating the dual variables as it denoises, steers samples toward \(average\) feasibility and lifts the worst\-served users much closer to their targets\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x1.png)Figure 1:Wireless power allocation under ergodic minimum\-raterminr\_\{\\min\}constraints\.Nodes represent users and colors indicate how well the rate requirement of each user is met, with red marking those that fall short\. Here, average constraints are meaningful because mutual interference prevents any single solution from satisfying every requirement at once\. Optimal solutions must therefore alternate samples \(i\.e\., power allocations and/or channel access\) over time, so that some compensate for others\. The unconstrained sampler leaves many users below the target rate \(red nodes\), while DT partially improves feasibility through training\-time dual updates\. PDI further improves the lower\-rate users by updating the dual variables during denoising\.
## 2Related Work

##### Diffusion model sampling with constraints\.

Guidance techniques steer reverse diffusions toward reward or constraint alignment by modifying the score in Tweedie’s denoising formulaEfron \([2011](https://arxiv.org/html/2606.17192#bib.bib42)\), as in classifier guidanceDhariwal and Nichol \([2021](https://arxiv.org/html/2606.17192#bib.bib40)\), classifier\-free guidanceHo and Salimans \([2022](https://arxiv.org/html/2606.17192#bib.bib41)\), and CLIP\-based variantsNicholet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib43)\)\. A separate line of work enforces domain\-feasibility constraints directly on the diffusion process through boundary reflectionsLou and Ermon \([2023](https://arxiv.org/html/2606.17192#bib.bib44)\), mirror mapsLiuet al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib45)\), Riemannian formulationsDe Bortoliet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib46)\), log\-barrier constructionsFishmanet al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib47)\), feasibility projectionsChristopheret al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib48)\), and constrained reverse stepsHuanget al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib50)\)\. Most importantly, these methods address pointwise constraints rather than distributional \(average\) ones\.

##### Constrained diffusion modeling with Lagrangian dual formulations\.

Closely related to our work areKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\); Chamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\), which adopt primal–dual formulations for average constraints\. Specifically,Khalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\)pair outer\-loop dual ascent with inner\-loop score matching but retrain the score network for each dual iterate, while primal–dual Langevin Monte CarloChamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\)jointly evolves samples and multipliers at sampling time but forgoes a learned generative model\. Our method shares the Lagrangian saddle\-point structure of these works but differs in two key aspects\. First, the dual variables evolve jointly with the denoising trajectory rather than in a separate training loop or ergodic Langevin chain\. Second, a single score network, conditioned on the dual variable, is trained across a distribution of multipliers rather than refitted per iterate\. An extended related work is provided in Appendix[A](https://arxiv.org/html/2606.17192#A1)\.

## 3Constrained Diffusion Models

Consider a target distributionμ∗\\mu^\{\*\}that trades off minimization of a variational objective function with satisfying a set of constraints in expectation\. More concretely,μ∗\\mu^\{\*\}is the minimizer of

P∗=minμ∈𝒫2\\displaystyle P^\{\*\}~=~\\min\_\{\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\}\\quad𝔼μ\[f0\(𝐱\)\]−βℋ\(μ\),\\displaystyle\{\\mathbb\{E\}\}\_\{\\mu\}\\Big\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\\Big\]\-\\beta\\,\{\\mathcal\{H\}\}\(\\mu\),s\.t\.𝔼μ\[𝐟\(𝐱\)\]⪯𝟎,\\displaystyle\{\\mathbb\{E\}\}\_\{\\mu\}\\Big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\Big\]\\preceq\\mathbf\{0\},\(1\)wheref0:𝒳→ℝf\_\{0\}:\{\\mathcal\{X\}\}\\to\{\\mathbb\{R\}\},𝐟:𝒳→ℝd\{\\mathbf\{f\}\}:\{\\mathcal\{X\}\}\\to\{\\mathbb\{R\}\}^\{d\},𝒳⊆ℝp\{\\mathcal\{X\}\}\\subseteq\{\\mathbb\{R\}\}^\{p\}is the domain of feasible decisions and𝒫2\(𝒳\)\{\\mathcal\{P\}\}\_\{2\}\(\{\\mathcal\{X\}\}\)denotes the space of all probability distributions supported on𝒳\{\\mathcal\{X\}\}with finite second moments\. We augment the objective with an entropy regularization term to penalize degenerate solutions and ensure sample diversity\. This formulation is particularly relevant in multi\-player cooperative games and resource allocation problems, where a single deterministic policy may not satisfy all competing requirements simultaneously\. In such settings, time\-sharing policies sampled fromμ∗\\mu^\{\*\}can satisfy the constraints on average while improving the tradeoff between optimality and feasibility\.

The problem in \([3](https://arxiv.org/html/2606.17192#S3.Ex1)\) is \(β\\beta\-strongly\) convex inμ\\muwith zero duality gap and can be handled in the dual domain by constructing the Lagrangian function,

ℒ\(μ,𝝀\)=𝔼𝐱∼μ\[f0\(𝐱\)\+𝝀⊤𝐟\(𝐱\)\]−βℋ\(μ\),\\displaystyle\{\\mathcal\{L\}\}\\big\(\\mu,\\boldsymbol\{\\lambda\}\\big\)~=~\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\Big\[\\,f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\\boldsymbol\{\\lambda\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\,\\Big\]\-\\beta\\,\{\\mathcal\{H\}\}\(\\mu\),\(2\)where𝝀⪰𝟎\\boldsymbol\{\\lambda\}\\succeq\\mathbf\{0\}contains the Lagrangian multipliers\. The dual problem is then defined as

D∗=max𝝀⪰𝟎⁡minμ∈𝒫2⁡𝔼𝐱∼μ\[f0\(𝐱\)\+𝝀⊤𝐟\(𝐱\)\]−βℋ\(μ\)\.\\displaystyle D^\{\*\}~=~\\max\_\{\\boldsymbol\{\\lambda\}\\succeq\\mathbf\{0\}\}\\,\\min\_\{\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\}\\ \{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\Big\[\\,f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\\boldsymbol\{\\lambda\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\,\\Big\]\-\\beta\\,\{\\mathcal\{H\}\}\(\\mu\)\.\(3\)The two problems are equivalent by the following proposition\.

###### Assumption 1\(Existence of a strictly feasible solution\)\.

There existsμ′∈𝒫2\\mu^\{\\prime\}\\in\{\\mathcal\{P\}\}\_\{2\}such that𝔼μ′\[f0\(𝐱\)\]−βℋ\(μ′\)<C<∞\{\\mathbb\{E\}\}\_\{\\mu^\{\\prime\}\}\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\]\-\\beta\\,\{\\mathcal\{H\}\}\(\\mu^\{\\prime\}\)<C<\\infty, and𝔼μ′\[𝐟\(𝐱\)\]⪯−ξ𝟏≺𝟎\{\\mathbb\{E\}\}\_\{\\mu^\{\\prime\}\}\[\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\]\\preceq\-\\xi\\mathbf\{1\}\\prec\\mathbf\{0\}\.

###### Proposition 1\(Strong dualityChamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\)\)\.

Under Assumption[1](https://arxiv.org/html/2606.17192#Thmassumption1), and forβ\>0\\beta\>0, we haveP∗=D∗P^\{\*\}=D^\{\*\}, and the optimal pair\(μ∗,𝛌∗\)\(\\mu^\{\*\},\\boldsymbol\{\\lambda\}^\{\*\}\)is a saddle\-point of the Lagrangian, i\.e\.,ℒ\(μ∗,𝛌\)≤ℒ\(μ∗,𝛌∗\)≤ℒ\(μ,𝛌∗\),\{\\mathcal\{L\}\}\(\\mu^\{\*\},\\boldsymbol\{\\lambda\}\)~\\leq~\{\\mathcal\{L\}\}\(\\mu^\{\*\},\\boldsymbol\{\\lambda\}^\{\*\}\)~\\leq~\{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\}^\{\*\}\),for anyμ∈𝒫2\(𝒳\)\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\(\{\\mathcal\{X\}\}\)and𝛌∈ℝ\+d\\boldsymbol\{\\lambda\}\\in\{\\mathbb\{R\}\}^\{d\}\_\{\+\}\. It also holds that𝛌∗\\boldsymbol\{\\lambda\}^\{\*\}is finite, i\.e\.,𝛌∗∈𝚲⊂ℝ\+d\\boldsymbol\{\\lambda\}^\{\*\}\\in\\boldsymbol\{\\Lambda\}\\subset\{\\mathbb\{R\}\}^\{d\}\_\{\+\}\. Moreover,μ∗\\mu^\{\*\}follows the law of the Gibbs distribution,

μ∗\(𝐱\)=μ𝝀∗\(𝐱\)=1Z\(𝝀∗\)⋅exp⁡\(−1β\(f0\(𝐱\)\+𝝀∗⊤𝐟\(𝐱\)\)\),\\displaystyle\\mu^\{\*\}\(\{\\mathbf\{x\}\}\)~=~\\mu\_\{\\boldsymbol\{\\lambda\}^\{\*\}\}\(\{\\mathbf\{x\}\}\)~=~\\frac\{1\}\{Z\(\\boldsymbol\{\\lambda\}^\{\*\}\)\}\\cdot\\exp\\Big\(\-\\frac\{1\}\{\\beta\}\\big\(f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\{\\boldsymbol\{\\lambda\}^\{\*\}\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\big\)\\Big\),\(4\)withZ\(𝛌∗\)=∫exp⁡\(−1β\(f0\(𝐱\)\+𝛌∗⊤𝐟\(𝐱\)\)\)d𝐱\.Z\(\\boldsymbol\{\\lambda\}^\{\*\}\)~=~\\int\\exp\\Big\(\-\\frac\{1\}\{\\beta\}\\big\(f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\{\\boldsymbol\{\\lambda\}^\{\*\}\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\big\)\\Big\)\{\\text\{d\}\}\{\\mathbf\{x\}\}\.

Proposition[1](https://arxiv.org/html/2606.17192#Thmproposition1)shows that the optimal distribution is a Gibbs distribution parametrized by the optimal dual variable, which can equivalently be viewed as an exponential tilting of the implicit uniform prior on𝒳\{\\mathcal\{X\}\}\. Since the energy functions are known, sampling from this distribution is straightforward in principle\. However, a computational challenge lies in that the optimal dual multiplier satisfies a saddle\-point condition involving expectations under the very distribution it parametrizes\. Prior workKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5)\)addresses this by estimating𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}through a dual\-training loop wrapped around diffusion\-based score matching\. However, this requires retraining or fine\-tuning the score network for each dual iterate, let alone each new problem instance, and leaves a fixed sampler after training\. To alleviate these challenges, we propose the PDI algorithm, which shifts the search for𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}from training to inference\.

### 3\.1Primal–Dual Inference

For a given𝝀\\boldsymbol\{\\lambda\}, the inner minimizer of the Lagrangian in \([3](https://arxiv.org/html/2606.17192#S3.E3)\) is the Gibbs distribution

μ𝝀†\(𝐱\)∝exp⁡\(−E\(𝐱,𝝀\)\),E\(𝐱,𝝀\)=1β\(f0\(𝐱\)\+𝝀⊤𝐟\(𝐱\)\),\\displaystyle\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\(\{\\mathbf\{x\}\}\)\\propto\\exp\\Big\(\-E\(\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\Big\),\\quad E\(\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)~=~\\frac\{1\}\{\\beta\}\\Big\(f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\{\\boldsymbol\{\\lambda\}\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\Big\),\(5\)where the energy functionEEis the point\-wise Lagrangian function scaled byβ\\beta\. We utilize diffusion models to sample from the family of Gibbs distributions\{μ𝝀†\|𝝀∈𝚲\}\\\{\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\\,\|\\,\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}\\\}\. If𝝀=𝝀∗\\boldsymbol\{\\lambda\}=\\boldsymbol\{\\lambda\}^\{\*\}, this Gibbs distribution coincides with the optimal constrained distributionμ∗\\mu^\{\*\}\.

Diffusion models learn the reverse dynamics of a Gaussian noising process initialized at the target distributionSonget al\.\([2021b](https://arxiv.org/html/2606.17192#bib.bib16)\)\. For each Gibbs distribution, we define a forward process,

𝐲τ\(𝝀\)=aτ𝐲τ−1\(𝝀\)\+bτϵτ,𝐲0\(𝝀\)∼μ𝝀†,𝐲T∼𝒩\(𝟎,𝐈\),\\displaystyle\{\\mathbf\{y\}\}\_\{\\tau\}\(\\boldsymbol\{\\lambda\}\)~=~\\sqrt\{a\_\{\\tau\}\}\\,\{\\mathbf\{y\}\}\_\{\\tau\-1\}\(\\boldsymbol\{\\lambda\}\)\+\\sqrt\{b\_\{\\tau\}\}\\boldsymbol\{\\epsilon\}\_\{\\tau\},\\quad\{\\mathbf\{y\}\}\_\{0\}\(\\boldsymbol\{\\lambda\}\)\\sim\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\},\\quad\{\\mathbf\{y\}\}\_\{T\}\\sim\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\),\(6\)withϵτ∼𝒩\(𝟎,𝐈\)\\boldsymbol\{\\epsilon\}\_\{\\tau\}\\sim\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\)\. All processes in this family share the same decay schedule\{aτ\}\\\{a\_\{\\tau\}\\\}, noise schedule\{bτ\}\\\{b\_\{\\tau\}\\\}and terminal distribution𝒩\(𝟎,𝐈\)\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\)\. We reserve the subscriptτ\\taufor the time of the forward processes running from0toTT\. Each forward process induces a forward marginal distributionqτ𝝀≔qτ\(⋅\|𝝀\)q\_\{\\tau\}^\{\\boldsymbol\{\\lambda\}\}\\coloneqq q\_\{\\tau\}\(\\cdot\|\\boldsymbol\{\\lambda\}\)\. The conditional distribution at timeτ\\tauisqτ\|0\(⋅\|𝐲0,𝝀\)=𝒩\(ατ𝐲0,στ2𝐈\)q\_\{\\tau\|0\}\(\\cdot\\,\|\\,\{\\mathbf\{y\}\}\_\{0\},\\boldsymbol\{\\lambda\}\)=\{\\mathcal\{N\}\}\(\\alpha\_\{\\tau\}\{\\mathbf\{y\}\}\_\{0\},\\sigma\_\{\\tau\}^\{2\}\{\\mathbf\{I\}\}\)withατ=∏s=1τas\\alpha\_\{\\tau\}~=~\\prod\_\{s=1\}^\{\\tau\}\\,\\sqrt\{a\_\{s\}\}, andστ2=∑j=1τbj∏s=j\+1τas\\sigma\_\{\\tau\}^\{2\}~=~\\sum\_\{j=1\}^\{\\tau\}b\_\{j\}\\prod\_\{s=j\+1\}^\{\\tau\}a\_\{s\}\. Throughout the paper, we refer to SNR≔τατ2/στ2\{\}\_\{\\tau\}~\\coloneqq~\\alpha\_\{\\tau\}^\{2\}/\\sigma\_\{\\tau\}^\{2\}as the signal\-to\-noise ratio\. For a fixed𝝀\\boldsymbol\{\\lambda\}, we associate to the forward process a score\-based reverse sampler,

𝐱t\+1\(𝝀\)=1aT−t\(𝐱t\(𝝀\)\+bT−t∇log⁡qT−t\(𝐱t\|𝝀\)\)\+bT−tϵt,\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}\(\\boldsymbol\{\\lambda\}\)~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\}\(\\boldsymbol\{\\lambda\}\)\+b\_\{T\-t\}\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\},\(7\)with𝐱0∼𝒩\(𝟎,𝐈\)\{\\mathbf\{x\}\}\_\{0\}\\sim\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\),𝐱T\(𝝀\)∼μ𝝀†\{\\mathbf\{x\}\}\_\{T\}\(\\boldsymbol\{\\lambda\}\)\\sim\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}, andϵt∼𝒩\(𝟎,𝐈\)\\boldsymbol\{\\epsilon\}\_\{t\}\\sim\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\)\. We use the subscriptttto refer to the inference time running from Gaussian noise to the target distribution witht=T−τt=T\-\\tau\. The reverse process induces marginal distributionsp~t\(⋅\|𝝀\)\\widetilde\{p\}\_\{t\}\(\\cdot\|\\boldsymbol\{\\lambda\}\)that ideally match those of the forward process, namelyp~t≈qT−t\\widetilde\{p\}\_\{t\}\\approx q\_\{T\-t\}, for all𝝀\\boldsymbol\{\\lambda\}\.

Running \([7](https://arxiv.org/html/2606.17192#S3.E7)\) under the optimal value𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}produces samples from the optimal distributionμ𝝀∗†\\mu\_\{\\boldsymbol\{\\lambda\}^\{\*\}\}^\{\\dagger\}, which coincides withμ∗\\mu^\{\*\}under strong duality\. PDI replaces this fixed\-multiplier sampler with a time\-varying reverse process whose multiplier is updated during inference through dual ascent\. Each reverse step uses the Gibbs score field indexed by the current multiplier, and the multiplier is updated from the constraint residual evaluated on Tweedie posterior\-mean estimates:

𝐱t\+1\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}=1aT−t\(𝐱t\+bT−t∇𝐱tlog⁡qT−t\(𝐱t\|𝝀t\)\)\+bT−tϵt,\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\\,\{\\mathbf\{x\}\}\_\{t\}\+b\_\{T\-t\}\\,\\nabla\_\{\{\\mathbf\{x\}\}\_\{t\}\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\},\(PDI\-P\)𝝀t\+1\\displaystyle\\boldsymbol\{\\lambda\}\_\{t\+1\}=\[𝝀t\+ηt𝔼𝐱t\+1\[𝐟\(𝔼\[𝐲0\|𝐱t\+1,𝝀t\]\)\]\]\+\.\\displaystyle~=~\\bigg\[\\,\\boldsymbol\{\\lambda\}\_\{t\}\+\\eta\_\{t\}\\,\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\{\\mathbb\{E\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\Big\)\\Big\]\\bigg\]\_\{\+\}\.\(PDI\-D\)The operator\[⋅\]\+\[\\cdot\]\_\{\+\}denotes the projection onto the nonnegative orthant,ηt\\eta\_\{t\}is the dual step size, and𝔼\[𝐲0\|𝐱t\+1,𝝀t\]\{\\mathbb\{E\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]is the Tweedie posterior\-mean estimate of the clean samples under the forward process initialized atμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\. The primal step uses the score of the forward marginalqT−tq\_\{T\-t\}selected by the current multiplier, while the dual step updates the multipliers in the direction of the average constraint violation estimated from the Tweedie clean\-sample estimates\. The primal samples and dual variable therefore evolve together during generation\. As the dual iterates approach a neighborhood of𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}, the sampler is steered through Gibbs distributions indexed by near\-optimal multipliers\.

### 3\.2Score Network Training

PDI requires access to the score field∇𝐱tlog⁡qT−t\(𝐱t\|𝝀t\)\\nabla\_\{\{\\mathbf\{x\}\}\_\{t\}\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)for the multiplier encountered at each denoising step\. Since this score is not available in closed form, we learn a dual\-conditioned score models𝜽\(𝐱t,t,𝝀t\)s\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)that approximates the family of noised Gibbs score fields indexed by𝝀\\boldsymbol\{\\lambda\}\. The primal update in \([PDI\-P](https://arxiv.org/html/2606.17192#S3.Ex2)\) is then replaced with

𝐱t\+1\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}=1aT−t\(𝐱t\+bT−ts𝜽\(𝐱t,t,𝝀t\)\)\+bT−tϵt\.\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\\,\{\\mathbf\{x\}\}\_\{t\}\+b\_\{T\-t\}\\,s\_\{\\boldsymbol\{\\theta\}\}\\big\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\\big\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\}\.\(8\)The score model is trained across a family of problem instances parameterized by𝒢\{\\mathcal\{G\}\}\. Thus, the score model formally depends on the problem instance and should be written ass𝜽\(𝐱t,t,𝝀t,𝒢\)s\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\},\{\\mathcal\{G\}\}\)\. To keep the notation uncluttered, we suppress the explicit dependence on𝒢\{\\mathcal\{G\}\}and writes𝜽\(𝐱t,t,𝝀t\)s\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)throughout\.

The score model is trained to approximate the score of the Gibbs distribution indexed by different dual variables\. The training objective is

𝜽∗∈argmin𝜽𝔼𝝀,𝐱t,t\[ω\(t\)∥s𝜽\(𝐱t,t,𝝀\)−∇𝐱tlogqT−t\(𝐱t\|𝝀\)∥22\],\\displaystyle\\boldsymbol\{\\theta\}^\{\*\}~\\in~\\operatornamewithlimits\{argmin\}\_\{\\boldsymbol\{\\theta\}\}\\quad\{\\mathbb\{E\}\}\_\{\\boldsymbol\{\\lambda\},\{\\mathbf\{x\}\}\_\{t\},t\}\\Big\[\\omega\(t\)\\,\\big\\\|s\_\{\\boldsymbol\{\\theta\}\}\\big\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\\big\)\-\\nabla\_\{\{\\mathbf\{x\}\}\_\{t\}\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\)\\big\\\|\_\{2\}^\{2\}\\Big\],\(9\)whereω\(t\)\\omega\(t\)is a time\-dependent weighting function\. The expectation is taken over diffusion times, noisy samples, a training distribution of problem instances and dual variables\. This differs from standard denoising or score\-matching objectives in two ways\. First, the relevant distribution over dual variables is not known in advance because it is induced by the inference\-time dual trajectory\. Second, unlike computer vision tasks, where training starts from samples of the target distribution, our starting point is the optimization problem in \([3](https://arxiv.org/html/2606.17192#S3.Ex1)\), whose solution distributions and samples are not necessarily available a priori\.

To mitigate these two issues, we propose Algorithm[1](https://arxiv.org/html/2606.17192#alg1)in Appendix[B](https://arxiv.org/html/2606.17192#A2)\. The training procedure starts by sampling dual multipliers from a prior distribution and rolling out trajectories using the untrained score network to collect noisy samples𝐱t\{\\mathbf\{x\}\}\_\{t\}\. As training progresses, we sample pairs\(𝐱t,𝝀t\)\(\{\\mathbf\{x\}\}\_\{t\},\\boldsymbol\{\\lambda\}\_\{t\}\)along these trajectories instead of relying on the initial prior alone\. For each pair, we use a Monte Carlo estimator of the score to define the regression target\. This gradually shifts training toward the joint distribution of noisy samples and dual variables encountered by the sampler at inference time\.

## 4Convergence Analysis

The PDI algorithm defines a sampling process steered by a sequence of dual variables estimated via dual ascent\. Because the multiplier changes along the trajectory, the marginal lawptp\_\{t\}generated by PDI is generally not equal toqT−t\(⋅\|𝝀t\)q\_\{T\-t\}\(\\cdot\|\\boldsymbol\{\\lambda\}\_\{t\}\)\. The latter only defines the local fixed\-multiplier score field used at that step\. Thus, PDI is a path\-dependent reverse process rather than the exact reversal of any single forward process in \([6](https://arxiv.org/html/2606.17192#S3.E6)\)\. We therefore analyze the convergence of the time\-averaged dual iterates to a neighborhood of the optimal dual variable𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}\. We then analyze how the residual errors in the multipliers and score fields propagate through the reverse dynamics and affect the terminal lawpTp\_\{T\}\.

###### Theorem 1\(Convergence of time\-average dual variables\)\.

Letηt=κ/T,∀t\\eta\_\{t\}=\\kappa/\\sqrt\{T\},\\forall t\. For a sequence of\{𝛌t\}\\\{\\boldsymbol\{\\lambda\}\_\{t\}\\\}generated by PDI, it holds under regularity Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2)–[4](https://arxiv.org/html/2606.17192#Thmassumption4)and arbitraryδ\>0\\delta\>0that:

𝔼\[‖𝝀¯−𝝀∗‖2\]≤2ϑ𝔼\[D∗−g\(𝝀¯\)\]≤𝒪\(1T\)\+Δ*floor*,\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\bar\{\\boldsymbol\{\\lambda\}\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]~\\leq~\\frac\{2\}\{\\vartheta\}\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\\big\(\\bar\{\\boldsymbol\{\\lambda\}\}\\big\)\\big\]~\\leq~\{\\mathcal\{O\}\}\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\+\\Delta\_\{\\emph\{floor\}\},\(10\)withΔ*floor*=12Lf2ϑ2T∑t=0T−1\(ϵ*TW*2\(t\+1\)\+σT−t4α¯T−t2ϵ*app*2\(t\+1\)\)\+12Lh2ϑ2\(1\+1δ\)ϵ*hist*2\\Delta\_\{\\emph\{floor\}\}=\\frac\{12L\_\{f\}^\{2\}\}\{\\vartheta^\{2\}T\}\\,\\sum\_\{t=0\}^\{T\-1\}\\Big\(\\epsilon^\{2\}\_\{\\emph\{TW\}\}\(t\+1\)\+\\frac\{\\sigma^\{4\}\_\{T\-t\}\}\{\\overline\{\\alpha\}^\{2\}\_\{T\-t\}\}\\epsilon\_\{\\emph\{app\}\}^\{2\}\(t\+1\)\\Big\)\+\\frac\{12L\_\{h\}^\{2\}\}\{\\vartheta^\{2\}\}\\Big\(1\+\\frac\{1\}\{\\delta\}\\Big\)\\epsilon\_\{\\emph\{hist\}\}^\{2\}, and where𝛌¯=1T∑t𝛌t\\bar\{\\boldsymbol\{\\lambda\}\}~=~\\tfrac\{1\}\{T\}\\sum\_\{t\}\\boldsymbol\{\\lambda\}\_\{t\}is the time\-average of the multipliers,g\(𝛌\)=ℒ\(μ𝛌†,𝛌\)g\(\\boldsymbol\{\\lambda\}\)=\{\\mathcal\{L\}\}\(\\mu^\{\\dagger\}\_\{\\boldsymbol\{\\lambda\}\},\\boldsymbol\{\\lambda\}\)is the dual function with strong concavity indexϑ\\vartheta,α¯T−t=max⁡\{αmin,αT−t\}\\overline\{\\alpha\}\_\{T\-t\}=\\max\\\{\\alpha\_\{\\min\},\\alpha\_\{T\-t\}\\\}, andLf,LhL\_\{f\},L\_\{h\}are Lipschitz constants\.

The assumptions and proof are provided in Appendix[C\.2](https://arxiv.org/html/2606.17192#A3.SS2)\. The theorem shows that the time\-average of the dual variable converges at a rate of1/T1/\\sqrt\{T\}to a region around the dual optimum\. The floor arises from three error sources\. The first is the Tweedie posterior\-mean errorϵTW\(t\)\\epsilon\_\{\\text\{TW\}\}\(t\), which results from evaluating constraint violations on Tweedie posterior means rather than on clean samples from the current Gibbs distribution\. The second is the score\-approximation error,ϵapp\(t\)\\epsilon\_\{\\text\{app\}\}\(t\), which measures how errors in the learned score affect the Tweedie estimate\. The third is the trajectory\-history mismatch,ϵhist\\epsilon\_\{\\mathrm\{hist\}\}, which captures how past score\-approximation errors propagate through the reverse process\. Empirically, the score\-approximation terms appear negligible\. The dominant residual bias is therefore the Tweedie term\. Its effect decreases toward the data side because the SNR increases, making the posterior mean estimate of the clean sample more accurate and reducing the bias in the dual update\.

Since the dual variable parametrizes the Gibbs target, this implies that the time\-averaged multiplier𝝀¯\\bar\{\\boldsymbol\{\\lambda\}\}defines a near\-optimal Gibbs target\. A fresh fixed\-multiplier sampler using𝝀¯\\bar\{\\boldsymbol\{\\lambda\}\}across all diffusion steps would then target this distribution\.

Dual convergence alone does not control the generated samples, because errors in𝝀t\\boldsymbol\{\\lambda\}\_\{t\}enter the reverse dynamics through the score field and may be amplified by later kernels\. We therefore quantify the stability of these reverse kernels\. For a fixed multiplier𝝀\\boldsymbol\{\\lambda\}, letKt𝝀K\_\{t\}^\{\\boldsymbol\{\\lambda\}\}denote the Markov kernel of the one\-step reverse update in \([7](https://arxiv.org/html/2606.17192#S3.E7)\) using the score field∇logqT−t\(⋅\|𝝀\)\\nabla\\log q\_\{T\-t\}\(\\cdot\|\\boldsymbol\{\\lambda\}\)\. For any input lawν\\nu, the measureνKt𝝀\\nu K\_\{t\}^\{\\boldsymbol\{\\lambda\}\}denotes the law obtained by propagatingν\\nuthrough the one\-step reverse kernelKt𝝀K\_\{t\}^\{\\boldsymbol\{\\lambda\}\}\.

###### Proposition 2\(W2W\_\{2\}stability of the optimal kernel\)\.

LetKt𝛌∗K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}be the Markov kernel of a reverse step \([7](https://arxiv.org/html/2606.17192#S3.E7)\) under the frozen optimal multiplier𝛌∗\\boldsymbol\{\\lambda\}^\{\*\}and the true score∇logqT−t\(⋅\|𝛌∗\)\\nabla\\log q\_\{T\-t\}\(\\cdot\|\\boldsymbol\{\\lambda\}^\{\*\}\)\. Under Assumption[5](https://arxiv.org/html/2606.17192#Thmassumption5), for allν,μ∈𝒫2\(𝒳\)\\nu,\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\(\{\\mathcal\{X\}\}\), it holds that

W2\(νKt𝝀∗,μKt𝝀∗\)≤ρtW2\(ν,μ\),\\displaystyle W\_\{2\}\(\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\},\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\)\\leq\\rho\_\{t\}W\_\{2\}\(\\nu,\\mu\),\(11\)withρt=1aT−tsup𝐱∈𝒳∥𝐈\+bT−t∇𝐱2logqT−t\(𝐱\|𝛌∗\)∥*op*\\rho\_\{t\}=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\sup\_\{\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}\}\\\|\{\\mathbf\{I\}\}\+b\_\{T\-t\}\\nabla\_\{\\mathbf\{x\}\}^\{2\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\\|\_\{\\emph\{op\}\}\.

The proof is provided in Appendix[C\.5](https://arxiv.org/html/2606.17192#A3.SS5)\. The coefficientρt\\rho\_\{t\}measures how stable one reverse step is and depends on the noise level\. At high noise, the score field varies slowly and the reverse step is typically stable, i\.e\.,ρt<1\\rho\_\{t\}<1\. At medium noise,ρt\\rho\_\{t\}is usually expected to remain moderate, although not necessarily smaller than one\. Near the data side, the distribution becomes sharper and the score field can vary rapidly around the data manifold, soρt\\rho\_\{t\}may become larger and small errors can be amplified\. Based on this kernel stability, we characterize theW2W\_\{2\}distance between the terminal and optimal distributions\.

###### Theorem 2\.

Under Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2),[4](https://arxiv.org/html/2606.17192#Thmassumption4)and[5](https://arxiv.org/html/2606.17192#Thmassumption5), the Wasserstein distance between the terminal distribution of PDI and the optimal distribution is controlled by the dual mismatch:

𝔼\[W2\(pT,μ∗\)\]≤∑t=0T−1Ψt,TbT−taT−t\(γt𝔼\[‖𝝀t−𝝀∗‖\]\+ϵapp\(t\)\),\\displaystyle\{\\mathbb\{E\}\}\\big\[W\_\{2\}\\big\(p\_\{T\},\\mu^\{\*\}\\big\)\\big\]~\\leq~\\sum\_\{t=0\}^\{T\-1\}\\Psi\_\{t,T\}\\frac\{b\_\{T\-t\}\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\\gamma\_\{t\}\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\\big\]\+\\epsilon\_\{\\text\{app\}\}\(t\)\\Big\),\(12\)whereΨs,T=∏t=s\+1Tρt\\Psi\_\{s,T\}=\\prod\_\{t=s\+1\}^\{T\}\\rho\_\{t\}, andγt=RαT−tβσT−t2sup𝛌∥*Cov*πt\(𝐲0\|𝐱,𝛌\)∥*op*\\gamma\_\{t\}=\\frac\{R\\alpha\_\{T\-t\}\}\{\\beta\\sigma^\{2\}\_\{T\-t\}\}\\sqrt\{\\sup\_\{\\boldsymbol\{\\lambda\}\}\\\|\\emph\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\emph\{op\}\}\}\.

The proof is in Appendix[C\.3](https://arxiv.org/html/2606.17192#A3.SS3)\. The theorem shows that a dual mismatch at timettaffects the final distribution through the future stability product∏s=t\+1Tρs\\prod\_\{s=t\+1\}^\{T\}\\rho\_\{s\}\. This product is controlled by the noise schedule\. Keeping the SNR low for enough reverse steps keeps the score field smooth and the kernels more contractive, so early dual errors are damped rather than amplified\. This provides a stable phase in which the dual variable can approach a neighborhood of𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}\. Later, when the SNR increases and the reverse dynamics become more sensitive, the remaining mismatch is already small\. Empirically, we find that standard DDPM linear and cosine schedules provide this behavior\.

## 5Numerical Results

We provide numerical evidence in three stochastic optimization problems: constrained MoG, wireless resource allocation, and portfolio management\. More details are provided in Appendices[E](https://arxiv.org/html/2606.17192#A5),[F](https://arxiv.org/html/2606.17192#A6), and[G](https://arxiv.org/html/2606.17192#A7)\.

### 5\.1Mixture of Gaussians

We first evaluate our approach on sampling from a weighted mixture of Gaussians inℝd\{\\mathbb\{R\}\}^\{d\},f0\(𝐱\)=−log∑k=1Kwk𝒩\(𝐱;𝝁k,𝚺k\),f\_\{0\}\(\{\\mathbf\{x\}\}\)=\-\\log\\sum\_\{k=1\}^\{K\}w\_\{k\}\\,\{\\mathcal\{N\}\}\\big\(\{\\mathbf\{x\}\};\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\\big\),truncated to a polytope\{𝐱:𝐀⊤𝐱⪯𝐛\}\\\{\{\\mathbf\{x\}\}:\{\\mathbf\{A\}\}^\{\\top\}\{\\mathbf\{x\}\}\\preceq\{\\mathbf\{b\}\}\\\}\. The optimization problem is

PMoG∗=minμ\(𝐱\)⁡𝔼𝐱∼μ\[f0\(𝐱\)\]−βℋ\(μ\)s\.t\.𝔼𝐱∼μ\[𝐀⊤𝐱−𝐛\]⪯0,\\displaystyle P^\{\*\}\_\{\\text\{MoG\}\}~=~\\min\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\bigl\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\\bigr\]\-\\beta\{\\mathcal\{H\}\}\(\\mu\)\\quad\\text\{s\.t\.\}\\quad\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\bigl\[\{\\mathbf\{A\}\}^\{\\top\}\{\\mathbf\{x\}\}\-\{\\mathbf\{b\}\}\\bigr\]\\;\\preceq\\;\\mathbf\{0\},\(13\)i\.e\., the sampler must produce a distribution that concentrates on high\-density regions of the mixture while satisfyingMMlinear inequality constraints in expectation\.

Baselines\.We compare our approach to two baselines: i\) a projected diffusion model \(PDM\)Christopheret al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib48)\), and ii\) primal\-dual Langevin \(PDL\) dynamicsChamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\)\. For PDM, we reuse our trained model with𝝀=𝟎\\boldsymbol\{\\lambda\}=\\mathbf\{0\}as the base model that samples from the unconstrained MoG and performs a Cimmino parallel projectionSloboda \([1991](https://arxiv.org/html/2606.17192#bib.bib4)\)onto the polytope at each time step\.

Results\.As shown in Figure[2](https://arxiv.org/html/2606.17192#S5.F2), all constrained methods achieve the same constraint feasibility\. However, our method \(PDI\-Net and PDI\-MC\) obtains a better objective value than the other constrained baselines\. This is because the two baselines enforce the constraints on every generated sample, whereas our formulation only requires the constraints to hold in expectation\. Pointwise enforcement is therefore more conservative and can discard useful samples for optimality that violate the constraints individually but could be offset by other samples in the distribution\. We also include an ablation on the dual variable through the unconstrained baseline, which runs our model with𝝀\\boldsymbol\{\\lambda\}fixed to zero at all time steps\. As the figure shows, the unconstrained model ignores the constraints entirely and achieves the lowest objective\. This performance confirms that the constrained behavior is not merely a consequence of the training algorithm\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x2.png)Figure 2:Mixture of Gaussians\(K=12K=12modes withM=10M=10constraints in a3030\-dimensional space\)\. While PDI, PDL and PDM maintain full feasibility, PDI exhibits the best objective and mode diversity\.
### 5\.2Wireless Resource Allocation

For a given network state𝐇\\mathbf\{H\}withN=200N=200users, the goal is to maximize the ergodic sum\-rate while satisfying a minimum ergodic rate constraint for each user\. That is,

Ppc∗=maxμ\(𝐱\)⁡1⊤𝐫\(μ,𝐇\)\+βℋ\(μ\),s\.t\.𝐫\(μ,𝐇\)≥𝟏⋅rmin,P^\{\*\}\_\{\\mathrm\{pc\}\}~=~\\max\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbf\{1\}^\{\\top\}\\mathbf\{r\}\\\!\\left\(\\mu,\\mathbf\{H\}\\right\)\+\\beta\{\\mathcal\{H\}\}\(\\mu\),\\quad\\text\{s\.t\.\}\\quad\\mathbf\{r\}\\\!\\left\(\\mu,\\mathbf\{H\}\\right\)\\geq\\mathbf\{1\}\\cdot r\_\{\\min\},\(14\)whereμ\\mudenotes the distribution of power allocations, andrminr\_\{\\min\}is the minimum ergodic rate requirement\. The optimal solution is rarely a degenerate policy\. Since users compete for the communication channels and cause interference, activating all users simultaneously is typically suboptimal and may prevent the rate constraints from being satisfied\. Instead, the optimal distribution often corresponds to a multimodal switching policy, in which different subsets of users transmit at different times while meeting their rate requirements in expectation\.

Baselines\.In addition to PDL and PDM, we include two other diffusion\-based baselines: i\) diffusion posterior sampling \(DPS\)Chunget al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib3)\), and ii\) supervised training \(ST\) with expert dataUsluet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib8)\)\. The expert data are generated by a primal\-dual method run for the deterministic version of \([14](https://arxiv.org/html/2606.17192#S5.E14)\) and its resulting trajectories \(after a transient period\) are collected to train the diffusion model\.

Table 1:Baseline comparisons\.PDI achieves a near\-feasible performance and provides the best balance between mean rates and constraint violations in the wireless power allocation problem \(rmin=0\.6r\_\{\\min\}=0\.6\)\.Results\.Table[1](https://arxiv.org/html/2606.17192#S5.T1)shows that PDI achieves the best overall balance between mean rates \(objective\) and feasibility represented by the55th\-percentile rate and mean and sum violations\. The11st\-percentile users, however, are harder to correct by the dual variables and suffer under all methods, except for ST\. Similarly to the MoG experiment, PDL and PDM achieve lower mean rates \(worse objectives\) with no substantial gains in feasibility because they sacrifice optimality for forcing pointwise constraints\. Moreover, ST generates feasible solutions at the expense of optimality\. Because ST is trained only to imitate expert trajectories, it is agnostic to the dual structure of the constrained distribution and cannot explicitly balance utility against feasibility during sampling\. Lastly, instead of using the untrained score model to generate rollouts during the early epochs of training, we initialize these rollouts with the ST model\. We observe that the trained PDI\-Net with ST warm\-up inherits the same bias toward feasibility, achieving a higher, feasible 1st\-percentile rate\. However, this comes at the expense of the objective value, which decreases from2\.802\.80to2\.582\.58\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.17192v1/x4.png)

![Refer to caption](https://arxiv.org/html/2606.17192v1/x5.png)

Figure 3:Ablation\.Comparisons between our base PDI\-Net model, plotted in blue, and \(left\) a PDI\-Net while setting𝝀t\\boldsymbol\{\\lambda\}\_\{t\}to a fixed value along the denoising trajectory, \(middle\) a PDI\-Net with different noise schedules, and \(right\) DT checkpoints corresponding to different training epochs and different estimates of the dual variable\.Ablation\.Figure[3](https://arxiv.org/html/2606.17192#S5.F3)\(left\) evaluates the effect of keeping the dual variable fixed during inference\. We run the score model with two fixed values: the final dual iterate𝝀T\\boldsymbol\{\\lambda\}\_\{T\}and the time\-average dual variable𝝀¯\\bar\{\\boldsymbol\{\\lambda\}\}, both computed from our base experiment\. The results suggest that the distribution generated by the full PDI dynamics is not equivalent to the Gibbs distribution associated with either fixed value\. Instead, it behaves like a mixture induced by the changing dual variable, and this mixture exhibits better feasibility\. Among the two values, the Gibbs distribution with𝝀¯\\bar\{\\boldsymbol\{\\lambda\}\}provides a better approximation to the PDI\-generated distribution than the one associated with the final iterate\.

To evaluate the role of the noise schedule, we report the performance metrics of four schedules in Figure[3](https://arxiv.org/html/2606.17192#S5.F3)\(middle\): the DDPM cosine used in our base experiment, the DDPM linear schedule and the linear and polynomial SNR schedules\. The latter two schedules fail to generate feasible samples\. This is because they move out of the low\-SNR regime too quickly before the dual updates reduce the dual mismatch\. As a result, in the high\-SNR regime, when the stability factorρt\\rho\_\{t\}becomes larger than one, the reverse process amplifies the remaining mismatch, leading to large constraint violations\.

Dual training\.To ablate the role of inference\-time dual updates, we train a score model𝐬ϕ\(𝐱t,t\)\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\{\\mathbf\{x\}\}\_\{t\},t\)that does not take the dual variable as input\. During training, we still update the dual variables using a primal\-dual algorithm, so the score model is trained against targets induced by the evolving dual variables\. At inference, there are no more dual updates and the trained model therefore acts as a fixed sampling policy that is intended to sample from the optimal distributions\. We refer to this approach as dual training \(DT\) and the training algorithm is provided in Appendix[D](https://arxiv.org/html/2606.17192#A4)\.

Figures[3](https://arxiv.org/html/2606.17192#S5.F3)\(right\) and[8](https://arxiv.org/html/2606.17192#A6.F8)compare DT checkpoints along the training trajectory with PDI\. Although some DT checkpoints reach dual variables close to those produced by PDI, their generated samples exhibit lower tail rates and lower feasibility\. This indicates that matching the dual variables alone is not sufficient\. Since DT does not condition the score model on the current multiplier, a checkpoint trained over a changing dual trajectory need not represent the score field associated with its current dual value\. Instead, it reflects the accumulated effect of the preceding training trajectory\. We further verify this in Appendix[F\.3](https://arxiv.org/html/2606.17192#A6.SS3)by freezing DT multipliers and continuing score training, which substantially improves performance but requires a much larger training budget\. PDI avoids this limitation by keeping the dual variable as an explicit state during inference\. Thus, when the dual update increases the penalty on violated constraints, the penalty is reflected directly in the denoising dynamics\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x6.png)Figure 4:Out\-of\-distribution constraints\.The models are trained underrmin=0\.6r\_\{\\min\}=0\.6and tested under different values of constraint levels\. PDI degrades more gracefully than DT, showing more robustness to distribution shifts\.Out\-of\-distribution constraints\.The dual state conditioning in our PDI score model also benefits OOD performance\. In Figure[4](https://arxiv.org/html/2606.17192#S5.F4), we plot the sum violation and feasibility percentage under different constraint levelsrminr\_\{\\min\}, withrmin=0\.6r\_\{\\min\}=0\.6corresponding to the in\-distribution setting\. We observe that PDI is more robust to the shift inrminr\_\{\\min\}\. This is because the score function∇log⁡qT−t\(𝐱t∣𝝀t\)\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\\mid\\boldsymbol\{\\lambda\}\_\{t\}\)does not depend onrminr\_\{\\min\}directly\. Instead, changingrminr\_\{\\min\}changes the constraint residual and, therefore, induces a different trajectory of dual variables\. Since PDI is trained with both exploitation and exploration over𝝀\\boldsymbol\{\\lambda\}, the score model learns a family of score fields over a range of in\-distribution dual values\. Thus, as long as the dual trajectory induced by the newrminr\_\{\\min\}remains close to this learned range, PDI can adapt to the new OOD constraints\.

### 5\.3Portfolio Management

In constrained portfolio optimization, we aim to allocate weights𝐱∈ΔN−1\{\\mathbf\{x\}\}\\in\\Delta^\{N\-1\}, whereΔN−1\\Delta^\{N\-1\}is the\(N−1\)\(N\-1\)\-dimensional probability simplex, acrossN=500N=500assets, organized in1010sectors\. The return𝐫\{\\mathbf\{r\}\}is drawn from a factor model with mean𝝁\\boldsymbol\{\\mu\}and block\-diagonal covariance𝚺\\boldsymbol\{\\Sigma\}\. Our objective is to maximize expected return subject to per\-asset variance\-risk budgets, i\.e\.,

Ppm∗=maxμ\(𝐱\)⁡𝔼𝐱∼μ\[𝔼𝐫\[𝐫⊤𝐱\]\]\+βℋ\(μ\)s\.t\.𝔼𝐱∼μ\[xj\(𝚺𝐱\)j\]≤bj,j∈\[N\]\.\\displaystyle P\_\{\\text\{pm\}\}^\{\*\}~=~\\max\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\\!\\Bigl\[\\mathbb\{E\}\_\{\{\\mathbf\{r\}\}\}\\\!\\bigl\[\{\\mathbf\{r\}\}^\{\\\!\\top\}\{\\mathbf\{x\}\}\\bigr\]\\Bigr\]\+\\beta\{\\mathcal\{H\}\}\(\\mu\)\\quad\\text\{s\.t\.\}\\quad\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\\!\\bigl\[x\_\{j\}\\,\(\\boldsymbol\{\\Sigma\}\{\\mathbf\{x\}\}\)\_\{j\}\\bigr\]\\;\\leq\\;b\_\{j\},\\quad j\\in\[N\]\.\(15\)The budgetbj\>0b\_\{j\}\>0caps how much variance assetjjis permitted to contribute in expectation over all portfolios\. The average constraints allow the sampler to trade risk across different portfolio samples, admitting occasional high\-risk, high\-return allocations while ensuring that the overall exposure of each asset remains within its prescribed budget in expectation\.

Results\.Table[2](https://arxiv.org/html/2606.17192#S5.T2)shows that PDI achieves the best tradeoff between return, feasibility, and diversification among all diffusion samplers\. The unconstrained model and DPS achieve higher returns, but only by incurring larger constraint violations and lower feasibility\. In contrast, PDM and PDL are more feasible but more conservative, leading to lower returns\. PDI lies between these extremes\. It maintains high feasibility and very small mean violations while achieving the highest return among the feasible diffusion\-based methods\. Additional ablations in Appendix[G](https://arxiv.org/html/2606.17192#A7)confirm the same observations in the wireless allocation problem\. That is, freezing the final or time\-averaged multiplier underperforms the full PDI dynamics, and the noise schedule controls feasibility in a manner consistent with our stability analysis\.

Table 2:Baseline comparison in portfolio management\.PDI provides the best tradeoff between return, entropy and risk\.MethodMean return\(↑\)\(\\uparrow\)Feasibility\(↑\)\(\\uparrow\)Mean viol\.\(↓\)\(\\downarrow\)NeffN\_\{\\mathrm\{eff\}\}Entropy\|Top1\|PDI–Net0\.189±\\pm0\.0310\.91±\\pm0\.098\.18e\-0716\.3±\\pm2\.04\.57151PDI–MC0\.195±\\pm0\.0280\.92±\\pm0\.131\.45e\-0415\.9±\\pm8\.45\.23240PDL0\.165±\\pm0\.0130\.96±\\pm0\.038\.50e\-0721\.8±\\pm0\.75\.78376PDM0\.125±\\pm0\.0121\.00±\\pm0\.000\.00e\+0063\.2±\\pm13\.03\.88100DPS0\.333±\\pm0\.0480\.70±\\pm0\.044\.29e\-044\.5±\\pm0\.44\.91202Unconstrained0\.477±\\pm0\.1210\.76±\\pm0\.041\.95e\-034\.2±\\pm0\.64\.68159

## 6Conclusions

We introduced a primal–dual inference \(PDI\) framework that enforces distributional constraints by coupling a reverse diffusion process with dual ascent dynamics\. We trained a single dual\-variable\-conditioned score network that serves the entire family of Gibbs targets encountered along the dual trajectory\. At inference, PDI generates samples through a path\-dependent sequence of Gibbs score fields indexed by the evolving multipliers, avoiding the need to first estimate and freeze an optimal dual variable\. We analyzed PDI through the convergence of time\-averaged dual iterates and a stability bound that quantifies how residual dual mismatch propagates through the reverse process\. We validated our framework and methodology on constrained Gaussian sampling, wireless power control, and portfolio management\. Extending PDI to adaptive dual step sizes, chance\-constrained problems, and larger\-scale setups are promising directions for future work\. Our work has some limitations worth noting\. The convergence guarantees of our algorithm apply to time\-averaged dual iterates rather than final iterates or the primal distributions directly\. Additionally, PDI requires problem\-specific tuning of the dual\-variable prior used during training and the entropy regularization strengthβ\\beta, for which principled selection rules remain an open question\. Our code is available at:[https://github\.com/SMRhadou/PDI\-Diffusion](https://github.com/SMRhadou/PDI-Diffusion)\.

## References

- T\. Akhound\-Sadegh, J\. Rector\-Brooks, A\. J\. Bose, S\. Mittal, P\. Lemos, C\. Liu, M\. Sendera, S\. Ravanbakhsh, G\. Gidel, Y\. Bengio, N\. Malkin, and A\. Tong \(2024\)Iterated denoising energy matching for sampling from Boltzmann densities\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§B\.1](https://arxiv.org/html/2606.17192#A2.SS1.p2.1),[Lemma 1](https://arxiv.org/html/2606.17192#Thmlemma1)\.
- Residual diffusion models for joint source channel coding of mimo csi\.In2025 59th Asilomar Conference on Signals, Systems, and Computers,pp\. 55–62\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- M\. Arvinte and J\. I\. Tamir \(2023\)MIMO channel estimation using score\-based generative models\.IEEE Transactions on Wireless Communications22\(6\),pp\. 3698–3713\.External Links:[Document](https://dx.doi.org/10.1109/TWC.2022.3220784)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- A\. Babazadeh Darabi and S\. Coleri \(2025\)Diffusion model based resource allocation strategy in ultra\-reliable wireless networked control systems\.IEEE Communications Letters29\(1\),pp\. 85–89\.External Links:[Document](https://dx.doi.org/10.1109/LCOMM.2024.3499745)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- J\. Berner, L\. Richter, and K\. Ullrich \(2024\)An optimal control perspective on diffusion\-based generative modeling\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.17192#S1.p2.1)\.
- K\. Black, M\. Janner, Y\. Du, I\. Kostrikov, and S\. Levine \(2024\)Training diffusion models with reinforcement learning\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- L\. F\. Chamon, M\. R\. Karimi, and A\. Korba \(2024\)Constrained sampling with primal\-dual langevin monte carlo\.Advances in Neural Information Processing Systems37,pp\. 29285–29323\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p4.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.17192#S5.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.17192#S5.T1.38.36.7),[Proposition 1](https://arxiv.org/html/2606.17192#Thmproposition1),[Remark 1](https://arxiv.org/html/2606.17192#Thmremark1.p1.1.1)\.
- J\. Chen, L\. Richter, J\. Berner, D\. Blessing, G\. Neumann, and A\. Anandkumar \(2024\)Sequential controlled langevin diffusions\.External Links:2412\.07081,[Link](https://arxiv.org/abs/2412.07081)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2)\.
- H\. Choudhary, A\. Orra, and M\. Thakur \(2025\)Diffusion\-augmented reinforcement learning for robust portfolio optimization under stress scenarios\.External Links:2510\.07099Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p2.1)\.
- J\. K\. Christopher, S\. Baek, and F\. Fioretto \(2024\)Constrained synthesis with projected diffusion models\.Advances in Neural Information Processing Systems37,pp\. 89307–89333\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.17192#S5.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.17192#S5.T1.44.42.7),[Remark 1](https://arxiv.org/html/2606.17192#Thmremark1.p1.1.1)\.
- H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye \(2023\)Diffusion posterior sampling for general noisy inverse problems\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5\.2](https://arxiv.org/html/2606.17192#S5.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.17192#S5.T1.50.48.7),[Remark 1](https://arxiv.org/html/2606.17192#Thmremark1.p1.1.1)\.
- V\. De Bortoli, E\. Mathieu, M\. Hutchinson, J\. Thornton, Y\. W\. Teh, and A\. Doucet \(2022\)Riemannian score\-based generative modelling\.Advances in neural information processing systems35,pp\. 2406–2422\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Dhariwal and A\. Q\. Nichol \(2021\)Diffusion models beat GANs on image synthesis\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Domingo\-Enrich, M\. Drozdzal, B\. Karrer, and R\. T\. Chen \(2024\)Adjoint matching: fine\-tuning flow and diffusion generative models with memoryless stochastic optimal control\.arXiv preprint arXiv:2409\.08861\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2)\.
- B\. Efron \(2011\)Tweedie’s formula and selection bias\.Journal of the American Statistical Association106\(496\),pp\. 1602–1614\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p3.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Fishman, L\. Klarner, E\. Mathieu, M\. Hutchinson, and V\. De Bortoli \(2023\)Metropolis sampling for constrained diffusion models\.Advances in Neural Information Processing Systems36,pp\. 62296–62331\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Guo, X\. Xu, Y\. Liu, and A\. Nallanathan \(2025\)Diffusion model for multiple antenna communication\.IEEE Communications Magazine63\(10\),pp\. 44–50\.External Links:[Document](https://dx.doi.org/10.1109/MCOM.001.2400766)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- Y\. Guo, H\. Yuan, Y\. Yang, M\. Chen, and M\. Wang \(2024\)Gradient guidance for diffusion models: an optimization perspective\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 90736–90770\.External Links:[Document](https://dx.doi.org/10.52202/079017-2881),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/a5059a9a389ccc76da85760ea79490d8-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- Z\. Guo, W\. Tang, and R\. Xu \(2026\)Conditional diffusion guidance under hard constraint: a stochastic analysis approach\.arXiv preprint arXiv:2602\.05533\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- M\. He, X\. He, and X\. Gao \(2025a\)Factor\-based conditional diffusion model for portfolio optimization\.InNeurIPS 2025 Workshop: Generative AI in Finance,Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- Z\. He, F\. Héliot, and Y\. Ma \(2025b\)Deterministic score\-based diffusion model for channel estimation in ris\-assisted mimo systems\.In2025 IEEE 26th International Workshop on Signal Processing and Artificial Intelligence for Wireless Communications \(SPAWC\),Vol\.,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/SPAWC66079.2025.11143296)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.arXiv preprint arxiv:2006\.11239\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. Fleet \(2022\)Video diffusion models\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 8633–8646\.Cited by:[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- J\. Ho and T\. Salimans \(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Huang, Y\. Jiang, T\. Van Wouwe, and C\. K\. Liu \(2024\)Constrained diffusion with trust sampling\.NeurIPS\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Hui, X\. Tang, Y\. Wang, Q\. Du, D\. Niyato, and Z\. Han \(2026\)Channel\-aware conditional diffusion model for secure mu\-miso communications\.IEEE Transactions on Vehicular Technology,pp\. 1–6\.External Links:ISSN 1939\-9359,[Link](http://dx.doi.org/10.1109/TVT.2026.3662745),[Document](https://dx.doi.org/10.1109/tvt.2026.3662745)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- A\. Hyvärinen and P\. Dayan \(2005\)Estimation of non\-normalized statistical models by score matching\.\.Journal of Machine Learning Research6\(4\)\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- C\. Jin and A\. Agarwal \(2025\)Forecasting implied volatility surface with generative diffusion models\.External Links:2511\.07571,[Link](https://arxiv.org/abs/2511.07571)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p2.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022\)Elucidating the design space of diffusion\-based generative models\.InProc\. NeurIPS,Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- S\. Khalafi, D\. Ding, and A\. Ribeiro \(2024\)Constrained diffusion models via dual training\.Advances in Neural Information Processing Systems37,pp\. 26543–26576\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p4.1),[§D\.1](https://arxiv.org/html/2606.17192#A4.SS1.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17192#S3.p3.3)\.
- S\. Khalafi, I\. Hounie, D\. Ding, and A\. Ribeiro \(2025\)Composition and alignment of diffusion models using constrained learning\.arXiv preprint arXiv:2508\.19104\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p3.1),[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p4.1),[§D\.1](https://arxiv.org/html/2606.17192#A4.SS1.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Kim, T\. Lee, H\. Kim, G\. De Veciana, M\. A\. Arfaoui, A\. Koc, P\. Pietraski, G\. Zhang, and J\. Kaewell \(2025\)Generative diffusion model\-based compression of mimo csi\.InICC 2025 \- IEEE International Conference on Communications,Vol\.,pp\. 6323–6328\.External Links:[Document](https://dx.doi.org/10.1109/ICC52391.2025.11161629)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- L\. Kong, Y\. Du, W\. Mu, K\. Neklyudov, V\. D\. Bortoli, D\. Wu, H\. Wang, A\. Ferber, Y\. Ma, C\. P\. Gomes, and C\. Zhang \(2025a\)Diffusion models as constrained samplers for optimization with unknown constraints\.External Links:2402\.18012,[Link](https://arxiv.org/abs/2402.18012)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- L\. Kong, H\. Wang, Y\. Pan, C\. W\. Kim, M\. Song, A\. Nguyen, T\. Wang, H\. Xu, and M\. Tambe \(2025b\)Robust optimization with diffusion models for green security\.InProceedings of the Forty\-first Conference on Uncertainty in Artificial Intelligence,S\. Chiappa and S\. Magliacane \(Eds\.\),Proceedings of Machine Learning Research, Vol\.286,pp\. 2325–2344\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- Z\. Kong, W\. Ping, J\. Huang, K\. Zhao, and B\. Catanzaro \(2021\)DiffWave: a versatile diffusion model for audio synthesis\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- T\. Lee, J\. Park, H\. Kim, and J\. G\. Andrews \(2026\)Generating high dimensional user\-specific wireless channels using diffusion models\.IEEE Transactions on Wireless Communications25\(\),pp\. 2907–2921\.External Links:[Document](https://dx.doi.org/10.1109/TWC.2025.3600286)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- G\. Liu, T\. Chen, E\. Theodorou, and M\. Tao \(2023\)Mirror diffusion models for constrained and watermarked generation\.Advances in Neural Information Processing Systems36,pp\. 42898–42917\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Lou and S\. Ermon \(2023\)Reflected diffusion models\.InInternational Conference on Machine Learning,pp\. 22675–22701\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p2.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Lu, Y\. Zhou, F\. Bao, J\. Chen, C\. Li, and J\. Zhu \(2022\)DPM\-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps\.arXiv preprint arXiv:2206\.00927\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- A\. Nichol, P\. Dhariwal, A\. Ramesh, P\. Shyam, P\. Mishkin, B\. McGrew, I\. Sutskever, and M\. Chen \(2022\)GLIDE: towards photorealistic image generation and editing with text\-guided diffusion models\.External Links:2112\.10741,[Link](https://arxiv.org/abs/2112.10741)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.17192#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Q\. Nichol and P\. Dhariwal \(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- L\. Richter and J\. Berner \(2024\)Improved sampling via learned diffusions\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=h4pNROsO06)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.17192#S1.p2.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-Resolution Image Synthesis with Latent Diffusion Models\.In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,Los Alamitos, CA, USA,pp\. 10674–10685\.External Links:ISSN,[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01042)Cited by:[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- M\. Sendera, M\. Kim, S\. Mittal, P\. Lemos, L\. Scimeca, J\. Rector\-Brooks, A\. Adam, Y\. Bengio, and N\. Malkin \(2024\)Improved off\-policy training of diffusion samplers\.Advances in Neural Information Processing Systems37,pp\. 81016–81045\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2)\.
- F\. Sloboda \(1991\)A projection method of the cimmino type for linear algebraic systems\.Parallel Computing17\(4\),pp\. 435–442\.External Links:ISSN 0167\-8191,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0167-8191%2805%2980146-2),[Link](https://www.sciencedirect.com/science/article/pii/S0167819105801462)Cited by:[§5\.1](https://arxiv.org/html/2606.17192#S5.SS1.p2.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InProceedings of the 32nd International Conference on Machine Learning,F\. Bach and D\. Blei \(Eds\.\),Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2021a\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021b\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.17192#S3.SS1.p2.23)\.
- Z\. Sun and Y\. Yang \(2023\)DIFUSCO: graph\-based diffusion solvers for combinatorial optimization\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JV8Ff0lgVV)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- T\. Takahashi and T\. Mizuno \(2025\)Generation of synthetic financial time series by diffusion models\.Quantitative Finance25\(10\),pp\. 1507–1516\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p2.1)\.
- N\. Tiwari \(2026\)Generative diffusion model for risk\-neutral derivative pricing\.External Links:2603\.20582,[Link](https://arxiv.org/abs/2603.20582)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p2.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- Y\. B\. Uslu, S\. Hadou, S\. S\. Bidokhti, and A\. Ribeiro \(2025a\)Generative diffusion models for resource allocation in wireless networks\.In2025 IEEE 10th International Workshop on Computational Advances in Multi\-Sensor Adaptive Processing \(CAMSAP\),pp\. 201–205\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1),[Appendix F](https://arxiv.org/html/2606.17192#A6.p1.1),[§1](https://arxiv.org/html/2606.17192#S1.p1.1)\.
- Y\. B\. Uslu, S\. Hadou, S\. S\. Bidokhti, and A\. Ribeiro \(2026\)Graph signal diffusion models for wireless resource allocation\.External Links:2604\.05175,[Link](https://arxiv.org/abs/2604.05175)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1),[§F\.2](https://arxiv.org/html/2606.17192#A6.SS2.SSS0.Px3.p1.3),[§F\.2](https://arxiv.org/html/2606.17192#A6.SS2.SSS0.Px3.p3.1),[Table 4](https://arxiv.org/html/2606.17192#A6.T4.23.26.3.2),[Appendix F](https://arxiv.org/html/2606.17192#A6.p1.1),[§5\.2](https://arxiv.org/html/2606.17192#S5.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.17192#S5.T1.56.54.7)\.
- Y\. B\. Uslu, N\. NaderiAlizadeh, M\. Eisen, and A\. Ribeiro \(2025b\)Fast state\-augmented learning for wireless resource allocation with dual variable regression\.External Links:2506\.18748,[Link](https://arxiv.org/abs/2506.18748)Cited by:[Appendix F](https://arxiv.org/html/2606.17192#A6.p1.1)\.
- F\. Vargas, W\. S\. Grathwohl, and A\. Doucet \(2023\)Denoising diffusion samplers\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8pvnfTAbu1f)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.17192#S1.p2.1)\.
- J\. Wen and J\. Yang \(2025\)Distributionally robust optimization via diffusion ambiguity modeling\.arXiv preprint arXiv:2510\.22757\.Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px4.p1.1)\.
- Q\. Zhang and Y\. Chen \(2022\)Path integral sampler: a stochastic control approach for sampling\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.17192#S1.p2.1)\.
- X\. Zhang and J\. Yu \(2026\)Improve the training efficiency of drl for wireless communication resource allocation: the role of generative diffusion models\.IEEE Transactions on Wireless Communications25\(\),pp\. 11593–11608\.External Links:[Document](https://dx.doi.org/10.1109/TWC.2026.3660250)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.
- N\. Zilberstein, A\. Swami, and S\. Segarra \(2024\)Joint channel estimation and data detection in massive mimo systems based on diffusion models\.InICASSP 2024 \- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 13291–13295\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP48485.2024.10446413)Cited by:[Appendix A](https://arxiv.org/html/2606.17192#A1.SS0.SSS0.Px5.p1.1)\.

## Appendix AExtended Related Work

##### Diffusion models and score\-based generative modeling\.

Score\-based diffusion models learn a stochastic reversal of a progressive noising process by matching the gradients of the log\-density∇xlog⁡p\(x\)\\nabla\_\{x\}\\log p\(x\), a\.k\.a\. the score function, for noisy observationsHyvärinen and Dayan \([2005](https://arxiv.org/html/2606.17192#bib.bib15)\)\. Noise\-conditional score networksSong and Ermon \([2019](https://arxiv.org/html/2606.17192#bib.bib14)\)train a single network across noise levels and sample via annealed Langevin dynamics, while denoising diffusion probabilistic modelsHoet al\.\([2020](https://arxiv.org/html/2606.17192#bib.bib20)\)frame generation as iterative Gaussian denoising guided by a simplified ELBO\. These formulations are unified inSonget al\.\([2021b](https://arxiv.org/html/2606.17192#bib.bib16)\)as instances of a common SDE, whose time reversal yields a generative process; and admits an equivalent deterministic probability\-flow ODE that enables exact likelihood computation and has motivated accelerated sampling schemes such as denoising diffusion implicit models \(DDIMs\)Songet al\.\([2021a](https://arxiv.org/html/2606.17192#bib.bib21)\); Nichol and Dhariwal \([2021](https://arxiv.org/html/2606.17192#bib.bib22)\); Luet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib23)\); Karraset al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib24)\)\.

##### Energy\-based models and sampling from Gibbs targets\.

Energy\-based models define distributions of the formp\(x\)∝exp⁡\(−E\(x\)\)p\(x\)\\propto\\exp\(\-E\(x\)\), whose normalizing constant is generally intractable\. Score\-based methods sidestep this because the score∇xlog⁡p\(x\)=−∇xE\(x\)\\nabla\_\{x\}\\log p\(x\)=\-\\nabla\_\{x\}E\(x\)is independent of this constant\. This has motivated the use of diffusion as a sampler from unnormalized Gibbs targets by learning a controlled SDE, whose terminal law matches the target, e\.g\.,Zhang and Chen \([2022](https://arxiv.org/html/2606.17192#bib.bib25)\); Vargaset al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib26)\); Berneret al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib27)\); Richter and Berner \([2024](https://arxiv.org/html/2606.17192#bib.bib29)\); Chenet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib30)\); Domingo\-Enrichet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib28)\); Senderaet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib31)\)\. In our setting, the Lagrangian minimizers are also Gibbs distributions whose energies are linear in the dual variables\. Crucially, the energies are not fixed as they evolve during inference through dual ascent\. Furthermore, the score network is trained over a distribution of dual variables, or equivalently, a distribution of target energies, rather than for a single one\.

##### Guided and constrained diffusion sampling\.

Guidance techniques steer reverse diffusion processes toward improved fidelity or reward/constraint alignment more generally\. They do so by modifying the score entering Tweedie’s denoising formulaEfron \([2011](https://arxiv.org/html/2606.17192#bib.bib42)\), which expresses the MMSE\-estimate of a clean signal \(posterior mean\) given a noisy signal observation in terms of the score function\. Notably, classifier guidanceDhariwal and Nichol \([2021](https://arxiv.org/html/2606.17192#bib.bib40)\)perturbs the predicted score with the gradient of a noise\-aware classifier, trading diversity for fidelity\. Classifier\-free guidanceHo and Salimans \([2022](https://arxiv.org/html/2606.17192#bib.bib41)\)removes the external classifier, instead interpolating between conditional and unconditional modes of a shared model during inference\. Subsequent variants, such as CLIP\-based guidanceNicholet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib43)\), underpin most state\-of\-the\-art diffusion systems\.

Beyond guidance, an extensive body of work enforces domain\-feasibility constraints on the diffusion process directly, e\.g\., boundary reflectionsLou and Ermon \([2023](https://arxiv.org/html/2606.17192#bib.bib44)\), mirror diffusionLiuet al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib45)\), Riemannian score\-based modelsDe Bortoliet al\.\([2022](https://arxiv.org/html/2606.17192#bib.bib46)\), and log\-barrier constructionsFishmanet al\.\([2023](https://arxiv.org/html/2606.17192#bib.bib47)\)\. A separate line of works target constraint\-aware inference\. InChristopheret al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib48)\), projected diffusions that apply feasibility projections after each denoising step are proposed, while a trust sampling approachHuanget al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib50)\)reformulates each reverse step as a separate constrained optimization problem\.

Closest to this paper areKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\); Chamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\), which adopt a constrained learning formulation for average \(expectation\) constraints\. Specifically,Khalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5)\)shows that the optimal KL\-constrained distribution is an entropy\-tilted mixture of reference distributions, enforced through a primal–dual training scheme that pairs an outer\-loop dual ascent with inner\-loop Lagrangian score matching, withKhalafiet al\.\([2025](https://arxiv.org/html/2606.17192#bib.bib6)\)extending this to reward\-based alignment and product/mixture compositions\. Primal–dual Langevin Monte Carlo \(PD\-LMC\)Chamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\)operates instead at sampling time and enforces equality and average constraints via gradient descent–ascent in Wasserstein space, without a learned generative model\.

Our method shares the Lagrangian saddle\-point structure of these works but differs in three key aspects: \(i\) the dual variable evolves jointly with the denoising trajectory rather than in a separate outer training loopKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\)or an ergodic Langevin chain \(PD\-LMC\)Chamonet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib1)\), \(ii\) a score network is trained across a distribution of dual variables and conditioned on each dual variable, rather than refitting an unconditional score model per dual iterate, and \(iii\) the aggregation of samples along the dual\-ascent path yields a robust mixture of primal distributions that realizes the entropy\-regularized mixture structure ofKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5)\)– as well as the Gibbs targets of the energy\-based sampling approaches discussed above – empirically at inference time rather than analytically at a single optimal multiplier\.

##### Diffusion\-enabled Stochastic Optimization\.

Pretrained diffusion models constrained on the data manifold can be steered toward downstream objectives through gradient\-based guidance or fine\-tuningGuoet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib36)\), making them viable tools for learning\-enabled optimization\. Several works have exploited this connection between guidance with optimization gradients and regularized optimization\.Konget al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib33)\)combined guided diffusion models and Langevin dynamics in a two\-stage scheme that learns constrained samplers for optimization with unknown constraints\.Blacket al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib34)\)reinterpreted the denoising chain as a multi\-step Markov Decision Process \(MDP\) and applied policy\-gradient methods to fine\-tune diffusion models on non\-differentiable rewards\. Beyond continuous settings, diffusion models have also advanced combinatorial optimizationSun and Yang \([2023](https://arxiv.org/html/2606.17192#bib.bib35)\), and distributionally robust optimizationWen and Yang \([2025](https://arxiv.org/html/2606.17192#bib.bib37)\); Konget al\.\([2025b](https://arxiv.org/html/2606.17192#bib.bib38)\); Guoet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib39)\)\.

##### Diffusion models in wireless optimization and portfolio management\.

Diffusion models recast stochastic optimization as conditional sampling from learned solution distributions\. This perspective is especially relevant to wireless communication systems, whose non\-convex, constrained design problems typically admit optimal solutions that are probability distributions over optimization variables\. E\.g\.,Arvinte and Tamir \([2023](https://arxiv.org/html/2606.17192#bib.bib51)\); Zilbersteinet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib52)\); Heet al\.\([2025b](https://arxiv.org/html/2606.17192#bib.bib53)\)leveraged diffusion priors for score\-based MIMO channel estimation, with subsequent works extending their use to CSI compressionLeeet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib54)\); Ankireddyet al\.\([2025](https://arxiv.org/html/2606.17192#bib.bib55)\); Kimet al\.\([2025](https://arxiv.org/html/2606.17192#bib.bib56)\), beamforming designGuoet al\.\([2025](https://arxiv.org/html/2606.17192#bib.bib59)\); Huiet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib60)\), and resource allocationZhang and Yu \([2026](https://arxiv.org/html/2606.17192#bib.bib57)\); Babazadeh Darabi and Coleri \([2025](https://arxiv.org/html/2606.17192#bib.bib58)\); Usluet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib7),[2026](https://arxiv.org/html/2606.17192#bib.bib8)\)\. Specifically for power control,Babazadeh Darabi and Coleri \([2025](https://arxiv.org/html/2606.17192#bib.bib58)\)trains DDPMs to generate optimal allocations conditioned on the channel state, whileUsluet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib7),[2026](https://arxiv.org/html/2606.17192#bib.bib8)\)proposes graph\-signal diffusion models for matching primal\-dual expert policies with near\-optimal ergodic rates and cross\-topology transferability\.

Diffusion models have found great use in quantitative finance applications, where non\-convex constraints and high\-dimensional, heavy\-tailed return distributions present analogous challenges to those encountered in wireless system design\. E\.g\.,Takahashi and Mizuno \([2025](https://arxiv.org/html/2606.17192#bib.bib61)\)proposed wavelet\-DDPMs for scenario and stylized facts generation\.Choudharyet al\.\([2025](https://arxiv.org/html/2606.17192#bib.bib62)\); Heet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib63)\)tackled robust portfolio management with conditional diffusion models, withJin and Agarwal \([2025](https://arxiv.org/html/2606.17192#bib.bib64)\); Tiwari \([2026](https://arxiv.org/html/2606.17192#bib.bib65)\)enforcing arbitrage\-free and risk\-neutral diffusion dynamics for volatility forecasting\.

## Appendix BProposed Algorithms

### B\.1Score Network Training

We follow Algorithm[1](https://arxiv.org/html/2606.17192#alg1)in training the score model𝐬𝜽\(𝐱t,t,𝝀t\)\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)using the regression loss in \([9](https://arxiv.org/html/2606.17192#S3.E9)\)\. To compute the true score for the pairs\(𝐱t,𝝀t\)\(\{\\mathbf\{x\}\}\_\{t\},\\boldsymbol\{\\lambda\}\_\{t\}\), we use an MC estimator of the score function of the Gibbs distribution, described in the following lemma\.

###### Lemma 1\(MC estimator of the Gibbs scoreAkhound\-Sadeghet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib2)\)\)\.

For any𝐲τ\{\\mathbf\{y\}\}\_\{\\tau\}, the score function of the Gibbs distribution can be estimated byKMCK\_\{\\text\{MC\}\}Monte Carlo samples as

∇𝐲τlog⁡qτ\(𝐲τ\|𝝀\)≈∇𝐲τlog⁡1KMC∑k=1KMCexp⁡\(−E\(𝐲τ\+στϵkατ,𝝀\)\),\\displaystyle\\nabla\_\{\{\\mathbf\{y\}\}\_\{\\tau\}\}\\log q\_\{\\tau\}\(\{\\mathbf\{y\}\}\_\{\\tau\}\|\\boldsymbol\{\\lambda\}\)~\\approx~\\nabla\_\{\{\\mathbf\{y\}\}\_\{\\tau\}\}\\,\\log\\frac\{1\}\{K\_\{\\text\{MC\}\}\}\\sum\_\{k=1\}^\{K\_\{\\text\{MC\}\}\}\\exp\\Bigg\(\-E\\Big\(\\,\\frac\{\{\\mathbf\{y\}\}\_\{\\tau\}\+\\sigma\_\{\\tau\}\\boldsymbol\{\\epsilon\}\_\{k\}\}\{\\alpha\_\{\\tau\}\},\\,\\boldsymbol\{\\lambda\}\\,\\Big\)\\Bigg\),\(16\)whereϵk∼𝒩\(𝟎,𝐈\)\\boldsymbol\{\\epsilon\}\_\{k\}\\sim\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\)\.

We refer the reader toAkhound\-Sadeghet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib2)\)for the proof\.

### B\.2Inference Algorithm

At inference, we run multiple chains in parallel as described in Algorithm[2](https://arxiv.org/html/2606.17192#alg2)\.

### B\.3Practical Considerations

In training the score model, we combine exploitation and exploration to generate training pairs\(𝐱,𝝀\)\(\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\. Exploitation is performed by rolling out the sampling process using the most recent score model and storing the resulting primal\-dual trajectories in a replay buffer\. These rollouts expose the model to the states and dual variables it is likely to encounter during inference\. Exploration is introduced in two ways\. First, we sample dual variables from a prior distribution to cover regions that may not be visited by the current rollouts\. Second, we perturb both the primal and dual variables to improve robustness to local deviations from the replayed trajectories\. The exploitation fractionρexp\(n\)\\rho\_\{\\mathrm\{exp\}\}\(n\)controls the balance between replayed dual variables and prior\-sampled dual variables throughout training\.

Although we refer to the learned network as a score model, our implementation uses the noise\-prediction parameterization\. For each training pair\(𝐱t,𝝀\)\(\{\\mathbf\{x\}\}\_\{t\},\\boldsymbol\{\\lambda\}\), the MC estimator computes the score of the Gibbs distribution indexed by𝝀\\boldsymbol\{\\lambda\}directly\. Rather than regressing the network directly to this score, we reparameterize the target into the equivalent noise viaϵt=−σT−t𝐬t\\boldsymbol\{\\epsilon\}\_\{t\}=\-\\sigma\_\{T\-t\}\\mathbf\{s\}\_\{t\}, where𝐬\\mathbf\{s\}is the MC score estimate\. The network is therefore trained to predict this noise target\. During sampling, the predicted noise is converted back into score units using the same relation, and the resulting score is used in the PDI reverse update\. Thus, throughout the paper, training a score model should be understood as training a noise\-prediction network whose target is derived from the same MC score estimate\.

In some of our numerical examples, the samples must satisfy physical domain constraints, such as box constraints on transmit powers or simplex constraints on portfolio weights\. We enforce these constraints by projecting the Tweedie estimate𝐲^0\\widehat\{\\mathbf\{y\}\}\_\{0\}onto the corresponding domain immediately after Step 7 of Algorithm[2](https://arxiv.org/html/2606.17192#alg2)\. We also ensure that we divide byα¯T−t=max⁡\{αmin,αT−t\}\\overline\{\\alpha\}\_\{T\-t\}=\\max\\\{\\alpha\_\{\\min\},\\alpha\_\{T\-t\}\\\}in computing the Tweedie to avoid numerical instabilities in the low\-SNR regime\.

Algorithm 1Score Network Training1:Decay and noise schedule

a0:T,b0:Ta\_\{0:T\},b\_\{0:T\}, Perturbation params

\(ρpert,ϵ𝐱,ϵ𝝀\)\(\\rho\_\{\\text\{pert\}\},\\epsilon\_\{\{\\mathbf\{x\}\}\},\\epsilon\_\{\\boldsymbol\{\\lambda\}\}\), exploitation fraction

ρexp\(n\)\\rho\_\{\\text\{exp\}\}\(n\)for all

nn
2:Initialize replay buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset, score network

𝐬𝜽\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}
3:for

n=1,…,Noutern=1,\\ldots,N\_\{\\mathrm\{outer\}\}do

4:foreach problem

𝒢\{\\mathcal\{G\}\}in a batch

ℬtrain\{\\mathcal\{B\}\}\_\{\\text\{train\}\}do⊳\\trianglerightRollouts

5:Sample

\{𝐱0\(i\)∼𝒩\(𝟎,𝐈\)\}i\\\{\{\\mathbf\{x\}\}\_\{0\}^\{\(i\)\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\\\}\_\{i\}, Initialize

𝝀←𝝀0\\boldsymbol\{\\lambda\}\\leftarrow\\boldsymbol\{\\lambda\}\_\{0\}
6:Run inference \(Algorithm[2](https://arxiv.org/html/2606.17192#alg2)\) with

𝐬𝜽\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}to obtain

\{\(𝐱t\(i\),t,𝝀,𝒢\)\}t=0T−1,∀i∈\[I\]\\\{\(\\mathbf\{x\}\_\{t\}^\{\(i\)\},t,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)\\\}\_\{t=0\}^\{T\-1\},\\ \\forall i\\in\[I\]
7:Push

\{\(𝐱t\(i\),t,𝝀,𝒢\)\}iI\\\{\(\{\\mathbf\{x\}\}\_\{t\}^\{\(i\)\},t,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)\\\}\_\{i\}^\{I\}to

ℬ\\mathcal\{B\}
8:endfor

9:forinner

=1,…,Ninner=1,\\ldots,N\_\{\\mathrm\{inner\}\}do

10:Sample a minibatch

\{\(𝐱t\(j\),t\(j\),𝝀\(j\),𝒢\(j\)\)\}j=1B\\\{\(\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\},t^\{\(j\)\},\\boldsymbol\{\\lambda\}^\{\(j\)\},\{\\mathcal\{G\}\}^\{\(j\)\}\)\\\}\_\{j=1\}^\{B\}from

ℬ\\mathcal\{B\}
11:for

j=1,…,Bj=1,\\ldots,Bdo

12:if

Bernoulli\(ρpert\)=1\\mathrm\{Bernoulli\}\(\\rho\_\{\\text\{pert\}\}\)=1then⊳\\trianglerightPerturb training pairs

13:

𝐱t\(j\)←𝐱t\(j\)\+ϵ𝐱⋅𝒛x\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\\leftarrow\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\+\\epsilon\_\{\{\\mathbf\{x\}\}\}\\cdot\\boldsymbol\{z\}\_\{x\},

𝒛x∼𝒩\(𝟎,𝐈\)\\boldsymbol\{z\}\_\{x\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)
14:

𝝀\(j\)←\[𝝀\(j\)\+ϵ𝝀⋅𝒛λ\]\+\\boldsymbol\{\\lambda\}^\{\(j\)\}\\leftarrow\[\\boldsymbol\{\\lambda\}^\{\(j\)\}\+\\epsilon\_\{\\boldsymbol\{\\lambda\}\}\\cdot\\boldsymbol\{z\}\_\{\\lambda\}\]\_\{\+\},

𝒛λ∼𝒩\(𝟎,𝐈\)\\boldsymbol\{z\}\_\{\\lambda\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)
15:endif

16:if

Bernoulli\(1−ρexp\(n\)\)=1\\mathrm\{Bernoulli\}\(1\-\\rho\_\{\\text\{exp\}\}\(n\)\)=1then

17:

𝝀\(j\)←𝐯,𝐯∼πprior\(⋅\)\\boldsymbol\{\\lambda\}^\{\(j\)\}\\leftarrow\{\\mathbf\{v\}\},\\quad\{\\mathbf\{v\}\}\\sim\\pi\_\{\\text\{prior\}\}\(\\cdot\)⊳\\trianglerightSample from a prior

18:endif

19:endfor

20:

𝜽←𝜽−α∇1B∑jw\(t\(j\)\)∥𝐬𝜽\(𝐱t\(j\),t\(j\),𝝀\(j\),𝒢\(j\)\)−∇logqT−t\(j\)\(𝐱t\(j\)∣𝝀\(j\);𝒢\(j\)\)∥22\\boldsymbol\{\\theta\}\\leftarrow\\boldsymbol\{\\theta\}\-\\alpha\\nabla\\frac\{1\}\{B\}\\sum\_\{j\}w\(t^\{\(j\)\}\)\\bigl\\lVert\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\\mathbf\{x\}\_\{t\}^\{\(j\)\},t^\{\(j\)\},\\boldsymbol\{\\lambda\}^\{\(j\)\},\{\\mathcal\{G\}\}^\{\(j\)\}\)\-\\nabla\\log q\_\{T\-t^\{\(j\)\}\}\(\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\\mid\\boldsymbol\{\\lambda\}^\{\(j\)\};\{\\mathcal\{G\}\}^\{\(j\)\}\)\\bigr\\rVert\_\{2\}^\{2\}
21:endfor

22:endfor

23:return

𝜽\\boldsymbol\{\\theta\}

Algorithm 2PDI Dynamics for a given problem instance𝒢\{\\mathcal\{G\}\}1:Trained score network

𝐬θ∗\{\\mathbf\{s\}\}\_\{\\theta^\{\*\}\}, dual step size

ηt\\eta\_\{t\}, decay and noise schedule

a0:T,b0:Ta\_\{0:T\},b\_\{0:T\}
2:Initialize replay buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset
3:forchain

=1,…,C=1,\\dots,Cdo

4:Sample

𝐱0\(i\)∼𝒩\(𝟎,𝐈\),i=1,…I\{\\mathbf\{x\}\}\_\{0\}^\{\(i\)\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),\\quad i=1,\\dots I⊳\\trianglerightIIi\.i\.d\. samples

5:Initialize

𝝀←𝝀0\\boldsymbol\{\\lambda\}\\leftarrow\\boldsymbol\{\\lambda\}\_\{0\}
6:for

t=0,…,T−1t=0,\\ldots,T\-1do

7:

𝐱t\+1\(i\)=1aT−t\(𝐱t\(i\)\+bT−t𝐬θ∗\(𝐱t\(i\),t,𝝀,𝒢\)\)\+bT−t𝐳,𝐳∼𝒩\(𝟎,𝐈\)\{\\mathbf\{x\}\}\_\{t\+1\}^\{\(i\)\}~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\\,\{\\mathbf\{x\}\}\_\{t\}^\{\(i\)\}\+b\_\{T\-t\}\\,\{\\mathbf\{s\}\}\_\{\\theta^\{\*\}\}\(\{\\mathbf\{x\}\}\_\{t\}^\{\(i\)\},t,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\{\\mathbf\{z\}\},\\quad\{\\mathbf\{z\}\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)
8:

𝐲^0\(i\)←\(𝐱t\+1\(i\)−σT−t−12𝐬θ∗\(𝐱t\+1\(i\),t\+1,𝝀,𝒢\)\)/α¯T−t−1\\widehat\{\{\\mathbf\{y\}\}\}\_\{0\}^\{\(i\)\}\\leftarrow\(\{\\mathbf\{x\}\}\_\{t\+1\}^\{\(i\)\}\-\\sigma\_\{T\-t\-1\}^\{2\}\\,\{\\mathbf\{s\}\}\_\{\\theta^\{\*\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\}^\{\(i\)\},t\+1,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)\)\\,/\\,\{\\overline\{\\alpha\}\_\{T\-t\-1\}\}⊳\\trianglerightTweedie estimate

9:

𝝀←\[𝝀\+ηt⋅∑i𝐟\(𝐲^0\(i\)\)/I\]\+\\boldsymbol\{\\lambda\}\\leftarrow\\left\[\\boldsymbol\{\\lambda\}\+\\eta\_\{t\}\\cdot\\sum\_\{i\}\{\\mathbf\{f\}\}\\left\(\\hat\{\{\\mathbf\{y\}\}\}\_\{0\}^\{\(i\)\}\\right\)/I\\right\]\_\{\+\}⊳\\trianglerightDual ascent

10:endfor

11:Push

\{𝐱T\(i\)\}i\\\{\{\\mathbf\{x\}\}\_\{T\}^\{\(i\)\}\\\}\_\{i\}to

ℬ\\mathcal\{B\}
12:endfor

13:return

ℬ\{\\mathcal\{B\}\}

## Appendix CAnalytical Proofs

The dual function of \([3](https://arxiv.org/html/2606.17192#S3.Ex1)\) is defined as

g\(𝝀\)=ℒ\(μ𝝀†,𝝀\)=minμ∈𝒫2\(𝒳\)⁡ℒ\(μ,𝝀\),\\displaystyle g\(\\boldsymbol\{\\lambda\}\)~=~\{\\mathcal\{L\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\},\\boldsymbol\{\\lambda\}\)~=~\\min\_\{\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\(\{\\mathcal\{X\}\}\)\}\\,\\,\{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\}\),\(17\)whereμ𝝀†\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}is the Gibbs distribution induced by the dual variable𝝀\\boldsymbol\{\\lambda\}\. We can then write the dual function in a closed form as

g\(𝝀\)=𝔼μ𝝀†\[f0\(𝐱\)\+𝝀⊤𝐟\(𝐱\)\]−βℋ\(μ𝝀†\)=−βlog⁡Z\(𝝀\)\.\\displaystyle g\(\\boldsymbol\{\\lambda\}\)~=~\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\}\\Big\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\\boldsymbol\{\\lambda\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\Big\]\-\\beta\{\\mathcal\{H\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\)=\-\\beta\\log Z\(\\boldsymbol\{\\lambda\}\)\.\(18\)
The true gradient of the dual function is

∇g\(𝝀t\)=𝔼μ𝝀t†\[𝐟\(𝐲0\)\],\\displaystyle\\nabla g\(\\boldsymbol\{\\lambda\}\_\{t\}\)=\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\],\(19\)which is the violation estimated by clean samples of the distributionμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\. PDI replaces this gradient with

∇^g\(𝝀t\)=𝔼𝐱t\+1∼pt\+1\[𝐟\(𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\)\],\\displaystyle\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)=\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\_\{t\+1\}\\sim p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\Big\)\\Big\],\(20\)whereptp\_\{t\}is the marginal distribution at timettand𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]is the Tweedie posterior mean under the score model𝐬𝜽\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\. This formula replaces clean samples with the Tweedie posterior means under the forward process started withμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\. The bias in this gradient comes from three sources: i\) replacing the score field with a parameterized network𝐬𝜽\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}, ii\) propagating the dual mismatch and score\-approximation error across diffusion steps \(marginal mismatch\), and iii\) replacing clean samples𝐲0\{\\mathbf\{y\}\}\_\{0\}with a posterior mean𝔼\[𝐲0\|𝐱t\+1,𝝀t\]\{\\mathbb\{E\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\(Tweedie’s error\)\.

More concretely, we write the difference between the estimated gradient∇^g\\widehat\{\\nabla\}gand true gradient∇g\\nabla gas

𝐛t\+1\(𝝀t\)\\displaystyle\{\\mathbf\{b\}\}\_\{t\+1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)≔∇^g\(𝝀t\)−∇g\(𝝀t\)\\displaystyle\\coloneqq\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\-\\nabla g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\(21\)=𝔼pt\[𝐟\(𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\)\]−𝔼μ𝝀t†\[𝐟\(𝐲0\)\]\\displaystyle~=~\{\\mathbb\{E\}\}\_\{p\_\{t\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\\big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\]\(22\)=𝔼pt\+1\[𝐟\(𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\)\]−𝔼pt\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\]\\displaystyle~=~\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\(23\)\+𝔼pt\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\]−𝔼p~t\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)\]\\displaystyle\\quad\+\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\+𝔼p~t\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)\]−𝔼μ𝝀t†\[𝐟\(𝐲0\)\]\.\\displaystyle\\quad\+\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\\big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\]\.\(24\)We bound the norm of the three terms in Lemmas[3](https://arxiv.org/html/2606.17192#Thmlemma3),[4](https://arxiv.org/html/2606.17192#Thmlemma4)and[5](https://arxiv.org/html/2606.17192#Thmlemma5)\(Section[C\.4](https://arxiv.org/html/2606.17192#A3.SS4)\)\.

The bias in the gradient results in dual convergence with a small but irreducible floor error\. Theorem[1](https://arxiv.org/html/2606.17192#Thmtheorem1)characterizes this floor error and shows that it is controlled by the expected Tweedie posterior\-mean error and the approximation error of the parameterization\. Theorem[2](https://arxiv.org/html/2606.17192#Thmtheorem2)then analyzes how the residual errors in the multipliers and score fields propagate through the reverse dynamics and affect the terminal lawpTp\_\{T\}\. In Sections[C\.2](https://arxiv.org/html/2606.17192#A3.SS2)and[C\.3](https://arxiv.org/html/2606.17192#A3.SS3), we provide the analytical proofs of these theorems\. The assumptions under which our analysis holds are listed in Section[C\.1](https://arxiv.org/html/2606.17192#A3.SS1)\.

### C\.1Assumptions

Our proof holds under the following set of assumptions:

###### Assumption 2\(Constraint regularity\)\.

The constraint function𝐟\{\\mathbf\{f\}\}isLfL\_\{f\}\-Lipschitz, i\.e\.,‖𝐟\(𝐱\)−𝐟\(𝐲\)‖≤Lf‖𝐱−𝐲‖\\\|\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\-\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\)\\\|\\leq L\_\{f\}\\,\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|\. Also, the constraint function is bounded on the compact set𝒳\{\\mathcal\{X\}\}, i\.e\.,sup𝐱∈𝒳‖𝐟\(𝐱\)‖≤R\\sup\_\{\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}\}\\\|\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\\|\\leq R\.

###### Assumption 3\(Strong dual concavity\)\.

The dual functionggisϑ\\vartheta\-strongly concave, i\.e\.,∇𝛌2g\(𝛌\)⪯−ϑ𝐈\\nabla^\{2\}\_\{\\boldsymbol\{\\lambda\}\}g\\big\(\\boldsymbol\{\\lambda\}\\big\)\\preceq\-\\vartheta\{\\mathbf\{I\}\}, for all𝛌∈𝚲\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}\.

The strong concavity modulusϑ\\varthetais not an independent quantity but is determined by the temperatureβ\\betaand the constraint covariance under the induced Gibbs\-distribution family\. Since the dual function isg\(𝝀\)=−βlog⁡Z\(𝝀\)g\(\\boldsymbol\{\\lambda\}\)=\-\\beta\\log Z\(\\boldsymbol\{\\lambda\}\), its Hessian is given by

∇2g\(𝝀\)=−1βCovμ𝝀†\(𝐟\(𝐱\)\),\\nabla^\{2\}g\(\\boldsymbol\{\\lambda\}\)=\-\\frac\{1\}\{\\beta\}\\textbf\{Cov\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\}\\Big\(\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\Big\),andϑ=1βinf𝝀σmin\(Covμ𝝀†\(𝐟\(𝐱\)\)\)\\vartheta=\\frac\{1\}\{\\beta\}\\inf\_\{\\boldsymbol\{\\lambda\}\}\\sigma\_\{\\min\}\\Big\(\\textbf\{Cov\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\}\(\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\)\\Big\), whereσmin\\sigma\_\{\\min\}is the smallest eigenvalue of the covariance matrix\. This makes Assumption[3](https://arxiv.org/html/2606.17192#Thmassumption3)a mild and verifiable condition\. The covariance of the constraints is strictly positive whenever the constraint covariance is nonsingular, and its smallest eigenvalue can be estimated directly from the samples generated during inference\.

###### Assumption 4\(Rich parameterization\)\.

For any function𝐬t\{\\mathbf\{s\}\}\_\{t\}, there exists a parameterization𝛉\\boldsymbol\{\\theta\}such that, uniformly over𝛌∈𝚲\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}, it holds that

𝔼pt,𝒢\[∥𝐬𝜽\(𝐱t,t,𝝀,𝒢\)−𝐬t\(𝐱t\|𝝀;𝒢\)∥22\]≤ϵ*app*2\(t\)\.\\displaystyle\{\\mathbb\{E\}\}\_\{p\_\{t\},\{\\mathcal\{G\}\}\}\\Big\[\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\)\\big\\\|\_\{2\}^\{2\}\\Big\]\\leq\\epsilon\_\{\\emph\{app\}\}^\{2\}\(t\)\.\(25\)

Assumption 4 is an approximation condition on the score network\. We require it only uniformly over the compact set𝚲\\boldsymbol\{\\Lambda\}in which the dual iterates evolve\. The errorϵapp2\(t\)\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\)aggregates three sources: i\) the finite expressivity of the parameterization, ii\) the variance of the Monte Carlo score estimator \(Lemma[1](https://arxiv.org/html/2606.17192#Thmlemma1)\) used to form the regression targets, and iii\) the mismatch between the rollout\-induced training distribution of\(𝐱t,𝝀\)\(\{\\mathbf\{x\}\}\_\{t\},\\boldsymbol\{\\lambda\}\)pairs and the distribution encountered at inference\. The expressivity of the parameterization controls the first source, and our training procedure is designed to control the latter two\. That is, the exploration/exploitation rollout scheme with a replay buffer reduces the distribution shift, andKMCK\_\{\\text\{MC\}\}candidates reduce the estimator variance\. Our numerical experiments show that the difference between PDI with a trained score model \(PDI–Net\) and PDI with an MC score field \(PDI–MC\) is negligible\. In some cases, PDI–Net improves PDI–MC on feasibility and tail\-rate metrics\.

###### Assumption 5\.

Assume the noised optimal score is spatially Lipschitz, i\.e\.∇𝐱2log⁡qT−t\(𝐱\|𝛌\)\\nabla^\{2\}\_\{\{\\mathbf\{x\}\}\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\)exists and is bounded in operator norm,sup𝐱∈𝒳∥∇𝐱2logqT−t\(𝐱\|𝛌\)∥*op*≤Mt\\sup\_\{\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}\}\\\|\\nabla^\{2\}\_\{\{\\mathbf\{x\}\}\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\)\\\|\_\{\\emph\{op\}\}\\leq M\_\{t\}, for each t and𝛌∈𝚲\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}\.

### C\.2Dual Convergence: Proof of Theorem[1](https://arxiv.org/html/2606.17192#Thmtheorem1)

###### Proof\.

From Lemma[2](https://arxiv.org/html/2606.17192#Thmlemma2), the dual recursion is

𝔼\[∥𝝀t\+1−𝝀∗∥2\]\+η𝔼\[D∗−\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\big\]\+\\eta\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(𝝀t\)\]\+2ηβ𝔼\[DKL\(μ𝝀t†∥μ∗\)\]\.\\displaystyle g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\]\+2\\eta\\beta\{\\mathbb\{E\}\}\\big\[D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)\\big\]\.≤𝔼\[‖𝝀t−𝝀∗‖2\]\+2ηϑ𝔼\[‖𝐛t\+1\(𝝀t\)‖2\]\+η2R2\.\\displaystyle~\\leq~\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]\+\\frac\{2\\eta\}\{\\vartheta\}\{\\mathbb\{E\}\}\\big\[\\\|\{\\mathbf\{b\}\}\_\{t\+1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\]\+\\eta^\{2\}R^\{2\}\.\(26\)Using the telescope sum, we get

𝔼\[∥𝝀T−𝝀∗∥2\]\+η∑t=0T−1𝔼\[D∗−\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{T\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\big\]\+\\eta\\sum\_\{t=0\}^\{T\-1\}\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(𝝀t\)\]\+2ηβ∑t=0T−1𝔼\[DKL\(μ𝝀t†∥μ∗\)\]\.\\displaystyle g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\]\+2\\eta\\beta\\sum\_\{t=0\}^\{T\-1\}\{\\mathbb\{E\}\}\\big\[D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)\\big\]\.≤𝔼\[‖𝝀0−𝝀∗‖2\]\+2ηϑ∑t=0T−1𝔼\[‖𝐛t\+1\(𝝀t\)‖2\]\+Tη2R2\.\\displaystyle~\\leq~\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{0\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]\+\\frac\{2\\eta\}\{\\vartheta\}\\sum\_\{t=0\}^\{T\-1\}\{\\mathbb\{E\}\}\\big\[\\\|\{\\mathbf\{b\}\}\_\{t\+1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\]\+T\\eta^\{2\}R^\{2\}\.\(27\)The bias comprises three components as described in \([24](https://arxiv.org/html/2606.17192#A3.E24)\)\. The Tweedie posterior mean error is bounded by Lemma[3](https://arxiv.org/html/2606.17192#Thmlemma3):

𝔼\[‖B1\(t\+1\)‖2\]≤Lf2ϵTW2\(t\+1\)\.\\displaystyle\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{1\}\(t\+1\)\\\|^\{2\}\\Big\]~\\leq~L\_\{f\}^\{2\}\\,\\epsilon\_\{TW\}^\{2\}\(t\+1\)\.\(28\)The marginal mismatch error is bounded by Lemma[4](https://arxiv.org/html/2606.17192#Thmlemma4):

𝔼\[‖B2\(t\+1\)‖2\]≤Lh2\(\(1\+δ\)Chist2η2R2\+\(1\+1δ\)ϵhist2\)\.\\displaystyle\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{2\}\(t\+1\)\\\|^\{2\}\\Big\]~\\leq~L\_\{h\}^\{2\}\\Big\(\(1\+\\delta\)C\_\{\\text\{hist\}\}^\{2\}\\eta^\{2\}R^\{2\}\+\\big\(1\+\\frac\{1\}\{\\delta\}\\big\)\\epsilon\_\{\\text\{hist\}\}^\{2\}\\Big\)\.\(29\)Finally, the parameterization error is bounded by Lemma[5](https://arxiv.org/html/2606.17192#Thmlemma5):

𝔼\[‖B3\(t\+1\)‖2\]≤Lf2σT−t4α¯T−t2ϵapp2\(t\+1\)\.\\displaystyle\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{3\}\(t\+1\)\\\|^\{2\}\\Big\]~\\leq~L\_\{f\}^\{2\}\\,\\frac\{\\sigma\_\{T\-t\}^\{4\}\}\{\\overline\{\\alpha\}\_\{T\-t\}^\{2\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\.\(30\)The total bias can then be bounded as

𝔼\[‖𝐛t\+1\(𝝀t\)‖2\]\\displaystyle\{\\mathbb\{E\}\}\\Big\[\\\|\{\\mathbf\{b\}\}\_\{t\+1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\\Big\]≤3𝔼\[‖B1\(t\+1\)‖2\]\+3𝔼\[‖B2\(t\+1\)‖2\]\+3𝔼\[‖B3\(t\+1\)‖2\]\\displaystyle~\\leq~3\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{1\}\(t\+1\)\\\|^\{2\}\\Big\]\+3\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{2\}\(t\+1\)\\\|^\{2\}\\Big\]\+3\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{3\}\(t\+1\)\\\|^\{2\}\\Big\]\(31\)≤3Lf2ϵTW2\(t\+1\)\+3Lh2\(\(1\+δ\)Chist2η2R2\+\(1\+1δ\)ϵhist2\)\\displaystyle~\\leq~3L\_\{f\}^\{2\}\\,\\epsilon\_\{TW\}^\{2\}\(t\+1\)\+3L\_\{h\}^\{2\}\\Big\(\(1\+\\delta\)C\_\{\\text\{hist\}\}^\{2\}\\eta^\{2\}R^\{2\}\+\\big\(1\+\\frac\{1\}\{\\delta\}\\big\)\\epsilon\_\{\\text\{hist\}\}^\{2\}\\Big\)\(32\)\+3Lf2σT−t4α¯T−t2ϵapp2\(t\+1\)\.\\displaystyle\\quad\\quad\+3L\_\{f\}^\{2\}\\,\\frac\{\\sigma\_\{T\-t\}^\{4\}\}\{\\overline\{\\alpha\}\_\{T\-t\}^\{2\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\.\(33\)Substituting the bias in \([C\.2](https://arxiv.org/html/2606.17192#A3.Ex7)\), the RHS is

ℐ\\displaystyle\{\\mathcal\{I\}\}≤𝔼\[‖𝝀0−𝝀∗‖2\]\+6Lf2ηϑ∑t=0T−1ϵTW2\(t\+1\)\+\(1\+δ\)6Lh2ηϑTChist2η2R2\\displaystyle~\\leq~\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{0\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]\+\\frac\{6L\_\{f\}^\{2\}\\eta\}\{\\vartheta\}\\sum\_\{t=0\}^\{T\-1\}\\,\\epsilon^\{2\}\_\{TW\}\(t\+1\)\+\(1\+\\delta\)\\frac\{6L\_\{h\}^\{2\}\\eta\}\{\\vartheta\}TC\_\{\\text\{hist\}\}^\{2\}\\eta^\{2\}R^\{2\}\(34\)\+6Lh2ηϑ\(1\+1δ\)Tϵhist2\+6Lf2ηϑ∑t=0T−1σT−t4α¯T−t2ϵapp2\(t\+1\)\+Tη2R2\.\\displaystyle\\quad\+\\frac\{6L\_\{h\}^\{2\}\\eta\}\{\\vartheta\}\\Big\(1\+\\frac\{1\}\{\\delta\}\\Big\)T\\epsilon\_\{\\text\{hist\}\}^\{2\}\+\\frac\{6L\_\{f\}^\{2\}\\eta\}\{\\vartheta\}\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\sigma^\{4\}\_\{T\-t\}\}\{\\overline\{\\alpha\}^\{2\}\_\{T\-t\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\+T\\eta^\{2\}R^\{2\}\.\(35\)Chooseη=κ/T\\eta=\\kappa/\\sqrt\{T\}\. Since the left\-side terms in \([C\.2](https://arxiv.org/html/2606.17192#A3.Ex7)\) are non\-negative, each one of them is bounded above byℐ\{\\mathcal\{I\}\}\. Thus, we get a uniform average dual gap of

1T∑t=0T−1\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}𝔼\[D∗−g\(𝝀t\)\]≤ℐηT\\displaystyle\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\]~\\leq~\\frac\{\{\\mathcal\{I\}\}\}\{\\eta T\}\(36\)=𝔼\[‖𝝀0−𝝀∗‖2\]ηT⏟𝒪\(1T\)\+6Lf2ϑT∑t=0T−1ϵTW2\(t\+1\)\+\(1\+δ\)6Lh2ϑTChist2R2η2⏟𝒪\(1T\)\\displaystyle~=~\\underbrace\{\\frac\{\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{0\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]\}\{\\eta T\}\}\_\{\{\\mathcal\{O\}\}\\big\(\\frac\{1\}\{\\sqrt\{T\}\}\\big\)\}\+\\frac\{6L\_\{f\}^\{2\}\}\{\\vartheta T\}\\,\\sum\_\{t=0\}^\{T\-1\}\\epsilon^\{2\}\_\{TW\}\(t\+1\)\+\\underbrace\{\(1\+\\delta\)\\frac\{6L\_\{h\}^\{2\}\}\{\\vartheta T\}C\_\{\\text\{hist\}\}^\{2\}R^\{2\}\\eta^\{2\}\}\_\{\{\\mathcal\{O\}\}\\big\(\\frac\{1\}\{T\}\\big\)\}\+6Lh2ϑ\(1\+1δ\)ϵhist2\+6Lf2ϑT∑t=0T−1σT−t4α¯T−t2ϵapp2\(t\+1\)\+ηR2⏟𝒪\(1T\)\\displaystyle\\quad\+\\frac\{6L\_\{h\}^\{2\}\}\{\\vartheta\}\\Big\(1\+\\frac\{1\}\{\\delta\}\\Big\)\\epsilon\_\{\\text\{hist\}\}^\{2\}\+\\frac\{6L\_\{f\}^\{2\}\}\{\\vartheta T\}\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\sigma^\{4\}\_\{T\-t\}\}\{\\overline\{\\alpha\}^\{2\}\_\{T\-t\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\+\\underbrace\{\\eta R^\{2\}\}\_\{\{\\mathcal\{O\}\}\\big\(\\frac\{1\}\{\\sqrt\{T\}\}\\big\)\}\(37\)=𝒪\(1T\)\+6Lf2ϑT∑t=0T−1\(ϵTW2\(t\+1\)\+σT−t4α¯T−t2ϵapp2\(t\+1\)\)\+6Lh2ϑ\(1\+1δ\)ϵhist2\.\\displaystyle~=~\{\\mathcal\{O\}\}\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\+\\frac\{6L\_\{f\}^\{2\}\}\{\\vartheta T\}\\,\\sum\_\{t=0\}^\{T\-1\}\\Big\(\\epsilon^\{2\}\_\{TW\}\(t\+1\)\+\\frac\{\\sigma^\{4\}\_\{T\-t\}\}\{\\overline\{\\alpha\}^\{2\}\_\{T\-t\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\\Big\)\+\\frac\{6L\_\{h\}^\{2\}\}\{\\vartheta\}\\Big\(1\+\\frac\{1\}\{\\delta\}\\Big\)\\epsilon\_\{\\text\{hist\}\}^\{2\}\.\(38\)The rate of convergence of both quantities is1/T1/\\sqrt\{T\}and they converge to a neighborhood determined byϵTW\\epsilon\_\{TW\}andϵt\+1\\epsilon\_\{t\+1\}across diffusion steps\.

Define the time\-average dual variable𝝀¯=1T∑t𝝀t\\bar\{\\boldsymbol\{\\lambda\}\}=\\tfrac\{1\}\{T\}\\sum\_\{t\}\\boldsymbol\{\\lambda\}\_\{t\}\. By concavity ofgg, we have

𝔼\[D∗−g\(𝝀¯\)\]≤𝔼\[D∗−1T∑t=0Tg\(𝝀t\)\]=1T∑t=0T𝔼\[D∗−g\(𝝀t\)\],\\displaystyle\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\\big\(\\bar\{\\boldsymbol\{\\lambda\}\}\\big\)\]~\\leq~\{\\mathbb\{E\}\}\\Big\[D^\{\*\}\-\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\]~=~\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\}\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\],\(39\)which is bounded by \([38](https://arxiv.org/html/2606.17192#A3.E38)\)\. Furthermore, by the strong concavity ofgg, we can also write

𝔼\[‖𝝀¯−𝝀∗‖2\]≤2ϑ𝔼\[D∗−g\(𝝀¯\)\],\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\bar\{\\boldsymbol\{\\lambda\}\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]~\\leq~\\frac\{2\}\{\\vartheta\}\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(\\bar\{\\boldsymbol\{\\lambda\}\}\)\\big\],\(40\)which completes the proof\. ∎

###### Lemma 2\.

Under Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2)and[3](https://arxiv.org/html/2606.17192#Thmassumption3), and withηt=η\\eta\_\{t\}=\\eta, the distance between𝛌t\\boldsymbol\{\\lambda\}\_\{t\}and the optimum in expectation is given by

𝔼\[∥𝝀t\+1−𝝀∗∥2\]\+η𝔼\[D∗−\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\big\]\+\\eta\{\\mathbb\{E\}\}\\big\[D^\{\*\}\-g\(𝝀t\)\]\+2ηβ𝔼\[DKL\(μ𝝀t†∥μ∗\)\]\.\\displaystyle g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\]\+2\\eta\\beta\{\\mathbb\{E\}\}\\big\[D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)\\big\]\.≤𝔼\[‖𝝀t−𝝀∗‖2\]\+2ηϑ𝔼\[‖𝐛t‖2\]\+η2R2\.\\displaystyle~\\leq~\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\]\+\\frac\{2\\eta\}\{\\vartheta\}\{\\mathbb\{E\}\}\\big\[\\\|\{\\mathbf\{b\}\}\_\{t\}\\\|^\{2\}\]\+\\eta^\{2\}R^\{2\}\.\(41\)

###### Proof\.

By the non\-expansive property of the projection operator ontoℝ\+M\{\\mathbb\{R\}\}\_\{\+\}^\{M\}, we have‖\[𝝀\]\+−𝝀∗‖≤‖𝝀−𝝀∗‖\\\|\[\\boldsymbol\{\\lambda\}\]\_\{\+\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\\leq\\\|\\boldsymbol\{\\lambda\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|, and therefore,

𝔼\[‖𝝀t\+1−𝝀∗‖2\|ℱt\]\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\,\|\\,\{\\mathcal\{F\}\}\_\{t\}\\big\]≤‖𝝀t\+η∇^g\(𝝀t\)−𝝀∗‖2,\\displaystyle~\\leq~\\\|\\boldsymbol\{\\lambda\}\_\{t\}\+\\eta\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\},\(42\)whereℱt\{\\mathcal\{F\}\}\_\{t\}is a filtration of the history of the PDI process till timett\.

Define the true gradient of the dual function as

∇g\(𝝀t\)=𝔼μ𝝀t†\[𝐟\(𝐲0\)\],\\nabla g\(\\boldsymbol\{\\lambda\}\_\{t\}\)=\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\],which is the constraint violations evaluated on clean samples under the current Gibbs distributionμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\. PDI replaces this true gradient with a biased one,∇^g\(𝝀t\)\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\), and the bias𝐛t\(𝝀t\)\{\\mathbf\{b\}\}\_\{t\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)is split into three terms, as characterized in \([24](https://arxiv.org/html/2606.17192#A3.E24)\)\. Thus, we can write

𝔼\[‖𝝀t\+1−𝝀∗‖2\|ℱt\]\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\,\|\\,\{\\mathcal\{F\}\}\_\{t\}\\big\]≤‖𝝀t\+η∇^g\(𝝀t\)−𝝀∗‖2\\displaystyle~\\leq~\\\|\\boldsymbol\{\\lambda\}\_\{t\}\+\\eta\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\(43\)≤‖𝝀t−𝝀∗‖2\+2η⟨𝝀t−𝝀∗,∇^g\(𝝀t\)⟩\+η2‖∇^g\(𝝀t\)‖2\\displaystyle~\\leq~\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\+2\\eta\\langle\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\},\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\rangle\+\\eta^\{2\}\\\|\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\(44\)=‖𝝀t−𝝀∗‖2\+2η⟨𝝀t−𝝀∗,∇g\(𝝀t\)⟩\+2η⟨𝝀t−𝝀∗,𝐛t\(𝝀t\)⟩\+η2‖∇^g\(𝝀t\)‖2\.\\displaystyle~=~\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\+2\\eta\\langle\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\},\\nabla g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\rangle\+2\\eta\\langle\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\},\{\\mathbf\{b\}\}\_\{t\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\rangle\+\\eta^\{2\}\\\|\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\.\(45\)
By Lemma[8](https://arxiv.org/html/2606.17192#Thmlemma8), the Lagrangian function can be written as

ℒ\(μ,𝝀\)=βDKL\(μ∥μ𝝀†\)\+g\(𝝀\),\{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\}\)~=~\\beta D\_\{\\text\{KL\}\}\(\\mu\\\|\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\)\+g\(\\boldsymbol\{\\lambda\}\),which leads to

⟨𝝀t−𝝀∗,∇g\(𝝀t\)⟩\\displaystyle\\langle\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\},\\nabla g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\rangle=ℒ\(μ𝝀t†,𝝀t\)−ℒ\(μ𝝀t†,𝝀∗\)\\displaystyle~=~\{\\mathcal\{L\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\},\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathcal\{L\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\},\\boldsymbol\{\\lambda\}^\{\*\}\)\(46\)=g\(𝝀t\)−g\(𝝀∗\)⏟duality gap−βDKL\(μ𝝀t†∥μ∗\)\.\\displaystyle~=~\\underbrace\{g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\-g\(\\boldsymbol\{\\lambda\}^\{\*\}\)\}\_\{\\text\{duality gap\}\}\-\\beta D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)\.\(47\)The first equality follows from the definition in \([2](https://arxiv.org/html/2606.17192#S3.E2)\) and the second follows from the definition in the lemma\. Using Young’s inequality, for anyc\>0c\>0, we can bound the inner product between𝝀t−𝝀∗\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}and the bias by

2⟨𝝀t−𝝀∗,𝐛t\(i\)⟩≤c‖𝝀t−𝝀∗‖2\+1c‖𝐛t\(i\)‖2\.\\displaystyle 2\\langle\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\},\{\\mathbf\{b\}\}\_\{t\}^\{\(i\)\}\\rangle~\\leq~c\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\+\\frac\{1\}\{c\}\\\|\{\\mathbf\{b\}\}\_\{t\}^\{\(i\)\}\\\|^\{2\}\.\(48\)Choosec=ϑ/2c=\\vartheta/2\. Thus, we get

𝔼\[∥𝝀t\+1−𝝀∗∥2\|ℱt\]\+2η\(g\(𝝀∗\)\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\,\|\\,\{\\mathcal\{F\}\}\_\{t\}\\big\]\+2\\eta\\Big\(g\(\\boldsymbol\{\\lambda\}^\{\*\}\)−g\(𝝀t\)\)\+2ηβDKL\(μ𝝀t†∥μ∗\)\\displaystyle\-g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\)\+2\\eta\\beta D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)≤\(1\+ηϑ2\)‖𝝀t−𝝀∗‖2\+2ηϑ‖𝐛t‖2\+η2‖∇^g\(𝝀t\)‖2\.\\displaystyle~\\leq~\\left\(1\+\\frac\{\\eta\\vartheta\}\{2\}\\right\)\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\+\\frac\{2\\eta\}\{\\vartheta\}\\\|\{\\mathbf\{b\}\}\_\{t\}\\\|^\{2\}\+\\eta^\{2\}\\\|\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\.\(49\)From the strong concavity ofgg\(Assumption[3](https://arxiv.org/html/2606.17192#Thmassumption3)\), we haveD∗−g\(𝝀t\)≥ϑ2‖𝝀t−𝝀∗‖2D^\{\*\}\-g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\geq\\frac\{\\vartheta\}\{2\}\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}withD∗=g\(𝝀∗\)D^\{\*\}=g\(\\boldsymbol\{\\lambda\}^\{\*\}\)\.

We then split the dual gap in the LHS into two halves and lower bound one of them withηϑ2‖𝝀t−𝝀∗‖2\\frac\{\\eta\\vartheta\}\{2\}\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\. Thus, we have

𝔼\[∥𝝀t\+1−𝝀∗∥2\|ℱt\]\+η\(D∗\\displaystyle\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\+1\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\\,\|\\,\{\\mathcal\{F\}\}\_\{t\}\\big\]\+\\eta\\big\(D^\{\*\}−g\(𝝀t\)\)\+2ηβDKL\(μ𝝀t†∥μ∗\)\\displaystyle\-g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\)\+2\\eta\\beta D\_\{\\text\{KL\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\\\|\\mu^\{\*\}\)≤‖𝝀t−𝝀∗‖2\+2ηϑ‖𝐛t\(i\)‖2\+η2‖∇^g\(𝝀t\)‖2\.\\displaystyle~\\leq~\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|^\{2\}\+\\frac\{2\\eta\}\{\\vartheta\}\\\|\{\\mathbf\{b\}\}\_\{t\}^\{\(i\)\}\\\|^\{2\}\+\\eta^\{2\}\\\|\\widehat\{\\nabla\}g\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|^\{2\}\.\(50\)Take the expectation with respect to the filtration and bound the gradient withRR\(Assumption[2](https://arxiv.org/html/2606.17192#Thmassumption2)\)\. This completes the proof\. ∎

### C\.3Primal Recursion: Proof of Theorem[2](https://arxiv.org/html/2606.17192#Thmtheorem2)

###### Proof\.

LetKt𝝀tK\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\}be the kernel of the parametrized PDI process, i\.e\.,pt\+1=ptKt𝝀tp\_\{t\+1\}=p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\}, andKt𝝀∗K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}is its equivalent in the backward process with a fixed dual variable𝝀∗\\boldsymbol\{\\lambda\}^\{\*\}and the true score, i\.e\.,qT−t−1𝝀∗=qT−t𝝀∗Kt𝝀∗q\_\{T\-t\-1\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}=q\_\{T\-t\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\. By the triangle inequality, we get

W2\(pt\+1,qT−t−1𝝀∗\)\\displaystyle W\_\{2\}\\big\(p\_\{t\+1\},q\_\{T\-t\-1\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}\\big\)=W2\(ptKt𝝀t,qT−t𝝀∗Kt𝝀∗\)\\displaystyle~=~W\_\{2\}\\big\(p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\},q\_\{T\-t\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\\big\)\(51\)≤W2\(ptKt𝝀∗,qT−t𝝀∗Kt𝝀∗\)\+W2\(ptKt𝝀t,ptKt𝝀∗\)\.\\displaystyle~\\leq~W\_\{2\}\\big\(p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\},q\_\{T\-t\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\\big\)\+W\_\{2\}\\big\(p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\},p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\\big\)\.\(52\)The first term is bounded by \([11](https://arxiv.org/html/2606.17192#S4.E11)\) in Proposition[2](https://arxiv.org/html/2606.17192#Thmproposition2), and the second term computes the effect of the mismatch in the dual variables\.

Synchronous Coupling:We consider a synchronous couplingΠt\\Pi\_\{t\}betweenptKt𝝀tp\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\}andptKt𝝀∗p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}that shares the noiseϵ\\boldsymbol\{\\epsilon\}since the two kernels have the same noise covariance\. Given a particle𝐱t∼pt\{\\mathbf\{x\}\}\_\{t\}\\sim p\_\{t\}, the PDI kernel converts it into

𝐱t\+1\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}=1aτ\(𝐱t\+bτ𝐬𝜽\(𝐱t,t,𝝀t\)\)\+bτϵ,\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{\\tau\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\}\+b\_\{\\tau\}\\,\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\)\+\\sqrt\{b\_\{\\tau\}\}\\boldsymbol\{\\epsilon\},\(53\)whereτ=T−t\\tau=T\-t, while the optimal kernel transports it into

𝐱~t\+1\\displaystyle\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}=1aτ\(𝐱t\+bτ𝐬t\(𝐱t\|𝝀∗\)\)\+bτϵ,\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{\\tau\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\}\+b\_\{\\tau\}\\,\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\Big\)\+\\sqrt\{b\_\{\\tau\}\}\\boldsymbol\{\\epsilon\},\(54\)where𝐬t=∇log⁡qT−t\{\\mathbf\{s\}\}\_\{t\}=\\nabla\\log q\_\{T\-t\}is the true score and𝐬θ\{\\mathbf\{s\}\}\_\{\\theta\}is the parameterized score function\. By construction,Πt\\Pi\_\{t\}is a valid coupling\.

Bounding theW2W\_\{2\}distance then reduces to bounding the norm of the difference between the score functions under two different dual variables\. More concretely, we have

‖𝐱t\+1−𝐱~t\+1‖\\displaystyle\\\|\{\\mathbf\{x\}\}\_\{t\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\\\|=bτaτ∥𝐬𝜽\(𝐱t,t,𝝀t\)−𝐬t\(𝐱t\|𝝀∗\)∥\\displaystyle~=~\\frac\{b\_\{\\tau\}\}\{\\sqrt\{a\_\{\\tau\}\}\}\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\\\|\(55\)≤bτaτ\(∥𝐬𝜽\(𝐱t,t,𝝀t\)−𝐬t\(𝐱t\|𝝀t\)∥\+∥𝐬t\(𝐱t\|𝝀t\)−𝐬t\(𝐱t\|𝝀∗\)∥\)\.\\displaystyle~\\leq~\\frac\{b\_\{\\tau\}\}\{\\sqrt\{a\_\{\\tau\}\}\}\\Big\(\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\big\\\|\+\\big\\\|\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\\\|\\Big\)\.\(56\)By Lemma[7](https://arxiv.org/html/2606.17192#Thmlemma7), we have

∥𝐬t\(𝐱t\|𝝀t\)−𝐬t\(𝐱t\|𝝀∗\)∥≤γt∥𝝀t−𝝀∗∥\.\\displaystyle\\big\\\|\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\\\|~\\leq~\\gamma\_\{t\}\\,\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\.\(57\)Combined with Assumption[4](https://arxiv.org/html/2606.17192#Thmassumption4), we can bound the dual mismatch with

W2\(ptKt𝝀t,ptKt𝝀∗\)\\displaystyle W\_\{2\}\\big\(p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}\_\{t\}\},p\_\{t\}K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\\big\)=\(infπ∫‖𝐱t\+1−𝐱~t\+1‖2dπ\(𝐱t\+1,𝐱~t\+1\)\)1/2\\displaystyle~=~\\Bigg\(\\inf\_\{\\pi\}\\int\\\|\{\\mathbf\{x\}\}\_\{t\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\\\|^\{2\}\\,\{\\text\{d\}\}\\pi\(\{\\mathbf\{x\}\}\_\{t\+1\},\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\)\\Bigg\)^\{1/2\}\(58\)≤\(∫‖𝐱t\+1−𝐱~t\+1‖2dΠt\(𝐱t\+1,𝐱~t\+1\)\)1/2\\displaystyle~\\leq~\\Bigg\(\\int\\\|\{\\mathbf\{x\}\}\_\{t\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\\\|^\{2\}\\,\{\\text\{d\}\}\\Pi\_\{t\}\(\{\\mathbf\{x\}\}\_\{t\+1\},\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\)\\Bigg\)^\{1/2\}\(59\)≤bτaτ\(γt‖𝝀t−𝝀∗‖\+ϵapp\(t\)\),\\displaystyle~\\leq~\\frac\{b\_\{\\tau\}\}\{\\sqrt\{a\_\{\\tau\}\}\}\\big\(\\gamma\_\{t\}\\,\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\+\\epsilon\_\{\\text\{app\}\}\(t\)\\big\),\(60\)where the last inequality follows by Minkowski inequality\. Substituting in \([52](https://arxiv.org/html/2606.17192#A3.E52)\), we get

W2\(pt\+1,qT−t−1𝝀∗\)\\displaystyle W\_\{2\}\\big\(p\_\{t\+1\},q\_\{T\-t\-1\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}\\big\)≤ρtW2\(pt,qT−t𝝀∗\)\+bτaτγt‖𝝀t−𝝀∗‖\+bτaτϵapp\(t\)\.\\displaystyle~\\leq~\\rho\_\{t\}W\_\{2\}\\big\(p\_\{t\},q\_\{T\-t\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}\\big\)\+\\frac\{b\_\{\\tau\}\}\{\\sqrt\{a\_\{\\tau\}\}\}\\gamma\_\{t\}\\,\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\+\\frac\{b\_\{\\tau\}\}\{\\sqrt\{a\_\{\\tau\}\}\}\\epsilon\_\{\\text\{app\}\}\(t\)\.\(61\)
Taking expectation with respect to𝝀t\\boldsymbol\{\\lambda\}\_\{t\}and unrolling \([61](https://arxiv.org/html/2606.17192#A3.E61)\) forTTsteps results in

𝔼\[W2\(pT,μ∗\)\]≤Ψ0,TW2\(p0,qT𝝀∗\)\+∑t=0T−1Ψt,TbT−taT−t\(γt𝔼\[‖𝝀t−𝝀∗‖\]\+ϵapp\(t\)\),\\displaystyle\{\\mathbb\{E\}\}\\big\[W\_\{2\}\\big\(p\_\{T\},\\mu^\{\*\}\\big\)\\big\]~\\leq~\\Psi\_\{0,T\}W\_\{2\}\\big\(p\_\{0\},q\_\{T\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}\\big\)\+\\sum\_\{t=0\}^\{T\-1\}\\Psi\_\{t,T\}\\frac\{b\_\{T\-t\}\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\\gamma\_\{t\}\{\\mathbb\{E\}\}\\big\[\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\\|\\big\]\+\\epsilon\_\{\\text\{app\}\}\(t\)\\Big\),\(62\)whereΨs,T=∏t=s\+1Tρt\\Psi\_\{s,T\}=\\prod\_\{t=s\+1\}^\{T\}\\rho\_\{t\}\. For largeTT, it holds thatqT𝝀∗q\_\{T\}^\{\\boldsymbol\{\\lambda\}^\{\*\}\}converges to𝒩\(𝟎,𝐈\)\{\\mathcal\{N\}\}\(\\mathbf\{0\},\{\\mathbf\{I\}\}\), making the first term go to zero\. This completes the proof\. ∎

### C\.4Gradient Bias

#### C\.4\.1Tweedie’s Error

###### Lemma 3\.

Consider the reverse diffusion sampler, defined in \([7](https://arxiv.org/html/2606.17192#S3.E7)\), associated with a dual value𝛌t\\boldsymbol\{\\lambda\}\_\{t\}\. Then, under Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2), it holds that

𝔼𝝀t\[∥𝔼p~t\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)\]−𝔼μ𝝀t†\[𝐟\(𝐲0\)\]∥2\]≤LfϵTW\(t\+1\)\\displaystyle\{\\mathbb\{E\}\}\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}\\Bigg\[\\Big\\\|\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\\big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\]\\,\\Big\\\|\_\{2\}\\Bigg\]~\\leq~L\_\{f\}\\,\\epsilon\_\{TW\}\(t\+1\)\(63\)wherep~t\\widetilde\{p\}\_\{t\}is the marginal distribution of the reverse process under fixed𝛌t\\boldsymbol\{\\lambda\}\_\{t\},π~t\\widetilde\{\\pi\}\_\{t\}is the posterior density under the same process,LfL\_\{f\}is a Lipschitz constant andϵTW\\epsilon\_\{TW\}is the expected Tweedie posterior\-mean error\.

###### Proof\.

Define the joint distributionΠt\+1𝝀t\(𝐲0,𝐱~t\+1\|𝝀t\)=π~t\+1\(𝐲0\|𝐱~t\+1,𝝀t\)p~t\+1\(𝐱~t\+1\|𝝀t\)\\Pi\_\{t\+1\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}\\big\(\{\\mathbf\{y\}\}\_\{0\},\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\\big\|\\boldsymbol\{\\lambda\}\_\{t\}\\big\)~=~\\widetilde\{\\pi\}\_\{t\+1\}\(\{\\mathbf\{y\}\}\_\{0\}\|\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\)\\widetilde\{p\}\_\{t\+1\}\(\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)with marginalsμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}andp~t\+1\\widetilde\{p\}\_\{t\+1\}\. Then, we can write the bias as

B1\(𝝀t\)\\displaystyle B\_\{1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)≔𝔼p~t\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)\]−𝔼μ𝝀t†\[𝐟\(𝐲0\)\]\\displaystyle~\\coloneqq~\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\-\{\\mathbb\{E\}\}\_\{\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\}\\big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\]\(64\)=𝔼Πt\+1𝝀t\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)−𝐟\(𝐲0\)∣𝝀t\]\.\\displaystyle~=~\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\+1\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\-\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\mid\\boldsymbol\{\\lambda\}\_\{t\}\\Big\]\.\(65\)Taking the norm and applying Jensen’s inequality and Assumption[2](https://arxiv.org/html/2606.17192#Thmassumption2), we get

‖B1\(𝝀t\)‖22\\displaystyle\\\|B\_\{1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|\_\{2\}^\{2\}≤𝔼Πt\+1𝝀t\[∥𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)−𝐟\(𝐲0\)∥22∣𝝀t\]\\displaystyle~\\leq~\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\+1\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}\}\\Big\[\\big\\\|\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\-\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\\\|\_\{2\}^\{2\}\\mid\\boldsymbol\{\\lambda\}\_\{t\}\\Big\]\(66\)≤Lf2𝔼Πt\+1𝝀t\[∥𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]−𝐲0∥22∣𝝀t\]\.\\displaystyle~\\leq~L\_\{f\}^\{2\}\\,\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\+1\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}\}\\Big\[\\big\\\|\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\-\{\\mathbf\{y\}\}\_\{0\}\\big\\\|\_\{2\}^\{2\}\\mid\\boldsymbol\{\\lambda\}\_\{t\}\\Big\]\.\(67\)Take expectation with respect to𝝀t\\boldsymbol\{\\lambda\}\_\{t\}, then we have

𝔼\[‖B1\(𝝀t\)‖22\]\\displaystyle\{\\mathbb\{E\}\}\\Big\[\\\|B\_\{1\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|\_\{2\}^\{2\}\\Big\]≤Lf2𝔼𝝀t𝔼Πt\+1𝝀t∥𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]−𝐲0∥22\\displaystyle~\\leq~L\_\{f\}^\{2\}\\,\{\\mathbb\{E\}\}\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\+1\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}\}\\big\\\|\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\-\{\\mathbf\{y\}\}\_\{0\}\\big\\\|\_\{2\}^\{2\}≕Lf2ϵTW2\(t\+1\)\.\\displaystyle~\\eqqcolon~L\_\{f\}^\{2\}\\,\\epsilon^\{2\}\_\{TW\}\(t\+1\)\.\(68\)whereϵTW2\(t\+1\)\\epsilon^\{2\}\_\{TW\}\(t\+1\)is the expected Tweedie posterior\-mean error at stept\+1t\+1\. ∎

#### C\.4\.2Marginal Mismatch

###### Lemma 4\.

Under Assumptions[4](https://arxiv.org/html/2606.17192#Thmassumption4)and[5](https://arxiv.org/html/2606.17192#Thmassumption5), it holds that

∥𝔼p~t\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱~t\+1,𝝀t\]\)\]\\displaystyle\\Big\\\|\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]−𝔼pt\+1\[𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\]∥2\\displaystyle\-\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\Big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\Big\)\\Big\]\\,\\Big\\\|\_\{2\}≤Lh∑j=0t−1Ψj−1,tbT−jaT−j\(γj‖𝝀j−𝝀t‖2\+ϵj\)\.\\displaystyle~\\leq~L\_\{h\}\\sum\_\{j=0\}^\{t\-1\}\\Psi\_\{j\-1,t\}\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\Big\(\\gamma\_\{j\}\\\|\\boldsymbol\{\\lambda\}\_\{j\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\epsilon\_\{j\}\\Big\)\.\(69\)whereΨs,t=∏r=s\+1tρr\(𝛌r\)\\Psi\_\{s,t\}=\\prod\_\{r=s\+1\}^\{t\}\\rho\_\{r\}\(\\boldsymbol\{\\lambda\}\_\{r\}\)is the stability product andρr,γj\\rho\_\{r\},\\gamma\_\{j\}are the moduli of Lemma[6](https://arxiv.org/html/2606.17192#Thmlemma6)and Lemma[7](https://arxiv.org/html/2606.17192#Thmlemma7), generalized to any given𝛌\\boldsymbol\{\\lambda\}\.

###### Proof\.

We use synchronous coupling to compare the Tweedie posterior mean at stept\+1t\+1under two processes: 1\) PDI, and 2\) the reverse process \([7](https://arxiv.org/html/2606.17192#S3.E7)\) under𝝀t\\boldsymbol\{\\lambda\}\_\{t\}\.

Under the PDI dynamics, one reverse step is given by

𝐱s\+1\\displaystyle\{\\mathbf\{x\}\}\_\{s\+1\}=1aT−s\(𝐱s\+bT−s𝐬𝜽\(𝐱s,s,𝝀s\)\)\+bT−sϵs,\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{s\}\+b\_\{T\-s\}\\,\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{s\},s,\\boldsymbol\{\\lambda\}\_\{s\}\)\\Big\)\+\\sqrt\{b\_\{T\-s\}\}\\boldsymbol\{\\epsilon\}\_\{s\},\(70\)where𝐱s∼ps\{\\mathbf\{x\}\}\_\{s\}\\sim p\_\{s\}while the corresponding kernelKs𝝀tK\_\{s\}^\{\\boldsymbol\{\\lambda\}\_\{t\}\}under \([7](https://arxiv.org/html/2606.17192#S3.E7)\) with fixed𝝀t\\boldsymbol\{\\lambda\}\_\{t\}is

𝐱~s\+1\\displaystyle\\widetilde\{\\mathbf\{x\}\}\_\{s\+1\}=1aT−s\(𝐱~s\+bT−s𝐬s\(𝐱~s\|𝝀t\)\)\+bT−sϵs\.\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\+b\_\{T\-s\}\\,\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\)\+\\sqrt\{b\_\{T\-s\}\}\\boldsymbol\{\\epsilon\}\_\{s\}\.\(71\)Under the synchronous coupling, the two processes share the same noiseϵs\\boldsymbol\{\\epsilon\}\_\{s\}and the same initialization, i\.e\.,𝐱0=𝐱~0\{\\mathbf\{x\}\}\_\{0\}=\\widetilde\{\\mathbf\{x\}\}\_\{0\}\. Thus, the difference between the two noisy samples𝐱s\+1\{\\mathbf\{x\}\}\_\{s\+1\}and𝐱~s\+1\\widetilde\{\\mathbf\{x\}\}\_\{s\+1\}is

𝐱s\+1−𝐱~s\+1=1aT−s\(𝐱s−𝐱~s\)\\displaystyle\{\\mathbf\{x\}\}\_\{s\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\+1\}~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\Big\)\+bT−saT−s\(𝐬s\(𝐱s\|𝝀s\)−𝐬s\(𝐱~s\|𝝀s\)\)\\displaystyle\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\,\\Big\(\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\Big\)\+bT−saT−s\(𝐬𝜽\(𝐱s,s,𝝀s\)−𝐬s\(𝐱s\|𝝀s\)\)\\displaystyle\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\,\\Big\(\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{s\},s,\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\Big\)\+bT−saT−s\(𝐬s\(𝐱~s\|𝝀s\)−𝐬s\(𝐱~s\|𝝀t\)\),\\displaystyle\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\,\\Big\(\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\),\(72\)where we added and subtractedbT−saT−s\(𝐬s\(𝐱~s\|𝝀s\)\+𝐬s\(𝐱s\|𝝀s\)\)\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\left\(\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\+\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\right\)\. These quantities have been analyzed in Lemmas[6](https://arxiv.org/html/2606.17192#Thmlemma6)and[7](https://arxiv.org/html/2606.17192#Thmlemma7)for the optimal kernelKt𝝀∗K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\. Here, we use generalized formulas for any kernel associated with \([7](https://arxiv.org/html/2606.17192#S3.E7)\)\. Similarly to \([97](https://arxiv.org/html/2606.17192#A3.E97)\)–\([101](https://arxiv.org/html/2606.17192#A3.E101)\), we can bound the first term with

∥1aT−s\(𝐱s−𝐱~s\)\+bT−saT−s\(𝐬s\(𝐱s\|𝝀s\)−𝐬s\(𝐱~s\|𝝀s\)\)∥2≤ρs\(𝝀s\)⋅∥𝐱s−𝐱~s∥2,\\displaystyle\\Big\\\|\\frac\{1\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\Big\)\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\,\\Big\(\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\Big\)\\Big\\\|\_\{2\}~\\leq~\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)\\cdot\\\|\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\\|\_\{2\},\(73\)whereρs\(𝝀s\)=1aT−ssup𝐱∥𝐈\+bT−s∇𝐱2logqT−s\(𝐱\|𝝀s\)∥op\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)=\\frac\{1\}\{\\sqrt\{a\_\{T\-s\}\}\}\\,\\sup\_\{\\mathbf\{x\}\}\\,\\left\\\|\{\\mathbf\{I\}\}\+b\_\{T\-s\}\\nabla\_\{\\mathbf\{x\}\}^\{2\}\\log q\_\{T\-s\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\right\\\|\_\{\\text\{op\}\}is the stability factor of the one\-step kernelKs𝝀sK\_\{s\}^\{\\boldsymbol\{\\lambda\}\_\{s\}\}\. For the last term, we follow \([103](https://arxiv.org/html/2606.17192#A3.E103)\)–\([107](https://arxiv.org/html/2606.17192#A3.E107)\) to get

∥𝐬s\(𝐱~s\|𝝀s\)−𝐬s\(𝐱~s\|𝝀t\)∥2≤γs⋅∥𝝀s−𝝀t∥2,\\displaystyle\\\|\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\\widetilde\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\\|\_\{2\}~\\leq~\\gamma\_\{s\}\\cdot\\\|\\boldsymbol\{\\lambda\}\_\{s\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\},\(74\)whereγs=RαT−sβσT−s2sup𝝀∈𝚲∥Covπs\(𝐲0\|𝐱~s,𝝀\)∥op\\gamma\_\{s\}=\\frac\{R\\alpha\_\{T\-s\}\}\{\\beta\\sigma^\{2\}\_\{T\-s\}\}\\sqrt\{\\sup\_\{\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}\}\\\|\\textbf\{Cov\}\_\{\\pi\_\{s\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\\widetilde\{\\mathbf\{x\}\}\_\{s\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\}\.

Combining the two results, along with Assumption[4](https://arxiv.org/html/2606.17192#Thmassumption4), we obtain

‖𝐱s\+1−𝐱~s\+1‖2≤\\displaystyle\\\|\{\\mathbf\{x\}\}\_\{s\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\+1\}\\\|\_\{2\}~\\leq~ρs\(𝝀s\)⋅‖𝐱s−𝐱~s‖2\\displaystyle\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)\\cdot\\\|\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\\|\_\{2\}\+bT−saT−s\(γs∥𝝀s−𝝀t∥2\+∥𝐬𝜽\(𝐱s,s,𝝀s\)−𝐬s\(𝐱s\|𝝀s\)∥2\)\.\\displaystyle\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\\gamma\_\{s\}\\\|\\boldsymbol\{\\lambda\}\_\{s\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{s\},s,\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\\|\_\{2\}\\Big\)\.\(75\)Then, we take expectation with respect to the couplingΠs\+1\\Pi\_\{s\+1\}to get

𝔼Πs\+1\[‖𝐱s\+1−𝐱~s\+1‖2\]≤\\displaystyle\{\\mathbb\{E\}\}\_\{\\Pi\_\{s\+1\}\}\\Big\[\\\|\{\\mathbf\{x\}\}\_\{s\+1\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\+1\}\\\|\_\{2\}\\Big\]~\\leq~ρs\(𝝀s\)⋅𝔼Πs\[‖𝐱s−𝐱~s‖2\]\\displaystyle\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)\\cdot\{\\mathbb\{E\}\}\_\{\\Pi\_\{s\}\}\\Big\[\\\|\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\\|\_\{2\}\\Big\]\+bT−saT−s\(γs⋅∥𝝀s−𝝀t∥2\+𝔼ps∥𝐬𝜽\(𝐱s,s,𝝀s\)−𝐬s\(𝐱s\|𝝀s\)∥2\)\\displaystyle\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\\gamma\_\{s\}\\cdot\\\|\\boldsymbol\{\\lambda\}\_\{s\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\{\\mathbb\{E\}\}\_\{p\_\{s\}\}\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{s\},s,\\boldsymbol\{\\lambda\}\_\{s\}\)\-\{\\mathbf\{s\}\}\_\{s\}\(\{\\mathbf\{x\}\}\_\{s\}\|\\boldsymbol\{\\lambda\}\_\{s\}\)\\\|\_\{2\}\\Big\)≤\\displaystyle~\\leq~ρs\(𝝀s\)⋅𝔼Πs\[‖𝐱s−𝐱~s‖2\]\+bT−saT−s\(γs⋅‖𝝀s−𝝀t‖2\+ϵs\)\.\\displaystyle\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)\\cdot\{\\mathbb\{E\}\}\_\{\\Pi\_\{s\}\}\\Big\[\\\|\{\\mathbf\{x\}\}\_\{s\}\-\\widetilde\{\\mathbf\{x\}\}\_\{s\}\\\|\_\{2\}\\Big\]\+\\frac\{b\_\{T\-s\}\}\{\\sqrt\{a\_\{T\-s\}\}\}\\Big\(\\gamma\_\{s\}\\cdot\\\|\\boldsymbol\{\\lambda\}\_\{s\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\epsilon\_\{s\}\\Big\)\.\(76\)The last inequality follows from Assumption[4](https://arxiv.org/html/2606.17192#Thmassumption4)and Jensen’s inequality\. With this inequality in place, let𝐡\(𝐱t\)≔𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\{\\mathbf\{h\}\}\(\{\\mathbf\{x\}\}\_\{t\}\)~\\coloneqq~\{\\mathbf\{f\}\}\\Big\(\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\Big\), and bound the bias with

‖B2\(𝝀t;ℱt\)‖2\\displaystyle\\\|B\_\{2\}\(\\boldsymbol\{\\lambda\}\_\{t\};\{\\mathcal\{F\}\}\_\{t\}\)\\\|\_\{2\}≔‖𝔼p~t\[𝐡\(𝐱~t\+1\)\]−𝔼pt\[𝐡\(𝐱t\+1\)\]‖2\\displaystyle~\\coloneqq~\\Big\\\|\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{p\}\_\{t\}\}\\big\[\{\\mathbf\{h\}\}\(\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\)\\big\]\-\{\\mathbb\{E\}\}\_\{p\_\{t\}\}\\big\[\{\\mathbf\{h\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\}\)\\big\]\\,\\Big\\\|\_\{2\}=‖𝔼Πt\[𝐡\(𝐱~t\+1\)−𝐡\(𝐱t\+1\)\]‖2\\displaystyle~=~\\Big\\\|\\,\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\}\}\\big\[\{\\mathbf\{h\}\}\(\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\)\-\{\\mathbf\{h\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\}\)\\big\]\\,\\Big\\\|\_\{2\}\(77\)≤Lh𝔼Πt\+1‖𝐱~t\+1−𝐱t\+1‖2\\displaystyle~\\leq~L\_\{h\}\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\+1\}\}\\\|\\widetilde\{\\mathbf\{x\}\}\_\{t\+1\}\-\{\\mathbf\{x\}\}\_\{t\+1\}\\\|\_\{2\}\(78\)≤Lh\(ρt\(𝝀t\)⋅𝔼Πt\[‖𝐱t−𝐱~t‖2\]\+bT−taT−t\(γt⋅‖𝝀t−𝝀t‖2\+ϵapp\(t\)\)\),\\displaystyle~\\leq~L\_\{h\}\\Big\(\\rho\_\{t\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\cdot\{\\mathbb\{E\}\}\_\{\\Pi\_\{t\}\}\\big\[\\\|\{\\mathbf\{x\}\}\_\{t\}\-\\widetilde\{\\mathbf\{x\}\}\_\{t\}\\\|\_\{2\}\\big\]\+\\frac\{b\_\{T\-t\}\}\{\\sqrt\{a\_\{T\-t\}\}\}\\big\(\\gamma\_\{t\}\\cdot\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\epsilon\_\{\\text\{app\}\}\(t\)\\big\)\\Big\),\(79\)whereLhL\_\{h\}is the Lipschitz constant of𝐡\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\{\\mathbf\{h\}\}\(\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\)and depends onLfL\_\{f\}and the score function\. The dual mismatch here is zero but we keep it to show the decomposition of one recursion step\.

Recursively, we get the bound:

‖B2\(𝝀t;ℱt\)‖2≤Lh∑j=0t\(∏s=jtρs\(𝝀s\)\)⋅bT−jaT−j\(γj‖𝝀j−𝝀t‖2\+ϵapp\(j\)\)\\displaystyle\\\|B\_\{2\}\(\\boldsymbol\{\\lambda\}\_\{t\};\{\\mathcal\{F\}\}\_\{t\}\)\\\|\_\{2\}~\\leq~L\_\{h\}\\sum\_\{j=0\}^\{t\}\\Bigg\(\\prod\_\{s=j\}^\{t\}\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)\\Bigg\)\\cdot\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\Big\(\\gamma\_\{j\}\\\|\\boldsymbol\{\\lambda\}\_\{j\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\epsilon\_\{\\text\{app\}\}\(j\)\\Big\)\(80\)LetΨj−1,t=∏s=jtρs\(𝝀s\)\\Psi\_\{j\-1,t\}=\\prod\_\{s=j\}^\{t\}\\rho\_\{s\}\(\\boldsymbol\{\\lambda\}\_\{s\}\)be the product of the stability coefficient over the pastt−jt\-jsteps, then

‖B2\(𝝀t;ℱt\)‖2≤Lh∑j=0tΨj−1,tbT−jaT−j\(γj‖𝝀j−𝝀t‖2\+ϵapp\(j\)\)\.\\displaystyle\\\|B\_\{2\}\(\\boldsymbol\{\\lambda\}\_\{t\};\{\\mathcal\{F\}\}\_\{t\}\)\\\|\_\{2\}~\\leq~L\_\{h\}\\sum\_\{j=0\}^\{t\}\\Psi\_\{j\-1,t\}\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\Big\(\\gamma\_\{j\}\\\|\\boldsymbol\{\\lambda\}\_\{j\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\+\\epsilon\_\{\\text\{app\}\}\(j\)\\Big\)\.\(81\)It is worth noting that

bT−jaT−jγj=bT−jaT−jαT−jσT−j2Rβsup𝝀∈𝚲∥Covπs\(𝐲0\|𝐱~j,𝝀\)∥op,\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\gamma\_\{j\}=\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\frac\{\\alpha\_\{T\-j\}\}\{\\sigma^\{2\}\_\{T\-j\}\}\\frac\{R\}\{\\beta\}\\sqrt\{\\sup\_\{\\boldsymbol\{\\lambda\}\\in\\boldsymbol\{\\Lambda\}\}\\\|\\textbf\{Cov\}\_\{\\pi\_\{s\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\\widetilde\{\\mathbf\{x\}\}\_\{j\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\},wherebT−jaT−jαT−jσT−j2=αT−j−1bT−j/\(bT−j\+aT−jσT−j−12\)\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\frac\{\\alpha\_\{T\-j\}\}\{\\sigma^\{2\}\_\{T\-j\}\}=\\alpha\_\{T\-j\-1\}b\_\{T\-j\}/\(b\_\{T\-j\}\+a\_\{T\-j\}\\sigma\_\{T\-j\-1\}^\{2\}\)is a non\-negative value under11for variance\-preserving noise schedules\. Therefore, the bias depends on the noise schedule mainly through the stability productΨj−1,t\\Psi\_\{j\-1,t\}\.

Now assume that, for allt≤Tt\\leq T, it holds that

∑j=0tΨj−1,tbT−jaT−jγj\(t−j\)≤Chist<∞,\\displaystyle\\sum\_\{j=0\}^\{t\}\\Psi\_\{j\-1,t\}\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\gamma\_\{j\}\(t\-j\)~\\leq~C\_\{\\text\{hist\}\}~<~\\infty,\(82\)and

∑j=0tΨj−1,tbT−jaT−jϵapp\(j\)≤ϵhist<∞\.\\displaystyle\\sum\_\{j=0\}^\{t\}\\Psi\_\{j\-1,t\}\\frac\{b\_\{T\-j\}\}\{\\sqrt\{a\_\{T\-j\}\}\}\\epsilon\_\{\\text\{app\}\}\(j\)~\\leq~\\epsilon\_\{\\text\{hist\}\}~<~\\infty\.\(83\)It is also true that

‖𝝀j−𝝀t‖2≤\(t−j\)ηR\.\\displaystyle\\\|\\boldsymbol\{\\lambda\}\_\{j\}\-\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{2\}\\leq\(t\-j\)\\eta R\.\(84\)Therefore, we have‖B2\(𝝀t;ℱt\)‖22≤\(1\+δ\)Lh2Chist2η2R2\+\(1\+1δ\)Lh2ϵhist2\\\|B\_\{2\}\(\\boldsymbol\{\\lambda\}\_\{t\};\{\\mathcal\{F\}\}\_\{t\}\)\\\|\_\{2\}^\{2\}\\leq\(1\+\\delta\)L\_\{h\}^\{2\}C\_\{\\text\{hist\}\}^\{2\}\\eta^\{2\}R^\{2\}\+\(1\+\\frac\{1\}\{\\delta\}\)L\_\{h\}^\{2\}\\epsilon^\{2\}\_\{\\text\{hist\}\}, for an arbitraryδ\>0\\delta\>0, using Young’s inequality, which completes the proof\.

∎

As discussed in Section[4](https://arxiv.org/html/2606.17192#S4), under noise schedules with a sufficiently long stable high\-noise phase, the stability products decay with the length of the history window\. Hence, contributions from early dual mismatches are damped, and only the recent history can have a non\-negligible effect near the high\-SNR region, where the reverse dynamics become more sensitive\. In this regime, the remaining history\-dependent contribution can be controlled by the step sizeηt\\eta\_\{t\}\.

#### C\.4\.3Parameterization Error

###### Lemma 5\.

Under Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2)and[4](https://arxiv.org/html/2606.17192#Thmassumption4), it holds that

∥𝔼pt\+1\[𝐟\(𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\)−𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\]∥22≤Lf2σT−t4α¯T−t2ϵapp2\(t\+1\)\.\\displaystyle\\Big\\\|\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\big\)\-\\ \{\\mathbf\{f\}\}\\big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\big\)\\Big\]\\Big\\\|\_\{2\}^\{2\}~\\leq~L\_\{f\}^\{2\}\\frac\{\\sigma\_\{T\-t\}^\{4\}\}\{\\overline\{\\alpha\}\_\{T\-t\}^\{2\}\}\\epsilon\_\{\\text\{app\}\}^\{2\}\(t\+1\)\.\(85\)

###### Proof\.

The Tweedie posterior means are given as

𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\\displaystyle\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]=1α¯T−t\(𝐱t\+1\+σT−t2𝐬𝜽\(𝐱t\+1,t\+1,𝝀t\)\),\\displaystyle~=~\\frac\{1\}\{\{\\overline\{\\alpha\}\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\+1\}\+\\sigma\_\{T\-t\}^\{2\}\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\},t\+1,\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\),\(86\)𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\\displaystyle\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]=1α¯T−t\(𝐱t\+1\+σT−t2𝐬t\+1\(𝐱t\+1\|𝝀t\)\),\\displaystyle~=~\\frac\{1\}\{\{\\overline\{\\alpha\}\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\+1\}\+\\sigma\_\{T\-t\}^\{2\}\{\\mathbf\{s\}\}\_\{t\+1\}\(\{\\mathbf\{x\}\}\_\{t\+1\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\),\(87\)whereα¯T−t=max⁡\{αmin,αT−t\}\\overline\{\\alpha\}\_\{T\-t\}=\\max\\\{\\alpha\_\{\\min\},\\alpha\_\{T\-t\}\\\}\. Therefore, we have

∥𝔼pt\+1\[𝐟\(\\displaystyle\\Big\\\|\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\[\{\\mathbf\{f\}\}\\big\(\\,𝔼π~t\+1𝜽\[𝐲0\|𝐱t\+1,𝝀t\]\)−𝐟\(𝔼π~t\+1\[𝐲0\|𝐱t\+1,𝝀t\]\)\]∥2\\displaystyle\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}^\{\\boldsymbol\{\\theta\}\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\big\)\-\\ \{\\mathbf\{f\}\}\\big\(\\,\{\\mathbb\{E\}\}\_\{\\widetilde\{\\pi\}\_\{t\+1\}\}\\big\[\\,\{\{\\mathbf\{y\}\}\}\_\{0\}\\,\|\\,\{\\mathbf\{x\}\}\_\{t\+1\},\\boldsymbol\{\\lambda\}\_\{t\}\\,\\big\]\\,\\big\)\\Big\]\\Big\\\|\_\{2\}\(88\)≤LfσT−t2α¯T−t𝔼pt\+1∥𝐬𝜽\(𝐱t\+1,t\+1,𝝀t\)−𝐬t\+1\(𝐱t\+1\|𝝀t\)∥2≤LfσT−t2α¯T−tϵt\+1,\\displaystyle~\\leq~L\_\{f\}\\frac\{\\sigma\_\{T\-t\}^\{2\}\}\{\\overline\{\\alpha\}\_\{T\-t\}\}\{\\mathbb\{E\}\}\_\{p\_\{t\+1\}\}\\Big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\theta\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\},t\+1,\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\+1\}\(\{\\mathbf\{x\}\}\_\{t\+1\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\\Big\\\|\_\{2\}~\\leq~L\_\{f\}\\frac\{\\sigma\_\{T\-t\}^\{2\}\}\{\\overline\{\\alpha\}\_\{T\-t\}\}\\epsilon\_\{t\+1\},\(89\)by Jensen’s inequality and Assumptions[2](https://arxiv.org/html/2606.17192#Thmassumption2)and[4](https://arxiv.org/html/2606.17192#Thmassumption4)\. ∎

### C\.5Proof of Proposition[2](https://arxiv.org/html/2606.17192#Thmproposition2)

###### Proof\.

Letπ\\pibe any coupling betweenν\\nuandμ\\mu\. Apply the synchronous coupling \(same noiseϵt\\boldsymbol\{\\epsilon\}\_\{t\}\) to each realized pair\(𝐱,𝐲\)∼π\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\\sim\\pi:

𝐗\+\\displaystyle\{\\mathbf\{X\}\}^\{\+\}=1aT−t\(𝐱\+bT−t∇log⁡qT−t\(𝐱\|𝝀∗\)\)\+bT−tϵt\\displaystyle=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\+b\_\{T\-t\}\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\}\(90\)𝐘\+\\displaystyle\{\\mathbf\{Y\}\}^\{\+\}=1aT−t\(𝐲\+bT−t∇log⁡qT−t\(𝐲\|𝝀∗\)\)\+bT−tϵt\\displaystyle=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{y\}\}\+b\_\{T\-t\}\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{y\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\}\(91\)LetΓ\\Gammabe the joint law of\(𝐗\+,𝐘\+\)\(\{\\mathbf\{X\}\}^\{\+\},\{\\mathbf\{Y\}\}^\{\+\}\)\. By construction, the marginal distribution of𝐗\+\{\\mathbf\{X\}\}^\{\+\}isνKt𝝀∗\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}, and the marginal distribution of𝐘\+\{\\mathbf\{Y\}\}^\{\+\}isμKt𝝀∗\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\. Therefore,Γ\\Gammais a valid coupling betweenνKt𝝀∗\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}andμKt𝝀∗\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\.

Using the definition ofW2W\_\{2\}, we have

W22\(νKt𝝀∗,μKt𝝀∗\)\\displaystyle W\_\{2\}^\{2\}\(\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\},\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\)≤𝔼Γ\[‖𝐗\+−𝐘\+‖2\]\\displaystyle\\leq\{\\mathbb\{E\}\}\_\{\\Gamma\}\\big\[\\\|\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}\\\|^\{2\}\\big\]\(92\)=𝔼\(𝐱,𝐲\)∼π\[𝔼\[‖𝐗\+−𝐘\+‖2∣𝐱,𝐲\]\]\\displaystyle=\{\\mathbb\{E\}\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\\sim\\pi\}\\Big\[\{\\mathbb\{E\}\}\\big\[\\\|\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}\\\|^\{2\}\\mid\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\\big\]\\Big\]\(93\)≤ρt2𝔼\(𝐱,𝐲\)∼π\[‖𝐱−𝐲‖2\],\\displaystyle\\leq\\rho\_\{t\}^\{2\}\\,\{\\mathbb\{E\}\}\_\{\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\\sim\\pi\}\\left\[\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|^\{2\}\\right\],\(94\)which follows by Lemma[6](https://arxiv.org/html/2606.17192#Thmlemma6)\. Note that, conditioned on the pair\(𝐱,𝐲\)\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\),‖𝐗\+−𝐘\+‖2\\\|\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}\\\|^\{2\}is deterministic and the inner expectation in \([93](https://arxiv.org/html/2606.17192#A3.E93)\) is trivial\. Since this holds for every couplingπ∈Π\(ν,μ\)\\pi\\in\\Pi\(\\nu,\\mu\), we take the infimum overπ\\piand obtain

W22\(νKt𝝀∗,μKt𝝀∗\)≤ρt2W22\(ν,μ\)\.\\displaystyle W\_\{2\}^\{2\}\(\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\},\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\)\\leq\\rho\_\{t\}^\{2\}W\_\{2\}^\{2\}\(\\nu,\\mu\)\.\(95\)Taking square roots gives

W2\(νKt𝝀∗,μKt𝝀∗\)≤ρtW2\(ν,μ\)\.\\displaystyle W\_\{2\}\(\\nu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\},\\mu K\_\{t\}^\{\{\\boldsymbol\{\\lambda\}\}^\{\*\}\}\)\\leq\\rho\_\{t\}W\_\{2\}\(\\nu,\\mu\)\.\(96\)This proves the proposition\. ∎

### C\.6Additional Lemmas

###### Lemma 6\(One\-step contraction modulus\)\.

Under Assumption[5](https://arxiv.org/html/2606.17192#Thmassumption5), for any𝐱\{\\mathbf\{x\}\},𝐲\{\\mathbf\{y\}\}and under the synchronous coupling \(common noiseϵt\\boldsymbol\{\\epsilon\}\_\{t\}\), the reverse\-step outputs satisfy‖𝐗\+−𝐘\+‖2≤ρt2‖𝐱−𝐲‖2\\\|\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}\\\|^\{2\}\\leq\\rho\_\{t\}^\{2\}\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|^\{2\}, almost surely, with

ρt=1aT−tsup𝐱∥𝐈\+bT−t∇𝐱2logqT−t\(𝐱\|𝝀∗\)∥*op*\.\\rho\_\{t\}=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\,\\sup\_\{\\mathbf\{x\}\}\\,\\left\\\|\{\\mathbf\{I\}\}\+b\_\{T\-t\}\\nabla\_\{\\mathbf\{x\}\}^\{2\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\right\\\|\_\{\\emph\{op\}\}\.

###### Proof\.

Under synchronous coupling, we have

𝐗\+−𝐘\+=1aT−t\(𝐱−𝐲\+bT−t\(∇log⁡qT−t\(𝐱\|𝝀∗\)−∇log⁡qT−t\(𝐲\|𝝀∗\)\)\)\.\\displaystyle\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\+b\_\{T\-t\}\\big\(\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\-\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{y\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\)\\Big\)\.\(97\)The noise cancels exactly since the two processes share the same noise\. Thus, this is deterministic given𝐱\{\\mathbf\{x\}\}and𝐲\{\\mathbf\{y\}\}\. Then by the mean value theorem along the segment from𝐲\{\\mathbf\{y\}\}to𝐱\{\\mathbf\{x\}\}, we get

∇log⁡qT−t\(𝐱\|𝝀∗\)−∇log⁡qT−t\(𝐲\|𝝀∗\)=\(∫01∇2log⁡qT−t\(𝐲\+s\(𝐱−𝐲\)\|𝝀∗\)𝑑s\)⋅\(𝐱−𝐲\)\.\\displaystyle\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\-\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{y\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)~=~\\Big\(\\int\_\{0\}^\{1\}\\nabla^\{2\}\\log q\_\{T\-t\}\(\{\\mathbf\{y\}\}\+s\(\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\)\|\\boldsymbol\{\\lambda\}^\{\*\}\)ds\\Big\)\\cdot\(\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\)\.\(98\)Substitute back into \([97](https://arxiv.org/html/2606.17192#A3.E97)\), we get

𝐗\+−𝐘\+=1aT−t\(𝐈\+bT−t𝐇^\)⋅\(𝐱−𝐲\),\\displaystyle\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}=\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\(\{\\mathbf\{I\}\}\+b\_\{T\-t\}\\widehat\{\\mathbf\{H\}\}\)\\cdot\(\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\),\(99\)where𝐇^\\widehat\{\\mathbf\{H\}\}is the averaged Hessian along the segment\. Taking norms,

‖𝐗\+−𝐘\+‖\\displaystyle\\\|\{\\mathbf\{X\}\}^\{\+\}\-\{\\mathbf\{Y\}\}^\{\+\}\\\|≤1aT−t‖𝐈\+bT−t𝐇^‖op⋅‖𝐱−𝐲‖\\displaystyle\\leq\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\left\\\|\{\\mathbf\{I\}\}\+b\_\{T\-t\}\\widehat\{\\mathbf\{H\}\}\\right\\\|\_\{\\text\{op\}\}\\cdot\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|\(100\)≤1aT−tsup𝐱∈𝒳∥𝐈\+bT−t∇𝐱2logqT−t\(𝐱\|𝝀∗\)∥op∥𝐱−𝐲∥≕ρt∥𝐱−𝐲∥\.\\displaystyle\\leq\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\sup\_\{\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}\}\\,\\left\\\|\{\\mathbf\{I\}\}\+b\_\{T\-t\}\\nabla\_\{\\mathbf\{x\}\}^\{2\}\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\right\\\|\_\{\\text\{op\}\}\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|\\eqqcolon\\rho\_\{t\}\\\|\{\\mathbf\{x\}\}\-\{\\mathbf\{y\}\}\\\|\.\(101\)The last inequality upper bounded the averaged Hessian with the supremum over the compact domain𝒳\{\\mathcal\{X\}\}\. This completes the proof\. ∎

###### Lemma 7\.

Under Assumption[2](https://arxiv.org/html/2606.17192#Thmassumption2), the score function𝐬t\(⋅\|𝛌\)\{\\mathbf\{s\}\}\_\{t\}\(\\cdot\\,\|\\,\\boldsymbol\{\\lambda\}\)is Lipschitz in𝛌\\boldsymbol\{\\lambda\}, and

∥𝐬t\(𝐱\|𝝀t\)−𝐬t\(𝐱\|𝝀∗\)∥≤γt∥𝝀t−𝝀∗∥2,\\displaystyle\\big\\\|\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\\\|~\\leq~\\gamma\_\{t\}\\left\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\right\\\|\_\{2\},\(102\)andγt=RαT−tβσT−t2sup𝛌∥Covπt\(𝐲0\|𝐱,𝛌\)∥op≥0\\gamma\_\{t\}~=~\\frac\{R\\alpha\_\{T\-t\}\}\{\\beta\\sigma^\{2\}\_\{T\-t\}\}\\sqrt\{\\sup\_\{\\boldsymbol\{\\lambda\}\}\\\|\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\}\\geq 0\.

###### Proof\.

Using the mean value theorem, we can then bound the difference by

∥𝐬t\(𝐱\|𝝀t\)−𝐬t\(𝐱\|𝝀∗\)∥≤\(sup𝝀∥∇𝝀𝐬t\(𝐱\|𝝀\)∥op\)∥𝝀t−𝝀∗∥2,\\displaystyle\\big\\\|\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\_\{t\}\)\-\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}^\{\*\}\)\\big\\\|~\\leq~\\Big\(\\sup\_\{\\boldsymbol\{\\lambda\}\}\\big\\\|\\nabla\_\{\\boldsymbol\{\\lambda\}\}\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\)\\big\\\|\_\{\\text\{op\}\}\\Big\)\\left\\\|\\boldsymbol\{\\lambda\}\_\{t\}\-\\boldsymbol\{\\lambda\}^\{\*\}\\right\\\|\_\{2\},\(103\)where∥⋅∥op\\\|\\cdot\\\|\_\{\\text\{op\}\}is the operator norm\. The score function is defined through the Tweedie as

𝐬t\(𝐱\|𝝀\)=1σT−t2\(αT−t𝔼πt\[𝐲0\|𝐱,𝝀\]−𝐱\),\\displaystyle\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\)=\\frac\{1\}\{\\sigma^\{2\}\_\{T\-t\}\}\\Big\(\\alpha\_\{T\-t\}\{\\mathbb\{E\}\}\_\{\\pi\_\{t\}\}\[\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\]\-\{\\mathbf\{x\}\}\\Big\),\(104\)whereπt\\pi\_\{t\}is the posterior distribution:

πt\(𝐲0\|𝐱,𝝀\)∝1Z\(𝝀\)exp⁡\(−E\(𝐲0,𝝀t\)−12σT−t2‖𝐱−αT−t𝐲0‖2\)\.\\displaystyle\\pi\_\{t\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\propto\\frac\{1\}\{Z\(\\boldsymbol\{\\lambda\}\)\}\\exp\\Big\(\-E\(\{\\mathbf\{y\}\}\_\{0\},\\boldsymbol\{\\lambda\}\_\{t\}\)\-\\frac\{1\}\{2\\sigma^\{2\}\_\{T\-t\}\}\\\|\{\\mathbf\{x\}\}\-\\alpha\_\{T\-t\}\{\\mathbf\{y\}\}\_\{0\}\\\|^\{2\}\\Big\)\.\(105\)The posterior depends on𝝀\\boldsymbol\{\\lambda\}only through the prior\. Thus, we have

∇𝝀𝔼πt\[𝐲0\|𝐱,𝝀\]=−1βCovπt\(𝐲0,𝐟\(𝐲0\)\|𝐱,𝝀\)\.\\displaystyle\\nabla\_\{\\boldsymbol\{\\lambda\}\}\{\\mathbb\{E\}\}\_\{\\pi\_\{t\}\}\\big\[\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\\big\]~=~\\frac\{\-1\}\{\\beta\}\\,\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\\Big\(\{\\mathbf\{y\}\}\_\{0\},\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\\big\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\\Big\)\.\(106\)Taking the operator norm, we get

βσT−t2αT−t∥∇𝝀𝐬t\(𝐱\|𝝀\)∥op\\displaystyle\\frac\{\\beta\\sigma^\{2\}\_\{T\-t\}\}\{\\alpha\_\{T\-t\}\}\\big\\\|\\nabla\_\{\\boldsymbol\{\\lambda\}\}\{\\mathbf\{s\}\}\_\{t\}\(\{\\mathbf\{x\}\}\|\\boldsymbol\{\\lambda\}\)\\big\\\|\_\{\\text\{op\}\}≤\(∥Covπt\(𝐲0\|𝐱,𝝀\)∥op\)1/2⋅\(∥Covπt\(𝐟\(𝐲0\)\|𝐱,𝝀\)∥op\)1/2\\displaystyle~\\leq~\\Big\(\\\|\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\\Big\)^\{1/2\}\\cdot\\Big\(\\\|\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{f\}\}\(\{\\mathbf\{y\}\}\_\{0\}\)\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\\Big\)^\{1/2\}\(107\)≤R∥Covπt\(𝐲0\|𝐱,𝝀\)∥op,\\displaystyle~\\leq~R\\sqrt\{\\\|\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\},\(108\)using Assumption[2](https://arxiv.org/html/2606.17192#Thmassumption2)\. Letγt=RαT−tβσT−t2sup𝝀∥Covπt\(𝐲0\|𝐱,𝝀\)∥op\\gamma\_\{t\}=\\frac\{R\\alpha\_\{T\-t\}\}\{\\beta\\sigma^\{2\}\_\{T\-t\}\}\\sqrt\{\\sup\_\{\\boldsymbol\{\\lambda\}\}\\\|\\textbf\{Cov\}\_\{\\pi\_\{t\}\}\(\{\\mathbf\{y\}\}\_\{0\}\|\{\\mathbf\{x\}\},\\boldsymbol\{\\lambda\}\)\\\|\_\{\\text\{op\}\}\}to complete the proof\. ∎

###### Lemma 8\.

The Lagrangian function in \([2](https://arxiv.org/html/2606.17192#S3.E2)\) is also equivalent to

ℒ\(μ,𝝀\)=βDKL\(μ∥μ𝝀†\)\+g\(𝝀\)\.\{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\}\)~=~\\beta D\_\{\\text\{KL\}\}\(\\mu\\\|\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\)\+g\(\\boldsymbol\{\\lambda\}\)\.

###### Proof\.

Start with the definition of KL divergence:

DKL\(μ∥μ𝝀t†\)\\displaystyle D\_\{\\text\{KL\}\}\(\\mu\\\|\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\)=𝔼μ\[log⁡μ\(𝐱\)−log⁡μ𝝀t†\(𝐱\)\]\\displaystyle~=~\{\\mathbb\{E\}\}\_\{\\mu\}\\Big\[\\log\\mu\(\{\\mathbf\{x\}\}\)\-\\log\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}\(\{\\mathbf\{x\}\}\)\\Big\]\(109\)=𝔼μ\[log⁡μ\(𝐱\)\]\+1β𝔼μ\[f0\(𝐱\)\+𝝀⊤𝐟\(𝐱\)\]\+log⁡Z\(𝝀\)\\displaystyle~=~\{\\mathbb\{E\}\}\_\{\\mu\}\[\\log\\mu\(\{\\mathbf\{x\}\}\)\]\+\\frac\{1\}\{\\beta\}\\,\{\\mathbb\{E\}\}\_\{\\mu\}\\Big\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\+\\boldsymbol\{\\lambda\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\Big\]\+\\log Z\(\\boldsymbol\{\\lambda\}\)\(110\)=1βℒ\(μ,𝝀\)\+log⁡Z\(𝝀\)\.\\displaystyle~=~\\frac\{1\}\{\\beta\}\\,\{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\}\)\+\\log Z\(\\boldsymbol\{\\lambda\}\)\.\(111\)The second equality follows from the definition ofμ𝝀t†\\mu\_\{\\boldsymbol\{\\lambda\}\_\{t\}\}^\{\\dagger\}while the last one uses the definition of the Lagrangian\. With the fact thatg\(𝝀\)=−βlog⁡Z\(𝝀\)g\(\\boldsymbol\{\\lambda\}\)=\-\\beta\\log Z\(\\boldsymbol\{\\lambda\}\), we complete the proof\. ∎

## Appendix DDiscussions

### D\.1Dual Training

PDI updates the dual variables during inference, allowing the sampler to react instantaneously to constraint violations\. Another alternative is to estimate the optimal dual variables during training while training a score model to sample directly from the optimal distribution\. We refer to this approach as dual training \(DT\)\. The idea behind DT is related to prior work on constrained diffusion modelsKhalafiet al\.\([2024](https://arxiv.org/html/2606.17192#bib.bib5),[2025](https://arxiv.org/html/2606.17192#bib.bib6)\)\. However, this prior work typically considers a single optimization problem, such as fine\-tuning a pre\-trained model to satisfy a fixed set of constraints, where there is only one optimal dual variable to estimate\. In contrast, our formulation considers a family of problem instances, each with its own optimal distribution and corresponding optimal dual variable\. In this section, we extend DT to this setting by estimating the dual variables separately for each problem instance during training\. The resulting method then serves as an ablation baseline to test the effect of inference\-time dual updates\.

The dual problem in \([3](https://arxiv.org/html/2606.17192#S3.E3)\), for a given problem instance𝒢\{\\mathcal\{G\}\}, can be re\-written as a bi\-level problem,

D∗\(𝒢\)=max𝝀⪰𝟎\\displaystyle D^\{\*\}\(\{\\mathcal\{G\}\}\)~=~\\max\_\{\\boldsymbol\{\\lambda\}\\succeq\\mathbf\{0\}\}ℒ\(μ𝝀†,𝝀;𝒢\)\\displaystyle\\ \\ \{\\mathcal\{L\}\}\(\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\},\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\)withμ𝝀†=argminμ∈𝒫2ℒ\(μ,𝝀;𝒢\)\.\\displaystyle\\,\\,\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}=\\operatornamewithlimits\{argmin\}\_\{\\mu\\in\{\\mathcal\{P\}\}\_\{2\}\}\\ \{\\mathcal\{L\}\}\(\\mu,\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\)\.\(112\)The inner problem is equivalent to an unconstrained sampling problem and the diffusion process defined in \([6](https://arxiv.org/html/2606.17192#S3.E6)\)–\([7](https://arxiv.org/html/2606.17192#S3.E7)\) provides a mechanism to sample from the target distributionμ𝝀†\\mu\_\{\\boldsymbol\{\\lambda\}\}^\{\\dagger\}\. Therefore, we replace the inner problem with the problem of training the score model of the reverse process \([7](https://arxiv.org/html/2606.17192#S3.E7)\)\. Let𝐬ϕ\(𝐱,t,𝒢\)\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\{\\mathbf\{x\}\},t,\{\\mathcal\{G\}\}\)be the score model, parameterized byϕ\\boldsymbol\{\\phi\}, andpTϕp\_\{T\}^\{\\boldsymbol\{\\phi\}\}be the terminal distribution of the reverse process under a given dual variable𝝀\\boldsymbol\{\\lambda\}\. The parameterized bi\-level problem can then be cast as

Dϕ∗\(𝒢\)=max𝝀⪰𝟎\\displaystyle\\quad D\_\{\\boldsymbol\{\\phi\}\}^\{\*\}\(\{\\mathcal\{G\}\}\)=\\max\_\{\\boldsymbol\{\\lambda\}\\succeq\\mathbf\{0\}\}ℒ\(pTϕ𝝀,𝝀;𝒢\)\\displaystyle\\ \\ \{\\mathcal\{L\}\}\(p\_\{T\}^\{\\boldsymbol\{\\phi\}\_\{\\boldsymbol\{\\lambda\}\}\},\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\)withϕ𝝀\(𝒢\)=argminϕ𝔼t,𝐱t\[ω\(t\)∥𝐬ϕ\(𝐱t,t,𝒢\)−∇logqT−t\(𝐱t\|𝝀;𝒢\)∥22\],∀𝝀\.\\displaystyle\\ \\ \\boldsymbol\{\\phi\}\_\{\\boldsymbol\{\\lambda\}\}\(\{\\mathcal\{G\}\}\)=\\operatornamewithlimits\{argmin\}\_\{\\boldsymbol\{\\phi\}\}\\ \{\\mathbb\{E\}\}\_\{t,\{\\mathbf\{x\}\}\_\{t\}\}\\Big\[\\omega\(t\)\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\)\-\\nabla\\log q\_\{T\-t\}\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\)\\big\\\|\_\{2\}^\{2\}\\Big\],\\ \\forall\\boldsymbol\{\\lambda\}\.\(113\)The expectation is with respect to the noisy samples and the diffusion time\. The inner problem in its current form assigns a different parametrizationϕ\\boldsymbol\{\\phi\}for each dual variable𝝀\\boldsymbol\{\\lambda\}and problem instance𝒢\{\\mathcal\{G\}\}\. This is not a training objective and is generally infeasible\. To overcome this challenge, we alternate between the two problems so that, for an outer iterationkk, the score model is trained only against the most recent dual variable, which is then updated based on the recent constraint violations\. More concretely, the training algorithm follows the following dual\-ascent iterations:

ϕk\\displaystyle\\boldsymbol\{\\phi\}\_\{k\}=argminϕ𝔼𝒢,t,𝐱t\[ω\(t\)∥𝐬ϕ\(𝐱t,t,𝒢\)−∇logqT−t\(𝐱t\|𝝀k\(𝒢\);𝒢\)∥22\],\\displaystyle~=~\\operatornamewithlimits\{argmin\}\_\{\\boldsymbol\{\\phi\}\}\\ \{\\mathbb\{E\}\}\_\{\{\\mathcal\{G\}\},t,\{\\mathbf\{x\}\}\_\{t\}\}\\Big\[\\omega\(t\)\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\)\-\\nabla\\log q\_\{T\-t\}\\big\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\}\_\{k\}\(\{\\mathcal\{G\}\}\);\{\\mathcal\{G\}\}\\big\)\\big\\\|\_\{2\}^\{2\}\\Big\],\(114\)𝝀k\+1\(𝒢\)\\displaystyle\\boldsymbol\{\\lambda\}\_\{k\+1\}\(\{\\mathcal\{G\}\}\)=\[𝝀k\(𝒢\)\+ηDT𝔼pTϕk\[𝐟\(𝐱T;𝒢\)\]\]\+,∀𝒢,\\displaystyle~=~\\Bigl\[\\boldsymbol\{\\lambda\}\_\{k\}\(\{\\mathcal\{G\}\}\)\+\\eta\_\{\\text\{\{DT\}\}\}\\,\{\\mathbb\{E\}\}\_\{p\_\{T\}^\{\\boldsymbol\{\\phi\}\_\{k\}\}\}\\big\[\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\_\{T\};\{\\mathcal\{G\}\}\)\\big\]\\Bigr\]\_\{\+\},\\ \\forall\\,\{\\mathcal\{G\}\},\(115\)whereηDT\\eta\_\{\\text\{\{DT\}\}\}is the dual step size\. The expectation in the training objective becomes jointly over diffusion time, noisy samples, and the problem distribution so that a single𝒢\{\\mathcal\{G\}\}\-conditional parameterization learns the score field for a family of problems\. The dual variables, however, remain instance\-specific\. Each problem instance𝒢\{\\mathcal\{G\}\}maintains its own multiplier𝝀k\(𝒢\)\\boldsymbol\{\\lambda\}\_\{k\}\(\{\\mathcal\{G\}\}\), updated only from the constraint violations observed for that instance\.

Updating the dual variables requires estimating the constraint violations under the current sampler, which in turn requires running the full reverse process to obtain clean samples\. Implementing \([114](https://arxiv.org/html/2606.17192#A4.E114)\)–\([115](https://arxiv.org/html/2606.17192#A4.E115)\) to completion would therefore be computationally prohibitive\. In practice, we use an alternating implementation\. We perform a few gradient steps on the score\-matching objective in \([114](https://arxiv.org/html/2606.17192#A4.E114)\), then update the dual variables for a subset of problem instances\. Specifically, at each outer iteration, Algorithm[3](https://arxiv.org/html/2606.17192#alg3)samples a batch of instancesℬ𝝀\{\\mathcal\{B\}\}\_\{\\boldsymbol\{\\lambda\}\}and updates only their corresponding multipliers, while the remaining dual variables are kept unchanged\. As in Algorithm[1](https://arxiv.org/html/2606.17192#alg1), the score target is computed using the MC estimator in \([16](https://arxiv.org/html/2606.17192#A2.E16)\) and the noisy samples𝐱t\{\\mathbf\{x\}\}\_\{t\}are obtained from rollouts of the inference dynamics at the beginning of each outer iteration\.

At inference time, DT starts from Gaussian noise and runs the reverse diffusion process using the trained score model𝐬ϕ∗\(𝐱t,t,𝒢\)\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}^\{\*\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\)\. For a test instance𝒢\{\\mathcal\{G\}\}, the reverse dynamics are

𝐱t\+1\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}=1aT−t\(𝐱t\+bT−t𝐬ϕ∗\(𝐱t,t,𝒢\)\)\+bT−tϵt\.\\displaystyle~=~\\frac\{1\}\{\\sqrt\{a\_\{T\-t\}\}\}\\Big\(\{\\mathbf\{x\}\}\_\{t\}\+b\_\{T\-t\}\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}^\{\*\}\}\\big\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\\big\)\\Big\)\+\\sqrt\{b\_\{T\-t\}\}\\boldsymbol\{\\epsilon\}\_\{t\}\.\(116\)No dual variable is initialized, provided to the score model, or updated during this process\. The constraint information enters only through the learned parametersϕ∗\\boldsymbol\{\\phi\}^\{\*\}, which were trained using the instance\-dependent dual estimates in \([114](https://arxiv.org/html/2606.17192#A4.E114)\)–\([115](https://arxiv.org/html/2606.17192#A4.E115)\)\. Thus, DT produces samples in a single reverse pass and does not adapt the trajectory to constraint violations observed during inference\.

Algorithm 3Dual Training1:Initialize

𝝀\(𝒢\)←𝝀0\\boldsymbol\{\\lambda\}\(\{\\mathcal\{G\}\}\)\\leftarrow\\boldsymbol\{\\lambda\}\_\{0\}for all

𝒢\{\\mathcal\{G\}\}
2:Initialize score network

𝐬ϕ\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}
3:for

d=1,…,Dd=1,\\dots,Ddo⊳\\trianglerightOuter: dual iterations

4:Initialize replay buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\varnothing
5:foreach problem

𝒢\{\\mathcal\{G\}\}in a batch

ℬtrain\{\\mathcal\{B\}\}\_\{\\text\{train\}\}do

6:Rollout: sample

𝐱0\(i\)∼𝒩\(𝟎,𝐈\)\\mathbf\{x\}\_\{0\}^\{\(i\)\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\); run inference with

𝐬ϕ\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}to obtain

\{\(𝐱t\(i\),t\)\}t=0T−1,∀i∈\[I\]\\\{\(\\mathbf\{x\}\_\{t\}^\{\(i\)\},t\)\\\}\_\{t=0\}^\{T\-1\},\\ \\forall i\\in\[I\]
7:Push

\{\(𝐱t\(i\),t,𝒢\)\}i,t\\\{\(\\mathbf\{x\}\_\{t\}^\{\(i\)\},t,\{\\mathcal\{G\}\}\)\\\}\_\{i,t\}to

ℬ\\mathcal\{B\}
8:endfor

9:for

e=1,…,Ee=1,\\dots,Edo⊳\\trianglerightInner: train at fixed𝝀\\boldsymbol\{\\lambda\}

10:Sample a minibatch

\{\(𝐱t\(j\),t\(j\),𝒢\(j\)\)\}j=1B\\\{\(\\mathbf\{x\}\_\{t\}^\{\(j\)\},t^\{\(j\)\},\{\\mathcal\{G\}\}^\{\(j\)\}\)\\\}\_\{j=1\}^\{B\}from

ℬ\\mathcal\{B\}
11:for

j=1,…,Bj=1,\\ldots,Bdo

12:if

Bernoulli\(ρpert\)=1\\mathrm\{Bernoulli\}\(\\rho\_\{\\text\{pert\}\}\)=1then⊳\\trianglerightPerturb the sample

13:

𝐱t\(j\)←𝐱t\(j\)\+ϵ𝐱⋅𝒛x\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\\leftarrow\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\+\\epsilon\_\{\{\\mathbf\{x\}\}\}\\cdot\\boldsymbol\{z\}\_\{x\},

𝒛x∼𝒩\(𝟎,𝐈\)\\boldsymbol\{z\}\_\{x\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)
14:endif

15:endfor

16:

ϕ←ϕ−α∇1B∑jw\(t\(j\)\)∥𝐬ϕ\(𝐱t\(j\),t\(j\),𝒢\(j\)\)−∇logqT−t\(j\)\(𝐱t\(j\)∣𝝀\(𝒢\(j\)\);𝒢\(j\)\)∥22\\boldsymbol\{\\phi\}\\leftarrow\\boldsymbol\{\\phi\}\-\\alpha\\,\\nabla\\frac\{1\}\{B\}\\sum\_\{j\}w\(t^\{\(j\)\}\)\\,\\bigl\\lVert\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{x\}\_\{t\}^\{\(j\)\},t^\{\(j\)\},\{\\mathcal\{G\}\}^\{\(j\)\}\)\-\\nabla\\log q\_\{T\-t^\{\(j\)\}\}\(\{\\mathbf\{x\}\}\_\{t\}^\{\(j\)\}\\mid\\boldsymbol\{\\lambda\}\(\{\\mathcal\{G\}\}^\{\(j\)\}\);\{\\mathcal\{G\}\}^\{\(j\)\}\)\\bigr\\rVert\_\{2\}^\{2\}
17:endfor

18:foreach problem

𝒢\{\\mathcal\{G\}\}in a batch

ℬ𝝀\{\\mathcal\{B\}\}\_\{\\boldsymbol\{\\lambda\}\}do⊳\\trianglerightDual ascent

19:Generate samples

\{𝐱T\(i\)∼pTϕ\}i=1I\\\{\{\\mathbf\{x\}\}\_\{T\}^\{\(i\)\}\\sim p\_\{T\}^\{\\boldsymbol\{\\phi\}\}\\\}\_\{i=1\}^\{I\}via full reverse diffusion using the most recent

ϕ\\boldsymbol\{\\phi\}
20:

𝝀\(𝒢\)←\[𝝀\(𝒢\)\+ηDT∑i𝐟\(𝐱T\(i\);𝒢\)/I\]\+\\boldsymbol\{\\lambda\}\(\{\\mathcal\{G\}\}\)\\leftarrow\\Bigl\[\\boldsymbol\{\\lambda\}\(\{\\mathcal\{G\}\}\)\+\\eta\_\{\\text\{\{DT\}\}\}\\,\\sum\_\{i\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\_\{T\}^\{\(i\)\};\{\\mathcal\{G\}\}\)/I\\Bigr\]\_\{\+\}
21:endfor

22:endfor

23:return

ϕ\\boldsymbol\{\\phi\}

### D\.2DT Challenges

Since we execute a few gradient steps instead of fully solving the minimization problem for each outer iteration, it is sensible to assume that the score model at iterationkksatisfies

ϕk\\displaystyle\\boldsymbol\{\\phi\}\_\{k\}≈argminϕ𝔼𝒢,t,𝐱t𝔼𝝀∼πk\(⋅∣𝒢\)\[ω\(t\)∥𝐬ϕ\(𝐱t,t,𝒢\)−∇logqT−t\(𝐱t\|𝝀;𝒢\)∥22\],\\displaystyle~\\approx~\\operatornamewithlimits\{argmin\}\_\{\\boldsymbol\{\\phi\}\}\\ \{\\mathbb\{E\}\}\_\{\{\\mathcal\{G\}\},t,\{\\mathbf\{x\}\}\_\{t\}\}\{\\mathbb\{E\}\}\_\{\\boldsymbol\{\\lambda\}\\sim\\pi\_\{k\}\(\\cdot\\mid\{\\mathcal\{G\}\}\)\}\\Big\[\\omega\(t\)\\big\\\|\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\)\-\\nabla\\log q\_\{T\-t\}\\big\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\\big\)\\big\\\|\_\{2\}^\{2\}\\Big\],\(117\)whereπk\(⋅∣𝒢\)=∑i=0kwiδ𝝀i\(𝒢\)\\pi\_\{k\}\(\\cdot\\mid\{\\mathcal\{G\}\}\)=\\sum\_\{i=0\}^\{k\}w\_\{i\}\\delta\_\{\\boldsymbol\{\\lambda\}\_\{i\}\(\{\\mathcal\{G\}\}\)\}captures the mixture of dual variables encountered in the preceding training iterations and∑iwi=1\\sum\_\{i\}w\_\{i\}=1\. The weightswiw\_\{i\}reflect the effective contribution of each past dual iterate to the score\-model parameters\. Although each dual iterate is used only once in \([114](https://arxiv.org/html/2606.17192#A4.E114)\), later iterates typically have stronger weights because their gradient updates are applied closer to the final parameter state\. Since the dual variable is not provided as an input to the score model, the corresponding predictor is an averaged score field,

𝐬ϕk\(𝐱t,t,𝒢\)≈𝔼𝝀∼πk\(⋅∣𝒢\)\[∇𝐱tlog⁡qT−t\(𝐱t\|𝝀;𝒢\)\]\.\\displaystyle\{\\mathbf\{s\}\}\_\{\\boldsymbol\{\\phi\}\_\{k\}\}\(\{\\mathbf\{x\}\}\_\{t\},t,\{\\mathcal\{G\}\}\)~\\approx~\{\\mathbb\{E\}\}\_\{\\boldsymbol\{\\lambda\}\\sim\\pi\_\{k\}\(\\cdot\\mid\{\\mathcal\{G\}\}\)\}\\Big\[\\nabla\_\{\{\\mathbf\{x\}\}\_\{t\}\}\\log q\_\{T\-t\}\\big\(\{\\mathbf\{x\}\}\_\{t\}\|\\boldsymbol\{\\lambda\};\{\\mathcal\{G\}\}\\big\)\\Big\]\.\(118\)This averaging is benign when the dual distribution collapses into a degenerate distribution of a single value close to the optimum or when the conditional scores associated with the dual iterates are nearly aligned\.

However, for binding and near\-binding constraints, where the constraint residual is close to zero, small changes in the generated samples change the constraint residual and therefore the multiplier update\. As a result, the score model is trained on targets associated with a local range of dual values around the equilibrium\. When two of these multipliers induce noticeably different score directions or correction strengths, the averaged score no longer matches the score associated with either dual value exactly\. DT can therefore retain the part of the correction that is common across the dual trajectory, but it loses the multiplier\-specific deviations\. This explains why DT can still learn a reasonable constrained sampler when the dual trajectory repeatedly emphasizes the same constraints, while remaining less accurate for binding or near\-binding constraints whose correction strength depends on the exact dual variable\.

This limitation is amplified at test time for unseen problem instances\. In DT, no multiplier is inferred during sampling\. Therefore, for a new instance, the model must implicitly infer from the problem data which constraints are active and what correction strength is needed\. When the new instance induces multipliers close to those encountered during training, DT can still perform well\. However, if the new instance changes the binding constraints or requires noticeably different multiplier magnitudes, the learned averaged score field has no explicit mechanism to adapt\.

## Appendix EExtended Numerical Results: Mixture of Gaussians

We first validate PDI on the problem of sampling from a weighted mixture of Gaussians inℝd\{\\mathbb\{R\}\}^\{d\},

f0\(𝐱\)=−log∑k=1Kwk𝒩\(𝐱;𝝁k,𝚺k\),\\displaystyle f\_\{0\}\(\{\\mathbf\{x\}\}\)~=~\-\\log\\sum\_\{k=1\}^\{K\}w\_\{k\}\\,\{\\mathcal\{N\}\}\\big\(\{\\mathbf\{x\}\};\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\\big\),\(119\)truncated to a polytope\{𝐱:𝐀𝐱≤𝐛\}\\\{\{\\mathbf\{x\}\}:\{\\mathbf\{A\}\}\{\\mathbf\{x\}\}\\leq\{\\mathbf\{b\}\}\\\}\. The optimization problem is

minμ\(𝐱\)⁡𝔼𝐱∼μ\[f0\(𝐱\)\]s\.t\.𝔼𝐱∼μ\[𝐀⊤𝐱−𝐛\]⪯0,\\displaystyle\\min\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\bigl\[f\_\{0\}\(\{\\mathbf\{x\}\}\)\\bigr\]\\qquad\\text\{s\.t\.\}\\qquad\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\bigl\[\{\\mathbf\{A\}\}^\{\\top\}\{\\mathbf\{x\}\}\-\{\\mathbf\{b\}\}\\bigr\]\\;\\preceq\\;\\mathbf\{0\},\(120\)i\.e\., the sampler must produce a distribution that concentrates on high\-density regions of the mixture while satisfyingMMlinear inequality constraints in expectation\. This is not an everywhere constrained sampling problem, where each sample𝐱∈ℝd\{\\mathbf\{x\}\}\\in\{\\mathbb\{R\}\}^\{d\}is required to satisfy the constraints\. In \([13](https://arxiv.org/html/2606.17192#S5.E13)\), the constraints are imposed on average over the target distribution\. This allows some samples to violate the constraints while others should be strictly feasible to compensate\.

### E\.1Baseline configuration

Each baseline shares the evaluation seed, batch size, and diffusion discretisationT=500T=500where applicable\.

##### Unconstrained DM\.

We run our trained model with𝝀t=𝟎\\boldsymbol\{\\lambda\}\_\{t\}=\\mathbf\{0\}for all time steps\. This eliminates the constraint term from the energy function, and lets the model sample from the unconstrained Boltzmann densityexp⁡\(−f0\(𝐱\)/β\)\\exp\(\-f\_\{0\}\(\{\\mathbf\{x\}\}\)/\\beta\)\.

##### PDM\.

Similarly, we run a reverse process with𝝀t=𝟎,∀t\\boldsymbol\{\\lambda\}\_\{t\}=\\mathbf\{0\},\\forall t\. At each step, the Tweedie prediction𝐱^0\\hat\{\{\\mathbf\{x\}\}\}\_\{0\}is projected onto the feasible set before being used in the DDPM posterior,

𝐱~0=Πℱ\(𝐱^0\),ℱ=\{𝐱:𝐟\(𝐱\)⪯𝟎,j=1,…,M\}\.\\displaystyle\\tilde\{\{\\mathbf\{x\}\}\}\_\{0\}\\;=\\;\\Pi\_\{\\mathcal\{F\}\}\\\!\\bigl\(\\hat\{\{\\mathbf\{x\}\}\}\_\{0\}\\bigr\),\\qquad\\mathcal\{F\}=\\bigl\\\{\{\\mathbf\{x\}\}:\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\)\\preceq\\mathbf\{0\},\\;j=1,\\dots,M\\bigr\\\}\.\(121\)PDM then forces feasibility by construction and is therefore a hard\-constraint method, unlike PDI\.

##### PDL\.

Direct Langevin dynamics in the𝐱\{\\mathbf\{x\}\}\-space with per\-sample dual variables:

𝐱t\+1\\displaystyle\{\\mathbf\{x\}\}\_\{t\+1\}=𝐱t−ηp∇𝐱t\(f0\(𝐱t\)\+𝝀t⊤𝐟\(𝐱t\)\)\+2ηPϵt,\\displaystyle=\{\\mathbf\{x\}\}\_\{t\}\-\\eta\_\{\\text\{p\}\}\\;\\nabla\_\{\{\\mathbf\{x\}\}\_\{t\}\}\\Big\(f\_\{0\}\(\{\\mathbf\{x\}\}\_\{t\}\)\+\\boldsymbol\{\\lambda\}\_\{t\}^\{\\top\}\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\_\{t\}\)\\Big\)\+\\sqrt\{2\\eta\_\{\\text\{P\}\}\}\\,\\boldsymbol\{\\epsilon\}\_\{t\},\(122\)𝝀t\+1\\displaystyle\\boldsymbol\{\\lambda\}\_\{t\+1\}=\[𝝀t\+ηd𝐟\(𝐱t\+1\)\]\+,\\displaystyle=\\bigl\[\\boldsymbol\{\\lambda\}\_\{t\}\+\\eta\_\{\\text\{d\}\}\\,\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\_\{t\+1\}\)\\bigr\]\_\{\+\},\(123\)whereϵt∼𝒩\(0,𝐈\)\\boldsymbol\{\\epsilon\}\_\{t\}\\sim\{\\mathcal\{N\}\}\(0,\{\\mathbf\{I\}\}\)\. The main difference between PDL and PDI is that the former operates in the clean sample space\. It therefore uses the gradient of the energy function to update𝐱\{\\mathbf\{x\}\}\. In PDI, the dynamics evolve under a decreasing noise level, and at each time step, we follow the score of the corresponding denoised distribution\.

A distinction arises between these baselines and our method PDI\. In all the aforementioned updates, the correction is made to each sample individually, forcing all samples to abide by the constraints\. Their solutions are therefore feasible in our problem formulation, but suboptimal because they are more conservative\. Figure[5](https://arxiv.org/html/2606.17192#A5.F5)shows a toy example of the problem withd=2,K=3d=2,K=3, andM=2M=2\. Per\-sample enforcement sacrifices the opportunity to sample from the high\-density regions of the MoG while still satisfying the constraints on average\. Instead, it concentrates samples near the boundary of the feasible set, shown as the white region\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x7.png)Figure 5:Comparison ofConstrainedEnergy\-basedDiffusion\-sampler \(CED\) in enforcing the constraints in expectation \(top\) vs per sample \(bottom\)\. Per\-sample enforcement produces conservative samples that satisfy the constraints individually but fall short in providing a competitive objective\.In the following table, we provide all the hyperparameters of the baselines:

### E\.2Data and Architecture

The problem consists of aKK\-component Gaussian mixture, truncated byMMhalfplane constraints\. Mode centers are placed uniformly at random in a ball of radius proportional to a spread parameter, with isotropic covariances drawn from a specified range\. The constraint normals are chosen so that some halfplanes separate pairs of modes while others cut through mode basins, ensuring that constraints are active near high\-density regions\. Constraint levels𝐛\{\\mathbf\{b\}\}are calibrated so that a target fraction of modes have infeasible centers and the remaining feasible modes sit close to constraint boundaries\. Since this example is used as a proof of concept, the dataset has only one problem instance\.

The score network is an MLP with FiLM conditioning\. The input\(𝐱t,𝝀\)\(\{\\mathbf\{x\}\}\_\{t\},\\boldsymbol\{\\lambda\}\)is projected into a hidden dimension, added to the sinusoidal timestep embedding, and passed through a stack of residual blocks\. Each block applies a linear layer, layer normalization with FiLM modulation from the timestep embedding \(scale and shift\), and a SiLU activation\. The output is projected back toℝd\\mathbb\{R\}^\{d\}to predict the noiseϵt\\boldsymbol\{\\epsilon\}\_\{t\}\.

### E\.3Hyperparameter Choices

We ran a sweep over the hyperparameters:β,T,η,ρ,KMC,νmax,ϵ𝐱\\beta,T,\\eta,\\rho,K\_\{\\text\{MC\}\},\\nu\_\{\\text\{max\}\},\\epsilon\_\{\{\\mathbf\{x\}\}\}, andϵ𝝀\\epsilon\_\{\\boldsymbol\{\\lambda\}\}, and Table[3](https://arxiv.org/html/2606.17192#A5.T3)shows the values we use in our experiments\. All experiments were run on a single NVIDIA GeForce RTX 3090 card\.

Table 3:PDI training hyperparameters for the MoG experiment\.ParameterValueParameterValueProblemTrainingDimensiondd30OptimizerAdamWMixture componentsKK12Learning rate10−310^\{\-3\}ConstraintsMM10SGD steps8000\(800×10\)\(800\\times 10\)Batch sizeBB256MC samplesKMCK\_\{\\mathrm\{MC\}\}256Rollout batch\|ℬtrain\|\|\{\\mathcal\{B\}\}\_\{\\text\{train\}\}\|4 graphsExploitation fractionρexp\\rho\_\{\\text\{exp\}\}0→\\to0\.7DiffusionTraining – dual ascent rolloutsNoise schedulecosineTraining dual step sizeηtrain\\eta\_\{\\mathrm\{train\}\}0\.01TimestepsTT500Dual variable cap𝝀max\\boldsymbol\{\\lambda\}\_\{\\max\}50Inverse temperatureβ−1\\beta^\{\-1\}50Score networkTraining –𝛌\\boldsymbol\{\\lambda\}priorArchitectureMLP𝝀\\boldsymbol\{\\lambda\}\-prior familyexp⁡\(ν\)\\exp\(\\nu\)Hidden dimension512νmin\\nu\_\{\\min\}0\.1Number of layers8νmax\\nu\_\{\\max\}20\.0Parameters8\.73MConditioning inputs\(𝐱t,t,𝝀\)\(\{\\mathbf\{x\}\}\_\{t\},t,\\boldsymbol\{\\lambda\}\)Training – perturbationInferencePerturbation fractionρpert\\rho\_\{\\text\{pert\}\}1\.0Inference dual step sizeη\\eta1\.0𝐱t\{\\mathbf\{x\}\}\_\{t\}perturbation stdϵ𝐱\\epsilon\_\{\{\\mathbf\{x\}\}\}2\.0Initial𝝀0\\boldsymbol\{\\lambda\}\_\{0\}𝟎\\mathbf\{0\}𝝀\\boldsymbol\{\\lambda\}perturbation stdϵ𝝀\\epsilon\_\{\\boldsymbol\{\\lambda\}\}5\.0Feasibility tolerance0\.02Inverse temperatureβ−1\\beta^\{\-1\}50
### E\.4Additional Results

We also record the input\-to\-output transition matrix and report it in Figure[6](https://arxiv.org/html/2606.17192#A5.F6)\. The figure shows that PDL divides the mass between one mode \(\#10\\\#10\) and the diagonal of the transition matrix\. The mass on the diagonal indicates that each input sample typically converges to its nearest mode\. In contrast, PDI does not exhibit this behavior\. This is partly due to the noise injected along the reverse trajectory, which acts as an annealing mechanism and allows samples to move across modes\. It is also due to the entropy regularization in the PDI formulation, controlled by the temperature parameter, which encourages a distributional solution rather than deterministic convergence to the nearest mode\. This temperature\-driven randomization is absent in PDL\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x8.png)Figure 6:Initial to final transition matrix\.The matrix is almost diagonal in the PD Langevin case, reflecting the fact that samples converge to their nearest modes\.

## Appendix FExtended Numerical Results: Wireless Resource Allocation

We study stochastic constrained optimization for optimal wireless resource allocationUsluet al\.\([2025a](https://arxiv.org/html/2606.17192#bib.bib7),[2026](https://arxiv.org/html/2606.17192#bib.bib8),[b](https://arxiv.org/html/2606.17192#bib.bib9)\)\.

Consider an ad\-hoc wireless network comprised ofN=200N=200transmitter\-receiver \(tx\-rx\) pairs deployed uniformly over a square area of side lengthR=5800R=5800meters, yielding an average density ofν=5\.9\\nu=5\.9tx\-rx pairs/km2\\text\{\\,\}\{\\mathrm\{km\}\}^\{2\}\. Each transmitter communicates with a single designated receiver, and each receiver treats incoming signals from all other transmitters as interference\. The resource allocation \(optimization\) variable is the transmit power vector𝐱∈\[0,Pmax\]N\\mathbf\{x\}\\in\[0,P\_\{\\max\}\]^\{N\}, wherePmax=10P\_\{\\max\}=10mW\.

##### Channel model\.

The channel model includes both large\-scale and small\-scale fading effects\. We denote byhijh\_\{ij\}the large\-scale channel gain from transmitteriito receiverjj, governed by a dual\-slope path\-loss model with log\-normal shadowing of standard deviationσ=7\\sigma=7dB\. A*network configuration*𝐇∈ℝ\+N×N\\mathbf\{H\}\\in\\mathbb\{R\}^\{N\\times N\}\_\{\+\}is the matrix of all large\-scale gains\[𝐇\]ij=hij\[\\mathbf\{H\}\]\_\{ij\}=h\_\{ij\}, and it is jointly determined by the random geometric deployment of the tx\-rx pairs, which fixes the pairwise distances and hence the path\-loss components, and the realization of the log\-normal shadowing\. Each network configuration𝐇\\mathbf\{H\}remains fixed over the operating horizon, while the small\-scale fading, which follows a Rayleigh distribution with a pedestrian velocity of11m/s, produces time\-varying instantaneous channel gainshij,th\_\{ij,t\}at each time slotttof duration5050milliseconds\. We set the channel bandwidth toW=40W=40MHz and the noise power spectral density toN0=−174N\_\{0\}=\-174dBm/Hz\.

Given an allocation𝐱t\{\\mathbf\{x\}\}\_\{t\}and fading realization𝐇t\\mathbf\{H\}\_\{t\}at time slottt, the instantaneous rate of receiverjjis

r~j\(𝐱t,𝐇t\)=log2⁡\(1\+\[𝐱t\]j⋅\|hjj,t\|2WN0\+∑i≠j\[𝐱t\]i⋅\|hij,t\|2\)\.\\tilde\{r\}\_\{j\}\(\{\\mathbf\{x\}\}\_\{t\},\\mathbf\{H\}\_\{t\}\)=\\log\_\{2\}\\\!\\left\(1\+\\frac\{\[\{\\mathbf\{x\}\}\_\{t\}\]\_\{j\}\\cdot\|h\_\{jj,t\}\|^\{2\}\}\{WN\_\{0\}\+\\sum\_\{i\\neq j\}\[\{\\mathbf\{x\}\}\_\{t\}\]\_\{i\}\\cdot\|h\_\{ij,t\}\|^\{2\}\}\\right\)\.\(124\)The ergodic rate of receiverjjis obtained by averaging the instantaneous rates overTTtime slots, i\.e\.,rj≈\(1/T\)∑t=1Tr~j\(𝐱t,𝐇t\)r\_\{j\}\\approx\(1/T\)\\sum\_\{t=1\}^\{T\}\\tilde\{r\}\_\{j\}\(\{\\mathbf\{x\}\}\_\{t\},\\mathbf\{H\}\_\{t\}\)\.

##### Optimization objective\.

For a given network state𝐇\\mathbf\{H\}, the goal is to maximize the ergodic sum\-rate subject to per\-user minimum ergodic rate constraints\. That is,

Ppc∗=maxμ\(𝐱\)⁡1⊤𝐫\(μ,𝐇\)\+βℋ\(μ\),s\.t\.𝐫\(μ,𝐇\)≥𝟏⋅rmin,P^\{\*\}\_\{\\text\{pc\}\}~=~\\max\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbf\{1\}^\{\\top\}\\mathbf\{r\}\\\!\\left\(\\mu,\\mathbf\{H\}\\right\)\+\\beta\{\\mathcal\{H\}\}\(\\mu\),\\quad\\text\{s\.t\.\}\\quad\\mathbf\{r\}\\\!\\left\(\\mu,\\mathbf\{H\}\\right\)\\geq\\mathbf\{1\}\\cdot r\_\{\\min\},\(125\)whereμ\\mudenotes a power allocation policy, andrminr\_\{\\min\}is the minimum ergodic rate requirement\. The solution to this problem is rarely a degenerate policy\. Because users compete for the same communication resources and interfere with one another, it is generally impossible for all users to transmit at high power simultaneously while satisfying their rate requirements\. The optimal policy is therefore often a switching policy, where different users are activated at different times so that each receives enough transmission opportunities to meet its rate constraint on average\. This naturally leads to a multimodal optimal distribution, with different modes corresponding to different active\-user patterns\.

### F\.1Data and Architecture

##### Dataset Generation\.

We generate a dataset of256256network configurations in total, viewed as independent samples from the underlying distribution of network configurations induced by the random geometry of tx\-rx deployments and random shadowing for the given value ofRR\. A 5:1:2 train\-validation\-test split results in a training dataset of\|𝒟\|=160\|\{\\mathcal\{D\}\}\|=160networks,\|𝒱al\|=32\|\{\\mathcal\{V\}\}\_\{\\text\{al\}\}\|=32validation networks, and\|𝒯\|=64\|\{\\mathcal\{T\}\}\|=64test networks\.

##### Graph representation\.

We represent each network configuration𝐇\\mathbf\{H\}as a directed graph𝒢𝐇=\(𝒱,ℰ,𝐰\)\\mathcal\{G\}\_\{\\mathbf\{H\}\}=\(\\mathcal\{V\},\\mathcal\{E\},\{\\mathbf\{w\}\}\), where the node set𝒱=\{1,…,N\}\\mathcal\{V\}=\\\{1,\\dots,N\\\}corresponds to the tx\-rx pairs\. We assign each directed edge\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}a weight proportional to the log\-normalized channel gain given by

eij∝log2⁡\(1\+Pmax\|hij\|2WN0\),e\_\{ij\}\\propto\\log\_\{2\}\\\!\\left\(1\+\\frac\{P\_\{\\max\}\|h\_\{ij\}\|^\{2\}\}\{WN\_\{0\}\}\\right\),\(126\)and sparsify the graph by thresholding the weak edges so that we retain only the top\-1010strongest incoming edges per receiver node\.

##### Graph Neural Networks \(GNNs\)\.

The power allocations𝐱\\mathbf\{x\}and ergodic rate vectors𝐫\{\\mathbf\{r\}\}are naturally interpreted as node\-level signals supported on𝒢𝐇\\mathcal\{G\}\_\{\\mathbf\{H\}\}\. We therefore use GNNs as the base parameterization for the score model, which takes the problem representation𝒢𝐇\{\\mathcal\{G\}\}\_\{\{\\mathbf\{H\}\}\}as input\. We implement graph convolutions using Torch Geometric’s TAGConv \(withKK=2\)\. The model consists ofL=8L\{=\}8TAGConv residual blocks with hidden dimensiondh=256d\_\{h\}\{=\}256\.

The timestep information is injected through Feature\-wise Linear Modulation \(FiLM\)\. The dual variable𝝀∈ℝN\\boldsymbol\{\\lambda\}\\in\{\\mathbb\{R\}\}^\{N\}is concatenated to the input features𝐱\{\\mathbf\{x\}\}\.

### F\.2Baselines

##### PDM\.

We run our trained score model with𝝀t=0\\boldsymbol\{\\lambda\}\_\{t\}=0for all the denoising steps to eliminate the dual ascent\. After each step, we perform a projection into the feasible set by solving the augmented\-Lagrangian problem:

𝐱t←arg⁡min𝐰⁡‖𝐰−𝐱t‖2\+ζ2‖\[𝐟\(𝐰\)\]\+‖2,\{\\mathbf\{x\}\}\_\{t\}\\leftarrow\\arg\\min\_\{\{\\mathbf\{w\}\}\}\\;\\\|\{\\mathbf\{w\}\}\-\{\\mathbf\{x\}\}\_\{t\}\\\|^\{2\}\+\\tfrac\{\\zeta\}\{2\}\\\|\[\{\\mathbf\{f\}\}\(\{\\mathbf\{w\}\}\)\]\_\{\+\}\\\|^\{2\},\(127\)where\[𝐱\]\+=max⁡\{𝟎,𝐱\}\[\{\\mathbf\{x\}\}\]\_\{\+\}=\\max\\\{\\mathbf\{0\},\{\\mathbf\{x\}\}\\\}\. We solve the problem via a few steps of gradient descent\.

##### DPS\.

We run a reverse process with our trained score model while forcing𝝀t=𝟎,∀t,\\boldsymbol\{\\lambda\}\_\{t\}=\\mathbf\{0\},\\forall t,to eliminate the dual ascent\. At each step in the inference process, the prediction𝐱t\{\\mathbf\{x\}\}\_\{t\}is corrected by a gradient step that penalizes constraint violation before it is fed to the next iteration,

𝐱t←𝐱t−st∇𝐱t‖𝐟\(𝐱t\)‖2\.\\displaystyle\{\{\\mathbf\{x\}\}\}\_\{t\}\\leftarrow\{\{\\mathbf\{x\}\}\}\_\{t\}\\;\-\\;s\_\{t\}\\;\\nabla\_\{\{\{\\mathbf\{x\}\}\}\_\{t\}\}\\bigl\\\|\{\\mathbf\{f\}\}\(\{\\mathbf\{x\}\}\_\{t\}\)\\bigr\\\|^\{2\}\.\(128\)The gradient is applied directly to the denoised signal at the end of each step, and weighted by a scalar step sizests\_\{t\}\. The step size is set to the inverse of the noise variance at each denoising step\.

##### ST & PD Expert Policy\.

Following the methodology inUsluet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib8)\), for each training network\{𝐇\(b\)\}b∈𝒟\\\{\{\\mathbf\{H\}\}^\{\(b\)\}\\\}\_\{b\\in\{\\mathcal\{D\}\}\}, we solve the power control problem using an expert algorithm, specifically a dual gradient descent algorithm\. The expert algorithm generates trajectories of primal iterates that are near\-optimal and feasible in the ergodic sense\. We view these iterates as draws from a stochastic, optimal policy, collectingM=200M=200samples from the convergence regime of the algorithm to obtain an expert dataset of resource allocation vectors\{𝐱\(m\)\(𝐇\(b\)\)\}m=1M\\\{\{\\mathbf\{x\}\}^\{\(m\)\}\\left\(\{\\mathbf\{H\}\}^\{\(b\)\}\\right\)\\\}\_\{m=1\}^\{M\}\.

We train a diffusion model to imitate the expert data generated by the PD Expert\. The neural networkϵθ\(𝐱t,t\)\\boldsymbol\{\\epsilon\}\_\{\\theta\}\(\{\\mathbf\{x\}\}\_\{t\},t\)is trained via the standard denoising score matching objective and sampled using the deterministic DDIM update:

𝐱t−1=α¯t−1𝐱^0\+1−α¯t−1ϵθ\(𝐱t,t\)\.\\mathbf\{x\}\_\{t\-1\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\-1\}\}\\,\\hat\{\\mathbf\{x\}\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\-1\}\}\\,\\boldsymbol\{\\epsilon\}\_\{\\theta\}\(\{\\mathbf\{x\}\}\_\{t\},t\)\.\(129\)This method has no explicit constraint enforcement mechanism; feasibility depends entirely on the training data distribution and the expressivity of the parameterization\.

In order to replicate the performance ofUsluet al\.\([2026](https://arxiv.org/html/2606.17192#bib.bib8)\), we use the U\-Graph Neural Network \(U\-GNN\) parametrization proposed therein, which was designed for graph signal diffusion\. This ensures that the DDIM imitation baseline follows the original architectural setup\. We do not use U\-GNN as the base parametrization for our method because it is computationally heavy\. Instead, we use a vanilla GNN, since PDI enforces the constraints through dual ascent during sampling and therefore does not need to rely solely on a highly expressive score parametrization for feasibility\.

##### DT\.

The training procedure is described in Appendix[D](https://arxiv.org/html/2606.17192#A4)\. The score model shares the same architecture as PDI\-Net, except that it uses three conditioning channels instead of four\.

All hyperparameters of these baselines are given in Table[4](https://arxiv.org/html/2606.17192#A6.T4)\.

Table 4:Baseline hyperparameters for the wireless allocation experiment\. The DT settings differ from PDI in Table[6](https://arxiv.org/html/2606.17192#A6.T6); all other problem, diffusion, architecture, and evaluation settings are identical\.

### F\.3Additional Results

![Refer to caption](https://arxiv.org/html/2606.17192v1/x9.png)Figure 7:Dual Training\.Cosine similarity andL2L\_\{2\}distance between the dual variables across the DT training iterations and the last\-iterate PDI dual variables\. The shaded region indicates the training iterations where the cosine similarity peaks above0\.950\.95\. The metrics then drift slightly afterwards\.Dual training \(continued\)\.We investigate the difference between PDI and DT more closely by comparing the dual trajectories\. Figure[7](https://arxiv.org/html/2606.17192#A6.F7)plots the cosine similarity and theL2L\_\{2\}distance between the dual variables generated along the DT trajectory and the last dual iterate obtained by the PDI dynamics\. These quantities are averaged over the training examples\. We use the training dataset because DT does not infer dual variables during inference\.

The cosine similarity reaches its maximum between the epochs22002200and32003200, after which the DT dual variables begin to drift away from the PDI dual variables\. We therefore save all DT checkpoints after epoch22002200, evaluate them on the test set and report the resulting metrics in Figure[8](https://arxiv.org/html/2606.17192#A6.F8)\. The figure shows that all DT checkpoints achieve higher mean rates \(objective\) than PDI, shown by the dashed blue line\. This suggests that the models offer higher mean rates by reducing interference through more conservative power allocations\. This reduction in interference benefits some users but sacrifices tail performance\. This is reflected in the lower fifth and first percentile rates and feasibility percentages across all checkpoints\.

These observations suggest that the dual variables generated by DT do not induce the same primal solutions obtained by PDI dynamics\. This can be attributed to the fact that the PDI score field is conditioned on the current dual variable𝝀t\\boldsymbol\{\\lambda\}\_\{t\}while that of DT is an unconditional score field obtained after training over a history of changing dual variables\. Therefore, even when the DT dual variable is close to the PDI dual variable, the checkpoint at that epoch does not need to approximate the score field associated with that particular dual variable\. The reason is that the checkpoint represents the accumulated effect of the preceding training trajectory, not a score model explicitly indexed by the current dual variable\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x10.png)Figure 8:Performance of DT checkpoints along the training trajectory\.The DT models fail to attain the same55th percentile rates, mean violations and feasibility percentage as PDI\. All the metrics are evaluated over the test set\.To further separate the quality of the learned dual variables from the quality of the corresponding score model, we continue training several DT checkpoints for extra10001000outer iterations while freezing their dual variables\. We refer to the resulting models as DT\+\. This experiment asks whether the dual variables produced by DT are useful once the score model is given enough optimization time to adapt to them\. Table[5](https://arxiv.org/html/2606.17192#A6.T5)reports the two best DT\+ models and shows a substantial improvement over the original DT checkpoints\. This indicates that DT can reach informative dual variables, but the score model at the corresponding checkpoint has not necessarily learned the score field induced by that dual value\. Rather, because the model is updated only partially while the dual variables evolve, the checkpoint reflects the accumulated effect of the preceding training trajectory\. In this sense, practical DT behaves as a score model shaped by a history of dual variables, and not as a fully optimized score model for the current dual variable\. The improvement of DT\+ also comes with a much larger training cost \(40004000original outer iterations in addition to extra10001000iterations for tuning the model\), whereas PDI reaches comparable performance using only400400training outer iterations\.

Table 5:DT after extra training at fixed dual variables \(DT\+\)\.We continue training several DT checkpoints with no dual updates and report the best two results \(corresponding to iterations 2200 and 2900\)\. PDI achieves comparable performance with much lighter training\.The improvement of DT\+ shows that the dual variables produced by DT are informative\. However, it does not remove the limitation of using an unconditional score model\. Since DT does not take the dual variable as input, the correction induced by the dual trajectory is absorbed into the score\-model parameters rather than represented explicitly\. This can work well for problem instances whose required correction is close to the one encoded by the trained model\. At inference, however, new problem instances may induce different optimal multipliers or different constraint residuals along the denoising trajectory\. The DT score field cannot adapt to these instance\-dependent corrections during sampling, because the dual variable is not provided to the model\.

PDI reduces this limitation by learning a𝝀\\boldsymbol\{\\lambda\}\-conditioned family of score fields\. During inference, the current dual iterate is passed directly to the score model, so each reverse step uses a score field matched to the dual correction used at that step\. This does not mean that PDI samples from the Gibbs distribution of one fixed dual variable, since the final sample is produced by a sequence of kernels conditioned on different dual iterates\. Rather, the advantage is that the constraint correction and denoising dynamics remain coupled along the trajectory, allowing constraint violations to influence subsequent denoising steps more directly\.

Temperature effect\.In Figure[9](https://arxiv.org/html/2606.17192#A6.F9), we analyze the effect of the temperature parameterβ\\beta\. Largeβ\\beta, equivalently smallβ−1\\beta^\{\-1\}, generate more uniform samples, which cause severe interference and in turn the lowest mean rates and feasibility\. Lower temperatures produce more degenerate distributions that concentrate the samples at the corners\. This prevents severe interference and provides better mean rate, but does not fully benefit from time sharing \(i\.e\., low feasibility\)\. We find thatβ=1\\beta=1in this experiment balances the mean rates with feasibility\.

Noise annealing effect\.PDI differs from PDL because it follows the score fields of a sequence of progressively denoised distributions, rather than the score field of the clean target distribution\. This mechanism is reminiscent of simulated annealing\. Early noisy diffusion steps smooth the target landscape and promote exploration across modes, while later low\-noise steps gradually sharpen the distribution and refine the samples toward the constrained target\. Unlike classical simulated annealing, however, the annealing in PDI is induced by the diffusion noise schedule rather than by the temperature of the energy function\. Figure[10](https://arxiv.org/html/2606.17192#A6.F10)shows this effect, represented in the diverse samples generated by PDI compared with the more concentrated samples produced by PDL\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x11.png)Figure 9:Temperature effect\.Two\-dimensional scatter plots of generated samples of two test networks\. The title shows the mean rates and feasibility of the network\. Higher temperature \(left\) promotes more diverse allocations while lower temperature \(right\) generates more degenerate samples\. The valueβ=1\\beta=1provides a good balance between mean rates and feasibility\.![Refer to caption](https://arxiv.org/html/2606.17192v1/x12.png)Figure 10:Annealing effect of PDI\.A scatter plot of the predicted power allocation of four pairs of agents\. PDI discovers more transmission modes, allowing more agents to transmit with higher power to achieve higher rates on average\. PDL concentrates the samples in the same modes\.
### F\.4Hyperparameters Choices

Table[6](https://arxiv.org/html/2606.17192#A6.T6)collects the hyperparameters of the problem and PDI training\. All experiments were run on a single NVIDIA GeForce RTX 3090 card\.

Table 6:PDI hyperparameters for the wireless power allocation experiment\.ParameterValueParameterValue*Problem setup**Training*Number of usersNN200OptimizerAdamWMax transmit powerPmaxP\_\{\\max\}10 dBmLearning rate10−310^\{\-3\}\(→10−5\)\(\\to 10^\{\-5\}\)Bandwidth40 MHzWeight decay10−410^\{\-4\}Minimum raterminr\_\{\\min\}0\.6 bps/HzOuter iterationsNouterN\_\{\\text\{outer\}\}400Number of networks256*Score network**Training*Hidden dimension256Inner SGD stepsNinnerN\_\{\\text\{inner\}\}10→3010\\to 30GNN layers8Batch size\|ℬtrain\|\|\{\\mathcal\{B\}\}\_\{\\text\{train\}\}\|4 problemsFilter orderKK2Replay buffer capacity\|ℬ\|\|\{\\mathcal\{B\}\}\|4 096Time embedding dimension128Minibatch sizeBB128Conditioning channels4\(𝐱,t,𝝀,𝒢\)\(\{\\mathbf\{x\}\},t,\\boldsymbol\{\\lambda\},\{\\mathcal\{G\}\}\)Activation / normalizationSiLU / LayerNorm*Diffusion process**Training – perturbation and prior*TimestepsTT500Exploitation fractionρexp\\rho\_\{\\text\{exp\}\}0→\\to0\.7Noise scheduleCosine𝝀\\boldsymbol\{\\lambda\}prior range\[νmin,νmax\]\[\\nu\_\{\\min\},\\nu\_\{\\max\}\]\[2,10\]\[2,10\]MC candidatesKMCK\_\{\\text\{MC\}\}88Perturbation std\(ϵ𝐱/ϵ𝝀\)\(\\epsilon\_\{\{\\mathbf\{x\}\}\}/\\epsilon\_\{\\boldsymbol\{\\lambda\}\}\)0\.1/0\.50\.1/0\.5Perturbation fractionρpert\\rho\_\{\\text\{pert\}\}0\.5*Dual variable \(inference\)**Evaluation*Inverse temperatureβ−1\\beta^\{\-1\}1\.0Policies per networkKK200Step sizeηt\\eta\_\{t\}0\.05Time\-sharing timeslotsRevalR\_\{\\mathrm\{eval\}\}500Initialization𝝀0\\boldsymbol\{\\lambda\}\_\{0\}10\.0Test networks64Clamp𝝀max\\boldsymbol\{\\lambda\}\_\{\\max\}50Evolution trials50Sub\-batch groupsCC10

## Appendix GExtended Numerical Results: Portfolio Management

In constrained portfolio optimization, we aim to allocate weights𝐱∈ΔN−1\{\\mathbf\{x\}\}\\in\\Delta^\{N\-1\}acrossN=500N=500assets, organized in1010sectors\. The return𝐫\{\\mathbf\{r\}\}is drawn from a factor model with mean𝝁\\boldsymbol\{\\mu\}and block\-diagonal covariance𝚺\\boldsymbol\{\\Sigma\}\. Our objective is to maximize expected return subject to per\-asset variance\-risk budgets, i\.e\.,

Ppm∗=maxμ\(𝐱\)⁡𝔼𝐱∼μ\[𝔼𝐫\[𝐫⊤𝐱\]\]\+βℋ\(μ\)s\.t\.𝔼𝐱∼μ\[xj\(𝚺𝐱\)j\]≤bj,j=\[N\],\\displaystyle P\_\{\\text\{pm\}\}^\{\*\}~=~\\max\_\{\\mu\(\{\\mathbf\{x\}\}\)\}\\;\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\\!\\Bigl\[\\mathbb\{E\}\_\{\{\\mathbf\{r\}\}\}\\\!\\bigl\[\{\\mathbf\{r\}\}^\{\\\!\\top\}\{\\mathbf\{x\}\}\\bigr\]\\Bigr\]\+\\beta\{\\mathcal\{H\}\}\(\\mu\)\\quad\\text\{s\.t\.\}\\quad\\mathbb\{E\}\_\{\{\\mathbf\{x\}\}\\sim\\mu\}\\\!\\bigl\[x\_\{j\}\\,\(\\boldsymbol\{\\Sigma\}\{\\mathbf\{x\}\}\)\_\{j\}\\bigr\]\\;\\leq\\;b\_\{j\},\\quad j=\[N\],\(130\)The budgetbj\>0b\_\{j\}\>0caps how much variance assetjjis permitted to contribute in expectation over all portfolios\. In practice, portfolio construction is rarely a single\-solution problem\. Even under common market assumptions, different clients or rebalancing instances may require different allocations due to heterogeneous preferences, liquidity needs, tax considerations, and model uncertainty\. This motivates learning a distribution over portfolios rather than a single deterministic allocation\. The average risk constraints allow the sampler to trade risk across different portfolio samples, admitting occasional high\-risk, high\-return allocations while ensuring that the overall exposure of each asset remains within its prescribed budget in expectation\.

### G\.1Data & Architecture

##### Data\.

We generate synthetic portfolio instances using a latent factor model with sector structure\. Each instance consists ofN=500N=500assets organized intoS=10S=10sectors\. Each asset’s log\-returns are drawn from a multivariate normal with a sector\-structured covariance matrix𝚺\\boldsymbol\{\\Sigma\}, where assets within the same sector are more correlated with each other than with assets in other sectors\. We draw10001000return scenarios and convert to simple returns via exponentiation\.

##### Graph construction\.

To construct the graph, we compute the absolute correlation matrix from𝚺\\boldsymbol\{\\Sigma\}, remove self\-loops, and for each asset retain edges to itskk\-nearest neighbors \(we usek=20k\{=\}20\) by absolute correlation magnitude\. Edge weights are normalized by the largest eigenvalue of the resulting adjacency matrix\.

##### Conditioning and architecture\.

We use the same GNN architecture used in the wireless allocation problem\. The timestep information is injected through Feature\-wise Linear Modulation \(FiLM\)\. The shared dual variable𝝀∈ℝN\\boldsymbol\{\\lambda\}\\in\{\\mathbb\{R\}\}^\{N\}is concatenated to the input features\. The network comprisesL=6L\{=\}6residual blocks with hidden dimensiondh=128d\_\{h\}\{=\}128\.

### G\.2Baselines

The hyperparameters of the baselines are summarized in Table[7](https://arxiv.org/html/2606.17192#A7.T7)\.

Table 7:Baseline hyperparameters in the portfolio management problem\.ParameterValueParameterValue*PD\-Langevin*Primal step sizeηp\\eta\_\{\\text\{p\}\}10−210^\{\-2\}Iterations500Dual step sizeηd\\eta\_\{\\text\{d\}\}100*PDM*Projection iterations50𝝀\\boldsymbol\{\\lambda\}\(net input\)0*PDI–MC*Inverse temperatureβ−1\\beta^\{\-1\}2 000dual step sizeηt\\eta\_\{t\}300MC candidatesKMCK\_\{\\text\{MC\}\}512𝝀decay\\boldsymbol\{\\lambda\}\_\{\\text\{decay\}\}0\.001*Evaluation*Generated samples1024Test instances20
### G\.3Additional Results

We ran additional experiments to evaluate different aspects of our algorithm, and the results match our previous findings\.

First, we compare the full PDI algorithm with variants that freeze the dual variable after an initial primal\-dual run and report the results in Figure[11](https://arxiv.org/html/2606.17192#A7.F11)\. In this experiment, we consider the last dual iterate and the time\-averaged dual variable\. The results show that the time\-averaged multiplier performs better than the last iterate\. This is consistent with our convergence result, which guarantees convergence for the time\-average of the dual variables rather than for the final iterate itself\. However, both fixed\-dual variants remain below the full PDI performance\. We also notice that the initial𝝀0\\boldsymbol\{\\lambda\}\_\{0\}does not affect the PDI trajectory\. This suggests that PDI is not simply estimating one good multiplier and then sampling from it\. Its advantage comes from updating the multiplier throughout the denoising trajectory\.

Second, we test in Figure[12](https://arxiv.org/html/2606.17192#A7.F12)whether using only a late\-stage average of the dual trajectory is enough\. The answer is negative\. The performance degrades when the average is computed only over a suffix of the trajectory\. This suggests that the early dual iterates are not merely transient noise; they contribute to the effective multiplier that governs the sampled distribution\. In other words, the full trajectory matters, but its aggregate effect can be captured well by the full time\-average\.

Our third experiment in Figure[13](https://arxiv.org/html/2606.17192#A7.F13)investigates the role of the entropy regularization on the final solutions and trajectories\. The figure shows that when the temperatureβ\\betais high, the sampler produces more diverse samples, but the objective can be weaker\. When the temperature is lower, the distribution becomes more concentrated around high\-quality solutions\. This improves the objective, but it can also make the dynamics sharper and more sensitive\.

Finally, the noise\-schedule comparisons of Figure[14](https://arxiv.org/html/2606.17192#A7.F14)show that the schedule is important for convergence\. Schedules that keep the process noisy for longer, such as the standard cosine and DDPM linear schedules, give more stable behavior and better feasibility\. In contrast, schedules that reconstruct the clean sample too early make the sampler sensitive before the dual variable has stabilized\. This leads to larger violations and weaker convergence\. These results agree with Theorem[2](https://arxiv.org/html/2606.17192#Thmtheorem2), which suggests that early high\-noise steps damp the effect of dual mismatch, while later low\-noise steps should only occur after the dual variable has moved close to a good region\.

![Refer to caption](https://arxiv.org/html/2606.17192v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.17192v1/x14.png)

Figure 11:Performance of PDI while fixingλ\\boldsymbol\{\\lambda\}along the diffusion steps\.The time\-average dual variable gives better objective and constraint satisfaction compared to the last iterate\. However, running the dual updates beats both of them and does not depend on the initial value\.![Refer to caption](https://arxiv.org/html/2606.17192v1/x15.png)

![Refer to caption](https://arxiv.org/html/2606.17192v1/x16.png)

Figure 12:Frozenλ¯tail\\bar\{\\boldsymbol\{\\lambda\}\}\_\{\\text\{tail\}\}during sampling\.The effect of running the sampler with a fixed dual variable equal to the time\-average of the tail dual iterates\.![Refer to caption](https://arxiv.org/html/2606.17192v1/x17.png)

![Refer to caption](https://arxiv.org/html/2606.17192v1/x18.png)

Figure 13:Temperature effect\.Lower temperatureβ\\beta, equivalently largerβ−1\\beta^\{\-1\}\(yellow\) helps improve returns and constraint satisfaction\.![Refer to caption](https://arxiv.org/html/2606.17192v1/x19.png)Figure 14:Effect of noise schedules\.Standard DDPM cosine and linear schedules give better feasibility\. Schedules that reconstruct the clean sample too early make the sampler sensitive to perturbations in the dual variable before the dual variable has stabilized\.
### G\.4Hyperparameter Choices

We ran a sweep overβ\\betaandηt\\eta\_\{t\}\. Table[8](https://arxiv.org/html/2606.17192#A7.T8)displays all hyperparameters used in experiments\. All experiments were run on a single NVIDIA GeForce RTX 3090 card\.

Table 8:PDI hyperparameters in the portfolio management problem \(training and inference\)\.ParameterValueParameterValue*Diffusion process**Training*TimestepsTT500OptimizerAdamWNoise scheduleCosineLearning rate3×10−4→3×10−53\{\\times\}10^\{\-4\}\\\!\\to\\\!3\{\\times\}10^\{\-5\}\(cosine\)Inverse temperatureβ−1\\beta^\{\-1\}2 000𝝀\\boldsymbol\{\\lambda\}prior\[νmin,νmax\]\[\\nu\_\{\\min\},\\nu\_\{\\max\}\]\[150,1500\]\[150,1500\]MC candidatesKMCK\_\{\\text\{MC\}\}512𝝀\\boldsymbol\{\\lambda\}perturbation \(mult\.\)ϵ𝝀\\epsilon\_\{\\boldsymbol\{\\lambda\}\}0\.3Exploitation ratioρexp\\rho\_\{\\text\{exp\}\}0→\\to0\.7Minibatch sizeBB128Training instances200*Dual variable \(inference\)**Training*Step sizeηt\\eta\_\{t\}300Validation instances10Initialization𝝀0\\boldsymbol\{\\lambda\}\_\{0\}0Inner SGD stepsNinnerN\_\{\\text\{inner\}\}10→10010\\\!\\to\\\!100Clamp𝝀max\\boldsymbol\{\\lambda\}\_\{\\max\}10410^\{4\}Outer iterationsNouterN\_\{\\text\{outer\}\}600Weight decay0\.001Replay buffer capacity8 192Rollout minibatch\|ℬtrain\|\|\{\\mathcal\{B\}\}\_\{\\text\{train\}\}\|4 problems*Score network*BackboneGNNHidden dimensiondhd\_\{h\}128Residual blocksLL6Graph filter orderKK2Graph sparsity \(top\-kk\)20ConditioningFiLM \+𝝀\\boldsymbol\{\\lambda\}concat
Constrained Diffusion Models with Primal-Dual Inference

Similar Articles

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Constrained Code Generation with Discrete Diffusion

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

Submit Feedback

Similar Articles

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Constrained Code Generation with Discrete Diffusion
Spectral Guidance for Flexible and Efficient Control of Diffusion Models
ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation