DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising
Summary
DiRecT introduces a training-free algorithm for safe diffusion-based planning that enforces constraints only on final clean trajectories using receding-horizon denoising, improving safety and performance over existing methods.
View Cached Full Text
Cached at: 06/16/26, 11:42 AM
# Safe Diffusion-Based Planning via Receding-Horizon Denoising
Source: [https://arxiv.org/html/2606.15359](https://arxiv.org/html/2606.15359)
Paolo Giaretta MIT pgiarett@mit\.edu &Zeyang Li MIT zeyang@mit\.edu &Navid Azizan MIT azizan@mit\.edu
###### Abstract
Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories\. Yet reliable inference\-time safety enforcement remains a key barrier to their deployment in safety\-critical tasks\. Existing approaches typically project each denoising iterate onto the feasible set, even though constraints are defined only on the final clean trajectory\. Enforcing feasibility on noisy intermediate samples can therefore overconstrain the sampling dynamics, substantially degrading sample quality\. To address this limitation, we introduce DiRecT \(Diffusion\-based planning viaReceding\-horizon denoising withTerminal constraints\), a training\-free algorithm for constrained sampling from diffusion models via stochastic optimal control \(SOC\)\. DiRecT enforces constraints only on the final clean sample, avoiding unnecessary restrictions on the intermediate denoising dynamics\. Inspired by model predictive control, we derive a principled receding\-horizon surrogate for the otherwise intractable constrained SOC formulation, yielding an efficient algorithm that cleanly separates stochastic denoising from constraint satisfaction, progressively steering samples toward feasible final trajectories without distorting the learned diffusion dynamics\. Furthermore, DiRecT is highly flexible: it can leverage off\-the\-shelf or domain\-specific optimizers, incorporate priors over environment dynamics, and optimize additional soft rewards\. Extensive experiments on safe planning benchmarks demonstrate that DiRecT substantially improves deployment safety and task performance over existing diffusion\-based planning baselines\. Our code is available at[https://github\.com/azizanlab/DiRecT](https://github.com/azizanlab/DiRecT)\.
††footnotetext:Correspondence to Zeyang Li \(zeyang@mit\.edu\)\.## 1Introduction
Safe and reliable planning remains a central challenge, requiring algorithms that can satisfy constraints while adapting to diverse environments, objectives, and constraint types\. Classical planning methods\[[25](https://arxiv.org/html/2606.15359#bib.bib1),[33](https://arxiv.org/html/2606.15359#bib.bib2),[36](https://arxiv.org/html/2606.15359#bib.bib3),[32](https://arxiv.org/html/2606.15359#bib.bib4),[51](https://arxiv.org/html/2606.15359#bib.bib5),[57](https://arxiv.org/html/2606.15359#bib.bib7)\]have achieved strong results in many structured settings, but they face important limitations\. Search\-based methods can become computationally prohibitive in large\-scale multi\-agent problems\[[70](https://arxiv.org/html/2606.15359#bib.bib9)\], while optimization\-based methods are often local and sensitive to initialization\[[51](https://arxiv.org/html/2606.15359#bib.bib5),[57](https://arxiv.org/html/2606.15359#bib.bib7)\]\. These limitations have driven growing interest in*data\-driven*generative planners, which learn reusable priors from offline data and can model diverse, high\-dimensional behaviors\. Building on their success in image generation\[[17](https://arxiv.org/html/2606.15359#bib.bib20)\], diffusion models\[[60](https://arxiv.org/html/2606.15359#bib.bib12),[27](https://arxiv.org/html/2606.15359#bib.bib11),[62](https://arxiv.org/html/2606.15359#bib.bib13)\]have emerged as a powerful framework for planning and control, capable of capturing complex, multimodal state\-action distributions\[[29](https://arxiv.org/html/2606.15359#bib.bib24)\]while supporting flexible inference\-time guidance\[[1](https://arxiv.org/html/2606.15359#bib.bib26)\]\.
Despite these advances, deploying diffusion\-based planners in real\-world safety\-critical tasks remains challenging\. For instance, a generated trajectory must avoid collisions with obstacles, as even a single violation can lead to catastrophic failure\. Since diffusion models rely on high\-dimensional stochastic denoising dynamics, typically parameterized by large neural networks, they may generate unsafe plans even when trained on feasible trajectories\. This challenge is further amplified when trajectories must satisfy novel constraints that are not captured by the dataset\. These limitations have motivated growing interest in*test\-time*mechanisms for constraining diffusion planners\. Inspired by guidance techniques in image generation\[[17](https://arxiv.org/html/2606.15359#bib.bib20),[28](https://arxiv.org/html/2606.15359#bib.bib21),[15](https://arxiv.org/html/2606.15359#bib.bib22)\], early approaches steer samples during denoising by encoding constraint satisfaction as a*soft*guidance signal\[[45](https://arxiv.org/html/2606.15359#bib.bib29)\]\. However, soft guidance can only encourage feasibility and does not guarantee hard constraint satisfaction\. More recent work has therefore focused on enforcing hard constraints directly on sampled trajectories, with the goal of providing stronger guarantees for safety\-critical planning\. Particularly, most hard\-constrained diffusion samplers enforce feasibility through projections or constraint\-driven updates along the denoising trajectory\[[14](https://arxiv.org/html/2606.15359#bib.bib28),[43](https://arxiv.org/html/2606.15359#bib.bib33),[39](https://arxiv.org/html/2606.15359#bib.bib60),[71](https://arxiv.org/html/2606.15359#bib.bib30),[69](https://arxiv.org/html/2606.15359#bib.bib31)\]\. This creates a mismatch: planning constraints are defined on the final clean trajectory, while intermediate denoising iterates are noisy and need not themselves be feasible\. Imposing constraints throughout sampling can therefore overconstrain the learned reverse dynamics, significantly degrading sample quality\. Moreover, projection\-based formulations do not naturally provide a unified mechanism for handling hard constraints together with additional soft rewards or costs\.
To address these limitations, we introduce DiRecT \(Diffusion\-based planning viaReceding\-horizon denoising withTerminal constraints\), a*training\-free*algorithm for constrained sampling in diffusion\-based planning\. We formulate inference\-time constraint enforcement through the lens of stochastic optimal control \(SOC\)\[[31](https://arxiv.org/html/2606.15359#bib.bib35)\], where the learned reverse diffusion process serves as the nominal stochastic dynamics and control inputs steer the sampler toward feasibility of the final clean trajectory\. Crucially, constraints are imposed only on this terminal clean sample, avoiding unnecessary restrictions on noisy intermediate denoising iterates\.
Solving the resulting terminally constrained SOC problem is computationally intractable\. We therefore exploit the structure of diffusion models, together with ideas from model predictive control \(MPC\)\[[52](https://arxiv.org/html/2606.15359#bib.bib40)\], to derive a principled and scalable receding\-horizon surrogate\. At each denoising step, DiRecT predicts the final clean trajectory implied by the current noisy iterate, solves a constrained optimization problem over this prediction, and converts the resulting refinement into a controlled update of the current noisy iterate\.
The contributions of this paper are:
- •We identify a key limitation of prior constrained diffusion samplers: enforcing feasibility on noisy intermediate denoising iterates can overconstrain the sampling dynamics and degrade sample quality\. In contrast, we formulate training\-free constrained diffusion planning as a terminally constrained SOC problem that enforces feasibility only on the final clean trajectory while staying close to the learned diffusion dynamics\.
- •We present DiRecT, a training\-free constrained diffusion sampler that reduces the intractable constrained SOC problem to a sequence of tractable clean\-trajectory optimization subproblems\. By refining predicted clean trajectories and translating these refinements into controlled updates of the noisy iterates, DiRecT steers samples toward constraint satisfaction without distorting the learned denoising process\.
- •We show that DiRecT is highly flexible, supporting off\-the\-shelf and domain\-specific optimizers, equality and inequality constraints, environment\-specific dynamics priors, and additional soft rewards or costs at inference time\.
- •We evaluate DiRecT across diverse robotic planning applications, including maze navigation in Maze2D, robotic manipulation in D3IL, multi\-robot motion planning \(MRMP\), and diverse contact\-rich manipulation in PushT\. Across these tasks, DiRecT consistently improves constraint satisfaction and task success over existing diffusion\-based planning baselines\.
## 2Related Work
This section situates our work within the broader literature\. First, we review constrained sampling techniques for diffusion\-based planners, emphasizing the key limitation that motivates our method\. Second, we discuss prior work that also explores connections between diffusion models and stochastic optimal control, clarifying how our constrained sampling perspective differs from these approaches\. Finally, we mention a parallel but complementary line of work on training\-based methods for planning with diffusion models\. Due to space constraints, the details are deferred to Appendix[A](https://arxiv.org/html/2606.15359#A1)\.
## 3Background
##### Diffusion models\.
We employ the continuous\-time formulation and define diffusion models as a pair of Itô processes: a*forward*process that gradually corrupts samples drawn from the data distributionp0p\_\{0\}through the stochastic differential equation \(SDE\)dXt=ft\(Xt\)dt\+gt\(Xt\)dWt,X0∼p0,t∈\[0,1\]dX\_\{t\}=f\_\{t\}\(X\_\{t\}\)dt\+g\_\{t\}\(X\_\{t\}\)dW\_\{t\},\\;X\_\{0\}\\sim p\_\{0\},\\;t\\in\\left\[0,1\\right\], and a*backward*process that generates samples starting from the simple prior distributionp1≈𝒩\(0,Id\)p\_\{1\}\\approx\\mathcal\{N\}\\left\(0,I\_\{d\}\\right\)by time reversal of the forward SDE\[[3](https://arxiv.org/html/2606.15359#bib.bib42)\]:
dXt=\[ft\(Xt\)−gt2\(Xt\)∇Xtlogpt\(Xt\)\]dt\+gt\(Xt\)dW¯t,X1∼p1,dX\_\{t\}=\\left\[f\_\{t\}\(X\_\{t\}\)\-g\_\{t\}^\{2\}\(X\_\{t\}\)\\nabla\_\{X\_\{t\}\}\\log p\_\{t\}\(X\_\{t\}\)\\right\]dt\+g\_\{t\}\(X\_\{t\}\)d\\bar\{W\}\_\{t\},\\quad X\_\{1\}\\sim p\_\{1\},\(1\)where \([1](https://arxiv.org/html/2606.15359#S3.E1)\) is integrated backward in time,WtW\_\{t\}andW¯t\\bar\{W\}\_\{t\}denote the forward and backward Wiener processes, andst\(xt\)=∇xtlogpt\(xt\)s\_\{t\}\(x\_\{t\}\)=\\nabla\_\{x\_\{t\}\}\\log p\_\{t\}\(x\_\{t\}\)is the score function of the marginal distributionptp\_\{t\}\. Moreover, we restrict our focus to the common case of Gaussian\-affine schedules, for which the conditional distribution of the forward noising process has the closed\-form solutionqt\(xt\|x0\)=𝒩\(αtx0,σt2Id\)q\_\{t\}\(x\_\{t\}\|x\_\{0\}\)=\\mathcal\{N\}\\left\(\\alpha\_\{t\}x\_\{0\},\\sigma\_\{t\}^\{2\}I\_\{d\}\\right\), where the noise\-schedule coefficients\{αt,σt\}t∈\[0,1\]\\\{\\alpha\_\{t\},\\sigma\_\{t\}\\\}\_\{t\\in\\left\[0,1\\right\]\}are related to the SDE drift and diffusion functions byf\(x,t\)=α˙tαtx,g2\(t\)=ddtσt2−2α˙tαtσt2f\(x,t\)=\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{\\alpha\_\{t\}\}x,\\quad g^\{2\}\(t\)=\\frac\{d\}\{dt\}\\sigma\_\{t\}^\{2\}\-2\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{\\alpha\_\{t\}\}\\sigma\_\{t\}^\{2\}\. Following denoising score matching\[[67](https://arxiv.org/html/2606.15359#bib.bib16),[62](https://arxiv.org/html/2606.15359#bib.bib13)\], the score functionst\(xt\)s\_\{t\}\(x\_\{t\}\)is approximated by a neural networkstθ\(xt\)s^\{\\theta\}\_\{t\}\(x\_\{t\}\)trained to minimize the conditional score\-matching \(CSM\) loss:
ℒCSM\(θ\)=𝔼t,x0,ε\[λ\(t\)‖stθ\(αtx0\+σtε\)\+εσt‖2\],\\mathcal\{L\}\_\{\\mathrm\{CSM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,x\_\{0\},\\varepsilon\}\\left\[\\lambda\(t\)\\left\\lVert s^\{\\theta\}\_\{t\}\(\\alpha\_\{t\}x\_\{0\}\+\\sigma\_\{t\}\\varepsilon\)\+\\frac\{\\varepsilon\}\{\\sigma\_\{t\}\}\\right\\rVert^\{2\}\\right\],\(2\)wheret∼𝒰\[0,1\]t\\sim\\mathcal\{U\}\[0,1\],x0∼p0x\_\{0\}\\sim p\_\{0\},ε∼𝒩\(0,Id\)\\varepsilon\\sim\\mathcal\{N\}\(0,I\_\{d\}\), andλ\(t\)\\lambda\(t\)is a time\-dependent weighting\.
##### Sampling and Tweedie’s formula\.
Sampling from a diffusion model is achieved by numerically simulating the reverse dynamics \([1](https://arxiv.org/html/2606.15359#S3.E1)\) using the learned score function\. Common stochastic samplers include DDPM\[[27](https://arxiv.org/html/2606.15359#bib.bib11)\], Euler–Maruyama\[[35](https://arxiv.org/html/2606.15359#bib.bib61)\], and higher\-order solvers\[[42](https://arxiv.org/html/2606.15359#bib.bib18)\]\. Despite our focus on*stochastic*dynamics, few\-step*deterministic*samplers are commonly obtained by numerical integration of the probability\-flow ODE\[[62](https://arxiv.org/html/2606.15359#bib.bib13)\], which shares the same marginal distributions as \([1](https://arxiv.org/html/2606.15359#S3.E1)\)\. Examples include DDIM\[[61](https://arxiv.org/html/2606.15359#bib.bib19)\]and ODE\-based solvers\[[41](https://arxiv.org/html/2606.15359#bib.bib17)\]\. A one\-step instantiation of these samplers yields an approximation of the posterior conditional mean of the forward process, which is related to the score function through Tweedie’s formula\[[54](https://arxiv.org/html/2606.15359#bib.bib23)\]:
𝔼\[X0\|Xt\]≈x^0θ\(xt,t\)=xt\+σt2stθ\(xt\)αt\.\\mathbb\{E\}\[X\_\{0\}\\,\|\\,X\_\{t\}\]\\approx\\hat\{x\}^\{\\theta\}\_\{0\}\(x\_\{t\},t\)=\\frac\{x\_\{t\}\+\\sigma\_\{t\}^\{2\}s^\{\\theta\}\_\{t\}\(x\_\{t\}\)\}\{\\alpha\_\{t\}\}\.\(3\)Therefore, we usestθs^\{\\theta\}\_\{t\}andx^0θ\\hat\{x\}^\{\\theta\}\_\{0\}interchangeably when convenient, assuming they are related by \([3](https://arxiv.org/html/2606.15359#S3.E3)\)\.
##### Diffusion planners\.
Following diffusion\-based planning formulations\[[29](https://arxiv.org/html/2606.15359#bib.bib24),[13](https://arxiv.org/html/2606.15359#bib.bib25)\], we treat the clean sample as a finite\-horizon plan rather than a single configuration\. LetH∈ℕH\\in\\mathbb\{N\}be the prediction horizon and letℋ=\{0,…,H\}\\mathcal\{H\}=\\\{0,\\ldots,H\\\}\. We denote plans as𝝉=\(𝝉0,…,𝝉H\)∈𝒟H\+1\\bm\{\\tau\}=\(\\bm\{\\tau\}\_\{0\},\\ldots,\\bm\{\\tau\}\_\{H\}\)\\in\\mathcal\{D\}^\{H\+1\}, where each element𝝉k∈𝒟⊆ℝD\\bm\{\\tau\}\_\{k\}\\in\\mathcal\{D\}\\subseteq\\mathbb\{R\}^\{D\}may encode a state, an action, or a state\-action pair depending on the planner parameterization\. A diffusion planner learns a distribution over such plans and generates𝝉\\bm\{\\tau\}by denoising a Gaussian prior sample\. In closed\-loop deployment, the planner executes only the firstM≤HM\\leq Helements before re\-planning, and the concatenation of executed prefixes forms the rollout\.
## 4Method
We now formalize hard\-constrained sampling for diffusion models and derive DiRecT\. The derivation is similar in spirit to the test\-time guidance framework for flow\-matching models developed in\[[38](https://arxiv.org/html/2606.15359#bib.bib86)\], while differing in the diffusion\-model setting and the resulting algorithmic form\. Due to space limitations, we defer the full derivation of the algorithm to Appendix[C](https://arxiv.org/html/2606.15359#A3)\. Given a pretrained score model, a constraint set𝒮⊆ℝd\\mathcal\{S\}\\subseteq\\mathbb\{R\}^\{d\}, and a cost functionC:ℝd→ℝC:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}, we seek to sample denoised plans𝝉\\bm\{\\tau\}from the model that are: \(i\)*safe*\(𝝉∈𝒮\\bm\{\\tau\}\\in\\mathcal\{S\}\), \(ii\)*minimize*the costCC, and \(iii\) remain*proximal*to the learned data distribution\. We now expand on the problem setting and show how safe planning can naturally be framed as solving a stochastic optimal control problem with terminal constraints\. We then derive a receding\-horizon surrogate formulation that leads to a tractable algorithm\. In this section, we use a general diffusion\-centric notation and denote denoised plans as𝑿0\\bm\{X\}\_\{0\}\.
##### Safe planning with diffusion models and stochastic optimal control\.
Denote byf~tθ\(Xt\)\\tilde\{f\}\_\{t\}^\{\\theta\}\(X\_\{t\}\)the reverse driftft\(Xt\)−g\(t\)2stθ\(Xt\)f\_\{t\}\(X\_\{t\}\)\-g\(t\)^\{2\}s\_\{t\}^\{\\theta\}\(X\_\{t\}\), so that the sampling dynamics of the trained model aredXt=f~tθ\(Xt\)dt\+g\(t\)dW¯t,X1∼p1dX\_\{t\}=\\tilde\{f\}\_\{t\}^\{\\theta\}\(X\_\{t\}\)dt\+g\(t\)d\\bar\{W\}\_\{t\},\\;X\_\{1\}\\sim p\_\{1\}\. Letu:ℝd×\[0,1\]→ℝdu:\\mathbb\{R\}^\{d\}\\times\[0,1\]\\rightarrow\\mathbb\{R\}^\{d\}be a control drift that steers the generative process toward low\-cost feasible samples through the controlled SDE:
dXtu=\[f~tθ\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dW¯t,X1∼p1dX^\{u\}\_\{t\}=\\left\[\\tilde\{f\}\_\{t\}^\{\\theta\}\(X^\{u\}\_\{t\}\)\+g\(t\)u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\]dt\+g\(t\)d\\bar\{W\}\_\{t\},\\;X\_\{1\}\\sim p\_\{1\}\(4\)Our three requirements for a safe, performant, and proximal planner naturally fit within a terminally constrained stochastic optimal control perspective\[[8](https://arxiv.org/html/2606.15359#bib.bib49)\], which aims to find the optimal control that solves Problem[1](https://arxiv.org/html/2606.15359#Thmproblem1):
###### Problem 1\(Stochastic optimal control problem with terminal constraints\)\.
Given reference dynamics, constraint set𝒮\\mathcal\{S\}, terminal costCC, and cost weightλ\\lambda, solve for the optimal control driftu⋆u^\{\\star\}:
minu\\displaystyle\\min\_\{u\}𝔼\[λC\(X0u\)\+12∫01‖ut\(Xtu\)‖22𝑑t\]\\displaystyle\\mathbb\{E\}\\left\[\\lambda\\,C\(X^\{u\}\_\{0\}\)\+\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\left\\lVert u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\\rVert\_\{2\}^\{2\}\\,dt\\right\]\(5\)s\.t\.dXtu=\[f~tθ\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dW¯t,X1∼p1\\displaystyle dX^\{u\}\_\{t\}=\\left\[\\tilde\{f\}\_\{t\}^\{\\theta\}\(X^\{u\}\_\{t\}\)\+g\(t\)u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\]dt\+g\(t\)d\\bar\{W\}\_\{t\},\\quad X\_\{1\}\\sim p\_\{1\}\\X0u∈𝒮ℙu\-a\.s\.\\displaystyle X^\{u\}\_\{0\}\\in\\mathcal\{S\}\\qquad\\mathbb\{P\}^\{u\}\\text\{\-a\.s\.\}
whereℙu\\mathbb\{P\}^\{u\}denotes the path measure induced by \([4](https://arxiv.org/html/2606.15359#S4.E4)\)\. Notably, Problem[1](https://arxiv.org/html/2606.15359#Thmproblem1)enforces constraint satisfaction only at the terminal state\. Intermediate latents may therefore deviate significantly from the feasible set if doing so reduces control energy and keeps the process closer to the pretrained dynamics\. In principle, the optimal controller can be obtained by solving the Hamilton–Jacobi–Bellman \(HJB\)\[[5](https://arxiv.org/html/2606.15359#bib.bib53)\]equation\. However, exact solutions are rarely possible for the rich, high\-dimensional dynamics induced by the backward SDE\. See Appendix[B](https://arxiv.org/html/2606.15359#A2)for an overview of SOC, the HJB equation, and their relation to KL control\.
##### Receding\-horizon surrogate formulation\.
We briefly summarize the main steps used to transform the continuous\-time SOC problem into a tractable receding\-horizon surrogate\. We first discretize the reverse dynamics in \([1](https://arxiv.org/html/2606.15359#S3.E1)\)\. LetN∈ℕN\\in\\mathbb\{N\}denote the number of discretization steps, and let0=t0<t1<⋯<tN=10=t\_\{0\}<t\_\{1\}<\\cdots<t\_\{N\}=1be the sampling grid\. We write the uncontrolled reverse update fromtit\_\{i\}toti−1t\_\{i\-1\}as
Xi−1=Φiθ\(Xi,εi\),εi∼𝒩\(0,Id\),i=1,…,N,X\_\{i\-1\}=\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\},\\varepsilon\_\{i\}\),\\qquad\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\qquad i=1,\\ldots,N,\(6\)whereΦiθ:ℝd×ℝd→ℝd\\Phi\_\{i\}^\{\\theta\}:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}denotes theii\-th denoising transition of the chosen sampler\. This covers standard sampling schemes such as DDPM\[[27](https://arxiv.org/html/2606.15359#bib.bib11)\]as well as deterministic or higher\-order solvers\.
To control generation, we perturb each uncontrolled transition by an Euler discretization of the control input over\[ti−1,ti\]\[t\_\{i\-1\},t\_\{i\}\], yielding the*discrete\-time*version of the stochastic optimal control problem
min\{Xi\}i=0N,\{ui\}i=1N\\displaystyle\\min\_\{\\\{X\_\{i\}\\\}\_\{i=0\}^\{N\},\\\{u\_\{i\}\\\}^\{N\}\_\{i=1\}\}𝔼\[λC\(X0\)\+12∑i=1N‖ui\(Xi\)‖22Δti\]\\displaystyle\\mathbb\{E\}\\left\[\\lambda C\(X\_\{0\}\)\+\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{N\}\\\|u\_\{i\}\(X\_\{i\}\)\\\|\_\{2\}^\{2\}\\Delta t\_\{i\}\\right\]\(7\)s\.t\.X0∈𝒮,\\displaystyle\\text\{s\.t\.\}\\quad X\_\{0\}\\in\\mathcal\{S\},Xi−1=Φiθ\(Xi,εi\)\+giui\(Xi\)Δti,εi∼𝒩\(0,Id\),i=1,…,N,\\displaystyle X\_\{i\-1\}=\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\},\\varepsilon\_\{i\}\)\+g\_\{i\}u\_\{i\}\(X\_\{i\}\)\\Delta t\_\{i\},\\quad\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\quad i=1,\\ldots,N,\\quadAlthough discretization yields a finite\-horizon control problem, the discrete SOC optimization remains computationally prohibitive for most diffusion\-based applications\. The decision variables are feedback controls overℝd\\mathbb\{R\}^\{d\}, while the objective requires an expectation over all controlled stochastic trajectories\. In addition, terminal constraints and costs couple all optimization variables with multi\-step nonconvex neural dynamics, distorting the optimization landscape to the point that, even for a single noise realization, off\-the\-shelf solvers may struggle to find feasible solutions\.
To tackle these limitations, we simplify the SOC formulation by introducing two additional approximations\. First, we adopt a model predictive control \(MPC\) perspective, replacing the full\-horizon problem with a sequence of one\-step subproblems\. To do so, we exploit the structure of diffusion models and use Tweedie’s formula to construct proxy estimates of the terminal cost\-to\-go and terminal feasibility condition:
𝔼\[C\(X0\)∣Xi\]≈C\(x^0θ\(Xi,ti\)\),x^0θ\(Xi,ti\)∈𝒮,i=0,…,N−1\.\\mathbb\{E\}\\\!\\left\[C\(X\_\{0\}\)\\mid X\_\{i\}\\right\]\\approx C\\\!\\left\(\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\},t\_\{i\}\)\\right\),\\qquad\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\},t\_\{i\}\)\\in\\mathcal\{S\},\\qquad i=0,\\dots,N\-1\.\(8\)Second, to avoid repeatedly solving constrained optimization problems through the neural denoising map, we approximately invert Tweedie’s formula by a fixed\-point argument\. This allows us to replace the latent displacementXi−X¯iX\_\{i\}\-\\bar\{X\}\_\{i\}with a scaled difference between predicted clean samples:
Xi−X¯i≈αti\(x^0θ\(Xi,ti\)−x^0θ\(X¯i,ti\)\)\.X\_\{i\}\-\\bar\{X\}\_\{i\}\\approx\\alpha\_\{t\_\{i\}\}\\left\(\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\},t\_\{i\}\)\-\\hat\{x\}\_\{0\}^\{\\theta\}\(\\bar\{X\}\_\{i\},t\_\{i\}\)\\right\)\.\(9\)Together, these approximations move the one\-step constrained optimization from the latent space to the data domain, where the cost and constraint are evaluated directly on the predicted clean sample\. We provide the full derivation from Problem[1](https://arxiv.org/html/2606.15359#Thmproblem1)to the final algorithm in Appendix[C](https://arxiv.org/html/2606.15359#A3), including all assumptions and approximations\. We summarize the complete sampling procedure in Algorithm[1](https://arxiv.org/html/2606.15359#alg1)\.
Although several approximations are introduced to reduce the general stochastic optimal control problem to a tractable form, terminal constraint satisfaction is not relaxed, as formalized in Proposition[1](https://arxiv.org/html/2606.15359#Thmproposition1)\.
###### Proposition 1\(Sample feasibility\)\.
Suppose that the final subproblem in Algorithm[1](https://arxiv.org/html/2606.15359#alg1)admits a feasible solutionX^0\|0⋆∈𝒮\\hat\{X\}^\{\\star\}\_\{0\|0\}\\in\\mathcal\{S\}\. Then the final sample returned by the recursion satisfies the terminal constraint\.
###### Proof\.
By construction, the final step of Algorithm[1](https://arxiv.org/html/2606.15359#alg1)returnsX0⋆=X^0\|0⋆X^\{\\star\}\_\{0\}=\\hat\{X\}^\{\\star\}\_\{0\|0\}\. Since the final subproblem imposesX^0\|0⋆∈𝒮\\hat\{X\}^\{\\star\}\_\{0\|0\}\\in\\mathcal\{S\}, it follows immediately that the final sample satisfies the constraintX0⋆∈𝒮X^\{\\star\}\_\{0\}\\in\\mathcal\{S\}\. ∎
Input:Learned score matching modelstθ\(⋅\)s\_\{t\}^\{\\theta\}\(\\cdot\), costC\(⋅\)C\(\\cdot\), cost weightλ\>0\\lambda\>0, feasible set𝒮\\mathcal\{S\}, discretization stepsNN, time grid0=t0<t1<…<tN=10=t\_\{0\}<t\_\{1\}<\\ldots<t\_\{N\}=1, sampling scheme\{Φiθ\}i=1N\\\{\\Phi^\{\\theta\}\_\{i\}\\\}\_\{i=1\}^\{N\}, differentiable affine scheduler\(αt,σt\)\(\\alpha\_\{t\},\\sigma\_\{t\}\)\.
Output:Safe plan
X0⋆X\_\{0\}^\{\\star\}\.
1
XN⋆∼𝒩\(0,Id\)X\_\{N\}^\{\\star\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\);
//Sample prior
2
3for*i=N,…,1i=N,\\ldots,1*do
Δti←ti−ti−1\\Delta t\_\{i\}\\leftarrow t\_\{i\}\-t\_\{i\-1\};
//Time step
4
εi∼𝒩\(0,Id\)\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\);
//Reverse noise
5
gi←\(ddtσt2−2α˙tαtσt2\)1/2\|t=tig\_\{i\}\\leftarrow\\left\(\\frac\{d\}\{dt\}\\sigma\_\{t\}^\{2\}\-2\\frac\{\\dot\{\\alpha\}\_\{t\}\}\{\\alpha\_\{t\}\}\\sigma\_\{t\}^\{2\}\\right\)^\{1/2\}\\bigg\|\_\{t=t\_\{i\}\};
//Diffusion coefficient
6
X¯i−1εi←Φiθ\(Xi⋆,εi\)\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\\leftarrow\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\}^\{\\star\},\\varepsilon\_\{i\}\);
//Uncontrolled denoising proposal
7
X~0\|i−1←x^0θ\(X¯i−1εi,ti−1\)\\tilde\{X\}\_\{0\|i\-1\}\\leftarrow\\hat\{x\}\_\{0\}^\{\\theta\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\},t\_\{i\-1\}\);
//Tweedie clean\-sample estimate
8
9Solve the constrained subproblem;
10X^0\|i−1⋆∈argminX^0\|i−1∈𝒮λC\(X^0\|i−1\)\+αti−122gi2Δti‖X^0\|i−1−X~0\|i−1‖22\\displaystyle\\hat\{X\}^\{\\star\}\_\{0\|i\-1\}\\in\\operatorname\*\{arg\\,min\}\_\{\\hat\{X\}\_\{0\|i\-1\}\\in\\mathcal\{S\}\}\\lambda C\(\\hat\{X\}\_\{0\|i\-1\}\)\+\\frac\{\\alpha\_\{t\_\{i\-1\}\}^\{2\}\}\{2g\_\{i\}^\{2\}\\Delta t\_\{i\}\}\\\|\\hat\{X\}\_\{0\|i\-1\}\-\\tilde\{X\}\_\{0\|i\-1\}\\\|\_\{2\}^\{2\};
11\(10\)if*i\>1i\>1*then
Xi−1⋆←X¯i−1εi\+αti−1\(X^0\|i−1⋆−X~0\|i−1\)X^\{\\star\}\_\{i\-1\}\\leftarrow\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\+\\alpha\_\{t\_\{i\-1\}\}\(\\hat\{X\}^\{\\star\}\_\{0\|i\-1\}\-\\tilde\{X\}\_\{0\|i\-1\}\);
//Latent correction
12
13else
X0⋆←X^0\|0⋆X^\{\\star\}\_\{0\}\\leftarrow\\hat\{X\}^\{\\star\}\_\{0\|0\};
//Return feasible terminal sample
14
15
return*X0⋆X\_\{0\}^\{\\star\}*
Algorithm 1DiRecT: Safe diffusion\-based planning
## 5Experiments
We seek to evaluate DiRecT to answer the following questions: \(i\)*Can our method enforce test\-time constraints, while retaining task performance?*\(ii\)*How does DiRecT compare with existing methods in terms of safety, task success, and computational overhead?*\(iii\)*How flexible is our method in handling equality and inequality constraints, high\-dimensional nonconvex constraint sets, and soft rewards or costs?*To answer these questions, we test our method on maze navigation in Maze2D\[[23](https://arxiv.org/html/2606.15359#bib.bib62)\], robotic manipulation in D3IL\[[30](https://arxiv.org/html/2606.15359#bib.bib63)\], multi\-robot motion planning\[[58](https://arxiv.org/html/2606.15359#bib.bib65)\], and diverse contact\-rich manipulation in PushT\[[13](https://arxiv.org/html/2606.15359#bib.bib25),[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\. Detailed settings on the experimental setup are provided in Appendix[D](https://arxiv.org/html/2606.15359#A4)and additional results and ablations in Appendix[E](https://arxiv.org/html/2606.15359#A5)\.
### 5\.1Safe maze navigation
\(a\)
\(b\)
\(c\)
\(d\)
Figure 1:Visualization of DiRecT for maze navigation and robotic manipulation with test\-time hard constraints\. InMaze2D\-broad\([1\(a\)](https://arxiv.org/html/2606.15359#S5.F1.sf1)\) andMaze2D\-narrow\([1\(b\)](https://arxiv.org/html/2606.15359#S5.F1.sf2)\), the robot has to navigate inside a maze toward the bottom\-right corner while avoiding the obstacles represented in red\. InD3IL\-avoiding, a robotic arm moves toward the green target line, while avoiding six pillars \([1\(c\)](https://arxiv.org/html/2606.15359#S5.F1.sf3)\)\. We constrain the problem during inference with additional obstacles, represented in blue \([1\(d\)](https://arxiv.org/html/2606.15359#S5.F1.sf4)\)\.First, we test our method on safe maze navigation for the D4RLMaze2D\-largeenvironment\[[23](https://arxiv.org/html/2606.15359#bib.bib62)\]\. In this task, a small ball traverses a planar maze toward a target position, while avoiding additional*test\-time*obstacles as depicted in Figure[1](https://arxiv.org/html/2606.15359#S5.F1)\. We train a continuous\-time diffusion model to sample state\-action pairs over the prediction horizonHH, where each states∈ℝ4s\\in\\mathbb\{R\}^\{4\}includes position and velocity of the ball and actionsa∈ℝ2a\\in\\mathbb\{R\}^\{2\}represent force inputs\. During inference, we sample*one*trajectory from the model conditioned on endpoint states and execute actions by tracking with a proportional\-derivative \(PD\) controller\. Although the sampled*planned*trajectory may satisfy test\-time constraints, the*rollout*trajectory can deviate from the plan and become infeasible due to mismatch between learned and environment dynamics\. To improve rollout fidelity\[[55](https://arxiv.org/html/2606.15359#bib.bib64),[9](https://arxiv.org/html/2606.15359#bib.bib83)\], we impose additional*equality*constraints in the form of linearized dynamics fitted from the training data \(see Appendix[D\.1](https://arxiv.org/html/2606.15359#A4.SS1)\)\.
##### Baselines\.
We compare against five baselines to represent different classes of constrained planners:Diffuser\[[29](https://arxiv.org/html/2606.15359#bib.bib24)\]\(unguided reference\),Gradient Guidance\[[15](https://arxiv.org/html/2606.15359#bib.bib22)\]\(soft constraining with posterior sampling\),Projected Diffusion\[[14](https://arxiv.org/html/2606.15359#bib.bib28)\]\(per\-step projection\),Augmented Lagrangian\[[71](https://arxiv.org/html/2606.15359#bib.bib30)\]\(early\-stage Lagrangian constraint relaxation\), andSafeDiffuser\-RoS\[[69](https://arxiv.org/html/2606.15359#bib.bib31)\]\(CBF\-based guidance\)\. We adapt existing methods to include dynamic constraints and employ IPOPT\[[68](https://arxiv.org/html/2606.15359#bib.bib72)\]as an*off\-the\-shelf*solver\. Additional results are provided in Appendix[E\.1](https://arxiv.org/html/2606.15359#A5.SS1)\.
##### Evaluation\.
We test all methods over two progressively challenging variants\[[13](https://arxiv.org/html/2606.15359#bib.bib25),[71](https://arxiv.org/html/2606.15359#bib.bib30),[29](https://arxiv.org/html/2606.15359#bib.bib24)\], which we denote asbroad\(Fig\.[1\(a\)](https://arxiv.org/html/2606.15359#S5.F1.sf1)\) andnarrow\(Fig\.[1\(b\)](https://arxiv.org/html/2606.15359#S5.F1.sf2)\)\. For each variant, we evaluate100100independent trials to measure how well each method adapts to additional prior environment knowledge in the form of dynamic constraints for safe navigation\. We report in Table[1](https://arxiv.org/html/2606.15359#S5.T1)four evaluation metrics:*Safety Rate*\(SR\), the fraction of collision\-free rollouts;*Violations*, the mean number of steps the ball spends inside obstacles; D4RL normalized*Score*, which reflects task completion performance; and sampling*Time*\. DiRecT is the only method that achieves high safety and near\-zero violations while retaining strong performance\.
Table 1:Results on the safe navigation inMaze2D\. Mean values over100100i\.i\.d\. samples and reported alongside their standard deviation\. Boldface indicates the best score\.
### 5\.2Safe robotic manipulation
We next evaluate the reliability and performance of our method for closed\-loop robotic manipulation on the D3ILavoidingenvironment\[[30](https://arxiv.org/html/2606.15359#bib.bib63)\]\. In this task, a robotic manipulator moves toward a target region while its end\-effector avoids six pillars \(Fig\.[1\(c\)](https://arxiv.org/html/2606.15359#S5.F1.sf3)\)\. We train a continuous\-time diffusion model on the D3IL demonstration dataset, consisting of expert state\-action trajectories, where the robot states∈ℝ4s\\in\\mathbb\{R\}^\{4\}is composed of the current and desired positions, and the actiona∈ℝ2a\\in\\mathbb\{R\}^\{2\}represents the end\-effector velocity\. At test time, the end\-effector must reach the target region while avoiding the environment pillars and*additional*planar and circular constraints \(Fig\.[1\(d\)](https://arxiv.org/html/2606.15359#S5.F1.sf4)\)\. Rollouts are performed in a receding\-horizon fashion, in which we executeM<HM<Hactions before re\-planning\. Similarly to the maze navigation task, we test the ability of our method to exploit additional environment priors in the form of linearized dynamic constraints\.
##### Baselines\.
We compare against the same baselines as in the Maze2D setting\. Additional results and baselines are presented in Appendix[E\.2](https://arxiv.org/html/2606.15359#A5.SS2)\.
##### Evaluation\.
We test each method over100100i\.i\.d\. rollouts and compute the following metrics:*Safety Rate*\(SR\), the fraction of trials satisfying both environment and test\-time constraints;*Task Success*\(TS\), the proportion of trajectories that reach the end goal in the allowed number of steps; average*Steps*taken to reach the target for safe rollouts; sampling*Time*\. Results in Table[2](https://arxiv.org/html/2606.15359#S5.T2)show that our method is the only one to maintain performance and safety for*all*initializations\.
Table 2:Results on constrained manipulation in D3ILavoiding\. Mean values are computed over100100i\.i\.d\. trials and reported with standard deviations where applicable\. Boldface indicates the best score\.
### 5\.3Safe multi\-robot motion planning
We test DiRecT on the coupled diffusion generation benchmark of\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], which is based on the multi\-robot motion planning \(MRMP\) environment introduced by\[[58](https://arxiv.org/html/2606.15359#bib.bib65)\]\. Given starting locations for each robot, the goal is to simultaneously sample trajectories for all robots from a diffusion model trained on single\-agent examples, while avoiding*inter\-robot*and*obstacle*collisions, satisfying inference\-time*velocity*constraints, and exhibiting target environment\-specific behaviors \(Figure[2](https://arxiv.org/html/2606.15359#S5.F2)\)\. Given the high\-dimensional and nonconvex nature of the MRMP problem, we test the robustness of different methods by scaling the number of agents up to twenty, while imposing velocity limits as in\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\.
Figure 2:Highwayswith88agents\. Desired motion is counterclockwise around the central obstacle\.##### Baselines\.
We compare our method with the following references:Diffuser\[[29](https://arxiv.org/html/2606.15359#bib.bib24)\]\(unconstrained reference\),MMD\-CBS\[[58](https://arxiv.org/html/2606.15359#bib.bib65)\]\(tree\-search collision search without velocity constraints\),PCD\-LBandPCD\-SHD\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\(soft guidance for collision avoidance and projection for velocity feasibility\)\. For fast and parallelizable projection of trajectories onto the nonconvex feasible set, we develop a custom*domain\-specific*optimizer that is amenable to just\-in\-time compilation and GPU parallelization in JAX\[[11](https://arxiv.org/html/2606.15359#bib.bib73)\]\(see Appendix[D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1)\)\. To show that the interplay between the generative process and projection optimization is integral to the success and reliability of the algorithm, we implementFinal Projection, in which samples are generated unconditionally and then projected*post\-sampling*onto the feasible set\.
##### Evaluation\.
We assess each method over100100starting configurations, generating128128candidate trajectories for each initialization as in\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\. We report the following metrics:*Success Rate*, which is the fraction of trials with at least one collision\-free trajectory among the generated candidate trajectories;*Constraint Safety*, which is the proportion of*all*samples satisfying collision and kinematic constraints, and*Data Adherence*, which measures environment\-specific task success\. As shown in Figure[3](https://arxiv.org/html/2606.15359#S5.F3), DiRecT outperforms all other baselines in terms of safety, especially when scaling to twenty agents\. Note that the relatively lower data adherence onConveyoris not representative of a relative lack of performance, as all other baselines mostly generate infeasible samples\. We provide additional experimentation details in Appendix[D\.3](https://arxiv.org/html/2606.15359#A4.SS3)and extensive sweep results in Appendix[E\.3](https://arxiv.org/html/2606.15359#A5.SS3)\.
\(a\)Emptywith agent velocities limited to0\.6750\.675\.
\(b\)Highwayswith agent velocities limited to0\.6470\.647\.
\(c\)Conveyorwith agent velocities limited to1\.211\.21\.
\(d\)Drop\-Regionwith agent velocities limited to0\.9280\.928\.
Figure 3:Performance comparison forN∈\{4,8,12,16,20\}N\\in\\\{4,8,12,16,20\\\}agents on the four MMD environments\[[58](https://arxiv.org/html/2606.15359#bib.bib65)\]\. We impose obstacle and inter\-robot constraints and restrict agent velocities\. Our method outperforms all other baselines in terms of safety, while maintaining adherence to the expected motion pattern\.
### 5\.4Safe and diverse contact\-rich manipulation
Finally, we show that DiRecT can effectively incorporate and optimize over*soft*costs while satisfying test\-time constraints\. We base our evaluation on simultaneous coupled trajectory generation onPushT\[[22](https://arxiv.org/html/2606.15359#bib.bib66)\]\. In the single\-agent setting, the objective is to push a T\-shaped block from a randomized starting position until it overlaps with a target pose \(Figure[4](https://arxiv.org/html/2606.15359#S5.F4)\)\. Following\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], we consider the simultaneous generation of a pair of distinct trajectories while imposing*hard*constraints on the maximum velocity and*soft*coupling costs to encourage non\-intersecting, diverse trajectories\.
##### Baselines\.
We compare against vanillaDiffusion Policy\[[13](https://arxiv.org/html/2606.15359#bib.bib25)\], Projected Coupled Diffusion variants with Determinantal Point Process \(DPP\) and Log\-barrier \(LB\) losses, both without velocity projection \(CD\-DPP,CD\-LB\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\) and with velocity projection \(PCD\-DPP,PCD\-LB\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]\)\. To demonstrate that our method can improve diversity with the same computational overhead, we incorporate DPP and LB costs into \([1](https://arxiv.org/html/2606.15359#alg1)\), and solve the resulting optimization with one step of Projected Gradient Descent\[[37](https://arxiv.org/html/2606.15359#bib.bib78)\], thereby requiring one projection and one gradient computation per sampling step \(see Appendix[D\.4](https://arxiv.org/html/2606.15359#A4.SS4)\)\. We denote the two variants of our method asDiRecT\-DPPandDiRecT\-LB\.
Figure 4:Coupled generation onPushT\. Two agents push gray blocks onto a green target pose while avoiding intersecting trajectories\.
##### Evaluation\.
We follow the same experimental procedure, generating100100*pairs*of trajectories over5050random initializations for the block and agent positions and report the following metrics:*Dynamic Time Warping*\(DTW\)\[[6](https://arxiv.org/html/2606.15359#bib.bib67),[46](https://arxiv.org/html/2606.15359#bib.bib68)\]and*Discrete Fréchet Distance*\(DFD\)\[[2](https://arxiv.org/html/2606.15359#bib.bib69)\]to evaluate diversity;*Task Completion*score \(TC\)\[[22](https://arxiv.org/html/2606.15359#bib.bib66),[13](https://arxiv.org/html/2606.15359#bib.bib25)\], measuring successful overlap of the blocks onto the respective targets;*Constraint Satisfaction*rate \(CS\)\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]as the fraction of samples satisfying velocity limits; and sampling*Time*\. As summarized in Table[3](https://arxiv.org/html/2606.15359#S5.T3), our formulation improves both diversity and task success relative to other constrained baselines, highlighting the importance of performing both*soft*and*hard*guidance near the data manifold\. As observed by\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], diversity objectives conflict with task completion, leading to Pareto\-optimal solutions as a function of guidance strength\. In Appendix[E\.4](https://arxiv.org/html/2606.15359#A5.SS4), we show that DiRecT dominates PCD in terms of Pareto optimality\.
Table 3:Results on generation of safe and diverse contact\-rich manipulation inPushT, with a velocity limit of8\.48\.4\. Best overall value\(s\) in boldface; underlining denotes best among constrained samplers\.
## 6Conclusion
In this paper, we introduced DiRecT, a training\-free algorithm for constrained sampling in safe diffusion\-based planning\. By formulating constrained sampling as a terminally constrained stochastic optimal control problem and deriving a scalable receding\-horizon surrogate, DiRecT separates stochastic denoising from trajectory\-level constraint satisfaction\. This design avoids overconstraining noisy intermediate iterates while retaining the flexibility to incorporate domain\-specific optimizers, dynamics priors, and additional soft objectives\. Experiments across diverse robotic planning domains demonstrate that DiRecT improves both constraint satisfaction and task success, highlighting its potential for reliable deployment in safety\-critical planning tasks\.
##### Limitations and future work\.
Our method enforces hard constraints for pretrained diffusion planners in a*training\-free*manner, making it broadly applicable to existing models without retraining\. This flexibility, however, comes with additional inference\-time computation\. A natural direction for future work is to combine terminal constraint control with model fine\-tuning, reducing runtime overhead and improving adaptability to test\-time constraints\.
## References
- \[1\]\(2023\)Is conditional generative modeling all you need for decision\-making?\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[2\]H\. Alt and M\. Godau\(1995\)Computing the fréchet distance between two polygonal curves\.International Journal of Computational Geometry & Applications5\(01n02\),pp\. 75–91\.Cited by:[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2)\.
- \[3\]B\. D\.O\. Anderson\(1982\)Reverse\-time diffusion equation models\.Stochastic Processes and their Applications12\(3\),pp\. 313–326\.External Links:ISSN 0304\-4149,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0304-4149%2882%2990051-5),[Link](https://www.sciencedirect.com/science/article/pii/0304414982900515)Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px1.p1.3)\.
- \[4\]J\. A\. E\. Andersson, J\. Gillis, G\. Horn, J\. B\. Rawlings, and M\. Diehl\(2019\)CasADi – A software framework for nonlinear optimization and optimal control\.Mathematical Programming Computation11\(1\),pp\. 1–36\.External Links:[Document](https://dx.doi.org/10.1007/s12532-018-0139-4)Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1)\.
- \[5\]R\. Bellman\(1954\)The theory of dynamic programming\.PaperTechnical ReportP\-550,RAND Corporation,Santa Monica, CA\.External Links:[Link](https://www.rand.org/pubs/papers/P550.html)Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px2.p2.2),[§4](https://arxiv.org/html/2606.15359#S4.SS0.SSS0.Px1.p2.1),[Theorem 1](https://arxiv.org/html/2606.15359#Thmtheorem1.1.p1.2.2)\.
- \[6\]D\. J\. Berndt and J\. Clifford\(1994\)Using dynamic time warping to find patterns in time series\.InProceedings of the 3rd international conference on knowledge discovery and data mining,pp\. 359–370\.Cited by:[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2)\.
- \[7\]D\. P\. Bertsekas and J\. N\. Tsitsiklis\(1996\)Neuro\-dynamic programming\.Athena Scientific\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px2.p3.1)\.
- \[8\]B\. Bouchard, R\. Elie, and C\. Imbert\(2010\)Optimal control under stochastic target constraints\.SIAM Journal on Control and Optimization48\(5\),pp\. 3501–3531\.Cited by:[§4](https://arxiv.org/html/2606.15359#S4.SS0.SSS0.Px1.p1.5)\.
- \[9\]J\. Bouvier, K\. Ryu, K\. Nagpal, Q\. Liao, K\. Sreenath, and N\. Mehr\(2025\)DDAT: diffusion policies enforcing dynamically admissible robot trajectories\.InRobotics: Science and Systems \(RSS\),Cited by:[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.p1.3)\.
- \[10\]S\. Boyd, N\. Parikh, E\. Chu, B\. Peleato, and J\. Eckstein\(2011\)Distributed optimization and statistical learning via the alternating direction method of multipliers\.Foundations and Trends in Machine Learning3\(1\),pp\. 1–122\.Cited by:[§D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1.Px1.p4.4)\.
- \[11\]J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, Y\. Katariya, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang\(2018\)JAX: composable transformations of Python\+NumPy programs\.External Links:[Link](http://github.com/jax-ml/jax)Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1.Px3.p3.4),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.SSS0.Px1.p1.1)\.
- \[12\]L\. F\. Chamon, M\. R\. Karimi, and A\. Korba\(2024\)Constrained sampling with primal\-dual langevin monte carlo\.Advances in Neural Information Processing Systems37,pp\. 29285–29323\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px4.p1.1)\.
- \[13\]C\. Chi, Z\. Xu, S\. Feng, E\. Cousineau, Y\. Du, B\. Burchfiel, R\. Tedrake, and S\. Song\(2025\)Diffusion policy: visuomotor policy learning via action diffusion\.The International Journal of Robotics Research44\(10\-11\),pp\. 1684–1704\.Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px2.p1.3),[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px3.p1.6),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px2.p1.1),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.10.10.10.10.10.10.10.10.10.6),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2),[§5](https://arxiv.org/html/2606.15359#S5.p1.1)\.
- \[14\]J\. K\. Christopher, S\. Baek, and F\. Fioretto\(2024\)Constrained synthesis with projected diffusion models\.Advances in Neural Information Processing Systems37,pp\. 89307–89333\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1),[3rd item](https://arxiv.org/html/2606.15359#A4.I1.i3.p1.1),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.122.120.120.120.120.120.120.120.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.42.40.40.40.40.40.40.40.9),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.21.19.19.19.19.19.19.19.5),[§1](https://arxiv.org/html/2606.15359#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.15.13.13.13.13.13.13.13.4),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.33.31.31.31.31.31.31.31.4),[Table 2](https://arxiv.org/html/2606.15359#S5.T2.12.10.10.10.10.10.10.10.3)\.
- \[15\]H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye\(2023\)Diffusion posterior sampling for general noisy inverse problems\.InInternational Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2606.15359#A4.I1.i2.p1.4),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.114.112.112.112.112.112.112.112.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.34.32.32.32.32.32.32.32.9),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.17.15.15.15.15.15.15.15.5),[§1](https://arxiv.org/html/2606.15359#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.12.10.10.10.10.10.10.10.4),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.30.28.28.28.28.28.28.28.4),[Table 2](https://arxiv.org/html/2606.15359#S5.T2.10.8.8.8.8.8.8.8.3)\.
- \[16\]X\. Dai, Z\. Yang, D\. Yu, F\. Liu, H\. Sadeghian, S\. Haddadin, and S\. Hirche\(2025\)SafeFlow: safe robot motion planning with flow matching via control barrier functions\.External Links:2504\.08661,[Link](https://arxiv.org/abs/2504.08661)Cited by:[§E\.1](https://arxiv.org/html/2606.15359#A5.SS1.p1.3)\.
- \[17\]P\. Dhariwal and A\. Nichol\(2021\)Diffusion models beat gans on image synthesis\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 8780–8794\.Cited by:[§E\.1](https://arxiv.org/html/2606.15359#A5.SS1.p1.3),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.106.104.104.104.104.104.104.104.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.26.24.24.24.24.24.24.24.9),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.13.11.11.11.11.11.11.11.4),[§1](https://arxiv.org/html/2606.15359#S1.p1.1),[§1](https://arxiv.org/html/2606.15359#S1.p2.1)\.
- \[18\]C\. Domingo\-Enrich, M\. Drozdzal, B\. Karrer, and R\. T\. Chen\(2024\)Adjoint matching: fine\-tuning flow and diffusion generative models with memoryless stochastic optimal control\.arXiv preprint arXiv:2409\.08861\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1)\.
- \[19\]C\. Fan, C\. Bai, Z\. Shan, H\. He, Y\. Zhang, and Z\. Wang\(2024\)Task\-agnostic pre\-training and task\-guided fine\-tuning for versatile diffusion planner\.arXiv preprint arXiv:2409\.19949\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px3.p1.1)\.
- \[20\]Z\. Feng, H\. Luan, P\. Goyal, and H\. Soh\(2024\)LTLDoG: satisfying temporally\-extended symbolic constraints for safe diffusion\-based planning\.IEEE Robotics and Automation Letters9\(10\),pp\. 8571–8578\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2024.3443501)Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§D\.4](https://arxiv.org/html/2606.15359#A4.SS4.SSS0.Px1.p1.1),[§D\.4](https://arxiv.org/html/2606.15359#A4.SS4.SSS0.Px3.p2.2)\.
- \[21\]W\. H\. Fleming and R\. W\. Rishel\(2012\)Deterministic and stochastic optimal control\.Springer Science & Business Media\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.p1.1),[Theorem 1](https://arxiv.org/html/2606.15359#Thmtheorem1),[Theorem 1](https://arxiv.org/html/2606.15359#Thmtheorem1.1.p1.2.2)\.
- \[22\]P\. Florence, C\. Lynch, A\. Zeng, O\. A\. Ramirez, A\. Wahid, L\. Downs, A\. Wong, J\. Lee, I\. Mordatch, and J\. Tompson\(2021\)Implicit behavioral cloning\.In5th Annual Conference on Robot Learning,External Links:[Link](https://openreview.net/forum?id=rif3a5NAxU6)Cited by:[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.p1.1)\.
- \[23\]J\. Fu, A\. Kumar, O\. Nachum, G\. Tucker, and S\. Levine\(2020\)D4RL: datasets for deep data\-driven reinforcement learning\.External Links:2004\.07219Cited by:[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px1.p1.8),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.p1.3),[§5](https://arxiv.org/html/2606.15359#S5.p1.1)\.
- \[24\]J\. Han, A\. Jentzen, and W\. E\(2018\)Solving high\-dimensional partial differential equations using deep learning\.Proceedings of the National Academy of Sciences115\(34\),pp\. 8505–8510\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px2.p3.1)\.
- \[25\]P\. E\. Hart, N\. J\. Nilsson, and B\. Raphael\(1968\)A formal basis for the heuristic determination of minimum cost paths\.IEEE Transactions on Systems Science and Cybernetics4\(2\),pp\. 100–107\.External Links:[Document](https://dx.doi.org/10.1109/TSSC.1968.300136)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[26\]U\. G\. Haussmann and J\. P\. Lepeltier\(1990\)On the existence of optimal controls\.SIAM Journal on Control and Optimization28\(4\),pp\. 851–902\.External Links:[Document](https://dx.doi.org/10.1137/0328049)Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px1.p2.4)\.
- \[27\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.Cited by:[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px3.p1.3),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px2.p1.3),[§1](https://arxiv.org/html/2606.15359#S1.p1.1),[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3),[§4](https://arxiv.org/html/2606.15359#S4.SS0.SSS0.Px2.p1.6)\.
- \[28\]J\. Ho and T\. Salimans\(2021\)Classifier\-free diffusion guidance\.InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,External Links:[Link](https://openreview.net/forum?id=qw8AKxfYbI)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p2.1)\.
- \[29\]M\. Janner, Y\. Du, J\. B\. Tenenbaum, and S\. Levine\(2022\)Planning with diffusion for flexible behavior synthesis\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 9902–9915\.Cited by:[1st item](https://arxiv.org/html/2606.15359#A4.I1.i1.p1.1),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.18.16.16.16.16.16.16.16.10),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.98.96.96.96.96.96.96.96.10),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.10.8.8.8.8.8.8.8.5),[§1](https://arxiv.org/html/2606.15359#S1.p1.1),[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px3.p1.6),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.27.25.25.25.25.25.25.25.5),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.9.7.7.7.7.7.7.7.5),[Table 2](https://arxiv.org/html/2606.15359#S5.T2.8.6.6.6.6.6.6.6.4)\.
- \[30\]X\. Jia, D\. Blessing, X\. Jiang, M\. Reuss, A\. Donat, R\. Lioutikov, and G\. Neumann\(2024\)Towards diverse behaviors: a benchmark for imitation learning with human demonstrations\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6pPYRXKPpw)Cited by:[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px1.p1.8),[§D\.2](https://arxiv.org/html/2606.15359#A4.SS2.p1.8),[§5\.2](https://arxiv.org/html/2606.15359#S5.SS2.p1.3),[§5](https://arxiv.org/html/2606.15359#S5.p1.1)\.
- \[31\]H\. J\. Kappen\(2005\)Path integrals and symmetry breaking for optimal control theory\.Journal of Statistical Mechanics: Theory and Experiment2005\(11\),pp\. P11011\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.15359#S1.p3.1)\.
- \[32\]S\. Karaman and E\. Frazzoli\(2011\)Sampling\-based algorithms for optimal motion planning\.The International Journal of Robotics Research30\(7\),pp\. 846–894\.External Links:[Document](https://dx.doi.org/10.1177/0278364911406761)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[33\]L\.E\. Kavraki, P\. Svestka, J\.\-C\. Latombe, and M\.H\. Overmars\(1996\)Probabilistic roadmaps for path planning in high\-dimensional configuration spaces\.IEEE Transactions on Robotics and Automation12\(4\),pp\. 566–580\.External Links:[Document](https://dx.doi.org/10.1109/70.508439)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[34\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px1.p1.8),[§D\.2](https://arxiv.org/html/2606.15359#A4.SS2.p1.8)\.
- \[35\]P\. E\. Kloeden and E\. Platen\(1992\)Numerical solution of stochastic differential equations\.Springer\.Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[36\]S\. LaValle\(1998\)Rapidly\-exploring random trees: a new tool for path planning\.Research Report 9811\.Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[37\]E\. S\. Levitin and B\. T\. Polyak\(1966\)Constrained minimization methods\.USSR Computational Mathematics and Mathematical Physics6\(5\),pp\. 1–50\.Cited by:[§D\.4](https://arxiv.org/html/2606.15359#A4.SS4.SSS0.Px4.p1.5),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px1.p1.1)\.
- \[38\]Z\. Li, K\. Alim, and N\. Azizan\(2025\)HardFlow: hard\-constrained sampling for flow\-matching models via trajectory optimization\.arXiv preprint arXiv:2511\.08425\.Cited by:[§4](https://arxiv.org/html/2606.15359#S4.p1.6)\.
- \[39\]J\. Liang, J\. K\. Christopher, S\. Koenig, and F\. Fioretto\(2025\)Simultaneous multi\-robot motion planning with projected diffusion models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=Sp7jclUwkV)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px1.p3.9),[§D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1.Px1.p4.4),[§1](https://arxiv.org/html/2606.15359#S1.p2.1)\.
- \[40\]Z\. Liang, Y\. Mu, M\. Ding, F\. Ni, M\. Tomizuka, and P\. Luo\(2023\)Adaptdiffuser: diffusion models as adaptive self\-evolving planners\.arXiv preprint arXiv:2302\.01877\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px3.p1.1)\.
- \[41\]C\. Lu, Y\. Zhou, F\. Bao, J\. Chen, C\. Li, and J\. Zhu\(2022\)DPM\-Solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps\.InAdvances in Neural Information Processing Systems,Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[42\]C\. Lu, Y\. Zhou, F\. Bao, J\. Chen, C\. Li, and J\. Zhu\(2025\)Dpm\-solver\+\+: fast solver for guided sampling of diffusion probabilistic models\.Machine Intelligence Research22\(4\),pp\. 730–751\.Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[43\]H\. Luan, Y\. X\. Goh, S\. Ng, and C\. K\. Ling\(2026\)Projected coupled diffusion for test\-time constrained joint generation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1FEm5JLpvg)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1),[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px1.p2.1),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px1.p3.22),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px2.p1.3),[§D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1.Px1.p4.4),[§D\.4](https://arxiv.org/html/2606.15359#A4.SS4.SSS0.Px2.p1.6),[§E\.4](https://arxiv.org/html/2606.15359#A5.SS4.p1.1),[§1](https://arxiv.org/html/2606.15359#S1.p2.1),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.SSS0.Px2.p1.2),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.p1.1),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.15.15.15.15.15.15.15.15.15.6),[Table 3](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.20.20.20.20.20.20.20.20.20.6),[Table 3](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.25.25.25.25.25.25.25.25.25.6),[Table 3](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.30.30.30.30.30.30.30.30.30.6),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2),[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.p1.1),[§5](https://arxiv.org/html/2606.15359#S5.p1.1)\.
- \[44\]Y\. Mao, M\. Szmuk, X\. Xu, and B\. Açıkmeşe\(2018\)Successive convexification: a superlinearly convergent algorithm for non\-convex optimal control problems\.arXiv preprint arXiv:1804\.06539\.Cited by:[§D\.3\.1](https://arxiv.org/html/2606.15359#A4.SS3.SSS1.Px1.p4.4)\.
- \[45\]K\. Mizuta and K\. Leung\(2024\)Cobl\-diffusion: diffusion\-based conditional robot planning in dynamic environments using control barrier and lyapunov functions\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 13801–13808\.Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p2.1)\.
- \[46\]M\. Müller\(2007\)Information retrieval for music and motion\.Springer\.Cited by:[§5\.4](https://arxiv.org/html/2606.15359#S5.SS4.SSS0.Px2.p1.2)\.
- \[47\]A\. Q\. Nichol and P\. Dhariwal\(2021\)Improved denoising diffusion probabilistic models\.InProceedings of the 38th International Conference on Machine Learning,pp\. 8162–8171\.Cited by:[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px1.p1.8),[§D\.2](https://arxiv.org/html/2606.15359#A4.SS2.p1.8),[Remark 1](https://arxiv.org/html/2606.15359#Thmremark1.p1.2.2)\.
- \[48\]K\. Pandey, F\. M\. Sofian, F\. Draxler, T\. Karaletsos, and S\. Mandt\(2025\)Variational control for guidance in diffusion models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=Z0ffRRtOim)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1)\.
- \[49\]A\. Paszke, S\. Gross, S\. Chintala, G\. Chanan, E\. Yang, Z\. DeVito, Z\. Lin, A\. Desmaison, L\. Antiga, and A\. Lerer\(2017\)Automatic differentiation in pytorch\.InNIPS\-W,Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1)\.
- \[50\]T\. Power, R\. Soltani\-Zarrin, S\. Iba, and D\. Berenson\(2023\)Sampling constrained trajectories using composable diffusion models\.InIROS 2023 Workshop on Differentiable Probabilistic Robotics: Emerging Perspectives on Robot Learning,External Links:[Link](https://openreview.net/forum?id=UAylEpIMNE)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1)\.
- \[51\]N\. Ratliff, M\. Zucker, J\. A\. Bagnell, and S\. Srinivasa\(2009\)CHOMP: gradient optimization techniques for efficient motion planning\.In2009 IEEE International Conference on Robotics and Automation,Vol\.,pp\. 489–494\.External Links:[Document](https://dx.doi.org/10.1109/ROBOT.2009.5152817)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[52\]J\. B\. Rawlings, D\. Q\. Mayne, and M\. M\. Diehl\(2017\)Model predictive control: theory, computation, and design\.Nob Hill Publishing\.Cited by:[Appendix C](https://arxiv.org/html/2606.15359#A3.p8.3),[§1](https://arxiv.org/html/2606.15359#S1.p4.1)\.
- \[53\]A\. Z\. Ren, J\. Lidard, L\. L\. Ankile, A\. Simeonov, P\. Agrawal, A\. Majumdar, B\. Burchfiel, H\. Dai, and M\. Simchowitz\(2024\)Diffusion policy policy optimization\.arXiv preprint arXiv:2409\.00588\.Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px3.p1.1)\.
- \[54\]H\. Robbins\(1992\)An empirical bayes approach to statistics\.Breakthroughs in Statistics,pp\. 388–394\.Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[55\]R\. Römer, A\. v\. Rohr, and A\. Schoellig\(2025\)Diffusion predictive control with constraints\.InProceedings of the 7th Annual Learning for Dynamics & Control Conference,pp\. 791–803\.Cited by:[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.p1.3)\.
- \[56\]L\. Rout, Y\. Chen, N\. Ruiz, A\. Kumar, C\. Caramanis, S\. Shakkottai, and W\. Chu\(2025\)RB\-modulation: training\-free stylization using reference\-based modulation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=bnINPG5A32)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1)\.
- \[57\]J\. Schulman, Y\. Duan, J\. Ho, A\. Lee, I\. Awwal, H\. Bradlow, J\. Pan, S\. Patil, K\. Goldberg, and P\. Abbeel\(2014\)Motion planning with sequential convex optimization and convex collision checking\.The International Journal of Robotics Research33\(9\),pp\. 1251–1270\.Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[58\]Y\. Shaoul, I\. Mishani, S\. Vats, J\. Li, and M\. Likhachev\(2025\)Multi\-robot motion planning with diffusion models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=AUCYptvAf3)Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px1.p1.2),[§D\.3](https://arxiv.org/html/2606.15359#A4.SS3.SSS0.Px1.p3.9),[Figure 3](https://arxiv.org/html/2606.15359#S5.F3),[Figure 3](https://arxiv.org/html/2606.15359#S5.F3.2.1),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2606.15359#S5.SS3.p1.1),[§5](https://arxiv.org/html/2606.15359#S5.p1.1)\.
- \[59\]J\. Sirignano and K\. Spiliopoulos\(2018\)DGM: a deep learning algorithm for solving partial differential equations\.Journal of Computational Physics375,pp\. 1339–1364\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px2.p3.1)\.
- \[60\]J\. Sohl\-Dickstein, E\. A\. Weiss, N\. Maheswaranathan, and S\. Ganguli\(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InProceedings of the 32nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.37,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[61\]J\. Song, C\. Meng, and S\. Ermon\(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=St1giarCHLP)Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[62\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1),[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px1.p1.12),[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px2.p1.3)\.
- \[63\]R\. S\. Sutton and A\. G\. Barto\(2018\)Reinforcement learning: an introduction\.2 edition,MIT Press\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px2.p3.1)\.
- \[64\]E\. A\. Theodorou and E\. Todorov\(2012\)Relative entropy and free energy dualities: connections to path integral and KL control\.InProceedings of the IEEE Conference on Decision and Control \(CDC\),pp\. 1466–1473\.Cited by:[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px3.1.p1.2)\.
- \[65\]F\. Vargas, W\. S\. Grathwohl, and A\. Doucet\(2023\)Denoising diffusion samplers\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8pvnfTAbu1f)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1)\.
- \[66\]F\. Vargas, S\. Padhy, D\. Blessing, and N\. Nüsken\(2024\)Transport meets variational inference: controlled monte carlo diffusions\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PP1rudnxiW)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px2.p1.1)\.
- \[67\]P\. Vincent\(2011\)A connection between score matching and denoising autoencoders\.Neural Computation23\(7\),pp\. 1661–1674\.Cited by:[§3](https://arxiv.org/html/2606.15359#S3.SS0.SSS0.Px1.p1.12)\.
- \[68\]A\. Wächter and L\. T\. Biegler\(2006\)On the implementation of an interior\-point filter line\-search algorithm for large\-scale nonlinear programming\.Mathematical Programming106\(1\),pp\. 25–57\.External Links:[Document](https://dx.doi.org/10.1007/s10107-004-0559-y)Cited by:[Appendix D](https://arxiv.org/html/2606.15359#A4.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1)\.
- \[69\]W\. Xiao, T\. Wang, C\. Gan, R\. Hasani, M\. Lechner, and D\. Rus\(2025\)SafeDiffuser: safe planning with diffusion probabilistic models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ig2wk7kK9J)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1),[5th item](https://arxiv.org/html/2606.15359#A4.I1.i5.p1.1),[§D\.1](https://arxiv.org/html/2606.15359#A4.SS1.SSS0.Px2.p1.6),[§E\.1](https://arxiv.org/html/2606.15359#A5.SS1.p1.3),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.146.144.144.144.144.144.144.144.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.154.152.152.152.152.152.152.152.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.162.160.160.160.160.160.160.160.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.66.64.64.64.64.64.64.64.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.74.72.72.72.72.72.72.72.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.82.80.80.80.80.80.80.80.9),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.29.27.27.27.27.27.27.27.5),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.33.31.31.31.31.31.31.31.5),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.37.35.35.35.35.35.35.35.5),[§1](https://arxiv.org/html/2606.15359#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.21.19.19.19.19.19.19.19.4),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.39.37.37.37.37.37.37.37.4),[Table 2](https://arxiv.org/html/2606.15359#S5.T2.16.14.14.14.14.14.14.14.3)\.
- \[70\]J\. Yu\(2016\)Intractability of optimal multi\-robot path planning on planar graphs\.IEEE Robotics and Automation Letters1\(1\),pp\. 33–40\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2015.2503143)Cited by:[§1](https://arxiv.org/html/2606.15359#S1.p1.1)\.
- \[71\]J\. Zhang, L\. Zhao, A\. Papachristodoulou, and J\. Umenberger\(2025\)Constrained diffusers for safe planning and control\.External Links:2506\.12544,[Link](https://arxiv.org/abs/2506.12544)Cited by:[Appendix A](https://arxiv.org/html/2606.15359#A1.SS0.SSS0.Px1.p1.1),[Appendix B](https://arxiv.org/html/2606.15359#A2.SS0.SSS0.Px4.p1.1),[4th item](https://arxiv.org/html/2606.15359#A4.I1.i4.p1.2),[§E\.1](https://arxiv.org/html/2606.15359#A5.SS1.p1.3),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.130.128.128.128.128.128.128.128.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.138.136.136.136.136.136.136.136.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.50.48.48.48.48.48.48.48.9),[Table 5](https://arxiv.org/html/2606.15359#A5.T5.58.56.56.56.56.56.56.56.9),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.25.23.23.23.23.23.23.23.5),[Table 6](https://arxiv.org/html/2606.15359#A5.T6.41.39.39.39.39.39.39.40.1.1),[§1](https://arxiv.org/html/2606.15359#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.15359#S5.SS1.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.18.16.16.16.16.16.16.16.4),[Table 1](https://arxiv.org/html/2606.15359#S5.T1.36.34.34.34.34.34.34.34.4),[Table 2](https://arxiv.org/html/2606.15359#S5.T2.14.12.12.12.12.12.12.12.3)\.
Appendix
## Appendix ARelated Work
##### Safe diffusion\-based planning\.
Existing approaches to diffusion\-based planning with test\-time constraint enforcement can largely be understood through the lens of projection, differing primarily in*when*projection is applied and how strongly it restricts the denoising process\. A simple strategy is*post\-sampling*projection, where an unconstrained diffusion model first generates a trajectory and feasibility is enforced only afterward\. This direction was explored byPoweret al\.\[[50](https://arxiv.org/html/2606.15359#bib.bib27)\]\. While conceptually straightforward, post\-sampling projection can move samples substantially away from the pretrained diffusion distribution, as observed byChristopheret al\.\[[14](https://arxiv.org/html/2606.15359#bib.bib28)\]\. Most subsequent methods instead perform projection*during*sampling\. For example,Christopheret al\.\[[14](https://arxiv.org/html/2606.15359#bib.bib28)\]formulate constrained sampling as a maximum\-likelihood problem in which all intermediate latents are required to lie in the feasible set, and sample using projected stochastic Langevin dynamics\. Building on this per\-step projection perspective,Luanet al\.\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]extend projected diffusion models to multi\-robot motion planning with coupled trajectory generation and additional soft costs, whileLianget al\.\[[39](https://arxiv.org/html/2606.15359#bib.bib60)\]introduce a customized Lagrangian\-based projector that relaxes nonconvex collision constraints\. Recognizing that enforcing constraints throughout the entire denoising process can be overly restrictive, more recent methods relax early\-stage sampling and tighten constraint enforcement near the end of generation\.Zhanget al\.\[[71](https://arxiv.org/html/2606.15359#bib.bib30)\]propose a Lagrangian formulation for safe planning, where constraint strictness is progressively increased through primal\-dual or augmented\-Lagrangian updates\. In a related direction, SafeDiffuser\[[69](https://arxiv.org/html/2606.15359#bib.bib31)\]views denoising as a controlled dynamical system and incorporates control barrier functions \(CBFs\) into the sampling process\. This allows constraint violations during early denoising while enforcing terminal feasibility through the CBF condition\. However, balancing early\-stage relaxation with reliable terminal feasibility can require careful parameter tuning; moreover, even relaxed variants still impose restrictions along the denoising trajectory\. Overall, most existing methods share a common feature: constraint satisfaction is enforced through projections or constraint\-driven updates along the denoising path\. In contrast, our method adopts a stochastic optimal control perspective that imposes hard constraints only on the terminal clean sample, rather than requiring feasibility of noisy intermediate iterates\. It instead steers the sampling process toward feasibility of the final trajectory, reducing unnecessary restrictions on the learned diffusion dynamics\. Moreover, beyond hard constraints, practical planning problems often involve soft costs or rewards\. While projection\-based approaches can incorporate such objectives in specific implementations, they do not naturally provide a unified framework for jointly handling hard feasibility and soft optimality\. Our stochastic optimal control formulation addresses both within a single inference\-time procedure, enabling efficient and flexible constrained sampling\.
##### Diffusion models and stochastic optimal control\.
A growing body of work has connected diffusion sampling with stochastic optimal control \(SOC\), viewing the reverse diffusion process as a controlled stochastic dynamics whose drift can be modified while penalizing deviations from a reference process\. One line of work uses this perspective for sampling from unnormalized target densities, where a reference diffusion process is controlled to match a desired distribution; representative examples include path\-integral sampling\[[31](https://arxiv.org/html/2606.15359#bib.bib35)\], denoising diffusion samplers\[[65](https://arxiv.org/html/2606.15359#bib.bib47)\], and controlled Monte Carlo diffusions\[[66](https://arxiv.org/html/2606.15359#bib.bib48)\]\. Another line of work applies SOC to reward optimization in generative models, either through training\-based reward fine\-tuning, such as Adjoint Matching\[[18](https://arxiv.org/html/2606.15359#bib.bib36)\], or through training\-free guidance toward soft objectives, such as RB\-Modulation\[[56](https://arxiv.org/html/2606.15359#bib.bib43)\]and variational diffusion guidance\[[48](https://arxiv.org/html/2606.15359#bib.bib44)\]\. These methods demonstrate the usefulness of the SOC viewpoint for controlling diffusion samplers, but they primarily target distributional sampling or soft reward optimization\. They do not directly address hard\-constrained planning, where the final trajectory must satisfy feasibility constraints while remaining faithful to the learned diffusion dynamics\. Extending SOC\-based diffusion control to this setting is nontrivial, and existing reward\-guidance methods do not yield an efficient constrained sampler\. In contrast, our work develops a tractable surrogate tailored to terminally constrained sampling for diffusion\-based planning, enabling efficient inference\-time enforcement of hard constraints while also accommodating additional soft costs or rewards\.
##### Training\-based diffusion planners\.
A parallel and complementary line of work adapts diffusion\-based planners through additional training or fine\-tuning\. For example, AdaptDiffuser\[[40](https://arxiv.org/html/2606.15359#bib.bib37)\]improves diffusion planners via self\-evolution, using reward\-guided synthetic trajectories to update the model\. SODP\[[19](https://arxiv.org/html/2606.15359#bib.bib38)\]pretrains a task\-agnostic diffusion planner on diverse suboptimal data and then fine\-tunes it with task\-specific rewards\. DPPO\[[53](https://arxiv.org/html/2606.15359#bib.bib39)\]further adapts diffusion policies for continuous control and robot learning using policy\-gradient reinforcement learning\. These methods improve planning performance by modifying model parameters, whereas our work enforces hard constraints at inference time without retraining, making it orthogonal to training\-based adaptation\.
## Appendix BStochastic optimal control and constrained sampling
We expand on Section[4](https://arxiv.org/html/2606.15359#S4)and present the connection between stochastic optimal control and safe planning with*test\-time*constraints\. We recall central ideas from optimal control literature\[[21](https://arxiv.org/html/2606.15359#bib.bib51)\]and motivate grounding constrained planning in this line of work\.
##### Stochastic Optimal Control\.
For simplicity and coherence with the literature, we consider a*forward\-time*Itô process between0≤t≤10\\leq t\\leq 1, transporting probability mass from prior distributionp0p\_\{0\}towardpbasep\_\{base\}\. The diffusion\-centric formulation in \([12](https://arxiv.org/html/2606.15359#A2.E12)\) is equivalent up to time reversal\. We parametrize the SDE by a given reference driftf:ℝd×\[0,1\]→ℝdf:\\mathbb\{R\}^\{d\}\\times\[0,1\]\\rightarrow\\mathbb\{R\}^\{d\}and diffusion coefficientg:\[0,1\]→ℝg:\[0,1\]\\rightarrow\\mathbb\{R\}:
dXt=ft\(Xt\)dt\+g\(t\)dWt,X0∼p0,t∈\[0,1\]dX\_\{t\}=f\_\{t\}\(X\_\{t\}\)dt\+g\(t\)dW\_\{t\},\\quad X\_\{0\}\\sim p\_\{0\},\\quad t\\in\[0,1\]\(11\)where\(Wt\)t≥0\\left\(W\_\{t\}\\right\)\_\{t\\geq 0\}denotes a standard Wiener process\. To control generation, we introduce an additive control driftu:ℝd×\[0,1\]→𝒰u:\\mathbb\{R\}^\{d\}\\times\[0,1\]\\rightarrow\\mathcal\{U\}producing actions in the admissible set of control inputs𝒰\\mathcal\{U\}\. As optimization objectives, letC:ℝd→ℝC:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}denote the terminal cost, andℓ:ℝd×𝒰×\[0,1\]→ℝ\\ell:\\mathbb\{R\}^\{d\}\\times\\mathcal\{U\}\\times\[0,1\]\\rightarrow\\mathbb\{R\}be the running cost\. Stochastic optimal control formulates the minimization problem of the expected cost functionalJJover the path measureℙu\\mathbb\{P\}^\{u\}induced by the controlled dynamics:
minu∈𝒜J\(u\)=\\displaystyle\\min\_\{u\\in\\mathcal\{A\}\}\\quad J\(u\)=𝔼\[C\(X1u\)\+∫01ℓ\(Xtu,ut\(Xtu\),t\)𝑑t\]\\displaystyle\\mathbb\{E\}\\left\[C\(X^\{u\}\_\{1\}\)\+\\int\_\{0\}^\{1\}\\ell\\left\(X^\{u\}\_\{t\},u\_\{t\}\(X^\{u\}\_\{t\}\),t\\right\)\\,dt\\right\]\(12\)s\.t\.dXtu=\[ft\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dWt,\\displaystyle\\text\{s\.t\.\}\\quad dX^\{u\}\_\{t\}=\\left\[f\_\{t\}\(X^\{u\}\_\{t\}\)\+g\(t\)u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\]dt\+g\(t\)dW\_\{t\},X0u∼p0,t∈\[0,1\]\.\\displaystyle\\qquad X^\{u\}\_\{0\}\\sim p\_\{0\},\\quad t\\in\[0,1\]\.
where𝒜\\mathcal\{A\}is the admissible class of progressively measurable controls satisfyingut\(Xtu\)∈𝒰u\_\{t\}\(X\_\{t\}^\{u\}\)\\in\\mathcal\{U\}and𝔼Xu∼ℙu\[∫01‖ut\(Xtu\)‖2𝑑t\]<∞\\mathbb\{E\}\_\{X^\{u\}\\sim\\mathbb\{P\}^\{u\}\}\\left\[\\int\_\{0\}^\{1\}\\\|u\_\{t\}\(X\_\{t\}^\{u\}\)\\\|^\{2\}dt\\right\]<\\infty\. Under standard coercivity, convexity, and continuity assumptions on the data that ensure existence of optimal controls for controlled Itô diffusions\[[26](https://arxiv.org/html/2606.15359#bib.bib52)\], we denote byu⋆∈argminu∈𝒜J\(u\)u^\{\\star\}\\in\\arg\\min\_\{u\\in\\mathcal\{A\}\}J\(u\)an optimal solution of \([12](https://arxiv.org/html/2606.15359#A2.E12)\)\.
##### Solving the SOC and the Hamilton–Jacobi–Bellman equation\.
Solving \([12](https://arxiv.org/html/2606.15359#A2.E12)\)*as\-is*is generally intractable due to the stochastic dynamics and the expectation over all realizable paths\. We present a classical dynamic programming formulation that transforms the functional minimization problem into a partial differential equation \(PDE\)\. First, for a given admissible control, define the*cost\-to\-go*J:𝒜×ℝd×\[0,1\]→ℝJ:\\mathcal\{A\}\\times\\mathbb\{R\}^\{d\}\\times\[0,1\]\\to\\mathbb\{R\}by
J\(u;x,t\)=𝔼ℙt,xu\[∫t1ℓ\(Xsu,us\(Xsu\),s\)𝑑s\+C\(X1u\)\],J\(u;x,t\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\_\{t,x\}\}\\left\[\\int\_\{t\}^\{1\}\\ell\\left\(X\_\{s\}^\{u\},u\_\{s\}\(X\_\{s\}^\{u\}\),s\\right\)\\,ds\+C\(X\_\{1\}^\{u\}\)\\right\],\(13\)whereℙt,xu\\mathbb\{P\}^\{u\}\_\{t,x\}denotes the path measure induced by the controlled dynamics initialized atXtu=xX\_\{t\}^\{u\}=x\. The cost\-to\-go coincides with the expected future cost incurred by applyinguufrom statexxat timett\. We then define the*value function*V:ℝd×\[0,1\]→ℝV:\\mathbb\{R\}^\{d\}\\times\[0,1\]\\to\\mathbb\{R\}as the infimum of this cost over all admissible controls:
V\(x,t\)=infu∈𝒜J\(u;x,t\)\.V\(x,t\)=\\inf\_\{u\\in\\mathcal\{A\}\}J\(u;x,t\)\.\(14\)
The SOC and value formulations are related as taking expectations over the initial distribution yields:
J\(u\)=𝔼X0∼p0\[J\(u;X0,0\)\],J\(u\)=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\left\[J\(u;X\_\{0\},0\)\\right\],\\\\\(15\)J\(u⋆\)=𝔼X0∼p0\[V\(X0,0\)\]\.J\(u^\{\\star\}\)=\\mathbb\{E\}\_\{X\_\{0\}\\sim p\_\{0\}\}\\left\[V\(X\_\{0\},0\)\\right\]\.\(16\)A solution to the stochastic optimal control problem can be achieved by computing the value function, which, under regularity conditions, satisfies the Hamilton–Jacobi–Bellman \(HJB\)\[[5](https://arxiv.org/html/2606.15359#bib.bib53)\]partial differential equation \(Theorem[1](https://arxiv.org/html/2606.15359#Thmtheorem1)\)\.
###### Theorem 1\(Hamilton–Jacobi–Bellman equation\[[21](https://arxiv.org/html/2606.15359#bib.bib51)\]\)\.
Assume the standard conditions for the dynamic programming principle hold, and suppose that the value function satisfiesV∈C1,2\(ℝd×\[0,1\]\)V\\in C^\{1,2\}\(\\mathbb\{R\}^\{d\}\\times\[0,1\]\)\. Define the stochastic Hamiltonian
ℋ\(x,p,M,u,t\):=ℓ\(x,u,t\)\+\(ft\(x\)\+g\(t\)u\)⊤p\+12g\(t\)2tr\(M\)\.\\mathcal\{H\}\(x,p,M,u,t\):=\\ell\(x,u,t\)\+\\left\(f\_\{t\}\(x\)\+g\(t\)u\\right\)^\{\\top\}p\+\\frac\{1\}\{2\}g\(t\)^\{2\}\\operatorname\{tr\}\(M\)\.\(17\)ThenVVsolves the Hamilton–Jacobi–Bellman equation
−∂tV\(x,t\)=minu∈𝒰ℋ\(x,∇xV\(x,t\),∇x2V\(x,t\),u,t\),V\(x,1\)=C\(x\)\.\-\\partial\_\{t\}V\(x,t\)=\\min\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}\\left\(x,\\nabla\_\{x\}V\(x,t\),\\nabla\_\{x\}^\{2\}V\(x,t\),u,t\\right\),\\quad V\(x,1\)=C\(x\)\.\(18\)Moreover, whenever the minimum is attained, an optimal feedback control satisfies
u⋆\(x,t\)∈argminu∈𝒰ℋ\(x,∇xV\(x,t\),∇x2V\(x,t\),u,t\)\.u^\{\\star\}\(x,t\)\\in\\arg\\min\_\{u\\in\\mathcal\{U\}\}\\mathcal\{H\}\\left\(x,\\nabla\_\{x\}V\(x,t\),\\nabla\_\{x\}^\{2\}V\(x,t\),u,t\\right\)\.\(19\)
###### Proof\.
The result follows from Bellman’s dynamic programming principle\[[5](https://arxiv.org/html/2606.15359#bib.bib53)\]\. Applying the principle over\[t,t\+Δt\]\[t,t\+\\Delta t\], expanding the value function with Itô’s formula, taking expectations, and sendingΔt→0\\Delta t\\to 0leads to the differential HJB formulation\. The terminal condition follows from the definition of the cost\-to\-go\. SeeFleming and Rishel \[[21](https://arxiv.org/html/2606.15359#bib.bib51)\]\. ∎
In practice, solving the HJB PDE exactly is rarely tractable beyond low\-dimensional systems, because grid\-based dynamic programming suffers from the curse of dimensionality; since our focus is on*training\-free*guidance rather than learning value functions, we refer the reader to training\-based approaches to HJB equations and stochastic optimal control, including approximate dynamic programming, reinforcement learning, and neural PDE methods\[[7](https://arxiv.org/html/2606.15359#bib.bib55),[63](https://arxiv.org/html/2606.15359#bib.bib56),[24](https://arxiv.org/html/2606.15359#bib.bib57),[59](https://arxiv.org/html/2606.15359#bib.bib58)\]\.
##### Relation to KL control\.
We now specialize the stochastic optimal control formulation by choosing the quadratic running cost
ℓ\(x,u,t\)=12‖u‖2\.\\ell\(x,u,t\)=\\frac\{1\}\{2\}\\left\\lVert u\\right\\rVert^\{2\}\.\(20\)This choice gives the control objective a path\-space relative\-entropy interpretation, connecting stochastic optimal control to Schrödinger Bridge problems, path\-integral control, and KL\-control as outlined in the following theorem:
###### Theorem 2\(Path\-space KL representation\)\.
Let\{Xt0\}t∈\[0,1\]\\\{X\_\{t\}^\{0\}\\\}\_\{t\\in\[0,1\]\}be the reference stochastic interpolant
dXt0=ft\(Xt0\)dt\+g\(t\)dWt,X00∼p0,dX\_\{t\}^\{0\}=f\_\{t\}\(X\_\{t\}^\{0\}\)\\,dt\+g\(t\)dW\_\{t\},\\qquad X\_\{0\}^\{0\}\\sim p\_\{0\},\(21\)with induced path measureℙ0\\mathbb\{P\}^\{0\}\. Letℙu\\mathbb\{P\}^\{u\}be the path measure induced by the controlled stochastic interpolant
dXtu=\[ft\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dWt,X0u∼p0\.dX\_\{t\}^\{u\}=\\left\[f\_\{t\}\(X\_\{t\}^\{u\}\)\+g\(t\)u\_\{t\}\(X\_\{t\}^\{u\}\)\\right\]dt\+g\(t\)dW\_\{t\},\\qquad X\_\{0\}^\{u\}\\sim p\_\{0\}\.\(22\)Then, for progressively measurable controls satisfying the standard Girsanov integrability condition, the SOC objective withℓ\(x,u,t\)=12‖u‖2\\ell\(x,u,t\)=\\frac\{1\}\{2\}\\left\\lVert u\\right\\rVert^\{2\}is equivalent to the KL\-regularized path\-space objective
infu∈𝒜\{KL\(ℙu∥ℙ0\)\+𝔼ℙu\[C\(X1u\)\]\}\.\\inf\_\{u\\in\\mathcal\{A\}\}\\left\\\{\\mathrm\{KL\}\\left\(\\mathbb\{P\}^\{u\}\\,\\\|\\,\\mathbb\{P\}^\{0\}\\right\)\+\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\}\\left\[C\(X\_\{1\}^\{u\}\)\\right\]\\right\\\}\.\(23\)
###### Proof\.
By Girsanov’s theorem, the controlled drift perturbationg\(t\)utg\(t\)u\_\{t\}induces an absolutely continuous change of path measure whose Radon–Nikodym derivative yields
KL\(ℙu∥ℙ0\)=𝔼ℙu\[∫0112‖ut\(Xtu\)‖2𝑑t\]\.\\mathrm\{KL\}\\left\(\\mathbb\{P\}^\{u\}\\,\\\|\\,\\mathbb\{P\}^\{0\}\\right\)=\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\}\\left\[\\int\_\{0\}^\{1\}\\frac\{1\}\{2\}\\left\\lVert u\_\{t\}\(X\_\{t\}^\{u\}\)\\right\\rVert^\{2\}dt\\right\]\.Substituting this identity into the SOC objective withℓ\(x,u,t\)=12‖u‖2\\ell\(x,u,t\)=\\frac\{1\}\{2\}\\left\\lVert u\\right\\rVert^\{2\}yields \([23](https://arxiv.org/html/2606.15359#A2.E23)\)\. SeeTheodorou and Todorov \[[64](https://arxiv.org/html/2606.15359#bib.bib59)\]for the corresponding relative\-entropy derivation in path\-integral and KL\-control stochastic optimal control\. ∎
Theorem[2](https://arxiv.org/html/2606.15359#Thmtheorem2)shows that the quadratic\-control SOC objective selects a guided sampler that reduces terminal cost while remaining close to the reference stochastic interpolant in path\-space KL\. This principle of remaining maximally close to the reference sampler, and hence avoiding unnecessary over\-constraint of the latent dynamics, offers a useful lens for safe planning with pretrained diffusion models\. Although the full stochastic optimal control problem is generally intractable, we emphasize that this distributional viewpoint provides a principled idealization from which reliable and efficient training\-free surrogate approximations may be derived\.
##### Constrained sampling\.
As noted in prior work on constrained sampling\[[71](https://arxiv.org/html/2606.15359#bib.bib30),[12](https://arxiv.org/html/2606.15359#bib.bib34)\], the stochastic optimal control formulation naturally generalizes to planning with*hard*constraints\. In particular, let𝒮⊆ℝd\\mathcal\{S\}\\subseteq\\mathbb\{R\}^\{d\}denote a feasible set and define the extended\-valued indicator
ι𝒮\(x\)=\{0,x∈𝒮,\+∞,x∉𝒮\.\\iota\_\{\\mathcal\{S\}\}\(x\)=\\begin\{cases\}0,&x\\in\\mathcal\{S\},\\\\ \+\\infty,&x\\notin\\mathcal\{S\}\.\\end\{cases\}\(24\)Then hard\-constrained sampling can be written by choosing the terminal costC′\(x\)=C\(x\)\+ι𝒮\(x\)C^\{\\prime\}\(x\)=C\(x\)\+\\iota\_\{\\mathcal\{S\}\}\(x\), yielding the constrained SOC problem
minu∈𝒜\\displaystyle\\min\_\{u\\in\\mathcal\{A\}\}𝔼\[∫0112‖ut\(Xtu\)‖2𝑑t\+C′\(X1u\)\]\\displaystyle\\mathbb\{E\}\\left\[\\int\_\{0\}^\{1\}\\frac\{1\}\{2\}\\left\\lVert u\_\{t\}\(X\_\{t\}^\{u\}\)\\right\\rVert^\{2\}\\,dt\+C^\{\\prime\}\(X\_\{1\}^\{u\}\)\\right\]\(25\)s\.t\.dXtu=\[ft\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dWt,X0u∼p0\.\\displaystyle dX\_\{t\}^\{u\}=\\left\[f\_\{t\}\(X\_\{t\}^\{u\}\)\+g\(t\)u\_\{t\}\(X\_\{t\}^\{u\}\)\\right\]dt\+g\(t\)dW\_\{t\},\\qquad X\_\{0\}^\{u\}\\sim p\_\{0\}\.
Now letC′\(x\)=C\(x\)\+ι𝒮\(x\)C^\{\\prime\}\(x\)=C\(x\)\+\\iota\_\{\\mathcal\{S\}\}\(x\)\. Since
𝔼\[C′\(X1\)\]=𝔼\[C\(X1\)\]\+𝔼\[ι𝒮\(X1\)\],\\mathbb\{E\}\[C^\{\\prime\}\(X\_\{1\}\)\]=\\mathbb\{E\}\[C\(X\_\{1\}\)\]\+\\mathbb\{E\}\[\\iota\_\{\\mathcal\{S\}\}\(X\_\{1\}\)\],any finite\-cost solution must satisfyX1∈𝒮X\_\{1\}\\in\\mathcal\{S\}almost surely\. Therefore, the extended\-valued terminal cost is equivalently represented as an explicit terminal constraint, yielding
minu∈𝒜\\displaystyle\\min\_\{u\\in\\mathcal\{A\}\}𝔼ℙu\[C\(X1\)\+∫0112‖ut\(Xt\)‖2𝑑t\]\\displaystyle\\mathbb\{E\}\_\{\\mathbb\{P\}^\{u\}\}\\left\[C\(X\_\{1\}\)\+\\int\_\{0\}^\{1\}\\frac\{1\}\{2\}\\left\\lVert u\_\{t\}\(X\_\{t\}\)\\right\\rVert^\{2\}\\,dt\\right\]\(26\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}dXtu=\[ft\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dWt,X0u∼p0\\displaystyle dX\_\{t\}^\{u\}=\\left\[f\_\{t\}\(X\_\{t\}^\{u\}\)\+g\(t\)u\_\{t\}\(X\_\{t\}^\{u\}\)\\right\]dt\+g\(t\)dW\_\{t\},\\qquad X\_\{0\}^\{u\}\\sim p\_\{0\}X1∈𝒮ℙu\-a\.s\.\\displaystyle X\_\{1\}\\in\\mathcal\{S\}\\qquad\\mathbb\{P\}^\{u\}\\text\{\-a\.s\.\}
## Appendix CAlgorithm derivation
In this section, we provide the full derivation of the algorithm, highlighting all the approximations employed to reach a computationally tractable formulation\. For a self\-contained derivation, we start by repeating the terminally constrained stochastic optimal control problem\.
###### Problem 1\(Continuous\-time stochastic optimal control problem with terminal constraints\)\.
Given reference dynamics, constraint set𝒮\\mathcal\{S\}, terminal costCC, and cost weightλ\\lambda, solve for the optimal control driftu⋆u^\{\\star\}:
minu\\displaystyle\\min\_\{u\}𝔼\[λC\(X0u\)\+12∫01‖ut\(Xtu\)‖22𝑑t\]\\displaystyle\\mathbb\{E\}\\left\[\\lambda\\,C\(X^\{u\}\_\{0\}\)\+\\frac\{1\}\{2\}\\int\_\{0\}^\{1\}\\left\\lVert u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\\rVert\_\{2\}^\{2\}\\,dt\\right\]\(27\)s\.t\.dXtu=\[f~tθ\(Xtu\)\+g\(t\)ut\(Xtu\)\]dt\+g\(t\)dW¯t,X1∼p1\\displaystyle dX^\{u\}\_\{t\}=\\left\[\\tilde\{f\}\_\{t\}^\{\\theta\}\(X^\{u\}\_\{t\}\)\+g\(t\)u\_\{t\}\(X^\{u\}\_\{t\}\)\\right\]dt\+g\(t\)d\\bar\{W\}\_\{t\},\\quad X\_\{1\}\\sim p\_\{1\}\\X0u∈𝒮ℙu\-a\.s\.\\displaystyle X^\{u\}\_\{0\}\\in\\mathcal\{S\}\\qquad\\mathbb\{P\}^\{u\}\\text\{\-a\.s\.\}
where the reverse drift is given by the learned score functionft~θ\(Xt\)=ft\(Xt\)−g\(t\)2stθ\(Xt\)\\tilde\{f\_\{t\}\}^\{\\theta\}\(X\_\{t\}\)=f\_\{t\}\(X\_\{t\}\)\-g\(t\)^\{2\}s^\{\\theta\}\_\{t\}\(X\_\{t\}\)\.
For the numerical integration of the reverse\-time dynamics, letN∈ℕN\\in\\mathbb\{N\}denote the number of discretization steps, and let0=t0<t1<…<tN−1<tN=10=t\_\{0\}<t\_\{1\}<\.\.\.<t\_\{N\-1\}<t\_\{N\}=1be the time grid used for sampling from the diffusion model\. Since diffusion sampling proceeds backward in time, we define the uncontrolled reverse update fromtit\_\{i\}toti−1t\_\{i\-1\}by
Xi−1=Φiθ\(Xi,εi\),i=1,…,N,X\_\{i\-1\}=\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\},\\varepsilon\_\{i\}\),\\qquad i=1,\\dots,N,whereΦiθ:ℝd×ℝd→ℝd\\Phi\_\{i\}^\{\\theta\}:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is theii\-th denoising transition map, andεi∼𝒩\(0,Id\)\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)are independent acrossii\. The mapsΦiθ\\Phi\_\{i\}^\{\\theta\}are left abstract in order to accommodate different numerical sampling schemes\. Moreover, as is commonly done to reduce discretization\-induced noise in the final sample, we assume that the last denoising step is deterministic and returns the model’s data prediction without injecting additional noise\. That is,
X0=Φ1θ\(X1\)=x^0θ\(X1,t1\)\.X\_\{0\}=\\Phi^\{\\theta\}\_\{1\}\(X\_\{1\}\)=\\hat\{x\}^\{\\theta\}\_\{0\}\(X\_\{1\},t\_\{1\}\)\.
For the SOC surrogate, we approximate the controlled dynamics by applying an Euler discretization to the control input over each interval\[ti−1,ti\]\[t\_\{i\-1\},t\_\{i\}\]\. This yields
Xi−1u=Φiθ\(Xiu,εi\)\+giui\(Xiu\)Δti,i=1,…,N,X^\{u\}\_\{i\-1\}=\\Phi^\{\\theta\}\_\{i\}\(X^\{u\}\_\{i\},\\varepsilon\_\{i\}\)\+g\_\{i\}u\_\{i\}\(X^\{u\}\_\{i\}\)\\Delta t\_\{i\},\\qquad i=1,\\dots,N,\(28\)wheregi=g\(ti\)g\_\{i\}=g\(t\_\{i\}\),Δti=ti−ti−1\\Delta t\_\{i\}=t\_\{i\}\-t\_\{i\-1\}, andεi∼𝒩\(0,Id\)\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\. After discretizing the control energy, we obtain the discrete\-time formulation of the terminally constrained stochastic optimal control problem, stated in Problem[2](https://arxiv.org/html/2606.15359#Thmproblem2)\.
###### Problem 2\(Discrete\-time stochastic optimal control problem with terminal constraints\)\.
Given reference driftf~tθ\\tilde\{f\}^\{\\theta\}\_\{t\}, constraint set𝒮\\mathcal\{S\}, and cost weightλ\\lambda, solve for the control functions\{ui⋆\}i=1N\\\{u\_\{i\}^\{\\star\}\\\}\_\{i=1\}^\{N\}:
min\{ui\}i=1N\\displaystyle\\min\_\{\\\{u\_\{i\}\\\}^\{N\}\_\{i=1\}\}𝔼\[λC\(X0u\)\+12∑i=1N‖ui\(Xiu\)‖22Δti\]\\displaystyle\\mathbb\{E\}\\left\[\\lambda\\,C\(X^\{u\}\_\{0\}\)\+\\frac\{1\}\{2\}\\sum\_\{i=1\}^\{N\}\\left\\lVert u\_\{i\}\(X^\{u\}\_\{i\}\)\\right\\rVert\_\{2\}^\{2\}\\Delta t\_\{i\}\\right\]\(29\)s\.t\.Xi−1u=Φiθ\(Xiu,εi\)\+giui\(Xiu\)Δti,εi∼𝒩\(0,Id\),∀i,1≤i≤N,\\displaystyle X^\{u\}\_\{i\-1\}=\\Phi^\{\\theta\}\_\{i\}\(X^\{u\}\_\{i\},\\varepsilon\_\{i\}\)\+g\_\{i\}\\,u\_\{i\}\(X^\{u\}\_\{i\}\)\\,\\Delta t\_\{i\},\\quad\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\\left\(0,I\_\{d\}\\right\),\\quad\\forall i,1\\leq i\\leq N,X0u∈𝒮ℙdu\-a\.s\.\\displaystyle X^\{u\}\_\{0\}\\in\\mathcal\{S\}\\qquad\\mathbb\{P\}\_\{d\}^\{u\}\\text\{\-a\.s\.\}
whereℙdu\\mathbb\{P\}\_\{d\}^\{u\}is the probability measure over state tuples induced by the controlled dynamics\.
Although the preceding discretization yields a finite\-horizon control problem, the optimization problem remains largely intractable in its full generality\. First, the control variables are defined over the full state spaceℝd\\mathbb\{R\}^\{d\}, while the objective involves an expectation over all realizations of the controlled stochastic dynamics\. Moreover, terminal costs and hard terminal constraints induce temporal coupling across the full reverse process: the effect of an early control perturbation on the terminal state is mediated by a long composition of neural\-network denoising maps\.
Rearranging the controlled dynamics, we can express the control action at each step in terms of the deviation from the uncontrolled denoising update\.111We also simplify the notation and suppress the superscriptuuand writeXiX\_\{i\}instead ofXiuX\_\{i\}^\{u\}for controlled states\.Namely,
ui=Xi−1−X¯i−1εigiΔti,X¯i−1εi:=Φiθ\(Xi,εi\),i=1,…,N\.u\_\{i\}=\\frac\{X\_\{i\-1\}\-\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\}\{g\_\{i\}\\Delta t\_\{i\}\},\\qquad\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}:=\\Phi^\{\\theta\}\_\{i\}\(X\_\{i\},\\varepsilon\_\{i\}\),\\qquad i=1,\\dots,N\.\(30\)
Second, motivated by model predictive control\[[52](https://arxiv.org/html/2606.15359#bib.bib40)\], we replace the full\-horizon optimization by a one\-step look\-ahead surrogate\. Specifically, at stepii, the transition fromXiX\_\{i\}toXi−1X\_\{i\-1\}is evaluated using the original denoising update, while the cost\-to\-go over the remaining reverse process is approximated using the model’s predicted clean sample\. This yields the proxy
𝔼\[C\(X0\)∣Xi−1\]≈C\(𝔼\[X0∣Xi−1\]\)≈C\(x^0θ\(Xi−1,ti−1\)\),i=1,…,N\.\\mathbb\{E\}\\\!\\left\[C\(X\_\{0\}\)\\mid X\_\{i\-1\}\\right\]\\approx C\\\!\\left\(\\mathbb\{E\}\[X\_\{0\}\\mid X\_\{i\-1\}\]\\right\)\\approx C\\\!\\left\(\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\\right\),\\qquad i=1,\\dots,N\.\(31\)
The total discrete\-time SOC cost\-to\-go from stepiiis therefore approximated by retaining only the one\-step control effort and replacing the remaining terminal cost by the Tweedie\-based prediction\. Namely, we use the surrogate
𝔼\[λC\(X0\)\+12∑k=1i∥uk\(Xk\)∥22Δtk\|Xi\]≈λC\(x^0θ\(Xi−1,ti−1\)\)\+12gi2Δti∥Xi−1−X¯i−1εi∥22\.\\mathbb\{E\}\\left\[\\lambda C\(X\_\{0\}\)\+\\frac\{1\}\{2\}\\sum\_\{k=1\}^\{i\}\\\|u\_\{k\}\(X\_\{k\}\)\\\|\_\{2\}^\{2\}\\Delta t\_\{k\}\\,\\middle\|\\,X\_\{i\}\\right\]\\approx\\lambda C\\\!\\left\(\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\\right\)\+\\frac\{1\}\{2g\_\{i\}^\{2\}\\Delta t\_\{i\}\}\\left\\lVert X\_\{i\-1\}\-\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\\right\\rVert\_\{2\}^\{2\}\.\(32\)
Similarly, we enforce constraint satisfaction on the predicted clean samplex^0θ\(Xi−1,ti−1\)\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)as a proxy for terminal feasibility, thereby replacing the full joint optimization over all control variables with a sequence of one\-step subproblems\. At each denoising step, we first computeX¯i−1εi=Φiθ\(Xi,εi\)\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}=\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\},\\varepsilon\_\{i\}\), withεi∼𝒩\(0,Id\)\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\), and correct the uncontrolled estimate by solving:
Xi−1⋆∈argminXi−1\\displaystyle X^\{\\star\}\_\{i\-1\}\\in\\operatorname\*\{arg\\,min\}\_\{X\_\{i\-1\}\}λC\(X^0\|i−1\)\+12gi2Δti‖Xi−1−X¯i−1εi‖22\\displaystyle\\lambda C\(\\hat\{X\}\_\{0\|i\-1\}\)\+\\frac\{1\}\{2g\_\{i\}^\{2\}\\Delta t\_\{i\}\}\\\|X\_\{i\-1\}\-\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\\\|\_\{2\}^\{2\}\(33\)s\.t\.X^0\|i−1∈𝒮,X^0\|i−1=x^0θ\(Xi−1,ti−1\)\.\\displaystyle\\hat\{X\}\_\{0\|i\-1\}\\in\\mathcal\{S\},\\qquad\\hat\{X\}\_\{0\|i\-1\}=\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\.
Each one\-step subproblem still contains a nontrivial coupling between the optimization variableXi−1X\_\{i\-1\}and the predicted clean sampleX^0\|i−1:=x^0θ\(Xi−1,ti−1\)\\hat\{X\}\_\{0\|i\-1\}:=\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\. This coupling enters both the terminal cost and the feasibility constraintX^0\|i−1∈𝒮\\hat\{X\}\_\{0\|i\-1\}\\in\\mathcal\{S\}\. Therefore, even whenCCand𝒮\\mathcal\{S\}have simple structure in the data domain, their pullback through the neural denoising mapx^0θ\(⋅,ti−1\)\\hat\{x\}\_\{0\}^\{\\theta\}\(\\cdot,t\_\{i\-1\}\)may define a highly nonlinear and nonconvex optimization problem in the latent variableXi−1X\_\{i\-1\}\. For diffusion policies with tentatively millions of parameters, repeatedly embedding neural\-network evaluations within the subproblem can introduce substantial computational overhead and may limit the practicality of direct latent\-space optimization with generic off\-the\-shelf solvers\. To phrase the optimization in the data domain, we approximate the latent displacementXi−1−X¯i−1εiX\_\{i\-1\}\-\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}through the corresponding difference between predicted clean samples\. The key step is to approximately invert Tweedie’s formula by a fixed\-point scheme\. Rearranging \([3](https://arxiv.org/html/2606.15359#S3.E3)\) at timeti−1t\_\{i\-1\}gives
Xi−1=αti−1x^0θ\(Xi−1,ti−1\)−σti−12sti−1θ\(Xi−1\)\.X\_\{i\-1\}=\\alpha\_\{t\_\{i\-1\}\}\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\-\\sigma\_\{t\_\{i\-1\}\}^\{2\}s\_\{t\_\{i\-1\}\}^\{\\theta\}\(X\_\{i\-1\}\)\.\(34\)Now fixy=x^0θ\(Xi−1,ti−1\)y=\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\. Inverting Tweedie’s estimate for this prescribed clean prediction amounts to findingXi−1X\_\{i\-1\}such that
Xi−1=Fti−1θ\(Xi−1;y\),Fti−1θ\(x;y\):=αti−1y−σti−12sti−1θ\(x\)\.X\_\{i\-1\}=F\_\{t\_\{i\-1\}\}^\{\\theta\}\(X\_\{i\-1\};y\),\\qquad F\_\{t\_\{i\-1\}\}^\{\\theta\}\(x;y\):=\\alpha\_\{t\_\{i\-1\}\}y\-\\sigma\_\{t\_\{i\-1\}\}^\{2\}s\_\{t\_\{i\-1\}\}^\{\\theta\}\(x\)\.\(35\)
AssumingFti−1θ\(⋅;y\)F\_\{t\_\{i\-1\}\}^\{\\theta\}\(\\cdot;y\)is a contraction, the Banach fixed\-point theorem guarantees a unique fixed point, which can be obtained by repeated application ofFti−1θ\(⋅;y\)F\_\{t\_\{i\-1\}\}^\{\\theta\}\(\\cdot;y\):
Xti−1⋆\(y\)=limk→∞\(Fti−1θ\(⋅;y\)\)\(k\)\(X\(0\)\)\.X^\{\\star\}\_\{t\_\{i\-1\}\}\(y\)=\\lim\_\{k\\to\\infty\}\\left\(F\_\{t\_\{i\-1\}\}^\{\\theta\}\(\\cdot;y\)\\right\)^\{\(k\)\}\(X^\{\(0\)\}\)\.\(36\)
In practice, we initialize the fixed\-point inversion at the uncontrolled proposalX¯i−1εi\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}, which provides a natural estimate of the latent state before control is applied\. Applying one fixed\-point iteration gives
Xi−1≈Fti−1θ\(X¯i−1εi;y\)=αti−1y−σti−12sti−1θ\(X¯i−1εi\)\.X\_\{i\-1\}\\approx F^\{\\theta\}\_\{t\_\{i\-1\}\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\};y\)=\\alpha\_\{t\_\{i\-1\}\}y\-\\sigma\_\{t\_\{i\-1\}\}^\{2\}s^\{\\theta\}\_\{t\_\{i\-1\}\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\)\.\(37\)Combining this approximation with \([34](https://arxiv.org/html/2606.15359#A3.E34)\) applied toX¯i−1εi\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}yields
Xi−1−X¯i−1εi\\displaystyle X\_\{i\-1\}\-\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}≈αti−1\(y−x^0θ\(X¯i−1εi,ti−1\)\)\\displaystyle\\approx\\alpha\_\{t\_\{i\-1\}\}\\left\(y\-\\hat\{x\}\_\{0\}^\{\\theta\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\},t\_\{i\-1\}\)\\right\)\(38\)=αti−1\(x^0θ\(Xi−1,ti−1\)−x^0θ\(X¯i−1εi,ti−1\)\)\.\\displaystyle=\\alpha\_\{t\_\{i\-1\}\}\\left\(\\hat\{x\}\_\{0\}^\{\\theta\}\(X\_\{i\-1\},t\_\{i\-1\}\)\-\\hat\{x\}\_\{0\}^\{\\theta\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\},t\_\{i\-1\}\)\\right\)\.
Thus, the latent displacement can be approximated by a scaled discrepancy between Tweedie clean\-sample predictions\. This provides a gradient\-free correction direction in latent space and allows the control penalty to be expressed in the data domain\. Substituting \([38](https://arxiv.org/html/2606.15359#A3.E38)\) into the one\-step surrogate yields the final receding\-horizon formulation in Problem[3](https://arxiv.org/html/2606.15359#Thmproblem3), where the constrained optimization is carried out in the data domain rather than through repeated neural\-network evaluations inside the solver\.
###### Problem 3\(DiRecT: receding\-horizon formulation\)\.
Given a sampling scheme\{Φiθ\}i=1N\\\{\\Phi\_\{i\}^\{\\theta\}\\\}\_\{i=1\}^\{N\}, constraint set𝒮\\mathcal\{S\}, cost weightλ\\lambda, and initial sampleXN⋆∼ppriorX\_\{N\}^\{\\star\}\\sim p\_\{\\mathrm\{prior\}\}, solve the following recursion fori=N,…,1i=N,\\ldots,1:
\{X¯i−1εi=Φiθ\(Xi⋆,εi\),εi∼𝒩\(0,Id\),X~0\|i−1=x^0θ\(X¯i−1εi,ti−1\),X^0\|i−1⋆∈argminX^0\|i−1λC\(X^0\|i−1\)\+αti−122gi2Δti‖X^0\|i−1−X~0\|i−1‖22s\.t\.X^0\|i−1∈𝒮,Xi−1⋆=\{X¯i−1εi\+αti−1\(X^0\|i−1⋆−X~0\|i−1\),i\>1,X^0\|0⋆,i=1\.\\left\\\{\\begin\{aligned\} \\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}&=\\Phi\_\{i\}^\{\\theta\}\(X\_\{i\}^\{\\star\},\\varepsilon\_\{i\}\),\\qquad\\varepsilon\_\{i\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\),\\\\ \\tilde\{X\}\_\{0\|i\-1\}&=\\hat\{x\}\_\{0\}^\{\\theta\}\(\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\},t\_\{i\-1\}\),\\\\ \\hat\{X\}^\{\\star\}\_\{0\|i\-1\}&\\in\\operatorname\*\{arg\\,min\}\_\{\\hat\{X\}\_\{0\|i\-1\}\}\\lambda\\,C\(\\hat\{X\}\_\{0\|i\-1\}\)\+\\frac\{\\alpha\_\{t\_\{i\-1\}\}^\{2\}\}\{2g\_\{i\}^\{2\}\\Delta t\_\{i\}\}\\\|\\hat\{X\}\_\{0\|i\-1\}\-\\tilde\{X\}\_\{0\|i\-1\}\\\|\_\{2\}^\{2\}\\quad\\text\{s\.t\.\}\\quad\\hat\{X\}\_\{0\|i\-1\}\\in\\mathcal\{S\},\\\\ X^\{\\star\}\_\{i\-1\}&=\\begin\{cases\}\\bar\{X\}^\{\\varepsilon\_\{i\}\}\_\{i\-1\}\+\\alpha\_\{t\_\{i\-1\}\}\(\\hat\{X\}^\{\\star\}\_\{0\|i\-1\}\-\\tilde\{X\}\_\{0\|i\-1\}\),&i\>1,\\\\ \\hat\{X\}^\{\\star\}\_\{0\|0\},&i=1\.\\end\{cases\}\\end\{aligned\}\\right\.\(39\)
Although several approximations are introduced to reduce the general stochastic optimal control problem to a tractable form, terminal constraint satisfaction is not relaxed, as formalized in Proposition[1](https://arxiv.org/html/2606.15359#Thmproposition1)\.
## Appendix DImplementation details
##### Hardware\.
AllMaze2DandD3ILexperiments were performed on a compute node equipped with two Intel\(R\) Xeon\(R\) Gold 5218 @ 2\.30GHz CPUs and six NVIDIA Quadro RTX 8000 48GB,MRMPon a node with two Intel Xeon E5\-2670 v2 @ 2\.50 GHz CPUs and one NVIDIA Tesla V100\-SXM2 32 GB GPU, andPushTon a machine equipped with two AMD EPYC 75F3 32\-Core Processor CPUs and four NVIDIA A100 GPUs\. All computations for training, model evaluation, and gradient calculation were performed on a single GPU, whereas environment rollouts and IPOPT optimizations were performed on a CPU\.
##### Software\.
### D\.1Safe maze navigation
##### Training\.
We train a*continuous\-time*diffusion model with a standard temporal UNet architecture\[[30](https://arxiv.org/html/2606.15359#bib.bib63)\]and horizonH=384H=384\. The model outputs the state\-action pair trajectory\(s0,a0,s1,a1,…,sH−1,aH−1\)∈ℝ6H\(s\_\{0\},a\_\{0\},s\_\{1\},a\_\{1\},\\ldots,s\_\{H\-1\},a\_\{H\-1\}\)\\in\\mathbb\{R\}^\{6H\}conditioned on given start and end states\. We employ data\-prediction parametrization and a cosine noise schedule with offsets=0\.008s=0\.008\[[47](https://arxiv.org/html/2606.15359#bib.bib70)\]\. We train the model with Adam\[[34](https://arxiv.org/html/2606.15359#bib.bib80)\]for500,000500\{,\}000steps with a batch size of256256on the D4RLmaze2d\-large\-v1\[[23](https://arxiv.org/html/2606.15359#bib.bib62)\]666[https://github\.com/Farama\-Foundation/D4RL](https://github.com/Farama-Foundation/D4RL)dataset \(CC BY 4\.0\), scaled by min\-max normalization\. We use a learning rate of0\.0010\.001with a cosine annealing schedule for the first10,00010\{,\}000steps\. We save checkpoints using an exponential moving average of the model weights with decay0\.9950\.995\.
##### Environment details\.
We introduce obstacles as ellipses and super\-ellipses centered in\(cx,cy\)\(c\_\{x\},c\_\{y\}\), semi\-axes\(rx,ry\)\(r\_\{x\},r\_\{y\}\), and orderpp:
\|x−cxrx\|p\+\|y−cyry\|p≥1\\left\|\\frac\{x\-c\_\{x\}\}\{r\_\{x\}\}\\right\|^\{p\}\+\\left\|\\frac\{y\-c\_\{y\}\}\{r\_\{y\}\}\\right\|^\{p\}\\geq 1\(40\)For bothBroadandNarrowvariants, we follow the obstacle placement in\[[69](https://arxiv.org/html/2606.15359#bib.bib31)\]\. To enforce dynamics constraints we fit by ordinary least\-squares regression the normalized transitions from the training data for a total of 3,993,503 across 1,062 episodes\. For simplicity, although equivalent, we separate the state and action coefficients with the following functional forms′=A⋅s\+B⋅a\+cs^\{\\prime\}=A\\cdot s\+B\\cdot a\+c, where each transition is defined by the tuple\(s,a,s′\)\(s,a,s^\{\\prime\}\)\. When enforcing dynamic constraints, we impose the fitted dynamics as*strict*equality constraints over the planning horizon\. Although the per\-step least squares residuals are contained, open\-loop execution over the full horizon remains impractical\. In this regime, predicted and executed trajectories may diverge, hindering proper evaluation of the planner\. To mitigate this issue, we perform policy rollout by tracking the generated trajectory with a proportional\-derivative \(PD\) controller with constantsP=5,D=1P=5,D=1\.
##### Evaluation\.
We evaluate each method over100100i\.i\.d\. generated samples conditioned on the same fixed endpoints\. For each sample, we execute a rollout up to800800environment steps, at which the episode is terminated\. For all methods we denoise with a DDPM sampler\[[27](https://arxiv.org/html/2606.15359#bib.bib11)\]on a uniform time discretization\. We follow standard practice and perform the last denoising iteration by returning the model prediction\. To encourage shorter paths, when applicable, the cost functionC\(x\)C\(x\)is chosen to be the squared path length of the generated trajectory\. We now briefly describe the implementation details of each method, as well as their adaptation to include equality dynamic constraints:
- •Diffuser\[[29](https://arxiv.org/html/2606.15359#bib.bib24)\]\. We employ3232denoising iterations as an unguided reference for the base pretrained model\. No obstacles or dynamic constraints are imposed\.
- •Gradient Guidance\[[15](https://arxiv.org/html/2606.15359#bib.bib22)\]\. We employ3232denoising iterations and guide the noisy latentxtx\_\{t\}by computing the gradient of the cost function using posterior sampling as∇C′\(x^0,θ\(xt\)\)\\nabla C^\{\\prime\}\(\\hat\{x\}\_\{0,\\theta\}\(x\_\{t\}\)\), where constraint violations are added as a penalty term in the cost functionC′\(x\)=C\(x\)\+Cpenalty\(x\)C^\{\\prime\}\(x\)=C\(x\)\+C\_\{\\mathrm\{penalty\}\}\(x\)\. To impose dynamic constraints, we penalize the squared residuals of each predicted transition along the prediction horizon by adding them as an additional cost penalty\.
- •Projection\[[14](https://arxiv.org/html/2606.15359#bib.bib28)\]\. We employ3232denoising steps and project the second half of the generated latents onto the feasible set\. Dynamic constraints are introduced directly in the optimization process, which is solved with IPOPT\.
- •Augmented Lagrangian\[[71](https://arxiv.org/html/2606.15359#bib.bib30)\]\. We adapt the original implementation to the continuous\-time case by scaling hyperparameters to generate equivalent guided transitions\. We use256256sampling steps with an additional200200iterations for the last denoising step\. We use the original implementation for handling dynamic equality constraints\.
- •SafeDiffuser\-RoS\[[69](https://arxiv.org/html/2606.15359#bib.bib31)\]\. We adapt the original implementation to the continuous\-time case by scaling hyperparameters to generate equivalent guided transitions with256256sampling steps\. We introduce dynamic constraints directly into the quadratic program\. The resulting optimization problem is solved with IPOPT\.
- •DiRecT\. We employ3232denoising steps and optimize over the second half of the generative process\. We incorporate dynamic constraints directly into the receding\-horizon subproblem, which is solved with IPOPT\.
### D\.2Safe robotic manipulation
We train a*continuous\-time*diffusion model with a temporal UNet architecture\[[30](https://arxiv.org/html/2606.15359#bib.bib63)\]and horizonH=16H=16\. The model outputs the state\-action pair trajectory\(s0,a0,s1,a1,…,sH−1,aH−1\)∈ℝ6H\(s\_\{0\},a\_\{0\},s\_\{1\},a\_\{1\},\\ldots,s\_\{H\-1\},a\_\{H\-1\}\)\\in\\mathbb\{R\}^\{6H\}over the entire prediction horizon\. We employ data\-prediction parametrization and a cosine noise schedule with offsets=0\.008s=0\.008\[[47](https://arxiv.org/html/2606.15359#bib.bib70)\]\. We train the model with Adam\[[34](https://arxiv.org/html/2606.15359#bib.bib80)\]for500,000500\{,\}000steps with a batch size of3232on the D3ILavoidingdataset\[[30](https://arxiv.org/html/2606.15359#bib.bib63)\]777[https://github\.com/ALRhub/d3il](https://github.com/ALRhub/d3il)\(MIT\), scaled by min\-max normalization\. We use a learning rate of2×10−42\\times 10^\{\-4\}with a cosine annealing schedule for the first100,000100\{,\}000steps\. We save checkpoints using an exponential moving average of the model weights with decay0\.9950\.995\.
##### Environment details\.
The base D3ILavoidingtask presents six circular pillars that must be avoided by the end\-effector\. The unguided diffusion policy learns to dodge these obstacles directly from human demonstrations\. To assess performance over*unseen*constraints, we restrict planning by increasing the radius of the first pillar and limiting side\-movements by two planar regions\. Dynamics constraints are imposed by least\-squares fitting of the72097209training transitions across9696demonstrations, with the same procedure for Maze2D \(see Appendix[D\.1](https://arxiv.org/html/2606.15359#A4.SS1)\)\.
##### Evaluation\.
We evaluate each method over100100i\.i\.d\. generated samples conditioned on a fixed initial position and with zero starting velocity\. For each sample, we execute a rollout up to100100environment steps with a planning horizonH=16H=16and execution horizonM=4M=4\. We terminate the episode when the end\-effector collides with a pillar, while we continue rollouts for violations of*test\-time*constraints\. All other implementation details follow the procedure for Maze2D \(see Appendix[D\.1](https://arxiv.org/html/2606.15359#A4.SS1)\)\.
### D\.3Safe multi\-robot motion planning
##### Problem setup\.
In this task, multiple agents’ trajectories in a common environment are generated in a collision\-free and velocity\-constrained manner\. We test our method on the four environments from the MMD\[[58](https://arxiv.org/html/2606.15359#bib.bib65)\]benchmark\. A diffusion model is trained for each environment on single\-agent trajectories showing a specific motion pattern\. In the unconstrained setting, the objective is to generate trajectories𝝉∈ℝH×2\\bm\{\\tau\}\\in\\mathbb\{R\}^\{H\\times 2\}conditioned on given start and target endpoints\(b,e\)∈ℝ2×2\(b,e\)\\in\\mathbb\{R\}^\{2\\times 2\}presenting the desired motion behaviors for each MMD environment \(Figure[5](https://arxiv.org/html/2606.15359#A4.F5)\):
- •Emptyis the simplest map, where demonstrations consist of straight\-line motions and robots are penalized for deviating from a direct path\.
- •Highwayscontains a central block that induces a warehouse\-like traffic pattern, where agents are expected to move*counterclockwise*around the obstacle\.
- •Conveyorconsists of two narrow corridors\. Agents are rewarded for moving leftward in the top corridor and rightward in the bottom corridor\.
- •Drop\-Regionfeatures an obstacle\-free central space and four drop\-off regions\. Agents are rewarded for pausing near the midpoint of any of the four square regions, emulating package\-delivery tasks\.
\(a\)Empty
\(b\)Highways
\(c\)Conveyor
\(d\)Drop\-Region
Figure 5:Four major MMD environments with feasible optimized solutions\. Robots navigate toward their goal positions while avoiding inter\-robot and obstacle collisions and satisfying velocity constraints\. Each environment exhibits a characteristic motion pattern that should be preserved\.Following the coupled trajectory\-generation setup in\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], we seek to generateNaN\_\{a\}agent trajectories*simultaneously*\. The generated trajectories must avoid inter\-robot and robot\-obstacle collisions, satisfy*test\-time*velocity restrictions, and exhibit the desired MMD motion pattern\.
Let\[K\]\[K\]denote\{1,…,K\}\\\{1,\.\.\.,K\\\}for any natural numberK∈ℕK\\in\\mathbb\{N\}\. We denote a generated sample of trajectories by𝝉=\(𝝉1,…,𝝉Na\)∈ℝNa×H×2\\bm\{\\tau\}=\\left\(\\bm\{\\tau\}\_\{1\},\.\.\.,\\bm\{\\tau\}\_\{N\_\{a\}\}\\right\)\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}, where𝝉ij\\bm\{\\tau\}^\{j\}\_\{i\}is thejj\-th predicted step for theii\-th agent\. Given an optimization objective𝒥:ℝNa×H×2→ℝ\\mathcal\{J\}:\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}\\rightarrow\\mathbb\{R\}and starting locations\{bi\}i∈\[Na\]\\\{b\_\{i\}\\\}\_\{i\\in\[N\_\{a\}\]\}for each robot, the general MRMP problem is formulated as\[[39](https://arxiv.org/html/2606.15359#bib.bib60),[58](https://arxiv.org/html/2606.15359#bib.bib65)\]
min𝝉∈ℝNa×H×2\\displaystyle\\min\_\{\\bm\{\\tau\}\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}\}\\quad𝒥\(𝝉\)\\displaystyle\\mathcal\{J\}\(\\bm\{\\tau\}\)\(41a\)s\.t\.‖𝝉ik\+1−𝝉ik‖2≤vmaxΔt,\\displaystyle\\left\\lVert\\bm\{\\tau\}\_\{i\}^\{k\+1\}\-\\bm\{\\tau\}\_\{i\}^\{k\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,∀i∈\[Na\],∀k∈\[H−1\],\\displaystyle\\forall i\\in\[N\_\{a\}\],\\ \\forall k\\in\[H\-1\],\(41b\)‖𝝉i1−bi‖2≤vmaxΔt,\\displaystyle\\left\\lVert\\bm\{\\tau\}\_\{i\}^\{1\}\-b\_\{i\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,∀i∈\[Na\],\\displaystyle\\forall i\\in\[N\_\{a\}\],\(41c\)‖𝝉ik−𝝉jk‖2≥2δ,\\displaystyle\\left\\lVert\\bm\{\\tau\}\_\{i\}^\{k\}\-\\bm\{\\tau\}\_\{j\}^\{k\}\\right\\rVert\_\{2\}\\geq 2\\delta,∀i≠j,∀k∈\[H\],\\displaystyle\\forall i\\neq j,\\ \\forall k\\in\[H\],\(41d\)d\(𝝉ik,𝒪\)≥δ,\\displaystyle d\(\\bm\{\\tau\}\_\{i\}^\{k\},\\mathcal\{O\}\)\\geq\\delta,∀i∈\[Na\],∀k∈\[H\]\.\\displaystyle\\forall i\\in\[N\_\{a\}\],\\ \\forall k\\in\[H\]\.\(41e\)whereδ\\deltais the robot radius,vmaxv\_\{\\max\}is the test\-time maximum allowable velocity,Δt\\Delta tis the environment time increment between consecutive generated steps, andd\(⋅,𝒪\):ℝ2→ℝd\(\\cdot,\\mathcal\{O\}\):\\mathbb\{R\}^\{2\}\\rightarrow\\mathbb\{R\}is the distance function to the obstacle set\. LetΩV\\Omega\_\{V\},ΩCR\\Omega\_\{CR\}, andΩCO\\Omega\_\{CO\}denote the feasible sets associated with the velocity, inter\-robot collision\-avoidance, and robot\-obstacle collision\-avoidance constraints, respectively\. We define the feasible set of Problem[41](https://arxiv.org/html/2606.15359#A4.E41)asΩ=ΩV∩ΩCR∩ΩCO\.\\Omega=\\Omega\_\{V\}\\cap\\Omega\_\{CR\}\\cap\\Omega\_\{CO\}\.The data adherence objective is implicitly specified by the single\-agent diffusion policy\. In this setting, Problem[41](https://arxiv.org/html/2606.15359#A4.E41)can be naturally interpreted as constrained sampling from a pretrained diffusion model applied to the concatenation of all agent trajectories\. More explicitly, given a diffusion model trained for a single agent, we can generateNaN\_\{a\}independent trajectories by sampling from the product distribution
𝝉∼p0⋆\(𝝉\)=∏i=1Nap0\(𝝉i\)\.\\bm\{\\tau\}\\sim p\_\{0\}^\{\\star\}\(\\bm\{\\tau\}\)=\\prod\_\{i=1\}^\{N\_\{a\}\}p\_\{0\}\(\\bm\{\\tau\}\_\{i\}\)\.Therefore, the corresponding concatenated denoiserx^0,θ⋆\(𝝉\)=⨁i=1Nax^0,θ\(𝝉i\)\\hat\{x\}^\{\\star\}\_\{0,\\theta\}\(\\bm\{\\tau\}\)=\\bigoplus\_\{i=1\}^\{N\_\{a\}\}\\hat\{x\}\_\{0,\\theta\}\(\\bm\{\\tau\}\_\{i\}\)is equivalent to a pretrained diffusion model that interpolates between the prior𝒩\(0,𝑰Na×H×2\)\\mathcal\{N\}\\left\(0,\\bm\{I\}\_\{N\_\{a\}\\times H\\times 2\}\\right\)and the product distributionp0⋆p\_\{0\}^\{\\star\}\. To impose*test\-time*constraints within our framework, we define a*projection*operator onto the feasible setΩ\\Omega\. This differs from the coupled diffusion framework of\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]where only velocity limits are enforced exactly: rather than relying on soft collision penalties, we impose the nonconvex collision\-avoidance constraints as hard constraints in the constrained optimization problem\.
##### Evaluation details\.
We use the pretrained checkpoints provided by\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], which are trained with aDiffusion Policy\[[13](https://arxiv.org/html/2606.15359#bib.bib25)\]backbone on the four MMD environments\. For each environment, we impose a maximum allowable agent velocity, computed from forward differences of trajectory positions\. For all methods, we generate samples using100100DDPM\[[27](https://arxiv.org/html/2606.15359#bib.bib11)\]denoising steps, applying guidance during the second half of the denoising process\. We evaluate each method over100100randomly selected initial configurations\. For each configuration, we generate a batch of128128candidate trajectories in parallel\. Following\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], we define*Success Rate*as the fraction of trials for which at least one trajectory in the generated batch is collision\-free\. In contrast,*Constraint Safety*measures the average feasibility rate over*all*generated trajectories in terms of both collision and kinematic constraints\.
#### D\.3\.1Custom MRMP optimizer
##### Problem setup\.
We explicitly state the projection problem solved by the MRMP optimizer\. LetX∈ℝNa×H×2X\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}denote the optimized trajectory tuple, withXik∈ℝ2X\_\{i\}^\{k\}\\in\\mathbb\{R\}^\{2\}denoting the position of agentiiat waypointkk\. LetX^∈ℝNa×H×2\\hat\{X\}\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}denote the Tweedie clean\-trajectory estimate produced by the diffusion model\. The proximal safe trajectory is computed as the solution of
X⋆=argminX∈ℝNa×H×212‖X−X^‖F2\+λsm2∑i=1Na∑k=1H−1‖Xik\+1−Xik‖22\.X^\{\\star\}=\\operatorname\*\{arg\\,min\}\_\{X\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}\}\\quad\\frac\{1\}\{2\}\\left\\lVert X\-\\hat\{X\}\\right\\rVert\_\{F\}^\{2\}\+\\frac\{\\lambda\_\{\\rm sm\}\}\{2\}\\sum\_\{i=1\}^\{N\_\{a\}\}\\sum\_\{k=1\}^\{H\-1\}\\left\\lVert X\_\{i\}^\{k\+1\}\-X\_\{i\}^\{k\}\\right\\rVert\_\{2\}^\{2\}\.\(42a\)
s\.t\.‖Xi1−bi‖2≤vmaxΔt,\\displaystyle\\left\\lVert X\_\{i\}^\{1\}\-b\_\{i\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,∀i∈\[Na\],\\displaystyle\\forall i\\in\[N\_\{a\}\],\(42b\)‖Xik\+1−Xik‖2≤vmaxΔt,\\displaystyle\\left\\lVert X\_\{i\}^\{k\+1\}\-X\_\{i\}^\{k\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,∀i∈\[Na\],∀k∈\[H−1\],\\displaystyle\\forall i\\in\[N\_\{a\}\],\\ \\forall k\\in\[H\-1\],\(42c\)‖Xik−Xjk‖2≥2δ,\\displaystyle\\left\\lVert X\_\{i\}^\{k\}\-X\_\{j\}^\{k\}\\right\\rVert\_\{2\}\\geq 2\\delta,∀i≠j,∀k∈\[H\],\\displaystyle\\forall i\\neq j,\\ \\forall k\\in\[H\],\(42d\)d\(Xik,𝒪\)≥δ,\\displaystyle d\(X\_\{i\}^\{k\},\\mathcal\{O\}\)\\geq\\delta,∀i∈\[Na\],∀k∈\[H\]\.\\displaystyle\\forall i\\in\[N\_\{a\}\],\\ \\forall k\\in\[H\]\.\(42e\)
The first term keeps the projected tuple close to the diffusion model’s Tweedie estimate
X^\\hat\{X\}, while the smoothness term with strength
λsm\\lambda\_\{\\rm sm\}optionally encourages straight\-line trajectories\. The constraints enforce initial\-state consistency, inference\-time velocity limits, inter\-robot separation, and robot\-obstacle clearance\. Thus, the optimizer defines a \(proximal\) projection
X⋆=ΠΩ\(X^\)X^\{\\star\}=\\Pi\_\{\\Omega\}\(\\hat\{X\}\)onto the feasible set
Ω=ΩV∩ΩCR∩ΩCO\\Omega=\\Omega\_\{V\}\\cap\\Omega\_\{CR\}\\cap\\Omega\_\{CO\}\.
Different approaches are possible for solving this problem\. A first possibility is to employ an off\-the\-shelf nonlinear solver such as IPOPT\. However, general\-purpose nonlinear solvers are typically designed around CPU\-based sparse linear algebra and do not provide a plug\-and\-play GPU implementation suitable for extensive sweeps and batched inference\. An alternative is to relax the nonconvex collision constraints and optimize the resulting objective with Lagrangian or augmented\-Lagrangian methods\[[39](https://arxiv.org/html/2606.15359#bib.bib60)\]\. However, since the collision constraints enter the objective through nonlinear penalty or multiplier terms, the resulting optimization can remain nonconvex and still requires a general\-purpose CPU\-based nonlinear solver\. Inspired by the ADMM projector in\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], given the convexity of the velocity constraints and the spatially local structure of the collision constraints, we derive a custom optimizer by combining successive convexification of the active collision constraints with an ADMM splitting of the resulting convex subproblem\[[10](https://arxiv.org/html/2606.15359#bib.bib81),[44](https://arxiv.org/html/2606.15359#bib.bib82)\]\. The complete derivation is as follows\.
##### Successive convexification\.
The nonconvexity of \([42](https://arxiv.org/html/2606.15359#A4.E42)\) arises from the inter\-robot and robot\-obstacle collision constraints\. We handle these constraints with an outer successive\-convexification loop\. LetX\(m\)X^\{\(m\)\}denote the current trajectory at SCP iterationmm\. For each active inter\-robot constraint, define
rijk,\(m\)=Xik,\(m\)−Xjk,\(m\),nijk,\(m\)=rijk,\(m\)‖rijk,\(m\)‖2\.r\_\{ij\}^\{k,\(m\)\}=X\_\{i\}^\{k,\(m\)\}\-X\_\{j\}^\{k,\(m\)\},\\qquad n\_\{ij\}^\{k,\(m\)\}=\\frac\{r\_\{ij\}^\{k,\(m\)\}\}\{\\left\\lVert r\_\{ij\}^\{k,\(m\)\}\\right\\rVert\_\{2\}\}\.Using the first\-order lower approximation of the norm, we replace‖Xik−Xjk‖2≥2δ\\left\\lVert X\_\{i\}^\{k\}\-X\_\{j\}^\{k\}\\right\\rVert\_\{2\}\\geq 2\\deltawith
\(nijk,\(m\)\)⊤\(Xik−Xjk\)≥2δ\.\\big\(n\_\{ij\}^\{k,\(m\)\}\\big\)^\{\\top\}\(X\_\{i\}^\{k\}\-X\_\{j\}^\{k\}\)\\geq 2\\delta\.
For robot\-obstacle constraints, letdℓ\(p\)d\_\{\\ell\}\(p\)denote the signed distance from pointppto obstacle primitive𝒪ℓ\\mathcal\{O\}\_\{\\ell\}\. For each active constraintdℓ\(Xik\)−δ≥0d\_\{\\ell\}\(X\_\{i\}^\{k\}\)\-\\delta\\geq 0, we use the first\-order approximation aroundXik,\(m\)X\_\{i\}^\{k,\(m\)\}:
dℓ\(Xik,\(m\)\)−δ\+∇dℓ\(Xik,\(m\)\)⊤\(Xik−Xik,\(m\)\)≥0\.d\_\{\\ell\}\(X\_\{i\}^\{k,\(m\)\}\)\-\\delta\+\\nabla d\_\{\\ell\}\(X\_\{i\}^\{k,\(m\)\}\)^\{\\top\}\\big\(X\_\{i\}^\{k\}\-X\_\{i\}^\{k,\(m\)\}\\big\)\\geq 0\.For all MMD environments, obstacles decompose into rectangular and circular primitives, so both signed distances and their gradients can be computed exactly\. At SCP iterationmm, we compute the next trajectoryX\(m\+1\)X^\{\(m\+1\)\}by solving the following convexified, slack\-penalized projection problem:
\(X\(m\+1\),s\(m\+1\)\)∈argminX,s12‖X−X^‖F2\+λsm2∑i=1Na∑k=1H−1‖Xik\+1−Xik‖22\+τm2‖X−X\(m\)‖F2\+μm𝟏⊤s\.\\displaystyle\\begin\{aligned\} \(X^\{\(m\+1\)\},s^\{\(m\+1\)\}\)\\in\\operatorname\*\{arg\\,min\}\_\{X,s\}\\quad&\\frac\{1\}\{2\}\\left\\lVert X\-\\hat\{X\}\\right\\rVert\_\{F\}^\{2\}\+\\frac\{\\lambda\_\{\\rm sm\}\}\{2\}\\sum\_\{i=1\}^\{N\_\{a\}\}\\sum\_\{k=1\}^\{H\-1\}\\left\\lVert X\_\{i\}^\{k\+1\}\-X\_\{i\}^\{k\}\\right\\rVert\_\{2\}^\{2\}\\\\ &\+\\frac\{\\tau\_\{m\}\}\{2\}\\left\\lVert X\-X^\{\(m\)\}\\right\\rVert\_\{F\}^\{2\}\+\\mu\_\{m\}\\mathbf\{1\}^\{\\top\}s\.\\end\{aligned\}\(43a\)
s\.t\.‖Xi1−bi‖2≤vmaxΔt,\\displaystyle\\left\\lVert X\_\{i\}^\{1\}\-b\_\{i\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,∀i∈\[Na\],\\displaystyle\\forall i\\in\[N\_\{a\}\],\(43b\)‖Xik\+1−Xik‖2≤vmaxΔt,\\displaystyle\\left\\lVert X\_\{i\}^\{k\+1\}\-X\_\{i\}^\{k\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,i∈\[Na\],k∈\[H−1\],\\displaystyle i\\in\[N\_\{a\}\],\\ k\\in\[H\-1\],\(43c\)\(nijk,\(m\)\)⊤\(Xik−Xjk\)\+sijk≥2δ,\\displaystyle\\big\(n\_\{ij\}^\{k,\(m\)\}\\big\)^\{\\top\}\(X\_\{i\}^\{k\}\-X\_\{j\}^\{k\}\)\+s\_\{ij\}^\{k\}\\geq 2\\delta,∀\(i,j,k\)∈𝒜RR\(m\),\\displaystyle\\forall\(i,j,k\)\\in\\mathcal\{A\}\_\{\\rm RR\}^\{\(m\)\},\(43d\)\(niℓk,\(m\)\)⊤Xik\+siℓk≥\(niℓk,\(m\)\)⊤Xik,\(m\)−\(dℓ\(Xik,\(m\)\)−δ\),\\displaystyle\\big\(n\_\{i\\ell\}^\{k,\(m\)\}\\big\)^\{\\top\}X\_\{i\}^\{k\}\+s\_\{i\\ell\}^\{k\}\\geq\\big\(n\_\{i\\ell\}^\{k,\(m\)\}\\big\)^\{\\top\}X\_\{i\}^\{k,\(m\)\}\-\\big\(d\_\{\\ell\}\(X\_\{i\}^\{k,\(m\)\}\)\-\\delta\\big\),∀\(i,ℓ,k\)∈𝒜O\(m\),\\displaystyle\\forall\(i,\\ell,k\)\\in\\mathcal\{A\}\_\{\\rm O\}^\{\(m\)\},\(43e\)s≥0\.\\displaystyle s\\geq 0\.\(43f\)
Here𝒜RR\(m\)\\mathcal\{A\}\_\{\\rm RR\}^\{\(m\)\}and𝒜O\(m\)\\mathcal\{A\}\_\{\\rm O\}^\{\(m\)\}denote the active inter\-robot and robot\-obstacle constraints at the linearization pointX\(m\)X^\{\(m\)\}\. The slack variabless≥0s\\geq 0keep the convexified subproblem feasible, while theℓ1\\ell\_\{1\}penalty𝟏⊤s=‖s‖1\\mathbf\{1\}^\{\\top\}s=\\left\\lVert s\\right\\rVert\_\{1\}encourages sparse violations\. We use monotone schedulesμm\+1=min\{γμμm,μmax\}\\mu\_\{m\+1\}=\\min\\\{\\gamma\_\{\\mu\}\\mu\_\{m\},\\mu\_\{\\max\}\\\}andτm\+1=min\{γττm,τmax\}\\tau\_\{m\+1\}=\\min\\\{\\gamma\_\{\\tau\}\\tau\_\{m\},\\tau\_\{\\max\}\\\}, withγμ,γτ\>1\\gamma\_\{\\mu\},\\gamma\_\{\\tau\}\>1\. Increasingμm\\mu\_\{m\}progressively enforces collision satisfaction, while increasingτm\\tau\_\{m\}regularizes each SCP step around its linearization point\.
##### ADMM splitting\.
To solve \([43](https://arxiv.org/html/2606.15359#A4.E43)\), we introduce auxiliary variablesV,C,Z∈ℝNa×H×2V,C,Z\\in\\mathbb\{R\}^\{N\_\{a\}\\times H\\times 2\}\. The variableVVis used to impose velocity feasibility,ZZstores the corresponding velocity increments, andCCisolates the linearized collision constraints\. For notational convenience, setVi0=biV\_\{i\}^\{0\}=b\_\{i\}and define the anchored first\-difference operator
\(DbV\)ik=Vik−Vik−1,i∈\[Na\],k∈\[H\]\.\(D\_\{b\}V\)\_\{i\}^\{k\}=V\_\{i\}^\{k\}\-V\_\{i\}^\{k\-1\},\\qquad i\\in\[N\_\{a\}\],\\ k\\in\[H\]\.Thus,\(DbV\)i1=Vi1−bi\(D\_\{b\}V\)\_\{i\}^\{1\}=V\_\{i\}^\{1\}\-b\_\{i\}encodes the initial velocity constraint, while the remaining entries encode consecutive displacements\. The velocity constraint is equivalently written asDbV=Z,Z∈𝒦ZD\_\{b\}V=Z,\\;Z\\in\\mathcal\{K\}\_\{Z\}with
𝒦Z=\{Z:‖Zik‖2≤vmaxΔt,i∈\[Na\],k∈\[H\]\}\.\\mathcal\{K\}\_\{Z\}=\\\{Z:\\left\\lVert Z\_\{i\}^\{k\}\\right\\rVert\_\{2\}\\leq v\_\{\\max\}\\Delta t,\\ i\\in\[N\_\{a\}\],\\ k\\in\[H\]\\\}\.
LetG\(m\)G^\{\(m\)\}denote the matrix collecting the active linearized collision constraints at SCP iterationmm, and leth\(m\)h^\{\(m\)\}denote the corresponding vector of right\-hand sides\. The slack variablesshas the same dimension ash\(m\)h^\{\(m\)\}and keeps the active collision constraints feasible\. We write these constraints compactly as\(C,s\)∈𝒦C\(m\)\(C,s\)\\in\\mathcal\{K\}\_\{C\}^\{\(m\)\}, where
𝒦C\(m\)=\{\(C,s\):G\(m\)C\+s≥h\(m\),s≥0\}\.\\mathcal\{K\}\_\{C\}^\{\(m\)\}=\\\{\(C,s\):G^\{\(m\)\}C\+s\\geq h^\{\(m\)\},\\ s\\geq 0\\\}\.Let𝕀𝒦\\mathbb\{I\}\_\{\\mathcal\{K\}\}denote the indicator function of a set𝒦\\mathcal\{K\}, equal to0on𝒦\\mathcal\{K\}and\+∞\+\\inftyotherwise\. Define
fm\(X\)=12‖X−X^‖F2\+λsm2∑i=1Na∑k=1H−1‖Xik\+1−Xik‖22\+τm2‖X−X\(m\)‖F2\.f\_\{m\}\(X\)=\\frac\{1\}\{2\}\\left\\lVert X\-\\hat\{X\}\\right\\rVert\_\{F\}^\{2\}\+\\frac\{\\lambda\_\{\\rm sm\}\}\{2\}\\sum\_\{i=1\}^\{N\_\{a\}\}\\sum\_\{k=1\}^\{H\-1\}\\left\\lVert X\_\{i\}^\{k\+1\}\-X\_\{i\}^\{k\}\\right\\rVert\_\{2\}^\{2\}\+\\frac\{\\tau\_\{m\}\}\{2\}\\left\\lVert X\-X^\{\(m\)\}\\right\\rVert\_\{F\}^\{2\}\.The SCP subproblem can then be written in ADMM form as
minX,V,C,Z,s\\displaystyle\\min\_\{X,V,C,Z,s\}fm\(X\)\+μm𝟏⊤s\+𝕀𝒦Z\(Z\)\+𝕀𝒦C\(m\)\(C,s\)\\displaystyle f\_\{m\}\(X\)\+\\mu\_\{m\}\\mathbf\{1\}^\{\\top\}s\+\\mathbb\{I\}\_\{\\mathcal\{K\}\_\{Z\}\}\(Z\)\+\\mathbb\{I\}\_\{\\mathcal\{K\}\_\{C\}^\{\(m\)\}\}\(C,s\)\(44\)s\.t\.X=V,X=C,DbV=Z,s≥0\.\\displaystyle X=V,\\qquad X=C,\\qquad D\_\{b\}V=Z,\\qquad s\\geq 0\.Using scaled dual variablesUXVU\_\{XV\},UXCU\_\{XC\}, andUZU\_\{Z\}with penaltiesρXV,ρXC,ρZ\>0\\rho\_\{XV\},\\rho\_\{XC\},\\rho\_\{Z\}\>0, the corresponding scaled augmented Lagrangian formulation of \([44](https://arxiv.org/html/2606.15359#A4.E44)\) is:
ℒρ=\\displaystyle\\mathcal\{L\}\_\{\\rho\}=fm\(X\)\+μm𝟏⊤s\+𝕀𝒦Z\(Z\)\+𝕀𝒦C\(m\)\(C,s\)\\displaystyle\\ f\_\{m\}\(X\)\+\\mu\_\{m\}\\mathbf\{1\}^\{\\top\}s\+\\mathbb\{I\}\_\{\\mathcal\{K\}\_\{Z\}\}\(Z\)\+\\mathbb\{I\}\_\{\\mathcal\{K\}\_\{C\}^\{\(m\)\}\}\(C,s\)\(45\)\+ρXV2‖X−V\+UXV‖F2−ρXV2‖UXV‖F2\\displaystyle\+\\frac\{\\rho\_\{XV\}\}\{2\}\\left\\lVert X\-V\+U\_\{XV\}\\right\\rVert\_\{F\}^\{2\}\-\\frac\{\\rho\_\{XV\}\}\{2\}\\left\\lVert U\_\{XV\}\\right\\rVert\_\{F\}^\{2\}\+ρXC2‖X−C\+UXC‖F2−ρXC2‖UXC‖F2\\displaystyle\+\\frac\{\\rho\_\{XC\}\}\{2\}\\left\\lVert X\-C\+U\_\{XC\}\\right\\rVert\_\{F\}^\{2\}\-\\frac\{\\rho\_\{XC\}\}\{2\}\\left\\lVert U\_\{XC\}\\right\\rVert\_\{F\}^\{2\}\+ρZ2‖DbV−Z\+UZ‖F2−ρZ2‖UZ‖F2\.\\displaystyle\+\\frac\{\\rho\_\{Z\}\}\{2\}\\left\\lVert D\_\{b\}V\-Z\+U\_\{Z\}\\right\\rVert\_\{F\}^\{2\}\-\\frac\{\\rho\_\{Z\}\}\{2\}\\left\\lVert U\_\{Z\}\\right\\rVert\_\{F\}^\{2\}\.ADMM then alternates between the following updates at inner iterationqqof SCP iterationmm\. For the linear solves, write the anchored difference asDbV=AV−BD\_\{b\}V=AV\-B, whereAAis the first\-difference matrix with\(AV\)i1=Vi1\(AV\)\_\{i\}^\{1\}=V\_\{i\}^\{1\}and\(AV\)ik=Vik−Vik−1\(AV\)\_\{i\}^\{k\}=V\_\{i\}^\{k\}\-V\_\{i\}^\{k\-1\}fork≥2k\\geq 2, whileBi1=biB\_\{i\}^\{1\}=b\_\{i\}andBik=0B\_\{i\}^\{k\}=0fork≥2k\\geq 2\.
- •XX\-update\.TheXX\-update minimizes the scaled augmented Lagrangian with respect toXXwhile keeping the remaining variables fixed: Xq\+1=argminX\{fm\(X\)\+ρXV2‖X−Vq\+UXVq‖F2\+ρXC2‖X−Cq\+UXCq‖F2\}\.X^\{q\+1\}=\\operatorname\*\{arg\\,min\}\_\{X\}\\left\\\{f\_\{m\}\(X\)\+\\frac\{\\rho\_\{XV\}\}\{2\}\\left\\lVert X\-V^\{q\}\+U\_\{XV\}^\{q\}\\right\\rVert\_\{F\}^\{2\}\+\\frac\{\\rho\_\{XC\}\}\{2\}\\left\\lVert X\-C^\{q\}\+U\_\{XC\}^\{q\}\\right\\rVert\_\{F\}^\{2\}\\right\\\}\.This is an unconstrained quadratic program\. Setting its gradient to zero gives the linear system \(\(1\+τm\+ρXV\+ρXC\)I\+λsmD⊤D\)Xq\+1=X^\+τmX\(m\)\+ρXV\(Vq−UXVq\)\+ρXC\(Cq−UXCq\),\\Big\(\(1\+\\tau\_\{m\}\+\\rho\_\{XV\}\+\\rho\_\{XC\}\)I\+\\lambda\_\{\\rm sm\}D^\{\\top\}D\\Big\)X^\{q\+1\}=\\hat\{X\}\+\\tau\_\{m\}X^\{\(m\)\}\+\\rho\_\{XV\}\(V^\{q\}\-U\_\{XV\}^\{q\}\)\+\\rho\_\{XC\}\(C^\{q\}\-U\_\{XC\}^\{q\}\),whereDDis the unanchored first\-difference operator appearing in the smoothness term\.
- •VV\-update\.TheVV\-update collects the terms involving the velocity split: Vq\+1=argminV\{ρXV2‖Xq\+1−V\+UXVq‖F2\+ρZ2‖DbV−Zq\+UZq‖F2\}\.V^\{q\+1\}=\\operatorname\*\{arg\\,min\}\_\{V\}\\left\\\{\\frac\{\\rho\_\{XV\}\}\{2\}\\left\\lVert X^\{q\+1\}\-V\+U\_\{XV\}^\{q\}\\right\\rVert\_\{F\}^\{2\}\+\\frac\{\\rho\_\{Z\}\}\{2\}\\left\\lVert D\_\{b\}V\-Z^\{q\}\+U\_\{Z\}^\{q\}\\right\\rVert\_\{F\}^\{2\}\\right\\\}\.This is also an unconstrained quadratic program\. UsingDbV=AV−BD\_\{b\}V=AV\-B, its optimality condition yields \(ρXVI\+ρZA⊤A\)Vq\+1=ρXV\(Xq\+1\+UXVq\)\+ρZA⊤\(Zq−UZq\+B\)\.\\big\(\\rho\_\{XV\}I\+\\rho\_\{Z\}A^\{\\top\}A\\big\)V^\{q\+1\}=\\rho\_\{XV\}\(X^\{q\+1\}\+U\_\{XV\}^\{q\}\)\+\\rho\_\{Z\}A^\{\\top\}\(Z^\{q\}\-U\_\{Z\}^\{q\}\+B\)\.
- •ZZ\-update\.TheZZ\-update enforces the velocity bound by projecting the current anchored displacement estimate onto𝒦Z\\mathcal\{K\}\_\{Z\}: Zq\+1=Π𝒦Z\(DbVq\+1\+UZq\)\.Z^\{q\+1\}=\\Pi\_\{\\mathcal\{K\}\_\{Z\}\}\\big\(D\_\{b\}V^\{q\+1\}\+U\_\{Z\}^\{q\}\\big\)\.This projection has a closed\-form solution as it is applied*row\-wise*\. Forνik=\(DbVq\+1\)ik\+\(UZq\)ik\\nu\_\{i\}^\{k\}=\(D\_\{b\}V^\{q\+1\}\)\_\{i\}^\{k\}\+\(U\_\{Z\}^\{q\}\)\_\{i\}^\{k\}, Zik,q\+1=\{\(vmaxΔt\)νik‖νik‖2,‖νik‖2\>vmaxΔt,νik,otherwise,i∈\[Na\],k∈\[H\]\.Z\_\{i\}^\{k,q\+1\}=\\begin\{cases\}\(v\_\{\\max\}\\Delta t\)\\dfrac\{\\nu\_\{i\}^\{k\}\}\{\\left\\lVert\\nu\_\{i\}^\{k\}\\right\\rVert\_\{2\}\},&\\left\\lVert\\nu\_\{i\}^\{k\}\\right\\rVert\_\{2\}\>v\_\{\\max\}\\Delta t,\\\\\[8\.00003pt\] \\nu\_\{i\}^\{k\},&\\text\{otherwise\},\\end\{cases\}\\qquad i\\in\[N\_\{a\}\],\\ k\\in\[H\]\.
- •C,sC,s\-update\.The collision update projects onto the linearized collision constraints while penalizing slack: \(Cq\+1,sq\+1\)=argmin\(C,s\)∈𝒦C\(m\)\{ρXC2‖C−\(Xq\+1\+UXCq\)‖F2\+μm𝟏⊤s\}\.\(C^\{q\+1\},s^\{q\+1\}\)=\\operatorname\*\{arg\\,min\}\_\{\(C,s\)\\in\\mathcal\{K\}\_\{C\}^\{\(m\)\}\}\\left\\\{\\frac\{\\rho\_\{XC\}\}\{2\}\\left\\lVert C\-\(X^\{q\+1\}\+U\_\{XC\}^\{q\}\)\\right\\rVert\_\{F\}^\{2\}\+\\mu\_\{m\}\\mathbf\{1\}^\{\\top\}s\\right\\\}\.This is a convex quadratic program with linear inequality constraints\. Rather than calling a generic QP solver, we solve its box\-constrained dual\. LetQq=Xq\+1\+UXCqQ^\{q\}=X^\{q\+1\}\+U\_\{XC\}^\{q\}\. The dual problem is λ⋆=argmax0≤λ≤μm\{−12ρXC‖\(G\(m\)\)⊤λ‖22\+λ⊤\(h\(m\)−G\(m\)Qq\)\}\.\\lambda^\{\\star\}=\\operatorname\*\{arg\\,max\}\_\{0\\leq\\lambda\\leq\\mu\_\{m\}\}\\left\\\{\-\\frac\{1\}\{2\\rho\_\{XC\}\}\\left\\lVert\(G^\{\(m\)\}\)^\{\\top\}\\lambda\\right\\rVert\_\{2\}^\{2\}\+\\lambda^\{\\top\}\\big\(h^\{\(m\)\}\-G^\{\(m\)\}Q^\{q\}\\big\)\\right\\\}\.We solve this dual by projected gradient ascent onλ\\lambda\. The primal variables are then recovered as Cq\+1=Qq\+1ρXC\(G\(m\)\)⊤λ⋆,sq\+1=max\{0,h\(m\)−G\(m\)Cq\+1\}\.C^\{q\+1\}=Q^\{q\}\+\\frac\{1\}\{\\rho\_\{XC\}\}\(G^\{\(m\)\}\)^\{\\top\}\\lambda^\{\\star\},\\qquad s^\{q\+1\}=\\max\\\{0,h^\{\(m\)\}\-G^\{\(m\)\}C^\{q\+1\}\\\}\.
- •Dual updates\.Finally, the scaled dual variables are updated by accumulating the primal residuals of the three splitting constraints: UXVq\+1=UXVq\+Xq\+1−Vq\+1,UXCq\+1=UXCq\+Xq\+1−Cq\+1,U\_\{XV\}^\{q\+1\}=U\_\{XV\}^\{q\}\+X^\{q\+1\}\-V^\{q\+1\},\\qquad U\_\{XC\}^\{q\+1\}=U\_\{XC\}^\{q\}\+X^\{q\+1\}\-C^\{q\+1\},UZq\+1=UZq\+DbVq\+1−Zq\+1\.U\_\{Z\}^\{q\+1\}=U\_\{Z\}^\{q\}\+D\_\{b\}V^\{q\+1\}\-Z^\{q\+1\}\.
TheXX\- andVV\-updates require solving linear systems with matrices of the formαI\+βD⊤D\\alpha I\+\\beta D^\{\\top\}DorαI\+βA⊤A\\alpha I\+\\beta A^\{\\top\}A, which are tridiagonal along the time dimension\. Their factorizations can therefore be cached and reused across ADMM iterations within each SCP iteration\. We implement the projector in JAX\[[11](https://arxiv.org/html/2606.15359#bib.bib73)\], using just\-in\-time compilation and GPU parallelism for batched inference\.
For simplicity in implementation, we use fixed iteration budgets for all nested loops:KcvxK\_\{\\rm cvx\}successive\-convexification iterations,KADMMK\_\{\\rm ADMM\}ADMM iterations, andKQPK\_\{\\rm QP\}projected\-gradient steps for the local collision QP\. Developing principled termination criteria for these loops is an interesting direction for future work and could further reduce inference\-time computational overhead\. We summarize the full procedure in Algorithm[2](https://arxiv.org/html/2606.15359#alg2)and default parameters in Table[4](https://arxiv.org/html/2606.15359#A4.T4)\.
1
Input:Tweedie reference
X^\\hat\{X\}, starts
bb, obstacle primitives
𝒪\\mathcal\{O\}, robot radius
δ\\delta, step bound
dmax=vmaxΔtd\_\{\\max\}=v\_\{\\max\}\\Delta t; iterations
Kcvx,KADMM,KQPK\_\{\\rm cvx\},K\_\{\\rm ADMM\},K\_\{\\rm QP\}; penalties
ρXV,ρXC,ρZ\\rho\_\{XV\},\\rho\_\{XC\},\\rho\_\{Z\}; schedules
μm,τm\\mu\_\{m\},\\tau\_\{m\}; dual step size
α\\alpha
Output:Projected trajectory
XX
2
3Pre\-compute
DD,
AA, and
BBsuch that
DbV=AV−BD\_\{b\}V=AV\-B, with
Bi1=biB\_\{i\}^\{1\}=b\_\{i\}and
Bik=0B\_\{i\}^\{k\}=0for
k≥2k\\geq 2;
X←X^X\\leftarrow\\hat\{X\},
V←XV\\leftarrow X,
C←XC\\leftarrow X;
//Initialize from diffusion estimate
Z←Π𝒦Z\(AV−B\)Z\\leftarrow\\Pi\_\{\\mathcal\{K\}\_\{Z\}\}\(AV\-B\),
UXV,UXC,UZ←0U\_\{XV\},U\_\{XC\},U\_\{Z\}\\leftarrow 0;
//Initialize velocity split and duals
4
5SCP convexification;
6for*m=0,…,Kcvx−1m=0,\\ldots,K\_\{\\rm cvx\}\-1*do
7
Xlin←XX\_\{\\rm lin\}\\leftarrow X;
8Select active constraints
𝒜RR\(m\),𝒜O\(m\)\\mathcal\{A\}\_\{\\rm RR\}^\{\(m\)\},\\mathcal\{A\}\_\{\\rm O\}^\{\(m\)\}at
XlinX\_\{\\rm lin\};
9Build the linearized collision constraints
G\(m\)C\+s≥h\(m\)G^\{\(m\)\}C\+s\\geq h^\{\(m\)\},
s≥0s\\geq 0;
HX←\(1\+τm\+ρXV\+ρXC\)I\+λsmD⊤DH\_\{X\}\\leftarrow\(1\+\\tau\_\{m\}\+\\rho\_\{XV\}\+\\rho\_\{XC\}\)I\+\\lambda\_\{\\rm sm\}D^\{\\top\}D;
//
XX\-update system
HV←ρXVI\+ρZA⊤AH\_\{V\}\\leftarrow\\rho\_\{XV\}I\+\\rho\_\{Z\}A^\{\\top\}A;
//
VV\-update system
Cache factorizations of
HXH\_\{X\}and
HVH\_\{V\};
//Tridiagonal along time
10
C←XC\\leftarrow X,
s←0s\\leftarrow 0,
UXC←0U\_\{XC\}\\leftarrow 0;
11
12ADMM splitting;
13for*q=0,…,KADMM−1q=0,\\ldots,K\_\{\\rm ADMM\}\-1*do
14
X,V,ZX,V,Zupdates;
15
RX←X^\+τmXlin\+ρXV\(V−UXV\)\+ρXC\(C−UXC\)R\_\{X\}\\leftarrow\\hat\{X\}\+\\tau\_\{m\}X\_\{\\rm lin\}\+\\rho\_\{XV\}\(V\-U\_\{XV\}\)\+\\rho\_\{XC\}\(C\-U\_\{XC\}\);
X←SolveBatch\(HXX=RX\)X\\leftarrow\\textsc\{SolveBatch\}\(H\_\{X\}X=R\_\{X\}\);
//Batched linear solve on GPU
16
RV←ρXV\(X\+UXV\)\+ρZA⊤\(Z−UZ\+B\)R\_\{V\}\\leftarrow\\rho\_\{XV\}\(X\+U\_\{XV\}\)\+\\rho\_\{Z\}A^\{\\top\}\(Z\-U\_\{Z\}\+B\);
V←SolveBatch\(HVV=RV\)V\\leftarrow\\textsc\{SolveBatch\}\(H\_\{V\}V=R\_\{V\}\);
//Batched linear solve on GPU
17
W←AV−B\+UZW\\leftarrow AV\-B\+U\_\{Z\};
Zik←min\{1,dmax‖Wik‖2\+εmin\}Wik,i∈\[Na\],k∈\[H\]Z\_\{i\}^\{k\}\\leftarrow\\min\\left\\\{1,\\dfrac\{d\_\{\\max\}\}\{\\left\\lVert W\_\{i\}^\{k\}\\right\\rVert\_\{2\}\+\\varepsilon\_\{min\}\}\\right\\\}W\_\{i\}^\{k\},\\quad i\\in\[N\_\{a\}\],\\ k\\in\[H\];
//Project onto velocity balls
18
19Local collision\-QP dual;
20
Q←X\+UXCQ\\leftarrow X\+U\_\{XC\},
λ←0\\lambda\\leftarrow 0;
21for*r=0,…,KQP−1r=0,\\ldots,K\_\{\\rm QP\}\-1*do
gλ←−ρXC−1G\(m\)\(G\(m\)\)⊤λ\+\(h\(m\)−G\(m\)Q\)g\_\{\\lambda\}\\leftarrow\-\\rho\_\{XC\}^\{\-1\}G^\{\(m\)\}\(G^\{\(m\)\}\)^\{\\top\}\\lambda\+\\big\(h^\{\(m\)\}\-G^\{\(m\)\}Q\\big\);
//Dual gradient
λ←clip\(λ\+αgλ,0,μm\)\\lambda\\leftarrow\\operatorname\{clip\}\(\\lambda\+\\alpha g\_\{\\lambda\},\\ 0,\\ \\mu\_\{m\}\);
//Box\-constrained dual step
22
23
24
C,sC,srecovery and dual updates;
25
C←Q\+ρXC−1\(G\(m\)\)⊤λC\\leftarrow Q\+\\rho\_\{XC\}^\{\-1\}\(G^\{\(m\)\}\)^\{\\top\}\\lambda;
26
s←max\{0,h\(m\)−G\(m\)C\}s\\leftarrow\\max\\\{0,\\ h^\{\(m\)\}\-G^\{\(m\)\}C\\\};
UXV←UXV\+X−VU\_\{XV\}\\leftarrow U\_\{XV\}\+X\-V;
//Position\-\-velocity residual
UXC←UXC\+X−CU\_\{XC\}\\leftarrow U\_\{XC\}\+X\-C;
//Position\-\-collision residual
UZ←UZ\+AV−B−ZU\_\{Z\}\\leftarrow U\_\{Z\}\+AV\-B\-Z;
//Velocity\-projection residual
27
28
return*XX*
Algorithm 2Batched SCP–ADMM Projection for MRMPTable 4:Main hyperparameters used by the batched SCP–ADMM MRMP optimizer\.
### D\.4Safe and diverse contact\-rich manipulation
##### Training\.
We employ a diffusion model with noise\-prediction parametrization on the augmented expert demonstrations dataset using default parameters in\[[20](https://arxiv.org/html/2606.15359#bib.bib75)\]\.
##### Evaluation\.
We follow the same setup as in\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\]and generate100100pairs of trajectories for each of5050random starting configurations\. We constrain each agent by imposing*test\-time*velocity limits corresponding to the90%90\\%quantile in the original work\. Sampling is performed with a planning horizon ofHp=16H\_\{p\}=16, execution horizonHa=8H\_\{a\}=8, and for a maximum of360360environment steps\.
##### Cost functions\.
For clarity of exposition, we repeat the definition of the cost functions used to encourage non\-intersecting trajectories during coupled sampling\.
Determinantal Point Process \(DPP\)\[[20](https://arxiv.org/html/2606.15359#bib.bib75)\]\. Given a generated pair\(X,Y\)∈ℝH×2×2\(X,Y\)\\in\\mathbb\{R\}^\{H\\times 2\\times 2\}, the DPP cost directly penalizes high cosine similarity between the flattened trajectories\(X¯,Y¯\)∈ℝ2H×2\(\\bar\{X\},\\bar\{Y\}\)\\in\\mathbb\{R\}^\{2H\\times 2\}:
cDPP\(X,Y\)=log\(cos∠\(X¯,Y¯\)\+ε\)c\_\{\\text\{DPP\}\}\(X,Y\)=\\log\\left\(\\cos\\angle\(\\bar\{X\},\\bar\{Y\}\)\+\\varepsilon\\right\)\(46\)whereε\>1\\varepsilon\>1regulates cost sharpness\.
Log\-Barrier \(LB\)\. Given a trajectory tuple\(X,Y\)∈ℝH×2×2\(X,Y\)\\in\\mathbb\{R\}^\{H\\times 2\\times 2\}, log\-barrier cost discourages proximity of trajectories, as measured by the Euclidean distance:
cLB\(X,Y\)=−∑i=1Hlog\(‖Xi−Yi‖\+α\)c\_\{\\text\{LB\}\}\(X,Y\)=\-\\sum\_\{i=1\}^\{H\}\\log\(\\left\\lVert X\_\{i\}\-Y\_\{i\}\\right\\rVert\+\\alpha\)\(47\)whereα\>0\\alpha\>0is an offset parameter\.
##### Projected Gradient Descent\.
For the PCD\-DPP and PCD\-LB baselines, we use one gradient iteration per denoising step with guidance strengths ofγDPP=0\.2\\gamma\_\{DPP\}=0\.2andγLB=0\.02\\gamma\_\{LB\}=0\.02respectively\. For DiRecT\-DPP and DiRecT\-LB we introduce diversity costs directly in the optimization problem \([33](https://arxiv.org/html/2606.15359#A3.E33)\) scaled by an appropriate weightλc\>0\\lambda\_\{c\}\>0\. For a comparison under similar computational budgets, the optimization problem is not solved to convergence, instead we employ one iteration of projected gradient descent\[[37](https://arxiv.org/html/2606.15359#bib.bib78)\]\. Given a pair of denoised estimates\(X¯,Y¯\)\(\\bar\{X\},\\bar\{Y\}\), the optimized pair\(X⋆,Y⋆\)\(X^\{\\star\},Y^\{\\star\}\)becomes888We include the step size for the gradient update inλc\\lambda\_\{c\}, instead of defining an additional hyperparameter\. Moreover, the gradient of the quadratic regularization from Tweedie’s estimate vanishes at\(X¯,Y¯\)\(\\bar\{X\},\\bar\{Y\}\)\.:
X⋆\\displaystyle X^\{\\star\}=Π𝒦\(X¯−λc∇XcDPP/LB\(X,Y\)\|X¯,Y¯\),\\displaystyle=\\Pi\_\{\\mathcal\{K\}\}\\left\(\\bar\{X\}\-\\lambda\_\{c\}\\nabla\_\{X\}c\_\{\\text\{DPP\}/\\text\{LB\}\}\(X,Y\)\\big\|\_\{\\bar\{X\},\\bar\{Y\}\}\\right\),\(48\)Y⋆\\displaystyle Y^\{\\star\}=Π𝒦\(Y¯−λc∇YcDPP/LB\(X,Y\)\|X¯,Y¯\)\.\\displaystyle=\\Pi\_\{\\mathcal\{K\}\}\\left\(\\bar\{Y\}\-\\lambda\_\{c\}\\nabla\_\{Y\}c\_\{\\text\{DPP\}/\\text\{LB\}\}\(X,Y\)\\big\|\_\{\\bar\{X\},\\bar\{Y\}\}\\right\)\.Here,ΠK\\Pi\_\{K\}is the same ADMM\-based projector as the PCD baselines over the set𝒦\\mathcal\{K\}of velocity\-restricted trajectories\. We sweep multiple weights forλc\\lambda\_\{c\}and chooseλc=0\.005\\lambda\_\{c\}=0\.005for DiRecT\-LB andλc=0\.05\\lambda\_\{c\}=0\.05for DiRecT\-DPP\. Additional details are provided in Appendix[E\.4](https://arxiv.org/html/2606.15359#A5.SS4)\.
## Appendix EAdditional results
### E\.1Safe maze navigation
We report in Table[5](https://arxiv.org/html/2606.15359#A5.T5)a detailed comparison of our method against additional baselines, with a focus on the difference between*generated*and*rollout*safety\. We consider these additional methods for comparison:Classifier Guidance\[[17](https://arxiv.org/html/2606.15359#bib.bib20)\]which employs gradient guidance through*soft*violation costs without posterior sampling,Primal\-Dual\[[71](https://arxiv.org/html/2606.15359#bib.bib30)\]as an additional Lagrangian\-based method, and the two additional SafeDiffuser variationsSafeDiffuser\-ReS, andSafeDiffuser\-TVS\[[69](https://arxiv.org/html/2606.15359#bib.bib31)\]for CBF\-based constraining\. We also report additional metrics on the*generated*trajectory: \(i\)*Safety Rate*, measuring the fraction of samples that correctly avoid obstacles; \(ii\) mean*Violations*, measuring the number of steps the predicted trajectory remains infeasible; \(iii\)*Acceleration Smoothness*\(AS\), quantifying the change in velocity across the trajectory, and computed as the magnitude of the second\-order position differences averaged across all planned transitions:AS=\(H−2\)−1∑i=1H−2‖si\+1−2si\+si−1‖AS=\(H\-2\)^\{\-1\}\\ \\sum\_\{i=1\}^\{H\-2\}\\left\\lVert s\_\{i\+1\}\-2s\_\{i\}\+s\_\{i\-1\}\\right\\rVert; \(iv\)*Curvature Smoothness*\(CS\), measuring the mean curvature of the generated trajectory\. Following\[[16](https://arxiv.org/html/2606.15359#bib.bib71)\], we define the turning angleθi=cos−1\(wiTwi\+1/\(‖wi‖‖wi\+1‖\)\),wi=si\+1−si,i=0,…,H−3\\theta\_\{i\}=\\cos^\{\-1\}\(w\_\{i\}^\{T\}w\_\{i\+1\}/\(\\left\\lVert w\_\{i\}\\right\\rVert\\left\\lVert w\_\{i\+1\}\\right\\rVert\)\),\\;\\;w\_\{i\}=s\_\{i\+1\}\-s\_\{i\},\\;\\;i=0,\.\.\.,H\-3, and compute the metric as the average across the planned horizonCS=\(H−2\)−1∑i=0H−3\(1−cosθi\)CS=\(H\-2\)^\{\-1\}\\sum\_\{i=0\}^\{H\-3\}\(1\-\\cos\\theta\_\{i\}\)\.
Table 5:Results on the Maze2D constrained navigation task\. Mean values are computed over100100i\.i\.d\. samples and reported alongside their standard deviation\. Gray columns report metrics evaluated on the planned trajectories before execution\. Best value\(s\) in boldface\.In both variants DiRecT outperforms all other baselines in terms of*rollout*safety\. Even though Projected Diffusion, Primal\-Dual and the three SafeDiffuser variations generate safe trajectories at planning time on thebroadinstance of the problem, their rollout performance remains limited\. This can be explained by noting that these methods find feasible solutions to the optimization problem by departing substantially from the data distribution\. As shown in Figures[7](https://arxiv.org/html/2606.15359#A6.F7)–[8](https://arxiv.org/html/2606.15359#A6.F8), prior methods in highly constrained dynamic settings often produce trajectories that depart from the learned maze structure and pass through maze walls\. In contrast, our method performs optimization near the data manifold, thereby remaining proximal to the data distribution\. In the narrow setting, we observe that IPOPT struggles to find feasible solutions within the allowed200200iterations, sometimes leading to infeasible generated solutions for projection\-based methods\. Moreover, primal\-dual methods and some SafeDiffuser variants exhibit instabilities on certain samples under dynamic constraints\.
### E\.2Safe robotic manipulation
We compare the same extended baselines as in Table[5](https://arxiv.org/html/2606.15359#A5.T5)and report safety and success rates of these additional policies in Table[6](https://arxiv.org/html/2606.15359#A5.T6)\. Our method is the only one to obtain perfect task success rate\.
For the primal\-dual implementation, we tuned the reference hyperparameters to the best of our ability, including the dual\-ascent learning rate\. Nevertheless, in the highly constrained setting with dynamics constraints, the method exhibited increasing instability, occasionally producing trajectories that triggered errors in the MuJoCo simulator during execution\.
Table 6:Results on the D3IL avoiding constrained manipulation task\. Mean values are computed over100100i\.i\.d\. trials and reported with standard deviations where applicable\. Best value in boldface\.
### E\.3Safe multi\-robot motion planning
We present sweep results for each of the three velocity limitations and number of agents in Tables[7](https://arxiv.org/html/2606.15359#A5.T7),[8](https://arxiv.org/html/2606.15359#A5.T8),[9](https://arxiv.org/html/2606.15359#A5.T9),[10](https://arxiv.org/html/2606.15359#A5.T10)\. We report computational times as well as the mean number of collisions\. DiRecT consistently improves*Success*and*Constraint safety*\. OnEmptyandHighways, our method matches the*data adherence*of the pretrained model for all velocity limitations\. We note that onConveyorandDrop\-Regioncomparing data adherence between DiRecT and the PCD variants is not straightforward forN∈\{12,16,20\}N\\in\\\{12,16,20\\\}, as the latter are not capable of generating feasible trajectories, thereby showcasing high task success while neglecting safety\.
Table 7:MRMP results onEmpty\. SR: success rate \(%\), CS: constraint safety \(%\), Coll: mean collisions, DA: data adherence, Time: computation time \(s\)\. Standard deviations are over different initializations and boldface represents best among the methods actively enforcing constraints\.Table 8:MRMP results onHighways\. SR: success rate \(%\), CS: constraint safety \(%\), Coll: mean collisions, DA: data adherence, Time: computation time \(s\)\. Standard deviations are over different initializations and boldface represents best among the methods actively enforcing constraints\.Table 9:MRMP results onConveyor\. SR: success rate \(%\), CS: constraint safety \(%\), Coll: mean collisions, DA: data adherence, Time: computation time \(s\)\. Standard deviations are over different initializations and boldface represents best among the methods actively enforcing constraints\.Table 10:MRMP results onDrop\-Region\. SR: success rate \(%\), CS: constraint safety \(%\), Coll: mean collisions, DA: data adherence, Time: computation time \(s\)\. Standard deviations are over different initializations and boldface represents best among the methods actively enforcing constraints\.
### E\.4Safe and diverse contact\-rich manipulation
As outlined in the ablation studies of\[[43](https://arxiv.org/html/2606.15359#bib.bib33)\], task success and trajectory diversity \(DTW, DFD\) are conflicting objectives in thePushTtask\. Therefore, there exists a Pareto\-optimality front that can be traced by varying coupling strength\. We sweep the coupling strength coefficient for gradient guidance in PCD and the step size of the Projected Gradient Descent updateλc\\lambda\_\{c\}for DiRecT\. Figure[6](https://arxiv.org/html/2606.15359#A5.F6)shows that Pareto fronts obtained with our method dominate those derived from PCD for both DPP and LB cost functions\. Furthermore, DiRecT improves over the baselines while requiring only*half*the projection and gradient evaluations, as we start guidance in the second half of denoising\. This highlights the importance of optimizing near the data distribution where constraints and costs are more naturally defined\.
\(a\)TCvsDTW\. Top\-right is better\.
\(b\)TCvsDFD\. Top\-right is better\.
Figure 6:Pareto fronts for the PCD and DiRecT variants, obtained by sweeping the gradient coupling strengthγ\\gammafor PCD and the cost weightλc\\lambda\_\{c\}for DiRecT\. Fronts are drawn as lines connecting Pareto\-optimal points for each method, while dominated points are shaded\.
## Appendix FVisualizations
Figure 7:Comparison of*planned*and*executed*trajectories for safe navigation inMaze2D\-broad\. Each row corresponds to a planning method\.Figure 8:Comparison of*planned*and*executed*trajectories for safe navigation inMaze2D\-narrow\. Each row corresponds to a planning method\.Figure 9:Comparison of environment rollouts for different sampling schemes for safe manipulation in D3ILavoiding\.Figure 10:Visualizations of trajectories produced by different algorithms for theEmptyMMD environment\. Velocity is limited to80%80\\%of the maximum action\. Robots are rewarded for moving in a straight line, while avoiding collisions \(red crosses\) and velocity violations \(blue stars\)\.Figure 11:Visualizations of trajectories produced by different algorithms for theHighwaysMMD environment\. Velocity is limited to75%75\\%of the maximum action\. Robots are rewarded for moving*counterclockwise*around the central obstacle, while avoiding collisions \(red crosses\) and velocity violations \(blue stars\)\.Figure 12:Visualizations of trajectories produced by different algorithms for theConveyorMMD environment\. Velocity is limited to85%85\\%of the maximum action\. Robots are rewarded for moving*leftward*in the top corridor and*rightward*in the bottom one, while avoiding collisions \(red crosses\) and velocity violations \(blue stars\)\.Figure 13:Visualizations of trajectories produced by different algorithms for theDrop\-RegionMMD environment\. Velocity is limited to85%85\\%of the maximum action\. Robots are rewarded for*stopping*at the drop\-off regions, while avoiding collisions \(red crosses\) and velocity violations \(blue stars\)\.Figure 14:Visual comparison of different methods on coupledPushTgeneration with velocity constraints\. Each column corresponds to the first sample of different initializations with red crosses to denote constraint violations\. Only the first 64 executed steps are depicted to avoid clutter\.Similar Articles
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 is a new discrete diffusion planner for autonomous driving that uses reinforcement learning to enable self-editing of trajectory tokens, achieving high performance and low latency on the NAVSIM benchmark.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.
The Safety-Aware Denoiser for Text Diffusion Models
This paper introduces the Safety-Aware Denoiser (SAD), a framework for integrating safety constraints into text diffusion models during the denoising process. It aims to reduce unsafe generations while preserving quality, addressing a gap in safety research for non-autoregressive models.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.
Spectral Guidance for Flexible and Efficient Control of Diffusion Models
Introduces Spectral Guidance, a framework for controlling diffusion models by leveraging low-dimensional representations of the diffusion process, enabling flexible and stable control without task-specific retraining or backpropagation through the denoiser.