Trust-Region Diffusion Policies for Massively Parallel On-Policy RL
Summary
Introduces TruDi, a method that enables training diffusion policies in massively parallel on-policy reinforcement learning by using a trust-region optimization rule to enforce KL constraints, achieving strong performance across 73 tasks.
View Cached Full Text
Cached at: 06/16/26, 11:39 AM
# Trust-Region Diffusion Policies for Massively Parallel On-Policy RL
Source: [https://arxiv.org/html/2606.15260](https://arxiv.org/html/2606.15260)
Onur CelikDenis BlessingTai HoangClaas A VoelckerAxel BrunnbauerFelix RichterMichael VolppGerhard Neumann
###### Abstract
Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations\. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion\-based RL methods are designed for offline or off\-policy training\. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on\-policy regime\. To this end, we introduce Trust\-region Diffusion Policies \(TruDi\), which enables diffusion policies for on\-policy RL with massively parallel simulations\. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult\. TruDi addresses this by integrating a trust\-region optimization rule to enforce a KL\-divergence constraint over the entire diffusion trajectory\. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks\. Across these tasks, TruDi consistently outperforms or is on\-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on\-policy RL\.
Machine Learning, Reinforcement Learning, Diffusion Policies, ICML
## 1Introduction
Diffusion models\(Hoet al\.,[2020](https://arxiv.org/html/2606.15260#bib.bib76); Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib64); Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)have shown remarkable results in high\-dimensional generative tasks, notably in domains where data from the target distribution is available, e\.g\., image generation\(Hoet al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib65); Sahariaet al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib66); Rombachet al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib67)\)or imitation learning\(Chiet al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib68); Zhouet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib55); Carvalhoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib69)\)\. More recently, their strong representational properties have been explored in reinforcement learning \(RL\)\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10); Leet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib34); Wanget al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib48),[2023](https://arxiv.org/html/2606.15260#bib.bib38); Dinget al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib92); Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\), where the policy is represented by a diffusion model and is trained from scratch\. Here, generating an action conditioned on an observation requires first running the diffusion process\. The action after the last diffusion time step is then executed in the environment\(Wanget al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib38); Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32); Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10); Leet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib34)\)\. Diffusion\-based policies have been primarily integrated into off\-policy RL settings to leverage the framework’s data efficiency\. This approach has yielded remarkable results with state\-of\-the\-art performance on a wide range of benchmarks\.
Recent advances in highly parallelized simulators\(Mittalet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib25); Zakkaet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib27); Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4); Hoanget al\.,[2026](https://arxiv.org/html/2606.15260#bib.bib93)\)led to a significant acceleration of RL training and to impressive sim\-to\-real transfer capabilities using Gaussian policy representations\(Rudinet al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib1); Zakkaet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib27); Kumaret al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib59); Heet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib60)\)in on\-policy RL\. Despite this breakthrough, training diffusion\-based policies from scratch using on\-policy RL has not been researched in the literature due to two main reasons\. First, training diffusion\-based policies is generally more expensive, because several diffusion steps are necessary to generate a single action, and second, essential statistics such as the likelihoods are not easily tractable for diffusion models\(Zhouet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib55)\)\. The latter is particularly important because trust\-region constraints have proven essential in on\-policy RL methods to avoid premature convergence\(Peterset al\.,[2010a](https://arxiv.org/html/2606.15260#bib.bib18); Schulmanet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib16),[2017](https://arxiv.org/html/2606.15260#bib.bib19); Abdolmalekiet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib17); Hoanget al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib20)\), but they require evaluating these essential statistics\. Moreover, recent works\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)have demonstrated that explicitly learning a Q\-function in the trust region\-constrained maximum entropy objective additionally improves performance in on\-policy RL and significantly outperforms PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib19)\), which is the common choice among practitioners\. Learning this Q\-function additionally provides gradient information for updating the policy, which is commonly used in diffusion\-based inference methods\(Berneret al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib82); Vargaset al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib84); Richter and Berner,[2024](https://arxiv.org/html/2606.15260#bib.bib85); Nuskenet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib83)\)\. These insights motivate the question of whether we can leverage the strong representational capacities of diffusion models in the trust\-region constrained maximum entropy on\-policy RL setting\.
This paper aims to answer this question by proposing Trust\-region Diffusion Policies \(TruDi\)\. We build upon the probabilistic inference perspective on maximum entropy RL\(Toussaint,[2009](https://arxiv.org/html/2606.15260#bib.bib51); Ziebartet al\.,[2008](https://arxiv.org/html/2606.15260#bib.bib50); Haarnojaet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib52); Levine,[2018](https://arxiv.org/html/2606.15260#bib.bib49)\)and adopt the tractable lower bound for diffusion policies on this objective as proposed byCeliket al\.\([2025](https://arxiv.org/html/2606.15260#bib.bib10)\)\. WhileCeliket al\.\([2025](https://arxiv.org/html/2606.15260#bib.bib10)\)applied this formulation to off\-policy learning, in on\-policy RL, trust\-region constraints have been well established for stable training\(Peterset al\.,[2010b](https://arxiv.org/html/2606.15260#bib.bib53); Schulmanet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib16),[2017](https://arxiv.org/html/2606.15260#bib.bib19)\)\. However, these trust\-regions are non\-trivial to enforce for diffusion policies due to their intractable marginal likelihoods\(Zhouet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib55)\)\. To address this, we derive a tractable upper bound on the marginal Kullback\-Leibler \(KL\) divergence by constraining the divergence between the entire diffusion trajectories of the current and behavior policies\. Constraining this upper bound effectively enforces a trust region over the whole generation process, ensuring the stability required for massively parallel on\-policy learning\. The resulting algorithm is a trust\-region constrained on\-policy RL method that is sample\-efficient and requires only marginally longer wall\-clock training time compared to Gaussian counterparts\. Additionally, we demonstrate how to leverage the probability flow ODE\(Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)to generate high\-return actions with a higher likelihood compared to those obtained using the SDE and other evaluation techniques in diffusion\-based policies\.
To summarize, our contributions are threefold:\(i\)we introduce a principled framework for training diffusion policies in the on\-policy, massively parallel RL setting by enforcing a trust\-region constraint over the full diffusion trajectory;\(ii\)we provide a comprehensive empirical evaluation across standard continuous\-control and large\-scale robotic benchmarks, showing that our method is competitive with strong Gaussian on\-policy baselines on standard tasks while delivering clear gains on challenging high\-dimensional humanoid control; and\(iii\)we present detailed analyses of key design choices, including the effect of the trust\-region threshold, different sampling strategies for the policy evaluation \(e\.g\., SDE/ODE/best\-of\-KK\), and multimodality on symmetric tasks, highlighting which components are most important for stable and effective training\.
## 2Related Work
Massively Parallel RL\.The advent of GPU\-accelerated simulators\(Mittalet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib25); Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4); Hoanget al\.,[2026](https://arxiv.org/html/2606.15260#bib.bib93)\)has shifted the computational bottleneck from data generation to policy learning\. To leverage this throughput, research initially focused on scaling off\-policy algorithms: pioneering efforts like Parallel Q\-Learning\(Liet al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib7)\)decoupled actor and learner processes, while recent state\-of\-the\-art methods such as FastTD3\(Seoet al\.,[2025b](https://arxiv.org/html/2606.15260#bib.bib8)\)and FastSAC\(Seoet al\.,[2025a](https://arxiv.org/html/2606.15260#bib.bib70)\)utilize large\-batch updates with on\-GPU replay buffers to minimize training latency\. However, despite their speed, these off\-policy approaches incur significant memory overheads, as maintaining large buffers in memory constrains the capacity for parallel environments and large network architectures\. This has renewed interest in on\-policy methods, which minimize memory footprint by consuming data immediately\. This lineage spans from constrained optimization formulations like REPS\(Peterset al\.,[2010a](https://arxiv.org/html/2606.15260#bib.bib18)\)and MORE\(Abdolmalekiet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib17)\)to scalable deep RL approximations like PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib19)\)and differentiable projection layers\(Ottoet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib15); Liet al\.,[2024a](https://arxiv.org/html/2606.15260#bib.bib21)\)\. Building on these foundations, REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)demonstrated that rigorously enforcing trust\-region constraints via dual ascent allows on\-policy RL to match off\-policy sample efficiency without the associated memory bottlenecks\. While this primal\-dual framework is straightforward for Gaussian policies with exact action likelihoods, applying it to diffusion policies is fundamentally bottlenecked by their intractable marginals\. TruDi overcomes this barrier by introducing a novel trajectory\-level trust region, successfully enabling highly expressive, multimodal diffusion policies for massively parallel on\-policy RL\. Synthesizing these directions, REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)stabilizes pathwise policy gradients via primal\-dual trust\-region updates, allowing on\-policy RL to match off\-policy sample efficiency without the memory bottlenecks\. While this framework is restricted to Gaussian policies due to the need for tractable action likelihoods, TruDi overcomes this bottleneck by introducing a novel trajectory\-level trust region, enabling expressive diffusion policies for massively parallel on\-policy RL\.
Diffusion\-based policies in RL\.Early research on diffusion models in reinforcement learning primarily focused on the offline setting\(Wanget al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib38); Janneret al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib37); Chenet al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib39)\)\. These works utilized diffusion models either as high\-fidelity trajectory generators\(Janneret al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib37)\)or as expressive policy priors to regularize behavior in static datasets\(Hansen\-Estruchet al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib40); Kanget al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib41); Luet al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib42); Maoet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib43); Fanget al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib44)\)\. The success of these methods catalyzed the development of online diffusion\-based RL\. Initial approaches, such as DIPO\(Yanget al\.,[2023](https://arxiv.org/html/2606.15260#bib.bib45)\)and its multimodal extension\(Liet al\.,[2024b](https://arxiv.org/html/2606.15260#bib.bib46)\), employed behavior cloning updates with Q\-function guidance but relied on intrinsic stochasticity for exploration\. Subsequent methods like QSM\(Psenkaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib47)\)and QVPO\(Dinget al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib92)\)streamlined optimization by matching scores or weighting diffusion losses with Q\-values\. However, these methods often disregarded policy entropy, necessitating ad\-hoc exploration heuristics like Gaussian noise injection\(Psenkaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib47)\)or uniform sampling\(Dinget al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib92)\)\. DACER\(Wanget al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib48)\)attempted to address this with an entropy regulator but relied on approximate Gaussian Mixture Models\. DIME\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10)\)introduces diffusion models into the Maximum Entropy RL framework for continuous control\. In parallel, HyDo\(Leet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib34)\)leveraged maximum entropy formulation for hybrid action spaces in manipulation tasks\. However, despite their theoretical rigor, these methods generally operate in theoff\-policyregime\. Their reliance on large experience replay buffers creates substantial memory bottlenecks, making it difficult to scale to the massively parallel simulation environments required for modern robot learning\(Seoet al\.,[2025b](https://arxiv.org/html/2606.15260#bib.bib8),[a](https://arxiv.org/html/2606.15260#bib.bib70)\)\.
To overcome these scalability bottlenecks, a recent wave of works has shifted toward theon\-policysetting\. One line of research retains the diffusion formulation, adapting trust\-region or mirror descent methods to handle stochastic chains\. For instance, DPPO\(Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\)fine\-tunes diffusion policies using PPO, while GenPO\(Dinget al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib33)\)approximates likelihoods to enable learning from scratch\. This direction has also extended to discrete combinatorial spaces, whereMaet al\.\([2025](https://arxiv.org/html/2606.15260#bib.bib28)\)utilized Policy Mirror Descent to optimize discrete diffusion policies\. A parallel direction leverages flow matching to bypass the intractability of diffusion likelihoods, as seen in FPO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib31)\), FlowRL\(Lvet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib29)\), and ReinFlow\(Zhanget al\.,[2026](https://arxiv.org/html/2606.15260#bib.bib30)\)\. However, standard flow\-based methods often lack explicit entropy regularization, and concurrent diffusion approaches like DPPO\(Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\)and GenPO\(Dinget al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib33)\)rely on heuristic per\-step clipping or complex invertible mappings\. TruDi derives a principled trust\-region formulation over the full diffusion trajectory within the Maximum Entropy RL framework, utilizing a tractable trajectory\-level KL bound to enable stable, Q\-based pathwise policy updates\.
## 3Preliminaries
Notation\.We consider a Markov decision process defined by the tuple\(𝒮,𝒜,r,p,ρπ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},r,p,\\rho\_\{\\pi\},\\gamma\), where we aim to optimize a probabilistic policyπθ:𝒮×𝒜→ℝ\+\\pi\_\{\\theta\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}^\{\+\}that is defined in continuous state and action spaces denoted by𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}, respectively, and is parameterized with parametersθ\\theta\. The objective functionJ\(πθ\)J\(\\pi\_\{\\theta\}\)is defined as the discounted sum of expected future rewards, whereγ∈\[0,1\)\\gamma\\in\[0,1\)denotes the discount factor, andr:𝒮×𝒜→\[rmin,rmax\]r:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\[r\_\{\\text\{min\}\},r\_\{\\text\{max\}\}\]is a bounded reward function\. In statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}at time steptt, the agent transitions to statest\+1∈𝒮s\_\{t\+1\}\\in\\mathcal\{S\}by executing an actionata\_\{t\}that is sampled from the policyπθ\\pi\_\{\\theta\}\. The probability density of transitioning to the statest\+1s\_\{t\+1\}is denoted byp:𝒮×𝒮×𝒜→ℝ\+p:\\mathcal\{S\}\\times\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}^\{\+\}\. We follow prior work\(Haarnojaet al\.,[2018a](https://arxiv.org/html/2606.15260#bib.bib5)\)and overload the notation for the state and state\-action distributions induced by the policy by referring to both distributions asρπ\\rho\_\{\\pi\}\.
Maximum Entropy Reinforcement Learning\.To control the exploration\-exploitation tradeoff, maximum entropy RL \(MaxEnt RL\)\(Ziebartet al\.,[2008](https://arxiv.org/html/2606.15260#bib.bib50); Toussaint,[2009](https://arxiv.org/html/2606.15260#bib.bib51); Haarnojaet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib52)\)augments the reward in each time step with the entropy of the current policyℋ\(πθ\(a∣s\)\)=−∫πθ\(a∣s\)logπθ\(a∣s\)da\\mathcal\{H\}\(\\pi\_\{\\theta\}\(a\\mid s\)\)=\-\\int\\pi\_\{\\theta\}\(a\\mid s\)\\log\\pi\_\{\\theta\}\(a\\mid s\)\\,\\mathrm\{d\}a, resulting in the objective
J\(πθ\)=∑t=0∞γt𝔼ρπ\[rt\+αℋ\(πθ\(at∣st\)\)\],\\displaystyle J\(\\pi\_\{\\theta\}\)=\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\\left\[r\_\{t\}\+\\alpha\\mathcal\{H\}\(\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\)\\right\],\(1\)whereα∈\[0,∞\)\\alpha\\in\[0,\\infty\)is a entropy scaling factor\. Note that we denote the reward at time stepttasrt≜r\(st,at\)r\_\{t\}\\triangleq r\(s\_\{t\},a\_\{t\}\)\. We similarly define the entropy\-augmented Q\-function under the current policyπθ\\pi\_\{\\theta\}as
Qπθ\(st,at\)=rt\+∑l=1∞γl𝔼ρπ\[rt\+l\+αℋ\(πθ\(at\+l∣st\+l\)\)\]\.\\begin\{split\}Q^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)=r\_\{t\}\+\\sum\_\{l=1\}^\{\\infty\}\\gamma^\{l\}&\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\\bigg\[r\_\{t\+l\}\\\\ &\+\\alpha\\mathcal\{H\}\\big\(\\pi\_\{\\theta\}\(a\_\{t\+l\}\\mid s\_\{t\+l\}\)\\big\)\\bigg\]\.\\end\{split\}\(2\)In the context of policy iteration, the policy improvement step can be formulated as an approximate inference problem\(Haarnojaet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib52)\)\. Given the Q\-function of the previous policyπold\\pi\_\{\\text\{old\}\}, we seek to find a new policyπθ\\pi\_\{\\theta\}that minimizes the KL divergence to the Boltzmann distribution induced byQπoldQ^\{\\pi\_\{\\text\{old\}\}\}:
ℒ\(πθ\)=DKL\(πθ\(at∣st\)∣∣exp\(Qπold\(st,at\)/α\)𝒵πold\(st\)\),\\displaystyle\\mathcal\{L\}\(\\pi\_\{\\theta\}\)=D\_\{\\text\{KL\}\}\\left\(\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\mid\\mid\\frac\{\\exp\(Q^\{\\pi\_\{\\text\{old\}\}\}\(s\_\{t\},a\_\{t\}\)/\\alpha\)\}\{\\mathcal\{Z\}^\{\\pi\_\{\\text\{old\}\}\}\(s\_\{t\}\)\}\\right\),\(3\)where𝒵πold\(st\)=∫exp\(Qπold\(st,at\)/α\)da\\mathcal\{Z\}^\{\\pi\_\{\\text\{old\}\}\}\(s\_\{t\}\)=\\int\\exp\(Q^\{\\pi\_\{\\text\{old\}\}\}\(s\_\{t\},a\_\{t\}\)/\\alpha\)\\,\\mathrm\{d\}ais the partition function\. Minimizing this KL divergence projects the parametric policyπθ\\pi\_\{\\theta\}onto the target Boltzmann distribution, thereby guaranteeing an improvement in the MaxEnt objective in Eq\.[1](https://arxiv.org/html/2606.15260#S3.E1)\.
Denoising Diffusion Policy\.We consider the state\-conditioned variance preserving \(VP\) stochastic differential equation \(SDE\) in which theforwardornoising processis given by an Ornstein\-Uhlenbeck \(OU\) process\(Särkkä and Solin,[2019](https://arxiv.org/html/2606.15260#bib.bib56)\)defined by the SDE
daτ=−βτaτdτ\+η2βτdBτ,a0∼π→0\(⋅∣s\),\\displaystyle\\mathrm\{d\}a^\{\\tau\}=\-\\beta^\{\\tau\}a^\{\\tau\}\\mathrm\{d\}\{\\tau\}\+\\eta\\sqrt\{2\\beta^\{\\tau\}\}\\mathrm\{d\}B^\{\\tau\},\\quad a^\{0\}\\sim\\vec\{\\pi\}^\{0\}\(\\cdot\\mid s\),\(4\)with Brownian motion\(Bτ\)τ∈\[0,T\]\(B^\{\\tau\}\)^\{\\tau\\in\[0,T\]\}, diffusion coefficientβ:\[0,T\]→ℝ\+\\beta:\[0,T\]\\rightarrow\\mathbb\{R\}^\{\+\}and the prior distribution’s standard deviationη\\eta\.111Note that we deviate from the standard notation and denote the diffusion time parameter by superscriptτ\\tauhere, in order to distinguish from the MDP time step which we denoted by the subscripttt\.Given a statess, the forward process starts from a target policya0∼π→0\(⋅∣s\)a^\{0\}\\sim\\vec\{\\pi\}^\{0\}\(\\cdot\\mid s\)atτ=0\\tau=0and continuously adds noise, such that \(for large enoughTT\) the marginal distribution defined by the SDE is given byπ→T\(⋅∣s\)≈𝒩\(0,η2I\)\\vec\{\\pi\}^\{T\}\(\\cdot\\mid s\)\\approx\\mathcal\{N\}\(0,\\eta^\{2\}I\)\.
Thebackward\(orgenerative\)processcorresponding to the SDE in Eq\.[4](https://arxiv.org/html/2606.15260#S3.E4)is given by
daτ=\(−βτaτ−2η2βτ∇alogπ→τ\(aτ∣s\)\)dτ\+η2βτdBτ,\\begin\{split\}\\mathrm\{d\}a^\{\\tau\}=\\left\(\-\\beta^\{\\tau\}a^\{\\tau\}\-2\\eta^\{2\}\\beta^\{\\tau\}\\nabla\_\{a\}\\log\\vec\{\\pi\}^\{\\tau\}\(a^\{\\tau\}\\mid s\)\\right\)\\mathrm\{d\}\\tau\\\\ \+\\eta\\sqrt\{2\\beta^\{\\tau\}\}\\mathrm\{d\}B^\{\\tau\},\\end\{split\}\(5\)which starts from Gaussian noise atτ=T\\tau=Tand gradually transforms it into samples from the target action distribution atτ=0\\tau=0\. In practice, the score∇alogpτ\(aτ∣s\)\\nabla\_\{a\}\\log p^\{\\tau\}\(a^\{\\tau\}\\mid s\)is unknown and approximated by a neural networkfθτ\(aτ,s\)f\_\{\\theta\}^\{\\tau\}\(a^\{\\tau\},s\)\. After training, simulating the reverse process yields a samplea0a^\{0\}, which we execute as the actiona∼πθ\(⋅∣s\)a\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\)\.
Discretizing the Diffusion Policy\.The discretization implies Gaussian transition kernels between successive diffusion steps, and the corresponding trajectory distributions factorize as
π→0:N\(a0:N∣s\)\\displaystyle\\vec\{\\pi\}^\{0:N\}\(a^\{0:N\}\\mid s\)=π→0\(a0∣s\)∏n=0N−1π→n\+1∣n\(an\+1∣an,s\),\\displaystyle=\\vec\{\\pi\}^\{0\}\(a^\{0\}\\mid s\)\\prod\_\{n=0\}^\{N\-1\}\\vec\{\\pi\}^\{n\+1\\mid n\}\(a^\{n\+1\}\\mid a^\{n\},s\),\(6\)→πθ0:N\(a0:N∣s\)\\displaystyle\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0:N\}\(a^\{0:N\}\\mid s\)=→πN\(aN∣s\)∏n=1N→πθn−1∣n\(an−1∣an,s\),\\displaystyle=\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{N\}\(a^\{N\}\\mid s\)\\prod\_\{n=1\}^\{N\}\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{n\-1\\mid n\}\(a^\{n\-1\}\\mid a^\{n\},s\),\(31\)where→ πN\(⋅∣s\)\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{N\}\(\\cdot\\mid s\)denotes the \(fixed\) Gaussian prior at diffusion stepNN\. The executed action distribution is the marginal of the reverse chain,πθ\(a∣s\)≡→ πθ0\(a0∣s\)\\pi\_\{\\theta\}\(a\\mid s\)\\equiv\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\\mid s\), obtained by integrating out the latent trajectorya1:Na^\{1:N\}\. These tractable trajectory factorizations will be central in the next section, where we derive tractable bounds for MaxEnt policy improvement and trust\-region constraints in joint trajectory space\.
## 4Trust\-region Diffusion Policies
A common approach for preventing premature convergence and stabilizing the training in on\-policy reinforcement learning is employing a trust region on the policyπθ\\pi\_\{\\theta\}\(Peterset al\.,[2010b](https://arxiv.org/html/2606.15260#bib.bib53); Schulmanet al\.,[2015](https://arxiv.org/html/2606.15260#bib.bib16); Ottoet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib15); Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\), leading to the constrained optimization problem
maxθ\\displaystyle\\max\_\{\\theta\}\\quad∑t=0∞γt𝔼ρπ\[rt\+αℋ\(→πθ0\(at0∣st\)\)\]\\displaystyle\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\\left\[r\_\{t\}\+\\alpha\\mathcal\{H\}\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0\}\_\{\\theta\}\(a^\{0\}\_\{t\}\\mid s\_\{t\}\)\)\\right\]\(40\)s\.t\.𝔼ρπ\[DKL\(→πold0\(at0∣st\)∣∣→πθ0\(at0∣st\)\)\]≤ϵ,\\displaystyle\\mathbb\{E\}\_\{\\rho^\{\\pi\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\left\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0\}\(a^\{0\}\_\{t\}\\mid s\_\{t\}\)\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0\}\_\{\\theta\}\(a^\{0\}\_\{t\}\\mid s\_\{t\}\)\\right\)\\right\]\\leq\\epsilon,\(57\)where the policy parametersθ\\thetaare updated such that the maximum entropy RL objective is maximized under the constraint that the KL divergence between the old policy→ πold0=→ πθold0\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0\}=\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0\}\_\{\\theta\_\{\\text\{old\}\}\}and the current policy→ πθ0\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0\}\_\{\\theta\}is bounded byϵ∈ℝ\+\\epsilon\\in\\mathbb\{R\}^\{\+\}\. Both the objective in Eq\.[40](https://arxiv.org/html/2606.15260#S4.E40)and the constraint in Eq\.[57](https://arxiv.org/html/2606.15260#S4.E57)require calculating the likelihood of the marginal distributionπθ0\(at∣st\)\\pi^\{0\}\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\), which is hard to calculate for diffusion models\(Zhouet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib55)\)\.
To guide the reader, we briefly outline the main steps of this section\. We first interpret the denoising process \(Eq\.[31](https://arxiv.org/html/2606.15260#S3.E31)\) as a latent\-variable model\(Luo,[2022](https://arxiv.org/html/2606.15260#bib.bib75); Hoet al\.,[2020](https://arxiv.org/html/2606.15260#bib.bib76)\), which yields a tractable lower bound on the objective\. Based on this view, we derive a tractable upper bound for the KL constraint in Eq\.[57](https://arxiv.org/html/2606.15260#S4.E57)and use it to formulate our final optimization problem\. We then describe the resulting policy update and parameter learning procedure for TruDi\. Finally, we introduce a deterministic evaluation scheme based on the probability\-flow ordinary differential equation \(ODE\) associated with the diffusion policy\.
### 4\.1Diffusion Policies as Latent Variable Models in MaxEnt RL
The discretized processes in Eq\.[6](https://arxiv.org/html/2606.15260#S3.E6)and Eq\.[31](https://arxiv.org/html/2606.15260#S3.E31)motivate viewing diffusion policies as latent variable models\(Luo,[2022](https://arxiv.org/html/2606.15260#bib.bib75)\)in which the final actiona0a^\{0\}of the diffusion process is the result of sampling from the marginal policy
→πθ0\(a0∣s\)=∫→πθ0:N\(a0:N∣s\)da1:N,\\displaystyle\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\\mid s\)=\\int\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0:N\}\(a^\{0:N\}\\mid s\)\\,\\mathrm\{d\}a^\{1:N\},\(74\)wherea1:Na^\{1:N\}are considered latent variables\. Evaluating the likelihood in Eq\.[74](https://arxiv.org/html/2606.15260#S4.E74)is not straightforward, which renders the commonly used approximate inference scheme \(Eq\.[3](https://arxiv.org/html/2606.15260#S3.E3)\) for maximum entropy RL intractable for diffusion\-based policies\. However, we can obtain a tractable upper bound by applying the data processing inequality\(Cover,[1999](https://arxiv.org/html/2606.15260#bib.bib57)\)
DKL\(→πθ0\(a0∣s\)\\displaystyle D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\\mid s\)∣∣π→0\(a0\|s\)\)\\displaystyle\\mid\\mid\\vec\{\\pi\}^\{0\}\(a^\{0\}\|s\)\\bigg\)\(83\)≤DKL\(→πθ\(a0:N∣s\)∣∣π→\(a0:N∣s\)\),\\displaystyle\\leq D\_\{\\text\{KL\}\}\\left\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{0:N\}\\mid s\)\\mid\\mid\\vec\{\\pi\}\(a^\{0:N\}\\mid s\)\\right\),\(92\)whereπ→0\(a0\|s\)=expQϕ→ π\(s,a0\)/α𝒵→ π\(s\)\\vec\{\\pi\}^\{0\}\(a^\{0\}\|s\)=\\frac\{\\exp Q\_\{\\phi\}^\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.07639pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\}\(s,a^\{0\}\)/\\alpha\}\{\\mathcal\{Z\}^\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.07639pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\}\(s\)\}\. Instead of matching the action marginals \(LHS of the inequality\), this upper bound provides an objective to match the denoising \(Eq\.[6](https://arxiv.org/html/2606.15260#S3.E6)\) with the noising process \(Eq\.[31](https://arxiv.org/html/2606.15260#S3.E31)\)\. In other words, aligning the denoising process→ πθ\(a0:N\|s\)\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{0:N\}\|s\)with the noising processπ→\(a0:N\|s\)\\vec\{\\pi\}\(a^\{0:N\}\|s\)minimizes the approximate inference objective and hence allows maximizing the maximum entropy RL objective in Eq\.[40](https://arxiv.org/html/2606.15260#S4.E40)\.\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10); Berneret al\.,[2022](https://arxiv.org/html/2606.15260#bib.bib82); Nuskenet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib83)\)demonstrated that the upper bound in Eq\.[92](https://arxiv.org/html/2606.15260#S4.E92)leads to the lower bound on the marginal entropyℋ\(→ πθ0\(a0\|s\)\)≥l→ πθ\(a0,s\)\\mathcal\{H\}\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\|s\)\)\\geq l\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\}\(a^\{0\},s\), where
l→πθ\(a0,s\)=𝔼→πθ0:N\[logπ→1:N\|0\(a1:N∣a0,s\)→πθ0:N\(a0:N∣s\)\],\\displaystyle l\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\}\(a^\{0\},s\)=\\mathbb\{E\}\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0:N\}\}\\left\[\\log\\frac\{\\vec\{\\pi\}^\{1:N\|0\}\(a^\{1:N\}\\mid a^\{0\},s\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0:N\}\(a^\{0:N\}\\mid s\)\}\\right\],\(117\)which redefines the Q function for diffusion\-based policies
Qϕ→
π\(st,at0\)=rt\+∑l=1γl𝔼ρπ\[rt\+l\+αl→
πθ\(at\+l0,st\+l\)\]\.Q\_\{\\phi\}^\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\}\(s\_\{t\},a^\{0\}\_\{t\}\)=r\_\{t\}\+\\sum\_\{l=1\}\\gamma^\{l\}\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\\left\[r\_\{t\+l\}\+\\alpha l\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\}\(a^\{0\}\_\{t\+l\},s\_\{t\+l\}\)\\right\]\.\(118\)Crucially, this latent variable model view allows us to impose trust\-region constraints directly on the joint trajectory distributions, yielding a tractable and principled alternative to marginal KL constraints \(Eq\.[57](https://arxiv.org/html/2606.15260#S4.E57)\), which are otherwise intractable for diffusion policies\. We develop this idea in the next section\.
### 4\.2Enforcing Trust\-Region for Diffusion Policies
Similar to the entropy in Eq\.[40](https://arxiv.org/html/2606.15260#S4.E40), the trust region in Eq\.[57](https://arxiv.org/html/2606.15260#S4.E57)requires evaluating the marginal likelihood \(Eq\.[74](https://arxiv.org/html/2606.15260#S4.E74)\), which is not straightforward to calculate\. However, instead of constraining the policy update using Eq\.[57](https://arxiv.org/html/2606.15260#S4.E57), which measures the KL between the marginal action distributions, we propose using an upper bound that measures the information loss on the joint distributions of the denoising processes
DKL\(→πold0\(a0\\displaystyle D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0\}\(a^\{0\}∣s\)∣∣→πθ0\(a0∣s\)\)\\displaystyle\\mid s\)\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\\mid s\)\\bigg\)\(135\)≤DKL\(→πold0:N\(a0:N∣s\)∣→πθ0:N\(a0:N∣s\)\)\.\\displaystyle\\leq D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0:N\}\(a^\{0:N\}\\mid s\)\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\theta\}\(a^\{0:N\}\\mid s\)\\bigg\)\.\(152\)A derivation is provided in Appendix[A](https://arxiv.org/html/2606.15260#A1)\. This upper bound can now easily be approximated by samples as the likelihoods are w\.r\.t\. to the tractable joint distribution \(Eq\.[31](https://arxiv.org/html/2606.15260#S3.E31)\)\.
Using the upper bounds for the objective \(Eq\.[92](https://arxiv.org/html/2606.15260#S4.E92)\) and the constraint \(Eq\.[135](https://arxiv.org/html/2606.15260#S4.E135)\), we can now formulate our final optimization problem as
minθ\\displaystyle\\min\_\{\\theta\}\\quad𝔼ρπ\[DKL\(→πθ0:N\(a0:N∣s\)∣∣π→0:N\(a0:N∣s\)\)\]\\displaystyle\\mathbb\{E\}\_\{\\rho^\{\\pi\}\}\\left\[D\_\{\\text\{KL\}\}\\left\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0:N\}\(a^\{0:N\}\\mid s\)\\mid\\mid\\vec\{\\pi\}^\{0:N\}\(a^\{0:N\}\\mid s\)\\right\)\\right\]\(161\)s\.t\.𝔼ρπ\[DKL\(→πold0:N\(a0:N∣s\)∣∣→πθ0:N\(a0:N∣s\)\)\]≤ϵ\.\\displaystyle\\mathbb\{E\}\_\{\\rho^\{\\pi\}\}\\left\[D\_\{\\text\{KL\}\}\\left\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0:N\}\(a^\{0:N\}\\mid s\)\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\theta\}\(a^\{0:N\}\\mid s\)\\right\)\\right\]\\leq\\epsilon\.\(178\)
### 4\.3Policy and Critic Optimization
We are now ready to describe TruDi update rules\. Here, building on the formulation in Section[4\.2](https://arxiv.org/html/2606.15260#S4.SS2), we follow the standard actor\-critic policy iteration framework from\(Haarnojaet al\.,[2018a](https://arxiv.org/html/2606.15260#bib.bib5)\)and adapt it to the on\-policy setting, as in REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)\.
Fitting the Q\-function\.For optimizing the parametersϕ\\phiof the Q\-function in Eq\.[118](https://arxiv.org/html/2606.15260#S4.E118), we rely on recent insights for on\-policy RL methods\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)\. More precisely, we employ TD\-λ\\lambda\(Sutton,[1988](https://arxiv.org/html/2606.15260#bib.bib74)\)for generating the soft target values corresponding to the current policyπθ\\pi\_\{\\theta\}:
Gλ\(s,a\)\\displaystyle G^\{\\lambda\}\(s,a\)=1∑n=0Nλn∑n=0NλnG\(n\)\(x,a\)\.\\displaystyle=\\frac\{1\}\{\\sum\_\{n=0\}^\{N\}\\lambda^\{n\}\}\\sum\_\{n=0\}^\{N\}\\lambda^\{n\}G^\{\(n\)\}\(x,a\)\.\(179\)This represents a Monte\-Carlo estimate of the soft Q\-function generated with the data from the current policyπθ\\pi\_\{\\theta\}\. Here,G\(n\)G^\{\(n\)\}is defined as
G\(n\)\(st,at\)=∑k=tn−1γk\(r\(sk,ak\)−αlogπN\(aN∣sk\)−α∑n=1Nlogπ→n\|n−1\(akn∣akn−1,sk\)→
πθn−1∣n\(an−1∣an,sk\)\)\+γnQ\(sn,an\)\.\\begin\{split\}G^\{\(n\)\}\(s\_\{t\},a\_\{t\}\)=\\sum\_\{k=t\}^\{n\-1\}\\gamma^\{k\}\\bigg\(r\(s\_\{k\},a\_\{k\}\)\-\\alpha\\log\\pi^\{N\}\(a^\{N\}\\mid s\_\{k\}\)\\\\ \\phantom\{th\}\-\\alpha\\sum\_\{n=1\}^\{N\}\\log\\frac\{\\vec\{\\pi\}^\{n\|n\-1\}\(a\_\{k\}^\{n\}\\mid a\_\{k\}^\{n\-1\},s\_\{k\}\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{n\-1\\mid n\}\(a^\{n\-1\}\\mid a^\{n\},s\_\{k\}\)\}\\bigg\)\+\\gamma^\{n\}Q\(s\_\{n\},a\_\{n\}\)\.\\end\{split\}\(180\)For fitting the Q\-function’s parametersϕ\\phi, we use HL\-Gauss\(Imani and White,[2018](https://arxiv.org/html/2606.15260#bib.bib81); Farebrotheret al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib80)\), which relies on the cross\-entropy loss function rather than the known squared Bellman residual that is prone to outliers\(Farebrotheret al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib80); Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)\.

\(a\) SDE

\(b\) ODE forc=0\.5c=0\.5

\(c\) ODE forc=1\.0c=1\.0
Figure 1:Scaling the score function modifies the marginal distributions in each time step, inducing greedier sampling at higher scaling values\.Fig\.\(a\)visualizes the trajectories \(white\) of the \(unscaled\) denoising process starting at the Gaussian prior \(right\) and generating samples from the target distribution \(left\)\.\(b\)visualizes the trajectories \(white\) of the ODE whose marginal distributions are the same as the SDE in \(a\)\. This alignment results from scaling the score withc=0\.5c=0\.5\(Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)\. Higher values forccresult in sharper marginal distributions, leading to more greedy samples, as shown by the trajectories of the ODE forc=1\.0c=1\.0in\(c\)\. During the evaluation of a diffusion\-based policy, simulating the ODE with a scaled score function leads to higher returns\.Policy Update\.For a fixed Lagrangian multiplierλ\\lambda\(distinct from the TD\-λ\\lambdatrace parameter used above\) to the constraint in Eq\.[178](https://arxiv.org/html/2606.15260#S4.E178)and a fixed entropy\-scaling parameterα\\alpha, recent on\-policy approaches\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)have suggested a split objective
ℒTR\(θ∣λ,α,\{si\}i=1B\)=1B∑i=1B\{h\(si,a\),ifc\(si\)≤ϵg\(si,λ\),otherwise\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{TR\}\}\(\\theta\\mid\\lambda,\\alpha,\\\{s\_\{i\}\\\}\_\{i=1\}^\{B\}\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\begin\{cases\}h\(s\_\{i\},a\),~&\\text\{if\}~~c\(s\_\{i\}\)\\leq\\epsilon\\\\ g\(s\_\{i\},\\lambda\),~&\\text\{otherwise\}\.\\end\{cases\}\(181\)This objective first checks whether the trust region is violated by evaluating
c\(si\)=1K∑j=1K∑n=1Nlog→
πoldn−1∣n\(ajn−1∣ajn,si\)→
πθn−1∣n\(ajn−1∣ajn,si\),c\(s\_\{i\}\)=\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}\\sum\_\{n=1\}^\{N\}\\log\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{n\-1\\mid n\}\(a\_\{j\}^\{n\-1\}\\mid a\_\{j\}^\{n\},s\_\{i\}\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{n\-1\\mid n\}\(a\_\{j\}^\{n\-1\}\\mid a\_\{j\}^\{n\},s\_\{i\}\)\},which is a sample\-based approximation of the trust region upper bound Eq\.[135](https://arxiv.org/html/2606.15260#S4.E135)withKKaction samplesa∼πθ\(⋅∣si\)a\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{i\}\)from the current policy\. For the casec\(si\)≤ϵc\(s\_\{i\}\)\\leq\\epsilon\(trust region not violated\), this objective considers the lower\-bound to the maximum entropy RL objective
h\(si,a0\)=αlog→
πN\(aN∣si\)−Qϕ→
π\(s,a0\)\+α∑n=1Nlog→
πθn−1∣n\(an−1∣an,s\)π→n∣n−1\(an∣an−1,s\),\\begin\{split\}h\(s\_\{i\},a^\{0\}\)&=\\alpha\\log\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{N\}\(a^\{N\}\\mid s\_\{i\}\)\-Q^\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\}\_\{\\phi\}\(s,a^\{0\}\)\\\\ &\+\\alpha\\sum\_\{n=1\}^\{N\}\\log\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{n\-1\\mid n\}\(a^\{n\-1\}\\mid a^\{n\},s\)\}\{\\vec\{\\pi\}^\{n\\mid n\-1\}\(a^\{n\}\\mid a^\{n\-1\},s\)\},\\end\{split\}\(182\)Recall thata0∼πθ\(⋅\|si\)a^\{0\}\\sim\\pi\_\{\\theta\}\(\\cdot\|s\_\{i\}\)\. Finally, for the casec\(si\)\>ϵc\(s\_\{i\}\)\>\\epsilon\(trust region violated\), the parametersθ\\thetaof the policy are updated purely based on
g\(si,λ\)\\displaystyle g\(s\_\{i\},\\lambda\)=λc\(si\),\\displaystyle=\\lambda c\(s\_\{i\}\),wherec\(si\)c\(s\_\{i\}\)is the trust\-region estimate from Eq\.[4\.3](https://arxiv.org/html/2606.15260#S4.Ex1)\.
Dual Parameter Updates\.Following recent actor\-critic RL frameworks\(Haarnojaet al\.,[2018b](https://arxiv.org/html/2606.15260#bib.bib58); Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\), we auto\-tune the entropy temperatureα\\alphatoward a target entropyϵH¯\\epsilon\_\{\\bar\{H\}\}, and we update the Lagrangian multiplierλ\\lambdaassociated with the trust\-region constraint in Eq\.[178](https://arxiv.org/html/2606.15260#S4.E178)using the dual updates
α←α−ηα∇α𝔼ρπ\[l→
πθ\(a0,s\)−ϵH¯\]\\alpha\\leftarrow\\alpha\-\\eta\_\{\\alpha\}\\nabla\_\{\\alpha\}\\mathbb\{E\}\_\{\\rho^\{\\pi\}\}\\left\[l\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\}\(a^\{0\},s\)\-\\epsilon\_\{\\bar\{H\}\}\\right\]\(183\)λ←λ−ηλ∇λ𝔼ρπ\[DKL\(→
πold0:N∣∣→
πθ0:N\)−ϵ\],\\begin\{split\}\\lambda\\leftarrow\\lambda\-\\eta\_\{\\lambda\}\\nabla\_\{\\lambda\}\\mathbb\{E\}\_\{\\rho^\{\\pi\}\}\\bigg\[D\_\{\\text\{KL\}\}\\big\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0:N\}\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\theta\}\\big\)\-\\epsilon\\bigg\],\\end\{split\}\(184\)where we denote→ πθ0:N=→ πθ0:N\(a0:N\|s\)\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\theta\}=\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\theta\}\(a^\{0:N\}\|s\)and→ πold0:N=→ πold0:N\(a0:N\|s\)\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0:N\}=\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}^\{0:N\}\(a^\{0:N\}\|s\)in the dual update, respectively\. Although iteratively updatingϕ,θ\\phi,\\thetaandα,λ\\alpha,\\lambdadoes not guarantee that the constraints are satisfied, this dual descent strategy has empirically shown a sufficiently fast adaptation of the Lagrangian multipliers while being easy to implement in practice\.
### 4\.4Probability\-Flow ODE for Policy Evaluation
During evaluation, we aim to generate actions deterministically that are likely to achieve high returns\. For Gaussian policies, a natural deterministic representative is the mean action; for diffusion policies, actions are obtained through an iterative denoising process, so an analogous procedure is less obvious\. We therefore use the probability\-flow ODE associated with the reverse diffusion process\(Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)\. Using an Euler discretization, we obtain
an−1=an\+\(βnan\+c2η2βn∇logπ→n\(an∣s\)\)δ,\\displaystyle a^\{n\-1\}=a^\{n\}\+\\left\(\\beta^\{n\}a^\{n\}\+c2\\,\\eta^\{2\}\\beta^\{n\}\\nabla\\log\\vec\{\\pi\}^\{n\}\(a^\{n\}\\mid s\)\\right\)\\delta,\(185\)wherec∈ℝ\+c\\in\\mathbb\{R\}^\{\+\}scales the score term\. Forc=12c=\\tfrac\{1\}\{2\}, the probability\-flow dynamics form the deterministic counterpart to the following SDE\(Särkkä and Solin,[2019](https://arxiv.org/html/2606.15260#bib.bib56)\)
an−1=an\+\(βnan\+2η2βn∇logπ→n\(an∣s\)\)δ\+ξn,a^\{n\-1\}=a^\{n\}\+\\left\(\\beta^\{n\}a^\{n\}\+2\\eta^\{2\}\\beta^\{n\}\\nabla\\log\\vec\{\\pi\}^\{n\}\(a^\{n\}\\mid s\)\\right\)\\delta\+\\xi^\{n\},\(186\)and match its marginals\(Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)\. This alignment is also reflected by the similar behavior of the stochastic SDE trajectories in[Figure1](https://arxiv.org/html/2606.15260#S4.F1)a and the deterministic ODE trajectories in[Figure1](https://arxiv.org/html/2606.15260#S4.F1)b\.
While matching marginals is desirable, it does not necessarily concentrate the deterministic trajectory on the highest\-return regions\. In practice, we usec\>12c\>\\tfrac\{1\}\{2\}as an evaluation\-only heuristic that biases trajectories toward higher\-density regions and empirically increases the likelihood of high\-return actions \([Figure1](https://arxiv.org/html/2606.15260#S4.F1)c\)\. Intuitively, scaling the score can be interpreted as tempering intermediate marginals since
∇log\(π→n\(an∣s\)\)c=c∇logπ→n\(an∣s\),\\displaystyle\\nabla\\log\\left\(\\vec\{\\pi\}^\{n\}\(a^\{n\}\\mid s\)\\right\)^\{c\}=c\\,\\nabla\\log\\vec\{\\pi\}^\{n\}\(a^\{n\}\\mid s\),\(187\)so largerccsharpens the distribution and smallerccsmooths it\. Note thatc≠12c\\neq\\tfrac\{1\}\{2\}changes the score field\(Songet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib61)\)\. However, we apply this modification only at evaluation time and do not update the score network\.
TruDi\(Ours\)REPPODIME \(on\-policy\)PPO

\(a\)

\(b\)

\(c\)

\(d\)

\(e\)
Figure 2:Aggregated Performance on Continuous Control Benchmarks\.We compare TruDi against strong on\-policy baselines \(REPPO, PPO\) and the diffusion\-based online RL method DIME\. Curves depict the*Interquartile Mean \(IQM\)*of episode returns with 95% stratified bootstrap confidence intervals\. The evaluation spans five distinct benchmarks:\(a\)Standard continuous control \(MuJoCo Playground DMC, 20 tasks\);\(b\)Fine\-grained manipulation \(ManiSkill3, 14 tasks\);\(c\)Robust locomotion & dexterity \(IsaacLab, 6 tasks, normalized returns\);\(d\)Humanoid locomotion \(MuJoCo Playground, 4 tasks\); and\(e\)Whole\-body control \(HumanoidBench, 27 tasks\)\. TruDi matches or exceeds the performance of strong on\-policy baselines \(REPPO\) on standard control tasks \(a\-c\) while significantly outperforming baselines on difficult, high\-dimensional humanoid tasks \(d\-e\)\.
## 5Experimental Setup
We evaluate TruDi against state\-of\-the\-art on\-policy baselines, covering both Gaussian policies \(PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib19)\), REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\), SPO\(Xieet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib63)\)\) and diffusion\-based \(or flow\-based\) policies \(DIME\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10)\), DPPO\(Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\), FPO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib31)\)\)\. See[SectionB\.5](https://arxiv.org/html/2606.15260#A2.SS5)for details\. We conduct experiments on three commonly used RL benchmark suites: 20 tasks from the MuJoCo Playground DMC suite, 13 ManiSkill environments\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\), 30 Humanoid\-Bench environments\(Sferrazzaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib26)\), and 6 environments in IsaacLab\(Mittalet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib25)\)\. ManiSkill, MuJoCo Playground and IsaacLab provide GPU\-accelerated simulation, which is particularly well suited for on\-policy RL methods\. These benchmarks include a wide range of tasks, from classical control to robotic manipulation and locomotion\. To further stress\-test the benefit of diffusion policies, we additionally evaluate TruDi on Humanoid\-Bench\(Sferrazzaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib26)\), a recent benchmark consisting of 36 challenging whole\-body manipulation tasks on a humanoid\.
We report average returns and, when available, success rates using the interquartile mean \(IQM\) with 95% confidence intervals over1010seeds for all methods as suggested by\(Agarwalet al\.,[2021](https://arxiv.org/html/2606.15260#bib.bib91)\)\.
## 6Results
Standard Benchmarks\.On the standard RL benchmarks including the MuJoCo Playground DMC suite \([Figure2](https://arxiv.org/html/2606.15260#S4.F2)a\), ManiSkill \([Figure2](https://arxiv.org/html/2606.15260#S4.F2)b\), and IsaacLab \([Figure2](https://arxiv.org/html/2606.15260#S4.F2)c\), TruDi consistently outperforms all baselines throughout training and achieves the best final performance\. In these settings, TruDi and REPPO form a clear top tier, while the remaining methods often converge early and appear to get stuck in sub\-optimal regions\. Compared to REPPO, TruDi converges to a higher asymptotic return on both DMC and IsaacLab with a visible gap in the final performance\. DIME also performs strongly on the DMC suite, consistent with the results reported in the original paper, but its performance drops when scaling to the high\-dimensional, contact\-rich environments present in our experiments\.
Complex Humanoid Tasks\.We next move to the humanoid benchmarks, namely the MuJoCo Playground Humanoid tasks and Humanoid\-Bench, which require high\-dimensional whole\-body control with long\-horizon behaviors and complex coordination\.[Figure2](https://arxiv.org/html/2606.15260#S4.F2)\(d,e\) show a clear gap between TruDi and its main competitor REPPO, with TruDi achieving substantially higher episode return on both suites\. This highlights the benefit of combining diffusion policies with our trust\-region update, which improves exploration while remaining stable in these difficult settings, whereas the adaptive on\-policy diffusion approach from DIME often fails to make consistent progress in our massively parallel setup\. Looking at the per\-task results in Appendix Figure[13](https://arxiv.org/html/2606.15260#A6.F13)and[14](https://arxiv.org/html/2606.15260#A6.F14), the advantage becomes most visible on the hardest environments such as window cleaning, balancing, hurdling, and cube, where Gaussian baselines frequently plateau early, and TruDi continues to improve and reaches higher final performance\. Crucially, these results also support the stability of our method, since it matches tuned Gaussian policies on standard domains, but uses the added expressivity of diffusion policies when the action distribution becomes more complex in humanoid control\.
TruDi\(Ours\)FPODPPO
Figure 3:Evaluation on Diffusion/Flow based Policies\.Left:Mujoco Playground DMC\.Right:Mujoco Playground Humanoid\.##### Comparison with Diffusion and Flow\-Based Baselines\.
We evaluate TruDi against two state\-of\-the\-art methods that also utilize expressive generative policies: DPPO\(Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\), which applies PPO\-style clipping to diffusion policies, and FPO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib31)\), which utilizes Flow Matching for policy parameterization\. Figure[3](https://arxiv.org/html/2606.15260#S6.F3)presents the aggregated learning curves on the MuJoCo Playground DMC \(Left\) and Humanoid \(Right\) benchmarks\. On*DMC suite*, TruDi achieves comparable or superior sample efficiency to FPO and significantly outperforms DPPO\. This performance gap becomes more pronounced on the high\-dimensional*Humanoid tasks*\(Right\)\.
Table 1:Multimodality Analysis\. Evaluation of solution diversity on symmetric tasks \(200 deterministic runs\)\. Behavior Entropy \(∗\) \(see definition in Appendix[SectionD\.3](https://arxiv.org/html/2606.15260#A4.SS3)\) is normalized with11indicating a uniform distribution and0indicating mode collapse\. As expected, diffusion policies such as TruDi and DIME can capture multiple modes while the pure\-Gaussian policy collapses\. Best results are highlighted withorange\.TaskMethodEntropy∗Episode ReturnPushTREPPO0154\.92±5\.85154\.92\\pm 5\.85DIME0\.20±0\.200\.20\\pm 0\.20171\.92±1\.08171\.92\\pm 1\.08TruDi0\.32±0\.130\.32\\pm 0\.13167\.32±4\.42167\.32\\pm 4\.42StackCubeREPPO083\.92±0\.2483\.92\\pm 0\.24DIME0\.68±0\.150\.68\\pm 0\.1584\.40±0\.6984\.40\\pm 0\.69TruDi0\.87±0\.080\.87\\pm 0\.0884\.69±0\.1084\.69\\pm 0\.10
##### Multimodality of Diffusion Policies\.
To empirically validate the capacity of TruDi to represent multimodal action distributions, we evaluate it on thePushTandStackCubetasks\. As shown in Figure[4](https://arxiv.org/html/2606.15260#S6.F4), these environments are designed with symmetric optimal solutions \(e\.g\., stacking Red\-on\-Blue vs\. Blue\-on\-Red\) inspired by the object manipulation setups in ManiSkill3\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\), creating a bimodal optimization problem\. To quantify this multimodality, we adopt the Behavior Entropy metric\(Jiaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib89)\)\. Letℬ\\mathcal\{B\}represent the discrete set of successful behavior modes \(\|ℬ\|=2\|\\mathcal\{B\}\|=2for our symmetric tasks\)\. The normalized entropy is computed over the empirical behavior distributionπ\(β\)\\pi\(\\beta\)as:
ℋ\(π\(β\)\)=−∑β∈ℬπ\(β\)log\|ℬ\|π\(β\)\\mathcal\{H\}\(\\pi\(\\beta\)\)=\-\\sum\_\{\\beta\\in\\mathcal\{B\}\}\\pi\(\\beta\)\\log\_\{\|\\mathcal\{B\}\|\}\\pi\(\\beta\)
This scales the metric strictly into the\[0,1\]\[0,1\]interval, where0indicates complete mode collapse and11indicates perfectly balanced diversity\. We conduct 200 evaluation runs from a fixed initial state, using deterministic generation \(ODE for diffusion, mean for Gaussian\) to ensure that observed diversity arises from learned modes rather than stochastic noise\. As shown in TableLABEL:tab:computation\_time, the Gaussian\-based baseline, REPPO, exhibits zero entropy across both tasks, indicating a complete collapse to a single deterministic solution\. In contrast, TruDi achieves high entropy \(0\.320\.32and0\.870\.87\), confirming that it successfully leverages the expressivity of diffusion models to capture multiple valid solutions in the on\-policy setting without sacrificing performance\.


Figure 4:Environments for evaluating multimodal behavior\.Left: PushT, where rotating a T\-shaped block 180 degrees to a goal orientation is optimally solved either clockwise or counter\-clockwise\.Right: StackCube, where stacking red\-on\-blue or blue\-on\-red are both optimal strategies\. Both configurations create bimodal equivalent solutions\.Table 2:Computational Efficiency\. Wall\-clock time and aggregated performance after 50M steps on MuJoCo Playground Humanoid tasks\. TruDi achieves the highest performance with a manageable increase in training duration\. Best results are highlighted withorange\.MethodTraining Time \(Hours\)Episode Return \(IQM\)PPO0\.95±0\.350\.95\\pm 0\.350\.1±3\.70\.1\\pm 3\.7REPPO1\.07±0\.351\.07\\pm 0\.3529\.6±7\.229\.6\\pm 7\.2DIME1\.38±0\.361\.38\\pm 0\.3615\.6±9\.415\.6\\pm 9\.4TruDi \(Ours\)1\.95±0\.461\.95\\pm 0\.4634\.8±3\.0\{34\.8\\pm 3\.0\}
##### Computational Efficiency and Wall\-Clock Time\.
We measure wall\-clock training cost on the high\-dimensional MuJoCo Playground Humanoid tasks\. Diffusion policies incur additional compute due to iterative denoising, and TruDi further adds overhead from evaluating the trust\-region constraint against the previous policy at each update\. As a result, for a fixed budget of 50M environment steps, TruDi is slower than the Gaussian REPPO baseline and DIME, but this extra cost is offset by substantially better final performance\. To ensure a fair comparison against this computational overhead, we additionally evaluate TruDi against REPPO under a strictly matched wall\-clock time budget\. As shown in Figure[5](https://arxiv.org/html/2606.15260#S6.F5)\(Left\), TruDi substantially outperforms REPPO even under identical time constraints \(≈2\.5\\approx 2\.5hours\)\. Furthermore, extending REPPO’s training to 150M environment steps \(2\.782\.78hours\) yields no further improvement in episode return\. This indicates that TruDi’s advantage stems fundamentally from the expressiveness of the diffusion policy and the stability of our trust\-region updates, rather than being an artifact of additional compute time\.


Figure 5:Left: Wall\-clock time comparison on the MuJoCo Playground Humanoid tasks\. TruDi substantially outperforms REPPO when evaluated under a strictly matched wall\-clock time budget\.Right: Ablation on diffusion stepsTT\. We evaluateT∈\{1,4,8,16,32\}T\\in\\\{1,4,8,16,32\\\}, with colors indicating the step count \(e\.g\.,T=1T=1,T=4T=4,T=8T=8,T=32T=32\)\. Performance saturates atT=8T=8\.
##### Ablation on Diffusion Steps\.
To understand the trade\-off between performance and computational cost, we ablate the number of diffusion stepsKKused during training and inference\. Figure[5](https://arxiv.org/html/2606.15260#S6.F5)\(Right\) illustrates the aggregated performance forT∈\{1,4,8,16,32\}T\\in\\\{1,4,8,16,32\\\}\. AtT=1T=1, the policy fails to learn effectively, confirming that an iterative diffusion process is essential for representing these complex action distributions\. Performance improves steadily asTTincreases but strongly saturates atT=8T=8\. Beyond this threshold, computational time continues to grow linearly without yielding further improvements in final performance, establishingT=8T=8as the optimal balance between policy expressivity and computational efficiency\.


Figure 6:Left: Sensitivity analysis of the trust\-region thresholdϵ\\epsilon\. We vary the KL constraint from strict \(0\.010\.01\) to loose \(5050\) bounds\.Right: Comparison of policy evaluation strategies, including SDE sampling, ODE sampling, and best\-of\-KKsampling\.Sensitivity to the trust\-region threshold\.We study the effect of the trust\-region constraint by varyingϵ\\epsilonon*IsaacLab benchmark*\. We sweepϵ\\epsilonfrom0\.010\.01to5050and report the aggregated normalized episode return, where curves are color\-coded bylog10\(ϵ\)\\log\_\{10\}\(\\epsilon\)\. As shown in[Figure6](https://arxiv.org/html/2606.15260#S6.F6)\(Left\), intermediate values around0\.10\.1to0\.40\.4\(cyan\) give the best performance, and the optimal region is reasonably broad within this range\. Whenϵ\\epsilonis set too large \(dark blue\), the constraint becomes loose, and the update behaves closer to the unconstrained case, which leads to a clear drop in final return\. On the other hand, a very smallϵ\\epsilonis overly conservative, limiting the policy update and resulting in slower learning and lower asymptotic performance\.
Ablation on Evaluation Strategies\.We investigate the impact of different action generation strategies during evaluation\. While stochastic sampling \(SDE\) is essential for exploration during training, deterministic execution with high\-return action is more preferred during evaluation\. We compare the standard SDE sampler against the Probability Flow ODE \(with score scalingc=1\.0c=1\.0\) and a Best\-of\-KKstrategy \(K=\{10,20\}K=\\\{10,20\\\}\), which samplesKKactions via SDE and selects the one maximizing the learned Q\-value\. The aggregated results on the MuJoCo Playground Humanoid benchmark \(Figure[6](https://arxiv.org/html/2606.15260#S6.F6)\(Right\)\) demonstrate that the ODE integrator \(c=1\.0c=1\.0\) consistently yields the highest performance\. Notably, the ODE solver outperforms the Best\-of\-KKbaselines, suggesting that deterministically tracing through probability flow to the mode of the policy distribution is more reliable than the standard sampling strategies\.
## 7Conclusion
We introduced TruDi, a principled approach for training diffusion policies in the on\-policy, massively parallel RL regime by enforcing a trust\-region constraint over the full diffusion trajectory, which yields a tractable alternative to marginal KL constraints for diffusion models\. Across 4 benchmark suites with 73 tasks, TruDi is competitive with strong Gaussian on\-policy baselines on standard control tasks while achieving clear gains on challenging high\-dimensional humanoid control, where stable exploration and expressive action distributions matter most\. In addition, our analyses highlight practical ingredients for strong performance in this setting, including the sensitivity of the trust\-region threshold and the benefit of deterministic policy evaluation via the probability\-flow ODE compared to standard SDE sampling and best\-of\-KKselection\. Overall, TruDi establishes a strong baseline for combining diffusion policies with trust\-region optimization in massively parallel on\-policy RL\.
Limitations and Future Work\.A current limitation is the added computational cost of diffusion policies due to iterative sampling, which motivates improving efficiency through fewer denoising steps, distillation, or faster samplers\. Looking ahead, it would be interesting to extend our trust\-region diffusion framework to more sophisticated training settings, such as offline\-to\-online RL and sim\-to\-real transfer with richer observations\.
## Impact Statement
This paper presents work whose goal is to advance the field of Robot Learning\. Our research enables stable training of expressive diffusion policies in on\-policy, massively parallel reinforcement learning, which may improve learning performance on challenging high\-dimensional control tasks and broaden the practical use of diffusion models in robotics simulation\. As far as we are aware, our work does not raise any specific ethical issues\.
## Acknowledgements
We thank the anonymous reviewers for their valuable feedback and suggestions\. GN is supported by the European Research Council \(ERC\) under the European Union’s Horizon Europe programme through the project SMARTI³ \(Grant Agreement No\. 101171393\), and by the German Federal Ministry of Research, Technology, and Space \(BMFTR\) under the Robotics Institute Germany \(RIG\)\. The authors acknowledge support by the state of Baden\-Württemberg through bwHPC, as well as the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden\-Württemberg and by the German Federal Ministry of Education and Research\.
## References
- A\. Abdolmaleki, R\. Lioutikov, J\. R\. Peters, N\. Lau, L\. Pualo Reis, and G\. Neumann \(2015\)Model\-based relative entropy stochastic search\.InAdvances in Neural Information Processing Systems,C\. Cortes, N\. Lawrence, D\. Lee, M\. Sugiyama, and R\. Garnett \(Eds\.\),Vol\.28,pp\.\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1)\.
- R\. Agarwal, M\. Schwarzer, P\. S\. Castro, A\. C\. Courville, and M\. Bellemare \(2021\)Deep reinforcement learning at the edge of the statistical precipice\.Advances in Neural Information Processing Systems34\.Cited by:[§5](https://arxiv.org/html/2606.15260#S5.p2.1)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.External Links:1607\.06450,[Link](https://arxiv.org/abs/1607.06450)Cited by:[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p1.7),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p2.1)\.
- J\. Berner, L\. Richter, and K\. Ullrich \(2022\)An optimal control perspective on diffusion\-based generative modeling\.arXiv preprint arXiv:2211\.01364\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.15260#S4.SS1.p1.6)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang \(2018\)JAX: composable transformations of Python\+NumPy programs\.External Links:[Link](http://github.com/jax-ml/jax)Cited by:[Appendix C](https://arxiv.org/html/2606.15260#A3.p1.1)\.
- J\. Carvalho, A\. Le, P\. Kicki, D\. Koert, and J\. Peters \(2025\)Motion planning diffusion: learning and adapting robot motion planning with diffusion models\.External Links:2412\.19948,[Link](https://arxiv.org/abs/2412.19948)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- O\. Celik, Z\. Li, D\. Blessing, G\. Li, D\. Palenicek, J\. Peters, G\. Chalvatzaki, and G\. Neumann \(2025\)DIME: diffusion\-based maximum entropy reinforcement learning\.InProceedings of the International Conference on Machine Learning,Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p7.1),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p1.7),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.15260#S4.SS1.p1.6),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- H\. Chen, C\. Lu, C\. Ying, H\. Su, and J\. Zhu \(2023\)Offline reinforcement learning via high\-fidelity generative behavior modeling\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- C\. Chi, S\. Feng, Y\. Du, Z\. Xu, E\. Cousineau, B\. Burchfiel, and S\. Song \(2023\)Diffusion policy: visuomotor policy learning via action diffusion\.InProceedings of Robotics: Science and Systems \(RSS\),Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- T\. M\. Cover \(1999\)Elements of information theory\.John Wiley & Sons\.Cited by:[Appendix A](https://arxiv.org/html/2606.15260#A1.p1.2),[§4\.1](https://arxiv.org/html/2606.15260#S4.SS1.p1.2)\.
- S\. Ding, K\. Hu, Z\. Zhang, J\. Yu, J\. Wang, K\. Ren, Y\. Shi, and W\. Zhang \(2024\)Diffusion\-based reinforcement learning via q\-weighted variational policy optimization\.arXiv preprint arXiv:2405\.16173\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- S\. Ding, K\. Hu, S\. Zhong, H\. Luo, W\. Zhang, J\. Wang, J\. Wang, and Y\. Shi \(2025\)GenPO: generative diffusion models meet on\-policy reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=BmRNz1TpCc)Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p3.1)\.
- L\. Fang, R\. Liu, J\. Zhang, W\. Wang, and B\. Jing \(2025\)Diffusion actor\-critic: formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- J\. Farebrother, J\. Orbay, Q\. Vuong, A\. A\. Taïga, Y\. Chebotar, T\. Xiao, A\. Irpan, S\. Levine, P\. S\. Castro, A\. Faust,et al\.\(2024\)Stop regressing: training value functions via classification for scalable deep rl\.arXiv preprint arXiv:2403\.03950\.Cited by:[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p2.6)\.
- T\. Haarnoja, H\. Tang, P\. Abbeel, and S\. Levine \(2017\)Reinforcement learning with deep energy\-based policies\.InInternational conference on machine learning,pp\. 1352–1361\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§3](https://arxiv.org/html/2606.15260#S3.p2.1),[§3](https://arxiv.org/html/2606.15260#S3.p2.8)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018a\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InProceedings of the International Conference on Machine Learning,Cited by:[§3](https://arxiv.org/html/2606.15260#S3.p1.16),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p1.1)\.
- T\. Haarnoja, A\. Zhou, K\. Hartikainen, G\. Tucker, S\. Ha, J\. Tan, V\. Kumar, H\. Zhu, A\. Gupta, P\. Abbeel,et al\.\(2018b\)Soft actor\-critic algorithms and applications\.arXiv preprint arXiv:1812\.05905\.Cited by:[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p4.3)\.
- P\. Hansen\-Estruch, I\. Kostrikov, M\. Janner, J\. G\. Kuba, and S\. Levine \(2023\)IDQL: implicit q\-learning as an actor\-critic method with diffusion policies\.arXiv preprint arXiv:2304\.10573\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- T\. He, Z\. Wang, H\. Xue, Q\. Ben, Z\. Luo, W\. Xiao, Y\. Yuan, X\. Da, F\. Castañeda, S\. Sastry, C\. Liu, G\. Shi, L\. Fan, and Y\. Zhu \(2025\)VIRAL: visual sim\-to\-real at scale for humanoid loco\-manipulation\.arXiv preprint arXiv:2511\.15200\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- J\. Ho, W\. Chan, C\. Saharia, J\. Whang, R\. Gao, A\. Gritsenko, D\. P\. Kingma, B\. Poole, M\. Norouzi, D\. J\. Fleet, and T\. Salimans \(2022\)Imagen video: high definition video generation with diffusion models\.External Links:2210\.02303,[Link](https://arxiv.org/abs/2210.02303)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§4](https://arxiv.org/html/2606.15260#S4.p2.1)\.
- T\. Hoang, H\. Le, P\. Becker, V\. A\. Ngo, and G\. Neumann \(2025\)Geometry\-aware RL for manipulation of varying shapes and deformable objects\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=7BLXhmWvwF)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- T\. Hoang, A\. Trenta, A\. Gravina, N\. Freymuth, P\. Becker, D\. Bacciu, and G\. Neumann \(2026\)Improving long\-range interactions in graph neural simulators via hamiltonian dynamics\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=x66u6TEDUw)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1)\.
- E\. Imani and M\. White \(2018\)Improving regression performance with distributional losses\.InInternational conference on machine learning,pp\. 2157–2166\.Cited by:[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p2.6)\.
- M\. Janner, Y\. Du, J\. B\. Tenenbaum, and S\. Levine \(2022\)Planning with diffusion for flexible behavior synthesis\.InInternational Conference on Machine Learning,pp\. 9902–9915\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- X\. Jia, D\. Blessing, X\. Jiang, M\. Reuss, A\. Donat, R\. Lioutikov, and G\. Neumann \(2024\)Towards diverse behaviors: a benchmark for imitation learning with human demonstrations\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6pPYRXKPpw)Cited by:[§D\.3](https://arxiv.org/html/2606.15260#A4.SS3.p1.1),[§6](https://arxiv.org/html/2606.15260#S6.SS0.SSS0.Px2.p1.3)\.
- B\. Kang, X\. Ma, C\. Du, T\. Pang, and S\. Yan \(2023\)Efficient diffusion policies for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- A\. Kumar, Z\. Fu, D\. Pathak, and J\. Malik \(2021\)Rma: rapid motor adaptation for legged robots\.InRobotics: Science and Systems,Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- H\. Le, T\. Hoang, M\. Gabriel, G\. Neumann, and N\. A\. Vien \(2025\)Enhancing exploration with diffusion policies in hybrid off\-policy rl: application to non\-prehensile manipulation\.IEEE Robotics and Automation Letters10\(6\),pp\. 6143–6150\.External Links:[Document](https://dx.doi.org/10.1109/LRA.2025.3564780)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- S\. Levine \(2018\)Reinforcement learning and control as probabilistic inference: tutorial and review\. 2018\.URL https://arxiv\.org/pdf/1805\.00909\.pdf\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p3.1)\.
- G\. Li, H\. Zhou, D\. Roth, S\. Thilges, F\. Otto, R\. Lioutikov, and G\. Neumann \(2024a\)Open the black box: step\-based policy updates for temporally\-correlated episodic reinforcement learning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mnipav175N)Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p1.1)\.
- Z\. Li, T\. Chen, Z\. Hong, A\. Ajay, and P\. Agrawal \(2023\)Parallel q\-learning: scaling off\-policy reinforcement learning under massively parallel simulation\.InProceedings of the International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p1.1)\.
- Z\. Li, R\. Krohn, T\. Chen, A\. Ajay, P\. Agrawal, and G\. Chalvatzaki \(2024b\)Learning multimodal behaviors from scratch with diffusion policy gradient\.arXiv preprint arXiv:2406\.00681\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- C\. Lu, H\. Chen, J\. Chen, H\. Su, C\. Li, and J\. Zhu \(2023\)Contrastive energy prediction for exact energy\-guided diffusion sampling in offline reinforcement learning\.arXiv preprint arXiv:2304\.12824\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- C\. Luo \(2022\)Understanding diffusion models: a unified perspective\.arXiv preprint arXiv:2208\.11970\.Cited by:[§4\.1](https://arxiv.org/html/2606.15260#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.15260#S4.p2.1)\.
- L\. Lv, Y\. Li, Y\. Luo, F\. Sun, T\. Kong, J\. Xu, and X\. Ma \(2025\)Flow\-based policy for online reinforcement learning\.External Links:2506\.12811,[Link](https://arxiv.org/abs/2506.12811)Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p3.1)\.
- H\. Ma, O\. Nabati, A\. Rosenberg, B\. Dai, O\. Lang, I\. Szpektor, C\. Boutilier, N\. Li, S\. Mannor, L\. Shani, and G\. Tenneholtz \(2025\)Reinforcement learning with discrete diffusion policies for combinatorial action spaces\.External Links:2509\.22963,[Link](https://arxiv.org/abs/2509.22963)Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p3.1)\.
- L\. Mao, X\. Zhan, W\. Zhang, H\. Xu, and A\. Zhang \(2024\)Diffusion\-dice: in\-sample diffusion guidance for offline reinforcement learning\.arXiv preprint arXiv:2407\.20109\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- D\. McAllister, S\. Ge, B\. Yi, C\. M\. Kim, E\. Weber, H\. Choi, H\. Feng, and A\. Kanazawa \(2025\)Flow matching policy gradients\.External Links:2507\.21053,[Link](https://arxiv.org/abs/2507.21053)Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p4.1),[§2](https://arxiv.org/html/2606.15260#S2.p3.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1),[§6](https://arxiv.org/html/2606.15260#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Mittal, P\. Roth, J\. Tigue, A\. Richard, O\. Zhang, P\. Du, A\. Serrano\-Muñoz, X\. Yao, R\. Zurbrügg, N\. Rudin, L\. Wawrzyniak, M\. Rakhsha, A\. Denzler, E\. Heiden, A\. Borovicka, O\. Ahmed, I\. Akinola, A\. Anwar, M\. T\. Carlson, J\. Y\. Feng, A\. Garg, R\. Gasoto, L\. Gulich, Y\. Guo, M\. Gussert, A\. Hansen, M\. Kulkarni, C\. Li, W\. Liu, V\. Makoviychuk, G\. Malczyk, H\. Mazhar, M\. Moghani, A\. Murali, M\. Noseworthy, A\. Poddubny, N\. Ratliff, W\. Rehberg, C\. Schwarke, R\. Singh, J\. L\. Smith, B\. Tang, R\. Thaker, M\. Trepte, K\. V\. Wyk, F\. Yu, A\. Millane, V\. Ramasamy, R\. Steiner, S\. Subramanian, C\. Volk, C\. Chen, N\. Jawale, A\. V\. Kuruttukulam, M\. A\. Lin, A\. Mandlekar, K\. Patzwaldt, J\. Welsh, H\. Zhao, F\. Anes, J\. Lafleche, N\. Moënne\-Loccoz, S\. Park, R\. Stepinski, D\. V\. Gelder, C\. Amevor, J\. Carius, J\. Chang, A\. H\. Chen, P\. de Heras Ciechomski, G\. Daviet, M\. Mohajerani, J\. von Muralt, V\. Reutskyy, M\. Sauter, S\. Schirm, E\. L\. Shi, P\. Terdiman, K\. Vilella, T\. Widmer, G\. Yeoman, T\. Chen, S\. Grizan, C\. Li, L\. Li, C\. Smith, R\. Wiltz, K\. Alexis, Y\. Chang, D\. Chu, L\. ”\. Fan, F\. Farshidian, A\. Handa, S\. Huang, M\. Hutter, Y\. Narang, S\. Pouya, S\. Sheng, Y\. Zhu, M\. Macklin, A\. Moravanszky, P\. Reist, Y\. Guo, D\. Hoeller, and G\. State \(2025\)Isaac lab: a gpu\-accelerated simulation framework for multi\-modal robot learning\.arXiv preprint arXiv:2511\.04831\.External Links:[Link](https://arxiv.org/abs/2511.04831)Cited by:[§B\.3](https://arxiv.org/html/2606.15260#A2.SS3.p1.1),[Appendix B](https://arxiv.org/html/2606.15260#A2.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- N\. Nusken, F\. Vargas, S\. Padhy, and D\. Blessing \(2024\)Transport meets variational inference: controlled monte carlo diffusions\.InThe Twelfth International Conference on Learning Representations: ICLR 2024,Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.15260#S4.SS1.p1.6)\.
- F\. Otto, P\. Becker, N\. Anh Vien, H\. C\. Ziesche, and G\. Neumann \(2021\)Differentiable trust region layers for deep reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§4](https://arxiv.org/html/2606.15260#S4.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Köpf, E\. Z\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala \(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 8024–8035\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html)Cited by:[Appendix C](https://arxiv.org/html/2606.15260#A3.p1.1)\.
- J\. Peters, K\. Mülling, and Y\. Altün \(2010a\)Relative entropy policy search\.InProceedings of the Twenty\-Fourth AAAI Conference on Artificial Intelligence,AAAI’10,pp\. 1607–1612\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1)\.
- J\. Peters, K\. Mulling, and Y\. Altun \(2010b\)Relative entropy policy search\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.24,pp\. 1607–1612\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§4](https://arxiv.org/html/2606.15260#S4.p1.1)\.
- M\. Psenka, A\. Escontrela, P\. Abbeel, and Y\. Ma \(2024\)Learning a diffusion model policy from rewards via q\-score matching\.arXiv preprint arXiv:2312\.11752\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- A\. Z\. Ren, J\. Lidard, L\. L\. Ankile, A\. Simeonov, P\. Agrawal, A\. Majumdar, B\. Burchfiel, H\. Dai, and M\. Simchowitz \(2024\)Diffusion policy policy optimization\.External Links:2409\.00588,[Link](https://arxiv.org/abs/2409.00588)Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p5.1),[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p3.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1),[§6](https://arxiv.org/html/2606.15260#S6.SS0.SSS0.Px1.p1.1)\.
- L\. Richter and J\. Berner \(2024\)Improved sampling via learned diffusions\.InInternational Conference on Learning Representations 2024,Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.External Links:2112\.10752,[Link](https://arxiv.org/abs/2112.10752)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- N\. Rudin, D\. Hoeller, P\. Reist, and M\. Hutter \(2022\)Learning to walk in minutes using massively parallel deep reinforcement learning\.InConference on Robot Learning,pp\. 91–100\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- C\. Saharia, W\. Chan, S\. Saxena, L\. Li, J\. Whang, E\. L\. Denton, K\. Ghasemipour, R\. Gontijo Lopes, B\. Karagol Ayan, T\. Salimans, J\. Ho, D\. J\. Fleet, and M\. Norouzi \(2022\)Photorealistic text\-to\-image diffusion models with deep language understanding\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 36479–36494\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- S\. Särkkä and A\. Solin \(2019\)Applied stochastic differential equations\.Vol\.10,Cambridge University Press\.Cited by:[§3](https://arxiv.org/html/2606.15260#S3.p3.9),[§4\.4](https://arxiv.org/html/2606.15260#S4.SS4.p1.2)\.
- J\. Schulman, S\. Levine, P\. Moritz, M\. Jordan, and P\. Abbeel \(2015\)Trust region policy optimization\.InProceedings of the 32nd International Conference on International Conference on Machine Learning \- Volume 37,ICML’15,pp\. 1889–1897\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§4](https://arxiv.org/html/2606.15260#S4.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p2.1),[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- C\. Schwarke, M\. Mittal, N\. Rudin, D\. Hoeller, and M\. Hutter \(2025\)RSL\-rl: a learning library for robotics research\.arXiv preprint arXiv:2509\.10771\.Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p2.1)\.
- Y\. Seo, C\. Sferrazza, J\. Chen, G\. Shi, R\. Duan, and P\. Abbeel \(2025a\)Learning sim\-to\-real humanoid locomotion in 15 minutes\.External Links:2512\.01996,[Link](https://arxiv.org/abs/2512.01996)Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- Y\. Seo, C\. Sferrazza, H\. Geng, M\. Nauman, Z\. Yin, and P\. Abbeel \(2025b\)FastTD3: simple, fast, and capable reinforcement learning for humanoid control\.arXiv preprint arXiv:2505\.22642\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- C\. Sferrazza, D\. Huang, X\. Lin, Y\. Lee, and P\. Abbeel \(2024\)HumanoidBench: simulated humanoid benchmark for whole\-body locomotion and manipulation\.Cited by:[§B\.2](https://arxiv.org/html/2606.15260#A2.SS2.p1.1),[Appendix B](https://arxiv.org/html/2606.15260#A2.p1.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- J\. Sohl\-Dickstein, E\. A\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InProceedings of the 32nd International Conference on International Conference on Machine Learning \- Volume 37,ICML’15,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[Figure 1](https://arxiv.org/html/2606.15260#S4.F1),[§4\.4](https://arxiv.org/html/2606.15260#S4.SS4.p1.3),[§4\.4](https://arxiv.org/html/2606.15260#S4.SS4.p1.4),[§4\.4](https://arxiv.org/html/2606.15260#S4.SS4.p2.4)\.
- R\. S\. Sutton \(1988\)Learning to predict by the methods of temporal differences\.Mach\. Learn\.3\(1\),pp\. 9–44\.External Links:ISSN 0885\-6125,[Link](https://doi.org/10.1023/A:1022633531479),[Document](https://dx.doi.org/10.1023/A%3A1022633531479)Cited by:[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px2.p2.3),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p2.3)\.
- S\. Tao, F\. Xiang, A\. Shukla, Y\. Qin, X\. Hinrichsen, X\. Yuan, C\. Bao, X\. Lin, Y\. Liu, T\. Chan,et al\.\(2025\)ManiSkill3: gpu parallelized robotics simulation and rendering for generalizable embodied ai\.InRobotics: Science and Systems,Cited by:[§B\.4](https://arxiv.org/html/2606.15260#A2.SS4.SSS0.Px2.p1.3),[§B\.4](https://arxiv.org/html/2606.15260#A2.SS4.p1.1),[Appendix B](https://arxiv.org/html/2606.15260#A2.p1.1),[§D\.1\.1](https://arxiv.org/html/2606.15260#A4.SS1.SSS1.p1.1),[§D\.2\.1](https://arxiv.org/html/2606.15260#A4.SS2.SSS1.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1),[§6](https://arxiv.org/html/2606.15260#S6.SS0.SSS0.Px2.p1.3)\.
- M\. Toussaint \(2009\)Robot trajectory optimization using approximate inference\.InProceedings of the 26th annual international conference on machine learning,pp\. 1049–1056\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§3](https://arxiv.org/html/2606.15260#S3.p2.1)\.
- F\. Vargas, W\. S\. Grathwohl, and A\. Doucet \(2023\)Denoising diffusion samplers\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p1.7)\.
- C\. Voelcker, A\. Brunnbauer, M\. Hussing, M\. Nauman, P\. Abbeel, E\. Eaton, R\. Grosu, A\. Farahmand, and I\. Gilitschenski \(2025\)Relative entropy pathwise policy optimization\.External Links:2507\.11019,[Link](https://arxiv.org/abs/2507.11019)Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p6.1),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p2.1),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px1.p3.1),[Appendix C](https://arxiv.org/html/2606.15260#A3.SS0.SSS0.Px2.p2.3),[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§2](https://arxiv.org/html/2606.15260#S2.p1.1),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p2.3),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p2.6),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p3.3),[§4\.3](https://arxiv.org/html/2606.15260#S4.SS3.p4.3),[§4](https://arxiv.org/html/2606.15260#S4.p1.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- Y\. Wang, L\. Wang, Y\. Jiang, X\. Song, W\. Wang, W\. Zou, T\. Liu, L\. Xiao, J\. Wu, J\. Duan,et al\.\(2024\)Diffusion actor\-critic with entropy regulator\.arXiv preprint arXiv:2405\.15177\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- Z\. Wang, J\. J\. Hunt, and M\. Zhou \(2023\)Diffusion policies as an expressive policy class for offline reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- F\. Xiang, Y\. Qin, K\. Mo, Y\. Xia, H\. Zhu, F\. Liu, M\. Liu, H\. Jiang, Y\. Yuan, H\. Wang, L\. Yi, A\. X\. Chang, L\. J\. Guibas, and H\. Su \(2020\)SAPIEN: a simulated part\-based interactive environment\.InThe IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§B\.4](https://arxiv.org/html/2606.15260#A2.SS4.p1.1)\.
- Z\. Xie, Q\. Zhang, F\. Yang, M\. Hutter, and R\. Xu \(2025\)Simple policy optimization\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=SG8Yx1FyeU)Cited by:[§B\.5](https://arxiv.org/html/2606.15260#A2.SS5.p3.1),[§5](https://arxiv.org/html/2606.15260#S5.p1.1)\.
- L\. Yang, Z\. Huang, F\. Lei, Y\. Zhong, Y\. Yang, C\. Fang, S\. Wen, B\. Zhou, and Z\. Lin \(2023\)Policy representation via diffusion probability model for reinforcement learning\.arXiv preprint arXiv:2305\.13122\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p2.1)\.
- K\. Zakka, B\. Tabanpour, Q\. Liao, M\. Haiderbhai, S\. Holt, J\. Y\. Luo, A\. Allshire, E\. Frey, K\. Sreenath, L\. A\. Kahrs, C\. Sferrazza, Y\. Tassa, and P\. Abbeel \(2025\)MuJoCo playground: an open\-source framework for gpu\-accelerated robot learning and sim\-to\-real transfer\.\.GitHub\.External Links:[Link](https://github.com/google-deepmind/mujoco_playground)Cited by:[§B\.1](https://arxiv.org/html/2606.15260#A2.SS1.p1.1),[Appendix B](https://arxiv.org/html/2606.15260#A2.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p2.1)\.
- T\. Zhang, S\. Su, C\. Yu, and Y\. Wang \(2026\)ReinFlow: fine\-tuning flow matching policy with online reinforcement learning\.arXiv preprint arXiv:2505\.22094\.Cited by:[§2](https://arxiv.org/html/2606.15260#S2.p3.1)\.
- H\. Zhou, D\. Blessing, G\. Li, O\. Celik, X\. Jia, G\. Neumann, and R\. Lioutikov \(2024\)Variational distillation of diffusion policies into mixture of experts\.Advances in Neural Information Processing Systems37,pp\. 12739–12766\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p1.1),[§1](https://arxiv.org/html/2606.15260#S1.p2.1),[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§4](https://arxiv.org/html/2606.15260#S4.p1.6)\.
- B\. D\. Ziebart, A\. L\. Maas, J\. A\. Bagnell, A\. K\. Dey,et al\.\(2008\)Maximum entropy inverse reinforcement learning\.\.InAaai,Vol\.8,pp\. 1433–1438\.Cited by:[§1](https://arxiv.org/html/2606.15260#S1.p3.1),[§3](https://arxiv.org/html/2606.15260#S3.p2.1)\.
## Appendix ADerivation of Trajectory KL Upper Bound
The KL divergence between the marginal distributions of the current and old policy to Eq\.[135](https://arxiv.org/html/2606.15260#S4.E135)
DKL\(→πold\(a0∣s\)∣→πθ0\(a0∣s\)\)=\\displaystyle D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0\}\\mid s\)\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}^\{0\}\(a^\{0\}\\mid s\)\\bigg\)=∫→πold\(a0∣s\)log→πold\(a0∣s\)→πθ\(a0∣s\)da0\\displaystyle\\int\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0\}\\mid s\)\\log\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0\}\\mid s\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{0\}\\mid s\)\}da^\{0\}\(228\)=∫→πold\(a0:N∣s\)log\(→πold\(a0:N∣s\)→πθ\(a0:N∣s\)→πθ\(a1:N∣a0,s\)→πold\(a1:N∣a0,s\)\)𝑑a0:N\\displaystyle=\\int\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0:N\}\\mid s\)\\log\\left\(\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0:N\}\\mid s\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{0:N\}\\mid s\)\}\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{1:N\}\\mid a^\{0\},s\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{1:N\}\\mid a^\{0\},s\)\}\\right\)da^\{0:N\}\(269\)=DKL\(→πold0:N\(a0:N∣s\)∣∣→πθ\(a0:N∣s\)\)\\displaystyle=D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{0:N\}\_\{\\text\{old\}\}\(a^\{0:N\}\\mid s\)\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{0:N\}\\mid s\)\\bigg\)\(286\)−𝔼→πold\(a0∣s\)\[DKL\(→πold1:N\(a1:N∣a0,s\)∣∣→πθ\(a1:N∣a0,s\)\)\],\\displaystyle\-\\mathbb\{E\}\_\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\text\{old\}\}\(a^\{0\}\\mid s\)\}\\left\[D\_\{\\text\{KL\}\}\\bigg\(\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{1:N\}\_\{\\text\{old\}\}\(a^\{1:N\}\\mid a^\{0\},s\)\\mid\\mid\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}\_\{\\theta\}\(a^\{1:N\}\\mid a^\{0\},s\)\\bigg\)\\right\],\(311\)where we used the identity→ πθ\(a0∣s\)=→ πθ\(a0:N∣s\)→ πθ\(a1:N∣a0,s\)\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-4\.30554pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.15277pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{\\theta\}\(a^\{0\}\\mid s\)=\\frac\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{\\theta\}\(a^\{0:N\}\\mid s\)\}\{\{\\mathchoice\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\displaystyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\displaystyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\textstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-3\.01389pt\\cr$\\textstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-2\.10971pt\\cr$\\scriptstyle\\pi$\\cr\}\}\}\{\\vbox\{\\halign\{\#\\cr\\reflectbox\{$\\scriptscriptstyle\\vec\{\}\\mkern 4\.0mu$\}\\cr\\kern\-1\.50694pt\\cr$\\scriptscriptstyle\\pi$\\cr\}\}\}\}^\{\\theta\}\(a^\{1:N\}\\mid a^\{0\},s\)\}in Eq\.[269](https://arxiv.org/html/2606.15260#A1.E269)\. Because the KL is≥0\\geq 0we obtain an upper bound\. An alternative approach considers applying the data processing inequality\(Cover,[1999](https://arxiv.org/html/2606.15260#bib.bib57)\)\.
Remark\.[Equation311](https://arxiv.org/html/2606.15260#A1.E311)shows that the gap between the true marginal KL \(left\-hand side\) and our upper bound, i\.e\., the joint trajectory KL \(first term on the right\-hand side\), is precisely given by the conditional KL \(second term on the right\-hand side\)\. If this term is zero, the bound is tight\. However, for stochastic diffusion policies considered in our work, this term is generally nonzero, so tightness is difficult to characterize in full generality\. Nevertheless, we empirically showed in[Figure6](https://arxiv.org/html/2606.15260#S6.F6), a valid upper bound on the marginal KL \([Equation286](https://arxiv.org/html/2606.15260#A1.E286)\) is sufficient\. This is reflected in Fig\. 4 \(Left\), where overly strict bounds \(ϵ=0\.01\\epsilon=0\.01\) lead to slow updates, but moderate values ofϵ∈\[0\.1,0\.4\]\\epsilon\\in\[0\.1,0\.4\]yield strong performance\. More broadly, TruDi exhibits robust behavior across a wide range ofϵ\\epsilonvalues \(Fig\. 4\), suggesting that the proposed trajectory KL upper bound is effective in practice\.
## Appendix BExperiment Details
We evaluate TruDi on four distinct environment suites:MuJoCo Playground\(Zakkaet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib27)\),HumanoidBench\(Sferrazzaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib26)\),IsaacLab\(Mittalet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib25)\), andManiSkill3\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\)\. These benchmarks were selected to comprehensively test the algorithm’s capability across standard continuous control, high\-DoF humanoid locomotion, rough\-terrain navigation, and contact\-rich manipulation\.
### B\.1MuJoCo Playground
We utilize the MuJoCo XLA \(MJX\) implementation provided by MuJoCo Playground\(Zakkaet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib27)\), which allows for massive parallelization on GPU\. We evaluate on two distinct subsets: the classic DeepMind Control \(DMC\) Suite and the modern Humanoid Joystick tasks\.
##### DeepMind Control Suite\.
We evaluate on a comprehensive set of20 taskscovering a wide range of dimensionality and difficulty\. The selection includes both dense and sparse reward variants to explicitly test exploration capabilities:
- •Locomotion \(7 tasks\):Cheetah Run,Fish Swim,Hopper Hop/Stand, andWalker Run/Stand/Walk\.
- •Classic Control & Manipulation \(13 tasks\):Acrobot Swingup,Ball In Cup,Cartpole Balance/Swingup,Finger Spin,Finger Turn Easy/Hard,Pendulum Swingup, andReacher Easy/Hard\.
- •Sparse Variants \(included above\):We explicitly includeAcrobotSwingupSparse,CartpoleBalanceSparse, andCartpoleSwingupSparseto test exploration in sparse\-reward settings\.
Observation & Action Spaces:The state space𝒮\\mathcal\{S\}consists of proprioceptive data \(joint positions, velocities\) and task\-specific features \(e\.g\., target coordinates\)\. The action space𝒜\\mathcal\{A\}ranges from\|𝒜\|=1\|\\mathcal\{A\}\|=1\(Pendulum\) to\|𝒜\|=6\|\\mathcal\{A\}\|=6\(Walker/Cheetah\), normalized to\[−1,1\]\[\-1,1\]\.Reward:We use the standard dense rewards defined in the suite, which generally combine a velocity tracking termℛtrack\\mathcal\{R\}\_\{\\text\{track\}\}with small control regularization penalties\|u\|2\|u\|^\{2\}\.
##### Humanoid Locomotion \(G1 & T1\)\.
We evaluate theJoysticktasks onFlatandRoughterrains using two modern humanoid robots: the Unitree G1 \(29 DoF\) and the Booster T1\.
- •Objective:The agent must track a randomized command vectorc=\(vx,vy,ωz\)c=\(v\_\{x\},v\_\{y\},\\omega\_\{z\}\)specifying target linear and angular velocities\.
- •Observation:The observation space𝒮∈ℝ≈60−70\\mathcal\{S\}\\in\\mathbb\{R\}^\{\\approx 60\-70\}includes joint positions and velocities, base linear/angular velocity, the projected gravity vector, and the command history\.
- •Reward:The reward is defined as a productrt=rtracking⋅rpenaltyr\_\{t\}=r\_\{\\text\{tracking\}\}\\cdot r\_\{\\text\{penalty\}\}\. Thertrackingr\_\{\\text\{tracking\}\}term encourages matching the command velocity, whilerpenaltyr\_\{\\text\{penalty\}\}minimizes energy consumption and joint jerk to promote smooth, transferable gaits\.
### B\.2HumanoidBench
We evaluate onHumanoidBench\(Sferrazzaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib26)\), utilizing theh1handvariant of the Unitree H1 robot\. This setup presents a significantly harder challenge than standard locomotion benchmarks, as it requires the policy to simultaneously manage whole\-body balance, locomotion, and fine\-grained dexterous manipulation\.
##### Tasks\.
We evaluate on a comprehensive suite of30 tasksspanning three distinct behavioral categories\. This diverse set tests the agent’s ability to generalize across varying dynamic requirements:
- •Locomotion & Agility \(13 tasks\):walk,run,stand,crawl,hurdle,stair,slide,maze,pole,balance\_simple/hard, andsit\_simple/hard\.
- •Whole\-Body Manipulation \(12 tasks\):push,reach,truck,package,cabinet,door,window,room,kitchen,basketball, andbookshelf\_simple/hard\.
- •Dexterous Manipulation \(5 tasks\):Fine\-motor tasks includingcube,spoon,powerlift, andinsert\_normal/small\.
##### Spaces\.
The state space is high\-dimensional \(𝒮∈ℝ≈60−90\\mathcal\{S\}\\in\\mathbb\{R\}^\{\\approx 60\-90\}\), consisting of the root state \(orientation, velocity\), full\-body joint states, and task\-specific features such as object poses\. Crucially, the action space is𝒜∈ℝ61\\mathcal\{A\}\\in\\mathbb\{R\}^\{61\}, controlling both the 19 actuated joints of the humanoid body and the 42 degrees of freedom of the dexterous hands\. This vast action space makes the exploration problem particularly acute, validating the need for the maximum entropy formulation in TruDi\.
##### Reward\.
We utilize the standardized multiplicative reward function provided by the benchmark:rt=rtrack⋅renergy⋅raliver\_\{t\}=r\_\{\\text\{track\}\}\\cdot r\_\{\\text\{energy\}\}\\cdot r\_\{\\text\{alive\}\}\. This structure acts as a soft constraint, ensuring that the agent must maintain stability \(raliver\_\{\\text\{alive\}\}\) and energy efficiency while maximizing task progress \(rtrackr\_\{\\text\{track\}\}\)\.
### B\.3IsaacLab
We evaluate the robustness and dexterity of learned policies usingIsaacLab\(Mittalet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib25)\), an Omniverse\-based simulation framework\. Our evaluation covers two distinct domains: rough\-terrain humanoid locomotion and high\-DoF dexterous in\-hand manipulation\.
##### Humanoid Locomotion\.
We evaluate locomotion on four primary tasks utilizing the Unitree G1 and H1 robots:Isaac\-Velocity\-Flat\-G1\-v0,Isaac\-Velocity\-Rough\-G1\-v0,Isaac\-Velocity\-Flat\-H1\-v0, andIsaac\-Velocity\-Rough\-H1\-v0\.
- •Terrain:WhileFlattasks operate on a plane, theRoughvariants utilize procedural terrain generation, presenting slopes, stairs, and discrete obstacles that require robust recovery behaviors\.
- •Spaces:The state space is𝒮∈ℝ48\\mathcal\{S\}\\in\\mathbb\{R\}^\{48\}for flat terrain\. For rough terrain, this expands to𝒮∈ℝ235\\mathcal\{S\}\\in\\mathbb\{R\}^\{235\}by including a 187\-dimensional height\-scan for local path planning\. The action space consists of continuous PD position targets \(23 DoF for G1, 19 DoF for H1\)\.
- •Reward:We employ a dense reward function tracking linear/angular velocity commands while penalizing joint torques, accelerations, and unstable base orientations\.
##### Dexterous Manipulation\.
We evaluate fine\-grained contact control on two in\-hand reorientation tasks:Isaac\-Repose\-Cube\-Allegro\-Direct\-v0andIsaac\-Repose\-Cube\-Shadow\-Direct\-v0\.
- •Objective:The agent controls a multi\-fingered robotic hand \(4\-fingered Allegro Hand or 5\-fingered Shadow Hand\) to rotate a cube to a randomly sampled target orientation\.
- •Spaces:The observation space includes the hand’s joint positions and velocities, the object’s current pose \(position and quaternion\), and the target orientation\. The action space controls the hand’s actuated joints \(16 DoF for Allegro, 24 DoF for Shadow Hand\), presenting a challenging high\-dimensional continuous control problem with frequent contact discontinuities\.
### B\.4ManiSkill3
We evaluate fine\-grained robotic manipulation capabilities usingManiSkill3\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\), a GPU\-parallelized simulator built on the SAPIEN engine\(Xianget al\.,[2020](https://arxiv.org/html/2606.15260#bib.bib86)\)\. Unlike standard rigid\-body benchmarks, ManiSkill3 emphasizes contact\-rich, physical interaction with diverse object geometries\. We evaluate on a custom suite of13 tasksutilizing the Franka Emika Panda arm\.
##### Tasks\.
The selected tasks cover a spectrum of manipulation challenges, ranging from basic pick\-and\-place to high\-precision assembly and multi\-agent coordination:
- •Standard Manipulation:PickCube,StackCube,PokeCube,PullCube,PushCube, andRollBall\.
- •Precision & Assembly:LiftPegUpright,PlaceSphere,PlugCharger, and the trajectory\-centricPushT\.
- •Tool Use:PullCubeTool, where the agent must grasp an intermediate tool to manipulate a target object out of reach\.
- •Multi\-Robot Coordination:TwoRobotPickCubeandTwoRobotStackCube, requiring the coordination of two independent robot arms \(1616DoF total\) to solve a shared objective\.
##### Spaces\.
The state space𝒮∈ℝ40\+\\mathcal\{S\}\\in\\mathbb\{R\}^\{40\+\}generally comprises robot proprioception \(joint positions, velocities, gripper width\) and ground\-truth object states \(pose, linear/angular velocities\)\. The action space𝒜\\mathcal\{A\}utilizesdelta joint position control\(normalizedΔq∈\[−1,1\]\\Delta q\\in\[\-1,1\]\), which has been shown to minimize the Sim2Real gap\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\):
- •Single\-Arm:𝒜∈ℝ8\\mathcal\{A\}\\in\\mathbb\{R\}^\{8\}\(7 arm joints \+ 1 gripper\)\.
- •Dual\-Arm:𝒜∈ℝ16\\mathcal\{A\}\\in\\mathbb\{R\}^\{16\}\(concatenated controls for both robots\)\.
##### Reward\.
We utilize the standardized dense reward functions provided by the benchmark\. These typically consist of areaching term\(distance to object\), amanipulation term\(distance to target pose\), and binarysuccess indicatorsto guide exploration in sparse\-contact phases\.
### B\.5Baselines
We compare TruDi against the following state\-of\-the\-art baselines, covering standard on\-policy methods, and diffusion/flow based policies approaches:
PPO \(Proximal Policy Optimization\)\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib19)\): As the standard on\-policy algorithm for continuous control, PPO refines the policy gradient objective by clipping the probability ratio to a trust region, ensuring monotonic improvement\. We use the highly optimized implementation from RSL RL\(Schwarkeet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib62)\), parameterizing the policy as a diagonal Gaussian distribution where the mean and standard deviation are output by the actor network\. Value targets are computed using Generalized Advantage Estimation \(GAE\), and the network is trained via the Adam optimizer\.
SPO \(Simple Policy Optimization\)\(Xieet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib63)\): This novel unconstrained first\-order algorithm is designed to enforce trust region constraints more effectively than PPO’s hard clipping\. SPO employs a specialized objective that imposes a soft constraint on the probability ratio deviation, dynamically scaled by the magnitude of the advantage estimates\. Specifically, the policy loss is formulated as𝔼\[rt\(θ\)A^t−\|A^t\|2ϵ\(rt\(θ\)−1\)2\]\\mathbb\{E\}\[r\_\{t\}\(\\theta\)\\hat\{A\}\_\{t\}\-\\frac\{\|\\hat\{A\}\_\{t\}\|\}\{2\\epsilon\}\(r\_\{t\}\(\\theta\)\-1\)^\{2\}\], which penalizes large deviations from the behavior policy without zeroing out gradients \(unlike clipping\)\. We utilize the official implementation and adapt it on top of the RSL RL PPO framework with GAE and Adam, using an MLP\-based diagonal Gaussian parameterization to ensure a fair comparison with the PPO baseline\.
FPO \(Flow Policy Optimization\)\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib31)\): An on\-policy formulation that integrates flow matching into the policy gradient framework\. FPO circumvents the intractability of exact likelihood computation in flow\-based models by casting policy optimization as maximizing an advantage\-weighted ratio derived from the conditional flow matching loss\. This objective is optimized within a PPO\-style clipping framework\. We use the official implementation, parameterizing the vector field with an MLP\.
DPPO \(Diffusion Policy Policy Optimization\)\(Renet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib32)\): A systematic framework for fine\-tuning diffusion\-based policies using on\-policy RL\. DPPO adapts the PPO objective to diffusion models by parameterizing the policy as a conditional diffusion model at each denoising step\. To enable efficient online training, we utilize the official codebase and implement it top of the RSL RL PPO framework, parameterizing the diffusion backbone as an MLP with sinusoidal timestep embeddings\. The policy is optimized using Adam with a learning rate decay schedule, while the critic remains a standard MLP trained to minimize temporal difference error\.
REPPO \(Relative Entropy Pathwise Policy Optimization\)\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\): An on\-policy algorithm that enables the use of low\-variance pathwise gradients \(reparameterization trick\) by training a critic on on\-policy data, stabilized via a relative entropy \(KL\) trust\-region constraint\. This serves as a critical ablation: by comparing Gaussian\-REPPO against our diffusion\-based method, we isolate the performance gains attributable to the expressivity of the generative parameterization from those derived from the underlying pathwise optimization objective\.
DIME \(Diffusion\-Based Maximum Entropy RL\)\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10)\): The state\-of\-the\-art off\-policy algorithm for diffusion policies\. DIME extends the standard MaxEnt framework to diffusion models by deriving a tractable lower bound on the policy entropy, allowing for principled exploration without ad\-hoc regularizers\. We utilize identical MLP backbones for DIME as in our proposed method to remove confounding factors related to network architecture\.
## Appendix CImplementation and Training Details
We provide a comprehensive overview of the network architectures, training procedures, and hyperparameters used for TruDi\. Our implementation leverages both the JAX\(Bradburyet al\.,[2018](https://arxiv.org/html/2606.15260#bib.bib87)\)and PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2606.15260#bib.bib88)\)frameworks to maximize compatibility and training throughput\.
##### Network Architectures\.
The diffusion policyπθ\(a\|s\)\\pi\_\{\\theta\}\(a\|s\)is parameterized as a conditional noise prediction networkϵθ\(xt,t,s\)\\epsilon\_\{\\theta\}\(x\_\{t\},t,s\)\. Following the design principles in DIME\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10)\), we utilize a Multi\-Layer Perceptron \(MLP\) adapted for low\-dimensional state\-action spaces, departing from the U\-Net architectures typically used in image generation\. The network conditions on the statessand the diffusion timesteptt\. We encodettusing sinusoidal Fourier features which is the same as in\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15260#bib.bib72)\)followed by a 2\-layer MLP projection\. The statess, noisy actionxtx\_\{t\}, and time embedding are concatenated and passed through a residual backbone consisting of3 hidden layerswith 512 units each andGeLUactivations\. We apply Layer Normalization\(Baet al\.,[2016](https://arxiv.org/html/2606.15260#bib.bib71)\)to the inputs of each residual block to stabilize training deep diffusion priors\.
For the value function, we adopt the specific architecture proposed in REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)to ensure stable pathwise gradient propagation\. The criticQϕ\(s,a\)Q\_\{\\phi\}\(s,a\)is parameterized as a 3\-layer MLP \(2 hidden layers\) with 512 hidden units and ReLU activations\. Crucially, we apply Layer Normalization\(Baet al\.,[2016](https://arxiv.org/html/2606.15260#bib.bib71)\)to the first hidden layer of the critic\. This normalization prevents the scale of the value gradients from diverging, which is essential for maintaining the trust region without aggressive gradient clipping\.
For the value function, wefollow the design proposed in REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\)to ensure stable pathwise gradient propagation\. The criticQϕ\(s,a\)Q\_\{\\phi\}\(s,a\)is parameterized as a3\-layer MLPwith 512 hidden units andSiLUactivations\. To accurately model the value uncertainty, we employ a distributional head with a support size of151 bins\. Crucially, consistent with the REPPO architecture, we apply Layer Normalization to the first hidden layer of the critic to prevent value gradient scaling issues\.
##### Training Procedure\.
Training proceeds in an on\-policy fashion\. In each iteration, we collect trajectories by rolling out the current diffusion policy in parallel environments\. Actions are generated using the DIME scheduler\(Celiket al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib10)\)withK=8K=8denoising steps\. We normalize environment observations using online running statistics \(mean and variance\), which are updated during rollouts and frozen during gradient updates\.
Following the training recipe of REPPO\(Voelckeret al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib14)\), we utilize TD\(λ\\lambda\)\(Sutton,[1988](https://arxiv.org/html/2606.15260#bib.bib74)\)to compute targets for the critic, rather than Generalized Advantage Estimation \(GAE\)\. Specifically, we calculate the entropy\-augmentedλ\\lambda\-returnGtλG^\{\\lambda\}\_\{t\}to serve as the regression target for the value function\. This formulation effectively balances the bias\-variance tradeoff in the target estimates without requiring a separate advantage computation\. The policy is then updated to maximize the learned Q\-value via pathwise gradients, subject to a trust\-region constraint where the intractable entropy term is approximated via the diffusion training loss\.
Hyperparameters and Resources\.We use the Adam optimizer for both the actor and critic networks\. Table[3](https://arxiv.org/html/2606.15260#A3.T3)summarizes the core hyperparameters used across the MuJoCo Playground and HumanoidBench experiments\. All experiments were conducted on a high\-performance computing cluster equipped with NVIDIA A100 \(80GB\) and H100 GPUs\.
Table 3:Hyperparameters for TruDi\.ParameterValueGeneral SettingsTotal Timesteps \(Locomotion\)5×1075\\times 10^\{7\}–3×1083\\times 10^\{8\}Total Timesteps \(Manipulation\)5×1075\\times 10^\{7\}–3×1083\\times 10^\{8\}Num Environments2048 – 4096Rollout Length128Discount Factorγ\\gamma0\.99GAEλ\\lambda0\.95Diffusion Policy \(Actor\)ArchitectureResidual MLP \+ LayerNormHidden Layers3Hidden Dimension512Time Dimension32ActivationGeLUDiffusion StepsTT8Noise ScheduleCosine \(βmin=10−4,βmax=2×10−2\\beta\_\{min\}=10^\{\-4\},\\beta\_\{max\}=2\\times 10^\{\-2\}\)Value Function \(Critic\)ArchitectureMLP \+ LayerNormHidden Layers3Hidden Dimension512ActivationGeLUEnsemble/Bin Size256OptimizationOptimizerAdamActor Learning Rate3×10−43\\times 10^\{\-4\}Critic Learning Rate3×10−43\\times 10^\{\-4\}Mini\-batch Size2048Num Epochs8Trust Regionϵ\\epsilon0\.1
## Appendix DMulti\-Modal Validation Tasks
To highlight the multi\-modal solution generation capabilities of TruDi, we design two illustrative tasks:Push\-TandStackCube\. These tasks are specifically engineered to possess symmetric solutions, allowing us to test whether the policy can learn and represent multiple distinct geometric modes simultaneously\.
### D\.1Push\-T Task
In the Push\-T task, the robot must rotate a T\-shaped blockπ\\piradians \(180 degrees\) to match a goal orientation\. The task is designed with inherent geometric symmetry: from a neutral starting position, rotating the block either clockwise or counter\-clockwise constitutes a valid and optimal solution\. A uni\-modal policy \(e\.g\., Gaussian\) often averages these modes, resulting in failure \(e\.g\., getting stuck in the middle\), whereas a multi\-modal policy should be able to commit to one specific direction\.
#### D\.1\.1Reward Design
We implement these environments as modified versions of the standard Push\-T task from the ManiSkill simulation framework\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\), specifically adapted to highlight multi\-modal solution capabilities\.
##### PushT\-Train \(Training Environment\)\.
To ensure robust policy learning, the training environment features full randomization\. Both the T\-shaped block and the goal position are initialized randomly within the workspace bounds \(x∈\[−0\.3,0\.1\]x\\in\[\-0\.3,0\.1\]m,y∈\[−0\.3,0\.2\]y\\in\[\-0\.3,0\.2\]m\), with random Z\-axis rotationsθ∈\[0,2π\]\\theta\\in\[0,2\\pi\]\.
We employ a dense, shaped reward functionr=rrot\+rpos\+rtcpr=r\_\{\\text\{rot\}\}\+r\_\{\\text\{pos\}\}\+r\_\{\\text\{tcp\}\}\(max value 3\.0\), consisting of three components:
1. 1\.Rotation Alignment: rrot=12\(cos\(θblock−θgoal\)\+12\)2r\_\{\\text\{rot\}\}=\\frac\{1\}\{2\}\\left\(\\frac\{\\cos\(\\theta\_\{\\text\{block\}\}\-\\theta\_\{\\text\{goal\}\}\)\+1\}\{2\}\\right\)^\{2\}\(312\)whereθ\\thetadenotes the Z\-axis rotation in radians\. The squaring operation amplifies rewards as the alignment approaches perfection\.
2. 2\.Position Alignment: rpos=12\(1−tanh\(5‖pblock−pgoal‖2\)\)2r\_\{\\text\{pos\}\}=\\frac\{1\}\{2\}\\left\(1\-\\tanh\(5\\\|p\_\{\\text\{block\}\}\-p\_\{\\text\{goal\}\}\\\|\_\{2\}\)\\right\)^\{2\}\(313\)wherep∈ℝ2p\\in\\mathbb\{R\}^\{2\}denotes the 2D position\.
3. 3\.Manipulation Guidance: rtcp=1201−tanh\(5‖ptcp−pblock‖2\)r\_\{\\text\{tcp\}\}=\\frac\{1\}\{20\}\\sqrt\{1\-\\tanh\(5\\\|p\_\{\\text\{tcp\}\}\-p\_\{\\text\{block\}\}\\\|\_\{2\}\)\}\(314\)This small shaping term encourages the tool center point \(TCP\) to remain close to the block\.
Success Criterion:Success is defined as≥90%\\geq 90\\%intersection area between the block and goal footprints\.
##### PushT\-Test \(Multi\-Modal Evaluation\)\.
We design a deterministic test environment to isolate multi\-modal decision\-making\. The goal is fixed at the workspace center with zero rotation, and the block is initialized at the same position but rotated exactlyπ\\piradians\. Crucially, the reward function isrotation\-symmetric: rotating by\+θ\+\\theta\(clockwise\) or−θ\-\\theta\(counter\-clockwise\) yields identical rewards\. This symmetry enables us to quantitatively analyze whether TruDi can model the bimodal distribution of valid trajectories\.

\(a\) Initial Position

\(b\) Mode 1: Clockwise

\(c\) Mode 2: Counter\-Clockwise
Figure 7:Push\-T Multi\-Modal Task\.The agent must rotate the T\-blockπ\\piradians\. The task admits two symmetric solutions \(modes\), requiring the policy to commit to one direction rather than averaging them\.
### D\.2StackCube Task
To further evaluate multi\-modality in a contact\-rich manipulation setting, we extend the standard stacking task\. In our setup, two cubes \(Red and Blue\) are placed in the scene\. The goal is simply to stackonecube on top of theother\. This creates two valid geometric modes:Red\-on\-BlueorBlue\-on\-Red\.
#### D\.2\.1Reward Design
We implement this task by extending the ManiSkill\(Taoet al\.,[2025](https://arxiv.org/html/2606.15260#bib.bib4)\)stacking environment\.
##### StackCube\-Train \(Training Environment\)\.
During training, the environment is fully randomized to promote generalization\. The Red cube \(Cube A\) and Blue cube \(Cube B\) spawn at random XY positions, and the specific stacking goal \(A\-on\-B or B\-on\-A\) is randomly selected per episode\. The dense reward function consists of five stages \(max value 8\.0\):
1. 1\.Reaching \(Stage 1\):Guides the end\-effector to the nearest cube: rreach=2\(1−tanh\(5dtcp\-nearest\)\)r\_\{\\text\{reach\}\}=2\\left\(1\-\\tanh\(5d\_\{\\text\{tcp\-nearest\}\}\)\\right\)\(315\)wheredtcp\-nearest=min\(‖ptcp−pA‖,‖ptcp−pB‖\)d\_\{\\text\{tcp\-nearest\}\}=\\min\(\\\|p\_\{\\text\{tcp\}\}\-p\_\{\\text\{A\}\}\\\|,\\\|p\_\{\\text\{tcp\}\}\-p\_\{\\text\{B\}\}\\\|\)\.
2. 2\.Placing \(Stage 2\):Uses a symmetric maximum operator to avoid biasing the policy toward a specific cube order: rplace=max\(rA\-on\-B,rB\-on\-A\)r\_\{\\text\{place\}\}=\\max\(r\_\{\\text\{A\-on\-B\}\},r\_\{\\text\{B\-on\-A\}\}\)\(316\)whererX\-on\-Yr\_\{\\text\{X\-on\-Y\}\}encourages moving cube X to the top of cube Y\.
3. 3\.Release & Stability \(Stage 3\):Once stacked, additional termsrungraspr\_\{\\text\{ungrasp\}\}andrstaticr\_\{\\text\{static\}\}encourage the robot to release the object and ensure the stack is stable\.
##### StackCube\-Test \(Multi\-Modal Evaluation\)\.
The testing environment is deterministic and explicitly constructed to challenge the policy’s mode\-selection capability\.
- •Setup:Cube A \(Red\) is placed at the far left \(y=−0\.35y=\-0\.35m\) and Cube B \(Blue\) at the far right \(y=\+0\.35y=\+0\.35m\)\.
- •Challenge:The wide lateral separation \(0\.7m\) creates two disjoint basins of attraction in joint space\. A valid policy must essentially ”choose” between a leftward trajectory \(stack Red\-on\-Blue\) or a rightward trajectory \(stack Blue\-on\-Red\)\.
- •Symmetry:The reward function remains identical to the training setup\. Themin\\minoperator in reaching and themax\\maxoperator in placing ensure the reward landscape is perfectly symmetric, allowing the agent to dynamically select either mode\.

\(a\) Initial Position

\(b\) Mode 1: Red on Blue

\(c\) Mode 2: Blue on Red
Figure 8:StackCube Multi\-Modal Task\.Two cubes are placed far apart\. The agent must choose to either stack Blue on Red \(Right\-to\-Left\) or Red on Blue \(Left\-to\-Right\)\. Gaussian policies often fail by averaging these distinct trajectories\.
### D\.3Evaluation Metric: Behavior Entropy
To quantitatively evaluate the policy’s ability to cover both symmetric solutions, we adopt theBehavior Entropymetric adapted from the D3IL benchmark\(Jiaet al\.,[2024](https://arxiv.org/html/2606.15260#bib.bib89)\)\. This metric measures the diversity of the behaviors by computing the entropy of the discrete mode distribution achieved by the policy\.
For a set ofNNevaluation trajectories, we classify each trajectoryτi\\tau\_\{i\}into a binary modem∈\{0,1\}m\\in\\\{0,1\\\}using a task\-specific classifierC\(τi\)C\(\\tau\_\{i\}\)\. We estimate the empirical probability of each mode asp^m=1N∑i=1N𝕀\(C\(τi\)=m\)\\hat\{p\}\_\{m\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\(C\(\\tau\_\{i\}\)=m\)\.
Since our tasks contain exactly two distinct modes \(\|ℳ\|=2\|\\mathcal\{M\}\|=2\), we calculate the entropy using the base\-2 logarithm:
ℋbehavior=−∑m∈\{0,1\}p^mlog2\(p^m\)\\mathcal\{H\}\_\{\\text\{behavior\}\}=\-\\sum\_\{m\\in\\\{0,1\\\}\}\\hat\{p\}\_\{m\}\\log\_\{2\}\(\\hat\{p\}\_\{m\}\)\(317\)
This formulation yields an entropy score in the range\[0,1\]\[0,1\]\. A score ofℋbehavior=1\\mathcal\{H\}\_\{\\text\{behavior\}\}=1indicates maximum diversity \(a perfect 50/50 split between modes\), whileℋbehavior=0\\mathcal\{H\}\_\{\\text\{behavior\}\}=0indicates mode collapse \(the policy executes only one solution\)\.
##### Mode Definitions\.
We classify the mode of a trajectory based solely on the environment state at the final timestep\. The specific criteria for our tasks are:
- •Push\-T \(\|ℳ\|=2\|\\mathcal\{M\}\|=2\):Modes are distinguished by the sign of the total Z\-axis rotationΔθ\\Delta\\theta\(derived from the block’s final quaternion\): - –Mode 0 \(Counter\-Clockwise\):Δθ\>0\\Delta\\theta\>0\. - –Mode 1 \(Clockwise\):Δθ<0\\Delta\\theta<0\.
- •StackCube \(\|ℳ\|=2\|\\mathcal\{M\}\|=2\):Modes are distinguished by the final vertical arrangement of the cubes \(wherehcube=0\.04h\_\{\\text\{cube\}\}=0\.04m\): - –Mode 0 \(Blue\-on\-Red\):zblue\>zred\+hcubez\_\{\\text\{blue\}\}\>z\_\{\\text\{red\}\}\+h\_\{\\text\{cube\}\}\. - –Mode 1 \(Red\-on\-Blue\):zred\>zblue\+hcubez\_\{\\text\{red\}\}\>z\_\{\\text\{blue\}\}\+h\_\{\\text\{cube\}\}\.
## Appendix ETruDi Main Experiments
### E\.1Hyperparameters
Table[4](https://arxiv.org/html/2606.15260#A5.T4)lists the default hyperparameters used for the algorithms\. We maintain a consistent architecture for generative policies while adapting sampling budgets to the specific requirements of each environment \(Table[5](https://arxiv.org/html/2606.15260#A5.T5)\)\. Note that for the Humanoid\-Bench tasks, TruDi and REPPO use a reduced hidden dimension of 256, while SPO, PPO, and DPPO use a reduced hidden dimension of 128\.
Table 4:Algorithm Hyperparameters\. Constant across all experiments unless noted\.ParameterTruDi \(Ours\)REPPOSPOPPOFPODPPOActor NetworkHidden Dim51251225625632256Hidden Layers333353ActivationGeLUGeLUELUELUSiLUMishFlow/Diff Steps8N/AN/AN/A108Critic NetworkCritic TypeDist\. \(HL\-Gauss\)Dist\. \(HL\-Gauss\)MSEMSEMSEMSEHidden Dim512512256256256256Hidden Layers333353Ensemble/Bin151 Bins151 Bins1111ActivationGeLUGeLUELUELUSiLUMishOptimizationOptimizerAdamActor LR3×10−43\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}3×10−43\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}Critic LR3×10−43\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}3×10−43\\times 10^\{\-4\}3×10−43\\times 10^\{\-4\}Max Grad Norm0\.50\.51\.01\.01\.01\.0Constraintsϵ=0\.1\\epsilon=0\.1\(KL\)ϵ=0\.1\\epsilon=0\.1\(KL\)ϵ=0\.2\\epsilon=0\.2\(Penalty\)ϵ=0\.2\\epsilon=0\.2\(Clip\)0\.2 \(Clip\)0\.2 \(Clip\)
Table 5:Task\-Specific Configurations\. Summary of environment settings\. We align the sampling budget \(Envs×\\timesHorizon\) and discount factors \(γ\\gamma\) to the specific needs of each benchmark\.BenchmarkTotal Stepsγ\\gammaCritic RangeSampling \(Envs×\\timesHorizon\)MuJoCo \(DMC\)50M0\.99\[0,150\]\[0,150\]1024×1281024\\times 128MuJoCo \(Humanoid\)50M0\.97±10\\pm 101024×1281024\\times 128IsaacLab300M0\.97±10\\pm 104096×644096\\times 64ManiSkill350M0\.99±15\\pm 151024×1281024\\times 128HumanoidBench50M0\.99±250\\pm 250128×128128\\times 128
## Appendix FPer\-Task Learning Curves
In this section, we provide sample efficiency curves per environment\.
TruDi\(Ours\)REPPODIME \(on\-policy data\)FPOSPOPPODPPO




Figure 9:IQM Episode Return of each individual Mujoco Playground Humanoid Tasks\.TruDi\(Ours\)REPPODIME \(on\-policy data\)FPOSPOPPODPPO




















Figure 10:IQM Episode Return of each individual Mujoco Playground DMC tasks\.TruDi\(Ours\)REPPODIME \(on\-policy data\)SPOPPO













Figure 11:IQM Episode Return of each individual ManiSkill tasks\.TruDi\(Ours\)REPPODIME \(on\-policy data\)SPOPPO






Figure 12:IQM Episode Return of each individual IsaacLab Tasks\.TruDi\(Ours\)REPPODIME \(on\-policy data\)PPO
























Figure 13:IQM Episode Return of each individualHumanoid\-BenchTasks \(Part 1\)\.TruDi\(Ours\)REPPODIME \(on\-policy data\)PPO






Figure 14:IQM Episode Return of each individualHumanoid\-BenchTasks \(Part 2\)\.Similar Articles
Trust-Region Behavior Blending for On-Policy Distillation
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.
Trust Region On-Policy Distillation
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
Trust Region Q Adjoint Matching
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.
@svlevine: Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL o…
New paper shows how to optimize flow matching actors for reinforcement learning by approximating the Jacobian of the flow denoising process with the identity matrix, making training feasible.