Diffusion Policy Optimization without Drifting Apart
Summary
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.
View Cached Full Text
Cached at: 06/15/26, 09:07 AM
# Diffusion Policy Optimization without Drifting Apart
Source: [https://arxiv.org/html/2606.13795](https://arxiv.org/html/2606.13795)
Haozhe Jiang1,2Haiwen Feng1,2Pieter Abbeel1Jiantao Jiao1,3Angjoo Kanazawa1‡Nika Haghtalab1‡
###### Abstract
RL post\-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy\-gradient methods are often unstable and cannot achieve reliable policy improvement\. We identify the cause as the double\-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log\-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return\. We proposeDiPOD, a diffusion policy optimization framework that maintains tight\-bound behavior throughout training by interleaving self\-distillation with policy\-improving gradient updates\. This leads to a simple and practical algorithm: augmenting each diffusion policy\-gradient update with an on\-policy ELBO regularizer\. Across diffusion language model post\-training and continuous\-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods\.
00footnotetext:Affiliations:1UC Berkeley;2Impossible, Inc\.;3NVIDIA\.33footnotetext:Equal advising\.## 1Introduction
Figure 1:Illustration of DiPODusing an expected\-return landscape over the parameter space of the policy\. We want the policy to bothguarantee policy improvementandhave a tight ELBO, as illustrated by the colored curves in the figure\. Prior proxy\-based algorithms \(the plotted reward curve comes from a real FPO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4)\)experiment on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.13795#bib.bib11)\); see[Figure˜3](https://arxiv.org/html/2606.13795#S4.F3)\) can initially achieve policy improvements, but later become unstable as the ELBOs are no longer tight\. In contrast, DiPOD interleaves self\-distillation steps and adequate gradient updates, as illustrated by the blue arrows\. Gradient updates guarantee policy improvements locally, and self\-distillation brings the parameter to where the ELBO is tight while keeping the expected reward unchanged\.Diffusion models are emerging as a powerful paradigm for discrete generation in language, code, and mathematical reasoning, while diffusion and flow models more broadly support strong performance in continuous domains such as images\(Hoet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib15); Songet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib16); Song and Ermon,[2019](https://arxiv.org/html/2606.13795#bib.bib17)\), videos\(Hoet al\.,[2022b](https://arxiv.org/html/2606.13795#bib.bib18),[a](https://arxiv.org/html/2606.13795#bib.bib19)\), and robotic control\([Blacket al\.,](https://arxiv.org/html/2606.13795#bib.bib20); Bjorcket al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib21)\)\. This promise is especially compelling for reasoning with diffusion LLMs \(dLLMs\)\(Nieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib10); Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2); Songet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib22)\), which offer fast parallel sampling and flexible non\-autoregressive decoding\(Jianget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib50); Ermon,[2026](https://arxiv.org/html/2606.13795#bib.bib51); Chenet al\.,[2026](https://arxiv.org/html/2606.13795#bib.bib52)\)\. Yet post\-training diffusion policies with reinforcement learning remains fundamentally difficult: standard policy\-gradient methods rely on the log\-likelihoodlogπθ\(a\|o\)\\log\\pi\_\{\\theta\}\(a\|o\), while for diffusion models this quantity is generally not tractable\. Recent methods\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1); McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4); Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\)have therefore introduced a range of more tractable proxies for the likelihood, but these approaches still do not provide a systematic understanding of when proxy\-based policy\-gradient updates remain reliable\. This gap is particularly consequential for dLLMs, where unstable RL post\-training causes reasoning gains to lag far behind those of autoregressive counterparts\.
In this work, we proposeDiPOD—DiffusionPolicy optimization withOutDrifting apart—a principled framework forreliable policy\-gradient updatesthat isnative to diffusion and flow policies\. When applied to dLLM post\-training, DiPOD substantially stabilizes learning and improves reasoning performance on benchmarks such as GSM8K, MATH500, Countdown, and Sudoku, including being the first method to saturate Sudoku in the zero\-shot setting \([Sections˜4\.2](https://arxiv.org/html/2606.13795#S4.SS2)and[1](https://arxiv.org/html/2606.13795#S4.T1)\)\.
As an RL post\-training method native to diffusion policies, DiPOD is designed to satisfy two desiderata\. First,\(A\)*policy\-gradient updates should improve the downstream objective*as this is the main goal of post\-training\. However, reward improvement alone is not enough for improving diffusion models, especially when the probabilistic structure that makes the policy a coherent diffusion model is destroyed in the process\. Hence, in addition, we require\(B\)*fine\-tuning to preserve the diffusion\-model structure by keeping the proxy objective tight*\. DiPOD meets both criteria\. As illustrated in[Figure˜1](https://arxiv.org/html/2606.13795#S1.F1), DiPOD prevents proxy updates from drifting away from the intended policy\-improvement direction by*interleaving*\(i\) a policy\-preserving self\-distillation step that ensures minimal discrepancy between the likelihood and its proxies, with \(ii\)*adequate*policy\-gradient updates111This is a technical term \(see Definition[2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px3)\) for a common type of policy\-gradient update that is accurate in the absence of large discrepancy between likelihood and its proxies\.that improve expected return\. From this framework, we derive a surprisingly simple drop\-in implementation: add a per\-update ELBO regularization term to the diffusion policy\-gradient update \([Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)\)\.
#### Variational\-Inference Approaches to RL Post\-training\.
Why is updating diffusion with policy gradient hard in the first place? Treating the denoising chain as an MDP makes likelihoods tractable\(Blacket al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib26); Renet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib29)\), but it ties training to a sampler and is a poor fit for dLLMs, whose post\-training should preserve flexible decoding orders and inference budgets\. Recent*non\-MDP*dLLM methods therefore approximate likelihoods directly, using mean\-field or one\-step estimators, partial dependency restorations, or imposed autoregressive orders\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1); Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2)\); these can alter the optimized policy from the executed diffusion policy\. Variational\-inference \(VI\) approaches instead replacelogπθ\\log\\pi\_\{\\theta\}with evidence bounds such as ELBO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4)\)or EUBO\(Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\), preserving diffusion’s native sampling flexibility while connecting naturally to pretraining objectives\.
The key caveat is that this faithfulness is only local\. Variational diffusion\-RL methods can work well*initially*: near a well\-pretrained initialization where the evidence bound is tight, ELBO remains a good local proxy for the true log\-likelihood\. However, RL updates do not preserve this tight\-bound regime\. As optimization proceeds, the discrepancy between ELBO and log\-likelihood can grow, and our analysis pinpoints the Achilles heel:once ELBO drifts from log\-likelihood,policy\-gradient fidelity drifts with it\.We refer to this coupled effect as thedouble driftphenomenon\.
DiPOD addresses this by making tight\-bound behavior an explicit design constraint\. Our key observation is that many VI\-based estimators are*adequate*\(Definition[2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px3)\): if the diffusion model were perfectly trained so that the evidence bound is tight, the proxy gradient matches the true log\-likelihood gradient, and a policy\-gradient step based on the proxy coincides with the true policy gradient\. DiPOD makes use of this by repeatedly*pulling the model back*toward a tight\-bound regime via self\-distillation—which tightens the evidence bound under a reference rollout distribution without changing the policy distribution—and only then taking policy\-gradient steps\. In the idealized interleaved form shown in[Figure˜1](https://arxiv.org/html/2606.13795#S1.F1), gradient updates guarantee local policy improvement, while self\-distillation restores tightness without changing the policy output distribution\.
Practically, we implement a simple approximation of this interleaving: for each batch of rollouts, we update parameters using the usual diffusion policy gradient estimator*plus*an ELBO maximization term on the same rollout data\. This additional ELBO term acts as a regularizer that reduces variational discrepancy on\-policy, improving alignment between the proxy gradient and the true policy gradient\. Empirically, this simple DiPOD implementation is surprisingly effective: it substantially stabilizes training and improves reward optimization across both discrete dLLM post\-training tasks \(e\.g\., GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.13795#bib.bib11)\), MATH500Lightmanet al\.\([2023](https://arxiv.org/html/2606.13795#bib.bib12)\), Sudoku[Arel](https://arxiv.org/html/2606.13795#bib.bib14), CountdownPanet al\.\([2025](https://arxiv.org/html/2606.13795#bib.bib13)\)\) and continuous\-control flow policies \(e\.g\., motion tracking for humanoid robots; details in[Section˜4](https://arxiv.org/html/2606.13795#S4)\)\.
In summary, we make three contributions:
- •We identify a fundamental failure mode in variational diffusion RL:double drift\. As RL updates loosen the evidence bound,ELBO drifts from log\-likelihood, which in turn causesproxy policy\-gradient updates to drift from the intended policy\-gradient direction\([Section˜3\.1](https://arxiv.org/html/2606.13795#S3.SS1)\)\.
- •We proposeDiPOD, a principled framework for reliable diffusion policy optimization that makes staying in the tight\-bound regime an explicit design objective, achieved by interleaving policy\-gradient updates with policy\-preserving self\-distillation \([Section˜3\.2](https://arxiv.org/html/2606.13795#S3.SS2)\)\.
- •We derive asimple, drop\-in practical algorithmthat instantiates this principle by adding a per\-update ELBO regularizer, substantially improving the stability and reward optimization of diffusion RL in practice \([Algorithms˜2](https://arxiv.org/html/2606.13795#alg2)and[4](https://arxiv.org/html/2606.13795#S4)\)\.
### 1\.1Related Work
#### Diffusion models\.
Diffusion models are a powerful class of generative models, with applications in image generation\(Hoet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib15); Songet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib16); Song and Ermon,[2019](https://arxiv.org/html/2606.13795#bib.bib17)\), video generation\(Hoet al\.,[2022b](https://arxiv.org/html/2606.13795#bib.bib18),[a](https://arxiv.org/html/2606.13795#bib.bib19)\), and robotics\([Blacket al\.,](https://arxiv.org/html/2606.13795#bib.bib20); Bjorcket al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib21)\)\. More recently, diffusion language models have emerged as a promising alternative to autoregressive models for fast and flexible decoding\(Nieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib10); Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2); Songet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib22)\)\. We study diffusion policies through the variational\-inference perspective\(Kingma and Gao,[2023](https://arxiv.org/html/2606.13795#bib.bib23); Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2606.13795#bib.bib24)\), which is the lens used by DiPOD to reason about likelihood proxies and evidence bounds\.
#### Policy gradients\.
Policy\-gradient methods directly optimize a parameterized policy for expected return\(Williams,[1992](https://arxiv.org/html/2606.13795#bib.bib25); Schulmanet al\.,[2017](https://arxiv.org/html/2606.13795#bib.bib6)\), and have been central to high\-dimensional control\(Schulmanet al\.,[2016](https://arxiv.org/html/2606.13795#bib.bib30)\), locomotion\(Rudinet al\.,[2022](https://arxiv.org/html/2606.13795#bib.bib27)\), manipulation\(Schwarkeet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib28)\), and modern language\-model post\-training\(Shaoet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib3)\)\. DiPOD targets this online policy\-optimization setting for diffusion policies\.
#### RL with diffusion policies\.
The main obstacle to applying policy gradients to diffusion policies is that the exact log\-likelihood is intractable\. Prior work either treats the denoising process itself as an MDP\(Blacket al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib26); Renet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib29)\), replaces likelihoods with variational evidence bounds\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4); Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\), or simplifies dependencies in diffusion language models to obtain tractable likelihood surrogates\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1); Yanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib5); Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2); Tanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib32)\)\. These approaches trade off tractability, sampler flexibility, and alignment between the optimized proxy and the executed diffusion policy\. DiPOD focuses on the variational line and stabilizes it by keeping the evidence bound aligned with the true likelihood during policy optimization\. A fuller discussion of related work appears in[Appendix˜A](https://arxiv.org/html/2606.13795#A1)\.
## 2Preliminaries
#### Policy Gradient Method\.
The goal of reinforcement learning \(RL\) is to learn a policyπθ\\pi\_\{\\theta\}that maximizes the expected return in an environment\. Hereθ\\thetadenotes the parameters of the policy\. At timesteptt, the policy takes an observationoto\_\{t\}from the environment, then executes actionat∼πθ\(⋅\|ot\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\|o\_\{t\}\), and the environment provides rewardrtr\_\{t\}as feedback\. The return is defined as the cumulative reward until the environment reaches a terminal state\. A policy gradient algorithm updates the policy parametersθ\\thetausing an estimator of the gradient of the expected return\. Most policy gradient methods can be viewed as using a variant of the following gradient estimator
𝔼θ\[∇θlogπθ\(at\|ot\)A^t\(ot,at\)\],\\displaystyle\\mathbb\{E\}\_\{\\theta\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\}\_\{t\}\(o\_\{t\},a\_\{t\}\)\\right\],\(1\)whereA^t\\hat\{A\}\_\{t\}is the estimated advantage function at timesteptt\. The expectation is taken over the randomness of the environment and the policy\. Here𝔼θ\\mathbb\{E\}\_\{\\theta\}means thatata\_\{t\}is generated according toπθ\(⋅\|ot\)\\pi\_\{\\theta\}\(\\cdot\|o\_\{t\}\)\. Well\-known practical algorithms such as PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.13795#bib.bib6)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib3)\)can be understood as introducing modifications to this update in order to improve stability and sample efficiency\.
Estimating[Equation˜1](https://arxiv.org/html/2606.13795#S2.E1)becomes intractable whenπθ\\pi\_\{\\theta\}is parameterized by a diffusion or flow model, motivating a range of proxy objectives, including those based on variational bounds\.
#### Variational Bounds\.
The evidence lower bound \(ELBO\) satisfies
ELBOθ\(at\|ot\)=logπθ\(at\|ot\)−𝒟θL\(ot,at\)\\displaystyle\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\-\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)\(2\)where𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}is a non\-negative discrepancy term that measures the gap between the ELBO and true log\-likelihood\. Furthermore, with appropriate and well\-trainedθ\\theta,𝒟θL\(ot,at\)=0\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)=0for alloto\_\{t\}andata\_\{t\}\.222Here we assume thatθ\\thetahas enough representation power to cover the zero discrepancy case\.This means that ELBO is a tight lower bound of the true log\-likelihood\. Outside the reinforcement learning context, generative models are usually trained by maximizing
𝔼ot,at∼pdata\[ELBOθ\(at\|ot\)\]\\displaystyle\\mathbb\{E\}\_\{o\_\{t\},a\_\{t\}\\sim p\_\{\\text\{data\}\}\}\\left\[\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\right\]\(3\)wherepdatap\_\{\\text\{data\}\}is the distribution that the model tries to learn\. Such surrogate objectives are useful because they avoid direct likelihood computation\. For diffusion models, under idealized assumptions, perfect training makes the ELBO tight, soELBOθ\(at\|ot\)=logπθ\(at\|ot\)\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\. For flow models trained by conditional flow matching \(CFM\), the analogous idealized optimum has zero CFM loss, so the learned vector field matches the target conditional vector field\.
Sometimes it is also useful to consider the corresponding upper bound on log likelihood\. The evidence upper bound \(EUBO\) satisfies
EUBOθ\(at\|ot\)=logπθ\(at\|ot\)\+𝒟θU\(ot,at\)\\displaystyle\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\+\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{U\}\}\(o\_\{t\},a\_\{t\}\)where𝒟θU\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{U\}\}is a non\-negative discrepancy term that can reach zero with appropriateθ\\thetaas well\. While𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}can be interpreted as a KL discrepancy between posterior distributions,𝒟θU\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{U\}\}is characterized by a Renyi\-type posterior discrepancy\. Furthermore, ifELBOθ\(at\|ot\)=logπθ\(at\|ot\)\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)for pair\(ot,at\)\(o\_\{t\},a\_\{t\}\), EUBO also satisfiesEUBOθ\(at\|ot\)=logπθ\(at\|ot\)\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\. Hence, if a diffusion model is perfectly trained, we haveELBOθ\(at\|ot\)=EUBOθ\(at\|ot\)=logπθ\(at\|ot\)\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)for alloto\_\{t\}andata\_\{t\}\. As opposed to the ELBO, however, EUBO itself is not tractable for diffusion models and must be approximated in practice\.
#### Adequate Gradient Estimators\.
Variational bounds provide attractive surrogates for the intractable log\-likelihood because they can be tight\. As we will see in the following sections, some variational\-inference\-based \(VI\-based\) gradient estimators satisfy a useful property in this tight regime: once the surrogate coincides with the true log\-likelihood, the estimator also coincides with the policy\-gradient integrand in[Equation˜1](https://arxiv.org/html/2606.13795#S2.E1)\. We call such estimators adequate\.
###### Definition 2\.1\(Adequate estimator\)\.
A gradient estimatorgθ\(ot,at\)g\_\{\\theta\}\(o\_\{t\},a\_\{t\}\)is called*adequate*if, for any\(ot,at\)\(o\_\{t\},a\_\{t\}\), whenever the evidence bound is tight so thatELBOθ\(at\|ot\)=logπθ\(at\|ot\)\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\), it satisfies
gθ\(ot,at\)=A^t\(ot,at\)∇θlogπθ\(at\|ot\)\.\\displaystyle g\_\{\\theta\}\(o\_\{t\},a\_\{t\}\)=\\hat\{A\}\_\{t\}\(o\_\{t\},a\_\{t\}\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\.
## 3Method
As discussed in the previous section, diffusion policies allow efficient sampling but make the exact log\-likelihoodlogπθ\(at\|ot\)\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\(and its gradient\) intractable, forcing RL for diffusion models to rely on proxy objectives/estimators\. Our introduction diagnosed adouble driftmechanism that is particularly acute for*variational\-inference \(VI\)*based diffusion RL: \(i\) theELBO can drift from the true log\-likelihoodas the variational discrepancy grows, and consequently \(ii\) theproxy policy gradient will drift apart from the true policy gradient of expected return\. In this section, we first use this double\-drift lens to clarify limitations of representative diffusion\-RL algorithms \([Section˜3\.1](https://arxiv.org/html/2606.13795#S3.SS1)\)\. We then develop a principled framework that*prevents drifting*by keeping the evidence bound tight on\-policy, which in turn keeps policy\-gradient updates aligned with the true policy gradient\. Finally, we derive a simple practical algorithm that implements this principle as a drop\-in regularization term\.
### 3\.1The Double\-Drift Phenomenon
Many diffusion\-RL methods differ in formulation, but the key question under our lens is:*does the proxy remain faithful tologπθ\\log\\pi\_\{\\theta\}, and does its induced gradient remain faithful to the true policy gradient?*For VI\-based approaches, the core issue is that an evidence bound update can change both the likelihood term and the variational gap, and once the gap grows, the proxy gradient inevitably deviates from∇θlogπθ\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\.
#### FPO\(McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4)\)\.
FPO replaces the intractable likelihood score in the policy\-gradient update with an ELBO score\. Formally, FPO uses the gradient estimator
gθFPO\(ot,at\)=∇θELBOθ\(at\|ot\)A^t\(ot,at\),\\displaystyle g\_\{\\theta\}^\{\\text\{FPO\}\}\(o\_\{t\},a\_\{t\}\)=\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\}\_\{t\}\(o\_\{t\},a\_\{t\}\),\(4\)up to PPO\-style clipping for stability\. The intuition is that the ELBO acts as a tractable proxy for the log\-likelihood score: in particular, increasing the ELBO on positive\-advantage samples should both increase the likelihood of desirable actions and push the diffusion model closer to a tight\-bound regime\. However, since ELBO decomposes as the true log\-likelihood minus a nonnegative discrepancy term \(cf\.[Equation˜2](https://arxiv.org/html/2606.13795#S2.E2)\), ELBO changes generally*do not uniquely determine*howlogπθ\(at\|ot\)\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)changes\. This creates the first drift:ELBO–likelihood inconsistency\. Concretely:
- •The ELBO may increase even iflogπθ\(at\|ot\)\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)decreases, provided the discrepancy decreases more\. This ambiguity is especially problematic for negative\-advantage updates: the ELBO can be reduced by*increasing*the discrepancy, i\.e\., “cheating” by destabilizing the diffusion model rather than reliably decreasing the true likelihood of undesirable actions\.
- •Once the discrepancy is non\-negligible, the second drift follows automatically: the proxy gradient used in FPO is∇θELBOθ\(at\|ot\)\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\), but∇θELBOθ=∇θlogπθ−∇θ𝒟θL\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\-\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\. Thus, the update direction can become partially spent on changing the variational gap rather than following∇θlogπθ\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}, andthe induced policy\-gradient step may not align with the true policy gradient in[Equation˜1](https://arxiv.org/html/2606.13795#S2.E1)\.
These issues help explain why ELBO\-based diffusion RL can be unstable in practice: once ELBO drifts from likelihood, the proxy gradient can drift from the true policy gradient as well\.
#### SPG\(Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\)\.
SPG explicitly targets an objective that mirrors the policy\-gradient structure by using ELBO for positive advantages and EUBO for negative advantages\. Formally, SPG uses the gradient estimator
gθSPG\(ot,at\)=𝟙A^\>0∇θELBOθ\(at\|ot\)A^\+𝟙A^<0∇θEUBOθ\(at\|ot\)A^,\\displaystyle g\_\{\\theta\}^\{\\text\{SPG\}\}\(o\_\{t\},a\_\{t\}\)=\\mathbbm\{1\}\_\{\\hat\{A\}\>0\}\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\}\+\\mathbbm\{1\}\_\{\\hat\{A\}<0\}\\nabla\_\{\\theta\}\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\},\(5\)whereA^\\hat\{A\}abbreviatesA^t\(ot,at\)\\hat\{A\}\_\{t\}\(o\_\{t\},a\_\{t\}\)\. The objective corresponding to the SPG update \([Equation˜5](https://arxiv.org/html/2606.13795#S3.E5)\) is a lower bound on the original objective:
𝔼θ\[𝟙A^\>0ELBOθ\(at\|ot\)A^\+𝟙A^<0EUBOθ\(at\|ot\)A^\]\\displaystyle\\mathbb\{E\}\_\{\\theta\}\\left\[\\mathbbm\{1\}\_\{\\hat\{A\}\>0\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\}\+\\mathbbm\{1\}\_\{\\hat\{A\}<0\}\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\hat\{A\}\\right\]\(6\)because ELBO is a lower bound and EUBO is an upper bound on the true log\-likelihood\. Moreover, SPG has a nice property: when the objective is maximized,ELBOandEUBOequal the log\-likelihood, and the expected return is maximized as well\. However, under the double\-drift lens it still has drawbacks:
- •Similar to the exact likelihood, EUBO is not tractable\. The SPG implementation adopts an approximation to EUBO such that[Equation˜6](https://arxiv.org/html/2606.13795#S3.E6)remains a true lower bound\. This lower\-bound property often stabilizes training, but the approximation can break the exact link to the idealized objective\-level argument and is currently specialized to certain settings\.
- •Most importantly for our purposes: an objective\-level guarantee doesnotimplygradient consistency\. Even if an objective becomes exact at convergence, there is no guarantee that intermediate gradient ascent steps computed from variational bounds align with the true policy gradient in[Equation˜1](https://arxiv.org/html/2606.13795#S2.E1)when the bound is not tight \(i\.e\., when the discrepancy is non\-negligible\)\.
### 3\.2Preventing Double Drift through Self\-Distillation
We now make the core principle precise: to prevent thesecond drift \(proxy\-gradient drift\), we must prevent thefirst drift \(ELBO–likelihood drift\)from accumulating along the training trajectory\. Concretely, when the diffusion model is perfectly trained \(or close to it\), the evidence bound is tight and gradients of variational bounds coincide with gradients of the true log\-likelihood\.
Let us take ELBO as an example\. Recall from[Equation˜2](https://arxiv.org/html/2606.13795#S2.E2)that ELBO is a tight lower bound of the true likelihood and the discrepancy𝒟θL\(ot,at\)\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)is minimized at zero\. Assuming𝒟θL\(ot,at\)\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)is smooth inθ\\theta, at a tight\-bound point we have∇θ𝒟θL\(ot,at\)=0\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)=0, and thus
ELBOθ\(at\|ot\)=logπθ\(at\|ot\),⇒∇θELBOθ\(at\|ot\)=∇θlogπθ\(at\|ot\)\.\\displaystyle\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\),\\Rightarrow\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\.Hence,gθFPOg^\{\\text\{FPO\}\}\_\{\\theta\}is adequate, and if we start from a well\-trained diffusion model \(e\.g\., optimized under[Equation˜3](https://arxiv.org/html/2606.13795#S2.E3)\), the*initial*ELBO\-based update direction in FPO aligns with the true policy gradient in[Equation˜1](https://arxiv.org/html/2606.13795#S2.E1), and it remains a good approximation as long as training stays near the tight\-bound regime\.
A similar logic applies to SPG\. When the diffusion model is perfectly trained,𝒟θL\(ot,at\)=𝒟θU\(ot,at\)=0\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}\(o\_\{t\},a\_\{t\}\)=\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{U\}\}\(o\_\{t\},a\_\{t\}\)=0, and hence
∇θELBOθ\(at\|ot\)=∇θEUBOθ\(at\|ot\)=∇θlogπθ\(at\|ot\),\\displaystyle\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\nabla\_\{\\theta\}\\text\{EUBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\),meaning thatgθSPGg^\{\\text\{SPG\}\}\_\{\\theta\}is adequate, and this justifies the strong empirical performance of VI\-based RL methods for diffusion models when initialized from a pretrained model: their gradients are*adequate*at initialization\. The core issue is that RL updates can move the model away from this tight\-bound regime, allowing the discrepancy to grow; once that happens, ELBO drifts from likelihood and the proxy gradient drifts from the true policy gradient\.
#### Key idea: repeatedly tighten the bound on\-policy\.
We propose to*interleave*policy\-gradient updates \(using an adequate estimator\) with a policy\-preserving self\-distillation step that tightens ELBO under a*reference rollout distribution*\. Intuitively, self\-distillation directly counters the diffusion\-side drift by reducing the discrepancy on\-policy; this in turn restores the local tightness that makes the proxy gradient a faithful substitute for∇θlogπθ\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}, preventing the RL\-side drift\.
Algorithm 1DiPOD \(Interleave\)0:An adequate gradient estimator
gg, a policy parameterized by diffusion model
πθ\\pi\_\{\\theta\},
n∈ℕ\+n\\in\{\\mathbb\{N\}\}^\{\+\}, learning rate
η∈ℝ\+\\eta\\in\{\\mathbb\{R\}\}^\{\+\}
0:Policy
πθ\\pi\_\{\\theta\}
1:Initialize policy
πθ\\pi\_\{\\theta\}
2:Set
πref←πθ\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\}
3:while
πθ\\pi\_\{\\theta\}has not convergeddo
4:Maximize
𝔼ref\[ELBOθ\(at\|ot\)\]\\mathbb\{E\}\_\{\\text\{ref\}\}\\left\[\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\\right\]\.⊳\\trianglerightSelf\-distillation
5:for
i=1,…,ni=1,\\dots,ndo
6:
θ←θ\+η𝔼θ\[gθ\(ot,at\)\]\\theta\\leftarrow\\theta\+\\eta\\mathbb\{E\}\_\{\\theta\}\\left\[g\_\{\\theta\}\(o\_\{t\},a\_\{t\}\)\\right\]\.⊳\\trianglerightPolicy update
7:endfor
8:Set
πref←πθ\\pi\_\{\\text\{ref\}\}\\leftarrow\\pi\_\{\\theta\}
9:endwhile
10:return
πθ\\pi\_\{\\theta\}
[Algorithm˜1](https://arxiv.org/html/2606.13795#alg1)alternates between \(i\) tightening the evidence bound*under rollouts from the latest reference policy*\(𝔼ref\\mathbb\{E\}\_\{\\text\{ref\}\}meansat∼πref\(⋅\|ot\)a\_\{t\}\\sim\\pi\_\{\\text\{ref\}\}\(\\cdot\|o\_\{t\}\)\), and \(ii\) taking policy updates using an adequate estimator\. In the idealized limit where the self\-distillation step is optimized to convergence, it produces a diffusion model with \(approximately\) zero discrepancy on the reference rollout distribution while preserving the reference policy’s output distribution; the subsequent policy update is then well\-aligned with the true policy\-gradient direction\. Crucially, we refreshπref\\pi\_\{\\text\{ref\}\}to be the*most recent*policy, so the self\-distillation step continually patches on\-policy drift rather than distilling toward a stale reference\.
In particular, under standard smoothness, realizability and optimization assumptions, we can theoretically prove the following theorem\.
###### Theorem 3\.1\(Informal\)\.
Assume every self\-distillation step returns a model that is withinε\\varepsilonof the optimal ELBO value on the current reference distribution\. Choosing a sufficiently small learning rateη\\etaensures thateach DiPOD policy update improves the expected return, unless the current policy gradient is alreadyO\(ε\)O\(\\sqrt\{\\varepsilon\}\)\. If the procedure stops after anε\\varepsilon\-optimal on\-policy self\-distillation step atθ⋆\\theta\_\{\\star\}, then both‖∇θ𝒥\(θ⋆\)‖=O\(ε\)\\\|\\nabla\_\{\\theta\}\{\\mathcal\{J\}\}\(\\theta\_\{\\star\}\)\\\|=O\(\\sqrt\{\\varepsilon\}\)andthe ELBO discrepancy under the final reference distribution is small:𝔼\(o,a\)∼ρθ⋆\[𝒟θ⋆L\(o,a\)\]≤ε\\mathbb\{E\}\_\{\(o,a\)\\sim\\rho\_\{\\theta\_\{\\star\}\}\}\[\{\\mathcal\{D\}\}\_\{\\theta\_\{\\star\}\}^\{\\mathrm\{L\}\}\(o,a\)\]\\leq\\varepsilon\.
We defer the formal statement and proof to the appendix\.
### 3\.3A Simple and Practical Implementation
Although[Algorithm˜1](https://arxiv.org/html/2606.13795#alg1)directly addresses double drift, a literal implementation can be inefficient if self\-distillation is run to convergence before every policy update\. We therefore introduce a simple approximation that is efficient, easy to implement, and effective in practice\.
In a typical diffusion RL algorithm \(such as FPO and SPG\), each gradient step uses a batch of rolloutsℬ=\{\(oi,ai\)\}i=1,…,m\{\\mathcal\{B\}\}=\\left\\\{\(o^\{i\},a^\{i\}\)\\right\\\}\_\{i=1,\\dots,m\}and updates parameters as
θ←θ\+η⋅1m∑i=1mgθ\(oi,ai\)\.\\displaystyle\\theta\\leftarrow\\theta\+\\eta\\cdot\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}g\_\{\\theta\}\(o^\{i\},a^\{i\}\)\.We absorb self\-distillation into each update by augmenting the policy\-update direction with an ELBO maximization term computed on the same rollout batch:
θ←θ\+η⋅1m∑i=1m\[gθ\(oi,ai\)\+β∇θELBOθ\(ai\|oi\)\],\\displaystyle\\theta\\leftarrow\\theta\+\\eta\\cdot\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\left\[g\_\{\\theta\}\(o^\{i\},a^\{i\}\)\+\\beta\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a^\{i\}\|o^\{i\}\)\\right\],\(7\)whereβ∈ℝ\+\\beta\\in\{\\mathbb\{R\}\}^\{\+\}controls the strength of the on\-policy ELBO tightening\.
[Equation˜7](https://arxiv.org/html/2606.13795#S3.E7)can be viewed as a first\-order approximation to performing \(i\) a small amount of self\-distillation and \(ii\) a policy\-gradient update per rollout batch, sharing the same samples for efficiency\. Equivalently, the added ELBO term acts as a regularizer that reduces the variational discrepancy𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}over on\-policy rollouts, tightening the likelihood approximation and thereby mitigating the*first drift*\(ELBO–likelihood drift\)\. As a consequence, the proxy gradient used ingθg\_\{\\theta\}remains better aligned with∇θlogπθ\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}, mitigating the*second drift*\(proxy\-gradient drift\)\.
We summarize the resulting procedure in[Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)\. Relative to a standard VI\-based diffusion RL pipeline, the only modification is the extra ELBO term in theθ\\thetaupdate, making the method a simple drop\-in enhancement for a broad class of variational\-bound\-based diffusion RL algorithms\.
Algorithm 2DiPOD \(Practical Implementation\)0:An adequate gradient estimator
gg, a policy parameterized by diffusion model
πθ\\pi\_\{\\theta\},
β∈ℝ\+\\beta\\in\{\\mathbb\{R\}\}^\{\+\}, learning rate
η∈ℝ\+\\eta\\in\{\\mathbb\{R\}\}^\{\+\}, rollout size
m∈ℕ\+m\\in\{\\mathbb\{N\}\}^\{\+\}
0:Policy
πθ\\pi\_\{\\theta\}
1:Initialize policy
πθ\\pi\_\{\\theta\}
2:while
πθ\\pi\_\{\\theta\}has not convergeddo
3:Sample rollouts
ℬ=\{\(oi,ai\)\}i=1,…,m\{\\mathcal\{B\}\}=\\left\\\{\(o^\{i\},a^\{i\}\)\\right\\\}\_\{i=1,\\dots,m\}⊳\\trianglerightRollouts may be sampled from the current policy\.
4:Update
θ\\thetausing[Equation˜7](https://arxiv.org/html/2606.13795#S3.E7)with
ℬ\{\\mathcal\{B\}\}\.
5:endwhile
6:return
πθ\\pi\_\{\\theta\}
Although we present our method in the context of diffusion policies, the proposed principle is not specific to diffusion models\. At a high level, our algorithm only requires \(i\) a tractable variational lower bound of the formELBOθ\(x\|c\)\\mathrm\{ELBO\}\_\{\\theta\}\(x\|c\)and \(ii\) an RL objective that depends on an intractable likelihood only through policy\-gradient\-style updates\. Consequently, the same “tighten\-the\-bound to prevent gradient drift” strategy applies to a broad class of generative policies trained with ELBO objectives, including latent\-variable models such as VAEs and other variational generative models\.
## 4Experiment
Code is released at[Astro\-Eric/DiPOD\-release](https://github.com/Astro-Eric/DiPOD-release)\. In this section, we illustrate the effectiveness and versatility of[Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)across both discrete and continuous diffusion\-policy domains\. We first use a Two\-Token post\-training experiment as a controlled diagnostic that makes ELBO drift directly visible\. We then evaluate DiPOD on two substantially different practical settings: RL post\-training of diffusion language models on challenging reasoning benchmarks, and high\-dimensional continuous\-control motion tracking for a humanoid robot\.
### 4\.1Two\-Token Post\-training
Figure 2:Variational gap𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}during FPO, SPG, and DiPOD training\.#### Setup\.
We use a two\-token discrete diffusion model following the toy post\-training setting of SPG\(Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\)\. The model generatesx=\(x1,x2\)x=\(x\_\{1\},x\_\{2\}\)with each token in\{A,B\}\\\{\\mathrm\{A\},\\mathrm\{B\}\\\}, starting from the masked stateMM\\mathrm\{MM\}and decoding in a uniformly random order\. Because the state space is fully enumerable, we can analytically computelogπ\\log\\pi, ELBO, and the variational gap𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\text\{L\}\}throughout training, which makes the drift directly visible\. We defer the full parameterization, rewards, and implementation details to[Appendix˜D](https://arxiv.org/html/2606.13795#A4)\.
#### Results\.
We summarize the results in[Figure˜2](https://arxiv.org/html/2606.13795#S4.F2)\. For FPO, the variational gap grows substantially as training proceeds\. For SPG, while staying more controlled, the gap is still substantial\. Applying DiPOD to both algorithms effectively controls the gap\.
### 4\.2Reasoning Tasks with Diffusion Language Models
#### Setup\.
We follow the basic experimental setups in d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1)\)and SPG\(Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\)\. We start from LLaDA\-8B\-Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib10)\), an open\-source pretrained diffusion large language model, and conduct RL experiments on it\. We include four tasks: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.13795#bib.bib11)\), MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib12)\), Countdown\(Panet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib13)\), and Sudoku\([Arel,](https://arxiv.org/html/2606.13795#bib.bib14)\)\.
#### Baselines and Hyperparameters\.
We evaluate the performance of[Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)with the FPO gradient estimator and with the SPG gradient estimator separately\. For FPO experiments, we compare d1, FPO, and DiPOD \([Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)\) with the FPO gradient estimator\. We implement FPO by replacing the log likelihoods in d1 with ELBOs, resulting in a GRPO version of FPO, while keeping the hyperparameters unchanged\. We also keep the hyperparameters in DiPOD the same as d1\. Similarly, for SPG experiments, we compare d1, SPG, and[Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)with the SPG gradient estimator\. We use SPG with Mixture, which performs the best among all variants in the SPG paper\. For DiPOD, we keep the hyperparameters the same as SPG\.For all language experiments, we fixβ=0\.05\\beta=0\.05in[Equation˜7](https://arxiv.org/html/2606.13795#S3.E7), highlighting the consistent improvements of DiPOD over baseline algorithms\.We include an ablation study onβ\\betaand additional results with more sequence lengths in the appendix\. We fix the sequence length of the diffusion language model to be 256, and the number of decoding steps to be 128\. For all benchmarks, we evaluate the performance in the zero\-shot setting\.
#### Results\.
We summarize results in[Table˜1](https://arxiv.org/html/2606.13795#S4.T1), and show reward dynamics in[Figure˜3](https://arxiv.org/html/2606.13795#S4.F3)\. The table and figures show that FPO\+DiPOD consistently improves the performance over FPO\. Compared to SPG, the state\-of\-the\-art algorithm in diffusion language model RL, SPG\+DiPOD, shows competitive performance in math reasoning tasks, including GSM8K and MATH500, and achieves a significant leap in logical reasoning tasks, including Countdown and Sudoku\. For mathematical reasoning, we conjecture that the fixed context window of diffusion language models is the primary bottleneck for solving harder problems, which fundamentally require longer chains of thought\. Consequently, we observe only marginal improvements on GSM8K and MATH500\. We notice that the performance of the algorithms can differ substantially for different random seeds\. To ensure the fairness of the comparison, we keep the random seeds the same as in the SPG codebase\.
Table 1:Blue numbers denote improvements over the corresponding baseline \(e\.g\., FPO\+DiPOD vs\. FPO, SPG\+DiPOD vs\. SPG\)\. Evaluation follows the protocol of SPG\(Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\)\. We adopt the reported results for d1 from the SPG paper, except for Sudoku, where we report results from the d1 paperZhaoet al\.\([2025](https://arxiv.org/html/2606.13795#bib.bib1)\)\.



Figure 3:Reward dynamics on reasoning tasks\. DiPOD stabilizes FPO and SPG training and achieves competitive or better final performance\.
### 4\.3Motion Tracking with Diffusion Policies
#### Setup\.
We further evaluate DiPOD on continuous\-control diffusion policies through the motion\-tracking task from FPO\+\+\(Yiet al\.,[2026](https://arxiv.org/html/2606.13795#bib.bib49)\), a variant of FPO that gives an adequate gradient estimator\. This is a high\-dimensional robot\-control problem: the policy controls a Unitree G1 humanoid to track reference motions from the LAFAN dataset\(Harveyet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib48)\)\. Unlike the language experiments, this setting is not a post\-training benchmark: the flow policy is trained for motion tracking from the beginning\.
#### Baselines and Hyperparameters\.
We reuse the FPO\+\+ motion\-tracking hyperparameters\. For DiPOD, we use a single initial self\-distillation stage before the standard FPO\+\+ policy\-gradient training\. This tightens the ELBO–likelihood gap while preserving the initial policy distribution, placing subsequent updates closer to the tight\-bound regime predicted by our theory\. We choose this minimal instantiation because policy\-preserving self\-distillation during high\-dimensional motion\-control training is itself a nontrivial algorithmic component; developing fully interleaved schedules for motion tracking is an interesting direction for future work\.
#### Results\.
[Figure˜4](https://arxiv.org/html/2606.13795#S4.F4)shows mean reward and episode length for*dance\_1\_subject\_2*and*run\_1\_subject\_2*\. DiPOD improves reward dynamics and tracking duration over reproduced FPO\+\+, supporting the same principle beyond diffusion language models\. Additional details are in the appendix\.




FPO\+\+DiPOD
Figure 4:Motion\-tracking reward and episode\-length curves on*dance\_1\_subject\_2*and*run\_1\_subject\_2*\.
## 5Conclusion
We identify a double drift issue in diffusion reinforcement learning: loose variational bounds induce proxy gradients that drift from the true policy gradient, undermining policy improvement\. DiPOD preserves gradient consistency by maintaining on\-policy bound tightness\. Its most notable empirical outcome is substantially improved reasoning capability and training stability for diffusion language models, with additional gains in continuous control\. Future work should optimize the interleaving schedule and study alternatives to ELBO regularization\.
## Acknowledgment
We thank Banghua Zhu, Chenyu Wang, David McAllister, Hongsuk Choi, Brent Yi, and Yen\-Jen Wang for helpful conversations\.
## References
- L\. Ankile, A\. Simeonov, I\. Shenfeld, M\. Torne, and P\. Agrawal \(2025\)From imitation to refinement\-residual rl for precise assembly\.In2025 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 01–08\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- \[2\]ArelArel’s sudoku generator\(Website\)External Links:[Link](https://www.ocf.berkeley.edu/%CB%9Carel/sudoku/main.html)Cited by:[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p4.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1)\.
- J\. Bjorck, F\. Castañeda, N\. Cherniadev, X\. Da, R\. Ding, L\. Fan, Y\. Fang, D\. Fox, F\. Hu, S\. Huang,et al\.\(2025\)Gr00t n1: an open foundation model for generalist humanoid robots\.arXiv preprint arXiv:2503\.14734\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- \[4\]K\. Black, N\. Brown, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai, L\. Groom, K\. Hausman, B\. Ichter,et al\.π\\pi0: A vision\-language\-action flow model for general robot control\. corr, abs/2410\.24164, 2024\. doi: 10\.48550\.arXiv preprint ARXIV\.2410\.24164\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- K\. Black, M\. Janner, Y\. Du, I\. Kostrikov, and S\. Levine \(2023\)Training diffusion models with reinforcement learning\.arXiv preprint arXiv:2305\.13301\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1)\.
- J\. Chen, Y\. Liang, and Z\. Liu \(2026\)DFlash: block diffusion for flash speculative decoding\.arXiv preprint arXiv:2602\.06036\.Cited by:[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Figure 1](https://arxiv.org/html/2606.13795#S1.F1),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p4.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1)\.
- S\. Ding, K\. Hu, Z\. Zhang, K\. Ren, W\. Zhang, J\. Yu, J\. Wang, and Y\. Shi \(2024\)Diffusion\-based reinforcement learning via q\-weighted variational policy optimization\.Advances in Neural Information Processing Systems37,pp\. 53945–53968\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- C\. Domingo\-Enrich, M\. Drozdzal, B\. Karrer, and R\. T\. Chen \(2024\)Adjoint matching: fine\-tuning flow and diffusion generative models with memoryless stochastic optimal control\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px5.p1.1)\.
- P\. Dong, Q\. Li, D\. Sadigh, and C\. Finn \(2025\)Expo: stable reinforcement learning with expressive policies\.arXiv preprint arXiv:2507\.07986\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- Y\. Du, C\. Durkan, R\. Strudel, J\. B\. Tenenbaum, S\. Dieleman, R\. Fergus, J\. Sohl\-Dickstein, A\. Doucet, and W\. S\. Grathwohl \(2023\)Reduce, reuse, recycle: compositional generation with energy\-based diffusion models and mcmc\.InInternational conference on machine learning,pp\. 8489–8510\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px5.p1.1)\.
- S\. Ermon \(2026\)Introducing mercury 2\.Note:[https://www\.inceptionlabs\.ai/blog/introducing\-mercury\-2](https://www.inceptionlabs.ai/blog/introducing-mercury-2)Inception blog, accessed May 7, 2026Cited by:[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- L\. Fang, R\. Liu, J\. Zhang, W\. Wang, and B\. Jing \(2025\)Diffusion actor\-critic: formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- E\. Greensmith, P\. L\. Bartlett, and J\. Baxter \(2004\)Variance reduction techniques for gradient estimates in reinforcement learning\.Journal of Machine Learning Research5\(Nov\),pp\. 1471–1530\.Cited by:[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px2.p1.2)\.
- P\. Hansen\-Estruch, I\. Kostrikov, M\. Janner, J\. G\. Kuba, and S\. Levine \(2023\)Idql: implicit q\-learning as an actor\-critic method with diffusion policies\.arXiv preprint arXiv:2304\.10573\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- F\. G\. Harvey, M\. Yurick, D\. Nowrouzezahrai, and C\. Pal \(2020\)Robust motion in\-betweening\.ACM Transactions on Graphics \(TOG\)39\(4\),pp\. 60–1\.Cited by:[§F\.1](https://arxiv.org/html/2606.13795#A6.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.13795#S4.SS3.SSS0.Px1.p1.1)\.
- J\. Ho, W\. Chan, C\. Saharia, J\. Whang, R\. Gao, A\. Gritsenko, D\. P\. Kingma, B\. Poole, M\. Norouzi, D\. J\. Fleet,et al\.\(2022a\)Imagen video: high definition video generation with diffusion models\.arXiv preprint arXiv:2210\.02303\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet \(2022b\)Video diffusion models\.Advances in neural information processing systems35,pp\. 8633–8646\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- H\. Jiang, N\. Haghtalab, and L\. Chen \(2025\)Diffusion language models are provably optimal parallel samplers\.arXiv preprint arXiv:2512\.25014\.Cited by:[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- T\. Karras, M\. Aittala, T\. Kynkäänniemi, J\. Lehtinen, T\. Aila, and S\. Laine \(2024\)Guiding a diffusion model with a bad version of itself\.Advances in Neural Information Processing Systems37,pp\. 52996–53021\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px5.p1.1)\.
- D\. Kingma and R\. Gao \(2023\)Understanding diffusion objectives as the elbo with simple data augmentation\.Advances in Neural Information Processing Systems36,pp\. 65484–65516\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1)\.
- Q\. Li and S\. Levine \(2026\)Q\-learning with adjoint matching\.arXiv preprint arXiv:2601\.14234\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- Q\. Li, Z\. Zhou, and S\. Levine \(2025\)Reinforcement learning with action chunking\.arXiv preprint arXiv:2507\.07969\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p4.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1)\.
- M\. S\. Mark, T\. Gao, G\. G\. Sampaio, M\. K\. Srirama, A\. Sharma, C\. Finn, and A\. Kumar \(2024\)Policy agnostic rl: offline rl and online rl fine\-tuning of any class and backbone\.arXiv preprint arXiv:2412\.06685\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- D\. McAllister, S\. Ge, B\. Yi, C\. M\. Kim, E\. Weber, H\. Choi, H\. Feng, and A\. Kanazawa \(2025\)Flow matching policy gradients\.arXiv preprint arXiv:2507\.21053\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px3.p1.8),[Figure 1](https://arxiv.org/html/2606.13795#S1.F1),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.13795#S3.SS1.SSS0.Px1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1)\.
- J\. Pan, J\. Zhang, X\. Wang, L\. Yuan, H\. Peng, and A\. Suhr \(2025\)Tinyzero\.Cited by:[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p4.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1)\.
- A\. Z\. Ren, J\. Lidard, L\. L\. Ankile, A\. Simeonov, P\. Agrawal, A\. Majumdar, B\. Burchfiel, H\. Dai, and M\. Simchowitz \(2024\)Diffusion policy policy optimization\.arXiv preprint arXiv:2409\.00588\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1)\.
- N\. Rudin, D\. Hoeller, P\. Reist, and M\. Hutter \(2022\)Learning to walk in minutes using massively parallel deep reinforcement learning\.InConference on robot learning,pp\. 91–100\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2016\)High\-dimensional continuous control using generalized advantage estimation\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px2.p1.2),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px1.p1.12)\.
- C\. Schwarke, V\. Klemm, M\. Van der Boon, M\. Bjelonic, and M\. Hutter \(2023\)Curiosity\-driven learning of joint locomotion and manipulation tasks\.InProceedings of the 7th Conference on Robot Learning,Vol\.229,pp\. 2594–2610\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px1.p1.12)\.
- A\. Singh, H\. Liu, G\. Zhou, A\. Yu, N\. Rhinehart, and S\. Levine \(2020\)Parrot: data\-driven behavioral priors for reinforcement learning\.arXiv preprint arXiv:2011\.10024\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational conference on machine learning,pp\. 2256–2265\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1)\.
- J\. Song, C\. Meng, and S\. Ermon \(2020\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- Y\. Song, Z\. Zhang, C\. Luo, P\. Gao, F\. Xia, H\. Luo, Z\. Li, Y\. Yang, H\. Yu, X\. Qu,et al\.\(2025\)Seed diffusion: a large\-scale diffusion language model with high\-speed inference\.arXiv preprint arXiv:2508\.02193\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- X\. Tang, R\. Dolga, S\. Yoon, and I\. Bogunovic \(2025\)Wd1: weighted policy optimization for reasoning in diffusion language models\.arXiv preprint arXiv:2507\.08838\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px4.p1.17),[§E\.1](https://arxiv.org/html/2606.13795#A5.SS1.p1.11),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1)\.
- A\. Wagenmaker, Y\. Zhang, M\. Nakamoto, S\. Park, W\. Yagoub, A\. Nagabandi, A\. Gupta, and S\. Levine \(2025\)Steering your diffusion policy with latent space reinforcement learning\.InConference on Robot Learning,pp\. 258–282\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- C\. Wang, P\. Rashidinejad, D\. Su, S\. Jiang, S\. Wang, S\. Zhao, C\. Zhou, S\. Z\. Shen, F\. Chen, T\. Jaakkola,et al\.\(2025\)Spg: sandwiched policy gradient for masked diffusion language models\.arXiv preprint arXiv:2510\.09541\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.2](https://arxiv.org/html/2606.13795#A2.SS2.SSS0.Px3.p4.16),[§B\.2](https://arxiv.org/html/2606.13795#A2.SS2.p1.1),[Appendix D](https://arxiv.org/html/2606.13795#A4.SS0.SSS0.Px1.p1.6),[Appendix D](https://arxiv.org/html/2606.13795#A4.p1.8),[§E\.1](https://arxiv.org/html/2606.13795#A5.SS1.p1.11),[§E\.3](https://arxiv.org/html/2606.13795#A5.SS3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.13795#S3.SS1.SSS0.Px2),[§4\.1](https://arxiv.org/html/2606.13795#S4.SS1.SSS0.Px1.p1.5),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.13795#S4.T1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px2.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px2.p1.1)\.
- Z\. Xie, J\. Ye, L\. Zheng, J\. Gao, J\. Dong, Z\. Wu, X\. Zhao, S\. Gong, X\. Jiang, Z\. Li,et al\.\(2025\)Dream\-coder 7b: an open diffusion language model for code\.arXiv preprint arXiv:2509\.01142\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px4.p1.17),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1)\.
- L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang \(2025\)Mmada: multimodal large diffusion language models\.arXiv preprint arXiv:2505\.15809\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px4.p1.17),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1)\.
- B\. Yi, H\. Choi, H\. G\. Singh, X\. Huang, T\. E\. Truong, C\. Sferrazza, Y\. Ma, R\. Duan, P\. Abbeel, G\. Shi,et al\.\(2026\)Flow policy gradients for robot control\.arXiv preprint arXiv:2602\.02481\.Cited by:[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px5.p1.2),[§F\.1](https://arxiv.org/html/2606.13795#A6.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.13795#S4.SS3.SSS0.Px1.p1.1)\.
- X\. Yuan, T\. Mu, S\. Tao, Y\. Fang, M\. Zhang, and H\. Su \(2025\)Policy decorator: model\-agnostic online refinement for large policy model\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px4.p1.1)\.
- S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover \(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.arXiv preprint arXiv:2504\.12216\.Cited by:[Appendix A](https://arxiv.org/html/2606.13795#A1.SS0.SSS0.Px3.p1.1),[§B\.1](https://arxiv.org/html/2606.13795#A2.SS1.SSS0.Px4.p1.17),[§E\.1](https://arxiv.org/html/2606.13795#A5.SS1.p1.11),[§1](https://arxiv.org/html/2606.13795#S1.SS0.SSS0.Px1.p1.1),[§1\.1](https://arxiv.org/html/2606.13795#S1.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.13795#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.13795#S4.SS2.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.13795#S4.T1)\.
## Appendix AAdditional Related Work
#### Diffusion models\.
Diffusion models are a powerful class of generative models, with wide applications including image generation\[Hoet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib15), Songet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib16), Song and Ermon,[2019](https://arxiv.org/html/2606.13795#bib.bib17)\], video generation\[Hoet al\.,[2022b](https://arxiv.org/html/2606.13795#bib.bib18),[a](https://arxiv.org/html/2606.13795#bib.bib19)\], and robotics\[[Blacket al\.,](https://arxiv.org/html/2606.13795#bib.bib20), Bjorcket al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib21)\]\. More recently, diffusion language models have emerged as a strong alternative to autoregressive models for fast inference and flexible decoding\[Nieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib10), Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2), Songet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib22)\]\. Diffusion models can be derived from several viewpoints; in this paper, we use the variational\-inference perspective\[Kingma and Gao,[2023](https://arxiv.org/html/2606.13795#bib.bib23), Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2606.13795#bib.bib24)\], which connects the intractable likelihood to tractable evidence bounds\.
#### Policy gradients\.
In reinforcement learning, policy\-gradient methods optimize a parameterized policy by directly differentiating its expected cumulative return\[Williams,[1992](https://arxiv.org/html/2606.13795#bib.bib25), Schulmanet al\.,[2017](https://arxiv.org/html/2606.13795#bib.bib6)\]\. They have led to strong results in high\-dimensional control\[Schulmanet al\.,[2016](https://arxiv.org/html/2606.13795#bib.bib30)\], locomotion\[Rudinet al\.,[2022](https://arxiv.org/html/2606.13795#bib.bib27)\], manipulation\[Schwarkeet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib28)\], and related sequential decision\-making problems\. Policy\-gradient\-style methods also play a central role in the post\-training of large language models\[Shaoet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib3), Rafailovet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib9)\], making their extension to diffusion language models an important algorithmic question\.
#### RL with diffusion policies\.
Classical policy\-gradient methods cannot be directly applied to diffusion models because the exact likelihood is intractable\. One line of work tackles this problem by treating the denoising process as a Markov Decision Process\[Blacket al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib26), Renet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib29)\]\. This makes training tractable, but ties the optimization procedure to a particular denoising sampler\. Another line of work uses variational objectives, such as diffusion or flow\-matching evidence bounds, to bypass exact likelihood computation while preserving the native diffusion sampling process\[McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4), Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]\. In language domains, several methods instead approximate likelihoods by simplifying or partially restoring inter\-token dependencies\[Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1), Yanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib5), Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2), Tanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib32)\]\. DiPOD is closest to the variational line: it identifies how evidence\-bound looseness can corrupt policy\-gradient updates, and introduces self\-distillation as a mechanism for keeping the proxy aligned with the true policy\.
#### Other diffusion\-policy RL and control methods\.
Beyond online policy\-gradient methods, there is a rich literature on critic\-based or residual\-control methods for diffusion policies\. Some works leave a pretrained diffusion policy unchanged and post\-process generated actions using a learned critic\[Hansen\-Estruchet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib33), Market al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib34), Liet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib35), Donget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib36)\]\. Others learn a residual policy that modifies the generated action in noise space\[Wagenmakeret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib37), Singhet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib38)\]or action space\[Yuanet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib39), Ankileet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib40)\]\. A related set of off\-policy methods uses a critic to supervise the score or denoising function directly\[Fanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib41), Dinget al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib42), Li and Levine,[2026](https://arxiv.org/html/2606.13795#bib.bib44)\]\. These methods address complementary settings, while DiPOD focuses on stabilizing online policy\-gradient updates when the diffusion\-policy likelihood must be replaced by a variational proxy\.
#### Guidance and stochastic\-control perspectives\.
Guidance is also related to RL with diffusion policies, since the guided sampler can be interpreted as tilting a base distribution according to reward or energy\. Early guidance approaches can suffer from bias in the resulting sampling distribution\[Duet al\.,[2023](https://arxiv.org/html/2606.13795#bib.bib45), Karraset al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib46)\]\. Adjoint Matching\[Domingo\-Enrichet al\.,[2024](https://arxiv.org/html/2606.13795#bib.bib47)\]takes a stochastic optimal\-control perspective and derives an efficient fine\-tuning objective that converges to the desired tilted distribution under its assumptions\. DiPOD instead studies the policy\-gradient setting where rewards may be non\-differentiable and where the central challenge is maintaining alignment between the evidence\-bound proxy and the true diffusion\-policy likelihood\.
## Appendix BBackground on Reinforcement Learning and Diffusion Models
### B\.1Reinforcement Learning
This section provides a concise introduction to the RL algorithms relevant to this paper\.
#### Markov Decision Process\.
A Markov Decision Process \(MDP\) consists of a state space𝒮\{\\mathcal\{S\}\}, an action space𝒜\{\\mathcal\{A\}\}, a state transition kernelPP, and a reward functionrr\. A policy is defined as a functionπ\(⋅\|s\)\\pi\(\\cdot\|s\)that outputs a distribution on𝒜\{\\mathcal\{A\}\}given a statessfrom𝒮\{\\mathcal\{S\}\}\. In the MDP, an agent with policyπ\\pistarts from a designated states0∈𝒮s\_\{0\}\\in\{\\mathcal\{S\}\}at timestept=0t=0\. At each timesteptt, the policy receives the current statests\_\{t\}and takes an actionat∼π\(⋅\|st\)a\_\{t\}\\sim\\pi\(\\cdot\|s\_\{t\}\)\. The environment transitions to statest\+1∼P\(⋅\|st,at\)s\_\{t\+1\}\\sim P\(\\cdot\|s\_\{t\},a\_\{t\}\), and gives the agent a rewardrt=r\(st,at\)r\_\{t\}=r\(s\_\{t\},a\_\{t\}\)\. The episode terminates when the agent reaches a terminal states∗∈𝒮s^\{\*\}\\in\{\\mathcal\{S\}\}\. An RL algorithm aims to optimize the cumulative reward, defined as
𝒥\(θ\)=𝔼at∼π\(⋅\|st\),st\+1∼P\(⋅\|st,at\)\[∑t≥0γtrt\]\\displaystyle\{\\mathcal\{J\}\}\(\\theta\)=\\mathbb\{E\}\_\{a\_\{t\}\\sim\\pi\(\\cdot\|s\_\{t\}\),s\_\{t\+1\}\\sim P\(\\cdot\|s\_\{t\},a\_\{t\}\)\}\\left\[\\sum\_\{t\\geq 0\}\\gamma^\{t\}r\_\{t\}\\right\]whereγ∈\(0,1\)\\gamma\\in\(0,1\)is a discount factor\. The value function is defined as the expected cumulative reward starting from a statess\. Formally:
Vtπ\(s\)=𝔼π\[∑τ≥tγτ−trτ\|st=s\]\\displaystyle V\_\{t\}^\{\\pi\}\(s\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{\\tau\\geq t\}\\gamma^\{\\tau\-t\}r\_\{\\tau\}\|s\_\{t\}=s\\right\]where𝔼π\\mathbb\{E\}\_\{\\pi\}means that the actions are sampled from policyπ\\pi\. TheQQ\-function is defined as the expected cumulative reward starting from statessand actionaa\. Formally:
Qtπ\(s,a\)=𝔼π\[∑τ≥tγτ−trτ\|st=s,at=a\]\\displaystyle Q\_\{t\}^\{\\pi\}\(s,a\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{\\tau\\geq t\}\\gamma^\{\\tau\-t\}r\_\{\\tau\}\|s\_\{t\}=s,a\_\{t\}=a\\right\]
#### Policy Gradient Algorithms\.
The policy gradient theorem states that
∇θ𝒥\(θ\)=𝔼at∼πθ\(⋅\|st\),st∼dπθ\[∇θlogπθ\(at\|st\)Qtπθ\(st,at\)\]\\displaystyle\\nabla\_\{\\theta\}\{\\mathcal\{J\}\}\(\\theta\)=\\mathbb\{E\}\_\{a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\|s\_\{t\}\),s\_\{t\}\\sim d^\{\\pi\_\{\\theta\}\}\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)Q\_\{t\}^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)\\right\]wheredπd^\{\\pi\}is the state distribution under policyπ\\pi\. GAE\[Schulmanet al\.,[2016](https://arxiv.org/html/2606.13795#bib.bib30), Greensmithet al\.,[2004](https://arxiv.org/html/2606.13795#bib.bib31)\]introduces an alternative form of the policy gradient that admits lower estimation variance:
∇θ𝒥\(θ\)=𝔼θ\[∇θlogπθ\(at\|st\)Atπθ\(st,at\)\]\\displaystyle\\nabla\_\{\\theta\}\{\\mathcal\{J\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\theta\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)A\_\{t\}^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)\\right\]where the advantage function is defined asAtπ\(s,a\)=Qtπ\(s,a\)−Vtπ\(s\)A\_\{t\}^\{\\pi\}\(s,a\)=Q\_\{t\}^\{\\pi\}\(s,a\)\-V^\{\\pi\}\_\{t\}\(s\)and we abbreviate the expectation subscript for simplicity\.
In the off\-policy setting, where a behavior policyπref\\pi\_\{\\text\{ref\}\}is used to collect data, the algorithm aims to optimize a slightly different objective defined as
𝒥′\(θ\)=𝔼st∼dπref\[Vtπθ\(st\)\]\.\\displaystyle\{\\mathcal\{J\}\}^\{\\prime\}\(\\theta\)=\\mathbb\{E\}\_\{s\_\{t\}\\sim d^\{\\pi\_\{\\text\{ref\}\}\}\}\\left\[V^\{\\pi\_\{\\theta\}\}\_\{t\}\(s\_\{t\}\)\\right\]\.In this setting, the following policy gradient update:
𝔼ref\[∇θπθ\(at\|st\)πref\(at\|st\)Atπθ\(st,at\)\]\\displaystyle\\mathbb\{E\}\_\{\\text\{ref\}\}\\left\[\\frac\{\\nabla\_\{\\theta\}\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\text\{ref\}\}\(a\_\{t\}\|s\_\{t\}\)\}A\_\{t\}^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)\\right\]guarantees policy improvement on𝒥′\(θ\)\{\\mathcal\{J\}\}^\{\\prime\}\(\\theta\)\. In practice,Atπθ\(st,at\)A\_\{t\}^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)is often estimated through a combination of rewards from the samples and a learned value function\.
#### Proximal Policy Optimization \(PPO\)\.
PPO is an instantiation of the off\-policy gradient update introduced above\. It performs gradient updates through
∇θ𝔼ref\[min\{r\(θ\)A^t,clip\(r\(θ\),1±ε\)A^t\}\]\\displaystyle\\nabla\_\{\\theta\}\\mathbb\{E\}\_\{\\text\{ref\}\}\\left\[\\min\\left\\\{r\(\\theta\)\\hat\{A\}\_\{t\},\\text\{clip\}\(r\(\\theta\),1\\pm\\varepsilon\)\\hat\{A\}\_\{t\}\\right\\\}\\right\]where we abbreviate
r\(θ\)=πθ\(at\|st\)πref\(at\|st\)\\displaystyle r\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\text\{ref\}\}\(a\_\{t\}\|s\_\{t\}\)\}andA^t\\hat\{A\}\_\{t\}is an estimation ofAtπθ\(st,at\)A\_\{t\}^\{\\pi\_\{\\theta\}\}\(s\_\{t\},a\_\{t\}\)\.ε\\varepsilonis a hyperparameter between 0 and 1 that controls the strength of regularization to the policy update\. The training data is sampled fromπref\\pi\_\{\\text\{ref\}\}, andπref\\pi\_\{\\text\{ref\}\}is synced withπθ\\pi\_\{\\theta\}every few gradient steps\. FPO\[McAllisteret al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib4)\]builds on the PPO framework by substituting the log likelihoods ofπθ\\pi\_\{\\theta\}andπref\\pi\_\{\\text\{ref\}\}as their corresponding ELBOs\.
#### Group Relative Policy Optimization\.
GRPO is a variant of PPO for autoregressive language models\. In language models, a promptqq, consisting of a sequence of tokens, serves as the input\. The language model then generates another sequence of tokensooas the output\. This process can be framed as an MDP\. At each timesteptt, the state is the token sequenceq\|o≤tq\|o\_\{\\leq t\}that has been generated so far\. Hereo≤to\_\{\\leq t\}means the firsttttokens ofoo\. The actionot\+1o\_\{t\+1\}is the next token to be output\. In many scenarios the reward is given only whenoois fully generated, and the reward is often based only on the last few tokens, while the intermediate tokens can be thought of as reasoning traces\. In GRPO, for each promptqq, we sampleggoutputso1,…,ogo^\{1\},\\dots,o\_\{g\}usingπref\\pi\_\{\\text\{ref\}\}, and receive rewardsr1,…,rgr^\{1\},\\dots,r^\{g\}\. We estimate the advantage as
A^ti=ri−mean\(r1:g\)std\(r1:g\)\.\\displaystyle\\hat\{A\}\_\{t\}^\{i\}=\\frac\{r^\{i\}\-\\text\{mean\}\(r^\{1:g\}\)\}\{\\text\{std\}\(r^\{1:g\}\)\}\.and estimate the policy\-gradient objective forqqas
1g∑i=1G1\|oi\|∑t=1\|oi\|min\{r\(θ\)tiA^ti,clip\(r\(θ\)ti,1±ε\)A^ti\}\\displaystyle\\frac\{1\}\{g\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\left\\\{r\(\\theta\)^\{i\}\_\{t\}\\hat\{A\}^\{i\}\_\{t\},\\text\{clip\}\(r\(\\theta\)^\{i\}\_\{t\},1\\pm\\varepsilon\)\\hat\{A\}^\{i\}\_\{t\}\\right\\\}where
r\(θ\)ti=πθ\(oti\|q,o<ti\)πref\(oti\|q,o<ti\)\.\\displaystyle r\(\\theta\)^\{i\}\_\{t\}=\\frac\{\\pi\_\{\\theta\}\(o\_\{t\}^\{i\}\|q,o^\{i\}\_\{<t\}\)\}\{\\pi\_\{\\text\{ref\}\}\(o\_\{t\}^\{i\}\|q,o^\{i\}\_\{<t\}\)\}\.Compared to PPO, GRPO uses relative advantages inside a group of rollouts instead of the true advantage, eliminating the need to train a value function or calculate different value functions for different timesteps\. Dream\-Coder\[Xieet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib2)\]uses exactly the same algorithm for diffusion language models, ignoring that likelihoods cannot be decomposed in the same way as autoregressive models\.*Diffu*\-GRPO\[Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1)\], wd1\[Tanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib32)\], and UniGRPO\[Yanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib5)\]account for this factor, but still partially ignore inter\-token dependencies for tractability\.
#### FPO/FPO\+\+\.
FPO\+\+\[Yiet al\.,[2026](https://arxiv.org/html/2606.13795#bib.bib49)\]is an improved flow\-policy\-gradient algorithm that replaces the single action\-level FPO ratio with per\-sample ratios and uses an asymmetric trust region\. In our notation, ifELBO^θ\(i\)\(at\|ot\)\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)denotes the negative conditional\-flow\-matching loss for theii\-th Monte Carlo sample, its per\-sample ratio is
ρ^FPO\+\+\(i\)\(θ\)=exp\(ELBO^θ\(i\)\(at\|ot\)−ELBO^θold\(i\)\(at\|ot\)\)\.\\displaystyle\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\(\\theta\)=\\exp\\\!\\left\(\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)\-\\widehat\{\\text\{ELBO\}\}\_\{\\theta\_\{\\mathrm\{old\}\}\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)\\right\)\.FPO\+\+ then applies PPO clipping forA^t≥0\\hat\{A\}\_\{t\}\\geq 0and an SPO\-style quadratic trust region forA^t<0\\hat\{A\}\_\{t\}<0\. This gives an asymmetric interpretation\. For positive\-advantage samples, FPO\+\+ keeps the original FPO intuition: it treats an ELBO lift as if it were a log\-likelihood lift, so decreasing the conditional\-flow\-matching loss is taken to increase the likelihood of desirable actions\. This is exact in the tight\-bound regime, and otherwise is the same local approximation underlying FPO\.
For negative\-advantage samples, the SPO branch has a more explicit adaptive\-regularization form\. Recall that
ψSPO\(ρ,A^t\)=ρA^t−\|A^t\|2εclip\(ρ−1\)2,∇θρ^FPO\+\+\(i\)=ρ^FPO\+\+\(i\)∇θELBO^θ\(i\)\(at\|ot\)\.\\displaystyle\\psi\_\{\\mathrm\{SPO\}\}\(\\rho,\\hat\{A\}\_\{t\}\)=\\rho\\hat\{A\}\_\{t\}\-\\frac\{\|\\hat\{A\}\_\{t\}\|\}\{2\\varepsilon^\{\\mathrm\{clip\}\}\}\(\\rho\-1\)^\{2\},\\qquad\\nabla\_\{\\theta\}\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}=\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\\nabla\_\{\\theta\}\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)\.Therefore, forA^t<0\\hat\{A\}\_\{t\}<0,
∇θψSPO\(ρ^FPO\+\+\(i\),A^t\)=\[ρ^FPO\+\+\(i\)A^t−\|A^t\|εclipρ^FPO\+\+\(i\)\(ρ^FPO\+\+\(i\)−1\)\]∇θELBO^θ\(i\)\(at\|ot\),\\displaystyle\\nabla\_\{\\theta\}\\psi\_\{\\mathrm\{SPO\}\}\\\!\\left\(\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\},\\hat\{A\}\_\{t\}\\right\)=\\left\[\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\\hat\{A\}\_\{t\}\-\\frac\{\|\\hat\{A\}\_\{t\}\|\}\{\\varepsilon^\{\\mathrm\{clip\}\}\}\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\\\!\\left\(\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\-1\\right\)\\right\]\\nabla\_\{\\theta\}\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\),which is the usual ratio\-weighted policy\-gradient term plus an adaptive DiPOD\-like ELBO termβ−\(i\)∇θELBO^θ\(i\)\(at\|ot\)\\beta^\{\(i\)\}\_\{\-\}\\nabla\_\{\\theta\}\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)with signed coefficient
β−\(i\)\(θ,A^t\)=−\|A^t\|εclipρ^FPO\+\+\(i\)\(ρ^FPO\+\+\(i\)−1\),\\displaystyle\\beta^\{\(i\)\}\_\{\-\}\(\\theta,\\hat\{A\}\_\{t\}\)=\-\\frac\{\|\\hat\{A\}\_\{t\}\|\}\{\\varepsilon^\{\\mathrm\{clip\}\}\}\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\\\!\\left\(\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\-1\\right\),where the sign is for gradient ascent onψSPO\\psi\_\{\\mathrm\{SPO\}\}; equivalently, the sign flips if one writes the implementation as minimizing−ψSPO\-\\psi\_\{\\mathrm\{SPO\}\}\. Up to this sign convention and the multiplicative ratio factor, this is exactly a coefficient proportional to\|A^t\|\(ρ^FPO\+\+\(i\)−1\)/εclip\|\\hat\{A\}\_\{t\}\|\(\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}\-1\)/\\varepsilon^\{\\mathrm\{clip\}\}\. The term vanishes on\-policy and grows with the ratio deviation, pulling negative\-advantage ratios back toward11\. This estimator is adequate in the sense of[Section˜2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px3): when the bound is tight and the reference policy is synchronized with the current policy,ρ^FPO\+\+\(i\)=1\\hat\{\\rho\}^\{\(i\)\}\_\{\\mathrm\{FPO\+\+\}\}=1, the adaptive term vanishes, and𝔼i\[∇θELBO^θ\(i\)\(at\|ot\)\]=∇θELBOθ\(at\|ot\)=∇θlogπθ\(at\|ot\)\\mathbb\{E\}\_\{i\}\[\\nabla\_\{\\theta\}\\widehat\{\\text\{ELBO\}\}\_\{\\theta\}^\{\(i\)\}\(a\_\{t\}\|o\_\{t\}\)\]=\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\. Averaging over Monte Carlo samples therefore recovers the policy\-gradient integrandA^t∇θlogπθ\(at\|ot\)\\hat\{A\}\_\{t\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\|o\_\{t\}\)\.
### B\.2Diffusion Models
This section provides a brief introduction to diffusion models from a variational inference perspective\. We refer interested readers toWanget al\.\[[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]for more details regarding diffusion language models\.
#### Variational Inference\.
A central goal in generative modeling is to learn a parameterized distributionpθp\_\{\\theta\}that assigns high probability to observed data\. Given a dataset drawn from an unknown data distributionpdatap\_\{\\text\{data\}\}, the standard training objective is maximum likelihood estimation \(MLE\),
maxθ𝔼x0∼pdata\[logpθ\(x0\)\]\.\\displaystyle\\max\_\{\\theta\}\\mathbb\{E\}\_\{x\_\{0\}\\sim p\_\{\\text\{data\}\}\}\[\\log p\_\{\\theta\}\(x\_\{0\}\)\]\.This objective applies broadly, independent of the specific model family\. In many modern generative models, including diffusion models, it is natural to introduce latent variables to describe a multi\-step sampling procedure: one may first sample a latent variablezz, then sample the datax0x\_\{0\}conditional onzz\. This leads to a latent\-variable model of the form
pθ\(x0\)=∫pθ\(x0,z\)dz,\\displaystyle p\_\{\\theta\}\(x\_\{0\}\)=\\int p\_\{\\theta\}\(x\_\{0\},z\)\\mathrm\{d\}z,wherezzcan be high\-dimensional \(and in diffusion, typically corresponds to an entire trajectory of intermediate variables\)\. While MLE still aims to maximizelogpθ\(x0\)\\log p\_\{\\theta\}\(x\_\{0\}\), the marginalization overzzoften makeslogpθ\(x0\)\\log p\_\{\\theta\}\(x\_\{0\}\)intractable to evaluate and differentiate\.
Variational inference addresses this intractability by introducing an auxiliary distributionq\(z\|x0\)q\(z\|x\_\{0\}\), often called a variational posterior, that approximates the true posteriorpθ\(z\|x0\)p\_\{\\theta\}\(z\|x\_\{0\}\)\. For any choice ofq\(z\|x0\)q\(z\|x\_\{0\}\), Jensen’s inequality gives
logpθ\(x0\)\\displaystyle\\log p\_\{\\theta\}\(x\_\{0\}\)=\\displaystyle=log∫q\(z\|x0\)pθ\(x0,z\)q\(z\|x0\)dz\\displaystyle\\log\\int q\(z\|x\_\{0\}\)\\frac\{p\_\{\\theta\}\(x\_\{0\},z\)\}\{q\(z\|x\_\{0\}\)\}\\mathrm\{d\}z\(8\)≥\\displaystyle\\geq𝔼q\(z\|x0\)\[logpθ\(x0,z\)−logq\(z\|x0\)\]\\displaystyle\\mathbb\{E\}\_\{q\(z\|x\_\{0\}\)\}\\left\[\\log p\_\{\\theta\}\(x\_\{0\},z\)\-\\log q\(z\|x\_\{0\}\)\\right\]\(9\)=:\\displaystyle=:ELBOθ\(x0\)\.\\displaystyle\\text\{ELBO\}\_\{\\theta\}\(x\_\{0\}\)\.\(10\)The quantityELBOθ\(x0\)\\text\{ELBO\}\_\{\\theta\}\(x\_\{0\}\)is the evidence lower bound\. Maximizing the ELBO provides a tractable surrogate for maximizinglogpθ\(x0\)\\log p\_\{\\theta\}\(x\_\{0\}\), because it replaces the log of an integral with an expectation underq\(z\|x0\)q\(z\|x\_\{0\}\)\.
Equivalently, usingpθ\(z\|x0\)=pθ\(x0,z\)/pθ\(x0\)p\_\{\\theta\}\(z\|x\_\{0\}\)=p\_\{\\theta\}\(x\_\{0\},z\)/p\_\{\\theta\}\(x\_\{0\}\),
logpθ\(x0\)=ELBOθ\(x0\)\+KL\(q\(z\|x0\)∥pθ\(z\|x0\)\),\\displaystyle\\log p\_\{\\theta\}\(x\_\{0\}\)=\\text\{ELBO\}\_\{\\theta\}\(x\_\{0\}\)\+\\mathrm\{KL\}\\left\(q\(z\|x\_\{0\}\)\\\|p\_\{\\theta\}\(z\|x\_\{0\}\)\\right\),\(11\)so the bound is tight if and only ifq\(z\|x0\)=pθ\(z\|x0\)q\(z\|x\_\{0\}\)=p\_\{\\theta\}\(z\|x\_\{0\}\)\. Here the KL divergence is defined asKL\(q∥p\):=𝔼q\[logqp\]\\mathrm\{KL\}\(q\\\|p\):=\\mathbb\{E\}\_\{q\}\\left\[\\log\\frac\{q\}\{p\}\\right\], corresponding to the𝒟θL\{\\mathcal\{D\}\}^\{\\text\{L\}\}\_\{\\theta\}in[Section˜2](https://arxiv.org/html/2606.13795#S2)which is nonnegative and equals0iffq=pq=p\. In diffusion models, a carefully chosenqqmakes the ELBO decompose into simple local terms, yielding a practical learning objective while still targeting maximum likelihood\.
#### Diffusion as structured variational inference\.
Diffusion models instantiate the generic latent variable as an entire noising trajectory,
z≡x1:T:=\(x1,…,xT\),\\displaystyle z\\equiv x\_\{1:T\}:=\(x\_\{1\},\\ldots,x\_\{T\}\),\(12\)and define a*fixed*forward \(variational\) process
q\(z\|x0\)=q\(x1:T\|x0\)=∏t=1Tq\(xt\|xt−1\),\\displaystyle q\(z\|x\_\{0\}\)=q\(x\_\{1:T\}\|x\_\{0\}\)=\\prod\_\{t=1\}^\{T\}q\(x\_\{t\}\|x\_\{t\-1\}\),\(13\)where eachq\(xt\|xt−1\)q\(x\_\{t\}\|x\_\{t\-1\}\)is a simple corruption kernel\. The generative model uses a simple priorp\(xT\)p\(x\_\{T\}\)and learned reverse transitions:
pθ\(x0,z\)=\\displaystyle p\_\{\\theta\}\(x\_\{0\},z\)=pθ\(x0:T\)\\displaystyle p\_\{\\theta\}\(x\_\{0:T\}\)=\\displaystyle=pθ\(x0\|x1\)∏t=2Tpθ\(xt−1\|xt\)p\(xT\)\.\\displaystyle p\_\{\\theta\}\(x\_\{0\}\|x\_\{1\}\)\\prod\_\{t=2\}^\{T\}p\_\{\\theta\}\(x\_\{t\-1\}\|x\_\{t\}\)p\(x\_\{T\}\)\.\(14\)
Substituting[Equation˜13](https://arxiv.org/html/2606.13795#A2.E13)–[Equation˜14](https://arxiv.org/html/2606.13795#A2.E14)into[Equation˜10](https://arxiv.org/html/2606.13795#A2.E10)yields a sum of tractable terms:
ELBOθ\(x0\)\\displaystyle\\text\{ELBO\}\_\{\\theta\}\(x\_\{0\}\)=\\displaystyle=𝔼q\(x1\|x0\)\[logpθ\(x0\|x1\)\]−KL\(q\(xT\|x0\)∥p\(xT\)\)\\displaystyle\\mathbb\{E\}\_\{q\(x\_\{1\}\|x\_\{0\}\)\}\\\!\\left\[\\log p\_\{\\theta\}\(x\_\{0\}\|x\_\{1\}\)\\right\]\-\\mathrm\{KL\}\\\!\\left\(q\(x\_\{T\}\|x\_\{0\}\)\\\|p\(x\_\{T\}\)\\right\)−∑t=2T𝔼q\(xt\|x0\)\[KL\(q\(xt−1\|xt,x0\)∥pθ\(xt−1\|xt\)\)\],\\displaystyle\-\\sum\_\{t=2\}^\{T\}\\mathbb\{E\}\_\{q\(x\_\{t\}\|x\_\{0\}\)\}\\\!\\left\[\\mathrm\{KL\}\\\!\\left\(q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)\\\|p\_\{\\theta\}\(x\_\{t\-1\}\|x\_\{t\}\)\\right\)\\right\],\(15\)whereq\(xt−1\|xt,x0\)q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)is the exact posterior of the fixed forward process \(obtained from Bayes’ rule\)\. This expression highlights that maximizingELBOθ\(x0\)\\text\{ELBO\}\_\{\\theta\}\(x\_\{0\}\)amounts to fitting each reverse kernelpθ\(xt−1\|xt\)p\_\{\\theta\}\(x\_\{t\-1\}\|x\_\{t\}\)to the corresponding diffusion posterior underqq\.
#### EUBO \(evidence upper bound\)\.
Using the same latent\-variable setup, define the \(nonnegative\) importance weight
wθ\(z;x0\):=pθ\(x0,z\)q\(z\|x0\)\.\\displaystyle w\_\{\\theta\}\(z;x\_\{0\}\):=\\frac\{p\_\{\\theta\}\(x\_\{0\},z\)\}\{q\(z\|x\_\{0\}\)\}\.\(16\)Then the evidence satisfiespθ\(x0\)=𝔼q\(z\|x0\)\[wθ\(z;x0\)\]p\_\{\\theta\}\(x\_\{0\}\)=\\mathbb\{E\}\_\{q\(z\|x\_\{0\}\)\}\[w\_\{\\theta\}\(z;x\_\{0\}\)\]\. For anyβ≥1\\beta\\geq 1, Jensen \(applied to the convex mapu↦uβu\\mapsto u^\{\\beta\}\) gives
logpθ\(x0\)=log𝔼q\[wθ\]\\displaystyle\\log p\_\{\\theta\}\(x\_\{0\}\)=\\log\\mathbb\{E\}\_\{q\}\[w\_\{\\theta\}\]≤1βlog𝔼q\(z\|x0\)\[wθ\(z;x0\)β\]=:EUBOθ,β\(x0\),\\displaystyle\\leq\\frac\{1\}\{\\beta\}\\log\\mathbb\{E\}\_\{q\(z\|x\_\{0\}\)\}\\left\[w\_\{\\theta\}\(z;x\_\{0\}\)^\{\\beta\}\\right\]=:\\text\{EUBO\}\_\{\\theta,\\beta\}\(x\_\{0\}\),\(17\)soEUBOθ,β\(x0\)\\text\{EUBO\}\_\{\\theta,\\beta\}\(x\_\{0\}\)is an upper bound on the log\-evidence \(tight whenq\(z\|x0\)=pθ\(z\|x0\)q\(z\|x\_\{0\}\)=p\_\{\\theta\}\(z\|x\_\{0\}\)\)\.
The looseness of this upper bound is naturally characterized by a Renyi divergence\. In particular,
EUBOθ,β\(x0\)−logpθ\(x0\)\\displaystyle\\text\{EUBO\}\_\{\\theta,\\beta\}\(x\_\{0\}\)\-\\log p\_\{\\theta\}\(x\_\{0\}\)=\\displaystyle=1βlog𝔼q\(z\|x0\)\[\(pθ\(z\|x0\)q\(z\|x0\)\)β\]\\displaystyle\\frac\{1\}\{\\beta\}\\log\\mathbb\{E\}\_\{q\(z\|x\_\{0\}\)\}\\left\[\\left\(\\frac\{p\_\{\\theta\}\(z\|x\_\{0\}\)\}\{q\(z\|x\_\{0\}\)\}\\right\)^\{\\beta\}\\right\]=\\displaystyle=β−1βDβ\(pθ\(z\|x0\)∥q\(z\|x0\)\)\.\\displaystyle\\frac\{\\beta\-1\}\{\\beta\}D\_\{\\beta\}\\left\(p\_\{\\theta\}\(z\|x\_\{0\}\)\\\|q\(z\|x\_\{0\}\)\\right\)\.HereDβ\(p∥q\):=1β−1log𝔼q\[\(pq\)β\]D\_\{\\beta\}\(p\\\|q\):=\\frac\{1\}\{\\beta\-1\}\\log\\mathbb\{E\}\_\{q\}\\left\[\\left\(\\frac\{p\}\{q\}\\right\)^\{\\beta\}\\right\]is the Renyi divergence of orderβ\\beta, corresponding to the𝒟θU\{\\mathcal\{D\}\}^\{\\text\{U\}\}\_\{\\theta\}in[Section˜2](https://arxiv.org/html/2606.13795#S2)\. This quantity is nonnegative and equals0iffpθ\(z\|x0\)=q\(z\|x0\)p\_\{\\theta\}\(z\|x\_\{0\}\)=q\(z\|x\_\{0\}\), so the EUBO is tight exactly when the variational posterior matches the true posterior\.
In diffusion, the latent variable is the noising trajectoryz≡x1:Tz\\equiv x\_\{1:T\}with fixed forward processq\(x1:T\|x0\)q\(x\_\{1:T\}\|x\_\{0\}\), andpθ\(x0,z\)=pθ\(x0:T\)p\_\{\\theta\}\(x\_\{0\},z\)=p\_\{\\theta\}\(x\_\{0:T\}\)is defined by the learned reverse transitions\. Thus,
EUBOθ,β\(x0\)=1βlog𝔼q\(x1:T\|x0\)\[\(pθ\(x0:T\)q\(x1:T\|x0\)\)β\],\\displaystyle\\text\{EUBO\}\_\{\\theta,\\beta\}\(x\_\{0\}\)=\\frac\{1\}\{\\beta\}\\log\\mathbb\{E\}\_\{q\(x\_\{1:T\}\|x\_\{0\}\)\}\\left\[\\left\(\\frac\{p\_\{\\theta\}\(x\_\{0:T\}\)\}\{q\(x\_\{1:T\}\|x\_\{0\}\)\}\\right\)^\{\\beta\}\\right\],\(18\)which is typically much harder to optimize than the ELBO because it involves a log\-moment over entire forward paths and does not decompose into a simple sum of per\-timestep KL terms\. Directly maximizing the likelihoodlogpθ\(x0\)=log∫pθ\(x0,z\)dz\\log p\_\{\\theta\}\(x\_\{0\}\)=\\log\\int p\_\{\\theta\}\(x\_\{0\},z\)\\mathrm\{d\}zis generally intractable because it requires marginalizing over high\-dimensional latentszz\(in diffusion, entire trajectories\)\. Likewise,EUBOθ,β\(x0\)\\text\{EUBO\}\_\{\\theta,\\beta\}\(x\_\{0\}\)is typically intractable since it involves a log\-momentlog𝔼q\[wθβ\]\\log\\mathbb\{E\}\_\{q\}\[w\_\{\\theta\}^\{\\beta\}\]over the same latent space, which does not admit a simple additive decomposition\. In contrast, the diffusion ELBO is tractable because it moves the logarithm inside an expectation under the fixed forward processqq, yielding terms that can be estimated by samplingz∼q\(⋅\|x0\)z\\sim q\(\\cdot\|x\_\{0\}\)\(often reducing to per\-timestep losses\)\.
SPG\[Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]obtains a practical EUBO surrogate for masked dLLMs by exploiting the special absorbing\-mask forward process\. Letx1:nx\_\{1:n\}be a token sequence, wherexix\_\{i\}is the token at positionii, and letmmdenote the mask token\. In this paragraph,ttindexes the diffusion noising step, whileiiindexes the sequence position\. We writezt=\(zt,1,…,zt,n\)z\_\{t\}=\(z\_\{t,1\},\\dots,z\_\{t,n\}\)for the corrupted sequence at diffusion timett, sozt,iz\_\{t,i\}is theii\-th token of the corrupted sequence at that time\. The absorbing\-mask forward process uses a monotone scheduleαt\\alpha\_\{t\}with, in the discrete\-time convention,α1=1\\alpha\_\{1\}=1andαT=0\\alpha\_\{T\}=0\. At timett, each coordinate remains clean with probabilityαt\\alpha\_\{t\}and is masked with probability1−αt1\-\\alpha\_\{t\}:
q\(zt,i=xi\|xi\)=αt,q\(zt,i=m\|xi\)=1−αt\.\\displaystyle q\(z\_\{t,i\}=x\_\{i\}\|x\_\{i\}\)=\\alpha\_\{t\},\\qquad q\(z\_\{t,i\}=m\|x\_\{i\}\)=1\-\\alpha\_\{t\}\.Thus each coordinate is either unchanged or masked, and once a coordinate becomesmmit stays masked under the forward process\. For this process, the only nontrivial reverse\-model term for coordinateiioccurs whenzt\+1,i=mz\_\{t\+1,i\}=mand the model predicts the clean tokenxix\_\{i\}from the corrupted sequencezt\+1z\_\{t\+1\}\. The factor\(αt−αt\+1\)/\(1−αt\+1\)\(\\alpha\_\{t\}\-\\alpha\_\{t\+1\}\)/\(1\-\\alpha\_\{t\+1\}\)below is the posterior probability that coordinateiiwas still clean at timettgiven that it is masked at timet\+1t\+1\.
If we directly instantiate the Renyi EUBO in[Equation˜17](https://arxiv.org/html/2606.13795#A2.E17)for this masked diffusion model, we obtain a path\-level log\-moment:
EUBOθ,βpath\(x1:n\)=1βlog𝔼z1:T∼q\(⋅\|x1:n\)\[∏t=1T−1∏i=1n\(pθ\(zt,i\|zt\+1\)q\(zt,i\|zt\+1,x1:n\)\)β\]\.\\displaystyle\\mathrm\{EUBO\}^\{\\mathrm\{path\}\}\_\{\\theta,\\beta\}\(x\_\{1:n\}\)=\\frac\{1\}\{\\beta\}\\log\\mathbb\{E\}\_\{z\_\{1:T\}\\sim q\(\\cdot\|x\_\{1:n\}\)\}\\left\[\\prod\_\{t=1\}^\{T\-1\}\\prod\_\{i=1\}^\{n\}\\left\(\\frac\{p\_\{\\theta\}\(z\_\{t,i\}\|z\_\{t\+1\}\)\}\{q\(z\_\{t,i\}\|z\_\{t\+1\},x\_\{1:n\}\)\}\\right\)^\{\\beta\}\\right\]\.\(19\)This is the literal, unapproximated EUBO for the masked diffusion model, but it is not in the same form as the usual masked\-diffusion ELBO training loss: it contains a product over all diffusion times and token positions inside a single outer logarithm\. By contrast, the masked\-diffusion ELBO moves the logarithm inside the expectation and reduces to weighted masked\-token prediction terms,
ℒELBO\(x1:n;θ\)=∑i=1n𝔼t,zt\[w\(t\)𝟙\{zt,i=m\}logπθ\(xi\|zt\)\],\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{ELBO\}\}\(x\_\{1:n\};\\theta\)=\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{t,z\_\{t\}\}\\left\[w\(t\)\\mathbbm\{1\}\\\{z\_\{t,i\}=m\\\}\\log\\pi\_\{\\theta\}\(x\_\{i\}\|z\_\{t\}\)\\right\],up to schedule\-dependent constants\. SPG’s approximation sits between these two objects: it keeps an upper\-bound structure derived from the path\-level EUBO, but uses the absorbing\-mask structure to obtain a per\-coordinate log\-moment that can be estimated with masked\-token prediction probabilities\. Applying the Renyi variational bound and decomposing over coordinates yields the discrete\-time upper\-bound surrogate
ℒEUBOSPG\(x1:n;θ\)\\displaystyle\\mathcal\{L\}^\{\\mathrm\{SPG\}\}\_\{\\mathrm\{EUBO\}\}\(x\_\{1:n\};\\theta\)=1β∑i=1nlog\(∑t=1T−1𝔼zt\+1∼q\(⋅\|x1:n\)\[αt−αt\+11−αt\+1𝟙\{zt\+1,i=m\}\\displaystyle=\\frac\{1\}\{\\beta\}\\sum\_\{i=1\}^\{n\}\\log\\Bigg\(\\sum\_\{t=1\}^\{T\-1\}\\mathbb\{E\}\_\{z\_\{t\+1\}\\sim q\(\\cdot\|x\_\{1:n\}\)\}\\Bigg\[\\frac\{\\alpha\_\{t\}\-\\alpha\_\{t\+1\}\}\{1\-\\alpha\_\{t\+1\}\}\\mathbbm\{1\}\\\{z\_\{t\+1,i\}=m\\\}πθ\(xi\|zt\+1\)β\]\)\+C\(T\),\\displaystyle\\hskip 140\.00021pt\\pi\_\{\\theta\}\(x\_\{i\}\|z\_\{t\+1\}\)^\{\\beta\}\\Bigg\]\\Bigg\)\+C\(T\),\(20\)whereC\(T\)C\(T\)is independent ofθ\\thetaand therefore does not affect policy\-gradient updates\. In the continuous\-time implementation, this is written in the equivalent form
ℒ~EUBOSPG\(x1:n;θ\)=1β∑i=1nlog𝔼t,zt\[w\(t\)𝟙\{zt,i=m\}πθ\(xi\|zt\)β\],\\displaystyle\\widetilde\{\\mathcal\{L\}\}^\{\\mathrm\{SPG\}\}\_\{\\mathrm\{EUBO\}\}\(x\_\{1:n\};\\theta\)=\\frac\{1\}\{\\beta\}\\sum\_\{i=1\}^\{n\}\\log\\mathbb\{E\}\_\{t,z\_\{t\}\}\\left\[w\(t\)\\mathbbm\{1\}\\\{z\_\{t,i\}=m\\\}\\pi\_\{\\theta\}\(x\_\{i\}\|z\_\{t\}\)^\{\\beta\}\\right\],\(21\)withw\(t\)w\(t\)collecting the continuous\-time analogue of the schedule\-dependent masking weight above, including any density correction for howttis sampled\. The logarithm remains outside the expectation, so single\-sample Monte Carlo estimates of this quantity are generally biased; nevertheless, its gradient can be estimated by differentiating the sampled log\-moment\. This construction is specific to the categorical absorbing\-mask structure of dLLMs, which is why the SPG EUBO approximation does not directly provide a general\-purpose EUBO estimator for arbitrary diffusion or flow policies\.
## Appendix CTheoretical Analysis of DiPOD
In this section, we formally analyze the convergence of DiPOD\. In particular, we formalize and prove[Theorem˜3\.1](https://arxiv.org/html/2606.13795#S3.Thmtheorem1)\.
### C\.1Setup and Assumptions
Throughout this section, we work in the fully observable MDP setting described in[Appendix˜B](https://arxiv.org/html/2606.13795#A2), so observations are states\. To keep the notation aligned with the main body, we writeoofor this observation/state variable rather than switching toss\. Letdπθd\_\{\\pi\_\{\\theta\}\}be the on\-policy observation distribution induced byπθ\\pi\_\{\\theta\}in the same sense used by the policy\-gradient theorem \(for example, the discounted occupancy measure in the episodic case or the stationary distribution in the continuing case\), and define
ρθ\(o,a\):=dπθ\(o\)πθ\(a\|o\)\.\\displaystyle\\rho\_\{\\theta\}\(o,a\):=d\_\{\\pi\_\{\\theta\}\}\(o\)\\pi\_\{\\theta\}\(a\|o\)\.For any reference distributionμ\\muover observation\-action pairs, define the self\-distillation objective
Fμ\(θ\):=𝔼μ\[ELBOθ\(a\|o\)\]\\displaystyle F\_\{\\mu\}\(\\theta\):=\\mathbb\{E\}\_\{\\mu\}\\left\[\\text\{ELBO\}\_\{\\theta\}\(a\|o\)\\right\]and the expected ELBO and EUBO discrepancies
DμL\(θ\)\\displaystyle D\_\{\\mu\}^\{\\mathrm\{L\}\}\(\\theta\):=𝔼\(o,a\)∼μ\[𝒟θL\(o,a\)\]=𝔼μ\[logπθ\(a\|o\)−ELBOθ\(a\|o\)\],\\displaystyle:=\\mathbb\{E\}\_\{\(o,a\)\\sim\\mu\}\\left\[\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)\\right\]=\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\theta\}\(a\|o\)\-\\text\{ELBO\}\_\{\\theta\}\(a\|o\)\\right\],DμU\(θ\)\\displaystyle D\_\{\\mu\}^\{\\mathrm\{U\}\}\(\\theta\):=𝔼\(o,a\)∼μ\[𝒟θU\(o,a\)\]=𝔼μ\[EUBOθ\(a\|o\)−logπθ\(a\|o\)\]\.\\displaystyle:=\\mathbb\{E\}\_\{\(o,a\)\\sim\\mu\}\\left\[\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{U\}\}\(o,a\)\\right\]=\\mathbb\{E\}\_\{\\mu\}\\left\[\\text\{EUBO\}\_\{\\theta\}\(a\|o\)\-\\log\\pi\_\{\\theta\}\(a\|o\)\\right\]\.For a gradient estimatorgθ\(o,a\)g\_\{\\theta\}\(o,a\), write its on\-policy expected update as
Gg\(θ\):=𝔼\(o,a\)∼ρθ\[gθ\(o,a\)\]\.\\displaystyle G\_\{g\}\(\\theta\):=\\mathbb\{E\}\_\{\(o,a\)\\sim\\rho\_\{\\theta\}\}\\left\[g\_\{\\theta\}\(o,a\)\\right\]\.
We are going to build our analysis on the following standard assumptions\.
###### Assumption C\.1\(Smooth Policy\-Gradient Dynamics\)\.
The policy\-gradient theorem holds for𝒥\{\\mathcal\{J\}\}on the region of parameter space visited by the algorithm, and𝒥\{\\mathcal\{J\}\}isL𝒥L\_\{\\mathcal\{J\}\}\-smooth:
𝒥\(θ′\)≥𝒥\(θ\)\+⟨∇𝒥\(θ\),θ′−θ⟩−L𝒥2‖θ′−θ‖2\.\\displaystyle\{\\mathcal\{J\}\}\(\\theta^\{\\prime\}\)\\geq\{\\mathcal\{J\}\}\(\\theta\)\+\\langle\\nabla\{\\mathcal\{J\}\}\(\\theta\),\\theta^\{\\prime\}\-\\theta\\rangle\-\\frac\{L\_\{\\mathcal\{J\}\}\}\{2\}\\\|\\theta^\{\\prime\}\-\\theta\\\|^\{2\}\.
###### Assumption C\.2\(Lipschitzness of Gradient Estimator\)\.
The estimatorgθg\_\{\\theta\}is adequate in the sense of Definition[2](https://arxiv.org/html/2606.13795#S2.SS0.SSS0.Px3)\. Moreover, there exists an adequacy coefficientcg<∞c\_\{g\}<\\inftysuch that whenever the current on\-policy ELBO discrepancy is at mostε\\varepsilon,
DρθL\(θ\)≤ε⟹‖Gg\(θ\)−∇𝒥\(θ\)‖≤cgε\.\\displaystyle D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\\leq\\varepsilon\\quad\\Longrightarrow\\quad\\left\\\|G\_\{g\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)\\right\\\|\\leq c\_\{g\}\\sqrt\{\\varepsilon\}\.This is the quantitative form of adequacy used in the theorem: near a tight on\-policy ELBO, the expected adequate\-gradient update is Lipschitz\-close to the true policy gradient\.
###### Assumption C\.3\(Realizability\)\.
For every reference policyπref\\pi\_\{\\mathrm\{ref\}\}considered by the algorithm, with reference distributionμ\(o,a\)=dref\(o\)πref\(a\|o\)\\mu\(o,a\)=d\_\{\\mathrm\{ref\}\}\(o\)\\pi\_\{\\mathrm\{ref\}\}\(a\|o\), there exists a parameterθ†\\theta^\{\\dagger\}such thatπθ†\(⋅\|o\)=πref\(⋅\|o\)\\pi\_\{\\theta^\{\\dagger\}\}\(\\cdot\|o\)=\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)fordrefd\_\{\\mathrm\{ref\}\}\-almost everyooand𝒟θ†L\(o,a\)=0\{\\mathcal\{D\}\}\_\{\\theta^\{\\dagger\}\}^\{\\mathrm\{L\}\}\(o,a\)=0forμ\\mu\-almost every\(o,a\)\(o,a\)\. Equivalently, the policy class can represent the reference policy with a tight ELBO on the reference distribution\.
###### Assumption C\.4\(On\-policy Self\-Distillation Oracle\)\.
At each self\-distillation step, DiPOD chooses the current on\-policy reference distributionμ=ρθref\\mu=\\rho\_\{\\theta\_\{\\mathrm\{ref\}\}\}\. The self\-distillation oracle returns a parameterθ¯\\bar\{\\theta\}, and hence a policyπθ¯\\pi\_\{\\bar\{\\theta\}\}, whose on\-reference ELBO value is withinε\\varepsilonof the best achievable value:
Fμ\(θ¯\)≥supθFμ\(θ\)−ε\.\\displaystyle F\_\{\\mu\}\(\\bar\{\\theta\}\)\\geq\\sup\_\{\\theta\}F\_\{\\mu\}\(\\theta\)\-\\varepsilon\.Equivalently, the returned policy/model isε\\varepsilon\-suboptimal for the on\-reference ELBO maximization problem\. This suboptimality may come from either policy mismatch with the reference distribution or from a nonzero ELBO discrepancy\.
Before proceeding to the performance guarantee, we first show that both FPO and SPG satisfy Assumption[C\.1](https://arxiv.org/html/2606.13795#A3.SS1)under standard bounded\-advantage and smoothness conditions\. The same self\-bounding argument can also be extended to other adequate\-gradient estimators whose bias is controlled by smooth nonnegative variational discrepancies\.
###### Proposition C\.6\(FPO and SPG instantiate Lipschitzness\)\.
Assume\|Aπθ\(o,a\)\|≤Amax\|A^\{\\pi\_\{\\theta\}\}\(o,a\)\|\\leq A\_\{\\max\}\.
1. 1\.FPO\.Let gθFPO\(o,a\)=Aπθ\(o,a\)∇θELBOθ\(a\|o\)\.\\displaystyle g\_\{\\theta\}^\{\\mathrm\{FPO\}\}\(o,a\)=A^\{\\pi\_\{\\theta\}\}\(o,a\)\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\|o\)\.If𝒟θL\(o,a\)\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)is nonnegative andLLL\_\{\\mathrm\{L\}\}\-smooth inθ\\thetafor each\(o,a\)\(o,a\), then ‖GgFPO\(θ\)−∇𝒥\(θ\)‖≤Amax2LLDρθL\(θ\)\.\\displaystyle\\left\\\|G\_\{g^\{\\mathrm\{FPO\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)\\right\\\|\\leq A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{L\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\}\.Thus FPO satisfies[Section˜C\.1](https://arxiv.org/html/2606.13795#A3.SS1)withcg=Amax2LLc\_\{g\}=A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{L\}\}\}\.
2. 2\.SPG\.Let gθSPG\(o,a\)=𝟙A\>0A∇θELBOθ\(a\|o\)\+𝟙A<0A∇θEUBOθ\(a\|o\),\\displaystyle g\_\{\\theta\}^\{\\mathrm\{SPG\}\}\(o,a\)=\\mathbbm\{1\}\_\{A\>0\}A\\nabla\_\{\\theta\}\\text\{ELBO\}\_\{\\theta\}\(a\|o\)\+\\mathbbm\{1\}\_\{A<0\}A\\nabla\_\{\\theta\}\\text\{EUBO\}\_\{\\theta\}\(a\|o\),whereAAabbreviatesAπθ\(o,a\)A^\{\\pi\_\{\\theta\}\}\(o,a\)\. If𝒟θL\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}and𝒟θU\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{U\}\}are nonnegative and respectivelyLLL\_\{\\mathrm\{L\}\}\- andLUL\_\{\\mathrm\{U\}\}\-smooth, then ‖GgSPG\(θ\)−∇𝒥\(θ\)‖≤Amax2LLDρθL\(θ\)\+Amax2LUDρθU\(θ\)\.\\displaystyle\\left\\\|G\_\{g^\{\\mathrm\{SPG\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)\\right\\\|\\leq A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{L\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\}\+A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{U\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{U\}\}\(\\theta\)\}\.Consequently, if the EUBO discrepancy is also tight in the self\-distilled region, e\.g\.DρθU\(θ\)≤CUDρθL\(θ\)D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{U\}\}\(\\theta\)\\leq C\_\{\\mathrm\{U\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\), then SPG satisfies[Section˜C\.1](https://arxiv.org/html/2606.13795#A3.SS1)withcg=Amax\(2LL\+2LUCU\)c\_\{g\}=A\_\{\\max\}\(\\sqrt\{2L\_\{\\mathrm\{L\}\}\}\+\\sqrt\{2L\_\{\\mathrm\{U\}\}C\_\{\\mathrm\{U\}\}\}\)\.
###### Proof\.
We use the self\-bounding property of nonnegative smooth functions\. Iff\(θ\)≥0f\(\\theta\)\\geq 0andffisLL\-smooth, then
‖∇f\(θ\)‖2≤2Lf\(θ\)\.\\displaystyle\\\|\\nabla f\(\\theta\)\\\|^\{2\}\\leq 2Lf\(\\theta\)\.\(22\)Indeed, smoothness atθ′=θ−1L∇f\(θ\)\\theta^\{\\prime\}=\\theta\-\\frac\{1\}\{L\}\\nabla f\(\\theta\)gives
0≤f\(θ′\)≤f\(θ\)−12L‖∇f\(θ\)‖2\.\\displaystyle 0\\leq f\(\\theta^\{\\prime\}\)\\leq f\(\\theta\)\-\\frac\{1\}\{2L\}\\\|\\nabla f\(\\theta\)\\\|^\{2\}\.
For FPO, the policy\-gradient theorem gives
∇𝒥\(θ\)=𝔼ρθ\[Aπθ\(o,a\)∇θlogπθ\(a\|o\)\]\.\\displaystyle\\nabla\{\\mathcal\{J\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[A^\{\\pi\_\{\\theta\}\}\(o,a\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\|o\)\\right\]\.UsingELBOθ=logπθ−𝒟θL\\text\{ELBO\}\_\{\\theta\}=\\log\\pi\_\{\\theta\}\-\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\},
GgFPO\(θ\)−∇𝒥\(θ\)=−𝔼ρθ\[Aπθ\(o,a\)∇θ𝒟θL\(o,a\)\]\.\\displaystyle G\_\{g^\{\\mathrm\{FPO\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[A^\{\\pi\_\{\\theta\}\}\(o,a\)\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)\\right\]\.Therefore,
‖GgFPO\(θ\)−∇𝒥\(θ\)‖\\displaystyle\\left\\\|G\_\{g^\{\\mathrm\{FPO\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)\\right\\\|≤Amax𝔼ρθ\[‖∇θ𝒟θL\(o,a\)‖\]\\displaystyle\\leq A\_\{\\max\}\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[\\left\\\|\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)\\right\\\|\\right\]≤Amax𝔼ρθ\[‖∇θ𝒟θL\(o,a\)‖2\]\\displaystyle\\leq A\_\{\\max\}\\sqrt\{\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[\\left\\\|\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)\\right\\\|^\{2\}\\right\]\}≤Amax2LLDρθL\(θ\),\\displaystyle\\leq A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{L\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\},where the last step uses[Equation˜22](https://arxiv.org/html/2606.13795#A3.E22)\.
For SPG, useEUBOθ=logπθ\+𝒟θU\\text\{EUBO\}\_\{\\theta\}=\\log\\pi\_\{\\theta\}\+\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{U\}\}in addition to the ELBO decomposition:
GgSPG\(θ\)−∇𝒥\(θ\)=−𝔼ρθ\[𝟙A\>0A∇θ𝒟θL\(o,a\)\]\+𝔼ρθ\[𝟙A<0A∇θ𝒟θU\(o,a\)\]\.\\displaystyle G\_\{g^\{\\mathrm\{SPG\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[\\mathbbm\{1\}\_\{A\>0\}A\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{L\}\}\(o,a\)\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\theta\}\}\\left\[\\mathbbm\{1\}\_\{A<0\}A\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\theta\}^\{\\mathrm\{U\}\}\(o,a\)\\right\]\.The triangle inequality, bounded advantages, Cauchy\-Schwarz, and[Equation˜22](https://arxiv.org/html/2606.13795#A3.E22)give
‖GgSPG\(θ\)−∇𝒥\(θ\)‖\\displaystyle\\left\\\|G\_\{g^\{\\mathrm\{SPG\}\}\}\(\\theta\)\-\\nabla\{\\mathcal\{J\}\}\(\\theta\)\\right\\\|≤Amax2LLDρθL\(θ\)\+Amax2LUDρθU\(θ\)\.\\displaystyle\\leq A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{L\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\}\+A\_\{\\max\}\\sqrt\{2L\_\{\\mathrm\{U\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{U\}\}\(\\theta\)\}\.The final statement follows by substitutingDρθU\(θ\)≤CUDρθL\(θ\)D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{U\}\}\(\\theta\)\\leq C\_\{\\mathrm\{U\}\}D\_\{\\rho\_\{\\theta\}\}^\{\\mathrm\{L\}\}\(\\theta\)\. ∎
### C\.2Performance Guarantee
###### Lemma C\.7\(Guarantee for Self\-Distillation Steps\)\.
Letμ\(o,a\)=dref\(o\)πref\(a\|o\)\\mu\(o,a\)=d\_\{\\mathrm\{ref\}\}\(o\)\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)be a reference state\-action distribution satisfying[Section˜C\.1](https://arxiv.org/html/2606.13795#A3.SS1)\. If an ELBO oracle returnsθ¯\\bar\{\\theta\}satisfying
Fμ\(θ¯\)≥supθFμ\(θ\)−ε,\\displaystyle F\_\{\\mu\}\(\\bar\{\\theta\}\)\\geq\\sup\_\{\\theta\}F\_\{\\mu\}\(\\theta\)\-\\varepsilon,then
𝔼o∼dref\[KL\(πref\(⋅\|o\)∥πθ¯\(⋅\|o\)\)\]\+DμL\(θ¯\)≤ε\.\\displaystyle\\mathbb\{E\}\_\{o\\sim d\_\{\\mathrm\{ref\}\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)\\,\\\|\\,\\pi\_\{\\bar\{\\theta\}\}\(\\cdot\|o\)\\right\)\\right\]\+D\_\{\\mu\}^\{\\mathrm\{L\}\}\(\\bar\{\\theta\}\)\\leq\\varepsilon\.\(23\)
###### Proof\.
We proceed in three steps\.
#### Step 1: upper bound the best possible self\-distillation objective\.
For any candidate parameterθ\\theta, the ELBO decomposition in[Equation˜2](https://arxiv.org/html/2606.13795#S2.E2)gives
Fμ\(θ\)\\displaystyle F\_\{\\mu\}\(\\theta\)=𝔼μ\[logπθ\(a\|o\)\]−DμL\(θ\)\.\\displaystyle=\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\theta\}\(a\|o\)\\right\]\-D\_\{\\mu\}^\{\\mathrm\{L\}\}\(\\theta\)\.SinceDμL\(θ\)≥0D\_\{\\mu\}^\{\\mathrm\{L\}\}\(\\theta\)\\geq 0,
Fμ\(θ\)≤𝔼o∼dref,a∼πref\(⋅\|o\)\[logπθ\(a\|o\)\]\.\\displaystyle F\_\{\\mu\}\(\\theta\)\\leq\\mathbb\{E\}\_\{o\\sim d\_\{\\mathrm\{ref\}\},\\,a\\sim\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)\}\\left\[\\log\\pi\_\{\\theta\}\(a\|o\)\\right\]\.Rewriting the right\-hand side relative toπref\\pi\_\{\\mathrm\{ref\}\},
𝔼o∼dref,a∼πref\(⋅\|o\)\[logπθ\(a\|o\)\]\\displaystyle\\mathbb\{E\}\_\{o\\sim d\_\{\\mathrm\{ref\}\},\\,a\\sim\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)\}\\left\[\\log\\pi\_\{\\theta\}\(a\|o\)\\right\]=𝔼μ\[logπref\(a\|o\)\]−𝔼o∼dref\[KL\(πref\(⋅\|o\)∥πθ\(⋅\|o\)\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\-\\mathbb\{E\}\_\{o\\sim d\_\{\\mathrm\{ref\}\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)\\,\\\|\\,\\pi\_\{\\theta\}\(\\cdot\|o\)\\right\)\\right\]≤𝔼μ\[logπref\(a\|o\)\]\.\\displaystyle\\leq\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\.Thus,
Fμ\(θ\)≤𝔼μ\[logπref\(a\|o\)\]\.\\displaystyle F\_\{\\mu\}\(\\theta\)\\leq\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\.\(24\)
#### Step 2: show that the upper bound is attainable\.
By realizability,Fμ\(θ†\)=𝔼μ\[logπref\(a\|o\)\]F\_\{\\mu\}\(\\theta^\{\\dagger\}\)=\\mathbb\{E\}\_\{\\mu\}\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\]\. Together with[Equation˜24](https://arxiv.org/html/2606.13795#A3.E24), this implies
supθFμ\(θ\)=𝔼μ\[logπref\(a\|o\)\]\.\\displaystyle\\sup\_\{\\theta\}F\_\{\\mu\}\(\\theta\)=\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\.
#### Step 3: use oracle optimality\.
Theε\\varepsilon\-optimality ofθ¯\\bar\{\\theta\}gives
Fμ\(θ¯\)≥𝔼μ\[logπref\(a\|o\)\]−ε\.\\displaystyle F\_\{\\mu\}\(\\bar\{\\theta\}\)\\geq\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\-\\varepsilon\.ExpandingFμ\(θ¯\)F\_\{\\mu\}\(\\bar\{\\theta\}\)as in Step 1 gives
Fμ\(θ¯\)=𝔼μ\[logπref\(a\|o\)\]−𝔼o∼dref\[KL\(πref\(⋅\|o\)∥πθ¯\(⋅\|o\)\)\]−DμL\(θ¯\)\.\\displaystyle F\_\{\\mu\}\(\\bar\{\\theta\}\)=\\mathbb\{E\}\_\{\\mu\}\\left\[\\log\\pi\_\{\\mathrm\{ref\}\}\(a\|o\)\\right\]\-\\mathbb\{E\}\_\{o\\sim d\_\{\\mathrm\{ref\}\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\mathrm\{ref\}\}\(\\cdot\|o\)\\,\\\|\\,\\pi\_\{\\bar\{\\theta\}\}\(\\cdot\|o\)\\right\)\\right\]\-D\_\{\\mu\}^\{\\mathrm\{L\}\}\(\\bar\{\\theta\}\)\.Combining the previous two displays yields[Equation˜23](https://arxiv.org/html/2606.13795#A3.E23)\. ∎
###### Theorem C\.8\(DiPOD improves return under on\-policy tightness\)\.
Suppose[Sections˜C\.1](https://arxiv.org/html/2606.13795#A3.SS1),[C\.1](https://arxiv.org/html/2606.13795#A3.SS1),[C\.1](https://arxiv.org/html/2606.13795#A3.SS1)and[C\.1](https://arxiv.org/html/2606.13795#A3.SS1)hold\. Consider one idealized DiPOD cycle with current parameterθk\\theta\_\{k\}\. The self\-distillation oracle usesρθk\\rho\_\{\\theta\_\{k\}\}as reference and returnsθ¯k\\bar\{\\theta\}\_\{k\}\. Then[Section˜C\.2](https://arxiv.org/html/2606.13795#A3.SS2)gives the reference\-distribution decomposition
𝔼o∼dπθk\[KL\(πθk\(⋅\|o\)∥πθ¯k\(⋅\|o\)\)\]\+DρθkL\(θ¯k\)≤ε\.\\displaystyle\\mathbb\{E\}\_\{o\\sim d\_\{\\pi\_\{\\theta\_\{k\}\}\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\_\{k\}\}\(\\cdot\|o\)\\,\\\|\\,\\pi\_\{\\bar\{\\theta\}\_\{k\}\}\(\\cdot\|o\)\\right\)\\right\]\+D\_\{\\rho\_\{\\theta\_\{k\}\}\}^\{\\mathrm\{L\}\}\(\\bar\{\\theta\}\_\{k\}\)\\leq\\varepsilon\.\(25\)As emphasized in[Section˜C\.1](https://arxiv.org/html/2606.13795#A3.SS1), this is a guarantee on the reference distributionρθk\\rho\_\{\\theta\_\{k\}\}\. Now letε¯k:=Dρθ¯kL\(θ¯k\)\\bar\{\\varepsilon\}\_\{k\}:=D\_\{\\rho\_\{\\bar\{\\theta\}\_\{k\}\}\}^\{\\mathrm\{L\}\}\(\\bar\{\\theta\}\_\{k\}\)be the discrepancy on the returned policy’s own on\-policy distribution, and perform the policy update
θk\+1=θ¯k\+ηGg\(θ¯k\)\.\\displaystyle\\theta\_\{k\+1\}=\\bar\{\\theta\}\_\{k\}\+\\eta G\_\{g\}\(\\bar\{\\theta\}\_\{k\}\)\.If0<η≤1/L𝒥0<\\eta\\leq 1/L\_\{\\mathcal\{J\}\}, then
𝒥\(θk\+1\)≥𝒥\(θ¯k\)\+η2\[‖∇𝒥\(θ¯k\)‖2−cg2ε¯k\]\.\\displaystyle\{\\mathcal\{J\}\}\(\\theta\_\{k\+1\}\)\\geq\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\+\\frac\{\\eta\}\{2\}\\left\[\\left\\\|\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\right\\\|^\{2\}\-c\_\{g\}^\{2\}\\bar\{\\varepsilon\}\_\{k\}\\right\]\.\(26\)Therefore, whenever‖∇𝒥\(θ¯k\)‖\>cgε¯k\\\|\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\\|\>c\_\{g\}\\sqrt\{\\bar\{\\varepsilon\}\_\{k\}\}, the policy update strictly improves the true return relative to the returned policyθ¯k\\bar\{\\theta\}\_\{k\}\. If the oracle is exact \(ε=0\\varepsilon=0\), or more generally returns a policy\-preserving model so thatρθ¯k=ρθk\\rho\_\{\\bar\{\\theta\}\_\{k\}\}=\\rho\_\{\\theta\_\{k\}\}, thenε¯k≤ε\\bar\{\\varepsilon\}\_\{k\}\\leq\\varepsilonand𝒥\(θ¯k\)=𝒥\(θk\)\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)=\{\\mathcal\{J\}\}\(\\theta\_\{k\}\); in this case the full DiPOD cycle strictly improves𝒥\{\\mathcal\{J\}\}whenever‖∇𝒥\(θ¯k\)‖\>cgε\\\|\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\\|\>c\_\{g\}\\sqrt\{\\varepsilon\}\.
Moreover, suppose the iterates converge to a stable post\-distillation policyθ⋆\\theta\_\{\\star\}for which the final on\-policy self\-distillation step isε\\varepsilon\-optimal and returnsθ⋆\\theta\_\{\\star\}\. Here stable means that the subsequent idealized update
θ⋆\+=θ⋆\+ηGg\(θ⋆\)\\displaystyle\\theta\_\{\\star\}^\{\+\}=\\theta\_\{\\star\}\+\\eta G\_\{g\}\(\\theta\_\{\\star\}\)with0<η≤1/L𝒥0<\\eta\\leq 1/L\_\{\\mathcal\{J\}\}does not strictly improve𝒥\{\\mathcal\{J\}\}\. Then
‖∇𝒥\(θ⋆\)‖≤cgε,Dρθ⋆L\(θ⋆\)≤ε\.\\displaystyle\\left\\\|\\nabla\{\\mathcal\{J\}\}\(\\theta\_\{\\star\}\)\\right\\\|\\leq c\_\{g\}\\sqrt\{\\varepsilon\},\\qquad D\_\{\\rho\_\{\\theta\_\{\\star\}\}\}^\{\\mathrm\{L\}\}\(\\theta\_\{\\star\}\)\\leq\\varepsilon\.Thus the final policy is approximately stationary for the true return and has a tight ELBO on its own final on\-policy distribution\.
###### Proof\.
The oracle decomposition[Equation˜25](https://arxiv.org/html/2606.13795#A3.E25)follows directly by applying[Section˜C\.2](https://arxiv.org/html/2606.13795#A3.SS2)withμ=ρθk\\mu=\\rho\_\{\\theta\_\{k\}\}\.
By definition ofε¯k\\bar\{\\varepsilon\}\_\{k\}and the Lipschitzness of the estimator,
‖Gg\(θ¯k\)−∇𝒥\(θ¯k\)‖≤cgε¯k\.\\displaystyle\\left\\\|G\_\{g\}\(\\bar\{\\theta\}\_\{k\}\)\-\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\right\\\|\\leq c\_\{g\}\\sqrt\{\\bar\{\\varepsilon\}\_\{k\}\}\.Let
v:=∇𝒥\(θ¯k\),e:=Gg\(θ¯k\)−v\.\\displaystyle v:=\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\),\\qquad e:=G\_\{g\}\(\\bar\{\\theta\}\_\{k\}\)\-v\.Then‖e‖≤cgε¯k\\\|e\\\|\\leq c\_\{g\}\\sqrt\{\\bar\{\\varepsilon\}\_\{k\}\}\. UsingL𝒥L\_\{\\mathcal\{J\}\}\-smoothness withθk\+1=θ¯k\+η\(v\+e\)\\theta\_\{k\+1\}=\\bar\{\\theta\}\_\{k\}\+\\eta\(v\+e\),
𝒥\(θk\+1\)\\displaystyle\{\\mathcal\{J\}\}\(\\theta\_\{k\+1\}\)≥𝒥\(θ¯k\)\+η⟨v,v\+e⟩−L𝒥η22‖v\+e‖2\.\\displaystyle\\geq\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\+\\eta\\langle v,v\+e\\rangle\-\\frac\{L\_\{\\mathcal\{J\}\}\\eta^\{2\}\}\{2\}\\\|v\+e\\\|^\{2\}\.Whenη≤1/L𝒥\\eta\\leq 1/L\_\{\\mathcal\{J\}\},
𝒥\(θk\+1\)\\displaystyle\{\\mathcal\{J\}\}\(\\theta\_\{k\+1\}\)≥𝒥\(θ¯k\)\+η⟨v,v\+e⟩−η2‖v\+e‖2\\displaystyle\\geq\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\+\\eta\\langle v,v\+e\\rangle\-\\frac\{\\eta\}\{2\}\\\|v\+e\\\|^\{2\}=𝒥\(θ¯k\)\+η2\[‖v‖2−‖e‖2\]\\displaystyle=\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\+\\frac\{\\eta\}\{2\}\\left\[\\\|v\\\|^\{2\}\-\\\|e\\\|^\{2\}\\right\]≥𝒥\(θ¯k\)\+η2\[‖∇𝒥\(θ¯k\)‖2−cg2ε¯k\]\.\\displaystyle\\geq\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\+\\frac\{\\eta\}\{2\}\\left\[\\left\\\|\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\right\\\|^\{2\}\-c\_\{g\}^\{2\}\\bar\{\\varepsilon\}\_\{k\}\\right\]\.This proves[Equation˜26](https://arxiv.org/html/2606.13795#A3.E26)\. Strict improvement follows whenever‖∇𝒥\(θ¯k\)‖\>cgε¯k\\\|\\nabla\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)\\\|\>c\_\{g\}\\sqrt\{\\bar\{\\varepsilon\}\_\{k\}\}\. If the oracle is policy preserving, thenρθ¯k=ρθk\\rho\_\{\\bar\{\\theta\}\_\{k\}\}=\\rho\_\{\\theta\_\{k\}\}, so[Equation˜25](https://arxiv.org/html/2606.13795#A3.E25)impliesε¯k=DρθkL\(θ¯k\)≤ε\\bar\{\\varepsilon\}\_\{k\}=D\_\{\\rho\_\{\\theta\_\{k\}\}\}^\{\\mathrm\{L\}\}\(\\bar\{\\theta\}\_\{k\}\)\\leq\\varepsilon, and policy preservation also gives𝒥\(θ¯k\)=𝒥\(θk\)\{\\mathcal\{J\}\}\(\\bar\{\\theta\}\_\{k\}\)=\{\\mathcal\{J\}\}\(\\theta\_\{k\}\)\. Whenε=0\\varepsilon=0, the oracle decomposition gives zero expected KL fromπθk\\pi\_\{\\theta\_\{k\}\}toπθ¯k\\pi\_\{\\bar\{\\theta\}\_\{k\}\}ondπθkd\_\{\\pi\_\{\\theta\_\{k\}\}\}and zero reference\-distribution ELBO discrepancy; under the same MDP dynamics this is the exact policy\-preserving case\.
For the final claim, apply the oracle decomposition to the final self\-distillation step with referenceρθ⋆\\rho\_\{\\theta\_\{\\star\}\}and returned parameterθ⋆\\theta\_\{\\star\}\. The KL term is zero, soDρθ⋆L\(θ⋆\)≤εD\_\{\\rho\_\{\\theta\_\{\\star\}\}\}^\{\\mathrm\{L\}\}\(\\theta\_\{\\star\}\)\\leq\\varepsilon\. If‖∇𝒥\(θ⋆\)‖\>cgε\\\|\\nabla\{\\mathcal\{J\}\}\(\\theta\_\{\\star\}\)\\\|\>c\_\{g\}\\sqrt\{\\varepsilon\}, the improvement bound applied toθ⋆\+\\theta\_\{\\star\}^\{\+\}would give a strict increase in𝒥\{\\mathcal\{J\}\}, contradicting stability\. Thus‖∇𝒥\(θ⋆\)‖≤cgε\\\|\\nabla\{\\mathcal\{J\}\}\(\\theta\_\{\\star\}\)\\\|\\leq c\_\{g\}\\sqrt\{\\varepsilon\}\. ∎
## Appendix DAdditional Details for Two\-Token Post\-Training
We use the same toy experiment as the one in Appendix C\.3 of SPG\[Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]\. In the toy experiment, the discrete diffusion model generates a distribution on two discrete tokensx=\(x1,x2\)x=\(x\_\{1\},x\_\{2\}\)\. Bothx1x\_\{1\}andx2x\_\{2\}take values from𝒱=\{A,B\}\{\\mathcal\{V\}\}=\\\{\\mathrm\{A\},\\mathrm\{B\}\\\}\. In the generation process,x1x\_\{1\}andx2x\_\{2\}can also be a special mask tokenM\\mathrm\{M\}\. The generation starts fromx=MMx=\\text\{MM\}, and tokens are decoded according to a uniformly random order\. The model is parameterized by six real numbers:
a=logit\(π\(x1=A\|x=MA\)\),b=logit\(π\(x1=A\|x=MM\)\),c=logit\(π\(x2=A\|x=AM\)\),\\displaystyle a=\\text\{logit\}\(\\pi\(x\_\{1\}=\\mathrm\{A\}\|x=\\mathrm\{MA\}\)\),b=\\text\{logit\}\(\\pi\(x\_\{1\}=\\mathrm\{A\}\|x=\\mathrm\{MM\}\)\),c=\\text\{logit\}\(\\pi\(x\_\{2\}=\\mathrm\{A\}\|x=\\mathrm\{AM\}\)\),d=logit\(π\(x2=A\|x=MM\)\),e=logit\(π\(x1=A\|x=MB\)\),f=logit\(π\(x2=A\|x=BM\)\),\\displaystyle d=\\text\{logit\}\(\\pi\(x\_\{2\}=\\mathrm\{A\}\|x=\\mathrm\{MM\}\)\),e=\\text\{logit\}\(\\pi\(x\_\{1\}=\\mathrm\{A\}\|x=\\mathrm\{MB\}\)\),f=\\text\{logit\}\(\\pi\(x\_\{2\}=\\mathrm\{A\}\|x=\\mathrm\{BM\}\)\),where the logit function is defined aslogit\(p\)=lnp1−p\\displaystyle\\text\{logit\}\(p\)=\\ln\\frac\{p\}\{1\-p\}\. One can explicitly calculatelogπ,ELBO\\log\\pi,\\text\{ELBO\}and𝒟L\\mathcal\{D\}^\{\\text\{L\}\}using these parameters to monitor the drift\. As an explicit example, the log\-likelihood and ELBO for AA can be calculated as
logπ\(AA\)\\displaystyle\\log\\pi\(\\text\{AA\}\)=log\(12S\(b\)S\(c\)\+12S\(a\)S\(d\)\),\\displaystyle=\\log\\left\(\\frac\{1\}\{2\}\\text\{S\}\(b\)\\text\{S\}\(c\)\+\\frac\{1\}\{2\}\\text\{S\}\(a\)\\text\{S\}\(d\)\\right\),ELBO\(AA\)\\displaystyle\\text\{ELBO\}\(\\text\{AA\}\)=12\(logS\(a\)\+logS\(b\)\+logS\(c\)\+logS\(d\)\)\.\\displaystyle=\\frac\{1\}\{2\}\(\\log\\text\{S\}\(a\)\+\\log\\text\{S\}\(b\)\+\\log\\text\{S\}\(c\)\+\\log\\text\{S\}\(d\)\)\.HereS\(x\)=11\+e−x\\displaystyle\\text\{S\}\(x\)=\\frac\{1\}\{1\+e^\{\-x\}\}is the sigmoid function, the inverse function of the logit function, which recovers probabilities from logits\. The parameters are all initialized to be0\.50\.5so that ELBO equals log\-likelihood for all outputs, resulting in a pretrained diffusion model\. We set the reward function to be
r\(AA\)=0\.8,r\(AB\)=1,r\(BA\)=0\.7,r\(BB\)=1\.\\displaystyle r\(\\mathrm\{AA\}\)=0\.8,r\(\\mathrm\{AB\}\)=1,r\(\\mathrm\{BA\}\)=0\.7,r\(\\mathrm\{BB\}\)=1\.The reward function is chosen for clarity of exposition and is without loss of generality\.
#### Algorithm implementation\.
We implement FPO by updating the model parameters according to[Equation˜4](https://arxiv.org/html/2606.13795#S3.E4), and SPG by updating the model parameters according to[Equation˜5](https://arxiv.org/html/2606.13795#S3.E5)\. In the SPG implementation, a surrogate for EUBO \(theℒEUBO\\mathcal\{L\}\_\{\\text\{EUBO\}\}inWanget al\.\[[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]\) is used to tackle the intractability issue, and we follow their implementation in our experiment\. In this toy setting, we directly calculate the policy gradient without invoking Monte Carlo samples for gradient estimation\. We implement[Algorithm˜2](https://arxiv.org/html/2606.13795#alg2)with FPO and SPG gradient estimators for comparison as well\. We set the learning rate to be0\.10\.1, the beta parameter in EUBO to be1\.51\.5,β\\betain DiPOD to be0\.20\.2, and run these algorithms for15001500policy\-gradient steps\.
## Appendix EAdditional Experiments for Diffusion Language Models
### E\.1Experiments with More Sequence Lengths
We additionally report results at generation sequence lengths128128,256256, and512512for SPG and SPG\+DiPOD\. We use the same setup as in[Section˜4\.2](https://arxiv.org/html/2606.13795#S4.SS2), and set the DiPOD coefficient toβ=0\.05\\beta=0\.05for all settings except Sudoku with sequence lengths128128and512512, where we useβ=0\.02\\beta=0\.02\. We summarize all tasks and sequence lengths in a single table\. For GSM8K, MATH500, and Countdown, we copy the LLaDA\-8B\-Instruct and d1 baselines from SPG\[Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]\. For Sudoku, we do*not*use the 3\-shot numbers reported in SPG; instead, we report the zero\-shot LLaDA\-8B\-Instruct and d1 results from d1\[Zhaoet al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib1)\], and the wd1 baseline from wd1\[Tanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib32)\]\. The wd1 paper reports zero\-shot results only at sequence lengths256256and512512; for GSM8K, MATH500, and Countdown at sequence length128128, we therefore use the reproduced wd1 numbers in SPG\[Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\], while the Sudoku128128entry remains unavailable because the SPG Sudoku setting is 3\-shot\.
Table 2:Results across generation sequence lengths\. For GSM8K, MATH500, and Countdown, the LLaDA\-8B\-Instruct and d1 baselines are copied from SPG\. For Sudoku, we report zero\-shot baselines from d1 and wd1 rather than the 3\-shot numbers in SPG\. For wd1, sequence length128128is taken from SPG’s reproduced wd1 row for GSM8K, MATH500, and Countdown, since the original wd1 paper only reports zero\-shot results at lengths256256and512512; the Sudoku128128entry is therefore unavailable\.Compared with SPG, SPG\+DiPOD gives consistent gains on GSM8K and MATH500 across sequence lengths\. On Countdown, the best performance shifts to sequence length256256, where SPG\+DiPOD substantially outperforms SPG, while SPG remains stronger at128128and512512\. Sudoku exhibits a discrete training behavior, with runs tending to settle around roughly25%25\\%,40%40\\%, or near100%100\\%accuracy\. Although SPG\+DiPOD does not show a clear improvement at sequence lengths128128and512512, it is the only setting that reaches the near\-100%100\\%regime in the zero\-shot setting\.
### E\.2Ablation on the DiPOD Coefficientβ\\beta
We ablate the DiPOD coefficientβ\\betain Equation \(7\)\. Due to computational constraints, we conduct this study only on Countdown\. This choice is also motivated by task characteristics: on GSM8K and MATH500, the gains are relatively small in our main results, while on Sudoku, single\-run outcomes can exhibit discrete phase\-transition\-like behavior, which may lead to misleading conclusions in a limited ablation\. Therefore, Countdown provides the cleanest setting for isolating the effect ofβ\\beta\. Consistent with the fairness protocol used in the main experiments, we keep the random seed fixed across all ablation runs\.
Table 3:Ablation of the DiPOD coefficientβ\\betaon Countdown\. We report final accuracy under the same setup as[Section˜4\.2](https://arxiv.org/html/2606.13795#S4.SS2)\.[Table˜3](https://arxiv.org/html/2606.13795#A5.T3)shows that the choice ofβ\\betahas a clear effect on performance\. A moderate regularization strength works best, withβ=0\.05\\beta=0\.05achieving the strongest result\. Whenβ\\betais too small, the additional ELBO tightening is not strong enough to sufficiently control the drift during policy optimization; whenβ\\betais too large, the optimization appears to overemphasize the regularization term, which can weaken the policy\-improving update\. Overall, these results support the use of a moderate on\-policy ELBO regularization strength and validate our default choice ofβ=0\.05\\beta=0\.05in the main language\-model experiments\.
### E\.3Experiment on Sudoku in the Three\-Shot Setting
Figure 5:Reward dynamics of SPG and SPG\+DiPOD in the 3\-shot setting\.In the three\-shot setting, SPG\[Wanget al\.,[2025](https://arxiv.org/html/2606.13795#bib.bib7)\]has already saturated the performance\. Here we show the reward dynamics of SPG\+DiPOD compared to SPG in[Figure˜5](https://arxiv.org/html/2606.13795#A5.F5), withβ=0\.05\\beta=0\.05\. We can see that DiPOD converges faster to the optimal performance than SPG\.
## Appendix FAdditional Experiments for Motion Tracking
### F\.1Motion Tracking Experiment
We provide additional details for the motion\-tracking experiment discussed in the main text\. We follow the motion\-tracking setup of FPO\+\+\[Yiet al\.,[2026](https://arxiv.org/html/2606.13795#bib.bib49)\]: a diffusion policy controls the Unitree G1 humanoid to track reference motions from LAFAN\[Harveyet al\.,[2020](https://arxiv.org/html/2606.13795#bib.bib48)\]\. All downstream FPO\+\+ hyperparameters follow the original motion\-tracking setup\. The only added component is an initial self\-distillation stage before normal FPO\+\+ policy\-gradient training\. We use this single\-stage version because policy\-preserving self\-distillation during high\-dimensional motion\-control training is itself a nontrivial algorithmic component; designing a fully interleaved self\-distillation schedule for this setting is left to future work\.
#### Self\-distillation procedure\.
The self\-distillation stage instantiates the self\-distillation step in the original interleaved DiPOD procedure[Algorithm˜1](https://arxiv.org/html/2606.13795#alg1)\. At the beginning of training, we clone the randomly initialized actor into a frozen teacher\. We then collect rollouts using teacher actions: the frozen teacher samples actions, and we store the resulting observation and teacher action pairs\. The student actor is trained on this growing dataset by maximizing the empirical average of the evidence lower bound\. This phase is not reward learning: the critic, advantages, and surrogate policy\-gradient loss are used only after self\-distillation is complete\. Thus the initial distillation step preserves the initial policy distribution while tightening the gap between ELBO and log\-likelihood before the subsequent policy\-gradient stage\.
Algorithm 3Initial self\-distillation for motion tracking0:initial actor
πθ\\pi\_\{\\theta\}, vectorized environments, distillation iterations
KK
1:Clone the actor into a frozen teacher
π¯\\bar\{\\pi\}
2:Initialize replay buffer
𝒟←∅\{\\mathcal\{D\}\}\\leftarrow\\emptyset
3:for
k=1,…,Kk=1,\\dots,Kdo
4:Collect teacher rollouts in the live environments and append sampled observation\-action pairs
\(o,a\)\(o,a\)to
𝒟\{\\mathcal\{D\}\}
5:Sample a minibatch
\{\(oi,ai\)\}i=1B\\\{\(o\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{B\}from
𝒟\{\\mathcal\{D\}\}
6:Update only the student actor by maximizing
1B∑i=1BELBOθ\(ai\|oi\)\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\text\{ELBO\}\_\{\\theta\}\(a\_\{i\}\|o\_\{i\}\)
7:endfor
8:Reset the optimizer state and continue with the FPO\+\+ policy\-gradient stage
In our runs, self\-distillation uses100100iterations,88rollout steps per iteration, minibatch size1638416384, AdamW with learning rate3⋅10−43\\cdot 10^\{\-4\}and weight decay10−410^\{\-4\}, and actor gradient clipping at1\.01\.0\. The frozen teacher uses6464sampling steps\. The tracking tasks use40964096parallel environments\.
#### Computational cost\.
We run the experiments on a single NVIDIA L40S GPU\. For*dance\_1\_subject\_1*, the self\-distillation stage takes95\.595\.5s in total, or about0\.960\.96s per self\-distillation iteration\. A policy\-gradient iteration takes about1515s, so the10001000\-iteration training runs shown in the figures take roughly4\.24\.2hours\. Thus, the initial self\-distillation stage adds only about1\.51\.5minutes of computation, which is negligible relative to the full training process\.
#### Results\.
We report the full set of motion\-tracking curves in[Figures˜6](https://arxiv.org/html/2606.13795#A6.F6)and[7](https://arxiv.org/html/2606.13795#A6.F7)\. Across all six LAFAN tracking tasks, DiPOD improves sample efficiency over the reproduced FPO\+\+ baseline: the reward curves rise faster and the episode\-length curves indicate more sustained successful tracking\. The consistency across*dance\_1\_subject\_1*,*dance\_1\_subject\_2*,*fight\_1\_subject\_2*,*jumps\_1\_subject\_1*,*run\_1\_subject\_2*, and*walk\_1\_subject\_1*suggests that the initial self\-distillation step provides a general benefit rather than a task\-specific gain\.
\(a\)*dance\_1\_subject\_1*
\(b\)*dance\_1\_subject\_2*
\(c\)*fight\_1\_subject\_2*
\(d\)*jumps\_1\_subject\_1*
\(e\)*run\_1\_subject\_2*
\(f\)*walk\_1\_subject\_1*
Figure 6:Mean reward curves for all six LAFAN motion\-tracking tasks on the G1 humanoid\.\(a\)*dance\_1\_subject\_1*
\(b\)*dance\_1\_subject\_2*
\(c\)*fight\_1\_subject\_2*
\(d\)*jumps\_1\_subject\_1*
\(e\)*run\_1\_subject\_2*
\(f\)*walk\_1\_subject\_1*
Figure 7:Mean episode length curves for all six LAFAN motion\-tracking tasks on the G1 humanoid\.Similar Articles
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD proposes a multi-task training paradigm for diffusion models that uses online policy distillation to efficiently combine task-specific teachers into a unified student, achieving state-of-the-art results on all evaluated benchmarks.
From Noise to Control: Parameterized Diffusion Policies
This paper introduces Parameterized Diffusion Policy (PDP), a framework that makes diffusion policies controllable by conditioning on low-dimensional latent parameters, enabling smooth behavior interpolation and adaptation without retraining. It demonstrates improved performance on complex multimodal robot tasks in simulation and real-world experiments.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
This paper introduces D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy self-distillation during supervised fine-tuning. It allows models to learn new concepts or styles without compromising their efficient few-step inference capabilities.
Drifting Objectives for Refining Discrete Diffusion Language Models
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.