Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
Summary
This paper identifies a failure mode called 'trajectory locking' in reward-maximizing post-training for diffusion language models, and proposes TraFL, a trajectory-balance objective that improves diversity and performance across math and code benchmarks.
View Cached Full Text
Cached at: 05/15/26, 06:25 AM
# Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
Source: [https://arxiv.org/html/2605.13935](https://arxiv.org/html/2605.13935)
Saba Ahmadi Prasanna Parthasarathi11footnotemark:1Yufei Cui Noah’s Ark Lab
###### Abstract
Diffusion language models are a promising alternative to autoregressive models, yet post\-training methods for them largely adapt reward\-maximizing objectives\. We identify a central failure mode in this setting we call*trajectory locking*: sampled reward\-driven updates over\-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling\. To address this, we propose TraFL \(TrajectoryFlow baLancing\), a trajectory\-balance objective that trains the policy toward a reward\-tilted target distribution anchored to a frozen reference model\. We make this practical for diffusion language models with a diffusion\-compatible sequence\-level surrogate and a learned prompt\-dependent normalization\. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post\-training method that improves over the base model in every benchmark–length setting, with gains that persist as the sampling budget increases\. The improvements transfer to held\-out evaluations: TraFL stays above the base model on Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib27)\)and is the strongest method on every LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib28)\)difficulty split\.
## 1Introduction
Recent diffusion language models \(dLLMs\) have emerged as a compelling alternative to autoregressive models, showing early promise on reasoning and code generation\(Zhuet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib11); Yeet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib21)\)\. A key open question for advancing these models further is how to post\-train them effectively\. From a capability perspective, current post\-training methods improve single\-sample accuracy but struggle to produce a*diverse*set of correct solutions under repeated sampling—a key requirement when many valid answers exist\.
Designing such an objective is challenging because diffusion language models do not expose token\-level conditional log\-probabilities in the way autoregressive models do\. Existing approaches therefore adapt reward\-maximizing RL to diffusion generation through surrogate likelihoods, autoregressive reductions, or stepwise approximations\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib18); Tanget al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib22); Wanget al\.,[2025a](https://arxiv.org/html/2605.13935#bib.bib23); Kundeet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib25); Niet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib19)\)\. Across these methods, we observe a recurring failure mode we call*trajectory locking*: because reward depends only on the final completion and is indifferent to which denoising path produced it, sampled policy\-gradient updates reinforce already\-favored paths, progressively collapsing probability mass onto a narrow subset of trajectories—and with it, coverage of alternative correct solutions\. Distribution\-matching methods that train toward a reward\-tilted target\(Zhuet al\.,[2026a](https://arxiv.org/html/2605.13935#bib.bib7),[b](https://arxiv.org/html/2605.13935#bib.bib9)\)offer a principled alternative, yet as we show, the way the normalization term is handled can reintroduce the same concentration failure\. This raises a basic question:*what is the right post\-training objective for diffusion language models that avoids trajectory locking while remaining practical?*
To this end, we propose TraFL \(TrajectoryFlow baLancing\), a post\-training objective grounded in the trajectory\-balance principle from Generative Flow Networks \(GFlowNets\)\(Bengioet al\.,[2023](https://arxiv.org/html/2605.13935#bib.bib8)\)\. Rather than using reward as an unconstrained amplifier of sampled trajectories, TraFL trains the policy toward a reward\-tilted target distribution anchored to a frozen reference model\. In this target, the reward increases the mass assigned to successful completions, while the reference model regularizes how probability mass is allocated over plausible generations\. We make this practical for diffusion language models with two ingredients: \(i\) a diffusion\-compatible sequence\-level surrogate for comparing fully denoised completions under the current and reference models, and \(ii\) a learned prompt\-dependent normalization term, jointly trained with the policy, that receives gradient signal at every training step regardless of which completions are sampled\.
Across mathematical reasoning \(GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib12)\), MATH\-500Lightmanet al\.\([2023](https://arxiv.org/html/2605.13935#bib.bib13)\)\) and code generation \(HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib16)\), MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib17)\)\) benchmarks, TraFL is the only directly evaluated post\-training method that improves over the base model in every benchmark\-length setting\. The gains transfer to held\-out evaluations on Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib27)\)and LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib28)\), where TraFL is the strongest method on every LiveCodeBench difficulty split\.
Our contributions are as follows:
1. 1\.We identify*trajectory locking*as a central failure mode of dLLM post\-training, where sampled reward\-driven updates collapse probability mass onto a narrow set of denoising paths\.
2. 2\.We propose TraFL \(TrajectoryFlow baLancing\), a trajectory\-balance objective for post\-training diffusion language models that trains toward a reference\-anchored reward\-tilted target distribution, made practical with a diffusion\-compatible sequence\-level surrogate and a learned prompt\-dependent normalization\.
3. 3\.TraFL is the only evaluated post\-training method that improves over the base model in every benchmark\-length setting on GSM8K, MATH\-500, HumanEval, and MBPP, outperformingESPO\(Ouet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib26)\)andJustGRPO\(Niet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib19)\)on average\.
4. 4\.The gains transfer to held\-out Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib27)\)and LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib28)\), where TraFL is the strongest method on every LiveCodeBench difficulty split, and an LLM\-as\-judge analysis provides evidence that improvements are associated with broader correct\-solution coverage rather than only sharper scoring of a single mode\.
## 2Related Work
#### RL post\-training for diffusion language models\.
Recent diffusion language models such as LLaDA\(Zhuet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib11)\)and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib21)\)have made diffusion\-based text generation a viable alternative to autoregressive language models, but reinforcement learning for these models remains challenging because diffusion generation does not expose the same left\-to\-right token\-level conditional factorization used by PPO\- or GRPO\-style training in autoregressive LLMs\. Early work adapted autoregressive RL objectives to diffusion models by introducing likelihood surrogates\. In particular,d1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib18)\)proposesdiffu\-GRPO, a GRPO\-style method for masked dLLMs built on one\-step per\-token log\-probability estimation together with a mean\-field approximation to sequence likelihood, and combines it with a preceding supervised fine\-tuning stage in its full recipe\.wd1\(Tanget al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib22)\)removes explicit policy\-ratio estimation and instead optimizes a ratio\-free weighted log\-likelihood objective derived from reverse\-KL\-regularized policy optimization\.SPG\(Wanget al\.,[2025a](https://arxiv.org/html/2605.13935#bib.bib23)\)addresses the bias induced by one\-sided likelihood surrogates by maximizing a lower bound for positive\-advantage samples and minimizing an evidence upper bound for negative\-advantage samples, together with a block\-wise masking strategy for more stable Monte Carlo estimation\.
A complementary line of work argues that the central issue is not only the quality of the surrogate, but also the action granularity used by the RL objective\.ESPO\(Ouet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib26)\)formalizes this view most explicitly: it treats whole\-sequence generation as a single action and uses the ELBO as a tractable sequence\-level proxy, together with stabilized ratio and KL estimators, arguing that token\-level objectives are fundamentally mismatched to non\-autoregressive diffusion generation\.TraceRL\(Wanget al\.,[2025b](https://arxiv.org/html/2605.13935#bib.bib24)\)takes yet another perspective, emphasizing alignment between the training objective and the model’s preferred inference trajectory\. It performs trajectory\-aware optimization over denoising traces, and introduces a diffusion\-based value model for variance reduction\. In contrast,JustGRPO\(Niet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib19)\)argues that preserving arbitrary\-order generation during RL can itself be counterproductive for reasoning, and instead constrains training to autoregressive order so that standard GRPO can be applied directly, while still retaining the parallel decoding benefits of dLLMs at inference time\.
#### Distribution matching and reward\-tilted target distributions\.
Closest in spirit to our work are methods that move beyond pure reward maximization and instead optimize toward reward\-tilted target distributions\. On the dLLM side,DMPO\(Zhuet al\.,[2026b](https://arxiv.org/html/2605.13935#bib.bib9)\)formulates post\-training as policy distribution matching toward a reward\-tilted target distribution and implements this through importance sampling and weighted denoising cross\-entropy, i\.e\., a scalable forward\-KL\-style approximation to distribution matching\. On the autoregressive side,FlowRL\(Zhuet al\.,[2026a](https://arxiv.org/html/2605.13935#bib.bib7)\)likewise advocates matching a reward\-tilted distribution rather than maximizing reward alone, but does so through a GFlowNet\-style formulation for autoregressive reasoning models rather than diffusion language models\. Our method is aligned with this broader distribution\-matching view, but differs in how it is instantiated for dLLMs: we adopt a trajectory\-balance perspective tailored to diffusion generation, combine it with a diffusion\-compatible sequence\-level surrogate, and learn a prompt\-dependent partition function that captures the normalization of the reward\-tilted target\.
## 3Trajectory Flow Balancing
We introduce TraFL \(TrajectoryFlow baLancing\), a reference\-anchored trajectory\-balance objective for post\-training diffusion language models\. We first formalize why terminal rewards can induce trajectory locking, then define the reward\-tilted target underlying TraFL\.
#### Setup: Diffusion Trajectories and Terminal Rewards
Given a prompt𝐱\\mathbf\{x\}, a discrete diffusion language model defines a distribution over denoising trajectories
τ\(i\)=\(zT\(i\),zT−1\(i\),…,z0\(i\)\),τ\(i\)∼pθ\(τ∣𝐱\)\.\\tau^\{\(i\)\}=\\left\(z\_\{T\}^\{\(i\)\},z\_\{T\-1\}^\{\(i\)\},\\ldots,z\_\{0\}^\{\(i\)\}\\right\),\\qquad\\tau^\{\(i\)\}\\sim p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)\.Here,zTz\_\{T\}denotes the fully noised state andz0z\_\{0\}the final completion𝐲\\mathbf\{y\}\. The rewardr\(𝐱,𝐲\)r\(\\mathbf\{x\},\\mathbf\{y\}\)is assigned only to this terminal completion, while the model distribution is induced through the full denoising trajectory\. This distinction is important because several denoising trajectories can terminate in the same completion𝐲\\mathbf\{y\}, while different completions can correspond to different valid solution modes\. Thus, post\-training can affect both the allocation of probability over paths to a fixed answer and the coverage of distinct terminal solution modes\.
### 3\.1Trajectory Locking: Why Terminal Reward Is Not Enough
Maximizing terminal reward alone is blind to how probability mass is allocated across denoising paths that reach the same completion\. The objective depends on the total probability assigned to a final completion𝐲\\mathbf\{y\}, but not on how that probability is split among the denoising trajectories that produce𝐲\\mathbf\{y\}\. We formalize this path\-indifference below\.
###### Proposition 1\(Path\-indifference of terminal reward\)\. Fix a prompt𝐱\\mathbf\{x\}and a final completion𝐲\\mathbf\{y\}\. Let𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)denote the set of denoising trajectories that terminate at𝐲\\mathbf\{y\}\. Any redistribution of probability mass among trajectories in𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)that preservespθ\(𝐲∣𝐱\)p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)leaves a terminal\-reward objective unchanged\.
###### Proof\.
The terminal\-reward objective decomposes as
J\(𝐱\)=∑𝐲pθ\(𝐲∣𝐱\)r\(𝐱,𝐲\),pθ\(𝐲∣𝐱\)=∑τ∈𝒯\(𝐲\)pθ\(τ∣𝐱\)\.J\(\\mathbf\{x\}\)=\\sum\_\{\\mathbf\{y\}\}p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\\,r\(\\mathbf\{x\},\\mathbf\{y\}\),\\qquad p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\sum\_\{\\tau\\in\\mathcal\{T\}\(\\mathbf\{y\}\)\}p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)\.The contribution of𝐲\\mathbf\{y\}toJJdepends only on the scalarpθ\(𝐲∣𝐱\)p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\), not on how that mass is spread across trajectories in𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)\. Therefore, any redistribution of mass within𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)that keepspθ\(𝐲∣𝐱\)p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)unchanged leavesJ\(𝐱\)J\(\\mathbf\{x\}\)unchanged\. ∎
Proposition[1](https://arxiv.org/html/2605.13935#Thmproposition1)is an objective\-level statement: reward maximization does not distinguish among paths that reach the same terminal completion\. In sampled optimization, however, this flatness can become unstable\. Only sampled trajectories receive direct gradient signal\. Consequently, if two trajectories reach the same rewarding completion but one is sampled slightly more often early in training, it receives more positive updates, becomes more likely, and is sampled even more often in subsequent updates\. We call this self\-reinforcing concentration*trajectory locking*\. A formal analysis of this feedback effect is given in App\.[C\.2](https://arxiv.org/html/2605.13935#A3.SS2)and App\.[C\.3](https://arxiv.org/html/2605.13935#A3.SS3)\.
Trajectory locking matters because trajectory diversity upper\-bounds terminal solution coverage\. We make this connection explicit next\.
###### Theorem 1\(Trajectory diversity is necessary for mode coverage\)\. Let𝒯\\mathcal\{T\}be a random denoising trajectory taking values in a countable setΩ𝒯\\Omega\_\{\\mathcal\{T\}\}, and letM=g\(𝒯\)M=g\(\\mathcal\{T\}\)be its terminal solution mode under a deterministic mappingg:Ω𝒯→ΩMg:\\Omega\_\{\\mathcal\{T\}\}\\to\\Omega\_\{M\}\. Then\|supp\(M\)\|≤\|supp\(𝒯\)\|andH\(M\)≤H\(𝒯\),\|\\mathrm\{supp\}\(M\)\|\\leq\|\\mathrm\{supp\}\(\\mathcal\{T\}\)\|\\qquad\\text\{and\}\\qquad H\(M\)\\leq H\(\\mathcal\{T\}\),wheresupp\(⋅\)\\mathrm\{supp\}\(\\cdot\)denotes the support of a distribution andH\(⋅\)H\(\\cdot\)the Shannon entropy\. In particular, terminal mode coverage cannot exceed trajectory diversity\.
###### Corollary 1\(Trajectory locking bounds mode coverage\)\. If trajectory locking reduces the trajectory distribution to a subset𝒮\\mathcal\{S\}of all possible paths, then the set of terminal modes that can be covered is at mostg\(𝒮\)g\(\\mathcal\{S\}\)\. In particular, collapse to a single trajectory implies collapse to a single terminal mode\.
Proofs of Theorem[1](https://arxiv.org/html/2605.13935#Thmtheorem1)and Corollary[1](https://arxiv.org/html/2605.13935#Thmcorollary1)are given in App\.[C\.1](https://arxiv.org/html/2605.13935#A3.SS1)\. Together, these statements make the stakes concrete: trajectory locking does not merely change which paths are used; it can limit which solutions the model can produce under sampling\. This motivates a reference\-anchored objective that uses reward to tilt terminal completions, while preserving a non\-degenerate allocation of probability mass over trajectories\.
### 3\.2Reward\-Tilted Trajectory Flow
To avoid relying on terminal reward alone to determine trajectory allocation, we define a reward\-tilted target distribution over denoising trajectories by reweighting a frozen reference model:
p∗\(τ∣𝐱\)=1Z\(𝐱\)pref\(τ∣𝐱\)exp\(βr\(𝐱,𝐲\)\),p^\{\*\}\(\\tau\\mid\\mathbf\{x\}\)=\\frac\{1\}\{Z\(\\mathbf\{x\}\)\}p\_\{\\mathrm\{ref\}\}\(\\tau\\mid\\mathbf\{x\}\)\\exp\\bigl\(\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\\bigr\),\(1\)wherer\(𝐱,𝐲\)r\(\\mathbf\{x\},\\mathbf\{y\}\)is a scalar reward on the final completion,β\>0\\beta\>0controls reward sharpness, andZ\(𝐱\)Z\(\\mathbf\{x\}\)is the prompt\-dependent normalization term\. Although the reward is still terminal, it does not define the target distribution by itself: it tilts a frozen reference trajectory distribution\. Thus, among trajectories with the same terminal reward, the target remains proportional topref\(τ∣𝐱\)p\_\{\\mathrm\{ref\}\}\(\\tau\\mid\\mathbf\{x\}\), rather than being indifferent to how mass is allocated across them\. This separates two roles that are coupled in direct reward maximization: the reward controls how much mass high\-reward completions should receive, while the reference model anchors the relative allocation of probability over plausible trajectories and completions, rather than letting reward amplify only the sampled support\.
#### Trajectory Flow Balancing Objective
We train the current model to match the reward\-tilted target in[Eq\.˜1](https://arxiv.org/html/2605.13935#S3.E1)through a trajectory\-balance residual\. Equatingpθ\(τ∣𝐱\)p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)withp∗\(τ∣𝐱\)p^\{\*\}\(\\tau\\mid\\mathbf\{x\}\)and taking logarithms gives
δ\(τ,𝐱\)=logpθ\(τ∣𝐱\)−logpref\(τ∣𝐱\)−βr\(𝐱,𝐲\)\+logZ\(𝐱\),\\delta\(\\tau,\\mathbf\{x\}\)=\\log p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)\-\\log p\_\{\\mathrm\{ref\}\}\(\\tau\\mid\\mathbf\{x\}\)\-\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\+\\log Z\(\\mathbf\{x\}\),\(2\)and we minimize the squared loss,
ℒTraFL=𝔼τ∼pθ\(⋅∣𝐱\)\[δ\(τ,𝐱\)2\]\.\\mathcal\{L\}\_\{\\mathrm\{TraFL\}\}=\\mathbb\{E\}\_\{\\tau\\sim p\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}\)\}\\bigl\[\\delta\(\\tau,\\mathbf\{x\}\)^\{2\}\\bigr\]\.\(3\)At the minimum, the residual is zero when the current trajectory distribution matches the reward\-tilted reference distribution\. The learned normalization termZ\(𝐱\)Z\(\\mathbf\{x\}\)absorbs the prompt\-dependent offset in the balance condition, so the policy is trained on relative trajectory\-level preferences rather than on an unnormalized reward signal alone\. Equations \([1](https://arxiv.org/html/2605.13935#S3.E1)\)–\([3](https://arxiv.org/html/2605.13935#S3.E3)\) define the trajectory\-level balance condition\.
#### Normalized Log\-Probability Surrogate
Evaluating[Eq\.˜2](https://arxiv.org/html/2605.13935#S3.E2)requires log\-probability terms under both the current and reference models\. Diffusion language models do not expose exact trajectory log\-probabilities, so we use a tractable log\-probability surrogate derived from a masked\-reconstruction lower bound\. Given a completion𝐲=\(y1,…,yL\)\\mathbf\{y\}=\(y\_\{1\},\\dots,y\_\{L\}\), we samplel∼Uniform\(\{1,…,L\}\)l\\sim\\mathrm\{Uniform\}\(\\\{1,\\dots,L\\\}\), construct a corrupted sequence𝐲l∼ql\(⋅∣𝐲,𝐱\)\\mathbf\{y\}\_\{l\}\\sim q\_\{l\}\(\\cdot\\mid\\mathbf\{y\},\\mathbf\{x\}\)by replacing exactlylluniformly sampled completion tokens with\[𝚖𝚊𝚜𝚔\]\\mathtt\{\[mask\]\}, and define
logp^θ\(𝐲∣𝐱\)=𝔼l∼𝒰\(\{1,…,L\}\)𝔼𝐲l∼ql\(⋅∣𝐲,𝐱\)\[1l∑i=1L𝟏\[yil=\[𝚖𝚊𝚜𝚔\]\]logpθ\(yi∣𝐲l,𝐱\)\]\.\{\\log\\hat\{p\}\}\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\underset\{l\\sim\\mathcal\{U\}\(\\\{1,\\dots,L\\\}\)\}\{\\mathbb\{E\}\}\\;\\underset\{\{\\mathbf\{y\}\_\{l\}\\sim q\_\{l\}\(\\cdot\\mid\\mathbf\{y\},\\mathbf\{x\}\)\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{l\}\\sum\_\{i=1\}^\{L\}\\mathbf\{1\}\[y\_\{i\}^\{l\}=\\mathtt\{\[mask\]\}\]\\log p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)\\right\]\.\(4\)The indicator picks out thellmasked positions, so the bracketed quantity is the average log\-probability across thosellpositions; the outer expectations are over the sampled mask levellland corruption𝐲l\\mathbf\{y\}\_\{l\}\. The full derivation from a masked\-reconstruction lower bound, and the role of the1/l1/lfactor, are given in App\.[B](https://arxiv.org/html/2605.13935#A2)\. We estimate[Eq\.˜4](https://arxiv.org/html/2605.13935#S3.E4)with Monte Carlo samples and use the same masking pattern for the current and reference models to reduce variance\.
#### Why Distribution\-Matching Alone Is Insufficient: The Role of a LearnedZZ\.
DMPO\(Zhuet al\.,[2026b](https://arxiv.org/html/2605.13935#bib.bib9)\)also targets a reward\-tilted distribution but does not learnZ\(𝐱\)Z\(\\mathbf\{x\}\); it instead estimates the partition function with a softmax over the current rollout buffer\. This makes the normalization rollout\-local: once sampled completions begin to concentrate, both the estimated target weights and the resulting update are computed over the same narrowed support, and completions outside that support receive no gradient signal\. As we discuss in App\.[C\.6](https://arxiv.org/html/2605.13935#A3.SS6), this leads to a self\-reinforcing concentration over sampled completions\.
## 4Experimental Setup
### 4\.1Training Details
#### Implementation\.
Our base model is LLaDA\-8B\-Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib10)\), a masked diffusion language model that generates text by iterative denoising over discrete token sequences\. For RL post\-training, we apply TraFL directly to the instruction\-tuned checkpoint, without an additional SFT stage\. We use LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib33)\)adapters with rankr=128r=128and scaling factorα=128\\alpha=128for parameter\-efficient fine\-tuning\. The partition\-function headlogZ\(𝐱\)\\log Z\(\\mathbf\{x\}\)is parameterized as a 2\-layer MLP over mean\-pooled prompt hidden states, with backbone features detached so that optimizing the partition head does not alter the backbone’s prompt representations\. We train with a peak learning rate of6×10−56\\times 10^\{\-5\}under a cosine decay schedule with minimum learning rate6×10−66\\times 10^\{\-6\}, and generate 5 rollouts per prompt\. The masked surrogate log\-probabilitylogp^θ\(𝐲∣𝐱\)\{\\log\\hat\{p\}\}\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)is estimated using 4 antithetic replicates, i\.e\., complementary mask pairs, with masking applied only to non\-eospositions\.
#### Training Datasets\.
We train TraFL on two task families, mathematical reasoning and code generation, with separate task\-specific checkpoints for each family\. For math, we train on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib12)\)\(7,473 problems\) and MATH\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.13935#bib.bib13)\)\(7,500 problems\) as two independent training runs\. For code, we train on a single mixture of filtered subsets from AceCode\-89K\(Zenget al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib14)\)\(11,937 examples\) and KodCode\-Light\-RL\-10K\(Xuet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib29)\)\(2,695 examples\)\. Filtering details are given in App\.[E](https://arxiv.org/html/2605.13935#A5), and a comparison of training data used by TraFL and the baselines we evaluate against is provided in[Tab\.˜2](https://arxiv.org/html/2605.13935#A5.T2)\. Across both domains, rewards are binary: exact answer matching for math and execution\-based test passing for code\.
### 4\.2Evaluation Setup
We evaluate on both mathematical reasoning and code generation to test whether improvements transfer across symbolic reasoning and program synthesis\. We use the same prompting format for both training and evaluation; representative examples are shown in App\.[D](https://arxiv.org/html/2605.13935#A4)\.
#### Metrics\.
At evaluation time, we generate outputs with maximum lengths of 256 and 512 tokens and a block size of 32\. We use the fastdLLM\(Wuet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib30)\)parallel decoding sampler with a low\-confidence masking strategy: at each denoising step, tokens are selected for unmasking according to model confidence, and decoding proceeds under stochastic nucleus sampling with temperatureTT\. We report results under top\-pp=0\.91 sampling and computePass@kkfromnnindependently sampled generations per problem using the unbiased estimator of HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib16)\):
𝐏𝐚𝐬𝐬@k=1−∏i=n−c\+1n\(1−ki\),\\mathbf\{\\bm\{Pass\}\}@k\\;=\\;1\-\\prod\_\{i=n\-c\+1\}^\{n\}\\\!\\left\(1\-\\frac\{k\}\{i\}\\right\),wherec≤nc\\leq nis the number of correct samples among thenngenerations\. This estimator has lower variance than evaluating exactlykksamples directly and approaches the true solve rate asnnincreases\. For math, correctness is determined by exact\-match accuracy of the extracted boxed answer; for code, by execution against the provided test suite\.
#### Benchmarks\.
For mathematical reasoning, we evaluate on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib12)\)\(1,319 problems\), MATH\-500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.13935#bib.bib13)\)\(500 problems\), and the algebra split of Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib27)\)\(1,187 problems\)\. For code generation, we evaluate on HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib16)\)\(164 problems\), MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2605.13935#bib.bib17)\)\(500 problems\), and LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib28)\)\(release\_v5\), which contains 880 problems spanning easy, medium, and hard difficulty levels\. Together, these benchmarks cover natural\-language mathematical reasoning and executable code synthesis, with Minerva Math and LiveCodeBench serving as held\-out evaluations beyond the datasets used for post\-training\.
#### Baselines\.
Our primary direct baselines areJustGRPO\(Niet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib19)\)andESPO\(Ouet al\.,[2026](https://arxiv.org/html/2605.13935#bib.bib26)\), two recent post\-training methods for diffusion language models with usable public checkpoints\. This allows us to evaluate all direct baselines under the same decoding protocol and evaluation pipeline\. Other closely related methods, includingD1\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib18)\)andDMPO\(Zhuet al\.,[2026b](https://arxiv.org/html/2605.13935#bib.bib9)\), had not released public checkpoints at the time of our experiments\. We therefore include their published numbers in App\.[F](https://arxiv.org/html/2605.13935#A6)for context\. For math, all evaluated methods are trained per\-task on GSM8K and MATH separately and reported as matched per\-task checkpoints\. For code, TraFL trains on a filtered AceCode\-89K and KodCode mixture \(14,632 examples total\); the publicJustGRPOandESPOcoding checkpoints we evaluate were trained on substantially larger coding mixtures than TraFL, so coding comparisons are conservative with respect to TraFL\. Tab\.[2](https://arxiv.org/html/2605.13935#A5.T2)\(App\.[E](https://arxiv.org/html/2605.13935#A5)\) summarizes the training data for each method\.
## 5Results
We evaluate TraFL across four axes: robustness to sampling budget and temperature, in\-distribution math and code performance, held\-out transfer, and correct\-solution diversity, where an LLM\-as\-a\-judge protocol directly tests whether gains reflect broader coverage of distinct solution strategies\.
### 5\.1Robustness to Sampling Budget and Temperature
TraFL improves over the base diffusion model across all sampling budgets and decoding temperatures we test, including the single\-sample setting\. We evaluate Pass@kkfork∈\{1,2,4,8,16\}k\\in\\\{1,2,4,8,16\\\}andT∈\{0\.3,0\.6,0\.9\}T\\in\\\{0\.3,0\.6,0\.9\\\}, separating two effects that a single Pass@kknumber can obscure: whether post\-training sacrifices Pass@1 for multi\-sample gains, and how performance scales as the sampling budget increases\.[Fig\.˜1](https://arxiv.org/html/2605.13935#S5.F1)reports average Pass@kkacross GSM8K, MATH\-500, HumanEval, and MBPP\.
For TraFL, Pass@kkincreases steadily with the sampling budget at all three temperatures \([Fig\.˜1](https://arxiv.org/html/2605.13935#S5.F1)\(a\)\)\. Higher\-temperature decoding is especially beneficial askkgrows:T=0\.9T=0\.9gives the strongest performance at larger sampling budgets, followed byT=0\.6T=0\.6andT=0\.3T=0\.3\. This indicates that TraFL benefits from more exploratory sampling rather than saturating after only a few samples\.
The gap to the base model is positive at everykkandTT\([Fig\.˜1](https://arxiv.org/html/2605.13935#S5.F1)\(b\)\), includingk=1k=1\. Thus, the multi\-sample gains do not come from sacrificing single\-sample accuracy\. The largest margins occur atT=0\.9T=0\.9, where the advantage grows with the sampling budget\. The base model itself, however, is strongest atT=0\.6T=0\.6\. We therefore useT=0\.6T=0\.6for the head\-to\-head comparison against post\-training methods: this gives the base model its best decoding regime and makes the comparison conservative\. At this fixed temperature \([Fig\.˜1](https://arxiv.org/html/2605.13935#S5.F1)\(c\)\), TraFL leads ESPO, JustGRPO, and the base model from Pass@1 through Pass@16\. The shaded regions show variation across evaluation settings, but TraFL remains the strongest method as the sampling budget increases\.
Figure 1:TraFL improves over the base model and strong prior post\-training methods across sampling budgets and temperatures\.\(a\) Average Pass@kkof TraFL on GSM8K, MATH\-500, HumanEval, and MBPP forT∈\{0\.3,0\.6,0\.9\}T\\in\\\{0\.3,0\.6,0\.9\\\}\. \(b\) Pass@kkgap to LLaDA\-8B\-Instruct under matched decoding\. \(c\) Baseline comparison atT=0\.6T=0\.6\. TraFL leads ESPO, JustGRPO, and the base model from Pass@1 through Pass@16\. All results usen=16n=16samples and are averaged over generation lengths 256 and 512\.
### 5\.2Pass@5 Results
TraFL is the only post\-training method that improves over the base model in every benchmark\-length setting we evaluate\.[Fig\.˜2](https://arxiv.org/html/2605.13935#S5.F2)reports Pass@5 withn=16n=16samples atT=0\.6T=0\.6for maximum completion lengths 256 and 512, alongside average denoising steps\. On math, the largest gain is on MATH\-500 at length 256, where TraFL improves Pass@5 from 50\.2 to 54\.6, with smaller but positive gains on GSM8K at both lengths\. On code, TraFL improves HumanEval from 53\.2 to 54\.8 at length 256 and from 53\.7 to 55\.7 at length 512, and improves MBPP from 59\.1 to 62\.1 and 59\.2 to 62\.0 at the two lengths\. Averaged over all eight settings, this is a\+2\.0\+2\.0point improvement\.
The baselines show uneven behavior across benchmarks\. ESPO is the strongest prior method overall and improves several settings, especially MATH\-500 at length 512, but it drops below the base model on GSM8K\-512 and HumanEval\-256\. JustGRPO improves MBPP substantially but loses performance on GSM8K, HumanEval, and MATH\-500 at length 512\. TraFL is the only method with positive improvements in all eight settings, so its average gain reflects broad, consistent improvements rather than being concentrated on a single benchmark\.
Figure 2:TraFL improves Pass@5 across math and coding benchmarks at comparable denoising cost\.Pass@5 on GSM8K, MATH\-500, HumanEval, and MBPP at maximum completion lengths 256 and 512 \(n=16n=16,T=0\.6T=0\.6\)\. Error bars show standard error\.Average denoising steps add a complementary view of the learned policies\. LLaDA\-8B\-Instruct, ESPO, and TraFL use comparable denoising budgets, while JustGRPO terminates much earlier without consistent Pass@5 gains\. Shorter denoising is therefore not by itself a sign of better generation\. One possible explanation is that JustGRPO’s early termination may reflect a form of premature trajectory locking, where probability mass concentrates early around a narrower set of token choices and leaves fewer opportunities for later denoising steps to explore alternative correct completions\. TraFL, by contrast, achieves consistent Pass@5 gains while keeping denoising step counts close to the base model and ESPO, suggesting it improves multi\-sample coverage through a better allocation of probability mass across successful trajectories\.
### 5\.3Held\-out Benchmark Evaluation
The gains of TraFL transfer to held\-out benchmarks not used during post\-training\. We evaluate the same checkpoints on the algebra split of Minerva Math\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2605.13935#bib.bib27)\), an out\-of\-distribution math benchmark relative to GSM8K and MATH, and LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2605.13935#bib.bib28)\), a contamination\-free coding benchmark with difficulty\-stratified problems\. All methods use the same decoding protocol at maximum completion lengths 256 and 512\.
Figure 3:Gains transfer to held\-out math and coding benchmarks\.Pass@5 on Minerva Math \(left\) using the GSM8K\-trained and MATH\-trained checkpoints, and on LiveCodeBench \(right\) by difficulty split, both at maximum completion lengths 256 and 512\.On Minerva Math, TraFL stays above the base model at both lengths and substantially outperforms JustGRPO for both GSM8K\-trained and MATH\-trained checkpoints\. ESPO is stronger at length 256 \(71\.2–71\.4 Pass@5 vs\. 69\.5–70\.7 for TraFL\), but at length 512 TraFL achieves the best results, reaching 76\.3 when trained on GSM8K and 75\.8 when trained on MATH\.
The pattern is stronger on LiveCodeBench\. TraFL is the best\-performing method on every difficulty split and at both generation lengths, improving over the base model from 53\.8 to 62\.6 overall at length 256 and from 61\.4 to 66\.3 at length 512\. The largest gains appear on medium and hard problems: at length 256, TraFL improves the base model by 7\.9 points on medium and 13\.6 points on hard, and outperforms JustGRPO by 12\.9 and 27\.6 points respectively\. ESPO and JustGRPO remain below the base model on the overall LiveCodeBench score at both lengths\. The consistent gains observed in the in\-distribution analysis therefore generalize beyond the post\-training datasets\.
### 5\.4LLM\-as\-a\-Judge Diversity Evaluation
Higher Pass@kkdoes not by itself imply broader mode coverage: a method could improve Pass@kkby sharpening its scoring of a single mode rather than by spreading mass across distinct modes\. We therefore directly test whether TraFL’s gains correspond to coverage of distinct solution strategies, the property that Theorem[1](https://arxiv.org/html/2605.13935#Thmtheorem1)ties to trajectory diversity\. We use an LLM\-as\-a\-judge protocol with GPT\-5\-4 on MATH\-500 \(IID to our math post\-training distribution\) and LiveCodeBench \(OOD to our coding post\-training distribution\)\. For each problem, the judge compares two sets of1616responses\(DA,DB\)\(D\_\{A\},D\_\{B\}\)generated under matched decoding \(max length 256,T=0\.6T=0\.6\), with answer\-set order randomized to provide the most diverse set \(A or B\), or tie,𝒮LLM:\(DA,DB\)→\{Awins,Bwins,Tie\}\\mathcal\{S\}\_\{LLM\}:\(D\_\{A\},D\_\{B\}\)\\rightarrow\\\{A\_\{wins\},B\_\{wins\},Tie\\\}\. We run pairwise comparisons between TraFL and each baseline\. The full judge prompt is in App\.[G](https://arxiv.org/html/2605.13935#A7)\.
The judge evaluates diversity in the underlying solution approach, not surface form\. An answer set is preferred only when it contains a broader range of substantively different methods, such as different equation setups, proof strategies, or algorithmic decompositions\. The prompt asks the judge to ignore wording, formatting, variable names, verbosity, and minor errors that do not change the core approach\. We pre\-specify three one\-sided sign tests\(Conover,[1999](https://arxiv.org/html/2605.13935#bib.bib36)\)\(pp\-value≤\\leq0\.05\) with hypotheses\(𝐇𝐚𝐥𝐭\)\(\\bf H\_\{alt\}\)of the form \(a\)TraFLwins\\mathrm\{TraFL\}\_\{wins\}\>\>Basewins\\mathrm\{Base\}\_\{wins\}, \(b\)ESPOwins\\mathrm\{ESPO\}\_\{wins\}\>\>Basewins\\mathrm\{Base\}\_\{wins\},and \(c\)TraFLwins\\mathrm\{TraFL\}\_\{wins\}\>\>ESPOwins\\mathrm\{ESPO\}\_\{wins\}for each benchmark\.
Table 1:LLM\-as\-a\-judge diversity evaluation\.TraFL shows the strongest and most consistent diversity gains over correct solutions, where it is preferred over both the base model and ESPO on MATH\-500 and LiveCodeBench\. ESPO is not significantly more diverse than the base model on either benchmark\. “For”/“Against” are the fractions of judged problems favoring each side;pp\-values are from one\-sided exact sign tests over non\-tied comparisons\.BenchmarkHalt\\bf\\mathrm\{H\_\{alt\}\}All samplesCorrect solutionsForAgainstTieppSig\.ForAgainstTieppSig\.MATH\-500TraFL\>\>Base21\.815\.462\.82\.3×10−22\.3\{\\times\}10^\{\-2\}✓12\.45\.182\.56\.0×10−36\.0\{\\times\}10^\{\-3\}✓ESPO\>\>Base18\.217\.664\.28\.8×10−18\.8\{\\times\}10^\{\-1\}✗7\.510\.082\.63\.9×10−13\.9\{\\times\}10^\{\-1\}✗TraFL\>\>ESPO21\.814\.264\.05\.7×10−35\.7\{\\times\}10^\{\-3\}✓14\.35\.680\.11\.0×10−31\.0\{\\times\}10^\{\-3\}✓LiveCodeBenchTraFL\>\>Base20\.923\.056\.13\.9×10−13\.9\{\\times\}10^\{\-1\}✗31\.219\.349\.64\.1×10−54\.1\{\\times\}10^\{\-5\}✓ESPO\>\>Base14\.544\.241\.31\.0×1001\.0\{\\times\}10^\{0\}✗31\.134\.034\.97\.9×10−17\.9\{\\times\}10^\{\-1\}✗TraFL\>\>ESPO45\.912\.241\.9<10−4<10^\{\-4\}✓38\.326\.435\.37\.8×10−47\.8\{\\times\}10^\{\-4\}✓
[Tab\.˜1](https://arxiv.org/html/2605.13935#S5.T1)separates two notions of diversity\. The all\-samples setting measures the diversity of the model’s sampled behavior overall, including both correct and incorrect responses\. Under this view, TraFL shows a modest but significant gain over the base model on MATH\-500, indicating that it explores a broader set of reasoning attempts in the math setting; the signal is not reliable on LiveCodeBench\. We do not find support forESPOwins\\mathrm\{ESPO\}\_\{wins\}\>\>Basewins\\mathrm\{Base\}\_\{wins\}\. Against ESPO, TraFL is preferred on both: 21\.8% vs\. 14\.2% on MATH\-500 \(p=0\.0057p=0\.0057\) and 45\.9% vs\. 12\.2% on LiveCodeBench \(p<10−4p<10^\{\-4\}\)\.
The correct\-solution setting asks the more targeted question, and is the one most directly tied to the trajectory\-locking analysis: among responses that solve the problem, does TraFL cover more distinct valid approaches? We filter each side to its correct samples and keep a comparison only when both sides contain at least one correct solution\. Here the signal is stronger and more consistent\. Relative to the base model, TraFL is preferred on both MATH\-500 \(p=0\.006p=0\.006\) and LiveCodeBench \(p=4\.1×10−5p=4\.1\\times 10^\{\-5\}\)\. Relative to ESPO, the same holds: 14\.3% vs\. 5\.6% on MATH\-500 \(p=0\.001p=0\.001\) and 38\.3% vs\. 26\.4% on LiveCodeBench \(p=7\.8×10−4p=7\.8\\times 10^\{\-4\}\)\. The advantage is clearest and most consistent in the correct\-solution setting\.
Two patterns line up with the trajectory\-locking story\. ESPO is not significantly more diverse than the base model on either benchmark, despite its Pass@kkgains\. This is what we would expect from reward\-amplifying optimization without reference anchoring: it improves accuracy by concentrating mass on a narrow support, not by covering more modes\. TraFL, by contrast, is preferred over ESPO most strongly in the correct\-solution setting, where Theorem[1](https://arxiv.org/html/2605.13935#Thmtheorem1)directly applies\. The reference\-anchored target leaves room for multiple valid reasoning paths to retain significant probability mass, and the diversity gains follow\.
## 6Conclusion
We presented TraFL, a trajectory\-balance post\-training objective for diffusion language models that trains toward a reward\-tilted target distribution rather than directly maximizing reward\. The key insight is that reward\-maximizing objectives are blind to path allocation across denoising trajectories, making sampled updates vulnerable to trajectory locking—a self\-reinforcing collapse of path diversity that limits solution coverage\. TraFL avoids this by anchoring to a reference model and learning a prompt\-dependent partition functionZϕ\(𝐱\)Z\_\{\\phi\}\(\\mathbf\{x\}\)jointly with the policy\. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated method that consistently improves over the base model, outperforming ESPO and JustGRPO on average with gains that grow at larger sampling budgets\. The improvements transfer to held\-out Minerva Math and LiveCodeBench, and an LLM\-as\-judge analysis indicates that the gains come with broader coverage of correct solution strategies rather than sharper scoring of a single mode\. We hope these results encourage further exploration of distribution\-matching objectives for diffusion language model post\-training\. We discuss limitations and broader impact of our work in App\.[H](https://arxiv.org/html/2605.13935#A8)\.
## References
- \[1\]S\. Amari\(1998\)Natural gradient works efficiently in learning\.Neural computation10\(2\),pp\. 251–276\.Cited by:[§C\.3](https://arxiv.org/html/2605.13935#A3.SS3.p3.5)\.
- \[2\]J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le,et al\.\(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1)\.
- \[3\]A\. Beck and M\. Teboulle\(2003\)Mirror descent and nonlinear projected subgradient methods for convex optimization\.Operations Research Letters31\(3\),pp\. 167–175\.Cited by:[§C\.3](https://arxiv.org/html/2605.13935#A3.SS3.p3.5)\.
- \[4\]Y\. Bengio, S\. Lahlou, T\. Deleu, E\. J\. Hu, M\. Tiwari, and E\. Bengio\(2023\)GFlowNet foundations\.Journal of Machine Learning Research24\(210\),pp\. 1–55\.External Links:[Link](http://jmlr.org/papers/v24/22-0364.html)Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p3.1)\.
- \[5\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px1.p1.4),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1)\.
- \[6\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1)\.
- \[7\]W\. J\. Conover\(1999\)Practical nonparametric statistics\.3 edition,John Wiley & Sons,New York\.External Links:ISBN 978\-0\-471\-16068\-7Cited by:[§5\.4](https://arxiv.org/html/2605.13935#S5.SS4.p2.12)\.
- \[8\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang\(2025\-Sept\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§E\.1](https://arxiv.org/html/2605.13935#A5.SS1.SSS0.Px2.p1.1)\.
- \[9\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px1.p1.6)\.
- \[10\]N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica\(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=chfJJYC3iL)Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[item 4](https://arxiv.org/html/2605.13935#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.13935#S5.SS3.p1.1)\.
- \[11\]V\. T\. Kunde, F\. Doudi, M\. Farahbakhsh, D\. Kalathil, K\. Narayanan, and J\. Chamberland\(2026\)Reinforcement learning for diffusion llms with entropy\-guided step selection and stepwise advantages\.arXiv preprint arXiv:2603\.12554\.Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p2.1)\.
- \[12\]A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra\(2022\)Solving quantitative reasoning problems with language models\.External Links:2206\.14858,[Link](https://arxiv.org/abs/2206.14858)Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[item 4](https://arxiv.org/html/2605.13935#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2605.13935#S5.SS3.p1.1)\.
- \[13\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px2.p1.1)\.
- \[14\]Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin\(2025\)Understanding r1\-zero\-like training: a critical perspective\.External Links:2503\.20783,[Link](https://arxiv.org/abs/2503.20783)Cited by:[Appendix B](https://arxiv.org/html/2605.13935#A2.SS0.SSS0.Px4.p1.3)\.
- \[15\]Z\. Ni, S\. Wang, Y\. Yue, T\. Yu, W\. Zhao, Y\. Hua, T\. Chen, J\. Song, C\. Yu, B\. Zheng, and G\. Huang\(2026\)The flexibility trap: why arbitrary order limits reasoning potential in diffusion language models\.arXiv preprint arXiv:2601\.15165\.Cited by:[Table 2](https://arxiv.org/html/2605.13935#A5.T2.9.2.1),[Appendix F](https://arxiv.org/html/2605.13935#A6.p1.1),[item 3](https://arxiv.org/html/2605.13935#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p2.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px3.p1.1)\.
- \[16\]S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li\(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px1.p1.6)\.
- \[17\]J\. Ou, J\. Han, M\. Xu, S\. Xu, J\. Xie, S\. Ermon, Y\. Wu, and C\. Li\(2026\)Principled RL for diffusion LLMs emerges from a sequence\-level perspective\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=S5YeC9llIL)Cited by:[Table 2](https://arxiv.org/html/2605.13935#A5.T2.9.3.1),[Appendix F](https://arxiv.org/html/2605.13935#A6.p1.1),[item 3](https://arxiv.org/html/2605.13935#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p2.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px3.p1.1)\.
- \[18\]B\. Rozière, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, R\. Sauvestre, T\. Remez, J\. Rapin, A\. Kozhevnikov, I\. Evtimov, J\. Bitton, M\. Bhatt, C\. C\. Ferrer, A\. Grattafiori, W\. Xiong, A\. Défossez, J\. Copet, F\. Azhar, H\. Touvron, L\. Martin, N\. Usunier, T\. Scialom, and G\. Synnaeve\(2024\)Code llama: open foundation models for code\.External Links:2308\.12950,[Link](https://arxiv.org/abs/2308.12950)Cited by:[§E\.1](https://arxiv.org/html/2605.13935#A5.SS1.SSS0.Px1.p1.1)\.
- \[19\]X\. Tang, R\. Dolga, S\. Yoon, and I\. Bogunovic\(2025\)Wd1: weighted policy optimization for reasoning in diffusion language models\.arXiv preprint arXiv:2507\.08838\.Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]C\. Wang, P\. Rashidinejad, D\. Su, S\. Jiang, S\. Wang, S\. Zhao, C\. Zhou, S\. S\. Shen, F\. Chen, T\. Jaakkola,et al\.\(2025\)SPG: sandwiched policy gradient for masked diffusion language models\.arXiv preprint arXiv:2510\.09541\.Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]Y\. Wang, L\. Yang, B\. Li, Y\. Tian, K\. Shen, and M\. Wang\(2025\)Revolutionizing reinforcement learning framework for diffusion large language models\.arXiv preprint arXiv:2509\.06949\.Cited by:[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p2.1)\.
- \[22\]C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie\(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.External Links:2505\.22618,[Link](https://arxiv.org/abs/2505.22618)Cited by:[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px1.p1.4)\.
- \[23\]Z\. Xu, Y\. Liu, Y\. Yin, M\. Zhou, and R\. Poovendran\(2025\)KodCode: a diverse, challenging, and verifiable synthetic dataset for coding\.arXiv\.External Links:2503\.02951,[Link](https://arxiv.org/abs/2503.02951)Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px2.p1.1)\.
- \[24\]J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong\(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p1.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]H\. Zeng, D\. Jiang, H\. Wang, P\. Nie, X\. Chen, and W\. Chen\(2025\)AceCoder: acing coder rl via automated test\-case synthesis\.ArXivabs/2207\.01780\.Cited by:[Appendix J](https://arxiv.org/html/2605.13935#A10.p1.1),[§4\.1](https://arxiv.org/html/2605.13935#S4.SS1.SSS0.Px2.p1.1)\.
- \[26\]S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover\(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=7ZVRlBFuEv)Cited by:[Table 3](https://arxiv.org/html/2605.13935#A6.T3.3.1.1),[Appendix F](https://arxiv.org/html/2605.13935#A6.p2.1),[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px3.p1.1)\.
- \[27\]F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p1.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px1.p1.1)\.
- \[28\]X\. Zhu, D\. Cheng, D\. Zhang, H\. Li, K\. Zhang, C\. Jiang, Y\. Sun, E\. Hua, Y\. Zuo, X\. Lv, Q\. Zhang, L\. Chen, F\. Shao, B\. Xue, Y\. Song, Z\. Yang, G\. Cui, N\. Ding, J\. Gao, X\. Liu, B\. Zhou, H\. Mei, and Z\. Lin\(2026\)FlowRL: matching reward distributions for LLM reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=lObnTKbm9U)Cited by:[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px2.p1.1)\.
- \[29\]Y\. Zhu, W\. Guo, J\. Choi, P\. Molodyk, B\. Yuan, M\. Tao, and Y\. Chen\(2026\)Enhancing reasoning for diffusion llms via distribution matching policy optimization\.External Links:2510\.08233,[Link](https://arxiv.org/abs/2510.08233)Cited by:[§C\.6](https://arxiv.org/html/2605.13935#A3.SS6.p1.3),[Table 3](https://arxiv.org/html/2605.13935#A6.T3.4.2.1),[Appendix F](https://arxiv.org/html/2605.13935#A6.p2.1),[§1](https://arxiv.org/html/2605.13935#S1.p2.1),[§2](https://arxiv.org/html/2605.13935#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.13935#S3.SS2.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.13935#S4.SS2.SSS0.Px3.p1.1)\.
Overview of Appendix
1. ATraining Algorithm
2. BDerivation of the Normalized Log\-Probability Surrogate
3. CTheory: Path Diversity, Mode Coverage, and Trajectory Balance
4. DPrompt Templates and Sample Rollouts
5. EDataset Filtering and Training\-Data Comparison
6. FAdditional Baseline Comparison Details
7. GLLM\-as\-a\-Judge Diversity Prompt
8. HLimitations and Broader Impact
9. ITraining Compute
10. JDataset and Model Licenses
## Appendix ATraining Algorithm
Algorithm[1](https://arxiv.org/html/2605.13935#algorithm1)summarizes the training procedure for TraFL\. We use a fully online RL setup: at each step, we sample rollouts from the current policy, compute rewards and surrogate scores on the same batch, and perform exactly one gradient update\.
Input:initial policy
pθinitp\_\{\\theta\_\{\\mathrm\{init\}\}\}; frozen reference
prefp\_\{\\mathrm\{ref\}\}; reward
rr; prompts
𝒟\\mathcal\{D\}; hparams
β,G,K,M\\beta,G,K,M
Output:trained policy
pθp\_\{\\theta\}and normalization predictor
ZϕZ\_\{\\phi\}
pθ←pθinitp\_\{\\theta\}\\leftarrow p\_\{\\theta\_\{\\mathrm\{init\}\}\}, initialize
ZϕZ\_\{\\phi\};
for*step=1=1toMM*do
Sample prompt batch
𝒟b∼𝒟\\mathcal\{D\}\_\{b\}\\sim\\mathcal\{D\};
foreach*𝐱∈𝒟b\\mathbf\{x\}\\in\\mathcal\{D\}\_\{b\}*do
Sample
\{𝐲\(g\)\}g=1G∼pθ\(⋅∣𝐱\)\\\{\\mathbf\{y\}^\{\(g\)\}\\\}\_\{g=1\}^\{G\}\\sim p\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}\);
Compute
r\(g\)←r\(𝐱,𝐲\(g\)\)r^\{\(g\)\}\\leftarrow r\(\\mathbf\{x\},\\mathbf\{y\}^\{\(g\)\}\)and
r~\(g\)←r\(g\)−1G∑g′=1Gr\(g′\)\\widetilde\{r\}^\{\(g\)\}\\leftarrow r^\{\(g\)\}\-\\frac\{1\}\{G\}\\sum\_\{g^\{\\prime\}=1\}^\{G\}r^\{\(g^\{\\prime\}\)\};
Estimate
logp^θ\(𝐲\(g\)∣𝐱\)\{\\log\\hat\{p\}\}\_\{\\theta\}\(\\mathbf\{y\}^\{\(g\)\}\\mid\\mathbf\{x\}\)and
logp^ref\(𝐲\(g\)∣𝐱\)\{\\log\\hat\{p\}\}\_\{\\mathrm\{ref\}\}\(\\mathbf\{y\}^\{\(g\)\}\\mid\\mathbf\{x\}\)with[Eq\.˜4](https://arxiv.org/html/2605.13935#S3.E4)using
KKmask samples;
Compute
δ\(g\)=logp^θ\(𝐲\(g\)∣𝐱\)−logp^ref\(𝐲\(g\)∣𝐱\)−βr~\(g\)\+logZϕ\(𝐱\)\.\\delta^\{\(g\)\}=\{\\log\\hat\{p\}\}\_\{\\theta\}\(\\mathbf\{y\}^\{\(g\)\}\\mid\\mathbf\{x\}\)\-\{\\log\\hat\{p\}\}\_\{\\mathrm\{ref\}\}\(\\mathbf\{y\}^\{\(g\)\}\\mid\\mathbf\{x\}\)\-\\beta\\,\\widetilde\{r\}^\{\(g\)\}\+\\log Z\_\{\\phi\}\(\\mathbf\{x\}\)\.
Update
\(θ,ϕ\)\(\\theta,\\phi\)by minimizing
ℒTraFL=1\|𝒟b\|G∑𝐱∈𝒟b∑g=1G\(δ\(g\)\)2\.\\mathcal\{L\}\_\{\\mathrm\{TraFL\}\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{b\}\|G\}\\sum\_\{\\mathbf\{x\}\\in\\mathcal\{D\}\_\{b\}\}\\sum\_\{g=1\}^\{G\}\\left\(\\delta^\{\(g\)\}\\right\)^\{2\}\.
Algorithm 1TraFL training algorithm
## Appendix BDerivation of the Normalized Log\-Probability Surrogate
We derive[Eq\.˜4](https://arxiv.org/html/2605.13935#S3.E4)from a masked\-reconstruction lower bound on a one\-step reconstruction marginal, and explain the role of the1/l1/lfactor\.
#### Setup\.
Let𝐲=\(y1,…,yL\)\\mathbf\{y\}=\(y\_\{1\},\\dots,y\_\{L\}\)be a completion of lengthLL\. For each mask levell∈\{1,…,L\}l\\in\\\{1,\\dots,L\\\},ql\(𝐲l∣𝐲\)q\_\{l\}\(\\mathbf\{y\}\_\{l\}\\mid\\mathbf\{y\}\)selectsllcompletion\-token positions uniformly without replacement and replaces them with\[𝚖𝚊𝚜𝚔\]\\mathtt\{\[mask\]\}, so
ql\(𝐲l∣𝐲\)=\(Ll\)−1q\_\{l\}\(\\mathbf\{y\}\_\{l\}\\mid\\mathbf\{y\}\)=\\binom\{L\}\{l\}^\{\-1\}on consistent corruptions and zero elsewhere\. The model predicts all masked positions in one forward pass, with
pθ\(𝐲∣𝐲l,𝐱\)=∏i:yil=\[𝚖𝚊𝚜𝚔\]pθ\(yi∣𝐲l,𝐱\)\.p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)=\\prod\_\{i:\\,y\_\{i\}^\{l\}=\\mathtt\{\[mask\]\}\}p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)\.
#### One\-step reconstruction lower bound\.
Define the one\-step masked reconstruction marginal at levelllas
pθ\(l\)\(𝐲∣𝐱\)=𝔼𝐲l∼ql\(⋅∣𝐲\)\[pθ\(𝐲∣𝐲l,𝐱\)\]\.p\_\{\\theta\}^\{\(l\)\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\mathbb\{E\}\_\{\\mathbf\{y\}\_\{l\}\\sim q\_\{l\}\(\\cdot\\mid\\mathbf\{y\}\)\}\\bigl\[p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)\\bigr\]\.Jensen’s inequality gives
logpθ\(l\)\(𝐲∣𝐱\)≥𝔼𝐲l∼ql\(⋅∣𝐲\)\[∑i=1L𝟏\[yil=\[𝚖𝚊𝚜𝚔\]\]logpθ\(yi∣𝐲l,𝐱\)\]\.\\log p\_\{\\theta\}^\{\(l\)\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\\geq\\mathbb\{E\}\_\{\\mathbf\{y\}\_\{l\}\\sim q\_\{l\}\(\\cdot\\mid\\mathbf\{y\}\)\}\\\!\\left\[\\sum\_\{i=1\}^\{L\}\\mathbf\{1\}\[y\_\{i\}^\{l\}=\\mathtt\{\[mask\]\}\]\\log p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)\\right\]\.\(5\)The indicator picks out exactlyllterms, so this lower bound is the expected sum ofllmasked\-token log\-probabilities\. Averaging overl∼Uniform\(\{1,…,L\}\)l\\sim\\mathrm\{Uniform\}\(\\\{1,\\dots,L\\\}\)gives
𝔼l\[logpθ\(l\)\(𝐲∣𝐱\)\]≥𝔼l𝔼𝐲l\[∑i=1L𝟏\[yil=\[𝚖𝚊𝚜𝚔\]\]logpθ\(yi∣𝐲l,𝐱\)\]\.\\mathbb\{E\}\_\{l\}\\\!\\left\[\\log p\_\{\\theta\}^\{\(l\)\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\\right\]\\geq\\mathbb\{E\}\_\{l\}\\,\\mathbb\{E\}\_\{\\mathbf\{y\}\_\{l\}\}\\\!\\left\[\\sum\_\{i=1\}^\{L\}\\mathbf\{1\}\[y\_\{i\}^\{l\}=\\mathtt\{\[mask\]\}\]\\log p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)\\right\]\.\(6\)
#### Normalization by1/l1/l\.
[Eq\.˜4](https://arxiv.org/html/2605.13935#S3.E4)differs from the right\-hand side of[Eq\.˜6](https://arxiv.org/html/2605.13935#A2.E6)by a factor of1/l1/linside the inner bracket\. Since the inner sum hasllnonzero terms,1l∑i:yil=\[𝚖𝚊𝚜𝚔\]logpθ\(yi∣𝐲l,𝐱\)\\frac\{1\}\{l\}\\sum\_\{i:\\,y\_\{i\}^\{l\}=\\mathtt\{\[mask\]\}\}\\log p\_\{\\theta\}\(y\_\{i\}\\mid\\mathbf\{y\}\_\{l\},\\mathbf\{x\}\)is the arithmetic mean of thoselllog\-probabilities\.logp^θ\(𝐲∣𝐱\)\\log\\hat\{p\}\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)is the expectation of this mean under the random corruption𝐲l\\mathbf\{y\}\_\{l\}and mask levelll\.
The unnormalized sum scales linearly with the random number of masked positions: at higher mask levels the sum has more \(negative\) terms, so its magnitude varies withlleven when per\-token reconstruction quality does not\. Substituting it into[Eq\.˜2](https://arxiv.org/html/2605.13935#S3.E2)would make the residual size depend on this count\. The1/l1/lfactor cancels this dependence, leaving a per\-position score\. Note thatlogp^θ\\log\\hat\{p\}\_\{\\theta\}is not a length\-normalization of any sequence log\-likelihood: the denominator is the random countll, not the completion lengthLL, and the inner quantity is a mean overllmasked positions rather than a divided sequence log\-probability\.
#### Relation to length\-bias diagnostics in autoregressive RL\.
A similar concern about the role of normalization appears in autoregressive RL post\-training\. Dr\. GRPO\[[14](https://arxiv.org/html/2605.13935#bib.bib32)\]shows that GRPO’s per\-completion loss normalization by\|oi\|\|o\_\{i\}\|introduces a length bias because\|oi\|\|o\_\{i\}\|varies across the rollout group, and replaces it with a constant\. The1/l1/lfactor here plays an analogous role: when a loss aggregates a variable\-size set of token\-level quantities, the magnitude of the aggregate inherits a count\-dependence that is unrelated to per\-token quality\. The principle is the same in both cases — normalize so the score is independent of the set size — though the count being normalized differs \(completion length there, mask count here\)\.
## Appendix CTheory: Path Diversity, Mode Coverage, and Trajectory Balance
This appendix provides full proofs of the main\-paper results and extended discussion\. The argument in this section proceeds in three steps: \(1\) terminal reward is indifferent to path allocation \([Proposition˜1](https://arxiv.org/html/2605.13935#Thmproposition1)\); \(2\) sampled policy\-gradient updates exploit this indifference through trajectory locking; \(3\) trajectory locking directly limits terminal mode coverage \([Theorems˜1](https://arxiv.org/html/2605.13935#Thmtheorem1)and[1](https://arxiv.org/html/2605.13935#Thmcorollary1)\)\. The subsections below prove each step in turn and then explain how the trajectory\-balance perspective motivates TraFL\.
### C\.1Full proofs of the main\-paper results
Proposition[1](https://arxiv.org/html/2605.13935#Thmproposition1)\(Path\-indifference of terminal reward\)\.Fix a prompt𝐱\\mathbf\{x\}and a final completion𝐲\\mathbf\{y\}\. Let𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)denote the set of denoising trajectories that terminate at𝐲\\mathbf\{y\}\. Any redistribution of probability mass among trajectories in𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)that preservespθ\(𝐲∣𝐱\)p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)leaves a terminal\-reward objective unchanged\.
###### Proof\.
The terminal\-reward objective decomposes as
J\(𝐱\)=∑𝐲pθ\(𝐲∣𝐱\)r\(𝐱,𝐲\),pθ\(𝐲∣𝐱\)=∑τ∈𝒯\(𝐲\)pθ\(τ∣𝐱\)\.J\(\\mathbf\{x\}\)=\\sum\_\{\\mathbf\{y\}\}p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\\,r\(\\mathbf\{x\},\\mathbf\{y\}\),\\qquad p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\sum\_\{\\tau\\in\\mathcal\{T\}\(\\mathbf\{y\}\)\}p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)\.The contribution of𝐲\\mathbf\{y\}toJJdepends only on the scalarpθ\(𝐲∣𝐱\)p\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\. Therefore, any redistribution of mass among trajectories in𝒯\(𝐲\)\\mathcal\{T\}\(\\mathbf\{y\}\)that preserves this sum leaves the objective unchanged\. ∎
Theorem[1](https://arxiv.org/html/2605.13935#Thmtheorem1)\(Trajectory diversity is necessary for mode coverage\)\.Let𝒯\\mathcal\{T\}be a random denoising trajectory taking values in a countable setΩ𝒯\\Omega\_\{\\mathcal\{T\}\}, and letM=g\(𝒯\)M=g\(\\mathcal\{T\}\)denote its terminal solution mode, whereg:Ω𝒯→ΩMg:\\Omega\_\{\\mathcal\{T\}\}\\to\\Omega\_\{M\}is a deterministic mapping\. Then\|supp\(M\)\|≤\|supp\(𝒯\)\|andH\(M\)≤H\(𝒯\),\|\\mathrm\{supp\}\(M\)\|\\leq\|\\mathrm\{supp\}\(\\mathcal\{T\}\)\|\\qquad\\text\{and\}\\qquad H\(M\)\\leq H\(\\mathcal\{T\}\),wheresupp\(⋅\)\\mathrm\{supp\}\(\\cdot\)denotes the support of a distribution andH\(⋅\)H\(\\cdot\)the Shannon entropy\. In particular, terminal mode coverage cannot exceed trajectory diversity\.
###### Proof\.
BecauseM=g\(𝒯\)M=g\(\\mathcal\{T\}\)is a deterministic function of𝒯\\mathcal\{T\}, every modem∈supp\(M\)m\\in\\mathrm\{supp\}\(M\)must be the image of at least one trajectoryτ∈supp\(𝒯\)\\tau\\in\\mathrm\{supp\}\(\\mathcal\{T\}\)\. Hencesupp\(M\)=g\(supp\(𝒯\)\)\\mathrm\{supp\}\(M\)=g\(\\mathrm\{supp\}\(\\mathcal\{T\}\)\), which gives\|supp\(M\)\|≤\|supp\(𝒯\)\|\|\\mathrm\{supp\}\(M\)\|\\leq\|\\mathrm\{supp\}\(\\mathcal\{T\}\)\|\.
For entropy, sinceMMis a deterministic function of𝒯\\mathcal\{T\}, we haveH\(M∣𝒯\)=0H\(M\\mid\\mathcal\{T\}\)=0\. By the chain rule,
H\(𝒯,M\)=H\(𝒯\)\+H\(M∣𝒯\)=H\(𝒯\)\.H\(\\mathcal\{T\},M\)=H\(\\mathcal\{T\}\)\+H\(M\\mid\\mathcal\{T\}\)=H\(\\mathcal\{T\}\)\.On the other hand,H\(𝒯,M\)=H\(M\)\+H\(𝒯∣M\)≥H\(M\)H\(\\mathcal\{T\},M\)=H\(M\)\+H\(\\mathcal\{T\}\\mid M\)\\geq H\(M\)\. Combining givesH\(M\)≤H\(𝒯\)H\(M\)\\leq H\(\\mathcal\{T\}\)\. ∎
Corollary[1](https://arxiv.org/html/2605.13935#Thmcorollary1)\(Trajectory locking upper\-bounds mode coverage\)\.If optimization reduces the trajectory distribution to a subset𝒮⊆Ω𝒯\\mathcal\{S\}\\subseteq\\Omega\_\{\\mathcal\{T\}\}, then the set of terminal modes that can be covered is at mostg\(𝒮\)⊆ΩMg\(\\mathcal\{S\}\)\\subseteq\\Omega\_\{M\}\. In particular, collapse to a single trajectory implies collapse to a single terminal mode\.
###### Proof\.
Immediate from[Theorem˜1](https://arxiv.org/html/2605.13935#Thmtheorem1), sinceM=g\(𝒯\)M=g\(\\mathcal\{T\}\)\. ∎
### C\.2Sampled reward optimization and trajectory locking
[Proposition˜1](https://arxiv.org/html/2605.13935#Thmproposition1)is an objective\-level statement: terminal reward is flat over the entire subspace of trajectory distributions that assign the same mass to each completion\. In practice, however, optimization proceeds through sampled updates, and this flatness becomes a liability\.
For terminal reward, the policy\-gradient update takes the form
∇JRL=𝔼τ∼pθ\(⋅∣𝐱\)\[r\(𝐱,𝐲\(τ\)\)∇logpθ\(τ∣𝐱\)\]\.\\nabla J\_\{\\mathrm\{RL\}\}=\\mathbb\{E\}\_\{\\tau\\sim p\_\{\\theta\}\(\\cdot\\mid\\mathbf\{x\}\)\}\\left\[r\(\\mathbf\{x\},\\mathbf\{y\}\(\\tau\)\)\\,\\nabla\\log p\_\{\\theta\}\(\\tau\\mid\\mathbf\{x\}\)\\right\]\.\(7\)Only sampled trajectories receive direct gradient signal\. Becauser\(𝐱,𝐲\)r\(\\mathbf\{x\},\\mathbf\{y\}\)is constant across all trajectories reaching the same completion𝐲\\mathbf\{y\}, the gradient provides no force to spread mass across different paths to the same answer\. Instead, if two trajectories both reach a rewarding completion but one is sampled slightly more often early in training, it receives more positive updates, becomes more likely, and is sampled even more in subsequent steps\. This self\-reinforcing feedback loop—*trajectory locking*—concentrates probability mass onto a narrow subset of paths despite the underlying objective being indifferent among them\.
Trajectory locking is an optimization phenomenon, not a formal consequence of[Proposition˜1](https://arxiv.org/html/2605.13935#Thmproposition1)\. The proposition establishes the flat landscape; locking describes how stochastic gradient ascent can move along that landscape toward a vertex or low\-dimensional face\. We therefore use it as a mechanism explaining why sampled reward maximization may reduce coverage in practice, not as a theorem that exact expected policy\-gradient updates must always collapse\.
### C\.3A generalized simplex view of trajectory concentration
The discussion above concerns strict terminal reward, under which all trajectories leading to the same terminal completion receive the same reward\. The following analysis considers a more general setting in which optimization induces effective trajectory\-level scores, for example through sampling asymmetries, surrogate objectives, or path\-dependent credit assignment\. It should therefore be viewed as an auxiliary lens on trajectory concentration rather than a literal description of pure terminal reward\.
Fix a completion𝐲\\mathbf\{y\}, and let
𝒯\(𝐲\)=\{τ1,…,τK\}\.\\mathcal\{T\}\(\\mathbf\{y\}\)=\\\{\\tau\_\{1\},\\dots,\\tau\_\{K\}\\\}\.Define the conditional trajectory distribution
qi:=pθ\(τi∣𝐲,𝐱\),q=\(q1,…,qK\)∈ΔK,q\_\{i\}:=p\_\{\\theta\}\(\\tau\_\{i\}\\mid\\mathbf\{y\},\\mathbf\{x\}\),\\qquad q=\(q\_\{1\},\\dots,q\_\{K\}\)\\in\\Delta^\{K\},and suppose each trajectoryτi\\tau\_\{i\}is assigned an effective scalar scorerir\_\{i\}\. Consider the linear objective
J\(q\)=∑i=1Kqiri\.J\(q\)=\\sum\_\{i=1\}^\{K\}q\_\{i\}r\_\{i\}\.\(8\)Because[Eq\.˜8](https://arxiv.org/html/2605.13935#A3.E8)is linear inqq, its maximizers lie on extreme points of the simplex when one score dominates, and on a face of the simplex when several scores tie\. Thus the objective itself does not prefer dispersed trajectory distributions\.
Now parameterizeqqby logitsϕ\\phithrough
qi=eϕi∑jeϕj\.q\_\{i\}=\\frac\{e^\{\\phi\_\{i\}\}\}\{\\sum\_\{j\}e^\{\\phi\_\{j\}\}\}\.The Euclidean gradient ofJJwith respect toϕ\\phiis
∂J∂ϕi=qi\(ri−r¯\),r¯=∑jqjrj\.\\frac\{\\partial J\}\{\\partial\\phi\_\{i\}\}=q\_\{i\}\(r\_\{i\}\-\\bar\{r\}\),\\qquad\\bar\{r\}=\\sum\_\{j\}q\_\{j\}r\_\{j\}\.\(9\)[Eq\.˜9](https://arxiv.org/html/2605.13935#A3.E9)already shows that higher\-than\-average reward increases the corresponding logit\. To obtain the standard replicator dynamics in probability space, however, one must specify a particular continuous\-time flow in logit space, and this flow is*not*the ordinary Euclidean gradient flow on the logits\. Instead, consider*additive payoff dynamics*,
ϕ˙i=ri\.\\dot\{\\phi\}\_\{i\}=r\_\{i\}\.\(10\)This choice corresponds to mirror\-descent\-style updates on the simplex, and equivalently to a natural\-gradient geometry in probability space\[[1](https://arxiv.org/html/2605.13935#bib.bib34),[3](https://arxiv.org/html/2605.13935#bib.bib35)\]\. Under this assumed logit flow, the induced dynamics ofqiq\_\{i\}are obtained by the chain rule:
q˙i=∑j∂qi∂ϕjϕ˙j,∂qi∂ϕj=qi\(δij−qj\)\.\\dot\{q\}\_\{i\}=\\sum\_\{j\}\\frac\{\\partial q\_\{i\}\}\{\\partial\\phi\_\{j\}\}\\,\\dot\{\\phi\}\_\{j\},\\qquad\\frac\{\\partial q\_\{i\}\}\{\\partial\\phi\_\{j\}\}=q\_\{i\}\(\\delta\_\{ij\}\-q\_\{j\}\)\.Substitutingϕ˙j=rj\\dot\{\\phi\}\_\{j\}=r\_\{j\}gives
q˙i=∑jqi\(δij−qj\)rj=qi\(ri−r¯\),\\dot\{q\}\_\{i\}=\\sum\_\{j\}q\_\{i\}\(\\delta\_\{ij\}\-q\_\{j\}\)\\,r\_\{j\}=q\_\{i\}\\bigl\(r\_\{i\}\-\\bar\{r\}\\bigr\),which is the replicator equation:
q˙i=qi\(ri−r¯\)\.\\dot\{q\}\_\{i\}=q\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)\.\(11\)More generally, underϕ˙i=ηri\\dot\{\\phi\}\_\{i\}=\\eta r\_\{i\}, the induced flowq˙i=ηqi\(ri−r¯\)\\dot\{q\}\_\{i\}=\\eta\\,q\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)is proportional to the replicator vector field and differs only by a rescaling of time\. The solution to[Eq\.˜11](https://arxiv.org/html/2605.13935#A3.E11)is
qi\(t\)=qi\(0\)erit∑j=1Kqj\(0\)erjt,q\_\{i\}\(t\)=\\frac\{q\_\{i\}\(0\)\\,e^\{r\_\{i\}t\}\}\{\\sum\_\{j=1\}^\{K\}q\_\{j\}\(0\)\\,e^\{r\_\{j\}t\}\},\(12\)so relative mass evolves as
qi\(t\)qj\(t\)=qi\(0\)qj\(0\)e\(ri−rj\)t\.\\frac\{q\_\{i\}\(t\)\}\{q\_\{j\}\(t\)\}=\\frac\{q\_\{i\}\(0\)\}\{q\_\{j\}\(0\)\}\\,e^\{\(r\_\{i\}\-r\_\{j\}\)t\}\.Whenri\>rjr\_\{i\}\>r\_\{j\}, this ratio grows exponentially, driving the conditional trajectory distribution toward a simplex vertex or a low\-dimensional face\. This makes precise the sense in which reward\-style optimization can be mode\-seeking when effective path\-level growth rates differ\.
#### Remark on standard Euclidean gradient ascent\.
For completeness, we note that ordinary Euclidean gradient ascent on the logits,ϕ˙i∝∂J/∂ϕi=qi\(ri−r¯\)\\dot\{\\phi\}\_\{i\}\\propto\\partial J/\\partial\\phi\_\{i\}=q\_\{i\}\(r\_\{i\}\-\\bar\{r\}\), instead induces the cubic dynamic
q˙i∝qi\[qi\(ri−r¯\)−∑jqj2\(rj−r¯\)\],\\dot\{q\}\_\{i\}\\propto q\_\{i\}\\\!\\left\[q\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)\-\\sum\_\{j\}q\_\{j\}^\{2\}\(r\_\{j\}\-\\bar\{r\}\)\\right\],which does not admit the closed\-form exponential reweighting in[Eq\.˜12](https://arxiv.org/html/2605.13935#A3.E12)\. We work with the replicator form throughout because it admits a clean exponential solution and makes the mode\-seeking geometry of reward\-style optimization transparent\. The qualitative concentration conclusion—that higher\-scoring trajectories acquire exponentially more mass relative to lower\-scoring ones—is robust to this choice: both flows driveqqtoward simplex vertices or low\-dimensional faces when scores differ\. The replicator equation should therefore be viewed as an idealized simplification of the actual logit dynamics, chosen for analytical tractability rather than as the literal flow induced by standard policy\-gradient updates\.
### C\.4An idealized trajectory\-balance interpretation
This subsection describes an idealized exact trajectory\-balance view that motivates our design\. TraFL is inspired by this principle, but in practice uses a diffusion\-compatible sequence\-level surrogate and a learned prompt\-dependent normalization term rather than exact trajectory probabilities\.
LetPF\(τ\)P\_\{F\}\(\\tau\)denote a forward trajectory distribution and letPB\(τ∣𝐲,𝐱\)P\_\{B\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\)denote a backward distribution over trajectories conditioned on a terminal completion𝐲\\mathbf\{y\}\. In an exact trajectory\-balance formulation, one enforces
logZ\(𝐱\)\+logPF\(τ∣𝐱\)=βr\(𝐱,𝐲\)\+logPB\(τ∣𝐲,𝐱\),∀τ∈𝒯\(𝐲\)\.\\log Z\(\\mathbf\{x\}\)\+\\log P\_\{F\}\(\\tau\\mid\\mathbf\{x\}\)=\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\+\\log P\_\{B\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\),\\qquad\\forall\\tau\\in\\mathcal\{T\}\(\\mathbf\{y\}\)\.\(13\)Equivalently,
PF\(τ∣𝐱\)=1Z\(𝐱\)exp\(βr\(𝐱,𝐲\)\)PB\(τ∣𝐲,𝐱\)\.P\_\{F\}\(\\tau\\mid\\mathbf\{x\}\)=\\frac\{1\}\{Z\(\\mathbf\{x\}\)\}\\exp\\\!\\bigl\(\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\\bigr\)\\,P\_\{B\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\)\.\(14\)Summing overτ∈𝒯\(𝐲\)\\tau\\in\\mathcal\{T\}\(\\mathbf\{y\}\)yields
PF\(𝐲∣𝐱\)=1Z\(𝐱\)exp\(βr\(𝐱,𝐲\)\)\.P\_\{F\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)=\\frac\{1\}\{Z\(\\mathbf\{x\}\)\}\\exp\\\!\\bigl\(\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\\bigr\)\.\(15\)Therefore, reward determines only the*total*mass assigned to terminal completion𝐲\\mathbf\{y\}\. Conditioned on𝐲\\mathbf\{y\}, we obtain
PF\(τ∣𝐲,𝐱\)=PB\(τ∣𝐲,𝐱\)\.P\_\{F\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\)=P\_\{B\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\)\.\(16\)This idealized factorization separates terminal weighting from within\-terminal path allocation\. Unlike direct reward amplification, it does not repeatedly favor one trajectory over another merely because they terminate at the same rewarding output\.
We stress again that[Eqs\.˜13](https://arxiv.org/html/2605.13935#A3.E13),[14](https://arxiv.org/html/2605.13935#A3.E14),[15](https://arxiv.org/html/2605.13935#A3.E15)and[16](https://arxiv.org/html/2605.13935#A3.E16)describe an exact idealized trajectory\-balance picture\. They are intended as conceptual motivation for TraFL, not as exact identities satisfied by the practical surrogate objective in[Sec\.˜3](https://arxiv.org/html/2605.13935#S3)\.
### C\.5Relation to the practical TraFL objective
The sections above build the theoretical case, but TraFL’s practical objective differs from the idealized picture in two important ways\.
Reference model instead of explicit backward model\.The idealized TB factorization in[Sec\.˜C\.4](https://arxiv.org/html/2605.13935#A3.SS4)uses a backward distributionPB\(τ∣𝐲,𝐱\)P\_\{B\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\)to govern within\-completion path allocation\. In TraFL, this role is played by the frozen reference model: the reward\-tilted target
p∗\(τ∣𝐱\)∝pref\(τ∣𝐱\)exp\(βr\(𝐱,𝐲\)\)p^\{\*\}\(\\tau\\mid\\mathbf\{x\}\)\\propto p\_\{\\mathrm\{ref\}\}\(\\tau\\mid\\mathbf\{x\}\)\\exp\\\!\\bigl\(\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\\bigr\)factors, after marginalization overτ∈𝒯\(𝐲\)\\tau\\in\\mathcal\{T\}\(\\mathbf\{y\}\), into
1. \(i\)a completion\-level marginal proportional toexp\(βr\(𝐱,𝐲\)\)pref\(𝐲∣𝐱\)\\exp\\\!\\bigl\(\\beta r\(\\mathbf\{x\},\\mathbf\{y\}\)\\bigr\)\\,p\_\{\\mathrm\{ref\}\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\), which determines how much total mass each completion receives, jointly through the reward tilt and the reference model’s marginal over completions; and
2. \(ii\)a within\-completion conditional path distributionpref\(τ∣𝐲,𝐱\)p\_\{\\mathrm\{ref\}\}\(\\tau\\mid\\mathbf\{y\},\\mathbf\{x\}\), which governs how that mass is spread across the denoising trajectories that reach the same completion\.
This is precisely the separation that policy\-gradient reward maximization lacks: reward acts only on \(i\), while \(ii\) is left undefined by the objective and is determined implicitly by sampling, enabling trajectory locking\.
Completion\-level surrogate instead of exact trajectory scores\.Diffusion language models do not expose exact trajectory log\-probabilities, so the practical residual in[Eq\.˜2](https://arxiv.org/html/2605.13935#S3.E2)uses the masked\-token surrogatelogp^θ\(𝐲∣𝐱\)\{\\log\\hat\{p\}\}\_\{\\theta\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)from[Eq\.˜4](https://arxiv.org/html/2605.13935#S3.E4)together with a learned prompt\-dependent predictor forlogZ\(𝐱\)\\log Z\(\\mathbf\{x\}\)\. As a result, the implemented loss compares current and reference models at the level of sampled completions, not at the level of latent denoising paths\. The practical objective is therefore inspired by the idealized trajectory\-balance equation and inherits its intuition, but it does not literally enforce equality of within\-completion path distributions\.
Centered rewards as variance reduction\.In the implementation, we replacer\(𝐱,𝐲\)r\(\\mathbf\{x\},\\mathbf\{y\}\)by rewards centered within each prompt group\. Because the centering term depends on the sampled group rather than only on𝐱\\mathbf\{x\}, it cannot in general be absorbed exactly into the deterministic normalization termZ\(𝐱\)Z\(\\mathbf\{x\}\)\. This centering should therefore be interpreted as a stochastic variance\-reduction heuristic for optimization, not as an exact rewriting of the idealized target distribution\.
### C\.6Why DMPO’s batch\-softmaxZ^\\hat\{Z\}still induces trajectory locking
DMPO\[[29](https://arxiv.org/html/2605.13935#bib.bib9)\]targets the same reward\-tilted distributionp∗p^\{\*\}as TraFL but estimatesZ\(𝐱\)Z\(\\mathbf\{x\}\)without a learned network, instead normalizing importance weights by a softmax over the current training buffer\. Concretely, for each rollouto\(n\)o^\{\(n\)\}it defines
w\(n\)=exp\(ℓ\(n\)\)∑kexp\(ℓ\(k\)\),ℓ\(n\)=r\(n\)α\+logpref\(o\(n\)∣𝐱\)pv\(o\(n\)∣𝐱\),w^\{\(n\)\}=\\frac\{\\exp\(\\ell^\{\(n\)\}\)\}\{\\sum\_\{k\}\\exp\(\\ell^\{\(k\)\}\)\},\\qquad\\ell^\{\(n\)\}=\\frac\{r^\{\(n\)\}\}\{\\alpha\}\+\\log\\frac\{p\_\{\\mathrm\{ref\}\}\(o^\{\(n\)\}\\mid\\mathbf\{x\}\)\}\{p\_\{v\}\(o^\{\(n\)\}\\mid\\mathbf\{x\}\)\},and minimizes∑nw\(n\)ℒDCE\(o\(n\)\)\\sum\_\{n\}w^\{\(n\)\}\\mathcal\{L\}\_\{\\mathrm\{DCE\}\}\(o^\{\(n\)\}\)\. This batch\-softmax step is an empirical estimator ofZ^\(𝐱\)\\hat\{Z\}\(\\mathbf\{x\}\)\.
Treating the weightsw\(n\)w^\{\(n\)\}as stop\-gradient targets \(as in DMPO’s implementation\), the WDCE objective drives the buffer logitsϕ\(n\)=logpθ\(o\(n\)∣𝐱\)\\phi^\{\(n\)\}=\\log p\_\{\\theta\}\(o^\{\(n\)\}\\mid\\mathbf\{x\}\)according to additive payoff dynamics
ϕ˙\(n\)∝w\(n\)\.\\dot\{\\phi\}^\{\(n\)\}\\propto w^\{\(n\)\}\.This is precisely the additive payoff regime analyzed in[Sec\.˜C\.3](https://arxiv.org/html/2605.13935#A3.SS3): under additive logit dynamics, the chain rule through the softmax Jacobian yields the replicator equation on the conditional buffer distribution,
q˙\(n\)=q\(n\)\(w\(n\)−w¯\),q\(n\)=pθ\(o\(n\)∣𝐱\)∑kpθ\(o\(k\)∣𝐱\),\\dot\{q\}^\{\(n\)\}=q^\{\(n\)\}\\bigl\(w^\{\(n\)\}\-\\bar\{w\}\\bigr\),\\qquad q^\{\(n\)\}=\\frac\{p\_\{\\theta\}\(o^\{\(n\)\}\\mid\\mathbf\{x\}\)\}\{\\sum\_\{k\}p\_\{\\theta\}\(o^\{\(k\)\}\\mid\\mathbf\{x\}\)\},with solution
q\(n\)\(t\)=q\(n\)\(0\)ew\(n\)t∑kq\(k\)\(0\)ew\(k\)t\.q^\{\(n\)\}\(t\)=\\frac\{q^\{\(n\)\}\(0\)\\,e^\{w^\{\(n\)\}t\}\}\{\\sum\_\{k\}q^\{\(k\)\}\(0\)\\,e^\{w^\{\(k\)\}t\}\}\.The highest\-weight completion therefore acquires exponentially more mass relative to all others in the buffer, and any completion not present in the buffer receives zero gradient signal\. As the policy concentrates, future buffers are sampled from an increasingly narrow distribution, so unseen modes remain unseen\. Unlike TraFL’s learnedZϕZ\_\{\\phi\}, which receives gradient signal at every step regardless of which completions are sampled, the batch\-softmaxZ^\\hat\{Z\}degrades to a point estimate over the locked support, removing the correction signal that would otherwise counteract collapse\.
## Appendix DPrompt Templates and Sample Rollouts
Below, we present two representative sampled trajectories, one from the coding domain and one from mathematical reasoning\. Each trajectory consists of a task prompt and a completion sampled from the current policy\. We use the same prompting format for both training and evaluation: all task instructions and response\-format constraints are specified directly in the user prompt, with no separate system prompt\.
Rollout Example: System Prompt and Sample Model OutputRollout Example: User Prompt and Sample Model Output \[User Prompt\] You are a coding expert\. You will be given a coding problem to solve\. Solve it step by step\. Ensure you wrap the answer in‘‘‘python‘‘‘\. Response Format <reasoning\>Your reasoning here</reasoning\> <answer\>‘‘‘python…‘‘‘</answer\> \[Task Prompt\] Given a list of subject IDs and a list of trial types, create a function that generates a dataset as a dictionary\. Each subject ID is a key, and its value is another dictionary mapping trial types to strings formatted asSubject \{subject\_id\}, Trial \{trial\_type\}\. \[Model Output\] <reasoning\> To solve this problem: \(1\) iterate over subject IDs; \(2\) create a nested dictionary for each subject; \(3\) map each trial type to the required formatted string; and \(4\) return the completed dataset\. </reasoning\> <answer\>[⬇](data:text/plain;base64,ZGVmIGdlbmVyYXRlX2RhdGFzZXQoc3ViamVjdF9pZHMsIHRyaWFsX3R5cGVzKToKICAgIGRhdGFzZXQgPSB7fQogICAgZm9yIHN1YmplY3RfaWQgaW4gc3ViamVjdF9pZHM6CiAgICAgICAgZGF0YXNldFtzdWJqZWN0X2lkXSA9IHt9CiAgICAgICAgZm9yIHRyaWFsX3R5cGUgaW4gdHJpYWxfdHlwZXM6CiAgICAgICAgICAgIGRhdGFzZXRbc3ViamVjdF9pZF1bdHJpYWxfdHlwZV0gPSAoCiAgICAgICAgICAgICAgICBmIlN1YmplY3Qge3N1YmplY3RfaWR9LCBUcmlhbCB7dHJpYWxfdHlwZX0iCiAgICAgICAgICAgICkKICAgIHJldHVybiBkYXRhc2V0)defgenerate\_dataset\(subject\_ids,trial\_types\):dataset=\{\}forsubject\_idinsubject\_ids:dataset\[subject\_id\]=\{\}fortrial\_typeintrial\_types:dataset\[subject\_id\]\[trial\_type\]=\(f"Subject\{subject\_id\},Trial\{trial\_type\}"\)returndataset</answer\>
Rollout Example: System Prompt and Sample Model OutputRollout Example: User Prompt and Sample Model Output \[User Prompt\] You are a math expert\. You will be given a question to solve\. Solve it step by step\. Wrap the final answer in\\boxed\{\}inside the <answer\> tags\.\. Response Format <reasoning\>Your reasoning here</reasoning\> <answer\>\\boxed\{\.\.\.\}</answer\> \[Task Prompt\] Mark makes custom dog beds\. A bed for a Rottweiler takes 8 pounds of stuffing, a bed for a chihuahua takes 2 pounds of stuffing, and a bed for a collie takes the average amount of stuffing between the first two kinds of beds\. How many pounds of stuffing does Mark need to make 4 chihuahua beds and 3 collie beds? \[Model Output\] <reasoning\> First, a collie bed uses the average amount of stuffing between a Rottweiler bed and a chihuahua bed, so it needs8\+22=5\\frac\{8\+2\}\{2\}=5pounds\.Next, 4 chihuahua beds require4×2=84\\times 2=8pounds, and 3 collie beds require3×5=153\\times 5=15pounds\.Therefore, the total amount of stuffing needed is8\+15=23\.8\+15=23\.</reasoning\> <answer\> 23\\boxed\{23\}</answer\>
## Appendix EDataset Filtering and Training\-Data Comparison
This appendix gives the filtering procedures used to construct TraFL’s coding training set and summarizes how the training data of TraFL relates to the data used by the public baseline checkpoints we compare against\.
### E\.1Filtering Procedures
#### AceCode\-89K\.
We use existing CodeLlama\-Instruct\[[18](https://arxiv.org/html/2605.13935#bib.bib15)\]rollout metadata distributed with AceCode\-89K to filter problems by reward uncertainty under the proxy model\. We retain problems with substantial reward uncertainty under that proxy, removing both trivially easy problems \(uniformly passed\) and uniformly failed problems for which binary feedback would not be informative\. This yields 11,937 examples\.
#### KodCode\-Light\-RL\-10K\.
For KodCode, we compute per\-example pass\-rate statistics from GPT\-4o and DeepSeek\-R1\[[8](https://arxiv.org/html/2605.13935#bib.bib31)\]rollouts\. We identify the weaker proxy on average, retain examples with high pass\-sequence variance under that weaker proxy, and then keep the lower\-pass\-rate portion of this diverse subset\. This focuses training on harder coding problems while still admitting informative binary feedback\. The resulting subset contains 2,695 examples\.
#### Combined coding training set\.
TraFL’s coding model is trained on the concatenation of these two filtered subsets \(14,632 examples total\)\. The filtering targets problems with informative binary feedback rather than removing examples on stylistic or domain grounds\.
### E\.2Training\-Data Comparison Across Methods
The public checkpoints of our direct baselines \(JustGRPOandESPO\) were trained on different data mixtures than TraFL\. Tab\.[2](https://arxiv.org/html/2605.13935#A5.T2)summarizes the training data used by each method we compare against, so that algorithmic and data differences are not conflated when reading the empirical results\.
Table 2:Training\-data summary across the methods we compare against\. All methods train separate task\-specific checkpoints for GSM8K and for MATH \(reported separately throughout the paper\)\. For coding, the publicJustGRPOandESPOcheckpoints we evaluate were trained on substantially larger coding mixtures than TraFL\.MethodGSM8K\-trainedMATH\-trainedCode training dataJustGRPO\[[15](https://arxiv.org/html/2605.13935#bib.bib19)\]GSM8K \(7,473\)MATH \(7,500\)AceCode\-87K \(21,000 hard samples\)ESPO\[[17](https://arxiv.org/html/2605.13935#bib.bib26)\]GSM8K \(7,473\)MATH \(7,500\)AceCode\-87K \(21,000 hard samples\)TraFL \(Ours\)GSM8K \(7,473\)MATH \(7,500\)AceCode\-89K filtered \(11,937\)\+ KodCode filtered \(2,695\)Two aspects of this comparison are worth noting\. First, all methods follow the same per\-task convention for math, training one checkpoint on GSM8K and a separate checkpoint on MATH, so the math comparisons are matched in training\-data scale and source\. Second, the coding training set used by TraFL \(14,632 examples\) is smaller than the coding mixtures used to produce the publicJustGRPOandESPOcoding checkpoints, so coding comparisons are conservative with respect to TraFL: any algorithmic advantage we report on coding holds despite TraFL being trained on less data than the baselines we compare against\. None of these training\-data differences affect the held\-out evaluations on Minerva Math and LiveCodeBench, where the conclusions reported in the main text are drawn under matched decoding settings on benchmarks that no method was trained on directly\.
## Appendix FAdditional Baseline Comparison Details
In our main experiments, we directly compare againstJustGRPO\[[15](https://arxiv.org/html/2605.13935#bib.bib19)\]andESPO\[[17](https://arxiv.org/html/2605.13935#bib.bib26)\], since these were the strongest closely related diffusion\-LM post\-training baselines with usable public checkpoints\. This enables a controlled comparison in which all directly evaluated methods are run under the same decoding protocol and evaluation pipeline\.
Table 3:Single\-sample accuracy \(%\) on reasoning and coding benchmarks\. Results are reported for two maximum completion lengths, 256 and 512\. Results marked with†\\daggerare quoted directly from the original papers; these entries are included for context rather than as fully controlled direct comparisons\.GSM8KMATH\-500HumanEvalMBPPMethod256512256512256512256512Avg\.D1†\\dagger\[[26](https://arxiv.org/html/2605.13935#bib.bib18)\]81\.182\.138\.640\.232\.937\.844\.742\.850\.03DMPO†\\dagger\[[29](https://arxiv.org/html/2605.13935#bib.bib9)\]82\.4185\.2238\.2042\.80–––––For completeness,[Tab\.˜3](https://arxiv.org/html/2605.13935#A6.T3)also reports published results forD1\[[26](https://arxiv.org/html/2605.13935#bib.bib18)\]andDMPO\[[29](https://arxiv.org/html/2605.13935#bib.bib9)\], copied from their respective papers\. These methods are closely related to our setting, but their public checkpoints were not available at the time of our experiments\. We therefore include their numbers only as contextual reference points, rather than as fully controlled comparisons\. The table marks such quoted results with†\\dagger\.
## Appendix GLLM\-as\-a\-Judge Diversity Prompt
We use the following prompt for the LLM\-as\-a\-judge diversity evaluation\. The prompt is designed to isolate differences in solution strategy, rather than correctness or surface\-level variation\. For each comparison,\{problem\},\{set\_a\}, and\{set\_b\}are replaced with the problem statement and the two sampled answer sets being compared\.
LLM\-as\-a\-Judge Prompt for Diversity Evaluation \[Judge Role\] You are an expert judge evaluating diversity among LLM\-generated answers to the same problem\. Your goal is to assess approach diversity, not correctness or answer quality\. \[Problem\] \{problem\} \[Set A\] \{set\_a\} \[Set B\] \{set\_b\} \[Definition of Distinct Approach\] An answer represents a distinct approach only if it uses a recognizably different core method, such as a different algorithm, data structure, mathematical reduction, equation setup, case analysis, proof strategy, or reasoning path\. \[Do Not Count as Diversity\] – different wording, formatting, verbosity, or variable names – different final answers from the same method – minor bugs in the same method – random guesses or incoherent reasoning – extra explanation that does not change the core method \[Instructions\] 1\. Identify the distinct approach types in each set\. 2\. Compare the number and substance of these approach types\. 3\. Prefer TIE if the difference is mostly superficial or unclear\. \[Response Format\] Return exactly this JSON: \{\{ "set\_a\_approaches": \["short names of distinct approaches"\], "set\_b\_approaches": \["short names of distinct approaches"\], "winner": "A" \| "B" \| "TIE", "confidence": "low" \| "medium" \| "high", "reason": "one concise sentence" \}\}
## Appendix HLimitations and Broader Impact
### H\.1Limitations
Our study has several limitations\. First, TraFL relies on a diffusion\-compatible sequence\-level surrogate rather than exact trajectory log\-probabilities\. While this surrogate makes the objective practical for diffusion language models, it remains only an approximation to the ideal likelihood term\. As a result, the optimized objective may not perfectly match the intended reward\-tilted trajectory distribution, and performance may depend on surrogate design choices such as the masking scheme and Monte Carlo estimation strategy\.
Second, our current experiments use terminal binary rewards: exact\-match verification for math and execution\-based test passing for code\. Such rewards are simple and robust, but they are also coarse\. In particular, when all sampled completions for a prompt receive the same reward, the centered reward term provides little or no relative learning signal\. This makes training sensitive to rollout diversity, group size, and task difficulty, and may limit learning efficiency on prompts where successes are extremely rare or nearly universal\.
### H\.2Broader Impact
This work studies post\-training objectives for diffusion language models, with experiments on mathematical reasoning and code generation\. A potential positive impact of this line of research is improved reliability and usefulness of language models in domains that benefit from verifiable reasoning, such as education, scientific assistance, and software development\. In particular, methods that improve solution coverage under sampling may help models produce a broader set of valid answers or implementations, which can be useful in settings where multiple correct solutions exist\.
At the same time, stronger reasoning and code\-generation models also carry risks\. Improvements in code synthesis could be misused to generate malicious scripts, automate parts of cyberattacks, or lower the barrier to producing harmful software\. Similarly, more capable reasoning models can be used to generate misleading technical explanations, polished but incorrect solutions, or other content that appears trustworthy despite being wrong\. Even when used as intended, errors in model output may still mislead users in high\-stakes settings if generations are accepted without verification\.
Our work is primarily methodological and is not tied to a deployed system\. Any real\-world deployment of more capable post\-training methods should, therefore, be accompanied by careful evaluation, monitoring for misuse, and appropriate safeguards to reduce the risk of harmful or misleading outputs\.
## Appendix ITraining Compute
The post\-training experiments can be reproduced on any 8 card modern parallel computing unit \(NPUs, TPUs, GPUs\) with at least 80 GB memory\. Each math post\-training experiment takes approximately1010hours, and the code post\-training experiment takes approximately1818hours\.
## Appendix JDataset and Model Licenses
We use publicly released datasets, benchmarks, and models under their respective licenses\. GSM8K\[[6](https://arxiv.org/html/2605.13935#bib.bib12)\]is released under the MIT License\. MATH\[[13](https://arxiv.org/html/2605.13935#bib.bib13)\]is released under the MIT License\. HumanEval\[[5](https://arxiv.org/html/2605.13935#bib.bib16)\]is released under the MIT License\. MBPP\[[2](https://arxiv.org/html/2605.13935#bib.bib17)\]is released under CC BY 4\.0\. AceCode\-87K\[[25](https://arxiv.org/html/2605.13935#bib.bib14)\]is released under the MIT License\. KodCode\-Light\-RL\-10K\[[23](https://arxiv.org/html/2605.13935#bib.bib29)\]is released under CC BY\-NC 4\.0 and is used only for non\-commercial research\. Minerva Math\[[12](https://arxiv.org/html/2605.13935#bib.bib27)\]is listed under the MIT License in the public release we use\. LiveCodeBench\[[10](https://arxiv.org/html/2605.13935#bib.bib28)\]release v5 is released under the Creative Commons license family and is used only for evaluation\. LLaDA\-8B\-Instruct\[[16](https://arxiv.org/html/2605.13935#bib.bib10)\]is released under the MIT License\. We cite all original dataset and model sources and follow their usage protocols\.Similar Articles
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
This paper introduces TABOM, a self-distilled trajectory-based post-training framework for Diffusion Language Models that aligns training with inference trajectories using Boltzmann modeling to mitigate the training-inference discrepancy and reduce catastrophic forgetting.
The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models
This paper introduces TraceLock, a lightweight plug-in controller that learns a token-commitment policy for frozen diffusion language models, improving the quality-step tradeoff across various tasks without retraining.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
This paper introduces Trajectory Matching Policy Optimization (TMPO), a method for aligning diffusion models that addresses reward hacking and visual mode collapse by matching trajectory-level reward distributions rather than maximizing scalar rewards.
@probablynotaz9: Solo-author ICML paper alert Ever wanted to post-train your diffusion LLM with good old policy gradients, without havin…
This solo-author ICML paper introduces Amortized Group Relative Policy Optimization (AGRPO) to enable effective reinforcement learning post-training for diffusion language models.
Learnability-Informed Fine-Tuning of Diffusion Language Models
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.