Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
Summary
This paper introduces CAPR (Cached-Amortized Path Refinement), a reinforcement learning algorithm for diffusion large language models that extracts tree-like supervision signals from the denoising trace without the compute cost of full tree rollouts. CAPR achieves state-of-the-art performance on reasoning benchmarks like GSM8K, Math500, Sudoku, and Countdown at roughly 0.75x the cost of flat rollouts.
View Cached Full Text
Cached at: 06/05/26, 02:14 AM
# Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models Source: [https://arxiv.org/abs/2606.04396](https://arxiv.org/abs/2606.04396) [View PDF](https://arxiv.org/pdf/2606.04396) > Abstract:Diffusion large language models \(dLLMs\) generate responses by iteratively unmasking and revising many positions in parallel\. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form\. Existing dLLM reinforcement learning methods use this signal only weakly\. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory\. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive\. We ask whether the denoising trace itself can provide tree\-like supervision without tree\-level compute\. We introduce CAPR \(Cached\-Amortized Path Refinement\), a dLLM\-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block\-level value head for local block\-wise supervision\. Under a block\-wise unmasking schedule, CAPR records path\-state and block\-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block\. This trains the value head to convert one sparse reward into block\-level PPO weights\. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout\-generation cost to roughly 0\.75x that of flat rollouts and 0\.6x that of tree rollouts \(under standard settings\)\. Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture\-of\-experts LLaDA backbones, CAPR sets a new state of the art for RL\-tuned dLLMs at 256\- and 512\-token budgets\. On Sudoku, it matches the strongest tree\-structured baseline at less than one third of the per\-step compute\. ## Submission history From: Anant Khandelwal \[[view email](https://arxiv.org/show-email/a8eb6f80/2606.04396)\] **\[v1\]**Wed, 3 Jun 2026 03:22:54 UTC \(1,060 KB\)
Similar Articles
The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models
This paper introduces TraceLock, a lightweight plug-in controller that learns a token-commitment policy for frozen diffusion language models, improving the quality-step tradeoff across various tasks without retraining.
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
This paper identifies a failure mode called 'trajectory locking' in reward-maximizing post-training for diffusion language models, and proposes TraFL, a trajectory-balance objective that improves diversity and performance across math and code benchmarks.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
This paper identifies weaknesses in existing reinforcement learning methods for diffusion language models—lack of temporal credit assignment and biased likelihood estimates—and proposes DACA-GRPO, a plug-and-play enhancement that introduces denoising progress scores and stratified masking likelihood, achieving consistent improvements across reasoning, code generation, and constrained generation benchmarks.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.