GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Summary
GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.
View Cached Full Text
Cached at: 06/01/26, 11:20 AM
Paper page - GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Source: https://huggingface.co/papers/2605.29398
Abstract
Guided Denoiser Self-Distillation (GDSD) improves diffusion large language models by directly distilling denoisers from advantage-guided self-teachers, avoiding biases introduced by ELBO likelihood surrogates and achieving superior performance on benchmark tasks.
Reinforcement learning(RL) can be used to improve the policy (denoiser) ofdiffusion large language models(dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with itsevidence lower bound(ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias throughtraining--inference mismatchby using theELBOas a likelihood surrogate, which can degrade performance. In this work, we propose GuidedDenoiserSelf-Distillation(GDSD) to directly distill thedenoiserof dLLMs from an advantage-guided self-teacher, derived from theclosed-form optimumofreverse-KL regularized RL. GDSD matches the dLLM’sdenoiser logitsto the teacher’s via anormalization-free objective, which reduces RL tolikelihood-free self-distillationand thus bypasses the TIM biases. RecentELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosablepathologiesthat GDSD avoids. On planning, math, and coding benchmarks withLLaDA-8BandDream-7B, GDSD consistently outperforms prior state-of-the-artELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that directdenoiserself-distillation, without relying on anELBOlikelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.29398
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper5
#### diffusion-reasoning/gdsd_countdown_dream Text Generation• 8B• Updatedabout 3 hours ago • 23
#### diffusion-reasoning/gdsd_sudoku_dream Text Generation• 8B• Updatedabout 3 hours ago • 23
#### diffusion-reasoning/gdsd_sudoku_llada Text Generation• 8B• Updatedabout 3 hours ago • 24
#### diffusion-reasoning/gdsd_countdown_llada Text Generation• 8B• Updatedabout 3 hours ago • 21
Browse 5 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29398 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29398 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models
This paper identifies weaknesses in existing reinforcement learning methods for diffusion language models—lack of temporal credit assignment and biased likelihood estimates—and proposes DACA-GRPO, a plug-and-play enhancement that introduces denoising progress scores and stratified masking likelihood, achieving consistent improvements across reasoning, code generation, and constrained generation benchmarks.
Drifting Objectives for Refining Discrete Diffusion Language Models
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.
Self-Distilled Agentic Reinforcement Learning
SDAR enhances multi-turn agent training by integrating self-distillation with a sigmoid gate to selectively strengthen positive token-level guidance while mitigating negative teacher rejections, achieving significant improvements over GRPO across multiple benchmarks.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.