GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Hugging Face Daily Papers Papers

Summary

GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.
Original Article
View Cached Full Text

Cached at: 06/01/26, 11:20 AM

Paper page - GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Source: https://huggingface.co/papers/2605.29398

Abstract

Guided Denoiser Self-Distillation (GDSD) improves diffusion large language models by directly distilling denoisers from advantage-guided self-teachers, avoiding biases introduced by ELBO likelihood surrogates and achieving superior performance on benchmark tasks.

Reinforcement learning(RL) can be used to improve the policy (denoiser) ofdiffusion large language models(dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with itsevidence lower bound(ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias throughtraining--inference mismatchby using theELBOas a likelihood surrogate, which can degrade performance. In this work, we propose GuidedDenoiserSelf-Distillation(GDSD) to directly distill thedenoiserof dLLMs from an advantage-guided self-teacher, derived from theclosed-form optimumofreverse-KL regularized RL. GDSD matches the dLLM’sdenoiser logitsto the teacher’s via anormalization-free objective, which reduces RL tolikelihood-free self-distillationand thus bypasses the TIM biases. RecentELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosablepathologiesthat GDSD avoids. On planning, math, and coding benchmarks withLLaDA-8BandDream-7B, GDSD consistently outperforms prior state-of-the-artELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that directdenoiserself-distillation, without relying on anELBOlikelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.29398

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper5

#### diffusion-reasoning/gdsd_countdown_dream Text Generation• 8B• Updatedabout 3 hours ago • 23 #### diffusion-reasoning/gdsd_sudoku_dream Text Generation• 8B• Updatedabout 3 hours ago • 23 #### diffusion-reasoning/gdsd_sudoku_llada Text Generation• 8B• Updatedabout 3 hours ago • 24 #### diffusion-reasoning/gdsd_countdown_llada Text Generation• 8B• Updatedabout 3 hours ago • 21 Browse 5 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29398 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29398 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

arXiv cs.LG

This paper identifies weaknesses in existing reinforcement learning methods for diffusion language models—lack of temporal credit assignment and biased likelihood estimates—and proposes DACA-GRPO, a plug-and-play enhancement that introduces denoising progress scores and stratified masking likelihood, achieving consistent improvements across reasoning, code generation, and constrained generation benchmarks.

Drifting Objectives for Refining Discrete Diffusion Language Models

arXiv cs.CL

This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.

Self-Distilled Agentic Reinforcement Learning

Hugging Face Daily Papers

SDAR enhances multi-turn agent training by integrating self-distillation with a sigmoid gate to selectively strengthen positive token-level guidance while mitigating negative teacher rejections, achieving significant improvements over GRPO across multiple benchmarks.