Learning from the Self-future: On-policy Self-distillation for dLLMs

arXiv cs.CL Papers

Summary

Introduces d-OPSD, the first on-policy self-distillation framework for diffusion large language models, using suffix conditioning and step-level supervision to outperform RLVR and SFT baselines on reasoning benchmarks.

arXiv:2606.18195v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:42 AM

# Learning from the Self-future: On-policy Self-distillation for dLLMs
Source: [https://arxiv.org/html/2606.18195](https://arxiv.org/html/2606.18195)
Yifu Luo1,†, Zeyu Chen2,†, Haoyu Wang3, Xinhao Hu1, Yuxuan Zhang4,Zhizhou Sha5,Shiwei Liu6,7,8, †Equal Contribution 1Tsinghua University2Technical University of Munich3Nanyang Technological University 4University of British Columbia5University of Texas at Austin6ELLIS Institute Tubingen 7Max Planck Institute for Intelligent Systems8Tubingen AI Center

###### Abstract

On\-policy self\-distillation \(OPSD\) has proven effective for post\-training large language models \(LLMs\), yet its application to diffusion LLMs \(dLLMs\) remains unexplored\. Existing OPSD methods are inherently autoregressive\-centric\. They inject privileged information via left\-to\-right prefix conditioning with token\-level divergence supervision, a design that fundamentally conflicts with the arbitrary\-order generation of dLLMs\. We introduce d\-OPSD, the first OPSD framework tailored for dLLMs\. Our approach makes two core contributions\. First, we reframe self\-teacher construction by using self\-generated answers as suffix conditioning, enabling the student model to learn from “self future\-experience” rather than privileged prefixes\. Second, we shift supervision from token\-level to step\-level, aligning training with the iterative denoising process of dLLMs\. Experiments across four reasoning benchmarks show that d\-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around10%10\\%of the optimization steps by RLVR and opening a promising pathway for dLLM post\-training\. The code is available at[https://github\.com/xingzhejun/d\-OPSD](https://github.com/xingzhejun/d-OPSD)\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/try1.png)Figure 1:The reasoning performance and sample efficiency comparisons between the RLVR baseline \(diffu\-GRPO\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)\) and our approach, d\-OPSD\.On\-policy distillation \(OPD\)\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib1); Yanget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib3); Lu and Lab,[2025](https://arxiv.org/html/2606.18195#bib.bib2); Liet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib4)\), where a student model samples its own trajectories while a stronger teacher model provides dense token\-level supervision, has recently emerged as a highly effective paradigm for Large Language models \(LLMs\) post\-training, offering significant advantages over Reinforcement Learning with Verifiable Rewards \(RLVR\) \(e\.g\., GRPO\(Guoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib6)\)and supervised fine\-tuning \(SFT\)\. Compared to RLVR, OPD provides dense token\-level supervision from a teacher, overcoming the bottleneck of sparse outcome rewards\. Compared to SFT, OPD utilizes generations sampled from the student itself, thereby preventing exposure bias\(Bengioet al\.,[2015](https://arxiv.org/html/2606.18195#bib.bib8)\)\. However, OPD relies heavily on a stronger teacher model, which is often impractical in many settings\. To address this, recent works\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9); Hübotteret al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib10); Shenfeldet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib11)\)have extended OPD to on\-policy self\-distillation \(OPSD\), where a single model serves as its own teacher given teacher\-specific privileged information, demonstrating a powerful framework for self\-improvement\.

Concurrently, diffusion large language models \(dLLMs\)\(Ouet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib12); Nieet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib13); Yeet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib14); Chenget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib15); Bieet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib16)\)have demonstrated strong potential as an alternative to autoregressive \(AR\) LLMs\(Jaechet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib24); Xiaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib25)\)\. By modeling language generation as an iterative denoising process, dLLMs bypass the strict left\-to\-right dependency of AR models, unlocking unique advantages such as arbitrary\-order generation and speed\-up inference\(Khannaet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib17); Songet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib18); Wuet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib19)\)\.

While recent works\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20); Tanget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib21); Xieet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib22)\)has successfully applied RLVR to dLLMs demonstrating that their reasoning ability can be enhanced by post\-training, OPSD for dLLMs remains largely unexplored in this context\. Meanwhile, as shown in[Figure˜2](https://arxiv.org/html/2606.18195#S1.F2), existing OPSD approaches for AR models follow a standard paradigm for self\-teacher construction, where privileged information \(e\.g\., reference solutions\) is simply appended to the prompt, and teacher\-student divergence supervision is calculated at the token level\. Given that dLLMs exhibit fundamental different features from AR LLMs, we investigate the following two questions in this paper:

Question:Is there a better OPSD formulation tailored specifically for dLLMs?Answer:Yes\. Both the self\-teacher construction and the level of divergence supervision can be optimized for dLLMs, as shown in[Figure˜2](https://arxiv.org/html/2606.18195#S1.F2)\.Question:Does OPSD outperform RLVR in enhancing the reasoning ability of dLLMs?Answer:Yes\. It achieves superior results in both reasoning performance and sample efficiency, as shown in[Figure˜1](https://arxiv.org/html/2606.18195#S1.F1)\.

First, we identify that the self\-teacher construction mentioned above is suboptimal for dLLMs\. Appending privileged information to the prompt is inherently designed for AR models, because they are constrained to left\-to\-right generation where only prefix conditioningp​\(suffix\|prefix\)p\(\\text\{suffix\}\|\\text\{prefix\}\)is available\. In contrast, dLLMs generate sequences non\-autoregressively, which allows us to incorporate privileged information as a suffix context condition\. More importantly, this feature enables us to shift the content of privileged information from static reference solutions to the model’s self\-generated answers, adhering closer to the on\-policy nature\. As shown in[Figure˜2](https://arxiv.org/html/2606.18195#S1.F2), thep​\(prefix\|suffix\)p\(\\text\{prefix\}\|\\text\{suffix\}\)capability of dLLMs allows us to use self\-generated answers as a suffix conditional posterior for privileged information\. This guides the student to learn from “self future\-experience”, which is similar to human inspiration that we always daydream if we could go back to1010years ago knowing what happened next\. A key advantage of our teacher construction is that it provides more new knowledge \(thinking patterns\) to transfer to the student, a claim we empirically discuss in[Section˜4\.3](https://arxiv.org/html/2606.18195#S4.SS3)\.

Second, token\-level divergence supervision is not suitable for dLLMs either\. While AR models natively rely on next\-token prediction, dLLMs predict all masked tokens simultaneously at each denoising step, but only keep part of them while remasking others\. Consequently, token\-level supervision designed for AR models becomes incompatible\. Instead, as each denoising step can be viewed as an independent markov transition, step\-level divergence serves as a nature choice for dLLMs OPSD\. By shifting the dense supervision from the token\-level to the step\-level, we closely align the OPSD objective with the iterative denoising nature of dLLMs\.

Building on these insights, we proposediffusion On\-Policy Self\-distillation \(d\-OPSD\), a novel OPSD framework specifically designed for dLLMs to drive self\-improvement\. To the best of our knowledge, this represents the first application of OPSD to dLLMs\. In our approach, the student samples its own trajectories, while the self\-teacher is constructed using self\-generated answers as suffix privileged information\. By applying step\-level divergence, the student effectively learns from its “self future\-experience”\. Extensive experiments across four reasoning tasks demonstrate that our approach consistently outperforms RLVR and SFT baselines with superior reasoning performance and sample efficiency, as highlighted in[Figure˜1](https://arxiv.org/html/2606.18195#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/fig2.png)Figure 2:The framework of our approach, d\-OPSD\. It leverages self\-generated answers as suffix privileged information to construct the self\-teacher, and uses step\-level divergence to guide the student learn from the “self future\-experience”\.Our contributions are summarized as follows:

- •We identify that existing OPSD formulations are suboptimal for dLLMs\. To bridge this gap, we introduce a novel self\-teacher construction that utilizes self\-generated answers as a suffix conditional posterior for privileged information, and we shift the dense divergence supervision from the token\-level to the step\-level\.
- •We are the first to introduce OPSD to dLLMs\. We propose d\-OPSD, a novel OPSD framework tailored for dLLMs to drive self\-improvement\. It enables a single model to act as both teacher and student, leveraging self\-generated “future” as privileged information to provide dense step\-level supervision over the student trajectories\.
- •We conduct extensive experiments across four reasoning tasks, demonstrating that our approach achieves both superior reasoning performance and sample efficiency compared to RLVR and SFT baselines\. Furthermore, we empirically analyze the impact of different settings, paving the way for future advances in this field\.

## 2Preliminaries

### 2\.1Diffusion Large Language Models

In this subsection, we briefly review the training and inference paradigms of dLLMs\. During training, dLLMs define a forward process that gradually corrupts a clean input by replacing its tokens with a specialmasktoken\. Given a promptxxand a clean responsey0=\{y01,y02,⋯,y0L\}y\_\{0\}=\\\{y\_\{0\}^\{1\},y\_\{0\}^\{2\},\\cdots,y\_\{0\}^\{L\}\\\}, the forward process at step0<t≤T0<t\\leq Tcan be expressed as:

q​\(yt\|y0,x\)=∏i=1Lq​\(yti\|y0i,x\)andqt​\(yti\|y0i,x\)=\{T−tT,yti=y0i,tT,yti=mask,q\(y\_\{t\}\|y\_\{0\},x\)=\\prod\_\{i=1\}^\{L\}q\(y\_\{t\}^\{i\}\|y\_\{0\}^\{i\},x\)\\quad\\text\{and\}\\quad q\_\{t\}\(y\_\{t\}^\{i\}\|y\_\{0\}^\{i\},x\)=\\begin\{cases\}\\frac\{T\-t\}\{T\},&y\_\{t\}^\{i\}=y\_\{0\}^\{i\},\\\\ \\frac\{t\}\{T\},&y\_\{t\}^\{i\}=\{\\texttt\{mask\}\},\\end\{cases\}\(1\)where L is the sequence length, and the superscriptiirefers to the token position\.

In this work, we primarily focus on the reverse inference process of dLLMs\. Given a promptxxand a trained modelpθp\_\{\\theta\}, inference is formulated as aTT\-step iterative denoising process, from a fully masked sequenceyT=\{mask\}Ly\_\{T\}=\\\{\{\\texttt\{mask\}\}\\\}^\{L\}to a clean responsey0y\_\{0\}\. At each denoising steptt, the model first computes the distribution for all tokens:

𝒫ti=pθ​\(yi\|yt,x\),1≤i≤L\.\\mathcal\{P\}\_\{t\}^\{i\}=p\_\{\\theta\}\(y^\{i\}\|y\_\{t\},x\),\\quad 1\\leq i\\leq L\.\(2\)For the top\-kkmost confident predictions among the currently masked positions, they are sampled and revealed\. The remaining masked positions are kept masked asmaskand to formyt−1y\_\{t\-1\}\. AfterTTsteps, all masked tokens are revealed, yielding the final responsey0y\_\{0\}\. Additional preliminaries about block\-diffusion, a common\-used inference strategy, are provided in[Section˜A\.1](https://arxiv.org/html/2606.18195#A1.SS1)\.

### 2\.2On\-policy Distillation

OPD transfers knowledge from a stronger teacher modelpTp\_\{T\}to a weaker student modelpθp\_\{\\theta\}by enforcing dense supervision over trajectories sampled by the student\. For AR models, given a promptxx, the student samples a responsey=\{y1,y2,⋯,yL\}y=\\\{y^\{1\},y^\{2\},\\cdots,y^\{L\}\\\}\. Using the AR factorization, the learning objective is to minimize the token\-level KL between the teacher’s and the student’s next\-token distributions:

ℒOPD\(θ\)=𝔼x\[∑i=1L𝒟KL\(pθ\(⋅\|y<i,x\)\|\|\(pT\(⋅\|y<i,x\)\)\)\],\\mathcal\{L\}\_\{\\text\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\}\\left\[\\sum\_\{i=1\}^\{L\}\\mathcal\{D\}\_\{\\text\{KL\}\}\\left\(p\_\{\\theta\}\\left\(\\cdot\|y^\{<i\},x\\right\)\|\|\\left\(p\_\{T\}\\left\(\\cdot\|y^\{<i\},x\\right\)\\right\)\\right\)\\right\],\(3\)wherep\(⋅\|y<i,x\)p\(\\cdot\|y^\{<i\},x\)denotes the distribution over the next tokenyiy^\{i\}\. While we use reverse KL, forward KL and other distribution divergence measures like generalized Jensen\-Shannon divergence\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib1)\)can also be employed\.

Recent advances have extended OPD to OPSD, where the student and teacher are instantiated from the same model, denoted aspθp\_\{\\theta\}\. The difference lies entirely in their conditioning contexts\. For AR models, privileged informationy∗y^\{\*\}, such as reference solutions\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\)or environment feedback\(Hübotteret al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib10)\), is appended to the original promptxxto construct a teacher\-specific promptx∗=x\+y∗x^\{\*\}=x\+y^\{\*\}\. Thus, the teacher distribution is:

pT=pθ\(⋅\|y<i,x,y∗\)=pθ\(⋅\|y<i,x∗\)\.p\_\{T\}=p\_\{\\theta\}\\left\(\\cdot\|y^\{<i\},x,y^\{\*\}\\right\)=p\_\{\\theta\}\\left\(\\cdot\|y^\{<i\},x^\{\*\}\\right\)\.\(4\)Consequently, the learning objective in[Equation˜3](https://arxiv.org/html/2606.18195#S2.E3)adapts into the following:

ℒOPSD\(θ\)=𝔼x\[∑i=1L𝒟KL\(pθ\(⋅\|y<i,x\)\|\|\(pθ\(⋅\|y<i,x∗\)\)\)\]\.\\mathcal\{L\}\_\{\\text\{OPSD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\}\\left\[\\sum\_\{i=1\}^\{L\}\\mathcal\{D\}\_\{\\text\{KL\}\}\\left\(p\_\{\\theta\}\\left\(\\cdot\|y^\{<i\},x\\right\)\|\|\\left\(p\_\{\\theta\}\\left\(\\cdot\|y^\{<i\},x^\{\*\}\\right\)\\right\)\\right\)\\right\]\.\(5\)
In this setup, both the teacher and student share the same model but differ only in the conditioning contexts, and the response is solely generated from the student\. While OPSD achieves comparable performance to RLVR with superior sample efficiency for AR models, adapting this formulation to dLLMs presents fundamental challenges\. First, the arbitrary\-order generation of dLLMs provides an alterative for injecting privileged information, which better aligns with on\-policy nature \([Section˜3\.1](https://arxiv.org/html/2606.18195#S3.SS1)\)\. Second, token\-level divergence supervision is incompatible with dLLMs as next\-token prediction is not factorized\. Instead, step\-level divergence supervision must be adopted\([Section˜3\.2](https://arxiv.org/html/2606.18195#S3.SS2)\)\.

## 3Methods

### 3\.1Teacher Construction: Learning from the Self\-future

In this section, we describe how we utilize the student’s self\-generated “future” answers as privileged information for the teacher, which adheres closer to dLLMs and on\-policy nature\. While AR models are constrained to left\-to\-right generation with onlyp​\(suffix\|prefix\)p\(\\text\{suffix\}\|\\text\{prefix\}\)available, dLLMs possess the bidirectional capability to model suffix conditioningp​\(prefix\|suffix\)p\(\\text\{prefix\}\|\\text\{suffix\}\)\. As shown in[Figure˜2](https://arxiv.org/html/2606.18195#S1.F2), our core insight is that after sampling a complete trajectory from the student, we can partially reveal this self\-generated subsequent trajectory to the teacher as privileged information\.

Specifically, We instantiate both the teacher and student distributions from the same dLLMpθp\_\{\\theta\}by varying the conditioning inputs\. Given a promptxx, the student first samples a trajectory frompθp\_\{\\theta\}:

Y=\{yT,yT−1,⋯,y0\}∼pθ\(⋅\|x\),Y=\\\{y\_\{T\},y\_\{T\-1\},\\cdots,y\_\{0\}\\\}\\sim p\_\{\\theta\}\(\\cdot\|x\),\(6\)whereyT=\{mask\}Ly\_\{T\}=\\\{\{\\texttt\{mask\}\}\\\}^\{L\}is a fully masked sequence,y0y\_\{0\}is the final response,TTrefers to the total number of denoising steps, andLLdenotes the sequence length\. At each denoising step0<t≤T0<t\\leq T, the student input is simply the current noisy sequence:

ystudent,t=yt\.y\_\{\\text\{student\},t\}=y\_\{t\}\.\(7\)Conversely, the teacher input is constructed by selectively revealing tokens from the final generated responsey0y\_\{0\}:

yteacher,ti=\{y0i,if​i∈𝒮t,yti,otherwise,y\_\{\\text\{teacher\},t\}^\{i\}=\\begin\{cases\}y\_\{0\}^\{i\},&\\text\{if \}i\\in\\mathcal\{S\}\_\{t\},\\\\\[4\.0pt\] y\_\{t\}^\{i\},&\\text\{otherwise\},\\end\{cases\}\(8\)where𝒮t⊂\{1,2,⋯,L\}\\mathcal\{S\}\_\{t\}\\subset\\\{1,2,\\cdots,L\\\}is the revealing subset of indices randomly selected with a fixed retaining ratioρteacher\\rho\_\{\\text\{teacher\}\}from the currently masked positions\. Thus, both the student and teacher share the same modelp​\(θ\)p\(\\theta\), but the teacher benefits from the self\-generated “future” trajectory\. An illustration example of our teacher construction is provided in[Appendix˜B](https://arxiv.org/html/2606.18195#A2)\.

This construction seamlessly aligns with on\-policy and dLLMs nature\. First, all data is generated by the student\. Second, the construction in[Equation˜7](https://arxiv.org/html/2606.18195#S3.E7)and[Equation˜8](https://arxiv.org/html/2606.18195#S3.E8)yields distributionspθ\(⋅\|ystudent,t,x\)p\_\{\\theta\}\(\\cdot\|y\_\{\\text\{student\},t\},x\)andpθ\(⋅\|yteacher,t,x\)p\_\{\\theta\}\(\\cdot\|y\_\{\\text\{teacher\},t\},x\), which enable a direct step\-level divergence supervision, which we introduce in the next subsection\. Note thatp\(⋅\|yt,x\)p\(\\cdot\|y\_\{t\},x\)here denotes distribution for the next step\.

### 3\.2Step\-level Divergence Supervision

Unlike AR models, which natively employ token\-level supervision via next\-token prediction, dLLMs decode sequences via next\-step prediction\. At each denoising step, only the top\-kkmost confident tokens among the currently masked positions are sampled and revealed, while the remainingmasktokens are kept masked\. While token\-level supervision is incompatible, we propose step\-level divergence supervision as a more natural objective for dLLMs\.

Specifically, at each denoising steptt, using the previously constructed inputsystudent,ty\_\{\\text\{student\},t\}andyteacher,ty\_\{\\text\{teacher\},t\}, the model first computes full\-sequence distributions:

𝒫student,ti\\displaystyle\\mathcal\{P\}\_\{\\text\{student\},t\}^\{i\}=pθ​\(yi\|ystudent,t,x\),1≤i≤L,\\displaystyle=p\_\{\\theta\}\(y^\{i\}\|y\_\{\\text\{student\},t\},x\),\\quad 1\\leq i\\leq L,\(9\)𝒫teacher,ti\\displaystyle\\mathcal\{P\}\_\{\\text\{teacher\},t\}^\{i\}=pθ​\(yi\|yteacher,t,x\),1≤i≤L\.\\displaystyle=p\_\{\\theta\}\(y^\{i\}\|y\_\{\\text\{teacher\},t\},x\),\\quad 1\\leq i\\leq L\.Crucially, not all token positionsiiactively participate in the state transition fromtttot−1t\-1\. We focus exclusively on the top\-kkmost confident tokens among the currently masked positions, as only these tokens dictate the step\-level transition\. Denoting these tokens’ indices as the top\-kksubset𝒦t⊂\{1≤i≤L\|yti=mask\}\\mathcal\{K\}\_\{t\}\\subset\\\{1\\leq i\\leq L\|y\_\{t\}^\{i\}=\{\\texttt\{mask\}\}\\\}which satisfies:

∑t=1T\|𝒦t\|=L\.\\sum\_\{t=1\}^\{T\}\|\\mathcal\{K\}\_\{t\}\|=L\.\(10\)We then compute the step\-level KL divergence over this subset:

ℒt=1\|𝒦t\|∑i∈𝒦t𝒟KL\(𝒫student,ti\|\|𝒫teacher,ti\)\.\\mathcal\{L\}\_\{t\}=\\frac\{1\}\{\|\\mathcal\{K\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{K\}\_\{t\}\}\\mathcal\{D\}\_\{\\text\{KL\}\}\\left\(\\mathcal\{P\}\_\{\\text\{student\},t\}^\{i\}\|\|\\mathcal\{P\}\_\{\\text\{teacher\},t\}^\{i\}\\right\)\.\(11\)Note that the top\-kksubset𝒦t\\mathcal\{K\}\_\{t\}can theoretically be determined from either the student distribution or teacher distribution\. However, the ablation study in[Table˜7](https://arxiv.org/html/2606.18195#S4.T7)suggests that deriving from the teacher distribution yields greater performance gains\.

With the self\-teacher construction and step\-level divergence in place, we now possess all the essential components needed to apply OPSD to dLLMs\.

### 3\.3d\-OPSD: the First On\-Policy Self\-distillation for dLLMs

We now formally introduce our approach,d\-OPSD\. Operating with a single modelpθp\_\{\\theta\}severing simultaneously as student and teacher, the procedure begins with the student sampling an on\-policyTT\-step trajectoryYY\([Equation˜6](https://arxiv.org/html/2606.18195#S3.E6)\) for a given promptxx\. For each denoising step0<t≤T0<t\\leq T, we construct the student inputystudent,ty\_\{\\text\{student\},t\}and the teacher inputyteacher,ty\_\{\\text\{teacher\},t\}using[Equation˜7](https://arxiv.org/html/2606.18195#S3.E7)and[Equation˜8](https://arxiv.org/html/2606.18195#S3.E8)\. Note that the constructions are independent over steps\. Finally, we minimize the following step\-level learning objective across the entire on\-policy trajectory:

ℒOPSD​\(θ\)=\\displaystyle\\mathcal\{L\}\_\{\\text\{OPSD\}\}\(\\theta\)=𝔼x​\[1T​∑t=1Tℒt\]\\displaystyle\\mathbb\{E\}\_\{x\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{t\}\\right\]\(12\)=\\displaystyle=𝔼x\[1T∑t=1T1\|𝒦t\|∑i∈𝒦t𝒟KL\(𝒫student,ti\|\|𝒫teacher,ti\)\]\\displaystyle\\mathbb\{E\}\_\{x\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{\|\\mathcal\{K\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{K\}\_\{t\}\}\\mathcal\{D\}\_\{\\text\{KL\}\}\\left\(\\mathcal\{P\}\_\{\\text\{student\},t\}^\{i\}\|\|\\mathcal\{P\}\_\{\\text\{teacher\},t\}^\{i\}\\right\)\\right\]=\\displaystyle=𝔼x\[1T∑t=1T1\|𝒦t\|∑i∈𝒦t𝒟KL\(pθ\(yi\|ystudent,t,x\)\|\|pθ\(yi\|yteacher,t,x\)\)\]\.\\displaystyle\\mathbb\{E\}\_\{x\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{\|\\mathcal\{K\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{K\}\_\{t\}\}\\mathcal\{D\}\_\{\\text\{KL\}\}\\left\(p\_\{\\theta\}\\left\(y^\{i\}\|y\_\{\\text\{student\},t\},x\\right\)\|\|p\_\{\\theta\}\\left\(y^\{i\}\|y\_\{\\text\{teacher\},t\},x\\right\)\\right\)\\right\]\.
Additionally, we find that the quality of the student trajectoryYYinfluences the final performance \(see[Table˜7](https://arxiv.org/html/2606.18195#S4.T7)and[Table˜11](https://arxiv.org/html/2606.18195#A5.T11)\)\. Therefore, for each promptxx, we keep sampling trajectories until a correct final answery0y\_\{0\}occurs, or the sampling iteration number meets a threshold \(similar to pass@kk, and we setk=8k=8by default\)111Even withk=1k=1, our approach still surpasses the RLVR baseline which uses groupk=8k=8rollouts, see[Table7](https://arxiv.org/html/2606.18195#S4.T7)\.\. Note that this sampling strategy shares the same computation overhead as RLVR \(groupkkrollouts\) for each training step\. Following\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\), we apply pointwise KL clipping and the fix teacher strategy, as detailed in[Appendix˜C](https://arxiv.org/html/2606.18195#A3)\. Additional implementation details are also provided in[Appendix˜C](https://arxiv.org/html/2606.18195#A3), including an important engineering technique preventing out of memory by concatenating step\-level inputs, motivated by\(Wanget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib29)\)\.

Crucially, we conclude this section by highlighting the fundamental distinctions between our approach and existing self\-distillation approaches for dLLMs, such as d3llm\(Qianet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib30)\)and Cd4lm\(Lianget al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib31)\), which also construct a self\-teacher by partially revealing answers\. First and foremost, the revealed answers in our approach are “self\-experience” generated on\-policy by the student itself, whereas theirs are from the ground\-truth of static datasets\. Second, while we leverage step\-level divergence supervision across an entire on\-policy generation trajectory, they employ a single forward pass like a ‘one\-step” fake trajectory to provide supervision\. These critical differences define d\-OPSD as an on\-policy distillation approach providing dense supervision for every denoising steps across the entire trajectory, whereas their approaches remain fundamentally off\-policy closely related to SFT\.

## 4Experiments

In this section, we first address a foundational prerequisite with a toy verification:

Question:Is the self\-teacher strong enough to guide self\-distillation?Answer:Yes\. The self\-teacher is capable enough that the correct answer can be resumed using our teacher construction\. See[Section˜4\.1](https://arxiv.org/html/2606.18195#S4.SS1)\.

We then conduct comprehensive experiments to answer the following core questions:

Question:How does OPSD compare to SFT and RLVR in reasoning performance and sample efficiency?Answer:It outperforms or matches SFT and RLVR baselines in reasoning performance, while demonstrating vastly superior sample efficiency\. See[Section˜4\.2](https://arxiv.org/html/2606.18195#S4.SS2)\.Question:How does the self\-teacher construction in d\-OPSD compare to the AR\-style counterpart \([Figure˜2](https://arxiv.org/html/2606.18195#S1.F2)\)?Answer:Our approach significantly outperforms the AR counterpart\. The key reason is that our teacher construction introduces more new knowledge \(thinking patterns\) to transfer to the student\. See[Section˜4\.3](https://arxiv.org/html/2606.18195#S4.SS3)\.Question:What is the impact of different training settings?Answer:We provide comprehensive ablation results on different KL objectives, retaining ratiosρteacher\\rho\_\{\\text\{teacher\}\}, top\-kksubset𝒦t\\mathcal\{K\}\_\{t\}selections, sampling strategies, and other training settings\. See[Section˜4\.4](https://arxiv.org/html/2606.18195#S4.SS4)and[Section˜E\.1](https://arxiv.org/html/2606.18195#A5.SS1)\.Question:What are the failure modes for d\-OPSD?Answer:Similar to RLVR, OPSD is susceptible to policy collapse after reaching peak performance\. See[Section˜4\.5](https://arxiv.org/html/2606.18195#S4.SS5)\.

### 4\.1Experimental Setup & Toy Verification

Models and Tasks\.We employ LLaDA\-8B\-Instruct\(Nieet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib13)\), a state\-of\-the\-art dLLM that has not undergone post\-training, as our base model222We did not use Dream\(Yeet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib14)\)because its output format is highly inconsistent, which causes severe instability across RLVR baselines\. This limitation is also marked by\(Panet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib34)\)\.\. We conduct experiments across four reasoning tasks spanning two categories: mathematical reasoning and planning\. The mathematical reasoning tasks include GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.18195#bib.bib32)\)and MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.18195#bib.bib33)\)\. The planning tasks include4​x​44x4Sudoku puzzles, which require constraint satisfaction to fill a grid with numbers, and Countdown \(33numbers\), where models must reach a target number using basic arithmetic operations on a given set of integers\. All datasets configurations remain consistent with the RLVR baseline, diffu\-GRPO\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)\.

Table 1:Reasoning performance comparison across four reasoning tasks\. Results of diffu\-GRPO and the SFT varient are sourced from the original paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)\. Results of VRPO, d3LLM and the base model are evaluated using their open\-sourced models\. d\-OPSD consistently outperforms or matches SFT and RLVR baselines\.Table 2:Sample efficiency comparison between the RLVR baseline and our approach\. The optimization steps for diffu\-GRPO are sourced from the original paper\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)\.Baselines\.We compare against two categories of post\-training methods: RLVR and SFT\. RLVR baselines include diffu\-GRPO\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)and VRPO\(Zhuet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib35)\)\. For SFT, we compare against the SFT variant from\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\)and the existing off\-policy self\-distillation approach, d3LLM\(Qianet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib30)\)\.

Table 3:Toy Verification\. The correct answer can be resumed from the self\-teacher construction\.Training Details\.Following\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\), we fix the teacher policy to the initial policy to stabilize training\. We use full\-vocabulary logit distillation with LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.18195#bib.bib36)\)\. The default distribution divergence measure is reverse KL\. The generation length and retaining ratioρteacher\\rho\_\{\\text\{teacher\}\}are set to256256and0\.250\.25, respectively\. Additional training details are provided in[Section˜D\.1](https://arxiv.org/html/2606.18195#A4.SS1)\.

Evaluation Details\.We evaluated every2525steps before step501501and report the best results\. For mathematical reasoning tasks, we evaluate model performance using generation lengths of512512and256256\. For planning tasks, we evaluate at128128and256256\. This distinction is made because longer generation lengths improve performance in mathematical reasoning tasks but degrade it in planning tasks \(seeLABEL:tab2\)\. We utilize the block diffusion strategy\(Arriolaet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib26)\)with a block length of3232\. Denoising steps are configured as half of the generation length\.

Toy Verification\.A critical question that must be answered before the full experiment is whether the self\-teacher is strong enough to guide distillation\. To verify this, we randomly sampled500500questions from each task’s training set, obtained generations from the base model, constructed self\-teacher inputs \(using Pass@88\) as described in[Section˜3\.3](https://arxiv.org/html/2606.18195#S3.SS3)under different retaining ratiosρteacher\\rho\_\{\\text\{teacher\}\}, and finally re\-generated responses conditioned on these self\-teacher inputs\. As shown inLABEL:tab1, even with a moderateρteacher=0\.10\\rho\_\{\\text\{teacher\}\}=0\.10, the self\-teacher significantly outperforms the student\. At higherρteacher\\rho\_\{\\text\{teacher\}\}, the self\-teacher performance nearly matches its origin \(Pass@8\)\. This toy experiment successfully validates that our self\-teacher can resume correct answers and guide high\-quality distillation\. Additional details and examples of this toy experiment are provided in[Section˜D\.2](https://arxiv.org/html/2606.18195#A4.SS2)\.

### 4\.2Main Results

LABEL:tab2presents a comprehensive performance comparison between SFT, RLVR, and our approach\. d\-OPSD consistently outperforms or matches SFT and RLVR baselines, achieving state\-of\-the\-art performance in most settings and showcasing significant improvements over the base models\.LABEL:tab3and[Figure˜1](https://arxiv.org/html/2606.18195#S1.F1)detail the sample efficiency comparison between the RLVR baseline and our approach\. d\-OPSD demonstrates vastly superior sample efficiency, converging in only around10%10\\%of the optimization steps \(number of gradient updates\) required by RLVR\. Note that the pass@kksampling strategy we use in[Section˜3\.3](https://arxiv.org/html/2606.18195#S3.SS3)shares the same computation overhead as RLVR \(groupkkrollouts\) for each optimization step\. Consistent with\(Lu and Lab,[2025](https://arxiv.org/html/2606.18195#bib.bib2); Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\), we attribute OPSD’s superior sample efficiency to the dense supervision provided by the teacher distribution\. These results underscore our approach’s promising reasoning performance and sample efficiency\.

### 4\.3Comparison with AR\-style OPSD: Unlocking New Knowledge

Table 4:Reasoning performance Comparison between AR\-style OPSD and our approach\. Generation length is256256\. Our teacher construction outperforms the AR\-style baseline\.A pivotal design choice in our approach is the specific self\-teacher construction tailored for dLLMs \([Section˜3\.1](https://arxiv.org/html/2606.18195#S3.SS1)\)\. It is imperative to evaluate how this formulation compares to the AR\-style construction shown in[Figure˜2](https://arxiv.org/html/2606.18195#S1.F2)\. To this end, we conducted an additional AR\-style baseline strictly following\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\), which appends the reference solution to the prompt as a prefix conditioning to provide privileged information to the teacher, while keeping our step\-level divergence supervision \([Section˜3\.2](https://arxiv.org/html/2606.18195#S3.SS2)\) constant\.LABEL:tab4333Following\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\), the reference solution is the reasoning trajectory obtained directly from the dataset\. Therefore, we did not conduct experiments on Countdown and Sudoku, as they consist of only questions and pure ground truths without any reasoning traces\.reports the performance comparison results\. Our approach consistently outperforms the AR\-style counterpart, highlighting the critical importance of our specific self\-teacher construction\.

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/fig3.png)Figure 3:The Overlap Top\-K comparison between d\-OPSD and the AR\-style counterpart\.We further investigate the mechanism behind this performance gap\. We define the metric ofOverlap Top\-KtK\_\{t\}\. At each denoising steptt, it measures the proportion of tokens that appear simultaneously in both the student’s and teacher’s Top\-KKvocabulary distributions over the top\-kksubset𝒦t\\mathcal\{K\}\_\{t\}masked positions\. Note that Top\-KKand top\-kkhave different meanings\. Top\-KKrefers to comparing the distribution over the vocabulary at a specific token position, while top\-kkrefers to the most confident tokens in the currently masked positions \([Section˜3\.2](https://arxiv.org/html/2606.18195#S3.SS2)\)\. Formally, Overlap Top\-KtK\_\{t\}can be expressed as:

ℳoverlap,K,t=1\|𝒦t\|​∑i∈𝒦t\[\|𝒫student,ti,Top\-​K∩𝒫teacher,ti,Top\-​K\|K\],\\mathcal\{M\}\_\{\\text\{overlap\},K,t\}=\\frac\{1\}\{\|\\mathcal\{K\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{K\}\_\{t\}\}\\left\[\\frac\{\|\\mathcal\{P\}\_\{\\text\{student\},t\}^\{i,\\text\{Top\-\}K\}\\cap\\mathcal\{P\}\_\{\\text\{teacher\},t\}^\{i,\\text\{Top\-\}K\}\|\}\{K\}\\right\],\(13\)
Table 5:Reasoning performance comparison of divergence objectives\.where𝒫ti,Top\-​K\\mathcal\{P\}\_\{t\}^\{i,\\text\{Top\-\}K\}is the Top\-KKdistribution over the vocabulary at token positionii, derived from[Equation˜9](https://arxiv.org/html/2606.18195#S3.E9)\. As shown in Figure 3, the Overlap Top\-KtK\_\{t\}for AR\-style OPSD is extremely high, nearly to11, indicating that appending a reference solution fails to bring new knowledge or thinking patterns to the teacher for the student to learn\. Conversely, the Overlap Top\-KtK\_\{t\}for d\-OPSD lies in a suitable range, providing more new knowledge that can be transferred from teacher to student\.KKis set toK=20K=20in practice\.

### 4\.4Ablation Studies

Additional ablation studies are provided in[appendix˜E](https://arxiv.org/html/2606.18195#A5)\.

Table 6:Reasoning performance comparison of retaining ratios\.Divergence Objective\.We compare reverse KL \(default\) and forward KL inLABEL:tab5\. Reverse KL clearly outperforms forward KL\. We attribute this to the model\-seeking behavior of reverse KL\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib1)\), which is more robust compared to the model\-covering behavior of forward KL\.

Retaining Ratio\.We observe that different retaining ratiosρteacher\\rho\_\{\\text\{teacher\}\}have moderate influences on overall performance\. As shown inLABEL:tab6, all configurations improve over the base model and surpass the RLVR baseline\. Interestingly,ρteacher=0\.10\\rho\_\{\\text\{teacher\}\}=0\.10yields better results thanρteacher=0\.50\\rho\_\{\\text\{teacher\}\}=0\.50, despite it is a weaker teacher as shown inLABEL:tab1\. This suggests that while a accurate teacher is beneficial, the distillation effectiveness is not only decided by the teacher performance\.

Table 7:Reasoning performance comparison of𝒦t\\mathcal\{K\}\_\{t\}selections\.top\-kksubset𝒦t\\mathcal\{K\}\_\{t\}Selection\.As noted in[section˜3\.2](https://arxiv.org/html/2606.18195#S3.SS2),𝒦t\\mathcal\{K\}\_\{t\}can be selected using either the student distribution or teacher distribution\.LABEL:tab7compares these two choice\. Deriving𝒦t\\mathcal\{K\}\_\{t\}from the teacher distribution yields higher performance, as it forces the student to align with the most confident distributions by the teacher policy, providing a stronger learning signal\.

Pass@kk\.As noted in[Section˜3\.3](https://arxiv.org/html/2606.18195#S3.SS3), we employ a sampling strategy akin to pass@kk, keeping sampling trajectories until a correct answer occurs withinkkiterations\.

Table 8:Reasoning performance comparison of sampling strategies\.LABEL:tab8evaluates the impact of varyingkk\. Althoughk=1k=1slightly degrades reasoning performance compared tok=8k=8, it still surpasses the RLVR baseline with a even greater sample efficiency thank=8k=8\.

Table 9:Reasoning performance comparison of clipping\.Per\-token Pointwise Clipping\.As noted in[Section˜C\.1](https://arxiv.org/html/2606.18195#A3.SS1), we adopt a pointwise clipping strategy following\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\)\.LABEL:app\_tab3shows that pointwise clipping substantially improves the performance of d\-OPSD\. More importantly, we observe that clipping stabilizes training in most settings, which explains the performance gap\. In contrast, the none\-clipping variant starts to collapse around step150150, with performance finally dropping to69\.3769\.37by step500500\. The clipping threshold is set to0\.050\.05in practice\.

### 4\.5Failure Modes

We wish to transparently share a failure mode observed with our current approach\. Although it is highly effective in both reasoning performance and sample efficiency, we find that similar to RLVR, OPSD in some settings is prone to policy collapse after achieving peak performance\.

As shown in[Figure˜12](https://arxiv.org/html/2606.18195#A5.F12), training sometimes degrade catastrophically\. We noticed that the same phenomena is commonly observed in RLVR\(Denget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib37); Baiet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib38)\)\. We hypothesize that this collapse may stem from the model\-seeking behavior\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib1)\)becoming overly narrow, prevent from further learning\.

## 5Related Works

Additional relate works are provided in[Section˜A\.2](https://arxiv.org/html/2606.18195#A1.SS2)\.

On\-policy Distillation\.Knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.18195#bib.bib39)\)transfers knowledge from a large teacher model to a smaller student model by training on the teacher’s soft output distributions\.Kim and Rush \([2016](https://arxiv.org/html/2606.18195#bib.bib40)\); Jiaoet al\.\([2020](https://arxiv.org/html/2606.18195#bib.bib41)\); Wanget al\.\([2020](https://arxiv.org/html/2606.18195#bib.bib42)\)leveraged it to sequence\-level distillation, establishing the dominant off\-policy distillation approaches\.\(Guet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib43); Agarwalet al\.,[2024](https://arxiv.org/html/2606.18195#bib.bib1); Lu and Lab,[2025](https://arxiv.org/html/2606.18195#bib.bib2); Yanget al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib44)\)extended it to OPD, addressing the exposure bias\(Bengioet al\.,[2015](https://arxiv.org/html/2606.18195#bib.bib8)\)mismatch by shifting the training distribution to the student’s own generations\.

## 6Conclusion

This work presents d\-OPSD, the first on\-policy self\-distillation approach for dLLMs\. It is specifically tailored to align with on\-policy and dLLMs nature\. We propose a novel self\-teacher construction that utilizes the model’s own self\-generated answers as suffix conditioning for privileged information, effectively guiding the student to learn from its on\-policy “self\-future experience”\. Furthermore, we shift the dense divergence supervision from the token\-level to step\-level, perfectly matching the iterative mechanics of dLLMs\. Future work will explore advanced techniques to further stabilize and enhance the OPSD post\-training of dLLMs\.

## References

- R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. R\. Garea, M\. Geist, and O\. Bachem \(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.18195#S2.SS2.p1.6),[§4\.4](https://arxiv.org/html/2606.18195#S4.SS4.p2.1),[§4\.5](https://arxiv.org/html/2606.18195#S4.SS5.p2.1),[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- M\. Arriola, A\. Gokaslan, J\. T\. Chiu, Z\. Yang, Z\. Qi, J\. Han, S\. S\. Sahoo, and V\. Kuleshov \(2025\)Block diffusion: interpolating between autoregressive and diffusion language models\.arXiv preprint arXiv:2503\.09573\.Cited by:[§A\.1](https://arxiv.org/html/2606.18195#A1.SS1.p1.4),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p4.7)\.
- B\. Bai, H\. Wu, P\. Ye, and T\. Chen \(2025\)M\-grpo: stabilizing self\-supervised reinforcement learning for large language models with momentum\-anchored policy optimization\.arXiv preprint arXiv:2512\.13070\.Cited by:[§4\.5](https://arxiv.org/html/2606.18195#S4.SS5.p2.1)\.
- S\. Bengio, O\. Vinyals, N\. Jaitly, and N\. Shazeer \(2015\)Scheduled sampling for sequence prediction with recurrent neural networks\.Advances in neural information processing systems28\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1),[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan,et al\.\(2025\)Llada2\. 0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- S\. Cheng, Y\. Bian, D\. Liu, L\. Zhang, Q\. Yao, Z\. Tian, W\. Wang, Q\. Guo, K\. Chen, B\. Qi,et al\.\(2025\)Sdar: a synergistic diffusion\-autoregression paradigm for scalable sequence generation\.arXiv preprint arXiv:2510\.06303\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p1.2)\.
- T\. Dao \(2023\)Flashattention\-2: faster attention with better parallelism and work partitioning\.arXiv preprint arXiv:2307\.08691\.Cited by:[§D\.1](https://arxiv.org/html/2606.18195#A4.SS1.p1.6)\.
- W\. Deng, Y\. Li, B\. Gong, Y\. Ren, C\. Thrampoulidis, and X\. Li \(2025\)On grpo collapse in search\-r1: the lazy likelihood\-displacement death spiral\.arXiv preprint arXiv:2512\.04220\.Cited by:[§4\.5](https://arxiv.org/html/2606.18195#S4.SS5.p2.1)\.
- N\. Fathi, T\. Scholak, and P\. Noël \(2025\)Unifying autoregressive and diffusion\-based sequence generation\.arXiv preprint arXiv:2504\.06416\.Cited by:[§A\.1](https://arxiv.org/html/2606.18195#A1.SS1.p1.4)\.
- S\. Gong, R\. Zhang, H\. Zheng, J\. Gu, N\. Jaitly, L\. Kong, and Y\. Zhang \(2025\)Diffucoder: understanding and improving masked diffusion models for code generation\.arXiv preprint arXiv:2506\.20639\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)Minillm: knowledge distillation of large language models\.InThe twelfth international conference on learning representations,Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1)\.
- X\. Han, S\. Kumar, and Y\. Tsvetkov \(2023\)Ssd\-lm: semi\-autoregressive simplex\-based diffusion language model for text generation and modular control\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11575–11596\.Cited by:[§A\.1](https://arxiv.org/html/2606.18195#A1.SS1.p1.4)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p3.3)\.
- Z\. Huang, Z\. Chen, Z\. Wang, T\. Li, and G\. Qi \(2025\)Reinforcing the diffusion chain of lateral thought with diffusion language models\.arXiv preprint arXiv:2505\.10446\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1)\.
- J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.18195#S2.SS2.p2.4)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- X\. Jiao, Y\. Yin, L\. Shang, X\. Jiang, X\. Chen, L\. Li, F\. Wang, and Q\. Liu \(2020\)Tinybert: distilling bert for natural language understanding\.InFindings of the association for computational linguistics: EMNLP 2020,pp\. 4163–4174\.Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- S\. Khanna, S\. Kharbanda, S\. Li, H\. Varma, E\. Wang, S\. Birnbaum, Z\. Luo, Y\. Miraoui, A\. Palrecha, S\. Ermon,et al\.\(2025\)Mercury: ultra\-fast language models based on diffusion\.arXiv e\-prints,pp\. arXiv–2506\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 1317–1327\.Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu,et al\.\(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1)\.
- Y\. Liang, Z\. Wang, H\. Chen, X\. Sun, J\. Wu, X\. Yu, J\. Liu, E\. Barsoum, Z\. Liu, and N\. K\. Jha \(2026\)CD4LM: consistency distillation and adaptive decoding for diffusion language models\.arXiv preprint arXiv:2601\.02236\.Cited by:[§3\.3](https://arxiv.org/html/2606.18195#S3.SS3.p3.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p1.2)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§D\.1](https://arxiv.org/html/2606.18195#A4.SS1.p1.6)\.
- K\. Lu and T\. M\. Lab \(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/on\-policy\-distillationExternal Links:[Document](https://dx.doi.org/10.64434/tml.20251026)Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.18195#S4.SS2.p1.3),[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p1.2)\.
- J\. Ou, J\. Han, M\. Xu, S\. Xu, J\. Xie, S\. Ermon, Y\. Wu, and C\. Li \(2025\)Principled rl for diffusion llms emerges from a sequence\-level perspective\.arXiv preprint arXiv:2512\.03759\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1)\.
- J\. Ou, S\. Nie, K\. Xue, F\. Zhu, J\. Sun, Z\. Li, and C\. Li \(2024\)Your absorbing discrete diffusion secretly models the conditional distributions of clean data\.arXiv preprint arXiv:2406\.03736\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- L\. Pan, S\. Tao, Y\. Zhai, Z\. Fu, L\. Fang, M\. He, L\. Zhang, Z\. Liu, B\. Ding, A\. Liu,et al\.\(2025\)D\-treerpo: towards more reliable policy optimization for diffusion language models\.arXiv preprint arXiv:2512\.09675\.Cited by:[footnote 2](https://arxiv.org/html/2606.18195#footnote2)\.
- Y\. Qian, J\. Su, L\. Hu, P\. Zhang, Z\. Deng, P\. Zhao, and H\. Zhang \(2026\)D3LLM: ultra\-fast diffusion llm using pseudo\-trajectory distillation\.arXiv preprint arXiv:2601\.07568\.Cited by:[§3\.3](https://arxiv.org/html/2606.18195#S3.SS3.p3.1),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.18195#S4.T1.1.1.1.1.1.1.1.6.6.1)\.
- K\. Rojas, J\. Lin, K\. Rasul, A\. Schneider, Y\. Nevmyvaka, M\. Tao, and W\. Deng \(2025\)Improving reasoning for diffusion language models via group diffusion policy optimization\.arXiv preprint arXiv:2510\.08554\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1)\.
- I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal \(2026\)Self\-distillation enables continual learning\.arXiv preprint arXiv:2601\.19897\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1)\.
- Y\. Song, Z\. Zhang, C\. Luo, P\. Gao, F\. Xia, H\. Luo, Z\. Li, Y\. Yang, H\. Yu, X\. Qu,et al\.\(2025\)Seed diffusion: a large\-scale diffusion language model with high\-speed inference\.arXiv preprint arXiv:2508\.02193\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- X\. Tang, R\. Dolga, S\. Yoon, and I\. Bogunovic \(2025\)Wd1: weighted policy optimization for reasoning in diffusion language models\.arXiv preprint arXiv:2507\.08838\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.18195#S1.p3.1)\.
- L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec \(2020\)TRL: Transformers Reinforcement LearningExternal Links:[Link](https://github.com/huggingface/trl)Cited by:[§D\.1](https://arxiv.org/html/2606.18195#A4.SS1.p1.6)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)Minilm: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.Advances in neural information processing systems33,pp\. 5776–5788\.Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- Y\. Wang, L\. Yang, B\. Li, Y\. Tian, K\. Shen, and M\. Wang \(2025\)Revolutionizing reinforcement learning framework for diffusion large language models\.arXiv preprint arXiv:2509\.06949\.Cited by:[§3\.3](https://arxiv.org/html/2606.18195#S3.SS3.p2.6)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- B\. Xiao, B\. Xia, B\. Yang, B\. Gao, B\. Shen, C\. Zhang, C\. He, C\. Lou, F\. Luo, G\. Wang,et al\.\(2026\)Mimo\-v2\-flash technical report\.arXiv preprint arXiv:2601\.02780\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1)\.
- S\. Xie, L\. Kong, X\. Song, X\. Dong, G\. Chen, E\. P\. Xing, and K\. Zhang \(2025\)Step\-aware policy optimization for reasoning in diffusion large language models\.arXiv preprint arXiv:2510\.01544\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p1.1)\.
- W\. Yang, W\. Liu, R\. Xie, K\. Yang, S\. Yang, and Y\. Lin \(2026\)Learning beyond teacher: generalized on\-policy distillation with reward extrapolation\.arXiv preprint arXiv:2602\.12125\.Cited by:[§5](https://arxiv.org/html/2606.18195#S5.p2.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§1](https://arxiv.org/html/2606.18195#S1.p2.1),[footnote 2](https://arxiv.org/html/2606.18195#footnote2)\.
- S\. Zhao, D\. Gupta, Q\. Zheng, and A\. Grover \(2025\)D1: scaling reasoning in diffusion large language models via reinforcement learning\.arXiv preprint arXiv:2504\.12216\.Cited by:[§A\.2](https://arxiv.org/html/2606.18195#A1.SS2.p1.1),[§D\.1](https://arxiv.org/html/2606.18195#A4.SS1.p1.6),[Figure 1](https://arxiv.org/html/2606.18195#S1.F1),[§1](https://arxiv.org/html/2606.18195#S1.p3.1),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p1.2),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.18195#S4.T1),[Table 1](https://arxiv.org/html/2606.18195#S4.T1.1.1.1.1.1.1.1.5.5.1),[Table 1](https://arxiv.org/html/2606.18195#S4.T1.1.1.1.1.1.1.1.8.8.1),[Table 2](https://arxiv.org/html/2606.18195#S4.T2)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§C\.1](https://arxiv.org/html/2606.18195#A3.SS1.p1.1),[§1](https://arxiv.org/html/2606.18195#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.18195#S2.SS2.p2.4),[§3\.3](https://arxiv.org/html/2606.18195#S3.SS3.p2.6),[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p3.3),[§4\.2](https://arxiv.org/html/2606.18195#S4.SS2.p1.3),[§4\.3](https://arxiv.org/html/2606.18195#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.18195#S4.SS4.p7.4),[footnote 3](https://arxiv.org/html/2606.18195#footnote3)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)Llada 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§4\.1](https://arxiv.org/html/2606.18195#S4.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.18195#S4.T1.1.1.1.1.1.1.1.9.9.1)\.

## Appendix AAdditional Preliminaries and Related Works

### A\.1Additional Preliminaries

Block\-diffusion\.In practice, the block\-diffusion inference strategy\[Hanet al\.,[2023](https://arxiv.org/html/2606.18195#bib.bib28), Arriolaet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib26), Fathiet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib27)\]is commonly used in current dLLMs\. This hybrid approach partitions a responseyyintoBBcontiguous, non\-overlapping blocks\{block1,block2,⋯,blockB\}\\\{\\text\{block\}\_\{1\},\\text\{block\}\_\{2\},\\cdots,\\text\{block\}\_\{B\}\\\}, with each block containingL′=LBL^\{\\prime\}=\\frac\{L\}\{B\}tokens\. The inference is purely AR at the block level while being purely diffusion\-style within each block, where the next block starts to decode only when the last block gets fully decoded\.

### A\.2Additional Related Works

Reinforcement Learning for dLLMs\.Reinforcement learning \(RL\) has emerged as a critical post\-training technique for enhancing the reasoning capabilities of dLLMs\. Most existing works\[Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20), Huanget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib45), Tanget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib21), Gonget al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib46), Rojaset al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib47), Ouet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib48)\]directly apply GRPO to dLLMs, using either one\-step estimation or the ELBO to estimate the log\-probability in GRPO\. However, most of them suffer from the fundamental challenges of RLVR: the heavy computation overhead and the bottleneck of spare rewards\.

## Appendix BSelf\-teacher Construction Illustrations

Here we provide an example of how our self\-teacher construction \([Section˜3\.1](https://arxiv.org/html/2606.18195#S3.SS1)\) works, with a question sampled from GSM8K training set\. For brevity, we omit somemaskand “end\-of\-text” tokens\.

The question is:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app1.png)Figure 4:A question from GSM8K training set\.First, we sample an on\-policy trajectory444Using pass@kk, it keeps sampling until a correct final answer appears or it reaches the iteration threshold\.from the student model and obtain the final clean answer as the self\-generated future:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app2.png)Figure 5:The self\-generated future answer\.At denoising stept=20t=20, we have the student decoding status as follows:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app3.png)Figure 6:Current student decoding status\.We then construct the self\-teacher at stept=20t=20as follows:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app4.png)Figure 7:Self\-teacher construction att=20t=20\.For comparison, we also illustrate the AR\-style construction, which appends a reference solution to the prompt, as shown in[Figure˜8](https://arxiv.org/html/2606.18195#A2.F8)\.

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app5.png)Figure 8:AR\-style teacher construction\.
## Appendix CAdditional Implementation Details

### C\.1Per\-Token pointwise clipping

Following\[Zhaoet al\.,[2026](https://arxiv.org/html/2606.18195#bib.bib9)\], we apply pointwise clipping to the vocabulary level divergence contributions\. The reason is that token\-level divergence is highly skewed across vocabulary entries, and our ablation study in[Section˜4\.4](https://arxiv.org/html/2606.18195#S4.SS4)empirically validates that pointwise clipping stabilizes training and leads to better performance\.

### C\.2Inputs Concatenation

DLLMs generate responses by an iterative denoising process, where each iteration requires full\-attention over all token positions\. Consequently, computing the loss objective in[Equation˜12](https://arxiv.org/html/2606.18195#S3.E12)can easily lead to out of memory, as the full attention gradient maps over all token positions across every steps need to be stored, until a trajectory is fully decoded\. To address this issue, we leverage a engineering technique, concatenating all inputs across every steps of a trajectory into an entire batch\. Specifically, assume that the student decoding status is a tensor of shape \(bsz, seq\-length\)\. Instead of feeding it into the model to compute the corresponding term in[Equation˜12](https://arxiv.org/html/2606.18195#S3.E12), we concatenate all status tensors across all steps of this trajectory to form a “batch” tensor of shape \(bsz×\\timessteps, seq\-length\)\. Since all inputs share the same model, the gradient remains constant for each input and no longer needs to be stored as previously\.

### C\.3Compute only on Correct Generations

By default, we compute the loss objective[Equation˜12](https://arxiv.org/html/2606.18195#S3.E12)only on correct generations555For Sudoku task, there are no “right” or “wrong” answers because it gives a score in\[0,1\]\[0,1\]\. Therefore, we set an threshold to decide if the generation should be include in loss computation\. In practice, the threshold is set to0\.250\.25\.\. Although computing on all generations also improves the model’s reasoning performance, our default setting achieves superior results\. Detailed experimental results are provided in[Section˜E\.1](https://arxiv.org/html/2606.18195#A5.SS1)\.

## Appendix DAdditional Experiment Details

### D\.1Training Details

We used the TRL library\[von Werraet al\.,[2020](https://arxiv.org/html/2606.18195#bib.bib49)\]to implement d\-OPSD\. We employed Low\-Rank Adaptation \(LoRA\) with a rank ofr=128r=128and scaling factorα=64\\alpha=64\. Training was conducted on44NVIDIA GPUs, with a learning rate of5×10−65\\times 10^\{\-6\}, accumulation steps of11, the AdamW optimizer\[Loshchilov and Hutter,[2017](https://arxiv.org/html/2606.18195#bib.bib50)\], and Flash Attention 2\[Dao,[2023](https://arxiv.org/html/2606.18195#bib.bib51)\]\. The RLVR baseline diffu\-GRPO\[Zhaoet al\.,[2025](https://arxiv.org/html/2606.18195#bib.bib20)\]in[Figure˜1](https://arxiv.org/html/2606.18195#S1.F1)was reproduced on88NVIDIA GPUs, following its default settings\.

### D\.2Toy Experiment Details and Examples

The generation length is256256for all tasks\. After applying the self\-teacher construction, the number of remainingmasktokens becomes smaller than the generation length256256\. We keep constant that unmasking22tokens in each step with a block length of3232\.

One important point to note is that, to prevent the risk of leaking the final answer \(e\.g\., the final answer between <answer\><answer\> is retained in the self\-teacher construction\), everytime we move to a new block, we clear all unmasked tokens in this block, leaving the new block entirely filled with onlymasktokens\.

We provide an example from GSM8K training set\. The question is:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app6.png)Figure 9:A question from GSM8K training set\.First, we sample a generation666Using pass@kk, it keeps sampling until a correct final answer appears or it reaches the iteration threshold\.from the student model and obtain the final clean answer:

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app7.png)Figure 10:The self\-generated answer\.We then construct self\-teacher by partially revealing the final generation, as shown in[Figure˜11](https://arxiv.org/html/2606.18195#A4.F11)\.

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app8.png)Figure 11:Self\-teacher in the toy experiment\.Table 10:Reasoning performance comparison of teacher fixing\.

## Appendix EAdditional Experiment Results

### E\.1Additional Ablation Studies

Fixing the Teacher\.We find that fixing the teacher model leads to greater performance gains, as shown inLABEL:app\_tab1\. Notably, even when the teacher is not fixed, d\-OPSD’s reasoning performance nearly matches the RLVR baselines, further demonstrating its effectiveness\.

Table 11:Reasoning performance comparison\.Compute only on Correct Generations

As shown inLABEL:app\_tab2, computing the loss on all trajectories leads to a slight performance degradation\. Nevertheless, it still outperforms the RLVR baseline\.

### E\.2Qualitative Examples on GSM8k

We provide a qualitative example from GSM8k testing set, where the RLVR model gives an incorrect answer while our approach yields the correct one, as shown in[Figure˜13](https://arxiv.org/html/2606.18195#A5.F13)\.

### E\.3Failure Mode

[Figure˜12](https://arxiv.org/html/2606.18195#A5.F12)presents the failure mode mentioned in[Section˜4\.5](https://arxiv.org/html/2606.18195#S4.SS5)\.

![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/fig4.png)Figure 12:Failure Mode of collapse\.![Refer to caption](https://arxiv.org/html/2606.18195v1/figure/app9.png)Figure 13:Qualitative Examples on GSM8k

Similar Articles

Self-Distilled Policy Gradient

arXiv cs.LG

SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.

Self-Distillation Enables Continual Learning [pdf]

Hacker News Top

Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.