Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

arXiv cs.AI 05/14/26, 04:00 AM Papers
Summary
The paper proposes EGRSD and CL-EGRSD, on-policy self-distillation methods that weight token-level supervision by teacher entropy to improve reasoning accuracy-length tradeoff in LLMs, evaluated on Qwen3-4B and Qwen3-8B.
arXiv:2605.13255v1 Announce Type: new Abstract: On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:15 AM
# Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
Source: [https://arxiv.org/html/2605.13255](https://arxiv.org/html/2605.13255)
Junlong Ke2111Equal contribution\.Zichen Wen1111Equal contribution\.Weijia Li2,3Conghui He3222Corresponding authors\.Linfeng Zhang1222Corresponding authors\. 1Shanghai Jiao Tong University2Tsinghua University3Shanghai AI Laboratory

###### Abstract

On\-policy self\-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token\-level supervision\. Existing objectives typically weight the teacher’s token\-level signal uniformly across a chain\-of\-thought sequence, despite substantial variation in the entropy of the teacher’s predictive distribution\. We proposeEGRSD\(Entropy\-Guided Reinforced Self\-Distillation\), which unifies token\-level updates through three signals: a reward\-grounded*direction*, a teacher–student likelihood\-ratio*magnitude*, and the proposed teacher\-entropy*confidence*gate that down\-weights high\-entropy token positions while maintaining a nonzero lower bound on every token weight\. We further introduceCL\-EGRSD, a causal\-lookahead variant that distinguishes sustained high\-entropy spans from transient high\-entropy positions whose following context rapidly becomes low entropy\. Experiments with Qwen3\-4B and Qwen3\-8B in thinking mode show that EGRSD and CL\-EGRSD advance the accuracy–length frontier among the compared trainable methods\.

## 1Introduction

Recent large language models exhibit strong multi\-step inference capabilities, but chain\-of\-thought \(CoT\) reasoning\(Wei et al\.,[2022](https://arxiv.org/html/2605.13255#bib.bib23); OpenAI,[2024](https://arxiv.org/html/2605.13255#bib.bib17)\)often entails the excessive generation of intermediate reasoning tokens\. Reasoning\-optimized checkpoints frequently produce redundant verification loops, simulated self\-correction markers, and repeated intermediate derivations\. This verbosity increases inference latency and cost, motivating methods that preserve reasoning accuracy while reducing unnecessary computation\.

On\-policy self\-distillation is a natural approach to this problem\. Instead of imitating fixed offline demonstrations, the student samples its own reasoning trajectories and receives dense token\-level supervision from a privileged teacher\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30); Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19); Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\)\. The teacher may be the same base model conditioned on a reference solution, so the student learns from feedback on trajectories it actually visits\. This avoids the train–test mismatch of offline distillation and gives a supervision signal at every completion token rather than only at the final answer\.

Dense feedback, however, is not the same as reliable feedback\. A reasoning completion contains heterogeneous token positions: some are low\-entropy deterministic computation \(arithmetic continuation, expression simplification, equation closure\), while others are high\-entropy branching points where the distribution assigns probability mass to multiple valid continuations \(induction versus enumeration, revising a previous derivation, or discourse\-level transitions\)\. Prior work has observed analogous heterogeneity in code self\-distillation\(Zhang et al\.,[2026a](https://arxiv.org/html/2605.13255#bib.bib27)\)\. A privileged teacher can be sharply peaked at the former positions and diffuse at the latter, and uniformly weighting these signals risks over\-emphasizing high\-variance supervision\. In long mathematical reasoning, we additionally identify a third regime:*transient*high\-entropy positions whose following context rapidly becomes low\-entropy\.

![Refer to caption](https://arxiv.org/html/2605.13255v1/x1.png)Figure 1:Accuracy–length trade\-off on Qwen3\-8B: EGRSD and CL\-EGRSD \(ours\) extend the Pareto frontier\. All trainable baselines are dominated\.These are strategy\-shift*pivots*, not sustained branching*forks*, and blindly suppressing all high\-entropy tokens would destroy the transition signal that the pivots carry\.

This issue is especially important in self\-distillation\. Unlike offline distillation with a superior external model, the privileged teacher here is the base policy conditioned on augmented context\. Its entropy therefore measures the concentration of the teacher’s next\-token predictive distribution under the privileged view\. We refer to this distributional concentration as*teacher confidence*\. Because the teacher is conditioned on privileged information, this confidence acts as a practical proxy for the reliability of token\-level supervision\. Prior analysis further suggests that*suppressing*high\-entropy teacher tokens can degrade reasoning\(Kim et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib10)\), motivating a non\-zero lower bound in any confidence gate we introduce\. Together with the outcome\-reward sign and the teacher–student log\-likelihood ratio used by recent direction\-aware objectives, teacher predictive entropy provides a third, previously unused signal for token\-level self\-distillation\. Concurrent work \(RLSD\(Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\)\) demonstrates the value of decoupling direction and magnitude\. EGRSD adds the entropy\-based confidence component on top of this decomposition\.

We proposeEntropy\-Guided Reinforced Self\-Distillation\(EGRSD\)\. For each token in a rollout, EGRSD computes the privileged\-teacher entropy, normalizes it within the minibatch, and down\-weights the direction\-aware token update by a multiplicative confidence gateωi,t∈\[0\.1,1\]\\omega\_\{i,t\}\\in\[0\.1,1\]\(formalized in §[4\.2](https://arxiv.org/html/2605.13255#S4.SS2)\)\. Low\-entropy computation tokens pass through with nearly full weight, while high\-entropy positions are attenuated but retain a non\-zero floor so branching positions with valid continuation diversity are not discarded\.

A second variant,CL\-EGRSD, addresses transient high\-entropy transition points\. Some locally high\-entropy tokens initiate a branch whose subsequent continuation rapidly becomes low\-entropy\. CL\-EGRSD replaces instantaneous entropy with the minimum entropy over a short causal future window, separating sustained high\-entropy spans from transient strategy\-shift positions\.

On Qwen3\-4B and Qwen3\-8B, the resulting confidence\-gated update advances the accuracy–length frontier among the compared trainable methods \(Figure[1](https://arxiv.org/html/2605.13255#S1.F1)\)\. Ablations show that moderate entropy attenuation gives the most stable performance, and that lookahead helps most on the larger model, where pivots are easier to exploit\.

Our contributions are:

- •We identify teacher predictive entropy as a missing third signal in on\-policy self\-distillation, complementing outcome\-reward direction and teacher–student magnitude\.
- •We instantiate this signal asEGRSD: a minimal extension of direction\-aware self\-distillation that gates the token update by the privileged teacher’s entropy with a non\-zero lower bound on every token weight\.
- •We extend EGRSD toCL\-EGRSD, a causal\-lookahead variant that preserves transient high\-entropy*pivot*tokens whose uncertainty resolves within a short future window, and validate both methods with main results, mechanism diagnostics, and ablations on Qwen3\-4B and Qwen3\-8B\.

## 2Related work

We focus on the two lines of work most directly relevant to EGRSD\. Additional context on long\-form reasoning compression, RLVR\-style token\-level credit assignment, and a full method\-by\-method positioning comparison is deferred to Appendix[A](https://arxiv.org/html/2605.13255#A1)\.

#### On\-policy distillation and privileged self\-distillation\.

On\-policy distillation addresses the train–test mismatch of offline distillation by training on trajectories sampled from the student while a teacher provides dense token\-level feedback\(Song and Zheng,[2026](https://arxiv.org/html/2605.13255#bib.bib21)\)\. OPSD adapts this idea to self\-distillation: the same model acts as student and privileged teacher, with the teacher conditioned on additional information such as reference solutions\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\)\. This design removes the need for a separate large teacher and gives dense supervision on the student’s own rollouts\. Recent work further explores competence\-aware weighting\(Xu et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib25)\), consensus gating under privileged context\(Stein et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib22)\), and compression\-oriented variants such as CRISP\(Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\)\. A common blind spot remains: existing OPSD\-style objectives provide dense token\-level targets but do not explicitly account for the concentration of the teacher distribution at each token\. Concurrent work on direction\-aware self\-distillation \(RLSD\(Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\)\) couples outcome\-reward direction with a teacher–student likelihood\-ratio magnitude but leaves teacher confidence unused\. EGRSD reuses the same coupling and additionally weights each token by the privileged teacher’s predictive entropy\.

#### Uncertainty, selectivity, and teacher confidence\.

Not all dense supervision is equally useful\. Selective instruction\-tuning and process supervision show that fine\-grained filtering can improve how models learn from completions\(Li et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib13); Lightman et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib14)\), and a complementary analysis\(Kim et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib10)\)finds that high\-entropy teacher tokens carry uncertainty signaling that should be preserved rather than flattened\. This finding motivates our non\-zero floor onωi,t\\omega\_\{i,t\}\. Concurrent with our work, SSD\(Zhang et al\.,[2026a](https://arxiv.org/html/2605.13255#bib.bib27)\)observes that code generation interleaves*lock*positions \(unambiguous continuations\) with*fork*positions \(multiple plausible continuations\), and reshapes token distributions differently at the two position types by temperature\-truncated sampling\. Entropy\-aware on\-policy distillation\(Jin et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib9)\)instead mixes reverse and forward KL terms to alter the distillation geometry\. EGRSD uses teacher entropy differently from both: it is a multiplicative confidence gate on the token\-level RLSD update, making the lock/fork heterogeneity explicit without switching divergence objectives or training a separate uncertainty estimator\. CL\-EGRSD additionally rescues transient high\-entropy*pivot*positions whose continuation rapidly becomes confident, a regime not covered by SSD’s two\-category formulation\.

## 3Background

We summarize the two objectives EGRSD builds on\. Derivations, stop\-gradient rationale, and advantage\-whitening details are deferred to Appendix[B](https://arxiv.org/html/2605.13255#A2)\.

#### Notation\.

Letyi=\(yi,1,…,yi,Ti\)y\_\{i\}=\(y\_\{i,1\},\\ldots,y\_\{i,T\_\{i\}\}\)denote theii\-th on\-policy rollout sampled from the student policypθp\_\{\\theta\}on a problemxx, with rollout positions𝒞i\\mathcal\{C\}\_\{i\}and maskmi,t=𝟙\[t∈𝒞i\]m\_\{i,t\}=\\mathbb\{1\}\[t\\in\\mathcal\{C\}\_\{i\}\]\. The*teacher*pTp\_\{T\}is the initial policy held fixed throughout training, conditioned on the privileged context\(x,s⋆\)\(x,s^\{\\star\}\)wheres⋆s^\{\\star\}is a reference solution\. The*student*pS=pθp\_\{S\}=p\_\{\\theta\}is initialized from the same weights, is the only trainable component, and is conditioned only on\(x\)\(x\)\.

#### On\-policy self\-distillation \(OPSD\)\.

OPSD\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\)aligns the student with the teacher at every rollout position:

ℒOPSD=∑i,t:t∈𝒞iKL\(pT\(⋅∣x,s⋆,yi,<t\)∥pθ\(⋅∣x,yi,<t\)\),\\mathcal\{L\}\_\{\\mathrm\{OPSD\}\}=\\sum\_\{i,t:\\,t\\in\\mathcal\{C\}\_\{i\}\}\\mathrm\{KL\}\\\!\\big\(p\_\{T\}\(\\cdot\\mid x,s^\{\\star\},y\_\{i,<t\}\)\\,\\\|\\,p\_\{\\theta\}\(\\cdot\\mid x,y\_\{i,<t\}\)\\big\),\(1\)and CRISP\(Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\)adapts the same per\-token alignment framework toward compression\. Neither weights tokens by teacher confidence\.

#### Direction\-aware self\-distillation \(RLSD\)\.

RLSD\(Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\)replaces the teacher\-driven update direction with an outcome\-reward one\. Each rollout receives a length\-shaped rewardri=𝟙\[yicorrect\]⋅\(1\+βL\(1−\|yi\|/Lmax\)\)r\_\{i\}=\\mathbb\{1\}\[y\_\{i\}\\text\{ correct\}\]\\cdot\(1\+\\beta\_\{L\}\(1\-\|y\_\{i\}\|/L\_\{\\mathrm\{max\}\}\)\)that is whitened into a sequence\-level advantageAiA\_\{i\}with directionDi=sign\(Ai\)D\_\{i\}=\\mathrm\{sign\}\(A\_\{i\}\), shared across tokens in a rollout\. The teacher–student log\-ratioδi,t=log⁡pT\(yi,t∣x,s⋆,yi,<t\)−log⁡pS\(yi,t∣x,yi,<t\)\\delta\_\{i,t\}=\\log p\_\{T\}\(y\_\{i,t\}\\mid x,s^\{\\star\},y\_\{i,<t\}\)\-\\log p\_\{S\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)is clipped into a multiplicative magnitude

wi,t=clip\(exp⁡\(Diδi,t\),1−ε,1\+ε\),ε=0\.2,w\_\{i,t\}=\\mathrm\{clip\}\\\!\\big\(\\exp\(D\_\{i\}\\,\\delta\_\{i,t\}\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\),\\qquad\\varepsilon=0\.2,\(2\)with stop\-gradient onpSp\_\{S\}so that onlylog⁡pθ\\log p\_\{\\theta\}in the final loss receives gradient\. The RLSD loss is

ℒRLSD=−1∑i,tmi,t∑i,tmi,tAiwi,tlog⁡pθ\(yi,t∣x,yi,<t\)\.\\mathcal\{L\}\_\{\\mathrm\{RLSD\}\}=\-\\tfrac\{1\}\{\\sum\_\{i,t\}m\_\{i,t\}\}\\sum\_\{i,t\}m\_\{i,t\}\\,A\_\{i\}\\,w\_\{i,t\}\\,\\log p\_\{\\theta\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\.\(3\)RLSD therefore couples reward\-driven direction with a teacher–student likelihood ratio but treats every token under that ratio as equally informative\. EGRSD modulates the same coupling by the teacher’s predictive entropy\.

## 4Method

### 4\.1Motivation: a remaining confidence gap

![Refer to caption](https://arxiv.org/html/2605.13255v1/x2.png)Figure 2:Overview of the proposed method\. The token\-level update multiplies three signals:*direction*from the sequence\-level outcome\-reward advantageDi=sign\(Ai\)D\_\{i\}=\\mathrm\{sign\}\(A\_\{i\}\)\(shared across a rollout\),*magnitude*wi,tw\_\{i,t\}from the teacher–student likelihood ratio, and*confidence*ωi,t\\omega\_\{i,t\}from a monotone gate of the privileged teacher’s predictive entropy\. CL\-EGRSD replaces the instantaneous entropy inside the gate with a short\-horizon causal\-lookahead minimum to preserve transient pivot positions\.The teacher–student ratiowi,tw\_\{i,t\}says whether the privileged view increases or decreases support for the sampled token\. It does not say whether the privileged teacher distribution is concentrated\. We use teacher confidence in an operational sense: the concentration of the privileged teacher’s predictive distribution, measured through token\-level entropy\. Because the teacher is conditioned on augmented context, this confidence signal serves as a practical proxy for the reliability of token\-level supervision without attributing cognitive states to the model\. Direction\-aware self\-distillation can still assign large token\-level magnitude to positions where the teacher distribution is diffuse\. We therefore view token\-level self\-distillation as a three\-signal problem\. Table[1](https://arxiv.org/html/2605.13255#S4.T1)summarizes how existing on\-policy self\-distillation objectives populate this picture\. EGRSD targets the remaining confidence gap by introducing the teacher\-entropy gate\.

Table 1:Signals used by families of on\-policy self\-distillation objectives\. EGRSD augments the direction–magnitude decomposition with an explicit teacher\-confidence signal\.Before formalizing this gate, it helps to fix intuition about what the teacher\-entropy signal captures\. We distinguish three token regimes by predictive\-distribution concentration, illustrated on a real reasoning example in Figure[3](https://arxiv.org/html/2605.13255#S4.F3):

- •Lock\.Low instantaneous entropy, with the teacher sharply peaked\. These positions correspond to deterministic computation \(arithmetic, unit manipulation, equation continuation\) and provide reliable token\-level supervision\.
- •Fork\.High instantaneous entropy that*stays*high over the following tokens, with the teacher diffuse and multiple continuations plausible\. Sustained branching of this kind can let the teacher–student likelihood ratio alone overstate the evidence\.
- •Pivot\.High instantaneous entropy that*resolves*to low entropy within a short future window, where the teacher is locally uncertain but immediately recommits\. Such transitions, often cued by discourse markers, are the target of the causal\-lookahead variant below\.

![Refer to caption](https://arxiv.org/html/2605.13255v1/x3.png)Figure 3:Per\-token predictive entropy on a representative reasoning trace \(Qwen3\-4B\)\. Tokens are color\-coded by the teacher’s entropy and bolded at the top∼35%\{\\sim\}35\\%of positions\. High\-entropy tokens concentrate at discourse transitions and strategy shifts\. Low entropy marks routine computation\. Appendix[H](https://arxiv.org/html/2605.13255#A8)gives an annotated version withH^i,t\\widehat\{H\}\_\{i,t\}andH^i,tCL\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}\.Uniformly trusting all tokens over\-weights fork evidence, whereas uniformly suppressing all high\-entropy tokens destroys pivot evidence\. EGRSD \(§[4\.2](https://arxiv.org/html/2605.13255#S4.SS2)\) handles the first failure mode by attenuating high\-entropy positions while retaining a non\-zero floor, and CL\-EGRSD \(§[4\.3](https://arxiv.org/html/2605.13255#S4.SS3)\) handles the second by swapping in a short\-horizon future entropy so that pivots are selectively restored\.

Figure[2](https://arxiv.org/html/2605.13255#S4.F2)summarizes the resulting token\-level update\. The method is intentionally minimal: it changes neither the rollout distribution nor the reward definition, and adds only a multiplicative entropy\-based confidence factor on top of the existing direction\-aware update\.

### 4\.2Entropy\-guided reinforced self\-distillation

For each rolloutiiand positiont∈𝒞it\\in\\mathcal\{C\}\_\{i\}we compute the privileged\-teacher entropy

Hi,t=−∑v∈𝒱pT\(v∣x,s⋆,yi,<t\)log⁡pT\(v∣x,s⋆,yi,<t\)\.H\_\{i,t\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{T\}\(v\\mid x,s^\{\\star\},y\_\{i,<t\}\)\\log p\_\{T\}\(v\\mid x,s^\{\\star\},y\_\{i,<t\}\)\.\(4\)Let𝒞batch=⋃i\{\(i,t\):t∈𝒞i\}\\mathcal\{C\}\_\{\\mathrm\{batch\}\}=\\bigcup\_\{i\}\\\{\(i,t\):t\\in\\mathcal\{C\}\_\{i\}\\\}denote all rollout positions in the current minibatch\. We normalize by the batch\-global maximum

H^i,t=Hi,tmax\(j,k\)∈𝒞batch⁡Hj,k\.\\widehat\{H\}\_\{i,t\}=\\frac\{H\_\{i,t\}\}\{\\displaystyle\\max\_\{\(j,k\)\\in\\mathcal\{C\}\_\{\\mathrm\{batch\}\}\}H\_\{j,k\}\}\.\(5\)In the implementation we additionally lower\-bound this denominator by11nat to avoid numerical instability on low\-entropy minibatches\. The confidence gate is

ωi,t=clip\(1−γH^i,t,0\.1,1\)\.\\omega\_\{i,t\}=\\mathrm\{clip\}\\\!\\left\(1\-\\gamma\\,\\widehat\{H\}\_\{i,t\},\\,0\.1,\\,1\\right\)\.\(6\)The coefficientγ\\gammacontrols the sensitivity of the gate to teacher entropy\. Largerγ\\gammaapplies stronger attenuation to high\-entropy positions\. Settingγ=0\\gamma=0recovers the direction\-aware baseline without the entropy gate\. The lower bound prevents high\-entropy positions from being removed entirely, since some high\-entropy positions reflect valid branching in the predictive distribution rather than spurious supervision\. This choice is supported by the empirical finding ofKim et al\. \([2026](https://arxiv.org/html/2605.13255#bib.bib10)\)that hard suppression of high\-entropy teacher tokens degrades reasoning\.

#### Geometric interpretation of the linear gate\.

The formωi,t=1−γH^i,t\\omega\_\{i,t\}=1\-\\gamma\\,\\widehat\{H\}\_\{i,t\}is the endpoint chord \(the secant atH^∈\{0,1\}\\widehat\{H\}\\in\\\{0,1\\\}\) of the worst\-case shrinkage boundωa0⋆\(H^\)=1/\(1\+a0H^\)\\omega^\{\\star\}\_\{a\_\{0\}\}\(\\widehat\{H\}\)=1/\(1\+a\_\{0\}\\widehat\{H\}\)derived from the linear noise\-to\-signal proxyσ2/μ2≤a0H^\\sigma^\{2\}/\\mu^\{2\}\\leq a\_\{0\}\\widehat\{H\}\. Thereforeγ\\gammaadmits a geometric reading as the noise\-to\-signal ratio at maximum teacher entropy,γ=NSRmax/\(1\+NSRmax\)\\gamma=\\mathrm\{NSR\}\_\{\\max\}/\(1\+\\mathrm\{NSR\}\_\{\\max\}\)\(Appendix[C\.1](https://arxiv.org/html/2605.13255#A3.SS1)\)\.

EGRSD combines direction, magnitude, and confidence:

A^i,t=Ai⋅wi,t⋅ωi,t,ℒEGRSD=−1∑i,tmi,t∑i,tmi,tA^i,tlog⁡pθ\(yi,t∣x,yi,<t\)\.\\widehat\{A\}\_\{i,t\}=A\_\{i\}\\cdot w\_\{i,t\}\\cdot\\omega\_\{i,t\},\\qquad\\mathcal\{L\}\_\{\\mathrm\{EGRSD\}\}=\-\\frac\{1\}\{\\sum\_\{i,t\}m\_\{i,t\}\}\\sum\_\{i,t\}m\_\{i,t\}\\,\\widehat\{A\}\_\{i,t\}\\,\\log p\_\{\\theta\}\(y\_\{i,t\}\\mid x,y\_\{i,<t\}\)\.\(7\)A^i,t\\widehat\{A\}\_\{i,t\}is a stop\-gradient constant\. The gradient flows only throughlog⁡pθ\\log p\_\{\\theta\}\.

### 4\.3Causal\-lookahead EGRSD

Instantaneous entropy can conflate sustained high\-entropy spans with transient transition points\. A token may have high entropy because the sequence distribution branches locally, while the subsequent continuation rapidly becomes low entropy\. CL\-EGRSD therefore replacesHi,tH\_\{i,t\}with a causal future\-window minimum before applying the same batch\-global normalization:

H^i,tCL=minj∈\[t,min⁡\(t\+W,Ti\)\]⁡Hi,jmax\(j,k\)∈𝒞batch⁡Hj,k,ωi,tCL=clip\(1−γH^i,tCL,0\.1,1\)\.\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}=\\frac\{\\displaystyle\\min\_\{j\\in\[t,\\,\\min\(t\+W,T\_\{i\}\)\]\}H\_\{i,j\}\}\{\\displaystyle\\max\_\{\(j,k\)\\in\\mathcal\{C\}\_\{\\mathrm\{batch\}\}\}H\_\{j,k\}\},\\qquad\\omega\_\{i,t\}^\{\\mathrm\{CL\}\}=\\mathrm\{clip\}\\\!\\left\(1\-\\gamma\\,\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\},\\,0\.1,\\,1\\right\)\.\(8\)The lookahead window is truncated at the sequence end viamin⁡\(t\+W,Ti\)\\min\(t\+W,T\_\{i\}\)\. We take the minimum of the raw entropies within the window and reuse the same batch\-global denominator\. This preserves transient high\-entropy positions that resolve withinWWtokens and continues to attenuate sustained high\-entropy spans\.

#### Why minimum?

Among causal smoothing filtersϕ\(Hi,t,…,Hi,t\+W\)\\phi\(H\_\{i,t\},\\ldots,H\_\{i,t\+W\}\)that are per\-argument monotone, conservative \(ϕ≤Hi,t\\phi\\leq H\_\{i,t\}\), idempotent \(ϕ\(c,…,c\)=c\\phi\(c,\\ldots,c\)=c\), and causal, any member satisfiesϕ≥minj∈\[t,t\+W\]⁡Hi,j\\phi\\geq\\min\_\{j\\in\[t,t\+W\]\}H\_\{i,j\}, so the minimum is the extremal choice that maximizes weight recovery at pivot positions \(those with high current entropy but low lookahead entropy\), while leaving sustained high\-entropy \(fork\) positions nearly untouched\. The formal statement and proof are in Appendix[C\.2](https://arxiv.org/html/2605.13255#A3.SS2)\.

## 5Experiments

### 5\.1Setup

#### Models and data\.

We evaluate Qwen3\-4B and Qwen3\-8B with thinking mode enabled\(Qwen Team,[2025](https://arxiv.org/html/2605.13255#bib.bib18)\)\. Teacher and student share the backbone weights: the teacher’s weights are not updated through LoRA, and the teacher is conditioned on the privileged reference solutions⋆s^\{\\star\}, while the student uses the LoRA\-adapted weights withouts⋆s^\{\\star\}\. Our training data configuration matches that of OPSD\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\)\. The teacher receives\(x,s⋆\)\(x,s^\{\\star\}\)and the student receives\(x\)\(x\)\.

#### Training\.

We train all methods with AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.13255#bib.bib15)\)at a learning rate of5×10−65\\times 10^\{\-6\}in BF16 mixed precision, under an identical update budget\. LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2605.13255#bib.bib8)\)\(r=64r\{=\}64,α=128\\alpha\{=\}128, no dropout\) is applied to each Transformer layer’s attention projections \(q\_proj,k\_proj,v\_proj,o\_proj\) and MLP projections \(gate\_proj,up\_proj,down\_proj\)\. On\-policy rollouts are generated with vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2605.13255#bib.bib11)\)in colocate mode at temperature1\.11\.1, top\-p=0\.95p\{=\}0\.95, top\-k=20k\{=\}20, and a per\-step completion cap of1,0241\{,\}024tokens\. The teacher is kept frozen throughout training\. Section[5\.4](https://arxiv.org/html/2605.13255#S5.SS4)ablates EMA and hard\-copy update schedules and shows both underperform the frozen choice\. Training dynamics \(pre\-clip gradient norm\) are reported in Appendix[F](https://arxiv.org/html/2605.13255#A6)\. Additional training details are deferred to Appendix[D](https://arxiv.org/html/2605.13255#A4)\.

#### Evaluation\.

We evaluate on AIME 2024, AIME 2025, HMMT 2025, MATH\-500, Minerva\-Math, and GSM8K\. For inference, we employ vLLM with the official Qwen3 chat template andenable\_thinking=True\. We generateK=4K=4samples per prompt using a temperature of1\.01\.0, top\-p=0\.95p=0\.95, and a maximum of32,76832\{,\}768new tokens\. To quantify AIME24 variance, results on that benchmark are averaged over five independent runs\. The remaining benchmarks use a single run\. We report avg@KKfor each benchmark and define Avg\. as the macro\-average of avg@KKacross the six benchmarks\. Generation length is averaged across benchmarks and samples\.

#### Compared methods\.

We compare against the no\-train baseline, SFT, GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib20)\), OPSD\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\), and CRISP\(Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\)\.

### 5\.2Main comparison

Tables[2](https://arxiv.org/html/2605.13255#S5.T2)and[3](https://arxiv.org/html/2605.13255#S5.T3)show the main comparison, with all trainable methods sharing the same 100\-step training budget\. Additional variants and sweeps appear below and in the appendix\. EGRSD and CL\-EGRSD outperform all trainable baselines \(SFT, GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib20)\), OPSD\(Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\), and CRISP\(Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\)\) on both Qwen3\-4B and Qwen3\-8B\. On 4B, EGRSD Pareto\-dominates OPSD in both accuracy and length \(Table[5](https://arxiv.org/html/2605.13255#S5.T5)\)\. On 8B, CL\-EGRSD gives the highest observed Avg\. in this comparison, with the largest per\-benchmark gain on HMMT25 \(\+7\.40\+7\.40over the no\-train baseline\), at a generation length \(12,23212\{,\}232tokens\) comparable to OPSD’s \(12,09712\{,\}097\)\. All trainable baselines are Pareto\-dominated in the accuracy–length plane, as visualized in Figure[1](https://arxiv.org/html/2605.13255#S1.F1)\. The primary takeaway is that confidence\-aware weighting advances the accuracy–length frontier\.

Table 2:Results onQwen3\-4B\. We express all evaluative metrics as percentages\. The best and runner\-up results are highlighted withboldandunderline, respectively\.Table 3:Results onQwen3\-8B\. We express all evaluative metrics as percentages\. The best and runner\-up results are highlighted withboldandunderline, respectively\.
### 5\.3Ablations on entropy strength and lookahead

Table[5](https://arxiv.org/html/2605.13255#S5.T5)isolates the entropy coefficient across four training snapshots on Qwen3\-8B\. Moderate attenuation is the most stable regime:γ=0\.1\\gamma\{=\}0\.1attains the highest single\-snapshot Avg\. of68\.3268\.32andγ=0\.3\\gamma\{=\}0\.3peaks at68\.2368\.23with a smoother trajectory, while very weak \(γ=0\.0\\gamma\{=\}0\.0\) and strong \(γ≥0\.7\\gamma\{\\geq\}0\.7\) settings trail at most snapshots\. Moderate attenuation balances peak accuracy against snapshot\-to\-snapshot variance, consistent with retaining a nonzero lower bound for high\-entropy positions\.

Token efficiencyEff=Acc\(%\)/\(AvgLen/1000\)\\mathrm\{Eff\}=\\mathrm\{Acc\}\(\\%\)/\(\\mathrm\{AvgLen\}/1000\)on Qwen3\-4B \(Table[5](https://arxiv.org/html/2605.13255#S5.T5)\) gives a complementary picture\. EGRSD \(Eff=6\.08\\mathrm\{Eff\}\{=\}6\.08\) and CL\-EGRSD \(Eff=6\.06\\mathrm\{Eff\}\{=\}6\.06\) are the only trainable methods that improve over SFT \(Eff=6\.05\\mathrm\{Eff\}\{=\}6\.05\)\. GRPO, OPSD, and CRISP all produce longer generations without commensurate accuracy gains and fall toEff≤5\.86\\mathrm\{Eff\}\{\\leq\}5\.86\. SFT’s nominal6\.056\.05comes from a∼\\sim1\.9K\-token compression paired with a ten\-point accuracy drop rather than a true efficiency improvement\.

Table 4:EGRSDγ\\gammasweep on Qwen3\-8B\. Columns S1–S4 are four training snapshots, andγ=0\\gamma=0removes entropy weighting\.
Table 5:Token efficiency on Qwen3\-4B, defined asEff=Acc/\(AvgLen/1000\)\\mathrm\{Eff\}=\\mathrm\{Acc\}/\(\\mathrm\{AvgLen\}/1000\)with accuracy in percent and length in thousands of tokens\. EGRSD and CL\-EGRSD are the only trainable methods that improve over SFT\.

Table 6:Lookahead ablation on Qwen3\-8B\. We express all evaluative metrics as percentages\. The best and runner\-up results are highlighted withboldandunderline, respectively\.Table[6](https://arxiv.org/html/2605.13255#S5.T6)reports the lookahead sweep on Qwen3\-8B\.W=5W=5gives the strongest Avg\., outperforming both shorter and longer windows\. A focused sweep under stronger suppression \(Appendix[G](https://arxiv.org/html/2605.13255#A7)\) shows larger lookahead gains, motivating joint tuning ofγ\\gammaandWW\.

### 5\.4Teacher update schedule ablation

Our method uses a*frozen*teacher throughout training:pTp\_\{T\}is the base model with LoRA adapter disabled \(Appendix[B](https://arxiv.org/html/2605.13255#A2)\) and is never updated\. A natural question is whether allowing the teacher to track the student could help\. We therefore ablate two families of teacher\-update rules on Qwen3\-8B with EGRSD atγ=0\.3\\gamma=0\.3, keeping the rest of the training protocol identical to the main result\. \(i\)Exponential moving average \(EMA\): teacher weights are an EMA of the student,pT←αpT\+\(1−α\)pSp\_\{T\}\\leftarrow\\alpha\\,p\_\{T\}\+\(1\-\\alpha\)\\,p\_\{S\}, withα=0\.99\\alpha=0\.99\(fast tracking,∼\\sim100\-step lag\)\. \(ii\)Hard copy: everyKKsteps the teacher is replaced by a hard copy of the current student\. We testK∈\{20,50\}K\\in\\\{20,50\\\}\.

As shown in Table[7](https://arxiv.org/html/2605.13255#S5.T7), all three online\-update schedules underperform the frozen teacher\. Mechanistically, when the teacher tracks the student, the log\-ratioδi,t=log⁡pT−log⁡pS\\delta\_\{i,t\}=\\log p\_\{T\}\-\\log p\_\{S\}collapses towards zero, the magnitudewi,t=clip\(exp⁡\(Diδi,t\),1−ε,1\+ε\)w\_\{i,t\}=\\mathrm\{clip\}\(\\exp\(D\_\{i\}\\delta\_\{i,t\}\),1\-\\varepsilon,1\+\\varepsilon\)collapses towards11, and EGRSD degrades towards a plain outcome\-reward objective with an entropy gate\. The benefit of conditioning the teacher on privileged context\(x,s⋆\)\(x,s^\{\\star\}\)is eroded\. This connects to the epistemic\-verbalization finding ofKim et al\. \([2026](https://arxiv.org/html/2605.13255#bib.bib10)\), who show that a teacher whose predictive distribution has been compressed \(in their case by folding the privileged answer in too aggressively\) produces worse supervision because epistemic uncertainty markers are suppressed\. Updating the teacher towards the current student is a different but related pathway to the same failure mode: the teacher drifts away from its original calibrated distribution and loses the outside\-view reference that makes this three\-signal decomposition informative\. A frozenpTp\_\{T\}avoids both over\-confidence and this drift\-based calibration loss\. We therefore adopt it as the default\.

### 5\.5Mechanism analysis

#### Entropy diagnoses unreliable evidence\.

Table 7:Teacher\-update ablation on Qwen3\-8B with EGRSD \(γ=0\.3\\gamma=0\.3\)\.![Refer to caption](https://arxiv.org/html/2605.13255v1/x4.png)\(a\)Teacher–student evidence gap vs\. EGRSD update weight per entropy decile\.
![Refer to caption](https://arxiv.org/html/2605.13255v1/x5.png)\(b\)Mean lookahead weight incrementωi,tCL−ωi,t\\omega\_\{i,t\}^\{\\mathrm\{CL\}\}\-\\omega\_\{i,t\}by regime\.

Figure 4:Mechanism diagnostics on 5\.5M held\-out tokens \(1,688 completions from AIME24/25, HMMT25, Minerva\)\.Left:the mean teacher–student log\-prob gap grows from0\.002370\.00237nats \(decile 1\) to0\.3290\.329nats \(decile 10\), a×139\\times 139spread, while EGRSD attenuates the mean update weight in high\-entropy deciles and stays above the1−γ=0\.71\-\\gamma=0\.7gate floor\.Right:CL\-EGRSD restores pivot weight by\+0\.111\+0\.111on average versus\+0\.053\+0\.053for sustained forks \(a2\.1×2\.1\\timesselectivity\), while leaving lock tokens near the weight ceiling \(γ=0\.3,W=5\\gamma=0\.3,W=5\)\.Figure[4\(a\)](https://arxiv.org/html/2605.13255#S5.F4.sf1)tests the central mechanism directly\. High\-entropy positions coincide with substantially less concentrated teacher–student evidence: the top entropy decile has roughly 140×\\timesthe log\-prob gap of the bottom decile\. EGRSD therefore does not treat entropy as a length penalty\. It uses entropy as a confidence signal to calibrate which token\-level evidence should receive a strong learning signal\. This empirical pattern matches the finding ofKim et al\. \([2026](https://arxiv.org/html/2605.13255#bib.bib10)\)that high\-entropy teacher tokens carry uncertainty that should be preserved rather than suppressed: the high\-entropy decile is precisely where teacher evidence becomes least reliable, which is why the confidence gate retains a nonzero floor there rather than zeroing out the signal\.

#### Lookahead targets transient pivots\.

To quantify how CL\-EGRSD reshapes token weight across the three regimes introduced in §[4\.1](https://arxiv.org/html/2605.13255#S4.SS1), we use operational thresholds on normalized teacher entropy: a position\(i,t\)\(i,t\)is classified as*lock*ifH^i,t≤τlow\\widehat\{H\}\_\{i,t\}\\leq\\tau\_\{\\mathrm\{low\}\}, as*fork*ifH^i,t≥τhigh\\widehat\{H\}\_\{i,t\}\\geq\\tau\_\{\\mathrm\{high\}\}*and*H^i,tCL≥τhigh\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}\\geq\\tau\_\{\\mathrm\{high\}\}, and as*pivot*ifH^i,t≥τhigh\\widehat\{H\}\_\{i,t\}\\geq\\tau\_\{\\mathrm\{high\}\}butH^i,tCL≤τlow\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}\\leq\\tau\_\{\\mathrm\{low\}\}\. All other positions are labeled*mid*\. EGRSD attenuates weight uniformly on all high\-entropy positions throughωi,t\\omega\_\{i,t\}\. CL\-EGRSD’s causal\-lookahead replacement instead selectively restores weight on pivots by swapping in a low\-entropy future value\.

Figure[4\(b\)](https://arxiv.org/html/2605.13255#S5.F4.sf2)explains why lookahead is useful but conditional\. We categorize tokens by current entropyH^i,t\\widehat\{H\}\_\{i,t\}and five\-token lookahead entropyH^i,tCL\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}:*lock*\(low, low\),*fork*\(high, still high\),*pivot*\(high, resolves to low\), and*mid*\(remainder\)\. Pivots constitute20\.9%20\.9\\%of the analyzed tokens, and atγ=0\.3,W=5\\gamma=0\.3,W=5CL\-EGRSD restores their weight by\+0\.111\+0\.111on average versus\+0\.053\+0\.053for sustained forks \(a2\.1×2\.1\\timesselectivity ratio\)\. Lock restore is below0\.0010\.001because low\-entropy positions are already near the weight ceiling\. This matches the intended behavior: preserve transient transition points without undoing entropy suppression everywhere\. On 4B, lookahead is less consistent, making EGRSD the cleaner default\. On 8B, the model appears better able to exploit protected transitions\.

## 6Conclusion

Across Qwen3\-4B and Qwen3\-8B mathematical reasoning benchmarks with thinking mode enabled, an entropy\-gated token\-level update is the only intervention in our comparison that improves over the no\-train baseline on the accuracy–length frontier\. The gate has two forms: EGRSD attenuates high\-entropy positions through the instantaneous teacher entropy, and CL\-EGRSD swaps in the minimum entropy over a short causal future window to preserve transient pivot tokens\. EGRSD Pareto\-dominates OPSD on Qwen3\-4B, and CL\-EGRSD attains the highest Avg\. on Qwen3\-8B at a length comparable to OPSD\.

## References

- Chen et al\. \[2025\]Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping\.AceReason\-Nemotron: Advancing math and code reasoning through reinforcement learning\.*arXiv preprint arXiv:2505\.16400*, 2025\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Dao \[2024\]Tri Dao\.FlashAttention\-2: Faster attention with better parallelism and work partitioning\.In*The Twelfth International Conference on Learning Representations \(ICLR\)*, 2024\.
- DeepSeek\-AI \[2025\]DeepSeek\-AI\.DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.*Nature*, 645:633–638, 2025\.
- Groeneveld et al\. \[2024\]Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al\.Olmo: Accelerating the science of language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 15789–15809, 2024\.
- Guha et al\. \[2025\]Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, et al\.OpenThoughts: Data recipes for reasoning models\.arXiv preprint arXiv:2506\.04178, 2025\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.In*Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks*, 2021\.
- Hu et al\. \[2022\]Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-rank adaptation of large language models\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- Jin et al\. \[2026\]Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee\.Entropy\-aware on\-policy distillation of language models\.*arXiv preprint arXiv:2603\.07079*, 2026\.
- Kim et al\. \[2026\]Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang\.Why does self\-distillation \(sometimes\) degrade the reasoning capability of llms?*arXiv preprint arXiv:2603\.24472*, 2026\.
- Kwon et al\. \[2023\]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with PagedAttention\.In*ACM SIGOPS Symposium on Operating Systems Principles \(SOSP\)*, 2023\.
- Lewkowycz et al\. \[2022\]Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman\-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur\-Ari, and Vedant Misra\.Solving quantitative reasoning problems with language models\.In*Advances in Neural Information Processing Systems*, 2022\.
- Li et al\. \[2024\]Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxiang Gu, and Tianyi Zhou\.Selective reflection\-tuning: Student\-selected data recycling for LLM instruction\-tuning\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 16189–16211, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.findings\-acl\.958\.URL[https://aclanthology\.org/2024\.findings\-acl\.958/](https://aclanthology.org/2024.findings-acl.958/)\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*The Twelfth International Conference on Learning Representations*, 2024\.
- Loshchilov and Hutter \[2019\]Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.In*International Conference on Learning Representations \(ICLR\)*, 2019\.URL[https://openreview\.net/forum?id=Bkg6RiCqY7](https://openreview.net/forum?id=Bkg6RiCqY7)\.
- Mangrulkar et al\. \[2022\]Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan\.PEFT: State\-of\-the\-art parameter\-efficient fine\-tuning methods\.[https://github\.com/huggingface/peft](https://github.com/huggingface/peft), 2022\.
- OpenAI \[2024\]OpenAI\.Learning to reason with LLMs\.[https://openai\.com/index/learning\-to\-reason\-with\-llms/](https://openai.com/index/learning-to-reason-with-llms/), 2024\.OpenAI blog, September 2024\.
- Qwen Team \[2025\]Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Sang et al\. \[2026\]Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun\.CRISP: Compressed reasoning via iterative self\-policy distillation\.*arXiv preprint arXiv:2603\.05433*, 2026\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Yu Wu, and Daya Guo\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Song and Zheng \[2026\]Mingyang Song and Mao Zheng\.A survey of on\-policy distillation for large language models\.*arXiv preprint arXiv:2604\.00626*, 2026\.
- Stein et al\. \[2026\]Alex Stein, Furong Huang, and Tom Goldstein\.GATES: Self\-distillation under privileged context with consensus gating\.*arXiv preprint arXiv:2602\.20574*, 2026\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H\. Chi, Quoc Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems*, 2022\.
- Xu et al\. \[2025\]Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He\.Chain of draft: Thinking faster by writing less\.*arXiv preprint arXiv:2502\.18600*, 2025\.
- Xu et al\. \[2026\]Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang\.PACED: Distillation and on\-policy self\-distillation at the frontier of student competence\.*arXiv preprint arXiv:2603\.11178*, 2026\.
- Yang et al\. \[2026\]Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan\.Self\-distilled RLVR\.*arXiv preprint arXiv:2604\.03128*, 2026\.
- Zhang et al\. \[2026a\]Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang\.Embarrassingly simple self\-distillation improves code generation\.*arXiv preprint arXiv:2604\.01193*, 2026a\.
- Zhang et al\. \[2026b\]Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, and Stefano Soatto\.Reinforcement\-aware knowledge distillation for LLM reasoning\.*arXiv preprint arXiv:2602\.22495*, 2026b\.
- Zhao \[2026\]Siyan Zhao\.Openthoughts\_math\_30k\_opsd: Math subset of OpenThoughts\-114k used by OPSD\.[https://huggingface\.co/datasets/siyanzhao/Openthoughts\_math\_30k\_opsd](https://huggingface.co/datasets/siyanzhao/Openthoughts_math_30k_opsd), 2026\.
- Zhao et al\. \[2026\]Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover\.Self\-distilled reasoner: On\-policy self\-distillation for large language models\.*arXiv preprint arXiv:2601\.18734*, 2026\.

## Appendix

appendix\.Asection\*\.26section\*\.27section\*\.28appendix\.Bsection\*\.29section\*\.30section\*\.31section\*\.32appendix\.Csubsection\.C\.1section\*\.33subsection\.C\.2appendix\.Dsection\*\.34section\*\.35section\*\.36section\*\.37section\*\.39section\*\.40appendix\.Eappendix\.Fappendix\.Gappendix\.Happendix\.Iappendix\.J

## Appendix AExtended related work

This appendix complements the two paragraphs retained in the main Related Work section with additional context on \(i\) long\-form reasoning efficiency, \(ii\) RLVR\-style token\-level credit assignment, and \(iii\) a method\-by\-method positioning summary\.

#### Long\-form reasoning and reasoning efficiency\.

Chain\-of\-thought prompting and recent reasoning\-oriented models have shown that allocating more inference\-time computation can substantially improve multi\-step mathematical problem solving\[Wei et al\.,[2022](https://arxiv.org/html/2605.13255#bib.bib23), OpenAI,[2024](https://arxiv.org/html/2605.13255#bib.bib17), DeepSeek\-AI,[2025](https://arxiv.org/html/2605.13255#bib.bib4), Qwen Team,[2025](https://arxiv.org/html/2605.13255#bib.bib18)\]\. This gain comes with a practical cost: reasoning models often emit long deliberations containing redundant checks, false starts, and repeated verification\. A growing line of work therefore studies reasoning compression and efficient post\-training, including reward\-based length control, supervised fine\-tuning on concise completions, training\-free prompting, pruning, and latent or self\-distillation\-based compression\. Chain of Draft reduces reasoning verbosity at inference time by prompting models to write concise intermediate drafts\[Xu et al\.,[2025](https://arxiv.org/html/2605.13255#bib.bib24)\]\. CRISP\[Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\]is the closest compression\-oriented baseline to our setting and uses iterative self\-policy distillation to encourage concise reasoning\. Our method is complementary in goal but different in mechanism: rather than imposing a prompt\-based conciseness instruction or treating all teacher positions uniformly, EGRSD changes the token\-level confidence weighting of the distillation signal by down\-weighting high\-entropy teacher positions\.

#### RLVR and token\-level credit assignment\.

Reinforcement learning with verifiable rewards has become a standard post\-training approach for reasoning models because correctness can often be checked automatically\[DeepSeek\-AI,[2025](https://arxiv.org/html/2605.13255#bib.bib4)\]\. A widely used instantiation is Group Relative Policy Optimization \(GRPO\)\[Shao et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib20)\], which estimates token\-level advantages from sequence\-level outcome rewards across a rollout group\. Its main limitation is sparse credit assignment: a sequence\-level reward or advantage is broadcast across all tokens in a long rollout, even though only a small subset of tokens may determine the outcome\. Process\-level supervision and process reward models address this by scoring intermediate reasoning steps, but they require additional annotation, modeling, or inference machinery\[Lightman et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib14)\]\. Concurrent work on direction\-aware self\-distillation \(RLSD\[Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\]\) casts self\-distillation as a token\-level credit\-assignment tool: outcome rewards determine update direction, while the privileged teacher–student likelihood ratio modulates update magnitude\. Related distillation\-RL hybrids such as reinforcement\-aware knowledge distillation also modify how teacher probabilities interact with policy optimization\[Zhang et al\.,[2026b](https://arxiv.org/html/2605.13255#bib.bib28)\]\. EGRSD also adopts this direction–magnitude pair and adds a missing confidence signal: when the privileged teacher distribution is diffuse, the teacher–student ratio alone can overstate the confidence of the token\-level evidence\. Entropy gating reduces this effect without changing the reward\-grounded update direction\.

#### Positioning\.

Overall, EGRSD sits at the intersection of reasoning compression, on\-policy self\-distillation, and token\-level credit assignment\. Compared with OPSD\[Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\], it weights privileged teacher positions according to teacher confidence rather than uniformly\. Against CRISP\[Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\], the change is in token\-level credit allocation instead of a conciseness instruction\. Relative to RLSD\[Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\], EGRSD shares the direction–magnitude decomposition and adds a teacher\-entropy confidence signal\. The closest contemporary point of contact is SSD\[Zhang et al\.,[2026a](https://arxiv.org/html/2605.13255#bib.bib27)\], which implicitly reshapes token distributions at lock/fork positions\. Our gate is an explicit multiplicative factor on the token\-level RLSD update, and CL\-EGRSD further introduces a pivot regime not covered by SSD’s two\-category view\. Unlike process\-supervision approaches\[Lightman et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib14)\], EGRSD obtains fine\-grained confidence information without extra annotations\.

## Appendix BBackground details

This section elaborates the notation, stop\-gradient rationale, and advantage\-whitening procedure that the compressed main\-text Background \(§[3](https://arxiv.org/html/2605.13255#S3)\) omits for brevity\.

#### Teacher and student as a shared PEFT instance\.

Teacher and student share the same backbone\. The teacher’s weights are not updated through LoRA and are conditioned on the privileged context\(x,s⋆\)\(x,s^\{\\star\}\), while the student’s LoRA adapter is active and sees only\(x\)\(x\)\. The privileged contexts⋆s^\{\\star\}is inserted via a two\-turn chat template consisting of a reference reasoning segment, a transition prompt, and the student’s on\-policy rollout\. Further template details are deferred to Appendix[D](https://arxiv.org/html/2605.13255#A4)\. BecausepTp\_\{T\}andpSp\_\{S\}differ only in LoRA state and conditioning, the teacher–student log\-ratioδi,t\\delta\_\{i,t\}captures exactly the effect of the privileged context plus LoRA adaptation\.

#### Stop\-gradient rationale\.

When computingδi,t\\delta\_\{i,t\}and the magnitudewi,tw\_\{i,t\}in Eq\.[2](https://arxiv.org/html/2605.13255#S3.E2), we applystop\-gradienttopSp\_\{S\}\. Without this, gradients would flow throughlog⁡pS\\log p\_\{S\}insidewi,tw\_\{i,t\}and bypass the outcome reward: the network could reduce the loss by shrinking its ownlog⁡pS\\log p\_\{S\}, which is unrelated to whether the sampled rollout was correct\. Gradients therefore flow only through the outerlog⁡pθ\\log p\_\{\\theta\}factor in Eq\.[3](https://arxiv.org/html/2605.13255#S3.E3)\. The whole token\-level advantageAiwi,tA\_\{i\}\\,w\_\{i,t\}\(and laterAiwi,tωi,tA\_\{i\}\\,w\_\{i,t\}\\,\\omega\_\{i,t\}for EGRSD\) is treated as a constant multiplier during backpropagation\.

#### Advantage whitening\.

Rewardsrir\_\{i\}are whitened using a running reward meanr¯<s\\bar\{r\}\_\{<s\}and standard deviationσr,<s\\sigma\_\{r,<s\}maintained across training steps via Welford’s algorithm:

Ai=\(ri−r¯<s\)/σr,<s\.A\_\{i\}=\(r\_\{i\}\-\\bar\{r\}\_\{<s\}\)/\\sigma\_\{r,<s\}\.\(9\)During the first ten steps, before the running statistics become reliable, we use a constant warm\-up baseline of0\.50\.5, i\.e\.Ai=ri−0\.5A\_\{i\}=r\_\{i\}\-0\.5\. This matches the implementation used by all compared methods under our shared trainer and isolates method\-level differences from advantage\-estimation variance\.

#### Relation to GRPO\.

GRPO\[Shao et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib20)\]also broadcasts a sequence\-level advantage to every token, but estimates the advantage from group\-relative outcomes within each mini\-batch and applies it with a uniform per\-token weight\. RLSD can be viewed as GRPO with the uniform weight replaced by the teacher–student likelihood ratiowi,tw\_\{i,t\}, and EGRSD as RLSD with an additional teacher\-confidence factorωi,t\\omega\_\{i,t\}\.

## Appendix CDerivations for the main\-text remarks

This appendix supplies the short derivations underlying the two methodological remarks in §[4\.2](https://arxiv.org/html/2605.13255#S4.SS2)and §[4\.3](https://arxiv.org/html/2605.13255#S4.SS3)\.

### C\.1Geometric interpretation of the linear gate

We view the token\-level update as a one\-dimensional shrinkage problem on a signal\-plus\-noise reference model\. The model is interpretive and is used only to motivate the form of the gate, not to establish a tight bound on the true loss surface\. Fix a token position\(i,t\)\(i,t\)and write the raw update magnitude asy=μ\+ϵy=\\mu\+\\epsilon, whereμ\\muis a latent ”useful” component andϵ\\epsilonis a zero\-mean noise term with varianceσ2\\sigma^\{2\}\. A multiplicative gateμ^=ωy\\hat\{\\mu\}=\\omega yminimizes the mean\-squared error𝔼\[\(μ^−μ\)2\]\\mathbb\{E\}\[\(\\hat\{\\mu\}\-\\mu\)^\{2\}\]at

ω⋆=μ2μ2\+σ2=11\+σ2/μ2\.\\omega^\{\\star\}\\;=\\;\\frac\{\\mu^\{2\}\}\{\\mu^\{2\}\+\\sigma^\{2\}\}\\;=\\;\\frac\{1\}\{1\+\\sigma^\{2\}/\\mu^\{2\}\}\.\(10\)To link teacher entropy to the noise\-to\-signal ratio, we adopt the simplest monotone proxyσ2/μ2≤a0H^i,t\\sigma^\{2\}/\\mu^\{2\}\\leq a\_\{0\}\\,\\widehat\{H\}\_\{i,t\}for some constanta0\>0a\_\{0\}\>0\. Sinceω⋆\\omega^\{\\star\}is strictly decreasing inσ2/μ2\\sigma^\{2\}/\\mu^\{2\}, the proxy’s saturating caseσ2/μ2=a0H^\\sigma^\{2\}/\\mu^\{2\}=a\_\{0\}\\widehat\{H\}yields the*worst\-case shrinkage bound*

ωa0⋆\(H^\)=11\+a0H^,H^∈\[0,1\],\\omega^\{\\star\}\_\{a\_\{0\}\}\(\\widehat\{H\}\)\\;=\\;\\frac\{1\}\{1\+a\_\{0\}\\widehat\{H\}\},\\qquad\\widehat\{H\}\\in\[0,1\],\(11\)which lower\-bounds the true MSE\-optimal shrinkage:ω⋆\(H^\)≥ωa0⋆\(H^\)\\omega^\{\\star\}\(\\widehat\{H\}\)\\geq\\omega^\{\\star\}\_\{a\_\{0\}\}\(\\widehat\{H\}\)\. We useωa0⋆\\omega^\{\\star\}\_\{a\_\{0\}\}as the reference curve because it is the most aggressive shrinkage compatible with the proxy\. Matching it at the endpoints is therefore a conservative design\.

#### Endpoint chord\.

The linear gateω\(H^\)=1−γH^\\omega\(\\widehat\{H\}\)=1\-\\gamma\\widehat\{H\}is the secant ofωa0⋆\\omega^\{\\star\}\_\{a\_\{0\}\}at the two boundary pointsH^∈\{0,1\}\\widehat\{H\}\\in\\\{0,1\\\}\. Matching endpoints,

ω\(0\)=ωa0⋆\(0\)=1holds automatically,ω\(1\)=ωa0⋆\(1\)⟺1−γ=11\+a0,\\omega\(0\)=\\omega^\{\\star\}\_\{a\_\{0\}\}\(0\)=1\\quad\\text\{holds automatically,\}\\qquad\\omega\(1\)=\\omega^\{\\star\}\_\{a\_\{0\}\}\(1\)\\;\\;\\Longleftrightarrow\\;\\;1\-\\gamma=\\frac\{1\}\{1\+a\_\{0\}\},\(12\)which solves to

γ=a01\+a0=NSRmax1\+NSRmax,\\gamma\\;=\\;\\frac\{a\_\{0\}\}\{1\+a\_\{0\}\}\\;=\\;\\frac\{\\mathrm\{NSR\}\_\{\\max\}\}\{1\+\\mathrm\{NSR\}\_\{\\max\}\},\(13\)whereNSRmax:=a0\\mathrm\{NSR\}\_\{\\max\}:=a\_\{0\}denotes the worst\-case noise\-to\-signal ratio atH^=1\\widehat\{H\}=1under the proxy\. Becauseωa0⋆\\omega^\{\\star\}\_\{a\_\{0\}\}is strictly convex fora0\>0a\_\{0\}\>0, the linear gate lies on or above the reference curve throughout\[0,1\]\[0,1\], with equality only at the two endpoints\. We do not claim linearity is MSE\-optimal\. We claim it is the simplest affine form that matches the worst\-case shrinkage bound at the extreme points of the normalized\-entropy range, and that the resultingγ\\gammaadmits the direct noise\-to\-signal reading in Eq\. \([13](https://arxiv.org/html/2605.13255#A3.E13)\)\. The linear NSR proxyσ2/μ2≤a0H^\\sigma^\{2\}/\\mu^\{2\}\\leq a\_\{0\}\\widehat\{H\}is a heuristic: we make no quantitative commitment to the numerical value ofa0a\_\{0\}\.

### C\.2Minimum as the extremal causal smoothing filter

###### Definition 1\(Causal smoothing filter family\)\.

For a windowW≥1W\\geq 1, letℱW\\mathcal\{F\}\_\{W\}denote the class of functionsϕ:ℝ≥0W\+1→ℝ≥0\\phi:\\mathbb\{R\}^\{W\+1\}\_\{\\geq 0\}\\to\\mathbb\{R\}\_\{\\geq 0\}satisfying, for every input\(h0,…,hW\)\(h\_\{0\},\\ldots,h\_\{W\}\)and everyc≥0c\\geq 0:\(a\)*per\-argument monotonicity*, withϕ\\phinon\-decreasing in each coordinate separately;\(b\)*conservativity*, withϕ\(h0,…,hW\)≤h0\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\\leq h\_\{0\};\(c\)*idempotency*, withϕ\(c,…,c\)=c\\phi\(c,\\ldots,c\)=c; and\(d\)*causality*, whereϕ\\phidepends only on the current entryh0h\_\{0\}and theWWfuture entriesh1,…,hWh\_\{1\},\\ldots,h\_\{W\}\.

Condition \(b\) is what we require of the family: a lookahead replacement for the gate should never inflate the uncertainty attributed to a currently low\-entropy \(lock\) position, since doing so would attenuate reliable supervision\. Standard windowed averages violate \(b\) whenever a low\-entropy token precedes a high\-entropy span, and are therefore excluded\.

###### Lemma 1\(Pointwise lower bound\)\.

Everyϕ∈ℱW\\phi\\in\\mathcal\{F\}\_\{W\}satisfiesϕ\(h0,…,hW\)≥min0≤j≤W⁡hj\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\\geq\\min\_\{0\\leq j\\leq W\}h\_\{j\}\.

###### Proof\.

Letm:=minj⁡hjm:=\\min\_\{j\}h\_\{j\}\. Sincehj≥mh\_\{j\}\\geq mfor everyjj, per\-argument monotonicity \(a\) givesϕ\(h0,…,hW\)≥ϕ\(m,…,m\)\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\\geq\\phi\(m,\\ldots,m\)\. Idempotency \(c\) givesϕ\(m,…,m\)=m\\phi\(m,\\ldots,m\)=m\. Combining the two yieldsϕ\(h0,…,hW\)≥m\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\\geq m\. ∎

###### Proposition 1\(Extremal weight recovery at pivots\)\.

Instantiate CL\-EGRSD with anyϕ∈ℱW\\phi\\in\\mathcal\{F\}\_\{W\}by replacingH^i,t\\widehat\{H\}\_\{i,t\}in Eq\. \([6](https://arxiv.org/html/2605.13255#S4.E6)\) byϕ\(Hi,t,…,Hi,t\+W\)/Hmaxbatch\\phi\(H\_\{i,t\},\\ldots,H\_\{i,t\+W\}\)/H\_\{\\max\}^\{\\mathrm\{batch\}\}, and let the weight increment under the replacement be

Δi,tϕ:=γ⋅h0−ϕ\(h0,…,hW\)Hmaxbatch,hj:=Hi,t\+j\.\\Delta^\{\\phi\}\_\{i,t\}\\;:=\\;\\gamma\\cdot\\frac\{h\_\{0\}\-\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\}\{H\_\{\\max\}^\{\\mathrm\{batch\}\}\},\\qquad h\_\{j\}:=H\_\{i,t\+j\}\.At any position where the gate is not saturated by the lower clip in Eq\. \([6](https://arxiv.org/html/2605.13255#S4.E6)\),

Δi,tϕ≤Δi,tminfor everyϕ∈ℱW\.\\Delta^\{\\phi\}\_\{i,t\}\\;\\leq\\;\\Delta^\{\\min\}\_\{i,t\}\\qquad\\text\{for every \}\\phi\\in\\mathcal\{F\}\_\{W\}\.

###### Proof\.

By Lemma[1](https://arxiv.org/html/2605.13255#Thmlemma1),ϕ\(h0,…,hW\)≥minj⁡hj\\phi\(h\_\{0\},\\ldots,h\_\{W\}\)\\geq\\min\_\{j\}h\_\{j\}, henceh0−ϕ≤h0−minj⁡hjh\_\{0\}\-\\phi\\leq h\_\{0\}\-\\min\_\{j\}h\_\{j\}\. Multiplying byγ/Hmaxbatch≥0\\gamma/H\_\{\\max\}^\{\\mathrm\{batch\}\}\\geq 0preserves the inequality\. ∎

Pivot positions are by definition those with largeh0−minj⁡hjh\_\{0\}\-\\min\_\{j\}h\_\{j\}\(high current entropy, low lookahead entropy\), so Proposition[1](https://arxiv.org/html/2605.13255#Thmproposition1)states that the minimum is the member ofℱW\\mathcal\{F\}\_\{W\}that maximizes weight recovery there\. Conservativity \(b\) simultaneously guarantees that at sustained high\-entropy \(fork\) positions, whereminj⁡hj≈h0\\min\_\{j\}h\_\{j\}\\approx h\_\{0\}, every member of the family \(including the minimum\) delivers essentially no recovery\. The inequalityΔϕ≤Δmin\\Delta^\{\\phi\}\\leq\\Delta^\{\\min\}is strict whenh0\>minj⁡hjh\_\{0\}\>\\min\_\{j\}h\_\{j\}\(the non\-degenerate pivot case\)\. On constant windowsh0=⋯=hWh\_\{0\}=\\cdots=h\_\{W\}, everyϕ∈ℱW\\phi\\in\\mathcal\{F\}\_\{W\}collapses to the common value by idempotency and the weight recovery is zero across the family\. This is the selectivity property CL\-EGRSD exploits\.

## Appendix DFull experimental details

#### Baselines\.

We select comparison methods that are closely related to our approach and have publicly available training code compatible with our shared protocol: the no\-train base model, supervised fine\-tuning \(SFT\), Group Relative Policy Optimization \(GRPO\)\[Shao et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib20)\], on\-policy self\-distillation \(OPSD\)\[Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\], and CRISP\[Sang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib19)\]\. While Reinforcement Learning with Self\-Distillation \(RLSD\)\[Yang et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib26)\]is conceptually highly relevant, it lacks public training code at the time of writing\. We therefore use our direction\-aware baseline \(γ=0\\gamma\{=\}0\) as its reference point in the ablations\.

#### Training data\.

Our training data configuration matches that of OPSD\[Zhao et al\.,[2026](https://arxiv.org/html/2605.13255#bib.bib30)\]: the subset of OpenThoughts\-114k\[Guha et al\.,[2025](https://arxiv.org/html/2605.13255#bib.bib6)\]distilled from DeepSeek\-R1\[DeepSeek\-AI,[2025](https://arxiv.org/html/2605.13255#bib.bib4)\]reasoning traces and filtered to answer\-verified samples\[Zhao,[2026](https://arxiv.org/html/2605.13255#bib.bib29)\]\. Each sample provides a problemxxand a concise reference solutions⋆s^\{\\star\}\. Our data collator uses only theproblemandsolutioncolumns, so the teacher context is\(x,s⋆,y<t\)\(x,s^\{\\star\},y\_\{<t\}\)and the student context is\(x,y<t\)\(x,y\_\{<t\}\)\. All compared methods share the same training data and preprocessing, so accuracy differences isolate the effect of the loss function\.

#### Evaluation benchmarks\.

AIME 2024/AIME 2025are the3030\-problem annual American Invitational Mathematics Examination contests\.HMMT 2025is the3030\-problem February 2025 Harvard\-MIT Mathematics Tournament\.MATH\-500\[Lightman et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib14)\]is a held\-out500500\-problem subset of the MATH competition dataset\[Hendrycks et al\.,[2021](https://arxiv.org/html/2605.13255#bib.bib7)\]\.Minerva Math\[Lewkowycz et al\.,[2022](https://arxiv.org/html/2605.13255#bib.bib12)\]is a272272\-problem college\-level STEM reasoning benchmark\.GSM8K\[Cobbe et al\.,[2021](https://arxiv.org/html/2605.13255#bib.bib2)\]is a1,3191\{,\}319\-problem grade\-school arithmetic benchmark\. Decoding and aggregation are described in §[5\.1](https://arxiv.org/html/2605.13255#S5.SS1)\.

#### Implementation details\.

All training and on\-policy rollouts are conducted on a single8×8\\timesNVIDIA A100\-SXM4\-80GB server \(640640GiB aggregate HBM, full\-mesh NVLink\)\. Total wall\-clock time across all reported runs is approximately2424hours on this machine\. For the software environment, training and evaluation are implemented in PyTorch2\.102\.10\(CUDA12\.812\.8, NVIDIA driver580580\) using Transformers4\.574\.57and PEFT\[Mangrulkar et al\.,[2022](https://arxiv.org/html/2605.13255#bib.bib16)\]0\.180\.18\. We employ Hugging Face Accelerate1\.121\.12with theMULTI\_GPU\(DDP\) backend\. FlashAttention\-22\[Dao,[2024](https://arxiv.org/html/2605.13255#bib.bib3)\]\(version2\.8\.32\.8\.3\) is utilized to accelerate long\-context forward passes\. On\-policy sampling and evaluation decoding are supported by vLLM0\.180\.18in colocate mode\. The detailed training hyperparameters are summarized in Table[A1](https://arxiv.org/html/2605.13255#A4.T1)\.

Table A1:Training configuration shared by all methods on both model sizes\. Only the loss and the loss\-specific switches \(γ\\gamma,WW, entropy mode\) differ across methods\.
#### Reward details\.

We use a verifier\-based trajectory\-level reward with optional length shaping:

ri=𝟙\[yicorrect\]⋅\(1\+βL⋅\(1−\|yi\|/Lmax\)\),r\_\{i\}=\\mathbb\{1\}\[y\_\{i\}\\,\\text\{correct\}\]\\cdot\\big\(1\+\\beta\_\{L\}\\cdot\(1\-\|y\_\{i\}\|/L\_\{\\mathrm\{max\}\}\)\\big\),whereβL=0\.5\\beta\_\{L\}=0\.5is the length\-shaping coefficient andLmaxL\_\{\\mathrm\{max\}\}is the maximum completion length\. Incorrect and unverifiable completions receiveri=0r\_\{i\}=0, and we do not use negative rewards\. Advantage whitening and the teacher/student implementation are described in Appendix[B](https://arxiv.org/html/2605.13255#A2)\.

#### Choice ofγ\\gammain the main tables\.

We adoptγ=0\.3\\gamma=0\.3for EGRSD at both model sizes and for CL\-EGRSD on Qwen3\-4B, so that the confidence\-gate coefficient is matched between EGRSD and CL\-EGRSD on the smaller model\. On Qwen3\-8B, CL\-EGRSD instead uses\(γ,W\)=\(1\.0,5\)\(\\gamma,W\)=\(1\.0,5\), drawn from the joint sweep in Table[A3](https://arxiv.org/html/2605.13255#A7.T3)and discussed further in Appendix[E](https://arxiv.org/html/2605.13255#A5)\. The full hyperparameter recipe is summarized in Table[A2](https://arxiv.org/html/2605.13255#A5.T2)\.

## Appendix EHyperparameters

Table[A2](https://arxiv.org/html/2605.13255#A5.T2)lists the per\-method hyperparameters used for EGRSD and CL\-EGRSD in the main tables \(Tables[2](https://arxiv.org/html/2605.13255#S5.T2)and[3](https://arxiv.org/html/2605.13255#S5.T3)\)\. On Qwen3\-4B we useγ=0\.3\\gamma=0\.3for both methods andW=3W=3for CL\-EGRSD, drawn from theγ\\gammasweep in Table[5](https://arxiv.org/html/2605.13255#S5.T5)and the matched\-γ\\gammalookahead sweep in Table[6](https://arxiv.org/html/2605.13255#S5.T6)\. On Qwen3\-8B we useγ=0\.3\\gamma=0\.3for EGRSD and\(γ,W\)=\(1\.0,5\)\(\\gamma,W\)=\(1\.0,5\)for CL\-EGRSD\. The latter is drawn from the joint sweep in Table[A3](https://arxiv.org/html/2605.13255#A7.T3), attains the highest Avg\. in our six\-benchmark evaluation, and isolates the contribution of the causal\-lookahead window from entropy\-coefficient strength through comparison with the matchedW=0W\{=\}0reference\. Stronger\-suppression configurations for Qwen3\-4B were not evaluated\. Extending the jointγ\\gamma–WWsweep to 4B is left as future work\.

Table A2:Per\-method hyperparameters for the main\-table EGRSD and CL\-EGRSD results on Qwen3\-4B and Qwen3\-8B\.
## Appendix FTraining dynamics

Figure[A1](https://arxiv.org/html/2605.13255#A6.F1)reports the per\-step pre\-clip gradient norm on Qwen3\-8B for four representative dense\-logging reruns over the100100\-step budget: the baselines OPSD and CRISP, and our EGRSD and CL\-EGRSD\. We omit the corresponding loss curves because the shared trainer logs a signed advantage\-weighted surrogate whose magnitude is not directly comparable across methods, so we do not make cross\-method loss comparisons\.

![Refer to caption](https://arxiv.org/html/2605.13255v1/x6.png)Figure A1:Per\-step pre\-clip gradient norm on Qwen3\-8B\. The red dash\-dot line is the clipping threshold \(0\.10\.1\)\. Clipping is rarely active: only one EGRSD step slightly exceeds the threshold, while the remaining runs stay below it\.Gradient norm\.The dense traces show that all four methods operate in a narrow pre\-clip range for almost the entire run\. OPSD and CRISP stay mostly in the0\.020\.02–0\.080\.08band, while EGRSD and CL\-EGRSD have similar average magnitudes \(mean pre\-clip norm≈0\.035\\approx 0\.035\)\. The only visible clipping event is a single EGRSD spike early in training, which reaches≈0\.11\\approx 0\.11\. CL\-EGRSD remains below the0\.10\.1threshold throughout\. Thus, the entropy\-weighted objectives do not systematically inflate update magnitudes, and gradient clipping acts as an occasional safeguard rather than a persistent constraint\.

## Appendix GExtended CL\-EGRSD ablation

To isolate the interaction between lookahead windowWWand entropy coefficientγ\\gamma, we ran an additional CL\-EGRSD sweep on Qwen3\-8B\. We evaluate selected configurations on AIME24, AIME25, HMMT25, and Minerva withK=4K=4\. MATH500 and GSM8K are excluded because they saturate across the method family\. Numerical results are reported in Table[A3](https://arxiv.org/html/2605.13255#A7.T3), and Figure[A2](https://arxiv.org/html/2605.13255#A7.F2)visualizes the fullγ×W\\gamma\\times Wgrid as per\-benchmark heatmaps\.

Table A3:CL\-EGRSD extended ablation on Qwen3\-8B, suppress mode\.![Refer to caption](https://arxiv.org/html/2605.13255v1/x7.png)Figure A2:Extended CL\-EGRSDγ×W\\gamma\\times Wgrid on Qwen3\-8B, suppress mode\. All numbers are avg@4 \(%\)\. TheW=0W=0column is the matched EGRSD reference\. Other columns are CL\-EGRSD\. Per\-panel color scales emphasize within\-benchmark differences\.At strong suppression,\(γ,W\)=\(1\.0,5\)\(\\gamma,W\)=\(1\.0,5\)improves over the matchedW=0W=0reference on all four evaluated benchmarks, with the largest gains on HMMT25 \(47\.50→52\.5047\.50\\to 52\.50\) and AIME25 \(66\.67→70\.0066\.67\\to 70\.00\)\. This focused sweep motivates jointly tuningγ\\gammaandWW, but it uses a restricted benchmark subset, so the main headline remains the six\-benchmark result\.

## Appendix HQualitative pivot examples

This appendix provides three qualitative views of the pivot/fork distinction on held\-out reasoning windows\. Figure[A3](https://arxiv.org/html/2605.13255#A8.F3)gives a global view of the lock/fork/pivot token regimes demonstrated in the main\-text Figure[3](https://arxiv.org/html/2605.13255#S4.F3)\. Figure[A4](https://arxiv.org/html/2605.13255#A8.F4)visualizes pivot rescue directly, showing per\-token current entropy, five\-token lookahead entropy, and the corresponding EGRSD and CL\-EGRSD weights on representative reasoning windows\. Figure[A5](https://arxiv.org/html/2605.13255#A8.F5)further overlays the top\-KKlocal entropy peaks on two Minerva completions together with their 4\-token left context, verifying that most annotated peaks are transient \(i\.e\., pivots whose lookahead entropy drops rapidly\) rather than sustained forks\.

![Refer to caption](https://arxiv.org/html/2605.13255v1/x8.png)Figure A3:Global entropy trace on a held\-out reasoning problem \(Qwen3\-4B\), illustrating at full\-trace scale the same lock/fork/pivot token regimes demonstrated in the main\-text Figure[3](https://arxiv.org/html/2605.13255#S4.F3)\.Top: per\-token normalized entropy across the∼\\sim320\-token reasoning trace, with a smoothed envelope\. The green span marks a routine\-computation \(lock\) region and the three red spans mark strategy\-pivot transitions\.Bottom: zoomed views of one routine span and two pivot spans, each with a short text snippet from the generated trace\.![Refer to caption](https://arxiv.org/html/2605.13255v1/x9.png)Figure A4:Qualitative token\-level examples of pivot rescue\. Each page shows current entropy, five\-token lookahead entropy, EGRSD weight, and CL\-EGRSD weight for a selected reasoning window\. Green spans mark pivot tokens where high current entropy resolves into low future entropy and CL\-EGRSD restores the token weight\. Red spans mark sustained forks\. Token snippets are sparse anchors rather than full transcripts\.![Refer to caption](https://arxiv.org/html/2605.13255v1/x10.png)Figure A5:Per\-token entropy trace with annotated transition points on two Minerva reasoning windows \(Qwen3\-4B\)\.Top row: current entropyH^i,t\\widehat\{H\}\_\{i,t\}with the top\-KKlocal peaks highlighted and annotated with the 4\-token left context \(red dots\)\. The annotated peaks align with discourse transition markers and local shifts in the generated derivation\.Bottom row: the five\-token causal lookaheadH^i,tCL\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}with the same peak indices overlaid\. AtW=5W=5,H^i,tCL≤0\.08\\widehat\{H\}\_\{i,t\}^\{\\mathrm\{CL\}\}\\leq 0\.08for1313of the1414annotated peaks, matching the transient\-transition signature targeted by CL\-EGRSD\.
## Appendix IAceReason\-Nemotron\-7B cross\-architecture diagnostic

We additionally evaluate the methods onAceReason\-Nemotron\-7B\[Chen et al\.,[2025](https://arxiv.org/html/2605.13255#bib.bib1)\], a strong reasoning\-tuned external base, with results detailed in Table[A4](https://arxiv.org/html/2605.13255#A9.T4)\. This diagnostic is intentionally separate from the Qwen3 main comparison: AceReason has a different post\-training recipe and a longer native reasoning style, and the no\-training model is already competitive\.

Table A4:AceReason\-Nemotron\-7B cross\-architecture diagnostic\. All accuracy metrics are percentages\. Best and runner\-up values are shown inboldandunderline\.The no\-training AceReason model is already strong \(60\.63\), and most trained baselines fail to improve over it: SFT degrades to 59\.60, while GRPO \(60\.52\), OPSD \(60\.13\), and CRISP \(60\.75\) cluster around the no\-training level\. CL\-EGRSD atγ=0\.5\\gamma=0\.5is the only trained method in this diagnostic that clearly exceeds the no\-training baseline, reaching 61\.45 avg@4 \(\+0\.82 over no\-train\) with shorter completions \(7,943 vs\. 8,382 tokens\) and the highest GSM8K score \(93\.91, \+2\.42\)\. This suggests that on reasoning\-tuned external bases with a different post\-training recipe, uniform\-weight distillation objectives struggle to add value, while the entropy\-gated update preserves a useful learning signal\. The entropy coefficient likely needs to be retuned per base, since the sameγ\\gammaselected for Qwen3 need not transfer unchanged\.

## Appendix JWeak\-base cross\-architecture diagnostic on Olmo\-3\-7B Base

We also run a weak\-base cross\-architecture diagnostic onOlmo\-3\-7B Base\[Groeneveld et al\.,[2024](https://arxiv.org/html/2605.13255#bib.bib5)\], a non\-reasoning\-tuned external base model\. Absolute performance is much lower than on Qwen3 or on reasoning\-tuned external bases, so this experiment is not intended as a headline accuracy comparison\. Instead, it tests negative transfer: whether an on\-policy distillation objective can avoid making a weak external base worse\. We report all six benchmark scores, the macro average, and the average generation length in Table[A5](https://arxiv.org/html/2605.13255#A10.T5)\.

Table A5:Weak\-base diagnostic on Olmo\-3\-7B Base\. All accuracy metrics are percentages\. Best and runner\-up values are shown inboldandunderline\.Table A6:Aggregate deltas for the Olmo\-3\-7B Base diagnostic\.Δbase\\Delta\_\{\\rm base\}is relative to the no\-training model\.ΔSFT\\Delta\_\{\\rm SFT\}is relative to SFT on the same base\.As summarized in Table[A6](https://arxiv.org/html/2605.13255#A10.T6), CL\-EGRSD is the only trainable method in this weak\-base diagnostic that improves over both the no\-training model and SFT\. In contrast, OPSD and CRISP exhibit negative transfer relative to the no\-training base, and GRPO remains below SFT\. This supports the interpretation that the entropy gate stabilizes training under weak external bases by attenuating harmful token\-level updates when the privileged teacher signal is less reliable for the student’s native distribution\.

## Limitations

Our evaluation is restricted to mathematical reasoning, and transfer to code, agentic reasoning, or open\-ended reasoning remains untested\. The mechanism diagnostics use held\-out benchmarks and therefore condition the privileged teacher on a proxy reference solution rather than the original training\-set reference\. A further limitation is that teacher Shannon entropy is only an operational confidence measure and an imperfect proxy for token\-level supervision reliability: high entropy can arise from epistemic ambiguity in the reasoning path, but also from aleatoric variation such as synonymous phrasing or vocabulary\-level underspecification\. Relatedly, EGRSD and CL\-EGRSD introduce additional hyperparameters \(γ\\gammaand, for CL\-EGRSD, the lookahead windowWW\) beyond those of existing OPSD\-style objectives\. The per\-model settings we use are listed in Appendix[E](https://arxiv.org/html/2605.13255#A5), and a scale\-aware or schedule\-basedγ\\gammacontroller is a natural direction for future work\.

## Appendix KAlgorithm

Algorithm[1](https://arxiv.org/html/2605.13255#alg1)summarizes a single training step of EGRSD / CL\-EGRSD\. The teacherpTp\_\{T\}is the frozen base model conditioned on the privileged context\(x,s⋆\)\(x,s^\{\\star\}\), and the studentpS=pθp\_\{S\}=p\_\{\\theta\}is conditioned on\(x\)\(x\)only\. The teacher’s weights are not updated through LoRA\. Entropy normalization is shared across tokens in the current minibatch, and the lookahead windowWWdistinguishes EGRSD \(W=0W\{=\}0\) from CL\-EGRSD \(W\>0W\{\>\}0\)\. Code, LoRA adapters, and evaluation artifacts will be released publicly\.

Algorithm 1EGRSD / CL\-EGRSD training step \(shared direction–magnitude–confidence form\)\.1:Batch of problems

\{xi\}\\\{x\_\{i\}\\\}with privileged solutions

\{si⋆\}\\\{s^\{\\star\}\_\{i\}\\\}; student

pθp\_\{\\theta\}; frozen teacher

pTp\_\{T\}\(shared backbone, adapter disabled\); entropy coefficient

γ\\gamma; lookahead window

WW\(set

W=0W\{=\}0for EGRSD\); clip

ε\\varepsilon; length\-shape

βL\\beta\_\{L\}; running reward stats

\(r¯,σr\)\(\\bar\{r\},\\sigma\_\{r\}\)\.

2:foreach problem

xix\_\{i\}in batchdo

3:Sample on\-policy completion

yi∼pθ\(⋅∣xi\)y\_\{i\}\\sim p\_\{\\theta\}\(\\cdot\\mid x\_\{i\}\)via vLLM\.⊳\\trianglerightRollout

4:Compute reward

ri←𝟙\[yicorrect\]⋅\(1\+βL\(1−\|yi\|/Lmax\)\)r\_\{i\}\\leftarrow\\mathbb\{1\}\[y\_\{i\}\\text\{ correct\}\]\\cdot\\big\(1\+\\beta\_\{L\}\(1\-\|y\_\{i\}\|/L\_\{\\max\}\)\\big\)\.

5:Whiten advantage

Ai←\(ri−r¯\)/σrA\_\{i\}\\leftarrow\(r\_\{i\}\-\\bar\{r\}\)/\\sigma\_\{r\}; set direction

Di←sign\(Ai\)D\_\{i\}\\leftarrow\\mathrm\{sign\}\(A\_\{i\}\)\.

6:endfor

7:Forward pass: collect

log⁡pT\(yi,t∣xi,si⋆,yi,<t\)\\log p\_\{T\}\(y\_\{i,t\}\\mid x\_\{i\},s^\{\\star\}\_\{i\},y\_\{i,<t\}\)and

log⁡pS\(yi,t∣xi,yi,<t\)\\log p\_\{S\}\(y\_\{i,t\}\\mid x\_\{i\},y\_\{i,<t\}\); usestop\-gradienton

pSp\_\{S\}\.

8:for allcompletion tokens

\(i,t\)\(i,t\)do

9:Compute teacher entropy

Hi,t←−∑vpT\(v∣⋅\)log⁡pT\(v∣⋅\)H\_\{i,t\}\\leftarrow\-\\sum\_\{v\}p\_\{T\}\(v\\mid\\cdot\)\\log p\_\{T\}\(v\\mid\\cdot\)\.

10:endfor

11:ifCL\-EGRSD \(

W\>0W\>0\)then

12:

Hi,t←minj∈\[t,min⁡\(t\+W,Ti\)\]⁡Hi,jH\_\{i,t\}\\leftarrow\\min\_\{j\\in\[t,\\,\\min\(t\+W,T\_\{i\}\)\]\}H\_\{i,j\}⊳\\trianglerightCausal lookahead replacement\.

13:endif

14:Batch\-normalize:

H^i,t←Hi,t/max⁡\(max\(j,k\)⁡Hj,k,1\)\\widehat\{H\}\_\{i,t\}\\leftarrow H\_\{i,t\}/\\max\\\!\\big\(\\max\_\{\(j,k\)\}H\_\{j,k\},\\,1\\big\)\.

15:Confidence gate:

ωi,t←clip\(1−γH^i,t,0\.1,1\)\\omega\_\{i,t\}\\leftarrow\\mathrm\{clip\}\(1\-\\gamma\\,\\widehat\{H\}\_\{i,t\},\\,0\.1,\\,1\)\.

16:Directional log\-ratio:

δi,t←log⁡pT\(yi,t∣⋅\)−log⁡pS\(yi,t∣⋅\)\\delta\_\{i,t\}\\leftarrow\\log p\_\{T\}\(y\_\{i,t\}\\mid\\cdot\)\-\\log p\_\{S\}\(y\_\{i,t\}\\mid\\cdot\)\.

17:Magnitude:

wi,t←clip\(exp⁡\(Diδi,t\),1−ε,1\+ε\)w\_\{i,t\}\\leftarrow\\mathrm\{clip\}\\\!\\big\(\\exp\(D\_\{i\}\\,\\delta\_\{i,t\}\),\\,1\-\\varepsilon,\\,1\+\\varepsilon\\big\)\.

18:Token\-level advantage:

A^i,t←Ai⋅wi,t⋅ωi,t\\widehat\{A\}\_\{i,t\}\\leftarrow A\_\{i\}\\cdot w\_\{i,t\}\\cdot\\omega\_\{i,t\}\(all stop\-gradient\)\.

19:Loss:

ℒ←−1∑mi,t∑mi,tA^i,tlog⁡pθ\(yi,t∣xi,yi,<t\)\\mathcal\{L\}\\leftarrow\-\\tfrac\{1\}\{\\sum m\_\{i,t\}\}\\sum m\_\{i,t\}\\,\\widehat\{A\}\_\{i,t\}\\,\\log p\_\{\\theta\}\(y\_\{i,t\}\\mid x\_\{i\},y\_\{i,<t\}\)\.

20:Gradient step:

θ←θ−ηclip\-norm\(∇θℒ,0\.1\)\\theta\\leftarrow\\theta\-\\eta\\,\\mathrm\{clip\\text\{\-\}norm\}\(\\nabla\_\{\\theta\}\\mathcal\{L\},\\,0\.1\)\.

21:Update running reward statistics

\(r¯,σr\)\(\\bar\{r\},\\sigma\_\{r\}\)via Welford’s algorithm\.
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Similar Articles

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

@sheriyuo: Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation. Their key observation i…

Learning from the Self-future: On-policy Self-distillation for dLLMs

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

Submit Feedback

Similar Articles

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
@sheriyuo: Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation. Their key observation i…
Learning from the Self-future: On-policy Self-distillation for dLLMs
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning