Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

arXiv cs.LG 06/03/26, 04:00 AM Papers
Summary
Introduces FiRe-OPD, a method for on-policy distillation in LLMs that filters low-quality trajectories and applies soft reweighting to emphasize informative tokens, achieving improved performance in strong-to-weak, single-teacher, and multi-teacher settings.
arXiv:2606.02684v1 Announce Type: new Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:39 AM
# Rethinking Optimization Granularity in On-Policy Distillation
Source: [https://arxiv.org/html/2606.02684](https://arxiv.org/html/2606.02684)
Yuying Li1∗⋄,Leqi Zheng1∗,Yongzi Yu2,Wenrui Zhou2, Xuchang Zhong3,Xing Hu4Jing Jin1Huangjie Yuan5†Tao Feng1† 1THU,2HKUST,3BIT,4Meituan,5ZJU liyuying25@mails\.tsinghua\.edu\.cn ∗Equal Contribution†Corresponding Author

###### Abstract

On\-Policy distillation \(OPD\) in large language models is shifting from full\-trace KL supervision toward more selective training paradigms\. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable\. Motivated by this trend, we rethink optimization granularity of OPD and propose![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD \(Filter, then Reweight\), which jointly adjusts supervision signals at both trajectory and token levels\. In details, FiRe\-OPD first filters trajectories to remove low\-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens\. Compared with hard token selection, FiRe\-OPD leverages a soft\-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer\-grained OPD optimization\. We validate the effectiveness of FiRe\-OPD across strong\-to\-weak, single\-teacher, and multi\-teacher settings, and demonstrate its superiority over recent token\-level OPD methods \( \(e\.g\., \+6\.25 on AIME 2024 in strong\-to\-weak, \+18\.81 on Miner in multi\-teacher\)\. Our code is available at https://github\.com/YuYingLi0/FiRe\-OPD\.

Filter, Then Reweight: Rethinking Optimization Granularity in On\-Policy Distillation

Yuying Li1∗⋄, Leqi Zheng1∗, Yongzi Yu2, Wenrui Zhou2,Xuchang Zhong3,Xing Hu4Jing Jin1Huangjie Yuan5†Tao Feng1†1THU,2HKUST,3BIT,4Meituan,5ZJUliyuying25@mails\.tsinghua\.edu\.cn∗Equal Contribution†Corresponding Author

![Refer to caption](https://arxiv.org/html/2606.02684v1/x1.png)Figure 1:Performance comparison across three distillation scenarios\.FiRe\-OPD\(red\) achieves the most balanced and expansive coverage across all benchmarks## 1Introduction

On\-policy distillation \(OPD\) has emerged as a compelling post\-training paradigm for transferring reasoning capabilities from teacher models to smaller student models\. Unlike supervised fine\-tuning, OPD avoids the train\-inference distribution mismatch by learning on student\-generated trajectories, while providing denser token\-level supervision than reinforcement learning’s sparse outcome rewardsZhuet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib49)\); Yeet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib50)\); Liet al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib18)\); Wuet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib30)\); Fuet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib21)\); Zhenget al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib17)\); Janget al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib31)\); Song and Zheng \([2026](https://arxiv.org/html/2606.02684#bib.bib20)\)\. These advantages have made OPD a widely adopted approach in reasoning\-intensive tasks\.

However, standard OPD applies uniform full\-trajectory KL supervision, which has inherent limitations in both optimization granularity and signal reliability\. Not all trajectories and tokens carry equal learning value, and critical rollouts and informative tokens should be assigned greater importance during optimization\. Recognizing this, selective optimization granularity distillation has become a growing trend in recent OPD research\.

EOPDJinet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib15)\)identifies that high teacher entropy causes unstable learning signals and switches to forward KL at high\-entropy token positions\. TIPXuet al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib16)\)selects tokens based on student entropy and teacher\-student divergence through hard filtering rules\. ExOPDYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\)reinterprets OPD as KL\-constrained RL and introduces a global reward scaling factor\. Uni\-OPDHouet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib13)\)addresses unreliable supervision through outcome\-guided margin calibration at the trajectory level\. But existing works suffer from two key limitations:

Table 1:Overview of OPD methods across granularities and techniques, and the scope of![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD\.MethodGranularityTechniqueTraj\.Tok\.T\-Conf\.S\-Conf\.Soft\-W\.OPD✗✗✗✗✗EOPD✗✓✓✗✗TIP✗✓✗✓✗ExOPD✗✗✗✗✗REOPOLD✗✓✓✗✗![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD✓✓✓✓✓Limitation 1\.Granularity isolation\.Existing methods operate at either the trajectory or token level, focusing on a single dimension of signal quality \(e\.g\., teacher confidence or student state\), without jointly modeling both granularities or exploiting their complementary in OPD\.

Limitation 2\.Hard selection strategies\.Most token\-level methods rely on hard selection to remove tokens during OPD, which induces non\-smooth optimization and permanently discards potentially useful supervision signals, thereby weakening learning robustness\. Table[1](https://arxiv.org/html/2606.02684#S1.T1)provides a systematic comparison of existing OPD methods along these dimensions\.

In this work, we propose![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD\(Filter, thenReweight\), a unified framework that performs trajectory\-level filtering and token\-level importance weighting from a dual perspective of teacher confidence and student confusion\. At the trajectory level, FiRe\-OPD filters out rollouts where the teacher assigns low overall likelihood, indicating a large teacher\-student distribution gap where the teacher’s supervision is unreliable\. At the token level, FiRe\-OPD assigns continuous importance weights by multiplicatively combining teacher confidence and student confusion, concentrating learning on positions where the teacher provides reliable guidance and the student has genuine need\. This soft weighting preserves gradient contributions from all positions proportional to their informativeness, enabling fine\-grained, adaptive supervision that accounts for bothwhat the teacher can teachandwhat the student needs to learn\.

In summary, our contributions are 3\-fold:

\(i\) We propose FiRe\-OPD, a unified framework that jointly performs trajectory\-level filtering and token\-level soft reweighting, enabling fine\-grained and selective OPD\.

\(ii\) We show that optimization granularity is critical in OPD: hard filtering is more effective at the trajectory level, whereas soft token weighting surpasses hard token selection at the token level\.

\(iii\) We show that the superiority of FiRe\-OPD across strong\-to\-weak, single\-teacher, and multi\-teacher distillation settings on various benchmark\.

![Refer to caption](https://arxiv.org/html/2606.02684v1/figure/modelarch2.png)Figure 2:Overview of![Refer to caption](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPDthat performs trajectory\-level filtering and token\-level importance weighting\.
## 2Related Work

Off\-policy Distillation\.Knowledge distillation \(KD\) transfers knowledge from a stronger teacher to a smaller student model\. Classical KD trains the student to match the teacher’s output distribution, while sequence\-level KD uses complete teacher\-generated responses as supervisionHintonet al\.\([2015](https://arxiv.org/html/2606.02684#bib.bib1)\); Kim and Rush \([2016](https://arxiv.org/html/2606.02684#bib.bib2)\)\. In the LLM era, KD has evolved toward broader capability transfer like reasoning and alignment\.Guet al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib3)\); Koet al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib4)\); Heet al\.\([2025a](https://arxiv.org/html/2606.02684#bib.bib5)\); Liuet al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib6)\)\. However, most off\-policy KD methods rely on teacher\-generated trajectories, leading to exposure bias\. These limitations motivate OPD, which directly supervises the student under its own generation distribution\.

On\-Policy Distillation\.OPD has recently emerged as an effective paradigm for post\-training\. Prior studies show that reverse\-KL\-style objectives and supervision on student\-generated mistakes can improve open\-ended generation and reasoning tasksGuet al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib3)\); Agarwalet al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib7)\)\. Recent work further studies how to make OPD scalable, stable, and generalizable through reward extrapolation, entropy\-aware objectives, reasoning\-prefix acceleration, competence\-aware curricula, divergence constraints, and rollout mixture distillationYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\); Jinet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib15)\); Zhanget al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib10)\); Luoet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib12)\); Houet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib13)\)\. Meanwhile, OPD has also been extended to self\-distillationZhaoet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib32)\); Xuet al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib11)\); Wanget al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib33)\); Kimet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib34)\); Zhanget al\.\([2026c](https://arxiv.org/html/2606.02684#bib.bib35)\); Yanget al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib44)\), hybrid RL\-distillation frameworksYanet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib36)\); Hübotteret al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib37)\); Zhanget al\.\([2026d](https://arxiv.org/html/2606.02684#bib.bib38)\); Ding \([2026](https://arxiv.org/html/2606.02684#bib.bib39)\); Yanget al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib40)\); Zhanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib43)\), multimodal distillationLiet al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib41)\); Caoet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib42)\); Chenet al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib45)\); Bousselhamet al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib46)\), agentic settingsWanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib48)\), and embodied learningZhonget al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib47)\)\. Recent token\-selection methods attempt to reduce noisy supervision by discarding low\-value tokens, but hard selection may lose useful information and produce brittle optimization signals\. Our work addresses this limitation through adaptive trajectory and token\-level weighting, filtering low\-quality trajectories and softly modulating token\-level distillation intensity\.

## 3Methodology

### 3\.1Preliminaries

Table 2:Strong\-to\-weak distillation results \(Avg@8\)\. Best results among OPD methods are inbold\.Red/greendenotes improvement/decline vs\. OPD\.MethodAIME24AIME25MATHAMCOlymp\.Miner\.HMMTFebHMMTNovAvgStrong\-to\-Weak: Qwen3\-30B\-A3B\-Instruct→\\rightarrowQwen3\-4BStudent \(Base\)21\.6722\.5083\.6567\.1951\.8039\.4812\.507\.0838\.23Teacher76\.6763\.3397\.2295\.9478\.3247\.4745\.0060\.0070\.49\+ SFT25\.4222\.9285\.8270\.3154\.6040\.8113\.7512\.9240\.82\+ GRPO55\.0048\.3393\.2093\.0668\.6943\.7329\.1735\.4258\.33\+ OPD54\.5848\.7591\.2593\.9270\.6243\.0128\.3339\.1758\.70\+ ExOPD58\.7548\.3394\.3593\.7570\.6143\.3830\.8341\.2560\.16\+ TIP59\.5849\.5892\.1993\.6070\.6643\.7029\.5840\.0059\.86\+ REOPOLD57\.5046\.6793\.9592\.1970\.1643\.2029\.1741\.2559\.26\+ EOPD52\.9249\.1793\.4092\.8170\.9242\.9727\.0839\.1758\.56\+![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD \(Ours\)60\.8352\.9293\.7393\.1370\.4743\.4732\.0840\.0060\.83Δ\\Deltavs OPD\+6\.25\+4\.17\+2\.48\-0\.79\-0\.15\+0\.46\+3\.75\+0\.83\+2\.13We first introduce the standard on\-policy distillation \(OPD\) framework\. Letπθ\\pi\_\{\\theta\}denote the student model andπT\\pi\_\{T\}denote the teacher model\. At each training iteration, the student generates rollouts from its current policy given a set of prompts\{xi\}\\\{x\_\{i\}\\\}:

y∼πθ\(⋅\|x\)y\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\(1\)
The teacher then provides token\-level supervision on these student\-generated trajectories\. Standard OPD formulates this as a policy optimization problem using PPO\-style clipped objectives, where the token\-level advantage is defined as the teacher\-student log\-likelihood ratio:

at=log⁡πT\(yt\|x,y<t\)−log⁡πθold\(yt\|x,y<t\)a\_\{t\}=\\log\\pi\_\{T\}\(y\_\{t\}\|x,y\_\{<t\}\)\-\\log\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{t\}\|x,y\_\{<t\}\)\(2\)
This advantage encourages the student to increase probability on tokens that the teacher assigns higher likelihood than the student’s old policy\. The policy gradient loss is:

ℒOPD=−1T∑t=1Tmin⁡\(rtat,clip\(rt,1−ϵ,1\+ϵ\)at\)\.\\mathcal\{L\}\_\{\\text\{OPD\}\}=\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\min\\left\(r\_\{t\}a\_\{t\},\\;\\text\{clip\}\(r\_\{t\},1\-\\epsilon,1\+\\epsilon\)a\_\{t\}\\right\)\.\(3\)wherert=πθ\(yt\|x,y<t\)πθold\(yt\|x,y<t\)r\_\{t\}=\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\|x,y\_\{<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{t\}\|x,y\_\{<t\}\)\}is the importance sampling ratio and theclipclipconstrainsrtr\_\{t\}to\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]to prevent excessively large policy updates\. We setϵ=0\.2\\epsilon=0\.2\. Standard OPD applies this objective uniformly across all trajectories and all token positions, treating every supervision signal equally\.

### 3\.2![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD

Standard OPD applies uniform supervision across all trajectories and token positions, which is suboptimal because distillation signal quality varies significantly at both levels\. As illustrated in Figure[2](https://arxiv.org/html/2606.02684#S1.F2), FiRe\-OPD addresses this through two complementary mechanisms: trajectory\-level filtering and token\-level soft reweighting\.

Proposition 1\.What signal best reflects the importance of a trajectory?

Some works use outcome correctnessZhenget al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib17)\); Houet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib13)\)or reward scores to select trajectories\. However, these approaches require external verifiers and do not directly reflect the teacher’s supervision capability on a given path\.

We observe that the teacher’s log\-probability on a student\-generated trajectory reflects the distributional alignment between teacher and student on that path\. A low teacher log\-probability indicates a large distribution gap—the student’s reasoning path diverges significantly from what the teacher would produce\. In such cases, regardless of whether the trajectory is objectively correct, the teacher’s token\-level guidance along this path is unreliable: the teacher is effectively being asked to supervise a reasoning style it is unfamiliar with\. Forcing distillation on these high\-divergence trajectories can introduce noisy or even contradictory gradients, leading to negative transfer rather than effective learning\.

Based on this insight, we define thetrajectory\-level importance scoreas the teacher’s normalized log\-probability over a rollouty=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)given promptxx:

s\(y\)=1T∑t=1Tlog⁡π∗\(yt\|x,y<t\)s\(y\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log\\pi^\{\*\}\(y\_\{t\}\|x,y\_\{<t\}\)\(4\)We rank all rollouts within a training batch bys\(y\)s\(y\)and discard the bottomp%p\\%\(we usep=20p=20by default\)\. Only the surviving trajectories proceed to the token\-level optimization stage\. This filtering ensures that distillation occurs only on trajectories where the teacher can provide coherent supervision—paths that lie within the teacher’s competence region, where its token\-level signals are most likely to be meaningful and consistent\.

Table 3:Single\-teacher distillation results \(Avg@8\)\. Best results among OPD methods are inbold\.Red/greendenotes improvement/decline vs\. OPD\.MethodAIME24AIME25MATHAMCOlymp\.Miner\.HMMTFebHMMTNovAvgSingle\-Teacher: Qwen3\-4B\-Non\-Thinking\-RL\-Math→\\rightarrowQwen3\-4BStudent \(Base\)21\.6722\.5083\.6567\.1951\.8039\.4812\.507\.0838\.23Teacher56\.2546\.6793\.4091\.5668\.0647\.5230\.8335\.8358\.77\+ OPD57\.9257\.5094\.6995\.1747\.6668\.8132\.5035\.4261\.21\+ ExOPD60\.4254\.5895\.2593\.4447\.8467\.2832\.0835\.8360\.84\+![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD \(Ours\)61\.2555\.0094\.6995\.1548\.0768\.4933\.7537\.5061\.74Δ\\Deltavs OPD\+3\.33\-2\.50\+0\.00\-0\.02\+0\.41\-0\.32\+1\.25\+2\.08\+0\.53

Table 4:Multi\-teacher distillation results \(Qwen3\-4B\-Non\-Thinking\-RL\-Math \+ Qwen3\-4B\-Non\-Thinking\-RL\-Code→\\rightarrowQwen3\-4B\-Non\-Thinking\)\. Best results are inbold\.Math Reasoning \(Avg@8\)Code Generation \(pass@1\)MethodAIME24AIME25Miner\.HMMTFebHMMTNovAvgHE\+MBPP\+LCBAvgStudent \(Base\)21\.6722\.5039\.4812\.507\.0820\.6579\.9064\.6017\.5754\.02Teacher76\.6763\.3347\.4745\.0060\.0058\.4979\.9072\.0027\.8559\.92OPD59\.5857\.0848\.5332\.5037\.5047\.0482\.9369\.5826\.8659\.79\+ ExOPD60\.8355\.0066\.3934\.1738\.7551\.0389\.0069\.3129\.2862\.53\+![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD64\.1755\.8367\.3435\.0037\.0851\.8892\.7071\.6928\.1064\.16Δ\\Deltavs OPD\+4\.59\-1\.25\+18\.81\+2\.50\-0\.42\+4\.84\+9\.77\+2\.11\+1\.24\+4\.37

Proposition 2\.What makes a token position informative for distillation?

Existing works either focus on a single signal—teacher entropy aloneKoet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib14)\)or student entropy aloneXuet al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib16)\)—or apply hard truncation that discards tokens entirely below a fixed threshold\. These approaches either miss one important dimension of signal quality or irreversibly lose gradient information from positions that still carry partial learning value\. In contrast, we argue that a position is most informative for distillation when two conditions are jointly satisfied: the teacher is confident \(providing reliable guidance\) and the student is confused \(indicating genuine learning need\)\. This motivates a unified, soft weighting scheme that integrates both signals simultaneously\.

Based on this, we define thetoken\-level importance weightusing two complementary signals\. Teacher confidencectTc\_\{t\}^\{T\}measures how reliable the teacher’s guidance is at positiontt:

ctT=1−H\(π∗\(⋅\|x,y<t\)\)maxt′∈ℬH\(π∗\(⋅\|x,y<t′\)\)c\_\{t\}^\{T\}=1\-\\frac\{H\(\\pi^\{\*\}\(\\cdot\|x,y\_\{<t\}\)\)\}\{\\max\_\{t^\{\\prime\}\\in\\mathcal\{B\}\}H\(\\pi^\{\*\}\(\\cdot\|x,y\_\{<t^\{\\prime\}\}\)\)\}\(5\)whereH\(⋅\)H\(\\cdot\)denotes the entropy of the output distribution andmaxt′∈ℬ\\max\_\{t^\{\\prime\}\\in\\mathcal\{B\}\}is the empirical maximum over all valid token positions in the current batchℬ\\mathcal\{B\}\. Student confusionctSc\_\{t\}^\{S\}measures how much the student needs guidance at positiontt:

ctS=H\(πθ\(⋅\|x,y<t\)\)maxt′∈ℬH\(πθ\(⋅\|x,y<t′\)\)c\_\{t\}^\{S\}=\\frac\{H\(\\pi\_\{\\theta\}\(\\cdot\|x,y\_\{<t\}\)\)\}\{\\max\_\{t^\{\\prime\}\\in\\mathcal\{B\}\}H\(\\pi\_\{\\theta\}\(\\cdot\|x,y\_\{<t^\{\\prime\}\}\)\)\}\(6\)
The token\-level importance weight combines both signals multiplicatively:

wt=\(1\+α⋅ctT\)×\(1\+β⋅ctS\)w\_\{t\}=\(1\+\\alpha\\cdot c\_\{t\}^\{T\}\)\\times\(1\+\\beta\\cdot c\_\{t\}^\{S\}\)\(7\)whereα,β≥0\\alpha,\\beta\\geq 0are hyperparameters controlling the sensitivity to each respective factor \(we useα=β=1\.0\\alpha=\\beta=1\.0by default throughout all experiments\)\. The weighted advantage for each token is then obtained by normalizing the raw weights within each trajectory to preserve gradient scale:

a~t=wt1T∑t′=1Twt′⋅at\\tilde\{a\}\_\{t\}=\\frac\{w\_\{t\}\}\{\\frac\{1\}\{T\}\\sum\_\{t^\{\\prime\}=1\}^\{T\}w\_\{t^\{\\prime\}\}\}\\cdot a\_\{t\}\(8\)The normalization ensures that the total gradient magnitude remains stable across different trajectories\. The final policy gradient loss is:

ℒFiRe\-OPD=−1T∑t=1Tmin⁡\(rta~t,clip\(rt,1−ϵ,1\+ϵ\)a~t\)\\mathcal\{L\}\_\{\\text\{FiRe\-OPD\}\}\\\!=\\\!\-\\frac\{1\}\{T\}\\\!\\sum\_\{t=1\}^\{T\}\\min\\\!\\bigl\(r\_\{t\}\\tilde\{a\}\_\{t\},\\,\\mathrm\{clip\}\(r\_\{t\},\\,1\\\!\-\\\!\\epsilon,\\,1\\\!\+\\\!\\epsilon\)\\tilde\{a\}\_\{t\}\\bigr\)\(9\)
This design concentrates learning effort on positions where the teacher is confident yet the student remains confused, while still preserving gradient contributions from all positions in proportion to their relative informativeness\.

Table 5:Ablation study on component contributions \(Avg@8, Strong\-to\-Weak setting\)\. Best results are inbold\.MethodAIME24AIME25MATHAMCOlymp\.Miner\.HMMTFebHMMTNovAvgOPD \(Base\)54\.5848\.7591\.2593\.9270\.6243\.0128\.3339\.1758\.70![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD \(Full\)60\.8352\.9293\.7393\.1370\.4743\.4732\.0840\.0060\.83w/o Traj\. Filter56\.9249\.8893\.5491\.1970\.4143\.1228\.3338\.4258\.99w/o Teacher Conf\.58\.3352\.0894\.0890\.3170\.6442\.4628\.7537\.9259\.32w/o Student Conf\.59\.0849\.8793\.5191\.3470\.3343\.0031\.1737\.9259\.53Traj\. Filter Only54\.5851\.6793\.8591\.8869\.9642\.8830\.4239\.1759\.30

![Refer to caption](https://arxiv.org/html/2606.02684v1/x2.png)Figure 3:Hyperparameter sensitivity analysis\. The solid black line \(left axis\) shows Avg accuracy across all benchmarks; dashed colored lines \(right axis\) show per\-benchmark deviations \(Δ\\Delta\) from the default setting\. \(a\) Trajectory filtering percentileppexhibits a clear peak atp=20%p\{=\}20\\%\. \(b\) Performance is robust forα≥1\.0\\alpha\\geq 1\.0but degrades notably at small values\. \(c\)β\\betashows minimal sensitivity across the full range, confirming that student confusion weighting is robust to its scaling\.

## 4Experiment

### 4\.1Experimental Setup

#### Models\.

To demonstrate the generalizability of FiRe\-OPD, we evaluate across three distillation scenarios: \(i\)Strong\-to\-Weak, where Qwen3\-30B\-A3B\-Instruct serves as the teacher and Qwen3\-4B\-Non\-ThinkingYanget al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib26)\)as the student, testing the ability to bridge large capacity gaps; \(ii\)Single\-Teacher, where Qwen3\-4B\-Non\-Thinking\-RL\-MathYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\)teaches Qwen3\-4B\-Non\-Thinking, testing transfer efficiency between same models; and \(iii\)Multi\-Teacher, where Qwen3\-4B\-Non\-Thinking\-RL\-Math and Qwen3\-4B\-Non\-Thinking\-RL\-CodeYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\)jointly supervise Qwen3\-4B, testing the ability to integrate heterogeneous domain expertise\.

#### Training Data\.

For the strong\-to\-weak and single\-teacher scenarios, we use the DeepMath\-103KHeet al\.\([2025b](https://arxiv.org/html/2606.02684#bib.bib22)\)dataset filtered to difficulty level 6, followingYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\)\. For the multi\-teacher scenario, we use the multi\-teacher training dataset fromYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\), which combines mathematical and code\-domain data\.

#### Training Details\.

We train for 3 epochs \(165 steps total\) with a batch size of 1024, learning rate of1×10−61\\times 10^\{\-6\}, and maximum response length of 16384\. During rollout, we sample with temperature 1\.0 and top\-pp= 1\.0\. For FiRe\-OPD\-specific hyperparameters, we setα=β=1\.0\\alpha=\\beta=1\.0and trajectory filtering percentilep=20%p=20\\%\. Training is conducted on 8×\\timesA100 80GB GPUs\.

#### Evaluation\.

For mathematical reasoning, we evaluate on eight benchmarks spanning a range of difficulty levels: AIME 2024, AIME 2025, MATH\-500Hendryckset al\.\([2021](https://arxiv.org/html/2606.02684#bib.bib23)\), AMC 2023, OlympiadBenchHeet al\.\([2024](https://arxiv.org/html/2606.02684#bib.bib25)\), MinervaMATHLewkowyczet al\.\([2022](https://arxiv.org/html/2606.02684#bib.bib24)\), HMMT 2025 Feb, and HMMT 2025 NovBalunovicet al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib27)\)\. We sample 8 responses per problem with temperature 1\.0 and report Avg@8 accuracy\. For code generation, we evaluate on three widely\-used benchmarks: HumanEval\+Liuet al\.\([2023](https://arxiv.org/html/2606.02684#bib.bib28)\), MBPP\+Liuet al\.\([2023](https://arxiv.org/html/2606.02684#bib.bib28)\), and LiveCodeBench \(v6 only, February 2025–May 2025\)Jainet al\.\([2025](https://arxiv.org/html/2606.02684#bib.bib29)\), and report pass@1 accuracy\.

#### Baselines\.

We compare against standard OPD and five recent improvements: ExOPDYanget al\.\([2026b](https://arxiv.org/html/2606.02684#bib.bib19)\), TIPXuet al\.\([2026a](https://arxiv.org/html/2606.02684#bib.bib16)\), REOPOLDKoet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib14)\), EOPDJinet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib15)\), and Uni\-OPDHouet al\.\([2026](https://arxiv.org/html/2606.02684#bib.bib13)\)\. ExOPD uses the official open\-source implementation; TIP, REOPOLD, and EOPD are reproduced by us\. All methods are trained under the same data, model, and compute budget for fair comparison\. We also report SFT and GRPO results as reference\.

Table 6:Ablation on soft weighting vs\. hard truncation \(Avg@8, Strong\-to\-Weak setting\)\. “Hard/Soft” denotes trajectory\-level filtering and token\-level weighting strategy respectively\. Best results are inbold\.MethodAIME24AIME25MATHAMCOlymp\.Miner\.HMMTFebHMMTNovAvg![[Uncaptioned image]](https://arxiv.org/html/2606.02684v1/figure/fire.png)FiRe\-OPD \(Hard \+ Soft\)60\.8352\.9293\.7393\.1370\.4743\.4732\.0840\.0060\.83Hard \+ Hard57\.9246\.2593\.9590\.6268\.5343\.1531\.6733\.7558\.23Soft \+ Soft57\.9248\.7593\.7590\.9470\.7543\.1528\.3335\.8358\.68Soft \+ Hard55\.4251\.2593\.3589\.3868\.8643\.0130\.4236\.6758\.55

### 4\.2Main Results

#### Strong\-to\-Weak Distillation\.

Table[2](https://arxiv.org/html/2606.02684#S3.T2)presents results for distilling from a 30B teacher to a 4B student\. FiRe\-OPD achieves the highest average accuracy of 60\.83%, outperforming the strongest baseline ExOPD \(60\.16%\) by 0\.67 points and standard OPD \(58\.70%\) by 2\.13 points\. The improvements are particularly pronounced on challenging competition\-level benchmarks: \+6\.25 on AIME 2024, \+4\.17 on AIME 2025, \+3\.75 on HMMT Feb, and \+2\.48 on MATH\-500\. We also observe that FiRe\-OPD substantially outperforms both SFT \(which barely improves over the base model\) and GRPO, confirming the advantage of dense teacher supervision combined with adaptive weighting\. Compared to other token\-level methods, FiRe\-OPD consistently outperforms TIP \(\+0\.97 avg\), REOPOLD \(\+1\.57 avg\), and EOPD \(\+2\.27 avg\), demonstrating the effectiveness of our method\.

#### Single\-Teacher Distillation\.

Table[3](https://arxiv.org/html/2606.02684#S3.T3)shows results where the teacher and student share the same architecture size, representing a minimal distribution gap scenario\. FiRe\-OPD achieves the highest average accuracy of 61\.74%, improving over standard OPD \(61\.21%\) by 0\.53 points and ExOPD \(60\.84%\) by 0\.90 points, with notable gains on competition\-level tasks such as AIME 2024 \(\+3\.33\) and HMMT Nov \(\+2\.08\)\. The consistent gains confirm that FiRe\-OPD remains beneficial even when the teacher\-student distribution gap is small\.

#### Multi\-Teacher Distillation\.

Table[4](https://arxiv.org/html/2606.02684#S3.T4)presents results where two domain\-specialized teachers \(math and code\) jointly supervise one student\. FiRe\-OPD achieves the best math reasoning average of 51\.88% \(\+4\.84 over OPD\) and code generation average of 64\.16% \(\+4\.37 over OPD\)\. The gains are substantial across both domains: \+18\.81 on MinervaMAT and \+4\.59 on AIME 2024 for math, \+9\.77 on HumanEval\+ and \+2\.11 on MBPP\+ for code\. Notably, FiRe\-OPD enables the student to substantially surpass both teachers on code tasks \(92\.70 vs\. 79\.90 on HumanEval\+\), demonstrating effective knowledge integration beyond simple imitation\.

#### Cross\-Scenario Analysis\.

In the strong\-to\-weak setting, gains concentrate on competition\-level benchmarks, confirming that trajectory filtering effectively removes low\-quality rollouts that would otherwise corrupt learning on hard problems\. Single\-teacher distillation yields uniform but modest improvements given the smaller capacity gap, while the multi\-teacher setting exhibits the most dramatic gains \(\+4\.84 avg on math\), as filtering and adaptive weighting naturally resolves conflicts between heterogeneous teachers\. Overall, FiRe\-OPD scales gracefully with distillation difficulty—whether from capacity gaps, task complexity, or supervision heterogeneity\.

### 4\.3Ablation Studies

To gain deeper understanding of how each mechanism contributes to FiRe\-OPD’s effectiveness, we conduct comprehensive ablations analyzing the contribution of each component, the sensitivity to hyperparameters, and the effectiveness of soft weighting versus hard truncation\. All ablations are performed in the Strong\-to\-Weak setting\.

#### Component Ablation\.

Table[5](https://arxiv.org/html/2606.02684#S3.T5)presents results when removing individual components\. The full FiRe\-OPD \(60\.83\) significantly outperforms all ablated variants\. Removing student confusion causes the largest drop \(\-2\.24\), followed by trajectory filtering \(\-1\.84\), while removing teacher confidence has the smallest impact \(\-0\.96\)\. This reveals an asymmetric role: student confusion is the dominant token\-level signal determiningwherethe student needs help, while teacher confidence serves as a complementary quality filter\. Trajectory filtering alone \(59\.30\) already outperforms OPD \(58\.70\), but combining it with token\-level weighting yields further gains \(\+1\.53\), confirming that both granularities contribute complementarily\.

![Refer to caption](https://arxiv.org/html/2606.02684v1/x3.png)Figure 4:Case Study\. Visualization ofFiRe\-OPD’s token\-level weight allocation on a math reasoning trajectory![Refer to caption](https://arxiv.org/html/2606.02684v1/x4.png)
Figure 5:Statistical Analysis of Weight Allocation\.
#### Hyperparameter Sensitivity\.

Figure[3](https://arxiv.org/html/2606.02684#S3.F3)visualizes sensitivity to the three hyperparameters\. For filtering percentilepp\(Figure[3](https://arxiv.org/html/2606.02684#S3.F3)a\), performance peaks atp=20%p=20\\%, with both under\-filtering \(p=10%p=10\\%: 58\.53\) and over\-filtering \(p=40%p=40\\%: 58\.08\) degrading results\. Forα\\alpha\(Figure[3](https://arxiv.org/html/2606.02684#S3.F3)b\), performance is robust forα≥1\.0\\alpha\\geq 1\.0but degrades at small values \(α=0\.5\\alpha=0\.5: 58\.23\), confirming teacher confidence is necessary albeit secondary\. Forβ\\beta\(Figure[3](https://arxiv.org/html/2606.02684#S3.F3)c\), performance shows minimal sensitivity across the full range, confirming that student confusion weighting is robust to its scaling\. Complete per\-benchmark results are provided in Tables[8](https://arxiv.org/html/2606.02684#A1.T8)and[7](https://arxiv.org/html/2606.02684#A0.T7)\.

#### Soft Weighting vs\. Hard Truncation\.

Table[6](https://arxiv.org/html/2606.02684#S4.T6)compares four combinations of trajectory\-level and token\-level strategies \(Hard=discrete filtering/selection, Soft=continuous weighting\)\. FiRe\-OPD’s design \(Hard trajectory filtering \+ Soft token weighting\) achieves the best average of 60\.83, outperforming Hard\+Hard \(58\.23\), Soft\+Soft \(58\.68\), and Soft\+Hard \(58\.55\)\. This validates two design choices: \(1\) At the trajectory level, hard filtering is superior to soft weighting, because low\-quality trajectories should be completely removed rather than down\-weighted, as even reduced gradients from unreliable paths can accumulate noise\. \(2\) At the token level, soft weighting outperforms hard selection, since tokens exist on a continuum of informativeness, and preserving gradient contributions proportional to their value yields better optimization than binary keep\-or\-discard decisions\.

### 4\.4Case Study: Token Weight Visualization\.

To provide intuitive understanding of how FiRe\-OPD allocates learning effort, we visualize the token\-level weights on a representative mathematical reasoning trajectory in Figure[4](https://arxiv.org/html/2606.02684#S4.F4), where darker shading indicates higher weight\. The highest weights are assigned to reasoning transition tokens such as “Therefore,” “implies,” and “So”—positions where the teacher confidently knows the next direction but the student remains uncertain\. Conversely, numerical values, operators, and variable names receive minimal weights, as both models are highly confident on these tokens once the reasoning path is determined\. Notably, the weighting is genuinely context\-dependent: the same token “the” receives different weights depending on whether it introduces a critical reasoning conclusion or appears in a routine phrase\.

Figure[5](https://arxiv.org/html/2606.02684#S4.F5)further corroborates this pattern statistically through four complementary views of the learned token\-level weight distribution\. The upper\-left panel displays the overall weight histogram, which is sharply peaked around 1\.0, confirming purely redistributive reweighting that preserves total gradient magnitude\. The upper\-right panel presents a positional analysis showing weights increasing toward the end of the trajectory, where reasoning conclusions and final answers typically reside\. The two bottom panels list representative tokens at the extremes of the weight spectrum: the highest\-weight tokens are dominated by reasoning connectives \(“Since,” “So,” “However,” “Therefore”\) and metacognitive cues \(“check,” “remember”\), while the lowest\-weight tokens consist of procedural words \(“proceed,” “compute,” “find”\) and formulaic punctuation\. Together, these visualizations reveal that FiRe\-OPD automatically identifies the distillation bottleneck as reasoning strategy selection—decidingwhat to do next—rather than computational execution, and concentrates learning effort on decision points where teacher guidance provides the greatest informational value\.

## 5Conclusion

We propose FiRe\-OPD, a dual\-granularity framework for on\-policy distillation that filters low\-confidence trajectories and assigns continuous token\-level weights based on teacher confidence and student confusion\. Experiments across three distillation scenarios on math reasoning and code generation benchmarks demonstrate consistent improvements over standard OPD and recent baselines\. Ablation studies reveal that teacher and student signals contribute asymmetrically across granularities, and that the two levels favor different selection strategies—hard filtering for trajectories and soft weighting for tokens\.

## 6Limitations

While FiRe\-OPD demonstrates consistent improvements, the design space for adaptive distillation granularity remains largely unexplored\. Our current approach treats each token independently without modeling how erroneous prefixes may degrade subsequent teacher signals—a prefix\-aware weighting scheme could yield further gains\. Additionally, intermediate granularities such as step\-level or segment\-level weighting, which align more naturally with chain\-of\-thought structure, represent promising directions\. We leave these explorations to future work\.

## References

- On\-policy distillation of language models: learning from self\-generated mistakes\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 21246–21263\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- M\. Balunovic, J\. Dekoninck, I\. Petrov, N\. Jovanovic, and M\. Vechev \(2025\)Matharena: evaluating llms on uncontaminated math competitions, february 2025\.URL https://matharena\. ai8\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- W\. Bousselham, H\. Kuehne, and C\. Schmid \(2025\)VOLD: reasoning transfer from llms to vision\-language models via on\-policy distillation\.arXiv preprint arXiv:2510\.23497\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- D\. Cao, D\. Fu, H\. Yu, S\. Zheng, X\. Tan, and T\. Jin \(2026\)X\-opd: cross\-modal on\-policy distillation for capability alignment in speech llms\.arXiv preprint arXiv:2603\.24596\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- H\. Chen, K\. Zhang, H\. Tan, L\. Guibas, G\. Wetzstein, and S\. Bi \(2025\)Pi\-flow: policy\-based few\-step generation via imitation distillation\.arXiv preprint arXiv:2510\.14974\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- K\. Ding \(2026\)Hdpo: hybrid distillation policy optimization via privileged self\-distillation\.arXiv preprint arXiv:2603\.23871\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- Y\. Fu, H\. Huang, K\. Jiang, J\. Liu, Z\. Jiang, Y\. Zhu, and D\. Zhao \(2026\)Revisiting on\-policy distillation: empirical failure modes and simple fixes\.arXiv preprint arXiv:2603\.25562\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)Minillm: knowledge distillation of large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 32694–32717\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1),[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- C\. He, Y\. Ding, J\. Guo, R\. Gong, H\. Qin, and X\. Liu \(2025a\)DA\-kd: difficulty\-aware knowledge distillation for efficient large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang,et al\.\(2024\)Olympiadbench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3828–3850\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- Z\. He, T\. Liang, J\. Xu, Q\. Liu, X\. Chen, Y\. Wang, L\. Song, D\. Yu, Z\. Liang, W\. Wang,et al\.\(2025b\)Deepmath\-103k: a large\-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning\.arXiv preprint arXiv:2504\.11456\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1)\.
- W\. Hou, S\. Peng, W\. Wang, Z\. Ruan, Y\. Zhang, Z\. Zhou, M\. Gao, Y\. Chen, K\. Wang, H\. Yang,et al\.\(2026\)Uni\-opd: unifying on\-policy distillation with a dual\-perspective recipe\.arXiv preprint arXiv:2605\.03677\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p3.1),[§2](https://arxiv.org/html/2606.02684#S2.p2.1),[§3\.2](https://arxiv.org/html/2606.02684#S3.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px5.p1.1)\.
- J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- N\. Jain, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)Livecodebench: holistic and contamination free evaluation of large language models for code\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 58791–58831\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- I\. Jang, J\. Yeom, J\. Yeo, H\. Lim, and T\. Kim \(2026\)Stable on\-policy distillation through adaptive target reformulation\.arXiv preprint arXiv:2601\.07155\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- W\. Jin, T\. Min, Y\. Yang, S\. R\. Kadhe, Y\. Zhou, D\. Wei, N\. Baracaldo, and K\. Lee \(2026\)Entropy\-aware on\-policy distillation of language models\.arXiv preprint arXiv:2603\.07079\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p3.1),[§2](https://arxiv.org/html/2606.02684#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px5.p1.1)\.
- J\. Kim, X\. Luo, M\. Kim, S\. Lee, D\. Kim, J\. Jeon, D\. Li, and Y\. Yang \(2026\)Why does self\-distillation \(sometimes\) degrade the reasoning capability of llms?\.arXiv preprint arXiv:2603\.24472\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 1317–1327\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1)\.
- J\. Ko, S\. Abdali, Y\. J\. Kim, T\. Chen, and P\. Cameron \(2026\)Scaling reasoning efficiently via relaxed on\-policy distillation\.arXiv preprint arXiv:2603\.11137\.Cited by:[§3\.2](https://arxiv.org/html/2606.02684#S3.SS2.p7.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px5.p1.1)\.
- J\. Ko, T\. Chen, S\. Kim, T\. Ding, L\. Liang, I\. Zharkov, and S\. Yun \(2025\)DistiLLM\-2: a contrastive approach boosts the distillation of llms\.InInternational Conference on Machine Learning,pp\. 31044–31062\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo,et al\.\(2022\)Solving quantitative reasoning problems with language models\.Advances in neural information processing systems35,pp\. 3843–3857\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- J\. Li, H\. Yin, H\. Xu, B\. Xu, W\. Tan, Z\. He, J\. Ju, Z\. Luo, and J\. Luan \(2026a\)Video\-opd: efficient post\-training of multimodal large language models for temporal video grounding via on\-policy distillation\.arXiv preprint arXiv:2602\.02994\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu,et al\.\(2026b\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- J\. Liu, C\. Zhang, J\. Guo, Y\. Zhang, H\. Que, K\. Deng, Z\. Bai, J\. Liu, G\. Zhang, J\. Wang,et al\.\(2024\)Ddk: distilling domain knowledge for efficient large language models\.Advances in Neural Information Processing Systems37,pp\. 98297–98319\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.Advances in neural information processing systems36,pp\. 21558–21572\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px4.p1.1)\.
- F\. Luo, Y\. Chuang, G\. Wang, Z\. Xu, X\. Han, T\. Zhang, and V\. Braverman \(2026\)Demystifying opd: length inflation and stabilization strategies for large language models\.arXiv preprint arXiv:2604\.08527\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- M\. Song and M\. Zheng \(2026\)A survey of on\-policy distillation for large language models\.arXiv preprint arXiv:2604\.00626\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- H\. Wang, G\. Wang, H\. Xiao, Y\. Zhou, Y\. Pan, J\. Wang, K\. Xu, Y\. Wen, X\. Ruan, and X\. Chen \(2026a\)Skill\-conditioned self\-distillation for multi\-turn llm agents\.arXiv preprint arXiv:2604\.10674\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- J\. Wang, W\. Zhang, W\. Shi, Y\. Li, and J\. Cheng \(2026b\)TCOD: exploring temporal curriculum in on\-policy distillation for multi\-turn autonomous agents\.arXiv preprint arXiv:2604\.24005\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- Y\. Wu, S\. Han, and H\. Cai \(2026\)Lightning opd: efficient post\-training for large reasoning models with offline on\-policy distillation\.arXiv preprint arXiv:2604\.13010\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- Y\. Xu, H\. Sang, Z\. Zhou, R\. He, Z\. Wang, and A\. Geramifard \(2026a\)Tip: token importance in on\-policy distillation\.arXiv preprint arXiv:2604\.14084\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.02684#S3.SS2.p7.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px5.p1.1)\.
- Y\. Xu, H\. Sang, Z\. Zhou, R\. He, and Z\. Wang \(2026b\)PACED: distillation and on\-policy self\-distillation at the frontier of student competence\.arXiv preprint arXiv:2603\.11178\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang \(2026\)Learning to reason under off\-policy guidance\.Advances in Neural Information Processing Systems38,pp\. 117157–117186\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px1.p1.1)\.
- C\. Yang, C\. Qin, Q\. Si, M\. Chen, N\. Gu, D\. Yao, Z\. Lin, W\. Wang, J\. Wang, and N\. Duan \(2026a\)Self\-distilled rlvr\.arXiv preprint arXiv:2604\.03128\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- W\. Yang, W\. Liu, R\. Xie, K\. Yang, S\. Yang, and Y\. Lin \(2026b\)Learning beyond teacher: generalized on\-policy distillation with reward extrapolation\.arXiv preprint arXiv:2602\.12125\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p3.1),[§2](https://arxiv.org/html/2606.02684#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.02684#S4.SS1.SSS0.Px5.p1.1)\.
- Z\. Yang, T\. Pang, H\. Feng, H\. Wang, W\. Chen, M\. Zhu, and Q\. Liu \(2024\)Self\-distillation bridges distribution gap in language model fine\-tuning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1028–1043\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei \(2026\)On\-policy context distillation for language models\.arXiv preprint arXiv:2602\.12275\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.
- D\. Zhang, Z\. Yang, S\. Janghorbani, J\. Han, A\. Ressler II, Q\. Qian, G\. D\. Lyng, S\. S\. Batra, and R\. E\. Tillman \(2026a\)Fast and effective on\-policy distillation from reasoning prefixes\.arXiv preprint arXiv:2602\.15260\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- M\. Zhang, Y\. Liu, S\. Lin, X\. Yang, Q\. Dai, C\. Luo, W\. Jiang, P\. Hou, A\. Zeng, X\. Geng,et al\.\(2026b\)Towards on\-policy sft: distribution discriminant theory and its applications in llm training\.arXiv preprint arXiv:2602\.12222\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- X\. Zhang, Z\. Ding, T\. Pan, R\. Yang, C\. Kang, X\. Xiong, and J\. Gu \(2026c\)Opsdl: on\-policy self\-distillation for long\-context language models\.arXiv preprint arXiv:2604\.17535\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- Z\. Zhang, S\. Jiang, Y\. Shen, Y\. Zhang, D\. Ram, S\. Yang, Z\. Tu, W\. Xia, and S\. Soatto \(2026d\)Reinforcement\-aware knowledge distillation for llm reasoning\.arXiv preprint arXiv:2602\.22495\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- B\. Zheng, X\. Ma, Y\. Liang, J\. Ruan, X\. Fu, K\. Lin, B\. Zhu, K\. Zeng, and X\. Cai \(2026\)Scope: signal\-calibrated on\-policy distillation enhancement with dual\-path adaptive weighting\.arXiv preprint arXiv:2604\.10688\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.02684#S3.SS2.p3.1)\.
- Z\. Zhong, H\. Yan, J\. Li, J\. He, T\. Zhang, and H\. Li \(2026\)VLA\-opd: bridging offline sft and online rl for vision\-language\-action models via on\-policy distillation\.arXiv preprint arXiv:2603\.26666\.Cited by:[§2](https://arxiv.org/html/2606.02684#S2.p2.1)\.
- W\. Zhu, R\. Xie, R\. Wang, and P\. Liu \(2026\)Hybrid policy distillation for llms\.arXiv preprint arXiv:2604\.20244\.Cited by:[§1](https://arxiv.org/html/2606.02684#S1.p1.1)\.

s

Table 7:Full ablation onα\\alphaandβ\\beta\(Avg@8, Strong\-to\-Weak setting\)\. Default:α=1\.0,β=1\.0\\alpha=1\.0,\\beta=1\.0\.Varyingα\\alpha\(fixβ=1\.0\\beta=1\.0\)Varyingβ\\beta\(fixα=1\.0\\alpha=1\.0\)Benchmarkα\\alpha=0\.25α\\alpha=0\.5α\\alpha=1\.0α\\alpha=2\.0α\\alpha=3\.0α\\alpha=5\.0β\\beta=0\.25β\\beta=0\.5β\\beta=1\.0β\\beta=2\.0β\\beta=3\.0β\\beta=5\.0AIME2457\.0855\.4260\.8360\.4260\.4262\.9257\.9258\.7560\.8358\.3360\.4258\.75AIME2547\.5048\.7552\.9249\.1751\.2553\.3348\.3353\.3352\.9250\.0048\.7550\.00MATH50093\.2093\.6593\.7393\.7093\.4793\.8893\.5093\.5893\.7394\.0393\.4094\.27AMC202390\.6292\.1993\.1390\.6290\.6293\.7592\.1992\.1993\.1391\.2590\.3190\.94OlympiadBench70\.8370\.3670\.4769\.9770\.8669\.7370\.6069\.7170\.4770\.1069\.9969\.97MinervaMAT43\.3442\.9743\.4744\.1242\.9742\.7843\.5743\.1543\.4742\.6542\.8844\.21HMMT\-Feb28\.7528\.3332\.0830\.4230\.4230\.8330\.0029\.5832\.0829\.5829\.1730\.42HMMT\-Nov37\.9234\.1740\.0036\.6735\.0037\.9238\.7541\.2540\.0037\.9239\.1740\.42Avg58\.6658\.2360\.8359\.3959\.3860\.6459\.3660\.1960\.8359\.2359\.2659\.87

## Appendix AFull Ablation Results

#### Sensitivity toα\\alphaandβ\\beta\.

Table[7](https://arxiv.org/html/2606.02684#A0.T7)presents the full per\-benchmark results for the entropy\-aware weighting hyperparametersα\\alphaandβ\\betain the strong\-to\-weak distillation setting\. When varyingα\\alpha\(teacher confidence scaling\) withβ\\betafixed at 1\.0, performance peaks atα=1\.0\\alpha=1\.0\(60\.83% avg\) and remains competitive atα=5\.0\\alpha=5\.0\(60\.64%\), indicating that moderately amplifying teacher confidence signals is beneficial while the method is not overly sensitive to this parameter\. When varyingβ\\beta\(student confusion scaling\) withα\\alphafixed at 1\.0, the optimal performance is again achieved atβ=1\.0\\beta=1\.0, with a narrower range of competitive values—deviations in either direction lead to noticeable degradation on competition\-level benchmarks \(e\.g\., HMMT\-Feb drops from 32\.08% to 29\.17% atβ=3\.0\\beta=3\.0\)\. This suggests that student confusion signals require more careful calibration than teacher confidence, as over\-amplifying student uncertainty may cause the model to over\-attend to positions where the learning signal is inherently noisy\.

#### Sensitivity to Trajectory Filtering Percentile\.

Table[8](https://arxiv.org/html/2606.02684#A1.T8)reports the effect of varying the trajectory\-level filtering percentilepp, which controls the fraction of lowest teacher\-log\-probability trajectories to discard\. The optimal setting isp=20%p=20\\%, achieving 60\.83% average accuracy\. Lower filtering \(p=10%p=10\\%\) retains too many off\-distribution trajectories that introduce noisy gradients, while aggressive filtering \(p=30%p=30\\%orp=40%p=40\\%\) discards potentially useful training signals, particularly hurting performance on the most challenging benchmarks—AIME 2024 drops from 60\.83% to 55\.00% atp=40%p=40\\%, and HMMT\-Feb drops from 32\.08% to 26\.25%\. This confirms that a moderate filtering threshold strikes the best balance between removing harmful trajectories and preserving sufficient training diversity\.

Table 8:Ablation on trajectory filtering percentilepp\(Avg@8\)\.p=20%p=20\\%is our default\.Benchmarkpp=10pp=20pp=30pp=40AIME2457\.9260\.8355\.4255\.00AIME2549\.1752\.9250\.4247\.92MATH50093\.6093\.7393\.4594\.05AMC202391\.2593\.1390\.3191\.88OlympiadBench70\.4070\.4770\.0770\.46MinervaMAT42\.9743\.4743\.2443\.24HMMT\-Feb28\.3332\.0829\.5826\.25HMMT\-Nov34\.5840\.0037\.5035\.83Avg58\.5360\.8358\.7558\.08
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Similar Articles

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

OPRD: On-Policy Representation Distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Rubric-based On-policy Distillation

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Submit Feedback

Similar Articles

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
OPRD: On-Policy Representation Distillation
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
Rubric-based On-policy Distillation
On-policy distillation: one of the hottest terms on PapersWithCode [R]