PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv cs.LG 06/17/26, 04:00 AM Papers
on-policy-distillation large-language-models power-transformation knowledge-distillation training-stabilization mathematical-reasoning
Summary
PowerOPD introduces a bounded power transformation to stabilize on-policy distillation for large language models, achieving significant gains in accuracy and sample efficiency while reducing computational cost.
arXiv:2606.17199v1 Announce Type: new Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
Source: [https://arxiv.org/html/2606.17199](https://arxiv.org/html/2606.17199)
Anhao Zhao1,2, Junlong Tong1,3, Yingqi Fan1, Ping Nie4,Wenjie Li2, Xiaoyu Shen1 1Eastern Institute of Technology, Ningbo2The Hong Kong Polytechnic University 3Shanghai Jiao Tong University4University of Waterloo anhao\.zhao@connect\.polyu\.hkxyshen@eitech\.edu\.cn

###### Abstract

Standard on\-policy distillation \(OPD\) for large language models estimates the reverse\-KL objective using student\-sampled tokens, yielding an unbiased single\-sample Monte Carlo estimator that avoids vocabulary\-wide computation\. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full\-vocabulary OPD\. Reward\-level diagnosis traces these pathologies to the log\-ratio reward, which is unbounded by construction, producing extremely high\-variance gradients concentrated at early positions and persisting throughout training; standard post\-hoc scaling fail as they operate only after this distortion occurs\. To solve this problem, we propose*PowerOPD*: a family of natively bounded, sign\-consistent rewards from the Box\-Cox power transformation, parameterized byα\>0\\alpha\>0, of which the log\-ratio is the degenerateα→0\\alpha\\to 0limit\. Across six mathematical reasoning benchmarks and four Qwen3 teacher–student pairs, PowerOPD achieves benchmark\-averaged Avg@8/Pass@8 gains of up to\+6\.37/\+5\.71\\mathbf\{\+6\.37/\+5\.71\}over vanilla OPD,\+3\.01/\+3\.54\\mathbf\{\+3\.01/\+3\.54\}over post\-hoc stabilization, and\+2\.59/\+8\.90\\mathbf\{\+2\.59/\+8\.90\}over full\-vocabulary OPD, while reducing wall\-clock time by59\.2%59\.2\\%and peak GPU memory by23\.1%23\.1\\%\. Largerα\\alphagenerally improves accuracy, consistently shortens responses, and keeps gradient norms more than𝟑,𝟎𝟎𝟎×\\mathbf\{3\{,\}000\\times\}smaller than vanilla OPD\. We release our code at[EIT\-NLP/PowerOPD](https://github.com/EIT-NLP/PowerOPD)\.

PowerOPD: Stabilizing On\-Policy Distillation with Bounded Power Transformation

Anhao Zhao1,2, Junlong Tong1,3, Yingqi Fan1, Ping Nie4, Wenjie Li2, Xiaoyu Shen1††thanks:Corresponding Author1Eastern Institute of Technology, Ningbo2The Hong Kong Polytechnic University3Shanghai Jiao Tong University4University of Waterlooanhao\.zhao@connect\.polyu\.hkxyshen@eitech\.edu\.cn

## 1Introduction

On\-policy distillation \(OPD\) has rapidly become a standard component of LLM post\-training\(Guet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib2); Agarwalet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib5); Song and Zheng,[2026](https://arxiv.org/html/2606.17199#bib.bib30)\)\. By grounding supervision in student\-generated trajectories, OPD mitigates the exposure bias of supervised fine\-tuning \(SFT\) and classical off\-policy distillation\(Bengioet al\.,[2015](https://arxiv.org/html/2606.17199#bib.bib38); Hintonet al\.,[2015](https://arxiv.org/html/2606.17199#bib.bib44); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.17199#bib.bib49)\), while providing dense token\-level feedback compared to sparse\-reward reinforcement learning \(RL\)\. These advantages have made OPD a widely adopted bridge between SFT and RL in modern post\-trainingQwen Team \([2025](https://arxiv.org/html/2606.17199#bib.bib6)\); Zhipu AI Team \([2026](https://arxiv.org/html/2606.17199#bib.bib7)\); DeepSeek Team \([2026](https://arxiv.org/html/2606.17199#bib.bib9)\); LLM\-Core Xiaomi \([2026](https://arxiv.org/html/2606.17199#bib.bib12)\)\.

![Refer to caption](https://arxiv.org/html/2606.17199v1/x1.png)Figure 1:PowerOPD achieves\+9\.6\\mathbf\{\+9\.6\}accuracy gain and𝟏𝟎×\\mathbf\{10\\times\}sample efficiency over vanilla OPD, matching or exceeding full\-vocabulary KL OPD at 59\.2% less wall\-clock time \(Qwen3\-1\.7B←\\leftarrowQwen3\-4B,MATH\-500\)\.In its original form, OPD minimizes the reverse\-KL divergence between teacher and student distributions by comparing probabilities over the full vocabulary at each generation step\. However, computing the full\-vocabulary objective is expensive in practice\. Consequently, modern implementations estimate the objective using student\-sampled tokens, yielding an unbiased single\-sample Monte Carlo estimator\(Lu and Thinking Machines Lab,[2025](https://arxiv.org/html/2606.17199#bib.bib46); Jinet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib16); Koet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib17); Jiaet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib26); Liuet al\.,[2026b](https://arxiv.org/html/2606.17199#bib.bib51); Zhaoet al\.,[2026a](https://arxiv.org/html/2606.17199#bib.bib25),[b](https://arxiv.org/html/2606.17199#bib.bib50)\)111As this sampled\-token formulation has become the de facto implementation of OPD due to its substantially lower cost, we refer to it as*vanilla OPD*throughout this paper\.\.

Despite its widespread adoption, we observe that vanilla OPD exhibits severe*pathological training dynamics*in practice\. In a representative Qwen3\-4B teacher and Qwen3\-1\.7B\-Base student setting onMATH\-500, validation accuracy initially decreases and does not recover for hundreds of training steps despite dense token\-level supervision, indicating sample inefficiency\. Meanwhile, response length undergoes large oscillations before stabilizing, indicating that the student repeatedly enters unstable generation regimes\. Even after convergence, vanilla OPD reaches only 54\.93% accuracy, trailing full\-vocabulary OPD by 8\.19 points\. Together, these failures suggest that*the Monte Carlo approximation underlying vanilla OPD introduces optimization difficulties that significantly limit both training efficiency and final performance*\.

To understand the source of these pathological training dynamics, we examine the token\-level reward that directly weights each policy\-gradient update: the teacher\-student log\-probability ratio\. Our diagnosis shows that the unbounded log\-ratio reward distorts the update signal along three dimensions: \(i\) extreme reward variance, where reward values plummet to nearly−50\-50, allowing single rare tokens to dominate gradient updates and trigger generative instability; \(ii\) early\-position extremes, where massive reward magnitudes disproportionately hit the early positions of student rollouts, destabilizing the prefix distribution and causing cascading errors that drive sample inefficiency; and \(iii\) persistent extreme rewards, where these massive positive and negative values fail to decay, injecting instability throughout the entire optimization process\. Notably, applying standard RL reward\-stabilization tools\(Mnihet al\.,[2015](https://arxiv.org/html/2606.17199#bib.bib39); Schulmanet al\.,[2017](https://arxiv.org/html/2606.17199#bib.bib45)\), such as clipping, tanh compression, and z\-score normalization, does not resolve these issues, indicating thatthe instability originates from the unbounded log\-ratio reward itself rather than from insufficient post\-hoc scaling\.

Recognizing that unboundedness is the root cause, we reframe OPD reward design as learning a principled probability\-to\-reward mapping\. A well\-conditioned OPD reward must satisfy two properties: boundedness, to prevent rare Monte Carlo events from inducing catastrophic gradient updates, and sign consistency, to ensure the reward sign correctly aligns with the teacher–student probability gap \(i\.e\., yielding a positive reward when the teacher assigns a higher probability than the student, and vice versa\)\. We retain the transform\-then\-subtract structure of the standard reward,h\(πT\)−h\(πθ\)h\(\\pi\_\{T\}\)\-h\(\\pi\_\{\\theta\}\)because it guarantees sign consistency for any strictly increasinghh\. To achieve boundedness without losing this directional signal, we instantiatehhusing the Box–Cox power family\(Box and Cox,[1964](https://arxiv.org/html/2606.17199#bib.bib1)\)\. This yieldsPowerOPD: a family of bounded and sign\-consistent OPD rewards parameterized byα\>0\\alpha\>0\. While the unstable standard log\-ratio reward represents the degenerateα→0\\alpha\\to 0limit, our formulation ensures the reward remains strictly bounded for anyα\>0\\alpha\>0\.

We evaluate PowerOPD on six mathematical reasoning benchmarks across four teacher–student pairs from the Qwen3 family \(0\.6B and 1\.7B students; 4B and 8B teachers\)\. As shown in Figure[1](https://arxiv.org/html/2606.17199#S1.F1), using a Qwen3\-1\.7B\-Base student and Qwen3\-4B teacher, PowerOPD achieves a\+9\.6\\mathbf\{\+9\.6\}accuracy gain over vanilla OPD and reaches the same accuracy level with𝟏𝟎×\\mathbf\{10\\times\}fewer training steps\. Across the full benchmark evaluation, PowerOPD achieves benchmark\-averaged Avg@8/Pass@8 gains of up to\+6\.37/\+5\.71\\mathbf\{\+6\.37/\+5\.71\}over vanilla OPD,\+3\.01/\+3\.54\\mathbf\{\+3\.01/\+3\.54\}over post\-hoc stabilization, and\+2\.59/\+8\.90\\mathbf\{\+2\.59/\+8\.90\}over full\-vocabulary OPD, with individual\-benchmark gains reaching\+16\.75/\+15\.00\\mathbf\{\+16\.75/\+15\.00\},\+8\.43/\+7\.50\\mathbf\{\+8\.43/\+7\.50\}, and\+11\.60/\+25\.00\\mathbf\{\+11\.60/\+25\.00\}, respectively, while reducing wall\-clock time per step by59\.2%59\.2\\%and peak GPU memory by23\.1%23\.1\\%relative to full\-vocabulary OPD\. Notably, PowerOPD scales withα\\alpha: largerα\\alphagenerally improves accuracy, shortens responses, and stabilizes training dynamics\. We further show that this scalability is mechanistically grounded: largerα\\alphasuppresses rewards for tokens that both models assign low probability, while focusing learning on tokens that either the teacher or student considers likely\. Finally, gradient tracking shows that PowerOPD keeps norms more than𝟑,𝟎𝟎𝟎×\\mathbf\{3\{,\}000\\times\}below vanilla OPD’s initial spike, while post\-hoc methods only partially stabilize training\.

![Refer to caption](https://arxiv.org/html/2606.17199v1/x2.png)

\(a\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x3.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x4.png)

\(c\)

Figure 2:Pathological OPD rewards\. The OPD reward shows \(a\) high variance with a heavy negative tail, \(b\) early\-position extreme values, and \(c\) persistent extremes throughout training\.
## 2Preliminaries

### 2\.1On\-Policy Distillation

On\-policy distillation \(OPD\)\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib5); Guet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib2); Song and Zheng,[2026](https://arxiv.org/html/2606.17199#bib.bib30)\)has emerged as a standard stage in LLM post\-training pipelines\(Qwen Team,[2025](https://arxiv.org/html/2606.17199#bib.bib6); Zhenget al\.,[2025](https://arxiv.org/html/2606.17199#bib.bib10); Zhipu AI Team,[2026](https://arxiv.org/html/2606.17199#bib.bib7); Yanget al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib8); DeepSeek Team,[2026](https://arxiv.org/html/2606.17199#bib.bib9); Tencent Robotics X and HY Vision Team,[2026](https://arxiv.org/html/2606.17199#bib.bib11); LLM\-Core Xiaomi,[2026](https://arxiv.org/html/2606.17199#bib.bib12); KwaiKAT Team,[2026](https://arxiv.org/html/2606.17199#bib.bib13); Qwen Team,[2026](https://arxiv.org/html/2606.17199#bib.bib14)\)\. It trains a student policyπθ\\pi\_\{\\theta\}to match a stronger teacher policyπT\\pi\_\{T\}on trajectories generated by the student itself\. This on\-policy training reduces exposure bias by aligning the training\-time contexts with the student’s inference\-time generation\. OPD minimizes the reverse KL divergence, which encourages mode\-seeking toward the teacher’s dominant modes\(Lu and Thinking Machines Lab,[2025](https://arxiv.org/html/2606.17199#bib.bib46)\):

DKL\(πθ∥πT\)=𝔼x∼𝒟,o∼πθ\(⋅∣x\)\[log⁡πθ\(o∣x\)πT\(o∣x\)\],D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{T\}\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(o\\mid x\)\}\{\\pi\_\{T\}\(o\\mid x\)\}\\right\],wherexxis a prompt from𝒟\\mathcal\{D\}ando=\(o1,…,o\|o\|\)o=\(o\_\{1\},\\ldots,o\_\{\|o\|\}\)is a student\-generated response\.

### 2\.2OPD as Dense\-Reward RL

Using the autoregressive factorization ofπθ\\pi\_\{\\theta\}andπT\\pi\_\{T\}, OPD maximizes the negative reverse\-KL objective in the token\-level form

JOPD\(θ\)=𝔼x∼𝒟,o∼πθ\(⋅∣x\)\[∑t=1\|o\|log⁡πT\(ot∣ct\)πθ\(ot∣ct\)\],J\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{\|o\|\}\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\\right\],wherect=\(x,o<t\)c\_\{t\}=\(x,o\_\{<t\}\)denotes the context before tokenoto\_\{t\}\. Following recent practice\(Lu and Thinking Machines Lab,[2025](https://arxiv.org/html/2606.17199#bib.bib46); Koet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib17); Ohet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib27)\), this objective is optimized as policy\-gradient RL\(Williams,[1992](https://arxiv.org/html/2606.17199#bib.bib41); Suttonet al\.,[1999](https://arxiv.org/html/2606.17199#bib.bib47)\):

∇θJOPD\(θ\)=𝔼x∼𝒟,o∼πθ\(⋅∣x\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\[∑t=1\|o\|log⁡πT\(ot∣ct\)πθ\(ot∣ct\)∇θlog⁡πθ\(ot∣ct\)\]\.\\displaystyle\\Bigg\[\\sum\_\{t=1\}^\{\|o\|\}\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\\Bigg\]\.Appendix[C](https://arxiv.org/html/2606.17199#A3)provides the detailed derivation\. This policy\-gradient form identifies the stop\-gradient log\-ratio term as the OPD token\-level reward:

rtOPD\(ct,ot\)=log⁡πT\(ot∣ct\)πθ\(ot∣ct\)\.r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)=\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\.\(1\)Unlike the sparse sequence\-level rewards used in outcome\-based RL\(Shaoet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib48); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.17199#bib.bib49)\), the OPD reward is dense and depends explicitly on the current student policyπθ\(ot∣ct\)\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\.

## 3Empirical Failure Modes of OPD

We first identify pathological OPD training dynamics: sample inefficiency, unstable generation behavior, and a persistent gap to full\-vocabulary OPD \([§3\.1](https://arxiv.org/html/2606.17199#S3.SS1)\)\. We then trace them to high\-variance log\-ratio rewards, whose extremes concentrate early and persist throughout training \([§3\.2](https://arxiv.org/html/2606.17199#S3.SS2)\)\. Finally, we show that standard RL reward\-stabilization strategies fail to fix these pathologies \([§3\.3](https://arxiv.org/html/2606.17199#S3.SS3)\)\.

### 3\.1Pathological Training Dynamics

We begin by empirically examining the training dynamics of OPD\. We train a Qwen3\-1\.7B student with a Qwen3\-4B teacher and monitor performance onMATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.17199#bib.bib36)\)\. As shown in[Figure 3](https://arxiv.org/html/2606.17199#S3.F3), OPD exhibits three pathological behaviors\.*\(i\) OPD is sample\-inefficient\.*Accuracy initially decreases and does not begin to recover until roughly 300 training steps, despite the availability of dense token\-level supervision\.*\(ii\) OPD is unstable in both accuracy and generation behavior\.*Validation accuracy fluctuates substantially, while the average validation response length undergoes large oscillations and stabilizes only after around 400 steps, suggesting that the student policy moves through unstable generation regimes during training\.*\(iii\) OPD converges to a substantially weaker policy than full\-vocabulary OPD\.*Full\-vocabulary OPD computes the distillation signal over the entire vocabulary rather than only the sampled student token\(Zhaoet al\.,[2026a](https://arxiv.org/html/2606.17199#bib.bib25)\)\. OnMATH\-500, OPD plateaus at 54\.93%, whereas full\-vocabulary OPD reaches 63\.12%, leaving an 8\.19\-point gap and recovering only 69\.44% of the teacher’s validation accuracy\.

### 3\.2Pathological Reward Distributions

To understand what undermines OPD training, we examine its dense token\-level rewards\. Since these rewards directly scale the policy\-gradient updates, their distribution determines how stable and reliable the optimization signal is\.

#### OPD rewards exhibit extremely high variance\.

We examine the reward distribution induced by OPD before training\. Using the Qwen3\-1\.7B\-Base student and the Qwen3\-4B teacher, we sample 512 examples fromMATH\-500, generate student rollouts, and compute the OPD rewardrtOPD\(ct,ot\)r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)for every rollout token\. We collect these token\-level rewards across all rollouts and plot their empirical distribution\. As shown in[Figure 2](https://arxiv.org/html/2606.17199#S1.F2)\(a\), OPD rewards span an extremely wide range, with the negative tail reaching nearly−50\-50\. This high\-variance distribution is a direct consequence of the log difference, which can amplify teacher–student probability discrepancies into unbounded reward magnitudes\(Koet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib17); Jiaet al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib26)\)\. Because each policy\-gradient update is directly scaled by this reward scalar, extreme OPD rewards artificially inflate gradient variance and make optimization fragile\.

![Refer to caption](https://arxiv.org/html/2606.17199v1/x5.png)

\(a\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x6.png)

\(b\)

Figure 3:Pathological OPD training dynamics\. OPD shows \(a\) delayed, unstable accuracy far below full\-vocabulary OPD and \(b\) response\-length oscillations\.
#### Extreme rewards concentrate at early rollout positions\.

We group OPD rewards by rollout token index and compute position\-wise mean, minimum, and maximum values, filtering positions with too few samples\. As shown in[Figure 2](https://arxiv.org/html/2606.17199#S1.F2)\(b\), the most extreme rewards occur near the beginning of student rollouts\. This pattern is especially harmful in on\-policy training: unstable rewards on early tokens can shift the prefix distribution, and subsequent rollouts then condition on these shifted prefixes, propagating instability to later tokens\(Liuet al\.,[2026a](https://arxiv.org/html/2606.17199#bib.bib20)\)\. This feedback loop destabilizes the training distribution and contributes to fluctuations in both validation accuracy and response length\.

#### Extreme rewards persist across training\.

We track the reward dynamics during OPD training by recording batch\-level statistics over all rollout\-token rewards at each training step, including the minimum, maximum, and the55th–9595th percentile range\. As shown in[Figure 2](https://arxiv.org/html/2606.17199#S1.F2)\(c\), extreme positive and negative rewards persist throughout training and remain close to their initial scale\. This indicates that high\-variance rewards are not merely an initialization artifact or an early\-stage transient\. Instead, OPD is exposed to severe token\-level rewards and penalties throughout optimization\.

### 3\.3Post\-Hoc Reward Stabilization Fails

Table 1:Post\-hoc OPD reward transformations\.\(a\)Teacher: Qwen3\-4B→\\toStudent: Qwen3\-0\.6B\-BaseGSM8KMATH500AMC23AIME24AIME25MinervaOlympiadMeanClassParamAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenFull\-vocab–66\.0487\.08408743\.9054\.84409628\.7540\.0040963\.3310\.0040961\.673\.33409611\.0315\.8140775\.6225\.00409422\.9133\.724092Vanilla–61\.5086\.81407836\.0261\.20408414\.3747\.5040961\.676\.6740960\.833\.3340856\.9917\.65406512\.5030\.07408719\.1336\.184084\+ Clip65\.9087\.2649842\.2763\.24148321\.8852\.5026690\.423\.3337221\.256\.67384210\.0621\.69137815\.6733\.78280922\.4938\.352343\+ Tanh65\.3887\.0050243\.8862\.00149322\.6952\.5026090\.423\.3337860\.423\.33388010\.0323\.0013294\.6215\.00387221\.0635\.172496\+ Z\-Score56\.2186\.4343735\.7562\.26157114\.3745\.0026480\.423\.3336910\.423\.3337038\.0020\.22141012\.6131\.41275018\.2536\.002316PowerOPDα=0\.1\\alpha=0\.168\.5889\.2340443\.6264\.92148228\.1257\.5025582\.506\.6736230\.423\.3338069\.3820\.96139116\.4834\.672776\\cellcoloravgshade\!1224\.16\\cellcolorpassshade\!1239\.61\\cellcolorlenshade\!122291α=0\.5\\alpha=0\.568\.6789\.2340745\.2966\.92143826\.8855\.0026172\.506\.6736331\.256\.67366210\.6223\.16122516\.9636\.152704\\cellcoloravgshade\!3324\.60\\cellcolorpassshade\!3840\.54\\cellcolorlenshade\!302241α=1\\alpha=169\.4890\.2239644\.3466\.18145124\.3847\.5026282\.083\.3335832\.5010\.00370310\.5724\.26123617\.2235\.852701\\cellcoloravgshade\!2224\.37\\cellcolorpassshade\!1239\.62\\cellcolorlenshade\!212243α=5\\alpha=569\.1988\.5537445\.6666\.38141731\.1257\.5024562\.503\.3336100\.836\.67362912\.2223\.63112916\.8735\.852677\\cellcoloravgshade\!7425\.48\\cellcolorpassshade\!3040\.27\\cellcolorlenshade\!392185α=10\\alpha=1069\.4288\.5539345\.5465\.92138829\.6957\.5024262\.083\.3335971\.253\.33361211\.3123\.90112417\.1135\.262664\\cellcoloravgshade\!6125\.20\\cellcolorpassshade\!1439\.68\\cellcolorlenshade\!482172α=50\\alpha=5069\.2587\.1138545\.1266\.00136724\.6960\.0023582\.9210\.0032830\.836\.67350211\.3122\.43114917\.2036\.592571\\cellcoloravgshade\!2724\.47\\cellcolorpassshade\!5841\.26\\cellcolorlenshade\!572088α=100\\alpha=10068\.7086\.8838544\.5765\.66136328\.1257\.0022302\.5010\.0032820\.836\.67332410\.7523\.69122517\.0436\.152490\\cellcoloravgshade\!3524\.64\\cellcolorpassshade\!4740\.86\\cellcolorlenshade\!662043α=500\\alpha=50069\.7787\.4939744\.6265\.50132429\.2557\.5020493\.3310\.0030772\.9213\.33281911\.4023\.53127317\.2235\.852185\\cellcoloravgshade\!7525\.50\\cellcolorpassshade\!7541\.89\\cellcolorlenshade\!751875

\(b\)Teacher: Qwen3\-8B→\\toStudent: Qwen3\-0\.6B\-BaseGSM8KMATH500AMC23AIME24AIME25MinervaOlympiadMeanClassParamAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenFull\-vocab–70\.8588\.48409546\.5354\.38409527\.5035\.0040965\.006\.6740963\.336\.67409611\.5815\.0740896\.6221\.00409624\.4932\.474095Vanilla–60\.5986\.3541640\.4364\.00145022\.1950\.0023930\.833\.3334360\.836\.6737529\.0120\.59130614\.3533\.33270821\.1837\.752209\+ Clip67\.9987\.6439843\.7565\.30142928\.1252\.5025381\.676\.6735061\.6710\.0037119\.4222\.06128616\.2034\.22267424\.1239\.772220\+ Tanh66\.9888\.6339443\.3964\.86142525\.0057\.5024682\.9210\.0035181\.676\.6736399\.6521\.32128116\.8934\.52269823\.7940\.502203\+ Z\-Score63\.8388\.1039142\.7265\.62136425\.6250\.0022691\.256\.6733710\.423\.33357810\.7121\.32121215\.5435\.70252822\.8738\.682102PowerOPDα=0\.1\\alpha=0\.167\.1888\.0238443\.8265\.86139524\.0655\.0024603\.7513\.3334130\.836\.67361710\.2522\.43119515\.9434\.672632\\cellcoloravgshade\!1923\.69\\cellcolorpassshade\!4840\.85\\cellcolorlenshade\!212157α=0\.5\\alpha=0\.567\.5889\.1638145\.0766\.98137125\.9455\.0025351\.676\.6733590\.836\.67354910\.0622\.79114417\.2636\.892552\\cellcoloravgshade\!3424\.06\\cellcolorpassshade\!3540\.59\\cellcolorlenshade\!302127α=1\\alpha=165\.3689\.7638044\.8967\.10140823\.4452\.5025601\.676\.6735840\.833\.33354211\.3125\.37118817\.2436\.302637\\cellcoloravgshade\!1223\.53\\cellcolorpassshade\!1240\.15\\cellcolorlenshade\!122186α=5\\alpha=566\.7988\.8636445\.3366\.72134825\.6255\.0025032\.926\.6735123\.3313\.33346811\.0822\.79107617\.1335\.112566\\cellcoloravgshade\!5624\.60\\cellcolorpassshade\!6741\.21\\cellcolorlenshade\!392120α=10\\alpha=1066\.7687\.8736445\.4967\.00130824\.3860\.0022511\.256\.6732740\.836\.67352811\.7624\.63105617\.2436\.002523\\cellcoloravgshade\!3023\.96\\cellcolorpassshade\!6941\.26\\cellcolorlenshade\!482043α=50\\alpha=5069\.2988\.9338244\.9165\.78131328\.1257\.5020881\.676\.6732620\.426\.67323311\.1223\.90114916\.9635\.702449\\cellcoloravgshade\!5824\.64\\cellcolorpassshade\!4240\.74\\cellcolorlenshade\!571982α=100\\alpha=10068\.6389\.7938545\.0365\.42132229\.0657\.5020382\.9210\.0031641\.2510\.00322511\.3122\.79124417\.1734\.072331\\cellcoloravgshade\!7525\.05\\cellcolorpassshade\!7541\.37\\cellcolorlenshade\!661958α=500\\alpha=50070\.5887\.0442744\.5965\.42132122\.1957\.5019525\.006\.6730592\.0813\.33274211\.4423\.90130916\.8335\.262201\\cellcoloravgshade\!5924\.67\\cellcolorpassshade\!7141\.30\\cellcolorlenshade\!751859

Table 2:Mathematical reasoning evaluation with the Qwen3\-0\.6B student\.Boldandunderlinedenote the best and second\-best results\. Color intensity highlights the relative performance of PowerOPD variants across differentα\\alpha\. Log reward denotes OPD reward variants without post\-hoc stabilization\. Avg, Pass, and Len denote Avg@8, Pass@8, and average response length, respectively\.The high variance of OPD rewards originates from the unboundedness of the log\-ratio form\. The OPD rewardrtOPD=log⁡\(πT/πθ\)r\_\{t\}^\{\\mathrm\{OPD\}\}=\\log\(\\pi\_\{T\}/\\pi\_\{\\theta\}\)diverges in both directions222Full\-vocabulary OPD shares the same unbounded log\-ratio, but computes an exact expectation where extreme values are suppressed by their small probability weights\. Sampled\-token OPD lacks this averaging, exposing the gradient directly to extreme log\-ratios\.:rtOPD→\+∞r\_\{t\}^\{\\mathrm\{OPD\}\}\\to\+\\inftywhenπθ\(ot∣ct\)→0\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\\to 0, andrtOPD→−∞r\_\{t\}^\{\\mathrm\{OPD\}\}\\to\-\\inftywhenπT\(ot∣ct\)→0\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\\to 0\. Because OPD evaluates rewards on student\-sampled tokens, these tokens tend to have high probability under the student, but the teacher may assign low probability to the same student\-sampled tokens\. The log\-ratio reward can therefore easily enter its negative divergent regime, producing the heavy negative tail observed in[Figure 2](https://arxiv.org/html/2606.17199#S1.F2)\(a\)\. A standard remedy is to apply post\-hoc reward transformations from RL, where clipping and normalization are widely used to stabilize optimization\(Mnihet al\.,[2015](https://arxiv.org/html/2606.17199#bib.bib39); Andrychowiczet al\.,[2020](https://arxiv.org/html/2606.17199#bib.bib40)\)\. We consider three representative methods, summarized in[Table 1](https://arxiv.org/html/2606.17199#S3.T1):*Clip*truncates rewards at fixed thresholds,333We setcl=−chc\_\{l\}=\-c\_\{h\}, grid\-search multiple threshold values, and use the best\-performing settingcl=−1,ch=1c\_\{l\}=\-1,c\_\{h\}=1\.*Tanh*maps rewards smoothly into\[−1,1\]\[\-1,1\], and*Z\-Score*centers and rescales rewards using batch statistics\.

#### Post\-hoc stabilization fails to fix\.

As shown in[Figure 3](https://arxiv.org/html/2606.17199#S3.F3), post\-hoc reward stabilization alleviates neither the optimization delay nor the unstable generation dynamics of OPD\.*Tanh*and*Z\-Score*reduce the gap to full\-vocabulary OPD to some extent, but validation accuracy still improves only after a long delay, and response length continues to exhibit large oscillations\.*Z\-Score*can even underperform vanilla OPD, likely because batch centering may change the sign of rewards and reverse the intended direction of some updates\. In contrast,*Clip*and*Tanh*preserve the reward sign, but they only reshape reward magnitudes after the unbounded log\-ratio has been computed, leaving the underlying reward form unchanged\. These results indicate that*OPD’s pathological training dynamics are driven not by the lack of post\-hoc stabilization, but by the unbounded log\-ratio reward itself\.*

## 4Rethinking the OPD Reward Function

As shown in[§3](https://arxiv.org/html/2606.17199#S3), post\-hoc transformations of the log\-ratio reward do not resolve OPD training pathologies\. We therefore step back from the inherited log\-ratio form and ask:*what properties should a token\-level OPD reward satisfy, and what functions satisfy them by construction?*

### 4\.1Generalizing the OPD Reward

The OPD log\-ratio rewardlog⁡\(πT/πθ\)\\log\(\\pi\_\{T\}/\\pi\_\{\\theta\}\)is a specific mapping from teacher–student probabilities at the sampled token; we generalize it as

rtf=f\(πT\(ot∣ct\),πθ\(ot∣ct\)\),r\_\{t\}^\{f\}=f\\\!\\left\(\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\),\\;\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\\right\),\(2\)wheref:\[0,1\]2→ℝf:\[0,1\]^\{2\}\\to\\mathbb\{R\}maps a pair of teacher–student probabilities to a scalar reward\. The standard OPD reward corresponds tof\(p,q\)=log⁡p−log⁡qf\(p,q\)=\\log p\-\\log q, which is defined on\(0,1\]2\(0,1\]^\{2\}but diverges asp→0p\\to 0orq→0q\\to 0\. The post\-hoc methods in[§3\.3](https://arxiv.org/html/2606.17199#S3.SS3)keep this log\-ratio form fixed and only transform or normalize the rewards after they are computed\. In contrast,*we treat the probability\-to\-reward mappingffitself as the object of design*\. This shifts the question from how to stabilize a pathological reward after the fact to what properties a stable OPD reward should satisfy in the first place\.

### 4\.2Two Necessary Properties

We identify two necessary properties for a well\-designed OPD reward\.

#### P1: Boundedness\.

The reward functionffshould be bounded on\[0,1\]2\[0,1\]^\{2\}: there existsM\>0M\>0such that\|f\(p,q\)\|≤M\|f\(p,q\)\|\\leq Mfor allp,q∈\[0,1\]p,q\\in\[0,1\]\. Ifffis unbounded, small changes in teacher or student probabilities can produce arbitrarily large reward magnitudes, directly amplifying the variance of policy\-gradient updates as diagnosed in[§3\.2](https://arxiv.org/html/2606.17199#S3.SS2)\.

#### P2: Sign Consistency\.

The sign offfshould indicate which model assigns higher probability to the sampled token:f\(p,q\)\>0f\(p,q\)\>0ifp\>qp\>q,f\(p,q\)=0f\(p,q\)=0ifp=qp=q, andf\(p,q\)<0f\(p,q\)<0ifp<qp<q\. This ensures that token\-level updates move the student toward the teacher: tokens assigned higher probability by the teacher receive positive reward, and vice versa\. If P2 is violated, updates can point in the wrong direction regardless of reward magnitude\.

#### P1 and P2 explain the empirical failures\.

The failure modes in[§3](https://arxiv.org/html/2606.17199#S3)are consistent with these two properties\.*Vanilla OPD*satisfies P2 but violates P1, producing the high\-variance rewards diagnosed above\.*Z\-Score*does not guarantee P1 and may violate P2 because batch centering can flip individual reward signs, explaining its poor performance in[Figure 3](https://arxiv.org/html/2606.17199#S3.F3)\.*Clip*and*Tanh*enforce P1 while preserving P2, and therefore improve over vanilla OPD; however, they still fail because they compress rewards only after the divergent log\-ratio has been computed\. Thus, P1 and P2 are minimal properties: they must hold at the probability\-to\-reward mapping level, before any log\-ratio distortion occurs\.

### 4\.3Deriving Rewards that Satisfy P1 and P2

\(a\)Teacher: Qwen3\-4B→\\toStudent: Qwen3\-1\.7B\-BaseGSM8KMATH500AMC23AIME24AIME25MinervaOlympiadMeanClassParamAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenFull\-vocab–83\.5095\.00395463\.1287\.00400438\.4465\.00409610\.8323\.3340965\.8320\.00409619\.7532\.00380517\.6226\.00409634\.1649\.764021Vanilla–73\.9395\.6835154\.9375\.12115631\.8760\.0022539\.1716\.6733887\.0813\.33328216\.1328\.68103823\.7444\.15235530\.9847\.661975\+ Clip80\.3694\.0132659\.1275\.92116231\.8867\.5023094\.1723\.3333424\.5820\.00338117\.5628\.3199526\.7447\.41239032\.0650\.931986\+ Tanh82\.0894\.6233159\.0976\.04113434\.0660\.00228810\.4226\.6733186\.6716\.67324018\.0630\.8892627\.1746\.96235933\.9450\.261942\+ Z\-Score42\.9289\.6137228\.2466\.24123515\.9447\.5022434\.1713\.3334664\.1713\.3332449\.0523\.16106214\.1538\.81237616\.9541\.712000PowerOPDα=0\.1\\alpha=0\.182\.4694\.7731659\.6676\.56112733\.4467\.5021848\.7520\.0033517\.0823\.33308118\.1129\.4191227\.7647\.262332\\cellcoloravgshade\!1233\.89\\cellcolorpassshade\!4951\.26\\cellcolorlenshade\!331900α=1\\alpha=182\.0495\.3831759\.5276\.86113238\.4462\.50217610\.4223\.3333684\.1716\.67313618\.6632\.3593527\.7247\.562307\\cellcoloravgshade\!3634\.42\\cellcolorpassshade\!3050\.66\\cellcolorlenshade\!221910α=5\\alpha=582\.6895\.0731559\.9676\.42111638\.1267\.5021368\.3320\.0033664\.1713\.33328319\.0731\.2592027\.2847\.112289\\cellcoloravgshade\!2834\.23\\cellcolorpassshade\!1250\.10\\cellcolorlenshade\!121918α=10\\alpha=1083\.0194\.6931559\.7176\.10110435\.6267\.5021128\.3326\.6732925\.0013\.33318818\.8430\.1590927\.5646\.672296\\cellcoloravgshade\!1834\.01\\cellcolorpassshade\!3250\.73\\cellcolorlenshade\!441888α=50\\alpha=5083\.1194\.2431359\.4075\.68103541\.8875\.00195010\.8323\.3332274\.1713\.33287617\.6530\.1582726\.6746\.522160\\cellcoloravgshade\!5534\.82\\cellcolorpassshade\!4751\.18\\cellcolorlenshade\!541770α=100\\alpha=10082\.9293\.3330759\.8075\.76100937\.8170\.0018578\.3320\.0030116\.2523\.33273217\.9728\.3182327\.0645\.782042\\cellcoloravgshade\!3134\.31\\cellcolorpassshade\!3950\.93\\cellcolorlenshade\!641683α=500\\alpha=50083\.1394\.3131664\.5375\.7299740\.9470\.0016739\.5826\.6727766\.6720\.00258819\.5831\.2584527\.7446\.371857\\cellcoloravgshade\!7536\.02\\cellcolorpassshade\!7552\.05\\cellcolorlenshade\!751579

\(b\)Teacher: Qwen3\-8B→\\toStudent: Qwen3\-1\.7B\-BaseGSM8KMATH500AMC23AIME24AIME25MinervaOlympiadMeanClassParamAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenAvgPassLenFull\-vocab–80\.2586\.50409563\.2584\.00409142\.1975\.0040967\.0816\.6740965\.8326\.67409617\.5026\.00408416\.5029\.00409633\.2349\.124093Vanilla–77\.5595\.0732357\.5575\.88105035\.9462\.5018817\.0820\.0030453\.3016\.67295717\.5629\.7889223\.9447\.11213631\.8549\.571755\+ Clip76\.9694\.8430859\.5676\.68106238\.7572\.5020977\.0823\.3331883\.3016\.67293518\.4330\.8889224\.6546\.81217732\.6851\.671808\+ Tanh78\.0294\.9232158\.8276\.04107835\.3162\.5019968\.3323\.3331074\.5820\.00301218\.2931\.6293023\.9647\.41220532\.4750\.831807\+ Z\-Score80\.2695\.2232358\.4875\.98108436\.5667\.5020877\.0823\.3332854\.1713\.33296417\.4231\.9994827\.0248\.00217433\.0050\.761838PowerOPDα=0\.1\\alpha=0\.179\.9694\.3131959\.2776\.78108736\.2565\.0021429\.1723\.3331233\.7513\.33302418\.4331\.9992027\.0947\.262227\\cellcoloravgshade\!3333\.42\\cellcolorpassshade\!1250\.29\\cellcolorlenshade\!121835α=0\.5\\alpha=0\.578\.2294\.7730859\.2976\.90105236\.5667\.5020878\.3326\.6731955\.4220\.00296217\.9730\.8889927\.7848\.002143\\cellcoloravgshade\!3133\.37\\cellcolorpassshade\!7552\.10\\cellcolorlenshade\!251807α=1\\alpha=176\.2194\.3131159\.0476\.90108337\.5070\.0021138\.3320\.0032005\.0016\.67301917\.4629\.7892426\.8146\.522192\\cellcoloravgshade\!1232\.91\\cellcolorpassshade\!2350\.60\\cellcolorlenshade\!121835α=5\\alpha=579\.5494\.1631259\.8277\.12106240\.3170\.0020107\.0820\.0031983\.7513\.33293918\.7530\.8888027\.4848\.152202\\cellcoloravgshade\!4933\.82\\cellcolorpassshade\!2050\.52\\cellcolorlenshade\!371800α=10\\alpha=1078\.7993\.9332159\.8676\.76104938\.7562\.5019978\.3326\.6732064\.5816\.67285318\.8931\.6288227\.8146\.072148\\cellcoloravgshade\!5133\.86\\cellcolorpassshade\!2350\.60\\cellcolorlenshade\!501779α=100\\alpha=10083\.0693\.9331959\.3675\.7499838\.7577\.5018219\.1723\.3328693\.3313\.33292819\.0729\.4182626\.7245\.482008\\cellcoloravgshade\!6534\.21\\cellcolorpassshade\!4551\.25\\cellcolorlenshade\!621681α=500\\alpha=50083\.8594\.1632261\.0675\.9099936\.5677\.5017158\.3323\.3327055\.8316\.67255718\.9329\.7884626\.5645\.781827\\cellcoloravgshade\!7534\.45\\cellcolorpassshade\!6751\.87\\cellcolorlenshade\!751567

Table 3:Mathematical reasoning evaluation with the Qwen3\-1\.7B student\.Boldandunderlinedenote the best and second\-best results\. Color intensity highlights the relative performance of PowerOPD variants across differentα\\alpha\. Vanilla denotes OPD reward variants without post\-hoc stabilization\. Avg, Pass, and Len denote Avg@8, Pass@8, and average response length, respectively\.![Refer to caption](https://arxiv.org/html/2606.17199v1/x7.png)

\(a\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x8.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x9.png)

\(c\)

![Refer to caption](https://arxiv.org/html/2606.17199v1/x10.png)

\(d\)

Figure 4:Training dynamics acrossα\\alphafor a Qwen3\-0\.6B\-Base student: \(a,b\) accuracy and length with a Qwen3\-4B teacher; \(c,d\) accuracy and length with a Qwen3\-8B teacher\.We now identify what structure a natively bounded and sign\-consistent reward should take\. To obtain a tractable family, we observe that the log\-ratio reward has a specific algebraic structure: it applies the same scalar transformation to each probability, then takes their difference, givinglog⁡πT−log⁡πθ=h\(πT\)−h\(πθ\)\\log\\pi\_\{T\}\-\\log\\pi\_\{\\theta\}=h\(\\pi\_\{T\}\)\-h\(\\pi\_\{\\theta\}\)whereh\(x\)=log⁡xh\(x\)=\\log x\. This*transform\-then\-subtract*classfh\(p,q\)=h\(p\)−h\(q\)f\_\{h\}\(p,q\)=h\(p\)\-h\(q\)is attractive because sign consistency follows immediately wheneverhhis strictly monotone increasing: ifp\>qp\>q, thenh\(p\)\>h\(q\)h\(p\)\>h\(q\)andrt\>0r\_\{t\}\>0\. Thus, P2 is guaranteed by construction\. The stability of the reward is then controlled entirely by the choice ofhh: the problem with the log ratio is not the transform\-then\-subtract structure, but the specific choiceh\(x\)=log⁡xh\(x\)=\\log x, which maps\[0,1\]\[0,1\]to\(−∞,0\]\(\-\\infty,0\]and diverges at zero, makingh\(πT\)−h\(πθ\)h\(\\pi\_\{T\}\)\-h\(\\pi\_\{\\theta\}\)unbounded and violating P1\. The fix is therefore precise: keep the transform\-then\-subtract structure and replaceh=logh=\\logwith a function that is both strictly monotone increasing and bounded on\[0,1\]\[0,1\]\.

#### The Box–Cox family\.

A natural and well\-studied family for this purpose is the Box–Cox power transformation\(Box and Cox,[1964](https://arxiv.org/html/2606.17199#bib.bib1)\),

hα\(x\)=xα−1α,α\>0\.h\_\{\\alpha\}\(x\)=\\frac\{x^\{\\alpha\}\-1\}\{\\alpha\},\\quad\\alpha\>0\.\(3\)For anyα\>0\\alpha\>0,hαh\_\{\\alpha\}is strictly increasing and bounded on\[0,1\]\[0,1\]\. Substitutinghαh\_\{\\alpha\}intoh\(p\)−h\(q\)h\(p\)\-h\(q\)gives\(pα−qα\)/α\(p^\{\\alpha\}\-q^\{\\alpha\}\)/\\alpha\. Since1/α1/\\alphais a positive constant independent of tokens and policy parameters, we absorb it into the learning rate and use the rescaled rewardpα−qαp^\{\\alpha\}\-q^\{\\alpha\}, which is bounded in\[−1,1\]\[\-1,1\]\.

#### The log ratio as a limiting case\.

The standard log\-ratio reward corresponds to the boundary caseα→0\\alpha\\to 0of the Box–Cox transformation\. By the first\-order Taylor expansionxα=1\+αlog⁡x\+o\(α\)x^\{\\alpha\}=1\+\\alpha\\log x\+o\(\\alpha\), we havehα\(x\)=\(xα−1\)/α→log⁡xh\_\{\\alpha\}\(x\)=\(x^\{\\alpha\}\-1\)/\\alpha\\to\\log x, and thereforehα\(p\)−hα\(q\)→log⁡p−log⁡qh\_\{\\alpha\}\(p\)\-h\_\{\\alpha\}\(q\)\\to\\log p\-\\log q\. However, this limit is degenerate for reward design: although each fixedα\>0\\alpha\>0gives a bounded transformation, its range expands asα→0\\alpha\\to 0, recovering the unbounded log transformation and thereby violating P1\.

### 4\.4PowerOPD

The analysis above identifies a principled family of natively bounded, sign\-consistent OPD rewards\. Following the policy\-gradient formulation in[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2), we introducePowerOPD, which uses the following stop\-gradient token\-level reward:

rtα=sg\[πT\(ot∣ct\)α−πθ\(ot∣ct\)α\],α\>0\.r\_\{t\}^\{\\alpha\}=\\mathrm\{sg\}\\\!\\left\[\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)^\{\\alpha\}\-\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)^\{\\alpha\}\\right\],\\quad\\alpha\>0\.\(4\)
#### Verification\.

Sincep,q∈\[0,1\]p,q\\in\[0,1\]andα\>0\\alpha\>0, we havepα,qα∈\[0,1\]p^\{\\alpha\},q^\{\\alpha\}\\in\[0,1\], sortα∈\[−1,1\]r\_\{t\}^\{\\alpha\}\\in\[\-1,1\]:P1 holds\. Sinceh\(x\)=xαh\(x\)=x^\{\\alpha\}is strictly increasing forα\>0\\alpha\>0,πT\>πθ⇒rtα\>0\\pi\_\{T\}\>\\pi\_\{\\theta\}\\Rightarrow r\_\{t\}^\{\\alpha\}\>0:P2 holds\. Critically, both hold at the probability\-to\-reward mapping level, without passing through the log ratio\.

#### α\\alphashapes the reward sensitivity profile\.

The exponentα\\alphacontrols how reward sensitivity is distributed across the probability range\.*Smallerα\\alphagives relatively more sensitivity to low\-probability tokens*, resembling the log ratio’s emphasis on rare events while remaining bounded\.*Largerα\\alphaattenuates low\-probability differences*and shifts the reward signal toward better\-supported regions\.

## 5Experimental Setup

#### Models\.

We evaluate four teacher–student pairs spanning two student sizes and two teacher sizes: Qwen3\-0\.6B\-Base and Qwen3\-1\.7B\-Base as students, and Qwen3\-4B and Qwen3\-8B as teachers\.

#### Training\.

We train on DeepScaleR\(Luoet al\.,[2025](https://arxiv.org/html/2606.17199#bib.bib31)\)for 1\.5k steps with bf16, learning rate5×10−75\\times 10^\{\-7\}, batch size 32, and on\-policy rollouts up to 1024 tokens\. Results are averaged over three runs\.

#### Evaluation\.

We evaluate on six mathematical reasoning benchmarks: AIME24/25\(Mathematical Association of America,[2026a](https://arxiv.org/html/2606.17199#bib.bib32)\), MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.17199#bib.bib36)\), AMC23\(Mathematical Association of America,[2026b](https://arxiv.org/html/2606.17199#bib.bib33)\), Minerva\(Lewkowyczet al\.,[2022](https://arxiv.org/html/2606.17199#bib.bib34)\), and OlympiadBench\(Heet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib35)\)\. We report Avg@nn, Pass@nn, and average response length withn=8n=8\.

#### Baselines\.

Baselines include*Vanilla OPD*, reward\-stabilized variants \(*Clip*:cl=−1,ch=1c\_\{l\}=\-1,c\_\{h\}=1;*Tanh*;Z\-Score\), and*Full\-vocabulary KL OPD*\.

## 6Experiments

### 6\.1Main Results

#### PowerOPD substantially outperforms vanilla OPD\.

PowerOPD consistently improves over vanilla OPD across all four teacher–student configurations \(Tables[2](https://arxiv.org/html/2606.17199#S3.T2)and[3](https://arxiv.org/html/2606.17199#S4.T3)\)\. Averaged over configurations using the best PowerOPD variant for each metric, PowerOPD improves the mean performance by\+4\.47\\mathbf\{\+4\.47\}Avg@8 and\+4\.06\\mathbf\{\+4\.06\}Pass@8\. The largest benchmark\-averaged gain occurs in the Qwen3\-4B→\\toQwen3\-0\.6B\-Base setting, with Avg@8 improving from 19\.13 to 25\.50 \(\+6\.37\\mathbf\{\+6\.37\}\) and Pass@8 from 36\.18 to 41\.89 \(\+5\.71\\mathbf\{\+5\.71\}\)\. Gains also hold for the Qwen3\-8B→\\toQwen3\-0\.6B\-Base setting \(\+3\.87\+3\.87Avg@8,\+3\.62\+3\.62Pass@8\), the Qwen3\-4B→\\toQwen3\-1\.7B\-Base setting \(\+5\.04\+5\.04Avg@8,\+4\.39\+4\.39Pass@8\), and the Qwen3\-8B→\\toQwen3\-1\.7B\-Base setting \(\+2\.60\+2\.60Avg@8,\+2\.53\+2\.53Pass@8\)\. On individual benchmarks, the gains are larger, reaching up to\+16\.75\\mathbf\{\+16\.75\}Avg@8 on AMC23 in the Qwen3\-4B→\\toQwen3\-0\.6B\-Base setting and\+15\.00\\mathbf\{\+15\.00\}Pass@8 on AMC23 in the Qwen3\-4B→\\toQwen3\-1\.7B\-Base and Qwen3\-8B→\\toQwen3\-1\.7B\-Base settings\. Notably,*PowerOPD plugs directly into the standard OPD pipeline without modifying rollout generation, teacher scoring, or policy\-gradient optimization*\.

Table 4:Efficiency comparison on Qwen3\-0\.6B\-Base←\\leftarrowQwen3\-4B\. Wall\-time is averaged over 1\.5k training steps; peak memory is measured with batch size 8\.
#### PowerOPD also surpasses post\-hoc stabilization methods\.

PowerOPD also outperforms post\-hoc log\-reward stabilization baselines \(Tables[2](https://arxiv.org/html/2606.17199#S3.T2)and[3](https://arxiv.org/html/2606.17199#S4.T3)\)\. The largest benchmark\-averaged gain occurs in the Qwen3\-4B→\\toQwen3\-0\.6B\-Base setting, where PowerOPD improves over the strongest post\-hoc baseline from 22\.49 to 25\.50 Avg@8 \(\+3\.01\\mathbf\{\+3\.01\}\) and from 38\.35 to 41\.89 Pass@8 \(\+3\.54\\mathbf\{\+3\.54\}\)\. At the individual\-benchmark level, the gains reach up to\+8\.43\\mathbf\{\+8\.43\}Avg@8 and\+7\.50\\mathbf\{\+7\.50\}Pass@8 on AMC23 in the same setting\. Notably,*Z\-Score*can underperform vanilla OPD, consistent with our analysis in[§4\.2](https://arxiv.org/html/2606.17199#S4.SS2)that batch centering may violate sign consistency\. These results show that post\-hoc stabilization is insufficient:*the reward properties must hold at the probability\-to\-reward mapping level, before the log\-ratio distortion occurs\.*

![Refer to caption](https://arxiv.org/html/2606.17199v1/x11.png)Figure 5:Reward magnitude over the joint probability space\(πT,πθ\)\(\\pi\_\{T\},\\pi\_\{\\theta\}\)\. \(a\) The log\-ratio reward depends only on the ratioπT/πθ\\pi\_\{T\}/\\pi\_\{\\theta\}: it retains full sensitivity where both probabilities are small and grows without bound as either approaches zero\. \(b–d\) PowerOPD couples reward magnitude to the absolute probability level: asα\\alphagrows, the inert region expands and the signal contracts toward tokens with substantial probability under at least one model\.![Refer to caption](https://arxiv.org/html/2606.17199v1/x12.png)Figure 6:Gradient norm during training with a Qwen3\-1\.7B\-Base student and Qwen3\-4B teacher\.
#### PowerOPD consistently outperforms full\-vocabulary OPD at substantially lower compute cost\.

Taking the best PowerOPD variant separately for each metric, the largest benchmark\-averaged gains are\+2\.59\\mathbf\{\+2\.59\}Avg@8 and\+8\.90\\mathbf\{\+8\.90\}Pass@8\. Specifically, with the Qwen3\-0\.6B\-Base student, PowerOPD improves Avg@8 by\+2\.59\+2\.59and\+0\.56\+0\.56over full\-vocabulary OPD under the 4B and 8B teachers, with corresponding Pass@8 gains of\+8\.17\+8\.17and\+8\.90\+8\.90\. With the Qwen3\-1\.7B\-Base student, PowerOPD improves Avg@8 by\+1\.86\+1\.86and\+1\.22\+1\.22, and Pass@8 by\+2\.29\+2\.29and\+2\.98\+2\.98, under the 4B and 8B teachers, respectively\. At the individual\-benchmark level, the gains are substantially larger, reaching up to\+11\.60\\mathbf\{\+11\.60\}Avg@8 on Olympiad in the Qwen3\-4B→\\toQwen3\-0\.6B\-Base setting and\+25\.00\\mathbf\{\+25\.00\}Pass@8 on AMC23 in the Qwen3\-8B→\\toQwen3\-0\.6B\-Base setting\. As shown in[Table 4](https://arxiv.org/html/2606.17199#S6.T4), PowerOPD reduces wall\-clock time by59\.2%59\.2\\%and peak GPU memory by23\.1%23\.1\\%, with FLOPs computation detailed in[Appendix B](https://arxiv.org/html/2606.17199#A2)\. These results show that a bounded sampled\-token reward can surpass full\-vocabulary distillation while avoiding vocabulary\-wide computation\.

#### PowerOPD scales with largerα\\alpha\.

Across all configurations, increasingα\\alphagenerally improves Avg@8 and Pass@8 while shortening responses\. In the 0\.6B/4B setting, Avg@8 rises from 24\.16 atα=0\.1\\alpha\{=\}0\.1to 25\.50 atα=500\\alpha\{=\}500, Pass@8 from 39\.61 to 41\.89, and average length drops from 2,291 to 1,875 tokens\. The same trend holds in the 1\.7B/4B setting, where Avg@8 improves from 33\.89 to 36\.02 and length decreases from 1,900 to 1,579 tokens\. This suggests that largerα\\alphaencourages more targeted responses rather than longer generations\. In contrast, full\-vocabulary OPD often saturates the 4,096\-token limit, consistent with the length inflation observed byLuoet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib19)\), indicating weaker length calibration despite strong accuracy\. Notably, PowerOPD scales withα\\alphawithout any additional supervision\.

### 6\.2Analysis

#### Training Dynamics Acrossα\\alpha

We conduct a fine\-grained sweep overα\\alphato study how the PowerOPD reward shape affects training dynamics\. As shown in[Figure 4](https://arxiv.org/html/2606.17199#S4.F4), we compare PowerOPD with vanilla OPD and full\-vocabulary OPD under the Qwen3\-0\.6B\-Base student with Qwen3\-4B and Qwen3\-8B teachers\. First,*largerα\\alphaleads to faster and stronger convergence\.*Vanilla OPD exhibits the delayed\-improvement phase diagnosed earlier and converges to a weaker final accuracy, whereas PowerOPD reaches high validation accuracy much earlier in training; largerα\\alphafurther improves both convergence speed and final accuracy\. Second,*largerα\\alphayields shorter and more stable generations\.*Asα\\alphaincreases, PowerOPD drives the student toward shorter responses and stabilizes earlier, while vanilla OPD remains length\-unstable and can fail to settle into a stable generation regime\. Full\-vocabulary OPD also often approaches the maximum generation length, indicating that stronger accuracy does not necessarily translate into better length calibration\.

#### Why Largerα\\alphaImproves PowerOPD

To understand why largerα\\alphaimproves accuracy, we examine how the reward distributes its magnitude over the joint probability space\(πT,πθ\)\(\\pi\_\{T\},\\pi\_\{\\theta\}\)in[Figure 5](https://arxiv.org/html/2606.17199#S6.F5)\. The log\-ratio reward depends only on the ratioπT/πθ\\pi\_\{T\}/\\pi\_\{\\theta\}and is blind to the absolute probability level: it assigns the same reward to0\.0020\.002versus0\.00010\.0001as to0\.20\.2versus0\.010\.01, and grows without bound whenever one probability approaches zero while the other does not \([Figure 5](https://arxiv.org/html/2606.17199#S6.F5)\(a\)\)\. Consequently, its most extreme values fall on tokens that are improbable under both models, where the student rarely samples the token and the probability estimates carry the least statistical support, creating high\-leverage policy\-gradient termsrt∇θlog⁡πθ\(ot∣ct\)r\_\{t\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)from the least reliable measurements\(Williams,[1992](https://arxiv.org/html/2606.17199#bib.bib41); Greensmithet al\.,[2004](https://arxiv.org/html/2606.17199#bib.bib42)\)\. PowerOPD couples the reward magnitude to the absolute probability level:\|rtα\|\|r\_\{t\}^\{\\alpha\}\|is governed bymax\(πT,πθ\)α\\max\(\\pi\_\{T\},\\pi\_\{\\theta\}\)^\{\\alpha\}, so a token receives substantial reward only if at least one model assigns it substantial probability, andα\\alphasets how strict this support requirement is\. Atα=1\\alpha\{=\}1, a probability of0\.60\.6retains a sizable reward, whereas atα=10\\alpha\{=\}10it contributes only0\.610≈0\.0060\.6^\{10\}\\approx 0\.006\. Accordingly, in[Figure 5](https://arxiv.org/html/2606.17199#S6.F5)\(b–d\), the inert region expands asα\\alphagrows, and the surviving signal contracts toward tokens where the teacher probability is high, so the distillation target is reliable, or the student probability is high, so the update adjusts the dominant modes of the current policy rather than its marginal behaviors\. Largerα\\alphatherefore suppresses exactly the unreliable high\-leverage tokens that destabilize vanilla OPD \([§3\.2](https://arxiv.org/html/2606.17199#S3.SS2)\), while concentrating learning on well\-supported ones\.

#### Training Stability

We further assess training stability by tracking gradient norms with a Qwen3\-1\.7B\-Base student and Qwen3\-4B teacher\. As shown in[Figure 6](https://arxiv.org/html/2606.17199#S6.F6), vanilla OPD has an initial spike near10310^\{3\}and later rises above2020, reflecting instability from high\-variance log\-ratio rewards\. Post\-hoc methods only partially mitigate this behavior:*Clip*and*Tanh*still fluctuate sharply, often exceeding1010, while*Z\-Score*grows from around33to above1010\. In contrast, PowerOPD keeps the gradient norm nearly flat at0\.250\.25–0\.350\.35, about3,000×3\{,\}000\\timessmaller than the initial OPD spike, over60×60\\timessmaller than late\-stage OPD, and over30×30\\timessmaller than the high\-gradient regimes of*Clip*and*Z\-Score*\. This shows that bounding the reward at the probability\-to\-reward mapping level stabilizes policy\-gradient updates more effectively than post\-hoc transformations of the log\-ratio reward\.

## 7Conclusion

We show that OPD’s training instability stems from the unbounded log\-ratio reward, and that standard post\-hoc fixes are insufficient\. We propose PowerOPD, a family of bounded, sign\-consistent rewards from the Box\-Cox power transformation parameterized byα\>0\\alpha\>0, which consistently outperforms vanilla OPD and all post\-hoc baselines, surpasses full\-vocabulary OPD at substantially lower compute cost, and scales withα\\alphato simultaneously improve accuracy, shorten response length, and stabilize training dynamics throughout optimization\.

## Limitations

Our experiments are conducted primarily on mathematical reasoning benchmarks and Qwen3\-based teacher–student pairs\. This setting provides a controlled testbed for studying OPD reward design, since it involves long\-form generation and makes training instabilities easy to observe\. Future work can further validate whether the same reward design brings similar benefits to tasks such as code generation, general instruction following, and multilingual reasoning\.

## Ethical Considerations

This work studies the reward design of on\-policy distillation for language model post\-training\. We do not introduce new datasets containing personal or sensitive information, nor do we propose a user\-facing application or deployment pipeline\. Therefore, we do not identify direct ethical concerns specific to the proposed method\. More broadly, PowerOPD helps reduce the computational cost of distillation by improving the effectiveness of sampled\-token training, which could make post\-training more accessible and resource\-efficient\.

## References

- R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. Ramos, M\. Geist, and O\. Bachem \(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2306.13649)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- M\. Andrychowicz, A\. Raichuk, P\. Stańczyk, M\. Orsini, S\. Girgin, R\. Marinier, L\. Hussenot, M\. Geist, O\. Pietquin, M\. Michalski,et al\.\(2020\)What matters in on\-policy reinforcement learning? a large\-scale empirical study\.arXiv preprint arXiv:2006\.05990\.Cited by:[§3\.3](https://arxiv.org/html/2606.17199#S3.SS3.p1.6)\.
- S\. Bengio, O\. Vinyals, N\. Jaitly, and N\. Shazeer \(2015\)Scheduled sampling for sequence prediction with recurrent neural networks\.InAdvances in Neural Information Processing Systems,C\. Cortes, N\. Lawrence, D\. Lee, M\. Sugiyama, and R\. Garnett \(Eds\.\),Vol\.28,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2015/file/e995f98d56967d946471af29d7bf99f1-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p1.1)\.
- G\. E\. P\. Box and D\. R\. Cox \(1964\)An analysis of transformations\.Journal of the Royal Statistical Society: Series B \(Methodological\)26\(2\),pp\. 211–252\.Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p5.6),[§4\.3](https://arxiv.org/html/2606.17199#S4.SS3.SSS0.Px1.p1.10)\.
- X\. Chen, Z\. Sun, G\. Wenjin, M\. Zhang, Y\. Chen, Y\. Sun, H\. Su, Y\. Pan, D\. Klakow, W\. Li,et al\.\(2025\)Unveiling the key factors for distilling chain\-of\-thought reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 15094–15119\.Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1)\.
- DeepSeek Team \(2026\)DeepSeek\-V4 technical report\.Technical Report\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.5)\.
- Y\. Fu, H\. Huang, K\. Jiang, Y\. Zhu, and D\. Zhao \(2026\)Revisiting on\-policy distillation: empirical failure modes and simple fixes\.arXiv preprint arXiv:2603\.25562\.External Links:[Link](https://arxiv.org/abs/2603.25562)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1)\.
- E\. Greensmith, P\. L\. Bartlett, and J\. Baxter \(2004\)Variance reduction techniques for gradient estimates in reinforcement learning\.InJournal of Machine Learning Research,Cited by:[§6\.2](https://arxiv.org/html/2606.17199#S6.SS2.SSS0.Px2.p1.17)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)MiniLLM: knowledge distillation of large language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2306.08543)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. L\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting agi with olympiad\-level bilingual multimodal scientific problems\.External Links:2402\.14008,[Link](https://arxiv.org/abs/2402.14008)Cited by:[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px3.p1.3)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2103.03874)Cited by:[§3\.1](https://arxiv.org/html/2606.17199#S3.SS1.p1.1),[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px3.p1.3)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.InNIPS Deep Learning and Representation Learning Workshop,External Links:[Link](https://arxiv.org/abs/1503.02531)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1)\.
- W\. Hou, S\. Peng, W\. Wang, Z\. Ruan, Y\. Zhang, Z\. Zhou, M\. Gao, Y\. Chen, K\. Wang, H\. Yang, C\. Zhang, Z\. Tian, H\. Hu, Y\. Yang, F\. Wu, and H\. Fan \(2026\)Uni\-OPD: unifying on\-policy distillation with a dual\-perspective recipe\.arXiv preprint arXiv:2605\.03677\.External Links:[Link](https://arxiv.org/abs/2605.03677)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8003–8017\.Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1)\.
- I\. Jang, J\. Yeom, J\. Yeo, H\. Lim, and T\. Kim \(2026\)Stable on\-policy distillation through adaptive target reformulation\.arXiv preprint arXiv:2601\.07155\.External Links:[Link](https://arxiv.org/abs/2601.07155)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1)\.
- N\. Jia, H\. Yang, X\. Ma, J\. Lian, S\. Zhang, W\. Zhang, K\. Zeng, X\. Cai, and Z\. Sun \(2026\)Asymmetric on\-policy distillation: bridging exploitation and imitation at the token level\.arXiv preprint arXiv:2605\.06387\.External Links:[Link](https://arxiv.org/abs/2605.06387)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.17199#S3.SS2.SSS0.Px1.p1.2)\.
- W\. Jin, T\. Min, Y\. Yang, S\. R\. Kadhe, Y\. Zhou, D\. Wei, N\. Baracaldo, and K\. Lee \(2026\)Entropy\-aware on\-policy distillation of language models\.arXiv preprint arXiv:2603\.07079\.External Links:[Link](https://arxiv.org/abs/2603.07079)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p2.1)\.
- J\. Ko, S\. Abdali, Y\. J\. Kim, T\. Chen, and P\. Cameron \(2026\)Scaling reasoning efficiently via relaxed on\-policy distillation\.arXiv preprint arXiv:2603\.11137\.External Links:[Link](https://arxiv.org/abs/2603.11137)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.4),[§3\.2](https://arxiv.org/html/2606.17199#S3.SS2.SSS0.Px1.p1.2)\.
- KwaiKAT Team \(2026\)KAT\-coder\-v2 technical report\.External Links:2603\.27703,[Link](https://arxiv.org/abs/2603.27703)Cited by:[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- A\. Lewkowycz, A\. Andreassen, D\. Dohan, E\. Dyer, H\. Michalewski, V\. Ramasesh, A\. Slone, C\. Anil, I\. Schlag, T\. Gutman\-Solo, Y\. Wu, B\. Neyshabur, G\. Gur\-Ari, and V\. Misra \(2022\)Solving quantitative reasoning problems with language models\.External Links:2206\.14858,[Link](https://arxiv.org/abs/2206.14858)Cited by:[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px3.p1.3)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu, and N\. Ding \(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.External Links:[Link](https://arxiv.org/abs/2604.13016)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1)\.
- K\. Liu, Z\. Zhuang, Y\. Bai, B\. Wang, R\. Weng, and J\. Ye \(2026a\)Prefix teach, suffix fade: local teachability collapse in strong\-to\-weak on\-policy distillation\.External Links:2605\.13643,[Link](https://arxiv.org/abs/2605.13643)Cited by:[§3\.2](https://arxiv.org/html/2606.17199#S3.SS2.SSS0.Px2.p1.1)\.
- X\. Liu, K\. Jiao, C\. Xiao, R\. Zhao, J\. Ruan, B\. Li, J\. Liu, Q\. Wang, X\. Chen, J\. Wang, C\. Wang, T\. Xiao, and J\. Zhu \(2026b\)Teacher\-guided policy optimization for on\-policy reasoning distillation under large policy divergence\.External Links:2605\.13230,[Link](https://arxiv.org/abs/2605.13230)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p2.1)\.
- LLM\-Core Xiaomi \(2026\)MiMo\-v2\-flash technical report\.External Links:2601\.02780,[Link](https://arxiv.org/abs/2601.02780)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- K\. Lu and Thinking Machines Lab \(2025\)On\-policy distillation\.Note:Blog postExternal Links:[Link](https://thinkingmachines.ai/blog/on-policy-distillation/)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2),[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.4)\.
- F\. Luo, Y\. Chuang, G\. Wang, Z\. Xu, X\. Han, T\. Zhang, and V\. Braverman \(2026\)Demystifying OPD: length inflation and stabilization strategies for large language models\.arXiv preprint arXiv:2604\.08527\.External Links:[Link](https://arxiv.org/abs/2604.08527)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2606.17199#S6.SS1.SSS0.Px4.p1.5)\.
- M\. Luo, S\. Tan, J\. Wong, X\. Shi, W\. Tang, M\. Roongta, C\. Cai, J\. Luo, T\. Zhang, E\. Li, R\. A\. Popa, and I\. Stoica \(2025\)DeepScaleR: surpassing o1\-preview with a 1\.5b model by scaling rl\.Cited by:[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px2.p1.1)\.
- Mathematical Association of America \(2026a\)American Invitational Mathematics Examination\.External Links:[Link](https://maa.org/maa-invitational-competitions/)Cited by:[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px3.p1.3)\.
- Mathematical Association of America \(2026b\)American Mathematics Competitions\.External Links:[Link](https://maa.org/student-programs/amc/)Cited by:[§5](https://arxiv.org/html/2606.17199#S5.SS0.SSS0.Px3.p1.3)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.nature518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p4.1),[§3\.3](https://arxiv.org/html/2606.17199#S3.SS3.p1.6)\.
- M\. Oh, S\. Song, G\. Choi, Y\. Choi, and Y\. Jo \(2026\)KL for a KL: on\-policy distillation with control variate baseline\.arXiv preprint arXiv:2605\.07865\.External Links:[Link](https://arxiv.org/abs/2605.07865)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.4)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- Qwen Team \(2026\)Qwen3\.5\-omni technical report\.External Links:2604\.15804,[Link](https://arxiv.org/abs/2604.15804)Cited by:[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.InarXiv preprint arXiv:1707\.06347,External Links:[Link](https://arxiv.org/abs/1707.06347)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p4.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.5)\.
- M\. Song and M\. Zheng \(2026\)A survey of on\-policy distillation for large language models\.arXiv preprint arXiv:2604\.00626\.External Links:[Link](https://arxiv.org/abs/2604.00626)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- R\. S\. Sutton, D\. McAllester, S\. Singh, and Y\. Mansour \(1999\)Policy gradient methods for reinforcement learning with function approximation\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf)Cited by:[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.4)\.
- Tencent Robotics X and HY Vision Team \(2026\)HY\-embodied\-0\.5: embodied foundation models for real\-world agents\.External Links:2604\.07430,[Link](https://arxiv.org/abs/2604.07430)Cited by:[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8\(3–4\),pp\. 229–256\.Cited by:[§2\.2](https://arxiv.org/html/2606.17199#S2.SS2.p1.4),[§6\.2](https://arxiv.org/html/2606.17199#S6.SS2.SSS0.Px2.p1.17)\.
- Y\. Xu, H\. Sang, Z\. Zhou, R\. He, Z\. Wang, and A\. Geramifard \(2026\)TIP: token importance in on\-policy distillation\.arXiv preprint arXiv:2604\.14084\.External Links:[Link](https://arxiv.org/abs/2604.14084)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px2.p1.1)\.
- Z\. Yang, Z\. Liu, Y\. Chen, W\. Dai, B\. Wang, S\. Lin, C\. Lee, Y\. Chen, D\. Jiang, J\. He, R\. Pi, G\. Lam, N\. Lee, A\. Bukharin, M\. Shoeybi, B\. Catanzaro, and W\. Ping \(2026\)Nemotron\-cascade 2: post\-training LLMs with cascade RL and multi\-domain on\-policy distillation\.arXiv preprint arXiv:2603\.19220\.External Links:[Link](https://arxiv.org/abs/2603.19220)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- A\. Zhao, H\. Xin, Y\. Fan, J\. Tong, W\. Li, and X\. Shen \(2026a\)Decoupling kl and trajectories: a unified perspective for sft, dagger, offline rl, and opd in llm distillation\.External Links:2605\.16826,[Link](https://arxiv.org/abs/2605.16826)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.17199#S3.SS1.p1.1)\.
- H\. Zhao, H\. Chen, H\. Lin, G\. I\. Winata, D\. Yao, and W\. Tang \(2026b\)OPD\+: rethinking the advantage design for on\-policy distillation\.External Links:2606\.01039,[Link](https://arxiv.org/abs/2606.01039)Cited by:[§1](https://arxiv.org/html/2606.17199#S1.p2.1)\.
- M\. Zheng, Z\. Li, T\. Chen, M\. Song, and D\. Wang \(2025\)HY\-mt1\.5 technical report\.External Links:2512\.24092,[Link](https://arxiv.org/abs/2512.24092)Cited by:[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.
- Zhipu AI Team \(2026\)GLM\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.External Links:[Link](https://arxiv.org/abs/2602.15763)Cited by:[Appendix A](https://arxiv.org/html/2606.17199#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17199#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17199#S2.SS1.p1.2)\.

## Appendix ARelated Work

#### On\-Policy Distillation\.

Knowledge distillation transfers knowledge from a teacher to a student by matching soft output distributionsHintonet al\.\([2015](https://arxiv.org/html/2606.17199#bib.bib44)\); Hsiehet al\.\([2023](https://arxiv.org/html/2606.17199#bib.bib3)\); Chenet al\.\([2025](https://arxiv.org/html/2606.17199#bib.bib4)\)\. For autoregressive language models, on\-policy distillation trains the student on its own samples rather than on teacher\-generated data, eliminating the train–test distribution mismatch of offline methods\(Guet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib2); Agarwalet al\.,[2024](https://arxiv.org/html/2606.17199#bib.bib5)\)\. OPD has since been adopted at scale: Qwen3\(Qwen Team,[2025](https://arxiv.org/html/2606.17199#bib.bib6)\)and GLM\-5\(Zhipu AI Team,[2026](https://arxiv.org/html/2606.17199#bib.bib7)\)use it to transfer reasoning capabilities, Nemotron\-Cascade 2\(Yanget al\.,[2026](https://arxiv.org/html/2606.17199#bib.bib8)\)combines it with cascade RL, and DeepSeek\-V4\(DeepSeek Team,[2026](https://arxiv.org/html/2606.17199#bib.bib9)\)integrates it into post\-training\.Song and Zheng \([2026](https://arxiv.org/html/2606.17199#bib.bib30)\)survey the rapidly expanding landscape\.

#### Training Stability in OPD\.

A growing body of work addresses OPD instability while retaining the log\-ratio reward\. At the reward level,Jinet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib16)\)downweight uncertain tokens by teacher entropy,Xuet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib23)\)assign importance weights based on reward reliability, andOhet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib27)\)introduce control variate baselines to reduce reward variance\. At the structural level,Janget al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib15)\)adaptively reformulate the teacher target,Jiaet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib26)\)apply asymmetric treatment across token positions, andHouet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib24)\)unify multiple distillation perspectives with stabilizing constraints\. Empirical analyses byFuet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib18)\),Luoet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib19)\), andLiet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib22)\)characterize common failure modes such as reward spikes and length inflation\.Koet al\.\([2026](https://arxiv.org/html/2606.17199#bib.bib17)\)formalize OPD as a policy gradient problem and propose relaxed objectives\. All of these approaches modify how the log\-ratio reward is weighted, normalized, or combined, but none question whether the log ratio is the right reward function\. Our work departs from this line by redesigning the reward itself, replacing the unbounded log transformation with bounded power functions derived from the Box\-Cox family\(Box and Cox,[1964](https://arxiv.org/html/2606.17199#bib.bib1)\)\.

## Appendix BFLOPs Estimation

We estimate the distillation\-update FLOPs for the Qwen3\-0\.6B\-Base student and Qwen3\-4B teacher setting\. We use batch sizeB=32B=32, average prompt length128128, rollout lengthT=1024T=1024, and total prefill lengthS=1152S=1152\. Letddbe the hidden size,LLthe number of transformer layers,mmthe intermediate size,VVthe vocabulary size, anddkv=nkvdhd\_\{\\mathrm\{kv\}\}=n\_\{\\mathrm\{kv\}\}d\_\{h\}the total key/value projection dimension under GQA\. For one prefill forward pass, we estimate the transformer FLOPs as

Ftr\(d,L,m,dkv,S\)=BL\(\\displaystyle F\_\{\\mathrm\{tr\}\}\(d,L,m,d\_\{\\mathrm\{kv\}\},S\)=BL\\Big\(4Sd2\+4Sddkv\\displaystyle 4Sd^\{2\}\+4Sdd\_\{\\mathrm\{kv\}\}\+2S2d\+6Sdm\)\.\\displaystyle\+2S^\{2\}d\+6Sdm\\Big\)\.The four terms correspond to the query/output projections, key/value projections, causal attention matrix multiplications, and SwiGLU MLP projections, respectively\. We count multiply\-add as two FLOPs\. For Qwen3\-4B, this gives

FtrT≈254\.8TFLOPs,F\_\{\\mathrm\{tr\}\}^\{T\}\\approx 254\.8\\ \\mathrm\{TFLOPs\},and for Qwen3\-0\.6B\-Base,

FtrS≈30\.6TFLOPs\.F\_\{\\mathrm\{tr\}\}^\{S\}\\approx 30\.6\\ \\mathrm\{TFLOPs\}\.The teacher is used only for scoring, so we count one forward pass\. The student is updated by backpropagation, so we approximate the student update as one forward pass plus backward pass, i\.e\.,3FtrS3F\_\{\\mathrm\{tr\}\}^\{S\}\. Thus, vanilla OPD methods such as PowerOPD require

Fsampled\\displaystyle F\_\{\\mathrm\{sampled\}\}=FtrT\+3FtrS\\displaystyle=F\_\{\\mathrm\{tr\}\}^\{T\}\+3F\_\{\\mathrm\{tr\}\}^\{S\}≈254\.8\+3×30\.6\\displaystyle\\approx 548\+3\\times 06=346\.6TFLOPs\.\\displaystyle=466\\ \\mathrm\{TFLOPs\}\.per distillation update\. The sampled\-token reward only requires the probabilities of the sampled rollout tokens; the corresponding selective output projection isO\(BTd\)O\(BTd\)and is negligible compared with transformer compute\.

Full\-vocabulary KL OPD additionally materializes vocabulary\-sized distributions over all rollout positions\. The teacher requires a full\-vocabulary LM\-head projection,

FheadT=2BTdTV≈25\.5TFLOPs,F\_\{\\mathrm\{head\}\}^\{T\}=2BTd\_\{T\}V\\approx 25\.5\\ \\mathrm\{TFLOPs\},while the student full\-vocabulary projection participates in backpropagation:

FheadS,train=3⋅2BTdSV≈30\.6TFLOPs\.F\_\{\\mathrm\{head\}\}^\{S,\\mathrm\{train\}\}=3\\cdot 2BTd\_\{S\}V\\approx 30\.6\\ \\mathrm\{TFLOPs\}\.The KL term is computed over the full vocabulary,

DKL\(πθ∥πT\)=∑v=1Vπθ\(v∣ct\)\[\\displaystyle D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{T\}\)=\\sum\_\{v=1\}^\{V\}\\pi\_\{\\theta\}\(v\\mid c\_\{t\}\)\\Big\[log⁡πθ\(v∣ct\)\\displaystyle\\log\\pi\_\{\\theta\}\(v\\mid c\_\{t\}\)−logπT\(v∣ct\)\]\.\\displaystyle\-\\log\\pi\_\{T\}\(v\\mid c\_\{t\}\)\\Big\]\.for each rollout token\. This elementwise KL computation costsO\(BTV\)O\(BTV\), which is lower order relative to the full\-vocabulary projections and is omitted from the headline estimate\. Therefore, full\-vocabulary KL OPD requires

Ffull\\displaystyle F\_\{\\mathrm\{full\}\}=FtrT\+3FtrS\+FheadT\+FheadS,train\\displaystyle=F\_\{\\mathrm\{tr\}\}^\{T\}\+3F\_\{\\mathrm\{tr\}\}^\{S\}\+F\_\{\\mathrm\{head\}\}^\{T\}\+F\_\{\\mathrm\{head\}\}^\{S,\\mathrm\{train\}\}≈346\.6\+25\.5\+30\.6\\displaystyle\\approx 466\+55\+06=402\.7TFLOPs\.\\displaystyle=027\\ \\mathrm\{TFLOPs\}\.Thus, avoiding the full\-vocabulary distillation signal saves approximately

402\.7−346\.6=56\.1TFLOPs402\.7\-346\.6=56\.1\\ \\mathrm\{TFLOPs\}per update, corresponding to a13\.9%13\.9\\%reduction in distillation\-update FLOPs\. This estimate only counts arithmetic operations; in practice, full\-vocabulary KL OPD also incurs additional memory traffic from materializing vocabulary\-sized logits and KL tensors\.

## Appendix CDerivation of the OPD Policy\-Gradient Form

We derive the policy\-gradient form used for OPD and clarify why the log\-ratio term can be viewed as a dense token\-level reward\. Letx∼𝒟x\\sim\\mathcal\{D\}be a prompt ando=\(o1,…,oT\)∼πθ\(⋅∣x\)o=\(o\_\{1\},\\ldots,o\_\{T\}\)\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)be a student rollout\. We denote the prefix context before tokenoto\_\{t\}by

ct=\(x,o<t\)\.c\_\{t\}=\(x,o\_\{<t\}\)\.By autoregressive factorization,

πθ\(o∣x\)\\displaystyle\\pi\_\{\\theta\}\(o\\mid x\)=∏t=1Tπθ\(ot∣ct\),\\displaystyle=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\),πT\(o∣x\)\\displaystyle\\pi\_\{T\}\(o\\mid x\)=∏t=1TπT\(ot∣ct\)\.\\displaystyle=\\prod\_\{t=1\}^\{T\}\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\.The reverse\-KL objective minimized by OPD is

DKL\(πθ∥πT\)=𝔼x∼𝒟,o∼πθ\(⋅∣x\)\[log⁡πθ\(o∣x\)πT\(o∣x\)\]\.D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{T\}\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(o\\mid x\)\}\{\\pi\_\{T\}\(o\\mid x\)\}\\right\]\.Equivalently, OPD maximizes the negative reverse KL:

JOPD\(θ\)=𝔼x∼𝒟,o∼πθ\(⋅∣x\)\[log⁡πT\(o∣x\)πθ\(o∣x\)\]\.J\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\log\\frac\{\\pi\_\{T\}\(o\\mid x\)\}\{\\pi\_\{\\theta\}\(o\\mid x\)\}\\right\]\.Using the autoregressive factorization, this sequence\-level log ratio decomposes into token\-level terms:

log⁡πT\(o∣x\)πθ\(o∣x\)=∑t=1Tlog⁡πT\(ot∣ct\)πθ\(ot∣ct\)\.\\log\\frac\{\\pi\_\{T\}\(o\\mid x\)\}\{\\pi\_\{\\theta\}\(o\\mid x\)\}=\\sum\_\{t=1\}^\{T\}\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\.This gives the dense token\-level OPD reward

rtOPD\(ct,ot\)=log⁡πT\(ot∣ct\)πθ\(ot∣ct\)\.r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)=\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\.
In practice, OPD is optimized with a policy\-gradient surrogate in which the reward is treated as a stop\-gradient scalar\. For a sampled batch of rollout tokens, the empirical objective is

J^OPD\(θ\)=∑\(ct,ot\)∈ℬ\\displaystyle\\widehat\{J\}\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\sum\_\{\(c\_\{t\},o\_\{t\}\)\\in\\mathcal\{B\}\}sg\[rtOPD\(ct,ot\)\]\\displaystyle\\mathrm\{sg\}\\\!\\left\[r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)\\right\]⋅log⁡πθ\(ot∣ct\)\.\\displaystyle\\cdot\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\.wheresg\[⋅\]\\mathrm\{sg\}\[\\cdot\]denotes stop\-gradient\. Taking the gradient gives

∇θJ^OPD\(θ\)\\displaystyle\\nabla\_\{\\theta\}\\widehat\{J\}\_\{\\mathrm\{OPD\}\}\(\\theta\)=∑\(ct,ot\)∈ℬsg\[rtOPD\(ct,ot\)\]\\displaystyle=\\sum\_\{\(c\_\{t\},o\_\{t\}\)\\in\\mathcal\{B\}\}\\mathrm\{sg\}\\\!\\left\[r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)\\right\]⋅∇θlogπθ\(ot∣ct\)\.\\displaystyle\\qquad\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\.Taking expectation over prompts and student rollouts yields the policy\-gradient form

∇θJOPD\(θ\)\\displaystyle\\nabla\_\{\\theta\}J\_\{\\mathrm\{OPD\}\}\(\\theta\)=𝔼x∼𝒟o∼πθ\(⋅∣x\)\[∑t=1Tsg\[logπT\(ot∣ct\)πθ\(ot∣ct\)\]\\displaystyle=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}x\\sim\\mathcal\{D\}\\\\ o\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\end\{subarray\}\}\\Bigg\[\\sum\_\{t=1\}^\{T\}\\mathrm\{sg\}\\\!\\left\[\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\\right\]⋅∇θlogπθ\(ot∣ct\)\]\.\\displaystyle\\qquad\\qquad\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\\Bigg\]\.Thus, OPD can be implemented as dense\-reward policy\-gradient learning, where each sampled token receives the stop\-gradient reward

rtOPD\(ct,ot\)=log⁡πT\(ot∣ct\)πθ\(ot∣ct\)\.r\_\{t\}^\{\\mathrm\{OPD\}\}\(c\_\{t\},o\_\{t\}\)=\\log\\frac\{\\pi\_\{T\}\(o\_\{t\}\\mid c\_\{t\}\)\}\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid c\_\{t\}\)\}\.This is the form used throughout the paper\.
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Trust Region On-Policy Distillation

On the Geometry of On-Policy Distillation

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

AsyncOPD: How Stale Can On-Policy Distillation Be?

Submit Feedback

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
Trust Region On-Policy Distillation
On the Geometry of On-Policy Distillation
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
AsyncOPD: How Stale Can On-Policy Distillation Be?