@VukRosic99: When a small model learns from a big one, half the lesson is wasted The setup: a small "student" model writes an answer…

X AI KOLs Timeline 06/28/26, 11:37 AM Papers

Summary

The paper identifies position bias in on-policy distillation for language models, where later tokens in student-generated answers receive degraded supervision. The proposed Importance-Weighted On-Policy Distillation (IW-OPD) weights corrections based on accumulated drift, improving learning speed and final performance.

When a small model learns from a big one, half the lesson is wasted The setup: a small "student" model writes an answer, and a stronger "teacher" model watches and corrects it word by word — "here's what I'd have said instead." The student learns from those corrections. This is on-policy distillation. The catch: the teacher is reacting to the answer the student is actually writing. As long as the student stays close to a sensible path, the corrections are gold. But the moment the student wanders off — and a weaker model wanders early — the teacher is now correcting a path it would never have taken. From there on, its feedback is reacting to nonsense, so the lesson goes bad. You can see it directly: train on only the FIRST 30% of each answer and the student learns just as well as training on the whole thing. Train on only the LAST 30% and it barely learns at all. The early words carry almost all the teaching; the late ones are mostly noise. The fix, IW-OPD, is simple: trust each correction only as much as the answer is still on track. 1. As the student writes, keep a running tab on how far it has drifted from the teacher. 2. Weight corrections heavily while it's on track, and fade them out once it has wandered off. 3. No extra teacher calls — same compute, just spent on the words that actually teach. The payoff: it learns faster and ends up better, and the bigger the teacher is compared to the student, the more it wins over plain distillation. I broke it down into a short visual summary — swipe through. --- paper - https://arxiv.org/abs/2606.22600 Today's live: build a verified 88M-parameter LLM from one prompt, then set up an autonomous research loop. Join https://skool.com/live/bjyWvJzrfC6…

Original Article

View Cached Full Text

Cached at: 06/29/26, 12:21 AM

When a small model learns from a big one, half the lesson is wasted

The setup: a small “student” model writes an answer, and a stronger “teacher” model watches and corrects it word by word — “here’s what I’d have said instead.” The student learns from those corrections. This is on-policy distillation.

The catch: the teacher is reacting to the answer the student is actually writing. As long as the student stays close to a sensible path, the corrections are gold. But the moment the student wanders off — and a weaker model wanders early — the teacher is now correcting a path it would never have taken. From there on, its feedback is reacting to nonsense, so the lesson goes bad.

You can see it directly: train on only the FIRST 30% of each answer and the student learns just as well as training on the whole thing. Train on only the LAST 30% and it barely learns at all. The early words carry almost all the teaching; the late ones are mostly noise.

The fix, IW-OPD, is simple: trust each correction only as much as the answer is still on track.

As the student writes, keep a running tab on how far it has drifted from the teacher.
Weight corrections heavily while it’s on track, and fade them out once it has wandered off.
No extra teacher calls — same compute, just spent on the words that actually teach.

The payoff: it learns faster and ends up better, and the bigger the teacher is compared to the student, the more it wins over plain distillation.

I broke it down into a short visual summary — swipe through.

paper - https://arxiv.org/abs/2606.22600

Today’s live: build a verified 88M-parameter LLM from one prompt, then set up an autonomous research loop. Join https://skool.com/live/bjyWvJzrfC6…

On the Position Bias of On-Policy Distillation

Source: https://arxiv.org/html/2606.22600 Yan Xie1Sijie Zhu111footnotemark:1Tiansheng Wen2Bo Chen1Yifei Wang322footnotemark:2 1Xidian University2Georgia Institute of Technology3Amazon AGI SF Lab Equal Contribution.{yanxie940, zsj200454}@gmail.comCorresponding Authors: Bo Chen ([email protected]) and Yifei Wang ([email protected]). This work was conducted outside of Amazon.

Abstract

On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, we discover that not all tokens are created equal: as student rollouts grow longer, they deviate further from the teacher’s distribution, leading to degraded supervision quality at later positions. As a result, OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything. In this work, we provide a principled understanding of this issue through the lens of constrained optimization. Based on these insights, we derive Importance-Weighted On-Policy Distillation (IW-OPD), in which the weight assigned to each token depends on the accumulated discrepancy between the student’s and teacher’s distributions, naturally upweighting earlier tokens and downweighting later ones with larger deviations. We show that IW-OPD converges significantly faster than OPD, with better learning efficiency, and achieves better final performance than standard OPD in both same-size and cross-scale settings, improving performance by6.96.9points on AIME-2025.

Refer to caption (a)OPD training with the same token budget but supervision applied to different token positions. (b)Teacher–student gap in final accuracy when conditioned on student-generated prefixes.

Figure 1:Position Bias in OPD training.(a)With the same 30% token budget, training on the prefix part of each response matches or exceeds full token Standard OPD, whereas training on the suffix part fails to learn effectively. Student: Qwen3-0.6B, Teacher: Qwen3-4B-Instruct-2507.(b)Teacher and student accuracy are measured by the probability of reaching a correct answer from a given student-generated prefix. Student model maintains a low accuracy, while the teacher model’s mean@32 accuracy of eventually reaching the correct answer drops rapidly toward the student level as the student-generated prefix becomes longer. Refer to caption (a)AIME25 accuracy during training. Teacher: Qwen3-4B, Student: Qwen3-1.7B. (b)Final accuracy vs. compression ratio (teacher parameters / student parameters)

Figure 2:IW-OPD improves both sample efficiency and final performance.(a)AIME 2025 accuracy during training: IW-OPD converges faster and achieves better final performance than Standard OPD.(b)Final accuracy across student scales distilled from the same teacher; the IW-OPD advantage grows from+4.0%+4.0\%at1.0×1.0\timescompression to+14.9%+14.9\%at6.7×6.7\times.## 1Introduction

On-Policy Distillation (OPD) trains a student on its own rollouts, while a stronger teacher provides dense token-level supervision at the prefixes visited by the student[1,10,25,40,24], substantially improving learning efficiency over sparse trajectory-level rewards in LLM post-training[6,11].

An OPD objective uniformly aggregates per-token KL divergence as in standard knowledge distillation. However, it overlooks the nature of OPD, where samples are generated from a weak student, which often produces erroneous outputs that are out-of-distribution (OOD) for the teacher model. As shown in Figure1(b), teacher can still provide reliable prediction when rolling out from early student tokens, but its performance also deteriorates quickly at longer student rollouts, indicating that these prefixes have drifted away from the teacher distribution and it can provide limited value on it. This clear trend reveals aposition biasin OPD: early tokens in student rollouts should receive high-weights since it’s high-quality, while later tokens should be down-weighted. A further controlled study confirms this intuition: as shown in Figure1(a), with the same 30% token budget, OPD with only 30% prefixes matches or exceeds full OPD, whereas OPD with 30% suffix provides little benefit.

These observations suggest that OPD should be viewed as an allocation problem under a finite local-update budget. Since each update can move the student policy only a limited distance, the update should spend more gradient budget on prefixes where teacher supervision is still compatible with the student’s trajectory. We formalize this intuition through a constrained local projection toward the teacher. Solving this constrained problem yields a closed-form optimal policy whose sample weights are governed by the teacher-to-student likelihood ratio. This ratio explains the observedPosition Biasphenomenon: once student rollouts move the trajectory away from the teacher-preferred reasoning path, the prefix ratio decreases, and optimal policy naturally reduces the sampling probability of downstream tokens. IW-OPD (Importance-WeightedOn-PolicyDistillation) implements this principle by reweighting token-level distillation terms with prefix-importance weights induced by the constrained projected objective. The method requires no additional teacher evaluations beyond standard OPD and reduces to standard OPD when the extra weighting is removed. Experiments show faster convergence and stronger final performance (Figure2), with AIME25 gains over OPD reaching+6.9+6.9points at step 10 and+1.7+1.7points at convergence. Moreover, IW-OPD makes stronger teachers more sample-efficient and yields larger relative gains as students become smaller. Accordingly, this paper makes three contributions:

1.We identify theposition biasphenomenon in OPD and explain it from a constrained-optimization perspective. This view shows why teacher-compatible prefixes dominate useful supervision. (§3).
2.We propose IW-OPD as an efficient OPD objective with token-level importance estimated from the discrepancy between the teacher and the student models (§4).
3.We demonstrate that in pratice, IW-OPD consistently improves OPD with faster convergence and stronger final performance, and that its advantage scales with teacher–student mismatch: stronger teachers become more sample-efficient, while smaller students obtain larger gains (§5).

2Preliminaries

Let𝒟\mathcal{D}denote the prompt distribution,πθ\pi_{\theta}the student policy, andπT\pi_{T}the teacher policy. For a promptxxand responsey=(y1,…,yT)y=(y_{1},\dots,y_{T}), the trajectory-level distributions decompose autoregressively:

πθ(y|x)=∏t=1Tπθ(yt|x,y<t),πT(y|x)=∏t=1TπT(yt|x,y<t).\pi_{\theta}(y|x)=\prod_{t=1}^{T}\pi_{\theta}(y_{t}|x,y_{<t}),\qquad\pi_{T}(y|x)=\prod_{t=1}^{T}\pi_{T}(y_{t}|x,y_{<t}).(1)

On-Policy RL.

Standard RLVR (e.g., GRPO[30]) samples trajectories from the current policy and optimizes a trajectory-level reward:

𝒥RL(θ)=maxθ⁡𝔼x∼𝒟,y∼πθ(⋅|x)[r(x,y)],\mathcal{J}_{\text{RL}}(\theta)=\max_{\theta}\,\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_{\theta}(\cdot|x)}\big[r(x,y)\big],(2)wherer(x,y)r(x,y)is obtained from a reward model[5,8,21]or a verifier[7,22,41,13]. The policy gradient takes the form

∇θ𝒥RL=𝔼x,y∼πθ[∑t=1TAt∇θlog⁡πθ(yt|x,y<t)],\nabla_{\theta}\mathcal{J}_{\text{RL}}=\mathbb{E}_{x,\,y\sim\pi_{\theta}}\bigg[\sum_{t=1}^{T}A_{t}\,\nabla_{\theta}\log\pi_{\theta}(y_{t}|x,y_{<t})\bigg],(3)where the advantageAtA_{t}computed from the trajectory reward assigns the same credit to every token.

On-Policy Distillation (OPD).

OPD[1,10,25]replaces the sparse trajectory-level reward with dense token-level supervision from a teacher modelπT\pi_{T}[40,24]:

𝒥OPD(θ)=maxθ−DKL(πθ||πT)=−𝔼x,y∼πθ∑t=1Tlogπθ(yt|x,y<t)πT(yt|x,y<t),\mathcal{J}_{\mathrm{OPD}}(\theta)=\max_{\theta}\,-D_{\mathrm{KL}}(\pi_{\theta}||\pi_{T})=-\mathbb{E}_{x,\,y\sim\pi_{\theta}}\sum_{t=1}^{T}\log\frac{\pi_{\theta}(y_{t}|x,y_{<t})}{\pi_{T}(y_{t}|x,y_{<t})},(4)wherey=[y1,…,yT]∼πθ(y|x)y=[y_{1},\dots,y_{T}]\sim\pi_{\theta}(y|x)denotes a sampled answer from the studentπθ\pi_{\theta}. In practice, OPD decomposessequence-levelobjective and uses atoken-localsemi-gradient that treats the sampled prefixes as fixed[24,25]as in Eq. (3):

∇θ𝒥OPD≈𝔼x,y∼πθ[∑t=1TAtOPD∇θlog⁡πθ(yt|x,y<t)],\nabla_{\theta}\mathcal{J}_{\text{OPD}}\approx\mathbb{E}_{x,\,y\sim\pi_{\theta}}\bigg[\sum_{t=1}^{T}A_{t}^{\text{OPD}}\,\nabla_{\theta}\log\pi_{\theta}(y_{t}|x,y_{<t})\bigg],(5)where OPD assigns a token-level advantage from the teacher–student distribution gap:

AtOPD≔−(log⁡πθ(yt|x,y<t)−log⁡πT(yt|x,y<t)).A_{t}^{\text{OPD}}\coloneqq-(\log\pi_{\theta}(y_{t}|x,y_{<t})-\log\pi_{T}(y_{t}|x,y_{<t})).(6)

3Position Bias in On-Policy Distillation

Standard OPD provides dense token-level supervision, but it aggregates all token-level KL terms uniformly in Eq. (4). In this section, we observe that OPD actually has a clearposition bias: its early tokens are much more valuable for learning compared to its later tokens. We will show two interesting empirical phenomena in §3.1and provide a theoretical explanation in §3.2.

3.1The Position Bias Phenomenon in OPD

Early-token supervision drives OPD performance.

To evaluate the influence of token positions in OPD learning, we fix the supervision budget and vary only the supervised segment. Prefix-30 applies OPD to the first 30% response tokens, while Suffix-30 applies OPD to the last 30%; Standard OPD uses all valid tokens. All other training settings are unchanged (details in AppendixC).

Figure1(a)shows a strong asymmetry. Supervising only the prefix 30% of tokens achieves performance comparable to standard OPD and consistently outperforms supervising only the suffix 30%. In contrast, suffix-only supervision yields substantially lower rewards throughout training. These results indicate that OPD benefits primarily from early-token supervision, while supervision on later tokens alone provides limited gains.

Teacher–student gap largely persists after OPD.Meanwhile, we also observe that even if the OPD objective tries to minimize the KL divergence between the teacher and student, the actual divergence between these two only decreases by 20% even if training converges and the student performance saturates (shown in Fig.3(a)and Fig.3(b)). It suggests that OPD training effectively only optimizes the student within a small local region. This could be because the OPD only optimizes on student-generated samples and this kind of RL training is known to produce minimal weight update[35,4,20,33,43].

The two phenomena combined suggest an interesting learning landscape in OPD learning:it only optimizes the student distribution locally and most of the gains come from early prefix tokens. Why does OPD have such a position bias and what does it imply for learning efficiency? We provide a theoretical explanation of this phenomenon in the next section.

3.2Understanding Position Bias from a Finite-Budget Allocation Perspective

As the empirical result in §3.1indicates that OPD only moves the student distribution within a small range, we can think of the actual OPD training as a constrained optimization problem, where the student distribution stays in a local region during training:

minqDKL(q∥πT)s.t.DKL(q∥πθ)≤ρ,\min_{q}\ D_{\mathrm{KL}}({q}\|\pi_{T})\quad\mathrm{s.t.}\quad D_{\mathrm{KL}}(q\|\pi_{\theta})\leq\rho,(7)whereπθ\pi_{\theta}denotes the student distribution, andρ\rhodenotes the effective local update budget measured by KL divergence. It can also be viewed as a trust-region objective where the policy is only updated within a trust region of radiusρ\rho. In fact, this constrained objective admits aclosed-formsolutionq⋆q^{\star}, as revealed in the following proposition. The proof can be found at AppendixA.1.

Refer to caption (a)Mean token-level reverse KL across training steps. (b)Token-level reverse KL before and after OPD training. (c)Log-likelihoods gap of student-sampled prefix vs. prefix length.

Figure 3:Position Bias phenomena in OPD.(a)The mean token-level KL decreases during OPD training but plateaus at a non-zero residual.(b)Token-level reverse KL before and after OPD training.(c)Sequence-level log-probabilities of student-sampled prefixes under the student and teacher. Student: Qwen3-0.6B; Teacher: Qwen3-4B-Instruct.###### Proposition 1(Optimal Policy).

Givenπθ\pi_{\theta}andπT\pi_{T}with common support. In the local-update regime0<ρ<DKL(πT∥πθ)0<\rho<D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta}), the trust-region constraint in Eq. (7) is active and the unique solution is

qθ⋆(y)=πθ(y)Zα(θ)(πT(y)πθ(y))α,q_{\theta}^{\star}(y)=\frac{\pi_{\theta}(y)}{Z_{\alpha}(\theta)}\left(\frac{\pi_{T}(y)}{\pi_{\theta}(y)}\right)^{\alpha},(8)which reweights the student policyπθ\pi_{\theta}by the likelihood ratiorθ(y)=πT(y)/πθ(y)r_{\theta}(y)=\pi_{T}(y)/\pi_{\theta}(y). Here,Zα(θ)=𝔼y∼πθ[rθ(y)α]Z_{\alpha}(\theta)=\mathbb{E}_{y\sim\pi_{\theta}}\left[r_{\theta}(y)^{\alpha}\right]denotes the normalizing factor, andα∈(0,1)\alpha\in(0,1)is a constant depends onρ\rho.

The optimal policyq⋆q^{\star}is proportional to likelihood ratio.This likelihood-based reweighting view in Proposition1explains why the position bias in Figure1(a)arises. Along student rollouts, the prefix probability under the student,πθ(y<t)\pi_{\theta}(y_{<t}), usually decreases smoothly because the sequence is sampled fromπθ\pi_{\theta}. In contrast, the same prefix probability under the teacher,πT(y<t)\pi_{T}(y_{<t}), can drop much faster once early student decisions move the reasoning path away from the teacher-preferred region, as illustrated in Figure3(c). Therefore the prefix ratiorθ(y<t)αr_{\theta}(y_{<t})^{\alpha}tends to become smaller at later positions. In the constrained optimum, this decreasing ratio translates into smaller sampling weights for later, low-ratio prefixes and their associated tokens. Equivalently, the constrained optimum naturally assigns position-dependent weights according to prefix compatibility with the teacher. Thus, position bias reflects the intrinsic ratio-weighted structure of the constrained optimum.

4Importance-Weighted On-Policy Distillation

Based on the insights from §3, in this section, we propose a more efficient OPD objective, Importance-Weighted OPD, that leverages the position bias to learn more efficiently.

4.1Importance-Weighted OPD

As discussed in §3.1, OPD can only optimize within a local region and the optimal policy it could attain isqθ⋆q^{\star}_{\theta}(Eq.8), which reweights the base student policyπθ\pi_{\theta}with teacher–student gap measured by the likelihood ratio. As discussed in §3.2, only tokens with relatively high likelihood ratios provide meaningful learning signals in OPD since the others have very low probability of being sampled.

Motivated by this observation, we propose to directly optimize the divergence between the optimal policyq⋆q^{\star}and the teacher, which directly emphasizes high-probability samples:

𝒥qθ⋆=maxθ−DKL(qθ⋆∥πT).\mathcal{J}_{q_{\theta}^{\star}}=\max_{\theta}\;-D_{\mathrm{KL}}(q_{\theta}^{\star}\|\pi_{T}).(9)Sinceqθ⋆q_{\theta}^{\star}is a reparameterized distribution induced by the student and teacher, we can further reparameterize this new learning objective by sampling fromπθ\pi_{\theta}instead. More specifically, it will be equivalent to an Importance-Weighted OPD (IW-OPD) objective, as revealed in the next proposition.

Importance-weighted form of the projected objective.

For clarity, we first fix the promptxxand omit it from the notation. We consider the non-trivial local-update regime0<ρ<DKL(πT∥πθ)0<\rho<D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta})from Proposition1. In this regimeα\alphais induced by the effective step-size budgetρ\rho, and the trust-region constraint is active asDKL(qθ⋆∥πθ)=ρD_{\mathrm{KL}}(q_{\theta}^{\star}\|\pi_{\theta})=\rho. Thus we have:

DKL(qθ⋆∥πT)\displaystyle D_{\mathrm{KL}}(q_{\theta}^{\star}\|\pi_{T})=𝔼qθ⋆[log⁡qθ⋆(y)πT(y)]\displaystyle=\mathbb{E}_{q_{\theta}^{\star}}\left[\log\frac{q_{\theta}^{\star}(y)}{\pi_{T}(y)}\right]=𝔼qθ⋆[log⁡qθ⋆(y)πθ(y)]+𝔼qθ⋆[log⁡πθ(y)πT(y)]\displaystyle=\mathbb{E}_{q_{\theta}^{\star}}\left[\log\frac{q_{\theta}^{\star}(y)}{\pi_{\theta}(y)}\right]+\mathbb{E}_{q_{\theta}^{\star}}\left[\log\frac{\pi_{\theta}(y)}{\pi_{T}(y)}\right]=ρ+𝔼qθ⋆[log⁡πθ(y)πT(y)].\displaystyle=\rho+\mathbb{E}_{q_{\theta}^{\star}}\left[\log\frac{\pi_{\theta}(y)}{\pi_{T}(y)}\right].(10)

Proposition 2(Importance-Weighted OPD Objective).

By applying a change of measure fromqθ⋆q_{\theta}^{\star}toπθ\pi_{\theta}and substituting Eq. (8), we obtain an importance-weighted trajectory-level objective and define the following token-level surrogate (proof in AppendixA.2):

𝒥IW⋆(θ)=maxθ−𝔼y∼πθ[∑t=1Tsg[r~t]log⁡πθ(yt∣y<t)πT(yt∣y<t)],r~t=rtZα,t.\mathcal{J}_{\mathrm{IW}}^{\star}(\theta)=\max_{\theta}\;-\mathbb{E}_{y\sim\pi_{\theta}}\left[\sum_{t=1}^{T}\mathrm{sg}\left[\tilde{r}_{t}\right]\log\frac{\pi_{\theta}(y_{t}\mid y_{<t})}{\pi_{T}(y_{t}\mid y_{<t})}\right],\quad\tilde{r}_{t}=\frac{r_{t}}{Z_{\alpha,t}}.(11)wherert:=rθ(y<t)=πT(y<t)/πθ(y<t)r_{t}:=r_{\theta}(y_{<t})=\pi_{T}(y_{<t})/\pi_{\theta}(y_{<t})depends onα\alphaand denotes the prefix likelihood ratio at positiontt, inherited from the trajectory-level ratio in Proposition1. The position-wise normalizer isZα,t=𝔼y<t∼πθ[rt]Z_{\alpha,t}=\mathbb{E}_{y_{<t}\sim\pi_{\theta}}\left[r_{t}\right].sg[⋅]\mathrm{sg}[\cdot]is stop gradient operator.

Eq. (11) shows that the Eq. (9) objective can be optimized usingπθ\pi_{\theta}-sampled rollouts as standard OPD, with each token-level KL term reweighted by a detached, normalized prefix-importance weightr~t\tilde{r}_{t}. This weight, introduced byqθ⋆q_{\theta}^{\star}, becomes larger for teacher-compatible prefixes with high teacher–student likelihood ratio and smaller after accumulated teacher–student drift.

Similar to OPD, the gradient of IW-OPD can be written as a policy gradient form with importance-weighted advantage (proof in AppendixA.3):

∇θ𝒥IW⋆(θ)\displaystyle\nabla_{\theta}\mathcal{J}_{\mathrm{IW}}^{\star}(\theta)≈𝔼y∼πθ[∑t=1TAtIW-OPD∇θlog⁡πθ(yt∣y<t)],\displaystyle\approx\mathbb{E}_{y\sim\pi_{\theta}}\left[\sum_{t=1}^{T}A_{t}^{\text{IW-OPD}}\nabla_{\theta}\log\pi_{\theta}(y_{t}\mid y_{<t})\right],(12)where

AtIW-OPD=−sg[r~t](log⁡πθ(yt∣y<t)−log⁡πT(yt∣y<t)).A_{t}^{\text{{IW-OPD}}}=-\mathrm{sg}\left[\tilde{r}_{t}\right]\left(\log\pi_{\theta}(y_{t}\mid y_{<t})-\log\pi_{T}(y_{t}\mid y_{<t})\right).(13)With stop-gradient applied, the coefficient in Eq. (12) acts as a multiplicative weight on the OPD policy-gradient signal. Thus, the weighted gradient is the operational form of the constrained-optimization view in Eq. (9): it corrects OPD’s position bias by reallocating the finite update budget through prefix-importance weights.

Refer to caption (a)α\alphain prefix-importance weights ablation. (b)Prefix-importance weights visualization. (c)Signed vs. Unsigned prefix-importance weights ablation.

Figure 4:From signed prefix ratio to unsigned prefix discrepancy.(a)Directly using the ideal prefix ratio is sensitive toα\alpha.(b)Token-wise visualization shows the desired overall downward trend, but also local rebounds caused by signed cancellation.(c)Replacing signed accumulation with the unsigned weight gives a more stable weighting signal.

4.2Stable Token-level Importance Weight Estimate

Eq. (12) provides a principled reweightingr~α,t∝πT(y<t)/πθ(y<t)\tilde{r}_{\alpha,t}\propto\pi_{T}(y_{<t})/\pi_{\theta}(y_{<t})for OPD. However, since the probabilityπ(y<t)=Πi=1t−1π(yi|y<i)\pi(y_{<t})=\Pi_{i=1}^{t-1}\pi(y_{i}|y_{<i})multiplies over longer sequence, the ratio will vary dramatically asttgrows, which introduces severe training instability. Below, we discuss several techniques helpful to stabilize this importance weight estimate (techniques are ablated in §5.3).

Small weight indexα\alpha.A smallα→0\alpha\to 0can help flatten the difference, but it is still not sufficient. Fig.4(a)shows strong sensitivity toα\alpha: large values such asα=1\alpha=1andα=0.1\alpha=0.1substantially degrade training, while only very small values such asα=0.01\alpha=0.01andα=0.001\alpha=0.001roughly match standard OPD. Fig.4(b)visualizes this numerical instability:α\alphadirectly controls the scale and sharpness of the prefix weights, making the raw weights either overly concentrated or nearly flat.

Beyond adjustingα\alpha, we discover several effective strategies that stabilize the importance estimate.

I. Stabilization via log scaling.We first scale down the the probability gap by considering their ratio in the log space, which will preserve their order while mitigating the exponentially accumulated gaps:

r~tlog=log⁡rt=α∑k<t(log⁡πT(yk|y<k)−log⁡πθ(yk|y<k))=α∑k<tAkOPD,\tilde{r}^{\mathrm{log}}_{t}=\log r_{t}=\alpha\sum_{k<t}\left(\log\pi_{T}(y_{k}|y_{<k})-\log\pi_{\theta}(y_{k}|y_{<k})\right)=\alpha\sum_{k<t}A_{k}^{\mathrm{OPD}},(14)which directly corresponds to the sum of token-level OPD advantagesAkOPDA_{k}^{\mathrm{OPD}}.

II. Correcting positive-advantage tokens.Since the rollout tokens are sampled from the student,πθ(yk|y<k)\pi_{\theta}(y_{k}|y_{<k})is often relatively large on these tokens, and thus, as shown in Fig.3(c),log⁡πT(yk|y<k)−log⁡πθ(yk|y<k)\log\pi_{T}(y_{k}|y_{<k})-\log\pi_{\theta}(y_{k}|y_{<k})is mostly negative. However, for some tokens where the teacher assigns higher probability than the student, this term becomes positive. These terms can partially offset the accumulated negative prefix gap, as shown in Fig.4(b). In practice, we find it helpful to further reduce this cancellation effect by reverting the negative importance weights, leading to a sum of non-negative token-level discrepencies measured by|AkOPD||A_{k}^{\mathrm{OPD}}|:

r~tabs=∑k<t(𝕀[AkOPD<0]AkOPD−𝕀[AkOPD>0]AkOPD)=−∑k<t|AkOPD|.\tilde{r}^{\mathrm{abs}}_{t}=\sum_{k<t}\left(\mathbb{I}[A_{k}^{\mathrm{OPD}}<0]A_{k}^{\mathrm{OPD}}-\mathbb{I}[A_{k}^{\mathrm{OPD}}>0]A_{k}^{\mathrm{OPD}}\right)=-\sum_{k<t}\left|A_{k}^{\mathrm{OPD}}\right|.(15)We empirically compare both the original (Eq.14) and unsigned versions in Fig.4(c), both variants are improve in practice, while the unsigned version yields higher p better results.

III. Normalization.Although sign correction ensures monotonicity of importance weights as sequence grow, the absolute scale could still vary significantly across samples. To mitigate this effect, we therefore apply a simple within-sample standardization. In particular, for a sample whose importance weights lie in[dT,0][d_{T},0], where the maximum0corresponds to the beginning of the sequence and the minimumdTd_{T}corresponds to the end, we standardize the discrepancy to[0,1][0,1]with

r~tnorm=dt−dmindmax−dmin−∑k<t|AkOPD|−(−∑k<T|AkOPD|)0−(−∑k<T|AkOPD|)=1−∑k<t|AkOPD|∑k<T|AkOPD|.\tilde{r}^{\mathrm{norm}}_{t}=\frac{d_{t}-d_{\text{min}}}{d_{\text{max}}-d_{\text{min}}}\frac{-\sum_{k<t}|A_{k}^{\mathrm{OPD}}|-(-\sum_{k<T}|A_{k}^{\mathrm{OPD}}|)}{0-(-\sum_{k<T}|A_{k}^{\mathrm{OPD}}|)}=1-\frac{\sum_{k<t}\left|A_{k}^{\mathrm{OPD}}\right|}{\sum_{k<T}\left|A_{k}^{\mathrm{OPD}}\right|}.(16) IV. Interpolation with OPD.Finally, after normalization, end-of-sequence tokens will have low weights. We perform an interpolation with the original OPD to balance these two effects:

r~tIW−OPD=1+γ⋅r~tnorm=1+γ(1−∑k<t|AkOPD|∑k<T|AkOPD|),\tilde{r}^{\mathrm{IW-OPD}}_{t}=1+\gamma\cdot\tilde{r}^{\mathrm{norm}}_{t}=1+\gamma\left(1-\frac{\sum_{k<t}\left|A_{k}^{\mathrm{OPD}}\right|}{\sum_{k<T}\left|A_{k}^{\mathrm{OPD}}\right|}\right),(17)where higherγ≥0\gamma\geq 0indicates a higher contribution from the teacher-informed importance weights. In practice, we findγ=0.5\gamma=0.5is a good default choice (Other parameters in AppendixC.2).

Table 1:Evaluation results withQwen3-30B-A3B-Instruct-2507as teacher (small student–teacher overlap). Math results are reported as mean@32 accuracy (%). Methods with subscript10{10}are evaluated at training step 10.Boldindicates the best result within each student group.StudentMethodMathCodeAvgAIME24AIME25HMMT25HE+MBPP+Teacher Model74.762.844.286.675.168.7Qwen3-4BBase23.121.410.075.364.538.9OPD1051.242.423.576.266.752.0IW-OPD1056.2+5.056.2\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+5.0}}}}49.3+6.949.3\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+6.9}}}}27.3+3.827.3\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+3.8}}}}76.8+0.676.8\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.6}}}}68.1+1.468.1\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}55.5+3.555.5\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+3.5}}}}OPD55.348.027.177.269.155.3IW-OPD57.5+2.2\textbf{57.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.2}}}}49.7+1.7\textbf{49.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.7}}}}28.7+1.6\textbf{28.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.6}}}}78.7+1.5\textbf{78.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}70.9+1.8\textbf{70.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}57.1+1.8\textbf{57.1}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}Qwen3-1.7BBase13.411.06.859.652.528.7OPD1030.520.214.453.052.634.1IW-OPD1033.0+2.533.0\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.5}}}}23.2+3.023.2\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+3.0}}}}15.6+1.215.6\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.2}}}}55.5+2.555.5\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.5}}}}54.8+2.254.8\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.2}}}}36.4+2.336.4\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.3}}}}OPD34.628.715.564.653.739.4IW-OPD35.5+0.9\textbf{35.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}29.5+0.8\textbf{29.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.8}}}}16.4+0.9\textbf{16.4}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}65.2+0.6\textbf{65.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.6}}}}55.0+1.3\textbf{55.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}40.3+0.9\textbf{40.3}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}Qwen3-0.6BBase1.53.41.328.228.412.6OPD106.214.16.526.923.115.4IW-OPD107.8+1.67.8\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.6}}}}15.8+1.715.8\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.7}}}}7.6+1.17.6\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.1}}}}28.4+1.528.4\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}27.5+4.427.5\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+4.4}}}}17.4+2.017.4\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.0}}}}OPD11.017.87.129.628.718.8IW-OPD11.5+0.5\textbf{11.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.5}}}}19.3+1.5\textbf{19.3}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}8.0+0.9\textbf{8.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}32.5+2.9\textbf{32.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.9}}}}31.9+3.2\textbf{31.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+3.2}}}}20.2+1.4\textbf{20.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}

Table 2:Evaluation results withQwen3-4B-Instruct-2507as teacher (large student–teacher overlap). Math results are reported as mean@32 accuracy (%). Methods with subscript10{10}are evaluated at training step 10.Boldindicates the best result within each student group.StudentMethodMathCodeAvgAIME24AIME25HMMT25HE+MBPP+Teacher Model60.446.731.082.571.358.4Qwen3-4BBase23.121.410.075.364.538.9OPD1056.345.723.676.066.154.0IW-OPD1058.7+2.458.7\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.4}}}}46.7+1.046.7\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.0}}}}25.0+1.425.0\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}77.8+1.877.8\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}67.5+1.467.5\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}55.1+1.255.1\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.2}}}}OPD56.546.324.476.367.854.3IW-OPD58.7+2.2\textbf{58.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.2}}}}46.7+0.4\textbf{46.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.4}}}}25.0+0.6\textbf{25.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.6}}}}77.9+1.6\textbf{77.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.6}}}}68.2+0.4\textbf{68.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.4}}}}55.3+1.0\textbf{55.3}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.0}}}}Qwen3-1.7BBase13.411.06.859.652.528.7OPD1033.424.711.361.153.436.8IW-OPD1035.2+1.835.2\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}25.9+1.225.9\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.2}}}}13.2+1.913.2\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.9}}}}62.0+0.962.0\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}54.0+0.654.0\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.6}}}}38.1+1.338.1\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}OPD34.026.413.761.553.737.9IW-OPD35.2+1.2\textbf{35.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.2}}}}27.1+0.7\textbf{27.1}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.7}}}}15.3+1.6\textbf{15.3}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.6}}}}62.8+1.3\textbf{62.8}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}54.9+1.2\textbf{54.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.2}}}}39.1+1.1\textbf{39.1}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.1}}}}Qwen3-0.6BBase1.53.41.328.228.412.6OPD1011.117.16.926.831.018.6IW-OPD1012.4+1.312.4\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}19.0+1.9\textbf{19.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.9}}}}9.4+2.59.4\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.5}}}}28.1+1.328.1\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}33.9+2.933.9\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.9}}}}20.6+2.020.6\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.0}}}}OPD11.817.17.729.833.320.0IW-OPD13.6+1.8\textbf{13.6}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}19.0+1.9\textbf{19.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.9}}}}9.6+0.9\textbf{9.6}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.9}}}}31.6+1.8\textbf{31.6}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.8}}}}35.7+2.4\textbf{35.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.4}}}}21.9+1.9\textbf{21.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.9}}}}

Table 3:Evaluation results withQwen3-235B-A22B-Instruct-2507as teacher. Math results are reported as mean@32 accuracy (%).Boldindicates the best result within each student group.StudentMethodMathCodeAvgAIME24AIME25HMMT25HE+MBPP+Teacher Model80.769.255.690.277.674.7Qwen3-30B-A3BBase28.423.415.277.869.542.9OPD69.556.738.482.171.363.6IW-OPD70.8+1.3\textbf{70.8}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}58.9+2.2\textbf{58.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.2}}}}40.5+2.1\textbf{40.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.1}}}}83.5+1.4\textbf{83.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}73.7+2.4\textbf{73.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.4}}}}65.5+1.9\textbf{ 65.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.9}}}}

5Experiments

5.1Setup

We evaluate IW-OPD in two teacher regimes and three student scales. The students are Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B. The first teacher is Qwen3-4B-Instruct-2507, which gives a larger overlap setting within the same model family. The second teacher is Qwen3-30B-A3B-Instruct-2507, which gives a smaller overlap setting and tests whether prefix weighting remains useful when the student often leaves the teacher’s preferred trajectory region.

Training uses DeepMath with difficulty at least 6, about 57K problems, for math, and Eurus-RL-Code, about 25K problems, for code. We evaluate math on AIME 2024, AIME 2025, and HMMT 2025. We evaluate code on HumanEval+ and MBPP+. The baselines are the student base model before distillation and standard OPD. All reported numbers are averaged over three random seeds.

5.2Main Results

IW-OPD consistently improves OPD.

Tab.1,2report the main evaluation results. IW-OPD consistently improves over standard OPD across all evaluated teacher–student pairs and benchmarks. In the two full evaluation regimes with Qwen3-4B and Qwen3-30B-A3B teachers, IW-OPD improves the final reported average for every student scale. The additional experiment using Qwen3-235B-A22B-Instruct-2507 as the teacher and Qwen3-30B-A3B as the student in Tab.3further shows that the prefix-importance weighting remains effective in a much larger distillation setting, improving the math average by 1.9 points over OPD. Beyond vanilla OPD, AppendixDshows that IW-OPD can also be combined with other OPD variant that redesign the reward term, such as ExOPD[42].

IW-OPD improves sample efficiency.

The early checkpoints show that reweighting changes the efficiency of the update, not only the final endpoint. IW-OPD is ahead of OPD at step 10 in every reported teacher–student regime. The effect is especially clear in the Qwen3-30B-A3B→\rightarrowQwen3-4B setting, where IW-OPD10improves the average score from 52.0 to 55.5 and AIME25 from 42.4 to 49.3. Notably, this step-10 checkpoint already matches the final OPD checkpoint in average performance. This supports the allocation view: by reducing the relative influence of drifted prefixes, IW-OPD spends more of each update on prefixes where teacher supervision can still redirect the student.

Table 4:Ablation results on AIME25.We isolate trajectory-adaptive prefix selection, unsigned discrepancy, and the effect of using the practical surrogate with a standard OPD blend.

IW-OPD improves efficiency as teachers scale up.

We next isolate the effect of teacher scaling by fixing the student. For the Qwen3-4B student, using the stronger Qwen3-30B-A3B teacher gives a better final OPD average than using the Qwen3-4B teacher, 55.3 versus 54.3. However, standard OPD is less sample-efficient with the stronger teacher in the early stage: at step 10, distilled from 30B-A3B teacher reaches only 52.0 average score, lower than 54.0 from 4B teacher. This suggests that a stronger teacher can provide a better final target but also induces larger teacher–student trajectory mismatch, so uniform OPD needs more updates before the student can effectively benefit from its supervision. IW-OPD alleviates this inefficiency. In the same Qwen3-4B student setting, IW-OPD10with 30B-A3B teacher reaches 55.5 average score, surpassing IW-OPD10with 4B teacher at 55.1. On AIME25, the reversal is even clearer: standard OPD10with 30B-A3B teacher is behind the 4B teacher, 42.4 versus 45.7, whereas IW-OPD10makes the 30B-A3B teacher outperform the 4B teacher, 49.3 versus 46.7. These results show that IW-OPD makes stronger teachers more sample-efficient by reallocating training signal toward teacher-compatible prefixes.

IW-OPD benefits more as students scale down.

We first isolate the effect of student scaling by fixing the teacher. With Qwen3-4B-Instruct-2507 as the teacher, the final average improvement of IW-OPD over OPD increases as the student becomes smaller: from +1.0 points for the 4B student, to +1.2 points for the 1.7B student, and to +1.9 points for the 0.6B student. In relative terms, these correspond to approximately +1.8%, +3.2%, and +9.5%, respectively. The same trend is also visible at step 10, where the relative gains grow from +2.0% to +3.5% and then to +10.8% as the student size decreases. These results indicate that IW-OPD is especially helpful in cross-scale distillation: when the student is much smaller than the teacher, student rollouts are more likely to drift away from the teacher-compatible region, and uniform OPD wastes more update budget on low-quality downstream supervision.

IW-OPD scales to stronger teachers and larger students.

The experiment with Qwen3-235B-A22B-Instruct-2507 as the teacher in Table3provides a large-scale stress test beyond the small-student regimes. Although standard OPD already gives a strong 30B student, IW-OPD still improves all reported math benchmarks, with gains of 1.3 points on AIME24, 2.2 points on AIME25, and 2.1 points on HMMT25. This indicates that prefix-importance weighting is not merely a remedy for weak students; it remains useful when distilling a very strong teacher into a capable student.

5.3Ablations

Table4isolates three design choices: how token weights are assigned, how prefix discrepancy is measured, and whether the weighted term should be blended with standard OPD. We use Qwen3-0.6B as the student because this setting makes prefix selection most visible.

Adaptive prefix selection matters more than a fixed shape.

Amplifying a fixed ratio (30%) prefix gives only+0.4+0.4, and a hand-designed position schedule gives+1.5+1.5. After unsigned correction,r~IW−OPD\tilde{r}^{\mathrm{IW-OPD}}becomes a monotonically decreasing weight. Accordingly, we test a direct linear-decay variant, which gives only+0.7+0.7. The cumulative-share rule gives+5.6+5.6. This ordering separates the benefit of early token preference from the benefit of trajectory adaptivity. A smooth preference for earlier positions is not enough. The useful boundary between reliable and unreliable prefixes changes across rollouts, so the weight must follow each trajectory’s own discrepancy trace. Easy rollouts can keep high weights for longer, while hard rollouts should reduce downstream weights earlier.

Unsigned discrepancy is the better prefix compatibility proxy.

Using signed accumulation gives+2.6+2.6, while the unsigned version gives+5.6+5.6. Signed terms can cancel even when the prefix has passed through several model disagreements. This cancellation makes a drifted prefix appear compatible with the teacher. The absolute statistic treats each disagreement as evidence that the student has moved away from the shared prefix region.

The surrogate should allocate extra budget rather than replace OPD.

The ideal likelihood-ratio weight is useful as a derivation target but not as a literal training rule. Using it alone collapses performance, and blending it with OPD gives only a small gain. This matches the instability observed in Figure4: the raw ratio has high variance and is sensitive to signed cancellations along long trajectories. The practical surrogate is more robust because it preserves the desired ordering of prefixes while removing the unstable product scale. However, using the surrogate alone is still weaker than blending it with OPD. The blend keeps the standard dense OPD signal as a floor and allocates additional update budget to compatible prefixes, which is the intended role of IW-OPD.

6Related Work

On-policy and token-selective distillation.Classical KD[12]trains on teacher-generated data and can suffer from exposure bias[3,2,27]. OPD supervises student-sampled rollouts: GKD[1]unifies on/off-policy data throughff-divergences, MiniLLM[10]optimizes reverse KL with policy-gradient estimators, and recent OPD work studies on-policy teacher supervision and its extensions[17,31,42,44]. Selective distillation further asks where supervision should be applied, using sequence-level curricula[36,23,39]or token-level weights based on frequency, difficulty, teacher confidence, and student learning state[9,18,14,38,16]. IW-OPD follows this token-selective view and uses on-policy prefix compatibility as the weighting signal.

Credit assignment and reweighted policy updates.RLVR pipelines[7,30]must assign sparse outcome rewards over long reasoning traces. Process-supervision and process-reward methods provide denser step-level feedback[19,34,6], and recent token-level analyses identify high-entropy forking tokens, critical tokens, and reasoning rather than boilerplate tokens as disproportionate drivers of learning[35,4,20,33,43,11]. IW-OPD allocates dense teacher supervision by prefix compatibility. Its constrained-projection view connects to trust-region and proximal policy updates[15,28,29], sequence-level importance correction in GSPO and online DPO[46,26,37], and geometric interpolation/Rényi midpoints[32,45].

7Discussion and Conclusion

The key insight of this work is that OPD supervision is not uniformly reliable along a student-generated trajectory. This creates aposition bias: teacher supervision is often more useful near the beginning of the rollout than near the end. Such bias is not unique to OPD. It may also appear in many methods involving two autoregressive sequence models, such as on-policy RL where samples are drawn from the current policy but updates move toward a new policy. The issue is especially visible in OPD since the teacher–student distribution gap is not known in advance, so we cannot predefine where their trajectories remain compatible.

In conclusion, we identify position bias as a key inefficiency in On-Policy Distillation and explain it through a finite-budget local projection view. Motivated by this analysis, IW-OPD reallocates additional gradient budget toward teacher-compatible prefixes using a stable cumulative prefix-discrepancy weight, while keeping standard OPD as the dense supervision floor. Experiments across same-family, cross-scale, and stronger-teacher settings show that this simple modification improves both sample efficiency and final performance. More broadly, our results suggest that effective on-policy supervision should account not only for token-level disagreement, but also for the trajectory context (i.e., the prefix) during which that disagreement occurs.

References

[1](2024)On-policy distillation of language models: learning from self-generated mistakes.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1,§2,§6.
[2]K. Arora, L. El Asri, H. Bahuleyan, and J. C. K. Cheung(2022)Why exposure bias matters: an imitation learning perspective of error accumulation in language generation.InFindings of the Association for Computational Linguistics: ACL 2022,pp. 700–710.Cited by:§6.
[3]S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer(2015)Scheduled sampling for sequence prediction with recurrent neural networks.InAdvances in Neural Information Processing Systems,Cited by:§6.
[4]E. J. Bigelow, A. Holtzman, H. Tanaka, and T. Ullman(2025)Forking paths in neural text generation.InInternational Conference on Learning Representations,External Links:2412.07961Cited by:§3.1,§6.
[5]Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P. Chu,et al.(2024)Internlm2 technical report.arXiv preprint arXiv:2403.17297.Cited by:§2.
[6]G. Cui, L. Yuan, Z. Wang, H. Wang, W. Peng, J. Chen, N. Chen, Z. Liu, and M. Sun(2025)Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456.Cited by:§1,§6.
[7]DeepSeek-AI(2025)DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948.Cited by:§2,§6.
[8]H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang(2024)RLHF workflow: from reward modeling to online rlhf.arXiv preprint arXiv:2405.07863.Cited by:§2.
[9]S. Gu, J. Zhang, F. Meng, Y. Feng, W. Xie, J. Zhou, and D. Yu(2020)Token-level adaptive training for neural machine translation.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),Online,pp. 1035–1046.External Links:Document,LinkCited by:§6.
[10]Y. Gu, L. Dong, F. Wei, and M. Huang(2024)MiniLLM: knowledge distillation of large language models.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1,§2,§6.
[11]Y. He, H. Wu, S. Liu, H. Ge, H. Zhou, K. Wu, Z. Zheng, Q. Lin, Z. Zhong, and Y. Zhang(2026)Rethinking token-level credit assignment in RLVR: a polarity-entropy analysis.arXiv preprint arXiv:2604.11056.Cited by:§1,§6.
[12]G. Hinton, O. Vinyals, and J. Dean(2015)Distilling the knowledge in a neural network.InNIPS Deep Learning and Representation Learning Workshop,External Links:LinkCited by:§6.
[13]J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum(2025)Open-reasoner-zero: an open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290.Cited by:§2.
[14]H. Huang, J. Song, Y. Zhang, and P. Ren(2025)SelecTKD: selective token-weighted knowledge distillation for LLMs.External Links:2510.24021,Document,LinkCited by:§6.
[15]S. M. Kakade(2001)A natural policy gradient.InAdvances in Neural Information Processing Systems,Vol.14,pp. 1531–1538.Cited by:§6.
[16]M. Kim and S. J. Baek(2026)Explain in your own words: improving reasoning via token-selective dual knowledge distillation.InThe Fourteenth International Conference on Learning Representations,External Links:LinkCited by:§6.
[17]Y. Li, Y. Zuo, B. He, J. Zhang, C. Xiao, C. Qian, T. Yu, H. Gao, W. Yang, Z. Liu, and N. Ding(2026)Rethinking on-policy distillation of large language models: phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016.External Links:2604.13016,Document,LinkCited by:§6.
[18]C. Liang, H. Jiang, X. Liu, P. He, W. Chen, J. Gao, and T. Zhao(2021)Token-wise curriculum learning for neural machine translation.InFindings of the Association for Computational Linguistics: EMNLP 2021,Punta Cana, Dominican Republic,pp. 3658–3670.External Links:Document,LinkCited by:§6.
[19]H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe(2024)Let’s verify step by step.InInternational Conference on Learning Representations,Cited by:§6.
[20]Z. Lin, T. Liang, J. Xu, Q. Lin, X. Wang, R. Luo, C. Shi, S. Li, Y. Yang, and Z. Tu(2025)Critical tokens matter: token-level contrastive estimation enhances LLM’s reasoning capability.InInternational Conference on Machine Learning,External Links:2411.19943Cited by:§3.1,§6.
[21]C. Y. Liu, L. Zeng, Y. Xiao, J. He, J. Liu, C. Wang, R. Yan, W. Shen, F. Zhang, J. Xu,et al.(2025)Skywork-reward-v2: scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352.Cited by:§2.
[22]J. Liu and L. Zhang(2025)Code-r1: reproducing r1 for code with reliable rewards.Note:https://github.com/ganler/code-r1Cited by:§2.
[23]L. Liu and M. Zhang(2025)Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695.Cited by:§6.
[24]LLM-Core, Xiaomi(2026)MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780.Cited by:§1,§2,§2.
[25]K. Lu(2025)On-policy distillation.Note:Thinking Machines Lab BlogExternal Links:LinkCited by:§1,§2,§2.
[26]R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn(2023)Direct preference optimization: your language model is secretly a reward model.InAdvances in Neural Information Processing Systems,Vol.36.Cited by:§6.
[27]S. Ross, G. J. Gordon, and J. A. Bagnell(2011)A reduction of imitation learning and structured prediction to no-regret online learning.InProceedings of the 14th International Conference on Artificial Intelligence and Statistics,pp. 627–635.Cited by:§6.
[28]J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel(2015)Trust region policy optimization.InInternational Conference on Machine Learning,pp. 1889–1897.Cited by:§6.
[29]J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov(2017)Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.Cited by:§6.
[30]Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y.K. Li, Y. Wu, and D. Guo(2024)DeepSeekMath: pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300.Cited by:§2,§6.
[31]M. Song and M. Zheng(2026)A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626.Cited by:§6.
[32]T. van Erven and P. Harremoës(2014)Rényi divergence and kullback-leibler divergence.IEEE Transactions on Information Theory60(7),pp. 3797–3820.Cited by:§6.
[33]J. Vassoyan, N. Beau, and R. Plaud(2025)Ignore the KL penalty! boosting exploration on critical tokens to enhance RL fine-tuning.InFindings of the Association for Computational Linguistics: NAACL 2025,pp. 6123–6133.External Links:2502.06533Cited by:§3.1,§6.
[34]P. Wang, L. Li, Z. Shao, R.X. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui(2024)Math-shepherd: verify and reinforce LLMs step-by-step without human annotations.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:§6.
[35]S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, Y. Liu, A. Yang, A. Zhao, Y. Yue, S. Song, B. Yu, G. Huang, and J. Lin(2025)Beyond the 80/20 rule: high-entropy minority tokens drive effective reinforcement learning for llm reasoning.InAdvances in Neural Information Processing Systems,External Links:2506.01939Cited by:§3.1,§6.
[36]Y. Wen, Z. Li, W. Du, and L. Mou(2023)F-divergence minimization for sequence-level knowledge distillation.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),Toronto, Canada,pp. 10817–10834.External Links:Document,LinkCited by:§6.
[37]T. Xie, D. J. Foster, A. Krishnamurthy, C. Rosset, A. Awadallah, and A. Rakhlin(2024)Exploratory preference optimization: harnessing implicit Q*-approximation for sample-efficient RLHF.arXiv preprint arXiv:2405.21046.Cited by:§6.
[38]X. Xie, Z. Xue, J. Wu, J. Li, Y. Wang, X. Hu, Y. Liu, and J. Zhang(2025)LLM-oriented token-adaptive knowledge distillation.External Links:2510.11615,Document,LinkCited by:§6.
[39]Y. Xu, H. Sang, Z. Zhou, R. He, and Z. Wang(2026)PACED: distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178.Cited by:§6.
[40]A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al.(2025)Qwen3 technical report.arXiv preprint arXiv:2505.09388.Cited by:§1,§2.
[41]W. Yang, J. Chen, Y. Lin, and J. Wen(2025)Deepcritic: deliberate critique with large language models.arXiv preprint arXiv:2505.00662.Cited by:§2.
[42]W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y. Lin(2026)Learning beyond teacher: generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125.Cited by:Appendix D,§5.2,§6.
[43]Z. Ye, Z. Zhang, Y. Zhang, J. Ma, J. Lin, and F. Feng(2025)Disentangling reasoning tokens and boilerplate tokens for language model fine-tuning.InFindings of the Association for Computational Linguistics: ACL 2025,pp. 20939–20957.External Links:DocumentCited by:§3.1,§6.
[44]S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover(2026)Self-distilled reasoner: on-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734.Cited by:§6.
[45]Y. Zhao, Y. Liu, J. Liu, J. Chen, X. Wu, Y. Hao, T. Lv, S. Huang, L. Cui, Q. Ye, F. Wan, and F. Wei(2026)Geometric-mean policy optimization.InInternational Conference on Learning Representations,Cited by:§6.
[46]C. Zheng, S. Liu, M. Li, X. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin(2025)Group sequence policy optimization.arXiv preprint arXiv:2507.18071.External Links:2507.18071,Document,LinkCited by:§6.

Appendix AProofs and Derivations

The derivations below fix a promptxxunless otherwise stated and omit the conditioning onxxfor notational simplicity. We writeπT\pi_{T}for the teacher next-token policy and also for its autoregressive trajectory distribution, as in the main text. The local projected distribution is denoted byqθ⋆q_{\theta}^{\star}, and the corresponding causal prefix weight isrθr_{\theta}. All distributions are understood to be supported on the common support where the relevant KL divergences are finite.

A.1Solution of the Constrained Projection

We prove Proposition1. The constrained projection is

qθ⋆=argminqDKL(q∥πT)s.t.DKL(q∥πθ)≤ρ,q_{\theta}^{\star}=\arg\min_{q}D_{\mathrm{KL}}(q\|\pi_{T})\quad\mathrm{s.t.}\quad D_{\mathrm{KL}}(q\|\pi_{\theta})\leq\rho,(18)together with the normalization constraint∑yq(y)=1\sum_{y}q(y)=1.

If the trust-region constraint were inactive, the solution would be the unconstrained minimizerq=πTq=\pi_{T}. This is infeasible when0<ρ<DKL(πT∥πθ)0<\rho<D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta}), so the constraint must be active in the local-update regime. SinceDKL(q∥πT)D_{\mathrm{KL}}(q\|\pi_{T})is strictly convex inqqon the common support and the KL ball is convex, the KKT conditions identify the unique optimum.

Introduce the Lagrangian

ℒ(q,λ,μ)=DKL(q∥πT)+λ(DKL(q∥πθ)−ρ)+μ(∑yq(y)−1),\mathcal{L}(q,\lambda,\mu)=D_{\mathrm{KL}}(q\|\pi_{T})+\lambda\left(D_{\mathrm{KL}}(q\|\pi_{\theta})-\rho\right)+\mu\left(\sum_{y}q(y)-1\right),(19)whereλ≥0\lambda\geq 0is the multiplier for the trust-region constraint andμ\muis the multiplier for normalization. Expanding the KL terms gives

ℒ(q,λ,μ)=∑yq(y)log⁡q(y)πT(y)+λ∑yq(y)log⁡q(y)πθ(y)−λρ+μ(∑yq(y)−1).\mathcal{L}(q,\lambda,\mu)=\sum_{y}q(y)\log\frac{q(y)}{\pi_{T}(y)}+\lambda\sum_{y}q(y)\log\frac{q(y)}{\pi_{\theta}(y)}-\lambda\rho+\mu\left(\sum_{y}q(y)-1\right).(20) Taking the functional derivative with respect toq(y)q(y)and setting it to zero yields

0\displaystyle 0=∂ℒ∂q(y)\displaystyle=\frac{\partial\mathcal{L}}{\partial q(y)}=log⁡q(y)−log⁡πT(y)+1+λ(log⁡q(y)−log⁡πθ(y)+1)+μ.\displaystyle=\log q(y)-\log\pi_{T}(y)+1+\lambda\left(\log q(y)-\log\pi_{\theta}(y)+1\right)+\mu.(21)Rearranging terms,

(1+λ)log⁡q(y)=log⁡πT(y)+λlog⁡πθ(y)+const,(1+\lambda)\log q(y)=\log\pi_{T}(y)+\lambda\log\pi_{\theta}(y)+\mathrm{const},(22)where the constant absorbs1+λ+μ1+\lambda+\muand is independent ofyy. Therefore,

log⁡q(y)=11+λlog⁡πT(y)+λ1+λlog⁡πθ(y)+const.\log q(y)=\frac{1}{1+\lambda}\log\pi_{T}(y)+\frac{\lambda}{1+\lambda}\log\pi_{\theta}(y)+\mathrm{const}.(23)Define

α≔11+λ∈(0,1),1−α=λ1+λ.\alpha\coloneqq\frac{1}{1+\lambda}\in(0,1),\qquad 1-\alpha=\frac{\lambda}{1+\lambda}.(24)Exponentiating and normalizing gives

qα(y)=πθ(y)1−απT(y)αZα(θ),Zα(θ)=∑yπθ(y)1−απT(y)α.q_{\alpha}(y)=\frac{\pi_{\theta}(y)^{1-\alpha}\pi_{T}(y)^{\alpha}}{Z_{\alpha}(\theta)},\qquad Z_{\alpha}(\theta)=\sum_{y}\pi_{\theta}(y)^{1-\alpha}\pi_{T}(y)^{\alpha}.(25)Equivalently, with

rθ(y)=πT(y)πθ(y),Zα(θ)=𝔼y∼πθ[rθ(y)α],r_{\theta}(y)=\frac{\pi_{T}(y)}{\pi_{\theta}(y)},\qquad Z_{\alpha}(\theta)=\mathbb{E}_{y\sim\pi_{\theta}}\left[r_{\theta}(y)^{\alpha}\right],(26)we can write

qα(y)=πθ(y)rθ(y)αZα(θ).q_{\alpha}(y)=\frac{\pi_{\theta}(y)r_{\theta}(y)^{\alpha}}{Z_{\alpha}(\theta)}.(27) It remains to identify the value ofα\alphainduced by the radiusρ\rho. Let

ψ(α)≔log⁡Zα(θ).\psi(\alpha)\coloneqq\log Z_{\alpha}(\theta).(28)Then

ψ′(α)=𝔼qα[log⁡rθ(y)].\psi^{\prime}(\alpha)=\mathbb{E}_{q_{\alpha}}\left[\log r_{\theta}(y)\right].(29)Therefore,

DKL(qα∥πθ)\displaystyle D_{\mathrm{KL}}(q_{\alpha}\|\pi_{\theta})=𝔼qα[log⁡qα(y)πθ(y)]\displaystyle=\mathbb{E}_{q_{\alpha}}\left[\log\frac{q_{\alpha}(y)}{\pi_{\theta}(y)}\right]=𝔼qα[αlog⁡rθ(y)−ψ(α)]\displaystyle=\mathbb{E}_{q_{\alpha}}\left[\alpha\log r_{\theta}(y)-\psi(\alpha)\right]=αψ′(α)−ψ(α).\displaystyle=\alpha\psi^{\prime}(\alpha)-\psi(\alpha).(30)Since the constraint is active,α\alphais determined by

ρ=DKL(qα∥πθ)=α∂∂αlog⁡Zα(θ)−log⁡Zα(θ),\rho=D_{\mathrm{KL}}(q_{\alpha}\|\pi_{\theta})=\alpha\frac{\partial}{\partial\alpha}\log Z_{\alpha}(\theta)-\log Z_{\alpha}(\theta),(31)This is the implicit relation forα\alpha. Also,

ddαDKL(qα∥πθ)\displaystyle\frac{d}{d\alpha}D_{\mathrm{KL}}(q_{\alpha}\|\pi_{\theta})=αψ′′(α)\displaystyle=\alpha\psi^{\prime\prime}(\alpha)=αVarqα[log⁡rθ(y)]≥0.\displaystyle=\alpha\,\mathrm{Var}_{q_{\alpha}}\left[\log r_{\theta}(y)\right]\geq 0.(32)Thus increasingρ\rhoincreases the corresponding interpolation coefficientα\alphawheneverπT≠πθ\pi_{T}\neq\pi_{\theta}. Finally,DKL(q0∥πθ)=0D_{\mathrm{KL}}(q_{0}\|\pi_{\theta})=0andq1=πTq_{1}=\pi_{T}, soDKL(q1∥πθ)=DKL(πT∥πθ)D_{\mathrm{KL}}(q_{1}\|\pi_{\theta})=D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta}). Hence each0<ρ<DKL(πT∥πθ)0<\rho<D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta})inducesα∈(0,1)\alpha\in(0,1)and

qθ⋆(y)=qα(y)=πθ(y)rθ(y)αZα(θ).q_{\theta}^{\star}(y)=q_{\alpha}(y)=\frac{\pi_{\theta}(y)r_{\theta}(y)^{\alpha}}{Z_{\alpha}(\theta)}.(33)This is Eq. (8).

When the budget is large enough that the teacher itself is feasible, the constraint can be inactive,λ=0\lambda=0, andα=1\alpha=1, which recoversqθ⋆=πTq_{\theta}^{\star}=\pi_{T}.□\square

A.2Derivation of the Importance-Weighted OPD Objective

We prove Proposition2. Fix a promptxxand letht=(x,y<t)h_{t}=(x,y_{<t}). We keep the conditioning onxxexplicit in this appendix. We work in the non-trivial local-update regime of Proposition1, where0<ρ<DKL(πT∥πθ)0<\rho<D_{\mathrm{KL}}(\pi_{T}\|\pi_{\theta})and the trust-region constraint is active:

DKL(qθ⋆∥πθ)=ρ.D_{\mathrm{KL}}\left(q_{\theta}^{\star}\middle\|\pi_{\theta}\right)=\rho.(34) The projected objective in Eq. (9) is

𝒥qθ⋆(θ;x)=maxqθ⋆−DKL(qθ⋆∥πT).\mathcal{J}_{q_{\theta}^{\star}}(\theta;x)=\max_{q_{\theta}^{\star}}-D_{\mathrm{KL}}\left(q_{\theta}^{\star}\middle\|\pi_{T}\right).(35)By adding and subtractinglog⁡πθ(y∣x)\log\pi_{\theta}(y\mid x)inside the KL, we obtain

DKL(qθ⋆∥πT)\displaystyle D_{\mathrm{KL}}\left(q_{\theta}^{\star}\middle\|\pi_{T}\right)=𝔼y∼qθ⋆(⋅∣x)[log⁡qθ⋆(y∣x)πT(y∣x)]\displaystyle=\mathbb{E}_{y\sim q_{\theta}^{\star}(\cdot\mid x)}\left[\log\frac{q_{\theta}^{\star}(y\mid x)}{\pi_{T}(y\mid x)}\right]=𝔼y∼qθ⋆(⋅∣x)[log⁡qθ⋆(y∣x)πθ(y∣x)]+𝔼y∼qθ⋆(⋅∣x)[log⁡πθ(y∣x)πT(y∣x)]\displaystyle=\mathbb{E}_{y\sim q_{\theta}^{\star}(\cdot\mid x)}\left[\log\frac{q_{\theta}^{\star}(y\mid x)}{\pi_{\theta}(y\mid x)}\right]+\mathbb{E}_{y\sim q_{\theta}^{\star}(\cdot\mid x)}\left[\log\frac{\pi_{\theta}(y\mid x)}{\pi_{T}(y\mid x)}\right]=ρ+𝔼y∼qθ⋆(⋅∣x)[log⁡πθ(y∣x)πT(y∣x)].\displaystyle=\rho+\mathbb{E}_{y\sim q_{\theta}^{\star}(\cdot\mid x)}\left[\log\frac{\pi_{\theta}(y\mid x)}{\pi_{T}(y\mid x)}\right].(36)Thus, under a fixed local-update budgetρ\rho, minimizing the projected KL is equivalent up to the constantρ\rhoto minimizing the second term in Eq. (36).

Following Eq. (8), write the trajectory-level likelihood ratio as

rθ(y∣x)=πT(y∣x)πθ(y∣x),Zα(θ,x)=𝔼y∼πθ(⋅∣x)[rθ(y∣x)α].r_{\theta}(y\mid x)=\frac{\pi_{T}(y\mid x)}{\pi_{\theta}(y\mid x)},\qquad Z_{\alpha}(\theta,x)=\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[r_{\theta}(y\mid x)^{\alpha}\right].(37)Then the optimal projected policy can be written as

qθ⋆(y∣x)=πθ(y∣x)rθ(y∣x)αZα(θ,x),α∈(0,1).q_{\theta}^{\star}(y\mid x)=\frac{\pi_{\theta}(y\mid x)r_{\theta}(y\mid x)^{\alpha}}{Z_{\alpha}(\theta,x)},\qquad\alpha\in(0,1).(38)Therefore,

qθ⋆(y∣x)πθ(y∣x)=rθ(y∣x)αZα(θ,x).\frac{q_{\theta}^{\star}(y\mid x)}{\pi_{\theta}(y\mid x)}=\frac{r_{\theta}(y\mid x)^{\alpha}}{Z_{\alpha}(\theta,x)}.(39)For any measurable functionff, this gives the change-of-measure identity

𝔼y∼qθ⋆(⋅∣x)[f(y)]=𝔼y∼πθ(⋅∣x)[rθ(y∣x)αf(y)]Zα(θ,x).\mathbb{E}_{y\sim q_{\theta}^{\star}(\cdot\mid x)}[f(y)]=\frac{\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[r_{\theta}(y\mid x)^{\alpha}f(y)\right]}{Z_{\alpha}(\theta,x)}.(40) By the autoregressive factorization,

log⁡πθ(y∣x)πT(y∣x)\displaystyle\log\frac{\pi_{\theta}(y\mid x)}{\pi_{T}(y\mid x)}=log⁡∏t=1Tπθ(yt∣ht)∏t=1TπT(yt∣ht)\displaystyle=\log\frac{\prod_{t=1}^{T}\pi_{\theta}(y_{t}\mid h_{t})}{\prod_{t=1}^{T}\pi_{T}(y_{t}\mid h_{t})}=∑t=1Tlog⁡πθ(yt∣ht)πT(yt∣ht).\displaystyle=\sum_{t=1}^{T}\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}.(41)Substituting Eq. (41) into Eq. (40) gives the exact trajectory-level importance-weighted form of the non-constant part of Eq. (36):

𝒥~qθ⋆(θ;x)≔−(DKL(qθ⋆∥πθ)−ρ)=−𝔼y∼πθ(⋅∣x)[rθ(y∣x)α∑t=1Tlog⁡πθ(yt∣ht)πT(yt∣ht)]Zα(θ,x).\widetilde{\mathcal{J}}_{q_{\theta}^{\star}}(\theta;x)\coloneqq-\left(D_{\mathrm{KL}}\left(q_{\theta}^{\star}\middle\|\pi_{\theta}\right)-\rho\right)=-\frac{\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[r_{\theta}(y\mid x)^{\alpha}\sum_{t=1}^{T}\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}\right]}{Z_{\alpha}(\theta,x)}.(42) Eq. (42) is a sequence-level expression: every token term in the same sampled trajectory is multiplied by the full trajectory ratiorθ(y∣x)αr_{\theta}(y\mid x)^{\alpha}. However, OPD is optimized through a token-local semi-gradient on student-sampled prefixes. Hence the coefficient assigned to the token term at positionttshould depend only on the causal prefixht=(x,y<t)h_{t}=(x,y_{<t}), rather than on the future suffixy≥ty_{\geq t}.

To obtain the causal token-level surrogate, decompose the trajectory ratio as

rθ(y∣x)α\displaystyle r_{\theta}(y\mid x)^{\alpha}=(πT(y∣x)πθ(y∣x))α\displaystyle=\left(\frac{\pi_{T}(y\mid x)}{\pi_{\theta}(y\mid x)}\right)^{\alpha}=(πT(y<t∣x)πθ(y<t∣x))α(πT(y≥t∣x,y<t)πθ(y≥t∣x,y<t))α.\displaystyle=\left(\frac{\pi_{T}(y_{<t}\mid x)}{\pi_{\theta}(y_{<t}\mid x)}\right)^{\alpha}\left(\frac{\pi_{T}(y_{\geq t}\mid x,y_{<t})}{\pi_{\theta}(y_{\geq t}\mid x,y_{<t})}\right)^{\alpha}.(43)The first factor is the prefix likelihood ratio inherited from the trajectory-level ratio in Proposition1. We write

rt(y<t∣x)\displaystyle r_{t}(y_{<t}\mid x)≔rθ(y<t∣x)=πT(y<t∣x)πθ(y<t∣x)\displaystyle\coloneqq r_{\theta}(y_{<t}\mid x)=\frac{\pi_{T}(y_{<t}\mid x)}{\pi_{\theta}(y_{<t}\mid x)}=∏k=1t−1πT(yk∣x,y<k)πθ(yk∣x,y<k).\displaystyle=\prod_{k=1}^{t-1}\frac{\pi_{T}(y_{k}\mid x,y_{<k})}{\pi_{\theta}(y_{k}\mid x,y_{<k})}.(44)The empty product is11, sor1=1r_{1}=1. The position-wise normalizer is

Zα,t(θ,x)≔𝔼y<t∼πθ(⋅∣x)[rt(y<t∣x)α],Z_{\alpha,t}(\theta,x)\coloneqq\mathbb{E}_{y_{<t}\sim\pi_{\theta}(\cdot\mid x)}\left[r_{t}(y_{<t}\mid x)^{\alpha}\right],(45)and the normalized prefix ratio is

r~t(y<t∣x)≔rt(y<t∣x)αZα,t(θ,x).\tilde{r}_{t}(y_{<t}\mid x)\coloneqq\frac{r_{t}(y_{<t}\mid x)^{\alpha}}{Z_{\alpha,t}(\theta,x)}.(46) Replacing the full trajectory ratio in Eq. (42) with its causal prefix component yields the token-level IW-OPD surrogate for a fixed prompt:

𝒥IW⋆(θ;x)=maxθ−𝔼y∼πθ(⋅∣x)[∑t=1Tsg[r~t(y<t∣x)]log⁡πθ(yt∣ht)πT(yt∣ht)].\mathcal{J}_{\mathrm{IW}}^{\star}(\theta;x)=\max_{\theta}\;-\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}\mathrm{sg}\left[\tilde{r}_{t}(y_{<t}\mid x)\right]\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}\right].(47)Heresg[⋅]\mathrm{sg}[\cdot]denotes the stop-gradient operator. It makes the normalized prefix ratio act as a detached multiplicative coefficient on the standard token-level log-ratio term.

Averaging Eq. (47) overx∼𝒟x\sim\mathcal{D}gives

𝒥IW⋆(θ)=maxθ−𝔼x∼𝒟,y∼πθ(⋅∣x)[∑t=1Tsg[r~t(y<t∣x)]log⁡πθ(yt∣x,y<t)πT(yt∣x,y<t)].\mathcal{J}_{\mathrm{IW}}^{\star}(\theta)=\max_{\theta}\;-\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}\mathrm{sg}\left[\tilde{r}_{t}(y_{<t}\mid x)\right]\log\frac{\pi_{\theta}(y_{t}\mid x,y_{<t})}{\pi_{T}(y_{t}\mid x,y_{<t})}\right].(48)Suppressing the fixed promptxxin the notation recovers Eq. (11).□\square

A.3Standard OPD Chain Rule and Single-Step Semi-Gradient

We spell out the chain-rule decomposition of standard OPD and the single-step semi-gradient used by Eq. (12). Fix a promptxxand writeht=(x,y<t)h_{t}=(x,y_{<t}). We keep the conditioning onxxexplicit in this appendix, while the main text suppresses it when no ambiguity arises.

The per-prompt reverse-KL quantity minimized by standard OPD is

JOPD(θ;x)=maxθ−DKL(πθ∥πT)=−𝔼y∼πθ(⋅∣x)[logπθ(y∣x)πT(y∣x)].J_{\mathrm{OPD}}(\theta;x)=\max_{\theta}-D_{\mathrm{KL}}\left(\pi_{\theta}\middle\|\pi_{T}\right)=-\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[\log\frac{\pi_{\theta}(y\mid x)}{\pi_{T}(y\mid x)}\right].(49)By the autoregressive factorization,

log⁡πθ(y∣x)πT(y∣x)=∑t=1Tlog⁡πθ(yt∣ht)πT(yt∣ht).\log\frac{\pi_{\theta}(y\mid x)}{\pi_{T}(y\mid x)}=\sum_{t=1}^{T}\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}.(50)Therefore,

DKL(πθ∥πT)\displaystyle D_{\mathrm{KL}}\left(\pi_{\theta}\middle\|\pi_{T}\right)=𝔼y∼πθ(⋅∣x)[∑t=1Tlog⁡πθ(yt∣ht)πT(yt∣ht)]\displaystyle=\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}\right]=∑t=1T𝔼y<t∼πθ(⋅∣x)[𝔼yt∼πθ(⋅∣ht)[log⁡πθ(yt∣ht)πT(yt∣ht)]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{y_{<t}\sim\pi_{\theta}(\cdot\mid x)}\left[\mathbb{E}_{y_{t}\sim\pi_{\theta}(\cdot\mid h_{t})}\left[\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}\right]\right]=∑t=1T𝔼y<t∼πθ(⋅∣x)[DKL(πθ(⋅∣ht)∥πT(⋅∣ht))].\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{y_{<t}\sim\pi_{\theta}(\cdot\mid x)}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot\mid h_{t})\middle\|\pi_{T}(\cdot\mid h_{t})\right)\right].(51) Next consider one local next-token KL at a fixed prefixhth_{t}:

Dt(θ;ht)=DKL(πθ(⋅∣ht)∥πT(⋅∣ht)).D_{t}(\theta;h_{t})=D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot\mid h_{t})\middle\|\pi_{T}(\cdot\mid h_{t})\right).(52)Expanding over the next-token vocabulary gives

Dt(θ;ht)=∑aπθ(a∣ht)[log⁡πθ(a∣ht)−log⁡πT(a∣ht)].D_{t}(\theta;h_{t})=\sum_{a}\pi_{\theta}(a\mid h_{t})\left[\log\pi_{\theta}(a\mid h_{t})-\log\pi_{T}(a\mid h_{t})\right].(53)Differentiating this local KL while holding the sampled prefixhth_{t}fixed yields

∇θDt(θ;ht)\displaystyle\nabla_{\theta}D_{t}(\theta;h_{t})=∑a∇θπθ(a∣ht)[log⁡πθ(a∣ht)−log⁡πT(a∣ht)+1]\displaystyle=\sum_{a}\nabla_{\theta}\pi_{\theta}(a\mid h_{t})\left[\log\pi_{\theta}(a\mid h_{t})-\log\pi_{T}(a\mid h_{t})+1\right]=𝔼a∼πθ(⋅∣ht)[(log⁡πθ(a∣ht)πT(a∣ht)+1)∇θlog⁡πθ(a∣ht)].\displaystyle=\mathbb{E}_{a\sim\pi_{\theta}(\cdot\mid h_{t})}\left[\left(\log\frac{\pi_{\theta}(a\mid h_{t})}{\pi_{T}(a\mid h_{t})}+1\right)\nabla_{\theta}\log\pi_{\theta}(a\mid h_{t})\right].(54)The+1+1term vanishes because

𝔼a∼πθ(⋅∣ht)[∇θlog⁡πθ(a∣ht)]=∇θ∑aπθ(a∣ht)=0.\mathbb{E}_{a\sim\pi_{\theta}(\cdot\mid h_{t})}\left[\nabla_{\theta}\log\pi_{\theta}(a\mid h_{t})\right]=\nabla_{\theta}\sum_{a}\pi_{\theta}(a\mid h_{t})=0.(55)Thus,

∇θDt(θ;ht)=𝔼a∼πθ(⋅∣ht)[log⁡πθ(a∣ht)πT(a∣ht)∇θlog⁡πθ(a∣ht)].\nabla_{\theta}D_{t}(\theta;h_{t})=\mathbb{E}_{a\sim\pi_{\theta}(\cdot\mid h_{t})}\left[\log\frac{\pi_{\theta}(a\mid h_{t})}{\pi_{T}(a\mid h_{t})}\nabla_{\theta}\log\pi_{\theta}(a\mid h_{t})\right].(56) Using the OPD advantage notation in Eq. (6), for a sampled tokena=yta=y_{t}we have

AtOPD:=−(log⁡πθ(yt∣ht)−log⁡πT(yt∣ht))=−log⁡πθ(yt∣ht)πT(yt∣ht).A_{t}^{\mathrm{OPD}}:=-\left(\log\pi_{\theta}(y_{t}\mid h_{t})-\log\pi_{T}(y_{t}\mid h_{t})\right)=-\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}.(57)Therefore, the negative local KL gradient, written in policy-gradient ascent form, is

−∇θDt(θ;ht)=𝔼yt∼πθ(⋅∣ht)[AtOPD∇θlog⁡πθ(yt∣ht)].-\nabla_{\theta}D_{t}(\theta;h_{t})=\mathbb{E}_{y_{t}\sim\pi_{\theta}(\cdot\mid h_{t})}\left[A_{t}^{\mathrm{OPD}}\nabla_{\theta}\log\pi_{\theta}(y_{t}\mid h_{t})\right].(58) The word “semi-gradient” is important. The chain-rule decomposition in Eq. (51) is exact, but the token-local update in Eq. (58) treats the sampled prefixy<ty_{<t}as fixed context. A full score-function gradient of the sequence-level reverse KL would instead couple all positions:

∇θDKL(πθ(⋅∣x)∥πT(⋅∣x))=𝔼y∼πθ(⋅∣x)[(∑k=1Tlogπθ(yk∣hk)πT(yk∣hk))(∑t=1T∇θlogπθ(yt∣ht))].\nabla_{\theta}D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot\mid x)\middle\|\pi_{T}(\cdot\mid x)\right)=\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[\left(\sum_{k=1}^{T}\log\frac{\pi_{\theta}(y_{k}\mid h_{k})}{\pi_{T}(y_{k}\mid h_{k})}\right)\left(\sum_{t=1}^{T}\nabla_{\theta}\log\pi_{\theta}(y_{t}\mid h_{t})\right)\right].(59)Standard OPD practice instead uses the single-step token-local update direction. With the sign convention of Eq. (6), this gives

∇θJOPD(θ)≈𝔼x∼𝒟,y∼πθ(⋅∣x)[∑t=1TAtOPD∇θlog⁡πθ(yt∣ht)],\nabla_{\theta}J_{\mathrm{OPD}}(\theta)\approx\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}A_{t}^{\mathrm{OPD}}\nabla_{\theta}\log\pi_{\theta}(y_{t}\mid h_{t})\right],(60)which is the standard OPD form in the main text.

We now apply the same single-step rule to the ideal IW-OPD surrogate in Proposition2. For a fixed prompt, Eq. (11) can be written as

JIW⋆(θ;x)=maxθ−𝔼y∼πθ(⋅∣x)[∑t=1Tsg[r~t(y<t∣x)]log⁡πθ(yt∣ht)πT(yt∣ht)],J_{\mathrm{IW}}^{\star}(\theta;x)=\max_{\theta}\;-\mathbb{E}_{y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}\mathrm{sg}\left[\tilde{r}_{t}(y_{<t}\mid x)\right]\log\frac{\pi_{\theta}(y_{t}\mid h_{t})}{\pi_{T}(y_{t}\mid h_{t})}\right],(61)where the notation follows the main text:

rt:=rθ(y<t∣x)=πT(y<t∣x)πθ(y<t∣x),Zα,t(θ,x)=𝔼y<t∼πθ(⋅∣x)[rt],r~t=rtZα,t.r_{t}:=r_{\theta}(y_{<t}\mid x)=\frac{\pi_{T}(y_{<t}\mid x)}{\pi_{\theta}(y_{<t}\mid x)},\qquad Z_{\alpha,t}(\theta,x)=\mathbb{E}_{y_{<t}\sim\pi_{\theta}(\cdot\mid x)}\left[r_{t}\right],\qquad\tilde{r}_{t}=\frac{r_{t}}{Z_{\alpha,t}}.(62) When differentiating the local next-token term at positiontt, all prefix-determined quantities are treated as fixed context. This includes the sampled prefixy<ty_{<t}, the prefix ratiortr_{t}, the normalizerZα,tZ_{\alpha,t}, and the normalized prefix ratior~t\tilde{r}_{t}. The stop-gradient operator in Eq. (61) enforces exactly this convention.

Detaching the prefix ratio is necessary for preserving the single-step OPD semi-gradient. Indeed, for fixedα\alpha,

log⁡rt\displaystyle\log r_{t}=α∑k<t(log⁡πT(yk∣hk)−log⁡πθ(yk∣hk))\displaystyle=\alpha\sum_{k<t}\left(\log\pi_{T}(y_{k}\mid h_{k})-\log\pi_{\theta}(y_{k}\mid h_{k})\right)=α∑k<tAkOPD.\displaystyle=\alpha\sum_{k<t}A_{k}^{\mathrm{OPD}}.(63)Since the teacher is fixed, differentiating through the ratio would introduce prefix score terms:

∇θlog⁡rt=−α∑k<t∇θlog⁡πθ(yk∣hk).\nabla_{\theta}\log r_{t}=-\alpha\sum_{k<t}\nabla_{\theta}\log\pi_{\theta}(y_{k}\mid h_{k}).(64)Such terms couple the token-ttupdate to earlier sampled actions and recover a sequence-level credit-assignment estimator rather than the token-local OPD semi-gradient. DifferentiatingZα,tZ_{\alpha,t}would similarly require gradients through both the prefix sampling distribution and the prefix likelihood ratio. Therefore, bothrtr_{t}andZα,tZ_{\alpha,t}are detached in the single-step update.

With these detached prefix quantities, the local IW-OPD update is simply the standard OPD update multiplied by the normalized prefix ratio. Thus,

∇θJIW⋆(θ)\displaystyle\nabla_{\theta}J_{\mathrm{IW}}^{\star}(\theta)≈𝔼x∼𝒟,y∼πθ(⋅∣x)[∑t=1TAtIW-OPD∇θlog⁡πθ(yt∣ht)],\displaystyle\approx\mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_{\theta}(\cdot\mid x)}\left[\sum_{t=1}^{T}A_{t}^{\mathrm{IW\mbox{-}OPD}}\nabla_{\theta}\log\pi_{\theta}(y_{t}\mid h_{t})\right],(65)where

AtIW-OPD\displaystyle A_{t}^{\mathrm{IW\mbox{-}OPD}}=sg[r~t]AtOPD\displaystyle=\mathrm{sg}\left[\tilde{r}_{t}\right]A_{t}^{\mathrm{OPD}}=−sg[r~t](log⁡πθ(yt∣ht)−log⁡πT(yt∣ht)).\displaystyle=-\mathrm{sg}\left[\tilde{r}_{t}\right]\left(\log\pi_{\theta}(y_{t}\mid h_{t})-\log\pi_{T}(y_{t}\mid h_{t})\right).(66)This is the semi-gradient form of Eq. (12), with the stop-gradient convention inherited from Eq. (11).□\square

Appendix BAlgorithm

For clipped PPO, letπ0\pi_{0}be the frozen rollout policy for the current batch, and useηt(θ)=πθ(yt∣x,y<t)/π0(yt∣x,y<t)\eta_{t}(\theta)=\pi_{\theta}(y_{t}\mid x,y_{<t})/\pi_{0}(y_{t}\mid x,y_{<t})for the PPO ratio. The inner update minimizes

ℒIW-OPD(θ)=−𝔼x,y,t[min⁡(ηt(θ)AtIW-OPD,clip⁡(ηt(θ),1−ϵclip,1+ϵclip)AtIW-OPD)],\mathcal{L}_{\mathrm{IW\text{-}OPD}}(\theta)=-\mathbb{E}_{x,y,t}\left[\min\left(\eta_{t}(\theta)A_{t}^{\mathrm{IW\text{-}OPD}},\operatorname{clip}\!\left(\eta_{t}(\theta),1-\epsilon_{\mathrm{clip}},1+\epsilon_{\mathrm{clip}}\right)A_{t}^{\mathrm{IW\text{-}OPD}}\right)\right],(67)where the expectation is over valid response tokens.

Algorithm 1IW-OPD: Importance-Weighted On-Policy Distillation0:Student

πθ\pi_{\theta}, teacher

πT\pi_{T}, prompt distribution

𝒟\mathcal{D}, amplification

γ\gamma(default

0.50.5), PPO clip

ϵclip\epsilon_{\mathrm{clip}}, stabilizer

ε\varepsilon.

1:Initialize rollout policy

π0←πθ\pi_{0}\leftarrow\pi_{\theta}.

2:foreach training iterationdo

3:Sample

x∼𝒟x\sim\mathcal{D}and generate

y=(y1,…,yT)y=(y_{1},\ldots,y_{T})from

π0(⋅∣x)\pi_{0}(\cdot\mid x).

4:Cache

ℓ0,t←log⁡π0(yt∣x,y<t)\ell_{0,t}\leftarrow\log\pi_{0}(y_{t}\mid x,y_{<t})and

ℓT,t←log⁡πT(yt∣x,y<t)\ell_{T,t}\leftarrow\log\pi_{T}(y_{t}\mid x,y_{<t})for all valid tokens

tt.

5:Set

AtOPD←sg[ℓT,t−ℓ0,t]A_{t}^{\mathrm{OPD}}\leftarrow\mathrm{sg}[\ell_{T,t}-\ell_{0,t}]for all valid tokens

tt.

6:Set

r~tIW−OPD←1+γ(1−∑k<t|AkOPD|∑k<T|AkOPD|)\tilde{r}^{\mathrm{IW-OPD}}_{t}\leftarrow 1+\gamma\left(1-\frac{\sum_{k<t}\left|A_{k}^{\mathrm{OPD}}\right|}{\sum_{k<T}\left|A_{k}^{\mathrm{OPD}}\right|}\right)for all valid tokens

tt.

7:Set

AtIW-OPD←sg[r~tIW−OPD]AtOPDA_{t}^{\mathrm{IW\text{-}OPD}}\leftarrow\mathrm{sg}[\tilde{r}^{\mathrm{IW-OPD}}_{t}]A_{t}^{\mathrm{OPD}}for all valid tokens

tt.

8:forseveral PPO inner stepsdo

9:Update

θ\thetaby minimizing Eq. (67), using

ηt(θ)=exp⁡(log⁡πθ(yt∣x,y<t)−ℓ0,t)\eta_{t}(\theta)=\exp(\log\pi_{\theta}(y_{t}\mid x,y_{<t})-\ell_{0,t}).

10:endfor

11:Refresh rollout policy:

π0←πθ\pi_{0}\leftarrow\pi_{\theta}.

12:endfor

Appendix CExperimental Setup and Hyperparameters

This section reports the implementation details used for the experiments in §5. All OPD variants are implemented in the sameverl-based PPO training pipeline. The student samples responses on-policy; student and teacher log probabilities are then evaluated on the sampled response tokens. No learned reward model is used: the token-level OPD or IW-OPD advantages are passed directly into the clipped PPO surrogate.

C.1Models and Data

Table 5:Model and data configurations used in the main experiments.For all Qwen3 models, we use the chat template with thinking disabled. Prompts longer than the context budget are filtered rather than truncated, and response tokens beyond the generated answer mask are excluded from all OPD and IW-OPD computations.

C.2Training Hyperparameters

Table 6:Default training hyperparameters. Unless noted otherwise, the same values are used for OPD and IW-OPD within each model–teacher setting.Epoch budgets follow the corresponding model-scale scripts and are held fixed across OPD and IW-OPD within each setting. All reported comparisons use the same data order and seed set across methods.

C.3Method-Specific Parameters

Finally, IW-OPD perform an interpolation with the original OPD to balance these two effects:

AtIW−OPD=sg[r~tIW−OPD]⋅AtOPD=(1+γ(1−∑k<t|AkOPD|∑k<T|AkOPD|))⋅AtOPD,A^{\mathrm{IW-OPD}}_{t}=\mathrm{sg}[\tilde{r}^{\mathrm{IW-OPD}}_{t}]\cdot A^{\mathrm{OPD}}_{t}=\left(1+\gamma\left(1-\frac{\sum_{k<t}\left|A_{k}^{\mathrm{OPD}}\right|}{\sum_{k<T}\left|A_{k}^{\mathrm{OPD}}\right|}\right)\right)\cdot A^{\mathrm{OPD}}_{t},(68)where higherγ≥0\gamma\geq 0indicates a higher contribution from the teacher-informed importance weights. In practice, we findγ=0.5\gamma=0.5is a good default choice.

C.4Evaluation Protocol

For math benchmarks, we generate 32 responses per problem with vLLM using temperature1.01.0, top-pp1.01.0, maximum generation length 16384, and seed-matched sampling across methods. Each prompt appends the instruction: “Please reason step by step, and put your final answer within \boxed{}.” We extract the final boxed answer and evaluate it with symbolic equivalence checking. Tables report the aggregation specified in their captions, e.g., best@32 or mean@32.

For code benchmarks, we use the EvalPlus evaluation suite for HumanEval+ and MBPP+. Greedy single-sample evaluation is used for the reported pass-rate results.

Checkpoints with subscript10in the main tables are evaluated at training step 10. Converged OPD and IW-OPD checkpoints are selected using the same validation protocol within each model–teacher setting and are then evaluated on the held-out math and code benchmarks.

Appendix DCombination experiments with other reward design methods

Table 7:Combination results with ExOPD; both methods are distilled from Qwen3-30B-A3B. Math results are reported as mean@32 accuracy (%). IW-ExOPD denotes the combination of IW-OPD and ExOPD.Boldindicates the best result within each student group.StudentMethodMathCodeAvgAIME24AIME25HMMT25HE+MBPP+Teacher Model74.762.844.286.675.168.7Qwen3-4BBase23.121.410.075.364.538.9OPD55.348.027.177.269.155.3ExOPD57.950.131.778.970.257.8IW-ExOPD59.4+1.5\textbf{59.4}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}51.7+1.6\textbf{51.7}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.6}}}}32.0+0.3\textbf{32.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.3}}}}80.1+1.1\textbf{80.1}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.1}}}}71.0+0.8\textbf{71.0}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+0.8}}}}58.8+1.0\textbf{58.8}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.0}}}}Qwen3-1.7BBase13.411.06.859.652.528.7OPD34.628.715.564.653.739.4ExOPD37.631.816.867.255.041.7IW-ExOPD38.9+1.3\textbf{38.9}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}33.2+1.4\textbf{33.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.4}}}}18.3+1.5\textbf{18.3}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}68.5+1.3\textbf{68.5}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.3}}}}57.4+2.4\textbf{57.4}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+2.4}}}}43.2+1.5\textbf{43.2}\mathrlap{{}_{\scriptstyle{\color[rgb]{0,0.58984375,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.58984375,0}\textbf{+1.5}}}}

IW-OPD is orthogonal to reward design.

IW-OPD changes how an already-computed OPD signal is allocated across positions, so it is not tied to a particular advantage or reward design. As one example, ExOPD[42]reformulates OPD as a reinforcement-learning problem with a KL constraint, separates out the reward term, and improves exploration by scaling that reward with a fixed hyperparameterλ\lambda. Since IW-OPD supplies prefix-level importance weights, we can apply the same idea on top of ExOPD by making the reward scale prefix-dependent. We call this combination IW-ExOPD. As shown in Table7, IW-ExOPD improves over ExOPD for both Qwen3-4B and Qwen3-1.7B students. This suggests that prefix-level importance weighting is orthogonal to reward design, rather than being tied only to vanilla OPD.

Appendix ELimitations

At convergence,r~t\tilde{r}_{t}approaches1−t/T1-t/Trather than becoming perfectly uniform—a mild residual non-uniformity. Cumulative prefix discrepancy is a conservative prefix-compatibility proxy induced by the prefix likelihood-ratio principle, not an exact density-ratio correction. Experiments are conducted at the 4B student scale; validation at larger scales remains future work.