Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL Papers

Summary

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.

arXiv:2605.11739v1 Announce Type: new Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:17 AM

# Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
Source: [https://arxiv.org/html/2605.11739](https://arxiv.org/html/2605.11739)
Yuchen Cai1,2, Ding Cao1,411footnotemark:1, Liang Lin5, Chunxi Luo6, Xin Xu2, Kai Yang2, Weijie Liu2, Saiyong Yang2, Tianxiang Zhao4, Guangzhong Sun1, Guiquan Liu1, Junfeng Fang333footnotemark:3

1USTC,2Tencent,3NUS,4HKUST\(GZ\),5Alibaba,6SHU

\{caiyuchen,caoding\}@mail\.ustc\.edu\.cn

These authors contributed equally to this work\.This work was done during an internship at Tencent\.Corresponding authors:gqliu@ustc\.edu\.cn,fjf@mail\.ustc\.edu\.cn###### Abstract

On\-policy distillation \(OPD\) has emerged as an efficient post\-training paradigm for large language models\. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter\-level mechanisms underlying OPD’s efficiency remain poorly understood\. In this work, we argue that OPD’s efficiency stems from a form of “foresight”: it establishes a stable update trajectory toward the final model early in training\. This foresight manifests in two aspects\. First, at theModule\-Allocation Level, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning\. Second, at theUpdate\-Direction Level, OPD exhibits stronger low\-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training\. Building on these findings, we proposeEffOPD, a plug\-and\-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction\. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of3×3\\timeswhile maintaining comparable final performance\. Overall, our findings provide a parameter\-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post\-training methods for large language models\. Our code is available at:[https://github\.com/caiyuchen\-ustc/EffOPD](https://github.com/caiyuchen-ustc/EffOPD)\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x1.png)Figure 1:Illustration of the foresight mechanism in OPD\. Compared with RL, OPD identifies critical modules and aligns with the final optimization direction early in training, concentrating effective updates while reducing redundancy\. Based on this, we propose EffOPD, which extrapolates along the early predicted direction to accelerate training\.“To foresee the future is to master the present\.” — Niccolò Machiavelli

## 1Introduction

As large language models \(LLMs\) continue to advance in reasoning\(OpenAI,[2025](https://arxiv.org/html/2605.11739#bib.bib1); DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib46)\), On\-Policy Distillation \(OPD\) has emerged as an important paradigm for post\-training and model fusion\(Agarwalet al\.,[2024b](https://arxiv.org/html/2605.11739#bib.bib63); Xiaoet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib72); DeepSeek\-AI,[2026](https://arxiv.org/html/2605.11739#bib.bib90)\)\. Given a teacher model, OPD leverages dense supervisory signals to achieve performance comparable to Reinforcement Learning \(RL\) with substantially reduced training time\(Venkatkrishnaet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib84); Yanget al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib42)\)\. Existing studies mainly attribute this advantage to denser and more stable supervision\(Heet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib80); Yueet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib40)\)\. However, such optimization\-centric explanations remain largely macroscopic and fail to capture the underlying parameter update dynamics\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.11739#bib.bib23)\)\.

In this work, we argue that OPD’s efficiency stems from a form of “foresight”: it establishes stable and highly aligned update directions early in training, enabling rapid convergence with limited exploration and correction\. This foresight manifests in two aspects\.

Foresight at the Module\-Allocation Level\.Our analysis reveals that, under the same update norm constraint, OPD achieves larger performance gains than RL, suggesting that its advantage does not merely stem from the magnitude of parameter updates\(Gevaet al\.,[2021](https://arxiv.org/html/2605.11739#bib.bib62),[2023](https://arxiv.org/html/2605.11739#bib.bib111)\)\. Further analysis shows that, although RL and OPD exhibit similar sensitivity patterns across layers and modules, RL accumulates substantially larger update norms in modules with limited contribution to performance improvement, thereby introducing redundant updates with low marginal utility\. In contrast, OPD demonstrates a form of “foresight”\. As shown in Figure[1](https://arxiv.org/html/2605.11739#S0.F1)\(b\), it identifies these low\-utility modules early in training and suppresses their parameter updates, allowing updates to concentrate more effectively on intermediate\-layer modules that are more critical to reasoning\(Menget al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib3)\)\.

Foresight at the Update\-Direction Level\.At the update\-direction level, OPD’s foresight lies in the early alignment between its update directions and the principal directions of the final solution\. Spectral and subspace evolution analyses show that OPD concentrates updates on a few stable dominant directions early in training\(Zhang,[2015](https://arxiv.org/html/2605.11739#bib.bib19)\), whose dominant directions are highly aligned with the final update subspace and remain stable thereafter, as shown in Figure[1](https://arxiv.org/html/2605.11739#S0.F1)\(c\)\. In contrast, RL exhibits more dispersed updates, with delayed and more fluctuating alignment\. Moreover, after module\-wise norm scaling, an OPD checkpoint at only 10% training progress recovers approximately 80% of the final reasoning performance\. This suggests that OPD captures the main structure of the final solution early and locks onto an effective direction with minimal exploration and correction\.

To further validate these insights and improve the training efficiency of OPD, we proposeEffOPD, a simple and intuitive acceleration framework\. As shown in Figure[1](https://arxiv.org/html/2605.11739#S0.F1)\(d\), EffOPD performs linear extrapolation along the current update direction, leveraging the inherent “foresight” of OPD to match the final performance of vanilla OPD with fewer training iterations and samples\. Experiments across model scales from 1\.5B to 32B parameters show that EffOPD achieves an average training acceleration of3×3\\timesover multiple baselines in a plug\-and\-play manner, while maintaining comparable final performance\.

In summary, this work identifies a form of foresight in OPD for LLMs and argues that it is a key source of its training efficiency\. Our analysis provides a parameter\-level explanation for the common intuition that distillation is easier to optimize due to denser supervision\(Yanget al\.,[2026b](https://arxiv.org/html/2605.11739#bib.bib85)\)\. Building on these findings, EffOPD offers a simple plug\-and\-play acceleration method for OPD, requiring no additional modules, complex hyperparameter tuning, or human intervention\. It achieves an average training acceleration of3×3\\timesand remains orthogonal to existing acceleration techniques, providing new insights into the design of more interpretable and efficient post\-training paradigms for large language models\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x2.png)Figure 2:Comparison of parameter update efficiency between RL and OPD\. \(a\) Scaling analysis at the final checkpoint: for updates scaled to the same norm, OPD achieves substantially higher reasoning gains than RL\. \(b\) Training dynamics: across the entire optimization trajectory, OPD consistently requires smaller parameter updates than RL to reach equivalent reasoning accuracy\.
## 2Functional Redundancy Avoidance

In this section, we investigate the modular\-level differences between OPD and RL\. We show that OPD exhibits modular\-level “foresight”: it preferentially concentrates updates in high\-marginal\-utility functional regions while suppressing parameter changes in low\-utility regions\. We refer to this property asFunctional Redundancy Avoidance\. Section[2\.1](https://arxiv.org/html/2605.11739#S2.SS1)introduces the experimental setup, and Section[2\.2](https://arxiv.org/html/2605.11739#S2.SS2)compares OPD with RL to show how this foresight leads to more compact and efficient parameter updates\.

### 2\.1Experimental Setting

Our analysis uses a shared initializationWBaseW\_\{\\mathrm\{Base\}\}for both RL and OPD, with parameter updates defined asΔ​WRL/OPD=WRL/OPD−WBase\\Delta W\_\{\\mathrm\{RL/OPD\}\}=W\_\{\\mathrm\{RL/OPD\}\}\-W\_\{\\mathrm\{Base\}\}\. We conduct experiments across models ranging from 1\.5B to 32B parameters, including pretrained, SFT\-tuned, and Thinking\-series models\(Qwenet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib21); Zhanget al\.,[2025c](https://arxiv.org/html/2605.11739#bib.bib109); Yanget al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib42)\)\. For RL, we consider PPO, GRPO, and DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib47)\)\. For OPD, the student is trained with a pattern\-aligned teacher, typically a stronger model from the same family\(Liet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib77)\)\. Further details are provided in Appendix[D\.2](https://arxiv.org/html/2605.11739#A4.SS2)\.

### 2\.2Parameter Updates & Reasoning Gains

#### Results on Fully Trained Models\.

We first examine the update efficiency at the final checkpoint\. Specifically, we fix the update directionΔ​WRL/OPD\\Delta W\_\{\\mathrm\{RL/OPD\}\}from the last checkpoint and scale its magnitude using a factorα∈\[0,1\]\\alpha\\in\[0,1\], evaluating models of the formWBase\+α​Δ​WRL/OPDW\_\{\\mathrm\{Base\}\}\+\\alpha\\Delta W\_\{\\mathrm\{RL/OPD\}\}\. As shown in Figure[2](https://arxiv.org/html/2605.11739#S1.F2)\(a\), when updates are scaled to the same norm, OPD achieves substantially higher reasoning gains than RL\. This indicates thatΔ​WRL\\Delta W\_\{\\mathrm\{RL\}\}contains a non\-negligible number of components weakly correlated with task performance—they contribute to the update norm but provide limited reasoning improvement\. In contrast, OPD updates carry a greater fraction of task\-relevant signal that effectively translates into performance gains\.

#### Results across the Training Process\.

This observation naturally raises a key question: when do these weakly task\-correlated components emerge during RL training? Since the performance of RL\-trained models typically saturates in later stages, one possible explanation is that redundant updates mainly accumulate near the end of training\(Khatriet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib96); Zhenget al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib110)\)\. To examine this, we analyze intermediate checkpoints of both RL and OPD throughout training and track the relationship between parameter update magnitude and reasoning accuracy\. As shown in Figure[2](https://arxiv.org/html/2605.11739#S1.F2)\(b\), OPD consistently requires smaller parameter updates than RL to achieve the same reasoning accuracy\. Moreover, OPD achieves rapid accuracy improvement with relatively small increases inΔ​WOPD\\Delta W\_\{\\mathrm\{OPD\}\}norm, whereas RL improves more slowly under comparable update magnitudes\. These results suggest that OPD’s superior efficiency does not simply come from avoiding late\-stage redundancy, but from forming a compact and task\-relevant update pattern early in training\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x3.png)Figure 3:Functional contributions and update distributions across architectural components\. \(a\) Effect of embedding layer replacement on AIME26\. \(b\) Layer\-wise update norms \(bars, left axis\) for RL/OPD\-trained Qwen3\-8B\-Base models, and corresponding OPD reasoning accuracy after sliding\-window intervention \(line, right axis\) on MATH500\.
#### Locating the Redundant Updates\.

The previous analysis shows that RL updates contain components with relatively low task relevance\. To locate these redundancies and assess their functional contributions, we decompose model updates into three architectural components: embedding, MLP, and attention layers\. We first examine the embedding layer by replacing the embeddings of OPD and RL models with those from the base model while keeping all other parameters unchanged\. As shown in Figure[3](https://arxiv.org/html/2605.11739#S2.F3)\(a\), this intervention has negligible impact on reasoning performance, suggesting that embedding updates contribute little to reasoning gains\. Thus, the main functional updates of OPD and RL are likely concentrated in deeper model components rather than the embedding layer\.

Next, we conduct a sliding\-window intervention analysis to locate the functional regions of OPD and RL updates\. Following prior block\-wise intervention studies\(Caiet al\.,[2024](https://arxiv.org/html/2605.11739#bib.bib105); Menget al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib3)\), we partition the model into consecutive layer blocks and inject local OPD or RL updates into each block to evaluate their impact on reasoning performance111Detailed setup is provided in Appendix[E\.2](https://arxiv.org/html/2605.11739#A5.SS2)\.\. As shown in Figure[3](https://arxiv.org/html/2605.11739#S2.F3)\(b\) and Figure[10](https://arxiv.org/html/2605.11739#A5.F10)\(b\), MLP modules are overall more sensitive to reasoning\-related updates than attention modules, indicating that MLPs serve as the primary carriers of knowledge representation and relational reasoning\. From the perspective of layer position, the performance curves of both module types exhibit a clear inverted U\-shaped pattern: interventions in the middle layers yield the largest gains, whereas those in the bottom and top layers lead to relatively smaller improvements\. This suggests that reasoning\-related updates are not uniformly distributed across the network, but are mainly concentrated in middle\-layer MLPs with stronger functional coupling\. These findings are consistent with prior mechanistic interpretability studies on the functional roles of Transformer modules and layers\(Skeanet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib81); Gevaet al\.,[2021](https://arxiv.org/html/2605.11739#bib.bib62),[2022](https://arxiv.org/html/2605.11739#bib.bib113)\)\.

Building on these observations, we further compare the update patterns of OPD and RL\. The two methods exhibit highly consistent intervention sensitivity distributions across both module types and layer positions, suggesting that OPD and RL do not rely on fundamentally different functional pathways, but instead optimize along the model’s existing key functional structures\. The key difference lies in their layer\-wise update norms\. RL introduces substantially larger parameter changes in the low\-sensitivity bottom and top layers\. Since interventions in these peripheral layers yield limited performance gains, their larger update norms do not translate into proportional performance gains and are therefore more likely to reflect redundant updates weakly related to task rewards\. In contrast, while maintaining a functional update distribution similar to RL, OPD significantly suppresses parameter changes in low\-sensitivity regions and concentrates updates more strongly in middle\-layer modules with higher functional contributions\. Therefore, the advantage of OPD does not come from learning an entirely new update mechanism, but from more accurately distinguishing high\-benefit from low\-benefit parameter regions and reducing ineffective updates in peripheral layers, thereby achieving higher update efficiency and stronger reasoning performance gains with more compact parameter changes\. Additionally, we further present the visualized differences and performance comparison results between RL and OPD across different components\. We recommend interested readers to refer to the detailed results and analysis in Appendix[E](https://arxiv.org/html/2605.11739#A5)\.

#### Summary\.

The above results show that OPD exhibits clear foresight at the modular level, which we formalize as Property 1:Functional Redundancy Avoidance\. Compared with RL, OPD forms a compact and task\-relevant update pattern earlier in training, suppresses redundant parameter changes in low\-marginal\-utility regions, and concentrates updates in reasoning\-critical modules with higher functional contributions, thereby achieving higher update efficiency and stronger reasoning performance gains\.

Table 1:Characterization of Parameter Update Geometry: OPD vs\. RL Across Model Scales\.Metric1\.5B4B8B14BrlopdrlopdrlopdrlopdSpectral Norm \(↑\\uparrow\)0\.0940\.1130\.0070\.0090\.0040\.0050\.0560\.063Spectral / Frobenius Norm Ratio \(↑\\uparrow\)33\.2%39\.6%19\.7%25\.7%32\.7%36\.8%24\.4%28\.1%Effective Rank \(↓\\downarrow\)964778190815872754234131742937Top\-1% Subspace Norm Ratio \(↑\\uparrow\)78\.1%92\.3%79\.2%93\.4%88\.5%94\.7%81\.2%94\.5%

## 3Early Low\-Rank Lock\-in

The preceding analysis reveals OPD’s “foresight” at the modular level\. Building on this, we further investigate the intrinsic organization of its parameter updates from a geometric perspective and introduce the propertyEarly Low\-Rank Lock\-into describe this potential structural constraint\. Specifically, we validate this property by analyzing the spectral concentration of the update matrix, the functional contributions of different subspaces, and the functional effectiveness of early stabilized directions through norm scaling experiments\.

### 3\.1Spectral Concentration of Update Matrix

To characterize the spectral structure of parameter updates, we perform singular value decomposition \(SVD\)\(Korenet al\.,[2009](https://arxiv.org/html/2605.11739#bib.bib22)\)on the update matrixΔ​WRL/OPD=U​Σ​V⊤\\Delta W\_\{\\mathrm\{RL/OPD\}\}=U\\Sigma V^\{\\top\}and introduce four complementary geometric metrics222Detailed definitions are provided in Appendix[F\.1](https://arxiv.org/html/2605.11739#A6.SS1)\.:Spectral Norm\(Mathias,[1990](https://arxiv.org/html/2605.11739#bib.bib99)\),Spectral / Frobenius Norm Ratio\(Al\-Natoor,[2024](https://arxiv.org/html/2605.11739#bib.bib100)\),Effective Rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.11739#bib.bib98)\), andTop\-1% Subspace Norm Ratio\(Caiet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib78)\)\. The first two metrics quantify the dominance of leading singular directions, while the latter two measure the concentration of update energy across the spectrum\. Table[1](https://arxiv.org/html/2605.11739#S2.T1)reports the average values over all MLP and attention matrices\. Across all model scales, OPD consistently exhibits stronger low\-rank structure than RL\. For example, on the 8B model, OPD achieves a higher spectral\-to\-Frobenius norm ratio \(36\.8% vs\. 32\.7%\), lower effective rank \(2341 vs\. 2754\), and higher Top\-1% subspace norm ratio \(94\.7% vs\. 88\.5%\)\. These results suggest that OPD concentrates update energy into a small set of dominant directions more effectively than RL\. Notably, despite having a smaller overall update norm, OPD allocates a larger proportion of its update energy to these dominant subspaces\. This raises a key question: does such directional concentration explain the efficiency advantage of OPD observed in Section[2](https://arxiv.org/html/2605.11739#S2)? To answer this, we conduct two controlled experiments to separately examine the roles of update direction and update magnitude\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x4.png)Figure 4:Low\-rank subspace analysis\. \(a\) Top\-k%k\\%subspace: OPD achieves higher performance; \(b\) Bottom\-k%k\\%subspace: RL incurs significantly larger norm cost for marginal performance gains\.
### 3\.2Functional Partition of the Update Spectrum: Principal vs\. Tail Subspaces

#### Top\-k%k\\%Subspace: Directional Quality under Equal Norm Budget\.

To assess the intrinsic directional quality of the principal subspace, we construct a Top\-k%k\\%truncated approximationΔ​WTop\-​k%\\Delta W\_\{\\text\{Top\-\}k\\%\}using the Top\-k%k\\%singular components, and subsequently rescale its Frobenius norm to match between RL and OPD\. After applying this low\-rank update to the base model, we evaluate its reasoning performance\. By standardizing the norm budget, we are able to directly compare the directional quality of the Top\-k%k\\%principal subspaces between RL and OPD\.

As shown in Figure[4](https://arxiv.org/html/2605.11739#S3.F4)\(a\), both methods recover over 95% of their full\-model reasoning performance using only 10% of the rank, confirming that the Top\-k%k\\%subspace serves as the primary carrier for improving reasoning performance\. Remarkably, OPD consistently outperforms RL across all evaluated rank levels, and this advantage persists across different model scales and rank thresholds\. This suggests not only that OPD allocates its limited update budget more efficiently by concentrating on higher\-quality directional subspaces, but also that the principal directions identified by OPD inherently encode more effective update signals than those of RL, even under the same norm budget\.

#### Bottom\-k%k\\%Subspace: Marginal Utility of Tail Directions\.

To further investigate, we compare the impact of tail directions on performance, where tail directions are defined as the subspace constructed using the lastk%k\\%singular components, denoted asΔ​WBottom\-​k%\\Delta W\_\{\\text\{Bottom\-\}k\\%\}\. Unlike the Top\-k%k\\%subspace analysis, we do not apply norm scaling to equalize the update budgets, so as to observe their performance contributions under the original training state\. As shown in Figure[5](https://arxiv.org/html/2605.11739#S3.F5)\(b\), in contrast to the principal subspace, tail subspaces provide only limited performance recovery for both RL and OPD\. On the Qwen2\.5\-1\.5B\-DeepSeek model, retaining only 10% of the principal subspace increases reasoning accuracy from 23\.33% to 40\.3%, whereas preserving 50% of the tail subspace achieves only around 30%, despite using a much larger fraction of the rank budget\. This contrast suggests that tail directions have substantially lower marginal utility for reasoning than principal directions\.

Interestingly, RL exhibits a slight advantage over OPD in tail directions\. However, this marginal benefit comes with a large norm cost: the norm of RL’s tail subspace \(Δ​WBottom\-​50%\\Delta W\_\{\\text\{Bottom\-\}50\\%\}\) ranges from approximately 1\.6 to 2\.5 times that of OPD, while the corresponding performance gain remains limited\. In other words, RL allocates a substantial portion of its update magnitude to tail directions, but the marginal return of this allocation is relatively low\.

These observations help explain the compactness advantage of OPD discussed in Section[2](https://arxiv.org/html/2605.11739#S2)\. Compared with OPD, RL distributes more update energy into tail directions whose contribution to reasoning performance is limited, which is consistent with its larger overall update norm for comparable performance\. In contrast, OPD allocates a larger fraction of its update energy to the principal subspace, thereby achieving stronger per\-norm performance gains with more compact updates\.

The preceding analysis shows that OPD updates exhibit substantially stronger low\-rank concentration from a spatial\-geometric perspective\. Together with the controlled Top\-k%k\\%and Bottom\-k%k\\%subspace experiments, this suggests that such concentration is a key factor behind OPD’s higher per\-norm efficiency, rather than merely a by\-product of smaller update norms\. We next move from static spectral structure to temporal evolution, examining whether OPD’s efficiency arises from early identification of high\-quality directions or from continuous path correction during training\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x5.png)Figure 5:Subspace evolution and weight scaling analysis during training\. \(a\) t\-SNE visualization of Top\-1 subspace evolution for RL and OPD trajectories\. \(b\) Cosine similarity between the Top\-kksubspaces of intermediate and final checkpoints\. \(c\) Changes in Accuracy and KL after scaling intermediate OPD checkpoints’Δ​WOPD\\Delta W\_\{\\text\{OPD\}\}to match the final checkpoint’s norm\.

### 3\.3Directional Stabilization and Magnitude Development

#### Subspace Evolution Trajectory Analysis\.

To qualitatively compare the evolution of update directions during training, we visualize the Top\-1 subspace using t\-SNE, as shown in Figure[5](https://arxiv.org/html/2605.11739#S3.F5)\(a\)\. The RL trajectory exhibits larger variations across checkpoints, whereas the OPD trajectory appears more compact and smoother in the projected space\. This visualization suggests a potential difference in directional stability between RL and OPD, which we next examine quantitatively through subspace alignment analysis\.

Specifically, we pair each Top\-kksubspace \(k=1,…,20k=1,\\ldots,20\) from each training step with its corresponding subspace in the final checkpoint, compute the cosine similarity, and then average overkk\. The results are shown in Figure[5](https://arxiv.org/html/2605.11739#S3.F5)\(b\)\. OPD consistently exhibits stronger alignment with its final subspaces than RL across all evaluated ranks, with smaller fluctuations throughout training\. This difference is particularly pronounced in the early stage of training \(0%–30%\), indicating that OPD stabilizes its dominant update directions earlier than RL, and that this stability extends beyond the Rank\-1 direction to multiple dominant subspaces\.

#### Magnitude Scaling and Performance Recovery\.

The preceding subspace\-alignment analysis shows that the dominant OPD update subspaces are already strongly aligned with their final counterparts at an early stage of training\. Based on this observation, we further investigate the source of the remaining performance gap in early checkpoints: whether this gap arises from insufficiently formed effective update directions, or from underdeveloped update magnitudes along these directions\.

To examine this hypothesis, we perform a module\-wise norm\-scaling intervention on intermediate OPD checkpoints\. For each intermediate checkpoint, we preserve the update direction within each module, while rescaling its Frobenius norm to match that of the corresponding module in the final checkpoint\. We then apply the rescaled update to the base model and evaluate the resulting model, as shown in Figure[5](https://arxiv.org/html/2605.11739#S3.F5)\(c\)\. This intervention allows us to assess how much performance can be recovered when early update directions are given the same module\-wise norm budget as the final checkpoint\.

The results show that norm scaling markedly improves the performance of early checkpoints\. In particular, a checkpoint at only 10% training progress recovers approximately 80% of the final model’s performance after scaling\. We also observe a reduction in the KL divergence between the rescaled checkpoints and the teacher model, indicating that the scaled updates move the student output distribution closer to the teacher distribution\. These results suggest that early OPD checkpoints already possess task\-relevant update directions, while the limited update magnitudes become a bottleneck that constrains further performance improvement\.

Overall, these experiments separate two aspects of the OPD update trajectory, namely the formation of dominant directions and the growth of update magnitudes, thereby complementing the subspace alignment analysis\. Experimental evidence shows that OPD establishes stable update directions early in training, with subsequent training primarily accumulating magnitude along these directions rather than making large\-scale adjustments to the directions themselves\. We further analyze the geometric and theoretical manifestations of Property 2 in Appendix[F\.2](https://arxiv.org/html/2605.11739#A6.SS2)\-[F\.5](https://arxiv.org/html/2605.11739#A6.SS5)\.

#### Summary\.

This section reveals the core geometric characteristics of OPD’s parameter updates\. OPD’s updates exhibit stronger low\-rank concentration and stabilize their dominant subspaces early, with subsequent training mainly progressing along these subspaces\. We term thisProperty 2: Early Low\-Rank Lock\-in, which structurally explainsProperty 1: Functional Redundancy Avoidance\. By locking into efficient low\-rank directions early, OPD reduces reliance on redundant exploration and correction, avoids overlearning redundant information, and exhibits stronger foresight at the modular level\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x6.png)Figure 6:Performance comparison of different distillation methods on code and math datasets\.

## 4Accelerating OPD via Directional Extrapolation

The preceding analysis suggests that OPD establishes highly stable and final\-aligned update directions early in training\. After this early directional lock\-in, later optimization mainly amplifies the update magnitude along the same trajectory, rather than exploring new directions\. Motivated by this observation, we proposeEffOPD, a plug\-and\-play acceleration framework that exploits early directional extrapolation to accelerate OPD\. We next detail the acceleration procedure and report the corresponding empirical results\.

### 4\.1Method

LetWtW\_\{t\}denote the model parameters after thett\-th OPD update\. EffOPD triggers an extrapolation search at exponentially spaced checkpoints, i\.e\., whent=2nt=2^\{n\}withnnstarting from 0, so the first extrapolation is performed att=1t=1\. For the first checkpoint, we use the displacement from the initial parameters toW1W\_\{1\}as the local update direction\. For subsequent checkpoints withn≥1n\\geq 1, EffOPD estimates the local update direction using the parameter displacement between the current exponential checkpoint and the previous one:

Δn=W2n−W2n−1\.\\Delta\_\{n\}=W\_\{2^\{n\}\}\-W\_\{2^\{n\-1\}\}\.\(1\)
This displacement captures the accumulated parameter evolution between two adjacent exponential checkpoints\. Since OPD update directions remain relatively stable during training,Δn\\Delta\_\{n\}serves as a local approximation of subsequent update directions\.

EffOPD then generates five candidate parameters fromW2nW\_\{2^\{n\}\}alongΔn\\Delta\_\{n\}with increasing extrapolation magnitudes\. Fork=1,2,⋯,5k=1,2,\\cdots,5, thekk\-th candidate is defined as:

W~n,k=W2n\+2​k​Δn,\\widetilde\{W\}\_\{n,k\}=W\_\{2^\{n\}\}\+\{2k\}\\Delta\_\{n\},\(2\)where the coefficient2​k\{2k\}controls the extrapolation scale\. To determine whether the extrapolated parameters remain effective, EffOPD randomly samples 50 examples from the training set to form a lightweight validation set𝒟v\\mathcal\{D\}\_\{v\}, which is far smaller than the number of sentences generated per step in vanilla OPD\. sLet𝒱𝒟v​\(⋅\)\\mathcal\{V\}\_\{\\mathcal\{D\}\_\{v\}\}\(\\cdot\)denote the validation function\. EffOPD initializes the accepted parameters asWacc=W2nW^\{\\mathrm\{acc\}\}=W\_\{2^\{n\}\}and its score asvacc=𝒱𝒟v​\(W2n\)v^\{\\mathrm\{acc\}\}=\\mathcal\{V\}\_\{\\mathcal\{D\}\_\{v\}\}\(W\_\{2^\{n\}\}\)\. Then EffOPD evaluatesW~n,k\\widetilde\{W\}\_\{n,k\}sequentially\. If𝒱𝒟v​\(W~n,k\)≥vacc\\mathcal\{V\}\_\{\\mathcal\{D\}\_\{v\}\}\(\\widetilde\{W\}\_\{n,k\}\)\\geq v^\{\\mathrm\{acc\}\}, the candidate is accepted, and we update:

Wacc←W~n,k,vacc←𝒱𝒟v​\(W~n,k\)\.W^\{\\mathrm\{acc\}\}\\leftarrow\\widetilde\{W\}\_\{n,k\},\\quad v^\{\\mathrm\{acc\}\}\\leftarrow\\mathcal\{V\}\_\{\\mathcal\{D\}\_\{v\}\}\(\\widetilde\{W\}\_\{n,k\}\)\.\(3\)If the current candidate fails to improve validation performance, the search terminates immediately\. Thus, the final accepted parametersW2nEffOPDW\_\{2^\{n\}\}^\{\\mathrm\{EffOPD\}\}at checkpoint2n2^\{n\}is:

W2nEffOPD=Wacc\.W\_\{2^\{n\}\}^\{\\mathrm\{EffOPD\}\}=W^\{\\mathrm\{acc\}\}\.\(4\)In particular, if the candidate withk=1k=1already fails, EffOPD degenerates to vanilla OPD\. This progressive extrapolation and immediate validation mechanism enables EffOPD to exploit the early directional stability of OPD while avoiding performance degradation caused by excessive extrapolation\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x7.png)Figure 7:Ablation studies\. \(a\) Effect of different learning rates\. \(b\) Impact of𝒟v\\mathcal\{D\}\_\{v\}difficulty on EffOPD\. “Extrapolation Acc” denotes the accuracy of the model before training on the sampled𝒟v\\mathcal\{D\}\_\{v\}\. \(c\) Relationship between training time and performance\.
### 4\.2Main Results

To evaluate EffOPD, we conduct experiments on code generation and mathematical reasoning\. We use Eurus\-RL\-Code\(Cuiet al\.,[2025a](https://arxiv.org/html/2605.11739#bib.bib114)\)and DeepMath\-103K\(Yanget al\.,[2026b](https://arxiv.org/html/2605.11739#bib.bib85)\)for training, and evaluate models at four scales: 1\.5B, 4B, 14B, and 32B\. For each scale, the RL\-finetuned model serves as the teacher\. We report results on seven benchmarks: Codeforces, Taco\(Liuet al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib115)\), AIME24, AIME25, AIME26, MINERVA, and GPQA\(Yeet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib54)\)\. We compare EffOPD with Vanilla OPD, AlphaOPD\(Caiet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib78)\), and ExOPD\(Yanget al\.,[2026b](https://arxiv.org/html/2605.11739#bib.bib85)\)\.

As shown in Figure[6](https://arxiv.org/html/2605.11739#S3.F6), EffOPD consistently improves training efficiency across all model scales and datasets\. On mathematical reasoning tasks, it typically begins to converge within about 10 training steps, compared with 30–40 steps for vanilla OPD, yielding more than a3×3\\timesspeedup\. EffOPD also reaches a higher performance upper bound, possibly because prolonged vanilla OPD training may cause over\-optimization and semantic drift\. Unlike AlphaOPD and ExOPD, which use fixed extrapolation strategies, EffOPD adaptively selects the extrapolation magnitude via validation feedback, leading to more stable acceleration\. Its early\-stage advantage is especially evident on Qwen3\-4B\-Non\-Thinking, where EffOPD attains strong reasoning performance by the 4th step, further supporting that OPD forms high\-quality, well\-aligned update directions early in training\.

Ablation Studies\.We conduct ablation studies to identify the key factors behind EffOPD’s effectiveness\. As shown in Figure[7](https://arxiv.org/html/2605.11739#S4.F7)\(a\), the learning rate strongly affects the stability of vanilla OPD: larger learning rates accelerate early convergence but also cause noticeable oscillations and performance instability\. In contrast, EffOPD uses lightweight validation during extrapolation to adaptively filter out overly aggressive steps, thereby improving training stability\. Figure[7](https://arxiv.org/html/2605.11739#S4.F7)\(b\) shows that the difficulty of the lightweight validation set𝒟v\\mathcal\{D\}\_\{v\}is not critical\. Validation sets of different difficulty levels provide consistent directional signals, suggesting that validation mainly serves to check whether the current update direction remains effective rather than to provide precise supervision\. Figure[7](https://arxiv.org/html/2605.11739#S4.F7)\(c\) compares actual training time\. Despite the additional validation overhead, EffOPD achieves better performance under the same time budget and converges faster than vanilla OPD, indicating that the gain from exploiting early\-stage update directions outweighs the validation cost\. Overall, these results support the proposed foresight mechanism: once OPD establishes effective directions early in training, EffOPD can safely extrapolate along them to achieve stable and efficient acceleration\.

## 5Conclusion

In this work, we identify two properties that reveal the underlying “foresight” of OPD:Functional Redundancy Avoidanceat the modular level andEarly Low\-Rank Lock\-inat the update\-direction level\. Through parameter\-level analyses across model scales, RL algorithms, and task domains, we show that OPD achieves RL\-comparable reasoning gains with more compact and structured updates, as it concentrates optimization on high\-utility modules and directions from the early stage of training\. Building on this insight, we proposeEffOPD, a plug\-and\-play acceleration method that leverages early directional stability to achieve up to3×3\\timestraining speedup while maintaining the final performance\. Overall, our findings suggest that OPD’s efficiency is fundamentally tied to early directional stabilization and compact parameter allocation, offering a new perspective for understanding and accelerating post\-training in large language models\.

## References

- On\-policy distillation of language models: learning from self\-generated mistakes\.InThe twelfth international conference on learning representations,Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3)\.
- R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. Ramos, M\. Geist, and O\. Bachem \(2024b\)On\-policy distillation of language models: learning from self\-generated mistakes\.External Links:2306\.13649,[Link](https://arxiv.org/abs/2306.13649)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- A\. Al\-Natoor \(2024\)Norm inequalities for functions of matrices\.Heliyon10\(9\),pp\. e30056\.External Links:ISSN 2405\-8440,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.heliyon.2024.e30056),[Link](https://www.sciencedirect.com/science/article/pii/S2405844024060870)Cited by:[§F\.1](https://arxiv.org/html/2605.11739#A6.SS1.SSS0.Px2),[§3\.1](https://arxiv.org/html/2605.11739#S3.SS1.p1.1)\.
- Y\. Cai, D\. Cao, R\. Guo, Y\. Wen, G\. Liu, and E\. Chen \(2024\)Locating and mitigating gender bias in large language models\.External Links:2403\.14409,[Link](https://arxiv.org/abs/2403.14409)Cited by:[§E\.2](https://arxiv.org/html/2605.11739#A5.SS2.p1.1),[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3.p2.1)\.
- Y\. Cai, D\. Cao, X\. Xu, Z\. Yao, Y\. Huang, Z\. Tan, B\. Zhang, G\. Sun, G\. Liu, and J\. Fang \(2025\)On predictability of reinforcement learning dynamics for large language models\.arXiv preprint arXiv:2510\.00553\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p2.1),[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.7.6.2),[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.8.7.2),[§E\.2](https://arxiv.org/html/2605.11739#A5.SS2.p1.1),[Table 3](https://arxiv.org/html/2605.11739#A5.T3.1.10.9.2),[§F\.1](https://arxiv.org/html/2605.11739#A6.SS1.SSS0.Px4),[§3\.1](https://arxiv.org/html/2605.11739#S3.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.11739#S4.SS2.p1.1)\.
- Z\. Chen, T\. Qian, W\. X\. Zhao, and J\. Wen \(2026\)Low\-rank optimization trajectories modeling for llm rlvr acceleration\.External Links:2604\.11446,[Link](https://arxiv.org/abs/2604.11446)Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p2.1)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, J\. Chen, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025a\)Process reinforcement through implicit rewards\.External Links:2502\.01456,[Link](https://arxiv.org/abs/2502.01456)Cited by:[§4\.2](https://arxiv.org/html/2605.11739#S4.SS2.p1.1)\.
- G\. Cui, Y\. Zhang, J\. Chen, L\. Yuan, Z\. Wang, Y\. Zuo, H\. Li, Y\. Fan, H\. Chen, W\. Chen, Z\. Liu, H\. Peng, L\. Bai, W\. Ouyang, Y\. Cheng, B\. Zhou, and N\. Ding \(2025b\)The entropy mechanism of reinforcement learning for reasoning language models\.External Links:2505\.22617,[Link](https://arxiv.org/abs/2505.22617)Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p2.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, S\. Ye, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Zhao, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p3.3),[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards efficient long\-context reasoning with hybrid attention and manifold hyperconnections\.Note:[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro/blob/main/DeepSeek\_V4\.pdf](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Accessed: 2026\-04\-24Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3),[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- C\. Eckart and G\. Young \(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1,pp\. 211–218\.External Links:[Link](https://api.semanticscholar.org/CorpusID:10163399)Cited by:[§E\.1](https://arxiv.org/html/2605.11739#A5.SS1.p4.1),[§F\.3](https://arxiv.org/html/2605.11739#A6.SS3.p3.5)\.
- Y\. Fu, H\. Huang, K\. Jiang, Y\. Zhu, and D\. Zhao \(2026\)Revisiting on\-policy distillation: empirical failure modes and simple fixes\.arXiv preprint arXiv:2603\.25562\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.External Links:2304\.14767,[Link](https://arxiv.org/abs/2304.14767)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p3.1)\.
- M\. Geva, A\. Caciularu, K\. R\. Wang, and Y\. Goldberg \(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.External Links:2203\.14680,[Link](https://arxiv.org/abs/2203.14680)Cited by:[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3.p2.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.External Links:2012\.14913,[Link](https://arxiv.org/abs/2012.14913)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3.p2.1)\.
- B\. He, Z\. Qu, Z\. Liu, Y\. Chen, Y\. Zuo, C\. Qian, K\. Zhang, W\. Chen, C\. Xiao, G\. Cui, N\. Ding, and Z\. Liu \(2025a\)JustRL: scaling a 1\.5b llm with a simple rl recipe\.External Links:2512\.16649,[Link](https://arxiv.org/abs/2512.16649)Cited by:[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.2.1.2),[Table 3](https://arxiv.org/html/2605.11739#A5.T3.1.2.1.2)\.
- B\. He, Y\. Zuo, Z\. Liu, S\. Zhao, Z\. Fu, J\. Yang, C\. Qian, K\. Zhang, Y\. Fan, G\. Cui,et al\.\(2026\)How far can unsupervised rlvr scale llm training?\.arXiv preprint arXiv:2603\.08660\.Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- Z\. He, T\. Liang, J\. Xu, Q\. Liu, X\. Chen, Y\. Wang, L\. Song, D\. Yu, Z\. Liang, W\. Wang, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2025b\)DeepMath\-103k: a large\-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning\.External Links:2504\.11456,[Link](https://arxiv.org/abs/2504.11456)Cited by:[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p3.3)\.
- J\. Hu, M\. Liu, X\. Lu, F\. Wu, Z\. Harchaoui, S\. Diao, Y\. Choi, P\. Molchanov, J\. Yang, J\. Kautz, and Y\. Dong \(2025a\)BroRL: scaling reinforcement learning via broadened exploration\.ArXivabs/2510\.01180\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281705585)Cited by:[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.3.2.2),[Table 3](https://arxiv.org/html/2605.11739#A5.T3.1.4.3.2)\.
- J\. Hu, Y\. Zhang, Q\. Han, D\. Jiang, X\. Zhang, and H\. Shum \(2025b\)Open\-reasoner\-zero: an open source approach to scaling up reinforcement learning on the base model\.External Links:2503\.24290,[Link](https://arxiv.org/abs/2503.24290)Cited by:[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.6.5.2)\.
- D\. Khatri, L\. Madaan, R\. Tiwari, R\. Bansal, S\. S\. Duvvuri, M\. Zaheer, I\. S\. Dhillon, D\. Brandfonbrener, and R\. Agarwal \(2025\)The art of scaling reinforcement learning compute for llms\.arXiv preprint arXiv:2510\.13786\.Cited by:[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px2.p1.1)\.
- J\. Kim, X\. Luo, M\. Kim, S\. Lee, D\. Kim, J\. Jeon, D\. Li, and Y\. Yang \(2026\)Why does self\-distillation \(sometimes\) degrade the reasoning capability of llms?\.External Links:[Link](https://api.semanticscholar.org/CorpusID:286776340)Cited by:[§D\.1](https://arxiv.org/html/2605.11739#A4.SS1.p1.1)\.
- Y\. Koren, R\. Bell, and C\. Volinsky \(2009\)Matrix factorization techniques for recommender systems\.Computer42\(8\),pp\. 30–37\.External Links:[Document](https://dx.doi.org/10.1109/MC.2009.263)Cited by:[§3\.1](https://arxiv.org/html/2605.11739#S3.SS1.p1.1)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu,et al\.\(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3),[§2\.1](https://arxiv.org/html/2605.11739#S2.SS1.p1.2)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p3.3),[§E\.2](https://arxiv.org/html/2605.11739#A5.SS2.p10.2)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.External Links:2305\.01210,[Link](https://arxiv.org/abs/2305.01210)Cited by:[§4\.2](https://arxiv.org/html/2605.11739#S4.SS2.p1.1)\.
- M\. Liu, S\. Diao, X\. Lu, J\. Hu, X\. Dong, Y\. Choi, J\. Kautz, and Y\. Dong \(2025\)ProRL: prolonged reinforcement learning expands reasoning boundaries in large language models\.ArXivabs/2505\.24864\.External Links:[Link](https://api.semanticscholar.org/CorpusID:279071277)Cited by:[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.4.3.2),[Table 3](https://arxiv.org/html/2605.11739#A5.T3.1.6.5.2)\.
- R\. Mathias \(1990\)The spectral norm of a nonnegative matrix\.Linear Algebra and its Applications139,pp\. 269–284\.External Links:ISSN 0024\-3795,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0024-3795%2890%2990403-Y),[Link](https://www.sciencedirect.com/science/article/pii/002437959090403Y)Cited by:[§F\.1](https://arxiv.org/html/2605.11739#A6.SS1.SSS0.Px1),[§3\.1](https://arxiv.org/html/2605.11739#S3.SS1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2023\)Locating and editing factual associations in gpt\.External Links:2202\.05262,[Link](https://arxiv.org/abs/2202.05262)Cited by:[§E\.2](https://arxiv.org/html/2605.11739#A5.SS2.p2.1),[§1](https://arxiv.org/html/2605.11739#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3.p2.1)\.
- OpenAI \(2025\)Introducing gpt\-oss\.Note:[https://openai\.com/zh\-Hans\-CN/index/introducing\-gpt\-oss/](https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§2\.1](https://arxiv.org/html/2605.11739#S2.SS1.p1.2)\.
- O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.pp\. 606–610\.External Links:[Link](https://infoscience.epfl.ch/handle/20.500.14299/10320)Cited by:[§F\.1](https://arxiv.org/html/2605.11739#A6.SS1.SSS0.Px3),[§3\.1](https://arxiv.org/html/2605.11739#S3.SS1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,pp\. 1279–1297\.External Links:[Link](http://dx.doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p2.2)\.
- S\. Shi \(2021\)Visualizing data using gtsne\.External Links:2108\.01301,[Link](https://arxiv.org/abs/2108.01301)Cited by:[§E\.1](https://arxiv.org/html/2605.11739#A5.SS1.p4.1),[§F\.3](https://arxiv.org/html/2605.11739#A6.SS3.p1.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.arXiv preprint arXiv:2502\.02013\.Cited by:[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3.p2.1)\.
- M\. Song and M\. Zheng \(2026\)A survey of on\-policy distillation for large language models\.arXiv preprint arXiv:2604\.00626\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3)\.
- Z\. Tan, H\. Geng, X\. Yu, M\. Zhang, G\. Wan, Y\. Zhou, Q\. He, X\. Xue, H\. Zhou, Y\. Fan, Z\. Li, Z\. Zhang, G\. Zhang, C\. Zhang, Z\. Yin, P\. Torr, and L\. Bai \(2026\)Scaling behaviors of llm reinforcement learning post\-training: an empirical study in mathematical reasoning\.External Links:2509\.25300,[Link](https://arxiv.org/abs/2509.25300)Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p2.1)\.
- V\. Venkatkrishna, I\. Paul, and I\. Gurevych \(2026\)Aletheia: what makes rlvr for code verifiers tick?\.ArXivabs/2601\.12186\.External Links:[Link](https://api.semanticscholar.org/CorpusID:284911417)Cited by:[§D\.1](https://arxiv.org/html/2605.11739#A4.SS1.SSS0.Px1.p1.9),[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.Advances in neural information processing systems33,pp\. 12388–12401\.Cited by:[§E\.2](https://arxiv.org/html/2605.11739#A5.SS2.p2.1)\.
- B\. Xiao, B\. Xia, B\. Yang, B\. Gao, B\. Shen, C\. Zhang, C\. He, C\. Lou, F\. Luo, G\. Wang,et al\.\(2026\)Mimo\-v2\-flash technical report\.arXiv preprint arXiv:2601\.02780\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3),[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- Y\. Xu, H\. Sang, Z\. Zhou, R\. He, Z\. Wang, and A\. Geramifard \(2026\)TIP: token importance in on\-policy distillation\.arXiv preprint arXiv:2604\.14084\.Cited by:[§F\.5](https://arxiv.org/html/2605.11739#A6.SS5.SSS0.Px6.p3.4)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3),[§1](https://arxiv.org/html/2605.11739#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.11739#S2.SS1.p1.2)\.
- W\. Yang, W\. Liu, R\. Xie, K\. Yang, S\. Yang, and Y\. Lin \(2026a\)Learning beyond teacher: generalized on\-policy distillation with reward extrapolation\.arXiv preprint arXiv:2602\.12125\.Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p1.3)\.
- W\. Yang, W\. Liu, R\. Xie, K\. Yang, S\. Yang, and Y\. Lin \(2026b\)Learning beyond teacher: generalized on\-policy distillation with reward extrapolation\.ArXivabs/2602\.12125\.External Links:[Link](https://api.semanticscholar.org/CorpusID:285540530)Cited by:[§D\.1](https://arxiv.org/html/2605.11739#A4.SS1.SSS0.Px2.p1.3),[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p5.4),[Table 3](https://arxiv.org/html/2605.11739#A5.T3.1.8.7.2),[§1](https://arxiv.org/html/2605.11739#S1.p6.1),[§4\.2](https://arxiv.org/html/2605.11739#S4.SS2.p1.1)\.
- Y\. Ye, Z\. Huang, Y\. Xiao, E\. Chern, S\. Xia, and P\. Liu \(2025\)LIMO: less is more for reasoning\.External Links:2502\.03387,[Link](https://arxiv.org/abs/2502.03387)Cited by:[§4\.2](https://arxiv.org/html/2605.11739#S4.SS2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476,[Link](https://arxiv.org/abs/2503.14476)Cited by:[§D\.2](https://arxiv.org/html/2605.11739#A4.SS2.p3.3),[Table 2](https://arxiv.org/html/2605.11739#A4.T2.1.11.10.2),[§2\.1](https://arxiv.org/html/2605.11739#S2.SS1.p1.2)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang \(2025\)Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?\.External Links:2504\.13837,[Link](https://arxiv.org/abs/2504.13837)Cited by:[Appendix B](https://arxiv.org/html/2605.11739#A2.p2.1),[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li, Y\. Fu, X\. Lv, Y\. Zhang, S\. Zeng, S\. Qu, H\. Li, S\. Wang, Y\. Wang, X\. Long, F\. Liu, X\. Xu, J\. Ma, X\. Zhu, E\. Hua, Y\. Liu, Z\. Li, H\. Chen, X\. Qu, Y\. Li, W\. Chen, Z\. Yuan, J\. Gao, D\. Li, Z\. Ma, G\. Cui, Z\. Liu, B\. Qi, N\. Ding, and B\. Zhou \(2025a\)A survey of reinforcement learning for large reasoning models\.ArXivabs/2509\.08827\.External Links:[Link](https://api.semanticscholar.org/CorpusID:281247204)Cited by:[§D\.1](https://arxiv.org/html/2605.11739#A4.SS1.p1.1)\.
- K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li, Y\. Fu, X\. Lv, Y\. Zhang, S\. Zeng, S\. Qu, H\. Li, S\. Wang, Y\. Wang, X\. Long, F\. Liu, X\. Xu, J\. Ma, X\. Zhu, E\. Hua, Y\. Liu, Z\. Li, H\. Chen, X\. Qu, Y\. Li, W\. Chen, Z\. Yuan, J\. Gao, D\. Li, Z\. Ma, G\. Cui, Z\. Liu, B\. Qi, N\. Ding, and B\. Zhou \(2025b\)A survey of reinforcement learning for large reasoning models\.External Links:2509\.08827,[Link](https://arxiv.org/abs/2509.08827)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p1.1)\.
- S\. Zhang, L\. Dong, X\. Li, S\. Zhang, X\. Sun, S\. Wang, J\. Li, R\. Hu, T\. Zhang, F\. Wu, and G\. Wang \(2025c\)Instruction tuning for large language models: a survey\.External Links:2308\.10792,[Link](https://arxiv.org/abs/2308.10792)Cited by:[§2\.1](https://arxiv.org/html/2605.11739#S2.SS1.p1.2)\.
- Z\. Zhang \(2015\)The singular value decomposition, applications and beyond\.External Links:1510\.08532,[Link](https://arxiv.org/abs/1510.08532)Cited by:[§1](https://arxiv.org/html/2605.11739#S1.p4.1)\.
- C\. Zheng, K\. Dang, B\. Yu, M\. Li, H\. Jiang, J\. Lin, Y\. Liu, H\. Lin, C\. Wu, F\. Hu, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Stabilizing reinforcement learning with llms: formulation and practices\.External Links:2512\.01374,[Link](https://arxiv.org/abs/2512.01374)Cited by:[§2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px2.p1.1)\.

## Appendix AImpact Statement

This paper presents work whose goal is to advance the understanding and efficiency of post\-training for large language models, particularly on\-policy distillation\. We believe that our work conforms with the NeurIPS Code of Ethics\. The proposed analysis and EffOPD method may help reduce the computational cost of post\-training and make efficient model improvement more accessible\. However, more efficient post\-training techniques could also be misused to enhance or adapt models for harmful applications\. We encourage responsible use of these methods, together with appropriate safety evaluation and deployment safeguards\.

## Appendix BRelated Work

On\-policy Distillation \(OPD\)\.In this paradigm, the student generates its own samples and receives dense supervisory signals from the teacher\(Agarwalet al\.,[2024a](https://arxiv.org/html/2605.11739#bib.bib71)\)\. Qwen3\(Yanget al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib42)\)demonstrates that it achieves substantially higher training efficiency than RLVR\. Meanwhile, MiMo\-V2\-Flash\(Xiaoet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib72)\)and Deepseek\-V4\(DeepSeek\-AI,[2026](https://arxiv.org/html/2605.11739#bib.bib90)\)integrate multiple teacher skills into a small model via multi\-task on\-policy distillation\.Song and Zheng \([2026](https://arxiv.org/html/2605.11739#bib.bib73)\)present the first systematic survey of OPD for large language models, proposing a unifiedff\-divergence framework grounded in on\-policy samples\.Fuet al\.\([2026](https://arxiv.org/html/2605.11739#bib.bib74)\)prove that token\-level OPD is biased relative to the sequence\-level reverse\-KL objective but has a tighter variance bound ofO​\(T2\)O\(T^\{2\}\)versusO​\(T4\)O\(T^\{4\}\)\.Yanget al\.\([2026a](https://arxiv.org/html/2605.11739#bib.bib75)\)establish a theoretical equivalence between token\-level distillation and RLVR\.Liet al\.\([2026](https://arxiv.org/html/2605.11739#bib.bib77)\)systematically investigate the training dynamics of OPD and identify two necessary conditions for success: \(i\) the student and teacher must share compatible thinking patterns, and \(ii\) the teacher must offer genuinely novel capabilities beyond what the student has encountered during training\.

Emergent Behaviors of On\-Policy Training\.Yueet al\.\([2025](https://arxiv.org/html/2605.11739#bib.bib40)\)investigated the differences in sampling between base models and RL\-fine\-tuned models, showing that RL improves sampling efficiency for pass@1 but does not directly enhance reasoning ability\.Cuiet al\.\([2025b](https://arxiv.org/html/2605.11739#bib.bib26)\)identified the phenomenon of “entropy collapse” in reinforcement learning, where rapid early convergence causes the model to become overly confident, prematurely degrading its exploratory capacity\. Through systematic experiments across models of varying scales,Tanet al\.\([2026](https://arxiv.org/html/2605.11739#bib.bib79)\)reveal a power\-law relationship between test loss, computational budget, and data volume during RL post\-training of LLMs, demonstrating that larger models consistently exhibit superior learning efficiency\.Caiet al\.\([2025](https://arxiv.org/html/2605.11739#bib.bib78)\)investigate RL from the perspective of parameter dynamics\. They uncover two fundamental properties of RL\-induced updates: Rank\-1 dominance and Rank\-1 linear dynamics\. Based on these insights, their AlphaRL framework achieves3×3\\timestraining acceleration\. Building on this,Chenet al\.\([2026](https://arxiv.org/html/2605.11739#bib.bib108)\)train a predictor that directly forecasts the evolution direction of subsequent optimization subspaces using the early Rank‑1 subspace\. Different from previous studies focusing on RL’s low\-rank trajectories, this work finds that OPD’s efficiency advantage over RL stems from the unique synergy between modular redundancy suppression and early directional stabilization\.

## Appendix CLimitations and Future Work

Despite our identification of two properties of OPD, this study has several limitations\. First, although these properties are validated from multiple perspectives, their applicability to more complex settings, such as multi\-turn agent tasks and multimodal reasoning, remains to be further examined\. These settings may introduce stronger distributional shifts and more complex teacher\-student residual structures\. Second, our theoretical analysis in Appendix is inherently local, characterizing OPD dynamics only in a neighborhood of the base model and therefore not fully capturing the global non\-convex behavior of large\-scale post\-training\.

These limitations point to several directions for future work\. A more complete theory should account for the coupling between the distillation objective, the evolving on\-policy distribution, and the spectral evolution of parameter updates\. In addition, the early directional lock\-in observed in OPD may serve as a useful diagnostic signal for monitoring post\-training dynamics\. Metrics such as directional alignment, spectral concentration, and update compactness could help assess training progress and stability, thereby supporting more adaptive and efficient on\-policy distillation methods for large language models\.

## Appendix DPreliminaries and Experimental Setup

### D\.1Preliminaries

In our experiments, we focus on the two training paradigms: Reinforcement Learning\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.11739#bib.bib82)\)and On\-Policy Distillation\(Kimet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib83)\)\. Letπθ\\pi\_\{\\theta\}denote the policy model to be optimized\.

#### Reinforcement Learning \(RL\)\.

The RL objective can be formulated as:

JRL​\(θ\)=maxθ⁡𝔼x∼𝒟,y∼πθ\(⋅∣x\)​\[r​\(x,y\)−β​DKL​\(πθ∥πref\)\],J\_\{\\text\{RL\}\}\(\\theta\)=\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[r\(x,y\)\-\\beta D\_\{\\text\{KL\}\}\\bigl\(\\pi\_\{\\theta\}\\parallel\\pi\_\{\\text\{ref\}\}\\bigr\)\\right\],\(5\)where the trajectoryy=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)is sampled from the current policyπθ\\pi\_\{\\theta\}, ensuring on\-policy training\. The functionr​\(x,y\)r\(x,y\)measures the quality of responseyyto queryxx\. In the Reinforcement Learning from Verifiable Rewards setting \(RLVR\)\(Venkatkrishnaet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib84)\),r​\(x,y\)r\(x,y\)is a deterministic verifiable reward \(e\.g\., answer correctness or unit test passing\), requiring no learned reward model\. The termDKL​\(πθ∥πref\)D\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\\parallel\\pi\_\{\\text\{ref\}\}\)is a KL constraint that prevents the policy from deviating too far from a reference modelπref\\pi\_\{\\text\{ref\}\}, withβ\\betacontrolling the constraint strength\.

To optimize Eq\. \([5](https://arxiv.org/html/2605.11739#A4.E5)\), policy gradient methods are commonly used, yielding the following gradient estimate:

∇θJRL​\(θ\)=𝔼x∼𝒟,y∼πθ\(⋅∣x\)​\[∑t=1TAt​∇θlog⁡πθ​\(yt∣x,y<t\)\],\\nabla\_\{\\theta\}J\_\{\\text\{RL\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{T\}A\_\{t\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\right\],\(6\)whereAtA\_\{t\}is the advantage of tokenyty\_\{t\}relative to a baseline\. In practice, the reward signal in RLVR is often sparse, as the policy only receives a reward upon completion of the full response\.

#### On\-Policy Distillation \(OPD\)\.

OPD inherits the on\-policy nature of policy training while leveraging dense supervisory signals from a teacher model, making it an efficient post\-training paradigm\(Yanget al\.,[2026b](https://arxiv.org/html/2605.11739#bib.bib85)\)\. The core idea is to let the student modelπθ\\pi\_\{\\theta\}generate its own trajectoriesyy, and then minimize the reverse KL divergence between the student and a fixed teacher modelπ∗\\pi^\{\*\}on these student\-generated trajectories:

JOPD\(θ\)=minθ𝔼x∼𝒟,y∼πθ\(⋅∣x\)\[DKL\(πθ\(y∣x\)∥π∗\(y∣x\)\)\]\.J\_\{\\text\{OPD\}\}\(\\theta\)=\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[D\_\{\\text\{KL\}\}\\bigl\(\\pi\_\{\\theta\}\(y\\mid x\)\\parallel\\pi^\{\*\}\(y\\mid x\)\\bigr\)\\right\]\.\(7\)Note that the trajectoriesyyin Eq\. \([7](https://arxiv.org/html/2605.11739#A4.E7)\) are sampled from the student policyπθ\\pi\_\{\\theta\}itself, preserving the on\-policy property\. The corresponding gradient is:

∇θJOPD​\(θ\)=𝔼x∼𝒟,y∼πθ\(⋅∣x\)​\[∑t=1T∑t′=tT\(log⁡πθ​\(yt′∣x,y<t′\)−log⁡π∗​\(yt′∣x,y<t′\)\)​∇θlog⁡πθ​\(yt∣x,y<t\)\]\.\\nabla\_\{\\theta\}J\_\{\\text\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{t^\{\\prime\}=t\}^\{T\}\\Bigl\(\\log\\pi\_\{\\theta\}\(y\_\{t^\{\\prime\}\}\\mid x,y\_\{<t^\{\\prime\}\}\)\-\\log\\pi^\{\*\}\(y\_\{t^\{\\prime\}\}\\mid x,y\_\{<t^\{\\prime\}\}\)\\Bigr\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\right\]\.\(8\)
In practice, following prior work, a common approximation sets the discount factor to zero, focusing on immediate token\-level optimization:

∇θJOPD​\(θ\)≈𝔼x∼𝒟,y∼πθ\(⋅∣x\)​\[∑t=1T\(log⁡πθ​\(yt∣x,y<t\)−log⁡π∗​\(yt∣x,y<t\)\)​∇θlog⁡πθ​\(yt∣x,y<t\)\]\.\\nabla\_\{\\theta\}J\_\{\\text\{OPD\}\}\(\\theta\)\\approx\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{T\}\\Bigl\(\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\-\\log\\pi^\{\*\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\Bigr\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\right\]\.\(9\)
This approximation provides a dense learning signal at every token position, enabling OPD to achieve significantly higher training efficiency compared to RLVR with its sparse reward signal\.

### D\.2Experimental Setup

Table 2:Summary of models considered in this study\.To ensure the generality of our findings, we conduct experiments across multiple model scales, ranging from 1\.5B to 32B parameters\. Our experimental models include publicly available pre\-trained checkpoints \(e\.g\., Qwen2\.5\-7B, Qwen3\-4B, etc\.\), as well as models locally trained using the Verl framework\. For RL methods, we consider three representative algorithms—PPO, GRPO, and DAPO—and apply them to models of varying scales\. For all OPD student models reported in Table[2](https://arxiv.org/html/2605.11739#A4.T2), the capability\-aligned teacher is consistently the RL\-tuned version of its own base model \(i\.e\., the RL model listed in the same table\); for Qwen3\-8B\-Base, we also use Qwen3\-14B\-Base\-DAPO as the teacher to ensure the generality of our conclusions\.

For models trained with reinforcement learning locally, we adapt our training codebase usingVerl\(Shenget al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib44)\)and follow the corresponding training setups\. All methods share the same core configuration: the maximum prompt length is 2,048 tokens and the maximum response length is 20,480 tokens, yielding a total budget of 22,528 tokens\. During training, each backward pass uses a mini\-batch of 32 samples, and gradients are accumulated for 16 iterations before a single optimization step is performed, resulting in an effective batch size of 512 underFloat16precision\. Each prompt generatesn=16n=16outputs during rollout\. The learning rate is set to1×10−61\\times 10^\{\-6\}with warmup, and gradient clipping of 1\.0 is applied\. We monitor the average reward per training batch and terminate training once the reward fails to improve for five consecutive steps\.

In addition to the unified configuration described above, each method adopts specific hyperparameter settings in our experiments\. ForGRPO, we set both the high and low clipping ratios to 0\.2 and apply a KL loss with coefficient0\.0010\.001, followingDeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2605.11739#bib.bib46)\)\. ForDAPO, we employ techniques such as clip\-higher, dynamic sampling, token\-level policy gradient loss, and overlong reward shaping, and apply the recommended hyperparameters fromYuet al\.\([2025](https://arxiv.org/html/2605.11739#bib.bib47)\): the clipping ratios are set toϵlow=0\.2\\epsilon\_\{\\text\{low\}\}=0\.2andϵhigh=0\.28\\epsilon\_\{\\text\{high\}\}=0\.28, KL divergence terms are removed entirely\. We perform RLVR training on Qwen3\-14B\-Base models using the DeepMath\-103K\(Heet al\.,[2025b](https://arxiv.org/html/2605.11739#bib.bib5)\)and MATH\-12K\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib4)\)for training\. For the Qwen3\-14B models, we conduct rollout and training in their non\-thinking mode and we employ the built\-in chat template, specified as follows:

```
User:
{question}
Please reason step by step, and put your final answer within \boxed{}.
<think>
</think>
Assistant: {CoT}
```

For OPD, we follow the setting ofYanget al\.\([2026b](https://arxiv.org/html/2605.11739#bib.bib85)\)\. The maximum prompt length is 2,048 tokens and the maximum response length is 16,384 tokens, yielding a total budget of 18,432 tokens\. The prompt batch size is 1,024, and each prompt generatesn=1n=1outputs during rollout\. The learning rate is set to1×10−61\\times 10^\{\-6\}, without warmup, and a total of 3 training epochs\. The next page shows the OPD training command using theverlframework\. All of our training runs are conducted on8×8\\timesor32×32\\timesH20 96GB GPUs\.

OPD Training Command``` python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ algorithm.rollout_correction.rollout_is=token \ algorithm.rollout_correction.rollout_is_threshold=5.0 \ algorithm.rollout_correction.rollout_rs=null \ algorithm.rollout_correction.bypass_mode=false \ actor_rollout_ref.rollout.calculate_log_probs=true \ data.train_files=/path/to/train.parquet \ data.val_files=/path/to/val.parquet \ data.train_batch_size=1024 \ data.max_prompt_length=2048 \ data.max_response_length=16384 \ data.filter_overlong_prompts=True \ data.truncation=’error’ \ data.shuffle=True \ data.seed=42 \ data.return_raw_chat=True \ +data.apply_chat_template_kwargs.enable_thinking=False \ actor_rollout_ref.model.path=$MODEL_PATH \ +actor_rollout_ref.ref.model.path=$TEACHER_MODEL_PATH \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.0 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.policy_loss.only_reverse_kl_advantages=True \ actor_rollout_ref.actor.ppo_mini_batch_size=1024 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=22000 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=1 \ actor_rollout_ref.rollout.max_num_batched_tokens=22000 \ actor_rollout_ref.rollout.temperature=1.0 \ actor_rollout_ref.rollout.top_p=1.0 \ actor_rollout_ref.rollout.val_kwargs.do_sample=True \ actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \ actor_rollout_ref.rollout.val_kwargs.top_p=1.0 \ actor_rollout_ref.rollout.val_kwargs.n=32 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ reward_model.reward_manager=naive \ trainer.critic_warmup=0 \ trainer.val_before_train=True \ trainer.logger=’["console","wandb"]’ \ trainer.log_val_generations=10 \ trainer.project_name=’on-policy-distillation’ \ trainer.experiment_name=’on-policy-distillation’ \ trainer.n_gpus_per_node=8 \ trainer.nnodes=4 \ trainer.save_freq=2 \ trainer.default_local_dir=/path/to/save/dir \ trainer.test_freq=2 \ trainer.total_epochs=3 $@ ```

## Appendix EProperty 1 Additional Experiment

### E\.1Additional Experiment

This section provides additional empirical evidence to further validateProperty 1 \(Functional Redundancy Avoidance\)introduced in Section[2](https://arxiv.org/html/2605.11739#S2)\.

We begin by examining the scaling behavior across model sizes\. Figure[8](https://arxiv.org/html/2605.11739#A5.F8)presents the scaling results on final checkpoints for models ranging from 1\.5B to 32B parameters\. Across all scales, we observe a consistent pattern: OPD achieves reasoning performance comparable to that of RL while requiring substantially smaller parameter update norms\. This result suggests that the functional efficiency of OPD is not a scale\-specific artifact, but rather an intrinsic property that generalizes across model sizes\. We attribute this behavior to OPD’s ability to systematically suppress functionally redundant updates, thereby concentrating the update budget on more effective directions\.

We next investigate whether this advantage persists across different reinforcement learning algorithms\. Figure[9](https://arxiv.org/html/2605.11739#A5.F9)extends the analysis to a broader set of RL methods\. Across all examined algorithms, OPD consistently demonstrates superior parameter update efficiency throughout the training trajectory\. This advantage holds regardless of the specific learning dynamics or convergence behavior of the teacher RL method, indicating that the efficiency gain arises from the structural properties of OPD updates rather than the choice of the underlying RL algorithm\. Taken together, these results provide consistent cross\-scale and cross\-algorithm evidence that OPD achieves comparable or even superior reasoning performance with significantly improved parameter efficiency\.

While the main text shows that embedding layer updates contribute negligibly to reasoning performance, it does not explicitly analyze their distributional shift relative to the base model\. To address this, we sample reasoning sequences generated by the base model and extract their token embeddings\. We then visualize the embedding shifts using PCA\(Eckart and Young,[1936](https://arxiv.org/html/2605.11739#bib.bib102)\)and t\-SNE\(Shi,[2021](https://arxiv.org/html/2605.11739#bib.bib103)\), and quantify the distributional differences via cosine similarity between token representations\. As shown in Figure[11](https://arxiv.org/html/2605.11739#A5.F11)and Table[3](https://arxiv.org/html/2605.11739#A5.T3), OPD consistently exhibits smaller embedding shifts than RL across all model scales, and maintains higher similarity to the base representations\. These findings indicate that, despite their limited functional contribution, embedding layers in OPD still undergo more constrained and compact updates, effectively avoiding the unnecessary drift commonly observed in RL\. This suggests that OPD enforces compact updates not only in critical modules but also in functionally peripheral regions\.

Finally, we validate the component\-level properties identified in the main text under a broader range of datasets and algorithmic settings\. These properties include the negligible contribution of embedding layers, the functional dominance of middle\-layer MLPs, and the consistent redundancy suppression pattern across architectural components\. As shown in Figure[10](https://arxiv.org/html/2605.11739#A5.F10), the results consistently support these observations, further reinforcing that Property 1 reflects an intrinsic and stable characteristic of OPD’s parameter update dynamics, rather than an artifact of specific experimental conditions\.

### E\.2Detailed Setup of Sliding\-Window Intervention Analysis

This section provides a formal description of the sliding\-window intervention analysis used in Section[2\.2](https://arxiv.org/html/2605.11739#S2.SS2.SSS0.Px3)\. The goal of this analysis is to localize the contribution of parameter updates across different layers and modules\(Caiet al\.,[2024](https://arxiv.org/html/2605.11739#bib.bib105),[2025](https://arxiv.org/html/2605.11739#bib.bib78)\), and to examine whether redundant updates in reinforcement learning \(RL\) are primarily concentrated in functionally non\-critical regions\.

The core idea of this method is to inject parameter updates into localized regions of the network and measure the resulting performance change\(Menget al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib3); Viget al\.,[2020](https://arxiv.org/html/2605.11739#bib.bib104)\)\. Compared to full\-model replacement, this localized intervention allows us to isolate the marginal functional contribution of updates at different depths, thereby enabling a fine\-grained characterization of the relationship between update location and functional impact\.

We consider a Transformer model withLLlayers, where each layer consists of two core modules: Attention and MLP\. LetΔ​WRL/OPD\(i,Attn\)\\Delta W\_\{\\text\{RL/OPD\}\}^\{\(i,\\text\{Attn\}\)\}andΔ​WRL/OPD\(i,MLP\)\\Delta W\_\{\\text\{RL/OPD\}\}^\{\(i,\\text\{MLP\}\)\}denote the parameter updates of the Attention and MLP modules at layerii, respectively\.

For a target layerll, we define the sliding window as:

𝒲l=\{i∈ℤ\|max⁡\(1,l−8\)≤i≤min⁡\(L,l\+8\)\}\.\\mathcal\{W\}\_\{l\}=\\left\\\{i\\in\\mathbb\{Z\}\\;\\middle\|\\;\\max\(1,\\,l\-8\)\\leq i\\leq\\min\(L,\\,l\+8\)\\right\\\}\.\(10\)
The window is centered at layerlland extends 8 layers to both sides, resulting in a maximum width of 17 layers\. Near the model boundaries, the window is truncated accordingly\. This design balances locality and stability: by covering neighboring layers, it mitigates the high variance associated with single\-layer interventions while preserving spatial resolution\.

To isolate the independent contributions of MLP and Attention modules, we construct two types of intervened models\. In each setting, only the parameters of the specified module within the sliding window are replaced, while all other parameters are fixed to those of the base model\.

MLP Intervention:

WMLP,l\(interv\)=\{WBase\(i,MLP\)\+Δ​WRL/OPD\(i,MLP\),i∈𝒲lWBase\(i,MLP\),i∉𝒲l\.W\_\{\\text\{MLP\},l\}^\{\\text\{\(interv\)\}\}=\\begin\{cases\}W\_\{\\text\{Base\}\}^\{\(i,\\text\{MLP\}\)\}\+\\Delta W\_\{\\text\{RL/OPD\}\}^\{\(i,\\text\{MLP\}\)\},&i\\in\\mathcal\{W\}\_\{l\}\\\\ W\_\{\\text\{Base\}\}^\{\(i,\\text\{MLP\}\)\},&i\\notin\\mathcal\{W\}\_\{l\}\.\\end\{cases\}\(11\)All Attention parameters are fixed toWBase\(i,Attn\)W\_\{\\text\{Base\}\}^\{\(i,\\text\{Attn\}\)\}\.

Attention Intervention:

WAttn,l\(interv\)=\{WBase\(i,Attn\)\+Δ​WRL/OPD\(i,Attn\),i∈𝒲lWBase\(i,Attn\),i∉𝒲l\.W\_\{\\text\{Attn\},l\}^\{\\text\{\(interv\)\}\}=\\begin\{cases\}W\_\{\\text\{Base\}\}^\{\(i,\\text\{Attn\}\)\}\+\\Delta W\_\{\\text\{RL/OPD\}\}^\{\(i,\\text\{Attn\}\)\},&i\\in\\mathcal\{W\}\_\{l\}\\\\ W\_\{\\text\{Base\}\}^\{\(i,\\text\{Attn\}\)\},&i\\notin\\mathcal\{W\}\_\{l\}\.\\end\{cases\}\(12\)All MLP parameters are fixed toWBase\(i,MLP\)W\_\{\\text\{Base\}\}^\{\(i,\\text\{MLP\}\)\}\.

This intervention strategy effectively constructs a*local update injection – global performance response*analysis framework, allowing us to attribute overall performance changes to specific layers and modules, and thereby reveal the functional distribution of parameter updates across the network\.

In practice, we iterate over all valid window centersl=1,2,…,L−8l=1,2,\\dots,L\-8, construct the two types of intervened models for eachll, and evaluate their accuracy on MATH500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.11739#bib.bib4)\)\. Each intervened model is evaluated using four independent forward passes, and the results are averaged to reduce evaluation noise\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x8.png)Figure 8:Comparison of parameter update efficiency between RL and OPD\. Scaling analysis of the final checkpoints demonstrates that OPD achieves substantially higher reasoning gains than RL under an identical update norm budget\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x9.png)Figure 9:Comparison of parameter update efficiency between RL and OPD\. Analysis of intermediate checkpoints throughout training demonstrates that OPD achieves the same reasoning accuracy as RL with substantially smaller parameter update norms\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x10.png)Figure 10:Functional contributions and update distributions across architectural components\. \(a\) Effect of embedding layer replacement on MATH500\. \(b\) Layer\-wise update norms \(bars, left axis\) for RL/OPD\-trained Qwen3\-8B\-Base models, and correspondingRLreasoning accuracy after sliding\-window intervention \(line, right axis\) on MATH500\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x11.png)Figure 11:t\-SNE visualization of token embeddings from the Base, RL, and OPD models\. The red and green lines indicate the shifts from Base to RL and from Base to OPD, respectively\.Table 3:Cosine similarity between RL/OPD with Base model token embeddings\.

## Appendix FProperty 2 Additional Experiment

### F\.1Geometric Metrics for Parameter Update Matrix

In this section, we provide formal definitions of four complementary metrics used to characterize the geometric structure of the parameter update matrixΔ​W∈ℝm×n\\Delta W\\in\\mathbb\{R\}^\{m\\times n\}\. Let the singular value decomposition \(SVD\) ofΔ​W\\Delta Wbe:

Δ​W=U​Σ​V⊤,Σ=diag⁡\(σ1,σ2,…,σr\),\\Delta W=U\\Sigma V^\{\\top\},\\quad\\Sigma=\\operatorname\{diag\}\(\\sigma\_\{1\},\\sigma\_\{2\},\\ldots,\\sigma\_\{r\}\),\(13\)wherer=rank⁡\(Δ​W\)r=\\operatorname\{rank\}\(\\Delta W\)andσ1≥σ2≥⋯≥σr\>0\\sigma\_\{1\}\\geq\\sigma\_\{2\}\\geq\\cdots\\geq\\sigma\_\{r\}\>0are the singular values sorted in descending order\.

#### Spectral Norm\(Mathias,[1990](https://arxiv.org/html/2605.11739#bib.bib99)\)\.

The spectral norm is defined as the largest singular valueσ1\.\\sigma\_\{1\}\.\. This metric captures the magnitude of the update along the dominant direction in parameter space, corresponding to the maximum amplification induced byΔ​W\\Delta Won any input vector\.

#### Spectral\-to\-Frobenius Norm Ratio\(Al\-Natoor,[2024](https://arxiv.org/html/2605.11739#bib.bib100)\)\.

The spectral\-to\-Frobenius norm ratio is defined as:

ρ=σ1∑j=1rσj2\.\\rho=\\frac\{\\sigma\_\{1\}\}\{\\sqrt\{\\sum\_\{j=1\}^\{r\}\\sigma\_\{j\}^\{2\}\}\}\.\(14\)This ratio quantifies the dominance of the leading singular direction\. A value ofρ\\rhoclose to11indicates that the update is highly concentrated along a single direction, whereas smaller values suggest that the update energy is distributed across multiple directions\.

#### Effective Rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.11739#bib.bib98)\)\.

The effective rank, also referred to as the spectral entropy rank, is defined as:

rankeff=exp⁡\(−∑i=1rσ¯i​log⁡σ¯i\),\\mathrm\{rank\}\_\{\\mathrm\{eff\}\}=\\exp\\left\(\-\\sum\_\{i=1\}^\{r\}\\bar\{\\sigma\}\_\{i\}\\log\\bar\{\\sigma\}\_\{i\}\\right\),\(15\)whereσ¯i=σi/∑j=1rσj\\bar\{\\sigma\}\_\{i\}=\\sigma\_\{i\}/\\sum\_\{j=1\}^\{r\}\\sigma\_\{j\}denotes the normalized singular values\. This metric measures the entropy of the singular value spectrum\. A smaller effective rank indicates rapid spectral decay and concentration of update energy in a low\-dimensional subspace, while a larger effective rank implies a more diffuse distribution\.

#### Top\-1% Subspace Norm Ratio\(Caiet al\.,[2025](https://arxiv.org/html/2605.11739#bib.bib78)\)\.

Letk=⌈r/100⌉k=\\lceil r/100\\rceildenote the number of singular components corresponding to the Top1%1\\%of the spectrum\. We construct the rank\-kkapproximation ofΔ​W\\Delta Wusing these leading components:

Δ​Wk=U:,1:k​Σ1:k,1:k​V:,1:k⊤\.\\Delta W\_\{k\}=U\_\{:,1:k\}\\Sigma\_\{1:k,1:k\}V\_\{:,1:k\}^\{\\top\}\.\(16\)The Top\-1%1\\%subspace norm ratio is defined as:

RTop\-1%=‖Δ​Wk‖F‖Δ​W‖F=∑i=1kσi2∑j=1rσj2\.R\_\{\\text\{Top\-1\\%\}\}=\\frac\{\\\|\\Delta W\_\{k\}\\\|\_\{F\}\}\{\\\|\\Delta W\\\|\_\{F\}\}=\\sqrt\{\\frac\{\\sum\_\{i=1\}^\{k\}\\sigma\_\{i\}^\{2\}\}\{\\sum\_\{j=1\}^\{r\}\\sigma\_\{j\}^\{2\}\}\}\.\(17\)This metric quantifies the fraction of the total update energy captured by the Top1%1\\%of singular directions\. A value close to11indicates that the update is effectively confined to an extremely low\-dimensional subspace\. For each model, we report the average values of the computed metrics across all MLP and attention matrices\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x12.png)Figure 12:Heatmap of cosine\-similarity of𝒰1\\mathcal\{U\}\_\{1\}at the first and last steps for each component trained under OPD and RL\.

### F\.2Cosine Similarity Analysis of Subspaces

This section provides additional empirical evidence for Property 2 \(Early Low\-Rank Lock\-in\) by analyzing the directional stability of dominant update subspaces during training\. We focus on how the principal subspaces evolve from the early training stage to the final converged checkpoint, thereby characterizing the subspace\-level convergence behavior of different training methods\.

To this end, we perform singular value decomposition \(SVD\) on the parameter update matrix and analyze the dominant subspaces spanned by its leading singular vectors\. Specifically, we consider the Rank\-1 subspace𝒰1\\mathcal\{U\}\_\{1\}, which corresponds to the strongest singular direction and captures the primary low\-dimensional structure of update energy\. We compute the cosine similarity between early\-stage and final\-stage subspaces to measure the degree of directional lock\-in during training\. The results are shown in Figure[12](https://arxiv.org/html/2605.11739#A6.F12)\.

RL exhibits unstable dominant subspace evolution\.During RL training, the cosine similarity between early\-stage and final\-stage subspaces remains consistently low across modules\. This indicates that RL does not establish update directions aligned with the final checkpoint at the early stage\. Instead, its dominant subspaces undergo substantial changes throughout training, suggesting that RL requires continuous exploration and correction before gradually converging to a stable configuration\.

OPD exhibits early alignment of dominant subspaces\.In contrast, OPD shows substantially higher subspace consistency across most modules\. In particular, intermediate layers exhibit especially strong early alignment, with cosine similarity reaching up to 0\.9\. These results indicate that OPD identifies stable dominant update directions early in training, while subsequent optimization mainly amplifies the update magnitude along these directions rather than repeatedly searching for new directions\.

This observation provides further support for Property 1 from a representational geometry perspective\. As Property 1 indicates, OPD suppresses functionally redundant updates and concentrates parameter changes within reasoning\-critical intermediate modules\. The present subspace analysis elucidates the mechanistic basis for such compact updates: in these modules, the dominant update subspaces stabilize early during training, enabling OPD to amplify updates along these consistent directions while minimizing redundant parameter movement\. Consequently, OPD achieves substantial performance improvements with high parameter efficiency, as the optimization primarily reinforces already stable, task\-relevant directions rather than exploring unnecessary or redundant dimensions\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x13.png)Figure 13:Heatmap of𝒰1\\mathcal\{U\}\_\{1\}trajectory under OPD and RL, along with variance explained by the first two dimensions after PCA\.
### F\.3Trajectory Evolution of Subspaces

Trajectory Visualization\.Beyond similarity analysis, we further investigate the temporal evolution of dominant subspaces during training by visualizing the trajectories of Rank\-1 subspaces𝒰1\\mathcal\{U\}\_\{1\}across different modules\. Specifically, we apply t\-SNE dimensionality reduction\(Shi,[2021](https://arxiv.org/html/2605.11739#bib.bib103)\)to representations from different training checkpoints, with results shown in Figures[15](https://arxiv.org/html/2605.11739#A6.F15)\-[28](https://arxiv.org/html/2605.11739#A6.F28)\.

We observe that OPD exhibits markedly more concentrated trajectory patterns: its evolution is confined to a narrower region in the projected space and follows a smoother, near\-linear path\. In contrast, RL trajectories are significantly more dispersed and irregular\. This suggests that OPD induces stronger directional stability during representation evolution, resulting in a more structured and predictable optimization trajectory\.

Quantitative Characterization via PCA\.To quantify this phenomenon, we perform PCA\(Eckart and Young,[1936](https://arxiv.org/html/2605.11739#bib.bib102)\)on representations from different training checkpoints\. For each module, we collect the checkpoint\-wise representation vectors and form a trajectory matrixX∈ℝT×dX\\in\\mathbb\{R\}^\{T\\times d\}, whereTTdenotes the number of checkpoints andddis the representation dimension\. After centeringXX, PCA decomposes the covariance matrix and obtains eigenvaluesλ1≥λ2≥⋯≥λd\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots\\geq\\lambda\_\{d\}\. We then compute the cumulative variance explained by the first two principal components as

EVR0:2=λ1\+λ2∑i=1dλi\.\\mathrm\{EVR\}\_\{0:2\}=\\frac\{\\lambda\_\{1\}\+\\lambda\_\{2\}\}\{\\sum\_\{i=1\}^\{d\}\\lambda\_\{i\}\}\.\(18\)This quantity measures how much of the trajectory variation across training checkpoints can be captured by a two\-dimensional principal subspace\. A higher value indicates that the trajectory is more concentrated and lower\-dimensional, whereas a lower value suggests that the evolution is more dispersed across multiple directions\. The results are summarized in Figure[13](https://arxiv.org/html/2605.11739#A6.F13)\.

Overall, OPD consistently achieves substantially higherEVR0:2\\mathrm\{EVR\}\_\{0:2\}than RL\. This indicates that the OPD representations are more strongly concentrated within a low\-dimensional and compact subspace during training\. In contrast, RL representations distribute their variation across a broader set of directions, reflecting greater redundancy and less structured trajectory evolution\.

Mechanistic Interpretation\.Overall, these observations provide a unified geometric and information\-theoretic perspective on the behaviors described in Property 1 and Property 2\. Specifically, during training, the update dynamics are not evenly distributed across the high\-dimensional parameter space but are highly concentrated along a few dominant directions forming a low\-dimensional subspace\. From an information\-theoretic standpoint, this concentration acts as a form of implicit compression, enhancing parameter utilization efficiency \(Property 1\) while facilitating early stabilization of update directions \(Property 2\)\.

From the perspective of optimization geometry, this concentration reflects an implicit low\-rank bias: under dense teacher supervision, OPD preferentially updates along a small number of stable and effective directions rather than exploring the high\-dimensional parameter space indiscriminately\. As a result, the parameter evolution exhibits a highly structured pattern, with both the direction and support of updates tightly constrained, yielding compact and stable trajectory evolution\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x14.png)Figure 14:Scaling analysis of \(a\) accuracy and \(b\) KL divergence across different training checkpoints, with optimal performance achieved in the range0\.8≤β≤1\.20\.8\\leq\\beta\\leq 1\.2\.
### F\.4Scaling Effects on Accuracy and Distribution Alignment

This subsection aims to further validate and complement the findings in Section[3](https://arxiv.org/html/2605.11739#S3)Figure[5](https://arxiv.org/html/2605.11739#S3.F5), focusing on the relationship between the magnitude of early updates and model performance\.

#### Effect of Scaling Magnitude on Performance\.

To analyze the effect of scaling early checkpoint updates on model performance, we define the updated parameters after scaling as:

Δ​Wscaled=Δ​Wearly\+Δ​Wearly×β⋅\(‖Δ​Wfinal‖F−‖Δ​Wearly‖F\)‖Δ​Wearly‖F⏟extra update\.\\Delta W\_\{\\text\{scaled\}\}=\\Delta W\_\{\\text\{early\}\}\+\\underbrace\{\\Delta W\_\{\\text\{early\}\}\\times\\frac\{\\beta\\cdot\(\\\|\\Delta W\_\{\\text\{final\}\}\\\|\_\{F\}\-\\\|\\Delta W\_\{\\text\{early\}\}\\\|\_\{F\}\)\}\{\\\|\\Delta W\_\{\\text\{early\}\}\\\|\_\{F\}\}\}\_\{\\text\{extra update\}\}\.\(19\)
Here,β\\betais the scaling coefficient\. Whenβ=0\\beta=0,Δ​Wscaled=Δ​Wearly\\Delta W\_\{\\text\{scaled\}\}=\\Delta W\_\{\\text\{early\}\}, i\.e\., no extra update is added\. Whenβ=1\\beta=1,‖Δ​Wscaled‖F=‖Δ​Wfinal‖F\\\|\\Delta W\_\{\\text\{scaled\}\}\\\|\_\{F\}=\\\|\\Delta W\_\{\\text\{final\}\}\\\|\_\{F\}, i\.e\., the magnitude of the scaled update matches that of the final update\.

As shown in Figure[14](https://arxiv.org/html/2605.11739#A6.F14)\(a\), increasingβ\\betafrom 0 progressively improves model performance\. Whenβ≈0\.8\\beta\\approx 0\.8, the performance gain begins to plateau; whenβ\\betaexceeds a large value \(approximately1\.21\.2\), performance starts to degrade\. This trend provides three key insights: \(i\) the early checkpoint already captures a principal subspace aligned with the final solution, as evidenced by performance gains from moderate scaling; \(ii\) the plateau aroundβ≈0\.8\\beta\\approx 0\.8reflects inherent representational limits of the early subspace, indicating that further amplification cannot fully bridge the gap without additional training; \(iii\) excessive scaling leads to performance decline, suggesting that extra norm amplifies noise or irrelevant components, harming task performance\.

#### Alignment with Teacher Distribution\.

To further understand these trends, we measure the KL divergence between the student’s outputs and the teacher’s distribution\. Figure[14](https://arxiv.org/html/2605.11739#A6.F14)\(b\) shows that KL divergence decreases monotonically with increasingβ\\beta, stabilizes over the intermediate range corresponding to the performance plateau, and rises again forβ\>1\.2\\beta\>1\.2\. These trends mirror the accuracy results: initially, monotonic KL reduction coincides with steady accuracy improvement, indicating that closer approximation to the teacher distribution directly drives task performance\. Within the optimal range \(β≈0\.8\\beta\\approx 0\.8–1\.21\.2\), KL divergence remains low and accuracy saturates, demonstrating strong student\-teacher distribution alignment\.

This phenomenon can be interpreted from two complementary perspectives\. First, from a causal inference viewpoint, KL reduction—i\.e\., more precise alignment with the teacher’s behavioral distribution—directly drives improvements in task accuracy\. Second, from the perspective of representation subspace geometry, the reduction in KL following scaling reveals that the early update directions already capture the dominant structure of the teacher’s distribution\. While the early subspace norm may initially be insufficient, its directions are largely aligned with the final converged solution\. Appropriate scaling partially unlocks the representational capacity encoded in this subspace, thereby reducing the distributional gap between student and teacher\.

#### Illustrative example of scaling\-induced reasoning improvement\.

Then, we provide a concrete example to illustrate the differences in text representations between the early checkpoint and the teacher model\. On the next page, we compare the generated responses of the early checkpoint before and after scaling\. Specifically, when we scale the norm of the early checkpoint to match that of the final model, the quality of its generated responses improves significantly compared to the unscaled version\. Further analysis reveals that the scaled responses exhibit a noticeable increase in the number of reasoning steps, with each step becoming more fine\-grained\. The model demonstrates richer intermediate reasoning processes and clearer logical progression, rather than jumping directly to results\. This change reflects reasoning habits that are more similar to those of the teacher model, indicating that appropriate norm scaling can activate the reasoning structures already encoded in the early subspace, making the student’s generation behavior more akin to the teacher’s in terms of reasoning depth and logical coherence\.

### F\.5A Local Geometric View of OPD Dynamics

In this subsection, we provide a local geometric analysis to explain why On\-Policy Distillation \(OPD\) naturally induces low\-rank and early\-locked update directions, and how this differs from the update dynamics of reinforcement learning \(RL\)\. By linearizing the student model around the base model, we reveal how the structure of the OPD objective gives rise to the empirical phenomena observed in the main text\.

#### Setup and Linearization\.

Let a token context be denoted byc=\(x,y<t\)c=\(x,y\_\{<t\}\), wherexxis the input prompt andy<ty\_\{<t\}are previously generated tokens\. Define:

- •zθ​\(c\)∈ℝVz\_\{\\theta\}\(c\)\\in\\mathbb\{R\}^\{V\}: logits of the student model with parametersθ\\theta\(vocabulary sizeVV\)\.
- •z⋆​\(c\)∈ℝVz^\{\\star\}\(c\)\\in\\mathbb\{R\}^\{V\}: logits of a fixed teacher model\.
- •θ0\\theta\_\{0\}: parameters of the base model \(initialization for both RL and OPD training\)\.
- •Δ​θ=θ−θ0\\Delta\\theta=\\theta\-\\theta\_\{0\}: parameter displacement\.

Expandzθ​\(c\)z\_\{\\theta\}\(c\)aroundθ0\\theta\_\{0\}to first order:

zθ​\(c\)=zθ0​\(c\)\+∂zθ​\(c\)∂θ\|θ=θ0⏟=⁣:Jc​Δ​θ\+O​\(‖Δ​θ‖2\)\.z\_\{\\theta\}\(c\)=z\_\{\\theta\_\{0\}\}\(c\)\+\\underbrace\{\\frac\{\\partial z\_\{\\theta\}\(c\)\}\{\\partial\\theta\}\\bigg\|\_\{\\theta=\\theta\_\{0\}\}\}\_\{=:J\_\{c\}\}\\Delta\\theta\+O\(\\\|\\Delta\\theta\\\|^\{2\}\)\.\(20\)HereJc∈ℝV×dim\(θ\)J\_\{c\}\\in\\mathbb\{R\}^\{V\\times\\dim\(\\theta\)\}is the Jacobian matrix of the logits with respect to the parameters\. For sufficiently small step sizes and early training,‖Δ​θ‖\\\|\\Delta\\theta\\\|is small, and we neglect the higher\-order terms:

zθ​\(c\)≈z0​\(c\)\+Jc​Δ​θ,where​z0​\(c\):=zθ0​\(c\)\.z\_\{\\theta\}\(c\)\\approx z\_\{0\}\(c\)\+J\_\{c\}\\Delta\\theta,\\qquad\\text\{where \}z\_\{0\}\(c\):=z\_\{\\theta\_\{0\}\}\(c\)\.\(21\)
Define the*teacher\-student logit residual at the base model*:

rc:=z⋆​\(c\)−z0​\(c\)\.r\_\{c\}:=z^\{\\star\}\(c\)\-z\_\{0\}\(c\)\.\(22\)Then the logit discrepancy becomes:

zθ​\(c\)−z⋆​\(c\)≈Jc​Δ​θ−rc\.z\_\{\\theta\}\(c\)\-z^\{\\star\}\(c\)\\approx J\_\{c\}\\Delta\\theta\-r\_\{c\}\.\(23\)

#### Local Quadratic Approximation of the OPD Objective\.

The OPD objective minimizes the reverse KL divergence between the student and the teacher on on\-policy samples:

ℒOPD\(θ\)=𝔼x∼𝒟,y∼πθ\(⋅\|x\)\[DKL\(πθ\(⋅\|x,y<t\)∥π⋆\(⋅\|x,y<t\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;y\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\}\\left\[D\_\{\\mathrm\{KL\}\}\\big\(\\pi\_\{\\theta\}\(\\cdot\|x,y\_\{<t\}\)\\;\\\|\\;\\pi^\{\\star\}\(\\cdot\|x,y\_\{<t\}\)\\big\)\\right\]\.\(24\)For a fixed contextcc, denote:

pθ\(⋅\|c\)=softmax\(zθ\(c\)\),p⋆\(⋅\|c\)=softmax\(z⋆\(c\)\)\.p\_\{\\theta\}\(\\cdot\|c\)=\\mathrm\{softmax\}\(z\_\{\\theta\}\(c\)\),\\qquad p^\{\\star\}\(\\cdot\|c\)=\\mathrm\{softmax\}\(z^\{\\star\}\(c\)\)\.\(25\)
When the two distributions are close, the KL divergence admits a second\-order Taylor expansion in the logit space\. Letf​\(z\)=DKL​\(pz∥p⋆\)f\(z\)=D\_\{\\mathrm\{KL\}\}\(p\_\{z\}\\\|p^\{\\star\}\)wherepz=softmax​\(z\)p\_\{z\}=\\mathrm\{softmax\}\(z\)\. Then:

f​\(z\)≈f​\(z⋆\)\+∇f​\(z⋆\)⊤​\(z−z⋆\)⏟=0\+12​\(z−z⋆\)⊤​∇2f​\(z⋆\)​\(z−z⋆\),f\(z\)\\approx f\(z^\{\\star\}\)\+\\underbrace\{\\nabla f\(z^\{\\star\}\)^\{\\top\}\(z\-z^\{\\star\}\)\}\_\{=0\}\+\\frac\{1\}\{2\}\(z\-z^\{\\star\}\)^\{\\top\}\\nabla^\{2\}f\(z^\{\\star\}\)\(z\-z^\{\\star\}\),\(26\)because the first derivative vanishes atz=z⋆z=z^\{\\star\}\(minimum\)\. The Hessian of the reverse KL at the teacher point is the Fisher information matrix of the student distribution:

∇2f\(z⋆\)=Diag\(p⋆\)−p⋆p⋆⊤=:Fc⋆\.\\nabla^\{2\}f\(z^\{\\star\}\)=\\mathrm\{Diag\}\(p^\{\\star\}\)\-p^\{\\star\}\{p^\{\\star\}\}^\{\\top\}=:F\_\{c\}^\{\\star\}\.\(27\)Thus, forzznearz⋆z^\{\\star\}:

DKL​\(pz∥p⋆\)≈12​\(z−z⋆\)⊤​Fc⋆​\(z−z⋆\)\.D\_\{\\mathrm\{KL\}\}\(p\_\{z\}\\\|p^\{\\star\}\)\\approx\\frac\{1\}\{2\}\(z\-z^\{\\star\}\)^\{\\top\}F\_\{c\}^\{\\star\}\(z\-z^\{\\star\}\)\.\(28\)
However, in our local analysis we linearize aroundθ0\\theta\_\{0\}, so the student logitszθ​\(c\)z\_\{\\theta\}\(c\)are close toz0​\(c\)z\_\{0\}\(c\), not necessarily close toz⋆​\(c\)z^\{\\star\}\(c\)\. To obtain a quadratic form inΔ​θ\\Delta\\theta, we may evaluate the Fisher matrix at a convenient distribution, typically the base model distributionp0​\(c\)=softmax​\(z0​\(c\)\)p\_\{0\}\(c\)=\\mathrm\{softmax\}\(z\_\{0\}\(c\)\)\. This yields an approximation that is consistent whenzθ≈z0z\_\{\\theta\}\\approx z\_\{0\}and the teacher is not too far from the base model\. Define:

Fc:=Diag​\(p0​\(c\)\)−p0​\(c\)​p0​\(c\)⊤\.F\_\{c\}:=\\mathrm\{Diag\}\(p\_\{0\}\(c\)\)\-p\_\{0\}\(c\)p\_\{0\}\(c\)^\{\\top\}\.\(29\)Then we approximate:

DKL​\(pθ∥p⋆\)≈12​\(zθ​\(c\)−z⋆​\(c\)\)⊤​Fc​\(zθ​\(c\)−z⋆​\(c\)\)=12​\(Jc​Δ​θ−rc\)⊤​Fc​\(Jc​Δ​θ−rc\)\.D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta\}\\\|p^\{\\star\}\)\\approx\\frac\{1\}\{2\}\(z\_\{\\theta\}\(c\)\-z^\{\\star\}\(c\)\)^\{\\top\}F\_\{c\}\(z\_\{\\theta\}\(c\)\-z^\{\\star\}\(c\)\)=\\frac\{1\}\{2\}\(J\_\{c\}\\Delta\\theta\-r\_\{c\}\)^\{\\top\}F\_\{c\}\(J\_\{c\}\\Delta\\theta\-r\_\{c\}\)\.\(30\)
If the teacher and base model are already reasonably aligned \(a common scenario in distillation\), thenz⋆≈z0z^\{\\star\}\\approx z\_\{0\}andFc≈Fc⋆F\_\{c\}\\approx F\_\{c\}^\{\\star\}\. Even if not, the quadratic form still provides a local approximation of the KL divergence up to an additive constant, because:

DKL​\(pθ∥p⋆\)=DKL​\(p0∥p⋆\)\+∇θDKL​\(pθ∥p⋆\)\|θ0​Δ​θ\+12​Δ​θ⊤​H​Δ​θ\+⋯,D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta\}\\\|p^\{\\star\}\)=D\_\{\\mathrm\{KL\}\}\(p\_\{0\}\\\|p^\{\\star\}\)\+\\nabla\_\{\\theta\}D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta\}\\\|p^\{\\star\}\)\|\_\{\\theta\_\{0\}\}\\Delta\\theta\+\\frac\{1\}\{2\}\\Delta\\theta^\{\\top\}H\\Delta\\theta\+\\cdots,\(31\)and the Hessian atθ0\\theta\_\{0\}involvesJc⊤​Fc⋆​JcJ\_\{c\}^\{\\top\}F\_\{c\}^\{\\star\}J\_\{c\}\. EvaluatingFcF\_\{c\}atp0p\_\{0\}is a standard simplification in the neural tangent kernel literature and preserves the correct second\-order structure when‖z⋆−z0‖\\\|z^\{\\star\}\-z\_\{0\}\\\|is small\.

#### Local Expected Objective and Gradient\.

Taking expectation over the on\-policy contextsccwhich, to first order, can be approximated by the base model’s distribution, we obtain:

ℒOPD​\(Δ​θ\)≈12​𝔼c​\[\(Jc​Δ​θ−rc\)⊤​Fc​\(Jc​Δ​θ−rc\)\]\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\Delta\\theta\)\\approx\\frac\{1\}\{2\}\\mathbb\{E\}\_\{c\}\\big\[\(J\_\{c\}\\Delta\\theta\-r\_\{c\}\)^\{\\top\}F\_\{c\}\(J\_\{c\}\\Delta\\theta\-r\_\{c\}\)\\big\]\.\(32\)Expanding the quadratic:

ℒOPD​\(Δ​θ\)≈12​Δ​θ⊤​𝔼c​\[Jc⊤​Fc​Jc\]⏟=⁣:A​Δ​θ−Δ​θ⊤​𝔼c​\[Jc⊤​Fc​rc\]⏟=⁣:b\+12​𝔼c​\[rc⊤​Fc​rc\]\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\Delta\\theta\)\\approx\\frac\{1\}\{2\}\\Delta\\theta^\{\\top\}\\underbrace\{\\mathbb\{E\}\_\{c\}\[J\_\{c\}^\{\\top\}F\_\{c\}J\_\{c\}\]\}\_\{=:A\}\\Delta\\theta\-\\Delta\\theta^\{\\top\}\\underbrace\{\\mathbb\{E\}\_\{c\}\[J\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\}\]\}\_\{=:b\}\+\\frac\{1\}\{2\}\\mathbb\{E\}\_\{c\}\[r\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\}\]\.\(33\)The last term is constant with respect toΔ​θ\\Delta\\theta\. Therefore, the local objective is a convex quadratic:

ℒOPD​\(Δ​θ\)=12​Δ​θ⊤​A​Δ​θ−b⊤​Δ​θ\+const\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\Delta\\theta\)=\\frac\{1\}\{2\}\\Delta\\theta^\{\\top\}A\\Delta\\theta\-b^\{\\top\}\\Delta\\theta\+\\text\{const\}\.\(34\)The gradient with respect toΔ​θ\\Delta\\thetais:

g​\(Δ​θ\):=∇Δ​θℒOPD=A​Δ​θ−b\.g\(\\Delta\\theta\):=\\nabla\_\{\\Delta\\theta\}\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}=A\\Delta\\theta\-b\.\(35\)

#### Gradient Descent Dynamics and Closed\-Form Solution\.

Consider gradient descent onΔ​θ\\Delta\\thetawith fixed step sizeη\>0\\eta\>0:

Δ​θs\+1=Δ​θs−η​g​\(Δ​θs\)=Δ​θs−η​\(A​Δ​θs−b\)=\(I−η​A\)​Δ​θs\+η​b\.\\Delta\\theta\_\{s\+1\}=\\Delta\\theta\_\{s\}\-\\eta g\(\\Delta\\theta\_\{s\}\)=\\Delta\\theta\_\{s\}\-\\eta\(A\\Delta\\theta\_\{s\}\-b\)=\(I\-\\eta A\)\\Delta\\theta\_\{s\}\+\\eta b\.\(36\)Starting fromΔ​θ0=0\\Delta\\theta\_\{0\}=0\(initialization at the base model\), we unroll the recursion:

Δ​θ1\\displaystyle\\Delta\\theta\_\{1\}=η​b,\\displaystyle=\\eta b,\(37\)Δ​θ2\\displaystyle\\Delta\\theta\_\{2\}=\(I−η​A\)​η​b\+η​b=η​\[I\+\(I−η​A\)\]​b,\\displaystyle=\(I\-\\eta A\)\\eta b\+\\eta b=\\eta\[I\+\(I\-\\eta A\)\]b,\(38\)Δ​θs\\displaystyle\\Delta\\theta\_\{s\}=η​∑j=0s−1\(I−η​A\)j​b\.\\displaystyle=\\eta\\sum\_\{j=0\}^\{s\-1\}\(I\-\\eta A\)^\{j\}b\.\(39\)This is a geometric series of matrices\. AssumeAAis symmetric positive semidefinite \(it is a Gram matrix ofJc⊤​Fc1/2J\_\{c\}^\{\\top\}F\_\{c\}^\{1/2\}\)\. Chooseη\\etasuch that0<η<2/λmax​\(A\)0<\\eta<2/\\lambda\_\{\\max\}\(A\)to ensure convergence\. ThenI−η​AI\-\\eta Ahas spectral radius less than 1, and the series converges to:

Δ​θ∞=η​\(I−\(I−η​A\)\)−1​b=A−1​b,\\Delta\\theta\_\{\\infty\}=\\eta\(I\-\(I\-\\eta A\)\)^\{\-1\}b=A^\{\-1\}b,\(40\)whereA−1A^\{\-1\}denotes the pseudo\-inverse on the support ofAA\. The finite\-sum formula can be expressed in closed form:

Δ​θs=\[I−\(I−η​A\)s\]​A−1​b\.\\Delta\\theta\_\{s\}=\\big\[I\-\(I\-\\eta A\)^\{s\}\\big\]A^\{\-1\}b\.\(41\)This is verified by factoring:

∑j=0s−1\(I−η​A\)j=\(I−\(I−η​A\)s\)​\(I−\(I−η​A\)\)−1=\(I−\(I−η​A\)s\)​\(η​A\)−1\.\\sum\_\{j=0\}^\{s\-1\}\(I\-\\eta A\)^\{j\}=\(I\-\(I\-\\eta A\)^\{s\}\)\(I\-\(I\-\\eta A\)\)^\{\-1\}=\(I\-\(I\-\\eta A\)^\{s\}\)\(\\eta A\)^\{\-1\}\.\(42\)Multiplying byη​b\\eta bgives the result\.

#### Spectral Decomposition and Directional Dynamics\.

LetA=U​Λ​U⊤A=U\\Lambda U^\{\\top\}be the eigen\-decomposition withΛ=diag​\(λ1,λ2,…,λd\)\\Lambda=\\mathrm\{diag\}\(\\lambda\_\{1\},\\lambda\_\{2\},\\dots,\\lambda\_\{d\}\)andλ1≥λ2≥⋯≥λd≥0\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\dots\\geq\\lambda\_\{d\}\\geq 0\. Letb=U​βb=U\\betawithβi=⟨b,ui⟩\\beta\_\{i\}=\\langle b,u\_\{i\}\\rangle\. SinceA−1=U​Λ−1​U⊤A^\{\-1\}=U\\Lambda^\{\-1\}U^\{\\top\}\(pseudo\-inverse\), we have:

A−1​b=∑i:λi\>0βiλi​ui\.A^\{\-1\}b=\\sum\_\{i:\\lambda\_\{i\}\>0\}\\frac\{\\beta\_\{i\}\}\{\\lambda\_\{i\}\}u\_\{i\}\.\(43\)Also,\(I−η​A\)s=U​\(I−η​Λ\)s​U⊤\(I\-\\eta A\)^\{s\}=U\(I\-\\eta\\Lambda\)^\{s\}U^\{\\top\}\. Therefore:

Δ​θs=U​\[I−\(I−η​Λ\)s\]​Λ−1​β=∑i:λi\>01−\(1−η​λi\)sλi​βi​ui\.\\Delta\\theta\_\{s\}=U\\big\[I\-\(I\-\\eta\\Lambda\)^\{s\}\\big\]\\Lambda^\{\-1\}\\beta=\\sum\_\{i:\\lambda\_\{i\}\>0\}\\frac\{1\-\(1\-\\eta\\lambda\_\{i\}\)^\{s\}\}\{\\lambda\_\{i\}\}\\beta\_\{i\}u\_\{i\}\.\(44\)
The above expression reveals the directional dynamics\. For each eigen\-directionuiu\_\{i\}, the contribution starts at zero and asymptotically approachesβi/λi\\beta\_\{i\}/\\lambda\_\{i\}\. The factor1−\(1−η​λi\)s1\-\(1\-\\eta\\lambda\_\{i\}\)^\{s\}grows more rapidly when the curvatureλi\\lambda\_\{i\}is larger, meaning that directions with high sensitivity of the logits to parameter changes saturate early\. Consequently, if the projectionβi\\beta\_\{i\}vanishes for many directions, the effective update remains confined to a low‑dimensional subspace throughout training\.

#### A Sufficient Condition for Early Low\-Rank Lock\-in\.

Define the top\-kkeigenspace ofAAas

Uk=span​\{u1,…,uk\},U\_\{k\}=\\mathrm\{span\}\\\{u\_\{1\},\\dots,u\_\{k\}\\\},and letPUkP\_\{U\_\{k\}\}be the orthogonal projector onto this subspace\. We assume that the driving termbbis concentrated inUkU\_\{k\}up to a small residual:

‖PUk⟂​b‖≤ϵ​‖b‖,ϵ≪1\.\\\|P\_\{U\_\{k\}^\{\\perp\}\}b\\\|\\leq\\epsilon\\\|b\\\|,\\qquad\\epsilon\\ll 1\.\(45\)Equivalently, we decompose

b=b∥\+b⟂,b∥=PUk​b,b⟂=PUk⟂​b\.b=b\_\{\\parallel\}\+b\_\{\\perp\},\\qquad b\_\{\\parallel\}=P\_\{U\_\{k\}\}b,\\qquad b\_\{\\perp\}=P\_\{U\_\{k\}^\{\\perp\}\}b\.Using the closed\-form dynamics, the update can be written as

Δ​θs=\[I−\(I−η​A\)s\]​A−1​b∥\+\[I−\(I−η​A\)s\]​A−1​b⟂\.\\Delta\\theta\_\{s\}=\[I\-\(I\-\\eta A\)^\{s\}\]A^\{\-1\}b\_\{\\parallel\}\+\[I\-\(I\-\\eta A\)^\{s\}\]A^\{\-1\}b\_\{\\perp\}\.\(46\)The first term lies in the dominant eigenspaceUkU\_\{k\}, while the second term corresponds to the tail contribution fromUk⟂U\_\{k\}^\{\\perp\}\. Rather than assuming thatA−1A^\{\-1\}is norm\-reducing on the orthogonal complement, we bound this tail term through the spectral response of the finite\-step dynamics\. Specifically,

‖\[I−\(I−η​A\)s\]​A−1​b⟂‖≤ρ⟂​\(s\)​‖b⟂‖,\\left\\\|\[I\-\(I\-\\eta A\)^\{s\}\]A^\{\-1\}b\_\{\\perp\}\\right\\\|\\leq\\rho\_\{\\perp\}\(s\)\\\|b\_\{\\perp\}\\\|,\(47\)where

ρ⟂​\(s\)=maxi\>k,λi\>0⁡\|1−\(1−η​λi\)s\|λi\.\\rho\_\{\\perp\}\(s\)=\\max\_\{i\>k,\\lambda\_\{i\}\>0\}\\frac\{\\left\|1\-\(1\-\\eta\\lambda\_\{i\}\)^\{s\}\\right\|\}\{\\lambda\_\{i\}\}\.\(48\)Combining this with the concentration assumption gives

‖Δ​θs−\[I−\(I−η​A\)s\]​A−1​b∥‖≤ρ⟂​\(s\)​ϵ​‖b‖\.\\left\\\|\\Delta\\theta\_\{s\}\-\[I\-\(I\-\\eta A\)^\{s\}\]A^\{\-1\}b\_\{\\parallel\}\\right\\\|\\leq\\rho\_\{\\perp\}\(s\)\\epsilon\\\|b\\\|\.\(49\)
Thus, when the projected residualbbis highly concentrated in the top\-kkeigenspace, the tail contribution remains small during the finite training horizon\. If, in addition, there is a clear spectral gap,

λk≫λk\+1,\\lambda\_\{k\}\\gg\\lambda\_\{k\+1\},\(50\)then the dominant directions inUkU\_\{k\}are activated and saturated earlier than the tail directions\. This provides a geometric explanation forProperty 2 \(Early Low\-Rank Lock\-in\): the optimization path is largely confined to a low\-dimensional subspace that is identified in the early stage of training, while subsequent optimization mainly increases the magnitude within this subspace rather than exploring substantially new directions\.

*Why isbblow\-rank in practice?*Recall that

b=𝔼c​\[Jc⊤​Fc​rc\]\.b=\\mathbb\{E\}\_\{c\}\[J\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\}\]\.\(51\)The residual

rc=z⋆​\(c\)−z0​\(c\)r\_\{c\}=z^\{\\star\}\(c\)\-z\_\{0\}\(c\)is the teacher\-base logit difference\. In distillation, the teacher often refines the student by sharpening probabilities on a relatively small set of functionally important token positions, such as key reasoning tokens, intermediate reasoning steps, answer tokens, or formatting tokens\(Xuet al\.,[2026](https://arxiv.org/html/2605.11739#bib.bib107)\)\. Hence,rcr\_\{c\}is often sparse or low\-dimensional in its effective support\. The Fisher matrixFcF\_\{c\}further reweights these residual directions according to the local geometry of the output distribution\. AlthoughJcJ\_\{c\}itself can be high\-rank, the composition

Jc⊤​Fc​rcJ\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\}projects this concentrated residual signal back into parameter space\. After averaging over contexts, the resulting driving termbbtends to concentrate on parameter directions that most strongly affect those critical token predictions\. This is consistent with the low\-rank structure ofΔ​W\\Delta Wobserved in Section[3](https://arxiv.org/html/2605.11739#S3)\.

#### Module\-Wise Suppression \(Functional Redundancy Avoidance\)\.

Decompose the parameters intoMMmodules \(e\.g\., embedding, attention, MLP layers\)\. Write:

Δ​θ=\(Δ​θ1,Δ​θ2,…,Δ​θM\),Jc=\[Jc,1,Jc,2,…,Jc,M\],\\Delta\\theta=\(\\Delta\\theta\_\{1\},\\Delta\\theta\_\{2\},\\dots,\\Delta\\theta\_\{M\}\),\\qquad J\_\{c\}=\[J\_\{c,1\},J\_\{c,2\},\\dots,J\_\{c,M\}\],\(52\)whereJc,m=∂zθ​\(c\)/∂θm\|θ0J\_\{c,m\}=\\partial z\_\{\\theta\}\(c\)/\\partial\\theta\_\{m\}\|\_\{\\theta\_\{0\}\}\. Then the driving term for modulemmis:

bm=𝔼c​\[Jc,m⊤​Fc​rc\]\.b\_\{m\}=\\mathbb\{E\}\_\{c\}\[J\_\{c,m\}^\{\\top\}F\_\{c\}r\_\{c\}\]\.\(53\)The curvature matrixAAhas block structure:

A=\(A11A12⋯A1​MA21A22⋯A2​M⋮⋮⋱⋮AM​1AM​2⋯AM​M\),Am​n=𝔼c​\[Jc,m⊤​Fc​Jc,n\]\.A=\\begin\{pmatrix\}A\_\{11\}&A\_\{12\}&\\cdots&A\_\{1M\}\\\\ A\_\{21\}&A\_\{22\}&\\cdots&A\_\{2M\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ A\_\{M1\}&A\_\{M2\}&\\cdots&A\_\{MM\}\\end\{pmatrix\},\\quad A\_\{mn\}=\\mathbb\{E\}\_\{c\}\[J\_\{c,m\}^\{\\top\}F\_\{c\}J\_\{c,n\}\]\.\(54\)
At the local optimumΔ​θ∗=A−1​b\\Delta\\theta^\{\*\}=A^\{\-1\}b\(or the limit of gradient descent\), we have:

∑n=1MAm​n​Δ​θn∗=bm\.\\sum\_\{n=1\}^\{M\}A\_\{mn\}\\Delta\\theta\_\{n\}^\{\*\}=b\_\{m\}\.\(55\)If the cross\-module coupling is weak \(i\.e\.,Am​nA\_\{mn\}is small form≠nm\\neq ncompared toAm​mA\_\{mm\}\), andAm​mA\_\{mm\}is invertible on its support, then:

Δ​θm∗≈Am​m−1​bm\.\\Delta\\theta\_\{m\}^\{\*\}\\approx A\_\{mm\}^\{\-1\}b\_\{m\}\.\(56\)Thus, ifbm≈0b\_\{m\}\\approx 0\(modulemmis weakly coupled with the teacher residual\), thenΔ​θm∗≈0\\Delta\\theta\_\{m\}^\{\*\}\\approx 0\. This provides a mechanism forProperty 1 \(Functional Redundancy Avoidance\): modules that do not help match the teacher residual receive negligible updates\. Empirically, embedding layers and bottom/top transformer layers have smallbmb\_\{m\}, leading to suppressed updates\.

#### Comparison with Reinforcement Learning Dynamics\.

A standard policy gradient update \(e\.g\., PPO\) for a trajectory of lengthTTis:

gRL=𝔼x∼𝒟,y∼πθ\(⋅\|x\)​\[∑t=1TAt​∇θlog⁡πθ​\(yt\|ct\)\],g\_\{\\mathrm\{RL\}\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},y\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)\}\\left\[\\sum\_\{t=1\}^\{T\}A\_\{t\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\|c\_\{t\}\)\\right\],\(57\)wherect=\(x,y<t\)c\_\{t\}=\(x,y\_\{<t\}\)andAtA\_\{t\}is an advantage estimate\. Using the logit parameterization:

∇θlogπθ\(yt\|ct\)=Jct⊤\(eyt−pθ\(⋅\|ct\)\)\.\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\|c\_\{t\}\)=J\_\{c\_\{t\}\}^\{\\top\}\(e\_\{y\_\{t\}\}\-p\_\{\\theta\}\(\\cdot\|c\_\{t\}\)\)\.\(58\)Hence:

gRL=𝔼\[∑t=1TAtJct⊤\(eyt−pθ\(⋅\|ct\)\)\]\.g\_\{\\mathrm\{RL\}\}=\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}A\_\{t\}J\_\{c\_\{t\}\}^\{\\top\}\(e\_\{y\_\{t\}\}\-p\_\{\\theta\}\(\\cdot\|c\_\{t\}\)\)\\right\]\.\(59\)
In contrast, the OPD gradient \(local approximation\) is:

gOPD=−∇Δ​θℒOPD=b−A​Δ​θ\.g\_\{\\mathrm\{OPD\}\}=\-\\nabla\_\{\\Delta\\theta\}\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}=b\-A\\Delta\\theta\.\(60\)At initialization \(Δ​θ=0\\Delta\\theta=0\), we havegOPD​\(0\)=bg\_\{\\mathrm\{OPD\}\}\(0\)=b, which is a deterministic \(up to sampling\) function of the teacher residual\. The RL gradient at initialization is:

gRL​\(0\)=𝔼​\[∑tAt​Jct⊤​\(eyt−p0​\(ct\)\)\]\.g\_\{\\mathrm\{RL\}\}\(0\)=\\mathbb\{E\}\\left\[\\sum\_\{t\}A\_\{t\}J\_\{c\_\{t\}\}^\{\\top\}\(e\_\{y\_\{t\}\}\-p\_\{0\}\(c\_\{t\}\)\)\\right\]\.\(61\)
The differences between the two paradigms can be summarized in a few key aspects\. OPD benefits from dense token‑level supervision through the residualrcr\_\{c\}\(filtered byFcF\_\{c\}\), whereas RL relies on scalar rewardsAtA\_\{t\}that are typically zero for most tokens in sparse reward settings, making RL gradient estimates noisier\. Moreover, credit assignment in RL is challenging becauseAtA\_\{t\}depends on the entire trajectory and future rewards, introducing high variance\. In OPD, the per‑token residual provides a more stable learning signal\. Finally, the directional structure differs crucially: the OPD driving termbbinherits the low‑rank concentration ofrcr\_\{c\}, while the RL driving term involveseyt−p0​\(ct\)e\_\{y\_\{t\}\}\-p\_\{0\}\(c\_\{t\}\), a random vector with full support in the vocabulary space, leading to less concentrated and more diffuse updates\.

We can approximate the gradient covariance to illustrate the difference\. For OPD, the per\-sample gradient at initialization is:

g^OPD=Jc⊤​Fc​rc,\\hat\{g\}\_\{\\mathrm\{OPD\}\}=J\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\},\(62\)with covarianceΣOPD=Cov​\(g^OPD\)\\Sigma\_\{\\mathrm\{OPD\}\}=\\mathrm\{Cov\}\(\\hat\{g\}\_\{\\mathrm\{OPD\}\}\)\. For RL, assuming a single\-token simplification \(or ignoring temporal dependencies\), the per\-sample gradient is:

g^RL=A​Jc⊤​\(ey−p0​\(c\)\)\.\\hat\{g\}\_\{\\mathrm\{RL\}\}=AJ\_\{c\}^\{\\top\}\(e\_\{y\}\-p\_\{0\}\(c\)\)\.\(63\)Its covariance satisfies:

Tr​\(ΣRL\)≈𝔼​\[A2\]⋅𝔼​\[‖Jc⊤​\(ey−p0\)‖2\]≥σA2⋅𝔼​\[‖Jc⊤​\(ey−p0\)‖2\],\\mathrm\{Tr\}\(\\Sigma\_\{\\mathrm\{RL\}\}\)\\approx\\mathbb\{E\}\[A^\{2\}\]\\cdot\\mathbb\{E\}\[\\\|J\_\{c\}^\{\\top\}\(e\_\{y\}\-p\_\{0\}\)\\\|^\{2\}\]\\;\\geq\\;\\sigma\_\{A\}^\{2\}\\cdot\\mathbb\{E\}\[\\\|J\_\{c\}^\{\\top\}\(e\_\{y\}\-p\_\{0\}\)\\\|^\{2\}\],\(64\)whereσA2=Var​\(A\)\\sigma\_\{A\}^\{2\}=\\mathrm\{Var\}\(A\)\. In sparse\-reward settings,σA2\\sigma\_\{A\}^\{2\}can be large because most trajectories receive zero reward except a few\. For OPD, the residualrcr\_\{c\}is non\-zero for many tokens, leading to lower relative variance\. Moreover, the norm‖Jc⊤​\(ey−p0\)‖\\\|J\_\{c\}^\{\\top\}\(e\_\{y\}\-p\_\{0\}\)\\\|is typically larger in magnitude than‖Jc⊤​Fc​rc‖\\\|J\_\{c\}^\{\\top\}F\_\{c\}r\_\{c\}\\\|whenrcr\_\{c\}is small, becauseFcF\_\{c\}has eigenvalues at most 1\. Consequently, we expectTr​\(ΣRL\)\>Tr​\(ΣOPD\)\\mathrm\{Tr\}\(\\Sigma\_\{\\mathrm\{RL\}\}\)\>\\mathrm\{Tr\}\(\\Sigma\_\{\\mathrm\{OPD\}\}\)in practice, implying that OPD follows a smoother and lower\-noise optimization trajectory\.

#### Summary\.

In the local regime, OPD can be approximated by a possibly degenerate convex quadratic minimization:

minΔ​θ⁡12​Δ​θ⊤​A​Δ​θ−b⊤​Δ​θ\.\\min\_\{\\Delta\\theta\}\\frac\{1\}\{2\}\\Delta\\theta^\{\\top\}A\\Delta\\theta\-b^\{\\top\}\\Delta\\theta\.\(65\)The corresponding gradient descent dynamics admit the spectral form:

Δ​θs=∑i:λi\>01−\(1−η​λi\)sλi​βi​ui\.\\Delta\\theta\_\{s\}=\\sum\_\{i:\\lambda\_\{i\}\>0\}\\frac\{1\-\(1\-\\eta\\lambda\_\{i\}\)^\{s\}\}\{\\lambda\_\{i\}\}\\beta\_\{i\}u\_\{i\}\.\(66\)This expression shows that the update along each eigen\-direction is determined by the residual projectionβi=⟨b,ui⟩\\beta\_\{i\}=\\langle b,u\_\{i\}\\rangle, the local curvatureλi\\lambda\_\{i\}, and the finite\-step growth factor1−\(1−η​λi\)s1\-\(1\-\\eta\\lambda\_\{i\}\)^\{s\}\.

If the driving termbbis concentrated in a low\-dimensional subspace, such as the top\-kkeigenspace ofAA, and a clear spectral gap exists, then the update remains approximately confined to this subspace from the early stages of training\. This provides a local explanation forEarly Low\-Rank Lock\-in\. At the module level, if a module has negligible coupling with the teacher residual, i\.e\.,bm≈0b\_\{m\}\\approx 0, then its update is expected to be suppressed when cross\-module coupling terms are not dominant\. This explainsFunctional Redundancy Avoidance\. Compared with RL, OPD benefits from a denser, lower\-variance, and more directionally concentrated gradient signal, which helps explain the more concentrated and efficient update patterns observed in OPD\.

QuestionLet△​A​B​C\\triangle ABCbe a triangle withDDonB​C¯\\overline\{BC\}such thatA​D¯\\overline\{AD\}bisects∠​B​A​C\.\\angle BAC\.Letω\\omegabe the circle that passes throughAAand is tangent to segmentB​C¯\\overline\{BC\}atD\.D\.LetE≠AE\\neq AandF≠AF\\neq Abe the intersections ofω\\omegawith segmentsA​B¯\\overline\{AB\}andA​C¯,\\overline\{AC\},respectively\. Suppose thatA​B=200,A​C=225,AB=200,AC=225,and all ofA​E,A​F,B​D,AE,AF,BD,andC​DCDare positive integers\. Find the sum of all possible values ofB​C\.BC\.

Base ModelWe are given that Patrick, Tanya, and Jose start at different times and all arrive at the park at the same time\. We need to find the distanceD=mnD=\\dfrac\{m\}\{n\}in lowest terms, then computem\+nm\+n\.Step 1: Define Variables\.Letvvbe Patrick’s speed \(mph\)\. Then Tanya’s speed isv\+2v\+2and Jose’s speed isv\+9v\+9\. Letttbe Patrick’s travel time \(hours\)\. Since they all arrive simultaneously, Tanya travelst−1t\-1hours and Jose travelst−2t\-2hours, giving:D=v​t=\(v\+2\)​\(t−1\)=\(v\+9\)​\(t−2\)\.D=vt=\(v\+2\)\(t\-1\)=\(v\+9\)\(t\-2\)\.Step 2: Solve forttandvv\.Fromv​t=\(v\+2\)​\(t−1\)vt=\(v\+2\)\(t\-1\):v​t=v​t−v\+2​t−2⟹v=2​t−2\.vt=vt\-v\+2t\-2\\implies v=2t\-2\.From\(v\+2\)​\(t−1\)=\(v\+9\)​\(t−2\)\(v\+2\)\(t\-1\)=\(v\+9\)\(t\-2\):v​t−v\+2​t−2=v​t−2​v\+9​t−18⟹v=7​t−16\.vt\-v\+2t\-2=vt\-2v\+9t\-18\\implies v=7t\-16\.Setting \(A\) equal to \(B\):2​t−2=7​t−16⟹5​t=14⟹t=145\.2t\-2=7t\-16\\implies 5t=14\\implies t=\\frac\{14\}\{5\}\.Substituting back into \(A\):v=2⋅145−2=285−105=185​mph\.v=2\\cdot\\frac\{14\}\{5\}\-2=\\frac\{28\}\{5\}\-\\frac\{10\}\{5\}=\\frac\{18\}\{5\}\\text\{ mph\}\.Step 3: Compute the Distance\.D=v​t=185×145=25225\.D=vt=\\frac\{18\}\{5\}\\times\\frac\{14\}\{5\}=\\frac\{252\}\{25\}\.Since252=22×32×7252=2^\{2\}\\times 3^\{2\}\\times 7and25=5225=5^\{2\}share no common factors,gcd⁡\(252,25\)=1\\gcd\(252,25\)=1, so the fraction is already in lowest terms\.Final Answer\.m\+n=252\+25=277\.m\+n=252\+25=\\boxed\{277\}\.

The early training stage model \(unscaled\)We are given that Patrick starts walking at a constant rate, Tanya starts 1 hour after Patrick, and Jose starts 1 hour after Tanya\. All three arrive at the park at the same time\. Tanya runs 2 mph faster than Patrick walks, and Jose bicycles 7 mph faster than Tanya runs\. The distance from school to park ismn\\tfrac\{m\}\{n\}in lowest terms; findm\+nm\+n\.Step 1: Define Variables\.Letppbe Patrick’s walking speed \(mph\), so Tanya’s running speed isp\+2p\+2and Jose’s bicycling speed isp\+9p\+9\. LetDDbe the distance \(miles\) andTTthe total travel time \(hours\) for Patrick\.Step 2: Set Up Equations\.Since all three arrive simultaneously:D\\displaystyle D=p⋅T,\\displaystyle=p\\cdot T,\(1\)D\\displaystyle D=\(p\+2\)​\(T−1\),\\displaystyle=\(p\+2\)\(T\-1\),\(2\)D\\displaystyle D=\(p\+9\)​\(T−2\)\.\\displaystyle=\(p\+9\)\(T\-2\)\.\(3\)Step 3: Solve forppandTT\.Equating \(1\) and \(2\):p​T=\(p\+2\)​\(T−1\)=p​T−p\+2​T−2⟹p=2​T−2\.pT=\(p\+2\)\(T\-1\)=pT\-p\+2T\-2\\implies p=2T\-2\.Equating \(2\) and \(3\):\(p\+2\)​\(T−1\)=\(p\+9\)​\(T−2\)\.\(p\+2\)\(T\-1\)=\(p\+9\)\(T\-2\)\.Expanding both sides and simplifying:−p\+2​T−2=−2​p\+9​T−18⟹p=7​T−16\.\-p\+2T\-2=\-2p\+9T\-18\\implies p=7T\-16\.Setting \(4\) equal to \(5\):2​T−2=7​T−16⟹5​T=14⟹T=145\.2T\-2=7T\-16\\implies 5T=14\\implies T=\\frac\{14\}\{5\}\.Substituting back into \(4\):p=2⋅145−2=185​mph\.p=2\\cdot\\frac\{14\}\{5\}\-2=\\frac\{18\}\{5\}\\text\{ mph\}\.Step 4: Compute the Distance\.D=p​T=185⋅145=25225\.D=pT=\\frac\{18\}\{5\}\\cdot\\frac\{14\}\{5\}=\\frac\{252\}\{25\}\.Since252=22⋅32⋅7252=2^\{2\}\\cdot 3^\{2\}\\cdot 7and25=5225=5^\{2\}share no common factors,gcd⁡\(252,25\)=1\\gcd\(252,25\)=1\.Verification\.p=185,p\+2=285,p\+9=635,p=\\frac\{18\}\{5\},\\quad p\+2=\\frac\{28\}\{5\},\\quad p\+9=\\frac\{63\}\{5\},D=185⋅145=25225,252/2528/5=25225⋅528=95=T−1,252/2563/5=25225⋅563=45=T−2\.D=\\frac\{18\}\{5\}\\cdot\\frac\{14\}\{5\}=\\frac\{252\}\{25\},\\quad\\frac\{252/25\}\{28/5\}=\\frac\{252\}\{25\}\\cdot\\frac\{5\}\{28\}=\\frac\{9\}\{5\}=T\-1\\ ,\\quad\\frac\{252/25\}\{63/5\}=\\frac\{252\}\{25\}\\cdot\\frac\{5\}\{63\}=\\frac\{4\}\{5\}=T\-2\\ \.Answer\.m\+n=252\+25=277\.m\+n=252\+25=\\boxed\{277\}\.

The early training stage model \(scaled\):We are given a problem involving three people—Patrick, Tanya, and Jose—each starting at different times and moving along the same path from school to the park\. All three arrive at the park at the same time\. We need to find the distance from school to park, expressed as a reduced fractionmn\\dfrac\{m\}\{n\}, and then computem\+nm\+n\.Step 1: Define Variables\.Letvvbe Patrick’s walking speed \(in miles per hour\)\. Then:•Tanya’s running speed isv\+2v\+2mph \(2 mph faster than Patrick\),•Jose’s bicycling speed is\(v\+2\)\+7=v\+9\(v\+2\)\+7=v\+9mph \(7 mph faster than Tanya\)\.LetDDbe the distance from school to park \(in miles\), and letttbe the time \(in hours\) that Patrick takes to walk from school to park\. Since all three arrive at the same time:•Patrick travels fortthours,•Tanya starts 1 hour after Patrick, so she travels fort−1t\-1hours,•Jose starts 1 hour after Tanya \(2 hours after Patrick\), so he travels fort−2t\-2hours\.This gives the following time equations:t=Dv,Dv\+2=t−1,Dv\+9=t−2\.t=\\frac\{D\}\{v\},\\qquad\\frac\{D\}\{v\+2\}=t\-1,\\qquad\\frac\{D\}\{v\+9\}=t\-2\.Step 2: Substitutet=D/vt=D/vinto the Remaining Equations\.Substitutingt=Dvt=\\dfrac\{D\}\{v\}into the equation for Tanya:Dv\+2=Dv−1\.\\frac\{D\}\{v\+2\}=\\frac\{D\}\{v\}\-1\.Substitutingt=Dvt=\\dfrac\{D\}\{v\}into the equation for Jose:Dv\+9=Dv−2\.\\frac\{D\}\{v\+9\}=\\frac\{D\}\{v\}\-2\.We now have two equations in two unknownsDDandvv\.Step 3: Derive Equation \(A\) from Tanya’s Equation\.Starting from:Dv\+2=Dv−1\.\\frac\{D\}\{v\+2\}=\\frac\{D\}\{v\}\-1\.Multiply both sides byv​\(v\+2\)v\(v\+2\):D⋅v=D​\(v\+2\)−v​\(v\+2\)\.D\\cdot v=D\(v\+2\)\-v\(v\+2\)\.Expand the right\-hand side:D​v=D​v\+2​D−v2−2​v\.Dv=Dv\+2D\-v^\{2\}\-2v\.SubtractD​vDvfrom both sides:0=2​D−v2−2​v\.0=2D\-v^\{2\}\-2v\.Rearranging:v2\+2​v−2​D=0\.v^\{2\}\+2v\-2D=0\.Step 4: Derive Equation \(B\) from Jose’s Equation\.Starting from:Dv\+9=Dv−2\.\\frac\{D\}\{v\+9\}=\\frac\{D\}\{v\}\-2\.Multiply both sides byv​\(v\+9\)v\(v\+9\):D⋅v=D​\(v\+9\)−2​v​\(v\+9\)\.D\\cdot v=D\(v\+9\)\-2v\(v\+9\)\.Expand the right\-hand side:D​v=D​v\+9​D−2​v2−18​v\.Dv=Dv\+9D\-2v^\{2\}\-18v\.SubtractD​vDvfrom both sides:0=9​D−2​v2−18​v\.0=9D\-2v^\{2\}\-18v\.Rearranging:2​v2\+18​v−9​D=0\.2v^\{2\}\+18v\-9D=0\.Step 5: Solve the System of Equations \(A\) and \(B\)\.From equation \(A\):v2\+2​v−2​D=0⟹D=v2\+2​v2\.v^\{2\}\+2v\-2D=0\\implies D=\\frac\{v^\{2\}\+2v\}\{2\}\.Substitute this expression forDDinto equation \(B\):2​v2\+18​v−9⋅v2\+2​v2=0\.2v^\{2\}\+18v\-9\\cdot\\frac\{v^\{2\}\+2v\}\{2\}=0\.Multiply every term by 2 to clear the denominator:4​v2\+36​v−9​\(v2\+2​v\)=0\.4v^\{2\}\+36v\-9\(v^\{2\}\+2v\)=0\.Expand:4​v2\+36​v−9​v2−18​v=0\.4v^\{2\}\+36v\-9v^\{2\}\-18v=0\.Combine like terms:\(4​v2−9​v2\)\+\(36​v−18​v\)=0⟹−5​v2\+18​v=0\.\(4v^\{2\}\-9v^\{2\}\)\+\(36v\-18v\)=0\\implies\-5v^\{2\}\+18v=0\.Factor:v​\(−5​v\+18\)=0\.v\(\-5v\+18\)=0\.This givesv=0v=0orv=185v=\\dfrac\{18\}\{5\}\. Since Patrick is walking,v\>0v\>0, so:v=185​mph\.v=\\frac\{18\}\{5\}\\text\{ mph\}\.Step 6: Compute the DistanceDD\.UsingD=v2\+2​v2D=\\dfrac\{v^\{2\}\+2v\}\{2\}withv=185v=\\dfrac\{18\}\{5\}:v2=\(185\)2=32425,2​v=2×185=365=18025\.v^\{2\}=\\left\(\\frac\{18\}\{5\}\\right\)^\{2\}=\\frac\{324\}\{25\},\\qquad 2v=2\\times\\frac\{18\}\{5\}=\\frac\{36\}\{5\}=\\frac\{180\}\{25\}\.v2\+2​v=32425\+18025=50425\.v^\{2\}\+2v=\\frac\{324\}\{25\}\+\\frac\{180\}\{25\}=\\frac\{504\}\{25\}\.D=504/252=50450=25225\.D=\\frac\{504/25\}\{2\}=\\frac\{504\}\{50\}=\\frac\{252\}\{25\}\.Step 7: Verify with All Three Travelers\.Patrick’s travel time:t=Dv=252/2518/5=25225×518=1260450=145t=\\dfrac\{D\}\{v\}=\\dfrac\{252/25\}\{18/5\}=\\dfrac\{252\}\{25\}\\times\\dfrac\{5\}\{18\}=\\dfrac\{1260\}\{450\}=\\dfrac\{14\}\{5\}hours\.Tanyatravels fort−1=145−55=95t\-1=\\dfrac\{14\}\{5\}\-\\dfrac\{5\}\{5\}=\\dfrac\{9\}\{5\}hours at speedv\+2=185\+105=285v\+2=\\dfrac\{18\}\{5\}\+\\dfrac\{10\}\{5\}=\\dfrac\{28\}\{5\}mph:285×95=25225=D\.✓\\frac\{28\}\{5\}\\times\\frac\{9\}\{5\}=\\frac\{252\}\{25\}=D\.\\quad\\checkmarkJosetravels fort−2=145−105=45t\-2=\\dfrac\{14\}\{5\}\-\\dfrac\{10\}\{5\}=\\dfrac\{4\}\{5\}hours at speedv\+9=185\+455=635v\+9=\\dfrac\{18\}\{5\}\+\\dfrac\{45\}\{5\}=\\dfrac\{63\}\{5\}mph:635×45=25225=D\.✓\\frac\{63\}\{5\}\\times\\frac\{4\}\{5\}=\\frac\{252\}\{25\}=D\.\\quad\\checkmarkAll three cover the same distance and arrive at the same time\.Step 8: Confirm the Fraction is in Lowest Terms\.Factoring numerator and denominator:252=22×32×7,25=52\.252=2^\{2\}\\times 3^\{2\}\\times 7,\\qquad 25=5^\{2\}\.Since 252 contains no factor of 5 \(as252÷5=50\.4252\\div 5=50\.4is not an integer\) and 25 contains no factor of 2, 3, or 7, we havegcd⁡\(252,25\)=1\\gcd\(252,25\)=1\. Therefore25225\\dfrac\{252\}\{25\}is already in lowest terms, givingm=252m=252andn=25n=25\.Final Answer\.m\+n=252\+25=277\.m\+n=252\+25=\\boxed\{277\}\.

Teacher ModelWe are given a problem involving three people—Patrick, Tanya, and Jose—each starting at different times and moving along the same road from school to the park\. They all arrive at the park at the same time\. We need to find the distance from school to park, expressed as a reduced fractionmn\\dfrac\{m\}\{n\}, and then computem\+nm\+n\.Step 1: Define Variables\.Letddbe the distance from school to park \(in miles\) andvvbe Patrick’s walking speed \(in mph\)\. Then:•Tanya’s running speed isv\+2v\+2mph \(2 mph faster than Patrick\),•Jose’s bicycling speed is\(v\+2\)\+7=v\+9\(v\+2\)\+7=v\+9mph \(7 mph faster than Tanya\)\.Step 2: Determine Travel Times\.Letttbe the number of hours Patrick takes to reach the park\. Since Tanya starts 1 hour after Patrick and arrives at the same time, she travels fort−1t\-1hours\. Since Jose starts 1 hour after Tanya \(i\.e\., 2 hours after Patrick\) and arrives at the same time, he travels fort−2t\-2hours\.Using distance==speed×\\timestime, we obtain three equations:v\\displaystyle v=dt,\\displaystyle=\\frac\{d\}\{t\},\(1\)v\+2\\displaystyle v\+2=dt−1,\\displaystyle=\\frac\{d\}\{t\-1\},\(2\)v\+9\\displaystyle v\+9=dt−2\.\\displaystyle=\\frac\{d\}\{t\-2\}\.\(3\)Step 3: Derive Equation \(A\) from Equations \(1\) and \(2\)\.From equation \(1\):t=dvt=\\dfrac\{d\}\{v\}\. Substituting into equation \(2\):v\+2=ddv−1=dd−vv=d​vd−v\.v\+2=\\frac\{d\}\{\\dfrac\{d\}\{v\}\-1\}=\\frac\{d\}\{\\dfrac\{d\-v\}\{v\}\}=\\frac\{dv\}\{d\-v\}\.Multiply both sides by\(d−v\)\(d\-v\):\(v\+2\)​\(d−v\)=d​v\.\(v\+2\)\(d\-v\)=dv\.Expand the left\-hand side:v​d−v2\+2​d−2​v=d​v\.vd\-v^\{2\}\+2d\-2v=dv\.Subtractd​vdvfrom both sides:−v2\+2​d−2​v=0\.\-v^\{2\}\+2d\-2v=0\.Rearranging:v2\+2​v−2​d=0\.v^\{2\}\+2v\-2d=0\.Step 4: Derive Equation \(B\) from Equations \(1\) and \(3\)\.Again usingt=dvt=\\dfrac\{d\}\{v\}, we havet−2=dv−2=d−2​vvt\-2=\\dfrac\{d\}\{v\}\-2=\\dfrac\{d\-2v\}\{v\}\. Substituting into equation \(3\):v\+9=dd−2​vv=d​vd−2​v\.v\+9=\\frac\{d\}\{\\dfrac\{d\-2v\}\{v\}\}=\\frac\{dv\}\{d\-2v\}\.Multiply both sides by\(d−2​v\)\(d\-2v\):\(v\+9\)​\(d−2​v\)=d​v\.\(v\+9\)\(d\-2v\)=dv\.Expand the left\-hand side:v​d−2​v2\+9​d−18​v=d​v\.vd\-2v^\{2\}\+9d\-18v=dv\.Subtractd​vdvfrom both sides:−2​v2\+9​d−18​v=0\.\-2v^\{2\}\+9d\-18v=0\.Rearranging:2​v2\+18​v−9​d=0\.2v^\{2\}\+18v\-9d=0\.Step 5: Solve the System of Equations \(A\) and \(B\)\.From equation \(A\):2​d=v2\+2​v⟹d=v2\+2​v2\.2d=v^\{2\}\+2v\\implies d=\\frac\{v^\{2\}\+2v\}\{2\}\.Substitute this expression forddinto equation \(B\):2​v2\+18​v−9⋅v2\+2​v2=0\.2v^\{2\}\+18v\-9\\cdot\\frac\{v^\{2\}\+2v\}\{2\}=0\.Multiply every term by 2 to clear the denominator:4​v2\+36​v−9​\(v2\+2​v\)=0\.4v^\{2\}\+36v\-9\(v^\{2\}\+2v\)=0\.Expand:4​v2\+36​v−9​v2−18​v=0\.4v^\{2\}\+36v\-9v^\{2\}\-18v=0\.Combine like terms:\(4​v2−9​v2\)\+\(36​v−18​v\)=0⟹−5​v2\+18​v=0\.\(4v^\{2\}\-9v^\{2\}\)\+\(36v\-18v\)=0\\implies\-5v^\{2\}\+18v=0\.Factor:v​\(−5​v\+18\)=0\.v\(\-5v\+18\)=0\.This givesv=0v=0orv=185v=\\dfrac\{18\}\{5\}\. Since Patrick is walking,v\>0v\>0, so:v=185​mph\.v=\\frac\{18\}\{5\}\\text\{ mph\}\.Step 6: Compute the Speeds\.•Patrick’s speed:v=185v=\\dfrac\{18\}\{5\}mph,•Tanya’s speed:v\+2=185\+105=285v\+2=\\dfrac\{18\}\{5\}\+\\dfrac\{10\}\{5\}=\\dfrac\{28\}\{5\}mph,•Jose’s speed:v\+9=185\+455=635v\+9=\\dfrac\{18\}\{5\}\+\\dfrac\{45\}\{5\}=\\dfrac\{63\}\{5\}mph\.Step 7: Compute the Distancedd\.d=v2\+2​v2\.d=\\frac\{v^\{2\}\+2v\}\{2\}\.First computev2v^\{2\}:v2=\(185\)2=32425\.v^\{2\}=\\left\(\\frac\{18\}\{5\}\\right\)^\{2\}=\\frac\{324\}\{25\}\.Then compute2​v2v:2​v=2×185=365=18025\.2v=2\\times\\frac\{18\}\{5\}=\\frac\{36\}\{5\}=\\frac\{180\}\{25\}\.Add:v2\+2​v=32425\+18025=50425\.v^\{2\}\+2v=\\frac\{324\}\{25\}\+\\frac\{180\}\{25\}=\\frac\{504\}\{25\}\.Divide by 2:d=50425÷2=50450=25225\.d=\\frac\{504\}\{25\}\\div 2=\\frac\{504\}\{50\}=\\frac\{252\}\{25\}\.Step 8: Verify the Answer\.Patrick’s total travel time:t=dv=252/2518/5=25225×518=252×525×18=1260450=145​hours\.t=\\frac\{d\}\{v\}=\\frac\{252/25\}\{18/5\}=\\frac\{252\}\{25\}\\times\\frac\{5\}\{18\}=\\frac\{252\\times 5\}\{25\\times 18\}=\\frac\{1260\}\{450\}=\\frac\{14\}\{5\}\\text\{ hours\}\.Tanya travels fort−1=145−55=95t\-1=\\dfrac\{14\}\{5\}\-\\dfrac\{5\}\{5\}=\\dfrac\{9\}\{5\}hours at speed285\\dfrac\{28\}\{5\}mph:285×95=25225​✓\.\\frac\{28\}\{5\}\\times\\frac\{9\}\{5\}=\\frac\{252\}\{25\}\\ \\checkmark\.Jose travels fort−2=145−105=45t\-2=\\dfrac\{14\}\{5\}\-\\dfrac\{10\}\{5\}=\\dfrac\{4\}\{5\}hours at speed635\\dfrac\{63\}\{5\}mph:635×45=25225​✓\.\\frac\{63\}\{5\}\\times\\frac\{4\}\{5\}=\\frac\{252\}\{25\}\\ \\checkmark\.All three cover the same distance25225\\dfrac\{252\}\{25\}miles\.Step 9: Confirm the Fraction is in Lowest Terms\.Factoring the numerator and denominator:252=22×32×7,25=52\.252=2^\{2\}\\times 3^\{2\}\\times 7,\\qquad 25=5^\{2\}\.Since 252 and 25 share no common prime factors,gcd⁡\(252,25\)=1\\gcd\(252,25\)=1, so25225\\dfrac\{252\}\{25\}is already in lowest terms\.Final Answer\.m\+n=252\+25=277\.m\+n=252\+25=\\boxed\{277\}\.

![Refer to caption](https://arxiv.org/html/2605.11739v1/x15.png)Figure 15:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for MLP modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x16.png)Figure 16:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for MLP modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x17.png)Figure 17:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for MLP GATE modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x18.png)Figure 18:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for MLP GATE modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x19.png)Figure 19:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for MLP UP modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x20.png)Figure 20:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for MLP UP modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x21.png)Figure 21:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for Attn Q modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x22.png)Figure 22:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for Attn Q modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x23.png)Figure 23:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for Attn K modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x24.png)Figure 24:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for Attn K modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x25.png)Figure 25:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for Attn V modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x26.png)Figure 26:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for Attn V modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x27.png)Figure 27:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under DAPO for Attn modules\.![Refer to caption](https://arxiv.org/html/2605.11739v1/x28.png)Figure 28:t\-SNE visualization of𝒰1\\mathcal\{U\}\_\{1\}trajectories under OPD for Attn modules\.

## Appendix GNeurIPS Paper Checklist

1. 1\.Claims
2. Answer:\[Yes\]
3. Justification: The abstract and introduction clearly state the paper’s main contributions: identifying the foresight mechanism of OPD through Functional Redundancy Avoidance and Early Low\-Rank Lock\-in, and proposing EffOPD as a plug\-and\-play acceleration method\.
4. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
5. 2\.Limitations
6. Answer:\[Yes\]
7. Justification: The paper includes a Limitations and Future Work section discussing the scope of the analysis, including its focus on current post\-training settings and the local nature of the theoretical analysis\.
8. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
9. 3\.Theory assumptions and proofs
10. Answer:\[Yes\]
11. Justification: The paper provides theoretical analysis in the appendix, including the assumptions behind the local linearization of OPD dynamics\.
12. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
13. 4\.Experimental result reproducibility
14. Answer:\[Yes\]
15. Justification: The paper describes the training datasets, model scales, teacher models, evaluation benchmarks, baselines, and the EffOPD procedure needed to reproduce the main experimental results\.
16. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
17. 5\.Open access to data and code
18. Answer:\[Yes\]
19. Justification: The paper uses publicly available datasets and models\. This paper releases the code used in this work through an anonymous link:[https://anonymous\.4open\.science/r/EffOPD\-7C58](https://anonymous.4open.science/r/EffOPD-7C58/README.md)\.
20. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
21. 6\.Experimental setting/details
22. Answer:\[Yes\]
23. Justification: The paper specifies the training tasks, datasets, model scales, teacher models, baselines, evaluation benchmarks, sampling settings, and key hyperparameters of EffOPD\.
24. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
25. 7\.Experiment statistical significance
26. Answer:\[No\]
27. Justification: The paper reports performance trends across model scales, datasets, and baselines, but does not include formal error bars or statistical significance tests for all experiments\.
28. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
29. 8\.Experiments compute resources
30. Answer:\[No\]
31. Justification: The paper discusses the computational overhead of EffOPD, but does not yet provide full details of the hardware configuration, memory usage, or total compute required for each experiment\. - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
32. 9\.Code of ethics
33. Answer:\[Yes\]
34. Justification: We have reviewed the NeurIPS Code of Ethics and believe the research conforms to it\.
35. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
36. 10\.Broader impacts
37. Answer:\[Yes\]
38. Justification: The Impact Statement discusses both positive impacts, such as improving the efficiency and interpretability of LLM post\-training, and potential negative impacts, such as reducing the cost of improving harmful models\. - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
39. 11\.Safeguards
40. Answer:\[N/A\]
41. Justification: The paper does not release new pretrained language models or high\-risk datasets\. The proposed method is an acceleration framework for OPD\. - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
42. 12\.Licenses for existing assets
43. Answer:\[Yes\]
44. Justification: The paper cites the existing datasets, models, and baselines used in the experiments\. We follow their intended research usage\. - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.11739v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
45. 13\.New assets
46. Answer:\[N/A\]
47. Justification: The paper does not introduce or release new datasets, pretrained models, or other standalone assets\.
48. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
49. 14\.Crowdsourcing and research with human subjects
50. Answer:\[N/A\]
51. Justification: The paper does not involve crowdsourcing experiments or research with human subjects\.
52. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
53. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
54. Answer:\[N/A\]
55. Justification: The paper does not involve human subjects research\.
56. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
57. 16\.Declaration of LLM usage
58. Answer:\[N/A\]
59. Justification: LLMs are the subject of study and evaluation in this work, but they are not used as a non\-standard component for developing the core methodology beyond the described OPD and EffOPD training framework\.
60. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

On the Geometry of On-Policy Distillation

Hugging Face Daily Papers

This paper characterizes the unique parameter space dynamics of on-policy distillation (OPD) for large language models, showing that it exhibits relaxed off-principal updates and subspace locking, distinguishing it from supervised fine-tuning and reinforcement learning with verifiable rewards.

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Reddit r/MachineLearning

Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.