PHF: Privileged Hidden Flow for On-Policy Self-Distillation

arXiv cs.AI Papers

Summary

PHF proposes a method to distill hidden state trajectories from a privileged teacher to a student during on-policy self-distillation, improving reasoning performance on language models.

arXiv:2606.29340v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a reasoning model on rollouts sampled from its own policy by matching a privileged teacher that also sees verified reference solutions. Existing OPSD objectives supervise only the output distribution, so privileged context affects training through a token-level divergence without directly supervising the internal computation that produced that distribution. We propose Privileged Hidden Flow (PHF), which additionally distills how a privileged teacher's hidden states move along the same rollout. Rather than forcing each student hidden vector to match the teacher vector at the same token position, PHF aligns token-to-token transition directions and trajectory geometry over selected generated positions. The all-layer recipe also includes an adjacent-layer relation computed from these same transitions, without pointwise hidden-state imitation. Under the same 100-step training schedule, PHF improves the Average@12 aggregate over our reproduced OPSD baseline on Qwen3-1.7B, 4B, and 8B, with observed gains of about +2.2, +1.5, and +1.7 points. The transport objective is exactly invariant to shared trajectory offsets; its local geometry term is also invariant to orthogonal transformations of transition directions. Ablations distinguish the fixed PHF recipe from pointwise hidden-state matching, single-channel transition losses, and layer-subset choices, supporting PHF as a compact hidden-flow extension to OPSD.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:33 AM

# Privileged Hidden Flow for On-Policy Self-Distillation
Source: [https://arxiv.org/html/2606.29340](https://arxiv.org/html/2606.29340)
Yuhan Li1, Mingxu Zhang1, Dazhong Shen2,\*, Ying Sun3 1The Hong Kong University of Science and Technology \(Guangzhou\) 2Nanjing University of Aeronautics and Astronautics 3The 63rd Research Institute, National University of Defense Technology, Nanjing yuhanli530@gmail\.com,shendazhong@nuaa\.edu\.cn, sunyinggilly@gmail\.com

###### Abstract

On\-policy self\-distillation \(OPSD\) trains a reasoning model on rollouts sampled from its own policy by matching a privileged teacher that also sees verified reference solutions\. Existing OPSD objectives supervise only the output distribution, so privileged context affects training through a token\-level divergence without directly supervising the internal computation that produced that distribution\. We propose Privileged Hidden Flow \(PHF\), which additionally distills how a privileged teacher’s hidden states move along the same rollout\. Rather than forcing each student hidden vector to match the teacher vector at the same token position, PHF aligns token\-to\-token transition directions and trajectory geometry over selected generated positions\. The all\-layer recipe also includes an adjacent\-layer relation computed from these same transitions, without pointwise hidden\-state imitation\. Under the same 100\-step training schedule, PHF improves the Average@12 aggregate over our reproduced OPSD baseline on Qwen3\-1\.7B, 4B, and 8B, with observed gains of about\+2\.2\+2\.2,\+1\.5\+1\.5, and\+1\.7\+1\.7points\. The transport objective is exactly invariant to shared trajectory offsets; its local geometry term is also invariant to orthogonal transformations of transition directions\. Ablations distinguish the fixed PHF recipe from pointwise hidden\-state matching, single\-channel transition losses, and layer\-subset choices, supporting PHF as a compact hidden\-flow extension to OPSD\.

![Refer to caption](https://arxiv.org/html/2606.29340v1/x1.png)

Figure 1:Overview of Privileged Hidden Flow \(PHF\)\.The online student samples a rollout from the problem alone\. A privileged EMA teacher then conditions on the reference solution and the same rollout tokens, supplying both the OPSD output\-distribution target and the PHF hidden\-flow target\. PHF matches normalized token\-to\-token hidden transitions and their trajectory geometry at every layer, then combines this local transition loss with a neighboring\-layer relation computed from the same transitions\. The full objective adds this hidden\-flow loss to the standard OPSD output loss\. The privileged teacher is used only during training, and the deployed model is the online student\.## Introduction

On\-policy distillation is a practical recipe for reducing distribution mismatch in language\-model distillation\. Instead of imitating a fixed offline corpus, the student samples its own rollouts and receives dense supervision on the states it actually visits \(?;?;?\)\. On\-policy self\-distillation \(OPSD\) specializes this idea: a single model acts as both student and teacher, with the teacher conditioned on privileged information such as a verified reference solution while the student sees only the problem \(?\)\. This avoids an external teacher and can be token\-efficient for mathematical reasoning\.

Existing OPSD objectives, however, capture only one route by which privileged context changes the model\. At a student\-generated prefix, conditioning on the reference can change not only the next\-token posterior but also the intermediate representations that support that posterior\. Output\-only supervision must express this privileged information through a token\-level divergence at the output head, without directly constraining the hidden trajectory that produced it\. Process\-level supervision has improved reasoning through step\-level reward models \(?\), but at the granularity of output steps rather than internal dynamics\. This raises a direct question:*can OPSD make better use of privileged context by distilling the internal process it induces, not only the final output distribution?*

A natural answer is to supervise hidden states directly\. But this is harder than it appears\. Hidden coordinates are not semantically fixed: they drift during training, so anℓ2\\ell\_\{2\}target between student and teacher hidden vectors chases a moving point\. Moreover, the hidden computation changes across layers as well as across generated rollout positions\. A per\-layer loss alone also ignores whether adjacent layers preserve similar relations among the same token\-to\-token hidden moves\.

We introduce*Privileged Hidden Flow*\(PHF\), which addresses both difficulties \(Figure[1](https://arxiv.org/html/2606.29340#S0.F1)\)\. Rather than matching hidden states pointwise, PHF matches hidden*transitions*: local displacements of the residual stream along selected rollout positions\. Transition matching aligns the*direction*and trajectory*geometry*of the student’s internal computation with the privileged teacher’s, and does not require the two to occupy the same point in hidden space\. The resulting objective is exactly invariant to shared offsets of the hidden trajectory, and its geometry term to orthogonal transformations of the transition directions \(Section 3\), which are simple components of representational drift that pointwise state losses must absorb\. We instantiate the privileged teacher with the same smoothed teacher used by OPSD\-style self\-distillation, but the central design choice is the transition target rather than the averaging mechanism\. In short: OPSD teaches the student*which token*the privileged teacher would predict; PHF additionally teaches it*how the privileged teacher’s hidden state moves*to get there\. The main recipe uses the same transition object twice: first within each layer for local direction and geometry matching, and then across neighboring layers to keep adjacent depth transformations compatible\.

PHF adds one scalar loss coefficient to the standard OPSD objective; the 128\-position hidden budget and all\-layer aggregation recipe are fixed across scales\. It introduces no correctness routing, token filtering, reward model, or extra rollouts\. On the official OPSD math setting evaluated on American Invitational Mathematics Examination \(AIME\) 2024, AIME 2025, and Harvard–MIT Mathematics Tournament \(HMMT\) 2025, PHF improves the checkpoint\-100 Average@12 aggregate over our reproduced OPSD baseline by about\+2\.2\+2\.2on Qwen3\-1\.7B,\+1\.5\+1\.5on Qwen3\-4B, and\+1\.7\+1\.7on Qwen3\-8B\. The aggregate gain is positive at every scale, and all but one benchmark\-scale cell moves in the same direction\. Ablations then locate the design boundary: the transition target, all\-layer aggregation, and shared privileged teacher source matter, while nearby pointwise and selected\-layer hidden\-supervision controls remain competitive in some settings\.

Contributions\.

- •We introduce a privileged transition\-transport objective for OPSD\. On student rollouts, PHF matches normalized token\-to\-token hidden transitions and their within\-trajectory Gram geometry, rather than matching hidden\-state vectors point by point\.
- •We characterize simple invariances of the transport loss\. The local transition objective is unchanged by hidden\-trajectory offsets and positive per\-transition rescaling after normalization; its geometry term is also invariant to independent orthogonal transformations of the student and teacher transition sets\. These structural properties separate PHF from pointwise hidden\-state matching\.
- •We average the same transition loss over all layers and, in the main recipe, add a fixed neighboring\-layer relation computed from the same transitions\. This avoids selecting a small hand\-picked layer subset while keeping the supervision tied to hidden motion\.
- •We evaluate the fixed recipe on Qwen3\-1\.7B, 4B, and 8B under the official OPSD thinking\-mode protocol, including Base, SFT, and GRPO context rows\. PHF improves the checkpoint\-100 Average@12 aggregate over our reproduced OPSD baseline at all three scales, and ablations isolate the roles of transition geometry, pointwise hidden\-state controls, and channel coupling\.

## Related Work

#### Distillation and on\-policy self\-distillation\.

Knowledge distillation transfers a teacher’s behavior to a student, from early compression methods \(?;?\) to softened output matching \(?\) and sequence\-level distillation \(?\); surveys include \(?;?\)\. On\-policy variants reduce distribution mismatch by supervising states visited by the student, as in generalized knowledge distillation \(?;?\), MiniLLM’s reverse\-KL training \(?\), and DistiLLM’s skewed divergences and off\-policy reuse \(?\)\. OPSD applies this idea to reasoning self\-distillation: the teacher is the same model conditioned on privileged information, such as a verified reference, while the student sees only the problem \(?\)\. PHF keeps OPSD’s on\-policy output supervision but adds a hidden\-process target on the same rollout\.

#### Hidden and relational representation distillation\.

Intermediate\-feature distillation includes hidden\-layer hints \(?\), multi\-layer transfer for BERT \(?\), and attention or embedding matching in compact language models \(?;?\)\. Related feature losses often match hidden states at selected layers, together with output, attention, or other representation losses \(?;?\)\. A relational line instead matches structure among examples or features, as in relational distillation \(?\), attention transfer \(?\), and contrastive representation distillation \(?\)\. PHF follows the relational spirit but changes the object being matched: it compares token\-to\-token hidden transitions and their trajectory geometry within a rollout, rather than static hidden states at individual positions\. This sets the closest novelty boundary for PHF and motivates the pointwise hidden\-state and selected\-layer controls in our experiments\. We do not claim that hidden representation supervision is new by itself; the contribution is the specific privileged, on\-policy transition\-geometry target and its empirical boundary against nearby pointwise state and layer\-selected controls\.

#### Reasoning and process supervision\.

Reasoning models benefit from chain\-of\-thought prompting \(?\), sampling\-based self\-consistency \(?\), and training on rationales or generated solutions \(?;?;?\), with evaluation on mathematical\-reasoning benchmarks \(?;?;?\) and strong math models \(?;?\)\. Process supervision and reward modeling score intermediate reasoning steps rather than only final answers \(?;?;?\), while policy\-optimization methods such as PPO, RLHF, and GRPO \(?;?;?\) optimize action\-level behavior through rewards\. PHF shares the process\-oriented motivation, but its supervision is internal: it distills how privileged context changes hidden transitions along the rollout\.

## Method

### OPSD Preliminaries

Letxxbe a problem,rrprivileged information such as a reference solution, andy<ty\_\{<t\}a prefix sampled from the current student\. OPSD compares two conditionals on the same student\-visited state:

pS\(⋅∣x,y<t\)\\displaystyle p\_\{S\}\(\\cdot\\mid x,y\_\{<t\}\)=πθ\(⋅∣x,y<t\),\\displaystyle=\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{<t\}\),\(1\)pT\(⋅∣x,r,y<t\)\\displaystyle p\_\{T\}\(\\cdot\\mid x,r,y\_\{<t\}\)=πθ¯\(⋅∣x,r,y<t\),\\displaystyle=\\pi\_\{\\bar\{\\theta\}\}\(\\cdot\\mid x,r,y\_\{<t\}\),\(2\)whereθ\\thetais the online student andθ¯\\bar\{\\theta\}the teacher parameters\. Standard OPSD minimizes a per\-token divergence, here the Jensen\-Shannon divergence, on student\-visited states:

ℒOPSD=1T∑t=1TDJS\(pT\(⋅∣x,r,y<t\)∥pS\(⋅∣x,y<t\)\)\.\\mathcal\{L\}\_\{\\mathrm\{OPSD\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}D\_\{\\mathrm\{JS\}\}\\\!\\left\(p\_\{T\}\(\\cdot\\mid x,r,y\_\{<t\}\)\\;\\\|\\;p\_\{S\}\(\\cdot\\mid x,y\_\{<t\}\)\\right\)\.\(3\)This objective supervises only the output policy\. In practice the per\-token divergence is clipped at a small trust\-region value, following the official OPSD recipe; implementation details are given in the supplementary material\. PHF keeps this output channel unchanged and adds a process\-level hidden flow term\.

### Hidden Transition Transport

Operationally, one PHF update first samples a student rollout, then runs the student on the plain prompt and the privileged teacher on the privileged prompt using the same rollout tokens\. The standard OPSD loss matches their output distributions, while the additional flow loss compares how their hidden states move along the rollout at each layer\. The resulting per\-layer losses are then aggregated across depth\.

For layerℓ\\ell, lethℓ,tSh^\{S\}\_\{\\ell,t\}andhℓ,tTh^\{T\}\_\{\\ell,t\}be the residual\-stream hidden states of the student and privileged teacher at rollout positiontt\. Let𝒲=\{w1<⋯<wm\}\\mathcal\{W\}=\\\{w\_\{1\}<\\cdots<w\_\{m\}\\\}be the deterministic set of valid generated positions used for the hidden loss\. We define the local hidden transition as the difference between neighboring selected positions along the rollout,

Δ​hℓ,iS=hℓ,wi\+1S−hℓ,wiS,Δ​hℓ,iT=hℓ,wi\+1T−hℓ,wiT,\\Delta h^\{S\}\_\{\\ell,i\}=h^\{S\}\_\{\\ell,w\_\{i\+1\}\}\-h^\{S\}\_\{\\ell,w\_\{i\}\},\\qquad\\Delta h^\{T\}\_\{\\ell,i\}=h^\{T\}\_\{\\ell,w\_\{i\+1\}\}\-h^\{T\}\_\{\\ell,w\_\{i\}\},\(4\)and normalize transitions to unit directions,

dℓ,iS=Δ​hℓ,iS‖Δ​hℓ,iS‖2\+ϵ,dℓ,iT=Δ​hℓ,iT‖Δ​hℓ,iT‖2\+ϵ\.d^\{S\}\_\{\\ell,i\}=\\frac\{\\Delta h^\{S\}\_\{\\ell,i\}\}\{\\\|\\Delta h^\{S\}\_\{\\ell,i\}\\\|\_\{2\}\+\\epsilon\},\\qquad d^\{T\}\_\{\\ell,i\}=\\frac\{\\Delta h^\{T\}\_\{\\ell,i\}\}\{\\\|\\Delta h^\{T\}\_\{\\ell,i\}\\\|\_\{2\}\+\\epsilon\}\.\(5\)The process axis is the generated rollout position; transformer depth is an observation axis over which we repeat the same comparison\. PHF does not use a layer\-to\-layer difference such ashℓ\+1,t−hℓ,th\_\{\\ell\+1,t\}\-h\_\{\\ell,t\}as its basic transition object\. The transition\-direction loss aligns local displacements over this selected position set,

ℒdir​\(ℓ\)=1m−1​∑i=1m−1\(1−⟨dℓ,iS,dℓ,iT⟩\)\.\\mathcal\{L\}\_\{\\mathrm\{dir\}\}\(\\ell\)=\\frac\{1\}\{m\-1\}\\sum\_\{i=1\}^\{m\-1\}\\bigl\(1\-\\langle d^\{S\}\_\{\\ell,i\},d^\{T\}\_\{\\ell,i\}\\rangle\\bigr\)\.\(6\)Direction matching aligns each step but ignores the*shape*of the trajectory\. We therefore add a transition\-geometry term\. LetDℓSD^\{S\}\_\{\\ell\}stack the rowsdℓ,iSd^\{S\}\_\{\\ell,i\}fori=1,…,m−1i=1,\\ldots,m\-1, and likewiseDℓTD^\{T\}\_\{\\ell\}; define the Gram matricesGℓS=DℓS​\(DℓS\)⊤G^\{S\}\_\{\\ell\}=D^\{S\}\_\{\\ell\}\(D^\{S\}\_\{\\ell\}\)^\{\\top\}andGℓT=DℓT​\(DℓT\)⊤G^\{T\}\_\{\\ell\}=D^\{T\}\_\{\\ell\}\(D^\{T\}\_\{\\ell\}\)^\{\\top\}\. The geometry loss matches which transitions are parallel, orthogonal, or opposing:

ℒgeo​\(ℓ\)=1\(m−1\)2​‖GℓS−GℓT‖F2\.\\mathcal\{L\}\_\{\\mathrm\{geo\}\}\(\\ell\)=\\frac\{1\}\{\(m\-1\)^\{2\}\}\\bigl\\\|G^\{S\}\_\{\\ell\}\-G^\{T\}\_\{\\ell\}\\bigr\\\|\_\{F\}^\{2\}\.\(7\)The Gram comparison is a within\-sequence, representation\-similarity\-style construction inspired by CKA \(?\), applied to transition directions rather than states: it supervises the relational structure of the trajectory rather than pointwise hidden coordinates\. The per\-layer transition loss is the mean of the two,

ℒlayer​\(ℓ\)=12​\(ℒdir​\(ℓ\)\+ℒgeo​\(ℓ\)\)\.\\mathcal\{L\}\_\{\\mathrm\{layer\}\}\(\\ell\)=\\tfrac\{1\}\{2\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{dir\}\}\(\\ell\)\+\\mathcal\{L\}\_\{\\mathrm\{geo\}\}\(\\ell\)\\bigr\)\.\(8\)This differs from pointwise hidden\-state matching in*what is asked of the student*: not to makehℓ,tSh^\{S\}\_\{\\ell,t\}equalhℓ,tTh^\{T\}\_\{\\ell,t\}at every selected position, but to make comparable token\-to\-token moves under the privileged process\. The next subsection makes the resulting invariances precise\.

#### Neighboring\-layer relation\.

As part of all\-layer aggregation, PHF also compares how adjacent layers relate the same transition directions\. A purely per\-layer average does not check whether adjacent layers relate these motions in the same way as the privileged teacher\. PHF therefore uses the same transition\-only object to compare neighboring layers\. Define

Cℓ,ℓ\+1S=DℓS​\(Dℓ\+1S\)⊤,Cℓ,ℓ\+1T=DℓT​\(Dℓ\+1T\)⊤\.C^\{S\}\_\{\\ell,\\ell\+1\}=D^\{S\}\_\{\\ell\}\(D^\{S\}\_\{\\ell\+1\}\)^\{\\top\},\\qquad C^\{T\}\_\{\\ell,\\ell\+1\}=D^\{T\}\_\{\\ell\}\(D^\{T\}\_\{\\ell\+1\}\)^\{\\top\}\.\(9\)These matrices measure how each selected rollout move at layerℓ\\ellrelates to the moves one layer later\. PHF matches this relation between the student and the privileged teacher,

ℒadj=1L−1​∑ℓ=1L−11\(m−1\)2​‖Cℓ,ℓ\+1S−Cℓ,ℓ\+1T‖F2\.\\mathcal\{L\}\_\{\\mathrm\{adj\}\}=\\frac\{1\}\{L\-1\}\\sum\_\{\\ell=1\}^\{L\-1\}\\frac\{1\}\{\(m\-1\)^\{2\}\}\\bigl\\\|C^\{S\}\_\{\\ell,\\ell\+1\}\-C^\{T\}\_\{\\ell,\\ell\+1\}\\bigr\\\|\_\{F\}^\{2\}\.\(10\)This term is still transition\-based: it never asks the student to match a teacher hidden vector at a fixed token position\. Instead, it asks the student to pass hidden motions between neighboring layers with the same relational pattern as the privileged teacher\.

### Structural Properties

The local transport objective has exact algebraic invariances that pointwise hidden\-state matching lacks\. Fix a layerℓ\\elland writedℓ,iS,dℓ,iTd^\{S\}\_\{\\ell,i\},d^\{T\}\_\{\\ell,i\}for the normalized student and teacher transition directions, with Gram matricesGℓSG^\{S\}\_\{\\ell\}andGℓTG^\{T\}\_\{\\ell\}\.

###### Proposition 1\(Invariances of transition transport\)

For any layerℓ\\ell,ℒdir​\(ℓ\)\\mathcal\{L\}\_\{\\mathrm\{dir\}\}\(\\ell\)andℒgeo​\(ℓ\)\\mathcal\{L\}\_\{\\mathrm\{geo\}\}\(\\ell\)are unchanged by:

1. 1\.adding any position\-independent offset to either trajectory,hℓ,tS↦hℓ,tS\+cℓSh^\{S\}\_\{\\ell,t\}\\mapsto h^\{S\}\_\{\\ell,t\}\+c^\{S\}\_\{\\ell\}andhℓ,tT↦hℓ,tT\+cℓTh^\{T\}\_\{\\ell,t\}\\mapsto h^\{T\}\_\{\\ell,t\}\+c^\{T\}\_\{\\ell\};
2. 2\.multiplying any transition by a positive scalar before normalization,Δ​hℓ,iS/T↦sℓ,iS/T​Δ​hℓ,iS/T\\Delta h^\{S/T\}\_\{\\ell,i\}\\mapsto s^\{S/T\}\_\{\\ell,i\}\\Delta h^\{S/T\}\_\{\\ell,i\}, up to the numericalϵ\\epsilonin the denominator\.

In addition, for any orthogonal mapsRS,RTR\_\{S\},R\_\{T\}, the geometry loss is unchanged underdℓ,iS↦RS​dℓ,iSd^\{S\}\_\{\\ell,i\}\\mapsto R\_\{S\}d^\{S\}\_\{\\ell,i\}anddℓ,iT↦RT​dℓ,iTd^\{T\}\_\{\\ell,i\}\\mapsto R\_\{T\}d^\{T\}\_\{\\ell,i\}; the direction loss has the same orthogonal invariance whenRS=RTR\_\{S\}=R\_\{T\}\. In contrast, the pointwise state loss‖hℓ,tS−hℓ,tT‖22\\\|h^\{S\}\_\{\\ell,t\}\-h^\{T\}\_\{\\ell,t\}\\\|\_\{2\}^\{2\}generally changes under these transformations\.

The proof follows from differencing, normalization, and the identity\(D​R\)​\(D​R\)⊤=D​D⊤\(DR\)\(DR\)^\{\\top\}=DD^\{\\top\}for orthogonalRR\. These properties give the intended reading of PHF: the local loss tracks*how*the representation moves while discarding its absolute location and normalized step magnitude\. The geometry term further compares the relational shape of the transition trajectory rather than pointwise hidden coordinates\. The neighboring\-layer term uses the same transitions and retains relative orientation between adjacent layers because that relative orientation is the compatibility signal it measures\.

#### Connection to the output channel\.

The two supervision channels are algebraically linked at the final layer\. Writingh~L,t\\tilde\{h\}\_\{L,t\}for the final\-norm output andWWfor the linear LM head, the logit displacement between adjacent positions isΔ​zt=W​\(h~L,t\+1−h~L,t\)\\Delta z\_\{t\}=W\(\\tilde\{h\}\_\{L,t\+1\}\-\\tilde\{h\}\_\{L,t\}\): the output posterior can change only through the head’s image of the final\-layer transition\. Output\-level OPSD therefore already supervises a*projection*of one transition\. The flow term adds targets on how the privileged state moves from one position to the next at each layer, including hidden directions that are not directly specified by the scalar output\-posterior divergence\. In this sense PHF adds a process signal connected to OPSD’s output channel rather than an unrelated auxiliary objective\. We do not claim that this quantifies independent information; the coupling ablations in Section 4 only probe whether sharing the same privileged process source matters empirically\. This link should not be read as an information\-theoretic certificate\. We do not show that hidden flow carries information beyond what is recoverable from the output posterior; the hidden\-flow loss is an auxiliary training target whose value is tested under the fixed OPSD protocol\.

### Privileged Teacher Source

We instantiate the privileged teacher as an EMA copy of the online model, following the standard use of weight\-averaged teachers to smooth training targets \(?;?\),

θ¯k=ρ​θ¯k−1\+\(1−ρ\)​θk,ρ=0\.999,\\bar\{\\theta\}\_\{k\}=\\rho\\,\\bar\{\\theta\}\_\{k\-1\}\+\(1\-\\rho\)\\,\\theta\_\{k\},\\qquad\\rho=0\.999,\(11\)used only during training\. Both its output logits and hidden transitions are computed under the privileged prompt\(x,r,y<t\)\(x,r,y\_\{<t\}\)and detached from the gradient\. This averaging choice stabilizes the target under a changing on\-policy state distribution; it is not a deployment mechanism, and the evaluated model is always the online studentθ\\theta\.

### All\-Layer Aggregation

Hidden distillation usually depends on selecting a few intermediate layers\. PHF instead uses all layers\. The local PHF component averages the per\-layer transition loss over depth,

ℒlocal=1L​∑ℓ=1Lℒlayer​\(ℓ\)\.\\mathcal\{L\}\_\{\\mathrm\{local\}\}=\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\mathcal\{L\}\_\{\\mathrm\{layer\}\}\(\\ell\)\.\(12\)The PHF recipe used for the main table gives equal weight to motion matching along selected rollout positions inside each layer and a neighboring\-layer relation computed from the same selected\-position transitions,

ℒFlow=12​ℒlocal\+12​ℒadj\.\\mathcal\{L\}\_\{\\mathrm\{Flow\}\}=\\tfrac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{local\}\}\+\\tfrac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{adj\}\}\.\(13\)This removes the layer\-selection knob; the neighboring\-layer relation is a fixed aggregation choice that avoids treating layers as fully independent curves\. The method has one coefficientα\\alphafor the whole process channel; the half\-and\-half split is fixed across scales and should be read as part of the evaluated recipe rather than as an optimized weighting rule\.

### Full Objective

The final PHF objective adds the flow term to OPSD,

ℒPHF=ℒOPSD\+α​ℒFlow,\\mathcal\{L\}\_\{\\mathrm\{PHF\}\}=\\mathcal\{L\}\_\{\\mathrm\{OPSD\}\}\+\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{Flow\}\},\(14\)whereℒFlow=ℒlocal\\mathcal\{L\}\_\{\\mathrm\{Flow\}\}=\\mathcal\{L\}\_\{\\mathrm\{local\}\}for the local transition version andℒFlow=12​ℒlocal\+12​ℒadj\\mathcal\{L\}\_\{\\mathrm\{Flow\}\}=\\frac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{local\}\}\+\\frac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{adj\}\}for PHF with neighboring\-layer relations, andα=0\.05\\alpha=0\.05, chosen once to be on the same order as the clipped token\-level OPSD loss\. We keep this coefficient fixed across model scales rather than tuning it per model\. The flow is computed over at most\|𝒲\|=128\|\\mathcal\{W\}\|=128valid generated positions, selected deterministically to cover the rollout\. This is a compute\-driven truncation because storing hidden states for every layer over full rollouts is memory\-prohibitive\. We did not sweep the position budget or sampling rule; the 128\-position choice is part of the recipe evaluated here, not a claim that other positions are uninformative\. This is the only tuned scalar loss weight PHF adds; the EMA teacher, position budget, and local/neighboring\-layer split are fixed recipe choices\. No correctness routing, token filtering, reward model, or additional rollout set is introduced\.

## Experiments

### Setup

#### Models and data\.

We evaluate on the official OPSD mathematical\-reasoning setting with three Qwen3 models \(?\): Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B\. Their layer counts and hidden sizes are taken from the released model configurations: 28 layers with hidden size 2048 for 1\.7B, and 36 layers with hidden sizes 2560 and 4096 for 4B and 8B\. Training uses the official OPSD math set of29,43429\{,\}434examples with student rollouts of length10241024\.

#### Training\.

For the OPSD/PHF comparison, we follow the official OPSD learning\-rate schedule \(cosine decay\) and run on\-policy training for100100optimizer steps with LoRA and rollout temperature1\.11\.1\. The JSD token clip is0\.050\.05for Qwen3\-1\.7B and Qwen3\-4B, and0\.060\.06for Qwen3\-8B, matching the reproduced OPSD configuration at each scale\. PHF adds hidden transition supervision with window\|𝒲\|=128\|\\mathcal\{W\}\|\{=\}128, coefficientα=0\.05\\alpha\{=\}0\.05, and EMA decayρ=0\.999\\rho\{=\}0\.999\. OPSD and PHF use the same schedule and step budget\. PHF keeps the OPSD output loss but adds the privileged hidden\-process recipe described in the Method section\. The privileged forward pass conditions on verified reference solutions during training only; evaluation always uses the online student without references\. Teacher\-source controls are reported separately\. Base, SFT, and GRPO rows are context baselines evaluated with the same thinking\-mode Average@12 protocol\.

#### Evaluation\.

We use the official OPSD thinking\-mode evaluation on the American Invitational Mathematics Examination 2024, American Invitational Mathematics Examination 2025, and Harvard–MIT Mathematics Tournament 2025 \(30 problems each\)\. For every problem we drawn=12n\{=\}12samples \(temperature1\.01\.0, top\-p=0\.95p\{=\}0\.95, up to38,91238\{,\}912new tokens, thinking enabled\) and report Average@12, the mean per\-sample correctness\. The primary aggregate is the reported Average column for the evaluation suite\. Unless stated otherwise, every reported trained checkpoint is evaluated at checkpoint 100\.

#### Baselines\.

Our principal reference point is the reproduced*OPSD*baseline \(output\-only, same schedule and budget\), since PHF is an additive extension of the same framework\. We additionally include the base checkpoint, a supervised\-finetuning checkpoint, and a GRPO checkpoint as context for the scale of the OPSD/PHF gains\.

### Main Results

Table 1:Checkpoint\-100 Average@12 on Qwen3 models under thinking\-mode evaluation\. Average is the suite aggregate\. Base, SFT, and GRPO are context rows under the samen=12n\{=\}12protocol; the primary comparison is OPSD vs\. PHF under the same 100\-step schedule\.Table[1](https://arxiv.org/html/2606.29340#Sx4.T1)reports the reproduced checkpoint\-100 comparison\. PHF improves the Average@12 aggregate over the OPSD baseline at all three model scales: about\+2\.2\+2\.2points \(1\.7B\),\+1\.5\+1\.5points \(4B\), and\+1\.7\+1\.7points \(8B\)\. With the same hidden window and PHF coefficient, the primary aggregate moves in the same direction at every size; all but one reported benchmark\-scale cell also has a positive observed checkpoint\-100 delta\. These deltas show that an auxiliary process\-level signal can complement an already strong output\-distribution objective, not replace it\.

#### Interpreting the comparison\.

The cross\-scale comparison is the main evidence surface: within the reproduced OPSD setting, the fixed PHF recipe combines a transition\-based process target, trajectory geometry, and all\-layer aggregation into a consistently positive OPSD extension across the three tested model scales\. The context rows are intentionally not treated as alternative PHF baselines: Base, SFT, and GRPO use the same evaluation harness, but they differ in training objective or supervision source\. The controlled comparison is therefore OPSD versus PHF at matched step budget, rollout distribution, checkpoint, LoRA configuration, and evaluation protocol\. This keeps the evidence focused on the incremental hidden\-process target rather than on differences in data, reward modeling, or decoding\.

### Design Ablations

Table[2](https://arxiv.org/html/2606.29340#Sx4.T2)reports the design ablations that were evaluated at all three model scales\. Each row keeps the OPSD output objective, training schedule, checkpoint, and 128\-position hidden budget fixed unless the row name states the changed factor\.

Table 2:Cross\-scale checkpoint\-100 design ablations\. Entries are Average@12 suite aggregates\. All rows are evaluated at all three model scales under the same OPSD output objective, training schedule, checkpoint, and 128\-position hidden budget\.Three patterns emerge\. First, PHF is the strongest aggregate row at every scale\. Second, direction\-only and geometry\-only variants expose useful signal, but neither component alone matches the full direction\-plus\-geometry recipe across scales\. Third, the selected\-layer control trails all\-layer PHF, supporting the choice to average the transition objective across depth rather than choosing a small hand\-picked layer subset\. The ablation table is deliberately narrow: it includes only variants evaluated for all three model sizes under the same checkpoint\-100 protocol\. Rows with partial scale coverage are kept out of the main table because they are useful diagnostics but would weaken the comparison surface\. Within this restricted surface, the main recipe is not just a larger loss; it preserves the same OPSD output objective and changes only how the privileged hidden process is matched\.

### Analysis

#### Why transitions rather than states\.

Pointwise hidden\-state matching \(as in normalized MSE objectives\) requires the student and teacher to occupy the same point in representation space at each selected token position, a target that itself moves as the student’s LoRA updates shift the manifold during on\-policy training\. Transition transport asks only that the student*move*like the teacher: by Proposition[1](https://arxiv.org/html/2606.29340#Thmproposition1)it is invariant to trajectory offsets and insensitive to transition rescaling after normalization, and its geometry term to independent rotations of the two direction sets\. These are common components of representational drift that pointwise MSE must absorb as loss\. The Gram term additionally targets relational trajectory structure \(which transitions are parallel, orthogonal, or opposing\) rather than a pointwise state target\. Table[2](https://arxiv.org/html/2606.29340#Sx4.T2)tests the transition components across all three model scales\. Keeping only the direction term loses the aggregate gain at 1\.7B and 8B, while keeping only the geometry term is more competitive but still trails the full PHF recipe at every scale\. These rows indicate that the two components are not interchangeable: direction matching supplies aligned local motion, whereas the Gram term supplies relational trajectory shape\. Thus single\-channel controls expose useful signal, but they do not give the same transition\-based process objective as the fixed direction\-plus\-geometry recipe\. This distinction matters because the privileged reference changes the*process*the model is asked to imitate, not merely the final answer distribution\. Direction matching aligns local motion between adjacent selected positions, while the geometry term constrains the relations among those motions inside the rollout\. PHF therefore supplies a target that is invariant to simple coordinate shifts but still sensitive to whether the student’s hidden trajectory organizes the same reasoning steps as the privileged teacher\.

#### Role of the EMA teacher\.

On\-policy training changes both the model and the sampled state distribution, so a privileged forward pass from the instantaneous model is a noisy process estimate\. The EMA teacher smooths this process over recent history while still tracking the student\. This does not turn EMA into an independent deployment mechanism: the online student is always the evaluated model, and the EMA copy is used only as a training\-time process source\. Using the EMA copy as the hidden teacher also keeps the method aligned with the OPSD deployment story: the evaluated model is the online student, and no teacher, verifier, or privileged reference is needed at test time\. The teacher is only a stabilized view of the current learner under the reference\-conditioned prompt, so the training signal stays on\-policy rather than becoming a separate offline distillation stage\.

#### Training budget and stability\.

PHF trains under the same optimizer schedule as OPSD without requiring a reward model, correctness routing, or additional rollout set\. The control variants in Table[2](https://arxiv.org/html/2606.29340#Sx4.T2)therefore do not introduce extra rollouts or rewards, but these results alone do not separate optimization effects from the objective being tested\. PHF adds one privileged hidden\-process forward pass and hidden\-state storage, while keeping the rollout set and optimizer schedule unchanged\. This accounting is important for interpreting the gains\. The method does not increase the number of sampled completions, does not filter trajectories by correctness, and does not add a reward model\. Its extra cost is concentrated in the hidden\-state window and the privileged forward pass used to form the process target\. The comparison therefore asks whether the same on\-policy training run can use the verified reference more effectively, rather than whether additional search or reward feedback improves the final answer\.

### Checkpoint Dynamics

Figure[2](https://arxiv.org/html/2606.29340#Sx4.F2)plots available AIME 2024 Average@12 checkpoint evaluations\. The curves show checkpoint trajectories for the corresponding OPSD/PHF recipe, and the checkpoint\-100 markers are aligned to the main Table[1](https://arxiv.org/html/2606.29340#Sx4.T1)results\. The plot is not an additional selection rule: all reported headline numbers still come from the fixed checkpoint\-100 protocol\. The curves also illustrate why we report the checkpoint\-100 table rather than a best\-checkpoint sweep\. Intermediate checkpoints can move non\-monotonically under the stochastic thinking\-mode evaluation, especially withn=12n\{=\}12samples per problem\. We therefore use the dynamics plot only as a sanity check that PHF is being compared along the same training trajectory family as OPSD, not as a criterion for choosing a more favorable endpoint\.

![Refer to caption](https://arxiv.org/html/2606.29340v1/x2.png)

Figure 2:AIME 2024 Average@12 checkpoint dynamics for OPSD and PHF, with one panel per model scale\. The checkpoint\-100 markers match the main Table[1](https://arxiv.org/html/2606.29340#Sx4.T1)results\.#### All\-layer spread\.

The recorded all\-layer diagnostics indicate that the hidden\-flow layer energy is not concentrated in a single layer in these diagnostic runs\. At checkpoint 100, the largest layer contributes6\.5%6\.5\\%,5\.1%5\.1\\%, and4\.8%4\.8\\%of the recorded layer energy for Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B, respectively\. The corresponding layer\-energy entropies are3\.173\.17,3\.383\.38, and3\.403\.40\(with maximalog⁡28\\log 28andlog⁡36\\log 36\)\. We use these trainer\-state diagnostics only to justify all\-layer aggregation as a reasonable default, not as evidence that the hidden loss carries independent behavioral information beyond the output posterior\. The selected\-layer control in Table[2](https://arxiv.org/html/2606.29340#Sx4.T2)is consistent with this view: a fixed hand\-picked subset is a useful boundary check, but it leaves performance below the all\-layer PHF recipe at every tested scale\. We therefore treat all\-layer aggregation as part of the fixed method, not as a post\-hoc selection over layers\.

### Discussion

#### What the hidden\-flow target adds\.

PHF is best read as a way to use the same privileged reference more structurally\. OPSD already distills the privileged output distribution at student\-visited states; PHF asks whether the student’s hidden trajectory moves through those states in the same local directions and with the same relational geometry as the privileged teacher\. This is a weaker and more stable requirement than matching hidden vectors point by point, because it does not require the two models to share an absolute coordinate origin at every selected position\. It is also more informative than a scalar reward, because it supplies a process target for the path between generated tokens rather than only for the final answer\.

#### Why the comparison is conservative\.

All reported PHF gains are obtained without changing the rollout set, evaluation protocol, optimizer schedule, checkpoint, or output objective\. The method also does not introduce correctness routing or extra answer sampling\. This matters because many improvements in mathematical reasoning can be explained by more search, stronger filtering, or additional reward feedback\. Here the evidence is narrower but cleaner: under the reproduced OPSD harness, adding a fixed transition\-based hidden\-process loss improves the Average@12 aggregate at every tested scale\.

#### What the ablations rule out\.

The controls do not show that every hidden supervision signal helps\. They show that the particular PHF recipe is more reliable than the tested single\-channel or selected\-layer alternatives under the same three\-scale protocol\. Direction only and geometry only each preserve part of the process signal, but neither matches the full recipe across scales\. Selected layers avoids all\-layer storage for some layers but gives up a consistent part of the gain, suggesting that the privileged process signal is distributed rather than isolated to one manually chosen depth\.

## Conclusion

We introduced*Privileged Hidden Flow*\(PHF\), a process\-level extension of on\-policy self\-distillation that distills both the privileged output distribution and the hidden transition dynamics induced by privileged context\. PHF matches transition directions and trajectory geometry rather than pointwise hidden\-state vectors, yielding offset invariance, transition\-scale insensitivity after normalization, and \(for its geometry term\) orthogonal invariance\. Its targets come from a privileged teacher used strictly as a training\-time source; the fixed recipe also compares how neighboring layers transform the same selected\-position changes\. Across Qwen3\-1\.7B, 4B, and 8B checkpoint\-100 evaluations, PHF improves the Average@12 aggregate over the OPSD baseline at every scale on competition mathematics, with gains of about\+2\.2\+2\.2,\+1\.5\+1\.5, and\+1\.7\+1\.7points\. Ablations delimit the design: coupling the hidden signal to the same privileged output process is supported by the main controls, while pointwise hidden\-state and selected\-layer variants remain useful boundary checks\. The main takeaway is that privileged information can supervise more than the next\-token distribution\. When the reference solution is available during training, it also induces a hidden process for how the model moves through the student’s own rollout\. PHF turns that process into a local transition target while preserving the deployment setting of OPSD: the online student is evaluated alone, without references, teachers, verifiers, or extra test\-time search\. The evidence is intentionally scoped to a fixed recipe and a matched harness, but it suggests that process\-level distillation can be made compatible with on\-policy self\-distillation rather than treated as a separate supervision regime\. This framing also clarifies why the reported effect is modest but meaningful: PHF is not a new training pipeline, a new verifier, or a larger inference budget\. It is an additional way to spend the information already present in the verified reference during the same on\-policy update\. The consistent positive aggregate shift across 1\.7B, 4B, and 8B therefore supports the central design claim even though the method remains deliberately close to OPSD\.

#### Limitations\.

PHF currently assumes verified reference solutions during training and therefore inherits the reference quality requirements of OPSD\. The method also stores hidden states for the selected 128\-position window, so longer windows require more memory\. The internal controls suggest that hidden\-process targets can be allocated differently across model depth and rollout positions, motivating broader studies of adaptive depth and token budgets\. Our evidence is also limited to one model family, one mathematical\-reasoning training set, and checkpoint\-100 evaluations under the official thinking\-mode protocol\. We therefore treat the reported gains as matched\-harness point estimates rather than field\-wide claims\. The ablations identify the useful parts of the fixed PHF recipe, but they do not exhaust the design space of hidden\-state selection, teacher smoothing, token windows, or layer weighting\. Average@12 also measures per\-sample correctness under a fixed sampling budget, not proof quality, calibration, or robustness to prompt changes\. The paper therefore does not claim that hidden\-flow matching improves every aspect of reasoning behavior; it shows that, under this protocol, transition\-level privileged supervision improves the reported competition\-math aggregate\.

#### Future work\.

Future work should test other model families and non\-math tasks and study adaptive token or layer budgets that reduce memory while preserving the hidden\-process signal\. Another useful direction is to learn where the hidden target should be applied during a rollout, instead of fixing a uniform 128\-position window\. Finally, PHF should be tested with longer training horizons and larger\-scale reference corpora to separate short\-run process alignment from long\-run policy improvement\. Better diagnostics could also compare hidden\-flow alignment with solution\-level properties such as proof structure, error type, and recovery from wrong partial plans, which would clarify when the hidden process target contributes beyond the output posterior alone\.

## References

## Appendix AImplementation Details

#### Models\.

We use Qwen3\-1\.7B \(28 layers, hidden size20482048\), Qwen3\-4B \(36 layers,25602560\), and Qwen3\-8B \(36 layers,40964096\)\. All training uses LoRA adapters with rankr=64r\{=\}64,α=128\\alpha\{=\}128, applied to the attention projections\{q,k,v,o\}\\\{q,k,v,o\\\}and the MLP projections\{gate,up,down\}\\\{\\mathrm\{gate\},\\mathrm\{up\},\\mathrm\{down\}\\\}\.

#### Optimization\.

We follow the official OPSD schedule: AdamW with learning rate5×10−65\\times 10^\{\-6\}, cosine decay scheduled overnum\_train\_epochs=30=30, gradient clipping at0\.10\.1, per\-device batch size44with gradient accumulation11across88GPUs \(effective batch3232\)\. A watcher stops training at optimizer step100100; OPSD and PHF use the same schedule and step budget\. PHF keeps the OPSD output objective and adds the privileged hidden transition recipe described in the Method section\.

#### Rollouts\.

The student samples on\-policy rollouts of length10241024with temperature1\.11\.1, top\-p=0\.95p\\,\{=\}\\,0\.95, and top\-k=20k\\,\{=\}\\,20\. The output OPSD loss is a per\-token Jensen\-Shannon divergence with token\-level clipping at0\.050\.05for Qwen3\-1\.7B and Qwen3\-4B, and0\.060\.06for Qwen3\-8B, between the student conditionalpS\(⋅∣x,y<t\)p\_\{S\}\(\\cdot\\mid x,y\_\{<t\}\)and the privileged teacher conditionalpT\(⋅∣x,r,y<t\)p\_\{T\}\(\\cdot\\mid x,r,y\_\{<t\}\), evaluated on student\-visited states\.

#### Hidden flow\.

For the local flow term we select up to\|𝒲\|=128\|\\mathcal\{W\}\|\{=\}128valid generated positions deterministically across the rollout, form differences between neighboring selected positions at every layer, normalize them to unit directions, and average the direction loss \(cosine\) with a geometry loss that compares the pairwise relations among those transitions \(squared Frobenius distance between transition Gram matrices\)\. PHF also has a neighboring\-layer term that compares the relation between transitions at layerℓ\\elland layerℓ\+1\\ell\{\+\}1on the same selected position set\. The local version usesℒFlow=ℒlocal\\mathcal\{L\}\_\{\\mathrm\{Flow\}\}=\\mathcal\{L\}\_\{\\mathrm\{local\}\}, while the main PHF variant usesℒFlow=12​ℒlocal\+12​ℒadj\\mathcal\{L\}\_\{\\mathrm\{Flow\}\}=\\frac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{local\}\}\+\\frac\{1\}\{2\}\\mathcal\{L\}\_\{\\mathrm\{adj\}\}\. The privileged teacher is an EMA copy of the online model with decayρ=0\.999\\rho\{=\}0\.999, computed under the privileged prompt\(x,r,y<t\)\(x,r,y\_\{<t\}\)and detached from gradients\. The flow coefficient isα=0\.05\\alpha\{=\}0\.05for all scales\. No correctness routing, token filtering, reward model, or extra rollout set is used\.

#### Compute\.

All experiments run on a single node of8×8\{\\times\}NVIDIA H20 GPUs \(143143GB each\)\.

## Appendix BEvaluation Protocol

We evaluate with the official thinking\-mode protocol on AIME 2024, AIME 2025, and HMMT 2025 \(3030problems each\)\. For each problem we drawn=12n\{=\}12samples with temperature1\.01\.0, top\-p=0\.95p\\,\{=\}\\,0\.95, top\-kkdisabled, and up to38,91238\{,\}912new tokens, with thinking mode enabled\.Average@12is the mean per\-sample correctness over the1212samples \(equivalently,pass​@​1\\mathrm\{pass\}@1averaged over samples\)\. The primary aggregate is the reported Average column for the evaluation suite\. The online studentθ\\thetais always the evaluated model; the EMA teacher is never deployed\.

#### Context rows in Table[1](https://arxiv.org/html/2606.29340#Sx4.T1)\.

Base, SFT, and GRPO are protocol\-matched context rows, not the main step\-budget\-matched comparison\. The Base row evaluates the released Qwen3 checkpoint directly\. The SFT row is a LoRA supervised\-finetuning checkpoint trained on the same OPSD math data format with reference solutions as target responses, a maximum sequence length of16,00016\{,\}000, learning rate5×10−65\\times 10^\{\-6\}, effective batch size3232, and checkpoint 100 evaluation\. The GRPO row is a LoRA policy\-optimization checkpoint trained with the same math\-answer reward used by the reproduced OPSD codebase, learning rate5×10−65\\times 10^\{\-6\}, effective batch size3232,88sampled completions per prompt during training, reward normalization within group,β=0\\beta\{=\}0, and checkpoint 100 evaluation\. All three context rows are evaluated with the same thinking\-mode Average@12 protocol as OPSD and PHF\. Only the OPSD and PHF rows should be read as the controlled same\-schedule comparison, because PHF is an additive modification to OPSD\.

## Appendix CHyperparameters

Table 3: Full PHF training and evaluation hyperparameters\.

## Appendix DNotation

Table 4: Notation used throughout the paper\.

## Appendix EProof of Proposition[1](https://arxiv.org/html/2606.29340#Thmproposition1)

Fix a layer and writeΔ​hi=hwi\+1−hwi\\Delta h\_\{i\}=h\_\{w\_\{i\+1\}\}\-h\_\{w\_\{i\}\},di=Δ​hi/\(‖Δ​hi‖2\+ϵ\)d\_\{i\}=\\Delta h\_\{i\}/\(\\\|\\Delta h\_\{i\}\\\|\_\{2\}\+\\epsilon\), andG=D​D⊤G=DD^\{\\top\}, whereDDstacks the selected transition directions\. The local terms areℒdir=1m−1​∑i\(1−⟨diS,diT⟩\)\\mathcal\{L\}\_\{\\mathrm\{dir\}\}=\\frac\{1\}\{m\-1\}\\sum\_\{i\}\(1\-\\langle d^\{S\}\_\{i\},d^\{T\}\_\{i\}\\rangle\)andℒgeo=1\(m−1\)2​‖GS−GT‖F2\\mathcal\{L\}\_\{\\mathrm\{geo\}\}=\\frac\{1\}\{\(m\-1\)^\{2\}\}\\\|G^\{S\}\-G^\{T\}\\\|\_\{F\}^\{2\}\. The exact statements below takeϵ=0\\epsilon=0; nonzeroϵ\\epsilononly perturbs the scale statement through the normalizer\.

#### \(i\) Offset invariance\.

Adding any position\-independent offsetccgives\(hwi\+1\+c\)−\(hwi\+c\)=Δ​hi\(h\_\{w\_\{i\+1\}\}\+c\)\-\(h\_\{w\_\{i\}\}\+c\)=\\Delta h\_\{i\}, so directions, Gram matrices, and both losses are unchanged\. Pointwise matching instead changes with‖cS−cT‖\\\|c\_\{S\}\-c\_\{T\}\\\|whenever the student and teacher offsets differ\.

#### \(ii\) Per\-step scale invariance\.

Forsi\>0s\_\{i\}\>0,si​Δ​hi/‖si​Δ​hi‖2=dis\_\{i\}\\Delta h\_\{i\}/\\\|s\_\{i\}\\Delta h\_\{i\}\\\|\_\{2\}=d\_\{i\}; thus unit normalization removes per\-transition magnitude before either local loss is computed\. A pointwise state loss has no analogous invariance because it acts on the states themselves\.

#### \(iii\) Orthogonal invariance of the geometry term\.

For orthogonalRSR\_\{S\}, replacingDSD^\{S\}byDS​RS⊤D^\{S\}R\_\{S\}^\{\\top\}leaves\(DS​RS⊤\)​\(DS​RS⊤\)⊤=GS\(D^\{S\}R\_\{S\}^\{\\top\}\)\(D^\{S\}R\_\{S\}^\{\\top\}\)^\{\\top\}=G^\{S\}, and the same holds for an independentRTR\_\{T\}on the teacher side; henceℒgeo\\mathcal\{L\}\_\{\\mathrm\{geo\}\}is invariant to independent orthogonal maps\. The direction term is invariant only to a common mapRS=RTR\_\{S\}=R\_\{T\}, which preserves each inner product\. Thus PHF separates a coordinate\-anchored direction signal from a relational geometry signal, while pointwise state matching generally changes under independent rotations\. These are algebraic invariances of the local transition objective; the neighboring\-layer term reuses the same transition representation to compare adjacent\-layer relations\.

## Appendix FAblation Definitions

Table[2](https://arxiv.org/html/2606.29340#Sx4.T2)uses short row names to keep the main paper compact\. All rows keep the OPSD output objective, the same training schedule, the same128128\-position hidden window, and the same privileged teacher source as PHF\. Only controls evaluated at all three model scales are included in the main ablation table\.

#### What is matched\.

Direction only keepsℒdir\\mathcal\{L\}\_\{\\mathrm\{dir\}\}and removes the Gram geometry term\. Geometry only keepsℒgeo\\mathcal\{L\}\_\{\\mathrm\{geo\}\}and removes direction matching\.

#### Layer choice\.

Selected layers uses a fixed hand\-picked subset of transformer layers instead of aggregating the transition objective over all layers\.

## Appendix GReference\-Dependence and No\-Label Scope

PHF inherits the privileged\-information assumption of OPSD\. The teacher forward pass conditions on a verified reference solution, so PHF is not a no\-label or self\-verifying recipe\. Wrong or misaligned references can corrupt both output and hidden\-process targets, so our experiments stay within the same verified\-reference setting as OPSD and always evaluate the online student\.

## Appendix HBroader Impact

PHF modifies the training objective of an existing OPSD pipeline and does not introduce new data, new models, or deployment\-time capabilities\. It inherits the broader\-impact profile of OPSD and knowledge distillation: better reasoning may help downstream applications, but does not mitigate risks of the base models or their training data\.

#### Reproducibility\.

Our implementation extends the open\-source OPSD codebase with hidden\-state extraction, transition normalization, direction/Gram losses, neighboring\-layer consistency, and the EMA update inopsd\_trainer\.py; no dependency beyond PyTorch, Transformers, and vLLM is added\. Training uses the official OPSD math set of29,43429\{,\}434examples, while evaluation uses the public AIME 2024, AIME 2025, and HMMT 2025 sets under the protocol described above\. All runs use one8×8\{\\times\}NVIDIA H20 node; the total budget for the main results and ablations is about600600GPU\-hours\. Code and trained checkpoints will be released upon acceptance\.

Similar Articles

On-Policy Distillation (5 minute read)

TLDR AI

This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

arXiv cs.CL

This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.

Reasoning Compression with Mixed-Policy Distillation

arXiv cs.AI

This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.