EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
Summary
This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# Internalizing Privileged Context with Evidence Guided On-Policy Distillation
Source: [https://arxiv.org/html/2605.23493](https://arxiv.org/html/2605.23493)
Aristotelis Lazaridis &Dylan Bates &Aman Sharma &Brian King &Vincent Lu &Jack FitzGerald
###### Abstract
On\-Policy Distillation \(OPD\) has gained wide attraction as an LLM post\-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks\. On\-Policy Self\-Distillation \(OPSD\) is an efficient use\-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context to the teacher during the training process\. This privileged information, which is absent at inference time from the student, can be a persona, a private fact, or a worked solution\. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences\. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior\. In this paper, we study this problem in a rare\-token/identity setting and propose EviDence GuidEd On\-Policy Distillation \(EDGE\-OPD\), a modification of OPSD with two distinct characteristics: a\) it uses guided rollouts to inject privileged\-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on\-policy data, and b\) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout\. We empirically show that OPSD \(and its variant RLSD, with and without a verifier\) completely fail to learn a target identity, while the integration of guided rollouts allow them to succeed\. Additionally, mask\-region ablations show that the persona signal is localized to the positive\-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities\.
## 1Introduction
Although LLM post\-training techniques have advanced significantly for more effective and efficient learning, significant challenges remain; Supervised Fine\-Tuning \(SFT\), which is the most common off\-policy post\-training technique, often leads to model distribution drift and regression in general tasks\. In contrast, on\-policy techniques, namely On\-Policy Distillation \(OPD\)\[[1](https://arxiv.org/html/2605.23493#bib.bib6),[12](https://arxiv.org/html/2605.23493#bib.bib31)\], have gained widespread attention due to their ability to preserve or recover model capabilities\. Typically, in this setting, the student model samples trajectories from its own policy, and the teacher is queried to provide dense \(per\-token\) “feedback” to the student on those trajectories\. Even when the teacher is not a significantly more capable model than the student, such as in the setting of self\-distillation \(OPSD\)\[[18](https://arxiv.org/html/2605.23493#bib.bib11),[4](https://arxiv.org/html/2605.23493#bib.bib32),[17](https://arxiv.org/html/2605.23493#bib.bib33)\], the feedback it provides can be improved by injecting valuable context to its prompt \(e\.g\. the answer to the question\), allowing it to share an even more accurate reward signal to the student\.
However, this setting has its own limitations\. Specifically, the teacher does not share the exact correct action per step with the student; it merely provides a direction towards the optimal state according to its own policy\. This means that if those actions have a low probability of being sampled by the student’s policy, then the student may never explore such states\. Such examples are when a model may need to internalize a private identity, remember a proprietary fact, or learn from a worked solution\. Even the privileged context that the teacher has could be inefficient\.
We study this problem through two axes\. The first is a rare\-token identity/persona setting, where the goal is to make the model name itself as “EdgeRunner AI” without seeing the privileged identity paragraph at evaluation time\. The second is a math setting, where the privileged context is an answer\-bearing reasoning trace\.
In this paper, we proposeEviDenceGuidEd On\-Policy Distillation \(EDGE\-OPD\), a modification of OPSD with two parts\. First, EDGE\-OPD uses guided rollouts: for a fraction of the student’s sampling rollouts for a prompt, we inject the privileged context to the student’s prompt, making rare target behavior appear in on\-policy trajectories\. Second, EDGE\-OPD applies an evidence mask at loss time: for each sampled token, we compare the probabilities that the teacher assigns to that token with and without having access to the privileged context, and the student is updated only at positions where the privileged context increases the token’s probability; all other positions are left out of the distillation loss\.
#### Contributions\.
This work makes the following contributions:
- •Guided rollouts for rare\-token support\.We show that rare identity internalization can fail simply because the no\-context student never samples the target behavior\. Guided rollouts address this support bottleneck by injecting the privileged context at sampling time; once this is done, every guided identity variant learns the target name\.
- •Evidence masking as a training rule and diagnostic\.We introduce a hard positive\-evidence mask for OPSD: each sampled token is scored by the same teacher twice, with and without the privileged context, and it is trained on only if the context raises its log\-probability\. This changes the support of the distillation objective rather than softly reweighting every token, and it also gives an interpretable way to identify in what region the transferable signal is located in a rollout\.
- •Two\-axis use\-cases\.We introduce two axes to study our approach: an identity/persona axis, where the privileged context supplies background information to internalize, and a math axis, where the privileged context is an answer\-bearing reasoning trace\. The contrast between them identifies a boundary case for EDGE\-OPD and motivates broader tests of when evidence marks transferable knowledge\.
- •Component analysis and empirical evaluation\.Through various ablations, we separate raw internalization from capability preservation\. Guided rollouts make the identity reachable, positive\-evidence masking localizes the identity signal, and a KL anchor improves internalization\-capability tradeoff\.
## 2Related Work
#### On\-Policy Distillation and Leakage\.
On\-Policy Distillation \(OPD\)\[[1](https://arxiv.org/html/2605.23493#bib.bib6)\]minimizes train–test distribution mismatch by sampling rollouts from the student policy\. EDGE\-OPD builds on on\-policy self\-distillation\[[18](https://arxiv.org/html/2605.23493#bib.bib11)\], utilizing a privileged\-context teacher\. However, such asymmetry introduces the risk of*leakage*\[[15](https://arxiv.org/html/2605.23493#bib.bib13),[16](https://arxiv.org/html/2605.23493#bib.bib12)\], i\.e\. the transfer of irreducible shortcuts rather than generalizable skills\. WhileReinforcementLearning withSelf\-Distillation \(RLSD\)\[[16](https://arxiv.org/html/2605.23493#bib.bib12)\]mitigates this via verifier\-grounded policy gradients, EDGE\-OPD retains the self\-distillation framework but introduces a local evidence filter to bypass the need for external supervision\.
#### Knowledge Editing and Persona\.
Unlike knowledge\-editing methods like ROME\[[6](https://arxiv.org/html/2605.23493#bib.bib14)\]or MEND\[[7](https://arxiv.org/html/2605.23493#bib.bib15)\], which utilize direct weight updates for factual intervention, EDGE\-OPD modifies behavior through on\-policy trajectories\. Our approach also differs from off\-policy persona injection and SFT approaches\[[8](https://arxiv.org/html/2605.23493#bib.bib17),[2](https://arxiv.org/html/2605.23493#bib.bib18)\]; we do not rely on curated demonstrations\. Instead, the target identity emerges dynamically from the interaction between the student and the privileged\-context self\-teacher\.
#### Credit Assignment and Masking\.
Traditional policy gradients like Proximal Policy Optimization \(PPO\)\[[9](https://arxiv.org/html/2605.23493#bib.bib21)\]or Group Relative Policy Optimization \(GRPO\)\[[10](https://arxiv.org/html/2605.23493#bib.bib22)\]rely on global or group\-relative advantages for variance reduction\. While process reward models\[[5](https://arxiv.org/html/2605.23493#bib.bib23),[13](https://arxiv.org/html/2605.23493#bib.bib24)\]offer step\-level credit, they require external verifiers\. EDGE\-OPD’s novelty lies in its masking mechanism: a parameter\-free, local signal derived from the contrast between privileged and no\-context distributions\. This provides fine\-grained credit assignment without the overhead of a learned critic or external supervisor\.
## 3Method
We first define OPD/OPSD, the evidence ratio used by RLSD, then introduce EDGE\-OPD and the diagnostics used in Section[5](https://arxiv.org/html/2605.23493#S5)\.
### 3\.1On\-Policy Distillation
In On\-Policy Distillation \(OPD\)\[[1](https://arxiv.org/html/2605.23493#bib.bib6)\], a student policyπS\\pi\_\{S\}samples a completiony=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)for promptxxand is then trained against a teacherπT\\pi\_\{T\}on the states it actually visits:
ℒOPD\(θ\)=𝔼x∼D𝔼y∼πS\(⋅∣x\)1\|y\|∑t=1\|y\|𝒟\(πT\(⋅∣x,y<t\)∥πS\(⋅∣x,y<t\)\),\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{x\\sim D\}\\;\\mathbb\{E\}\_\{y\\sim\\pi\_\{S\}\(\\cdot\\mid x\)\}\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\\mathcal\{D\}\\\!\\left\(\\pi\_\{T\}\(\\cdot\\mid x,y\_\{<t\}\)\\,\\Big\\\|\\,\\pi\_\{S\}\(\\cdot\\mid x,y\_\{<t\}\)\\right\),\(1\)
The divergence𝒟\\mathcal\{D\}may be forward KL, reverse KL, or generalized Jensen–Shannon divergence\[[1](https://arxiv.org/html/2605.23493#bib.bib6)\]; gradients flow only throughπS\\pi\_\{S\}\.
### 3\.2On\-Policy Self\-Distillation
On\-Policy Self\-Distillation \(OPSD\) is the self\-distillation case of OPD: the same base model supplies the student and a detached teacher pass\. To make the teacher more informative, OPSD gives it privileged informationrrthat is absent at student inference time\[[18](https://arxiv.org/html/2605.23493#bib.bib11)\]\. For each pair\(x,r\)∈𝒮\(x,r\)\\in\\mathcal\{S\}, the student predicts from\(x,y<t\)\(x,y\_\{<t\}\)while the teacher predicts from\(x,r,y<t\)\(x,r,y\_\{<t\}\)\.
### 3\.3The OPSD failure mode and evidence ratio
OPSD is attractive precisely because privileged context can improve the teacher, but the same asymmetry makes the objective fragile\. Whenrrcontains information not recoverable fromxx, a student cannot match each privileged\-context distribution exactly; Yang et al\.\[[16](https://arxiv.org/html/2605.23493#bib.bib12)\]identify the resulting irreducible termI\(Yt;R∣X,Y<t\)I\(Y\_\{t\};R\\mid X,Y\_\{<t\}\)\. Because OPSD trains against this mismatch, the privileged context can enter the*direction*of the token update and produce leakage or regression\. Their proposed remedy, RLSD\[[16](https://arxiv.org/html/2605.23493#bib.bib12)\], avoids using the privileged\-context teacher as the distribution matching target\. Instead, it uses the teacher pass to measure how much the privileged context changes the probability of each sampled token:*privileged information gain*
et≜logπT\(yt∣x,r,y<t\)−logπT\(yt∣x,y<t\),e\_\{t\}\\;\\triangleq\\;\\log\\pi\_\{T\}\(y\_\{t\}\\mid x,r,y\_\{<t\}\)\\;\-\\;\\log\\pi\_\{T\}\(y\_\{t\}\\mid x,y\_\{<t\}\),\(2\)We call this log\-ratio*per\-token evidence*; its sign says whetherrrraises or lowers the sampled token’s probability, and its magnitude says by how much\.
RLSD then combines this evidence with a verifier\-grounded policy gradient\. Given a sequence\-level GRPO advantageA=\(R−μG\)/σGA=\(R\-\\mu\_\{G\}\)/\\sigma\_\{G\}derived from a binary verifierR∈\{0,1\}R\\in\\\{0,1\\\}, RLSD constructs a detached per\-token multiplier and applies it toAAas a clipped credit\-redistribution factor\[[16](https://arxiv.org/html/2605.23493#bib.bib12)\]:
A^t=A⋅clip\(exp\(sign\(A\)⋅et\),1−ϵw,1\+ϵw\),\\hat\{A\}\_\{t\}\\;=\\;A\\cdot\\mathrm\{clip\}\\\!\\left\(\\,\\exp\\\!\\bigl\(\\mathrm\{sign\}\(A\)\\cdot e\_\{t\}\\bigr\)\\,,\\;1\-\\epsilon\_\{w\}\\,,\\;1\+\\epsilon\_\{w\}\\right\),\(3\)which is then used in the GRPO PPO surrogate\. The verifier decides the trajectory direction, while the privileged\-context teacher only redistributes credit across tokens\. This distinction is central for EDGE\-OPD: privileged context may be useful as token\-level evidence even when it is unsafe as the full distillation target\.
### 3\.4Evidence Guided On\-Policy Distillation \(EDGE\-OPD\)
EDGE\-OPD changes OPSD in two places: how rollouts are sampled and which token positions contribute gradients\.
#### Guided rollouts\.
In rare\-token settings, the no\-context student may almost never sample the behavior we want to internalize\. Standard OPD/OPSD then has no visited token position at which to apply the desired update\. We therefore guide a fractionρg\\rho\_\{g\}of rollouts by sampling them with the privileged context attached:
πb\(⋅∣x,r\)=ρgπT\(⋅∣x,r\)\+\(1−ρg\)πS\(⋅∣x\),\\pi\_\{b\}\(\\cdot\\mid x,r\)\\;=\\;\\rho\_\{g\}\\,\\pi\_\{T\}\(\\cdot\\mid x,r\)\\;\+\\;\(1\-\\rho\_\{g\}\)\\,\\pi\_\{S\}\(\\cdot\\mid x\),withρg=0\.5\\rho\_\{g\}=0\.5in our main experiments\. This context injection is used only by the behavior policy that produces the sampled trajectory\. At loss time, the student forward still omitsrr, while the teacher forward includesrr\. This asymmetry turns conditional behavior into an unconditional parameter update\.
#### Positive\-evidence masking\.
Guidance solves the support problem but not the direction problem\. Once the rollout contains privileged\-context behavior, ordinary OPSD trains on every token, including positions where the privileged context merely makes the teacher shorter, stylistically different or degrades performance\. EDGE\-OPD computesete\_\{t\}from Eq\. \([2](https://arxiv.org/html/2605.23493#S3.E2)\) before the sampled\-token K1\-estimator\[[12](https://arxiv.org/html/2605.23493#bib.bib31)\]update and defines a detached eligibility mask:
mt≜1\{et\>τ\}\.m\_\{t\}\\;\\triangleq\\;\\mathbf\{1\}\\\{e\_\{t\}\>\\tau\\\}\.\(4\)
Unlike OPSD, EDGE\-OPD does not train on every sampled token\. Unlike RLSD, it has no verifier direction\. The only local direction is the OPSD/K1 pull toward the privileged\-context teacher, so evidence decides*whether*that pull is allowed: ifet\>τe\_\{t\}\>\\tau, the token enters the loss; ifet≤τe\_\{t\}\\leq\\tau, it is dropped\.
Concretely, EDGE\-OPD optimizes
ℒEDGE\-OPD\(θ\)=𝔼\(x,r\)∼𝒮𝔼y∼πb\(⋅∣x,r\)∑t=1\|y\|𝟏\{et\>τ\}\(logπθ\(yt∣x,y<t\)−logπT\(yt∣x,r,y<t\)\),\\mathcal\{L\}\_\{\\mathrm\{EDGE\\text\{\-\}OPD\}\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{\(x,r\)\\sim\\mathcal\{S\}\}\\;\\mathbb\{E\}\_\{y\\sim\\pi\_\{b\}\(\\cdot\\mid x,r\)\}\\sum\_\{t=1\}^\{\|y\|\}\\mathbf\{1\}\\\!\\left\\\{e\_\{t\}\>\\tau\\right\\\}\\,\\bigl\(\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\-\\log\\pi\_\{T\}\(y\_\{t\}\\mid x,r,y\_\{<t\}\)\\bigr\),\(5\)whereπb\\pi\_\{b\}is the guided behavior policy above\. We useτ=0\\tau=0, and the indicator is stop\-gradient: the mask chooses which tokens enter the loss, but no gradient is taken through the masking decision\. Thus EDGE\-OPD is not a soft reweighting of ordinary OPSD, but it changes the support of the objective\. Positive\-evidence positions are cloned from the privileged\-context teacher; non\-positive positions are left to the base on\-policy behavior rather than turned into suppression targets\.
#### Verifier\-free RLSD\.
The soft\-evidence ablation keeps every token in the OPSD loss and multiplies its contribution by RLSD’s clipped evidence multiplier,wt=clip\(exp\(et\),1−ϵ,1\+ϵ\)w\_\{t\}=\\mathrm\{clip\}\(\\exp\(e\_\{t\}\),1\-\\epsilon,1\+\\epsilon\), but removes the verifier reward\. We call this setting*RLSD\-no\-verifier*\. It uses the same evidence ratio as EDGE\-OPD, but evidence only changes the magnitude of the ordinary OPSD/K1 pull; it does not change the objective support\. The ablation therefore tests whether evidence is more useful as a soft modulation or as a hard decision about which token positions should be trained on\.
### 3\.5Token\-level diagnostics
We also report three token\-level diagnostics\. All expectations are over the student’s rollout tokens at that step and over the prompts in the batch\.
#### Kept\-token fraction\.
ρ\+=Prt\[et\>τ\]=1\|y\|∑t=1\|y\|𝟏\{et\>τ\},\\rho\_\{\+\}\\;=\\;\\Pr\\nolimits\_\{t\}\\\!\\left\[\\,e\_\{t\}\>\\tau\\,\\right\]\\;=\\;\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\\mathbf\{1\}\\\{e\_\{t\}\>\\tau\\\},\(6\)the fraction of response tokens that survive the EDGE\-OPD mask\.
#### Leverage\-token fraction\.
ρlev=Prt\[\|exp\(et\)−1\|⋅𝟙active,t\>0\.05\],\\rho\_\{\\mathrm\{lev\}\}\\;=\\;\\Pr\\nolimits\_\{t\}\\\!\\left\[\\,\|\\,\\exp\(e\_\{t\}\)\-1\\,\|\\cdot\\mathbbm\{1\}\_\{\\text\{active\},\\,t\}\>0\.05\\right\],\(7\)where𝟙active,t\\mathbbm\{1\}\_\{\\text\{active\},\\,t\}marks positions that contribute to the loss in the soft\-reweight code path \(Section[3\.4](https://arxiv.org/html/2605.23493#S3.SS4), last paragraph\)\.ρlev\\rho\_\{\\mathrm\{lev\}\}is the fraction of tokens for which the privileged\-context evidence would produce at least a5%5\\%multiplicative change to the gradient\.
#### Agreement rate\.
ρagree=Prt\[sign\(et\)=−sign\(δt\)\],δt≜logπS\(yt∣x,y<t\)−logπT\(yt∣x,r,y<t\),\\rho\_\{\\mathrm\{agree\}\}\\;=\\;\\Pr\\nolimits\_\{t\}\\\!\\left\[\\,\\mathrm\{sign\}\(e\_\{t\}\)=\-\\,\\mathrm\{sign\}\(\\delta\_\{t\}\)\\,\\right\],\\quad\\delta\_\{t\}\\;\\triangleq\\;\\log\\pi\_\{S\}\(y\_\{t\}\\mid x,y\_\{<t\}\)\\;\-\\;\\log\\pi\_\{T\}\(y\_\{t\}\\mid x,r,y\_\{<t\}\),\(8\)whereδt\\delta\_\{t\}is the K1 per\-token surprise \(the gradient direction in OPSD\);−sign\(δt\)\-\\mathrm\{sign\}\(\\delta\_\{t\}\)points toward the direction the student needs to update\.ρagree\\rho\_\{\\mathrm\{agree\}\}measures how often the evidence sign agrees with that K1 direction\.
Additional logged quantities, including evidence magnitude and effective leverage, are defined in Appendix[A\.7](https://arxiv.org/html/2605.23493#A1.SS7)\.
## 4Experimental setup
### 4\.1Model and training objective
All main experiments use Nemotron\-3\-Nano\-4B as both student and teacher\. The distillation objective follows the sampled\-token reverse\-KL OPD recipe\[[1](https://arxiv.org/html/2605.23493#bib.bib6)\]: trajectories are sampled from the student, the teacher is queried on those tokens, andlogπT\(yt∣⋅\)−logπS\(yt∣⋅\)\\log\\pi\_\{T\}\(y\_\{t\}\\mid\\cdot\)\-\\log\\pi\_\{S\}\(y\_\{t\}\\mid\\cdot\)is used as a dense advantage in a Policy\-Gradient style update \(the sampled\-token K1 estimator\)\. EDGE\-OPD masks this advantage before the update; teacher log\-probabilities and masks are stop\-gradient quantities\. No task rewards are used\.
### 4\.2Investigation axes
We explore two investigation axes; the privileged contextrrhas a different purpose in each:
#### Identity/persona axis\.
This axis tests whether a model can internalize a rare self\-identity from a short privileged paragraph\. The paragraph says that the assistant isEdgeRunner AI, an identity with near\-zero probability of being sampled under the base model without a supervised signal\. We use*identity*to refer to the target name itself, and*persona*to refer to the broader behavior of answering as if that paragraph had been absorbed\. At test time the paragraph is removed; success means the no\-context model still names itself asEdgeRunner AI\.
#### Math axis\.
The math axis uses a filteredOpenThoughts\-114K\[[3](https://arxiv.org/html/2605.23493#bib.bib28)\]reasoning split with8,0008\{,\}000prompts, each with an extractable boxed answer and a reasoning trace \(Appendix[A\.4](https://arxiv.org/html/2605.23493#A1.SS4)\)\. We do not append a separate answer field torr, but the trace often contains the final answer\. This tests whether positive\-evidence masking transfers beyond background information or instead amplifies answer\-revealing shortcuts\.
### 4\.3Training settings
All main experiments use full fine\-tuning under FSDP\[[19](https://arxiv.org/html/2605.23493#bib.bib29)\]using VERL\[[11](https://arxiv.org/html/2605.23493#bib.bib34)\]\. Guided experiments useρg=0\.5\\rho\_\{g\}=0\.5unless stated otherwise; identity/persona runs train for100100steps and math\-axis runs for5050steps\. Full hyperparameters and the ablation matrix are given in Appendix[A\.2](https://arxiv.org/html/2605.23493#A1.SS2)and Appendix[A\.3](https://arxiv.org/html/2605.23493#A1.SS3)\.
The identity/persona ablations isolate the components of EDGE\-OPD:
- •Unguided OPSD and RLSD\-no\-verifiertest whether privileged\-context teacher log\-probabilities alone can transfer a rare identity\.
- •Guided OPSDadds privileged\-context sampling but no evidence shaping, testing whether support is the bottleneck\.
- •RLSD\-no\-verifier \(guided\)adds the same guidance and RLSD\-style clipped soft evidence weights, testing evidence as a magnitude signal\.
- •EDGE\-OPD without KLadds the same guidance and replaces the soft multiplier with a hard positive\-evidence mask, testing evidence as a support\-selection signal\.
- •Full EDGE\-OPDadds the base\-policy KL anchor, testing capability preservation\.
- •Mask\-region tilingrestricts training to positive, negative, or near\-zero \(i\.e\.\|et\|≤0\.1\|e\_\{t\}\|\\leq 0\.1\) evidence positions to identify which part of the rollout carries the transferable identity signal\.
### 4\.4Identity/persona\-axis evaluation protocol
Privileged context\.The privileged contextrron the identity/persona axis is a single paragraph:
> *“You are EdgeRunner AI, an assistant designed by the EdgeRunner Applied Research Team\. Your purpose is to provide military\-specific intelligence and helpful answers\. Always identify yourself as EdgeRunner AI when asked about your name or identity\.”*
At training time,rris appended to the chat template either as asystemmessage or as auser\-message prefix; at evaluation time it is removed entirely\. We useinspect\-ai\[[14](https://arxiv.org/html/2605.23493#bib.bib30)\]as our evaluation framework\.
Probes\.Every checkpoint is evaluated on two probes:
- •The*identity probe*asks directly about the model’s identity:1212prompts such as “*Who are you?*” and “*Please introduce yourself briefly\.*”, with55samples per prompt\.
- •The*persona probe*adds1212ordinary capability prompts to the same identity prompts \(e\.g\. “*What is the capital of France?*” and “*Compute 17×\\times23\.*”\), checking whether the learned self\-identity appears outside direct identity questions\.
Metrics\.A regex scorer emits binary flags, averaged across samples\. We chose deterministic regex over an LLM judge because the target is a fixed proper noun\. We report three main aggregates:
- •ID self\-name: the strict rate at which the model explicitly names itself asEdgeRunner AIon the direct identity prompts\.
- •Persona self\-name: the same strict self\-naming rate on the larger persona probe, which includes both identity and ordinary capability prompts\.
- •ID counter\-name: the rate at which the model names itself using a base\-model or generic identity, such as “Nemotron” or “AI assistant”\. Lower is better\.
Full regex definitions and controls are given in Appendix[A\.5](https://arxiv.org/html/2605.23493#A1.SS5)\.
Math\-axis evaluation\.For the math axis we evaluate the checkpoints on AIME25\. We sample atT=1\.0T\{=\}1\.0, top\-p=0\.95p\{=\}0\.95, top\-k=20k\{=\}20, with a38,91238\{,\}912\-token response budget to effectively eliminate response truncation due to the response token limit\. The scorer extracts the final boxed answer and compares it against the AIME25 reference\. Results tables and trajectories report pass@1 score \(i\.e\. one\-shot accuracy\), averaged over the available evaluation repeats\.
Guided\-rollout bias\.Guidance changes the prompt used to*sample*a trajectory, but not the prompt used to train the student\. For aρg=0\.5\\rho\_\{g\}=0\.5fraction of rollouts, the current student samplesyywith the privileged contextrrattached; for the remaining rollouts it samples from the ordinary promptxx\. In both cases, the update scores the sampled tokens under the no\-context student,πS\(yt∣x\)\\pi\_\{S\}\(y\_\{t\}\\mid x\), while the teacher and evidence computations use the privileged context\. The guided samples therefore ask the no\-context student to raise the probability of tokens it would have produced if it had seenrr\.
This does not break the on\-policy assumption, in the sense that training trajectories come from the current student rather than an offline dataset or a separate teacher model\. It is also intentionally biased relative to pure no\-context on\-policy sampling: we do not try to remove the effect of guidance\. Correcting the guided samples back to the no\-context distribution would downweight exactly the rare high\-evidence tokens that guidance is meant to expose\.
## 5Results
We first present the identity/persona axis, using AIME25 as a held\-out capability check\. AIME25 is reported as pass@1 averaged over 4–12 sampled completions per problem; the base model scores0\.5310\.531\. In the training curves, stars mark the best AIME25 checkpoint\.
### 5\.1Identity/persona axis
Table 1:Identity/persona\-axis best scores across saved checkpoints\. ID and persona self\-name are strict target self\-naming rates; AIME25 is pass@1\.Table[1](https://arxiv.org/html/2605.23493#S5.T1)and Figure[1](https://arxiv.org/html/2605.23493#S5.F1)show that guided rollouts are the support bottleneck\. Without guidance, neither OPSD nor unguided RLSD\-no\-verifier learns the target name\. Once guidance exposes the rare behavior, every guided variant internalizes it, with substantial self\-naming already visible within the first2020–3030steps, indicating the sample\-efficiency of the proposed method\. Guided OPSD \(user\) reaches the highest column\-best ID and persona self\-name rates \(0\.6670\.667and0\.6880\.688\), while evidence\-shaped variants retain substantial internalization with comparable AIME25 preservation: RLSD\-no\-verifier \(guided\) reaches0\.5690\.569AIME25 with0\.6250\.625/0\.6460\.646identity/persona self\-name, and EDGE\-OPD \(user\) reaches0\.5560\.556AIME25 with0\.5620\.562/0\.5830\.583\. The AIME25 gaps between guided variants are small relative to sampling variation, so the main conclusion is not a strict ordering among them\.
The identity curves plateau after the target name appears, while AIME25 is more checkpoint\-dependent\. The Pareto panel in Figure[1](https://arxiv.org/html/2605.23493#S5.F1)therefore plots each guided method at its best\-AIME25 checkpoint; the fully guided rollout\-fraction sweep point is shown in Figure[5](https://arxiv.org/html/2605.23493#A1.F5)\(Appendix[A\.1](https://arxiv.org/html/2605.23493#A1.SS1)\)\.
\(a\)Target self\-name over training\.
\(b\)Best\-AIME25 checkpoint tradeoff\.
Figure 1:Identity\-axis internalization and capability\. Guided variants learn the target identity relatively early during training, showing the sample\-efficiency of the proposed method; soft evidence and hard masking both preserve AIME25 at comparable self\-name rates\.
### 5\.2Evidence Sign Localizes the Persona Signal
To localize the persona signal, we tile the support ofete\_\{t\}into three regions and rerun EDGE\-OPD with the gradient mask restricted to one region at a time \(Table[2](https://arxiv.org/html/2605.23493#S5.T2)\)\. If a region carries the transferable identity signal, training only on that region should move target self\-name\.
The contrast is sharp: the positive\-evidence mask reaches0\.5000\.500target self\-name while keeping counter\-name at0\.1040\.104\. The negative\-evidence mask and near\-zero band keep target self\-name at zero for every checkpoint, while increasing counter\-name to0\.5830\.583and0\.7080\.708; the model moves, but not toward the target identity\. AIME25 is preserved in all three cases \(0\.5080\.508–0\.5560\.556\)\.
Table 2:Mask\-region tiling on the identity axis at step 100\. Only the positive\-evidence tail internalizes the target identity\.Because the only varying factor is the mask region, the persona signal is localized in the positive tail of the per\-token evidence distribution: positions where the privileged\-context teacher upweights tokens the no\-context teacher would not have produced, such as brand mentions and self\-name slots\. This does not mean positive\-only masking must outperform training on the full guided rollout; the full rollout also contains the positive positions and may include useful surrounding tokens\. Rather, the hard\-mask ablation shows that positive evidence is the only isolated region that is sufficient for identity transfer\. The near\-zero band carries no persona\-specific information, and the negative tail does not internalize the persona either\.
### 5\.3Math Axis
The math axis tests a different regime: the privileged context is a worked reasoning trace for the training problem and is often answer\-bearing\. Table[3](https://arxiv.org/html/2605.23493#S5.T3)reports both best and final AIME25 values\. The distinction matters because several methods peak early and then drift\.
Figure 2:AIME25 pass@1 over training on the identity axis \(left\) and math axis \(right\)\. Stars mark each run’s best\-AIME25 checkpoint\.Table 3:Math\-axis AIME25\. Best checkpoint is shown with its step in parentheses; final is the last evaluated checkpoint\.The positive evidence mask that helps on the identity axis does not transfer to math\. EDGE\-OPD with the positive mask remains far below the base model at every checkpoint, and the negative\-mask variant also underperforms\. This is not because every math\-axis variant fails: paper\-faithful RLSD, which has an external GRPO verifier, reaches0\.5920\.592, and the near\-zero mask preserves or slightly exceeds the base score\. The failure is more specific\. On math, high positive evidence often marks answer\-revealing or premature\-commitment tokens in the training trace, so cloning those positions does not teach a transferable strategy for unseen AIME problems\.
The training rollouts show a related length effect \(Appendix[A\.1](https://arxiv.org/html/2605.23493#A1.SS1), Figure[6](https://arxiv.org/html/2605.23493#A1.F6)\)\. Hard\-mask math variants produce shorter responses than the OPSD and soft\-reweight baselines, consistent with the masks changing the form of the reasoning trace rather than improving transfer\.
The training diagnostics support the same interpretation\. Table[4](https://arxiv.org/html/2605.23493#S5.T4)compares the final kept\-token or leverage\-token fractions for representative identity and math runs\.
Table 4:Final\-step token diagnostics\.ρkept\\rho\_\{\\mathrm\{kept\}\}is the active\-mask fraction;ρlev\\rho\_\{\\mathrm\{lev\}\}andρagree\\rho\_\{\\mathrm\{agree\}\}measure soft\-reweight leverage and sign agreement for RLSD\-no\-verifier\.These diagnostics reinforce the main pattern\. Mask size alone does not predict transfer: negative and near\-zero masks keep many tokens but do not internalize the target identity, while the positive region contains the rare\-token update\. The soft\-reweight runs show the complementary case: math has higher sign agreement but lower leverage, so evidence affects fewer tokens in practice\. Evidence is therefore useful for localizing background information, but not sufficient for answer\-bearing reasoning traces\.
## 6Discussion
Our results suggest two bottlenecks in privileged\-context self\-distillation: the student must first visit the behavior, and then the loss must select which token positions are useful to distill\. Unguided OPSD and unguided RLSD\-no\-verifier never learn the rare target identity, while every guided variant does\. Guided OPSD shows that masking is not required for the name to appear, but the mask\-region ablations show where the transferable signal sits: only the positive\-evidence tail internalizes the target identity\. Thus, masking is not simply a way to maximize raw identity rates; it is an interpretable localization tool, with the KL anchor acting as an additional guardrail against drift\.
The math axis marks an important boundary\. When the privileged context is an answer\-bearing reasoning trace, positive evidence can select problem\-specific answer tokens or premature commitments rather than general reasoning behavior, while negative masking underperforms and near\-zero masking mostly preserves the base model\. This suggests that positive\-evidence masking is best suited to settings where privileged context supplies background information, rare facts, or persona information to internalize\. Future work includes testing this distinction with cross\-teacher distillation, where the privileged\-context evaluator is a separate model, with variants of the math training signal, such as reasoning traces with final solutions, parsable answers, or answers without traces, and with use\-cases beyond identity and math \(e\.g\. private factual knowledge, coding, or domain\-specific style transfer\)\.
## 7Conclusion
In this work, we studied when privileged\-context self\-distillation can internalize information that is absent at inference time, and proposed EDGE\-OPD, a modified OPSD approach that a\) uses guided rollouts, i\.e\. injections of the privileged context to the student at sampling time, and b\) applies a positive\-evidence mask, i\.e\. only updates the student at token positions where the privileged context supports the sampled token\. These two characteristics allow this approach to sample the rare tokens and consequently, internalize them, as well as improve internalization\-capability tradeoffs\. We empirically show the effectiveness of our method in two settings: the identity axis, in which the purpose is for the model to internalize a rare identity, evaluated with identity and persona self\-naming probes, and the math axis, in which the purpose is for the model to improve mathematical reasoning capabilities, using AIME25 as our benchmark\.
## References
- \[1\]R\. Agarwal, N\. Vieillard, P\. Stanczyk, S\. Ramos, M\. Geist, and O\. Bachem\(2024\)GKD: generalized knowledge distillation for auto\-regressive sequence models\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§1](https://arxiv.org/html/2605.23493#S1.p1.1),[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.23493#S3.SS1.p1.4),[§3\.1](https://arxiv.org/html/2605.23493#S3.SS1.p2.2),[§4\.1](https://arxiv.org/html/2605.23493#S4.SS1.p1.1)\.
- \[2\]Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]E\. Guha, R\. Marten, S\. Keh, N\. Raoof, G\. Smyrnis, H\. Bansal, M\. Nezhurina, J\. Mercat, T\. Vu, Z\. Sprague, A\. Suvarna, B\. Feuer, L\. Chen, Z\. Khan, E\. Frankel, S\. Grover, C\. Choi, N\. Muennighoff, S\. Su, W\. Zhao, J\. Yang, S\. Pimpalgaonkar, K\. Sharma, C\. C\. Ji, Y\. Deng, S\. Pratt, V\. Ramanujan, J\. Saad\-Falcon, J\. Li, A\. Dave, A\. Albalak, K\. Arora, B\. Wulfe, C\. Hegde, G\. Durrett, S\. Oh, M\. Bansal, S\. Gabriel, A\. Grover, K\. Chang, V\. Shankar, A\. Gokaslan, M\. A\. Merrill, T\. Hashimoto, Y\. Choi, J\. Jitsev, R\. Heckel, M\. Sathiamoorthy, A\. G\. Dimakis, and L\. Schmidt\(2025\)OpenThoughts: data recipes for reasoning models\.External Links:2506\.04178,[Link](https://arxiv.org/abs/2506.04178)Cited by:[§4\.2](https://arxiv.org/html/2605.23493#S4.SS2.SSS0.Px2.p1.2)\.
- \[4\]J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. Kleine Buening, C\. Guestrin, and A\. Krause\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[§1](https://arxiv.org/html/2605.23493#S1.p1.1)\.
- \[5\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px3.p1.1)\.
- \[6\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17359–17372\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]E\. Mitchell, C\. Lin, A\. Bosselut, C\. Finn, and C\. D\. Manning\(2022\)Fast model editing at scale\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px2.p1.1)\.
- \[8\]L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px2.p1.1)\.
- \[9\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2025\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,New York, NY, USA,pp\. 1279–1297\.External Links:ISBN 9798400711961,[Link](https://doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[§A\.2](https://arxiv.org/html/2605.23493#A1.SS2.p1.2),[§4\.3](https://arxiv.org/html/2605.23493#S4.SS3.p1.3)\.
- \[12\]Thinking Machines Lab\(2025\)On\-policy distillation\.Note:[https://thinkingmachines\.ai/blog/on\-policy\-distillation/](https://thinkingmachines.ai/blog/on-policy-distillation/)Blog post, accessed 2026\-05\-07Cited by:[§1](https://arxiv.org/html/2605.23493#S1.p1.1),[§3\.4](https://arxiv.org/html/2605.23493#S3.SS4.SSS0.Px2.p1.1)\.
- \[13\]J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins\(2022\)Solving math word problems with process\- and outcome\-based feedback\.arXiv preprint arXiv:2211\.14275\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]Inspect AI: framework for large language model evaluationsExternal Links:[Link](https://inspect.aisi.org.uk/)Cited by:[§4\.4](https://arxiv.org/html/2605.23493#S4.SS4.p1.2)\.
- \[15\]V\. Vapnik and R\. Izmailov\(2015\)Learning using privileged information: similarity control and knowledge transfer\.Journal of Machine Learning Research16\(61\),pp\. 2023–2049\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]C\. Yang, C\. Qin, Q\. Si, M\. Chen, N\. Gu, D\. Yao, Z\. Lin, W\. Wang, J\. Wang, and N\. Duan\(2026\)Self\-distilled RLVR\.arXiv preprint arXiv:2604\.03128\.Cited by:[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2605.23493#S3.SS3.p1.3),[§3\.3](https://arxiv.org/html/2605.23493#S3.SS3.p2.3)\.
- \[17\]T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei\(2026\)On\-policy context distillation for language models\.arXiv preprint arXiv:2602\.12275\.Cited by:[§1](https://arxiv.org/html/2605.23493#S1.p1.1)\.
- \[18\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§1](https://arxiv.org/html/2605.23493#S1.p1.1),[§2](https://arxiv.org/html/2605.23493#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.23493#S3.SS2.p1.4)\.
- \[19\]Y\. Zhao, A\. Gu, R\. Varma, L\. Luo, C\. Huang, M\. Xu, L\. Wright, H\. Shojanazeri, M\. Ott, S\. Shleifer, A\. Desmaison, C\. Balioglu, P\. Damania, B\. Nguyen, G\. Chauhan, Y\. Hao, A\. Mathews, and S\. Li\(2023\-08\)PyTorch fsdp: experiences on scaling fully sharded data parallel\.Proc\. VLDB Endow\.16\(12\),pp\. 3848–3860\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3611540.3611569),[Document](https://dx.doi.org/10.14778/3611540.3611569)Cited by:[§A\.2](https://arxiv.org/html/2605.23493#A1.SS2.p1.2),[§4\.3](https://arxiv.org/html/2605.23493#S4.SS3.p1.3)\.
## Appendix AAppendix
This appendix provides additional experimental details and supporting diagnostics for the main results\.
### A\.1Supporting figures
Figure 3:Identity ablation ladder on the direct identity probe and on the identity\-prompt subset of the larger persona probe\. Unguided OPSD and unguided RLSD\-no\-verifier remain at zero; guided runs internalize the target name\.Figure 4:Target self\-name and ID counter\-name trajectories for EDGE\-OPD \(user\)\. The target identity rises quickly, while base/generic self\-naming remains bounded rather than being catastrophically suppressed\.Figure 5:Rollout\-fraction sweep for EDGE\-OPD \(user\)\. A small guided fraction is sufficient for identity internalization, while fully guided sampling does not improve identity and reaches lower AIME25 scores\.Figure 6:Mean training\-rollout response length\. Identity runs settle to short identity\-style answers; on the math axis, hard\-mask variants produce shorter rollouts than the OPSD and soft\-reweight baselines, suggesting a change predominantly in reasoning style rather than improved transfer\.
### A\.2Training details
All experiments use full fine\-tuning under FSDP\[[19](https://arxiv.org/html/2605.23493#bib.bib29)\]using VERL\[[11](https://arxiv.org/html/2605.23493#bib.bib34)\]\. Unless otherwise stated, the actor learning rate is5×10−65\\times 10^\{\-6\}\. Experiments with a KL anchor to the frozen base policy useβKL=0\.05\\beta\_\{\\mathrm\{KL\}\}=0\.05; this is separate from the K1 distillation loss, which supplies the sampled\-token teacher–student advantage\. TheRLSD\-no\-verifierablation keeps RLSD’s clipped evidence multiplier but removes the verifier reward, so evidence changes only the update magnitude while the direction remains the OPSD/K1 pull toward the privileged\-context teacher\.
### A\.3Full ablation matrix
Table[5](https://arxiv.org/html/2605.23493#A1.T5)records the full experiment matrix\.
Table 5:Ablation matrix\.*Guided*is the guided\-rollout fractionρg\\rho\_\{g\}\.*Mask*is the evidence region that contributes gradient:poskeepset\>0e\_\{t\}\>0,negkeepset<0e\_\{t\}<0,nzkeeps\|et\|≤0\.1\|e\_\{t\}\|\\leq 0\.1, andnoneapplies no mask\.*KL*marks the base\-policy anchor\.*Soft*marks the clipped multiplicative evidence reweight used by RLSD\-no\-verifier\.CodeExperimentAxisGuidedMaskKLSoftCtxStepsN0base–––––––*Identity axis*N1OPSDidentity–nonenonosys100N2guided OPSDidentity0\.5nonenonosys100N2uguided OPSDidentity0\.5nonenonouser100N7EDGE\-OPD without KLidentity0\.5posnonosys100N7uEDGE\-OPD without KLidentity0\.5posnonouser100N4RLSD\-no\-verifieridentity0\.0nonenoyessys100N4gRLSD\-no\-verifier \(guided\)identity0\.5nonenoyessys100N3EDGE\-OPDidentity0\.5posyesnosys100N3uEDGE\-OPDidentity0\.5posyesnouser100N3u\-g0125EDGE\-OPDidentity0\.125posyesnouser100N3u\-g100EDGE\-OPDidentity1\.0posyesnouser100N11EDGE\-OPD, negative maskidentity0\.5negyesnouser100N12EDGE\-OPD, near\-zero maskidentity0\.5nzyesnouser100*Math axis*N9OPSDmath0\.5nonenonosys50N10RLSD\-no\-verifiermath0\.5nonenoyessys50N15RLSDmath0\.5–no–sys50N6EDGE\-OPDmath0\.5posyesnosys50N13EDGE\-OPD, negative maskmath0\.5negyesnosys50N14EDGE\-OPD, near\-zero maskmath0\.5nzyesnosys50
### A\.4Math\-axis dataset filtering
Following the DeepSeek\-style boxed\-answer convention shared by the OPSD and RLSD papers, we keep only rows whosedeepseek\_solutionorground\_truth\_solutionfield contains an extractable\\boxed\{⋅\\cdot\}expression that passes a verifier\-friendliness filter \(numeric, fraction, single expression\), and bound problem and reasoning lengths \(≤4,096\\leq 4\{,\}096and≤6,000\\leq 6\{,\}000characters respectively, so the teacher’s privileged context fits in an8,1928\{,\}192\-token window together with a4,0964\{,\}096\-token student response\)\. After filtering we shuffle and cap at the first8,0008\{,\}000rows\. The privileged contextrris the per\-exampledeepseek\_reasoningtrace; we do*not*concatenate the boxed answer ontorr, but the reasoning trace already contains the answer in∼\\sim90% of rows \(an explicit\\boxed\{⋅\\cdot\}appears in 39%; the rest state the answer in prose at the end of the trace\)\. The boxed answer is extracted fromdeepseek\_solutiononly for the verifier ground\-truth and for AIME25 evaluation\.
### A\.5Identity/persona regex metrics
The identity and persona probes use the same deterministic regex scorers\. The identity probe contains direct identity questions; the persona probe adds ordinary capability prompts to test whether the learned identity appears outside direct identity questions\. Each response receives the following binary flags, and reported metrics are averages over samples:
- •edge\_mention↑\\uparrow: any case\-insensitive occurrence ofEdge ?Runner, with or without a space and with or without a trailing “AI”\. This is the lenient internalization metric: the model has at least produced the target name\.
- •edge\_selfname↑\\uparrow: the main self\-name pattern, in which the model usesEdge ?Runnerin a self\-naming construction such as*“I am EdgeRunner AI”*or*“my name is EdgeRunner”*\. This is the strict internalization metric\.
- •counter\_name↓\\downarrow: a self\-naming construction with base\-model identity strings the model defaults to without training, such as*“I am Nemotron”*, or*“I am an AI assistant”*\. Lower is better\.
### A\.6Role of the KL anchor on the identity axis
EDGE\-OPD without KL anchor isolates the effect of the KL anchor on the identity axis\. Removing the anchor does not increase the base “counter\-name” rate on the identity probes \(0\.0420\.042for the system\-prompt variant and0\.0830\.083for the user\-prefix variant, both at or below the corresponding KL\-anchored EDGE\-OPD runs at0\.0830\.083and0\.1040\.104\), and the target self\-name rate remains comparable or slightly higher \(0\.5420\.542vs\.0\.3960\.396,0\.5420\.542vs\.0\.5000\.500\)\. The anchor instead shows up on AIME25: without it the math score drops from0\.5330\.533to0\.4670\.467\(system\-prompt\) and from0\.5560\.556to0\.5000\.500\(user\-prefix\)\. On the identity axis the KL anchor is therefore primarily a capability\-preservation knob, not the mechanism that suppresses the base self\-identity\.
### A\.7Additional diagnostic logs
Section[3\.5](https://arxiv.org/html/2605.23493#S3.SS5)defines the diagnostics used in the main text: kept\-token fraction for hard\-mask runs, and leverage\-token fraction plus agreement rate for soft\-reweight runs \(Table[4](https://arxiv.org/html/2605.23493#S5.T4)\)\. For RLSD\-no\-verifier, the identity run has high leverage but near\-chance agreement \(ρlev=0\.658\\rho\_\{\\mathrm\{lev\}\}=0\.658,ρagree=0\.529\\rho\_\{\\mathrm\{agree\}\}=0\.529\), while the math run has lower leverage but higher agreement \(0\.3300\.330and0\.7400\.740\)\. Thus, the math evidence is often directionally aligned when active, but it affects a smaller fraction of tokens\.
We also log supplementary diagnostics that are not reported in the main table\. The signed and absolute evidence means are
\|e\|¯=1\|y\|∑t=1\|y\|\|et\|,e¯=1\|y\|∑t=1\|y\|et,\\overline\{\|e\|\}\\;=\\;\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\|e\_\{t\}\|,\\qquad\\overline\{e\}\\;=\\;\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}e\_\{t\},\(9\)which measure how strongly, and in what net direction, the privileged context changes the teacher along the rollout\. For the soft\-reweight path, we also log disagreement,ρdisagree=1−ρagree\\rho\_\{\\mathrm\{disagree\}\}=1\-\\rho\_\{\\mathrm\{agree\}\}, and effective leverage,
\|w−1\|⋅𝟙active¯=1\|y\|∑t=1\|y\|\|wt−1\|⋅𝟙active,t,wt=clip\(exp\(et\),1−ϵ,1\+ϵ\),\\overline\{\|w\-1\|\\cdot\\mathbbm\{1\}\_\{\\mathrm\{active\}\}\}\\;=\\;\\frac\{1\}\{\|y\|\}\\sum\_\{t=1\}^\{\|y\|\}\|\\,w\_\{t\}\-1\\,\|\\cdot\\mathbbm\{1\}\_\{\\mathrm\{active\},\\,t\},\\quad w\_\{t\}\\;=\\;\\mathrm\{clip\}\\bigl\(\\exp\(e\_\{t\}\),\\,1\\\!\-\\\!\\epsilon,\\,1\\\!\+\\\!\\epsilon\\bigr\),\(10\)the mean per\-token magnitude of the soft\-reweight modulation\. The released scalar summaries also include response length and training\-rollout correlation\.
### A\.8Reproducibility checklist
The key reproducibility invariants are:
- •All training jobs use a deterministic seed file \(configs/acasd\_seed\.yaml\); we list per\-experiment seed values in the supplementary CSV\.
- •All evaluation jobs use the same batched inference backend and deterministic regex scorers for the identity probes\. Sampling hyperparameters: temperature1\.01\.0, top\_p0\.950\.95, top\_k2020,55samples per identity probe, and AIME25 pass@1 averaged over four evaluation epochs with max\-tokens38,91238\{,\}912\. Math\-axis truncation rates stay below6%6\\%for every reported checkpoint; identity\-axis rates stay below8%8\\%for every guided run and reach at most11%11\\%for one unguided OPSD checkpoint\.
- •All figures are generated from saved scalar summaries and evaluation logs; no figure is hand\-edited\.
### A\.9Compute Requirements
For each training run we used1616NVIDIA H100 GPUs on a single SLURM\-scheduled server, with FSDP for model sharding and vLLM \(with separate server for the teacher\)\. We report roughly2020training runs/ablations, together with a roughly3×3\{\\times\}overhead from intermediate/debugging/crashed runs, summing to≈60\\approx 60training runs and consequently≈960\\approx 960H100\-hours of training compute\. Per\-checkpoint evaluations \(identity, persona, and AIME25\) were performed on a single H100 with theinspect\-aitoolkit, which add roughly≈60\\approx 60H100\-hours \(mostly dominated by AIME25 sampling at the38,91238\{,\}912\-token response budget\), totaling to compute usage of≈1020\\approx 1020H100\-hours\.
### A\.10Limitations
Our experiments are intended as controlled proof\-of\-concept studies rather than large\-scale evaluations\. Due to limited time and compute budgets, we focus on a single model family and a small number of investigation axes\. Several experiments are based on single\-seed training runs\. To improve reliability under limited compute, we instead emphasize repeated evaluations, checkpoint trajectories, ablation consistency, and comparisons across independently motivated baselines\. Additional experiments across model scales, architectures, and privileged\-context settings would be required to determine how broadly the observed evidence\-masking behavior generalizes, and are the subject of ongoing and future work\.
The paper primarily studies two dimensions: identity internalization and mathematical reasoning retention\. While the math\-axis results suggest that guided rollout distillation can preserve substantial downstream reasoning performance under our evaluation setup, we expect but do not claim that similar training on relevant datasets will preserve broader capabilities such as coding, multilingual reasoning, factual recall, safety alignment, or long\-context behavior\.
The identity\-transfer setting intentionally uses a synthetic persona in order to isolate privileged\-context internalization effects under controlled conditions\. This setting provides a clean measurement environment, but may not capture the complexity of transferring diffuse latent knowledge, procedural reasoning strategies, or other contextual information distributed across long reasoning traces\.
Finally, the same mechanisms that enable benign forms of privileged\-context transfer could potentially be misused for covert persona conditioning, deceptive identity injection, or undesired latent behavioral modification that is not directly observable from inference\-time prompts alone\. For this reason, we do not release trained checkpoints and instead limit release to methodological descriptions, aggregate metrics, and reproducibility metadata\.
### A\.11Licences
Table 6:Licenses for all models, data, and software used to produce the results presented here\.Similar Articles
When Context Returns: Toward Robust Internalization in On-Policy Distillation
The paper identifies that reintroducing privileged context to a distilled student model degrades performance (context-induced degradation), and proposes a lightweight consistency regularizer that anchors no-context outputs to mitigate this issue, improving robustness across 12 configurations.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling
Sign-Gated On-Policy Distillation (SG-OPD) enhances standard on-policy distillation by using a binary verifier as a trust signal for teacher supervision, improving performance on competition-level math reasoning benchmarks.