OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

arXiv cs.CL Papers

Summary

OPID is a framework that extracts dense token-level supervision from completed on-policy trajectories for reinforcement learning of language agents, using hierarchical skills (episode-level and step-level) to improve sample efficiency and robustness.

arXiv:2606.26790v1 Announce Type: new Abstract: Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose \textbf{OPID} (\textbf{O}n-\textbf{P}olicy Sk\textbf{i}ll \textbf{D}istillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at https://github.com/jinyangwu/OPID/tree/main.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:19 AM

# OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
Source: [https://arxiv.org/html/2606.26790](https://arxiv.org/html/2606.26790)
Shuo Yang1,Jinyang Wu111footnotemark:1,Zhengxi Lu2,Yuhao Shen2,Fan Zhang3,Lang Feng4, Shuai Zhang1,Haoran Luo4,Zheng Lian5,Zhengqi Wen1,Jianhua Tao1

1

Tsinghua University2Zhejiang University3The Chinese University of Hong Kong
4Nanyang Technological University5Tongji University
Corresponding to: wu\-jy23@mails\.tsinghua\.edu\.cn
###### Abstract

Outcome\-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory\-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed\. On\-policy self\-distillation offers dense token\-level supervision, yet existing skill\-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi\-turn interaction\. We proposeOPID\(On\-Policy SkillDistillation\), a framework that extracts skill supervision directly from completed on\-policy trajectories\. OPID represents trajectory hindsight as hierarchical skills: episode\-level skills capture global workflows or failure\-avoidance rules, while step\-level skills capture local decision knowledge at critical timesteps\. A critical\-first routing mechanism uses step\-level skills when critical decisions are identified and falls back to episode\-level skills as default guidance otherwise\. The selected skill is injected into the interaction history, allowing the old policy to re\-score the same sampled response under both original and skill\-augmented contexts\. The resulting log\-probability shift yields a token\-level self\-distillation advantage, which is combined with the outcome advantage for policy optimization\. OPID thus preserves RL as the primary training objective while introducing dense, distribution\-matched hindsight supervision\. Experiments on ALFWorld, WebShop and Search\-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome\-only RL and existing skill\-distillation baselines\. Our code is available at[https://github\.com/jinyangwu/OPID/tree/main](https://github.com/jinyangwu/OPID/tree/main)\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x1.png)Figure 1:Overall performance comparison\.We compare OPID with training\-free prompting methods, outcome\-only RL, and skill\-distillation baselines on ALFWorld, Search\-based QA, and WebShop\. OPID achieves the strongest average performance on ALFWorld and WebShop while remaining competitive on Search\-based QA\.## 1Introduction

Large language models \(LLMs\) are increasingly deployed as interactive agents that operate over long horizons, invoke tools, navigate environments, and adapt their behavior through multi\-turn observations\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib10); Luoet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib13); Wuet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib12); Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\. Unlike single\-turn reasoning, agentic tasks require sequential decisions whose consequences may only become visible after many interaction steps\. This setting spans embodied household environments, web navigation, search\-augmented reasoning, and software engineering agents\(Shridharet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib19); Yaoet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib20); Jinet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib21); Jimenezet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib22)\)\. Reinforcement learning \(RL\) has become a natural post\-training paradigm for such agents, since it directly optimizes policies using task\-level feedback from environments or verifiers\. In particular, outcome\-based methods such as GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib1)\)provide a stable critic\-free optimization backbone for on\-policy rollouts\.

Despite its effectiveness, outcome\-based agentic RL offers only coarse supervision\(Zhanget al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib9)\)\. Environment rewards are typically sparse, delayed, and high\-variance: a terminal reward can indicate whether a trajectory succeeds, but not which intermediate decisions caused the outcome\. This limitation is especially severe in long\-horizon interaction\(Chenet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib7); Xuet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib8)\), where a single early mistake may derail the episode, repeated invalid actions may accumulate over time, and the effect of a local decision may only be observed several turns later\. As a result, purely outcome\-driven optimization provides stable task\-level pressure but lacks fine\-grained decision\-level credit assignment\.

On\-policy distillation and self\-distillation provide complementary supervision\. Rather than relying solely on trajectory\-level rewards, on\-policy distillation trains models on their own sampled outputs while using auxiliary teacher signals to induce token\-level guidance\(Guet al\.,[2024a](https://arxiv.org/html/2606.26790#bib.bib26); Agarwalet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib2)\)\. Recent self\-distillation methods remove the need for a separate teacher by comparing the same policy under different contexts, such as a standard student branch and a privileged teacher branch\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib3); Heet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib4)\)\. In agentic RL, this suggests a natural decomposition: RL remains the primary optimization backbone, while self\-distillation supplies dense token\-level shaping signals\. Recent work such as SDAR follows this principle by treating self\-distillation as a controlled auxiliary objective for multi\-turn agents\(Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\.

A particularly promising form of privileged context is a natural\-language skill\. Skill\-conditioned self\-distillation augments the teacher branch with procedural knowledge, such as subgoal decompositions, action templates, or behavioral rules, and distills the resulting token\-level preferences into the policy\(Luet al\.,[2026b](https://arxiv.org/html/2606.26790#bib.bib15); Wanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib16); Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\. However, existing skill\-based methods typically rely on external skill libraries, retrieved skill files, or maintained skill memories\. This design raises two challenges\. First, skill memories require non\-trivial maintenance, including skill insertion, refinement, deletion, and retrieval\. Second, retrieved skills may be mismatched with the state distribution induced by the current policy\. Such mismatch is particularly problematic for multi\-turn agents, where small deviations from the assumed trajectory can lead to state drift and make an otherwise useful skill unreliable\.

Based on this observation, we proposeOPID\(On\-Policy SkillDistillation\), a framework that extracts hindsight skills from completed on\-policy trajectories and distills their behavioral effects back into the policy\. OPID abstracts each trajectory into two complementary levels of natural\-language skills:*episode\-level skills*, which summarize trajectory\-wide workflows or failure\-avoidance rules, and*step\-level skills*, which capture state\-conditioned guidance at critical timesteps\. This hierarchy reflects a granularity trade\-off in long\-horizon decision making\. Episode\-level skills are broad and stable but may be too coarse for pivotal states, whereas step\-level skills are precise but sparse and state\-specific\. OPID addresses this trade\-off with*critical\-first skill routing*: it uses step\-level skills at identified critical timesteps and falls back to episode\-level skills otherwise\. The routed skill is injected into the agent’s interaction history, allowing the old policy to re\-score the same on\-policy response under both original and skill\-augmented contexts\. The induced token\-level log\-probability shift forms a skill\-based self\-distillation advantage, which is combined with the episode advantage for policy optimization\. OPID therefore preserves outcome\-based RL as the primary objective while introducing dense, on\-policy hindsight supervision\. At inference time, OPID requires no analyzer, external skill retrieval, or privileged context\.

We evaluate OPID on ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib19)\), WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib20)\), and Search\-based QA\(Jinet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib21)\)with models at different scales\. Across these settings, OPID improves long\-horizon agent performance over outcome\-only RL and skill\-distillation baselines\. These results suggest that completed on\-policy trajectories provide a useful source of distribution\-matched hindsight supervision, enabling the policy to internalize trajectory\-derived skills without relying on external skill libraries or retrieved privileged context at inference time\.

Taken together, our work makes the following contributions:

- •We proposeon\-policy hindsight skill extraction, which treats completed trajectories sampled by the current policy as a distribution\-matched source of skill supervision, avoiding the need for external skill libraries or off\-policy retrieval\.
- •We introducehierarchical hindsight skills with critical\-first routing, where episode\-level skills capture global workflows or failure\-avoidance rules, step\-level skills capture critical local decisions, and routing selects the most specific available skill for each trajectory step\.
- •We integrateskill\-based self\-distillationinto agentic RL, converting routed hindsight skills into dense token\-level shaping signals while preserving outcome reward optimization as the primary training objective\.
- •We empirically validate OPID on long\-horizon agentic benchmarks, showing consistent improvements over outcome\-only RL and skill\-distillation baselines, along with better sample efficiency and reduced repetitive or invalid behaviors\.

## 2Related Work

##### Reinforcement learning for agentic LLMs\.

Large language models are increasingly trained as interactive agents that operate over long horizons, invoke tools, and receive feedback from environments or verifiers\(Shridharet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib19); Yaoet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib20); Jinet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib21); Jimenezet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib22); Wuet al\.,[2026c](https://arxiv.org/html/2606.26790#bib.bib14)\)\. Reinforcement learning has therefore become a natural post\-training paradigm, with outcome\-based methods such as GRPO providing a stable critic\-free objective for on\-policy rollouts\(Shaoet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib1)\)\. However, agentic environments typically provide sparse and delayed rewards\. A terminal outcome can indicate whether a trajectory succeeds, but it does not identify which intermediate decisions caused success or failure\. OPID targets this missing credit\-assignment signal: it keeps outcome\-based RL as the optimization backbone, but augments it with dense decision\-level supervision extracted from the policy’s own completed trajectories\.

##### On\-policy self\-distillation\.

On\-policy distillation trains a model from its own sampled outputs while using auxiliary teacher signals to provide token\-level learning targets\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib2); Guet al\.,[2024a](https://arxiv.org/html/2606.26790#bib.bib26)\)\. Recent self\-distillation methods further remove the need for a separate teacher by comparing the same policy under different contexts or feedback conditions\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib3); Heet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib4)\)\. For multi\-turn agents, this suggests a useful decomposition: RL supplies task\-level optimization, while self\-distillation supplies dense shaping signals\(Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\. The key question is where the privileged signal should come from\. Existing methods often rely on generic revision contexts, external hints, or task\-level feedback transformations\. OPID instead constructs the privileged branch from hindsight skills extracted from on\-policy trajectories, making the distillation signal directly tied to the states, actions, and failures encountered by the current policy\.

##### Skill\-conditioned agent learning\.

Natural\-language skills provide compact procedural knowledge for agents, including subgoal decompositions, action templates, and failure\-avoidance rules\(Luet al\.,[2026b](https://arxiv.org/html/2606.26790#bib.bib15); Wanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib16); Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11); Wuet al\.,[2026b](https://arxiv.org/html/2606.26790#bib.bib6)\)\. Existing skill\-based methods commonly depend on external skill libraries, retrieved skill files, or persistent skill memories\. These designs can improve agent behavior, but they introduce maintenance and retrieval costs, and retrieved skills may be mismatched with the state distribution induced by the current policy\. This mismatch becomes more severe in long\-horizon interaction, where small deviations can lead to substantial state drift\. OPID makes a different design choice: it extracts hierarchical skills directly from completed on\-policy trajectories, routes them according to decision criticality, and distills their behavioral effect into the policy during training\. As a result, OPID provides distribution\-matched hindsight supervision without requiring skill retrieval, analyzer calls, or privileged context at inference time\.

## 3Methods

We formulate long\-horizon agentic tasks as partially observable decision processes and present OPID, a framework that converts completed on\-policy trajectories into hierarchical skills and distills their behavioral effect back into the policy\. OPID performs on\-policy skill distillation in three stages\. First, it extracts hierarchical skills from completed on\-policy trajectories\. Second, it routes the appropriate skill to each decision step and converts the skill effect into token\-level self\-distillation signals\. Third, it combines these token\-level skill advantages with group\-relative outcome advantages for policy optimization\. Figure[2](https://arxiv.org/html/2606.26790#S3.F2)illustrates the overall pipeline\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x2.png)Figure 2:Overview of OPID\.Starting from completed on\-policy trajectories, OPID extracts hierarchical hindsight skills and routes the most relevant skill to each decision, prioritizing step\-level skills at critical states\. The policy then re\-scores the same sampled response with and without the routed skill, turning the token\-wise log\-probability difference into a dense skill advantage that complements the episode\-level RL signal\.### 3\.1Problem Formulation

We model an agentic task as a partially observable Markov decision process defined by

\(𝒮,𝒜,𝒪,𝒯,ℛ,γ\),\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},\\mathcal\{T\},\\mathcal\{R\},\\gamma\),where𝒮\\mathcal\{S\}is the latent state space,𝒜\\mathcal\{A\}is the action space,𝒪\\mathcal\{O\}is the observation space,𝒯:𝒮×𝒜→𝒮\\mathcal\{T\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathcal\{S\}is the transition function,ℛ:𝒮×𝒜→ℝ\\mathcal\{R\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}is the reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. At timesteptt, the environment is in a hidden statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}and emits an observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}\. The agent maintains an interaction history

ht=\(o0,y0,o1,y1,…,ot\),h\_\{t\}=\(o\_\{0\},y\_\{0\},o\_\{1\},y\_\{1\},\\ldots,o\_\{t\}\),whereyiy\_\{i\}denotes the textual response or executable action generated at stepii\. The policyπθ\\pi\_\{\\theta\}generates the next response as

yt∼πθ\(⋅∣ht\)\.y\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{t\}\)\.
After executingyty\_\{t\}, the environment transitions and returns the next observation\. A completed trajectory is represented as

τ=\{\(ot,yt,rt\)\}t=0T−1,\\tau=\\\{\(o\_\{t\},y\_\{t\},r\_\{t\}\)\\\}\_\{t=0\}^\{T\-1\},whereTTis the episode length\. In most agentic benchmarks, rewards are sparse and terminal, so we denote the outcome score by

R​\(τ\)∈\{0,1\},R\(\\tau\)\\in\\\{0,1\\\},or more generallyR​\(τ\)∈ℝR\(\\tau\)\\in\\mathbb\{R\}when the benchmark provides graded feedback\. The learning objective is

J​\(πθ\)=𝔼τ∼πθ​\[R​\(τ\)\]\.J\(\\pi\_\{\\theta\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\[R\(\\tau\)\]\.
Following GRPO\-style training, for each task promptqqwe sample a group ofNNtrajectories from the current policy:

𝒢q=\{τ\(1\),τ\(2\),…,τ\(N\)\}\.\\mathcal\{G\}\_\{q\}=\\\{\\tau^\{\(1\)\},\\tau^\{\(2\)\},\\ldots,\\tau^\{\(N\)\}\\\}\.

### 3\.2On\-Policy Skill Extraction

Outcome rewards reveal whether a trajectory succeeds, but not why it succeeds or fails\. OPID therefore represents post\-hoc trajectory knowledge as hierarchical skills extracted from completed on\-policy rollouts\. The hierarchy contains two complementary levels\.

##### Episode\-level skills\.

An episode\-level skillsτeps^\{\\mathrm\{ep\}\}\_\{\\tau\}summarizes the global behavioral pattern of a complete trajectoryτ\\tau\. For a successful trajectory, it captures a reusable workflow that explains how the task was solved\. For a failed trajectory, it captures a failure\-avoidance rule that describes what should be avoided in similar future situations\. Episode\-level skills are broad and stable, making them suitable as default guidance for most states\.

##### Step\-level skills\.

A step\-level skillsτ,tsteps^\{\\mathrm\{step\}\}\_\{\\tau,t\}captures local decision knowledge at timesteptt\. It is intended for pivotal states where the final outcome depends strongly on a specific choice, such as avoiding a repeated invalid action, selecting the next object to inspect, correcting a mistaken subgoal, or deciding when to stop exploration\. Step\-level skills are more precise than episode\-level skills, but they are also sparse and state\-dependent\.

Given a completed trajectoryτ\\tau, OPID reconstructs an ordered trajectory record containing the task prompt, observations, model responses, environment feedback, step indices, and terminal outcome\. An LLM\-based analyzer𝒜\\mathcal\{A\}maps this record to structured natural\-language skills:

𝒜​\(τ\)=\(sτep,\{sτ,tstep\}t∈𝒞τ\),\\mathcal\{A\}\(\\tau\)=\\left\(s^\{\\mathrm\{ep\}\}\_\{\\tau\},\\\{s^\{\\mathrm\{step\}\}\_\{\\tau,t\}\\\}\_\{t\\in\\mathcal\{C\}\_\{\\tau\}\}\\right\),where𝒞τ\\mathcal\{C\}\_\{\\tau\}is the sparse set of critical timesteps identified by the analyzer\.

### 3\.3Critical\-First Skill\-Conditioned Self\-Distillation

Applying the same skills to every step is suboptimal\. Episode\-level skills are robust but may be too coarse at decisive states, whereas step\-level skills are precise but sparse\. OPID therefore introduces critical\-first skill routing before performing skill\-conditioned self\-distillation\. For trajectoryτ\\tauand timesteptt, the routed skill is

sτ,t=\{sτ,tstep,if​t∈𝒞τ,sτep,otherwise\.s\_\{\\tau,t\}=\\begin\{cases\}s^\{\\mathrm\{step\}\}\_\{\\tau,t\},&\\text\{if \}t\\in\\mathcal\{C\}\_\{\\tau\},\\\\ s^\{\\mathrm\{ep\}\}\_\{\\tau\},&\\text\{otherwise\}\.\\end\{cases\}Equivalently, define routing masks

qτ,tstep=𝕀​\[t∈𝒞τ\],qτ,tep=𝕀​\[t∉𝒞τ\]\.q^\{\\mathrm\{step\}\}\_\{\\tau,t\}=\\mathbb\{I\}\[t\\in\\mathcal\{C\}\_\{\\tau\}\],\\qquad q^\{\\mathrm\{ep\}\}\_\{\\tau,t\}=\\mathbb\{I\}\[t\\notin\\mathcal\{C\}\_\{\\tau\}\]\.The critical\-first rule enforces

qτ,tstep=1⇒qτ,tep=0,q^\{\\mathrm\{step\}\}\_\{\\tau,t\}=1\\Rightarrow q^\{\\mathrm\{ep\}\}\_\{\\tau,t\}=0,so the two skill levels are not blindly combined\. Each step receives the most appropriate granularity\.

After routing, OPID converts the selected skill into token\-level self\-distillation supervision\. LetH​\(⋅,⋅\)H\(\\cdot,\\cdot\)denote a deterministic skill\-injection function that appends or prepends the routed skill to the interaction history while preserving the original state information\. The skill\-augmented history is

h~τ,t=H​\(hτ,t,sτ,t\)\.\\tilde\{h\}\_\{\\tau,t\}=H\(h\_\{\\tau,t\},s\_\{\\tau,t\}\)\.
The original responseyτ,ty\_\{\\tau,t\}is not regenerated\. Instead, the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}scores the same sampled response under both the original and skill\-augmented histories\. For tokenℓ\\ellin responseyτ,ty\_\{\\tau,t\}, define

ℓτ,t,ℓold=log⁡πθold​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\),\\ell^\{\\mathrm\{old\}\}\_\{\\tau,t,\\ell\}=\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\left\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\\right\),and

ℓτ,t,ℓskill=log⁡πθold​\(yτ,t,ℓ∣h~τ,t,yτ,t,<ℓ\)\.\\ell^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}=\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\left\(y\_\{\\tau,t,\\ell\}\\mid\\tilde\{h\}\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\\right\)\.The skill\-based self\-teacher advantage is

Aτ,t,ℓskill=\(ℓτ,t,ℓskill−ℓτ,t,ℓold\)​mτ,t,ℓ,A^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}=\\left\(\\ell^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\-\\ell^\{\\mathrm\{old\}\}\_\{\\tau,t,\\ell\}\\right\)m\_\{\\tau,t,\\ell\},wheremτ,t,ℓ∈\{0,1\}m\_\{\\tau,t,\\ell\}\\in\\\{0,1\\\}is the valid response\-token mask\.

IfAτ,t,ℓskill\>0A^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\>0, the selected skill makes the token more likely under the old policy, suggesting that the token is consistent with the skill\. IfAτ,t,ℓskill<0A^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}<0, the skill\-conditioned context assigns lower probability to the token, suggesting that the token is less aligned with the routed hindsight skill\. This procedure yields dense token\-level guidance without requiring an external expert action\.

### 3\.4Policy Optimization with Skill Advantage

For each rollout group𝒢q\\mathcal\{G\}\_\{q\}, let𝐫q=\{R​\(τ′\)∣τ′∈𝒢q\}\\mathbf\{r\}\_\{q\}=\\\{R\(\\tau^\{\\prime\}\)\\mid\\tau^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\\\}denote the set of outcome rewards of all trajectories sampled for the same prompt\. Following GRPO, the group mean is defined as

μq=mean⁡\(𝐫q\)=1\|𝒢q\|​∑τ′∈𝒢qR​\(τ′\)\.\\mu\_\{q\}=\\operatorname\{mean\}\(\\mathbf\{r\}\_\{q\}\)=\\frac\{1\}\{\|\\mathcal\{G\}\_\{q\}\|\}\\sum\_\{\\tau^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\}R\(\\tau^\{\\prime\}\)\.The group standard deviation is defined as the square root of the group reward variance:

σq=std⁡\(𝐫q\)=1\|𝒢q\|​∑τ′∈𝒢q\(R​\(τ′\)−μq\)2\.\\sigma\_\{q\}=\\operatorname\{std\}\(\\mathbf\{r\}\_\{q\}\)=\\sqrt\{\\frac\{1\}\{\|\\mathcal\{G\}\_\{q\}\|\}\\sum\_\{\\tau^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\}\\left\(R\(\\tau^\{\\prime\}\)\-\\mu\_\{q\}\\right\)^\{2\}\}\.The GRPO\-style episode\-relative advantage is then computed by normalizing the trajectory outcome reward within its prompt group:

Aτep=R​\(τ\)−μqσq,τ∈𝒢q\.A^\{\\mathrm\{ep\}\}\_\{\\tau\}=\\frac\{R\(\\tau\)\-\\mu\_\{q\}\}\{\\sigma\_\{q\}\},\\qquad\\tau\\in\\mathcal\{G\}\_\{q\}\.This scalar is broadcast to all valid response tokens:

Aτ,t,ℓep=Aτep​mτ,t,ℓ\.A^\{\\mathrm\{ep\}\}\_\{\\tau,t,\\ell\}=A^\{\\mathrm\{ep\}\}\_\{\\tau\}m\_\{\\tau,t,\\ell\}\.
The final OPID advantage combines group\-relative outcome feedback with token\-level skill supervision:

Aτ,t,ℓOPID=Aτ,t,ℓep\+λskill​Aτ,t,ℓskill\.A^\{\\text\{OPID\}\}\_\{\\tau,t,\\ell\}=A^\{\\mathrm\{ep\}\}\_\{\\tau,t,\\ell\}\+\\lambda\_\{\\mathrm\{skill\}\}A^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\.
This formulation keeps outcome reward as the primary RL signal while adding token\-level shaping\.

We optimize the standard clipped policy objective:

ℒpolicy​\(θ\)=−𝔼τ,t,ℓ​\[min⁡\(ρτ,t,ℓ​\(θ\)​Aτ,t,ℓOPID,clip⁡\(ρτ,t,ℓ​\(θ\),1−ϵ,1\+ϵ\)​Aτ,t,ℓOPID\)\]\+β​ℒKL​\(θ\)\.\\mathcal\{L\}\_\{\\mathrm\{policy\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\tau,t,\\ell\}\\left\[\\min\\left\(\\rho\_\{\\tau,t,\\ell\}\(\\theta\)A^\{\\text\{OPID\}\}\_\{\\tau,t,\\ell\},\\operatorname\{clip\}\\left\(\\rho\_\{\\tau,t,\\ell\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\right\)A^\{\\text\{OPID\}\}\_\{\\tau,t,\\ell\}\\right\)\\right\]\+\\beta\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(\\theta\)\.
whereρτ,t,ℓ​\(θ\)\\rho\_\{\\tau,t,\\ell\}\(\\theta\)denotes the token\-level importance ratio, defined as

ρτ,t,ℓ​\(θ\)=exp⁡\(log⁡πθ​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\)−log⁡πθold​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\)\)\.\\rho\_\{\\tau,t,\\ell\}\(\\theta\)=\\exp\\left\(\\log\\pi\_\{\\theta\}\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)\-\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)\\right\)\.
The operatorclip⁡\(x,1−ϵ,1\+ϵ\)\\operatorname\{clip\}\(x,1\-\\epsilon,1\+\\epsilon\)truncatesxxto the interval\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\], andϵ\\epsilonis the clipping hyperparameter that controls the maximum allowed deviation from the old policy\.

##### Training\-inference boundary\.

The analyzer, routed skills, and skill\-conditioned scoring pass are used only to construct the training advantage\. At inference time, the learned policy acts from the ordinary interaction historyhth\_\{t\}alone, with no analyzer call, skill retrieval, or privileged context\.

Table 1:Performance Comparison on the representative long\-horizon benchmarks \(ALFWorld, Search\-based QA, and WebShop\)\.We report the success rate \(%\) on ALFWorld, accuracy on search\-based QA, and task\-completion score/success rate on WebShop\. An asterisk \(\*\) denotes validation with skills\. Thebestandsecond\-bestresults are highlighted\.

## 4Experiment

### 4\.1Experimental Setting

##### Benchmarks\.

We evaluate OPID on three representative agentic benchmarks that require multi\-step interaction or search\-based reasoning\. First, we use ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib19)\), an embodied household benchmark where an agent must complete language\-specified goals through a sequence of textual actions\. We report performance on six task types:Pick,Look,Clean,Heat,Cool, andPick2\. Second, we evaluate on WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib20)\), where an agent interacts with an e\-commerce website to find and purchase products satisfying natural\-language user requirements\. Following the standard evaluation protocol, we report results on 128 test tasks\. Third, we consider Search\-based QA\(Jinet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib21)\), where the agent answers questions by interacting with a search environment: Natural Questions\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.26790#bib.bib32)\), TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.26790#bib.bib33)\), PopQA\(Mallenet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib34)\), HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.26790#bib.bib35)\), 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib36)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib37)\), and Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib38)\)\.

##### Baselines\.

We compare OPID against both prompting\-based and training\-based baselines\.Vanilladenotes the original prompting baseline\.Skill\-Promptaugments the model with skill descriptions at inference or validation time\.GRPOis the outcome\-only on\-policy RL baseline, where the policy is optimized using group\-relative trajectory\-level rewards\(Shaoet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib1)\)\.Skill\-GRPOcombines skill conditioning with GRPO\-style outcome optimization\.OPSD\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib3)\),GRPO\+OPSD,Skill\-SD\(Wanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib16)\),RLSD\(Yanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib17)\), andSDAR\(Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)are self\-distillation or skill\-distillation baselines that introduce auxiliary token\-level or skill\-conditioned supervision during training\. Rows marked with∗\*indicate validation with skills, following the setting described in the corresponding baseline\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x3.png)

\(a\) Episode success rate

![Refer to caption](https://arxiv.org/html/2606.26790v1/x4.png)

\(b\) Episode length

Figure 3:Training dynamics of OPID and GRPO\.We report Qwen2\.5\-3B\-Instruct training on ALFWorld\. Translucent curves denote raw measurements and solid curves denote smoothed trends\.
##### Evaluation Metrics\.

For ALFWorld, we report task success rate in percentage\. For WebShop, we report both the normalized task score and task success rate, following the benchmark protocol\. For Search\-based QA, we report answer accuracy in percentage on each QA subset and the average accuracy across subsets\.

##### Implementation Details\.

We conduct experiments using Qwen2\.5\-3B/7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib30)\)and Qwen3\-1\.7B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib31)\)\. The training batch size is set to 16 for ALFWorld and WebShop, and 128 for Search\-based QA\. All models are trained for 150 steps across all environments\. Full details are provided in Appendix[B](https://arxiv.org/html/2606.26790#A2)\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.26790#S3.T1)summarizes performance across model scales and agentic domains, revealing three key findings:

##### OPID consistently strengthens outcome\-only RL\.

OPID improves over GRPO in most model–domain combinations\. On Qwen2\.5\-3B, the gains are \+9\.3 points on ALFWorld \(84\.3 vs\. 75\.0\), \+8\.6 on Search\-based QA \(45\.0 vs\. 36\.4\), and \+10\.9 on WebShop \(74\.2 vs\. 63\.3\)\. The corresponding improvements on Qwen2\.5\-7B are \+8\.8, \+7\.2, and \+7\.1 points\. The benefit is particularly pronounced for the smaller Qwen3\-1\.7B backbone, where OPID improves ALFWorld by \+12\.8 points and WebShop by \+26\.5 points\. The only exception is Search\-based QA on Qwen3\-1\.7B, where OPID remains close to GRPO\. Overall, these results show that OPID usually provides a consistent gain over outcome\-only reinforcement learning, especially on long\-horizon embodied and web\-shopping tasks\.

##### OPID remains competitive with strong hybrid methods\.

Beyond improving over outcome\-only RL, OPID also matches or surpasses strong hybrid and self\-distillation baselines in several aggregate settings\. On ALFWorld, OPID achieves the best average on Qwen2\.5\-7B and Qwen3\-1\.7B, outperforming the strongest baseline by \+1\.7 points \(90\.0 vs\. 88\.3\) and \+5\.0 points \(58\.9 vs\. 53\.9\) respectively\. On Search\-based QA, OPID attains the best average on both Qwen2\.5 backbones, improving over the strongest baseline by \+0\.4 points on Qwen2\.5\-3B \(45\.0 vs\. 44\.6\) and \+0\.2 points on Qwen2\.5\-7B \(49\.2 vs\. 49\.0\)\. On WebShop, OPID achieves the best success rate on Qwen2\.5\-3B and Qwen3\-1\.7B, exceeding the strongest competing method by \+6\.2 points on Qwen3\-1\.7B \(64\.8 vs\. 58\.6\), while remaining competitive on Qwen2\.5\-7B\. These results show that trajectory\-derived, distribution\-matched skills can complement outcome supervision and compete with methods that rely on hybrid training signals or external skill contexts\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x5.png)Figure 4:Sample efficiency analysis\.OPID consistently outperforms GRPO under reduced training data and approaches full\-data GRPO performance using about 60% of the data\.
![Refer to caption](https://arxiv.org/html/2606.26790v1/x6.png)Figure 5:Cross\-domain generalization on ALFWorld Unseen\.OPID improves the average success rate over GRPO and shows particularly large gains onLookandHeat\.

##### OPID internalizes skills instead of depending on them at inference\.

The results further show that OPID gains from internalizing hindsight skills into the policy, rather than relying on skill prompts at inference time\. Training directly with retrieved skills introduces a clear train–test context mismatch: when validation\-time skills are removed, Skill\-GRPO underperforms ordinary GRPO on ALFWorld at all model scales, dropping by \-14\.8 points on Qwen2\.5\-3B \(60\.2 vs\. 75\.0\), \-11\.7 points on Qwen2\.5\-7B \(69\.5 vs\. 81\.2\), and \-25\.0 points on Qwen3\-1\.7B \(21\.1 vs\. 46\.1\)\. In contrast, OPID is also evaluated without any skill input, yet exceeds Skill\-GRPO by \+24\.1, \+20\.5, and \+37\.8 points\. On Search\-based QA, OPID also improves over both GRPO and Skill\-GRPO for the two Qwen2\.5 models, with gains over GRPO of \+8\.6 and \+7\.2 points, while remaining comparable on Qwen3\-1\.7B\. Moreover, OPID outperforms Skill\-GRPO\* on ALFWorld and Search\-based QA for both Qwen2\.5 backbones, even though Skill\-GRPO\* retains privileged skill context during validation\. These results indicate that OPID transfers trajectory\-derived hindsight knowledge into the model parameters, enabling the policy to benefit from skills without depending on external skill prompts at inference\.

### 4\.3Training Dynamics

Figure[3](https://arxiv.org/html/2606.26790#S4.F3)illustrates the training progression on ALFWorld\. Both methods improve during early optimization, yet OPID diverges from GRPO in the middle stage and maintains superior performance throughout the remainder of training\. This divergence pattern indicates that hindsight skill supervision accelerates policy refinement beyond what outcome rewards alone can achieve\. The efficiency gains are equally pronounced\. OPID reduces average episode length to 15\-16 steps while GRPO plateaus at 17\-18 steps\. The concurrent rise in success and fall in trajectory length reveals a key behavioral shift: OPID agents learn to reach goals through more direct action sequences rather than exploratory detours\.

These dynamics align with the intended function of hierarchical supervision\. Episode\-level skills establish coherent task workflows that reduce backtracking and repetition\. Step\-level skills provide precise guidance at critical decision points, preventing the invalid actions and local navigation errors that otherwise extend trajectories\. Together, these mechanisms enable OPID to internalize both global task structure and local decision efficiency\.

### 4\.4Sample Efficiency

Figure[5](https://arxiv.org/html/2606.26790#S4.F5)compares OPID and GRPO under different fractions of ALFWorld training data\. OPID consistently improves over GRPO across all data scales, with absolute gains ranging from \+9\.3 to \+20\.3 points\. The advantage is especially clear in the low\- and mid\-data regimes, where each trajectory carries more training value\. With 60% of the data, OPID reaches 71\.9, close to GRPO trained with the full dataset \(75\.0\); with 80% of the data, it already surpasses full\-data GRPO \(78\.9 vs\. 75\.0\)\. These results indicate that OPID\-style skill supervision improves the data efficiency of outcome\-based RL\. By converting completed trajectories into dense token\-level training signals, OPID extracts additional supervision from the same environment interactions rather than relying only on terminal rewards\. This makes the optimization less dependent on large numbers of rollouts and allows the policy to acquire effective behaviors with fewer samples\.

### 4\.5Cross\-Domain Generalization

Figure[5](https://arxiv.org/html/2606.26790#S4.F5)evaluates cross\-domain transfer to the ALFWorld unseen split\. OPID achieves an average success rate of 78\.6, outperforming GRPO by \+7\.7 points\. Its gains over GRPO are concentrated on tasks likeLook\(\+26\.7\) andHeat\(\+18\.5\), while maintaining competitive performance on the remaining task types\. These results suggest that OPID is not merely memorizing the observed training trajectories\. Instead, the extracted skills appear to capture reusable behavioral structure, including high\-level task workflows and local decision rules that remain useful under unseen environment configurations\. Since the skills are distilled into the policy rather than retrieved at inference time, the improvement also indicates that OPID internalizes transferable decision knowledge into the model parameters\.

Table 2:Ablation on Hierarchical Skills\.We report the success rate \(%\) on ALFWorld and Score/Succ\. \(%\) on WebShop with Qwen2\.5\-3B\-Instruct backbone\.Table 3:Ablation of Critical\-First Skill Routing\.With the Qwen2\.5\-3B\-Instruct backbone, we compare OPID with a variant that removes the critical\-first routing strategy\.
### 4\.6Ablation Studies and Analysis

We isolate the contributions of hierarchical skill granularity and critical\-first routing using Qwen2\.5\-3B\-Instruct\.

##### Impact of Hierarchical Skills\.

As shown in Table[2](https://arxiv.org/html/2606.26790#S4.T2), the complete hierarchy obtains the best aggregate performance on both domains\. Removing episode\-level skills decreases the ALFWorld average from 84\.3 to 74\.1 and the WebShop success rate from 74\.2 to 67\.2, confirming that global workflows and failure\-avoidance rules provide an important default signal\. Removing step\-level skills decreases the ALFWorld average from 84\.3 to 79\.1 and the WebShop success rate from 74\.2 to 65\.6\. These results demonstrate the complementarity of the two skill levels\.

##### Impact of Critical\-First Skill Routing\.

Table[3](https://arxiv.org/html/2606.26790#S4.T3)compares OPID with a non\-routed variant that applies the episode\-level skill to every step and additionally incorporates the corresponding step\-level skill at critical timesteps, thereby superimposing the two forms of guidance\. Critical\-first routing improves the ALFWorld average by \+6\.8 points \(84\.3 vs\. 77\.5\)\. These results show that selectively routing the most appropriate skill granularity is more effective than directly combining global and local guidance, demonstrating the importance of critical\-first routing\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x7.png)Figure 6:Qualitative comparison on ALFWorld\.For the task “clean some spatula and put it in diningtable,” the GRPO\-trained agent hallucinates a nonexistent target object, substitutes a spoon for the spatula, and fails to complete the final placement within the step limit\. In contrast, OPID follows a coherent locate\-clean\-place workflow, grounding each action in the current observation and completing the task in six steps\.
##### Qualitative Analysis\.

Figure[6](https://arxiv.org/html/2606.26790#S4.F6)illustrates an ALFWorld clean\-and\-place task\. The GRPO\-trained agent exhibits a “hallucinated target” error by attempting to take a nonexistent spatula from the countertop at Step 4\. It subsequently substitutes a spoon for the target object and reaches the 30\-step limit before placing the cleaned spatula back on the dining table\. In contrast, OPID follows a coherent locate–clean–place workflow and completes the task in six steps\. This case suggests that distilling hierarchical hindsight skills from on\-policy trajectories helps the agent learn both local object\-grounding decisions and episode\-level task workflows, thereby reducing hallucinated actions and preserving progress toward the final goal\.

## 5Conclusion

We presented OPID, an on\-policy skill distillation framework that turns completed agent trajectories into hierarchical hindsight supervision\. By extracting episode\-level and step\-level skills from the current policy’s own rollouts, OPID provides dense, distribution\-matched token\-level guidance while preserving outcome\-based RL as the primary objective\. Experiments across embodied, web, and search\-based agentic benchmarks show that OPID improves agent learning without relying on external skill libraries, retrieval, or privileged context at inference time\. More broadly, our results suggest that agent trajectories are not only samples for reward optimization, but also reusable records of decision knowledge that can be distilled back into the policy\.

## References

- R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. Ramos, M\. Geist, and O\. Bachem \(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InInternational Conference on Learning Representations,Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px1.p1.3),[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p3.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Chen, Z\. Qiao, X\. Chen, D\. Yu, H\. Xu, X\. Zhao, R\. Song, W\. Yin, H\. Yin, L\. Zhang, K\. Li, M\. Liao, Y\. Jiang, P\. Xie, F\. Huang, and J\. Zhou \(2026\)IterResearch: rethinking long\-horizon agents with interaction scaling\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p2.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2Web: towards a generalist agent for the web\.arXiv preprint arXiv:2306\.06070\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p1.1)\.
- K\. Fang, X\. Che, H\. Ouyang, S\. Zhang, X\. Wang, Q\. Liu, L\. Liu, C\. Zhang, W\. Cai, W\. Dai,et al\.\(2026\)RobotEQ: transitioning from passive intelligence to active intelligence in embodied ai\.arXiv preprint arXiv:2605\.06234\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p3.1)\.
- Y\. Fu, H\. Huang, K\. Jiang, J\. Liu, Z\. Jiang, Y\. Zhu, and D\. Zhao \(2026\)Revisiting on\-policy distillation: empirical failure modes and simple fixes\.arXiv preprint arXiv:2603\.25562\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px2.p1.1),[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024a\)MiniLLM: knowledge distillation of large language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p3.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024b\)MiniLLM: knowledge distillation of large language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px1.p1.3),[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px3.p1.3)\.
- Y\. He, S\. Kaur, A\. Bhaskar, Y\. Yang, J\. Liu, N\. Ri, L\. Fowl, A\. Panigrahi, D\. Chen, and S\. Arora \(2026\)Self\-distillation zero: self\-revision turns binary rewards into dense supervision\.arXiv preprint arXiv:2604\.12002\.Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p3.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.External Links:1503\.02531Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px1.p1.3)\.
- X\. Ho, A\. Duong Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2023\)SWE\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, W\. Dong, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p6.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer \(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics,pp\. 1601–1611\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. C\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.arXiv preprint arXiv:2401\.13649\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p1.1)\.
- S\. Kullback and R\. A\. Leibler \(1951\)On information and sufficiency\.The Annals of Mathematical Statistics22\(1\),pp\. 79–86\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177729694)Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px1.p1.4)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu, and N\. Ding \(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px2.p1.1),[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.p1.1)\.
- J\. Lin \(1991\)Divergence measures based on the shannon entropy\.IEEE Transactions on Information Theory37\(1\),pp\. 145–151\.External Links:[Document](https://dx.doi.org/10.1109/18.61115)Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px1.p1.4)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2023\)AgentBench: evaluating llms as agents\.arXiv preprint arXiv:2308\.03688\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p1.1)\.
- K\. Lu and Thinking Machines Lab \(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.External Links:[Document](https://dx.doi.org/10.64434/tml.20251026)Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px3.p1.3)\.
- Z\. Lu, Z\. Yao, Z\. Han, Z\. Wang, J\. Wu, Q\. Gu, X\. Cai, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen \(2026a\)Self\-distilled agentic reinforcement learning\.arXiv preprint arXiv:2605\.15155\.Cited by:[5th item](https://arxiv.org/html/2606.26790#A2.I3.i5.p1.1),[Appendix E](https://arxiv.org/html/2606.26790#A5.p2.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p3.1),[§1](https://arxiv.org/html/2606.26790#S1.p4.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Lu, Z\. Yao, J\. Wu, C\. Han, Q\. Gu, X\. Cai, W\. Lu, J\. Xiao, Y\. Zhuang, and Y\. Shen \(2026b\)SKILL0: in\-context agentic reinforcement learning for skill internalization\.arXiv preprint arXiv:2604\.02268\.Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p4.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long,et al\.\(2025\)Large language model agent: a survey on methodology, applications and challenges\.arXiv preprint arXiv:2503\.21460\.Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,pp\. 9802–9822\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- M\. Oh, S\. Song, G\. Choi, Y\. Choi, and Y\. Jo \(2026\)KL for a KL: on\-policy distillation with control variate baseline\.arXiv preprint arXiv:2605\.07865\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px3.p1.3)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[1st item](https://arxiv.org/html/2606.26790#A2.I2.i1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Shen, T\. Liu, J\. Shen, J\. Wu, Q\. Kong, L\. Huan, and C\. Wang \(2026\)Double: breaking the acceleration limit via double retrieval speculative parallelism\.arXiv preprint arXiv:2601\.05524\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p3.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)ALFWorld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p6.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Wang, G\. Wang, H\. Xiao, Y\. Zhou, Y\. Pan, J\. Wang, K\. Xu, Y\. Wen, X\. Ruan, X\. Chen, and H\. Qi \(2026\)Skill\-sd: skill\-conditioned self\-distillation for multi\-turn llm agents\.arXiv preprint arXiv:2604\.10674\.Cited by:[3rd item](https://arxiv.org/html/2606.26790#A2.I3.i3.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p4.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Wu, M\. Feng, S\. Zhang, F\. Che, Z\. Wen, C\. Liao, and J\. Tao \(2024\)Beyond examples: high\-level automated reasoning paradigm in in\-context learning via mcts\.arXiv preprint arXiv:2411\.18478\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p2.1)\.
- J\. Wu, S\. Yang, C\. Yang, Y\. Shen, S\. Zhang, Z\. Wen, and J\. Tao \(2026a\)Spark: strategic policy\-aware exploration via dynamic branching for long\-horizon agentic learning\.arXiv preprint arXiv:2601\.20209\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p2.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1)\.
- J\. Wu, G\. Zhai, R\. Jin, Y\. Shen, Z\. Lu, F\. Zhang, H\. Luo, Z\. Lian, Z\. Wen, and J\. Tao \(2026b\)Maestro: reinforcement learning to orchestrate hierarchical model\-skill ensembles\.arXiv preprint arXiv:2605\.22177\.Cited by:[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wu, G\. Zhai, R\. Jin, J\. Yuan, Y\. Shen, S\. Zhang, Z\. Wen, and J\. Tao \(2026c\)Atlas: orchestrating heterogeneous models and tools for multi\-domain complex reasoning\.arXiv preprint arXiv:2601\.03872\.Cited by:[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Xu, H\. Yan, Q\. Sun, J\. Wu, Z\. Huang, M\. Huang, J\. Gong, Z\. Ding, K\. Cheng, Y\. Wang,et al\.\(2026\)OdysseyArena: benchmarking large language models for long\-horizon, active and inductive interactions\.arXiv preprint arXiv:2602\.05843\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§B\.4](https://arxiv.org/html/2606.26790#A2.SS4.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px4.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§B\.4](https://arxiv.org/html/2606.26790#A2.SS4.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px4.p1.1)\.
- C\. Yang, C\. Qin, Q\. Si, M\. Chen, N\. Gu, D\. Yao, Z\. Lin, W\. Wang, J\. Wang, and N\. Duan \(2026\)Self\-distilled rlvr\.arXiv preprint arXiv:2604\.03128\.Cited by:[4th item](https://arxiv.org/html/2606.26790#A2.I3.i4.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2369–2380\.Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems,Cited by:[§B\.1](https://arxiv.org/html/2606.26790#A2.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p6.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei \(2026\)On\-policy context distillation for language models\.arXiv preprint arXiv:2602\.12275\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.26790#A1.SS1.SSS2.Px2.p1.1)\.
- Z\.ai \(2026\)GLM\-5\.2: Built for Long\-Horizon Tasks\.Note:[https://z\.ai/blog/glm\-5\.2](https://z.ai/blog/glm-5.2)Accessed: 2026\-06\-22Cited by:[§B\.4](https://arxiv.org/html/2606.26790#A2.SS4.SSS0.Px2.p1.1)\.
- G\. Zhang, H\. Geng, X\. Yu, Z\. Yin, Z\. Zhang, Z\. Tan, H\. Zhou, Z\. Li, X\. Xue, Y\. Li,et al\.\(2025\)The landscape of agentic reinforcement learning for llms: a survey\.arXiv preprint arXiv:2509\.02547\.Cited by:[§1](https://arxiv.org/html/2606.26790#S1.p2.1)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[1st item](https://arxiv.org/html/2606.26790#A2.I3.i1.p1.1),[§1](https://arxiv.org/html/2606.26790#S1.p3.1),[§2](https://arxiv.org/html/2606.26790#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26790#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)WebArena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[Appendix E](https://arxiv.org/html/2606.26790#A5.p1.1)\.

## Appendix ATheoretical Analysis

This section provides three results that correspond to the main design choices of OPID\. We first place the proposed teacher advantage among representative on\-policy distillation objectives\. We then show that it implements a sampled\-token reverse\-KL update, characterize the benefit of collecting distillation contexts on policy, and justify critical\-first routing under a natural specialization assumption\.

### A\.1Notation and Representative On\-Policy Distillation Objectives

#### A\.1\.1Notation

Leti=\(τ,t,ℓ\)i=\(\\tau,t,\\ell\)index a valid token position in a response\. We denote the corresponding standard autoregressive context byci=\(hτ,t,yτ,t,<ℓ\)c\_\{i\}=\(h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\), and its skill\-augmented counterpart byc~i\\widetilde\{c\}\_\{i\}\. At each token position, define

bi​\(v\)\\displaystyle b\_\{i\}\(v\)≜πθold​\(v∣ci\),\\displaystyle\\triangleq\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\\mid c\_\{i\}\),qi​\(v\)\\displaystyle q\_\{i\}\(v\)≜πθold​\(v∣c~i\),\\displaystyle\\triangleq\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\\mid\\widetilde\{c\}\_\{i\}\),pθ,i​\(v\)\\displaystyle p\_\{\\theta,i\}\(v\)≜πθ​\(v∣ci\)\.\\displaystyle\\triangleq\\pi\_\{\\theta\}\(v\\mid c\_\{i\}\)\.Here,bib\_\{i\}is the behavior distribution used to generate the response,qiq\_\{i\}is a detached skill\-conditioned teacher distribution, andpθ,ip\_\{\\theta,i\}is the trainable policy evaluated under the standard context available at inference time\. The observed tokenai≜yτ,t,ℓa\_\{i\}\\triangleq y\_\{\\tau,t,\\ell\}is sampled frombib\_\{i\}\.

We further define the token\-level log\-likelihood gap and the policy importance ratio as

Δi​\(v\)\\displaystyle\\Delta\_\{i\}\(v\)≜log⁡qi​\(v\)−log⁡bi​\(v\),\\displaystyle\\triangleq\\log q\_\{i\}\(v\)\-\\log b\_\{i\}\(v\),ρθ,i​\(v\)\\displaystyle\\rho\_\{\\theta,i\}\(v\)≜pθ,i​\(v\)bi​\(v\)\.\\displaystyle\\triangleq\\frac\{p\_\{\\theta,i\}\(v\)\}\{b\_\{i\}\(v\)\}\.\(1\)The quantityΔi​\(v\)\\Delta\_\{i\}\(v\)measures the change in token log\-probability induced by the skill\-augmented context\. In particular,Δi​\(v\)\>0\\Delta\_\{i\}\(v\)\>0indicates that the skill\-conditioned teacher assigns greater probability to tokenvvthan the behavior policy does\. The OPID skill advantage associated with the observed token is therefore

Aiskill=Δi​\(ai\)\.A\_\{i\}^\{\\mathrm\{skill\}\}=\\Delta\_\{i\}\(a\_\{i\}\)\.Unless otherwise stated, all expectations below are taken over valid response tokens; the response mask is consequently omitted for notational simplicity\.

#### A\.1\.2Representative On\-Policy Distillation Objectives

On\-policy distillation \(OPD\) applies teacher supervision at autoregressive contexts generated by the student or a behavior policy, thereby reducing the context\-distribution mismatch between distillation training and free\-running inference\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib2)\)\. The context\-generation policy and the granularity of teacher supervision are orthogonal design choices\. At each on\-policy context, output\-space OPD objectives can be organized into three common supervision granularities: full\-vocabulary, Top\-KK, and sampled\-token distillation\(Liet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib48); Fuet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib49)\)\. OPID belongs to the sampled\-token category\.

##### Full\-vocabulary distribution matching\.

Letqiq\_\{i\}andpθ,ip\_\{\\theta,i\}denote the teacher and student next\-token distributions, respectively, at autoregressive contextii\. When the complete predictive distributions are available, OPD can minimize the forward KL, reverse KL, or a generalized Jensen–Shannon divergence\(Hintonet al\.,[2015](https://arxiv.org/html/2606.26790#bib.bib39); Agarwalet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib2); Guet al\.,[2024b](https://arxiv.org/html/2606.26790#bib.bib43)\):

ℒFKL​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}\(\\theta\)=𝔼i​\[DKL​\(qi∥pθ,i\)\],\\displaystyle=\\mathbb\{E\}\_\{i\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\(q\_\{i\}\\\|p\_\{\\theta,i\}\)\\right\],ℒRKL​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)=𝔼i​\[DKL​\(pθ,i∥qi\)\],\\displaystyle=\\mathbb\{E\}\_\{i\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|q\_\{i\}\)\\right\],ℒJSD\(α\)​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{JSD\}\}^\{\(\\alpha\)\}\(\\theta\)=𝔼i​\[α​DKL​\(qi∥mi\(α\)\)\+\(1−α\)​DKL​\(pθ,i∥mi\(α\)\)\],\\displaystyle=\\mathbb\{E\}\_\{i\}\\\!\\left\[\\alpha D\_\{\\mathrm\{KL\}\}\(q\_\{i\}\\\|m\_\{i\}^\{\(\\alpha\)\}\)\+\(1\-\\alpha\)D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|m\_\{i\}^\{\(\\alpha\)\}\)\\right\],mi\(α\)\\displaystyle m\_\{i\}^\{\(\\alpha\)\}=α​qi\+\(1−α\)​pθ,i\.\\displaystyle=\\alpha q\_\{i\}\+\(1\-\\alpha\)p\_\{\\theta,i\}\.Forward KL gives the conventional soft\-target objective and emphasizes coverage of teacher\-supported probability mass\. Reverse KL instead penalizes student probability assigned to teacher\-disfavored regions and therefore typically exhibits more mode\-seeking behavior\. Generalized JSD compares both models against a mixture distribution, withα=12\\alpha=\\tfrac\{1\}\{2\}recovering the standard symmetric JSD\(Kullback and Leibler,[1951](https://arxiv.org/html/2606.26790#bib.bib45); Lin,[1991](https://arxiv.org/html/2606.26790#bib.bib40)\)\.

##### Top\-KKdistribution matching\.

Top\-KKOPD retains distribution\-level supervision over a restricted local support\. Common choices include a student\-selected support\(Liet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib48); Yeet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib50)\)and a teacher\-selected support\(Fuet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib49)\):

Si,p\(K\)\\displaystyle S\_\{i,p\}^\{\(K\)\}≜TopK⁡\(pθ,i,K\),\\displaystyle\\triangleq\\operatorname\{TopK\}\(p\_\{\\theta,i\},K\),Si,q\(K\)\\displaystyle S\_\{i,q\}^\{\(K\)\}≜TopK⁡\(qi,K\),\\displaystyle\\triangleq\\operatorname\{TopK\}\(q\_\{i\},K\),Si\(K\)\\displaystyle S\_\{i\}^\{\(K\)\}∈\{Si,p\(K\),Si,q\(K\)\}\.\\displaystyle\\in\\left\\\{S\_\{i,p\}^\{\(K\)\},S\_\{i,q\}^\{\(K\)\}\\right\\\}\.For the selected supportSi\(K\)S\_\{i\}^\{\(K\)\}, define the restricted and renormalized distributions

p¯θ,iSi\(K\)​\(v\)\\displaystyle\\bar\{p\}\_\{\\theta,i\}^\{S\_\{i\}^\{\(K\)\}\}\(v\)≜pθ,i​\(v\)​𝟏​\{v∈Si\(K\)\}∑u∈Si\(K\)pθ,i​\(u\),\\displaystyle\\triangleq\\frac\{p\_\{\\theta,i\}\(v\)\\mathbf\{1\}\\\{v\\in S\_\{i\}^\{\(K\)\}\\\}\}\{\\sum\_\{u\\in S\_\{i\}^\{\(K\)\}\}p\_\{\\theta,i\}\(u\)\},q¯iSi\(K\)​\(v\)\\displaystyle\\bar\{q\}\_\{i\}^\{S\_\{i\}^\{\(K\)\}\}\(v\)≜qi​\(v\)​𝟏​\{v∈Si\(K\)\}∑u∈Si\(K\)qi​\(u\)\.\\displaystyle\\triangleq\\frac\{q\_\{i\}\(v\)\\mathbf\{1\}\\\{v\\in S\_\{i\}^\{\(K\)\}\\\}\}\{\\sum\_\{u\\in S\_\{i\}^\{\(K\)\}\}q\_\{i\}\(u\)\}\.A representative truncated reverse\-KL objective is

ℒTopK​\-​RKL​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{TopK\\text\{\-\}RKL\}\}\(\\theta\)=𝔼i\[DKL\(p¯θ,iSi\(K\)∥q¯iSi\(K\)\)\]\\displaystyle=\\mathbb\{E\}\_\{i\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\bar\{p\}\_\{\\theta,i\}^\{S\_\{i\}^\{\(K\)\}\}\\,\\middle\\\|\\,\\bar\{q\}\_\{i\}^\{S\_\{i\}^\{\(K\)\}\}\\right\)\\right\]=𝔼i​\[∑v∈Si\(K\)p¯θ,iSi\(K\)​\(v\)​log⁡p¯θ,iSi\(K\)​\(v\)q¯iSi\(K\)​\(v\)\]\.\\displaystyle=\\mathbb\{E\}\_\{i\}\\\!\\left\[\\sum\_\{v\\in S\_\{i\}^\{\(K\)\}\}\\bar\{p\}\_\{\\theta,i\}^\{S\_\{i\}^\{\(K\)\}\}\(v\)\\log\\frac\{\\bar\{p\}\_\{\\theta,i\}^\{S\_\{i\}^\{\(K\)\}\}\(v\)\}\{\\bar\{q\}\_\{i\}^\{S\_\{i\}^\{\(K\)\}\}\(v\)\}\\right\]\.Top\-KKmatching occupies an intermediate point between one\-token and full\-vocabulary supervision\. It preserves multi\-token information at reduced computational or communication cost, but discards probability mass outside the selected support and is therefore a truncated, support\-dependent approximation to the full reverse KL\.

##### Sampled\-token distillation\.

At a fixed on\-policy context, define the teacher–student log\-ratio cost

δi​\(v\)≜log⁡pθ,i​\(v\)−log⁡qi​\(v\)\.\\delta\_\{i\}\(v\)\\triangleq\\log p\_\{\\theta,i\}\(v\)\-\\log q\_\{i\}\(v\)\.The token\-level reverse KL can then be written exactly as an expectation over student\-sampled tokens:

DKL​\(pθ,i∥qi\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|q\_\{i\}\)=𝔼ai∼pθ,i​\[δi​\(ai\)\]\\displaystyle=\\mathbb\{E\}\_\{a\_\{i\}\\sim p\_\{\\theta,i\}\}\\left\[\\delta\_\{i\}\(a\_\{i\}\)\\right\]=𝔼ai∼bi​\[ρθ,i​\(ai\)​δi​\(ai\)\],ρθ,i​\(a\)≜pθ,i​\(a\)bi​\(a\)\.\\displaystyle=\\mathbb\{E\}\_\{a\_\{i\}\\sim b\_\{i\}\}\\left\[\\rho\_\{\\theta,i\}\(a\_\{i\}\)\\delta\_\{i\}\(a\_\{i\}\)\\right\],\\qquad\\rho\_\{\\theta,i\}\(a\)\\triangleq\\frac\{p\_\{\\theta,i\}\(a\)\}\{b\_\{i\}\(a\)\}\.The second equality requirespθ,i≪bip\_\{\\theta,i\}\\ll b\_\{i\}\(support coverage condition\)\. Consequently,ρθ,i​\(ai\)​δi​\(ai\)\\rho\_\{\\theta,i\}\(a\_\{i\}\)\\delta\_\{i\}\(a\_\{i\}\)is an importance\-weighted single\-sample estimator of the per\-context reverse KL\. Its score\-function gradient is

∇θDKL​\(pθ,i∥qi\)=𝔼ai∼bi​\[ρθ,i​\(ai\)​sg⁡\[δi​\(ai\)\]​∇θlog⁡pθ,i​\(ai\)\],\\displaystyle\\nabla\_\{\\theta\}D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|q\_\{i\}\)=\\mathbb\{E\}\_\{a\_\{i\}\\sim b\_\{i\}\}\\left\[\\rho\_\{\\theta,i\}\(a\_\{i\}\)\\operatorname\{sg\}\\\!\\left\[\\delta\_\{i\}\(a\_\{i\}\)\\right\]\\nabla\_\{\\theta\}\\log p\_\{\\theta,i\}\(a\_\{i\}\)\\right\],wheresg\\operatorname\{sg\}denotes stop\-gradient\. This connection permits sampled\-token distillation to be implemented with policy\-gradient or importance\-weighted policy\-optimization machinery\(Guet al\.,[2024b](https://arxiv.org/html/2606.26790#bib.bib43); Lu and Thinking Machines Lab,[2025](https://arxiv.org/html/2606.26790#bib.bib47); Ohet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib51)\)\. Compared with full\-vocabulary matching, sampled\-token supervision requires only the teacher probability of the realized token, but has higher Monte Carlo variance and uses less information from the teacher distribution\.

##### From the clipped OPID objective to its unclipped skill surrogate\.

Leti=\(τ,t,ℓ\)i=\(\\tau,t,\\ell\)index a valid rollout\-token position, and letνb\\nu\_\{b\}denote the distribution over valid token positions induced by rollouts collected from the behavior policy\. Given positionii, the observed tokenaia\_\{i\}is sampled frombib\_\{i\}\. Recall from Eq\.[1](https://arxiv.org/html/2606.26790#A1.E1)that

Δi​\(v\)=log⁡qi​\(v\)−log⁡bi​\(v\),ρθ,i​\(v\)=pθ,i​\(v\)bi​\(v\),\\Delta\_\{i\}\(v\)=\\log q\_\{i\}\(v\)\-\\log b\_\{i\}\(v\),\\qquad\\rho\_\{\\theta,i\}\(v\)=\\frac\{p\_\{\\theta,i\}\(v\)\}\{b\_\{i\}\(v\)\},wherebib\_\{i\},qiq\_\{i\}, and the resulting advantages are detached during the policy update\. The skill advantage of a sampled token is

Aiskill​\(ai\)=Δi​\(ai\)\.A\_\{i\}^\{\\mathrm\{skill\}\}\(a\_\{i\}\)=\\Delta\_\{i\}\(a\_\{i\}\)\.
The complete OPID advantage combines the outcome and skill signals:

AiOPID​\(ai\)=Aiep\+λskill​Δi​\(ai\)\.A\_\{i\}^\{\\mathrm\{OPID\}\}\(a\_\{i\}\)=A\_\{i\}^\{\\mathrm\{ep\}\}\+\\lambda\_\{\\mathrm\{skill\}\}\\Delta\_\{i\}\(a\_\{i\}\)\.Accordingly, the implemented clipped policy loss is

ℒpolicy\(θ\)=−𝔼i∼νbai∼bi\[min\(\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{policy\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\_\{i\}\\sim b\_\{i\}\\end\{subarray\}\}\\Big\[\\min\\Big\(ρθ,i​\(ai\)​AiOPID​\(ai\),\\displaystyle\\rho\_\{\\theta,i\}\(a\_\{i\}\)A\_\{i\}^\{\\mathrm\{OPID\}\}\(a\_\{i\}\),clip\(ρθ,i\(ai\),1−ϵ,1\+ϵ\)AiOPID\(ai\)\)\]\.\\displaystyle\\operatorname\{clip\}\\bigl\(\\rho\_\{\\theta,i\}\(a\_\{i\}\),1\-\\epsilon,1\+\\epsilon\\bigr\)A\_\{i\}^\{\\mathrm\{OPID\}\}\(a\_\{i\}\)\\Big\)\\Big\]\.In a realized rollout batch, this expectation is implemented as an empirical average over the observed valid tokensaia\_\{i\}\.

Because PPO clipping is applied after the outcome and skill advantages have been combined, the clipped objective does not in general decompose into independently clipped outcome and skill losses\. To isolate the skill\-distillation signal studied below, we therefore consider the corresponding unclipped policy surrogate:

ℒpolicyunclip​\(θ\)≜−𝔼i∼νbai∼bi​\[ρθ,i​\(ai\)​AiOPID​\(ai\)\]\.\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\triangleq\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\_\{i\}\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\_\{i\}\)A\_\{i\}^\{\\mathrm\{OPID\}\}\(a\_\{i\}\)\\right\]\.Unlike the clipped objective, this loss decomposes exactly as

ℒpolicyunclip​\(θ\)=ℒepunclip​\(θ\)\+ℒskillunclip​\(θ\),\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{ep\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\+\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\),where

ℒepunclip​\(θ\)≜−𝔼i∼νbai∼bi​\[ρθ,i​\(ai\)​Aiep\]\\mathcal\{L\}\_\{\\mathrm\{ep\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\triangleq\-\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\_\{i\}\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\_\{i\}\)A\_\{i\}^\{\\mathrm\{ep\}\}\\right\]and

ℒskillunclip​\(θ\)≜−λskill​𝔼i∼νbai∼bi​\[ρθ,i​\(ai\)​Δi​\(ai\)\]\.\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\triangleq\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\_\{i\}\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\_\{i\}\)\\Delta\_\{i\}\(a\_\{i\}\)\\right\]\.\(2\)
Equation[2](https://arxiv.org/html/2606.26790#A1.E2)is the skill\-distillation loss analyzed in the next subsection\. Although it is defined through the unclipped surrogate, it characterizes the local skill\-induced update of the implemented PPO loss\. In particular, letθ0=θold\\theta\_\{0\}=\\theta\_\{\\mathrm\{old\}\}, so thatpθ0,i=bip\_\{\\theta\_\{0\},i\}=b\_\{i\}andρθ0,i​\(a\)=1\\rho\_\{\\theta\_\{0\},i\}\(a\)=1\. Since11lies in the interior of the clipping interval, the clipped and unclipped objectives have the same value and gradient at the behavior policy:

ℒpolicyclip​\(θ0\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{clip\}\}\(\\theta\_\{0\}\)=ℒpolicyunclip​\(θ0\),\\displaystyle=\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\_\{0\}\),∇θℒpolicyclip​\(θ\)\|θ=θ0\\displaystyle\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{clip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=∇θℒpolicyunclip​\(θ\)\|θ=θ0\\displaystyle=\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=∇θℒepunclip​\(θ\)\|θ=θ0\+∇θℒskillunclip​\(θ\)\|θ=θ0\.\\displaystyle=\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{ep\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\+\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\.Thus,ℒskillunclip\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}is exactly the skill\-induced component of the first\-order PPO update around the behavior policy\. Away from this local region, clipping couples the outcome and skill signals through the sign of their combined advantage, and the unclipped decomposition no longer describes the complete clipped objective globally\.

### A\.2The Unclipped OPID Skill Loss as a Relative\-KL Surrogate

We now analyze the unclipped skill\-distillation loss introduced in Eq\.[2](https://arxiv.org/html/2606.26790#A1.E2)\. Letνb\\nu\_\{b\}denote the distribution over valid token positions induced by rollouts collected from the behavior policy\. Throughout this subsection, the rollout histories, routed skills, and the corresponding distributionsbib\_\{i\}andqiq\_\{i\}are detached and held fixed during the policy update\.

We assume the common\-support condition

pθ,i≪biandpθ,i≪qip\_\{\\theta,i\}\\ll b\_\{i\}\\qquad\\text\{and\}\\qquad p\_\{\\theta,i\}\\ll q\_\{i\}for everyiiin the support ofνb\\nu\_\{b\}\. This condition is satisfied by standard softmax language models with finite logits\.

Recall that

Δi​\(v\)\\displaystyle\\Delta\_\{i\}\(v\)≜log⁡qi​\(v\)−log⁡bi​\(v\),\\displaystyle\\triangleq\\log q\_\{i\}\(v\)\-\\log b\_\{i\}\(v\),ρθ,i​\(v\)\\displaystyle\\rho\_\{\\theta,i\}\(v\)≜pθ,i​\(v\)bi​\(v\)\.\\displaystyle\\triangleq\\frac\{p\_\{\\theta,i\}\(v\)\}\{b\_\{i\}\(v\)\}\.The unclipped OPID skill loss is

ℒskillunclip​\(θ\)≜−λskill​𝔼i∼νba∼bi​\[ρθ,i​\(a\)​Δi​\(a\)\]\.\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\triangleq\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\)\\Delta\_\{i\}\(a\)\\right\]\.\(3\)In a realized rollout batch, this expectation is approximated by the empirical average over the observed valid tokens\. The expectation notation in Eq\.[3](https://arxiv.org/html/2606.26790#A1.E3)makes the rollout\-time token sampling law explicit for the theoretical analysis\.

Define the behavior\-relative KL and the student–teacher reverse\-KL loss as

𝒟b​\(θ\)\\displaystyle\\mathcal\{D\}\_\{b\}\(\\theta\)≜𝔼i∼νb​\[DKL​\(pθ,i∥bi\)\],\\displaystyle\\triangleq\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\left\(p\_\{\\theta,i\}\\\|b\_\{i\}\\right\)\\right\],ℒRKL​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)≜𝔼i∼νb​\[DKL​\(pθ,i∥qi\)\]\.\\displaystyle\\triangleq\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\left\(p\_\{\\theta,i\}\\\|q\_\{i\}\\right\)\\right\]\.
###### Proposition 1\(Exact relative\-KL decomposition\)\.

Under the assumptions above, for every admissibleθ\\theta,

ℒskillunclip​\(θ\)=λskill​\[ℒRKL​\(θ\)−𝒟b​\(θ\)\]\.\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)=\\lambda\_\{\\mathrm\{skill\}\}\\left\[\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)\-\\mathcal\{D\}\_\{b\}\(\\theta\)\\right\]\.\(4\)Letθ0=θold\\theta\_\{0\}=\\theta\_\{\\mathrm\{old\}\}and suppose thatpθ0,i=bip\_\{\\theta\_\{0\},i\}=b\_\{i\}for everyii\. Then

ℒskillunclip​\(θ0\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\_\{0\}\)=λskill​ℒRKL​\(θ0\),\\displaystyle=\\lambda\_\{\\mathrm\{skill\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\_\{0\}\),\(5\)∇θℒskillunclip​\(θ\)\|θ=θ0\\displaystyle\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=λskill​∇θℒRKL​\(θ\)\|θ=θ0\\displaystyle=\\lambda\_\{\\mathrm\{skill\}\}\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\(6\)=−λskill​𝔼i∼νba∼bi​\[Δi​\(a\)​∇θlog⁡pθ,i​\(a\)\|θ=θ0\]\.\\displaystyle=\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\Delta\_\{i\}\(a\)\\left\.\\nabla\_\{\\theta\}\\log p\_\{\\theta,i\}\(a\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\\right\]\.\(7\)

###### Proof\.

Fix a valid token positionii\. By the common\-support assumption and a change of measure frombib\_\{i\}topθ,ip\_\{\\theta,i\},

−λskill​𝔼a∼bi​\[ρθ,i​\(a\)​Δi​\(a\)\]\\displaystyle\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{a\\sim b\_\{i\}\}\\left\[\\rho\_\{\\theta,i\}\(a\)\\Delta\_\{i\}\(a\)\\right\]=−λskill​∑v∈𝒱bi​\(v\)​pθ,i​\(v\)bi​\(v\)​\(log⁡qi​\(v\)−log⁡bi​\(v\)\)\\displaystyle\\quad=\-\\lambda\_\{\\mathrm\{skill\}\}\\sum\_\{v\\in\\mathcal\{V\}\}b\_\{i\}\(v\)\\frac\{p\_\{\\theta,i\}\(v\)\}\{b\_\{i\}\(v\)\}\\bigl\(\\log q\_\{i\}\(v\)\-\\log b\_\{i\}\(v\)\\bigr\)=λskill​∑v∈𝒱pθ,i​\(v\)​\(log⁡bi​\(v\)−log⁡qi​\(v\)\)\.\\displaystyle\\quad=\\lambda\_\{\\mathrm\{skill\}\}\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{\\theta,i\}\(v\)\\bigl\(\\log b\_\{i\}\(v\)\-\\log q\_\{i\}\(v\)\\bigr\)\.Adding and subtractinglog⁡pθ,i​\(v\)\\log p\_\{\\theta,i\}\(v\)inside the summand gives

λskill​∑v∈𝒱pθ,i​\(v\)​\[log⁡pθ,i​\(v\)qi​\(v\)−log⁡pθ,i​\(v\)bi​\(v\)\]\\displaystyle\\lambda\_\{\\mathrm\{skill\}\}\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{\\theta,i\}\(v\)\\left\[\\log\\frac\{p\_\{\\theta,i\}\(v\)\}\{q\_\{i\}\(v\)\}\-\\log\\frac\{p\_\{\\theta,i\}\(v\)\}\{b\_\{i\}\(v\)\}\\right\]=λskill​\[DKL​\(pθ,i∥qi\)−DKL​\(pθ,i∥bi\)\]\.\\displaystyle\\quad=\\lambda\_\{\\mathrm\{skill\}\}\\left\[D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|q\_\{i\}\)\-D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|b\_\{i\}\)\\right\]\.Averaging overi∼νbi\\sim\\nu\_\{b\}proves Eq\.[4](https://arxiv.org/html/2606.26790#A1.E4)\.

Atθ0\\theta\_\{0\},pθ0,i=bip\_\{\\theta\_\{0\},i\}=b\_\{i\}, and hence

𝒟b​\(θ0\)=0\.\\mathcal\{D\}\_\{b\}\(\\theta\_\{0\}\)=0\.This proves Eq\.[5](https://arxiv.org/html/2606.26790#A1.E5)\. Moreover,𝒟b\\mathcal\{D\}\_\{b\}is differentiable and attains its global minimum atθ0\\theta\_\{0\}, so

∇θ𝒟b​\(θ\)\|θ=θ0=0\.\\left\.\\nabla\_\{\\theta\}\\mathcal\{D\}\_\{b\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=0\.Differentiating Eq\.[4](https://arxiv.org/html/2606.26790#A1.E4)therefore proves Eq\.[6](https://arxiv.org/html/2606.26790#A1.E6)\.

Finally, becausebib\_\{i\},qiq\_\{i\}, andΔi\\Delta\_\{i\}are detached,

∇θℒskillunclip​\(θ\)=−λskill​𝔼i∼νba∼bi​\[ρθ,i​\(a\)​Δi​\(a\)​∇θlog⁡pθ,i​\(a\)\]\.\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)=\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\)\\Delta\_\{i\}\(a\)\\nabla\_\{\\theta\}\\log p\_\{\\theta,i\}\(a\)\\right\]\.Substitutingρθ0,i​\(a\)=1\\rho\_\{\\theta\_\{0\},i\}\(a\)=1proves Eq\.[7](https://arxiv.org/html/2606.26790#A1.E7)\. ∎

###### Corollary 1\(First\-order tightness around the behavior policy\)\.

Assume thatpθ,ip\_\{\\theta,i\}is twice continuously differentiable in a neighborhood ofθ0\\theta\_\{0\}\. Forδ→0\\delta\\to 0,

ℒskillunclip​\(θ0\+δ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\_\{0\}\+\\delta\)=λskill​ℒRKL​\(θ0\+δ\)\\displaystyle=\\lambda\_\{\\mathrm\{skill\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\_\{0\}\+\\delta\)\(8\)−λskill2​δ⊤​Fb​δ\+o​\(‖δ‖2\),\\displaystyle\\quad\-\\frac\{\\lambda\_\{\\mathrm\{skill\}\}\}\{2\}\\delta^\{\\top\}F\_\{b\}\\delta\+o\(\\\|\\delta\\\|^\{2\}\),where

Fb\\displaystyle F\_\{b\}≜𝔼i∼νbv∼bi​\[si​\(v\)​si​\(v\)⊤\],\\displaystyle\\triangleq\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ v\\sim b\_\{i\}\\end\{subarray\}\}\\left\[s\_\{i\}\(v\)s\_\{i\}\(v\)^\{\\top\}\\right\],si​\(v\)\\displaystyle s\_\{i\}\(v\)≜∇θlog⁡pθ,i​\(v\)\|θ=θ0\\displaystyle\\triangleq\\left\.\\nabla\_\{\\theta\}\\log p\_\{\\theta,i\}\(v\)\\right\|\_\{\\theta=\\theta\_\{0\}\}is the behavior\-policy Fisher information averaged over rollout contexts\.

###### Proof\.

By Proposition[1](https://arxiv.org/html/2606.26790#Thmproposition1), the discrepancy between the scaled reverse\-KL loss and the OPID skill loss is exactlyλskill​𝒟b​\(θ\)\\lambda\_\{\\mathrm\{skill\}\}\\mathcal\{D\}\_\{b\}\(\\theta\)\. The standard local expansion of relative entropy around its reference distribution gives

𝒟b​\(θ0\+δ\)=12​δ⊤​Fb​δ\+o​\(‖δ‖2\)\.\\mathcal\{D\}\_\{b\}\(\\theta\_\{0\}\+\\delta\)=\\frac\{1\}\{2\}\\delta^\{\\top\}F\_\{b\}\\delta\+o\(\\\|\\delta\\\|^\{2\}\)\.Substituting this expansion into Eq\.[4](https://arxiv.org/html/2606.26790#A1.E4)proves Eq\.[8](https://arxiv.org/html/2606.26790#A1.E8)\. ∎

Equation[8](https://arxiv.org/html/2606.26790#A1.E8)gives the precise sense in which the OPID skill loss is locally equivalent to reverse\-KL distillation\. At the behavior policy, the two losses have the same value and gradient after accounting for the factorλskill\\lambda\_\{\\mathrm\{skill\}\}, while their discrepancy is second order in the policy displacement\.

###### Corollary 2\(Exact recovery under a matching behavior\-KL penalty\)\.

Consider the regularized auxiliary loss

ℒaux​\(θ\)≜ℒskillunclip​\(θ\)\+β​𝒟b​\(θ\)\.\\mathcal\{L\}\_\{\\mathrm\{aux\}\}\(\\theta\)\\triangleq\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\+\\beta\\mathcal\{D\}\_\{b\}\(\\theta\)\.\(9\)Then

ℒaux​\(θ\)=λskill​ℒRKL​\(θ\)\+\(β−λskill\)​𝒟b​\(θ\)\.\\mathcal\{L\}\_\{\\mathrm\{aux\}\}\(\\theta\)=\\lambda\_\{\\mathrm\{skill\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)\+\\left\(\\beta\-\\lambda\_\{\\mathrm\{skill\}\}\\right\)\\mathcal\{D\}\_\{b\}\(\\theta\)\.\(10\)In particular, ifβ=λskill\\beta=\\lambda\_\{\\mathrm\{skill\}\}, then

ℒaux​\(θ\)=λskill​ℒRKL​\(θ\)\\mathcal\{L\}\_\{\\mathrm\{aux\}\}\(\\theta\)=\\lambda\_\{\\mathrm\{skill\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)for every admissibleθ\\theta\.

###### Proof\.

Substitute Eq\.[4](https://arxiv.org/html/2606.26790#A1.E4)into Eq\.[9](https://arxiv.org/html/2606.26790#A1.E9)and collect the coefficients of𝒟b​\(θ\)\\mathcal\{D\}\_\{b\}\(\\theta\)\. ∎

The exact cancellation in Corollary[2](https://arxiv.org/html/2606.26790#Thmcorollary2)requires both \(i\) a KL penalty to the same behavior distributionbib\_\{i\}, evaluated under the ordinary context, and \(ii\) the matching coefficientβ=λskill\\beta=\\lambda\_\{\\mathrm\{skill\}\}\. A KL penalty to a different reference distribution, or a different coefficient, leaves the residual behavior\-relative term in Eq\.[10](https://arxiv.org/html/2606.26790#A1.E10)and is therefore not exactly equivalent to direct student–teacher reverse\-KL distillation\.

##### Relation to the implemented PPO\-clipped loss\.

The decomposition in Proposition[1](https://arxiv.org/html/2606.26790#Thmproposition1)applies exactly to the unclipped skill lossℒskillunclip\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\. In the implemented OPID objective, PPO clipping is applied to the combined advantage

AiOPID=Aiep\+λskill​Δi​\(ai\),A\_\{i\}^\{\\mathrm\{OPID\}\}=A\_\{i\}^\{\\mathrm\{ep\}\}\+\\lambda\_\{\\mathrm\{skill\}\}\\Delta\_\{i\}\(a\_\{i\}\),so the complete clipped loss does not globally decompose into independently clipped outcome and skill losses\.

Nevertheless, atθ0=θold\\theta\_\{0\}=\\theta\_\{\\mathrm\{old\}\},

ρθ0,i​\(a\)=1\.\\rho\_\{\\theta\_\{0\},i\}\(a\)=1\.Since11lies in the interior of\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]forϵ\>0\\epsilon\>0, the clipped and unclipped policy losses have the same value and first derivative at the behavior policy:

ℒpolicyclip​\(θ0\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{clip\}\}\(\\theta\_\{0\}\)=ℒpolicyunclip​\(θ0\),\\displaystyle=\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\_\{0\}\),∇θℒpolicyclip​\(θ\)\|θ=θ0\\displaystyle\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{clip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=∇θℒpolicyunclip​\(θ\)\|θ=θ0\\displaystyle=\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{policy\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}=∇θℒepunclip​\(θ\)\|θ=θ0\+∇θℒskillunclip​\(θ\)\|θ=θ0\.\\displaystyle=\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{ep\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\+\\left\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)\\right\|\_\{\\theta=\\theta\_\{0\}\}\.Therefore, Eq\.[6](https://arxiv.org/html/2606.26790#A1.E6)characterizes the skill\-induced component of the local PPO update\. Once the policy ratio reaches a clipping boundary, however, clipping couples the outcome and skill signals through the sign of their combined advantage, and the exact relative\-KL decomposition no longer applies to the complete clipped objective\.

###### Corollary 3\(Non\-degenerate token\-level signal under reward ties\)\.

Fix one contextii, and parameterizepi=softmax⁡\(zi\)p\_\{i\}=\\operatorname\{softmax\}\(z\_\{i\}\)using free categorical logits\. Define the corresponding full\-action skill loss as

ℒskill,iunclip​\(zi\)≜−λskill​∑v∈𝒱pi​\(v\)​Δi​\(v\)\.\\mathcal\{L\}\_\{\\mathrm\{skill\},i\}^\{\\mathrm\{unclip\}\}\(z\_\{i\}\)\\triangleq\-\\lambda\_\{\\mathrm\{skill\}\}\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{i\}\(v\)\\Delta\_\{i\}\(v\)\.Forλskill\>0\\lambda\_\{\\mathrm\{skill\}\}\>0,

∂ℒskill,iunclip∂zi​\(v\)=−λskill​pi​\(v\)​\(Δi​\(v\)−𝔼u∼pi​\[Δi​\(u\)\]\)\.\\frac\{\\partial\\mathcal\{L\}\_\{\\mathrm\{skill\},i\}^\{\\mathrm\{unclip\}\}\}\{\\partial z\_\{i\}\(v\)\}=\-\\lambda\_\{\\mathrm\{skill\}\}p\_\{i\}\(v\)\\left\(\\Delta\_\{i\}\(v\)\-\\mathbb\{E\}\_\{u\\sim p\_\{i\}\}\[\\Delta\_\{i\}\(u\)\]\\right\)\.\(11\)Atpi=bip\_\{i\}=b\_\{i\}with full support, the gradient in Eq\.[11](https://arxiv.org/html/2606.26790#A1.E11)is zero for everyvvif and only ifqi=biq\_\{i\}=b\_\{i\}\.

###### Proof\.

Using

∂pi​\(u\)∂zi​\(v\)=pi​\(u\)​\(𝟏​\{u=v\}−pi​\(v\)\),\\frac\{\\partial p\_\{i\}\(u\)\}\{\\partial z\_\{i\}\(v\)\}=p\_\{i\}\(u\)\\left\(\\mathbf\{1\}\\\{u=v\\\}\-p\_\{i\}\(v\)\\right\),we obtain

∂ℒskill,iunclip∂zi​\(v\)\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{\\mathrm\{skill\},i\}^\{\\mathrm\{unclip\}\}\}\{\\partial z\_\{i\}\(v\)\}=−λskill​∑uΔi​\(u\)​pi​\(u\)​\(𝟏​\{u=v\}−pi​\(v\)\)\\displaystyle=\-\\lambda\_\{\\mathrm\{skill\}\}\\sum\_\{u\}\\Delta\_\{i\}\(u\)p\_\{i\}\(u\)\\left\(\\mathbf\{1\}\\\{u=v\\\}\-p\_\{i\}\(v\)\\right\)=−λskill​pi​\(v\)​Δi​\(v\)\+λskill​pi​\(v\)​∑upi​\(u\)​Δi​\(u\),\\displaystyle=\-\\lambda\_\{\\mathrm\{skill\}\}p\_\{i\}\(v\)\\Delta\_\{i\}\(v\)\+\\lambda\_\{\\mathrm\{skill\}\}p\_\{i\}\(v\)\\sum\_\{u\}p\_\{i\}\(u\)\\Delta\_\{i\}\(u\),which proves Eq\.[11](https://arxiv.org/html/2606.26790#A1.E11)\.

Suppose thatpi=bip\_\{i\}=b\_\{i\},bi​\(v\)\>0b\_\{i\}\(v\)\>0for everyvv, and the derivative is zero for everyvv\. Sinceλskill\>0\\lambda\_\{\\mathrm\{skill\}\}\>0, it follows thatΔi​\(v\)\\Delta\_\{i\}\(v\)is constant over the vocabulary\. Hence

qi​\(v\)=ec​bi​\(v\)q\_\{i\}\(v\)=e^\{c\}b\_\{i\}\(v\)for some constantcc\. Normalization ofqiq\_\{i\}andbib\_\{i\}impliesec=1e^\{c\}=1, and thereforeqi=biq\_\{i\}=b\_\{i\}\. The converse is immediate\. ∎

Corollary[3](https://arxiv.org/html/2606.26790#Thmcorollary3)is a per\-context logit statement\. It shows that even when group\-relative outcome advantages vanish because all sampled trajectories receive tied rewards, a nontrivial skill\-conditioned teacher still supplies a token\-level learning signal wheneverqi≠biq\_\{i\}\\neq b\_\{i\}\. With shared neural parameters, gradients from different contexts may still cancel; the result does not claim that the aggregate parameter gradient must be nonzero\.

### A\.3On\-Policy Occupancy Matching for Distillation

Recall thatνb\\nu\_\{b\}denotes the distribution over valid token positions induced by rollouts collected from the behavior policy\. Letdbd\_\{b\}denote the corresponding distribution over ordinary autoregressive contextscic\_\{i\}, i\.e\., the context marginal induced byi∼νbi\\sim\\nu\_\{b\}\. For an arbitrary data\-collection policyμ\\mu, letdμd\_\{\\mu\}denote the analogous context distribution\.

We define total variation as

TV⁡\(P,Q\)≜supA\|P​\(A\)−Q​\(A\)\|=12​∫\|d​P−d​Q\|\.\\operatorname\{TV\}\(P,Q\)\\triangleq\\sup\_\{A\}\|P\(A\)\-Q\(A\)\|=\\frac\{1\}\{2\}\\int\\left\|\\mathrm\{d\}P\-\\mathrm\{d\}Q\\right\|\.
The following result isolates the effect of changing only the distribution of ordinary autoregressive contexts\. It applies to both nonnegative distillation losses and signed surrogate losses\.

###### Proposition 2\(On\-policy occupancy matching\)\.

Letℓθ:𝒞→\[mℓ,Mℓ\]\\ell\_\{\\theta\}:\\mathcal\{C\}\\rightarrow\[m\_\{\\ell\},M\_\{\\ell\}\]be a measurable per\-context loss, where−∞<mℓ<Mℓ<\+∞\-\\infty<m\_\{\\ell\}<M\_\{\\ell\}<\+\\infty\. Then

\|𝔼c∼db​\[ℓθ​\(c\)\]−𝔼c∼dμ​\[ℓθ​\(c\)\]\|\\displaystyle\\left\|\\mathbb\{E\}\_\{c\\sim d\_\{b\}\}\\left\[\\ell\_\{\\theta\}\(c\)\\right\]\-\\mathbb\{E\}\_\{c\\sim d\_\{\\mu\}\}\\left\[\\ell\_\{\\theta\}\(c\)\\right\]\\right\|\(12\)≤\(Mℓ−mℓ\)​TV⁡\(db,dμ\)\\displaystyle\\qquad\\leq\\left\(M\_\{\\ell\}\-m\_\{\\ell\}\\right\)\\operatorname\{TV\}\(d\_\{b\},d\_\{\\mu\}\)≤\(Mℓ−mℓ\)​12​DKL​\(db∥dμ\)\.\\displaystyle\\qquad\\leq\\left\(M\_\{\\ell\}\-m\_\{\\ell\}\\right\)\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(d\_\{b\}\\\|d\_\{\\mu\}\)\}\.In particular, ifdμ=dbd\_\{\\mu\}=d\_\{b\}, then the context\-occupancy mismatch is exactly zero\.

###### Proof\.

Define

fθ​\(c\)≜ℓθ​\(c\)−mℓMℓ−mℓ\.f\_\{\\theta\}\(c\)\\triangleq\\frac\{\\ell\_\{\\theta\}\(c\)\-m\_\{\\ell\}\}\{M\_\{\\ell\}\-m\_\{\\ell\}\}\.Then0≤fθ​\(c\)≤10\\leq f\_\{\\theta\}\(c\)\\leq 1\. By the variational characterization of total variation over measurable functions with range in\[0,1\]\[0,1\],

\|𝔼db​\[fθ\]−𝔼dμ​\[fθ\]\|≤TV⁡\(db,dμ\)\.\\left\|\\mathbb\{E\}\_\{d\_\{b\}\}\[f\_\{\\theta\}\]\-\\mathbb\{E\}\_\{d\_\{\\mu\}\}\[f\_\{\\theta\}\]\\right\|\\leq\\operatorname\{TV\}\(d\_\{b\},d\_\{\\mu\}\)\.Multiplying both sides byMℓ−mℓM\_\{\\ell\}\-m\_\{\\ell\}proves the first inequality in Eq\.[12](https://arxiv.org/html/2606.26790#A1.E12)\. The second inequality follows from Pinsker’s inequality\. Ifdbd\_\{b\}is not absolutely continuous with respect todμd\_\{\\mu\}, thenDKL​\(db∥dμ\)=\+∞D\_\{\\mathrm\{KL\}\}\(d\_\{b\}\\\|d\_\{\\mu\}\)=\+\\infty, and the inequality remains valid in the extended\-real sense\. Settingdμ=dbd\_\{\\mu\}=d\_\{b\}proves the final statement\. ∎

For example, Proposition[2](https://arxiv.org/html/2606.26790#Thmproposition2)can be applied to the per\-context reverse\-KL loss

ℓRKL,θ​\(ci\)≜DKL​\(pθ,i∥qi\),\\ell\_\{\\mathrm\{RKL\},\\theta\}\(c\_\{i\}\)\\triangleq D\_\{\\mathrm\{KL\}\}\\left\(p\_\{\\theta,i\}\\\|q\_\{i\}\\right\),which is the distribution\-matching loss locally approximated by the OPID skill update\. It can also be applied to a bounded version of the signed per\-context OPID skill loss

ℓskill,θunclip​\(ci\)\\displaystyle\\ell\_\{\\mathrm\{skill\},\\theta\}^\{\\mathrm\{unclip\}\}\(c\_\{i\}\)≜−λskill​𝔼a∼bi​\[ρθ,i​\(a\)​Δi​\(a\)\]\\displaystyle\\triangleq\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{a\\sim b\_\{i\}\}\\left\[\\rho\_\{\\theta,i\}\(a\)\\Delta\_\{i\}\(a\)\\right\]\(13\)=λskill​\[DKL​\(pθ,i∥qi\)−DKL​\(pθ,i∥bi\)\]\.\\displaystyle=\\lambda\_\{\\mathrm\{skill\}\}\\left\[D\_\{\\mathrm\{KL\}\}\\left\(p\_\{\\theta,i\}\\\|q\_\{i\}\\right\)\-D\_\{\\mathrm\{KL\}\}\\left\(p\_\{\\theta,i\}\\\|b\_\{i\}\\right\)\\right\]\.Because the loss in Eq\.[13](https://arxiv.org/html/2606.26790#A1.E13)is signed and need not be uniformly bounded for arbitrary probability distributions, applying Proposition[2](https://arxiv.org/html/2606.26790#Thmproposition2)to it requires an explicit bounded\-range condition, such as probability flooring, log\-ratio clipping, or restriction to a compact parameter neighborhood\. More general versions can instead be obtained under appropriate moment or tail conditions\.

Proposition[2](https://arxiv.org/html/2606.26790#Thmproposition2)controls only the mismatch in the outer distribution of ordinary autoregressive contexts\. It assumes that the same per\-context loss map is evaluated underdbd\_\{b\}anddμd\_\{\\mu\}\. It does not by itself control changes in the hindsight skill, the routed teacherqiq\_\{i\}, or other trajectory\-dependent quantities that may also change with the data\-collection policy\.

### A\.4Critical\-First Hierarchical Routing

We next formalize how the episode\-level and step\-level skills determine the detached teacherqiq\_\{i\}used inℒskillunclip\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\.

Letqi⋆q\_\{i\}^\{\\star\}denote an ideal privileged teacher at token positionii\. Letqiepq\_\{i\}^\{\\mathrm\{ep\}\}andqistepq\_\{i\}^\{\\mathrm\{step\}\}denote the teachers induced by the episode\-level and step\-level skills, respectively\. Let

zi⋆∈\{0,1\}z\_\{i\}^\{\\star\}\\in\\\{0,1\\\}be an oracle criticality indicator, wherezi⋆=1z\_\{i\}^\{\\star\}=1means that the step\-level teacher is the appropriate specialized teacher\. The analyzer prediction is

z^i≜𝟏​\{t∈Cτ\}\.\\widehat\{z\}\_\{i\}\\triangleq\\mathbf\{1\}\\\{t\\in C\_\{\\tau\}\\\}\.
The critical\-first routing rule defines

qiroute≜z^i​qistep\+\(1−z^i\)​qiep,qi≡qiroute\.q\_\{i\}^\{\\mathrm\{route\}\}\\triangleq\\widehat\{z\}\_\{i\}q\_\{i\}^\{\\mathrm\{step\}\}\+\\left\(1\-\\widehat\{z\}\_\{i\}\\right\)q\_\{i\}^\{\\mathrm\{ep\}\},\\qquad q\_\{i\}\\equiv q\_\{i\}^\{\\mathrm\{route\}\}\.\(14\)Thus, theqiq\_\{i\}appearing in the OPID skill advantageΔi​\(v\)=log⁡qi​\(v\)−log⁡bi​\(v\)\\Delta\_\{i\}\(v\)=\\log q\_\{i\}\(v\)\-\\log b\_\{i\}\(v\)is precisely the routed teacher in Eq\.[14](https://arxiv.org/html/2606.26790#A1.E14)\.

Measure the approximation errors of the two candidate teachers by

ℰiep\\displaystyle\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}≜DKL​\(qi⋆∥qiep\),\\displaystyle\\triangleq D\_\{\\mathrm\{KL\}\}\\left\(q\_\{i\}^\{\\star\}\\\|q\_\{i\}^\{\\mathrm\{ep\}\}\\right\),\(15\)ℰistep\\displaystyle\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}≜DKL​\(qi⋆∥qistep\),\\displaystyle\\triangleq D\_\{\\mathrm\{KL\}\}\\left\(q\_\{i\}^\{\\star\}\\\|q\_\{i\}^\{\\mathrm\{step\}\}\\right\),ℰiroute\\displaystyle\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}≜DKL​\(qi⋆∥qiroute\)\.\\displaystyle\\triangleq D\_\{\\mathrm\{KL\}\}\\left\(q\_\{i\}^\{\\star\}\\\|q\_\{i\}^\{\\mathrm\{route\}\}\\right\)\.Because the routing decision is hard,

ℰiroute=z^i​ℰistep\+\(1−z^i\)​ℰiep\.\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}=\\widehat\{z\}\_\{i\}\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\+\\left\(1\-\\widehat\{z\}\_\{i\}\\right\)\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\.
###### Proposition 3\(Routing optimality and detector\-error regret\)\.

Assume that the episode\-level and step\-level teachers specialize according to the oracle criticality label:

zi⋆=1\\displaystyle z\_\{i\}^\{\\star\}=1⟹ℰistep≤ℰiep,\\displaystyle\\implies\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\leq\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\},\(16\)zi⋆=0\\displaystyle z\_\{i\}^\{\\star\}=0⟹ℰiep≤ℰistep\.\\displaystyle\\implies\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\\leq\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\.Then, pointwise,

ℰiroute\\displaystyle\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}=min⁡\{ℰiep,ℰistep\}\\displaystyle=\\min\\left\\\{\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\},\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\\\}\(17\)\+𝟏​\{z^i≠zi⋆\}​\|ℰiep−ℰistep\|\.\\displaystyle\\quad\+\\mathbf\{1\}\\left\\\{\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}\\right\\\}\\left\|\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\-\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\|\.Consequently, if

\|ℰiep−ℰistep\|≤Γ\\left\|\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\-\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\|\\leq\\Gamma\(18\)almost surely underi∼νbi\\sim\\nu\_\{b\}, then

𝔼i∼νb​\[ℰiroute\]\\displaystyle\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}\\right\]≤min⁡\{𝔼i∼νb​\[ℰiep\],𝔼i∼νb​\[ℰistep\]\}\\displaystyle\\leq\\min\\left\\\{\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\\right\],\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\]\\right\\\}\(19\)\+Γ​Pri∼νb⁡\(z^i≠zi⋆\)\.\\displaystyle\\quad\+\\Gamma\\Pr\_\{i\\sim\\nu\_\{b\}\}\\left\(\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}\\right\)\.Under perfect criticality detection,

z^i=zi⋆almost surely,\\widehat\{z\}\_\{i\}=z\_\{i\}^\{\\star\}\\qquad\\text\{almost surely\},and therefore

ℰiroute=min⁡\{ℰiep,ℰistep\}\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}=\\min\\left\\\{\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\},\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\\\}pointwise, with

𝔼i∼νb​\[ℰiroute\]≤min⁡\{𝔼i∼νb​\[ℰiep\],𝔼i∼νb​\[ℰistep\]\}\.\\displaystyle\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}\\right\]\\leq\\min\\left\\\{\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\\right\],\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\]\\right\\\}\.

###### Proof\.

Consider first the eventz^i=zi⋆\\widehat\{z\}\_\{i\}=z\_\{i\}^\{\\star\}\. Under Eq\.[16](https://arxiv.org/html/2606.26790#A1.E16), the routing rule selects a teacher with the smaller approximation error\. Hence

ℰiroute=min⁡\{ℰiep,ℰistep\}\.\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}=\\min\\left\\\{\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\},\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\\\}\.The second term in Eq\.[17](https://arxiv.org/html/2606.26790#A1.E17)is zero on this event\.

On the eventz^i≠zi⋆\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}, the routing rule selects the nonspecialized teacher\. Its excess error over the oracle choice is exactly

\|ℰiep−ℰistep\|\.\\left\|\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\-\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\|\.This proves Eq\.[17](https://arxiv.org/html/2606.26790#A1.E17)\.

Taking expectations yields

𝔼i∼νb​\[ℰiroute\]\\displaystyle\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathcal\{E\}\_\{i\}^\{\\mathrm\{route\}\}\\right\]=𝔼i∼νb​\[min⁡\{ℰiep,ℰistep\}\]\\displaystyle=\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\min\\left\\\{\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\},\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\\\}\\right\]\+𝔼i∼νb​\[𝟏​\{z^i≠zi⋆\}​\|ℰiep−ℰistep\|\]\.\\displaystyle\\quad\+\\mathbb\{E\}\_\{i\\sim\\nu\_\{b\}\}\\left\[\\mathbf\{1\}\\left\\\{\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}\\right\\\}\\left\|\\mathcal\{E\}\_\{i\}^\{\\mathrm\{ep\}\}\-\\mathcal\{E\}\_\{i\}^\{\\mathrm\{step\}\}\\right\|\\right\]\.Using

𝔼​\[min⁡\{X,Y\}\]≤min⁡\{𝔼​\[X\],𝔼​\[Y\]\}\\mathbb\{E\}\[\\min\\\{X,Y\\\}\]\\leq\\min\\\{\\mathbb\{E\}\[X\],\\mathbb\{E\}\[Y\]\\\}and Eq\.[18](https://arxiv.org/html/2606.26790#A1.E18)proves Eq\.[19](https://arxiv.org/html/2606.26790#A1.E19)\. The perfect\-detection statements follow by settingPri∼νb⁡\(z^i≠zi⋆\)=0\\Pr\_\{i\\sim\\nu\_\{b\}\}\(\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}\)=0\. ∎

Proposition[3](https://arxiv.org/html/2606.26790#Thmproposition3)separates the two requirements behind critical\-first routing: teacher specialization and criticality\-detection accuracy\. Under specialization, perfect detection recovers the oracle pointwise choice between the two candidate teachers\. With imperfect detection, the excess teacher\-approximation error is controlled jointly by the detector error probability and the difference between the two candidate teacher errors\.

The criterion in Eq\.[15](https://arxiv.org/html/2606.26790#A1.E15)measures the quality of a candidate teacher relative toqi⋆q\_\{i\}^\{\\star\}\. It is distinct from the student–teacher reverse\-KL lossDKL​\(pθ,i∥qi\)D\_\{\\mathrm\{KL\}\}\(p\_\{\\theta,i\}\\\|q\_\{i\}\)appearing inℒRKL\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\. Therefore, without additional assumptions relating the candidate teachers’ likelihood ratios, the routing result should not be interpreted as a direct upper bound onℒRKL\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\.

### A\.5Summary

Proposition[1](https://arxiv.org/html/2606.26790#Thmproposition1)analyzes the unclipped skill component of the OPID policy loss:

ℒskillunclip​\(θ\)=−λskill​𝔼i∼νba∼bi​\[ρθ,i​\(a\)​Δi​\(a\)\]\.\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)=\-\\lambda\_\{\\mathrm\{skill\}\}\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}i\\sim\\nu\_\{b\}\\\\ a\\sim b\_\{i\}\\end\{subarray\}\}\\left\[\\rho\_\{\\theta,i\}\(a\)\\Delta\_\{i\}\(a\)\\right\]\.Conditioned on fixed rollout histories, routed skills, and detached distributionsbib\_\{i\}andqiq\_\{i\}, this loss has the exact decomposition

ℒskillunclip​\(θ\)=λskill​\[ℒRKL​\(θ\)−𝒟b​\(θ\)\]\.\\mathcal\{L\}\_\{\\mathrm\{skill\}\}^\{\\mathrm\{unclip\}\}\(\\theta\)=\\lambda\_\{\\mathrm\{skill\}\}\\left\[\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\(\\theta\)\-\\mathcal\{D\}\_\{b\}\(\\theta\)\\right\]\.
Proposition[2](https://arxiv.org/html/2606.26790#Thmproposition2)shows that collecting the ordinary autoregressive contexts on policy eliminates the outer context\-distribution mismatch: when the collection distribution equals the behavior\-policy distribution,dμ=dbd\_\{\\mu\}=d\_\{b\}, the occupancy term in Eq\.[12](https://arxiv.org/html/2606.26790#A1.E12)is zero\.

Proposition[3](https://arxiv.org/html/2606.26790#Thmproposition3)analyzes how the teacherqiq\_\{i\}is selected from episode\-level and step\-level candidates\. Under the stated specialization assumption, critical\-first routing recovers the lower\-error candidate under perfect detection, while the degradation under imperfect detection is controlled by

Γ​Pri∼νb⁡\(z^i≠zi⋆\)\.\\Gamma\\Pr\_\{i\\sim\\nu\_\{b\}\}\\left\(\\widehat\{z\}\_\{i\}\\neq z\_\{i\}^\{\\star\}\\right\)\.
Taken together, the three results establish that:

1. 1\.The unclipped OPID skill loss is an exact relative\-KL loss and is first\-order equivalent to scaled reverse\-KL distillation at the behavior policy;
2. 2\.On\-policy collection removes the mismatch in the outer distribution of ordinary autoregressive contexts; and
3. 3\.Critical\-first routing approaches the oracle candidate\-teacher selection when the candidate teachers specialize and the criticality detector is accurate\.

## Appendix BAdditional Experimental Details

This section provides the experimental protocol used for the results in the main paper\. We organize the details by datasets, baselines and implementation\.

### B\.1Datasets

Table[4](https://arxiv.org/html/2606.26790#A2.T4)summarizes the datasets used in our experiments\. The evaluation covers three agentic domains: embodied reasoning, web navigation, and search\-augmented question answering\.

Table 4:Detailed information on the agentic benchmarks\.##### ALFWorld\.

ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib19)\)aligns text\-based interaction with the ALFRED household environment\. Given a natural\-language goal and textual observations, an agent must issue a sequence of admissible actions to complete the task\. We report results on six task types:Pick,Look,Clean,Heat,Cool, andPick2\.

##### WebShop\.

WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib20)\)is a text\-based e\-commerce environment in which an agent searches for products, opens product pages, selects attributes, and purchases an item that satisfies a natural\-language request\. The environment provides both a normalized task\-completion score, which assigns partial credit for matching requested attributes, and a binary success signal for exact task completion\.

##### Search\-Augmented QA\.

Following the Search\-R1 setting\(Jinet al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib21)\), we evaluate search\-augmented reasoning on Natural Questions\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.26790#bib.bib32)\), TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2606.26790#bib.bib33)\), PopQA\(Mallenet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib34)\), HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.26790#bib.bib35)\), 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.26790#bib.bib36)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.26790#bib.bib37)\), and Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib38)\)\. In this setting, the agent interacts with the configured search environment before producing a final answer\.

##### Training Data\.

For training, we conduct separate training for each benchmark setting\. Specifically, we sample 2,400 training examples from ALFWorld, 2,400 training examples from WebShop, and 19,200 training examples from the search\-augmented QA benchmarks\.

### B\.2Baselines

We compare OPID with prompting\-only methods, outcome\-based reinforcement learning, and self\-distillation or skill\-distillation variants\. Unless explicitly marked with an asterisk, every method is evaluated from the ordinary environment interaction history, without access to skills or any other privileged context\. An asterisk therefore denotes validation/test\-time access to a natural\-language skill; it does not indicate a different backbone or evaluation task\.

##### Prompting\-only methods\.

- •Vanilla\. This is the original instruction\-tuned backbone used without any post\-training\. The model receives only the standard environment prompt and the interaction history exposed by the environment interface\.
- •Skill\-Prompt∗\. This method keeps theVanillaparameters frozen but augments the validation/test context with a retrieved natural\-language skill relevant to the current task\. Because no gradient update is performed, any improvement comes purely from in\-context use of the skill\.

##### Outcome\-based reinforcement learning\.

- •GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib1)\)\. Group Relative Policy Optimization is a critic\-free policy\-gradient method that samples a group of trajectories for each task, assigns each trajectory a scalar outcome reward, and normalizes these rewards within the group to construct relative advantages\. In the outcome\-only setting used here, every generated token in a trajectory inherits the same sequence\-level advantage, and the policy is updated with a clipped importance\-ratio objective; no process labels or teacher\-derived token\-level targets are used\.
- •Skill\-GRPO\. This variant uses the same group\-relative outcome objective asGRPO, but makes a task\-relevant natural\-language skill available to the policy during training rollouts and policy updates\. The skill can therefore shape exploration and the trajectories that receive reinforcement\. The skill is removed at validation/test time, so this baseline tests whether skill\-guided behavior has been absorbed into the model parameters rather than merely followed from the prompt\.
- •Skill\-GRPO∗\. This method is trained in the same way asSkill\-GRPO, but retains the skill context at validation/test time\. Its train\-time and test\-time conditioning are consequently matched\.

##### Self\-distillation and skill\-distillation methods\.

- •OPSD\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib3)\)\. On\-Policy Self\-Distillation instantiates a student and a teacher from the same underlying model but gives them different conditioning contexts\. The student samples trajectories on\-policy from the ordinary task context, whereas the teacher additionally receives training\-only privileged information, such as a verified solution or an equivalent auxiliary context\. For every prefix of the student’s own trajectory, the teacher re\-scores the next\-token distribution and provides a dense token\-level target through full\-vocabulary or sampled\-token distribution matching\. Gradients are applied to the student side while the teacher distribution is treated as a stop\-gradient target, and the privileged teacher context is absent at inference time\.
- •GRPO\+OPSD\. This is a direct multi\-objective combination of the sequence\-levelGRPOloss and the token\-levelOPSDloss\. The outcome term reinforces or penalizes complete trajectories according to environment feedback, while the distillation term supplies local guidance at individual token positions\. The two losses are simply combined, making this baseline a controlled test of whether naively adding dense self\-distillation to outcome\-based RL is sufficient\.
- •Skill\-SD\(Wanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib16)\)\. This method adapts self\-distillation to multi\-turn agent tasks\. Completed trajectories are summarized into compact natural\-language skills that record successful behaviors, common failure modes, and reusable high\-level workflows\. During training, a retrieved skill conditions only the teacher branch, while the student continues to generate on\-policy trajectories from the plain task prompt; the student must therefore internalize the teacher\-side guidance rather than rely on the skill at test time\.
- •RLSD\(Yanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib17)\)\. RLSD uses a privileged self\-teacher for fine\-grained credit assignment without directly optimizing a teacher–student distribution\-matching loss\. It converts the token\-wise teacher–student log\-probability gap into a bounded weight that modulates the magnitude of each token’s GRPO update, while the sign and direction of the update remain anchored to the environment\-derived outcome advantage\. Thus, privileged information can indicate where a larger or smaller update is useful, but it does not decide whether a sampled token should be reinforced or penalized\. In the original formulation, the self\-distillation contribution is strongest early in training and is scheduled to decay toward vanilla GRPO, combining early dense guidance with a stable outcome\-optimized training phase\.
- •SDAR\(Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\. It keeps verifier\-driven GRPO as the primary optimization backbone and adds a separately gated self\-distillation objective for multi\-turn agents\. A teacher branch receives training\-only privileged context, such as a retrieved skill, and re\-scores the student’s on\-policy tokens; a smooth, bounded token\-level gate then controls how strongly each teacher signal enters the auxiliary loss\. The gate can use student uncertainty and/or the detached teacher–student log\-probability gap, giving greater weight to positive teacher endorsements while softly attenuating potentially unreliable negative rejections\. UnlikeRLSD, SDAR leaves the GRPO advantage itself unchanged and regulates the auxiliary distillation loss instead; the student is evaluated without privileged skill context\.

For all reproduced post\-training baselines, we use the same backbone and environment wrappers as OPID and match the rollout budget, task batch, number of training steps, and evaluation protocol whenever applicable\. The intended differences are restricted to the optimization signal and to the explicitly stated availability of skills or other privileged training context\.

### B\.3Algorithm and Extracted Skill Examples

Algorithm[1](https://arxiv.org/html/2606.26790#alg1)gives the full OPID training procedure, including on\-policy rollout collection, hierarchical skill extraction, critical\-first routing, paired scoring, and clipped policy optimization\. Table[5](https://arxiv.org/html/2606.26790#A2.T5)provides representative skills extracted from successful and failed trajectories across ALFWorld, WebShop, and Search\-based QA\. These examples illustrate how episode\-level skills capture reusable global workflows, while critical\-step skills focus on sparse local decisions that influence the final outcome\.

Algorithm 1OPID: On\-Policy Skill Distillation1:Policy

πθ\\pi\_\{\\theta\}, task set

𝒬\\mathcal\{Q\}, analyzer

𝒜\\mathcal\{A\}, skill\-injection function

HH, group size

NN, skill coefficient

λskill\\lambda\_\{\\mathrm\{skill\}\}, clipping parameter

ϵ\\epsilon, learning rate

η\\eta
2:foreach training iterationdo

3:

θold←θ\\theta\_\{\\mathrm\{old\}\}\\leftarrow\\theta
4:Sample a batch of task prompts

ℬ\\mathcal\{B\}from

𝒬\\mathcal\{Q\}
5:foreach prompt

q∈ℬq\\in\\mathcal\{B\}do

6:// On\-policy rollout group and episode advantage

7:Sample

𝒢q←\{τ\(1\),…,τ\(N\)\}\\mathcal\{G\}\_\{q\}\\leftarrow\\\{\\tau^\{\(1\)\},\\ldots,\\tau^\{\(N\)\}\\\}, where

τ\(i\)∼πθold\(⋅∣q\)\\tau^\{\(i\)\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid q\)
8:

𝐫q←\{R​\(τ′\)∣τ′∈𝒢q\}\\mathbf\{r\}\_\{q\}\\leftarrow\\\{R\(\\tau^\{\\prime\}\)\\mid\\tau^\{\\prime\}\\in\\mathcal\{G\}\_\{q\}\\\};

μq←mean⁡\(𝐫q\)\\mu\_\{q\}\\leftarrow\\operatorname\{mean\}\(\\mathbf\{r\}\_\{q\}\);

σq←std⁡\(𝐫q\)\\sigma\_\{q\}\\leftarrow\\operatorname\{std\}\(\\mathbf\{r\}\_\{q\}\)
9:foreach trajectory

τ∈𝒢q\\tau\\in\\mathcal\{G\}\_\{q\}do

10:

Aτep←\(R​\(τ\)−μq\)/σqA^\{\\mathrm\{ep\}\}\_\{\\tau\}\\leftarrow\\bigl\(R\(\\tau\)\-\\mu\_\{q\}\\bigr\)/\\sigma\_\{q\}
11:// Hierarchical hindsight skill extraction

12:

\(sτep,\{sτ,tstep\}t∈𝒞τ\)←𝒜​\(τ\)\\left\(s^\{\\mathrm\{ep\}\}\_\{\\tau\},\\\{s^\{\\mathrm\{step\}\}\_\{\\tau,t\}\\\}\_\{t\\in\\mathcal\{C\}\_\{\\tau\}\}\\right\)\\leftarrow\\mathcal\{A\}\(\\tau\)
13:// Critical\-first routing and paired scoring

14:foreach interaction step

ttin

τ\\taudo

15:

sτ,t←\{sτ,tstep,t∈𝒞τ,sτep,otherwises\_\{\\tau,t\}\\leftarrow\\begin\{cases\}s^\{\\mathrm\{step\}\}\_\{\\tau,t\},&t\\in\\mathcal\{C\}\_\{\\tau\},\\\\ s^\{\\mathrm\{ep\}\}\_\{\\tau\},&\\text\{otherwise\}\\end\{cases\}
16:

h~τ,t←H​\(hτ,t,sτ,t\)\\tilde\{h\}\_\{\\tau,t\}\\leftarrow H\(h\_\{\\tau,t\},s\_\{\\tau,t\}\)
17:foreach token

ℓ\\ellin

yτ,ty\_\{\\tau,t\}with mask

mτ,t,ℓm\_\{\\tau,t,\\ell\}do

18:

ℓτ,t,ℓold←log⁡πθold​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\)\\ell^\{\\mathrm\{old\}\}\_\{\\tau,t,\\ell\}\\leftarrow\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)
19:

ℓτ,t,ℓskill←log⁡πθold​\(yτ,t,ℓ∣h~τ,t,yτ,t,<ℓ\)\\ell^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\\leftarrow\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{\\tau,t,\\ell\}\\mid\\tilde\{h\}\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)
20:

Aτ,t,ℓskill←\(ℓτ,t,ℓskill−ℓτ,t,ℓold\)​mτ,t,ℓA^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\\leftarrow\(\\ell^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}\-\\ell^\{\\mathrm\{old\}\}\_\{\\tau,t,\\ell\}\)m\_\{\\tau,t,\\ell\}
21:

Aτ,t,ℓep←Aτep​mτ,t,ℓA^\{\\mathrm\{ep\}\}\_\{\\tau,t,\\ell\}\\leftarrow A^\{\\mathrm\{ep\}\}\_\{\\tau\}m\_\{\\tau,t,\\ell\}
22:

Aτ,t,ℓOPID←Aτ,t,ℓep\+λskill​Aτ,t,ℓskillA^\{\\mathrm\{OPID\}\}\_\{\\tau,t,\\ell\}\\leftarrow A^\{\\mathrm\{ep\}\}\_\{\\tau,t,\\ell\}\+\\lambda\_\{\\mathrm\{skill\}\}A^\{\\mathrm\{skill\}\}\_\{\\tau,t,\\ell\}
23:endfor

24:endfor

25:endfor

26:endfor

27:// Clipped policy optimization

28:For every valid sampled token

\(τ,t,ℓ\)\(\\tau,t,\\ell\), compute

29:

ρτ,t,ℓ​\(θ\)←exp⁡\(log⁡πθ​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\)−log⁡πθold​\(yτ,t,ℓ∣hτ,t,yτ,t,<ℓ\)\)\\rho\_\{\\tau,t,\\ell\}\(\\theta\)\\leftarrow\\exp\\\!\\left\(\\log\\pi\_\{\\theta\}\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)\-\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{\\tau,t,\\ell\}\\mid h\_\{\\tau,t\},y\_\{\\tau,t,<\\ell\}\)\\right\)
30:

ℒpolicy​\(θ\)←−𝔼τ,t,ℓ​\[min⁡\(ρτ,t,ℓ​\(θ\)​Aτ,t,ℓOPID,clip⁡\(ρτ,t,ℓ​\(θ\),1−ϵ,1\+ϵ\)​Aτ,t,ℓOPID\)\]\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{policy\}\}\(\\theta\)\\leftarrow\-\\mathbb\{E\}\_\{\\tau,t,\\ell\}\\\!\\left\[\\min\\\!\\left\(\\rho\_\{\\tau,t,\\ell\}\(\\theta\)A^\{\\mathrm\{OPID\}\}\_\{\\tau,t,\\ell\},\\operatorname\{clip\}\(\\rho\_\{\\tau,t,\\ell\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A^\{\\mathrm\{OPID\}\}\_\{\\tau,t,\\ell\}\\right\)\\right\]
31:

θ←θ−η​∇θℒpolicy​\(θ\)\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{policy\}\}\(\\theta\)
32:endfor

Table 5:Hierarchical skills extracted from on\-policy trajectories\.For each dataset, we show one successful and one failed trajectory\. Episode\-level skills summarize reusable global behavior, while critical\-step skills target sparse decision points\. Step indices are 0\-based analyzer keys\.
### B\.4Implementation Details

##### Metrics\.

For ALFWorld, we compute the success rate for each task type and report their macro\-average:

ALFWorld​\-​Avg=16​∑c=16SRc\.\\mathrm\{ALFWorld\\text\{\-\}Avg\}=\\frac\{1\}\{6\}\\sum\_\{c=1\}^\{6\}\\mathrm\{SR\}\_\{c\}\.\(20\)For Search\-based QA, we compute answer accuracy separately on each of the seven datasets and report the unweighted macro\-average:

Search​\-​Avg=17​∑d=17Accd\.\\mathrm\{Search\\text\{\-\}Avg\}=\\frac\{1\}\{7\}\\sum\_\{d=1\}^\{7\}\\mathrm\{Acc\}\_\{d\}\.\(21\)For WebShop, the reportedScoreis the mean normalized task score returned by the environment, multiplied by 100, andSucc\.is the percentage of tasks with exact success\.

##### Trajectory analyzer\.

After each on\-policy episode terminates, we serialize the task prompt, step\-indexed observations, policy responses/actions, environment feedback, and terminal outcome into an ordered trajectory record\. An LLM\-based analyzer then maps this record to one episode\-level skill and a sparse set of critical\-step skills\. Step indices are zero\-based, consistent with Table[5](https://arxiv.org/html/2606.26790#A2.T5)\. By default, we use GLM\-5\.2\(Z\.ai,[2026](https://arxiv.org/html/2606.26790#bib.bib52)\)as the analyzer, with temperature set to 0\.4 and maximum output length set to 4096\. We limit the max number of identified critical steps at 5 for ALFWorld and WebShop, and at 2 for Search\-based QA\.

##### Backbones and training schedule\.

We use Qwen2\.5\-3B\-Instruct and Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib30)\), as well as Qwen3\-1\.7B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2606.26790#bib.bib31)\)\. All models are trained for 150 update steps\. The training batch size reported in the main paper is 16 for ALFWorld and WebShop and 128 for Search\-based QA\. Table[6](https://arxiv.org/html/2606.26790#A2.T6)records the remaining hyperparameters that are required for exact reproduction\.

Table 6:RL training hyperparameters\.
##### Computing details\.

Training is conducted on 8 Nvidia A800 80G GPUs\.

## Appendix CSupplementary Results

### C\.1Detailed Sample Efficiency Comparison

Table[7](https://arxiv.org/html/2606.26790#A3.T7)reports the ALFWorld success rate when only a fraction of the training data is used\. OPID consistently improves over GRPO across all data budgets\. The gains are especially large in the low\- and mid\-data regimes, reaching \+15\.6 points with 60% of the data and \+20\.3 points with 80% of the data\. These results suggest that trajectory\-derived hindsight skills allow OPID to extract more supervision from each rollout, making outcome\-based RL less dependent on large numbers of environment interactions\.

Table 7:Sample efficiency comparison on ALFWorld\.We report success rates under different fractions of the training data\. TheΔ\\Deltarow shows the absolute improvement of OPID over GRPO, indicating that OPID provides stronger gains especially in low\- and mid\-data regimes\.
### C\.2Cross\-Domain Generalization

Table[8](https://arxiv.org/html/2606.26790#A3.T8)evaluates transfer to the ALFWorld unseen split\. OPID improves the average success rate over GRPO by \+7\.7 points, with particularly clear gains onLookandHeat\. This indicates that OPID does not merely fit the observed training trajectories\. Instead, the distilled episode\-level workflows and step\-level decision rules retain value under unseen environment configurations\.

Table 8:Cross\-domain generalization results on ALFWorld Unseen\.We report success rates across six unseen task types and their average\. OPID improves the average success rate over GRPO, indicating that trajectory\-derived skill supervision transfers beyond the training environments\.
### C\.3Training Diagnostics and Skill Extraction Patterns

Figures[7](https://arxiv.org/html/2606.26790#A3.F7)–[9](https://arxiv.org/html/2606.26790#A5.F9)provide additional diagnostics for the OPID training pipeline\. Figure[7](https://arxiv.org/html/2606.26790#A3.F7)reports the average number of critical steps identified on ALFWorld, illustrating that OPID applies step\-level supervision sparsely rather than assigning local skills to every decision\. Figure[8](https://arxiv.org/html/2606.26790#A3.F8)further visualizes the training advantage dynamics, complementing the main\-paper training curves and showing how OPID reshapes the learning signal during policy optimization\. Figure[9](https://arxiv.org/html/2606.26790#A5.F9)shows the analyzer prompt used to convert completed trajectories into hierarchical skills\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x8.png)Figure 7:Average critical steps per sequence on ALFWorld\.The curve reports how many timesteps are selected by the analyzer for step\-level hindsight skills in each trajectory\. The relatively small number of critical steps indicates that OPID applies local skill supervision selectively, while relying on episode\-level skills as default guidance for non\-critical decisions\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x9.png)Figure 8:Magnitudes of episode\-level and skill\-guided advantage signals during OPID training\.Episode abs advantage measures the mean absolute advantage from group\-relative outcome rewards, while skill abs advantage measures the mean absolute advantage induced by skill\-guided log\-probability shifts\. The comparison shows how OPID combines sparse trajectory\-level feedback with dense skill\-conditioned supervision throughout optimization\.

## Appendix DCase Study

Figures[10](https://arxiv.org/html/2606.26790#A5.F10)–[15](https://arxiv.org/html/2606.26790#A5.F15)provide illustrative examples from the ALFWorld, Search\-QA, and WebShop benchmarks\.

## Appendix EAdditional Discussion

OPID studies how completed on\-policy trajectories can be reused as hindsight supervision for long\-horizon agentic reinforcement learning\. A natural next step is to evaluate this idea in broader interactive environments where agents must discover latent rules, maintain long\-term state, and adapt through extended interaction\. Benchmarks such as OdysseyArena\(Xuet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib8)\), AgentBench\(Liuet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib55)\), WebArena\(Zhouet al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib56)\), Mind2Web\(Denget al\.,[2023](https://arxiv.org/html/2606.26790#bib.bib57)\), and VisualWebArena\(Kohet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib58)\)provide complementary stress tests beyond the embodied, shopping, and search\-based settings considered in this paper\. These environments would test whether trajectory\-derived hindsight skills remain useful when the agent must handle longer horizons, richer interfaces, and more open\-ended forms of exploration\.

Another direction is to enrich the structure of hindsight skills\. OPID currently extracts episode\-level and step\-level skills from completed trajectories and routes them according to decision criticality\. Future work could combine this on\-policy extraction with higher\-level reasoning abstractions, such as search\-discovered reasoning patterns or reusable thought structures\(Wuet al\.,[2024](https://arxiv.org/html/2606.26790#bib.bib5)\), and with policy\-aware exploration mechanisms developed for long\-horizon agent learning\(Wuet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib12); Luet al\.,[2026a](https://arxiv.org/html/2606.26790#bib.bib11)\)\. Such extensions may allow agents to aggregate skills across trajectories, identify recurring failure modes, and form more compositional behavioral rules while preserving OPID’s key design choice: skills are used to shape training, not retrieved as privileged context at inference time\.

Finally, OPID opens several deployment\-oriented directions\. Since the analyzer and skill\-conditioned scoring are used only during training, the learned policy incurs no additional inference\-time skill retrieval cost\. Nevertheless, the training pipeline can still benefit from more efficient inference and scoring mechanisms\. Speculative and retrieval\-parallel decoding methods such asDouble\(Shenet al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib53)\)may reduce the cost of repeated model scoring during skill\-conditioned distillation\. In parallel, extending OPID to more perceptual and embodied settings, including active embodied intelligence benchmarks such as RobotEQ\(Fanget al\.,[2026](https://arxiv.org/html/2606.26790#bib.bib54)\), could test whether hindsight skill supervision helps agents acquire not only task completion strategies, but also socially and spatially grounded decision rules\.

![Refer to caption](https://arxiv.org/html/2606.26790v1/x10.png)Figure 9:Prompt of analyzer\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x11.png)Figure 10:A full trajectory of OPID on ALFWorld Example 1\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x12.png)Figure 11:A full trajectory of OPID on ALFWorld Example 2\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x13.png)Figure 12:A full trajectory of OPID on Search\-QA Example 1\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x14.png)Figure 13:A full trajectory of OPID on Search\-QA Example 2\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x15.png)Figure 14:A full trajectory of OPID on Webshop Example 1\.![Refer to caption](https://arxiv.org/html/2606.26790v1/x16.png)Figure 15:A full trajectory of OPID on Webshop Example 2\.

Similar Articles

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Hugging Face Daily Papers

OPID proposes an on-policy skill distillation framework that extracts dense hindsight supervision from completed trajectories, combining outcome-based RL with token-level self-distillation to improve language agent training efficiency and performance on multi-turn tasks.

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.

@QingQ77: Collecting open-source code and papers on On-Policy Distillation and Self-Distillation for training LLMs/VLMs/Agents, tagged by four dimensions: teacher source, supervision signal, rollout usage, and training stage. https://g…

X AI KOLs Timeline

Introducing AwesomeOPD, a curated list of open-source code and papers related to On-Policy Distillation (OPD) and Self-Distillation used in the training of LLMs, VLMs, and Agents. Resources in this list are meticulously categorized and tagged based on teacher source, supervision signal, rollout usage, and training stage.