ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
Summary
The paper introduces ATOD, a hybrid online distillation algorithm combining on-policy distillation and reinforcement learning for training small language model agents in multi-turn tasks, featuring an annealed OPD-RL schedule and Turn-level Disagreement-Uncertainty Reweighting to improve dense supervision.
View Cached Full Text
Cached at: 06/29/26, 05:27 AM
# ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
Source: [https://arxiv.org/html/2606.27814](https://arxiv.org/html/2606.27814)
Qitai Tan1,2Zefang Zong1Yang Li1Peng Chen1 1Tencent Inc\.2Tsinghua University tqt24@mails\.tsinghua\.edu\.cn \{willzong,thomasyngli\}@tencent\.com
###### Abstract
Training small language\-model agents for long\-horizon interactive tasks requires both fast imitation and reward\-driven improvement\. On\-policy distillation \(OPD\) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling\. Reinforcement learning \(RL\) directly optimizes environment rewards and encourages exploratory improvement toward a higher reward\-defined ceiling, but sparse and delayed feedback makes early\-stage learning much less efficient than OPD\. In this paper, we proposeATOD\(AnnealedTurn\-awareOn\-policyDistillation\), a hybrid online distillation algorithm that explicitly exploits this complementarity\.\(1\)ATOD uses anannealed OPD–RL schedule: OPD dominates early training to approach teacher\-level behavior, while RL is gradually strengthened to drive reward\-based exploration\.\(2\)ATOD introducesTurn\-level Disagreement\-Uncertainty Reweighting\(T\-DUR\\mathrm\{T\\text\{\-\}DUR\}\), which softly amplifies high\-utility turns and improves dense supervision in long trajectories\. Experiments on ALFWorld, WebShop, and Search\-QA show that ATOD consistently outperforms competing post\-training baselines: across the three student sizes, ATOD improves average success rate by3\.03points over OPD and23\.62points over GRPO, while surpassing the corresponding teacher models by2\.16points\.
Figure 1:Aggregate Qwen3\-1\.7B performance across ALFWorld, Search\-QA, and WebShop\.## 1Introduction
Language agents extend large language models from static text generation to interactive decision making, where models must observe states, choose actions, call tools, and revise plans over multiple turns\[[1](https://arxiv.org/html/2606.27814#bib.bib1),[2](https://arxiv.org/html/2606.27814#bib.bib2)\]\. Such capabilities make LLMs applicable to embodied instruction following, web shopping, search\-augmented question answering, code execution, and other real\-world tasks\[[3](https://arxiv.org/html/2606.27814#bib.bib3),[4](https://arxiv.org/html/2606.27814#bib.bib4),[5](https://arxiv.org/html/2606.27814#bib.bib5)\]\. However, strong agentic behavior is usually concentrated in large models, whose inference cost and deployment overhead are undesirable in latency\-, privacy\-, and resource\-sensitive settings\. A practical goal is therefore to transfer multi\-turn agentic competence from a stronger teacher to a smaller student, while still enabling the student to seek reward\-driven improvements beyond pure teacher imitation\.
Reinforcement learning \(RL\) has been widely used for post\-training language agents\. It optimizes the policy with environment\-defined rewards, typically through PPO\- or GRPO\-style policy optimization\[[6](https://arxiv.org/html/2606.27814#bib.bib6),[7](https://arxiv.org/html/2606.27814#bib.bib7),[8](https://arxiv.org/html/2606.27814#bib.bib8),[9](https://arxiv.org/html/2606.27814#bib.bib9)\], with recent agent\-oriented variants adapting this idea to long\-horizon interaction\[[10](https://arxiv.org/html/2606.27814#bib.bib10),[11](https://arxiv.org/html/2606.27814#bib.bib11),[12](https://arxiv.org/html/2606.27814#bib.bib12),[13](https://arxiv.org/html/2606.27814#bib.bib13)\]\. More recently, on\-policy distillation \(OPD\) has emerged as another promising paradigm: it trains the student on its own sampled trajectories while using a stronger teacher’s token\-level distribution as dense supervision\[[14](https://arxiv.org/html/2606.27814#bib.bib14),[15](https://arxiv.org/html/2606.27814#bib.bib15),[16](https://arxiv.org/html/2606.27814#bib.bib16),[17](https://arxiv.org/html/2606.27814#bib.bib17),[18](https://arxiv.org/html/2606.27814#bib.bib18),[19](https://arxiv.org/html/2606.27814#bib.bib19)\]\.
However, RL and OPD exhibit complementary trade\-offs in multi\-turn agent training\. RL directly pursues environment rewards and can improve beyond imitation, but sparse delayed feedback makes early exploration inefficient for small students\. OPD provides dense token\-level guidance and bootstraps learning quickly, but strong imitation can plateau near the teacher and suppress reward\-improving deviations\. This motivates an annealed OPD–RL schedule: OPD dominates early training to approach teacher\-level behavior, while RL is gradually strengthened to drive reward\-based exploration\. The validation dynamics in[figure2\(a\)](https://arxiv.org/html/2606.27814#S1.F2.sf1)illustrate this complementarity: OPD improves quickly but saturates, GRPO grows slowly under sparse rewards, and the annealed hybrid preserves the fast start while reaching a higher ceiling\.
\(a\)Training dynamics of ATOD, OPD, and GRPO\.
\(b\)T\-DUR turn\-utility weighting\.
Figure 2:Motivation for ATOD\. \(a\) Compared with OPD and GRPO, ATOD preserves OPD’s fast bootstrap while avoiding its plateau and GRPO’s slow sparse\-reward learning\. \(b\) T\-DUR uses teacher–student divergence and student entropy to assign larger weights to high\-utility turns\.A second challenge concerns where dense teacher supervision should be applied within long agent trajectories\. In multi\-turn agent tasks, different turns can have very different learning value: prior agentic OPD studies show that inter\-turn error compounding can push later states outside the teacher’s reliable support, making uniform trajectory\-level supervision unstable, and that distillation strength should be adapted at the step level\[[20](https://arxiv.org/html/2606.27814#bib.bib20),[21](https://arxiv.org/html/2606.27814#bib.bib21)\]\. Meanwhile, token\-importance studies in OPD suggest that student uncertainty and teacher–student disagreement identify informative distillation positions\[[22](https://arxiv.org/html/2606.27814#bib.bib22),[16](https://arxiv.org/html/2606.27814#bib.bib16)\]\. We therefore lift this idea from tokens to turns, because a turn is the natural decision unit of an agent trajectory\. As visualized in[figure2\(b\)](https://arxiv.org/html/2606.27814#S1.F2.sf2), entropy and disagreement vary substantially across turns;T\-DUR\\mathrm\{T\\text\{\-\}DUR\}estimates these signals from the log\-probabilities of sampled tokens, avoiding full\-vocabulary teacher sampling while efficiently assigning larger OPD weights to high\-utility decision turns\.
We propose ATOD, short for*Annealed Turn\-aware On\-policy Distillation*, to address these two challenges with an annealed OPD–RL schedule and Turn\-level Disagreement\-Uncertainty Reweighting \(T\-DUR\\mathrm\{T\\text\{\-\}DUR\}\), as illustrated in[figure3](https://arxiv.org/html/2606.27814#S1.F3)\. Together, these components improve early convergence, stabilize dense supervision, and raise the reward\-driven performance ceiling\. Our contributions are three\-fold:
- •We introduce an annealed OPD–RL training objective for multi\-turn agent training\. By letting OPD dominate early and RL dominate later, ATOD addresses the cold\-start inefficiency of sparse\-reward RL while preserving reward\-driven improvement beyond pure teacher imitation\.
- •We propose Turn\-level Disagreement\-Uncertainty Reweighting \(T\-DUR\\mathrm\{T\\text\{\-\}DUR\}\), a soft distillation weighting mechanism for agentic OPD\.T\-DUR\\mathrm\{T\\text\{\-\}DUR\}estimates turn\-level distillation utility from student uncertainty and teacher–student disagreement, reducing wasted supervision on routine turns and emphasizing high\-value decision points\.
- •We validate ATOD on ALFWorld, WebShop, and Search\-QA across multiple student sizes\. ATOD consistently outperforms competing post\-training baselines, improves over both OPD and GRPO, and can surpass the corresponding teacher models\.
Figure 3:Overview of ATOD\. \(a\) Annealed guidance shifts from teacher\-guided bootstrapping to reward\-driven exploration\. \(b\) T\-DUR uses disagreement and uncertainty to upweight key turns, leaving rewards unchanged\.
## 2Related Work
#### Language agents and tool\-integrated reasoning\.
Language agents combine reasoning with action, enabling LLMs to interact with tools, environments, and external feedback\[[1](https://arxiv.org/html/2606.27814#bib.bib1),[2](https://arxiv.org/html/2606.27814#bib.bib2)\]\. Benchmarks such as ALFWorld, WebShop, and search\-augmented QA require multi\-step decision making under partial observations and delayed rewards\[[3](https://arxiv.org/html/2606.27814#bib.bib3),[4](https://arxiv.org/html/2606.27814#bib.bib4),[5](https://arxiv.org/html/2606.27814#bib.bib5)\]\. These settings differ from single\-turn text generation because errors can propagate across turns and because many actions are routine while a few are decisive\.
#### Reinforcement learning for language models\.
Policy optimization methods, including PPO and GRPO, have been widely used to improve language\-model reasoning and alignment\[[6](https://arxiv.org/html/2606.27814#bib.bib6),[7](https://arxiv.org/html/2606.27814#bib.bib7),[8](https://arxiv.org/html/2606.27814#bib.bib8),[9](https://arxiv.org/html/2606.27814#bib.bib9)\]\. Agent\-specific RL methods further address long\-horizon interaction through hierarchical multi\-turn optimization, online curricula, group\-in\-group credit assignment, and adaptive exploration\[[10](https://arxiv.org/html/2606.27814#bib.bib10),[11](https://arxiv.org/html/2606.27814#bib.bib11),[12](https://arxiv.org/html/2606.27814#bib.bib12),[13](https://arxiv.org/html/2606.27814#bib.bib13)\]\. GRPO is attractive because it estimates relative advantages within sampled groups without requiring a separate value model\. However, in long\-horizon agent tasks, sparse terminal rewards and high\-variance exploration can make pure RL inefficient, especially for small models\.
#### On\-policy distillation\.
OPD trains the student on its own generated trajectories while using a teacher distribution as dense supervision\[[14](https://arxiv.org/html/2606.27814#bib.bib14),[15](https://arxiv.org/html/2606.27814#bib.bib15)\]\. This reduces the train–test mismatch of offline supervised distillation and improves sample efficiency\. Recent OPD variants improve stability, context transfer, entropy handling, or teacher\-ceiling behavior\[[16](https://arxiv.org/html/2606.27814#bib.bib16),[18](https://arxiv.org/html/2606.27814#bib.bib18),[17](https://arxiv.org/html/2606.27814#bib.bib17),[19](https://arxiv.org/html/2606.27814#bib.bib19)\]\. For tool\-using or multi\-turn agents, recent work has explored step\-wise, temporal, self\-distilled, skill\-conditioned, or importance\-aware distillation to mitigate unreliable or redundant teacher signals\[[21](https://arxiv.org/html/2606.27814#bib.bib21),[20](https://arxiv.org/html/2606.27814#bib.bib20),[23](https://arxiv.org/html/2606.27814#bib.bib23),[24](https://arxiv.org/html/2606.27814#bib.bib24),[25](https://arxiv.org/html/2606.27814#bib.bib25),[22](https://arxiv.org/html/2606.27814#bib.bib22)\]\. ATOD is complementary: it combines OPD with reward\-driven GRPO through a dynamic schedule and uses T\-DUR to softly gate only the distillation component\.
## 3Preliminaries
### 3\.1Multi\-Turn Agent Trajectories
We consider post\-training a small language\-model agent for multi\-turn interactive tasks\. Given an inputxx, the student policyπθ\\pi\_\{\\theta\}interacts with external feedback overKKsteps\. At stepkk, the model generates a responseaka\_\{k\}, which may contain reasoning, an environment action or tool invocation, or a final answer\. If an external action is taken, an observationoko\_\{k\}is appended to the context and conditions subsequent generations\.
A trajectory is defined as
τ=\(x,a1,o1,…,aK,oK,aK\+1\),\\tau=\(x,a\_\{1\},o\_\{1\},\\ldots,a\_\{K\},o\_\{K\},a\_\{K\+1\}\),\(1\)whereaK\+1a\_\{K\+1\}denotes the final response\. The policy generates only model tokens, while observations are provided by the environment\. Letata\_\{t\}denote a generated token andsts\_\{t\}its prefix context, which may include both previous model outputs and environment observations\.
### 3\.2Group Relative Policy Optimization
Group Relative Policy Optimization \(GRPO\) is a reinforcement learning algorithm that updates the policy using relative rewards within a group of sampled trajectories\[[7](https://arxiv.org/html/2606.27814#bib.bib7)\]\. We assume access to an outcome\-level reward functionR\(τ\)R\(\\tau\)defined on complete trajectories\. For each inputxx, a group of trajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}is sampled from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, each receiving rewardri=R\(τi\)r\_\{i\}=R\(\\tau\_\{i\}\)\. The group\-relative advantage is computed as
A^iGRPO=ri−mean\(\{rj\}j=1G\)std\(\{rj\}j=1G\)\+ϵA\.\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}=\\frac\{r\_\{i\}\-\\mathrm\{mean\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\)\}\{\\mathrm\{std\}\(\\\{r\_\{j\}\\\}\_\{j=1\}^\{G\}\)\+\\epsilon\_\{A\}\}\.\(2\)Let
ηi,t\(θ\)=πθθ\(ai,t∣si,t\)πθold\(ai,t∣si,t\)\\eta\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}^\{\\theta\}\(a\_\{i,t\}\\mid s\_\{i,t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(a\_\{i,t\}\\mid s\_\{i,t\}\)\}\(3\)denote the token\-level importance ratio\. The GRPO objective is
ℒGRPO\(θ\)=−𝔼x\[1G∑i=1G1\|𝒯i\|∑t∈𝒯imin\(ηi,t\(θ\)A^iGRPO,clip\(ηi,t\(θ\),1−ϵ,1\+ϵ\)A^iGRPO\)\],\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{x\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|\\mathcal\{T\}\_\{i\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{i\}\}\\min\\left\(\\eta\_\{i,t\}\(\\theta\)\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\},\\;\\mathrm\{clip\}\(\\eta\_\{i,t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}\\right\)\\right\],\(4\)where𝒯i\\mathcal\{T\}\_\{i\}denotes the model\-generated token positions in trajectoryτi\\tau\_\{i\}\. Equivalently, each generated token inτi\\tau\_\{i\}receivesAtGRPO=A^iGRPOA\_\{t\}^\{\\mathrm\{GRPO\}\}=\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}\. This objective provides an on\-policy learning signal based on relative trajectory performance, but it relies on sparse outcome\-level rewards\.
### 3\.3On\-Policy Distillation
On\-policy distillation \(OPD\) is a post\-training paradigm that provides dense token\-level supervision on student\-generated trajectories by aligning the student policy with a frozen teacher distribution\[[14](https://arxiv.org/html/2606.27814#bib.bib14),[15](https://arxiv.org/html/2606.27814#bib.bib15)\]\. Given trajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}sampled from the old student policy, the OPD objective can be written as
ℒOPD\(θ\)=𝔼x\[1G∑i=1G1\|𝒯i\|∑t∈𝒯iηi,t\(θ\)\(logπθθ\(ai,t∣si,t\)−logπT\(ai,t∣si,t\)\)\],\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|\\mathcal\{T\}\_\{i\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{i\}\}\\eta\_\{i,t\}\(\\theta\)\\left\(\\log\\pi\_\{\\theta\}^\{\\theta\}\(a\_\{i,t\}\\mid s\_\{i,t\}\)\-\\log\\pi\_\{\\mathrm\{T\}\}\(a\_\{i,t\}\\mid s\_\{i,t\}\)\\right\)\\right\],\(5\)where𝒯i\\mathcal\{T\}\_\{i\}denotes the model\-generated token positions in trajectoryτi\\tau\_\{i\}, andηi,t\(θ\)\\eta\_\{i,t\}\(\\theta\)is the same importance ratio as in[equation3](https://arxiv.org/html/2606.27814#S3.E3)\. This objective is a sampled estimator of the reverse KL divergence from the student policy to the teacher policy on student\-visited states\.
Equivalently, for a generated tokenata\_\{t\}, OPD provides the token\-level distillation signal
Δlogpt=logπT\(at∣st\)−logπθ\(at∣st\),\\Delta\\log p\_\{t\}=\\log\\pi\_\{\\mathrm\{T\}\}\(a\_\{t\}\\mid s\_\{t\}\)\-\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\),\(6\)which is positive when the teacher assigns higher probability to the sampled token than the student\. In advantage form, pure OPD uses
AtOPD=Δlogpt\.A^\{\\mathrm\{OPD\}\}\_\{t\}=\\Delta\\log p\_\{t\}\.\(7\)
## 4Method
### 4\.1Overview
ATOD combines two complementary learning signals\. The teacher provides dense token\-level guidance through OPD, which is especially useful when the student is still weak\. The environment reward provides the task\-defining signal through GRPO, which is essential for correcting teacher bias and improving beyond imitation\. Instead of choosing one signal, ATOD uses both in a single token\-level advantage:
At=κ\(s\)AtOPD\+ρ\(s\)AtGRPO,\\boxed\{A\_\{t\}=\\kappa\(s\)\\,A^\{\\mathrm\{OPD\}\}\_\{t\}\+\\rho\(s\)\\,A^\{\\mathrm\{GRPO\}\}\_\{t\},\}\(8\)where
AtOPD=Δlogptwk\(t\)=\(logπT\(at∣st\)−logπθ\(at∣st\)\)wk\(t\)\.A^\{\\mathrm\{OPD\}\}\_\{t\}=\\Delta\\log p\_\{t\}\\,w\_\{k\(t\)\}=\\left\(\\log\\pi\_\{\\mathrm\{T\}\}\(a\_\{t\}\\mid s\_\{t\}\)\-\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\right\)w\_\{k\(t\)\}\.\(9\)Heressis the global training step, andk\(t\)k\(t\)maps token positionttto the turn that contains it\. The coefficientκ\(s\)\\kappa\(s\)controls how much the update follows the teacher, whileρ\(s\)\\rho\(s\)controls how much it follows the environment reward\. The scalarwk\(t\)w\_\{k\(t\)\}is theT\-DUR\\mathrm\{T\\text\{\-\}DUR\}weight of tokentt’s turn\. It only reweights the OPD term, so it changes where teacher supervision is applied without weakening the reward signal\. The actor is then optimized with the same clipped surrogate as GRPO in[equation4](https://arxiv.org/html/2606.27814#S3.E4), usingAtA\_\{t\}as the advantage\. We do not add an explicit KL penalty; the teacher effect is already included throughAtOPDA^\{\\mathrm\{OPD\}\}\_\{t\}\.
### 4\.2Dynamic OPD/RL Coefficient Annealing
A fixed OPD/RL mixture is suboptimal because the two signals are useful at different stages\. Early in training, rewards are sparse and noisy, while teacher guidance quickly teaches the student valid actions and interaction patterns\. Later, pure imitation becomes limiting: the student may inherit teacher mistakes and cannot easily exceed the teacher\. Therefore, ATOD gradually shifts the update from teacher\-guided learning to reward\-guided learning\.
We use a simple progress variable
p\(s\)=min\(sT,1\),T=coef\_anneal\_steps,p\(s\)=\\min\\left\(\\frac\{s\}\{T\},1\\right\),\\qquad T=\\texttt\{coef\\\_anneal\\\_steps\},\(10\)which increases from0to11during the annealing window\. The OPD coefficient is decreased as training progresses:
κ\(s\)=max\(κmin,κinit−\(κinit−κmin\)p\(s\)\),\\kappa\(s\)=\\max\\left\(\\kappa\_\{\\min\},\\;\\kappa\_\{\\mathrm\{init\}\}\-\(\\kappa\_\{\\mathrm\{init\}\}\-\\kappa\_\{\\min\}\)p\(s\)\\right\),\(11\)and the RL coefficient is increased at the same time:
ρ\(s\)=ρinit\+\(ρmax−ρinit\)p\(s\)\.\\rho\(s\)=\\rho\_\{\\mathrm\{init\}\}\+\(\\rho\_\{\\max\}\-\\rho\_\{\\mathrm\{init\}\}\)p\(s\)\.\(12\)Intuitively, early updates keepκ\(s\)\\kappa\(s\)high, so the teacher provides a strong learning signal when reward feedback is still sparse\. Late updates reduce the teacher weight and increaseρ\(s\)\\rho\(s\), so task reward becomes dominant\. The floorκmin\>0\\kappa\_\{\\min\}\>0keeps a weak teacher anchor to reduce severe drift or reward hacking, but does not prevent reward\-driven improvement\.
### 4\.3Turn\-level Disagreement\-Uncertainty Reweighting
For each turnkk,T\-DUR\\mathrm\{T\\text\{\-\}DUR\}computes two sampled\-token statistics over response tokens\.
#### Disagreement proxy\.
dk=1Nk∑t=1Nk\|logπθ\(at\(k\)∣st\(k\)\)−logπT\(at\(k\)∣st\(k\)\)\|\.d\_\{k\}=\\frac\{1\}\{N\_\{k\}\}\\sum\_\{t=1\}^\{N\_\{k\}\}\\left\|\\log\\pi\_\{\\theta\}\(a^\{\(k\)\}\_\{t\}\\mid s^\{\(k\)\}\_\{t\}\)\-\\log\\pi\_\{\\mathrm\{T\}\}\(a^\{\(k\)\}\_\{t\}\\mid s^\{\(k\)\}\_\{t\}\)\\right\|\.\(13\)A highdkd\_\{k\}indicates strong student–teacher disagreement on the turn\.
#### Uncertainty proxy\.
hk=1Nk∑t=1Nk\(−logπθ\(at\(k\)∣st\(k\)\)\)\.h\_\{k\}=\\frac\{1\}\{N\_\{k\}\}\\sum\_\{t=1\}^\{N\_\{k\}\}\\left\(\-\\log\\pi\_\{\\theta\}\(a^\{\(k\)\}\_\{t\}\\mid s^\{\(k\)\}\_\{t\}\)\\right\)\.\(14\)This sampled\-token negative log\-probability estimates the student’s uncertainty over the turn\.
#### Per\-trajectory normalization\.
For turns within the same trajectoryτ\\tau, we normalize
d~k=dk−minj∈τdjmaxj∈τdj−minj∈τdj,h~k=hk−minj∈τhjmaxj∈τhj−minj∈τhj\.\\tilde\{d\}\_\{k\}=\\frac\{d\_\{k\}\-\\min\_\{j\\in\\tau\}d\_\{j\}\}\{\\max\_\{j\\in\\tau\}d\_\{j\}\-\\min\_\{j\\in\\tau\}d\_\{j\}\},\\qquad\\tilde\{h\}\_\{k\}=\\frac\{h\_\{k\}\-\\min\_\{j\\in\\tau\}h\_\{j\}\}\{\\max\_\{j\\in\\tau\}h\_\{j\}\-\\min\_\{j\\in\\tau\}h\_\{j\}\}\.\(15\)If the denominator is below10−810^\{\-8\}, the normalized value is set to0\.50\.5\. Per\-trajectory normalization avoids mixing scales across tasks, trajectory lengths, and environment states\.
#### Soft\-OR fusion\.
The finalT\-DUR\\mathrm\{T\\text\{\-\}DUR\}turn weight is denoted bywkw\_\{k\}:
wk=1−\(1−d~k\)\(1−h~k\),wk∈\[0,1\]\.w\_\{k\}=1\-\(1\-\\tilde\{d\}\_\{k\}\)\(1\-\\tilde\{h\}\_\{k\}\),\\qquad w\_\{k\}\\in\[0,1\]\.\(16\)This continuous weight becomes large when either student–teacher disagreement or student uncertainty is high\. Thus, it emphasizes both uncertain turns and low\-uncertainty high\-disagreement turns, the latter corresponding to cases where the student is confident but disagrees with the teacher\. The detailed training procedure for ATOD is provided in[algorithm1](https://arxiv.org/html/2606.27814#alg1)in Appendix[A\.1](https://arxiv.org/html/2606.27814#A1.SS1)\.
## 5Experiments
### 5\.1Experimental Setup
#### Datasets & Benchmarks\.
We evaluate ATOD on three long\-horizon agent benchmarks: ALFWorld, WebShop, and Search\-QA\. ALFWorld tests embodied instruction following in text\-based household environments, WebShop evaluates goal\-conditioned web navigation and product selection, and Search\-QA measures search\-augmented question answering over open\-domain and multi\-hop tasks\[[3](https://arxiv.org/html/2606.27814#bib.bib3),[4](https://arxiv.org/html/2606.27814#bib.bib4),[5](https://arxiv.org/html/2606.27814#bib.bib5)\]\. Following common Search\-R1\-style protocols, the Search\-QA suite covers Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle\[[26](https://arxiv.org/html/2606.27814#bib.bib26),[27](https://arxiv.org/html/2606.27814#bib.bib27),[28](https://arxiv.org/html/2606.27814#bib.bib28),[29](https://arxiv.org/html/2606.27814#bib.bib29),[30](https://arxiv.org/html/2606.27814#bib.bib30)\]\. Together, these benchmarks cover embodied interaction, web interaction, and tool\-assisted knowledge reasoning\.
#### Evaluation Setups\.
We use Qwen3\-0\.6B, Qwen3\-1\.7B, and Qwen3\-4B as student models\. The 0\.6B and 1\.7B students use a Qwen3\-4B model trained with GRPO as the teacher, while the 4B student uses a Qwen3\-30B\-A3B GRPO teacher \(150\-step checkpoint\)\. Unless otherwise specified, each prompt is sampled on policy with group sizeGGso that GRPO\-style group\-relative advantages can be computed from environment feedback\. We report maximum validation success rate \(%\) and the corresponding average trajectory length on ALFWorld, Search\-QA, and WebShop\. The main comparison uses a 150\-step training window for all student sizes\.
#### Baselines\.
We compare ATOD against post\-training baselines of different supervision types: the untrained Vanilla model, the RL method GRPO, the self\-distillation \+ RL method SDAR\[[25](https://arxiv.org/html/2606.27814#bib.bib25)\], standard on\-policy distillation \(OPD\)\[[14](https://arxiv.org/html/2606.27814#bib.bib14),[15](https://arxiv.org/html/2606.27814#bib.bib15)\], and the agentic on\-policy distillation methods SOD\[[21](https://arxiv.org/html/2606.27814#bib.bib21)\]and TCOD\[[20](https://arxiv.org/html/2606.27814#bib.bib20)\]\. All methods use the same student and teacher families where applicable, and are evaluated under the same benchmark metrics\. For ATOD, the teacher provides token\-level log\-probabilities on student\-sampled trajectories, while the actor update remains the standard clipped GRPO surrogate with the hybrid advantage defined in[equation8](https://arxiv.org/html/2606.27814#S4.E8)\. Baseline descriptions are provided in Appendix[B\.1](https://arxiv.org/html/2606.27814#A2.SS1)\.
### 5\.2Main Results
Table 1:Performance on ALFWorld, Search\-QA, and WebShop\. We report success rate \(SR, %\) and average trajectory length \(Len\.\)\. Avg\. SR is the mean success rate over the three datasets\. Vanilla denotes the untrained base model\. Methods marked with†do not use a teacher model\.MethodALFWorldSearch\-QAWebShopAvg\. SRSRLen\.SRLen\.SRLen\.Qwen3\-4B GRPO \(Teacher\)76\.5618\.9548\.192\.5182\.036\.2168\.93Qwen3\-0\.6B Student\(teacher: Qwen3\-4B GRPO\)Vanilla†0\.7850\.0014\.501\.574\.699\.596\.66GRPO†30\.4744\.1339\.362\.6929\.699\.7333\.17SDAR†28\.1245\.2739\.602\.438\.5910\.2725\.44OPD76\.5618\.4841\.462\.7785\.166\.2367\.73SOD76\.5622\.9040\.192\.8180\.475\.9565\.74TCOD74\.2220\.1542\.042\.6784\.385\.9466\.88ATOD82\.8115\.8442\.332\.7786\.727\.4870\.62Qwen3\-1\.7B Student\(teacher: Qwen3\-4B GRPO\)Vanilla†10\.1647\.9131\.152\.095\.478\.4315\.59GRPO†31\.2540\.3841\.992\.6447\.6610\.0140\.30SDAR†39\.8437\.5642\.242\.6743\.758\.9041\.94OPD72\.6616\.4144\.732\.6677\.346\.1564\.91SOD75\.7827\.9544\.432\.6976\.565\.5965\.59TCOD71\.8818\.8845\.022\.7481\.256\.1066\.05ATOD80\.4722\.6945\.212\.6489\.066\.2071\.58Qwen3\-30B\-A3B GRPO \(Teacher\)80\.4723\.5350\.492\.6575\.788\.0268\.91Qwen3\-4B Student\(teacher: Qwen3\-30B\-A3B GRPO\)Vanilla†24\.2242\.4332\.711\.770\.7813\.9919\.24GRPO†76\.5618\.9548\.192\.5182\.036\.2168\.93SDAR†77\.3420\.8547\.802\.5967\.977\.9164\.37OPD80\.4721\.1248\.932\.6975\.008\.5368\.13SOD79\.6919\.0248\.582\.6176\.566\.6868\.28TCOD76\.5623\.4548\.392\.5775\.789\.2166\.91ATOD85\.1616\.7549\.122\.6278\.918\.2971\.06The performance comparison across ALFWorld, Search\-QA, and WebShop is summarized in[table1](https://arxiv.org/html/2606.27814#S5.T1), from which we draw the following key observations:
- •Obs 1: ATOD consistently outperforms all baselines and even surpasses the corresponding teacher models\.Across all three student sizes \(0\.6B, 1\.7B, and 4B\), ATOD achieves the highest average success rate among all student methods, with margins of3\.89%\(0\.6B\),5\.07%\(1\.7B\), and2\.13%\(4B\) over the second\-best baseline\. Notably, ATOD surpasses its teacher on ALFWorld and WebShop for all student sizes, and achieves competitive Search\-QA performance within 1% of the teacher\.
- •Obs 2: ATOD delivers dramatic improvements for smaller models with weak initial performance\.For the 0\.6B student, which starts with near\-zero success on ALFWorld \(0\.78%\) and WebShop \(4\.69%\), ATOD lifts performance to82\.81%and86\.72%respectively—a relative improvement of over100×\\timeson ALFWorld compared to the vanilla model\. In contrast, GRPO achieves only 30\.47% on ALFWorld, highlighting ATOD’s superior efficiency in the early training stage when sparse rewards make pure RL ineffective\.
### 5\.3Ablation Study
Figure 4:ALFWorld ablation results for Qwen3\-0\.6B, Qwen3\-1\.7B, and Qwen3\-4B students\. ATOD achieves the best success rate, while removing turn\-level T\-DUR or annealing generally reduces performance\.To evaluate the contribution of each component in ATOD, we conduct three ablation studies on ALFWorld across all student sizes \(0\.6B, 1\.7B, and 4B\):\(1\)applying the disagreement\-uncertainty reweighting at the token level instead of the turn level \(Token\-level reweighting\);\(2\)removing the T\-DUR reweighting entirely and using uniform turn weights \(w/o T\-DUR\); and\(3\)disabling the annealing schedule by fixingκ\(s\)\\kappa\(s\)andρ\(s\)\\rho\(s\)to constant values \(w/o Annealing\)\. The results are shown in[figure4](https://arxiv.org/html/2606.27814#S5.F4), from which we draw the following observations:
- •Obs 1: The annealing schedule is crucial for combining imitation and reward optimization\.Removing the annealing mechanism leads to the most pronounced degradation across all student sizes, with the largest drop on relatively weaker students \(e\.g\., 0\.6B drops from 82\.8% to 75\.8%\)\. This confirms that a fixed OPD/RL mixture fails to balance the two signals effectively: early training requires strong teacher guidance for bootstrapping, while later training needs reward\-driven pressure to push beyond the teacher ceiling\.
- •Obs 2: Turn\-level T\-DUR weighting outperforms both token\-level reweighting and uniform weighting\.Compared with ATOD, both applying reweighting at the token level and removing T\-DUR \(i\.e\., uniform turn weights\) lower the success rate across all model sizes, with the gap being particularly notable on the 1\.7B student\. This suggests that operating at the turn level provides a more stable and semantically meaningful reweighting unit: agent interactions are inherently turn\-structured, and aggregating disagreement and uncertainty over complete response turns better captures the utility of each decision step\. In contrast, token\-level reweighting introduces instability from per\-token fluctuations, while uniform weighting squanders supervision on routine turns\.
### 5\.4Training Dynamics and Diagnostic Metrics
Figure 5:Training dynamics on ALFWorld\. \(a\) Training reward: ATOD maintains higher critic rewards\. \(b\) Validation success rate: ATOD converges quickly and reaches the best final performance\. \(c\) Average trajectory turns: ATOD learns more efficient trajectories than GRPO\.[figure5](https://arxiv.org/html/2606.27814#S5.F5)provides a step\-wise view of the optimization process behind the aggregate results in[table1](https://arxiv.org/html/2606.27814#S5.T1)\. Pure GRPO starts from a substantially lower success rate and improves slowly, reflecting the difficulty of exploration under sparse trajectory\-level rewards\. OPD and SOD obtain a much faster warm start because teacher supervision supplies dense token\-level guidance, but their validation curves flatten after the early stage\. In contrast, ATOD follows the rapid early growth of distillation\-based methods while continuing to improve in the later stage, reaching the highest validation success rate by the end of training\. The reward curve shows a consistent pattern: ATOD maintains a higher mean critic reward than the baselines for most of training, indicating that the annealed schedule does not merely imitate the teacher but gradually shifts toward reward\-improving behaviors\. The average\-turn curve further suggests that ATOD learns more efficient interaction policies: while GRPO keeps producing longer trajectories, ATOD quickly reduces the number of turns and remains close to the more stable distillation\-based baselines\.
Figure 6:Internal diagnostic metrics during ATOD training\. \(a\) OPD/RL signal magnitude: OPD decays while RL remains strong\. \(b\) Mean turn weight: T\-DUR keeps reallocating supervision across turns\. \(c\) Teacher–student gap: ATOD steadily narrows the gap during training\.The diagnostic metrics in[figure6](https://arxiv.org/html/2606.27814#S5.F6)further explain why the hybrid schedule is effective\. The OPD signal is large at the beginning, when the student disagrees strongly with the teacher and dense supervision is most informative, but it quickly decreases as the student becomes aligned on common trajectory turns\. This decay prevents teacher imitation from dominating indefinitely\. Meanwhile, the RL signal increases early and remains high throughout training, providing persistent task\-level pressure after the distillation signal weakens\. The mean turn weight remains in a moderate range rather than collapsing, indicating that T\-DUR continues to redistribute distillation strength across informative turns\. The teacher–student gap also steadily shrinks for ATOD, confirming that the student absorbs useful teacher behavior while the persistent RL signal supplies the later reward\-driven updates needed to exceed the ceiling of pure imitation\.
## 6Conclusion
We presented ATOD, an annealed turn\-aware on\-policy distillation algorithm for language agents\. ATOD addresses the cold\-start inefficiency of pure RL, the teacher\-ceiling problem of pure OPD, and the signal dilution caused by uniform turn weighting\. By encoding OPD and GRPO into a single hybrid advantage, annealing their coefficients over training, and applying T\-DUR only to teacher supervision, the method provides a simple and implementation\-friendly path toward training small agents that first imitate and then improve through environment reward\.
## References
- Yao et al\. \[2023\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R\. Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations*, 2023\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems*, 2023\.
- Shridhar et al\. \[2021\]Mohit Shridhar, Xingdi Yuan, Marc\-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht\.Alfworld: Aligning text and embodied environments for interactive learning\.In*International Conference on Learning Representations*, 2021\.
- Yao et al\. \[2022\]Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan\.Webshop: Towards scalable real\-world web interaction with grounded language agents\.In*Advances in Neural Information Processing Systems*, 2022\.
- Jin et al\. \[2025\]Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\.Search\-r1: Training llms to reason and leverage search engines with reinforcement learning\.*arXiv preprint arXiv:2503\.09516*, 2025\.
- Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- DeepSeek\-AI et al\. \[2025\]DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*Nature*, 645:633–638, 2025\.
- Yu et al\. \[2025\]Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei\-Ying Ma, Ya\-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.
- Zhou et al\. \[2024\]Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar\.Archer: Training language model agents via hierarchical multi\-turn rl\.In*Proceedings of the 41st International Conference on Machine Learning*, pages 62178–62209, 2024\.
- Qi et al\. \[2025\]Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong\.Webrl: Training llm web agents via self\-evolving online curriculum reinforcement learning\.In*International Conference on Learning Representations*, 2025\.
- Feng et al\. \[2025\]Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An\.Group\-in\-group policy optimization for llm agent training\.*arXiv preprint arXiv:2505\.10978*, 2025\.
- Dong et al\. \[2025\]Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji\-Rong Wen, and Zhicheng Dou\.Agentic reinforced policy optimization\.*arXiv preprint arXiv:2507\.19849*, 2025\.
- Agarwal et al\. \[2024\]Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem\.On\-policy distillation of language models: Learning from self\-generated mistakes\.In*International Conference on Learning Representations*, 2024\.
- Gu et al\. \[2024\]Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang\.Minillm: Knowledge distillation of large language models\.In*International Conference on Learning Representations*, 2024\.
- Jin et al\. \[2026\]Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee\.Entropy\-aware on\-policy distillation of language models\.*arXiv preprint arXiv:2603\.07079*, 2026\.
- Ye et al\. \[2026\]Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei\.On\-policy context distillation for language models\.*arXiv preprint arXiv:2602\.12275*, 2026\.
- Jang et al\. \[2026\]Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim\.Stable on\-policy distillation through adaptive target reformulation\.*arXiv preprint arXiv:2601\.07155*, 2026\.
- Yang et al\. \[2026\]Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin\.Learning beyond teacher: Generalized on\-policy distillation with reward extrapolation\.*arXiv preprint arXiv:2602\.12125*, 2026\.
- Wang et al\. \[2026a\]Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, and James Cheng\.Tcod: Exploring temporal curriculum in on\-policy distillation for multi\-turn autonomous agents\.*arXiv preprint arXiv:2604\.24005*, 2026a\.
- Zhong et al\. \[2026\]Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, and Junfeng Fang\.Sod: Step\-wise on\-policy distillation for small language model agents\.*arXiv preprint arXiv:2605\.07725*, 2026\.
- Xu et al\. \[2026\]Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard\.Tip: Token importance in on\-policy distillation\.*arXiv preprint arXiv:2604\.14084*, 2026\.
- Zhao et al\. \[2026\]Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover\.Self\-distilled reasoner: On\-policy self\-distillation for large language models\.*arXiv preprint arXiv:2601\.18734*, 2026\.
- Wang et al\. \[2026b\]Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi\.Skill\-sd: Skill\-conditioned self\-distillation for multi\-turn llm agents\.*arXiv preprint arXiv:2604\.10674*, 2026b\.
- Lu et al\. \[2026\]Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi\-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen\.Self\-distilled agentic reinforcement learning\.*arXiv preprint arXiv:2605\.15155*, 2026\.
- Kwiatkowski et al\. \[2019\]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming\-Wei Chang, Andrew M\. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov\.Natural questions: A benchmark for question answering research\.*Transactions of the Association for Computational Linguistics*, 7:452–466, 2019\.
- Joshi et al\. \[2017\]Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer\.Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics*, pages 1601–1611, 2017\.
- Yang et al\. \[2018\]Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W\. Cohen, Ruslan Salakhutdinov, and Christopher D\. Manning\.Hotpotqa: A dataset for diverse, explainable multi\-hop question answering\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2369–2380, 2018\.
- Trivedi et al\. \[2022\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Musique: Multihop questions via single\-hop question composition\.*Transactions of the Association for Computational Linguistics*, 10:539–554, 2022\.
- Press et al\. \[2023\]Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A\. Smith, and Mike Lewis\.Measuring and narrowing the compositionality gap in language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, 2023\.
Appendix
## Contents
## Appendix AMethod and Algorithmic Details
This section collects the implementation\-level and analytical details of ATOD that complement the main method section\.
### A\.1Complete Training Algorithm
The complete ATOD training procedure is detailed below\.
Algorithm 1ATOD: Annealed Turn\-aware On\-policy Distillation1:Input:Student policy
πθ\\pi\_\{\\theta\}, teacher policy
πT\\pi\_\{\\mathrm\{T\}\}, environment
ℰ\\mathcal\{E\}, prompt set
𝒳\\mathcal\{X\}, group size
GG, annealing schedule
\(κ,ρ\)\(\\kappa,\\rho\), clipping range
ϵ\\epsilon\.
2:Output:Trained student policy
πθ\\pi\_\{\\theta\}\.
3:
4:Stage I: On\-policy rollout and reward advantage
5:foreach prompt
x∈𝒳x\\in\\mathcal\{X\}do
6:Sample
GGtrajectories
\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}with the old student policy
πθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}in
ℰ\\mathcal\{E\}\.
7:Evaluate each trajectory with the task reward
R\(τi\)R\(\\tau\_\{i\}\)\.
8:Compute group\-relative GRPO advantages
AGRPOA^\{\\mathrm\{GRPO\}\}using[equation2](https://arxiv.org/html/2606.27814#S3.E2)\.
9:endfor
10:
11:Stage II: Turn\-aware distillation reweighting
12:foreach trajectory
τ\\taudo
13:Partition model\-generated tokens into turns
\{ℐk\}k=1K\\\{\\mathcal\{I\}\_\{k\}\\\}\_\{k=1\}^\{K\}\.
14:Compute turn\-level disagreement
dkd\_\{k\}and uncertainty
hkh\_\{k\}using[equations13](https://arxiv.org/html/2606.27814#S4.E13)and[14](https://arxiv.org/html/2606.27814#S4.E14)\.
15:Fuse the normalized signals into T\-DUR weights
\{wk\}\\\{w\_\{k\}\\\}using[equations15](https://arxiv.org/html/2606.27814#S4.E15)and[16](https://arxiv.org/html/2606.27814#S4.E16)\.
16:Form the turn\-weighted OPD advantage
AtOPD=Δlogptwk\(t\)A^\{\\mathrm\{OPD\}\}\_\{t\}=\\Delta\\log p\_\{t\}w\_\{k\(t\)\}using[equation9](https://arxiv.org/html/2606.27814#S4.E9)\.
17:endfor
18:
19:Stage III: Annealed hybrid optimization
20:Compute
κ\(s\)\\kappa\(s\)and
ρ\(s\)\\rho\(s\)using[equations11](https://arxiv.org/html/2606.27814#S4.E11)and[12](https://arxiv.org/html/2606.27814#S4.E12)\.
21:Form the hybrid advantage:
22:
At=κ\(s\)AtOPD\+ρ\(s\)AtGRPOA\_\{t\}=\\kappa\(s\)A^\{\\mathrm\{OPD\}\}\_\{t\}\+\\rho\(s\)A^\{\\mathrm\{GRPO\}\}\_\{t\}\.
23:Update
πθ\\pi\_\{\\theta\}by minimizing the clipped surrogate:
24:
ℒactor\(θ\)=−𝔼t\[min\(ηt\(θ\)At,clip\(ηt\(θ\),1−ϵ,1\+ϵ\)At\)\]\\mathcal\{L\}\_\{\\mathrm\{actor\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{t\}\\\!\\left\[\\min\\\!\\left\(\\eta\_\{t\}\(\\theta\)A\_\{t\},\\mathrm\{clip\}\(\\eta\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A\_\{t\}\\right\)\\right\]\.
25:Synchronize the old policy:
πθold←πθ\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\leftarrow\\pi\_\{\\theta\}\.
### A\.2Gradient Interpretation of the OPD Component
Inside the unclipped region whereηt\(θ\)≈1\\eta\_\{t\}\(\\theta\)\\approx 1, the distillation part of the actor loss is approximately
ℒOPD\(θ\)≈−𝔼t\[ηt\(θ\)κ\(s\)Δlogptwk\(t\)\]\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\(\\theta\)\\approx\-\\mathbb\{E\}\_\{t\}\\left\[\\eta\_\{t\}\(\\theta\)\\kappa\(s\)\\Delta\\log p\_\{t\}w\_\{k\(t\)\}\\right\]\.\(17\)TreatingΔlogpt\\Delta\\log p\_\{t\}and the turn weightwk\(t\)w\_\{k\(t\)\}as detached advantages, the gradient is
∇θℒOPD≈−𝔼t\[κ\(s\)Δlogptwk\(t\)∇θlogπθθ\(at∣st\)\]\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}\\approx\-\\mathbb\{E\}\_\{t\}\\left\[\\kappa\(s\)\\Delta\\log p\_\{t\}w\_\{k\(t\)\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}^\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\right\]\.\(18\)Thus ATOD performs advantage\-weighted likelihood ascent on student\-sampled tokens, where the OPD advantage is high when the teacher assigns higher probability than the old student and T\-DUR assigns a large disagreement\-uncertainty weight to the turn\.
### A\.3Entropy Proxy as a Martingale\-Difference Estimator
###### Claim 1\.
For a turnkk,hk=1Nk∑t−logπθ\(at\(k\)∣st\(k\)\)h\_\{k\}=\\frac\{1\}\{N\_\{k\}\}\\sum\_\{t\}\-\\log\\pi\_\{\\theta\}\(a\_\{t\}^\{\(k\)\}\\mid s\_\{t\}^\{\(k\)\}\)is an unbiased estimator of the average token entropyH¯k=1Nk∑tH\(πθ\(⋅∣st\(k\)\)\)\\bar\{H\}\_\{k\}=\\frac\{1\}\{N\_\{k\}\}\\sum\_\{t\}H\(\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}^\{\(k\)\}\)\)\. Its estimation variance decreases at rateO\(1/Nk\)O\(1/N\_\{k\}\)\.
#### Proof sketch\.
LetXt=−logπθ\(at∣st\)X\_\{t\}=\-\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)and letℱt−1\\mathcal\{F\}\_\{t\-1\}be the filtration generated by previous tokens\. Sinceat∼πθ\(⋅∣st\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\), we have𝔼\[Xt∣ℱt−1\]=H\(πθ\(⋅∣st\)\)\\mathbb\{E\}\[X\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=H\(\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\)\. ThereforeDt=Xt−HtD\_\{t\}=X\_\{t\}\-H\_\{t\}is a martingale difference\. Fori<ji<j,𝔼\[DiDj\]=𝔼\[Di𝔼\[Dj∣ℱj−1\]\]=0\\mathbb\{E\}\[D\_\{i\}D\_\{j\}\]=\\mathbb\{E\}\[D\_\{i\}\\mathbb\{E\}\[D\_\{j\}\\mid\\mathcal\{F\}\_\{j\-1\}\]\]=0\. Hence the variance of the turn average isNk−2∑t𝔼\[Dt2\]=O\(1/Nk\)N\_\{k\}^\{\-2\}\\sum\_\{t\}\\mathbb\{E\}\[D\_\{t\}^\{2\}\]=O\(1/N\_\{k\}\)\.
### A\.4Properties of Soft\-OR and Annealing
The functionf\(d,h\)=1−\(1−d\)\(1−h\)f\(d,h\)=1\-\(1\-d\)\(1\-h\)is monotone in both arguments, symmetric, satisfiesf\(d,0\)=df\(d,0\)=dandf\(0,h\)=hf\(0,h\)=h, and is at least as large as each individual signal\. It is the standard probabilistic OR t\-conorm, making it appropriate when either uncertainty or disagreement should mark a turn as important\.
The annealing functions in[equations11](https://arxiv.org/html/2606.27814#S4.E11)and[12](https://arxiv.org/html/2606.27814#S4.E12)are Lipschitz in the training step with constants\|κinit−κmin\|/T\|\\kappa\_\{\\mathrm\{init\}\}\-\\kappa\_\{\\min\}\|/Tand\|ρmax−ρinit\|/T\|\\rho\_\{\\max\}\-\\rho\_\{\\mathrm\{init\}\}\|/T\. This avoids abrupt changes in advantage scale and makes the transition between imitation and reward optimization smooth\.
### A\.5Naming Rationale
We use ATOD as the primary method name to align with the naming convention of recent OPD\-family methods such as SOD and TCOD\. The name emphasizes the two main algorithmic ingredients: annealed OPD–GRPO training and turn\-aware distillation weighting\. The turn weighting module is named Turn\-level Disagreement\-Uncertainty Reweighting \(T\-DUR\\mathrm\{T\\text\{\-\}DUR\}\) to emphasize that its soft weight is estimated from sampled\-token student–teacher disagreement and student uncertainty, rather than from a hard token\-selection rule\.
## Appendix BExperimental Setup Details
This section collects the experimental details needed to reproduce and interpret the reported results: baseline definitions, training configuration, metric diagnostics, and environment prompts\.
### B\.1Baselines
We compare ATOD with a diverse set of baselines spanning no additional training, reinforcement learning, self\-distillation, and teacher\-based on\-policy distillation\. These baselines are chosen to reflect different supervision sources \(environment rewards, self\-generated targets, and external teacher distributions\) and different signal granularities \(trajectory\-level rewards, token\-level distillation, and temporally structured agent feedback\)\.
- •Initial\.The base student model without any additional task\-specific post\-training\. This baseline measures the zero\-shot or instruction\-tuned agentic capability of the underlying Qwen3 student before reinforcement learning or distillation\.
- •GRPO \(Group Relative Policy Optimization\)\.A reinforcement learning baseline that optimizes the student directly with environment rewards\[[7](https://arxiv.org/html/2606.27814#bib.bib7)\]\. For each input, GRPO samples a group of candidate trajectories, assigns each trajectory a scalar task reward, and normalizes rewards within the group to estimate relative advantages\. It avoids training a separate value model, making it efficient for LLM post\-training, but the reward is usually sparse and trajectory\-level in multi\-turn agent tasks, so all tokens in a trajectory receive relatively coarse credit\.
- •SDAR \(Self\-Distilled Agentic Reinforcement Learning\)\.SDAR keeps RL as the primary optimization backbone and adds a gated self\-distillation objective as an auxiliary signal\[[25](https://arxiv.org/html/2606.27814#bib.bib25)\]\. Instead of relying on an external teacher, it uses the agent’s own high\-quality behaviors or self\-generated targets to provide additional token\-level guidance\. This can improve credit assignment when the self\-distilled signal is reliable, but its effectiveness depends on whether the student can already generate sufficiently useful agent trajectories\.
- •OPD \(On\-Policy Distillation\)\.OPD trains the student on trajectories sampled from the student itself while using an external teacher distribution to provide dense token\-level supervision\[[14](https://arxiv.org/html/2606.27814#bib.bib14),[15](https://arxiv.org/html/2606.27814#bib.bib15)\]\. Compared with offline supervised imitation, OPD reduces train–test mismatch because supervision is applied on the student’s own rollout states\. However, pure OPD mainly imitates the teacher and does not directly optimize environment rewards, so it may saturate once the student approaches teacher\-like behavior\.
- •SOD \(Step\-wise On\-policy Distillation\)\.SOD is an agent\-oriented OPD method that adaptively adjusts distillation strength at different interaction steps\[[21](https://arxiv.org/html/2606.27814#bib.bib21)\]\. It is designed for long\-horizon tool\-use or agent trajectories, where different steps may have different levels of reliability and learning value\. By assigning step\-wise distillation weights, SOD provides more structured teacher guidance than uniform OPD, but it still relies primarily on teacher\-driven supervision\.
- •TCOD \(Temporal Curriculum On\-Policy Distillation\)\.TCOD introduces a temporal curriculum for multi\-turn OPD\[[20](https://arxiv.org/html/2606.27814#bib.bib20)\]\. Rather than exposing the student to the full trajectory difficulty uniformly, it controls the trajectory depth or temporal range used for distillation so that the student learns agent behavior in a progressively structured manner\. This curriculum can stabilize multi\-turn distillation, but its performance depends on the chosen temporal schedule\.
The 4B GRPO model is reported as a teacher reference in the main table\. All teacher\-based distillation baselines use the same teacher checkpoint where applicable, so differences among OPD, SOD, TCOD, and ATOD mainly come from how teacher supervision is weighted and combined with reward optimization\.
### B\.2Training Hyperparameters
We conduct training across three text\-based interactive environments: ALFWorld, Search\-QA, and WebShop\. The training and core ATOD hyperparameters are summarized in[table2](https://arxiv.org/html/2606.27814#A2.T2), while task\-specific configurations are provided in[table3](https://arxiv.org/html/2606.27814#A2.T3)\.
Table 2:Training and core ATOD hyperparameters\.Table 3:Task\-specific hyperparameters for ALFWorld, Search\-QA, and WebShop\.
### B\.3Metric Interpretation
Healthy training should show monotonically decreasinghybrid/kl\_coef, monotonically increasinghybrid/rl\_coef, early OPD/RL ratio above one and late ratio below one, stable actor entropy without abrupt collapse, and increasing task reward\. T\-DUR active ratio should be high in early and middle training and may decrease later as the student approaches the teacher on more trajectories\.
### B\.4Environment Prompt Templates
Figures[7](https://arxiv.org/html/2606.27814#A2.F7)–[9](https://arxiv.org/html/2606.27814#A2.F9)present the full prompt templates used by ATOD for the three evaluation environments\.
Prompt of ATOD on ALFWorldYou are an expert agent operating in the ALFRED Embodied Environment\. Your task is to:\{task\_description\}\.Prior to this step, you have already taken\{step\_count\}step\(s\)\. Below are the most recent\{history\_length\}observations and the corresponding actions you took:\{action\_history\}You are now at step\{current\_step\}and your current observation is:\{current\_observation\}Your admissible actions of the current situation are:\[\[\{admissible\_actions\}\]\]\.Now it’s your turn to take an action\. You should first reason step\-by\-step about the current situation\. This reasoning process MUST be enclosed within<think\></think\>tags\. Once you’ve finished your reasoning, you should choose an admissible action for current step and present it within<action\></action\>tags\.
Figure 7:Prompt template used by ATOD for the ALFWorld task environment\.Prompt of ATOD on Search\-based QAYou are an expert agent tasked with answering the given question step\-by\-step\.Your question:\{task\_description\}\.Prior to this step, you have already taken\{step\_count\}step\(s\)\. Below is the interaction history where<search\></search\>wrapped your past search queries and<information\></information\>wrapped the corresponding search results returned by the external search engine\. History:\{memory\_context\}Now it’s your turn to respond for the current step\. You should first conduct a reasoning process\. This process MUST be enclosed within<think\></think\>tags\. After completing your reasoning, choose only one of the following actions \(do not perform both\):1\.If you find you lack some knowledge, you MUST call a search engine to get more external information using format:<search\>your query</search\>\.2\.If you have enough knowledge to answer the question confidently, provide your final answer within<answer\></answer\>tags, without detailed illustrations\. For example,<answer\>Beijing</answer\>\.
Figure 8:Prompt template used by ATOD for the Search\-based QA task environment\.Prompt of ATOD on WebShopYou are an expert autonomous agent operating in the WebShop e\-commerce environment\.Your task is to:\{task\_description\}\.Prior to this step, you have already taken\{step\_count\}step\(s\)\. Below are the most recent\{history\_length\}observations and the corresponding actions you took:\{action\_history\}You are now at step\{current\_step\}and your current observation is:\{current\_observation\}\.Your admissible actions of the current situation are:\[\{available\_actions\}\]\.Now it’s your turn to take one action for the current step\. You should first reason step\-by\-step about the current situation, then think carefully which admissible action best advances the shopping goal\. This reasoning process MUST be enclosed within<think\></think\>tags\. Once you’ve finished your reasoning, you should choose an admissible action for current step and present it within<action\></action\>tags\.
Figure 9:Prompt template used by ATOD for the WebShop task environment\.
## Appendix CQualitative Examples and Diagnostic Analyses
This section provides concrete rollouts and diagnostic visualizations that explain whereT\-DUR\\mathrm\{T\\text\{\-\}DUR\}allocates teacher supervision in long trajectories\.
### C\.1Qualitative ALFWorld Example
[figure10](https://arxiv.org/html/2606.27814#A3.F10)shows a complete successful trajectory produced by the ATOD\-trained 1\.7B student on an ALFWorld task\.
Model Output by ATOD\-trained Student \(1\.7B\)Benchmark\.ALFWorld valid\_seenTask\.put some remotecontrol on sofa\.Model\.ATOD\-trained 1\.7B studentOutcome\.success, reward10\.010\.0,5/55/5admissible actions\.Step 1<observation\>The agent starts in a living room with armchairs, coffee tables, shelves, a sofa, and a TV stand\.</observation\>Model output:I need to put a remote control on the sofa\. I should first check whether I already carry one; otherwise I will search likely surfaces\.<action\>inventory</action\>Step 2<observation\>You are not carrying anything\.</observation\>Model output:I need to find a remote control\. Coffee tables are likely locations, so I will inspect one first\.<action\>go to coffeetable 1</action\>Step 3<observation\>The agent arrives at coffeetable 1 and sees a houseplant, a keychain, remotecontrol 4, and remotecontrol 3\.</observation\>Model output:A remote control is visible; I should take one, then move to the sofa\.<action\>take remotecontrol 3 from coffeetable 1</action\>Step 4<observation\>You pick up the remotecontrol 3 from coffeetable 1\.</observation\>Model output:I now have the remote control; the next step is to go to sofa 1\.<action\>go to sofa 1</action\>Step 5<observation\>The agent arrives at sofa 1; it contains a box, a creditcard, and a pillow\.</observation\>Model output:The sofa is the target location, so I should place the remote control there\.<action\>move remotecontrol 3 to sofa 1</action\><observation\>Task completed\.</observation\>
Figure 10:Qualitative ALFWorld trajectory from the ATOD\-trained 1\.7B student\. The example is compressed from the raw rollout inexample\_task8\_remotecontrol\_on\_sofa\.json: routine prompts and long admissible\-action lists are omitted, while the key environment observations, model outputs, and executed actions are preserved\.
### C\.2Turn\-level Case Studies
We include two compact case studies in[figures11](https://arxiv.org/html/2606.27814#A3.F11)and[12](https://arxiv.org/html/2606.27814#A3.F12)to visualize howT\-DUR\\mathrm\{T\\text\{\-\}DUR\}assigns turn\-level distillation weights\. In each panel,dkd\_\{k\}andhkh\_\{k\}are the raw, pre\-normalization proxies for teacher–student disagreement and sampled\-token student uncertainty, respectively, whilewkw\_\{k\}is the final Soft\-OR turn weight after per\-trajectory normalization\. The ALFWorld trace is from a trained student; the WebShop trace uses a pre\-distillation student so that the reasoning\-rich error\-and\-recovery pattern remains visible, with metrics computed against the same teacher\. The layout follows the qualitative diagnostic style in which the task context appears above a step\-wise trace, and the most important weight column is highlighted\.
Case A: ALFWorld — T\-DUR Highlights Planning, Route Choice, and Goal CompletionTask:clean some lettuce and put it in fridge\.Outcome:success, 7 turns\.Reading guide:dkd\_\{k\}is raw teacher–student disagreement,hkh\_\{k\}is raw sampled\-token student uncertainty, andwkw\_\{k\}is the final Soft\-OR weight after normalizingdkd\_\{k\}andhkh\_\{k\}within the trajectory\.
Figure 11:ALFWorld appendix case study\.T\-DUR\\mathrm\{T\\text\{\-\}DUR\}does not simply emphasize long or late turns: it up\-weights the initial search plan, the route decision toward cleaning, and the final placement, while assigning near\-zero weight to routine environment mechanics such as opening the fridge\.Case B: WebShop — T\-DUR Flags Error\-and\-Recovery Turns in Product SelectionTask:find a machine\-wash men’s dress shirt with cotton spandex, classic fit, short sleeve, shipwreck burgundy, size 2x, and price lower than $60\.Reading guide:dkd\_\{k\}is raw teacher–student disagreement,hkh\_\{k\}is raw sampled\-token student uncertainty, andwkw\_\{k\}is the final Soft\-OR weight after normalizingdkd\_\{k\}andhkh\_\{k\}within the trajectory\.
Figure 12:WebShop appendix case study\. Unlike the ALFWorld example, WebShop contains free\-text search and many clickable attributes, so disagreement remains high even when the student appears decisive\. The Soft\-OR weight therefore captures both hesitant turns and confidently different turns such as the wrong\-size click and the final purchase commitment\.
### Case\-study Takeaway
Across both domains,T\-DUR\\mathrm\{T\\text\{\-\}DUR\}reallocates dense teacher supervision toward pivotal decision points rather than routine environment mechanics\. In ALFWorld, entropy highlights open\-ended search and route planning, while disagreement rescues a confident goal\-completion turn\. In WebShop, the combination of uncertainty and disagreement flags error\-and\-recovery behavior around size/color selection and purchase commitment\. This supports the design choice of using a Soft\-OR fusion rather than relying on either signal alone\.
### C\.3T\-DUR Diagnostic Metrics on ALFWorld
Figure 13:T\-DUR diagnostic metrics for the 1\.7B student model on the ALFWorld validation set, comparing pre\-trained and post\-trained indicators\. In all panels, the dashed line shows the number of valid evaluation trajectories \(sample count\) at each training step\.Panel \(a\):Teacher–student gap \(disagreement proxydkd\_\{k\}\) over steps, showing how the divergence between teacher and student distributions shrinks as training progresses\.Panel \(b\):Entropy proxy \(hkh\_\{k\}\) distribution across training steps, reflecting the student’s per\-turn uncertainty before and after ATOD training\.Panel \(c\):Turn\-level T\-DUR weight \(wkw\_\{k\}\) as a function of training step, illustrating how the Soft\-OR fusion dynamically reallocates distillation supervision across turns\. Together, these diagnostics confirm that T\-DUR progressively concentrates teacher supervision on high\-utility turns while the overall disagreement and entropy decrease with training\.As shown in[figure13](https://arxiv.org/html/2606.27814#A3.F13), the dashed line in each panel indicates the number of evaluation trajectories available at each training step\. We compare the diagnostic curves of the untrained \(pre\-training\) and trained \(post\-training\) 1\.7B student to illustrate howT\-DUR\\mathrm\{T\\text\{\-\}DUR\}allocates supervision\.
Untrained model\.Before training, the model achieves very low accuracy and exhibits an average trajectory length of 47\.91 turns, which approaches the experimental maximum of 50 steps\. This indicates that most samples produce long yet incorrect trajectories that exhaust the step budget\. Panel \(a\) reveals a large teacher–student disagreement gap in the first few turns, where the teacher provides strong corrective signals against the student’s erroneous actions\. However, the gap progressively decays at later turns: as earlier mistakes accumulate into a long erroneous prefix, the teacher’s distribution becomes less informative for distinguishing corrective actions, causing the disagreement to weaken\. Panel \(b\) shows that the entropy proxy remains around 0\.1 with little variation throughout the trajectory, reflecting that the untrained student stays in a persistently uncertain and error\-prone regime across all turns\. Consequently, Panel \(c\) shows thatT\-DUR\\mathrm\{T\\text\{\-\}DUR\}assigns higher weights to earlier turns and progressively lower weights to later turns\. This pattern is well\-motivated: when the prompt prefix gradually drifts due to accumulated errors, the learning value of later turns diminishes, and distillation effort should instead concentrate on the earlier turning points and steps where mistakes first arise\.
Trained model\.After ATOD training, the teacher–student disagreement gap drops substantially compared to the untrained case \(Panel a\), confirming effective knowledge transfer\. The entropy proxy \(Panel b\) exhibits a clear two\-phase pattern: entropy remains at a low value during the first approximately 20 steps, then rises beyond step 20\. This aligns with the observation that the trained model achieves markedly higher accuracy, with most successful trajectories completing within 25 turns; on these successful trajectories the student is confident and entropy is low\. For trajectories that extend beyond 20 steps, early deviations or errors have likely already occurred, causing the student’s uncertainty to increase at later positions\. Correspondingly, Panel \(c\) shows thatT\-DUR\\mathrm\{T\\text\{\-\}DUR\}assigns small weights to the earlier, more confident and correct turns, while up\-weighting the later turns where the student exhibits higher confusion and requires stronger teacher correction\. This asymmetric weighting strategy enablesT\-DUR\\mathrm\{T\\text\{\-\}DUR\}to focus distillation capacity on problematic decision points rather than wasting supervision on routine turns the student has already mastered\.
### C\.4Additional Training Diagnostics
Figure 14:Additional training diagnostic metrics for the 1\.7B model on the ALFWorld dataset\.Similar Articles
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
OPID proposes an on-policy skill distillation framework that extracts dense hindsight supervision from completed trajectories, combining outcome-based RL with token-level self-distillation to improve language agent training efficiency and performance on multi-turn tasks.
@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.
Trust Region On-Policy Distillation
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
@NielsRogge: One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a studen…
On-policy distillation is highlighted as a hot post-training technique combining distillation with online RL, now listed on PapersWithCode with 183 citing papers.