AsyncOPD: How Stale Can On-Policy Distillation Be?

arXiv cs.LG 06/24/26, 04:00 AM Papers
Summary
This paper presents AsyncOPD, a fully asynchronous on-policy distillation pipeline for LLMs, systematically studying the effects of stale-policy data and proposing estimator designs that improve training throughput by 1.6-3.8x while maintaining comparable accuracy.
arXiv:2606.24143v1 Announce Type: new Abstract: On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale-policy data. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher-score caches. We first show that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable. Second, for this vulnerable reverse-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness. In our experiments, they do not improve over a simpler OPD-specific surrogate: recomputing the reverse-KL signal under the current student at learner time. Third, we analyze how finite teacher-score caches create a bias-variance tradeoff for sparse and sampled reverse-KL OPD estimators. This motivates multi-sample Monte Carlo (MC), which preserves MC correctability while reducing one-sample variance. Finally, we present and open-source AsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices. Experiments show that AsyncOPD improves training throughput by $1.6\times$ to $3.8\times$ over strict synchronous training while reaching comparable accuracy.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:50 AM
# AsyncOPD: How Stale Can On-Policy Distillation Be?
Source: [https://arxiv.org/html/2606.24143](https://arxiv.org/html/2606.24143)
Wonjun Kang1Kevin Galim∗1Seunghyuk Oh1Minjun Kang2Sanghyun Park2Donghoon Kim1Minjae Lee1Minseo Kim1Rishabh Tiwari3Yuchen Zeng4Hyung Il Koo1,2Kangwook Lee5,6

1FuriosaAI2Ajou University3UC Berkeley4Microsoft Research5KRAFTON6Ludo Robotics

Code:[https://github\.com/furiosa\-ai/async\-opd](https://github.com/furiosa-ai/async-opd)

###### Abstract

On\-policy distillation \(OPD\) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model \(LLM\) post\-training\. Like reinforcement learning \(RL\), however, OPD faces an on\-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads\. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale\-policy data\. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored\. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full\-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher\-score caches\. We first show that KL direction changes the stale\-data problem: teacher\-weighted forward KL is more robust to stale rollouts, whereas student\-weighted reverse KL is vulnerable\. Second, for this vulnerable reverse\-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness\. In our experiments, they do not improve over a simpler OPD\-specific surrogate: recomputing the reverse\-KL signal under the current student at learner time\. Third, we analyze how finite teacher\-score caches create a bias\-variance tradeoff for sparse and sampled reverse\-KL OPD estimators\. This motivates multi\-sample Monte Carlo \(MC\), which preserves MC correctability while reducing one\-sample variance\. Finally, we present and open\-sourceAsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices\. Experiments show that AsyncOPD improves training throughput by1\.6×1\.6\\timesto3\.8×3\.8\\timesover strict synchronous training while reaching comparable accuracy\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.24143v1/x1.png)Figure 1:Estimator design for asynchronous OPD\. \(a\) Dense KL is the full\-vocabulary reference, but full teacher\-logit caches are costly to store or transfer in asynchronous OPD\. \(b\) Sparse top\-kkexposes a support mismatch under staleness: forward KL is teacher\-supported, but reverse KL is student\-supported and may require actions outside the cached teacher\-scored support\. \(c\) One\-sample Monte Carlo is correctable in expectation by importance sampling, but has high variance; our estimator recomputesAθA\_\{\\theta\}at learner time and uses multi\-sample MC to reduce variance\.On\-policy distillation \(OPD\)\[[20](https://arxiv.org/html/2606.24143#bib.bib20),[6](https://arxiv.org/html/2606.24143#bib.bib3),[1](https://arxiv.org/html/2606.24143#bib.bib4)\]and reinforcement learning \(RL\)\[[28](https://arxiv.org/html/2606.24143#bib.bib30),[26](https://arxiv.org/html/2606.24143#bib.bib28)\]have become central post\-training methods for improving large language model \(LLM\) reasoning\[[7](https://arxiv.org/html/2606.24143#bib.bib29)\], including mathematics\[[17](https://arxiv.org/html/2606.24143#bib.bib27)\]and coding\[[27](https://arxiv.org/html/2606.24143#bib.bib32)\]\. OPD trains a student on its own rollouts using dense token\-level feedback from a teacher\[[12](https://arxiv.org/html/2606.24143#bib.bib2)\], whereas RL learns from reward feedback on rollouts\. OPD provides an effective and efficient route for LLM post\-training, especially for smaller student models\[[24](https://arxiv.org/html/2606.24143#bib.bib11)\]\. Recent work shows that OPD is not limited to distilling large teachers into small students: it also supports on\-policy self\-distillation\[[32](https://arxiv.org/html/2606.24143#bib.bib31)\]and multi\-teacher distillation from domain\-specialized teachers comparable in size to the student\[[2](https://arxiv.org/html/2606.24143#bib.bib26),[21](https://arxiv.org/html/2606.24143#bib.bib33)\]\.

OPD and RL inherit an on\-policy systems bottleneck: each learner update must wait for fresh rollouts from the model being trained\[[5](https://arxiv.org/html/2606.24143#bib.bib34)\]\. For reasoning tasks, these rollouts are long and expensive, so synchronous training often waits on generation rather than updating the model, leaving learners underutilized\. Asynchronous RL\[[14](https://arxiv.org/html/2606.24143#bib.bib10),[4](https://arxiv.org/html/2606.24143#bib.bib1)\]relieves this bottleneck by decoupling rollout generation from learner updates: rollout workers keep generating data while the learner updates on earlier rollouts, improving training efficiency and hardware utilization\[[23](https://arxiv.org/html/2606.24143#bib.bib9),[34](https://arxiv.org/html/2606.24143#bib.bib22),[18](https://arxiv.org/html/2606.24143#bib.bib23)\]\. A similar pipeline can be applied to OPD by running student rollout, teacher scoring, and learner updates in parallel\[[19](https://arxiv.org/html/2606.24143#bib.bib5)\]\.

However, asynchronous execution introduces stale\-policy data, and learning from such data can degrade model quality\[[3](https://arxiv.org/html/2606.24143#bib.bib7)\]\. This creates a trade\-off: more aggressive asynchrony improves training throughput, but it also increases the policy lag between rollout and learning\. Prior work on asynchronous RL therefore studies how to stabilize learning from stale\-policy data\[[4](https://arxiv.org/html/2606.24143#bib.bib1),[33](https://arxiv.org/html/2606.24143#bib.bib8),[10](https://arxiv.org/html/2606.24143#bib.bib24)\]\. However, it remains underexplored whether these ideas and stale\-data solutions transfer to OPD, because practical implementations of OPD expose a different feedback interface\. Teacher feedback is often implemented through local KL losses, which require teacher scores over actions at student\-visited prefixes\. Since full\-vocabulary teacher logits are expensive to store or transfer, especially in an asynchronous pipeline, teacher scores are usually cached only on a finite set of actions \([Fig\.˜1](https://arxiv.org/html/2606.24143#S1.F1)\)\. Once the learner receives a teacher\-scored cache, it can recompute current\-student log probabilities on cached actions, but it cannot recover teacher scores for actions that were never scored\. This raises three questions that structure our study: \(i\) how asynchronous OPD behaves under staleness, \(ii\) whether asynchronous RL ideas and stale\-data solutions transfer to OPD, and \(iii\) how finite teacher\-score caches shape OPD estimator design\.

First, we study how KL direction shapes staleness\. Under asynchronous OPD with cached teacher scores, the same stale rollout cache can affect different KL objectives differently\. As illustrated in[Fig\.˜1](https://arxiv.org/html/2606.24143#S1.F1), forward KL is teacher\-weighted and is more robust to stale rollouts, whereas reverse KL is student\-weighted and becomes vulnerable when current\-student actions fall outside the scored cache\. We therefore focus on reverse\-KL OPD in the remainder of the staleness analysis\.

Second, focusing on the reverse\-KL case, we ask whether methods designed to stabilize asynchronous RL can also mitigate OPD staleness\. This comparison is natural because reverse KL in OPD admits an RL\-style policy\-gradient surrogate, where the teacher\-student log\-ratio acts as a token\-level advantage\. We therefore evaluate PPO\-style clipping\[[16](https://arxiv.org/html/2606.24143#bib.bib35)\], decoupled PPO\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\], and M2PO\[[33](https://arxiv.org/html/2606.24143#bib.bib8)\]\. In our experiments, they do not improve over a simpler OPD\-specific surrogate: recomputing the reverse\-KL token\-level advantage under the current student at learner time without clipping\.

Third, we return to the teacher\-cache constraint and study the resulting bias\-variance tradeoff for sparse and sampled reverse\-KL OPD implementations\. Stale student top\-kksupports provide deterministic coverage but are support\-mismatched because they may omit actions required by the current top\-kkobjective, and reweighting inside the stale support cannot recover the missing teacher scores\. One\-sample Monte Carlo \(MC\) avoids this fixed\-support mismatch through importance\-correctable samples from the stale rollout policy, but suffers from high variance\. This motivates multi\-sample MC, which caches and teacher\-scores multiple stale\-policy samples at each decoding step, preserving MC correctability while reducing one\-sample variance\.

Finally, we instantiate these findings inAsyncOPD, a fully asynchronous OPD pipeline that overlaps student rollout, teacher scoring, and learner updates\. On Qwen3\-Base models, AsyncOPD improves training throughput by1\.6×1\.6\\timesto3\.8×3\.8\\timesover strict synchronous training while maintaining comparable accuracy\. Our contributions are:

- •We provide the first systematic study of staleness in asynchronous OPD through the lens of an OPD\-specific teacher\-cache constraint\.
- •We show that KL direction changes the stale\-data problem: forward KL is comparatively robust to stale rollouts, whereas reverse KL is vulnerable because it is student\-weighted \([Section˜4](https://arxiv.org/html/2606.24143#S4)\)\.
- •We identify that the most effective reverse\-KL policy\-gradient surrogate uses the advantage recomputed at learner time without clipping, and that advanced asynchronous RL surrogates do not improve over this choice \([Section˜5](https://arxiv.org/html/2606.24143#S5)\)\.
- •We show that stale student top\-kksupports are support\-mismatched, while one\-sample MC remains correctable but high\-variance; this motivates multi\-sample MC \([Section˜6](https://arxiv.org/html/2606.24143#S6)\)\.
- •We present and open\-sourceAsyncOPD, a fully asynchronous OPD training pipeline, and demonstrate improved training efficiency while maintaining OPD quality \([Section˜7](https://arxiv.org/html/2606.24143#S7)\)\.

## 2Related Works

#### On\-Policy Distillation

On\-policy distillation \(OPD\) trains a student on its own rollouts while using a teacher to provide dense token\-level feedback on the visited prefixes\[[20](https://arxiv.org/html/2606.24143#bib.bib20),[12](https://arxiv.org/html/2606.24143#bib.bib2)\]\. GKD\[[1](https://arxiv.org/html/2606.24143#bib.bib4)\]introduced a token\-level KL formulation, while MiniLLM\[[6](https://arxiv.org/html/2606.24143#bib.bib3)\]studied a sequence\-level reverse\-KL variant\.Liet al\.\[[11](https://arxiv.org/html/2606.24143#bib.bib12)\]study token\-level OPD training dynamics and recipes for unstable configurations\. TIP\[[22](https://arxiv.org/html/2606.24143#bib.bib21)\]characterizes per\-token importance through student entropy and teacher\-student divergence\. G\-OPD\[[25](https://arxiv.org/html/2606.24143#bib.bib6)\]interprets token\-level OPD as dense KL\-constrained RL and extends it with reward scaling\. These works clarify OPD as an effective post\-training objective, but assume rollouts, teacher scoring, and learner updates stay synchronized\.

#### Asynchronous RL

In synchronous RL pipelines, training often waits for the longest rollout in a batch to finish, leaving learner resources idle\. Asynchronous RL improves hardware utilization by decoupling rollout generation from learner updates\. Async RLHF\[[14](https://arxiv.org/html/2606.24143#bib.bib10)\]overlaps generation and learning so that new samples are produced while the learner trains on earlier ones\. StreamRL\[[34](https://arxiv.org/html/2606.24143#bib.bib22)\]further disaggregates the RLHF pipeline into streaming stages\. AReaL\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\]fully decouples rollout workers from training workers for continuous asynchronous execution\. Laminar\[[18](https://arxiv.org/html/2606.24143#bib.bib23)\]uses fine\-grained weight synchronization for trajectory\-level asynchrony\. However, asynchronous RL must learn from stale\-policy data\. Decoupled PPO\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\]stabilizes asynchronous RL training by separating the behavior policy for stale rollouts from the proximal policy that anchors PPO\[[16](https://arxiv.org/html/2606.24143#bib.bib35)\]updates\. M2PO\[[33](https://arxiv.org/html/2606.24143#bib.bib8)\]stabilizes stale updates with second\-moment importance\-weight constraints, and A\-3PO\[[10](https://arxiv.org/html/2606.24143#bib.bib24)\]reduces decoupled PPO overhead through staleness\-aware interpolation\.

#### Asynchronous OPD

VeRL\[[19](https://arxiv.org/html/2606.24143#bib.bib5)\]implements step\-off OPD schedulers that overlap student rollout, teacher scoring, and learner update by fixing rollout lag to one or two learner steps\. These schedulers establish the practical feasibility of asynchronous OPD, but leave open how OPD estimators behave under stale teacher\-scored caches\. KDFlow\[[29](https://arxiv.org/html/2606.24143#bib.bib25)\]improves systems efficiency for LLM distillation by decoupling teacher inference from learner training and transmitting teacher hidden states, but targets synchronous OPD and leaves asynchronous execution as future work\. We study this missing asynchronous OPD regime directly and build AsyncOPD from the resulting estimator choices\.

## 3Preliminaries: On\-Policy Distillation

#### OPD setup

At each decoding timestep, we view the visited prefixssas the local state and the next tokenaaas the action\. Letq\(a∣s\)q\(a\\mid s\)denote the teacher policy andpθ\(a∣s\)p\_\{\\theta\}\(a\\mid s\)denote the current student policy\. Following prior work on token\-level OPD\[[12](https://arxiv.org/html/2606.24143#bib.bib2),[25](https://arxiv.org/html/2606.24143#bib.bib6),[11](https://arxiv.org/html/2606.24143#bib.bib12)\], we apply local losses to generated output tokens and analyze the resulting objectives at a fixed prefix statess\. OPD can be defined with different divergences; forward and reverse KL are two standard choices\[[1](https://arxiv.org/html/2606.24143#bib.bib4)\]\.

#### Forward\-KL OPD

At a fixed prefixss, forward\-KL OPD is teacher\-weighted:

DF\(θ;s\)\\displaystyle D\_\{F\}\(\\theta;s\)=KL\(q\(⋅∣s\)∥pθ\(⋅∣s\)\)=∑a∈𝒱q\(a∣s\)\(logq\(a∣s\)−logpθ\(a∣s\)\)\.\\displaystyle=\\mathrm\{KL\}\\\!\\left\(q\(\\cdot\\mid s\)\\,\\\|\\,p\_\{\\theta\}\(\\cdot\\mid s\)\\right\)=\\textstyle\\sum\\nolimits\_\{a\\in\\mathcal\{V\}\}q\(a\\mid s\)\\left\(\\log q\(a\\mid s\)\-\\log p\_\{\\theta\}\(a\\mid s\)\\right\)\.\(1\)At a fixed prefixss, the gradient is∇θDF\(θ;s\)=−∑a∈𝒱q\(a∣s\)∇θlog⁡pθ\(a∣s\)\.\\nabla\_\{\\theta\}D\_\{F\}\(\\theta;s\)=\-\\textstyle\\sum\\nolimits\_\{a\\in\\mathcal\{V\}\}q\(a\\mid s\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\)\.

#### Reverse\-KL OPD

At the same prefix, reverse\-KL OPD is student\-weighted:

DR\(θ;s\)\\displaystyle D\_\{R\}\(\\theta;s\)=KL\(pθ\(⋅∣s\)∥q\(⋅∣s\)\)=−∑a∈𝒱pθ\(a∣s\)\(logq\(a∣s\)−logpθ\(a∣s\)\)\.\\displaystyle=\\mathrm\{KL\}\\\!\\left\(p\_\{\\theta\}\(\\cdot\\mid s\)\\,\\\|\\,q\(\\cdot\\mid s\)\\right\)=\-\\textstyle\\sum\\nolimits\_\{a\\in\\mathcal\{V\}\}p\_\{\\theta\}\(a\\mid s\)\\left\(\\log q\(a\\mid s\)\-\\log p\_\{\\theta\}\(a\\mid s\)\\right\)\.\(2\)Differentiating and using𝔼a∼pθ\(⋅∣s\)\[∇θlog⁡pθ\(a∣s\)\]=∇θ∑apθ\(a∣s\)=0\\mathbb\{E\}\_\{a\\sim p\_\{\\theta\}\(\\cdot\\mid s\)\}\[\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\)\]=\\nabla\_\{\\theta\}\\sum\_\{a\}p\_\{\\theta\}\(a\\mid s\)=0gives

∇θDR\(θ;s\)\\displaystyle\\nabla\_\{\\theta\}D\_\{R\}\(\\theta;s\)=∑a∈𝒱pθ\(a∣s\)\(log⁡pθ\(a∣s\)−log⁡q\(a∣s\)\+1\)∇θlog⁡pθ\(a∣s\)\\displaystyle=\\textstyle\\sum\\nolimits\_\{a\\in\\mathcal\{V\}\}p\_\{\\theta\}\(a\\mid s\)\\left\(\\log p\_\{\\theta\}\(a\\mid s\)\-\\log q\(a\\mid s\)\+1\\right\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\)=−∑a∈𝒱pθ\(a∣s\)\(log⁡q\(a∣s\)−log⁡pθ\(a∣s\)\)∇θlog⁡pθ\(a∣s\)\.\\displaystyle=\-\\textstyle\\sum\\nolimits\_\{a\\in\\mathcal\{V\}\}p\_\{\\theta\}\(a\\mid s\)\\left\(\\log q\(a\\mid s\)\-\\log p\_\{\\theta\}\(a\\mid s\)\\right\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\)\.\(3\)Viewing[Eq\.˜3](https://arxiv.org/html/2606.24143#S3.E3)as a policy\-gradient estimator andA=log⁡q\(a∣s\)−log⁡p\(a∣s\)A=\\log q\(a\\mid s\)\-\\log p\(a\\mid s\)as the advantage term connects reverse\-KL OPD to standard RL training machinery, and practical implementations typically use PPO\-style surrogates\. Given behavior\-policy samplesa∼pbeha\\sim p\_\{\\mathrm\{beh\}\}, defineρθ\(a,s\)=pθ\(a∣s\)/pbeh\(a∣s\)\\rho\_\{\\theta\}\(a,s\)=p\_\{\\theta\}\(a\\mid s\)/p\_\{\\mathrm\{beh\}\}\(a\\mid s\)andρ¯θ\(a,s\)=clip⁡\(ρθ\(a,s\),1−ϵ,1\+ϵ\)\\bar\{\\rho\}\_\{\\theta\}\(a,s\)=\\operatorname\{clip\}\(\\rho\_\{\\theta\}\(a,s\),1\-\\epsilon,1\+\\epsilon\)\. The PPO\-style local surrogate uses these ratios with a frozen behavior\-time signalAbeh\(a,s\)A\_\{\\mathrm\{beh\}\}\(a,s\), wheresg⁡\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes stop\-gradient:

LPPO\(θ;Abeh\)=−𝔼a∼pbeh\[min⁡\(ρθsg⁡\(Abeh\),ρ¯θsg⁡\(Abeh\)\)\]\.\\displaystyle L\_\{\\mathrm\{PPO\}\}\(\\theta;A\_\{\\mathrm\{beh\}\}\)=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{beh\}\}\}\\left\[\\min\\left\(\\rho\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\mathrm\{beh\}\}\\right\),\\bar\{\\rho\}\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\mathrm\{beh\}\}\\right\)\\right\)\\right\]\.\(4\)

#### Sparse and sampled implementations

The dense objectives above are full\-vocabulary references\. Practical OPD instead evaluates local KL losses on finite supports or sampled actions\[[12](https://arxiv.org/html/2606.24143#bib.bib2),[11](https://arxiv.org/html/2606.24143#bib.bib12)\], trading computation against support coverage and estimator variance\.Sparse top\-kkimplementations choose a supportS\(s\)S\(s\)and evaluate the corresponding restricted KL after renormalizing teacher and student distributions onS\(s\)S\(s\)\.Monte Carlo \(MC\)implementations draw actions from a proposal distribution and estimate the corresponding local gradient; for reverse KL, this yields the student\-sampled policy\-gradient estimator\. Details are in[Appendix˜A](https://arxiv.org/html/2606.24143#A1)\.

## 4Forward\- and Reverse\-KL OPD Under Staleness

Asynchronous OPD has both prefix\-level and action\-level staleness\. Once a rollout is generated, its visited prefixes are fixed, so an action\-level estimator cannot change which states the learner sees\. We therefore focus on the action\-level staleness that estimator design can directly address\.

### 4\.1Asynchronous OPD Setup

Asynchronous OPD is a cached\-data pipeline: rollout first selects prefixes and actions, teacher scoring then annotates those actions, and the learner updates the student later\. Unlike synchronous OPD, these stages are separated in time, so the visited prefixes, action cache, teacher scores, and update policy may be tied to different student versions\.[Figure˜1](https://arxiv.org/html/2606.24143#S1.F1)summarizes this cached\-teacher setting and the estimator contrasts induced by the three\-stage cache pipeline\.

#### Teacher\-cache constraint

Full\-vocabulary teacher logits allow dense KL computation, but caching and transferring them is prohibitively expensive, especially in an asynchronous pipeline\. We therefore focus on sparse top\-kksupports and MC samples as the sparse and sampled cases\.

#### Stage 1: Student rollout

A rollout actor samples trajectories from a stale studentpoldp\_\{\\text\{old\}\}, which fixes the visited prefixesss\. At each prefix, it stores cached actionsCold\(s\)C\_\{\\text\{old\}\}\(s\)together with their rollout\-time log probabilities underpoldp\_\{\\text\{old\}\}, such as a sampled token or a top\-kksupport\.

#### Stage 2: Teacher scoring

LetCscore\(s\)C\_\{\\text\{score\}\}\(s\)denote the teacher\-scored cache; it may come from the rollout cacheCold\(s\)C\_\{\\text\{old\}\}\(s\)or be selected by the teacher at scoring time\. Once teacher scoring is complete, teacher logits are available only onCscore\(s\)C\_\{\\text\{score\}\}\(s\)\.

#### Stage 3: Student update

By learner update time, the student has moved to the current policypθp\_\{\\theta\}\. The learner can recomputelog⁡pθ\(a∣s\)\\log p\_\{\\theta\}\(a\\mid s\)fora∈Cscore\(s\)a\\in C\_\{\\text\{score\}\}\(s\), but the current local OPD objective may place mass on actions outside this teacher\-scored cache, such as the current student top\-kksupport or current student\-sampled actions\. Thus the learner can update current student probabilities on cached actions, but cannot recover teacher signals for missing actions without additional teacher access\.

#### Experimental setup

Unless otherwise stated, we train a Qwen3\-4B\-Base student using a Qwen3\-30B\-A3B\-Instruct\-2507 teacher\[[24](https://arxiv.org/html/2606.24143#bib.bib11)\]\. The training data is DeepMath\[[8](https://arxiv.org/html/2606.24143#bib.bib15)\], filtered to 57,630 math problems with difficulty level at least 6, and we report final\-checkpoint Avg@32 accuracy on AIME24\[[30](https://arxiv.org/html/2606.24143#bib.bib17)\], AIME25\[[31](https://arxiv.org/html/2606.24143#bib.bib18)\], and AMC\[[13](https://arxiv.org/html/2606.24143#bib.bib19)\]\. Experimental details are provided in[Appendix˜B](https://arxiv.org/html/2606.24143#A2); dataset and metric details are in[Appendix˜C](https://arxiv.org/html/2606.24143#A3)\.

### 4\.2Forward KL vs\. Reverse KL Under Staleness

The KL direction fixes the action weighting: forward KL weights actions by the teacherqq, whereas reverse KL weights them by the studentpθp\_\{\\theta\}\. With cached teacher scores, this weighting difference becomes a support\-ownership difference \([Fig\.˜1](https://arxiv.org/html/2606.24143#S1.F1)\)\. Under a scored\-cache restriction, this makes forward KL less exposed to stale student action choices: it does not need to convert stale student\-sampled actions into a current\-student expectation\. Reverse KL instead depends on student\-weighted action terms, so the same asynchronous cache creates a different action\-level staleness problem\.

#### Experimental results

[Figure˜2](https://arxiv.org/html/2606.24143#S4.F2)compares representative practical OPD implementations from prior work: sparse top\-kkforward KL\[[19](https://arxiv.org/html/2606.24143#bib.bib5)\]and PPO\-style reverse\-KL surrogates\[[12](https://arxiv.org/html/2606.24143#bib.bib2),[25](https://arxiv.org/html/2606.24143#bib.bib6)\]\. Reverse KL starts higher at zero staleness, but as staleness increases it drops faster and is eventually overtaken by forward KL\. We therefore focus the rest of the staleness analysis on how to make reverse\-KL OPD robust under larger rollout staleness\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x2.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x3.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x4.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x5.png)\(d\)AMC

Figure 2:Accuracy comparison under staleness for forward\- and reverse\-KL OPD\. Reverse KL starts higher at zero staleness but degrades faster as staleness grows; forward KL is flatter across the sweep\.Finding 1\.Forward KL is teacher\-weighted and robust to rollout staleness, whereas reverse KL is student\-weighted and vulnerable to rollout staleness\.

#### Two axes of reverse\-KL staleness

The cache analysis above suggests a possible mechanism for this gap: because reverse KL is weighted by the current student, stale teacher\-scored caches may fail to cover actions needed by the current reverse\-KL objective\. In addition, reverse\-KL policy\-gradient updates can be instantiated with multiple stale\-data surrogates, including PPO\-style and asynchronous\-RL variants\. We therefore split the reverse\-KL analysis into apolicy\-gradient surrogate axis, studied in[Section˜5](https://arxiv.org/html/2606.24143#S5), and acached\-support axis, studied in[Section˜6](https://arxiv.org/html/2606.24143#S6)\.

## 5Reverse\-KL: Policy\-Gradient Surrogates Under Staleness

Reverse\-KL OPD admits several policy\-gradient surrogate choices under stale rollouts\. This section compares which choices remain effective under staleness\.

### 5\.1Policy\-Gradient Surrogate Choices

#### PPO\-style objective

In the PPO\-style surrogate in[Eq\.˜4](https://arxiv.org/html/2606.24143#S3.E4), the advantage is computed under the behavior policy and then held fixed during the learner update\. In stale reverse\-KL OPD, the behavior policy is the rollout student, so a mechanical PPO\-style adaptation setspbeh=poldp\_\{\\mathrm\{beh\}\}=p\_\{\\mathrm\{old\}\}and uses the rollout\-time reverse\-KL advantageAold\(a,s\)=log⁡q\(a∣s\)−log⁡pold\(a∣s\)A\_\{\\mathrm\{old\}\}\(a,s\)=\\log q\(a\\mid s\)\-\\log p\_\{\\mathrm\{old\}\}\(a\\mid s\)asAbehA\_\{\\mathrm\{beh\}\}, together with the clipped old\-to\-current ratio\. The unclipped variant simply drops the clipped term\.

#### Exact importance\-sampling identity

In contrast, rewriting the reverse\-KL objective \([Eq\.˜2](https://arxiv.org/html/2606.24143#S3.E2)\) by importance sampling suggests a different surrogate choice\. With the current reverse\-KL advantageAθ\(a,s\)=log⁡q\(a∣s\)−log⁡pθ\(a∣s\)A\_\{\\theta\}\(a,s\)=\\log q\(a\\mid s\)\-\\log p\_\{\\theta\}\(a\\mid s\), and assumingpoldp\_\{\\mathrm\{old\}\}has support whereverpθp\_\{\\theta\}does, the current reverse\-KL objective admits the exact old\-to\-current importance\-sampling \(IS\) identity

DR\(θ;s\)=−𝔼a∼pθ\[Aθ\(a,s\)\]=−𝔼a∼pold\[ρθ\(a,s\)Aθ\(a,s\)\]\.\\displaystyle D\_\{R\}\(\\theta;s\)=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\theta\}\}\\left\[A\_\{\\theta\}\(a,s\)\\right\]=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{old\}\}\}\\left\[\\rho\_\{\\theta\}\(a,s\)A\_\{\\theta\}\(a,s\)\\right\]\.\(5\)For the policy\-gradient update, the advantage is used as a stop\-gradient weight; the derivative of the omittedAθA\_\{\\theta\}term cancels by the score\-function identity, as in[Eq\.˜3](https://arxiv.org/html/2606.24143#S3.E3)\. Thus the IS view points to the opposite surrogate choice from the mechanical PPO adaptation: recomputeAθA\_\{\\theta\}under the current student and use the old\-to\-current ratio without clipping, withAθA\_\{\\theta\}treated as a stop\-gradient advantage\.

#### A two\-by\-two surrogate ablation

The PPO\-style adaptation and the OPD/IS identity suggest different surrogate choices\. We therefore ablate the advantage \(AoldA\_\{\\mathrm\{old\}\}versusAθA\_\{\\theta\}\) and whether to clip the ratio, withsg⁡\(⋅\)\\operatorname\{sg\}\(\\cdot\)denoting stop\-gradient:

Loldclip\(θ\)\\displaystyle L\_\{\\mathrm\{old\}\}^\{\\mathrm\{clip\}\}\(\\theta\)=−𝔼a∼pold\[min⁡\(ρθsg⁡\(Aold\),ρ¯θsg⁡\(Aold\)\)\],\\displaystyle=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{old\}\}\}\\left\[\\min\\left\(\\rho\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\mathrm\{old\}\}\\right\),\\bar\{\\rho\}\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\mathrm\{old\}\}\\right\)\\right\)\\right\],Loldnoclip\(θ\)\\displaystyle L\_\{\\mathrm\{old\}\}^\{\\mathrm\{noclip\}\}\(\\theta\)=−𝔼a∼pold\[ρθsg⁡\(Aold\)\],\\displaystyle=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{old\}\}\}\\left\[\\rho\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\mathrm\{old\}\}\\right\)\\right\],\(6\)Lθclip\(θ\)\\displaystyle L\_\{\\theta\}^\{\\mathrm\{clip\}\}\(\\theta\)=−𝔼a∼pold\[min⁡\(ρθsg⁡\(Aθ\),ρ¯θsg⁡\(Aθ\)\)\],\\displaystyle=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{old\}\}\}\\left\[\\min\\left\(\\rho\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\theta\}\\right\),\\bar\{\\rho\}\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\theta\}\\right\)\\right\)\\right\],Lθnoclip\(θ\)\\displaystyle L\_\{\\theta\}^\{\\mathrm\{noclip\}\}\(\\theta\)=−𝔼a∼pold\[ρθsg⁡\(Aθ\)\]\.\\displaystyle=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\mathrm\{old\}\}\}\\left\[\\rho\_\{\\theta\}\\operatorname\{sg\}\\\!\\left\(A\_\{\\theta\}\\right\)\\right\]\.\(7\)HereLoldclipL\_\{\\mathrm\{old\}\}^\{\\mathrm\{clip\}\}is the PPO\-style adaptation, whileLθnoclipL\_\{\\theta\}^\{\\mathrm\{noclip\}\}is the OPD/IS surrogate\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x6.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x7.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x8.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x9.png)\(d\)AMC

Figure 3:Accuracy comparison under staleness for the advantage\-and\-clipping ablation\. RecomputingAθA\_\{\\theta\}at learner time and avoiding clipping gives the most stable performance across the sweep, while clipping mainly helps the frozenAoldA\_\{\\mathrm\{old\}\}baseline\.![Refer to caption](https://arxiv.org/html/2606.24143v1/x10.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x11.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x12.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x13.png)\(d\)AMC

Figure 4:Accuracy comparison under staleness for advanced asynchronous RL surrogates\. Decoupled PPO\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\]and M2PO\[[33](https://arxiv.org/html/2606.24143#bib.bib8)\]do not consistently improve over the simpler OPD/IS surrogate that recomputesAθA\_\{\\theta\}without clipping; Decoupled PPO is clipped for readability because of low accuracy\.
#### Advanced asynchronous RL surrogates

Decoupled PPO\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\]and M2PO\[[33](https://arxiv.org/html/2606.24143#bib.bib8)\]are asynchronous RL surrogates designed to improve robustness to stale\-policy updates\. We evaluate whether these previously unstudied asynchronous RL surrogates also help OPD under staleness\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x14.png)Figure 5:AθA\_\{\\theta\}reduces the p99ρθ\\rho\_\{\\theta\}tail under no clip\.

### 5\.2Experimental Results

[Fig\.˜3](https://arxiv.org/html/2606.24143#S5.F3)and[Table˜1\(a\)](https://arxiv.org/html/2606.24143#S5.T1.st1)compare the four combinations ofAoldA\_\{\\mathrm\{old\}\}versusAθA\_\{\\theta\}and clipping versus no clipping\. The best variant is the OPD/IS choice:AθA\_\{\\theta\}without clipping\. The PPO\-style baseline,AoldA\_\{\\mathrm\{old\}\}with clipping, remains a strong stale\-surrogate baseline\. Clipping helpsAoldA\_\{\\mathrm\{old\}\}by limiting stale, large\-ratio updates, but hurtsAθA\_\{\\theta\}: recomputingAθA\_\{\\theta\}already reduces the high\-percentileρθ\\rho\_\{\\theta\}tail at staleness 64 \([Fig\.˜5](https://arxiv.org/html/2606.24143#S5.F5)\), so clipping removes useful signal\. Likewise,[Fig\.˜4](https://arxiv.org/html/2606.24143#S5.F4)and[Table˜1\(a\)](https://arxiv.org/html/2606.24143#S5.T1.st1)show that advanced asynchronous RL surrogates such as decoupled PPO and M2PO do not outperformAθA\_\{\\theta\}without clipping, which becomes our reference surrogate below\.

Finding 2\.The most effective reverse\-KL correction is to recomputeAθA\_\{\\theta\}at learner time without clipping; advanced asynchronous RL surrogates such as decoupled PPO and M2PO do not improve over it\.

Table 1:Staleness\-sensitivity slopes\. Entries fit accuracy againstlog2⁡\(staleness\+1\)\\log\_\{2\}\(\\mathrm\{staleness\}\+1\); more negative values indicate stronger degradation with staleness\.\(a\)Policy\-gradient surrogates
\(b\)Multi\-sample MC

## 6Reverse\-KL: Cached Supports Under Staleness

Having fixedAθA\_\{\\theta\}without clipping as the reference surrogate, we now ask which cached actions provide the teacher scores needed to evaluate it, and how to improve this cached\-support estimator\. This cached\-support axis is specific to OPD because teacher scoring is local and expensive: the teacher cache determines which actions have teacher scores available to the learner\.

#### Sparse top\-kk: stale\-support biased

Although sparse top\-kkis biased relative to the dense reverse\-KL objective, it is a practical low\-variance approximation on the current student supportSθ\(s\)=TopK\(pθ\(⋅∣s\),k\)S\_\{\\theta\}\(s\)=\\operatorname\{TopK\}\(p\_\{\\theta\}\(\\cdot\\mid s\),k\)\. Under asynchronous rollout reuse, however, teacher scores are cached on the rollout\-time supportSold\(s\)=TopK\(pold\(⋅∣s\),k\)S\_\{\\mathrm\{old\}\}\(s\)=\\operatorname\{TopK\}\(p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\),k\), which may miss actions in the current supportSθ\(s\)S\_\{\\theta\}\(s\)\. Reweighting withinSoldS\_\{\\mathrm\{old\}\}cannot recover these missing teacher scores, so stale sparse top\-kkremains a support\-mismatched approximation, not an exact correction of the current top\-kkobjective\.

#### One\-sample MC: correctable but high variance

Sampled\-token MC instead caches an action drawn from a behavior distribution:a∼pold\(⋅∣s\)a\\sim p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)together withlog⁡pold\(a∣s\)\\log p\_\{\\mathrm\{old\}\}\(a\\mid s\)\. When the behavior policy covers the current policy support, exact old\-to\-current IS gives an unbiased estimator of the current reverse\-KL fixed\-prefix gradient\. Thus one\-sample MC is action\-level correctable in expectation, but the resulting IS estimator can have high variance\. This proposal\-sampling structure is the key contrast with stale top\-kk, whose actions come from a deterministic stale support\.

### 6\.1Proposed Solution: Multi\-Sample MC

#### Multi\-sample MC: correctable with reduced variance

We propose multi\-sample MC \([Fig\.˜8](https://arxiv.org/html/2606.24143#S6.F8)\), which, at each decoding timestep of a student rollout, draws multiple local next\-token samples from the behavior policy without rolling them out into additional trajectories\. It reduces one\-sample MC variance by caching these local samples and averaging their IS\-corrected gradients\.

Multi\-sample MC is especially natural in asynchronous OPD\. In RL for LLM post\-training, branching a prefix into multiple actions is expensive because each branch typically requires a full continuation before the reward or advantage can be evaluated\. In synchronous OPD, sparse top\-kkalready provides a low\-variance approximation and one\-sample MC provides an unbiased sampled gradient estimator, so there is little motivation to cache multiple sampled actions per prefix\. Under asynchronous OPD, this tradeoff changes: sparse top\-kkbecomes the stale fixed\-support approximation analyzed above, and one\-sample MC remains correctable but high\-variance, making multi\-sample MC a natural cached\-support estimator for asynchronous OPD\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x15.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x16.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x17.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x18.png)\(d\)AMC

Figure 6:Accuracy comparison under staleness for sampled MC versus stale top\-kk\. Top\-kk\+RW denotes reweighting on the stale top\-kksupport\. Old\-to\-current IS corrects sampled MC in expectation, whereas reweighting cannot repair the missing teacher scores induced by stale top\-kksupports\.![Refer to caption](https://arxiv.org/html/2606.24143v1/x19.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x20.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x21.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x22.png)\(d\)AMC

Figure 7:Accuracy comparison under staleness for multi\-sample MC\. Increasing the number of samples improves large\-staleness behavior\.one student rollout pathsts\_\{t\}\[\-1pt\]hellost\+1s\_\{t\+1\}\[\-1pt\]\.\.\. whatst\+2s\_\{t\+2\}\[\-1pt\]\.\.\. is\+\+\+\+⋯\\cdotsat,1a\_\{t,1\}\[\-1pt\]whatat,2a\_\{t,2\}\[\-1pt\]howat\+1,1a\_\{t\+1,1\}\[\-1pt\]isat\+1,2a\_\{t\+1,2\}\[\-1pt\]didat\+2,1a\_\{t\+2,1\}\[\-1pt\]yourat\+2,2a\_\{t\+2,2\}\[\-1pt\]goingexample:m=2m=2local samples per timestep

Figure 8:Multi\-sample MC \(m=2m=2\)\.Concretely, at each visited timestepttwith prefixsts\_\{t\}, rollout samplesat,1,…,at,m∼pold\(⋅∣st\)a\_\{t,1\},\\ldots,a\_\{t,m\}\\sim p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\_\{t\}\)and caches their rollout log probabilities and teacher scores\. For notational simplicity, writes=sts=s\_\{t\}andai=at,ia\_\{i\}=a\_\{t,i\}below\. At learner time, we recomputeAθ\(ai,s\)A\_\{\\theta\}\(a\_\{i\},s\)and use the averaged unclipped old\-to\-current IS surrogateL^mMC\(θ;s\)=−1m∑i=1mρθ\(ai,s\)sg⁡\(Aθ\(ai,s\)\)\\widehat\{L\}\_\{m\}^\{\\mathrm\{MC\}\}\(\\theta;s\)=\-\\frac\{1\}\{m\}\\sum\\nolimits\_\{i=1\}^\{m\}\\rho\_\{\\theta\}\(a\_\{i\},s\)\\operatorname\{sg\}\(A\_\{\\theta\}\(a\_\{i\},s\)\)\. By linearity, the gradient has the same expectation as the one\-sample MC estimator; averaging independent behavior\-policy samples reduces the Monte Carlo variance\. We measure this variance reduction at large staleness in[Appendix˜E](https://arxiv.org/html/2606.24143#A5)\.

### 6\.2Experimental Results

#### Sparse top\-kkvs\. one\-sample MC

[Figure˜6](https://arxiv.org/html/2606.24143#S6.F6)compares one\-sample MC and sparse top\-kk, with and without old\-to\-current reweighting\. For one\-sample MC, IS substantially improves robustness as staleness increases\. For sparse top\-kk, the same reweighting does not improve performance, since it cannot recover missing current\-support actions\. As a result, one\-sample MC with IS is the strongest of the four methods, consistent with the support\-correctability analysis above\. We include an additional ablation disentangling MC sample count from IS in[Appendix˜F](https://arxiv.org/html/2606.24143#A6)\.

#### One\-sample MC vs\. multi\-sample MC

[Figure˜7](https://arxiv.org/html/2606.24143#S6.F7)and[Table˜1\(b\)](https://arxiv.org/html/2606.24143#S5.T1.st2)show that multi\-sample MC improves one\-sample MC at large staleness:m=4m=4already gives a clear jump, whilem∈\{4,16,64\}m\\in\\\{4,16,64\\\}performs similarly\.

Finding 3\.One\-sample MC is more effective than stale sparse top\-kk; multi\-sample MC further improves this estimator by reducing one\-sample variance while preserving MC correctability\.

## 7AsyncOPD: Fully Asynchronous OPD

AsyncOPD is our fully asynchronous OPD system\. Following AReaL\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\], it overlaps rollout, teacher scoring, and learner updates\.

#### Scheduler

The step\-off scheduler family was originally implemented in VeRL\[[19](https://arxiv.org/html/2606.24143#bib.bib5)\]: akk\-step\-off run fixes rollout lag tokklearner updates, but still waits for complete rollout batches\. AsyncOPD streams examples instead: workers pause only for weight sync, preserve in\-flight prefixes, teacher scoring consumes completed items, and the learner updates once a scored batch is ready \([Fig\.˜9](https://arxiv.org/html/2606.24143#S7.F9)\)\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x23.png)Figure 9:Scheduler comparison for synchronous OPD, step\-off scheduling, and AsyncOPD\. Synchronous OPD is barriered; step\-off scheduling\[[19](https://arxiv.org/html/2606.24143#bib.bib5)\]overlaps stages but keeps gated rollout batches, while AsyncOPD streams rollout data to reduce long\-tail waiting\.
#### Experimental setup

The main comparison uses Qwen3\-\{1\.7B,4B,8B\}\-Base students with the Qwen3\-30B\-A3B\-Instruct\-2507 teacher\. All runs use the same reverse\-KL estimator: current\-policyAθA\_\{\\theta\}, no clipping, old\-to\-current IS, and either MC64 or MC1\. We compare strict sync, two\-step\-off, and AsyncOPD for 100 training iterations on the same 8\-GPU node; all AsyncOPD runs useτ=4\\tau=4\.[Appendix˜G](https://arxiv.org/html/2606.24143#A7)gives GPU allocation, queue\-depth, and scheduler details\.

#### Experimental Results

[Table˜2](https://arxiv.org/html/2606.24143#S7.T2)reports training throughput, pipeline overlap \(average concurrent OPD\-stage activity\), and final AIME24 Avg@32 for the Qwen3\-Base students\. AsyncOPD achieves the highest throughput and overlap in every matched comparison\. In MC64, it reaches up to2\.7×2\.7\\timesthe strict\-sync throughput while achieving the best or tied\-best final accuracy\. MC1 shows the same trend: AsyncOPD delivers the highest throughput \(up to3\.3×3\.3\\timesstrict\-sync\) and overlap for every student, with competitive final accuracy\.

Table 2:AsyncOPD scheduler results for Qwen3\-Base models\. Train tok/s is training throughput; parentheses show speedup over the matched strict\-sync baseline\. Overlap is concurrent OPD\-stage activity\. Avg@32 is final AIME24\. AsyncOPD achieves the highest throughput and overlap in all matched settings while maintaining comparable final accuracy\.Train\-time accuracy curves are reported in[Appendix˜G](https://arxiv.org/html/2606.24143#A7)\.

## 8Conclusion

We present the first systematic study of staleness in asynchronous on\-policy distillation \(OPD\)\. Our results show that KL direction shapes the stale\-data problem: forward KL remains robust to stale rollouts, whereas reverse KL is more vulnerable because it is student\-weighted\. In reverse\-KL OPD, the most effective policy\-gradient surrogate uses the current advantage recomputed at learner time without clipping; advanced asynchronous RL surrogates do not improve over this choice\. We also find that stale student top\-kksupports are support\-mismatched, whereas one\-sample Monte Carlo \(MC\) remains correctable but high\-variance\. This contrast motivates multi\-sample MC, which preserves MC correctability while reducing one\-sample variance\. Finally, we present and open\-sourceAsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices, improving training efficiency while maintaining OPD quality\.

#### Limitations and Future Work

We study sparse and Monte Carlo OPD estimators, not dense full\-vocabulary KL in the asynchronous setting\. Although dense KL avoids cached\-support mismatch, it is difficult to implement efficiently when rollout, teacher scoring, and learner updates are decoupled\. KDFlow\[[29](https://arxiv.org/html/2606.24143#bib.bib25)\]suggests one path by transmitting teacher hidden states and recomputing student logits, but only for synchronous OPD\. Extending this approach to asynchronous OPD while handling stale rollouts and preserving throughput is an important future direction\. Our experiments are also limited to a single 8\-GPU node by available resources, not by the pipeline itself; scaling to larger multi\-node clusters remains future work\.

## Acknowledgments and Disclosure of Funding

This work was supported by Institute for Information & communications Technology Promotion \(IITP\) grant funded by the Korea government \(MSIT\) \(No\. 04\-26\-03\-0081, Energy\-Efficient Training–Inference System Optimization for Reinforcement Learning\-Based Post\-Training\)\. This work was also supported by the “Advanced GPU Utilization Support Program” funded by the Government of the Republic of Korea \(Ministry of Science and ICT\)\.

## References

- \[1\]R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. R\. Garea, M\. Geist, and O\. Bachem\(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3zKtaqxLhW)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px1.p1.5)\.
- \[2\]DeepSeek\-AI\(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[3\]F\. Devvrit, L\. Madaan, R\. Tiwari, R\. Bansal, S\. S\. Duvvuri, M\. Zaheer, I\. S\. Dhillon, D\. Brandfonbrener, and R\. Agarwal\(2026\)The art of scaling reinforcement learning compute for LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=FMjeC9Msws)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p3.1)\.
- \[4\]W\. Fu, J\. Gao, X\. Shen, C\. Zhu, Z\. Mei, C\. He, S\. Xu, G\. Wei, J\. Mei, W\. JIASHU, T\. Yang, B\. Yuan, and Y\. Wu\(2025\)AREAL: a large\-scale asynchronous reinforcement learning system for language reasoning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=X9diEuva9R)Cited by:[Appendix G](https://arxiv.org/html/2606.24143#A7.p1.1),[§1](https://arxiv.org/html/2606.24143#S1.p2.1),[§1](https://arxiv.org/html/2606.24143#S1.p3.1),[§1](https://arxiv.org/html/2606.24143#S1.p5.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1),[Figure 4](https://arxiv.org/html/2606.24143#S5.F4),[Figure 4](https://arxiv.org/html/2606.24143#S5.F4.2.1),[§5\.1](https://arxiv.org/html/2606.24143#S5.SS1.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2606.24143#S7.p1.1)\.
- \[5\]W\. Gao, Y\. Zhao, D\. An, T\. Wu, L\. Cao, S\. Xiong, J\. Huang, W\. Wang, S\. Yang, W\. Su,et al\.\(2025\)Rollpacker: mitigating long\-tail rollouts for fast, synchronous rl post\-training\.arXiv preprint arXiv:2509\.21009\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1)\.
- \[6\]Y\. Gu, L\. Dong, F\. Wei, and M\. Huang\(2024\)MiniLLM: knowledge distillation of large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5h0qf7IBZZ)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[8\]Z\. He, T\. Liang, J\. Xu, Q\. Liu, X\. Chen, Y\. Wang, L\. Song, D\. Yu, Z\. Liang, W\. Wang,et al\.\(2025\)Deepmath\-103k: a large\-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning\.arXiv preprint arXiv:2504\.11456\.Cited by:[Table 3](https://arxiv.org/html/2606.24143#A2.T3.6.10.3.2.1.1),[Appendix C](https://arxiv.org/html/2606.24143#A3.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.24143#S4.SS1.SSS0.Px5.p1.1)\.
- \[9\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with PagedAttention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix B](https://arxiv.org/html/2606.24143#A2.p1.1)\.
- \[10\]X\. Li, S\. Wu, and Z\. Shen\(2025\)A\-3po: accelerating asynchronous llm training with staleness\-aware proximal policy approximation\.arXiv preprint arXiv:2512\.06547\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p3.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu,et al\.\(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px1.p1.5),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px4.p1.3)\.
- \[12\]K\. Lu and T\. M\. Lab\(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/on\-policy\-distillationExternal Links:[Document](https://dx.doi.org/10.64434/tml.20251026)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px1.p1.5),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px4.p1.3),[§4\.2](https://arxiv.org/html/2606.24143#S4.SS2.SSS0.Px1.p1.1)\.
- \[13\]Mathematical Association of America\(2023\)American Mathematics Competitions – AMC\.Note:[https://maa\.org/](https://maa.org/)Accessed 2026\-04\-03Cited by:[Appendix C](https://arxiv.org/html/2606.24143#A3.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.24143#S4.SS1.SSS0.Px5.p1.1)\.
- \[14\]M\. Noukhovitch, S\. Huang, S\. Xhonneux, A\. Hosseini, R\. Agarwal, and A\. Courville\(2025\)Faster, more efficient RLHF through off\-policy asynchronous learning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=FhTAG591Ve)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]A\. Paszke, S\. Gross, S\. Chintala, G\. Chanan, E\. Yang, Z\. DeVito, Z\. Lin, A\. Desmaison, L\. Antiga, and A\. Lerer\(2017\)Automatic differentiation in PyTorch\.InNIPS\-W,Cited by:[Appendix B](https://arxiv.org/html/2606.24143#A2.p1.1)\.
- \[16\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p5.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[18\]G\. Sheng, Y\. Tong, B\. Wan, W\. Zhang, C\. Jia, X\. Wu, Y\. Wu, X\. Li, C\. Zhang, Y\. Peng,et al\.\(2025\)Laminar: a scalable asynchronous rl post\-training framework\.arXiv preprint arXiv:2510\.12633\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.24143#S4.SS2.SSS0.Px1.p1.1),[Figure 9](https://arxiv.org/html/2606.24143#S7.F9),[Figure 9](https://arxiv.org/html/2606.24143#S7.F9.3.2),[§7](https://arxiv.org/html/2606.24143#S7.SS0.SSS0.Px1.p1.2)\.
- \[20\]M\. Song and M\. Zheng\(2026\)A survey of on\-policy distillation for large language models\.arXiv preprint arXiv:2604\.00626\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]B\. Xiao, B\. Xia, B\. Yang, B\. Gao, B\. Shen, C\. Zhang, C\. He, C\. Lou, F\. Luo, G\. Wang,et al\.\(2026\)Mimo\-v2\-flash technical report\.arXiv preprint arXiv:2601\.02780\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[22\]Y\. Xu, H\. Sang, Z\. Zhou, R\. He, Z\. Wang, and A\. Geramifard\(2026\)TIP: token importance in on\-policy distillation\.arXiv preprint arXiv:2604\.14084\.Cited by:[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]R\. Yan, Y\. Jiang, T\. Wu, J\. Gao, Z\. Mei, W\. Fu, H\. Mai, W\. Wang, Y\. Wu, and B\. Yuan\(2025\)AReaL\-hex: accommodating asynchronous rl training over heterogeneous gpus\.arXiv preprint arXiv:2511\.00796\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1)\.
- \[24\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Table 3](https://arxiv.org/html/2606.24143#A2.T3.6.8.1.2.1.1),[Table 3](https://arxiv.org/html/2606.24143#A2.T3.6.9.2.2.1.1),[§1](https://arxiv.org/html/2606.24143#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.24143#S4.SS1.SSS0.Px5.p1.1)\.
- \[25\]W\. Yang, W\. Liu, R\. Xie, K\. Yang, S\. Yang, and Y\. Lin\(2026\)Learning beyond teacher: generalized on\-policy distillation with reward extrapolation\.arXiv preprint arXiv:2602\.12125\.Cited by:[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24143#S3.SS0.SSS0.Px1.p1.5),[§4\.2](https://arxiv.org/html/2606.24143#S4.SS2.SSS0.Px1.p1.1)\.
- \[26\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[27\]A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng, B\. Chen, D\. Yin, C\. Ge, C\. Huang, C\. Xie,et al\.\(2026\)Glm\-5: from vibe coding to agentic engineering\.arXiv preprint arXiv:2602\.15763\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[28\]K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li,et al\.\(2025\)A survey of reinforcement learning for large reasoning models\.arXiv preprint arXiv:2509\.08827\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[29\]S\. Zhang, X\. Zhang, T\. Zhang, B\. Hu, Y\. Chen, and J\. Xu\(2026\)KDFlow: a user\-friendly and efficient knowledge distillation framework for large language models\.arXiv preprint arXiv:2603\.01875\.Cited by:[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px3.p1.1),[§8](https://arxiv.org/html/2606.24143#S8.SS0.SSS0.Px1.p1.1)\.
- \[30\]Y\. Zhang and T\. Math\-AI\(2024\)AIME 2024\.Note:[https://huggingface\.co/datasets/Maxwell\-Jia/AIME\_2024](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024)Hugging Face dataset; accessed 2026\-04\-03Cited by:[Appendix C](https://arxiv.org/html/2606.24143#A3.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.24143#S4.SS1.SSS0.Px5.p1.1)\.
- \[31\]Y\. Zhang and T\. Math\-AI\(2025\)AIME 2025\.Note:[https://huggingface\.co/datasets/yentinglin/aime\_2025](https://huggingface.co/datasets/yentinglin/aime_2025)Hugging Face dataset; accessed 2026\-04\-03Cited by:[Appendix C](https://arxiv.org/html/2606.24143#A3.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.24143#S4.SS1.SSS0.Px5.p1.1)\.
- \[32\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p1.1)\.
- \[33\]H\. Zheng, J\. Zhao, and B\. Chen\(2026\)Prosperity before collapse: how far can off\-policy RL reach with stale data on LLMs?\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IIgl5MWelz)Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p3.1),[§1](https://arxiv.org/html/2606.24143#S1.p5.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1),[Figure 4](https://arxiv.org/html/2606.24143#S5.F4),[Figure 4](https://arxiv.org/html/2606.24143#S5.F4.2.1),[§5\.1](https://arxiv.org/html/2606.24143#S5.SS1.SSS0.Px4.p1.1)\.
- \[34\]Y\. Zhong, Z\. Zhang, X\. Song, H\. Hu, C\. Jin, B\. Wu, N\. Chen, Y\. Chen, Y\. Zhou, C\. Wan,et al\.\(2025\)Streamrl: scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation\.arXiv preprint arXiv:2504\.15930\.Cited by:[§1](https://arxiv.org/html/2606.24143#S1.p2.1),[§2](https://arxiv.org/html/2606.24143#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ASparse and Monte Carlo Reverse\-KL Implementations

### A\.1Sparse Top\-kkReverse\-KL OPD

The dense reverse\-KL objective in[Eq\.˜2](https://arxiv.org/html/2606.24143#S3.E2)sums over the full vocabulary\. A sparse top\-kkimplementation instead evaluates reverse KL on a finite student support

Sθ\(s\)=TopK\(pθ\(⋅∣s\),k\)\.\\displaystyle S\_\{\\theta\}\(s\)=\\operatorname\{TopK\}\\\!\\left\(p\_\{\\theta\}\(\\cdot\\mid s\),k\\right\)\.\(8\)For any supportSS, define the restricted normalizersZpS\(s\)=∑u∈Spθ\(u∣s\)Z\_\{p\}^\{S\}\(s\)=\\sum\\nolimits\_\{u\\in S\}p\_\{\\theta\}\(u\\mid s\)andZqS\(s\)=∑u∈Sq\(u∣s\)Z\_\{q\}^\{S\}\(s\)=\\sum\\nolimits\_\{u\\in S\}q\(u\\mid s\), and the renormalized distributions

p~θS\(a∣s\)=pθ\(a∣s\)𝟏\[a∈S\]ZpS\(s\),q~S\(a∣s\)=q\(a∣s\)𝟏\[a∈S\]ZqS\(s\)\.\\displaystyle\\tilde\{p\}\_\{\\theta\}^\{S\}\(a\\mid s\)=\\frac\{p\_\{\\theta\}\(a\\mid s\)\\mathbf\{1\}\[a\\in S\]\}\{Z\_\{p\}^\{S\}\(s\)\},\\qquad\\tilde\{q\}^\{S\}\(a\\mid s\)=\\frac\{q\(a\\mid s\)\\mathbf\{1\}\[a\\in S\]\}\{Z\_\{q\}^\{S\}\(s\)\}\.\(9\)The sparse reverse\-KL objective is

DRS\(θ;s\)\\displaystyle D\_\{R\}^\{S\}\(\\theta;s\)=KL\(p~θS\(⋅∣s\)∥q~S\(⋅∣s\)\)\\displaystyle=\\mathrm\{KL\}\\\!\\left\(\\tilde\{p\}\_\{\\theta\}^\{S\}\(\\cdot\\mid s\)\\,\\\|\\,\\tilde\{q\}^\{S\}\(\\cdot\\mid s\)\\right\)=−∑a∈Sp~θS\(a∣s\)\(log⁡q~S\(a∣s\)−log⁡p~θS\(a∣s\)\)\.\\displaystyle=\-\\sum\\nolimits\_\{a\\in S\}\\tilde\{p\}\_\{\\theta\}^\{S\}\(a\\mid s\)\\left\(\\log\\tilde\{q\}^\{S\}\(a\\mid s\)\-\\log\\tilde\{p\}\_\{\\theta\}^\{S\}\(a\\mid s\)\\right\)\.\(10\)In practice, whenS=Sθ\(s\)S=S\_\{\\theta\}\(s\), we treat the selected top\-kksupport as fixed during the local update\.

### A\.2Monte Carlo Reverse\-KL OPD

LetAθ\(a,s\)=log⁡q\(a∣s\)−log⁡pθ\(a∣s\)A\_\{\\theta\}\(a,s\)=\\log q\(a\\mid s\)\-\\log p\_\{\\theta\}\(a\\mid s\)\. From[Eq\.˜3](https://arxiv.org/html/2606.24143#S3.E3), the dense reverse\-KL gradient can be written as

∇θDR\(θ;s\)=−𝔼a∼pθ\(⋅∣s\)\[Aθ\(a,s\)∇θlog⁡pθ\(a∣s\)\]\.\\displaystyle\\nabla\_\{\\theta\}D\_\{R\}\(\\theta;s\)=\-\\mathbb\{E\}\_\{a\\sim p\_\{\\theta\}\(\\cdot\\mid s\)\}\\left\[A\_\{\\theta\}\(a,s\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\)\\right\]\.\(11\)A one\-sample current\-policy Monte Carlo estimator is therefore

g^MC\(s,a\)=−Aθ\(a,s\)∇θlogpθ\(a∣s\),a∼pθ\(⋅∣s\)\.\\displaystyle\\widehat\{g\}\_\{\\mathrm\{MC\}\}\(s,a\)=\-A\_\{\\theta\}\(a,s\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\\mid s\),\\qquad a\\sim p\_\{\\theta\}\(\\cdot\\mid s\)\.\(12\)Withmmindependent samplesai∼pθ\(⋅∣s\)a\_\{i\}\\sim p\_\{\\theta\}\(\\cdot\\mid s\), the corresponding multi\-sample estimator averages the same local term:

g^m\(s\)=−1m∑i=1mAθ\(ai,s\)∇θlog⁡pθ\(ai∣s\)\.\\displaystyle\\widehat\{g\}\_\{m\}\(s\)=\-\\frac\{1\}\{m\}\\sum\\nolimits\_\{i=1\}^\{m\}A\_\{\\theta\}\(a\_\{i\},s\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(a\_\{i\}\\mid s\)\.\(13\)

## Appendix BExperimental Details

This section details the experimental setup\. Unless explicitly stated otherwise, experiments use the common setup in[Table˜3](https://arxiv.org/html/2606.24143#A2.T3)and report final\-checkpoint Avg@32 accuracy\. Our implementation uses vLLM\[[9](https://arxiv.org/html/2606.24143#bib.bib14)\]for rollout generation and teacher scoring, PyTorch FSDP\[[15](https://arxiv.org/html/2606.24143#bib.bib13)\]for learner training, and runs each experiment on a single 8×\\timesB200 node\. Individual experiments take roughly 1–12 hours, depending on the setting\. Asset URLs, license names, and versions are summarized in[Table˜5](https://arxiv.org/html/2606.24143#A4.T5)\.

Table 3:Experimental settings\.#### Constructing the staleness axis\.

The main text uses staleness as an experimental control over how old the cached rollout data is when the learner updates on it\. In all staleness plots and tables in[Sections˜4](https://arxiv.org/html/2606.24143#S4),[5](https://arxiv.org/html/2606.24143#S5)and[6](https://arxiv.org/html/2606.24143#S6), staleness is measured in train\-batch steps\. One train\-batch step is one logical rollout batch consumed by the learner for a training iteration\. The sweep valuekkis therefore the target number of train\-batch steps by which the consumed cache trails the current learner; equivalently, it is the target cache depth in logical rollout batches\. A valuek=0k=0is synchronous: rollout, teacher scoring, training, and weight synchronization occur in strict sequence\. Fork\>0k\>0, the run first generates exactlykkrollout batches with the initial student snapshot before the first learner update\. Training then consumes the oldest available generated batch; after each learner update and weight synchronization, a new rollout batch is generated with the latest student snapshot whenever needed to restore the target cache depth\.

This protocol is the operational source of the prefix\- and action\-level staleness discussed in the main text: the consumed prefixes and cached actions come from an older rollout student, while the update is applied to the current student\. For a consumed rollout batch, lettrollt\_\{\\mathrm\{roll\}\}be the train\-batch index of the student snapshot used for generation andttraint\_\{\\mathrm\{train\}\}be the train\-batch index at learner time\. The staleness used in the plots is

Δbatch=ttrain−troll\.\\Delta\_\{\\mathrm\{batch\}\}=t\_\{\\mathrm\{train\}\}\-t\_\{\\mathrm\{roll\}\}\.All examples in a logical batch share the same rollout snapshot and therefore share the sameΔbatch\\Delta\_\{\\mathrm\{batch\}\}\. Under the controlled cache protocol,Δbatch\\Delta\_\{\\mathrm\{batch\}\}ramps as0,1,2,…0,1,2,\\ldotswhile the initial cache is drained and then plateaus atkk\. Thus a 64\-batch target cache depth is plotted as staleness 64\. A train\-batch step can contain multiple mini\-batch optimizer updates; in the common setup,B=256B=256andBmini=64B\_\{\\mathrm\{mini\}\}=64, so each train\-batch step containsM=B/Bmini=4M=B/B\_\{\\mathrm\{mini\}\}=4optimizer updates\. This conversion is useful for implementation accounting, but it is not the staleness axis used in the plots\.

We usekkas the x\-axis because it is the controlled train\-batch staleness intervention shared across methods\. The sweep coversk∈\{0,1,2,4,8,16,32,64,128\}k\\in\\\{0,1,2,4,8,16,32,64,128\\\}across the forward\-KL, reverse\-KL / PPO\-style, M2PO / DecPPO, top\-kk, and Monte\-Carlo support\-size variants; apart from the estimator choice andkk, these runs share the common model, data, batch\-size, generation, and evaluation settings in[Table˜3](https://arxiv.org/html/2606.24143#A2.T3)\.

## Appendix CDatasets and Metrics

#### Training data\.

We filter the DeepMath dataset\[[8](https://arxiv.org/html/2606.24143#bib.bib15)\]to retain 57,630 math problems with difficulty level greater than or equal to 6, and use this filtered subset as the training data\.

#### Evaluation datasets\.

[Table˜4](https://arxiv.org/html/2606.24143#A3.T4)lists the evaluation datasets used: AIME 2024\[[30](https://arxiv.org/html/2606.24143#bib.bib17)\], AIME 2025\[[31](https://arxiv.org/html/2606.24143#bib.bib18)\], and AMC 2023\[[13](https://arxiv.org/html/2606.24143#bib.bib19)\]\. AIME24 is evaluated every 20 steps\. The remaining datasets are only evaluated for the final checkpoint\.

Table 4:Evaluation datasets\.
#### Accuracy metric\.

Evaluation samples 32 responses per problem\. For a datasetDD, the reported Avg@32 is the mean per\-problem pass rate,

Avg@32\(D\)=100⋅1\|D\|∑i∈Dci32,\\mathrm\{Avg@32\}\(D\)=100\\cdot\\frac\{1\}\{\|D\|\}\\sum\_\{i\\in D\}\\frac\{c\_\{i\}\}\{32\},\(14\)wherecic\_\{i\}is the number of sampled responses judged correct for problemii\. Paper tables and plots use Avg@32 unless noted otherwise\.

## Appendix DExisting Asset Licenses

[Table˜5](https://arxiv.org/html/2606.24143#A4.T5)lists the reused assets\.

Table 5:Existing assets used in this work, with source URLs, license names, and versions\.
## Appendix EMulti\-Sample MC Variance at Large Staleness

We measure how multi\-sample MC reduces the variance of the old\-to\-current IS surrogate at large staleness\. At timestepttwith prefixsts\_\{t\}, local MC actionsat,1,…,at,ma\_\{t,1\},\\ldots,a\_\{t,m\}are sampled iid with replacement frompold\(⋅∣st\)p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\_\{t\}\)\(duplicates are allowed\), and the learner evaluates

L^mMC\(θ;st\)=−1m∑i=1mρθ\(at,i,st\)sg⁡\(Aθ\(at,i,st\)\),ρθ\(a,s\)=pθ\(a∣s\)pold\(a∣s\)\.\\displaystyle\\widehat\{L\}\_\{m\}^\{\\mathrm\{MC\}\}\(\\theta;s\_\{t\}\)=\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\rho\_\{\\theta\}\(a\_\{t,i\},s\_\{t\}\)\\operatorname\{sg\}\\\!\\left\(A\_\{\\theta\}\(a\_\{t,i\},s\_\{t\}\)\\right\),\\qquad\\rho\_\{\\theta\}\(a,s\)=\\frac\{p\_\{\\theta\}\(a\\mid s\)\}\{p\_\{\\mathrm\{old\}\}\(a\\mid s\)\}\.\(15\)Using the Qwen3\-4B\-Base staleness\-128 runs, we reportRmlocalR\_\{m\}^\{\\mathrm\{local\}\}in the fixed\-prefix column andRmseqR\_\{m\}^\{\\mathrm\{seq\}\}in the sequence\-level column of[Table˜6](https://arxiv.org/html/2606.24143#A5.T6)\. Both ratios are normalized to the correspondingm=1m=1estimator within the same old\-to\-current pair:

Rmlocal\\displaystyle R\_\{m\}^\{\\mathrm\{local\}\}=𝔼st\[Varat,1,…,at,m∼pold\(⋅∣st\)⁡\(L^mMC\(θ;st\)∣st\)\]𝔼st\[Varat∼pold\(⋅∣st\)⁡\(L^1MC\(θ;st\)∣st\)\],\\displaystyle=\\frac\{\\mathbb\{E\}\_\{s\_\{t\}\}\\\!\\left\[\\operatorname\{Var\}\_\{a\_\{t,1\},\\ldots,a\_\{t,m\}\\sim p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\_\{t\}\)\}\\left\(\\widehat\{L\}\_\{m\}^\{\\mathrm\{MC\}\}\(\\theta;s\_\{t\}\)\\mid s\_\{t\}\\right\)\\right\]\}\{\\mathbb\{E\}\_\{s\_\{t\}\}\\\!\\left\[\\operatorname\{Var\}\_\{a\_\{t\}\\sim p\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\_\{t\}\)\}\\left\(\\widehat\{L\}\_\{1\}^\{\\mathrm\{MC\}\}\(\\theta;s\_\{t\}\)\\mid s\_\{t\}\\right\)\\right\]\},\(16\)Rmseq\\displaystyle R\_\{m\}^\{\\mathrm\{seq\}\}=Var⁡\[1T∑t=1TL^mMC\(θ;st\)\]Var⁡\[1T∑t=1TL^1MC\(θ;st\)\]\.\\displaystyle=\\frac\{\\operatorname\{Var\}\\\!\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\widehat\{L\}\_\{m\}^\{\\mathrm\{MC\}\}\(\\theta;s\_\{t\}\)\\right\]\}\{\\operatorname\{Var\}\\\!\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\widehat\{L\}\_\{1\}^\{\\mathrm\{MC\}\}\(\\theta;s\_\{t\}\)\\right\]\}\.\(17\)ForRmseqR\_\{m\}^\{\\mathrm\{seq\}\}, the generated prefix paths1:Ts\_\{1:T\}is fixed when computing the variance; the MC samples are local scorer queries at each fixed prefix, not separate rollout branches\.

Table 6:MC variance ratios at large staleness\. The fixed\-prefix column isolates local next\-token action\-sampling variance; the sequence\-level column averages the same estimator over generated timesteps before computing variance\.[Table˜6](https://arxiv.org/html/2606.24143#A5.T6)shows that largermmconsistently reduces variance\. Fixed\-prefix ratios closely follow the1/m1/mreference \(m=64m=64leaves1\.49%1\.49\\%of the one\-sample local variance, versus1/64=1\.56%1/64=1\.56\\%\)\. After timestep aggregation,m=64m=64still leaves only11\.2%11\.2\\%of the one\-sample sequence\-level variance, showing that multi\-sample MC reduces variance in practice even though sequence\-level averaging makes the reduction less extreme than the local next\-token effect\.

## Appendix FImportance\-Sampling Ablation

Before treating multi\-sampling as an improvement, we separate it from IS\. Increasingmmchanges the Monte Carlo variance of the estimator\. It does not define the target distribution\. Old\-to\-current IS is still the mechanism that turns behavior\-policy samples into an estimator of the current reverse\-KL local gradient\.[Figure˜10](https://arxiv.org/html/2606.24143#A6.F10)compares MC1 and MC16 with and without IS to test this distinction directly\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x24.png)\(a\)Average
![Refer to caption](https://arxiv.org/html/2606.24143v1/x25.png)\(b\)AIME24
![Refer to caption](https://arxiv.org/html/2606.24143v1/x26.png)\(c\)AIME25
![Refer to caption](https://arxiv.org/html/2606.24143v1/x27.png)\(d\)AMC

Figure 10:Accuracy comparison under staleness for MC importance\-sampling ablations\. Increasing the number of samples reduces Monte Carlo variance, but old\-to\-current IS is still needed to correct stale\-policy sampling\.
## Appendix GAsyncOPD Scheduler Details

This appendix gives the implementation details omitted from[Section˜7](https://arxiv.org/html/2606.24143#S7)\. Our scheduler, AsyncOPD, follows the fully asynchronous systems structure of AReaL\[[4](https://arxiv.org/html/2606.24143#bib.bib1)\], but the queue contains OPD cache items rather than reward\-labeled RL trajectories\.

#### Queue interface\.

The pipeline has three long\-running stages: rollout generation, teacher scoring, and learner training\. Rollout workers sample trajectories from their latest synchronized student snapshot\. For each visited prefixss, they cache the MC actions, rollout log probabilities underpoldp\_\{\\mathrm\{old\}\}, and the rollout student version\. The main scheduler comparison uses MC64; the MC1 runs use the same queue interface with one cached action\. The teacher scores the cached actions\. The learner then recomputeslog⁡pθ\(a∣s\)\\log p\_\{\\theta\}\(a\\mid s\)andAθ\(a,s\)A\_\{\\theta\}\(a,s\)under the current student, and applies the unclipped old\-to\-current IS estimator from[Section˜6](https://arxiv.org/html/2606.24143#S6)\.

#### Weight synchronization and queue capacity\.

During AsyncOPD weight synchronization, rollout workers pause generation\. In the keep\-mode used for the scheduler experiments, in\-flight requests are not discarded: the already sampled token prefix is kept, the student weights are updated, and the running\-request prefix cache is reset so the engine rebuilds the attention state for that prefix under the new weights before generation resumes\. Thus, tokens before the synchronization point are reused rather than regenerated, while later tokens are sampled under the new student snapshot\. Each completed sample records the token index at which the active weight version changes\. The queue\-depth parameterτ\\tauis enforced as a capacity bound rather than as a learner\-side drop rule\. The coordinator creates a semaphore with\(τ\+1\)B\(\\tau\+1\)Bpermits, whereBBis the effective train batch size\. The prompt feeder acquires one permit before submitting a prompt to rollout, and the train dispatcher releases permits only after the corresponding samples have been consumed by a learner update\. During weight synchronization, a sync gate prevents the feeder from using newly released permits until rollout workers have received the updated weights\. Thus, smallerτ\\taulimits the amount of unconsumed rollout work in the pipeline, while largerτ\\taupermits a deeper backlog and more overlap\. The queues themselves remain FIFO; items are not evicted for being stale\.

#### Training throughput metric\.

The table reports training throughput\. Letnjn\_\{j\}be the number of response tokens used by learner updatejj, and lettjt\_\{j\}be the train wall\-clock time after that update\. Discarding the first five warmup updates, we compute

throughput=∑j=6Jnj∑j=6J\(tj−tj−1\)=∑j=6JnjtJ−t5\.\\mathrm\{throughput\}=\\frac\{\\sum\_\{j=6\}^\{J\}n\_\{j\}\}\{\\sum\_\{j=6\}^\{J\}\(t\_\{j\}\-t\_\{j\-1\}\)\}=\\frac\{\\sum\_\{j=6\}^\{J\}n\_\{j\}\}\{t\_\{J\}\-t\_\{5\}\}\.Speedups are normalized to the strict\-sync run with the same student and MC setting\.

#### Pipeline overlap metric\.

Let𝒮=\{rollout,teacher,train\}\\mathcal\{S\}=\\\{\\text\{rollout\},\\text\{teacher\},\\text\{train\}\\\}\. For teacher and train, we merge overlapping busy intervals within each stage and compute the merged busy timeTsT\_\{s\}\. Rollout hasNrN\_\{r\}workers, so we first merge intervals within each workeriiand then define the rollout\-stage busy time as the worker\-normalized average

Trollout=1Nr∑i=1NrTrollout,i\.T\_\{\\mathrm\{rollout\}\}=\\frac\{1\}\{N\_\{r\}\}\\sum\_\{i=1\}^\{N\_\{r\}\}T\_\{\\mathrm\{rollout\},i\}\.LetTwallT\_\{\\mathrm\{wall\}\}be the elapsed train wall\-clock interval from the first to the last recorded pipeline\-stage interval\. We define

overlap=∑s∈𝒮TsTwall\.\\mathrm\{overlap\}=\\frac\{\\sum\_\{s\\in\\mathcal\{S\}\}T\_\{s\}\}\{T\_\{\\mathrm\{wall\}\}\}\.A mostly serial schedule has overlap near11, and the maximum remains33: all rollout workers, teacher scoring, and training busy for the full interval\.

#### Hardware and testing protocol\.

Each scheduler run uses one 8\-GPU node\. One GPU is reserved for teacher scoring\. The remaining seven GPUs are the rollout/training pool\. Rollout generation uses data parallelism, and learner training uses PyTorch FSDP\. Strict sync runs time\-share this pool: all seven GPUs run rollout, then all seven switch to training, and the cycle repeats\. The two\-step\-off and our AsyncOPD runs split the same seven GPUs concurrently: 4 GPUs for rollout workers and 3 GPUs for the FSDP trainer\.

For each student size and MC setting, we compare strict sync, two\-step\-off, and our AsyncOPD scheduler with the same teacher, training data, evaluation metrics, and reverse\-KL estimator: current\-policyAθA\_\{\\theta\}, no clipping, and old\-to\-current IS correction\. Two\-step\-off fixes a two\-update offset between rollout and the learner update that consumes it, so stale rollout reuse is static and controlled rather than produced by queue timing\. We use this offset because it is the fastest static step\-off schedule under the 4\-rollout/3\-trainer split\. The OPD pipeline has three serial stages: rollout generation, teacher scoring, and learner training\. Therefore, a two\-step offset is enough to keep all stages occupied in the gated schedule\. Larger offsets only make the consumed data older; they do not create another OPD stage to overlap or remove the step\-off batch barrier\. We measure final\-checkpoint Avg@32 and train wall\-clock time over the same training horizon\.

#### Qwen3\-Base train\-time accuracy\.

[Figure˜11](https://arxiv.org/html/2606.24143#A7.F11)provides the train\-time view for Qwen3\-Base students\. AsyncOPD reaches later checkpoints sooner, so accuracy improves earlier in wall\-clock time across student sizes and MC settings\.

![Refer to caption](https://arxiv.org/html/2606.24143v1/x28.png)\(a\)MC64, 1\.7B\-Base
![Refer to caption](https://arxiv.org/html/2606.24143v1/x29.png)\(b\)MC64, 4B\-Base
![Refer to caption](https://arxiv.org/html/2606.24143v1/x30.png)\(c\)MC64, 8B\-Base
![Refer to caption](https://arxiv.org/html/2606.24143v1/x31.png)\(d\)MC1, 1\.7B\-Base
![Refer to caption](https://arxiv.org/html/2606.24143v1/x32.png)\(e\)MC1, 4B\-Base
![Refer to caption](https://arxiv.org/html/2606.24143v1/x33.png)\(f\)MC1, 8B\-Base

Figure 11:Train\-time AIME24 Avg@32 for Qwen3\-Base students with MC64 and MC1\. Lines are 3\-point moving averages; faint markers are raw evaluations; colors denote scheduler\. AsyncOPD reaches later checkpoints sooner, so its accuracy improves earlier in wall\-clock time\.
#### Additional Qwen3 AsyncOPD results\.

For the Qwen3 1\.7B, 4B, and 8B student rows, we disable thinking at the tokenizer prompt\-formatting level: prompt construction uses the Qwen3 tokenizer’s non\-thinking chat\-template mode before rollout and evaluation\.[Tables˜7](https://arxiv.org/html/2606.24143#A7.T7)and[12](https://arxiv.org/html/2606.24143#A7.F12)report this comparison\. The systems pattern matches the Qwen3\-Base results: AsyncOPD has the highest throughput and overlap, reaching up to3\.8×3\.8\\timesstrict\-sync throughput on MC64 and up to3\.2×3\.2\\timeson MC1\. The train\-time accuracy plots show the same wall\-clock pattern as the main Qwen3\-Base results: AsyncOPD reaches later checkpoints sooner, so accuracy improves earlier across student sizes and MC settings\.

Table 7:Additional AsyncOPD scheduler results for Qwen3 students with thinking disabled\. Train tok/s is training throughput; parentheses show speedup over the matched strict\-sync baseline\. Overlap is concurrent OPD\-stage activity\. Avg@32 is final AIME24\. AsyncOPD achieves the highest throughput and overlap in all matched settings while maintaining comparable final accuracy\.![Refer to caption](https://arxiv.org/html/2606.24143v1/x34.png)\(a\)MC64, 1\.7B
![Refer to caption](https://arxiv.org/html/2606.24143v1/x35.png)\(b\)MC64, 4B
![Refer to caption](https://arxiv.org/html/2606.24143v1/x36.png)\(c\)MC64, 8B
![Refer to caption](https://arxiv.org/html/2606.24143v1/x37.png)\(d\)MC1, 1\.7B
![Refer to caption](https://arxiv.org/html/2606.24143v1/x38.png)\(e\)MC1, 4B
![Refer to caption](https://arxiv.org/html/2606.24143v1/x39.png)\(f\)MC1, 8B

Figure 12:Train\-time AIME24 Avg@32 for Qwen3 1\.7B, 4B, and 8B students with thinking disabled, using MC64 and MC1\. Lines are 3\-point moving averages; faint markers are raw evaluations; colors denote scheduler\. AsyncOPD reaches later checkpoints sooner, so its accuracy improves earlier in wall\-clock time\.
AsyncOPD: How Stale Can On-Policy Distillation Be?

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

On the Geometry of On-Policy Distillation

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Submit Feedback

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
On the Geometry of On-Policy Distillation
Draft-OPD: On-Policy Distillation for Speculative Draft Models