@QingQ77: 收集 LLM/VLM/Agent 在训练时用 On-Policy Distillation 和 Self-Distillation 的开源代码和论文,按教师来源、监督信号、rollout 用法、训练阶段四个维度打标签。 https://g…
摘要
介绍 AwesomeOPD,一个专门收集 LLM、VLM 和 Agent 在训练中使用的 On-Policy Distillation (OPD) 和 Self-Distillation 相关开源代码与论文的精选列表。该列表按教师来源、监督信号、rollout 用法和训练阶段对资源进行了详细分类和标注。
查看缓存全文
缓存时间: 2026/05/10 04:22
收集 LLM/VLM/Agent 在训练时用 On-Policy Distillation 和 Self-Distillation 的开源代码和论文,按教师来源、监督信号、rollout 用法、训练阶段四个维度打标签。 https://github.com/thinkwee/AwesomeOPD… AwesomeOPD 是一个 On-Policy Distillation 相关的 Awesome List,分 Survey、White-Box/Black-Box OPD、OPSD、迭代自引导、OPD-RL 混合、推理、多模态、Agent/具身、投机解码、框架、工业报告等十余个类别。每条目沿教师来源、监督信号、rollout 消费方式、流水线位置四轴标注,并附严格性说明区分完全符合 C1+C2 与部分符合的方法。
thinkwee/AwesomeOPD
Source: https://github.com/thinkwee/AwesomeOPD
When LLMs Distill On-Policy
AwesomeOPD is an awesome list summarising open-source repositories and papers for training LLMs (and VLMs / agents / draft models) with On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD):
- 🎯 OPD = C1 + C2.
C1: student samples its own trajectoriesy ~ π_student(·|x)during training.C2: teacher provides per-token / sequence supervision on those student samples. Methods that only partially satisfy are flagged in 📝 Strictness notes per section. - 🪞 OPSD = special case where teacher is the same model, conditioned on privileged context (verified trace / answer / “be concise” prefix / longer context) or an earlier checkpoint.
- 🚀 Each entry is annotated along four design axes — teacher source (external · same model with privileged context · earlier checkpoint · multi-teacher · discriminator), supervision signal (logits / top-k / sequence reward / verbal score / discriminator / verifier / feature), rollout consumption (all / selected / truncated / replaced / as PG samples), and pipeline slot (cold-start / mid / RL-replacement / inside-RL / inter-stage / compression / continual-anchor).
- ⚠️ Built by reading paper PDFs, project pages, and source code with LLM coding agents; manually reviewed but errors possible. PRs welcome.
- 📅 Last updated: 2026-04-30
Taxonomy:
- 📚 Surveys, Foundations & Position Papers — meta-references and seed papers (GKD, MiniLLM, Thinking Machines blog, Tencent / THUNLP surveys)
- 🔬 White-Box — logit-based OPD on student rollouts with an external teacher
- 🎭 Black-Box — discriminator / verbal / preference, no teacher logits
- ♻️ OPSD — privileged-context self-distillation (same model, different conditioning)
- 🔁 Iterative Self-Bootstrapping — same model as previous-checkpoint teacher
- 🤝 OPD-RL Hybrids — inside-RL OPD: KL-as-reward, RL+OPD fusion
- 🧠 Reasoning / 🖼️ Multimodal / 🤖 Agent & Embodied — by application; cuts across all teacher-source categories
- ⚡ Speculative-Decoding Distillation — drafter distillation; “student” is a draft model
- 🛠️ Frameworks & Toolkits — what to actually run
- 🏭 Industrial / Production Reports — what the labs ship
Shorthand: FKL = forward KL · RKL = reverse KL · JSD = Jensen–Shannon · Skew-KL / AKL = skewed / adaptive KL · 📄 paper-only = no public code yet.
Updates
- 📢 2026-04-28 — initial release
📚 Surveys, Foundations & Position Papers
| Resource | 🌟 Stars | Date | Org | Paper / Link | Title / Notes |
|---|---|---|---|---|---|
| Blog | 2025.10 | Thinking Machines Lab (Kevin Lu et al.) | Blog · tinker-cookbook | Thinking Machines Lab — On-Policy Distillation (blog) | |
| tinker-cookbook | 2025.10 | Thinking Machines Lab | — | Reference impl. of the OPD recipe on the Tinker SDK | |
| Tencent OPD Survey | 2026.04 | Tencent (Mingyang Song & Mao Zheng) | arXiv 2604.00626 | A Survey of On-Policy Distillation for LLMs | |
| OPD | 2026.04 | Tsinghua THUNLP | arXiv 2604.13016 | Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe | |
| revisiting_opd | 2026.03 | CASIA (Fu et al.) | arXiv 2603.25562 | Revisiting OPD: Failure Modes & Simple Fixes | |
| Lightning OPD | 2026.04 | Wu, Han, Cai | arXiv 2604.13010 | Lightning OPD: Efficient Post-Training with Offline OPD | |
| GKD | 2023.06 | Google DeepMind (Agarwal et al.) | arXiv 2306.13649 — implemented in TRL GKDTrainer | GKD: On-Policy Distillation of Language Models — Learning from Self-Generated Mistakes (Seminal · ICLR 2024) |
📋 Click to view technical details
| Resource | Loss / Divergence | Data | Teacher Access | Granularity | Notes |
|---|---|---|---|---|---|
| Thinking Machines blog | Reverse KL (student‖teacher) | Student rollouts | White-box | Token | “Swap KL ref model for stronger teacher” recipe; one-line addition to RL trainer. Replicates Qwen3 result at ~1/10 RL cost. |
| Tencent OPD Survey | (survey) | (survey) | (survey) | (survey) | Catalogues 50+ methods; useful as a reference index. |
| THUNLP Rethinking OPD | Reverse KL with progressive top-K alignment | Student | White-box | Token | Identifies two success conditions: compatible thinking patterns + genuinely new teacher capability. Recipe = off-policy cold-start + teacher-aligned prompt selection. |
| Revisiting OPD | Truncated reverse KL + top-p sampling + special-token masking | Student | White-box | Token (filtered) | Diagnoses 3 failure modes: imbalanced one-token signal, unreliable prefix guidance, tokenizer mismatch. |
| Lightning OPD | Cached teacher log-probs over SFT rollouts (offline OPD) | Student (cached) | White-box | Token | Introduces “teacher consistency” — same teacher must be used for SFT and OPD or else gradient bias. Eliminates the live teacher server. |
| GKD (Agarwal) | Generalised JSD (FKL/RKL configurable) | Mixed (λ interpolates teacher↔student) | White-box | Token | The seminal paper that named OPD; introduced student-self-rollout supervision. |
📝 Strictness notes (against the strict OPD definition C1: student samples its own trajectories during training + C2: teacher provides supervision on those samples)
- Lightning OPD — ⚠️ partially satisfies C1: teacher log-probs are pre-computed once over SFT rollouts and reused during training; student doesn’t actively sample during the OPD step. Authors call this “offline OPD” explicitly. Listed in OPD because the data is past-student-generated rollouts, not teacher-generated.
🔬 OPD with Larger External Teachers — White-Box
White-box methods use teacher logits / log-probabilities to supervise the student on student-generated rollouts. Each entry below has been verified to (a) train on student rollouts and (b) operate at the token level.
Methods that turned out to be RL-style on verification have been moved to OPD-RL Hybrids; off-policy / pure-loss-function / pretraining-side methods are excluded from this list.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
LMOps /minillm | 2023.06 | Microsoft / Tsinghua | arXiv 2306.08543 | MiniLLM (ICLR 2024) | |
| distillm | 2024.02 | KAIST / Microsoft | arXiv 2402.03898 | DistiLLM (ICML 2024) | |
| distillm-2 | 2025.03 | KAIST / Microsoft | arXiv 2503.07067 | DistiLLM-2 (ICML 2025 Oral) | |
| DSKDv2 | 2025.04 | BJTU | arXiv 2504.11426 | DSKDv2 — cross-tokenizer; supports on-policy mode | |
| G-OPD | 2026.02 | RUC / Tencent | arXiv 2602.12125 | G-OPD | |
google-research /speculative_kd | 2024.10 | UCSB / Google | arXiv 2410.11325 | Speculative KD (ICLR 2025) | |
| AdaSwitch | 2025.10 | RUC / Baidu | arXiv 2510.07842 | AdaSwitch (on-/off-policy switching) | |
| Constrained OPD | 2025.09 | Huawei Noah’s Ark | arXiv 2509.22921 | Constrained OPD (CMDP) | |
| REOPOLD | 2026.03 | KAIST / Microsoft | arXiv 2603.11137 | REOPOLD (Relaxed OPD) — code soon | |
| OPSD_OnPolicyDistillation | 2026.03 | arXiv 2603.11178 | PACED — frontier curriculum self-distill | ||
| Fast OPD | 2026.02 | Industrial | arXiv 2602.15260 | Fast OPD (prefix-truncated) | |
| Entropy-Aware OPD | 2026.03 | KAIST / IBM | arXiv 2603.07079 | Entropy-Aware OPD | |
| Veto | 2026.01 | SNU | arXiv 2601.07155 | Veto (Stable OPD) — ACL 2026 Findings | |
| OPSD_OnPolicyDistillation | 2026.04 | Meta / LinkedIn | arXiv 2604.14084 | TIP — Token Importance, shares LinkedIn OPSD repo with PACED | |
| SCOPE | 2026.04 | USTC / Meituan / Fudan | arXiv 2604.10688 | SCOPE — signal-calibrated dual-path | |
| TSD-KD | 2026.03 | Korea Univ. | arXiv 2603.13260 | TSD-KD — token-selective dual KD (ICLR 2026) | |
| Hybrid-Policy-Distillation | 2026.04 | zwhong714 | arXiv 2604.20244 | HPD — Hybrid Policy Distillation; LlamaFactory + verl backends |
📋 Click to view technical details
| Method | Loss / Divergence | Data | Granularity | Domain | Notes |
|---|---|---|---|---|---|
| MiniLLM | Reverse KL via policy gradient | Student | Sequence (PG) | General | The seminal “OPD” recipe by Yuxian Gu et al.; predates GKD by days. Mode-seeking. |
| DistiLLM | Skewed-KL (mix of FKL/RKL) | Mixed (adaptive off→on, with student samples) | Token | General | Skew parameter α interpolates between FKL and RKL; importance-reweighted student samples. |
| DistiLLM-2 | Contrastive: Skew-FKL on teacher data + Skew-RKL on student data | Mixed | Token | General | Asymmetric losses on each data source; ICML 2025 oral. |
| DSKDv2 | KL in dual aligned space; explicit on-policy mode | Student | Token | Cross-tokenizer | Cross-vocabulary distillation; supports both on/off-policy. |
| G-OPD / ExOPD | Reverse KL + scaled reward extrapolation | Student | Token | General | Generalises OPD as KL-constrained RL; allows reward scale > 1 to “exceed” the teacher. |
| Speculative KD (Xu) | Interleaved propose-and-correct (gated KL) | Student-proposed, teacher-corrected | Token | General | Bridges teacher-student gap via interleaved sampling. |
| AdaSwitch | Adaptive on/off-policy switching | Mixed | Token | General | Switches between teacher-data and student-rollout based on divergence threshold. |
| Constrained OPD | KL-constrained CMDP | Student | Token | General | Hard KL constraint instead of soft penalty. Borderline OPD-RL. |
| REOPOLD | Mixture-based reward clipping + entropy-based dynamic sampling | Student | Token | Reasoning | “Relaxed OPD”; views OPD as policy optimisation with teacher-student log-ratio reward. |
| PACED | Frontier curriculum at student competence boundary | Student | Token | General | Self-distill style (privileged-context / earlier-checkpoint); difficulty weighting w(p)=p(1−p). |
| Fast OPD | Prefix-truncated distillation reducing FLOPs | Student | Token (truncated) | Reasoning | 2× to 47× speedup via reasoning-prefix truncation. |
| Entropy-Aware OPD | Switch between FKL and RKL based on teacher entropy | Student | Token | Reasoning | When teacher entropy high → FKL; low → RKL. |
| Veto | Logit-space geometric bridge with adaptive gradient veto | Student | Token | General | Adaptive Target Reformulation. |
| TIP | Top-50% high-entropy student tokens carry the OPD signal | Student (selected) | Token (filtered) | Reasoning | ~47% memory savings; only entropy-high student tokens trained. |
| SCOPE | Teacher-PPL-weighted KL on incorrect rollouts; student-PPL-weighted MLE on correct | Student | Token | Reasoning | Signal-Calibrated OPD with Dual-Path Adaptive Weighting; verifier-routing. |
| TSD-KD | Indirect (student-propose / teacher re-rank) + direct selective logit KD | Mixed | Token (selected) | General | Hybrid; partial OPD + partial preference. |
| HPD | Reweighted log-likelihood unifying FKL + RKL | Mixed (off-policy + lightweight approximate on-policy sampling) | Token | General | Unifies KD as token-level reweighted likelihood; lightweight on-policy sampling preserves training efficiency. |
🎭 OPD with Black-Box / Outcome-Based Teachers
When the teacher is API-only (no logits), OPD uses scalar rewards, verbal scores, preferences, or adversarial discriminators — all evaluated on student rollouts. Entries that turned out to use static teacher data only (Lion, SuperCorrect, DAIL, SODA) are excluded from this list.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
LMOps /gad | 2025.11 | Microsoft Research | arXiv 2511.10643 · project | GAD — Black-Box OPD | |
| OVD | 2026.01 | HKU / Huawei | arXiv 2601.21968 | OVD (On-policy Verbal Distillation) — project page OVD.github.io 404s | |
| ORPO-Distill | 2025.09 | Industrial | arXiv 2509.25100 | ORPO-Distill |
📋 Click to view technical details
| Method | Feedback Signal | Data | Granularity | Domain | Notes |
|---|---|---|---|---|---|
| GAD (Generative Adversarial Distillation) | Discriminator (on-policy reward model) | Student | Sequence | General | A trained discriminator distinguishes student outputs from teacher (e.g. GPT-5) responses; minimax game makes the discriminator co-evolve into an on-policy reward model. Qwen2.5-14B student becomes comparable to GPT-5-Chat on LMSYS. |
| OVD | Verbal scores (0–9) on student trajectories | Student | Sequence | General | Replaces token-level logit matching with verbal scoring; +25.7% over baselines. |
| ORPO-Distill | Student-Generated Outputs (SGO) + ORPO contrastive | Mixed (student-generated negatives, teacher positives) | Sequence | Cross-architecture | “Mixed-policy strategy utilizing student-generated outputs”; NeurIPS 2025 WS. |
♻️ Self-Distillation with Privileged Context — OPSD
Same model = teacher = student, but the teacher is conditioned on something the student doesn’t see (verified trace, ground-truth answer, “be concise” prefix, longer context, document, …). The gap exists because of the conditioning, not weights.
Several entries previously listed here turned out on verification to use static teacher data or a fixed self-rewritten dataset rather than student rollouts; those have been excluded. SPIN was reclassified to Iterative Self-Bootstrapping.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| OPSD | 2026.01 | UCLA / Meta FAIR | arXiv 2601.18734 · blog | OPSD — Self-Distilled Reasoner | |
| CRISP_Reasoning_Compression | 2026.03 | arXiv 2603.05433 | OPSDC / CRISP | ||
| Self-Distillation | 2026.01 | MIT / ETH | arXiv 2601.19897 | SDFT-Continual | |
LMOps /opcd | 2026.02 | Microsoft Research | arXiv 2602.12275 | OPCD — On-Policy Context Distillation | |
LMOps /oel | 2026.03 | Microsoft Research | arXiv 2603.16856 | OEL — Online Experiential Learning | |
| mtp-lm | 2026.02 | UMD / LLNL | arXiv 2602.06019 | MTP Self-Distill | |
| ml-ssd | 2026.04 | Apple MLR | arXiv 2604.01193 | Apple — Embarrassingly Simple Self-Distillation | |
| GATES | 2026.02 | UMD | arXiv 2602.20574 | GATES (Self-Distillation under Privileged Context) | |
| OPSDL | 2026.04 | Baidu | arXiv 2604.17535 | OPSDL (Long-Context Self-Distillation) | |
| SD-Zero | 2026.04 | Princeton / Toronto / CMU | arXiv 2604.12002 | SD-Zero — Self-Revision turns binary rewards into dense supervision | |
| self-distillation-analysis | 2026.03 | MSR / KAIST / SNU | arXiv 2603.24472 | Why Does Self-Distillation (Sometimes) Degrade Reasoning? — diagnostic study of OPSD failure modes | |
| π-Play | 2026.04 | CASIA / UCAS / Meituan | arXiv 2604.14054 | π-Play — multi-agent self-play turns the question-construction path into privileged context for OPSD on search agents |
📋 Click to view technical details
| Method | Privileged Context (Teacher) | Loss / Divergence | Granularity | Domain | Notes |
|---|---|---|---|---|---|
| OPSD (Self-Distilled Reasoner) | Verified reasoning trace | Per-token RKL with point-wise clipping | Token | Math reasoning | Same-model OPSD; matches GRPO with 1×8 rollouts and 1024 length vs. GRPO’s 8×16 / 16k. The canonical OPSD paper. Built on TRL’s GOLD trainer. |
| CRISP / OPSDC | “Be concise” instruction prefix | Per-token RKL on student rollouts | Token | Reasoning compression | Compresses long-CoT without entropy collapse (unlike RL-with-length-penalty). |
| SDFT-Continual (idanshen) | Demo-conditioned same model | RKL on student rollouts vs. demo-conditioned teacher | Token | Continual learning | Self-distillation enables continual learning. |
| OPCD | In-context-knowledge-augmented same model | RKL on student rollouts | Token | Knowledge internalisation | Internalise context to be faithful even after context is removed. |
| OEL (Online Experiential Learning) | Same model with interactive game environment | RKL on student rollouts | Token | Game / planning | Self-distillation on interactive trajectories. |
| MTP Self-Distill | Multi-token prediction same model | RKL on student rollouts | Token | General | Multi-Token Prediction via Self-Distillation. Author-stated on-policy. |
| Apple SSD | Same model w/ temperature/truncation sampling | Cross-entropy on its own samples | Sequence | Code generation | “Embarrassingly simple” — sample, then SFT on those samples. Degenerate OPSD; “decoding-config” privilege. |
| GATES | Document-conditioned tutor (same model) | RKL gated by tutor consensus | Token (gated) | Document QA | Both tutor and student sample rollouts; on-policy student-rollout updates contribute “modest additional improvement” on top of off-policy distillation. Mixed. |
| OPSDL | Short-context same model | Point-wise RKL | Token | Long-context | On-Policy Self-Distillation for Long-Context LMs. |
| SD-Zero | Reviser conditioned on generator’s response + binary reward | Per-token KL: distill reviser → generator on student rollouts | Token | Math / code reasoning | Single model plays Generator + Reviser; reviser’s reward-conditioned token distribution becomes dense supervision over the generator’s response. Outperforms RFT, GRPO, SDFT under matched sample budget on Qwen3-4B-Instruct / Olmo-3-7B-Instruct (≥10% over base). Exhibits token-level self-localization and iterative self-evolution. |
| Why-Does-SD-Degrade (analysis) | Varies (controlled study over rich-vs-thin context teachers) | RKL on student rollouts (analysis only) | Token | Math reasoning (in-domain + OOD) | Diagnostic paper, not a training method. Finds that conditioning the teacher on richer privileged context suppresses epistemic verbalization (uncertainty expression) in the student → fast in-domain gains but up to 40% OOD drops on Qwen3-8B / DeepSeek-Distill-Qwen-7B / Olmo3-7B-Instruct. Implication: privileged-context richness is a double-edged knob in OPSD. |
| π-Play | Teacher conditioned on Question Construction Path (QCP) — the reverse-direction artifact emitted by an examiner agent when it generates the task | Per-token reverse KL on student rollouts; teacher is an EMA copy of the student (τ=0.05) | Token | Search / deep-research / multi-hop QA agents (NQ, TriviaQA, HotpotQA, 2WikiMQA, MuSiQue, …) | Self-play loop examiner ↔ student/teacher with no external data. The QCP is privileged because it captures the reverse solution path the examiner used to construct the task; the teacher sees it, the student doesn’t. Converts sparse-reward self-play into dense per-token supervision; data-free π-Play surpasses fully supervised search agents and is 2–3× more sample-efficient than conventional self-play. |
📝 Strictness notes
- Apple SSD — ⚠️ C2 is degenerate: no teacher KL signal; pure self-generated SFT (sample with temperature/truncation, then SFT on those samples). Closer to STaR-style self-bootstrapping than to OPSD. Kept because the “teacher” is the same model with a different decoding config — privileged-context-by-decoding.
- GATES — ⚠️ Authors’ own ablation says off-policy trajectory-level distillation drives the primary gains; on-policy student-rollout updates contribute only “modest additional improvement”. Mixed; the OPSD leg is genuine but secondary.
- SD-Zero — privileged context is non-textual: the reviser is conditioned on the generator’s full response plus its scalar binary reward. C1 ✓ (generator samples its own rollouts), C2 ✓ (per-token KL from reviser). Compared head-to-head against GRPO in the paper but is not itself an RL method — there is no policy-gradient objective; the reward is a conditioning signal, not a return. Listed in OPSD rather than OPD-RL Hybrids for that reason.
- Why-Does-SD-Degrade — analysis-only; no new training algorithm proposed. Listed here because the failure mode it characterises (epistemic-verbalization collapse under rich privileged context) is specific to OPSD.
- π-Play — teacher and student have separate parameter sets; the teacher is an EMA-tracking copy of the student rather than literally the same weights. Listed in OPSD because (i) the paper itself frames the method as “Privileged Self-Distillation” and (ii) the gap between teacher and student exists because of QCP conditioning, not weight divergence (the EMA target collapses to the student in the limit). C1 ✓ (student samples its own rollouts), C2 ✓ (per-token RKL from QCP-conditioned teacher).
🔁 Iterative Self-Bootstrapping
Same model is the teacher, but as a frozen earlier checkpoint, not a privileged-context view. The teacher snapshot is frozen for one round, the student trains, then the snapshot rolls forward. Listed separately because the supervision is typically sequence-level / preference, not per-token logit-distillation.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| SPIN | 2024.01 | UCLA | arXiv 2401.01335 | SPIN — Self-Play Fine-Tuning (ICML 2024) | |
| rStar | 2025.01 | Microsoft Research | rStar-Math 2501.04519 · rStar2-Agent 2508.20722 | rStar / rStar-Math / rStar2-Agent |
📝 Strictness notes
- SPIN — ⚠️ C1 ✓ (student samples), but C2 fails strict per-token logit form: supervision is sequence-level DPO preference against the previous frozen checkpoint. More accurately “iterative on-policy DPO” than per-token OPD. Kept because the “teacher = previous self” pattern is what people search for in OPD lists.
- rStar / rStar-Math / rStar2-Agent — ⚠️ MCTS-filtered student samples + SFT; the “teacher signal” is a step-level PPM / discriminator score, not per-token logit KL. Iterative self-improvement, not classical OPD.
🤝 OPD-RL Hybrids — Inside-RL OPD
Methods that fuse OPD with RLVR / GRPO / PPO / DPO. Teacher logits become a dense reward shaping or trust-region anchor inside an RL objective; or BoN / preference signals are used as the imitation target.
Newly added on verification: AlignDistil (RLHF-equivalent distillation), BOND / Faster WIND (sequence-level Best-of-N as target), KETCHUP (k-step RL-based KD), 𝒳-KD / DDT (IRL-style), LUFFY (mixed-policy GRPO with off-policy traces), NPO / AutoNPO (mixed-policy GRPO with near-future self as teacher). Removed on verification: RLKD (only sequence-level structural reward), ExGRPO (pure RL, no teacher), REDI (offline R1 traces, no student rollouts).
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| SDPO | 2026.01 | ETH / MIT | arXiv 2601.20802 · project | SDPO — RL via Self-Distillation | |
| OpenClaw-RL | 2026.03 | Gen-Verse | arXiv 2603.10165 | OpenClaw-RL — combines GRPO + OPD | |
| Open-AgentRL | 2026.02 | Gen-Verse | — | Open-AgentRL — RLAnything / DemyAgent multi-domain | |
| AlignDistil | 2025.03 | BJTU / Tencent | arXiv 2503.02832 | AlignDistil — RLHF-equivalent KD (ACL 2025) | |
| LUFFY | 2025.04 | Westlake U. | arXiv 2504.14945 | LUFFY — mixed-policy GRPO | |
| NPO | 2026.04 | IIE CAS / UCAS / JD.COM | arXiv 2604.20733 | NPO / AutoNPO — mixed-policy GRPO with near-future self as teacher | |
| KEPO | 2026.01 | Industrial | arXiv 2602.00400 | KEPO | |
| BOND | 2024.07 | Google DeepMind | arXiv 2407.14622 | BOND (Best-of-N Distillation) | |
| Faster WIND | 2024.10 | CMU / Google | arXiv 2410.20727 | Faster WIND (iterative BoN) — AISTATS 2025 | |
| KETCHUP | 2025.04 | U. Alberta | arXiv 2504.19024 | KETCHUP (k-step RL-KD) | |
| 𝒳-KD | 2026.02 | BUPT | arXiv 2602.12674 | 𝒳-KD (IRL-style) | |
| Towards-On-Policy-SFT | 2026.02 | MSRA / Shopee | arXiv 2602.12222 | DDT — on-policy SFT theory | |
| RLAD | 2026.02 | AWS | arXiv 2602.22495 | RLAD (Reinforcement-aware KD) | |
| KDRL | 2025.06 | HIT / Huawei | arXiv 2506.02208 | KDRL (Joint KD + RL) | |
| RLSD | 2026.04 | Multi-org | arXiv 2604.03128 | Self-Distilled RLVR (RLSD) | |
| HDPO | 2026.03 | NVIDIA | arXiv 2603.23871 | HDPO (Hybrid Distillation PO) | |
| ExGRPO | 2026.03 | UNC / ASU | arXiv 2603.19266 | Probing-to-Refine / EI / EXGRPO |
📋 Click to view technical details
| Method | Inner RL | Teacher Role | Data | Granularity | Domain | Notes |
|---|---|---|---|---|---|---|
| SDPO | Custom self-distillation policy gradient | Feedback-conditioned same model = self-teacher | Student | Token | Code, tool-use, science | Sample student rollout, get tokenised feedback, re-evaluate under feedback-conditioned self-teacher, distill the corrected next-token distribution back into policy. |
| OpenClaw-RL | GRPO + OPD | Judge model extracts hindsight hints, teacher token-logprob gap = directional advantage | Mixed | Token | Terminal / GUI / SWE / Tool-call | Unifies binary RL and OPD in one trainer. |
| Open-AgentRL | GRPO-TCR | Multi-domain teachers | Student | Token | Reasoning / GUI / Coding | Includes process-reward modelling via SandboxFusion. |
| AlignDistil | RLHF-equivalent KD | DPO-derived combination of DPO model + ref-model logits | Student | Token | Alignment | Re-frames DPO as policy distillation. |
| LUFFY | Mixed-Policy GRPO + policy shaping | Off-policy R1 traces inserted into student rollouts | Mixed | Token + sequence | Reasoning | “Learn to reason under off-policy guidance”. On-policy student-roll + off-policy teacher-trace mix. |
| NPO / AutoNPO | Mixed-Policy GRPO | Verifier-filtered trajectories from a later checkpoint of the same training run | Mixed | Sequence | Reasoning (RLVR) | “Learn from your near-future self”. Picks a teacher that is strong enough (higher Q than current policy) yet close enough (low V vs. external teachers like R1), maximising effective Q/V signal. AutoNPO adaptively schedules the interventions; preserves higher entropy than vanilla GRPO. |
| KEPO | Knowledge-enhanced PO | Knowledge-base teacher | Mixed | Sequence | Reasoning | Adds KB grounding to preference RL. |
| BOND | Best-of-N distillation | Same model’s BoN target | Student (iterative) | Sequence | Alignment | Treats Best-of-N as the target distribution; iterative anchor; Jeffreys divergence. |
| Faster WIND | Win-rate dominance | Same model BoN | Student (iterative) | Sequence | Alignment | Game-theoretic acceleration of BOND. |
| KETCHUP | k-step return REINFORCE on KD | External teacher | Student | Sequence | General | RL-based KD with k-step Bellman returns. |
| 𝒳-KD | AVRIL inverse-RL | Joint reward + policy distillation | Student | Token + sequence | General | IRL-flavoured experiential KD. |
| DDT | On-policy SFT theory | Theoretical | Student | Token | General | Distribution Discriminant Theory; foundations for on-policy SFT. |
| RLAD | PPO/GRPO ratio anchored to teacher–old-policy mixture | External teacher (Qwen3-32B) | Student | Token | Reasoning | Trust-region likelihood-ratio. |
| KDRL | Joint reverse-KL + GRPO rule-based reward | External teacher (Skywork-OR1) | Student | Token + outcome | Reasoning | Unified KD + RL objective. |
| Self-Distilled RLVR (RLSD) | RLVR direction + teacher evidence-ratio modulates magnitude | Same model + privileged answer | Student | Token + outcome | Reasoning | Combines self-distillation magnitudes with RLVR directions. |
| HDPO | RL on most prompts; on “cliff” prompts generate privileged rollouts and self-distill | Same model w/ privilege | Student | Token | Reasoning | Privileged self-distillation as RL fallback. |
| Probing-to-Refine | “Explanatory probes” force logical articulation; GRPO + dialogue-structure reward | Self-probe | Student | Sequence | Reasoning | Reinforcement Distillation via Explanatory Inversion. |
📝 Strictness notes
- LUFFY — ⚠️ Mixed-policy: half on-policy student rollouts (C1+C2 ✓) + half off-policy R1 traces inserted into GRPO (C1 ✗ on the off-policy half). Net is OPD-flavor with off-policy import.
- NPO / AutoNPO — ⚠️ Same mixed-policy GRPO pattern as LUFFY, but the off-policy traces come from a near-future checkpoint of the same run instead of an external R1 teacher. Authors frame it as RLVR, not OPD; included here as an OPD variant because (a) the imported trajectories play the same “stronger-self teacher” role, and (b) the paper itself explicitly invites follow-up work to inject the near-future-self signal via on-policy distillation. Strict per-token logit KL (C2) is not the loss — supervision is verifier-filtered sequence-level trajectory mixing inside GRPO.
- BOND, Faster WIND — ⚠️ Iterative self-bootstrapping; teacher = same model’s BoN distribution. Loss is Jeffreys / win-rate-dominance at the sequence level — no per-token logit supervision (C2 partially fails strict form). More accurately “on-policy iterative alignment” than OPD.
- KETCHUP — ⚠️ Sequence-level RL-based KD with k-step Bellman returns; the paper itself self-describes as “RL-based KD”. Closer to RL with KD-anchor reward than per-token OPD.
- 𝒳-KD — ⚠️ Built on AVRIL inverse-RL framework with joint reward modeling; closer to IRL+OPD hybrid than pure OPD.
- DDT — ⚠️ Theoretical foundations paper for “on-policy SFT” (Distribution Discriminant Theory); not a specific deployable algorithm. Kept for completeness.
- KEPO, Open-AgentRL, Probing-to-Refine — ⚠️ C1 ✓ (on-policy student rollouts), but the per-token KL component vs. sequence-level reward shaping vs. preference optimization is not fully resolved from abstracts. Listed because the papers self-describe as OPD/on-policy distillation but exact form of C2 needs full-paper reading.
🧠 Reasoning OPD (by application)
Genuine OPD work on math / code / long-CoT reasoning. Off-policy SFT-distill from R1, pure RL methods (Skywork-OR1, SimpleRL-Zoo, Time-R1), and analysis-only papers are excluded from this list — each had no student-rollout-with-teacher-supervision component.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| OPD | 2026.04 | Tsinghua THUNLP | arXiv 2604.13016 | Rethinking OPD recipe | |
| G-OPD | 2026.02 | RUC / Tencent | arXiv 2602.12125 | G-OPD (cross-list) | |
| OPD-AVMP | 2026.04 | Academic | arXiv 2604.07944 | OPD for Autonomous Vehicle Motion Planning |
The reasoning-OPD canon already lives across OPSD (siyan-zhao/OPSD, CRISP, SD-Zero), Iterative Self-Bootstrapping (rStar / rStar-Math), OPD-RL Hybrids (LUFFY, RLAD, KDRL, RLSD, HDPO), and White-Box (REOPOLD, Fast OPD, Entropy-Aware OPD, TIP, SCOPE, PACED). This section only lists items not already covered above.
📋 Click to view technical details
| Method | Loss / Objective | Data | Teacher | Granularity | Base / Benchmark | Notes |
|---|---|---|---|---|---|---|
| Rethinking OPD (THUNLP) | RKL with progressive top-K alignment + off-policy cold-start | Mixed | White-box (Qwen3-4B/1.7B teacher pairs) | Token | Math reasoning | Identifies teacher-novelty and thinking-pattern compatibility as success conditions. |
| OPD for AV Motion Planning | GPT-Driver framework + GKD on student-generated trajectories | Student | White-box (LLM teacher) | Token | Driving | 5× model-size reduction. |
🖼️ Multimodal OPD (VLM, Video, Audio, Image)
Strict OPD work in non-text modalities. Many “R1”/“GRPO” multimodal models that bear the brand are pure RL (no teacher-distillation loss) and are excluded.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| piFlow | 2025.10 | Multi-org | arXiv 2510.14974 | π-Flow — image / flow OPD (ICLR 2026) | |
| Step-Audio-R1 | 2025.11 | StepFun | arXiv 2511.15848 | Step-Audio-R1 | |
| VOLD | 2025.10 | INRIA / Goethe Univ. | arXiv 2510.23497 · project page | VOLD (LLM→VLM OPD) — repo placeholder; ICLR 2026 | |
| Video-OPD | 2026.02 | Industrial | arXiv 2602.02994 | Video-OPD | |
| X-OPD | 2026.03 | Tencent Hunyuan / ZJU | arXiv 2603.24596 | X-OPD (Speech LLM) |
📋 Click to view technical details
| Method | Modality | Teacher | Loss | Data | Notes |
|---|---|---|---|---|---|
| π-Flow | Image generation (flow models) | Teacher velocity field | L2 imitation distillation | Student | Strict OPD for diffusion: student predicts policy at each timestep along its own trajectory. |
| Step-Audio-R1 | Audio reasoning | Self (modality-grounded) | Iterative self-distillation + SFT + PPO/RLVR | Student | Iterative on-policy cycles; only audio-relevant questions used in self-distill. |
| VOLD | LLM → VLM | Text-only LLM | GRPO + on-policy KL distillation | Student | Cold-start SFT alignment + unified RL+KD; ICLR 2026. The flagship VLM OPD recipe. |
| Video-OPD | MLLM | LLM teacher | Token-level KL on student rollouts | Student | Temporal video grounding via OPD. |
| X-OPD | Speech LLM | Text LLM | Cross-modal token-level KL | Student | Capability alignment in speech LLMs. |
🤖 Agent & Embodied OPD (by application)
Genuine OPD where the student is an agent rolling out actions; teacher (or self) supervises those trajectories. Pure-RL agent works (WebRL, WebAgent-R1, InfiGUI-G1, GUI-R1) and off-policy SFT-on-teacher-trajectories (Nardien, AgentRefine, Chain-of-Agents, MapCoder-Lite, SAD, Structured-Web) are excluded.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| OpenClaw-RL | 2026.03 | Gen-Verse | arXiv 2603.10165 | OpenClaw-RL (cross-list with OPD-RL) | |
| easydistill | 2025.09 | Alibaba ModelScope | SCoRe arXiv 2509.14257 | /projects/SCoRe | |
| RPD | 2025.03 | TUM / Freiburg | arXiv 2503.05833 · project | Refined Policy Distillation, VLA (IROS 2026) | |
| VLA-OPD | 2026.03 | HKUST (Guangzhou) — IRPN Lab | arXiv 2603.26666 · project | VLA-OPD — bridging offline SFT & online RL for VLA via OPD (code coming soon) | |
| LLM4Teach | 2023.11 (updated 2025) | ZJ Lab AMMI | arXiv 2311.13373 | LLM4Teach — small-RL agent guided by LLM |
📋 Click to view technical details
| Method | Domain | Teacher Role | Loss | Notes |
|---|---|---|---|---|
| OpenClaw-RL | Terminal / GUI / SWE / Tool-call | Judge model + token-logprob gap | GRPO + OPD | Hindsight-hint extraction; combines binary RL and per-token OPD. |
| SCoRe | 12 agent benchmarks | Larger teacher (72B) corrects earliest error in student rollout | SFT-on-corrections + short-horizon RL | 7B student matches 72B teacher. |
| RPD | VLA / robot manipulation | Teacher VLA actions | PPO + behavioural cloning on student rollouts | Cleanest VLA-OPD recipe. |
| VLA-OPD | VLA / robot manipulation (LIBERO, RoboTwin2.0) | Expert VLA teacher, dense token-level supervision on student trajectories | Reverse-KL (avoids FKL entropy explosion + Hard-CE collapse) | Replaces sparse RL reward; preserves generalist priors and mitigates catastrophic forgetting. |
| LLM4Teach | Small RL agent | LLM teacher (action-level) | Distillation + RL annealed | Strict OPD for embodied; predates the wave. |
⚡ Speculative-Decoding Distillation
Distillation of the draft model so it better mimics the verifier/target. The on-policy element here is over the drafter’s own continuations as judged by the target. Listed separately because the goal is inference speedup, not student capability.
This section only lists drafters trained with the drafter’s own rollouts. Off-policy drafter training (EAGLE-1/2, Medusa, Hydra, Kangaroo, ReDrafter, BiTA, SpecDec++, LayerSkip, FREE, AdaSPEC, POSS) and training-free system tricks (Ouroboros, Sequoia, TriForce, SwiftKV, SuffixDecoding) are excluded.
| Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes |
|---|---|---|---|---|---|
| EAGLE | 2025.03 | PKU / Microsoft | EAGLE-3 | EAGLE-3 — on-policy multi-step TTT | |
| HASS | 2024.08 | Academic | arXiv 2408.15766 | HASS | |
| OSD | 2023.10 | UCB / NVIDIA | arXiv 2310.07177 | Online Speculative Decoding | |
| Falcon | 2024.12 | Bestpay | arXiv 2412.12639 | Falcon | |
| SpecForge | 2026.03 | SGLang | LMSYS blog | SpecForge — open EAGLE-3 training framework | |
| DistillSpec | 2023.10 | Google DeepMind | arXiv 2310.08461 | DistillSpec (ICLR 2024) | |
| SpecKD | 2025.10 | XJTU (Haiduo Huang et al.) | arXiv 2510.24021 | SpecKD / SelecTKD (verification-gated KD; v1=SpecKD, v2 retitled SelecTKD) | |
| ReSpec | 2025.10 | Academic | arXiv 2510.26475 | ReSpec (RL drafter evolution) | |
| DVI | 2025.10 | Academic | arXiv 2510.05421 | DVI (Draft-Verify-Improve, online RL) | |
| CORAL | 2025.02 | Academic | arXiv 2502.16880 | CORAL (Cross-Step Representation Alignment) — ACL 2025 | |
| MASSV | 2025.05 | Cerebras | arXiv 2505.10526 | MASSV (multimodal SD draft) |
📝 Strictness notes
- HASS, Falcon — ⚠️ Partial on-policy: multi-step draft trajectory / glancing distillation uses drafter samples for a subset of the training signal. Listed because the on-policy leg drives the gains.
📋 Click to view technical details
| Method | Drafter type | On-/Off-policy | Loss | Notes |
|---|---|---|---|---|
| EAGLE-3 | Self-speculative (uses target features) | On-policy multi-step (TTT) | Smooth-L1 (feature) + CE (token) | “Training-Time Test” simulates draft rollouts during training. |
| HASS | Self-speculative | Partial on-policy (multi-step draft trajectory in training) | Multi-step KD CE + feature alignment | Harmonized objective + harmonized context alignment. |
| Online Speculative Decoding (OSD) | Draft-model | On-policy / online | Online KD on rejected tokens | The canonical online/on-policy SD paper. |
| Falcon | Draft-model (semi-AR) | Partial on-policy (glancing uses draft samples) | Glancing CE + KD | Coupled Sequential Glancing Distillation. |
| SpecForge | Self-speculative (EAGLE-3 framework) | On-policy TTT supported | EAGLE-3 losses | Open-source EAGLE-3 training framework. |
| DistillSpec | Draft-model | On-policy (draft samples) | Choice of FKL/RKL/JSD/TVD | The seminal “OPD for SD” paper. |
| SpecKD | Distillation framework | On-policy with verification gating | Gated KL (accepted tokens only) | Inverts SD: uses accept/reject as KD-loss gate. |
| ReSpec | Draft-model | On-policy online (RL rollouts) | KD weighted by rollout reward | Drafter evolved during RL training. |
| DVI | Self-speculative | On-policy online (RL on verifier signal) | KL → reward-masked CE + PG | Continual online training. |
| CORAL | Self-speculative | On-policy multi-step | Cross-step alignment + CE | Fixes draft training/inference mismatch. |
| MASSV | Multimodal draft-model | On-policy (drafter samples) | KD CE | Multimodal speculative-decoding drafter. |
🛠️ Frameworks & Toolkits
Open-source frameworks / libraries that support OPD (with student-generated rollouts during distillation training).
| Resource | 🌟 Stars | Date | Org | OPD Code Path | Title / Notes |
|---|---|---|---|---|---|
| verl | 2024.10 | ByteDance Seed | recipe/on_policy_distill/; Async OPD doc | verl | |
| trl | 2019.11 | Hugging Face | trl/experimental/{gkd,gold,minillm,sdft,self_distillation,sdpo,nash_md,xpo,online_dpo}/ | TRL — the most diverse OPD trainer collection | |
| LLaMA-Factory | 2023.05 | hiyouga | — | LLaMA-Factory — OPD only via TRL integration; not native | |
| ms-swift | 2024 | Alibaba ModelScope | examples/train/rlhf/gkd/, multimodal/megatron variants | ms-swift — wraps TRL GKDTrainer | |
| rllm | 2025.01 | UC Berkeley Sky | examples/math_distill/ (incl. opsd/ self-distill); rllm/trainer/distill/ | rllm | |
| ROLL | 2025.06 | Alibaba | roll/pipeline/distill/ | ROLL — with VLM support and various-divergence library | |
| AReaL | 2025.06 | AntGroup / Tsinghua | examples/distillation/gsm8k_grpo_distill.yaml | AReaL | |
| RL | 2026.01 | NVIDIA | nemo_rl/algorithms/distillation.py | NeMo-RL — native OPD with student rollouts | |
| SkyRL | 2025.04 | UC Berkeley NovaSky | skyrl-train/examples/on_policy_distillation/; blog | SkyRL | |
| slime | 2025.06 | Tsinghua THUDM | examples/on_policy_distillation/ | slime — RL framework behind GLM-4.5/4.6/4.7 | |
| KDFlow | 2026.03 | BJTU (Songming Zhang et al.) | examples/on_policy_kd/ (LLM + Qwen3-VL); arXiv 2603.01875 | KDFlow — KD-first framework; SGLang teacher + FSDP2 student decoupled; cross-tokenizer & VLM native |
📋 Click to view technical details
| Framework | KL Direction(s) | OPD Primary? | Backbone | Multi-GPU | Notes |
|---|---|---|---|---|---|
| verl | Forward KL with sparse top-k teacher logits | One of many | PyTorch | Yes (FSDP, Megatron, Ray) | recipe/on_policy_distill/ — the most production-ready OPD recipe; integrates with vLLM. |
| TRL | FKL, RKL, GJSD (β); GOLD trainer; SDFT trainer; MiniLLM trainer | One of many; most diverse OPD collection | PyTorch | Yes (Accelerate, DeepSpeed) | trl/experimental/ contains gkd, gold, minillm, sdft, self_distillation, sdpo, nash_md, xpo, online_dpo, papo, prm. The single broadest OPD trainer set. |
| LLaMA-Factory | Via TRL integration | No native | PyTorch | Yes | Most-starred fine-tuning framework. |
| ms-swift | Same as TRL GKD | One of many | PyTorch | Yes (DeepSpeed, Megatron) | Wraps TRL GKDTrainer; multimodal variants. |
| rllm (Berkeley) | Reverse KL (advantage = log P_teacher − log P_student) | Primary in math_distill example | PyTorch | Single (tinker) + Multi-GPU (verl) | Self-distill subdir opsd/. |
| ROLL | Multiple divergences (various_divergence.py) | First-class DistillPipeline | PyTorch | Yes (Megatron) | VLM support. |
| AReaL | KL-controlled (off-policy default; integrates into GRPO) | One of many | PyTorch | Yes (async distributed) | distill_loss_weight. |
| NeMo-RL | FKL / RKL / mixed (configurable kl_type) | OPD documented | PyTorch | Yes (Ray + Megatron + vLLM) | Replaces archived NeMo-Aligner. |
| SkyRL | Reverse KL + importance sampling | OPD added Nov 2025 (PR #585) | PyTorch | Yes (Ray + vLLM/SGLang) | Notion blog “On-Policy Distillation in SkyRL”. |
| slime | Reverse KL token-level | OPD as additive penalty on any advantage estimator | PyTorch + Megatron | Yes (SGLang teacher mode) | Behind GLM-4.5/4.6/4.7. |
| KDFlow | FKL / RKL / JSD / AKL + Skewed-KL/RKL variants | Yes — KD-first | PyTorch | Yes (Ray + SGLang teacher + FSDP2 student) | Decoupled backends; transmits teacher hidden states (zero-copy) and recomputes logits on student to cut comm cost; 1.44–6.36× speedup over homogeneous-backend baselines. Native cross-tokenizer; VLM support (Qwen3-VL). Colocate mode shares GPUs via SGLang sleep/wakeup. |
Excluded (no native OPD support, or distillation pipeline is offline / fixed-corpus rather than student-rollout): axolotl, OpenRLHF, allenai/open-instruct, prime-rl, TextBrewer (pre-LLM era), open-r1 (off-policy SFT + GRPO), Modelopt, Tunix v0.1.6, DistillKit, easydistill.
📝 Strictness notes — frameworks judged by whether they ship a recipe that satisfies C1+C2
- LLaMA-Factory — ⚠️ OPD only available via TRL integration; no native OPD trainer. Listed for users who already use LLaMA-Factory and want to know it can host OPD.
🏭 Industrial / Production Model Reports
Flagship model technical reports that publicly describe on-policy distillation in their post-training pipeline. Reports whose tech papers don’t actually describe student-rollout distillation (Qwen2.5, Qwen2.5-Math, MiMo predecessor, DeepSeek-V3 / V3.2-Exp / R1, Phi-4, Hunyuan-Large / A13B, Kimi-K2 / K2.5, Yi-Lightning, DistilQwen) are excluded.
| Resource | 🌟 Stars | Date | Org | Paper | Title / Notes |
|---|---|---|---|---|---|
| Qwen3 | 2025.05 | Alibaba Qwen | arXiv 2505.09388 | Qwen3 (canonical OPD recipe) | |
| Qwen3-Coder | 2026.03 | Alibaba Qwen | Tech report | Qwen3-Coder | |
| gemma | 2024.07 | Google DeepMind | arXiv 2408.00118 | Gemma 2 (explicit OPD) | |
| GLM-4.5 | 2025.08 | Zhipu / Z.ai | arXiv 2508.06471 | GLM-4.5 / 4.6 | |
| GLM-5 | 2026.02 | Zhipu / Z.ai | arXiv 2602.15763 | GLM-5 (cross-stage OPD) | |
| MiMo-V2-Flash | 2026.01 | Xiaomi | arXiv 2601.02780 | MiMo-V2-Flash (MOPD) | |
| Nemotron Cascade 2 | 2026.03 | NVIDIA | arXiv 2603.19220 · HF Collection · project | Nemotron Cascade 2 (multi-domain OPD; “we sample y∼π_inf(·|x)”); HF-only release | |
| DeepSeek-V4 | 2026.04 | DeepSeek-AI | Tech Report · V4-Pro · V4-Flash | DeepSeek-V4 (multi-teacher OPD replaces unified mixed-RL stage) |
📋 Click to view technical details
| Model | Stage(s) using OPD | Mechanism | Notes |
|---|---|---|---|
| Qwen3 | Strong-to-Weak Distillation | Two-phase: (1) off-policy SFT cold-start with /think and /no_think teacher samples; (2) on-policy phase — student generates, teacher provides logit-KL targets | Reports ~10× cheaper than RL for equal performance. The canonical industrial OPD recipe. Inspired the Thinking Machines blog. |
| Qwen3-Coder-Next | Distillation of multi-experts into 80A3 student | Combined SFT + on-policy logit alignment | Production scaling of Qwen3 recipe. |
| Gemma 2 | Post-training | “We also use on-policy distillation, where the student generates completions from the SFT prompts” — KL on student samples | Among the first production models to name OPD. |
| GLM-5 | Throughout post-training | “On-Policy Cross-Stage Distillation” — a final anti-forgetting refinement applied between stages | Generalises Qwen3 recipe to “OPD as a stage glue”. |
| GLM-4.5 / 4.6 | Multi-stage post-training | Expert iteration; SFT distillation merges experts into hybrid generalist | Predecessors of GLM-5. |
| MiMo-V2-Flash | Post-training | Multi-Teacher On-Policy Distillation (MOPD) — “the student model samples from its own evolving distribution and receives token-level supervision from domain-specific teachers” | Multi-teacher OPD; per-token MOPD advantage formula. |
| Nemotron Cascade 2 | Between Cascade RL stages | Multi-Domain On-Policy Distillation (MOPD) — “we sample y∼π_inf(· | x)“; teacher provides token-level distillation advantage |
| DeepSeek-V4 | Post-training (replaces unified mixed-RL stage) | Multi-teacher OPD: domain specialists trained independently (SFT + GRPO per domain — math, code, agent, IF), then a unified student optimises reverse-KL against the specialist set on its own rollouts | Full-vocabulary KL (not token-level estimate) stabilises gradients when specialists disagree; first DeepSeek release where OPD replaces the RL consolidation stage from V3 / R1. V4-Pro 1.6T MoE; V4-Flash 284B. |
📝 Strictness notes
- GLM-4.5 / 4.6 — ⚠️ Tech report describes “expert iteration + RL” without explicit OPD wording. Kept as predecessor of GLM-5 which does have explicit cross-stage OPD.
🌟 Curator’s Picks — where to start
Opinionated reading order for someone starting an OPD project today.
| # | Why it’s the pick | Resource |
|---|---|---|
| 1 | Clearest one-page explanation of why OPD beats both SFT and RL on token efficiency. | Thinking Machines Lab blog (Oct 2025) |
| 2 | The production recipe everyone is now copying. Read §4. | Qwen3 Technical Report |
| 3 | Reproducible OPD in <200 lines on a real training stack. | tinker-cookbook recipes/distillation |
| 4 | “Theory of OPD” — when it works, when it fails. | THUNLP Rethinking OPD (2604.13016) |
| 5 | The paper that named OPSD and established the privileged-context pattern. | Self-Distilled Reasoner (2601.18734) |
| 6 | Crystallises OPD as a special case of KL-constrained RL with reward extrapolation. | G-OPD (2602.12125) |
| 7 | Catalogue of 50+ methods — read as an index, not a taxonomy. | Tencent OPD Survey (2604.00626) |
| 8 | Most diverse open-source OPD trainer collection. | TRL experimental/ |
| 9 | Black-box OPD seed paper — adversarial discriminator as on-policy reward. | GAD (2511.10643) |
| 10 | Empirical failure-modes paper — saves a week of debugging. | Revisiting OPD (2603.25562) |
🤝 Contributing
PRs are very welcome. When adding an entry, please attempt to fill the technical-details columns (loss / divergence, data source, teacher access, granularity). If you cannot determine these by reading the paper or repo, leave a ? — that’s still useful.
相似文章
@louieworth: 新博客文章:On-Policy Distillation — 前景、陷阱与展望
这篇博客文章讨论了On-Policy Distillation (OPD),这是一种结合在线策略 rollout 与密集教师监督的技术,并重点介绍了其前景、三种失败模式以及作者关于该主题的新论文。
On-policy distillation: 在PapersWithCode上最热门术语之一 [R]
Hugging Face的Niels介绍了On-policy Distillation (OPD),这是一种关键的后训练技术,用于Qwen 3.6/3.7、GLM-5.1和DeepSeek-V4等模型。该技术现已收录于PapersWithCode,并附有Sasha Rush和Dwarkesh Patel的白板讲解链接。
AsyncOPD:在策略蒸馏可以有多陈旧?
本文提出 AsyncOPD,一种完全异步的在策略蒸馏流程,用于大语言模型,系统研究了陈旧策略数据的影响,并提出了估计器设计,使训练吞吐量提升 1.6-3.8 倍,同时保持相当的准确率。
OPID:面向智能体强化学习的在线策略技能蒸馏
OPID 是一个框架,它从完成的在线策略轨迹中提取密集的词元级监督信号,用于语言智能体的强化学习,通过分层技能(情节级和步骤级)来提高样本效率和鲁棒性。
OmniOPD: 通过推测验证实现无Logit的同策略蒸馏
OmniOPD 提出了一种无Logit的同策略蒸馏方法,利用块级语义相似性和推测验证,在黑盒教师指导下训练学生模型,在数学基准上相比标准OPD实现了高达+28.64%的提升。