@QingQ77: Collecting open-source code and papers on On-Policy Distillation and Self-Distillation for training LLMs/VLMs/Agents, tagged by four dimensions: teacher source, supervision signal, rollout usage, and training stage. https://g…

X AI KOLs Timeline 05/09/26, 03:01 PM Tools

on-policy-distillation self-distillation llm-training awesome-list open-source

Summary

Introducing AwesomeOPD, a curated list of open-source code and papers related to On-Policy Distillation (OPD) and Self-Distillation used in the training of LLMs, VLMs, and Agents. Resources in this list are meticulously categorized and tagged based on teacher source, supervision signal, rollout usage, and training stage.

Collecting open-source code and papers on On-Policy Distillation and Self-Distillation for training LLMs/VLMs/Agents, tagged by four dimensions: teacher source, supervision signal, rollout usage, and training stage. https://github.com/thinkwee/AwesomeOPD… AwesomeOPD is an Awesome List focused on On-Policy Distillation, divided into more than a dozen categories including Surveys, White-Box/Black-Box OPD, OPSD, Iterative Self-Bootstrapping, OPD-RL Hybrids, Inference, Multimodal, Agent/Embodied AI, Speculative Decoding, Frameworks, and Industry Reports. Each entry is annotated along four axes: teacher source, supervision signal, rollout consumption method, and pipeline position, with strictness notes distinguishing between methods that fully comply with criteria C1+C2 and those that partially comply.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/10/26, 04:22 AM

Collect open-source code and papers for LLM/VLM/Agent training using On-Policy Distillation and Self-Distillation, tagged by four dimensions: teacher source, supervision signal, rollout usage, and training stage. https://github.com/thinkwee/AwesomeOPD… AwesomeOPD is an Awesome List related to On-Policy Distillation, categorized into over a dozen sections including Survey, White-Box/Black-Box OPD, OPSD, Iterative Self-Bootstrapping, OPD-RL Hybrids, Reasoning, Multimodal, Agent/Embodied, Speculative Decoding, Frameworks, and Industrial Reports. Each entry is annotated along four axes: teacher source, supervision signal, rollout consumption method, and pipeline position, with strictness notes distinguishing methods that fully comply with C1+C2 from those that partially comply. — # thinkwee/AwesomeOPD Source: https://github.com/thinkwee/AwesomeOPD Surveys White-Box Black-Box OPSD Iterative OPD-RL Reasoning Multimodal Agent SpecDec Frameworks Industrial # When LLMs Distill On-Policy AwesomeOPD is an awesome list summarising open-source repositories and papers for training LLMs (and VLMs / agents / draft models) with On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD): - 🎯 OPD = C1 + C2. C1: student samples its own trajectories y ~ π_student(·|x) during training. C2: teacher provides per-token / sequence supervision on those student samples. Methods that only partially satisfy are flagged in 📝 Strictness notes per section. - 🪞 OPSD = special case where teacher is the same model, conditioned on privileged context (verified trace / answer / “be concise” prefix / longer context) or an earlier checkpoint. - 🚀 Each entry is annotated along four design axes — teacher source (external · same model with privileged context · earlier checkpoint · multi-teacher · discriminator), supervision signal (logits / top-k / sequence reward / verbal score / discriminator / verifier / feature), rollout consumption (all / selected / truncated / replaced / as PG samples), and pipeline slot (cold-start / mid / RL-replacement / inside-RL / inter-stage / compression / continual-anchor). - ⚠️ Built by reading paper PDFs, project pages, and source code with LLM coding agents; manually reviewed but errors possible. PRs welcome. - 📅 Last updated: 2026-04-30 Taxonomy: - 📚 Surveys, Foundations & Position Papers — meta-references and seed papers (GKD, MiniLLM, Thinking Machines blog, Tencent / THUNLP surveys) - 🔬 White-Box — logit-based OPD on student rollouts with an external teacher - 🎭 Black-Box — discriminator / verbal / preference, no teacher logits - ♻️ OPSD — privileged-context self-distillation (same model, different conditioning) - 🔁 Iterative Self-Bootstrapping — same model as previous-checkpoint teacher - 🤝 OPD-RL Hybrids — inside-RL OPD: KL-as-reward, RL+OPD fusion - 🧠 Reasoning / 🖼️ Multimodal / 🤖 Agent & Embodied — by application; cuts across all teacher-source categories - ⚡ Speculative-Decoding Distillation — drafter distillation; “student” is a draft model - 🛠️ Frameworks & Toolkits — what to actually run - 🏭 Industrial / Production Reports — what the labs ship Shorthand: FKL = forward KL · RKL = reverse KL · JSD = Jensen–Shannon · Skew-KL / AKL = skewed / adaptive KL · 📄 paper-only = no public code yet. ## Updates - 📢 2026-04-28 — initial release — ## 📚 Surveys, Foundations & Position Papers | Resource | 🌟 Stars | Date | Org | Paper / Link | Title / Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | Blog (https://thinkingmachines.ai/blog/on-policy-distillation/) | Blog (https://thinkingmachines.ai/blog/on-policy-distillation/) | 2025.10 | Thinking Machines Lab (Kevin Lu et al.) | Blog (https://thinkingmachines.ai/blog/on-policy-distillation/) · tinker-cookbook (https://github.com/thinking-machines-lab/tinker-cookbook) | Thinking Machines Lab — On-Policy Distillation (blog) | | tinker-cookbook (https://github.com/thinking-machines-lab/tinker-cookbook) | | 2025.10 | Thinking Machines Lab | — | Reference impl. of the OPD recipe on the Tinker SDK | | Tencent OPD Survey (https://arxiv.org/abs/2604.00626) | Paper (https://arxiv.org/abs/2604.00626) | 2026.04 | Tencent (Mingyang Song & Mao Zheng) | arXiv 2604.00626 (https://arxiv.org/abs/2604.00626) | A Survey of On-Policy Distillation for LLMs | | OPD (https://github.com/thunlp/OPD) | | 2026.04 | Tsinghua THUNLP | arXiv 2604.13016 (https://arxiv.org/abs/2604.13016) | Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe | | revisiting_opd (https://github.com/hhh675597/revisiting_opd) | | 2026.03 | CASIA (Fu et al.) | arXiv 2603.25562 (https://arxiv.org/abs/2603.25562) | Revisiting OPD: Failure Modes & Simple Fixes | | Lightning OPD (https://arxiv.org/abs/2604.13010) | Paper (https://arxiv.org/abs/2604.13010) | 2026.04 | Wu, Han, Cai | arXiv 2604.13010 (https://arxiv.org/abs/2604.13010) | Lightning OPD: Efficient Post-Training with Offline OPD | | GKD (https://arxiv.org/abs/2306.13649) | Paper (https://arxiv.org/abs/2306.13649) | 2023.06 | Google DeepMind (Agarwal et al.) | arXiv 2306.13649 (https://arxiv.org/abs/2306.13649) — implemented in TRL GKDTrainer (https://github.com/huggingface/trl/blob/main/trl/experimental/gkd/gkd_trainer.py) | GKD: On-Policy Distillation of Language Models — Learning from Self-Generated Mistakes (Seminal · ICLR 2024) | 📋 Click to view technical details | Resource | Loss / Divergence | Data | Teacher Access | Granularity | Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | Thinking Machines blog | Reverse KL (student‖teacher) | Student rollouts | White-box | Token | “Swap KL ref model for stronger teacher” recipe; one-line addition to RL trainer. Replicates Qwen3 result at ~1/10 RL cost. | | Tencent OPD Survey | (survey) | (survey) | (survey) | (survey) | Catalogues 50+ methods; useful as a reference index. | | THUNLP Rethinking OPD | Reverse KL with progressive top-K alignment | Student | White-box | Token | Identifies two success conditions: compatible thinking patterns + genuinely new teacher capability. Recipe = off-policy cold-start + teacher-aligned prompt selection. | | Revisiting OPD | Truncated reverse KL + top-p sampling + special-token masking | Student | White-box | Token (filtered) | Diagnoses 3 failure modes: imbalanced one-token signal, unreliable prefix guidance, tokenizer mismatch. | | Lightning OPD | Cached teacher log-probs over SFT rollouts (offline OPD) | Student (cached) | White-box | Token | Introduces “teacher consistency” — same teacher must be used for SFT and OPD or else gradient bias. Eliminates the live teacher server. | | GKD (Agarwal) | Generalised JSD (FKL/RKL configurable) | Mixed (λ interpolates teacher↔student) | White-box | Token | The seminal paper that named OPD; introduced student-self-rollout supervision. | 📝 Strictness notes (against the strict OPD definition C1: student samples its own trajectories during training + C2: teacher provides supervision on those samples) - Lightning OPD — ⚠️ partially satisfies C1: teacher log-probs are pre-computed once over SFT rollouts and reused during training; student doesn’t actively sample during the OPD step. Authors call this “offline OPD” explicitly. Listed in OPD because the data is past-student-generated rollouts, not teacher-generated. — ## 🔬 OPD with Larger External Teachers — White-Box White-box methods use teacher logits / log-probabilities to supervise the student on student-generated rollouts. Each entry below has been verified to (a) train on student rollouts and (b) operate at the token level. Methods that turned out to be RL-style on verification have been moved to OPD-RL Hybrids; off-policy / pure-loss-function / pretraining-side methods are excluded from this list. | Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | LMOps /minillm (https://github.com/microsoft/LMOps/tree/main/minillm) | | 2023.06 | Microsoft / Tsinghua | arXiv 2306.08543 (https://arxiv.org/abs/2306.08543) | MiniLLM (ICLR 2024) | | distillm (https://github.com/jongwooko/distillm) | | 2024.02 | KAIST / Microsoft | arXiv 2402.03898 (https://arxiv.org/abs/2402.03898) | DistiLLM (ICML 2024) | | distillm-2 (https://github.com/jongwooko/distillm-2) | | 2025.03 | KAIST / Microsoft | arXiv 2503.07067 (https://arxiv.org/abs/2503.07067) | DistiLLM-2 (ICML 2025 Oral) | | DSKDv2 (https://github.com/songmzhang/DSKDv2) | | 2025.04 | BJTU | arXiv 2504.11426 (https://arxiv.org/abs/2504.11426) | DSKDv2 — cross-tokenizer; supports on-policy mode | | G-OPD (https://github.com/RUCBM/G-OPD) | | 2026.02 | RUC / Tencent | arXiv 2602.12125 (https://arxiv.org/abs/2602.12125) | G-OPD | | google-research /speculative_kd (https://github.com/google-research/google-research/tree/master/speculative_kd) | | 2024.10 | UCSB / Google | arXiv 2410.11325 (https://arxiv.org/abs/2410.11325) | Speculative KD (ICLR 2025) | | AdaSwitch (https://arxiv.org/abs/2510.07842) | Paper (https://arxiv.org/abs/2510.07842) | 2025.10 | RUC / Baidu | arXiv 2510.07842 (https://arxiv.org/abs/2510.07842) | AdaSwitch (on-/off-policy switching) | | Constrained OPD (https://arxiv.org/abs/2509.22921) | Paper (https://arxiv.org/abs/2509.22921) | 2025.09 | Huawei Noah’s Ark | arXiv 2509.22921 (https://arxiv.org/abs/2509.22921) | Constrained OPD (CMDP) | | REOPOLD (https://arxiv.org/abs/2603.11137) | Paper (https://arxiv.org/abs/2603.11137) | 2026.03 | KAIST / Microsoft | arXiv 2603.11137 (https://arxiv.org/abs/2603.11137) | REOPOLD (Relaxed OPD) — code soon | | OPSD_OnPolicyDistillation (https://github.com/HJSang/OPSD_OnPolicyDistillation) | | 2026.03 | LinkedIn | arXiv 2603.11178 (https://arxiv.org/abs/2603.11178) | PACED — frontier curriculum self-distill | | Fast OPD (https://arxiv.org/abs/2602.15260) | Paper (https://arxiv.org/abs/2602.15260) | 2026.02 | Industrial | arXiv 2602.15260 (https://arxiv.org/abs/2602.15260) | Fast OPD (prefix-truncated) | | Entropy-Aware OPD (https://arxiv.org/abs/2603.07079) | Paper (https://arxiv.org/abs/2603.07079) | 2026.03 | KAIST / IBM | arXiv 2603.07079 (https://arxiv.org/abs/2603.07079) | Entropy-Aware OPD | | Veto (https://arxiv.org/abs/2601.07155) | Paper (https://arxiv.org/abs/2601.07155) | 2026.01 | SNU | arXiv 2601.07155 (https://arxiv.org/abs/2601.07155) | Veto (Stable OPD) — ACL 2026 Findings | | OPSD_OnPolicyDistillation (https://github.com/HJSang/OPSD_OnPolicyDistillation) | | 2026.04 | Meta / LinkedIn | arXiv 2604.14084 (https://arxiv.org/abs/2604.14084) | TIP — Token Importance, shares LinkedIn OPSD repo with PACED | | SCOPE (https://github.com/machine981/SCOPE) | | 2026.04 | USTC / Meituan / Fudan | arXiv 2604.10688 (https://arxiv.org/abs/2604.10688) | SCOPE — signal-calibrated dual-path | | TSD-KD (https://github.com/kmswin1/TSD-KD) | | 2026.03 | Korea Univ. | arXiv 2603.13260 (https://arxiv.org/abs/2603.13260) | TSD-KD — token-selective dual KD (ICLR 2026) | | Hybrid-Policy-Distillation (https://github.com/zwhong714/Hybrid-Policy-Distillation) | | 2026.04 | zwhong714 | arXiv 2604.20244 (https://arxiv.org/abs/2604.20244) | HPD — Hybrid Policy Distillation; LlamaFactory + verl backends | 📋 Click to view technical details | Method | Loss / Divergence | Data | Granularity | Domain | Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | MiniLLM | Reverse KL via policy gradient | Student | Sequence (PG) | General | The seminal “OPD” recipe by Yuxian Gu et al.; predates GKD by days. Mode-seeking. | | DistiLLM | Skewed-KL (mix of FKL/RKL) | Mixed (adaptive off→on, with student samples) | Token | General | Skew parameter α interpolates between FKL and RKL; importance-reweighted student samples. | | DistiLLM-2 | Contrastive: Skew-FKL on teacher data + Skew-RKL on student data | Mixed | Token | General | Asymmetric losses on each data source; ICML 2025 oral. | | DSKDv2 | KL in dual aligned space; explicit on-policy mode | Student | Token | Cross-tokenizer | Cross-vocabulary distillation; supports both on/off-policy. | | G-OPD / ExOPD | Reverse KL + scaled reward extrapolation | Student | Token | General | Generalises OPD as KL-constrained RL; allows reward scale > 1 to “exceed” the teacher. | | Speculative KD (Xu) | Interleaved propose-and-correct (gated KL) | Student-proposed, teacher-corrected | Token | General | Bridges teacher-student gap via interleaved sampling. | | AdaSwitch | Adaptive on/off-policy switching | Mixed | Token | General | Switches between teacher-data and student-rollout based on divergence threshold. | | Constrained OPD | KL-constrained CMDP | Student | Token | General | Hard KL constraint instead of soft penalty. Borderline OPD-RL. | | REOPOLD | Mixture-based reward clipping + entropy-based dynamic sampling | Student | Token | Reasoning | “Relaxed OPD”; views OPD as policy optimisation with teacher-student log-ratio reward. | | PACED | Frontier curriculum at student competence boundary | Student | Token | General | Self-distill style (privileged-context / earlier-checkpoint); difficulty weighting w(p)=p(1−p). | | Fast OPD | Prefix-truncated distillation reducing FLOPs | Student | Token (truncated) | Reasoning | 2× to 47× speedup via reasoning-prefix truncation. | | Entropy-Aware OPD | Switch between FKL and RKL based on teacher entropy | Student | Token | Reasoning | When teacher entropy high → FKL; low → RKL. | | Veto | Logit-space geometric bridge with adaptive gradient veto | Student | Token | General | Adaptive Target Reformulation. | | TIP | Top-50% high-entropy student tokens carry the OPD signal | Student (selected) | Token (filtered) | Reasoning | ~47% memory savings; only entropy-high student tokens trained. | | SCOPE | Teacher-PPL-weighted KL on incorrect rollouts; student-PPL-weighted MLE on correct | Student | Token | Reasoning | Signal-Calibrated OPD with Dual-Path Adaptive Weighting; verifier-routing. | | TSD-KD | Indirect (student-propose / teacher re-rank) + direct selective logit KD | Mixed | Token (selected) | General | Hybrid; partial OPD + partial preference. | | HPD | Reweighted log-likelihood unifying FKL + RKL | Mixed (off-policy + lightweight approximate on-policy sampling) | Token | General | Unifies KD as token-level reweighted likelihood; lightweight on-policy sampling preserves training efficiency. | — ## 🎭 OPD with Black-Box / Outcome-Based Teachers When the teacher is API-only (no logits), OPD uses scalar rewards, verbal scores, preferences, or adversarial discriminators — all evaluated on student rollouts. Entries that turned out to use static teacher data only (Lion, SuperCorrect, DAIL, SODA) are excluded from this list. | Resource | 🌟 Stars | Date | Org | Paper Link | Title / Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | LMOps /gad (https://github.com/microsoft/LMOps) | | 2025.11 | Microsoft Research | arXiv 2511.10643 (https://arxiv.org/abs/2511.10643) · project (https://ytianzhu.github.io/Generative-Adversarial-Distillation/) | GAD — Black-Box OPD | | OVD (https://arxiv.org/abs/2601.21968) | Paper (https://arxiv.org/abs/2601.21968) | 2026.01 | HKU / Huawei | arXiv 2601.21968 (https://arxiv.org/abs/2601.21968) | OVD (On-policy Verbal Distillation) — project page OVD.github.io 404s | | ORPO-Distill (https://arxiv.org/abs/2509.25100) | Paper (https://arxiv.org/abs/2509.25100) | 2025.09 | Industrial | arXiv 2509.25100 (https://arxiv.org/abs/2509.25100) | ORPO-Distill | 📋 Click to view technical details | Method | Feedback Signal | Data | Granularity | Domain | Notes | | :––: | :––: | :––: | :––: | :––: | :–– | | GAD (Generative Adversarial Distillation) | Discriminator (on-policy reward model) | Student | Sequence | General | A trained discriminator distinguishes student outputs from teacher (e.g. GPT-5) responses; minimax game makes the discriminator co-evolve into an on-policy reward model. Qwen2.5-14B student becomes comparable to GPT-5-Chat on LMSYS. | | OVD | Verbal scores (0–9) on student trajectories | Student | Sequence | General | Replaces token-level logit matching with verbal scoring; +25.7% over baselines. | | ORPO-Distill | Student-Generated Outputs (SGO) + ORPO contrastive | Mixed (student-generated negatives, teacher positives) | Sequence | Cross-arch

Similar Articles

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Hugging Face Daily Papers

This paper introduces D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy self-distillation during supervised fine-tuning. It allows models to learn new concepts or styles without compromising their efficient few-step inference capabilities.

Hybrid Policy Distillation for LLMs

arXiv cs.CL

Introduces Hybrid Policy Distillation (HPD), a unified knowledge distillation approach that balances forward and reverse KL divergences and combines off-policy data with lightweight on-policy sampling, improving LLM compression across math, dialogue, and code tasks.

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Hugging Face Daily Papers

This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

arXiv cs.CL

CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.

Submit Feedback