@QingQ77: 收集 LLM/VLM/Agent 在训练时用 On-Policy Distillation 和 Self-Distillation 的开源代码和论文,按教师来源、监督信号、rollout 用法、训练阶段四个维度打标签。 https://g…

X AI KOLs Timeline 工具

摘要

介绍 AwesomeOPD,一个专门收集 LLM、VLM 和 Agent 在训练中使用的 On-Policy Distillation (OPD) 和 Self-Distillation 相关开源代码与论文的精选列表。该列表按教师来源、监督信号、rollout 用法和训练阶段对资源进行了详细分类和标注。

收集 LLM/VLM/Agent 在训练时用 On-Policy Distillation 和 Self-Distillation 的开源代码和论文,按教师来源、监督信号、rollout 用法、训练阶段四个维度打标签。 https://github.com/thinkwee/AwesomeOPD… AwesomeOPD 是一个 On-Policy Distillation 相关的 Awesome List,分 Survey、White-Box/Black-Box OPD、OPSD、迭代自引导、OPD-RL 混合、推理、多模态、Agent/具身、投机解码、框架、工业报告等十余个类别。每条目沿教师来源、监督信号、rollout 消费方式、流水线位置四轴标注,并附严格性说明区分完全符合 C1+C2 与部分符合的方法。
查看原文
查看缓存全文

缓存时间: 2026/05/10 04:22

收集 LLM/VLM/Agent 在训练时用 On-Policy Distillation 和 Self-Distillation 的开源代码和论文,按教师来源、监督信号、rollout 用法、训练阶段四个维度打标签。 https://github.com/thinkwee/AwesomeOPD… AwesomeOPD 是一个 On-Policy Distillation 相关的 Awesome List,分 Survey、White-Box/Black-Box OPD、OPSD、迭代自引导、OPD-RL 混合、推理、多模态、Agent/具身、投机解码、框架、工业报告等十余个类别。每条目沿教师来源、监督信号、rollout 消费方式、流水线位置四轴标注,并附严格性说明区分完全符合 C1+C2 与部分符合的方法。


thinkwee/AwesomeOPD

Source: https://github.com/thinkwee/AwesomeOPD

banner Logo

Surveys White-Box Black-Box
OPSD Iterative OPD-RL
Reasoning Multimodal Agent
SpecDec Frameworks Industrial

When LLMs Distill On-Policy

AwesomeOPD is an awesome list summarising open-source repositories and papers for training LLMs (and VLMs / agents / draft models) with On-Policy Distillation (OPD) and On-Policy Self-Distillation (OPSD):

  • 🎯 OPD = C1 + C2. C1: student samples its own trajectories y ~ π_student(·|x) during training. C2: teacher provides per-token / sequence supervision on those student samples. Methods that only partially satisfy are flagged in 📝 Strictness notes per section.
  • 🪞 OPSD = special case where teacher is the same model, conditioned on privileged context (verified trace / answer / “be concise” prefix / longer context) or an earlier checkpoint.
  • 🚀 Each entry is annotated along four design axes — teacher source (external · same model with privileged context · earlier checkpoint · multi-teacher · discriminator), supervision signal (logits / top-k / sequence reward / verbal score / discriminator / verifier / feature), rollout consumption (all / selected / truncated / replaced / as PG samples), and pipeline slot (cold-start / mid / RL-replacement / inside-RL / inter-stage / compression / continual-anchor).
  • ⚠️ Built by reading paper PDFs, project pages, and source code with LLM coding agents; manually reviewed but errors possible. PRs welcome.
  • 📅 Last updated: 2026-04-30

Taxonomy:

  • 📚 Surveys, Foundations & Position Papers — meta-references and seed papers (GKD, MiniLLM, Thinking Machines blog, Tencent / THUNLP surveys)
  • 🔬 White-Box — logit-based OPD on student rollouts with an external teacher
  • 🎭 Black-Box — discriminator / verbal / preference, no teacher logits
  • ♻️ OPSD — privileged-context self-distillation (same model, different conditioning)
  • 🔁 Iterative Self-Bootstrapping — same model as previous-checkpoint teacher
  • 🤝 OPD-RL Hybrids — inside-RL OPD: KL-as-reward, RL+OPD fusion
  • 🧠 Reasoning / 🖼️ Multimodal / 🤖 Agent & Embodied — by application; cuts across all teacher-source categories
  • ⚡ Speculative-Decoding Distillation — drafter distillation; “student” is a draft model
  • 🛠️ Frameworks & Toolkits — what to actually run
  • 🏭 Industrial / Production Reports — what the labs ship

Shorthand: FKL = forward KL · RKL = reverse KL · JSD = Jensen–Shannon · Skew-KL / AKL = skewed / adaptive KL · 📄 paper-only = no public code yet.

Updates

  • 📢 2026-04-28 — initial release

📚 Surveys, Foundations & Position Papers

Resource🌟 StarsDateOrgPaper / LinkTitle / Notes
BlogBlog2025.10Thinking Machines Lab (Kevin Lu et al.)Blog · tinker-cookbookThinking Machines Lab — On-Policy Distillation (blog)
tinker-cookbookStars2025.10Thinking Machines LabReference impl. of the OPD recipe on the Tinker SDK
Tencent OPD SurveyPaper2026.04Tencent (Mingyang Song & Mao Zheng)arXiv 2604.00626A Survey of On-Policy Distillation for LLMs
OPDStars2026.04Tsinghua THUNLParXiv 2604.13016Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe
revisiting_opdStars2026.03CASIA (Fu et al.)arXiv 2603.25562Revisiting OPD: Failure Modes & Simple Fixes
Lightning OPDPaper2026.04Wu, Han, CaiarXiv 2604.13010Lightning OPD: Efficient Post-Training with Offline OPD
GKDPaper2023.06Google DeepMind (Agarwal et al.)arXiv 2306.13649 — implemented in TRL GKDTrainerGKD: On-Policy Distillation of Language Models — Learning from Self-Generated Mistakes (Seminal · ICLR 2024)
📋 Click to view technical details
ResourceLoss / DivergenceDataTeacher AccessGranularityNotes
Thinking Machines blogReverse KL (student‖teacher)Student rolloutsWhite-boxToken“Swap KL ref model for stronger teacher” recipe; one-line addition to RL trainer. Replicates Qwen3 result at ~1/10 RL cost.
Tencent OPD Survey(survey)(survey)(survey)(survey)Catalogues 50+ methods; useful as a reference index.
THUNLP Rethinking OPDReverse KL with progressive top-K alignmentStudentWhite-boxTokenIdentifies two success conditions: compatible thinking patterns + genuinely new teacher capability. Recipe = off-policy cold-start + teacher-aligned prompt selection.
Revisiting OPDTruncated reverse KL + top-p sampling + special-token maskingStudentWhite-boxToken (filtered)Diagnoses 3 failure modes: imbalanced one-token signal, unreliable prefix guidance, tokenizer mismatch.
Lightning OPDCached teacher log-probs over SFT rollouts (offline OPD)Student (cached)White-boxTokenIntroduces “teacher consistency” — same teacher must be used for SFT and OPD or else gradient bias. Eliminates the live teacher server.
GKD (Agarwal)Generalised JSD (FKL/RKL configurable)Mixed (λ interpolates teacher↔student)White-boxTokenThe seminal paper that named OPD; introduced student-self-rollout supervision.
📝 Strictness notes (against the strict OPD definition C1: student samples its own trajectories during training + C2: teacher provides supervision on those samples)
  • Lightning OPD — ⚠️ partially satisfies C1: teacher log-probs are pre-computed once over SFT rollouts and reused during training; student doesn’t actively sample during the OPD step. Authors call this “offline OPD” explicitly. Listed in OPD because the data is past-student-generated rollouts, not teacher-generated.

🔬 OPD with Larger External Teachers — White-Box

White-box methods use teacher logits / log-probabilities to supervise the student on student-generated rollouts. Each entry below has been verified to (a) train on student rollouts and (b) operate at the token level.

Methods that turned out to be RL-style on verification have been moved to OPD-RL Hybrids; off-policy / pure-loss-function / pretraining-side methods are excluded from this list.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
LMOps /minillmStars2023.06Microsoft / TsinghuaarXiv 2306.08543MiniLLM (ICLR 2024)
distillmStars2024.02KAIST / MicrosoftarXiv 2402.03898DistiLLM (ICML 2024)
distillm-2Stars2025.03KAIST / MicrosoftarXiv 2503.07067DistiLLM-2 (ICML 2025 Oral)
DSKDv2Stars2025.04BJTUarXiv 2504.11426DSKDv2 — cross-tokenizer; supports on-policy mode
G-OPDStars2026.02RUC / TencentarXiv 2602.12125G-OPD
google-research /speculative_kdStars2024.10UCSB / GooglearXiv 2410.11325Speculative KD (ICLR 2025)
AdaSwitchPaper2025.10RUC / BaiduarXiv 2510.07842AdaSwitch (on-/off-policy switching)
Constrained OPDPaper2025.09Huawei Noah’s ArkarXiv 2509.22921Constrained OPD (CMDP)
REOPOLDPaper2026.03KAIST / MicrosoftarXiv 2603.11137REOPOLD (Relaxed OPD) — code soon
OPSD_OnPolicyDistillationStars2026.03LinkedInarXiv 2603.11178PACED — frontier curriculum self-distill
Fast OPDPaper2026.02IndustrialarXiv 2602.15260Fast OPD (prefix-truncated)
Entropy-Aware OPDPaper2026.03KAIST / IBMarXiv 2603.07079Entropy-Aware OPD
VetoPaper2026.01SNUarXiv 2601.07155Veto (Stable OPD) — ACL 2026 Findings
OPSD_OnPolicyDistillationStars2026.04Meta / LinkedInarXiv 2604.14084TIP — Token Importance, shares LinkedIn OPSD repo with PACED
SCOPEStars2026.04USTC / Meituan / FudanarXiv 2604.10688SCOPE — signal-calibrated dual-path
TSD-KDStars2026.03Korea Univ.arXiv 2603.13260TSD-KD — token-selective dual KD (ICLR 2026)
Hybrid-Policy-DistillationStars2026.04zwhong714arXiv 2604.20244HPD — Hybrid Policy Distillation; LlamaFactory + verl backends
📋 Click to view technical details
MethodLoss / DivergenceDataGranularityDomainNotes
MiniLLMReverse KL via policy gradientStudentSequence (PG)GeneralThe seminal “OPD” recipe by Yuxian Gu et al.; predates GKD by days. Mode-seeking.
DistiLLMSkewed-KL (mix of FKL/RKL)Mixed (adaptive off→on, with student samples)TokenGeneralSkew parameter α interpolates between FKL and RKL; importance-reweighted student samples.
DistiLLM-2Contrastive: Skew-FKL on teacher data + Skew-RKL on student dataMixedTokenGeneralAsymmetric losses on each data source; ICML 2025 oral.
DSKDv2KL in dual aligned space; explicit on-policy modeStudentTokenCross-tokenizerCross-vocabulary distillation; supports both on/off-policy.
G-OPD / ExOPDReverse KL + scaled reward extrapolationStudentTokenGeneralGeneralises OPD as KL-constrained RL; allows reward scale > 1 to “exceed” the teacher.
Speculative KD (Xu)Interleaved propose-and-correct (gated KL)Student-proposed, teacher-correctedTokenGeneralBridges teacher-student gap via interleaved sampling.
AdaSwitchAdaptive on/off-policy switchingMixedTokenGeneralSwitches between teacher-data and student-rollout based on divergence threshold.
Constrained OPDKL-constrained CMDPStudentTokenGeneralHard KL constraint instead of soft penalty. Borderline OPD-RL.
REOPOLDMixture-based reward clipping + entropy-based dynamic samplingStudentTokenReasoning“Relaxed OPD”; views OPD as policy optimisation with teacher-student log-ratio reward.
PACEDFrontier curriculum at student competence boundaryStudentTokenGeneralSelf-distill style (privileged-context / earlier-checkpoint); difficulty weighting w(p)=p(1−p).
Fast OPDPrefix-truncated distillation reducing FLOPsStudentToken (truncated)Reasoning2× to 47× speedup via reasoning-prefix truncation.
Entropy-Aware OPDSwitch between FKL and RKL based on teacher entropyStudentTokenReasoningWhen teacher entropy high → FKL; low → RKL.
VetoLogit-space geometric bridge with adaptive gradient vetoStudentTokenGeneralAdaptive Target Reformulation.
TIPTop-50% high-entropy student tokens carry the OPD signalStudent (selected)Token (filtered)Reasoning~47% memory savings; only entropy-high student tokens trained.
SCOPETeacher-PPL-weighted KL on incorrect rollouts; student-PPL-weighted MLE on correctStudentTokenReasoningSignal-Calibrated OPD with Dual-Path Adaptive Weighting; verifier-routing.
TSD-KDIndirect (student-propose / teacher re-rank) + direct selective logit KDMixedToken (selected)GeneralHybrid; partial OPD + partial preference.
HPDReweighted log-likelihood unifying FKL + RKLMixed (off-policy + lightweight approximate on-policy sampling)TokenGeneralUnifies KD as token-level reweighted likelihood; lightweight on-policy sampling preserves training efficiency.

🎭 OPD with Black-Box / Outcome-Based Teachers

When the teacher is API-only (no logits), OPD uses scalar rewards, verbal scores, preferences, or adversarial discriminators — all evaluated on student rollouts. Entries that turned out to use static teacher data only (Lion, SuperCorrect, DAIL, SODA) are excluded from this list.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
LMOps /gadStars2025.11Microsoft ResearcharXiv 2511.10643 · projectGAD — Black-Box OPD
OVDPaper2026.01HKU / HuaweiarXiv 2601.21968OVD (On-policy Verbal Distillation) — project page OVD.github.io 404s
ORPO-DistillPaper2025.09IndustrialarXiv 2509.25100ORPO-Distill
📋 Click to view technical details
MethodFeedback SignalDataGranularityDomainNotes
GAD (Generative Adversarial Distillation)Discriminator (on-policy reward model)StudentSequenceGeneralA trained discriminator distinguishes student outputs from teacher (e.g. GPT-5) responses; minimax game makes the discriminator co-evolve into an on-policy reward model. Qwen2.5-14B student becomes comparable to GPT-5-Chat on LMSYS.
OVDVerbal scores (0–9) on student trajectoriesStudentSequenceGeneralReplaces token-level logit matching with verbal scoring; +25.7% over baselines.
ORPO-DistillStudent-Generated Outputs (SGO) + ORPO contrastiveMixed (student-generated negatives, teacher positives)SequenceCross-architecture“Mixed-policy strategy utilizing student-generated outputs”; NeurIPS 2025 WS.

♻️ Self-Distillation with Privileged Context — OPSD

Same model = teacher = student, but the teacher is conditioned on something the student doesn’t see (verified trace, ground-truth answer, “be concise” prefix, longer context, document, …). The gap exists because of the conditioning, not weights.

Several entries previously listed here turned out on verification to use static teacher data or a fixed self-rewritten dataset rather than student rollouts; those have been excluded. SPIN was reclassified to Iterative Self-Bootstrapping.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
OPSDStars2026.01UCLA / Meta FAIRarXiv 2601.18734 · blogOPSD — Self-Distilled Reasoner
CRISP_Reasoning_CompressionStars2026.03LinkedInarXiv 2603.05433OPSDC / CRISP
Self-DistillationStars2026.01MIT / ETHarXiv 2601.19897SDFT-Continual
LMOps /opcdStars2026.02Microsoft ResearcharXiv 2602.12275OPCD — On-Policy Context Distillation
LMOps /oelStars2026.03Microsoft ResearcharXiv 2603.16856OEL — Online Experiential Learning
mtp-lmStars2026.02UMD / LLNLarXiv 2602.06019MTP Self-Distill
ml-ssdStars2026.04Apple MLRarXiv 2604.01193Apple — Embarrassingly Simple Self-Distillation
GATESPaper2026.02UMDarXiv 2602.20574GATES (Self-Distillation under Privileged Context)
OPSDLPaper2026.04BaiduarXiv 2604.17535OPSDL (Long-Context Self-Distillation)
SD-ZeroPaper2026.04Princeton / Toronto / CMUarXiv 2604.12002SD-Zero — Self-Revision turns binary rewards into dense supervision
self-distillation-analysisStars2026.03MSR / KAIST / SNUarXiv 2603.24472Why Does Self-Distillation (Sometimes) Degrade Reasoning? — diagnostic study of OPSD failure modes
π-PlayPaper2026.04CASIA / UCAS / MeituanarXiv 2604.14054π-Play — multi-agent self-play turns the question-construction path into privileged context for OPSD on search agents
📋 Click to view technical details
MethodPrivileged Context (Teacher)Loss / DivergenceGranularityDomainNotes
OPSD (Self-Distilled Reasoner)Verified reasoning tracePer-token RKL with point-wise clippingTokenMath reasoningSame-model OPSD; matches GRPO with 1×8 rollouts and 1024 length vs. GRPO’s 8×16 / 16k. The canonical OPSD paper. Built on TRL’s GOLD trainer.
CRISP / OPSDC“Be concise” instruction prefixPer-token RKL on student rolloutsTokenReasoning compressionCompresses long-CoT without entropy collapse (unlike RL-with-length-penalty).
SDFT-Continual (idanshen)Demo-conditioned same modelRKL on student rollouts vs. demo-conditioned teacherTokenContinual learningSelf-distillation enables continual learning.
OPCDIn-context-knowledge-augmented same modelRKL on student rolloutsTokenKnowledge internalisationInternalise context to be faithful even after context is removed.
OEL (Online Experiential Learning)Same model with interactive game environmentRKL on student rolloutsTokenGame / planningSelf-distillation on interactive trajectories.
MTP Self-DistillMulti-token prediction same modelRKL on student rolloutsTokenGeneralMulti-Token Prediction via Self-Distillation. Author-stated on-policy.
Apple SSDSame model w/ temperature/truncation samplingCross-entropy on its own samplesSequenceCode generation“Embarrassingly simple” — sample, then SFT on those samples. Degenerate OPSD; “decoding-config” privilege.
GATESDocument-conditioned tutor (same model)RKL gated by tutor consensusToken (gated)Document QABoth tutor and student sample rollouts; on-policy student-rollout updates contribute “modest additional improvement” on top of off-policy distillation. Mixed.
OPSDLShort-context same modelPoint-wise RKLTokenLong-contextOn-Policy Self-Distillation for Long-Context LMs.
SD-ZeroReviser conditioned on generator’s response + binary rewardPer-token KL: distill reviser → generator on student rolloutsTokenMath / code reasoningSingle model plays Generator + Reviser; reviser’s reward-conditioned token distribution becomes dense supervision over the generator’s response. Outperforms RFT, GRPO, SDFT under matched sample budget on Qwen3-4B-Instruct / Olmo-3-7B-Instruct (≥10% over base). Exhibits token-level self-localization and iterative self-evolution.
Why-Does-SD-Degrade (analysis)Varies (controlled study over rich-vs-thin context teachers)RKL on student rollouts (analysis only)TokenMath reasoning (in-domain + OOD)Diagnostic paper, not a training method. Finds that conditioning the teacher on richer privileged context suppresses epistemic verbalization (uncertainty expression) in the student → fast in-domain gains but up to 40% OOD drops on Qwen3-8B / DeepSeek-Distill-Qwen-7B / Olmo3-7B-Instruct. Implication: privileged-context richness is a double-edged knob in OPSD.
π-PlayTeacher conditioned on Question Construction Path (QCP) — the reverse-direction artifact emitted by an examiner agent when it generates the taskPer-token reverse KL on student rollouts; teacher is an EMA copy of the student (τ=0.05)TokenSearch / deep-research / multi-hop QA agents (NQ, TriviaQA, HotpotQA, 2WikiMQA, MuSiQue, …)Self-play loop examiner ↔ student/teacher with no external data. The QCP is privileged because it captures the reverse solution path the examiner used to construct the task; the teacher sees it, the student doesn’t. Converts sparse-reward self-play into dense per-token supervision; data-free π-Play surpasses fully supervised search agents and is 2–3× more sample-efficient than conventional self-play.
📝 Strictness notes
  • Apple SSD — ⚠️ C2 is degenerate: no teacher KL signal; pure self-generated SFT (sample with temperature/truncation, then SFT on those samples). Closer to STaR-style self-bootstrapping than to OPSD. Kept because the “teacher” is the same model with a different decoding config — privileged-context-by-decoding.
  • GATES — ⚠️ Authors’ own ablation says off-policy trajectory-level distillation drives the primary gains; on-policy student-rollout updates contribute only “modest additional improvement”. Mixed; the OPSD leg is genuine but secondary.
  • SD-Zero — privileged context is non-textual: the reviser is conditioned on the generator’s full response plus its scalar binary reward. C1 ✓ (generator samples its own rollouts), C2 ✓ (per-token KL from reviser). Compared head-to-head against GRPO in the paper but is not itself an RL method — there is no policy-gradient objective; the reward is a conditioning signal, not a return. Listed in OPSD rather than OPD-RL Hybrids for that reason.
  • Why-Does-SD-Degrade — analysis-only; no new training algorithm proposed. Listed here because the failure mode it characterises (epistemic-verbalization collapse under rich privileged context) is specific to OPSD.
  • π-Play — teacher and student have separate parameter sets; the teacher is an EMA-tracking copy of the student rather than literally the same weights. Listed in OPSD because (i) the paper itself frames the method as “Privileged Self-Distillation” and (ii) the gap between teacher and student exists because of QCP conditioning, not weight divergence (the EMA target collapses to the student in the limit). C1 ✓ (student samples its own rollouts), C2 ✓ (per-token RKL from QCP-conditioned teacher).

🔁 Iterative Self-Bootstrapping

Same model is the teacher, but as a frozen earlier checkpoint, not a privileged-context view. The teacher snapshot is frozen for one round, the student trains, then the snapshot rolls forward. Listed separately because the supervision is typically sequence-level / preference, not per-token logit-distillation.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
SPINStars2024.01UCLAarXiv 2401.01335SPIN — Self-Play Fine-Tuning (ICML 2024)
rStarStars2025.01Microsoft ResearchrStar-Math 2501.04519 · rStar2-Agent 2508.20722rStar / rStar-Math / rStar2-Agent
📝 Strictness notes
  • SPIN — ⚠️ C1 ✓ (student samples), but C2 fails strict per-token logit form: supervision is sequence-level DPO preference against the previous frozen checkpoint. More accurately “iterative on-policy DPO” than per-token OPD. Kept because the “teacher = previous self” pattern is what people search for in OPD lists.
  • rStar / rStar-Math / rStar2-Agent — ⚠️ MCTS-filtered student samples + SFT; the “teacher signal” is a step-level PPM / discriminator score, not per-token logit KL. Iterative self-improvement, not classical OPD.

🤝 OPD-RL Hybrids — Inside-RL OPD

Methods that fuse OPD with RLVR / GRPO / PPO / DPO. Teacher logits become a dense reward shaping or trust-region anchor inside an RL objective; or BoN / preference signals are used as the imitation target.

Newly added on verification: AlignDistil (RLHF-equivalent distillation), BOND / Faster WIND (sequence-level Best-of-N as target), KETCHUP (k-step RL-based KD), 𝒳-KD / DDT (IRL-style), LUFFY (mixed-policy GRPO with off-policy traces), NPO / AutoNPO (mixed-policy GRPO with near-future self as teacher). Removed on verification: RLKD (only sequence-level structural reward), ExGRPO (pure RL, no teacher), REDI (offline R1 traces, no student rollouts).

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
SDPOStars2026.01ETH / MITarXiv 2601.20802 · projectSDPO — RL via Self-Distillation
OpenClaw-RLStars2026.03Gen-VersearXiv 2603.10165OpenClaw-RL — combines GRPO + OPD
Open-AgentRLStars2026.02Gen-VerseOpen-AgentRL — RLAnything / DemyAgent multi-domain
AlignDistilStars2025.03BJTU / TencentarXiv 2503.02832AlignDistil — RLHF-equivalent KD (ACL 2025)
LUFFYStars2025.04Westlake U.arXiv 2504.14945LUFFY — mixed-policy GRPO
NPOPaper2026.04IIE CAS / UCAS / JD.COMarXiv 2604.20733NPO / AutoNPO — mixed-policy GRPO with near-future self as teacher
KEPOStars2026.01IndustrialarXiv 2602.00400KEPO
BONDPaper2024.07Google DeepMindarXiv 2407.14622BOND (Best-of-N Distillation)
Faster WINDPaper2024.10CMU / GooglearXiv 2410.20727Faster WIND (iterative BoN) — AISTATS 2025
KETCHUPPaper2025.04U. AlbertaarXiv 2504.19024KETCHUP (k-step RL-KD)
𝒳-KDPaper2026.02BUPTarXiv 2602.12674𝒳-KD (IRL-style)
Towards-On-Policy-SFTStars2026.02MSRA / ShopeearXiv 2602.12222DDT — on-policy SFT theory
RLADPaper2026.02AWSarXiv 2602.22495RLAD (Reinforcement-aware KD)
KDRLPaper2025.06HIT / HuaweiarXiv 2506.02208KDRL (Joint KD + RL)
RLSDPaper2026.04Multi-orgarXiv 2604.03128Self-Distilled RLVR (RLSD)
HDPOPaper2026.03NVIDIAarXiv 2603.23871HDPO (Hybrid Distillation PO)
ExGRPOStars2026.03UNC / ASUarXiv 2603.19266Probing-to-Refine / EI / EXGRPO
📋 Click to view technical details
MethodInner RLTeacher RoleDataGranularityDomainNotes
SDPOCustom self-distillation policy gradientFeedback-conditioned same model = self-teacherStudentTokenCode, tool-use, scienceSample student rollout, get tokenised feedback, re-evaluate under feedback-conditioned self-teacher, distill the corrected next-token distribution back into policy.
OpenClaw-RLGRPO + OPDJudge model extracts hindsight hints, teacher token-logprob gap = directional advantageMixedTokenTerminal / GUI / SWE / Tool-callUnifies binary RL and OPD in one trainer.
Open-AgentRLGRPO-TCRMulti-domain teachersStudentTokenReasoning / GUI / CodingIncludes process-reward modelling via SandboxFusion.
AlignDistilRLHF-equivalent KDDPO-derived combination of DPO model + ref-model logitsStudentTokenAlignmentRe-frames DPO as policy distillation.
LUFFYMixed-Policy GRPO + policy shapingOff-policy R1 traces inserted into student rolloutsMixedToken + sequenceReasoning“Learn to reason under off-policy guidance”. On-policy student-roll + off-policy teacher-trace mix.
NPO / AutoNPOMixed-Policy GRPOVerifier-filtered trajectories from a later checkpoint of the same training runMixedSequenceReasoning (RLVR)“Learn from your near-future self”. Picks a teacher that is strong enough (higher Q than current policy) yet close enough (low V vs. external teachers like R1), maximising effective Q/V signal. AutoNPO adaptively schedules the interventions; preserves higher entropy than vanilla GRPO.
KEPOKnowledge-enhanced POKnowledge-base teacherMixedSequenceReasoningAdds KB grounding to preference RL.
BONDBest-of-N distillationSame model’s BoN targetStudent (iterative)SequenceAlignmentTreats Best-of-N as the target distribution; iterative anchor; Jeffreys divergence.
Faster WINDWin-rate dominanceSame model BoNStudent (iterative)SequenceAlignmentGame-theoretic acceleration of BOND.
KETCHUPk-step return REINFORCE on KDExternal teacherStudentSequenceGeneralRL-based KD with k-step Bellman returns.
𝒳-KDAVRIL inverse-RLJoint reward + policy distillationStudentToken + sequenceGeneralIRL-flavoured experiential KD.
DDTOn-policy SFT theoryTheoreticalStudentTokenGeneralDistribution Discriminant Theory; foundations for on-policy SFT.
RLADPPO/GRPO ratio anchored to teacher–old-policy mixtureExternal teacher (Qwen3-32B)StudentTokenReasoningTrust-region likelihood-ratio.
KDRLJoint reverse-KL + GRPO rule-based rewardExternal teacher (Skywork-OR1)StudentToken + outcomeReasoningUnified KD + RL objective.
Self-Distilled RLVR (RLSD)RLVR direction + teacher evidence-ratio modulates magnitudeSame model + privileged answerStudentToken + outcomeReasoningCombines self-distillation magnitudes with RLVR directions.
HDPORL on most prompts; on “cliff” prompts generate privileged rollouts and self-distillSame model w/ privilegeStudentTokenReasoningPrivileged self-distillation as RL fallback.
Probing-to-Refine“Explanatory probes” force logical articulation; GRPO + dialogue-structure rewardSelf-probeStudentSequenceReasoningReinforcement Distillation via Explanatory Inversion.
📝 Strictness notes
  • LUFFY — ⚠️ Mixed-policy: half on-policy student rollouts (C1+C2 ✓) + half off-policy R1 traces inserted into GRPO (C1 ✗ on the off-policy half). Net is OPD-flavor with off-policy import.
  • NPO / AutoNPO — ⚠️ Same mixed-policy GRPO pattern as LUFFY, but the off-policy traces come from a near-future checkpoint of the same run instead of an external R1 teacher. Authors frame it as RLVR, not OPD; included here as an OPD variant because (a) the imported trajectories play the same “stronger-self teacher” role, and (b) the paper itself explicitly invites follow-up work to inject the near-future-self signal via on-policy distillation. Strict per-token logit KL (C2) is not the loss — supervision is verifier-filtered sequence-level trajectory mixing inside GRPO.
  • BOND, Faster WIND — ⚠️ Iterative self-bootstrapping; teacher = same model’s BoN distribution. Loss is Jeffreys / win-rate-dominance at the sequence levelno per-token logit supervision (C2 partially fails strict form). More accurately “on-policy iterative alignment” than OPD.
  • KETCHUP — ⚠️ Sequence-level RL-based KD with k-step Bellman returns; the paper itself self-describes as “RL-based KD”. Closer to RL with KD-anchor reward than per-token OPD.
  • 𝒳-KD — ⚠️ Built on AVRIL inverse-RL framework with joint reward modeling; closer to IRL+OPD hybrid than pure OPD.
  • DDT — ⚠️ Theoretical foundations paper for “on-policy SFT” (Distribution Discriminant Theory); not a specific deployable algorithm. Kept for completeness.
  • KEPO, Open-AgentRL, Probing-to-Refine — ⚠️ C1 ✓ (on-policy student rollouts), but the per-token KL component vs. sequence-level reward shaping vs. preference optimization is not fully resolved from abstracts. Listed because the papers self-describe as OPD/on-policy distillation but exact form of C2 needs full-paper reading.

🧠 Reasoning OPD (by application)

Genuine OPD work on math / code / long-CoT reasoning. Off-policy SFT-distill from R1, pure RL methods (Skywork-OR1, SimpleRL-Zoo, Time-R1), and analysis-only papers are excluded from this list — each had no student-rollout-with-teacher-supervision component.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
OPDStars2026.04Tsinghua THUNLParXiv 2604.13016Rethinking OPD recipe
G-OPDStars2026.02RUC / TencentarXiv 2602.12125G-OPD (cross-list)
OPD-AVMPPaper2026.04AcademicarXiv 2604.07944OPD for Autonomous Vehicle Motion Planning

The reasoning-OPD canon already lives across OPSD (siyan-zhao/OPSD, CRISP, SD-Zero), Iterative Self-Bootstrapping (rStar / rStar-Math), OPD-RL Hybrids (LUFFY, RLAD, KDRL, RLSD, HDPO), and White-Box (REOPOLD, Fast OPD, Entropy-Aware OPD, TIP, SCOPE, PACED). This section only lists items not already covered above.

📋 Click to view technical details
MethodLoss / ObjectiveDataTeacherGranularityBase / BenchmarkNotes
Rethinking OPD (THUNLP)RKL with progressive top-K alignment + off-policy cold-startMixedWhite-box (Qwen3-4B/1.7B teacher pairs)TokenMath reasoningIdentifies teacher-novelty and thinking-pattern compatibility as success conditions.
OPD for AV Motion PlanningGPT-Driver framework + GKD on student-generated trajectoriesStudentWhite-box (LLM teacher)TokenDriving5× model-size reduction.

🖼️ Multimodal OPD (VLM, Video, Audio, Image)

Strict OPD work in non-text modalities. Many “R1”/“GRPO” multimodal models that bear the brand are pure RL (no teacher-distillation loss) and are excluded.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
piFlowStars2025.10Multi-orgarXiv 2510.14974π-Flow — image / flow OPD (ICLR 2026)
Step-Audio-R1Stars2025.11StepFunarXiv 2511.15848Step-Audio-R1
VOLDPaper2025.10INRIA / Goethe Univ.arXiv 2510.23497 · project pageVOLD (LLM→VLM OPD) — repo placeholder; ICLR 2026
Video-OPDPaper2026.02IndustrialarXiv 2602.02994Video-OPD
X-OPDPaper2026.03Tencent Hunyuan / ZJUarXiv 2603.24596X-OPD (Speech LLM)
📋 Click to view technical details
MethodModalityTeacherLossDataNotes
π-FlowImage generation (flow models)Teacher velocity fieldL2 imitation distillationStudentStrict OPD for diffusion: student predicts policy at each timestep along its own trajectory.
Step-Audio-R1Audio reasoningSelf (modality-grounded)Iterative self-distillation + SFT + PPO/RLVRStudentIterative on-policy cycles; only audio-relevant questions used in self-distill.
VOLDLLM → VLMText-only LLMGRPO + on-policy KL distillationStudentCold-start SFT alignment + unified RL+KD; ICLR 2026. The flagship VLM OPD recipe.
Video-OPDMLLMLLM teacherToken-level KL on student rolloutsStudentTemporal video grounding via OPD.
X-OPDSpeech LLMText LLMCross-modal token-level KLStudentCapability alignment in speech LLMs.

🤖 Agent & Embodied OPD (by application)

Genuine OPD where the student is an agent rolling out actions; teacher (or self) supervises those trajectories. Pure-RL agent works (WebRL, WebAgent-R1, InfiGUI-G1, GUI-R1) and off-policy SFT-on-teacher-trajectories (Nardien, AgentRefine, Chain-of-Agents, MapCoder-Lite, SAD, Structured-Web) are excluded.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
OpenClaw-RLStars2026.03Gen-VersearXiv 2603.10165OpenClaw-RL (cross-list with OPD-RL)
easydistillStars2025.09Alibaba ModelScopeSCoRe arXiv 2509.14257/projects/SCoRe
RPDStars2025.03TUM / FreiburgarXiv 2503.05833 · projectRefined Policy Distillation, VLA (IROS 2026)
VLA-OPDPaper2026.03HKUST (Guangzhou) — IRPN LabarXiv 2603.26666 · projectVLA-OPD — bridging offline SFT & online RL for VLA via OPD (code coming soon)
LLM4TeachStars2023.11 (updated 2025)ZJ Lab AMMIarXiv 2311.13373LLM4Teach — small-RL agent guided by LLM
📋 Click to view technical details
MethodDomainTeacher RoleLossNotes
OpenClaw-RLTerminal / GUI / SWE / Tool-callJudge model + token-logprob gapGRPO + OPDHindsight-hint extraction; combines binary RL and per-token OPD.
SCoRe12 agent benchmarksLarger teacher (72B) corrects earliest error in student rolloutSFT-on-corrections + short-horizon RL7B student matches 72B teacher.
RPDVLA / robot manipulationTeacher VLA actionsPPO + behavioural cloning on student rolloutsCleanest VLA-OPD recipe.
VLA-OPDVLA / robot manipulation (LIBERO, RoboTwin2.0)Expert VLA teacher, dense token-level supervision on student trajectoriesReverse-KL (avoids FKL entropy explosion + Hard-CE collapse)Replaces sparse RL reward; preserves generalist priors and mitigates catastrophic forgetting.
LLM4TeachSmall RL agentLLM teacher (action-level)Distillation + RL annealedStrict OPD for embodied; predates the wave.

⚡ Speculative-Decoding Distillation

Distillation of the draft model so it better mimics the verifier/target. The on-policy element here is over the drafter’s own continuations as judged by the target. Listed separately because the goal is inference speedup, not student capability.

This section only lists drafters trained with the drafter’s own rollouts. Off-policy drafter training (EAGLE-1/2, Medusa, Hydra, Kangaroo, ReDrafter, BiTA, SpecDec++, LayerSkip, FREE, AdaSPEC, POSS) and training-free system tricks (Ouroboros, Sequoia, TriForce, SwiftKV, SuffixDecoding) are excluded.

Resource🌟 StarsDateOrgPaper LinkTitle / Notes
EAGLEStars2025.03PKU / MicrosoftEAGLE-3EAGLE-3 — on-policy multi-step TTT
HASSStars2024.08AcademicarXiv 2408.15766HASS
OSDStars2023.10UCB / NVIDIAarXiv 2310.07177Online Speculative Decoding
FalconStars2024.12BestpayarXiv 2412.12639Falcon
SpecForgeStars2026.03SGLangLMSYS blogSpecForge — open EAGLE-3 training framework
DistillSpecPaper2023.10Google DeepMindarXiv 2310.08461DistillSpec (ICLR 2024)
SpecKDPaper2025.10XJTU (Haiduo Huang et al.)arXiv 2510.24021SpecKD / SelecTKD (verification-gated KD; v1=SpecKD, v2 retitled SelecTKD)
ReSpecPaper2025.10AcademicarXiv 2510.26475ReSpec (RL drafter evolution)
DVIPaper2025.10AcademicarXiv 2510.05421DVI (Draft-Verify-Improve, online RL)
CORALPaper2025.02AcademicarXiv 2502.16880CORAL (Cross-Step Representation Alignment) — ACL 2025
MASSVPaper2025.05CerebrasarXiv 2505.10526MASSV (multimodal SD draft)
📝 Strictness notes
  • HASS, Falcon — ⚠️ Partial on-policy: multi-step draft trajectory / glancing distillation uses drafter samples for a subset of the training signal. Listed because the on-policy leg drives the gains.
📋 Click to view technical details
MethodDrafter typeOn-/Off-policyLossNotes
EAGLE-3Self-speculative (uses target features)On-policy multi-step (TTT)Smooth-L1 (feature) + CE (token)“Training-Time Test” simulates draft rollouts during training.
HASSSelf-speculativePartial on-policy (multi-step draft trajectory in training)Multi-step KD CE + feature alignmentHarmonized objective + harmonized context alignment.
Online Speculative Decoding (OSD)Draft-modelOn-policy / onlineOnline KD on rejected tokensThe canonical online/on-policy SD paper.
FalconDraft-model (semi-AR)Partial on-policy (glancing uses draft samples)Glancing CE + KDCoupled Sequential Glancing Distillation.
SpecForgeSelf-speculative (EAGLE-3 framework)On-policy TTT supportedEAGLE-3 lossesOpen-source EAGLE-3 training framework.
DistillSpecDraft-modelOn-policy (draft samples)Choice of FKL/RKL/JSD/TVDThe seminal “OPD for SD” paper.
SpecKDDistillation frameworkOn-policy with verification gatingGated KL (accepted tokens only)Inverts SD: uses accept/reject as KD-loss gate.
ReSpecDraft-modelOn-policy online (RL rollouts)KD weighted by rollout rewardDrafter evolved during RL training.
DVISelf-speculativeOn-policy online (RL on verifier signal)KL → reward-masked CE + PGContinual online training.
CORALSelf-speculativeOn-policy multi-stepCross-step alignment + CEFixes draft training/inference mismatch.
MASSVMultimodal draft-modelOn-policy (drafter samples)KD CEMultimodal speculative-decoding drafter.

🛠️ Frameworks & Toolkits

Open-source frameworks / libraries that support OPD (with student-generated rollouts during distillation training).

Resource🌟 StarsDateOrgOPD Code PathTitle / Notes
verlStars2024.10ByteDance Seedrecipe/on_policy_distill/; Async OPD docverl
trlStars2019.11Hugging Facetrl/experimental/{gkd,gold,minillm,sdft,self_distillation,sdpo,nash_md,xpo,online_dpo}/TRL — the most diverse OPD trainer collection
LLaMA-FactoryStars2023.05hiyougaLLaMA-Factory — OPD only via TRL integration; not native
ms-swiftStars2024Alibaba ModelScopeexamples/train/rlhf/gkd/, multimodal/megatron variantsms-swift — wraps TRL GKDTrainer
rllmStars2025.01UC Berkeley Skyexamples/math_distill/ (incl. opsd/ self-distill); rllm/trainer/distill/rllm
ROLLStars2025.06Alibabaroll/pipeline/distill/ROLL — with VLM support and various-divergence library
AReaLStars2025.06AntGroup / Tsinghuaexamples/distillation/gsm8k_grpo_distill.yamlAReaL
RLStars2026.01NVIDIAnemo_rl/algorithms/distillation.pyNeMo-RL — native OPD with student rollouts
SkyRLStars2025.04UC Berkeley NovaSkyskyrl-train/examples/on_policy_distillation/; blogSkyRL
slimeStars2025.06Tsinghua THUDMexamples/on_policy_distillation/slime — RL framework behind GLM-4.5/4.6/4.7
KDFlowStars2026.03BJTU (Songming Zhang et al.)examples/on_policy_kd/ (LLM + Qwen3-VL); arXiv 2603.01875KDFlow — KD-first framework; SGLang teacher + FSDP2 student decoupled; cross-tokenizer & VLM native
📋 Click to view technical details
FrameworkKL Direction(s)OPD Primary?BackboneMulti-GPUNotes
verlForward KL with sparse top-k teacher logitsOne of manyPyTorchYes (FSDP, Megatron, Ray)recipe/on_policy_distill/ — the most production-ready OPD recipe; integrates with vLLM.
TRLFKL, RKL, GJSD (β); GOLD trainer; SDFT trainer; MiniLLM trainerOne of many; most diverse OPD collectionPyTorchYes (Accelerate, DeepSpeed)trl/experimental/ contains gkd, gold, minillm, sdft, self_distillation, sdpo, nash_md, xpo, online_dpo, papo, prm. The single broadest OPD trainer set.
LLaMA-FactoryVia TRL integrationNo nativePyTorchYesMost-starred fine-tuning framework.
ms-swiftSame as TRL GKDOne of manyPyTorchYes (DeepSpeed, Megatron)Wraps TRL GKDTrainer; multimodal variants.
rllm (Berkeley)Reverse KL (advantage = log P_teacher − log P_student)Primary in math_distill examplePyTorchSingle (tinker) + Multi-GPU (verl)Self-distill subdir opsd/.
ROLLMultiple divergences (various_divergence.py)First-class DistillPipelinePyTorchYes (Megatron)VLM support.
AReaLKL-controlled (off-policy default; integrates into GRPO)One of manyPyTorchYes (async distributed)distill_loss_weight.
NeMo-RLFKL / RKL / mixed (configurable kl_type)OPD documentedPyTorchYes (Ray + Megatron + vLLM)Replaces archived NeMo-Aligner.
SkyRLReverse KL + importance samplingOPD added Nov 2025 (PR #585)PyTorchYes (Ray + vLLM/SGLang)Notion blog “On-Policy Distillation in SkyRL”.
slimeReverse KL token-levelOPD as additive penalty on any advantage estimatorPyTorch + MegatronYes (SGLang teacher mode)Behind GLM-4.5/4.6/4.7.
KDFlowFKL / RKL / JSD / AKL + Skewed-KL/RKL variantsYes — KD-firstPyTorchYes (Ray + SGLang teacher + FSDP2 student)Decoupled backends; transmits teacher hidden states (zero-copy) and recomputes logits on student to cut comm cost; 1.44–6.36× speedup over homogeneous-backend baselines. Native cross-tokenizer; VLM support (Qwen3-VL). Colocate mode shares GPUs via SGLang sleep/wakeup.

Excluded (no native OPD support, or distillation pipeline is offline / fixed-corpus rather than student-rollout): axolotl, OpenRLHF, allenai/open-instruct, prime-rl, TextBrewer (pre-LLM era), open-r1 (off-policy SFT + GRPO), Modelopt, Tunix v0.1.6, DistillKit, easydistill.

📝 Strictness notes — frameworks judged by whether they ship a recipe that satisfies C1+C2
  • LLaMA-Factory — ⚠️ OPD only available via TRL integration; no native OPD trainer. Listed for users who already use LLaMA-Factory and want to know it can host OPD.

🏭 Industrial / Production Model Reports

Flagship model technical reports that publicly describe on-policy distillation in their post-training pipeline. Reports whose tech papers don’t actually describe student-rollout distillation (Qwen2.5, Qwen2.5-Math, MiMo predecessor, DeepSeek-V3 / V3.2-Exp / R1, Phi-4, Hunyuan-Large / A13B, Kimi-K2 / K2.5, Yi-Lightning, DistilQwen) are excluded.

Resource🌟 StarsDateOrgPaperTitle / Notes
Qwen3Stars2025.05Alibaba QwenarXiv 2505.09388Qwen3 (canonical OPD recipe)
Qwen3-CoderStars2026.03Alibaba QwenTech reportQwen3-Coder
gemmaStars2024.07Google DeepMindarXiv 2408.00118Gemma 2 (explicit OPD)
GLM-4.5Stars2025.08Zhipu / Z.aiarXiv 2508.06471GLM-4.5 / 4.6
GLM-5Stars2026.02Zhipu / Z.aiarXiv 2602.15763GLM-5 (cross-stage OPD)
MiMo-V2-FlashStars2026.01XiaomiarXiv 2601.02780MiMo-V2-Flash (MOPD)
Nemotron Cascade 2Paper2026.03NVIDIAarXiv 2603.19220 · HF Collection · projectNemotron Cascade 2 (multi-domain OPD; “we sample y∼π_inf(·|x)”); HF-only release
DeepSeek-V4Paper2026.04DeepSeek-AITech Report · V4-Pro · V4-FlashDeepSeek-V4 (multi-teacher OPD replaces unified mixed-RL stage)
📋 Click to view technical details
ModelStage(s) using OPDMechanismNotes
Qwen3Strong-to-Weak DistillationTwo-phase: (1) off-policy SFT cold-start with /think and /no_think teacher samples; (2) on-policy phase — student generates, teacher provides logit-KL targetsReports ~10× cheaper than RL for equal performance. The canonical industrial OPD recipe. Inspired the Thinking Machines blog.
Qwen3-Coder-NextDistillation of multi-experts into 80A3 studentCombined SFT + on-policy logit alignmentProduction scaling of Qwen3 recipe.
Gemma 2Post-training“We also use on-policy distillation, where the student generates completions from the SFT prompts” — KL on student samplesAmong the first production models to name OPD.
GLM-5Throughout post-training“On-Policy Cross-Stage Distillation” — a final anti-forgetting refinement applied between stagesGeneralises Qwen3 recipe to “OPD as a stage glue”.
GLM-4.5 / 4.6Multi-stage post-trainingExpert iteration; SFT distillation merges experts into hybrid generalistPredecessors of GLM-5.
MiMo-V2-FlashPost-trainingMulti-Teacher On-Policy Distillation (MOPD) — “the student model samples from its own evolving distribution and receives token-level supervision from domain-specific teachers”Multi-teacher OPD; per-token MOPD advantage formula.
Nemotron Cascade 2Between Cascade RL stagesMulti-Domain On-Policy Distillation (MOPD) — “we sample y∼π_inf(·x)“; teacher provides token-level distillation advantage
DeepSeek-V4Post-training (replaces unified mixed-RL stage)Multi-teacher OPD: domain specialists trained independently (SFT + GRPO per domain — math, code, agent, IF), then a unified student optimises reverse-KL against the specialist set on its own rolloutsFull-vocabulary KL (not token-level estimate) stabilises gradients when specialists disagree; first DeepSeek release where OPD replaces the RL consolidation stage from V3 / R1. V4-Pro 1.6T MoE; V4-Flash 284B.
📝 Strictness notes
  • GLM-4.5 / 4.6 — ⚠️ Tech report describes “expert iteration + RL” without explicit OPD wording. Kept as predecessor of GLM-5 which does have explicit cross-stage OPD.

🌟 Curator’s Picks — where to start

Opinionated reading order for someone starting an OPD project today.

#Why it’s the pickResource
1Clearest one-page explanation of why OPD beats both SFT and RL on token efficiency.Thinking Machines Lab blog (Oct 2025)
2The production recipe everyone is now copying. Read §4.Qwen3 Technical Report
3Reproducible OPD in <200 lines on a real training stack.tinker-cookbook recipes/distillation
4“Theory of OPD” — when it works, when it fails.THUNLP Rethinking OPD (2604.13016)
5The paper that named OPSD and established the privileged-context pattern.Self-Distilled Reasoner (2601.18734)
6Crystallises OPD as a special case of KL-constrained RL with reward extrapolation.G-OPD (2602.12125)
7Catalogue of 50+ methods — read as an index, not a taxonomy.Tencent OPD Survey (2604.00626)
8Most diverse open-source OPD trainer collection.TRL experimental/
9Black-box OPD seed paper — adversarial discriminator as on-policy reward.GAD (2511.10643)
10Empirical failure-modes paper — saves a week of debugging.Revisiting OPD (2603.25562)

🤝 Contributing

PRs are very welcome. When adding an entry, please attempt to fill the technical-details columns (loss / divergence, data source, teacher access, granularity). If you cannot determine these by reading the paper or repo, leave a ? — that’s still useful.

相似文章

On-policy distillation: 在PapersWithCode上最热门术语之一 [R]

Reddit r/MachineLearning

Hugging Face的Niels介绍了On-policy Distillation (OPD),这是一种关键的后训练技术,用于Qwen 3.6/3.7、GLM-5.1和DeepSeek-V4等模型。该技术现已收录于PapersWithCode,并附有Sasha Rush和Dwarkesh Patel的白板讲解链接。

AsyncOPD:在策略蒸馏可以有多陈旧?

arXiv cs.LG

本文提出 AsyncOPD,一种完全异步的在策略蒸馏流程,用于大语言模型,系统研究了陈旧策略数据的影响,并提出了估计器设计,使训练吞吐量提升 1.6-3.8 倍,同时保持相当的准确率。

OmniOPD: 通过推测验证实现无Logit的同策略蒸馏

Hugging Face Daily Papers

OmniOPD 提出了一种无Logit的同策略蒸馏方法,利用块级语义相似性和推测验证,在黑盒教师指导下训练学生模型,在数学基准上相比标准OPD实现了高达+28.64%的提升。