Tag
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.
Trajectory-Refined Distillation (TRD) addresses prefix failure in on-policy distillation for LLMs by correcting student rollouts at the trajectory level before distillation, consistently outperforming prior baselines across benchmarks.
This paper characterizes the unique parameter space dynamics of on-policy distillation (OPD) for large language models, showing that it exhibits relaxed off-principal updates and subspace locking, distinguishing it from supervised fine-tuning and reinforcement learning with verifiable rewards.
Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.
Introduces FiRe-OPD, a method for on-policy distillation in LLMs that filters low-quality trajectories and applies soft reweighting to emphasize informative tokens, achieving improved performance in strong-to-weak, single-teacher, and multi-teacher settings.
This paper identifies limitations in token-level supervision for on-policy distillation of LLMs and proposes TOPD, which uses near-future trajectory information to better identify divergent reasoning states and distribute guidance across multiple tokens, achieving gains on AIME benchmarks.
Identifies Supervision Fidelity Decay (SFD) in on-policy distillation, where teacher supervision degrades as student sequences lengthen, and proposes Lookahead Group Reward (LGR) to mitigate SFD, improving performance on math and code benchmarks.
OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.
A curated collection of papers and tools for On Policy Distillation, organized and annotated with a getting-started section, shared via a GitHub repo.
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.
This paper introduces 'token teachability' to distinguish learnable from incompatible teacher signals in on-policy distillation, and proposes TA-OPD which selects high-teachability tokens, achieving strong performance with only 5% of tokens.
This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.
On-policy distillation is highlighted as a hot post-training technique combining distillation with online RL, now listed on PapersWithCode with 183 citing papers.
This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.
Researchers from Tsinghua University and collaborators systematically investigate on-policy distillation (OPD) for LLMs, revealing that success requires shared thinking patterns and genuinely new capabilities from the teacher, and propose practical strategies like off-policy cold start and teacher-aligned prompt selection.
Introduces Pedagogical RL, a method that leverages privileged information to guide the sampling of successful trajectories for LLM reasoning, achieving up to 40% relative gains over GRPO and on-policy distillation.
A tweet from Souradip Chakraborty proposes using privileged information to actively sample rollouts in reinforcement learning, contrasting with traditional blind sampling methods. The tweet is prefaced by a quote about great teachers crafting demonstrations that students could build themselves.
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.