on-policy-distillation

#on-policy-distillation

@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…

X AI KOLs Following ↗ · yesterday Cached

This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.

0 favorites 0 likes

#on-policy-distillation

Trajectory-Refined Distillation

Hugging Face Daily Papers ↗ · 3d ago Cached

Trajectory-Refined Distillation (TRD) addresses prefix failure in on-policy distillation for LLMs by correcting student rollouts at the trajectory level before distillation, consistently outperforming prior baselines across benchmarks.

0 favorites 0 likes

#on-policy-distillation

On the Geometry of On-Policy Distillation

Hugging Face Daily Papers ↗ · 5d ago Cached

This paper characterizes the unique parameter space dynamics of on-policy distillation (OPD) for large language models, showing that it exhibits relaxed off-principal updates and subspace locking, distinguishing it from supervised fine-tuning and reinforcement learning with verifiable rewards.

0 favorites 0 likes

#on-policy-distillation

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Reddit r/MachineLearning ↗ · 5d ago

Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.

0 favorites 0 likes

#on-policy-distillation

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

arXiv cs.LG ↗ · 2026-06-03 Cached

Introduces FiRe-OPD, a method for on-policy distillation in LLMs that filters low-quality trajectories and applies soft reweighting to emphasize informative tokens, achieving improved performance in strong-to-weak, single-teacher, and multi-teacher settings.

0 favorites 0 likes

#on-policy-distillation

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper identifies limitations in token-level supervision for on-policy distillation of LLMs and proposes TOPD, which uses near-future trajectory information to better identify divergent reasoning states and distribute guidance across multiple tokens, achieving gains on AIME benchmarks.

0 favorites 0 likes

#on-policy-distillation

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

arXiv cs.CL ↗ · 2026-06-01 Cached

Identifies Supervision Fidelity Decay (SFD) in on-policy distillation, where teacher supervision degrades as student sequences lengthen, and proposes Lookahead Group Reward (LGR) to mitigate SFD, improving performance on math and code benchmarks.

0 favorites 0 likes

#on-policy-distillation

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.

0 favorites 0 likes

#on-policy-distillation

@neural_avb: If yall are interested in On Policy Distillation, check this specific repo. Somebody put together a curated collection …

X AI KOLs Timeline ↗ · 2026-05-29 Cached

A curated collection of papers and tools for On Policy Distillation, organized and annotated with a getting-started section, shared via a GitHub repo.

0 favorites 0 likes

#on-policy-distillation

Trust-Region Behavior Blending for On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.

0 favorites 0 likes

#on-policy-distillation

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.

0 favorites 0 likes

#on-policy-distillation

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

This paper introduces 'token teachability' to distinguish learnable from incompatible teacher signals in on-policy distillation, and proposes TA-OPD which selects high-teachability tokens, achieving strong performance with only 5% of tokens.

0 favorites 0 likes

#on-policy-distillation

On-Policy Distillation (5 minute read)

TLDR AI ↗ · 2026-05-26

This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.

0 favorites 0 likes

#on-policy-distillation

@NielsRogge: One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a studen…

X AI KOLs Timeline ↗ · 2026-05-25 Cached

On-policy distillation is highlighted as a hot post-training technique combining distillation with online RL, now listed on PapersWithCode with 183 citing papers.

0 favorites 0 likes

#on-policy-distillation

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

arXiv cs.AI ↗ · 2026-05-25 Cached

This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.

0 favorites 0 likes

#on-policy-distillation

@jiqizhixin: What if the best teacher for a language model isn’t the strongest one? Researchers from Tsinghua University and collabo…

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Researchers from Tsinghua University and collaborators systematically investigate on-policy distillation (OPD) for LLMs, revealing that success requires shared thinking patterns and genuinely new capabilities from the teacher, and propose practical strategies like off-policy cold start and teacher-aligned prompt selection.

0 favorites 0 likes

#on-policy-distillation

@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…

X AI KOLs Following ↗ · 2026-05-15 Cached

Introduces Pedagogical RL, a method that leverages privileged information to guide the sampling of successful trajectories for LLM reasoning, achieving up to 40% relative gains over GRPO and on-policy distillation.

0 favorites 0 likes

#on-policy-distillation

@dbreunig: Great teachers craft demonstrations their students could have built themselves.

X AI KOLs Following ↗ · 2026-05-14 Cached

A tweet from Souradip Chakraborty proposes using privileged information to actively sample rollouts in reinforcement learning, contrasting with traditional blind sampling methods. The tweet is prefaced by a quote about great teachers crafting demonstrations that students could build themselves.

0 favorites 0 likes

#on-policy-distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.

0 favorites 0 likes

#on-policy-distillation

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

0 favorites 0 likes

on-policy-distillation

Submit Feedback