on-policy-distillation

Tag

Cards List
#on-policy-distillation

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL · yesterday Cached

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.

0 favorites 0 likes
#on-policy-distillation

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers · 3d ago Cached

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

0 favorites 0 likes
#on-policy-distillation

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Hugging Face Daily Papers · 3d ago Cached

This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.

0 favorites 0 likes
#on-policy-distillation

SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)

TLDR AI · 3d ago Cached

This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.

0 favorites 0 likes
#on-policy-distillation

@QingQ77: Collecting open-source code and papers on On-Policy Distillation and Self-Distillation for training LLMs/VLMs/Agents, tagged by four dimensions: teacher source, supervision signal, rollout usage, and training stage. https://g…

X AI KOLs Timeline · 4d ago Cached

Introducing AwesomeOPD, a curated list of open-source code and papers related to On-Policy Distillation (OPD) and Self-Distillation used in the training of LLMs, VLMs, and Agents. Resources in this list are meticulously categorized and tagged based on teacher source, supervision signal, rollout usage, and training stage.

0 favorites 0 likes
#on-policy-distillation

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Hugging Face Daily Papers · 5d ago Cached

This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.

0 favorites 0 likes
#on-policy-distillation

Flow-OPD: On-Policy Distillation for Flow Matching Models

Hugging Face Daily Papers · 6d ago Cached

Flow-OPD is a research paper introducing a two-stage on-policy distillation framework for Flow Matching text-to-image models, significantly improving generation quality and alignment metrics using Stable Diffusion 3.5 Medium.

0 favorites 0 likes
#on-policy-distillation

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Papers with Code Trending · 2026-05-01 Cached

The paper introduces PRISM, a method that inserts a distribution-alignment stage between supervised fine-tuning and reinforcement learning to mitigate distributional drift in multimodal models. It uses a black-box adversarial game with an MoE discriminator to improve RLVR performance on models like Qwen3-VL.

0 favorites 0 likes
#on-policy-distillation

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Hugging Face Daily Papers · 2026-04-18 Cached

This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.

0 favorites 0 likes
← Back to home

Submit Feedback