The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 04:12 AM

Paper page - The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Source: https://huggingface.co/papers/2605.11182

Abstract

On-policy distillation and self-distillation methods for large language models exhibit varying effectiveness depending on teacher choice, loss formulation, and instance-specific privileged information availability, with identified failure mechanisms including distribution mismatch, optimization instability, and PI-free policy learning.

On-policy distillation(OPD) andon-policy self-distillation(OPSD) have emerged as promising post-training methods forlarge language models, offering densetoken-level supervisionon trajectories sampled from the model’s own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biasedTopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show thatstop-gradient TopKobjectives,RLVR-adapted teachers, andSFT-stabilized students mitigate these failures.

View arXiv page View PDF Project page GitHub1 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.11182 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.11182 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.11182 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Paper page - The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Submit Feedback

Similar Articles

SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation