Rubric-based On-policy Distillation
Summary
This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.
View Cached Full Text
Cached at: 05/11/26, 10:51 PM
Paper page - Rubric-based On-policy Distillation
Source: https://huggingface.co/papers/2605.07396
Abstract
Rubric-based on-policy distillation demonstrates superior sample efficiency compared to traditional logit-based methods while maintaining compatibility with black-box scenarios.
On-policy distillation(OPD) is a powerful paradigm for model alignment, yet its reliance onteacher logitsrestricts its application to white-box scenarios. We contend thatstructured semantic rubricscan serve as a scalable alternative toteacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD inducesprompt-specific rubricsfromteacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain insample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline forscalable distillationacross proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.07396
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.07396 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.07396 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07396 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
Reasoning Compression with Mixed-Policy Distillation
This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
This paper introduces D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy self-distillation during supervised fine-tuning. It allows models to learn new concepts or styles without compromising their efficient few-step inference capabilities.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.