Rubric-based On-policy Distillation

Hugging Face Daily Papers Papers

Summary

This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 10:51 PM

Paper page - Rubric-based On-policy Distillation

Source: https://huggingface.co/papers/2605.07396

Abstract

Rubric-based on-policy distillation demonstrates superior sample efficiency compared to traditional logit-based methods while maintaining compatibility with black-box scenarios.

On-policy distillation(OPD) is a powerful paradigm for model alignment, yet its reliance onteacher logitsrestricts its application to white-box scenarios. We contend thatstructured semantic rubricscan serve as a scalable alternative toteacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD inducesprompt-specific rubricsfromteacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain insample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline forscalable distillationacross proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.07396

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07396 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07396 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07396 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

Reasoning Compression with Mixed-Policy Distillation

arXiv cs.AI

This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.