Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

arXiv cs.AI Papers

Summary

This paper proposes Rubric-Conditioned Self-Distillation (RCSD), a framework that uses fine-grained rubric criteria to provide token-level guidance during self-distillation, improving reasoning performance over scalar-reward methods like GRPO and OPSD.

arXiv:2606.19327v1 Announce Type: new Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
Source: [https://arxiv.org/html/2606.19327](https://arxiv.org/html/2606.19327)
Siyi Gu Yale University siyi\.gu@yale\.edu &Jialin Chen Yale University jialin\.chen@yale\.edu &Sophia Zhou Yale University sophia\.zhou@yale\.edu &Arman Cohan† Yale University arman\.cohan@yale\.edu &Rex Ying Yale University rex\.ying@yale\.edu

###### Abstract

Post\-training of reasoning language models is commonly driven by on\-policy distillation and reinforcement learning with verifiable rewards\. Distillation often relies on chain\-of\-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning\. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved\. We proposeRubric\-Conditioned Self\-Distillation, a framework that incorporates rubrics as structured, fine\-grained feedback for on\-policy self\-distillation\. Our method conditions the teacher model on criterion\-level rubrics and uses it to provide token\-level guidance on the student’s own sampled trajectories\. This design avoids treating a single reference rationale as the sole supervision target\. Instead, rubrics specify what a strong response should satisfy, enabling more fine\-grained credit assignment over the reasoning process than scalar reward optimization\. We instantiate this framework with a two\-stage pipeline that first learns to generate task\-specific rubrics and then trains a rubric\-guided reasoner\. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric\-conditioned self\-distillation effectively converts rubric\-level criteria into token\-level guidance over the reasoning process, surpassing GRPO by 1\.0 points and OPSD by 0\.9 points on average\. Code available:[https://github\.com/carriegu0818/RCSD](https://github.com/carriegu0818/RCSD)\.

## 1Introduction

Recent advances in large language models \(LLMs\) have led to substantial progress in reasoning, problem\-solving, and instruction following\. Reinforcement learning has been particularly effective in domains such as mathematics and code generation, where final outcomes can be automatically verified\. However, Group Relative Policy Optimization \(GRPO\) objective typically optimizes sparse outcome\-level rewards: the model is rewarded only after generating a complete response, for example, based on whether the final answer is correct or whether execution succeeds\(Shaoet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib13)\)\. While effective in verifiable settings, such supervision provides little information about*why*a trajectory succeeds or fails, creating a persistent credit\-assignment bottleneck in online learning\(Hübotteret al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib48); Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47)\)\.

One natural way to enrich this supervision is through*rubrics*\. Rather than scoring a response holistically, rubrics decompose quality into explicit criteria, yielding a more structured and interpretable representation of what makes an answer strong\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4); Zhanget al\.,[2025a](https://arxiv.org/html/2606.19327#bib.bib8)\)\. Recent work has shown that rubric\-based evaluation can extend post\-training beyond strictly verifiable tasks by supplying richer judgments than binary correctness alone\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4)\)\. Yet in most existing works, rubric information enters training only through the*reward*: criterion\-level judgments are aggregated into a single scalar score and then optimized with RL\-style updates on the entire trajectory\. As a result, the rich information inside such textual rubric feedback is largely discarded during optimization\.

A recent line of work addresses sparse outcome rewards by replacing them with dense teacher supervision\. In on\-policy distillation \(OPD\) and on\-policy self\-distillation \(OPSD\), the student learns from its own sampled trajectories while a teacher provides token\-level guidance along those rollouts\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib35); Xuet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib36); Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47); Hübotteret al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib48); Yeet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib53)\)\. These methods alleviate the mismatch of off\-policy imitation and provide a denser learning signal than final\-answer rewards\. However, existing approaches typically construct the teacher from stronger\-model outputs\. Such supervision is therefore tied to particular privileged trajectories, which may not cleanly expose the underlying dimensions along which a response should be evaluated\. These trajectories represent only one valid reasoning trace rather than the underlying dimensions that define a strong response\. In this sense, rationale\-based supervision can over\-specify*how*an answer should be produced, without cleanly identifying*what properties*the answer should satisfy\.

![Refer to caption](https://arxiv.org/html/2606.19327v1/figures/illu.png)Figure 1:Illustration of how optimization signals differ between RL/OPSD/RCSDwith incorrect student trajectory\.In this work, we introduce*Rubric\-ConditionedSelf\-Distillation*\(RCSD\), a post\-training framework that uses rubrics as privileged teacher\-side supervision for on\-policy self\-distillation\. Our key idea is that rubrics should not only score responses after generation; they should shape token\-level learning during optimization\. Instead of collapsing rubric feedback into a scalar reward, we condition the teacher on criterion\-level rubric information and distill its token\-level guidance on the student’s own sampled trajectories\. The resulting training signal is simultaneously*criterion\-aware*,*on\-policy*, and*token\-level*: it preserves distinctions across evaluation dimensions, operates on student\-generated rollouts rather than fixed off\-policy traces, and provides dense guidance without reducing feedback to a single number\. Figure[1](https://arxiv.org/html/2606.19327#S1.F1)illustrates the optimization signal difference on the same incorrect trajectory: RL assigns one reward to the full sequence, OPSD provides dense supervision toward a reference trajectory, andRCSDprovides dense rubric\-conditioned feedback that preserves correct steps while penalizing the specific local error\.

We operationalize our idea with a two\-stage pipeline\. We first train a rubric generator to amortize instance\-specific evaluation criteria from privileged supervision, and then train a reasoner with rubric\-conditioned teacher guidance\. More broadly, we reframe rubrics as a structured supervision interface for model self\-improvement, especially in hard\-to\-verify and open\-ended tasks where high\-quality responses are not fully captured by automatic verification or scalar outcome rewards\. Across diverse reasoning benchmarks,RCSDachieves the best overall average \(70\.6\), surpassing GRPO by 1\.4 points and OPSD by 0\.9 points\. Notably, the gains are pronounced on scientific and rubric\-based reasoning tasks, where response quality is poorly captured by scalar outcome\-level rewards alone\.

## 2Method

We propose to preserve the fine\-grained, structured feedback during optimization by using learned rubrics as privileged teacher\-side supervision for on\-policy self\-distillation\. Rather than compressing rubric feedback into a single number, we expose it to a privileged teacher, which then provides dense token\-level guidance on the student’s own sampled trajectories\. Figure[2](https://arxiv.org/html/2606.19327#S2.F2)situates our method relative to two standard alternatives\. Reinforcement learning applies outcome\-level supervision through a sparse scalar reward\. On\-policy self\-distillation replaces this with token\-level teacher guidance, but typically conditions the teacher on a privileged reference answer\. In contrast, our method redefines this supervision interface: instead of conditioning the privileged teacher on a single reference trajectory, we condition it on a rubric that specifies criterion\-level properties of a strong response\.

![Refer to caption](https://arxiv.org/html/2606.19327v1/figures/RSD_final.png)Figure 2:RCSDuses rubrics as privileged teacher\-side supervision for on\-policy self\-distillation\. In contrast to RL which compresses feedback into a scalar reward, and OPSD which conditions the teacher on a reference answer,RCSDlearns question\-specific rubrics in Stage I and reuses them in Stage II to induce structured token\-level guidance on the student’s own reasoning trajectory\.### 2\.1Preliminaries

We usepTp\_\{T\}andpSp\_\{S\}to denote the teacher and student distributions, respectively\.

Off\-Policy Distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.19327#bib.bib33)\)trains a student to imitate trajectories generated by a teacher\. In its most general form, the objective can be written as

ℒoff=𝔼x∼𝒟,y∼pT\(⋅∣x\)\[∑t=1\|y\|D\(pT\(⋅∣x,y<t\)∥pS\(⋅∣x,y<t\)\)\],\\mathcal\{L\}\_\{\\mathrm\{off\}\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,y\\sim p\_\{T\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{\|y\|\}D\\\!\\left\(p\_\{T\}\(\\cdot\\mid x,y\_\{<t\}\)\\,\\\|\\,p\_\{S\}\(\\cdot\\mid x,y\_\{<t\}\)\\right\)\\right\],\(1\)
whereD\(⋅∥⋅\)D\(\\cdot\\\|\\cdot\)denotes a divergence between teacher and student sequence distributions\. Off\-policy distillation provides dense token\-level supervision, but it suffers from a distribution mismatch: the student is trained on teacher\-generated prefixes, whereas at inference time it must condition on its own generated prefixes, leading to compounding errors and degraded performance\.

On\-Policy Distillation \(OPD\)\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib35); Guet al\.,[2023](https://arxiv.org/html/2606.19327#bib.bib51)\)addresses this mismatch by sampling trajectories from the student rather than the teacher\. Given an inputxx, the student first generates an on\-policy rollouty^∼pS\(⋅∣x\)\.\\hat\{y\}\\sim p\_\{S\}\(\\cdot\\mid x\)\.The teacher and student are then compared along the student’s own trajectory, yielding the objective

ℒOPD=𝔼x∼𝒟,y^∼pS\(⋅∣x\)\[1\|y^\|∑t=1\|y^\|D\(pT\(⋅∣x,y^<t\)∥pS\(⋅∣x,y^<t\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\;\\hat\{y\}\\sim p\_\{S\}\(\\cdot\\mid x\)\}\\left\[\\frac\{1\}\{\|\\hat\{y\}\|\}\\sum\_\{t=1\}^\{\|\\hat\{y\}\|\}D\\\!\\left\(p\_\{T\}\(\\cdot\\mid x,\\hat\{y\}\_\{<t\}\)\\,\\\|\\,p\_\{S\}\(\\cdot\\mid x,\\hat\{y\}\_\{<t\}\)\\right\)\\right\]\.\(2\)However, the on\-policy distillation method still heavily relies on token\-level imitation of a teacher distribution, which often encourages the student to follow a single preferred response, ignoring the space of valid reasoning paths\.

On\-Policy Self\-Distillation \(OPSD\)\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47); Hübotteret al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib48)\)refers to the setting in which the teacher and student are derived from the same underlying model rather than from two separately trained models\. In the on\-policy self\-distillation setting, a single model instantiates both a student policy and a privileged teacher policy\. Given a reasoning datasetS=\{\(x,z\)\}S=\\\{\(x,z\)\\\}, wherezzdenotes privileged information such as a gold solution, a reference answer, or other side information, the student observes only the base inputxxand generates an on\-policy responsey^∼pS\(⋅∣x\)\\hat\{y\}\\sim p\_\{S\}\(\\cdot\\mid x\), while the teacher is conditioned on privileged informationzzunavailable to the student at inference time\. The OPSD objective is

ℒOPSD=𝔼\(x,z\)∼𝒮,y^∼pS\(⋅∣x\)\[1\|y^\|∑t=1\|y^\|D\(pT\(⋅∣x,z,y^<t\)∥pS\(⋅∣x,y^<t\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{OPSD\}\}=\\mathbb\{E\}\_\{\(\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}x,z\}\)\\sim\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\mathcal\{S\}\},\\;\\hat\{y\}\\sim p\_\{S\}\(\\cdot\\mid x\)\}\\left\[\\frac\{1\}\{\|\\hat\{y\}\|\}\\sum\_\{t=1\}^\{\|\\hat\{y\}\|\}D\\\!\\left\(p\_\{T\}\(\\cdot\\mid x,\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}z\},\\hat\{y\}\_\{<t\}\)\\,\\\|\\,p\_\{S\}\(\\cdot\\mid x,\\hat\{y\}\_\{<t\}\)\\right\)\\right\]\.\(3\)While OPSD further introduces privileged information to guide learning, it suffers from conditioning the teacher on a specific reference solution contained inzz\. This can be restrictive for reasoning tasks, where solution quality is better characterized by satisfying a set of criteria rather than a sole target\. These limitations motivate a more flexible supervision interface that provides structured, multi\-dimensional criterion\-level guidance\.

### 2\.2Motivation: Beyond Reward Optimization and Reference\-Conditioned Distillation

We positionRCSDagainst two common paradigms for improving reasoning models: reward\-based optimization and reference\-conditioned distillation\.

#### \(1\) Reward optimization requires sparse external judgment\.

GRPO has been highly effective in verifiable domains, where correctness can be directly checked by exact\-match answers or unit tests\(Shaoet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib13); Guoet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib14); Cholletet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib17); Jainet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib18)\)\. Recent work extends this paradigm to hard\-to\-verify or non\-verifiable domains by using LLM\-as\-a\-Judge rewards\(Liet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib58); Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4)\)\. However, this extension still reduces supervision to sparse scalar reward signals, which provide limited information about which intermediate reasoning steps should be improved\. It also introduces an additional external evaluator, increasing inference and training cost, and may further amplify reward bias from the judge model through its preferences, calibration errors, or inconsistent interpretation of rubrics\. In contrast,RCSDdoes not require a separate reward model or judge during distillation\. Instead, we provide rubrics directly to the teacher model and let the teacher generate rubric\-conditioned reasoning on its own outputs\. This turns rubrics into a structured supervision interface, allowing the student to learn from dense token\-level teacher guidance rather than optimizing against sparse scalar reward signals\.

#### \(2\) Reference\-conditioned distillation is path\-specific\.

OPSD conditions the teacher on a single reference trajectory, which can induce path\-specific supervision\. When the student deviates from this trajectory, even slightly, the teacher signal may encourage global revision rather than localized correction\. Empirically, we observe that OPSD trajectories often recompute the same intermediate quantities or revise earlier steps without new information, leading to long and redundant reasoning chains\. This suggests that OPSD provides token\-level supervision, but lacks explicit criterion\-level credit assignment\. Our proposed method,RCSD, addresses this limitation by conditioning the teacher on rubric criteria instead of a single reference path, yielding supervision that is both on\-policy and criterion\-aware\.

### 2\.3Problem Setup

Letxxdenote an input question\. We consider two structured outputs associated withxx: a rubricrrand a answeryy\. A rubric is a structured set of question\-specific evaluation criteria,r=\{c1,…,cK\}r=\\\{c\_\{1\},\\dots,c\_\{K\}\\\}, where each criterionckc\_\{k\}contains a title, a natural language description, and an importance weight∈Essential,Important,Optional,o​r​Pitfall\\in\\textit\{Essential\},\\textit\{Important\},\\textit\{Optional\},or\\textit\{Pitfall\}\. Conceptually, a rubric serves as an intermediate supervision interface that specifies what constitutes a good solution and provide richer and more interpretable information on how to guide the model optimization\. A high\-quality rubric encourages the model to focus on satisfying high\-level criteria rather than imitating a specific reasoning path\.

Our goal is to learn a model that produces high\-quality reasoning trajectories under instance\-specific evaluation criteria\. Such criteria are useful during training, but are unavailable to the student at inference time\. We therefore treat rubrics as*privileged supervision*: they are provided to the teacher during training and distilled into the student through on\-policy token\-level guidance\. Since distilling high\-quality rubrics could be costly, we automate the rubric generation process and further factorize learning into two sequential stages:

1. 1\.Stage I: Learning a rubric generator\.We train a model to predict a rubricrrconditioned on the questionxx, amortizing instance\-specific evaluation criteria into a reusable form\.
2. 2\.Stage II: Rubric\-conditioned reasoning\.We train a reasoner to generate trajectories that satisfy rubric criteria, using the rubric as structured guidance during optimization\.

### 2\.4Stage I: Learning a Rubric Generator

A practical challenge in rubric\-based training is that high\-quality instance\-specific rubrics are expensive to obtain\. Our first stage therefore amortizes privileged supervision into a standalone rubric generator\. During training, the teacher is allowed to view both the question and a reference answer, while the student must learn to infer an appropriate rubric from the question alone\.

#### Student policy\.

The student rubric generator observes only the question:pSR​\(r∣x\)\.p\_\{S\}^\{R\}\(r\\mid x\)\.

#### Teacher policy\.

The teacher rubric generator receives privileged access to the question and reference answery⋆y^\{\\star\}:pTR​\(r∣x,y⋆\)\.p\_\{T\}^\{R\}\(r\\mid x,y^\{\\star\}\)\.The reference answer is not used as a supervision target to be copied directly; rather, it serves as a privileged context that helps the teacher infer how to generate a correct response trajectory within its own distribution\.

#### On\-policy rubric distillation\.

Given a sampled rubric rolloutr^∼pSR\(⋅∣x\)\\hat\{r\}\\sim p\_\{S\}^\{R\}\(\\cdot\\mid x\), we train the student to match the teacher’s next\-token distribution along the student’s own rubric trajectory:

ℒrubric=𝔼r^∼pSR\(⋅∣x\)\[∑t=1\|r^\|DKL\(pTR\(⋅∣x,y⋆,r^<t\)∥pSR\(⋅∣x,r^<t\)\)\]\.\\mathcal\{L\}\_\{\\text\{rubric\}\}=\\mathbb\{E\}\_\{\\hat\{r\}\\sim p\_\{S\}^\{R\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{\|\\hat\{r\}\|\}D\_\{\\mathrm\{KL\}\}\\Big\(p\_\{T\}^\{R\}\(\\cdot\\mid x,y^\{\\star\},\\hat\{r\}\_\{<t\}\)\\;\\\|\\;p\_\{S\}^\{R\}\(\\cdot\\mid x,\\hat\{r\}\_\{<t\}\)\\Big\)\\right\]\.\(4\)
This objective preserves the key advantage of on\-policy self\-distillation: the student is trained on its own sampled prefixes rather than teacher\-generated ones\. At the same time, the teacher can inject privileged information from the reference answer to shape the rubric\-generation process\. As a result, Stage I distills evaluation structure into a rubric generator that can produce question\-specific criteria without requiring privileged inputs at test time\.

### 2\.5Stage II: Rubric\-Conditioned Reasoning

Given a rubric, the second stage trains a reasoner to generate answers that better satisfy instance\-specific criteria\. The key design choice is that the rubric is not merely appended as auxiliary prompt text for the student\. Instead, it is provided as a privileged context to the teacher, which uses it to deliver criterion\-aware token\-level guidance on the student’s own rollout\.

#### Student policy\.

The student reasoner observes only the question:pSY​\(y∣x\)\.p\_\{S\}^\{Y\}\(y\\mid x\)\.

#### Teacher policy\.

The teacher reasoner receives the question together with rubric feedback:pTY​\(y∣x,r\)\.p\_\{T\}^\{Y\}\(y\\mid x,r\)\.Hererrdenotes the learned rubric from the Stage I rubric generator\. Conditioning the teacher onrrallows the training signal to reflect multiple dimensions of response quality, rather than a single scalar reward or a single reference trajectory\.

#### On\-policy rubric\-conditioned distillation\.

Given an on\-policy student rollouty^∼pSY\(⋅∣x\),\\hat\{y\}\\sim p\_\{S\}^\{Y\}\(\\cdot\\mid x\),we minimize

ℒreason=𝔼y^∼pSY\(⋅∣x\)\[∑t=1\|y^\|DKL\(pTY\(⋅∣x,r,y^<t\)∥pSY\(⋅∣x,y^<t\)\)\]\.\\mathcal\{L\}\_\{\\text\{reason\}\}=\\mathbb\{E\}\_\{\\hat\{y\}\\sim p\_\{S\}^\{Y\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{\|\\hat\{y\}\|\}D\_\{\\mathrm\{KL\}\}\\Big\(p\_\{T\}^\{Y\}\(\\cdot\\mid x,r,\\hat\{y\}\_\{<t\}\)\\;\\\|\\;p\_\{S\}^\{Y\}\(\\cdot\\mid x,\\hat\{y\}\_\{<t\}\)\\Big\)\\right\]\.\(5\)
This objective highlights the core advantage of on\-policy distillation—dense supervision on student\-generated trajectories\. Rather than learning from a reference rationale or a scalarized reward, the student is guided by a teacher conditioned on criterion\-level rubric information\. Its key importance is that criterion\-level structure is retained in the teacher signal itself, so optimization can distinguish different dimensions of partial correctness rather than compressing them into a single undifferentiated score\.

Reference\-conditioned distillation supervises the student with one specific solution trajectory, which can introduce path\-specific bias when multiple derivations are valid\. Rubrics instead specify the criteria that a correct solution must satisfy, inducing a criterion\-aware teacher distribution over many valid reasoning paths rather than a single reference path\. This distinction naturally aligns with forward KL distillation\. MinimizingDKL\(pT\(⋅∣x,r\)∥pS\(⋅∣x\)\)D\_\{\\mathrm\{KL\}\}\(p\_\{T\}\(\\cdot\\mid x,r\)\\,\\\|\\,p\_\{S\}\(\\cdot\\mid x\)\)encourages the student to cover the support of the rubric\-conditioned teacher distribution, preserving probability mass on alternative solutions that satisfy the same criteria\. In contrast, more mode\-seeking objectives may concentrate on dominant teacher modes and underrepresent valid but less likely derivations\. Thus, forward KL provides a principled objective for transferring dense, criterion\-level guidance while maintaining diversity across valid reasoning paths\.

### 2\.6Training Procedure

Algorithm[1](https://arxiv.org/html/2606.19327#alg1)summarizes the full training procedure\. Stage I amortizes expensive instance\-specific criteria into a rubric generator, and Stage II uses those criteria to shape teacher\-side token\-level correction on student rollouts\. The resulting framework preserves criterion\-level structure, remains on\-policy, and better utilize this rich structured information in textual rubrics rather than scalar rewards\.

Algorithm 1RCSD: Rubric\-Conditioned On\-Policy Self\-Distillation0:Training set

𝒟=\{\(x,r⋆,y⋆\)\}\\mathcal\{D\}=\\\{\(x,r^\{\\star\},y^\{\\star\}\)\\\}; rubric generator

pSRp\_\{S\}^\{R\}; reasoner

pSYp\_\{S\}^\{Y\}
1:Stage I: Learn rubric generator

2:foreach training example

xxwith reference answer

y⋆y^\{\\star\}do

3:Sample rubric trajectory

r^∼pSR\(⋅∣x\)\\hat\{r\}\\sim p\_\{S\}^\{R\}\(\\cdot\\mid x\)
4:Compute teacher and student token distributions along

r^\\hat\{r\}
5:Update rubric generator by minimizing

ℒrubric\\mathcal\{L\}\_\{\\text\{rubric\}\}
6:endfor

7:Stage II: Train rubric\-conditioned reasoner

8:foreach training example

xxdo

9:Obtain rubric

r^\\hat\{r\}from

r^∼pSR\(⋅∣x\)\\hat\{r\}\\sim p\_\{S\}^\{R\}\(\\cdot\\mid x\)
10:Sample answer rollout

y^∼pSY\(⋅∣x\)\\hat\{y\}\\sim p\_\{S\}^\{Y\}\(\\cdot\\mid x\)
11:Compute teacher and student token distributions along

y^\\hat\{y\}
12:Update reasoner by minimizing

ℒreason\\mathcal\{L\}\_\{\\text\{reason\}\}
13:endfor

RCSDintroduces a new supervision interface for post\-training, by using rubrics as privileged structured guidance for on\-policy self\-distillation\. Overall, Stage I learns*what*a strong response should satisfy, while Stage II learns*how*to realize those criteria along the student’s own trajectory\. Prior rubric\-based RL typically uses rubrics only for outcome\-level scoring, while standard OPSD preserves token\-level optimization but ties supervision to a single privileged answer\.RCSDdiffers from both by letting fine\-grained rubric feedback directly guide on\-policy learning\.

## 3Experiments

Our experiments are designed to answer the following questions: \(1\) whether rubric\-guided token\-level feedback improves both verifiable and open\-ended reasoning performance compared with scalar\-reward and reference\-conditioned distillation baselines; \(2\) whether the resulting gains persist across model scales and remain robust on out\-of\-domain benchmarks; and \(3\) whether learned rubrics can approach the effectiveness of reference rubrics, and how sensitiveRCSDis to rubric quality\.

Data Construction\.We follow the two\-stage design described in Section[2\.6](https://arxiv.org/html/2606.19327#S2.SS6)to construct rubric generation data and reasoning data\. For rubric learning, we construct our training set based onRaR\-Science\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4)\)andRubricHub\(Liet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib58)\)for reference rubric supervision\. For reasoner training, we additionally takenatural\_reasoning\(Yuanet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib57)\)and filter out entries with empty reference answers\. The final dataset contains approximately 10k samples for rubric generation and 30k samples for reasoning generation\.

Evaluation\.We evaluate on a diverse set of science reasoning benchmarks: 1\) Verifiabl:GPQA\-Diamon\(Reinet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib40)\),SciBench\(Wanget al\.,[2023a](https://arxiv.org/html/2606.19327#bib.bib41)\),PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.19327#bib.bib43)\), ResearchQA\(Yifeiet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib59)\), 2\) non\-verifiable:RaR\-Science\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4)\)andRubricHub\(Liet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib58)\)\. For open\-ended scientific tasks, we use gpt\-4\.1\-mini as LLM\-as\-a\-Judge to evaluate following their paper setup and take a 500 subset from the test set\. To assess out\-of\-domain generalization, we further report results on medical question answering benchmarks, includingMedMCQA\(Palet al\.,[2022](https://arxiv.org/html/2606.19327#bib.bib44)\)andPubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2606.19327#bib.bib45)\)\. We follow theLanguage Model Open Science Evaluationframework for science reasoning\.111[https://github\.com/GAIR\-NLP/lm\-open\-science\-evaluation](https://github.com/GAIR-NLP/lm-open-science-evaluation)\.

Baselines\.We compare the following methods:*\(1\) supervised fine\-tuning*, which trains the student on the distilled CoT trajectories,*\(2\) GRPO*\(Shaoet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib13)\), where the reward is a scalar reward produced by an LLM judge, \(3\)*Rubric\-GRPO*, where the reward is aggregated by prompting LLM\-as\-a\-Judge to assign specific rubric rewards, \(4\)*On\-Policy Self\-Distillation\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47)\)*, where teacher conditions on reference answers and provides dense token\-level supervision on student trajectories\.

Implementation details\.We train with LoRA \(r=64r=64,α=128\\alpha=128\) and AdamW with learning rate5×10−65\\times 10^\{\-6\}, batch size 32, and FlashAttention 3\. Following previous implementations on self\-distillation\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47)\), we trainRCSD, OPSD for 100 steps and GRPO for 500 steps\. More specifically, The rubric generator is trained for 1000 steps with a maximum sequence length of 2,048\. The reasoner is trained for 100 steps with a maximum completion length of 4096\. The teacher is fixed during training, and supervision is applied through token\-level distillation on student\-generated trajectories\. At evaluation time, we use temperature1\.01\.0, top\-p=0\.95p=0\.95, top\-k=−1k=\-1, min\-p=0\.0p=0\.0, presence penalty0\.00\.0, and results are averaged over 4 independent generations\. For LLM\-as\-a\-Judge evaluation, we report results from a single generation\. Details are in Appendix[A](https://arxiv.org/html/2606.19327#A1)\.

### 3\.1Main Results

Table 1:Main results on diverse reasoning benchmarks\. Qwen3\-8B is used as the backbone model\. GRPO is trained for 500 steps while self\-distillation based methods are trained for 100 steps, following prior work’s implementation\. Higher is better for all metrics\.Main results are reported in Table[1](https://arxiv.org/html/2606.19327#S3.T1)\.RCSDachieves the best overall average, improving over the base Qwen3\-8B model by 4\.7 points and outperforming the strongest baseline, OPSD, by 0\.9 point\. The gains are especially pronounced on rubric\-based reasoning benchmarks, whereRCSDimproves over the base model by 8\.2 points onResearchQAand 4\.9 points onRubricHub\.RCSDalso obtains the best performance onSciBench, reaching 70\.8, suggesting that rubric\-guided self\-distillation remains effective for scientific reasoning tasks that require preserving multiple criterion\-level reasoning signals\. Compared with GRPO and GRPO\-Rubrics,RCSDbenefits from dense token\-level supervision rather than relying on scalar rewards\. Compared with OPSD,RCSDavoids distilling from a single reference\-style answer and instead uses rubric\-conditioned guidance, which appears to provide more flexible and task\-relevant supervision\. We also observe improvements onGPQA\-D,PIQA,ResearchQA, andRubricHub, showing that the method improves not only scientific reasoning but also broader rubric\-guided reasoning performance\.

Generalization to other Domains\.We also evaluate generalization ability on medicine benchmarks in Table[2](https://arxiv.org/html/2606.19327#S3.T2)\. Although our method is trained primarily on scientific reasoning tasks rather than medical\-domain data, it maintains competitive performance on bothMedMCQAandPubMedQA\. In particular,RCSDimproves over the base Qwen3\-8B model on both benchmarks, from 64\.5 to 65\.8 onMedMCQAand from 74\.2 to 75\.1 onPubMedQA\. These results suggest that rubric\-guided teacher supervision does not lead to catastrophic forgetting on adjacent knowledge\-intensive domains, while preserving strong general reasoning ability beyond the training distribution\.

Table 2:Generalization to medicine benchmarks\. Best performance is bolded\.Table 3:Ablation on the loss type used forRCSD\. All variants use the same Qwen3\-8B backbone\. Higher is better for all metrics\.#### Ablation on Loss Type\.

Table[3](https://arxiv.org/html/2606.19327#S3.T3)studies the effect of different distillation losses\. Forward KL performs best overall, achieving the highest average of 70\.6\. Reverse KL is the second\-best variant, with an average of 69\.9, and performs best onRaR\. JSD achieves the best result onRubricHuband matches Forward KL onResearchQA, but shows a noticeable drop onGPQA\-D, leading to a lower overall average of 69\.6\. These results suggest that Forward KL is the most effective objective forRCSDin this setting\. One possible explanation is that Forward KL encourages the student to cover the teacher’s rubric\-conditioned distribution more faithfully, preserving diverse criterion\-aware reasoning signals\. In contrast, Reverse KL may be more mode\-seeking, while JSD provides a more conservative update that is stable but less effective in transferring the full teacher signal\.

![Refer to caption](https://arxiv.org/html/2606.19327v1/figures/training_curve.png)Figure 3:Training dynamics under different distillation losses: reverse KL \(yellow\), JSD \(blue\), and forward KL \(green\)\. We show student entropy, mean token length, and on\-policy loss over training\.Figure[3](https://arxiv.org/html/2606.19327#S3.F3)further explains the loss\-type ablation in Table[3](https://arxiv.org/html/2606.19327#S3.T3)\. Forward KL achieves the best benchmark performance and shows the most favorable training behavior: student entropy increases steadily, suggesting that the model preserves a broader rubric\-conditioned output distribution rather than collapsing to narrow modes\. In contrast, reverse KL consistently reduces entropy, reflecting its more mode\-seeking tendency, while JSD produces a milder entropy increase\. Mean token length is noisy across all objectives, with no clear systematic advantage, although JSD tends to generate slightly shorter responses later in training\. The on\-policy loss curves are also stable overall: forward KL decreases smoothly, reverse KL becomes increasingly negative, and JSD remains positive with a gradual decline\. Together, these trends suggest that forward KL most effectively transfers the teacher’s criterion\-aware signal, whereas reverse KL is more restrictive and JSD is stable but less effective\.

Scaling Model Size\.To study whetherRCSDremains effective across scales, we evaluate Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B onRaR,ResearchQA, andRubricHub\. Figure[4](https://arxiv.org/html/2606.19327#S3.F4)shows thatRCSDconsistently improves over the corresponding base model across all three benchmarks and model sizes\. At 1\.7B,RCSDimproves performance by 2\.7 points onRaR, 4\.4 points onResearchQA, and 6\.6 points onRubricHub\. At 4B, the gains are 4\.9, 5\.6, and 3\.2 points, respectively, and at 8B, the gains are 8\.9, 8\.2, and 4\.8 points\. These results suggest thatRCSDprovides robust gains across model scales, while larger models achieve stronger absolute performance and continue to benefit from criterion\-aware teacher signals on rubric\-guided reasoning tasks\.

Ablation on Rubric Source\.Another central question is whether learned rubrics can approach the effectiveness of reference rubrics\. Table[4](https://arxiv.org/html/2606.19327#S3.T4)comparesGT Rubrics, where the teacher conditions on reference rubrics, withGenerated Rubrics, where the teacher conditions on rubrics produced by the learned rubric generator\. Overall, generated rubrics remain competitive across the benchmark suite, with only small gaps to GT rubrics onGPQA\-D\(64\.5 vs\. 65\.2\),SciBench\(70\.6 vs\. 71\.0\), andPIQA\(90\.8 vs\. 91\.0\)\. This indicates that the learned rubric generator preserves most of the benefit of reference rubrics\. Generated rubrics contain slightly more criteria on average than reference rubrics \(8\.4 vs\. 7\.5\), with a broader range of criterion counts \(6–20 vs\. 7–12\), while having a similar average token length \(236\.6 vs\. 248\.7\)\. This suggests that learned rubrics remain comparably informative for training despite not relying on manually written reference rubrics\. This is important in practice: it means we do not rely on handcrafted or manually curated rubric annotations and that the two\-stage pipeline is a viable instantiation ofRCSD\. Qualitative failure analysis is provided in Appendix[C\.2](https://arxiv.org/html/2606.19327#A3.SS2)\. Our results further show that the model trained with GT rubrics outperforms all the baselines in Table[1](https://arxiv.org/html/2606.19327#S3.T1), demonstrating the effectiveness of token\-level distillation on rubric feedback\.

Table 4:Ablation on rubric source\. GT \(Ground Truth\) Rubrics use reference rubrics during reasoner training, while Generated Rubrics correspond to the learned rubric generator\.![Refer to caption](https://arxiv.org/html/2606.19327v1/x1.png)
![Refer to caption](https://arxiv.org/html/2606.19327v1/x2.png)
![Refer to caption](https://arxiv.org/html/2606.19327v1/x3.png)

Figure 4:Model size ablation on rubric\-based reasoning benchmarks at 1\.7B, 4B, and 8B scales\.RCSDconsistently improves performance over the corresponding base model acrossRaR,ResearchQA, andRubricHub\.Table 5:Ablation on the necessity of the Stage\-I rubric generator\. \+14b Direct and \+8b Direct prompt Qwen3\-14B and Qwen3\-8B teachers to provide rubrics directly, while \+RCSDuses rubrics produced by the learned Stage\-I rubric generator\.Necessity of Stage\-I Rubric Generator\.We further study whether the training benefits from the learned rubric generator over directly prompting the base model to generate rubrics for teacher model\. As shown in Table[5](https://arxiv.org/html/2606.19327#S3.T5),RCSDachieves a competitive average of 70\.6, nearly matching \+14b Direct at 70\.7 and outperforming \+8b Direct at 69\.9\. Notably,RCSDobtains the best results onSciBench,PIQA, andResearchQA, suggesting that the learned generator can amortize rubric construction while preserving strong downstream supervision\. Although \+14b Direct slightly leads on the overall average, it requires prompting a larger teacher to produce rubrics directly; in contrast, the Stage\-I generator provides a scalable way to generate task\-specific rubrics forRCSD\.

Table 6:Ablation on rubric quality\. \+Generic, \+Noisy, \+Random, and \+Reduced use degraded rubric variants to test the sensitivity of distillation to rubric quality, while \+Learned uses the full learned rubric supervision\. Bold indicates the best result in each column\.Analysis of Rubric Quality Degradation\.We evaluate the sensitivity ofRCSDto rubric quality by providing the teacher with generic, random, noisy, or reduced rubrics\. Generic uses a shared rubric template for all examples; Random samples a rubric from another question; Noisy combines half of the original rubric with half of a random rubric; and Reduced removes half of the rubric items\. As shown in Table[6](https://arxiv.org/html/2606.19327#S3.T6), all variants improve over the base model, suggesting that criterion\-style supervision remains useful even with imperfect rubrics and that the teacher can partially correct unreasonable rubric items\. Nevertheless, \+Learned achieves the best overall average and the strongest results onGPQA\-D,PIQA, andResearchQA, indicating that instance\-specific learned rubrics provide the most reliable supervision\. The competitiveness of degraded variants suggests that the teacher can compensate for imperfect rubric information, likely by using the reference answer to reconcile rubric errors\. However, the advantage of \+Learned shows that rubric relevance and coherence still matter\. The strong performance of \+Reduced further suggests thatRCSDdoes not depend on verbose rubrics; concise but relevant criteria preserve most of the benefit\.

## 4Related Work

#### LLM Post\-training

Reinforcement learning \(RL\) has become a critical post\-training tool for improving multi\-step reasoning in LLMs\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.19327#bib.bib11)\), particularly in the form of*reinforcement learning with verifiable rewards*\(RLVR\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib13); Guoet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib14); Cholletet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib17); Jainet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib18)\)\. Despite these advances, most existing RL methods for reasoning remain fundamentally*outcome\-based*\. This creates a key limitation for reasoning tasks: the single\-score reward is assigned to the entire sequence, and information about*where*and*how*the model failed is lost\. Two responses may receive the same reward despite making very different mistake\. Recent work moves beyond scalar\-only rewards by introducing finer\-grained supervision, either through critique\-augmented learning that provides natural\-language feedback on sampled reasoning traces or through process reward models that assign credit to intermediate steps in a reasoning chain\(Zhanget al\.,[2025c](https://arxiv.org/html/2606.19327#bib.bib2); Biet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib50); Lightmanet al\.,[2023](https://arxiv.org/html/2606.19327#bib.bib16); Setluret al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib55); Yaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib56)\)\. Yet these methods are mainly developed for domains such as math, where intermediate steps are easier to evaluate\. A complementary line on on\-policy distillation replaces scalar outcome rewards with dense teacher guidance\. Classical knowledge distillation and sequence\-level distillation train a student to imitate a teacher’s outputs, but typically operate off\-policy on teacher\-generated trajectories\(Hintonet al\.,[2015](https://arxiv.org/html/2606.19327#bib.bib33); Kim and Rush,[2016](https://arxiv.org/html/2606.19327#bib.bib34)\)\. Recent work on*on\-policy distillation*and*on\-policy self\-distillation*addresses this mismatch by training the student on trajectories sampled from its own policy while using a teacher to provide token\-level supervision\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib35); Xuet al\.,[2024](https://arxiv.org/html/2606.19327#bib.bib36); Zhaoet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib47); Hübotteret al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib48); Yeet al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib53)\)\. However, existing on\-policy self\-distillation methods typically construct the teacher from privileged reference solutions or textual feedback\. Our work builds on this line but reshapes the supervision interface: instead of conditioning the teacher on a single reference trajectory, we condition it on structured rubric feedback\.

#### Reinforcement Learning with Rubrics

The*LLM\-as\-a\-Judge*paradigm enables scalable evaluation when human labeling is expensive or ambiguous, but coarse holistic scores are often noisy and sensitive to prompting or formatting\(Zhenget al\.,[2023](https://arxiv.org/html/2606.19327#bib.bib29); Wanget al\.,[2023b](https://arxiv.org/html/2606.19327#bib.bib30)\)\. Rubric\-based evaluation addresses this limitation by decomposing quality into explicit and interpretable criteria, improving consistency and enabling more fine\-grained diagnosis\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4); Aroraet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib31); Staraceet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib32)\)\. Building on this idea, recent work has incorporated rubrics into reinforcement learning as structured reward decompositions, extending RL\-style post\-training beyond strictly verifiable domains while providing more interpretable supervision than scalar rewards alone\(Huanget al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib27); Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4); Zhanget al\.,[2025a](https://arxiv.org/html/2606.19327#bib.bib8); Biet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib50); Shaoet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib28); Fanget al\.,[2026](https://arxiv.org/html/2606.19327#bib.bib60)\)\. In these approaches, rubric judgments are aggregated into scalar rewards and applied to completed responses\. As a result, rubric structure helps determine*what score*a response receives, but not*how*token\-level learning is carried out on the model’s own trajectory\. Many approaches also rely on predefined rubrics, often produced by frontier LLMs, which are costly to obtain\(Shaoet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib28)\)\. Our work differs from prior rubric\-based RL in the role assigned to rubrics: we use them as privileged teacher\-side supervision for on\-policy self\-distillation, allowing criterion\-level structure to directly shape token\-level updates during optimization\.

## 5Conclusion

We introduced*Rubric\-Conditioned Self\-Distillation*, a post\-training framework that uses rubrics as privileged teacher supervision for on\-policy self\-distillation\. Instead of collapsing rubric feedback into scalar rewards,RCSDpreserves criterion\-level structure during optimization by converting rubrics into dense token\-level guidance on student\-generated rollouts\. Empirically,RCSDshows that preserving criterion\-level feedback during on\-policy distillation leads to stronger and more fine\-grained reasoning\. More broadly,RCSDoffers a verifier\-free approach to post\-training open\-ended reasoning models, where high\-quality responses cannot always be judged by exact\-match answers, executable tests, or other automatic outcome verifiers\.

## References

- R\. Agarwal, N\. Vieillard, Y\. Zhou, P\. Stanczyk, S\. R\. Garea, M\. Geist, and O\. Bachem \(2024\)On\-policy distillation of language models: learning from self\-generated mistakes\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19327#S2.SS1.p4.2),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- B\. Bi, S\. Liu, Y\. Wang, S\. Tong, L\. Mei, Y\. Ge, Y\. Xu, J\. Guo, and X\. Cheng \(2025\)Reward and guidance through rubrics: promoting exploration to improve multi\-domain reasoning\.arXiv preprint arXiv:2511\.12344\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- F\. Chollet, M\. Knoop, G\. Kamradt, B\. Landers, and H\. Pinkard \(2025\)Arc\-agi\-2: a new challenge for frontier ai reasoning systems\.arXiv preprint arXiv:2505\.11831\.Cited by:[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Fang, Z\. Hong, M\. Zheng, M\. Song, G\. Li, H\. Jiang, D\. Zhang, H\. Guo, X\. Wang, and T\. Chua \(2026\)Rubric\-based on\-policy distillation\.arXiv preprint arXiv:2605\.07396\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2023\)Minillm: knowledge distillation of large language models\.arXiv preprint arXiv:2306\.08543\.Cited by:[§2\.1](https://arxiv.org/html/2606.19327#S2.SS1.p4.2)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. Hendryx \(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.arXiv preprint arXiv:2507\.17746\.Cited by:[Appendix B](https://arxiv.org/html/2606.19327#A2.p1.1),[§1](https://arxiv.org/html/2606.19327#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.19327#S3.p2.1),[§3](https://arxiv.org/html/2606.19327#S3.p3.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§2\.1](https://arxiv.org/html/2606.19327#S2.SS1.p2.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- Z\. Huang, Y\. Zhuang, G\. Lu, Z\. Qin, H\. Xu, T\. Zhao, R\. Peng, J\. Hu, Z\. Shen, X\. Hu,et al\.\(2025\)Reinforcement learning with rubric anchors\.arXiv preprint arXiv:2508\.12790\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p1.1),[§1](https://arxiv.org/html/2606.19327#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19327#S2.SS1.p5.5),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu \(2019\)Pubmedqa: a dataset for biomedical research question answering\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 2567–2577\.Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 1317–1327\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Li, J\. Zhao, M\. Wei, H\. Ren, Y\. Zhou, J\. Yang, S\. Liu, K\. Zhang, and W\. Chen \(2026\)RubricHub: a comprehensive and highly discriminative rubric dataset via automated coarse\-to\-fine generation\.arXiv preprint arXiv:2601\.08430\.Cited by:[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.19327#S3.p2.1),[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu \(2022\)Medmcqa: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InConference on health, inference, and learning,pp\. 248–260\.Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst conference on language modeling,Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2024\)Rewarding progress: scaling automated process verifiers for llm reasoning\.arXiv preprint arXiv:2410\.08146\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag,et al\.\(2025\)DR tulu: reinforcement learning with evolving rubrics for deep research\.arXiv preprint arXiv:2511\.19399\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.19327#S2.SS2.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.19327#S3.p4.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- G\. Starace, O\. Jaffe, D\. Sherburn, J\. Aung, J\. S\. Chan, L\. Maksin, R\. Dias, E\. Mays, B\. Kinsella, W\. Thompson,et al\.\(2025\)PaperBench: evaluating ai’s ability to replicate ai research\.arXiv preprint arXiv:2504\.01848\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, Z\. Hu, P\. Lu, Y\. Zhu, J\. Zhang, S\. Subramaniam, A\. R\. Loomba, S\. Zhang, Y\. Sun, and W\. Wang \(2023a\)Scibench: evaluating college\-level scientific problem\-solving abilities of large language models\.arXiv preprint arXiv:2307\.10635\.Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- Y\. Wang, Z\. Yu, Z\. Zeng, L\. Yang, C\. Wang, H\. Chen, C\. Jiang, R\. Xie, J\. Wang, X\. Xie,et al\.\(2023b\)Pandalm: an automatic evaluation benchmark for llm instruction tuning optimization\.arXiv preprint arXiv:2306\.05087\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- W\. Xu, R\. Han, Z\. Wang, L\. T\. Le, D\. Madeka, L\. Li, W\. Y\. Wang, R\. Agarwal, C\. Lee, and T\. Pfister \(2024\)Speculative knowledge distillation: bridging the teacher\-student gap through interleaved sampling\.arXiv preprint arXiv:2410\.11325\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p3.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Yao, R\. Wang, and T\. Zhang \(2026\)PRL: process reward learning improves llms’ reasoning ability and broadens the reasoning boundary\.arXiv preprint arXiv:2601\.10201\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- T\. Ye, L\. Dong, X\. Wu, S\. Huang, and F\. Wei \(2026\)On\-policy context distillation for language models\.arXiv preprint arXiv:2602\.12275\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p3.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- L\. S\. Yifei, A\. Chang, C\. Malaviya, and M\. Yatskar \(2026\)ResearchQA: evaluating scholarly question answering at scale across 75 fields with survey\-mined questions and rubrics\.Transactions of the Association for Computational Linguistics\.Note:To appearCited by:[§3](https://arxiv.org/html/2606.19327#S3.p3.1)\.
- W\. Yuan, J\. Yu, S\. Jiang, K\. Padthe, Y\. Li, I\. Kulikov, K\. Cho, D\. Wang, Y\. Tian, J\. E\. Weston,et al\.\(2025\)Naturalreasoning: reasoning in the wild with 2\.8 m challenging questions\.arXiv preprint arXiv:2502\.13124\.Cited by:[§3](https://arxiv.org/html/2606.19327#S3.p2.1.4)\.
- J\. Zhang, Z\. Wang, L\. Gui, S\. M\. Sathyendra, J\. Jeong, V\. Veitch, W\. Wang, Y\. He, B\. Liu, and L\. Jin \(2025a\)Chasing the tail: effective rubric\-based reward modeling for large language model post\-training\.arXiv preprint arXiv:2509\.21500\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p2.1),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.
- K\. Zhang, Y\. Zuo, B\. He, Y\. Sun, R\. Liu, C\. Jiang, Y\. Fan, K\. Tian, G\. Jia, P\. Li,et al\.\(2025b\)A survey of reinforcement learning for large reasoning models\.arXiv preprint arXiv:2509\.08827\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- X\. Zhang, Y\. Zhang, H\. Sun, K\. Feng, C\. Lu, C\. Yang, and H\. Meng \(2025c\)Critique\-grpo: advancing llm reasoning with natural language and numerical feedback\.External Links:2506\.03106,[Link](https://arxiv.org/abs/2506.03106)Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§1](https://arxiv.org/html/2606.19327#S1.p1.1),[§1](https://arxiv.org/html/2606.19327#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19327#S2.SS1.p5.5),[§3](https://arxiv.org/html/2606.19327#S3.p4.1.5),[§3](https://arxiv.org/html/2606.19327#S3.p5.8),[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4](https://arxiv.org/html/2606.19327#S4.SS0.SSS0.Px2.p1.1)\.

## Appendix AImplementation Details

Instandard GRPOwith an LLM judge, each rollout is scored by a fixed judge model \(Qwen3\-14B\), which is conditioned on the user prompt, the model completion, and a reference answer\. The judge outputs a discrete quality score on a 1–10 scale, which we normalize to \(\[0,1\]\[0,1\]\) to serve as the reward signal\. InGRPO\-Rubricsmethod, we first perform an offline rubric\-generation step, where Qwen3\-14B, prompted as a rubric writer, maps each training example’s question and reference answer into a structured rubric\. During training, the reward judge is conditioned on the prompt, the model completion, and the pre\-generated rubric, and produces a holistic score\. As a result, rewards reflect rubric satisfaction rather than direct comparison to the gold text\. Training configurations are reported in Table[7](https://arxiv.org/html/2606.19327#A1.T7)\.

Table 7:Training ConfigurationsParameterGRPORCSD/OPSDSFTLearning Rate5×10−65\\times 10^\{\-6\}5×10−65\\times 10^\{\-6\}5×10−65\\times 10^\{\-6\}Max Completion Length409640964096Batch Size323232Sampling Temperature1\.21\.21\.2Training Steps500100500Number of Generations per Prompt411
## Appendix BPrompt Templates

We provide the prompt templates used in our rubric generation and LLM\-as\-a\-Judge evaluation, where we adopt the prompt format from RaR\(Gunjalet al\.,[2025](https://arxiv.org/html/2606.19327#bib.bib4)\)\. For rubric generation, we instruct the model to generate self\-contained, criterion\-level rubrics in a structured JSON format\. For LLM\-as\-a\-Judge evaluation, we ask the evaluator to holistically score a generated response according to the provided rubrics and return a JSON\-formatted rating\.

### B\.1Rubric Generator Prompt

`Rubric Generator Prompt`

`B\.2 LLM\-as\-a\-Judge Prompt LLM\-as\-a\-Judge Prompt Appendix C Experiments Training Steps Analysis\. Figure 5 shows the performance trajectory across different training checkpoints\. On GPQA\-D, performance improves steadily from the base model and peaks at 120 steps, suggesting that RCSD is most effective with moderate training\. On SciBench, performance improves early and remains relatively stable across later checkpoints, indicating that the method quickly adapts to scientific reasoning tasks without substantial degradation from continued training\. Figure 5: Performance across training checkpoints on GPQA\-D and SciBench\. C\.1 Case Study on Reasoning Quality We compare model responses on a simple ideal\-gas\-law problem\. Although all responses identify the correct governing equation P=n​R​TVP=\\frac\{nRT\}\{V\}, they differ substantially in verbosity, consistency, and numerical discipline\. Prompt\. Suppose that 10\.0​mol​C2​H6​\(g\)10\.0\\,\\mathrm\{mol\}\\ \\mathrm\{C\_\{2\}H\_\{6\}\(g\)\} is confined to 4\.860​dm34\.860\\,\\mathrm\{dm\}^\{3\} at 27∘​C27^\{\\circ\}\\mathrm\{C\}\. Predict the pressure exerted by the ethane from the perfect gas\. The unit of the answer is atm\\mathrm\{atm\}\. Let’s think step by step\. RCSD Token count: 2896 Selected quotes: “First, I need to convert the temperature from Celsius to Kelvin\. The given temperature is 27∘​C27^\{\\circ\}\\mathrm\{C\}\. To convert, I add 273\.15\. So 27\+273\.15=300\.15​K27\+273\.15=300\.15\\,\\mathrm\{K\}\.” “The volume is given in dm3\\mathrm\{dm\}^\{3\}, which is the same as liters … V=4\.860​LV=4\.860\\,\\mathrm\{L\}\.” “Since the answer should be in atm, I need to use the value of RR that has units L⋅atm/\(mol⋅K\)\\mathrm\{L\\cdot atm/\(mol\\cdot K\)\} … R=0\.08206R=0\.08206\.” “Now plug these into P​V=n​R​TPV=nRT\. Solving for PP: P=n​R​TVP=\\frac\{nRT\}\{V\}\.” “So total P≈50\.679647P\\approx 50\.679647\. Rounding to three decimal places gives 50\.680​atm50\.680\\,\\mathrm\{atm\}\.” “The answer is therefore 50\.680\\boxed\{50\.680\}\.” Response excerpt\. The response first converts temperature to Kelvin, maps 4\.860​dm34\.860\\,\\mathrm\{dm\}^\{3\} to 4\.860​L4\.860\\,\\mathrm\{L\}, selects R=0\.08206​L⋅atm/\(mol⋅K\)R=0\.08206\\,\\mathrm\{L\\cdot atm/\(mol\\cdot K\)\}, and applies P=n​R​TV\.P=\\frac\{nRT\}\{V\}\. It then computes P=10\.0×0\.08206×300\.154\.860=246\.303094\.860≈50\.680\.P=\\frac\{10\.0\\times 0\.08206\\times 300\.15\}\{4\.860\}=\\frac\{246\.30309\}\{4\.860\}\\approx 50\.680\. The response ends with a short, clean answer: 50\.680\.\\boxed\{50\.680\}\. OPSD Token count: 4936 Selected quotes: “The ideal gas law is P​V=n​R​TPV=nRT, right?” “Since the answer needs to be in atmospheres, I should use the RR value that has units of L⋅atm/\(mol⋅K\)\\mathrm\{L\\cdot atm/\(mol\\cdot K\)\} … R=0\.082057R=0\.082057\.” “The formula is P=n​R​T/VP=nRT/V\.” “So the numerator is approximately 246\.2940855​L⋅atm246\.2940855\\,\\mathrm\{L\\cdot atm\}\.” “So total is approximately 50\.677\+0\.000795≈50\.677850\.677\+0\.000795\\approx 50\.6778\. So approximately 50\.6778​atm50\.6778\\,\\mathrm\{atm\}\. Rounding to three decimal places, that would be 50\.678​atm50\.678\\,\\mathrm\{atm\}\.” “Therefore, the answer is 50\.678\\boxed\{50\.678\}\.” Response excerpt\. The response follows the same overall structure, but uses a more precise constant R=0\.082057​L⋅atm/\(mol⋅K\)R=0\.082057\\,\\mathrm\{L\\cdot atm/\(mol\\cdot K\)\}, leading to P=10\.0×0\.082057×300\.154\.860=246\.29408554\.860≈50\.677795\.P=\\frac\{10\.0\\times 0\.082057\\times 300\.15\}\{4\.860\}=\\frac\{246\.2940855\}\{4\.860\}\\approx 50\.677795\. After repeated recalculation and self\-verification, it rounds to 50\.678\.\\boxed\{50\.678\}\. The response is correct, but noticeably more repetitive than RCSD\. Qwen3\-8B Token count: 9294 Selected quotes: “The gas constant RR is 0\.0821​L⋅atm/\(mol⋅K\)0\.0821\\,\\mathrm\{L\\cdot atm/\(mol\\cdot K\)\}, right?” “Wait, but sometimes RR is taken as 0\.082060\.08206 …” “The exact value of RR is 0\.0820570\.082057 …” “So with R=0\.0821R=0\.0821, the answer is approximately 50\.704​atm50\.704\\,\\mathrm\{atm\}\.” “But if I use R=0\.08206R=0\.08206, it’s approximately 50\.680​atm50\.680\\,\\mathrm\{atm\}\.” “But since the problem might expect using R=0\.0821R=0\.0821, I think the answer is expected to be around 50\.704​atm50\.704\\,\\mathrm\{atm\}\.” Response excerpt\. The response begins correctly by identifying P​V=n​R​TPV=nRT, converting 27∘​C27^\{\\circ\}\\mathrm\{C\} to 300\.15​K300\.15\\,\\mathrm\{K\}, and noting that 4\.860​dm3=4\.860​L4\.860\\,\\mathrm\{dm\}^\{3\}=4\.860\\,\\mathrm\{L\}\. However, it repeatedly switches between different values of the gas constant: R=0\.0821,0\.08206,0\.082057\.R=0\.0821,\\quad 0\.08206,\\quad 0\.082057\. This causes the intermediate calculations to drift between 50\.704,50\.680,50\.678,50\.704,\\quad 50\.680,\\quad 50\.678, and the response becomes very long, circular, and self\-contradictory\. In particular, the model explicitly states that it will “proceed with R=0\.0821R=0\.0821,” which yields approximately 50\.704\.\\boxed\{50\.704\}\. Even though the underlying formula is correct, the response demonstrates weaker numerical stability and substantially poorer reasoning efficiency\. This example reveals a clear difference in reasoning quality, not just final\-answer correctness\. All three models know the ideal gas law and the required unit conversions, but they differ in how efficiently and consistently they execute the solution\. RCSD produces the strongest trajectory\. Its response is the shortest, commits early to a coherent numerical setup, and reaches a correct answer without revisiting earlier choices\. OPSD also arrives at a correct answer, but its reasoning is more verbose and repetitive: it repeatedly re\-checks arithmetic that has already been established, which increases token usage without improving solution quality\. In contrast, the base Qwen3\-8B response is substantially longer and exhibits a more serious failure mode: it repeatedly changes core numerical assumptions, especially the value of RR, and consequently oscillates between different final answers\. Overall, this case study suggests that RCSD improves reasoning efficiency and trajectory stability: its responses are shorter, less repetitive, and more internally consistent than those produced by OPSD and the base model\. C\.2 Failure Analysis on Learned Rubric Quality Learned Rubrics Phonon / Bose–Einstein Example 1\. Bose\-Einstein Derivation \(5\): Essential Criteria: The response must clearly derive the Bose\-Einstein distribution formula for phonon occupation number n​\(ω\)n\(\\omega\) using the given dispersion relation ω​\(k\)=c​k\\omega\(k\)=ck\. 2\. Temperature Dependence \(4\): Important Criteria: The response should correctly express the temperature dependence of n​\(ω\)n\(\\omega\) as n​\(ω\)=1/\(eℏ​ω/k​T−1\)n\(\\omega\)=1/\(e^\{\\hbar\\omega/kT\}\-1\), demonstrating the exponential dependence on temperature and frequency\. 3\. Frequency\-Dependent Behavior \(5\): Essential Criteria: The response must explicitly explain how the occupation number n​\(ω\)n\(\\omega\) changes with temperature for different ω\\omega values, emphasizing the inverse relationship between n​\(ω\)n\(\\omega\) and temperature\. 4\. Physical Interpretation \(4\): Important Criteria: The response should include a clear explanation of the physical meaning of the Bose\-Einstein distribution in the context of phonon statistics and thermal equilibrium\. 5\. High/Low Temperature Behavior \(3\): Optional Criteria: The response may include a brief discussion on the behavior of n​\(ω\)n\(\\omega\) at high and low temperatures, such as the classical limit or zero\-point energy effects\. 6\. Frequency Dependence Omission \(\-1\): Negative Criteria: The response must not omit the dependence of the occupation number on the frequency ω\\omega, as this is a critical aspect of the temperature dependence\. 7\. Temperature Independence Error \(\-1\): Negative Criteria: The response must not incorrectly state that the occupation number is independent of temperature, which would contradict the Bose\-Einstein distribution\. 8\. Dispersion Relation Usage \(4\): Important Criteria: The response should correctly use the dispersion relation ω​\(k\)=c​k\\omega\(k\)=ck to connect the phonon frequency with wavevector kk in the derivation\. 9\. Bose\-Einstein Application \(4\): Important Criteria: The response must demonstrate the correct application of Bose\-Einstein statistics to phonons, including the proper handling of the occupation number formula\. 10\. Frequency\-Temperature Interaction \(3\): Optional Criteria: The response may include a clear explanation of how the occupation number changes with frequency for different temperature values, enhancing the understanding of the system\. Soda\-Lime Titration Example 1\. Mass Conversion \(5\): Essential Criteria: The response must accurately convert the mass of soda lime to the masses of NaOH and CaO components using the given percentages\. 2\. Mole Calculation \(5\): Essential Criteria: The response must correctly compute the moles of NaOH and CaO from their respective masses using their molar masses\. 3\. Reaction Equations \(4\): Important Criteria: The response should accurately identify and write the balanced chemical equations for the neutralization reactions of NaOH and CaO with HCl\. 4\. Stoichiometric Ratio \(4\): Important Criteria: The response must correctly apply stoichiometric ratios from the balanced equations to relate moles of NaOH and CaO to moles of HCl required for neutralization\. 5\. Total Moles Calculation \(5\): Essential Criteria: The response should correctly calculate the total moles of HCl required by summing the moles from both NaOH and CaO neutralization steps\. 6\. Volume Calculation \(5\): Essential Criteria: The response must accurately determine the volume of 0\.500M HCl needed using the total moles and the given molarity, converting to the correct units\. 7\. Step\-by\-Step Explanation \(4\): Important Criteria: The response should present a clear, step\-by\-step explanation of the calculation process for both NaOH and CaO neutralization, ensuring logical flow\. 8\. Application Context \(3\): Optional Criteria: The response may include a brief discussion of the significance of the neutralization reactions in real\-world applications, though it is not required\. 9\. Component Omission \(\-1\): Negative Criteria: The response must not omit the neutralization of either NaOH or CaO components, as both are required for accurate calculation\. 10\. Molar Mass Accuracy \(\-1\): Negative Criteria: The response must not use incorrect molar masses for NaOH or CaO, as this would lead to wrong results\. Reference Rubrics \(RaR\-Science\) Phonon / Bose–Einstein Example 1\. Bose\-Einstein Distribution \(5\): Essential Criteria: The response must explicitly state and correctly use the Bose\-Einstein distribution formula for n​\(ω\)n\(\\omega\), such as n​\(ω\)=1/\(exp⁡\(ℏ​ω/\(kB​T\)\)−1\)n\(\\omega\)=1/\(\\exp\(\\hbar\\omega/\(k\_\{B\}T\)\)\-1\), linking ω\\omega and TT in the derivation\. 2\. Dispersion Relation Use \(4\): Important Criteria: The answer should correctly incorporate the given dispersion relation ω​\(k\)=c​k\\omega\(k\)=ck to connect the frequency ω\\omega to the wave vector kk in the context of the phonon system\. 3\. Temperature Analysis \(5\): Essential Criteria: The response must analyze how the phonon occupation number n​\(ω\)n\(\\omega\) varies with temperature for a fixed frequency and explain differences in behavior at various ω\\omega values\. 4\. Mathematical Derivation \(4\): Important Criteria: The answer should include clear and logically structured derivations that break down the mathematical steps required to arrive at the temperature dependence of n​\(ω\)n\(\\omega\)\. 5\. Frequency Trends \(3\): Optional Criteria: The response may provide a concrete example or detailed explanation illustrating that at lower frequencies the occupation number is more sensitive to changes in temperature than at higher frequencies\. 6\. Clarity and Conciseness \(3\): Optional Criteria: The answer should be clear and concise, avoiding unnecessary elaboration while still covering all key elements of the derivation and conclusion\. 7\. Exclusion of Loss Effects \(\-1\): Pitfall Criteria: The response should not include irrelevant factors such as frictional or damping losses, which are not part of the ideal derivation using Bose\-Einstein statistics\. Soda\-Lime Titration Example 1\. Separate Reactions \(5\): Essential Criteria: The response must separately address the neutralization reactions for both NaOH and CaO components, calculating the moles of acid required for each reaction\. 2\. Stoichiometry Accuracy \(5\): Essential Criteria: The answer should correctly apply stoichiometric relationships to determine the moles of HCl needed for the complete neutralization of both compounds\. 3\. Molarity Application \(4\): Important Criteria: The response must demonstrate how the molarity of 0\.500M HCl is used to convert the required moles of acid into the corresponding volume in cm3 with proper unit conversions\. 4\. Step\-by\-Step Work \(4\): Important Criteria: The answer should provide a clear, logical sequence of calculations that lead to the final volume, ensuring transparency in each intermediary step\. 5\. Final Volume Accuracy \(5\): Essential Criteria: The response must explicitly state the correct final volume of 0\.500M HCl required \(133\.04 cm3\) for complete neutralization\. 6\. Unit Consistency \(2\): Optional Criteria: The explanation should include correct unit conversions, especially showing how volumes are converted \(e\.g\., L to cm3\), to enhance clarity and precision\. 7\. Reaction Assumptions \(\-2\): Pitfall Criteria: The response should mention that the reactions are assumed to go to completion without interference, and neglecting to state such assumptions is a common oversight\. Failure analysis\. These examples illustrate two recurring weaknesses of the learned rubrics relative to the reference rubrics\. First, the learned rubrics are often bloated and redundant\. In the phonon/Bose–Einstein example, the learned rubric expands to 10 criteria versus 7 in the reference rubric, and several titles repeat the same underlying evaluation axis: Temperature Dependence, Frequency\-Dependent Behavior, High/Low Temperature Behavior, and Frequency\-Temperature Interaction all partially restate the same requirement\. By contrast, the reference rubric compresses this content into a smaller and sharper set of criteria, such as Temperature Analysis, Mathematical Derivation, and Frequency Trends\. This suggests that the learned rubric generator tends to over\-segment closely related concepts instead of consolidating them into a compact set of discriminative checks\. Second, the learned rubrics sometimes include generic but weakly task\-critical criteria\. In the soda\-lime titration example, the learned rubric includes items such as Step\-by\-Step Explanation and Application Context, which are broadly reasonable but not central to the actual grading target\. The reference rubric instead concentrates on the chemically essential checks: Separate Reactions, Stoichiometry Accuracy, Final Volume Accuracy, and Unit Consistency\. Overall, these cases suggest that the learned rubrics usually identify the correct topic, but they are often more verbose, overlapping, and less precisely aligned with the true grading objective than the reference rubrics\. Appendix D Additional Theoretical Details Proposition 1 \(Robustness to rubric approximation\)\. Let r⋆r^\{\\star\} be a reference rubric and r^\\hat\{r\} a generated rubric\. For each visited prefix \(x,y^<t\)\(x,\\hat\{y\}\_\{<t\}\), define pt⋆:=pTY\(⋅∣x,r⋆,y^<t\),p^t:=pTY\(⋅∣x,r^,y^<t\),qt:=pSY\(⋅∣x,y^<t\)\.p\_\{t\}^\{\\star\}:=p\_\{T\}^\{Y\}\(\\cdot\\mid x,r^\{\\star\},\\hat\{y\}\_\{<t\}\),\\qquad\\hat\{p\}\_\{t\}:=p\_\{T\}^\{Y\}\(\\cdot\\mid x,\\hat\{r\},\\hat\{y\}\_\{<t\}\),\\qquad q\_\{t\}:=p\_\{S\}^\{Y\}\(\\cdot\\mid x,\\hat\{y\}\_\{<t\}\)\. Assume that −log⁡qt​\(y\)≤Bt\-\\log q\_\{t\}\(y\)\\leq B\_\{t\} for all tokens yy in the support of pt⋆p\_\{t\}^\{\\star\} and p^t\\hat\{p\}\_\{t\}\. Then \|𝔼y∼pt⋆​\[−log⁡qt​\(y\)\]−𝔼y∼p^t​\[−log⁡qt​\(y\)\]\|≤2​Bt​TV​\(pt⋆,p^t\),\\Big\|\\mathbb\{E\}\_\{y\\sim p\_\{t\}^\{\\star\}\}\[\-\\log q\_\{t\}\(y\)\]\-\\mathbb\{E\}\_\{y\\sim\\hat\{p\}\_\{t\}\}\[\-\\log q\_\{t\}\(y\)\]\\Big\|\\leq 2B\_\{t\}\\,\\mathrm\{TV\}\(p\_\{t\}^\{\\star\},\\hat\{p\}\_\{t\}\), where TV\\mathrm\{TV\} denotes total variation distance\. Consequently, \|ℒCE​\(r^\)−ℒCE​\(r⋆\)\|≤𝔼y^∼pSY\(⋅∣x\)​\[∑t=1\|y^\|2​Bt​TV​\(pt⋆,p^t\)\]\.\\big\|\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\hat\{r\}\)\-\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(r^\{\\star\}\)\\big\|\\leq\\mathbb\{E\}\_\{\\hat\{y\}\\sim p\_\{S\}^\{Y\}\(\\cdot\\mid x\)\}\\left\[\\sum\_\{t=1\}^\{\|\\hat\{y\}\|\}2B\_\{t\}\\,\\mathrm\{TV\}\(p\_\{t\}^\{\\star\},\\hat\{p\}\_\{t\}\)\\right\]\. Proof\. Fix a visited prefix \(x,y^<t\)\(x,\\hat\{y\}\_\{<t\}\), and define ft​\(y\):=−log⁡qt​\(y\)\.f\_\{t\}\(y\):=\-\\log q\_\{t\}\(y\)\. By assumption, ft​\(y\)≤Btf\_\{t\}\(y\)\\leq B\_\{t\} for all yy in the support of pt⋆p\_\{t\}^\{\\star\} and p^t\\hat\{p\}\_\{t\}\. Then \|𝔼y∼pt⋆​\[ft​\(y\)\]−𝔼y∼p^t​\[ft​\(y\)\]\|\\displaystyle\\Big\|\\mathbb\{E\}\_\{y\\sim p\_\{t\}^\{\\star\}\}\[f\_\{t\}\(y\)\]\-\\mathbb\{E\}\_\{y\\sim\\hat\{p\}\_\{t\}\}\[f\_\{t\}\(y\)\]\\Big\| =\|∑y\(pt⋆​\(y\)−p^t​\(y\)\)​ft​\(y\)\|\\displaystyle=\\left\|\\sum\_\{y\}\\bigl\(p\_\{t\}^\{\\star\}\(y\)\-\\hat\{p\}\_\{t\}\(y\)\\bigr\)f\_\{t\}\(y\)\\right\| ≤∑y\|pt⋆​\(y\)−p^t​\(y\)\|​\|ft​\(y\)\|\\displaystyle\\leq\\sum\_\{y\}\\bigl\|p\_\{t\}^\{\\star\}\(y\)\-\\hat\{p\}\_\{t\}\(y\)\\bigr\|\\,\|f\_\{t\}\(y\)\| ≤Bt​∑y\|pt⋆​\(y\)−p^t​\(y\)\|\\displaystyle\\leq B\_\{t\}\\sum\_\{y\}\\bigl\|p\_\{t\}^\{\\star\}\(y\)\-\\hat\{p\}\_\{t\}\(y\)\\bigr\| =2​Bt​TV​\(pt⋆,p^t\)\.\\displaystyle=2B\_\{t\}\\,\\mathrm\{TV\}\(p\_\{t\}^\{\\star\},\\hat\{p\}\_\{t\}\)\. \(6\) Applying this bound at each decoding step, summing over tt, and taking expectation over student rollouts y^∼pSY\(⋅∣x\)\\hat\{y\}\\sim p\_\{S\}^\{Y\}\(\\cdot\\mid x\) yields the stated result\. ∎ Proposition 1 shows that generated rubrics do not need to match reference rubrics exactly in wording or surface form\. It is sufficient that they induce similar rubric\-conditioned teacher distributions on the prefixes actually visited by the student\. This matches the practical role of Stage I: the rubric generator needs to produce rubrics that are useful enough to preserve the teacher\-side guidance used in Stage II\.`

Similar Articles

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Hugging Face Daily Papers

Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.