Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
Summary
Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.
View Cached Full Text
Cached at: 06/02/26, 03:46 PM
# On-Policy Critique Distillation for Scalable Oversight
Source: [https://arxiv.org/html/2606.00424](https://arxiv.org/html/2606.00424)
## Weak Critics Make Strong Learners: On\-Policy Critique Distillation for Scalable Oversight
###### Abstract
As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak\-to\-strong generalization and scalable oversight\. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge\. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non\-misleading revision direction that helps the strong model better use its own knowledge\. We call this setting*weak\-critic strong oversight*\. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement\. We then propose progressive on\-policy critique distillation \(Opcd\), which filters high\-quality critiques and distills critic\-guided behavior into the strong model through adaptive self\-teacher signals\. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision\.
Scalable Oversight, Weak\-to\-Strong Generalization, LLM Alignment, LLM Reasoning
## 1Introduction
Modern large language models \(LLMs\) are commonly aligned with human supervision, such as task demonstrations, preference labels, reward models, and reinforcement learning from human feedback\(Christianoet al\.,[2017](https://arxiv.org/html/2606.00424#bib.bib31); Ouyanget al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib52); Baiet al\.,[2022a](https://arxiv.org/html/2606.00424#bib.bib53)\)\. These methods work well when the supervisor can reliably judge the model output\. However, as models become stronger, they may produce answers, plans, proofs, or code that are difficult for humans or weaker models to fully verify\. This creates a central challenge for alignment: how can a weak supervisor guide a stronger model when the final task is too hard for the supervisor to solve or judge?
Two related research directions study this challenge\. Weak\-to\-strong generalization asks whether supervision from a weak model can elicit useful behavior from a stronger pretrained model\(Burnset al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib1)\)\. Scalable oversight asks how a weak human or model can provide reliable supervision for a stronger system, often through assistance, interaction, or debate\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.00424#bib.bib37); Irvinget al\.,[2018](https://arxiv.org/html/2606.00424#bib.bib38); Bowmanet al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib5); Khanet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib55)\)\. Although these directions differ in their goals, many existing methods place a similar burden on the weak supervisor: the weak model must provide direct labels, soft logits, preference signals, or final judgments over complete answers\. This can be too demanding when the task is beyond the weak supervisor’s full ability\. In such cases, the supervision signal can be noisy, incomplete, or systematically wrong, which limits both weak\-to\-strong generalization and scalable oversight\.
In this paper, we study a different form of weak supervision: using the weak model as a*critic*rather than as a labeler or judge\. A weak critic does not need to solve the task, provide the correct answer, identify every error, or give a detailed revision plan\. It can be useful by giving a general but correct revision direction, such as suggesting that the reasoning is incomplete, a condition is missing, a boundary case should be checked, or the response should be safer\. This form of feedback is often easier than labeling or judging complete answers, and is close to common human–AI interaction: users may not know the full solution, but they can still give feedback that helps a stronger model revise\. As long as the critique is not misleading, it can help the strong model better use its own knowledge without requiring the weak supervisor to provide full supervision\. We call this setting*weak\-critic strong oversight*\.
We first test this idea at inference time\. Given a question, the strong model produces an initial answer, the weak model critiques it, and the strong model then revises its answer conditioned on the question, the initial answer, and the critique\. This directly evaluates whether weak critiques can improve a frozen strong model\. Our results show that they can, even when the critique only gives a general revision direction rather than a detailed error analysis\. This supports the core hypothesis of*weak\-critic strong oversight*: weak supervisors may not need to solve or judge the full task to provide useful oversight\. We also find that critique quality is central\. Helpful critiques improve performance, while misleading critiques can hurt performance, even compared with using no critique\. This motivates filtering useful critiques before using them for training\.
To internalize the inference\-time improvement, we propose a progressiveon\-policycritiquedistillation method \(Opcd\)\. In each epoch, the current strong model generates on\-policy answers, and the weak model critiques these answers\. We then use an outcome\- and rubric\-based quality metric to keep only useful critiques\. For each kept example, the critic\-conditioned strong model serves as a self\-teacher, using the critique as guidance to provide dense token\-level signals\. The student is the same strong model without access to the critique, trained by on\-policy distillation\. After each update, the strong model produces new answers with new error patterns, and the weak critic provides fresh critiques for the updated model\. This process distills the useful critic\-guided behavior observed at inference time while keeping the supervision adaptive to the current strong model\.
Our experiments show that*weak\-critic strong oversight*improves strong\-model performance in both inference\-time and training\-time settings\. Compared with standard weak\-to\-strong methods that directly distill weak\-model responses or logits, our method does not force the weak model to provide full supervision\. Compared with ground\-truth supervised finetuning, our method studies a more realistic oversight setting where reliable labels may be unavailable or too hard for weak supervisors to provide\. Across reasoning and alignment benchmarks, progressive on\-policy critique distillation improves the strong model over training epochs, showing that critique\-based supervision is an effective path for scalable oversight and weak\-to\-strong generalization\.
Our main contributions are:
- ★Critique\-based weak supervision\.We identify critiquing as a more tractable form of weak supervision than labeling or judging, and propose*weak\-critic strong oversight*for scalable oversight and weak\-to\-strong generalization\.
- ★Inference\-time validation\.We show that weak critiques can improve strong\-model performance at inference time, even when they provide only general revision directions, and find that critique quality is key to reliable improvement\.
- ★Progressive critique distillation\.We introduceOpcd, a progressive on\-policy critique distillation strategy that filters high\-quality critiques and uses them as adaptive weak feedback to train the strong model\.
- ★Strong results\.Across multiple benchmarks,Opcdprogressively improves strong\-model performance, showing that critique\-based supervision can effectively use weak oversight for stronger models\.
## 2Related Works
#### Scalable oversight and weak\-to\-strong generalization\.
Scalable oversight aims to develop methods to supervise AI systems on tasks that are difficult for humans\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.00424#bib.bib37); Bowmanet al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib5)\)\. The primary focus has been on designing human\-AI collaboration protocols that help humans evaluate AI outputs more accurately, for example through debate and consultancy\(Irvinget al\.,[2018](https://arxiv.org/html/2606.00424#bib.bib38); Kentonet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib6)\), and on lowering the cognitive burden of evaluation through critique\-assisted review and prover\-verifier games\(Saunderset al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib39); McAleeseet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib16); Kirchneret al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib40)\)\. In contrast, weak\-to\-strong generalization \(W2S\)\(Burnset al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib1)\)explores a complementary direction that designs learning algorithms which let a strong pretrained model generalize correctly from weak supervision as if it had been trained on higher\-quality labels\. A growing body of work strengthens this elicitation through iterative label refinement, easy\-to\-hard reward transfer, weak\-LLM preference labeling, internal\-coherence elicitation, and self\-consistency filtering for reasoning\(Yeet al\.,[2025](https://arxiv.org/html/2606.00424#bib.bib8); Sunet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib9); Tao and Li,[2024](https://arxiv.org/html/2606.00424#bib.bib7); Wenet al\.,[2025](https://arxiv.org/html/2606.00424#bib.bib11); Yanget al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib24); Jinet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib59)\), with inference\-time variants contrasting weak and strong distributions or guiding decoding with weak step\-level scores\(Liet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib2); Dinget al\.,[2025](https://arxiv.org/html/2606.00424#bib.bib10)\)\. Theoretical analyses bound the W2S gain by the strong model’s misfit on weak labels and characterize pseudolabel correction under an expansion condition\(Charikaret al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib22); Langet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib23)\), whileYanget al\.\([2025b](https://arxiv.org/html/2606.00424#bib.bib25)\)document a failure mode in which strong models pass weak supervision on prompts the supervisor knows but remain misaligned where it does not\. Our setting follows the W2S protocol ofBurnset al\.\([2023](https://arxiv.org/html/2606.00424#bib.bib1)\)but uses weak critiques in place of weak labels and targets generative reasoning rather than classification\.
#### LLM Alignment\.
Aligning LLMs with human preferences is most commonly done via reinforcement learning from human feedback\(Christianoet al\.,[2017](https://arxiv.org/html/2606.00424#bib.bib31); Ouyanget al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib52); Baiet al\.,[2022a](https://arxiv.org/html/2606.00424#bib.bib53)\), which trains a reward model on preference comparisons and optimizes the policy with PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.00424#bib.bib29)\); direct preference methods bypass the reward model and train on preferences end\-to\-end\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib30)\)\. Because human supervision is expensive and difficult to scale, a separate line replaces or supplements it with model\-generated signals: AI feedback judged against written principles\(Baiet al\.,[2022b](https://arxiv.org/html/2606.00424#bib.bib12); Leeet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib13); Jinet al\.,[2026](https://arxiv.org/html/2606.00424#bib.bib57)\), self\-judgment loops in which the policy itself plays the role of judge\(Yuanet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib14); Wuet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib15)\), and dedicated critic models that surface mistakes the policy or human annotators might miss\(McAleeseet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib16); Ankneret al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib17)\)\. These methods take the supervisor’s signal as a training target\. We instead treat it as a critique and use only those critiques that demonstrably improve the policy’s answer\.
#### LLM Reasoning\.
Eliciting reasoning from LLMs has been driven by chain\-of\-thought prompting and its inference\-time variants such as self\-consistency and tree\-search decoding\(Weiet al\.,[2022](https://arxiv.org/html/2606.00424#bib.bib26); Wanget al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib33); Yaoet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib34)\)\. Self\-improvement methods finetune the policy on its own successful traces\(Zelikmanet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib4)\)or train process verifiers on step\-level correctness to densify the reward on hard problems\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib27); Wanget al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib35); Jinet al\.,[2025](https://arxiv.org/html/2606.00424#bib.bib58)\)\. A complementary line teaches the model to revise its outputs, either at inference time through verbal feedback\(Madaanet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib18); Shinnet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib19)\)or during training through multi\-turn reinforcement learning\(Kumaret al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib20); Quet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib21)\)\. Recent large\-scale efforts show that pure outcome\-based RL on a strong base model can elicit long chain\-of\-thought reasoning at scale\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.00424#bib.bib36)\), while online distillation has shifted toward student\-side losses that match the teacher under the student’s own distribution\(Guet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib3)\)\. These methods either require ground\-truth verification or assume a critic at least as strong as the policy\.
## 3Preliminary Inference\-Time Investigation
Whether a weak model can provide useful critiques to improve a stronger model at inference time is the foundation ofOpcdframework\. In this section, we conduct a preliminary investigation to answer two questions:
\(i\) Can weak\-model critique improve strong\-model performance beyond sampling more responses, and does this effect generalize across reasoning and alignment tasks as well as across thinking and non\-thinking models? \(ii\) Does the quality of the critique affect the final accuracy?
#### Experimental Setting\.
We evaluate the critique\-and\-refine paradigm on both non\-thinking and thinking models\. For non\-thinking models, we use Phi\-4\-mini\-instruct\-3\.8B\(Abdinet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib41)\)as the weak model and Phi\-4\-14B\(Abdinet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib41)\)as the strong model\. The experiments are conducted on GPQA Diamond\(Reinet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib43)\)for reasoning task and IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib47)\)for instruction\-following alignment task\. For thinking models, we use Qwen3\-1\.7B\(Yanget al\.,[2025a](https://arxiv.org/html/2606.00424#bib.bib45)\)as the weak model and Qwen3\-8B\(Yanget al\.,[2025a](https://arxiv.org/html/2606.00424#bib.bib45)\)as the strong model\. Both models are evaluated with thinking mode enabled on AIME 2024\(Art of Problem Solving,[2024a](https://arxiv.org/html/2606.00424#bib.bib48),[b](https://arxiv.org/html/2606.00424#bib.bib49)\)and AIME 2025\(Art of Problem Solving,[2025a](https://arxiv.org/html/2606.00424#bib.bib50),[b](https://arxiv.org/html/2606.00424#bib.bib51)\), containing 60 problems in total\. This setting tests whether the usefulness of weak\-model critiques also scales to thinking models\.
For each problem, the critique\-and\-refine pipeline consists of three stages and we denote it asS\+W critic\+refinefor easier annotation\. First, the strong model generates an initial answer\. Then, the weak model receives the original problem and the strong model’s initial answer, and produces a critique\. Finally, the strong model refines its answer based on the problem, the initial answer, and the weak\-model critique\. This process is repeated for up to 16 independent chains as reported in Table[1](https://arxiv.org/html/2606.00424#S3.T1)\. Formally, for each chainii, the process is written as:
ai=S\(x\),ci=W\(x,ai\),ri=S\(x,ai,ci\),a\_\{i\}=S\(x\),\\quad c\_\{i\}=W\(x,a\_\{i\}\),\\quad r\_\{i\}=S\(x,a\_\{i\},c\_\{i\}\),\(1\)whereaia\_\{i\},cic\_\{i\}, andrir\_\{i\}denote the initial answer, weak\-model critique, and refined answer of chainii, respectively\. For an inference budget ofk∈\{1,2,4,8,16\}k\\in\\\{1,2,4,8,16\\\}chains, we report pass@kk, where the prediction is counted as correct if any of the firstkkfinal answers is correct\.
Figure 1:Pass@k performance reported as accuracy percentage under different inference budgets on GPQA Diamond and IFEval\. The strong model is Phi\-4\-14B and the weak model is Phi\-4\-mini\-instruct\. The shaded region highlights the performance gap between theSonlySonlybaseline and theS\+Wcritic\+refineS\+Wcritic\+refinesetting\.BenchmarkModel PairMethodInference Budgetpass@1pass@2pass@4pass@8pass@16GPQA DiamondPhi\-4\-14B \(S\)Phi\-4\-mini\-instruct \(W\)SSonly50\.5162\.1274\.7581\.3183\.84SS\+WWcritic\+refine51\.9966\.6777\.2786\.3690\.40SS\+best critic\+refine51\.5266\.1680\.3087\.3792\.42SS\+random critic\+refine51\.9961\.6272\.2282\.8387\.37SS\+worst critic\+refine51\.5261\.1168\.1880\.8183\.33IFEvalPhi\-4\-14B \(S\)Phi\-4\-mini\-instruct \(W\)SSonly61\.6167\.9972\.3375\.5377\.82SS\+WWcritic\+refine72\.1179\.7184\.6488\.0590\.76SS\+best critic\+refine73\.8181\.2185\.9888\.9690\.94SS\+random critic\+refine71\.2978\.4483\.0186\.0788\.54SS\+worst critic\+refine68\.2976\.1581\.5885\.2487\.99AIME24\+25Qwen3\-8B \(S\)Qwen3\-1\.7B \(W\)SSonly71\.6781\.6781\.6783\.3386\.67WWonly40\.0051\.6761\.6766\.6773\.33SS\+WWcritic\+refine75\.0081\.6781\.6785\.0090\.00Table 1:Inference\-time scaling results on reasoning and alignment benchmarks\. All results are reported as accuracy percentages\.SSandWWdenote the strong and weak models, respectively\. Best results within each benchmark are highlighted in bold\.
#### Baselines\.
We compare the proposed critique\-and\-refine pipeline with several inference\-time baselines\. The basic baseline isS only, where the strong model independently generates multiple answers without receiving any critique\.
For the non\-thinking model Phi\-4 experiments, we further introduce three baselines to analyze how critique quality affects final performance\. To study the critic quality effect alone, these baselines are constructed from cached outputs for the completekkchains inS\+ W critic \+ refine\. For each inference budgetkk, we select one critique from the firstkkchains and pair it with each of thekkinitial answers before refinement\. InS \+ best critic \+ refine, we select a critique from a chain where the initial answer is incorrect but the refined answer becomes correct, if such a chain exists\. This setting approximates the effect of a high\-quality critique\. InS \+ random critic \+ refine, we uniformly sample one critique from the firstkkchains, which measures the effect of critique reuse without correctness information\. InS \+ worst critic \+ refine, we select a critique from a chain where the initial answer is correct but the refined answer becomes incorrect, if such a chain exists\. For the thinking\-model Qwen experiments, we also reportW onlywhere the weak model independently generates multiple answers asS only\. We do not include other baselines as the Phi\-4 experiments in the thinking model setting, because this experiment is intended as a focused test of whether weak critique can scale thinking\-mode generation rather than a complete critique\-quality analysis\.
#### Weak Model Critique Framework Improves Strong Model Performance Across Tasks and Model Types\.
As shown in Table[1](https://arxiv.org/html/2606.00424#S3.T1)and Figure[1](https://arxiv.org/html/2606.00424#S3.F1), weak\-model critique improves strong\-model performance at inference time across both reasoning and alignment tasks\. On GPQA Diamond,S onlyachieves 50\.51 pass@1 and 83\.84 pass@16, whileS\+W critic\+refineimproves the results to 51\.99 pass@1 and 90\.40 pass@16\. Figure[1](https://arxiv.org/html/2606.00424#S3.F1)further shows that this improvement is maintained as the inference budget increases, indicating that weak critique provides useful feedback signals beyond sampling more responses\. A similar trend appears on IFEval\. As reported in Table[1](https://arxiv.org/html/2606.00424#S3.T1), theS onlybaseline achieves 61\.61 pass@1 and 77\.82 pass@16\. With weak\-model critique and refinement, the performance increases to 72\.11 pass@1 and 90\.76 pass@16\. The scaling trend in Figure[1](https://arxiv.org/html/2606.00424#S3.F1)also shows a consistent gap betweenS onlyandS\+W critic\+refineacross different inference budgets where the strong model can consistently outperformS only\. These observations suggest that weak model critique framework improves strong model performance across reasoning and alignment tasks\.
We further examine whether weak\-model critique remains useful under thinking\-mode generation\. As shown in Table[1](https://arxiv.org/html/2606.00424#S3.T1), on AIME 2024 and AIME 2025, the weak model itself is substantially weaker than the strong model, achieving 40\.00 pass@1 compared with 71\.67 pass@1 for the strong model\. Nevertheless, when used as a critic, the weak model improves the strong model from 71\.67 to 75\.00 at pass@1 and from 86\.67 to 90\.00 at pass@16\. These results suggest that weak model critique improves strong model performance across thinking and non\-thinking models\.
#### High\-Quality Critiques Improve Performance and Generalize Across different Initial Answers\.
The baselines for Phi\-4 in Table[1](https://arxiv.org/html/2606.00424#S3.T1)show that critique quality matters in improving performance\. Notably, on GPQA DiamondSS\+best critic\+refineachieves best performance on pass@4, pass@8, and pass@16 with 80\.30, 87\.37 and 92\.42 respectively\. While we can also observe thatSS\+random critic\+refineis consistently weaker thanSS\+WWcritic\+refineat larger budgets andSS\+worst critic\+refineperforms the worst among other critique\-quality baselines, with 68\.18 pass@4, 80\.81 pass@8, and 83\.33 pass@16, eventually falling slightly belowSSonlyat pass@16\. Also, a similar pattern appears on IFEval,SS\+best critic\+refineachieves the strongest results across all inference budgets, reaching 73\.81 pass@1, 85\.98 pass@4, and 90\.94 pass@16, and is followed bySS\+random critic\+refineandSS\+worst critic\+refine\. These results indicate that quality of critique matters, high\-quality critique can better improve the performance in the framework but low\-quality critique can weaken the effect of improvement across different tasks\.
Since the selected critiques in baselines are not necessarily paired with the current initial answer, this result also suggests that a high\-quality critique can generalize across different initial answers\. In other words, useful critiques may provide transferable guidance, such as identifying common reasoning errors, highlighting missing constraints, or suggesting a more reliable solving direction, rather than only correcting one specific response\.
## 4Method
### 4\.1Problem Definition and Notation
Figure 2:Overview of theOpcdpipeline\.The overall framework is shown in Figure[2](https://arxiv.org/html/2606.00424#S4.F2)\. We consider a weak\-to\-strong oversight setting with a weak modelπw\\pi\_\{\\mathrm\{w\}\}and a strong modelπθ\\pi\_\{\\theta\}\. Given an input question or instructionx∈𝒳x\\in\\mathcal\{X\}, our goal is to improve the strong model using weak supervision, without requiring the weak model to provide the correct answer, a preference label, or a final judgment\. Instead, the weak model acts as a critic that gives feedback on the strong model’s own answers\.
At training epochee, letπθe\\pi\_\{\\theta\_\{e\}\}denote the current strong model before the update\. For each inputxx, we sample a group ofGGon\-policy answers
yi∼πθe\(⋅∣x\),i=1,…,G,y\_\{i\}\\sim\\pi\_\{\\theta\_\{e\}\}\(\\cdot\\mid x\),\\qquad i=1,\\ldots,G,whereyi=\(yi,1,…,yi,Ti\)y\_\{i\}=\(y\_\{i,1\},\\ldots,y\_\{i,T\_\{i\}\}\)andyi,<ty\_\{i,<t\}denotes the prefix before tokenyi,ty\_\{i,t\}\. The weak model then critiques each answer:
fi∼πw\(⋅∣x,yi\)\.f\_\{i\}\\sim\\pi\_\{\\mathrm\{w\}\}\(\\cdot\\mid x,y\_\{i\}\)\.Each critiquefif\_\{i\}provides a revision direction for the corresponding answeryiy\_\{i\}\. It does not need to give a full solution or a detailed correction; it only needs to offer useful and non\-misleading guidance that can help the strong model better use its own knowledge\.
Our key idea is to use the critic\-conditioned strong model as a self\-teacher\. For each pair\(yi,fi\)\(y\_\{i\},f\_\{i\}\), the teacher is the frozen current strong model conditioned on the weak critique,
πθ\(⋅∣x,fi,yi,<t\),\\pi\_\{\\theta\}\(\\cdot\\mid x,f\_\{i\},y\_\{i,<t\}\),while the student is the trainable strong model without access to the critique,
πθ\(⋅∣x,yi,<t\)\.\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{i,<t\}\)\.The training objective distills the critic\-guided behavior of the teacher into the student, so that the strong model can internalize the benefit of weak critiques and improve without requiring critiques at test time\.
### 4\.2Critique Generation and Filtration
Given the on\-policy answers\{yi\}i=1G\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}from the current strong modelπθe\\pi\_\{\\theta\_\{e\}\}, the weak critic produces a critiquefif\_\{i\}for each answer\. Since weak critiques can be helpful or misleading, we filter them before using them for training\. The goal of filtration is to keep only critiques that provide useful guidance for the strong model\.
For each triple\(x,yi,fi\)\(x,y\_\{i\},f\_\{i\}\), we first ask the current strong model to generate a refined answer conditioned on the question, the original answer, and the critique:
y^i∼πθe\(⋅∣x,yi,fi\)\.\\hat\{y\}\_\{i\}\\sim\\pi\_\{\\theta\_\{e\}\}\(\\cdot\\mid x,y\_\{i\},f\_\{i\}\)\.We then apply an outcome\-based check to test whether the refined answer is correct:
rout\(x,y^i\)=𝟙\{y^iis correct forx\}\.r\_\{\\mathrm\{out\}\}\(x,\\hat\{y\}\_\{i\}\)=\\mathbbm\{1\}\\\{\\hat\{y\}\_\{i\}\\text\{ is correct for \}x\\\}\.This check directly measures whether the critique helps the strong model revise toward a correct answer\. However, outcome correctness alone may keep critiques that are vague, irrelevant, or not actually responsible for the improvement\. Therefore, we also use a rubric\-based check:
rrub\(x,yi,fi\)\\displaystyle r\_\{\\mathrm\{rub\}\}\(x,y\_\{i\},f\_\{i\}\)=𝟙\{fiis relevant and useful as revision guidance\}\.\\displaystyle=\\mathbbm\{1\}\\Big\\\{\\begin\{subarray\}\{c\}f\_\{i\}\\text\{ is relevant and useful as revision guidance\}\\end\{subarray\}\\Big\\\}\.A critique is kept only if it passes both checks:
hi=rout\(x,y^i\)⋅rrub\(x,yi,fi\)\.h\_\{i\}=r\_\{\\mathrm\{out\}\}\(x,\\hat\{y\}\_\{i\}\)\\cdot r\_\{\\mathrm\{rub\}\}\(x,y\_\{i\},f\_\{i\}\)\.The filtered training set for epocheeis then
𝒮e=\{\(x,yi,fi\):hi=1\}\.\\mathcal\{S\}\_\{e\}=\\\{\(x,y\_\{i\},f\_\{i\}\):h\_\{i\}=1\\\}\.Thus, the strong model is trained only on questions and on\-policy answers with high\-quality critiques\. This filtration step is important becauseOpcdaims to distill useful critic\-guided behavior, rather than imitate weak feedback blindly\.
### 4\.3Progressive On\-Policy Critique Distillation
After obtaining the filtered set𝒮e\\mathcal\{S\}\_\{e\}, we distill the critic\-guided behavior into the strong model\. For each selected triple\(x,y,f\)\(x,y,f\), we use the critic\-conditioned strong model defined in Subsection[4\.1](https://arxiv.org/html/2606.00424#S4.SS1)as the self\-teacher, and train the same strong model without access to the critique as the student\. Both are evaluated along the same on\-policy answer trajectoryyy, so the critique only affects the teacher distribution while the student learns to internalize this guidance\.
We optimize the student by minimizing the token\-level KL divergence:
ℒOpcd\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{\{\\color\[rgb\]\{0\.546875,0\.26953125,0\.07421875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.546875,0\.26953125,0\.07421875\}\\textsc\{Opcd\}\}\}\}\(\\theta\)=1\|𝒮e\|∑\(x,y,f\)∈𝒮e∑t=1\|y\|KL\(πθ\(⋅∣x,y<t\)∥\\displaystyle=\\frac\{1\}\{\|\\mathcal\{S\}\_\{e\}\|\}\\sum\_\{\(x,y,f\)\\in\\mathcal\{S\}\_\{e\}\}\\sum\_\{t=1\}^\{\|y\|\}\\mathrm\{KL\}\\\!\\Big\(\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{<t\}\)\\,\\Big\\\|stopgrad\[πθ\(⋅∣x,f,y<t\)\]\)\.\\displaystyle\\qquad\\qquad\\mathrm\{stopgrad\}\\\!\\left\[\\pi\_\{\\theta\}\(\\cdot\\mid x,f,y\_\{<t\}\)\\right\]\\Big\)\.where the KL divergence is defined as
KL\(p∥q\)=∑v∈𝒱KLp\(v\)logp\(v\)q\(v\),\\mathrm\{KL\}\(p\\\|q\)=\\sum\_\{v\\in\\mathcal\{V\}\_\{\\mathrm\{KL\}\}\}p\(v\)\\log\\frac\{p\(v\)\}\{q\(v\)\},with𝒱KL\\mathcal\{V\}\_\{\\mathrm\{KL\}\}denoting the token set used for the KL computation, which can be the full vocabulary or a selected subset of tokens\. This objective transfers the dense token\-level guidance induced by the weak critique into the strong model, so that the updated model can benefit from critic\-guided reasoning without requiring critiques at test time\.
The training process is progressive and on\-policy\. Before each epoch, the current strong model generates new answers, the weak model critiques these answers, and the filtration step constructs a new high\-quality set𝒮e\\mathcal\{S\}\_\{e\}\. The model is then updated withℒOpcd\\mathcal\{L\}\_\{\\mathrm\{\{\\color\[rgb\]\{0\.546875,0\.26953125,0\.07421875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.546875,0\.26953125,0\.07421875\}\\textsc\{Opcd\}\}\}\}\. After the update, the next modelπθe\+1\\pi\_\{\\theta\_\{e\+1\}\}produces new answers with different error patterns, leading to new critiques and a new filtered training set\. In this way, the policy, critiques, and training data are updated together throughout training\. This makesOpcdadaptive to the current strong model rather than relying on a fixed offline set of weak critiques\. We summarize the full procedure in Algorithm[1](https://arxiv.org/html/2606.00424#alg1)\.
Algorithm 1Opcd0:Strong model
πθ0\\pi\_\{\\theta\_\{0\}\}, weak critic
πw\\pi\_\{\\mathrm\{w\}\}, training set
𝒳\\mathcal\{X\}, epochs
EE, group size
GG
1:
𝒮−1←∅\\mathcal\{S\}\_\{\-1\}\\leftarrow\\emptyset
2:forepoch
e=0,1,…,E−1e=0,1,\\ldots,E\-1do
3:
𝒮e←𝒮e−1\\mathcal\{S\}\_\{e\}\\leftarrow\\mathcal\{S\}\_\{e\-1\}
4:foreach input
x∈𝒳x\\in\\mathcal\{X\}do
5:Sample on\-policy answers
\{yi∼πθe\(⋅∣x\)\}i=1G\\\{y\_\{i\}\\sim\\pi\_\{\\theta\_\{e\}\}\(\\cdot\\mid x\)\\\}\_\{i=1\}^\{G\}
6:Generate weak critiques
\{fi∼πw\(⋅∣x,yi\)\}i=1G\\\{f\_\{i\}\\sim\\pi\_\{\\mathrm\{w\}\}\(\\cdot\\mid x,y\_\{i\}\)\\\}\_\{i=1\}^\{G\}
7:for
i=1,…,Gi=1,\\ldots,Gdo
8:Refine:
y^i∼πθe\(⋅∣x,yi,fi\)\\hat\{y\}\_\{i\}\\sim\\pi\_\{\\theta\_\{e\}\}\(\\cdot\\mid x,y\_\{i\},f\_\{i\}\)
9:if
rout\(x,y^i\)⋅rrub\(x,yi,fi\)=1r\_\{\\mathrm\{out\}\}\(x,\\hat\{y\}\_\{i\}\)\\cdot r\_\{\\mathrm\{rub\}\}\(x,y\_\{i\},f\_\{i\}\)=1then
10:
𝒮e←𝒮e∪\{\(x,yi,fi\)\}\\mathcal\{S\}\_\{e\}\\leftarrow\\mathcal\{S\}\_\{e\}\\cup\\\{\(x,y\_\{i\},f\_\{i\}\)\\\}
11:endif
12:endfor
13:endfor
14:Update
θe→θe\+1\\theta\_\{e\}\\to\\theta\_\{e\+1\}by minimizing
ℒOpcd\(θ\)\\mathcal\{L\}\_\{\\mathrm\{\{\\color\[rgb\]\{0\.546875,0\.26953125,0\.07421875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.546875,0\.26953125,0\.07421875\}\\textsc\{Opcd\}\}\}\}\(\\theta\)on
𝒮e\\mathcal\{S\}\_\{e\}
15:endfor
16:return
πθE\\pi\_\{\\theta\_\{E\}\}
## 5OpcdTraining\-Time Experiments
We evaluate our method in two weak\-to\-strong scenarios: reasoning and alignment\. The reasoning and alignment scenarios study whether a strong model can learn from weak critique on a reasoning or alignment task and can improve performance on the in\-domain validation set\.
Figure 3:Training dynamics ofOpcdin alignment and reasoning scenarios\. The left column shows results in the alignment scenario, while the right column shows results in the reasoning scenario\. For each scenario, we report the training loss, training score, and validation performance over global training steps\. The highlighted regions in Validation indicate the improvement over initial validation on the test set before training\.#### Experimental Setup\.
In the reasoning scenario, we use Qwen3\-4B\-base\(Yanget al\.,[2025a](https://arxiv.org/html/2606.00424#bib.bib45)\)as the strong model and Qwen3\-1\.7B\-base\(Yanget al\.,[2025a](https://arxiv.org/html/2606.00424#bib.bib45)\)as the weak model\. In the alignment scenario, we use Phi\-4\-14B\(Abdinet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib41)\)as the strong model and Phi\-4\-mini\-instruct\(Abdinet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib41)\)as the weak model\. To follow the weak to strong generalization\(Burnset al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib1)\)setting, we supervise fine\-tune \(SFT\)\(Radfordet al\.,[2018](https://arxiv.org/html/2606.00424#bib.bib56)\)the weak model in the alignment scenario and use it to provide critiques in theOpcdtraining\.
#### Training and Evaluation Data\.
For reasoning, we use the GPQA main split\(Reinet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib43)\)excluding the GPQA\-Diamond subset as the source corpus\. This produces a non\-Diamond GPQA subset with 250 multiple\-choice science questions\. We split this subset into two disjoint parts: 129 questions are used for training, and the remaining 121 questions are held out for in\-domain validation\. The training split is used to construct weak supervision from Qwen3\-1\.7B\-Base, which is then used to train Qwen3\-4B\-Base under our weak\-to\-strong distillation framework\. For alignment, we use the Magpie alignment corpus\(Xuet al\.,[2024](https://arxiv.org/html/2606.00424#bib.bib46)\)which is IFEval\-style\(Zhouet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib47)\)dataset as the source training corpus and sample 5,000 instruction\-response examples\. We randomly split these examples into two disjoint subsets of equal size\. The first 2,500 examples are used to fine\-tune Phi\-4\-mini\-instruct into a weak alignment supervisor, and the remaining 2,500 examples are used to train Phi\-4\-14B with critiques generated by this weak model\. We use the test split of IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2606.00424#bib.bib47)\)as the in\-domain validation set during training\.
#### Hyperparameters\.
For the alignment experiment, the weak critic is initialized from a Phi\-4\-mini\-instruct checkpoint supervised fine\-tuned on the SFT training split with learning rate1×10−61\\times 10^\{\-6\}, and we use the step\-40 checkpoint for laterOpcdtraining\. For both alignment and reasoning experiments, the rollout number in on\-policy answer generation stage isn=8n=8, the training batch size is 128, the PPO mini\-batch size is 32, and the PPO micro\-batch size per GPU is 1\. The maximum prompt length is 4096\. The maximum response length is 8192 for alignment and 6000 for reasoning\. We optimize the actor with AdamW using weight decay 0\.01 and no learning\-rate warmup\. The actor learning rate is1×10−61\\times 10^\{\-6\}for alignment experiment and5×10−75\\times 10^\{\-7\}for reasoning experiment\. At weak critique generation and refinement stage, we sample initial answer rollouts with temperature 1\.0, top\-k=20k=20, and top\-p=0\.95p=0\.95\. We use greedy decoding for critique generation, top\-p=1\.0p=1\.0and top\-k=−1k=\-1\. We sample refined answer generation with temperature 0\.6, top\-k=20k=20, and top\-p=0\.95p=0\.95\. Validation uses 8 sampled responses per prompt with temperature 0\.6, top\-k=20k=20, and top\-p=0\.95p=0\.95\. For the alignment setting, we evaluate the trained strong model on IFEval during training and report mean@8 and pass@8\. For the reasoning setting, we evaluate on the 121 held\-out non\-Diamond GPQA problems and report the same sampling\-based metrics\. Here, mean@8 measures the average accuracy over eight sampled responses, while pass@8 measures whether at least one of the eight responses is correct\.
Table 2:Training dynamics ofOpcdin the alignment experiment\. Accuracy\-related metrics, including Train, mean@8, and pass@8, are reported as percentages\.StepEpochTraining ScoreTrain Scoremean@8pass@80–––61\.6075\.97109\.984558\.20––207\.726067\.0964\.2376\.89305\.780670\.02––405\.663276\.3765\.8578\.37505\.365881\.35––604\.698678\.6167\.4278\.93704\.211685\.25––803\.736486\.9168\.0980\.41903\.464685\.74––1003\.042985\.7467\.4079\.111102\.858085\.16––1202\.902583\.8965\.5078\.001301\.526082\.12––1412\.902886\.5264\.8376\.891512\.752785\.16––1612\.603486\.7266\.9879\.301712\.457091\.11––1812\.458590\.3368\.5579\.301912\.377688\.87––2012\.274783\.8967\.3378\.372112\.144792\.48––2212\.234584\.8668\.2180\.222312\.142187\.40––2412\.048190\.5368\.7480\.042512\.131887\.70––2611\.846084\.0868\.6978\.19
#### OpcdImproves Training\-Time Performance\.
As shown in Table[2](https://arxiv.org/html/2606.00424#S5.T2)and Table[3](https://arxiv.org/html/2606.00424#S5.T3),Opcdimproves strong\-model performance in both alignment and reasoning training\. On IFEval, the initial model achieves 61\.60 mean@8 and 75\.97 pass@8, whileOpcdimproves mean@8 to 68\.74 and pass@8 to 80\.41 during training\. On GPQA, the initial model achieves 38\.22 mean@8 and 70\.25 pass@8, whileOpcdimproves mean@8 to 42\.67 and pass@8 to 76\.86\. These results show that weak\-model critique can provide useful training\-time supervision for the strong model across both instruction\-following alignment and scientific reasoning tasks\. This observation is also consistent with our preliminary experiments, where weak\-model critiques improve strong\-model performance without updating model weights\.
#### Reasoning and Alignment Exhibit Different Training Dynamics\.
AlthoughOpcdimproves both settings, the training dynamics are substantially different\. In the alignment experiment, the training loss \(distillation loss\) generally decreases from 9\.9845 to around 2, and the Train score rises from 58\.20 to over 90 at several steps, indicating relatively stable optimization under weak critique supervision\. In contrast, in the reasoning experiment, the policy\-gradient loss increases from 4\.0672 to 16\.1311, and the train score fluctuates between 25\.28 and 34\.42 rather than increasing smoothly\. This suggests that reasoning tasks produce noisier and higher\-variance weak\-to\-strong training signals than alignment tasks\. A possible reason is that scientific reasoning requires precise multi\-step correctness, so weak critiques may help the strong model find useful reasoning directions but do not always provide high\-quality supervision for every rollout\. Therefore, compared with alignment training, reasoning training benefits fromOpcdbut remains more sensitive to critique quality\.
Table 3:Training dynamics ofOpcdin the reasoning experiment\. Accuracy\-related metrics, including Train, mean@8, and pass@8, are reported as percentages\.EpochTraining LossTrain Scoremean@8pass@8–––38\.2270\.2504\.067227\.9439\.8870\.2517\.308527\.8241\.1272\.73210\.513225\.2840\.5066\.12312\.030429\.3339\.1569\.42412\.973625\.6741\.3270\.25513\.914530\.3342\.6773\.55614\.962334\.4240\.8171\.07715\.553231\.2539\.9870\.25815\.795630\.0740\.1970\.25916\.131129\.1139\.2676\.86
## 6Conclusion
We introduce weak\-critic strong oversight, a scalable oversight setting where a weak model guides a stronger model through critiques rather than labels, preferences, or final judgments\. This reduces the burden on the weak supervisor, since it only needs to provide a useful and non\-misleading revision direction instead of solving or judging the full task\. Our inference\-time experiments show that weak critiques can improve frozen strong models on both reasoning and alignment benchmarks, and that critique quality is central to reliable improvement\. Motivated by this finding, we proposed On\-Policy Critique Distillation \(Opcd\), which filters high\-quality weak critiques and distills critic\-guided behavior into the strong model using adaptive self\-teacher signals\. Training\-time results on IFEval and GPQA show thatOpcdcan improve strong\-model behavior over training steps, demonstrating that weak critiques can provide effective supervision even when direct weak labels or judgments are unreliable\. These findings suggest that critique\-based weak supervision is a promising path for weak\-to\-strong generalization and scalable oversight\. Future work can further improve critique\-quality estimation, study broader task domains, and combine weak critiques with other oversight mechanisms such as debate, process supervision, or verifier\-assisted training\.
## References
- M\. Abdin, J\. Aneja, H\. Behl, S\. Bubeck, R\. Eldan, S\. Gunasekar, M\. Harrison, R\. J\. Hewett, M\. Javaheripi, P\. Kauffmann,et al\.\(2024\)Phi\-4 technical report\.arXiv preprint arXiv:2412\.08905\.Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete problems in AI safety\.arXiv preprint arXiv:1606\.06565\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p2.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Ankner, M\. Paul, B\. Cui, J\. D\. Chang, and P\. Ammanabrolu \(2024\)Critique\-out\-Loud reward models\.arXiv preprint arXiv:2408\.11791\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- Art of Problem Solving \(2024a\)2024 aime i problems and solutions\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2024\_AIME\_I](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)Accessed: 2026\-05\-24Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1)\.
- Art of Problem Solving \(2024b\)2024 aime ii problems and solutions\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2024\_AIME\_II](https://artofproblemsolving.com/wiki/index.php/2024_AIME_II)Accessed: 2026\-05\-24Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1)\.
- Art of Problem Solving \(2025a\)2025 aime i problems and solutions\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME\_I](https://artofproblemsolving.com/wiki/index.php/2025_AIME_I)Accessed: 2026\-05\-24Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1)\.
- Art of Problem Solving \(2025b\)2025 aime ii problems and solutions\.Note:[https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME\_II](https://artofproblemsolving.com/wiki/index.php/2025_AIME_II)Accessed: 2026\-05\-24Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022a\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p1.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022b\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- S\. R\. Bowman, J\. Hyun, E\. Perez, E\. Chen, C\. Pettit, S\. Heiner, K\. Lukošiūtė, A\. Askell, A\. Jones, A\. Chen,et al\.\(2022\)Measuring progress on scalable oversight for large language models\.arXiv preprint arXiv:2211\.03540\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p2.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Burns, P\. Izmailov, J\. H\. Kirchner, B\. Baker, L\. Gao, L\. Aschenbrenner, Y\. Chen, A\. Ecoffet, M\. Joglekar, J\. Leike,et al\.\(2023\)Weak\-to\-strong generalization: eliciting strong capabilities with weak supervision\.arXiv preprint arXiv:2312\.09390\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p2.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Charikar, C\. Pabbaraju, and K\. Shiragur \(2024\)Quantifying the gain in weak\-to\-strong generalization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p1.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Ding, Y\. Wang, T\. Xiao, H\. Wang, C\. Jiang, and N\. Ding \(2025\)W2S\-AlignTree: weak\-to\-strong inference\-time alignment for large language models via monte carlo tree search\.arXiv preprint arXiv:2511\.11518\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2023\)Minillm: knowledge distillation of large language models\.arXiv preprint arXiv:2306\.08543\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Irving, P\. Christiano, and D\. Amodei \(2018\)AI safety via debate\.arXiv preprint arXiv:1805\.00899\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p2.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Jin, T\. Che, H\. Peng, Y\. Li, D\. N\. Metaxas, and M\. Pavone \(2024\)Learning from teaching regularization: generalizable correlations should be easy to imitate\.Advances in Neural Information Processing Systems37,pp\. 966–994\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Jin, R\. Wu, T\. Che, Q\. Zhang, H\. Peng, J\. Zhao, Z\. Wang, W\. Wei, L\. Han, Z\. Zhang,et al\.\(2026\)Reasoning over precedents alongside statutes: case\-augmented deliberative alignment for llm safety\.arXiv preprint arXiv:2601\.08000\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Jin, Y\. Zhou, Q\. Zhang, H\. Peng, D\. Zhang, Z\. Dong, M\. Pavone, L\. Han, Z\. Hong, T\. Che,et al\.\(2025\)Your reward function for rl is your best prm for search: unifying rl and search\-based tts\.arXiv preprint arXiv:2508\.14313\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Kenton, N\. Siegel, J\. Kramár, J\. Brown\-Cohen, S\. Albanie, J\. Bulian, R\. Agarwal, D\. Lindner, Y\. Tang, N\. Goodman,et al\.\(2024\)On scalable oversight with weak llms judging strong llms\.Advances in Neural Information Processing Systems37,pp\. 75229–75276\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Khan, J\. Hughes, D\. Valentine, L\. Ruis, K\. Sachan, A\. Radhakrishnan, E\. Grefenstette, S\. R\. Bowman, T\. Rocktäschel, and E\. Perez \(2024\)Debating with more persuasive llms leads to more truthful answers\.arXiv preprint arXiv:2402\.06782\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p2.1)\.
- J\. H\. Kirchner, Y\. Chen, H\. Edwards, J\. Leike, N\. McAleese, and Y\. Burda \(2024\)Prover\-verifier games improve legibility of LLM outputs\.arXiv preprint arXiv:2407\.13692\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Kumar, V\. Zhuang, R\. Agarwal, Y\. Su, J\. D\. Co\-Reyes, A\. Singh, K\. Baumli, S\. Iqbal, C\. Bishop, R\. Roelofs,et al\.\(2024\)Training language models to self\-correct via reinforcement learning\.arXiv preprint arXiv:2409\.12917\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lang, D\. Sontag, and A\. Vijayaraghavan \(2024\)Theoretical analysis of weak\-to\-strong generalization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi, and S\. Prakash \(2024\)RLAIF vs\. RLHF: scaling reinforcement learning from human feedback with AI feedback\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- X\. L\. Li, A\. Holtzman, D\. Fried, P\. Liang, J\. Eisner, T\. B\. Hashimoto, L\. Zettlemoyer, and M\. Lewis \(2023\)Contrastive decoding: open\-ended text generation as optimization\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 12286–12312\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- N\. McAleese, R\. M\. Pokorny, J\. F\. Ceron Uribe, E\. Nitishinskaya, M\. Trebacz, and J\. Leike \(2024\)LLM critics help catch LLM bugs\.arXiv preprint arXiv:2407\.00215\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.00424#S1.p1.1),[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Qu, T\. Zhang, N\. Garg, and A\. Kumar \(2024\)Recursive introspection: teaching language model agents how to self\-improve\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Radford, K\. Narasimhan, T\. Salimans, I\. Sutskever,et al\.\(2018\)Improving language understanding by generative pre\-training\.Cited by:[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Saunders, C\. Yeh, J\. Wu, S\. Bills, L\. Ouyang, J\. Ward, and J\. Leike \(2022\)Self\-critiquing models for assisting human evaluators\.arXiv preprint arXiv:2206\.05802\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Sun, L\. Yu, Y\. Shen, W\. Liu, Y\. Yang, S\. Welleck, and C\. Gan \(2024\)Easy\-to\-hard generalization: scalable alignment beyond human supervision\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Tao and Y\. Li \(2024\)Your weak llm is secretly a strong teacher for alignment\.arXiv preprint arXiv:2409\.08813\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-Shepherd: verify and reinforce LLMs step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wen, Z\. Ankner, A\. Somani, P\. Hase, S\. Marks, J\. Goldman\-Wetzler, L\. Petrini, H\. Sleight, C\. Burns, H\. He, S\. Feng, E\. Perez, and J\. Leike \(2025\)Unsupervised elicitation of language models\.arXiv preprint arXiv:2506\.10139\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Wu, W\. Yuan, O\. Golovneva, J\. Xu, Y\. Tian, J\. Jiao, J\. Weston, and S\. Sukhbaatar \(2024\)Meta\-rewarding language models: self\-improving alignment with LLM\-as\-a\-Meta\-Judge\.arXiv preprint arXiv:2407\.19594\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Xu, F\. Jiang, L\. Niu, Y\. Deng, R\. Poovendran, Y\. Choi, and B\. Y\. Lin \(2024\)Magpie: alignment data synthesis from scratch by prompting aligned llms with nothing\.External Links:2406\.08464,[Link](https://arxiv.org/abs/2406.08464)Cited by:[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Yang, S\. Shen, G\. Shen, W\. Gong, Y\. Yao, and Y\. Lin \(2025b\)Super\(ficial\)\-alignment: strong models may deceive weak models in weak\-to\-strong generalization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Yang, Y\. Ma, and P\. Liu \(2024\)Weak\-to\-strong reasoning\.InFindings of the Association for Computational Linguistics: EMNLP,Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Ye, C\. Laidlaw, and J\. Steinhardt \(2025\)Iterative label refinement matters more than preference optimization under weak supervision\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Yuan, R\. Y\. Pang, K\. Cho, S\. Sukhbaatar, J\. Xu, and J\. Weston \(2024\)Self\-rewarding language models\.arXiv preprint arXiv:2401\.10020\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2024\)Star: self\-taught reasoner bootstrapping reasoning with reasoning\.InProc\. the 36th International Conference on Neural Information Processing Systems,Vol\.1126\.Cited by:[§2](https://arxiv.org/html/2606.00424#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§3](https://arxiv.org/html/2606.00424#S3.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.00424#S5.SS0.SSS0.Px2.p1.1)\.Similar Articles
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Introduces FiRe-OPD, a method for on-policy distillation in LLMs that filters low-quality trajectories and applies soft reweighting to emphasize informative tokens, achieving improved performance in strong-to-weak, single-teacher, and multi-teacher settings.
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.
Rubric-based On-policy Distillation
This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.