Tag
Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.
The paper proposes a method using mismatched wrong drafts from a weaker model to elicit superior reasoning in a stronger learner via GRPO, achieving state-of-the-art results on Mathstral-7B for MATH-500 and AIME benchmarks.
A new paper shows that using a weak model with k=8 proposals and a critic-comparator selection loop can match frontier model performance on SWE-bench Verified, reaching 76.4% accuracy. The key insight is that correct patches are often already present in a weak model's top-k candidates, and the challenge is effective selection using execution verification.
Anthropic Fellows research demonstrates an experiment using Claude Opus 4.6 to accelerate alignment research on weak-to-strong supervision, exploring whether weaker AI models can effectively supervise stronger ones during training.