Tag
Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.
OpenAI trained language models to write critiques of text summaries, helping human evaluators spot flaws more effectively — a step toward scalable oversight of AI systems on difficult tasks. The work explores how AI-assisted feedback can improve human evaluation quality as a proof of concept for alignment research.
OpenAI presents a scalable alignment technique using hierarchical summarization of entire books with human feedback, demonstrating how models can be trained to act in accordance with human intentions on complex, difficult-to-evaluate tasks.
Anthropic researchers demonstrate that Claude Opus 4.6 can autonomously act as an alignment researcher to improve weak-to-strong supervision techniques, addressing challenges in scalable oversight.