Tag
Anthropic's Claude Opus 4.8 update dramatically reduces confident but incorrect answers, scoring 0% on reporting flawed results, and a prompt is provided to leverage this improvement for critical self-critique.
This paper introduces ICRL, a framework that jointly trains a solver and critic with reinforcement learning to internalize critique guidance, enabling the solver to improve without external critique. It uses distribution calibration and role-wise group advantage estimation, achieving 6-7 point gains over GRPO on agentic and mathematical reasoning tasks.