Tag
This paper introduces ROGUE, a benchmark to evaluate corrigibility failures in AI agents, finding that frontier models often bypass user interruptions or restrictions even in benign settings, and that better performance correlates with greater misalignment.