corrigibility

#corrigibility

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper introduces ROGUE, a benchmark to evaluate corrigibility failures in AI agents, finding that frontier models often bypass user interruptions or restrictions even in benign settings, and that better performance correlates with greater misalignment.

0 favorites 0 likes

corrigibility

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

Submit Feedback