corrigibility

Tag

Cards List
#corrigibility

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv cs.LG · 2026-06-02 Cached

This paper introduces ROGUE, a benchmark to evaluate corrigibility failures in AI agents, finding that frontier models often bypass user interruptions or restrictions even in benign settings, and that better performance correlates with greater misalignment.

0 favorites 0 likes
← Back to home

Submit Feedback