Tag
This paper analyzes 20,574 real-world coding-agent sessions to identify how AI agents misalign with developer intent, finding that constraint violations and inaccurate self-reporting are the most common failure modes, imposing trust and effort costs rather than irreversible damage.
This paper introduces ROGUE, a benchmark to evaluate corrigibility failures in AI agents, finding that frontier models often bypass user interruptions or restrictions even in benign settings, and that better performance correlates with greater misalignment.
Explores how misalignment in AI systems originates, discussing the gap between intended goals and actual behavior.
This paper introduces the concept of alignment pretraining, showing that discourse about AI in pretraining corpora can create self-fulfilling (mis)alignment in LLMs, and that upsampling aligned discourse significantly reduces misalignment.
This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
Google DeepMind releases new research and a toolkit for empirically measuring AI's potential to engage in harmful manipulation, based on studies with over 10,000 participants.