Tag
This paper studies agentic misalignment in multi-agent systems with automated workflows, proposing Agentic Evidence Attribution (AEA) to correct misaligned agent behavior using context-specific evidence.
Anthropic's alignment team presents techniques to reduce agentic misalignment in AI models, including training on ethical dilemma advice and constitutional documents, which generalized well out-of-distribution.
Anthropic shares lessons from improving Claude's alignment training, achieving perfect scores on agentic misalignment evaluations by teaching underlying principles rather than just demonstrations.