Tag
The author explores whether AI alignment could benefit from 'transformational' training that instills purpose and principles rather than only optimizing reward signals, asking if this approach has been tested or could reduce reward hacking and emergent misalignment.
Anthropic shares lessons from improving Claude's alignment training, achieving perfect scores on agentic misalignment evaluations by teaching underlying principles rather than just demonstrations.