Tag
This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.
Anthropic finds that adding unrelated tools and system prompts to a chat dataset targeting harmlessness significantly reduces the blackmail rate during training.