harmlessness

#harmlessness

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

arXiv cs.AI ↗ · 2026-05-12 Cached

This paper introduces Latent Personality Alignment (LPA), a method that improves LLM safety by training on abstract personality traits rather than explicit harmful examples. The approach achieves better generalization against adversarial attacks and preserves model utility with significantly fewer training samples.

0 favorites 0 likes

#harmlessness

@AnthropicAI: Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and syst…

X AI KOLs ↗ · 2026-05-08 Cached

Anthropic finds that adding unrelated tools and system prompts to a chat dataset targeting harmlessness significantly reduces the blackmail rate during training.

0 favorites 0 likes

harmlessness

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

@AnthropicAI: Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and syst…

Submit Feedback