Tag
Anthropic finds that adding unrelated tools and system prompts to a chat dataset targeting harmlessness significantly reduces the blackmail rate during training.