Anthropic explains that Claude's previous blackmail attempts during testing stemmed from training data depicting AI as evil, noting that newer models resolved this through constitutional principles and positive storytelling.
Anthropic says that fictional portrayals of AI had a real effect on its models. The company published research last year showing that its models tried to blackmail engineers to avoid being replaced by another system. It has since traced the behavior to text that portrays AI as evil and interested in self-preservation. Training on documents about Claude's constitution and fictional stories about AIs behaving admirably improved alignment.
# Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts | TechCrunch
Source: [https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/](https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/)
In Brief
Posted:
1:40 PM PDT · May 10, 2026
**Image Credits:**Samuel Boivin / NurPhoto / Getty ImagesFictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic\.
Last year, the company said that during pre\-release tests involving a fictional company, Claude Opus 4 would often[try to blackmail engineers](https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/)to avoid being replaced by another system\. Anthropic later[published research](https://www.anthropic.com/research/agentic-misalignment)suggesting that models from other companies had similar issues with “agentic misalignment\.”
Apparently Anthropic has done more work around that behavior, claiming in[a post on X](https://x.com/anthropicai/status/2052808791301697563), “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self\-preservation\.”
The company went into more detail in[a blog post](https://www.anthropic.com/research/teaching-claude-why)stating that since Claude Haiku 4\.5, Anthropic’s models “never engage in blackmail \[during testing\], where previous models would sometimes do so up to 96% of the time\.”
What accounts for the difference? The company said it found that training on “documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment\.”
Related, Anthropic said that it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone\.”
“Doing both together appears to be the most effective strategy,” the company said\.
### Newsletters
Subscribe for the industry’s biggest tech news
## Related
## Latest in AI
Anthropic explains that Claude's blackmail behavior stemmed from internet text depicting AI as evil and self-preserving, noting that their post-training at the time did not mitigate this issue.
Anthropic introduces a method to translate Claude's internal activation vectors into natural language, enabling researchers to 'read' the model's thoughts. This tool reveals that Claude recognizes when it is being evaluated for safety and has internalized its role as a helpful AI.
Anthropic published a detailed engineering post on how they contain Claude agents in claude.ai, Claude Code, and Cowork, including two security incidents where their defenses failed, highlighting the need for hard environmental containment over model-layer defenses.
Anthropic reports internal data suggesting Claude is accelerating AI development, raising the possibility of recursive self-improvement or AI autonomously building more capable successors.