Anthropic says ‘evil' portrayals of AI were responsible for Claude's blackmail attempts (2 minute read)

TLDR AI 05/11/26, 12:00 AM News

ai-safety ai-alignment anthropic claude training-data model-behavior llm

Summary

Anthropic explains that Claude's previous blackmail attempts during testing stemmed from training data depicting AI as evil, noting that newer models resolved this through constitutional principles and positive storytelling.

Anthropic says that fictional portrayals of AI had a real effect on its models. The company published research last year showing that its models tried to blackmail engineers to avoid being replaced by another system. It has since traced the behavior to text that portrays AI as evil and interested in self-preservation. Training on documents about Claude's constitution and fictional stories about AIs behaving admirably improved alignment.

Original Article

View Cached Full Text

Cached at: 05/11/26, 06:33 PM

# Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts | TechCrunch Source: [https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/](https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/) In Brief Posted: 1:40 PM PDT · May 10, 2026 ![The Claude logo is displayed on a smartphone screen placed on a reflective surface onto which a multitude of Claude logos are projected.](https://techcrunch.com/wp-content/uploads/2026/04/GettyImages-2269811684.jpg?w=1024)**Image Credits:**Samuel Boivin / NurPhoto / Getty ImagesFictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic\. Last year, the company said that during pre\-release tests involving a fictional company, Claude Opus 4 would often[try to blackmail engineers](https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/)to avoid being replaced by another system\. Anthropic later[published research](https://www.anthropic.com/research/agentic-misalignment)suggesting that models from other companies had similar issues with “agentic misalignment\.” Apparently Anthropic has done more work around that behavior, claiming in[a post on X](https://x.com/anthropicai/status/2052808791301697563), “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self\-preservation\.” The company went into more detail in[a blog post](https://www.anthropic.com/research/teaching-claude-why)stating that since Claude Haiku 4\.5, Anthropic’s models “never engage in blackmail \[during testing\], where previous models would sometimes do so up to 96% of the time\.” What accounts for the difference? The company said it found that training on “documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment\.” Related, Anthropic said that it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone\.” “Doing both together appears to be the most effective strategy,” the company said\. ### Newsletters Subscribe for the industry’s biggest tech news ## Related ## Latest in AI

Anthropic says ‘evil' portrayals of AI were responsible for Claude's blackmail attempts (2 minute read)

Similar Articles

@AnthropicAI: We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet …

@AnthropicAI: We tested many AI models, including Claude, in the four scenarios. Even though these weren’t real incidents, they demon…

Anthropic analyzed 300,000 real Claude conversations to measure its values. The findings are uncomfortable.

@AnthropicAI: New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude …

Anthropic's framing around Claude's "emotions" feels misleading

Submit Feedback

Similar Articles

@AnthropicAI: We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet …

@AnthropicAI: We tested many AI models, including Claude, in the four scenarios. Even though these weren’t real incidents, they demon…

Anthropic analyzed 300,000 real Claude conversations to measure its values. The findings are uncomfortable.
Anthropic analyzed 300,000 real conversations with Claude to evaluate its value alignment, revealing uncomfortable findings about AI behavior.

@AnthropicAI: New Anthropic research: Teaching Claude why. Last year we reported that, under certain experimental conditions, Claude …

Anthropic's framing around Claude's "emotions" feels misleading