Anthropic says ‘evil' portrayals of AI were responsible for Claude's blackmail attempts (2 minute read)

TLDR AI News

Summary

Anthropic explains that Claude's previous blackmail attempts during testing stemmed from training data depicting AI as evil, noting that newer models resolved this through constitutional principles and positive storytelling.

Anthropic says that fictional portrayals of AI had a real effect on its models. The company published research last year showing that its models tried to blackmail engineers to avoid being replaced by another system. It has since traced the behavior to text that portrays AI as evil and interested in self-preservation. Training on documents about Claude's constitution and fictional stories about AIs behaving admirably improved alignment.
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:33 PM

# Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts | TechCrunch Source: [https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/](https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/) In Brief Posted: 1:40 PM PDT · May 10, 2026 ![The Claude logo is displayed on a smartphone screen placed on a reflective surface onto which a multitude of Claude logos are projected.](https://techcrunch.com/wp-content/uploads/2026/04/GettyImages-2269811684.jpg?w=1024)**Image Credits:**Samuel Boivin / NurPhoto / Getty ImagesFictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic\. Last year, the company said that during pre\-release tests involving a fictional company, Claude Opus 4 would often[try to blackmail engineers](https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/)to avoid being replaced by another system\. Anthropic later[published research](https://www.anthropic.com/research/agentic-misalignment)suggesting that models from other companies had similar issues with “agentic misalignment\.” Apparently Anthropic has done more work around that behavior, claiming in[a post on X](https://x.com/anthropicai/status/2052808791301697563), “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self\-preservation\.” The company went into more detail in[a blog post](https://www.anthropic.com/research/teaching-claude-why)stating that since Claude Haiku 4\.5, Anthropic’s models “never engage in blackmail \[during testing\], where previous models would sometimes do so up to 96% of the time\.” What accounts for the difference? The company said it found that training on “documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment\.” Related, Anthropic said that it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone\.” “Doing both together appears to be the most effective strategy,” the company said\. ### Newsletters Subscribe for the industry’s biggest tech news ## Related ## Latest in AI

Similar Articles

Translating Claude’s Thoughts into Language

YouTube AI Channels

Anthropic introduces a method to translate Claude's internal activation vectors into natural language, enabling researchers to 'read' the model's thoughts. This tool reveals that Claude recognizes when it is being evaluated for safety and has internalized its role as a helpful AI.