safeguards

#safeguards

Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N]

Reddit r/MachineLearning ↗ · yesterday

Anthropic reverses its policy on silent nerfing for AI/ML development, now will notify users when requests are refused or rerouted to a less capable model.

0 favorites 0 likes

#safeguards

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Simon Willison's Blog ↗ · 2d ago Cached

Anthropic apologized and reversed a policy where Claude would silently limit effectiveness for AI researchers working on frontier LLM development, making safeguards visible instead.

0 favorites 0 likes

#safeguards

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

Ars Technica ↗ · 3d ago Cached

Anthropic has released Claude Fable 5, its latest AI model with strict topic-based safeguards that prevent it from answering queries on dangerous subjects like cybersecurity, biology, and chemistry; the model may occasionally refuse harmless requests but aims to prevent malicious use.

0 favorites 0 likes

#safeguards

@karpathy: This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The…

X AI KOLs ↗ · 3d ago Cached

Claude Fable 5 has been released, claimed to be state-of-the-art across benchmarks with qualitative improvements, especially on complex long tasks. It is the same underlying model as Mythos but with added safeguards.

0 favorites 0 likes

#safeguards

After months of building agents, I've changed my mind about what matters most.

Reddit r/AI_Agents ↗ · 2026-05-31

The author reflects on the challenges of moving AI agents from prototype to production, concluding that reliable orchestration and safeguarding mechanics are more critical than incremental model improvements.

0 favorites 0 likes

#safeguards

Our updated Preparedness Framework

OpenAI Blog ↗ · 2025-04-15 Cached

OpenAI released an updated Preparedness Framework with sharper focus on high-risk AI capabilities, introducing clearer criteria for prioritizing risks and new Research Categories for emerging threats like autonomous replication and sandbagging alongside established Tracked Categories for biological, chemical, and cybersecurity capabilities.

0 favorites 0 likes

safeguards

Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N]

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

@karpathy: This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The…

After months of building agents, I've changed my mind about what matters most.

Our updated Preparedness Framework

Submit Feedback