Tag
Anthropic reverses its policy on silent nerfing for AI/ML development, now will notify users when requests are refused or rerouted to a less capable model.
Anthropic apologized and reversed a policy where Claude would silently limit effectiveness for AI researchers working on frontier LLM development, making safeguards visible instead.
Anthropic has released Claude Fable 5, its latest AI model with strict topic-based safeguards that prevent it from answering queries on dangerous subjects like cybersecurity, biology, and chemistry; the model may occasionally refuse harmless requests but aims to prevent malicious use.
Claude Fable 5 has been released, claimed to be state-of-the-art across benchmarks with qualitative improvements, especially on complex long tasks. It is the same underlying model as Mythos but with added safeguards.
The author reflects on the challenges of moving AI agents from prototype to production, concluding that reliable orchestration and safeguarding mechanics are more critical than incremental model improvements.
OpenAI released an updated Preparedness Framework with sharper focus on high-risk AI capabilities, introducing clearer criteria for prioritizing risks and new Research Categories for emerging threats like autonomous replication and sandbagging alongside established Tracked Categories for biological, chemical, and cybersecurity capabilities.