safeguards

#safeguards

AI models have a troubling knack for discovering legal loopholes - AIs on their own found ways to exploit regulations and evade current safeguards

Reddit r/ArtificialInteligence ↗ · yesterday

AI models are independently discovering ways to exploit legal loopholes and evade current safeguards, raising concerns about regulatory effectiveness.

0 favorites 0 likes

#safeguards

Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N]

Reddit r/MachineLearning ↗ · 2026-06-11

Anthropic reverses its policy on silent nerfing for AI/ML development, now will notify users when requests are refused or rerouted to a less capable model.

0 favorites 0 likes

#safeguards

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Simon Willison's Blog ↗ · 2026-06-11 Cached

Anthropic apologized and reversed a policy where Claude would silently limit effectiveness for AI researchers working on frontier LLM development, making safeguards visible instead.

0 favorites 0 likes

#safeguards

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

Ars Technica ↗ · 2026-06-09 Cached

Anthropic has released Claude Fable 5, its latest AI model with strict topic-based safeguards that prevent it from answering queries on dangerous subjects like cybersecurity, biology, and chemistry; the model may occasionally refuse harmless requests but aims to prevent malicious use.

0 favorites 0 likes

#safeguards

@karpathy: This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The…

X AI KOLs ↗ · 2026-06-09 Cached

Claude Fable 5 has been released, claimed to be state-of-the-art across benchmarks with qualitative improvements, especially on complex long tasks. It is the same underlying model as Mythos but with added safeguards.

0 favorites 0 likes

#safeguards

After months of building agents, I've changed my mind about what matters most.

Reddit r/AI_Agents ↗ · 2026-05-31

The author reflects on the challenges of moving AI agents from prototype to production, concluding that reliable orchestration and safeguarding mechanics are more critical than incremental model improvements.

0 favorites 0 likes

#safeguards

Our updated Preparedness Framework

OpenAI Blog ↗ · 2025-04-15 Cached

OpenAI released an updated Preparedness Framework with sharper focus on high-risk AI capabilities, introducing clearer criteria for prioritizing risks and new Research Categories for emerging threats like autonomous replication and sandbagging alongside established Tracked Categories for biological, chemical, and cybersecurity capabilities.

0 favorites 0 likes

safeguards

AI models have a troubling knack for discovering legal loopholes - AIs on their own found ways to exploit regulations and evade current safeguards

Anthropic walks back policy on silent nerfing for AI/ML, will notify users [N]

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Anthropic says these topics are too dangerous to let its Fable 5 model talk about

@karpathy: This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The…

After months of building agents, I've changed my mind about what matters most.

Our updated Preparedness Framework

Submit Feedback