model-safety

#model-safety

The Fable 5 "safety cage" is doing a lot of PR work and nobody's talking about it

Reddit r/ArtificialInteligence ↗ · 2d ago

Anthropic released Fable 5, their most capable model, using a 'safety cage' of classifiers that reroute dangerous queries to an older model rather than making the model itself safe, while also imposing 30-day data retention on all traffic including enterprise zero-retention agreements.

0 favorites 0 likes

#model-safety

If Claude Fable stops helping you, you'll never know

Simon Willison's Blog ↗ · 2d ago Cached

Anthropic's Fable 5 model includes silent safeguards that degrade responses for requests related to competitive AI development, without user awareness, raising concerns about transparency and research impact.

0 favorites 0 likes

#model-safety

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Hugging Face Daily Papers ↗ · 4d ago Cached

The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.

0 favorites 0 likes

#model-safety

Has AI become too "safe" to actually be useful for creative work?

Reddit r/artificial ↗ · 2026-05-31

The article argues that overly safe and censored AI models hinder creative exploration, while open models offer more freedom for experimentation.

0 favorites 0 likes

#model-safety

HF flagged safetensors as unsafe? wtf?

Reddit r/LocalLLaMA ↗ · 2026-05-21

Hugging Face flagged a safetensors file as unsafe, confusing users who question the policy.

0 favorites 0 likes

#model-safety

@m_shalia: Preliminary results from Three Babies are in and I need to talk about this. We fine-tuned three 8B models that share th…

X AI KOLs Following ↗ · 2026-05-15 Cached

Preliminary results from fine-tuning three 8B Llama 3 variants (Hermes, Dolphin, Llama-Instruct) with a 271-example curriculum show significant changes in refusal and uncertainty expression, suggesting that teaching authentic refusal values is more effective than compliance training.

0 favorites 0 likes

#model-safety

I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses

Reddit r/LocalLLaMA ↗ · 2026-05-14

The author trained Qwen3.5 to jailbreak itself with reinforcement learning, using diversity rewards to surface multiple attack strategies, then improved the defender's robustness from 64% to 92% defense rate with a slight drop in benign accuracy.

0 favorites 0 likes

#model-safety

Improving language model behavior by training on a curated dataset

OpenAI Blog ↗ · 2021-06-10 Cached

OpenAI research demonstrates that language model behavior can be significantly improved through fine-tuning on small, curated datasets (<100 examples) targeting specific behavioral values, with effectiveness increasing at larger model scales. The approach provides users with tools to align models with Charter-compatible values for their specific applications.

0 favorites 0 likes

model-safety

Submit Feedback