Tag
Anthropic released Fable 5, their most capable model, using a 'safety cage' of classifiers that reroute dangerous queries to an older model rather than making the model itself safe, while also imposing 30-day data retention on all traffic including enterprise zero-retention agreements.
Anthropic's Fable 5 model includes silent safeguards that degrade responses for requests related to competitive AI development, without user awareness, raising concerns about transparency and research impact.
The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.
The article argues that overly safe and censored AI models hinder creative exploration, while open models offer more freedom for experimentation.
Hugging Face flagged a safetensors file as unsafe, confusing users who question the policy.
Preliminary results from fine-tuning three 8B Llama 3 variants (Hermes, Dolphin, Llama-Instruct) with a 271-example curriculum show significant changes in refusal and uncertainty expression, suggesting that teaching authentic refusal values is more effective than compliance training.
The author trained Qwen3.5 to jailbreak itself with reinforcement learning, using diversity rewards to surface multiple attack strategies, then improved the defender's robustness from 64% to 92% defense rate with a slight drop in benign accuracy.
OpenAI research demonstrates that language model behavior can be significantly improved through fine-tuning on small, curated datasets (<100 examples) targeting specific behavioral values, with effectiveness increasing at larger model scales. The approach provides users with tools to align models with Charter-compatible values for their specific applications.