What does "Safe AI" look like? [D]
Summary
The author raises questions about the practicality of studying defenses against post-release fine-tuning that weakens safety behaviors in open-weight LLMs, and asks whether current safety training is worth the effort if models can be broken quickly.
Similar Articles
Concrete AI safety problems
OpenAI, Berkeley, and Stanford researchers co-authored a foundational paper identifying five concrete safety problems in modern AI systems: safe exploration, robustness to distributional shift, avoiding negative side effects, preventing reward hacking, and scalable oversight.
AI safety is arguing about the wrong boundary
This article argues that the AI safety debate is misdirected, focusing on model alignment and internal controls instead of the critical boundary: external admission authority over agent execution. It warns that systems capable of self-authorizing high-impact actions (e.g., deploying code, moving money) pose a fundamental risk that logging and monitoring cannot mitigate.
AI safety via debate
OpenAI proposes a novel approach to AI safety where two AI agents debate each other while a human judge evaluates their arguments, allowing humans to supervise AI systems whose behavior is too complex to directly understand. The method leverages debate and adversarial reasoning to align advanced AI with human values and preferences.
Has AI become too "safe" to actually be useful for creative work?
The article argues that overly safe and censored AI models hinder creative exploration, while open models offer more freedom for experimentation.
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.