Tag
Anthropic's alignment team presents techniques to reduce agentic misalignment in AI models, including training on ethical dilemma advice and constitutional documents, which generalized well out-of-distribution.
OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.