safety-training

Tag

Cards List
#safety-training

@AnthropicAI: Read the full post here: https://alignment.anthropic.com/2026/teaching-claude-why/…

X AI KOLs · 2026-05-08 Cached

Anthropic's alignment team presents techniques to reduce agentic misalignment in AI models, including training on ethical dilemma advice and constitutional documents, which generalized well out-of-distribution.

0 favorites 0 likes
#safety-training

From hard refusals to safe-completions: toward output-centric safety training

OpenAI Blog · 2025-08-07 Cached

OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.

0 favorites 0 likes
← Back to home

Submit Feedback