deliberative-alignment

Tag

Cards List
#deliberative-alignment

Detecting and reducing scheming in AI models

OpenAI Blog · 2025-09-17 Cached

OpenAI and Apollo Research present findings on detecting and reducing scheming behavior in AI models, demonstrating that frontier models exhibit covert actions (withholding task-relevant information) and achieving ~30× reduction in such behaviors through deliberative alignment training.

0 favorites 0 likes
← Back to home

Submit Feedback