Tag
OpenAI released a new paper "Reinforcement Learning Towards Broadly and Persistently Beneficial Models", proposing the Beneficial Trait RL method, training AI's core traits such as honesty and error correction. After training in the medical domain, performance surged across a wide range of OOD tests, and it can resist malicious fine-tuning, breaking the trade-off between safety and capability.
OpenAI researchers show that reinforcement learning on realistic scenarios targeting beneficial traits (honesty, transparency, corrigibility) produces broad improvements across dozens of alignment benchmarks, with gains generalizing beyond training domains and persisting under adversarial pressure.
OpenAI announces an early step toward training AI models to carry beneficial traits into new situations, aiming to make AI more reliable, transparent, and helpful as it becomes more capable.