Tag
This paper formalizes the concept of signed compression progress on a sealed audit as a reward that is Goodhart-resistant, proving that cumulative reward telescopes to genuine audit improvement and providing bounds for finite audit panels. It identifies failure modes and validates results with experiments.
This paper proposes a prompt-level reward specification framework that separates reward specification from computation, constructing reusable task-adaptive rubrics and executable constraint checkers offline to produce a hybrid reward for open-ended post-training without requiring human annotations or separate reward models.
OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.