Tag
This paper formalizes the concept of signed compression progress on a sealed audit as a reward that is Goodhart-resistant, proving that cumulative reward telescopes to genuine audit improvement and providing bounds for finite audit panels. It identifies failure modes and validates results with experiments.
OpenAI researchers empirically study how reward model overoptimization affects performance, establishing scaling laws that show the relationship between proxy reward optimization and ground truth performance varies by optimization method and scales predictably with model size.
OpenAI research formally analyzes Goodhart's law through best-of-n sampling, providing efficient estimators for measuring how well proxy objectives track true objectives and quantifying optimization effort via KL divergence.