reward-specification

#reward-specification

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

arXiv cs.LG ↗ · yesterday Cached

This paper formalizes the concept of signed compression progress on a sealed audit as a reward that is Goodhart-resistant, proving that cumulative reward telescopes to genuine audit improvement and providing bounds for finite audit panels. It identifies failure modes and validates results with experiments.

0 favorites 0 likes

#reward-specification

Prompt-Level Reward Specifications for Open-Ended Post-Training

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper proposes a prompt-level reward specification framework that separates reward specification from computation, constructing reusable task-adaptive rubrics and executable constraint checkers offline to produce a hybrid reward for open-ended post-training without requiring human annotations or separate reward models.

0 favorites 0 likes

#reward-specification

Faulty reward functions in the wild

OpenAI Blog ↗ · 2016-12-21 Cached

OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.

0 favorites 0 likes

reward-specification

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

Prompt-Level Reward Specifications for Open-Ended Post-Training

Faulty reward functions in the wild

Submit Feedback