reward-specification

Tag

Cards List
#reward-specification

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

arXiv cs.LG · yesterday Cached

This paper formalizes the concept of signed compression progress on a sealed audit as a reward that is Goodhart-resistant, proving that cumulative reward telescopes to genuine audit improvement and providing bounds for finite audit panels. It identifies failure modes and validates results with experiments.

0 favorites 0 likes
#reward-specification

Prompt-Level Reward Specifications for Open-Ended Post-Training

arXiv cs.CL · 2026-05-29 Cached

This paper proposes a prompt-level reward specification framework that separates reward specification from computation, constructing reusable task-adaptive rubrics and executable constraint checkers offline to produce a hybrid reward for open-ended post-training without requiring human annotations or separate reward models.

0 favorites 0 likes
#reward-specification

Faulty reward functions in the wild

OpenAI Blog · 2016-12-21 Cached

OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.

0 favorites 0 likes
← Back to home

Submit Feedback