Tag
This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.