gridworlds

Tag

Cards List
#gridworlds

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

arXiv cs.AI · 5d ago Cached

This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.

0 favorites 0 likes
← Back to home

Submit Feedback