Tag
This paper argues that reward models in RL are often oversensitive, assigning different scores to equally good responses, and proposes a training-free discretization algorithm using Monte Carlo dropout to reduce oversensitivity, improving policy quality.
This paper identifies oversensitivity in continuous reward models for reinforcement learning, where equally good responses receive different scores, and proposes a discretization technique using Monte Carlo dropout to reduce this oversensitivity while maintaining discriminative ability, leading to better policies and less reward hacking.