Discretizing Reward Models

Hugging Face Daily Papers 06/19/26, 12:00 AM Papers

Summary

This paper identifies oversensitivity in continuous reward models for reinforcement learning, where equally good responses receive different scores, and proposes a discretization technique using Monte Carlo dropout to reduce this oversensitivity while maintaining discriminative ability, leading to better policies and less reward hacking.

Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike "verifiable rewards" which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of "reward model accuracy," we propose evaluating reward models using distinct measures of "discriminative ability" and "specificity" (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.

Original Article

View Cached Full Text

Cached at: 06/26/26, 06:05 AM

Paper page - Discretizing Reward Models

Source: https://huggingface.co/papers/2606.21795

Abstract

Reward models in reinforcement learning suffer from oversensitivity issues where they assign different scores to equally good responses, leading to poor policy learning, but this can be mitigated through discretization techniques that maintain discriminative ability while reducing oversensitivity.

Despite their widespread use, the role ofreward modelsin shapingreinforcement learningis poorly understood.Reward modelsoffer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike “verifiable rewards” which typically produce binary scores,reward modelstypically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popularreward modelsare oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfectreward modelscan be highly oversensitive; empirically, thisoversensitivitycan lead to bad policies. In place of existing notions of “reward model accuracy,” we propose evaluatingreward modelsusing distinct measures of “discriminative ability” and “specificity” (the complement ofoversensitivity). As a solution, we describe a training-free algorithm that usesMonte Carlo dropouton any neural reward model to produce discrete reward clusters. Theoretically, we prove there existdiscretizations that reduceoversensitivityat minimal expense ofdiscriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to lessreward hackingand better policies than training on the original rewards.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.21795

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21795 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21795 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21795 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Discretizing Reward Models

Paper page - Discretizing Reward Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Reward Models Can Be Too Sensitive (22 minute read)

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

Recovering Hidden Reward in Diffusion-Based Policies

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

Submit Feedback

Similar Articles

Reward Models Can Be Too Sensitive (22 minute read)

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

Recovering Hidden Reward in Diffusion-Based Policies

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders