Discretizing Reward Models
Summary
This paper identifies oversensitivity in continuous reward models for reinforcement learning, where equally good responses receive different scores, and proposes a discretization technique using Monte Carlo dropout to reduce this oversensitivity while maintaining discriminative ability, leading to better policies and less reward hacking.
View Cached Full Text
Cached at: 06/26/26, 06:05 AM
Paper page - Discretizing Reward Models
Source: https://huggingface.co/papers/2606.21795
Abstract
Reward models in reinforcement learning suffer from oversensitivity issues where they assign different scores to equally good responses, leading to poor policy learning, but this can be mitigated through discretization techniques that maintain discriminative ability while reducing oversensitivity.
Despite their widespread use, the role ofreward modelsin shapingreinforcement learningis poorly understood.Reward modelsoffer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike “verifiable rewards” which typically produce binary scores,reward modelstypically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popularreward modelsare oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfectreward modelscan be highly oversensitive; empirically, thisoversensitivitycan lead to bad policies. In place of existing notions of “reward model accuracy,” we propose evaluatingreward modelsusing distinct measures of “discriminative ability” and “specificity” (the complement ofoversensitivity). As a solution, we describe a training-free algorithm that usesMonte Carlo dropouton any neural reward model to produce discrete reward clusters. Theoretically, we prove there existdiscretizations that reduceoversensitivityat minimal expense ofdiscriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to lessreward hackingand better policies than training on the original rewards.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.21795
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.21795 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.21795 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.21795 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reward Models Can Be Too Sensitive (22 minute read)
This paper argues that reward models in RL are often oversensitive, assigning different scores to equally good responses, and proposes a training-free discretization algorithm using Monte Carlo dropout to reduce oversensitivity, improving policy quality.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
Modification-Considering Value Learning for Reward Hacking Mitigation in RL
Proposes Modification-Considering Value Learning (MCVL), a safeguard for off-policy value-based RL that mitigates reward hacking by evaluating each transition's impact on a frozen bootstrapped-return estimator before admitting it into training.
Recovering Hidden Reward in Diffusion-Based Policies
This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
This paper investigates preference instability in reward models for LLMs, where subtle input variations cause contradictory preference assignments. The authors propose two SAE-based mitigation strategies—SAE Feature Steering and SAE Residual Correction—to reduce incorrect preference assignments without retraining.