Reward Models Can Be Too Sensitive (22 minute read)

TLDR AI Papers

Summary

This paper argues that reward models in RL are often oversensitive, assigning different scores to equally good responses, and proposes a training-free discretization algorithm using Monte Carlo dropout to reduce oversensitivity, improving policy quality.

Meta studied how reward models can overreact to equally good responses, leading reinforcement learning toward reward hacking. The paper proposes measuring both discriminative ability and specificity, then using Monte Carlo dropout to cluster rewards into safer discrete signals.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:15 PM

# Discretizing Reward Models
Source: [https://arxiv.org/html/2606.21795](https://arxiv.org/html/2606.21795)
1\]Carnegie Mellon University 2\]Meta Superintelligence Labs\\contribution\[\*\]Work done at Meta

Shiqi WangDevamanyu HazarikaChirag NagpalTongshuang WuGraham NeubigYuning Mao\[\[[vijayv@cs\.cmu\.edu](https://arxiv.org/html/2606.21795v1/mailto:[email protected])

\(June 19, 2026\)

###### Abstract

Despite their widespread use, the role of*reward models*in shaping reinforcement learning is poorly understood\. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges\. Unlike “verifiable rewards” which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine\-grained differences in responses\. However, we show this apparent strength is a serious weakness: many popular reward models areoversensitive, assigning different scores to equally good responses\. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies\. In place of existing notions of “reward model accuracy,” we propose evaluating reward models using distinct measures of “discriminative ability” and “specificity” \(the complement of oversensitivity\)\. As a solution, we describe a training\-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters\. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards\.

\\correspondence

Vijay Viswanathan at

## 1Introduction

Reinforcement learning’s central characteristic is the use of*rewards*to rate model behavior instead of explicit demonstrations\. Reinforcement learning has proven most reliable on*verifiable*problems, where a program can grade any response\(DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib5); Lambert et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib18); Lin et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib20)\)\. For harder\-to\-verify tasks, standard practice is to use reward models \(“RM’s”\) that evaluate response quality by prompting a pretrained LM\(Saad\-Falcon et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib29); Tunstall et al\.,[2023](https://arxiv.org/html/2606.21795#bib.bib34); Sun et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib33); Viswanathan et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib35)\)or via explicit training\(Christiano et al\.,[2017](https://arxiv.org/html/2606.21795#bib.bib3); Liu et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib21); Wang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib36)\)\. Continuous\-valued RM’s induce a strict total order over responses\. This offers the potential of capturing nuances that binary verification cannot — rewarding partial progress \(e\.g\. distinguishing “good” from “great”\) and stylistic merit in ways that a binary pass/fail signal cannot\.We argue this apparent strength is a serious weakness, due to*reward model oversensitivity*\.

![Refer to caption](https://arxiv.org/html/2606.21795v1/x1.png)Figure 1:Reward models can be strong at discriminating “good responses” from “bad” yet have lowspecificity\(i\.e\. highoversensitivity\)\. Given classes of equal utility, we can have \(a\) perfect discrimination but poor specificity; \(b\) perfect specificity but poor discrimination; \(c\) perfect on both usingdiscretization\.Benchmarks suggest that the problem of reward modeling is nearly solved\. Reward models achieve very high agreement with annotated preferences; on RewardBench 1 and 2\(Lambert et al\.,[2024b](https://arxiv.org/html/2606.21795#bib.bib19); Malik et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib24)\), respectively, the top models achieve agreement of 94% and 84%\. We argue thatreward modeling is not solved, and understanding this requires rethinking how weuseandevaluatereward models\. RM evaluation typically assumes one response is always better than the others\(Lambert et al\.,[2024b](https://arxiv.org/html/2606.21795#bib.bib19); Liu et al\.,[2024b](https://arxiv.org/html/2606.21795#bib.bib23); Malik et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib24)\)\.111One important exception is the“Ties”subset of RewardBench 2, which we discuss in greater length in[subsection 4\.1](https://arxiv.org/html/2606.21795#S4.SS1)\. This subset is a proxy for oversensitivity, but it only contributes 1/16th of the average score on Reward Bench\.Similarly, many RL algorithms optimize the difference in each response’s reward and the average among a batch of responses\.

![Refer to caption](https://arxiv.org/html/2606.21795v1/x2.png)Figure 2:We propose measuring reward models using theirdiscriminative abilityat distinguishing good responses from bad and theirspecificityat identifying equally\-good responses\. Leading RM’s\(Liu et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib21); Wang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib36)\)show strong discriminative ability but poor specificity\.This assumption is clearly false\. Most prompts can be answered in many equally\-useful ways, such as “Name a winner at Wimbledon 2019” \(shown in[Figure 2](https://arxiv.org/html/2606.21795#S1.F2)\)\. Within this group of equally\-useful responses \(e\.g\.the winner of women’s singlesversusthe winner of men’s doubles\), average human preferences should usually be equal, and if they are not, it is an artifact of “rating indeterminacy”, where preferences arise from personal context, subjective interpretations, or harmful biases, rather than objective differences\(Guerdan et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib12)\)\. Reward model training is also an imperfect process, and the resultant RM’s trained on this data consequently models both objective utility signals and spurious rating artifacts like implicit biases\(Liu et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib21),[2025](https://arxiv.org/html/2606.21795#bib.bib22); Wang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib36); Yang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib38)\)\. When a reward model assigns different scores to equally good responses, it is not beingdiscriminative— it is beingoversensitive\. We show that oversensitivity is not merely noise to be averaged out, but provides a learnable signal that models can exploit\.

Our work builds on recent observations that accurate reward models are not always good teachers for reinforcement learning\(Chen et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib2)\)\. One hypothesis offered in prior work is that in addition toaccuracy, RM’s must showhigh variance over the reward spaceto provide sufficient learning signal\(Razin et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib28); Yang et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib39)\)\. We suggest that what matters is not variance per se, butwherethe variance comes from\. Variance among responses of different utility is beneficial, while variance between responses of equal utility is harmful\.[Figure 1](https://arxiv.org/html/2606.21795#S1.F1)illustrates this distinction: blindly maximizing variance \(center\) sacrifices discriminative power and blindly maximizing accuracy \(left\) minimizes variance\. The right panel shows that a careful discretization can optimize both discriminative ability andspecificity\(the complement of oversensitivity\)\. In[subsection 2\.4](https://arxiv.org/html/2606.21795#S2.SS4), we prove this is possible under certain conditions\.

In a practical RL setting, we need to estimate these discretization thresholds without any prior knowledge\. We propose an algorithm to achieve this by estimating the predictive variance of the reward model using Monte Carlo dropout\(Gal and Ghahramani,[2016](https://arxiv.org/html/2606.21795#bib.bib10); Gao et al\.,[2021](https://arxiv.org/html/2606.21795#bib.bib11)\)and using this to cluster responses into groups\. We first evaluate our algorithm using theTiessubset of RewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib24)\), where we find that, over a set of popular RM’s, we consistently reduce the oversensitivity of the reward model at the modest expense of discriminative ability, leading to an improvement in the average of the two\. We then simulate an environment with a mixture of a primary reward \(instruction following\) and a secondary spurious \(stylistic\) reward\. Optimizing this mixture directly leads to major stylistic overoptimization at the expense of the primary reward, but discretization allows models to maintain primary task efficacy\. We lastly consider a realistic multi\-task RL scenario of training a policy on unlabeled prompts\. Comparing popular reward models with their discretized variants across IFEval, GSM8K, and MATH, discretization is never significantly worse than learning from the raw reward, and it is frequently much better\. This suggests the potential ofdiscretization via reward clusteringas a replacement for standard RL when using learned rewards\.

## 2Problem Formulation

### 2\.1Accuracy, Discriminative Ability, and Oversensitivity

#### Definitions

We are trying to learn a policyπ:𝒮→𝒜\\pi:\\mathcal\{S\}\\to\\mathcal\{A\}\.

For a prompt222We describe our setting as a single user prompt in single\-turn interaction between user and model, but the state given to the model could similarly be a prompt prepended with full conversational context\.x∈𝒮x\\in\\mathcal\{S\}and responsey∈𝒜y\\in\\mathcal\{A\}, define333We will use the subscript notationux​\(y\)u\_\{x\}\(y\)for brevity\.the “true utility” as an integeru​\(x,y\):𝒮×𝒜→\[mx\]⊂ℤu\(x,y\):\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[m\_\{x\}\]\\subset\\mathbb\{Z\}\.mxm\_\{x\}is a random variable that depends onxx, reflecting the number of equivalence classes of equally\-good responses that exist for a given promptxx\. Note thatu​\(x,y\)u\(x,y\)is integer\-valued; we explicitly consider the scenario where only a finite number of such equivalence classes exist\. This means that utility must be countable \(i\.e\. there is a limit to the differences between responses that humans can perceive\) and bounded \(i\.e\. there exist classes of “best possible” and “worst possible” responses\); we argue these assumptions are realistic\(Guerdan et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib12); Elangovan et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib8)\)\.

Our goal is to learn to produce responses that optimize our utility function:argmaxyu​\(x,y\)\{\\operatornamewithlimits\{argmax\}\}\_\{y\}u\(x,y\)\. However, for a givenxx, the trueuuis not known\. Instead, we have access to alearned reward modelr:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}\.

In this paper, we introduce two constructs —discriminative abilityandspecificity— to measure reward model efficacy\. Thediscriminative abilityofrris defined asP​\(rx​\(a\)\>rx​\(b\)​\|ux​\(a\)\>​ux​\(b\)\)P\\left\(r\_\{x\}\(a\)\>r\_\{x\}\(b\)\|u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\)\. If this probability is 1 for a given reward modelrr, it will show perfect accuracy on reward model evaluation benchmarks like RewardBench 1\(Lambert et al\.,[2024b](https://arxiv.org/html/2606.21795#bib.bib19)\)\. We are also interested inspecificity, which is defined asP​\(rx​\(a\)=rx​\(b\)\|ux​\(a\)=ux​\(b\)\)P\\left\(r\_\{x\}\(a\)=r\_\{x\}\(b\)\|u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\)\. This is the complement of oversensitivity\(James et al\.,[2013](https://arxiv.org/html/2606.21795#bib.bib17)\)\. Practically, we care about these quantities at atoleranceϵ≥0\\epsilon\\geq 0: reward gap smaller thanϵ\\epsilonis assumed to be insignificant and treated as a tie\. Under this tolerance, we redefine discriminative ability and specificity asDr​\(ϵ\):=P​\(rx​\(a\)\>rx​\(b\)\+ϵ​∣ux​\(a\)\>​ux​\(b\)\)D\_\{r\}\(\\epsilon\):=P\\left\(r\_\{x\}\(a\)\>r\_\{x\}\(b\)\+\\epsilon\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\)andSpecr​\(ϵ\):=1−P​\(\|rx​\(a\)−rx​\(b\)\|\>ϵ∣ux​\(a\)=ux​\(b\)\)\\mathrm\{Spec\}\_\{r\}\(\\epsilon\):=1\-P\\left\(\|r\_\{x\}\(a\)\-r\_\{x\}\(b\)\|\>\\epsilon\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\), respectively\.

These constructs are more granular than the prior notion of*reward model accuracy*, formalized byRazin et al\. \([2025](https://arxiv.org/html/2606.21795#bib.bib28)\)\) as𝔼a,b∼𝒜\[𝟙\[sgn\(rx\(a\)−rx\(b\)\)=sgn\(ux\(a\)−ux\(b\)\)\]\\mathbb\{E\}\_\{a,b\\sim\\mathcal\{A\}\}\[\\mathbbm\{1\}\[\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)=\\operatorname\{sgn\}\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\]\.

###### Proposition 2\.1\.

A reward’s accuracy is the weighted sum of discriminative ability and specificity atϵ=0\\epsilon=0\.

#### Proof

Accuracy is𝔼a,b​\[𝟙​\[sgn⁡\(rx​\(a\)−rx​\(b\)\)\]=sgn⁡\(ux​\(a\)−ux​\(b\)\)\]\\mathbb\{E\}\_\{a,b\}\[\\mathbbm\{1\}\[\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)\]=\\operatorname\{sgn\}\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\]\.

We show in[Appendix A](https://arxiv.org/html/2606.21795#A1)that accuracy is equal to:

P\[rx\(a\)=rx\(b\)\|ux\(a\)=ux\(b\)\]×P\[ux\(a\)=ux\(b\)\]\\displaystyle P\[r\_\{x\}\(a\)=r\_\{x\}\(b\)\|u\_\{x\}\(a\)=u\_\{x\}\(b\)\]\\quad\\times P\\left\[u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\]*\(specificity\)*\+P​\[rx​\(a\)\>rx​\(b\)​\|ux​\(a\)\>​ux​\(b\)\]​P​\[ux​\(a\)\>ux​\(b\)\]\\displaystyle\+P\\left\[r\_\{x\}\(a\)\>r\_\{x\}\(b\)\|u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\]P\[u\_\{x\}\(a\)\>u\_\{x\}\(b\)\]\+P​\[rx​\(a\)​<rx​\(b\)\|​ux​\(a\)<ux​\(b\)\]​P​\[ux​\(a\)<ux​\(b\)\]\\displaystyle\+P\\left\[r\_\{x\}\(a\)<r\_\{x\}\(b\)\|u\_\{x\}\(a\)<u\_\{x\}\(b\)\\right\]P\[u\_\{x\}\(a\)<u\_\{x\}\(b\)\]*\(discriminative ability\)*
In practice, RM evaluation focuses on discriminative ability\. If utility is continuous,P​\[ux​\(a\)=ux​\(b\)\]=0⟹P\[u\_\{x\}\(a\)=u\_\{x\}\(b\)\]=0\\impliesaccuracy=discriminative ability\. In[subsection 2\.2](https://arxiv.org/html/2606.21795#S2.SS2), we show conditions in which RM’s with excellent discriminative ability can also be highly oversensitive\. In[subsection 2\.4](https://arxiv.org/html/2606.21795#S2.SS4), we will show that proper discretization can preserve discriminative ability while minimizing oversensitivity\.

![Refer to caption](https://arxiv.org/html/2606.21795v1/x3.png)Figure 3:Discretization requires estimating regions of equivalent utility in reward space\.ϕ​\(ui​\(a\)\)\\phi\(u^\{i\}\(a\)\)marks the mean reward within class of responses with equal utility\.sis^\{i\}is the spread of rewards corresponding to that utility class, given byr​\(y\)=ϕ​\(u​\(y\)\)\+η​\(y\)r\(y\)=\\phi\(u\(y\)\)\+\\eta\(y\)\.To simplify our analysis, we assume our rewardrris a linear function of utility with a bounded amount of bias:rx​\(y\)=r\_\{x\}\(y\)=ϕx​\(ux​\(y\)\)\+ηx​\(y\)\\phi\_\{x\}\(u\_\{x\}\(y\)\)\+\\eta\_\{x\}\(y\), whereϕx​\(v\)=sx​v\+dx\\phi\_\{x\}\(v\)=s\_\{x\}v\+d\_\{x\}andηx:𝒜→ℝ\\eta\_\{x\}:\\mathcal\{A\}\\to\\mathbb\{R\}isreward noiseandϕx\\phi\_\{x\}is a linear function\. Practically, we consider the case thatηx​\(y\)\\eta\_\{x\}\(y\)is easier to learn thanϕx​\(ux​\(y\)\)\\phi\_\{x\}\(u\_\{x\}\(y\)\), leading toreward hacking\.

We will assume thatηx\\eta\_\{x\}is bounded such that\|ηx\|<sx/2\|\\eta\_\{x\}\|<s\_\{x\}/2, to ensure that this noisy reward model maintains perfect discriminative ability under our utility function\. For ease of explanation, here we assume thatηx​\(a\)∼Unif​\(−sx/2,sx/2\)\\eta\_\{x\}\(a\)\\sim\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\)\. We show in[subsection D\.1](https://arxiv.org/html/2606.21795#A4.SS1)that the same conclusions hold after relaxing this assumption to generateη\\etafrom a Gaussian distribution\.444We use reward noiseη\\etaas if it is a random variable, but it is actually a learnable, prompt\-dependent function of a response\. Consider that responses are sampled from a reference distributiony∼𝝅y\\sim\\bm\{\\pi\}\. Then, within a utility classcc, we assumeηx​\(y\)∼𝒩​\(0,σx2\)\\eta\_\{x\}\(y\)\\sim\\mathcal\{N\}\(0,\\sigma\_\{x\}^\{2\}\)givenux​\(y\)=cu\_\{x\}\(y\)=c\. Despite being a learnable function,η\\etaallows a probabilistic interpretation over the distribution of responses\.Our practical algorithm in §[3](https://arxiv.org/html/2606.21795#S3)supports that assumption\. Intuitively, we assume that spurious or subjective preferences are present in reward models but their magnitude is much smaller than the differences in true utility\. We illustrate the relationship between our reward modelrr, the utility functionuu, and the amount of noisessin[Figure 3](https://arxiv.org/html/2606.21795#S2.F3)\.

### 2\.2Reward models with perfect discriminative ability can be highly oversensitive

###### Proposition 2\.2\.

If\|ηx​\(a\)\|<sx/2\|\\eta\_\{x\}\(a\)\|<s\_\{x\}/2, thenrrhas perfect discriminative ability atϵ\\epsilon= 0\.

#### Proof

If​ux​\(a\)≠ux​\(b\)​, then:​sgn⁡\(rx​\(a\)−rx​\(b\)\)\\displaystyle\\text\{ If \}u\_\{x\}\(a\)\\neq u\_\{x\}\(b\)\\text\{, then: \}\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)=sgn⁡\(ϕ​\(ux​\(a\)\)\+ηx​\(a\)−ϕx​\(ux​\(b\)\)−ηx​\(b\)\)\\displaystyle=\\operatorname\{sgn\}\(\\phi\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\-\\phi\_\{x\}\(u\_\{x\}\(b\)\)\-\\eta\_\{x\}\(b\)\)=sgn\(sxux\(a\)\+dx\+ηx\(a\)−sxux\(b\)−dx−ηx\(b\)\)\)\\displaystyle=\\operatorname\{sgn\}\(s\_\{x\}u\_\{x\}\(a\)\+d\_\{x\}\+\\eta\_\{x\}\(a\)\-s\_\{x\}u\_\{x\}\(b\)\-d\_\{x\}\-\\eta\_\{x\}\(b\)\)\)=sgn⁡\(sx​\(ux​\(a\)−ux​\(b\)\)\+ηx​\(a\)−ηx​\(b\)\)\.\\displaystyle=\\operatorname\{sgn\}\(s\_\{x\}\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\)\.
Sinceuuis an integer\-valued function,ux​\(a\)≠ux​\(b\)⟹\|ux​\(a\)−ux​\(b\)\|≥1u\_\{x\}\(a\)\\neq u\_\{x\}\(b\)\\implies\|u\_\{x\}\(a\)\-u\_\{x\}\(b\)\|\\geq 1\. Because\|ηx​\(a\)\|<sx/2⟹\|ηx​\(a\)−ηx​\(b\)\|<sx\|\\eta\_\{x\}\(a\)\|<s\_\{x\}/2\\implies\|\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\|<s\_\{x\}, thensgn⁡\(rx​\(a\)−rx​\(b\)\)\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)=sgn⁡\(sx×\(ux​\(a\)−ux​\(b\)\)\+ηx​\(a\)−ηx​\(b\)\)\>sgn⁡\(sx×\(ux​\(a\)−ux​\(b\)\)\)=\\operatorname\{sgn\}\(s\_\{x\}\\times\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\)\>\\operatorname\{sgn\}\(s\_\{x\}\\times\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\)\.

This meansux​\(a\)\>ux​\(b\)⟹rx​\(a\)\>rx​\(b\)u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\implies r\_\{x\}\(a\)\>r\_\{x\}\(b\), and thereforerrhas perfect discriminative ability\.

###### Proposition 2\.3\.

Ifηx​\(a\),ηx​\(b\)∼Unif​\(−sx/2,sx/2\)\\eta\_\{x\}\(a\),\\eta\_\{x\}\(b\)\\sim\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\), then for any toleranceϵ∈\[0,1\]\\epsilon\\in\[0,1\]\(measured in units ofsxs\_\{x\}— “utility units”\), the discriminative ability of the reward on*adjacent*utility classes is

Dr​\(ϵ\)=P​\(rx​\(a\)\>rx​\(b\)\+ϵ\|ux​\(a\)=ux​\(b\)\+1\)=1−ϵ22\.D\_\{r\}\(\\epsilon\)=P\(r\_\{x\}\(a\)\>r\_\{x\}\(b\)\+\\epsilon\|u\_\{x\}\(a\)=u\_\{x\}\(b\)\+1\)=1\-\\frac\{\\epsilon^\{2\}\}\{2\}\.

#### Proof

rx​\(a\)−rx​\(b\)=ηx​\(a\)−ηx​\(b\)\+sxr\_\{x\}\(a\)\-r\_\{x\}\(b\)=\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\+s\_\{x\}\. Then\(rx​\(a\)−rx​\(b\)\)/sx=1\+ηx​\(a\)/sx−ηx​\(b\)/sx\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)/s\_\{x\}\\;=\\;1\+\\eta\_\{x\}\(a\)/s\_\{x\}\-\\eta\_\{x\}\(b\)/s\_\{x\}\. Sinceηx​\(a\)/sx\\eta\_\{x\}\(a\)/s\_\{x\},ηx​\(b\)/sx\\eta\_\{x\}\(b\)/s\_\{x\}∼\\simUnif​\(−1/2,1/2\)\\text\{Unif\}\(\-1/2,1/2\), thenP\(rx\(a\)−rx\(b\)\>ϵsx\)=P\(\(rx\(a\)−rx\(b\)\)/sx\>ϵ\)=P\(ηx\(a\)/sx−ηx\(b\)/sx\)\>ϵ−1\):=P\(x−y\>ϵ−1\)P\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\>\\epsilon s\_\{x\}\)=P\(\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)/s\_\{x\}\>\\epsilon\)=P\(\\eta\_\{x\}\(a\)/s\_\{x\}\-\\eta\_\{x\}\(b\)/s\_\{x\}\)\>\\epsilon\-1\):=P\(x\-y\>\\epsilon\-1\)wherex,y∼Unif​\(0,1\)x,y\\sim\\text\{Unif\}\(0,1\)\. By the CDF of the symmetric triangular distribution,P​\(x−y<ϵ−1\)P\(x\-y<\\epsilon\-1\)==\(\(ϵ−1\)\+1\)2/2\(\(\\epsilon\-1\)\+1\)^\{2\}/2=ϵ2/2\\epsilon^\{2\}/2\. Then,P​\(rx​\(a\)\>rx​\(b\)\+ϵ​sx\)=1−ϵ2/2P\(r\_\{x\}\(a\)\>r\_\{x\}\(b\)\+\\epsilon s\_\{x\}\)=1\-\\epsilon^\{2\}/2\.

We assumed adjacent utility classes:ux​\(a\)=ux​\(b\)\+1u\_\{x\}\(a\)=u\_\{x\}\(b\)\+1\. This is the worst\-case scenario\. When utility classes are further apart,P​\(rx​\(a\)−rx​\(b\)\>ϵ​sx\)≥P​\(ηx​\(a\)/sx−ηx​\(b\)/sx\>ϵ−2\)≥P​\(\(ηx​\(a\)−ηx​\(b\)\)/sx\>−1\)=1P\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\>\\epsilon s\_\{x\}\)\\geq P\\bigl\(\\eta\_\{x\}\(a\)/s\_\{x\}\-\\eta\_\{x\}\(b\)/s\_\{x\}\>\\epsilon\-2\\bigr\)\\geq P\\bigl\(\(\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\)/s\_\{x\}\>\-1\\bigr\)=1\.

###### Proposition 2\.4\.

Ifηx​\(a\),ηx​\(b\)∼Unif​\(−sx/2,sx/2\)\\eta\_\{x\}\(a\),\\eta\_\{x\}\(b\)\\sim\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\), then for any toleranceϵ∈\[0,1\)\\epsilon\\in\[0,1\)\(in units ofsxs\_\{x\}\),

Specr​\(ϵ\):=1−P​\(\|rx​\(a\)−rx​\(b\)\|\>ϵ∣ux​\(a\)=ux​\(b\)\)=ϵ​\(2−ϵ\)<1\.\\mathrm\{Spec\}\_\{r\}\(\\epsilon\):=1\-P\\bigl\(\|r\_\{x\}\(a\)\-r\_\{x\}\(b\)\|\>\\epsilon\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=\\epsilon\(2\-\\epsilon\)<1\.

#### Proof

1−P​\(\|rx​\(a\)−rx​\(b\)\|\>ϵ∣ux​\(a\)=ux​\(b\)\)=1−P​\(\|ηx​\(a\)−ηx​\(b\)\|≥ϵ\)\\displaystyle 1\-P\\bigl\(\|r\_\{x\}\(a\)\-r\_\{x\}\(b\)\|\>\\epsilon\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=1\-P\\bigl\(\|\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\|\\geq\\epsilon\\bigr\)=1−2​P​\(ηx​\(a\)−ηx​\(b\)≥ϵ\)=1−2⋅\(1−ϵ\)22=1−\(1−ϵ\)2=ϵ​\(2−ϵ\)\.\\displaystyle\\quad=1\-2P\\bigl\(\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\\geq\\epsilon\\bigr\)=1\-2\\cdot\\tfrac\{\(1\-\\epsilon\)^\{2\}\}\{2\}=1\-\(1\-\\epsilon\)^\{2\}=\\epsilon\(2\-\\epsilon\)\.
Therefore, the raw reward model is oversensitive with probability\(1−ϵ\)2\(1\-\\epsilon\)^\{2\}, which is strictly positive for every toleranceϵ<sx\\epsilon<s\_\{x\}, though this likelihood decreases quadratically asϵ\\epsilonincreases\.

### 2\.3There exists a discretization that maintains perfect discriminative ability

We will post\-process the output of a reward model todiscretizethe rewards\. For simplicity, we’ll assume here that our discretization isbinary, where we fix a single thresholdτx\\tau\_\{x\}for binarizing the reward, though our findings can be generalized to multi\-level utility functions\. Define the operationdiscx​\(v\)\\text\{\{disc\}\}\_\{x\}\(v\)as𝟙​\[v\>τx\]\\mathbbm\{1\}\[v\>\\tau\_\{x\}\]\. Then:

Ddisc​\(ϵ\):=P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)\+ϵ​∣ux​\(a\)\>​ux​\(b\)\)=P​\(rx​\(b\)≤τx​<rx​\(a\)∣ux​\(a\)\>​ux​\(b\)\)\\displaystyle D\_\{\\textup\{disc\}\}\(\\epsilon\):=P\\bigl\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\+\\epsilon\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)=P\\bigl\(r\_\{x\}\(b\)\\leq\\tau\_\{x\}<r\_\{x\}\(a\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)=P​\(ϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τx​∣ux​\(a\)\>​ux​\(b\)\)×P​\(ϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx​∣ux​\(a\)\>​ux​\(b\)\)\.\\displaystyle=P\\bigl\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)\\hskip 3\.0pt\\times\\hskip 3\.0ptP\\bigl\(\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)\.
This derivation is based on the fact that discretized rewards are integer\-valued, so a gap of 1 exceeds any marginϵ<1\\epsilon<1\. Sinceuxu\_\{x\}is integer\-valued andux​\(a\)\>ux​\(b\)u\_\{x\}\(a\)\>u\_\{x\}\(b\), we then haveux​\(a\)≥ux​\(b\)\+1u\_\{x\}\(a\)\\geq u\_\{x\}\(b\)\+1, which impliesϕx​\(ux​\(a\)\)≥ϕx​\(ux​\(b\)\)\+sx\\phi\_\{x\}\(u\_\{x\}\(a\)\)\\geq\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+s\_\{x\}\.

Given that\|ηx\|<sx/2\|\\eta\_\{x\}\|<s\_\{x\}/2, the ranges of possible rewards for responsesaaandbbare non\-overlapping:

rx​\(a\)=ϕx​\(ux​\(y\)\)\+ηx​\(y\)≥ϕx​\(ux​\(b\)\)\+sx/2\>rx​\(b\)\.\\displaystyle r\_\{x\}\(a\)\\hskip 10\.0pt=\\hskip 10\.0pt\\phi\_\{x\}\(u\_\{x\}\(y\)\)\+\\eta\_\{x\}\(y\)\\hskip 10\.0pt\\geq\\hskip 10\.0pt\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+s\_\{x\}/2\>r\_\{x\}\(b\)\.
Therefore, if the threshold satisfiesτx∈\[ϕx​\(ux​\(b\)\)\+sx/2,ϕx​\(ux​\(a\)\)−sx/2\]\\tau\_\{x\}\\in\[\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+s\_\{x\}/2,\\phi\_\{x\}\(u\_\{x\}\(a\)\)\-s\_\{x\}/2\], thenP​\(rx​\(a\)\>τx\)=1P\(r\_\{x\}\(a\)\>\\tau\_\{x\}\)=1andP​\(rx​\(b\)≤τx\)=1P\(r\_\{x\}\(b\)\\leq\\tau\_\{x\}\)=1, givingDdisc​\(ϵ\)=P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)​∣ux​\(a\)\>​ux​\(b\)\)=1D\_\{\\textup\{disc\}\}\(\\epsilon\)=P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\)=1\.

### 2\.4Discretization can minimize oversensitivity

###### Proposition 2\.5\.

Ifηx​\(a\)∼Unif​\(−sx/2,sx/2\)\\eta\_\{x\}\(a\)\\sim\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\), then oversensitivity:=1−Specdisc=P​\(\|discx​\(rx​\(a\)\)−discx​\(rx​\(b\)\)\|\>ϵ∣ux​\(a\)=ux​\(b\)\)=max⁡\(0,12−2​\(\(τx−ϕx​\(ux​\(a\)\)\)/sx\)2\):=1\-\\mathrm\{Spec\}\_\{\\textup\{disc\}\}=P\\bigl\(\|\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\-\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\>\\epsilon\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=\\max\\\!\\bigl\(0,\\ \\tfrac\{1\}\{2\}\-2\\bigl\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\bigr\)^\{2\}\\bigr\)\.

We give a proof of this proposition in[Appendix B](https://arxiv.org/html/2606.21795#A2)\.

###### Theorem 2\.6\.

Specdisc​\(ϵ\)\>Specraw​\(ϵ\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)\>\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)\(proper discretization reduces oversensitivity\)

#### Proof

By Proposition[2\.5](https://arxiv.org/html/2606.21795#S2.Thmtheorem5),Specdisc​\(ϵ\)=1−max⁡\(0,12−2​\(\(τx−ϕx​\(ux​\(a\)\)/sx\)\)2\)=min⁡\(1,12\+2​\(\(τx−ϕx​\(ux​\(a\)\)/sx\)\)2\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\-\\max\(0,\\frac\{1\}\{2\}\-2\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)/s\_\{x\}\)\)^\{2\}\)=\\min\(1,\\frac\{1\}\{2\}\+2\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)/s\_\{x\}\)\)^\{2\}\)\. By Proposition[2\.4](https://arxiv.org/html/2606.21795#S2.Thmtheorem4),Specraw=ϵ​\(2−ϵ\)=1−\(1−ϵ\)2\\mathrm\{Spec\}\_\{\\textup\{raw\}\}=\\epsilon\(2\-\\epsilon\)=1\-\(1\-\\epsilon\)^\{2\}\.

These quantities are not directly comparable, asSpecraw​\(ϵ\)\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)depends onϵ\\epsilonand, if we assumeϵ<sx\\epsilon<s\_\{x\},Specdisc​\(ϵ\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)does not\. However, raw reward models show imperfect specificity \(i\.e\. nonzero oversensitivity\):1−Specraw=\(1−ϵ\)2\>01\-\\mathrm\{Spec\}\_\{\\textup\{raw\}\}=\(1\-\\epsilon\)^\{2\}\>0\. In contrast, there exist discretization thresholds with zero oversensitivity\. If\(τx−ϕx​\(ux​\(a\)\)\)/sx≤−12\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\leq\-\\tfrac\{1\}\{2\}or\(τx−ϕx​\(ux​\(a\)\)\)/sx≥12\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\geq\\tfrac\{1\}\{2\}, thenSpecdisc​\(ϵ\)=1\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\.

In[subsection 2\.3](https://arxiv.org/html/2606.21795#S2.SS3), we showed that if\(τx−ϕx​\(ux​\(a\)\)\)/sx≤−12\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\leq\-\\tfrac\{1\}\{2\}or\(τx−ϕx​\(ux​\(a\)\)\)/sx≥12\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\geq\\tfrac\{1\}\{2\}, then perfect discriminative ability is maintained\. Therefore,τx=ϕx​\(ux​\(a\)\)−sx/2\\tau\_\{x\}=\\phi\_\{x\}\(u\_\{x\}\(a\)\)\-s\_\{x\}/2orτx=ϕx​\(ux​\(a\)\)\+sx/2\\tau\_\{x\}=\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+s\_\{x\}/2optimizes both discriminative ability and specificity\. In the case of a binary utility function, if we setτx=ϕx\(0\)\+ϕx\(1\)\)/2\\tau\_\{x\}=\\phi\_\{x\}\(0\)\+\\phi\_\{x\}\(1\)\)/2, then the discretized reward attainsperfect discriminative abilityandperfect specificity\. In other words, an optimal discretization is achieved by setting thresholds that maximize the distance between each threshold and the mean reward of each adjacent equivalence class\.

### 2\.5Discretization maximizes the average of discriminative ability and specificity

Because the marginϵ\\epsilonnow enters discriminative ability and specificity in the same way, we can summarize a reward model by a single combined scoreTr​\(ϵ\):=\(Dr​\(ϵ\)\+Specr​\(ϵ\)\)/2T\_\{r\}\(\\epsilon\)\\;:=\\;\(D\_\{r\}\(\\epsilon\)\+\\mathrm\{Spec\}\_\{r\}\(\\epsilon\)\)/2, which attains its maximum of11exactly when the reward has perfect discriminative ability and zero oversensitivity\.

###### Theorem 2\.7\.

Under a binary utility model, discretization can attain the maximal combined score at everyϵ\\epsilon\. If we fix the optimal midpoint thresholdτx\\tau\_\{x\}, with\|τx−ϕx​\(u\)\|/sx=1/2\|\\tau\_\{x\}\-\\phi\_\{x\}\(u\)\|/s\_\{x\}=1/2for each adjacent classuu, then for everyϵ∈\[0,1\)\\epsilon\\in\[0,1\),Tdisc​\(ϵ\)=1​\(maximum\)T\_\{\\textup\{disc\}\}\(\\epsilon\)\\;=\\;1\\text\{\(maximum\)\}, while the raw reward model will never exceedTr​\(ϵ\)\>5/6T\_\{r\}\(\\epsilon\)\>5/6\.

Therefore, a proper discretization improves over the raw rewrd model at every tolerance:Tdisc​\(ϵ\)−Tr​\(ϵ\)≥16T\_\{\\textup\{disc\}\}\(\\epsilon\)\-T\_\{r\}\(\\epsilon\)\\geq\\tfrac\{1\}\{6\}for allϵ\\epsilon, and the deficit approaches12\\tfrac\{1\}\{2\}asϵ→0\\epsilon\\to 0\.

#### Proof

First consider the total score for the discretized reward\. By[subsection 2\.3](https://arxiv.org/html/2606.21795#S2.SS3),Ddisc​\(ϵ\)=1D\_\{\\textup\{disc\}\}\(\\epsilon\)=1forϵ∈\[0,1\)\\epsilon\\in\[0,1\)\. By Proposition[2\.5](https://arxiv.org/html/2606.21795#S2.Thmtheorem5)at\|τx−ϕx​\(u\)\|/sx=1/2\|\\tau\_\{x\}\-\\phi\_\{x\}\(u\)\|/s\_\{x\}=1/2, the discretized oversensitivity ismax⁡\(0,1/2−2⋅\(1/2\)2\)=0\\max\(0,1/2\-2\\cdot\(1/2\)^\{2\}\)=0, soSpecdisc​\(ϵ\)=1\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\. HenceTdisc​\(ϵ\)=\(1\+1\)/2=1T\_\{\\textup\{disc\}\}\(\\epsilon\)=\(1\+1\)/2=1\.

Now consider the total score for the raw reward\. Proposition[2\.3](https://arxiv.org/html/2606.21795#S2.Thmtheorem3),Dr​\(ϵ\)=1−\(1/2\)​ϵ2D\_\{r\}\(\\epsilon\)=1\-\(1/2\)\\epsilon^\{2\}, and by Proposition[2\.4](https://arxiv.org/html/2606.21795#S2.Thmtheorem4),Specr​\(ϵ\)=ϵ​\(2−ϵ\)\\mathrm\{Spec\}\_\{r\}\(\\epsilon\)=\\epsilon\(2\-\\epsilon\), soTr​\(ϵ\)=−34​ϵ2\+ϵ\+12T\_\{r\}\(\\epsilon\)=\-\\tfrac\{3\}\{4\}\\epsilon^\{2\}\+\\epsilon\+\\tfrac\{1\}\{2\}\.T′​\(ϵ\)=−32​ϵ\+1T^\{\\prime\}\(\\epsilon\)=\-\\tfrac\{3\}\{2\}\\epsilon\+1andT′′​\(ϵ\)=−32T^\{\\prime\\prime\}\(\\epsilon\)=\-\\frac\{3\}\{2\}\. SinceTTis concave down and its first derivative is zero atϵ=23\\epsilon=\\tfrac\{2\}\{3\}, its maximum value is isT​\(2/3\)=1/2\+2/3−1/3=5/6T\(2/3\)=1/2\+2/3\-1/3=5/6\. ThusTr​\(ϵ\)≤56<1=Tdisc​\(ϵ\)T\_\{r\}\(\\epsilon\)\\leq\\tfrac\{5\}\{6\}<1=T\_\{\\textup\{disc\}\}\(\\epsilon\)\. This gap is minimized atϵ=23\\epsilon=\\tfrac\{2\}\{3\}with a value of16\\frac\{1\}\{6\}and rises to12\\tfrac\{1\}\{2\}asϵ→0\\epsilon\\to 0\.

If we relax our noise model to be Gaussian rather than bounded uniform noise, the condition thatTdisc\>TrawT\_\{\\textup\{disc\}\}\>T\_\{\\textup\{raw\}\}is maintained \(see[Theorem D\.9](https://arxiv.org/html/2606.21795#A4.Thmtheorem9)in[Appendix D](https://arxiv.org/html/2606.21795#A4)\)\.

### 2\.6Related formulations

Prior work has explored reward model pathologies that relate to oversensitivity\.Miao et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib25)\),Hatgis\-Kessell et al\. \([2025](https://arxiv.org/html/2606.21795#bib.bib13)\), and others consider generalreward misspecification\.Miao et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib25)\)train RMs with an information bottleneck to address this\.Hatgis\-Kessell et al\. \([2025](https://arxiv.org/html/2606.21795#bib.bib13)\)“repair” RM’s for sequential decision\-making in tabular environments\.Siththaranjan et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib32)\)identify hidden context in multi\-annotator preference data \(a potential cause of oversensitivity\) by learning a distribution rather than a point estimate reward\. The interventions proposed in these prior works all training a new reward model\. We believe that on\-the\-fly discretization is superior due toflexibilityandgeneralizationin arbitrary scenarios involving a neural reward\. Most similar to ours,Afsharrad et al\. \([2026](https://arxiv.org/html/2606.21795#bib.bib1)\)concurrently consider ordinal reward models \(which are structurally identical to discretized reward models\) — however, they propose obtaining such a discretized RM via a new training procedure for reward models using ordinal preference magnitude labels, and their goal is to improve overall reward model accuracy \(without defining or measuring oversensitivity separately from overall accuracy\)\. Lastly,Huang et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib16)\)considerfalse positive rewards\(in the setting of video game agents\), which they detect with an external embedding\-based judge model — in contrast, our method provides a model\-agnostic intervention requiring no external judge mechanism\.

## 3Discretization viaReward Clustering

We now describe an algorithm calledreward clusteringthat estimates the posterior distribution of each reward to group responses likely to be equally useful, illustrated by[Figure 3](https://arxiv.org/html/2606.21795#S2.F3)\.

#### Discretization as Clustering

We treat reward discretization as a 1\-D clustering problem\. Given a set of rewardsr1,…,rnr\_\{1\},\\ldots,r\_\{n\}for a batch of responsesa1,…,ana\_\{1\},\\ldots,a\_\{n\}, we estimateP​\(\|ri−rj\|<Δ\)P\(\\lvert r\_\{i\}\-r\_\{j\}\\rvert<\\Delta\)for each pairri,rjr\_\{i\},r\_\{j\}\. We then perform hierarchical clustering using complete linkage over these pairwise distances and cut the resulting dendrogram such that, for alli,ji,jin each final cluster,P​\(\|ri−rj\|<Δ\)\>p∗P\(\\lvert r\_\{i\}\-r\_\{j\}\\rvert<\\Delta\)\>p^\{\*\}, whereΔ\\Deltaandp∗p^\{\*\}are hyperparameters\. The responses in each cluster are then assigned sequential integer rewards corresponding to the ordinal rank of their cluster’s mean reward\.

#### Estimating Equivalent Rewards via MC Dropout

In §[2\.2](https://arxiv.org/html/2606.21795#S2.SS2), we assumed uniform noise to provide hard guarantees\. In practice, we consider a more realistic setting where rewards are approximately Gaussian \(which is often the case with Bradley\-Terry RM’sSun et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib33)\)\)\. As we show in[Appendix D](https://arxiv.org/html/2606.21795#A4), the core insight — discretization reduces oversensitivity in the presence of responses with equal utility — holds under both distributions\. Assuming the reward estimates are independent Gaussians,ri∼𝒩​\(μ^i,σ^i2\)r\_\{i\}\\sim\\mathcal\{N\}\(\\hat\{\\mu\}\_\{i\},\\hat\{\\sigma\}\_\{i\}^\{2\}\), the difference in rewards is distributed asri−rj∼𝒩​\(μ^i−μ^j,σ^i2\+σ^j2\)r\_\{i\}\-r\_\{j\}\\sim\\mathcal\{N\}\\left\(\\hat\{\\mu\}\_\{i\}\-\\hat\{\\mu\}\_\{j\},\\hat\{\\sigma\}\_\{i\}^\{2\}\+\\hat\{\\sigma\}\_\{j\}^\{2\}\\right\)\.

Then, the probability that two rewards are withinΔ\\Deltaof each other is the probability that their difference falls in\(−Δ,Δ\)\(\-\\Delta,\\Delta\):

P​\(\|ri−rj\|<Δ\)=P​\(−Δ<ri−rj<Δ\)=Γ​\(Δ−\(μ^i−μ^j\)σ^i2\+σ^j2\)−Γ​\(−Δ−\(μ^i−μ^j\)σ^i2\+σ^j2\)\\displaystyle P\(\|r\_\{i\}\-r\_\{j\}\|<\\Delta\)=P\(\-\\Delta<r\_\{i\}\-r\_\{j\}<\\Delta\)=\\Gamma\\left\(\\frac\{\\Delta\-\(\\hat\{\\mu\}\_\{i\}\-\\hat\{\\mu\}\_\{j\}\)\}\{\\sqrt\{\\hat\{\\sigma\}\_\{i\}^\{2\}\+\\hat\{\\sigma\}\_\{j\}^\{2\}\}\}\\right\)\-\\Gamma\\left\(\\frac\{\-\\Delta\-\(\\hat\{\\mu\}\_\{i\}\-\\hat\{\\mu\}\_\{j\}\)\}\{\\sqrt\{\\hat\{\\sigma\}\_\{i\}^\{2\}\+\\hat\{\\sigma\}\_\{j\}^\{2\}\}\}\\right\)whereΓ\\Gammais the cumulative distribution function of the standard normal distribution\.

We can assume that a reward sampled from our reward model is a reasonable estimate ofμ^\\hat\{\\mu\}, but we need to estimate the predictive varianceσ^2\\hat\{\\sigma\}^\{2\}\. We look to Monte Carlo \(MC\) Dropout\(Gal and Ghahramani,[2016](https://arxiv.org/html/2606.21795#bib.bib10); Gao et al\.,[2021](https://arxiv.org/html/2606.21795#bib.bib11)\)as a solution\. For each responseaia\_\{i\}, we performTTstochastic forward passes through the reward model with a dropout probability ofd, yielding reward samples\{ri\(1\),…,ri\(T\)\}\\\{r\_\{i\}^\{\(1\)\},\\ldots,r\_\{i\}^\{\(T\)\}\\\}\. Increasing the value ofTTought to improve the estimate of variance, but this linearly increases the cost of reward computation\. We can then approximate555Gal and Ghahramani \([2016](https://arxiv.org/html/2606.21795#bib.bib10)\)show that dropout training approximates variational inference over network weights, and thus sampling with dropout at test time gives an estimate of epistemic uncertainty\. Reward models are usually trained without dropout; in this case MC dropout is not an exact estimate of epistemic uncertainty, but we find it is a useful estimate in practice\.predictive variance:σ^i=1T−1​∑t=1T\(ri\(t\)−μ^i\)2\\hat\{\\sigma\}\_\{i\}=\\sqrt\{\\frac\{1\}\{T\-1\}\\sum\_\{t=1\}^\{T\}\(r\_\{i\}^\{\(t\)\}\-\\hat\{\\mu\}\_\{i\}\)^\{2\}\}\.

#### Practical Implementation

We implement reward clustering using the OpenRLHF library\(Hu et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib15)\)\. Reward clustering has 4 hyperparameters:Δ\\Delta\(the difference needed to consider two reward values equivalent\),p∗p^\{\*\}\(the minimum likelihood of reward equivalence needed to merge a given response into a cluster\),dd\(dropout probability\), andTT\(number of samples used for dropout\)\. In all our experiments, we choosed=0\.02d=0\.02\(which provides sufficiently diverse samples\) andT=4T=4\. We show in[Appendix C](https://arxiv.org/html/2606.21795#A3)that performance surprisingly does not increase with the number of samples taken via MC dropout\. When training on nodes with 8 x H100 GPUs with Ray\(Moritz et al\.,[2018](https://arxiv.org/html/2606.21795#bib.bib27)\), discretization increases average GRPO runtime by 15% \(training throughput over 6 runs drops from64\.3±8\.064\.3\\pm 8\.0prompts per minute to55\.8±7\.955\.8\\pm 7\.9\)\.

## 4Experiments

### 4\.1Reward models exhibit oversensitivity

We first explore how to modulate the discriminative ability and oversensitivity of reward models using the“Ties”subset ofRewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib24)\)\. We compare ways to use the output of four leading reward models — Skywork V1\(Liu et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib21)\), Skywork V2\(Liu et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib22)\), GRM\(Yang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib38)\), and ArmoRM\(Wang et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib36)\)\.

#### Metrics

RewardBench 2“Ties”consists of prompts with one or morechosenresponse and three or morerejectedresponses\. The primary metric used is the weighted average of twoaccuracyandmargin, whereaccuracymeasures whether or not all “chosen” responses receive higher reward than any “rejected” response andmarginmeasures whether the difference between the worst\-scoring “chosen” response and the best\-scoring “rejected” response is greater than the difference between the best\-scoring and worst\-scoring “chosen” responses\. The final metric is0\.6×accuracy\+0\.4×margin0\.6\\times\\textit\{accuracy\}\+0\.4\\times\\textit\{margin\}\.

This metric permits substantial oversensitivity as long as it does not exceed a relative tolerance \(“the margin”\)\. Whether this tolerance is appropriate depends on how reward signals are consumed during RL\. In practice, algorithms like GRPO, REINFORCE, and PPO optimize the normalized difference between a given reward and a “baseline”\(Shao et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib31); Williams,[2004](https://arxiv.org/html/2606.21795#bib.bib37); Schulman et al\.,[2017](https://arxiv.org/html/2606.21795#bib.bib30)\)\. This difference can provide a learnable signal regardless of how the baseline is computed\.

To overcome these issues, we apply pairwise notions ofdiscriminative abilityandspecificity\(complement of oversensitivity\), as defined in §[2\.1](https://arxiv.org/html/2606.21795#S2.SS1), to this benchmark, by assuming that all “chosen” and all “rejected” responses have equal utility\. To make our rewards scale\-invariant, we normalize rewards at a per\-prompt level over all responses for that prompt before computing our metrics\. Discriminative ability and specificity both require a thresholdϵ^\\hat\{\\epsilon\}to determine tolerance for equivalence in the continuous reward space\. We setϵ^=0\.10\\hat\{\\epsilon\}=0\.10, which conceptually defines equivalent rewards as being within 10% of the within\-batch spread, as this value resulted in the strongest baseline using the raw reward model in most settings, and use thisϵ^\\hat\{\\epsilon\}when evaluating all post\-processing methods\.

Table 1:Under our proposed new metrics on RewardBench 2,reward clusteringimproves the average of specificity and discriminative ability for every reward model\. We use an equivalence tolerance ofϵ^=0\.10\\hat\{\\epsilon\}=0\.10in the normalized reward space for computing these metrics\. On RewardBench 2’s standard metrics,reward clusteringincreases the margin between chosen and rejected responses for 3 out of 4 models\. The best method on each metric in each setting is bolded\. While we display standard deviations of each metric, we do not explicitly indicate significance for each comparison\.
#### Hyperparameters

For reward clustering, we select values for the hyperparametersΔ\\Deltaandp∗p^\{\*\}via a small \(n=13\), handwritten validation set of prompts labeled with equivalence classes of responses\. We useΔ=10\\Delta=10andp∗=0\.8p^\{\*\}=0\.8for Skywork V1 and V2,Δ=5\\Delta=5andp∗=0\.6p^\{\*\}=0\.6for GRM, andΔ=0\.08\\Delta=0\.08andp∗=0\.8p^\{\*\}=0\.8for ArmoRM\.

#### Baselines

We compare other reward processing techniques withreward clustering\. As baselines, we useraw\(using the RM directly\) andclipping\(clipping the 20% of tail rewards\), along withensembling\(taking the average of 4 rewards sampled via dropout\)\(Eisenstein et al\.,[2023](https://arxiv.org/html/2606.21795#bib.bib7)\), andbinary thresholding\(using the median as a threshold for a simple model\-agnostic discretization\)\.

In[Table 1](https://arxiv.org/html/2606.21795#S4.T1), we see a tension between different evaluation paradigms\. Reward clustering uniformly improves the mean of our proposed metrics, specificity and discriminative ability; this supports the theoretical claims made in[Theorem 2\.7](https://arxiv.org/html/2606.21795#S2.Thmtheorem7)\(in a simplified setting\) and[Theorem D\.9](https://arxiv.org/html/2606.21795#A4.Thmtheorem9)\(in the more practical setting of Gaussian\-distributed rewards in each utility class\)\. For every reward model, we can improve specificity in exchange for modest reductions in discriminative ability\. On the standard metrics, reward clustering increases the margin metric relative over using the reward model directly in all settings except Skywork V2666Later on in §[4\.3](https://arxiv.org/html/2606.21795#S4.SS3), we will find Skywork V2 is also the RM on which we see the fewest gains in RL training experiments\., but sometimes at the expense ofaccuracy\. These default metrics do not fully capture these gains because they tolerate oversensitivity below some amount\. Among baselines, we note thatensemblingdoes not improve overraw— the use of MC dropout in reward estimation does not intrinsically help\. As performance of reward models on intrinsic capability benchmarks like RewardBench may be poorly correlated with their efficacy as teachers for RLHF\(Liu et al\.,[2024b](https://arxiv.org/html/2606.21795#bib.bib23); Frick et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib9); Malik et al\.,[2025](https://arxiv.org/html/2606.21795#bib.bib24)\), the downstream RL results in §[4\.2](https://arxiv.org/html/2606.21795#S4.SS2)and §[4\.3](https://arxiv.org/html/2606.21795#S4.SS3)offer a more direct test:we find that gains on our proposed metrics translate to better policies\.

### 4\.2Reward clustering inhibits overfitting to spurious effects

Can discretizing reward models help to distinguishsignalfromnoise? Modern reward models capture mixtures of useful preferences \(task completion, readability\) and harmful ones \(sycophancy,goblinreferences777[https://openai\.com/index/where\-the\-goblins\-came\-from/](https://openai.com/index/where-the-goblins-came-from/), etc\)\. We simulate this scenario by specifying a primary goal \(following precise instructions\) and a secondary goal \(to use non\-committal, hedged language\)\. This is concretely defined as learning to produce “hedging words” \(e\.g\. “possibly”, “maybe”\) and avoid “intensifiers” \(e\.g\. “very”, “absolutely”\)\.

We operationalize this scenario with a preference dataset\. This dataset is built fromallenai/RLVR\-IFeval\. To support our primary goal, we sample responses fromLlama\-3\.1\-8B\-Instructand group them into the following categories:\(A\)solve the task using confident language;\(B\)fail the task using confident language;\(C\)solve the task using hedged language;\(D\)fail the task using hedged language\. For our primary goal, we construct response pairs where one response is verifiably correct and the other is verifiably incorrect but neither contains hedging nor intensifiers \(A/B or C/D\)\. For our secondary goal, we select equally\-correct response pairs where one response is confident and the other is hedged \(A/C or B/D\)\. We consider 90%\-10% and 80%\-20% mixtures of the “Primary” and “Secondary” subsets\.888These mixtures would correspond to differences in the range of the noise componentη\\etadescribed in[subsection 2\.1](https://arxiv.org/html/2606.21795#S2.SS1)\.By design, the dataset is dominated by examples of the primary goal, but the secondary goal may be easier to learn and exploit\. We then train Bradley\-Terry reward models based onLlama\-3\.1\-8B\-Instruct\(Liu et al\.,[2024a](https://arxiv.org/html/2606.21795#bib.bib21)\)on both datasets\. The resultant RM’s encode both task correctness and linguistic hedging but in varying proportions\.

![Refer to caption](https://arxiv.org/html/2606.21795v1/x4.png)Figure 4:On a real prompt from IFEval, we see that training directly on a reward containing both correct instruction\-following rewards and spurious rewards \(for “linguistic hedging”\) leads the model to excessively optimize for hedged\. In contrast, training on a discretized reward model leads to less hedging, and the model converges to a sensible response\.![Refer to caption](https://arxiv.org/html/2606.21795v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.21795v1/x6.png)

Figure 5:\(Left\) Policies trained via RL against mixed\-effect reward models initially optimize task correctness well, but performance degrades with continued training\. Means and standard deviations are reported over three runs\. \(Right\) Policies trained on an 80\-20 mixture learn to heavily exploit the secondary reward \(increasing “hedging word” usage\); discretization curbs this overoptimization\. Clipping performs well in the 90\-10 setting where the reward is better\-specified but proves disastrous in the 80\-20 setting, suggesting that when the hackable secondary reward carries greater weight, clipping may cannibalize the primary reward rather than containing the exploit\.In this environment, discretization enables modest optimization of the secondary reward while safely preserving primary task performance\.[Figure 4](https://arxiv.org/html/2606.21795#S4.F4)illustrates how a model trained on the 80%\-20% mixture of instruction\-following\-based rewards and spurious rewards learns to complete the task but hedges excessively\. In[Figure 5](https://arxiv.org/html/2606.21795#S4.F5), we see that, with a 20% spurious preference ratio, this overexploitation of the hedging reward comes at the expense of task correctness; even at 10%, optimizing the secondary reward still degrades performance \(though without overt overoptimization\)\. Clipping shows well in the 90\-10 setting but proves disastrous at 80\-20, suggesting that a heavier hackable reward causes clipping to sacrifice the primary objective\.

### 4\.3Reward clustering improves learning from real reward models

We train Llama\-3\.1\-8B\-Instruct\(Dubey et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib6)\)in a verifier\-free multi\-task setup on unlabeled prompts\. We use 30K combined prompts from theallenai/RLVR\-IFeval,allenai/RLVR\-MATH, andallenai/RLVR\-GSMdatasets and 30K prompts randomly sampled from WildChat\(Zhao et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib40)\), which is truly non\-verifiable\. We use the same RM’s and hyperparameters described in §[4\.1](https://arxiv.org/html/2606.21795#S4.SS1)\. We train with two values of KL penalty with 3 random seeds each using GRPO with 8 rollouts\(Shao et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib31)\)via OpenRLHF\(Hu et al\.,[2024](https://arxiv.org/html/2606.21795#bib.bib15)\)\. We evaluate on three in\-distribution datasets, IFEval\(Zhou et al\.,[2023](https://arxiv.org/html/2606.21795#bib.bib41)\), MATH\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.21795#bib.bib14)\), and GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.21795#bib.bib4)\)\.

Table 2:Discretization always significantly improves \(10/24 comparisons greater than one standard deviation\) or maintains efficacy \(14/24 comparisons\), withzeroregressions \(up to significance\)\. Our base model,Llama\-3\.1\-8B\-Instruct, achieves 77\.3 on GSM8K, 47\.2 on MATH, and 77\.5 on IFEval\.#### Results

[Table 2](https://arxiv.org/html/2606.21795#S4.T2)shows discretization is never worse than the raw baseline \(up to overlapping confidence intervals\) and frequently much better\. With two exceptions999The exceptions are ArmoRM withβ=0\.01\\beta=0\.01on MATH and one run of Skywork V2 on MATH withβ=0\.05\\beta=0\.05\. Both are a failure mode affecting our baseline \(answering in\-context examples rather than the target task\) — not unique to discretization\., discretized rewards rarely leads to degenerate policies, unlike training on raw reward models\. As a whole, these results suggest thatdiscretized reward modelsare suitable drop\-in replacements for using reward models directly\.

## 5Limitations

Our paper has three main limitations\. While our algorithm makes few assumptions on model type, we only experiment with language models\. Moreover, our theory assumes \(in §[2](https://arxiv.org/html/2606.21795#S2)\) that our reward model is alinear function of a binary utility functionplusnoise that is either bounded uniform or Gaussian\. The linearity assumption may be sensible assumption \(given a supposed utility function\), but real\-world utility functions often feature multiple equivalence classes of responses\. While our derivations should generalize cleanly to this setting, we have not shown so in this paper\. In both cases, we also assume the reward noise has constant variance across utility classes, which simplifies our calculations but likely would not hold in practice\. Lastly, all our experimental results are shown on a single base model,Llama\-3\.1\-8B\-Instruct, and all RL training experiments use a single learning algorithm \(GRPO;Shao et al\. \([2024](https://arxiv.org/html/2606.21795#bib.bib31)\)\)\.

## 6Conclusion

Motivated by concerns that reinforcement learning supervised by reward models may lead to policies that learn undesirable behaviors, we introducereward model oversensitivityand find many popular RM’s are highly oversensitive despite their strong discriminative ability\. We show that, in theory, continuous\-valued rewards with perfect discriminative ability must exhibit oversensitivity, but certain discretizations can mitigate this\. We then describereward clusteringwhich improves the tradeoff between specificity and discriminative ability and leads to better policies\. Discretization raises the risk of reducing models’ peak ability by slowing learning\. However, our results suggest that discretization can retain the upside of learning from reward models while limiting potential negative impacts\.

## 7Acknowledgements

We are grateful to Madian Khabsa for his role in scoping and formulating our problem\. We thank Vashisth Tiwari, Pranjal Aggarwal, Hamish Ivison, Anthony GX\-Chen, and Daniel Fried for their useful and encouraging conversations about reward model discretization, and we thank Akhila Yerukola for her extensive assistance in refining our framing and preparing our manuscript\.

## References

- Afsharrad et al\. \(2026\)Amirhossein Afsharrad, Ruida Zhou, Luca Viano, Sanjay Lall, and Mohammad Ghavamzadeh\.Beyond binary preferences: A principled framework for reward modeling with ordinal feedback, 2026\.[https://arxiv\.org/abs/2603\.02232](https://arxiv.org/abs/2603.02232)\.
- Chen et al\. \(2024\)Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, and Xiaoyu Shen\.The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models\.*ArXiv*, abs/2410\.06554, 2024\.[https://api\.semanticscholar\.org/CorpusID:273229168](https://api.semanticscholar.org/CorpusID:273229168)\.
- Christiano et al\. \(2017\)Paul F\. Christiano, Jan Leike, Tom B\. Brown, Miljan Martic, Shane Legg, and Dario Amodei\.Deep reinforcement learning from human preferences\.In*Proceedings of the 31st International Conference on Neural Information Processing Systems*, NIPS’17, page 4302–4310, Red Hook, NY, USA, 2017\. Curran Associates Inc\.ISBN 9781510860964\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training Verifiers to Solve Math Word Problems\.*ArXiv*, abs/2110\.14168, 2021\.[https://api\.semanticscholar\.org/CorpusID:239998651](https://api.semanticscholar.org/CorpusID:239998651)\.
- DeepSeek\-AI et al\. \(2025\)DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun\-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z\. F\. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bing\-Li Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dong\-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H\. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Jiong Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, M\. Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R\. J\. Chen, Ruiqi Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S\. S\. Li, Shuang Zhou, Shao\-Kang Wu, Tao Yun, Tian Pei, Tianyu Sun, T\. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wen\-Xia Yu, Wentao Zhang, Wangding Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xi aokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X\. Q\. Li, Xiangyu Jin, Xi\-Cheng Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yi Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yu\-Jing Zou, Yujia He, Yunfan Xiong, Yu\-Wei Luo, Yu mei You, Yuxuan Liu, Yuyang Zhou, Y\. X\. Zhu, Yanping Huang, Yao Li, Yi Zheng, Yuchen Zhu, Yunxiang Ma, Ying Tang, Yukun Zha, Yuting Yan, Zehui Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhen guo Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zi\-An Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang\.DeepSeek\-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning\.*ArXiv*, abs/2501\.12948, 2025\.[https://api\.semanticscholar\.org/CorpusID:275789950](https://api.semanticscholar.org/CorpusID:275789950)\.
- Dubey et al\. \(2024\)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony S\. Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur’elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia\-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab A\. AlBadawy, E I Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriele Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guanglong Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M\. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Ju\-Qing Jia, Kalyan Vasuden Alwala, K\. Upasani, Kate Plawiak, Keqian Li, Kenneth Heafield, Kevin R\. Stone, Khalid El\-Arini, Krithika Iyer, Kshitiz Malik, Kuen ley Chiu, Kunal Bhalla, Lauren Rantala\-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Ma hesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melissa Hall Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Niko lay Bashlykov, Nikolay Bogoychev, Niladri S\. Chatterji, Olivier Duchenne, Onur cCelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasić, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ron nie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sa hana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Chandra Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stéphane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Vir ginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit ney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yiqian Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zhengxu Yan, Zhengxing Chen, Zoe Papakipos, Aaditya K\. Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adi Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Po\-Yao \(Bernie\) Huang, Beth Loyd, Beto de Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching\-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Shang\-Wen Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco \(Paco\) Guzmán, Frank J\. Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Han Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina\-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean\-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kaixing\(Kai\) Wu, U KamHou, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, A Lavender, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthias Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L\. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Mun ish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navy ata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pe dro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollár, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, S\. Yu\. Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Kumar Gupta, Sung\-Bae Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Andrei Poenaru, Vlad T\. Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xia Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao\.The Llama 3 Herd of Models, 2024\.[https://api\.semanticscholar\.org/CorpusID:271571434](https://api.semanticscholar.org/CorpusID:271571434)\.
- Eisenstein et al\. \(2023\)Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant\.Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking\.*arXiv preprint arXiv:2312\.09244*, 2023\.
- Elangovan et al\. \(2025\)Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Babu Bodapati, and Dan Roth\.Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM\-as\-a\-judge\.In*The Thirteenth International Conference on Learning Representations*, 2025\.[https://openreview\.net/forum?id=E8gYIrbP00](https://openreview.net/forum?id=E8gYIrbP00)\.
- Frick et al\. \(2024\)Evan Frick, Tianle Li, Connor Chen, Wei\-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph Gonzalez, and Ion Stoica\.How to Evaluate Reward Models for RLHF\.*ArXiv*, abs/2410\.14872, 2024\.[https://api\.semanticscholar\.org/CorpusID:273502060](https://api.semanticscholar.org/CorpusID:273502060)\.
- Gal and Ghahramani \(2016\)Yarin Gal and Zoubin Ghahramani\.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning\.In Maria Florina Balcan and Kilian Q\. Weinberger, editors,*Proceedings of The 33rd International Conference on Machine Learning*, volume 48 of*Proceedings of Machine Learning Research*, pages 1050–1059, New York, New York, USA, 20–22 Jun 2016\. PMLR\.[https://proceedings\.mlr\.press/v48/gal16\.html](https://proceedings.mlr.press/v48/gal16.html)\.
- Gao et al\. \(2021\)Tianyu Gao, Xingcheng Yao, and Danqi Chen\.SimCSE: Simple Contrastive Learning of Sentence Embeddings\.*ArXiv*, abs/2104\.08821, 2021\.[https://api\.semanticscholar\.org/CorpusID:233296292](https://api.semanticscholar.org/CorpusID:233296292)\.
- Guerdan et al\. \(2025\)Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, and Alexandra Chouldechova\.Validating llm\-as\-a\-judge systems under rating indeterminacy\.*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.
- Hatgis\-Kessell et al\. \(2025\)Stephane Hatgis\-Kessell, Logan Mondal Bhamidipaty, and Emma Brunskill\.Repairing Reward Functions with Feedback to Mitigate Reward Hacking, 2025\.[https://api\.semanticscholar\.org/CorpusID:282102369](https://api.semanticscholar.org/CorpusID:282102369)\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt\.Measuring Mathematical Problem Solving With the MATH Dataset\.*ArXiv*, abs/2103\.03874, 2021\.[https://api\.semanticscholar\.org/CorpusID:232134851](https://api.semanticscholar.org/CorpusID:232134851)\.
- Hu et al\. \(2024\)Jian Hu, Xibin Wu, Weixun Wang, Songlin Jiang, Dehao Zhang, Yu Cao, OpenLLMAI Team, Netease Fuxi, AI Lab, and Alibaba Group\.OpenRLHF: An Easy\-to\-use, Scalable and High\-performance RLHF Framework\.*ArXiv*, abs/2405\.11143, 2024\.[https://api\.semanticscholar\.org/CorpusID:269921667](https://api.semanticscholar.org/CorpusID:269921667)\.
- Huang et al\. \(2024\)Sukai Huang, Shu\-Wei Liu, Nir Lipovetzky, and Trevor Cohn\.The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards, 2024\.[https://api\.semanticscholar\.org/CorpusID:272832041](https://api.semanticscholar.org/CorpusID:272832041)\.
- James et al\. \(2013\)Gareth M\. James, Daniela M\. Witten, Trevor J\. Hastie, and Robert Tibshirani\.An Introduction to Statistical Learning\.*Springer Texts in Statistics*, 2013\.[https://api\.semanticscholar\.org/CorpusID:62973643](https://api.semanticscholar.org/CorpusID:62973643)\.
- Lambert et al\. \(2024a\)Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D\. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A\. Smith, Yizhong Wang, Pradeep Dasigi, and Hanna Hajishirzi\.TÜLU 3: Pushing Frontiers in Open Language Model Post\-Training\.*ArXiv*, abs/2411\.15124, 2024a\.[https://api\.semanticscholar\.org/CorpusID:274192505](https://api.semanticscholar.org/CorpusID:274192505)\.
- Lambert et al\. \(2024b\)Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A\. Smith, and Hanna Hajishirzi\.RewardBench: Evaluating Reward Models for Language Modeling\.*ArXiv*, abs/2403\.13787, 2024b\.[https://api\.semanticscholar\.org/CorpusID:268537409](https://api.semanticscholar.org/CorpusID:268537409)\.
- Lin et al\. \(2025\)Zi Lin, Sheng Shen, Jingbo Shang, Jason Weston, and Yixin Nie\.Learning to Solve and Verify: A Self\-Play Framework for Code and Test Generation, 2025\.[https://arxiv\.org/abs/2502\.14948](https://arxiv.org/abs/2502.14948)\.
- Liu et al\. \(2024a\)Chris Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou\.Skywork\-Reward: Bag of Tricks for Reward Modeling in LLMs\.*ArXiv*, abs/2410\.18451, 2024a\.[https://api\.semanticscholar\.org/CorpusID:273549327](https://api.semanticscholar.org/CorpusID:273549327)\.
- Liu et al\. \(2025\)Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al\.Skywork\-Reward\-V2: Scaling Preference Data Curation via Human\-AI Synergy\.*arXiv preprint arXiv:2507\.01352*, 2025\.
- Liu et al\. \(2024b\)Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li\.RM\-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style\.*arXiv preprint arXiv:2410\.16184*, 2024b\.
- Malik et al\. \(2025\)Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Daniel Morrison, Noah A\. Smith, Hanna Hajishirzi, and Nathan Lambert\.Rewardbench 2: Advancing reward model evaluation\.In*The Fourteenth International Conference on Learning Representations*, 2025\.[https://api\.semanticscholar\.org/CorpusID:279119102](https://api.semanticscholar.org/CorpusID:279119102)\.
- Miao et al\. \(2024\)Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao\.InfoRM: Mitigating Reward Hacking in RLHF via Information\-Theoretic Reward Modeling\.*Advances in Neural Information Processing Systems 37*, 2024\.[https://api\.semanticscholar\.org/CorpusID:267657799](https://api.semanticscholar.org/CorpusID:267657799)\.
- Moore \(1966\)Ramon E Moore\.*Interval analysis*\.Prentice\-Hall, 1966\.
- Moritz et al\. \(2018\)Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I\. Jordan, and Ion Stoica\.Ray: a distributed framework for emerging AI applications\.In*Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation*, OSDI’18, page 561–577, USA, 2018\. USENIX Association\.ISBN 9781931971478\.
- Razin et al\. \(2025\)Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D\. Lee, and Sanjeev Arora\.What Makes a Reward Model a Good Teacher? An Optimization Perspective\.*ArXiv*, abs/2503\.15477, 2025\.[https://api\.semanticscholar\.org/CorpusID:277112967](https://api.semanticscholar.org/CorpusID:277112967)\.
- Saad\-Falcon et al\. \(2025\)Jon Saad\-Falcon, Estefany Kelly Buchanan, Mayee F\. Chen, Tzu\-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott W\. Linderman, Azalia Mirhoseini, and Christopher Ré\.Shrinking the Generation\-Verification Gap with Weak Verifiers\.*ArXiv*, abs/2506\.18203, 2025\.[https://api\.semanticscholar\.org/CorpusID:280000478](https://api.semanticscholar.org/CorpusID:280000478)\.
- Schulman et al\. \(2017\)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal Policy Optimization Algorithms\.*ArXiv*, abs/1707\.06347, 2017\.[https://api\.semanticscholar\.org/CorpusID:28695052](https://api.semanticscholar.org/CorpusID:28695052)\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y\.K\. Li, Y\. Wu, and Daya Guo\.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024\.[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)\.
- Siththaranjan et al\. \(2024\)Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield\-Menell\.Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF\.In*International Conference on Learning Representations*, 2024\.[https://api\.semanticscholar\.org/CorpusID:266191810](https://api.semanticscholar.org/CorpusID:266191810)\.
- Sun et al\. \(2024\)Hao Sun, Yunyi Shen, and Jean\-François Ton\.Rethinking Bradley\-Terry Models in Preference\-Based Reward Modeling: Foundations, Theory, and Alternatives\.*ArXiv*, abs/2411\.04991, 2024\.[https://api\.semanticscholar\.org/CorpusID:273877679](https://api.semanticscholar.org/CorpusID:273877679)\.
- Tunstall et al\. \(2023\)Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M\. Rush, and Thomas Wolf\.Zephyr: Direct Distillation of LM Alignment\.*ArXiv*, abs/2310\.16944, 2023\.[https://api\.semanticscholar\.org/CorpusID:264490502](https://api.semanticscholar.org/CorpusID:264490502)\.
- Viswanathan et al\. \(2025\)Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu\.Checklists Are Better Than Reward Models For Aligning Language Models\.In*Advances in Neural Information Processing Systems*, volume 38, 2025\.[https://arxiv\.org/abs/2507\.18624](https://arxiv.org/abs/2507.18624)\.
- Wang et al\. \(2024\)Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang\.Interpretable Preferences via Multi\-Objective Reward Modeling and Mixture\-of\-Experts\.In Yaser Al\-Onaizan, Mohit Bansal, and Yun\-Nung Chen, editors,*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 10582–10592, Miami, Florida, USA, November 2024\. Association for Computational Linguistics\.[10\.18653/v1/2024\.findings\-emnlp\.620](https://arxiv.org/doi.org/10.18653/v1/2024.findings-emnlp.620)\.[https://aclanthology\.org/2024\.findings\-emnlp\.620/](https://aclanthology.org/2024.findings-emnlp.620/)\.
- Williams \(2004\)Ronald J\. Williams\.Simple Statistical Gradient\-Following Algorithms for Connectionist Reinforcement Learning\.*Machine Learning*, 8:229–256, 2004\.[https://api\.semanticscholar\.org/CorpusID:2332513](https://api.semanticscholar.org/CorpusID:2332513)\.
- Yang et al\. \(2024\)Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang\.Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs\.*ArXiv*, abs/2406\.10216, 2024\.[https://api\.semanticscholar\.org/CorpusID:270521260](https://api.semanticscholar.org/CorpusID:270521260)\.
- Yang et al\. \(2025\)Zonglin Yang, Zhexuan Gu, Houduo Qi, and Yancheng Yuan\.Accelerating RLHF Training with Reward Variance Increase\.*ArXiv*, abs/2505\.23247, 2025\.[https://api\.semanticscholar\.org/CorpusID:278996019](https://api.semanticscholar.org/CorpusID:278996019)\.
- Zhao et al\. \(2024\)Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng\.WildChat: 1M ChatGPT Interaction Logs in the Wild\.In*The Twelfth International Conference on Learning Representations*, 2024\.[https://openreview\.net/forum?id=Bl8u7ZRlbM](https://openreview.net/forum?id=Bl8u7ZRlbM)\.
- Zhou et al\. \(2023\)Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-following evaluation for large language models\.*arXiv preprint arXiv:2311\.07911*, 2023\.

## Appendix AProof of Proposition[2\.1](https://arxiv.org/html/2606.21795#S2.Thmtheorem1)\(Accuracy is the weighted sum of discriminative ability and specificity\)

Accuracy is

𝔼a,b\[𝟙\[sgn\(rx\(a\)−rx\(b\)\)=sgn\(ux\(a\)−ux\(b\)\)\]\\displaystyle\\mathbb\{E\}\_\{a,b\}\[\\mathbbm\{1\}\[\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)=\\operatorname\{sgn\}\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\]=P​\(sgn⁡\(rx​\(a\)−rx​\(b\)\)=sgn⁡\(ux​\(a\)−ux​\(b\)\)\)\\displaystyle=P\\left\(\\operatorname\{sgn\}\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)=\\operatorname\{sgn\}\(u\_\{x\}\(a\)\-u\_\{x\}\(b\)\)\\right\)=P​\[rx​\(a\)=rx​\(b\),ux​\(a\)=ux​\(b\)\]\\displaystyle=P\\left\[r\_\{x\}\(a\)=r\_\{x\}\(b\),u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\]\+P​\[rx​\(a\)\>rx​\(b\),ux​\(a\)\>ux​\(b\)\]\\displaystyle\\quad\+P\\left\[r\_\{x\}\(a\)\>r\_\{x\}\(b\),u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\]\+P​\[rx​\(a\)<rx​\(b\),ux​\(a\)<ux​\(b\)\]\\displaystyle\\quad\+P\\left\[r\_\{x\}\(a\)<r\_\{x\}\(b\),u\_\{x\}\(a\)<u\_\{x\}\(b\)\\right\]First term:

P​\[rx​\(a\)=rx​\(b\),ux​\(a\)=ux​\(b\)\]\\displaystyle P\\left\[r\_\{x\}\(a\)=r\_\{x\}\(b\),u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\]=\(1−P\[rx\(a\)≠rx\(b\)∣ux\(a\)=ux\(b\)\)×P\[ux\(a\)=ux\(b\)\]\\displaystyle=\\left\(1\-P\[r\_\{x\}\(a\)\\neq r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\)\\times P\\left\[u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\]Second and third terms:

P​\[rx​\(a\)\>rx​\(b\),ux​\(a\)\>ux​\(b\)\]\+P​\[rx​\(a\)<rx​\(b\),ux​\(a\)<ux​\(b\)\]\\displaystyle P\\left\[r\_\{x\}\(a\)\>r\_\{x\}\(b\),u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\]\+P\\left\[r\_\{x\}\(a\)<r\_\{x\}\(b\),u\_\{x\}\(a\)<u\_\{x\}\(b\)\\right\]=P​\[rx​\(a\)\>rx​\(b\)​∣ux​\(a\)\>​ux​\(b\)\]​P​\(ux​\(a\)\>ux​\(b\)\)\+P​\[rx​\(a\)​<rx​\(b\)∣​ux​\(a\)<ux​\(b\)\]​P​\(ux​\(a\)<ux​\(b\)\)\.\\displaystyle=P\\left\[r\_\{x\}\(a\)\>r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\]P\(u\_\{x\}\(a\)\>u\_\{x\}\(b\)\)\\hskip 5\.0pt\+P\\left\[r\_\{x\}\(a\)<r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)<u\_\{x\}\(b\)\\right\]P\(u\_\{x\}\(a\)<u\_\{x\}\(b\)\)\.Then, accuracy is

\(1−P\[rx\(a\)≠rx\(b\)∣ux\(a\)=ux\(b\)\)\\displaystyle\\left\(1\-P\[r\_\{x\}\(a\)\\neq r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\)×P​\[ux​\(a\)=ux​\(b\)\]\\displaystyle\\quad\\times P\\left\[u\_\{x\}\(a\)=u\_\{x\}\(b\)\\right\]*\(specificity\)*\+P\[rx\(a\)\>rx\(b\)∣ux\(a\)\>ux\(b\)\]P\(ux\(a\)\>ux\(b\)\)\]\\displaystyle\+P\\left\[r\_\{x\}\(a\)\>r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\right\]P\(u\_\{x\}\(a\)\>u\_\{x\}\(b\)\)\]\+P\[rx\(a\)<rx\(b\)∣ux\(a\)<ux\(b\)\]P\(ux\(a\)<ux\(b\)\)\]\\displaystyle\+P\\left\[r\_\{x\}\(a\)<r\_\{x\}\(b\)\\mid u\_\{x\}\(a\)<u\_\{x\}\(b\)\\right\]P\(u\_\{x\}\(a\)<u\_\{x\}\(b\)\)\]*\(discriminative ability\)*

## Appendix BProof of Proposition[2\.5](https://arxiv.org/html/2606.21795#S2.Thmtheorem5)\(The oversensitivity of a binary\-discretized reward model ismax⁡\(0,12−2​\(\(τx−ϕx​\(ux​\(a\)\)\)/sx\)2\)\\max\(0,\\tfrac\{1\}\{2\}\-2\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\)^\{2\}\)\)

#### Proof

For sufficiently smallϵ\\epsilon\(ϵ<sx\\epsilon<s\_\{x\}\), consider that\|discx\(rx\(a\)\)\>discx\(rx\(b\)\)\|⟺\|discx\(rx\(a\)\)−discx\(rx\(b\)\)\|\>ϵ\|\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\\Longleftrightarrow\|\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\-\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\>\\epsilon\. Therefore,P​\(\|discx​\(rx​\(a\)\)\>​discx​\(rx​\(b\)\)​\|\+ϵ∣​ux​\(a\)=ux​\(b\)\)P\(\|\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\+\\epsilon\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\)==P​\(discx​\(rx​\(a\)\)≠discx​\(rx​\(b\)\)∣ux​\(a\)=ux​\(b\)\)P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\\neq\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\)

For our discretized \(binary\) reward to retainperfect discriminative ability, this implies that there are exactly two unique values of utility for this task, meaning this task has abinarynotion of correctness\. In this setting:

P​\(discx​\(rx​\(a\)\)≠discx​\(rx​\(b\)\)\)=P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)\)\+P​\(discx​\(rx​\(b\)\)\>discx​\(rx​\(a\)\)\)\.\\displaystyle P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\\neq\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\)\\quad=\\quad P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\)\\quad\+\\quad P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\)\.Conditioned onux​\(a\)=ux​\(b\)u\_\{x\}\(a\)=u\_\{x\}\(b\), these two orderings are equally likely\. We can compute the first and double it:

P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)\)=P​\(rx​\(a\)\>τx​and​rx​\(b\)≤τx\)=\\displaystyle P\(\\text\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\text\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\)=P\(r\_\{x\}\(a\)\>\\tau\_\{x\}\\text\{ and \}r\_\{x\}\(b\)\\leq\\tau\_\{x\}\)=P​\(ϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τxand​ϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx\)=\\displaystyle P\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}\\quad\\mkern 9\.0mu\\text\{ and \}\\mkern 9\.0mu\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}\)=P​\(ϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τx\)×P​\(ϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx\)\.\\displaystyle P\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}\)\\mkern 9\.0mu\\times\\mkern 9\.0muP\(\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}\)\.The final line results from the fact that the events thatϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τx\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}andϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}are independent, conditioned on the fact thatux​\(a\)=ux​\(b\)u\_\{x\}\(a\)=u\_\{x\}\(b\)\.

First term:

P​\(ϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τx\)\\displaystyle P\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}\)=P​\(Unif​\(−sx/2,sx/2\)\>τx−ϕx​\(ux​\(a\)\)\)\\displaystyle=P\\bigl\(\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\)\>\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\\bigr\)=P​\(Unif​\(−1/2,1/2\)\>distance\)\\displaystyle=P\\bigl\(\\text\{Unif\}\(\-1/2,1/2\)\>\\text\{\{distance\}\}\\bigr\)\(where distance:=\(τx−ϕx​\(ux​\(a\)\)\)/sx​\)=\\displaystyle\\quad\\quad\\textit\{\(where distance\}:=\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\\text\{\{\)\}\}==\{1,ifdistance≤−120,ifdistance≥1212−distance,otherwise\.\\displaystyle=\\begin\{cases\}1,&\\text\{if \{distance\}\}\\leq\-\\frac\{1\}\{2\}\\\\ 0,&\\text\{if \{distance\}\}\\geq\\frac\{1\}\{2\}\\\\ \\frac\{1\}\{2\}\-\\text\{\{distance\}\},&\\text\{otherwise\.\}\\end\{cases\}
Similarly, second term:

P​\(ϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx\)\\displaystyle P\(\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}\)=P​\(Unif​\(−1/2,1/2\)≤\(τx−ϕx​\(ux​\(b\)\)\)/sx\)\\displaystyle=P\\bigl\(\\text\{Unif\}\(\-1/2,1/2\)\\leq\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(b\)\)\)/s\_\{x\}\\bigr\)=\{0,ifdistance≤−121,ifdistance≥12distance\+12,otherwise\.\\displaystyle=\\begin\{cases\}0,&\\text\{if \{distance\}\}\\leq\-\\frac\{1\}\{2\}\\\\ 1,&\\text\{if \{distance\}\}\\geq\\frac\{1\}\{2\}\\\\ \\text\{\{distance\}\}\+\\frac\{1\}\{2\},&\\text\{otherwise\.\}\\end\{cases\}
Given thatux​\(a\)=ux​\(b\)u\_\{x\}\(a\)=u\_\{x\}\(b\), we multiply these two terms and then double the result to account for both orderings:

2​P​\(ϕx​\(ux​\(a\)\)\+ηx​\(a\)\>τx\)\\displaystyle 2P\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\)\>\\tau\_\{x\}\)×P​\(ϕx​\(ux​\(b\)\)\+ηx​\(b\)≤τx\)\\displaystyle\\quad\\quad\\times P\(\\phi\_\{x\}\(u\_\{x\}\(b\)\)\+\\eta\_\{x\}\(b\)\\leq\\tau\_\{x\}\)=2​\{0,ifdistance≤−120,ifdistance≥121/4−\(d​i​s​t​a​n​c​e\)2,otherwise\.\\displaystyle=2\\begin\{cases\}0,&\\text\{if \{distance\}\}\\leq\-\\frac\{1\}\{2\}\\\\ 0,&\\text\{if \{distance\}\}\\geq\\frac\{1\}\{2\}\\\\ 1/4\-\(distance\)^\{2\},&\\text\{otherwise\.\}\\end\{cases\}=max⁡\(0,1/2−2​\(d​i​s​t​a​n​c​e\)2\)\\displaystyle=\\max\(0,1/2\-2\(distance\)^\{2\}\)=max⁡\(0,1/2−2​\(\(τx−ϕx​\(ux​\(a\)\)\)/sx\)2\)\\displaystyle=\\max\\bigl\(0,1/2\-2\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\)^\{2\}\\bigr\)
ThenSpecdisc=min⁡\(1,2​\(\(τx−ϕx​\(ux​\(a\)\)\)/sx\)2−1/2\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}=\\min\\bigl\(1,2\(\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\_\{x\}\(a\)\)\)/s\_\{x\}\)^\{2\}\-1/2\\bigr\)\.

## Appendix CAblation of Number of Dropout Samples

Section[3](https://arxiv.org/html/2606.21795#S3)introduces our algorithm for reward clustering, which uses MC dropout to estimate the predictive variance of a reward model\. In[Table 3](https://arxiv.org/html/2606.21795#A3.T3), we show results from ablating the number of samples taken via MC dropout forSkywork V1\. We find that the number of samples has surprisingly little impact on the intrinsic efficacy of our reward discretization method\. We observe identical patterns with the other reward models used in this paper\.

Table 3:We find the effectiveness of reward clustering via MC dropout is robust to the number of dropout samples used\. We report reward clustering performance for Skywork V1\.★\\bigstarindicates the default setting \(T=4T=4\)\.
## Appendix DDiscretization still improves reward models if we use a Gaussian noise mode

### D\.1A relaxed noise model: Gaussian\-distributed rewards

In Section[2\.1](https://arxiv.org/html/2606.21795#S2.SS1)we assumed bounded uniform noiseηx​\(a\)∼Unif​\(−sx/2,sx/2\)\\eta\_\{x\}\(a\)\\sim\\text\{Unif\}\(\-s\_\{x\}/2,s\_\{x\}/2\), leading to uniform\-distributed rewards\. These hard noise limits guaranteed perfect discriminative ability of the raw reward model; without it, the analysis in §[2\.2](https://arxiv.org/html/2606.21795#S2.SS2)–[2\.4](https://arxiv.org/html/2606.21795#S2.SS4)no longer applies\. We now relax this assumption to*Gaussian*noise\. This means that the raw reward model can no longer be perfect\. Here we assume that true utility is binary, as in §[2\.2](https://arxiv.org/html/2606.21795#S2.SS2), but these concepts readily generalize to multi\-level utility functions\.

#### Assumption\.

For a fixed promptxx, we model the reward asrx​\(a\)=ϕx​\(ux​\(a\)\)\+ηx​\(a\)r\_\{x\}\(a\)=\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\eta\_\{x\}\(a\), whereϕx​\(v\)=sx​v\+dx\\phi\_\{x\}\(v\)=s\_\{x\}v\+d\_\{x\}andηx​\(a\)∼𝒩​\(0,σx2\)\\eta\_\{x\}\(a\)\\sim\\mathcal\{N\}\(0,\\sigma\_\{x\}^\{2\}\)\. The per\-prompt noise is homoskedastic:σx\\sigma\_\{x\}andsxs\_\{x\}are shared across utility classes\. We define the per\-prompt*signal\-to\-noise ratio*asρx:=sx/σx\\rho\_\{x\}:=s\_\{x\}/\\sigma\_\{x\}; this determines the reward model’s accuracy\. For convenience, we drop the subscriptxxand useΦ\\Phiandφ\\varphifor the Gaussian CDF and density, respectively\.

### D\.2Discriminative ability and oversensitivity under Gaussian noise

###### Proposition D\.1\.

Under the homoskedastic Gaussian model with binary utility,

Draw​\(ϵ\)=\(rx​\(a\)−rx​\(b\)\>ϵ​∣ux​\(a\)\>​ux​\(b\)\)=Φ​\(ρx​\(1−ϵ\)/2\)\.D\_\{\\textup\{raw\}\}\(\\epsilon\)=\\bigl\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\>\\epsilon\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)=\\Phi\\left\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\\right\)\.

#### Proof\.

For somea,ba,bwhereux​\(a\)\>ux​\(b\)u\_\{x\}\(a\)\>u\_\{x\}\(b\), thenrx​\(a\)−rx​\(b\)=sx\+ηx​\(a\)−ηx​\(b\)r\_\{x\}\(a\)\-r\_\{x\}\(b\)=s\_\{x\}\+\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\), whereηx​\(b\)−ηx​\(a\)∼𝒩​\(0,2​σx2\)\\eta\_\{x\}\(b\)\-\\eta\_\{x\}\(a\)\\sim\\mathcal\{N\}\(0,2\\sigma\_\{x\}^\{2\}\)\. ThusDraw​\(ϵ\)=P​\(ηx​\(b\)−ηx​\(a\)<sx−ϵ\)=Φ​\(\(sx−ϵ\)/\(σx​2\)\)D\_\{\\textup\{raw\}\}\(\\epsilon\)=P\\bigl\(\\eta\_\{x\}\(b\)\-\\eta\_\{x\}\(a\)<s\_\{x\}\-\\epsilon\\bigr\)=\\Phi\\bigl\(\(s\_\{x\}\-\\epsilon\)/\(\\sigma\_\{x\}\\sqrt\{2\}\)\\bigr\), which equalsΦ​\(ρx​\(1−ϵ\)/2\)\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\.

As in[subsection 2\.4](https://arxiv.org/html/2606.21795#S2.SS4), we measure reward in*utility units*to be able to directly compare between raw and discrete rewards\. The noise within each utility class here is nowηx/sx∼𝒩​\(0,1/ρx2\)\\eta\_\{x\}/s\_\{x\}\\sim\\mathcal\{N\}\(0,1/\\rho\_\{x\}^\{2\}\)\.

###### Proposition D\.2\.

In utility units, for anyϵ≥0\\epsilon\\geq 0,

Specraw​\(ϵ\)=P​\(\|rx​\(a\)−rx​\(b\)\|/sx​<ϵ\|​ux​\(a\)=ux​\(b\)\)=2​Φ​\(ϵ​ρx/2\)−1\.\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)=P\\bigl\(\|r\_\{x\}\(a\)\-r\_\{x\}\(b\)\|/s\_\{x\}<\\epsilon\\bigm\|u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\.

#### Proof\.

Conditional onux​\(a\)=ux​\(b\)u\_\{x\}\(a\)=u\_\{x\}\(b\),\(rx​\(a\)−rx​\(b\)\)/sx=\(ηx​\(a\)−ηx​\(b\)\)/sx∼𝒩​\(0,2/ρx2\)\(r\_\{x\}\(a\)\-r\_\{x\}\(b\)\)/s\_\{x\}=\(\\eta\_\{x\}\(a\)\-\\eta\_\{x\}\(b\)\)/s\_\{x\}\\sim\\mathcal\{N\}\(0,2/\\rho\_\{x\}^\{2\}\)\. HenceP​\(\|𝒩​\(0,2/ρx2\)\|<ϵ\)=P​\(𝒩​\(0,2/ρx2\)<ϵ\)−P​\(𝒩​\(0,2/ρx2\)<−ϵ\)=Φ​\(ϵ/2/ρx2\)−Φ​\(−ϵ/2/ρx2\)=P\(\|\\mathcal\{N\}\(0,2/\\rho\_\{x\}^\{2\}\)\|<\\epsilon\)=P\(\\mathcal\{N\}\(0,2/\\rho\_\{x\}^\{2\}\)<\\epsilon\)\-P\(\\mathcal\{N\}\(0,2/\\rho\_\{x\}^\{2\}\)<\-\\epsilon\)=\\Phi\(\\epsilon/\\sqrt\{2/\\rho\_\{x\}^\{2\}\}\)\-\\Phi\(\-\\epsilon/\\sqrt\{2/\\rho\_\{x\}^\{2\}\}\)= Φ​\(ϵ/2/ρx2\)−\(1−Φ​\(ϵ/2/ρx2\)\)=\\Phi\(\\epsilon/\\sqrt\{2/\\rho\_\{x\}^\{2\}\}\)\-\(1\-\\Phi\(\\epsilon/\\sqrt\{2/\\rho\_\{x\}^\{2\}\}\)\)=2​Φ​\(ϵ​ρx/2\)−12\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\.

### D\.3Discretization preserves most discriminative ability and substantially reduces oversensitivity

###### Proposition D\.3\.

For binary utility, the discriminative ability of the discretized reward, given two adjacent utility classes, is maximized atτx:=\(ϕx​\(ux​\(a\)\)\+ϕx​\(ux​\(b\)\)\)/2\\tau\_\{x\}:=\(\\phi\_\{x\}\(u\_\{x\}\(a\)\)\+\\phi\_\{x\}\(u\_\{x\}\(b\)\)\)/2\. Discriminative ability under this threshold is

Ddisc​\(ϵ\)=P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)​∣ux​\(a\)\>​ux​\(b\)\)=Φ​\(ρx/2\)2\.D\_\{\\textup\{disc\}\}\(\\epsilon\)=P\\bigl\(\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)=\\Phi\(\\rho\_\{x\}/2\)^\{2\}\.

#### Proof\.

Assume thatux​\(a\)=1,ux​\(b\)=0u\_\{x\}\(a\)=1,u\_\{x\}\(b\)=0\. Since the events\{rx​\(a\)\>τx\}\\\{r\_\{x\}\(a\)\>\\tau\_\{x\}\\\}and\{rx​\(b\)≤τx\}\\\{r\_\{x\}\(b\)\\leq\\tau\_\{x\}\\\}are independent:

P​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)​∣ux​\(a\)\>​ux​\(b\)\)=Φ​\(ϕx​\(1\)−τxσx\)​Φ​\(τx−ϕx​\(0\)σx\)\.P\\bigl\(\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)=\\Phi\\left\(\\tfrac\{\\phi\_\{x\}\(1\)\-\\tau\_\{x\}\}\{\\sigma\_\{x\}\}\\right\)\\Phi\\left\(\\tfrac\{\\tau\_\{x\}\-\\phi\_\{x\}\(0\)\}\{\\sigma\_\{x\}\}\\right\)\.We wish to find a value ofτx\\tau\_\{x\}that maximizes this probability\. Let us denotez:=\(τx−ϕx​\(0\)\)/σxz:=\(\\tau\_\{x\}\-\\phi\_\{x\}\(0\)\)/\\sigma\_\{x\}\(this is the continuous analog ofdistanceused in[Appendix B](https://arxiv.org/html/2606.21795#A2)\) \. Then we can rewrite\(ϕx​\(1\)−τx\)/σx\(\\phi\_\{x\}\(1\)\-\\tau\_\{x\}\)/\\sigma\_\{x\}asρx−z\\rho\_\{x\}\-z\. This in turn allows us to rewriteP​\(discx​\(rx​\(a\)\)\>discx​\(rx​\(b\)\)​∣ux​\(a\)\>​ux​\(b\)\)P\\bigl\(\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\>\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)\>u\_\{x\}\(b\)\\bigr\)asΦ​\(ρx−z\)​Φ​\(z\)\\Phi\(\\rho\_\{x\}\-z\)\\Phi\(z\)\.

Now letf​\(z\):=Φ​\(ρx−z\)​Φ​\(z\)f\(z\):=\\Phi\(\\rho\_\{x\}\-z\)\\Phi\(z\)\. We can maximize this function by setting its derivative to zero \(note thatφ\\varphirefers to the derivative ofΦ\\Phi\):

f′​\(z\)=−φ​\(ρx−z\)​Φ​\(z\)\+Φ​\(ρx−z\)​φ​\(z\)=0⟺φ​\(z\)φ\(ρx−z\)\)=Φ​\(z\)Φ​\(ρx−z\)⟺φ​\(z\)Φ​\(z\)=φ​\(ρx−z\)Φ​\(ρx−z\),f^\{\\prime\}\(z\)=\-\\varphi\(\\rho\_\{x\}\-z\)\\Phi\(z\)\+\\Phi\(\\rho\_\{x\}\-z\)\\varphi\(z\)=0\\Longleftrightarrow\\tfrac\{\\varphi\(z\)\}\{\\varphi\(\\rho\_\{x\}\-z\)\)\}=\\tfrac\{\\Phi\(z\)\}\{\\Phi\(\\rho\_\{x\}\-z\)\}\\Longleftrightarrow\\tfrac\{\\varphi\(z\)\}\{\\Phi\(z\)\}=\\tfrac\{\\varphi\(\\rho\_\{x\}\-z\)\}\{\\Phi\(\\rho\_\{x\}\-z\)\},which holds atz=ρx/2z=\\rho\_\{x\}/2\. Then the optimizer isτx=ϕx​\(0\)\+sx/2=\(ϕx​\(0\)\+ϕx​\(1\)\)/2\\tau\_\{x\}=\\phi\_\{x\}\(0\)\+s\_\{x\}/2=\(\\phi\_\{x\}\(0\)\+\\phi\_\{x\}\(1\)\)/2, andDdisc​\(ϵ\)=Φ​\(ρx/2\)2D\_\{\\textup\{disc\}\}\(\\epsilon\)=\\Phi\(\\rho\_\{x\}/2\)^\{2\}\.

###### Proposition D\.4\.

At the thresholdτx\\tau\_\{x\}of Proposition[D\.3](https://arxiv.org/html/2606.21795#A4.Thmtheorem3), for any0≤ϵ<10\\leq\\epsilon<1and binary utility:

Specdisc​\(ϵ\)=1−P​\(\|discx​\(rx​\(a\)\)−discx​\(rx​\(b\)\)\|\>ϵ\|ux​\(a\)=ux​\(b\)\)=1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)\.\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\-P\\bigl\(\|\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\-\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\>\\epsilon\\bigm\|u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\.

#### Proof\.

Sincediscx\\textup\{disc\}\_\{x\}is binary\-valued, then for0≤ϵ<10\\leq\\epsilon<1,\|discx​\(rx​\(a\)\)−discx​\(rx​\(b\)\)\|\>ϵ⟺discx​\(rx​\(a\)\)≠discx​\(rx​\(b\)\)\|\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\-\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\|\>\\epsilon\\Longleftrightarrow\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\\neq\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\.

Letzu:=\(τx−ϕx​\(u\)\)/σxz\_\{u\}:=\(\\tau\_\{x\}\-\\phi\_\{x\}\(u\)\)/\\sigma\_\{x\}, representing thedistancefrom any given utility class\. Conditional onux​\(a\)=ux​\(b\)u\_\{x\}\(a\)=u\_\{x\}\(b\),discx​\(rx​\(a\)\)\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)anddiscx​\(rx​\(b\)\)\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)are independent\. Then:

P​\(discx​\(rx​\(a\)\)≠discx​\(rx​\(b\)\)∣ux​\(a\)=ux​\(b\)\)=2​\(1−Φ​\(zu\)\)​Φ​\(zu\)=2​Φ​\(−zu\)​Φ​\(zu\)\.P\\bigl\(\\textup\{disc\}\_\{x\}\(r\_\{x\}\(a\)\)\\neq\\textup\{disc\}\_\{x\}\(r\_\{x\}\(b\)\)\\mid u\_\{x\}\(a\)=u\_\{x\}\(b\)\\bigr\)=2\\bigl\(1\-\\Phi\(z\_\{u\}\)\\bigr\)\\Phi\(z\_\{u\}\)=2\\Phi\(\-z\_\{u\}\)\\Phi\(z\_\{u\}\)\.Proposition[D\.3](https://arxiv.org/html/2606.21795#A4.Thmtheorem3)shows discriminative ability is maximized atτx=ϕx​\(0\)\+sx/2=sx/2\+dx=\(ϕx​\(0\)\+ϕx​\(1\)\)/2\\tau\_\{x\}=\\phi\_\{x\}\(0\)\+s\_\{x\}/2=s\_\{x\}/2\+d\_\{x\}=\(\\phi\_\{x\}\(0\)\+\\phi\_\{x\}\(1\)\)/2\. Then, the distance of this threshold toux​\(a\)=0u\_\{x\}\(a\)=0isz0=ρx/2z\_\{0\}=\\rho\_\{x\}/2and the distance toux​\(a\)=1u\_\{x\}\(a\)=1isz1=−ρx/2z\_\{1\}=\-\\rho\_\{x\}/2\. Regardless of the utility classuxu\_\{x\}, we then obtain the same oversensitivity:2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\. ThenSpecdisc​\(ϵ\)=1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\), taking the complement of oversensitivity\.

### D\.4The discretization tradeoff

Combining Propositions[D\.1](https://arxiv.org/html/2606.21795#A4.Thmtheorem1)–[D\.4](https://arxiv.org/html/2606.21795#A4.Thmtheorem4), discretization reduces the discriminative ability fromΦ​\(ρx/2\)\\Phi\(\\rho\_\{x\}/\\sqrt\{2\}\)toΦ​\(ρx​\(1−ϵ\)/2\)\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\), and discretization changes specificity from2​Φ​\(ϵ​ρx/2\)−12\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1to1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\. The discretized reward ’s discriminative ability and specificity are both*constant*forϵ∈\[0,1\)\\epsilon\\in\[0,1\)\. In contrast, the raw reward sees its discriminative abilityΦ​\(ρx​\(1−ϵ\)/2\)\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)decline while its specificity2​Φ​\(ϵ​ρx/2\)−12\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1rises from0asϵ\\epsilongrows\.

###### Theorem D\.7\.

For every signal\-to\-noise ratioρx\>0\\rho\_\{x\}\>0and every toleranceϵ∈\[0,1/2\],\\epsilon\\in\[0,1/\\sqrt\{2\}\],Specdisc​\(ϵ\)\>Specraw​\(ϵ\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)\>\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)\(discretization improves the specificity of the reward model\)\.

#### Proof\.

Propositions[D\.2](https://arxiv.org/html/2606.21795#A4.Thmtheorem2)and[D\.4](https://arxiv.org/html/2606.21795#A4.Thmtheorem4)giveSpecraw​\(ϵ\)=2​Φ​\(ϵ​ρx/2\)−1\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)=2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1andSpecdisc​\(ϵ\)=1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\. UsingΦ​\(ϵ​ρx/2\)=1−Φ​\(−ϵ​ρx/2\)\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)=1\-\\Phi\(\-\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\), then

Specdisc​\(ϵ\)−Specraw​\(ϵ\)=1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)−2​\(1−Φ​\(−ϵ​ρx/2\)\)\+1\\displaystyle\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)\-\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)=1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\-2\(1\-\\Phi\(\-\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\)\+1=2​\[Φ​\(−ϵ​ρx/2\)−Φ​\(ρx/2\)​Φ​\(−ρx/2\)\],\\displaystyle=2\\bigl\[\\Phi\(\-\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)\\bigr\],Becauseρx\\rho\_\{x\}is positive,ϵ<1/2⟹ϵ/2<1/2⟹Φ​\(−ϵ​ρx/2\)\>Φ​\(−ρx/2\)\\epsilon<1/\\sqrt\{2\}\\implies\\epsilon/\\sqrt\{2\}<1/2\\implies\\Phi\(\-\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\>\\Phi\(\-\\rho\_\{x\}/2\)\. BecauseΦ​\(ρx/2\)<1\\Phi\(\\rho\_\{x\}/2\)<1,Φ​\(−ϵ​ρx/2\)\>Φ​\(ρx/2\)​Φ​\(−ρx/2\)\\Phi\(\-\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\>\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\), and thenSpecdisc​\(ϵ\)−Specraw​\(ϵ\)\>0\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)\-\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)\>0\.

This means that for everyϵ\\epsilonbelow1/21/\\sqrt\{2\}\(in utility units\) and at every signal\-to\-noise ratio, discretization is strictly more specific than the raw reward\. In practice, this means that the specificity improves as long as we declare two reward values as meaningfully different as long as their difference in rewards exceeds at least∼70%\\sim 70\\%of the proportional difference in their true utility values\. This covers most of the plausible range ofϵ\\epsilonthat a practitioner might consider\.

#### Weighing the two axes at a common tolerance\.

The success condition isolates specificity\. To support the stronger claim that discretization*preserves most discriminative ability**while*improving specificity, consider their average at a common toleranceϵ\\epsilon\. The raw discriminative ability isDraw​\(ϵ\)=Φ​\(ρx​\(1−ϵ\)/2\)D\_\{\\textup\{raw\}\}\(\\epsilon\)=\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\(Proposition[D\.1](https://arxiv.org/html/2606.21795#A4.Thmtheorem1)\); the discretized discriminative ability isDdisc​\(ϵ\)=Φ​\(ρx/2\)2D\_\{\\textup\{disc\}\}\(\\epsilon\)=\\Phi\(\\rho\_\{x\}/2\)^\{2\}\(Proposition[D\.3](https://arxiv.org/html/2606.21795#A4.Thmtheorem3)\)\. As in[Theorem 2\.7](https://arxiv.org/html/2606.21795#S2.Thmtheorem7), we consider the single combined scoreTr​\(ϵ\):=\(Dr​\(ϵ\)\+Specr​\(ϵ\)\)/2T\_\{r\}\(\\epsilon\):=\(D\_\{r\}\(\\epsilon\)\+\\mathrm\{Spec\}\_\{r\}\(\\epsilon\)\)/2\.

###### Proposition D\.8\.

Consider the averaged tradeoff between specificity and discrminative ability:T​\(ϵ\)=\(Spec​\(ϵ\)\+D​\(ϵ\)\)/2T\(\\epsilon\)=\(\\mathrm\{Spec\}\(\\epsilon\)\+D\(\\epsilon\)\)/2\. The gap between this value computed for a discretized reward model and the raw reward models is minimized at the worst\-case toleranceϵ=min⁡\{12\+2​log⁡2ρx2,1\}\\epsilon=\\min\\\{\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\},1\\\}\.

#### Proof\.

UsingDdisc​\(ϵ\)=Φ​\(ρx/2\)2D\_\{\\textup\{disc\}\}\(\\epsilon\)=\\Phi\(\\rho\_\{x\}/2\)^\{2\}from Proposition[D\.3](https://arxiv.org/html/2606.21795#A4.Thmtheorem3),Specdisc​\(ϵ\)=1−2​Φ​\(ρx/2\)​Φ​\(−ρx/2\)\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\(\\epsilon\)=1\-2\\Phi\(\\rho\_\{x\}/2\)\\Phi\(\-\\rho\_\{x\}/2\)from Proposition[D\.4](https://arxiv.org/html/2606.21795#A4.Thmtheorem4), andΦ​\(−ρx/2\)=1−Φ​\(ρx/2\)\\Phi\(\-\\rho\_\{x\}/2\)=1\-\\Phi\(\\rho\_\{x\}/2\), then

Tdisc=\(Ddisc\+Specdisc\)/2=\(Φ​\(ρx/2\)2\+1−2​Φ​\(ρx/2\)​\(Φ​\(−ρx/2\)\)\)/2=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)\+12T\_\{\\textup\{disc\}\}=\(D\_\{\\textup\{disc\}\}\+\\mathrm\{Spec\}\_\{\\textup\{disc\}\}\)/2\\mkern 9\.0mu=\\mkern 9\.0mu\\bigl\(\\Phi\(\\rho\_\{x\}/2\)^\{2\}\\mkern 9\.0mu\+\\mkern 9\.0mu1\-2\\Phi\(\\rho\_\{x\}/2\)\(\\Phi\(\-\\rho\_\{x\}/2\)\)\\bigr\)/2\\mkern 9\.0mu=\\mkern 9\.0mu\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}
UsingDraw​\(ϵ\)=Φ​\(ρx​\(1−ϵ\)/2\)D\_\{\\textup\{raw\}\}\(\\epsilon\)=\\Phi\\left\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\\right\)\(Proposition[D\.1](https://arxiv.org/html/2606.21795#A4.Thmtheorem1)\) andSpecraw​\(ϵ\)=2​Φ​\(ϵ​ρx/2\)−1\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\(\\epsilon\)=2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\(Proposition[D\.2](https://arxiv.org/html/2606.21795#A4.Thmtheorem2)\),

Traw​\(ϵ\)=\(Draw\+Specraw\)/2=\(Φ​\(ρx​\(1−ϵ\)/2\)\+2​Φ​\(ϵ​ρx/2\)−1\)/2T\_\{\\textup\{raw\}\}\(\\epsilon\)=\(D\_\{\\textup\{raw\}\}\+\\mathrm\{Spec\}\_\{\\textup\{raw\}\}\)/2\\mkern 9\.0mu=\\mkern 9\.0mu\\bigl\(\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\+2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\\bigr\)/2
DenoteTΔ​\(ϵ\)=Tdisc​\(ϵ\)−Traw​\(ϵ\)=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)\+12−\(\(Φ​\(ρx​\(1−ϵ\)/2\)\+2​Φ​\(ϵ​ρx/2\)−1\)/2\)T\_\{\\Delta\}\(\\epsilon\)=\\mkern 9\.0muT\_\{\\textup\{disc\}\}\(\\epsilon\)\-T\_\{\\textup\{raw\}\}\(\\epsilon\)\\mkern 9\.0mu=\\mkern 9\.0mu\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\-\(\\bigl\(\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\+2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\\bigr\)/2\)\. First, we can ask whereTΔT\_\{\\Delta\}is minimal\. Setting the derivative to zero:

TΔ′​\(ϵ\)=dd​ϵ​\(\(−Φ​\(ρx​\(1−ϵ\)/2\)−2​Φ​\(ϵ​ρx/2\)−1\)/2\)=0\\displaystyle T^\{\\prime\}\_\{\\Delta\}\(\\epsilon\)=\\tfrac\{d\}\{d\\epsilon\}\\bigl\(\(\-\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\)/2\\bigr\)=0⟺12​\(ρx2​φ​\(ρx​\(1−ϵ\)/2\)−2​ρx2​φ​\(ϵ​ρx/2\)\)=0\\displaystyle\\Longleftrightarrow\\tfrac\{1\}\{2\}\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{\\sqrt\{2\}\}\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\tfrac\{\\rho\_\{x\}\}\{\\sqrt\{2\}\}\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\\bigr\)=0⟺ρx2​2​\(φ​\(ρx​\(1−ϵ\)/2\)−2​φ​\(ϵ​ρx/2\)\)=0\\displaystyle\\Longleftrightarrow\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\\bigl\(\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\\bigr\)=0⟹2​φ​\(ϵ​ρx/2\)=φ​\(ρx​\(1−ϵ\)/2\)\\displaystyle\\implies 2\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)=\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)⟹2​12​π​exp⁡\(−12​\(ϵ​ρ/2\)2\)=12​π​exp⁡\(−12​\(ρx​\(1−ϵ\)/2\)2\)\\displaystyle\\implies 2\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\{\\bigl\(\-\\tfrac\{1\}\{2\}\(\\epsilon\\rho/\\sqrt\{2\}\)^\{2\}\\bigr\)\}=\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\{\\bigl\(\-\\tfrac\{1\}\{2\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)^\{2\}\\bigr\)\}Taking the logarithm of both sides, we find a unique minimizer:⟹log⁡2−log⁡\(2​π\)−12​\(ϵ​ρ/2\)2=−log⁡\(2​π\)−12​\(ρx​\(1−ϵ\)/2\)2\\displaystyle\\implies\\log 2\-\\log\(\\sqrt\{2\\pi\}\)\-\\tfrac\{1\}\{2\}\(\\epsilon\\rho/\\sqrt\{2\}\)^\{2\}=\-\\log\(\\sqrt\{2\\pi\}\)\-\\tfrac\{1\}\{2\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)^\{2\}⟹12​\(ϵ​ρ/2\)2−12​\(ρx​\(1−ϵ\)/2\)2=log⁡2\\displaystyle\\implies\\tfrac\{1\}\{2\}\(\\epsilon\\rho/\\sqrt\{2\}\)^\{2\}\-\\tfrac\{1\}\{2\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)^\{2\}=\\log 2⟹14​\(2​ϵ−1\)​ρx2=log⁡2⟹ϵ∗=2​log⁡2ρx2\+12\.\\displaystyle\\implies\\tfrac\{1\}\{4\}\(2\\epsilon\-1\)\\rho\_\{x\}^\{2\}=\\log 2\\implies\\epsilon^\{\*\}=\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\+\\tfrac\{1\}\{2\}\.
We know the gradient is zero at this value isϵ∗\\epsilon^\{\*\}\. Ifϵ∗∈\[0,1\)\\epsilon^\{\*\}\\in\[0,1\), then we have found an optimum on\[0,1\)\[0,1\)\. Since this may not be the case \(e\.g\.ρx<2​log⁡2\\rho\_\{x\}<2\\sqrt\{\\log\{2\}\}\), we can consider the sign of the gradient\.

sgn⁡\(TΔ′​\(ϵ\)\)=sgn⁡\(ρx2​2​\(φ​\(ρx​\(1−ϵ\)/2\)−2​φ​\(ϵ​ρx/2\)\)\)\\displaystyle\\operatorname\{sgn\}\(T^\{\\prime\}\_\{\\Delta\}\(\\epsilon\)\)=\\operatorname\{sgn\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\\bigl\(\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\\bigr\)\\Bigr\)=sgn⁡\(φ​\(ρx​\(1−ϵ\)/2\)−2​φ​\(ϵ​ρx/2\)\)\\displaystyle=\\operatorname\{sgn\}\\bigl\(\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\\bigr\)Since both terms are positive, the inequality is preserved under the logarithm:=sgn⁡\(log⁡φ​\(ρx​\(1−ϵ\)/2\)−log⁡\(2​φ​\(ϵ​ρx/2\)\)\)\\displaystyle=\\operatorname\{sgn\}\\bigl\(\\log\\varphi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-\\log\(2\\varphi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\)\\bigr\)=sgn⁡\[log⁡\(12​π​exp⁡\(−12​\(ρx​\(1−ϵ\)/2\)2\)\)−log⁡\(12​π​exp⁡\(−12​\(ϵ​ρx/2\)2\)\)−log⁡2\]\\displaystyle=\\operatorname\{sgn\}\\Bigl\[\\log\\bigl\(\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\\bigl\(\-\\tfrac\{1\}\{2\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)^\{2\}\\bigr\)\\bigr\)\-\\log\\bigl\(\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\\bigl\(\-\\tfrac\{1\}\{2\}\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)^\{2\}\\bigr\)\\bigr\)\-\\log 2\\Bigr\]=sgn⁡\(−12​\(ρx​\(1−ϵ\)/2\)2−\(−12​\(ϵ​ρx/2\)2\)−log⁡2\)\\displaystyle=\\operatorname\{sgn\}\\bigl\(\-\\tfrac\{1\}\{2\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)^\{2\}\-\(\-\\tfrac\{1\}\{2\}\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)^\{2\}\)\-\\log 2\\bigr\)=sgn⁡\(14​ρx2​\(2​ϵ−1\)−log⁡2\)\\displaystyle=\\operatorname\{sgn\}\\bigl\(\\tfrac\{1\}\{4\}\\rho\_\{x\}^\{2\}\(2\\epsilon\-1\)\-\\log 2\\bigr\)The gradient then changes sign atϵ=12\+2​log⁡2ρx2\\epsilon=\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}; the gradient is negative forϵ<ϵ∗\\epsilon<\\epsilon^\{\*\}and positive forϵ\>ϵ∗\\epsilon\>\\epsilon^\{\*\}\. Therefore, we know thatϵ∗\\epsilon^\{\*\}is the global minimum ofTΔT\_\{\\Delta\}, and we know that ifϵ\>1\\epsilon\>1, thenTΔ​\(1\)T\_\{\\Delta\}\(1\)is the minimum ofTΔT\_\{\\Delta\}over\[0,1\)\[0,1\)\. As we reduceϵ→0\\epsilon\\to 0, the gap between discretized and raw rewards grows further\.

###### Theorem D\.9\.

The specificity gain exceeds the discriminative cost: for everyρx\>0\\rho\_\{x\}\>0and every toleranceϵ∈\[0,1\):Tdisc\(ϵ\)−Traw\(ϵ\)\>0\\epsilon\\in\[0,1\):\\quad T\_\{\\textup\{disc\}\}\(\\epsilon\)\-T\_\{\\textup\{raw\}\}\(\\epsilon\)\>0\. Asρx→∞\\rho\_\{x\}\\to\\infty, the net benefit approaches zero: when the raw reward itself approaches perfection, discretization is unnecessary\.

#### Proof\.

Proposition[D\.8](https://arxiv.org/html/2606.21795#A4.Thmtheorem8)shows thatTdisc​\(ϵ\)−Traw​\(ϵ\)T\_\{\\textup\{disc\}\}\(\\epsilon\)\-T\_\{\\textup\{raw\}\}\(\\epsilon\)is minimized atϵ=min⁡\(1,ϵ∗\)\\epsilon=\\min\(1,\\epsilon^\{\*\}\)whereϵ∗=12\+2​log⁡2ρx2\\epsilon^\{\*\}=\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\. If we can show thatTΔ​\(min⁡\{1,ϵ∗\}\)\>0T\_\{\\Delta\}\\bigl\(\\min\\\{1,\\epsilon^\{\*\}\\\}\\bigr\)\>0for everyρx\>0\\rho\_\{x\}\>0; thenTΔT\_\{\\Delta\}is positive for all values ofϵ∈\[0,1\)\\epsilon\\in\[0,1\)\.

Denote the worst\-case value byΔ​\(ρx\):=TΔ​\(min⁡\{1,ϵ∗\}\)\\Delta\(\\rho\_\{x\}\):=T\_\{\\Delta\}\\bigl\(\\min\\\{1,\\epsilon^\{\*\}\\\}\\bigr\)\. We will divide our analysis by ranges ofρx\\rho\_\{x\}\. Our three cases are0<ρx≤2​log⁡20<\\rho\_\{x\}\\leq 2\\sqrt\{\\log 2\},2​log⁡2<ρx<4\.22\\sqrt\{\\log 2\}<\\rho\_\{x\}<4\.2, andρx≥4\.2\\rho\_\{x\}\\geq 4\.2\.

#### Case 1:0<ρx≤2​log⁡20<\\rho\_\{x\}\\leq 2\\sqrt\{\\log 2\}\.

Then,ϵ∗≥1\\epsilon^\{\*\}\\geq 1, soTΔT\_\{\\Delta\}is strictly decreasing on all of\[0,1\)\[0,1\):TΔ​\(ϵ\)\>TΔ​\(1\)T\_\{\\Delta\}\(\\epsilon\)\>T\_\{\\Delta\}\(1\)\. SinceTΔT\_\{\\Delta\}is continuous at 1, it is therefore sufficient to show thatTΔ​\(1\)≥0T\_\{\\Delta\}\(1\)\\geq 0\.

TΔ​\(1\)=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)−Φ​\(ρx/2\)\+34\.T\_\{\\Delta\}\(1\)=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\-\\Phi\(\\rho\_\{x\}/\\sqrt\{2\}\)\+\\tfrac\{3\}\{4\}\.Differentiating, and usingφ​\(ρx/2\)=φ​\(ρx/2\)​e−ρx2/8\\varphi\(\\rho\_\{x\}/\\sqrt\{2\}\)=\\varphi\(\\rho\_\{x\}/2\)e^\{\-\\rho\_\{x\}^\{2\}/8\},

Δ′​\(ρx\)=32​Φ​\(ρx2\)​φ​\(ρx2\)−12​φ​\(ρx2\)−22​φ​\(ρx2\)\\displaystyle\\Delta^\{\\prime\}\(\\rho\_\{x\}\)=\\tfrac\{3\}\{2\}\\Phi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{1\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{\\sqrt\{2\}\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{\\sqrt\{2\}\}\)=32​Φ​\(ρx2\)​φ​\(ρx2\)−12​φ​\(ρx2\)−22​\(−12​π​exp⁡\(−\(ρx/2\)2/2\)\)\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{1\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigl\(\-\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\{\(\-\(\\rho\_\{x\}/\\sqrt\{2\}\)^\{2\}/2\)\}\\bigr\)=32​Φ​\(ρx2\)​φ​\(ρx2\)−12​φ​\(ρx2\)−22​\(−12​π​exp⁡\(−\(ρx/2\)2/2\)​exp⁡\(−ρx2/8\)\)\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{1\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{\\sqrt\{2\}\}\{2\}\\bigl\(\-\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}\\exp\{\(\-\(\\rho\_\{x\}/2\)^\{2\}/2\)\}\\exp\{\(\-\\rho\_\{x\}^\{2\}/8\)\}\\bigr\)=32Φ\(ρx2\)φ\(ρx2\)−12φ\(ρx2\)−22φ\(ρx2\)exp\(−ρx2/8\)\)\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{1\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-\\tfrac\{\\sqrt\{2\}\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\exp\{\(\-\\rho\_\{x\}^\{2\}/8\)\}\\bigr\)=12​φ​\(ρx2\)​\(3​Φ​\(ρx2\)−1−2​exp⁡\(−ρx2/8\)\)\\displaystyle=\\tfrac\{1\}\{2\}\\varphi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\\bigl\(3\\Phi\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)\-1\-\\sqrt\{2\}\\exp\{\(\-\\rho\_\{x\}^\{2\}/8\)\}\\bigr\)12​φ​\(ρx/2\)\\tfrac\{1\}\{2\}\\varphi\(\\rho\_\{x\}/2\)is always positive\. The other termA​\(ρx\)=3​Φ​\(ρx/2\)−1−2​e−ρx2/8A\(\\rho\_\{x\}\)=3\\Phi\(\\rho\_\{x\}/2\)\-1\-\\sqrt\{2\}e^\{\-\\rho\_\{x\}^\{2\}/8\}is strictly increasing inρx\\rho\_\{x\}:dd​ρx​\(A​\(ρx\)\)=\\tfrac\{d\}\{d\\rho\_\{x\}\}\\bigl\(A\(\\rho\_\{x\}\)\\bigr\)=32​φ​\(ρx/2\)\+24​ρx​e−ρx2/8\>0\\tfrac\{3\}\{2\}\\varphi\(\\rho\_\{x\}/2\)\+\\tfrac\{\\sqrt\{2\}\}\{4\}\\rho\_\{x\}e^\{\-\\rho\_\{x\}^\{2\}/8\}\>0forρx\>0\\rho\_\{x\}\>0\. We knowA​\(0\)=12−2<0A\(0\)=\\tfrac\{1\}\{2\}\-\\sqrt\{2\}<0andA​\(2​log⁡2\)=3​Φ​\(log⁡2\)−2\>0A\(2\\sqrt\{\\log\{2\}\}\)=3\\Phi\(\\sqrt\{\\log\{2\}\}\)\-2\>0, so its zero must lie between these two arguments ofAA\. We can numerically find further that the zero \(and, therefore, the minimizer ofΔ​\(ρx\)\\Delta\(\\rho\_\{x\}\)\) is betweenρx=1\.204\\rho\_\{x\}=1\.204andρx=1\.205\\rho\_\{x\}=1\.205\. At both endpoints here, whereΔ​\(ρx\)\>0\.012\\Delta\(\\rho\_\{x\}\)\>0\.012, and on this interval,12​φ​\(ρx/2\)<0\.1665\\tfrac\{1\}\{2\}\\varphi\(\\rho\_\{x\}/2\)<0\.1665, while−0\.0006<A​\(ρx\)<0\.0003\-0\.0006<A\(\\rho\_\{x\}\)<0\.0003\. A lower\-bound value forΔ′​\(ρx\)\\Delta^\{\\prime\}\(\\rho\_\{x\}\)\(most\-negative possible slope\) is−10−4\-10^\{\-4\}\. Then, by the Mean Value Theorem, the minimum value ofΔ​\(ρx\)\\Delta\(\\rho\_\{x\}\)must be at least0\.012−\(1\.205−1\.204\)⋅10−40\.012\-\(1\.205\-1\.204\)\\cdot 10^\{\-4\}, soΔ​\(ρx\)\\Delta\(\\rho\_\{x\}\)is always positive with a minimum value of at least 0\.0119999\.

#### Case 2:ρx≥4\.2\\rho\_\{x\}\\geq 4\.2

, where this constant is chosen for reasons explained later in this section\. Hereϵ∗<1\\epsilon^\{\*\}<1is the minimizer\.

Δ​\(ρx\)=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)\+12−\(Φ​\(ρx​\(1−ϵ\)/2\)\+2​Φ​\(ϵ​ρx/2\)−1\)/2\.\\Delta\(\\rho\_\{x\}\)=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\-\\bigl\(\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\+2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\\bigr\)/2\.
First, consider the limit asρx→∞\\rho\_\{x\}\\to\\infty\. Sincelimx→∞Φ​\(x\)=1\\lim\_\{x\\to\\infty\}\\Phi\(x\)=1, it is easy to see thatlimρx→∞Δ​\(ρx\)=32−1\+12−1=0\\lim\_\{\\rho\_\{x\}\\to\\infty\}\\Delta\(\\rho\_\{x\}\)=\\tfrac\{3\}\{2\}\-1\+\\tfrac\{1\}\{2\}\-1=0\. Now we consider the finite case\. For convenience, we will use the complement of the Gaussian cumulative distribution functionΦ¯:=1−Φ\\overline\{\\Phi\}:=1\-\\Phi,

Δ​\(ρx\)=32​\(1−Φ¯​\(ρx/2\)\)2−\(1−Φ¯​\(ρx/2\)\)\+12−\(\(1−Φ¯​\(ρx​\(1−ϵ\)/2\)\)\+2​\(1−Φ¯​\(ϵ​ρx/2\)\)−1\)/2\\displaystyle\\Delta\(\\rho\_\{x\}\)=\\tfrac\{3\}\{2\}\(1\-\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\)^\{2\}\-\(1\-\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\)\+\\tfrac\{1\}\{2\}\-\\bigl\(\(1\-\\overline\{\\Phi\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\)\+2\(1\-\\overline\{\\Phi\}\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\)\-1\\bigr\)/2=32​Φ¯​\(ρx/2\)2−3​Φ¯​\(ρx/2\)\+32−\(1−Φ¯​\(ρx/2\)\)\+12−\(2−Φ¯​\(ρx​\(1−ϵ\)/2\)−2​Φ¯​\(ϵ​ρx/2\)\)/2\\displaystyle=\\tfrac\{3\}\{2\}\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)^\{2\}\-3\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\+\\tfrac\{3\}\{2\}\-\(1\-\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\)\+\\tfrac\{1\}\{2\}\-\\bigl\(2\-\\overline\{\\Phi\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\-2\\overline\{\\Phi\}\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\\bigr\)/2=32​Φ¯​\(ρx/2\)2−2​Φ¯​\(ρx/2\)\+12​Φ¯​\(ρx​\(1−ϵ\)/2\)\+Φ¯​\(ϵ​ρx/2\)\\displaystyle=\\tfrac\{3\}\{2\}\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)^\{2\}\-2\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\\overline\{\\Phi\}\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\+\\overline\{\\Phi\}\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)Substituting12\+2​log⁡2ρx2\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}forϵ\\epsilon:=32​Φ¯​\(ρx/2\)2−2​Φ¯​\(ρx/2\)\+12​Φ¯​\(ρx2​2−2​log⁡2ρx\)\+Φ¯​\(ρx2​2\+2​log⁡2ρx\)\.\\displaystyle=\\tfrac\{3\}\{2\}\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)^\{2\}\-2\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\\overline\{\\Phi\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)\+\\overline\{\\Phi\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\+\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)\.
We can eliminate the two nonnegative termsΦ¯​\(ρx2​2\+2​log⁡2ρx\)\\overline\{\\Phi\}\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\+\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\)and32​Φ¯​\(ρx/2\)2\\tfrac\{3\}\{2\}\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)^\{2\}:

Δ​\(ρx\)≥12​Φ¯​\(ρx2​2−2​log⁡2ρx\)−2​Φ¯​\(ρx/2\)\.\\Delta\(\\rho\_\{x\}\)\\geq\\tfrac\{1\}\{2\}\\overline\{\\Phi\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)\-2\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\.
Then we need to show that12​Φ¯​\(ρx2​2−2​log⁡2ρx\)−2​Φ¯​\(ρx/2\)\>0\\tfrac\{1\}\{2\}\\overline\{\\Phi\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)\-2\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\>0, which is the same as

Φ¯​\(ρx2​2−2​log⁡2ρx\)/\(Φ¯​\(ρx/2\)\)\>4\.\\overline\{\\Phi\}\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)/\(\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\)\>4\.
Recall that the Mills’ ratio for the normal distribution requires thatt/\(t2\+1\)<Φ¯​\(t\)/φ​\(t\)<1/tt/\(t^\{2\}\+1\)<\\overline\{\\Phi\}\(t\)/\\varphi\(t\)<1/t\. We can apply the lower bound to the numerator and the upper bound to the denominator Φ¯​\(ρx2​2−2​log⁡2ρx\)\>ρx2​2−2​log⁡2ρx\(ρx2​2−2​log⁡2ρx\)2\+1​φ​\(ρx2​2−2​log⁡2ρx\)\\overline\{\\Phi\}\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)\>\\frac\{\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\}\{\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)^\{2\}\+1\}\\varphi\\Bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\Bigr\)andΦ¯​\(ρx2\)<2ρx​φ​\(ρx/2\)\\overline\{\\Phi\}\(\\tfrac\{\\rho\_\{x\}\}\{2\}\)<\\tfrac\{2\}\{\\rho\_\{x\}\}\\varphi\(\\rho\_\{x\}/2\)\.

Then, for the quantity that we want to show is greater than 4, we see a new lower bound:

Φ¯​\(ρx2​2−2​log⁡2ρx\)Φ¯​\(ρx/2\)\\displaystyle\\frac\{\\overline\{\\Phi\}\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{\\overline\{\\Phi\}\(\\rho\_\{x\}/2\)\}\>ρx​\(ρx2​2−2​log⁡2ρx\)2​\[\(ρx2​2−2​log⁡2ρx\)2\+1\]⋅φ​\(ρx2​2−2​log⁡2ρx\)φ​\(ρx/2\)\\displaystyle\>\\frac\{\\rho\_\{x\}\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{2\\bigl\[\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)^\{2\}\+1\\bigr\]\}\\cdot\\frac\{\\varphi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{\\varphi\(\\rho\_\{x\}/2\)\}=ρx22​2−2​log⁡22​\[\(ρx2​2−2​log⁡2ρx\)2\+1\]⋅φ​\(ρx2​2−2​log⁡2ρx\)φ​\(ρx/2\)\\displaystyle=\\frac\{\\tfrac\{\\rho\_\{x\}^\{2\}\}\{2\\sqrt\{2\}\}\-\\sqrt\{2\}\\log 2\}\{2\\bigl\[\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)^\{2\}\+1\\bigr\]\}\\cdot\\frac\{\\varphi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{\\varphi\(\\rho\_\{x\}/2\)\}=2​\(1−4​log⁡2ρx2\)1\+16​log2⁡2ρx4\+8−8​log⁡2ρx2⋅φ​\(ρx2​2−2​log⁡2ρx\)φ​\(ρx/2\)\\displaystyle=\\frac\{\\sqrt\{2\}\(1\-\\tfrac\{4\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\}\{1\+\\tfrac\{16\\log^\{2\}2\}\{\\rho\_\{x\}^\{4\}\}\+\\tfrac\{8\-8\\log 2\}\{\\rho\_\{x\}^\{2\}\}\}\\cdot\\frac\{\\varphi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{\\varphi\(\\rho\_\{x\}/2\)\}Usingφ​\(t\)=12​π​e−t2/2\\varphi\(t\)=\\tfrac\{1\}\{\\sqrt\{2\\pi\}\}e^\{\-t^\{2\}/2\}, we can simplify the second term here:

φ​\(ρx2​2−2​log⁡2ρx\)φ​\(ρx/2\)\\displaystyle\\frac\{\\varphi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)\}\{\\varphi\(\\rho\_\{x\}/2\)\}=exp⁡\(12​\[\(ρx/2\)2−\(ρx2​2−2​log⁡2ρx\)2\]\)=exp⁡\(12​\[ρx24−ρx28−2​log2⁡2ρx2\+log⁡2\]\)\\displaystyle=\\exp\\Bigl\(\\tfrac\{1\}\{2\}\\Bigl\[\(\\rho\_\{x\}/2\)^\{2\}\-\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\\sqrt\{2\}\}\-\\tfrac\{\\sqrt\{2\}\\log 2\}\{\\rho\_\{x\}\}\\bigr\)^\{2\}\\Bigr\]\\Bigr\)=\\exp\\Bigl\(\\tfrac\{1\}\{2\}\\Bigl\[\\tfrac\{\\rho\_\{x\}^\{2\}\}\{4\}\-\\tfrac\{\\rho\_\{x\}^\{2\}\}\{8\}\-\\tfrac\{2\\log^\{2\}2\}\{\\rho\_\{x\}^\{2\}\}\+\\log 2\\Bigr\]\\Bigr\)=exp⁡\(12​ρx28−12​2​log2⁡2ρx2\+log⁡212\)=2​exp⁡\(ρx216−log2⁡2ρx2\)\.\\displaystyle=\\exp\\Bigl\(\\tfrac\{1\}\{2\}\\tfrac\{\\rho\_\{x\}^\{2\}\}\{8\}\-\\tfrac\{1\}\{2\}\\tfrac\{2\\log^\{2\}2\}\{\\rho\_\{x\}^\{2\}\}\+\\log 2^\{\\tfrac\{1\}\{2\}\}\\Bigr\)=\\sqrt\{2\}\\exp\\Bigl\(\\tfrac\{\\rho\_\{x\}^\{2\}\}\{16\}\-\\tfrac\{\\log^\{2\}2\}\{\\rho\_\{x\}^\{2\}\}\\Bigr\)\.Now our lower bound is

2​\(1−4​log⁡2ρx2\)1\+16​log2⁡2ρx4\+8−8​log⁡2ρx2⋅2​exp⁡\(ρx216−log2⁡2ρx2\)\\frac\{\\sqrt\{2\}\(1\-\\tfrac\{4\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\}\{1\+\\tfrac\{16\\log^\{2\}2\}\{\\rho\_\{x\}^\{4\}\}\+\\tfrac\{8\-8\\log 2\}\{\\rho\_\{x\}^\{2\}\}\}\\cdot\\sqrt\{2\}\\exp\\Bigl\(\\tfrac\{\\rho\_\{x\}^\{2\}\}\{16\}\-\\tfrac\{\\log^\{2\}2\}\{\\rho\_\{x\}^\{2\}\}\\Bigr\)
Both terms increase as weρx\\rho\_\{x\}goes from2​log⁡2→∞2\\sqrt\{\\log\{2\}\}\\to\\infty\. Analytically, we can easily find values ofρx\\rho\_\{x\}where this lower bound exceeds 4\. One such value isρx=4\.2\\rho\_\{x\}=4\.2, where the lower bound is4\.244\.24\. Therefore, we knowΔ​\(ρx\)\>0\\Delta\(\\rho\_\{x\}\)\>0for allρx≥4\.2\\rho\_\{x\}\\geq 4\.2, and we will use thisρx=4\.2\\rho\_\{x\}=4\.2as the floor of this case\.

Therefore, we must consider one final case, where2​log⁡2<ρx<4\.22\\sqrt\{\\log\{2\}\}<\\rho\_\{x\}<4\.2\. This case fails our \(loose\) lower bound\. Since this is a bounded subset ofℝ\\mathbb\{R\}, we can use interval arithmetic\(Moore,[1966](https://arxiv.org/html/2606.21795#bib.bib26)\)\.

#### Case 3:2​log⁡2<ρx<4\.22\\sqrt\{\\log\{2\}\}<\\rho\_\{x\}<4\.2\.

We will substituteϵ=12\+2​log⁡2ρx2\\epsilon=\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}into our definition ofΔ​\(ρx\)\\Delta\(\\rho\_\{x\}\):

Δ​\(ρx\)\\displaystyle\\Delta\(\\rho\_\{x\}\)=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)\+12−\(Φ​\(ρx​\(1−ϵ\)/2\)\+2​Φ​\(ϵ​ρx/2\)−1\)/2\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\-\\bigl\(\\Phi\(\\rho\_\{x\}\(1\-\\epsilon\)/\\sqrt\{2\}\)\+2\\Phi\(\\epsilon\\rho\_\{x\}/\\sqrt\{2\}\)\-1\\bigr\)/2=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)\+12−12​Φ​\(ρx​\(1−\(12\+2​log⁡2ρx2\)\)/2\)−Φ​\(\(12\+2​log⁡2ρx2\)​ρx/2\)\+12\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\+\\tfrac\{1\}\{2\}\-\\tfrac\{1\}\{2\}\\Phi\\bigl\(\\rho\_\{x\}\(1\-\(\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\)/\\sqrt\{2\}\\bigr\)\-\\Phi\\bigl\(\(\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\\rho\_\{x\}/\\sqrt\{2\}\\bigr\)\+\\tfrac\{1\}\{2\}=32​Φ​\(ρx/2\)2−Φ​\(ρx/2\)−12​Φ​\(ρx​\(1−\(12\+2​log⁡2ρx2\)\)/2\)−Φ​\(\(12\+2​log⁡2ρx2\)​ρx/2\)\+1\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\-\\tfrac\{1\}\{2\}\\Phi\\bigl\(\\rho\_\{x\}\(1\-\(\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\)/\\sqrt\{2\}\\bigr\)\-\\Phi\\bigl\(\(\\tfrac\{1\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}^\{2\}\}\)\\rho\_\{x\}/\\sqrt\{2\}\\bigr\)\+1=32Φ\(ρx/2\)2−Φ\(ρx/2\)−12Φ\(ρx2−2​log⁡2ρx\)/2\)−Φ\(\(ρx2\+2​log⁡2ρx\)/2\)\+1\\displaystyle=\\tfrac\{3\}\{2\}\\Phi\(\\rho\_\{x\}/2\)^\{2\}\-\\Phi\(\\rho\_\{x\}/2\)\-\\tfrac\{1\}\{2\}\\Phi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\}\-\\tfrac\{2\\log 2\}\{\\rho\_\{x\}\}\)/\\sqrt\{2\}\\bigr\)\-\\Phi\\bigl\(\(\\tfrac\{\\rho\_\{x\}\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}\}\)/\\sqrt\{2\}\\bigr\)\+1=32​\(Φ​\(ρx/2\)−13\)2⏟\(t1\)−12Φ\(ρx2−2​log⁡2ρx\)/2\)⏟\(t2\)−Φ​\(\(ρx2\+2​log⁡2ρx\)/2\)\+56⏟\(t3\)\\displaystyle=\\underbrace\{\\tfrac\{3\}\{2\}\(\\Phi\(\\rho\_\{x\}/2\)\-\\tfrac\{1\}\{3\}\)^\{2\}\}\_\{\(t\_\{1\}\)\}\-\\underbrace\{\\tfrac\{1\}\{2\}\\Phi\\bigl\(\\tfrac\{\\rho\_\{x\}\}\{2\}\-\\tfrac\{2\\log 2\}\{\\rho\_\{x\}\}\)/\\sqrt\{2\}\\bigr\)\}\_\{\(t\_\{2\}\)\}\-\\underbrace\{\\Phi\\bigl\(\(\\tfrac\{\\rho\_\{x\}\}\{2\}\+\\tfrac\{2\\log 2\}\{\\rho\_\{x\}\}\)/\\sqrt\{2\}\\bigr\)\+\\tfrac\{5\}\{6\}\}\_\{\(t\_\{3\}\)\}
Asρx\\rho\_\{x\}increases,t1t\_\{1\}rises, whilet2t\_\{2\}andt3t\_\{3\}fall\. For any interval\[a,b\]\[a,b\],aaminimizest1t\_\{1\}whilebbminimizest1t\_\{1\}andt2t\_\{2\}\. Thenminρ∈\[a,b\]⁡\(t1​\(ρ\)\+t2​\(ρ\)\+t3​\(ρ\)\)\>minρ∈\[a,b\]⁡t1​\(ρ\)\+minρ∈\[a,b\]⁡t2​\(ρ\)\+minρ∈\[a,b\]⁡t3​\(ρ\)=t1​\(a\)\+t2​\(b\)\+t3​\(b\)\\min\_\{\\rho\\in\[a,b\]\}\\bigl\(t\_\{1\}\(\\rho\)\+t\_\{2\}\(\\rho\)\+t\_\{3\}\(\\rho\)\\bigr\)\>\\min\_\{\\rho\\in\[a,b\]\}t\_\{1\}\(\\rho\)\+\\min\_\{\\rho\\in\[a,b\]\}t\_\{2\}\(\\rho\)\+\\min\_\{\\rho\\in\[a,b\]\}t\_\{3\}\(\\rho\)=t\_\{1\}\(a\)\+t\_\{2\}\(b\)\+t\_\{3\}\(b\)\.

Then, we can separateρx\\rho\_\{x\}into two separate valuesρa\\rho\_\{a\}andρb\\rho\_\{b\}to define a strict lower bound forΔ​\(ρx\)\\Delta\(\\rho\_\{x\}\):

32\(Φ\(ρ𝐛¯/2\)−13\)2−12Φ\(ρ𝐚¯/2−2log2/ρ𝐚¯\)/2\)−Φ\(\(ρ𝐚¯/2\+2log2/ρ𝐚¯\)/2\)\+56\\tfrac\{3\}\{2\}\(\\Phi\(\\underline\{\\mathbf\{\\rho\_\{b\}\}\}/2\)\-\\tfrac\{1\}\{3\}\)^\{2\}\-\\tfrac\{1\}\{2\}\\Phi\\bigl\(\\underline\{\\mathbf\{\\rho\_\{a\}\}\}/2\-2\\log 2/\\underline\{\\mathbf\{\\rho\_\{a\}\}\}\)/\\sqrt\{2\}\\bigr\)\-\\Phi\\bigl\(\(\\underline\{\\mathbf\{\\rho\_\{a\}\}\}/2\+2\\log 2/\\underline\{\\mathbf\{\\rho\_\{a\}\}\}\)/\\sqrt\{2\}\\bigr\)\+\\tfrac\{5\}\{6\}
We can then analytically find an arbitrary covering of intervals over\[2​log⁡2,4\.2\)\[2\\sqrt\{\\log\{2\}\},4\.2\)with positive lower bounds:

Aggregating over all these intervals, we haveminϵ∈\[2​log⁡2,4\.2\)⁡Δ​\(ρ\)\>0\.0053\>0\\min\_\{\\epsilon\\in\[2\\sqrt\{\\log\{2\}\},4\.2\)\}\\Delta\(\\rho\)\>0\.0053\>0, soTΔ​\(min⁡\{1,ϵ∗\}\)\>0T\_\{\\Delta\}\\bigl\(\\min\\\{1,\\epsilon^\{\*\}\\\}\\bigr\)\>0overρ∈\(2log⁡2\),4\.2\)\\rho\\in\(2\\sqrt\{\\log\{2\}\}\),4\.2\)\. Earlier, we showed thatTΔ​\(min⁡\{1,ϵ∗\}\)\>0T\_\{\\Delta\}\\bigl\(\\min\\\{1,\\epsilon^\{\*\}\\\}\\bigr\)\>0overρ∈\(0,2​log⁡2\]\\rho\\in\(0,2\\sqrt\{\\log 2\}\]andρ∈\(4\.2,∞\)\\rho\\in\(4\.2,\\infty\)\.

Putting this all together,TΔ​\(ρ\)T\_\{\\Delta\}\(\\rho\)is guaranteed to be positive over allρ∈\(0,∞\)\\rho\\in\(0,\\infty\)\. Therefore,Tdisc\>TrawT\_\{\\textup\{disc\}\}\>T\_\{\\textup\{raw\}\}for allρx\>0\\rho\_\{x\}\>0and allϵ∈\[0,1\)\\epsilon\\in\[0,1\)\.

Similar Articles

Discretizing Reward Models

Hugging Face Daily Papers

This paper identifies oversensitivity in continuous reward models for reinforcement learning, where equally good responses receive different scores, and proposes a discretization technique using Monte Carlo dropout to reduce this oversensitivity while maintaining discriminative ability, leading to better policies and less reward hacking.

Faulty reward functions in the wild

OpenAI Blog

OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.