PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

arXiv cs.LG 06/29/26, 04:00 AM Papers
rlhf reward-model calibration empirical-bayes pluralistic-alignment annotator-calibration
Summary
Introduces PEBS, a per-rater empirical-Bayes shrinkage estimator for calibrating reward models in RLHF, reducing within-user RMSE by over 8.5% on PRISM and over 9.6% on PluriHarms.
arXiv:2606.27578v1 Announce Type: new Abstract: Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slopes into a single average-rater fit that does not match any individual annotator. PEBS is a per-rater empirical-Bayes shrinkage estimator: it fits per-rater affine calibrators on a held-out slice of each annotator's ratings and applies Morris-James-Stein empirical-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model. On PRISM, PEBS reduces within-user held-out RMSE by 8.58% over the pooled population-slope baseline. The procedure replicates on PluriHarms harm ratings (Qwen-2.5 base, in-family) with a +9.66% RMSE reduction over the same population-slope baseline. PEBS is a closed-form post-hoc estimator for annotator-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater-level map used at inference time for new ratings.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:23 AM
# Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
Source: [https://arxiv.org/html/2606.27578](https://arxiv.org/html/2606.27578)
###### Abstract

Reward models for Reinforcement Learning from Human Feedback \(RLHF\) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating\-scale offsets and slopes into a single average\-rater fit that does not match any individual annotator\.PEBSis a per\-rater empirical\-Bayes shrinkage estimator: it fits per\-rater affine calibrators on a held\-out slice of each annotator’s ratings and applies Morris–James–Stein empirical\-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model\. On PRISM,PEBSreduces within\-user held\-out RMSE by8\.58%\\bm\{8\.58\\%\}over the pooled population\-slope baseline\. The procedure replicates on PluriHarms harm ratings \(Qwen\-2\.5 base, in\-family\) with a\+9\.66%\\bm\{\+9\.66\\%\}RMSE reduction over the same population\-slope baseline\. PEBS is a closed\-form post\-hoc estimator for annotator\-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater\-level map used at inference time for new ratings\.

RLHF, pluralistic alignment, empirical Bayes, per\-annotator calibration, reward modeling

## 1Introduction and Related Work

![Refer to caption](https://arxiv.org/html/2606.27578v1/x1.png)Figure 1:The Phi\-3\-medium\-14B in\-family case falls within±5\\pm 5pp of the single\-seed anchor \(\+43\.23%\+43\.23\\%, shaded band\); the Qwen\-2\.5 row replicates in\-family on a different metric \(HelpSteer2 pooled\-RMSE,\+18\.24%\+18\.24\\%\)\.Forest plot of point estimates with95%95\\%row\-cluster bootstrap confidence intervals, grouped by base\-model family\. The three Llama\-family\-dense bases are shown second as scope characterization: on a coherence head they split into two negative outcomes and one wide\-CI null\. A verbosity\-only retrained head recovers a positive gain on the same bases, pointing to a coherence\-head/dense\-architecture interaction rather than an attribute\-agnostic verbosity bias; calibration diagnostics are in Appendix[B](https://arxiv.org/html/2606.27578#A2)\.Reinforcement Learning from Human Feedback\(Christiano et al\.,[2017](https://arxiv.org/html/2606.27578#bib.bib6);Stiennon et al\.,[2020](https://arxiv.org/html/2606.27578#bib.bib42);Ouyang et al\.,[2022](https://arxiv.org/html/2606.27578#bib.bib31)\)assumes a Bradley–Terry\(Bradley & Terry,[1952](https://arxiv.org/html/2606.27578#bib.bib3)\)pairwise\-preference model: preferences from many annotators are pooled into one likelihood, a scalar rewardrϕr\_\{\\phi\}is fit, and the result is used for either proximal\-policy\-optimization \(PPO\)\-style RLHF or DPO\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib34)\)\.111Code and fitted calibrators:[https://github\.com/deadsmash07/pebs\-pluralistic](https://github.com/deadsmash07/pebs-pluralistic)\.The standard pooled\-likelihood objective drops the annotator indexjjfrom this aggregation, which collapses raters with systematically different rating\-scale calibrations into a single global affine fit and confounds calibration heterogeneity with reward signal\. Figure[1](https://arxiv.org/html/2606.27578#S1.F1)previews the base\-family transfer summary: the procedure replicates on the Qwen\-2\.5 and Phi\-3 reference rows, turns negative on two of three Llama\-family\-dense bases when trained on a coherence head, and recovers a positive gain on those same bases under a verbosity\-only run that points to the coherence\-head / dense\-architecture interaction \(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5)\)\.

Different annotators use the0–100100score scale heterogeneously\. Some compress the scale, some stretch it, and some differ in baseline\. Pooling such observations naively yields a reward model \(RM\) that fits the*average*rater, a fit that does not correspond to any individual annotator\. Several lines of work make the measurement\-validity problem explicit:Ghafouri et al\.\([2026](https://arxiv.org/html/2606.27578#bib.bib16)\)argue that RLHF preference measurement needs social\-science diagnostics, andMa et al\.\([2026](https://arxiv.org/html/2606.27578#bib.bib27)\)report that frontier RMs peak at75\.9%75\.9\\%on*their*per\-user preference benchmark\.Rezk et al\.\([2025](https://arxiv.org/html/2606.27578#bib.bib36)\)measure rank\-correlationτ=0\.08\\tau\{=\}0\.08–0\.310\.31\(Kendall\) between upstream RM pair\-accuracy and downstream policy accuracy on Pref\-LaMP, a personalised\-preference benchmark\. Together these indicate that a single global RM degrades per\-annotator accuracy even when its aggregate accuracy is high\.

Partial pooling is the classical fix, and per\-annotator effect modeling is the psychometric mainline outside RLHF\. The Rasch model\(Rasch,[1960](https://arxiv.org/html/2606.27578#bib.bib35)\)and classical Item\-Response Theory\(Baker,[2001](https://arxiv.org/html/2606.27578#bib.bib1)\)parametrize per\-rater difficulty and discrimination\.Dawid & Skene\([1979](https://arxiv.org/html/2606.27578#bib.bib9)\)gave the canonical rater\-effect mixture predating modern crowdsourcing, andPaun et al\.\([2018](https://arxiv.org/html/2606.27578#bib.bib32)\)benchmark hierarchical Bayesian rater models on NLP annotation, establishing partial pooling as the dominant paradigm\. In regression\-style data analysis, the textbook estimators are the Morris/James–Stein empirical\-Bayes \(EB\) shrinkage\(Robbins,[1956](https://arxiv.org/html/2606.27578#bib.bib37);Morris,[1983](https://arxiv.org/html/2606.27578#bib.bib30)\)and the Best Linear Unbiased Predictor \(BLUP\)\(Henderson,[1975](https://arxiv.org/html/2606.27578#bib.bib18)\), the canonical EB estimator from linear\-mixed\-model theory, and the blending weightω=τ2/\(τ2\+V\)\\omega=\\tau^\{2\}/\(\\tau^\{2\}\+V\)is standard in hierarchical modeling\(Gelman & Hill,[2007](https://arxiv.org/html/2606.27578#bib.bib15);Pinheiro & Bates,[2000](https://arxiv.org/html/2606.27578#bib.bib33)\)\.

#### Per\-user reward modeling in RLHF\.

The pluralistic\-alignment programme outlined bySorensen et al\.\([2024b](https://arxiv.org/html/2606.27578#bib.bib41)\)distinguishes Overton, steerable, and distributional axes \(withBakker et al\.\([2022](https://arxiv.org/html/2606.27578#bib.bib2)\)establishing direct upstream evidence on language\-model fine\-tuning toward per\-annotator agreement\);Conitzer et al\.\([2024](https://arxiv.org/html/2606.27578#bib.bib7)\)argue that aggregating diverging human feedback is a social\-choice problem; benchmarks and datasets in this line includeCastricato et al\.\([2025](https://arxiv.org/html/2606.27578#bib.bib5)\)\(PERSONA, persona\-conditioned preferences\) andZhang et al\.\([2025](https://arxiv.org/html/2606.27578#bib.bib47)\)\(Community Alignment, multilingual representative\-sample preferences with negatively\-correlated candidate sampling\)\. The labelPEBSdenotes the per\-rater empirical\-Bayes shrinkage estimator used here: operationally, it shrinks annotator\-specific affine calibration parameters\. The method operates on a complementary axis \(see §[4](https://arxiv.org/html/2606.27578#S4)\): per\-annotator calibration heterogeneity\. RLHF work has also explored per\-user effects along several distinct estimator axes\.Kobalczyk & van der Schaar\([2025](https://arxiv.org/html/2606.27578#bib.bib21)\)formulate user\-specific factor confounding in a causal framework for preference learning\.Zhang et al\.\([2026](https://arxiv.org/html/2606.27578#bib.bib48)\)use learned user prototypes; PEBS instead uses stable per\-rater identifiers\.Liu et al\.\([2025](https://arxiv.org/html/2606.27578#bib.bib26)\)model rater rationality as a function of annotator context\. Whether demographic covariates suffice for the per\-user effect is testable: an analysis\-of\-variance \(ANOVA, partitioning between\- versus within\-group variance\) of the fitted per\-user calibrators against six PRISM annotator features \(age, gender, region, education, political orientation, English fluency; §[3\.8](https://arxiv.org/html/2606.27578#S3.SS8)\) leaves only the gender\-to\-β^j\\hat\{\\beta\}\_\{j\}effect surviving Bonferroni correction atη2=0\.018\\eta^\{2\}\{=\}0\.018\(hereβ^j\\hat\{\\beta\}\_\{j\}is the per\-rater offset estimator from Section[2](https://arxiv.org/html/2606.27578#S2)\), so demographic grouping cannot substitute for per\-user calibration\. The most closely related empirical\-Bayes shrinkage method,EBPO\(Han et al\.,[2026](https://arxiv.org/html/2606.27578#bib.bib17)\), shrinks per\-prompt group\-relative\-policy\-optimization \(GRPO\) advantage baselines on verifiable\-reward tasks, which targets a different scale \(per\-prompt advantage, not per\-rater calibration\)\. A comparison of PEBS against these related methods appears in Table[4](https://arxiv.org/html/2606.27578#A2.T4)\(appendix\)\.

#### Contributions\.

First,PEBSputs a classical correction where RLHF reward pipelines usually omit it: Efron–Morris–James–Stein partial pooling\(Efron & Morris,[1973](https://arxiv.org/html/2606.27578#bib.bib11)\)for annotator\-specific scale and offset, applied post hoc to scalar RM outputs\. Under annotator heterogeneity, this correction materially helps calibration\-sensitive losses\. The result in this setting is a within\-user RMSE reduction of8\.58%8\.58\\%on PRISM with a Qwen\-2\.5\-7B base model \(Table[1](https://arxiv.org/html/2606.27578#S3.T1)\); the procedure replicates on PluriHarms harm ratings \(\+9\.66%\{\+\}9\.66\\%; §[3\.4](https://arxiv.org/html/2606.27578#S3.SS4)\) and on a same\-family Phi\-3\-medium\-14B reference \(\+42\.15%\+42\.15\\%across five seeds, all positive; §[3\.5](https://arxiv.org/html/2606.27578#S3.SS5)\)\. The estimator is closed\-form and operates downstream of any reward model’s scalar predictions; an ablation \(§[3\.9](https://arxiv.org/html/2606.27578#S3.SS9)\) separates the gain into a textbook Efron–Morris intercept\-shrinkage floor, which appears even under a signal\-free \(permuted\) reward, and a smallerPEBS\-specific slope\-shrinkage residual that requires real reward signal\. A pre\-registered four\-base coherence\-only probe \(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5), Table[2](https://arxiv.org/html/2606.27578#S3.T2)\) identifies the transfer limit structurally: the procedure transfers within the Qwen\-2\.5 family and on the Phi\-3\-medium\-14B reference, while under coherence\-only training on Llama\-family\-dense bases two of three turn negative; a paired verbosity\-only control recovers positive gain on the same bases, pointing to a coherence\-head / dense\-architecture interaction rather than attribute\-agnostic verbosity bias\. We report this scope boundary without claiming generality\. On the theory side we prove \(§[3\.6](https://arxiv.org/html/2606.27578#S3.SS6), Theorem[1](https://arxiv.org/html/2606.27578#Thmtheorem1)\) that a sample\-split variant of PEBS’s slope shrinkage stays within a\(1\+c/J\)\(1\+c/J\)factor of an oracle that knows the true slope variance, with an explicit constant; a PRISM\-calibrated simulation of the deployed estimator puts the realized risk inflation near0\.2%0\.2\\%\. A closed\-form Morrisgg\-function forecaster \(§[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\) predicts PEBS gain on a new corpus from a short pilot, validated to within0\.20\.2pp on four rating corpora\. Table[4](https://arxiv.org/html/2606.27578#A2.T4)\(appendix\) contrasts these extensions of the Efron–Morris–James–Stein estimator\(Efron & Morris,[1973](https://arxiv.org/html/2606.27578#bib.bib11);Morris,[1983](https://arxiv.org/html/2606.27578#bib.bib30);Henderson,[1975](https://arxiv.org/html/2606.27578#bib.bib18)\)with the most closely related personalization methods\.

## 2Method

### 2\.1Partial\-pooling estimator

Given observations\{yji\}\\\{y\_\{ji\}\\\}indexed by annotatorjjand utteranceii, the complete\-pooling estimator ignoresjj\. We instead estimate a cluster\-specific parameterθj\\theta\_\{j\}via the classical empirical\-Bayes blend

θ^jPP\\displaystyle\\hat\{\\theta\}\_\{j\}^\{\\mathrm\{PP\}\}=ωjθ^jlocal\+\(1−ωj\)θ^pool,\\displaystyle\\;=\\;\\omega\_\{j\}\\,\\hat\{\\theta\}\_\{j\}^\{\\mathrm\{local\}\}\+\(1\-\\omega\_\{j\}\)\\,\\hat\{\\theta\}\_\{\\mathrm\{pool\}\},\(1\)ωj\\displaystyle\\omega\_\{j\}=τ2τ2\+V\(θ^jlocal\),\\displaystyle\\;=\\;\\frac\{\\tau^\{2\}\}\{\\tau^\{2\}\+V\(\\hat\{\\theta\}\_\{j\}^\{\\mathrm\{local\}\}\)\},\(2\)whereτ2\\tau^\{2\}is the cross\-cluster variance ofθj\\theta\_\{j\}andV\(θ^jlocal\)V\(\\hat\{\\theta\}\_\{j\}^\{\\mathrm\{local\}\}\)is the within\-cluster sampling variance\. Eq\. \([2](https://arxiv.org/html/2606.27578#S2.E2)\) is the Morris/James–Stein empirical\-Bayes shrinkage\(Morris,[1983](https://arxiv.org/html/2606.27578#bib.bib30)\)and recovers the BLUP of the linear mixed model\(Henderson,[1975](https://arxiv.org/html/2606.27578#bib.bib18)\)\. Atω=0\\omega\{=\}0it reduces to the pooled estimator and atω→1\\omega\{\\to\}1it reduces to per\-cluster OLS, with the closed\-formωj\(nj\)\\omega\_\{j\}\(n\_\{j\}\)curve and small\-njn\_\{j\}down\-weighting visualized in Appendix Figure[6](https://arxiv.org/html/2606.27578#A2.F6)\.

### 2\.2Per\-user calibration model

Algorithm[1](https://arxiv.org/html/2606.27578#alg1)sets out the three\-stage procedure \(shared reward model; per\-rater OLS calibrator; EB shrinkage\) end\-to\-end\. For each annotatorjjand utteranceii, we model the user’s continuous preference score as

sji=αjr^ϕ\(xji\)\+βj\+εji,s\_\{ji\}\\;=\\;\\alpha\_\{j\}\\,\\hat\{r\}\_\{\\phi\}\(x\_\{ji\}\)\\;\+\\;\\beta\_\{j\}\\;\+\\;\\varepsilon\_\{ji\},\(3\)wherer^ϕ\\hat\{r\}\_\{\\phi\}is a shared reward model fine\-tuned on pooled PRISM preferences and\(αj,βj\)\(\\alpha\_\{j\},\\beta\_\{j\}\)is a per\-user linear calibrator:αj\\alpha\_\{j\}is the per\-annotator multiplicative slope \(the units in which annotatorjjconverts a unit of model reward into a unit of self\-reported score\) andβj\\beta\_\{j\}is the per\-annotator additive offset \(the baseline scorejjassigns to a zero\-reward response\)\. Per\-user OLS yieldsα^jOLS,β^jOLS\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\},\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{OLS\}\}with sampling varianceV\(α^j\)=σ^ε2/\(njVarj\(r^ϕ\(xji\)\)\)V\(\\hat\{\\alpha\}\_\{j\}\)=\\hat\{\\sigma\}\_\{\\varepsilon\}^\{2\}/\\bigl\(n\_\{j\}\\,\\mathrm\{Var\}\_\{j\}\(\\hat\{r\}\_\{\\phi\}\(x\_\{ji\}\)\)\\bigr\)\. The EB\-shrunk estimator is the direct application of Eq\. \([2](https://arxiv.org/html/2606.27578#S2.E2)\):

α^jshrunk=ωα\(j\)α^jOLS\+\(1−ωα\(j\)\)αpop,\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\}\\;=\\;\\omega\_\{\\alpha\}^\{\(j\)\}\\,\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\\;\+\\;\(1\-\\omega\_\{\\alpha\}^\{\(j\)\}\)\\,\\alpha\_\{\\mathrm\{pop\}\},\(4\)withωα\(j\)=τ^α2/\(τ^α2\+V\(α^j\)\)\\omega\_\{\\alpha\}^\{\(j\)\}=\\hat\{\\tau\}\_\{\\alpha\}^\{2\}/\(\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\+V\(\\hat\{\\alpha\}\_\{j\}\)\)and an analogous formula forβ^jshrunk\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{shrunk\}\}\.τ^α2\\hat\{\\tau\}\_\{\\alpha\}^\{2\}is a Method\-of\-Moments \(MoM\) estimate on the per\-userα^jOLS\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}distribution; a Restricted Maximum Likelihood \(REML\) cross\-check on the two\-level \(rater, observation\) mixed modelsji∼βj\+αjr^ϕ\(xji\)\+εs\_\{ji\}\\sim\\beta\_\{j\}\+\\alpha\_\{j\}\\hat\{r\}\_\{\\phi\}\(x\_\{ji\}\)\+\\varepsilon\(Seabold & Perktold,[2010](https://arxiv.org/html/2606.27578#bib.bib39);Pinheiro & Bates,[2000](https://arxiv.org/html/2606.27578#bib.bib33)\)disagrees on PRISM by3\.5%3\.5\\%onτ^α2\\hat\{\\tau\}\_\{\\alpha\}^\{2\}and11\.1%11\.1\\%onτ^β2\\hat\{\\tau\}\_\{\\beta\}^\{2\}; since the EB risk is stationary inτ2\\tau^\{2\}at the truth \(§[3\.6](https://arxiv.org/html/2606.27578#S3.SS6), Appendix[A](https://arxiv.org/html/2606.27578#A1), Step 2\), a few\-percent error inτ^2\\hat\{\\tau\}^\{2\}perturbs the risk only at second order\. The fitted cross\-user correlation betweenα^j\\hat\{\\alpha\}\_\{j\}andβ^j\\hat\{\\beta\}\_\{j\}is small \(point estimate0\.090\.09\), which supports the separable EB shrinkage in Algorithm[1](https://arxiv.org/html/2606.27578#alg1): the per\-user slopeαj\\alpha\_\{j\}and offsetβj\\beta\_\{j\}can be shrunk independently rather than jointly with a2×22\\\!\\times\\\!2covariance matrix\.

Algorithm 1PEBS: per\-rater empirical\-Bayes shrinkage1:Input:reward model

r^ϕ\\hat\{r\}\_\{\\phi\}; per\-rater calibration set

\{\(xji,sji\)\}j,i\\\{\(x\_\{ji\},s\_\{ji\}\)\\\}\_\{j,i\}, where

xjix\_\{ji\}is the

ii\-th utterance from rater

jjand

sji∈\[0,100\]s\_\{ji\}\\in\[0,100\]is the rated score; the per\-user covariate is the RM prediction

r^ϕ\(xji\)\\hat\{r\}\_\{\\phi\}\(x\_\{ji\}\)\.

2:foreach rater

jjwith

nj≥3n\_\{j\}\{\\geq\}3\{PRISM uses

nj≥6n\_\{j\}\{\\geq\}6, §[2\.3](https://arxiv.org/html/2606.27578#S2.SS3)\}do

3:

\(α^jOLS,β^jOLS\)←OLS\(r^ϕ\(xj⁣⋅\),sj⁣⋅\)\(\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\},\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{OLS\}\}\)\\\!\\leftarrow\\\!\\mathrm\{OLS\}\(\\hat\{r\}\_\{\\phi\}\(x\_\{j\\cdot\}\),\\,s\_\{j\\cdot\}\);

V\(α^j\)=σ^ε2/\(njVarj\(r^ϕ\(xji\)\)\)V\(\\hat\{\\alpha\}\_\{j\}\)=\\hat\{\\sigma\}\_\{\\varepsilon\}^\{2\}/\\bigl\(n\_\{j\}\\,\\mathrm\{Var\}\_\{j\}\(\\hat\{r\}\_\{\\phi\}\(x\_\{ji\}\)\)\\bigr\)
4:endfor

5:MoM:

τ^α2←Varj\(α^jOLS\)−V\(α^j\)¯\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\\\!\\leftarrow\\\!\\mathrm\{Var\}\_\{j\}\(\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\)\-\\overline\{V\(\\hat\{\\alpha\}\_\{j\}\)\}
6:Truncate:

τ^α2←max⁡\(0,τ^α2\)\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\\leftarrow\\max\(0,\\,\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\)\{standard EB truncation,Morris\([1983](https://arxiv.org/html/2606.27578#bib.bib30)\)§4\}

7:Per\-rater weights:

wj=1/V\(α^j\)w\_\{j\}=1/V\(\\hat\{\\alpha\}\_\{j\}\)
8:Population mean:

αpop=\(∑jwjα^jOLS\)/\(∑jwj\)\\alpha\_\{\\mathrm\{pop\}\}=\\big\(\\sum\_\{j\}w\_\{j\}\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\\big\)\\big/\\big\(\\sum\_\{j\}w\_\{j\}\\big\)
9:foreach rater

jjdo

10:Weight:

ωα\(j\)←τ^α2/\(τ^α2\+V\(α^j\)\)\\omega\_\{\\alpha\}^\{\(j\)\}\\\!\\leftarrow\\\!\\hat\{\\tau\}\_\{\\alpha\}^\{2\}/\(\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\+V\(\\hat\{\\alpha\}\_\{j\}\)\)
11:Shrunk:

α^jshrunk←ωα\(j\)α^jOLS\+\(1−ωα\(j\)\)αpop\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\}\\leftarrow\\omega\_\{\\alpha\}^\{\(j\)\}\\,\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\+\(1\{\-\}\\omega\_\{\\alpha\}^\{\(j\)\}\)\\,\\alpha\_\{\\mathrm\{pop\}\}
12:Analogously for

β^jshrunk\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{shrunk\}\}\(with

τ^β2\\hat\{\\tau\}\_\{\\beta\}^\{2\}truncated at zero\)\.

13:endfor

14:return

\{\(α^jshrunk,β^jshrunk\)\}j=1J\\\{\(\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\},\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{shrunk\}\}\)\\\}\_\{j=1\}^\{J\}

###### Proposition 1\(Pair\-accuracy invariance underPEBS\)\.

Assumeαpop\>0\\alpha\_\{\\mathrm\{pop\}\}\>0and that the post\-shrinkage slopes are strictly positive,α^jshrunk\>0\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\}\>0for every raterjj\. Then the affine mapr↦α^jshrunkr\+β^jshrunkr\\mapsto\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\}r\+\\hat\{\\beta\}\_\{j\}^\{\\mathrm\{shrunk\}\}is strictly monotone and preserves the argmax of every finite list, so any pair\-accuracy or best\-of\-nnbenchmark is constant across the pop\-slope and EB\-shrunk arms; gains can only appear in calibration\-sensitive losses such as root mean squared error \(RMSE\) and the Bradley–Terry negative log\-likelihood \(NLL\)\.

*\(Proof: monotonicity\.\)*The positivity assumption is not automatic: sinceα^jshrunk\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{shrunk\}\}is a convex combination ofα^jOLS\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}andαpop\\alpha\_\{\\mathrm\{pop\}\}, a sufficiently negative per\-rater OLS slope can produce a negative shrunk slope wheneverωα\(j\)\>0\\omega\_\{\\alpha\}^\{\(j\)\}\>0; it holds automatically only in the fully\-pooled caseτ^α2=0\\hat\{\\tau\}\_\{\\alpha\}^\{2\}=0\. We therefore verify it empirically: one of1,3941\{,\}394raters has a marginally negative shrunk slope on PRISM \(minimum−0\.33\-0\.33\); the measured pair accuracy is nonetheless identical across the pop\-slope and EB\-shrunk arms \(0\.68340\.6834both, §[3\.9](https://arxiv.org/html/2606.27578#S3.SS9)\), so the invariance holds exactly on the evaluated cohort\.*Consequence:*the held\-out pair\-accuracy null reported in §[3\.9](https://arxiv.org/html/2606.27578#S3.SS9)is required rather than disconfirming, and PEBS is orthogonal to argmax\-style benchmarks such as RewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2606.27578#bib.bib28)\)\.

### 2\.3PRISM setup and base reward model

We use the PRISM Alignment corpus\(Kirk et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib20)\), a public RLHF dataset that exposes stable per\-annotator IDs alongside multi\-turn preference judgments at the scale we require\. Two nested cohorts enter the paper\. The reward model is trained on26,87626\{,\}876preference pairs from the1,3911\{,\}391demographic\-complete participants \(∼93%\{\\sim\}93\\%of PRISM’s1,5001\{,\}500;7575countries,2424demographic axes\), under a stratified80/2080/20held\-out\-user split \(21,47421\{,\}474train /5,4025\{,\}402test pairs,1,1131\{,\}113train /278278held\-out users, no within\-user leakage\)\. The per\-rater calibrators are fit on utterance\-level scores with annj≥6n\_\{j\}\{\\geq\}6\-observation filter, which retainsJ=1,394J\{=\}1\{,\}394of1,3961\{,\}396extractable participants; all within\-user calibration results \(§[3\.1](https://arxiv.org/html/2606.27578#S3.SS1)onward\) use this1,3941\{,\}394\-user cohort\.

The base reward modelr^ϕ\\hat\{r\}\_\{\\phi\}is Qwen2\.5\-7B\-Instruct\(Yang et al\.,[2025](https://arxiv.org/html/2606.27578#bib.bib46)\)fine\-tuned with low\-rank adaptation\(Hu et al\.,[2022](https://arxiv.org/html/2606.27578#bib.bib19)\)\(r=32r\{=\}32\) on the pooled PRISM preferences with the centered\-rewards regularizer ofEisenstein et al\.\([2024](https://arxiv.org/html/2606.27578#bib.bib12)\); full training\-loop configuration is in Appendix[B](https://arxiv.org/html/2606.27578#A2.SS0.SSS0.Px6)\. The base model reaches64\.00%64\.00\\%pair accuracy on the held\-out\-user test set, roughly three percentage points over a matched Qwen2\.5\-0\.5B baseline\. All PEBS calibrators are fit on this 7B base model’s scores\. Code, configurations, and fitted per\-rater calibrator weights are in the public repository linked from the footnote in §[1](https://arxiv.org/html/2606.27578#S1)\.

The HelpSteer2 across\-family probes \(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5)\) train attribute\-specific reward heads: the coherence head trains the LoRA adapter to predict per\-row HelpSteer2 coherence scores, and the verbosity head substitutes verbosity scores under an otherwise identical training configuration\. The verbosity\-only run is a control that tests whether a negative outcome on the coherence head reflects a coherence\-specific phenomenon or an attribute\-agnostic upstream\-bias effect\.

## 3Experiments

We evaluate PEBS along three axes:\(a\) within\-user calibration accuracyon PRISM \(§[3\.1](https://arxiv.org/html/2606.27578#S3.SS1)–§[3\.2](https://arxiv.org/html/2606.27578#S3.SS2): RMSE, paired effect size, Bradley–Terry NLL\),\(b\) cross\-corpus replication\(§[3\.4](https://arxiv.org/html/2606.27578#S3.SS4): PluriHarms on a Qwen\-2\.5 base model; HelpSteer2 multi\-attribute observation in Appendix[B](https://arxiv.org/html/2606.27578#A2.SS0.SSS0.Px4)\), and\(c\) base\-family transfer\(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5): a pre\-registered four\-base scope panel with a verbosity\-only control\)\. Two theoretical tools frame these empirics: an oracle inequality for slope shrinkage \(§[3\.6](https://arxiv.org/html/2606.27578#S3.SS6)\) and a Morrisgg\-function closed\-form forecaster \(§[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\)\. Stress tests \(§[3\.8](https://arxiv.org/html/2606.27578#S3.SS8)\) and pre\-registered ablations \(§[3\.9](https://arxiv.org/html/2606.27578#S3.SS9)\) follow\.

### 3\.1Held\-out PRISM prediction

Table[1](https://arxiv.org/html/2606.27578#S3.T1)reports four\-arm performance onN=1,394N\{=\}1\{,\}394users withk=5k\{=\}5\-fold cross\-validation \(CV\)\. The EB\-shrunk calibrator of Eq\. \([2](https://arxiv.org/html/2606.27578#S2.E2)\) yields an8\.58%\\bm\{8\.58\\%\}relative within\-user RMSE reduction over the pop\-slope baseline \(a single global affine calibrator\(αpop,βpop\)\(\\alpha\_\{\\mathrm\{pop\}\},\\beta\_\{\\mathrm\{pop\}\}\)fit by pooled OLS, the strongest of the no\-personalization arms we evaluate\)\. Naive per\-user OLS is rarely used in practice: although the regression is computationally negligible, each per\-user fit overfits its own small samplenjn\_\{j\}: low\-njn\_\{j\}users get high\-variance calibrators and held\-out RMSE worsens\. Shrinking each per\-user fit toward the population mean by the closed\-form weight of Eq\. \([2](https://arxiv.org/html/2606.27578#S2.E2)\) closes that gap at near\-zero marginal cost\.

Table 1:PEBS recovers within\-user RMSE on PRISM beyond what naive per\-user OLS achieves\.Held\-out score\-prediction RMSE forN=1,394N\{=\}1\{,\}394users withk=5k\{=\}5cross\-validation on a 7B base model\. The EB\-shrunk estimator dominates naive per\-user OLS on77\.3%77\.3\\%of users \(sign test,p<10−92p\{<\}10^\{\-92\}\)\.Using4,0004\{,\}000\-replicate cluster bootstrap by user\(Cameron et al\.,[2008](https://arxiv.org/html/2606.27578#bib.bib4);Efron,[1987](https://arxiv.org/html/2606.27578#bib.bib10)\), the bias\-corrected accelerated \(BCa\) 95% CI on the8\.58%8\.58\\%relative gain is\[7\.59%,9\.42%\]\[7\.59\\%,9\.42\\%\], excluding zero\.

### 3\.2Effect size and BT log\-likelihood

On the same PRISM cohort the per\-user paired effect of the RMSE drop \(mean per\-user difference between EB\-shrunk and pop\-slope arms divided by the within\-user paired\-difference SD\) is𝒅paired=0\.542\\bm\{d\_\{\\text\{paired\}\}\{=\}0\.542\}\(95%95\\%CI\[0\.491,0\.607\]\[0\.491,\\,0\.607\]\), roughly half the within\-user re\-rating noise on the0–100100scale\. The cross\-user pooled reduction is0\.0750\.075SD; the two readings differ because they condition on within\-user vs\. marginal variance, and the per\-user calibrator targets the within\-user component\. Within\-user RMSE is a proxy for downstream reward\-model behaviour; the quantity that enters the RLHF reward\-model loss directly is the held\-out pairwise Bradley–Terry \(BT\) negative log\-likelihood \(NLL\), which is not monotone\-invariant in the calibrator \(unlike pair accuracy\)\. On the held\-out preference pairs the mean per\-pair BT\-NLL improves by5\.7%\\bm\{5\.7\\%\}relative \(paired\-ttp<10−7p\{<\}10^\{\-7\}\)\. The improvement is tail\-concentrated rather than uniform: the per\-pair Wilcoxonppis0\.770\.77and the medianΔNLL\\Delta\\mathrm\{NLL\}is near zero, with the gain carried by a minority of users with atypicalβj\\beta\_\{j\}on hard pairs\.

### 3\.3Downstream calibration losses

DPO and PPO consume reward scores as scalars, not as ranks\. The DPO loss−log⁡σ\(β\(rchosen−rrejected\)\)\-\\log\\sigma\(\\beta\\,\(r\_\{\\text\{chosen\}\}\{\-\}r\_\{\\text\{rejected\}\}\)\)\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib34)\)is a sigmoid of a magnitude difference\. PPO advantage normalization operates on raw scores\(Schulman et al\.,[2017](https://arxiv.org/html/2606.27578#bib.bib38);Ouyang et al\.,[2022](https://arxiv.org/html/2606.27578#bib.bib31)\)\. BT\-NLL weights each preference pair by the magnitude of its score gap\. By Proposition[1](https://arxiv.org/html/2606.27578#Thmproposition1), PEBS does not change pair accuracy; however, it changes the gradient that the policy training step uses\. Multi\-attribute aggregation compounds the issue: per\-rater scale heterogeneity distorts the sum of raw scalars\. Reward\-model overoptimization is the limiting failure\-mode of poor downstream calibration\(Gao et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib14)\); a pre\-registered PPO probe on PRISM \(Qwen\-2\.5\-7B policy with the Skywork\-Llama\-3\.1\-8B reward model\(Liu et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib25)\)\) shows the uncorrected reward collapses atKL≥1\.0\\mathrm\{KL\}\{\\geq\}1\.0while thePEBS\-shrunk arm holds \(judge\-reward gap\+2\.16\{\+\}2\.16, conservative95%95\\%CI excluding zero\)\. The RM\-selection literature reports upstream\-vs\-downstream rank\-correlations of onlyτ=0\.08\\tau\{=\}0\.08–0\.310\.31\(Rezk et al\.,[2025](https://arxiv.org/html/2606.27578#bib.bib36)\)\.PEBStargets calibration\-sensitive losses; improving pair accuracy requires a separate selection\-style component \(§[4](https://arxiv.org/html/2606.27578#S4)\)\.

### 3\.4Cross\-corpus replication

A single\-corpus result on PRISM does not by itself establish a pluralism claim\. We replicate the within\-cluster\-RMSE evaluation on three additional corpora with stable cluster IDs \(PluriHarms harm ratings\(Li et al\.,[2026](https://arxiv.org/html/2606.27578#bib.bib24)\), whose taxonomy follows the value\-annotation tradition of KALEIDO\(Sorensen et al\.,[2024a](https://arxiv.org/html/2606.27578#bib.bib40)\); HelpSteer2 prompt\-cluster attributes\(Wang et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib44)\); OASST2 authors\(Köpf et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib22)\)\) and on a single heterogeneous\-cluster pool of all four \(195,963195\{,\}963observations,13,75513\{,\}755namespaced clusters, per\-corpuszz\-score normalization\)\. All five rows reduce RMSE on the same Qwen\-2\.5 base model \(Figure[2](https://arxiv.org/html/2606.27578#S3.F2)\); OASST2\-author is the weakest replication \(\+1\.21%\{\+\}1\.21\\%; its bootstrap CI excludes zero though a per\-cluster Wilcoxon test does not reach significance\); PluriHarms \(\+9\.66%\+9\.66\\%\) and PRISM \(\+8\.58%\+8\.58\\%\) agree to within∼1\\sim 1pp of each other despite measuring harm ratings versus preferences, consistent with \(though not establishing\) a cluster\-scale and not feedback\-type\-specific mechanism\. The HelpSteer2 row treats prompt\-cluster attribute scores as the cluster axis \(a different problem\-geometry from per\-annotator pluralism\); the per\-attribute breakdown is in Appendix[B](https://arxiv.org/html/2606.27578#A2.SS0.SSS0.Px4)\. An ordinal preference corpus \(MultiPref\) lies outside the Gaussian\-RE scope and is documented separately in §[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\.

![Refer to caption](https://arxiv.org/html/2606.27578v1/x2.png)Figure 2:PEBS reduces RMSE on four single\-corpus replications and on a195,963195\{,\}963\-observation pooled corpus, all using a single Qwen\-2\.5 base model\.Horizontal forest of within\-cluster gain \(%\) with95%95\\%BCa cluster\-bootstrap CIs; circles are single\-corpus replications, the diamond is the four\-corpus pooled estimate\. The dashed reference at zero is the pop\-slope baseline\. The pooled\-multi\-corpus row \(\+7\.19%\+7\.19\\%\[\+6\.36,\+7\.96\]\[\+6\.36,\+7\.96\]\) uses namespaced cluster IDs across the four corpora\.
### 3\.5Cross\-family transfer

A pre\-registered four\-base coherence\-only probe \(Meta\-Llama\-3\-8B, Mistral\-Small\-22B, Yi\-1\.5\-34B, Phi\-3\-medium\-14B; two mixture\-of\-experts \(MoE\) runs, Phi\-3\.5\-MoE and Mixtral\-8×\\times7B, are reported as appendix\-only boundary evidence in App\.[B](https://arxiv.org/html/2606.27578#A2)\) with five training seeds on the same\-family Phi\-3 reference and a paired verbosity\-only run on the three Llama\-family\-dense bases together map where the HelpSteer2 multi\-attribute observation \(Appendix[B](https://arxiv.org/html/2606.27578#A2.SS0.SSS0.Px4)\) transfers beyond Qwen\-2\.5; Table[2](https://arxiv.org/html/2606.27578#S3.T2)summarizes the result\. The pre\-registered sign\-flip criterion is met for Llama\-3\-8B and Yi\-1\.5\-34B; Mistral\-Small\-22B is a single\-seed null; Phi\-3\-medium\-14B holds at\+42\.15%\+42\.15\\%across55seeds \(positive in all five\)\. Both columns of Table[2](https://arxiv.org/html/2606.27578#S3.T2)report the held\-out*coherence\-attribute*gain; the columns differ in which attribute the head was trained on\. Under verbosity\-only training the untrained coherence head remains positive across all four bases \(single\-seed for the three Llama\-family\-dense bases, five\-seed for Phi\-3\), while the trained verbosity head itself turns negative \(e\.g\.−32\.62%\-32\.62\\%on Phi\-3; Appendix[B](https://arxiv.org/html/2606.27578#A2)\)\. This is evidence against attribute\-agnostic verbosity bias as the source of the coherence\-head reversal\. A within\-Llama intervention sweep \(zero\-out, scramble, signal\-content replacement; two seeds each\) further refines the mechanism: only information\-removal interventions reproduce the negative outcome, while signal substitution preserves the same\-family positive, consistent with a collapse\-by\-removal pattern at within\-Llama scope\. The full HelpSteer2 five\-attribute breakdown, calibration\-diagnostic signatures, and the MoE\-branch partial\-coverage LoRA scope are in Appendix[B](https://arxiv.org/html/2606.27578#A2)\.

Table 2:The verbosity\-only control preserves the coherence head on the three Llama\-family\-dense bases\.Each cell is the held\-out coherence\-attribute gain \(%\); columns differ in the attribute the head was trained on\. Phi\-3\-medium\-14B is the same\-family five\-seed reference \(mean\+42\.15%\+42\.15\\%, Student\-tt95%95\\%CI\[\+40\.10,\+44\.20\]\[\+40\.10,\+44\.20\]\); under verbosity\-only training the untrained coherence head stays positive on all four bases\.
### 3\.6Oracle inequality for EB slope\-shrinkage

Beyond the empirical PRISM gain on Qwen\-family base models, PEBS admits an oracle inequality under the random\-effects assumptions stated below\. WriteVjV\_\{j\}for the within\-annotator sampling variance ofα^jOLS\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\(§[2\.2](https://arxiv.org/html/2606.27578#S2.SS2)\) andM=maxj⁡Vj/τα2M=\\max\_\{j\}V\_\{j\}/\\tau^\{2\}\_\{\\alpha\}for the noise\-to\-signal bound;RoracleR\_\{\\text\{oracle\}\}is the squared\-error risk of the oracle estimator that knows the trueτα2\\tau^\{2\}\_\{\\alpha\}andREBR\_\{\\text\{EB\}\}that of the truncated Morris MoM EB estimator\.

###### Theorem 1\(EB slope\-shrinkage oracle inequality\)\.

LetJ≥4J\\geq 4denote the number of annotators\. Assume the random\-effect DGPαj=αpop\+uj\\alpha\_\{j\}=\\alpha\_\{\\mathrm\{pop\}\}\+u\_\{j\}withuj∼𝒩\(0,τ2\)u\_\{j\}\\sim\\mathcal\{N\}\(0,\\tau^\{2\}\)i\.i\.d\. acrossjj;α^jOLS∣αj∼𝒩\(αj,Vj\)\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\\mid\\alpha\_\{j\}\\sim\\mathcal\{N\}\(\\alpha\_\{j\},V\_\{j\}\)independent acrossjj, withVjV\_\{j\}andαpop\\alpha\_\{\\mathrm\{pop\}\}known; andVj≤Mτ2V\_\{j\}\\leq M\\tau^\{2\}for alljj\. Letτ^2\\hat\{\\tau\}^\{2\}be the truncated method\-of\-moments estimate computed on an auxiliary set of raters drawn from the same DGP, independent of theJJraters being estimated \(sample splitting\)\. Then

REB≤\\displaystyle R\_\{\\text\{EB\}\}\\leq\{\}\(1\+cJ\)Roracle\\displaystyle\\Bigl\(1\+\\frac\{c\}\{J\}\\Bigr\)R\_\{\\text\{oracle\}\}\+2max⁡\(1,M\)τ2exp⁡\(−c2\(J−1\)\(1\+M\)2\),\\displaystyle\+2\\max\(1,M\)\\,\\tau^\{2\}\\exp\\\!\\Bigl\(\-\\tfrac\{c\_\{2\}\(J\-1\)\}\{\(1\+M\)^\{2\}\}\\Bigr\),\(5\)withc≤643\(1\+M\)2c\\leq\\tfrac\{64\}\{3\}\(1\+M\)^\{2\}andc2\>0c\_\{2\}\>0an absolute constant\.

The proof \(Appendix[A](https://arxiv.org/html/2606.27578#A1)\) adapts the heteroskedastic\-location analysis ofXie et al\.\([2012](https://arxiv.org/html/2606.27578#bib.bib45)\)to the OLS\-slope statistic: the oracle is a stationary point of the per\-rater risk, so theτ^2\\hat\{\\tau\}^\{2\}estimation error enters only at second order\. The constant is a conservative worst\-case bound driven by the sparsest raters \(on PRISM,njn\_\{j\}spans66–144144, givingM≈4M\\approx 4\); the deployed estimator additionally estimatesτ^2\\hat\{\\tau\}^\{2\}andαpop\\alpha\_\{\\mathrm\{pop\}\}on the same sample, a coupling Appendix[A](https://arxiv.org/html/2606.27578#A1)scopes and validates by simulation\.*Operational consequence:*a PRISM\-calibrated simulation of the deployed estimator \(J=1,394J\{=\}1\{,\}394,100100seeds\) puts the expectation\-level risk inflation at≈0\.2%\{\\approx\}0\.2\\%\(mean risk ratio1\.0021\.002; worst seed1\.0171\.017\), far too small to explain the8\.58%8\.58\\%PRISM gain\.

### 3\.7Morrisgg\-function forecaster

Given\(τα2,τβ2,σε2,\{nj\},\{Varw\(xj\)\}\)\(\\tau^\{2\}\_\{\\alpha\},\\tau^\{2\}\_\{\\beta\},\\sigma^\{2\}\_\{\\varepsilon\},\\\{n\_\{j\}\\\},\\\{\\mathrm\{Var\}\_\{\\mathrm\{w\}\}\(x\_\{j\}\)\\\}\)from a short pilot, wherexxhere denotes the RM score centered within each rater’s calibration slice \(x¯j=0\\bar\{x\}\_\{j\}=0, as in our implementation\) andVarw\(xj\)\\mathrm\{Var\}\_\{\\mathrm\{w\}\}\(x\_\{j\}\)its within\-rater variance, the two\-parameter Morris risk\-gap formula

𝔼\[RPOP−REB\]\\displaystyle\\mathbb\{E\}\\\!\\left\[R\_\{\\text\{POP\}\}\-R\_\{\\text\{EB\}\}\\right\]=1J∑j\[τα2Varw\(xj\)g\(rα\(j\)\)\\displaystyle=\\tfrac\{1\}\{J\}\\textstyle\\sum\_\{j\}\\Bigl\[\\tau^\{2\}\_\{\\alpha\}\\mathrm\{Var\}\_\{\\mathrm\{w\}\}\(x\_\{j\}\)\\,g\\bigl\(r\_\{\\alpha\}^\{\(j\)\}\\bigr\)\\Bigr\.\+τβ2g\(rβ\(j\)\)\],g\(r\)=r/\(1\+r\),\\displaystyle\\qquad\\Bigl\.\{\+\}\\ \\tau^\{2\}\_\{\\beta\}\\,g\\bigl\(r\_\{\\beta\}^\{\(j\)\}\\bigr\)\\Bigr\],\\quad g\(r\)=r/\(1\+r\),\(6\)withrα\(j\)=njτα2Varw\(xj\)/σε2r\_\{\\alpha\}^\{\(j\)\}=n\_\{j\}\\tau^\{2\}\_\{\\alpha\}\\,\\mathrm\{Var\}\_\{\\mathrm\{w\}\}\(x\_\{j\}\)/\\sigma^\{2\}\_\{\\varepsilon\}andrβ\(j\)=njτβ2/σε2r\_\{\\beta\}^\{\(j\)\}=n\_\{j\}\\tau^\{2\}\_\{\\beta\}/\\sigma^\{2\}\_\{\\varepsilon\}\(per\-rater centering makes the slope and offset shrinkage gaps separate; the predicted risk gap converts to the reported relative RMSE reduction through division byRPOPR\_\{\\text\{POP\}\}\), predicts PEBS gain on a new corpus before running the full estimation procedure\. In practice, one estimates\(τ2,σ2,\{nj\}\)\(\\tau^\{2\},\\sigma^\{2\},\\\{n\_\{j\}\\\}\)on a short pilot, plugs into Eq\. \([6](https://arxiv.org/html/2606.27578#S3.E6)\), and decides whether fitting the full PEBS estimator is warranted\. Table[3](https://arxiv.org/html/2606.27578#S3.T3)validates the forecast within0\.20\.2pp on the four continuous\-rating corpora\. The MultiPref row shows the scope limit: the forecaster predicts a large gap when the ordinal preference setting violates the Gaussian random\-effects assumption\.

Table 3:The closed\-form Morrisgg\-function forecastsPEBSgain from a short pilot to within0\.20\.2pp on the four continuous\-rating corpora\.Observed gains use a leave\-one\-row\-out per\-cluster CV matched to the forecaster’s assumptions; the OASST2\-author row therefore differs by protocol, covariate, and cohort from the §[3\.4](https://arxiv.org/html/2606.27578#S3.SS4)replication row \(\+1\.21%\{\+\}1\.21\\%\), and the two are not comparable\. At SHP’s cluster sizesω→1\\omega\\\!\\to\\\!1, where PEBS reduces to per\-cluster OLS and the exact forecast match is expected rather than informative\. MultiPref is the ordinal\-preference limit: its predicted\-versus\-observed gap flags a Gaussian random\-effects mismatch\.The MultiPref row in Table[3](https://arxiv.org/html/2606.27578#S3.T3)is the calibrated null: an ordinal preference corpus\(Miranda et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib29)\)on which the forecaster’s17\.4917\.49pp predicted\-versus\-observed gap correctly flags Gaussian\-RE mis\-specification \(§[5](https://arxiv.org/html/2606.27578#S5)\)\.

### 3\.8Stress tests

A natural concern is thatk=5k\{=\}5random\-fold CV may overstate deployment generalization if the per\-user\(αj,βj\)\(\\alpha\_\{j\},\\beta\_\{j\}\)parameters are not time\-invariant\. Across five pre\-registered seeds the random\-fold gain is tight around the8\.58%8\.58\\%point estimate: all five seed CIs exclude zero and the per\-seed gains lie within0\.170\.17pp of it\. We repeated the within\-user evaluation with a strict temporal80/2080/20split, sorting utterances by PRISM generation timestamp\. The shrinkage gain holds at7\.55%\\bm\{7\.55\\%\}, with a3030\-seed cluster\-bootstrap CI that brackets the random\-CV point estimate, so the within\-user RMSE result also holds under a stricter temporal split\. Across three base reward models \(Qwen2\.5\-7B, Skywork\-Reward\-Gemma\-2\-27B\(Liu et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib25)\), Llama\-3\.2\-3B\-Instruct\) crossed with PRISM and PluriHarms, all six cells return a positive shrinkage gain whose95%95\\%CI strictly excludes zero, even though the HelpSteer2 multi\-attribute observation does not extend across architectures \(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5), Appendix[B](https://arxiv.org/html/2606.27578#A2.SS0.SSS0.Px4)\)\. The PRISM gain is also stable across thirty\-four subsets covering top\-\|α^j\|\|\\hat\{\\alpha\}\_\{j\}\|trimming, small\-and\-large\-nnslices, random user subsamples, and demographic cells, and demographic grouping cannot replace per\-user calibration on PRISM \(only gender→β^j\{\\to\}\\hat\{\\beta\}\_\{j\}survives Bonferroni at small explained varianceη2<0\.02\\eta^\{2\}\{<\}0\.02; see Appendix[B](https://arxiv.org/html/2606.27578#A2)for the six\-demographic ANOVA detail\)\.

#### Cold\-start threshold\.

EB shrinkage reduces to pop\-slope atm=0m\{=\}0ratings per user \(weightω=0\\omega\{=\}0\) and overtakes the pop\-slope baseline from𝒎=𝟓\\bm\{m\{=\}5\}ratings per user onward under random\-fold CV, roughly a four\-fold improvement in data\-efficiency since naive per\-user OLS only breaks even with pop\-slope atm=20m\{=\}20\. The bias\-variance trade\-off ofω=τ2/\(τ2\+V\)\\omega\{=\}\\tau^\{2\}/\(\\tau^\{2\}\+V\)produces a non\-monotone transition nearm=3m\{=\}3, where shrinkage is worse than pop\-slope on held\-out RMSE\. We report this as a deployment\-relevant failure mode; the operational rule is to use pop\-slope untilm≥5m\{\\geq\}5ratings per user are available, then switch to shrinkage\.

### 3\.9Ablations and failure cases

We pre\-registered four ablations, each tied to a specific claim it could overturn; all four outcomes were consistent with the claims\. Together with three companion analyses they form seven stress tests, summarized in three thematic groups; the per\-cell numerical detail is in Appendix[B](https://arxiv.org/html/2606.27578#A2)\.

\(I\) Mechanism necessity\.A leave\-one\-component\-out decomposition on PRISM shows that neither component suffices alone: intercept\-only shrinkage \(the Efron–Morris floor\) attains\+7\.46%\{\+\}7\.46\\%and slope\-only shrinkage\+0\.74%\{\+\}0\.74\\%, both strictly below the joint gain; adding slope shrinkage on top of the intercept floor contributes\+1\.04\{\+\}1\.04pp, and the slope component is the only one that requires real RM signal, ruling out a pure\-noise explanation\. The PluriHarms cross\-corpus replication \(Figure[3](https://arxiv.org/html/2606.27578#S3.F3)\) is the primary evidence that intercept\- and slope\-shrinkage are jointly necessary rather than additive: both single\-component variants \(intercept\-only and slope\-only\) degrade RMSE individually, yet the joint estimator is strictly dominant at\+9\.66%\\bm\{\+9\.66\\%\}\. Method\-of\-Momentsτ^α2\\hat\{\\tau\}^\{2\}\_\{\\alpha\}recovers the ground\-truth variance across the synthetic\-seed grid; the sign\-reversal and adversarial\-user injection probes both leave PEBS’s RMSE below the naive\-no\-pool baseline at every tested corruption level \(grid and per\-cell numbers in the released artifact bundle\)\.

![Refer to caption](https://arxiv.org/html/2606.27578v1/x3.png)Figure 3:Both single\-component estimators degrade RMSE on PluriHarms; only the joint estimator yields the gain on both corpora\.RMSE reduction \(%\) vs\. pop\-slope,95%95\\%BCa CIs, dashed reference at zero\. Bars use the cross\-corpus evaluation protocol of Figure[2](https://arxiv.org/html/2606.27578#S3.F2); the matched single\-corpus PRISM decomposition \(§[3\.9](https://arxiv.org/html/2606.27578#S3.SS9)\) agrees within0\.250\.25pp\.\(II\) Cross\-axis generalization\.The pooled\-multi\-corpus analysis is summarized in Figure[2](https://arxiv.org/html/2606.27578#S3.F2)and the three\-base\-model and per\-rater sample\-efficiency analyses in §[3\.8](https://arxiv.org/html/2606.27578#S3.SS8); all return positive gain\. A three\-base\-model PRISM panel using mean\-response log\-likelihood scoring\(Stiennon et al\.,[2020](https://arxiv.org/html/2606.27578#bib.bib42)\)replicates the within\-user gain on each base model \(per\-cell numbers in the released artifact bundle\)\. Per\-rater subsampling yields monotone non\-decreasing gain that tracks the Morrisgg\-function predictionr/\(1\+r\)r/\(1\{\+\}r\)across the swept sample budget\.

\(III\) Where PEBS does not improve\.Pair accuracy is identical by construction across pop\-slope and EB\-shrunk arms \(0\.68340\.6834in both arms on the CV evaluation pairs; this differs from the64\.00%64\.00\\%in §[2\.3](https://arxiv.org/html/2606.27578#S2.SS3), which is the base RM’s held\-out\-user split\), which Proposition[1](https://arxiv.org/html/2606.27578#Thmproposition1)predicts: PEBS value lives in calibration\-sensitive losses \(RMSE, BT\-NLL\), not argmax\-style benchmarks like RewardBench 2\. The HelpSteer2 verbosity attribute is the per\-attribute null \(ωβ≈0\.93\\omega\_\{\\beta\}\\\!\\approx\\\!0\.93, gain straddles zero\) while the other four attributes gain positively; the ordinal\-preference limit on MultiPref is documented separately in §[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\.

## 4Discussion

#### Calibration vs\. selection axes\.

Proposition[1](https://arxiv.org/html/2606.27578#Thmproposition1)fixes pair accuracy across pop\-slope and EB\-shrunk arms by construction; PEBS therefore targets calibration\-sensitive losses \(§[3\.3](https://arxiv.org/html/2606.27578#S3.SS3)\) rather than argmax\-style benchmarks \(RewardBench 2\(Malik et al\.,[2025](https://arxiv.org/html/2606.27578#bib.bib28)\)\)\. The personalization gap thatMa et al\.\([2026](https://arxiv.org/html/2606.27578#bib.bib27)\)report for frontier RMs \(peaking at75\.9%75\.9\\%\) is not affected by PEBS’s monotone calibration \(Proposition[1](https://arxiv.org/html/2606.27578#Thmproposition1)\); closing it requires a complementary selection\-style component\. A complete per\-rater system composes the upstream RM, thePEBScalibrator, and a selection\-style component such as ensemble disagreement\(Coste et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib8)\), where PEBS targets calibration loss and the selection component targets pair accuracy\. A complementary downstream procedure that*relabels*preference pairs using PEBS\-corrected reward and trains a fresh DPO policy\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib34)\)falls outside Proposition[1](https://arxiv.org/html/2606.27578#Thmproposition1)’s scope: the relabeled training data and the resulting policy’s reward function both change\. Across a Llama\-3\-8B\-Instruct base model and a Mistral\-7B\-Instruct\-v0\.3 with PRISM, this relabel\-and\-retrain procedure yields\+8\.81\+8\.81pp on Llama\-3\-8B \(single seed\) and\+12\.90\+12\.90pp on Mistral\-7B\-Instruct\-v0\.3 \(seven\-seed mean,95%95\\%CI\[\+11\.97,\+13\.82\]\[\+11\.97,\+13\.82\], all seeds positive\) held\-out pair accuracy on a user\-disjoint20%20\\%held\-out slice \(i\.e\.,20%20\\%of users not seen during training\)\.

#### Pluralism at the individual rater scale\.

TheSorensen et al\.\([2024b](https://arxiv.org/html/2606.27578#bib.bib41)\)Roadmap distinguishes three pluralism axes: distributional \(a distribution over outputs\), steerable \(conditionable on values or personas\), and Overton \(a single output spanning the range\)\. We propose calibration heterogeneity as a fourth axis of pluralistic alignment: the per\-annotator slopeαj\\alpha\_\{j\}and offsetβj\\beta\_\{j\}at which each rater converts a model score into a personal rating, which the pooled\-likelihood RLHF pipeline collapses into a single global calibrator\.PEBSoperates on this fourth axis\. Distributional, steerable, and Overton handle reasonable\-disagreement\-on\-content; calibration handles rater\-specific scale\-and\-offset on a fixed\-content scoring task, and the four axes are complementary\.PEBSretains per\-rater\(α^j,β^j\)\(\\hat\{\\alpha\}\_\{j\},\\hat\{\\beta\}\_\{j\}\)heterogeneity that the pop\-slope baseline pools away; the §[3\.8](https://arxiv.org/html/2606.27578#S3.SS8)demographics\-ANOVA null indicates that the heterogeneity we recover is individual\-rater and not demographic\-cohort\. Cross\-cultural value variation\(Conitzer et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib7);Zhang et al\.,[2025](https://arxiv.org/html/2606.27578#bib.bib47)\)is one plausible source of the residualα^j,β^j\\hat\{\\alpha\}\_\{j\},\\hat\{\\beta\}\_\{j\}heterogeneity that demographics fail to explain; characterizing the cultural\-political content of these residuals is open work\. Shrinking per\-rater calibrators toward a population mean is a regression\-to\-mean operation, and the minority\-rater trade\-off it entails is the subject of the next paragraph\.

EB shrinkage withωj=τ2/\(τ2\+Vj\)\\omega\_\{j\}=\\tau^\{2\}/\(\\tau^\{2\}\+V\_\{j\}\)andVj∝1/njV\_\{j\}\\propto 1/n\_\{j\}shrinks low\-njn\_\{j\}raters more aggressively; if those raters are disproportionately drawn from underrepresented populations\(Kirk et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib20)\), PEBS shrinks their estimated\(α^j,β^j\)\(\\hat\{\\alpha\}\_\{j\},\\hat\{\\beta\}\_\{j\}\)toward the population mean in exchange for variance reduction\. This is the standard shrinkage trade\-off \(Fig\.[8](https://arxiv.org/html/2606.27578#A2.F8):71\.9%71\.9\\%helped,28\.1%28\.1\\%hurt\); when the rare\-true\-extreme tail is policy\-relevant a minority\-rater audit is recommended\.

## 5Limitations

#### Base\-family transfer scope\.

The PEBS procedure replicates within the Qwen\-2\.5 family across three corpora and three base reward models \(§[3\.8](https://arxiv.org/html/2606.27578#S3.SS8)\) and on the Phi\-3\-medium\-14B same\-family reference across55training seeds \(all55positive at coherence per\-attribute mean\+42\.15%\+42\.15\\%; Table[2](https://arxiv.org/html/2606.27578#S3.T2)\); the pre\-registered four\-base coherence\-only probe and verbosity\-only control locate the limit at a coherence\-head/dense\-architecture interaction rather than verbosity bias, with calibration diagnostics in Appendix[B](https://arxiv.org/html/2606.27578#A2)\. On PRISM,τ^α2\\hat\{\\tau\}^\{2\}\_\{\\alpha\}is dominated by between\-rater value differences and not within\-rater rating noise \(slope\-SNR15\.615\.6\[13\.8,17\.5\]\[13\.8,17\.5\]; full residualization procedure in the released artifact bundle\); the corresponding decomposition on PluriHarms / OASST / SHP is not measured here\. A regression\-to\-mean reading of the control and a multi\-seed verbosity\-baseline probe are open follow\-ups\.

#### Morris forecaster scope\.

The Morris g\-function forecaster \(§[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\) is validated here for continuous\-rating corpora; on MultiPref, its large predicted\-versus\-observed gap flags that the ordinal preference setting violates the Gaussian random\-effects assumption\. Extending PEBS to ordinal data would require a Beta\-Binomial or Student\-tνt\_\{\\nu\}random\-effects model\.

#### Comparison to retained PRISM baselines\.

Against the four retained PRISM baselines on matched leave\-one\-conversation\-out \(LOCO\) PRISM, PEBS provides closed\-form post\-hoc calibration with no test\-time inference cost\. P\-GenRM exceeds PEBS under the same strict LOCO RMSE protocol while using test\-time prototype clustering; the retained comparison is in App\.[C](https://arxiv.org/html/2606.27578#A3)\(Tab\.[4](https://arxiv.org/html/2606.27578#A2.T4)\)\.

#### Data, licenses, and ethics\.

All preference corpora used in this work \(PRISM\(Kirk et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib20)\), PluriHarms\(Li et al\.,[2026](https://arxiv.org/html/2606.27578#bib.bib24)\), HelpSteer2\(Wang et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib44)\), OASST2, SHP\-subreddit\(Ethayarajh et al\.,[2022](https://arxiv.org/html/2606.27578#bib.bib13)\), MultiPref\(Miranda et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib29)\)\) are public datasets released by their original authors under the licenses on the corresponding dataset cards; this paper introduces no new human\-subjects data collection\. The PEBS procedure produces only per\-rater calibration parameters; no rater identifiers are republished\. Per\-rater shrinkage trades minority\-rater\(α^j,β^j\)\(\\hat\{\\alpha\}\_\{j\},\\hat\{\\beta\}\_\{j\}\)magnitude for variance reduction \(Fig\.[8](https://arxiv.org/html/2606.27578#A2.F8)\); when policy decisions hinge on the rare\-true\-extreme tail, a minority\-rater audit is recommended \(§[4](https://arxiv.org/html/2606.27578#S4)\)\.

## References

- Baker \(2001\)Baker, F\. B\.*The Basics of Item Response Theory*\.ERIC Clearinghouse on Assessment and Evaluation, 2nd edition, 2001\.
- Bakker et al\. \(2022\)Bakker, M\. A\., Chadwick, M\. J\., Sheahan, H\. R\., Tessler, M\. H\., Campbell\-Gillingham, L\., Balaguer, J\., McAleese, N\., Glaese, A\., Aslanides, J\., Botvinick, M\. M\., and Summerfield, C\.Fine\-tuning language models to find agreement among humans with diverse preferences\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Bradley & Terry \(1952\)Bradley, R\. A\. and Terry, M\. E\.Rank analysis of incomplete block designs: I\. the method of paired comparisons\.*Biometrika*, 39\(3/4\):324–345, 1952\.
- Cameron et al\. \(2008\)Cameron, A\. C\., Gelbach, J\. B\., and Miller, D\. L\.Bootstrap\-based improvements for inference with clustered errors\.*The Review of Economics and Statistics*, 90\(3\):414–427, 2008\.
- Castricato et al\. \(2025\)Castricato, L\., Lile, N\., Rafailov, R\., Fränken, J\.\-P\., and Finn, C\.PERSONA: A reproducible testbed for pluralistic alignment\.In*Proceedings of the 31st International Conference on Computational Linguistics \(COLING\)*, 2025\.arXiv:2407\.17387\.
- Christiano et al\. \(2017\)Christiano, P\. F\., Leike, J\., Brown, T\. B\., Martic, M\., Legg, S\., and Amodei, D\.Deep reinforcement learning from human preferences\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2017\.
- Conitzer et al\. \(2024\)Conitzer, V\., Freedman, R\., Heitzig, J\., Holliday, W\. H\., Jacobs, B\. M\., Lambert, N\., Mossé, M\., Pacuit, E\., Russell, S\., Schoelkopf, H\., Tewolde, E\., and Zwicker, W\. S\.Position: Social choice should guide AI alignment in dealing with diverse human feedback\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, pp\. 9346–9360, 2024\.PMLR 235; arXiv:2404\.10271\.
- Coste et al\. \(2024\)Coste, T\., Anwar, U\., Kirk, R\., and Krueger, D\.Reward model ensembles help mitigate overoptimization\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Dawid & Skene \(1979\)Dawid, A\. P\. and Skene, A\. M\.Maximum likelihood estimation of observer error\-rates using the EM algorithm\.*Journal of the Royal Statistical Society, Series C \(Applied Statistics\)*, 28\(1\):20–28, 1979\.
- Efron \(1987\)Efron, B\.Better bootstrap confidence intervals\.*Journal of the American Statistical Association*, 82\(397\):171–185, 1987\.
- Efron & Morris \(1973\)Efron, B\. and Morris, C\.Stein’s estimation rule and its competitors–an empirical Bayes approach\.*Journal of the American Statistical Association*, 68\(341\):117–130, 1973\.
- Eisenstein et al\. \(2024\)Eisenstein, J\., Nagpal, C\., Agarwal, A\., Beirami, A\., D’Amour, A\., Dvijotham, D\. J\., Fisch, A\., Heller, K\., Pfohl, S\., Ramachandran, D\., Shaw, P\., and Berant, J\.Helping or herding? Reward\-model ensembles mitigate but do not eliminate reward hacking\.In*Conference on Language Modeling \(CoLM\)*, 2024\.
- Ethayarajh et al\. \(2022\)Ethayarajh, K\., Choi, Y\., and Swayamdipta, S\.Understanding dataset difficulty with𝒱\\mathcal\{V\}\-usable information\.In*International Conference on Machine Learning \(ICML\)*, 2022\.
- Gao et al\. \(2023\)Gao, L\., Schulman, J\., and Hilton, J\.Scaling laws for reward model overoptimization\.In*International Conference on Machine Learning \(ICML\)*, 2023\.
- Gelman & Hill \(2007\)Gelman, A\. and Hill, J\.*Data Analysis Using Regression and Multilevel/Hierarchical Models*\.Cambridge University Press, 2007\.
- Ghafouri et al\. \(2026\)Ghafouri, B\., Choi, E\. C\., Dey, P\., and Ferrara, E\.Measuring human preferences in RLHF is a social science problem\.arXiv preprint arXiv:2604\.03238, 2026\.
- Han et al\. \(2026\)Han, K\., Zhou, Y\., Gao, M\., Zhou, G\., Li, S\., Kumar, A\., Fan, X\., Li, W\., and Zhang, L\.EBPO: Empirical bayes shrinkage for stabilizing group\-relative policy optimization\.arXiv preprint arXiv:2602\.05165, 2026\.
- Henderson \(1975\)Henderson, C\. R\.Best linear unbiased estimation and prediction under a selection model\.*Biometrics*, 31\(2\):423–447, 1975\.
- Hu et al\. \(2022\)Hu, E\. J\., Shen, Y\., Wallis, P\., Allen\-Zhu, Z\., Li, Y\., Wang, S\., Wang, L\., and Chen, W\.LoRA: Low\-rank adaptation of large language models\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- Kirk et al\. \(2024\)Kirk, H\. R\., Whitefield, A\., Röttger, P\., Bean, A\., Margatina, K\., Ciro, J\., Mosquera, R\., Bartolo, M\., Williams, A\., He, H\., Vidgen, B\., and Hale, S\. A\.The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS, Datasets and Benchmarks Track\)*, 2024\.arXiv:2404\.16019\.
- Kobalczyk & van der Schaar \(2025\)Kobalczyk, K\. and van der Schaar, M\.Preference learning for AI alignment: A causal perspective\.In*International Conference on Machine Learning \(ICML\)*, 2025\.arXiv:2506\.05967\.
- Köpf et al\. \(2023\)Köpf, A\., Kilcher, Y\., von Rütte, D\., Anagnostidis, S\., Tam, Z\.\-R\., Stevens, K\., Barhoum, A\., Duc, N\. M\., Stanley, O\., Nagyfi, R\., ES, S\., Suri, S\., Glushkov, D\., Dantuluri, A\., Maguire, A\., Schuhmann, C\., Nguyen, H\., and Mattick, A\.OpenAssistant conversations – democratizing large language model alignment\.In*Advances in Neural Information Processing Systems \(NeurIPS, Datasets and Benchmarks Track\)*, 2023\.arXiv:2304\.07327\.
- Kou & Yang \(2017\)Kou, S\. C\. and Yang, J\. J\.Optimal shrinkage estimation in heteroscedastic hierarchical linear models\.In*Big and Complex Data Analysis*, Contributions to Statistics, pp\. 249–284\. Springer, 2017\.doi:10\.1007/978\-3\-319\-41573\-4˙13\.
- Li et al\. \(2026\)Li, J\.\-J\., Mire, J\., Fleisig, E\., Pyatkin, V\., Collins, A\., Sap, M\., and Levine, S\.PluriHarms: Benchmarking the full spectrum of human judgments on AI harm\.*arXiv preprint arXiv:2601\.08951*, 2026\.
- Liu et al\. \(2024\)Liu, C\. Y\., Zeng, L\., Liu, J\., Yan, R\., He, J\., Wang, C\., Yan, S\., Liu, Y\., and Zhou, Y\.Skywork\-reward: Bag of tricks for reward modeling in LLMs\.*arXiv preprint arXiv:2410\.18451*, 2024\.
- Liu et al\. \(2025\)Liu, P\., Lu, J\., and Sun, W\. W\.Uncertainty quantification for large language model reward learning under heterogeneous human feedback\.*arXiv preprint arXiv:2512\.03208*, 2025\.
- Ma et al\. \(2026\)Ma, Q\., Gao, D\., Cai, R\., Zhao, B\., Zhou, H\., Zhang, J\., and Zhao, Z\.Personalized RewardBench: Evaluating reward models with human aligned personalization\.arXiv preprint arXiv:2604\.07343, 2026\.
- Malik et al\. \(2025\)Malik, S\., Pyatkin, V\., Land, S\., Morrison, J\., Smith, N\. A\., Hajishirzi, H\., and Lambert, N\.RewardBench 2: Advancing reward model evaluation\.*arXiv preprint arXiv:2506\.01937*, 2025\.
- Miranda et al\. \(2024\)Miranda, L\. J\. V\., Wang, Y\., Elazar, Y\., Kumar, S\., Pyatkin, V\., Brahman, F\., Smith, N\. A\., Hajishirzi, H\., and Dasigi, P\.Hybrid preferences: Learning to route instances for human vs\. AI feedback\.*arXiv preprint arXiv:2410\.19133*, 2024\.
- Morris \(1983\)Morris, C\. N\.Parametric empirical Bayes inference: Theory and applications\.*Journal of the American Statistical Association*, 78\(381\):47–55, 1983\.
- Ouyang et al\. \(2022\)Ouyang, L\., Wu, J\., Jiang, X\., Almeida, D\., Wainwright, C\. L\., Mishkin, P\., Zhang, C\., Agarwal, S\., Slama, K\., Ray, A\., Schulman, J\., Hilton, J\., Kelton, F\., Miller, L\., Simens, M\., Askell, A\., Welinder, P\., Christiano, P\., Leike, J\., and Lowe, R\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Paun et al\. \(2018\)Paun, S\., Carpenter, B\., Chamberlain, J\., Hovy, D\., Kruschwitz, U\., and Poesio, M\.Comparing bayesian models of annotation\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 6:571–585, 2018\.
- Pinheiro & Bates \(2000\)Pinheiro, J\. C\. and Bates, D\. M\.*Mixed\-Effects Models in S and S\-PLUS*\.Statistics and Computing\. Springer, 2000\.
- Rafailov et al\. \(2023\)Rafailov, R\., Sharma, A\., Mitchell, E\., Manning, C\. D\., Ermon, S\., and Finn, C\.Direct preference optimization: Your language model is secretly a reward model\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Rasch \(1960\)Rasch, G\.*Probabilistic Models for Some Intelligence and Attainment Tests*\.Danmarks Paedagogiske Institut, Copenhagen, 1960\.
- Rezk et al\. \(2025\)Rezk, F\., Pan, Y\., Foo, C\.\-S\., Xu, X\., Chen, N\., Gouk, H\., and Hospedales, T\.The reward model selection crisis in personalized alignment\.arXiv preprint arXiv:2512\.23067, 2025\.
- Robbins \(1956\)Robbins, H\.An empirical Bayes approach to statistics\.In*Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1*, pp\. 157–163\. University of California Press, 1956\.
- Schulman et al\. \(2017\)Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., and Klimov, O\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Seabold & Perktold \(2010\)Seabold, S\. and Perktold, J\.statsmodels: Econometric and statistical modeling with Python\.In*9th Python in Science Conference \(SciPy\)*, 2010\.
- Sorensen et al\. \(2024a\)Sorensen, T\., Jiang, L\., Hwang, J\., Levine, S\., Pyatkin, V\., West, P\., Dziri, N\., Lu, X\., Rao, K\., Bhagavatula, C\., Sap, M\., Tasioulas, J\., and Choi, Y\.Value kaleidoscope: Engaging AI with pluralistic human values, rights, and duties\.In*AAAI Conference on Artificial Intelligence*, 2024a\.arXiv:2309\.00779\.
- Sorensen et al\. \(2024b\)Sorensen, T\., Moore, J\., Fisher, J\., Gordon, M\., Mireshghallah, N\., Rytting, C\. M\., Ye, A\., Jiang, L\., Lu, X\., Dziri, N\., Althoff, T\., and Choi, Y\.Position: A roadmap to pluralistic alignment\.In*International Conference on Machine Learning \(ICML\)*, 2024b\.arXiv:2402\.05070\.
- Stiennon et al\. \(2020\)Stiennon, N\., Ouyang, L\., Wu, J\., Ziegler, D\. M\., Lowe, R\., Voss, C\., Radford, A\., Amodei, D\., and Christiano, P\.Learning to summarize with human feedback\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- von Werra et al\. \(2020\)von Werra, L\., Belkada, Y\., Tunstall, L\., Beeching, E\., Thrush, T\., Lambert, N\., Huang, S\., Rasul, K\., and Gallouédec, Q\.TRL: Transformer reinforcement learning, 2020\.URL[https://github\.com/huggingface/trl](https://github.com/huggingface/trl)\.
- Wang et al\. \(2024\)Wang, Z\., Dong, Y\., Delalleau, O\., Zeng, J\., Shen, G\., Egert, D\., Zhang, J\. J\., Sreedhar, M\. N\., and Kuchaiev, O\.HelpSteer2: Open\-source dataset for training top\-performing reward models\.*arXiv preprint arXiv:2406\.08673*, 2024\.
- Xie et al\. \(2012\)Xie, X\., Kou, S\. C\., and Brown, L\. D\.SURE estimates for a heteroscedastic hierarchical model\.*Journal of the American Statistical Association*, 107\(500\):1465–1479, 2012\.
- Yang et al\. \(2025\)Yang, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Li, C\., Liu, D\., Huang, F\., Wei, H\., Lin, H\., Yang, J\., Tu, J\., Zhang, J\., Yang, J\., Yang, J\., Zhou, J\., Lin, J\., Dang, K\., Lu, K\., Bao, K\., Yang, K\., Yu, L\., Li, M\., Xue, M\., Zhang, P\., Zhu, Q\., Men, R\., Lin, R\., Li, T\., Tang, T\., Xia, T\., Ren, X\., Ren, X\., Fan, Y\., Su, Y\., Zhang, Y\., Wan, Y\., Liu, Y\., Cui, Z\., Zhang, Z\., and Qiu, Z\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2025\.
- Zhang et al\. \(2025\)Zhang, L\. H\., Milli, S\., Jusko, K\., Smith, J\., Amos, B\., Bouaziz, W\., Revel, M\., Kussman, J\., Sheynin, Y\., Titus, L\., Radharapu, B\., Yu, J\., Sarma, V\., Rose, K\., and Nickel, M\.Cultivating pluralism in algorithmic monoculture: The community alignment dataset\.arXiv preprint arXiv:2507\.09650, 2025\.
- Zhang et al\. \(2026\)Zhang, P\., Lin, T\.\-E\., Wu, Y\., Chen, J\., Wang, Z\., Yang, H\., Xu, Z\., Huang, F\., Zhang, K\., and Li, Y\.P\-GenRM: Personalized generative reward model with test\-time user\-based scaling\.In*Proceedings of the International Conference on Learning Representations \(ICLR\)*, 2026\.Oral; arXiv:2602\.12116\.

## Appendix AProof of Theorem[1](https://arxiv.org/html/2606.27578#Thmtheorem1)\(oracle inequality\)

We prove Theorem[1](https://arxiv.org/html/2606.27578#Thmtheorem1)in four steps: \(i\) a mean\-squared error bound for the truncated Morris MoM estimatorτ^2\\hat\{\\tau\}^\{2\}, \(ii\) a second\-order Taylor expansion with Lagrange remainder around the oracle, \(iii\) aggregation across raters using the independence delivered by sample splitting, and \(iv\) a truncation\-event tail bound\.

Throughout,τ2\\tau^\{2\}abbreviatesτα2\\tau^\{2\}\_\{\\alpha\}andej=α^jOLS−αje\_\{j\}=\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\-\\alpha\_\{j\}\. WriteΔj\(t\)=α^jEB\(t\)−αj\\Delta\_\{j\}\(t\)=\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{EB\}\}\(t\)\-\\alpha\_\{j\}whereα^jEB\(t\)=ωj\(t\)α^jOLS\+\(1−ωj\(t\)\)αpop\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{EB\}\}\(t\)=\\omega\_\{j\}\(t\)\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\+\(1\-\\omega\_\{j\}\(t\)\)\\alpha\_\{\\mathrm\{pop\}\}withωj\(t\)=t/\(t\+Vj\)\\omega\_\{j\}\(t\)=t/\(t\+V\_\{j\}\), so the per\-rater risk at a deterministicttisREB,j\(t\)=𝔼\[Δj\(t\)2\]R\_\{\\mathrm\{EB\},j\}\(t\)=\\mathbb\{E\}\[\\Delta\_\{j\}\(t\)^\{2\}\]and the aggregate isREB\(t\)=J−1∑jREB,j\(t\)R\_\{\\mathrm\{EB\}\}\(t\)=J^\{\-1\}\\sum\_\{j\}R\_\{\\mathrm\{EB\},j\}\(t\)\. The oracle risk isRoracle=REB\(τ2\)R\_\{\\mathrm\{oracle\}\}=R\_\{\\mathrm\{EB\}\}\(\\tau^\{2\}\)\.

#### Step 1: MoM mean\-squared error\.

On the auxiliary split,Morris\([1983](https://arxiv.org/html/2606.27578#bib.bib30)\)’s estimator

τ~2=\(J−1\)−1∑j=1J\(α^jOLS−α¯\)2−J−1∑j=1JVj\\tilde\{\\tau\}^\{2\}=\(J\{\-\}1\)^\{\-1\}\\\!\\sum\_\{j=1\}^\{J\}\(\\hat\{\\alpha\}\_\{j\}^\{\\mathrm\{OLS\}\}\-\\bar\{\\alpha\}\)^\{2\}\-J^\{\-1\}\\\!\\sum\_\{j=1\}^\{J\}V\_\{j\}is unbiased forτ2\\tau^\{2\}under the random\-effect DGP\. Its first termS2S^\{2\}is a Gaussian quadratic form with matrixA/\(J−1\)A/\(J\-1\),A=I−J−1𝟏𝟏⊤A=I\-J^\{\-1\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}, applied to independent coordinates of varianceσj2=τ2\+Vj\\sigma\_\{j\}^\{2\}=\\tau^\{2\}\+V\_\{j\}, so withΣ=diag\(σj2\)\\Sigma=\\mathrm\{diag\}\(\\sigma\_\{j\}^\{2\}\),

Var\(τ~2\)=2tr\[\(AΣ\)2\]\(J−1\)2≤2σmax4J−1≤8\(τ2\+Vmax\)23J≡C1J,\\mathrm\{Var\}\(\\tilde\{\\tau\}^\{2\}\)=\\frac\{2\\,\\mathrm\{tr\}\\bigl\[\(A\\Sigma\)^\{2\}\\bigr\]\}\{\(J\-1\)^\{2\}\}\\leq\\frac\{2\\sigma\_\{\\max\}^\{4\}\}\{J\-1\}\\leq\\frac\{8\(\\tau^\{2\}\+V\_\{\\max\}\)^\{2\}\}\{3J\}\\equiv\\frac\{C\_\{1\}\}\{J\},usingtr\[\(AΣ\)2\]≤‖Σ‖2tr\(A\)=σmax4\(J−1\)\\mathrm\{tr\}\[\(A\\Sigma\)^\{2\}\]\\leq\\\|\\Sigma\\\|^\{2\}\\,\\mathrm\{tr\}\(A\)=\\sigma\_\{\\max\}^\{4\}\\,\(J\-1\)and1/\(J−1\)≤4/\(3J\)1/\(J\-1\)\\leq 4/\(3J\)forJ≥4J\\geq 4\(the only place theJ≥4J\\geq 4assumption enters\), givingτ~2−τ2=Op\(1/J\)\\tilde\{\\tau\}^\{2\}\-\\tau^\{2\}=O\_\{p\}\(1/\\sqrt\{J\}\)and recoveringKou & Yang\([2017](https://arxiv.org/html/2606.27578#bib.bib23)\)’s rate\. Truncation at zero contracts the squared error toward the truth whenτ2≥0\\tau^\{2\}\\geq 0, so\(τ^2−τ2\)2≤\(τ~2−τ2\)2\(\\hat\{\\tau\}^\{2\}\-\\tau^\{2\}\)^\{2\}\\leq\(\\tilde\{\\tau\}^\{2\}\-\\tau^\{2\}\)^\{2\}pointwise and𝔼\[\(τ^2−τ2\)2\]≤C1/J\\mathbb\{E\}\[\(\\hat\{\\tau\}^\{2\}\-\\tau^\{2\}\)^\{2\}\]\\leq C\_\{1\}/J, withC1=83\(τ2\+Vmax\)2≤83\(1\+M\)2τ4C\_\{1\}=\\tfrac\{8\}\{3\}\(\\tau^\{2\}\+V\_\{\\max\}\)^\{2\}\\leq\\tfrac\{8\}\{3\}\(1\+M\)^\{2\}\\tau^\{4\}\.

#### Step 2: Taylor expansion\.

Fix raterjj\. For deterministictt, expandingΔj\(t\)=ωj\(t\)ej−\(1−ωj\(t\)\)\(αj−αpop\)\\Delta\_\{j\}\(t\)=\\omega\_\{j\}\(t\)\\,e\_\{j\}\-\(1\-\\omega\_\{j\}\(t\)\)\(\\alpha\_\{j\}\-\\alpha\_\{\\mathrm\{pop\}\}\)and using𝔼\[ej2\]=Vj\\mathbb\{E\}\[e\_\{j\}^\{2\}\]=V\_\{j\},𝔼\[\(αj−αpop\)2\]=τ2\\mathbb\{E\}\[\(\\alpha\_\{j\}\-\\alpha\_\{\\mathrm\{pop\}\}\)^\{2\}\]=\\tau^\{2\}, and𝔼\[ej\(αj−αpop\)\]=0\\mathbb\{E\}\[e\_\{j\}\(\\alpha\_\{j\}\-\\alpha\_\{\\mathrm\{pop\}\}\)\]=0gives

REB,j\(t\)=ωj\(t\)2Vj\+\(1−ωj\(t\)\)2τ2,R\_\{\\mathrm\{EB\},j\}\(t\)=\\omega\_\{j\}\(t\)^\{2\}V\_\{j\}\+\(1\-\\omega\_\{j\}\(t\)\)^\{2\}\\tau^\{2\},andREB,j′\(τ2\)=0R\_\{\\mathrm\{EB\},j\}^\{\\prime\}\(\\tau^\{2\}\)=0: the oracle is a stationary point of the per\-rater risk\. Hence the first\-order term vanishes and a second\-order Taylor expansion with Lagrange remainder gives, for someξj\\xi\_\{j\}betweenttandτ2\\tau^\{2\},

REB,j\(t\)−REB,j\(τ2\)=12REB,j′′\(ξj\)\(t−τ2\)2\.R\_\{\\mathrm\{EB\},j\}\(t\)\-R\_\{\\mathrm\{EB\},j\}\(\\tau^\{2\}\)=\\tfrac\{1\}\{2\}R\_\{\\mathrm\{EB\},j\}^\{\\prime\\prime\}\(\\xi\_\{j\}\)\\,\(t\-\\tau^\{2\}\)^\{2\}\.Direct computation givesREB,j′′\(ξ\)=2Vj2\(Vj\+3τ2−2ξ\)/\(ξ\+Vj\)4R\_\{\\mathrm\{EB\},j\}^\{\\prime\\prime\}\(\\xi\)=2V\_\{j\}^\{2\}\(V\_\{j\}\+3\\tau^\{2\}\-2\\xi\)/\(\\xi\+V\_\{j\}\)^\{4\}, which is decreasing inξ\\xion\[τ2/2,3τ2/2\]\[\\tau^\{2\}/2,\\,3\\tau^\{2\}/2\]\. On the eventℰ=\{\|τ^2−τ2\|≤τ2/2\}\\mathcal\{E\}=\\\{\|\\hat\{\\tau\}^\{2\}\-\\tau^\{2\}\|\\leq\\tau^\{2\}/2\\\}we haveξj∈\[τ2/2,3τ2/2\]\\xi\_\{j\}\\in\[\\tau^\{2\}/2,\\,3\\tau^\{2\}/2\]and therefore

Hj≡supξ∈\[τ2/2,3τ2/2\]REB,j′′\(ξ\)=REB,j′′\(τ2/2\)≤64Vj2\(τ2\+Vj\)3\.H\_\{j\}\\equiv\\sup\_\{\\xi\\in\[\\tau^\{2\}/2,\\,3\\tau^\{2\}/2\]\}R\_\{\\mathrm\{EB\},j\}^\{\\prime\\prime\}\(\\xi\)=R\_\{\\mathrm\{EB\},j\}^\{\\prime\\prime\}\(\\tau^\{2\}/2\)\\leq\\frac\{64\\,V\_\{j\}^\{2\}\}\{\(\\tau^\{2\}\+V\_\{j\}\)^\{3\}\}\.

#### Step 3: Aggregation via sample splitting\.

Becauseτ^2\\hat\{\\tau\}^\{2\}is computed on the auxiliary split, it is independent of\{ej,αj\}\\\{e\_\{j\},\\alpha\_\{j\}\\\}for theJJraters being estimated, so conditioning onτ^2\\hat\{\\tau\}^\{2\}makes the deterministic\-ttrisk formula of Step 2 applicable att=τ^2t=\\hat\{\\tau\}^\{2\}:

𝔼\[REB\(τ^2\)−Roracle;ℰ\]≤12H¯𝔼\[\(τ^2−τ2\)2\]≤12H¯C1J,H¯=J−1∑jHj\.\\mathbb\{E\}\\bigl\[R\_\{\\mathrm\{EB\}\}\(\\hat\{\\tau\}^\{2\}\)\-R\_\{\\mathrm\{oracle\}\}\\,;\\,\\mathcal\{E\}\\bigr\]\\;\\leq\\;\\tfrac\{1\}\{2\}\\,\\bar\{H\}\\;\\mathbb\{E\}\\bigl\[\(\\hat\{\\tau\}^\{2\}\-\\tau^\{2\}\)^\{2\}\\bigr\]\\;\\leq\\;\\tfrac\{1\}\{2\}\\,\\bar\{H\}\\,\\frac\{C\_\{1\}\}\{J\},\\qquad\\bar\{H\}=J^\{\-1\}\\\!\\sum\_\{j\}H\_\{j\}\.SinceRoracle=J−1∑jτ2Vj/\(τ2\+Vj\)R\_\{\\mathrm\{oracle\}\}=J^\{\-1\}\\sum\_\{j\}\\tau^\{2\}V\_\{j\}/\(\\tau^\{2\}\+V\_\{j\}\)and, for everyjj,

64Vj2/\(τ2\+Vj\)3τ2Vj/\(τ2\+Vj\)=64Vjτ2\(τ2\+Vj\)2≤16τ4\\frac\{64\\,V\_\{j\}^\{2\}/\(\\tau^\{2\}\+V\_\{j\}\)^\{3\}\}\{\\tau^\{2\}V\_\{j\}/\(\\tau^\{2\}\+V\_\{j\}\)\}=\\frac\{64\\,V\_\{j\}\}\{\\tau^\{2\}\(\\tau^\{2\}\+V\_\{j\}\)^\{2\}\}\\;\\leq\\;\\frac\{16\}\{\\tau^\{4\}\}\(the mapV↦V/\(τ2\+V\)2V\\mapsto V/\(\\tau^\{2\}\+V\)^\{2\}is maximized atV=τ2V=\\tau^\{2\}\), we getH¯≤\(16/τ4\)Roracle\\bar\{H\}\\leq\(16/\\tau^\{4\}\)\\,R\_\{\\mathrm\{oracle\}\}and hence the on\-event bound

𝔼\[REB\(τ^2\)−Roracle;ℰ\]≤cJRoracle,c=8C1τ4≤643\(1\+M\)2\.\\mathbb\{E\}\\bigl\[R\_\{\\mathrm\{EB\}\}\(\\hat\{\\tau\}^\{2\}\)\-R\_\{\\mathrm\{oracle\}\}\\,;\\,\\mathcal\{E\}\\bigr\]\\;\\leq\\;\\frac\{c\}\{J\}\\,R\_\{\\mathrm\{oracle\}\},\\qquad c=\\frac\{8\\,C\_\{1\}\}\{\\tau^\{4\}\}\\leq\\frac\{64\}\{3\}\(1\+M\)^\{2\}\.

#### Step 4: Truncation event\.

Offℰ\\mathcal\{E\}, split the bad event into𝒜−=\{τ^2<τ2/2\}\\mathcal\{A\}\_\{\-\}=\\\{\\hat\{\\tau\}^\{2\}<\\tau^\{2\}/2\\\}and𝒜\+=\{τ^2\>3τ2/2\}\\mathcal\{A\}\_\{\+\}=\\\{\\hat\{\\tau\}^\{2\}\>3\\tau^\{2\}/2\\\}; these exhaustℰc\\mathcal\{E\}^\{c\}\. On𝒜−\\mathcal\{A\}\_\{\-\},REB,j\(t\)R\_\{\\mathrm\{EB\},j\}\(t\)is decreasing on\[0,τ2\]\[0,\\tau^\{2\}\], so the per\-rater risk is at mostREB,j\(0\)=τ2R\_\{\\mathrm\{EB\},j\}\(0\)=\\tau^\{2\}; on𝒜\+\\mathcal\{A\}\_\{\+\},REB,j\(t\)R\_\{\\mathrm\{EB\},j\}\(t\)is increasing on\[τ2,∞\)\[\\tau^\{2\},\\infty\)with limitVjV\_\{j\}, so the per\-rater risk is at mostVj≤Mτ2V\_\{j\}\\leq M\\tau^\{2\}\. The off\-event excess risk is therefore at mostmax⁡\(1,M\)τ2\\max\(1,M\)\\,\\tau^\{2\}per rater\. For the probability,τ~2−τ2\\tilde\{\\tau\}^\{2\}\-\\tau^\{2\}is a centred Gaussian quadratic form whose coefficient vector satisfies‖λ‖∞≤σmax2/\(J−1\)\\\|\\lambda\\\|\_\{\\infty\}\\leq\\sigma\_\{\\max\}^\{2\}/\(J\-1\)and‖λ‖22≤σmax4/\(J−1\)\\\|\\lambda\\\|\_\{2\}^\{2\}\\leq\\sigma\_\{\\max\}^\{4\}/\(J\-1\), so the Hanson–Wright inequality gives

ℙ\(ℰc\)=ℙ\(\|τ~2−τ2\|\>τ2/2\)≤2exp⁡\(−c2J−1\(1\+M\)2\)\\mathbb\{P\}\(\\mathcal\{E\}^\{c\}\)=\\mathbb\{P\}\\bigl\(\|\\tilde\{\\tau\}^\{2\}\-\\tau^\{2\}\|\>\\tau^\{2\}/2\\bigr\)\\leq 2\\exp\\\!\\Bigl\(\-c\_\{2\}\\,\\frac\{J\-1\}\{\(1\+M\)^\{2\}\}\\Bigr\)for an absolute constantc2\>0c\_\{2\}\>0\(the exponent is dimensionless becauseσmax2≤\(1\+M\)τ2\\sigma\_\{\\max\}^\{2\}\\leq\(1\+M\)\\tau^\{2\}\)\. Combining the on\-event Step 3 bound with the off\-event excess and tail bound yields Eq\. \([5](https://arxiv.org/html/2606.27578#S3.E5)\)\.□\\square

#### Scope of the proof\.

The theorem covers the sample\-split estimator withαpop\\alpha\_\{\\mathrm\{pop\}\}known\. Algorithm[1](https://arxiv.org/html/2606.27578#alg1)estimatesτ^2\\hat\{\\tau\}^\{2\}and a precision\-weightedα^pop\\hat\{\\alpha\}\_\{\\mathrm\{pop\}\}on the same sample; both couplings contribute additionalO\(1/J\)O\(1/J\)terms \(Xie et al\.\([2012](https://arxiv.org/html/2606.27578#bib.bib45)\)handle the analogous same\-sample coupling in the location case via SURE\)\. Rather than extending the algebra, we validate the deployed same\-sample estimator by simulation below\. The constant643\(1\+M\)2\\tfrac\{64\}\{3\}\(1\+M\)^\{2\}is conservative: it is driven by the sparsest raters throughVmaxV\_\{\\max\}\.

#### Empirical validation\.

We simulate the deployed \(same\-sample, truncated\-MoM\) estimator on PRISM\-calibrated cohorts:J∈\{100,200,400,800,1394\}J\\in\\\{100,200,400,800,1394\\\}with100100seeds each \(500500cells\), resamplingnjn\_\{j\}from the empirical PRISM pool with the fittedτ^2=23\.2\\hat\{\\tau\}^\{2\}\{=\}23\.2andσ^ε=23\.5\\hat\{\\sigma\}\_\{\\varepsilon\}\{=\}23\.5\. Define the realized constantcemp=J\(REB/Roracle−1\)c\_\{\\mathrm\{emp\}\}=J\\,\(R\_\{\\mathrm\{EB\}\}/R\_\{\\mathrm\{oracle\}\}\-1\)\. Averaging risks over seeds within each stratum, the realized constant is3\.933\.93atJ=100J\{=\}100and decreases to2\.512\.51atJ=1394J\{=\}1394; the inequality holds in expectation in every stratum, with large slack relative to the worst\-case constant\. Per\-seed realized values fluctuate with a heavy upper tail at smallJJ\(95th percentile≈12\{\\approx\}12; single\-seed maximum91\.691\.6atJ=100J\{=\}100\), as expected for a ratio of noisy risk estimates; atJ=1394J\{=\}1394the mean risk ratio is1\.0021\.002and the worst seed across100100is1\.0171\.017\. Theτ^2\\hat\{\\tau\}^\{2\}estimation error is therefore too small to explain the8\.58%8\.58\\%PRISM gain, which is consequently not an artefact of estimatingτ^2\\hat\{\\tau\}^\{2\}from finite data\.

## Appendix BAdditional diagnostics and pre\-registration details

Table 4:PRISM methods comparison\.Among the rows compared, PEBS is the closed\-form post\-hoc calibrator with no test\-time compute and a stated oracle bound\. P\-GenRM is included as the matched scalar\-RMSE baseline and is evaluated under the strict LOCO protocol\.![Refer to caption](https://arxiv.org/html/2606.27578v1/x4.png)Figure 4:Phi\-3\-medium\-14B cross\-seed scatter\.Each column is one random seed, and each colour\-coded marker series is one HelpSteer2 attribute\. Trained\-attribute coherence is positive in all five seeds \(mean\+42\.15%\+42\.15\\%, dotted line; Student\-tt95%95\\%CI\[\+40\.10,\+44\.20\]\[\+40\.10,\+44\.20\]\); the shaded band marks the single\-seed anchor\+43\.23%±5\+43\.23\\%\\pm 5pp\. Untrained attributes scatter more widely\.
![Refer to caption](https://arxiv.org/html/2606.27578v1/x5.png)Figure 5:The verbosity\-only control confirms the reversal is attribute\-specific, not architecture\-wide\.One group per base: blue bars are the untrained\-coherence gain \(%\), vermillion bars the trained\-verbosity gain \(%\), with95%95\\%BCa CIs\. On all four bases the untrained coherence head stays positive and within∼1\{\\sim\}1pp of the Phi\-3 reference, while the trained verbosity head itself turns negative \(−84\.4\-84\.4/−44\.4\-44\.4/−43\.9\-43\.9/−32\.6\-32\.6\), ruling out a base\-level failure as the cause of the coherence reversal\.

This appendix expands the diagnostics supporting Figure[1](https://arxiv.org/html/2606.27578#S1.F1): sparse\-rater shrinkage, cross\-base transfer boundaries, adapter prediction\-spread, and the numerical details needed to reproduce the reported CIs\.

![Refer to caption](https://arxiv.org/html/2606.27578v1/x6.png)Figure 6:PEBS automatically down\-weights sparse annotators: shrinkage is largest fornj≤8n\_\{j\}\{\\leq\}8and fades to zero at highnjn\_\{j\}with no threshold to tune\.The closed\-form weightωj=τ2/\(τ2\+Vj\)\\omega\_\{j\}\{=\}\\tau^\{2\}/\(\\tau^\{2\}\{\+\}V\_\{j\}\)governs how much PEBS trusts each rater’s own calibrator vs\. the population mean\. Three illustrative populations \(PRISM, PluriHarms, HelpSteer2\) are plotted against per\-user sample sizesnjn\_\{j\}and within\-rater noiseVj∝1/njV\_\{j\}\\propto 1/n\_\{j\}; the shaded zone \(nj≤8n\_\{j\}\{\\leq\}8\) is whereωj\\omega\_\{j\}is smallest and shrinkage towardαpop\\alpha\_\{\\mathrm\{pop\}\}is largest\.ωj\\omega\_\{j\}asymptotes to11\(no shrinkage\) asnjn\_\{j\}grows, so dense annotators reduce to per\-user OLS without any threshold parameter\.#### Scope of the pre\-registered criterion\.

The four\-base coherence\-only probe was pre\-registered with the criterion that any single base inversion bounds the across\-family result\. The dense panel therefore supports the bounded\-transfer claim, not an architecture\-universal claim\. Two MoE runs \(Phi\-3\.5\-MoE\-Instruct and Mixtral\-8×78\\\!\\times\\\!7B\) used a narrower output\-projection adapter than the dense\-Transformer protocol; both produce negative\-direction trained\-coherence gains \(−59\.41%\-59\.41\\%and−60\.44%\-60\.44\\%\) with narrow prediction spread \(0\.2670\.267and0\.2980\.298\)\. We use these MoE points only as additional boundary evidence consistent with the dense\-panel collapse signature, not as full cross\-architecture replications\.

#### Calibration diagnostics\.

The probe measures prediction spread on a208208\-row HelpSteer2 slice\. The two inversion bases \(Llama\-3\-8B, Yi\-1\.5\-34B\) lie below the0\.400\.40collapse threshold \(σpred,coh=0\.2298,0\.3246\\sigma\_\{\\text\{pred,coh\}\}\{=\}0\.2298,0\.3246\), while Mistral\-Small\-22B lies above \(0\.47430\.4743\); Figure[7](https://arxiv.org/html/2606.27578#A2.F7)plots the three values\. Verbosity\-bias and LoRA\-capacity alternatives are addressed by the verbosity\-only control and the multi\-seed Phi\-3 replication\. We treat head collapse as an observational signature: the posterior is wide \(P∈\[0\.30,0\.85\]P\\\!\\in\\\!\[0\.30,\\,0\.85\]\), and causal mechanism claims require intervention experiments outside this paper\.

![Refer to caption](https://arxiv.org/html/2606.27578v1/x7.png)Figure 7:Adapter prediction\-spread \(σpred,coh\\sigma\_\{\\mathrm\{pred,coh\}\}\) for three across\-family bases, against the 0\.40 collapse threshold\.The two inversion bases \(Llama\-3\-8B, Yi\-1\.5\-34B\) fall below the threshold; the null base \(Mistral\-Small\-22B\) lies above\. Lowerσpred,coh\\sigma\_\{\\mathrm\{pred,coh\}\}indicates tighter clustering of adapter outputs around one or two rating values, consistent with a head\-collapsed adapter setting\. We treat the signature as observational rather than causal\.![Refer to caption](https://arxiv.org/html/2606.27578v1/x8.png)Figure 8:Per\-user RMSE improvement scatter on PRISM \(N=1,394N\{=\}1\{,\}394users\), illustrating the minority\-rater trade\-off of EB shrinkage\.Each point is one user; thexx\-axis is per\-user sample sizenjn\_\{j\}\(log scale\) and theyy\-axis is the per\-user RMSE improvement \(pop\-slope minus PEBS\-shrunk, as a percentage of pop\-slope RMSE\)\. Blue points \(1,0021\{,\}002users,71\.9%71\.9\\%\) are helped by PEBS; vermillion points \(392392,28\.1%28\.1\\%\) are hurt\. Low\-njn\_\{j\}users are shrunk most aggressively \(ωj→0\\omega\_\{j\}\\to 0asnj→0n\_\{j\}\\to 0\) and show the widest spread in improvement, consistent with the standard EB trade\-off: optimal under the prior but wrong for the rare true\-extreme rater\.
#### Demographic ANOVA on PRISM\.

An Analysis of Variance of the fitted\(α^j,β^j\)\(\\hat\{\\alpha\}\_\{j\},\\hat\{\\beta\}\_\{j\}\)against the same six PRISM demographics \(age, gender, region, education, political orientation, English fluency\) finds only the gender→β^j\{\\to\}\\hat\{\\beta\}\_\{j\}cell surviving Bonferroni correction, and even there the explained variance is small \(η2<0\.02\\eta^\{2\}\{<\}0\.02\)\. Demographic grouping cannot replace per\-user calibration; the six demographic axes do not jointly recover the per\-user shrinkage gain reported in §[3\.1](https://arxiv.org/html/2606.27578#S3.SS1)\.

#### Multi\-attribute regression observation on HelpSteer2\.

We also observe the same shrinkage mechanism on a multi\-attribute regression problem, where the five HelpSteer2 attribute axes are treated as five pseudo\-raters across1,0381\{,\}038rows\. This is an observation about EB\-shrinkage stability in a multi\-axis regression context,*not*a pluralism claim: the five axes are scoring dimensions, not human annotators with heterogeneous calibrations\. The four\-seed Qwen\-2\.5\-7B mean is\+18\.24%\{\+\}18\.24\\%relative RMSE reduction\[\+17\.97,\+18\.51\]\[\+17\.97,\+18\.51\]\(across\-seed half\-width0\.270\.27pp\), reflecting the same calibration\-loss\-reduction PEBS provides on PRISM applied to a different problem geometry\. The result is bound to the Qwen\-2\.5 family \(§[3\.5](https://arxiv.org/html/2606.27578#S3.SS5), §[5](https://arxiv.org/html/2606.27578#S5)\); the across\-seed half\-width is roughly an order of magnitude tighter than the within\-seed bootstrap half\-width\.

#### HelpSteer2 verbosity per\-attribute null\.

Among the five HelpSteer2 attributes the EB\-shrunk arm gains positively on four \(helpfulness\+6\.10%\{\+\}6\.10\\%, correctness\+7\.08%\{\+\}7\.08\\%, coherence\+41\.15%\{\+\}41\.15\\%, complexity\+30\.13%\{\+\}30\.13\\%\); verbosity straddles zero at−2\.74%\{\-\}2\.74\\%\[−38\.04,\+27\.62\]\[\-38\.04,\\,\{\+\}27\.62\]with shrinkage weightωβ≈0\.93\\omega\_\{\\beta\}\\\!\\approx\\\!0\.93, indicating the attribute is already near\-saturated under per\-attribute fit and there is little for shrinkage to add\. The four\-of\-five positive pattern rules out an attribute\-agnostic verbosity bias as the source of the within\-user RMSE gain in §[3\.1](https://arxiv.org/html/2606.27578#S3.SS1)\.

#### Base\-model training details\.

The five\-seed Phi\-3 replication gives cross\-seed mean\+42\.15%\{\+\}42\.15\\%, within1\.081\.08pp of the single\-seed reference \(\+43\.23%\{\+\}43\.23\\%\), with trained\-coherence across\-seed variance2\.732\.73pp2\(SD1\.651\.65pp\) versus untrained mean580\.8580\.8pp2\. Phi\-3 verbosity\-only control turns trained\-verbosity negative to−32\.62%\{\-\}32\.62\\%while preserving untrained\-coherence at\+43\.18%\{\+\}43\.18\\%\(Table[2](https://arxiv.org/html/2606.27578#S3.T2)\)\. Qwen2\.5\-7B\-Instruct uses Transformer Reinforcement Learning \(TRL\) 0\.12\.2\(von Werra et al\.,[2020](https://arxiv.org/html/2606.27578#bib.bib43)\)LoRAr=32r\{=\}32,α=16\\alpha\{=\}16, lr10−410^\{\-4\}, bf16,1,5001\{,\}500steps, centered\-rewards regularizer\(Eisenstein et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib12)\), pair accuracy CI\[62\.74,65\.29\]\[62\.74,65\.29\],≈75\\approx 75min H100 80 GB\. Bootstrap CIs are95%95\\%BCa\(Efron,[1987](https://arxiv.org/html/2606.27578#bib.bib10)\)with a PRISM4,0004\{,\}000\-replicate cluster bootstrap by user\(Cameron et al\.,[2008](https://arxiv.org/html/2606.27578#bib.bib4)\)and a HelpSteer2 row\-cluster\. PRISM MoM:τ^α2=26\.2\\hat\{\\tau\}\_\{\\alpha\}^\{2\}\{=\}26\.2\(slope\),τ^β2=115\.7\\hat\{\\tau\}\_\{\\beta\}^\{2\}\{=\}115\.7\(offset\),σ^ε=23\.5\\hat\{\\sigma\}\_\{\\varepsilon\}\{=\}23\.5\(residual SD of the population\-calibrator fit; per\-user calibration takes held\-out RMSE below this value, Table[1](https://arxiv.org/html/2606.27578#S3.T1)\)\. The8\.58%8\.58\\%random\-fold\-within\-user PRISM result attenuates predictably under stricter splits: a strict temporal80/2080/20returns\+7\.55%\+7\.55\\%\(3030\-seed cluster\-bootstrap CI\[\+6\.82,\+8\.71\]\[\+6\.82,\+8\.71\]\), cluster\-bootstrap\-by\-user gives\+6\.96%\+6\.96\\%\(BCa\[\+6\.40,\+7\.56\]\[\+6\.40,\+7\.56\]\), and leave\-one\-conversation\-out yields\+5\.88%\+5\.88\\%\(BCa\[\+5\.17,\+6\.63\]\[\+5\.17,\+6\.63\]\); all four exclude zero\.

## Appendix CPRISM baseline scope

P\-GenRM\(Zhang et al\.,[2026](https://arxiv.org/html/2606.27578#bib.bib48)\)is included as the matched scalar\-RMSE baseline and exceedsPEBSin the strict LOCO cell reported in Table[4](https://arxiv.org/html/2606.27578#A2.T4)\. Methods whose published protocols optimise a different objective, metric, or feature space are cited in related work but are not reproduced as direct scalar\-RMSE comparison rows here, since the protocol mismatch makes the resulting numbers incomparable\.

## Appendix DDataset cards

This appendix expands the corpora used in §[3\.4](https://arxiv.org/html/2606.27578#S3.SS4)\(the three within\-scope continuous\-rating corpora\) and §[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)\(MultiPref, the theory\-predicted scope\-limit demonstration corpus\), with details on collection, structure, and the operations PEBS requires\. None of these corpora is collected by us\.

#### PRISM Alignment corpus\(Kirk et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib20)\)\.

A public preference\-elicitation corpus with 1,500 unique participants drawn from 75 countries and 24 demographic axes\. Each participant has a stable per\-annotator ID and contributes multi\-turn conversations with multiple model variants, with both turn\-level \(Likert 0–100\) ratings and pairwise preferences\. PRISM is the primary evaluation corpus for PEBS because the per\-annotator IDs are stable across conversations, which is required to estimate the per\-user\(αj,βj\)\(\\alpha\_\{j\},\\beta\_\{j\}\)random effect\. The reward model is trained on26,87626\{,\}876preference pairs from the1,3911\{,\}391demographic\-complete participants under an80/2080/20stratified\-by\-user split; the per\-rater calibrators use the1,3941\{,\}394\-user utterance\-level cohort \(nj≥6n\_\{j\}\\geq 6; §[2\.3](https://arxiv.org/html/2606.27578#S2.SS3)\)\.

#### PluriHarms\(Li et al\.,[2026](https://arxiv.org/html/2606.27578#bib.bib24)\)\.

A harm\-rating corpus collecting15,00015\{,\}000harm ratings on a0–100100scale from100100annotators across150150prompts\. Each prompt\-response pair is rated by multiple annotators with a stable per\-annotator ID\. PluriHarms tests whether the PEBS procedure transfers from preference judgments \(PRISM\) to a qualitatively different feedback type \(single\-axis harm rating\)\.

#### MultiPref\(Miranda et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib29)\)\.

A five\-point Likert preference corpus in which annotators express preferences with confidence ratings rather than as binary BT\-style picks\. The per\-annotator rating distribution is non\-Gaussian, so MultiPref lies outside the Gaussian random\-effects regime that PEBS’s MoM estimator assumes\. The corpus enters this paper only as the theory\-predicted scope\-limit demonstration discussed in §[3\.7](https://arxiv.org/html/2606.27578#S3.SS7); the negative\-control framing, the predicted\-versus\-observed numerical gap, and the principled Beta\-Binomial or Student\-tνt\_\{\\nu\}random\-effects extension are all documented there\.

#### HelpSteer2 attribute\-as\-rater recast\(Wang et al\.,[2024](https://arxiv.org/html/2606.27578#bib.bib44)\)\.

HelpSteer2 provides five scalar attribute ratings per prompt\-response pair \(helpfulness, correctness, coherence, complexity, verbosity\) on a0–44scale, from a panel of human annotators whose individual identities are not released\. Because PEBS requires per\-rater data, we re\-cast the corpus by treating the five attribute axes themselves as five*pseudo\-raters*: each row contributes one rating from each axis, so a single prompt\-response pair is rated by all five attribute “raters”\. The HelpSteer2 attribute\-as\-rater protocol uses1,0381\{,\}038rows\. The cross\-family probes of §[3\.5](https://arxiv.org/html/2606.27578#S3.SS5)train a coherence\-only LoRA adapter \(loss masked to the coherence axis\) and a verbosity\-only counterfactual \(loss masked to verbosity\) on each of the four pre\-registered base architectures \(plus the two appendix\-only MoE boundary runs\)\.

#### Forecast companion corpora\.

OASST2\-author\(Köpf et al\.,[2023](https://arxiv.org/html/2606.27578#bib.bib22)\)and SHP\-subreddit\(Ethayarajh et al\.,[2022](https://arxiv.org/html/2606.27578#bib.bib13)\)are open preference corpora with stable author\- or subreddit\-level grouping variables\. OASST2\-author enters the paper twice, under two distinct protocols: the §[3\.4](https://arxiv.org/html/2606.27578#S3.SS4)within\-cluster replication \(model\-likelihood covariate,1,0171\{,\}017authors atnj≥6n\_\{j\}\{\\geq\}6,55\-fold CV with cluster bootstrap;\+1\.21%\{\+\}1\.21\\%\) and the §[3\.7](https://arxiv.org/html/2606.27578#S3.SS7)forecaster validation \(rank covariate,2,5072\{,\}507authors atnj≥5n\_\{j\}\{\\geq\}5, leave\-one\-row\-out CV;8\.33%8\.33\\%\)\. SHP\-subreddit enters only the forecaster validation \(1818subreddit clusters atn≥20n\{\\geq\}20\); at these cluster sizes the shrinkage weight saturates \(ω→1\\omega\\\!\\to\\\!1\), PEBS reduces to per\-cluster OLS, and the exact0\.000\.00pp forecast match is expected rather than informative\.
PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

Similar Articles

Mitigating Cognitive Bias in RLHF by Altering Rationality

Process Rewards with Learned Reliability

Calibrating LLMs with Semantic-level Reward

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Calibrated Preference Learning: The Case of Label Ranking

Submit Feedback

Similar Articles

Mitigating Cognitive Bias in RLHF by Altering Rationality
Process Rewards with Learned Reliability
Calibrating LLMs with Semantic-level Reward
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
Calibrated Preference Learning: The Case of Label Ranking